Data Analysis in Research

Data Analysis in ResearchThe researcher of any discipline is confronted with the problem as to what to do with data after they have been collected. The mass of data may be so high that the researcher cannot put all of it in a form it is collected in his report.

Much of the data must be reduced to some form suitable for analysis so that a concise set of conclusions or findings can be reported to a scientific audience.

In an attempt to analyze the data, we must first decide

  • Whether the tabulation of data will be performed by hand or by computer
  • How the information can be converted into a form that will allow it to be processed efficiently, and
  • What statistical tools or methods will be employed

In recent times computers have become an essential tool for the tabulation and analysis of survey data.

Even in small scale studies that employ relatively simple statistical procedures, computer tabulation is encouraged for easy and flexible handling of data.

Micro and laptop computers can produce tables of any dimension and perform statistical operations much more easily and usually with far less error than is possible manually.

Assuming that the database is large and the processing of the data will be undertaken by computer, we will address the following major issues in the task of data analysis:

  • Data preparation which includes;
    • editing,
    • coding, and
    • data entry.
  • Exploring, displaying, and examining data that involve breaking down, examining, and rearranging data to search for meaningful descriptions, patterns, and relationships.


The customary first step in the analysis is to edit the raw data. Editing detects errors and omissions, corrects them whenever possible, and certifies that minimum data quality standards are achieved.

The editor’s responsibility is to guarantee that data are;

  1. accurate,
  2. consistent with the intent of the question or other information,
  3. uniformly entered,
  4. complete, and
  5. arranged to simplify coding and tabulation.

Editing of data may be accomplished in two ways: field editing and in­house, also called central editing.

Field editing is the preliminary editing of data by a field supervisor on the same day as the interview. Its purpose is to identify technical omissions, check legibility, and clarify responses that are logically or conceptually inconsistent.

When gaps are present from interviews, a call-back should be made rather than guessing what the respondent “would have probably said.”

A second important task of the supervisor is to re-interview a few respondents, at least on some pre-selected questions, as a validity check. In central or in-house editing, all the questionnaires undergo thorough editing. It is a rigorous job performed by central office staff.


Coding is the process of assigning numbers or other symbols to answers so that the responses can be grouped into a limited number of classes or categories. Coding helps the researcher to reduce several thousand replies to a few categories containing the critical information intended for the question asked.

Numerical coding can be incorporated when the questionnaire itself is being prepared, which we call pre-coding or after the questionnaire has been administered. The questions answered, which we call post-coding.

Pre-coding is necessarily limited chiefly to questions whose answer categories are known in advance.

These are primarily closed-ended questions (such as sex, religion) or questions whose answer is already a number and thus does not need to be converted (such as age, number of children).

Pre-coding is particularly helpful for data entry because it makes the intermediate step of completing a coding sheet unnecessary. The data are accessible directly from the questionnaire.

A respondent, interviewer, field supervisor, or researcher (depending on the data collection method) can assign appropriate numerical responses on the instrument by checking or circling it in the proper coding location.

The chief advantage of post-coding over pre-coding is that post-coding allows the coder to ascertain which answers are given by the respondent before beginning coding.

This can lead to great simplification. Post-coding also allows the researcher to code multiple answers to a single variable by writing a different code number for each combination of answers given.

Coding, whether pre or post, is a two-part procedure involving;

  1. choice of a different number for every possible answer category; and
  2. choice of the appropriate column or columns on the computer card that is to contain the code numbers for those variables.

The coding of data sacrifices some data detail, but it is necessary for efficient analysis. Instead of requesting the word Muslim or Christian to a question that asks for identification of one’s religion, we could use the code “M” or “C.”

Normally this variable would be coded 1 for Muslims and 2 for Christian. Codes of the type “QI” or “VI” are called alphanumeric codes. When numbers are used exclusively (e.g., 1, 2, etc.), the codes are numeric.

Codebook and its Construction

The codebook is a type of booklet compiled by the survey personnel that tells the meaning of each code from each question on a questionnaire.

For example, the codebook might reveal that for question number 10, the male is coded 1 and the female 2.

The codebook is used by the researcher as a guide to making data entry less prone to error and more efficient. It is also the

the definitive source for locating the positions of variables in the data file during analysis.

If a questionnaire can be completely pre-coded, with an edge-code indicating the location of the variable on the data file, then a separate codebook is not necessary, and a blank questionnaire can be used as a codebook.

However, particularly for post-coding and for open-ended questions that receive many answers, there is not sufficient room on the questionnaire to identify all codes.

The following is an example of the part of a codebook.

Sample Codebook

Question no.Column locationVariable numberVariable descriptionVariable name

Respondent number

Self code 999=Missing

4V102Place of residence: l=Rural 2=Urban 9=MissingRES
25V103Sex of respondent: l=Male 2=Female 9=MissingSEX


Self code

48V105Marital status: l=Single 2=Married 3=Widowed 4=Divorced 5=Separated 9=MissingMARITAL

Coding Non-responses

Non-response (or missing cases) occurs as a result of failure to provide any answer at all for a question, and these are inevitable in any questionnaire.

Care should be taken to prevent non-responses, but if these occur, the researcher must devise some scheme for coding them, preferably a standard scheme so that the same code can be used for non-response regardless of the particular question.

A numerical code should be assigned to a non-response.

The numbers used most often for non-response are 0 and 9. For variables requiring more than one column, the number is merely repeated for each column (e.g., 99, 999).

Any numerical code is satisfactory for non­response as long as it is not a number that could occur as a legitimate response.

For example, if you were to ask the respondent to list the number of children in his/her family, you should not use 9 for non-response because you could not distinguish a non-response from a family of nine children.

In addition to non-response items, code may also need to be assigned for “don’t know” abbreviated “DK” responses and for “not applicable (NA)” responses, where the question does not apply to a particular respondent. ‘Don’t know’ responses are often coded as ‘O’ or ‘OO.’

Data Entry

After coding is over, the next step is to enter the coded information into a file, which can be stored on a disc, diskette, or tape.

If the questionnaires are pre-coded, including edge-coding to signify the proper columns in the data file for each variable, the codes can be extracted directly from the questionnaires.

This is advisable if possible since it saves clerical work, which not only costs time and money but also provides the potential for additional error.

However, if the questionnaire has been post-coded, and if the codes are complicated requiring a lengthy codebook, it will be difficult or impossible to work directly from the questionnaires. In such a case, a standard procedure is to split the task of constructing the data file into two separate operations;

  1. reading the questionnaires and the codebook and transferring the correct numerical codes for each question onto a transcription or transfer sheet, and
  2. entering the data into the computer through a computer terminal.

Until quite recently, the use of punch cards was the most common way of entering data onto computers.

This system has virtually disappeared. In recent times, many computers allow for the entry of data from optical-scan forms. In examinations, examinees darken small circles, ellipses, or sets of parallel lines to choose a test answer.

Optical scanners process the marked- sensed questionnaires and store the answers in a raw data file in the computer. Some questionnaires are now developed, which have optical scan forms as answer sheets, or the questionnaire itself may be superimposed on an optical scan form.

If this is the case, the researcher will not need to transfer the data to the forms.

This technology has been adopted by questionnaire designers for the most routine data collection. It reduces the number of times the data are handled, thereby reducing the number of errors that are introduced.

In addition to the above procedure, keyboard entry remains a mainstay for researchers who need to create a data file immediately and store it in a minimal space on a variety of media.

For this procedure, one is to take his coded data, sit down at a computer terminal, and enter his data on the keyboard of the terminal, case by case. Once the data are entered, we can get a listing from the computer of what you have entered and check the list with the original coded data.

Telephone keypad response is another capability made possible by computers linked to telephone lines.

Using the telephone keypad (touch one), the respondent answers questions by pressing the appropriate number.

The computer captures the data by “listening,” decoding the tone’s electrical signal, and storing the numeric or alphabetic answer in a data file.

Nowadays, Bar code readers are extensively used in business. This technology can be used to simplify the interviewer’s role as a data recorder.

Instead of writing or typing information about the respondents and their answers by hand, the interviewer can pass a bar code wand over the appropriate codes. The data are recorded in a small, lightweight unit for translation later.

Variable Transformation

It is often necessary to transform or modify data for subsequent analyses. It is the process of changing data from their original form to a format that better supports data analysis to achieve research objectives.

Many researchers believe that response bias will be less if interviewers ask the respondents for their year of birth rather then their age, even though the objective of the data analysis is to investigate respondents’ age in years.

The raw data coded as the year of birth can be easily transformed to the current age by subtracting the birth year from the current year.

Since this calculation can be done more easily and accurately by the computer than by hand, it should be done during the data analysis phase rather than during coding.

Collapsing or combining adjacent categories of a variable is a common data transformation that reduces the number of categories, and all such transformation can be done in the computer at any stage of the analysis. For example, single years of age (such as 0, 1,..) can be collapsed and transformed as age categories 0-4, 5-9, 10-14, etc.

One of the disadvantages of this collapsing process is that the individual identity of the observations is permanently lost.

To avoid this, It is, however, advisable to create a new variable from the old one retaining the original variable.

In any event, the original variable must be retained, and the transformed variable must be given a new name so that you can make further transformation whenever needed.

Computing a New Variable

Sometimes it is necessary to compute a new variable by combining two or more variables for analysis.

Suppose that for an individual, you recorded the number of daughters (x1) and the number of sons (x2) separately he or she has.

You want to combine these two variables into a single variable (x), which denotes the total number of children the individual has such that x=x1+x2.

In computing a new variable, you can make addition, subtraction, multiplication, and division with one or more original variables.

To compute your profit margin P, you can subtract purchase value (Y) from sales value (X). P=X-Y. P is thus your computed variable. All these operations can easily be done by a computer program at any stage of your data analysis.

Recoding Data

Recoding is a common manipulation procedure that you need to adopt in setting up your variable for analyses.

The purpose of recording is generally to reduce the number of categories in a variable to a more manageable for numerical analysis.

Suppose, for example, and you have coded religion as follows:

Muslim=l, Hindu=2, Christian=3, Buddhist=4, Others=5

If a frequency run depicts that Christian, Buddhist, and category ‘others’ together constitute only a small proportion of the whole and thus you are convinced that separate analysis of your data by these underrepresented categories may not be meaningful.

In that event, you will be tempted to put these three categories together and assign a new code ‘3’.

More 'Data Analysis' Posts ⁄
Related Posts ⁄