Validity in Experimentation

In the area of scientific research design and experimentation, validity refers to whether a study can scientifically answer the questions it is intended to solve. The validity of an experimental result is the degree to which it measures what it is supposed to measure.

This is not the same as reliability, which is the extent to which a measurement gives consistent results. Within validity, the measurement does not always have to be similar, as it does in reliability.

Even when an experiment is performed employing an ideal design, it may encounter various types of error that may reduce the reliability and validity of the experimental results.

In other words, an experimental design may suffer from validity threats.

Validity in Experimentation

Two major types of experimental validity are considered here: internal validity and external validity.

Internal Validity

Internal validity refers to whether the experimental treatment was the sole cause of observed changes in the dependent variable. In other words, internal validity addresses the ‘true’ causes of outcomes that we observe in our study.

Strong internal validity means that we have not only reliable measures of our independent and dependent variables but also a strong justification that causally links our independent variables to our dependent variables.

In other words, strong internal validity refers to the unambiguous assignment of causes to effects.

Good experimental techniques, in which the effect of an independent variable on a dependent variable is studied under highly controlled conditions, usually allow for higher degrees of internal validity than, for example, single-case designs.

Cambell and Stanley (1963) and later, Cook and Cambell (1979) listed eight kinds of confounding variables that can interfere with internal validity (i.e., with the attempt to isolate causal relationships).

These threats are

  • History
  • Selection
  • Testing
  • Instrumentation
  • Maturation
  • Experimental mortality
  • Statistical regression
  • Selection-maturation interaction

We discuss these threats in turn below.


During the time that an experiment is taking place, some unanticipated and unplanned events may occur that confuse the relationship being studied. This is a history effect.

In many experimental designs, we take a pretest­measurement (O1) of the dependent variable before introducing the intervention (X). After the intervention, we take a posttest measurement (O2).

Then the difference between O1 and O2 is the change that we believe the intervention has caused.

Between O1 and O2, however, many unexpected events could occur to confounding effects of the intervention, which remain beyond the control of the experimenter.

This makes it impossible for the experimenter to know whether the change was due to the intervention, or it has been the result of the unanticipated events (extraneous factors).


A six-week training program is organized among a group of bank employees to enhance their current level of knowledge on bank management.

A few months later, an evaluation is conducted, and the results of the evaluation indicated that there had been no improvement in the performance compared with the pre-training period. Naturally, the evaluator will conclude that the training was of no use.

Why is this so?

Upon inquiry, it was revealed that soon after the training, the bank employees went on an unplanned strike to realize some of their long-standing demands. The strike represents a history threat to the validity of the evaluation.

Even if we could control for the effect of the strike, we would still ask: what is the true and valid effect of the training program?

The answer to this question depends on the type of design the researcher uses for his study.


An important threat to internal validity is encountered if the subjects for the experimental group (a group to which an intervention is given) and control group (a group without intervention) are not equivalent in every respect, e.g., age, occupation, race, and similar other characteristics.

If the subjects are randomly assigned to experimental and control groups, this selection problem can largely be overcome.


Consider the previous example. Suppose the employees are divided into two groups: Group A and Group B.

A training program is organized for Group A (experimental group) but not for Group B (control group). After one year, the performances of the two groups were evaluated.

It is observed that the experimental group made significant improvements in their performance. Can we conclude that this is due to the effect of the training imparted to Group A?

Upon scrutiny, it appeared that participants of Group A were younger, energetic, and better educated than the participants of Group B.

That is, the two groups differ in their background characteristics.

Hence the participants of Group A are more likely to be efficient and thus could pick up things more quickly, resulting in better performance than the control group.

This refers to the selection effect. We feel that if both the groups had been similar in characteristics initially, we would probably be in a safer position to conclude the impact of the training program.

Thus we need a design that will be able to isolate the effect of the program intervention.


The process of taking a pretest can affect the scores or measurements of a posttest. This merely happens because the experience of taking the first test is likely to have a learning or diffusion effect that tends to influence the results of the second test.

How does this happen?

People who undergo a pretest become more conscious and aware of the dimension of the problem and thus are likely to remember some of the questions and some of the errors they made when they take the posttest. They are likely to do somewhat better on the posttest than they did on the pretest.

This difference or better performance on the posttest might have nothing to do with a program intervention, but instead, be entirely due to the effect of the pretest.

Thus, whenever a test is given repeatedly to the same group of individuals, there is every likelihood of encountering a threat to the validity of this nature.


If you ask a salesperson in a department store: how many food items you keep in your store?

His response may be incorrect in this respect because he was not prepared to answer this question.

A week later, if the same question is asked to him, he might answer it more correctly only because by the time he becomes more familiar with the store’s items.

Repeated measurements thus lead to valid data.

Thus if training is organized for the employees to make them familiar, the independent effect of the program intervention is difficult to isolate due to the testing effect. A valid design thus calls for this purpose.


Measuring the dependent variable in an experiment requires the use of a questionnaire, a test, or other forms of measuring instruments.

Any change in the wording of questions, a change in interviewers, or a change in other procedures to measure the dependent variable causes an instrumentation effect.

This may lead to a threat to internal validity.

For example, interviewers, who have been used in the pretest measurement, may acquire increased knowledge and skill in interviewing during the posttest, or they may develop fatigue and decide to reword the questionnaire in their style.

As a result, an experienced interviewer may obtain complete information from a respondent than an inexperienced interviewer.

The additional information obtained may be because the interviewer has become more skilled in asking questions or observing events, and not due to the effect of program intervention.


A respondent is asked at a pretest: At what age you got married? At a later date (posttest), the same respondent was asked: In which year and month of the year you got married?

The two questions might lead to two different answers if the objective of the study is to know the age at marriage of the respondent.


The maturation effect is an effect on the results of an experiment caused by changes in the experimental subjects over time.

It is a function of time rather than a response to a specific event. In a long training program, for example, it is not unusual for trainees to become tired, hungry, or bored, or sometimes even discouraged.

In longitudinal studies covering a long period, respondents usually become more experienced, more knowledgeable, wiser, sometimes more resistant, and, of course, older.

In other words, people mature over time, and this maturation process can produce changes that are independent of the changes a program intervention is designed to produce.

Experimental mortality

This occurs when the composition of the study groups changes during the test.

Attrition is especially likely in the experimental group, and with each dropout, the group changes. Mortality effects refer to these losses. Because members of the control group are not affected by the testing situation, they are less likely to withdraw.

If the cases that have dropped out (lost to follow-up) are different from those who have not, then there is every likelihood to get a great difference between the pretest and posttest measurements.

These differences may be due to the loss of cases rather than the effect of program intervention.

Statistical regression

This factor operates when groups have been selected based on their extreme scores and on second testing tend to move back toward the mean score of the group.

Suppose we measure the performance of all workers in a department store for a few days before an experiment and then conduct the experiment with only those workers in the top 25% and bottom 25% productivity scores.

No matter what is done between O1 (the pre-test measurement) and O2 (the post-test measurement), there is a strong tendency for the average of the high scores at O1 to decline at O2 and for the low scores at O, to increase.

This tendency results from imperfect measurement that, in effect, records some persons abnormally high and abnormally low at O1. In the second measurement, members of both groups tend to score more closely to their long-run mean scores (Cooper and Schindler, 1995).

Selection-maturation interaction

This refers to the differential maturation of members of experimental and control groups.

Unlike the independent effect of only one factor, this threat to validity comes from two factors that may influence the experimental results.

To see how effective a specific training method is, two groups of individuals, one consisting of males and the other consisting of females of the same age and similar educational background, are chosen. Suppose the pretest measurements made on two groups showed equal performance levels.

Following the pretest measurement, the females were given the training while the males were not. Then a second measurement was taken for both the groups.

If the posttest measure shows a better performance level, can we say that better performance is entirely due to the intervention administered to the females?

Perhaps not. It is possible that between the pretest and posttest, the maturity effect acted more pronouncedly among the females than among the males, and this contributed to a high score for the females. If so, we call it a selection­ maturation interaction effect.

The validity threats that we have just elaborated need to be carefully considered while designing a study.

Before we conclude that an independent variable has a causal relationship with the dependent variable, it is important to be certain that validity threats have not contaminated the relationships.

It must be ensured that the choice of the study design has been such that it rules out as many alternative explanations of an observed effect as possible.

External Validity

External validity concerns the extent to which the (internally valid) results of a study can be held to be true for other cases, for example, to different people, settings, places, or times.

In other words, it is about whether findings can be validly generalized.

It seeks to answer the question: if the same research study was conducted in those other cases, would it get the same results?

To what populations, settings, treatment variables, and measurement variables can these results be generalized?

A study of fifth-graders, for example, in a rural school that found one method of teaching superior to another, may not be applicable with the students of a similar grade or of a different grade in an urban school.

As a second example, suppose you select patients from a public hospital to study their attitude towards the services they receive from the hospital. The question may remain about whether you can extrapolate the results to all public hospitals in the country.

Or consider a study in which you ask a cross-section of a population to participate in an experiment, but a substantial number refuses.

If you experiment only with those who agree to participate, can the results be generalized to the whole population?

This raises the issue of external validity.

External validity can be split into two distinct types: population validity and ecological validity.

Population validity refers to the extent to which the results of a study can be generalized from the specific sample that was studied to a large group of subjects.

For example, if the sample is drawn from an accessible population, rather than a target population, generalizing the research results from the accessible population to the target population is risky.

On the other hand, ecological validity refers to the extent to which the results of an experiment can be generalized from a set of environmental conditions created by the researcher to other environmental conditions (settings and conditions).

More 'Research Design' Posts ⁄
Related Posts ⁄