Confounding

Overview
A confounding variable (also confounding factor, lurking variable, a confound, or confounder) is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable. The methodologies of scientific studies therefore need to control for these factors to avoid what is known as a type 1 error: A 'false positive' conclusion that the dependent variables are in a causal relationship with the independent variable. Such a relation between two observed variables is termed a spurious relationship. Thus, confounding is a major threat to the validity of inferences made about cause and effect, i.e. internal validity, as the observed effects should be attributed to the confounder rather than the independent variable.

For example, assume that a child's weight and a country's gross domestic product (GDP) rise with time. A person carrying out an experiment could measure weight and GDP, and conclude that a higher GDP causes children to gain weight, or that children's weight gain boosts the GDP. However, the confounding variable, time, was not accounted for, and is the real cause of both rises.

By definition, a confounding variable is associated with both the probable cause and the outcome. The confounder is not allowed to lie in the causal pathway between the cause and the outcome: If A is thought to be the cause of disease C, the confounding variable B may not be solely caused by behaviour A; and behaviour B shall not always lead to behaviour C. An example: Being female does not always lead to smoking tobacco, and smoking tobacco does not always lead to cancer. Therefore, in any study that tries to elucidate the relation between being female and cancer should take smoking into account as a possible confounder. In addition, a confounder is always a risk factor that has a different prevalence in two risk groups (e.g. females/males). (Hennekens, Buring & Mayrent, 1987).

Though criteria for causality in statistical studies have been researched intensely, Judea Pearl has shown that confounding variables cannot be defined in terms of statistical notions alone; some causal assumptions are necessary. In a 1965 paper, Austin Bradford Hill proposed a set of causal criteria. . Many working epidemiologists take these as a good place to start when considering confounding and causation. However, these are of heuristic value at best. When causal assumptions are articulated in the form of causal graph, a simple criterion is available, called backdoor, to identify sets of confounding variables.

How to remove confounding in a study
There are various ways to modify a study design to actively exclude or control confounding variables:

All these methods have their drawbacks. This can be clearly seen in this example: A 45 years old Afro-American from Alaska, avid football player and vegetarian, working in education, suffers from a disease and is enrolled into a case-control study. Proper matching would call for a person with the same characteristics, with the sole difference of being healthy – but finding such one would be an enormous task. Additionally, there is always the risk of over- and undermatching of the study population. In cohort studies, too many people can be excluded; and in stratification, single strata can get too thin and thus contain only a small, non-significant number of samples.
 * Case-control studies assign confounders to both groups, cases and controls, equally. For example if somebody wanted to study the cause of myocardial infarct and thinks that the age is a probable confounding variable, each 67 years old infarct patient will be matched with a healthy 67 year old "control" person. In case-control studies, matched variables most often are the age and sex.
 * Cohort studies: A degree of matching is also possible and it is often done by only admitting certain age groups or a certain sex into the study population, and thus all cohorts are comparable in regard to the possible confounding variable. For example, if age and sex are thought to be a confounders, only 40 to 50 years old males would be involved in a cohort study that would assess the myocardial infarct risk in cohorts that either are physically active or inactive.
 * Stratification: As in the example above, physical activity is thought to be a behaviour that protects from myocardial infarct; and age is assumed to be a possible confounder. The data sampled is then stratified by age group – this means, the association between activity and infarct would be analyzed per each age group. If the different age groups (or age strata) yield much different risk ratios, age must be viewed as a confounding variable. There are statistical tools like Mantel-Haenszel methods that deal with stratified data.
 * Controlling for confounding by measuring the known confounders and including them as covariates in multivariate analyses. A drawback of these is that they give little information about the strength of the confounding variable compared to stratification methods.

One major problem is that confounding variables are not always known or measurable. This leads to 'residual confounding' - epidemiological jargon for incompletely controlled confounding. Hence, randomization is often the best solution as, if performed successfully on sufficiently large numbers, all confounding variables (known and unknown) will be equally distributed across all study groups.