Confounding Effects: Why Treatments Can Appear Effective Even When They Are Not | Enter the Double-blind Study | Problems in Performing Double-blind Trials | Other Types of Studies | References
At first glance, it might seem that you can discover whether a treatment is successful quite simply: just try it and see. However, a close analysis of the subject reveals that it's much harder to identify effective treatments than one might think. Decades of investigation have led scientists gradually to the conclusion that there's only one truly trustworthy source of information on whether a medical therapy really works: the double-blind, placebo-controlled study. The reasons behind this conclusion are complicated, and run counter to almost everybody's intuition. In this article, we explore this crucial topic in depth.
For a hint why double-blind studies are so important, consider the following examples: in medical trials of drugs used to treat the symptoms of menopause, many of the participants were given a fake treatment (placebo) without being informed that it was fake.1 The combined results of multiple studies showed that women given placebo experienced a 51% reduction in hot flashes! Similarly, in a large study of men with prostate enlargement, participants given placebo therapy showed significantly improved symptoms and maintained at least some improvement for a full two years.2
Effects like these can be highly misleading to both physicians and patients. Suppose, for example, a physician prescribes a new drug for menopausal symptoms or prostate enlargement, and his or her patients report wonderful improvements. Does this indicate that the drug is effective? Not at all. As we know from the results described in the previous paragraph, many patients will report improvement no matter what they are taking. Thus, a drug can seem to be effective even if it doesn't possess any healing powers beyond the power of suggestion.
For a particularly dramatic example of this phenomenon, consider what happened when orthopedic surgeon Bruce Moseley, team physician for the Houston Rockets, decided he needed to properly evaluate the efficacy of an operation commonly used to treat the pain caused by arthritic knees. This surgery involves scraping away rough areas in the knee's cartilage. It is widely believed to be effective, and as many as 400,000 such surgeries are performed each year.
Mosely decided to see if the surgery really worked. He conducted a study in which five patients were given the real surgery and five were given fake surgery consisting of little incisions over the knee. He then followed his patients for two years.
The results were amazing. Interviews showed that pain and swelling were reduced just as much in the placebo group as in the group that received the real surgery. Four out of the five participants who experienced the fake surgery said it was so helpful they'd gladly recommend it to a friend. Glowing testimonials, in other words, mean nothing.
A follow-up trial of 180 individuals confirmed these results,6 and this surgical approach is on its way to well-deserved oblivion. However, if these properly designed trials had not been undertaken, surgeons might have continued to scrape arthritic knees. No doubt, there are other ineffective surgeries that pass for effective, as well as ineffective herbs, supplements, and alternative therapies, as well.
The double-blind, placebo-controlled trial is the best way to eliminate such misleading results. Such trials are the foundation of modern evidence-based medicine, and they are the foundation of the information in the Natural & Alternative Treatments database, as well.
In the following discussion, we'll begin by exploring the many factors that can deceive medical researchers. We'll follow that with an explanation of how the double-blind study design solves these problems. After that, we'll analyze the many difficulties involved in performing a meaningful double-blind study and properly interpreting the results. Finally, we'll look at other forms of scientific evidence and explain their limitations.
At least twelve factors tend to confound (confuse) the results of studies.
First, researchers tend to observe what they expect to observe, a confounding factor known as observer bias. One placebo-controlled study evaluated a new treatment for multiple sclerosis.7 The researchers behind this study added an interesting twist: while most of physicians assigned to evaluate the participants for improvement were blinded, a few were not blinded, and they knew which participants were receiving placebo. As it happened, the treatment proved to be no more effective than placebo. However, the unblinded physicians managed to “observe” a significant difference in outcome between patients on placebo and those getting the treatment under study. In other words, they imagined they saw improvement where they expected to see it. No doubt this happens frequently in the daily life of a practicing physician, who is never blinded. For this reason the professional opinions of practicing doctors are far less reliable than the outcomes of double-blind, placebo-controlled studies.
Not only do observers’ expectations influence their own observations, they can also cause study participants to behave in the way the observers expect. This is the Rosenthal effect, and it is startlingly powerful. In one famous set of experiments, graduate students were given several photographs and told to show them to their subjects.3 The subjects were supposed to rate their impression of the people in the photos on a scale whose extremes were "big success in life" and "utter failure in life." (The photos were selected from magazines and were intended to show rather normal people.)
Next, half the graduate student experimenters were informed by their professors that their subjects would rate most of the people in the photos as failures. The rest of the graduate students were led to expect their subjects to rate the photos as showing only successful people.
Almost invariably, subjects gave precisely the ratings experimenters expected. This is particularly amazing because the graduate students were only allowed to read a set speech to their subjects. They were not allowed to change a single word, and did not do so. Apparently, they managed to communicate their expectations through small changes in inflection of voice.
One of the reasons study participants respond to observer expectation in medical studies is a desire to please their physician. Patients tend to stress improvements and downplay problems if that's what they sense the doctor wants to hear. This does not necessarily involve lying. Participants may simply reinterpret their own experience to show improvement. A good example of this reinterpretation effect occurs when you take vitamin C over the winter and then decide, no matter how many colds you had, you would have gotten more if you didn't use the vitamin C. You don't really know this, but you may tell yourself it is true nonetheless.
An entirely different possibility is that the power of suggestion may actually improve your health. This is the concept of the placebo effect. It may be, for example, that if you expect your knee arthritis to improve, it really will improve, through the power of the mind. (The concept of the placebo effect has recently undergone serious challenge, but it probably does occur at least to some extent.)
Memory distortion effects also influence the apparent outcome of treatments. Physicians (like everyone else) have a tendency to remember their greatest successes and most extreme failures, and drop from their memory everything in between. This can lead to a very skewed recall of the effectiveness of a treatment. Suppose a surgery works dramatically 15 times, fails absolutely 5 times, and yields mediocre results in the great majority of patients. The surgeon will most likely recall the surgery as highly effective.
Cognitive dissonance is another influence that makes physician impression unreliable. It is a well-established principle of experimental psychology that if you state out loud that something is true (eg, a treatment is effective), your mind will jump through hoops to make you experience the results as consistent with your beliefs. If you make your living doing something, you will similarly experience a strong tendency to believe that what you are doing really works.
Another major confusing influence is the natural course of the disease. Many diseases eventually run their course and symptoms improve on their own. This can give a false impression that a treatment has worked. However, due to a very powerful psychological tendency called the illusion of agency, a doctor will tend to feel that her efforts caused this improvement.
A related effect is called regression toward the mean. This term refers to a statistical principle. Simply put, most objective measurements of the state of the body fluctuate over time. Cholesterol level is a good example. People who are admitted to a study because their cholesterol levels are high may simply have high cholesterol at the moment they were tested for the study. During the subsequent several months, their cholesterol level will naturally move up and down. Suppose they happened to have been caught at a time of particularly high levels at the beginning of the study. By the end of that study, odds are they will show a lower reading. You might object that the effect should be symmetrical, and they just as well could have been caught at a low cholesterol moment at the beginning of the study. However, if that had been the case, they wouldn't have been allowed to participate in the study, because they wouldn't appear to have high cholesterol. Thus, this effect tends to produce an impression of improvement when in fact what is being observed is simply the workings of chance.
Another influence is called the study effect. Individuals in scientific studies (or under the care of a physician generally) often feel motivated to take better care of themselves overall. If you have diabetes, for example, and you enroll in a study of a new diabetes treatment, you may subconsciously begin to take your insulin shots more religiously, control your diet more enthusiastically, and make sure that you don't miss any doctors' appointments. The net result may be an improvement in symptoms that has nothing to do with a specific therapy under study.
Finally, participants with bad results may drop out of a study (or stop coming to a physician), while those with good results remain. This will tend to bias the apparent outcome toward more positive results.
All of these factors combine to make it immensely tricky to informally discover whether a treatment is effective. Neither a physician's clinical experience nor a patient's personal experience is particularly trustworthy. By the 1960s, researchers had begun to settle on an effective solution to this problem.
Medical researchers now agree that a treatment cannot really be said to be proven effective unless it has been examined in properly designed and sufficiently large double-blind studies.
In such experiments, half the participants are randomly assigned to receive the "real thing"—the treatment being tested. The other half receives a fake treatment designed to appear as much as possible like the real thing (the placebo control). Both participants and researchers are kept in the dark regarding which is which. Hence, they are both "blind"—resulting in the term “double-blind.”
If performed correctly, a double-blind study can eliminate all the confounding effects described above. If the people in the real treatment group fare significantly better than those in the placebo group, it is a strong indication that the treatment really works on its own merits.
However, conducting a proper double-blind, placebo-controlled study isn't easy.
One problem is that participants may be able to discern whether they are getting a real treatment or placebo. For example, the smell and taste of a liquid preparation of some herbs is distinctive. Creating a substance that looks and tastes similar but lacks any active ingredients is difficult. This means that it's possible for those in the treatment group to know they are taking the real thing and for those in the control group to know they are taking placebo. Technically, this is described as "breaking the blind," and it can invalidate the results of a study. Similar difficulties occur in studies of conventional medications. If a treatment causes side effects, participants and physicians may be able to tell whether they are part of the treated group rather than the untreated (placebo) group. A top quality study will report on the success researchers had in efforts to keep the participants "blind." Surprisingly, many studies of medications reported in prestigious medical journals fail to do so.
In addition, some treatments are difficult or impossible to fit into the double-blind format, and others may be impossible. Studies on therapies such as acupuncture, physical therapy, diet, surgery, chiropractic, and massage are quite challenging to design in a double-blind manner. How do you keep the acupuncturist or surgeon in the dark as to whether he or she is performing real or fake treatment? How do you make study participants unaware of what they are eating?
Even properly designed double-blind studies aren't perfect.4 For example, individuals willing to participate in studies may not be representative of the general population. This could skew the results. It's not clear what can be done to eliminate this issue.
Another important issue regards a subject called statistical significance. Sometimes you will read that people in the treatment group did better than those in the placebo group, but that the results were not statistically meaningful. This means you cannot assume that the results proved the treatment was effective.
Evaluation of statistical significance is a mathematical analysis used to ensure that the apparent improvement seen in the treated group represents a genuine difference, rather than just chance. Consider the following analogy: Suppose you flip one coin 20 times and end up with 9 heads. Then, you flip a second coin 20 times and count 12 heads. Does this mean that the first coin is less likely to fall with the head side up than the second coin? Or was the difference just due to chance? A special mathematical technique can help answer this question. The bottom line is that when study results look good but aren't statistically significant, they can't be taken any more seriously than the apparent "bias" of the coin that happens to fall heads more often when you flip it a few times.
A related issue is called statistical power. If a study enrolls too few people, the chance of discovering a true treatment effect diminishes. The number of enrollees necessary to identify a benefit depends on the strength of the treatment—a powerful treatment can be identified as effective in a relatively small study, but a modestly effective treatment may require hundreds of study participants to identify an effect. This effect is compounded when it is tricky to measure the benefits of a treatment.
Antidepressant drugs and herbs are a good example of a form of treatment requiring very large studies to demonstrate benefit. There are two reasons for this. First, in antidepressant studies, people given placebo typically show about 75% as much improvement as those in the treated group.5 Additionally, the method of rating depression severity—a questionnaire—is relatively coarse and subject to wide variations in interpretation. The net result is a great deal of statistical "noise." In consequence, numerous studies of antidepressants have failed to identify any difference between treatment and placebo. This doesn't mean that the drugs don't work—only that very large studies are necessary to show that they work. Similarly, when small trials fail to find an herb effective, one shouldn't think they have proven it ineffective. They simply have failed to find it effective. Only relatively large negative trials truly prove that a treatment doesn't work. Small trials may simply lack sufficient statistical power to show benefit.
Another statistical problem involves what is called data dredging. Before performing an experiment, researchers are supposed to pick one or two hypotheses that their study will test. This is called the primary outcome measure(s). For example, in a study of a treatment for Alzheimer's disease, the primary outcome measure may be the score on a given memory test. The researchers hypothesize that scores on this test will improve, and then conduct the study to determine whether their hypothesis is correct.
Once a study has begun, however, there's a temptation to gather more information by applying numerous tests to the participants. These are called secondary outcome measures. In the Alzheimer's example, these may involve such ratings as questionnaire assessments of ability to perform a daily task, physician opinion of overall progress, caregiver assessment of overall progress, and other perfectly reasonable ways of evaluating the success of therapy. There is a problem, however, with using a multitude of secondary outcomes: by the laws of statistics, if you measure enough things, some will indicate improvement, just by chance. Researchers who look at dozens of factors in hopes of finding evidence of improvement in a few of them are said to be engaged in data dredging. Only the results on the primary outcome measure are trustworthy. There is simply too much leeway to find favorable data by digging deep in the mass of other data recorded.
This is not a complete list of the challenges involved in designing a proper double-blind trial. There are numerous other tricky considerations, including study dropouts, ethical issues that interfere with an accurate determination of outcome, and many more. Nonetheless, when properly designed, the double-blind, placebo-controlled trial is the best method we have of objectively determining the effectiveness of a treatment.
There are many other types of studies besides double-blind, placebo-controlled trials; however, none can be considered as reliable.
Some double-blind studies compare a new treatment against an accepted treatment. If the comparative treatment is known to be effective beyond a shadow of a doubt, and the new treatment proves equally effective, such a study can provide evidence of effectiveness. However, it is preferable that such studies should also include a placebo group to enhance their meaningfulness.
In a single-blind trial, even though participants are not informed of who is receiving real treatment and who is not, researchers know the difference. Studies of acupuncture are typically single-blind, because it is difficult to design a study in which acupuncturists can deliver fake acupuncture without knowing it. Similar problems apply in studies of modalities such as physical therapy, surgery, chiropractic, and massage.
The problem with single-blind trials is that they can't eliminate all the confounding factors described above. Some can be prevented by using blinded evaluators; in other words, the acupuncturists know who is receiving real treatment, but a separate researcher evaluates how well participants have improved, and that researcher has no idea who received real treatment. Nonetheless, a single-blind study can't eliminate all confounding factors. The Rosenthal effect described above, for example, still has full sway to bias the results.
In some studies, a portion of the participants simply receives no treatment at all. Their outcome is compared to those who do receive treatment. Unfortunately, this form of study proves little. Every one of the confounding factors described above comes into play, and the results almost universally indicate that the tested treatment is successful, regardless of what it is.
Double-blind studies involve giving participants a treatment; in other words, "intervening" in their lives. They all fall in the category of an "intervention trial."
Observational studies (also called epidemiologic or population studies), on the other hand, simply follow large groups of people for years and keep track of a great deal of information about them, including diet. Researchers do not do anything to them; they just examine the collected data closely and try to identify which dietary and lifestyle factors are associated with better health and longer life. Researchers in these studies don't change anything, they simply observe what is already going on. Such studies have most often tried to find connections between what people eat and the development of different diseases. A few have looked at the effect of taking nutrition supplements.
Observational studies are often the only practical way to gain information about the long-term health effects of nutrition and lifestyle. As noted above, how would you set up a study with placebo diet in which neither participants nor researchers knew who was eating what? It's basically impossible.
Furthermore, when looking for changes in events like heart attacks, you need to enroll enormous numbers of people and you have to follow them for decades. It is simply very expensive and difficult to conduct double-blind studies that meet such requirements.
Thus, for many treatments, such as low fat diets, observational studies are the primary source of information on their effects. Unfortunately, the results of observational studies can be misleading. Consider, for example, an observational study that discovers the following piece of data: people who consume high levels of saturated fat develop more heart disease. Does this mean that reducing saturated fat in the diet will reduce heart disease risk? Not necessarily. It is quite possible that the saturated fat in the diet was an innocent bystander, not the cause of the heart disease. Perhaps people who eat a great deal of saturated fat also tend to exercise less and smoke more cigarettes than other people. Those habits, and not the fat, might play the most important role.
Researchers try to look closely at the data and eliminate such factors, but it can never be done perfectly. Knowing this, scientists reporting the results of observational studies tend to make very cautious statements, such as "high saturated fat intake is associated with increased heart disease," rather than "high saturated fat intake causes heart disease." However, the media will frequently rephrase the results to make them more impressive. This leads to the all-too-frequent situation where scientists appear to change their minds from year to year. In many cases, it's not that the scientists have revised their opinion; just that those who report the studies (including physicians, who should know better) have drawn unwarranted, firm conclusions from them.
In the Natural & Alternative Treatments database , when we report the results of observational studies, we add the caveat that they can't be taken as definitive proof.
An in vitro study is a trial that tests a substance in a test tube. Such studies are really only spurs for further research, as they don't prove that a treatment is effective in real life. An herb or supplement taken by mouth must be absorbed into the bloodstream, survive processing by the liver, and still manage to be effective when diluted by the fluids of the body. It's a long leap from a test tube result to a treatment that actually works.
Evidence from studies enrolling animals means more than evidence from in vitro studies. However, because animals may process nutrients and herbs differently than we do, the results can't be taken as completely reliable.
Sometimes a group of people are given a treatment and simply followed for a period of time to see if they improve. The results of such open studies mean practically nothing at all. Due to all the influences described above, one can expect even before conducting the trial that improvements will be reported. Such studies are mostly worthwhile for discovering harmful effects of new treatments.
1. MacLennan A, Lester S, Moore V. Oral estrogen replacement therapy versus placebo for hot flushes: a systematic review. Climacteric. 2001;4:58-74.
2. Nickel JC. Placebo therapy of benign prostatic hyperplasia: a 25-month study. Canadian PROSPECT Study Group. Br J Urol. 1998;81:383-387.
3. Rosenthal R. Interpersonal expectations: effects of the experimenter's hypothesis. In: Rosenthal R, Rosnow RL, eds. Artifact in Behavioral Research. New York, NY: Academic Press; 1969:181-277.
4. Kaptchuk TJ. The double-blind, randomized, placebo-controlled trial. Gold standard or golden calf? J Clin Epidemiol. 2001;54:541-549.
5. Andrews G. Placebo response in depression: bane of research, boon to therapy. Br J Psychiatry. 2001;178:192-194.
6. Moseley JB, O'Malley K, Petersen NJ, et al. A controlled trial of arthroscopic surgery for osteoarthritis of the knee. N Engl J Med. 2002;347:81 88.
7. Noseworthy J H, Ebers G C, Vandervoort M K, et al. The impact of blinding on the results of a randomized, placebo-controlled multiple sclerosis clinical trial. Neurology. 2001;57:S31 S35.