Stats #44: Things You Need to Know Before Starting a Research Project

Content: This class will introduce you to the statistical issues important in developing a research study. It combines material from classes #32, 42, and 52. This class is useful for anyone who participates in the planning of research. There are no pre-requisites for this class.

Teaching strategies: Didactic lectures and small group exercises.

Objectives: In this class you will learn how to:

This class qualifies for 3 IRB Education Credits (IRBECs).

Contents


Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon, edited by Linda Foland, and was last modified on 07/17/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


[../../03/consulting.html]
[08/08/NewOffice.asp]
[../../99/ideas.html]
[../../99/hypo.asp]
[../../02/irb.html]
[../../01/power.html]

Statistical Evidence: Overview

This is a first draft of the overview for "Statistical Evidence."

"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.

Reading medical research is hard work. I'm not talking about the medical terminology, though that is often quite bad (if I hear the word "emesis" one more time, I'm going to throw up!). The hard part is assessing the strength of the evidence. When you read a journal article, you have to decide if the authors present a case that is persuasive enough to get you to change your practice. This means assessing the strength of the evidence.

Some evidence is so strong that it stands on its own. Other evidence is weaker and requires support from other studies, from mechanistic arguments, and so forth. Still other evidence is so weak, that you should not consider any changes in your practice until the study is replicated using a more rigorous approach. I hope to elaborate on the criteria that you should use when assessing the strength of the evidence.

0.1. What should you look for?

When you are assessing the quality of the evidence, it's not how the data are analyzed that's important. Far more important is HOW THE DATA ARE COLLECTED. Don't agonize over technical details about the statistical analysis. After all, if you collect the wrong data, it doesn't matter how fancy the analysis is.

This is good news, because you don't need a lot of statistical training or a lot of mathematical sophistication to assess how the data are collected.

In this book, I want to show you what to look for and why. I will also highlight real research articles and use them as examples. Although all of the examples represent good and valuable research, some of the examples represent a level of evidence that by itself is less persuasive. It is helpful to understand why these examples are less persuasive.

0.2. Schizophrenic Research

Unfortunately, there is a lot of less than persuasive research out there. You don't have to look very hard to find solid empirical evidence of this. One of my favorite example is a study by Ben Thornley and Clive Adams that appeared in the British Medical Journal in 1998. You can find the full text of this article on the web at bmj.com/cgi/content/full/317/7167/1181 and it is well worth reading. Thornley and Adams looked at the quality of clinical trials for treating schizophrenia. Since they work for the Cochrane Collaboration Group, a group that provides systematic reviews of the results of medical trials, they are in a good position to write such an article.

Thornley and Adams actually identified over 2500 studies of schizophrenia, but decided to summarize only the first 2000 that they uncovered. Perhaps they reached the point of sheer exhaustion. I am very impressed at the amount of work this must have taken.

The research covered fifty years, starting in 1948 through 1997. The research covered a variety of therapies: drug therapies, psychotherapy, policy or care packages, or physical interventions like electroconvulsive therapy.

What did Thornley and Adams find? It wasn't a pretty picture. First, researchers in schizophrenia studied the wrong patients. Most studies used institutionalized patients, who are easier to recruit and follow up with, but who do not provide a good representation of the all patients with schizophrenia. Readers would probably be interested as much in community based studies, if not more interested, but only 14% of the studies were community based.

Second, the researchers also did not study enough patients. Thornley and Adams estimated that a good study of schizophrenia should have at least 300 patients in each group. This would be based on rates of improvements that might be expected for an active drug compared to placebo effects. Even though the desired sample size was 300, it turns out that the average study had only 65. Only 3% of the studies had 300 or more patients.

Third, the researchers did not study the patients long enough. A good study of schizophrenia should last for six months or more; long term changes are more important than short term changes. Unfortunately, more than half of the studies lasted for six weeks or less.

Finally, the researchers did not measure these patients consistently. In the 2,000 studies, the researchers used 640 ways to measure the impact of the interventions. Granted, there are a lot of dimensions to the schizophrenia and there were measures of symptoms, behavior, cognitive functioning, side effects, social functioning, and so forth. Still, there is no justification for using so many different measurements. Imagine how hard this makes it for anyone to summarize the results of this research. Failure to use and re-use a few standardized assessments has led to a very fragmentary (dare I say, schizophrenic) picture about schizophrenia treatments.

I don't wish to single out research in just this area. There are many reviews in other areas that also point out the flaws and shortcomings of research. Also keep in mind that research on schizophrenia is especially hard to do well. The take home message from Thornley and Adams is that just because the research is peer-reviewed does not mean that it is perfect. I hope to help you identify factors that limit the quality of peer-reviewed research.

0.3. Healthy Skepticism

Please don't panic. Research studies have many flaws but usually those flaws do not make the research wholly uninterruptible. These limitations should make you skeptical, perhaps, but not cynical.

The cynical attitude would be "you can prove anything with statistics" and leads to a nihilistic view that all research is garbage. The cynical attitude would lead you to nit pick a research paper, find a flaw here and a flaw there. Then use these flaws to disregard any research whose conclusions make you uncomfortable.

A skeptical attitude, on the other hand, would ask "how persuasive is this research" and would look at the strengths and the weaknesses of a research paper. It would place limits on how persuasive the research is. When the research was not sufficiently persuasive, a skeptical attitude would encourage you to think about what level of evidence would be enough to persuade you otherwise.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon on (unknown date), edited by Steve Simon and Linda Foland, and was last modified on 2008-06-24. This page needs minor revisions. Category: Statistical evidence


Apples or oranges?

1.0 Introduction

Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

When you make such a comparison between an exposure/treatment group and a control group, you want a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.

1.0.1 Covariate imbalance

Sometimes, however, you get an unfair comparison, an apples to oranges comparison. The control group differs on some important characteristics that might influence the outcome measure. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

Women who take oral contraceptives appear to have a higher risk of cervical cancer. But covariate imbalance might be producing an artificial rise in cancer rates for this group. Women who take oral contraceptives behave, as a group, differently than other women. For example, women who take oral contraceptives have a larger number of pap smears. This is probably because these women visit their doctors more regularly in order to get their prescriptions refilled and therefore have more opportunities to be offered a pap smear. This difference could lead to an increase in the number of detected cancer cases. Perhaps, though, the other women have just as much cancer, but it is more likely to remain undetected.

There are many other variables that influence the development of cervical cancer: age of first intercourse, number of sexual partners, use of condoms, and smoking habits. If women who take oral contraceptives differ in any of these lifestyle factors, then that might also produce a difference in cervical cancer rates. The possibility that oral contraceptives causes an increase in the risk of cervical cancer is quite complex; a good summary of all the issues involved appears on the web at www.jhuccp.org/pr/a9/a9chap5.shtml.

1.0.2 Case study: Vitamin C and Cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Ewan Cameron and Linus Pauling published an observational study of Vitamin C as a treatment for advanced cancer (Cameron 1976). For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."

Ten years later, the Mayo Clinic (Moertel 1985) conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Cameron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison because the initial prognosis was worse in the control group than in the treatment group. As Paul Rosenbaum says so well: one can say with total confidence, without reservation or caveat, that the prognosis of the patient who is already dead is not good. (page 4)

When the treatment group is apples and the control group is oranges, you can't make a fair comparison.

1.0.3 Apples or oranges: What to look for.

To ensure that the researchers made an apples to apples comparison, ask the following questions:

Did the authors use randomization? In some studies, the researchers control who gets the new therapy and who gets the standard (control) therapy. When the researchers have this level of control, they almost always will randomize the choice. This type of study, a randomized study, is a very effective and very simple way to prevent covariate imbalance.

If randomization was not done, how were the patients selected? Several alternative approaches are available when the researchers have control of treatment assignment, but minimization is the only credible alternative. When researchers do not have control over treatment assignments, you have an observational studies. The three major observational studies, cohort designs, case-control designs, and historical controls, all have weaknesses, but may represent the best available approach that is practical and ethical.

Did the authors use matching to prevent covariate imbalance? Matching is a method for selecting subjects that ensures a similar set of patients for the control group. A crossover design represent the ideal form of matching because each subject serves as his or her own control. Stratification insures that broad demographic groups are equally represented in the treatment and control group.

Did the authors use statistical adjustments to control for covariate imbalance? Covariate adjustment uses statistical methods to try to correct for any existing imbalance. This methods work well, but only on variables that can be measured easily and accurately.

1.1 Did the authors use randomization?

Randomization is the assignment of treatment groups through the use of a random device, like the flip of a coin or the roll of a die, or numbers randomly generated by a computer.

Example: In a study of allergy shots (Adkinson 1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo."

Example: In a study of acupuncture (Bullock 1989) "80 severe recidivist alcoholics received acupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)."

In both studies the researchers decided who got what. This is a hallmark of a randomized design and it only can occur when the patients and/or their doctors have no say in the assignment.

1.1.2 How does randomization help?

Randomization helps ensure that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. Used correctly, it also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.

There are situations where covariate imbalance can appear, even in a well randomized study (Roberts 1999). Just as you have no guarantee that a flip of 100 coins will yield exactly 50 heads and 50 tails, you have no guarantee that covariate imbalances cannot creep into a randomized study once in a while. This is not just a theoretical concern. One article (Mann 2002) argues that a difference in baseline stroke severity in a randomized trial of tPA produced an incorrect assertion of the effectiveness of this treatment.

Randomization relies on the law of large numbers. With small sample sizes, covariate imbalance may still creep in. A study examining the probability of covariate imbalance (Hsu 1989) showed that total sample sizes less than 10 could have a 50% chance or higher of having a categorical covariate with levels twice as large in one group than the other. This study also showed that total sample sizes of 40 or greater would have very little chance of such a serious imbalance, and a total of 20-40 subjects would be acceptable if there were only one or two important covariates.

1.1.3 A fishy story about randomization

I was told this story but have no way of verifying its accuracy. A long, long, time ago, the U.S. Environmental Protection Agency wanted to examine a pollutant to find concentration levels that would kill fish. This research required that 100 fish be separated into five tanks, each of which would get a different level of the pollutant. The researchers caught the first twenty fish and put then in the first tank, then the next twenty fish and put them in a second tank and so forth. The last twenty fish went into the fifth tank. Each fish tank got a different concentration of the pollutant. When the research was done, the mortality was related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled and the best outcomes in the last tank filled. What happened was that the slow-moving, easy-to-catch fish (the weakest and most sickly fish) were all allocated to the first tank. The fast-moving, hard-to-catch fish (the strongest and healthiest fish) ended up in the last tank.

Failure to randomize in this study ruined the entire effort. The huge imbalance caused by putting the sickest fish in the first tank and the healthiest fish in the last tank overwhelmed any differences in mortality caused by varying levels of the pollutant.

1.1.4 The mechanics of randomization

Random assignment means that the choice is left to some device that is inherently random and unpredictable. A flip of a coin is one approach, but usually a table of random numbers or a random number generator is more practical.

The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number. Sorting by a random number is effectively the same thing as putting the list in a random order.

1.1.5 Concealing the randomization list

Another important aspect of randomization is concealed allocation, which is the concealment of the randomization list from those involved with recruiting subjects. This concealment occurs until after subjects agree to participate and the recruiter determines that the patient is eligible for the study. Only then is a sealed envelope opened that reveals the treatment status. Concealed allocation can also be done through an 800 number that the doctor calls to discover the treatment status.

Please note that concealing the randomization list is not the same as blinding the study (a topic I discuss later in this book). Certain treatments, such as surgery, cannot be blinded but the allocation list can still be concealed. Consider, for example, a randomized trial comparing laparoscopic surgery to traditional surgery. After the fact, the patient can tell by the size of the scar what type of surgery they received. But the choice as to what type of surgery that the patient receives could be made as the patient is being sedated. There is an example of a research study where a sterilized coin was flipped in the operating room to decide which surgery will be used.

If the randomization list is not concealed, doctors have the ability to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).

There are many stories of physicians who have tried and succeeded in recruiting a patient into a preferred group. If the treatment allocation is hidden in sealed envelopes, they can hold it up to a strong light. If the sealed envelopes are not sequentially numbered, they can open several envelopes at once. If the allocation is controlled by a central operator, they can call and ask for the allocation of several patients at once.

When a doctor has an overt preference to enroll a patient into one group over another, it raises ethical issues and perhaps the doctor should not be participating in the trial. You should only participate in a research study if you believe there is genuine uncertainty about whether the new therapy or the standard therapy is better. If not, you have no business participating in a study where some of your patients will be randomized to a treatment that you consider inferior. Unfortunately, some doctors will continue to participate in these trials but will try to skew the enrollment of some or all of the patients towards a favored therapy.

Concealed allocation only makes sense for a truly randomized study. If patients are assigned in an alternating fashion, concealed allocation is buying a fancy burglar alarm and leaving the front door wide open. You already know that alternating assignments is a bad idea, but it is even worse because it the doctors will immediately recognize the next patient is going to be allocated to. This makes it easy for them to preferentially recruit to a specific treatment if they want to.

1.1.6 Randomization: what to look for.

If a study is randomized, look for the following features:

1.2 If randomization was not done, how were the patients selected?

Randomization is not always used. There are three alternatives to randomization when the control of treatment assignment is under the power of the researchers. When practical or ethical issues prevent researchers from controlling treatment assignment, you have an observational study.

1.2.1 Minimization.

An alternative, when the researchers have sufficient control, is to allocate the assignments so that at each step, the covariate imbalance is minimized. So if the treatment group has a slight surplus of older patients and the next patient to join the study is also older than average, then that patient would be assigned to the control group so as to reduce the age discrepancy.

Example: In a study of behavioral counseling (Steptoe 1999), twenty general practices were allocated either to use behavioral counseling based on the stages of change model for all their patients, or no counseling other than what their current standard of care. These practices were assigned using minimization to insure balance on three factors: the degree of underprivileged patients being served, the patient to nurse ratio of the practice, and fund holding status.

Minimization is a good approach if there are one or two covariates which are especially important and which are easily measured at the start of the study. It will perform better than randomization on those factors, although there is no guarantee of covariate balance for other covariates not used in the minimization. Minimization also cannot control for unmeasured covariates.

There is more effort required in setting up a study with minimization. You need a computer to be available at the time and location of the recruitment of each patient because you can't just print a list ahead of time. Another difficulty is that minimization is open to possible abuse because doctors might be able to predict what the next assignment would be.

1.2.2 Alternating assignments.

Another approach used in place of randomization is to alternate the assignment, so that every even patient is in the treatment group and every odd patient is in the control group.

Alternate assignment was popular in trials before World War II; it was felt that researchers would not understand and not tolerate randomization (Yoshioka 1998).

[Insert a recent example of alternating assignment]

Alternating assignment seems on the surface to be a good approach, but it can sometimes lead to trouble. This is especially true when consecutive patients can influence one another. You may have seen this level of influence if you grow vegetables in a garden. If you have a row of cabbages, for example, you will often see a pattern of big cabbage, little cabbage, big cabbage, little cabbage, etc. What happens, usually if the cabbages a planted a bit too closely is that one of the cabbages will grow just a bit faster at first. It will extend into the neighboring cabbage's territory, stealing some of the nutrients and water, and thus growing even faster at the expense of the neighbor. If you assigned a fertilizer to every other cabbage, you would probably see an artificial difference because of the alternating pattern in growth within a row.

This alternating pattern can also occur in medicine. Consider, for example, a study of how much time doctors spend with their patients. If the first patient takes longer than expected, the doctor will probably rush a bit with the second patient in order to keep from falling further behind schedule. On the other hand, if the first patient finishes quickly, then the doctor will feel more relaxed and might tend to take a bit more time with the next patient.

In some situations, alternating assignment would be tolerable, but there is no good reason to prefer this over random assignment. You should be skeptical of this approach because studies with alternating assignment will tend, on average, to overstate the effectiveness of a new therapy by 15% (Colditz 1989).

1.2.3 Haphazard assignment.

Other choices that researchers will make it to base assignments on some arbitrary value. For example, patients born on days which are even numbers would be assigned to the treatment group and those born on odds days would be assigned to the control group.

Example: In a study of heparinized saline to maintain the patency of patient catheters (Kulkarni 1994), patients admitted on odd-numbered dates received heparinized saline and patients admitted on even-numbered days received normal saline.

In some situations, haphazard assignment might be tolerable, but there is no good reason to use this approach. The study mentioned above was excluded from a meta-analysis of heparinized saline (Randolph 1998) because the reviewers felt the quality level was too low.

1.2.4 Observational studies

There are many situations where randomization is not practical or possible. Sometimes patients have a strong preference for one particular treatment and would not consider the possibility of being randomized into a different treatment. Surgery is one area with strong patient preferences especially for newer approaches like laparoscopic surgery (www.symposion.com/nrccs/lefering.htm).

Sometimes we are studying noxious agents, like second hand cigarette smoke, noisy workplaces, or boring statistics teachers like me. It would be unethical to deliberately expose people to any of these agents, so we have to collect data on those people who are unavoidably exposed to these things.

Sometimes, the sample sizes required or the duration of the study make it difficult to use randomization. Diseases like cancer that have a long latency period are especially hard to study  with a randomized design.

Retrospective studies, studies where the outcome of interest has already occurred and you are looking at factors in the past that might have caused this outcome, are also impossible to randomize, unless you have a time machine.

Sometimes, the groups being studied existed prior to the start of the research. Genetic conditions like Down's syndrome cannot be randomly assigned to half of the patients in your study.

Sometimes researchers just do not want to go to the effort of randomizing. It is usually faster and cheaper to use existing non-randomized databases, and these are often helpful in evaluating the feasibility of then performing a large randomized study.

When randomization is not possible, then you are looking at an observational study. There are three major flavors for observational studies: cohort studies, case control studies, and historical controls studies.

1.2.5 The cohort study.

In a cohort study, a group of patients has a certain exposure or condition. They are compared to a group of patients without that exposure or condition. Does the exposed cohort differ from the unexposed cohort on an outcome of interest?

Example: In a study of dietary fat (Hu 1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. This is an observational study because the women chose the type of diets they ate, not the researchers. This particular observational study is a cohort design, with the three levels of fat consumption representing three different exposure groups.

Cohort studies are intuitively appealing and selection of a control group is usually not too difficult. You have to be very wary of covariate imbalance, but other observational designs are likely to have even more problems. Don't worry about every possible covariate imbalance. You should look for large imbalances, especially for covariates which are closely related to the outcome variable.

When you are studying a very rare outcome, the sample size may have to be extremely large. As a rough rule of thumb, you need to observe 25 to 50 outcomes in each group in order to have a reasonable level of precision. So when a condition occurs only once in every thousand patients, a cohort study would require tens of thousands of patients.

You want to avoid "leaky groups" in a cohort design. If the exposure group includes some unexposed patients and the control group includes some exposed patients, then anything effect you are trying to detect will be diluted. Be especially aware of situations where one group is more leaky than the other.

For example, many studies will classify people into various levels of caffeine exposure on the basis of how much coffee they drink. Although coffee is the major source of caffeine for most people, failure to ask about other sources of caffeine consumption can lead to large underestimates of caffeine intake, which can obscure relationships to various diseases (Brown 2001).

Dietary studies will sometimes rely on household food surveys, but these need adjustment for the varying consumption of individual family members. For example, within the same family, males (especially boys aged 11-17 years) will have higher average intakes of calories and nutrients (Nelson 1986).

1.2.6 The case control study

A case control study selects patients on the basis of an outcome, such as development of breast cancer, and are compared to a group of patients without that outcome. Do the cases differ from the controls in some exposures?

Example: In a study of HIV infection (Cardo 1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive. This is an observational study, since the researchers did not control who became seropositive. This particular observational study is a case-control design because patients were selected on the basis of the outcome, seroconversion.

A case-control study is very efficient in studying rare diseases. With this design, you round up all of the limited number of cases of the disease and then find a comparable control group. By contrast, a cohort design has to round up far more exposures to insure that a handful of them will develop the rare disease.

Case-control studies do not perform well when you are evaluating a diagnostic test. They are easy to set up, because you have a group of patients with the disease and you estimate the probability of a positive result for the diagnostic test in this group (sensitivity). You also have a control group and you estimate the probability of a negative result for the diagnostic test in this group (specificity). Unfortunately, the case control design usually has a collection of very obviously diseased patients among the cases and very obviously healthy patients among the controls. This is an example of spectrum bias, the a lack of patients in the ambiguous middle of the spectrum. A study with spectrum bias will often overstate the sensitivity and specificity of a diagnostic test.

[Include reference on spectrum bias.]

Because the outcome in a case control study has already occurred, this study is always retrospective. Retrospective studies usually have more problems with data quality because our memory is not always perfect. What's worse is that sometimes the ability to remember is sharply influenced by the outcome being studied. People who experience a tragic event like a miscarriage will have a strong desire to try to understand why this has happened and will search their pasts for risk factors that have been highly publicized in the press (Bryant 1989). They don't make things up, but the problem is that the people in the control group only seem to remember about half the things that have happened in their past. This selective underreporting in the control group is known as recall bias and it can lead to some serious faulty findings.

If you have "leaky groups" in a case-control design, this can cause problems also. do some of the disease outcomes get left out of the cases? It might be harder, for example, to identify the less serious examples of disease, and this can lead to serious problems. You can avoid this problem if there is some type of registry that allows the researchers to identify every possible case. Watch out also for situations where healthy people or people with the incorrect disease are accidentally classified as cases.

The other major problem with this type of study is that it is so hard to find a good control group. You want to find controls that are identical to the cases in all aspects except for the outcome itself. When there is a roster of all potentially eligible subjects (subjects who would be classified as cases if they developed the disease), then selection of a good quality control group is easy (Wacholder 1995). Most studies would not have such a roster. In this case, the controls are often patients admitted to the hospital for outcomes unrelated to the study. So if cases represent newly diagnosed lung cancer, then the controls might be patients admitted for a bone fracture. Other times, you might ask the case to bring a friend with them or to identify a relative.

Finally, the case-control design just does not sit well with your intuition. You are trying to find factors that cause an outcome, so you are sampling from the causes while a cohort design samples from the effects. Don't let this bother you too much, though. The mathematics that justify the case control design were developed half a century ago by Jerome Cornfield (JNCI 1951, 11: 1269-75) and careful use of the case-control design has helped establish the use of aspirin as a cause of Reye's syndrome (Monto 1999).

1.2.7 The historical controls study.

In a historical controls study, researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy.

Example: In a study of the rapid parathyroid hormone test (Johnson 2001), 49 patients undergoing parathyroidectomy received the rapid test. These patients were compared to 55 patients undergoing the same procedure before the rapid test was available. This is an observational study because the calendar, not the researchers, determined which test was applied. This particular observational study is a historical controls design because the control group represents patients tested before the availability of the rapid test.

The very nature of a historical controls study guarantees that there will be a major covariate imbalance between the two groups. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially? For the most part, historical controls are considered one of the weakest forms of evidence. The one exception is when a disease has close to 100% mortality. In that situation, there is no need for a concurrent control group, since any therapy that is remotely effective can be detected readily. Even in this situation, you want to be sure there is a biological basis for the treatment and that the disease group is homogenous (www.pharmafile.com/Pharmafocus/Features/feature.asp?fID=354).

1.2.8 Non-randomized studies, what to look for.

When a study was not randomized, look for the following features.

For a study using minimization:

For studies using alternating assignments or haphazard assignments:

For studies using a cohort design:

For studies using a case-control design:

For studies using a historical controls design:

1.3 Did the authors use matching?

To ensure an apples to apples comparison, researchers will often use matching. Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you would select infants born to a mother who abused cocaine during pregnancy for your exposure group. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status for your control group.

Example: In a study of home versus hospital delivery (Ackerman-Liebrich 1996), 489 women who planned to deliver their babies at home were matched with women who planned to deliver at the hospital. Matching was based on age category (5 categories), parity category (3 categories), category of gynecological and obstetric history (24 categories or none), category of medical history (12 categories or none), social class (5 categories), and nationality. Because the matching criteria were so elaborate, they were only able to find a matched hospital delivery for about half of their home deliveries.

Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.

1.3.1 Matching in a case control design

When you are selecting patients on the basis of disease and looking back at what exposure might have caused the disease, selection of matching control patients (patients without disease) can sometimes be tricky. You need to find a control that is similar to the case, except for the disease of interest. There are several possibilities, but none of them works perfectly.

Example: In a study of early onset myocardial infarction (Danesh 1999), 1122 survivors of heart attacks between the ages of 30-49 were matched with people of the same age and gender who did not have heart attacks. These controls were recruited from a pool of subjects related to the cases. A second analysis used 510 survivors and their siblings, if the sibling was the same sex and within five years of age. All of the cases and the controls had blood tests to look for Helicobacter pylori infection, which was more commonly found in the cases than the controls.

[Insert additional discussion here]

1.3.2 Matching in a randomized design

In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance, just like a flip of 100 coins will not always result in exactly 50 heads and 50 tails. More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.

Example: A study of 1,121 patients with tinnitus were randomly assigned these patients to either Ginkgo biloba or a placebo (Drew 2001). The researchers also identified 489 pairs of subjects (978 total) who were the same sex, similar age (within 10 years) and similar duration of tinnitus (within 5 years) to try to improve the precision of this study.

1.3.3 Matching can sometimes backfire

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

In the tinnitus study mentioned above, although there were 1,121 patients, 143 of them did not have a close match in the data and were excluded from the matched analysis. There was also some attrition in the study, which caused a greater loss in the matched analysis. If one of the patients in a pair dropped out, the other patient's data could not be used in the matched analysis. So the analysis of improvement after 4 weeks included only 414 pairs and the analysis after 14 weeks included only 354 pairs. Although the loss in sample size was probably offset by the added precision from the matching, the authors do acknowledge that this was probably "an unnecessary and disadvantageous complication."

In a case control design, matching can sometimes remove the very effect you are trying to study. You should avoid matching when the matching variable is caused by the exposure or is a similar measure of exposure, then you might "over match" the data and remove the effect of the exposure. In a study examining radiation exposure and the risk of leukemia at a nuclear reprocessing plant (Marsh 2002), there were 37 workers diagnosed with leukemia (cases) and they were each matched to four control workers. Each of the four control workers had to work at the same site, have the same gender, have the same job code, be born within two years of the case, and had to be hired within two years of the hire date of the case.

Unfortunately, there was a strong trend between hire date and exposure. Exposures were highest early in the plant's history and declined over time. So both hire date and exposure were measuring the same thing. When the data was matched on hire date, it artefactually controlled the exposure and pretty much ensured that the average radiation exposure would be the same among both the cases and the controls. This led to an estimate of radiation exposure that was actually slightly negative and not statistically significant. When the data was rematched using all the variables except for hire date, the effect of radiation dose was large and positive and came close to approaching statistical significance.

1.3.4 The crossover design

The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA). Since the same subject receives both treatments, there is no possibility of covariate imbalance.

Example: In a study of electronic records (Brown 2003), ten physicians were asked to code patient records with two separate systems: Clinical Terms Version 3 and with the Read Codes 5 byte set. Half of the physicians were randomly assigned to code using Clinical Terms Version 3 first and then later with the Read Codes 5 Byte Set. The other half coded using Read Codes 5 Byte Set first.

When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?

The washout period can sometimes cause ethical concerns. If you are treating patients for depression, an extensive amount of time during the washout would leave the patient without any effective treatment and increase the chances of something bad happening, like the patient committing suicide.

The possibility of learning effects are also potential problems in a crossover design. You can't use a crossover design, for example, to test alternative training approaches. Imagine the instructions for this study (now forget everything we just told you; we're going to teach it a different way). I guess that would work for the classes I teach; the only things my students remember are the jokes.

Also watch out for the possibility that a subject may get tired or bored. This could lead to a the second treatment assigned being worse than the first. Or if the outcome involves skill, maybe "practice makes perfect" leading to the second treatment assigned being better than the first.

If there are timing effects, randomization is critical. Even with randomization, though, timing effects are a problem because they increase uncertainty by adding an extra source of variation.

Special problems arise when each subject always receives one therapy first and it is always followed by the other therapy. Many factors other than the change in therapy can cause a shift in the health of patients over time. If you cannot randomize the order of treatments, you have all the problems of a historical controls study.

1.3.5 Stratification

Stratification is a method similar to matching that tries to achieve covariate balance across broad groups or strata. The selection of subjects in both the treatment group and the control group are constrained to have identical proportions in each strata. This guarantees covariate balance for the strata itself and any other factors closely related to the strata.

Example: In a study of medical records (Fine 2003), 54 records were selected from each of 10 cardiac surgery centers were examined for accuracy and completeness. To ensure a good balance, the 54 records at each site were allocated evenly to six different predefined risk strata (nine in each strata).

Example: In a study of retention of doctors in rural Australia (Humphreys 2002), a random sample of 1400 doctors was sent a questionnaire. The doctors were selected in strata defined by the size of the town they lived in to keep the proportion in each strata equivalent to those proportions in the entire population of Australian doctors.

Another use of stratification is to ensure that the sample has numbers in each strata that are proportional to numbers in the strata for the entire population of interest. This helps ensure that the sample is generalizable to the entire population.

The strata are usually broadly drawn. If there were a small number of possible patients within each strata, then the logistics become too difficult. So for example, stratification by age will usually involve large intervals such as 21-30 years, 31-40 years, etc.

You cannot stratify on factors that you cannot measure or on information that is not immediately available at the start of the study. And like matching, stratification only works when you have a large pool of subjects to draw from.

Stratification can add precision to a randomized study. A separate randomization list would be drawn up for each strata. This would ensure that the strata would have perfect balance between the treatment group and the control group.

1.3.6 Things to look for in a study with matching

When a study uses matching, look for the following features.

For a study using matching (stratification):

For studies using a cross-over design:

1.4 Did the researchers use statistical adjustments

Statistical adjustments represent one way of correcting for covariate imbalance. While matching and stratification, try to prevent covariate imbalance before it occurs, statistical adjustment corrects for the imbalance after the fact.

Example: A study of males residents of Caerphilly, South Wales (Smith 1997) examined the relationship between frequency of orgasm and ten year mortality among males residents of Caerphilly, South Wales. They divided the men into low, medium, and high frequency. Low frequency meant less than monthly and high frequency meant twice a week or more often. This is a study which would have been impossible to randomize--the men (and presumably their wives) determined which group they belonged to. As you might expect, there were demographic differences in the three groups. Age was significantly associated with frequency of orgasm. Men in the low, medium, and high frequency groups were 54, 52, and 50 years old, on average. The job categories also differed, with the proportion of non-manual labor being 29%, 42%, and 42% among the three groups. For other variables (height, body mass index, systolic blood pressure, cholesterol, existing coronary heart disease, and smoking status), the differences in  were smaller and less important. The adjustments used a combination of regression approaches and weighting. After adjustment, there was a strong trend in mortality, with men in the low frequency group having an adjusted mortality rate that was twice as big as the high frequency group. Both the article itself, and a subsequent letter to the editor (Batty 1998) mentioned, however, that additional unmeasured variables could have influenced the outcome.

Example: In a breast feeding study here at Children's Mercy Hospital (Kliethermes 1999), pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital. The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group. Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age. From a regression model, we discover that older mothers breastfeed for longer periods of time, on average, than younger mothers. In fact, for each year of age, the duration of breastfeeding increases by 0.25 weeks on average. So we would adjust the difference of the two groups by 0.25 weeks for every year in discrepancy between the average mothers' ages.

1.4.1 Imperfectly measured covariates

Some covariates can be measured, but only crudely. If the covariate itself is difficult to measure accurately, then any attempts to make statistical adjustments will only be partially successful. Your measurement may only capture half of the information in the covariate. The half of the covariate that is unaccounted for will remain behind leading to an unfair comparison. This is sometimes called residual confounding.

Example: In a study of factors influencing Down syndrome (Chen 1999), smoking had a surprisingly protective effect. This could be explained by the age of the mother. Older mothers smoke less and are also at greater risk for birth of a Down syndrome child. The unadjusted odds ratio for this effect was 0.80 and was borderline statistically significant (95% CI 0.65 to 0.98). A crude adjustment for age used the categories <35 years and >=35 years). With this adjustment, the odds ratio was still small (0.87) and borderline (95% CI 0.71 to 1.07). But when the exact year of age was used to adjust and race parity also included in the adjustment, then there was no association odds ratio=1.00, 95% CI 0.82 to 1.24). This shows that an imperfect adjustment can produce an incorrect conclusion.

Self report measures are often measured imperfectly, and are especially troublesome if they require the patient to recall events from the distant past.

Smoking is an important covariate for many studies and  it would be better to ask about the amount of smoking for current smokers. For smokers who have quit recently, you might also like to know how recently they quit. For both groups it might also help to know when they started. But often, the only question asked is a yes/no question like "do you smoke cigarettes?"

Some covariates like blood cholesterol levels are inherently variable. In an ideal world, these covariates would be measured at a second time and the two measures could be averaged to reduce some of the uncertainty. But this is not always possible or practical.

[Expand the discussion of this section.]

1.4.2 Unmeasured covariates

You can only adjust for those things that you can measure. Unfortunately, there are many things such as a patient's psychological state, presence of co-morbid conditions, and initial severity of the disease that are so difficult to assess that they are often just not measured.

[Add discussion about this topic.]

1.4.3 Other alternatives to covariate adjustment

If there is covariate imbalance in the entire sample, perhaps there may be a subgroup where the covariate is balanced. If you can find such a subgroup and it produces results similar to the entire sample, you can have greater confidence in the findings of the entire sample.

Example: In a study of the effect of men's age on time to pregnancy (Hassan. 2003), older men tended to have a longer time to pregnancy. These older men, though, also have older wives, on average. This creates an unfair comparison, since the wife's age would probably also influence time to pregnancy. To produce a fairer comparison, they conducted a separate analysis looking at men of all ages who married young wives.

Of course, it is not always possible to find a subgroup without covariate imbalance. And when you do find such a subgroup, the smaller sample size may lead to an unacceptable loss of precision. Furthermore, the subgroup may be somewhat unusual, making it difficult for you to generalize the findings.

Another way to restore balance in a study is the use of weights. Suppose the treatment group includes 25 males and 75 females, but in population we know that there should be a 50/50 split by gender. We could re-weight the data, so that each male has a weighting factor of 2.0 and each female has a weighting factor of 0.67. This artificially inflates the number of males to 50 and deflates the number of females to 50. The control group might have 40 males and 60 females. For this group, we would use weights of 1.25 and 0.83.

[Insert a better example here]

The statistical analysis gets a bit tricky with weights, but nothing that a professional statistician can't handle. Weights can also improve the generalizability of a study. If the overall a sample has a skewed demographic, weights can help bring it back in line with the population of interest.

1.4.4 Matching and adjustments: what to look for.

If a study uses covariate adjustments, look for the following things:

1.5 Summary -- Is randomized better than observational?

Can matching and/or statistical adjustments in an observational study provide a comparison as fair and as persuasive as a randomized study? This is an unfair question, because sometimes a randomized study is just not possible. Also, there are so many different types of observational studies that it would be difficult to come up with a good general answer. Still, some people have tried to answer this question.

An empirical study of observational and randomized studies of the same topic (Concato 2000) found that there was a high level of consistency between the two. This contradicted the previously held belief that observational studies tended to overstate the effectiveness of a new treatment. The debate about this finding continues to rage, but perhaps the quality of the design and the sophistication of the adjustments used in observational studies places them on a level comparable to randomized studies. Another study published on the web (www.symposion.com/nrccs/koch.htm) showed that a large non-randomized registry provided data that was comparable to that collected in randomized studies.

In spite of this research, information from a randomized study is usually consider a stronger form of evidence. Randomization provides a greater level of assurance that the two groups are comparable in every way except for the therapy received. An editorial in the Journal of the American Medical Association (Sherwin 1997) noted the weakness of observational studies while trying to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing numerous studies, the editorial comments:

"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."

Randomized studies do have some weaknesses. The very process of randomization will create an artificial environment that does not represent how medicine is normally practiced (Sackett 1997). When you go to your doctor for assistance with birth control, you do not expect him/her to randomly assign you to a particular method. And if your doctor said you had a 50% of getting a placebo contraceptive, you would probably switch doctors. Because an observational study does not have to cope with the intrusion of the randomization process, it can often study medicine in an environment much closer to reality.

Another problem with randomized designs is the limit to their size and scope. The logistics of randomization make it more expensive than a comparable observational study. Thus effects that require a very large sample size to detect (such as rare side effects) or effects that take a long time to manifest themselves (such as the progression of many types of cancer) cannot be examined in a randomized experiment. An observational approach like post marketing surveillance is more likely to be successful in these situations.

Furthermore, the use of a placebo in a randomized trial creates an artificial situation where patients are more likely to drop out and less likely to report side effects (Rochon 1999).

Studies of the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are often impossible to randomize because of logistical and ethical issues.

On the other hand, observational studies often require either matching or statistical adjustments. While both matching and adjustments can help to some extent with covariate imbalance, these approaches do not work as well as randomization. In particular, some of the covariate imbalance may be due to factors that are difficult to measure like the psychological state of the patient, initial severity of the disease, and/or the presence of comorbid conditions. All of these factors can influence the outcome, but if you can't measure them easily, matching or adjustment is not possible.

Generally, the advantages of a randomized design outweigh the disadvantages. All other things being equal, a randomized study provides a higher standard of evidence than an observational study. Nevertheless, much can be learned from observational studies. Even though observational studies provide weaker evidence, but if you can bring other data to bear on the problem, as through replication or the establishment of a scientific mechanism, you can gain quite persuasive evidence from observational data. Almost everything we know about the risks of cigarette smoking, for example, came from observational designs. The identification of Reye's syndrome and its link to aspirin was also established solely through observational data.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon on 2003-07-03, edited by Steve Simon, and was last modified on 2008-06-24. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence


[12a/extras/last_thing.htm]