These pages discuss issues about
statistical models which are relevant across a broad class of models. These
pages may mention a specific model like logistic regression to provide context,
but the ideas generalize easily to other models. Also see Category: Covariate adjustment, Category: Linear regression, Category: Logistic regression,
and Category: Unusual data.
Other entries about modeling issues can be found in the
modeling issues page at the
StATS website.
2010
- P.Mean: What is principal components analysis?
(created 2010-07-19). I was asked to help someone who was reviewing a
paper that used principal components analysis (PCA) as part of the statistical
methodology. I have not yet seen the article, so I could only offer very
general advice.
- P.Mean: Calculating weights to
correct for over and under sampling (created 2010-03-22). Someone asked
how to use weights to adjust for the fact that certain strata in a study were
recruited more vigorously than other strata. For example, suppose you sampled
at four communities and noted the age distribution as 0-14 years, 15-39 years,
and 40+ years. How would you adjust for differential age distributions.
- P.Mean: Can sex be an outcome variable
(created 2010-03-16). Someone asked whether it was legitimate to use sex
(gender) as a dependent variable or outcome variable in a logistic regression
model. It seems wrong, on the face of it, to think that various factors can
influence whether we are male or female. It actually is perfectly fine to use
sex as an outcome variable. Here is how I would justify its use.
2009
- P.Mean: Formula for multiple
imputation (created 2009-07-24). I'm working on a project that involves
multiple imputation, and I may have to program some of the work myself. I can
use the R package MICE to generate the imputed data sets, but then I have to
use a mixed linear model rather than a linear model. How do I combine the
estimates from the multiple imputed data sets? The estimate is just the
average of the individual estimates, but what about the standard error?
- P.Mean: Fewer than 10 events per
variable (created 2009-02-18). I am in the
process of advising on the design of a study using logistic regression. There
are five confounding variables and a treatment variable. If I apply the rule
that you need 10 events per variable (EPV), then I need 60 events. I expect
that the probability of observing an event is 40%. This means that I'll need
data on 60 / 0.4 = 150 patients. I can only collect data on 90 patients, and
that sample size gives me more than adequate power. Since my power will be
fine, can I ignore the rule of thumb about 10 EPV?
2008
- P.Mean: Using a sub-optimal
approach in meta-analysis (created 2008-12-06). I am having difficulty
understanding the meta-analysis of ordinal data in a Cochrane systematic
review, and would appreciate advice and comments. One study in the
meta-analysis had an ordinal efficacy outcome with categories None, Some,
Good, and Excellent. The meta-analysis did 4 separate analyses, treating each
category as if it were a dichotomous outcome. Aside from the fact that this
generates (almost) more analyses than there are data, this approach seems
unnecessary and uninterpretable. The Cochrane Handbook says: "Ordinal and
measurement scale outcomes are most commonly meta-analysed as
dichotomous data." And "Occasionally it is possible to analyse the data using
proportional odds models where ordinal scales have a small number of
categories, the numbers falling into each category for each intervention group
can be obtained, and the same ordinal scale has been used in all studies."
What should the authors of the systematic review have done?
- P.Mean: Explaining CART models in simple
terms (created 2008-11-05). I need some help understanding and explaining Classification and Regression
Trees (CART). I am personally not familiar with this technique. When would
someone select this over linear/logistic regression model?
- P.Mean: What's the difference between
regression and ANOVA? (created 2008-10-15). Someone asked me to explain the
difference between regression and ANOVA. That's challenging because regression
and ANOVA are like the flip sides of the same coin. They are different, but they
have more in common that you might think at first glance.
- P.Mean: A simple example of
overfitting (created 2008-10-08). A couple of the Internet discussion groups that I participate in have been
discussing the concept of overfitting. Overfitting occurs when a model is too
complex for a given sample size. I want to show a simple example of the
negative consequences of overfitting.
- P.Mean: Using ANOVA for a sum of Likert scaled
variables (created 2008-10-09). I want to analyse data derived from a
questionnaire. The range of possible values that my variable can take goes from
20 to 100. No evidence for rejecting the hypothesis of normality was found. I
would therefore apply an ANOVA, but I still have some doubts whether this
methods of analysis is valid, since the range of my dependent variable is not [-
infinity;+ infinity]. Is the ANOVA a valid method of analysis or are there other
approaches I can apply?
- P.Mean: What distribution does this
data come from? (created 2008-07-23). I'm very interested in assessing
distributional fits for empirical data and I've found tidbits of information
here and there but no real good source. Could you recommend a few good sources?
Outside resources:
- Anova for Unbalanced Data: An Overview. Shaw RG, Mitchell-Olds T.
Ecology 74:6 (Sep., 1993); 1638-1645.
[Abstract]
[PDF]. Description: An analysis of variance model with multiple
factors is very easy to analyze when the data is balanced, that is, when
every combination of the factors has the same number of observations. If
some combinations have more or fewer observations, you need to approach the
ANOVA model very carefully. This article shows some of the issues you need
to be aware of with unbalanced data.
- Steve Miller. Biostatistics, Open Source and BI – an Interview with
Frank Harrell. Description: This article, published in Information
Management Online, February 25, 2009, offers a nice interview with Frank
Harrell, a leading proponent of modern statistical methods. Excerpt: "My
correspondence with Frank provided the opportunity to ask him to do an
interview for the OpenBI Forum. He graciously accepted, turning around deft
responses to my sometimes ponderous questions in very short order. What
follows is text for our questions and answer session. I trust that readers
will learn as much from Frank’s responses as I did." [Accessed July 19,
2010]. Available at:
http://www.information-management.com/news/10015023-1.html.
- Phil Ender. Centering (ED230B/C). Excerpt: "Centering a
variable involves subtracting the mean from each of the scores, that is,
creating deviation scores. Centering can be done two ways; 1) centering
using the grand mean and 2) centering using group means, which is also known
as context centering." [Accessed July 26, 2010]. Available at:
http://www.gseis.ucla.edu/courses/ed230bc1/notes4/center.html.
- McCandless L, Gustafson P, Austin P, Levy A. Covariate balance in a
Bayesian propensity score analysis of beta blocker therapy in heart failure
patients. Epidemiologic Perspectives & Innovations. 2009;6(1):5.
Available at:
http://www.epi-perspectives.com/content/6/1/5 [Accessed September 14,
2009]. Abstract: Regression adjustment for the propensity score is a
statistical method that reduces confounding from measured variables in
observational data. A Bayesian propensity score analysis extends this idea
by using simultaneous estimation of the propensity scores and the treatment
effect. In this article, we conduct an empirical investigation of the
performance of Bayesian propensity scores in the context of an observational
study of the effectiveness of beta-blocker therapy in heart failure
patients. We study the balancing properties of the estimated propensity
scores. Traditional Frequentist propensity scores focus attention on
balancing covariates that are strongly associated with treatment. In
contrast, we demonstrate that Bayesian propensity scores can be used to
balance the association between covariates and the outcome. This balancing
property has the effect of reducing confounding bias because it reduces the
degree to which covariates are outcome risk factors.
- Karyn Heavner, Carl Phillips, Igor Burstyn, Warren Hare.
Dichotomization: 2 x 2 (x2 x 2 x 2...) categories: infinite possibilities.
BMC Medical Research Methodology. 2010;10(1):59. Abstract: "BACKGROUND:
Consumers of epidemiology may prefer to have one measure of risk arising
from analysis of a 2-by-2 table. However, reporting a single measure of
association, such as one odds ratio (OR) and 95% confidence interval, from a
continuous exposure variable that was dichotomized withholds much
potentially useful information. Results of this type of analysis are often
reported for one such dichotomization, as if no other cutoffs were
investigated or even possible. METHODS: This analysis demonstrates the
effect of using different theory and data driven cutoffs on the relationship
between body mass index and high cholesterol using National Health and
Nutrition Examination Survey data. The recommended analytic approach,
presentation of a graph of ORs for a range of cutoffs, is the focus of most
of the results and discussion. RESULTS: These cutoff variations resulted in
ORs between 1.1 and 1.9. This allows investigators to select a result that
either strongly supports or provides negligible support for an association;
a choice that is invisible to readers. The OR curve presents readers with
more information about the exposure disease relationship than a single OR
and 95% confidence interval. CONCLUSION: As well as offering results for
additional cutoffs that may be of interest to readers, the OR curve provides
an indication of whether the study focuses on a reasonable representation of
the data or outlier results. It offers more information about trends in the
association as the cutoff changes and the implications of random
fluctuations than a single OR and 95% confidence interval." [Accessed
July 19, 2010]. Available at:
http://www.biomedcentral.com/1471-2288/10/59.
- H. Gilbert Welch, Lisa M. Schwartz, Steven Woloshin. The exaggerated
relations between diet, body weight and mortality: the case for a
categorical data approach. CMAJ. 2005;172(7):891-895. Excerpt:
"Multivariate analysis has become a major statistical tool for medical
research. It is most commonly used for adjustment — the process of
correcting the main effect for multiple variables that confound the relation
between exposure and outcome in an observational study. Any apparent
relation between estrogen replacement and dementia, for example, should be
adjusted for socioeconomic status, a variable that is known to relate both
to access (and thus the likelihood of having received estrogen) and to
measures of cognitive function (and thus the likelihood of being diagnosed
with dementia). The capacity to account for numerous variables (e.g.,
income, education and insurance status) simultaneously constitutes a major
advance in the ability of researchers to estimate the true effect of the
exposure of interest. But this advance has come at a cost: the actual
relation between exposure and outcome is increasingly opaque to readers,
researchers and editors alike." [Accessed July 26, 2010]. Available at:
http://www.ecmaj.com/cgi/content/full/172/7/891.
- The
Fourth Quadrant: A Map of the Limits of Statistics. Nassim Nicholas
Taleb, published September 15, 2008 by the Edge Foundation, Inc. Excerpt:
When Nassim Taleb talks about the limits of statistics, he becomes outraged.
"My outrage," he says, "is aimed at the scientist-charlatan putting society
at risk using statistical methods. This is similar to iatrogenics, the study
of the doctor putting the patient at risk." As a researcher in probability,
he has some credibility. In 2006, using FNMA and bank risk managers as his
prime perpetrators, he wrote the following: The government-sponsored
institution Fannie Mae, when I look at its risks, seems to be sitting on a
barrel of dynamite, vulnerable to the slightest hiccup. But not to worry:
their large staff of scientists deemed these events "unlikely." In the
following Edge original essay, Taleb continues his examination of Black
Swans, the highly improbable and unpredictable events that have massive
impact. He claims that those who are putting society at risk are "no true
statisticians", merely people using statistics either without understanding
them, or in a self-serving manner. "The current subprime crisis did wonders
to help me drill my point about the limits of statistically driven claims,"
he says. URL: www.edge.org/documents/archive/edge257.html#taleb
- Interesting quote: The purpose of models is not to fit the data but to sharpen the
questions. Samuel Karlin (1924 - ) as quoted in
www.causeweb.org/resources/fun/db.php?id=102.
- Mitchell H. Katz. Multivariable Analysis: A Primer for Readers of
Medical Research. Annals of Internal Medicine. 2003;138(8):644-650.
Abstract: "Many clinical readers, especially those uncomfortable with
mathematics, treat published multivariable models as a black box, accepting
the author's explanation of the results. However, multivariable analysis can
be understood without undue concern for the underlying mathematics. This
paper reviews the basics of multivariable analysis, including what
multivariable models are, why they are used, what types exist, what
assumptions underlie them, how they should be interpreted, and how they can
be evaluated. A deeper understanding of multivariable models enables readers
to decide for themselves how much weight to give to the results of published
analyses." [Accessed July 7, 2010]. Available at:
http://www.annals.org/content/138/8/644.abstract.
-
Negative
Consequences of Dichotomizing Continuous Predictor Variables.
Gary McClelland. Description: This Java applet shows
graphically how creating a median split for a predictor variable leads to
loss of precision and power. This website was last verified on
2003-02-10. URL: psych.colorado.edu/~mcclella/MedianSplit
- Vittinghoff E, McCulloch CE. Relaxing the Rule of Ten Events per
Variable in Logistic and Cox Regression. Am. J. Epidemiol.
2007;165(6):710-718. Available at:
http://aje.oxfordjournals.org/cgi/content/abstract/165/6/710
[Accessed February 18, 2009]. Description: This article examines the rule
that you need 10 events per independent variable. Some sources cite 15
events and other 20 events per independent variable. The authors argue that
in the context of adjusting for confounders, this rule might be relaxed a
bit.
- Simpson's Paradox, Lord's Paradox, and Suppression
Effects are the same phenomenon - the reversal paradox. YK Tu, D Gunnell,
Gilthorpe MS. Emerg Themes Epidemiol 2008: 5; 2.
[Medline]
[Abstract] [Full text]
[PDF]. Description: This article provides a nice overview of how associations
between two variables can be modified by a third variable.
- P Peduzzi, J Concato, E Kemper, T R Holford, A R Feinstein. A
simulation study of the number of events per variable in logistic regression
analysis. J Clin Epidemiol. 1996;49(12):1373-1379. Abstract: "We
performed a Monte Carlo study to evaluate the effect of the number of events
per variable (EPV) analyzed in logistic regression analysis. The simulations
were based on data from a cardiac trial of 673 patients in which 252 deaths
occurred and seven variables were cogent predictors of mortality; the number
of events per predictive variable was (252/7 =) 36 for the full sample. For
the simulations, at values of EPV = 2, 5, 10, 15, 20, and 25, we randomly
generated 500 samples of the 673 patients, chosen with replacement,
according to a logistic model derived from the full sample. Simulation
results for the regression coefficients for each variable in each group of
500 samples were compared for bias, precision, and significance testing
against the results of the model fitted to the original sample. For EPV
values of 10 or greater, no major problems occurred. For EPV values less
than 10, however, the regression coefficients were biased in both positive
and negative directions; the large sample variance estimates from the
logistic model both overestimated and underestimated the sample variance of
the regression coefficients; the 90% confidence limits about the estimated
values did not have proper coverage; the Wald statistic was conservative
under the null hypothesis; and paradoxical associations (significance in the
wrong direction) were increased. Although other factors (such as the total
number of events, or sample size) may influence the validity of the logistic
model, our findings indicate that low EPV can lead to major problems."
[Accessed June 14, 2010]. Available at:
http://www.ncbi.nlm.nih.gov/pubmed/8970487.
- Wuensch K. Stepwise Regression = Voodoo Regression. Available at:
http://core.ecu.edu/psyc/wuenschk/StatHelp/Stepwise-Voodoo.htm [Accessed
April 16, 2009]. Excerpt: It is pretty cool, but not necessarily very
useful, and just plain dangerous in the hands of somebody not well educated
in the multiple regression techniques, including effects of collinearity,
redundancy, and suppression. Here are some quotes from others I have
collected from the now departed STAT-L.
- Flom PL, Cassell DL. Stopping stepwise: Why stepwise and similar
selection methods are bad, and what you should use. Available at:
http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf [Accessed April 24,
2009].
- Sylvia Sudat, Elizabeth Carlton, Edmund Seto, Robert Spear, Alan
Hubbard. Using variable importance measures from causal inference to rank
risk factors of schistosomiasis infection in a rural setting in China.
Epidemiologic Perspectives & Innovations. 2010;7(1):3. Abstract:
"BACKGROUND: Schistosomiasis infection, contracted through contact with
contaminated water, is a global public health concern. In this paper we
analyze data from a retrospective study reporting water contact and
schistosomiasis infection status among 1011 individuals in rural China. We
present semi-parametric methods for identifying risk factors through a
comparison of three analysis approaches: a prediction-focused machine
learning algorithm, a simple main-effects multivariable regression, and a
semi-parametric variable importance (VI) estimate inspired by a causal
population intervention parameter. RESULTS: The multivariable regression
found only tool washing to be associated with the outcome, with a relative
risk of 1.03 and a 95% confidence interval (CI) of 1.01-1.05. Three types of
water contact were found to be associated with the outcome in the
semi-parametric VI analysis: July water contact (VI estimate 0.16, 95% CI
0.11-0.22), water contact from tool washing (VI estimate 0.88, 95% CI
0.80-0.97), and water contact from rice planting (VI estimate 0.71, 95% CI
0.53-0.96). The July VI result, in particular, indicated a strong
association with infection status - its causal interpretation implies that
eliminating water contact in July would reduce the prevalence of
schistosomiasis in our study population by 84%, or from 0.3 to 0.05 (95% CI
78%-89%). CONCLUSIONS: The July VI estimate suggests possible within-season
variability in schistosomiasis infection risk, an association not detected
by the regression analysis. Though there are many limitations to this study
that temper the potential for causal interpretations, if a high-risk time
period could be detected in something close to real time, new prevention
options would be opened. Most importantly, we emphasize that traditional
regression approaches are usually based on arbitrary pre-specified models,
making their parameters difficult to interpret in the context of real-world
applications. Our results support the practical application of analysis
approaches that, in contrast, do not require arbitrary model
pre-specification, estimate parameters that have simple public health
interpretations, and apply inference that considers model selection as a
source of variation." [Accessed July 16, 2010]. Available at:
http://www.epi-perspectives.com/content/7/1/3.
- What you see may not be what you get: a brief, nontechnical
introduction to overfitting in regression-type models. Babyak MA.
Psychosomatic Medicine 66:411-421 (2004).
[Abstract]
[Full text]
[PDF]. Description: If you have too many variables relative to the
amount of data you have, then your model will suffer from overfitting. This
article outlines the problems caused by overfitting and offers some
solutions.
- Why do we still use stepwise modelling in ecology and behaviour?
Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. J Anim Ecol. 2006
Sep;75(5):1182-9.
[Medline]
[Abstract]
[PDF]. Description: This article reviews the continued use of
stepwise regression methods in leading ecological and behavioral journals
and explains the drawbacks of this approach.
All of the material above this paragraph is licensed under a
Creative Commons Attribution 3.0 United States License. This page was written by
Steve Simon and was last modified on
2010-07-26. The material
below this paragraph links to my
old website, StATS. Although I wrote all of the material
listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright
ownership of this material. The brief excerpts shown here are included under
the fair use provisions of U.S. Copyright laws.
2008
- Stats: Presenting unadjusted and
adjusted estimates side by side (March 24, 2008). Someone on the Medstats
discussion group asked about reporting the analysis of a model without
adjustment for covariates along with the analysis adjusted for covariates.
What is the purpose of reporting the unadjusted analysis?
- Stats: Assessing the assumption of
an exponential distribution (February 25, 2008). The following 41
observations: 8, 2, 26, 29, 1, 2, 11, 8, 0, 5, 10, 1, 4, 9, 12, 3, 6, 5, 2,
12, 1, 5, 3, 5, 7, 0, 2, 8, 3, 3, 1, 0, 4, 8, 1, 8, 12, 0, 6, 1, 5, represent
waiting times that we suspect follow an exponential distribution. There are
several ways to examine this belief, and the simplest way to to draw a Q-Q
plot for the exponential distribution.
2007
- Stats: When should you use a log
transformation? (December 28, 2007). Dear Professor Mean, How do I
know whether it is appropriate to use a log transformation for my data?
- Stats: The order of
entering interactions into a model (September 20, 2007). Dear
Professor Mean, I like your titanic example! But shouldn't you enter the
interaction term on a second step following entry of the main effects on the
first step? If you enter the terms all at the same time, the interaction term
will compete for variance with the two main effects on which is depends.
- Stats: Are we assuming a normal
sample or a normal population? (August 30, 2007). Dear Professor Mean,
I'm fitting an ANOVA model to a sample of 25 observations, and the data is
skewed. I'm quite worried about this, but my husband reassures me that this
is not a problem. He says that the assumption is that the population is
normal, not the sample. Should I listen to him?
- Stats: How good is my prediction?
(August 13, 2007). Dear Professor Mean, I have two time series of
data, one actual and one predicted. Since I'm quite new to statistical
methods, I would like to know what methods are used to evaluate the different
between the two time series. I would like to able to say something like "the
predicted values were 70% accurate."
2006
- Stats: Frank Harrell's
Philosophy of Biostatistics (October 10, 2006). There are a lot of people
in the world who are a lot smarter than I am and it is always a humbling
experience when I recognize how little I really know. Frank Harrell, chair of
the Department of Biostatistics at Vanderbilt University, is one of those
people.
- Stats: Slash and burn models (June
26, 2006). I received an email question about developing a logistic
regression model with some interaction terms. One of the interaction terms
was statistically significant but one or both of the main effects associated
with the interaction was not. So is it okay, I was asked to include the
interaction in the final model but not the non-significant main effects?
First, I need to comment on the "slash and burn" model building
practice that this person is using. A recent posting to the MedStats email
discussion group outlines problems with this approach (although it does not
use the term "slash and burn"). The person who adopts a "slash and burn"
approach to models has a parsimonious intent. He/she wants to use as few
degrees of freedom as possible in the final statistical model and one way to
do this is to strip out anything that has an insignificant p-value. The ideal
in the "slash and burn" world is a model where every single p-value is
smaller than 0.05.
- Stats: Multicollinearity is
not a violation of assumptions (January 20, 2006). A colleague from my
days at the National Institute for Occupational Safety and Health emailed me
a question. Apparently, one of the co-authors of a paper he is writing is in
a bit of a panic because the linear regression model that they are using has
multicollinearity. She calls this a violation of assumptions and wonders if
she should look at certain transformations that are difficult to interpret
but which remove much of the multicollinearity. To me this seems like jumping
from the frying pan into the fire.
2005
- Stats: I abhor Lilliefor and
other tests of normality (April 14, 2005). Someone asked me about a log
transformation for their data. It seemed like a good idea, based on my
general comments on the log transformation, but the test of significance for
normality (Lilliefor's test) was still rejected even after the log
transformation. In general, I dislike Lilliefor's test (and other tests of
normality like the Shapiro-Wilks test).
2004
- Stats: Discrepancy between
univariate and multivariate models (November 12, 2004). Someone asked me
about an analysis that showed certain factors were predictive of a health
outcome when considered individually. When these factors were included in a
multivariate model that included other factors, they were no longer
statistically significant. This is worth investigating further but perhaps
you need to live with a bit of ambiguity in the data.
- Stats: What is the best statistical
model? (September 17, 2004). Someone asked me by email about the
advantages and disadvantages of various statistical models (multinomial
logistic regression, ordinal logistic regression, and structural equations
models). This is a somewhat difficult question to answer by email, but as a
general rule, I think that people worry too much about the particular model
that they choose.
- Stats: Central Limit Theorem (March 9, 2004).
Dear Professor Mean, How does the central limit theorem affect the
statistical tests that I might use for my data?
2003
- Stats: What does "overfitting" mean? (July
24, 2003). Dear Professor Mean, I am conducting binary logistic
regression analyses with a sample size of 80 of which 20 have the outcome of
interest (e.g. are "very successful" versus somewhat/not very successful). I
have thirty possible independent variables which I examined in a univariate
logistic regression with the dependent variable. Of these thirty, five look
like they might have a relationship with the independent variable. Now I want
to include these six variables in a stepwise logistic regression model, but I
am worried about overfitting the data. I have heard that there should be
about 10 cases with the outcome of interest per independent variable to avoid
overfitting. What exactly does overfitting mean?
2002
- Stats: Log transformation (October 11, 2002).
Dear Professor Mean, I have some data
that I need help with analysis. One suggestion is that I use a log
transformation. Why would I want to do this? -- Stumped Susan
- Stats: Checking the assumption of normality
(September 11, 2002). Dear Professor Mean, I have some data that don't
seem to meet the assumption of normality. What should I do? -Anxious Abby
2000
- Stats: What is collinearity? (January 27,
2000). Dear Professor Mean, Could you describe the term collinearity
for me? I understand that it has to do with variables which are not totally
independent, but that is all I know!
- Stats: Best fitting curve (January 26, 2000).
Dear Professor Mean: I have a graph of the trend for the mean frequency of
injuries among children from 1 to 11 years of age. The shape of the curve
suggests a nonlinear relationship between the age and the frequency of
injuries. Is there some software that would provide the best fitting curve
for this data from among a large family of nonlinear curves?
What now?
Browse other categories at this site
Browse through the most recent entries
Get help
This work is licensed under a
Creative Commons Attribution 3.0 United States License. This page was written by
Steve Simon and was last modified on
2010-07-26.