Category: Modeling issues (created 2007-08-13). These pages discuss issues about
statistical models which are relevant across a broad class of models. These
pages may mention a specific model like logistic regression to provide context,
but the ideas generalize easily to other models. Also see Category: Covariate adjustment, Category: Linear regression, Category: Logistic regression,
and Category: Unusual data.
Other entries about modeling issues can be found in the
modeling issues page at the
StATS website.
2008
- P.Mean: Using a sub-optimal
approach in meta-analysis (created 2008-12-06). I am having difficulty
understanding the meta-analysis of ordinal data in a Cochrane systematic
review, and would appreciate advice and comments. One study in the
meta-analysis had an ordinal efficacy outcome with categories None, Some,
Good, and Excellent. The meta-analysis did 4 separate analyses, treating each
category as if it were a dichotomous outcome. Aside from the fact that this
generates (almost) more analyses than there are data, this approach seems
unnecessary and uninterpretable. The Cochrane Handbook says: "Ordinal and
measurement scale outcomes are most commonly meta-analysed as
dichotomous data." And "Occasionally it is possible to analyse the data using
proportional odds models where ordinal scales have a small number of
categories, the numbers falling into each category for each intervention group
can be obtained, and the same ordinal scale has been used in all studies."
What should the authors of the systematic review have done?
- P.Mean: Explaining CART models in simple
terms (created 2008-11-05). I need some help understanding and explaining Classification and Regression
Trees (CART). I am personally not familiar with this technique. When would
someone select this over linear/logistic regression model?
- P.Mean: What's the difference between
regression and ANOVA? (created 2008-10-15). Someone asked me to explain the
difference between regression and ANOVA. That's challenging because regression
and ANOVA are like the flip sides of the same coin. They are different, but they
have more in common that you might think at first glance.
- P.Mean: A simple example of
overfitting (created 2008-10-08). A couple of the Internet discussion groups that I participate in have been
discussing the concept of overfitting. Overfitting occurs when a model is too
complex for a given sample size. I want to show a simple example of the
negative consequences of overfitting.
- P.Mean: Using ANOVA for a sum of Likert scaled
variables (created 2008-10-09). I want to analyse data derived from a
questionnaire. The range of possible values that my variable can take goes from
20 to 100. No evidence for rejecting the hypothesis of normality was found. I
would therefore apply an ANOVA, but I still have some doubts whether this
methods of analysis is valid, since the range of my dependent variable is not [-
infinity;+ infinity]. Is the ANOVA a valid method of analysis or are there other
approaches I can apply?
- P.Mean: What distribution does this
data come from? (created 2008-07-23). I'm very interested in assessing
distributional fits for empirical data and I've found tidbits of information
here and there but no real good source. Could you recommend a few good sources?
Outside resources:
- Anova for Unbalanced Data: An Overview. Shaw RG, Mitchell-Olds T.
Ecology 74:6 (Sep., 1993); 1638-1645.
[Abstract]
[PDF]. Description: An analysis of variance model with multiple
factors is very easy to analyze when the data is balanced, that is, when
every combination of the factors has the same number of observations. If
some combinations have more or fewer observations, you need to approach the
ANOVA model very carefully. This article shows some of the issues you need
to be aware of with unbalanced data.
- The
Fourth Quadrant: A Map of the Limits of Statistics. Nassim Nicholas
Taleb, published September 15, 2008 by the Edge Foundation, Inc. Excerpt:
When Nassim Taleb talks about the limits of statistics, he becomes outraged.
"My outrage," he says, "is aimed at the scientist-charlatan putting society
at risk using statistical methods. This is similar to iatrogenics, the study
of the doctor putting the patient at risk." As a researcher in probability,
he has some credibility. In 2006, using FNMA and bank risk managers as his
prime perpetrators, he wrote the following: The government-sponsored
institution Fannie Mae, when I look at its risks, seems to be sitting on a
barrel of dynamite, vulnerable to the slightest hiccup. But not to worry:
their large staff of scientists deemed these events "unlikely." In the
following Edge original essay, Taleb continues his examination of Black
Swans, the highly improbable and unpredictable events that have massive
impact. He claims that those who are putting society at risk are "no true
statisticians", merely people using statistics either without understanding
them, or in a self-serving manner. "The current subprime crisis did wonders
to help me drill my point about the limits of statistically driven claims,"
he says. URL: www.edge.org/documents/archive/edge257.html#taleb
- Interesting quote: The purpose of models is not to fit the data but to sharpen the
questions. Samuel Karlin (1924 - ) as quoted in
www.causeweb.org/resources/fun/db.php?id=102.
-
Negative
Consequences of Dichotomizing Continuous Predictor Variables.
Gary McClelland. Description: This Java applet shows
graphically how creating a median split for a predictor variable leads to
loss of precision and power. This website was last verified on
2003-02-10. URL: psych.colorado.edu/~mcclella/MedianSplit
- Simpson's Paradox, Lord's Paradox, and Suppression
Effects are the same phenomenon - the reversal paradox. YK Tu, D Gunnell,
Gilthorpe MS. Emerg Themes Epidemiol 2008: 5; 2.
[Medline]
[Abstract] [Full text]
[PDF]. Description: This article provides a nice overview of how associations
between two variables can be modified by a third variable.
- What you see may not be what you get: a brief, nontechnical
introduction to overfitting in regression-type models. Babyak MA.
Psychosomatic Medicine 66:411-421 (2004).
[Abstract]
[Full text]
[PDF]. Description: If you have too many variables relative to the
amount of data you have, then your model will suffer from overfitting. This
article outlines the problems caused by overfitting and offers some
solutions.
- Why do we still use stepwise modelling in ecology and behaviour?
Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. J Anim Ecol. 2006
Sep;75(5):1182-9.
[Medline]
[Abstract]
[PDF]. Description: This article reviews the continued use of
stepwise regression methods in leading ecological and behavioral journals
and explains the drawbacks of this approach.
All of the material above this paragraph is licensed under a
Creative Commons Attribution 3.0 United States License. This page was written by
Steve Simon and was last modified on
2008-12-06. The material
below this paragraph links to my
old website, StATS. Although I wrote all of the material
listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright
ownership of this material. The brief excerpts shown here are included under
the fair use provisions of U.S. Copyright laws.
2008
- Stats: Presenting unadjusted and
adjusted estimates side by side (March 24, 2008). Someone on the Medstats
discussion group asked about reporting the analysis of a model without
adjustment for covariates along with the analysis adjusted for covariates.
What is the purpose of reporting the unadjusted analysis?
- Stats: Assessing the assumption of
an exponential distribution (February 25, 2008). The following 41
observations: 8, 2, 26, 29, 1, 2, 11, 8, 0, 5, 10, 1, 4, 9, 12, 3, 6, 5, 2,
12, 1, 5, 3, 5, 7, 0, 2, 8, 3, 3, 1, 0, 4, 8, 1, 8, 12, 0, 6, 1, 5, represent
waiting times that we suspect follow an exponential distribution. There are
several ways to examine this belief, and the simplest way to to draw a Q-Q
plot for the exponential distribution.
2007
- Stats: When should you use a log
transformation? (December 28, 2007). Dear Professor Mean, How do I
know whether it is appropriate to use a log transformation for my data?
- Stats: The order of
entering interactions into a model (September 20, 2007). Dear
Professor Mean, I like your titanic example! But shouldn't you enter the
interaction term on a second step following entry of the main effects on the
first step? If you enter the terms all at the same time, the interaction term
will compete for variance with the two main effects on which is depends.
- Stats: Are we assuming a normal
sample or a normal population? (August 30, 2007). Dear Professor Mean,
I'm fitting an ANOVA model to a sample of 25 observations, and the data is
skewed. I'm quite worried about this, but my husband reassures me that this
is not a problem. He says that the assumption is that the population is
normal, not the sample. Should I listen to him?
- Stats: How good is my prediction?
(August 13, 2007). Dear Professor Mean, I have two time series of
data, one actual and one predicted. Since I'm quite new to statistical
methods, I would like to know what methods are used to evaluate the different
between the two time series. I would like to able to say something like "the
predicted values were 70% accurate."
2006
- Stats: Frank Harrell's
Philosophy of Biostatistics (October 10, 2006). There are a lot of people
in the world who are a lot smarter than I am and it is always a humbling
experience when I recognize how little I really know. Frank Harrell, chair of
the Department of Biostatistics at Vanderbilt University, is one of those
people.
- Stats: Slash and burn models (June
26, 2006). I received an email question about developing a logistic
regression model with some interaction terms. One of the interaction terms
was statistically significant but one or both of the main effects associated
with the interaction was not. So is it okay, I was asked to include the
interaction in the final model but not the non-significant main effects?
First, I need to comment on the "slash and burn" model building
practice that this person is using. A recent posting to the MedStats email
discussion group outlines problems with this approach (although it does not
use the term "slash and burn"). The person who adopts a "slash and burn"
approach to models has a parsimonious intent. He/she wants to use as few
degrees of freedom as possible in the final statistical model and one way to
do this is to strip out anything that has an insignificant p-value. The ideal
in the "slash and burn" world is a model where every single p-value is
smaller than 0.05.
- Stats: Multicollinearity is
not a violation of assumptions (January 20, 2006). A colleague from my
days at the National Institute for Occupational Safety and Health emailed me
a question. Apparently, one of the co-authors of a paper he is writing is in
a bit of a panic because the linear regression model that they are using has
multicollinearity. She calls this a violation of assumptions and wonders if
she should look at certain transformations that are difficult to interpret
but which remove much of the multicollinearity. To me this seems like jumping
from the frying pan into the fire.
2005
- Stats: I abhor Lilliefor and
other tests of normality (April 14, 2005). Someone asked me about a log
transformation for their data. It seemed like a good idea, based on my
general comments on the log transformation, but the test of significance for
normality (Lilliefor's test) was still rejected even after the log
transformation. In general, I dislike Lilliefor's test (and other tests of
normality like the Shapiro-Wilks test).
2004
- Stats: Discrepancy between
univariate and multivariate models (November 12, 2004). Someone asked me
about an analysis that showed certain factors were predictive of a health
outcome when considered individually. When these factors were included in a
multivariate model that included other factors, they were no longer
statistically significant. This is worth investigating further but perhaps
you need to live with a bit of ambiguity in the data.
- Stats: What is the best statistical
model? (September 17, 2004). Someone asked me by email about the
advantages and disadvantages of various statistical models (multinomial
logistic regression, ordinal logistic regression, and structural equations
models). This is a somewhat difficult question to answer by email, but as a
general rule, I think that people worry too much about the particular model
that they choose.
- Stats: Central Limit Theorem (March 9, 2004).
Dear Professor Mean, How does the central limit theorem affect the
statistical tests that I might use for my data?
2003
- Stats: What does "overfitting" mean? (July
24, 2003). Dear Professor Mean, I am conducting binary logistic
regression analyses with a sample size of 80 of which 20 have the outcome of
interest (e.g. are "very successful" versus somewhat/not very successful). I
have thirty possible independent variables which I examined in a univariate
logistic regression with the dependent variable. Of these thirty, five look
like they might have a relationship with the independent variable. Now I want
to include these six variables in a stepwise logistic regression model, but I
am worried about overfitting the data. I have heard that there should be
about 10 cases with the outcome of interest per independent variable to avoid
overfitting. What exactly does overfitting mean?
2002
- Stats: Log transformation (October 11, 2002).
Dear Professor Mean, I have some data
that I need help with analysis. One suggestion is that I use a log
transformation. Why would I want to do this? -- Stumped Susan
- Stats: Checking the assumption of normality
(September 11, 2002). Dear Professor Mean, I have some data that don't
seem to meet the assumption of normality. What should I do? -Anxious Abby
2001
2000
- Stats: What is collinearity? (January 27,
2000). Dear Professor Mean, Could you describe the term collinearity
for me? I understand that it has to do with variables which are not totally
independent, but that is all I know!
- Stats: Best fitting curve (January 26, 2000).
Dear Professor Mean: I have a graph of the trend for the mean frequency of
injuries among children from 1 to 11 years of age. The shape of the curve
suggests a nonlinear relationship between the age and the frequency of
injuries. Is there some software that would provide the best fitting curve
for this data from among a large family of nonlinear curves?
What now?
Browse other categories at this site
Browse through the most recent entries
Get help
This work is licensed under a
Creative Commons Attribution 3.0 United States License. This page was written by
Steve Simon and was last modified on
2008-12-06.