Category: Modeling issues (created 2007-08-13). These pages discuss issues about statistical models which are relevant across a broad class of models. These pages may mention a specific model like logistic regression to provide context, but the ideas generalize easily to other models. Also see Category: Covariate adjustment, Category: Linear regression, Category: Logistic regression, and Category: Unusual data. Other entries about modeling issues can be found in the modeling issues page at the StATS website.

2008

  1. P.Mean: Using a sub-optimal approach in meta-analysis (created 2008-12-06). I am having difficulty understanding the meta-analysis of ordinal data in a  Cochrane systematic review, and would appreciate advice and comments. One study in the meta-analysis had an ordinal efficacy outcome with categories None, Some, Good, and Excellent. The meta-analysis did 4 separate analyses, treating each category as if it were a dichotomous outcome. Aside from the fact that this generates (almost) more analyses than there are data, this approach seems unnecessary and uninterpretable. The Cochrane Handbook says: "Ordinal and measurement scale outcomes are most commonly meta-analysed  as dichotomous data." And "Occasionally it is possible to analyse the data using proportional odds models where ordinal scales have a small number of categories, the numbers falling into each category for each intervention group can be obtained, and the same ordinal scale has been used in all studies." What should the authors of the systematic review have done?
  2. P.Mean: Explaining CART models in simple terms (created 2008-11-05). I need some help understanding and explaining Classification and Regression Trees (CART). I am personally not familiar with this technique. When would someone select this over linear/logistic regression model?
  3. P.Mean: What's the difference between regression and ANOVA? (created 2008-10-15). Someone asked me to explain the difference between regression and ANOVA. That's challenging because regression and ANOVA are like the flip sides of the same coin. They are different, but they have more in common that you might think at first glance.
  4. P.Mean: A simple example of overfitting (created 2008-10-08). A couple of the Internet discussion groups that I participate in have been discussing the concept of overfitting. Overfitting occurs when a model is too complex for a given sample size. I want to show a simple example of the negative consequences of overfitting.
  5. P.Mean: Using ANOVA for a sum of Likert scaled variables (created 2008-10-09). I want to analyse data derived from a questionnaire. The range of possible values that my variable can take goes from 20 to 100. No evidence for rejecting the hypothesis of normality was found. I would therefore apply an ANOVA, but I still have some doubts whether this methods of analysis is valid, since the range of my dependent variable is not [- infinity;+ infinity]. Is the ANOVA a valid method of analysis or are there other approaches I can apply?
  6. P.Mean: What distribution does this data come from? (created 2008-07-23). I'm very interested in assessing distributional fits for empirical data and I've found tidbits of information here and there but no real good source. Could you recommend a few good sources?

Outside resources:

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2008-12-06. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.

2008

  1. Stats: Presenting unadjusted and adjusted estimates side by side (March 24, 2008). Someone on the Medstats discussion group asked about reporting the analysis of a model without adjustment for covariates along with the analysis adjusted for covariates. What is the purpose of reporting the unadjusted analysis?
  2. Stats: Assessing the assumption of an exponential distribution (February 25, 2008). The following 41 observations: 8, 2, 26, 29, 1, 2, 11, 8, 0, 5, 10, 1, 4, 9, 12, 3, 6, 5, 2, 12, 1, 5, 3, 5, 7, 0, 2, 8, 3, 3, 1, 0, 4, 8, 1, 8, 12, 0, 6, 1, 5, represent waiting times that we suspect follow an exponential distribution. There are several ways to examine this belief, and the simplest way to to draw a Q-Q plot for the exponential distribution.

    2007
     
  3. Stats: When should you use a log transformation? (December 28, 2007). Dear Professor Mean, How do I know whether it is appropriate to use a log transformation for my data?
  4. Stats: The order of entering interactions into a model (September 20, 2007). Dear Professor Mean, I like your titanic example! But shouldn't you enter the interaction term on a second step following entry of the main effects on the first step? If you enter the terms all at the same time, the interaction term will compete for variance with the two main effects on which is depends.
  5. Stats: Are we assuming a normal sample or a normal population? (August 30, 2007). Dear Professor Mean, I'm fitting an ANOVA model to a sample of 25 observations, and the data is skewed. I'm quite worried about this, but my husband reassures me that this is not a problem. He says that the assumption is that the population is normal, not the sample. Should I listen to him?
  6. Stats: How good is my prediction? (August 13, 2007). Dear Professor Mean, I have two time series of data, one actual and one predicted. Since I'm quite new to statistical methods, I would like to know what methods are used to evaluate the different between the two time series. I would like to able to say something like "the predicted values were 70% accurate."

    2006
     
  7. Stats: Frank Harrell's Philosophy of Biostatistics (October 10, 2006). There are a lot of people in the world who are a lot smarter than I am and it is always a humbling experience when I recognize how little I really know. Frank Harrell, chair of the Department of Biostatistics at Vanderbilt University, is one of those people.
  8. Stats: Slash and burn models (June 26, 2006). I received an email question about developing a logistic regression model with some interaction terms. One of the interaction terms was statistically significant but one or both of the main effects associated with the interaction was not.  So is it okay, I was asked to include the interaction in the final model but not the non-significant main effects? First, I need to comment on the "slash and burn"  model building practice that this person is using. A recent posting to the MedStats email discussion group outlines problems with this approach (although it does not use the term "slash and burn"). The person who adopts a "slash and burn" approach to models has a parsimonious intent. He/she wants to use as few degrees of freedom as possible in the final statistical model and one way to do this is to strip out anything that has an insignificant p-value. The ideal in the "slash and burn" world is a model where every single p-value is smaller than 0.05.
  9. Stats: Multicollinearity is not a violation of assumptions (January 20, 2006). A colleague from my days at the National Institute for Occupational Safety and Health emailed me a question. Apparently, one of the co-authors of a paper he is writing is in a bit of a panic because the linear regression model that they are using has multicollinearity. She calls this a violation of assumptions and wonders if she should look at certain transformations that are difficult to interpret but which remove much of the multicollinearity. To me this seems like jumping from the frying pan into the fire.

    2005
     
  10. Stats: I abhor Lilliefor and other tests of normality (April 14, 2005). Someone asked me about a log transformation for their data. It seemed like a good idea, based on my general comments on the log transformation, but the test of significance for normality (Lilliefor's test) was still rejected even after the log transformation. In general, I dislike Lilliefor's test (and other tests of normality like the Shapiro-Wilks test).

    2004
     
  11. Stats: Discrepancy between univariate and multivariate models (November 12, 2004). Someone asked me about an analysis that showed certain factors were predictive of a health outcome when considered individually. When these factors were included in a multivariate model that included other factors, they were no longer statistically significant. This is worth investigating further but perhaps you need to live with a bit of ambiguity in the data.
  12. Stats: What is the best statistical model? (September 17, 2004). Someone asked me by email about the advantages and disadvantages of various statistical models (multinomial logistic regression, ordinal logistic regression, and structural equations models). This is a somewhat difficult question to answer by email, but as a general rule, I think that people worry too much about the particular model that they choose.
  13. Stats: Central Limit Theorem (March 9, 2004). Dear Professor Mean, How does the central limit theorem affect the statistical tests that I might use for my data?

    2003
     
  14. Stats: What does "overfitting" mean? (July 24, 2003). Dear Professor Mean,  I am conducting binary logistic regression analyses with a sample size of 80 of which 20 have the outcome of interest (e.g. are "very successful" versus somewhat/not very successful). I have thirty possible independent variables which I examined in a univariate  logistic regression with the dependent variable. Of these thirty, five look like they might have a relationship with the independent variable. Now I want to include these six variables in a stepwise logistic regression model, but I am worried about overfitting the data. I have heard that there should be about 10 cases with the outcome of interest per independent variable to avoid overfitting. What exactly does overfitting mean?

    2002
     
  15. Stats: Log transformation (October 11, 2002). Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a log transformation. Why would I want to do this? -- Stumped Susan
  16. Stats: Checking the assumption of normality (September 11, 2002). Dear Professor Mean, I have some data that don't seem to meet the assumption of normality. What should I do? -Anxious Abby

    2001

    2000

     
  17. Stats: What is collinearity? (January 27, 2000). Dear Professor Mean, Could you describe the term collinearity for me? I understand that it has to do with variables which are not totally independent, but that is all I know!
  18. Stats: Best fitting curve (January 26, 2000). Dear Professor Mean: I have a graph of the trend for the mean frequency of injuries among children from 1 to 11 years of age. The shape of the curve suggests a nonlinear relationship between the age and the frequency of injuries. Is there some software that would provide the best fitting curve for this data from among a large family of nonlinear curves?

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2008-12-06.