Category >> Modeling issues (created 2007-08-13).

These pages are transitioning to a new website.

These pages discuss issues about statistical models which are relevant across a broad class of models. These pages may mention a specific model like logistic regression to provide context, but the ideas generalize easily to other models. Also see Category: Covariate adjustment, Category: Linear regression, Category: Logistic regression, and Category: Unusual data.


40. P.Mean: Looking at another grant opportunity (created 2011-11-07). It must be masochism on my part, but I'm looking at writing yet another grant. This grant would go to the Heartland Institute for Clinical and Translational Research and would be for a pilot study.

39. P.Mean: A bunch of univariate nonparametric tests versus a single parametric model (created 2011-11-03). Dear Professor Mean, I'm working on a of weight loss during hospitalization. I'm measuring the percent change in the weight loss from admission to discharge and looking at the factors that influence it. I ran some non parametric tests and found a few factors that were associated with the weight loss. When I run a multivariate linear regression model, only one factor is associated with weight loss. The linear regression model assumes normative data, so I am not sure I can do that here. The data appears to be normally distributed but fails the test of normality. So, should I just report the non parametric tests? Is there a multivariable model for non normally distributed data?


39. The Monthly Mean: When and why to log (September/October 2010)

38. The Monthly Mean: Unequal sample sizes? Don't worry! (September/October 2010)

37. P.Mean: Oh those pesky interactions! (created 2010-09-16). Someone was fitting a binary logistic regression model and regretfully (that was his word) found two significant (p < 0.05) interactions. The tone was that he was testing for interactions using some type of stepwise approach, but was hoping that no interactions would appear. When they did appear, he had a panic, not about how to interpret the interactions, but rather whether he should include them in his publication. Here's the advice I offered.

36. P.Mean: What is principal components analysis? (created 2010-07-19). I was asked to help someone who was reviewing a paper that used principal components analysis (PCA) as part of the statistical methodology. I have not yet seen the article, so I could only offer very general advice.

35. P.Mean: Calculating weights to correct for over and under sampling (created 2010-03-22). Someone asked how to use weights to adjust for the fact that certain strata in a study were recruited more vigorously than other strata. For example, suppose you sampled at four communities and noted the age distribution as 0-14 years, 15-39 years, and  40+ years. How would you adjust for differential age distributions.

34. P.Mean: Can sex be an outcome variable (created 2010-03-16). Someone asked whether it was legitimate to use sex (gender) as a dependent variable or outcome variable in a logistic regression model. It seems wrong, on the face of it, to think that various factors can influence whether we are male or female. It actually is perfectly fine to use sex as an outcome variable. Here is how I would justify its use.

33. The Monthly Mean: Using weights to correct for over and under sampling (February/March 2010)


32. The Monthly Mean: Risk adjustment using propensity scores (December 2009)

31. The Monthly Mean: Risk adjustment using the case mix index (November 2009)

30. The Monthly Mean: Risk adjustment using reweighting (July/August 2009)

29. P.Mean: Formula for multiple imputation (created 2009-07-24). I'm working on a project that involves multiple imputation, and I may have to program some of the work myself. I can use the R package MICE to generate the imputed data sets, but then I have to use a mixed linear model rather than a linear model. How do I combine the estimates from the multiple imputed data sets? The estimate is just the average of the individual estimates, but what about the standard error?

28. The Monthly Mean: Risk adjustment using Analysis of Covariance (May/June 2009)

27. P.Mean: Fewer than 10 events per variable (created 2009-02-18). I am in the process of advising on the design of a study using logistic regression. There are five confounding variables and a treatment variable. If I apply the rule that you need 10 events per variable (EPV), then I need 60 events. I expect that the probability of observing an event is 40%. This means that I'll need data on 60 / 0.4 = 150 patients. I can only collect data on 90 patients, and that sample size gives me more than adequate power. Since my power will be fine, can I ignore the rule of thumb about 10 EPV?

26. The Monthly Mean: Crude versus adjusted comparisons. (January 2009)


25. The Monthly Mean: A simple example of overfitting (December 2008)

24. P.Mean: Using a sub-optimal approach in meta-analysis (created 2008-12-06). I am having difficulty understanding the meta-analysis of ordinal data in a  Cochrane systematic review, and would appreciate advice and comments. One study in the meta-analysis had an ordinal efficacy outcome with categories None, Some, Good, and Excellent. The meta-analysis did 4 separate analyses, treating each category as if it were a dichotomous outcome. Aside from the fact that this generates (almost) more analyses than there are data, this approach seems unnecessary and uninterpretable. The Cochrane Handbook says: "Ordinal and measurement scale outcomes are most commonly meta-analysed  as dichotomous data." And "Occasionally it is possible to analyse the data using proportional odds models where ordinal scales have a small number of categories, the numbers falling into each category for each intervention group can be obtained, and the same ordinal scale has been used in all studies." What should the authors of the systematic review have done?

23. P.Mean: Explaining CART models in simple terms (created 2008-11-05). I need some help understanding and explaining Classification and Regression Trees (CART). I am personally not familiar with this technique. When would someone select this over linear/logistic regression model?

22. P.Mean: What's the difference between regression and ANOVA? (created 2008-10-15). Someone asked me to explain the difference between regression and ANOVA. That's challenging because regression and ANOVA are like the flip sides of the same coin. They are different, but they have more in common that you might think at first glance.

21. P.Mean: A simple example of overfitting (created 2008-10-08). A couple of the Internet discussion groups that I participate in have been discussing the concept of overfitting. Overfitting occurs when a model is too complex for a given sample size. I want to show a simple example of the negative consequences of overfitting.

20. P.Mean: Using ANOVA for a sum of Likert scaled variables (created 2008-10-09). I want to analyse data derived from a questionnaire. The range of possible values that my variable can take goes from 20 to 100. No evidence for rejecting the hypothesis of normality was found. I would therefore apply an ANOVA, but I still have some doubts whether this methods of analysis is valid, since the range of my dependent variable is not [- infinity;+ infinity]. Is the ANOVA a valid method of analysis or are there other approaches I can apply?

19. P.Mean: What distribution does this data come from? (created 2008-07-23). I'm very interested in assessing distributional fits for empirical data and I've found tidbits of information here and there but no real good source. Could you recommend a few good sources?

Outside resources:

Anova for Unbalanced Data: An Overview. Shaw RG, Mitchell-Olds T. Ecology 74:6 (Sep., 1993); 1638-1645. [Abstract] [PDF]. Description: An analysis of variance model with multiple factors is very easy to analyze when the data is balanced, that is, when every combination of the factors has the same number of observations. If some combinations have more or fewer observations, you need to approach the ANOVA model very carefully. This article shows some of the issues you need to be aware of with unbalanced data.

Steve Miller. Biostatistics, Open Source and BI � an Interview with Frank Harrell. Description: This article, published in Information Management Online, February 25, 2009, offers a nice interview with Frank Harrell, a leading proponent of modern statistical methods. Excerpt: "My correspondence with Frank provided the opportunity to ask him to do an interview for the OpenBI Forum. He graciously accepted, turning around deft responses to my sometimes ponderous questions in very short order. What follows is text for our questions and answer session. I trust that readers will learn as much from Frank�s responses as I did." [Accessed July 19, 2010]. Available at:

Phil Ender. Centering (ED230B/C). Excerpt: "Centering a variable involves subtracting the mean from each of the scores, that is, creating deviation scores. Centering can be done two ways; 1) centering using the grand mean and 2) centering using group means, which is also known as context centering." [Accessed July 26, 2010]. Available at:

McCandless L, Gustafson P, Austin P, Levy A. Covariate balance in a Bayesian propensity score analysis of beta blocker therapy in heart failure patients. Epidemiologic Perspectives & Innovations. 2009;6(1):5. Available at: [Accessed September 14, 2009]. Abstract: Regression adjustment for the propensity score is a statistical method that reduces confounding from measured variables in observational data. A Bayesian propensity score analysis extends this idea by using simultaneous estimation of the propensity scores and the treatment effect. In this article, we conduct an empirical investigation of the performance of Bayesian propensity scores in the context of an observational study of the effectiveness of beta-blocker therapy in heart failure patients. We study the balancing properties of the estimated propensity scores. Traditional Frequentist propensity scores focus attention on balancing covariates that are strongly associated with treatment. In contrast, we demonstrate that Bayesian propensity scores can be used to balance the association between covariates and the outcome. This balancing property has the effect of reducing confounding bias because it reduces the degree to which covariates are outcome risk factors.

Karyn Heavner, Carl Phillips, Igor Burstyn, Warren Hare. Dichotomization: 2 x 2 (x2 x 2 x 2...) categories: infinite possibilities. BMC Medical Research Methodology. 2010;10(1):59. Abstract: "BACKGROUND: Consumers of epidemiology may prefer to have one measure of risk arising from analysis of a 2-by-2 table. However, reporting a single measure of association, such as one odds ratio (OR) and 95% confidence interval, from a continuous exposure variable that was dichotomized withholds much potentially useful information. Results of this type of analysis are often reported for one such dichotomization, as if no other cutoffs were investigated or even possible. METHODS: This analysis demonstrates the effect of using different theory and data driven cutoffs on the relationship between body mass index and high cholesterol using National Health and Nutrition Examination Survey data. The recommended analytic approach, presentation of a graph of ORs for a range of cutoffs, is the focus of most of the results and discussion. RESULTS: These cutoff variations resulted in ORs between 1.1 and 1.9. This allows investigators to select a result that either strongly supports or provides negligible support for an association; a choice that is invisible to readers. The OR curve presents readers with more information about the exposure disease relationship than a single OR and 95% confidence interval. CONCLUSION: As well as offering results for additional cutoffs that may be of interest to readers, the OR curve provides an indication of whether the study focuses on a reasonable representation of the data or outlier results. It offers more information about trends in the association as the cutoff changes and the implications of random fluctuations than a single OR and 95% confidence interval." [Accessed July 19, 2010]. Available at:

H. Gilbert Welch, Lisa M. Schwartz, Steven Woloshin. The exaggerated relations between diet, body weight and mortality: the case for a categorical data approach. CMAJ. 2005;172(7):891-895. Excerpt: "Multivariate analysis has become a major statistical tool for medical research. It is most commonly used for adjustment � the process of correcting the main effect for multiple variables that confound the relation between exposure and outcome in an observational study. Any apparent relation between estrogen replacement and dementia, for example, should be adjusted for socioeconomic status, a variable that is known to relate both to access (and thus the likelihood of having received estrogen) and to measures of cognitive function (and thus the likelihood of being diagnosed with dementia). The capacity to account for numerous variables (e.g., income, education and insurance status) simultaneously constitutes a major advance in the ability of researchers to estimate the true effect of the exposure of interest. But this advance has come at a cost: the actual relation between exposure and outcome is increasingly opaque to readers, researchers and editors alike." [Accessed July 26, 2010]. Available at:

The Fourth Quadrant: A Map of the Limits of Statistics. Nassim Nicholas Taleb, published September 15, 2008 by the Edge Foundation, Inc. Excerpt: When Nassim Taleb talks about the limits of statistics, he becomes outraged. "My outrage," he says, "is aimed at the scientist-charlatan putting society at risk using statistical methods. This is similar to iatrogenics, the study of the doctor putting the patient at risk." As a researcher in probability, he has some credibility. In 2006, using FNMA and bank risk managers as his prime perpetrators, he wrote the following: The government-sponsored institution Fannie Mae, when I look at its risks, seems to be sitting on a barrel of dynamite, vulnerable to the slightest hiccup. But not to worry: their large staff of scientists deemed these events "unlikely." In the following Edge original essay, Taleb continues his examination of Black Swans, the highly improbable and unpredictable events that have massive impact. He claims that those who are putting society at risk are "no true statisticians", merely people using statistics either without understanding them, or in a self-serving manner. "The current subprime crisis did wonders to help me drill my point about the limits of statistically driven claims," he says. URL:

Webpage: Amara Lynn Graps. An Introduction to Wavelets Abstract: "Wavelets are mathematical functions that cut up data into different frequency components, and then study each component with a resolution matched to its scale. They have advantages over traditional Fourier methods in analyzing physical situations where the signal contains discontinuities and sharp spikes. Wavelets were developed independently in the fields of mathematics, quantum physics, electrical engineering, and seismic geology. Interchanges between these fields during the last ten years have led to many new wavelet applications such as image compression, turbulence, human vision, radar, and earthquake prediction. This paper introduces wavelets to the interested technical person outside of the digital signal processing field. I describe the history of wavelets beginning with Fourier, compare wavelet transforms with Fourier transforms, state properties and other special aspects of wavelets, and finish with some interesting applications such as image compression, musical tones, and de-noising noisy data. Keywords: Wavelets, Signal Processing Algorithms, Orthogonal Basis Functions, Wavelet Applications." [Accessed on May 12, 2011].

Interesting quote: The purpose of models is not to fit the data but to sharpen the questions. Samuel Karlin (1924 - ) as quoted in

Mitchell H. Katz. Multivariable Analysis: A Primer for Readers of Medical Research. Annals of Internal Medicine. 2003;138(8):644-650. Abstract: "Many clinical readers, especially those uncomfortable with mathematics, treat published multivariable models as a black box, accepting the author's explanation of the results. However, multivariable analysis can be understood without undue concern for the underlying mathematics. This paper reviews the basics of multivariable analysis, including what multivariable models are, why they are used, what types exist, what assumptions underlie them, how they should be interpreted, and how they can be evaluated. A deeper understanding of multivariable models enables readers to decide for themselves how much weight to give to the results of published analyses." [Accessed July 7, 2010]. Available at:

Negative Consequences of Dichotomizing Continuous Predictor Variables. Gary McClelland. Description: This Java applet shows graphically how creating a median split for a predictor variable leads to loss of precision and power. This website was last verified on 2003-02-10. URL:

Vittinghoff E, McCulloch CE. Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression. Am. J. Epidemiol. 2007;165(6):710-718. Available at: [Accessed February 18, 2009]. Description: This article examines the rule that you need 10 events per independent variable. Some sources cite 15 events and other 20 events per independent variable. The authors argue that in the context of adjusting for confounders, this rule might be relaxed a bit.

David MacKinnon. RIPL - Statistical Mediation. Excerpt: "Once a relationship between two variables is established, it is common for researchers to consider the role of other variables in this relationship (Lazarsfeld, 1955). In one situation, moderation or effect modification, an observed relationship may be different at different levels of a third variable. In a second situation, which is the focus of this site, a third variable provides a clearer interpretation of the relationship between the two variables. A clearer interpretation may be obtained by elucidating the causal process among the three variables, a mediational hypothesis. " [Accessed October 13, 2010]. Available at:

David Vergouw, Martijn Heymans, George Peat, et al. The search for stable prognostic models in multiple imputed data sets. BMC Medical Research Methodology. 2010;10(1):81. Abstract: "BACKGROUND: In prognostic studies model instability and missing data can be troubling factors. Proposed methods for handling these situations are bootstrapping (B) and Multiple imputation (MI). The authors examined the influence of these methods on model composition. METHODS: Models were constructed using a cohort of 587 patients consulting between January 2001 and January 2003 with a shoulder problem in general practice in the Netherlands (the Dutch Shoulder Study). Outcome measures were persistent shoulder disability and persistent shoulder pain. Potential predictors included socio-demographic variables, characteristics of the pain problem, physical activity and psychosocial factors. Model composition and performance (calibration and discrimination) were assessed for models using a complete case analysis, MI, bootstrapping or both MI and bootstrapping. RESULTS: Results showed that model composition varied between models as a result of how missing data was handled and that bootstrapping provided additional information on the stability of the selected prognostic model. CONCLUSION: In prognostic modeling missing data needs to be handled by MI and bootstrap model selection is advised in order to provide information on model stability." [Accessed October 25, 2010]. Available at:

Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon - the reversal paradox. YK Tu, D Gunnell, Gilthorpe MS. Emerg Themes Epidemiol 2008: 5; 2. [Medline] [Abstract] [Full text] [PDF]. Description: This article provides a nice overview of how associations between two variables can be modified by a third variable.

P Peduzzi, J Concato, E Kemper, T R Holford, A R Feinstein. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(12):1373-1379. Abstract: "We performed a Monte Carlo study to evaluate the effect of the number of events per variable (EPV) analyzed in logistic regression analysis. The simulations were based on data from a cardiac trial of 673 patients in which 252 deaths occurred and seven variables were cogent predictors of mortality; the number of events per predictive variable was (252/7 =) 36 for the full sample. For the simulations, at values of EPV = 2, 5, 10, 15, 20, and 25, we randomly generated 500 samples of the 673 patients, chosen with replacement, according to a logistic model derived from the full sample. Simulation results for the regression coefficients for each variable in each group of 500 samples were compared for bias, precision, and significance testing against the results of the model fitted to the original sample. For EPV values of 10 or greater, no major problems occurred. For EPV values less than 10, however, the regression coefficients were biased in both positive and negative directions; the large sample variance estimates from the logistic model both overestimated and underestimated the sample variance of the regression coefficients; the 90% confidence limits about the estimated values did not have proper coverage; the Wald statistic was conservative under the null hypothesis; and paradoxical associations (significance in the wrong direction) were increased. Although other factors (such as the total number of events, or sample size) may influence the validity of the logistic model, our findings indicate that low EPV can lead to major problems." [Accessed June 14, 2010]. Available at:

Wuensch K. Stepwise Regression = Voodoo Regression. Available at: [Accessed April 16, 2009]. Excerpt: It is pretty cool, but not necessarily very useful, and just plain dangerous in the hands of somebody not well educated in the multiple regression techniques, including effects of collinearity, redundancy, and suppression. Here are some quotes from others I have collected from the now departed STAT-L.

Flom PL, Cassell DL. Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use. Available at: [Accessed April 24, 2009].

Sylvia Sudat, Elizabeth Carlton, Edmund Seto, Robert Spear, Alan Hubbard. Using variable importance measures from causal inference to rank risk factors of schistosomiasis infection in a rural setting in China. Epidemiologic Perspectives & Innovations. 2010;7(1):3. Abstract: "BACKGROUND: Schistosomiasis infection, contracted through contact with contaminated water, is a global public health concern. In this paper we analyze data from a retrospective study reporting water contact and schistosomiasis infection status among 1011 individuals in rural China. We present semi-parametric methods for identifying risk factors through a comparison of three analysis approaches: a prediction-focused machine learning algorithm, a simple main-effects multivariable regression, and a semi-parametric variable importance (VI) estimate inspired by a causal population intervention parameter. RESULTS: The multivariable regression found only tool washing to be associated with the outcome, with a relative risk of 1.03 and a 95% confidence interval (CI) of 1.01-1.05. Three types of water contact were found to be associated with the outcome in the semi-parametric VI analysis: July water contact (VI estimate 0.16, 95% CI 0.11-0.22), water contact from tool washing (VI estimate 0.88, 95% CI 0.80-0.97), and water contact from rice planting (VI estimate 0.71, 95% CI 0.53-0.96). The July VI result, in particular, indicated a strong association with infection status - its causal interpretation implies that eliminating water contact in July would reduce the prevalence of schistosomiasis in our study population by 84%, or from 0.3 to 0.05 (95% CI 78%-89%). CONCLUSIONS: The July VI estimate suggests possible within-season variability in schistosomiasis infection risk, an association not detected by the regression analysis. Though there are many limitations to this study that temper the potential for causal interpretations, if a high-risk time period could be detected in something close to real time, new prevention options would be opened. Most importantly, we emphasize that traditional regression approaches are usually based on arbitrary pre-specified models, making their parameters difficult to interpret in the context of real-world applications. Our results support the practical application of analysis approaches that, in contrast, do not require arbitrary model pre-specification, estimate parameters that have simple public health interpretations, and apply inference that considers model selection as a source of variation." [Accessed July 16, 2010]. Available at:

Journal article: Stefan Walter, Henning Tiemeier. Variable selection: current practice in epidemiological studies. Eur J Epidemiol. 2009;24(12):733-736. Abstract: "Selection of covariates is among the most controversial and difficult tasks in epidemiologic analysis. Correct variable selection addresses the problem of confounding in etiologic research and allows unbiased estimation of probabilities in prognostic studies. The aim of this commentary is to assess how often different variable selection techniques were applied in contemporary epidemiologic analysis. It was of particular interest to see whether modern methods such as shrinkage or penalized regression were used in recent publications. Stepwise selection methods remained the predominant method for variable selection in publications in epidemiological journals in 2008. Shrinkage methods were not used in any of the reviewed articles. Editors, reviewers and authors have insufficiently promoted the new, less controversial approaches of variable selection in the biomedical literature, whereas statisticians may not have adequately addressed the method's feasibility." [Accessed November 26, 2010]. Available at

What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Babyak MA. Psychosomatic Medicine 66:411-421 (2004). [Abstract] [Full text] [PDF]. Description: If you have too many variables relative to the amount of data you have, then your model will suffer from overfitting. This article outlines the problems caused by overfitting and offers some solutions.

Why do we still use stepwise modelling in ecology and behaviour? Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. J Anim Ecol. 2006 Sep;75(5):1182-9. [Medline] [Abstract] [PDF]. Description: This article reviews the continued use of stepwise regression methods in leading ecological and behavioral journals and explains the drawbacks of this approach.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


18. Stats: Presenting unadjusted and adjusted estimates side by side (March 24, 2008). Someone on the Medstats discussion group asked about reporting the analysis of a model without adjustment for covariates along with the analysis adjusted for covariates. What is the purpose of reporting the unadjusted analysis?

17. Stats: Assessing the assumption of an exponential distribution (February 25, 2008). The following 41 observations: 8, 2, 26, 29, 1, 2, 11, 8, 0, 5, 10, 1, 4, 9, 12, 3, 6, 5, 2, 12, 1, 5, 3, 5, 7, 0, 2, 8, 3, 3, 1, 0, 4, 8, 1, 8, 12, 0, 6, 1, 5, represent waiting times that we suspect follow an exponential distribution. There are several ways to examine this belief, and the simplest way to to draw a Q-Q plot for the exponential distribution.


16. Stats: When should you use a log transformation? (December 28, 2007). Dear Professor Mean, How do I know whether it is appropriate to use a log transformation for my data?

15. Stats: The order of entering interactions into a model (September 20, 2007). Dear Professor Mean, I like your titanic example! But shouldn't you enter the interaction term on a second step following entry of the main effects on the first step? If you enter the terms all at the same time, the interaction term will compete for variance with the two main effects on which is depends.

14. Stats: Are we assuming a normal sample or a normal population? (August 30, 2007). Dear Professor Mean, I'm fitting an ANOVA model to a sample of 25 observations, and the data is skewed. I'm quite worried about this, but my husband reassures me that this is not a problem. He says that the assumption is that the population is normal, not the sample. Should I listen to him?

13. Stats: How good is my prediction? (August 13, 2007). Dear Professor Mean, I have two time series of data, one actual and one predicted. Since I'm quite new to statistical methods, I would like to know what methods are used to evaluate the different between the two time series. I would like to able to say something like "the predicted values were 70% accurate."


12. Stats: Frank Harrell's Philosophy of Biostatistics (October 10, 2006). There are a lot of people in the world who are a lot smarter than I am and it is always a humbling experience when I recognize how little I really know. Frank Harrell, chair of the Department of Biostatistics at Vanderbilt University, is one of those people.

11. Stats: Slash and burn models (June 26, 2006). I received an email question about developing a logistic regression model with some interaction terms. One of the interaction terms was statistically significant but one or both of the main effects associated with the interaction was not.  So is it okay, I was asked to include the interaction in the final model but not the non-significant main effects? First, I need to comment on the "slash and burn"  model building practice that this person is using. A recent posting to the MedStats email discussion group outlines problems with this approach (although it does not use the term "slash and burn"). The person who adopts a "slash and burn" approach to models has a parsimonious intent. He/she wants to use as few degrees of freedom as possible in the final statistical model and one way to do this is to strip out anything that has an insignificant p-value. The ideal in the "slash and burn" world is a model where every single p-value is smaller than 0.05.

10. Stats: Multicollinearity is not a violation of assumptions (January 20, 2006). A colleague from my days at the National Institute for Occupational Safety and Health emailed me a question. Apparently, one of the co-authors of a paper he is writing is in a bit of a panic because the linear regression model that they are using has multicollinearity. She calls this a violation of assumptions and wonders if she should look at certain transformations that are difficult to interpret but which remove much of the multicollinearity. To me this seems like jumping from the frying pan into the fire.


9. Stats: I abhor Lilliefor and other tests of normality (April 14, 2005). Someone asked me about a log transformation for their data. It seemed like a good idea, based on my general comments on the log transformation, but the test of significance for normality (Lilliefor's test) was still rejected even after the log transformation. In general, I dislike Lilliefor's test (and other tests of normality like the Shapiro-Wilks test).


8. Stats: Discrepancy between univariate and multivariate models (November 12, 2004). Someone asked me about an analysis that showed certain factors were predictive of a health outcome when considered individually. When these factors were included in a multivariate model that included other factors, they were no longer statistically significant. This is worth investigating further but perhaps you need to live with a bit of ambiguity in the data.

7. Stats: What is the best statistical model? (September 17, 2004). Someone asked me by email about the advantages and disadvantages of various statistical models (multinomial logistic regression, ordinal logistic regression, and structural equations models). This is a somewhat difficult question to answer by email, but as a general rule, I think that people worry too much about the particular model that they choose.

6. Stats: Central Limit Theorem (March 9, 2004). Dear Professor Mean, How does the central limit theorem affect the statistical tests that I might use for my data?


5. Stats: What does "overfitting" mean? (July 24, 2003). Dear Professor Mean,  I am conducting binary logistic regression analyses with a sample size of 80 of which 20 have the outcome of interest (e.g. are "very successful" versus somewhat/not very successful). I have thirty possible independent variables which I examined in a univariate  logistic regression with the dependent variable. Of these thirty, five look like they might have a relationship with the independent variable. Now I want to include these six variables in a stepwise logistic regression model, but I am worried about overfitting the data. I have heard that there should be about 10 cases with the outcome of interest per independent variable to avoid overfitting. What exactly does overfitting mean?


4. Stats: Log transformation (October 11, 2002). Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a log transformation. Why would I want to do this? -- Stumped Susan

3. Stats: Checking the assumption of normality (September 11, 2002). Dear Professor Mean, I have some data that don't seem to meet the assumption of normality. What should I do? -Anxious Abby


2. Stats: What is collinearity? (January 27, 2000). Dear Professor Mean, Could you describe the term collinearity for me? I understand that it has to do with variables which are not totally independent, but that is all I know!

1. Stats: Best fitting curve (January 26, 2000). Dear Professor Mean: I have a graph of the trend for the mean frequency of injuries among children from 1 to 11 years of age. The shape of the curve suggests a nonlinear relationship between the age and the frequency of injuries. Is there some software that would provide the best fitting curve for this data from among a large family of nonlinear curves?

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15.