P.Mean >> Category >> Unusual data (created 2007-06-20). 

These pages describe data analysis that does not fit easily into the more traditional categories of data analysis. If I get a sufficient number of pages on the same general topic, I will create a new category. Also see Category: Modeling issues, Category: Statistical theory.


25. P.Mean: I can't get SAS to model the cluster effects in the MEPS data set (created 2011-009-02). I'm trying to use SAS to analyze data from the Medical Expenditure Panel Survey (MEPS) but when I try to model the cluster effect using proc glimmix, I get an error message. What am I doing wrong?

24. What is the bootstrap? (May/June 2011)


23. P.Mean: More discussion on instrumental variables (created 2010-05-03). I attended the May meeting of the KUMC Statistics Journal Club. The topic of discussion was a paper outlining the properties and applications of instrumental variables.


22. P.Mean: Generating multinomial random variables in Excel (created 2009-11-23). Someone asked how to generate six random integers subject to the conditions that the sum of those random integers had to equal a value, x. This is a classic description of a multinomial distribution. Unstated in the question, but assumed by me, was that each random integer had to have the same distribution. that forces the probability vector for the multinomial to be (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

21. The Monthly Mean: Explaining CART models in simple terms (December 2008)

Other resources:

Kanji GK. One hundred statistical tests. 3rd ed. Thousand Oaks, Calif: Sage Publications; 2006. Description: Gopal Kanji lists specific details of many statistical tests, some quite obscure. This book is for students who want more mathematical details.

Journal article: Jason S Haukoos, Roger J Lewis. Advanced statistics: bootstrapping confidence intervals for statistics with "difficult" distributions Acad Emerg Med. 2005;12(4):360-365. Abstract: "The use of confidence intervals in reporting results of research has increased dramatically and is now required or highly recommended by editors of many scientific journals. Many resources describe methods for computing confidence intervals for statistics with mathematically simple distributions. Computing confidence intervals for descriptive statistics with distributions that are difficult to represent mathematically is more challenging. The bootstrap is a computationally intensive statistical technique that allows the researcher to make inferences from data without making strong distributional assumptions about the data or the statistic being calculated. This allows the researcher to estimate confidence intervals for statistics that do not have simple sampling distributions (e.g., the median). The purposes of this article are to describe the concept of bootstrapping, to demonstrate how to estimate confidence intervals for the median and the Spearman rank correlation coefficient for non-normally-distributed data from a recent clinical study using two commonly used statistical software packages (SAS and Stata), and to discuss specific limitations of the bootstrap." [Accessed on September 21, 2011]. http://gcrc.labiomed.org/Biostat/Education/Case%20studies%202005/session2/Haukoos%20and%20Lewis%20Bootstrapping.pdf.

Gunnes N, Seierstad T, Aamdal S, et al. Assessing quality of life in a randomized clinical trial: Correcting for missing data. BMC Medical Research Methodology. 2009;9(1):28. Available at: http://www.biomedcentral.com/1471-2288/9/28 [Accessed May 20, 2009]. Excerpt: Use of proper methodology developed for analysing data subject to missingness is necessary to reduce potential estimation bias. The quality of life of patients receiving radiation therapy with concurrent chemotherapy (docetaxel) appears somewhat worse than that of patients receiving radiation therapy alone in the period during which treatment is given. The conclusions are robust for the choice of statistical methods.

Westaby S, Archer N, Manning N, et al. Comparison of hospital episode statistics and central cardiac audit database in public reporting of congenital heart surgery mortality. BMJ. 2007;335(7623):759. Available at: http://www.bmj.com/cgi/content/abstract/335/7623/759 [Accessed March 4, 2009]. Description: One of the more lively debates in medicine today is the use of report cards to summarize performance of hospitals and/or individual physicians. This paper takes individual statistics compiled by hospitals (hospital episode statistics) and compares them to a centralized database. There are large discrepancies between the two, and the authors suggest that individual hospitals should spend the effort to more rigorously collect and validate their data.

Micheloud F. Jean Paul Benz�cri's Correspondence Analysis. Available at: http://www.micheloud.com/FXM/COR/E/index.htm [Accessed March 4, 2009]. Excerpt: This paper is an introduction to correspondence analysis, a statistical method allowing to analyze and describe graphically and synthetically big contingency tables, that is tables in which you find at the intersection of a row and a column the number of individuals who share the characteristic of the row and that of the column.

Molly Kelton, Cynthia LeardMann, Besa Smith, et al. Exploratory factor analysis of self-reported symptoms in a large, population-based military cohort. BMC Medical Research Methodology. 2010;10(1):94. Abstract: "BACKGROUND: US military engagements have consistently raised concern over the array of health outcomes experienced by service members postdeployment. Exploratory factor analysis has been used in studies of 1991 Gulf War-related illnesses, and may increase understanding of symptoms and health outcomes associated with current military conflicts in Iraq and Afghanistan. The objective of this study was to use exploratory factor analysis to describe the correlations among numerous physical and psychological symptoms in terms of a smaller number of unobserved variables or factors. METHODS: The Millennium Cohort Study collects extensive self-reported health data from a large, population-based military cohort, providing a unique opportunity to investigate the interrelationships of numerous physical and psychological symptoms among US military personnel. This study used data from the Millennium Cohort Study, a large, population-based military cohort. Exploratory factor analysis was used to examine the covariance structure of symptoms reported by approximately 50,000 cohort members during 2004-2006. Analyses incorporated 89 symptoms, including responses to several validated instruments embedded in the questionnaire. Techniques accommodated the categorical and sometimes incomplete nature of the survey data. RESULTS: A 14-factor model accounted for 60 percent of the total variance in symptoms data and included factors related to several physical, psychological, and behavioral constructs. A notable finding was that many factors appeared to load in accordance with symptom co-location within the survey instrument, highlighting the difficulty in disassociating the effects of question content, location, and response format on factor structure. CONCLUSIONS: This study demonstrates the potential strengths and weaknesses of exploratory factor analysis to heighten understanding of the complex associations among symptoms. Further research is needed to investigate the relationship between factor analytic results and survey structure, as well as to assess the relationship between factor scores and key exposure variables." [Accessed October 25, 2010]. Available at: http://www.biomedcentral.com/1471-2288/10/94.

Rigdon E. Frequently Asked Questions about SEM. Available at: http://www2.gsu.edu/~mkteer/semfaq.html [Accessed March 4, 2009]. Description: This is the first place you should look if you have questions about Structural Equation Models.

Wikipedia. Instrumental variable. Available at: http://en.wikipedia.org/wiki/Instrumental_variable [Accessed March 4, 2009]. Excerpt: In statistics and econometrics, an instrumental variable (IV, or instrument) can be used to produce a consistent estimator of a parameter when the explanatory variables (covariates) are correlated with the error terms. Such correlation can be caused by endogeneity, by omitted covariates, or by measurement errors in the covariates. In this situation, ordinary linear regression produces biased and inconsistent estimates. However, if an instrument is available, consistent estimates may still be obtained. An instrument is a variable that does not itself belong in the explanatory equation, that is correlated with the suspect explanatory variable, and that is uncorrelated with the error term.

Kenny DA. SEM: Instrumental Variables. Available at: http://davidakenny.net/cm/iv.htm [Accessed March 4, 2009]. Excerpt: One way of identifying models that cannot be estimated by using multiple regression is through the use of instrumental variables. For path analysis, the disturbance must not be correlated with each causal variable. There are three reasons why such a correlation might exist: * Spuriousness (Third Variable Causation): A variable causes both the endogenous variable and one its causal variables and that variable is not included in the model. * Reverse Causation (Feedback Model): The endogenous variable causes, either directly or indirectly, one of its causes. * Measurement Error: There is measurement error in a causal variable.

Michel Chavance, Sylvie Escolano, Monique Romon, et al. Latent variables and structural equation models for longitudinal relationships: an illustration in nutritional epidemiology. BMC Medical Research Methodology. 2010;10(1):37. Abstract: "BACKGROUND: The use of structural equation modeling and latent variables remains uncommon in epidemiology despite its potential usefulness. The latter was illustrated by studying cross-sectional and longitudinal relationships between eating behavior and adiposity, using four different indicators of fat mass. METHODS: Using data from a longitudinal community-based study, we fitted structural equation models including two latent variables (respectively baseline adiposity and adiposity change after 2 years of follow-up), each being defined, by the four following anthropometric measurement (respectively by their changes): body mass index, waist circumference, skinfold thickness and percent body fat. Latent adiposity variables were hypothesized to depend on a cognitive restraint score, calculated from answers to an eating-behavior questionnaire (TFEQ-18), either cross-sectionally or longitudinally. RESULTS: We found that high baseline adiposity was associated with a 2-year increase of the cognitive restraint score and no convincing relationship between baseline cognitive restraint and 2-year adiposity change could be established. CONCLUSIONS: The latent variable modeling approach enabled presentation of synthetic results rather than separate regression models and detailed analysis of the causal effects of interest. In the general population, restrained eating appears to be an adaptive response of subjects prone to gaining weight more than as a risk factor for fat-mass increase." [Accessed May 6, 2010]. Available at: http://www.biomedcentral.com/1471-2288/10/37.

Kelly PA. Overview of Computer-Intensive Statistics. Available at: http://www.hsrd.houston.med.va.gov/AdamKelly/resampling.html [Accessed March 4, 2009]. Description: This page provides a nice overview of the permutation test, randomization test, Monte Carlo estimation, bootstrapping, the jackknife, and Markov Chain Monte Carlo methods.

Walters S, Campbell M. The use of bootstrap methods for analysing health-related quality of life outcomes (particularly the SF-36). Health and Quality of Life Outcomes. 2004;2(1):70. Available at: http://www.hqlo.com/content/2/1/70 [Accessed March 4, 2009]. Description: The article provides an illustrative example of how to use the bootstrap method.

Li P. The Zoo of Loglinear Analysis. Available at: http://facultystaff.richmond.edu/~pli/psy538/loglin02/index.html [Accessed March 4, 2009]. Excerpt: "Loglinear Analysis is a multivariate extension of Chi Square. You use Loglinear when you have more than two qualitative variables. Chi Square is insufficient when you have more than two qualitative variables because it only tests the independence of the variables. When you have more than two, it cannot detect the varying associations and interactions between the variables. Loglinear is a goodness-of-fit test that allows you to test all the effects (the main effects, the association effects and the interaction effects) at the same time."

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


20. Stats: Bootstrap estimates of the standard error (June 20, 2008). A regular correspondent (JU) on the MEDSTATS email discussion group asked about using the bootstrap to estimate the standard error of the mean in a simple case with 9 data values. He wanted to know why the commonly used approach in the bootstrap community was to use n instead of n-1 in the variance denominator. It seemed to him that n-1 would produce an unbiased estimate of the standard error and wanted to know if that was true just in this special case or true in general. He quoted from the book by Efron and Tibshirani that they felt that for most purposes either method would work well.

19. Stats: A brief overview of instrumental variables (April 14, 2008). People will often ask me questions that are outside my area of expertise. Yes, I know you're shocked to hear this, but there are lots of areas of statistics where I only have a vague understanding. One of these questions was about instrumental variables. I could only offer a vague explanation, but I hope that is better than no explanation at all.


18. Stats: Parametric tests for a ratio (October 27, 2006). Dear Professor Mean, I computed a variable, Y3, which is the ratio of two other variables, Y1 and Y2. Can I use a parametric test on this ratio?

17. Stats: The problem with ranking ordinal scales (June 29, 2006). When I was young and naive, I thought that anytime you encountered ordinal data, it would make the most sense to use a test statistic based on ranks, such as the Mann-Whitney-Wilcoxon test or the Kruskal-Wallis test. Unfortunately, the ranks can sometime distort the true nature of an ordinal scale. I thought that I had provided an example of how ranks can distort things, but I could not find it this morning when someone asked a question relating to ordinal scales. So here is the example again.

16. Stats: Randomization tests for paired data (January 24, 2006). The randomization test offers a lot of flexibility for analyzing data in ways well beyond what traditional tests might offer. Here's a simple example from the Chance Data Sets web page.


15. Stats: Outcomes research (November 24, 2004). Someone asked me for a simple definition of outcomes research. I hemmed and hawed and could not come up with a good definition. It turns out that the Agency for Healthcare Research and Quality has a nice definition.

14. Stats: Report cards (August 27, 2004). I'm working on a project looking at some outcomes that might eventually become part of a report card or benchmarking system. This is an area fraught with controversy and it needs to be handled very carefully. Here are a few references that I have accumulated that address some of these issues.

13. Stats: Randomization test (July 14, 2004). I received some data from a project where the outcome measure was the degree of improvement after a treatment, with values of -1 (slight decline), 0 (no change), 1 (slight improvement), 2 (moderate improvement), and 3 (large improvement). The two treatments had quite different results. The old therapy had eight patients, three of whom showed a slight decline and five of whom showed no change. Among the eight patients in the new therapy, one showed no change, three showed a slight improvement, six showed moderate improvement, and two showed a large improvement. There are several approaches that you could try with this data. Even though I did not have a problem with computing averages, I was a bit nervous about the t-test. This data is clearly non-normal, and with the sample sizes as small as they are, I'd be worried about whether the t-test would be valid. An interesting alternative is the randomization test.

12. Stats: McNemar's Test (June 17, 2004). I received an email asking how to test two correlated proportions to see if one proportion is significantly larger than another. This is a classic application of McNemar's test.

11. Stats: Analyzing percentage data (May 24, 2004). I received one of those difficult to answer questions: how do I analyze my data when the outcome variable is a percentage. That depends a lot on the context of the problem. The first thing to look at is whether the percentage involves counts of some type, and if so, do you know the numerator and denominator. Instead, the percentage might be the ratio of two continuous measurements.


10. Stats: Parametric versus nonparametric tests (July 30, 2001). Dear Professor Mean: When should I use a parametric test versus a non-parametric test?


9. Stats: Outliers (January 28, 2000). Dear Professor Mean: I have recently conducted a survey of attitudes toward research from a professional group. There are some outliers (+/- 3SD) that I would eliminate , but others conducting the research with me feel that this might be a minority view, and should not be eliminate from the dataset......any views or references that I should read to confirm my view, or theirs?

8. Stats: Composite scores (January 27, 2000). Dear Professor Mean: I have developed a method to distinguish among several products that we need to buy so our company can make a good purchasing decision. I created a composite score which is a weighted average of several different indicators of quality. I want to use statistics to determine when two different products have significantly different composite scores.

7. Stats: Mixture models (January 27, 2000). Dear Professor Mean: I have read a journal article where the authors used a mixture model . What is this?

6. Stats: Physician Performance Data (January 27, 2000). Dear Professor Mean: Producing statistics of physician performance or group performance or whatever seems to be one of the great growth industries in medicine. Graphs of performance in just about anything seem to be produced - usually with something that looks at first glance like a normal distribution (and almost never with any statistical addenda). But I would like to know whether we can use them sensibly as anything other than pictures? In particular when I am one of the subjects of the analysis how do I interpret my own performance?

5. Stats: Splines (January 27, 2000). Dear Professor Mean: Can you send me a basic definition of splines?

4. Stats: Bootstrap (January 26, 2000). Dear Professor Mean: I've heard a lot about how the bootstrap is going to revolutionize statistics. How does the bootstrap work?


3. Stats: Injury index creation (September 23, 1999). Dear Professor Mean: I want to create an injury index that describes the severity of an injury to a child. This would include information about the type of injury, the location of the injury, the age of the child, etc. What's the best way to do this?

2. Stats: Chi-square (September 3, 1999). Dear Professor Mean: Can the Chi-squared test be used for anything besides categorical data?

1. Stats: Page's test (September 3, 1999). Dear Professor Mean: I have recently come across a statistical test (Page's L test), with which I am unfamiliar. Does anyone either have information about this test or know where I might find information about it?

What now?

Browse other categories at this site

Browse through the most recent entries

Get help