Archive organized by date (created 2009-01-06)
This page is moving to a new website.
This page lists files created in calendar year 2009. Also look at archive
for 2012, 2011,
2010, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, and 1999.You can
also browse through an archive of pages organized by topic.
December 2009 (1 entry)
- P.Mean: Entering and analyzing data from a
two by two table, using PASW/SPSS (created 2009-12-14). One of the most common questions I hear is how to enter and analyze data
from a two by two crosstabulation. It is not immediately obvious, especially
to beginners, how to get started with this type of data. The table shown below
presents some data and statistics from several two by two crosstabulations. How do you take information
like this and enter it into PASW/SPSS,
so that you can produce a useful analysis?
November 2009 (5 entries)
- P.Mean: Randomly generating simple math
problems using R (created 2009-11-30). To help drill simple concepts in
math for my second grade son, I developed a series of R programs to generate
these problems randomly. It makes use of the sample function on a sequence of
integers and allows you to limit or expand the scope of the problems
generated. It is far from perfect, but it shows a few simple tricks in R.
- P.Mean: Generating multinomial random
variables in Excel (created 2009-11-23). Someone asked how to generate
six random integers subject to the conditions that the sum of those random
integers had to equal a value, x. This is a classic description of a
multinomial distribution. Unstated in the question, but assumed by me, was
that each random integer had to have the same distribution. that forces the
probability vector for the multinomial to be (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
- P.Mean: Jump start statistics for beginning
researchers (created 2009-11-07). I am a part-time faculty at the
University of Missouri-Kansas City, one of four schools in the University of
Missouri system. I responded recently to an email sent to all faculty asking
for suggestions about eLearning and distance education. Here was my proposal.
- P.Mean: New Bioinformatics degree
program at UMKC (created 2009-11-03). I am working part-time in the
Department of Informatic Medicine and Personalized Health in the School of
Medicine at the University of Missouri-Kansas City. This department has just
developed a website advertising the program. Here are a few key links.
- P.Mean: Rotating locations (created
2009-11-02). Someone asked about holding a series of meeting with
subgroups of people and wanted to insure during any round of the meetings
that people would meet at a different location than the previous round and
with a different mix of people. So on the first round of meetings, Allen,
Barb, Charlie, and Denise would meet at location E and Fred, Gina, Harry, and
Iona would meet at location J. On the next round, you'd mix things up so that
it wasn't the same four people at the same location.
October 2009 (6 entries)
- P.Mean: Sneaking ineligible patients
into a clinical trial (created 2009-10-30). There was an interesting
article in the New York Times (Chen PW. Bending the Rules of Clinical Trials)
that described a terminal cancer patient and the doctor's goal to get them
access to a new experimental drug, even though the patient was not eligible
for the clinical trial that was studying this drug. It's a difficult
situation for doctors. Do you do what's best for the patient in front of you,
knowing that the data collected from this patient might corrupt the findings
of the clinical trial?
- P.Mean: What is a normal probability plot?
(created 2009-10-29). The normal probability plot, sometimes called the qq plot,
is a graphical way of assessing whether a set of data looks like it
might come from a standard bell shaped curve (normal distribution). To compute a
normal probability plot, first sort your data, then compute evenly spaced
percentiles from a normal distribution. Optionally, you can choose the normal
distribution to have the same mean and standard deviation as your data, or you
can save some time by using evenly spaced percentiles from a standard normal
distribution. Finally, plot the evenly spaced percentiles versus the sorted
data. A reasonably straight line indicates a distribution that is close to
normal. A markedly curved line indicates a distribution that deviates from
- P.Mean: Data layout for an ROC curve (created
2009-10-16). Back in 1999, I wrote a brief description of the ROC curve
and showed what it would look like in SPSS. That page can be found at
www.childrensmercy.org/stats/ask/roc.asp. I didn't show, however, what
the data would look like when entered into SPSS or what the dialog boxes
would look like.
- P.Mean: Are we statisticians gods? (created
2009-10-13). I'm helping someone who wants an alternative statistical
analysis to the one used by the principal investigator. I'm happy to help and
will offer advice about why my approach may be better, but I was warned that
the PI considers the analysis chosen to be ordained by the "Statistic Gods"
at her place of work. I'm not sure what to make of the words "Statistic
- P.Mean: Use of Likert data with ANOVA (created
2009-10-13). I never quite feel I can offer my students a thoughtful
explanation about the use of Likert data with ANOVA. It is recommended that
ANOVA be used with interval or ratio data, but, in practice, ANOVA is
sometimes used when the data is ordinal (as you'd find when using Likert
scales). This confuses some students. Are there any good references out there
I can share with my students that might explain the pros and cons of using
ordinal data with ANOVA?
- P.Mean: Accounting for clusters in
an individually randomized clinical trial (created 2009-10-13). I have
a clinical trial with clusters (the clusters are medical practice), but
unlike a cluster randomized trial, I am able to randomize within each
cluster. From what I've read about this, I can provide an estimate for the
Intraclass Correlation Coefficient (ICC) that will decrease my sample size.
But I'm uncomfortable doing this. Can you help?
September 2009 (4 entries)
- P.Mean: Can I stop this study? (created
2009-05-29). I got an email from a researcher on a project I was peripherally involved
with awhile back. Here's what she wrote (with a few details removed to protect
anonymity). As you all are aware, enrollment for the BLANK study has been slower
than anticipated. However, due to a high suspicion that patients in the
CONTROL arm were having more complications (more rescue therapy) and less
improvement, we have decided to look at the data prior to reaching our initial
proposed N=140. We had 79 patients enrolled. We found that significantly more
patients in the TREATMENT arm reported that their main symptom was better at
24 hours than the CONTROL arm (p=0.02). Also, we had 6 patients need some kind
of rescue, 5 of those were in the CONTROL arm (this approached statistical
significance, p=0.08). Therefore, I am writing to see if you agree with
stopping the study at this point. Please let me know at your earliest
- P.Mean: The problem with being too sensitive
or too specific (created 2009-09-16). Somebody asked my opinion about
cost effectiveness research. My bottom line is that I like it, but I
understand why it is controversial. Here's the logic that I presented to draw
- P.Mean: Power for a three arm experiment
(created 2009-09-14). "I want to compute power for a three arm
experiment. The outcome variable is binary (yes/no). I know how to compute
power for a two-arm experiment already, but have no idea how to handle the
- P.Mean: Getting a good cut-off
when sensitivity is more important than specificity (created 2009-09-14).
"I am working on a prediction model to help with diagnosis. In this
particular area I need a model that has the highest possible sensitivity (low
specificity is not a problem)." One obvious comment is that you can
achieve a sensitivity of 100% if you don't mind a specificity of 0%. So when
you say "low specificity is not a problem" that statement is only partially
true. What you mean to say is that false negatives are far more serious than
false positives. How much more serious, though. Five times? Ten times? Once
you've decided the relative costs of false negatives and false positives, the
rest is easy.
August 2009 (3 entries)
- P.Mean: Tentative training schedule
(created 2009-08-31). I've been asked to develop a series of training
classes. Here's a first draft.
- P.Mean: The controversy over
standardized beta coefficients (created 2009-09-12). I have a client who
is working on her dissertation. I always warn people working on dissertations
or theses that they should listen more to what their committee members say
about statistics than what I say about statistics. If the committee loves the
statistical analysis and I hate it, you still get your degree. If I love the
statistical analysis and the committee hates it, you get nothing. For this
client, a committee member asked if she could produce standardized beta
coefficients in her regression models. I helped her write an argument as to
why the unstandardized coefficients are better, but the committee member gave
a reasonable counter-argument, so there was no point in persisting. Still, it
would be helpful here to outline some of the controversy over standardized
- P.Mean: Standard error for an odds ratio
(created 2009-08-12). I submitted an article to a journal that
included some odds ratios and their confidence intervals. The journal editor
said that their policy was to report standard errors and not confidence
intervals. How do I do this for an odds ratio?
July 2009 (3 entries)
- P.Mean: Formula for multiple imputation
(created 2009-07-24). I'm working on a project that involves multiple
imputation, and I may have to program some of the work myself. I can use the
R package MICE to generate the imputed data sets, but then I have to use a
mixed linear model rather than a linear model. How do I combine the estimates
from the multiple imputed data sets? The estimate is just the average of the
individual estimates, but what about the standard error?
- P.Mean: The first three steps in
selecting an appropriate sample size (created 2009-07-20). I got an email
last week from a client wanting to start a new research project looking at
relationships between parenting beliefs and childhood behaviors. The
description of the sorts of things to examine was quite elaborate, and it
ended with the question "how many families would we need to have any
significant differences if they exist?" Unfortunately, all the elaborate
information provided did not include the information I would need to answer
this question. Justifying a sample size usually involves three steps.
- P.Mean: Do multiple time points require
a Bonferroni adjustment? (created 2009-07-18). I'm a little confused as
to when to apply the multiple comparisons correction. If I had a measure
which compared blood pressure (say) between two groups after 7, 14 and 21
days post procedure, would I need to adjust for multiple comparisons of the
June 2009 (2 entries)
- P.Mean: The perils of self-evaluation
(created 2009-06-30). A survey by New Scientist magazine examined a
phenomenon called "citation amnesia." This is the tendency of researchers to
overlook previously published work in the bibliography of their articles.
Most of the respondents felt that citation amnesia was a problem. "Indeed,
the vast majority of the survey's roughly 550 respondents -- 85% -- said that
citation amnesia in the life sciences literature is an already-serious or
potentially serious problem. A full 72% of respondents said their own work
had been regularly or frequently ignored in the citations list of subsequent
publications. Respondents' explanations of the causes range from
maliciousness to laziness." There are several problems with this survey,
- P.Mean: What is the effect of an
unmeasured covariate? (created 2009-06-09). Suppose you want to conduct
an analysis of covariance, but you have data on some but not all of the
covariates. What do you miss out on because of the unmeasured covariate. To
understand this, we need to venture in to the world of partitioned matrices.
May 2009 (5 entries)
- P.Mean: Institute of Medicine report on conflict
of interest (created 2009-05-24). The National Academies Press has
announced the release of a report, Conflict of Interest in Medical Research,
Education, and Practice, prepared by a special committee of the Institute of
P.Mean: Data that IRBs should collect about
themselves (created 2009-05-22). Somone on the IRBForum (TS) asked about
what type of reports that an IRB should provide. There were a lot of good
comments. I encouraged a data centric approach to reporting. Here's what I
- P.Mean: Developing a website logo (created
2009-05-22). I'm not big on graphic logos, but the Zotero website asked
for one. So I wrote a short R program to create a simple logo.
- P.Mean: Analyzing bad data (created 2009-05-22).
A discussion on the MEDSTATS email discussion group centered around a data
set involving blood loss. Blood loss was quantified into categories with
values of less than 250 ml, 250-500 ml, 500-1000 ml, and great than 1000 ml.
The discussion centered on the inefficiencies created when continuous data is
reported in categories like these.
- P.Mean: NYTimes advice on increasing
website traffic (created 2009-05-11). The New York Times has an excellent
blog entry on increasing traffic to your website. It is well worth reading if
you write a lot of stuff for the web. I had a few additional comments which I
added in the comment section of this webpage.
April 2009 (4 entries)
- P.Mean: Is this a case-control design (created
2009-04-28). I have a stats study design question. If I were to look at the association
of curly hair for instance with a rash on the forehead, I pick a case control
study design. When I analyze this I find that 45% of kids in the clinic
(surprise) had curly hair. But I look at two groups curly vs non curly and the
outcome of interest is the rash on the forehead, instead of cases vs controls
so now, has this become an observational study instead of case control? Hope I
am making sense, this is only a theoretical question.
- P.Mean: Can I use you as a teaching example
(created 2009-04-20). I frequently ask people for permission to talk
about the projects I am helping them with, as they make great teaching
examples. Some people say no, and that's fine. I do offer a discount for
paying clients if you let me talk about this work on my web pages. One person
raised an important issue when I asked. That person asked me to keep details
about his/her organization anonymous if I was illustrating any boneheaded
- P.Mean: Calculating NNT for
indirect comparisons (created 2009-04-20). To calculate the Numbers
Needed to Treat (NNT) statistic for response rates when the effect size is
shown as an odds ratio I carry out the following calculation: NNT =
(1-(CER*(1-OR))) / ((1-CER)*(CER)*(1-OR))  CER = Control Event Rate OR =
Odds Ratio My query occurs when I am calculating this for an indirect
comparison. So for example if I am comparing A and B vs a common comparator C
I have the following set up: Trial 1 - A vs C: Response rate A = 0.8 Response
rate C = 0.6. Trial 2 - B vs C: Response rate B = 0.7 Response rate C = 0.55.
Indirect comparison gives (for example) A vs B odds ratio of 0.85 (0.6, 1.2).
Is it valid to calculate the NNT by substituting CER = 0.8 and OR = 0.85 into
the first equation ?
- P.Mean: Calculating NNT for infection rates
(created 2009-04-15). I will be leading an EBM teaching session for
housestaff on an article about Methicillin-Resistant Staphylococcus aureus
infection rates. I was planning to analyze it using the standard questions
about therapy from the Users' Guides to the Medical Literature, but I was
wondering if there should be any special considerations, given that therapy (MRSA
screening & eradication) was given at a hospital-wide level. For example, the
results are presented as incidence of nosocomial MRSA infections per
person-years -- can I convert this to a percentage, to churn out a number
needed to treat (NNT)? Or is this statistically forbidden? Please let me know
of any journal articles you're aware of that address the issue of
studies taken at a hospital- or population-based level.
March 2009 (9 entries)
- P.Mean: A sportswriter tackles the Monte Hall
problem (created 2009-03-31). Joe Posnanski, a famous sports writer who
loves statistics, wrote a couple of entries in his blog about the famous
Monty Hall problem. I find the problem trite and annoying, but that probably
says something more about me than about the problem. It is a very popular
problem highly cited on the Internet and in many print publications. The
Wikipedia quotes it from Parade magazine in 1990. Suppose you're on a game
show, and you're given the choice of three doors: Behind one door is a car;
behind the others, goats. You pick a door, say No. 1, and the host, who knows
what's behind the doors, opens another door, say No. 3, which has a goat. He
then says to you, "Do you want to pick door No. 2?" Is it to your advantage
to switch your choice?
- P.Mean: Short biography (created
2009-03-30). At irregular intervals, I am asked to provide a brief
biography of myself. Here is the latest version, along with links to earlier
version. I usually put this up on my website, not out of vanity, but rather
so that I would remember all the nice things that I am supposed to say about
myself. If you need material to introduce me as a speaker, to help write a
grant, or to get a better appreciation of who I am and what I do, please feel
free to read and use any of this material.
- P.Mean: Two business contacts (created
2009-03-30). I got a phone call today from someone applying for a very
high level job at Children's Mercy Hospital (CMH). I no longer work at CMH,
but this person wanted to see what I knew about this position, the person
making the hiring decision, the management climate at CMH, etc. I couldn't
offer too much advice as this position was quite different from the areas I
worked in, but I did try to help as best I could. During the discussion, this
person mentioned two business contacts that I might want to follow up with to
help build my consulting customer base.
- P.Mean: DNA binding image (created
2009-03-25). There is an important application of information theory in
DNA binding that I discussed at my old website. I may want to expand that
discussion into an article for Chance Magazine. If I do, here is an open
source image of DNA binding that might be useful.
- P.Mean: I love to write (or my newsletters are
getting longer) (created 2009-03-19). In high school and college I
dreaded writing term papers. Something has changed because now I love to
write. I started a monthly newsletter, and it's length seems to be growing
with each month. Here are some statistics on newsletter lengths.
- P.Mean: Five points or seven points on a
survey scale (created 2009-03-12). I am creating a survey and wanted
to know if anybody can suggest a scale: both the wording and 5 versus 7
- P.Mean: Good papers for a journal club
(created 2009-03-07). I work as a biostatistician within a medical research area and I am planning on starting a stats/research methods journal club. This would be aimed at postgraduate students (from both science and medical degree background), early career academic researchers (again, they come from both science and medical backgrounds), and clinical researchers (medical doctors from areas such as critical care and gastroenterology). In conjunction with published work from their research areas I wish to use papers that present fairly fundamental statistical concepts in an easy to read manner. I imagine focusing more on theoretical/philosophical issues,
rather than 'this is how you do an ANOVA' type treatises. Does anyone have any favourite such papers that they find useful for
- P.Mean: Locating individual points on an ROC
curve (created 2009-03-05). In a project examining a diagnostic test, I
was asked to develop an ROC curve. That is fairly easy to do. Six months
later, though, I was asked to designate a particular point on the curve
corresponding to a cutpoint of 7. This is a bit ambiguous, but in re-reading
the paper, it was obvious from the context that this meant locating the point
on the curve where a positive test result of 7 or less (alternatively a
negative test result of 8 or more) occurred. It takes a while to get oriented
properly on an ROC curve. Here's what I did.
- P.Mean: The surprisal matrix and
applications to exploration of very large discrete data sets (created
2009-03-04). The surprisal, defined as the negative of the base 2
logarithm of a probability, is a fundamental component used in the
calculation of entropy. In this talk, I will define a surprisal matrix for a
data set consisting of multiple discrete variables, possibly with different
supports. The surprisal matrix is useful in identifying areas of high
heterogeneity in such a data set, which often corresponds to interesting and
unusual patterns among the observations or among the variables. I will
illustrate two applications of the surprisal matrix: monitoring data quality
in a large stream of fixed format text data, and examining consensus in the
evaluation of sperm morphology.
February 2009 (3 entries)
- P.Mean: Fewer than 10 events per variable
(created 2009-02-18). I am in the
process of advising on the design of a study using logistic regression. There
are five confounding variables and a treatment variable. If I apply the rule
that you need 10 events per variable (EPV), then I need 60 events. I expect
that the probability of observing an event is 40%. This means that I'll need
data on 60 / 0.4 = 150 patients. I can only collect data on 90 patients, and
that sample size gives me more than adequate power. Since my power will be
fine, can I ignore the rule of thumb about 10 EPV?
- P.Mean: Interpreting a negative
autocorrelation (created 2009-02-16). I have two questions regarding
autocorrelation: if there is negative autocorrelation is it correct to say
that "past values decreasingly influence future values? Why is positive
auto-correlation considered more important by most statisticians.
- P.Mean: Acknowledging the
contributions of a statistician (created 2009-02-16). A while back you
assisted me with stats on my paper. I am finally ready to submit and wanted
to know how I should appropriately acknowledge you for your participation
since you are no longer at Children's Mercy Hospital.
January 2009 (5 entries)
- P.Mean: Calibrating information
using a two by two table (created 2009-01-28). In a previous webpage, I discussed the concept of joint entropy,
conditional entropy, and information. The information for two measurements is
zero if the two measurements are statistically independent. Information
increases between two measurements as the degree of dependence (either
positive or negative) increases. I thought it would be helpful to visualize
this relationship graphically.
- P.Mean: Changes in the adjusted hazard
ratio, but not in the precision of the ratio (created 2009-01-19).
Does anyone know a good reference on why, in Cox regression of a clinical
trial, including covariates often changes the treatment hazard ratio rather
than narrowing the confidence interval? I can remember attending a talk on
this years ago, but cannot remember the details.
- P.Mean: Drawing simple mathematical
graphs (created 2009-01-14). I'm looking for a good, basic, relatively
easy-to-use graphing program, to draw simple mathematical graphs one would
see in basic calculus, algebra, statistics. Something similar to Paint, but a
step or two up from it, and that I could copy and paste images, venn
diagrams, etc., into a Word file, and the quality would be publication
quality. I want something that is MUCH more versatile than one would get
using Excel or similar.
- P.Mean: A simple example of joint and
conditional entropy (created 2009-01-07). In a project involving sperm
morphology classification, I have found that the concept of entropy very
useful in analyzing the data and describing certain patterns. I want to
extend the work to include joint and conditional entropy. I wanted to start
with a simple data set, so I downloaded a file from the Data and Story
Library website. There is an interesting file "High Fiber Diet Plan" that
provides a useful way to explore joint and conditional entropy.
- P.Mean: Maybe Powerpoint isn't so bad
(created 2009-01-06). I have been harshly critical of PowerPoint in the
past (though I did post a rejoinder from one of the readers of my old
website). Most of my criticisms were inspired by Edward Tufte, who wrote an
article for Wired magazine (Powerpoint is evil) and a short monograph (The
Cognitive Style of Powerpoint). In preparing a newsletter article about
Edward Tufte and his new book, Beautiful Evidence, I came across some
reviewers who take Dr. Tufte to task for his harsh criticisms of Powerpoint.
2011-01-01. Need more