The Monthly Mean newsletter, May 2013. Released June 16, 2013.

--> Introduction

--> What can you do with a p-value of 0.078?

--> My data set is so big that everything ends up statistically significant

--> Article: Barnes B. Solving Equation of a Hit Film Script, With Data

--> Book: The Signal and the Noise: Why So Many Predictions Fail — but Some Don't.

--> Nick News: Nick takes a turn at pitching

--> Quote: Statisticians, like artists, have...

--> Trivia: What is the number who sing ...

--> Tell me what you think.

--> Permission to re-use any of the material in this newsletter

--> Introduction. Welcome to the Monthly Mean newsletter for May 2013. "What?" you say, "June is more than halfway done!" I know, I know. My goal was to send this out no later than June 5, but I was helping on a couple of major research grants, and that was quite a distraction. I'm still planning to get the June newsletter out by early July. Wish me luck. My Dad had a saying about someone being "a day late and a dollar short." He wasn't talking about me, but I must admit that the day late part applies to almost everything I do.

--> What can you do with a p-value of 0.078? Someone posed an interesting hypothetical question. Suppose you wrote a paper that compared two treatments and found a p value of say 0.078 and then made a claim that the new method worked better than the old method? Do you think the editor and/or reviewers of this would let it pass?

Well, no, they wouldn't let it pass, but maybe they should. I like to bring out the 1965 article by Sir Austin Bradford Hill and the nine factors that can help define a causal relationship. These were designed for observational studies, but I think they are reasonable criteria for randomized studies also. If you have a p-value of 0.078, and it is a strong effect and if there is a biological gradient and if there is a plausible scientific mechanism, etc., then a p-value of 0.078 can be very strong support for the new method working better than the old method. If you lack all of these things then even a p-value smaller than 0.05 is inadequate to claim that the new method works better than the old method.

--> My data set is so big that everything ends up statistically significant. Dear Professor Mean, I am analyzing data from the Behavioral Risk Factor Surveillance System, and have over 40,000 observations in my data set. When I fit a logistic regression model with Chronic Obstructive Pulmonary Disease (COPD) as the outcome, every single variable ends up being significant. Should I remove variables based on the effect size or biological plausibility or something else?

It's important to ask yourself why you wanted to run a logistic regression model in the first place. Is it to understand risk factors associated with COPD? Is it to try to develop a diagnostic test for COPD? Is it to get a risk adjusted comparison of a key variable? The objective will help you decide what approach is best.

For example, if your goal is to develop a diagnostic test, then leaving out many of the variables may make sense. A parsimonious model leads to a simpler and cheaper diagnostic test.

If your goal is to identify risk factors, then a parsimonious model is not necessarily your friend. If two risk factors are highly correlated but both are highly related to COPD, then eliminating one will prevent you from gaining important insights into the risk factors.

In risk adjustment, the very large data set gives you the luxury of putting every reasonably plausible confounding variable into the mix. Don't worry about eliminating variables that turn out to be unimportant. Leaving out an important confounder is a far greater sin than keeping a worthless variable in your model.

But whatever your goal, please do screen out first any variable that you know based on your medical knowledge is inappropriate. Putting a bunch of "garbage" variables in your model is never going to be helpful, no matter what your objective is.

--> Article: Barnes B. Solving Equation of a Hit Film Script, With Data. The New York Times, May 5, 2013. Excerpt: "Forget zombies. The data crunchers are invading Hollywood. The same kind of numbers analysis that has reshaped areas like politics and online marketing is increasingly being used by the entertainment industry."

--> Book: Nate Silver. The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. From the back cover: Nate Silver built an innovative system for predicting baseball performance, predicted the 2008 election within a hair's breadth, and became a national sensation as a blogger—all by the time he was thirty. The New York Times now publishes FiveThirtyEight.com, where Silver is one of the nation's most influential political forecasters. Drawing on his own groundbreaking work, Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data. Most predictions fail, often at great cost to society, because most of us have a poor understanding of probability and uncertainty. Both experts and laypeople mistake more confident predictions for more accurate ones. But overconfidence is often the reason for failure. If our appreciation of uncertainty improves, our predictions can get better too. This is the "prediction paradox": The more humility we have about our ability to make predictions, the more successful we can be in planning for the future. In keeping with his own aim to seek truth from data, Silver visits the most successful forecasters in a range of areas, from hurricanes to baseball, from the poker table to the stock market, from Capitol Hill to the NBA. He explains and evaluates how these forecasters think and what bonds they share. What lies behind their success? Are they good—or just lucky? What patterns have they unraveled? And are their forecasts really right? He explores unanticipated commonalities and exposes unexpected juxtapositions. And sometimes, it is not so much how good a prediction is in an absolute sense that matters but how good it is relative to the competition. In other cases, prediction is still a very rudimentary—and dangerous—science. Silver observes that the most accurate forecasters tend to have a superior command of probability, and they tend to be both humble and hardworking. They distinguish the predictable from the unpredictable, and they notice a thousand little details that lead them closer to the truth. Because of their appreciation of probability, they can distinguish the signal from the noise. With everything from the health of the global economy to our ability to fight terrorism dependent on the quality of our predictions, Nate Silver's insights are an essential read.

--> Nick News: Nick takes a turn at pitching. Nick really wanted to try organized baseball this year and convinced us to sign him up to a team. We went to the leagues organized by Blue Valley Recreation. I was worried about whether he would find the game boring, but he has really gotten into it. At his age, a lot depends on the pitcher, and many of the pitchers for our team and the other teams give up more walks than hits, just because it is so hard to get the ball over the plate consistently. After one particularly rough game with our pitchers (one of our pitchers walked seven out of eight batters and put the other one on base when he hit him with a pitch), Nick asked to try pitching.

I worked a bit with him on accuracy for a couple of days before his debut, but I don't know enough about baseball to tell him anything about his pitching form. I would kneel in a spot in our front yard (I am too old for the classic catcher's crouch) and hold my glove out as a target. He would be about 45 feet away and could hit my glove more often than not. I'm a bit of klutz and a couple of his pitches hit me in the chest, but that's a small price to pay. In his debut, he started out a bit rocky. He came in with the bases loaded and one out and proceeded to walk the first three batters. But even when he missed, he missed just by a little. Then he struck out the next two batters to end the inning. He started out the next inning with another strikeout. After that, it was a mix of hits and walks, but he did strikeout two more batters and a couple of the hits could be attributed to poor fielding. In any case, he loved pitching because he was in the center of the action.

His second outing was also a mixed bag, but he didn't have any wild pitches and no hit batters, so that is a sign that he is keeping things close. Here's a picture of him on the mound in the second game.

--> Quote: Statisticians, like artists, have the bad habit of falling in love with their models. George Box, as quoted at J. Michael Steele's website.

--> Trivia: What is the number who sing the song "Heigh Ho" in the classic Disney cartoon, "Snow White"?

Many people got last month's trivia question: What tune by the band Chicago has more numbers than words in its title? Jane Yank was first with "25 or 6 to 4" though she and a couple of others pointed out that if you write the title as "Twenty Five or Six to Four" it no longer has more numbers than words.

--> Tell me what you think. How did you like this newsletter? Give me some feedback by responding to this email. Unlike most newsletters where your reply goes to the bottomless bit bucket, a reply to this newsletter goes back to my main email account. Comment on anything you like, but I am especially interested in answers to the following three questions.
--> What was the most important thing that you learned in this newsletter?
--> What was the one thing that you found confusing or difficult to follow?
--> What other topics would you like to see covered in a future newsletter?

If you send a comment, I'll mention your name and summarize what you said in the next newsletter. It's a small thank you and acknowledgement to those who take the time to help me improve my newsletter. If you send feedback and you want to remain anonymous, please let me know.

One anonymous person liked my article on the model not running and characterized it as "Give up the fancy and do what grandma did." Neither of my grandmothers were statisticians, but I appreciate the point. He mentioned a personal example of spending a long time working on VCARD to share his phone number with a neighbor and his son took a couple of seconds to text the number instead.

I always worry a bit that someone will get offended if I talk about their data analysis experience, but the person who asked about sequentially numbered opaque envelopes was actually pleased to see her query highlighted in my newsletter. I do try very hard not to reveal anything private or unflattering in the newsletter (or my website), but if any of you ever find something that I talk about relating to a consulting interaction with you to be too invasive or personal, please let me know.

--> Permission to re-use any of the material in this newsletter. This newsletter is published under the Creative Commons Attribution 3.0 United States License. You are free to re-use any of this material, as long as you acknowledge the original source. A link to or a mention of my main website, www.pmean.com, is sufficient attribution. If your re-use of my material is at a publicly accessible webpage, it would be nice to hear about that link, but this is optional.

What now?