P.Mean >> Category >>  Hypothesis testing (created 2007-06-16).

Hypothesis testing is a set of formal methods to select between two competing research hypotheses. These pages discuss some of the philosophical underpinnings for hypothesis testing as well as some pragmatic concerns. Articles are arranged by date with the most recent entries at the top. Also see Bayesian statistics and Confidence intervals. You can find outside resources at the bottom of this page.

2016

18. P.Mean: So you're thinking about a t-test (created 2016-06-18). You've got your data and you've heard that you need to analyze it using a t-test. Congratulations on getting this far. A t-test is pretty easy as far as statistical tests go, but it never hurts to talk to a statistician about this if you can. I'm going to give some general guidance about how to approach a data analysis when you think you need a t-test.

2012

17. P.Mean: A single wildly large value makes you less confident that the mean of your data is large (created 2012-12-12). I was working on a project that seemed to be producing some counter-intuitive results. The work involved ratios, and one of the experiments had an unusually large ratio. I tried a log transformation, which tends to pull down that large ratio. It improved the precision of the results, which you might expect. But it also reduced the p-value, which you might not expect. After all, if you use a log transformation to de-emphasize large values, won't that attenuate an test that tries to show that the average value is large? This bothered me for a while, so I developed a series of simple examples to resolve the apparent inconsistency.

2011

16. P.Mean: Borderline p-values (created 2012-09-19). Dear Professor Mean, I originally reported a p-value of 0.04 for a Chi-Square test, but I was told to use the Fisher's Exact Test instead. The p-value for Fisher's Exact Test is 0.06. Do I have to drop the discussion of statistical significance?

15. P.Mean: When should I use the Fisher's Exact Test and when should I use the Chi-Square Test (created 2012-09-19). Dear Professor Mean, I was running crosstabs in SPSS for a two-by-two table and the p-values disagree. The p-value for the Pearson Chi-Square is 0.04 and the p-value for the Fisher's Exact Test (2-sided) is 0.06. Which one should I use?

14. The Monthly Mean: First things first--tell me your research hypothesis (September-Novmeber 2011)

2010

13. The Monthly Mean: Tests of equivalence and non-inferiority (December 2010)

12. P.Mean: If you knew that failure was not an option, what would you do (created 2010-10-01). There is a question and answer forum on LinkedIn where people ask all sorts of questions. A common theme among some people there is to ask motivational questions, which I try to respond to sometimes with an off-beat answer. There was a question along these lines: "If you knew that failure was not an option, what would you do?" I started off with a rather flippant answer, but then realized that there was a more serious answer.

11. What are Type I and Type II errors? (August 2010)

2008

10. P.Mean: What is an intervening variable (created 2008-10-20). I'm familiar with dependent and independent variables but I just heard about intervening variables. Please tell me what are they, and how they deal with the other variables.

9. Normality assumptions for the paired t-test (created 2008-10-14). I am confused about which data have to be normally distributed on a paired t-test for testing that two data sets differ significantly. Everitt-Hothorn "A handbook of statistical analyses using R", page 33 says that the differences between the data should be normally distributed without implying anything about if the original data should be normally distributed, while Wiki t-test and Field "Discovering statistics using SPSS" page 287 imply that both of the original data should be normally distributed? Considering that I am a beginner in statistics, I am confused. can you give me any clues please?

8. P.Mean: Comparing two proportions out of the same multinomial population (created 2008-08-05). I am lucky enough to be researching wine. Specifically I am exploring which components in wine results in maximised preference. At the moment I am trying to compare proportions from the same population. N = 68. 8 people most preferred wine 1, 25 most preferred wine 2, 1 most preferred wine 1 and 2, for 34 of participants their most preferred wine was another wine. I want to see if the proportion of people that chose wine 1 was significantly different from the proportion that chose wine 2. I have been recommended to use McNemar's. But I just don't know how. I found your website which is as close as I have got but is slightly different. Just wondering if you had any thoughts? Cheers

7. P.Mean: How to report a one-tailed Fisher's Exact test (created 2008-07-12). Thank you for your informative page about the Fisher's Exact test. Can you please clarify how whether the test was 1 or 2-tailed affects the way that a significant result would be reported?

Outside resources:

An alternative to null-hypothesis significance tests. P. R. Killeen. Psychol Sci 2005: 16(5); 345-53. [Medline] [Abstract] [Full text] [PDF]. Description: This article describes p-rep, a statistic that measures the probability of replication. The article argues that this measure is superior to the p-value and also covers the mathematical details needed for calculation of the statistic.

EDF 5841 Methods of Educational Research. Guide 2: Variables and Hypotheses. Susan Carol Losh, Florida State University, September 3, 2001. Description: This webpage provides simple definitions of terms commonly used in educational research such as intervening variable, conceptual hypothesis, and operational variables. URL: edf5481-01.fa01.fsu.edu/Guide2.html

Webpage: U.S. Food and Drug Administration. Guidance for Industry Non-Inferiority Clinical Trials Excerpt: "This guidance provides sponsors and review staff in the Center for Drug Evaluation and Research (CDER) and Center for Biologic Evaluation and Research (CBER) at the Food and Drug Administration (FDA) with our interpretation of the underlying principles involved in the use of non-inferiority (NI) study designs to provide evidence of the effectiveness of a drug or biologic. The guidance gives advice on when NI studies can be interpretable, on how to choose the NI margin, and how to analyze the results." [Accessed on November 16, 2011]. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM202140.pdf.

Committee for Medicinal Products for Human Use. Guideline on the choice of the non-inferiority margin. Excerpt: "Many clinical trials comparing a test product with an active comparator are designed as non-inferiority trials. The term �non-inferiority� is now well established, but if taken literally could be misleading. The objective of a non-inferiority trial is sometimes stated as being to demonstrate that the test product is not inferior to the comparator. However, only a superiority trial can demonstrate this. In fact a noninferiority trial aims to demonstrate that the test product is not worse than the comparator by more than a pre-specified, small amount. This amount is known as the non-inferiority margin, or delta (Δ). [Accessed September 13, 2010]. Available at: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003636.pdf.

Wikipedia: Intervening variable. Excerpt: An intervening variable is a hypothetical internal state that is used to explain relationships between observed variables, such as independent and dependent variables, in empirical research. An intervening variable facilitates a better understanding of the relationship between the independent and dependent variables when the variables appear to not have a definite connection. They are studied by means of operational definitions and have no existence apart. URL: en.wikipedia.org/wiki/Intervening_variable

Webpage: U.S. General Accounting Office. New Drug Approval: FDA's Consideration of Evidence from Certain Clinical Trials Excerpt: "Before approving a new drug, the Food and Drug Administration (FDA)--an agency of the Department of Health and Human Services (HHS)--assesses a drug's effectiveness. To do so, it examines information contained in a new drug application (NDA), including data from clinical trials in humans. Several types of trials may be used to gather this evidence. For example, superiority trials may show that a new drug is more effective than an active control--a drug known to be effective. Non-inferiority trials aim to demonstrate that the difference between the effectiveness of a new drug and an active control is small--small enough to show that the new drug is also effective. Drugs approved on this basis may provide important benefits, such as improved safety. Because non-inferiority trials are difficult to design and interpret, they have received attention within the research community and FDA. FDA has issued guidance on these trials. GAO was asked to examine FDA's use of non-inferiority trial evidence. This report (1) identifies NDAs for new molecular entities--potentially innovative new drugs not FDA-approved in any form--that included evidence from non-inferiority trials, (2) examines the characteristics of these trials, and (3) describes FDA's guidance on these trials. GAO reviewed NDAs submitted to FDA between fiscal year 2002 (the first full year that FDA documentation was available electronically) and fiscal year 2009 (the last full year of submissions), examined FDA's guidance, and interviewed agency officials." [Accessed on November 16, 2011]. http://www.gao.gov/products/GAO-10-798.

Webpage: Randall Munroe. xkcd: Null Hypothesis Excerpt: "I can't believe schools are still teaching kids about the null hypothesis." [Accessed on May 16, 2011]. http://xkcd.com/892/.

Luis Carlos Silva-Aycaguer, Patricio Suarez-Gil, Ana Fernandez-Somoano. The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation. BMC Medical Research Methodology. 2010;10(1):44. Abstract: "BACKGROUND: The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI). Our objective was to evaluate the extent and quality in the use of NHST and CI, both in English and Spanish language biomedical publications between 1995 and 2006, taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on the accuracy of the interpretation of statistical significance and the validity of conclusions. METHODS: Original articles published in three English and three Spanish biomedical journals in three fields (General Medicine, Clinical Specialties and Epidemiology - Public Health) were considered for this study. Papers published in 1995-1996, 2000-2001, and 2005-2006 were selected through a systematic sampling method. After excluding the purely descriptive and theoretical articles, analytic studies were evaluated for their use of NHST with P-values and/or CI for interpretation of statistical "significance" and "relevance" in study conclusions. RESULTS: Among 1,043 original papers, 874 were selected for detailed review. The exclusive use of P-values was less frequent in English language publications as well as in Public Health journals; overall such use decreased from 41 % in 1995-1996 to 21% in 2005-2006. While the use of CI increased over time, the "significance fallacy" (to equate statistical and substantive significance) appeared very often, mainly in journals devoted to clinical specialties (81%). In papers originally written in English and Spanish, 15% and 10%, respectively, mentioned statistical significance in their conclusions. CONCLUSIONS: Overall, results of our review show some improvements in statistical management of statistical results, but further efforts by scholars and journal editors are clearly required to move the communication toward ICMJE advices, especially in the clinical setting, which seems to be imperative among publications in Spanish." [Accessed June 14, 2010]. Available at: http://www.biomedcentral.com/1471-2288/10/44.

Committee for Proprietary Medicinal Products. Points to consider on switching between superiority and non-inferiority. Br J Clin Pharmacol. 2001;52(3):223-228. Excerpt: "A number of recent applications have led to CPMP discussions concerning the interpretation of superiority, noninferiority and equivalence trials. These issues are covered in ICH E9 (Statistical Principles for Clinical Trials). There is further relevant material in the Step 2 draft of ICH E10 (Choice of Control Group) and in the CPMP Note for Guidance on the Investigation of Bioavailability and Bioequivalence. However, the guidelines do not address some specific difficulties that have arisen in practice. In broad terms, these difficulties relate to switching from one design objective to another at the time of analysis." Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2014556/.

Journal article: Jennifer Schumi, Janet T Wittes. Through the looking glass: understanding non-inferiority Trials. 2011;12:106. Abstract: "Non-inferiority trials test whether a new product is not unacceptably worse than a product already in use. This paper introduces concepts related to non-inferiority, and discusses the regulatory views of both the European Medicines Agency and the United States Food and Drug Administration." [Accessed on November 16, 2011].

Angele Gayet-Ageron, Thomas Agoritsas, Christophe Combescure, et al. What differences are detected by superiority trials or ruled out by noninferiority trials? A cross-sectional study on a random sample of two-hundred two-arms parallel group randomized clinical trials. BMC Medical Research Methodology. 2010;10(1):93. Abstract: "BACKGROUND: The smallest difference to be detected in superiority trials or the largest difference to be ruled out in noninferiority trials is a key determinant of sample size, but little guidance exists to help researchers in their choice. The objectives were to examine the distribution of differences that researchers aim to detect in clinical trials and to verify that those differences are smaller in noninferiority compared to superiority trials. METHODS: Cross-sectional study based on a random sample of two hundred two-arm, parallel group superiority (100) and noninferiority (100) randomized clinical trials published between 2004 and 2009 in 27 leading medical journals. The main outcome measure was the smallest difference in favor of the new treatment to be detected (superiority trials) or largest unfavorable difference to be ruled out (noninferiority trials) used for sample size computation, expressed as standardized difference in proportions, or standardized difference in means. Student t test and analysis of variance were used. RESULTS: The differences to be detected or ruled out varied considerably from one study to the next; e.g., for superiority trials, the standardized difference in means ranged from 0.007 to 0.87, and the standardized difference in proportions from 0.04 to 1.56. On average, superiority trials were designed to detect larger differences than noninferiority trials (standardized difference in proportions: mean 0.37 versus 0.27, P = 0.001; standardized difference in means: 0.56 versus 0.40, P = 0.006). Standardized differences were lower for mortality than for other outcomes, and lower in cardiovascular trials than in other research areas. CONCLUSIONS: Superiority trials are designed to detect larger differences than noninferiority trials are designed to rule out. The variability between studies is considerable and is partly explained by the type of outcome and the medical context. A more explicit and rational approach to choosing the difference to be detected or to be ruled out in clinical trials may be desirable." [Accessed December 28, 2010]. Available at: http://www.biomedcentral.com/1471-2288/10/93.

2008

6. Stats: An alternative to the p-value (April 3, 2008). A discussion on edstat-l concerned a statistic called p-rep. I had not heard of this statistic before, but at least one journal is calling for its use in all papers published by that journal.

5. Stats: What is a critical value? (February 22, 2008). Someone wrote in asking about the difference between a p-value and a critical value.

4. Stats: Type III error (January 3, 2008). Dear Professor Mean, What is the definition of a Type III error?

2007

3. Stats: Further exploration of Type I and Type II errors (April 5, 2007). I got some feedback that my definitions of Type I errors and Type II errors would be clearer if I specified what the actual hypothesis are. I wanted to avoid symbols like mu or pi, so here is what I wrote.

1999

2. Stats: Type II error (September 3, 1999). Dear Professor Mean: A journal reviewer criticized the small sample size in my research study and suggested that I mention a Type II error as a possible explanation for my results. I've never heard this term before. What is a Type II error?

1. Stats: T-test (April 18, 1999). Dear Professor Mean: How do you analyze a t-test? I have a t-test value, and I know that I have to compare it to a t-distribution. I'm not sure how to do that.

Definitions:

Stats: What is an alpha level?

Stats: What is a beta level?

Stats: What is a decision rule?

Stats: What is a dependent variable?

Stats: What is an inferential statistic?

Stats: What is an independent variable?

Stats: What is a parameter?

Stats: What is a population?

Stats: What is a P-value?

Stats: What is a sample?

Stats: What is a statistic?

Stats: What is a T statistic?

Stats: What is a Type I error?

Stats: What is a Type II error?

What now?

Browse other categories at this site