The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H_{0}) when a specific alternative hypothesis (H_{1}) is true. The statistical power ranges from 0 to 1, and as statistical power increases, the probability of making a type II error (wrongly failing to reject the null hypothesis) decreases. For a type II error probability of β, the corresponding statistical power is 1 − β. For example, if experiment 1 has a statistical power of 0.7, and experiment 2 has a statistical power of 0.95, then there is a stronger probability that experiment 1 had a type II error than experiment 2, and experiment 2 is more reliable than experiment 1 due to the reduction in probability of a type II error. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H_{1}) when it is true—that is, the ability of a test to detect a specific effect, if that specific effect actually exists. That is,
If is not an equality but rather simply the negation of (so for example with for some unobserved population parameter we have simply ) then power cannot be calculated unless probabilities are known for all possible values of the parameter that violate the null hypothesis. Thus one generally refers to a test's power against a specific alternative hypothesis.
As the power increases, there is a decreasing probability of a type II error, also referred to as the false negative rate (β) since the power is equal to 1 − β. A similar concept is the type I error probability, also referred to as the “false positive rate” or the level of a test under the null hypothesis.
Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. For example: “how many times do I need to toss a coin to conclude it is rigged by a certain amount?”^{[1]} Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric test and a nonparametric test of the same hypothesis.
In the context of binary classification, the power of a test is called its statistical sensitivity, its true positive rate, or its probability of detection.
YouTube Encyclopedic

1/5Views:248 77667 02882 57938 579408

✪ Type I Errors, Type II Errors, and the Power of the Test

✪ Power of a Test

✪ Calculating Power

✪ Calculating Statistical Power Tutorial

✪ Power  Intro to Inferential Statistics
Transcription
Let's look at type I errors, type II errors, and the power of the test in hypothesis testing. A type I error is rejecting the null hypothesis when in reality it is true. A type II error is failing to reject the null hypothesis when in reality it is false. or if we made one of these two errors. Suppose we test the null hypothesis that mu is equal to 10, against the alternative that it's greater than 10. We carry out the test in the usual ways and we end up rejecting the null hypothesis at an alpha level of 0.05. One of two things occurred: the null hypothesis is false and we rejected it, so we made the correct decision, or the null hypothesis is true and we rejected it, so we made a type I error. In practice if we reject the null hypothesis we are simply not going to know which one of these two things occurred. But suppose instead that we carry out the same test and we do not reject the null hypothesis at an alpha level of 0.05. Well here again one of two things occurred: the null hypothesis is true and we did not reject it so we made the correct decision, or the null hypothesis is false and we did not reject it, so we made a type II error. Here in table form are the possible outcomes of a hypothesis test. In the columns is the underlying reality and that's going to be unknown to us. in the rows are the conclusion from the test, which is going to be known once we carry out our test. If we end up rejecting the null the null hypothesis and the null hypothesis is false, we made the correct decision. But if we rejected the null hypothesis and the null hypothesis is true we made a type I error. If we do not reject the null hypothesis and in reality the null hypothesis is false, we made a type II error. But if we don't reject the null hypothesis, and in reality it's true, we made the correct decision. Some people find it helps to compare the conclusions in a hypothesis test to the results of the criminal trial. In a criminal trial we test the null hypothesis that the defendant did not commit the crime against the alternative hypothesis that the defendant did commit the crime. In a criminal trial we give the defendant the benefit of the doubt and use terms like innocent until proven guilty. Well it's similar in a hypothesis test we will only reject the null hypothesis if we have very strong evidence against it. In a criminal trial setting, a type I error would be convicting a person who, in reality, did not commit the crime. In other words, rejecting the null hypothesis when it is in fact true. A type II error is acquitting a person who in reality committed the crime. In other words, not rejecting the null hypothesis when it is in fact false. Nobody likes the idea of spending the rest of their life in jail for a crime they did not commit, and so as a society we've decided to make the probability of a tight I error small, by using language like beyond a reasonable doubt. The probability of a type I error, given the null hypothesis is true, is called the significance level of the test, and it's typically represented by alpha. We get to pick the value of alpha that we feel is appropriate for any given problem. The probability of a type II error is represented by beta. The value of beta depends on a number of factors, including the choice of alpha, the sample size, and the true value of the parameter. It depends on other factors as well, such as the alternative hypothesis and the variance. The power of the test is the probability of rejecting the null hypothesis given it is false. Power is 1 minus the probability of a type II error, or 1beta. And of course the power depends on the same factors as beta does. Alpha is the probability of a type I error, given the null hypothesis is true, and we choose the value of alpha, so why not choose alpha to be some tiny value so we're not making a lot of type I errors? It's because of the relationship between alpha and beta. If we decrease alpha, then beta will increase. If we choose a very small value of alpha we will be making it very difficult to reject the null hypothesis and so type II errors will be very common. If we choose a larger value of alpha, it will become easier to reject the null hypothesis and so type II errors will be less common. Let's take a look at the relationship between alpha and beta for a test of the null hypothesis that mu is equal to 0. to calculate beta, I had to make a decision on a few of these quantities down here, and I show how to actually calculate beta in another video. For now let's not worry too much about my choices there or how to calculate beta and let's focus on the relationship between alpha and beta. over here I put in the power which is simply one  beta. If we chose the common alpha value of 0.05, and we went up here, we'd see that the corresponding beta value is 0.77, that's the actual calculated value. So the probability of a type II error in this scenario is 0.77. And the corresponding power of the test is 1 minus that, 0.23. If we let alpha increase from 0.05 to 0.10, then we're going to be decreasing the probability of a type II error, and increasing the power of the test. If however we chose an alpha value that was very near zero, beta would creep up very close to one and our test would have almost no power. so there is a balancing act between alpha and beta. But in many practical situations people simply pick an alpha level they feel is appropriate and let beta fall where it may. Alpha is usually chosen to be a small value like 0.01 or 0.05, but for completeness let's look at the relationship between alpha, beta and power over all possible values of alpha. iI we choose a value of alpha very near zero, then depending on the other factors, beta will typically be very close to one, and the test will have very low power. But if we were to choose a value of alpha over here near one beta would be very close to 0 and the test would have very high power. But in statistics we do not like making a lot of type I errors, so alpha is typically chosen to be a small value like 0.01 or 0.05.
Contents
Background
Statistical tests use data from samples to assess, or make inferences about, a statistical population. In the concrete setting of a twosample comparison, the goal is to assess whether the mean values of some attribute obtained for individuals in two subpopulations differ. For example, to test the null hypothesis that the mean scores of men and women on a test do not differ, samples of men and women are drawn, the test is administered to them, and the mean score of one group is compared to that of the other group using a statistical test such as the twosample ztest. The power of the test is the probability that the test will find a statistically significant difference between men and women, as a function of the size of the true difference between those two populations.
Factors influencing power
Statistical power may depend on a number of factors. Some factors may be particular to a specific testing situation, but at a minimum, power nearly always depends on the following three factors:
 the statistical significance criterion used in the test
 the magnitude of the effect of interest in the population
 the sample size used to detect the effect
A significance criterion is a statement of how unlikely a positive result must be, if the null hypothesis of no effect is true, for the null hypothesis to be rejected. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the data implying an effect at least as large as the observed effect when the null hypothesis is true must be less than 0.05, for the null hypothesis of no effect to be rejected. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion, for example 0.10 instead of 0.05. This increases the chance of rejecting the null hypothesis (i.e. obtaining a statistically significant result) when the null hypothesis is false; that is, it reduces the risk of a type II error (false negative regarding whether an effect exists). But it also increases the risk of obtaining a statistically significant result (i.e. rejecting the null hypothesis) when the null hypothesis is not false; that is, it increases the risk of a type I error (false positive).
The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be a direct value of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means Y − X would be a direct estimate of the effect size, whereas (Y − X)/σ where σ is the common standard deviation of the outcomes in the treated and control groups, would be an estimated standardized effect size. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size will rarely be sufficient to determine the power, as it does not contain information about the variability in the measurements.
The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test. How increased sample size translates to higher power is a measure of the efficiency of the test—for example, the sample size required for a given power.^{[2]}
The precision with which the data are measured also influences statistical power. Consequently, power can often be improved by reducing the measurement error in the data. A related concept is to improve the “reliability” of the measure being assessed (as in psychometric reliability).
The design of an experiment or observational study often influences the power. For example, in a twosample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and analysis of variance, there are extensive theories and practical strategies for improving the power based on optimally setting the values of the independent variables in the model.
Interpretation
Although there are no formal standards for power (sometimes referred to as π), most researchers assess the power of their tests using π = 0.80 as a standard for adequacy. This convention implies a fourtoone trade off between βrisk and αrisk. (β is the probability of a Type II error, and α is the probability of a Type I error; 0.2 and 0.05 are conventional values for β and α). However, there will be times when this 4to1 weighting is inappropriate. In medicine, for example, tests are often designed in such a way that no false negatives (Type II errors) will be produced. But this inevitably raises the risk of obtaining a false positive (a Type I error). The rationale is that it is better to tell a healthy patient “we may have found something—let's test further,” than to tell a diseased patient “all is well.”^{[3]}
Power analysis is appropriate when the concern is with the correct rejection of a false null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around 0.50, a sample size of 20 will give us approximately 80% power (alpha = 0.05, twotail) to reject the null hypothesis of zero correlation. However, in doing this study we are probably more interested in knowing whether the correlation is 0.30 or 0.60 or 0.50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. Techniques similar to those employed in a traditional power analysis can be used to determine the sample size required for the width of a confidence interval to be less than a given value.
Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities are nuisance parameters. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more "exploratory", there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.
Any statistical analysis involving multiple hypotheses is subject to inflation of the type I error rate if appropriate measures are not taken. Such measures typically involve applying a higher threshold of stringency to reject a hypothesis in order to compensate for the multiple comparisons being made (e.g. as in the Bonferroni method). In this situation, the power analysis should reflect the multiple testing approach to be used. Thus, for example, a given study may be well powered to detect a certain effect size when only one test is to be made, but the same effect size may have much lower power if several tests are to be performed.
It is also important to consider the statistical power of a hypothesis test when interpreting its results. A test's power is the probability of correctly rejecting the null hypothesis when it is false; a test's power is influenced by the choice of significance level for the test, the size of the effect being measured, and the amount of data available. A hypothesis test may fail to reject the null, for example, if a true difference exists between two populations being compared by a ttest but the effect is small and the sample size is too small to distinguish the effect from random chance.^{[4]} Many clinical trials, for instance, have low statistical power to detect differences in adverse effects of treatments, since such effects may be rare and the number of affected patients small.^{[5]}
A priori vs. post hoc analysis
Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data are collected. A priori power analysis is conducted prior to the research study, and is typically used in estimating sufficient sample sizes to achieve adequate power. Posthoc analysis of "observed power" is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, post hoc power analysis is fundamentally flawed.^{[6]}^{[7]} Falling for the temptation to use the statistical analysis of the collected data to estimate the power will result in uninformative and misleading values. In particular, it has been shown that posthoc "observed power" is a onetoone function of the pvalue attained.^{[6]} This has been extended to show that all posthoc power analyses suffer from what is called the "power approach paradox" (PAP), in which a study with a null result is thought to show more evidence that the null hypothesis is actually true when the pvalue is smaller, since the apparent power to detect an actual effect would be higher.^{[6]} In fact, a smaller pvalue is properly understood to make the null hypothesis relatively less likely to be true.^{[citation needed]}
Application
Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis, for example to determine the minimum number of animal test subjects needed for an experiment to be informative. In frequentist statistics, an underpowered study is unlikely to allow one to choose between hypotheses at the desired significance level. In Bayesian statistics, hypothesis testing of the type used in classical power analysis is not done. In the Bayesian framework, one updates his or her prior beliefs using the data obtained in a given study. In principle, a study that would be deemed underpowered from the perspective of hypothesis testing could still be used in such an updating process. However, power remains a useful measure of how much a given experiment size can be expected to refine one's beliefs. A study with low power is unlikely to lead to a large change in beliefs.
Example
The following is an example that shows how to compute power for a randomized experiment: Suppose the goal of an experiment is to study the effect of a treatment on some quantity, and compare research subjects by measuring the quantity before and after the treatment, analyzing the data using a paired ttest. Let and denote the pretreatment and posttreatment measures on subject i respectively. The possible effect of the treatment should be visible in the differences which are assumed to be independently distributed, all with the same expected value and variance.
The effect of the treatment can be analyzed using a onesided ttest. The null hypothesis of no effect will be that the mean difference will be zero, i.e. In this case, the alternative hypothesis states a positive effect, corresponding to The test statistic is:
where
n is the sample size and is the standard error. The test statistic under the null hypothesis follows a Student tdistribution. Furthermore, assume that the null hypothesis will be rejected at the significance level of Since n is large, one can approximate the tdistribution by a normal distribution and calculate the critical value using the quantile function , the inverse of the cumulative distribution function of the normal distribution. It turns out that the null hypothesis will be rejected if
Now suppose that the alternative hypothesis is true and . Then, the power is
For large n, approximately follows a standard normal distribution when the alternative hypothesis is true, the approximate power can be calculated as
According to this formula, the power increases with the values of the parameter For a specific value of a higher power may be obtained by increasing the sample size n.
It is not possible to guarantee a sufficient large power for all values of as may be very close to 0. The minimum (infimum) value of the power is equal to the size of the test, in this example 0.05. However, it is of no importance to distinguish between and small positive values. If it is desirable to have enough power, say at least 0.90, to detect values of the required sample size can be calculated approximately:
from which it follows that
Hence, using the quantile function,
where is a standard normal quantile; see Probit for an explanation of the relationship between and zvalues.
Extension
Bayesian power
In the frequentist setting, parameters are assumed to have a specific value which is unlikely to be true. This issue can be addressed by assuming the parameter has a distribution. The resulting power is sometimes referred to as Bayesian power which is commonly used in clinical trial design.
Predictive probability of success
Both frequentist power and Bayesian power use statistical significance as the success criterion. However, statistical significance is often not enough to define success. To address this issue, the power concept can be extended to the concept of predictive probability of success (PPOS). The success criterion for PPOS is not restricted to statistical significance and is commonly used in clinical trial designs.
Software for power and sample size calculations
Numerous free and/or open source programs are available for performing power and sample size calculations. These include
 G*Power (http://www.gpower.hhu.de/)
 WebPower Free online statistical power analysis (http://webpower.psychstat.org)
 powerandsamplesize.com Free and open source online calculators
 PowerUp! provides convenient excelbased functions to determine minimum detectable effect size and minimum required sample size for various experimental and quasiexperimental designs.
 PowerUpR is R package version of PowerUp! and additionally includes functions to determine sample size for various multilevel randomized experiments with or without budgetary constraints.
 R package pwr
 R package WebPower
 Python package statsmodels (http://www.statsmodels.org/)
See also
Notes
 ^ http://www.statisticsdonewrong.com/power.html
 ^ Everitt 2002, p. 321.
 ^ Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, MetaAnalysis and the Interpretation of Research Results. United Kingdom: Cambridge University Press.
 ^ Ellis, Paul (2010). The Essential Guide to Effect Sizes: Statistical Power, MetaAnalysis, and the Interpretation of Research Results. Cambridge University Press. p. 52. ISBN 9780521142465.
 ^ Tsang, R.; Colley, L.; Lynd, L. D. (2009). "Inadequate statistical power to detect clinically significant differences in adverse event rates in randomized controlled trials". Journal of Clinical Epidemiology. 62 (6): 609–616. doi:10.1016/j.jclinepi.2008.08.005. PMID 19013761.
 ^ ^{a} ^{b} ^{c} Hoenig; Heisey (2001). "The Abuse of Power". The American Statistician. 55 (1): 19–24. doi:10.1198/000313001300339897.
 ^ Thomas, L. (1997) Retrospective power analysis. Conservation Biology 11(1):276–280
References
 Everitt, Brian S. (2002). The Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 052181099X.
 Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). ISBN 0805802835.
 Aberson, C. L. (2010). Applied Power Analysis for the Behavioral Science. ISBN 1848728352.