Statistical power

A set of results that will be measured by comparison to another set of results needs sufficient statistical power (the ability to find a significant difference between groups when one exists). This is particularly important for a SIB where the results are independently evaluated (probably publicly announced) and trigger payments. Statistical power is a function of:

To increase statistical power:

  • Increase the effect size i.e. the difference in result between the two groups (this difference can only be estimated and remains a variable throughout the contract). Larger effects may allow earlier results that can bring payments forward, but changing payment dates might increase complexity and cost of measurement.
  • Reduce the variance for each group (this can occur by refining referral criteria to produce a more homogenous cohort that will have a greater response to the services provided).
  • Decrease the level of confidence required (usually set at 95 per cent, this means that 95 times out of 100, this result would be caused by the intervention. Five times out of 100, this result would happen by chance, even if neither group received the intervention services).
  • Increase the sample size and therefore the degrees of freedom: (this could be achieved by delivering the service to a larger area, over a longer timeframe).

Mulgan et.al. (2011) demonstrate that “statistical methods allow us to model the interplay between the following factors:

  • Government’s confidence level. The government’s confidence that it is not paying out for a scheme that has succeeded only by chance (that is, the probability of a ‘Type I error’);
  • Funder’s confidence levels. The funder’s confidence that a good scheme will not fail to show a result through bad luck (a ‘Type II error’ in statistical terms);
  • Sensitivity to effect size. The size that an effect of an intervention has to be to reliably enable detection of change;
  • Sample size. The number of participants in the control groups.”

The table below shows how this trade-off works. By setting the Government’s confidence level to 70% (equivalent to a one-tailed p-value of 30%, and lower than the 95% that would usually be associated with this type of test) and assuming that the baseline probability is 0.5 and that the control group is twice the size of the treatment group, it shows how the sample size changes with different levels of funder’s confidence and predicted effect size. The effect size is the percentage change of the baseline 50%, so a 10% reduction would see an intervention result of 45%. It thereby demonstrates that it is possible to have a reasonable degree of confidence in a scheme, for a manageable sample size, and a high degree of sensitivity. Measurement can be both fair and accurate.

 

 

Funder’s Confidence

Predicted Effect

(% of baseline)

 

90%

85%

80%

75%

70%

65%

5.0%

1980

1500

1150

900

690

520

10.0%

490

370

290

220

170

130

15.0%

220

170

125

100

75

60

20.0%

125

95

70

55

45

35

Notes: Using the methodology laid out in Cohen, J. (1988) 'Statistical power analysis for the behavioral sciences (2nd ed.).

If the baseline probability is lower than 0.5 the sample sizes would need to be larger than those shown in the table. If it is higher than 0.5, the required sample sizes would be smaller.

If a sufficiently sized sample is not able to be measured, the effect cannot be attributed to the intervention with a strong degree of confidence. Note the measurement mandates in the open tenders by the State of New York and the Ministry of Justice (UK) in relation to their statistical power.

State of New York:

  • The outcomes are measured rigorously by an independent validator, and impact is assessed based on comparison to an estimate of the “counterfactual” outcomes that would have occurred in the absence of the services being delivered.
  • At least several hundred people are served each year and there is an identifiable path to expanding a successful initiative to a larger population.

Ministry of Justice (UK):
Proposals that are:

  • of sufficient size for payment and evaluation, and
  • create incentives to intervene with the entire cohort.

More information on statistical power can be found in HM Treasury’s Magenta Book (2011), in the section ‘The Power of Design’ p.109.