4
$\begingroup$

Sometimes, when the logic of p-value is explained and motivated, it happens in the following way:

  • The concept of a random sample is introduced, together with the need to consider the possibility that we got the results we did - by chance alone.
  • One then reflects on the probability of the experiment's results - given the null hypothesis.
  • Since the probability of any specific experimental result (when there are many possibilities) is very small, it is suggested to use the probability of getting the results we did or "more extreme ones".

I have some doubts about the logic of the last stage. Even though calculating the probability of our results or more extreme observations is one solution (for the problem with using only the probability of the specific result given the null hypothesis), this does not necessitate its use.

Have alternative statistics been suggested, the probability of which given the null hypothesis is the value to be considered for rejecting (or not) the null hypothesis?

$\endgroup$
10
  • 1
    $\begingroup$ Given the null hypothesis, you want to know the probability of the null hypothesis? $\endgroup$
    – Dave
    Commented Jul 5 at 16:55
  • $\begingroup$ "Have alternative statistics been suggested, the probability of which given the null hypothesis is the value to be considered for rejecting (or not) the null hypothesis?" Given what you acknowledge in your third dotpoint, the meaning of your question is unclear. Please clarify what you are requesting. $\endgroup$ Commented Jul 5 at 20:53
  • $\begingroup$ I think that all of your bulleted items are false. It is a statistical model that needs to be introduced for the first, not 'a random sample'. Think of resampling tests that yield p-values from an assumption of random treatment allocation instead of sampling from a population. $\endgroup$ Commented Jul 5 at 21:16
  • $\begingroup$ ...cont. For the second bullet, the probability of exactly the experimenter's results is often (in theory) exactly zero. If you are interested in the probability of the results then you should be interested in the likelihood function, not a p-value. $\endgroup$ Commented Jul 5 at 21:17
  • 1
    $\begingroup$ P-values are not difficult to understand, but most attempts to introduce them to students are almost unbelievably bad, and they are usually contaminated with non-p-value stuff about error rate accounting of the all-or-none significant/not significant error rate stuff. $\endgroup$ Commented Jul 5 at 21:22

2 Answers 2

3
$\begingroup$

Your last statement talks about the probability of a hypothesis being true. That can be approached by Bayesian statistics. But that is a whole topic that I will not get into here (but is a good topic to learn).

I will address other ideas of how to think about NHST other than p-values.

Back when I took my first statistics class (after the dinosaurs died, but before everyone carried powerful computers in their pockets (OK, cell phones did exist then, but they were bigger than bricks)) we learned about critical values and rejection regions before p-values, and I sometimes lament that peoples dependence on computers has promoted the p-value to prime status and we don't really think in terms of rejection regions any more. We have lost some of the basic intuitive understanding in exchange for the simplicity of looking at the p-value spit out by the computer.

An example of creating a rejection region:

I sometimes play backgammon on my phone or tablet against a computer player. However it seems like the computer player rolls doubles more often then it should (especially in critical situations). Now this could be just my memory tricking me by remembering situations where the computer came from behind by rolling doubles better than the situations where it did not (psychologists have a name for this difference in memory), so I want to do a formal test.

I will play a game and count the number of times that the computer rolls doubles on its next 10 turns. If there are a high number of doubles, then this indicates that something more than random chance is happening, but if the number of doubles is low, then it is probably me just remembering differently.

But how "High" is high enough to be convincing that the computer is cheating? Let's look at what the probability of getting different numbers of doubles would be if the computer is playing fair (rolls independent with constant probability of $\frac16$, computed by R):

> dbinom(0:10, 10, 1/6) |> zapsmall() |> setNames(0:10)
        0         1         2         3         4         5         6         7 
0.1615056 0.3230112 0.2907100 0.1550454 0.0542659 0.0130238 0.0021706 0.0002481 
        8         9        10 
0.0000186 0.0000008 0.0000000 

So the chance of 0 doubles is about 16%, 1 double is 32% and 10 doubles rounds to 0%. So I can decide what values I will use to reject the idea that the computer is playing fair. I will only use large numbers, because if the computer is getting fewer doubles than it should, that is in my favor and I should not complain. The common strategy is to start with the largest number of doubles (10) and count back accumulating probability as long as I think the probability is small enough. If I choose a probability as a limit (say the traditional $\alpha=0.05$) then this means that I will reject the idea that the computer plays fair in favor of the computer cheating if there are more than 4 doubles (5 or more) in the next 10 rolls by the computer ($P(X>4)=0.015$, $P(X>3)=0.070$). So this is my rejection region, or I could call $4.5$ my critical value and reject if the number of doubles is greater that $4.5$. This limits my chance of falsely accusing the computer of cheating to be less that $\alpha$, though in this case the probability of a Type I error is $0.015$.

Now with a critical value or rejection region I can run my test and make a decision without ever computing a P-value. And I think this is a lot more intuitive than worrying about the exact definition of a P-value.

Now, I could have also done the above by computing a P-value based on the binomial distribution. For 4 doubles it would give $p=0.070 > \alpha = 0.05$ and for 5 doubles it would give $p=0.015 < \alpha = 0.05$, for numbers less than 4 the P-value would be higher and values greater than 5 the P-value would be even lower. The 2 strategies are exactly equivalent. I can find a critical value based on $\alpha$ and compare the observed number of doubles to the critical value. Or I can convert the observed number (test statistic) into a P-value to compare to $\alpha$. Either approach will give the same decision (about rejecting the null that the computer is playing fair), just like if I want to compare 2 distances to see which is longer, but one is given in miles and the other in kilometers, it does not matter whether I convert the miles to kilometers or convert the kilometers to miles, the decision of which is longer will be the same.

For this example I chose the rejection region to be 5-10 doubles, I could have also chosen the rejection region to be just 5-7 doubles, but would it make much sense to accuse the computer of cheating if I see 6 doubles, but accept that it is fair if I see 9 doubles? It makes sense to define the rejection region to include everything "more extreme" than the critical value, this is why when computing the P-value we include "or more extreme" rather than just computing the probability of the one observed value.

The advantage of computing P-values instead of critical values and rejection regions is the common scale. All P-values are between 0 and 1 and can be compared to a pre-specified $\alpha$, which is simple. The critical value for my above example would be different if I based my test on the next 20 rolls, or 100 rolls, etc. For a t-test the critical value and rejection region depend on the hypothesized mean and the observed standard deviation (and sample size), so it is just simpler to look at the P-value.

I hope that thinking in terms of rejection regions helps your understanding.

$\endgroup$
1
  • 1
    $\begingroup$ Rejection regions can only lead to all-or-none outcomes and are not relevant to p-values. It is hard to see how thinking on those terms can help. $\endgroup$ Commented Jul 5 at 21:27
3
$\begingroup$

To put it briefly, this is now an operational question. Your observations are sound. In fact, nearly every statistician has (or should) consider the exact problem as you've described it. Closely tied to your question is doubt against the magic 0.05 threshold for significance.

The solution of a "p"-value was never posited as an omnibus for statistical reasoning. Rather, due to several contemporaneous factors, it was widely popular from its outset, and this popularity so heavily bent thinking toward its approach that now it is very hard to introduce any new "standard" inferential approach. Any attempt to do otherwise very quickly devolves to "back of the envelope" significance testing.

For instance, when I have submitted manuscripts that report a frequentist confidence interval in lieu of a p-value, the reviewers are very quick to inspect and comment whether the values do (or do not) intersect 0 or 1 for differences or ratios respectively even when a hypothesis is not called into question. Even for credible intervals, the reviewers are concerned with using a non-informative prior, in which case for them "non-informative" means the operating characteristics of the credible interval very closely resemble those of the frequentist confidence interval, and thus we come back to square one.

Indeed, several SEVERAL alternative procedures have been put forward. Bayes factors are probably the most noteworthy. But once one starts inspecting the frequentist operating characteristics of Bayes factor testing, it doesn't stand against a p-value. We have to begin by the realization that a "null" hypothesis is usually a ridiculous presumption. But scientists aren't ripe enough for more nuanced thinking on this yet.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.