Your last statement talks about the probability of a hypothesis being true. That can be approached by Bayesian statistics. But that is a whole topic that I will not get into here (but is a good topic to learn).
I will address other ideas of how to think about NHST other than p-values.
Back when I took my first statistics class (after the dinosaurs died, but before everyone carried powerful computers in their pockets (OK, cell phones did exist then, but they were bigger than bricks)) we learned about critical values and rejection regions before p-values, and I sometimes lament that peoples dependence on computers has promoted the p-value to prime status and we don't really think in terms of rejection regions any more. We have lost some of the basic intuitive understanding in exchange for the simplicity of looking at the p-value spit out by the computer.
An example of creating a rejection region:
I sometimes play backgammon on my phone or tablet against a computer player. However it seems like the computer player rolls doubles more often then it should (especially in critical situations). Now this could be just my memory tricking me by remembering situations where the computer came from behind by rolling doubles better than the situations where it did not (psychologists have a name for this difference in memory), so I want to do a formal test.
I will play a game and count the number of times that the computer rolls doubles on its next 10 turns. If there are a high number of doubles, then this indicates that something more than random chance is happening, but if the number of doubles is low, then it is probably me just remembering differently.
But how "High" is high enough to be convincing that the computer is cheating? Let's look at what the probability of getting different numbers of doubles would be if the computer is playing fair (rolls independent with constant probability of $\frac16$, computed by R):
> dbinom(0:10, 10, 1/6) |> zapsmall() |> setNames(0:10)
0 1 2 3 4 5 6 7
0.1615056 0.3230112 0.2907100 0.1550454 0.0542659 0.0130238 0.0021706 0.0002481
8 9 10
0.0000186 0.0000008 0.0000000
So the chance of 0 doubles is about 16%, 1 double is 32% and 10 doubles rounds to 0%. So I can decide what values I will use to reject the idea that the computer is playing fair. I will only use large numbers, because if the computer is getting fewer doubles than it should, that is in my favor and I should not complain. The common strategy is to start with the largest number of doubles (10) and count back accumulating probability as long as I think the probability is small enough. If I choose a probability as a limit (say the traditional $\alpha=0.05$) then this means that I will reject the idea that the computer plays fair in favor of the computer cheating if there are more than 4 doubles (5 or more) in the next 10 rolls by the computer ($P(X>4)=0.015$, $P(X>3)=0.070$). So this is my rejection region, or I could call $4.5$ my critical value and reject if the number of doubles is greater that $4.5$. This limits my chance of falsely accusing the computer of cheating to be less that $\alpha$, though in this case the probability of a Type I error is $0.015$.
Now with a critical value or rejection region I can run my test and make a decision without ever computing a P-value. And I think this is a lot more intuitive than worrying about the exact definition of a P-value.
Now, I could have also done the above by computing a P-value based on the binomial distribution. For 4 doubles it would give $p=0.070 > \alpha = 0.05$ and for 5 doubles it would give $p=0.015 < \alpha = 0.05$, for numbers less than 4 the P-value would be higher and values greater than 5 the P-value would be even lower. The 2 strategies are exactly equivalent. I can find a critical value based on $\alpha$ and compare the observed number of doubles to the critical value. Or I can convert the observed number (test statistic) into a P-value to compare to $\alpha$. Either approach will give the same decision (about rejecting the null that the computer is playing fair), just like if I want to compare 2 distances to see which is longer, but one is given in miles and the other in kilometers, it does not matter whether I convert the miles to kilometers or convert the kilometers to miles, the decision of which is longer will be the same.
For this example I chose the rejection region to be 5-10 doubles, I could have also chosen the rejection region to be just 5-7 doubles, but would it make much sense to accuse the computer of cheating if I see 6 doubles, but accept that it is fair if I see 9 doubles? It makes sense to define the rejection region to include everything "more extreme" than the critical value, this is why when computing the P-value we include "or more extreme" rather than just computing the probability of the one observed value.
The advantage of computing P-values instead of critical values and rejection regions is the common scale. All P-values are between 0 and 1 and can be compared to a pre-specified $\alpha$, which is simple. The critical value for my above example would be different if I based my test on the next 20 rolls, or 100 rolls, etc. For a t-test the critical value and rejection region depend on the hypothesized mean and the observed standard deviation (and sample size), so it is just simpler to look at the P-value.
I hope that thinking in terms of rejection regions helps your understanding.