5
$\begingroup$

Let's say I'm analyzing the mean number of students per class for a school district. The district has imposed a hard limit on the maximum ratio: there can never be more than 30 students in a class. The district strictly enforces this rule and it is known to hold true.

I'd like to construct a confidence interval using a Student's t-distribution. For the sake of argument, we'll assume that all assumptions and conditions for this are met. The data are independent, randomly-sampled and appear approximately normally-distributed on a histogram.

What happens if the sample yields a large-ish standard error and has a mean close to 30, resulting in a confidence interval like this?

We are 95% confident that the true mean students per class lies between 28 students and 31 students.

Obviously, a mean of 31 students per class is impossible, because none of the district's classes have more than 30 students. What should be done in cases like this? Should the confidence interval be left as-is, or should it be "capped" to the values that are actually possible (i.e. "between 28 students and 30 students")?

$\endgroup$
12
  • $\begingroup$ (1) Likert scale-items are ordinal not interval. They don't have a mean. (2) What are the proportions in each category, and what sample size is this for? (3) What assumptions were the CI calculation done under? (4) Are those assumptions appropriate? $\endgroup$
    – Glen_b
    Commented Apr 10, 2013 at 0:12
  • $\begingroup$ @Glen_b (1) I think a mean could still be useful in analyzing these data. In any case, the Likert scale is just an example. It could be something else that has a well-defined possible range of possible responses/values. (2) I'm not sure what you mean by "the proportions in each category." The sample size is small: around 100. (3) I'm assuming that the survey itself is unbiased and that responses are independent, and that it was distributed to random people within the population. We'll also say that the sampling distribution is nearly normal, though this is admittedly unlikely for Likert. $\endgroup$
    – Maxpm
    Commented Apr 10, 2013 at 16:06
  • $\begingroup$ A likert scale item has ordered categories like "Strongly agree" and "Agree" etc. I was asking for the proportions of answers in each of those. [It's possible you actually meant Likert scale in its proper sense (a thing is truly only a Likert scale after you add the items up, either in a direct or weighted sum), in which case my question would not make sense.] $\endgroup$
    – Glen_b
    Commented Apr 10, 2013 at 22:46
  • $\begingroup$ (ctd) It looks like you missed the point of the last question. What do the data look like? How does the actual appearance of the data compare with the assumptions involved in the CI calculation you did? Why, if they're very different, would the CI do sensible things? $\endgroup$
    – Glen_b
    Commented Apr 10, 2013 at 22:47
  • 1
    $\begingroup$ Not exactly "impossible" since among other things, it depends on a specific definition of 'approximately normal' and on the sample size. But if it happens, it's a pretty clear suggestion that the accuracy of the normal approximation on which the CI was based isn't sufficient for your purposes. In some cases truncating the interval is reasonable, but my instinct is usually to abandon it and do something better suited to the situation. It's akin to CI's for the binomial $p$ when $p$ is small - normality doesn't really apply. $\endgroup$
    – Glen_b
    Commented Apr 11, 2013 at 12:47

4 Answers 4

3
$\begingroup$

You could use resampling and order statistics. From your dataset of 100 points, select say, 100000 samples of size 100 with replacement. Find the average class size in each case. Order the averages from smallest to largest. Remove the bottom 2500 (2.5%) and top 2500 (2.5%). Now your 95% confidence interval is the range from the lowest remaining average to the highest remaining average.

If you need a quick back-of-the-envelope answer just order your 100 data points and remove the bottom 2 points and top 2. That should give you a rough 96% confidence interval.

You should also consider using a one-sided interval since you have a hard constraint on the upper end.

$\endgroup$
1
$\begingroup$

If the confidence interval for the mean is running into a physical boundary you need to calculate the Bayesian credible interval for the mean where the prior assigns zero probability to the impossible values.

$\endgroup$
0
$\begingroup$

You could predict a variable that is a transformation of the class size instead of the class sizes directly.

In the case of students per class a variable that is restricted to values between 0 and 30, you can consider ordinal regression which models a latent variable for the probabilities of class sizes rather than the class sizes themselves.


Other common transformations are the use of log transformation, which transforms a positive variable to the entire real line, or a logit transformation which transforms a number between 0 and 1 to the entire real line.

$\endgroup$
0
$\begingroup$

I'd like to construct a confidence interval using a Student's t-distribution. For the sake of argument, we'll assume that all assumptions and conditions for this are met.

In my opinion, this is where the problem is. IF there is a hard boundary, a defining assumption in the t-distribution is violated.

It is very common to ignore this: when the mean is “far enough” from the boundaries, the inaccuracy is meaningless. In your case, it is clearly not meaningless.

In addition, you don’t have a continuous variable; there are no “fractional students” in a classroom. Again, ignoring this is something that we can often get away with.

I have little experience with this type of data. When I ran into it, I got away with converting to proportions and using inference methods for that (in this case that would be [number in class]/30). You may try this, but I suspect that the step size of 1/30 may be too rough for this to work.

It is therefore probably best to read up on inference methods for discrete distributions. I personally never got around to that, and never had an application for it in my daily work. So I can’t do more than point you to Ecosia. (Google will probably also work)

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.