1

I am attempting to write probability density function for 2x2 contingency table with given margins, using chi-squared statistics. I am aware there are easy ways to do this with native R functions, but I am interested in trying to get a better understanding of R, thus I want to try this the "hard way".

a   b     ab
c   d     cd
ac  bd   abcd

a, b, c, and d are the contingency table elements. In the parlance of statistical gene overrepresentation, ab (a + b) is the total number of genes in gene set, ac (a + c) is the total number of genes in the group of interest (e.g., all overexpressed genes), a is the overlap of the two sets, and abcd is the total number of all detected genes.

I want to create probability density function integrand() that, from arguments a, ab, ac, and abcd, yields value for the given a. Importantly, integrand() must be vectorized with respect to argument a as integrate() (see below) requires this for calculating the integral. Other arguments (ab, ac, and abcd are, and should be, vectors of length 1 as they are fixed margins). Typically, user has the values for a, ab, ac, and abcd easily available, hence this particular parameterization.

integrand <- function(a, ...)
{
  args <- list(...)
  sapply(a, \(x, ...)
  {
    args <- list(...)
    observed <- c(x, args[[1]] - x, args[[2]] - x, args[[3]] - args[[2]] - args[[1]] + x)
    expected <- c(args[[1]] * args[[2]], args[[1]] * (args[[3]] - args[[2]]), (args[[3]] - args[[1]]) * args[[2]], (args[[3]] - args[[1]]) * (args[[3]] - args[[2]])) / args[[3]]
    s <- sum((observed - expected)^2 / expected)
    (s^-0.5) * exp(-s / 2) / (2^0.5 * gamma(0.5))  #  df fixed to 1 appropriately for 2x2 contingency table.
  }
  , args[[1]], args[[2]], args[[3]])
}

ab <- 11
ac <- 10
abcd <- 25

# plot the pdf between 0 and 10
curve(integrand(x, ab, ac, abcd), 0, 10)
#  integrate between 1 and 5
integrate(integrand, lower = 1, upper = 5, subdivisions = 1000, ab = ab, ac = ac, abcd = abcd)

But integrate() doesn't always work, yielding an error message "roundoff error is detected in the extrapolation table". I suspect because there is a sharp transition or discontinuity in the pdf. I have banged my head to the wall two days now and I can't figure out why my function is not working robustly. The only reason I can think of is that (s^-0.5) * exp(-s / 2) / (2^0.5 * gamma(0.5)) is somehow wrong, but I can't see the mistake.

I also am aware that chi-squared statistics is not strictly very powerful when the cells in contingency tables have small values, left out Yates' continuity correction, and chose the integration limits somewhat arbitrarily, but but perhaps the numbers chosen here should still not crash the computation.

6
  • 2
    The value S computed above is the quantile value of the chi distribution. You do not again pass it into the chi-square pdf. What you did is incorrect
    – Onyambu
    Commented Jul 9 at 2:11
  • @Onyambu, appreciate the comment. Could you point me in the right direction to proceed! Specifically, what to do after calculating the chi-squared statistic (s), how to obtain a proper integrand that can be given to integrate(). Much appreciated!
    – yuppe
    Commented Jul 9 at 15:22
  • its unclear what you are looking for. the value s is not the density but rather the quantile. you do not need a function. the function pchisq can be used to determine the probability of observing s. What are you looking for exactly?
    – Onyambu
    Commented Jul 9 at 19:39
  • @Onyambu, thank you for your patience, I am not sure of the proper statistical terminology. Maybe the most concise description of what I am interested in achieving is in writing a function that takes contingency table as an input, and when numerically integrated by integrate(), yields a p-value corresponding to the (arbitrary) integration limits.
    – yuppe
    Commented Jul 10 at 2:53
  • To compute the p-value you do not need to integrate anything. Do you understand what a quantile value is? If the value was a density value, then you will have to integrate to obtain the p-value. A quantile value is directly related to the probability. its the inverse of the cdf. Unless you want to implement the pchisq using integration, you can directly use it as it is while passing the quantile value s
    – Onyambu
    Commented Jul 10 at 3:36

0

Browse other questions tagged or ask your own question.