I am attempting to write probability density function for 2x2 contingency table with given margins, using chi-squared statistics. I am aware there are easy ways to do this with native R functions, but I am interested in trying to get a better understanding of R, thus I want to try this the "hard way".
a b ab
c d cd
ac bd abcd
a, b, c, and d are the contingency table elements. In the parlance of statistical gene overrepresentation, ab (a + b) is the total number of genes in gene set, ac (a + c) is the total number of genes in the group of interest (e.g., all overexpressed genes), a is the overlap of the two sets, and abcd is the total number of all detected genes.
I want to create probability density function integrand() that, from arguments a, ab, ac, and abcd, yields value for the given a. Importantly, integrand() must be vectorized with respect to argument a as integrate() (see below) requires this for calculating the integral. Other arguments (ab, ac, and abcd are, and should be, vectors of length 1 as they are fixed margins). Typically, user has the values for a, ab, ac, and abcd easily available, hence this particular parameterization.
integrand <- function(a, ...)
{
args <- list(...)
sapply(a, \(x, ...)
{
args <- list(...)
observed <- c(x, args[[1]] - x, args[[2]] - x, args[[3]] - args[[2]] - args[[1]] + x)
expected <- c(args[[1]] * args[[2]], args[[1]] * (args[[3]] - args[[2]]), (args[[3]] - args[[1]]) * args[[2]], (args[[3]] - args[[1]]) * (args[[3]] - args[[2]])) / args[[3]]
s <- sum((observed - expected)^2 / expected)
(s^-0.5) * exp(-s / 2) / (2^0.5 * gamma(0.5)) # df fixed to 1 appropriately for 2x2 contingency table.
}
, args[[1]], args[[2]], args[[3]])
}
ab <- 11
ac <- 10
abcd <- 25
# plot the pdf between 0 and 10
curve(integrand(x, ab, ac, abcd), 0, 10)
# integrate between 1 and 5
integrate(integrand, lower = 1, upper = 5, subdivisions = 1000, ab = ab, ac = ac, abcd = abcd)
But integrate() doesn't always work, yielding an error message "roundoff error is detected in the extrapolation table". I suspect because there is a sharp transition or discontinuity in the pdf.
I have banged my head to the wall two days now and I can't figure out why my function is not working robustly. The only reason I can think of is that (s^-0.5) * exp(-s / 2) / (2^0.5 * gamma(0.5))
is somehow wrong, but I can't see the mistake.
I also am aware that chi-squared statistics is not strictly very powerful when the cells in contingency tables have small values, left out Yates' continuity correction, and chose the integration limits somewhat arbitrarily, but but perhaps the numbers chosen here should still not crash the computation.
S
computed above is the quantile value of the chi distribution. You do not again pass it into the chi-square pdf. What you did is incorrects
is not the density but rather the quantile. you do not need a function. the functionpchisq
can be used to determine the probability of observings
. What are you looking for exactly?pchisq
using integration, you can directly use it as it is while passing the quantile values