Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug;55(8):1413-1420.
doi: 10.1038/s41588-023-01439-2. Epub 2023 Jul 13.

Studying the genetics of participation using footprints left on the ascertained genotypes

Affiliations

Studying the genetics of participation using footprints left on the ascertained genotypes

Stefania Benonisdottir et al. Nat Genet. 2023 Aug.

Abstract

The trait of participating in a genetic study probably has a genetic component. Identifying this component is difficult as we cannot compare genetic information of participants with nonparticipants directly, the latter being unavailable. Here, we show that alleles that are more common in participants than nonparticipants would be further enriched in genetic segments shared by two related participants. Genome-wide analysis was performed by comparing allele frequencies in shared and not-shared genetic segments of first-degree relative pairs of the UK Biobank. In nonoverlapping samples, a polygenic score constructed from that analysis is significantly associated with educational attainment, body mass index and being invited to a dietary study. The estimated correlation between the genetic components underlying participation in UK Biobank and educational attainment is estimated to be 36.6%-substantial but far from total. Taking participation behaviour into account would improve the analyses of the study data, including those of health traits.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The first principle of genetic induced participation bias: comparing shared and not-shared alleles.
a, General relative pairs sharing one allele IBD. b, Parent–offspring pairs. c, Sib-pairs sharing one allele IBD (IBD1). d, Sib-pairs sharing two alleles or no allele IBD (IBD2 and IBD0). Different shading denotes segments that are distinct by descent.
Fig. 2
Fig. 2. Flowchart summarizing the procedure implemented to test genetic variants for association with primary participation using genotypes of participants only.
Core steps are shown as yellow rectangles. The procedure involves dividing the data into different data groups, shown here as blue rectangles. Gray rectangles show additional quality control steps, implemented to reduce or adjust for genotyping and data-processing errors. Pink rectangles show within-study validation steps.
Fig. 3
Fig. 3. Sex-specific analysis of pPGS associations.
Centers of error bars correspond to effect estimates in Table 1 for the pPGS based on combined weights; here, the association analyses were performed for male (blue error bars) and female (red error bars) unrelateds separately. Error bars correspond to 95% CI (estimate ±1.96 s.e.). Quantitative phenotypes were standardized to have a variance of 1 in males and females separately (Supplementary Note section 5). For secondary participation traits, the sample size shown equals number of cases plus controls.
Fig. 4
Fig. 4. Relative frequency differences as functions of sib-pairs enrichment in sample.
For α=0.055, the participation rate of UKBB, three frequency difference ratios of an SNP (indicated in the figure) with participation effects are displayed as functions of the sibling recurrence participation rate ratio, λS. The results are from simulations under a liability-threshold model where the participation of an individual is determined by its liability score and the participation rate α. Given α, λS is a function of the correlation between the liability scores of two siblings (Methods). In particular, for α=0.055, a correlation of 0.193 between the siblings’ liability scores leads to λS=2, the reported enrichment of sib-pairs in the UKBB data. We simulated 500 replications from a population of 5×107 sib-pairs. Allele 1 of the SNP is assumed to have a population frequency of 0.5 and the effect of the SNP is assumed to account for 0.1% of the variance of the liability score. The simulated averages of the two ratios, (FIBD1SFIBD1NS)/(Fsampfpop) and FIBD2FIBD0/Fsampfpop, shown with red hollow squares and blue solid triangles respectively, are virtually indistinguishable from each other; they are roughly 1 when λS is close to 1 and decrease gradually as λS increases, but are always positive. The simulated average of the third ratio (FSIBSFSING)/(Fsampfpop), where FSIBS is the allele frequency in the participating sibling pairs and FSING is the allele frequency in the participating individuals whose sibling does not participate, is shown with gray solid circles. For λS=2, the first two ratios are around 0.86 and the third ratio is around 0.32.
Extended Data Fig. 1
Extended Data Fig. 1. Parental transmissions to sibling pairs.
Displayed are the 16 possible equally likely combinations of transmissions of parental genetic segments to a sibling pair at a given locus. The father has two blue genetic segments, one solid and one striped, and the mother has two orange genetic segments, solid and striped. The different colors and fills indicate distinct origins of inheritance. That is, the four parental genetic segments could be identical by state but they are all distinct with regard to grandparental origin. The sibling pairs are shown as diamond shapes, each carrying one blue genetic segment (solid or striped) inherited from the father and one orange genetic segment (solid or striped) inherited from the mother. In the case the siblings share one segment IBD (IBD1), 8 out of the 16 combinations, the shared segment is paternal for 4 combinations and maternal for the other 4 combinations.
Extended Data Fig. 2
Extended Data Fig. 2. Expected and called fractions of sibling pairs sharing 0, 1, and 2 alleles IBD.
The figure shows, for each SNP, chromosomal position (x-axis, build 37) and the estimated IBD fractions among the 16,668 white British sibling pairs in UKBB (y-axis). The black solid lines denote the theoretical expected fraction for each IBD state, equals 0.25, 0.5, and 0.25 for IBD0, IBD1, and IBD2 respectively. The two black dashed lines indicate the theoretical 95% probability interval, that is expectation ± 1.96 SD. Figure a) shows the empirical sibling fractions for each of the three IBD states computed based on results from the program KING, and figure b) shows the sibling fractions computed with the results from the program snipar.
Extended Data Fig. 3
Extended Data Fig. 3. Trimming SNPs at the beginning and end of IBD regions.
Noting that the error-rate of inferring IBD state is higher in the beginning and end of IBD regions, we trimmed away 250 SNPs from the beginning and end of each called IBD segment.
Extended Data Fig. 4
Extended Data Fig. 4. Inferring shared allele for IBD1 when both individuals are heterozygous for target SNP.
Within the IBD1 region, we search for a neighboring SNP for which one individual is heterozygous while the other is homozygous. If such a neighboring SNP exists, and is phased with the target SNP, the shared allele of the target SNP can be inferred through the shared haplotype. This method was also used by Young et al. to infer the IBD1 shared allele.
Extended Data Fig. 5
Extended Data Fig. 5. Phasing error rate as a function of allele frequency.
For each biallelic sequence variant, we estimate the phasing error rate (y-axis) from trios where the offspring and one parent are heterozygous while the other parent is homozygous. The latter allowed us to determine the shared allele without using phasing and is taken as the truth. Error is when the shared allele deduced through phasing for the double-heterozygotes parent–offspring pair differs from the ‘truth’ supported by the genotype of the homozygous parent. The error rate here is, for instances where the true shared allele is 1, the fraction of times that allele 0 is deduced as the shared allele through phasing. The solid line shows the fit from regressing the estimated error rate on allele frequency up to the third power in the set of 500,632 SNPs.
Extended Data Fig. 6
Extended Data Fig. 6. Estimated bias of the WSPC t-statistics as a function of allele frequency.
The solid line shows the fit from regressing the unadjusted WSPC t-statistics through the origin on centered allele frequency (cf = f - 0.5) and cf3 in the set of 500,632 SNPs. The dashed line shows the fit for the estimated bias induced by miscalling the shared allele for the double-heterozygotes as a function of allele frequency. As described in Supplementary Note section 3, the estimated phasing induced bias was computed as 2f (1-f) [fϵ 1-f - (1-f) ϵf]/SEf with f being the frequency for the allele coded as 1, ϵ1-f and ϵf being the estimated phasing error rate for a given f (see Extended Data Fig. 5) and SEf being the standard error of the shared-not-shared allele frequency difference for a given f. We note that, mainly due to the variation in sample sizes, SEf has modest variation among SNPs with very similar f. For the figure here, a fitted value of SEf is used.
Extended Data Fig. 7
Extended Data Fig. 7. χ2 values as a function of minor allele frequency (MAF).
The two solid lines show the fit from regressing the χ2 statistics, computed from the allele-frequency adjusted TNTC and WSPC t-statistics, on MAF up to the third power. The t-statistics are for the 500,632 SNPs. The fitted value for a particular MAF can be interpreted as the average χ2 values for SNPs with MAFs close to that. The broken line is the corresponding fit for the BSPC χ2 statistics. Given that BSPC and WSPC capture similar true effects with comparable power, the difference between the MAF-specific fitted/average χ2 values is a measure of the average inflation of the WSPC χ2 values. Notably, for WSPC, the fitted χ2 value is much higher for SNPs with low MAFs. By contrast, the fitted χ2 value for BSPC has an increasing trend as MAF gets bigger. When MAF is low, the WSPC fitted χ2 value is substantially higher than that of BSPC, indicating that data errors are inducing a higher inflation there. As MAF increases, the difference between the WSPC and BSPC fitted χ2 values decreases. The fitted χ2 value of BSPC actually becomes slightly bigger than that of WSPC for MAF > 0.46, although that difference is not statistically significant. This is consistent with the WSPC results being close to unbiased when MAF is close to 0.5, which makes sense as the difference between major and minor alleles is small, and so is the major allele effect, when MAF is close to 0.5. The TNTC fitted χ2 value is in general smaller than that of WSPC. That is mainly due to the smaller effective sample size of TNTC, which affects the contributions of both the true effect and the bias to the χ2 statistics.
Extended Data Fig. 8
Extended Data Fig. 8. Relative frequency differences as a function of enrichment of sibling pairs in sample.
Displayed are relative allele frequency differences for different groups and segments as functions of the sibling recurrence participation ratio, λS. These differences are estimated from the same simulations underlying Fig. 4 and are described in the main text and Methods. FIBD2 and FIBD0 denote the allele frequency among sibling pairs sharing the SNP IBD2 and IBD0 respectively, while FIBD1S and FIBD1NS denote the allele frequency among the shared and not-shared alleles among sibling pairs sharing the SNP IBD1. FSIBS is the allele frequency in the participating sibling pairs, FSING is the allele frequency in the participating individuals whose sibling does not participate and fpop is the population allele frequency.

Comment in

Similar articles

Cited by

References

    1. Bradley VC, et al. Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature. 2021;600:695–700. doi: 10.1038/s41586-021-04198-4. - DOI - PMC - PubMed
    1. Barnes, P. Reality check: should we give up on election polling? BBC Newshttp://www.bbc.com/news/election-us-2016-37949527 (2016).
    1. Meng X-L. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 2018;12:685–726. doi: 10.1214/18-AOAS1161SF. - DOI
    1. Tyrrell J, et al. Genetic predictors of participation in optional components of UK Biobank. Nat. Commun. 2021;12:886. doi: 10.1038/s41467-021-21073-y. - DOI - PMC - PubMed
    1. Taylor AE, et al. Exploring the association of genetic factors with participation in the Avon Longitudinal Study of Parents and Children. Int J. Epidemiol. 2018;47:1207–1216. doi: 10.1093/ije/dyy060. - DOI - PMC - PubMed

Publication types