DNA Methylation Signature From Buccal Swabs to Identify Tuberculosis Infection | The Journal of Infectious Diseases

Author Notes

Abstract

Background

Tuberculosis (TB) is among the largest infectious causes of death worldwide, and there is a need for a time- and resource-effective diagnostic methods. In this novel and exploratory study, we show the potential of using buccal swabs to collect human DNA and investigate the DNA methylation (DNAm) signatures as a diagnostic tool for TB.

Methods

Buccal swabs were collected from patients with pulmonary TB (n = 7), TB-exposed persons (n = 7), and controls (n = 9) in Sweden. Using Illumina MethylationEPIC array, the DNAm status was determined.

Results

We identified 5644 significant differentially methylated CpG sites between the patients and controls. Performing the analysis on a validation cohort of samples collected in Kenya and Peru (patients, n = 26; exposed, n = 9; control, n = 10) confirmed the DNAm signature. We identified a TB consensus disease module, significantly enriched in TB-associated genes. Last, we used machine learning to identify a panel of 7 CpG sites discriminative for TB and developed a TB classifier. In the validation cohort, the classifier performed with an area under the curve of 0.94, sensitivity of 0.92, and specificity of 1.

Conclusions

In summary, the result from this study shows clinical implications of using DNAm signatures from buccal swabs to explore new diagnostic strategies for TB.

Graphical Abstract

Open in new tab Download slide

tuberculosis, DNA methylation, classifier, biosignature, buccal swabs

Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), was one of the most fatal infectious diseases worldwide in 2022 [1]. Mtb is spread via aerosol when an infected individual coughs or sneezes. The immunological events following exposure are heterologous and range from clearance of the bacteria with innate or adaptive immune responses, to latent infection or subclinical or clinical TB [2]. There are several challenges with diagnosing both latent and active TB, and the World Health Organization (WHO) has developed the End TB Strategy and issued a global priority in research for new diagnostic tools [3]. The current diagnostic methods for latent TB infection include the Mantoux tuberculin skin test (TST) and the interferon-gamma release assay (IGRA), both of which have several limitations. TST and IGRA are based on circulating adaptive immune memory but cannot distinguish between a latent, active, or eliminated infection [4–6]. Diagnosis of active TB requires detection of Mtb in sputum through microscopy, culture, or nucleic acid amplification tests. The most contagious patients can be detected by smear microscopy, but the time to diagnosis with culture can take up to 6 weeks [7]. In addition, diagnosis based on sputum is of limited use in children and in cases of extrapulmonary TB, which is common in people with human immunodeficiency virus (HIV) infection. The laboratory handling of sputum samples used for culture and microscopy requires experienced personnel and laboratory facilities with high biosafety level. There is an urgent need for a diagnostic method that is resource-effective to enable rapid diagnosis in TB-endemic countries. Epigenetic signatures have become recognized as new promising tools for the diagnosis of different diseases, including cancer, neurodegenerative diseases, and cardiovascular disease (reviewed in [8–10]). Transcriptomic signatures from whole blood have been widely studied and diagnostic signatures for TB proposed [11]. DNA methylation (DNAm) signatures of peripheral blood mononuclear cells, whole blood, and lung immune cells have also been shown to distinguish both active and latent TB [12–18]. The buccal mucosa is a part of the mucosal immunity and the first line of defense in respiratory infections [19]. Humoral immune responses against Mtb have previously been described in saliva of TB patients [20]. During active TB, the bacteria can be present in the oral cavity and oral TB infection can develop from a pulmonary infection [21]. Several studies have investigated the possibility to diagnose TB by detection of Mtb DNA from mouth swab samples, but the sensitivity varies largely between studies [22–25]. In this study, we aimed to investigate if TB infection and exposure could be reflected in the DNA methylome of buccal cells. We investigated the DNAm patterns using Illumina EPIC Array and identified differently methylated CpG sites (DMCs) between the patients and controls. Using a validation cohort of samples collected in Peru and Kenya, we confirmed DNAm changes in the buccal mucosa of patients with TB compared to controls. The results showed cross-continental DNAm differences in buccal swabs from TB patients and controls. We further used machine learning to identify a panel of 7 CpG sites with TB case/control discriminative potential.

METHODS

Patients (n = 7) and individuals with occupational- or household-related TB exposure (n = 7) were enrolled in the study at the Department of Infectious Diseases at Linköping University Hospital. All patients had active pulmonary TB, and 2 patients also had bacteria spreading to other organs including pancreas and lymph nodes. TB patients in Sweden were diagnosed with sputum microscopy, polymerase chain reaction (PCR), or culture and were HIV negative (Supplementary Table 1). With new TB cases discovered in Sweden, there is a routine contact tracing around the index to identify exposed individuals. Individuals with >24 hours of exposure to a contagious patient or >8 hours of exposure with a highly contagious patient are enrolled in contact tracing and are tested with IGRA. Highly contagious patients were defined by positive smear microscopy diagnosis. Individuals enrolled in the contact tracing were asked to participate in the study. Healthy controls (n = 9) were enrolled at Linköping University. Additional participants were recruited in Eldoret, Kenya (patients; n = 19) and in Lima, Peru (patients; n = 7, control; n = 10, exposed; n = 9). All patients donated sample within 2 weeks from diagnosis. TB patients in Peru and Kenya were diagnosed with smear microscopy and GeneXpert PCR. In Kenya, urine lipoarabinomannan (LAM) or clinical diagnosis was also used. All TB patients in Peru had pulmonary TB; in Kenya, 15 patients had pulmonary and 4 had extrapulmonary TB. All TB patients in Peru were HIV negative; in Kenya, 5 TB patients were HIV positive (Supplementary Table 1). The exposed population in Peru was defined as household contacts to a TB patient (n = 5) or healthcare workers with occupational TB exposure (n = 4). Study participants donated buccal swab samples and blood samples for IGRA with the QuantiFERON TB-Gold Plus test (SSI Diagnostica, Hillerød, Denmark), which was analyzed according to the manufacturer's instructions. Buccal swab samples were collected using OmniSwab (Qiagen, Hilden, Germany). The swab was rubbed in the buccal mucosa 5 times up and down for 10 seconds and ejected into a 2-mL tube; 1 swab per cheek was collected from each participant. The buccal swab was stored at 4°C for a maximum of 4 hours before DNA isolation was performed. DNA was extracted from the buccal swabs using QIAamp DNA mini kit (Qiagen) following the manufacturer's instructions for DNA isolation from buccal swabs. The DNA was analyzed using Illumina Infinium MethylationEPIC BeadChip microarray (Illumina, California). Bioinformatic analysis and statistical analysis was performed (Supplementary File 1: Supplementary Methods, Bioinformatics and Statistics).

RESULTS

Study Cohort and Design

We included study participants with active TB (patients; n = 7), persons with occupational- or household-related TB exposure (exposed; n = 7), and healthy controls (controls; n = 9) to investigate epigenetic patterns in buccal swabs in TB infection and exposure. The participants donated buccal swabs and blood samples for IGRA. The demographics of the included study participants are shown in Table 1. There were no significant differences regarding sex, age, weight, body mass index, or BCG vaccination status. We observed a significant difference in the IGRA status and height (P < .001 and P = .023, respectively). Furthermore, we included a validation cohort of participants included from Kenya (patients; n = 19) and Peru (patients; n = 7, control; n = 10, exposed; n = 9) to validate the results of the pilot cohort. The demographics of the validation cohort showed significant differences between weight and body mass index (BMI) (P < .001 and P < .001, respectively) (Table 2). The patients had significantly lower weight and BMI compared to exposed individuals (P < .001, P < .001) and controls (P < .001, P = .023). Malnutrition and underweight are both risk factors for developing TB and features of the disease [26]. The TB disease phenotype, diagnostic method used, and HIV status of all patients is shown in Supplementary Table 1.

Table 1.

Open in new tab

Demographics of the Study Participants in Pilot Cohort

Characteristic	Patient (n = 7)	Exposed (n = 7)	Control (n = 9)	P Value	Post Hoc P value
Sex				.838
Male	4 (57.1)	3 (42.9)	4 (44.4)		…
Female	3 (42.9)	4 (57.1)	5 (55.6)		…
BCG vaccine				.551
Yes	2 (33.3)	5 (71.4)	4 (44.4)		…
No	3 (50)	2 (28.6)	5 (55.6)		…
NA	2 (28.6)	0 (0)	0 (0)		…
IGRA status				<.001
Positive	5 (71.4)	0 (0)	0 (0)		…
Negative	1 (14.3)	7 (100)	7 (77.8)		…
NA	1 (14.3)	…	…		…
Age, y				.857
Mean ± SD	37.5 ± 9.5	35.9 ± 14.9	33.44 ± 14.0		…
Min, max	25, 50	18, 61	18, 51		…
Weight, kg				.490
Mean ± SD	61 ± 11.7	66.3 ± 20.5	70.89 ± 16.51		…
Min, max	44, 80	49, 110	52, 100		…
Height, m				.023	Patient–Control .308
Mean ± SD	1.68 ± 9.3	1.63 ± 0.1	1.78 ± 0.11		Control–Exposed .021
Min, max	1.54, 1.83	1.55, 1.77	1.65, 1.98		Exposed–Patient .943
BMI, kg/m²				.427
Mean ± SD	21.49 ± 2.7	24.58 ± 5.2	22.4 ± 4.5		…
Min, max	18.6, 24.5	19.9, 35.1	17.6, 30.9		…

Characteristic	Patient (n = 7)	Exposed (n = 7)	Control (n = 9)	P Value	Post Hoc P value
Sex				.838
Male	4 (57.1)	3 (42.9)	4 (44.4)		…
Female	3 (42.9)	4 (57.1)	5 (55.6)		…
BCG vaccine				.551
Yes	2 (33.3)	5 (71.4)	4 (44.4)		…
No	3 (50)	2 (28.6)	5 (55.6)		…
NA	2 (28.6)	0 (0)	0 (0)		…
IGRA status				<.001
Positive	5 (71.4)	0 (0)	0 (0)		…
Negative	1 (14.3)	7 (100)	7 (77.8)		…
NA	1 (14.3)	…	…		…
Age, y				.857
Mean ± SD	37.5 ± 9.5	35.9 ± 14.9	33.44 ± 14.0		…
Min, max	25, 50	18, 61	18, 51		…
Weight, kg				.490
Mean ± SD	61 ± 11.7	66.3 ± 20.5	70.89 ± 16.51		…
Min, max	44, 80	49, 110	52, 100		…
Height, m				.023	Patient–Control .308
Mean ± SD	1.68 ± 9.3	1.63 ± 0.1	1.78 ± 0.11		Control–Exposed .021
Min, max	1.54, 1.83	1.55, 1.77	1.65, 1.98		Exposed–Patient .943
BMI, kg/m²				.427
Mean ± SD	21.49 ± 2.7	24.58 ± 5.2	22.4 ± 4.5		…
Min, max	18.6, 24.5	19.9, 35.1	17.6, 30.9		…

Categorical variables are shown as No. (%). Continuous variables are shown as mean ± SD, and min, max shows the range of data. Significance tested in SPSS with χ² test for categorical variables and independent sample Kruskal-Wallis test for continuous variables. For significant findings, post hoc testing with Bonferroni was applied.

Abbreviations: BMI, body mass index; IGRA, interferon-gamma release assay; NA, not applicable; SD, standard deviation.

Table 1.

Open in new tab

Demographics of the Study Participants in Pilot Cohort

Characteristic	Patient (n = 7)	Exposed (n = 7)	Control (n = 9)	P Value	Post Hoc P value
Sex				.838
Male	4 (57.1)	3 (42.9)	4 (44.4)		…
Female	3 (42.9)	4 (57.1)	5 (55.6)		…
BCG vaccine				.551
Yes	2 (33.3)	5 (71.4)	4 (44.4)		…
No	3 (50)	2 (28.6)	5 (55.6)		…
NA	2 (28.6)	0 (0)	0 (0)		…
IGRA status				<.001
Positive	5 (71.4)	0 (0)	0 (0)		…
Negative	1 (14.3)	7 (100)	7 (77.8)		…
NA	1 (14.3)	…	…		…
Age, y				.857
Mean ± SD	37.5 ± 9.5	35.9 ± 14.9	33.44 ± 14.0		…
Min, max	25, 50	18, 61	18, 51		…
Weight, kg				.490
Mean ± SD	61 ± 11.7	66.3 ± 20.5	70.89 ± 16.51		…
Min, max	44, 80	49, 110	52, 100		…
Height, m				.023	Patient–Control .308
Mean ± SD	1.68 ± 9.3	1.63 ± 0.1	1.78 ± 0.11		Control–Exposed .021
Min, max	1.54, 1.83	1.55, 1.77	1.65, 1.98		Exposed–Patient .943
BMI, kg/m²				.427
Mean ± SD	21.49 ± 2.7	24.58 ± 5.2	22.4 ± 4.5		…
Min, max	18.6, 24.5	19.9, 35.1	17.6, 30.9		…

Characteristic	Patient (n = 7)	Exposed (n = 7)	Control (n = 9)	P Value	Post Hoc P value
Sex				.838
Male	4 (57.1)	3 (42.9)	4 (44.4)		…
Female	3 (42.9)	4 (57.1)	5 (55.6)		…
BCG vaccine				.551
Yes	2 (33.3)	5 (71.4)	4 (44.4)		…
No	3 (50)	2 (28.6)	5 (55.6)		…
NA	2 (28.6)	0 (0)	0 (0)		…
IGRA status				<.001
Positive	5 (71.4)	0 (0)	0 (0)		…
Negative	1 (14.3)	7 (100)	7 (77.8)		…
NA	1 (14.3)	…	…		…
Age, y				.857
Mean ± SD	37.5 ± 9.5	35.9 ± 14.9	33.44 ± 14.0		…
Min, max	25, 50	18, 61	18, 51		…
Weight, kg				.490
Mean ± SD	61 ± 11.7	66.3 ± 20.5	70.89 ± 16.51		…
Min, max	44, 80	49, 110	52, 100		…
Height, m				.023	Patient–Control .308
Mean ± SD	1.68 ± 9.3	1.63 ± 0.1	1.78 ± 0.11		Control–Exposed .021
Min, max	1.54, 1.83	1.55, 1.77	1.65, 1.98		Exposed–Patient .943
BMI, kg/m²				.427
Mean ± SD	21.49 ± 2.7	24.58 ± 5.2	22.4 ± 4.5		…
Min, max	18.6, 24.5	19.9, 35.1	17.6, 30.9		…

Abbreviations: BMI, body mass index; IGRA, interferon-gamma release assay; NA, not applicable; SD, standard deviation.

Table 2.

Open in new tab

Demographics of the Study Participants in Validation Cohort

Characteristic	Patient (n = 26)	Exposed (n = 9)	Control (n = 10)	P Value	Post Hoc P value
Sex				.297
Male	12 (46.2)	2 (22.2)	5 (50)		…
Female	14 (53.8)	7 (77.8)	5 (50)		…
Smoking				<.001
Yes	2 (7.7)	0 (0)	0 (0)		…
Age, y				.105
Mean ± SD	35.54 ± 16.597	41.56 ± 14.388	28.8 ± 10.433		…
Min, max	19, 72	25, 72	20, 54		…
Weight, kg				<.001	Patient–Control <.001 Patient–Exposed <.001
Mean ± SD	55.85 ± 11.979	73.67 ± 12.826	70.7 ± 10.914		…
Min, max	45, 90	57, 95	61, 95		…
Height, m				.185
Mean ± SD	165.23 ± 9.02	159.11 ± 9.36	165.6 ± 10.617		…
Min, max	148, 185	150, 159	159, 183		…
BMI, kg/m²				<.001	Patient–Control .023
Mean ± SD	20.58 ± 5.139	29 ± 4.0	25.9 ± 3.843		Patient–Exposed .00
Min, max	15, 34	24, 34	21, 33		…
IGRA
Positive	0	5	1		…
Negative	0	4	9		…
Unknown	26	0	0		…
Min, max	15, 34	24, 34	21, 33		…
Country
Peru	7 (21.2)	9 (100)	10 (100)		…
Kenya	19 (57.6)	0 (0)	0 (0)		…

Characteristic	Patient (n = 26)	Exposed (n = 9)	Control (n = 10)	P Value	Post Hoc P value
Sex				.297
Male	12 (46.2)	2 (22.2)	5 (50)		…
Female	14 (53.8)	7 (77.8)	5 (50)		…
Smoking				<.001
Yes	2 (7.7)	0 (0)	0 (0)		…
Age, y				.105
Mean ± SD	35.54 ± 16.597	41.56 ± 14.388	28.8 ± 10.433		…
Min, max	19, 72	25, 72	20, 54		…
Weight, kg				<.001	Patient–Control <.001 Patient–Exposed <.001
Mean ± SD	55.85 ± 11.979	73.67 ± 12.826	70.7 ± 10.914		…
Min, max	45, 90	57, 95	61, 95		…
Height, m				.185
Mean ± SD	165.23 ± 9.02	159.11 ± 9.36	165.6 ± 10.617		…
Min, max	148, 185	150, 159	159, 183		…
BMI, kg/m²				<.001	Patient–Control .023
Mean ± SD	20.58 ± 5.139	29 ± 4.0	25.9 ± 3.843		Patient–Exposed .00
Min, max	15, 34	24, 34	21, 33		…
IGRA
Positive	0	5	1		…
Negative	0	4	9		…
Unknown	26	0	0		…
Min, max	15, 34	24, 34	21, 33		…
Country
Peru	7 (21.2)	9 (100)	10 (100)		…
Kenya	19 (57.6)	0 (0)	0 (0)		…

Abbreviations: BMI, body mass index; IGRA, interferon-gamma release assay; SD, standard deviation.

Table 2.

Open in new tab

Demographics of the Study Participants in Validation Cohort

Characteristic	Patient (n = 26)	Exposed (n = 9)	Control (n = 10)	P Value	Post Hoc P value
Sex				.297
Male	12 (46.2)	2 (22.2)	5 (50)		…
Female	14 (53.8)	7 (77.8)	5 (50)		…
Smoking				<.001
Yes	2 (7.7)	0 (0)	0 (0)		…
Age, y				.105
Mean ± SD	35.54 ± 16.597	41.56 ± 14.388	28.8 ± 10.433		…
Min, max	19, 72	25, 72	20, 54		…
Weight, kg				<.001	Patient–Control <.001 Patient–Exposed <.001
Mean ± SD	55.85 ± 11.979	73.67 ± 12.826	70.7 ± 10.914		…
Min, max	45, 90	57, 95	61, 95		…
Height, m				.185
Mean ± SD	165.23 ± 9.02	159.11 ± 9.36	165.6 ± 10.617		…
Min, max	148, 185	150, 159	159, 183		…
BMI, kg/m²				<.001	Patient–Control .023
Mean ± SD	20.58 ± 5.139	29 ± 4.0	25.9 ± 3.843		Patient–Exposed .00
Min, max	15, 34	24, 34	21, 33		…
IGRA
Positive	0	5	1		…
Negative	0	4	9		…
Unknown	26	0	0		…
Min, max	15, 34	24, 34	21, 33		…
Country
Peru	7 (21.2)	9 (100)	10 (100)		…
Kenya	19 (57.6)	0 (0)	0 (0)		…

Characteristic	Patient (n = 26)	Exposed (n = 9)	Control (n = 10)	P Value	Post Hoc P value
Sex				.297
Male	12 (46.2)	2 (22.2)	5 (50)		…
Female	14 (53.8)	7 (77.8)	5 (50)		…
Smoking				<.001
Yes	2 (7.7)	0 (0)	0 (0)		…
Age, y				.105
Mean ± SD	35.54 ± 16.597	41.56 ± 14.388	28.8 ± 10.433		…
Min, max	19, 72	25, 72	20, 54		…
Weight, kg				<.001	Patient–Control <.001 Patient–Exposed <.001
Mean ± SD	55.85 ± 11.979	73.67 ± 12.826	70.7 ± 10.914		…
Min, max	45, 90	57, 95	61, 95		…
Height, m				.185
Mean ± SD	165.23 ± 9.02	159.11 ± 9.36	165.6 ± 10.617		…
Min, max	148, 185	150, 159	159, 183		…
BMI, kg/m²				<.001	Patient–Control .023
Mean ± SD	20.58 ± 5.139	29 ± 4.0	25.9 ± 3.843		Patient–Exposed .00
Min, max	15, 34	24, 34	21, 33		…
IGRA
Positive	0	5	1		…
Negative	0	4	9		…
Unknown	26	0	0		…
Min, max	15, 34	24, 34	21, 33		…
Country
Peru	7 (21.2)	9 (100)	10 (100)		…
Kenya	19 (57.6)	0 (0)	0 (0)		…

Abbreviations: BMI, body mass index; IGRA, interferon-gamma release assay; SD, standard deviation.

DNA Methylation Pattern in Buccal Swabs Separates Patients, Exposed Contacts, and Healthy Controls

The DNA methylation status in >800 000 CpG sites was assessed using Illumina Infinium MethylationEPIC array. A singular value decomposition (SVD) analysis of the factors known to influence DNAm was performed (Supplementary Figure 1A) and the data were batch corrected (Supplementary Figure 1B). We performed an unsupervised clustering analysis using multidimensional scaling (MDS) of the 1000 most variable CpG sites in the dataset and observed separation of the groups (Figure 1A). To investigate if there were any significant differences between the groups, we identified DMCs (mean methylation difference [MMD], >0.2 and false discovery rate [FDR]–adjusted P < .05). There were 5644 significant DMCs between the patients and controls, 413 between patients and exposed individuals, and 309 between exposed individuals and controls. Using all significant DMCs (n = 5865), we created a heatmap showing a spectrum of DNAm changes in the exposed and separation of the patients and controls (Figure 1B). The overlap of the DMCs between the groups was analyzed in a Venn analysis and showed 5153 significant DMCs unique to the patients and controls (Figure 1C). Together, these results indicate that the DNAm profiles obtained from buccal mucosa differentiate TB patients, exposed individuals, and healthy controls. We further investigated the cellular heterogeneity of the samples, since different cell types display distinct DNAm patterns, which can influence the DNA methylomes in a mixed sample [27]. We identified epithelial cell proportions of 75% (standard error of the mean [SEM], 6.2%) in patients, 79% (SEM, 6.1%) in exposed individuals, and 85% (SEM, 3.7%) in controls. The proportion of neutrophils was 13.8% (SEM, 5.5%) in patients, 10% (SEM, 5%) in exposed individuals, and 5.8% (SEM, 2.2%) in controls. The remaining cells consisted of other leukocytes including B cells, natural killer cells, CD4⁺ T cells, and monocytes. There were no significant differences of the cell proportions between the groups (P = .893; Figure 1D), and the cellular heterogeneity identified in the buccal swabs samples was in line with previous findings based on DNAm data [28] and on microscopy characterization [29].

Figure 1.

DNA methylation patterns in buccal swabs distinguish patients with active tuberculosis (TB) (pink triangles), TB-exposed individuals (green squares), and healthy controls (blue circles). A, Multidimensional scaling plot of the 1000 most variable CpG sites within the dataset. B, Heatmap of beta values of the differently methylated CpG sites (DMCs) identified in pairwise comparison across all groups with stringency criteria of adjusted P < .05 and mean methylation difference >0.2. Dendrogram shows separation based on groups. C, Venn diagram of the DMCs identified in the pairwise comparisons showing the largest amount of DMCs between the TB patients and healthy controls. D, Hierarchical epigenetic dissection of intra-sample heterogeneity of the data showing proportions of different cell types within the mouth swab samples. No significant difference of the cell types between groups was identified (Kruskal-Wallis test, P = .893).

Open in new tab Download slide

Validation of DNA Methylation Pattern in Buccal Swab Samples of TB Patients, Exposed Individuals, and Controls Using a Validation Cohort

To validate the robustness and generalizability of our findings, we incorporated additional participants from Kenya and Peru into a validation cohort. This validation cohort, notably larger than the initial pilot cohort, provided a more rigorous test of our findings, particularly given the perfect separation observed in the pilot study. We performed an unsupervised clustering analysis using MDS of the 1000 most variable CpG sites in the dataset-confirmed separation of the groups (Figure 2A). Furthermore, we identified 413 significant DMCs between the patients and controls, 32 between patients and exposed individuals, and 51 between exposed individuals and controls (MMD >0.2 and FDR-adjusted P < .05). The overlap in DMCs identified in the pilot and validation cohort was investigated in a Venn analysis and showed 22 overlapping DMCs between the cohorts (Figure 2B). In summary, these results confirm the findings from the pilot cohort and support that DNAm from the buccal mucosa can distinguish TB patients among healthy controls and exposed.

Figure 2.

A validation cohort confirms differential methylation pattern from buccal swabs between patients with tuberculosis (TB) and healthy controls. A, Multidimensional scaling plot of the 1000 most variable CpG sites from buccal swab samples of patients with TB (pink triangles), TB-exposed individuals (green squares), and healthy controls (blue circles) with 85% confidence ellipses. The interferon-gamma release assay (IGRA) status of participants is indicated with black outline. B, Differently methylated CpG sites (DMCs) between the patients and controls from the pilot cohort and validation cohort compared in a Venn analysis showing an overlap of 22 DMCs. C, Multidimensional scaling plot of the 1000 most variable CpG sits in DNA methylomes from buccal swab samples from TB patients (pink), TB-exposed individuals (green), and healthy controls (blue) from Kenya (triangles), Peru (squares), or Sweden (circles).

Open in new tab Download slide

Cross-continental DNA Methylation Patterns in Buccal Swab Samples of TB Patients

We identified the signature in the pilot and showed the general applicability across different subpopulations by replicating the results in the validation cohort. To investigate the similarities and dissimilarities of TB patients from the different geographical areas, we combined the data from the pilot and validation cohort. This allowed us to introduce and investigate population-based confounding factors. We did an SVD analysis of the data including all samples of the pilot and validation cohort (patients; n = 33, exposed; n = 16, controls; n = 19), and we identified significant contribution to variation in the data by the group, country, and slide (Supplementary Figure 2A). An SVD correction was performed to reduce technical batch effect from the slide (Supplementary Figure 2B). We performed an MDS analysis of the 1000 most variable CpG sites and identified separation between the groups regardless of the country (Figure 2C). In summary, the analysis show that the patients with TB display a distinct DNAm pattern regardless of the population, suggesting a cross-continental DNAm signature in buccal swab samples for active TB.

Supervised Machine Learning Models Trained on a Panel of CpG Sites Achieve High Classification Performance of TB Patients and Controls

To investigate the potential of this DNAm signature as a diagnostic tool, we applied a machine learning approach to select a panel of CpG sites that can accurately distinguish TB patients from TB-exposed individuals and healthy controls. First, we trained L1-regularized multivariate logistic regression models on the pilot cohort (n = 23) using recursive feature elimination to promote model simplicity and interpretability, while mitigating potential overfitting and instability. Then, we validated the predictive accuracy of the selected CpG subsets by training supervised learning classifiers to estimate the probability of TB on samples from the pilot cohort and evaluating them on the left-out validation cohort. We observed that classifiers trained on the selected CpG subsets were able to achieve a high discriminatory performance for active TB (area under the curve [AUC] >0.90, sensitivity >0.70, specificity >0.95) among TB-exposed and healthy controls (Figure 3A and 3B). In particular, we found that a panel of 7 CpG sites optimized the balance between set size and model classification performance, showing an AUC of 0.94, sensitivity of 0.92, and specificity of 1, on the validation set (Figure 3C). The beta values for the 7 CpG sites for each group are shown in Figure 3D. Two CpG sites from the classifier were identified as DMCs in both the pilot and validation cohort independently, whereas the remaining 5 CpG sites were identified as DMCs in the pilot cohort (Supplementary Figure 3A). The model also performed within a satisfactory range (AUC, 0.82–0.92) when evaluated on the validation set without samples below the underweight BMI threshold (18.5 kg/m²), and on both the complete Peruvian cohort and the Peruvian cohort without low-BMI individuals (Supplementary Figure 3B). The average TB probability of the validation cohort samples estimated by the classifiers trained on these 7 CpG sites was significantly higher for active TB patients (prob_patients = .76) compared to both healthy controls (prob_controls = .13, adjusted Wilcoxon test P = 2.36e-8) and exposed individuals (prob_exposed = .52, adjusted Wilcoxon test P = 1.95e-4). Similarly, the TB probability predicted for exposed individuals was significantly higher than the estimations for controls (adjusted Wilcoxon test P = 1.30e-4) (Supplementary Figure 3C). Since Sweden is a low-incidence country for TB whereas Peru is a high-incidence country, we also investigated if this circumstance could have influenced the development of the classifier. We applied the same methodology as before to construct a new classifier trained on high-incidence settings, wherein we interchanged the training and validation sets (Peruvian cohort as training set, Swedish cohort as validation). The resulting model was built using 6 CpG sites, 2 of which overlapped with the previous set of 7 CpG sites (Fisher exact test P = 1.83e-5; odds ratio [OR], 548.54). Remarkably, this high-incidence classifier achieved an AUC of 0.98 on the Swedish cohort (Supplementary Figure 3D). These results suggest that DNAm levels from a small number of CpG sites suffice to accurately classify TB patients from exposed individuals and controls. Furthermore, the selected sites demonstrate consistency across different TB incidence settings.

Figure 3.

Tuberculosis (TB) classifier based on DNA methylation (DNAm) in buccal swab samples accurately classifies patients with TB among healthy controls and exposed individuals. Using machine learning, 20 candidate CpG sites with discriminative features for TB were obtained. A classifier was trained on the pilot cohort (blue) and tested on the validation cohort (orange). A, Sensitivity (y-axis) of the classifier based on DNAm level in 1–20 CpG sites (x-axis). Sensitivity of 0.70–0.94 was reached depending on the number of CpGs included. B, Specificity (y-axis) of the classifier in 1–20 CpG sites (x-axis). Specificity of 0.95–100 was reached depending on the number of CpG sites investigated. C, Receiver operating characteristics (ROCs) of the classifier based on 7 CpG sites (sensitivity of 0.92 and specificity 1) showing the true-positive rate (y-axis) and false-positive rate (x-axis) with an area under the curve (AUC) of 0.94. D, Beta values of the 7 classifier CpG sites for each group. The CpG sites are ordered by importance. All samples from pilot and validation cohort are represented in the plot.

Open in new tab Download slide

Identification of a Consensus Disease Module Enriched in TB-Associated Genes and Pathways

To explore the biological context of the TB DNAm signatures from the pilot and validation cohorts, we applied a network analysis approach to identify modules of highly interconnected genes using the MODifieR pipeline [30]. We mapped the DMCs between TB patients and controls for each cohort, generating a pilot cohort TB module of 763 genes and a validation cohort TB module of 126 genes. KEGG pathway enrichment analysis showed significant enrichment in several pathways of infectious diseases and immune system (Supplementary Figures 4 and 5, respectively). Furthermore, the genes from the pilot and validation modules overlapped significantly (P < 2.2e-16; OR, 13.56), allowing to retrieve a consensus TB disease module of 48 genes (Figure 4A). To further examine the TB consensus module, we performed pathway and gene ontology enrichment analyses. Notably, we found that the consensus genes were significantly enriched (adjusted P < .05) in pathways associated with bacterial infection pathways, extracellular matrix (ECM) interactions, and immunoregulatory pathways (Figure 4B, complete pathway analysis in Supplementary Figure 6). The main component of the consensus module (n = 42) was significantly enriched in TB-associated genes from DisGeNET (P = .03; OR, 2.75). The TB-associated genes identified was WNT family member 5A (Wnt5a), growth factor receptor bound protein 2 (GRB2), mitogen-activated protein kinase 1 (MAPK1), epidermal growth factor (EGFR), protein tyrosine phosphate nonreceptor type 6 (PTPN6), and protein tyrosine phosphate receptor type C (PTPRC).

Figure 4.

Disease modules of patients with tuberculosis (TB) from pilot and validation cohort overlap and are enriched in TB-associated genes and pathways. A, Disease modules for the pilot and validation cohort identified based on the differentially methylated CpG sites from each cohort using MODifieR. The disease modules were compared in a Venn analysis showing significant overlap (P < 2.2e-16; odds ratio [OR], 13.56) and a consensus module of 48 genes. B, Network showing the genes in the consensus module and their connections. Hypermethylated CpG sites are shown in red, hypomethylated in blue, mixed methylation pattern in beige, and TB-associated genes with a black outline. There was a significant overlap of TB-associated genes in the interconnected module genes (n = 42) (P = .03; OR, 2.75). The module was explored using KEGG pathway enrichment analysis, and genes enriched in pathways of cell and extracellular matrix (ECM) interactions (light blue area) and genes enriched in immune system pathways (light red area) were identified.

Open in new tab Download slide

DISCUSSION

To meet the needs for efficient TB diagnostics with reliable performance in low-resource settings, WHO is asking for non-sputum-based TB triage tests (minimum sensitivity of 0.90 and specificity of 0.70) and confirmatory tests (minimum sensitivity of 65% and specificity of 98%). Urine LAM is clinically used in some settings but has suboptimal sensitivity [31]. Several blood transcriptomic classifiers have been suggested and performed with sensitivities of 0.83–0.91 at a specificity of 0.70 [11]. Another classifier based on 3 differentially methylated regions performed with AUC 0.84, sensitivity of 0.65, and specificity of 0.90 in a validation cohort of 31 TB patients and 31 controls [15]. Compared to blood samples, buccal swabs are less invasive and easy to collect and store, and DNAm are stable epigenetic marks [32]. At the collection site, buccal swabs can be put in decontamination buffer to allow laboratory processing in Biosafety Level 2 laboratories. The cell proportions in buccal swab samples are also more homogenous as compared to blood samples and can consequently be more suitable for diagnostic developments of epigenetic signatures, since cellular heterogeneity contributes to variation in the DNAm [33]. By narrowing down the number of PCR-addressable CpG sites that with precision separate TB patients from persons without TB, the technology can be aligned with existing high-throughput PCR protocols. Such a tool is not primarily a stand-alone tool but would have value in a clinical setting as a triaging tool, which needs validation with further confirmatory testing. In other fields of research, the buccal mucosa has been explored, and DNAm signatures of smoking [34], biological age [35], maternal stress during pregnancy [36], and in utero exposure to severe acute respiratory syndrome coronavirus 2 [37] have been described.

In the present study, we have analyzed DNA methylomes of TB-exposed study participants in geographically distant locations and thereby introduced population-based confounding since there are population-specific epigenetic differences due to ethnicity and environment [38, 39]. Sweden is a low-incidence country where most TB patients are foreign-born [40]. Investigating covariates in our datasets with SVD showed variation in the data caused by the country but that the TB status was contributing with greater variation in the data. We confirmed that TB patients have a distinct DNAm pattern and that these are cross-continental differences. The results are in line with our own and others’ previous work demonstrating that DNAm patterns are changed in blood- and lung-derived immune cells during clinical or subclinical TB infection or after TB exposure [12–18]. To our knowledge, we are the first to report that DNAm changes are present in the buccal mucosa during active TB. We also identified DNAm differences in the TB-exposed individuals as compared with the healthy controls, suggesting induced epigenetic changes after exposure to Mtb. The signature seemed independent on involvement of the adaptive immune system, possibly reflecting the spectrum of disease severity that is not reflected by IGRA status. The classifier showed higher probability for TB-exposed individuals than unexposed controls to be classified as TB patients, proposing that some of the exposed individuals have a subclinical infection. We have previously shown that recently TB-exposed individuals have altered DNAm of lung immune cells regardless of IGRA status [14], suggesting that DNAm signatures could be a measure of TB exposure independent from immunological tests. The relationship between DNAm and the heterology of TB, ranging from early clearance, latent TB, subclinical TB, and active TB, needs further investigation.

DNAm is intricately linked to the regulation of gene expression, with hypomethylation of promoters being associated with increased expression and hypermethylation with silencing of genes [41]. We identified a consensus disease module from the pilot and validation cohort, with significant enrichment for TB-associated genes. Wnt5a was hypomethylated and is involved in the cellular processes following Mtb recognition through Toll-like receptor 2 [42]. Grb2, PTPN6 (both hypomethylated), and cdc-42 are involved in Mtb recognition of the mannose receptor [43]. Furthermore, we identified hypermethylation in EGFR, and mutations in this gene have been reported at an increased frequency in TB patients [44]. We also identified hypomethylation in PTPRC, which has been suggested as a biomarker for the diagnosis of active and latent TB [45]. Although macrophages are the primary target cell for Mtb, studies have provided evidence of infection in alveolar epithelial cells (AECs) [46, 47]. It has been proposed that Mtb translocates over the epithelial barrier when internalized in AECs and through migration of infected macrophages [48]. Mtb adheres to ECM proteins such as collagens, fibronectins, and laminins [48]. We observed enrichment in the pathway of ECM receptor interaction and bacterial invasion of epithelial cells. We also identified enrichment in pathways connected to shigellosis and salmonella infections; these are invasive intracellular infections where the pathogen is phagocytosed by macrophages and manipulates the host to avoid digestion and extend intracellular survival, similarly as in TB [49].

Limitations of this study include the limited sample size and lack of controls from Kenya, inconsistency in diagnostic method used, and potential model overfitting due to high sensitivity with limited features. Temporal data and comparisons of other diseases would be required to further develop a DNAm signature with diagnostic properties.

CONCLUSIONS

We identified DNAm differences in buccal swab samples distinguishing TB patients from healthy controls. The signature was present in TB patients from 3 different populations collected in Sweden, Peru, and Kenya. Furthermore, we developed a TB-specific DNAm classifier that demonstrated promising performance in identifying TB patients within our limited-scale cohort. Our results suggest that we can use buccal swabs to identify TB patients and strengthen the clinical relevance and implications for future development of DNAm signatures as a diagnostic tool in TB.

Supplementary Data

Supplementary materials are available at The Journal of Infectious Diseases online (http://jid.oxfordjournals.org/). Supplementary materials consist of data provided by the author that are published to benefit the reader. The posted materials are not copyedited. The contents of all supplementary data are the sole responsibility of the authors. Questions or messages regarding errors should be addressed to the author.

Notes

Acknowledgments. We would like to acknowledge the Core Facility for Bioinformatics and Expression Analysis at Karolinska Institute for the help with DNAm analysis of all samples collected in Sweden. We would like to acknowledge Nicholas Kiprotich and Mary Chepkwemoi at Moi University for their contribution in the coordination of the project and collection and processing of samples. We would like to acknowledge Clinical Genomics Linköping, Science for Life Laboratory, Linköping University, for DNAm analysis performed on samples collected in Peru and Kenya. The authors also acknowledge the contributions by Martina Sönnerbrandt, Department of Infectious Diseases, Linköping University Hospital, for help with contact tracing and sample collections during the study. The authors acknowledge the students Simona Lazarevic, John Berg, Gordon Spiegel, Danna Gutierrez, Sandra Dahling, and Remo Andersson for their work in the project in Peru, and Raynice Waker, Sam Widén, Frida Lindgärde, Anders Appeldahl, Malin Grönqvist, and Felicia Ollfors for their work in the project in Kenya.

Author contributions. L. K. coordinated the study in Sweden and designed and coordinated the studies in Kenya and Peru, optimized methods, prepared samples, did bioinformatic analysis, generated figures, and wrote the manuscript. I. Ö. included participants and prepared samples in Linköping, Sweden, and designed and coordinated the studies in Kenya and Peru. S. S. designed the bioinformatic analysis. D. M.-E. and M. G. performed MODifieR analysis and performed machine learning to develop the TB classifier. P. E. was responsible for the inclusion of participants and sample preparation in Lima, Peru. M. M.-A. was responsible for the coordination and supervision of laboratory activities. C. U.-G. had medical responsibility and supervised the study in Lima. L. D. and R. T. had medical responsibility and supervised the study in Eldoret. J. P. had medical responsibility of the studies and designed the study. M. L. designed and funded the study and wrote ethical application. All authors contributed to the manuscript.

Ethics approval. Ethical approval for the study in Sweden was obtained from the regional ethical review board in Linköping, No. 2016/237-31. In Kenya, ethical approval was obtained from Moi Teaching and Referral Hospital/Moi University Institutional Research and Ethics Committee, No. 0004260. In Peru, ethical approval was obtained from Universidad Peruana Cayetano Heredia Institutional Review Board, No. 209390. All participants signed an informed consent.

Data availability. Participant-related data from this study are not available for sharing because Institutional Review Board rules currently limit the data release. Bioinformatic pipelines used to analyze the data and to generate graphs and figures will be available on the following GitHub account upon publication: https://github.com/Lerm-Lab/TB-BuccalSwabTB.

Financial support. This study was funded by the Heart and Lung Foundation (grant numbers 20180613 and 20220034 to M. L.) and by the Swedish Research Council (grant numbers 2018-02961 and 2018-04246 to M. L.).

References

World Health Organization

. Global tuberculosis report. 2023. https://www.who.int/publications/i/item/9789240083851. Accessed 28 February 2024.

Simmons

Stein

Seshadri

, et al.

Immunological mechanisms of human resistance to persistent Mycobacterium tuberculosis infection

Nat Rev Immunol

2018

;

575

–

Article Contents

A DNA Methylation Signature From Buccal Swabs to Identify Tuberculosis Infection

Abstract

METHODS

RESULTS

Study Cohort and Design

DNA Methylation Pattern in Buccal Swabs Separates Patients, Exposed Contacts, and Healthy Controls

Validation of DNA Methylation Pattern in Buccal Swab Samples of TB Patients, Exposed Individuals, and Controls Using a Validation Cohort

Cross-continental DNA Methylation Patterns in Buccal Swab Samples of TB Patients

Supervised Machine Learning Models Trained on a Panel of CpG Sites Achieve High Classification Performance of TB Patients and Controls

Identification of a Consensus Disease Module Enriched in TB-Associated Genes and Pathways

DISCUSSION

CONCLUSIONS

Supplementary Data

Notes

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only