I'm using RStudio. I have a directory with files in xlsx format. The filename of each file is separated by a dot (e.g. 218-8263.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx
), the first part corresponds to the sample name, and the second to the variant caller used, the rest (i.e. .vcf.gz.hg19_multianno.txt_exonic.xlsx
) is not relevant. All files have the same columns. I would like to create a Venn diagram for each sample (I have 9 samples and each sample with 4 variant callers) based on the AAchange.refGene
column. My directory is: /Samples/
.
An example of the files are as follows (I have 36 files in total):
"/Samples/104-10017.bcftools.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-10017.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-10017.mutect2.filtered.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-10017.strelka.variants.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-8613.bcftools.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-8613.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-8613.mutect2.filtered.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-8613.strelka.variants.vcf.gz.hg19_multianno.txt_exonic.xlsx"
I have started coding this, but I do not know how to continue:
library(readxl)
library(VennDiagram)
# Directory of the files
directorio <- "/Samples"
# Function to obtain the data of AAchange.refGene for each file
get_AAchange_data <- function(file_path) {
data <- read_excel(file_path)
return(data$AAchange.refGene)
}
datos_muestras <- list()
# Read the files and get relevant data
for (muestra in muestras) {
variant_callers <- c("bcftools", "freebayes", "mutect2.filtered", "strelka.variants")
datos_variant_callers <- list()
for (caller in variant_callers) {
file_name <- paste(muestra, caller, "vcf.gz.hg19_multianno.txt_exonic.xlsx", sep = ".")
file_path <- file.path(directorio, file_name)
datos <- get_AAchange_data(file_path)
datos_variant_callers[[caller]] <- datos
}
datos_muestras[[muestra]] <- datos_variant_callers
}
An example of the columns for 104-10017
:
bcftools
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H
BRCA2:NM_000059:exon10:c.A1114C:p.N372H
BRCA2:NM_000059:exon10:c.A1365G:p.S455S
BRCA2:NM_000059:exon11:c.T2229C:p.H743H
BRCA2:NM_000059:exon11:c.A2971G:p.N991D
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L
strelka
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H
BRCA2:NM_000059:exon10:c.A1114C:p.N372H
BRCA2:NM_000059:exon10:c.A1365G:p.S455S
BRCA2:NM_000059:exon11:c.T2229C:p.H743H
BRCA2:NM_000059:exon11:c.A2971G:p.N991D
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L
mutect
AAChange.refGene
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L
BRCA2:NM_000059:exon11:c.G6513C:p.V2171V
BRCA2:NM_000059:exon14:c.T7397C:p.V2466A
BRCA2:NM_000059:exon10:c.A1114C:p.N372H
BRCA2:NM_000059:exon11:c.A2971G:p.N991D
BRCA2:NM_000059:exon10:c.A865C:p.N289H
freebayes
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H
BRCA2:NM_000059:exon10:c.A1114C:p.N372H
BRCA2:NM_000059:exon10:c.A1365G:p.S455S
BRCA2:NM_000059:exon11:c.T2229C:p.H743H
BRCA2:NM_000059:exon11:c.A2971G:p.N991D
BRCA2:NM_000059:exon11:c.A3429C:p.E1143D
I think I should do a kind of loop for each sample, but I did not find a possible solution
Could you help me generate these Venn diagrams with its intersections data? Thanks!!