0

I'm using RStudio. I have a directory with files in xlsx format. The filename of each file is separated by a dot (e.g. 218-8263.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx), the first part corresponds to the sample name, and the second to the variant caller used, the rest (i.e. .vcf.gz.hg19_multianno.txt_exonic.xlsx) is not relevant. All files have the same columns. I would like to create a Venn diagram for each sample (I have 9 samples and each sample with 4 variant callers) based on the AAchange.refGene column. My directory is: /Samples/.

An example of the files are as follows (I have 36 files in total):

"/Samples/104-10017.bcftools.vcf.gz.hg19_multianno.txt_exonic.xlsx"        
"/Samples/104-10017.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx"       
"/Samples/104-10017.mutect2.filtered.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-10017.strelka.variants.vcf.gz.hg19_multianno.txt_exonic.xlsx"
"/Samples/104-8613.bcftools.vcf.gz.hg19_multianno.txt_exonic.xlsx"         
"/Samples/104-8613.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx"        
"/Samples/104-8613.mutect2.filtered.vcf.gz.hg19_multianno.txt_exonic.xlsx" 
"/Samples/104-8613.strelka.variants.vcf.gz.hg19_multianno.txt_exonic.xlsx"

I have started coding this, but I do not know how to continue:

library(readxl)
    library(VennDiagram)
    
    # Directory of the files
    directorio <- "/Samples"
     
    # Function to obtain the data of AAchange.refGene for each file
    get_AAchange_data <- function(file_path) {
      data <- read_excel(file_path)
      return(data$AAchange.refGene)
    }
    
    datos_muestras <- list()
    
    # Read the files and get relevant data
    for (muestra in muestras) {
      variant_callers <- c("bcftools", "freebayes", "mutect2.filtered", "strelka.variants")
      datos_variant_callers <- list()
      
      for (caller in variant_callers) {
        file_name <- paste(muestra, caller, "vcf.gz.hg19_multianno.txt_exonic.xlsx", sep = ".")
        file_path <- file.path(directorio, file_name)
        datos <- get_AAchange_data(file_path)
        datos_variant_callers[[caller]] <- datos
      }
      
      datos_muestras[[muestra]] <- datos_variant_callers
    }

An example of the columns for 104-10017:

bcftools
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H  
BRCA2:NM_000059:exon10:c.A1114C:p.N372H  
BRCA2:NM_000059:exon10:c.A1365G:p.S455S  
BRCA2:NM_000059:exon11:c.T2229C:p.H743H 
BRCA2:NM_000059:exon11:c.A2971G:p.N991D  
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L

strelka
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H   
BRCA2:NM_000059:exon10:c.A1114C:p.N372H  
BRCA2:NM_000059:exon10:c.A1365G:p.S455S  
BRCA2:NM_000059:exon11:c.T2229C:p.H743H 
BRCA2:NM_000059:exon11:c.A2971G:p.N991D  
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L

mutect
AAChange.refGene
BRCA2:NM_000059:exon11:c.A4563G:p.L1521L
BRCA2:NM_000059:exon11:c.G6513C:p.V2171V 
BRCA2:NM_000059:exon14:c.T7397C:p.V2466A 
BRCA2:NM_000059:exon10:c.A1114C:p.N372H 
BRCA2:NM_000059:exon11:c.A2971G:p.N991D  
BRCA2:NM_000059:exon10:c.A865C:p.N289H

freebayes
AAChange.refGene
BRCA2:NM_000059:exon10:c.A865C:p.N289H  
BRCA2:NM_000059:exon10:c.A1114C:p.N372H  
BRCA2:NM_000059:exon10:c.A1365G:p.S455S
BRCA2:NM_000059:exon11:c.T2229C:p.H743H
BRCA2:NM_000059:exon11:c.A2971G:p.N991D  
BRCA2:NM_000059:exon11:c.A3429C:p.E1143D

I think I should do a kind of loop for each sample, but I did not find a possible solution

Could you help me generate these Venn diagrams with its intersections data? Thanks!!

1 Answer 1

-1

A generic way to Map the sample names to some result would be as follows. Put the payload (importing file, manipulating data, plotting ...) into the map function f = \(sample_name){...}:

    file_names <- c(
      "/Samples/104-10017.bcftools.vcf.gz.hg19_multianno.txt_exonic.xlsx",
      "/Samples/104-10017.freebayes.vcf.gz.hg19_multianno.txt_exonic.xlsx" ## ...,
      )

extract the variable part of the filename:

    sample_names <- gsub('/Samples/(.*)\\.vcf.*', '\\1', file_names)

    ## > sample_names
    ## [1] "104-10017.bcftools"  "104-10017.freebayes"

map sample names to some result, e.g. Venn-diagram

    sample_names |>
      Map(f = \(sample_name){
        file_name = sprintf("/Samples/%s.vcf.gz.hg19_multianno.txt_exonic.xlsx", sample_name)    
        ## d <- read_excel(file_name)
        ## d |>
        ## some_transformations() |>
        ## make_venn() |>
        ## plot()
        sprintf('a dummy result vor sample %s', sample_name)
      })

    ## $`104-10017.bcftools`
    ## [1] "a dummy result vor sample 104-10017.bcftools"
    ## 
    ## $`104-10017.freebayes`
    ## [1] "a dummy result vor sample 104-10017.freebayes"

Not the answer you're looking for? Browse other questions tagged or ask your own question.