Skip to content

CCRGeneticsBranch/khanlab_pipeline

 
 

Repository files navigation

khanlab_pipeline

This is the implementation of KhanLab NGS Pipeline using Snakemake.

Khanlab pipeline supports the following NGS data types:

  1. HiC
  2. RNAseq
  3. ChIPseq
  4. Pacbio
  5. DNAseq

Installation

The easiest way to get this pipeline is to clone the repository.

git clone git@github.com:hsienchao/khanlab_pipeline.git

This pipeline is available on NIH biowulf cluster, contact me if you would like to do a test run. The data from this pipeline could directly be ported in OncoGenomics-DB, an application created to visualize NGS data available to NIH users.

Requirements

snakemake 5.13.0
mutt
gnu parallel
SLURM resource management

HiC:

ChIPseq:

  • BWA
  • macs
  • rose
  • homer
  • coltron
  • samtools
  • bedtools

RNAseq:

  • STAR
  • RSEM
  • samtools
  • xengsort
  • HLAminer
  • seq2HLA

Pacbio:

  • Isoseq3
  • cDNA_cupcake
  • sqanti3

Conventions

  • Sample names cannot have "/" or "." in them
  • Fastq files end in ".fastq.gz"

Running pipeline

Input sample sheet

  • Sample sheet in YAML format
Required columns
  1. Genome (accepted values: hg19,hg38,mm10)
  2. SampleFiles ( usually Sample_ + Library_ID + _ + FCID )
  3. Other pipeline type specific columns
Example

See examples in HiC, RNAseq or ChIPseq

  • Sample sheet in YAML format
  • Sample sheet can be generated using script sampleToYaml.py. Usage:
python scripts/sampleToYaml.py -s [SAMPLE_ID] -o [OUTPUT_FILE]

Example:

python scripts/sampleToYaml.py -s RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7 -o RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7.hic.yaml

Launching the pipeline

1. General launch script:
launch [options]

required options:

        -type|t         <string>        Pipeline type (available options: hic,chipseq,ranseq,dnaseq)
        -workdir|w      <string>        Working directory where all the results will be stored.
        -sheet|s        <string>        Sample sheet in YAML format
        -genome|g       <string>        Genome version (default: hg38)

optional options: 
        -datadir|d <string> FASTQ file location (default: /data/khanlab/projects/DATA) 
        -dryrun Dryrun only 
        -dag Generate DAG SVG
  • Example (hg38)
      launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml
  • Example (hg19)
      launch -type hic -w /data/khanlab/projects/HiC/processed_DATA -s /data/khanlab/projects/HiC/processed_DATA/sample_sheets/RH4_D6_H3K27ac_HiChIP_HKJ22BGX7.hic.yaml -g hg19
2. Launch by sample ID (Khanlab automation and regular Khanlab users):
    scripts/automate.sh [sampleID]
This script will parse the Khanlab master files to determine the sequencing type automatically. Then it will retrieve required columns from HiC/ChIPseq sample sheets and check if FASTQ files are ready. Then it will lauch the pipeline automatically.

For RNAseq samples, the genome is defined in "SampleRef" column in master file. For ChIPSeq/HiC samples, genome version is defined in "Genome" column. The values can be multiple seperated by comma (e.g. hg19,hg38)

The Khanlab data location:

-- FASTQ files: /data/khanlab/projects/DATA
-- Processed data: /data/khanlab/projects/pipeline_production/processed_DATA
-- Processed data: /data/khanlab/projects/pipeline_production/sample_sheets
-- ChIPSeq sample sheet: /data/khanlab/projects/ChIP_seq/manage_samples/ChIP_seq_samples.xlsx
-- HiC sample sheet: /data/khanlab/projects/HiC/manage_samples/HiC_sample_sheet.xlsx
  • Automation workflow alt tag

  • Example 1: process HiC sample

   scripts/automate.sh RH4_Ent6_H3K27ac_HiChIP_HH3JVBGX7
  • Example 2: process ChIPseq samples
   scripts/automate.sh RH4_D6_BRD4_024_C_HLFMLBGX3
  • Example 3: process RNAseq samples
   scripts/automate.sh NCI0215tumor_T_C2V4TACXX

HiC

HiC sample_sheet

columns
  1. Genome (hg19,hg38,mm10)
  2. SampleFiles
  3. SpikeIn (optional)
  4. SpikeInGenome (optional)

HiC Example:

samples:
  RH4_D6_H3K27ac_HiChIP_HKJ22BGX7:
    Genome: hg19
    SampleFiles: Sample_RH4_D6_H3K27ac_HiChIP_HKJ22BGX7
    SpikeIn: 'yes'
    SpikeInGenome: mm10

Output

  1. Juicebox hic file: [output dir]/[sample_id].allValidPairs.hic
  2. HiCpro pairs:
  • Pairs for the reference genome:
    • [output dir]/HiCproOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
  • Pairs for the spikeIn genome:
    • [output dir]/HiCproAQuAOUTPUT/hic_results/data/[sample_id]/[sample_id]/[sample_id].allValidPair
  • Mergestat summary (reference and spike-In):
    • mergeStats.txt
  • Successful flag:
    • successful.txt

Dependency graph

DAG example with spike-In

alt tag

ChIPseq

ChIPseq sample_sheet

Required columns
  1. Genome (hg19,hg38,mm10)
  2. SampleFiles
  3. SpikeIn (yes,no)
  4. SpikeInGenome (optional)
  5. LibrarySize (optional)
  6. EnhancePipe (yes, no)
  7. PeakCalling (narrow, broad)
  8. Matched RNA-seq lib (optional)

ChIPseq Example:

ChIPSeq with enhancer pipeline (e.g. H3K27ac)

samples:
  RH4_D6_H3K27ac_018_C_HWC77BGXY:
    Matched normal: RH4_Input_001_C_H5TLGBGXX
    Genome: hg38
    SampleFiles: Sample_RH4_D6_H3K27ac_018_C_HWC77BGXY
    SpikeIn: 'yes'
    SpikeInGenome: dm6
    LibrarySize: 250
    EnhancePipe: 'yes'
    PeakCalling: narrow
    Matched RNA-seq lib: Rh4_dmso_6h_rz_T_H3YCHBGXX

ChIPSeq without enhancer pipeline (e.g. H3K27ac)

samples:
  RH4_D6_BRD4_024_C_HLFMLBGX3:
    Matched normal: RH4_Input_001_C_H5TLGBGXX
    Genome: hg38
    SampleFiles: Sample_RH4_D6_BRD4_024_C_HLFMLBGX3
    SpikeIn: 'yes'
    SpikeInGenome: dm6
    LibrarySize: 250
    EnhancePipe: 'no'
    PeakCalling: narrow
    Matched RNA-seq lib: RH4_D6_T_HVNVFBGX2

Output

BWA Output
  1. BAM: [sample_id].bam
  2. No duplicate BAM: [sample_id].dd.bam
  3. Bigwig/TDF: [sample_id].25.RPM.bw/tdf
  4. SpikeIn normalized bigwig/TDF: [sample_id].25.scaled.bw/tdf
MACS2 Output

Folder name:

MACS_OUT_q_[cutoff]: MACS_Out_q_0.01 or MACS_Out_q_0.05

  1. Peaks (no regions in blacklist): [sample_id]_peaks.narrowPeak.nobl.bed
  2. Peak annotation: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.txt
  3. Peak annotation summary: [sample_id]_peaks.narrowPeak.nobl.bed.annotation.summary
  4. Peak summit file (narrow peaks only): [sample_id].narrow_summits.bed
  5. Peaks GREAT format: [sample_id]_peaks.narrowPeak.nobl.GREAT.bed
ROSE Output

Folder name:

MACS_OUT_q_[cutoff]/ROSE_out_[stitch distance]: MACS_Out_q_[cutoff]/ROSE_out_12500

  1. All enhancers: [sample_id]_peaks_AllEnhancers.table.txt
  2. Super enhancer BED file: [sample_id]_peaks_AllEnhancers.table.super.bed
  3. Regular enhancer BED file: [sample_id]_peaks_AllEnhancers.table.regular.bed
  4. Super enhancer summit file: [sample_id].super_summits.bed
  5. Regular enhancer summit file: [sample_id].regular_summits.bed
Motif Output

We use homer to predict motifs for:

  1. Peaks summit file (folder: motif_narrow)
  2. Super enhancer summit file (folder: motif_super)
  3. Regular enhancer summit file (folder: motif_regular)
EDEN output

We use EDEN to look for regulated genes in the same TAD for:

  1. Peaks summit file (folder: motif_narrow)
  2. Super enhancer summit file (folder: motif_super)
  3. Regular enhancer summit file (folder: motif_regular)

In the same folder, EDEN generates the following files:

  1. *_TPM[cutoff]_muti-genes.txt: Nearest genes for upstream, downstream and overlapped region of interest (TPM at least cutoff)
  2. *_TPM[cutoff]_max-genes.txt: Max expressed gene around region of interest (TPM at least cutoff)
  3. *_TPM[cutoff]_nearest-genes.txt: Max expressed gene around region of interest (TPM at least cuotff)
  4. *_TPM[cutoff]_300k.superloci.max.bed: Max expressed gene around stitched regions (TPM at least cutoff)
  5. *_TPM[cutoff]_300k.superloci.nearest.bed: Nearest gene around stitched regions (TPM at least cutoff)
Coltron output

Coltron output can be found in ROSE_out_12500/coltron

Dependency graph

DAG example with enhancer pipeline

alt tag

DAG example without enhancer pipeline

alt tag

RNAseq

RNAseq sample_sheet

Required columns
  1. Genome (hg19,hg38,mm10)
  2. SampleFiles
  3. SampleCaptures (polya, polya_stranded, ribozero, access)
  4. Xenograft (optional)
  5. XenograftGenome (optional)

RNAseq Example:

Regular RNAseq

samples:
  NCI0215tumor_T_C2V4TACXX:
    Genome: hg19
    SampleFiles: Sample_NCI0215tumor_T_C2V4TACXX
    SampleCaptures: ribozero

Xenograft RNAseq

samples:
  RH4_total_RNA_PA58_T_H37TWBGXC:
    Genome: hg19
    SampleFiles: Sample_RH4_total_RNA_PA58_T_H37TWBGXC
    Xenograft: 'yes'
    XenograftGenome: mm10
    SampleCaptures: ribozero

Output

STAR Output
  1. Gencode STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
  2. Gencode STAR BAM bigwig: STAR_hg19_gencode/[sample_id].star.genome.bw
  3. UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bam
  4. UCSC STAR BAM: STAR_hg19_gencode/[sample_id].star.genome.bw
RSEM Output
  1. Gencode RSME genes: RSEM_hg19_gencode/[sample_id].hg19.genocde.genes.results
  2. Gencode RSME isoforms: RSEM_hg19_gencode/[sample_id].hg19.genocde.isoforms.results
  3. Gencode RSME genes: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results
  4. Gencode RSME isoforms: RSEM_hg19_ucsc/[sample_id].hg19.ucsc.genes.results
HLA Output
  1. Seq2HLA and HLAminer combined file: HLA/[sample_id].Calls.txt
Xenograft filtered FASTQ (for Xenograft samples)
  1. DATA/classification.tsv: Filtering summary
  2. DATA/[sample_id].filtered_R1.fastq.gz
  3. DATA/[sample_id].filtered_R2.fastq.gz

Dependency graph

DAG example

alt tag

DAG example for Xenograft samples

alt tag

Pacbio

Pacbio sample_sheet

columns
  1. Genome (hg19,hg38,mm10)
  2. SampleFiles
  3. Target. Format: chromosome:start-end. Empty if the library is whole transcriptome
  4. Primers. Primer fasta file if your primers are different from standard one.

Pacbio Example:

samples:
  RH30:
    Amplified_Sample_Library_Name: RH30
    SampleFiles: Sample_RH30/RH30.ccs.bam
    Target: chr5:177086905-177098144
	Genome: hg19
    Primers: /data/khanlab3/hsienchao/Adam/FGFR4/CS28601_R2/A01/primers.fasta

Output

  1. Sqanti classification file: [output dir]/[sample_id].sqanti_classification.filtered_lite_classification.txt
  2. Final GTF: [output dir]/[sample_id].sqanti_classification.filtered_lite.gtf

Dependency graph

DAG example with spike-In

alt tag

DNAseq

Not implemented yet.

Methylseq

Not implemented yet.

For questions or comments, please contact: Hsien-chao Chou (chouh@nih.gov)

Releases

No releases published

Packages

 
 
 

Languages

  • Python 65.1%
  • R 16.5%
  • Perl 14.7%
  • Shell 3.7%