Skip to content

agalitsyna/sc_dros

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sc_dros is an illustrative code for the paper where we unravel the story on Drosophila chromatin in single cells:

Ulianov, Sergey V., Vlada V. Zakharova, Aleksandra A. Galitsyna, Pavel I. Kos, Kirill E. Polovnikov, Ilya M. Flyamer, Elena A. Mikhaleva et al. "Order and stochasticity in the folding of individual Drosophila genomes." Nature Communications 12, no. 1 (2021): 1-17. DOI: [10.1038/s41467-020-20292-z](https://doi.org/10.1038/s41467-020-20292-z)

This code is focused on initial steps of data processing.

By default, it is supposed to work for Drosophila data from the paper (see the notes, however). It may be adapted for other datasets (I've already tested it on Flyamer et al. 2017 and Gassler et al. 2017 data). In case of questions, don't hesitate to contact the maintainer: Aleksandra Galitsyna (Aleksandra.Galitsyna at skoltech.ru)

Preparing the environment

We recommend setting up conda environment named sc_dros with all the requirements.

List of dependencies:

Preparing the dataset

The files needed:

  • Files with information about the genome:

    • Reduced chromosomes sizes,

      Path: data/GENOME

    • FASTA file with genome and bwa index,

      Path: data/GENOME

      Run: bash script scripts/00_prepare_data/001_prepare_genome.sh

    • FASTQ files with snHi-C data,

      PATH: data/FASTQ

      Run python code: scripts/00_prepare_data/002_download_data.py

      The result of this step should be a set of FASTQ files with names formatted as follows:

      {Cell_name}_{Replicate}_R1.fastq.gz
      {Cell_name}_{Replicate}_R2.fastq.gz
      

      Note 1. I recommend the specialized version of GEOParse. Dependencies: GEOparse with parallel download option: https://github.com/agalitsyna/GEOparse.git To install GEOparse, run:

      git clone https://github.com/agalitsyna/GEOparse.git
      cd GEOparse
      pip install -e .

      Note 2. Note 2. GEOParse for many GEO IDs but may fail for some SRA entries. This is because GEOParse downloads the data from SRA FTP, decommissioned at the end of 2019, and files might not be available through GEOparse.

      Note 3. If GEOParse does not install in your environment, or SRA fails to be downloaded from ftp, you may use manual download from GEO. FASTQs can be found in GEO entry GSE131811.

Raw data processing

Raw data processing includes the steps from raw FASTQ files to processed COOL files for both snHi-C and bulk Hi-C.

  • Reads mapping and BAM files to PAIR parse
bash scripts/01_data_mapping/010_run_mapping.sh
bash scripts/01_data_mapping/011_parse_population.sh
  • PAIR files to COOL conversion
bash 012_run_pairsam2cooler.sh

This should result in a set of COOL files: data/COOL/

Note. If there are problems with running the scripts/accessig the data, Drosophila snHi-C COOLs can be alternatively found as supplementary files in GEO entry GSE131811.

TAD calling

TAD calling is performed with lavaburst package and includes the steps of scanning a wide range of gamma parameter and selection of a single set of TADs and sub-TADs.

Working directory: scripts/02_tad_calling/

Script example: 021_run_FindOptimalGamma.sh

Remarks

This folder is maintained for demonstration of initial steps of snHi-C data processing.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages