paleomix, Next-Generation Sequencing wrapper

this framework is open-source and available on GitHub and wraps all steps from fastq to bam files. Actually, this tool can do much more but the rest is out of scope. Its major drawback is that it is dedicated to one machine. For clusters, you are then limited to one node since memory are not shared by default. Actually, not entirely true since independent tasks can be spawn on a separate machine. Full documentation available here

check if paleomix is available

paleomix

In case it is not, your are certainly not using the singularity container, see instructions in the set-up

test your install

fetch the example, reference is the human mitochondrial genome


mkdir -p ~/paleomix/example
cp -r /scratch/users/aginolhac/chip-seq/bam_pipeline_example/000* ~/paleomix/example
cd ~/paleomix/example

run the example, start by a dry-run, adjust the number of threads accordingly.

paleomix bam_pipeline run --bwa-max-threads=1 --max-threads=8 --dry-run 000_makefile.yaml

If all fine, re-rerun the command without the --dry-run option

Of note, calling mapDamage and GATK were disabled to limit the computation time (~ 35 min when included). Anyway, this tool is not used for ChIP-seq analysis.

Generate a makefile

Trimming, mapping imply a lot of steps and it is hard to be sure that everything goes well. Paleomix works in temporary folders, check the data produced and then copy back files that are complete. Plus, you want to test different parameters, add a new reference without having to redo earlier steps while being sure that all files are up-to-date. This goes through a YAML makefile. The syntax is pretty straight-forward. What matters is, that you use SPACES and not TABS.

Create a generic makefile (extension, yml or yaml to get syntax highlights)

cd ~/chip-seq
paleomix bam_pipeline mkfile > mouse.yaml

Edit the makefile

using your favorite text editor (such as nano), edit the mouse.yaml. For example vim mouse.yaml or kate or nano.

Options

  • for the compression, change the default behavior from bz2 to gz
  CompressionFormat: gz
  • for duplicates, change the default behavior from filter to mark
  PCRDuplicates: mark

Features

Under the Features section, update the featurse that need to be performed. Change yes/no to match the following:

  Features:
    RawBAM: yes         # Generate BAM from the raw libraries (no indel realignment)
                        #   Location: {Destination}/{Target}.{Genome}.bam
    RealignedBAM: no    # Generate indel-realigned BAM using the GATK Indel realigner
                        #   Location: {Destination}/{Target}.{Genome}.realigned.bam
    mapDamage: no       # Generate mapDamage plot for each (unrealigned) library
                        #   Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/
    Coverage: yes       # Generate coverage information for the raw BAM (wo/ indel realignment)
                        #   Location: {Destination}/{Target}.{Genome}.coverage
    Depths: no          # Generate histogram of number of sites with a given read-depth
                        #   Location: {Destination}/{Target}.{Genome}.depths
    Summary: yes        # Generate summary table for each target
                        #   Location: {Destination}/{Target}.summary
    DuplicateHist: no   # Generate histogram of PCR duplicates, for use with PreSeq
                        #   Location: {Destination}/{Target}.{Genome}.duphist/{Library}/

In detail, the RealignedBAM are important for calling variants, we only need to RawBAM. Moreover, the Depths also help to define which upper limit could be used for variant calling. This is not in the scope of ChIP-seq analysis. Same for mapDamage, only relevant for ancient DNA.

Prefixes

These are the references to align read to. You could notice that we are going to use only one chromosome to save computational time.

Prefixes:
  mouse_19:
    Path: /scratch/users/aginolhac/chip-seq/references/chr19.fasta

Samples

enter at the end of the makefile, the following lines. Again, do use spaces and not tabs for the indentation. For those who are lazy and use copy/paste in vim use the trick to :set paste to avoid extra spaces, comment hashes etc to be automatically added.

The descriptions of the different hierachical names can be read here

TC1-I-A-D3:
  TC1-I-A-D3:
    TC1-I-A-D3:
      "14s006680-1-1": fastq/C53CYACXX_TC1-I-A-D3_14s006682-1-1_Sinkkonen_lane114s006682_sequence.txt.gz

TC1-H3K4-A-D3:
  TC1-H3K4-A-D3:
    TC1-H3K4-A-D3:
      "14s006647-1-1": fastq/C51C3ACXX_TC1-H3K4-A-D3_14s006647-1-1_Sinkkonen_lane514s006647_sequence.txt.gz

TC1-I-ST2-D0:
  TC1-I-ST2-D0:
    TC1-I-ST2-D0:
      "14s006677-1-1": fastq/C51C3ACXX_TC1-I-ST2-D0_14s006677-1-1_Sinkkonen_lane814s006677_sequence.txt.gz

TC1-H3K4-ST2-D0:
  TC1-H3K4-ST2-D0:
    TC1-H3K4-ST2-D0:
      "14s006644": fastq/C51C3ACXX_TC1-H3K4-ST2-D0_14s006644-1-1_Sinkkonen_lane514s006644_sequence.txt.gz

Perform the trimming / mapping

First use the option --dry-run to spot mistakes.

Please adapt the --max-threads option to the #cpus actually booked

paleomix bam_pipeline run --bwa-max-threads=2 --adapterremoval-max-threads=2 --max-threads=8 --dry-run mouse.yaml

when all green lights are on, remove the dry-run and perform the mapping.

correct the makefile

the trimming should go well, but an error arises because you cannot write to the reference folder.

  • symbolic links in your own folder, removing the file that should be created if the ref is correct
mkdir references
ln -s /scratch/users/aginolhac/chip-seq/references/chr19.fasta references/
  • correct the makefile, so the Path is relative now
    Path: references/chr19.fasta
  • enjoy paleomix by just re-running it, only necessary steps are done (validatating the ref, indexing it and mappings)
paleomix bam_pipeline run --bwa-max-threads=2 --adapterremoval-max-threads=2 --max-threads=8 mouse.yaml

the whole process should takes ~ 25 minutes with 8 cores

check trimming

First of all, check using fastqc that the trimming did remove the adapters that were contaminated the reads.

Again, with parallel specify the max number of jobs with the option -j to fit the #cpus booked

find . -name "reads.truncated.gz" | parallel -j 8 "fastqc {}" &

using the character & tells the shell that we want the processes to run in the background. Meaning that you can still run more things while the 4 tasks are running. Check them using htop.

check especially, the input for ST2, day0 before and after trimming. Did it solve the issue with adapters?

truncated read files are all named the same. And they are located into a deeper folder structure. To rename them all with their ids and move them at the ~/chip-seq/ level, you can run the following:

find -name "reads.truncated.gz" |  xargs ls -1 | awk '{split($1, path, "/");  system("mv "$0 " "path[3] "_" path[6])}'

filter for unique reads

Uniqueness of reads refers to mappability. The fewer locations a read has in a genome, the higher is mappability will be. A common filter is to use 30 as a threshold for filtering reads. Filter them in parallel

parallel "samtools view -b -q 30 {} > {.}.q30.bam" ::: *.bam

Since we are using only the chr19 for this tutorial, do you think the mappability score is correct? Why?

filter for duplicates?

Duplication is a bias that comes from PCR amplification. Reads then stack at the same location and create artificial high depth of coverage. Duplicates have an unclear definition in a mapped file. Usually, single-end reads that are mapped at the same 5' end are considered as duplicates. External coordinates are used for paired-end reads.
For regular NGS, filtering for duplicates is mandatory. However, for ChIP-seq since the reads are, by nature, clustered at one location this is not recommended. If duplication is observed at the reads level, such as in fastqc output, then filtering may be necessary. Marking duplicates allow keeping track of them without losing them.