From fastq to bam files

fastq to final valid pairs bam file - for the impatient!

If you just want to give it a shot and run all the alignment and filtering steps without going over the details, this is a shorter version for you. Otherwise move to the next section fastq to final valid pairs bam file - step by step. The piped commands output two different formats of final bam files - bam index file and a dup stats file. The example below is based on the human NSC dataset, replica 1. You can find the fastq files and a link to the reference in the capture Data Sets section.

Command:

bwa mem -5SP -T0 -t<cores> <ref.fa> <capture.R1.fastq.gz> <capture.R2.fastq.gz>| \
pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in <cores> --nproc-out <cores> --chroms-path <ref.genome> | \
pairtools sort --tmpdir=<full_path/to/tmpdir> --nproc <cores>|pairtools dedup --nproc-in <cores> \
--nproc-out <cores> --mark-dups --output-stats <stats.txt>|pairtools split --nproc-in <cores> \
--nproc-out <cores> --output-pairs <mapped.pairs> --output-sam -|samtools view -bS -@<cores> | \
samtools sort -@<cores> -o <mapped.PT.bam>;samtools index <mapped.PT.bam>;samtools view -@ <threads> -Shu -F 2048 <mapped.PT.bam>|samtools sort -n -T <path_to_temp_dir> --threads <threads> -o <chicago.bam> -

Example:

bwa mem -5SP -T0 -t16 hg38.fa NSC_rep1_R1.fastq.gz NSC_rep1_R2.fastq.gz| pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 --nproc-in 8 --nproc-out 8 --chroms-path hg38.genome | pairtools sort --tmpdir=/home/ubuntu/ebs/temp/ --nproc 16|pairtools dedup --nproc-in 8 --nproc-out 8 --mark-dups --output-stats stats.txt|pairtools split --nproc-in 8 --nproc-out 8 --output-pairs mapped.pairs --output-sam -|samtools view -bS -@16 | samtools sort -@16 -o NSC_rep1_PT.bam;samtools index NSC_rep1_PT.bam;samtools view -@ 16 -Shu -F 2048 NSC_rep1_PT.bam|samtools sort -n -T /home/ubuntu/ebs/temp/ --threads 16 -o NSC_rep1_chicago.bam -

The full pipeline, with 250M read pairs on an Ubuntu 18.04 machine with 16 CPUs, 1 TB storage and 64 GB memory takes about 8 hours to complete.

fastq to final valid pairs bam file - step by step

Alignment

Now that you have a genome file, index file and a reference fasta file, you are all set to align your captured Micro-C^® or Omni-C^® library to the reference. Please note the specific settings that are needed to map mates independently and for optimal results with our proximity library reads.

To replicate the output files generated in this workflow, please use the NSC replica 1 fastq files from our capture Data Sets section. For your convenience, the reference file is also included. If you are using your own fastq files, the results will be different from the example outputs displayed here.

Parameter	Alignment function
mem	Set the bwa to use the BWA-MEM algorithm, a fast and accurate alignment algorithm optimized for sequences in the range of 70 bp to 1 Mbp
-5	For split alignment, take the alignment with the smallest coordinate (5’ end) as primary. The mapq assignment of the primary alignment is calculated independent of the 3’ alignment
-S	Skip mate rescue
-P	Skip pairing; mate rescue is performed unless -S also in use
-T0	The T flag sets the minimum mapping quality of alignments to output. At this stage we want all the alignments to be recorded and thus T is set to 0, (this will enable us to gather full stats for the library. At later stage we will filter the alignments by mapping quality
-t	Number of threads - default is 1. Set the numbers of threads to no more than the number of cores that you have on your machine (If you don’d know the number of cores, used the command lscpu and multiply Thread(s) per core x Core(s) per socket x Socket(s))
.fasta or .fa	Path to a reference file, ending with .fa or .fasta (e.g. hg38.fasta)
.fastq or .fastq.gz	Path to two fastq files; path to read 1 fastq file, followed by fastq file of read 2 (usually labeled as R1 and R2, respectively). Files can be in their compressed format (.fastq.gz) or uncompressed (.fastq). In case your library sequence is divided to multiple fastq files, you can use a process substitution < with the cat command (see example below)
-o	Sam file name to use for output results [stdout]. You can choose to skip the -o flag if you are piping the output to the next command using ‘\|’

bwa mem will output a sam file that you can either pipe or save to a path using -o option, as in the example below (please note that version 0.7.17 or higher should be used, older versions do not support the -5 flag)

Command:

bwa mem -5SP -T0 -t<threads> <ref.fasta> <capture_R1.fastq> <capture_R2.fastq> -o <aligned.sam>

Example (one pair of fastq files):

bwa mem -5SP -T0 -t16 hg38.fasta NSC_rep1_R1.fastq.gz NSC_rep1_R2.fastq.gz -o  NSC_rep1_aligned.sam

Example (multiple pairs of fastq files):

bwa mem -5SP -T0 -t16 hg38.fasta <(zcat file1.R1.fastq.gz file2.R1.fastq.gz file3.R1.fastq.gz) <(zcat file1.R2.fastq.gz file2.R2.fastq.gz file3.R2.fastq.gz) -o aligned.sam

Note

The bwa command will work on either fastq files or fastq.gz files

Recording valid ligation events

We use the parse module of the pairtools pipeline to find ligation junctions. When a ligation event is identified in the alignment file, the pairtools pipeline will record the outer-most (5’) aligned base pair and the strand of each one of the paired reads into a .pairsam file (pairsam format captures SAM entries together with the Hi-C pair information). In addition, it will assign a pair type for each event (e.g. if both reads aligned uniquely to only one region in the genome, the type UU (Unique-Unique) will be assigned to the pair). The following steps are necessary to identify the high-quality valid pairs over low quality events (e.g. due to low mapping quality):

pairtools parse options:

Parameter	Value	Function
min-mapq	40	Mapq threshold for defining an alignment as a multi-mapping alignment. Alignment with mapq < 40 will be marked as type M (multi)
walks-policy	5unique	Walks is the term used to describe multiple ligation events resulting in three alignments (instead of two) for a read pair. However, in cases where three alignments arise from a single ligation event, pairtools parse can rescue this event. Walks-policy defines the policy for reporting un-rescuable walks. The 5unique value is used to report the 5’-most unique alignment on each side, if present (one or both sides may map to different locations on the genome, producing more than two alignments per DNA molecule)
max-inter-align-gap	30	In cases where there is a gap between alignments, if the gap is 30 or smaller the algorithm will, ignore the gap, and for gaps >30 bp, these will be marked as “null” alignments
nproc-in	integer, e.g. 16	Pairtools has an automatic “guesses” function to identify the format of the input file, whether it is compressed or not. When needed, the input is decompressed by bgzip/lz4c. The option nproc-in sets the number of processes used by the auto-guess input decompressing function.If not specified the default is 3.
nproc-out	integer, e.g. 16	Pairtools automatically “guesses” the desired output file format (compressed or not compressed, based on file name extension). When needed, the output is compressed by bgzip/lz4c. The option nproc-out sets the number of processes used by the auto-guess output compressing function. If not specified the default is 8
chroms-path		Defines the path to your .genome file (e.g. hg38.genome)
*.sam		Defines the path to the sam file used as an input. If you are piping the input (stdin) skip this option
*pairsam		Name of pairsam file for writing output results. You can choose to skip and pipe the output directly to the next command (pairtools sort)

pairtools parse command example for finding ligation events:

Command:

 pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 --nproc-in <cores>\
--nproc-out <cores> --chroms-path <ref.genome> <aligned.sam> > <parsed.pairsam>

Example:

pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 --nproc-in 8 --nproc-out 8 --chroms-path hg38.genome NSC_rep1_aligned.sam >   NSC_rep1_parsed.pairsam

At the parsing step, pairs will be flipped such that regardless of read1 and read2, pairs are always recorded with first side of the pair having the lower genomic coordinates.

Sorting the pairsam file

The parsed pairs are then sorted using pairtools sort

pairtools sort options:

Parameter	Function
–tmpdir	Provides a full path to a temp directory. A good rule of thumb is to have 3x the size of your fastq.gz files available for this diredtory. Using a temp directory will help avoid memory issues
–nproc	Number of processes to split the sorting work

Command:

pairtools sort --nproc <cores> --tmpdir=<path/to/tmpdir> <parsed.pairsam> > <sorted.pairsam>

Example:

pairtools sort --nproc 16 --tmpdir=/home/ubuntu/ebs/temp/  NSC_rep1_parsed.pairsam > NSC_rep1_sorted.pairsam

Important!

Please note that an absolute path for the temp directory is required for pairtools sort (e.g. path of the structure ~/ebs/temp/ or ./temp/ will not work, instead, something of this akin /home/user/ebs/temp/ is needed).

Removing PCR duplicates

pairtools dedup detects molecules that could be formed via PCR duplication and tags them as “DD” pair type. These pairs should be excluded from downstream analysis. Use the pairtools dedup command with the –output-stats option to save the dup stats into a text file.

pairtools dedup options:

Parameter	Function
–mark-dups	If specified, duplicate pairs are marked as DD in “pair_type” and as a duplicate in the sam entries
–output-stats	Creates an output file for duplicate statistics. Please note that if a file with the same name already exists, it will be opened in the append mode

Command:

pairtools dedup --nproc-in <cores> --nproc-out <cores> --mark-dups --output-stats <stats.txt> \
--output <dedup.pairsam> <sorted.pairsam>

Example:

pairtools dedup --nproc-in 8 --nproc-out 8 --mark-dups --output-stats stats.txt --output NSC_rep1_dedup.pairsam NSC_rep1_sorted.pairsam

Generating .pairs and bam files

The pairtools split command is used to split the final .pairsam into two files: .sam (or .bam) and .pairs (.pairsam). Note that .pairsam has two extra columns containing the alignments from which the Omni-C or Micro-C pair was extracted (these two columns are not included in .pairs files)

pairtools split options:

Parameter	Function
–output-pairs	Output pairs file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If you wish to pipe the command and output the pairs files to stdout, use `-` instead of file name
–output-sam	Output sam file. If the file name extension is .bam, the output will be written in bam format. If you wish to pipe the command, use - instead of a file name. Please note that, in this case, the sam format will be used (and can be later converted to bam file with the command samtools view -bS -@16 -o temp.bam)

Command:

pairtools split --nproc-in <cores> --nproc-out <cores> --output-pairs <mapped.pairs> \
--output-sam <unsorted.bam> <dedup.pairsam>

Example:

pairtools split --nproc-in 8 --nproc-out 8 --output-pairs NSC_rep1_mapped.pairs --output-sam NSC_rep1_unsorted.bam NSC_rep1_dedup.pairsam

The .pairs file can be used for generating contact matrix

Generating the dedup, sorted bam file

For downstream steps, the bam file should be sorted, using the command samtools sort

samtools sort options:

Parameter	Function
-@	Number of threads to use
-o	File name. Write final output to FILE rather than standard output
-T	Path to temp file. Using a temp file will help avoid memory issues

Command:

samtools sort -@<threads> -T <path/to/tmpdir/>-o <mapped.PT.bam> <unsorted.bam>

Example:

samtools sort -@16 -T /home/ubuntu/ebs/temp/ -o NSC_rep1_PT.bam NSC_rep1_unsorted.bam

For future steps, an index (.bai) of the bam file is also needed. Index the bam file:

Command:

samtools index <mapped.PT.bam>

Example:

samtools index NSC_rep1_PT.bam

The above steps result in multiple intermediate files. To simplify the process and avoid intermediate files, you can pipe the steps.

The *PT.bam (PT stands for pair tools) is a key bam file that will be used for library QC, generating contact maps and more. Additional processing of the bam file will be required for interaction calling.

CHiCAGO compatible bam file

As will be discussed in the interaction calling section, we will use the CHiCAGO tool for calling P-E interactions. CHiCAGO is designed to work with bam files produced with HiCUP pipeline. To match the format of our bam file to that expected by CHiCAGO, we will clean the bam file of alignments not used by CHiCAGO (e.g. supplementary alignment) and modify the sorting from position based to read-name based sorting.

Samtools parameter for generating a CHiCAGO compatible bam format:

Samtools Utility	Parameter	Function
view	-@	Number of threads
view	-S	Ignored (input format is auto-detected)
view	-h	Include header in SAM output
view	-u	Output uncompressed bam file. Since this is not a final output, but piped to another samtools step the `u` option will save time
view	-F 2048	Remove supplementary alignments
sort	-T	Provide a full path to a temp directory. Using a temp directory will help avoid memory issues
sort	-n	Sort by read name
sort	-o	Output file
sort	-T	path to temp file

Command:

samtools view -@ <threads> -Shu -F 2048 <input bam file>|samtools sort -n -T <path to temp dir> --threads <threads> -o <output bam file> -

Example:

samtools view -@ 16 -Shu -F 2048 NSC_rep1_PT.bam|samtools sort -n -T /home/ubuntu/ebs/temp/temp.bam --threads 16 -o NSC_rep1_chicago.bam -