DeepVariant Pipeline

Note: There is no GPU installed in FPGA server, and thus the following steps are using CPU DeepVariant. Some flags might be required to use GPU.

Prepare DeepVariant
Pull image and create a container

BIN_VERSION="1.3.0"
sudo docker pull google/deepvariant:"${BIN_VERSION}"

OUTPUT_DIR="${PWD}/DV_output"
INPUT_DIR="${PWD}/align"
CONTAINER_NAME="deepvariant0"
sudo docker run --name ${CONTAINER_NAME} -v "${INPUT_DIR}":"/input" -v "${OUTPUT_DIR}:/output" -it google/deepvariant:"${BIN_VERSION}" /bin/bash

Download dataset

https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-training-case-study.md

Fetch fastq from bam file

samtools fastq -@ 16 ./BGISEQ_PE100_NA12878.sorted.chr21.bam -o BGISEQ_PE100_NA12878.sorted.chr21.fataq
samtools faidx ./ucsc_hg19.fa -r chr21 -o ./ucsc_hg19.chr21.fa

Align

3.1 BWA:

 bwa index ./ucsc_hg19.chr21.fa
 bwa mem -t 16 -k 21 ./ucsc_hg19.chr21.fa ./BGISEQ_PE100_NA12878.sorted.chr21.fastq > ${INPUT_DIR}/aln_bwa.sam

3.2 FPGA:

 fpga_alg_dv -tables ./FOLDER_OF_TABLES/ -r ./BGISEQ_PE100_NA12878.sorted.chr21.fastq -o ${INPUT_DIR}/aln_fpga.sam

-tables: The folder storing tables. For the demo, maybe it will be easier to use pre-built tables

. Maybe we can just make user chooses a table from a drop-down list.

-r: input fastq file

-o: output sam file.

Sort and index

FILENAME="aln_bwa"
samtools view -@ 16 -b -S ./align/${FILENAME}.sam > ./align/${FILENAME}.bam
samtools sort -@ 16 ./align/${FILENAME}.bam -o ./align/${FILENAME}.bam
samtools index -@ 16 ./align/${FILENAME}.bam

Call DeepVariant

sudo docker exec ${CONTAINER_NAME} /opt/deepvariant/bin/run_deepvariant \
  --model_type="WGS" --ref=/input/ucsc_hg19.chr21.fa \
  --reads=/input/${FILENAME}.bam \
  --output_vcf=/output/output_${FILENAME}vcf.gz \
  --output_gvcf=/output/output_${FILENAME}.g.vcf.gz \
  --num_shards=15 \
  --logging_dir=/output/logs_${FILENAME} \
  --dry_run=false

Check accuracy

Some statistics results are visualized in ${OUTPUT_DIR}/output_$FILENAME$_visual_report.html There is a script hap.py supported by DeepVariant to evaluation the accuracy. But I never used it and just wrote a small script to do the similar thing. Maybe we can have a better way to visualize the results and accuracy.

Tn93 Pipeline

Following Niema's guide.

Align with FPGA

fpga_alg_covid -tables ./FOLDER_OF_TABLES/ -r ./input.fataq -o output.sam

Primer Trim with FPGA
```
fpga_ivar_trim -q 20 -w 4 -m 30 -x 5 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam 
```
-q: Minimum quality threshold for sliding window to pass, same as iVar.

-w: Width of sliding window, same as iVar.

-m: Minimum length of read to retain after trimming, same as iVar

-x : Primer position offset. Reads that occur at the specified offset positions relative to primer positions will also be trimmed.

-r: input fastq file

-o: output sam file.
TN93 with FPGA
```
fpga_tn93 -input ./input.aln -o tn93.dist
```
-r: input fasta file

-o: output sam file.

Unifrac Pipeline

Dataset

I am using ref107 as the reference genome database. I generated the input fastq files with CAMISIM with the CAMISIM_config/config.ini. As each fastq file contains sequence of all reference genome and making the unifrac distances are almost all 0, I randomly discarded 54 reference for each fastq file. The new distribution is distribution.tsv.

You may find what I am using here: https://drive.google.com/drive/folders/1KrYfJhjRgigpbYAMeWP4u8G5eSG3FtZv?usp=sharing
Align with FPGA
```
fpga_alg_mb -k 16 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam
```
-k: Same as bowtie2. -k 16 is the default setting of SHOGUN protocol.

Bowtie 2 searches for up to N distinct, valid alignments for each read, where N equals the integer specified with the -k parameter. That is, if -k 2 is specified, Bowtie 2 will search for at most 2 distinct alignments.

Bowtie 2 does not "find" alignments in any specific order, so for reads that have more than N distinct, valid alignments, Bowtie 2 does not guarantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.

I used the following bowtie2 command as benchmark following https://github.com/qiyunzhu/woltka/blob/26e1670ec7ffb42245a74e11c3a315ec3f3267bd/doc/align.md?plain=1#L90
```
bowtie2 -p 16  --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" -x db -U input.fastq | pigz -p 16 -c > output.sam.gz
```

Call woltka for OGU analysis

woltka classify --uniq -i bt2out -o table.biom

Call Unifrac

ssu -i ogu.biom -t fasttree.nwk -o unifrac.txt -m unweighted

GITLAB

0 / FPGA Demo

FPGA Command #33 by 0

DeepVariant Pipeline

Tn93 Pipeline

Unifrac Pipeline