FPGA Command #33 by 0
DeepVariant Pipeline
Note: There is no GPU installed in FPGA server, and thus the following steps are using CPU DeepVariant. Some flags might be required to use GPU.
Prepare DeepVariant
Pull image and create a containerBIN_VERSION="1.3.0" sudo docker pull google/deepvariant:"${BIN_VERSION}" OUTPUT_DIR="${PWD}/DV_output" INPUT_DIR="${PWD}/align" CONTAINER_NAME="deepvariant0" sudo docker run --name ${CONTAINER_NAME} -v "${INPUT_DIR}":"/input" -v "${OUTPUT_DIR}:/output" -it google/deepvariant:"${BIN_VERSION}" /bin/bash
Download dataset
https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-training-case-study.md
Fetch fastq from bam file
samtools fastq -@ 16 ./BGISEQ_PE100_NA12878.sorted.chr21.bam -o BGISEQ_PE100_NA12878.sorted.chr21.fataq samtools faidx ./ucsc_hg19.fa -r chr21 -o ./ucsc_hg19.chr21.fa
Align
3.1 BWA:
bwa index ./ucsc_hg19.chr21.fa bwa mem -t 16 -k 21 ./ucsc_hg19.chr21.fa ./BGISEQ_PE100_NA12878.sorted.chr21.fastq > ${INPUT_DIR}/aln_bwa.sam
3.2 FPGA:
fpga_alg_dv -tables ./FOLDER_OF_TABLES/ -r ./BGISEQ_PE100_NA12878.sorted.chr21.fastq -o ${INPUT_DIR}/aln_fpga.sam
-tables
: The folder storing tables. For the demo, maybe it will be easier to use pre-built tables. Maybe we can just make user chooses a table from a drop-down list.
-r
: input fastq file-o
: output sam file.Sort and index
FILENAME="aln_bwa" samtools view -@ 16 -b -S ./align/${FILENAME}.sam > ./align/${FILENAME}.bam samtools sort -@ 16 ./align/${FILENAME}.bam -o ./align/${FILENAME}.bam samtools index -@ 16 ./align/${FILENAME}.bam
Call DeepVariant
sudo docker exec ${CONTAINER_NAME} /opt/deepvariant/bin/run_deepvariant \
--model_type="WGS" --ref=/input/ucsc_hg19.chr21.fa \
--reads=/input/${FILENAME}.bam \
--output_vcf=/output/output_${FILENAME}vcf.gz \
--output_gvcf=/output/output_${FILENAME}.g.vcf.gz \
--num_shards=15 \
--logging_dir=/output/logs_${FILENAME} \
--dry_run=false
Check accuracy
Some statistics results are visualized in
${OUTPUT_DIR}/output_$FILENAME$_visual_report.html
There is a scripthap.py
supported by DeepVariant to evaluation the accuracy. But I never used it and just wrote a small script to do the similar thing. Maybe we can have a better way to visualize the results and accuracy.
Tn93 Pipeline
Following Niema's guide.
Align with FPGA
fpga_alg_covid -tables ./FOLDER_OF_TABLES/ -r ./input.fataq -o output.sam
Primer Trim with FPGA
fpga_ivar_trim -q 20 -w 4 -m 30 -x 5 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam
-q
: Minimum quality threshold for sliding window to pass, same as iVar.-w
: Width of sliding window, same as iVar.-m
: Minimum length of read to retain after trimming, same as iVar-x
: Primer position offset. Reads that occur at the specified offset positions relative to primer positions will also be trimmed.-r
: input fastq file-o
: output sam file.TN93 with FPGA
fpga_tn93 -input ./input.aln -o tn93.dist
-r
: input fasta file-o
: output sam file.
Unifrac Pipeline
Dataset
I am using ref107 as the reference genome database. I generated the input fastq files with CAMISIM with the
CAMISIM_config/config.ini
. As each fastq file contains sequence of all reference genome and making the unifrac distances are almost all 0, I randomly discarded 54 reference for each fastq file. The new distribution isdistribution.tsv
.You may find what I am using here: https://drive.google.com/drive/folders/1KrYfJhjRgigpbYAMeWP4u8G5eSG3FtZv?usp=sharing
Align with FPGA
fpga_alg_mb -k 16 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam
-k
: Same as bowtie2.-k 16
is the default setting of SHOGUN protocol.Bowtie 2 searches for up to N distinct, valid alignments for each read, where N equals the integer specified with the
-k
parameter. That is, if-k 2
is specified, Bowtie 2 will search for at most 2 distinct alignments.Bowtie 2 does not "find" alignments in any specific order, so for reads that have more than N distinct, valid alignments, Bowtie 2 does not guarantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.
I used the following bowtie2 command as benchmark following https://github.com/qiyunzhu/woltka/blob/26e1670ec7ffb42245a74e11c3a315ec3f3267bd/doc/align.md?plain=1#L90
bowtie2 -p 16 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" -x db -U input.fastq | pigz -p 16 -c > output.sam.gz
Call woltka for OGU analysis
woltka classify --uniq -i bt2out -o table.biom
Call Unifrac
ssu -i ogu.biom -t fasttree.nwk -o unifrac.txt -m unweighted