FPGA Command #33 by 21ec845a004c1ca2b8fe4bb6170c7bf7?s=40&d=mm 0

FPGA_command.md
Raw

DeepVariant Pipeline

Note: There is no GPU installed in FPGA server, and thus the following steps are using CPU DeepVariant. Some flags might be required to use GPU.

  1. Prepare DeepVariant
    Pull image and create a container

    BIN_VERSION="1.3.0"
    sudo docker pull google/deepvariant:"${BIN_VERSION}"
    
    OUTPUT_DIR="${PWD}/DV_output"
    INPUT_DIR="${PWD}/align"
    CONTAINER_NAME="deepvariant0"
    sudo docker run --name ${CONTAINER_NAME} -v "${INPUT_DIR}":"/input" -v "${OUTPUT_DIR}:/output" -it google/deepvariant:"${BIN_VERSION}" /bin/bash
    
  2. Download dataset

    https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-training-case-study.md

  3. Fetch fastq from bam file

    samtools fastq -@ 16 ./BGISEQ_PE100_NA12878.sorted.chr21.bam -o BGISEQ_PE100_NA12878.sorted.chr21.fataq
    samtools faidx ./ucsc_hg19.fa -r chr21 -o ./ucsc_hg19.chr21.fa
    
  4. Align

    3.1 BWA:

     bwa index ./ucsc_hg19.chr21.fa
     bwa mem -t 16 -k 21 ./ucsc_hg19.chr21.fa ./BGISEQ_PE100_NA12878.sorted.chr21.fastq > ${INPUT_DIR}/aln_bwa.sam
    

    3.2 FPGA:

     fpga_alg_dv -tables ./FOLDER_OF_TABLES/ -r ./BGISEQ_PE100_NA12878.sorted.chr21.fastq -o ${INPUT_DIR}/aln_fpga.sam
    

    -tables: The folder storing tables. For the demo, maybe it will be easier to use pre-built tables

    . Maybe we can just make user chooses a table from a drop-down list.

    -r: input fastq file

    -o: output sam file.

  5. Sort and index

    FILENAME="aln_bwa"
    samtools view -@ 16 -b -S ./align/${FILENAME}.sam > ./align/${FILENAME}.bam
    samtools sort -@ 16 ./align/${FILENAME}.bam -o ./align/${FILENAME}.bam
    samtools index -@ 16 ./align/${FILENAME}.bam
    
  6. Call DeepVariant

sudo docker exec ${CONTAINER_NAME} /opt/deepvariant/bin/run_deepvariant \
  --model_type="WGS" --ref=/input/ucsc_hg19.chr21.fa \
  --reads=/input/${FILENAME}.bam \
  --output_vcf=/output/output_${FILENAME}vcf.gz \
  --output_gvcf=/output/output_${FILENAME}.g.vcf.gz \
  --num_shards=15 \
  --logging_dir=/output/logs_${FILENAME} \
  --dry_run=false
  1. Check accuracy

    Some statistics results are visualized in ${OUTPUT_DIR}/output_$FILENAME$_visual_report.html There is a script hap.py supported by DeepVariant to evaluation the accuracy. But I never used it and just wrote a small script to do the similar thing. Maybe we can have a better way to visualize the results and accuracy.

Tn93 Pipeline

Following Niema's guide.

  1. Align with FPGA

    fpga_alg_covid -tables ./FOLDER_OF_TABLES/ -r ./input.fataq -o output.sam
    
  2. Primer Trim with FPGA

    fpga_ivar_trim -q 20 -w 4 -m 30 -x 5 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam 
    

    -q: Minimum quality threshold for sliding window to pass, same as iVar.

    -w: Width of sliding window, same as iVar.

    -m: Minimum length of read to retain after trimming, same as iVar

    -x : Primer position offset. Reads that occur at the specified offset positions relative to primer positions will also be trimmed.

    -r: input fastq file

    -o: output sam file.

  3. TN93 with FPGA

    fpga_tn93 -input ./input.aln -o tn93.dist
    

    -r: input fasta file

    -o: output sam file.

Unifrac Pipeline

  1. Dataset

    I am using ref107 as the reference genome database. I generated the input fastq files with CAMISIM with the CAMISIM_config/config.ini. As each fastq file contains sequence of all reference genome and making the unifrac distances are almost all 0, I randomly discarded 54 reference for each fastq file. The new distribution is distribution.tsv.

    You may find what I am using here: https://drive.google.com/drive/folders/1KrYfJhjRgigpbYAMeWP4u8G5eSG3FtZv?usp=sharing

  2. Align with FPGA

    fpga_alg_mb -k 16 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam
    

    -k: Same as bowtie2. -k 16 is the default setting of SHOGUN protocol.

    Bowtie 2 searches for up to N distinct, valid alignments for each read, where N equals the integer specified with the -k parameter. That is, if -k 2 is specified, Bowtie 2 will search for at most 2 distinct alignments.

    Bowtie 2 does not "find" alignments in any specific order, so for reads that have more than N distinct, valid alignments, Bowtie 2 does not guarantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.

    I used the following bowtie2 command as benchmark following https://github.com/qiyunzhu/woltka/blob/26e1670ec7ffb42245a74e11c3a315ec3f3267bd/doc/align.md?plain=1#L90

    bowtie2 -p 16  --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" -x db -U input.fastq | pigz -p 16 -c > output.sam.gz
    
  3. Call woltka for OGU analysis

    woltka classify --uniq -i bt2out -o table.biom
    
  4. Call Unifrac

    ssu -i ogu.biom -t fasttree.nwk -o unifrac.txt -m unweighted