# DeepVariant Pipeline

Note: There is no GPU installed in FPGA server, and thus the following steps are using CPU DeepVariant. Some flags might be required to use GPU. 

0. Prepare DeepVariant
   Pull image and create a container 

    ```bash
    BIN_VERSION="1.3.0"
    sudo docker pull google/deepvariant:"${BIN_VERSION}"
        
    OUTPUT_DIR="${PWD}/DV_output"
    INPUT_DIR="${PWD}/align"
    CONTAINER_NAME="deepvariant0"
    sudo docker run --name ${CONTAINER_NAME} -v "${INPUT_DIR}":"/input" -v "${OUTPUT_DIR}:/output" -it google/deepvariant:"${BIN_VERSION}" /bin/bash
    ```
      
1. Download dataset

   https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-training-case-study.md

2. Fetch fastq from bam file

   ```
   samtools fastq -@ 16 ./BGISEQ_PE100_NA12878.sorted.chr21.bam -o BGISEQ_PE100_NA12878.sorted.chr21.fataq
   samtools faidx ./ucsc_hg19.fa -r chr21 -o ./ucsc_hg19.chr21.fa
   ```

3. Align

   3.1 BWA:

     ```bash
     bwa index ./ucsc_hg19.chr21.fa
     bwa mem -t 16 -k 21 ./ucsc_hg19.chr21.fa ./BGISEQ_PE100_NA12878.sorted.chr21.fastq > ${INPUT_DIR}/aln_bwa.sam
     ```

   3.2 FPGA:

     ```bash
     fpga_alg_dv -tables ./FOLDER_OF_TABLES/ -r ./BGISEQ_PE100_NA12878.sorted.chr21.fastq -o ${INPUT_DIR}/aln_fpga.sam
     ```

     `-tables`: The folder storing tables. For the demo, maybe it will be easier to use pre-built tables
     
     . Maybe we can just make user chooses a table from a drop-down list. 
     
     `-r`: input fastq file
     
     `-o`: output sam file. 

4. Sort and index

   ```bash
   FILENAME="aln_bwa"
   samtools view -@ 16 -b -S ./align/${FILENAME}.sam > ./align/${FILENAME}.bam
   samtools sort -@ 16 ./align/${FILENAME}.bam -o ./align/${FILENAME}.bam
   samtools index -@ 16 ./align/${FILENAME}.bam
   ```

5. Call DeepVariant

```bash
sudo docker exec ${CONTAINER_NAME} /opt/deepvariant/bin/run_deepvariant \
  --model_type="WGS" --ref=/input/ucsc_hg19.chr21.fa \
  --reads=/input/${FILENAME}.bam \
  --output_vcf=/output/output_${FILENAME}vcf.gz \
  --output_gvcf=/output/output_${FILENAME}.g.vcf.gz \
  --num_shards=15 \
  --logging_dir=/output/logs_${FILENAME} \
  --dry_run=false
```

6. Check accuracy

   Some statistics results are visualized in `${OUTPUT_DIR}/output_$FILENAME$_visual_report.html` There is a script `hap.py` supported by DeepVariant to evaluation the accuracy. But I never used it and just wrote a small script to do the similar thing. Maybe we can have a better way to visualize the results and accuracy. 

# Tn93 Pipeline 

Following Niema's guide.  

1. Align with FPGA

   ```
   fpga_alg_covid -tables ./FOLDER_OF_TABLES/ -r ./input.fataq -o output.sam
   ```

2. Primer Trim with FPGA

   ```
   fpga_ivar_trim -q 20 -w 4 -m 30 -x 5 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam 
   ```

   `-q`: Minimum quality threshold for sliding window to pass, same as iVar.

   `-w`: Width of sliding window, same as iVar.

   `-m`: Minimum length of read to retain after trimming, same as iVar

   `-x` : Primer position offset. Reads that occur at the specified offset positions relative to primer positions will also be trimmed.

   `-r`: input fastq file

   `-o`: output sam file. 

3. TN93 with FPGA

   ```
   fpga_tn93 -input ./input.aln -o tn93.dist
   ```

   `-r`: input fasta file

   `-o`: output sam file. 

# Unifrac Pipeline

0. Dataset

   I am using ref107 as the reference genome database. I generated the input fastq files with CAMISIM with the `CAMISIM_config/config.ini`. As each fastq file contains sequence of all reference genome and making the unifrac distances are almost all 0, I randomly discarded 54 reference for each fastq file. The new distribution is `distribution.tsv`. 

   You may find what I am using here: https://drive.google.com/drive/folders/1KrYfJhjRgigpbYAMeWP4u8G5eSG3FtZv?usp=sharing

1. Align with FPGA

    ```
    fpga_alg_mb -k 16 -tables ./FOLDER_OF_TABLES/ -r ./input.fastq -o output.sam
    ```

	`-k`: Same as bowtie2.   `-k 16` is the default setting of SHOGUN protocol. 

    > Bowtie 2 searches for up to N distinct, valid alignments for each read, where N equals the integer specified with the `-k` parameter. That is, if `-k 2` is specified, Bowtie 2 will search for at most 2 distinct alignments. 
    >
    > Bowtie 2 does not "find" alignments in any specific order, so for reads that have more than N distinct, valid alignments, Bowtie 2 does not guarantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.
    
    I used the following bowtie2 command as benchmark following https://github.com/qiyunzhu/woltka/blob/26e1670ec7ffb42245a74e11c3a315ec3f3267bd/doc/align.md?plain=1#L90
    
    ```
    bowtie2 -p 16  --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" -x db -U input.fastq | pigz -p 16 -c > output.sam.gz
    ```
    
2. Call woltka for OGU analysis

    ```
    woltka classify --uniq -i bt2out -o table.biom
    ```

3. Call Unifrac

    ```
    ssu -i ogu.biom -t fasttree.nwk -o unifrac.txt -m unweighted
    ```