Automating Genomic Variant Calling

Questions:

How can I make my workflow more efficient and less error-prone?

Objectives:

Write a shell script with multiple variables.
Incorporate a for loop into a shell script.

Setup

Check your directories on Pete for the files:

SRR2584863_1.trim.sub.fastq
SRR2584863_2.trim.sub.fastq
SRR2584866_1.trim.sub.fastq
SRR2584866_2.trim.sub.fastq
SRR2589044_1.trim.sub.fastq
SRR2589044_2.trim.sub.fastq

These should be in your /scratch/<username>/dc_workshop/data/trimmed_fastq_small/ directory. If you don’t have these files, create a /scratch/<username>/dc_workshop/data/trimmed_fastq_small/ directory (if you don’t have one) and change into that directory. Then download the files from CANVAS using the command:

curl -O https://canvas.okstate.edu/files/8837111/download?download_frd=1 .

What is a shell script?

You wrote a simple shell script in a previous lesson that we used to extract bad reads from our FASTQ files and put them into a new file.

Here’s the script you wrote:

grep -B1 -A2 NNNNNNNNNN *.fastq > scripted_bad_reads.txt
echo "Script finished!"

That script was only two lines long, but shell scripts can be much more sophisticated and can be used to perform a large number of operations on one file. This saves you the effort of having to re-type each of those commands for each of your data files and makes your work less error-prone and more reproducible. For example, the variant calling workflow we just carried out had about eight steps where we had to type a command into our terminal. Most of these commands were pretty long. If we wanted to do this for all six of our data files, that would be forty-eight steps. If we had 50 samples (a more realistic number), it would be 400 steps! You can see why we want to automate this.

We’ve also used for loops in previous lessons to iterate one or two commands over multiple input files. In these for loops you used variables to enable you to run the loop on multiple files. We will be using variable assignments like this in our new shell scripts.

Here’s the for loop you wrote for unzipping .zip files:

$ for filename in *.zip
> do
> unzip ${filename}
> done

And here’s the one you wrote for running Trimmomatic on all of our .fastq sample files.

$ for infile in *.fastq
> do
> outfile=${infile}\_trim.fastq
> java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE ${infile} ${outfile} SLIDINGWINDOW:4:20 MINLEN:20
> done

In this lesson, we will create two shell scripts that can be submitted to Pete.

First Script: Analyzing Quality with FastQC

We’ll combine each of the commands we used to run FastQC and process the output files into a single file with a .sh (the Bash script) extension. This script will include creating our summary file.

Using Zoom Chat and Text Files

During a remote session, we may use a chat box to paste text for commands to place in script files. We found in some cases the script files had problems with “end-of-line” white space. To avoid this in the future we can run a command dos2unix that fixes all the line ends by converting them to Unix-style. This may or may not occur on your system but we wanted you to know

Let’s use the command touch to create a new file where we will write our shell script. Remember, we used nano to create and open a new file, but the command touch allows us to create a new file without opening that file.

$ cd /scratch/<username>/dc_workshop
$ touch read_qc.sh
$ ls 
read_qc.sh

We now have an empty file called read_qc.sh in our dc_workshop/ directory. Open read_qc.sh in nano and start building our script.

$ nano read_qc.sh

Use nano to place the following pieces of code into your shell script (not into your terminal prompt).

These next two lines will give us a status message to tell us that we are currently running FastQC, then will run FastQC on all of the files in our data/untrimmed_fastq/ directory with a .fastq.gz extension. Remember that to run FastQC, we have to load the module first. Then we call the program as fastqc and give it the path and the file names.

echo "Running FastQC ..."
module load fastqc
fastqc data/untrimmed_fastq/*.fastq.gz

Our next line will create a new directory to hold our FastQC output files which are currently in the data/untrimmed_fastq/ directory. Here we are using the -p option for mkdir. This option forces mkdir to create the new directory, even if one of the parent directories doesn’t already exist. It is a good idea to use this option in your shell scripts to avoid running into errors if you don’t have the directory structure you think you do.

mkdir -p results/fastqc_untrimmed_reads
mkdir -p docs

Our next three lines first give us a status message to tell us we are saving the results from FastQC, then moves all of the files with a .zip or a .html extension to the directory we just created for storing our FastQC results.

echo "Saving FastQC results..."
mv data/untrimmed_fastq/*.zip results/fastqc_untrimmed_reads/
mv data.untrimmed_fastq/*.html results/fastqc_untrimmed_reads/

The next line moves us to the results directory where we’ve stored our output.

cd results/fastqc_untrimmed_reads/

The next five lines should look very familiar. First we give ourselves a status message to tell us that we’re unzipping our .zip files. Then we run our for loop to unzip all of the .zip files in this directory. Remember that in a script it is extremely important to use four spaces to indent the for loop’s internal lines!!

echo "Unzipping..."
for filename in *.zip
    do
    unzip ${filename}
    done

Next we concatenate all of our summary files into a single output file, with a status message to remind ourselves that this is what we’re doing. Note that the cat command, by default, always “appends” files rather than overwrites them. By doing this, all the summaries of all the read files (fastq files) are combined into ONE file we can scan through (or even search for words like “FAIL”).

echo "Saving summary..."
cat */summary.txt > ../../docs/fastqc_summaries.txt

Finally, let’s add one more line to let us know the script completed the last steps successfully:

echo "Summary file completed"

Using echo statements

We’ve used echo commands to add progress statements to our script. Our script will print these statements as it is running and therefore we will be able to see how far our script has progressed. When using a .sbatch file to submit jobs, the outputs will go to the job output file (e.g. slurm-<job-number>.out). If our script fails, we will use the echo outputs to track how far the script got before an error message is displayed (if any). We can then check for error messages in this file by opening it in nano or displaying it to the screen using the command less slurm-<job-number>.out. We will always want to check this output file!

Your full shell script should now look like this (extra blank lines are used for clarity):

echo "Running FastQC..."
module load fastqc
fastqc data/untrimmed_fastq/*.fastq.gz

mkdir -p results/fastqc_untrimmed_reads
mkdir -p docs

echo "Saving FastQC results..."
mv data/untrimmed_fastq/*zip results/fastqc_untrimmed_reads/
mv data/untrimmed_fastq/*.html results/fastqc_untrimmed_reads/

cd results/fastqc_untrimmed_reads/

echo "Unzipping ..."
for filename in *.zip
    do
    unzip ${filename}
    done
	
echo "Saving summary..."
cat */summary.txt > ../../docs/fastqc_summaries.txt

echo "Summary file completed"

Save your file and exit nano. We can now create a .sbatch file that will run our script. If we were operating “interactively” (not using .sbatch files) we could run the script by typing:

$ bash read_qc.sh

So our .sbatch file will be similar. First we use nano to create the .sbatch file that contains the following:

#!/bin/bash
#SBATCH -p express
#SBATCH -t 1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mail-user=<your.email.address@univ.edu>
#SBATCH --mail-type=end
bash read_qc.sh

Notice the “Shebang” (#!bin/bash) at the top of your submission script. When we are running a bash script, even using a submission script, a rule applies:

Shebang: The Shebang command tells the shell (which interprets UNIX commands) to interpret and run the Slurm script using the bash (Bourne-again shell) shell.

This line should always be added at the very top of your SBATCH/Slurm script, when running a bash script.

save this file as read-test.sbatch and exit nano

Now submit the .sbatch (Yes we are using a submission script, to run our script)

sbatch read-test.sbatch

Make a note of the job number and check to see when you script completes. After the script completes, open the job output file using less slurm-<job-number>.out. You should see something similar to this:

Running FastQC ...
Started analysis of SRR097977.fastq
Approx 5% complete for SRR097977.fastq
Approx 10% complete for SRR097977.fastq
Approx 15% complete for SRR097977.fastq
Approx 20% complete for SRR097977.fastq
Approx 25% complete for SRR097977.fastq
. 
. 
. 

Because we have already run FastQC on these samples files and generated all of the outputs our older files will be overwritten (if they are generated in the same directory). That is okay.

Automating the Rest of our Variant Calling Workflow

Now we will create a second shell script to complete the other steps of our variant calling workflow. To do this, we will take all of the individual commands that we wrote before, put them into a single file, add variables so that the script knows to iterate through our input files and do a few other formatting that we’ll explain as we go. This is very similar to what we did with our read_qc.sh script, but will be a bit more complex.

Our variant calling workflow will do the following steps

Index the reference genome for use by bwa and samtools
Align reads to reference genome
Convert the format of the alignment to sorted BAM, with some intermediate steps.
Calculate the read coverage of positions in the genome
Detect the single nucleotide polymorphisms (SNPs)
Filter and report the SNP variants in VCF (variant calling format)

We will be creating a script together to do all of these steps.

First, we will create a new script in our dc_workshop directory using touch.

$ touch run_variant_calling.sh

We now have a new empty file called run_variant_calling.sh. We will open this file in nano and start building our script, like we did before.

$ nano run_variant_calling.sh

Enter the following pieces of code into your shell script (not into your terminal prompt).

Bash scripts should start with a Shebang! So begin by using:

#!/bin/bash

Next we tell our script where to find the reference genome by assigning the genome variable to the path to our reference genome:

genome=data/ref_genome/ecoli_rel606.fasta

Creating Variables

Within the Bash shell you can create variables at any time (as we did above, and during the for loop lesson). Assign any name and the value using the assignment operator: =. You can check the current definition of your variable by including echo $variable_name in your script.

Next we index our reference genome for BWA. Note on Pete, and other HPCs with modules, the commands are:

module load bwa
bwa index ${genome}

We will now create the directory structure to store our results then move up one directory:

mkdir -p results/sam results/bam results/bcf results/vcf

Then we use a loop to run the variant calling workflow on each of our FASTQ files. We’ve already used Trimmomatic to trim these files, and we could re-do that here, but to save time, we will use the smaller files we’ve already trimmed in the data/trimmed_fastq_small/ directory. (You might want to try this later on the bigger files in the data/trimmed_fastq/ directory.) We will use absolute paths so that all these commands within the loop will be executed once for each of the FASTQ files in the data/trimmed_fastq_small/ directory. We will include a few echo statements to give us status updates on our progress.

The first thing we do is assign the name of the FASTQ files we’re currently working with to a variable called fq and tell the script to echo the filename back to us so we can check which file we’re on.

for fq in data/trimmed_fastq_small/*.fastq
    do
    echo "working with file ${fq}"
    done

Indentation

All of the statements within your for loop (i.e. everything after the for line and including the done line) need to be indented with four spaces. This indicates to the shell interpretor that these statements are all part of the for loop and should be done for each line in the for loop.

This is a good time to check that our script is assigning the FASTQ filename variables correctly. Save your script and run it interactively, because we aren’t doing any big calculations and it only takes a few seconds.

$ bash run_variant_calling.sh

[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.10 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.64 sec
[main] Version: 0.7.5a-r405
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.892 sec; CPU: 1.829 sec
working with file data/trimmed_fastq_small/SRR2584863_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584863_2.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2589044_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2589044_2.trim.sub.fastq

You should see “working with file . . . “ for each of the six FASTQ files in our trimmed_fastq_small/ directory. If you don’t see this output, then you’ll need to troubleshoot your script. A common problem is that your directory might not be specified correctly. Ask for help if you get stuck here!

Now that we’ve tested the components of our loops so far, we will add our next few steps. Open the script with nano, remove the line done from the end of your script, and add the next two (indented) lines. These lines extract the basename of the file name excluding the path and .trim.sub.fastq extension. Then the basename is assigned to a new variable called base. Add done again at the end so we can test our script.

    base=$(basename ${fq} .trim.sub.fastq)
    echo "base name is ${base}"
    done

Now if you save and run your script, the final lines of your output should look like this:

working with file data/trimmed_fastq_small/SRR2584863_1.trim.sub.fastq
base name is SRR2584863
working with file data/trimmed_fastq_small/SRR2584863_2.trim.sub.fastq
base name is SRR2584863
working with file data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq
base name is SRR2584866
working with file data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
base name is SRR2584866
working with file data/trimmed_fastq_small/SRR2589044_1.trim.sub.fastq
base name is SRR2589044
working with file data/trimmed_fastq_small/SRR2589044_2.trim.sub.fastq
base name is SRR2589044

For each file, you see two statements printed to the terminal window. This is because we have two echo statements. The first tells you which file the loop is currently working with. The second tells you the base name of the file. This base name is going to be used to create our output files.

Next we will create variables to store the names of our output files as paired-end read files. This will make your script easier to read because you won’t need to type out the full name of each of the files. We’re using the base variable that we just defined, and adding different file name extensions to represent the different output files. We can use the base variable to access both the base_1.fastq and base_2.fastq input files, and create variables to store the names of our output files. Remember to delete the done line from your script before adding these (indented) lines.

    # input files
    # The first will be 'SRR2584863_1.trim.sub.fastq'
    # The next will be 'SRR2584863_2.trim.sub.fastq'
    fq1=data/trimmed_fastq_small/${base}_1.trim.sub.fastq 
    fq2=data/trimmed_fastq_small/${base}_2.trim.sub.fastq 
    
    # output files
    sam=results/sam/${base}.aligned.sam
    bam=results/bam/${base}.aligned.bam
    sorted_bam=results/bam/${base}.aligned.sorted.bam
    raw_bcf=results/bcf/${base}_raw.bcf
    variants=results/bcf/${base}_variants.vcf
    final_variants=results/vcf/${base}_final_variants.vcf

Now that we’ve created our variables, we can start running the steps of our workflow. Remember that on Pete, or an HPC with modules, you need to load the module each time. (If you are on a cloud instance, you can ignore the module load.. commands.)

1) align the reads to the reference genome and output a .sam file:

    module load bwa
    bwa mem ${genome} ${fq1} ${fq2} > ${sam}

2) convert the SAM file to BAM format:

    module load samtools/1.10
    samtools view -S -b ${sam} > ${bam}

3) sort the BAM file:

    samtools sort -o ${sorted_bam} ${bam}

4) index the BAM file for display purposes:

    samtools index ${sorted_bam}

5) calculate the read coverage of positions in the genome:

    module load bcftools/1.8
    bcftools mpileup -O b -o ${raw_bcf} -f ${genome} ${sorted_bam} 

6) call SNPs with bcftools:

    bcftools call --ploidy 1 -m -v -o ${variants} ${raw_bcf}

7) filter and report the SNP variants in variant calling format (VCF):

    vcfutils.pl varFilter ${variants}  > ${final_variants}

Your final script should look like this:

#!/bin/bash
genome=data/ref_genome/ecoli_rel606.fasta

module load bwa
bwa index ${genome}

mkdir -p results/sam results/bam results/bcf results/vcf
for fq in data/trimmed_fastq_small/*_1.trim.sub.fastq
    do
    echo "working with file ${fq}"

    base=$(basename $fq _1.trim.sub.fastq)
    echo "base name is ${base}"

    fq1=data/trimmed_fastq_small/${base}_1.trim.sub.fastq  
    fq2=data/trimmed_fastq_small/${base}_2.trim.sub.fastq
    # output files
    sam=results/sam/${base}.aligned.sam
    bam=results/bam/${base}.aligned.bam
    sorted_bam=results/bam/${base}.aligned.sorted.bam
    raw_bcf=results/bcf/${base}_raw.bcf
    variants=results/bcf/${base}_variants.vcf
    final_variants=results/vcf/${base}_final_variants.vcf
    module load bwa
    bwa mem $genome ${fq1} ${fq2} > ${sam}
    module load samtools/1.10
    samtools view -S -b ${sam} > ${bam}
    samtools sort -o ${sorted_bam} ${bam}
    samtools index ${sorted_bam}
    module load bcftools/1.8
    bcftools mpileup -O b -o ${raw_bcf} -f ${genome} ${sorted_bam}
    bcftools call --ploidy 1 -m -v -o ${variants} ${raw_bcf} 
    vcfutils.pl varFilter ${variants} > ${final_variants}
    echo "done"
    done

Exercise

It’s a good idea to add comments to your code so that you (or a collaborator) can make sense of what you did later. Look through your existing script. Discuss with a neighbor where you should add comments. Add comments (anything following a # character will be interpreted as a comment, bash will not try to run these comments as code).

Now we can run our script: With Pete, we will create a submission script as shown below. (On a cloud instance the simple command is: bash run_variant_calling.sh

Here is a Pete submissions script made with nano named bigscript.sbatch

#!/bin/bash
#SBATCH -p express
#SBATCH -t 1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --mail-user=peter.r.hoyt@okstate.edu
#SBATCH --mail-type=end
 
bash run_variant_calling.sh

submit: sbatch bigscript.sbatch

CONGRATULATIONS! You have created a powerful pipeline taking raw fastq data, and generating a file with all the sequence variations mapped to a reference genome! You can use this pipeline over and over again by swapping out the fastq files and the reference genome.

Nelle would be proud!

Keypoints:

We can combine multiple commands into a shell script to automate a workflow.
Use echo statements within your scripts to get an automated progress update.

Data Science in Omics Introduction

Notes

Questions:

Objectives:

Setup

What is a shell script?

First Script: Analyzing Quality with FastQC

Using Zoom Chat and Text Files

Using `echo` statements

Automating the Rest of our Variant Calling Workflow

Creating Variables

Indentation

Exercise

Keypoints:

Data Science in Omics Introduction

Automating Genomic Variant Calling

Notes

Questions:

Objectives:

Setup

What is a shell script?

First Script: Analyzing Quality with FastQC

Using Zoom Chat and Text Files

Using echo statements

Automating the Rest of our Variant Calling Workflow

Creating Variables

Indentation

Exercise

Keypoints:

Using `echo` statements