Questions:
- How can I make my workflow more efficient and less error-prone?
Objectives:
- Write a shell script with multiple variables.
- Incorporate a
for
loop into a shell script.
Setup
Check your directories on Pete for the files:
SRR2584863_1.trim.sub.fastq
SRR2584863_2.trim.sub.fastq
SRR2584866_1.trim.sub.fastq
SRR2584866_2.trim.sub.fastq
SRR2589044_1.trim.sub.fastq
SRR2589044_2.trim.sub.fastq
These should be in your /scratch/<username>/dc_workshop/data/trimmed_fastq_small/
directory. If you
don’t have these files, create a /scratch/<username>/dc_workshop/data/trimmed_fastq_small/
directory (if you don’t have one) and change into that directory. Then download
the files from CANVAS using the command:
curl -O https://canvas.okstate.edu/files/8837111/download?download_frd=1 .
What is a shell script?
You wrote a simple shell script in a previous lesson that we used to extract bad reads from our FASTQ files and put them into a new file.
Here’s the script you wrote:
grep -B1 -A2 NNNNNNNNNN *.fastq > scripted_bad_reads.txt
echo "Script finished!"
That script was only two lines long, but shell scripts can be much more sophisticated and can be used to perform a large number of operations on one file. This saves you the effort of having to re-type each of those commands for each of your data files and makes your work less error-prone and more reproducible. For example, the variant calling workflow we just carried out had about eight steps where we had to type a command into our terminal. Most of these commands were pretty long. If we wanted to do this for all six of our data files, that would be forty-eight steps. If we had 50 samples (a more realistic number), it would be 400 steps! You can see why we want to automate this.
We’ve also used for
loops in previous lessons to iterate one or two commands over multiple input files.
In these for
loops you used variables to enable you to run the loop on multiple files. We will be using variable
assignments like this in our new shell scripts.
Here’s the for
loop you wrote for unzipping .zip
files:
$ for filename in *.zip
> do
> unzip ${filename}
> done
And here’s the one you wrote for running Trimmomatic on all of our .fastq
sample files.
$ for infile in *.fastq
> do
> outfile=${infile}\_trim.fastq
> java -jar ~/Trimmomatic-0.32/trimmomatic-0.32.jar SE ${infile} ${outfile} SLIDINGWINDOW:4:20 MINLEN:20
> done
In this lesson, we will create two shell scripts that can be submitted to Pete.
First Script: Analyzing Quality with FastQC
We’ll combine each of the commands we used to run FastQC and
process the output files into a single file with a .sh
(the Bash script) extension. This script will include creating our summary file.
Using Zoom Chat and Text Files
During a remote session, we may use a chat box to paste text for commands to place in script files. We found in some cases the script files had problems with “end-of-line” white space. To avoid this in the future we can run a command
dos2unix
that fixes all the line ends by converting them to Unix-style. This may or may not occur on your system but we wanted you to know
Let’s use the command touch
to create a new file where we will write our shell script. Remember, we used
nano
to create and open a new file, but the command touch
allows us to create a new file without opening that file.
$ cd /scratch/<username>/dc_workshop
$ touch read_qc.sh
$ ls
read_qc.sh
We now have an empty file called read_qc.sh
in our dc_workshop/
directory.
Open read_qc.sh
in nano
and start building our script.
$ nano read_qc.sh
Use nano
to place the following pieces of code into your shell script (not into your terminal prompt).
These next two lines will give us a status message to tell us that we are currently running FastQC, then will run FastQC
on all of the files in our data/untrimmed_fastq/
directory
with a .fastq.gz
extension. Remember that to run FastQC, we have
to load the module first. Then we call the program as fastqc
and give it
the path and the file names.
echo "Running FastQC ..."
module load fastqc
fastqc data/untrimmed_fastq/*.fastq.gz
Our next line will create a new directory to hold our FastQC output files which are currently in the data/untrimmed_fastq/
directory. Here we are using the -p
option for mkdir
. This
option forces mkdir
to create the new directory, even if one of the parent directories doesn’t already exist. It is a good
idea to use this option in your shell scripts to avoid running into errors if you don’t have the directory structure you think
you do.
mkdir -p results/fastqc_untrimmed_reads
mkdir -p docs
Our next three lines first give us a status message to tell us we are saving the results from FastQC, then moves all of the files
with a .zip
or a .html
extension to the directory we just created for storing our FastQC results.
echo "Saving FastQC results..."
mv data/untrimmed_fastq/*.zip results/fastqc_untrimmed_reads/
mv data.untrimmed_fastq/*.html results/fastqc_untrimmed_reads/
The next line moves us to the results directory where we’ve stored our output.
cd results/fastqc_untrimmed_reads/
The next five lines should look very familiar. First we give ourselves a status message to tell us that we’re unzipping our .zip
files. Then we run our for
loop to unzip all of the .zip
files in this directory.
Remember that in a script it is extremely important to use four spaces to indent the for
loop’s
internal lines!!
echo "Unzipping..."
for filename in *.zip
do
unzip ${filename}
done
Next we concatenate all of our summary files into a single output file, with a status message to remind ourselves that this is what we’re doing. Note that the cat
command, by default, always “appends” files rather than overwrites them. By doing this, all the summaries of all the read files (fastq
files) are combined into ONE file we can scan through (or even search for words like “FAIL”).
echo "Saving summary..."
cat */summary.txt > ../../docs/fastqc_summaries.txt
Finally, let’s add one more line to let us know the script completed the last steps successfully:
echo "Summary file completed"
Using
echo
statementsWe’ve used
echo
commands to add progress statements to our script. Our script will print these statements as it is running and therefore we will be able to see how far our script has progressed. When using a.sbatch
file to submit jobs, the outputs will go to the job output file (e.g.slurm-<job-number>.out
). If our script fails, we will use theecho
outputs to track how far the script got before an error message is displayed (if any). We can then check for error messages in this file by opening it innano
or displaying it to the screen using the commandless slurm-<job-number>.out
. We will always want to check this output file!
Your full shell script should now look like this (extra blank lines are used for clarity):
echo "Running FastQC..."
module load fastqc
fastqc data/untrimmed_fastq/*.fastq.gz
mkdir -p results/fastqc_untrimmed_reads
mkdir -p docs
echo "Saving FastQC results..."
mv data/untrimmed_fastq/*zip results/fastqc_untrimmed_reads/
mv data/untrimmed_fastq/*.html results/fastqc_untrimmed_reads/
cd results/fastqc_untrimmed_reads/
echo "Unzipping ..."
for filename in *.zip
do
unzip ${filename}
done
echo "Saving summary..."
cat */summary.txt > ../../docs/fastqc_summaries.txt
echo "Summary file completed"
Save your file and exit nano
. We can now create a .sbatch
file that
will run our script. If we were operating “interactively” (not using .sbatch
files)
we could run the script by typing:
$ bash read_qc.sh
So our .sbatch
file will be similar. First we use nano
to create the .sbatch
file that
contains the following:
#!/bin/bash
#SBATCH -p express
#SBATCH -t 1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mail-user=<your.email.address@univ.edu>
#SBATCH --mail-type=end
bash read_qc.sh
Notice the “Shebang” (#!bin/bash
) at the top of your submission script. When we are running a bash script, even using a submission script, a rule applies:
Shebang: The Shebang command tells the shell (which interprets UNIX commands) to interpret and run the Slurm script using the bash (Bourne-again shell) shell.
This line should always be added at the very top of your SBATCH/Slurm script, when running a bash script.
save this file as read-test.sbatch
and exit nano
Now submit the .sbatch
(Yes we are using a submission script, to run our script)
sbatch read-test.sbatch
Make a note of the job number and check to see when you script completes.
After the script completes, open the job output file using less slurm-<job-number>.out
. You should see something similar to this:
Running FastQC ...
Started analysis of SRR097977.fastq
Approx 5% complete for SRR097977.fastq
Approx 10% complete for SRR097977.fastq
Approx 15% complete for SRR097977.fastq
Approx 20% complete for SRR097977.fastq
Approx 25% complete for SRR097977.fastq
.
.
.
Because we have already run FastQC on these samples files and generated all of the outputs our older files will be overwritten (if they are generated in the same directory). That is okay.
Automating the Rest of our Variant Calling Workflow
Now we will create a second shell script to complete the other steps of our variant calling
workflow. To do this, we will take all of the individual commands that we wrote before, put them into a single file,
add variables so that
the script knows to iterate through our input files and do a few other formatting that
we’ll explain as we go. This is very similar to what we did with our read_qc.sh
script, but will be a bit more complex.
Our variant calling workflow will do the following steps
- Index the reference genome for use by bwa and samtools
- Align reads to reference genome
- Convert the format of the alignment to sorted BAM, with some intermediate steps.
- Calculate the read coverage of positions in the genome
- Detect the single nucleotide polymorphisms (SNPs)
- Filter and report the SNP variants in VCF (variant calling format)
We will be creating a script together to do all of these steps.
First, we will create a new script in our dc_workshop
directory using touch
.
$ touch run_variant_calling.sh
We now have a new empty file called run_variant_calling.sh
. We will open this file in nano
and start
building our script, like we did before.
$ nano run_variant_calling.sh
Enter the following pieces of code into your shell script (not into your terminal prompt).
Bash scripts should start with a Shebang! So begin by using:
#!/bin/bash
Next we tell our script where to find the reference genome by assigning the genome
variable to
the path to our reference genome:
genome=data/ref_genome/ecoli_rel606.fasta
Creating Variables
Within the Bash shell you can create variables at any time (as we did above, and during the
for
loop lesson). Assign any name and the value using the assignment operator:=
. You can check the current definition of your variable by includingecho $variable_name
in your script.
Next we index our reference genome for BWA. Note on Pete, and other HPCs with modules, the commands are:
module load bwa
bwa index ${genome}
We will now create the directory structure to store our results then move up one directory:
mkdir -p results/sam results/bam results/bcf results/vcf
Then we use a loop to run the variant calling workflow on each of our FASTQ files. We’ve already used Trimmomatic to trim these files, and we could re-do that here, but to save time, we will use the smaller files we’ve already trimmed in the data/trimmed_fastq_small/
directory. (You might want to try this later on the bigger files in the data/trimmed_fastq/
directory.) We will use absolute paths so that all these commands
within the loop will be executed once for each of the FASTQ files in the data/trimmed_fastq_small/
directory.
We will include a few echo
statements to give us status updates on our progress.
The first thing we do is assign the name of the FASTQ files we’re currently working with to a variable called fq
and
tell the script to echo
the filename back to us so we can check which file we’re on.
for fq in data/trimmed_fastq_small/*.fastq
do
echo "working with file ${fq}"
done
Indentation
All of the statements within your
for
loop (i.e. everything after thefor
line and including thedone
line) need to be indented with four spaces. This indicates to the shell interpretor that these statements are all part of thefor
loop and should be done for each line in thefor
loop.
This is a good time to check that our script is assigning the FASTQ filename variables correctly. Save your script and run it interactively, because we aren’t doing any big calculations and it only takes a few seconds.
$ bash run_variant_calling.sh
[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.10 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.64 sec
[main] Version: 0.7.5a-r405
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.892 sec; CPU: 1.829 sec
working with file data/trimmed_fastq_small/SRR2584863_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584863_2.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2589044_1.trim.sub.fastq
working with file data/trimmed_fastq_small/SRR2589044_2.trim.sub.fastq
You should see “working with file . . . “ for each of the six FASTQ files in our trimmed_fastq_small/
directory.
If you don’t see this output, then you’ll need to troubleshoot your script. A common problem is that your directory might not
be specified correctly. Ask for help if you get stuck here!
Now that we’ve tested the components of our loops so far, we will add our next few steps. Open the script with nano
, remove the line done
from the end of
your script, and add the next two (indented) lines. These lines extract the basename of the file name
excluding the path and .trim.sub.fastq
extension. Then the basename
is assigned
to a new variable called base
. Add done
again at the end so we can test our script.
base=$(basename ${fq} .trim.sub.fastq)
echo "base name is ${base}"
done
Now if you save and run your script, the final lines of your output should look like this:
working with file data/trimmed_fastq_small/SRR2584863_1.trim.sub.fastq
base name is SRR2584863
working with file data/trimmed_fastq_small/SRR2584863_2.trim.sub.fastq
base name is SRR2584863
working with file data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq
base name is SRR2584866
working with file data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
base name is SRR2584866
working with file data/trimmed_fastq_small/SRR2589044_1.trim.sub.fastq
base name is SRR2589044
working with file data/trimmed_fastq_small/SRR2589044_2.trim.sub.fastq
base name is SRR2589044
For each file, you see two statements printed to the terminal window. This is because we have two echo
statements. The first
tells you which file the loop is currently working with. The second tells you the base name of the file. This base name is going
to be used to create our output files.
Next we will create variables to store the names of our output files as paired-end read files. This will make your script easier
to read because you won’t need to type out the full name of each of the files. We’re using the base
variable that we just
defined, and adding different file name extensions to represent the different output files.
We can use the base
variable to access both the base_1.fastq
and base_2.fastq
input files, and create variables to store the names of our output files.
Remember to delete the done
line from your script before adding these (indented) lines.
# input files
# The first will be 'SRR2584863_1.trim.sub.fastq'
# The next will be 'SRR2584863_2.trim.sub.fastq'
fq1=data/trimmed_fastq_small/${base}_1.trim.sub.fastq
fq2=data/trimmed_fastq_small/${base}_2.trim.sub.fastq
# output files
sam=results/sam/${base}.aligned.sam
bam=results/bam/${base}.aligned.bam
sorted_bam=results/bam/${base}.aligned.sorted.bam
raw_bcf=results/bcf/${base}_raw.bcf
variants=results/bcf/${base}_variants.vcf
final_variants=results/vcf/${base}_final_variants.vcf
Now that we’ve created our variables, we can start running the steps of our workflow. Remember that on Pete, or an HPC with modules, you need to load the module each time. (If you are on a cloud instance, you can ignore the module load..
commands.)
1) align the reads to the reference genome and output a .sam
file:
module load bwa
bwa mem ${genome} ${fq1} ${fq2} > ${sam}
2) convert the SAM file to BAM format:
module load samtools/1.10
samtools view -S -b ${sam} > ${bam}
3) sort the BAM file:
samtools sort -o ${sorted_bam} ${bam}
4) index the BAM file for display purposes:
samtools index ${sorted_bam}
5) calculate the read coverage of positions in the genome:
module load bcftools/1.8
bcftools mpileup -O b -o ${raw_bcf} -f ${genome} ${sorted_bam}
6) call SNPs with bcftools:
bcftools call --ploidy 1 -m -v -o ${variants} ${raw_bcf}
7) filter and report the SNP variants in variant calling format (VCF):
vcfutils.pl varFilter ${variants} > ${final_variants}
Your final script should look like this:
#!/bin/bash
genome=data/ref_genome/ecoli_rel606.fasta
module load bwa
bwa index ${genome}
mkdir -p results/sam results/bam results/bcf results/vcf
for fq in data/trimmed_fastq_small/*_1.trim.sub.fastq
do
echo "working with file ${fq}"
base=$(basename $fq _1.trim.sub.fastq)
echo "base name is ${base}"
fq1=data/trimmed_fastq_small/${base}_1.trim.sub.fastq
fq2=data/trimmed_fastq_small/${base}_2.trim.sub.fastq
# output files
sam=results/sam/${base}.aligned.sam
bam=results/bam/${base}.aligned.bam
sorted_bam=results/bam/${base}.aligned.sorted.bam
raw_bcf=results/bcf/${base}_raw.bcf
variants=results/bcf/${base}_variants.vcf
final_variants=results/vcf/${base}_final_variants.vcf
module load bwa
bwa mem $genome ${fq1} ${fq2} > ${sam}
module load samtools/1.10
samtools view -S -b ${sam} > ${bam}
samtools sort -o ${sorted_bam} ${bam}
samtools index ${sorted_bam}
module load bcftools/1.8
bcftools mpileup -O b -o ${raw_bcf} -f ${genome} ${sorted_bam}
bcftools call --ploidy 1 -m -v -o ${variants} ${raw_bcf}
vcfutils.pl varFilter ${variants} > ${final_variants}
echo "done"
done
Exercise
It’s a good idea to add comments to your code so that you (or a collaborator) can make sense of what you did later. Look through your existing script. Discuss with a neighbor where you should add comments. Add comments (anything following a
#
character will be interpreted as a comment, bash will not try to run these comments as code).
Now we can run our script:
With Pete, we will create a submission script as shown below. (On a cloud instance the simple command is: bash run_variant_calling.sh
Here is a Pete submissions script made with nano
named bigscript.sbatch
#!/bin/bash
#SBATCH -p express
#SBATCH -t 1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --mail-user=peter.r.hoyt@okstate.edu
#SBATCH --mail-type=end
bash run_variant_calling.sh
submit: sbatch bigscript.sbatch
CONGRATULATIONS! You have created a powerful pipeline taking raw fastq data, and generating a file with all the sequence variations mapped to a reference genome! You can use this pipeline over and over again by swapping out the fastq files and the reference genome.
Nelle would be proud!
Keypoints:
- We can combine multiple commands into a shell script to automate a workflow.
- Use
echo
statements within your scripts to get an automated progress update.