Hands-on Genome Assembly Read Processing

Goal: get familiar with genome assemblers, pre-processing, reporting and validation. Exercise will be based on chromosomes of a mutant genotype of bakers’ yeast, as practice of reference-based genome assembly. Here we have the benefits of a reference genome to validate our assembly. We must be sure to acknowledge those who helped design these lessons.

Data and directory structure

This exercise will likely use the local University supercomputing resources. (AKA: The HPCC) Normally, we would not be able to perform these assemblies “interactively” i.e. by entering commands at the command prompt. Instead we would enter commands by submitting our commands as a “job” in the computers “queue”. Most of the commands we enter will be identical to the commands you would place inside your “job submission script” which we’ll discuss later, but there’s plenty of information on the HPCC site for new users

Depending on which system is available, everyone should have an account on the “Pete” supercomputer by now. The complete manual for working on the Pete supercomputer is available at this download link.

Special note to curriculum developers: This course was reaching the data limit for Github, so files had to be moved. If you are teaching this course, please feel free to request any datafiles. This may not be easy to fix.

Log onto the Pete computer using Putty (Windows users) or your Terminal Program (Mac and Linux users). Then open your FTP software, and connect to your account on Pete. You should have been given the file mcbios.zip, for you to place on your desktop. Alternatively, registered students can download the file mcbios.zip from Canvas onto the Desktop of your local machine. Use FTP to transfer the file mcbios.zip to Pete.

Make sure the mcbios.zip file will be uploaded into your /scratch/username directory on Pete. NOTE that you should substitute username with your actual username for the computer. For example, if your username was phoyt, upload the file to /scratch/phoyt.

Then switch to your REMOTE terminal Window logged in to Cowboy (Windows users can use Putty if they want).

For mcbios.zip use:

$ pwd
/home/<username>
$ unzip mcbios.zip

The output should look similar to:

Archive:  mcbios.zip
   creating: mcbios/
   creating: mcbios/abyss/
   creating: mcbios/abyss/abyss31/
  inflating: mcbios/abyss/abyss31/abyssk31.pbs
   creating: mcbios/data/
   creating: mcbios/data/group1/
  inflating: mcbios/data/group1/PE-350.1.fastq
  inflating: mcbios/data/group1/PE-350.2.fastq
  inflating: mcbios/data/group1/ref.fasta
   creating: mcbios/data/group2/
  inflating: mcbios/data/group2/PE-350.1.fastq
  inflating: mcbios/data/group2/PE-350.2.fastq
  inflating: mcbios/data/group2/ref.fasta
   creating: mcbios/data/group3/
  inflating: mcbios/data/group3/PE-350.1.fastq
  inflating: mcbios/data/group3/PE-350.2.fastq
  inflating: mcbios/data/group3/ref.fasta
   creating: mcbios/data/group4/
  inflating: mcbios/data/group4/PE-350.1.fastq
  inflating: mcbios/data/group4/PE-350.2.fastq
  inflating: mcbios/data/group4/ref.fasta
   creating: mcbios/data/group5/
  inflating: mcbios/data/group5/PE-350.1.fastq
  inflating: mcbios/data/group5/PE-350.2.fastq
  inflating: mcbios/data/group5/ref.fasta
   creating: mcbios/results/
  inflating: mcbios/results/quast.pbs
   creating: mcbios/soap/
   creating: mcbios/soap/soap31/
  inflating: mcbios/soap/soap31/soap.config
  inflating: mcbios/soap/soap31/soapk31.pbs
   creating: mcbios/velvet/
  inflating: mcbios/velvet/velvetk31.pbs

Then change into your new mcbios directory and see what is there.

$ cd mcbios/
$ ls

The output should be:

data results soap velvet 

File Structure

What you should now have:


|-- data
|   |-- group1
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group2
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group3
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group4
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   `-- group5
|       |-- PE-350.1.fastq
|       |-- PE-350.2.fastq
|       `-- ref.fasta
|-- results
|   `-- quast.pbs
|-- spades
|   `-- spades31
|       |-- spadesk31.sbatch
|-- velvet
|    |-- velvetk31.pbs
|    `-- velvet-estimate-exp_cov.pl
|-- nucmer
|   `-- nucmer.sbatch

We will dive into each directory for each task: velvet, spades, etc. Most folders contain a submission script which includes the commands that we use for each task. It is always a good idea to use a script so you can modify parameters, and the script also serves as a note to your future self.

Important notes before hands-on

Since we are using the Pete cluster, only very small tasks can be done directly on the login nodes. For each longer activity, we will submit the jobs to the scheduler using “pbs scripts”. These .pbs files are text files that include information for the job scheduler as well as the commands to execute your job.

The data:

Move into your data directory:

$ cd data

Use ls to see what files or directories are present:

$ ls
group1  group2  group3  group4  group5

There are five group directories, so that up to five different groups of students can work on a different chromosome in yeast. You can then compare your results. As an example, everyone should look inside the group1 directory:

$ cd group1
$ ls
ref.fasta  PE-350.1.fastq  PE-350.2.fastq

To see the first read, remember that each read consists of four lines in a fastq file:

$ head -n 4 PE-350.1.fastq
@DRR001841.41/1
AAAAGAATGGAAATCTATGTTTTTATTATTACAAGTTTTGAAGATTGCCAAAGAAATCAAGAATTTCGTGAGATTGAAAGTCATCGGGTC
+
CCCCCCBBCCCCCCCCCCCCCCCBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCB@CCCCCCCCCCCBCCCCACCCCABA>@CCAB6<B

By now, you should all be familiar with .fastq sequencer files. Our assembly will start using one Illumina library, PE-350, which is a paired end reads library, divided into two datasets: 1.fastq and 2.fastq, which are the paired reads very close to 350 bp apart. Every read in the 1.fastq file should have a corresponding read in the 2.fastq file. Also, ref.fasta is the included reference sequence for comparison with your assembly.

Review FASTQ format

Each base has quality identifier called PHRED score, typically between value 0-40 (+33) for Illumina. The FASTQ file converts the numeric value to a character because they use less space (fewer bits). There is an older scoring system called PHRED+64, but PHRED+33 is used in almost all modern systems including Illumina protocols. PHRED+64 is used rarely, but be aware!

“Phred quality scores are defined as a property which is logarithmically related to the base-calling error probabilities.”

-Wikipedia.

The basics of PHRED scores are described in this chart:

Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%

Almost all modern sequence data uses the PHRED+33 (version 1.8 or newer) base call system.

If a base call has PHRED quality score of 20 (i.e. 1% error), the PHRED+33 system adds 33 to the quality score: PHRED(20)+33=53, which corresponds to and is given the character 5 using the reference chart below (the “S” range) which is what shows up on the fourth line of every read within a fastq file.

The older PHRED+64 system is rarely used anymore so ignore it because although it’s very, very uncommon to see any sequence data using PHRED+64. The table below shows how you can be sure sequence quality scores are using PHRED+33, because the PHRED+33 max score is “J”, and the PHRED+64 minimum score is “@”. If you ever see any quality scores above “J”, contact the person who gave you the sequence data for more information!

PhredScores

Exercise

The base quality in PE-350.1.fastq is encoded in Phred+33 or Phred+64 ?________

The paired end dataset PE-350.1.fastq + PE-350.2.fastq contains a total of ______ reads.

(big hint): Count # of lines using the terminal shell by typing:
wc -l PE-350.1.fastq
and divide by 4

or make the command line do all the arithmetic):

$ expr $(wc -l < PE-350.1.fastq) / 4

How good is your sequencing run? Use FASTQC

Fastqc runs very quickly, so we can enter these commands interactively without submitting them as a “job”.

If you want help with fastqc, you can type: $ fastqc -h

Now perform the fastqc quality control from within your group’s folder:

$ module load fastqc
$ fastqc PE-350.1.fastq

(protip: do this again to the PE-350.2.fastq file by using the up arrow to bring up the last command, then use arrow keys to move over to change 1 to a 2 and press enter.)

fastqc puts the results in the same folder as the data (in this case data/group1/)

Download your fastqc results using scp by reversing the “what you want to move” and “where you want to move it”.

$ scp phoyt@cowboy.hpc.okstate.edu/home/phoyt/mcbios/data/PE-350.1_fastqc.zip .
$ scp phoyt@cowboy.hpc.okstate.edu/home/phoyt/mcbios/data/PE-350.2_fastqc.zip .

Then use your Graphical User Interface to find these “ZIP” files on your Desktop and unzip them. You will find two new folders: PE-350.1_fastqc and PE-350.2_fastqc.

FASTQC generates a HTML report for each FASTQ file you run. In each folder, double-click on the file fastqc_report.html to open it in a browser. This shows a lot of summary information about the sequencing files quality (and the data are pretty good). For more information and to see examples of what bad data look like, read the FASTQC manual when you have time.

Assignment

Find the assignment and submit answers to Canvas