Assembly Tools

Hands-on Genome Assembly Tools

(This workshop-style-lesson is taught on the Pete supercomputer at OSU)

DISCLAIMER: Some of these software (not all) nearly obsolete.

We are using them to demonstrate some processes used in bioinformatics. We want you to know you can do this! But be aware that open source software stops working eventually if is not actively maintained. The software used in this beginner’s introduction are already suffering from “Software collapse” a term introduced in 2019.

Why are we using older software?

Bioinformatics is rapidly developing and most software used for many important bioinformatics processes like assembly, could be updated multiple times before the semester completed! (This is also why we don’t use a textbook)

Online learning repositories such as The Carpentries, including Software Carpentry and Data Carpentry (which this course uses) are needed for their carefully curated lessons, and also for their ability to remain current! Also consider free resources like Cyverse.

Do some research and find the best software for your project if you plan to perform assembly or RNA-Seq in your own experiments. There are many online tutorials but only some are well-maintained. In particular the Galaxy Project is a great place to find current bioinformatics software and tutorials.

For an interesting discussion about scientists using outdated software see this 2018 article Scientists Continue to Use Outdated Methods,

Assembly

Main Point!: Not all assemblers are the same or give the same results. Also note that there are assemblers that use a reference genome to assist in assembly (reference-guided assembly) as opposed to de novo assembly which does not use a genome as a reference.

Which genome assembler should I use?

As inefficient as it seems, it’s generally good to run several assemblers and compare results to find the one best for your data! Also, if there is a good-quality reference genome, you should use it. We are using reference-guided assembly with VELVET, and SPAdes in this lesson. All of these assemblers use de-bruijn (K-mer) graph assembly methods.
Some assembly software are optimized for specific organisms or types of data outputs. In these cases you can safely use the recommended assembler

Genome_size = Total_Kmers / peak_Kmer_coverage

For more information about K-mers and coverage, there is this EXTRA Page that explains why we want to estimate genome sizes. Briefly, K-mers represent a copy of all your sequencing data, broken into small fragments of an exact size.

This is useful because:

Assemblers can be confused by repeated sequences and end up very inaccurate. This gives you a second opinion.

You might not know anything about your genome size!

Imagine you had 30-billion base pairs of sequencing data, and you knew your coverage was “10-fold”, you could then estimate the size of your genome being sequenced was 3-billion base-pairs long (30,000,000,000 ÷ 10 = 3,000,000,000)

We are using several pre-made .sbatch submission scripts, so let’s briefly review the structure of a submission script using the velvetk29.sbatch we’ll use later in this lesson as an example.

VELVET

$ cd ../../velvet/

Understand velvetk31.sbatch - We will use the text-file editor nano to examine our script file, that submits our data to the assemblers. $ nano -w velvetk31.sbatch

Be sure to change the line:

export GROUPNUMBER=1

to your group number. You will need to do this for all the submission scripts today.

The first section of our velvetk31.sbatch script sets up the scheduler requests (the queueing system) and environment variables. The assembler commands are velveth and velvetg. What do these commands do? See the velvet website here.

If you understand the script and have changed the group number, press ctrl-x, save the file and submit it to the queue with the sbatch command as follows.

$ sbatch velvetk31.sbatch

What was printed on the screen after you pressed enter?

Submitted batch job <jobidnumber>

After the job is finished, the log file, your results will be in the velvet31 folder. velvetk31.sbatch.o<jobidnumbmer> is very useful.

The log file, velvetk31.pbs.o<jobidnumbmer> is very useful and contains a lot of the output data we want to look at.

Options to examine that output file:

$ tail velvetk31.sbatch.o<jobidnumber> (HINT: Try using TAB-completion to enter your commands!)

$ less velvetk31.sbatch.o<jobidnumber>

You can scroll through the less interface by pressing the enter key on your keyboard. Also, remember to press q to quit the less interface. But here’s a trick to using less for searching: At the bottom of the less interface, there is a “colon” character : to the left of your cursor. If you press the forward slash button / on your keyboard, you will see the line is now blank. It is waiting for a search pattern! Try entering something like “Paired end library” and press the Enter key.

Now use less tind these information from the log file: Paired end library insert size: _____________

Standard deviation ______________

contig stats:

n50 ________

max contig length _______

total number of contigs __________

reads used ____________ (did it use all the reads?)

Where are the contigs stored? Transfer the results to the results folder!

$ cp velvet31/contigs.fa ../results/velvet31.fasta

Does the assembly get better if I use a different K-mer size?

To try different kmers, first copy your .pbs script with a new name:

$ cp velvetk31.sbatch velvetk21.sbatch
$ nano -w velvetk21.sbatch

and change K to a different value, currently 31. Try 21 first, and then 25. then submit this new job:

$ sbatch velvetk21.sbatch

Your results will be in the velvet31 subfolder.

To see kmer coverage differences in a histogram format, we can use the included Perl script on the velvet stats.txt file in each of your results folders:

$ module load velvet
$ velvet-estimate-exp_cov.pl velvet31/stats.txt
    10 |      1 | **********
<snip>
Predicted expected coverage: 18
velvetg parameters: -exp_cov 18 -cov_cutoff 0

This is a K-mer histogram.

K-mer histogram from slides

You should learn that when using K-mer based assembly methods, “good” K-mers show a peak where coverage is the most accurate. K-mers at low coverage tend to be those with many errors, and so they are ignored (we’re going to ignore them too!). The coverage histogram displays “coverage” vs. the “Number of K-mers”. The number of K-mers at every useful level of coverage are shown.

Here’s another trick: You can use this K-mer histogram to estimate the size of the genome you sequenced using the following formula:

Genome_size = Total_Kmers / peak_Kmer_coverage

Why do we want to estimate genome sizes? Briefly, K-mers represent a copy of all your sequencing data, broken into small fragments of an exact size. Software can estimate your coverage of the genome based on the number of “good” K-mers.

This is useful because:

Assemblers can be confused by repeated sequences and end up very inaccurate. This gives you a second opinion.

You might not know anything about your genome size!

Imagine you had 30-billion base pairs of sequencing data, and you knew your coverage was “10-fold”, you could then estimate the size of your genome being sequenced was 3-billion base-pairs long (30,000,000,000 ÷ 10 = 3,000,000,000)

To calculate “Total_Kmers”, sum all the numbers of all your K-mers. This is (approximately) the total number of all your “good” K-mers.

The formula above estimates the genome size based on K-mer histogram. What is your Group’s genome size estimate? ________ bp.

CONGRATULATIONS on your first assembly!

SPAdes

SPAdes (SPAdes – St. Petersburg genome Assembler) – is a MODERN assembly toolkit. It is one of the most powerful asssembly software because it has several built-in pipelines. SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads. SPAdes supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes, but is used for many larger genomes when computation power is available. In addition to genome asembly, SPAdes 3.12.0 includes the following additional pipelines:

metaSPAdes – a pipeline for metagenomic data sets (see metaSPAdes options).
plasmidSPAdes – a pipeline for extracting and assembling plasmids from WGS data sets (see plasmidSPAdes options).
rnaSPAdes – a de novo transcriptome assembler from RNA-Seq data (see rnaSPAdes manual).
truSPAdes – a module for TruSeq barcode assembly (see truSPAdes manual).

SPAdes also provides stand-alone programs with relatively simple command-line interfaces for: k-mer counting (spades-kmercounter), assembly graph construction (spades-gbuilder) and a long read to graph aligner (spades-gmapper). So let’s begin:

$ cd ../../spades/

Learn how we run SPAdes by viewing spadesk31.sbatch

$ nano -w spadesk31.sbatch

Be sure to change the line: export GROUPNUMBER=1 to your group number.

SPAdes has too many capabilities built-in to cover everything today, but be sure to read the SPAdes manual!

If you understand the script and have changed the group number, press ctrl-x, save the file and submit it to the queue with the sbatch command as follows.

$ sbatch spadesk31.sbatch

What was printed on the screen after you pressed enter?

Submitted batch job <jobidnumber>

When the job is complete, the log file, spades.log is found in the output directory named “K31” and is very useful. At the top it shows you the command you used. It then shows the software versions you loaded. This is very important for making your results reproducible. Spades will even give you recommendations if it thinks you could get better results using different settings.

To examine the output file:

less spades.log

Find these information from the log file (and remember press q to quit):

What is the Reads length of your sequences: _____________

What is the SPAdes version used? ______________

To find contig stats you need to use less to look at the contigs.paths file

What is the max contig length (“NODE_1” in contigs.paths) _______

Does the assembly get better if I use a different K-mer size?

To try different kmers, first copy your .pbs script with a new name:

$ cp spadesk31.sbatch spadesk21.sbatch
$ nano -w velvetk21.sbatch

and change K to a different value, currently 31. Try 21 first, and then 25. then submit this new job:

$ sbatch velvetk21.sbatch

Now remember to copy your output files to the results folder: $ cp spades31/contigs.fasta ../../results/spades31.fasta

SPAdes doesn’t output an N50 score but it outputs lots of other information. We will see the N50 of the SPAdes output later in this Workshop lesson.

Congratulations! This lesson has shown you how you can assemble genomes using different software, and using different parameters (K-mers, in this lesson). Now we should check our assemblies in a process called “validation” before sending the best assembly to our collaborators! We will learn how to compare the overall assembly of all our assemblers in our NEXT LESSON

Data Science in Omics Introduction

Notes

Hands-on Genome Assembly Tools

DISCLAIMER: Some of these software (not all) nearly obsolete.

Why are we using older software?

Assembly

Genome_size = Total_Kmers / peak_Kmer_coverage

VELVET

Genome_size = Total_Kmers / peak_Kmer_coverage

SPAdes