A colleague wants to do a comparitive analysis of the GC content (i.e., the percentage of bases in a DNA sequence that are either G or C) on data from the PATRIC bacterial phytogenomic database. They want to know the GC content of all of the bacteria in the database and have started working initially with a handful of archaea. They know a little R and wrote the following code to load a single sequence and caculate it’s GC content.
library(ShortRead)
library(Biostrings)
reads <- readFasta("archaea-dna/A-saccharovorans.fasta")
seq <- sread(reads)
base_freq <- alphabetFrequency(seq)
gc_content <- (base_freq[1, "G"] + base_freq[1, "C"]) / sum(base_freq) * 100
The first two lines load a .fasta file using the ShortRead package in Bioconductor. The second two lines determine the frequency of all of the bases in sequence and then calculate the GC content.
Start by installing Bioconductor with the following code (this may take a while, so be patient):
source("https://bioconductor.org/biocLite.R")
biocLite("ShortRead")
biocLite("Biostrings")
Each fasta file contains one long sequence of DNA for an archaea species. The
following code loads one sequence file, where seq
is the variable name for the data
file:
library(ShortRead)
reads <- readFasta("archaea-dna/A-saccharovorans.fasta")
seq <- sread(reads)
Help out your colleagues by downloading
the data and looping over the files
to determine the GC content for each file. Once downloaded, either unzip the
.zip file manually or use the unzip
function.
Use list.files()
, with full.names
set to true, to generate a list of the
names of all the sequence files. Then create a for loop that uses the above code
to read in each sequence file and calculate it’s GC content. Store the resulting
values in a data frame with one column with file names and a second column with
GC contents.