Loops

stringr

A colleague has produced a file with one DNA sequence on each line. Download the file. Load it into R withread.csv() and name the data frame sequences.

Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code, which will calculate the GC content for a single sequence using the stringr package:

sequence <- "attggc"
num_g <- str_count(sequence, "g")
num_c <- str_count(sequence, "c")
gc_content <- (num_g + num_c) / str_length(sequence) * 100 

Convert the last three lines of this code into a function to calculate the GC content of a DNA sequence. Name that function get_gc_content.
Use a for loop and your function to calculate the GC content of each sequence and store the results in a new data frame using an index. The following code will help you create this data frame:

# create an empty data frame with one row for each sequence
gc_contents <- data.frame(gc_content = numeric(nrow(_______)))

# loop over sequences using an index for the row and
# store the output in the new data frame
for (i in 1:nrow(__________)){
  ________[i,] <- get_gc_content(sequences$____[____])
}

[click here for output]

Data Science in Omics Introduction

Loops

stringr