Learning Objectives
Following this assignment students should be able to:
- connect to a remote database and execute simple queries
- integrate database and R workflow
- export output data from R to database
- tidy data table with redundant fields or overfilled cells
Reading
-
Topics
- SQL
- Using databases from inside of R
-
Readings
Lecture Notes
Exercises
Simple WHERE (10 pts)
A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from the Portal Project, but she doesn’t know how to work with large datasets.
Write a query that returns the month, day, year, and mass of each individual for her species of interest Dipodomys spectabilis (
[click here for output] [click here for output] [click here for output]DSin thespecies_idcolumn).Since there is only one species do no include thespecies_idcolumn in the output. Save this query as a view with the namespectabilis_population_data.COUNT (10 pts)
Write a query that returns the number of individuals trapped in each year. Count the
[click here for output]species_idcolumn so that you only include cases where an individual was identified to species. Name the count columntotal_abundanceand sort it chronologically. Include the year in the output. Save it astotal_abundance_by_year. There should only be one value for each year since this is a count of the individuals across all species in that year.Basic Join (20 pts)
Write a query that returns the
[click here for output]year,month, anddayfor each individual captured as well as it’sgenusandspeciesnames. This can be accomplished by joining thespeciestable to thesurveystable using thespecies_idcolumn in both tables. Save this query asspecies_captures_by_date.Multi-table Join (20 pts)
The
plotstable in the Portal database can be joined to thesurveystable by joiningplot_idtoplot_idand thespeciestable can be joined to thesurveystable by joiningspecies_idtospecies_id.The Portal mammal data include data from a number of different experimental manipulations. You want to do a time-series analysis of the population dynamics of all of the species at the site, taking into account the different experimental manipulations. Write a query that returns the
[click here for output]year,month,day,genusandspeciesof every individual as well as theplot_idandplot_typeof the plot they are captured on. Save this query asspecies_plot_data.Link to Databases (20 pts)
Let’s access an SQL database directly from R using
dplyr.Either use an existing copy of the
portal_mammals.sqlitedatabase or download a new copy. You should be able to link to thesurveystable in the database using:library(DBI) portaldb <- dbConnect(RSQLite::SQLite(), "portal_mammals.sqlite") surveys <- tbl(portaldb, "surveys")surveysis actually a connection to the database, which means that the table remains external to the R environment. Also, we won’t need to worry about it printing out huge numbers of rows when we look at it.- Select the
year,month,day, andspecies_idcolumns in that order. - Create a new data frame with the
year,species_id, and weight in kilograms of each individual, with no null weights. - Use the
distinct()function to print thespecies_idfor each species in the dataset that has been weighed.
- Select the
Copy to Database (20 pts)
Dr. Undómiel has decided to focus on the change in size of a few target rodent species over the course of the experiment(1977-2002). She has chosen Dipodymys spectabilis, Onychomys torridus, Perymiscus erimicus, Chaetodipus penicillatus.
Write a script that uses
dplyrto:- Connect to the
portal_mammals.sqlite. - Generate a data frame with
year,species_id, and the average weight per year (avg_weight) for each target species. You may find the%in% c()construction useful for yourfilter(). - Use
copy_to()to include your new data frame inportal_mammals.sqlite. Call it something informative so that Dr. Undómiel can find it easily. Make sure it remains after the connection is terminated usingtemporary = FALSE.
- Connect to the
