Chapter 1 Loading and Formatting Data

Load required packages:

library(tidyverse)

Text files (csv, delimited, ) Excel Relational databases Stat software (SAS SPSS Stata) Spatial (shapefiles, raster, etc)

Reading data into R is generally the first step in any project workflow. The easiest format to get data into R is from a .csv or similar text file. However, there are functions to read data from Excel, relational databases, and some proprietary statistical software packages. There are also functions for readining in spatial data such as shapefiles or raster files, this is covered in …

1.1 Rectangular data

Rectangular data generally refers to data stored as a table, rows and columns, or matrix. This is probably the most typical format of data we are reading into R. This data most often comes by way of text files or Excel spreadsheets. The readr package includes several functions for handling text files:

  • read_csv: reads comma separated files,
  • read_tsv: reads tab seperated files,
  • read_delim: reads any formatted text file with a specified field separator.

The example below is reading in a TCEQ SWQMIS generated datafile. The fields in the file are pipe seperated (|). Notice you are not limited to local filenames, you can point to files with a valid URL address. By default, readr tries to guess the variable type for each column. It is tedious, but I suggest specifying the column type with the col_type argument to prevent any errors. You can also skip importing certain columns if you choose. This can help speed things up if you are reading in a really large dataset.

df <- read_delim(file = "https://gist.githubusercontent.com/mps9506/004624b5aa9bdf101c36278835cb38df/raw/46267d403bb450da4f7a0c726bd77d4fff1c5be5/1501_Bacteria.txt",
                 delim = "|",
                 col_types = cols(
                   `RFA(Sample Set ID)/Tag_id` = col_character(),
                   Segment = col_character(),
                   `Station ID` = col_character(),
                   `Station Description` = col_character(),
                   `Parameter Code` = col_character(),
                   `Parameter Description` = col_character(),
                   `Greater Than/Less Than` = col_skip(),
                   Value = col_double(),
                   `End Date` = col_date(format = "%m/%d/%Y"),
                   `End Time` = col_skip(),
                   `End Depth` = col_skip(),
                   `Start Date` = col_skip(),
                   `Start Time` = col_skip(),
                   `Start Depth` = col_skip(),
                   `Composite Category` = col_skip(),
                   `Composite Type` = col_skip(),
                   `Submitting Entity` = col_character(),
                   `Collecting Entity` = col_character(),
                   `Monitoring Type` = col_character(),
                   Comments = col_character())
                 )

df
## # A tibble: 122 x 12
##    `RFA(Sample Set… Segment `Station ID` `Station Descri… `Parameter Code`
##    <chr>            <chr>   <chr>        <chr>            <chr>           
##  1 1549521          1501    12515        TRES PALACIOS C… 31701           
##  2 R201391          1501    12515        TRES PALACIOS C… 31701           
##  3 R212883          1501    12515        TRES PALACIOS C… 31701           
##  4 L080888          1501    12515        TRES PALACIOS C… 31701           
##  5 L083338          1501    12515        TRES PALACIOS C… 31701           
##  6 L086380          1501    12515        TRES PALACIOS C… 31701           
##  7 L086923          1501    12515        TRES PALACIOS C… 31701           
##  8 L087695          1501    12515        TRES PALACIOS C… 31701           
##  9 L092047          1501    12515        TRES PALACIOS C… 31701           
## 10 10231848         1501    12515        TRES PALACIOS C… 31701           
## # … with 112 more rows, and 7 more variables: `Parameter Description` <chr>,
## #   Value <dbl>, `End Date` <date>, `Submitting Entity` <chr>, `Collecting
## #   Entity` <chr>, `Monitoring Type` <chr>, Comments <chr>

The read_delim and related functions will produce an object called a tibble. When you print a tibble to the console, it provides the column names, variable types, and an abbreviated snapshot of the first few rows of data.

1.2 Formatting Data

Most analysis and plotting functions in R prefer something called tidy data. The concept of tidy data can be difficult to grasp since we often don’t record data in a tidy format. Table ?? is an example of how data is often recorded in a spreadsheet. Notice there is a row for each date, then columns that have values for each water quality parameter and flow.

Table 1.1: Example of wide data format
Date Flow Nitrogen Phosphorus Ecoli
2020-01-01 130 0.31 0.03 189
2020-01-16 73 0.14 0.00 146
2020-01-31 212 0.08 0.01 893
2020-02-15 2259 0.15 0.02 77
2020-03-01 122 0.17 0.01 122

Tidy data is also called long data. The goal is to get one observation per row and a column for every covariate. In this case, we can tidy the data into four columns: Date, Flow Value, Parameter Name, and Parameter Value (Table ??. The observation of interest is the Parameter_Value which has a unique observation for each row. Date, Flow, and Parameter_name are not unique observations any longer.

## # A tibble: 15 x 4
##    Date        Flow Parameter_Name Parameter_Value
##    <date>     <dbl> <chr>                    <dbl>
##  1 2020-01-01   130 Nitrogen                  0.31
##  2 2020-01-01   130 Phosphorus                0.03
##  3 2020-01-01   130 Ecoli                   189   
##  4 2020-01-16    73 Nitrogen                  0.14
##  5 2020-01-16    73 Phosphorus                0   
##  6 2020-01-16    73 Ecoli                   146   
##  7 2020-01-31   212 Nitrogen                  0.08
##  8 2020-01-31   212 Phosphorus                0.01
##  9 2020-01-31   212 Ecoli                   893   
## 10 2020-02-15  2259 Nitrogen                  0.15
## 11 2020-02-15  2259 Phosphorus                0.02
## 12 2020-02-15  2259 Ecoli                    77   
## 13 2020-03-01   122 Nitrogen                  0.17
## 14 2020-03-01   122 Phosphorus                0.01
## 15 2020-03-01   122 Ecoli                   122

Work through real world example

1.3 USGS and EPA Data

USGS provides R functions to retrieve and load streamflow and water quality data directly to R through the dataRetrieval package (De Cicco et al. 2018).

1.4 Online Data Sources

Use APIs (TWDB example)

1.5 Project Workflows

Discuss project structure, storing raw data seperately, using scripts to download and save data locally one time,

References

De Cicco, Laura A., David Lorenz, Robert M. Hirsch, and William Watkins. 2018. DataRetrieval: R Packages for Discovering and Retrieving Water Data Available from U.s. Federal Hydrologic Web Services (version 2.7.6). Reston, VA: U.S. Geological Survey; U.S. Geological Survey. https://doi.org/10.5066/P9X4L3GE.