Chapter 1 Loading and Formatting Data
Load required packages:
Text files (csv, delimited, ) Excel Relational databases Stat software (SAS SPSS Stata) Spatial (shapefiles, raster, etc)
Reading data into R is generally the first step in any project workflow. The easiest format to get data into R is from a .csv
or similar text file. However, there are functions to read data from Excel, relational databases, and some proprietary statistical software packages. There are also functions for readining in spatial data such as shapefiles or raster files, this is covered in …
1.1 Rectangular data
Rectangular data generally refers to data stored as a table, rows and columns, or matrix. This is probably the most typical format of data we are reading into R. This data most often comes by way of text files or Excel spreadsheets. The readr
package includes several functions for handling text files:
read_csv
: reads comma separated files,read_tsv
: reads tab seperated files,read_delim
: reads any formatted text file with a specified field separator.
The example below is reading in a TCEQ SWQMIS generated datafile. The fields in the file are pipe seperated (|
). Notice you are not limited to local filenames, you can point to files with a valid URL address. By default, readr
tries to guess the variable type for each column. It is tedious, but I suggest specifying the column type with the col_type
argument to prevent any errors. You can also skip importing certain columns if you choose. This can help speed things up if you are reading in a really large dataset.
df <- read_delim(file = "https://gist.githubusercontent.com/mps9506/004624b5aa9bdf101c36278835cb38df/raw/46267d403bb450da4f7a0c726bd77d4fff1c5be5/1501_Bacteria.txt",
delim = "|",
col_types = cols(
`RFA(Sample Set ID)/Tag_id` = col_character(),
Segment = col_character(),
`Station ID` = col_character(),
`Station Description` = col_character(),
`Parameter Code` = col_character(),
`Parameter Description` = col_character(),
`Greater Than/Less Than` = col_skip(),
Value = col_double(),
`End Date` = col_date(format = "%m/%d/%Y"),
`End Time` = col_skip(),
`End Depth` = col_skip(),
`Start Date` = col_skip(),
`Start Time` = col_skip(),
`Start Depth` = col_skip(),
`Composite Category` = col_skip(),
`Composite Type` = col_skip(),
`Submitting Entity` = col_character(),
`Collecting Entity` = col_character(),
`Monitoring Type` = col_character(),
Comments = col_character())
)
df
## # A tibble: 122 x 12
## `RFA(Sample Set… Segment `Station ID` `Station Descri… `Parameter Code`
## <chr> <chr> <chr> <chr> <chr>
## 1 1549521 1501 12515 TRES PALACIOS C… 31701
## 2 R201391 1501 12515 TRES PALACIOS C… 31701
## 3 R212883 1501 12515 TRES PALACIOS C… 31701
## 4 L080888 1501 12515 TRES PALACIOS C… 31701
## 5 L083338 1501 12515 TRES PALACIOS C… 31701
## 6 L086380 1501 12515 TRES PALACIOS C… 31701
## 7 L086923 1501 12515 TRES PALACIOS C… 31701
## 8 L087695 1501 12515 TRES PALACIOS C… 31701
## 9 L092047 1501 12515 TRES PALACIOS C… 31701
## 10 10231848 1501 12515 TRES PALACIOS C… 31701
## # … with 112 more rows, and 7 more variables: `Parameter Description` <chr>,
## # Value <dbl>, `End Date` <date>, `Submitting Entity` <chr>, `Collecting
## # Entity` <chr>, `Monitoring Type` <chr>, Comments <chr>
The read_delim and related functions will produce an object called a tibble. When you print a tibble to the console, it provides the column names, variable types, and an abbreviated snapshot of the first few rows of data.
1.2 Formatting Data
Most analysis and plotting functions in R prefer something called tidy data. The concept of tidy data can be difficult to grasp since we often don’t record data in a tidy format. Table ?? is an example of how data is often recorded in a spreadsheet. Notice there is a row for each date, then columns that have values for each water quality parameter and flow.
Date | Flow | Nitrogen | Phosphorus | Ecoli |
---|---|---|---|---|
2020-01-01 | 130 | 0.31 | 0.03 | 189 |
2020-01-16 | 73 | 0.14 | 0.00 | 146 |
2020-01-31 | 212 | 0.08 | 0.01 | 893 |
2020-02-15 | 2259 | 0.15 | 0.02 | 77 |
2020-03-01 | 122 | 0.17 | 0.01 | 122 |
Tidy data is also called long data. The goal is to get one observation per row and a column for every covariate. In this case, we can tidy the data into four columns: Date, Flow Value, Parameter Name, and Parameter Value (Table ??. The observation of interest is the Parameter_Value which has a unique observation for each row. Date, Flow, and Parameter_name are not unique observations any longer.
## # A tibble: 15 x 4
## Date Flow Parameter_Name Parameter_Value
## <date> <dbl> <chr> <dbl>
## 1 2020-01-01 130 Nitrogen 0.31
## 2 2020-01-01 130 Phosphorus 0.03
## 3 2020-01-01 130 Ecoli 189
## 4 2020-01-16 73 Nitrogen 0.14
## 5 2020-01-16 73 Phosphorus 0
## 6 2020-01-16 73 Ecoli 146
## 7 2020-01-31 212 Nitrogen 0.08
## 8 2020-01-31 212 Phosphorus 0.01
## 9 2020-01-31 212 Ecoli 893
## 10 2020-02-15 2259 Nitrogen 0.15
## 11 2020-02-15 2259 Phosphorus 0.02
## 12 2020-02-15 2259 Ecoli 77
## 13 2020-03-01 122 Nitrogen 0.17
## 14 2020-03-01 122 Phosphorus 0.01
## 15 2020-03-01 122 Ecoli 122
Work through real world example
1.3 USGS and EPA Data
USGS provides R functions to retrieve and load streamflow and water quality data directly to R through the dataRetrieval
package (De Cicco et al. 2018).
1.4 Online Data Sources
Use APIs (TWDB example)
1.5 Project Workflows
Discuss project structure, storing raw data seperately, using scripts to download and save data locally one time,
References
De Cicco, Laura A., David Lorenz, Robert M. Hirsch, and William Watkins. 2018. DataRetrieval: R Packages for Discovering and Retrieving Water Data Available from U.s. Federal Hydrologic Web Services (version 2.7.6). Reston, VA: U.S. Geological Survey; U.S. Geological Survey. https://doi.org/10.5066/P9X4L3GE.