Empirical Streamflow Predictions at Thompsons Creek

##note, need development version of bookdown and gt to properly compile this document

## readr imports data
library(readr)
library(tidyr)
## tibbles are advanced fancy dataframes
library(tibble)
## dplyr for data handling and piping functions
library(dplyr)
## ggplot for plots
library(ggplot2)
## stringr to read and manipulate strings
library(stringr)
## here is a function to ensure file paths are correct
library(here)
## units and ggforce facilitate attaching units to data
library(units)
library(ggforce)
## hrbrtheme is optional, I use it to pretty my plots
library(hrbrthemes)
## patchwork and cowplot support arranging multiple ggplots into one plot
library(patchwork)
library(cowplot)
## lubridate provides functions for handling time and date
library(lubridate)
## purrr lets us use map_ functions as an alternative to loops
library(purrr)
## hydroGOF provide goodness of fit metrics (NSE, RMSE, etc.)
library(hydroGOF)
## tsibble and imputeTS will allow some simple time series interpolation
library(tsibble)
library(imputeTS)
## gt
library(gtsummary)
library(flextable)
## nls.multstart fits non-linear least squares using the Levenberg-Marquardt algorithm with multiple starting values.
library(nls.multstart)
##
library(mgcv)
library(gratia)
## apply drainage area ratio
library(dartx)

## set some options
update_geom_font_defaults(font_rc)
units_options(parse = FALSE)


## some custom functions
theme_ms <- function(...) {
  theme_ipsum_rc(plot_margin = margin(10,10,10,10),
              axis_title_just = "c") +
    theme(legend.position = "bottom",
          panel.background = element_rect(fill = "transparent", 
            colour = NA), 
          panel.border = element_rect(fill = NA, 
            colour = "grey20"),
          plot.title = element_text(size = 11,
                                    face = "plain"),
          ...)
}

exponent <- function(x, pow) {
  (abs(x)^pow)*sign(x)
}

About

This is an exploratory document investigating the feasibility of estimating mean daily streamflows in the Thompsons Creek watershed using empirical methods. In a previous document mean daily streamflows at three sites in the watershed were developed for March 2020 through March 20221 using measured flows, depths, and rating curves. The one-year record of streamflows will be used to fit and validate drainage area ratio, regression, and semi-parametric regression based methods for estimating streamflow using locally available data.

Introduction

Ungaged streamflow estimation

Statistical information transfer methods

Statistical information transfer and empirical regression are two relatively simple methods for estimating streamflows in poorly gaged watersheds using only measured stremaflow values from nearby gaged watersheds. Statistical transfer procedures simply transfer flow duration curves or daily streamflow values from a gaged watershed to the ungaged watershed using assumed relationships between area and runoff. The most common statistical transfer method is the drainage area ratio. With the drainage area ratio, daily streamflows are transferred from one basin to the other by multiplying the area ratio to daily streamflows:

\[\begin{equation} Q_y^t = Q_x^t\bigg(\frac{A_y}{A_x}\bigg)^\phi \tag{1} \end{equation}\]

Where \(Q_y^t\) is streamflow at ungaged basin \(y\) and time \(t\), \(Q_x^t\) is streamflow at gaged basin \(x\) and time \(t\), and \(\frac{A_y}{A_x}\) is the area ratio of the basins. Parameter \(\phi\) is typically equal to one (Asquith, Roussel, and Vrabel 2006). However, Asquith, Roussel, and Vrabel (2006) provides empirically estimated values of \(\phi\) for use in the drainage area ratio when applied in Texas. With an available short term streamflow record available, various stream gages can be assessed for performance using the drainage area ratio method. However, when that short-term streamflow record is available we extend the simple drainage area ratio to a linear regression for streamflow estimation using one or more gaged watersheds:

\[\begin{equation} Q_y = \beta_0 + \beta_n{Q_{xn}} + \varepsilon \tag{2} \end{equation}\]

where \(Q_y\) is the predicted mean daily streamflow at the ungaged or temporarily gaged site, \(\beta_0\) is the intercept, \(Q_{xn}\) are mean daily streamflows at gaged watershed \(n\), \(\beta_n\) is the regression coefficient, and \(varepsilon\) is the residual error term assumed normally distributed around mean zero. A linear regression of this form still acts as a streamflow transfer methods like the drainage area approach but allows for an easy incorporation of additional model terms such as lagged streamflows which might improve predictive performance.

Semi-parametric rainfall-runoff regression

If nearby gaged watersheds are unavailable or are not reflective of the streamflow responses in the ungaged watershed, locally available weather information can be used to empircally estimate streamflow response. A number of empirically based rainfall-runoff routing models are available that account for soil and land-use conditions to predict streamflow response (SIMHYD, IHACRES, and Sacramento rainfall-runoff models are examples). More complex mechanistic models that simulate hydrologic and water quality responses to landuse and precipitation are also available (SWAT and HSPF are two examples). The mechanistic models have a steep requirement for data and ability of the technician developing the model. The routing models and mechanistic models are outside the scope of work for this particular project. Here we focus on using a semi-parametric regression based approach to predict streamflow using locally available weather data. Empirical regression based approaches require a period of measured streamflow and some predictor variables to estimate the runoff response from. Typically, daily rainfall and temperature data are employed to fit a regression model to measured streamflow response:

\[\begin{equation} Q_i = \beta_0 + \beta_1x_i+ \varepsilon_i \tag{3} \end{equation}\]

where \(Q\) is predicted discharge on day \(i\), \(\beta_0\) is the intercept, \(\beta_1\) is the regression coefficient, \(x\) is the predictor variable value on the \(i\)th day. The error term, \(\varepsilon_i\) is assumed normally distributed around mean zero. Using simple linear or multiple linear regression \(Q\) is typically log transformed prior to estimating the regression coefficients. Generalized linear models (GLMs) can instead be used when the data distribution is skewed (more specifically, when the residuals are not expected to be normally distributed). This is accomplished through the inclusion of a link function that describes how the mean of the response depends on the linear predictor and a variance function the describes how variance depends on the mean. Using the GLM and appropriate link function, transformation of the response variable can be avoided. This approach is highly desirable due to biases in the variance structure that are introduced when estimating the means of log transformed data that is intended to be back transformed to the original scale.

If the relationship between predictor variables and the response are expected to be nonlinear, polynomial terms can be included. However, an extension of GLMs called generalized additive models (GAMs) allows relatively easy fitting of these nonlinear terms to the data. With GAMs, the response variable depends on the sum of smoothing functions applied to each predictor variable:

\[\begin{equation} Q_i = \beta_0 + f(x_1) + \varepsilon \tag{4} \end{equation}\]

where \(Q\) is predicted discharge on day \(i\), \(\beta_0\) is the intercept, and some function \(f(x_1)\) is the linear predictor. In the case of GAMs fit using the mgcv package in R, \(f\) is a smoothing function fit to the data using generalized cross validation or restricted maximum likelihood. Due to the smoothing function and link functions, GAMs are extremely flexible for fitting regression models to data of different distributions and responses. However, compared to linear regression and GLMs the inclusion of the smoothing functions limits interpretability because traditional regression coefficients are not part of the model structure. Therefore, the effect of each individual smoothing function on the response variable mean is shown graphically.

Method

Data

Two National Oceanic and Atmospheric Administration (NOAA) Global Historical Climatology Network (GHCN) locations provide daily precipitation data for the project area (Figure 1; Figure 2). GHCND daily summaries were downloaded for GHCND:USW00003904 (Easterwood Airport) using the NOAA API services and the rnoaa package in R (Chamberlain 2019).

easterwood_precip <- read_csv("Data/noaa_precip/easterwood_precip.csv",
                               col_types = cols(
                                 date = col_datetime(format = ""),
                                 datatype = col_character(),
                                 station = col_character(),
                                 value = col_double(),
                                 fl_m = col_character(),
                                 fl_q = col_character(),
                                 fl_so = col_character(),
                                 fl_t = col_character(),
                                 units = col_character())) %>%
  mutate(value =  set_units(value/10, "mm")) %>%
  select(date, datatype, station, value)
  


easterwood_precip %>%
  mutate(date = as.Date(date),
         value = as.numeric(value)) %>%
  ggplot() +
  geom_line(aes(date, value)) +
  geom_point(aes(date, value), alpha = 0) +
  labs(y = "Daily precipitation (mm)",
       x = "Date") +
  facet_wrap(~station, ncol = 1) +
  theme_ms(plot.margin = margin(10,0,10,10),
           panel.spacing = unit(0, "lines"),
           panel.spacing.x = unit(0, "lines"),
           panel.spacing.y = unit(2, "lines")) -> p1

  

easterwood_precip %>%
  mutate(date = as.Date(date),
         value = as.numeric(value)) %>%
  ggplot() +
  geom_histogram(aes(value), binwidth = 5) +
  facet_wrap(~station, ncol = 1) +
  coord_flip() +
  theme_ms() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        plot.background = element_blank(),
        plot.margin = margin(10,10,10,0),
        panel.spacing = unit(0, "lines"),
        panel.spacing.x = unit(0, "lines"),
        panel.spacing.y = unit(2, "lines"),
        strip.text = element_text(color="transparent")) -> p2


plot_grid(p1,p2,align = "h", axis = "bt", rel_widths = c(2, 1))

Figure 1: Daily precipitation and marginal histograms of daily precipitation.

easterwood_tmax <- read_csv("Data/noaa_precip/easterwood_tmax.csv",
                               col_types = cols(
                                 date = col_datetime(format = ""),
                                 datatype = col_character(),
                                 station = col_character(),
                                 value = col_double(),
                                 fl_m = col_character(),
                                 fl_q = col_character(),
                                 fl_so = col_character(),
                                 fl_t = col_character(),
                                 units = col_character())) %>%
  mutate(value =  set_units(value/10, "°C")) %>%
  select(date, datatype, station, value)

easterwood_tmax %>%
  mutate(date = as.Date(date),
         value = as.numeric(value)) %>%
  ggplot() +
  geom_step(aes(date, value), alpha = 0.5) +
  labs(y = "Daily Maximum Temperature [°C]",
       x = "Date") +
  theme_ms(plot.margin = margin(10,0,10,10),
           panel.spacing = unit(0, "lines"),
           panel.spacing.x = unit(0, "lines"),
           panel.spacing.y = unit(2, "lines")) -> p1

easterwood_tmax %>%
  mutate(date = as.Date(date),
         value = as.numeric(value)) %>%
  ggplot() +
  geom_density(aes(value), 
                 binwidth = 1.1,
                 alpha = 0.5) +
  coord_flip() +
  theme_ms() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        plot.background = element_blank(),
        plot.margin = margin(10,10,10,0),
        panel.spacing = unit(0, "lines"),
        panel.spacing.x = unit(0, "lines"),
        panel.spacing.y = unit(2, "lines"),
        strip.text = element_text(color="transparent")) -> p2

plot_grid(p1,p2,align = "h", axis = "bt", rel_widths = c(2, 1))

Figure 2: Daily maximum temperature and marginal histograms of daily maximum temperature.

We identified two wastewater treatment facilities located upstream of the lowest discharge point. The Still Creek WWTF (TPDES Permit No. WQ0010426002) is permitted to discharge 4.0 MGD to Still Creek and discharge flows past SWQM sites 16882 and 16396. The Sanderson Farm Inc. facility (TPDES Permit No. WQ0003821000) is permitted to discharge 1.678 MGD to a tributary of Cottonwood Branch and the discharge flows past SWQM site 16396. Mean daily wastewater facility discharges were downloaded from the EPA ECHO database (Figure 3) using the echor R package and subtracted from the measured mean daily streamflows to better represent naturalized mean daily flows (Schramm 2020).

wwtf <- read_csv("Data/EPA_WWTF/mean_daily_discharges.csv",
                 col_types = cols(
                   npdes_id = col_character(),
                   date = col_date(format = ""),
                   mgd = col_double(),
                   cfs = col_double()))
wwtf %>%
  ggplot() +
  geom_step(aes(date, cfs, color = npdes_id)) +
  scale_x_date(date_breaks = "month",
               date_labels = "%b-%y") +
  scale_color_discrete(labels = c("WQ0010426002, Still Creek WWTF",
                                "WQ0003821000, Sanderson Farm Inc." )) +
  labs(x = "Date", y = "Mean Daily Discharge [cfs]") + 
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.line.x = element_blank(),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.x = element_line(color = "black"),
        axis.ticks.length = grid::unit(5, "pt"),
        legend.title = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank())

Figure 3: Mean daily WWTF discharges

Daily flow estimation

Mean daily streamflows from 2020-03-03 through 2021-01-14 were estimated using rating curves at SWQM stations 16396, 16397, and 16882 (Table 1). Mean daily wastewater discharges were subtracted from the flow record to represent naturalized flows. Mean daily temperature and total daily precipitation were joined to the data record by date. Figure 4 shows the hydrograph of naturalized flows and precipitation at each SWQM station.

Daily naturalized flow was estimated using the drainage area ratio equation (Equation (1)) at USGS stream gages 08065800, 08109800, and 08110100 (Table 1). Values of recommended in Asquith, Roussel, and Vrabel (2006) were used. Linear regression was also used to estimate log transformed flows at each SWQM station using mean daily flows and 1-day lagged mean daily flows from each USGS stream gage as a predictor variable.

dar_table <- tibble(Site = c("SWQM-16396", "SWQM-16397", "SWQM-16882", "USGS-08065800", "USGS-08109800", "USGS-08110100"),
       Description = c("Thompsons Creek at Silver Hill Rd",
                       "Thompsons Creek at Hwy 21",
                       "Still Creek at Hwy 21",
                       "Bedias Creek near Madisonville",
                       "East Yegua Creek near Dime Box",
                       "Davidson Creek near Lyons"),
       Area = c(42.33,24.21,10.03,321,244,195))

kable(dar_table,
      caption = "TCEQ SWQM stations and USGS stream gages used to develop flows with drainage area ratio and linear regression methods.")

Table 1: TCEQ SWQM stations and USGS stream gages used to develop flows with drainage area ratio and linear regression methods.
Site	Description	Area
SWQM-16396	Thompsons Creek at Silver Hill Rd	42.33
SWQM-16397	Thompsons Creek at Hwy 21	24.21
SWQM-16882	Still Creek at Hwy 21	10.03
USGS-08065800	Bedias Creek near Madisonville	321.00
USGS-08109800	East Yegua Creek near Dime Box	244.00
USGS-08110100	Davidson Creek near Lyons	195.00

##naturalize streamflows

df <- read_csv("Data/daily_streamflow/model_df.csv",
               col_types = cols(
                 Site = col_character(),
                 date = col_date(format = ""),
                 mean_daily = col_double()
                 ))


df %>%
  bind_rows(tibble(Site = c(rep("16396",5),rep("16397",5),rep("16882",5)),
                   date = rep(seq.Date(as.Date("2020-02-27"), as.Date("2020-03-02"), by = "day"),3),
                   mean_daily = NA)) %>%
  arrange(date) %>%
  left_join(easterwood_precip, by = c("date" = "date")) %>%
  mutate(value = as.numeric(value)) %>%
  dplyr::rename(ewood_precip = value) %>%
  left_join(easterwood_tmax, by = c("date" = "date")) %>%
  mutate(value = as.numeric(value)) %>%
  dplyr::rename(ewood_tmax = value) %>%
  dplyr::select(Site, date, mean_daily, ewood_precip, ewood_tmax) %>%
  left_join(wwtf %>% pivot_wider(id_cols = date, names_from = npdes_id, values_from = cfs),
            by = c("date" = "date")) %>%
  ## remove WWTF influence from discharge record
  mutate(adjusted_flow = case_when(
    Site == 16396 ~ mean_daily - TX0025071 - TX0113603, 
    Site == 16882 ~ mean_daily - TX0025071,
    Site == 16397 ~ mean_daily)) %>%
  mutate(adjusted_flow = case_when(
    adjusted_flow < 0 ~ 0,
    adjusted_flow >= 0 ~ adjusted_flow)) %>%
  mutate(doy = lubridate::yday(date))-> df

## plot hydrograph of Thompsons @ Silver Hill
p1 <- ggplot(df %>% filter(Site == "16396") %>% mutate(date = as.Date(date))) + 
  geom_line(aes(date, adjusted_flow, color = "Mean Daily Streamflow")) +
  scale_y_continuous(position = "left", 
                  limits = c(0,600),
                  expand = c(0,0)) +
  scale_x_date(date_breaks = "month",
               date_labels = "%b-%y") +
  labs(y = "Mean daily streamflow [cfs]", 
       x = "Date",
       caption = "16396, Thompsons Creek at Silver Hill Rd") +
  scale_color_manual(values = c("dodgerblue")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.text.x = element_text(size = 8),
        axis.title.y.left = element_text(hjust = 0),
        axis.ticks.y.left = element_line(color = "black"),
        axis.ticks.x = element_line(color = "black"),
        axis.ticks.length = grid::unit(5, "pt"),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none") 

p2 <- ggplot(df %>% filter(Site == "16396")) + 
  geom_line(aes(date, ewood_precip, color = "Total Daily Preciptitation")) +
  scale_y_reverse(position = "right", 
                  limits = c(350,0),
                  breaks = c(0,25,50),
                  labels = c(0,25,50),
                  expand = c(0,0)) +
  labs(y = "Total Daily Precipitaiton [mm]") +
  scale_color_manual(values = c("dodgerblue4")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.line.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.y.right = element_text(hjust = 1),
    axis.title.x = element_blank(),
    axis.ticks.y.right = element_line(color = "black"),
    axis.ticks.length = grid::unit(5, "pt"),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
    ) 
set_null_device("agg")
aligned_plots <- align_plots(p1, p2, align="hv", axis="tblr")
set_null_device("agg")
hg_plot1 <- ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

## plot hydrograph of Thompsons @ Hwy21
p1 <- ggplot(df %>% filter(Site == "16397") %>% mutate(date = as.Date(date))) + 
  geom_line(aes(date, adjusted_flow, color = "Mean Daily Streamflow")) +
  scale_y_continuous(position = "left", 
                  limits = c(0,75),
                  expand = c(0,0)) +
  scale_x_date(date_breaks = "month",
               date_labels = "%b-%y") +
  labs(y = "Mean daily streamflow [cfs]", 
       x = "Date",
       caption = "16397, Thompsons Creek at Hwy 21") +
  scale_color_manual(values = c("dodgerblue")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.text.x = element_text(size = 8),
        axis.title.y.left = element_text(hjust = 0),
        axis.ticks.y.left = element_line(color = "black"),
        axis.ticks.x = element_line(color = "black"),
        axis.ticks.length = grid::unit(5, "pt"),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none") 

p2 <- ggplot(df %>% filter(Site == "16397")) + 
  geom_line(aes(date, ewood_precip, color = "Total Daily Preciptitation")) +
  scale_y_reverse(position = "right", 
                  limits = c(350,0),
                  breaks = c(0,25,50),
                  labels = c(0,25,50),
                  expand = c(0,0)) +
  labs(y = "Total Daily Precipitaiton [mm]") +
  scale_color_manual(values = c("dodgerblue4")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.line.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.y.right = element_text(hjust = 1),
    axis.title.x = element_blank(),
    axis.ticks.y.right = element_line(color = "black"),
    axis.ticks.length = grid::unit(5, "pt"),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
    ) 
set_null_device("agg")
aligned_plots <- align_plots(p1, p2, align="hv", axis="tblr")
set_null_device("agg")
hg_plot2 <- ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])


## plot hydrograph of Sill Creek @ Hwy21
p1 <- ggplot(df %>% filter(Site == "16882") %>% mutate(date = as.Date(date))) + 
  geom_line(aes(date, adjusted_flow, color = "Mean Daily Streamflow")) +
  scale_y_continuous(position = "left", 
                  limits = c(0,75),
                  expand = c(0,0)) +
  scale_x_date(date_breaks = "month",
               date_labels = "%b-%y") +
  labs(y = "Mean daily streamflow [cfs]", 
       x = "Date",
       caption = "16882, Sill Creek at Hwy 21") +
  scale_color_manual(values = c("dodgerblue")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.text.x = element_text(size = 8),
        axis.title.y.left = element_text(hjust = 0),
        axis.ticks.y.left = element_line(color = "black"),
        axis.ticks.x = element_line(color = "black"),
        axis.ticks.length = grid::unit(5, "pt"),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.position = "none") 

p2 <- ggplot(df %>% filter(Site == "16882")) + 
  geom_line(aes(date, ewood_precip, color = "Total Daily Preciptitation")) +
  scale_y_reverse(position = "right", 
                  limits = c(350,0),
                  breaks = c(0,25,50),
                  labels = c(0,25,50),
                  expand = c(0,0)) +
  labs(y = "Total Daily Precipitaiton [mm]") +
  scale_color_manual(values = c("dodgerblue4")) +
  theme_ms() +
  guides(x = guide_axis(angle = 90)) +
  theme(axis.line.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.y.right = element_text(hjust = 1),
    axis.title.x = element_blank(),
    axis.ticks.y.right = element_line(color = "black"),
    axis.ticks.length = grid::unit(5, "pt"),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none"
    ) 
set_null_device("agg")
aligned_plots <- align_plots(p1, p2, align="hv", axis="tblr")
set_null_device("agg")
hg_plot3 <- ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])


hg_plot1 / hg_plot2 / hg_plot3

Figure 4: Hydrograph of mean daily streamflows at SWQM 16396, 16397, and 16882 with reported total daily precipitation at Easterwood Airport.

GLMs and GAMs relied on local precipitation and temperature data to estimate streamflows at each SWQM station (Figure 1; Figure 2). The GLM was of form:

\[\begin{multline} Q = P + T + DOY + P_{sum,3} + T_{mean,5} + (P\times T) + (P \times P_{sum,3})\\ + (P \times T_{mean,5}) + (T \times P_{sum,3})\\ + (T \times T_{mean,5}) + (P_{sum,3} \times T_{mean,5}) \tag{5} \end{multline}\]

where \(P\) is total daily precipitation and \(T\) is mean daily temperature, and are assumed to be the main forcing variables influencing streamflow. \(DOY\) is day of the year and included as a seasonal predictor. \(P_{sum,3}\) is the rolling 3-day sum rainfall and included as an indicator of wetness in the watershed. \(T_{mean,5}\) is the rolling 5-day mean temperature and included as an indicator of seasonal temperature condition that is less daily variance than \(T\) and maybe an improved indicator of potential evapotransportation conditions. The GLM error structure was fit with a scaled t distribution and a log link function.

The GAM included the same variables but with smoothing functions:

\[\begin{multline} Q = f(P) + f(T) + f(DOY) + f(P_{sum,3}) + f(T_{mean,5})\\ + f(P,T) + f(P, P_{sum,3}) + f(P,T_{mean,5})\\ + f(T,P_{sum,3}) + (T,T_{mean,5}) + f(P_{sum,3},T_{mean,5}) \tag{6} \end{multline}\]

where \(f()\) is the smoothing function. Where a single covariate is smoothed, a thin plate regression spline was fit to the data using restricted maximum likelihood to estimated the automatically select the optimal smoothing parameters. For smoothing functions applied to two parameters, a tensor product smooth function was applied. The GAM error structure was fit using the scaled t distribution and a log link function.

Performance of predicted streamflows over the period of record for each method was assessed using Nash-Sutcliffe Efficiency (NSE) and Kling-Gupta Efficiency (KGE) goodness-of-fit metrics.

Results

## create a model df for 16396
## additional variables = 
## lagPrecip = 1 day lag precipitation
## wetness index = 3 day total precip
## ET index = 5 day avg tmax


df %>%
  dplyr::filter(Site == "16396") %>%
  arrange(date) -> df_16396

df_16396 %>%
  mutate(lagPrecip = lag(ewood_precip),
         wetness = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16396$ewood_precip[.x]}
                      if(.x - 2 <= 0) {sum(df_16396$ewood_precip[.x],
                                           df_16396$ewood_precip[.x-1])}
                      if(.x - 2 > 0) {
                       sum(
                        df_16396$ewood_precip[.x],
                        df_16396$ewood_precip[.x-1],
                        df_16396$ewood_precip[.x-2],
                        na.rm = TRUE
                        ) 
                      }}),
         et = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16396$ewood_tmax[.x]}
                      if(.x - 2 <= 0) {df_16396$ewood_tmax[.x-1]}
                      if(.x - 3 <= 0) { mean(c(df_16396$ewood_tmax[.x-1],
                                            df_16396$ewood_tmax[.x-2]),
                                            na.rm = TRUE)}
                      if(.x - 4 <= 0) { mean(c(df_16396$ewood_tmax[.x-1],
                                            df_16396$ewood_tmax[.x-2],
                                            df_16396$ewood_tmax[.x-3]),
                                            na.rm = TRUE)}
                      if(.x - 5 <= 0) { mean(c(df_16396$ewood_tmax[.x-1],
                                            df_16396$ewood_tmax[.x-2],
                                            df_16396$ewood_tmax[.x-3],
                                            df_16396$ewood_tmax[.x-4]),
                                            na.rm = TRUE)}
                      mean(c(df_16396$ewood_tmax[.x-1],
                        df_16396$ewood_tmax[.x-2],
                        df_16396$ewood_tmax[.x-3],
                        df_16396$ewood_tmax[.x-4],
                        df_16396$ewood_tmax[.x-5]),
                        na.rm = TRUE)})) %>%
  unnest(c(wetness, et)) %>%
  dplyr::filter(!is.na(mean_daily)) -> df_16396


## create a model df for 16397
## additional variable = 
## wetness index = 3 day total precip
## ET index = 5 day avg tmax

df %>%
  dplyr::filter(Site == "16397") %>%
  arrange(date) -> df_16397

df_16397 %>%
  mutate(lagPrecip = lag(ewood_precip),
         wetness = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16397$ewood_precip[.x]}
                      if(.x - 2 <= 0) {sum(df_16397$ewood_precip[.x],
                                           df_16397$ewood_precip[.x-1])}
                      if(.x - 2 > 0) {
                       sum(
                        df_16397$ewood_precip[.x],
                        df_16397$ewood_precip[.x-1],
                        df_16397$ewood_precip[.x-2],
                        na.rm = TRUE
                        ) 
                      }}),
         et = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16397$ewood_tmax[.x]}
                      if(.x - 2 <= 0) {df_16397$ewood_tmax[.x-1]}
                      if(.x - 3 <= 0) { mean(c(df_16397$ewood_tmax[.x-1],
                                            df_16397$ewood_tmax[.x-2]),
                                            na.rm = TRUE)}
                      if(.x - 4 <= 0) { mean(c(df_16397$ewood_tmax[.x-1],
                                            df_16397$ewood_tmax[.x-2],
                                            df_16397$ewood_tmax[.x-3]),
                                            na.rm = TRUE)}
                      if(.x - 5 <= 0) { mean(c(df_16397$ewood_tmax[.x-1],
                                            df_16397$ewood_tmax[.x-2],
                                            df_16397$ewood_tmax[.x-3],
                                            df_16397$ewood_tmax[.x-4]),
                                            na.rm = TRUE)}
                      mean(c(df_16397$ewood_tmax[.x-1],
                        df_16397$ewood_tmax[.x-2],
                        df_16397$ewood_tmax[.x-3],
                        df_16397$ewood_tmax[.x-4],
                        df_16397$ewood_tmax[.x-5]),
                        na.rm = TRUE)})) %>%
  unnest(c(wetness, et)) %>%
  dplyr::filter(!is.na(mean_daily)) -> df_16397

## create a model df for 16882
## additional variable = 
## wetness index = 3 day total precip
## ET index = 5 day avg tmax

df %>%
  dplyr::filter(Site == "16882") %>%
  arrange(date) -> df_16882

df_16882 %>%
  mutate(lagPrecip = lag(ewood_precip),
         wetness = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16882$ewood_precip[.x]}
                      if(.x - 2 <= 0) {sum(df_16882$ewood_precip[.x],
                                           df_16882$ewood_precip[.x-1])}
                      if(.x - 2 > 0) {
                       sum(
                        df_16882$ewood_precip[.x],
                        df_16882$ewood_precip[.x-1],
                        df_16882$ewood_precip[.x-2],
                        na.rm = TRUE
                        ) 
                      }}),
         et = map(row_number(.$date),
                    ~{if(.x - 1 <= 0) {df_16882$ewood_tmax[.x]}
                      if(.x - 2 <= 0) {df_16882$ewood_tmax[.x-1]}
                      if(.x - 3 <= 0) { mean(c(df_16882$ewood_tmax[.x-1],
                                            df_16882$ewood_tmax[.x-2]),
                                            na.rm = TRUE)}
                      if(.x - 4 <= 0) { mean(c(df_16882$ewood_tmax[.x-1],
                                            df_16882$ewood_tmax[.x-2],
                                            df_16882$ewood_tmax[.x-3]),
                                            na.rm = TRUE)}
                      if(.x - 5 <= 0) { mean(c(df_16882$ewood_tmax[.x-1],
                                            df_16882$ewood_tmax[.x-2],
                                            df_16882$ewood_tmax[.x-3],
                                            df_16882$ewood_tmax[.x-4]),
                                            na.rm = TRUE)}
                      mean(c(df_16882$ewood_tmax[.x-1],
                        df_16882$ewood_tmax[.x-2],
                        df_16882$ewood_tmax[.x-3],
                        df_16882$ewood_tmax[.x-4],
                        df_16882$ewood_tmax[.x-5]),
                        na.rm = TRUE)})) %>%
  unnest(c(wetness, et)) %>%
  dplyr::filter(!is.na(mean_daily)) %>%
  mutate(non_zero = case_when(
    adjusted_flow == 0 ~ 0,
    adjusted_flow > 0 ~ 1
  ))-> df_16882

Statistical Information Transfer

DAR

Mean daily streamflows estimated using DAR applied to three different USGS streamgages are displayed in Figure @ref[fig:dar16396results]. All three USGS gages results in biased results at low streamflows (consistently underpredicted streamflows). The hydrographs indicate the timing and magnitude of stormflow events are routinely mismatched using any of the three USGS gages. This visual validation suggests DAR method is not suitable in the Thompsons Creek watershed.

## apply DAR to 16396 using each select USGS gage
usgs_08065800 <- readr::read_csv("Data/USGS_Streamflow/08065800.csv") %>%
  dplyr::rename(Flow_08065800 = "133946_00060_00003") %>%
  dplyr::select(datetime, Flow_08065800)

usgs_08109800 <- readr::read_csv("Data/USGS_Streamflow/08109800.csv") %>%
  dplyr::rename(Flow_08109800 = `135356_00060_00003`) %>%
  dplyr::select(datetime, Flow_08109800)

usgs_08110100 <- readr::read_csv("Data/USGS_Streamflow/08110100.csv") %>%
  dplyr::rename(Flow_08110100 = `135389_00060_00003`) %>%
  dplyr::select(datetime, Flow_08110100)


df_16396 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  dartx(Flow_08065800, 42.33/321) %>%
  dplyr::rename(DAR_Q_08065800 = Q,
                Q_percentile_08065800 = Q_percentile,
                exp_08065800 = exp) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  dartx(Flow_08109800, 42.33/244) %>%
  dplyr::rename(DAR_Q_08109800 = Q,
                Q_percentile_08109800 = Q_percentile,
                exp_08109800 = exp) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  dartx(Flow_08110100, 42.33/195) %>%
  dplyr::rename(DAR_Q_08110100 = Q,
                Q_percentile_08110100 = Q_percentile,
                exp_08110100 = exp) -> dar_results_16396

df_16882 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  dartx(Flow_08065800, 24.21/321) %>%
  dplyr::rename(DAR_Q_08065800 = Q,
                Q_percentile_08065800 = Q_percentile,
                exp_08065800 = exp) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  dartx(Flow_08109800, 24.21/244) %>%
  dplyr::rename(DAR_Q_08109800 = Q,
                Q_percentile_08109800 = Q_percentile,
                exp_08109800 = exp) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  dartx(Flow_08110100, 24.21/195) %>%
  dplyr::rename(DAR_Q_08110100 = Q,
                Q_percentile_08110100 = Q_percentile,
                exp_08110100 = exp) -> dar_results_16882

df_16397 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  dartx(Flow_08065800, 10.03/321) %>%
  dplyr::rename(DAR_Q_08065800 = Q,
                Q_percentile_08065800 = Q_percentile,
                exp_08065800 = exp) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  dartx(Flow_08109800, 10.03/244) %>%
  dplyr::rename(DAR_Q_08109800 = Q,
                Q_percentile_08109800 = Q_percentile,
                exp_08109800 = exp) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  dartx(Flow_08110100, 10.03/195) %>%
  dplyr::rename(DAR_Q_08110100 = Q,
                Q_percentile_08110100 = Q_percentile,
                exp_08110100 = exp) -> dar_results_16397

dar_results_16396 %>%
  dplyr::select(Site, date, adjusted_flow, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_point(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), alpha = 0.3) +
  geom_smooth(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), 
              method = "lm", se = FALSE, alpha = 0.3) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  facet_wrap(~Source_Site, scales = "free", ncol = 1) +
  scale_x_continuous(trans = "pseudo_log") + scale_y_continuous(trans = "pseudo_log") +
  labs(x = "DAR Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() + 
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical") -> p1

dar_results_16396 %>%
  dplyr::select(Site, date, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_line(aes(date, Estimated_Flow, color = "DAR Estimated Flow")) +
  geom_line(data = dar_results_16396, aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16396"), alpha = 0.4) +
  facet_wrap(~Source_Site, scales = "free_y", ncol = 1) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() + 
  theme(legend.title = element_blank(),
        legend.direction = "vertical") -> p2

p1 + p2

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Estimated streamflows at SWQM 16396 using DAR at three USGS gages plotted against measured streamflows and over time.

dar_results_16397 %>%
  dplyr::select(Site, date, adjusted_flow, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_point(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), alpha = 0.3) +
  geom_smooth(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), 
              method = "lm", se = FALSE, alpha = 0.3) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  facet_wrap(~Source_Site, scales = "free", ncol = 1) +
  scale_x_continuous(trans = "pseudo_log") + scale_y_continuous(trans = "pseudo_log") +
  labs(x = "DAR Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() + 
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical") -> p1

dar_results_16397 %>%
  dplyr::select(Site, date, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_line(aes(date, Estimated_Flow, color = "DAR Estimated Flow")) +
  geom_line(data = dar_results_16397, aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16397"), alpha = 0.4) +
  facet_wrap(~Source_Site, scales = "free_y", ncol = 1) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() + 
  theme(legend.title = element_blank(),
        legend.direction = "vertical") -> p2

p1 + p2

## `geom_smooth()` using formula 'y ~ x'

Figure 6: Estimated streamflows at SWQM 16397 using DAR at three USGS gages plotted against measured streamflows and over time.

dar_results_16882 %>%
  dplyr::select(Site, date, adjusted_flow, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_point(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), alpha = 0.3) +
  geom_smooth(aes(Estimated_Flow, adjusted_flow, color = "DAR Estimates against Naturalized Flow"), 
              method = "lm", se = FALSE, alpha = 0.3) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  facet_wrap(~Source_Site, scales = "free", ncol = 1) +
  scale_x_continuous(trans = "pseudo_log") + scale_y_continuous(trans = "pseudo_log") +
  labs(x = "DAR Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() + 
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical") -> p1

dar_results_16882 %>%
  dplyr::select(Site, date, DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100) %>%
  pivot_longer(c(DAR_Q_08065800, DAR_Q_08109800, DAR_Q_08110100),
               names_to = "Source_Site",
               values_to = "Estimated_Flow") %>%
  mutate(Source_Site = case_when(
    Source_Site == "DAR_Q_08065800" ~ "Estimated flow using USGS-08065800",
    Source_Site == "DAR_Q_08109800" ~ "Estimated flow using USGS-08109800",
    Source_Site == "DAR_Q_08110100" ~ "Estimated flow using USGS-08110100"
  )) %>%
  ggplot() +
  geom_line(aes(date, Estimated_Flow, color = "DAR Estimated Flow")) +
  geom_line(data = dar_results_16882, aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16882"), alpha = 0.4) +
  facet_wrap(~Source_Site, scales = "free_y", ncol = 1) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() + 
  theme(legend.title = element_blank(),
        legend.direction = "vertical") -> p2

p1 + p2

## `geom_smooth()` using formula 'y ~ x'

Figure 7: Estimated streamflows at SWQM 16882 using DAR at three USGS gages plotted against measured streamflows and over time.

Linear Regression

df_16396 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  mutate(lag_Flow_08065800 = lag(Flow_08065800),
         lag_Flow_08109800 = lag(Flow_08109800),
         lag_Flow_08110100 = lag(Flow_08110100),
         log_Q = log1p(adjusted_flow)) -> df_16396_lm

m1.lm <- lm(log_Q ~ Flow_08065800 + 
              Flow_08109800 + 
              Flow_08110100 + 
              lag_Flow_08065800 + 
              lag_Flow_08109800 + 
              lag_Flow_08110100,
            data = df_16396_lm)


flextable::as_flextable(m1.lm) %>%
  set_caption("Summary of linear regression coefficients at SWQM 16396")

Table 2: Summary of linear regression coefficients at SWQM 16396
	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	1.83e+00	4.43e-02	41.1913	2.052e-126	***
Flow_08065800	1.79e-04	9.58e-05	1.8656	6.305e-02	.
Flow_08109800	1.69e-02	2.25e-03	7.5366	5.578e-13	***
Flow_08110100	5.16e-03	8.31e-04	6.2130	1.708e-09	***
lag_Flow_08065800	7.85e-06	9.50e-05	0.0827	9.342e-01
lag_Flow_08109800	-4.97e-03	2.23e-03	-2.2311	2.640e-02	*
lag_Flow_08110100	-1.76e-03	8.41e-04	-2.0984	3.669e-02	*
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

Residual standard error: 0.7083 on 304 degrees of freedom
Multiple R-squared: 0.4088, Adjusted R-squared: 0.3972
F-statistic: 35.04 on 304 and 6 DF, p-value: 0.0000

df_16397 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  mutate(lag_Flow_08065800 = lag(Flow_08065800),
         lag_Flow_08109800 = lag(Flow_08109800),
         lag_Flow_08110100 = lag(Flow_08110100),
         log_Q = log1p(adjusted_flow)) -> df_16397_lm

m2.lm <- lm(log_Q ~ Flow_08065800 + 
              Flow_08109800 + 
              Flow_08110100 + 
              lag_Flow_08065800 + 
              lag_Flow_08109800 + 
              lag_Flow_08110100,
            data = df_16397_lm)


flextable::as_flextable(m2.lm) %>%
  set_caption("Summary of linear regression coefficients at SWQM 16397")

Table 3: Summary of linear regression coefficients at SWQM 16397
	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	8.46e-01	1.48e-02	57.074	1.691e-164	***
Flow_08065800	2.23e-05	3.20e-05	0.695	4.877e-01
Flow_08109800	5.64e-03	7.51e-04	7.509	6.650e-13	***
Flow_08110100	2.30e-03	2.78e-04	8.296	3.547e-15	***
lag_Flow_08065800	5.52e-05	3.17e-05	1.739	8.304e-02	.
lag_Flow_08109800	-2.96e-03	7.45e-04	-3.970	8.990e-05	***
lag_Flow_08110100	-8.21e-04	2.81e-04	-2.920	3.760e-03	**
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

Residual standard error: 0.2368 on 304 degrees of freedom
Multiple R-squared: 0.4337, Adjusted R-squared: 0.4226
F-statistic: 38.81 on 304 and 6 DF, p-value: 0.0000

df_16882 %>%
  dplyr::select(Site, date, adjusted_flow) %>%
  left_join(usgs_08065800, by = c("date" = "datetime")) %>%
  left_join(usgs_08109800,  by = c("date" = "datetime")) %>%
  left_join(usgs_08110100,  by = c("date" = "datetime")) %>%
  mutate(lag_Flow_08065800 = lag(Flow_08065800),
         lag_Flow_08109800 = lag(Flow_08109800),
         lag_Flow_08110100 = lag(Flow_08110100),
         log_Q = log1p(adjusted_flow)) -> df_16882_lm

m3.lm <- lm(log_Q ~ Flow_08065800 + 
              Flow_08109800 + 
              Flow_08110100 + 
              lag_Flow_08065800 + 
              lag_Flow_08109800 + 
              lag_Flow_08110100,
            data = df_16882_lm)


flextable::as_flextable(m3.lm) %>%
  set_caption("Summary of linear regression coefficients at SWQM 16882")

Table 4: Summary of linear regression coefficients at SWQM 16882
	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	3.12e-01	0.049030	6.3536	7.497e-10	***
Flow_08065800	-9.30e-06	0.000107	-0.0868	9.309e-01
Flow_08109800	1.81e-02	0.002513	7.1853	5.022e-12	***
Flow_08110100	3.65e-03	0.000929	3.9333	1.034e-04	***
lag_Flow_08065800	6.80e-05	0.000106	0.6405	5.223e-01
lag_Flow_08109800	-8.79e-03	0.002494	-3.5244	4.885e-04	***
lag_Flow_08110100	-2.54e-03	0.000940	-2.7016	7.280e-03	**
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

Residual standard error: 0.7923 on 310 degrees of freedom
Multiple R-squared: 0.2595, Adjusted R-squared: 0.2452
F-statistic: 18.11 on 310 and 6 DF, p-value: 0.0000

df_16396_lm %>%
  mutate(fits = as.numeric(predict(m1.lm, newdata = .))) %>%
  mutate(fits = expm1(fits)) -> df_16396_lm


df_16397_lm %>%
  mutate(fits = as.numeric(predict(m2.lm, newdata = .))) %>%
  mutate(fits = expm1(fits)) -> df_16397_lm

df_16882_lm %>%
  mutate(fits = as.numeric(predict(m3.lm, newdata = .))) %>%
  mutate(fits = expm1(fits)) -> df_16882_lm

df_16396_lm %>%
  ggplot() +
  geom_point(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
             alpha =0.4) +
  geom_smooth(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "Regression Estimated Flow [cfs]", y = "Naturalized Flow [cfs]",
       title = "SWQM-16396") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  coord_equal() +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical",
        legend.position = "none")-> p1

df_16396_lm %>%
  ggplot() +
  geom_line(aes(date, fits, color = "Linear Regression Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16396"), alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical",
        legend.position = "none") -> p2


df_16397_lm %>%
  ggplot() +
  geom_point(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
             alpha =0.4) +
  geom_smooth(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "Regression Estimated Flow [cfs]", y = "Naturalized Flow [cfs]",
       title = "SWQM-16397") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  coord_equal() +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical",
        legend.position = "none")-> p3

df_16397_lm %>%
  ggplot() +
  geom_line(aes(date, fits, color = "Linear Regression Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16397"), alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical",
        legend.position = "none") -> p4

df_16882_lm %>%
  ggplot() +
  geom_point(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
             alpha =0.4) +
  geom_smooth(aes(fits, adjusted_flow, color = "Regression Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "Regression Estimated Flow [cfs]", y = "Naturalized Flow [cfs]",
       title = "SWQM-16882") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  coord_equal() +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical")-> p5

df_16882_lm %>%
  ggplot() +
  geom_line(aes(date, fits, color = "Linear Regression Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow"), alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical") -> p6

(p1 + p2)/(p3 + p4)/(p5 + p6)

Figure 8: Linear regression estimated mean daily streamflows plotted against measured mean daily streamflows and over time.

GAM

m1.gam <- gam(adjusted_flow ~
            s(ewood_precip, bs = "ts", k = 5) +
            s(ewood_tmax, bs = "ts", k = 5) +
            s(lagPrecip, bs = "ts", k = 5) +
            s(doy, bs = "cc", k = 8) +
            s(wetness, bs = "ts", k = 5) +
            s(et, bs = "ts", k = 5) +
              ti(ewood_precip, lagPrecip, k = 5) +
              ti(ewood_precip, ewood_tmax, k = 5) +
              ti(ewood_precip, wetness, k = 5) +
              ti(ewood_precip, et, k = 5) +
              ti(ewood_tmax, wetness, k = 5) +
              ti(ewood_tmax, et, k = 5) +
              ti(wetness, et, k = 5),
          data = df_16396,
          select = TRUE,
          family = scat(link = "log"),
          method = "REML",
          control = gam.control(nthreads = 2))
flextable::as_flextable(m1.gam) %>%
  flextable::set_caption("Summary of GAM coefficients at SWQM 16396")

Table 5: Summary of GAM coefficients at SWQM 16396
	Estimate	Standard Error	z value	Pr(>\|z\|)	Signif.
s(ewood_precip)			8.17	0.000e+00	***
s(ewood_tmax)			9.40	6.247e-06	***
s(lagPrecip)			342.94	0.000e+00	***
s(doy)			333.11	0.000e+00	***
s(wetness)			50.00	0.000e+00	***
s(et)			1.98	2.336e-02	*
ti(ewood_precip,lagPrecip)			19.35	7.282e-07	***
ti(ewood_precip,ewood_tmax)			130.02	0.000e+00	***
ti(ewood_precip,wetness)			139.59	0.000e+00	***
ti(ewood_precip,et)			779.63	0.000e+00	***
ti(ewood_tmax,wetness)			32.16	0.000e+00	***
ti(ewood_tmax,et)			40.48	0.000e+00	***
ti(wetness,et)			451.08	0.000e+00	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

(Dispersion parameter for Scaled t(3.001,1.175) family taken to be 1)
Null deviance: NULL on NULL degrees of freedom
Residual deviance: NULL on NULL degrees of freedom

m2.gam <- gam(adjusted_flow ~
            s(ewood_precip, bs = "ts", k = 5) +
            s(ewood_tmax, bs = "ts", k = 5) +
            s(lagPrecip, bs = "ts", k = 5) +
            s(doy, bs = "cc", k = 8) +
            s(wetness, bs = "ts", k = 5) +
            s(et, bs = "ts", k = 5) +
              ti(ewood_precip, lagPrecip, k = 5) +
              ti(ewood_precip, ewood_tmax, k = 5) +
              ti(ewood_precip, wetness, k = 5) +
              ti(ewood_precip, et, k = 5) +
              ti(ewood_tmax, wetness, k = 5) +
              ti(ewood_tmax, et, k = 5) +
              ti(wetness, et, k = 5),
          data = df_16397,
          select = TRUE,
          family = scat(link = "log"),
          method = "REML",
          control = gam.control(nthreads = 2))
flextable::as_flextable(m2.gam) %>%
  flextable::set_caption("Summary of GAM coefficients at SWQM 16397")

Table 6: Summary of GAM coefficients at SWQM 16397
	Estimate	Standard Error	z value	Pr(>\|z\|)	Signif.
s(ewood_precip)			11.93	1.229e-06	***
s(ewood_tmax)			1.56	4.402e-02	*
s(lagPrecip)			54.56	0.000e+00	***
s(doy)			285.47	0.000e+00	***
s(wetness)			32.63	0.000e+00	***
s(et)			26.11	0.000e+00	***
ti(ewood_precip,lagPrecip)			209.86	0.000e+00	***
ti(ewood_precip,ewood_tmax)			161.10	0.000e+00	***
ti(ewood_precip,wetness)			53.93	0.000e+00	***
ti(ewood_precip,et)			378.14	0.000e+00	***
ti(ewood_tmax,wetness)			375.35	0.000e+00	***
ti(ewood_tmax,et)			128.74	0.000e+00	***
ti(wetness,et)			436.57	0.000e+00	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

(Dispersion parameter for Scaled t(3.013,0.081) family taken to be 1)
Null deviance: NULL on NULL degrees of freedom
Residual deviance: NULL on NULL degrees of freedom

# we need a hurdle model here. We will fit logistic regression to predict 
# the probabilty of a non-zero flow
# and a scaled-t model with log link to predict the mean of the non-zero data 

m3.gam <- gam(non_zero ~
            s(ewood_precip, bs = "ts") +
            s(ewood_tmax, bs = "ts") +
            s(lagPrecip, bs = "ts") +
            s(doy, bs = "cc") +
            s(wetness, bs = "ts") +
            s(et, bs = "ts") +
            ti(ewood_precip, ewood_tmax) +
            ti(ewood_precip, wetness) +
            ti(ewood_precip, et) +
            ti(ewood_tmax, wetness) +
            ti(ewood_tmax, et) +
            ti(wetness, et),
          data = df_16882,
          select = TRUE,
          family = binomial(),
          method = "REML",
          control = gam.control(nthreads = 2))

m4.gam <- gam(adjusted_flow ~
            #s(ewood_precip, bs = "ts", k = 5) +
            #s(ewood_tmax, bs = "ts", k = 5) +
            s(doy, bs = "cs", k = 6) +
            s(lagPrecip, bs = "ts", k = 5) +
            #s(wetness, bs = "ts", k = 4) +
            #s(et, bs = "ts", k = 4) +
            #te(ewood_precip, ewood_tmax, k = 8),
            te(ewood_precip, wetness, k = 3)+
            te(ewood_precip, et, k = 3) +
            te(ewood_tmax, wetness, k = 3)+
            te(ewood_tmax, et, k = 3) +
            te(wetness, et, k = 3),
          data = subset(df_16882, non_zero == 1),
          select = TRUE,
          family = scat(link = "log"),
          method = "REML",
          control = gam.control(nthreads = 2))
#draw(m4.gam, n = 1000, dist = 2)

flextable::as_flextable(m3.gam) %>%
  flextable::set_caption("Summary of GAMcoefficients at SWQM 16882")

Table 7: Summary of GAMcoefficients at SWQM 16882
	Estimate	Standard Error	z value	Pr(>\|z\|)	Signif.
s(ewood_precip)			7.67e+00	2.431e-03	**
s(ewood_tmax)			2.18e+00	6.979e-02	.
s(lagPrecip)			4.37e+00	2.071e-02	*
s(doy)			3.31e+01	-2.211e-06	***
s(wetness)			1.46e+01	2.385e-04	***
s(et)			4.50e-01	2.212e-01
ti(ewood_precip,ewood_tmax)			1.89e-05	4.391e-01
ti(ewood_precip,wetness)			2.20e-06	8.499e-01
ti(ewood_precip,et)			5.52e+00	1.581e-02	*
ti(ewood_tmax,wetness)			6.19e-07	9.684e-01
ti(ewood_tmax,et)			8.37e-06	4.014e-01
ti(wetness,et)			4.32e+00	5.795e-02	.
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

(Dispersion parameter for binomial family taken to be 1)
Null deviance: NULL on NULL degrees of freedom
Residual deviance: NULL on NULL degrees of freedom

flextable::as_flextable(m4.gam) %>%
  flextable::set_caption("Summary of GAMcoefficients at SWQM 16882")

Table 7: Summary of GAMcoefficients at SWQM 16882
	Estimate	Standard Error	z value	Pr(>\|z\|)	Signif.
s(doy)			815.535	0.0000	***
s(lagPrecip)			65.089	0.0000	***
te(ewood_precip,wetness)			65.459	0.0000	***
te(ewood_precip,et)			454.991	0.0000	***
te(ewood_tmax,wetness)			91.155	0.0000	***
te(ewood_tmax,et)			272.787	0.0000	***
te(wetness,et)			0.251	0.1536
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05 < '.' < 0.1 < '' < 1

(Dispersion parameter for Scaled t(3,0.952) family taken to be 1)
Null deviance: NULL on NULL degrees of freedom
Residual deviance: NULL on NULL degrees of freedom

##get model fits and confidence intervals
gam_fam <- family(m1.gam)
ilink <- gam_fam$linkinv

df_16396 %>%
    bind_cols(
    as_tibble(predict(m1.gam, df_16396, se.fit = TRUE)) %>%
      mutate(response = ilink(fit),
             upr_ci = ilink(fit + (2*se.fit)),
             lwr_ci = ilink(fit - (2*se.fit))) %>%
      dplyr::select(response, upr_ci, lwr_ci)
    ) -> df_16396_gam

##get model fits and confidence intervals
gam_fam <- family(m2.gam)
ilink <- gam_fam$linkinv

df_16397 %>%
    bind_cols(
    as_tibble(predict(m2.gam, df_16397, se.fit = TRUE)) %>%
      mutate(response = ilink(fit),
             upr_ci = ilink(fit + (2*se.fit)),
             lwr_ci = ilink(fit - (2*se.fit))) %>%
      dplyr::select(response, upr_ci, lwr_ci)
    ) -> df_16397_gam


##get model fits and confidence intervals
gam_fam <- family(m3.gam)
ilink <- gam_fam$linkinv

df_16882 %>%
    bind_cols(
    as_tibble(predict(m3.gam, df_16882, se.fit = FALSE,
                      type = "response"))) %>%
  mutate(pred_zero = case_when(
    value < 0.5 ~ 0,
    value > 0.5 ~ 1)) %>%
  bind_cols(
    as_tibble(predict(m4.gam, df_16882, se.fit = FALSE,
                      type = "response")) %>%
      dplyr::rename(response = value)) %>%
  mutate(response = case_when(
    pred_zero == 0 ~ 0,
    pred_zero != 0 ~ response)) -> df_16882_gam



df_16396_gam %>%
  ggplot() +
  geom_point(aes(response, adjusted_flow,
                 color = "GAM Estimates against Naturalized Flow"),
             alpha = 0.4) +
  geom_smooth(aes(response, adjusted_flow,
                  color = "GAM Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "GAM Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical",
        legend.position = "none")-> p1

df_16396_gam %>%
  ggplot() +
  geom_line(aes(date, response, color =  "GAM Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16396"),
            alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical",
        legend.position = "none") -> p2




df_16397_gam %>%
  ggplot() +
  geom_point(aes(response, adjusted_flow,
                 color = "GAM Estimates against Naturalized Flow"),
             alpha = 0.4) +
  geom_smooth(aes(response, adjusted_flow,
                  color = "GAM Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "GAM Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical",
        legend.position = "none")-> p3

df_16397_gam %>%
  ggplot() +
  geom_line(aes(date, response, color =  "GAM Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow SWQM 16397"),
            alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical",
        legend.position = "none") -> p4


df_16882_gam %>%
  ggplot() +
  geom_point(aes(response, adjusted_flow,
                 color = "GAM Estimates against Naturalized Flow"),
             alpha = 0.4) +
  geom_smooth(aes(response, adjusted_flow,
                  color = "GAM Estimates against Naturalized Flow"),
              method = "lm", se = FALSE) +
  geom_abline(aes(linetype = "1:1 Line", intercept = 0, slope = 1)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "GAM Estimated Flow [cfs]", y = "Naturalized Flow [cfs]") +
  guides(color = guide_legend(""), linetype = guide_legend("")) +
  theme_ms() +
  theme(legend.margin = margin(0,0,0,0),
        legend.box = "vertical",
        legend.box.margin = margin(0,0,0,0),
        legend.direction = "vertical")-> p5

df_16882_gam %>%
  ggplot() +
  geom_line(aes(date, response, color =  "GAM Estimated Flow")) +
  geom_line(aes(date, adjusted_flow, color = "Naturalized Flow"),
            alpha = 0.4) +
  labs(x = "Date", y = "Mean Daily Flow [cfs]") +
  theme_ms() +
  theme(legend.title = element_blank(),
        legend.direction = "vertical") -> p6

(p1+p2)/(p3 +p4)/(p5+p6)

Figure 9: GAM estimated streamflows at SWQM 16396 plotted against measured streamflows and over time.

## calculate NSE and KGE as goodness of fit metrics
## for each method

##
tibble(model = c("DAR_08065800", "DAR_08109800", "DAR_08110100",
                 "Linear Regression",
                 "GAM"),
       'SWQM-16396' = c(
         hydroGOF::NSE(dar_results_16396$DAR_Q_08065800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::NSE(dar_results_16396$DAR_Q_08109800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::NSE(dar_results_16396$DAR_Q_08110100, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::NSE(df_16396_lm$fits, df_16396_lm$adjusted_flow),
         hydroGOF::NSE(as.numeric(df_16396_gam$response), df_16396_gam$adjusted_flow)),
       'SWQM-16397' = c(
         hydroGOF::NSE(dar_results_16397$DAR_Q_08065800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::NSE(dar_results_16397$DAR_Q_08109800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::NSE(dar_results_16397$DAR_Q_08110100, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::NSE(df_16397_lm$fits, df_16397_lm$adjusted_flow),
         hydroGOF::NSE(as.numeric(df_16397_gam$response), df_16397_gam$adjusted_flow)),
       'SWQM-16882' = c(
         hydroGOF::NSE(dar_results_16882$DAR_Q_08065800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::NSE(dar_results_16882$DAR_Q_08109800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::NSE(dar_results_16882$DAR_Q_08110100, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::NSE(df_16882_lm$fits, df_16882_lm$adjusted_flow),
         hydroGOF::NSE(as.numeric(df_16882_gam$response), df_16882_gam$adjusted_flow)))

## # A tibble: 5 x 4
##   model             `SWQM-16396` `SWQM-16397` `SWQM-16882`
##   <chr>                    <dbl>        <dbl>        <dbl>
## 1 DAR_08065800           -6.84       -407.         -91.5  
## 2 DAR_08109800            0.0293       -0.462        0.151
## 3 DAR_08110100            0.0996       -7.05        -1.35 
## 4 Linear Regression      -0.603         0.319       -0.237
## 5 GAM                     0.601         0.373       -0.109

tibble(model = c("DAR_08065800", "DAR_08109800", "DAR_08110100",
                 "Linear Regression",
                 "GAM"),
       'SWQM-16396' = c(
         hydroGOF::KGE(dar_results_16396$DAR_Q_08065800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::KGE(dar_results_16396$DAR_Q_08109800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::KGE(dar_results_16396$DAR_Q_08110100, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::KGE(df_16396_lm$fits, df_16396_lm$adjusted_flow),
         hydroGOF::KGE(as.numeric(df_16396_gam$response), df_16396_gam$adjusted_flow)),
       'SWQM-16397' = c(
         hydroGOF::KGE(dar_results_16397$DAR_Q_08065800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::KGE(dar_results_16397$DAR_Q_08109800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::KGE(dar_results_16397$DAR_Q_08110100, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::KGE(df_16397_lm$fits, df_16397_lm$adjusted_flow),
         hydroGOF::KGE(as.numeric(df_16397_gam$response), df_16397_gam$adjusted_flow)),
       'SWQM-16882' = c(
         hydroGOF::KGE(dar_results_16882$DAR_Q_08065800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::KGE(dar_results_16882$DAR_Q_08109800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::KGE(dar_results_16882$DAR_Q_08110100, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::KGE(df_16882_lm$fits, df_16882_lm$adjusted_flow),
         hydroGOF::KGE(as.numeric(df_16882_gam$response), df_16882_gam$adjusted_flow)))

## # A tibble: 5 x 4
##   model             `SWQM-16396` `SWQM-16397` `SWQM-16882`
##   <chr>                    <dbl>        <dbl>        <dbl>
## 1 DAR_08065800           -1.02        -18.4        -9.32  
## 2 DAR_08109800           -0.308         0.201       0.182 
## 3 DAR_08110100           -0.0809       -1.21        0.0916
## 4 Linear Regression       0.277         0.313       0.138 
## 5 GAM                     0.636         0.452       0.402

tibble(model = c("DAR_08065800", "DAR_08109800", "DAR_08110100",
                 "Linear Regression",
                 "GAM"),
       'SWQM-16396' = c(
         hydroGOF::pbias(dar_results_16396$DAR_Q_08065800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::pbias(dar_results_16396$DAR_Q_08109800, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::pbias(dar_results_16396$DAR_Q_08110100, as.numeric(dar_results_16396$adjusted_flow)),
         hydroGOF::pbias(df_16396_lm$fits, df_16396_lm$adjusted_flow),
         hydroGOF::pbias(as.numeric(df_16396_gam$response), df_16396_gam$adjusted_flow)),
       'SWQM-16397' = c(
         hydroGOF::pbias(dar_results_16397$DAR_Q_08065800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::pbias(dar_results_16397$DAR_Q_08109800, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::pbias(dar_results_16397$DAR_Q_08110100, as.numeric(dar_results_16397$adjusted_flow)),
         hydroGOF::pbias(df_16397_lm$fits, df_16397_lm$adjusted_flow),
         hydroGOF::pbias(as.numeric(df_16397_gam$response), df_16397_gam$adjusted_flow)),
       'SWQM-16882' = c(
         hydroGOF::pbias(dar_results_16882$DAR_Q_08065800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::pbias(dar_results_16882$DAR_Q_08109800, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::pbias(dar_results_16882$DAR_Q_08110100, as.numeric(dar_results_16882$adjusted_flow)),
         hydroGOF::pbias(df_16882_lm$fits, df_16882_lm$adjusted_flow),
         hydroGOF::pbias(as.numeric(df_16882_gam$response), df_16882_gam$adjusted_flow)))

## # A tibble: 5 x 4
##   model             `SWQM-16396` `SWQM-16397` `SWQM-16882`
##   <chr>                    <dbl>        <dbl>        <dbl>
## 1 DAR_08065800              58.6        365          581. 
## 2 DAR_08109800             -84.1        -55.9        -33.1
## 3 DAR_08110100             -69.8        -12.1         29.3
## 4 Linear Regression        -22.7         -7.8        -48.6
## 5 GAM                      -15.8         -5.8        -12

Model validation

## use k-fold cross validation to estimate
## how well the approach fits to ujnknown data
## randomly split the data 80-20 to fit and validate the GAM.
library(modelr)

## 
## Attaching package: 'modelr'

## The following object is masked from 'package:gratia':
## 
##     add_residuals

## The following objects are masked from 'package:hydroGOF':
## 
##     mae, mse, rmse

set.seed(123)

df_16396 %>%
  ## create a list of data resamples.
  ## randomly splits the dataset into 80-20
  ## n number of times
  crossv_kfold() %>%
  mutate(model = map(train, ~gam(adjusted_flow~
                                   s(ewood_precip, bs = "ts", k = 5) +
                                   s(ewood_tmax, bs = "ts", k = 5) +
                                   s(lagPrecip, bs = "ts", k = 5) +
                                   s(doy, bs = "cc", k = 8) +
                                   s(wetness, bs = "ts", k = 5) +
                                   s(et, bs = "ts", k = 5) +
                                   ti(ewood_precip, lagPrecip, k = 5) +
                                   ti(ewood_precip, ewood_tmax, k = 5) +
                                   ti(ewood_precip, wetness, k = 5) +
                                   ti(ewood_precip, et, k = 5) +
                                   ti(ewood_tmax, wetness, k = 5) +
                                   ti(ewood_tmax, et, k = 5) +
                                   ti(wetness, et, k = 5),
                                  data = as.data.frame(.),
                                  family = scat(link = "log"),
                                  method = "REML",
                                  control = gam.control(nthreads = 2))),
         preds = map2(test,model, ~predict(.y, newdata = as.data.frame(.x),
                                           type = "response")),
         NSE = map2_dbl(test, preds, ~hydroGOF::NSE(as.numeric(as.data.frame(.x)$adjusted_flow),
                                                    as.numeric(.y))),
         KGE = map2_dbl(test, preds, ~hydroGOF::KGE(as.numeric(as.data.frame(.x)$adjusted_flow),
                                                    as.numeric(.y))),
         MAE = map2_dbl(test, preds, ~hydroGOF::mae(as.numeric(as.data.frame(.x)$adjusted_flow),
                                                    as.numeric(.y))),
         pbias = map2_dbl(test, preds, ~hydroGOF::pbias(as.numeric(as.data.frame(.x)$adjusted_flow),
                                                    as.numeric(.y)))) -> df_16396_kfold

## Warning: Problem with `mutate()` input `model`.
## i Fitting terminated with step failure - check results carefully
## i Input `model` is `map(...)`.

## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully

## Warning: Problem with `mutate()` input `model`.
## i Fitting terminated with step failure - check results carefully
## i Input `model` is `map(...)`.

## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully

## Warning: Problem with `mutate()` input `model`.
## i Fitting terminated with step failure - check results carefully
## i Input `model` is `map(...)`.

## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully

df_16396_kfold

## # A tibble: 5 x 9
##   train        test         .id   model    preds         NSE     KGE   MAE pbias
##   <named list> <named list> <chr> <named > <named >    <dbl>   <dbl> <dbl> <dbl>
## 1 <resample [~ <resample [~ 1     <gam>    <dbl [6~ -6.84e-3 -0.489  101.  -92.6
## 2 <resample [~ <resample [~ 2     <gam>    <dbl [6~ -1.57e-2 -0.647  829.  -97.4
## 3 <resample [~ <resample [~ 3     <gam>    <dbl [6~ -2.60e-2 -0.515   22.0 -63.3
## 4 <resample [~ <resample [~ 4     <gam>    <dbl [6~ -2.13e+1 -3.27    21.0 176. 
## 5 <resample [~ <resample [~ 5     <gam>    <dbl [6~ -9.27e-1  0.0170  10.4  55.8

# 
# m1.gam <- gam(adjusted_flow ~
#             s(ewood_precip, bs = "ts", k = 5) +
#             s(ewood_tmax, bs = "ts", k = 5) +
#             s(lagPrecip, bs = "ts", k = 5) +
#             s(doy, bs = "cc", k = 8) +
#             s(wetness, bs = "ts", k = 5) +
#             s(et, bs = "ts", k = 5) +
#               ti(ewood_precip, lagPrecip, k = 5) +
#               ti(ewood_precip, ewood_tmax, k = 5) +
#               ti(ewood_precip, wetness, k = 5) +
#               ti(ewood_precip, et, k = 5) +
#               ti(ewood_tmax, wetness, k = 5) +
#               ti(ewood_tmax, et, k = 5) +
#               ti(wetness, et, k = 5),
#           data = df_16396,
#           select = TRUE,
#           family = scat(link = "log"),
#           method = "REML",
#           control = gam.control(nthreads = 2))

References

Asquith, William H., Meghan C. Roussel, and Joseph Vrabel. 2006. “Statewide Analysis of the Drainage-Area Ratio Method for 34 Streamflow Percentile Ranges in Texas.” 2006-5286. U.S. Geological Survey. https://pubs.usgs.gov/sir/2006/5286/pdf/sir2006-5286.pdf.

Chamberlain, Scott. 2019. rnoaa: ’NOAA’ Weather Data from R. https://CRAN.R-project.org/package=rnoaa.

Schramm, Michael. 2020. Echor: Access EPA ’ECHO’ Data (version v0.1.5). https://doi.org/10.5281/zenodo.3988554.