Making a table out of summary data

11 months 2 weeks ago #48

I figured something out today that isn't perfect, but might be useful for others: getting quantiles for a big batch of data, combining multiple parameters into a single data frame, and spitting that all out as a table in CSV format.

Like so:

Start by opening up SWMPr, importing and QCing data:
path <- 'C:/etc-etc/data'
BHWQ_dat <- import_local(path, 'gndbhwq')
BHWQ_dat <- qaqc(BHWQ_dat, qaqc_keep = c(0,4), rem_cols = T)

Then decide what you want to look at. The basic quantile function can be modified - lots of people just want min, median, max, and the quartiles. I'm also interested in the 1st percentile and 99th percentile. And I'm removing NAs so it doesn't get screwed up. This is the basic way to get those percentiles, for the do_pct column in my data:
quantile(BHWQ_dat$do_pct, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm=T)

Because I want everything to be combined into a table, I'm going to name each of those quantile outputs:
q.do_pct <- quantile(BHWQ_dat$do_pct, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm=T)
q.do_mgl <- quantile(BHWQ_dat$do_mgl, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm=T)
q.sal <- quantile(BHWQ_dat$sal, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm=T) <- quantile(BHWQ_dat$ph, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm=T)

Then I'll combine them into a data frame and assign names to the columns:
BHWQ_sum <- data.frame(q.sal,, q.do_pct, q.do_mgl)
names(BHWQ_sum) <- c("Sal", "pH", "DO%", "DO mg/L")

For those of you who are strong in the data types of R, the output from those quantile commands is a numeric vector.

And finally export it to a csv:
write.table(BHWQ_sum, file="BHWQsum.csv", sep = ",")

The percentiles I used magically show up in the data frame (the numeric vector had them as a name attribute; I don't fully understand this background stuff that R is doing). That means they also show up in the table. Unfortunately I can't assign a name to that column, so the names I assigned are actually one column to the left of where I want them, and I had to move them to get what I really wanted.

Here's what it actually looked like:

So like I said, it's not perfect. But it got me close enough to what I needed. Suggestions on making this smoother and easier are welcome.


Please Log in to join the conversation.

11 months 2 weeks ago #49

Hi Kim,

Great post and thanks for sharing. Simple summaries like these can be really helpful for exploratory data analysis. As I'm sure you've discovered about R, there are multiple ways to get the same result. I wanted to contribute by providing an alternative to getting quantile summaries using the dplyr and tidyr packages. The syntax is a bit different from standard R programming but the idea is the same. RStudio has a great cheat sheet for the dplyr package that Todd showed during the workshop.

dplyr provides a family of functions for manipulating data frames and tidyr is a simple approach to converting between wide and long styles of data frames. Long formatting can be useful if you want to summarize columns against one or more categorical variables. For example, your post summarized the water quality variables separately but we can simplify this by changing data from a wide format (one column per variable) to long format (two columns, one for variable and one for value).

dplyr also uses a piping structure to pass results from one function to another. This is a more linear approach to programming as compared to combining functions from the inside out (e.g., mean(rnorm(10))) or reassigning data objects with the same name after each step. See this blog for more info.

This script loads the libraries (install first if needed), imports sample data that comes with SWMPr, manipulates the data, and saves as a csv file.

# load libraries
# load data
# summarize by quantiles
dat <- qaqc(apacpwq, qaqc_keep = c(0, 4), rem_cols = T) %>% 
  select(sal, ph, do_pct, do_mgl) %>%
  gather('var', 'val') %>% 
  group_by(var) %>% 
    qts <- quantile(.$val, c(0, 0.01, 0.25, 0.5, 0.75, 0.99, 1), na.rm = T)
# save 
write.csv(dat, 'dat.csv', quote = F, row.names = F)

The interesting part of this code is the creation of 'dat'. The pipe operator (%>%) lets us string multiple functions together, where the input for a function depends on the output from the previous. The steps are 1) remove qaqc flags, select columns of interest, convert from wide to long (gather), group the dataset by the new 'var' column for the water quality parameters, then summarize each by the specified quantiles. The advantage of this approach is better code readability because it flows linearly and you are not recreating the 'dat' object over and over with each step.

The .csv output is similar to what's shown above.

Anyway, that's just a different approach to getting the same result. In the end it boils down to your comfort level with different types of syntax and what makes sense to you. I wanted to present this because I've found pipes very intuitive and I use the dplyr package almost every day for data manipulation.



Please Log in to join the conversation.

11 months 2 weeks ago #50

I KNEW there was an easier way! Just didn't know how to get there and sometimes I don't have the vocabulary to google up what I'm looking for. Thanks so much Marcus for this explanation.

So if you want to go even bigger and do all of this on multiple sites, does piping get you there? Or is there something else to use?

Please Log in to join the conversation.

11 months 2 weeks ago #51

I slept on this and realized each site would be a different data frame... so I just googled "R how to perform the same functions on multiple data frames" and the most popular thing that's coming up in the results is the use of 'lapply'.

Seems like you'd just read in all the data files, then make a list of them, and then use lapply on the code you supplied above, right? Although write.csv might have to go inside parentheses somewhere so it would happen for each file?

I'm still stewing over details, but I know you're busy getting ready for CERF and I'm trying to avoid complete hand-holding. I'll update this post if/when I play with it more and figure it out. You're certainly welcome to walk me through it if you have time though.

Please Log in to join the conversation.

11 months 2 weeks ago #52

Yep, the 'apply' family of functions are really useful for iterating over multiple objects or pieces of data objects. lapply is by far the one I use the most, others include tapply, apply, mapply, vapply, and sapply. They all basically do the same thing but with different data structures. These functions all come with the base installation and newer tools have surfaced in recent years. For example, the dplyr package I used in my post can do similar operations on data frames by applying the same function to different groups with the group_by function. I think an advantage of these newer tools is that they tend to operate faster by using vector operations and compiled C code. The piping functionality is a separate aspect but it can be really useful for improving readability.

Please Log in to join the conversation.

Time to create page: 0.575 seconds
Powered by Kunena Forum