class: center middle main-title section-title-2 # Data Transformation <br>and Exploration .class-info[ **Rei Sanchez-Arias, Ph.D.** .light[Using the `dplyr` package] ] --- layout: false name: common-tasks class: center middle section-title section-title-2 animated fadeIn # Motivation --- layout: true class: title title-2 --- # Common tasks <small> - Often you will need to create some *new variables* or *summaries*, or maybe you just want to *rename *the variables or *reorder* the observations to make the data a little easier to work with. - We will focus on how to use the `dplyr` package, another core member of the `tidyverse`. **Prerequisites:** Install the `nycflights13` and `tidyverse` packages </small> ```r library(tidyverse) library(nycflights13) ``` --- # Data from `nycflights13` .pull-left[ This dataset contains flights departing New York City (NYC) in 2013. It contains all 336,776 flights that departed from NYC in 2013. ] .pull-right[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/flights13.png"><figcaption></figcaption></figure> ] The data comes from the [US Bureau of Transportation Statistics](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?flights` --- layout: false name: dplyr class: center middle section-title section-title-2 animated fadeIn # Introducing `dplyr` --- layout: true class: title title-2 --- # `dplyr` basics 1. Pick observations by their values: `filter()` 2. Reorder the rows: `arrange()` 3. Pick variables by their names: `select()` 4. Create new variables with functions of existing variables: `mutate()` 5. Collapse many values down to a single summary: `summarise()` 6. Operate on a group-by-group basis: `group_by()` --- # Filtering _rows_ .pull-left[ `filter()` allows you to subset observations based on their values. For example, we can select all flights on January 1st with: ```r filter(flights, month == 1, day == 1) ``` ] .pull-right[ Below is the `dep_time` and `sched_dep_time` for the first 4 rows ``` ## # A tibble: 4 x 2 ## dep_time sched_dep_time ## <int> <int> ## 1 517 515 ## 2 533 529 ## 3 542 540 ## 4 544 545 ``` ] --- # January 1st `dplyr` functions never modify their inputs, so if you want to save the result, you will need to use the assignment operator, `<-` ```r jan1 <- filter(flights, month == 1, day == 1) ``` --- # Comparisons .pull-left[ To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). ] .pull-right[ **Logical operators** <figure><img src="http://r4ds.had.co.nz/diagrams/transform-logical.png"><figcaption></figcaption></figure> ] --- # Flights in November OR December The following code finds all flights that departed in November OR December: ```r filter(flights, month == 11 | month == 12) ``` A useful short-hand is `x %in% y`. This will select every row where `x` is one of the values in `y`. We could use it to rewrite the code above: ```r nov_dec <- filter(flights, month %in% c(11, 12)) ``` --- # Arrange rows with `arrange()` `arrange()` works similarly to `filter()` except that instead of selecting rows, it **changes their order**. It takes a data frame and a set of column names to order by. ```r arrange(flights, year, month, day) ``` Use `desc()` to re-order by a column in descending order: ```r arrange(flights, desc(arr_delay)) ``` --- # Select columns with `select()` <small> `select()` allows you to rapidly zoom in on a useful subset using the names of the **variables**. </small> ```r # Select columns by name select(flights, year, month, day) ``` <small> There are a number of helper functions you can use within `select()`: `starts_with("abc")`: matches names that begin with "abc". `ends_with("xyz")`: matches names that end with "xyz". `contains("ijk")`: matches names that contain "ijk". </small> --- # Add new variables with `mutate()` To _add new columns_ that are functions of existing columns. That is the job of `mutate()`. `mutate()` always adds new columns at the end of your dataset. ```r # create a smaller dataset with less columns flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) ``` --- # Example .small-code[ ```r mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` ] Note that you can refer to columns that you have just created: .small-code[ ```r mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ] --- # Grouped summaries with `summarise()` `summarise()` _collapses_ a data frame to a single row: ```r summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 1 ## delay ## <dbl> ## 1 12.6 ``` The `summarise()` function is useful when we pair it with `group_by()`. This way, the analysis can be done for individual groups. --- layout: false name: pipe class: center middle section-title section-title-2 animated fadeIn # The pipe operator <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/pipe.png"><figcaption></figcaption></figure> --- layout: true class: title title-2 --- # The pipe .pull-left[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/mario.png"><figcaption></figcaption></figure> ] .pull-right[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/flappybird.png"><figcaption></figcaption></figure> ] .box-inv-2.medium.sp-after[ The pipe operator in R is given by `%>%` ] --- # Example Sends the output of the LHS function to the first argument of the RHS function. .pull-left[ ```r sum(1:8) %>% sqrt() %>% log() ``` ``` ## [1] 1.791759 ``` ] .pull-right[ is equivalent to ```r log(sqrt(sum(1:8))) ``` ``` ## [1] 1.791759 ``` ] --- # Example for data exploration Imagine that we want to explore the relationship between the distance and average delay for each location. There are three steps to prepare this data: 1. Group flights by destination. 2. Summarize to compute distance, average delay, and number of flights. 3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. --- # Power of the pipe `%>%` operator .pull-left[ .small-code[ ```r delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") ``` ] <small> If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac. </small> ] .pull-right[ <small> You can read it as a series of imperative statements: _group_, then _summarise_, then _filter_. A good way to pronounce `%>%` when reading code is "then". The `n()` function is implemented specifically for each data source and can be used from within `summarise()`, `mutate()` and `filter()`. It returns the number of observations in the current group. </small> ] --- # Transformations </small> **Why `na.rm = TRUE` ?** Aggregation functions obey the usual rule of missing values: if there is any missing value in the input, the output will be a missing value! Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation </small> ```r flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE)) ```