layout: true <div class="my-footer"><a href="https://www.reisanar.com/" target="_blank" rel="noopener"><span>@reisanar</span></a></div> --- class: center middle main-title section-title-2 # Data Transformation <br>and Exploration .class-info[ **Rei Sanchez-Arias, Ph.D.** .light[Using the `dplyr` package] ] --- layout: false name: common-tasks class: center middle section-title section-title-2 animated fadeIn <div class="my-footer"><a href="https://www.reisanar.com/" target="_blank" rel="noopener"><span>@reisanar</span></a></div> # Motivation --- layout: true class: title title-2 --- # Common tasks <small> - Often you will need to create some *new variables* or *summaries*, or maybe you just want to *rename *the variables or *reorder* the observations to make the data a little easier to work with. - We will focus on how to use the `dplyr` package, another core member of the `tidyverse`. **Prerequisites:** Install the `nycflights13` and `tidyverse` packages </small> ```r library(tidyverse) library(nycflights13) ``` --- # Data from `nycflights13` .pull-left[ This dataset contains flights departing New York City (NYC) in 2013. It contains all 336,776 flights that departed from NYC in 2013. ] .pull-right[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/flights13.png"><figcaption></figcaption></figure> ] The data comes from the [US Bureau of Transportation Statistics](https://www.bts.gov/browse-statistical-products-and-data), and is documented in `?flights` --- layout: false name: dplyr class: center middle section-title section-title-2 animated fadeIn background-image: url("https://raw.githubusercontent.com/reisanar/figs/master/dplyr-logo.png") background-position: right background-size: contain <div class="my-footer"><a href="https://www.reisanar.com/" target="_blank" rel="noopener"><span>@reisanar</span></a></div> # Introducing<br>`dplyr` --- layout: true class: title title-2 --- # The `tidyverse` The `tidyverse` is an opinionated [collection of R packages](https://www.tidyverse.org/packages/) designed for data science. All packages share an underlying design philosophy, grammar, and data structures. -- Definition of **tidy data**: (see [paper](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham) - Each variable is a column - Each observation is a row - Each type of observational unit is a table --- # Core `tidyverse` packages .pull-left[ .smaller[ | -----------------|-------------------------------------------------------- [`ggplot2`](https://ggplot2.tidyverse.org/) | data visualization [`dplyr`](https://dplyr.tidyverse.org/) | data manipulation [`tidyr`](https://tidyr.tidyverse.org/)| tidy up data [`readr`](https://readr.tidyverse.org/)| data import [`purrr`](https://purrr.tidyverse.org/)| functional programming [`tibble`](https://tibble.tidyverse.org/)| dataframes reimagined ] ] .pull-right[ .tiny[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/tidyverse-packages.png"><figcaption></figcaption></figure> ] .smaller[ | -----------------|-------------------------------------------------------- [`stringr`](https://stringr.tidyverse.org/) | working with strings [`forcats`](https://forcats.tidyverse.org/)| working with factors ] ] --- # `dplyr` basics .left-column[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/dplyr-new-old.png"><figcaption></figcaption></figure> ] .right-column[ .small[ 1. **Pick observations** by their values: `filter()` 2. **Reorder** the rows: `arrange()` 3. **Pick variables** by their names: `select()` 4. **Create new variables** with functions of existing variables: `mutate()` 5. **Collapse** many values down to a single summary: `summarize()` 6. Operate on a **group-by-group** basis: `group_by()` ] ] --- # Filtering _rows_ .small[ `filter()` allows you to subset observations based on their values. For example, we can select all flights on January 1st with: ] .small-code[ ```r filter(flights, month == 1, day == 1) ``` ] .small[ Below is the `dep_time` and `sched_dep_time` for the first 100 rows ] .tiny[
] --- # Comparisons .small[ `dplyr` functions never modify their inputs, so if you want to save the result, you will need to use the assignment operator, `<-` ] .small-code[ ```r jan1 <- filter(flights, month == 1, day == 1) ``` ] -- .pull-left[ .small[ <br> To use filtering effectively, you have to know how to select the observations that you want using comparison operators: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). ] ] .pull-right[ .small[**Logical operators**] <figure><img src="http://r4ds.had.co.nz/diagrams/transform-logical.png"><figcaption></figcaption></figure> ] --- # Flights in November _OR_ December The following code finds all flights that departed in November OR December: ```r filter(flights, month == 11 | month == 12) ``` -- A useful short-hand is `x %in% y`. This will select every row where `x` is one of the values in `y`. We could use it to rewrite the code above: ```r nov_dec <- filter(flights, month %in% c(11, 12)) ``` --- # Arrange rows with `arrange()` `arrange()` works similarly to `filter()` except that instead of selecting rows, it **changes their order**. It takes a data frame and a set of column names to order by. ```r arrange(flights, year, month, day) ``` -- Use `desc()` to re-order by a column in descending order: ```r arrange(flights, desc(arr_delay)) ``` --- # Select columns with `select()` `select()` allows you to rapidly zoom in on a useful subset using the names of the **variables**. .small-code[ ```r # Select columns by name select(flights, year, month, day) ``` ] <br> There are a number of helper functions you can use within `select()`: `starts_with("abc")`: matches names that begin with `"abc"`. `ends_with("xyz")`: matches names that end with `"xyz"`. `contains("ijk")`: matches names that contain `"ijk"`. --- # Add new variables with `mutate()` To _add new columns_ that are functions of existing columns. That is the job of `mutate()`. `mutate()` always adds new columns at the end of your dataset. ```r # create a smaller dataset with less columns flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) ``` --- # Example: using `mutate()` .pull-left[ .small-code[ ```r mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` ] .small[Note that you can refer to columns that you have just created:] .small-code[ ```r mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ] ] .pull-right[ .small[_A random sample of 100 observations is shown below:_] .tiny[
] ] --- # Grouped summaries with `summarise()` `summarise()` _collapses_ a data frame to a single row: ```r summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## delay ## <dbl> ## 1 12.6 ``` The `summarise()` function is useful when we pair it with `group_by()`. This way, the analysis can be done for individual groups. --- layout: false name: pipe class: center middle section-title section-title-2 animated fadeIn background-image: url("https://raw.githubusercontent.com/reisanar/figs/master/pipe.png") background-position: right background-size: contain <div class="my-footer"><a href="https://www.reisanar.com/" target="_blank" rel="noopener"><span>@reisanar</span></a></div> # The pipe<br> operator --- layout: true class: title title-2 --- # The pipe .pull-left[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/mario.png"><figcaption></figcaption></figure> ] .pull-right[ <figure><img src="https://raw.githubusercontent.com/reisanar/figs/master/flappybird.png"><figcaption></figcaption></figure> ] .box-inv-2.medium.sp-after[ The pipe operator is given by **`%>%`** <br>(from the `magrittr` [package](https://magrittr.tidyverse.org/reference/pipe.html)) ] --- # Example The pipe sends the output of the LHS function to the first argument of the RHS function. .pull-left[ For example: ```r # pipe example sum(1:8) %>% sqrt() %>% log() ``` ``` ## [1] 1.791759 ``` ] .pull-right[ is equivalent to ```r # the log of the square root # of the sum of the elements # of the vector [1 ... 8] log(sqrt(sum(1:8))) ``` ``` ## [1] 1.791759 ``` ] --- # Example for data exploration Imagine that we want to explore the relationship between the distance and average delay for each location. -- There are three steps to prepare this data: 1. **Group** flights by destination. 2. **Summarize** to compute distance, average delay, and number of flights. 3. **Filter** to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. --- # Power of the pipe `%>%` operator .pull-left[ .small-code[ ```r # create dataframe of "delays" delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") ``` ] <br> .smaller[ If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac. ] ] -- .pull-right[ .small[ You can read it as a series of imperative statements: **group**, then **summarize**, then **filter**. A good way to pronounce `%>%` when reading code is "then".<br><br> The `n()` function is implemented specifically for each data source and can be used from within `summarize()`, `mutate()` and `filter()`. It returns the number of observations in the current group. ] ] --- # Transformations .small[ **Why `na.rm = TRUE` ?** Aggregation functions obey the usual rule of missing values: if there is any missing value in the input, the output will be a missing value! ] .small-code[ ```r flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay, na.rm = TRUE)) ``` ] .tiny[ <br>
]