2020-04-21

Introduction

Introduction

Introduction

Introduction

This course would not be possible without the work done by Sono Shah! Much of this material was also taken from his Presidential Elections course for which I ran the lab component.

Goals

  • Open RStudio!
  • Tidyverse?
  • Basic grammar and organization!
    • Filtering
    • Summarising
    • Mutating
    • Arranging
  • Real world examples

Opening RStudio

Opening RStudio

Opening RStudio

Opening RStudio

Opening RStudio

  • The console
  • The environment
  • The files pane
  • The terminal
  • Script

Tidyverse

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse

Tidyverse

Install the complete tidyverse with:

install.packages("tidyverse")

library(tidyverse)

The pipe (%>%)

The pipe (%>%)

Before we introduce the five main verbs, we first introduce the pipe operator (%>%).

The pipe operator allows us to chain together data wrangling functions.

The pipe operator can be read as “ then”.

The (%>%) operator allows us to go from one step to the next easily so we can, for example…

The pipe (%>%)

  • filter our data frame to only focus on a few rows then
  • group_by another variable to create groups then
  • summarize this grouped data to calculate the mean for each level of the group.

Five Main Verbs - The 5MV

Five Main Verbs - The 5MV

The five most commonly used functions that help wrangle and summarize data.

A description of these verbs follows, with each subsection devoted to an example of that verb, or a combination of a few verbs, in action.

Five Main Verbs - The 5MV

  • filter: Pick rows based on conditions about their values

  • summarise: Create summary measures of variables either over the entire data frame or over groups of observations on variables using group_by

  • mutate: Create a new variable in the data frame by mutating existing ones

  • arrange: Arrange/sort the rows based on one or more variables

Five Main Verbs - The 5MV

All of the 5MVs follow the same syntax, with the argument before the pipe %>% being the name of the data frame, then the name of the verb, followed with other arguments specifying which criteria you’d like the verb to work with in parentheses.

Filtering Observation

Filtering

Filtering

  • The filter function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about values of a variable in your dataset and then chooses only those rows that match that criteria.

Example 1: Using words

Example 1: Using words

Example 1: Using words

  • I want to view all of the rows (counties in this case) that have the state_name == Michigan. (All the counties in Michigan)

Example 1: Using words

electiondta %>% 
  head()
## # A tibble: 6 x 17
##   state_name county_name total_2012 dem_2012 pct_dem_2012 gop_2012
##   <chr>      <chr>            <dbl>    <dbl>        <dbl>    <dbl>
## 1 utah       Utah County     161278    15791         9.79   142337
## 2 utah       Cache Coun…      37099     5434        14.6     30818
## 3 idaho      Madison Co…      14412      832         5.77    13445
## 4 utah       Davis Coun…     112568    20417        18.1     90030
## 5 utah       Morgan Cou…       4370      386         8.83     3910
## 6 utah       Box Elder …      18092     1830        10.1     15969
## # … with 11 more variables: pct_gop_2012 <dbl>, oth_2012 <dbl>,
## #   pct_oth_2012 <dbl>, total_2016 <dbl>, dem_2016 <dbl>,
## #   pct_dem_2016 <dbl>, gop_2016 <dbl>, pct_gop_2016 <dbl>,
## #   oth_2016 <dbl>, pct_oth_2016 <dbl>, trump_county <chr>

Example 1: Using words

electiondta %>% 
  filter(state_name == "michigan")

Example 1: Using words

The ordering of the commands:

  • Take the data frame electiondta then
  • filter the data frame so that only those where the state_name equals “Michigan” are included.
  • The double equal sign == for testing for equality, and not=. You are almost guaranteed to make the mistake at least once of only including one equals sign

Example 1: Using words

electiondta %>%
  filter(state_name == "michigan") %>%
  head()
## # A tibble: 6 x 17
##   state_name county_name total_2012 dem_2012 pct_dem_2012 gop_2012
##   <chr>      <chr>            <dbl>    <dbl>        <dbl>    <dbl>
## 1 michigan   Kent County     295418   134560         45.6   157801
## 2 michigan   Washtenaw …     179016   120791         67.5    56401
## 3 michigan   Ottawa Cou…     132695    43068         32.5    88503
## 4 michigan   Leelanau C…      14164     6576         46.4     7483
## 5 michigan   Kalamazoo …     122880    69045         56.2    52660
## 6 michigan   Ingham Cou…     127579    80903         63.4    45365
## # … with 11 more variables: pct_gop_2012 <dbl>, oth_2012 <dbl>,
## #   pct_oth_2012 <dbl>, total_2016 <dbl>, dem_2016 <dbl>,
## #   pct_dem_2016 <dbl>, gop_2016 <dbl>, pct_gop_2016 <dbl>,
## #   oth_2016 <dbl>, pct_oth_2016 <dbl>, trump_county <chr>

Example 2: Using numbers

Example 2: Using numbers

  • I want to view all of the rows (counties) that cast more than 100,000 total votes in 2012
electiondta %>% 
   filter(total_2012 > 100000)

The ordering of the commands:

  • Take the data frame electiondta then
  • filter the data frame so that only those where the total_2012 is greater than 100000 are included.

Example 2: Using numbers

electiondta %>%
  filter(total_2012 > 100000) %>%
  head()
## # A tibble: 6 x 17
##   state_name county_name total_2012 dem_2012 pct_dem_2012 gop_2012
##   <chr>      <chr>            <dbl>    <dbl>        <dbl>    <dbl>
## 1 utah       Utah County     161278    15791         9.79   142337
## 2 utah       Davis Coun…     112568    20417        18.1     90030
## 3 utah       Salt Lake …     341664   132686        38.8    198739
## 4 virginia   Arlington …     117279    81178        69.2     34433
## 5 virginia   Fairfax Co…     440174   260835        59.3    173786
## 6 kansas     Johnson Co…     268145   107651        40.2    155442
## # … with 11 more variables: pct_gop_2012 <dbl>, oth_2012 <dbl>,
## #   pct_oth_2012 <dbl>, total_2016 <dbl>, dem_2016 <dbl>,
## #   pct_dem_2016 <dbl>, gop_2016 <dbl>, pct_gop_2016 <dbl>,
## #   oth_2016 <dbl>, pct_oth_2016 <dbl>, trump_county <chr>

Other ways to filter

You can combine multiple criteria together using operators that make comparisons:

| corresponds to “or” >= corresponds to “greater than or equal to”
& corresponds to “and” <= corresponds to “less than or equal to”
> corresponds to “greater than” !=corresponds to “not equal to”
< corresponds to “less than”

Summarize

Summarize

Summarize

Summarize

The next common task when working with data is to be able to summarize data: take a large number of values and summarize then with a single value. While this may seem like a very abstract idea, something has simple as the sum, the smallest value, and the largest values are all summaries of a large number of values.

Summarize

Example 1: We can calculate the and mean and minimun, and maximum of the total number of votes for the democratic candidate in 2012 in one step using the summarize:

electiondta %>% 
  summarize(mean = mean(dem_2012),
            min = min(dem_2012), 
            max = max(dem_2012)) 
## # A tibble: 1 x 3
##     mean   min     max
##    <dbl> <dbl>   <dbl>
## 1 20017.     5 1672164

Summarize

What did that just do?

  • Take the data frame electiondta then
  • summarise the data frame so that so that we get the mean (or average) of the variable dem2012, the minimum value of dem2012 and the maximum value of 2012.

Summarize

Other ways to use summarise

  • IRQ(): Interquartile range
  • sum(): the sum (or total)
  • n(): a count of the number of rows/observations in each group. This will be really useful when you use group_by

Group rows using group_by

Group rows using group_by

Group rows using group_by

It’s often more useful to summarize a variable based on the groupings of another variable.

Group rows using group_by

Example 1

Let’s say, we are interested in the mean of total votes cast in 2016 and total votes cast in 2012 but grouped by state. To be more specific: we want the mean and median votes cast in 2016

Group rows using group_by

What does the data look like now?

electiondta %>% 
  head()
## # A tibble: 6 x 17
##   state_name county_name total_2012 dem_2012 pct_dem_2012 gop_2012
##   <chr>      <chr>            <dbl>    <dbl>        <dbl>    <dbl>
## 1 utah       Utah County     161278    15791         9.79   142337
## 2 utah       Cache Coun…      37099     5434        14.6     30818
## 3 idaho      Madison Co…      14412      832         5.77    13445
## 4 utah       Davis Coun…     112568    20417        18.1     90030
## 5 utah       Morgan Cou…       4370      386         8.83     3910
## 6 utah       Box Elder …      18092     1830        10.1     15969
## # … with 11 more variables: pct_gop_2012 <dbl>, oth_2012 <dbl>,
## #   pct_oth_2012 <dbl>, total_2016 <dbl>, dem_2016 <dbl>,
## #   pct_dem_2016 <dbl>, gop_2016 <dbl>, pct_gop_2016 <dbl>,
## #   oth_2016 <dbl>, pct_oth_2016 <dbl>, trump_county <chr>

Group rows using group_by

Step by step:

  1. split by State
  2. aggregated by State
  3. collapsed over State

Group rows using group_by

electiondta %>% 
  group_by(state_name) %>% 
  summarize(mean_20165 = mean(total_2016), 
            mean_2012 = median(total_2012))
## # A tibble: 50 x 3
##    state_name           mean_20165 mean_2012
##    <chr>                     <dbl>     <dbl>
##  1 alabama                  31017.    14562 
##  2 arizona                 137521.    35775 
##  3 arkansas                 14782.     6919 
##  4 california              166068.    52330.
##  5 colorado                 40065.     6658.
##  6 connecticut             202943.   103224.
##  7 delaware                147178.    93215 
##  8 district of columbia    280272    243348 
##  9 florida                 139521.    65958 
## 10 georgia                  25343.     9205 
## # … with 40 more rows

Create new variables/change old variables using mutate

Create new variables using mutate

Create new variables using mutate

When looking at the electiondta dataset, there are some variables that were created using other variables in the dataset. For instance, the variable pct_dem_2012 refers to the percentage vote the Democratic candidate got in 2012. This variable was created by doing the following:

Create new variables using mutate

  1. taking dem_2012 and dividing it by total_2012 to get a proportion.
  2. Then, multiplying it by 100 to convert it to percentage points.

We will create this variable again using the mutate function.

Example 1: Creating pct_dem_2012

electiondta %>% 
  mutate(pct_dem_2012_copy = (dem_2012 / total_2012)*100) %>%
  select( state_name, county_name, pct_dem_2012_copy,pct_dem_2012)
## # A tibble: 3,122 x 4
##    state_name county_name      pct_dem_2012_copy pct_dem_2012
##    <chr>      <chr>                        <dbl>        <dbl>
##  1 utah       Utah County                   9.79         9.79
##  2 utah       Cache County                 14.6         14.6 
##  3 idaho      Madison County                5.77         5.77
##  4 utah       Davis County                 18.1         18.1 
##  5 utah       Morgan County                 8.83         8.83
##  6 utah       Box Elder County             10.1         10.1 
##  7 utah       Salt Lake County             38.8         38.8 
##  8 utah       Wasatch County               23.0         23.0 
##  9 utah       Weber County                 25.8         25.8 
## 10 utah       Tooele County                23.1         23.1 
## # … with 3,112 more rows

Create new variables using mutate

What did we just do?

  1. Take the data frame electiondta then
  2. create a variable that we are calling pct_dem_2012_copy that is equal to (dem_2012/total_2012)*100 then
  3. To make the results more easily viewable we are selecting state_name, county_name, pct_dem_2012_copy, and the original variable pct_dem_2012 for comparison

Using the original variable as a check, we can see that our pct_dem_2012_copy, the only difference being that our measure extends by a few decimal places.

Reorder the data frame using arrange

Reorder the data frame using arrange

One of the most common things people working with data would like to do is sort the data frames by a specific variable in a column.

Reorder the data frame using arrange

This requires you to put the data in order from smallest to highest in value.

The dplyr package has a function called arrange that we will use to sort/reorder our data according to the values of the specified variable.

This is often used after we have used the group_by and summarize functions as we will see.

Let’s suppose we are interested in determining the states with the largest numbers of counties that were won by Trump.

Reorder the data frame using arrange

electiondta %>%
  filter(trump_county == "Trump Won") %>%
  group_by(state_name, trump_county) %>%
  summarise(trump_counties = n())
## # A tibble: 48 x 3
## # Groups:   state_name [48]
##    state_name  trump_county trump_counties
##    <chr>       <chr>                 <int>
##  1 alabama     Trump Won                54
##  2 arizona     Trump Won                11
##  3 arkansas    Trump Won                67
##  4 california  Trump Won                26
##  5 colorado    Trump Won                41
##  6 connecticut Trump Won                 2
##  7 delaware    Trump Won                 2
##  8 florida     Trump Won                59
##  9 georgia     Trump Won               128
## 10 idaho       Trump Won                42
## # … with 38 more rows

Reorder the data frame using arrange

OK great! But it looks like these are all out of order. So, we can just use arrange() to get them sorted.

arrange() will automatically sort in ascending order (smallest to largest or A to Z) unless you tell it differently.

Since we do, we need to let it know we want it sorted in descending order to get the largest numbers on the top.

Reorder the data frame using arrange

electiondta %>%
  filter(trump_county == "Trump Won") %>%
  group_by(state_name, trump_county) %>%
  summarise(trump_counties = n()) %>%
  arrange(desc(trump_counties))
## # A tibble: 48 x 3
## # Groups:   state_name [48]
##    state_name trump_county trump_counties
##    <chr>      <chr>                 <int>
##  1 texas      Trump Won               228
##  2 georgia    Trump Won               128
##  3 kentucky   Trump Won               118
##  4 missouri   Trump Won               111
##  5 kansas     Trump Won               103
##  6 iowa       Trump Won                93
##  7 tennessee  Trump Won                92
##  8 illinois   Trump Won                91
##  9 nebraska   Trump Won                91
## 10 indiana    Trump Won                88
## # … with 38 more rows

Reorder the data frame using arrange

What did that code just do?

electiondta %>%
  filter(trump_county == "Trump Won") %>%
  group_by(state_name, trump_county) %>%
  summarise(trump_counties = n()) %>%
  arrange(desc(trump_counties))

Reorder the data frame using arrange

  1. Take the data frame electiondta then
  2. Filter the data, since we are only interested in the counties that Trump won (using trump_county) then
  3. Group the data by state (using state_name) and whether or not Trump won (using trump_county) then
  4. Summarise the data to get a count of the number of counties Trump won (using n()) then
  5. Arrange the data in descending order of trump_counties so we can see which states are the largest.