class: center, middle, inverse, title-slide # Tidying data ## Data Science for Cognitive (Neuro)Science ### Ina Bornkessel-Schlesewsky ### Jan 2021 (updated: 2021-01-25) --- # Recall ... .pull-left[ <img src="images/r4ds_workflow.png" width="100%" style="display: block; margin: auto auto auto 0;" /> * **import** data (into R) * **tidy** data - bring it into a consistent format that can be used for multiple purposes (each column = variable; each row = observation) - lets you focus on understanding the data rather than which format you need ] .pull-right[ <br> * **transform** data * e.g. focus on observations of interest (such as those from a particular location), create new variables (such as speed from distance and time), compute summary statistics * **visualise** data * essential for understanding * **model** data * use (statistical) models to answer your questions about the data * **communicate** insights ] <span style="font-variant:small-caps;">Note how you already know the basics of a number of these steps!</span> --- # Tidy data .pull-left[ ### Characteristics of tidy data: * Each variable has its own column * Each observation has its own row * Each value has its own cell ] <div class="figure"> <img src="images/tidy_data_r4ds.png" alt="figure from R4DS" width="2795" /> <p class="caption">figure from R4DS</p> </div> --- # Different data formats <br> * The three data frames on the next slide show example data provided in the `tidyr` package, which is part of the `tidyverse`. > "all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000" (from `?tidyr::table1`) * Which of these tables is tidy? .font80[ If you want to inspect the data yourself, you can access them via `table`, `table2` and `table3`. ] --- # Different data formats .pull-left[ ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ``` ## # A tibble: 6 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ``` ] .pull-right[ ``` ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ] --- # Different data formats <br> * If you said `table1`, you were right! <br> * Another possibility is for data to be spread across two data frames * See the next slide for an example (`table4a` and `table4b` from `tidyr`) --- # Different data formats .pull-left[ ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ] .pull-right[ ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` ] --- # Advantages to working with tidy data <br> 1. Having one consistent format for data makes it easier to learn the tools required for analysis (which can have a certain uniformity). The `tidyverse` packages, for example, are designed to work with tidy data (who would have thought! 🤣) 2. It is advantageous for variables to be placed in columns because this caters to R's vectorised nature. (Most R-functions work with vectors of values.) --- class: inverse, mline, center, middle # Your turn! ## Complete the first part of the *tidy data* tutorial --- class: inverse, mline, center, middle # Dealing with untidy data ## (you will need this more often than you might think) --- # Data are often untidy ### Common problems 1. Variables are spread across multiple columns 2. Observations are spread across multiple rows ### The solution * functions `pivot_longer()` and `pivot_wider()` in `tidyr`! <br> <br> *Note: this doesn't mean that non-tidy data are "bad". There can be many reasons for why a dataset is in a non-tidy format, e.g. easy of data entry if this is being done manually.* --- # Pivot to longer .pull-left[ **Common problem**: column names are values of a variable rather than variables **Example**: `table4a` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ] .pull-right[ **Solution**: *pivot* these columns to new variables, rendering the dataset longer We need: * the columns with values as names (`1999` and `2000`) * the name of the variable to move the column names to (`year`) * the name of the variable to move the column values to (`cases`) ] --- # Pivot to longer ```r table4a %>% pivot_longer(c(`1999`,`2000`), names_to = "year", values_to = "cases") ``` ``` ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` --- # Pivot to longer <div class="figure"> <img src="images/pivot_longer_r4ds.png" alt="from R4DS" width="1664" /> <p class="caption">from R4DS</p> </div> *Exercise: try doing the same thing with `table4b`!* --- # Excursus: joining tables We can easily join the longer versions of `table4a` and `table4b` using `left_join()` (more on joining operations later): ```r tidy4a <- table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases") tidy4b <- table4b %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population") left_join(tidy4a, tidy4b) ``` ``` ## Joining, by = c("country", "year") ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <chr> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Pivot to wider * `pivot_wider()` is the counterpart of `pivot_longer()` which you need when observations are spread across multiple rows such as in `table2` * here, the table needs to be made wider ``` ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` --- # Pivot to wider To tidy `table2` we need: * the column to take variables from (`type`) * the column to take values from (`count`) ```r table2 %>% pivot_wider(names_from = type, values_from = count) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Pivot to wider <div class="figure"> <img src="images/pivot_wider_r4ds.png" alt="from R4DS" width="1667" /> <p class="caption">from R4DS</p> </div> --- class: inverse, mline, center, middle # Try the pivoting exercises in the tutorial --- # What's up with table3? <br> ``` ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` --- # Multiple variables in one column * in `table3`, the `rate` column contains both cases and population * to deal with this problem, we can use the `separate()` function * it allows us to easily split a column according to a delimiting character (here, the "/") * note how `separate` is clever enough to correctly guess the delimiting character -- it looks for a non-alphanumeric character by default (to specify it manually, use `sep = "/"`) ```r table3 %>% separate(rate, into = c("cases","population")) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <chr> <chr> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Multiple variables in one column <br> * by default, `separate` retains the original column type (character in this case) * we can ask it to try to convert to a more suitable type using the `convert` parameter ```r table3 %>% separate(rate, into = c("cases","population"), convert = TRUE) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- class: inverse, mline, center, middle # Putting it all together ... ## Work through the case study + exercises in the tutorial