Tidying data

# Tidying data
## Data Science for Cognitive (Neuro)Science
### Ina Bornkessel-Schlesewsky
### Jan 2021 (updated: 2021-01-25)

---

# Recall ...

* **import** data (into R)
* **tidy** data
  - bring it into a consistent format that can be used for multiple purposes (each column = variable; each row = observation)
  - lets you focus on understanding the data rather than which format you need

]

<br>

* **transform** data
  * e.g. focus on observations of interest (such as those from a particular location), create new variables (such as speed from distance and time), compute summary statistics
* **visualise** data
  * essential for understanding
* **model** data
  * use (statistical) models to answer your questions about the data
* **communicate** insights
]

<span style="font-variant:small-caps;">Note how you already know the basics of a number of these steps!</span>

---

# Tidy data

.pull-left[
### Characteristics of tidy data:
* Each variable has its own column
* Each observation has its own row
* Each value has its own cell
]

<div class="figure">
<img src="images/tidy_data_r4ds.png" alt="figure from R4DS" width="2795" />
<p class="caption">figure from R4DS</p>
</div>

---

# Different data formats

<br>

* The three data frames on the next slide show example data provided in the `tidyr` package, which is part of the `tidyverse`.

> "all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000"

(from `?tidyr::table1`)

* Which of these tables is tidy?

.font80[
If you want to inspect the data yourself, you can access them via `table`, `table2` and `table3`. 
]

---

# Different data formats

```
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

```
## # A tibble: 6 x 4
##   country      year type           count
##   <chr>       <int> <chr>          <int>
## 1 Afghanistan  1999 cases            745
## 2 Afghanistan  1999 population  19987071
## 3 Afghanistan  2000 cases           2666
## 4 Afghanistan  2000 population  20595360
## 5 Brazil       1999 cases          37737
## 6 Brazil       1999 population 172006362
```
]

```
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
```
]

---

# Different data formats

<br>

* If you said `table1`, you were right!

<br>

* Another possibility is for data to be spread across two data frames
* See the next slide for an example (`table4a` and `table4b` from `tidyr`)

---

# Different data formats

```
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```
]

```
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
```
]

---

# Advantages to working with tidy data

<br>

1. Having one consistent format for data makes it easier to learn the tools required for analysis (which can have a certain uniformity). The `tidyverse` packages, for example, are designed to work with tidy data (who would have thought! 🤣)

2. It is advantageous for variables to be placed in columns because this caters to R's vectorised nature. (Most R-functions work with vectors of values.)

---

# Your turn!

## Complete the first part of the *tidy data* tutorial

---

# Dealing with untidy data

## (you will need this more often than you might think)

---

# Data are often untidy

### Common problems

1. Variables are spread across multiple columns
2. Observations are spread across multiple rows

### The solution

* functions `pivot_longer()` and `pivot_wider()` in `tidyr`!

*Note: this doesn't mean that non-tidy data are "bad". There can be many reasons for why a dataset is in a non-tidy format, e.g. easy of data entry if this is being done manually.*
---

# Pivot to longer

**Example**: `table4a`

```
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```
]

We need:

* the columns with values as names (`1999` and `2000`)
* the name of the variable to move the column names to (`year`)
* the name of the variable to move the column values to (`cases`)

]

---

# Pivot to longer

```r
table4a %>% 
  pivot_longer(c(`1999`,`2000`), names_to = "year", values_to = "cases")
```

```
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
```

---

# Pivot to longer

*Exercise: try doing the same thing with `table4b`!*

---

# Excursus: joining tables

We can easily join the longer versions of `table4a` and `table4b` using `left_join()` (more on joining operations later):

```r
tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

left_join(tidy4a, tidy4b)
```

```
## Joining, by = c("country", "year")
```

```
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583
```

---

# Pivot to wider

* `pivot_wider()` is the counterpart of `pivot_longer()` which you need when observations are spread across multiple rows such as in `table2`
* here, the table needs to be made wider

```
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
```

---

# Pivot to wider

To tidy `table2` we need:

* the column to take variables from (`type`)
* the column to take values from (`count`)

```r
table2 %>% 
  pivot_wider(names_from = type, values_from = count)
```

---

# Pivot to wider

---

# Try the pivoting exercises in the tutorial

---

# What's up with table3?

<br>

---

# Multiple variables in one column

* in `table3`, the `rate` column contains both cases and population
* to deal with this problem, we can use the `separate()` function
* it allows us to easily split a column according to a delimiting character (here, the "/")
* note how `separate` is clever enough to correctly guess the delimiting character -- it looks for a non-alphanumeric character by default (to specify it manually, use `sep = "/"`)

```r
table3 %>% 
  separate(rate, into = c("cases","population"))
```

```
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

---

# Multiple variables in one column

<br>

* by default, `separate` retains the original column type (character in this case)
* we can ask it to try to convert to a more suitable type using the `convert` parameter

```r
table3 %>% 
  separate(rate, into = c("cases","population"), convert = TRUE)
```

---

# Putting it all together ...

## Work through the case study + exercises in the tutorial