Loading and Cleaning Data in R

class: title-slide, nobar
![:img right: 25px, bottom: 25px, 40%, , ](figures/dataset.png)

## Workshop: Dealing with Data in R
# Loading and Cleaning Data in R
## I know the file exists, why doesn't R?

.footnote[Steffi LaZerte <https://steffilazerte.ca> | *Compiled: 2022-01-26*]

---
class: section

# First things first

Save previous script

Open New File .medium[(make sure you're in the RStudio Project)]

![:spacer 10px]()

Add `library(tidyverse)` to the top

Save this new script

.medium[consider names like `cleaning.R` or `3_loading_and_cleaning.R`]

---
# R base vs. `tidyverse`

## R base
- R base is basic R
- Most packages used are installed and loaded by default

## `tidyverse`
- Collection of 'new' packages developed by a team closely affiliated with RStudio
- Packages designed to work well together
- Use a slightly different syntax
- Among others, includes packages used for data transformations and visualizations:
    - e.g., `ggplot2`, `dplyr`, `tidyr`, `readr`

> Can be helpful to understand whether functions are `tidyverse` or R base functions

---
class: split-50
# Dealing with data

.columnl[
## 1. Loading data
  - Get your data into R

## 2. Looking for problems
  - Typos
  - Incorrectly loaded data

## 3. Fixing problems
  - Corrections
  - Renaming
]

.columnr[
## 4. Setting formats
  - Dates
  - Numbers
  - Factors

## 5. Saving your data
]

---
class: section

# 1. Loading Data

---
class: full-width
# Data types: What kind of data do you have?

## Specific program files

Type           | Extension         | R Package   | R function
-------------- | ----------------- | ----------- | ------------
Excel          | .xls, .xlsx       | `readxl`    | `read_excel()` 
Open Document  | .ods              | `readODS`   | `read_ods()`
SPSS           | .sav, .zsav, .por | `haven`     | `read_spss()`
SAS            | .sas7bdat         | `haven`     | `read_sas()`            
Stata          | .dta              | `haven`     | `read_dta()`
Database Files | .dbf              | `foreign`   | `read.dbf()`

### Convenient but...
- Can be unreliable
- Can take longer

![:box 35%, 83%, 50%](For files that don't change, better to save as a <code>*.csv</code> (Comma-separated-variables file&rpar;)

---
class: full-width
# Data types: What kind of data do you have?

## General text files

Type            | R base          | `readr` package (`tidyverse`)
--------------- | --------------- | --------------
Comma separated | `read.csv()`    | `read_csv()`, `read_csv2()`
Tab separated   | `read.delim()`  | `read_tsv()`
Space separated | `read.table()`  | `read_table()`
Fixed-width     | `read.fwf()`    | `read_fwf()`

- **`readr` package** especially useful for big data sets (fast!)
- Error/warnings from `readr` are a bit more helpful

> We'll focus on: 
> - `readxl` package - `read_excel()` 
> - `readr` package - `read_csv()`, `read_tsv()`

---
# Where is my data?

## Common error

```r
my_data <- read_csv("weather.csv")
```

```
## Error: 'weather.csv' does not exist in current working directory ('/home/steffi/Projects/R Workshop/Lessons').
```

With no folder (just file name) R expects file to be in **Working directory**

### Working directory is:
- Where your RStudio project is
- Your home directory (My Documents, etc.) [If not using RStudio Projects]
- Where you've set it (using `setwd()` or RStudio's Session > Set Working Directory)

> Using Projects in RStudio is a great idea, try to avoid `setwd()`

---
# Where is my data?