Error: 'weather.csv' does not exist in current working directory ('/home/steffi/Projects/Workshops/workshop-dealing-with-data').
Workshop: Dealing with Data in R
steffilazerte
@steffilazerte@fosstodon.org
@steffilazerte
steffilazerte.ca
Compiled: 2025-01-12
Save previous script
Open New File
(make sure you’re in the RStudio Project)
Write library(tidyverse) at the top
Save this new script
(consider names like cleaning.R or 3_loading_and_cleaning.R)
Click on “New Folder”
R base vs. tidyverse
tidyverse|> *
*We’ll cover pipes soon 😁
tidyverseggplot2, dplyr, tidyr, readr%>% or base pipe |> *
Useful to know if functions aretidyverse or R base
| Type | Extension | R Package | R function |
|---|---|---|---|
| Excel | .xls, .xlsx | readxl* |
read_excel() |
| Open Document | .ods | readODS |
read_ods() |
| SPSS | .sav, .zsav, .por | haven |
read_spss() |
| SAS | .sas7bdat | haven |
read_sas() |
| Stata | .dta | haven |
read_dta() |
| Database Files | .dbf | foreign |
read.dbf() |
For files that don’t change, better to save as a *.csv
(Comma-separated-variables file)
* part of the tidyverse
| Type | R base | readr package * |
|---|---|---|
| Comma separated | read.csv() |
read_csv(), read_csv2() |
| Tab separated | read.delim() |
read_tsv() |
| Space separated | read.table() |
read_table() |
| Fixed-width | read.fwf() |
read_fwf() |
* part of the tidyverse
readr package especially useful for big data sets (fast!)readr are a bit more helpfulWe’ll focus on
readxl package read_excel()readr package read_csv(), read_tsv()Error: 'weather.csv' does not exist in current working directory ('/home/steffi/Projects/Workshops/workshop-dealing-with-data').
With no folder (just file name) R expects file to be in Working directory
setwd() or RStudio’s Session > Set Working Directory)Don’t use setwd()
Do use Projects in RStudio
/home is a folder/home and steffi are folderssteffi is a folder inside of home/home, steffi, Documents, R Projects are folderssteffi is inside of home, Documents is inside of steffi, etc.mydata.csv is a data file inside R Projects folderRStudio Files Pane
| OS | Path |
|---|---|
| LINUX | /home/steffi/Documents/R Projects/mydata.csv |
| WINDOWS | C:/Users/steffi/My Documents/R Projects/mydata.csv |
| MAC | /users/steffi/Documents/R Projects/mydata.csv |
Full location, folders and filename
| Path | Where to look |
|---|---|
| mydata.csv | Here (current directory) |
| ../mydata.csv | Go up one directory (../) |
| data/mydata.csv | Stay here, go into “data” folder (data/) |
| ../data/mydata.csv | Go up one directory (../), then into “data” folder (data/) |
Only relative info
Use relative symbols (e.g., ../)
With RStudio ‘Projects’ only need to use relative paths
Working with water_cleaned.xlsx
waterUse head() and tail() functions to look at the data
e.g., head(water) and tail(water)
Click on the water object in your “Environment” pane to look at the whole data set

read_excel from readxl package)read_csv(), read_tsv(), etc.)water_cleaned.xlsx Excel file read_excel()water_raw.csv Comma-separated-variables read_csv()But sometimes not clear…
master_moch.txtID region hab freq freq.sd p.notes
MCB02 kam 0.5266879074 3.9806600009 3.9806600009 0.4592592593
MCB03 kam -0.9707703735 4.1090031783 4.1090031783 0.5
MCB04 kam -0.9707703735 4.2463067674 4.2463067674 0.5151515152
This does not read the file into R, but only shows you the contents as text.
Hmm, not comma-separated, maybe tab-separated?
read_csv() is a good start)n_max to read only first few rows# A tibble: 3 × 1
`ID\tregion\thab\tfreq\tfreq.sd\tp.notes`
<chr>
1 "MCB02\tkam\t0.5266879074\t3.9806600009\t3.9806600009\t0.4592592593"
2 "MCB03\tkam\t-0.9707703735\t4.1090031783\t4.1090031783\t0.5"
3 "MCB04\tkam\t-0.9707703735\t4.2463067674\t4.2463067674\t0.5151515152"
\t means tab, so this is tab-separated data
read_tsv()# A tibble: 3 × 6
ID region hab freq freq.sd p.notes
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 MCB02 kam 0.527 3.98 3.98 0.459
2 MCB03 kam -0.971 4.11 4.11 0.5
3 MCB04 kam -0.971 4.25 4.25 0.515
Excellent!
col_namesWorking with Geolocator data
# A tibble: 20 × 2
`02/05/11 22:29:59` `64`
<chr> <dbl>
1 02/05/11 22:31:59 64
2 02/05/11 22:33:59 38
3 02/05/11 22:35:59 38
4 02/05/11 22:37:59 34
5 02/05/11 22:39:59 30
6 02/05/11 22:41:59 34
7 02/05/11 22:43:59 40
8 02/05/11 22:45:59 46
9 02/05/11 22:47:59 48
10 02/05/11 22:49:59 46
# ℹ 10 more rows
Oops?
read_csv, read_tsv, etc. assume that the first row contains the column namescol_namesWorking with Geolocator data
# A tibble: 21 × 2
X1 X2
<chr> <dbl>
1 02/05/11 22:29:59 64
2 02/05/11 22:31:59 64
3 02/05/11 22:33:59 38
4 02/05/11 22:35:59 38
5 02/05/11 22:37:59 34
6 02/05/11 22:39:59 30
7 02/05/11 22:41:59 34
8 02/05/11 22:43:59 40
9 02/05/11 22:45:59 46
10 02/05/11 22:47:59 48
# ℹ 11 more rows
# A tibble: 21 × 2
date light
<chr> <dbl>
1 02/05/11 22:29:59 64
2 02/05/11 22:31:59 64
3 02/05/11 22:33:59 38
4 02/05/11 22:35:59 38
5 02/05/11 22:37:59 34
6 02/05/11 22:39:59 30
7 02/05/11 22:41:59 34
8 02/05/11 22:43:59 40
9 02/05/11 22:45:59 46
10 02/05/11 22:47:59 48
# ℹ 11 more rows
skip info rows before dataWorking with Grain size data
# A tibble: 36 × 7
`DATA DOWNLOAD: 2015-09-23` ...2 ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SYSTEM 001 <NA> <NA> <NA> <NA> <NA> <NA>
2 LOGGER X <NA> <NA> <NA> <NA> <NA> <NA>
3 lab_num CSP sample_num depth_lb csa msa fsa
4 3177 CSP01 CSP01-P-1-1 4 13.04 17.37 8.19
5 3178 CSP01 CSP01-P-1-2 12 10.74 16.9 7.92
6 3179 CSP01 CSP01-P-1-3 35 12.11 17.75 6.99
7 3180 CSP01 CSP01-P-1-4 53 17.61 18.16 6.29
8 3181 CSP01 CSP01-P-1-5 83 21.05 18.38 6.26
9 3182 CSP01 CSP01-P-1-6 105 19.02 18.43 6.28
10 3183 CSP08 CSP08-P-1-1 10 11.6 17.14 8.18
# ℹ 26 more rows
skip info rows before dataWorking with Grain size data
Look at the file:
DATA DOWNLOAD: 2015-09-23
SYSTEM 001
LOGGER X
lab_num CSP sample_num depth_lb csa msa fsa
3177 CSP01 CSP01-P-1-1 4 13.04 17.37 8.19
3178 CSP01 CSP01-P-1-2 12 10.74 16.9 7.92
3179 CSP01 CSP01-P-1-3 35 12.11 17.75 6.99
3180 CSP01 CSP01-P-1-4 53 17.61 18.16 6.29
3181 CSP01 CSP01-P-1-5 83 21.05 18.38 6.26
Ah ha!
Metadata was stored at the top of the file
skip info rows before dataWorking with Grain size data
skip = 3 to skip the first three rows# A tibble: 33 × 7
lab_num CSP sample_num depth_lb csa msa fsa
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 3177 CSP01 CSP01-P-1-1 4 13.0 17.4 8.19
2 3178 CSP01 CSP01-P-1-2 12 10.7 16.9 7.92
3 3179 CSP01 CSP01-P-1-3 35 12.1 17.8 6.99
4 3180 CSP01 CSP01-P-1-4 53 17.6 18.2 6.29
5 3181 CSP01 CSP01-P-1-5 83 21.0 18.4 6.26
6 3182 CSP01 CSP01-P-1-6 105 19.0 18.4 6.28
7 3183 CSP08 CSP08-P-1-1 10 11.6 17.1 8.18
8 3184 CSP08 CSP08-P-1-2 27 15.4 16.2 6.76
9 3185 CSP08 CSP08-P-1-3 90 14.9 15.8 7.12
10 3186 CSP02 CSP02-P-1-1 5 8.75 8.64 3.41
# ℹ 23 more rows
Much better!
Load Data: Sta A Data 2006-11-07.dmp
# A tibble: 19 × 7
StartDate Time Frequency `Rate/Temp` Pwr Ant SD
<dbl> <time> <dbl> <dbl> <dbl> <chr> <dbl>
1 39022 17:15:36 150. 34.8 175 M0 0
2 39022 17:19:14 148. 19.2 72 M0 0
3 39022 17:19:25 148. 19.7 194 M1 0
4 39022 17:20:04 149. 33.8 104 M0 0
5 39022 17:20:17 149. 33.7 152 M1 0
6 39022 17:20:57 150. 34.2 188 M0 0
7 39022 17:22:50 148. 9.8 188 M0 0
# ℹ 12 more rows
Too Easy?
Load some of your own tricky data OR
Try to load the second sheet of water_cleaned.xlsx and rename the columns in the same step
(see ?read_excel)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Chinstrap Dream 45.7 17 195 3650 female 2009
2 Chinstrap Dream 55.8 19.8 207 4000 male 2009
3 Chinstrap Dream 43.5 18.1 202 3400 female 2009
4 Chinstrap Dream 49.6 18.2 193 3775 male 2009
5 Chinstrap Dream 50.8 19 210 4100 male 2009
6 Chinstrap Dream 50.2 18.7 198 3775 female 2009
skim() the dataskim() is from skimrnumeric),character)date, POSIXct, datetime)factor)NAs?── Data Summary ────────────────────────
Values
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
1 species 0 1 FALSE 3 Ade: 152, Gen: 124, Chi: 68
2 island 0 1 FALSE 3 Bis: 168, Dre: 124, Tor: 52
3 sex 11 0.968 FALSE 2 mal: 168, fem: 165
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 bill_length_mm 2 0.994 43.9 5.46 32.1 39.2 44.4 48.5 59.6 ▃▇▇▆▁
2 bill_depth_mm 2 0.994 17.2 1.97 13.1 15.6 17.3 18.7 21.5 ▅▅▇▇▂
3 flipper_length_mm 2 0.994 201. 14.1 172 190 197 213 231 ▂▇▃▅▂
4 body_mass_g 2 0.994 4202. 802. 2700 3550 4050 4750 6300 ▃▇▆▃▂
5 year 0 1 2008. 0.818 2007 2007 2008 2009 2009 ▇▁▇▁▇

count() categoriescount() is from dplyr*
* part of the tidyverse



* part of the tidyverse

* part of the tidyverse
Working with: water_raw.csv
# A tibble: 300 × 7
`River Name` Site Ele Amo `Temperature C°` Year Wea
<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Grasse Up stream Al 0.606 10.9 2019 snowy
2 Grasse Mid stream Al 0.425 8.68 2020 cloudy
3 Grase Down stream Al 0.194 8.75 2021 cloudy
4 Oswegatchie Up stream Al 1 0.791 2022 sunny
5 Oswegatchie Mid stream Al 0.161 9.32 2023 snowy
6 Oswegatchie Down stream Al 0.0333 10.6 2019 wet
7 Raquette Up stream Al 0.292 4.01 2020 snowy
8 Raquette Mid stream Al 0.0389 5.96 2021 sunny
9 Raquette Down stream Al NA 6.21 2022 cloudy
10 St. Regis Up stream Al 0.681 8.02 2023 wet
# ℹ 290 more rows
River Name and Temperature C°) or obvious (what is Ele?)Grase should be Grasse)# A tibble: 300 × 7
`River Name` Site Ele Amo `Temperature C°` Year Wea
<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Grasse Up stream Al 0.606 10.9 2019 snowy
2 Grasse Mid stream Al 0.425 8.68 2020 cloudy
3 Grase Down stream Al 0.194 8.75 2021 cloudy
4 Oswegatchie Up stream Al 1 0.791 2022 sunny
5 Oswegatchie Mid stream Al 0.161 9.32 2023 snowy
6 Oswegatchie Down stream Al 0.0333 10.6 2019 wet
7 Raquette Up stream Al 0.292 4.01 2020 snowy
8 Raquette Mid stream Al 0.0389 5.96 2021 sunny
9 Raquette Down stream Al NA 6.21 2022 cloudy
10 St. Regis Up stream Al 0.681 8.02 2023 wet
# ℹ 290 more rows
skim() the datacount() some columnsggplot()sFind any problems?
clean_names() is from janitor** not part of the tidyverse but tidyverse-orientated
# A tibble: 300 × 7
river_name site ele amo temperature_c year wea
<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Grasse Up stream Al 0.606 10.9 2019 snowy
2 Grasse Mid stream Al 0.425 8.68 2020 cloudy
3 Grase Down stream Al 0.194 8.75 2021 cloudy
4 Oswegatchie Up stream Al 1 0.791 2022 sunny
5 Oswegatchie Mid stream Al 0.161 9.32 2023 snowy
6 Oswegatchie Down stream Al 0.0333 10.6 2019 wet
7 Raquette Up stream Al 0.292 4.01 2020 snowy
8 Raquette Mid stream Al 0.0389 5.96 2021 sunny
9 Raquette Down stream Al NA 6.21 2022 cloudy
10 St. Regis Up stream Al 0.681 8.02 2023 wet
# ℹ 290 more rows
rename() is from dplyr*rename() columns
# A tibble: 300 × 7
river_name site element amount temperature year wea
<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Grasse Up stream Al 0.606 10.9 2019 snowy
2 Grasse Mid stream Al 0.425 8.68 2020 cloudy
3 Grase Down stream Al 0.194 8.75 2021 cloudy
4 Oswegatchie Up stream Al 1 0.791 2022 sunny
5 Oswegatchie Mid stream Al 0.161 9.32 2023 snowy
6 Oswegatchie Down stream Al 0.0333 10.6 2019 wet
7 Raquette Up stream Al 0.292 4.01 2020 snowy
8 Raquette Mid stream Al 0.0389 5.96 2021 sunny
9 Raquette Down stream Al NA 6.21 2022 cloudy
10 St. Regis Up stream Al 0.681 8.02 2023 wet
# ℹ 290 more rows
* part of the tidyverse
select() is from dplyr*select() columns you want
* part of the tidyverse
OR, unselect() columns you don’t want
# A tibble: 300 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Grasse Up stream Al 0.606 10.9 2019
2 Grasse Mid stream Al 0.425 8.68 2020
3 Grase Down stream Al 0.194 8.75 2021
4 Oswegatchie Up stream Al 1 0.791 2022
5 Oswegatchie Mid stream Al 0.161 9.32 2023
6 Oswegatchie Down stream Al 0.0333 10.6 2019
7 Raquette Up stream Al 0.292 4.01 2020
8 Raquette Mid stream Al 0.0389 5.96 2021
9 Raquette Down stream Al NA 6.21 2022
10 St. Regis Up stream Al 0.681 8.02 2023
# ℹ 290 more rows
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water# A tibble: 300 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Grasse Up stream Al 0.606 10.9 2019
2 Grasse Mid stream Al 0.425 8.68 2020
3 Grase Down stream Al 0.194 8.75 2021
4 Oswegatchie Up stream Al 1 0.791 2022
5 Oswegatchie Mid stream Al 0.161 9.32 2023
6 Oswegatchie Down stream Al 0.0333 10.6 2019
7 Raquette Up stream Al 0.292 4.01 2020
8 Raquette Mid stream Al 0.0389 5.96 2021
9 Raquette Down stream Al NA 6.21 2022
10 St. Regis Up stream Al 0.681 8.02 2023
# ℹ 290 more rows
Note how code repeats data frame water…
Remember the typos…
Combine the if_else function with the mutate() function
if_else() and mutate() from dplyr package*if_else() tests for a condition, and returns one value if FALSE and another if TRUE
* part of the tidyverse
count()mutate() and if_else() to fix the typoToo Easy?
Examine and fix problems in your own data OR
Use case_when() to fix all the river name typos at once…
tidyverse functionstidyverse functionsrename(), select(), mutate()tidyverse functions always start with the data, followed by other argumentsrename() changes column namesselect() chooses columns to keep or to remove (with -)mutate() changes column contents
tidyverse functions?|>* Allow you to string commands togetherInstead of:
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water <- mutate(water,
river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
river_name == "raquette" ~ "Raquette",
TRUE ~ river_name))* |> is the base pipe, %>% is the tidyverse pipe, you can use either

Take a moment to play with this code in your console
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water <- mutate(water,
river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
river_name == "raquette" ~ "Raquette",
TRUE ~ river_name))
NAsData that is missing
Data that should be missing
NAsamountfilter() to take a closer look# A tibble: 39 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Raquette Down stream Al NA 6.21 2022
2 Raquette Up stream Ba NA 5.23 2022
3 Raquette Up stream Br NA -99 2019
4 Oswegatchie Up stream Ca NA 4.76 2023
5 Raquette Down stream Ce NA 13.9 2020
6 Grasse Up stream Cu NA 9.13 2019
7 Raquette Down stream Dy NA 4.98 2019
8 Raquette Down stream Er NA 3.07 2021
9 Raquette Down stream Fe NA 7.20 2023
10 Raquette Down stream Gd NA 4.73 2020
# ℹ 29 more rows
NAsdrop_na() is from tidyr*Omit NAs from the amount column only (drop those rows)
Omit all NAs from all columns (drop those rows)
Check…
# A tibble: 0 × 6
# ℹ 6 variables: river_name <chr>, site <chr>, element <chr>, amount <dbl>, temperature <dbl>, year <dbl>
[1] 261
No more NAs!
Fewer rows
* part of the tidyverse 
filter() also omits NAs 😱If we filter by the column with NAs, they are silently dropped
# A tibble: 15 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Oswegatchie Down stream Al 0.0333 10.6 2019
2 Raquette Mid stream Al 0.0389 5.96 2021
3 Grasse Mid stream Br 0.0357 12.4 2019
4 St. Regis Up stream Br 0.0357 3.52 2022
5 St. Regis Mid stream Br 0.0357 0.936 2023
6 Raquette Mid stream Ce 0.0116 6.61 2019
7 Raquette Mid stream Fe 0.00656 10.8 2022
8 Grasse Up stream K 0.0313 3.61 2021
9 Raquette Mid stream La 0.0275 2.50 2020
10 Oswegatchie Down stream Mn 0.00672 8.89 2019
# ℹ 5 more rows
We need to be explicit if we want to keep them
# A tibble: 54 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Oswegatchie Down stream Al 0.0333 10.6 2019
2 Raquette Mid stream Al 0.0389 5.96 2021
3 Raquette Down stream Al NA 6.21 2022
4 Raquette Up stream Ba NA 5.23 2022
5 Grasse Mid stream Br 0.0357 12.4 2019
6 Raquette Up stream Br NA -99 2019
7 St. Regis Up stream Br 0.0357 3.52 2022
8 St. Regis Mid stream Br 0.0357 0.936 2023
9 Oswegatchie Up stream Ca NA 4.76 2023
10 Raquette Mid stream Ce 0.0116 6.61 2019
# ℹ 44 more rows
NAsreplace_na() is from tidyr*Check…
# A tibble: 0 × 6
# ℹ 6 variables: river_name <chr>, site <chr>, element <chr>, amount <dbl>, temperature <dbl>, year <dbl>
[1] 300
No more NAs!
Same number of rows
(If you want to do a more complex replacement, you’ll have to use if_else() or case_when() like we did for typos.)
* part of the tidyverse 
NARemember the problem with temperature?
# A tibble: 3 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Raquette Up stream Br NA -99 2019
2 Oswegatchie Mid stream K 0.426 -99 2020
3 St. Regis Mid stream La 0.367 -99 2023
na_if() is from dplyr*
* part of the tidyverse
# A tibble: 300 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Grasse Up stream Al 0.606 10.9 2019
2 Grasse Mid stream Al 0.425 8.68 2020
3 Grasse Down stream Al 0.194 8.75 2021
4 Oswegatchie Up stream Al 1 0.791 2022
5 Oswegatchie Mid stream Al 0.161 9.32 2023
6 Oswegatchie Down stream Al 0.0333 10.6 2019
7 Raquette Up stream Al 0.292 4.01 2020
8 Raquette Mid stream Al 0.0389 5.96 2021
9 Raquette Down stream Al NA 6.21 2022
10 St. Regis Up stream Al 0.681 8.02 2023
# ℹ 290 more rows
Year could be categorical (factor)
Better for plotting!
(although it really depends)
# A tibble: 300 × 6
river_name site element amount temperature year
<chr> <chr> <chr> <dbl> <dbl> <fct>
1 Grasse Up stream Al 0.606 10.9 2019
2 Grasse Mid stream Al 0.425 8.68 2020
3 Grasse Down stream Al 0.194 8.75 2021
4 Oswegatchie Up stream Al 1 0.791 2022
5 Oswegatchie Mid stream Al 0.161 9.32 2023
6 Oswegatchie Down stream Al 0.0333 10.6 2019
7 Raquette Up stream Al 0.292 4.01 2020
8 Raquette Mid stream Al 0.0389 5.96 2021
9 Raquette Down stream Al NA 6.21 2022
10 St. Regis Up stream Al 0.681 8.02 2023
# ℹ 290 more rows
| Function | Input | Output |
|---|---|---|
as.character() |
Any vector | Text (Characters) |
as.numeric() |
Any vector (but returns NAs if not numbers) | Numbers |
as.logical() |
TRUE, FALSE, T, F, 0 (FALSE), any other number (all TRUE) | TRUE or FALSE |
as.factor() |
Any vector | Categories |
We’ll deal with dates and times later…
water <- read_csv("data/water_raw.csv") |>
clean_names() |>
rename(element = ele, amount = amo, temperature = temperature_c) |>
select(-wea) |>
mutate(river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
river_name == "raquette" ~ "Raquette",
TRUE ~ river_name),
amount = replace_na(amount, 0),
temperature = na_if(temperature, -99),
year = as.factor(year)) water <- read_csv("data/water_raw.csv") |>
# Fix column names
clean_names() |>
rename(element = ele, amount = amo, temperature = temperature_c) |>
select(-wea) |>
mutate(
# Correct typos
river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
river_name == "raquette" ~ "Raquette",
TRUE ~ river_name),
# Missing amounts should be 0
amount = replace_na(amount, 0),
# Problems with temperature logger, -99 is a mistake
temperature = na_if(temperature, -99),
# Convert for plotting
year = as.factor(year)) (Or why does R hate me?)
Working with Geolocator data
# A tibble: 21 × 2
time light
<chr> <dbl>
1 02/05/11 22:29:59 64
2 02/05/11 22:31:59 64
3 02/05/11 22:33:59 38
4 02/05/11 22:35:59 38
5 02/05/11 22:37:59 34
6 02/05/11 22:39:59 30
# ℹ 15 more rows
Here
timecolumn is consideredchr(character/text)You may know it’s a date, but R does not
Artwork by @allison_horst
lubridate package*tidyverse, but needs to be loaded separately# A tibble: 21 × 3
time light time_fixed
<chr> <dbl> <dttm>
1 02/05/11 22:29:59 64 2011-05-02 22:29:59
2 02/05/11 22:31:59 64 2011-05-02 22:31:59
3 02/05/11 22:33:59 38 2011-05-02 22:33:59
4 02/05/11 22:35:59 38 2011-05-02 22:35:59
5 02/05/11 22:37:59 34 2011-05-02 22:37:59
6 02/05/11 22:39:59 30 2011-05-02 22:39:59
# ℹ 15 more rows
Now
time_fixedcolumn is considereddttm(Date/Time)So You know it’s a Date/Time and now R knows too
* part of the tidyverse
lubridate package*
Generally, only the order of the year, month, day, hour, minute, or second matters.
For example
| date/time format | function | output class |
|---|---|---|
| 2018-01-01 13:09:11 | ymd_hms() |
dttm (POSIXct/POSIXt) |
| 12/20/2019 10:00 PM | mdy_hm() |
dttm (POSIXct/POSIXt) |
| 31/01/2000 10 AM | dmy_h() |
dttm (POSIXct/POSIXt) |
| 31-01/2000 | dmy() |
Date |
lubridateis smart enough to detect AMs and PMs
Note: R generally requires that times have dates (datetime/POSIXct), but dates don’t have to have times (Date)
* part of the tidyverse
(For the love of all that is good don’t lose that data!!!)*
* but if you’ve been paying attention, you know that you only need the script 😉
1_cleaned.csv2_summarized.csv3_graphing.csv
* I usually have a data folder and then both raw and datasets folders inside of that
skim() to explore the format of your dataskim(), count(), filter(), select(), ggplot() to explore the content of your data|> or %>%) pass the output from one function as input to the next function: