Error: 'weather.csv' does not exist in current working directory ('/home/steffi/Projects/Workshops/workshop-dealing-with-data').
Workshop: Dealing with Data in R
     steffilazerte
     @steffilazerte@fosstodon.org
     @steffilazerte
     steffilazerte.ca
    
  
Compiled: 2025-01-12
Save previous script
 Open New File 
 (make sure you’re in the RStudio Project)
 Write library(tidyverse) at the top
 Save this new script 
 (consider names like cleaning.R or 3_loading_and_cleaning.R)
Click on “New Folder”
R base vs. tidyverse
tidyverse|> *
*We’ll cover pipes soon 😁
tidyverseggplot2, dplyr, tidyr, readr%>% or base pipe |> *
Useful to know if functions aretidyverse or R base
| Type | Extension | R Package | R function | 
|---|---|---|---|
| Excel | .xls, .xlsx | readxl* | 
read_excel() | 
| Open Document | .ods | readODS | 
read_ods() | 
| SPSS | .sav, .zsav, .por | haven | 
read_spss() | 
| SAS | .sas7bdat | haven | 
read_sas() | 
| Stata | .dta | haven | 
read_dta() | 
| Database Files | .dbf | foreign | 
read.dbf() | 
For files that don’t change, better to save as a *.csv
 (Comma-separated-variables file)
 * part of the tidyverse
| Type | R base | readr package * | 
|---|---|---|
| Comma separated | read.csv() | 
read_csv(), read_csv2() | 
| Tab separated | read.delim() | 
read_tsv() | 
| Space separated | read.table() | 
read_table() | 
| Fixed-width | read.fwf() | 
read_fwf() | 
 * part of the tidyverse
readr package especially useful for big data sets (fast!)readr are a bit more helpfulWe’ll focus on
readxl package  read_excel()readr package  read_csv(), read_tsv()Error: 'weather.csv' does not exist in current working directory ('/home/steffi/Projects/Workshops/workshop-dealing-with-data').
With no folder (just file name) R expects file to be in Working directory
setwd() or RStudio’s Session > Set Working Directory)Don’t use setwd()
Do use Projects in RStudio
/home is a folder/home and steffi are folderssteffi is a folder inside of home/home, steffi, Documents, R Projects are folderssteffi is inside of home, Documents is inside of steffi, etc.mydata.csv is a data file inside R Projects folderRStudio Files Pane
| OS | Path | 
|---|---|
| LINUX | /home/steffi/Documents/R Projects/mydata.csv | 
| WINDOWS | C:/Users/steffi/My Documents/R Projects/mydata.csv | 
| MAC | /users/steffi/Documents/R Projects/mydata.csv | 
Full location, folders and filename
| Path | Where to look | 
|---|---|
| mydata.csv | Here (current directory) | 
| ../mydata.csv | Go up one directory (../) | 
| data/mydata.csv | Stay here, go into “data” folder (data/) | 
| ../data/mydata.csv | Go up one directory (../), then into “data” folder (data/) | 
Only relative info
Use relative symbols (e.g., ../)
With RStudio ‘Projects’ only need to use relative paths
Working with water_cleaned.xlsx
waterUse head() and tail() functions to look at the data
e.g., head(water) and tail(water)
Click on the water object in your “Environment” pane to look at the whole data set

read_excel from readxl package)read_csv(), read_tsv(), etc.)water_cleaned.xlsx  Excel file  read_excel()water_raw.csv  Comma-separated-variables  read_csv()But sometimes not clear…
master_moch.txtID  region  hab freq    freq.sd p.notes
MCB02   kam 0.5266879074    3.9806600009    3.9806600009    0.4592592593
MCB03   kam -0.9707703735   4.1090031783    4.1090031783    0.5
MCB04   kam -0.9707703735   4.2463067674    4.2463067674    0.5151515152
This does not read the file into R, but only shows you the contents as text.
Hmm, not comma-separated, maybe tab-separated?
read_csv() is a good start)n_max to read only first few rows# A tibble: 3 × 1
  `ID\tregion\thab\tfreq\tfreq.sd\tp.notes`                            
  <chr>                                                                
1 "MCB02\tkam\t0.5266879074\t3.9806600009\t3.9806600009\t0.4592592593" 
2 "MCB03\tkam\t-0.9707703735\t4.1090031783\t4.1090031783\t0.5"         
3 "MCB04\tkam\t-0.9707703735\t4.2463067674\t4.2463067674\t0.5151515152"
\t means tab, so this is tab-separated data
read_tsv()# A tibble: 3 × 6
  ID    region    hab  freq freq.sd p.notes
  <chr> <chr>   <dbl> <dbl>   <dbl>   <dbl>
1 MCB02 kam     0.527  3.98    3.98   0.459
2 MCB03 kam    -0.971  4.11    4.11   0.5  
3 MCB04 kam    -0.971  4.25    4.25   0.515
Excellent!
col_namesWorking with Geolocator data
# A tibble: 20 × 2
   `02/05/11 22:29:59`  `64`
   <chr>               <dbl>
 1 02/05/11 22:31:59      64
 2 02/05/11 22:33:59      38
 3 02/05/11 22:35:59      38
 4 02/05/11 22:37:59      34
 5 02/05/11 22:39:59      30
 6 02/05/11 22:41:59      34
 7 02/05/11 22:43:59      40
 8 02/05/11 22:45:59      46
 9 02/05/11 22:47:59      48
10 02/05/11 22:49:59      46
# ℹ 10 more rows
Oops?
read_csv, read_tsv, etc. assume that the first row contains the column namescol_namesWorking with Geolocator data
# A tibble: 21 × 2
   X1                   X2
   <chr>             <dbl>
 1 02/05/11 22:29:59    64
 2 02/05/11 22:31:59    64
 3 02/05/11 22:33:59    38
 4 02/05/11 22:35:59    38
 5 02/05/11 22:37:59    34
 6 02/05/11 22:39:59    30
 7 02/05/11 22:41:59    34
 8 02/05/11 22:43:59    40
 9 02/05/11 22:45:59    46
10 02/05/11 22:47:59    48
# ℹ 11 more rows
# A tibble: 21 × 2
   date              light
   <chr>             <dbl>
 1 02/05/11 22:29:59    64
 2 02/05/11 22:31:59    64
 3 02/05/11 22:33:59    38
 4 02/05/11 22:35:59    38
 5 02/05/11 22:37:59    34
 6 02/05/11 22:39:59    30
 7 02/05/11 22:41:59    34
 8 02/05/11 22:43:59    40
 9 02/05/11 22:45:59    46
10 02/05/11 22:47:59    48
# ℹ 11 more rows
skip info rows before dataWorking with Grain size data
# A tibble: 36 × 7
   `DATA DOWNLOAD: 2015-09-23` ...2  ...3        ...4     ...5  ...6  ...7 
   <chr>                       <chr> <chr>       <chr>    <chr> <chr> <chr>
 1 SYSTEM 001                  <NA>  <NA>        <NA>     <NA>  <NA>  <NA> 
 2 LOGGER X                    <NA>  <NA>        <NA>     <NA>  <NA>  <NA> 
 3 lab_num                     CSP   sample_num  depth_lb csa   msa   fsa  
 4 3177                        CSP01 CSP01-P-1-1 4        13.04 17.37 8.19 
 5 3178                        CSP01 CSP01-P-1-2 12       10.74 16.9  7.92 
 6 3179                        CSP01 CSP01-P-1-3 35       12.11 17.75 6.99 
 7 3180                        CSP01 CSP01-P-1-4 53       17.61 18.16 6.29 
 8 3181                        CSP01 CSP01-P-1-5 83       21.05 18.38 6.26 
 9 3182                        CSP01 CSP01-P-1-6 105      19.02 18.43 6.28 
10 3183                        CSP08 CSP08-P-1-1 10       11.6  17.14 8.18 
# ℹ 26 more rows
skip info rows before dataWorking with Grain size data
Look at the file:
DATA DOWNLOAD: 2015-09-23                       
SYSTEM 001                      
LOGGER X                        
lab_num CSP sample_num  depth_lb    csa msa fsa
3177    CSP01   CSP01-P-1-1 4   13.04   17.37   8.19
3178    CSP01   CSP01-P-1-2 12  10.74   16.9    7.92
3179    CSP01   CSP01-P-1-3 35  12.11   17.75   6.99
3180    CSP01   CSP01-P-1-4 53  17.61   18.16   6.29
3181    CSP01   CSP01-P-1-5 83  21.05   18.38   6.26
Ah ha!
Metadata was stored at the top of the file
skip info rows before dataWorking with Grain size data
skip = 3 to skip the first three rows# A tibble: 33 × 7
   lab_num CSP   sample_num  depth_lb   csa   msa   fsa
     <dbl> <chr> <chr>          <dbl> <dbl> <dbl> <dbl>
 1    3177 CSP01 CSP01-P-1-1        4 13.0  17.4   8.19
 2    3178 CSP01 CSP01-P-1-2       12 10.7  16.9   7.92
 3    3179 CSP01 CSP01-P-1-3       35 12.1  17.8   6.99
 4    3180 CSP01 CSP01-P-1-4       53 17.6  18.2   6.29
 5    3181 CSP01 CSP01-P-1-5       83 21.0  18.4   6.26
 6    3182 CSP01 CSP01-P-1-6      105 19.0  18.4   6.28
 7    3183 CSP08 CSP08-P-1-1       10 11.6  17.1   8.18
 8    3184 CSP08 CSP08-P-1-2       27 15.4  16.2   6.76
 9    3185 CSP08 CSP08-P-1-3       90 14.9  15.8   7.12
10    3186 CSP02 CSP02-P-1-1        5  8.75  8.64  3.41
# ℹ 23 more rows
Much better!
Load Data: Sta A Data 2006-11-07.dmp
# A tibble: 19 × 7
  StartDate Time     Frequency `Rate/Temp`   Pwr Ant      SD
      <dbl> <time>       <dbl>       <dbl> <dbl> <chr> <dbl>
1     39022 17:15:36      150.        34.8   175 M0        0
2     39022 17:19:14      148.        19.2    72 M0        0
3     39022 17:19:25      148.        19.7   194 M1        0
4     39022 17:20:04      149.        33.8   104 M0        0
5     39022 17:20:17      149.        33.7   152 M1        0
6     39022 17:20:57      150.        34.2   188 M0        0
7     39022 17:22:50      148.         9.8   188 M0        0
# ℹ 12 more rows
Too Easy?
Load some of your own tricky data OR
Try to load the second sheet of water_cleaned.xlsx and rename the columns in the same step
(see ?read_excel)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA <NA>    2007
 5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
 6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
 7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
 8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
 9 Adelie  Torgersen           34.1          18.1               193        3475 <NA>    2007
10 Adelie  Torgersen           42            20.2               190        4250 <NA>    2007
# ℹ 334 more rows
# A tibble: 6 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int> <fct>  <int>
1 Chinstrap Dream            45.7          17                 195        3650 female  2009
2 Chinstrap Dream            55.8          19.8               207        4000 male    2009
3 Chinstrap Dream            43.5          18.1               202        3400 female  2009
4 Chinstrap Dream            49.6          18.2               193        3775 male    2009
5 Chinstrap Dream            50.8          19                 210        4100 male    2009
6 Chinstrap Dream            50.2          18.7               198        3775 female  2009
skim() the dataskim() is from skimrnumeric),character)date, POSIXct, datetime)factor)NAs?── Data Summary ────────────────────────
                           Values  
Name                       penguins
Number of rows             344     
Number of columns          8       
_______________________            
Column type frequency:             
  factor                   3       
  numeric                  5       
________________________           
Group variables            None    
── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts                 
1 species               0         1     FALSE          3 Ade: 152, Gen: 124, Chi: 68
2 island                0         1     FALSE          3 Bis: 168, Dre: 124, Tor: 52
3 sex                  11         0.968 FALSE          2 mal: 168, fem: 165         
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate   mean      sd     p0    p25    p50    p75   p100 hist 
1 bill_length_mm            2         0.994   43.9   5.46    32.1   39.2   44.4   48.5   59.6 ▃▇▇▆▁
2 bill_depth_mm             2         0.994   17.2   1.97    13.1   15.6   17.3   18.7   21.5 ▅▅▇▇▂
3 flipper_length_mm         2         0.994  201.   14.1    172    190    197    213    231   ▂▇▃▅▂
4 body_mass_g               2         0.994 4202.  802.    2700   3550   4050   4750   6300   ▃▇▆▃▂
5 year                      0         1     2008.    0.818 2007   2007   2008   2009   2009   ▇▁▇▁▇

count() categoriescount() is from dplyr*
 * part of the tidyverse



 * part of the tidyverse

 * part of the tidyverse
Working with: water_raw.csv
# A tibble: 300 × 7
   `River Name` Site        Ele       Amo `Temperature C°`  Year Wea   
   <chr>        <chr>       <chr>   <dbl>            <dbl> <dbl> <chr> 
 1 Grasse       Up stream   Al     0.606            10.9    2019 snowy 
 2 Grasse       Mid stream  Al     0.425             8.68   2020 cloudy
 3 Grase        Down stream Al     0.194             8.75   2021 cloudy
 4 Oswegatchie  Up stream   Al     1                 0.791  2022 sunny 
 5 Oswegatchie  Mid stream  Al     0.161             9.32   2023 snowy 
 6 Oswegatchie  Down stream Al     0.0333           10.6    2019 wet   
 7 Raquette     Up stream   Al     0.292             4.01   2020 snowy 
 8 Raquette     Mid stream  Al     0.0389            5.96   2021 sunny 
 9 Raquette     Down stream Al    NA                 6.21   2022 cloudy
10 St. Regis    Up stream   Al     0.681             8.02   2023 wet   
# ℹ 290 more rows
River Name and Temperature C°) or obvious (what is Ele?)Grase should be Grasse)# A tibble: 300 × 7
   `River Name` Site        Ele       Amo `Temperature C°`  Year Wea   
   <chr>        <chr>       <chr>   <dbl>            <dbl> <dbl> <chr> 
 1 Grasse       Up stream   Al     0.606            10.9    2019 snowy 
 2 Grasse       Mid stream  Al     0.425             8.68   2020 cloudy
 3 Grase        Down stream Al     0.194             8.75   2021 cloudy
 4 Oswegatchie  Up stream   Al     1                 0.791  2022 sunny 
 5 Oswegatchie  Mid stream  Al     0.161             9.32   2023 snowy 
 6 Oswegatchie  Down stream Al     0.0333           10.6    2019 wet   
 7 Raquette     Up stream   Al     0.292             4.01   2020 snowy 
 8 Raquette     Mid stream  Al     0.0389            5.96   2021 sunny 
 9 Raquette     Down stream Al    NA                 6.21   2022 cloudy
10 St. Regis    Up stream   Al     0.681             8.02   2023 wet   
# ℹ 290 more rows
skim() the datacount() some columnsggplot()sFind any problems?
clean_names() is from janitor** not part of the tidyverse but tidyverse-orientated
# A tibble: 300 × 7
   river_name  site        ele       amo temperature_c  year wea   
   <chr>       <chr>       <chr>   <dbl>         <dbl> <dbl> <chr> 
 1 Grasse      Up stream   Al     0.606         10.9    2019 snowy 
 2 Grasse      Mid stream  Al     0.425          8.68   2020 cloudy
 3 Grase       Down stream Al     0.194          8.75   2021 cloudy
 4 Oswegatchie Up stream   Al     1              0.791  2022 sunny 
 5 Oswegatchie Mid stream  Al     0.161          9.32   2023 snowy 
 6 Oswegatchie Down stream Al     0.0333        10.6    2019 wet   
 7 Raquette    Up stream   Al     0.292          4.01   2020 snowy 
 8 Raquette    Mid stream  Al     0.0389         5.96   2021 sunny 
 9 Raquette    Down stream Al    NA              6.21   2022 cloudy
10 St. Regis   Up stream   Al     0.681          8.02   2023 wet   
# ℹ 290 more rows
 
rename() is from dplyr*rename() columns
# A tibble: 300 × 7
   river_name  site        element  amount temperature  year wea   
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl> <chr> 
 1 Grasse      Up stream   Al       0.606       10.9    2019 snowy 
 2 Grasse      Mid stream  Al       0.425        8.68   2020 cloudy
 3 Grase       Down stream Al       0.194        8.75   2021 cloudy
 4 Oswegatchie Up stream   Al       1            0.791  2022 sunny 
 5 Oswegatchie Mid stream  Al       0.161        9.32   2023 snowy 
 6 Oswegatchie Down stream Al       0.0333      10.6    2019 wet   
 7 Raquette    Up stream   Al       0.292        4.01   2020 snowy 
 8 Raquette    Mid stream  Al       0.0389       5.96   2021 sunny 
 9 Raquette    Down stream Al      NA            6.21   2022 cloudy
10 St. Regis   Up stream   Al       0.681        8.02   2023 wet   
# ℹ 290 more rows
 * part of the tidyverse
select() is from dplyr*select() columns you want 
 * part of the tidyverse
OR, unselect() columns you don’t want
# A tibble: 300 × 6
   river_name  site        element  amount temperature  year
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl>
 1 Grasse      Up stream   Al       0.606       10.9    2019
 2 Grasse      Mid stream  Al       0.425        8.68   2020
 3 Grase       Down stream Al       0.194        8.75   2021
 4 Oswegatchie Up stream   Al       1            0.791  2022
 5 Oswegatchie Mid stream  Al       0.161        9.32   2023
 6 Oswegatchie Down stream Al       0.0333      10.6    2019
 7 Raquette    Up stream   Al       0.292        4.01   2020
 8 Raquette    Mid stream  Al       0.0389       5.96   2021
 9 Raquette    Down stream Al      NA            6.21   2022
10 St. Regis   Up stream   Al       0.681        8.02   2023
# ℹ 290 more rows
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water# A tibble: 300 × 6
   river_name  site        element  amount temperature  year
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl>
 1 Grasse      Up stream   Al       0.606       10.9    2019
 2 Grasse      Mid stream  Al       0.425        8.68   2020
 3 Grase       Down stream Al       0.194        8.75   2021
 4 Oswegatchie Up stream   Al       1            0.791  2022
 5 Oswegatchie Mid stream  Al       0.161        9.32   2023
 6 Oswegatchie Down stream Al       0.0333      10.6    2019
 7 Raquette    Up stream   Al       0.292        4.01   2020
 8 Raquette    Mid stream  Al       0.0389       5.96   2021
 9 Raquette    Down stream Al      NA            6.21   2022
10 St. Regis   Up stream   Al       0.681        8.02   2023
# ℹ 290 more rows
Note how code repeats data frame water…
Remember the typos…
Combine the if_else function with the mutate() function
if_else() and mutate() from dplyr package*if_else() tests for a condition, and returns one value if FALSE and another if TRUE
 * part of the tidyverse
count()mutate() and if_else() to fix the typoToo Easy?
Examine and fix problems in your own data OR
Use case_when() to fix all the river name typos at once…
tidyverse functionstidyverse functionsrename(), select(), mutate()tidyverse functions always start with the data, followed by other argumentsrename() changes column namesselect() chooses columns to keep or to remove (with -)mutate() changes column contents
tidyverse functions?|>* Allow you to string commands togetherInstead of:
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water <- mutate(water, 
                 river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
                                        river_name == "raquette" ~ "Raquette",
                                        TRUE ~ river_name))* |> is the base pipe, %>% is the tidyverse pipe, you can use either

Take a moment to play with this code in your console
water <- read_csv("data/water_raw.csv")
water <- clean_names(water)
water <- rename(water, element = ele, amount = amo, temperature = temperature_c)
water <- select(water, -wea)
water <- mutate(water, 
                 river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
                                        river_name == "raquette" ~ "Raquette",
                                        TRUE ~ river_name))
NAsData that is missing
Data that should be missing
NAsamountfilter() to take a closer look# A tibble: 39 × 6
   river_name  site        element amount temperature  year
   <chr>       <chr>       <chr>    <dbl>       <dbl> <dbl>
 1 Raquette    Down stream Al          NA        6.21  2022
 2 Raquette    Up stream   Ba          NA        5.23  2022
 3 Raquette    Up stream   Br          NA      -99     2019
 4 Oswegatchie Up stream   Ca          NA        4.76  2023
 5 Raquette    Down stream Ce          NA       13.9   2020
 6 Grasse      Up stream   Cu          NA        9.13  2019
 7 Raquette    Down stream Dy          NA        4.98  2019
 8 Raquette    Down stream Er          NA        3.07  2021
 9 Raquette    Down stream Fe          NA        7.20  2023
10 Raquette    Down stream Gd          NA        4.73  2020
# ℹ 29 more rows
NAsdrop_na() is from tidyr*Omit NAs from the amount column only (drop those rows)
Omit all NAs from all columns (drop those rows)
Check…
# A tibble: 0 × 6
# ℹ 6 variables: river_name <chr>, site <chr>, element <chr>, amount <dbl>, temperature <dbl>, year <dbl>
[1] 261
No more NAs!
Fewer rows
* part of the tidyverse 
filter() also omits NAs 😱If we filter by the column with NAs, they are silently dropped
# A tibble: 15 × 6
   river_name  site        element  amount temperature  year
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl>
 1 Oswegatchie Down stream Al      0.0333       10.6    2019
 2 Raquette    Mid stream  Al      0.0389        5.96   2021
 3 Grasse      Mid stream  Br      0.0357       12.4    2019
 4 St. Regis   Up stream   Br      0.0357        3.52   2022
 5 St. Regis   Mid stream  Br      0.0357        0.936  2023
 6 Raquette    Mid stream  Ce      0.0116        6.61   2019
 7 Raquette    Mid stream  Fe      0.00656      10.8    2022
 8 Grasse      Up stream   K       0.0313        3.61   2021
 9 Raquette    Mid stream  La      0.0275        2.50   2020
10 Oswegatchie Down stream Mn      0.00672       8.89   2019
# ℹ 5 more rows
We need to be explicit if we want to keep them
# A tibble: 54 × 6
   river_name  site        element  amount temperature  year
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl>
 1 Oswegatchie Down stream Al       0.0333      10.6    2019
 2 Raquette    Mid stream  Al       0.0389       5.96   2021
 3 Raquette    Down stream Al      NA            6.21   2022
 4 Raquette    Up stream   Ba      NA            5.23   2022
 5 Grasse      Mid stream  Br       0.0357      12.4    2019
 6 Raquette    Up stream   Br      NA          -99      2019
 7 St. Regis   Up stream   Br       0.0357       3.52   2022
 8 St. Regis   Mid stream  Br       0.0357       0.936  2023
 9 Oswegatchie Up stream   Ca      NA            4.76   2023
10 Raquette    Mid stream  Ce       0.0116       6.61   2019
# ℹ 44 more rows
NAsreplace_na() is from tidyr*Check…
# A tibble: 0 × 6
# ℹ 6 variables: river_name <chr>, site <chr>, element <chr>, amount <dbl>, temperature <dbl>, year <dbl>
[1] 300
No more NAs!
Same number of rows
(If you want to do a more complex replacement, you’ll have to use if_else() or case_when() like we did for typos.)
* part of the tidyverse 
NARemember the problem with temperature?
# A tibble: 3 × 6
  river_name  site       element amount temperature  year
  <chr>       <chr>      <chr>    <dbl>       <dbl> <dbl>
1 Raquette    Up stream  Br      NA             -99  2019
2 Oswegatchie Mid stream K        0.426         -99  2020
3 St. Regis   Mid stream La       0.367         -99  2023
na_if() is from dplyr*
 * part of the tidyverse
# A tibble: 300 × 6
   river_name  site        element  amount temperature  year
   <chr>       <chr>       <chr>     <dbl>       <dbl> <dbl>
 1 Grasse      Up stream   Al       0.606       10.9    2019
 2 Grasse      Mid stream  Al       0.425        8.68   2020
 3 Grasse      Down stream Al       0.194        8.75   2021
 4 Oswegatchie Up stream   Al       1            0.791  2022
 5 Oswegatchie Mid stream  Al       0.161        9.32   2023
 6 Oswegatchie Down stream Al       0.0333      10.6    2019
 7 Raquette    Up stream   Al       0.292        4.01   2020
 8 Raquette    Mid stream  Al       0.0389       5.96   2021
 9 Raquette    Down stream Al      NA            6.21   2022
10 St. Regis   Up stream   Al       0.681        8.02   2023
# ℹ 290 more rows
Year could be categorical (factor)
Better for plotting!
(although it really depends)
# A tibble: 300 × 6
   river_name  site        element  amount temperature year 
   <chr>       <chr>       <chr>     <dbl>       <dbl> <fct>
 1 Grasse      Up stream   Al       0.606       10.9   2019 
 2 Grasse      Mid stream  Al       0.425        8.68  2020 
 3 Grasse      Down stream Al       0.194        8.75  2021 
 4 Oswegatchie Up stream   Al       1            0.791 2022 
 5 Oswegatchie Mid stream  Al       0.161        9.32  2023 
 6 Oswegatchie Down stream Al       0.0333      10.6   2019 
 7 Raquette    Up stream   Al       0.292        4.01  2020 
 8 Raquette    Mid stream  Al       0.0389       5.96  2021 
 9 Raquette    Down stream Al      NA            6.21  2022 
10 St. Regis   Up stream   Al       0.681        8.02  2023 
# ℹ 290 more rows
| Function | Input | Output | 
|---|---|---|
as.character() | 
Any vector | Text (Characters) | 
as.numeric() | 
Any vector (but returns NAs if not numbers) | Numbers | 
as.logical() | 
TRUE, FALSE, T, F, 0 (FALSE), any other number (all TRUE) | TRUE or FALSE | 
as.factor() | 
Any vector | Categories | 
We’ll deal with dates and times later…
water <- read_csv("data/water_raw.csv") |>
  clean_names() |>
  rename(element = ele, amount = amo, temperature = temperature_c) |>
  select(-wea) |>
  mutate(river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
                                river_name == "raquette" ~ "Raquette",
                                TRUE ~ river_name),
         amount = replace_na(amount, 0),
         temperature = na_if(temperature, -99),
         year = as.factor(year)) water <- read_csv("data/water_raw.csv") |>
  # Fix column names
  clean_names() |>
  rename(element = ele, amount = amo, temperature = temperature_c) |>
  select(-wea) |>
  mutate(
    # Correct typos
    river_name = case_when(river_name %in% c("Grase", "grasse") ~ "Grasse",
                           river_name == "raquette" ~ "Raquette",
                           TRUE ~ river_name),
    # Missing amounts should be 0
    amount = replace_na(amount, 0),
    # Problems with temperature logger, -99 is a mistake
    temperature = na_if(temperature, -99),
    # Convert for plotting 
    year = as.factor(year)) (Or why does R hate me?)
Working with Geolocator data
# A tibble: 21 × 2
  time              light
  <chr>             <dbl>
1 02/05/11 22:29:59    64
2 02/05/11 22:31:59    64
3 02/05/11 22:33:59    38
4 02/05/11 22:35:59    38
5 02/05/11 22:37:59    34
6 02/05/11 22:39:59    30
# ℹ 15 more rows
Here
timecolumn is consideredchr(character/text)You may know it’s a date, but R does not
Artwork by @allison_horst
lubridate package*tidyverse, but needs to be loaded separately# A tibble: 21 × 3
  time              light time_fixed         
  <chr>             <dbl> <dttm>             
1 02/05/11 22:29:59    64 2011-05-02 22:29:59
2 02/05/11 22:31:59    64 2011-05-02 22:31:59
3 02/05/11 22:33:59    38 2011-05-02 22:33:59
4 02/05/11 22:35:59    38 2011-05-02 22:35:59
5 02/05/11 22:37:59    34 2011-05-02 22:37:59
6 02/05/11 22:39:59    30 2011-05-02 22:39:59
# ℹ 15 more rows
Now
time_fixedcolumn is considereddttm(Date/Time)So You know it’s a Date/Time and now R knows too
 * part of the tidyverse
lubridate package*
Generally, only the order of the year, month, day, hour, minute, or second matters.
For example
| date/time format | function | output class | 
|---|---|---|
| 2018-01-01 13:09:11 | ymd_hms() | 
dttm (POSIXct/POSIXt) | 
| 12/20/2019 10:00 PM | mdy_hm() | 
dttm (POSIXct/POSIXt) | 
| 31/01/2000 10 AM | dmy_h() | 
dttm (POSIXct/POSIXt) | 
| 31-01/2000 | dmy() | 
Date | 
lubridateis smart enough to detect AMs and PMs
Note: R generally requires that times have dates (datetime/POSIXct), but dates don’t have to have times (Date)
 * part of the tidyverse
(For the love of all that is good don’t lose that data!!!)*
* but if you’ve been paying attention, you know that you only need the script 😉
1_cleaned.csv2_summarized.csv3_graphing.csv
* I usually have a data folder and then both raw and datasets folders inside of that
skim() to explore the format of your dataskim(), count(), filter(), select(), ggplot() to explore the content of your data|> or %>%) pass the output from one function as input to the next function: