Download Data

Author

Steffi LaZerte

Published

July 4, 2024

In this step we download, or update, our Motus SQLite databases.

Setup

source("XX_setup.R")

This step can take a lot of time and it may not be necessary to be constantly updating the data bases. Set update <- TRUE to update, update <- FALSE to just check the number of new observations.

Note that in addition to downloading new data, update will also rerun all the processing of this data into feather hits and runs (See Save to Arrow/Feather).

update <- FALSE

Status

Get the status of each project (i.e. how much data left to download?)

  • This will create new project-XXX.motus SQLite data bases for us if it doesn’t already exist (but will not download the data)
if(update) {
  status <- data.frame(proj_id = projects, 
                       file = paste0("Data/01_Raw/project-", projects, ".motus")) |>
    mutate(status = map2(
      proj_id, file, 
      \(x, y) tellme(x, dir = "Data/01_Raw", new = !file.exists(y)))) |>
    unnest(status) |>
    mutate(currentMB = file.size(file)/1024/1024,
           newMB = numBytes/1024/1024)
  
  select(status, -file, -numBytes) |>
    relocate(currentMB, newMB, .after = proj_id) |>
    arrange(proj_id) |>
    gt() |>
    fmt_number(decimals = 0) |>
    tab_spanner(label = "New Data", columns = -c(proj_id, currentMB)) |>
    gt_theme()
}  

Download data

If this is the first time running, it will take time!

  • tagme() without arguments will update all databases in the folder
  • we can run this intermittently to update the databases as new data arrives
if(update) tagme(dir = "Data/01_Raw")

Clean up

Here we’ll deal with deprecated batches and metadata.

First we’ll load the database connections.

dbs <- map(projects, \(x) tagme(x, dir = "Data/01_Raw", update = FALSE))

Remove deprecated batches

Deprecated batches are removed from the Motus server, but are still present in data that was previously downloaded. This step cleans up the database.

if(update) {
  iwalk(dbs, \(x, y) {
    message("\nProject ", y)
    deprecateBatches(x, ask = FALSE)
  })
}

Update metadata

Here we’ll update the metadata associated with all projects so that we have better information on different tags, receivers, etc. which these projects may be interacting with.

if(update) {
  iwalk(dbs, \(x, y) {
    message("\nProject ", y)
    metadata(x)
  })
}

Save to Arrow/Feather

Now we’ll create some custom views and save to feather format (see also the R4DS introduction to Apache Arrow).

Feather files are very fast to work with and I’ve found that they are a bit simpler to deal with when we’re mostly concerned with the hits/runs tables.

So we’ll spend a bit of time converting them here so the following steps are faster.

Custom tables

Normally, we would use the allruns view from our .motus data bases.

However, this view includes a lot of data that we don’t need and I find it faster to pull out the variables we’re interested by hand using a custom function, custom_runs().

Caution

That being said, there are some complex joins going on in the allruns view. I have replicated the relevant ones here, and while I reasonably sure these yield the same data, if there are is any hint that they’re not exactly the same, we should double check.

Code
custom_runs <- function(db) {
  # Replacement for NULL tsEnd values (i.e. today plus change)
  max_ts <- round(as.numeric(Sys.time()) + 1000)
  
  # Get Receivers
  r <- tbl(db, "recvDeps") |> 
    select("deviceID" = "deviceID", "recvDeployID" = "deployID", "tsStartRecv" = "tsStart", "tsEndRecv" = "tsEnd",
           "recvType" = "receiverType", "recvDeployLat" = "latitude", "recvDeployLon" = "longitude") |>
    mutate(tsEndRecv = if_else(is.na(tsEndRecv), max_ts, tsEndRecv))
  
  # Get tags
  t <- tbl(db, "tagDeps") |> 
    select("tagID", "tagDeployID" = "deployID", "speciesID", "tsStartTag" = "tsStart", "tsEndTag" = "tsEnd", "test") |>
    mutate(tsEndTag = if_else(is.na(tsEndTag), max_ts, tsEndTag))
  
  # Get batches
  b <- tbl(db, "batchRuns") |>
    distinct()
  
  # Combine with runs
  tbl(db, "runs") |>
    rename("tagID" = "motusTagID") |>
    # Add in tags by tagID *and* overlap of start/end of tag deployment with the beginning of a run
    left_join(t, by = join_by(tagID, between(tsBegin, tsStartTag, tsEndTag))) |>
    slice_max(order_by = tsStartTag, by = "runID", with_ties = FALSE) |> # If multiple deps, take only the one with max start
    # Add in batchRuns by runID (to get the batchID)
    left_join(b, by = "runID") |>
    slice_max(order_by = batchID, by = "runID", with_ties = FALSE) |> # If multiple batches, take only the latest one
    # Add in batches by batchID (to get the deviceID)
    left_join(tbl(db, "batches") |> select("batchID", "motusDeviceID"), by = "batchID") |>
    # Add in receivers by deviceID *and* overlap of receiver deployment time with the beginning of a run
    left_join(r, by = join_by(motusDeviceID == deviceID, between(tsBegin, tsStartRecv, tsEndRecv))) |>
    slice_max(order_by = tsStartRecv, by = "runID", with_ties = FALSE) |> # If multiple deps, take only the one with max start
    rename("recvDeviceID" = "motusDeviceID") |>
    select(-"batchIDbegin") |>
    distinct()
}

This function is faster than collecting the allruns view

system.time(tbl(dbs[["607"]], "allRuns") |> collect())
   user  system elapsed 
  0.794   0.296   1.159 
system.time(custom_runs(dbs[["607"]]) |> collect())
   user  system elapsed 
  0.378   0.004   0.382 

Get the custom run tables for all databases

if(update) runs <- map(dbs, custom_runs)

Export

Now save these as Arrow/Feather format (but first remove any previously saved data, as we’ll be saving to Arrow by splitting the data across multiple files, so don’t want any conflicts).

On the way we’ll also create some added date and time columns.

This should run reasonably fast.

if(update) {
  unlink("Data/02_Datasets/runs/*", recursive = TRUE)
  iwalk(runs, \(x, i) {
    message(i)
    x |>
      mutate(proj_id = as.integer(i)) |>
      collect() |>
      mutate(timeBegin = as_datetime(tsBegin),
             timeEnd = as_datetime(tsEnd),
             dateBegin = as_date(timeBegin),
             dateEnd = as_date(timeEnd),
             monthBegin = month(timeBegin),
             yearBegin = year(timeBegin),) |>
      group_by(proj_id) |>
      write_dataset(path = "Data/02_Datasets/runs/", format = "arrow") 
  })
}

The thing with splitting data across multiple files is that we ideally want files to be >20MB and <2GB and we want to avoid too many files (< 10,000).

So let’s double check that we have reasonable file sizes and numbers. Not perfect, but probably good enough.

tibble(files = list.files("Data/02_Datasets/runs/", recursive = TRUE, full.names = TRUE)) |>
  mutate(size_mb = file.size(files) * 1e-6) |>
  summarize(n = n(),
            min_mb = min(size_mb),
            median_mb = median(size_mb),
            max_mb = max(size_mb))
# A tibble: 1 × 4
      n min_mb median_mb max_mb
  <int>  <dbl>     <dbl>  <dbl>
1    11 0.0811      73.5   224.

Next we’ll save the hit-level data.

Get the basic hits table for all databases and join with the runs data, and save as feather for future use.

This may take a while (~2-5min per project, depending on the size). This also takes a fair amount of disk space (much more than the SQLite databases).

if(update) {
  hits <- map(dbs, \(x) tbl(x, "hits"))
  
  unlink("Data/02_Datasets/hits/*", recursive = TRUE)
  iwalk(hits, \(x, i) {
    message(i) 
    r <- select(runs[[i]], "runID", "speciesID", "tagDeployID", "tagID", 
                "recvDeployID", "recvType", "motusFilter")
    x |>
      mutate(proj_id = as.integer(i)) |>
      left_join(r, by = "runID") |>
      collect() |>
      mutate(time = as_datetime(ts),
             date = as_date(time),
             month = month(date),
             year = year(date),) |> 
      group_by(proj_id, speciesID, year) |>
      write_dataset(path = "Data/02_Datasets/hits/", format = "arrow") 
  })
}

As with hits, we ideally want files to be >20MB and <2GB and we want to avoid too many files (< 10,000).

So let’s double check that we have reasonable file sizes and numbers.

tibble(files = list.files("Data/02_Datasets/hits/", recursive = TRUE, full.names = TRUE)) |>
  mutate(size_mb = file.size(files) * 1e-6) |>
  summarize(n = n(),
            min_mb = min(size_mb),
            median_mb = median(size_mb),
            max_mb = max(size_mb))
# A tibble: 1 × 4
      n  min_mb median_mb max_mb
  <int>   <dbl>     <dbl>  <dbl>
1   193 0.00363     0.995  5371.

Not fantastic, but we want the data to be split in a sensible way, so we’ll leave it as is.

Wrap up

Disconnect from the databases

walk(dbs, dbDisconnect)

Reproducibility

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_CA:en
 collate  en_CA.UTF-8
 ctype    en_CA.UTF-8
 tz       America/Winnipeg
 date     2024-07-03
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package       * version  date (UTC) lib source
 P arrow         * 16.1.0   2024-05-25 [?] CRAN (R 4.4.0)
 P assertr       * 3.0.1    2023-11-23 [?] CRAN (R 4.4.0)
 P assertthat      0.2.1    2019-03-21 [?] CRAN (R 4.4.0)
 P bit             4.0.5    2022-11-15 [?] CRAN (R 4.4.0)
 P bit64           4.0.5    2020-08-30 [?] CRAN (R 4.4.0)
 P blob            1.2.4    2023-03-17 [?] CRAN (R 4.4.0)
 P cachem          1.1.0    2024-05-16 [?] CRAN (R 4.4.0)
 P class           7.3-22   2023-05-03 [?] CRAN (R 4.3.1)
 P classInt        0.4-10   2023-09-05 [?] CRAN (R 4.4.0)
 P cli             3.6.2    2023-12-11 [?] CRAN (R 4.4.0)
 P codetools       0.2-19   2023-02-01 [?] CRAN (R 4.2.2)
 P colorspace      2.1-0    2023-01-23 [?] CRAN (R 4.4.0)
 P DBI           * 1.2.3    2024-06-02 [?] CRAN (R 4.4.0)
 P dbplyr          2.5.0    2024-03-19 [?] CRAN (R 4.4.0)
 P devtools        2.4.5    2022-10-11 [?] CRAN (R 4.4.0)
 P digest          0.6.35   2024-03-11 [?] CRAN (R 4.4.0)
 P dplyr         * 1.1.4    2023-11-17 [?] CRAN (R 4.4.0)
 P e1071           1.7-14   2023-12-06 [?] CRAN (R 4.4.0)
 P ebirdst       * 3.2022.3 2024-03-05 [?] CRAN (R 4.4.0)
 P ellipsis        0.3.2    2021-04-29 [?] CRAN (R 4.4.0)
 P evaluate        0.23     2023-11-01 [?] CRAN (R 4.4.0)
 P fansi           1.0.6    2023-12-08 [?] CRAN (R 4.4.0)
 P fastmap         1.2.0    2024-05-15 [?] CRAN (R 4.4.0)
 P fs              1.6.4    2024-04-25 [?] CRAN (R 4.4.0)
 P furrr         * 0.3.1    2022-08-15 [?] CRAN (R 4.4.0)
 P future        * 1.33.2   2024-03-26 [?] CRAN (R 4.4.0)
 P generics        0.1.3    2022-07-05 [?] CRAN (R 4.4.0)
 P ggplot2       * 3.5.1    2024-04-23 [?] CRAN (R 4.4.0)
 P ggrepel       * 0.9.5    2024-01-10 [?] CRAN (R 4.4.0)
 P ggspatial     * 1.1.9    2023-08-17 [?] CRAN (R 4.4.0)
 P globals         0.16.3   2024-03-08 [?] CRAN (R 4.4.0)
 P glue            1.7.0    2024-01-09 [?] CRAN (R 4.4.0)
 P gt            * 0.10.1   2024-01-17 [?] CRAN (R 4.4.0)
 P gtable          0.3.5    2024-04-22 [?] CRAN (R 4.4.0)
 P hms             1.1.3    2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools       0.5.8.1  2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets     1.6.4    2023-12-06 [?] CRAN (R 4.4.0)
 P httpuv          1.6.15   2024-03-26 [?] CRAN (R 4.4.0)
 P httr            1.4.7    2023-08-15 [?] CRAN (R 4.4.0)
 P jsonlite        1.8.8    2023-12-04 [?] CRAN (R 4.4.0)
 P KernSmooth      2.23-22  2023-07-10 [?] CRAN (R 4.3.1)
 P knitr           1.47     2024-05-29 [?] CRAN (R 4.4.0)
 P later           1.3.2    2023-12-06 [?] CRAN (R 4.4.0)
 P lifecycle       1.0.4    2023-11-07 [?] CRAN (R 4.4.0)
 P listenv         0.9.1    2024-01-29 [?] CRAN (R 4.4.0)
 P lubridate     * 1.9.3    2023-09-27 [?] CRAN (R 4.4.0)
 P lutz          * 0.3.2    2023-10-17 [?] CRAN (R 4.4.0)
 P magrittr        2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
 P memoise         2.0.1    2021-11-26 [?] CRAN (R 4.4.0)
 P mime            0.12     2021-09-28 [?] CRAN (R 4.4.0)
 P miniUI          0.1.1.1  2018-05-18 [?] CRAN (R 4.4.0)
 P motus         * 6.1.0    2024-05-02 [?] Github (motuswts/motus@a53a8b8)
 P munsell         0.5.1    2024-04-01 [?] CRAN (R 4.4.0)
 P naturecounts    0.4.0    2024-05-02 [?] Github (birdscanada/naturecounts@a6e52da)
 P parallelly      1.37.1   2024-02-29 [?] CRAN (R 4.4.0)
 P patchwork     * 1.2.0    2024-01-08 [?] CRAN (R 4.4.0)
 P pillar          1.9.0    2023-03-22 [?] CRAN (R 4.4.0)
 P pkgbuild        1.4.4    2024-03-17 [?] CRAN (R 4.4.0)
 P pkgconfig       2.0.3    2019-09-22 [?] CRAN (R 4.4.0)
 P pkgload         1.3.4    2024-01-16 [?] CRAN (R 4.4.0)
 P profvis         0.3.8    2023-05-02 [?] CRAN (R 4.4.0)
 P promises        1.3.0    2024-04-05 [?] CRAN (R 4.4.0)
 P proxy           0.4-27   2022-06-09 [?] CRAN (R 4.4.0)
 P purrr         * 1.0.2    2023-08-10 [?] CRAN (R 4.4.0)
 P R6              2.5.1    2021-08-19 [?] CRAN (R 4.4.0)
 P Rcpp            1.0.12   2024-01-09 [?] CRAN (R 4.4.0)
 P readr         * 2.1.5    2024-01-10 [?] CRAN (R 4.4.0)
 P remotes         2.5.0    2024-03-17 [?] CRAN (R 4.4.0)
   renv            1.0.7    2024-04-11 [1] CRAN (R 4.4.0)
 P rlang           1.1.3    2024-01-10 [?] CRAN (R 4.4.0)
 P rmarkdown       2.27     2024-05-17 [?] CRAN (R 4.4.0)
 P rnaturalearth * 1.0.1    2023-12-15 [?] CRAN (R 4.4.0)
 P RSQLite         2.3.6    2024-03-31 [?] CRAN (R 4.4.0)
 P rstudioapi      0.16.0   2024-03-24 [?] CRAN (R 4.4.0)
 P scales          1.3.0    2023-11-28 [?] CRAN (R 4.4.0)
 P sessioninfo     1.2.2    2021-12-06 [?] CRAN (R 4.4.0)
 P sf            * 1.0-16   2024-03-24 [?] CRAN (R 4.4.0)
 P shiny           1.8.1.1  2024-04-02 [?] CRAN (R 4.4.0)
 P stringi         1.8.4    2024-05-06 [?] CRAN (R 4.4.0)
 P stringr       * 1.5.1    2023-11-14 [?] CRAN (R 4.4.0)
 P terra           1.7-71   2024-01-31 [?] CRAN (R 4.4.0)
 P tibble        * 3.2.1    2023-03-20 [?] CRAN (R 4.4.0)
 P tidyr         * 1.3.1    2024-01-24 [?] CRAN (R 4.4.0)
 P tidyselect      1.2.1    2024-03-11 [?] CRAN (R 4.4.0)
 P timechange      0.3.0    2024-01-18 [?] CRAN (R 4.4.0)
 P tzdb            0.4.0    2023-05-12 [?] CRAN (R 4.4.0)
 P units         * 0.8-5    2023-11-28 [?] CRAN (R 4.4.0)
 P urlchecker      1.0.1    2021-11-30 [?] CRAN (R 4.4.0)
 P usethis         2.2.3    2024-02-19 [?] CRAN (R 4.4.0)
 P utf8            1.2.4    2023-10-22 [?] CRAN (R 4.4.0)
 P vctrs           0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P withr           3.0.0    2024-01-16 [?] CRAN (R 4.4.0)
 P xfun            0.44     2024-05-15 [?] CRAN (R 4.4.0)
 P xml2            1.3.6    2023-12-04 [?] CRAN (R 4.4.0)
 P xtable          1.8-4    2019-04-21 [?] CRAN (R 4.4.0)
 P yaml            2.3.8    2023-12-11 [?] CRAN (R 4.4.0)

 [1] /home/steffi/Projects/Business/Barbara Frei/urban_motus/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
 [2] /home/steffi/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top