source("XX_setup.R")
Download Data
In this step we download, or update, our Motus SQLite databases.
Setup
This step can take a lot of time and it may not be necessary to be constantly updating the data bases. Set update <- TRUE
to update, update <- FALSE
to just check the number of new observations.
Note that in addition to downloading new data, update
will also rerun all the processing of this data into feather hits
and runs
(See Save to Arrow/Feather).
<- FALSE update
Status
Get the status of each project (i.e. how much data left to download?)
- This will create new
project-XXX.motus
SQLite data bases for us if it doesn’t already exist (but will not download the data)
if(update) {
<- data.frame(proj_id = projects,
status file = paste0("Data/01_Raw/project-", projects, ".motus")) |>
mutate(status = map2(
proj_id, file, tellme(x, dir = "Data/01_Raw", new = !file.exists(y)))) |>
\(x, y) unnest(status) |>
mutate(currentMB = file.size(file)/1024/1024,
newMB = numBytes/1024/1024)
select(status, -file, -numBytes) |>
relocate(currentMB, newMB, .after = proj_id) |>
arrange(proj_id) |>
gt() |>
fmt_number(decimals = 0) |>
tab_spanner(label = "New Data", columns = -c(proj_id, currentMB)) |>
gt_theme()
}
Download data
If this is the first time running, it will take time!
tagme()
without arguments will update all databases in the folder- we can run this intermittently to update the databases as new data arrives
if(update) tagme(dir = "Data/01_Raw")
Clean up
Here we’ll deal with deprecated batches and metadata.
First we’ll load the database connections.
<- map(projects, \(x) tagme(x, dir = "Data/01_Raw", update = FALSE)) dbs
Remove deprecated batches
Deprecated batches are removed from the Motus server, but are still present in data that was previously downloaded. This step cleans up the database.
if(update) {
iwalk(dbs, \(x, y) {
message("\nProject ", y)
deprecateBatches(x, ask = FALSE)
}) }
Update metadata
Here we’ll update the metadata associated with all projects so that we have better information on different tags, receivers, etc. which these projects may be interacting with.
if(update) {
iwalk(dbs, \(x, y) {
message("\nProject ", y)
metadata(x)
}) }
Save to Arrow/Feather
Now we’ll create some custom views and save to feather format (see also the R4DS introduction to Apache Arrow).
Feather files are very fast to work with and I’ve found that they are a bit simpler to deal with when we’re mostly concerned with the hits/runs tables.
So we’ll spend a bit of time converting them here so the following steps are faster.
Custom tables
Normally, we would use the allruns
view from our .motus data bases.
However, this view includes a lot of data that we don’t need and I find it faster to pull out the variables we’re interested by hand using a custom function, custom_runs()
.
Code
<- function(db) {
custom_runs # Replacement for NULL tsEnd values (i.e. today plus change)
<- round(as.numeric(Sys.time()) + 1000)
max_ts
# Get Receivers
<- tbl(db, "recvDeps") |>
r select("deviceID" = "deviceID", "recvDeployID" = "deployID", "tsStartRecv" = "tsStart", "tsEndRecv" = "tsEnd",
"recvType" = "receiverType", "recvDeployLat" = "latitude", "recvDeployLon" = "longitude") |>
mutate(tsEndRecv = if_else(is.na(tsEndRecv), max_ts, tsEndRecv))
# Get tags
<- tbl(db, "tagDeps") |>
t select("tagID", "tagDeployID" = "deployID", "speciesID", "tsStartTag" = "tsStart", "tsEndTag" = "tsEnd", "test") |>
mutate(tsEndTag = if_else(is.na(tsEndTag), max_ts, tsEndTag))
# Get batches
<- tbl(db, "batchRuns") |>
b distinct()
# Combine with runs
tbl(db, "runs") |>
rename("tagID" = "motusTagID") |>
# Add in tags by tagID *and* overlap of start/end of tag deployment with the beginning of a run
left_join(t, by = join_by(tagID, between(tsBegin, tsStartTag, tsEndTag))) |>
slice_max(order_by = tsStartTag, by = "runID", with_ties = FALSE) |> # If multiple deps, take only the one with max start
# Add in batchRuns by runID (to get the batchID)
left_join(b, by = "runID") |>
slice_max(order_by = batchID, by = "runID", with_ties = FALSE) |> # If multiple batches, take only the latest one
# Add in batches by batchID (to get the deviceID)
left_join(tbl(db, "batches") |> select("batchID", "motusDeviceID"), by = "batchID") |>
# Add in receivers by deviceID *and* overlap of receiver deployment time with the beginning of a run
left_join(r, by = join_by(motusDeviceID == deviceID, between(tsBegin, tsStartRecv, tsEndRecv))) |>
slice_max(order_by = tsStartRecv, by = "runID", with_ties = FALSE) |> # If multiple deps, take only the one with max start
rename("recvDeviceID" = "motusDeviceID") |>
select(-"batchIDbegin") |>
distinct()
}
This function is faster than collecting the allruns
view
system.time(tbl(dbs[["607"]], "allRuns") |> collect())
user system elapsed
0.794 0.296 1.159
system.time(custom_runs(dbs[["607"]]) |> collect())
user system elapsed
0.378 0.004 0.382
Get the custom run tables for all databases
if(update) runs <- map(dbs, custom_runs)
Export
Now save these as Arrow/Feather format (but first remove any previously saved data, as we’ll be saving to Arrow by splitting the data across multiple files, so don’t want any conflicts).
On the way we’ll also create some added date and time columns.
This should run reasonably fast.
if(update) {
unlink("Data/02_Datasets/runs/*", recursive = TRUE)
iwalk(runs, \(x, i) {
message(i)
|>
x mutate(proj_id = as.integer(i)) |>
collect() |>
mutate(timeBegin = as_datetime(tsBegin),
timeEnd = as_datetime(tsEnd),
dateBegin = as_date(timeBegin),
dateEnd = as_date(timeEnd),
monthBegin = month(timeBegin),
yearBegin = year(timeBegin),) |>
group_by(proj_id) |>
write_dataset(path = "Data/02_Datasets/runs/", format = "arrow")
}) }
The thing with splitting data across multiple files is that we ideally want files to be >20MB and <2GB and we want to avoid too many files (< 10,000).
So let’s double check that we have reasonable file sizes and numbers. Not perfect, but probably good enough.
tibble(files = list.files("Data/02_Datasets/runs/", recursive = TRUE, full.names = TRUE)) |>
mutate(size_mb = file.size(files) * 1e-6) |>
summarize(n = n(),
min_mb = min(size_mb),
median_mb = median(size_mb),
max_mb = max(size_mb))
# A tibble: 1 × 4
n min_mb median_mb max_mb
<int> <dbl> <dbl> <dbl>
1 11 0.0811 73.5 224.
Next we’ll save the hit-level data.
Get the basic hits table for all databases and join with the runs data, and save as feather for future use.
This may take a while (~2-5min per project, depending on the size). This also takes a fair amount of disk space (much more than the SQLite databases).
if(update) {
<- map(dbs, \(x) tbl(x, "hits"))
hits
unlink("Data/02_Datasets/hits/*", recursive = TRUE)
iwalk(hits, \(x, i) {
message(i)
<- select(runs[[i]], "runID", "speciesID", "tagDeployID", "tagID",
r "recvDeployID", "recvType", "motusFilter")
|>
x mutate(proj_id = as.integer(i)) |>
left_join(r, by = "runID") |>
collect() |>
mutate(time = as_datetime(ts),
date = as_date(time),
month = month(date),
year = year(date),) |>
group_by(proj_id, speciesID, year) |>
write_dataset(path = "Data/02_Datasets/hits/", format = "arrow")
}) }
As with hits, we ideally want files to be >20MB and <2GB and we want to avoid too many files (< 10,000).
So let’s double check that we have reasonable file sizes and numbers.
tibble(files = list.files("Data/02_Datasets/hits/", recursive = TRUE, full.names = TRUE)) |>
mutate(size_mb = file.size(files) * 1e-6) |>
summarize(n = n(),
min_mb = min(size_mb),
median_mb = median(size_mb),
max_mb = max(size_mb))
# A tibble: 1 × 4
n min_mb median_mb max_mb
<int> <dbl> <dbl> <dbl>
1 193 0.00363 0.995 5371.
Not fantastic, but we want the data to be split in a sensible way, so we’ll leave it as is.
Wrap up
Disconnect from the databases
walk(dbs, dbDisconnect)