Basic Filtering

Author

Steffi LaZerte

Published

July 4, 2024

Here we apply broad filters for missing details, time, space, and species. We’ll omit runs missing deployment information for tags and receivers, runs which occur during times we’re not interested in, and runs which occur for species/receiver pairs which are not within a particular species’ range (ranges were determined in Range Maps), also omitting species we’re not interested in.

The idea is to reduce the size of the data we’re working with as much as possible before we get into the fine details of assessing data quality.

However, we don’t want to simply delete records in our original data because

So we will create lists of all the run/tagID combinations to be removed, and we’ll store these as a new dataset.

When we want to calculate our final summaries, we can apply these filters in the last step before collect()ing (flattening) the feather data files.

Setup

source("XX_setup.R")

recvs <- read_csv("Data/02_Datasets/receivers.csv")
runs <- open_dataset("Data/02_Datasets/runs", format = "feather")

Non-deployments

Both tags and receivers have device ids and deployment ids. Deployments are associated with a particular device (tag or receiver) being deployed at a certain time.

Many erroneous runs are associated with a tag at a time in which that tag was not recorded as deployed.

Funnily enough this can also occur with receivers where the run occurs at a time where the associated receiver is not recorded as deployed.

This could occur for several reasons:

  • The metadata is incorrect and the tag/receiver was, in fact deployed at that time
  • This is false positive recording of the tag
  • This is non-trustworthy receiver data, which may have been collecting during testing or installation

As we cannot be sure which is correct, for now we will omit those runs.

In our custom runs tables, any run/tag combination with missing a tagDeployID or a recvDeployID did not actually have a record of the tag/receiver being deployed at that time.

speciesID vs. tagDeployID

Technically we should omit those without a tagDeployID, however, looking at the records which have a tagDeployID but do not have a speciesID shows that they are all test tags, so we’ll go ahead and just remove all records missing a speciesID.

See the “Explore - Tags” panel for this exploration

no_deps <- filter(runs, is.na(speciesID) | is.na(recvDeployID)) |>
  collect()

To be sure that we’re not missing things, we’ll explore what is removed if we use tagDeployID rather than speciesID (as we could conceivably have tags with a deploy id but not species ids).

no_tag_dep_check <- filter(runs, is.na(tagDeployID))

For example, in this project for this tag, we have no species ids for a series of runs.

no_tag_dep_check |>
  select(tagID, tagDeployID, speciesID, tsBegin, tsEnd, motusFilter) |>
  filter(tagID == 63138) |>
  collect_ts()
# A tibble: 301 × 6
   tagID tagDeployID speciesID tsBegin             tsEnd              
   <int>       <int>     <int> <dttm>              <dttm>             
 1 63138          NA        NA 2022-01-15 10:27:31 2022-01-15 10:28:51
 2 63138          NA        NA 2022-01-16 04:26:10 2022-01-16 04:26:50
 3 63138          NA        NA 2022-01-27 01:22:06 2022-01-27 01:22:45
 4 63138          NA        NA 2022-04-11 20:25:47 2022-04-11 20:27:07
 5 63138          NA        NA 2022-04-02 21:10:25 2022-04-02 21:11:45
 6 63138          NA        NA 2022-05-30 21:39:44 2022-05-30 21:41:04
 7 63138          NA        NA 2022-06-02 22:34:33 2022-06-02 22:35:53
 8 63138          NA        NA 2022-02-14 19:55:01 2022-02-14 19:55:41
 9 63138          NA        NA 2022-05-01 01:58:15 2022-05-01 01:59:34
10 63138          NA        NA 2022-07-01 10:04:21 2022-07-01 10:05:41
# ℹ 291 more rows
# ℹ 1 more variable: motusFilter <dbl>

Note that the motusFilter is also 0, indicating that they are likely false positives based on other metrics as well.

Now if we look at the tag deployment dates for that tag, we see that these runs occurred even before the tag was deployed, further indicating that they are false positives.

runs |>
  filter(tagID == 63138) |>
  select(tagID, tagDeployID,  tsStartTag, tsEndTag, speciesID) |>
  distinct() |>
  collect_ts()
# A tibble: 2 × 5
  tagID tagDeployID tsStartTag          tsEndTag            speciesID
  <int>       <int> <dttm>              <dttm>                  <int>
1 63138          NA NA                  NA                         NA
2 63138       52117 2023-11-10 17:15:00 2025-08-10 17:15:00     19060

However, just because we have a tagDeployID, it doesn’t necessarily mean we have a speciesID.

But in all cases where there is a tagDeployID but there is not a speciesID, it’s because the tag is a test tag (test = 1).

runs |> 
  filter(!is.na(tagDeployID), is.na(speciesID)) |>
  select(tagID, tagDeployID, test) |>
  distinct() |>
  collect()
# A tibble: 13 × 3
   tagID tagDeployID  test
   <int>       <int> <int>
 1 52917       32346     1
 2 23608       12511     1
 3 26315       16538     1
 4 26448       13848     1
 5 68906       42671     1
 6 52918       33943     1
 7 64024       38571     1
 8 68914       42527     1
 9 74330       45876     1
10 74328       45833     1
11 74328       45874     1
12 74331       45877     1
13 74329       45875     1

Therefore, we can actually remove all records with missing species ids without worry.

For example, in this project (484) for this receiver device ID (3071), we have no deployment ids (recvDeployID) for a series of runs.

filter(runs, is.na(recvDeployID)) |>
  select(proj_id, recvDeployID, recvDeviceID) |>
  distinct() |>
  collect()
# A tibble: 501 × 3
   proj_id recvDeployID recvDeviceID
     <int>        <int>        <int>
 1     168           NA          660
 2     352           NA         2637
 3     352           NA         2635
 4     352           NA         1948
 5     352           NA         2082
 6     352           NA         1855
 7     352           NA         1825
 8     352           NA         2380
 9     352           NA         2173
10     352           NA         2439
# ℹ 491 more rows
filter(runs, is.na(recvDeployID)) |>
  filter(proj_id == 484, recvDeviceID == 3071) |>
  select(proj_id, runID, recvDeployID, recvDeviceID, tagDeployID) |>
  collect_ts()
# A tibble: 40 × 5
   proj_id     runID recvDeployID recvDeviceID tagDeployID
     <int>     <int>        <int>        <int>       <int>
 1     484 540827863           NA         3071       41246
 2     484 540827867           NA         3071       41246
 3     484 540827898           NA         3071       41246
 4     484 540827900           NA         3071       41246
 5     484 540827914           NA         3071       41246
 6     484 540827956           NA         3071       41246
 7     484 540827975           NA         3071       41243
 8     484 540827979           NA         3071       41246
 9     484 540827997           NA         3071       41243
10     484 540828021           NA         3071       41246
# ℹ 30 more rows

Now if we look at the receiver deployment dates for that device, we see that these runs occurred before the receiver was deployed, although not that much earlier.

Perhaps these are semi-legitimate runs, which were detected when testing and installing the receiver.

tbl(dbs[["484"]], "recvDeps") |>
  filter(deviceID == "3071") |>
  select(deviceID, deployID, projectID, tsStart, tsEnd, status) |>
  collect_ts()
# A tibble: 1 × 6
  deviceID deployID projectID tsStart             tsEnd  status
     <int>    <int>     <int> <dttm>              <dttm> <chr> 
1     3071     9150       354 2022-10-18 18:37:00 NA     active

Time

In this filter, we omit times of year that don’t apply to our study. This is the easiest filter to apply as it doesn’t rely on metadata or other variables.

Which dates do we keep?

  • Fall and Spring migration
  • August - December & February - July

Get all runs that are NOT within these dates (Jan, July, Dec)

noise_time <- runs |>
  filter(monthBegin %in% c(1, 7, 12)) |>
  collect()
d <- runs |>
  filter(proj_id == 464) |>
  mutate(omit = if_else(monthBegin %in% c(1, 7, 12), TRUE, FALSE)) |>
  select(runID, tagID, tsBegin, omit, timeBegin) |>
  collect()

ggplot(data = d, aes(x = timeBegin, y = factor(tagID), colour = omit)) +
  theme_bw() +
  geom_point() +
  scale_color_viridis_d(end = 0.8)

Space

Here we identify runs which are not associated with appropriate stations as determined by overlap with a species range map (Range Maps).

These are runs associated with a species which is unlikely to be found at that station and can therefore be assumed to be false positives.

Note that this step also omits species we’re not interested in, as they are not included in the list of species ranges and receivers.

First we’ll add the master list of whether or not a species is in range for each receiver deployment.

rs <- recvs |>
  mutate(speciesID = as.integer(id), 
         recvDeployID = as.integer(recvDeployID)) |>
  filter(in_range) |>
  select(speciesID, recvDeployID, in_range)

Now, get all runs which are NOT in range given the species and the receiver deployment.

noise_space <- runs |>
  left_join(rs, by = c("recvDeployID", "speciesID")) |>
  filter(is.na(in_range)) |>
  select(-"in_range") |>
  collect()

Many of these omitted are omitted because they are missing the speciesID or the recevDeployID (which is already accounted for the the Non-deployments section above.

However, let’s take a look at the remaining ones to check.

Hmm, most of these are poor quality runs (i.e. the motusFilter is 0). So this is a good sign that these are indeed false positives which should be omitted.

noise_space |> 
  filter(!is.na(speciesID), !is.na(recvDeployID)) |>
  count(proj_id, motusFilter) |>
  collect() |>
  gt() |>
  gt_theme()
proj_id motusFilter n
168 0 685
352 0 1574
352 1 11
364 0 1866
364 1 275
373 0 72
393 0 18134
393 1 68
417 0 10861
417 1 8
464 0 1536
464 1 3
484 0 28156
484 1 76
515 0 389
515 1 2
551 0 1700
551 1 5
607 0 95

Let’s take a closer look at one species in one project.

There is quite possibly a lot of noise (see the amount of scatter in the motusFilter = 0 category. And clearly this individual (Spotted Towhee) had a home base in southern BC.

ns <- noise_space |> 
  filter(proj_id == 484, speciesID == 18550) |>
  select("runID", "tagID") |>
  mutate(in_range = 0)

coords <- tbl(dbs[["484"]], "recvDeps") |> 
  select("deployID", "latitude", "longitude") |> 
  collect()

rn <- filter(runs, proj_id == 484, speciesID == 18550) |>
  left_join(coords, by = c("recvDeployID" = "deployID")) |>
  left_join(ns, by = c("runID", "tagID")) |>
  filter(!is.na(latitude), !is.na(longitude)) |>
  select(motusFilter, tagDeployID, tsBegin, tsEnd, in_range, latitude, longitude) |>
  collect() |>
  mutate(in_range = factor(replace_na(in_range, 1)))

map <- ne_countries(country = c("Canada", "United States of America"), returnclass = "sf")

rn_sf_cnt <- rn |>
  summarize(n = n(), .by = c("motusFilter", "in_range", "latitude", "longitude")) |> 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
  
ggplot(rn_sf_cnt) +
  geom_sf(data = map) +
  geom_sf(aes(colour = in_range, size = n)) +
  facet_wrap(~motusFilter)

Duplicate runIDs

Double check that there are no duplicate runIDs, which can happen when joining all the data together if we don’t account for duplicate/overlapping receiver/tag deployments.

Additionally there are some duplicates based on the same runID in different batches… This occurs in allruns too, so isn’t an artifact of the custom joins, but still not ideal.

These should have been dealt with in custom_runs() but just in case, we’ll verify here.

runs |>
  count(runID) |>
  filter(n > 1) |>
  collect()
# A tibble: 0 × 2
# ℹ 2 variables: runID <int>, n <int>

Okay, good, there are zero duplicates (zero rows).

Looking at the filters

First we’ll combine and collect the ‘bad’ runs.

noise <- bind_rows(no_deps, noise_time, noise_space) |>
  distinct()

Next we’ll take a look at how this compares to the motusFilter.

Original runs

count(runs, proj_id, motusFilter) |>
  filter(proj_id == 484) |>
  collect() |>
  arrange(motusFilter)
# A tibble: 2 × 3
  proj_id motusFilter      n
    <int>       <dbl>  <int>
1     484           0 492931
2     484           1 601907

Remaining (non-bad) runs

anti_join(runs, noise, by = c("runID", "tagDeployID", "recvDeployID")) |>
  filter(proj_id == 484) |>
  count(proj_id, motusFilter) |>
  collect()
# A tibble: 2 × 3
  proj_id motusFilter      n
    <int>       <dbl>  <int>
1     484           0 336867
2     484           1 429134

There are still many remaining ‘bad’ data (according to the motusFilter)… perhaps we should use that as well, or see what happens after we do the next stage of fine scale filtering.

Saving filters

We’ll save the ‘bad data’ for use in the next steps.

write_feather(noise, sink = "Data/02_Datasets/noise_runs.feather")

Wrap up

Disconnect from the databases

walk(dbs, dbDisconnect)

Reproducibility

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_CA:en
 collate  en_CA.UTF-8
 ctype    en_CA.UTF-8
 tz       America/Winnipeg
 date     2024-07-03
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package       * version  date (UTC) lib source
 P arrow         * 16.1.0   2024-05-25 [?] CRAN (R 4.4.0)
 P assertr       * 3.0.1    2023-11-23 [?] CRAN (R 4.4.0)
 P assertthat      0.2.1    2019-03-21 [?] CRAN (R 4.4.0)
 P bit             4.0.5    2022-11-15 [?] CRAN (R 4.4.0)
 P bit64           4.0.5    2020-08-30 [?] CRAN (R 4.4.0)
 P blob            1.2.4    2023-03-17 [?] CRAN (R 4.4.0)
 P cachem          1.1.0    2024-05-16 [?] CRAN (R 4.4.0)
 P class           7.3-22   2023-05-03 [?] CRAN (R 4.3.1)
 P classInt        0.4-10   2023-09-05 [?] CRAN (R 4.4.0)
 P cli             3.6.2    2023-12-11 [?] CRAN (R 4.4.0)
 P codetools       0.2-19   2023-02-01 [?] CRAN (R 4.2.2)
 P colorspace      2.1-0    2023-01-23 [?] CRAN (R 4.4.0)
 P crayon          1.5.2    2022-09-29 [?] CRAN (R 4.4.0)
 P DBI           * 1.2.3    2024-06-02 [?] CRAN (R 4.4.0)
 P dbplyr          2.5.0    2024-03-19 [?] CRAN (R 4.4.0)
 P devtools        2.4.5    2022-10-11 [?] CRAN (R 4.4.0)
 P digest          0.6.35   2024-03-11 [?] CRAN (R 4.4.0)
 P dplyr         * 1.1.4    2023-11-17 [?] CRAN (R 4.4.0)
 P e1071           1.7-14   2023-12-06 [?] CRAN (R 4.4.0)
 P ebirdst       * 3.2022.3 2024-03-05 [?] CRAN (R 4.4.0)
 P ellipsis        0.3.2    2021-04-29 [?] CRAN (R 4.4.0)
 P evaluate        0.23     2023-11-01 [?] CRAN (R 4.4.0)
 P fansi           1.0.6    2023-12-08 [?] CRAN (R 4.4.0)
 P farver          2.1.2    2024-05-13 [?] CRAN (R 4.4.0)
 P fastmap         1.2.0    2024-05-15 [?] CRAN (R 4.4.0)
 P fs              1.6.4    2024-04-25 [?] CRAN (R 4.4.0)
 P furrr         * 0.3.1    2022-08-15 [?] CRAN (R 4.4.0)
 P future        * 1.33.2   2024-03-26 [?] CRAN (R 4.4.0)
 P generics        0.1.3    2022-07-05 [?] CRAN (R 4.4.0)
 P ggplot2       * 3.5.1    2024-04-23 [?] CRAN (R 4.4.0)
 P ggrepel       * 0.9.5    2024-01-10 [?] CRAN (R 4.4.0)
 P ggspatial     * 1.1.9    2023-08-17 [?] CRAN (R 4.4.0)
 P globals         0.16.3   2024-03-08 [?] CRAN (R 4.4.0)
 P glue            1.7.0    2024-01-09 [?] CRAN (R 4.4.0)
 P gt            * 0.10.1   2024-01-17 [?] CRAN (R 4.4.0)
 P gtable          0.3.5    2024-04-22 [?] CRAN (R 4.4.0)
 P hms             1.1.3    2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools       0.5.8.1  2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets     1.6.4    2023-12-06 [?] CRAN (R 4.4.0)
 P httpuv          1.6.15   2024-03-26 [?] CRAN (R 4.4.0)
 P httr            1.4.7    2023-08-15 [?] CRAN (R 4.4.0)
 P jsonlite        1.8.8    2023-12-04 [?] CRAN (R 4.4.0)
 P KernSmooth      2.23-22  2023-07-10 [?] CRAN (R 4.3.1)
 P knitr           1.47     2024-05-29 [?] CRAN (R 4.4.0)
 P labeling        0.4.3    2023-08-29 [?] CRAN (R 4.4.0)
 P later           1.3.2    2023-12-06 [?] CRAN (R 4.4.0)
 P lifecycle       1.0.4    2023-11-07 [?] CRAN (R 4.4.0)
 P listenv         0.9.1    2024-01-29 [?] CRAN (R 4.4.0)
 P lubridate     * 1.9.3    2023-09-27 [?] CRAN (R 4.4.0)
 P lutz          * 0.3.2    2023-10-17 [?] CRAN (R 4.4.0)
 P magrittr        2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
 P memoise         2.0.1    2021-11-26 [?] CRAN (R 4.4.0)
 P mime            0.12     2021-09-28 [?] CRAN (R 4.4.0)
 P miniUI          0.1.1.1  2018-05-18 [?] CRAN (R 4.4.0)
 P motus         * 6.1.0    2024-05-02 [?] Github (motuswts/motus@a53a8b8)
 P munsell         0.5.1    2024-04-01 [?] CRAN (R 4.4.0)
 P naturecounts    0.4.0    2024-05-02 [?] Github (birdscanada/naturecounts@a6e52da)
 P parallelly      1.37.1   2024-02-29 [?] CRAN (R 4.4.0)
 P patchwork     * 1.2.0    2024-01-08 [?] CRAN (R 4.4.0)
 P pillar          1.9.0    2023-03-22 [?] CRAN (R 4.4.0)
 P pkgbuild        1.4.4    2024-03-17 [?] CRAN (R 4.4.0)
 P pkgconfig       2.0.3    2019-09-22 [?] CRAN (R 4.4.0)
 P pkgload         1.3.4    2024-01-16 [?] CRAN (R 4.4.0)
 P profvis         0.3.8    2023-05-02 [?] CRAN (R 4.4.0)
 P promises        1.3.0    2024-04-05 [?] CRAN (R 4.4.0)
 P proxy           0.4-27   2022-06-09 [?] CRAN (R 4.4.0)
 P purrr         * 1.0.2    2023-08-10 [?] CRAN (R 4.4.0)
 P R6              2.5.1    2021-08-19 [?] CRAN (R 4.4.0)
 P Rcpp            1.0.12   2024-01-09 [?] CRAN (R 4.4.0)
 P readr         * 2.1.5    2024-01-10 [?] CRAN (R 4.4.0)
 P remotes         2.5.0    2024-03-17 [?] CRAN (R 4.4.0)
   renv            1.0.7    2024-04-11 [1] CRAN (R 4.4.0)
 P rlang           1.1.3    2024-01-10 [?] CRAN (R 4.4.0)
 P rmarkdown       2.27     2024-05-17 [?] CRAN (R 4.4.0)
 P rnaturalearth * 1.0.1    2023-12-15 [?] CRAN (R 4.4.0)
 P RSQLite         2.3.6    2024-03-31 [?] CRAN (R 4.4.0)
 P rstudioapi      0.16.0   2024-03-24 [?] CRAN (R 4.4.0)
 P sass            0.4.9    2024-03-15 [?] CRAN (R 4.4.0)
 P scales          1.3.0    2023-11-28 [?] CRAN (R 4.4.0)
 P sessioninfo     1.2.2    2021-12-06 [?] CRAN (R 4.4.0)
 P sf            * 1.0-16   2024-03-24 [?] CRAN (R 4.4.0)
 P shiny           1.8.1.1  2024-04-02 [?] CRAN (R 4.4.0)
 P stringi         1.8.4    2024-05-06 [?] CRAN (R 4.4.0)
 P stringr       * 1.5.1    2023-11-14 [?] CRAN (R 4.4.0)
 P terra           1.7-71   2024-01-31 [?] CRAN (R 4.4.0)
 P tibble        * 3.2.1    2023-03-20 [?] CRAN (R 4.4.0)
 P tidyr         * 1.3.1    2024-01-24 [?] CRAN (R 4.4.0)
 P tidyselect      1.2.1    2024-03-11 [?] CRAN (R 4.4.0)
 P timechange      0.3.0    2024-01-18 [?] CRAN (R 4.4.0)
 P tzdb            0.4.0    2023-05-12 [?] CRAN (R 4.4.0)
 P units         * 0.8-5    2023-11-28 [?] CRAN (R 4.4.0)
 P urlchecker      1.0.1    2021-11-30 [?] CRAN (R 4.4.0)
 P usethis         2.2.3    2024-02-19 [?] CRAN (R 4.4.0)
 P utf8            1.2.4    2023-10-22 [?] CRAN (R 4.4.0)
 P vctrs           0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P viridisLite     0.4.2    2023-05-02 [?] CRAN (R 4.4.0)
 P vroom           1.6.5    2023-12-05 [?] CRAN (R 4.4.0)
 P withr           3.0.0    2024-01-16 [?] CRAN (R 4.4.0)
 P xfun            0.44     2024-05-15 [?] CRAN (R 4.4.0)
 P xml2            1.3.6    2023-12-04 [?] CRAN (R 4.4.0)
 P xtable          1.8-4    2019-04-21 [?] CRAN (R 4.4.0)
 P yaml            2.3.8    2023-12-11 [?] CRAN (R 4.4.0)

 [1] /home/steffi/Projects/Business/Barbara Frei/urban_motus/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
 [2] /home/steffi/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top