Fine-scale Filtering

Author

Steffi LaZerte

Published

July 4, 2024

Here we perform fine-scale filtering which involves more assessments of potential issues.

Many of these steps rely on hit-level data, as opposed to run-level data from the previous steps.

Setup

source("XX_setup.R")
noise_runs <- open_dataset("Data/02_Datasets/noise_runs.feather", format = "feather")

runs <- open_dataset("Data/02_Datasets/runs", format = "feather") |>
  anti_join(noise_runs)

hits <- load_hits() |>
  map(\(x) anti_join(x, noise_runs, by = c("runID", "recvDeployID", "tagDeployID")))

We’ll also use a modified version of the motusFilter1, such that hits from SENSORGNOME stations with a freqSD > 0.1 will be considered ‘bad’ data (i.e., we’ll set the motusFilter value to 0 for these hits).

hits <- map(hits, \(x) {
  mutate(x, motusFilter = if_else(recvType == "SENSORGNOME" & freqSD > 0.1, 0, motusFilter))
})

Bad Tags2

First, we’ll collect individual tags which have only bad data (i.e. all 0)

noise_tags <- map(hits, \(x) {
  noise <- x |>
    select("tagID", "tagDeployID", "motusFilter") |>
    summarize(motusFilter = sum(motusFilter, na.rm = TRUE), .by = c("tagID", "tagDeployID")) |>
    filter(motusFilter == 0)
  
  semi_join(x, noise, by = c("tagID", "tagDeployID")) |>
    select("runID", "tagDeployID", "recvDeployID") |>
    collect()
}) |> list_rbind()

Bad Runs3

Now we’ll calculate the proportion of good/bad data per tag, per receiver, per day.

  • We’ll omit all runs on a day for this tag/receiver combo where less than half are ‘good’
noise_quality <- map(hits, \(x) {
  noise <- x |>
    select("date", "runID", "tagID", "tagDeployID", "recvDeployID", "motusFilter") |>
    summarize(p_good = sum(motusFilter, na.rm = TRUE) / n(),
              .by = c("tagID", "tagDeployID", "recvDeployID", "date")) |>
    filter(p_good <= 0.5) |>
    distinct()
  
  semi_join(x, noise, by = c("tagID", "tagDeployID", "recvDeployID", "date")) |>
    select("runID", "tagDeployID", "recvDeployID") |>
    collect()
}) |> list_rbind()

Ambiguous detections

Let’s collect all runs where there is some ambiguity. We’ll look at the allruns table for this.

ambig_ids <- map(dbs, \(x) {
  t <- tbl(x, "allruns") |>
    filter(!is.na(ambigID)) |>
    select("runID", "tagID" = "motusTagID", "ambigID") |>
    distinct() |>
    collect() |>
    mutate(is_ambig = TRUE)
  if(nrow(t) == 0) t <- NULL
  t
})|> 
  list_rbind(names_to = "proj_id") |>
  mutate(proj_id = as.integer(proj_id))

Now let’s see if any of these runs are even left in our data after filtering…

runs |>
  anti_join(noise_tags) |>
  anti_join(noise_quality) |>
  semi_join(ambig_ids) |>
  collect()
# A tibble: 0 × 29
# ℹ 29 variables: runID <int>, tsBegin <dbl>, tsEnd <dbl>, done <int>,
#   tagID <int>, ant <chr>, len <int>, nodeNum <chr>, motusFilter <dbl>,
#   tagDeployID <int>, speciesID <int>, tsStartTag <dbl>, tsEndTag <dbl>,
#   test <int>, batchID <int>, recvDeviceID <int>, recvDeployID <int>,
#   tsStartRecv <dbl>, tsEndRecv <dbl>, recvType <chr>, recvDeployLat <dbl>,
#   recvDeployLon <dbl>, timeBegin <dttm>, timeEnd <dttm>, dateBegin <date>,
#   dateEnd <date>, monthBegin <dbl>, yearBegin <dbl>, proj_id <int>

There are no ambiguous runs left the data after we cleaned, so we’ll just ignore them for now.

Looking at the filters

noise_hits <- bind_rows(noise_tags, noise_quality) |>
  select("runID", "tagDeployID", "recvDeployID") |>
  distinct()

noise_hits
# A tibble: 338,272 × 3
       runID tagDeployID recvDeployID
       <int>       <int>        <int>
 1 587803069       41241         8385
 2 587803894       41241         8385
 3 607721978       41241         7952
 4 607723494       41241         7952
 5 491015464       41214         8415
 6 491123194       41205         8415
 7 629411593       52135         8415
 8 631079364       44341         5417
 9 640346863       52140         9006
10 640347023       52140         9006
# ℹ 338,262 more rows

Next we’ll take a look at how this compares to the motusFilter

With only the runs filtering

count(runs, proj_id, motusFilter) |>
  collect() |>
  pivot_wider(names_from = motusFilter, values_from = n) |>
  arrange(proj_id)
# A tibble: 11 × 3
   proj_id    `1`    `0`
     <int>  <int>  <int>
 1     168 267523 148857
 2     352 543816 212388
 3     364   1617  14453
 4     373 477672 292145
 5     393  24927  49501
 6     417 604562 354220
 7     464    556  31519
 8     484 601907 492931
 9     515  29846 153426
10     551 625166 641404
11     607      4    427

With both the runs and hit filtering

anti_join(runs, noise_hits, by = c("runID", "tagDeployID", "recvDeployID")) |>
  count(proj_id, motusFilter) |>
  collect() |>
  pivot_wider(names_from = motusFilter, values_from = n) |>
  arrange(proj_id)
# A tibble: 11 × 3
   proj_id    `1`    `0`
     <int>  <int>  <int>
 1     168 267235 118672
 2     352 542725 174567
 3     364   1584   7799
 4     373 476839 267387
 5     393  24654  42336
 6     417 603126 315663
 7     464    472  16293
 8     484 601017 483676
 9     515  29753 149450
10     551 619025 489385
11     607      4    172

There are still many ‘bad’ data according to the motusFilter… but we are definitely getting closer.

Saving filters

We’ll save the ‘bad data’ for use in the next steps.

write_feather(noise_hits, sink = "Data/02_Datasets/noise_hits.feather")

Wrap up

Disconnect from the databases

walk(dbs, dbDisconnect)

Reproducibility

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_CA:en
 collate  en_CA.UTF-8
 ctype    en_CA.UTF-8
 tz       America/Winnipeg
 date     2024-06-27
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package       * version  date (UTC) lib source
 P arrow         * 16.1.0   2024-05-25 [?] CRAN (R 4.4.0)
 P assertr       * 3.0.1    2023-11-23 [?] CRAN (R 4.4.0)
 P assertthat      0.2.1    2019-03-21 [?] CRAN (R 4.4.0)
 P bit             4.0.5    2022-11-15 [?] CRAN (R 4.4.0)
 P bit64           4.0.5    2020-08-30 [?] CRAN (R 4.4.0)
 P blob            1.2.4    2023-03-17 [?] CRAN (R 4.4.0)
 P cachem          1.1.0    2024-05-16 [?] CRAN (R 4.4.0)
 P class           7.3-22   2023-05-03 [?] CRAN (R 4.3.1)
 P classInt        0.4-10   2023-09-05 [?] CRAN (R 4.4.0)
 P cli             3.6.2    2023-12-11 [?] CRAN (R 4.4.0)
 P codetools       0.2-19   2023-02-01 [?] CRAN (R 4.2.2)
 P colorspace      2.1-0    2023-01-23 [?] CRAN (R 4.4.0)
 P DBI           * 1.2.3    2024-06-02 [?] CRAN (R 4.4.0)
 P dbplyr          2.5.0    2024-03-19 [?] CRAN (R 4.4.0)
 P devtools        2.4.5    2022-10-11 [?] CRAN (R 4.4.0)
 P digest          0.6.35   2024-03-11 [?] CRAN (R 4.4.0)
 P dplyr         * 1.1.4    2023-11-17 [?] CRAN (R 4.4.0)
 P e1071           1.7-14   2023-12-06 [?] CRAN (R 4.4.0)
 P ebirdst       * 3.2022.3 2024-03-05 [?] CRAN (R 4.4.0)
 P ellipsis        0.3.2    2021-04-29 [?] CRAN (R 4.4.0)
 P evaluate        0.23     2023-11-01 [?] CRAN (R 4.4.0)
 P fansi           1.0.6    2023-12-08 [?] CRAN (R 4.4.0)
 P fastmap         1.2.0    2024-05-15 [?] CRAN (R 4.4.0)
 P fs              1.6.4    2024-04-25 [?] CRAN (R 4.4.0)
 P furrr         * 0.3.1    2022-08-15 [?] CRAN (R 4.4.0)
 P future        * 1.33.2   2024-03-26 [?] CRAN (R 4.4.0)
 P generics        0.1.3    2022-07-05 [?] CRAN (R 4.4.0)
 P ggplot2       * 3.5.1    2024-04-23 [?] CRAN (R 4.4.0)
 P ggrepel       * 0.9.5    2024-01-10 [?] CRAN (R 4.4.0)
 P ggspatial     * 1.1.9    2023-08-17 [?] CRAN (R 4.4.0)
 P globals         0.16.3   2024-03-08 [?] CRAN (R 4.4.0)
 P glue            1.7.0    2024-01-09 [?] CRAN (R 4.4.0)
 P gt            * 0.10.1   2024-01-17 [?] CRAN (R 4.4.0)
 P gtable          0.3.5    2024-04-22 [?] CRAN (R 4.4.0)
 P hms             1.1.3    2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools       0.5.8.1  2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets     1.6.4    2023-12-06 [?] CRAN (R 4.4.0)
 P httpuv          1.6.15   2024-03-26 [?] CRAN (R 4.4.0)
 P httr            1.4.7    2023-08-15 [?] CRAN (R 4.4.0)
 P jsonlite        1.8.8    2023-12-04 [?] CRAN (R 4.4.0)
 P KernSmooth      2.23-22  2023-07-10 [?] CRAN (R 4.3.1)
 P knitr           1.47     2024-05-29 [?] CRAN (R 4.4.0)
 P later           1.3.2    2023-12-06 [?] CRAN (R 4.4.0)
 P lifecycle       1.0.4    2023-11-07 [?] CRAN (R 4.4.0)
 P listenv         0.9.1    2024-01-29 [?] CRAN (R 4.4.0)
 P lubridate     * 1.9.3    2023-09-27 [?] CRAN (R 4.4.0)
 P lutz          * 0.3.2    2023-10-17 [?] CRAN (R 4.4.0)
 P magrittr        2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
 P memoise         2.0.1    2021-11-26 [?] CRAN (R 4.4.0)
 P mime            0.12     2021-09-28 [?] CRAN (R 4.4.0)
 P miniUI          0.1.1.1  2018-05-18 [?] CRAN (R 4.4.0)
 P motus         * 6.1.0    2024-05-02 [?] Github (motuswts/motus@a53a8b8)
 P munsell         0.5.1    2024-04-01 [?] CRAN (R 4.4.0)
 P naturecounts    0.4.0    2024-05-02 [?] Github (birdscanada/naturecounts@a6e52da)
 P parallelly      1.37.1   2024-02-29 [?] CRAN (R 4.4.0)
 P patchwork     * 1.2.0    2024-01-08 [?] CRAN (R 4.4.0)
 P pillar          1.9.0    2023-03-22 [?] CRAN (R 4.4.0)
 P pkgbuild        1.4.4    2024-03-17 [?] CRAN (R 4.4.0)
 P pkgconfig       2.0.3    2019-09-22 [?] CRAN (R 4.4.0)
 P pkgload         1.3.4    2024-01-16 [?] CRAN (R 4.4.0)
 P profvis         0.3.8    2023-05-02 [?] CRAN (R 4.4.0)
 P promises        1.3.0    2024-04-05 [?] CRAN (R 4.4.0)
 P proxy           0.4-27   2022-06-09 [?] CRAN (R 4.4.0)
 P purrr         * 1.0.2    2023-08-10 [?] CRAN (R 4.4.0)
 P R6              2.5.1    2021-08-19 [?] CRAN (R 4.4.0)
 P Rcpp            1.0.12   2024-01-09 [?] CRAN (R 4.4.0)
 P readr         * 2.1.5    2024-01-10 [?] CRAN (R 4.4.0)
 P remotes         2.5.0    2024-03-17 [?] CRAN (R 4.4.0)
   renv            1.0.7    2024-04-11 [1] CRAN (R 4.4.0)
 P rlang           1.1.3    2024-01-10 [?] CRAN (R 4.4.0)
 P rmarkdown       2.27     2024-05-17 [?] CRAN (R 4.4.0)
 P rnaturalearth * 1.0.1    2023-12-15 [?] CRAN (R 4.4.0)
 P RSQLite         2.3.6    2024-03-31 [?] CRAN (R 4.4.0)
 P rstudioapi      0.16.0   2024-03-24 [?] CRAN (R 4.4.0)
 P scales          1.3.0    2023-11-28 [?] CRAN (R 4.4.0)
 P sessioninfo     1.2.2    2021-12-06 [?] CRAN (R 4.4.0)
 P sf            * 1.0-16   2024-03-24 [?] CRAN (R 4.4.0)
 P shiny           1.8.1.1  2024-04-02 [?] CRAN (R 4.4.0)
 P stringi         1.8.4    2024-05-06 [?] CRAN (R 4.4.0)
 P stringr       * 1.5.1    2023-11-14 [?] CRAN (R 4.4.0)
 P terra           1.7-71   2024-01-31 [?] CRAN (R 4.4.0)
 P tibble        * 3.2.1    2023-03-20 [?] CRAN (R 4.4.0)
 P tidyr         * 1.3.1    2024-01-24 [?] CRAN (R 4.4.0)
 P tidyselect      1.2.1    2024-03-11 [?] CRAN (R 4.4.0)
 P timechange      0.3.0    2024-01-18 [?] CRAN (R 4.4.0)
 P tzdb            0.4.0    2023-05-12 [?] CRAN (R 4.4.0)
 P units         * 0.8-5    2023-11-28 [?] CRAN (R 4.4.0)
 P urlchecker      1.0.1    2021-11-30 [?] CRAN (R 4.4.0)
 P usethis         2.2.3    2024-02-19 [?] CRAN (R 4.4.0)
 P utf8            1.2.4    2023-10-22 [?] CRAN (R 4.4.0)
 P vctrs           0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P withr           3.0.0    2024-01-16 [?] CRAN (R 4.4.0)
 P xfun            0.44     2024-05-15 [?] CRAN (R 4.4.0)
 P xml2            1.3.6    2023-12-04 [?] CRAN (R 4.4.0)
 P xtable          1.8-4    2019-04-21 [?] CRAN (R 4.4.0)
 P yaml            2.3.8    2023-12-11 [?] CRAN (R 4.4.0)

 [1] /home/steffi/Projects/Business/Barbara Frei/urban_motus/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
 [2] /home/steffi/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top

Footnotes

  1. Based on notes from Amie MacDonald’s scripts↩︎

  2. Based on notes from Amie MacDonald’s scripts↩︎

  3. Based on notes from Amie MacDonald’s scripts↩︎