Calculate Metrics

Author

Steffi LaZerte

Published

June 6, 2025

Background

Having explored the data (Initial Exploration) and various ways of calculating metrics of migration timing, we will now calculate and explore these metrics for the entire data set.

We will use

a GAM approach to model the pattern of vulture counts

percentiles based on cumulative modelled counts to assess dates of passage

Option 3 to account for resident birds (subtract the predicted mean residents prior to calculating the cumulative counts)

Load Data

source("XX_functions.R")  # Custom functions and packages

set.seed(1234) # To make this reproducible

v <- read_csv("Data/Datasets/vultures_clean_2023.csv")
resident_date <- 240

Metrics to assess

To answer these questions we will summarize the counts into specific metrics representing the timing of migration.

Specifically, we would like to calculate the

dates of 5%, 25%, 50%, 75%, and 95% of the kettle numbers
duration of passage - No. days between 5% and 95%
duration of peak passage - No. days between 25% and 75%

Population size (no. vultures in aggregations)

maximum
cumulative
number at peak passage (mean, median, range)
number of locals (mean, median, range)

Of these, the most important starting metrics are the dates of 5%, 25%, 50%, 75%, and 95% of the kettle numbers. These dates will define migration phenology as well as local vs. migrating counts. All other calculations can be performed using these values and the raw data.

Proceedure

The steps for calculating these metrics are as follows.

For each year we will calculate…

A GAM
The median number of residents, using day 240 as a cutoff
The cumulative migration counts
The dates of passage as percentiles of these cumulative counts (5%, 25%, 75%, 95%)
The duration of (peak) passage from these dates
The population size (max, cumulative, stats at peak passage, stats of locals)

We will also create figures outlining these metrics for each year and will use these to assess whether anything needs to be tweaked (i.e. perhaps the date 240 cutoff)

Calculate Metrics

samples <- v |>
  group_by(year) |>
  filter(!is.na(count)) |> # Omit missing dates
  summarize(
    date_min = min(date), date_max = max(date),
    # number of dates with a count
    n_dates_obs = n(),           
    # number of dates in the range
    n_dates = as.numeric(difftime(date_max, date_min, units = "days")), 
    n_obs = sum(count))

gt(samples)

year	date_min	date_max	n_dates_obs	n_dates	n_obs
1999	1999-07-25	1999-10-20	64	87	7397
2000	2000-07-23	2000-10-18	76	87	2623
2001	2001-07-23	2001-10-07	64	76	3366
2002	2002-07-23	2002-10-21	83	90	4454
2003	2003-07-23	2003-10-18	83	87	6229
2004	2004-07-23	2004-10-18	80	87	9052
2005	2005-07-23	2005-10-18	83	87	5267
2006	2006-07-23	2006-10-17	73	86	5317
2008	2008-07-23	2008-10-17	59	86	3761
2009	2009-07-23	2009-10-18	62	87	5257
2010	2010-07-23	2010-10-18	75	87	7956
2011	2011-07-24	2011-10-20	69	88	3032
2012	2012-07-23	2012-10-18	78	87	5327
2013	2013-07-23	2013-10-18	69	87	4006
2014	2014-07-23	2014-10-18	74	87	6021
2015	2015-07-23	2015-10-18	77	87	5428
2016	2016-07-23	2016-10-18	68	87	8137
2017	2017-07-23	2017-10-17	62	86	4827
2018	2018-07-23	2018-10-18	75	87	6672
2019	2019-07-23	2019-10-18	87	87	6476
2020	2020-07-23	2020-10-18	81	87	12595
2021	2021-07-23	2021-10-18	82	87	9652
2022	2022-07-23	2022-10-18	86	87	19826
2023	2023-07-23	2023-10-15	84	84	12749

1. GAMs

As developed in our Initial Exploration we will use:

Negative binomial model to fit count data with overdispersion
Use Restricted Maximum Likelihood (“Most likely to give you reliable, stable results”¹)
A smoother (s()) over doy (day of year) to account for non-linear migration patterns
k = 10 (up to 10 basis functions; we want enough to make sure we capture the patterns, but too many will slow things down).

Run GAM on each year (except 2007)

gams <- v |>
  mutate(count = as.integer(count)) |>
  filter(year != 2007) |> # Can't model 2007 because no data
  nest(counts = -year) |>
  mutate(models = map(counts, \(x) gam(count ~ s(doy, k = 20), data = x, 
                                      method = "REML", family = "nb")))

Create model predictions

gams <- gams |>
  mutate(
    doy = map(counts, \(x) list(doy = min(x$doy):max(x$doy))),
    pred = map2(
      models, doy, 
      \(x, y) predict(x, newdata = y, type = "response", se.fit = TRUE)),
    pred = map2(
      pred, doy,
      \(x, y) data.frame(doy = y, count = x$fit, se = x$se) |>
        mutate(ci99_upper = count + se * 2.58,
               ci99_lower = count - se * 2.58)))

pred <- gams |>
  select(year, pred) |>
  unnest(pred)

Checks to ensure models are valid.

Here we look for two things

first that there is full convergence
second that there is not a significant non-random pattern in the residuals around the smoothing term (p-value, but be aware this is an approximation²)

If we have low p-values, we want to check and see

if the model doesn’t look like it fits the data (see the model plots at the end of this script)
if the k (number of basis functions) and edf (effective degrees of freedom) values are similar (if they are, this implies that we haven’t picked a large enough k)

Code

checks <- gams |>
  mutate(checks = map2(models, year, gam_check)) |>
  invisible() |>
  mutate(plots = map(checks, \(x) pluck(x, "plot")),
         df = map(checks, \(x) pluck(x, "checks"))) |>
  unnest(df) |>
  mutate(low_k = p_value < 0.1)

c <- checks |>
  filter(low_k | !full_convergence) |>
  select(year, param, k, edf, k_index, p_value, convergence)

gt(c)

year	param	k	edf	k_index	p_value	convergence
2010	s(doy)	19.00	8.57	0.76	0.027	full convergence after 4 iterations.
2012	s(doy)	19.00	7.68	0.76	0.021	full convergence after 4 iterations.

These plots are two different ways of presenting model diagnostics.

gam.check() is the default check that produces both these plots as well as the diagnostics in the Model Evaluation tab.

DHARMa is a package for simulating residuals to allow model checking for all types of models (details).

Both these sets of plots can be interpreted similarly to general linear model plots. We want roughly normal residuals and constant variance.

I tend to put more weight on DHARMa as it’s plots are easier to interpret for non-Gaussian model residuals, but I have included the gam.check() plots for completeness.

DHARMa
gam.check()

Year: 1999