Butterfly Counts

Simulation of Butterfly Counts

We will use simulated data to demonstrate and evaluate methods for calculating butterfly abundance indices, population trends and multi-species indicators such as the European Grassland Butterfly Indicator. To do this, we need to generate realistic data sets with known parameters. Data simulation will allow us to apply and test the methods to data generated under different scenarios. This approach will allow rigorous sensitivity analysis and provide key insights into the methods and a deeper understanding of their performance and limitations.

Code

if(!require("data.table")) install.packages("data.table")
if(!require("ggplot2")) install.packages("ggplot2")
if(!require("devtools")) install.packages("devtools")
if(!require("rbms")) devtools::install_github("RetoSchmucki/rbms")

flc_col <- '#ff8c00'
cnt_col <- '#008b8b'
missing_col <- '#8b0000'
GAM_col <- '#483d8b'

Simulation Tool for Butterflies and Other Phenologies

As the life cycle of butterflies is highly structured in time, with species-specific phenologies, we need to take this ecological process into account when simulating individual counts over an entire season. The temporal pattern in the number of adult butterflies (imago) is determined by their emergence rate, the timing of emergence and the lifespan of the adult. These parameters often result in the number of adults increasing over a period of time to a peak, after which their numbers begin to decline as mortality exceeds emergence. If a species can produce more than one generation in a season, the number will show additional waves of emergence, each with its own start, peak and decline. For each generation, the hump-shaped temporal pattern in the number of adults can often be described by a function that has a mean (centre), variance (width) and some degree of skewness (asymmetry). When combined, the pattern resulting from the cumulative effect of partially overlapping generations can obscure the individual patterns.

To simulate butterfly count data with such phenological patterns, we will use the function timeseries_sim() from the R package butterflyGamSims developed by Collin Edwards (see Edwards et al. 2023). This package is freely available on GitHub and will allow us to generate realistic datasets under scenarios of varying complexity.

We illustrate how the timeseries_sim() function works with a simple case where we simulate butterfly counts for a univoltine species with a Gaussian pattern (i.e. one generation with a normal distribution), where the peak abundance is observed on day 175 with a standard deviation of 15 days.

Code

set.seed(13276)

if(!require("butterflyGamSims")) devtools::install_github("cbedwards/butterflyGamSims")

btfl_data <- timeseries_sim(nsims=1,
               year = c(2023),
               doy.samples = seq(from=1, to=365, by=1),
               abund.type = "exp",
               activity.type = "gauss",
               sample.type = "pois",
               sim.parms = list(growth.rate = 0,
                                init.size = 500, 
                                act.mean = 175,
                                act.sd = 15)
               )

The object returned by the timeseries_sim() function contains 1) a data.frame NAME$timeseries with the time series and 2) a data.frame NAME$parms with the parameters used for the simulation. The parameters are the population growth rate (growth.rate), the initial population size (init.size) measured in number of individuals expected over the season, the peak of the activity curve (act.mean) measured in days and the width of the activity curve (act.sd) measured in standard deviation.

Note

Note that not all the sampling parameters used for the simulation are included in the parms object; activity.type' (the distribution function used to define the activity curve),sample.type’ (the sampling procedure used to sample random counts along the activity curve) and `abund.type’ (the type of growth rate, deterministic or with a log-normal process error) are missing.

Simulate Data Sampling Process

With the timeseries_sim() function above we simulated a regular time series of butterfly counts, where the actual number of active butterflies of each day is drawn from a Poisson distribution with a given expectation, defined by the activity curve along a day-of-year vector $j$, following the probability density of a normal (Gaussian) distribution with a mean equal to the peak day $\mu$ (act.mean) and a standard deviation $\alpha$ (act.sd). Since the integral of the probability density distribution (area under the curve) is 1, we can multiply the density by the total abundance to obtain a vector of expected abundance for each day of the year, $\lambda_j$.

\[ \lambda_j = abundance * \frac{1}{\alpha \sqrt{2 \pi}} \exp \left(- \frac{1}{2} \left( \frac{\left(j - \mu\right)}{\alpha}\right)^2\right) \]

From the activity curve, we can use the rpois() function in R to draw a random value from a Poisson distribution, where $y_i$ is the count for day $j$ and the mean is given by the expected value $\lambda$ on day $j$.

\[ y_j = rpois(\lambda_j) \]

From this simulation, we have generated the ecological process for the butterfly counts, for a given abundance distributed over a specific phenology, defined by a Gaussian curve with a peak (mean) and a breath (standard deviation).

Code

library(knitr)
kable(btfl_data$timeseries[c(1:3,160:163,250:253),])

Butterfly Simmulation Data
	years	doy	count	act	abund.true	onset.true	median.true	end.true	fp.true	sim.id
1	2023	1	0	0.0000000	500	155.8	175	194.2	38.4	1
2	2023	2	0	0.0000000	500	155.8	175	194.2	38.4	1
3	2023	3	0	0.0000000	500	155.8	175	194.2	38.4	1
160	2023	160	9	8.0656908	500	155.8	175	194.2	38.4	1
161	2023	161	11	8.6025942	500	155.8	175	194.2	38.4	1
162	2023	162	10	9.1345489	500	155.8	175	194.2	38.4	1
163	2023	163	5	9.6563851	500	155.8	175	194.2	38.4	1
250	2023	250	0	0.0000496	500	155.8	175	194.2	38.4	1
251	2023	251	0	0.0000354	500	155.8	175	194.2	38.4	1
252	2023	252	0	0.0000252	500	155.8	175	194.2	38.4	1
253	2023	253	0	0.0000179	500	155.8	175	194.2	38.4	1

The additional structure resulting from the monitoring protocol (observation process) can now be added to the simulated time series to reproduce a specific protocol. In this first case, we will simulate a protocol with weekly visits and include some missing counts for weeks when the minimum monitoring conditions were not met or the observer was absent. To do this, we will write some new R functions. The first function will define the start and end of the monitoring season and sample one monitoring day per week throughout the season. Then we will write functions to simulate a certain level of missing weekly visits within the season. The probability of missing weeks tends to be higher at the beginning and end of the season and lower in the middle. Let’s start with the first function, which defines the monitoring season and resamples one day of the simulated time series each week. We will call the function sim2bms() as it adapts the simulated time series to the protocol of a specific Butterfly Monitoring Scheme (BMS). The function takes the time series and additional arguments to define the year we want to extract yearKeep, if the time series needs to be resampled weekly weeklySample, which day to use for the weekly resampling weekdayKeep (this can be a vector of days, e.g. c(2,3,4,5), or a specific day), and finally the monitoring season monitoringSeason, which is a vector of months, e.g. c(4,5,6,7,8,9) represents a season starting in April and ending in September. Note that this function also adds some new variables such as the date, the ISO week number and the day of the week.

Code

# data: Time series resulting from the simulation generated for 365 days
# weeklySample: TRUE or FALSE; should the daily count in the time series be resampled weekly?
# weekdayKeep: vector of days c(1,2,..., 7) to be sampled from for the weekly count. If the vector
# contains c(2,3,4), the sampling process will be restricted to Tuesday, Wednesday or Thursday.
# monitoringSeason: vector of months that define the monitoring season (e.g. April to September is c(4:9)). 

sim2bms <- function(data, yearKeep = NULL, weeklySample = FALSE, weekdayKeep = NULL, monitoringSeason = NULL){
                  
                  btfl_ts <- data.table::data.table(data)[, site_id := paste0("site_",sim.id)]
                  if(!is.null(yearKeep)){
                        btfl_ts <- btfl_ts[years %in% yearKeep, ]
                  }
                  btfl_ts[ , date := as.Date(doy, origin = paste0(years, "-01-01"))-1]
                  btfl_ts[ , week := isoweek(date)]
                  btfl_ts[month(date) != 1 | week < 50, weekday := rowid(week), by = .(site_id, years)]
                  if(isTRUE(weeklySample)){
                        if(!is.null(weekdayKeep)){
                              btfl_ts <- btfl_ts[weekday %in% weekdayKeep, ]
                        }
                        btfl_ts <- btfl_ts[btfl_ts[,.I[sample(.N, 1)], by = .(week, site_id, years)][["V1"]],]
                  }
                  if(!is.null(monitoringSeason)){
                  btfl_ts <- btfl_ts[month(date) %in% monitoringSeason, ]
                  }
            return(btfl_ts)
            }

We can use this function to retrieve a particular year of the simulated time series and add some new variables, leaving all other parameters empty. The same function can also be used to resample weekly counts (e.g. a day from c(2:5)) and to restrict the time series to a particular monitoring season (e.g. c(4:9)).

Code

set.seed(13276)
y <- c(2023)

btfl_ts <- sim2bms(data = btfl_data$timeseries, yearKeep = y)

btfl_fig1 <- ggplot() +
                geom_point(data=btfl_ts, aes(x=doy, y=count, colour = "count")) + 
                geom_line(data = btfl_ts,
                aes(x = doy, y = act, colour = "activity")) +
                xlim(1,365) + ylim(0, max(btfl_ts$count, btfl_ts$act)) + 
                scale_colour_manual("", 
                      breaks = c("count", "activity"),
                      values = c(cnt_col, flc_col)) +
                theme_light() + 
                theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) +
                labs(title = paste0("Simulated butterfly counts (", y,")"),
                     subtitle = "- daily visit",
                     x = "Day of Year",
                     y = "Count")

btfl_week_smpl <- sim2bms(data = btfl_data$timeseries, yearKeep = y, 
                  weeklySample = TRUE,
                  weekdayKeep = c(2:5),
                  monitoringSeason = c(4:9))
btfl_fig2 <- ggplot() +
                geom_point(data=btfl_week_smpl, aes(x=doy, y=count, colour = "count")) + 
                geom_line(data = btfl_ts,
                aes(x = doy, y = act, colour = "activity")) +
                xlim(1,365) + ylim(0, max(btfl_ts$count, btfl_ts$act)) + 
                scale_colour_manual("", 
                      breaks = c("count", "activity"),
                      values = c(cnt_col, flc_col)) +
                theme_light() + 
                theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) +
                labs(title = paste0("Simulated butterfly counts (", y,")"),
                     subtitle = "- weekly visit (random resampled)",
                     x = "Day of Year",
                     y = "Count")

btfl_fig1
btfl_fig2

In the example above, the activity curve represented by the line has a Gaussian shape and the counts represented by the points along the curve are independent samples from a Poisson distribution. Because we sampled a count value for 365 days (day-of-year; doy), the counts are representative of the population of active adult butterflies as if the site were visited every day. This means that a proportion of butterflies will be counted on more than one day because their lifespan is longer than one day. On Pollard transects, butterfly counts are likely to be counted and reported in this way, but at a different frequency, as visits may be weekly, fortnightly or even monthly. We can replicate this value by resampling the daily count on a weekly basis.

Weekly visits will not include counts outside the monitoring period, in many cases these will be ‘zeros’ as we do not expect adult butterfly activity outside the monitoring season. Some other weeks may be missing from the time series, possibly due to unsuitable weather conditions for monitoring or recorder availability. We can inform and exclude the missing visits by resampling a subset of the weekly counts.

Code

missing_prob <- function(data, mu=NULL, alpha = 5, theta = 0.3){
                              x_ <- seq_len(nrow(data))
                              mu_ <- ifelse(is.null(mu), length(x_) / 2, mu)
                              std_ <- sqrt(mu_ / theta)
                              y_ <- abs((alpha * exp((-(x_ - mu_)^2) / std_^2)) - alpha) + alpha
                              yn_ <- y_ / (sum(y_))
                        return(yn_)
                  }

sample_missing <- function(data, propMissing = 0.25){
            
            missing.prob <- data.table::data.table()
            for(i in data[, unique(years)]){
                  for(j in data[, unique(site_id)]){
                  missing.prob <- rbind(missing.prob, missing_prob(data[years == i & site_id == j, ]))
                  }
            }
            
            missing.week <- data[sample(seq_len(.N), round(propMissing * .N), prob = unlist(missing.prob)), ]  
      
      return(missing.week)
}

btfl_week_missing <- sample_missing(data = btfl_week_smpl, propMissing = 0.25)

btfl_fig3 <- ggplot() +
                geom_point(data=btfl_week_smpl, aes(x=doy, y=count, colour = "count")) + 
                geom_point(data=btfl_week_missing, aes(x=doy, y=count, colour = "missing"), 
                            shape=4, size=2, stroke=2) + 
                geom_line(data = btfl_ts,
                aes(x = doy, y = act, colour = "activity")) +
                xlim(1,365) + ylim(0, max(btfl_ts$count, btfl_ts$act)) + 
                scale_colour_manual("", 
                      breaks = c("count", "activity", "missing"),
                      values = c(cnt_col, flc_col, missing_col)) +
                theme_light() + 
                theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) +
                labs(title = paste0("Simulated butterfly counts (", y,")"),
                     subtitle = "- weekly visit (random resampled)",
                     x = "Day of Year",
                     y = "Count")
btfl_fig3

Generalized Additive Models with rbms

We will use the simulation to test the GAM method implemented in the R package rbms (Schmucki, Harrower A., and Dennis B. 2022). As recorders only report the number of butterflies observed, zeros are generally not reported, but can be inferred from the visit dates (e.g. we can infer zeros for the dates when monitoring took place but no individuals were reported).

Organising BMS count data

Code

visit_sim <- btfl_week_smpl[!date %in% btfl_week_missing$date, .(site_id, date, count)]
count_sim <- visit_sim[count>=1,][, species := "sp1"]

names(visit_sim) <- toupper(names(visit_sim))
names(count_sim) <- toupper(names(count_sim))

ts_date <- rbms::ts_dwmy_table(InitYear = 2023, LastYear = 2023, WeekDay1 = 'monday')

ts_season <- rbms::ts_monit_season(ts_date,
                       StartMonth = 4,
                       EndMonth = 9, 
                       StartDay = 1,
                       EndDay = NULL,
                       CompltSeason = TRUE,
                       Anchor = TRUE,
                       AnchorLength = 2,
                       AnchorLag = 2,
                       TimeUnit = 'd')

ts_season_visit <- rbms::ts_monit_site(ts_season, visit_sim)

ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, count_sim, sp = "sp1")

Fitting a GAM to Butterfly Counts

Code

ts_flight_curve <- rbms::flight_curve(ts_season_count, 
                       NbrSample = 300,
                       MinVisit = 5,
                       MinOccur = 3,
                       MinNbrSite = 1,
                       MaxTrial = 4,
                       GamFamily = 'nb',
                       SpeedGam = FALSE,
                       CompltSeason = TRUE,
                       SelectYear = NULL,
                       TimeUnit = 'd')

The flight curve computed by the rbms::flight_curve() function is stored in the …$pheno object, where the days of the year are stored in the variable trimDAYNO and the standardised flight curve in the variable NM. The NM variable is scaled to an area under the curve (AUC) that sums to 1. To compare the flight curve derived from the GAM with the activity curve used for the simulation, we must rescale them to the same AUC, in other words we must rescale the activity curve to have an AUC of 1 or rescale the NM to the population size used for the simulation. Here we will rescale the NM to match the population size of the simulation, this will allow us to display the curves and counts on the same graph with the correct scale.

Code

pheno <- ts_flight_curve$pheno

btfl_fig4 <- ggplot() +
                geom_point(data=ts_season_count[ANCHOR == 0 & !is.na(COUNT), ], aes(x=DAY_SINCE, y=COUNT, colour = "count")) +
                geom_point(data=btfl_week_missing, aes(x=doy, y=count, colour = "missing"), 
                            shape=4, size=2, stroke=2) + 
                geom_line(data = btfl_ts, aes(x = doy, y = act, colour = "activity")) +
                geom_line(data = pheno,
                    aes(x = trimDAYNO, y = btfl_ts[,unique(abund.true)]*NM, colour = "GAM_fit")) +
                xlim(1,365) + ylim(0, max(btfl_ts$act, 
                                          pheno$NM*btfl_ts[,unique(abund.true)], 
                                          btfl_week_missing$count, 
                                          ts_season_count[!is.na(COUNT), COUNT] )) + 
                scale_colour_manual("", 
                      breaks = c("count", "activity", "missing", "GAM_fit"),
                      values = c(cnt_col, flc_col, missing_col, GAM_col)) +
                theme_light() + 
                theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) +
                labs(title = paste0("Simulated butterfly counts (", y,")"),
                     subtitle = "- Fitting GAM model with rbms",
                     x = "Day of Year",
                     y = "Count")
btfl_fig4

To compare the fitted curve with the activity curve, we should use a standard AUC of 1 to allow a fair comparison between models fitted to different population sizes. Using the standardised activity curve and the GAM generated flight curve (NM), we can calculate the Root Mean Squared Error (RMSE) to estimate the goodness of fit of the flight curve generated by the rbms package.

\[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( y_i - \tilde{y}_i\right)^2} \]

where $y_i$ is the NM value at time $i$ and $\tilde{y}_i$ the value from the standardized activity curve at time $i$, from day $1$ to $n$ of the monitoring season.

Non Gaussian flight curve

The same procedure can be applied to flight curves with more complex shapes. Here we generate a time series of butterfly counts from a known flight curve (adult activity) using a simulation of a Zonneveld model.

Code

btfl_data_zn <- timeseries_sim(nsims=1,
               year = c(2023),
               doy.samples = seq(from=1, to=365, by=1),
               abund.type = "exp",
               activity.type = "zon",
               sample.type = "pois",
               sim.parms = list(growth.rate = 0,
                                init.size = 500, 
                                act.mean = 175,
                                act.sd = 15,
                                #theta = 5,
                                zon.theta = 50,
                                t0 = 100,
                                beta = 5,
                                alpha = 0.05)
               )


set.seed(13276)
btfl_ts <- sim2bms(data = btfl_data_zn$timeseries, yearKeep = y)

btfl_week_smpl <- sim2bms(data = btfl_data_zn$timeseries, yearKeep = y, 
                  weeklySample = TRUE,
                  weekdayKeep = c(2:5),
                  monitoringSeason = c(4:9))

btfl_week_missing <- sample_missing(data = btfl_week_smpl, propMissing = 0.25)

visit_sim <- btfl_week_smpl[!date %in% btfl_week_missing$date, .(site_id, date, count)]
count_sim <- visit_sim[count>=1,][, species := "sp1"]

names(visit_sim) <- toupper(names(visit_sim))
names(count_sim) <- toupper(names(count_sim))

ts_date <- rbms::ts_dwmy_table(InitYear = 2023, LastYear = 2023, WeekDay1 = 'monday')

ts_season <- rbms::ts_monit_season(ts_date,
                       StartMonth = 4,
                       EndMonth = 9, 
                       StartDay = 1,
                       EndDay = NULL,
                       CompltSeason = TRUE,
                       Anchor = TRUE,
                       AnchorLength = 2,
                       AnchorLag = 2,
                       TimeUnit = 'd')

ts_season_visit <- rbms::ts_monit_site(ts_season, visit_sim)

ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, count_sim, sp = "sp1")

# mod_k <- "COUNT ~ s(DAY_SINCE, bs =\"cr\", k = 5) + factor(SITE_ID)"

ts_flight_curve <- rbms::flight_curve(ts_season_count, 
                       NbrSample = 300,
                       MinVisit = 5,
                       MinOccur = 3,
                       MinNbrSite = 1,
                       MaxTrial = 4,
                       GamFamily = 'nb',
                       SpeedGam = FALSE,
                       CompltSeason = TRUE,
                       SelectYear = NULL,
                       #mod_form = mod_k,
                       TimeUnit = 'd')

pheno <- ts_flight_curve$pheno

btfl_fig5 <- ggplot() +
                geom_point(data=ts_season_count[ANCHOR == 0 & !is.na(COUNT), ], aes(x=DAY_SINCE, y=COUNT, colour = "count")) +
                geom_point(data=btfl_week_missing, aes(x=doy, y=count, colour = "missing"), 
                            shape=4, size=2, stroke=2) + 
                geom_line(data = btfl_ts, aes(x = doy, y = act, colour = "activity")) +
                geom_line(data = pheno, aes(x = trimDAYNO, y = btfl_ts[,unique(abund.true)]*NM, colour = "GAM_fit")) +
                xlim(1,365) + ylim(0, max(btfl_ts$act, 
                                          pheno$NM*btfl_ts[,unique(abund.true)], 
                                          btfl_week_missing$count, 
                                          ts_season_count[!is.na(COUNT), COUNT] )) + 
                scale_colour_manual("", 
                      breaks = c("count", "activity", "missing", "GAM_fit"),
                      values = c(cnt_col, flc_col, missing_col, GAM_col)) +
                theme_light() + 
                theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) +
                labs(title = paste0("Simulated butterfly counts - Zonneveld Model (", y,")"),
                     subtitle = "- Fitting GAM model with rbms",
                     x = "Day of Year",
                     y = "Count")
btfl_fig5

Simple trend case

In our first scenario, we apply the method to a simple case study of a single univoltine species monitored over 15 years at 100 sites. In this case, the observed population trends follow a consistent trajectory with a known growth rate.