Correcting Delayed Data

For each approach below, we describe what it does, what data it needs, and when it is most useful. For mathematical details and computational properties, see the companion technical overview.

At a glance

Approach Best when Watch out for
Simple approaches Delays are short and stable; you care about trends not levels No uncertainty; throws away information
Multiplicative Stable delay pattern; moderate to large counts Assumes fixed delays; no weekday effects
Regression Delays change over time; covariates matter Delay not constrained; hard to add mechanisms
Generative You need delay constraints, parametric delays, pooling, or mechanistic structure More work to set up
Other statistical Quick prototyping; large datasets Hard to diagnose when things go wrong

Data requirements

Any nowcasting method that accounts for reporting delays needs data organised by both reference date (when the event occurred) and reporting delay (or equivalently, report date). This is often represented as a reporting triangle, where rows are reference dates, columns are delays, and cells contain the number of events first reported at each delay, but can equally be stored as a data frame with one row per reference-date-by-delay combination.

If you have individual line list records with both a reference date and a report date, this structure is built by counting events within each combination. If instead you receive periodic snapshots of aggregate counts (for example, a daily download of cumulative totals by event date), it is obtained by differencing successive snapshots (Wolffram et al. 2023; Johnson et al. 2025). Following a fixed data extraction schedule makes modelling substantially easier, because irregular download times introduce artificial variation in what appears as a new report.

There are several common data cleaning issues. Differencing snapshots can produce negative cell values when earlier counts are revised downward, for example through deduplication or reclassification. Some surveillance systems report rolling sums (for example, 7-day totals) rather than daily counts. These typically need to be decomposed back into daily or weekly values before the reporting triangle can be constructed, but some methods can target rolling sums directly, which has the advantage of avoiding the need to model correlations between the constituent counts (Johnson et al. 2025; Wolffram et al. 2023).

Methods

Not all situations require statistical correction. When delays are short relative to the decision timescale, or when the primary concern is monitoring trends rather than estimating levels, simpler strategies may be enough. One option is to flag recent dates as provisional and exclude them from interpretation until enough reports have accumulated. Another is to show only the counts available at each snapshot in time, removing any later additions, so that all dates are treated consistently even though each underestimates the true level. A third is to work with report dates instead of event dates, accepting a lagged and smoothed picture of the underlying signal. Each of these trades timeliness or completeness for simplicity, and each avoids the modelling choices that statistical methods require. They become problematic when delays are long or changing, when the gap between reported and actual counts matters for decisions, or when decision makers need to understand how uncertain current estimates are.

TipBest when

Delays are short and stable, recent data does not drive time-sensitive decisions, or resources for statistical modelling are unavailable.

WarningWatch out for

These approaches discard information that statistical methods could use, and they do not provide uncertainty estimates.

Multiplicative methods estimate current counts by scaling up partially reported values according to the proportion of cases historically observed at each delay. The approach originates from actuarial claims reserving, where it is known as the chain ladder method, and remains one of the most widely used approaches to nowcasting because it is fast, transparent, and easy to explain to decision makers. It works well when the delay pattern is reasonably stable over the fitting window and counts are large enough that the scaling factors are well estimated. The main limitations are that the basic form assumes a fixed delay distribution, cannot produce meaningful estimates when event-date counts are zero, and does not account for systematic day-of-the-week effects in reporting (Wolffram et al. 2023). Variants that address some of these issues exist, and uncertainty can be added either through distributional assumptions or by evaluating past prediction errors (Johnson et al. 2025).

TipBest when

Delays are reasonably stable, counts are large enough for stable scaling factors, and speed and simplicity are priorities.

WarningWatch out for

Fixed delay assumption breaks down when reporting practices change. Zero counts produce undefined scaling factors. Day-of-week effects are not handled natively.

Regression approaches go further by jointly modelling the epidemic curve and the delay distribution within a single model, treating the reporting triangle as a smooth surface over event time and delay (Kassteele, Eilers, and Wallinga 2019; Schneble et al. 2021; Bastos et al. 2019; Mellor et al. 2025). This means models can be specified that borrow information from neighbouring time points to stabilise estimates where counts are low, allow the delay pattern to change over time, account for systematic day-of-week effects, and incorporate additional covariates such as age group or geography. Uncertainty can be estimated as part of the model rather than as a postprocessing step. These are marginal approaches, treating each cell of the reporting triangle as independent (Stoner and Economou 2020). The regression structure makes it difficult to encode mechanistic components such as laboratory capacity or test-seeking behaviour, the delay component is not constrained to produce reporting proportions that sum to one, and parametric delay distributions are not supported, which in practice means that nowcasts can behave unpredictably at long delays where data are sparse.

TipBest when

Delays change over time, day-of-week or other covariate effects matter, and you want uncertainty as a direct model output.

WarningWatch out for

Unconstrained delay distribution can produce odd behaviour at long delays. Difficult to add mechanistic components. Each cell is treated independently, which can affect uncertainty calibration.

Generative models are a more general class than regression approaches, specifying an explicit model for both the expected number of events at each event time and how those events distribute across reporting delays (Höhle and Heiden 2014; McGough et al. 2020; Günther et al. 2021; Lison et al. 2024; Sam Abbott et al. 2025). This structure makes it straightforward to incorporate mechanistic components, such as a renewal process linking expected counts to a reproduction number, or to bring in auxiliary data sources like leading epidemiological indicators (Lison et al. 2024; Bergström et al. 2022). It also makes it straightforward to add constraints to the delay distribution, such as requiring reporting proportions to sum to one or specifying a parametric form. Two main variants exist: conditional and marginal. Conditional generative models separate variability in incidence from variability in reporting by modelling total counts directly and then distributing them across delays, producing well-calibrated uncertainty for the quantities decision makers most care about (Höhle and Heiden 2014; Stoner and Economou 2020; Stoner, Halliday, and Economou 2023; Seaman et al. 2022). Marginal generative models instead treat each cell of the reporting triangle as an independent draw (McGough et al. 2020; Günther et al. 2021; Lison et al. 2024; Sam Abbott et al. 2025). Both variants can support parametric delay distributions, time-varying delays through covariates such as day-of-week effects, hierarchical pooling across regions or age groups, joint estimation of the reproduction number, and incorporation of leading indicators (Stoner, Halliday, and Economou 2023; Seaman et al. 2022; Bergström et al. 2022; Sam Abbott et al. 2025).

TipBest when

You need mechanistic structure, constrained delay distributions, pooling across strata, joint Rt estimation, or integration of auxiliary data sources.

WarningWatch out for

More complex to specify and fit. Conditional variants can be slow. Marginal variants may underestimate uncertainty when delays are highly variable.

Because the nowcasting problem can be expressed as a regression model, any generalisation of regression can in principle be applied, including ARIMA models with covariates, gradient-boosted trees, neural networks, and other machine learning approaches. These methods can be easier to set up and may learn flexible structure from the data without requiring the analyst to specify it. The downside is that they typically lack the constraints and mechanistic components that purpose-built nowcasting models provide, and they carry the common limitations of black-box approaches.

TipBest when

Quick prototyping, flexible pattern learning from large datasets, or when existing ML infrastructure is already in place.

WarningWatch out for

Predictions are difficult to interpret, failure modes are hard to diagnose, and behaviour in novel situations is unpredictable.

Implementation

Topic What it covers
Software Available packages, what they support, and how they differ
Starting simple Look at your data first, fit a baseline, then build up
Model specification Maximum delay, training window, and stratification
Practical considerations Fitting times, installation constraints, and checking outputs
NoteAt a glance

Multiple packages exist for each method class, differing in flexibility, inference method, and installation requirements. Use the table and descriptions below to identify which packages support the features you need.

The table below compares available software packages across key features. The Method column links to the corresponding approach description above.

For multiplicative methods, baselinenowcast (Johnson et al. 2025) provides a straightforward implementation with options for Poisson or negative binomial count models, separate day-of-week adjustment, and a correction for zero counts that ad hoc baseline methods cannot handle. ChainLadder implements the classical actuarial chain ladder and its variants. EpiNow2 (Abbott et al. 2020) implements a truncation model that can be used as input to its more flexible epidemic model, which can also accept nowcasts from any method expressed as a delay distribution.

For regression approaches, nowcaster (Bastos et al. 2019) uses smooth functions of event time and delay, with user-specified count distributions, support for day-of-week effects, and stratification by age or geography. A benefit of the regression framework is that it builds on widely used statistical software. Kassteele, Eilers, and Wallinga (2019) provide accompanying scripts for their constrained P-spline approach, and the UK Health Security Agency (UKHSA) nowcasting pipelines are available as public code repositories (Overton et al. 2023; Mellor et al. 2025; Tang et al. 2025).

For generative models, NobBS (McGough et al. 2020) implements a simple marginal generative model with a random walk expectation model. EpiLPS (Sumalinab et al. 2024) uses a marginal generative approach with smooth functions of event time and delay and user-specified count distributions. surveillance (Meyer, Held, and Höhle 2017) implements the conditional approach of Höhle and Heiden (2014) with a log-linear expectation model and support for additional data sources. epinowcast (Sam Abbott et al. 2025) implements a marginal generative approach with a flexible expectation model, multiple count distributions, parametric delays, hierarchical pooling across strata, joint effective reproduction number estimation, missing reference date imputation, user-defined reporting schedules, missing data handling, forecasting, and support for additional data sources.

Table 1
TipKey message

Visualise your data first, start with a simple baseline adapted to your problem, and add complexity only when you can see that it is needed.

Before choosing a method, visualise the reporting triangle to understand the structure of your data: how long delays typically are, whether they change over time, and whether there are systematic day-of-week patterns. This informs which features a model actually needs. A good strategy is to start simple and build up complexity only when you can see that it is needed. Having a baseline nowcast adapted to your specific problem provides a reference point for judging whether additional model features add value (Johnson et al. 2025); a multiplicative method is a sensible choice for this. From there, you can assess whether the residual errors point to a specific deficiency and choose a more flexible method that addresses it. Using a framework that allows the model to be built up step by step makes this incremental approach easier to manage, such as epinowcast (Sam Abbott et al. 2025).

NoteKey decisions

Maximum delay, data requirements, training window length, and stratification strategy.

Regardless of which method you choose, several specification decisions are likely going to be needed to be made.

  • Maximum delay. This should be set at the point that captures the vast majority of the target reporting, which you can estimate from historical data. Unless computational performance is a priority, there is generally little reason to vary this as part of the model development.
  • Data requirements. More complex models need more data to fit their additional flexibility. Most methods need at least as many snapshots of data as the target delay length at a minimum, though some Bayesian methods can work with less data by relying on prior models or parametric delay distributions.
  • Training window. Most methods fit to a window of recent data rather than the full history. A shorter window lets the model adapt quickly to changes in reporting behaviour but provides fewer data points for estimating delay proportions, which matters when counts are low. The appropriate window length depends on data frequency and the stability of reporting patterns (Johnson et al. 2025; Wolffram et al. 2023). Some methods support time-varying delays directly (see the Software tab), which makes optimising the training window less important because the model itself can adapt. In settings where the delay distribution is expected to change, these methods are likely to perform better than those that assume a fixed delay within each window.
  • Stratification. When nowcasts are needed by age group, geography, or another variable, you can either fit separate models to each stratum or fit a single model that pools information across strata. Separate models are simpler but can be unstable when individual strata have low counts. Pooled or hierarchical models borrow strength across strata to stabilise estimates, but require software that supports this feature (see the Software tab) and take longer to fit (Stoner, Halliday, and Economou 2023). A middle ground is to assume that the delay distribution is shared across strata while allowing the epidemic curve to vary (Seaman et al. 2022), which reduces the number of parameters without forcing strata to have identical trajectories.
WarningWatch out for

Software installation constraints in public health settings can limit which packages are available. Check that your computing environment supports the inference backend a package requires before committing to it.

  • Computation. Multiplicative methods typically run in seconds. Regression and generative models fitted with Bayesian inference take longer, and fitting time depends heavily on the inference method. Full MCMC sampling via Stan (Stan Development Team 2021) is the most flexible but slowest option, with run times ranging from minutes to days depending on model complexity, the number of strata, and the length of the time series (Stoner, Halliday, and Economou 2023; Sam Abbott et al. 2025). Approximate inference methods can be substantially faster: nowcaster uses INLA (Rue, Martino, and Chopin 2009), EpiLPS uses Laplacian P-splines (Sumalinab et al. 2024), and epinowcast supports variational inference via Pathfinder (Zhang et al. 2022) as an alternative to MCMC. Plan for this when deciding how to integrate nowcasting into an operational workflow, for instance by scheduling model runs overnight.
  • Software availability. Public health departments often have limited ability to install software, with access restricted to packages available as pre-compiled binaries from CRAN. This rules out tools that depend on cmdstanr (Gabry and Češnovar 2021), which requires a local C++ toolchain: for example, epinowcast is cmdstanr-based and cannot be installed in such environments, whereas EpiNow2 uses rstan and is available from CRAN. nowcaster depends on INLA (Rue, Martino, and Chopin 2009), which may face similar installation barriers depending on the computing environment.
  • Monitoring. Once a pipeline is running, inspect the nowcast visually against the raw data after each run to confirm that estimates are plausible and uncertainty intervals are reasonable. Watch for estimates that swing sharply between runs, which can indicate changes in reporting practice, data quality events, or a model that is poorly suited to the current data. Compare the nowcast against what was eventually observed for recent past dates; persistent over- or under-prediction suggests the model needs recalibration (Wolffram et al. 2023).

References

Abbott, Sam, Joel Hellewell, Katharine Sherratt, Katelyn Gostic, Joe Hickson, Hamada S. Badr, Michael DeWitt, Robin Thompson, EpiForecasts, and Sebastian Funk. 2020. EpiNow2: Estimate Real-Time Case Counts and Time-Varying Epidemiological Parameters. https://doi.org/10.5281/zenodo.3957489.
Bastos, Leonardo S, Theodoros Economou, Marcelo F C Gomes, Daniel A M Villela, Flavio C Coelho, Oswaldo G Cruz, Oliver Stoner, Trevor Bailey, and Claudia T Codeço. 2019. “A Modelling Approach for Correcting Reporting Delays in Disease Surveillance Data.” Statistics in Medicine 38 (22): 4363–77. https://doi.org/10.1002/sim.8303.
Bergström, Fanny, Felix Günther, Michael Höhle, and Tom Britton. 2022. “Bayesian Nowcasting with Leading Indicators Applied to COVID-19 Fatalities in Sweden.” PLOS Computational Biology 18 (12): e1010767. https://doi.org/10.1371/journal.pcbi.1010767.
Gabry, Jonah, and Rok Češnovar. 2021. Cmdstanr: R Interface to ’CmdStan.
Günther, Felix, Andreas Bender, Katharina Katz, Helmut Küchenhoff, and Michael Höhle. 2021. “Nowcasting the COVID-19 Pandemic in Bavaria.” Biom. J. 63 (3): 490–502. https://doi.org/10.1002/bimj.202000112.
Höhle, Michael, and Matthias an der Heiden. 2014. “Bayesian Nowcasting During the STEC O104:H4 Outbreak in Germany, 2011.” Biometrics 70 (4): 993–1002. https://doi.org/10.1111/biom.12194.
Johnson, Kaitlyn E, Maria L Tang, Emily Tyszka, Laura Jones, Barbora Nemcova, Daniel Wolffram, Rosa Ergas, et al. 2025. “Baseline Nowcasting Methods for Handling Delays in Epidemiological Data.” Wellcome Open Research 10: 614. https://wellcomeopenresearch.org/articles/10-614.
Kassteele, Jan van de, Paul H C Eilers, and Jacco Wallinga. 2019. “Nowcasting the Number of New Symptomatic Cases During Infectious Disease Outbreaks Using Constrained P-Spline Smoothing.” Epidemiology 30 (5): 737–45. https://doi.org/10.1097/EDE.0000000000001050.
Lison, Adrian, Sam Abbott, Jana Huisman, and Tanja Stadler. 2024. “Generative Bayesian Modeling to Nowcast the Effective Reproduction Number from Line List Data with Missing Symptom Onset Dates.” Edited by Tom Britton. PLOS Computational Biology 20 (4): e1012021. https://doi.org/10.1371/journal.pcbi.1012021.
McGough, Sarah F, Michael A Johansson, Marc Lipsitch, and Nicolas A Menzies. 2020. “Nowcasting by Bayesian Smoothing: A Flexible, Generalizable Model for Real-Time Epidemic Tracking.” PLOS Computational Biology 16 (4): e1007735. https://doi.org/10.1371/journal.pcbi.1007735.
Mellor, Jonathon, Maria L Tang, Emilie Finch, Rachel Christie, Oliver Polhill, Christopher E Overton, Ann Hoban, Amy Douglas, Sarah R Deeny, and Thomas Ward. 2025. “An Application of Nowcasting Methods: Cases of Norovirus During the Winter 2023/2024 in England.” PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1012849.
Meyer, Sebastian, Leonhard Held, and Michael Höhle. 2017. “Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance.” Journal of Statistical Software 77 (11): 1–55. https://doi.org/10.18637/jss.v077.i11.
Overton, Christopher E, Sam Abbott, Rachel Christie, Fergus Cumming, Julie Day, Owen Jones, Rob Sherlock Paton, Charlie Turner, and Thomas Ward. 2023. “Nowcasting the 2022 Mpox Outbreak in England.” PLoS Comput. Biol. 19 (9): e1011463. https://doi.org/10.1371/journal.pcbi.1011463.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92. https://doi.org/10.1111/j.1467-9868.2008.00700.x.
Sam Abbott, Adrian Lison, Sebastian Funk, Carl Pearson, Hugo Gruson, Felix Guenther, Michael DeWitt, James Mba Azam, and Jessalyn Sebastian. 2025. Epinowcast: A Bayesian Framework for Real-Time Infectious Disease Surveillance. https://doi.org/10.5281/zenodo.5637165.
Schneble, Marc, Giacomo De Nicola, Göran Kauermann, and Ursula Berger. 2021. “Nowcasting Fatal COVID-19 Infections on a Regional Level in Germany.” Biometrical Journal 63 (3): 471–89. https://doi.org/10.1002/bimj.202000143.
Seaman, Shaun R, Pantelis Samartsidis, Meaghan Kall, and Daniela De Angelis. 2022. “Nowcasting COVID-19 Deaths in England by Age and Region.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 71 (5): 1266–81. https://doi.org/10.1111/rssc.12576.
Stan Development Team. 2021. Stan Modeling Language Users Guide and Reference Manual, 2.28.1.
Stoner, Oliver, and Theo Economou. 2020. “Multivariate Hierarchical Frameworks for Modeling Delayed Reporting in Count Data.” Biometrics 76 (3): 789–98. https://doi.org/10.1111/biom.13188.
Stoner, Oliver, Allison Halliday, and Theo Economou. 2023. “Correcting Delayed Reporting of COVID-19 Using the Generalized-Dirichlet-Multinomial Method.” Biometrics 79 (3): 2537–50. https://doi.org/10.1111/biom.13810.
Sumalinab, Bryan, Oswaldo Gressani, Niel Hens, and Christel Faes. 2024. “Bayesian Nowcasting with Laplacian-P-Splines.” Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2024.2395414.
Tang, Maria L, Ian S McFarlane, Christopher E Overton, Erjola Hani, Vanessa Saliba, Gareth J Hughes, Paul Crook, Thomas Ward, and Jonathon Mellor. 2025. “Nowcasting Cases and Trends During the Measles 2023/24 Outbreak in England.” Journal of Infection. https://doi.org/10.1016/j.jinf.2025.106473.
Wolffram, Daniel, Sam Abbott, Matthias an der Heiden, Sebastian Funk, Felix Günther, Davide Haase, Stefan Heyder, et al. 2023. “Collaborative Nowcasting of COVID-19 Hospitalization Incidences in Germany.” PLOS Computational Biology 19 (8): e1011394. https://doi.org/10.1371/journal.pcbi.1011394.
Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2022. “Pathfinder: Parallel Quasi-Newton Variational Inference.” Journal of Machine Learning Research 23 (306): 1–49.