SARIMA

This project focuses on predicting dengue outbreaks at the administrative subdistrict level (known as khwaengs) in Bangkok. At such a fine spatial scale, case counts are subject to high stochasticity. Therefore, additional processing and modeling steps are required to extract meaningful signals from noisy data.

To address these challenges, we utilize the SARIMA (Seasonal Autoregressive Integrated Moving Average) model as a robust statistical backbone. SARIMA is particularly well-suited for this high-resolution task because it explicitly accounts for the non-stationarity and strong seasonality inherent in dengue transmission. By decomposing the time-series data into its autoregressive (past cases), moving average (random shocks), and seasonal components, the model can effectively filter out the ‘noise’ of stochastic fluctuations at the khwaeng level.

Furthermore, the SARIMA framework allows us to capture the cyclical patterns driven by monsoon seasons and local climate variations that dictate mosquito population dynamics. By fitting specific seasonal orders (P, D, Q)s, the model learns the historical rhythm of outbreaks in each subdistrict, providing a stable baseline. This statistical rigor ensures that the ‘velocity’ signals detected by the model are grounded in true epidemiological shifts rather than being mere artifacts of reporting lags or small-sample variability.

Subdistrict Selection and Clustering

We first identify the 55 khwaengs (out of 180 total) that account for 67% of all reported dengue cases in Bangkok. Focusing on these higher-incidence areas helps stabilize the signal and reduces noise.

Next, we apply a mixed clustering approach based on inter-khwaeng correlation statistics, assigning these 55 khwaengs to 3 main clusters. By design, khwaengs within the same cluster exhibit higher internal correlation, strong enough to suppose the epidemic patterns are consistent within clusters.

Technically, the model is specified as $SARIMA(p, d, q) \times (P, D, Q)_s$ , where the seasonal period (s) is set to 52 weeks to align with the annual dengue cycle. The integration terms (d, D) are critical in this high-resolution context, as they apply successive differencing to stabilize the mean and eliminate non-stationary stochastic trends common in subdistrict-level reporting. This ensures that the model residuals are approximately white noise, allowing for more reliable detection of true epidemic signals.

The selection of autoregressive (p, P) and moving average (q, Q) orders is driven by the minimization of the Akaike Information Criterion (AIC) and the analysis of Autocorrelation Functions (ACF). This rigorous optimization allows the model to differentiate between ‘momentum’ (the autoregressive component of the disease spread) and ‘shocks’ (short-term spikes due to local environmental anomalies or reporting artifacts). By capturing these dependencies, SARIMA provides a robust counterfactual against which we can measure the ‘velocity’ of an emerging outbreak.

Furthermore, to handle the heteroscedasticity often found in count data at the khwaeng level, the model can be coupled with a Box-Cox transformation. This stabilizes the variance across different transmission intensities, ensuring that the model’s sensitivity remains consistent whether a subdistrict is in a low-endemic phase or a high-transmission peak. This mathematical framework transforms raw surveillance data into a calibrated baseline, essential for validating the performance of more complex Machine Learning or Agent-Based architectures.

Modeling Seasonality & Peak Intensity

Within each identified cluster, we perform a time-series renormalization by scaling the data relative to its annual mean or peak. This crucial step allows us to isolate the seasonal signatures—the ‘shape’ of the outbreak—from its absolute magnitude. By decoupling these factors, we can identify clusters where the dengue season consistently starts early (e.g., coastal or high-density areas) versus those with a delayed peak, regardless of whether a specific year was a ‘high’ or ‘low’ case year.

To forecast the upcoming year’s dynamics, we apply a SARIMA model to the aggregated counts of each cluster. SARIMA is highly effective at capturing intra-annual dynamics, such as the regular rise and fall of cases linked to the monsoon cycle. However, SARIMA is inherently ‘mean-reverting’ and lacks the mechanism to predict interannual variability—the massive differences in peak intensity caused by multi-year climate drivers like the El Niño Southern Oscillation (ENSO) or shifting immunity levels.

To overcome this limitation and recover the lost variability, we integrate a probabilistic layer. We approximate the distribution of historical peak sizes using a Poisson or Negative Binomial distribution (to account for overdispersion in epidemic data). We then execute Monte Carlo simulations, running thousands of potential scenarios for the next 12 months. Each simulation samples from the historical peak distribution while following the seasonal ‘envelope’ defined by the SARIMA model. This ensemble approach provides not just a single forecast, but a range of probabilistic outcomes, allowing us to quantify the likelihood of a ‘high-intensity’ year versus a ‘standard’ year.

Disaggregating to Subdistricts

The predicted cluster-level totals are downscaled to individual khwaengs (subdistricts) using a proportional allocation framework. This process is governed by a Gaussian distribution fitted to each subdistrict’s historical contribution to its cluster’s total burden. By calculating the mean relative intensity and its associated variance for each locality, the model accounts for the fact that certain subdistricts consistently act as ‘hotspots’ while others maintain a lower baseline. This stochastic weighting ensures that the disaggregated forecasts reflect the unique historical footprint of each khwaeng while remaining constrained by the broader regional cluster trend.

For the remaining 125 khwaengs—characterized by sparse data and high stochasticity—a different analytical strategy is required. In these locations, traditional time-series models often fail due to the high frequency of zero-case weeks. To address this, we implement independent Poisson models for each month, effectively treating the incidence as a series of discrete counts driven by historical seasonal averages.

This dual-pathway approach ensures that the model remains robust across different scales: the high-volume areas benefit from the predictive power of cluster-level dynamics, while the sparse-data areas are modeled using probability-based baselines that prevent erratic spikes in the forecast. By combining these methods, the platform maintains 100% spatial coverage across Bangkok without sacrificing the mathematical integrity of the subdistrict-level outputs.