Machine Learning (ML) Dengue Forecasting

The predictive system developed for the platform is a robust Machine Learning (ML) framework designed to forecast the velocity of dengue transmission across diverse climatic and demographic settings. Beyond its direct predictive capabilities, this system serves as a foundational benchmark model for the platform. By providing a high-performance baseline, it allows for a rigorous comparison with more complex architectures, such as Agent-Based Models (ABM), ensuring that any increase in computational complexity translates into a measurable gain in predictive accuracy and biological insight.

1. Model Architecture: Sequential Ensemble Learning

The core of the forecasting engine is based on Histogram-based Gradient Boosting (HGBR). Rather than relying on a single complex equation, the model utilizes an Ensemble Learning technique that constructs a “committee” of hundreds of decision trees sequentially.

The Concept of “Weak Learners”: As illustrated in the model diagram, the system does not use a single “super-brain.” Instead, it uses hundreds of Weak Learners (or weak classifiers).

A “Weak Learner” is a simple, shallow decision tree that, on its own, is only slightly better than a random guess.
However, in our architecture, each weak learner focuses exclusively on the mistakes of its predecessor. By aggregating these hundreds of simple corrections, the system creates a Strong Predictor capable of capturing highly complex patterns that a single model would miss.

The Boosting Mechanism (How it works): The training process is iterative:

Weak Learner 1 makes an initial, rough prediction based on climate and history.
The system calculates the Residuals (the errors made by Learner 1).
Weak Learner 2 is trained specifically to predict and correct the errors of the previous tree.
This process repeats sequentially. The final forecast is the sum of all these small corrections.

Efficiency: By grouping continuous data into discrete bins, the model reduces computational complexity, allowing for rapid retraining (O(b)) on nationwide datasets.
Robustness: The architecture natively handles missing data (NaNs), ensuring resilience against gaps in surveillance reporting without requiring artificial imputation.

2. Feature Engineering: Capturing the “Pulse” of the Outbreak

To forecast the situation in Bangkok and the provinces, raw data is transformed into “Bio-Climatic Features” that represent the biological reality of the vector (Aedes aegypti):

Auto-regressive Dynamics (Momentum): The model calculates the “velocity” and “acceleration” of cases over the last 4 to 8 weeks. This allows it to detect if an outbreak is gaining momentum (inertial growth) or slowing down.
Climate Lags (Biological Delays): Rainfall and temperature are processed using Rolling Window Averages with time lags (e.g., precipitation from 4 weeks ago) to account for the mosquito breeding cycle and virus incubation period.
Spatial Interdependence: By evaluating the outbreak status of neighboring districts (“Neighboring Pressure”), the model captures the spatial diffusion of the virus across provincial borders.

This feature engineering approach allows the model to go beyond static data points and instead interpret the epidemic system’s memory. By integrating auto-regressive dynamics (momentum), the algorithm can distinguish between typical seasonal fluctuations and the onset of an exponential growth phase. This is achieved by detecting changes in the ‘velocity’ of transmission before they result in a massive accumulation of cases. Such proactive detection provides a critical window of opportunity for public health authorities to mobilize resources and implement preventive measures before the healthcare system reaches a breaking point.

Furthermore, the integration of biological delays and spatial interdependence transforms the model from a purely statistical exercise into a biologically-grounded framework. By synchronizing the Aedes aegypti life cycle with specific climatic anomalies, the system simulates future risk based on current breeding and incubation conditions. This is complemented by the ‘Neighboring Pressure’ analysis, which recognizes that provinces are not isolated entities but interconnected nodes in a complex mobility network. Capturing these cross-border dynamics ensures that the model accounts for the spatial diffusion of the virus, identifying when an outbreak in a neighboring district poses an imminent threat of importation

3. Urban Focus: Modeling Bangkok

Forecasting for Bangkok requires handling extreme population density and mobility. The model addresses this through specific demographic integration:

Density-Weighted Transmission: Integrating Census data, the model distinguishes between slow, less populated areas and explosive transmission potential in high-density districts (e.g., Din Daeng).
Mobility Proxy: The “Neighboring Pressure” feature implicitly captures the high-frequency flow of the virus between the interconnected districts of the Greater Bangkok Metropolitan Area.

The integration of density-weighted transmission allows the model to move beyond uniform assumptions and instead capture the heterogeneous landscape of urban risk. By incorporating high-resolution census data, the framework adjusts its transmission coefficients to account for the heightened contact rates characteristic of densely populated districts like Din Daeng. This differentiation is vital for urban planning, as it enables the simulation of ‘explosive’ outbreaks in hyper-urbanized zones while maintaining realistic, slower growth projections for suburban or less crowded areas, ensuring that resource allocation is prioritized where the risk of rapid escalation is greatest.

Furthermore, the model’s use of ‘Neighboring Pressure’ as a mobility proxy provides a robust solution to the challenge of tracking high-frequency human movement within the Greater Bangkok Metropolitan Area. In a megacity where administrative boundaries are often bypassed by daily commuting patterns, this feature implicitly captures the invisible flow of the virus across districts. By treating the metropolitan area as a highly interconnected system rather than a collection of independent units, the model can predict the spatial ‘spillover’ effect, identifying how an increase in cases in one central hub will inevitably exert infectious pressure on the surrounding residential belts.

4. Strategic Optimization: Dynamic Velocity Weighting

A common flaw in time-series forecasting is “lag,” where the model reacts too late to a sudden spike. To prevent this, we implemented a Velocity-Based Training Strategy.

The Logic: Standard models treat a week with 0% change and a week with 50% growth as equally important. Our model uses a custom sample weight formula: W=1+(∣Velocity∣×2).
The Effect: This effectively doubles the penalty for errors made during periods of rapid change (high velocity).
Result: The algorithm is mathematically forced to pay more attention to instability (sudden spikes or drops) than to stability. This sharpens the model’s reflexes, allowing it to anticipate the onset of an outbreak earlier than a traditional baseline

By prioritizing these high-velocity periods, the model overcomes the ‘inertial lag’ typical of traditional time-series forecasting. Instead of merely smoothing over historical trends, the strategy ensures that the system is hypersensitive to the early signals of a surge. Consequently, the platform provides public health officials with a more reactive and reliable early-warning tool, capable of identifying the precise moment an endemic situation shifts into an active outbreak.

5. Output and Performance

The model outputs the Predicted Velocity of Change, which is mathematically transformed back into “Expected Total Cases.”

Validation against Baseline: To prove real-world utility, the model was rigorously tested against a “Persistence Baseline” (a reference model assuming “next week will be the same as this week”).

The Challenge: Baselines perform well during stable periods but fail systematically during outbreak onsets.
Performance: In the validation set (Holdout), the ML model achieved a ~11% global error reduction and a ~24.9% error reduction specifically during outbreaks, confirming its value as an Early Warning System.

Why this approach?

The dengue forecasting engine is designed to bridge the gap between complex epidemiological modeling and real-time public health action. By leveraging a high-performance Gradient Boosting framework, the system processes vast amounts of bioclimatic and demographic data to identify hidden transmission patterns. Unlike traditional models that focus on long-term averages, our approach is built for responsiveness and operational utility, characterized by four core technical pillars:

Adaptability: Automatically reverts to learn from shifting climate patterns (e.g., El Niño or changes in monsoon intensity).
Granularity: Provides district-level precision for targeted resource allocation.
Speed: Histogram-based architecture enables real-time weekly updates.
Safety First: The weighted loss function ensures the model prioritizes high-velocity events (crises) over stable trends.