In today’s data-centric landscape, effective time series forecasting plays a vital role in helping organizations leverage historical data patterns to predict future events. This capability is essential across various sectors, including asset risk management, trading, weather forecasting, energy demand analysis, vital sign monitoring, and traffic studies. The ability to generate accurate forecasts is crucial for achieving success in these fields.
In many of these scenarios, time series data may exhibit heavy-tailed distributions, where the tails signify extreme values. Achieving accurate forecasts in these regions is essential for assessing the likelihood of extreme events and deciding whether to trigger alerts. However, outliers can heavily influence the estimation of the base distribution, making robust forecasting a challenging task. For instance, financial institutions depend on reliable models to predict extreme occurrences like market crashes. Similarly, in sectors such as energy, weather, and healthcare, precise forecasts of infrequent yet high-impact events—such as natural disasters or pandemics—are crucial for effective planning and resource allocation. Ignoring tail behavior can result in losses, missed opportunities, and threats to safety. Focusing on accuracy at the tails is essential for generating dependable and actionable forecasts. In this discussion, we will explore how to train a robust time series forecasting model that can effectively address such extreme events using Amazon SageMaker.
To successfully train this model, we implement an MLOps framework that streamlines the model development process through automation of data preprocessing, feature engineering, hyperparameter tuning, and model selection. This automation minimizes human error, enhances reproducibility, and accelerates the development cycle. With a training pipeline in place, organizations can seamlessly integrate new data and adjust their models to meet changing conditions, ensuring that forecasts remain timely and reliable.
Once the time series forecasting model is trained, deploying it to an endpoint facilitates real-time prediction capabilities. This empowers decision-makers to act quickly based on the most current data. Additionally, deploying the model to an endpoint enhances scalability, allowing multiple users and applications to access and utilize the model concurrently. By following these steps, organizations can leverage the power of robust time series forecasting to make informed decisions and maintain a competitive edge in a fast-paced environment.
Overview of the Solution
This solution highlights the training of a time series forecasting model developed specifically to manage outliers and data variability using a Temporal Convolutional Network (TCN) combined with a Spliced Binned Pareto (SBP) distribution. For further insights about a multimodal variant of this solution, check out another blog post here. To demonstrate the efficacy of the SBP distribution, we will also compare it against the same TCN model using a Gaussian distribution.
Our approach significantly benefits from the MLOps functionalities of SageMaker, which optimize the data science workflow by leveraging AWS’s robust cloud infrastructure. In this solution, we utilize Amazon SageMaker Automatic Model Tuning for hyperparameter searches, Amazon SageMaker Experiments for experiment management, Amazon SageMaker Model Registry for version control, and Amazon SageMaker Pipelines to orchestrate the entire process. Our model is then deployed to a SageMaker endpoint to achieve real-time predictions.
The following diagram illustrates the architecture of the training pipeline.
Data Generation
The first step in our pipeline involves generating a synthetic dataset characterized by a sinusoidal waveform complemented by asymmetric heavy-tailed noise. This dataset is constructed using various parameters, including degrees of freedom, a noise multiplier, and a scale parameter. These components shape the distribution of the data, modulate its random variability, and adjust the spread of the data distribution.
This data processing task employs a PyTorchProcessor, which executes the PyTorch code (generate_data.py) within a container managed by SageMaker. Data and relevant artifacts for debugging are stored in the default Amazon Simple Storage Service (Amazon S3) bucket associated with the SageMaker account. Logs for each step in the pipeline can be accessed through Amazon CloudWatch.
The following figure showcases a sample of the data generated by the pipeline.
You can substitute the input with various types of time series data, including symmetric, asymmetric, light-tailed, heavy-tailed, or multimodal distributions. The model’s robustness allows it to be applied to a wide array of time series challenges, provided there are enough observations available.
Model Training
After the data generation phase, we train two TCNs: one based on SBP distribution and the other using Gaussian distribution. The SBP distribution utilizes a discrete binned distribution as its foundational predictive model, dividing the real axis into discrete bins where the model predicts the likelihood of an observation falling within each bin. This methodology facilitates the capture of asymmetries and multiple modes since the probability for each bin operates independently. An example of the binned distribution is depicted in the following figure.
The predictive binned distribution on the left is resilient to extreme events because the log-likelihood remains unaffected by the distance between the predicted mean and the observed point, unlike parametric distributions such as Gaussian or Student’s t. Thus, the extreme event represented by the red dot will not skew the learned mean of the distribution. However, the extreme event will have a zero probability. To accurately capture extreme events, we construct an SBP distribution by establishing the lower tail at the 5th quantile and the upper tail at the 95th quantile, replacing both tails with weighted Generalized Pareto Distributions (GPD), which can quantify the likelihood of such events. The TCN outputs the parameters for both the binned distribution base and the GPD tails.
Hyperparameter Search
To achieve optimal performance, we employ automatic model tuning to identify the best version of our model through hyperparameter tuning. This step is integrated into SageMaker Pipelines, allowing for the parallel execution of multiple training jobs using various methods and predefined hyperparameters. This approach enhances the efficiency of the tuning process. For more on this topic, visit this authoritative source.
In conclusion, the entire process allows organizations to seamlessly implement robust time series forecasting models, ensuring they are well-equipped to respond effectively to dynamic environments. For an excellent resource on this subject, check out this Reddit thread that offers valuable insights.
Leave a Reply