In the realm of data-centric decision-making, time series forecasting plays a pivotal role in helping businesses utilize historical data trends to predict future events. This is particularly vital across various sectors such as asset risk management, trading, meteorology, energy consumption forecasting, health monitoring, and traffic analysis. The capacity to forecast with precision is essential for achieving success in these fields.
Time series data often exhibits heavy-tailed distributions, where extreme values are represented in the tails. Accurately predicting outcomes in these areas is crucial for assessing the likelihood of rare events and responding accordingly. However, the presence of outliers can significantly distort the estimation of the underlying distribution, making robust forecasting a challenging task. Financial institutions depend on resilient models to foresee catastrophic events like market crashes. Similarly, accurate predictions of sporadic yet impactful occurrences, such as natural disasters and pandemics, are vital for effective planning and resource management in sectors like energy, weather, and healthcare. Ignoring tail behavior can lead to financial losses, missed opportunities, and safety issues. Thus, prioritizing accuracy at the extremes is essential for generating reliable and actionable forecasts. In this article, we will develop a robust time series forecasting model that captures extreme events using Amazon SageMaker.
To effectively train this model, we establish an MLOps framework that simplifies the model development process by automating data preprocessing, feature engineering, hyperparameter tuning, and model selection. This automation minimizes human error, enhances reproducibility, and accelerates the model development cycle. By implementing a training pipeline, businesses can seamlessly integrate new data and adapt their models to changing conditions, ensuring that forecasts remain trustworthy and current.
Once the time series forecasting model is trained, deploying it to an endpoint enables real-time prediction capabilities. This allows for informed and swift decision-making based on the most recent data. Moreover, deploying the model in an endpoint provides scalability, allowing multiple users and applications to access and utilize the model concurrently. By following these outlined steps, organizations can leverage robust time series forecasting to make informed decisions and maintain a competitive edge in a rapidly evolving landscape.
Solution Overview
This solution illustrates the training of a time series forecasting model specifically designed to manage outliers and data variability using a Temporal Convolutional Network (TCN) with a Spliced Binned Pareto (SBP) distribution. For additional insights on a multimodal variant of this solution, refer to this blog post. To demonstrate the effectiveness of the SBP distribution, we will compare it against the same TCN model utilizing a Gaussian distribution.
The MLOps capabilities of SageMaker significantly enhance the data science workflow by leveraging the robust cloud infrastructure of AWS. Our solution employs features such as Amazon SageMaker Automatic Model Tuning for hyperparameter optimization, Amazon SageMaker Experiments for managing experimental runs, Amazon SageMaker Model Registry for overseeing model versions, and Amazon SageMaker Pipelines for orchestrating the entire process. The model is then deployed to a SageMaker endpoint to facilitate real-time predictions.
The following diagram provides an overview of the training pipeline.
The following diagram depicts the inference pipeline.
You can find the complete code in the GitHub repository. To implement the solution, execute the cells in SBP_main.ipynb
. Click here to open the AWS console and follow along.
SageMaker Pipeline
SageMaker Pipelines presents an intuitive Python SDK for creating integrated machine learning (ML) workflows. These workflows are represented as Directed Acyclic Graphs (DAGs), consisting of various steps and dependencies. With SageMaker Pipelines, you can streamline the comprehensive process of training and evaluating models, thereby enhancing efficiency and reproducibility in your ML workflows.
The training pipeline commences with the generation of a synthetic dataset, which is divided into training, validation, and test sets. The training set is utilized to train two TCN models, one employing the Spliced Binned Pareto distribution and the other using a Gaussian distribution. Both models undergo hyperparameter tuning with the validation set to optimize performance. Subsequently, an evaluation is conducted against the test set to identify the model with the lowest root mean squared error (RMSE). The model demonstrating the best accuracy is then uploaded to the model registry.
The following diagram illustrates the steps involved in the pipeline.
Let’s delve deeper into the individual steps.
Data Generation
The initial step in our pipeline involves generating a synthetic dataset characterized by a sinusoidal waveform and asymmetric heavy-tailed noise. This data is created using several parameters, including degrees of freedom, a noise multiplier, and a scale parameter. These factors influence the shape of the data distribution, modulate the random variability, and adjust the overall spread of the data distribution.
This data processing task is executed with a PyTorchProcessor
, which runs PyTorch code (generate_data.py
) within a container managed by SageMaker. Data and other pertinent artifacts for debugging are stored in the default Amazon Simple Storage Service (Amazon S3) bucket linked to the SageMaker account. Logs for each step of the pipeline are accessible via Amazon CloudWatch.
The following figure shows a sample of the generated data.
You can replace the input with a diverse array of time series datasets, including symmetric, asymmetric, light-tailed, heavy-tailed, or multimodal distributions. The model’s robustness allows it to be applicable to numerous time series challenges, provided there are sufficient observations available.
Model Training
Following data generation, we train two TCNs: one utilizing SBP distribution and the other employing Gaussian distribution. The SBP distribution utilizes a discrete binned distribution as its predictive base, where the real axis is segmented into discrete bins, and the model predicts the likelihood of an observation falling within each bin. This approach enables capturing asymmetries and multiple modes, as the probability for each bin is independent. An example of the binned distribution is illustrated in the following figure.
The predictive binned distribution on the left is resilient to extreme events because the log-likelihood is not influenced by the distance between the predicted mean and the observed point, in contrast to parametric distributions like Gaussian or Student’s t. Therefore, the extreme event represented by the red dot will not skew the learned mean of the distribution. However, the extreme event will carry zero probability. To capture these extreme occurrences, we construct an SBP distribution by defining the lower tail at the 5th quantile and the upper tail at the 95th quantile, replacing both tails with weighted Generalized Pareto Distributions (GPD), which can quantify the likelihood of the event. The TCN will output the parameters for the binned distribution base and GPD tails.
Hyperparameter Search
To achieve optimal performance, we utilize automatic model tuning to identify the best version of the model through hyperparameter tuning. This phase is integrated into SageMaker Pipelines and allows for the parallel execution of multiple training jobs, employing various methods and predefined hyperparameter configurations. For more insights on this topic, check out the experts at this site.
Additionally, if you’re interested in learning more about Amazon Flex onboarding, visit this excellent resource.
Leave a Reply