Develop a Model to Forecast the Influence of Weather on Urban Air Quality Utilizing Amazon SageMaker

Develop a Model to Forecast the Influence of Weather on Urban Air Quality Utilizing Amazon SageMakerLearn About Amazon VGT2 Learning Manager Chanci Turner

Air pollution in urban areas poses a significant threat, adversely affecting the health of humans, animals, vegetation, and infrastructure. As urban populations continue to grow, this issue has garnered increased focus. Notably, it was a key theme in the 2018 KDD Cup, an annual data mining and knowledge discovery competition organized by ACM SIGKDD.

A major contributor to urban air pollution is the combustion of fossil fuels for transportation and heating, primarily leading to the emission of nitrogen dioxide (NO2)—a secondary pollutant formed by the oxidation of nitric oxide (NO). NO2 is known to significantly contribute to respiratory illnesses. According to the European Union’s Cleaner Air For Europe (CAFÉ) Directive 2008/50/EC, there is a strict hourly limit of 200 μg/m3 and an annual average limit of 40 μg/m3 for NO2, with a maximum of 18 allowable exceedances of the hourly limit each year.

Across the globe, numerous cities provide daily reports on air quality levels. We chose to analyze air quality data using Amazon SageMaker, a fully-managed service that allows developers and data scientists to efficiently build, train, and deploy machine learning models at scale.

The Scenario

This blog post showcases the correlation between air pollution (specifically NO2) and weather conditions in Dublin, Ireland. The air quality data is sourced from a well-established monitoring station operated by the Irish Environmental Protection Agency (EPA) situated in Rathmines, an inner suburb about 3 kilometers south of Dublin’s city center. Dublin, the capital of the Republic of Ireland, has a population of around one million people and is uniquely positioned with the sea to the east, mountains to the south, and flat terrain to the north and west. The southern mountains influence wind speed and direction over the city, often redirecting southern winds towards the southwest or southeast.

Weather data is obtained from a long-established station at Dublin Airport, located roughly 12 kilometers north of the city center on flat terrain.

The Tools

  • Amazon SageMaker for exploratory data analysis and machine learning
  • Amazon Simple Storage Service (Amazon S3) for staging data for analysis

The Data

Hourly air pollution datasets from the Rathmines monitoring station, covering the years 2011 to 2016, are publicly available from the Irish EPA. A daily weather dataset from Dublin Airport, dating back to 1942, is published by the Irish Meteorological Service (Met Éireann) under a Creative Commons License.

For further global studies, OpenAQ offers a comprehensive repository of air quality data, which is also accessible on the Registry of Open Data on AWS.

Preparing the Data for Analysis and Loading Data from Amazon S3

The data is initially in CSV format. Before uploading it to our Amazon S3 bucket, we performed necessary data wrangling:

Weather Data

The initial dataset contained excessive information. We streamlined the weather data by:

  1. Removing the header, which occupied the first 25 rows.
  2. Converting wind speed measurements from knots to meters per second.
  3. Selecting relevant parameters based on scientific literature.
  4. Renaming parameters for clarity:
    • ‘rain’ to ‘rain_mm’ (precipitation in mm)
    • ‘maxtp’ to ‘maxtemp’ (maximum air temperature in Celsius)
    • ‘mintp’ to ‘mintemp’ (minimum air temperature in Celsius)
    • ‘cbl’ to ‘pressure_hpa’ (mean air pressure in hectopascals)
    • ‘wdsp’ to ‘wd_speed_m_per_s’ (and units converted)
    • ‘ddhm’ to ‘winddirection’
    • ‘sun’ to ‘sun_hours’ (sunshine duration)
    • ‘evap’ to ‘evap_mm’ (evaporation in mm)

Air Quality Data

Each year of air quality data came in separate files, with varying units of measurement. We limited our analysis to 2011-2016, merging yearly files into a single dataset. The weather observations consist of 24-hour daily averages, while air quality data is captured hourly. Therefore, we resampled the air quality data to 24-hour averages and adjusted column names accordingly—for example, NO2 became NO2_avg.

Following these preparations, we uploaded the refined data into our S3 bucket, and we’re ready to explore it using Amazon SageMaker’s notebook-hosting capabilities.

Exploring the Data Using an Amazon SageMaker Notebook

To analyze our data, we utilized Amazon SageMaker’s notebook hosting feature, which provides a Jupyter notebook environment with essential data analysis and machine learning libraries pre-installed, along with access to the Amazon SageMaker Python SDK.

We initiated a notebook instance in Amazon SageMaker by navigating to the console and selecting “Create notebook instance.” After naming the instance and creating a new IAM role granting Amazon SageMaker access to our S3 data, we waited a few minutes for the notebook to become available. Once ready, we opened it to access the Jupyter environment for our analysis.

Next, we downloaded the companion notebook and uploaded it via the “Upload” option in the Jupyter console. Upon opening the notebook, we imported the necessary libraries:

%matplotlib inline
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

These libraries are pivotal for data analysis, with pandas serving as a robust tool for data manipulation, numpy as a standard scientific library in Python, and seaborn and matplotlib for visualizations.

Loading Prepared Data into the Amazon SageMaker Notebook from S3

With our notebook set up and the necessary libraries imported, we can now load the data. We employed the pandas library, which excels at exploring and manipulating tabular data directly in Python. By utilizing the pandas.read_csv command, we specified the S3 locations for both the air pollution and weather data, ensuring that our column names were consistent and clear.

By engaging with this analysis, we can gain valuable insights into how weather impacts urban air quality, which is crucial for crafting effective environmental policies and raising awareness about air pollution’s implications. Remember to check out this excellent resource on navigating your first six months at Amazon for further tips! For more strategies on maintaining boundaries in professional settings, refer to this insightful article.

In the heart of Las Vegas, located at 6401 E HOWDY WELLS AVE, NV 89115, at “Amazon IXD – VGT2,” we delve into the critical intersection of technology and environmental science.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *