Microsoft SharePoint serves as a robust document management system, facilitating the storage, organization, and collaborative editing of files. Organizations often seek to ingest SharePoint data into their data lakes, integrating it with other available data for comprehensive reporting and analytics. AWS Glue, a serverless data integration service, simplifies the process of discovering, preparing, and combining data for analytics, machine learning, and application development. With AWS Glue, you can quickly analyze your data and leverage it effectively, transforming what used to take months into mere minutes.
Data on SharePoint is frequently managed as files and lists, making it ideal for enhanced discovery, auditing, and compliance. However, since SharePoint is not a traditional relational database and typically stores semi-structured data, joining SharePoint data with other relational sources can be challenging. This article outlines how to ingest and process SharePoint data using AWS Glue and Amazon EventBridge, enabling you to combine this data with what exists in your data lake. We utilize SharePoint REST APIs with OData syntax, which provides a standardized method for implementing REST APIs and allows for SQL-like querying capabilities. OData streamlines the development of RESTful APIs by abstracting complexities regarding request and response headers and query options.
AWS Glue Event-Driven Workflows
SharePoint data may change unpredictably, complicating efforts to schedule data processing pipelines effectively. Running these pipelines too often can incur significant costs, while processing them less frequently may result in outdated data. Triggering pipelines from external processes adds unnecessary complexity and increases job startup times.
AWS Glue supports event-driven workflows, enabling developers to initiate Glue workflows based on events from EventBridge. EventBridge is ideal in this context as it allows real-time processing of events, updating target tables and making information readily available. Given the unpredictable nature of data changes in SharePoint, using EventBridge to capture events as they occur permits the execution of data processing pipelines solely when new data is present.
To start, create a new AWS Glue trigger of type EVENT, which serves as the initial trigger in your workflow. You can also set a batching condition, allowing you to control the number of events to buffer or the maximum time elapsed before triggering the workflow. For instance, you might configure the workflow to activate when 100 files are uploaded to Amazon S3 or five minutes after the first upload. It is advisable to configure event batching to prevent excessive concurrent workflows, thereby optimizing resource use and costs.
To exemplify this solution, consider a wine manufacturing and distribution company that operates across multiple regions. They maintain their transactional data in an Amazon S3 data lake and utilize SharePoint lists to gather feedback on wine quality from suppliers and stakeholders. The supply chain team aims to merge their transactional data with the wine quality comments captured in SharePoint to enhance product quality and address production issues. They require that the comments from SharePoint be collected within an hour and published to a wine quality dashboard in Amazon QuickSight. By adopting an event-driven approach, the supply chain team can access the data in less than an hour.
Overview of the Solution
This article details a solution for setting up an AWS Glue job to ingest SharePoint lists and files into an S3 bucket, along with a Glue workflow that listens for S3 PutObject data events recorded by AWS CloudTrail. This workflow is configured with an event-based trigger that activates when an AWS Glue ingest job uploads new files to the S3 bucket. The accompanying diagram illustrates the architecture.
To simplify deployment, we have encapsulated the entire solution within an AWS CloudFormation template, which automates the ingestion of SharePoint data into Amazon S3. SharePoint employs ClientID and TenantID for authentication and utilizes Oauth2 for authorization.
The template facilitates the following steps:
- Create an AWS Glue Python shell job to make REST API calls to the SharePoint server and ingest files or lists into Amazon S3.
- Establish an AWS Glue workflow with an initial EVENT type trigger.
- Configure CloudTrail to log data events, including PutObject API calls.
- Develop an EventBridge rule to forward the PutObject API events from CloudTrail to AWS Glue.
- Add an AWS Glue event-driven workflow as the target for the EventBridge rule, which will be triggered whenever the SharePoint ingest job adds new files to the S3 bucket.
Prerequisites
Before proceeding, ensure you meet the following prerequisites:
- An AWS account
- SharePoint server 2013 or later
Configuring SharePoint Server Authentication Details
Prior to launching the CloudFormation stack, set up your SharePoint server authentication details, including TenantID, ClientID, ClientSecret, and SharePoint URL in the AWS Systems Manager Parameter Store. This approach ensures authentication details are not hardcoded and are retrieved dynamically during execution.
To create the necessary AWS Systems Manager parameters, follow these steps:
- Navigate to the Systems Manager console and select Parameter Store.
- Choose Create Parameter.
- Enter the parameter name
/DATALAKE/GlueIngest/SharePoint/tenant. - Keep the type as string and input your SharePoint tenant detail in the value field.
- Click Create parameter.
Repeat these steps to create the following parameters:
/DataLake/GlueIngest/SharePoint/tenant/DataLake/GlueIngest/SharePoint/tenant_id/DataLake/GlueIngest/SharePoint/client_id/list/DataLake/GlueIngest/SharePoint/client_secret/list/DataLake/GlueIngest/SharePoint/client_id/file/DataLake/GlueIngest/SharePoint/client_secret/file/DataLake/GlueIngest/SharePoint/url/list/DataLake/GlueIngest/SharePoint/url/file
For further insights on this topic, check out this another blog post on Chanci Turner VGT2, or explore the authoritative resources provided by Chanci Turner. Additionally, this video resource offers excellent guidance.

Leave a Reply