Dynamic Video Content Moderation and Policy Evaluation with AWS Generative AI Services | Artificial Intelligence

Organizations in media and entertainment, advertising, social media, education, and various other fields are increasingly seeking efficient methods to extract insights from videos and conduct flexible evaluations based on their specific policies. Generative artificial intelligence (AI) has opened up new avenues for these applications. In this article, we present the Media Analysis and Policy Evaluation solution, which leverages AWS AI and generative AI services to streamline video extraction and evaluation processes.

Key Use Cases

Advertising technology firms manage video content such as advertisements. In video analysis, key priorities include brand safety, regulatory adherence, and engaging content. This solution, powered by AWS AI and generative AI services, addresses these requirements. Advanced content moderation ensures that ads are displayed alongside safe and compliant content, fostering consumer trust. Additionally, this solution allows for the evaluation of videos against compliance policies and aids in generating compelling headlines and summaries that enhance user engagement and ad performance.

Educational technology companies handle extensive libraries of training videos. An effective video analysis tool enables them to assess content according to industry standards, index videos for efficient searching, and perform dynamic tasks such as blurring student faces in Zoom recordings.

The solution is readily available on GitHub and can be deployed into your AWS account via an AWS Cloud Development Kit (AWS CDK) package.

Solution Overview

Media Extraction: Once a video is uploaded, the application begins preprocessing by extracting image frames. Each frame is then analyzed with Amazon Rekognition and Amazon Bedrock for metadata extraction. Simultaneously, audio transcription is performed on the uploaded content using Amazon Transcribe.
Policy Evaluation: Leveraging the extracted metadata, the system conducts evaluations with large language models (LLMs). This allows for the assessment of videos against dynamic policies, taking full advantage of LLM flexibility.

The following diagram illustrates the solution’s workflow and architecture.

The solution employs microservice design principles, with loosely coupled components that can be deployed together for video analysis and policy evaluation or independently for integration into existing workflows. The accompanying diagram depicts the microservice architecture.

Microservice Workflow Steps

Users access the frontend static website through an Amazon CloudFront distribution. The static content is hosted on Amazon Simple Storage Service (Amazon S3).
Users log in to the frontend web application and are authenticated via an Amazon Cognito user pool.
Users upload videos directly to Amazon S3 from their browsers using multi-part pre-signed URLs.
The frontend UI interacts with the extraction microservice through a RESTful interface provided by Amazon API Gateway, enabling CRUD (create, read, update, delete) functionalities for video task extraction management.
An AWS Step Functions state machine manages the analysis process. It transcribes audio using Amazon Transcribe, samples image frames from videos using MoviePy, and analyzes each image with Anthropic Claude Sonnet for summarization. It also generates text and multimodal embeddings at the frame level using Amazon Titan models.
An Amazon OpenSearch Service cluster stores the extracted metadata, facilitating video search and discovery. The UI constructs evaluation prompts and sends them to Amazon Bedrock LLMs, retrieving results synchronously.
Through the UI, users select existing template prompts, customize them, and initiate policy evaluations using Amazon Bedrock. The solution runs the evaluation workflow and displays results back to the user.

In the subsequent sections, we will delve into the key components and microservices of the solution in greater detail.

Website UI

The solution includes a website that allows users to browse videos and manage the uploading process through an intuitive interface. It provides details on extracted video information and features a lightweight analytics UI for dynamic LLM analysis.

Extracting Information from Videos

The solution has a backend extraction service designed to manage asynchronous video metadata extraction. This encompasses gathering information from both the visual and audio components, which includes identifying objects, scenes, text, and human faces. The audio component is particularly crucial for videos that contain active narratives, as it often holds valuable information.

Creating a robust extraction solution presents challenges from both machine learning (ML) and engineering perspectives. From an ML viewpoint, the aim is to achieve generic information extraction that serves as factual data for subsequent analysis. On the engineering side, handling video sampling with concurrency and providing high availability, flexible configurations, and an extendable architecture to accommodate additional ML model plugins requires significant effort.

The extraction service utilizes Amazon Transcribe to convert the video’s audio track into text in subtitle formats. For visual extraction, several techniques are employed:

Frame Sampling: A traditional method for analyzing video visuals involves sampling techniques. This captures screenshots at designated intervals, applying ML models to extract information from each frame. Our solution incorporates sampling with the following considerations:
- The solution supports a configurable interval for fixed sampling rates.
- It also features an advanced smart sampling option, employing the Amazon Titan Multimodal Embeddings model to perform similarity searches against frames sampled from the same video, thus identifying similar images and eliminating redundant ones to optimize performance and cost.
Extracting Information from Image Frames: The solution iterates through sampled images from a video and processes them concurrently. For each image, it applies the following ML features to extract information:
- Recognizes celebrity faces using the Amazon Rekognition celebrity API.
- Detects generic objects and labels via the Amazon Rekognition label detection API.
- Identifies text using the Amazon Rekognition text detection API.
- Flags inappropriate content leveraging the Amazon Rekognition moderation API.
- Uses the Anthropic Claude V3 Haiku model to summarize image frames.

The extraction service is implemented using Amazon Simple Queue Service (Amazon SQS) and Step Functions to manage concurrent video processing, enabling configurable settings. You can define how many videos can be processed in parallel and how many frames for each video can be concurrently handled, based on your account’s service quota limits and performance requirements.

Searching for Videos

Effectively locating videos within your inventory is vital, and efficient search capabilities are essential for optimizing this process. This solution provides a solid foundation for your video management needs, ensuring that users can easily navigate through their content libraries.

For those looking to enhance their interview skills, this resource on how to ace an interview is invaluable. Similarly, if you’re interested in employee benefits, you might find insights from Adidas regarding student loan benefits for full-time employees worthwhile. In addition, an excellent guide for onboarding at Amazon can be found here.

Make sure to visit us at 6401 E Howdy Wells Ave, Las Vegas, NV 89115, within the Amazon IXD – VGT2 location, for more insights.