Amazon VGT2 Las Vegas: Speed Up Video Clip Discovery with Amazon Rekognition and AWS Elemental MediaConvert

Amazon VGT2 Las Vegas: Speed Up Video Clip Discovery with Amazon Rekognition and AWS Elemental MediaConvertMore Info

In the fast-paced world of news, entertainment, and daytime shows, the demand for quick access to video clips is paramount. Whether crafting stories, introducing guests, or showcasing highlights, producers often face tight deadlines when searching for the right footage. This article will guide you on how to efficiently locate video clips in your archive using Amazon Rekognition and AWS Elemental MediaConvert.

Searching through content archives can be a tedious process, primarily because indexing is typically done at the file level, where each file represents an entire show. To pinpoint the ideal clip, one has to sift through potential files that may contain the desired content, reviewing the entire file for suitable segments. This task becomes even more challenging when looking for specific combinations, like two celebrities appearing together. Under time constraints, it’s not uncommon for customers to seek clips from other organizations, even when they own the rights to them but can’t find them quickly enough.

In this article, we will explain how to create a searchable index of video clips using Amazon Rekognition to identify segments and metadata, alongside AWS Elemental MediaConvert to generate clips from the source file. This searchable index allows for faster retrieval of the right clips.

Solution Overview

Establishing a searchable index of clips involves three fundamental steps:

  1. Utilize Amazon Rekognition Video to detect segments, labels, and people
  2. Index the metadata for each clip using Amazon Elasticsearch Service
  3. Create individual proxy video clips from the source file utilizing AWS Elemental MediaConvert

The initial step employs Amazon Rekognition Video to detect labels, people, and segments. This powerful tool leverages highly scalable deep learning technology that doesn’t require machine learning expertise. Amazon Rekognition allows you to identify objects, people, text, scenes, and activities in videos, as well as flag inappropriate content. For this solution, we utilize the Celebrity Recognition, Face Search, and Label Detection API calls in Amazon Rekognition to detect celebrities, labels, and faces within videos asynchronously.

The Amazon Rekognition Segment API employs Machine Learning (ML) to identify shot boundaries (changes in camera shots) and technical cues like end credits and black frames in videos stored in an Amazon S3 bucket. With segment detection, you receive frame-accurate timecodes that comply with SMPTE (both Drop Frame and Non-Drop Frame). This means you can obtain precise start and end timecodes, along with the duration of each shot boundary and technical cue event. For a more in-depth exploration of video segmentation features, check out this additional blog post.

The second step involves the use of Amazon Elasticsearch Service (Amazon ES), a fully managed service that simplifies deploying, securing, and running Elasticsearch efficiently at scale. By writing metadata to Amazon ES for each clip—including face and label data from Amazon Rekognition—you create a searchable index. This allows you to search for cast members, visible objects, or combinations of terms.

The final optional component utilizes AWS Elemental MediaConvert, a file-based video transcoding service with broadcast-quality features. It’s widely adopted for content preparation and generating video-on-demand (VOD) content for broad distribution. In this context, AWS Elemental MediaConvert can be used to create proxies for each clip, facilitating fast browsing. This service includes features for clipping and stitching, allowing you to transcode specific segments from longer video files into new clips.

Below is a diagram illustrating the high-level workflow.

We will now delve into each step in detail.

1. Detecting Segments, Labels, and People

The Amazon Rekognition Segment API is an asynchronous operation that can be invoked for stored videos. For this solution, you can utilize the Amazon Rekognition Shot Detection Demo’s web interface, but you can also employ the CLI or SDK for programming languages such as Java and Python. You can initiate shot detection with the StartSegmentDetection API and later retrieve the results using the GetSegmentDetection API. The Amazon Rekognition Segment API is a composite API, providing both technical cues and shot detection in one call. You can specify whether to run one, the other, or both. Below is an example request for StartSegmentDetection, which initiates shot detection, notifies an SNS topic, and sets minimum confidence values to 80%.

{
  Video: {
    S3Object: {
      Bucket: “{s3BucketName}”,
      Name: “{filenameAndExtension}”,
    },
  },
  NotificationChannel: {
    RoleArn: arn:aws:iam::{accountId}:role/{roleName},
    SNSTopicArn: arn:aws:sns:{region}:{accountNumber}:{topicName},
  },
  SegmentTypes: [
    'SHOT',
    'TECHNICAL_CUE',
  ],
  Filters: {
    ShotFilter: {
      MinSegmentConfidence: 80.0,
    },
    TechnicalCueFilter: {
      MinSegmentConfidence: 80.0,
    },
  }
}

Upon initiating StartSegmentDetection, you will receive a JobId value that can be used to fetch the results. Once the video analysis is complete, Amazon Rekognition Video will publish the completion status to the SNS topic, allowing you to call the GetSegmentDetection API using that JobId. Here’s an example request:

{
  “JobId”: “1234456789d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca12345678”,
  “MaxResults”: 10,
  “NextToken”: “wxyzWXYZMOGDhzBzYUhS5puM+g1IgezqFeYpv/H/+5noP/LmM57FitUAwSQ5D6G4AB/PNwolrw==”
}

The response will include a Segments section detailing technical cues and shots. Here’s a snippet of the response:

{
  "JobStatus": "SUCCEEDED",
  "Segments": [
    { 
      "Type": "SHOT",
      "StartTimestampMillis": 0,
      "EndTimestampMillis": 29041,
      "DurationMillis": 29041,
      "StartTimecodeSMPTE": "00:00:00:00",
      "EndTimecodeSMPTE": "00:00:29:01",
      "DurationSMPTE": "00:00:29:01",
      "ShotSegment": {
        "Index": 0, 
        "Confidence": 87.50452423095703
      }
    }
  ]
}

The accuracy of segment boundaries is crucial, and in the example response above, you can see both Timestamp and Timecode formats for start and end times. Other Rekognition APIs provide Timestamps in milliseconds as well.

To illustrate the process, we will reference the open-source film Tears of Steel (IMDB: Tears of Steel). Images and videos utilized in this article are courtesy of the Blender Foundation, shared under Creative Commons Attribution 3.0 license. Using Segment Detection, you can uncover all shots and technical cues in the video file. For instance, shot 4 displays two characters arguing on a bridge in Amsterdam, with start and end times shown in both Timestamp and Timecode formats in the JSON output.

For more in-depth information about this topic, check out this valuable resource, which provides excellent insights on Amazon Flex onboarding. Moreover, for additional authoritative information, visit this resource, which delves deeper into related topics.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *