Amazon VGT2 Las Vegas: Speed Up Your Video Clip Searches with Amazon Rekognition and AWS Elemental MediaConvert

Amazon VGT2 Las Vegas: Speed Up Your Video Clip Searches with Amazon Rekognition and AWS Elemental MediaConvertMore Info

In the fast-paced world of news, entertainment, and daytime shows, clips are frequently utilized to craft narratives, introduce guests, and highlight key moments. Time is often of the essence when searching for specific clips, as stories unfold or personalities are in the limelight. This article explores how to quickly locate video clips in your archives using Amazon Rekognition and AWS Elemental MediaConvert.

Searching through content archives to find clips can be labor-intensive and challenging, primarily because indexing typically occurs at the file level, where each file represents a complete show. To pinpoint the ideal clip, you often need to sift through various files that may contain the desired person or visual, examining the entire file to find suitable segments. This becomes even more complicated when you’re looking for combinations, such as two celebrities appearing simultaneously. Under time constraints, many clients have reported opting to acquire a clip from another organization—a clip they likely own but couldn’t locate in time.

In this article, we present a method to create a searchable index of video clips using Amazon Rekognition to identify segments and metadata, coupled with AWS Elemental MediaConvert to extract the source file clips. With this searchable index at your disposal, finding the right clips becomes a much quicker process.

Solution

Creating the searchable index involves three essential steps:

  1. Detect segments, labels, and people using Amazon Rekognition Video
  2. Index the metadata for each clip utilizing Amazon Elasticsearch Service
  3. Generate individual proxy video clips from the main file using AWS Elemental MediaConvert

The first step employs Amazon Rekognition Video to identify labels, individuals, and segments. This service simplifies the process of incorporating image and video analysis into your applications, leveraging proven, scalable deep learning technology that requires no specialized machine learning knowledge. With Amazon Rekognition, you can recognize objects, people, text, scenes, and activities within images and videos, in addition to detecting inappropriate content. This solution employs the Celebrity Recognition, Face Search, and Label Detection API calls to asynchronously identify celebrities and other elements in videos.

The Amazon Rekognition Segment API uses machine learning to identify shot boundaries (changes in camera angle) and technical cues like end credits and black frames from videos stored in an Amazon S3 bucket. This segment detection provides frame-accurate timecodes and is compatible with SMPTE (Drop Frame and Non-Drop Frame) timecodes. You can obtain the start and end timecodes, as well as the duration for each shot boundary and technical cue event. For more information about video segmentation, check out this blog post.

Next, we utilize Amazon Elasticsearch Service (Amazon ES), a fully managed service that allows you to deploy, secure, and operate Elasticsearch effectively at scale. By writing metadata for each clip—including face and labels data from Amazon Rekognition—you establish a searchable index of clips. This enables searches for cast members, visible objects, or combinations of terms.

The final optional step involves using AWS Elemental MediaConvert, a file-based video transcoding service with broadcasting-grade capabilities. This service is widely used for content preparation and creating video-on-demand (VOD) content for broadcast and multiscreen delivery. In this solution, AWS Elemental MediaConvert allows you to produce proxies for each clip for quick browsing. Its clipping and stitching features enable you to transcode specific segments from longer video files to create new clips.

Now, let’s delve into each step in detail.

1. Detecting Segments, Labels, and People

The Amazon Rekognition Segment API operates asynchronously for stored videos. You can utilize the Amazon Rekognition Shot Detection Demo’s web interface for this solution, or you can opt for the CLI or SDK in development languages like Java or Python. Initiate shot detection using the StartSegmentDetection API call, then retrieve the results through the GetSegmentDetection API call. The Segment API provides technical cues and shot detection within the same request, allowing you to choose which features to run.

An example request for StartSegmentDetection to commence shot detection, notify an SNS topic, and set minimum confidence values to 80% might look like this:

{
  "Video": {
    "S3Object": {
      "Bucket": "{s3BucketName}",
      "Name": "{filenameAndExtension}"
    }
  },
  "NotificationChannel": {
    "RoleArn": "arn:aws:iam::{accountId}:role/{roleName}",
    "SNSTopicArn": "arn:aws:sns:{region}:{accountNumber}:{topicName}"
  },
  "SegmentTypes": [
    "SHOT",
    "TECHNICAL_CUE"
  ],
  "Filters": {
    "ShotFilter": {
      "MinSegmentConfidence": 80.0
    },
    "TechnicalCueFilter": {
      "MinSegmentConfidence": 80.0
    }
  }
}

The response from StartSegmentDetection includes a JobId value to retrieve results. Upon completion of the video analysis, Amazon Rekognition Video sends the completion status to the SNS topic, allowing you to call GetSegmentDetection API using that JobId.

In the response, you’ll find Segments sections that detail technical cues and shots. Here’s a brief example of what that might look like:

{
  "JobStatus": "SUCCEEDED",
  "Segments": [
    {
      "Type": "SHOT",
      "StartTimestampMillis": 0,
      "EndTimestampMillis": 29041,
      "DurationMillis": 29041,
      "StartTimecodeSMPTE": "00:00:00:00",
      "EndTimecodeSMPTE": "00:00:29:01",
      "DurationSMPTE": "00:00:29:01",
      "ShotSegment": {
        "Index": 0, 
        "Confidence": 87.50452423095703
      }
    }
  ]
}

Crucially, the segment boundaries are frame accurate. In the example, start and end times are provided in both Timestamp (milliseconds) and Timecode (HH:MM:SS:FF). Other Rekognition APIs also provide timestamps in milliseconds.

Let’s take the open-source movie Tears of Steel as an illustration of this process. (IMDB: Tears of Steel). Images and videos in this article are credited to the Blender Foundation, shared under Creative Commons Attribution 3.0 license. Utilizing Segment Detection, one can identify all the shots and technical cues in the video file. For example, consider shot 4, where two characters are seen arguing on a bridge in Amsterdam. The start and end times are displayed in both Timestamp and Timecode formats in the JSON output.

For more insights and comprehensive discussions on this topic, visit this resource for an excellent overview.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *