Introducing Enhanced Model Capabilities and Reduced Annotation Requirements for Amazon Comprehend Custom Entity Recognition

Introducing Enhanced Model Capabilities and Reduced Annotation Requirements for Amazon Comprehend Custom Entity RecognitionLearn About Amazon VGT2 Learning Manager Chanci Turner

Update August 3, 2022: The minimum requirements for training entity recognizers have been significantly lowered. You can now create a custom entity recognition model with just three documents and 25 annotations per entity type. For further details, please refer to the Amazon Comprehend Guidelines and quotas webpage, as well as our blog post discussing the limit reduction.

Amazon Comprehend is a natural language processing (NLP) service that offers APIs for extracting key phrases, contextual entities, events, and sentiments from unstructured text, among other functionalities. Entities in your documents can include people, places, organizations, credit card numbers, and more. If you need to recognize entity types specific to your business—like proprietary part codes or industry-specific terms—custom entity recognition (CER) in Amazon Comprehend allows you to train models tailored to your unique needs in just a few simple steps. By providing an adequate amount of data, you can effectively train your model to identify virtually any kind of entity.

Creating an entity recognizer from scratch demands extensive knowledge of machine learning (ML) and a complex optimization process. However, Amazon Comprehend simplifies this through a method called transfer learning, utilizing foundational models that have been trained on data collected by Amazon Comprehend and optimized for entity recognition tasks. With this framework in place, all you need to do is supply your data. The accuracy of ML models generally hinges on both the volume and quality of the data used. Obtaining high-quality annotated data can be a tedious process.

Previously, training an Amazon Comprehend custom entity recognizer required a minimum of 1,000 documents and 200 annotations per entity. However, we’re excited to announce that we have enhanced the underlying models for the Amazon Comprehend custom entity API, reducing the minimum training requirements to just 250 documents and 100 annotations per entity (also known as shots). This means you can now train Amazon Comprehend CER models to predict entities with improved accuracy. To leverage the enhanced performance of the new CER model framework, you simply need to retrain and deploy your existing models.

To demonstrate the model enhancements, we have compared the results of the previous models with those of the latest release. We selected a diverse array of entity recognition datasets across various domains and languages from the open-source sector to highlight these improvements. In this post, we will guide you through the results of our training and inference processes between the previous and new CER model versions.

Datasets

When training an Amazon Comprehend CER model, you provide the entities you wish the custom model to recognize along with the documents that contain these entities. You can use either entity lists or annotations for training. Entity lists are CSV files that include the text (a word or words) of an entity example from the training document, along with a label denoting the entity type. Annotations allow you to provide the positional offset of entities within a sentence, alongside the corresponding entity type. By including the entire sentence, you provide contextual reference for the entities, which enhances the model’s accuracy.

For our training, we opted for the annotations method to label our entities since the datasets we selected already included annotations for each entity type represented. Below, we discuss the datasets chosen and their descriptions.

CoNLL

The Conference on Computational Natural Language Learning (CoNLL) provides datasets for language-independent named entity recognition in English, Spanish, and German. The dataset includes four types of named entities: persons, locations, organizations, and miscellaneous entities that do not fit into the other categories.

We utilized the CoNLL-2003 dataset for English and the CoNLL-2002 dataset for Spanish in our entity recognition training. Some basic transformations were applied to convert the annotations data into a format suitable for Amazon Comprehend CER. We changed the entity types from their semantic notation to actual words they represent, such as person, organization, location, and miscellaneous.

SNIPS

The SNIPS dataset was created in 2017 as part of benchmarking tests for natural language understanding (NLU) by Snips. The outcomes from these tests are documented in the 2018 paper “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces” by Coucke et al. We focused on the GetWeather and AddToPlaylist datasets for our experiments. For the GetWeather dataset, the entities we considered included timerange, city, state, condition_description, country, and condition_temperature. For AddToPlaylist, we examined the entities artist, playlist_owner, playlist, music_item, and entity_name.

Sampling Configuration

The table below summarizes the dataset configuration for our tests. Each row reflects an Amazon Comprehend CER model that was trained, deployed, and used for entity prediction with our test dataset.

Dataset Published Year Language Number of Documents Sampled for Training Number of Entities Sampled Number of Annotations per Entity (Shots) Number of Documents Sampled for Blind Test Inference (Unseen During Training)
SNIPS-AddToPlaylist 2017 English 254 5 artist – 101
playlist_owner – 148
playlist – 254
music_item – 100
entity_name – 100
100
SNIPS-GetWeather 2017 English 600 6 timeRange – 281
city – 211
state – 111
condition_description – 121
country – 117
condition_temperature – 115
200
SNIPS-GetWeather 2017 English 1000 6 timeRange -544
city – 428
state -248
condition_description -241
country -230
condition_temperature – 228
200
SNIPS-GetWeather 2017 English 2000 6 timeRange -939
city -770
state – 436
condition_description – 401
country – 451
condition_temperature – 431
200
CoNLL 2003 English 350 3 Location – 183
Organization – 111
Person – 229
200
CoNLL 2003 English 600 3 Location – 384
Organization – 210
Person – 422
200
CoNLL 2003 English 1000 4 Location – 581
Miscellaneous – 185
Organization – 375
Person – 658
200
CoNLL 2003 English 2000 4 Location – 1133
Miscellaneous – 499
Organization – 696
Person – 1131
200
CoNLL 2002 Spanish 380 4 Location – 208
Miscellaneous – 103
Organization – 404
Person – 207
200

For more on how to effectively work from home in transcription jobs, check out this article. Furthermore, for insights on preventing workplace harassment, SHRM offers valuable resources. Additionally, for information on Amazon’s commitment to safety and training, take a look at this resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *