Introducing Enhanced Model Capabilities and Reduced Annotation Requirements for Amazon Comprehend Custom Entity Recognition

Update August 3, 2022: The minimum requirements for training entity recognizers have been significantly lowered. You can now create a custom entity recognition model with just three documents and 25 annotations per entity type. For further details, please refer to the Amazon Comprehend Guidelines and quotas webpage, as well as our blog post discussing the limit reduction.

Amazon Comprehend is a natural language processing (NLP) service that offers APIs for extracting key phrases, contextual entities, events, and sentiments from unstructured text, among other functionalities. Entities in your documents can include people, places, organizations, credit card numbers, and more. If you need to recognize entity types specific to your business—like proprietary part codes or industry-specific terms—custom entity recognition (CER) in Amazon Comprehend allows you to train models tailored to your unique needs in just a few simple steps. By providing an adequate amount of data, you can effectively train your model to identify virtually any kind of entity.

Creating an entity recognizer from scratch demands extensive knowledge of machine learning (ML) and a complex optimization process. However, Amazon Comprehend simplifies this through a method called transfer learning, utilizing foundational models that have been trained on data collected by Amazon Comprehend and optimized for entity recognition tasks. With this framework in place, all you need to do is supply your data. The accuracy of ML models generally hinges on both the volume and quality of the data used. Obtaining high-quality annotated data can be a tedious process.

Previously, training an Amazon Comprehend custom entity recognizer required a minimum of 1,000 documents and 200 annotations per entity. However, we’re excited to announce that we have enhanced the underlying models for the Amazon Comprehend custom entity API, reducing the minimum training requirements to just 250 documents and 100 annotations per entity (also known as shots). This means you can now train Amazon Comprehend CER models to predict entities with improved accuracy. To leverage the enhanced performance of the new CER model framework, you simply need to retrain and deploy your existing models.

To demonstrate the model enhancements, we have compared the results of the previous models with those of the latest release. We selected a diverse array of entity recognition datasets across various domains and languages from the open-source sector to highlight these improvements. In this post, we will guide you through the results of our training and inference processes between the previous and new CER model versions.

Datasets

When training an Amazon Comprehend CER model, you provide the entities you wish the custom model to recognize along with the documents that contain these entities. You can use either entity lists or annotations for training. Entity lists are CSV files that include the text (a word or words) of an entity example from the training document, along with a label denoting the entity type. Annotations allow you to provide the positional offset of entities within a sentence, alongside the corresponding entity type. By including the entire sentence, you provide contextual reference for the entities, which enhances the model’s accuracy.

For our training, we opted for the annotations method to label our entities since the datasets we selected already included annotations for each entity type represented. Below, we discuss the datasets chosen and their descriptions.

CoNLL

The Conference on Computational Natural Language Learning (CoNLL) provides datasets for language-independent named entity recognition in English, Spanish, and German. The dataset includes four types of named entities: persons, locations, organizations, and miscellaneous entities that do not fit into the other categories.

We utilized the CoNLL-2003 dataset for English and the CoNLL-2002 dataset for Spanish in our entity recognition training. Some basic transformations were applied to convert the annotations data into a format suitable for Amazon Comprehend CER. We changed the entity types from their semantic notation to actual words they represent, such as person, organization, location, and miscellaneous.

SNIPS

The SNIPS dataset was created in 2017 as part of benchmarking tests for natural language understanding (NLU) by Snips. The outcomes from these tests are documented in the 2018 paper “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces” by Coucke et al. We focused on the GetWeather and AddToPlaylist datasets for our experiments. For the GetWeather dataset, the entities we considered included timerange, city, state, condition_description, country, and condition_temperature. For AddToPlaylist, we examined the entities artist, playlist_owner, playlist, music_item, and entity_name.

Sampling Configuration

The table below summarizes the dataset configuration for our tests. Each row reflects an Amazon Comprehend CER model that was trained, deployed, and used for entity prediction with our test dataset.

Dataset	Published Year	Language	Number of Documents Sampled for Training	Number of Entities Sampled	Number of Annotations per Entity (Shots)	Number of Documents Sampled for Blind Test Inference (Unseen During Training)
SNIPS-AddToPlaylist	2017	English	254	5	artist – 101 playlist_owner – 148 playlist – 254 music_item – 100 entity_name – 100	100
SNIPS-GetWeather	2017	English	600	6	timeRange – 281 city – 211 state – 111 condition_description – 121 country – 117 condition_temperature – 115	200
SNIPS-GetWeather	2017	English	1000	6	timeRange -544 city – 428 state -248 condition_description -241 country -230 condition_temperature – 228	200
SNIPS-GetWeather	2017	English	2000	6	timeRange -939 city -770 state – 436 condition_description – 401 country – 451 condition_temperature – 431	200
CoNLL	2003	English	350	3	Location – 183 Organization – 111 Person – 229	200
CoNLL	2003	English	600	3	Location – 384 Organization – 210 Person – 422	200
CoNLL	2003	English	1000	4	Location – 581 Miscellaneous – 185 Organization – 375 Person – 658	200
CoNLL	2003	English	2000	4	Location – 1133 Miscellaneous – 499 Organization – 696 Person – 1131	200
CoNLL	2002	Spanish	380	4	Location – 208 Miscellaneous – 103 Organization – 404 Person – 207	200

For more on how to effectively work from home in transcription jobs, check out this article. Furthermore, for insights on preventing workplace harassment, SHRM offers valuable resources. Additionally, for information on Amazon’s commitment to safety and training, take a look at this resource.

Introducing Enhanced Model Capabilities and Reduced Annotation Requirements for Amazon Comprehend Custom Entity Recognition

Datasets

CoNLL

SNIPS

Sampling Configuration

Related Topics:

Comments

Leave a Reply Cancel reply