DEV Community

João Moura for AWS

Posted on • Edited on

AWS - NLP newsletter September 2021

Alt Text

Hello world. This is the second monthly Natural Language Processing(NLP) newsletter, covering everything related to NLP at AWS, and more. Feel free to leave comments, or share on your social network. Let's dive in!


AWS NLP Services

Feature Releases

Amazon Textract announcements price reductions, reduction in processing time for asynchronous operations up to 50% worldwide, US FedRAMP authorization
The usage of the AnalyzeDocument and DetectDocumentText API’s in eight AWS regions will now be billed at the same rates as prices in the US East (N.Virginia) region (not inclusive of the recently launched AnalyzeExpense API), posing a price reduction of up to 32%. Based on costumer feedback, enhancements made to Textract’s asynchronous operations reduced latency by as much as 50 percent worldwide. Finally, Textract achieved US FedRAMP authorization and added IRAP compliance support. What’s New, AWS News Blog, Documentation.

Amazon Transcribe adds support for 6 new languages, Amazon Lex adds support for Korean
Amazon Transcribe now supports batch transcription in six new languages - Afrikaans, Danish, Mandarin Chinese (Taiwan), Thai, New Zealand English, and South African English. Additionally, Amazon Lex it has just added support for Korean. What’s New (Transcribe), What’s New (Lex), Transcribe Documentation, Lex Documentation.

Amazon Transcribe can now generate subtitles for your video files
Amazon Transcribe now supports the generation of WebVTT (*.vtt) and SubRip (.srt) output for use as video subtitles during a batch transcription job. You can select one or both options when you submit the job, and the resultant subtitle files are generated in the same destination as the underlying transcription output file. Find more details in the title link above.

Amazon Transcribe now supports redaction of personal identifiable information (PII) for streaming transcriptions
You can now use Amazon Transcribe to automatically identify and redact PII - such as Social Security numbers, credit card/bank account information, and contact information (i.e. name, email address, phone number and mailing address) - from your streaming transcription results. In addition, granular PII categories are now provided, instead of the unique [PII] tag available when redacting PII in a batch transcription job. With this new feature, companies can provide their contact center agents with valuable transcripts for on-going conversation while maintaining privacy standards. What’s New, AWS ML Blog.

Extract custom entities from documents in their native format with Amazon Comprehend
Amazon Comprehend now allows you to extract custom entities from documents in a variety of formats (PDF, Word, plain text) and layouts (e.g., bullets ,lists). Prior to this announcement, you could only use Comprehend on plain text documents, which required you to flatten documents into machine-readable text; this feature combines the power of NLP and Optical Character Recognition (OCR) to extract custom entitities from your documents using the same API and with no preprocessing required. What’s New, Getting Started (blog), Document Annotation for new feature (blog).

Blog posts/demos

Boost transcription accuracy of class lectures with custom language models for Amazon Transcribe
Practical example of how training a custom language model in Amazon Transcribe can help improve transcription accuracy on difficult specialized topics, such as biology lectures.

Alt Text

Read more about how to leverage custom language models in the Transcribe documentation.


NLP on Amazon SageMaker

Feature Releases

Amazon SageMaker now supports inference endpoint testing from SageMaker Studio
Once a model is deployed to Amazon SageMaker, customers can get predictions from their models deployed on SageMaker real-time endpoints. Previously, customers used third-party tooling such as curl or wrote code in Jupyter Notebooks to invoke the endpoints for inference. Now, customers can provide a JSON payload, send the inference request to the endpoint, and receive results directly from SageMaker Studio. The results are displayed directly in SageMaker Studio and can be downloaded for further analysis.

Amazon S3 plugin for PyTorch
This is an open-source library, built to be used with the deep learning framework PyTorch for streaming data with Amazon S3. This feature is also available in PyTorch Deep Learning Containers, and with it you can take advantage of using data from S3 buckets directly with PyTorch dataset and dataloader API’s without needing to download it first on local storage. AWS ML Blog, Plugin Github.

Blog posts/demos

Detecting Data Drift in NLP using SageMaker Custom Model Monitor
Detecting data drift in NLP is a challenging task. Model monitoring becomes an important aspect in MLOPS, because the change in data distribution from the training corpus to real-world data at inference time can cause model performance decay. This distribution shift is called data drift. This demo focuses on detecting that drift, making use of the custom monitoring capabilities of SageMaker Model Monitor.


Upcoming events

NLP Summit 2021
Oct 05-07, 2021
Join the NLP Summit: two weeks of immersive, industry-focused content. Week one will include over 30 unique sessions, with a special track on NLP in Healthcare. Week two will feature beginner to advanced training workshops with certifications. Attendees can also participate in coffee chats with speakers, committers, and industry experts. Registration is free.

AWS Startup Accelerate: Start your NLP journey on AWS
Oct 11, 2021
AWS will be running a Technical talk on "Starting your NLP journey with AWS". Based on feedback from lead NLP ML Core startups, we see that developing NLP models is a complex and costly process, which is why we’d like to engage with Data Scientists and ML engineers to help them in their adoption journey. We would love to have you there! Register here.


Miscellaneous

🤗 HuggingFace: Hardware Partner Program, Optimum, and Infinity
A trio of announcements for HuggingFace this month:

  • Hugging Face has launched a Hardware Partner Program, partnering with AI Hardware accelerators to make state of the art production performance accessible with Transformers.
  • In this context, HuggingFace has released Optimum, an ML optimization toolkit, which enables maximum efficiency to train and run models on specific hardware. As of today, you can use it to easily prune and/or quantize Transformer models for Intel Xeon CPU’s using Intel Low Precision Optimization Tool (LPOT), and later this year the first models optimized for GraphCore’s Intelligence Processing Unit (IPU) will be added.
  • Finally, Infinity - HugginFace’s enterprise-scale inference solution - was officially announced on September 28th, comprised of a containerized solution which promises Transformers’ accuracy at 1ms latency.

Top comments (0)