Andre Lopes for AWS Community Builders

Posted on Mar 18 • Originally published at blog.andrevitorlopes.com

AWS AI - Transcribe

#aws #ai #serverless #programming

Using AWS Transcribe serverless service to create text from speech

Hey people!

I want to start a new series covering the most used AWS AI services today. For the first service, we’ll discuss AWS Transcribe.

Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. It is an excellent service for getting text from audio/videos or adding speech-to-text capabilities to an application. It has multiple language support, and you can:

Use the general model provided by AWS
Train your model with custom data
Create vocabularies to enrich the selected model
Create vocabulary filters to remove/redact selected data

Demo Architecture

For our demo architecture, we will build a simple app with an S3 bucket and two folders: media and transcription. We’ll then configure S3 notifications so that when a file is added to the media folder, it triggers a Lambda function to start a new job in Transcribe. The job will then generate and add a transcription to the transcription folder.

We'll be creating our infrastructure using Terraform as our Infrastructure as Code.

General Model

The quickest way to start with Transcribe is to use its General Model. It is suited for a broad audience but has no specialized capabilities.

Let’s start by creating our infrastructure. Create an iac folder and then add a terraform.tf file to initialize our Terraform configuration:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.84"
    }
  }

  backend "s3" {
    bucket = "YOUR_BUCKET"
    key    = "state.tfstate"
  }
}

provider "aws" {}

Don’t forget to change YOUR_BUCKET for the bucket you want to use for your backend configuration.

Let’s now create the S3 buckets and folders. Create a s3.tf file:

resource "aws_s3_bucket" "bucket" {
  bucket = "aws-ai-transcribe"
}

resource "aws_s3_object" "media" {
  bucket = aws_s3_bucket.bucket.id
  key    = "media/"
}

resource "aws_s3_object" "transcription" {
  bucket = aws_s3_bucket.bucket.id
  key    = "transcription/"
}

resource "aws_lambda_permission" "allow_bucket" {
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.transcribe.arn
  source_arn    = aws_s3_bucket.bucket.arn
  principal     = "s3.amazonaws.com"
}

resource "aws_s3_bucket_notification" "bucket" {
  bucket = aws_s3_bucket.bucket.id
  lambda_function {
    filter_prefix       = aws_s3_object.media.key
    events              = ["s3:ObjectCreated:*"]
    lambda_function_arn = aws_lambda_function.transcribe.arn
  }

  depends_on = [aws_lambda_permission.allow_bucket]
}

Here, we are creating our aws-ai-transcribe bucket, adding the two folders, media and transcription, and then creating an S3 notification to the lambda function for every file added to the media folder.

Note that we added an explicit depends_on = [aws_lambda_permission.allow_bucket]. Because we want this notification to be created after we have created the permission for our lambda function to be invoked from S3 notifications.

Now, let’s create our lambda function in a lambdas.tf file:

resource "aws_lambda_function" "transcribe" {
  function_name    = "transcribe"
  runtime          = "nodejs22.x"
  handler          = "index.handler"
  filename         = data.archive_file.file.output_path
  source_code_hash = data.archive_file.file.output_base64sha256
  role             = aws_iam_role.role.arn

  environment {
    variables = {
      JOB_ROLE_ARN = "${aws_iam_role.job_role.arn}"
      OUTPUT_KEY   = "${aws_s3_object.transcription.key}"
    }
  }
}

resource "aws_iam_role" "role" {
  name               = "transcribe-lambda-role"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

resource "aws_iam_role_policy" "policies" {
  role   = aws_iam_role.role.name
  policy = data.aws_iam_policy_document.policies.json
}

resource "aws_iam_role" "job_role" {
  name               = "transcribe-job-role"
  assume_role_policy = data.aws_iam_policy_document.assume_job_role.json
}

resource "aws_iam_role_policy" "job_policies" {
  role   = aws_iam_role.job_role.name
  policy = data.aws_iam_policy_document.job_policies.json
}

data "archive_file" "file" {
  source_dir  = "${path.root}/init_code"
  output_path = "lambda_payload.zip"
  type        = "zip"
}

Here, we are creating our lambda function and the role that our transcribe jobs will assume to be able to download the media file and upload the transcription to the S3 bucket.

This lambda requires an initialization code from the init_code folder. So, under iac, create a folder init_code with an index.js:

// Default handler generated in AWS
export const handler = async (event) => {
  const response = {
    statusCode: 200,
    body: JSON.stringify({ message: 'Hello from Lambda!' }),
  };

  return response;
};

And a package.json file:

{
  "name": "second-lambda",
  "version": "1.0.0",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

Last, let’s create the policies for our lambda and transcribe job IAM role in a policies.tf file:

data "aws_iam_policy_document" "assume_role" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

data "aws_iam_policy_document" "assume_job_role" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["transcribe.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

data "aws_iam_policy_document" "policies" {
  statement {
    effect = "Allow"

    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]

    resources = ["arn:aws:logs:*:*:*"]
  }

  statement {
    effect = "Allow"

    actions = ["iam:PassRole"]

    resources = [aws_iam_role.job_role.arn]
  }

  statement {
    effect = "Allow"

    actions = ["transcribe:StartTranscriptionJob"]

    resources = ["arn:aws:transcribe:*:*:transcription-job/*"]
  }
}

data "aws_iam_policy_document" "job_policies" {
  statement {
    effect = "Allow"

    actions = ["s3:GetObject"]

    resources = ["${aws_s3_object.media.arn}*"]
  }
  statement {
    effect = "Allow"

    actions = ["s3:PutObject"]

    resources = ["${aws_s3_object.transcription.arn}*"]
  }
}

Our lambda role must have the transcribe:StartTranscriptionJob permission, or else it cannot start a job in Transcribe. The job role needs permission to get the media file from the S3 bucket folder and put the transcription into the S3 bucket folder.

Now that we have our infrastructure files let’s set up our GitHub workflow to create them. Create a .github/workflows folder and add a deploy-infrastructure.yml file:

name: Deploy Transcribe Infra
on:
  workflow_dispatch:
  push:
    branches:
      - main
    paths:
      - transcribe/iac/**/*

defaults:
  run:
    working-directory: transcribe/iac

jobs:
  deploy:
    name: 'Deploy'
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Configure AWS Credentials Action For GitHub Actions
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: YOUR_REGION

      # Install the latest version of Terraform CLI and configure the Terraform CLI configuration file with a Terraform Cloud user API token
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      # Initialize a new or existing Terraform working directory by creating initial files, loading any remote state, downloading modules, etc.
      - name: Terraform Init
        run: terraform init

      # Checks that all Terraform configuration files adhere to a canonical format
      - name: Terraform Format
        run: terraform fmt -check

      # Generates an execution plan for Terraform
      - name: Terraform Plan
        run: |
          terraform plan -out=plan -input=false

        # On push to "main", build or change infrastructure according to Terraform configuration files
        # Note: It is recommended to set up a required "strict" status check in your repository for "Terraform Cloud". See the documentation on "strict" required status checks for more information: https://help.github.com/en/github/administering-a-repository/types-of-required-status-checks
      - name: Terraform Apply
        run: terraform apply -auto-approve -input=false  plan

Make sure you have AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY variables added to GitHub actions secrets and that these credentials have permission to create resources in your account. Also, don’t forget to replace YOUR_REGION with the region where you want these resources created.

Now, push the code to GitHub, and you should have the basic infrastructure created after it finishes.

Now, let’s create the lambda function that will start a transcription job whenever a file is added to S3 media bucket folder.

Create an app folder and run the following code to start our TypeScript project:

npm init -y
npm i -s @aws-sdk/client-transcribe
npm i -D typescript copyfiles @types/aws-sdk @types/aws-lambda

Now, we need to the a tsc and a build script to your package.json :

{
  "name": "app",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "tsc": "tsc",
    "build": "tsc && copyfiles package.json build/",
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "@aws-sdk/client-transcribe": "^3.760.0"
  },
  "devDependencies": {
    "@types/aws-lambda": "^8.10.147",
    "@types/aws-sdk": "^0.0.42",
    "copyfiles": "^2.4.1",
    "typescript": "^5.8.2"
  }
}

Now, let’s initialize our TypeScript project with:

npm run tsc -- --init --target esnext --module nodenext \
--moduleResolution nodenext --rootDir src \
--outDir build --noImplicitAny --noImplicitThis --newLine lf \
--resolveJsonModule

Or, for Windows:

npm run tsc -- --init --target esnext --module nodenext `
--moduleResolution nodenext --rootDir src `
--outDir build --noImplicitAny --noImplicitThis --newLine lf `
--resolveJsonModule

You should have a tsconfig.json file:

{
  "compilerOptions": {
    "target": "esnext",
    "module": "nodenext",
    "rootDir": "src",
    "moduleResolution": "nodenext",
    "resolveJsonModule": true,
    "outDir": "build",
    "newLine": "lf",
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "strict": true,
    "noImplicitAny": true,
    "noImplicitThis": true,
    "skipLibCheck": true
  }
}

Now, to our actual lambda function code, create a src folder, and an index.ts file:

import {
  MediaFormat,
  StartTranscriptionJobCommand,
  TranscribeClient,
  type StartTranscriptionJobRequest,
} from '@aws-sdk/client-transcribe';
import type { S3Event } from 'aws-lambda';

const JOB_ROLE_ARN = process.env.JOB_ROLE_ARN;
const transcribeClient = new TranscribeClient({});

export const handler = async (event: S3Event) => {
  for (let record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = record.s3.object.key;

    const fileInput = `s3://${bucket}/${key}`;
    const mediaFormat = fileInput.split('.').at(-1);

    if (
      !mediaFormat ||
      !Object.values(MediaFormat).includes(mediaFormat as MediaFormat)
    ) {
      console.warn('No media format for this file');
      return;
    }

    const jobName = key.replace('/', '_') + Date.now();

    const jobRequest: StartTranscriptionJobRequest = {
      TranscriptionJobName: jobName,
      Media: { MediaFileUri: fileInput },
      MediaFormat: mediaFormat as MediaFormat,
      LanguageCode: 'en-US',
      OutputBucketName: bucket,
      OutputKey: `transcription/${jobName}.json`,
      JobExecutionSettings: {
        DataAccessRoleArn: JOB_ROLE_ARN,
      },
    };

    const job = new StartTranscriptionJobCommand(jobRequest);

    try {
      const response = await transcribeClient.send(job);
      console.log(
        'Finished job %s. Data %s',
        jobName,
        response.TranscriptionJob
      );
    } catch (error: any) {
      console.error(
        "Couldn't start transcription job %s. Error: %s",
        jobName,
        error
      );
      throw error;
    }
  }
};

Notice the LanguageCode we are passing to our StartTranscriptionJobRequest object. We know the language will be en-US and have set it up for better transcription (You can find all the supported languages here). If you do not know the languages spoken in your media, use either IdentifyLanguage or IdentifyMultipleLanguages and let Amazon Transcribe identify the languages for you. Just remember that one of these three language configurations is required.

Now, to deploy the lambda with GitHub Actions, create a new deploy-lambda.yml workflow under .github/workflows :

name: Deploy Transcribe Lambda
on:
  workflow_dispatch:
  push:
    branches:
      - main
    paths:
      - app/**/*

defaults:
  run:
    working-directory: app/

jobs:
  deploy:
    name: 'Deploy Lambda'
    runs-on: ubuntu-latest
    steps:
      # Checkout the repository to the GitHub Actions runner
      - name: Checkout
        uses: actions/checkout@v3

      - name: Setup NodeJS
        uses: actions/setup-node@v4
        with:
          node-version: 22

      - name: Configure AWS Credentials Action For GitHub Actions
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: YOUR_REGION

      - name: Install packages
        run: npm install

      - name: Build
        run: npm run build

      - name: Zip build
        run: cd build && zip -r ../main.zip .

      - name: Update Lambda code
        run: aws lambda update-function-code --function-name=transcribe --zip-file=fileb://main.zip

After the workflow finishes running, you can upload a media file to the media/ folder, and you should get a transcription in the transcription/ folder. For example, I uploaded a file of me saying, Hello, how are you?. The file you get in JSON format contains the whole transcript, broken down into multiple items:

{
  "jobName": "audio_hello.m4a17393009178780818",
  "accountId": "044256433832",
  "status": "COMPLETED",
  "results": {
    "transcripts": [{ "transcript": "Hello, how are you?" }],
    "items": [
      {
        "id": 0,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "Hello" }],
        "start_time": "0.72",
        "end_time": "1.419"
      },
      {
        "id": 1,
        "type": "punctuation",
        "alternatives": [{ "confidence": "0.0", "content": "," }]
      },
      {
        "id": 2,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.996", "content": "how" }],
        "start_time": "1.629",
        "end_time": "1.809"
      },
      {
        "id": 3,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "are" }],
        "start_time": "1.809",
        "end_time": "1.99"
      },
      {
        "id": 4,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "you" }],
        "start_time": "1.99",
        "end_time": "2.259"
      },
      {
        "id": 5,
        "type": "punctuation",
        "alternatives": [{ "confidence": "0.0", "content": "?" }]
      }
    ],
    "audio_segments": [
      {
        "id": 0,
        "transcript": "Hello, how are you?",
        "start_time": "0.709",
        "end_time": "2.349",
        "items": [0, 1, 2, 3, 4, 5]
      }
    ]
  }
}

Custom Vocabulary

Creating a Custom Vocabulary https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary.html

There are some considerations when using custom vocabularies:

Each AWS account can have up to 100 vocabularies.
The file size limit is 50 KB.
When using API to create, the file must be a text file (txt). For the AWS Console, it can also be a CSV file.
Each entry must have a maximum of 256 characters.
Vocabulary is region-locked, meaning you can only use it in the same region in which it was created.
Entries with multiple words must be separated with a hyphen (-). For example: Event-Driven-Architecture
Entries that represent acronyms must have a trailing period (.) after each letter. For example: A.W.S., A.I., C.L.I..
For entries with words and acronyms, words should be separated with a hyphen from the acronyms, and the acronyms should follow the rule above. Example: Dynamo-D.B.

To create a new vocabulary, we can do it in two ways:

Directly in the Terraform configuration
Through a configuration text file

To do it with the Terraform configuration, we need to add them to the phrases property:

resource "aws_transcribe_vocabulary" "vocabulary" {
  vocabulary_name = "example"
  phrases         = ["A.W.S.", "A.I.", "Serverless"]
  language_code   = "en-US"
}

For creating a vocabulary with a file, we need to upload it to S3 before referencing it to the vocabulary. This is my preferred approach, as you can better have control over the vocabulary entries, and it also allows you to set how the entry should be displayed in the transcription.

Let’s create a vocabulary file. It should be in the following structure:

Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs
Entry[TAB][TAB][TAB]Entry
Another-Entry[TAB][TAB][TAB]Another Entry
X.Y.Z.[TAB][TAB][TAB]XYZ
Amazon-dot-com[TAB][TAB][TAB]Amazon.com
A.B.C.-s[TAB][TAB][TAB]ABCs
Dynamo-D.B.[TAB][TAB][TAB]DynamoDB
zero-one-two-A.B.[TAB][TAB][TAB]012AB

Where [TAB] is the tab character.

So, let’s create a vocabulary.txt file under the iac folder:

Phrase[TAB]SoundsLike IPA DisplayAs
A.W.S.   AWS
A.I.   AI
Serverless   serverless

Now, let’s create the object in S3. In the s3.tf file, add the following:

resource "aws_s3_object" "vocabulary_folder" {
  bucket = aws_s3_bucket.bucket.id
  key    = "vocabularies/"
}

resource "aws_s3_object" "vocabulary" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.vocabulary_folder.key}vocabulary.txt"
  source      = "${path.module}/transcribe/vocabulary.txt"
  source_hash = filemd5("${path.module}/transcribe/vocabulary.txt")
}

Then, let’s create a transcribe.tf file:

resource "aws_transcribe_vocabulary" "vocabulary" {
  vocabulary_name     = "example"
  vocabulary_file_uri = "s3://${aws_s3_object.vocabulary.bucket}/${aws_s3_object.vocabulary.key}"
  language_code       = "en-US"
}

We now add the vocabulary name to our lambda environment variables:

resource "aws_lambda_function" "transcribe" {
  function_name    = "transcribe"
  runtime          = "nodejs22.x"
  handler          = "index.handler"
  filename         = data.archive_file.file.output_path
  source_code_hash = data.archive_file.file.output_base64sha256
  role             = aws_iam_role.role.arn

  environment {
    variables = {
      JOB_ROLE_ARN    = "${aws_iam_role.job_role.arn}"
      OUTPUT_KEY      = "${aws_s3_object.transcription.key}"
      VOCABULARY_NAME = "${aws_transcribe_vocabulary.vocabulary.vocabulary_name}"
    }
  }
}

And lastly, you need to then add the Vocabulary ARN to the StartTranscriptionJob policy we created. It should be like:

statement {
  effect = "Allow"
  actions = ["transcribe:StartTranscriptionJob"]
  resources = [
    "arn:aws:transcribe:*:*:transcription-job/*",
    aws_transcribe_vocabulary.vocabulary.arn
  ]
}

You can now push the code to GitHub, wait for the action to finish, and then see the vocabulary in Transcribe.

Then, when you upload a media containing these words, Transcribe will give you a better transcription with a better match of words.

Vocabulary Filter

Sometimes, you might want to filter some words from your transcription, like removing offensive language or redacting/masking sensitive data.

In these cases, you can create and use a custom vocabulary filter, a list of words you want to filter from your transcription.

Things to note when creating your custom vocabulary filter:

Words are case-insensitive. So, “offensive” and “OFFENSIVE” are considered the same.
Word matching works with exact matches. You must include all variations of the words you want to filter from your transcription. Example: “hate”, “hating”, “hated”.
As mentioned previously, filters work with exact matches, so you don’t need to worry about variant words being filtered if you don’t want them. Example: You have the filter for “car”, so “scar” won’t be filtered.
Each entry can only contain one word (no spaces).
Text files must be in plain text with UTF-8 encoding.
100 custom vocabulary filters per account.
Vocabulary filter files can be up to 50 KB in size.
Only characters supported for the selected language can be used. (See documentation here for supported characters)

Just as with the custom vocabulary, we can create our vocabulary filter in two ways:

In line words through the words property in Terraform
Text file in S3

For text files, it needs to be in a txt file, and each word needs to be in a new line:

hate
offensive
masked

We can follow the same approach for the custom vocabulary. Add a file vocabulary_filter.txt with the content, then create the folder and object:

resource "aws_s3_object" "vocabulary_filters_folder" {
  bucket = aws_s3_bucket.bucket.id
  key    = "vocabulary_filters/"
}

resource "aws_s3_object" "vocabulary_filter" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.vocabulary_filters_folder.key}vocabulary_filter.txt"
  source      = "${path.module}/transcribe/vocabulary_filter.txt"
  source_hash = filemd5("${path.module}/transcribe/vocabulary_filter.txt")
}

Then, we can create a new vocabulary filter with:

resource "aws_transcribe_vocabulary_filter" "filter" {
  vocabulary_filter_name     = "example"
  vocabulary_filter_file_uri = "s3://${aws_s3_object.vocabulary_filter.bucket}/${aws_s3_object.vocabulary_filter.key}"
  language_code              = "en-US"
}

Now, if you want a vocabulary with inline words, you can easily add them through the words property:

After pushing to GitHub and the workflow completes, you can find the filters in the Transcribe vocabulary filtering console.

Now, to use it, let’s add the vocabulary filter name to our lambda environment variables:

resource "aws_lambda_function" "transcribe" {
  function_name    = "transcribe"
  runtime          = "nodejs22.x"
  handler          = "index.handler"
  filename         = data.archive_file.file.output_path
  source_code_hash = data.archive_file.file.output_base64sha256
  role             = aws_iam_role.role.arn

  environment {
    variables = {
      JOB_ROLE_ARN           = "${aws_iam_role.job_role.arn}"
      OUTPUT_KEY             = "${aws_s3_object.transcription.key}"
      VOCABULARY_NAME        = "${aws_transcribe_vocabulary.vocabulary.vocabulary_name}"
      VOCABULARY_FILTER_NAME = "${aws_transcribe_vocabulary_filter.filter.vocabulary_filter_name}"
    }
  }
}

Then, let’s give permission to our lambda role to use the filter when calling StartTranscriptionJob in the policies.tf file:

statement {
  effect = "Allow"
  actions = ["transcribe:StartTranscriptionJob"]
  resources = [
    "arn:aws:transcribe:*:*:transcription-job/*",
    aws_transcribe_vocabulary.vocabulary.arn,
    aws_transcribe_vocabulary_filter.filter.arn
  ]
}

And finally, let’s modify our lambda code to use the vocabulary filter when starting a transcription job:

import {
  MediaFormat,
  StartTranscriptionJobCommand,
  ToxicityCategory,
  TranscribeClient,
  VocabularyFilterMethod,
  type StartTranscriptionJobRequest,
} from '@aws-sdk/client-transcribe';
import type { S3Event } from 'aws-lambda';

const JOB_ROLE_ARN = process.env.JOB_ROLE_ARN;
const OUTPUT_KEY = process.env.OUTPUT_KEY; // transcription/
const vocabularyName = process.env.VOCABULARY_NAME;
const vocabularyFilterName = process.env.VOCABULARY_FILTER_NAME;

const transcribeClient = new TranscribeClient({});

export const handler = async (event: S3Event) => {
  for (let record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = record.s3.object.key;

    const fileInput = `s3://${bucket}/${key}`;
    const mediaFormat = fileInput.split('.').at(-1);

    if (
      !mediaFormat ||
      !Object.values(MediaFormat).includes(mediaFormat as MediaFormat)
    ) {
      console.warn('No media format for this file');
      return;
    }

    const jobName = key.replace('/', '_') + Date.now();

    const jobRequest: StartTranscriptionJobRequest = {
      TranscriptionJobName: jobName,
      Media: { MediaFileUri: fileInput },
      MediaFormat: mediaFormat as MediaFormat,
      LanguageCode: 'en-US',
      OutputBucketName: bucket,
      OutputKey: `${OUTPUT_KEY}${jobName}.json`,
      JobExecutionSettings: {
        DataAccessRoleArn: JOB_ROLE_ARN,
      },
      Settings: {
        VocabularyName: vocabularyName,
        VocabularyFilterMethod: VocabularyFilterMethod.MASK,
        VocabularyFilterName: vocabularyFilterName,
      },
    };

    const job = new StartTranscriptionJobCommand(jobRequest);

    try {
      const response = await transcribeClient.send(job);
      console.log(
        'Finished job %s. Data %s',
        jobName,
        response.TranscriptionJob
      );
    } catch (error: any) {
      console.error(
        "Couldn't start transcription job %s. Error: %s",
        jobName,
        error
      );
      throw error;
    }
  }
};

Notice that we have the VocabularyFilterMethod , it accepts one of the three. Let’s say we have the sentence This is an offensive content :

MASK — The filtered content will be masked. For example: { “transcript”: “This is an *** content.” }
REMOVE -The filtered content will be removed. Example: { “transcript”: “This is an content.” }
TAG — Adds a vocabularyFilterMatch: true tag to the filter content but doesn’t remove it from the transcript. Example:

{
  ...
  "results": {
    "transcripts": [
      { "transcript": "This is an offensive content." }
    ],
    "items": [
      ...
      {
        "id": 3,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.758", "content": "offensive" }],
        "start_time": "1.549",
        "end_time": "2.589",
        "vocabularyFilterMatch": true
      }
    ]
  }
}

Content Redaction

Transcribe also has the capability of content redaction, where it can identify PII (personally identifiable information) data and automatically redact or flag it. To do so, we need to update our StartTranscriptionJobRequest to tell our job to do it for us:

import {
  MediaFormat,
  PiiEntityType,
  RedactionOutput,
  StartTranscriptionJobCommand,
  ToxicityCategory,
  TranscribeClient,
  VocabularyFilterMethod,
  type StartTranscriptionJobRequest,
} from '@aws-sdk/client-transcribe';
import type { S3Event } from 'aws-lambda';

const JOB_ROLE_ARN = process.env.JOB_ROLE_ARN;
const OUTPUT_KEY = process.env.OUTPUT_KEY; // transcription/
const vocabularyName = process.env.VOCABULARY_NAME;
const vocabularyFilterName = process.env.VOCABULARY_FILTER_NAME;

const transcribeClient = new TranscribeClient({});

export const handler = async (event: S3Event) => {
  for (let record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = record.s3.object.key;

    const fileInput = `s3://${bucket}/${key}`;
    const mediaFormat = fileInput.split('.').at(-1);

    if (
      !mediaFormat ||
      !Object.values(MediaFormat).includes(mediaFormat as MediaFormat)
    ) {
      console.warn('No media format for this file');
      return;
    }

    const jobName = key.replace('/', '_') + Date.now();

    const jobRequest: StartTranscriptionJobRequest = {
      TranscriptionJobName: jobName,
      Media: { MediaFileUri: fileInput },
      MediaFormat: mediaFormat as MediaFormat,
      LanguageCode: 'en-US',
      OutputBucketName: bucket,
      OutputKey: `${OUTPUT_KEY}${jobName}.json`,
      JobExecutionSettings: {
        DataAccessRoleArn: JOB_ROLE_ARN,
      },
      Settings: {
        VocabularyName: vocabularyName,
        VocabularyFilterMethod: VocabularyFilterMethod.MASK,
        VocabularyFilterName: vocabularyFilterName,
      },
      ContentRedaction: {
        RedactionOutput: RedactionOutput.REDACTED,
        RedactionType: 'PII', // Only value allowed
        // If PiiEntityTypes is not provided, all PII data is redacted
        PiiEntityTypes: [
          PiiEntityType.CREDIT_DEBIT_NUMBER,
          PiiEntityType.BANK_ACCOUNT_NUMBER,
        ],
      },
    };

    const job = new StartTranscriptionJobCommand(jobRequest);

    try {
      const response = await transcribeClient.send(job);
      console.log(
        'Finished job %s. Data %s',
        jobName,
        response.TranscriptionJob
      );
    } catch (error: any) {
      console.error(
        "Couldn't start transcription job %s. Error: %s",
        jobName,
        error
      );
      throw error;
    }
  }
};

Here, we added the ContentRedaction property to our job request, and we then added the following:

RedactionOutput — It tells our job how to handle the redaction. REDACTED or REDACTED_AND_UNREDACTED .
RedactionType — Right now, the valid value is PII
PiiEntityTypes — A list of the PII type you want to redact. CREDIT_DEBIT_NUMBER, ADDRESS, BANK_ACCOUNT_NUMBER , …

If this property is not provided, all PII data will be redacted.

Custom Model

If your application has extensive data that represents the media that will be transcribed, then you can use a custom model trained with your user data.

This type of model will increase the reliability of your transcriptions by using a model with the exact data your applications use. In many cases, you can dismiss the custom vocabulary because your data already represents all the words that can be transcribed.

We have two types of data that we can use. Training data and tuning data:

Training data — This is the primary dataset used to train a new model from scratch. It typically represents a broad and diverse set of language or domain-specific data. For example, it can be text from your website, training manuals, or application documentation.
Tuning data—This data is used to refine and optimize an already trained model. Tuning data is typically more specialized or narrower in focus than training data. It’s meant to improve the model’s accuracy on a particular subset of data or domain. For example, audio transcripts of phone calls or media content directly relevant to your use case can be used to tune data or data containing slang or specialized vocabulary.

Your training and tuning datasets should have the following requirements:

Plain text (.txt) file. (Word, CSV, and PDF are not accepted)
Single sentence per line.
Encoded in UTF-8.
Doesn’t contain any formatting characters, such as HTML tags.
Less than 2 GB for the total size of all training data.
Less than 200 MB for the total size of optional tuning data.

So, let’s start by uploading our training and tuning data to S3. You can use the data I have here. It is just a simple training dataset from Wikipedia to train the model to be more specialized in games. In the s3.tf file:

resource "aws_s3_object" "clm" {
  bucket = aws_s3_bucket.bucket.id
  key    = "clm/"
}

resource "aws_s3_object" "training_data" {
  bucket = aws_s3_bucket.bucket.id
  key    = "${aws_s3_object.clm.key}training_data/"
}
resource "aws_s3_object" "tune_data" {
  bucket = aws_s3_bucket.bucket.id
  key    = "${aws_s3_object.clm.key}tune_data/"
}

resource "aws_s3_object" "nintendo_switch" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.training_data.key}NintendoSwitch.txt"
  source      = "${path.module}/transcribe/training_data/NintendoSwitch.txt"
  source_hash = filemd5("${path.module}/transcribe/training_data/NintendoSwitch.txt")
}

resource "aws_s3_object" "ps5" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.training_data.key}PlayStation5.txt"
  source      = "${path.module}/transcribe/training_data/PlayStation5.txt"
  source_hash = filemd5("${path.module}/transcribe/training_data/PlayStation5.txt")
}

resource "aws_s3_object" "xbox" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.training_data.key}XboxSeries.txt"
  source      = "${path.module}/transcribe/training_data/XboxSeries.txt"
  source_hash = filemd5("${path.module}/transcribe/training_data/XboxSeries.txt")
}

resource "aws_s3_object" "tune_data_file" {
  bucket      = aws_s3_bucket.bucket.id
  key         = "${aws_s3_object.tune_data.key}tune_data.txt"
  source      = "${path.module}/transcribe/tune_data/tune_data.txt"
  source_hash = filemd5("${path.module}/transcribe/tune_data/tune_data.txt")
}

In the policies.tf , let’s create a new policy document with access to list our bucket and access our training data:

data "aws_iam_policy_document" "transcribe_s3" {
  statement {
    effect = "Allow"

    actions = [
      "s3:GetObject",
    ]

    resources = [
      "${aws_s3_object.clm.arn}*",
    ]
  }
  statement {
    effect = "Allow"

    actions = [
      "s3:ListBucket"
    ]

    resources = [
      "${aws_s3_bucket.bucket.arn}",
    ]
  }
}

Also modify the policies IAM Policy document to add the language model ARN to the resources:

data "aws_iam_policy_document" "policies" {
  statement {
    effect = "Allow"

    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]

    resources = ["arn:aws:logs:*:*:*"]
  }

  statement {
    effect = "Allow"

    actions = ["iam:PassRole"]

    resources = [aws_iam_role.job_role.arn]
  }

  statement {
    effect = "Allow"

    actions = ["transcribe:StartTranscriptionJob"]

    resources = [
      "arn:aws:transcribe:*:*:transcription-job/*",
      aws_transcribe_vocabulary.vocabulary.arn,
      aws_transcribe_vocabulary_filter.filter.arn,
      aws_transcribe_language_model.model.arn
    ]
  }
}

In the transcribe.tf file, let’s create our custom language model. Here, we’ll provision the model and assign a role with access to our datasets so it can train:

resource "aws_transcribe_language_model" "model" {
  model_name = "example"

  // NarrowBand: Use this option for audio with a sample rate of less than 16,000 Hz. This model type is typically used for telephone conversations recorded at 8,000 Hz.
  // WideBand: Use this option for audio with a sample rate greater than or equal to 16,000 Hz.
  base_model_name = "WideBand"

  language_code = "en-US"
  input_data_config {
    s3_uri               = "s3://${aws_s3_object.training_data.bucket}/${aws_s3_object.training_data.key}"
    tuning_data_s3_uri   = "s3://${aws_s3_object.tune_data.bucket}/${aws_s3_object.tune_data.key}"
    data_access_role_arn = aws_iam_role.transcribe_clm.arn
  }

  depends_on = [aws_iam_role_policy.transcribe_clm_policy]
}

resource "aws_iam_role" "transcribe_clm" {
  name               = "transcribe_clm"
  assume_role_policy = data.aws_iam_policy_document.transcribe_assume_role.json
}

resource "aws_iam_role_policy" "transcribe_clm_policy" {
  name = "transcribe_clm"
  role = aws_iam_role.transcribe_clm.id

  policy = data.aws_iam_policy_document.transcribe_s3.json
}

Then, let’s add the model name to our lambda’s environment variables in the lambdas.tf file:

resource "aws_lambda_function" "transcribe" {
  function_name    = "transcribe"
  runtime          = "nodejs22.x"
  handler          = "index.handler"
  filename         = data.archive_file.file.output_path
  source_code_hash = data.archive_file.file.output_base64sha256
  role             = aws_iam_role.role.arn

  environment {
    variables = {
      JOB_ROLE_ARN           = "${aws_iam_role.job_role.arn}"
      OUTPUT_KEY             = "${aws_s3_object.transcription.key}"
      VOCABULARY_NAME        = "${aws_transcribe_vocabulary.vocabulary.vocabulary_name}"
      VOCABULARY_FILTER_NAME = "${aws_transcribe_vocabulary_filter.filter.vocabulary_filter_name}"
      CUSTOM_MODEL           = "${aws_transcribe_language_model.model.model_name}"
    }
  }
}

You can then push the code to GitHub Actions and wait for it to finish. According to AWS, training a Custom Language Model can take 6 to 10 hours, depending on the size of your training data. Mine took around 3 hours. Terraform will wait for it to finish training before completing the apply run.

If you are working on a distributed team, I recommend creating the IAM role and the training and tuning data in S3 through Terraform, and then you can make the Custom Language Model in the AWS console. Once finished, you can reference it in your Terraform configuration with a data source. This way, the model training won’t impact the delivery of other Terraform resources.

While it is being trained, you can see the status in the Transcribe Custom language model console:

After it finishes, you should see it as completed in the console:

We need to modify our Lambda code to pass the CLM name to the job request. Get the model name from the CUSTOM_MODEL environment variable and then pass in the ModelSettings.LanguageModelName property:

import {
  MediaFormat,
  PiiEntityType,
  RedactionOutput,
  StartTranscriptionJobCommand,
  ToxicityCategory,
  TranscribeClient,
  VocabularyFilterMethod,
  type StartTranscriptionJobRequest,
} from '@aws-sdk/client-transcribe';
import type { S3Event } from 'aws-lambda';

const JOB_ROLE_ARN = process.env.JOB_ROLE_ARN;
const OUTPUT_KEY = process.env.OUTPUT_KEY; // transcription/
const vocabularyName = process.env.VOCABULARY_NAME;
const vocabularyFilterName = process.env.VOCABULARY_FILTER_NAME;
const customModel = process.env.CUSTOM_MODEL;

const transcribeClient = new TranscribeClient({});

export const handler = async (event: S3Event) => {
  for (let record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = record.s3.object.key;

    const fileInput = `s3://${bucket}/${key}`;
    const mediaFormat = fileInput.split('.').at(-1);

    if (
      !mediaFormat ||
      !Object.values(MediaFormat).includes(mediaFormat as MediaFormat)
    ) {
      console.warn('No media format for this file');
      return;
    }

    const jobName = key.replace('/', '_') + Date.now();

    const jobRequest: StartTranscriptionJobRequest = {
      TranscriptionJobName: jobName,
      Media: { MediaFileUri: fileInput },
      MediaFormat: mediaFormat as MediaFormat,
      LanguageCode: 'en-US',
      OutputBucketName: bucket,
      OutputKey: `${OUTPUT_KEY}${jobName}.json`,
      JobExecutionSettings: {
        DataAccessRoleArn: JOB_ROLE_ARN,
      },
      Settings: {
        VocabularyName: vocabularyName,
        VocabularyFilterMethod: VocabularyFilterMethod.MASK,
        VocabularyFilterName: vocabularyFilterName,
      },
      ModelSettings: {
        LanguageModelName: customModel,
      },
      ContentRedaction: {
        RedactionOutput: RedactionOutput.REDACTED,
        RedactionType: 'PII', // Only value allowed
        // If PiiEntityTypes is not provided, all PII data is redacted
        PiiEntityTypes: [
          PiiEntityType.CREDIT_DEBIT_NUMBER,
          PiiEntityType.BANK_ACCOUNT_NUMBER,
        ],
      },
    };

    const job = new StartTranscriptionJobCommand(jobRequest);

    try {
      const response = await transcribeClient.send(job);
      console.log(
        'Finished job %s. Data %s',
        jobName,
        response.TranscriptionJob
      );
    } catch (error: any) {
      console.error(
        "Couldn't start transcription job %s. Error: %s",
        jobName,
        error
      );
      throw error;
    }
  }
};

Once you deploy your lambda code, you can test it by adding a media file with a more specific language for your domain, and you should start getting better transcriptions.

Testing in AWS Console

You can also test your Transcribe vocabularies, filters, and custom language models in the Transcribe Real-time transcription console:

Then, you can also specify the Vocabulary, Filter, and Custom Language Model in the Customizations options:

After configuring, you can click on Start streaming to test your Transcribe configurations.

Conclusion

In this story, we could learn more about AWS Transcribe, the serverless service to convert speech to text, as well as learn how you can quickly get started with a simple serverless architecture with S3, Lambda, and Transcribe to have your application to convert media files into text.

You could learn about the difference between using General and Custom language models and when you’d train your model.

Also, you can enrich the model you are using by adding vocabulary with words that are related to your domain.

We saw how we could filter and redact data by using custom vocabulary filters to match words we don’t want showing in our transcriptions and utilizing the content redaction native feature to redact or flag PII data automatically.

We saw how we can quickly train and tune a CLM with our data in the Custom Language Models domain. The only downside of doing that with Terraform is that the process will hang until the model is trained, which can take a few hours to complete, depending on the size of your data.

I hope you had fun with this story.

Happy coding! 💻

The code for this project can be found here.

Struggling with slow API calls?

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Top comments (1)

Paul SANTUS • Mar 18

Thanks for this comprehensive overview. It gives a practical way to deploy and immediate grasp of what the service capabilities are!

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post

DEV Community