Voices in the Cloud: Amazon Transcribe and Polly

Voices in the Cloud: Amazon Transcribe and Polly

Master the audio domain. Learn how to convert spoken words to text with Transcribe and give your apps a voice with Polly.

Hearing and Speaking

In the world of AWS AI, audio processing is split into two distinct paths.

  1. Speech-to-Text: Listening to a recording and writing it down (Amazon Transcribe).
  2. Text-to-Speech: Taking a text string and turning it into a natural-sounding voice (Amazon Polly).

For the AWS Certified AI Practitioner exam, you must understand the specialized features of both and, most importantly, never confuse the two names.


1. Amazon Transcribe (The Listener)

Amazon Transcribe uses deep learning to provide high-quality speech-to-text. It doesn't just write words; it understands the structure of a conversation.

Key Features:

  • Speaker Identification (Diarization): It can tell when "Speaker 1" stops and "Speaker 2" starts. (Crucial for meeting notes).
  • Custom Vocabulary: You can tell it what your company's product names or technical jargon sound like so it doesn't get them wrong.
  • Redaction: It can automatically remove sensitive data (like a credit card number spoken over a phone call).
  • Subtitling: It can output a time-coded file (.srt) for video subtitles.

Industry Variation: Amazon Transcribe Medical

Just like Comprehend Medical, this version is trained on terms like "Metformin" or "Clarithromycin," allowing doctors to dictate notes hands-free.


2. Amazon Polly (The Speaker)

Amazon Polly turns text into lifelike speech. It is the "Voice" for your applications.

Key Features:

  • Neural TTS (NTTS): High-fidelity voices that sound like a human person, with natural breathing and rhythm.
  • SSML (Speech Synthesis Markup Language): You can "code" the voice. You can tell Polly to whisper, to <emphasis>shout</emphasis>, or to take a specific pause.
  • Speech Marks: It can tell your app exactly when it is pronouncing a specific word—perfect for syncing an avatar's lips or highlighting text as it's read.
  • Lexicons: You can teach Polly how to pronounce a specific acronym (e.g., "AWS") so it says "Amazon Web Services" or "A-W-S" depending on your choice.

3. The Functional Distinction

graph LR
    subgraph Audio_Processing
    A[Raw Audio File / Stream] -->|TRANSCRIBE| B[Text Document]
    B -->|POLLY| C[Digital Voice .mp3]
    end
    
    subgraph Use_Cases
    D[Call Center Analytics]
    E[Video Subtitles]
    F[Handbooks for the Blind]
    G[Podcast Generation]
    end
    
    B --> D
    B --> E
    C --> F
    C --> G

The "A-ha!" Moment for the Exam:

  • If the question asks to "Support users with visual impairments" -> Use Amazon Polly (to read to them).
  • If the question asks to "Analyze customer complaints on phone calls" -> Use Amazon Transcribe first (to get the text) and then Amazon Comprehend (to find the sentiment).

4. Summary: The Audio Loop

  • Polly is a Synthetic Voice.
  • Transcribe is an Automatic Scribe.
  • Together, they allow for high-level automated phone systems (IVRs) and accessibility tools.

Exercise: The Strategic Chain

A hospital wants to build a system where a doctor can:

  1. Record a voice memo of a patient's symptoms.
  2. Have that memo turned into a text note in the system.
  3. Have the text note specifically highlight any "Dangerous" medical conditions.

Which three services are needed in this order?

  • A. Amazon Polly -> Amazon Transcribe -> Amazon Bedrock.
  • B. Amazon Transcribe Medical -> Amazon Comprehend Medical.
  • C. Amazon Rekognition -> Amazon Textract -> Amazon Translate.
  • D. Amazon Transcribe -> Amazon Polly -> Amazon QuickSight.

The Answer is B! Transcribe Medical converts the voice to text. Comprehend Medical "Highlights" the specific medical conditions (Entities).


Knowledge Check

?Knowledge Check

Which AWS service is responsible for providing the 'Voice' for a smart home speaker by converting text into natural-sounding speech?

What's Next?

We’ve covered Audio. But what about the millions of paper documents and messy PDFs in the world? In our next lesson, we master Amazon Textract.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn