Google Cloud Text to Speech API: The Future of AI Voice Synthesis Read it later

4.3/5 - (6 votes)

Are you tired of studying lengthy articles or books but need to analyze or enjoy them? Google has an answer for you! Google Cloud Text to Speech converts the text into natural-sounding speech. With the help of Google Cloud Voice, you could listen to your preferred articles, books, or maybe your website content material without setting any stress for your eyes. In this blog, let’s learn about Google Cloud Text to Speech and its API in detail.

What is Google Text to Speech?

Google Cloud Text to Speech is a cloud-based text-to-speech (TTS) service that allows developers to integrate natural-sounding speech into their projects. It is part of the Google Cloud AI Platform, which offers a collection of machine mastering and artificial intelligence offerings.

Using Google Cloud Text to Speech, developers can convert written text into natural-sounding speech in a variety of languages and voices. The service uses advanced deep-learning techniques to generate speech that is indistinguishable from human speech.

Google Cloud Text to Speech gives a wide range of customization options, together with the capacity to regulate the velocity, pitch, and volume of the ensuing audio. It also offers multiple voice alternatives, which include male and female voices in distinctive languages and accents.

The service is easy to integrate into applications, with APIs available for multiple programming languages, including Java, Python, and Node.js. It also offers integration with other Google Cloud services, such as Google Cloud Storage and Google Cloud Functions.

How does Google Cloud Voice Work?

We just saw what Google Text-to-Speech is, but you might be wondering how it actually works. Allow me to shed some light on the subject. Google Text-to-Speech makes use of an advanced AI voice synthesis technology known as WaveNet, which was developed by DeepMind. Now, you might be wondering if DeepMind is a separate company that developed WaveNet. Well, it used to be, until Google acquired DeepMind in 2014.

To understand the proper working of Google Cloud Text-to-Speech, we must first understand how WaveNet operates. But before diving into WaveNet itself, let’s explore why it was developed and the problems it aims to solve.

Why WaveNet?

One of the earliest virtual assistants, Apple’s Siri (released in October 2011), employed Text-to-Speech technology, but it relied on a technique called concatenation synthesis. In this method, individual phonemes (the smallest speech units that differentiate words) are stored and then combined to form words and sentences.

Let’s illustrate this with an example. Imagine you said, “Hey Siri, good morning!”, In concatenation synthesis, each phoneme’s voice form would be stored and then concatenated to construct the complete sentence. The output would be a sequence like this:

<voice "Hey"> + <voice "Siri"> + <voice ","> + <voice "good"> + <voice "morning"> + <voice "!">

How does WaveNet work?

While the concatenation synthesis was groundbreaking, it lacked the natural fluency of human speech. This is where WaveNet took a different path. Instead of piecing together pre-recorded elements, WaveNet generates raw audio waveforms from scratch.

WaveNet’s backbone is its neural network, which has undergone extensive training using a vast collection of speech samples. Throughout the training process, the network extracts the fundamental structure of speech, encompassing the sequencing of tones and the representation of realistic speech waveforms.

With WaveNet, Google has set a new standard for TTS technology, making it easier than ever to integrate natural-sounding speech into your projects.

Key Features of Google Cloud Text to Speech

Google Cloud Text to Speech (TTS) API provides a wide range of features that allow us to create a rich and natural-sounding speech for our application. Let’s dive into the key features and explore how Google TTS can enhance our speech synthesis experience:

  1. Custom Voice (beta): Train a unique and personalized speech synthesis model using your own recordings.
  2. Voice and Language Selection: Choose from over 220 voices in 40 languages to create a localized and engaging experience.
  3. Google WaveNet Voices: Access over 90 WaveNet voices that bring human-like performance and authenticity to your applications.
  4. Text and SSML Support: Customize speech output with SSML tags for fine-grained control, including pauses, numbers, and pronunciation instructions.
  5. Pitch Tuning: Personalize the pitch of the voice to match character traits or convey emotions effectively.
  6. Speaking Rate Tuning: Adjust the speaking rate up to four times faster or slower to align with the desired context.
  7. Volume Gain Control: Amplify or reduce the volume of the speech output for intelligibility in various playback environments.
  8. Integrated REST and gRPC APIs: Seamlessly integrate with applications or devices using REST or gRPC requests.
  9. Audio Format Flexibility: Convert text into MP3, Linear16, OGG Opus, and other formats for compatibility and easy integration.
  10. Audio Profiles: Optimize speech output for specific playback scenarios, enhancing quality and user experience.

Google Cloud Text-to-Speech (TTS) Pricing

Now, let’s talk about everyone’s favorite topic: pricing! We understand that cost plays a crucial role in decision-making, so let’s break down the pricing structure for Google Cloud Text-to-Speech API. Remember, it’s very important to take a look at the Google Cloud TTS’s official documentation for the most up-to-date pricing information.

The pricing for Google Text-to-Speech API service revolves around the number of characters you send to the service to be synthesized into audio each month. It’s simple and straightforward.

However, before you start using the service, Google Cloud Platform makes sure that you enable the billing. Once billing is enabled, you will be automatically charged if your usage exceeds the number of free characters allowed per month.

So, keep an eye on your usage to avoid any surprises. Google Cloud provides tools to help you monitor your API usage and keep track of your character totals.

Google TTS API Service Pricing Breakdown

Now, let’s get into the specifics of pricing based on the different voice types:

Voice TypeFree Usage LimitPrice after Free Limit Exceeded
Neural2 voices0 to 1 million bytes$0.000016 USD per byte
Studio (Preview) voices0 to 100K bytes$0.00016 USD per byte
Standard voices0 to 4 million characters$0.000004 USD per character
WaveNet voices0 to 1 million characters$0.000016 USD per character
Google Cloud Text to Speech (TTS) Voice Pricing

Real Human Voices From Text in 60 Seconds! (WHOA!)

Text to Speech in just 3 clicks!
  • Create beautiful, natural sounding Voice-overs
  • Add Pauses, Inflections, and Tone
  • Make Listeners believe it’s a real human talking

📝Note: For WaveNet and Standard voices, the number of characters will be equal to or less than the number of bytes represented by the text. This includes alphanumeric characters, punctuation, and white spaces. Some character sets use more than one byte for a character. For example, Japanese (ja-JP) characters in UTF-8 typically require more than one byte each. In this case, you are only charged for one character, not multiple bytes.

Additionally, if you use other Google Cloud Platform resources alongside Text-to-Speech, such as Google App Engine instances, you’ll be billed for the use of those services as well.

To get a comprehensive view of your potential costs, you can use the Google Cloud Platform Pricing Calculator, which takes into account the current rates for various services.

Remember to always check the official pricing documentation for the most accurate and up-to-date information. Understanding the pricing structure helps you plan and manage your budget effectively while making the most of the powerful Google Cloud Text-to-Speech API. So go ahead, and unleash the magic of speech synthesis without any financial surprises!

Project Setup for Google Cloud Text to Speech (TTS) API

Now that we have a basic understanding of the Google Cloud Text-to-Speech API, let’s dive into the project setup process.

Steps to set up a project for Google Cloud Text to Speech (TTS) API:

  1. Sign in to Google Cloud Console.
  2. Select or Create a project.
  3. Enable the Text-to-Speech API.
    • Make sure billing is enabled for Text-to-Speech.
    • Make sure your project has at least one service account.
    • Download a service account credential key.
  4. Link a service account to the Text-to-Speech API (Optional, if you have already linked).
  5. Set the authentication environment variable.

Let’s follow these steps to get everything up and running smoothly.

Step 1: Sign in to the Google Cloud Console

To begin, sign in to the Google Cloud Console using your Google account credentials. If you don’t have an account, you can create one for free.

Step 2: Select or create a project

Once you’re logged in, navigate to the project selector page. You can either choose an existing project or create a new one.

If you decide to create a new project, you’ll be prompted to link a billing account to it. If you’re using a pre-existing project, ensure that billing is enabled.

📝Note: You must enable billing to use Text-to-Speech API, however, you will not be charged unless you exceed the free quota. See the pricing page for more details.

Step 3: Enable the Text-to-Speech API

To start using the Text-to-Speech API, we need to enable it for our project.

In the Cloud Console:

  1. Go to the Search Products and Resources bar at the top of the page
  2. Search “Speech”
  3. Select Cloud Text-to-Speech API from the list of results.

At this point, you have two options:

  1. you can either try the API without linking it to your project by selecting the “TRY THIS API” option, or
  2. you can enable it for your project by clicking on the “ENABLE” button.

Step 4: Link a service account to the Text-to-Speech API

To use the Text-to-Speech API, you need to link one or more service accounts to it. Service accounts provide the necessary authentication credentials. On the left side of the Text-to-Speech API page, click on the “Credentials” link.

If you don’t have any service accounts associated with your project, follow the instructions provided to create a new one.

Fill in the required details, such as the service account name and description, and click “CREATE AND CONTINUE“. We recommend assigning one of the basic IAM roles to the service account.

If you already have a service account and its JSON key, you can proceed to Step 6: Set the authentication environment variable.

Step 5: Create a JSON key for your service account

The JSON key associated with your service account is required for authentication purposes when making requests to the Text-to-Speech API.

To create a JSON key for the service account:

  1. Click on the service account you want to use and,
  2. Select the “KEYS” tab.
  3. Click on the “ADD KEY” button.
  4. Choose “Create new key”.

It’s recommended to select the JSON format for the key.

Once you’ve created the key, it will be automatically downloaded. Make sure to store the JSON file in a secure location and take note of the file path. You’ll need to reference this file by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable during the authentication process.

Step 6: Set the authentication environment variable

To authenticate your application code, you need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable. This variable specifies the path to your service account’s JSON key.

On Linux or macOS, you can set the variable for your current shell session using the following command:



export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

Note: Replace “KEY_PATH” with the actual path to your JSON file.

On Windows, the process depends on your shell. For PowerShell, use the following command:




For the command prompt, use:



set GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\service-account-file.json"

Again, remember to replace “KEY_PATH” with the actual path to your JSON file.

That’s it! You have now successfully set up your project for the Google Cloud Text-to-Speech API. You’re ready to explore the fascinating world of speech synthesis and leverage the power of this API to enhance your applications.

Disable GCP Text to Speech API

If you ever need to disable the Text-to-Speech API, you can do so by:

  1. Navigating to the Google Cloud Platform (GCP) dashboard.
  2. Click on the “Go to APIs overview” link in the APIs box.
  3. Locate the Text-to-Speech API.
  4. Select the “DISABLE API” button at the top of the page.

Now that we have the foundation in place, let’s dive deeper into the functionalities and explore how to make the most out of the Google Cloud Text-to-Speech API!

How to Interact with the Google Cloud Text-to-Speech API?

There are several ways to interact with the Google Cloud Text-to-Speech API:

  1. REST API: You can use HTTP requests (POST or GET) to communicate with the API. Send requests to specific endpoints with the required parameters and authentication credentials.
  2. Client Libraries: Google provides client libraries for popular languages like Python, Java, Node.js, Ruby, C#, and Go. These libraries offer convenient methods and objects to interact with the API, handling low-level details such as HTTP requests and authentication.
  3. Command-Line Interface (CLI): The Google Cloud SDK includes a CLI that allows you to use command-line tools for API tasks. You can synthesize speech, manage voices, and perform other API operations from the command line.
  4. Cloud Client Libraries: Google Cloud offers Cloud Client Libraries, available in various programming languages. These libraries provide a consistent way to interact with multiple services, including the Text-to-Speech API, by handling authentication, request serialization, and error handling.

Each method has its advantages and disadvantages, so choose the one that suits your preferences and project requirements. Whether you prefer REST API calls, client libraries, CLI, or Cloud Client Libraries, you have options to integrate Google Cloud Text-to-Speech API seamlessly into your applications.

Object Types in Google Cloud TTS API

Before diving into the endpoints, let’s take a look at some of the object types that will be used in Google Cloud Text to Speech API:


The SynthesisInput the object specifies the input text for the speech synthesis. It contains the text input that will be synthesized into speech.

It’s important to note that either the text or ssml field must be supplied. Providing both or neither will result in an INVALID_ARGUMENT error. The input size is limited to 5000 bytes, so keep that in mind when working with larger text inputs.

SynthesisInput JSON Representation

Let’s explore the structure of the SynthesisInput object in JSON representation:

  "text": string,
  "ssml": string

The SynthesisInput object has the following fields:

textstringThe SSML document to be synthesized. The SSML document must be valid and well-formed for successful synthesis.
ssmlstringThe SSML document to be synthesized. The SSML document must be valid and well-formed for successful synthesis.
Google Cloud Voice SynthesisInput Fields

The text field is used when you want to synthesize speech from plain text. Simply provide the raw text you want to convert into speech.

The ssml field is used when you want to utilize the power of Speech Synthesis Markup Language (SSML). SSML allows you to add various speech features, such as pauses, emphasis, and pitch changes, to enhance the expressiveness of the synthesized speech. Ensure that the provided SSML document is valid and well-formed to avoid any INVALID_ARGUMENT errors.

When using the SynthesisInput object, you can choose between plain text or SSML to suit your specific requirements.

Let’s look at an example where the text field is used:

  "text": "Hello, world! How are you today?"

And here’s an example where the ssml field is utilized:

  "ssml": "<speak>Hello, <emphasis level=\"strong\">world!</emphasis> How are <prosody rate=\"fast\">you</prosody> today?</speak>"

Remember, the SynthesisInput object is a crucial part of the API request, allowing you to provide the desired text or SSML input for speech synthesis. Use this input wisely to create an engaging and lifelike synthesized speech that meets your application’s needs.


The VoiceSelectionParams object allows you to specify the voice to be used for the synthesis request.

It provides details such as the language, name, gender preference, and custom voice configuration.

  "languageCode": string,
  "name": string,
  "ssmlGender": enum (SsmlVoiceGender),
  "customVoice": {
    object (CustomVoiceParams)

Let’s break down the fields of the VoiceSelectionParams object:

languageCodestringRequired. The language (and potentially the region) of the voice expressed as a BCP-47 language tag, such as en-US.

Note: The TTS service may choose a voice with a slightly different language code or region based on availability. The script tag should not be included.
namestringThe name of the voice. If not set, the service will choose a voice based on other parameters like languageCode and gender.
ssmlGenderenum (SsmlVoiceGender)The preferred gender of the voice. If not set, the service will choose a voice based on other parameters like languageCode and name.
customVoiceobject (CustomVoiceParams)The configuration for a custom voice. If the CustomVoiceParams.model is set, the service will select the custom voice that matches the specified configuration.
VoiceSelectionParams Fields


The CustomVoiceParams object describes the details of a custom voice to be synthesized.

  "model": string,
  "reportedUsage": enum (ReportedUsage)

The fields of the CustomVoiceParams object are as follows:

modelstringRequired. The name of the AutoML model that will synthesize the voice.
reportedUsageenum (ReportedUsage)Optional. The usage of the synthesized audio that is to be reported.
CustomVoiceParams Fields
  • ReportedUsage

The ReportedUsage enum specifies the usage category for the synthesized audio. It is important to report your usage honestly and accurately as it affects the billing and compliance with the service contract.

REPORTED_USAGE_UNSPECIFIEDRequests with unspecified reported usage will be rejected.
REALTIMEUsed for scenarios where the synthesized audio is not downloadable and can only be used once. For example, real-time requests in an IVR (Interactive Voice Response) system.
OFFLINEUsed for scenarios where the synthesized audio is downloadable and can be reused. For example, the synthesized audio is downloaded, stored, and played repeatedly in a customer service system.
ReportedUsage Enums

By utilizing the VoiceSelectionParams object and its fields, you can customize the voice for your speech synthesis requests, ensuring the generated audio meets your specific requirements.


This object provides description of the audio data to be synthesized. The AudioConfig object in the Google Cloud Text-to-Speech API allows you to specify various parameters to control the audio data that will be synthesized.

It provides options to customize the speaking rate, pitch, volume gain, sample rate, and effects profiles.

  "audioEncoding": enum (AudioEncoding),
  "speakingRate": number,
  "pitch": number,
  "volumeGainDb": number,
  "sampleRateHertz": integer,
  "effectsProfileId": [

Let’s explore each field in detail.

1. audioEncoding

The audioEncoding field is required and determines the format of the audio byte stream. Here are the supported audio encodings in the API:

AUDIO_ENCODING_UNSPECIFIEDNot specified. Will return a result of google.rpc.Code.INVALID_ARGUMENT.
LINEAR16Uncompressed 16-bit signed little-endian samples (Linear PCM). Audio content returned as LINEAR16 also contains a WAV header.
MP3MP3 audio at 32kbps.
OGG_OPUSOpus encoded audio wrapped in an Ogg container. The quality of the encoding is higher than MP3 while using approximately the same bitrate.
MULAW8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. Audio content returned as MULAW also contains a WAV header.
ALAW8-bit samples that compand 14-bit audio samples using G.711 PCMU/A-law. Audio content returned as ALAW also contains a WAV header.
Google Text to Speech – audioEncoding Enums

It’s essential to choose the appropriate audio encoding based on your specific requirements, considering factors such as audio quality and compatibility.

2. speakingRate

The speakingRate field, which is optional, allows you to control the speaking rate or speed of the synthesized speech. The value range is between 0.25 and 4.0, where 1.0 represents the normal native speed supported by the chosen voice.

Setting the value to 2.0 doubles the speed, while 0.5 halves it. By default, if unset or set to 0.0, it defaults to the native 1.0 speed. Please note that values outside the range of 0.25 to 4.0 will result in an error.

3. pitch

The pitch field, also optional, enables you to adjust the speaking pitch of the synthesized audio. The range is -20.0 to 20.0, where 20 means increasing the pitch by 20 semitones from the original, and -20 means decreasing the pitch by 20 semitones. This parameter provides flexibility in modifying the pitch according to your desired effect.

4. volumeGainDb

The volumeGainDb field, again optional, allows you to control the volume gain of the synthesized speech. It represents the gain in decibels (dB) and ranges from -96.0 to 16.0. A value of 0.0 (dB) plays the audio at the normal native signal amplitude.

Setting it to -6.0 (dB) plays the audio at approximately half the amplitude, while +6.0 (dB) plays it at approximately twice the amplitude. It’s generally recommended not to exceed +10.0 (dB) as there is usually no effective increase in loudness beyond that point.

5. sampleRateHertz

The sampleRateHertz field is optional and specifies the synthesis sample rate in hertz (Hz) for the audio. If the specified sample rate is different from the voice’s natural sample rate, the synthesizer will honor the request by converting the audio to the desired sample rate.

However, this conversion might result in reduced audio quality. If the specified sample rate is not supported for the chosen encoding, the request will fail, returning google.rpc.Code.INVALID_ARGUMENT.

6. effectsProfileId

The effectsProfileId field is optional and allows you to select audio effects profiles to be applied to the synthesized speech. Multiple effects profiles can be applied on top of each other by providing their respective identifiers. You can refer to the audio profiles documentation to find the currently supported profile IDs and explore the available effects.

It’s important to note that not all effects profiles are compatible with every voice or language. Therefore, you should verify the compatibility before applying specific profiles.


The SsmlVoiceGender parameter plays a crucial role in specifying the gender of the voice when using the SSML (Speech Synthesis Markup Language) voice element. This parameter allows you to tailor the synthesized speech to match the desired gender expression. Let’s explore the available options for the SsmlVoiceGender enum:

SSML_VOICE_GENDER_UNSPECIFIEDThis value indicates that the gender of the voice is unspecified. When used in the VoiceSelectionParams, it signifies that the client does not have a preference regarding the gender of the selected voice. In the Voice field of ListVoicesResponse, it may suggest that the voice doesn’t fit into any other gender categories or that the gender of the voice is unknown.
MALEThis option represents a male voice. Selecting this gender will result in the synthesized speech having masculine characteristics.
FEMALEThis option represents a female voice. By choosing this gender, the synthesized speech will have feminine attributes.
NEUTRALA gender-neutral voice. This voice is not yet supported.
SsmlVoiceGender Enums

When utilizing the Google Cloud Text-to-Speech API, you can select the appropriate gender to align the synthesized speech with your project’s specific requirements. Whether you’re aiming for a commanding male voice, a soothing female voice, or a gender-neutral option (when supported), the SsmlVoiceGender parameter empowers you to shape the auditory experience to suit your application’s needs.

Google Cloud Text to Speech REST API

Now, Let’s learn about different endpoints in text-to-speech REST API:

Retrieve the List of Supported Cloud Voices

To retrieve the list of voices supported by the Text to Speech API, we’ll make a GET request to the following endpoint:

Google Cloud Text to Speech Voices Endpoint

The API uses gRPC Transcoding syntax, which allows us to communicate with it using HTTP/JSON. Now, let’s break down the details of the API request:

Query Parameters:

  • languageCode (optional): By specifying a language code in the query parameter, you can filter the list of voices based on a particular language. For example, languageCode=en-US will return voices that support English (United States). Feel free to experiment with different language codes to find the perfect voice for your application.

Request Body:

The request body should be empty for this particular API request.

Making Voices List API Call Using Python

Here’s an example of how you can make an HTTP call to the voices.list endpoint of the Google Cloud Text-to-Speech API using Python:

import requests

# Define the API endpoint URL
url = "[YOUR_API_KEY]"

# Optional: Set the language code query parameter if needed
language_code = "en-US"
params = {"languageCode": language_code} if language_code else {}

# Make the GET request
response = requests.get(url, params=params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Print the response content (JSON data)
    # If the request failed, print the status code and error message
    print(f"Request failed with status code {response.status_code}: {response.text}")

Response Body:

Upon a successful request, the response body will contain data structured as follows:

  "voices": [
      "languageCodes": [
      "name": "ar-XA-Wavenet-A",
      "ssmlGender": "FEMALE",
      "naturalSampleRateHertz": 24000
      "languageCodes": [
      "name": "ar-XA-Wavenet-B",
      "ssmlGender": "MALE",
      "naturalSampleRateHertz": 24000

In the above sample output, we can see two voices that support the Arabic language (ar-XA). The first voice, named ar-XA-Wavenet-A, has a female gender (FEMALE), while the second voice, ar-XA-Wavenet-B, has a male gender (MALE). Both voices share a natural sample rate of 24000 Hz.

The voices field contains an array of voice objects, each representing a unique voice supported by the Text to Speech API.

Understanding the Voice Object

The voice object provides essential details about each voice supported by the Text to Speech service. Let’s take a closer look at the structure and its fields:

  "languageCodes": [
  "name": string,
  "ssmlGender": enum (SsmlVoiceGender),
  "naturalSampleRateHertz": integer

Voice Object Fields:

languageCodes[]stringAn array of BCP-47 language tags that the voice supports (e.g., en-US, es-419, cmn-tw).
namestringThe unique name of the voice.
The gender of the voice. Possible values are: FEMALE, MALE, NEUTRAL, and SSML_VOICE_GENDER_UNSPECIFIED.
naturalSampleRateHertzintegerThe natural sample rate (in hertz) for the voice.
Google Cloud Voice Object Fields

Google Cloud Voices

Google Cloud Text-to-Speech offers a range of voices to choose from, including male and female voices in various languages. This makes it easy to find a voice that fits your project’s needs.

Let’s explore some of the voices available in Google Cloud Text-to-Speech.

Google Cloud Voices for the English Language

Google Cloud Text-to-Speech offers a variety of voices for the English language. Some of the popular English voices include:

  1. en-US-Wavenet-A – This voice is a female voice that sounds like a young adult. It has a natural-sounding intonation and is suitable for a wide range of applications.
  2. en-US-Wavenet-B – This voice is a male voice that sounds like a middle-aged adult. It has a smooth, clear tone and is suitable for presentations and narrations.
  3. en-GB-Wavenet-A – This voice is a female voice that sounds like a young adult from the United Kingdom. It has a British accent and is suitable for applications that require a British voice.
  4. en-AU-Wavenet-A – This voice is a female voice that sounds like a young adult from Australia. It has an Australian accent and is suitable for applications that require an Australian voice.
Google Cloud Voices for Non-English Languages

Google Cloud Text-to-Speech also offers a variety of voices for non-English languages. Some of the popular non-English voices include:

  1. fr-FR-Wavenet-A – This voice is a female voice that sounds like a young adult from France. It has a natural-sounding intonation and is suitable for a wide range of applications.
  2. de-DE-Wavenet-A – This voice is a female voice that sounds like a young adult from Germany. It has a clear, crisp tone and is suitable for presentations and narrations.
  3. es-ES-Wavenet-A – This voice is a female voice that sounds like a young adult from Spain. It has a smooth, melodic tone and is suitable for applications that require a Spanish voice.
  4. ja-JP-Wavenet-A – This voice is a female voice that sounds like a young adult from Japan. It has a natural-sounding intonation and is suitable for a wide range of applications.

Synthesize Text to Speech

When working with the Google Cloud Text-to-Speech API, one of the key methods available is the text.synthesize method. This method allows you to synthesize speech synchronously, meaning that you receive the results after all the text input has been processed.

HTTP Request:

To synthesize speech using the REST API, you need to send a POST request to the following endpoint:

Google Cloud Text Synthesize Endpoint

The request body should include the necessary data in JSON format. Let’s take a closer look at the structure of the request body:

  "input": {
    object (SynthesisInput)
  "voice": {
    object (VoiceSelectionParams)
  "audioConfig": {
    object (AudioConfig)


  "input": {
    "text": "Hello, This is hack the developer speaking!"
  "voice": {
    "languageCode": "en-US",
    "name": "en-US-Wavenet-A",
    "ssmlGender": "MALE"
  "audioConfig": {
    "audioEncoding": "MP3",
    "pitch": 0,
    "speakingRate": 0,
    "sampleRateHertz": 0,
    "volumeGainDb": 0

The request body contains three fields:

  1. input: SynthesisInput (required)
  2. voice: VoiceSelectionParams (required)
  3. audioConfig: AudioConfig (required)

Response Structure

Upon a successful request, the response body will contain the synthesized audio data. Let’s take a look at the structure of the response:

  "audioContent": string

The audioContent field contains the audio data bytes encoded as specified in the request. Depending on the chosen encoding, such as MP3 or OGG_OPUS, the audio data may be wrapped in containers. For LINEAR16 audio, a WAV header is included.

📝 Note: The audioContent field is presented as a base64-encoded string.

Making Text Synthesize API Calls in Python

To demonstrate how to make the API call to the text synthesize endpoint using Python, we can utilize the requests library:

import requests

url = ""
headers = {
   "Authorization": "Bearer YOUR_ACCESS_TOKEN",
   "Accept": "application/json",
   "Content-Type": "application/json"
payload = {
  "audioConfig": {
    "audioEncoding": "MP3",
    "pitch": 0,
    "speakingRate": 0,
    "sampleRateHertz": 0,
    "volumeGainDb": 0
  "input": {
    "text": "Hello, This is hack the developer speaking!"
  "voice": {
    "languageCode": "en-US",
    "name": "en-US-Wavenet-A",
    "ssmlGender": "MALE"

response =, headers=headers, json=payload)
if response.status_code == 200:
    audio_data = response.json()["audioContent"]
    print("Error:", response.status_code, response.text)

📝Note: Don’t forget to replace YOUR_ACCESS_TOKEN with your actual access token obtained from the authentication process.


  "audioContent": "//NExAARMGFcA...."

Learn how you can use Text-to-Speech in your Python application using the Pyttsx3 library.

Wrapping Up

In conclusion, the Google Cloud Text-to-Speech API provides a powerful platform for synthesizing speech from text input. By leveraging its features and functionality, you can enhance your applications with natural-sounding voices and bring your content to life.

Throughout this blog, we explored the key aspects of the Google Cloud Text-to-Speech API, including its REST API and the text.synthesize method. We discussed how to make requests, provided code examples for making API calls, and explained the necessary parameters and response structures.

Remember, the Google Cloud Text-to-Speech API offers a wide range of languages, voices, and customization options, allowing you to create engaging and dynamic applications. Whether you’re building interactive voice response systems, chatbots, or any other voice-enabled applications, this API empowers you to deliver an exceptional user experience.

So, dive into the world of speech synthesis with the Google Cloud Text-to-Speech API, and let your applications speak volumes with lifelike voices and seamless integration. Happy coding!

Frequently Asked Questions (FAQs)

Is Google text-to-speech API free?

No, the Google text-to-speech API is not available for free. It is a service provided as part of the Google Cloud Platform, and there are costs associated with its usage. When using the API, you will be charged based on the number of characters converted from text to speech.

How do I get Google text-to-speech API?

To access the Google text-to-speech API, you need to follow these steps:
1. Create a project on the Google Cloud Platform (GCP) console.
2. Enable the Text-to-Speech API for your project (Ensure billing is enabled).
3. Set up authentication and obtain the necessary API credentials, such as an API key or service account key.
4. Install the necessary client libraries or use RESTful API calls to interact with the Google text-to-speech service.

How long is Google Cloud free trial?

Google Cloud offers a free trial period of 90 days. This trial period allows users to explore various Google Cloud services, including the text-to-speech API, within certain usage limits, without incurring charges.

How many languages does Google Cloud Text-to-Speech support?

Google Cloud Text-to-Speech supports a wide range of languages to cater to diverse global audiences. The API provides support for over 180 voices across more than 30 languages. These languages include popular options such as English, Hindi, Spanish, French, German, Italian, Japanese, Korean, Chinese, and many others.

Can I use Google text-to-speech for YouTube?

Yes, you can use the Google Cloud text-to-speech (TTS) API to create voiceovers or narration for your YouTube videos. However, it is crucial to adhere strictly to YouTube’s policies and guidelines when employing TTS or any other methods for generating content.


Was This Article Helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *