Documentation





Interactive Live Streaming Premium (Legacy)



All

Product



Voice Call



Video Call



Interactive Live Streaming Premium



Interactive Live Streaming Standard



Voice Call



Video Call



Interactive Live Streaming Premium



Interactive Live Streaming Standard



Media Broadcast



Chat (BETA)



Media Acceleration



Cloud Recording



On-premise Recording



Interactive Whiteboard



Cloud Gateway



Agora Analytics



Content Moderation



Flexible Classroom



Agora Basics

Beta Product



Voice Call v4.0.0 Beta



Video Call v4.0.0 Beta



Interactive Live Streaming Standard v4.0.0 Beta



Interactive Live Streaming Premium v4.0.0 Beta



New Live Broadcast



Beta Products



Video Call v4.0.0 Preview



Interactive Live Streaming Premium v4.0.0 Preview



Interactive Live Streaming Standard v4.0.0 Preview

Previous Product

Use Case

Extensions Marketplace

Console 官网 Community Technical support





Speech to text

Last updated 2023/03/01 10:41:19

Agora provides a speech-to-text service for live streaming scenarios, which takes the audio content of a host's media stream and transcribes it into written words in real time.

This page shows you how to implement speech-to-text.

Understand the tech

Send an HTTP request to the Agora server through your business server to convert speech to text in real time. The speech-to-text service supports two modes:

Convert speech to text in real time.
Convert speech to text in real time, store the text in WebVTT format and upload the file to third-party cloud storage.

Prerequisites

Contact sales@agora.io to enable the speech-to-text service for your project.
You have a computer with internet access. For the required firewall ports, see Firewall Requirements.
To record and store speech-to-text videos and texts, make sure you have activated a third-party cloud storage service. The following third-party cloud storage service providers are currently supported:
Agora SDK has been integrated into your project, and you are able to join channels and receive media streams. For specific steps, refer to Get Started with Interactive Live Streaming Premium.
Make sure you have joined the RTC channel and have users in the channel and are streaming.

Call sequence

Follow the steps below to call the RESTful API to convert speech to text:

Call the acquire method to request a builderToken for the speech-to-text. A buildToken is valid for 5 minutes.
Call the start method within five minutes after getting the builderToken to start the speech-to-text task.
Call the stopmethod to stop the speech-to-text task.

After you start a speech-to-text task, you can call query to check its status.

Generate a builderToken

A builderToken secures your speech-to-text tasks. Call this method to generate a builderToken before starting a speech-to-text task.

Use this builderToken to send a request within 5 minutes. After the time has expired, you need to generate a new builderToken; otherwise, other methods cannot be called.

HTTP request

POST https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/builderTokens

Path parameter

appId: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.

Request header

Content-Type: application/json
Authorization: The value of this field needs to refer to the authentication instructions.

Request body

The following parameters need to be passed in the request body:

Field	Type	Description
`instanceId`	String	(Required) The instance ID set by the developer. The maximum length is 64 characters. The following character sets are supported: All lowercase English letters (a-z) All uppercase English letters (A-Z) All numeric characters: 0-9 "-", "_" One instanceId can generate multiple builderTokens, but only one builderToken can be used to send a request in a task.

Request body example:

{
    "instanceId": "XXXX"
}

HTTP response

Response body

If the status code is 2XX, the request is successful. The response body contains the following fields:

Field	Type	Description
`tokenName`	String	The value of the dynamic key builderToken. This value needs to be passed in when calling other methods.
`createTs`	Number	The Unix timestamp (seconds) when the builderToken was generated.
`instanceId`	Number	The instance ID set in the request body.

Response body example:

{
    "tokenName": "XXXXXX",
    "createTs": 1550024508,
    "instanceId": "XXXXXX"
}

If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.

Start: Start a speech-to-text task

A builderToken can guarantee the security of your request. You should generate a builderToken before calling the start method.

HTTP request

POST https://api.agora.io/v1/projects/<appid>/rtsc/speech-to-text/tasks?builderToken=<tokenName>

Path parameter

appId: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.

Query parameter

builderToken: The dynamic key obtained through the generate a builderToken method. Used to ensure the security of speech-to-text tasks.

Request header

Content-Type: application/json
Authorization: The value of this field needs to refer to the authentication instructions.

Request body

The following parameters need to be passed in the request body:

Field name	Type	Description
`audio`	JSON Object	(Required) The configuration for input audio.
`audio.subscribeSource`	String	(Required) The source type of the audio input. Supports the audio stream published in the Agora RTC channel only. The type is `AGORARTC`.
`audio.agoraRtcConfig`	JSON Object	(Required) Information required to enter the Agora RTC channel.
`audio.agoraRtcConfig.channelName`	String	(Required) The RTC channel name.
`audio.agoraRtcConfig.uid`	String	(Required) The RTC user ID. Every user ID in the RTC channel must be unique.
`audio.agoraRtcConfig.token`	String	(Optional) The token required to enter the RTC channel. Used to ensure channel security.
`audio.agoraRtcConfig.channelType`	String	(Required) The channel profile. For a speech-to-text task, you need to set the channel profile to live streaming (`LIVE_TYPE`).
`audio.agoraRtcConfig.subscribeConfig`	JSON Object	(Required) The subscription configuration.
`audio.agoraRtcConfig.maxIdleTime`	Number	(Optional) The maximum idle channel time (seconds). The default value is 30. The value range is [5, 2,592,000]. If there is no user in the channel for longer than this time, the task automatically stops.
`config`	JSON Object	(Required) The feature configuration.
`config.features`	JSON Array	(Required) The service type. Only speech to text (`RECOGNIZE`) is supported.
`config.recognizeConfig`	JSON Object	(Required) The speech-to-text conversion configuration.

Subscription configuration

Field name	Type	Description
`subscribeConfig.subscribeMode`	String	(Required) The subscription mode. To use speech to text, you need to set this field to `CHANNEL_MODE`.

Conversion configuration

Field name	Type	Description
`recognizeConfig.language`	String	(Required) The conversion language. You can set the field to the following: `en-US`: Endlish `fr-FR`: French
`recognizeConfig.model`	String	(Required) The conversion mode. To use speech to text, you need to set this field to `Model`.
`recognizeConfig.profanityFilter`	Boolen	(Optional) Whether to enable the sensitive word filtering function. When enabled, it detects offensive language within a segment and replace the insulting word with asterisks, leaving only the first letter visible (e.g. "f***"). `true`: Enable the sensitive word filtering function. `false`: Disable the sensitive word filtering function.
`recognizeConfig.output.destinations`	JSON Array	(Required) The target channel type for the output streams. Set this field as `AgoraRTCDataStream` to push speech-to-text conversion output to RTC channel. If you need to store speech-to-text output text in WebVTT format to third-party cloud storage, you also need to add `Storage` to the field.
`recognizeConfig.output.agoraRTCDataStream`	JSON Object	(Required) The target channel configuration.
`recognizeConfig.output.cloudStorage`	JSON Array	(Optional) The third-party cloud storage configuration. This field must be set to store speech-to-text output text and subtitle file.

Target channel configuration

Field name	Type	Description
`agoraRTCDataStream.channelName`	String	(Required) The target channel name.
`agoraRTCDataStream.uid`	String	(Required) The RTC user ID in target channel.
`agoraRTCDataStream.token`	String	(Required) The token required to enter the RTC channel. Used to ensure channel security.

Cloud storage configuration

Field name	Type	Description
`cloudStorage.format`	String	(Required) Reserved filed. You need to set this field to `HLS`.
`cloudStorage.storageConfig`	JSON Object	(Required) The third-party cloud storage information configuration, for storing speech-to-text output text and subtitle file.

cloudStorage.storageConfig

When speech-to-text process ends, an output text file (txt format) and a subtitle file (WebVTT format) are generated. This field is used to store the two files.

Field name	Type	Description
`accessKey`	string	(Required) The access key of the third-party cloud storage. In general, Agora recommends providing write-only access keys. For delayed transcoding, the access key must have both read and write permissions.
`secretKey`	string	(Required) The secret key of the third-party cloud storage.
`bucket`	string	(Required) The bucket of the third-party cloud storage. The bucket name must conform to the naming rules of the corresponding third-party cloud storage service.
`vendor`	Number	(Required) The third-party cloud storage platform.
`region`	Number	(Required) The region information specified for the third-party cloud storage.
`fileNamePrefix`	JSON Array	(Optional) An array of strings specifying where the recorded files are stored in the third-party cloud storage. For example, if `fileNamePrefix` = `["directory1","directory2"]`, Agora speech-to-text adds the prefix "`directory1/directory2/`" before the name of the recorded file, that is, `directory1/directory2/xxx.m3u8`. The maximum length of the prefix, including the slashes, is 128 characters. The string itself cannot contain symbols such as slash, underscore, or parenthesis. The following character sets are supported: All lowercase English letters (a-z) All uppercase English letters (A-Z) All numeric characters: 0-9 "-", "_"

Request body example:

{
    "audio": {
        "subscribeSource": "AGORARTC",
        "agoraRtcConfig": {
            "channelName": "<YourChannelName>",
            "uid": "<YourUid>",
            "token": "<YourToken>",
            "channelType": "<YourChannelType>",
            "subscribeConfig": {
                "subscribeMode": "CHANNEL_MODE"
            },
            "maxIdleTime": 60
        }
    },
    "config": {
        "features": [
            "RECOGNIZE"
        ],
        "recognizeConfig": {
            "language": "en-US",
            "model": "Model",
            "output": {
                "destinations": [
                    "AgoraRTCDataStream",
                      "Storage"
                ],
                "agoraRTCDataStream": {
                    "channelName": "<YourChannelName>",
                    "uid": "<YourUid>",
                    "token": "<YourToken>"
                },
                "cloudStorage": [
                    {
                        "format": "HLS",
                        "storageConfig": {
                            "accessKey": "<YourOssAccessKey>",
                            "secretKey": "<YourOssSecretKey>",
                            "bucket": "<YourOssBucketName>",
                            "vendor": "<YourOssVendor>",
                            "region": "<YourOssRegion>",
                            "fileNamePrefix": "<YourOssPrefix>"
                        }
                    }
                ]
            }
        }
    }
}

HTTP response

Response body

If the status code is 2XX, the request is successful. The response body contains the following fields:

Field	Type	Description
`taskId`	String	The task ID, a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created.
`createTs`	Number	The Unix timestamp (seconds) when the task was created.
`status`	Number	The running status of tasks: `IDLE`: The task has not started or has ended. `PREPARING`: The task has received a start request. `IN_PROGRESS`: The task is in progress. `STOPPING`: The task is stopping. `STOPPED`: The task has been stopped. `RECONNECTING`: The task is being reestablished.

You can refer to the following example:

If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.

{
   "taskId": "XXXX",
   "createTs": 1550024508,
   "status": "IN_PROGRESS"
}

Query: Query the status of speech to text

HTTP request

GET https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/tasks/<taskId>?builderToken=<tokenName>

Path parameter

appId: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.

Query parameter

builderToken: (Required) String. Get the parameter value tokenName of builderToken by generating builderToken method.

Request header

Content-Type: application/json
Authorization: The value of this field needs to refer to the authentication instructions.

HTTP response

Response body

If the status code is 2XX, the request is successful. The response body contains the following fields:

Field	Type	Description
`taskId`	String	The task ID, a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created.
`createTs`	Number	The Unix timestamp (seconds) when the task was created.
`status`	Number	The running status of tasks: `IDLE`: The task has not started or has ended. `PREPARING`: The task has received a start request. `IN_PROGRESS`: The task is in progress. `STOPPING`: The task is stopping. `STOPPED`: The task has been stopped. `RECONNECTING`: The task is being reestablished.

You can refer to the following example:

{
   "taskId": "XXXX",
   "createTs": 1550024508,
   "status": "IN_PROGRESS"
}

If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.

Stop: Stop speech-to-text

HTTP request

DELETE https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/tasks/<taskId>?builderToken=<tokenName>

Path parameter

appId: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.
taskId: (Required) String. The task ID generated by calling the start method. A task ID is a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created.
tokenName: The value of the dynamic key builderToken.

Query parameter

builderToken: The dynamic key obtained through the generate a builderToken method. Used to ensure the security of speech-to-text tasks.

HTTP response

If the status code is 2XX, the request is successful; if the status code is not 2XX, the request fails.

Next steps

Display subtitles in the video stream in real time by listening to the relevant callback method of the Agora SDK.

Platform	API
Android	onStreamMessage
iOS	receiveStreamMessageFromUid
Web	`client.on("stream-message", (uid: UID, payload: UInt8Array) => {})`

Reference

Consideration

To record audio or video, call the cloud recording methods.

Sample project

Agora provides a sample project for the speech-to-text function. You can contact sales@agora.io to get it, and refer to the source code to implement speech to text.

Online demo

Agora provides an online demo for the speech-to-text function. You can download and experience it.

Is this page helpful?

Yes