Agora provides a speech-to-text service for live streaming scenarios, which takes the audio content of a host's media stream and transcribes it into written words in real time.
This page shows you how to implement speech-to-text.
Send an HTTP request to the Agora server through your business server to convert speech to text in real time. The speech-to-text service supports two modes:
Follow the steps below to call the RESTful API to convert speech to text:
acquire
method to request a builderToken for the speech-to-text. A buildToken is valid for 5 minutes.start
method within five minutes after getting the builderToken to start the speech-to-text task.stop
method to stop
the speech-to-text task.After you start a speech-to-text task, you can call query
to check its status.
Use this builderToken to send a request within 5 minutes. After the time has expired, you need to generate a new builderToken; otherwise, other methods cannot be called.
POST https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/builderTokens
appId
: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.
Content-Type
: application/json
Authorization
: The value of this field needs to refer to the authentication instructions.The following parameters need to be passed in the request body:
Field | Type | Description |
---|---|---|
instanceId |
String | (Required) The instance ID set by the developer. The maximum length is 64 characters. The following character sets are supported: One instanceId can generate multiple builderTokens, but only one builderToken can be used to send a request in a task. |
Request body example:
{
"instanceId": "XXXX"
}
If the status code is 2XX, the request is successful. The response body contains the following fields:
Field | Type | Description |
---|---|---|
tokenName |
String | The value of the dynamic key builderToken. This value needs to be passed in when calling other methods. |
createTs |
Number | The Unix timestamp (seconds) when the builderToken was generated. |
instanceId |
Number | The instance ID set in the request body. |
Response body example:
{
"tokenName": "XXXXXX",
"createTs": 1550024508,
"instanceId": "XXXXXX"
}
If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.
A builderToken can guarantee the security of your request. You should generate a builderToken before calling the start method.
POST https://api.agora.io/v1/projects/<appid>/rtsc/speech-to-text/tasks?builderToken=<tokenName>
appId
: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.
builderToken
: The dynamic key obtained through the generate a builderToken method. Used to ensure the security of speech-to-text tasks.
Content-Type
: application/json
Authorization
: The value of this field needs to refer to the authentication instructions.The following parameters need to be passed in the request body:
Field name | Type | Description |
---|---|---|
audio |
JSON Object | (Required) The configuration for input audio. |
audio.subscribeSource |
String | (Required) The source type of the audio input. Supports the audio stream published in the Agora RTC channel only. The type is AGORARTC . |
audio.agoraRtcConfig |
JSON Object | (Required) Information required to enter the Agora RTC channel. |
audio.agoraRtcConfig.channelName |
String | (Required) The RTC channel name. |
audio.agoraRtcConfig.uid |
String | (Required) The RTC user ID. Every user ID in the RTC channel must be unique. |
audio.agoraRtcConfig.token |
String | (Optional) The token required to enter the RTC channel. Used to ensure channel security. |
audio.agoraRtcConfig.channelType |
String | (Required) The channel profile. For a speech-to-text task, you need to set the channel profile to live streaming (LIVE_TYPE ). |
audio.agoraRtcConfig.subscribeConfig |
JSON Object | (Required) The subscription configuration. |
audio.agoraRtcConfig.maxIdleTime |
Number | (Optional) The maximum idle channel time (seconds). The default value is 30. The value range is [5, 2,592,000]. If there is no user in the channel for longer than this time, the task automatically stops. |
config |
JSON Object | (Required) The feature configuration. |
config.features |
JSON Array | (Required) The service type. Only speech to text (RECOGNIZE ) is supported. |
config.recognizeConfig |
JSON Object | (Required) The speech-to-text conversion configuration. |
Subscription configuration
Field name | Type | Description |
---|---|---|
subscribeConfig.subscribeMode |
String | (Required) The subscription mode. To use speech to text, you need to set this field to CHANNEL_MODE . |
Conversion configuration
Field name | Type | Description |
---|---|---|
recognizeConfig.language |
String | (Required) The conversion language. You can set the field to the following:en-US : Endlishfr-FR : French |
recognizeConfig.model |
String | (Required) The conversion mode. To use speech to text, you need to set this field to Model . |
recognizeConfig.profanityFilter |
Boolen | (Optional) Whether to enable the sensitive word filtering function. When enabled, it detects offensive language within a segment and replace the insulting word with asterisks, leaving only the first letter visible (e.g. "f***").true : Enable the sensitive word filtering function.false : Disable the sensitive word filtering function. |
recognizeConfig.output.destinations |
JSON Array | (Required) The target channel type for the output streams. Set this field as AgoraRTCDataStream to push speech-to-text conversion output to RTC channel. If you need to store speech-to-text output text in WebVTT format to third-party cloud storage, you also need to add Storage to the field. |
recognizeConfig.output.agoraRTCDataStream |
JSON Object | (Required) The target channel configuration. |
recognizeConfig.output.cloudStorage |
JSON Array | (Optional) The third-party cloud storage configuration. This field must be set to store speech-to-text output text and subtitle file. |
Target channel configuration
Field name | Type | Description |
---|---|---|
agoraRTCDataStream.channelName |
String | (Required) The target channel name. |
agoraRTCDataStream.uid |
String | (Required) The RTC user ID in target channel. |
agoraRTCDataStream.token |
String | (Required) The token required to enter the RTC channel. Used to ensure channel security. |
Cloud storage configuration
Field name | Type | Description |
---|---|---|
cloudStorage.format |
String | (Required) Reserved filed. You need to set this field to HLS . |
cloudStorage.storageConfig |
JSON Object | (Required) The third-party cloud storage information configuration, for storing speech-to-text output text and subtitle file. |
cloudStorage.storageConfig
When speech-to-text process ends, an output text file (txt format) and a subtitle file (WebVTT format) are generated. This field is used to store the two files.
Field name | Type | Description |
---|---|---|
accessKey |
string | (Required) The access key of the third-party cloud storage. In general, Agora recommends providing write-only access keys. For delayed transcoding, the access key must have both read and write permissions. |
secretKey |
string | (Required) The secret key of the third-party cloud storage. |
bucket |
string | (Required) The bucket of the third-party cloud storage. The bucket name must conform to the naming rules of the corresponding third-party cloud storage service. |
vendor |
Number | (Required) The third-party cloud storage platform. |
region |
Number | (Required) The region information specified for the third-party cloud storage. |
fileNamePrefix |
JSON Array | (Optional) An array of strings specifying where the recorded files are stored in the third-party cloud storage. For example, if fileNamePrefix = ["directory1","directory2"] , Agora speech-to-text adds the prefix "directory1/directory2/ " before the name of the recorded file, that is, directory1/directory2/xxx.m3u8 . The maximum length of the prefix, including the slashes, is 128 characters. The string itself cannot contain symbols such as slash, underscore, or parenthesis. The following character sets are supported: |
Request body example:
{
"audio": {
"subscribeSource": "AGORARTC",
"agoraRtcConfig": {
"channelName": "<YourChannelName>",
"uid": "<YourUid>",
"token": "<YourToken>",
"channelType": "<YourChannelType>",
"subscribeConfig": {
"subscribeMode": "CHANNEL_MODE"
},
"maxIdleTime": 60
}
},
"config": {
"features": [
"RECOGNIZE"
],
"recognizeConfig": {
"language": "en-US",
"model": "Model",
"output": {
"destinations": [
"AgoraRTCDataStream",
"Storage"
],
"agoraRTCDataStream": {
"channelName": "<YourChannelName>",
"uid": "<YourUid>",
"token": "<YourToken>"
},
"cloudStorage": [
{
"format": "HLS",
"storageConfig": {
"accessKey": "<YourOssAccessKey>",
"secretKey": "<YourOssSecretKey>",
"bucket": "<YourOssBucketName>",
"vendor": "<YourOssVendor>",
"region": "<YourOssRegion>",
"fileNamePrefix": "<YourOssPrefix>"
}
}
]
}
}
}
}
If the status code is 2XX, the request is successful. The response body contains the following fields:
Field | Type | Description |
---|---|---|
taskId |
String | The task ID, a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created. |
createTs |
Number | The Unix timestamp (seconds) when the task was created. |
status |
Number | The running status of tasks:IDLE : The task has not started or has ended. PREPARING : The task has received a start request. IN_PROGRESS : The task is in progress. STOPPING : The task is stopping.STOPPED : The task has been stopped. RECONNECTING : The task is being reestablished. |
You can refer to the following example:
If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.
{
"taskId": "XXXX",
"createTs": 1550024508,
"status": "IN_PROGRESS"
}
GET https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/tasks/<taskId>?builderToken=<tokenName>
appId
: (Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console.
builderToken
: (Required) String. Get the parameter value tokenName of builderToken by generating builderToken method.
Content-Type
: application/json
Authorization
: The value of this field needs to refer to the authentication instructions.If the status code is 2XX, the request is successful. The response body contains the following fields:
Field | Type | Description |
---|---|---|
taskId |
String | The task ID, a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created. |
createTs |
Number | The Unix timestamp (seconds) when the task was created. |
status |
Number | The running status of tasks:IDLE : The task has not started or has ended. PREPARING : The task has received a start request. IN_PROGRESS : The task is in progress. STOPPING : The task is stopping.STOPPED : The task has been stopped. RECONNECTING : The task is being reestablished. |
You can refer to the following example:
{
"taskId": "XXXX",
"createTs": 1550024508,
"status": "IN_PROGRESS"
}
If the status code is not 2XX, the request fails. The response body contains the message field, reporting the reason for the failure of the request.
DELETE https://api.agora.io/v1/projects/<appId>/rtsc/speech-to-text/tasks/<taskId>?builderToken=<tokenName>
appId:
(Required) String. The App ID provided by Agora. An App ID is the unique identification of a project. You can get an App ID after creating a project in Agora console. taskId
: (Required) String. The task ID generated by calling the start
method. A task ID is a UUID (Universal Unique Identifier) generated by the Agora server to identify a speech-to-text task that has been created.tokenName
: The value of the dynamic key builderToken.builderToken
: The dynamic key obtained through the generate a builderToken method. Used to ensure the security of speech-to-text tasks.
If the status code is 2XX, the request is successful; if the status code is not 2XX, the request fails.
Display subtitles in the video stream in real time by listening to the relevant callback method of the Agora SDK.
Platform | API |
---|---|
Android | onStreamMessage |
iOS | receiveStreamMessageFromUid |
Web | client.on("stream-message", (uid: UID, payload: UInt8Array) => {}) |
To record audio or video, call the cloud recording methods.
Agora provides a sample project for the speech-to-text function. You can contact sales@agora.io to get it, and refer to the source code to implement speech to text.
Agora provides an online demo for the speech-to-text function. You can download and experience it.