Actions16
Overview
The "Create Transcript" operation of the Speech resource in this node converts audio or video files into text transcripts using an external speech-to-text API. This is useful for automating transcription tasks such as generating subtitles, creating searchable text from meetings or interviews, and enabling voice command processing.
Typical use cases include:
- Transcribing recorded podcasts or webinars.
- Converting customer support calls to text for analysis.
- Creating captions for videos automatically.
- Extracting text from voice notes or audio messages.
Users provide an audio/video file as binary input, and the node returns a detailed transcription with optional features like speaker diarization, timestamps, and tagging of audio events.
Properties
| Name | Meaning |
|---|---|
| Binary Input Field | The name of the binary property containing the audio/video file to transcribe. |
| Transcript Model ID | The identifier of the transcription model to use. Currently, only "scribe_v1" is available. |
| Language Code | ISO-639-1 or ISO-639-3 language code to specify the language of the audio. If left empty, the language will be auto-detected. |
| Tag Audio Events | Whether to tag audio events (e.g., laughter, footsteps) in the transcription output. Boolean value (true/false). |
| Number of Speakers | Maximum number of speakers expected in the audio. Helps improve speaker prediction accuracy. Accepts values between 1 and 32. |
| Timestamps Granularity | Level of detail for timestamps in the transcript. Options: none, word, or character. Default is word. |
| Speaker Diarization | Whether to annotate which speaker is talking throughout the audio. Boolean value (true/false). |
| Enable Logging | Whether to enable logging on the API side. Disabling logging means zero retention mode, disabling history features. Boolean value (true/false). |
Note: There is also a hidden property used internally for request configuration and query string parameters.
Output
The node outputs JSON data representing the transcription result. This typically includes:
- The full transcribed text.
- Optional speaker labels if diarization is enabled.
- Timestamps for words or characters depending on granularity settings.
- Tags for audio events if enabled.
- Metadata about the transcription process.
If the input was binary audio/video, the node processes it and returns the transcription in the JSON output field. No binary output is generated by this operation.
Dependencies
- Requires an active API key credential for the ElevenLabs API.
- The node sends requests to the ElevenLabs speech-to-text endpoint (
/speech-to-text). - The user must supply the audio/video file as binary data within the workflow.
- Optional parameters depend on supported transcription models and API capabilities.
Troubleshooting
Common issues:
- Providing an incorrect or missing binary input field name will cause the node to fail to find the audio data.
- Using unsupported language codes or transcription models may lead to errors or fallback to defaults.
- Enabling speaker diarization without specifying a reasonable number of speakers might reduce accuracy.
- Network or authentication errors if the API key is invalid or quota exceeded.
Error messages:
- Errors related to missing binary data: Ensure the binary property name matches the actual input.
- API authentication errors: Verify that the API key credential is correctly configured.
- Model not found or unsupported language: Check the model ID and language code inputs.
- Rate limit or quota exceeded: Wait or upgrade your API plan.