Actions16
Overview
The "Create Transcript" operation of the Speech resource in this node converts audio or video files into text transcripts. It is designed to transcribe spoken content from binary audio/video input, supporting features like speaker diarization, language specification, and timestamp granularity. This node is useful for automating transcription workflows such as generating meeting notes, creating subtitles, or analyzing audio content.
Practical examples:
- Transcribing recorded interviews or podcasts into searchable text.
- Generating captions for videos automatically.
- Extracting dialogue from customer support calls for quality analysis.
Properties
| Name | Meaning |
|---|---|
| Binary Input Field | The name of the binary property containing the audio or video file to be transcribed. |
| Additional Fields | A collection of optional parameters to customize transcription: |
| - Transcript Model ID | The transcription model to use; currently only "scribe_v1" is available. |
| - Language Code | ISO-639-1 or ISO-639-3 code specifying the language of the audio. If omitted, the language will be auto-detected. |
| - Tag Audio Events | Whether to tag audio events (e.g., laughter, footsteps) in the transcript. Defaults to true. |
| - Number of Speakers | Maximum number of speakers expected in the audio, aiding speaker identification. Range: 1 to 32. |
| - Timestamps Granularity | Level of detail for timestamps in the transcript: None, Word, or Character. Default is Word. |
| - Speaker Diarization | Whether to annotate which speaker is talking throughout the audio. Defaults to false. |
| - Enable Logging | Enables logging of the transcription process. When false, zero retention mode is used (history features unavailable). Defaults to true. |
Output
The node outputs a JSON object containing the transcription result. This includes the transcribed text along with metadata such as timestamps (depending on granularity), speaker labels if diarization is enabled, and tagged audio events if selected. The output does not include binary data but focuses on textual transcription results enriched with contextual information.
Dependencies
- Requires an API key credential for authentication with the ElevenLabs API.
- The node sends requests to the ElevenLabs speech-to-text endpoint (
/speech-to-text). - Proper configuration of the API key credential in n8n is necessary.
- The input audio/video must be provided as binary data within the specified binary input field.
Troubleshooting
Common issues:
- Providing an incorrect or empty binary input field name will cause the node to fail to find the audio data.
- Using unsupported audio formats or corrupted files may lead to transcription errors.
- Specifying an invalid language code might cause fallback to auto-detection or errors.
- Enabling speaker diarization without multiple speakers may produce unexpected results.
Error messages:
- Authentication errors indicate missing or invalid API credentials; verify the API key setup.
- Request failures due to network issues or API limits should be retried or checked against service status.
- Validation errors on parameters (e.g., number of speakers out of range) require correcting input values.