ElevenLabs icon

ElevenLabs

WIP

Overview

The "Create Transcript" operation of the Speech resource in this node converts audio or video files into text transcripts using an external speech-to-text API. This is useful for automating transcription tasks such as generating subtitles, creating searchable text from meetings or interviews, and enabling voice command processing.

Typical use cases include:

  • Transcribing recorded podcasts or webinars.
  • Converting customer support calls to text for analysis.
  • Creating captions for videos automatically.
  • Extracting text from voice notes or audio messages.

Users provide an audio/video file as binary input, and the node returns a detailed transcription with optional features like speaker diarization, timestamps, and tagging of audio events.

Properties

Name Meaning
Binary Input Field The name of the binary property containing the audio/video file to transcribe.
Transcript Model ID The identifier of the transcription model to use. Currently, only "scribe_v1" is available.
Language Code ISO-639-1 or ISO-639-3 language code to specify the language of the audio. If left empty, the language will be auto-detected.
Tag Audio Events Whether to tag audio events (e.g., laughter, footsteps) in the transcription output. Boolean value (true/false).
Number of Speakers Maximum number of speakers expected in the audio. Helps improve speaker prediction accuracy. Accepts values between 1 and 32.
Timestamps Granularity Level of detail for timestamps in the transcript. Options: none, word, or character. Default is word.
Speaker Diarization Whether to annotate which speaker is talking throughout the audio. Boolean value (true/false).
Enable Logging Whether to enable logging on the API side. Disabling logging means zero retention mode, disabling history features. Boolean value (true/false).

Note: There is also a hidden property used internally for request configuration and query string parameters.

Output

The node outputs JSON data representing the transcription result. This typically includes:

  • The full transcribed text.
  • Optional speaker labels if diarization is enabled.
  • Timestamps for words or characters depending on granularity settings.
  • Tags for audio events if enabled.
  • Metadata about the transcription process.

If the input was binary audio/video, the node processes it and returns the transcription in the JSON output field. No binary output is generated by this operation.

Dependencies

  • Requires an active API key credential for the ElevenLabs API.
  • The node sends requests to the ElevenLabs speech-to-text endpoint (/speech-to-text).
  • The user must supply the audio/video file as binary data within the workflow.
  • Optional parameters depend on supported transcription models and API capabilities.

Troubleshooting

  • Common issues:

    • Providing an incorrect or missing binary input field name will cause the node to fail to find the audio data.
    • Using unsupported language codes or transcription models may lead to errors or fallback to defaults.
    • Enabling speaker diarization without specifying a reasonable number of speakers might reduce accuracy.
    • Network or authentication errors if the API key is invalid or quota exceeded.
  • Error messages:

    • Errors related to missing binary data: Ensure the binary property name matches the actual input.
    • API authentication errors: Verify that the API key credential is correctly configured.
    • Model not found or unsupported language: Check the model ID and language code inputs.
    • Rate limit or quota exceeded: Wait or upgrade your API plan.

Links and References

Discussion