Audio to Text (Whisper)

Transcribes audio to text using a local Whisper model via Transformers.js.

Overview

This node transcribes audio files to text using a local Whisper model via Transformers.js. It supports input audio from URLs or binary data from previous nodes, and offers multiple Whisper models varying in speed and accuracy. Users can specify language, transcription or translation tasks, and whether to include timestamps or full raw output. This node is useful for converting speech in audio files into text for further processing or analysis, such as generating subtitles, transcribing interviews, or voice command recognition.

Use Case Examples

  1. Transcribing an English podcast episode to text using the recommended Whisper base model.
  2. Translating a Spanish audio interview to English text using the multilingual Whisper medium model.
  3. Extracting timestamps along with transcription for creating subtitles from a video audio track.

Properties

Name Meaning
Audio Input URL of the audio file to process or the name of the binary property from a previous node containing audio data.
Model The Whisper model to use for transcription, balancing speed, accuracy, and language support.
Language Optional language code for transcription; if empty, the model auto-detects language. Ignored by English-only models.
Task Whether to transcribe in the original language or translate speech to English. Ignored by English-only models if source is not English.
Return Timestamps Whether to include timestamps in the transcription output, which may affect performance and output format.
Output Field The field name where the transcribed text will be stored in the output JSON.
Include Full Output Whether to include the full raw output object from the model, which may contain timestamps, chunks, and other details, under a separate field.

Output

JSON

  • Output Field
    • _full - Optional full raw output object from the transcription model, including timestamps and chunks, if enabled.

Dependencies

  • @huggingface/transformers for local Whisper model inference
  • axios for downloading audio files from URLs
  • wav for decoding WAV audio data

Troubleshooting

  • Ensure the audio input is a valid URL or a valid binary property name containing audio data.
  • Timeout errors may occur if the audio format is incompatible or the file is corrupted; verify the audio file format and integrity.
  • If no binary data is found for the specified property, check that the previous node outputs binary data with the correct property name.
  • Model loading may take time; ensure sufficient memory and processing power for larger models.
  • English-only models ignore language and translation parameters if the source language is not English.

Discussion