Transcribe Audio

Transcribe audio

Overview

This node transcribes audio data into text using pre-trained speech recognition models. It is designed to process audio provided as binary files and convert spoken words into written transcription. This functionality is useful in scenarios such as:

Converting recorded meetings, interviews, or podcasts into searchable text.
Automating subtitle generation for videos.
Extracting information from voice notes or customer support calls.

For example, a user can input an audio recording of a customer service call, select a suitable speech-to-text model, and receive the full transcript as output for further analysis or storage.

Properties

Name	Meaning
Audio Input Type	The type of audio input; currently supports "Binary File" which means audio is provided as binary data.
Binary Property Name	The name of the binary property on the input item that contains the audio file to be transcribed. Default is "data".
Model	The speech recognition model to use for transcription. Options include:
	- Xenova/whisper-tiny.en
	- Xenova/whisper-base.en
	- Xenova/whisper-small.en
	- Xenova/whisper-medium.en

Output

The node outputs items with a JSON field containing a transcription property. This property holds the transcribed text result of the audio input.

Example output JSON structure:

{
  "transcription": "Transcribed text from the audio input."
}

No binary output is produced by this node; it only returns textual transcription results.

Dependencies

Uses the @huggingface/transformers library to load and run speech recognition models.
Uses the wavefile package to parse and preprocess WAV audio files (converts bit depth and sample rate).
Requires audio input in a compatible WAV format accessible as binary data within the workflow.
The node internally uses ONNX Runtime Web backend configured to load WASM binaries from a specific directory.

No external API keys or credentials are required since the models run locally via ONNX runtime.

Troubleshooting

No binary data found on item!
This error occurs if the specified binary property does not exist or contains no data. Ensure the input item has the correct binary property name set and that it contains valid audio data.
Model loading issues
If the selected model fails to load, verify that the model name is correctly chosen from the available options and that the environment supports ONNX Runtime Web with WASM.
Audio format problems
The node expects WAV audio data. Other formats may cause errors or incorrect transcription. Convert audio to WAV before input if necessary.
Performance considerations
Transcription time depends on audio length and model size. Larger models provide better accuracy but require more processing time.

Links and References

Hugging Face Transformers – Documentation for the transformers library used for speech recognition models.
WaveFile npm package – Used for reading and manipulating WAV audio files.
ONNX Runtime Web – Runtime environment for running ONNX models in web environments.