Gemini AI Converter

Convert speech to text, image to text, and youtube to text using Gemini API

Overview

The Gemini AI Converter node enables converting various media types into text using the Gemini API. It supports three main conversion operations:

Speech to Text: Transcribe audio content from speech.
Image to Text: Extract text content from images.
Video to Text: Extract text content from YouTube videos.

This node is useful in scenarios where you need to automate transcription or text extraction from multimedia sources, such as generating captions for videos, digitizing printed text from images, or transcribing recorded meetings and podcasts.

For example, you can input a URL of an audio file to get its transcript, upload an image containing text to extract that text, or provide a YouTube video URL to obtain a textual summary or transcription.

Properties

Name	Meaning
Model Name or ID	The specific AI model used to generate the text output. Choose from a list of available models or specify one via expression. Models are fetched dynamically from the Gemini API and typically include version info (e.g., "2.0").
Convert	The type of conversion to perform. Options: "Speech To Text", "Image To Text", or "Video To Text".
Source Type	For Speech To Text and Image To Text conversions, specifies whether the source media is provided via a URL or as binary data within the workflow. Options: "From URL" or "From Binary".
URL	The direct URL to the audio or image file to convert. Shown only if Source Type is "From URL" and conversion is speech or image.
Binary Field	The name of the binary property in the input data that contains the media file. Used when Source Type is "From Binary" for speech or image conversion. Default is "data".
Video URL	The URL of the YouTube video to convert to text. Only shown when conversion type is "Video To Text".
Prompt	A text prompt guiding the AI on how to process the input media. For example, "Extract content from the audio file". This helps tailor the output. Required field.

Output

The node outputs an array of JSON objects, each containing a text field with the extracted or transcribed content as plain text.

Example output structure:

[
  {
    "text": "Transcribed or extracted text content here"
  }
]

No binary data is output by this node; all results are returned as text in JSON format.

Dependencies

Requires an active connection to the Gemini API service.
Needs an API key credential configured in n8n for authentication with the Gemini API.
The node fetches available models dynamically from the Gemini API endpoint.
Network access to URLs provided for media files or YouTube videos is necessary.

Troubleshooting

Conversion Failure: If the node fails to convert the media, it throws an error unless "Continue On Fail" is enabled. Common causes include invalid URLs, unsupported media formats, or API errors.
Invalid Model: Selecting a model not supported or unavailable may cause errors. Ensure the model is selected from the dynamically loaded list.
Missing Credentials: The node requires valid API credentials; missing or incorrect credentials will prevent execution.
Binary Data Issues: When using binary input, ensure the specified binary field exists and contains valid media data.
API Rate Limits: Excessive requests may be throttled by the Gemini API, causing failures.

To resolve errors, verify input parameters, check API credentials, confirm media accessibility, and consult API usage limits.

Links and References

Gemini API Models List: https://developers.generativeai.google/api/rest/generativelanguage/models/list
n8n Expressions Documentation: https://docs.n8n.io/code-examples/expressions/
Gemini API Documentation (general): https://developers.generativeai.google/