Image to Text (Captioning)

Generates a textual description (caption) for an image using Transformers.js.

Overview

This node generates a textual description (caption) for an input image using a pre-trained image-to-text model from the Transformers.js library. It is useful in scenarios where you want to automatically create captions or descriptions for images, such as enhancing accessibility, generating metadata for media libraries, or automating content tagging.

For example, given an image URL or binary image data, the node produces a natural language caption describing the image content. This can be used to enrich datasets, assist visually impaired users, or support AI-driven content management workflows.

Properties

Name Meaning
Image Input URL of the image to process, or the name of the binary property from a previous node containing image data (e.g., 'data' if input is {{$binary.data}}).
Output Caption Field The field name where the generated image caption (text) will be stored in the output JSON.
Max New Tokens Optional. Maximum number of new tokens (roughly words) to generate for the caption. Controls caption length. Defaults to 50.
Include Full Output Whether to include the full raw output object from the model under a field named [Output Caption Field]_full. This may contain multiple generated texts or additional details.

Output

The node outputs an array with one item per input. Each item's json contains all original fields plus:

  • A new field (name defined by the "Output Caption Field" property) containing the generated caption text.
  • Optionally, if "Include Full Output" is enabled, an additional field named [Output Caption Field]_full containing the full raw output object from the model, which includes detailed generation results.

If the input was binary image data, it is processed internally but not passed through in the output.

Dependencies

  • Uses the @huggingface/transformers library via Transformers.js to load and run the "Xenova/vit-gpt2-image-captioning" model.
  • Requires internet access to download the model on first use.
  • No explicit API keys or credentials are needed.
  • Runs inference locally on CPU with 32-bit floating point precision.

Troubleshooting

  • Model loading failure: If the model fails to load, it may be due to network issues or the model being private/restricted on Hugging Face. Check your internet connection and model availability.
  • Empty or invalid image input: Ensure the "Image Input" property is correctly set to a valid URL or a binary property name containing image data.
  • Unexpected output format: If the model output does not contain the expected generated_text field, this indicates a problem with the model response. Retry or verify the model version.
  • Error handling: If an error occurs during processing an item, the node either stops execution or continues based on the "Continue On Fail" setting, attaching error messages to the output.

Links and References

Discussion