ComfyUI Video to Text icon

ComfyUI Video to Text

Extract text descriptions or subtitles from videos using ComfyUI workflow

Overview

This node integrates with ComfyUI to extract textual information from videos using a user-defined ComfyUI workflow. It supports analyzing videos provided as URLs, base64 strings, or binary data and can output either descriptive text, subtitles, or both extracted from the video content.

Common scenarios include:

  • Automatically generating video descriptions for accessibility or metadata enrichment.
  • Extracting subtitles or captions from videos for translation or indexing.
  • Combining both descriptions and subtitles for comprehensive video text analysis.

Practical example:

  • A marketing team uploads product demo videos (via URL) and uses this node to generate textual summaries and subtitles automatically, which are then used for SEO and social media posts.

Properties

Name Meaning
Workflow JSON The ComfyUI workflow in JSON format that defines how the video will be analyzed.
Input Type Specifies the format of the input video: URL, Base64 encoded string, or Binary data.
Input Video The actual video input as a URL or base64 string (required if Input Type is URL or Base64).
Binary Property The name of the binary property containing the video data (required if Input Type is Binary).
Output Type The type of text output to extract: Description, Subtitles, or Both.
Timeout Maximum time in minutes to wait for the video analysis to complete before timing out.

Output

The node outputs JSON data with the following structure:

  • text: The extracted text content, which can be:
    • A plain text description,
    • Subtitles formatted as SRT (if multiple subtitle entries),
    • Or both combined depending on the selected output type.
  • outputType: Indicates the chosen output type (description, subtitles, or both).
  • outputFormat: The format of the text output, either "txt" for plain text or "srt" for subtitle format.
  • textCount: Number of text segments extracted.
  • status: Status object returned by ComfyUI indicating execution details.

If the node processes binary video input, it expects the video data in a specified binary property but does not output binary data itself; the output is purely textual.

Dependencies

  • Requires an API key credential for authenticating with the ComfyUI API.
  • Needs access to the ComfyUI API endpoint URL configured in credentials.
  • Uses HTTP requests to upload videos, submit workflows, and poll for results.
  • Relies on the ComfyUI server supporting specific endpoints like /upload/image, /prompt, /history/{id}, and /system_stats.

Troubleshooting

  • Invalid workflow JSON: If the workflow JSON is malformed or not a valid object, the node throws an error. Ensure the JSON syntax is correct and the workflow contains at least one LoadVideo or LoadImage node.
  • No video input node found: The workflow must include a node that loads video or image input; otherwise, the node errors out.
  • Binary property missing or invalid: When using binary input, if the specified binary property is missing or does not contain video data, the node attempts to find an alternative video property. If none is found, it throws an error.
  • Unsupported media type: Only video MIME types are supported for binary input. Other media types cause an error.
  • Timeout exceeded: If the video analysis does not complete within the specified timeout, the node throws a timeout error.
  • API connection issues: Failure to connect or authenticate with the ComfyUI API results in errors. Verify API URL and API key correctness.
  • No text outputs found: If the analysis completes but no textual outputs are detected, an error is thrown.

Links and References

Discussion