omniparser

n8n node for OmniParser - AI-powered UI screenshot analysis for desktop automation

Package Information

Downloads: 2 weeklyย /ย 13 monthly
Latest Version: 0.1.0
Author: Alf-David Heermann

Documentation

n8n-nodes-omniparser

This is an n8n community node that integrates Microsoft's OmniParser for AI-powered UI screenshot analysis and desktop automation.

What is OmniParser?

OmniParser is a comprehensive method for parsing user interface screenshots into structured, interpretable elements. It uses computer vision and AI to:

  • Detect interactive UI elements (buttons, text fields, icons, etc.)
  • Generate functional descriptions for each element
  • Provide precise coordinates for automation
  • Enable vision-based GUI automation

Features

  • ๐ŸŽฏ Accurate Element Detection: Identifies clickable and interactive regions with high precision
  • ๐Ÿ“Š Structured Output: Returns indexed elements with coordinates, descriptions, and metadata
  • ๐Ÿ–ผ๏ธ Annotated Images: Optionally returns screenshot with bounding boxes drawn
  • โš™๏ธ Configurable Thresholds: Adjust detection sensitivity and box merging
  • ๐Ÿ”„ Multiple Input Types: Supports binary data and base64 encoded images

Installation

In n8n

  1. Go to Settings > Community Nodes
  2. Click Install a community node
  3. Enter n8n-nodes-omniparser
  4. Click Install

Manual Installation

npm install n8n-nodes-omniparser

Prerequisites

You need a running OmniParser API instance. Two options:

Option 1: Docker (Recommended)

# Clone the omniparser-api repository
git clone https://github.com/addy999/omniparser-api.git
cd omniparser-api

# Build the Docker image
docker build -t omni-parser-app .

# Run with GPU support
docker run --gpus all -p 7860:7860 omni-parser-app

Requirements:

  • NVIDIA GPU with CUDA support
  • 16GB RAM minimum
  • Docker with nvidia-docker2

Option 2: Docker Compose with n8n

Add to your docker-compose.yml:

services:
  n8n:
    image: n8nio/n8n
    ports:
      - "5678:5678"
    networks:
      - n8n-network

  omniparser:
    image: omni-parser-app
    ports:
      - "7860:7860"
    networks:
      - n8n-network
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

networks:
  n8n-network:
    driver: bridge

Then in n8n credentials, use: http://omniparser:7860

Credentials

Configure the OmniParser API credentials:

  • API Base URL: Your OmniParser API endpoint (e.g., http://omniparser:7860 for Docker, or http://localhost:7860 for local)

Operations

Parse Screenshot

Analyzes a UI screenshot to detect interactive elements.

Input Parameters:

  • Input Type: How to provide the screenshot

    • Binary Data: Use output from previous node
    • Base64 String: Paste base64 encoded image
  • Box Threshold (0.0-1.0): Detection confidence threshold. Lower values detect more elements.

  • IOU Threshold (0.0-1.0): Controls merging of overlapping bounding boxes

  • Output Format:

    • Structured: Array of parsed elements with coordinates
    • Raw: Original API response
  • Include Annotated Image: Whether to include base64 annotated image with bounding boxes

Output (Structured Format):

{
  "elementsCount": 3,
  "elements": [
    {
      "index": 0,
      "description": "Username text field",
      "coordinates": [100, 200, 300, 250],
      "centerX": 200,
      "centerY": 225,
      "width": 200,
      "height": 50
    },
    {
      "index": 1,
      "description": "Password text field",
      "coordinates": [100, 300, 300, 350],
      "centerX": 200,
      "centerY": 325,
      "width": 200,
      "height": 50
    },
    {
      "index": 2,
      "description": "Login button",
      "coordinates": [150, 400, 250, 450],
      "centerX": 200,
      "centerY": 425,
      "width": 100,
      "height": 50
    }
  ],
  "annotatedImageBase64": "iVBORw0KGgoAAAANSUhEUg..."
}

Example Workflow

Desktop Automation with OmniParser

  1. Screenshot Node โ†’ Captures desktop screenshot
  2. OmniParser โ†’ Analyzes screenshot
  3. Filter โ†’ Find element by description (e.g., "Login button")
  4. Desktop Control Node โ†’ Click at coordinates
  5. Desktop Control Node โ†’ Type text
Screenshot โ†’ OmniParser โ†’ Filter ("Username field") โ†’ Click โ†’ Type "myusername"

Use Cases

  • ๐Ÿ–ฅ๏ธ Desktop Automation: Automate any GUI application
  • ๐Ÿค– RPA (Robotic Process Automation): Automate repetitive desktop tasks
  • ๐Ÿงช UI Testing: Automatically test application interfaces
  • ๐Ÿ“ธ Screen Analysis: Extract structured data from screenshots
  • ๐ŸŽฎ Game Automation: Detect and interact with game UI elements
  • ๐Ÿ“Š Data Entry: Fill forms across any application

Companion Nodes

For complete desktop automation, combine with:

  • n8n-nodes-desktop-control (coming soon): Mouse/keyboard control with PyAutoGUI
  • n8n-nodes-screenshot (coming soon): Cross-platform screenshot capture

Troubleshooting

"Failed to connect to OmniParser API"

  • Check that OmniParser container is running: docker ps
  • Verify the API URL in credentials matches your setup
  • Test API manually: curl http://localhost:7860/docs

"Detection not finding elements"

  • Lower the Box Threshold (try 0.03 or 0.02)
  • Ensure screenshot is clear and high resolution
  • Check that UI elements are visible and not obscured

"Out of memory errors"

  • OmniParser requires 16GB RAM minimum
  • Enable swap if needed
  • Close other GPU-intensive applications

Resources

License

MIT

Author

Alf-David Heermann (ad.heermann@flexcon-it.de)

Discussion