omniparser

n8n node for OmniParser - AI-powered UI screenshot analysis for desktop automation

Package Information

Downloads: 2 weekly / 13 monthly

Latest Version: 0.1.0

Author: Alf-David Heermann

Available Nodes

OmniParser

Analyze UI screenshots to detect interactive elements with AI vision

Documentation

n8n-nodes-omniparser

This is an n8n community node that integrates Microsoft's OmniParser for AI-powered UI screenshot analysis and desktop automation.

What is OmniParser?

OmniParser is a comprehensive method for parsing user interface screenshots into structured, interpretable elements. It uses computer vision and AI to:

Detect interactive UI elements (buttons, text fields, icons, etc.)
Generate functional descriptions for each element
Provide precise coordinates for automation
Enable vision-based GUI automation

Features

🎯 Accurate Element Detection: Identifies clickable and interactive regions with high precision
📊 Structured Output: Returns indexed elements with coordinates, descriptions, and metadata
🖼️ Annotated Images: Optionally returns screenshot with bounding boxes drawn
⚙️ Configurable Thresholds: Adjust detection sensitivity and box merging
🔄 Multiple Input Types: Supports binary data and base64 encoded images

Installation

In n8n

Go to Settings > Community Nodes
Click Install a community node
Enter n8n-nodes-omniparser
Click Install

Manual Installation

npm install n8n-nodes-omniparser

Prerequisites

You need a running OmniParser API instance. Two options:

Option 1: Docker (Recommended)

# Clone the omniparser-api repository
git clone https://github.com/addy999/omniparser-api.git
cd omniparser-api

# Build the Docker image
docker build -t omni-parser-app .

# Run with GPU support
docker run --gpus all -p 7860:7860 omni-parser-app

Requirements:

NVIDIA GPU with CUDA support
16GB RAM minimum
Docker with nvidia-docker2

Option 2: Docker Compose with n8n

Add to your docker-compose.yml:

services:
  n8n:
    image: n8nio/n8n
    ports:
      - "5678:5678"
    networks:
      - n8n-network

  omniparser:
    image: omni-parser-app
    ports:
      - "7860:7860"
    networks:
      - n8n-network
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

networks:
  n8n-network:
    driver: bridge

Then in n8n credentials, use: http://omniparser:7860

Credentials

Configure the OmniParser API credentials:

API Base URL: Your OmniParser API endpoint (e.g., http://omniparser:7860 for Docker, or http://localhost:7860 for local)

Operations

Parse Screenshot

Analyzes a UI screenshot to detect interactive elements.

Input Parameters:

Input Type: How to provide the screenshot
- Binary Data: Use output from previous node
- Base64 String: Paste base64 encoded image
Box Threshold (0.0-1.0): Detection confidence threshold. Lower values detect more elements.
IOU Threshold (0.0-1.0): Controls merging of overlapping bounding boxes
Output Format:
- Structured: Array of parsed elements with coordinates
- Raw: Original API response
Include Annotated Image: Whether to include base64 annotated image with bounding boxes

Output (Structured Format):

{
  "elementsCount": 3,
  "elements": [
    {
      "index": 0,
      "description": "Username text field",
      "coordinates": [100, 200, 300, 250],
      "centerX": 200,
      "centerY": 225,
      "width": 200,
      "height": 50
    },
    {
      "index": 1,
      "description": "Password text field",
      "coordinates": [100, 300, 300, 350],
      "centerX": 200,
      "centerY": 325,
      "width": 200,
      "height": 50
    },
    {
      "index": 2,
      "description": "Login button",
      "coordinates": [150, 400, 250, 450],
      "centerX": 200,
      "centerY": 425,
      "width": 100,
      "height": 50
    }
  ],
  "annotatedImageBase64": "iVBORw0KGgoAAAANSUhEUg..."
}

Example Workflow

Desktop Automation with OmniParser

Screenshot Node → Captures desktop screenshot
OmniParser → Analyzes screenshot
Filter → Find element by description (e.g., "Login button")
Desktop Control Node → Click at coordinates
Desktop Control Node → Type text

Screenshot → OmniParser → Filter ("Username field") → Click → Type "myusername"

Use Cases

🖥️ Desktop Automation: Automate any GUI application
🤖 RPA (Robotic Process Automation): Automate repetitive desktop tasks
🧪 UI Testing: Automatically test application interfaces
📸 Screen Analysis: Extract structured data from screenshots
🎮 Game Automation: Detect and interact with game UI elements
📊 Data Entry: Fill forms across any application

Companion Nodes

For complete desktop automation, combine with:

n8n-nodes-desktop-control (coming soon): Mouse/keyboard control with PyAutoGUI
n8n-nodes-screenshot (coming soon): Cross-platform screenshot capture

Troubleshooting

"Failed to connect to OmniParser API"

Check that OmniParser container is running: docker ps
Verify the API URL in credentials matches your setup
Test API manually: curl http://localhost:7860/docs

"Detection not finding elements"

Lower the Box Threshold (try 0.03 or 0.02)
Ensure screenshot is clear and high resolution
Check that UI elements are visible and not obscured

"Out of memory errors"

OmniParser requires 16GB RAM minimum
Enable swap if needed
Close other GPU-intensive applications

Resources

License

MIT

Author

Alf-David Heermann (ad.heermann@flexcon-it.de)

omniparserInstall