datagait

Scrape, extract, and crawl data from JavaScript-heavy websites using DataGait — the world's fastest headless DOM engine. Full JS rendering, structured data extraction, and multi-page crawling.

Package Information

Downloads: 48 weekly / 974 monthly

Latest Version: 1.1.7

Author: Centicore Technologies

Available Nodes

DataGait

Scrape, crawl, and extract structured data from JavaScript-heavy websites using DataGait API

Documentation

n8n-nodes-datagait

n8n community node for DataGait — scrape, extract, and crawl data from JavaScript-heavy websites with the world's fastest headless DOM engine.

npm | Documentation | Report Issue

Features

Full JavaScript Rendering — Renders SPAs, React, Angular, Vue, and dynamic content before extraction
Structured Data Extraction — JSON-LD, Open Graph, Twitter Cards, microdata in a single call
Multi-Page Crawling — Crawl entire sites with intelligent link following and SSE streaming
Proxy Support — Smart proxy routing with X-Proxy-Mode (off / smart / always)
Blazing Fast — Powered by a Rust-based headless DOM engine, 10x faster than traditional headless browsers
Clean Text Output — CSS-aware text extraction ideal for LLM and AI data pipelines
Real-Time Credit Tracking — Per-page credit usage and remaining balance in crawl events

Installation

From n8n Community Nodes (recommended)

Open your n8n instance
Go to Settings → Community Nodes
Click Install a community node
Search for n8n-nodes-datagait
Click Install

Manual Installation

cd ~/.n8n
npm install n8n-nodes-datagait
# Restart n8n

Setup

Sign up at datagait.com
Go to Dashboard → Settings → API Keys → Create API Key
In n8n, add DataGait API credentials with your dg_live_xxx token
Click Test to verify the connection

Actions (6)

Scraping Actions (`GET /extract`)

Action	Description
Extract Page	Extract content from a webpage with full JavaScript rendering. Choose which fields to return: HTML, text, links, media, meta tags, and structured data.
Scrape Text	Extract clean, readable text content from a URL. Ideal for feeding into LLMs, AI agents, and NLP pipelines.
Scrape Links	Extract all links from a webpage with resolved absolute URLs and link count.
Scrape Metadata	Extract meta tags, Open Graph, Twitter Cards, and JSON-LD structured data from any page.

Crawling Actions (`POST /crawl` — SSE streaming)

Action	Description
Crawl Site	Crawl multiple pages from a site with intelligent extraction. Configure output fields (text, html, links, media, meta, structured_data), max pages (up to 100), and parallel workers. Results streamed via Server-Sent Events with real-time credit tracking.

Account Actions

Action	Description
Health Check	Check DataGait API availability and measure response latency.

Output Fields

Extract Page

Field	Description
`url`	The final URL after redirects
`title`	Page title
`html`	Full rendered HTML after JavaScript execution
`text`	CSS-aware innerText — clean, readable content
`links`	All `<a>` and `<link>` elements with resolved absolute URLs
`media`	Images, videos, and audio elements with src and alt
`meta`	All `<meta>` tags (name, property, content)
`structured_data`	JSON-LD, microdata, Open Graph, Twitter Cards
`timing_ms`	Extraction time in milliseconds
`proxy_used`	Whether a proxy was used for the request
`proxy_provider`	Proxy provider: `datagait`, `byop`, or `none`

Crawl Site

Field	Description
`url`	Starting URL
`pages`	Array of page events (each with `url`, `page_num`, `text`, `title`, `links`, `credits_used`, `credits_remaining`)
`page_count`	Number of pages crawled
`total_ms`	Total crawl time in milliseconds
`total_credits_used`	Total credits consumed
`credit_exhausted`	`true` if crawl stopped due to credit exhaustion

Crawl SSE Events

Event Type	Description
`started`	First event — confirms crawl started with `max_pages` and `credits_available`
`page`	One per page — includes `success`, content fields, `credits_used`, `credits_remaining`
`done`	Final event — `total_pages`, `total_content`, `total_links`, `total_ms`, `total_credits_used`
`credit_exhausted`	Only if credits run out mid-crawl — `pages_completed`, `credits_used`

API Details

Extract: GET /extract?url=<url>&html=true&text=true&...
Crawl: POST /crawl with JSON body { url, config: { max_pages, text, html, links, workers, ... } }
Auth: X-API-Key: dg_live_xxx header
Base URL: https://ingest.datagait.com (configurable for self-hosted)
Proxy: Optional X-Proxy-Mode header (off / smart / always)
Crawl Timeout: Up to 300s — SSE stream stays open until done or credit_exhausted

Why DataGait?

Feature	DataGait	Firecrawl	Apify
JS Rendering	Rust-based headless DOM (fastest)	Chrome-based	Chrome-based
Latency	Sub-second for most pages	2-10s typical	5-30s typical
Structured Data	JSON-LD, OG, Twitter, microdata	Limited	Varies by actor
Crawl Streaming	Real-time SSE with per-page credits	Webhook/polling	Polling
Proxy Support	Built-in smart proxy routing	Separate config	Separate config
Pricing	1 credit/page (10 via proxy)	Pay per extraction	Pay per compute

Development

Build

cd integrations/n8n
npm install
npm run build

Test Locally with n8n

cd integrations/n8n
npm link

cd ~/.n8n/custom
npm link n8n-nodes-datagait

# Restart n8n — the DataGait node will appear in the editor

Publish to npm

# 1. Bump version in package.json
# 2. Build and publish
npm publish

Once published to npm with the n8n-community-node-package keyword, it will automatically appear in the n8n Community Nodes marketplace.

Example Workflows

See the examples/ directory for importable n8n workflow JSON files:

price-monitoring.json — Monitor product prices on a schedule, alert on drops
job-board-scraper.json — Scrape job listings and save to Airtable

Resources

License

MIT

datagaitInstall