sitemap-parser

n8n community node for parsing sitemaps with recursive sitemap index traversal

Documentation

n8n-nodes-sitemap-parser

An n8n community node for parsing XML sitemaps with recursive sitemap index traversal.

Features

  • Recursive Sitemap Index Parsing — Automatically traverses nested <sitemapindex> structures to any depth
  • Domain Auto-Discovery — Given a domain, discovers sitemaps from robots.txt and common paths
  • Direct Sitemap URL — Parse any sitemap URL directly
  • Gzip Support — Handles .xml.gz compressed sitemaps
  • Concurrency Control — Configurable parallel request limits
  • URL Filtering — Include/exclude URLs with regex patterns
  • Rich Output — Extracts lastmod, changefreq, priority metadata
  • Loop Detection — Prevents infinite recursion with visited tracking

Installation

In n8n, go to Settings → Community Nodes and install:

n8n-nodes-sitemap-parser

Or install via npm:

npm install n8n-nodes-sitemap-parser

Usage

Mode 1: Sitemap URL (Direct)

Provide a direct sitemap URL:

https://rothys.com/sitemap.xml

The node will:

  1. Fetch the sitemap
  2. If it's a <sitemapindex>, recursively fetch all child sitemaps
  3. Extract all <url> entries with metadata
  4. Output each URL as a separate n8n item

Mode 2: Domain (Auto-Discovery)

Provide a domain:

rothys.com

The node will:

  1. Check robots.txt for Sitemap: directives
  2. Try common sitemap paths (/sitemap.xml, /sitemap_index.xml, etc.)
  3. Parse all discovered sitemaps recursively
  4. Output all URLs

Options

Option Default Description
Max Recursion Depth 10 Maximum depth for nested sitemap indexes
Concurrency 5 Max parallel HTTP requests
Request Timeout 30s Timeout per request
Custom User Agent n8n-sitemap-parser/1.0 User-Agent header
URL Filter Pattern Regex to include only matching URLs
Exclude Pattern Regex to exclude matching URLs
Include Metadata true Include lastmod, changefreq, priority
Flatten Output true One item per URL (false = single array)

Output Schema

Each URL item contains:

{
  "url": "https://example.com/products/widget",
  "lastmod": "2024-01-15",
  "changefreq": "weekly",
  "priority": "0.8",
  "depth": 2,
  "source": "https://example.com/sitemap-products.xml"
}

Example Workflows

Crawl all product pages from a store

[Sitemap Parser] → [HTTP Request] → [Extract Content]
  url: store.com
  filter: .*\/products\/.*

Get all blog post URLs

[Sitemap Parser] → [Filter] → [Next Steps]
  url: https://blog.example.com/sitemap.xml
  exclude: .*\.(jpg|png|gif|css|js)$

Development

# Install dependencies
npm install

# Build
npm run build

# Development with hot reload
npm run dev

# Lint
npm run lint

License

MIT

Discussion