sitemap-parser

n8n community node for parsing sitemaps with recursive sitemap index traversal

Package Information

Downloads: 6 weekly / 349 monthly

Latest Version: 0.1.2

Author: flakz

Available Nodes

Recursively parse XML sitemaps and extract all URLs. Supports sitemap indexes, child sitemaps, gzip-compressed sitemaps, and automatic sitemap discovery from domains.

Documentation

n8n-nodes-sitemap-parser

An n8n community node for parsing XML sitemaps with recursive sitemap index traversal.

Features

Recursive Sitemap Index Parsing — Automatically traverses nested <sitemapindex> structures to any depth
Domain Auto-Discovery — Given a domain, discovers sitemaps from robots.txt and common paths
Direct Sitemap URL — Parse any sitemap URL directly
Gzip Support — Handles .xml.gz compressed sitemaps
Concurrency Control — Configurable parallel request limits
URL Filtering — Include/exclude URLs with regex patterns
Rich Output — Extracts lastmod, changefreq, priority metadata
Loop Detection — Prevents infinite recursion with visited tracking

Installation

In n8n, go to Settings → Community Nodes and install:

n8n-nodes-sitemap-parser

Or install via npm:

npm install n8n-nodes-sitemap-parser

Usage

Mode 1: Sitemap URL (Direct)

Provide a direct sitemap URL:

https://rothys.com/sitemap.xml

The node will:

Fetch the sitemap
If it's a <sitemapindex>, recursively fetch all child sitemaps
Extract all <url> entries with metadata
Output each URL as a separate n8n item

Mode 2: Domain (Auto-Discovery)

Provide a domain:

rothys.com

The node will:

Check robots.txt for Sitemap: directives
Try common sitemap paths (/sitemap.xml, /sitemap_index.xml, etc.)
Parse all discovered sitemaps recursively
Output all URLs

Options

Option	Default	Description
Max Recursion Depth	10	Maximum depth for nested sitemap indexes
Concurrency	5	Max parallel HTTP requests
Request Timeout	30s	Timeout per request
Custom User Agent	`n8n-sitemap-parser/1.0`	User-Agent header
URL Filter Pattern	—	Regex to include only matching URLs
Exclude Pattern	—	Regex to exclude matching URLs
Include Metadata	true	Include lastmod, changefreq, priority
Flatten Output	true	One item per URL (false = single array)

Output Schema

Each URL item contains:

{
  "url": "https://example.com/products/widget",
  "lastmod": "2024-01-15",
  "changefreq": "weekly",
  "priority": "0.8",
  "depth": 2,
  "source": "https://example.com/sitemap-products.xml"
}

Example Workflows

Crawl all product pages from a store

[Sitemap Parser] → [HTTP Request] → [Extract Content]
  url: store.com
  filter: .*\/products\/.*

Get all blog post URLs

[Sitemap Parser] → [Filter] → [Next Steps]
  url: https://blog.example.com/sitemap.xml
  exclude: .*\.(jpg|png|gif|css|js)$

Development

# Install dependencies
npm install

# Build
npm run build

# Development with hot reload
npm run dev

# Lint
npm run lint

License

MIT

sitemap-parserInstall