ai-training-scraper

Scrape and chunk websites into AI-ready training data for RAG, LLM fine-tuning, and vector databases

Package Information

Downloads: 41 weekly / 83 monthly
Latest Version: 1.0.2
Author: Blukaze Automations

Documentation

n8n-nodes-ai-training-scraper

n8n-nodes-ai-training-scraper

This is an n8n community node that lets you scrape websites and convert them into AI-ready training data using the Apify AI Training Data Scraper. It intelligently chunks content for RAG (Retrieval-Augmented Generation), LLM fine-tuning, and vector databases.

n8n is a fair-code licensed workflow automation platform.

Features

  • Smart Scraping: Choose between Cheerio (fast, static) or Playwright (headless browser for JS-heavy sites).
  • Intelligent Chunking:
    • Semantic: Splits by meaning (recommended for RAG).
    • Fixed Token: Strict token limits.
    • Sentence Based: Preserves sentence structure.
    • Markdown Section: Splits by headers.
  • AI-Ready Output: Formats data specifically for vector databases (Pinecone, Weaviate, etc.) or fine-tuning datasets.
  • Advanced Control:
    • Remove CSS selectors (ads, navbars).
    • Respect robots.txt.
    • Recursively follow links with depth control.
    • Extract metadata (author, date, keywords).

Installation

Follow the instructions for installing a community node in your n8n instance.

  1. Go to Settings > Community Nodes.
  2. Select Install.
  3. Enter the package name: n8n-nodes-ai-training-scraper.

Alternatively, if running via npm:

npm install n8n-nodes-ai-training-scraper

Configuration

You need an Apify API Token to use this node.

  1. Log in to your Apify Console.
  2. Go to Settings > Integrations.
  3. Copy your API Token.
  4. In n8n, add a new Credential for Apify API and paste the token.

Usage Examples

1. Basic Documentation Scraping

Scrape a documentation site and prepare it for a vector store.

  • Operation: Scrape and Chunk
  • Start URLs: https://docs.python.org/3/
  • Chunking: Semantic
  • Output Format: Vector Ready

2. RAG Pipeline

Build a chatbot that answers questions based on your website.

  1. AI Training Scraper: Scrapes your blog.
  2. OpenAI Embeddings: Converts chunks to vectors.
  3. Pinecone: Stores the vectors.
  4. LangChain: Queries Pinecone for context.

3. Multi-site Knowledge Base

Combine multiple sources into one dataset.

  • Start URLs: https://docs.example.com, https://blog.example.com
  • Max Pages: 500
  • Crawler Type: Playwright (to handle dynamic content)

Parameters Guide

Essential

  • Start URLs: Where the crawler begins. Can be multiple comma-separated URLs.
  • Crawler Type:
    • Cheerio: Much faster, cheaper, but acts like curl. Good for static HTML.
    • Playwright: Uses a real browser. Essential for React/Vue/Angular sites but slower.

Chunking

  • Strategy: How to split the text. Semantic uses basic NLP to keep related text together.
  • Chunk Size: Target size in tokens (approx. 4 chars per token).
  • Chunk Overlap: How many tokens to repeat between chunks to preserve context at boundaries.

Advanced

  • Max Crawl Depth: How many clicks away from the start URL to go.
  • Remove Elements: CSS selectors to strip out before processing (e.g., nav, .footer, .ad-banner).
  • URL Patterns: Only scrape URLs matching these globs (e.g., **/blog/**).
  • Exclude URL Patterns: Skip URLs matching these globs (e.g., **/login).

Troubleshooting

  • Rate Limits: If you see 429 errors or timeouts, reduce Max Concurrency in Advanced Options.
  • Empty Results: Check your Start URLs and ensure Crawler Type matches the site technology. If the site uses JavaScript to render content, you MUST use Playwright.
  • Garbage Content: Use Remove Elements to strip out headers, footers, and sidebars that clutter the training data.

Compatibility

tested with n8n v1.0.0+

License

MIT

Discussion