Smart Scraper

Extrae contenido de páginas web con modo de extracción de texto filtrado. Compatible con Windows y Android.

Overview

This node, named Smart Scraper, extracts content from web pages using a filtered text extraction mode. It supports direct HTTP fetching or rendering via a JavaScript render service. It is useful for scraping structured data such as articles, products, SEO metadata, and reviews from websites, with options for pagination, custom headers, authentication, and respecting robots.txt rules. Practical applications include content aggregation, market research, SEO analysis, and review collection.

Use Case Examples

Extract product details from an e-commerce site using a CSS selector and direct HTTP fetch.
Scrape article content from a news website using a render service to handle JavaScript.
Collect SEO metadata and internal links from a webpage for SEO auditing.
Extract multiple pages of reviews from a product page with pagination enabled.

Properties

Name	Meaning
URL	The web page URL to extract content from.
Modo de Extracción	The type of content extraction mode, fixed as filtered text.
Estrategia de Fetch	Method to obtain the page content, either direct HTTP or render service.
URL del Servicio de Renderizado	URL of the render service used when the render fetch strategy is selected.
Selector CSS (opcional)	Optional CSS selector to extract a specific part of the page content.
Opciones Avanzadas	Advanced options including custom headers, user-agent, timeout, retries, robots.txt respect, pagination, binary HTML output, and authentication method.

Output

Binary

If enabled, the node returns the full HTML content as a binary file with MIME type text/html.

JSON

results
- ``
  * html - Raw or filtered HTML content extracted from the page or CSS selector.
  * text - Extracted plain text content with whitespace normalized.
  * title - Title of the article or product extracted.
  * author - Author metadata for articles or reviews.
  * datePublished - Publication date for articles or reviews.
  * content - Main content HTML for articles.
  * excerpt - Excerpt or summary of article content.
  * name - Product name extracted.
  * price - Product price extracted.
  * description - Description metadata for products or articles.
  * images - Array of product image URLs.
  * url - URL of the extracted page.
  * basic - Basic SEO metadata including title, description, and keywords.
  * openGraph - Open Graph metadata extracted from the page.
  * twitterCard - Twitter card metadata extracted from the page.
  * headings - Headings (h1, h2, h3) extracted from the page.
  * links - Array of links with href, text, and internal/external flag.
  * reviews - Array of review objects with author, rating, content, and date.

Dependencies

axios
cheerio
@mozilla/readability
robots-parser
jsdom

Troubleshooting

Ensure the URL is correct and accessible; invalid URLs or network issues will cause errors.
If using the render fetch strategy, the render service URL must be provided and reachable.
Respecting robots.txt may block scraping if disallowed; disable this option if necessary but be mindful of legal and ethical considerations.
Custom headers and authentication must be correctly configured to access protected resources.
Pagination requires correct CSS selector for the next page link; incorrect selectors will stop pagination early.
Timeout and retry settings should be adjusted based on network conditions to avoid premature failures.

Smart Scraper

Overview

Use Case Examples

Properties

Output

Binary

JSON

Dependencies

Troubleshooting

Links

Discussion

Smart ScraperInstall

Overview

Use Case Examples

Properties

Output

Binary

JSON

Dependencies

Troubleshooting

Links

Discussion

Smart Scraper