Smart Scraper

Extrae contenido de páginas web con modo de extracción de texto filtrado. Compatible con Windows y Android.

Overview

This node, named Smart Scraper, extracts content from web pages using a filtered text extraction mode. It supports direct HTTP fetching or rendering via a JavaScript render service. It is useful for scraping structured data such as articles, products, SEO metadata, and reviews from websites, with options for pagination, custom headers, authentication, and respecting robots.txt rules. Practical applications include content aggregation, market research, SEO analysis, and review collection.

Use Case Examples

  1. Extract product details from an e-commerce site using a CSS selector and direct HTTP fetch.
  2. Scrape article content from a news website using a render service to handle JavaScript.
  3. Collect SEO metadata and internal links from a webpage for SEO auditing.
  4. Extract multiple pages of reviews from a product page with pagination enabled.

Properties

Name Meaning
URL The web page URL to extract content from.
Modo de Extracción The type of content extraction mode, fixed as filtered text.
Estrategia de Fetch Method to obtain the page content, either direct HTTP or render service.
URL del Servicio de Renderizado URL of the render service used when the render fetch strategy is selected.
Selector CSS (opcional) Optional CSS selector to extract a specific part of the page content.
Opciones Avanzadas Advanced options including custom headers, user-agent, timeout, retries, robots.txt respect, pagination, binary HTML output, and authentication method.

Output

Binary

If enabled, the node returns the full HTML content as a binary file with MIME type text/html.

JSON

  • results
    • ``
      * html - Raw or filtered HTML content extracted from the page or CSS selector.
      * text - Extracted plain text content with whitespace normalized.
      * title - Title of the article or product extracted.
      * author - Author metadata for articles or reviews.
      * datePublished - Publication date for articles or reviews.
      * content - Main content HTML for articles.
      * excerpt - Excerpt or summary of article content.
      * name - Product name extracted.
      * price - Product price extracted.
      * description - Description metadata for products or articles.
      * images - Array of product image URLs.
      * url - URL of the extracted page.
      * basic - Basic SEO metadata including title, description, and keywords.
      * openGraph - Open Graph metadata extracted from the page.
      * twitterCard - Twitter card metadata extracted from the page.
      * headings - Headings (h1, h2, h3) extracted from the page.
      * links - Array of links with href, text, and internal/external flag.
      * reviews - Array of review objects with author, rating, content, and date.

Dependencies

  • axios
  • cheerio
  • @mozilla/readability
  • robots-parser
  • jsdom

Troubleshooting

  • Ensure the URL is correct and accessible; invalid URLs or network issues will cause errors.
  • If using the render fetch strategy, the render service URL must be provided and reachable.
  • Respecting robots.txt may block scraping if disallowed; disable this option if necessary but be mindful of legal and ethical considerations.
  • Custom headers and authentication must be correctly configured to access protected resources.
  • Pagination requires correct CSS selector for the next page link; incorrect selectors will stop pagination early.
  • Timeout and retry settings should be adjusted based on network conditions to avoid premature failures.

Links

Discussion