Overview
This node, named "Gimme Dat," is designed to scrape data from any webpage by fetching its HTML content and extracting information based on user preferences. It supports two extraction modes:
- Simple mode: Automatically extracts the main content of a page such as articles, blog posts, or documentation by targeting common content containers.
- Advanced mode: Allows users to define specific fields to extract using CSS selectors, enabling precise scraping of elements like product titles, prices, images, links, etc.
Typical use cases include:
- Extracting article text or blog content for further processing.
- Scraping product details from e-commerce pages.
- Collecting metadata or custom attributes from web pages.
- Converting HTML content into Markdown format for easier readability or storage.
Properties
| Name | Meaning |
|---|---|
| URL | The full URL of the webpage to scrape. Must be a valid HTTP/HTTPS address. |
| Extraction Mode | Method of data extraction: - Simple (Auto-Extract Content): Automatically extracts main textual content. - Advanced (CSS Selectors): Define multiple fields with CSS selectors and specify what to extract. |
| Fields to Extract | (Visible only in Advanced mode) A collection of fields where each field includes: - CSS Selector: The selector to locate elements. - Custom Attribute Name: Name of an attribute to extract if applicable. - Extract: What to extract (text, html, href, src, alt, value, or custom attribute). - Field Name: Output key name. - Return Multiple: Whether to return all matches as an array or just the first match. |
| Options | Additional options: - User Agent: Choose user agent string for the request (Default, Desktop Chrome, Mobile Safari, Bot). - Timeout: Request timeout in milliseconds. - Include Metadata: Include page metadata (title, description, author, timestamp) in output. - Convert to Markdown: Convert extracted HTML content to Markdown format preserving structure and formatting. |
Output
The node outputs an array of JSON objects, one per input item, containing the scraped data under the json property.
In Simple mode, the output contains:
content: The main extracted content as plain text or Markdown (if enabled).- Optionally, if metadata inclusion is enabled, a
metadataobject with:url: The scraped URL.title: Page title.description: Meta description content.author: Meta author content.timestamp: ISO timestamp of extraction.
In Advanced mode, the output contains keys corresponding to each defined field name, with values extracted according to the specified CSS selector and attribute. Values can be strings, arrays (if multiple matches requested), or null if no match found.
No binary data output is produced by this node.
Dependencies
- Requires internet access to fetch webpages.
- Uses standard HTTP(S) requests with configurable user agents.
- Depends on these npm packages bundled within the node:
cheeriofor parsing and querying HTML content.turndownfor converting HTML to Markdown (optional).
- No external API keys or credentials are required.
- Supports request timeout configuration to avoid hanging requests.
Troubleshooting
Common issues:
- Invalid or missing URL parameter will cause an error.
- Network errors or unreachable URLs result in fetch failures.
- Incorrect CSS selectors in advanced mode may yield empty or null results.
- Timeout too short may abort requests prematurely.
- Some websites may block scraping attempts depending on user agent or require authentication (not supported).
Error messages:
"URL is required": Ensure the URL property is set and valid."Failed to fetch URL: <status> <statusText>": Indicates HTTP error response; verify URL accessibility.- Parsing errors are unlikely but malformed HTML could affect extraction accuracy.
Resolutions:
- Double-check URLs and network connectivity.
- Adjust timeout settings for slow-loading pages.
- Use appropriate user agent strings to mimic browsers.
- Verify CSS selectors via browser developer tools before inputting.
- Enable "Include Metadata" to debug page metadata presence.
Links and References
- Cheerio GitHub Repository – HTML parsing and manipulation library used internally.
- Turndown GitHub Repository – Converts HTML to Markdown.
- MDN Web Docs: CSS Selectors – Reference for writing CSS selectors.
- User-Agent Strings – Information about user agent headers.
If you need examples or further explanation on how to configure the node for specific scraping tasks, feel free to ask!