Overview
The Webpage Content Extractor node processes raw HTML code and extracts the main readable content from a web page, similar to the "Reader" mode in modern browsers. It removes extraneous elements such as headers, footers, banners, and advertisements, providing only the core article or content. This node is particularly useful for workflows that need to analyze, summarize, or repurpose web articles, blog posts, or news stories.
Practical examples:
- Automatically summarizing news articles for newsletters.
- Extracting the main text from blog posts for sentiment analysis.
- Archiving clean versions of web content without clutter.
Properties
| Name | Type | Meaning |
|---|---|---|
| HTML Code | String | The full HTML source code of the webpage to extract content from. Typically obtained using an HTTP Request node. |
Output
The node outputs a JSON object with the following fields:
| Field | Type | Description |
|---|---|---|
| excerpt | String | A short summary or excerpt of the main content. |
| siteName | String | The name of the website, if available. |
| length | Number | The character count of the extracted main content. |
| textContent | String | The plain text version of the main content, with all HTML tags removed. |
| content | String | The HTML-formatted main content (cleaned up, suitable for display). |
| title | String | The title of the article or main content. |
| language | String | The detected language code of the content (e.g., "en"). |
| byline | String | The author or byline information, if available. |
| publishedTime | String | The publication date/time of the content, if available. |
Example output:
{
"excerpt": "This is a summary of the article...",
"siteName": "Example News",
"length": 1234,
"textContent": "Full plain text of the article...",
"content": "<div><p>Full HTML content...</p></div>",
"title": "Breaking News: Example Event",
"language": "en",
"byline": "By Jane Doe",
"publishedTime": "2024-06-01T12:00:00Z"
}
Dependencies
- External Libraries:
- @mozilla/readability: Used for extracting the main content from HTML.
- jsdom: Used to parse and simulate the DOM environment for Readability.
- n8n Configuration:
- No special API keys or environment variables are required.
- The node expects valid HTML input, typically fetched using n8n's HTTP Request node.
Troubleshooting
- Common Issues:
- Invalid or incomplete HTML: If the provided HTML is malformed or missing key elements, extraction may fail or return empty results.
- Non-article pages: Pages without clear main content (e.g., homepages, search results) may not yield meaningful output.
- Error Messages:
"Could not extract main contents of webpage."- Cause: The Readability library could not identify the main content in the provided HTML.
- Resolution: Ensure you are passing the full HTML of an article or content-rich page. Try fetching the page again or check if the URL points to a valid article.