Overview
This node extracts the main readable content from HTML input using Mozilla's Readability library. It is useful for cleaning up raw HTML pages to obtain just the article or primary text content, removing ads, navigation, and other clutter. Common scenarios include processing web-scraped HTML to get clean article text, summarizing blog posts, or extracting excerpts for previews.
For example, if you have a dataset of full HTML pages from news websites, this node can parse each page and output only the article title, main content, and excerpt, making downstream processing or analysis easier.
Properties
| Name | Meaning |
|---|---|
| HTML Field | The name of the field containing the HTML content to parse. This should be a string field holding raw HTML. |
| Error Handling | How to handle invalid or unparsable HTML content. Options: - Pass Through: Output the original input unchanged. - Output Empty: Output empty content fields. - Throw Error: Stop execution with an error (default). |
| Return Full Response | Whether to return the full response object from Readability, which includes metadata like author, siteName, length, etc. If false, only content, title, and excerpt are returned. |
Output
The node outputs JSON data with the extracted readable content. By default, the output JSON contains:
content: The main HTML content extracted from the input.title: The title of the article or main content.excerpt: A short summary or excerpt of the content.
If "Return Full Response" is enabled, the output JSON will include all metadata provided by Mozilla Readability, such as author, siteName, length, direction, and more, alongside the above fields.
No binary data is produced by this node.
Dependencies
- Mozilla Readability: The core library used to parse and extract readable content from HTML.
- jsdom: Used to create a DOM environment from the input HTML string so that Readability can operate on it.
- No external API keys or services are required.
- Requires the node to receive input items with a JSON field containing valid HTML content.
Troubleshooting
Error: Item X has no JSON data
This occurs if the input item at index X does not contain any JSON data. Ensure your input data includes JSON objects with the specified HTML field.Error processing HTML: Failed to extract content
Indicates that the HTML could not be parsed into readable content. This might happen if the HTML is malformed or empty. Adjust the "Error Handling" property to either pass through the input or output empty content instead of throwing an error.Invalid HTML field name
Make sure the "HTML Field" property matches exactly the name of the field in your input data that contains the HTML string.
Links and References
- Mozilla Readability GitHub – The underlying library used for content extraction.
- jsdom GitHub – JavaScript implementation of the DOM used to parse HTML strings.