Readability Content Extractor
Extracts clean, readable content from HTML using Mozilla's Readability algorithm. Removes clutter like ads, navigation, and sidebars.
Overview
The Readability Content Extractor node processes raw HTML content to extract clean, readable text by removing clutter such as advertisements, navigation menus, and sidebars. It leverages Mozilla's Readability algorithm to parse the HTML and isolate the main article or content body along with optional metadata.
This node is useful in scenarios where you want to:
- Extract the main textual content from web pages for further analysis or processing.
- Clean up HTML fetched from websites to get a simplified version of the content.
- Obtain metadata like title, author, excerpt, language, and publication date from articles.
Practical examples:
- Fetching news articles using an HTTP Request node and then extracting just the readable article text for sentiment analysis.
- Cleaning blog post HTML to generate summaries or plain text versions.
- Aggregating content from multiple sources while ignoring ads and sidebars.
Properties
| Name | Meaning |
|---|---|
| HTML Code | The raw HTML content to extract readable text from. Typically obtained via an HTTP Request node. |
| Include Full Content | Whether to include the full cleaned HTML content (with tags) in the output. |
| Include Text Content | Whether to include the plain text content extracted from the HTML. |
| Include Metadata | Whether to include article metadata such as title, excerpt, author, language, and publish time. |
Output
The node outputs JSON objects containing the extracted content based on the selected options:
textContent(string): The plain text extracted from the HTML, trimmed of whitespace.length(number): The length of the extracted text content.content(string): The cleaned HTML content representing the main article body (included if "Include Full Content" is true).- Metadata fields (included if "Include Metadata" is true):
title: Article title.excerpt: A short summary or excerpt of the article.siteName: The name of the website.byline: Author or byline information.language: Language code of the article.publishedTime: Publication date/time.textDirection: Text direction (e.g., "ltr" or "rtl").
If multiple input items are processed, the output is an array of such JSON objects.
Dependencies
- Uses the Mozilla Readability library to parse and extract readable content.
- Uses jsdom to create a DOM environment from the input HTML string.
- Requires no external API keys or credentials.
- No special n8n environment variables needed.
Troubleshooting
- Empty HTML input error: If the provided HTML string is empty or only whitespace, the node throws an error stating "HTML input cannot be empty." Ensure that the input HTML is correctly fetched or passed.
- Extraction failure: If the Readability parser fails to extract content (returns null), the node throws "Could not extract readable content from the provided HTML." This can happen if the HTML is malformed or does not contain recognizable article structure.
- Malformed HTML: Since the node relies on jsdom, severely broken HTML might cause parsing issues. Preprocessing or validating HTML before passing it may help.
- Missing expected metadata: Some pages may not have all metadata fields; these will simply be omitted from the output.