Overview
This node uses Mozilla's Readability library to extract the main readable content from a given HTML input. It processes raw HTML and returns structured information such as the article title, main content, excerpt, author byline, site name, text direction, content length, plain text content, and images found within the content.
Common scenarios where this node is beneficial include:
- Extracting clean article content from web pages for further processing or analysis.
- Summarizing web articles by retrieving excerpts and key metadata.
- Cleaning up cluttered HTML to get only the meaningful text and images.
- Preparing web content for display in simplified readers or mobile apps.
Practical example: You have an HTML snapshot of a news article and want to extract just the article text and images without ads or navigation elements. This node will parse the HTML and output the cleaned content and metadata.
Properties
| Name | Meaning |
|---|---|
| HTML | The raw HTML string to be parsed by the Readability library to extract readable content. |
Output
The node outputs JSON objects with the following fields:
title: The extracted title of the article (string).content: The main HTML content of the article (string).excerpt: A short summary or excerpt of the article (string).siteName: The name of the website or source (string).byline: Author or byline information (string).dir: Text direction, e.g., "ltr" or "rtl" (string).length: Length of the content in characters (number).textContent: Plain text version of the content without HTML tags (string).images: An array of image tags (as strings) extracted from the content's HTML.
If parsing fails, the output JSON contains an error field describing the issue.
Dependencies
- Requires the
jsdompackage to create a DOM environment from the input HTML. - Uses the
@mozilla/readabilitypackage to parse and extract readable content. - No external API keys or credentials are needed.
- Runs entirely within n8n's environment.
Troubleshooting
- Error parsing HTML: This error occurs if the input HTML is invalid or cannot be parsed by the Readability library. Ensure the HTML string is well-formed and complete.
- If the output fields like
titleorcontentare empty or default values, it may indicate that the input HTML does not contain recognizable article structure. - Large or malformed HTML inputs might cause performance issues or parsing failures.
- Make sure the input HTML includes the full document structure (
<html>,<body>) for best results.