Article Extractor

Extract article content from web pages using Mozilla Readability

Overview

This node extracts the main article content from HTML input using Mozilla's Readability library. It is useful for scenarios where you want to cleanly extract readable text, metadata, and optionally images from web pages or raw HTML content. For example, it can be used to process scraped web pages, RSS feed content, or any HTML source to obtain a simplified article view with title, text, and other metadata.

Properties

Name	Meaning
Operation	The action to perform; here only "Extract Article" is supported, which extracts article content from HTML.
HTML Input Field	The name of the input field containing the HTML string or URL response data to extract from.
URL	Optional URL of the page, used to resolve relative links in the HTML content.
Options	Collection of optional settings:
- Character Threshold	Minimum number of characters required for an article to be considered valid (default 500).
- Include Images	Whether to include images in the extracted article content (true/false).
- Keep Classes	Whether to preserve CSS class names in the extracted HTML content (true/false).
- Max Elements to Parse	Maximum number of HTML elements to parse (0 means no limit).

Output

The node outputs JSON objects with the following fields extracted from the article:

title: The article's title.
content: The HTML content of the article, cleaned and simplified.
textContent: The plain text content of the article.
length: Number of characters in the text content.
excerpt: A short excerpt or summary of the article.
byline: Author or byline information if available.
dir: Text direction (e.g., "ltr" or "rtl").
siteName: The site name if detected.
lang: Language code of the article.
publishedTime: Publication date/time if available.

If images are excluded via options, the content field will have all <img> tags removed.

Each output item is paired with its corresponding input item index.

Dependencies

Uses the @mozilla/readability package for article extraction logic.
Uses jsdom to parse and manipulate HTML content as a DOM.
Requires input HTML content either directly or as a URL response.
No external API keys or credentials are needed.

Troubleshooting

No HTML content found: If the specified HTML input field does not contain any HTML string, the node throws an error. Ensure the correct field name is provided and contains valid HTML.
Could not extract article: If Readability fails to parse a meaningful article, this error occurs. Try adjusting options like lowering the character threshold or providing a more complete HTML input.
Malformed HTML input: Invalid or incomplete HTML may cause parsing issues. Validate or sanitize input HTML before processing.
To continue processing despite errors on some items, enable the node’s "Continue On Fail" option.

Article Extractor

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

Article ExtractorInstall

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

Article Extractor