Article Extractor icon

Article Extractor

Extract article content from web pages using Mozilla Readability

Overview

This node extracts the main article content from HTML input using Mozilla's Readability library. It is useful for scenarios where you want to cleanly extract readable text, metadata, and optionally images from web pages or raw HTML content. For example, it can be used to process scraped web pages, RSS feed content, or any HTML source to obtain a simplified article view with title, text, and other metadata.

Properties

Name Meaning
Operation The action to perform; here only "Extract Article" is supported, which extracts article content from HTML.
HTML Input Field The name of the input field containing the HTML string or URL response data to extract from.
URL Optional URL of the page, used to resolve relative links in the HTML content.
Options Collection of optional settings:
- Character Threshold Minimum number of characters required for an article to be considered valid (default 500).
- Include Images Whether to include images in the extracted article content (true/false).
- Keep Classes Whether to preserve CSS class names in the extracted HTML content (true/false).
- Max Elements to Parse Maximum number of HTML elements to parse (0 means no limit).

Output

The node outputs JSON objects with the following fields extracted from the article:

  • title: The article's title.
  • content: The HTML content of the article, cleaned and simplified.
  • textContent: The plain text content of the article.
  • length: Number of characters in the text content.
  • excerpt: A short excerpt or summary of the article.
  • byline: Author or byline information if available.
  • dir: Text direction (e.g., "ltr" or "rtl").
  • siteName: The site name if detected.
  • lang: Language code of the article.
  • publishedTime: Publication date/time if available.

If images are excluded via options, the content field will have all <img> tags removed.

Each output item is paired with its corresponding input item index.

Dependencies

  • Uses the @mozilla/readability package for article extraction logic.
  • Uses jsdom to parse and manipulate HTML content as a DOM.
  • Requires input HTML content either directly or as a URL response.
  • No external API keys or credentials are needed.

Troubleshooting

  • No HTML content found: If the specified HTML input field does not contain any HTML string, the node throws an error. Ensure the correct field name is provided and contains valid HTML.
  • Could not extract article: If Readability fails to parse a meaningful article, this error occurs. Try adjusting options like lowering the character threshold or providing a more complete HTML input.
  • Malformed HTML input: Invalid or incomplete HTML may cause parsing issues. Validate or sanitize input HTML before processing.
  • To continue processing despite errors on some items, enable the node’s "Continue On Fail" option.

Links and References

Discussion