Readability Reader icon

Readability Reader

Basic Mozilla Readability Reader

Overview

This node uses Mozilla's Readability library to extract the main readable content from a given HTML input. It processes raw HTML and returns structured information such as the article title, main content, excerpt, author byline, site name, text direction, content length, plain text content, and images found within the content.

Common scenarios where this node is beneficial include:

  • Extracting clean article content from web pages for further processing or analysis.
  • Summarizing web articles by retrieving excerpts and key metadata.
  • Cleaning up cluttered HTML to get only the meaningful text and images.
  • Preparing web content for display in simplified readers or mobile apps.

Practical example: You have an HTML snapshot of a news article and want to extract just the article text and images without ads or navigation elements. This node will parse the HTML and output the cleaned content and metadata.

Properties

Name Meaning
HTML The raw HTML string to be parsed by the Readability library to extract readable content.

Output

The node outputs JSON objects with the following fields:

  • title: The extracted title of the article (string).
  • content: The main HTML content of the article (string).
  • excerpt: A short summary or excerpt of the article (string).
  • siteName: The name of the website or source (string).
  • byline: Author or byline information (string).
  • dir: Text direction, e.g., "ltr" or "rtl" (string).
  • length: Length of the content in characters (number).
  • textContent: Plain text version of the content without HTML tags (string).
  • images: An array of image tags (as strings) extracted from the content's HTML.

If parsing fails, the output JSON contains an error field describing the issue.

Dependencies

  • Requires the jsdom package to create a DOM environment from the input HTML.
  • Uses the @mozilla/readability package to parse and extract readable content.
  • No external API keys or credentials are needed.
  • Runs entirely within n8n's environment.

Troubleshooting

  • Error parsing HTML: This error occurs if the input HTML is invalid or cannot be parsed by the Readability library. Ensure the HTML string is well-formed and complete.
  • If the output fields like title or content are empty or default values, it may indicate that the input HTML does not contain recognizable article structure.
  • Large or malformed HTML inputs might cause performance issues or parsing failures.
  • Make sure the input HTML includes the full document structure (<html>, <body>) for best results.

Links and References

Discussion