HTML Cleaner

Cleans HTML content

Overview

The HTML Cleaner node processes and cleans HTML content by removing unwanted elements, attributes, comments, scripts, styles, and empty tags. It also supports extracting the main readable content from the HTML using Mozilla's Readability library, which is useful for isolating article text or main page content. Additionally, it can convert the cleaned HTML output into Markdown format.

This node is beneficial in scenarios where you need to sanitize or simplify HTML input before further processing, such as:

Extracting clean article content from web pages.
Removing potentially harmful or unnecessary HTML elements and attributes.
Preparing HTML content for display in environments that require simplified markup.
Converting HTML content to Markdown for easier editing or storage.

Practical examples:

Cleaning up scraped web page HTML to extract only the main article text.
Removing all inline event handlers and scripts from user-submitted HTML for security.
Converting blog post HTML content into Markdown for use in static site generators.

Properties

Name	Meaning
HTML Content	The raw HTML content string to be cleaned.
Clean Options	Collection of options controlling what to remove or exclude during cleaning: - Excluded Attributes: Comma-separated list of HTML attributes to remove. - Excluded Selectors: CSS selectors whose matching elements will be removed. - Excluded Tags: HTML tags to remove. - Remove Attributes: Remove all attributes from tags. - Remove Comments: Remove HTML comments. - Remove Empty Tags: Remove tags with no content. - Remove Scripts: Remove `<script>` tags and their content. - Remove Styles: Remove `<style>` tags and their content.
Readability Options	Options for parsing the HTML with the Readability library: - Char Threshold: Minimum number of characters for an article to be considered valid. - Classes To Preserve: List of CSS classes to keep on elements if not preserving all classes. - Disable JSON-LD: Skip JSON-LD metadata extraction. - Keep Classes: Whether to preserve all CSS classes on elements. - NB Top Candidates: Number of top candidates to consider when analyzing content.
Markdown Output	Boolean flag indicating whether to convert the cleaned HTML content to Markdown format using the `turndown` library.

Output

The node outputs a JSON object per input item containing the following fields:

html: The cleaned HTML content after applying all cleaning options.
title: The title extracted by the Readability parser (if available).
lang: The language code detected in the HTML document.
content: The main content HTML extracted by Readability.
textContent: The plain text content extracted by Readability.
length: The length (number of characters) of the extracted content.
excerpt: A short excerpt or summary of the content.
markdown (optional): The Markdown representation of the cleaned content if Markdown output is enabled.

If an error occurs during processing and "Continue On Fail" is enabled, the output JSON will contain an error field with the error message.

The node does not output binary data.

Dependencies

Uses the cheerio library for HTML parsing and manipulation.
Uses jsdom and @mozilla/readability for extracting readable content from HTML.
Uses turndown for converting HTML to Markdown if enabled.

No external API keys or services are required. All processing is done locally within the node.

Troubleshooting

Missing or empty HTML content: The node requires non-empty HTML content input. If missing, it throws an error unless "Continue On Fail" is enabled.
Invalid HTML input: Malformed HTML may cause parsing errors or unexpected results.
Readability extraction returns null: If the content is too short or does not meet the character threshold, Readability may return no result.
Incorrect removal of elements: Ensure that excluded attributes, selectors, and tags are correctly specified as comma-separated strings without extra spaces.
Markdown output issues: If the Markdown output looks incorrect, verify that the HTML content is well-formed and that the turndown conversion is appropriate for your HTML structure.

Links and References

Cheerio GitHub – jQuery-like HTML parser.
Mozilla Readability – Library for extracting main content from web pages.
jsdom GitHub – JavaScript implementation of DOM and HTML standards.
Turndown GitHub – HTML to Markdown converter.

HTML CleanerInstall