ScrapeNinja icon

ScrapeNinja

Consume ScrapeNinja Web Scraping API - See full documentation at https://scrapeninja.net/docs/

Overview

The Clean up HTML operation of the ScrapeNinja node is designed to process and sanitize raw HTML content. It removes unnecessary or redundant elements, compresses the HTML, and optionally trims text nodes and URLs. This is particularly useful when you need to prepare HTML for further processing, such as passing it to Large Language Models (LLMs), storing in databases, or extracting clean text for analysis.

Common scenarios:

  • Preprocessing web-scraped HTML before feeding it into AI models.
  • Reducing the size of HTML documents for storage or transmission.
  • Removing extraneous markup to focus on main content.
  • Ensuring only the body content is retained for downstream tasks.

Practical example:
You scrape a web page and want to remove scripts, styles, and limit the length of text nodes before sending the cleaned HTML to an LLM for summarization.


Properties

Display Name Type Description
HTML String The HTML content to clean up. This is the main input and is required.
XML Mode Boolean Whether to use XML parser mode instead of standard HTML parsing. Useful for XHTML/XML docs.
Max Text Node Length Number Maximum allowed length for text nodes. 0 means no trimming.
URL Max Length Number Maximum allowed length for URL attributes. 0 means no trimming.
Only Body Content Boolean If enabled, only the contents inside the <body> tag are kept.
Max Output Length Number Maximum length of the output HTML. 0 means no limit.

Output

The node outputs a single field:

{
  "html": "<cleaned HTML string>"
}
  • html: Contains the cleaned and possibly trimmed HTML string according to the specified options.

No binary data is produced by this operation.


Dependencies

  • No external API keys or credentials are required for the Clean up HTML operation.
  • No special n8n configuration is needed for this operation.

Troubleshooting

Common issues:

  • Malformed HTML input: If the provided HTML is not well-formed, the cleaning process may not work as expected or could result in loss of content.
  • Excessive trimming: Setting very low values for "Max Text Node Length", "URL Max Length", or "Max Output Length" may remove important content.
  • Empty output: Enabling "Only Body Content" when the input HTML lacks a <body> tag will result in empty output.

Error messages:

  • "error": "Some error message": General errors are returned in the error field if "Continue On Fail" is enabled. Check the details for more information.
  • "details": "No additional details available": Indicates that no further error context was provided.

How to resolve:

  • Double-check your input HTML for correctness.
  • Adjust trimming parameters to avoid removing too much content.
  • Ensure the input contains a <body> tag if using "Only Body Content".

Links and References

Discussion