Overview
This node extracts structured information from a raw HTML string. It parses the provided HTML content and retrieves various metadata, links, images, videos, iframes, headings, and textual content statistics. This is useful for scenarios such as web scraping, SEO analysis, content summarization, or extracting key elements from HTML documents without needing to write custom parsing code.
For example, you can input the HTML source of a webpage and get back its title, meta description, Open Graph tags, all image URLs, heading structure, and plain text content with word and character counts.
Properties
| Name | Meaning |
|---|---|
| Raw HTML | The raw HTML string to extract structure from. Example: <html><head></head><body>hello world</body></html> |
Output
The output JSON contains a single data object with the following fields extracted from the HTML:
title: The content of the<title>tag.description: The content of the meta tag with name "description".keywords: The content of the meta tag with name "keywords".canonical: The href attribute of the canonical link tag.- Open Graph (og) properties:
ogTitle,ogDescription,ogImage,ogUrl,ogSiteName,ogType,ogLocale
- Article-specific Open Graph properties:
ogPublishedTime,ogModifiedTime,ogAuthor,ogSection,ogTag
meta: Array of content attributes from all meta tags.links: Array of href attributes from all link tags.images: Array of src attributes from all img tags.videos: Array of src attributes from all video tags.iframes: Array of src attributes from all iframe tags.headings: Array of objects representing headings (h1toh6) with:level: heading tag name in lowercase (e.g., "h1")text: trimmed text content of the heading
text: The plain text content of the HTML after removing style and script tags.wordCount: Number of words in the plain text.characterCount: Number of characters in the plain text.mainContent: The inner HTML of the main container element found (<main>,<article>, or<body>), cleaned of whitespace characters.
If the node encounters an error while processing an item and is set to continue on failure, it outputs the error details alongside the original input data.
Dependencies
- Uses the
cheeriolibrary for HTML parsing and querying. - No external API calls or credentials are required.
- Runs entirely within n8n environment.
Troubleshooting
- Empty or invalid HTML input: If the provided HTML string is empty or malformed, the node may return empty fields or fail to extract meaningful data. Ensure valid HTML is passed.
- Missing expected tags: If certain tags like
<main>,<article>, or<body>are missing, the node falls back gracefully but some fields likemainContentmight be empty. - Continue on Fail behavior: When enabled, errors during processing individual items do not stop the workflow but output error info instead.
- Common errors: Parsing errors due to invalid HTML or unexpected input types. Validate input before passing to the node.
Links and References
- Cheerio GitHub Repository – Used for HTML parsing and manipulation.
- HTML Meta Tags Reference
- Open Graph Protocol – Explanation of Open Graph meta tags.