Extract from HTML

Extract Structure from HTML

Overview

This node extracts structured information from a raw HTML string. It parses the provided HTML content and retrieves various metadata, links, images, videos, iframes, headings, and textual content statistics. This is useful for scenarios such as web scraping, SEO analysis, content summarization, or extracting key elements from HTML documents without needing to write custom parsing code.

For example, you can input the HTML source of a webpage and get back its title, meta description, Open Graph tags, all image URLs, heading structure, and plain text content with word and character counts.

Properties

Name	Meaning
Raw HTML	The raw HTML string to extract structure from. Example: `<html><head></head><body>hello world</body></html>`

Output

The output JSON contains a single data object with the following fields extracted from the HTML:

title: The content of the <title> tag.
description: The content of the meta tag with name "description".
keywords: The content of the meta tag with name "keywords".
canonical: The href attribute of the canonical link tag.
Open Graph (og) properties:
- ogTitle, ogDescription, ogImage, ogUrl, ogSiteName, ogType, ogLocale
Article-specific Open Graph properties:
- ogPublishedTime, ogModifiedTime, ogAuthor, ogSection, ogTag
meta: Array of content attributes from all meta tags.
links: Array of href attributes from all link tags.
images: Array of src attributes from all img tags.
videos: Array of src attributes from all video tags.
iframes: Array of src attributes from all iframe tags.
headings: Array of objects representing headings (h1 to h6) with:
- level: heading tag name in lowercase (e.g., "h1")
- text: trimmed text content of the heading
text: The plain text content of the HTML after removing style and script tags.
wordCount: Number of words in the plain text.
characterCount: Number of characters in the plain text.
mainContent: The inner HTML of the main container element found (<main>, <article>, or <body>), cleaned of whitespace characters.

If the node encounters an error while processing an item and is set to continue on failure, it outputs the error details alongside the original input data.

Dependencies

Uses the cheerio library for HTML parsing and querying.
No external API calls or credentials are required.
Runs entirely within n8n environment.

Troubleshooting

Empty or invalid HTML input: If the provided HTML string is empty or malformed, the node may return empty fields or fail to extract meaningful data. Ensure valid HTML is passed.
Missing expected tags: If certain tags like <main>, <article>, or <body> are missing, the node falls back gracefully but some fields like mainContent might be empty.
Continue on Fail behavior: When enabled, errors during processing individual items do not stop the workflow but output error info instead.
Common errors: Parsing errors due to invalid HTML or unexpected input types. Validate input before passing to the node.

Links and References

Cheerio GitHub Repository – Used for HTML parsing and manipulation.
HTML Meta Tags Reference
Open Graph Protocol – Explanation of Open Graph meta tags.

Extract from HTMLInstall