DuckDuckGo Fetcher

Search DuckDuckGo and fetch clean text from top N sites

Overview

This node performs a web search using DuckDuckGo and retrieves clean text content from the top N resulting websites. It is useful for scenarios where you want to gather summarized or raw textual data from multiple relevant web pages based on a search query, such as market research, content aggregation, or competitive analysis.

For example, you can input a search term like "OpenAI GPT-4" and specify the number of results to fetch (e.g., 5). The node will then:

  • Search DuckDuckGo for that query.
  • Extract the URLs of the top results.
  • Fetch each webpage.
  • Clean the HTML by removing scripts, styles, navigation, footers, forms, and other non-content elements.
  • Return the plain text content of each page.

This allows downstream nodes to process or analyze the extracted textual data without dealing with raw HTML.

Properties

Name Meaning
Search Query The search term or phrase to look up on DuckDuckGo.
Number of Results The number of top search results to retrieve and fetch text from. Allowed values: 1–10.

Output

The node outputs an array with one item per execution input. Each output item contains a json object with the following fields:

  • query: The original search query string.
  • links: An array of URLs representing the top search results retrieved from DuckDuckGo.
  • texts: An array of strings, each containing the cleaned plain text content fetched from the corresponding URL in links. If fetching or parsing a page fails, the corresponding entry contains an error message string describing the failure.

No binary data is output by this node.

Example output JSON structure:

{
  "query": "OpenAI GPT-4",
  "links": [
    "https://example.com/article1",
    "https://example.com/article2"
  ],
  "texts": [
    "Cleaned text content from article 1...",
    "Failed to fetch or parse https://example.com/article2: timeout"
  ]
}

Dependencies

  • Requires internet access to perform HTTP requests.
  • Uses the DuckDuckGo HTML search endpoint (https://html.duckduckgo.com/html) via POST request.
  • Fetches external webpages using HTTP GET requests.
  • Relies on the following npm packages bundled with the node:
    • axios for HTTP requests.
    • cheerio for HTML parsing and manipulation.
  • No special API keys or authentication tokens are required.

Troubleshooting

  • No search results returned: This may indicate that DuckDuckGo blocked the request (possibly due to bot detection) or the CSS selector used to extract links no longer matches the page structure. Check network connectivity and consider updating the node if DuckDuckGo changes their HTML layout.
  • Search request failed with status [code]: Indicates an HTTP error from DuckDuckGo. Verify network access and try again later.
  • Failed to fetch or parse [URL]: [error message]: Could be caused by network timeouts, invalid URLs, or unexpected page structures. The node uses a 7-second timeout per page fetch; increasing this timeout would require code modification.
  • Node error: [message]: General errors during execution. If "Continue On Fail" is enabled, the node will output the error message instead of stopping the workflow.

Links and References

Discussion