FetchFox AI Scraper icon

FetchFox AI Scraper

Scrape public web data with FetchFox

Overview

This node integrates with the FetchFox AI Scraper service to crawl web pages and find URLs matching a specified URL pattern. It is particularly useful for scenarios where you want to discover multiple URLs under a certain directory or path on a website, such as finding all product pages under a category or all blog posts under a specific tag.

For example, if you want to find all URLs under https://www.example.com/directory/ that match a pattern like https://www.example.com/directory/*, this node will crawl the site up to a maximum number of pages and return the matching URLs.

Properties

Name Meaning
URL Pattern to Find. Include at Least One * Wildcard The URL pattern to search for, which must include at least one * wildcard. For example: https://www.example.com/directory/*. This tells the crawler which URLs to look for.
Max Visits The maximum number of pages the crawler will visit during the operation. Defaults to 50.
Proxy The type of proxy to use when loading pages. Options are:
- None ($0.01 per GB)
- Datacenter ($0.01 per GB)
- Residential ($8.00 per GB)
- Residential, Load Images, Fonts, Etc ($8.50 per GB)

Output

The output is an array of JSON objects, each representing a URL found by the crawler that matches the specified pattern. Each object has the following structure:

{
  "url": "https://matched-url.com/page"
}

Additionally, the first item in the output array includes a _metrics field containing metrics about the crawl operation (such as performance data), but this is mainly for informational/debugging purposes.

No binary data is output by this operation.

Dependencies

  • Requires an API key credential for the FetchFox AI Scraper service.
  • The node makes authenticated HTTP POST requests to the FetchFox API endpoint at https://api.fetchfox.ai/api/crawl.
  • Proxy usage depends on the selected option and may incur additional costs.

Troubleshooting

  • Common Issues:

    • If the URL pattern does not contain at least one * wildcard, the node will likely fail or return no results because the pattern is invalid.
    • Exceeding the Max Visits limit may result in incomplete URL discovery.
    • Using proxies incorrectly or without proper configuration might cause request failures or increased latency.
  • Error Messages:

    • Authentication errors indicate missing or invalid API credentials; ensure the API key is correctly configured.
    • Network or timeout errors may occur if the target website is unreachable or slow; consider adjusting proxy settings or max visits.
    • Invalid pattern errors if the pattern format is incorrect; verify the pattern includes at least one *.

Links and References

Discussion