Website crawling

Tale’s crawler visits pages on a domain you point it at, extracts the text content, and indexes it into the knowledge base alongside your uploaded documents. The AI agent can then answer questions grounded in that content — “What’s our current pricing on the website?”, “Which features changed in the v3 release notes?”. This page covers the Editor/Developer side. For the end-user workflow of simply adding a website, see Knowledge base.

What the crawler does

Fetches the URL you provide and parses the HTML.
Discovers linked pages on the same domain.
Fetches each discovered page and repeats up to the domain’s discovered-URL limit.
Converts every page to clean text (strips navigation, footers, and ads).
Indexes the text into the shared knowledge store with page URL as the source.

Non-HTML documents (PDF, DOCX) linked from crawled pages are fetched, converted, and indexed too.

Scan intervals

The crawler revisits the site on a schedule you pick per site:

Scan interval	Best for
Every hour	Sites with frequent content changes.
Every 6 hours (default)	Documentation sites and company wikis.
Every 12 hours	Semi-active sites.
Every day	Marketing sites and blogs.
Every 5 days	Moderately static content.
Every 7 days	Reference sites with infrequent updates.
Every 30 days	Rarely changing reference material.

Each rescan diffs against the last fetch. Unchanged pages are not re-indexed — only new, changed, or deleted pages trigger work.

Respecting the target site

The crawler honours robots.txt. Disallowed paths are skipped.
Requests are rate-limited (one fetch per 2 seconds per domain by default) to avoid hammering the target.
The user agent is TaleCrawler/1.0 (+https://tale.dev/crawler) so site owners can identify traffic.

For crawling sites behind auth or requiring a custom user agent, configure a REST API integration instead — see Integrations overview.

Debugging a crawl

If a crawl isn’t picking up pages you expect:

Open the site’s detail page under Knowledge > Websites. The Discovered pages list shows what the crawler has found.
The Errors tab lists pages that failed to fetch or parse, with the HTTP status and error message.
Check that the expected pages are linked from the homepage or sitemap. The crawler only finds what it can reach via links.

Removing a site

Deleting a tracked website from Knowledge > Websites removes all indexed content from that site. This is immediate — the AI will no longer find those pages.

Start here

Cloud

Self-hosted

Platform

Tutorials

Development

Legal

Website crawling

What the crawler does

Scan intervals

Respecting the target site

Debugging a crawl

Removing a site

Start here

Cloud

Self-hosted

Platform

Tutorials

Development

Legal

​What the crawler does

​Scan intervals

​Respecting the target site

​Debugging a crawl

​Removing a site

What the crawler does

Scan intervals

Respecting the target site

Debugging a crawl

Removing a site