What Robots.txt Actually Does
Robots.txt is a plain-text file placed at the root of a domain (https://example.com/robots.txt) that tells web crawlers which URLs they are allowed to request. The file follows the Robots Exclusion Protocol, a standard that Googlebot, Bingbot, and AI crawlers like GPTBot and ClaudeBot voluntarily obey. When a crawler arrives at a site, it reads robots.txt first and uses the rules to decide what to fetch.
The most important thing to understand: robots.txt controls crawling, not indexing. A blocked page can still appear in Google search results (as a bare URL with no snippet) if other pages link to it. To keep a page out of the index, use a noindex meta tag or HTTP header instead — and that page must stay crawlable for Google to see the noindex.
Robots.txt is also a public file. Anyone can read yoursite.com/robots.txt, so it is never a security tool. Listing Disallow: /admin/ just advertises where your admin panel lives. Use authentication for anything sensitive, not crawl rules.
Robots.txt Syntax: The Core Directives
Robots.txt syntax is built from a small set of directives grouped into blocks. Each block starts with a User-agent line naming the crawler, followed by Allow and Disallow rules. A typical file looks like this:
User-agent: *
Disallow: /cart/
Disallow: /search
Allow: /search/help
User-agent: GPTBot
Disallow: /
Sitemap: https://example.com/sitemap.xmlHere is what each piece means:
- `Disallow: /cart/` — crawlers should not request any URL starting with
/cart/. - `Allow: /search/help` — carves an exception out of a broader Disallow.
- `User-agent: GPTBot` + `Disallow: /` — blocks OpenAI's crawler from the entire site.
- `Sitemap:` — points crawlers to your XML sitemap (an absolute URL, listed once, applies site-wide).
Two rules trip people up. First, path matching is prefix-based and case-sensitive — Disallow: /Blog does not block /blog. Second, the most specific matching block wins, not the first one. If you have both a * block and a Googlebot block, Googlebot reads only its own block and ignores the wildcard entirely. Repeat shared rules in each named block, or Google may crawl paths you thought were closed.
Common Mistakes (Including Blocking Google by Accident)
The single most damaging robots.txt mistake is shipping Disallow: / to production. That one line tells every crawler to skip the entire site. It usually happens when a staging robots.txt — which legitimately blocks everything — gets deployed to the live domain. Traffic then quietly collapses over the following weeks as Google drops uncrawlable pages.
Other frequent errors:
- Using robots.txt to deindex — blocking a URL does not remove it from the index. Use
noindexinstead, and leave the page crawlable. - Wrong file location — robots.txt only works at the domain root.
example.com/blog/robots.txtis ignored. - Trailing-slash confusion —
Disallow: /newsblocks both/newsand/news/article;Disallow: /news/blocks only paths under the folder. - Trusting it as a wall — well-behaved bots obey robots.txt; scrapers and malicious bots ignore it.
If your audit ever flags a site-wide block, treat it as a five-alarm fire. The check tool surfaces this kind of crawl-blocking instantly, and it is the first thing to confirm during technical SEO work.
AI-Crawler Directives in 2026
Robots.txt is now the front line for controlling AI crawlers, not just search engines. In 2026, the major AI bots respect the Robots Exclusion Protocol, so you can allow or block them by name. The key user-agents to know are GPTBot (OpenAI training), OAI-SearchBot (ChatGPT search), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended (Gemini/AI training, separate from regular Googlebot).
The strategic question is which bots help you. Blocking GPTBot or ClaudeBot can keep your content out of AI answers entirely — and if you want citations in ChatGPT, Claude, or Perplexity, that is the opposite of your goal. Many sites block training crawlers (GPTBot, Google-Extended) but allow retrieval crawlers (OAI-SearchBot, PerplexityBot) so they can still be cited in real-time AI answers.
Decide deliberately rather than copy-pasting a block-everything snippet. If you want to be cited in AI answers, allow the retrieval bots and learn how to rank in ChatGPT. Getting this wrong is one of the most common GEO mistakes the audit catches.
How to Set Up and Test Your Robots.txt
Setting up robots.txt is a five-step loop: write the file, place it at the root, fetch it in a browser, test specific URLs, and re-check after every deploy. The flowchart below maps the whole process — the testing step matters most, because the cost of a bad rule is your entire organic and AI-search footprint.
- Write the rulesDraft User-agent blocks with Disallow/Allow paths and add your Sitemap line.
- Place it at the rootUpload the file so it resolves at yoursite.com/robots.txt — subfolders are ignored.
- Fetch it liveOpen the URL in a private window to confirm the production file matches your draft.
- Test specific URLsUse Google Search Console URL inspection to verify key pages are crawlable and blocked pages are blocked.
- Confirm AI botsCheck that GPTBot, ClaudeBot, and PerplexityBot directives match your allow/block intent.
- Re-check after deploysRe-fetch robots.txt after every release so a staging Disallow: / never reaches production.
After deploying, always do two manual checks. Open https://yoursite.com/robots.txt in a private browser window to confirm the live file matches what you wrote — not a cached staging version. Then use the URL inspection tool in Google Search Console to confirm Googlebot can actually fetch a key page. The directives differ by bot and by intent, so it helps to know which ones to use when:
| Directive | What it does | Use it when |
|---|---|---|
| Disallow: /path | Stops crawlers from requesting matching URLs | Hiding low-value pages (cart, internal search) from crawl |
| Allow: /path | Carves an exception inside a broader Disallow | Re-opening one folder beneath a blocked parent |
| User-agent: GPTBot | Targets a specific named crawler | Allowing or blocking one AI or search bot precisely |
| Sitemap: URL | Points crawlers to your XML sitemap | Always — helps every crawler discover your URLs |
| noindex (meta tag, NOT robots.txt) | Removes a crawlable page from the index | Keeping a page out of results — leave it crawlable |
Make robots.txt a release checklist item. The most expensive outages are silent — nothing errors, traffic just fades. A quick re-fetch after each deploy, plus an automated audit, turns a five-minute habit into cheap insurance.