robots.txt Best Practices: How to Control Google's Crawler

The robots.txt file is one of the most misunderstood files in SEO. It is not a security tool — it is a set of polite instructions that tell search engine crawlers which parts of your site they should and should not visit. Get it right, and you help search engines crawl your site more efficiently. Get it wrong, and you can accidentally block your most important pages from appearing in search results.

This guide covers robots.txt syntax, common patterns, AI bot blocking, testing procedures, and the mistakes that can tank your SEO overnight.

What Is robots.txt?

robots.txt is a plain text file that lives at the root of your domain: yourdomain.com/robots.txt. It uses the Robots Exclusion Protocol (REP), a standard that all major search engines respect. When Googlebot visits your site, the very first thing it does is request your robots.txt file to understand which URLs it is allowed to crawl.

Important: robots.txt is a directive, not a rule. Google respects it, but malicious crawlers can ignore it. Never use robots.txt to hide sensitive content — use authentication or noindex tags for that purpose.

robots.txt Syntax Explained

The file uses a simple structure built around two main directives: User-agent and Disallow.

User-agent: Specifies which crawler the rules apply to. Use an asterisk (*) to target all crawlers, or name a specific bot like Googlebot, Googlebot-Image, or Bingbot.

Disallow: Specifies which paths should not be crawled. Use a forward slash (/) to block the entire site, or specific paths like /admin/ to block particular directories.

Allow: Explicitly permits crawling of a path, even if a broader disallow rule would block it. Useful for making exceptions within blocked directories.

Essential robots.txt Examples

Allow All Crawling (Default)

The simplest robots.txt allows everything:

User-agent: *
Disallow:

An empty Disallow means nothing is blocked. Every crawler can access every page.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

This blocks all crawlers from your admin panel, private directory, and temporary files while allowing access to everything else.

Block a Specific Bot

User-agent: BadBot
Disallow: /

This blocks a specific crawler from your entire site while allowing all other bots full access.

Sitemap Declaration

Always include your sitemap URL in robots.txt:

User-agent: *
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml

This helps search engines discover your sitemap quickly, even if they do not find it through other means.

Blocking AI Bots in robots.txt

With the rise of AI training crawlers, many website owners want to block their content from being scraped for AI model training. You can add specific directives for known AI bots:

User-agent: GPTBot (OpenAI's crawler)
User-agent: ChatGPT-User (OpenAI's ChatGPT user agent)
User-agent: Google-Extended (Google's AI training bot, separate from Googlebot)
User-agent: CCBot (Common Crawl)
User-agent: anthropic-ai (Anthropic's crawler)
User-agent: Bytespider (ByteDance/TikTok crawler)

Add a Disallow: / under each User-agent to block them. Note that this relies on the bot respecting robots.txt — which major AI companies have generally committed to, but which cannot be guaranteed for all crawlers.

robots.txt vs. Meta Robots vs. X-Robots-Tag

These three methods control crawling and indexing differently, and understanding the distinction is critical.

robots.txt: Controls crawling. Tells bots whether they may request a URL. Does not remove already-indexed pages from search results.
Meta robots (noindex): Controls indexing. The page is crawled but not added to the search index. This is the correct way to remove a page from search results.
X-Robots-Tag: An HTTP header that works like meta robots but can be applied to non-HTML files like PDFs and images.

Common mistake: Using robots.txt to Disallow a page you want deindexed. If Google cannot crawl the page, it cannot see the noindex tag either. The page may remain in search results with only the URL visible. Use noindex in HTML or the X-Robots-Tag header instead.

Testing Your robots.txt

Google Search Console Robots.txt Tester

Google Search Console includes a robots.txt testing tool that shows you exactly how Google interprets your file. Enter a URL and select a user-agent to see whether it would be allowed or blocked.

Manual Testing

Visit yourdomain.com/robots.txt in a browser to view the file directly. Check that it is accessible, properly formatted, and does not contain unintended Disallow rules.

URL Inspection Tool

Use the URL Inspection tool in Google Search Console to check if a specific URL is being blocked by robots.txt. This is the fastest way to diagnose why a page is not appearing in search results.

Common robots.txt Mistakes

Blocking CSS and JS files: Google needs to render your pages properly. Disallowing /assets/ or /wp-includes/ can prevent Google from seeing your page as users do.
Using robots.txt for security: Anyone can view your robots.txt file, which means listing sensitive directories in Disallow actually exposes their existence.
Forgetting the sitemap: Not including your Sitemap URL means crawlers have to discover it through other means, which is slower.
Wildcards in the wrong place: Using * and $ wildcards incorrectly can block unintended URLs. Test thoroughly.
Not updating after site changes: If you restructure your site, update robots.txt to match the new URL structure.

robots.txt for Next.js Sites

Next.js provides built-in support for generating robots.txt. You can create a robots.ts file in your app directory that dynamically generates the file. This approach lets you reference your site URL from environment variables and automatically include your sitemap path.

For static sites, a simple robots.txt file in your public directory works just as well. The key is ensuring it exists, is properly formatted, and includes your sitemap.

Generate Your robots.txt

Rather than writing robots.txt from scratch, use our free robots.txt generator. Select which directories to block, add your sitemap URL, toggle AI bot blocking, and download a properly formatted robots.txt file in seconds. No signup required.

robots.txt Best Practices: How to Control Google's Crawler