Robots.txt – Complete Checklist - Readul Haque Curriculum Vitae Website

1) Robots.txt fundamentals (syntax & placement)

● File location: must live at the site root (e.g., https://example.com/robots.txt). Putting it in a subfolder (e.g., /userpages/yourname/robots.txt) does not work.

● Scope: a robots.txt file applies only to the host/subdomain it’s served from. If you have multiple subdomains, each needs its own robots.txt.

● Blocks crawling, not indexing: disallowed URLs might still be indexed if discovered via external links; use noindex (meta or HTTP header) for true de-indexing.

● Minimum structure: every ruleset must begin with a User-agent: line, or nothing is enforced. Paths in Allow/Disallow must start with /.

Minimal skeleton

User-agent: *

Disallow:

(Empty Disallow means “crawl everything.”)

2) Matching rules you’ll actually use (with wildcards)

Robots.txt supports limited wildcards:

● * = matches zero or more characters

● $ = end of URL (end-anchor)

Common, safe patterns

# Block entire site (staging only!)

User-agent: *

Disallow: /

(Useful for staging; be sure to remove on production.)

# Block specific folders

User-agent: *

Disallow: /calendar/

Disallow: /junk/

Disallow: /books/fiction/contemporary/

(Append / to block a whole directory.)

# Allow only Google News; block everyone else

User-agent: Googlebot-news

Allow: /

User-agent: *

Disallow: /

# Allow all but one bot

User-agent: Unnecessarybot

Disallow: /

User-agent: *

Allow: /

# Block single pages

User-agent: *

Disallow: /useless_file.html

Disallow: /junk/other_useless_file.html

# Block entire site but allow one public section

User-agent: *

Disallow: /

Allow: /public/

# Block all images from Google Images

User-agent: Googlebot-Image

Disallow: /

(Or block a single image: Disallow: /images/dogs.jpg.)

# Block a file type everywhere (end-anchor)

User-agent: Googlebot

Disallow: /*.gif$

($ ensures you only block URLs ending with .gif.)

# Block URL patterns with query strings

User-agent: *

Disallow: /*?

(Blocks any URL containing ?.)

# Block PHP pages (anywhere in path) vs only those ending with .php

User-agent: *

Disallow: /*.php # contains

Disallow: /*.php$ # ends with .php

(Understand the difference between /*.php and /*.php$.)

3) Fine-grained control with Allow + Disallow

● When rules conflict, Google chooses the most specific path (longest matching string). If lengths tie, Allow wins.

Classic session-ID pattern
Block duplicates with ?, but allow a canonical “?-only” version (ends with ?):

User-agent: *

Allow: /*?$

Disallow: /*?

(Blocks any URL that contains ?, but allows URLs that end with ?.)

Unblock a “good” page inside a blocked folder

User-agent: *

Allow: /baddir/goodpage

Disallow: /baddir/

(The longer Allow path beats the shorter Disallow.)

Be careful with overlapping patterns

User-agent: *

Allow: /some

Disallow: /*page

/somepage is blocked (the /*page path is longer).

4) Query strings & special characters (do it the safe way)

● If you need to block a specific parameter (e.g., id=), you often need two lines to cover “first param” vs “later param”:

Disallow: /*?id=

Disallow: /*&id=

● For URLs with unsafe characters (like < or spaces), block the URL-encoded form (copy from the browser’s address bar).

User-agent: *

Disallow: /search?q=%3C%%20var_name%20%%3E

● To match a literal dollar sign ($) in a URL (e.g., ?price=$10), don’t use Disallow: /*$ (that means “end of URL” and will block everything). Use:

Disallow: /*$*

(The trailing * removes the end-anchor meaning from $.)

5) Order of precedence & conflict resolution (how crawlers decide)

● Most specific path wins (longest match).

● If equally specific, Google uses the least restrictive rule (i.e., Allow).
Examples demonstrated in the PDF (e.g., /p vs /, /page vs /*.htm).

6) High-risk mistakes to avoid (with safer alternatives)

Leaving “Disallow: /” on production after launch.

○ Keep staging behind password (e.g., HTTP auth) so you can ship the same robots.txt to prod safely.

Trying to block hostile scrapers via robots.txt

○ Bad actors ignore robots.txt. Use firewalls, IP/user-agent blocking, rate limiting, or bot management.

Listing secret directories in robots.txt

○ This advertises where your private content lives. Use authentication. Band-aids: noindex meta or X-Robots-Tag (but still not a substitute for security).

Accidental over-blocking with broad prefixes

○ Disallow: /admin also blocks /administer-medication….

Safer pair:

Disallow: /admin$

Disallow: /admin/

○ ($ blocks exactly /admin, while the second blocks the folder.)

Forgetting the User-agent line

○ Rules won’t apply without it. Also avoid mixing a general block with a later specific bot rule unless you repeat shared rules under each bot block.

Case sensitivity

○ Paths are case sensitive. To block all variants, list each case explicitly.

Trying to control other subdomains from one robots.txt

○ Each subdomain needs its own robots.txt at its own root.

Using robots.txt as “noindex”

○ Disallow ≠ De-index. Use meta noindex/X-Robots-Tag for reliable removal from search. The PDF explicitly clarifies this (including a Bengali note explaining that Google may index a disallowed URL if discovered elsewhere).

7) Copy-paste templates for real-world scenarios

A) Staging site (block all crawling)

User-agent: *

Disallow: /

(Protect with password too; remove before go-live.)

B) Live site: block internal search pages & session IDs

User-agent: *

Disallow: /search?s=*

Disallow: /*?

Allow: /*?$

(Blocks parameterized duplicates but allows the rare “ends-with-?” canonical.)

C) Allow everything except an admin area (with exact page preserved)

User-agent: *

Disallow: /admin/

Disallow: /admin$

Allow: /admin/healthcheck.html

(Blocks folder & exact /admin, but lets a single page through.)

D) Images: remove .jpg from Google Images only

User-agent: Googlebot-Image

Disallow: /images/*.jpg$

(Prevents appearance/cropping in Image Search.)

E) Multi-subdomain setup

● https://example.com/robots.txt → rules for example.com

● https://blog.example.com/robots.txt → rules for blog.example.com

● https://store.example.com/robots.txt → rules for store.example.com
(You cannot control subdomains from the main host’s robots.)

F) Blocking specific parameter “id=” (first vs later param)

User-agent: *

Disallow: /*?id=

Disallow: /*&id=

(Use both to be safe.)

G) Block literal $ anywhere

User-agent: *

Disallow: /*$*

(Do not use /*$ unless you intend to block everything.)

8) Debugging & auditing checklist

● Confirm location: https://{host}/robots.txt loads and is publicly readable.

● Validate syntax: every group begins with User-agent. All paths start with /.

● Scan for risky patterns: overly broad prefixes (e.g., /adm) that might catch unrelated pages (e.g., /administer…). Use $ and explicit folder slashes.

● Check wildcards: remember * is greedy; $ pins “end of URL.” Avoid trailing * after bare paths because /fish and /fish* behave the same.

● Conflict resolution: if a URL matches multiple rules, the longest path wins; if tie, Allow wins. Test suspect URLs against your rule set.

● Don’t rely on robots.txt to hide content: for sensitive assets, use authentication; for de-indexing, use noindex (meta or HTTP header).

9) Quick “Do / Don’t” recap

● Put robots.txt in the root of every host/subdomain you control.

● Use * and $ deliberately for precise matching.

● Use paired rules for tricky params (?id= and &id=).

● Prefer meta/X-Robots-Tag noindex for removal from search results.

Don’t

● Put secrets in robots.txt (it advertises them).

● Expect bad crawlers to obey robots.txt.

● Forget User-agent or the leading / in paths.

● Try to control other subdomains from one robots.txt.

In Details

What is robots.txt?

● It’s a simple text file that lives at the root of your site (https://example.com/robots.txt).

● Purpose: tells search engine crawlers which parts of your site they can/can’t crawl.

● Important: it only controls crawling, not indexing.

○ Example: if someone links to a blocked page, Google may still index the URL, but won’t know what’s inside (no title/snippet).

Checklist

● Place robots.txt only at site root.

● One robots.txt file per domain or subdomain.

● Use it for crawl management, not for security.

Structure of robots.txt

Every robots.txt is built from two parts:

User-agent: → which bot(s) the rule applies to (e.g. Googlebot, Bingbot, or * for all).
Allow / Disallow: → which URL paths are allowed or blocked.

Checklist

● Always start rules with User-agent.

● Paths must begin with /.

● Use wildcards (* and $) for flexible rules.

Basic examples

1. Allow everything

User-agent: *

Disallow:

(Empty disallow means: “crawl it all.”)

2. Block entire site

User-agent: *

Disallow: /

(Useful on staging sites – don’t forget to remove before going live!)

3. Block one folder

User-agent: *

Disallow: /private/

4. Block one page

User-agent: *

Disallow: /secret.html

Checklist

● / means everything.

● /folder/ means that folder and all inside.

● /file.html means just that file.

Advanced rules with wildcards

● * = matches anything.

● $ = end of URL.

Example: Block all GIFs

User-agent: *

Disallow: /*.gif$

Example: Block URLs with query strings

User-agent: *

Disallow: /*?

Example: Block id= parameter everywhere

User-agent: *

Disallow: /*?id=

Disallow: /*&id=

Checklist

● Use $ when you want to block only URLs that end a certain way.

● Use paired rules (?id= and &id=) for parameters.

Allow vs Disallow conflicts

When both apply:

The longest rule wins (most specific).
If length ties, Allow wins.

Example:

User-agent: *

Allow: /goodpage.html

Disallow: /goodpage

Result: /goodpage.html is allowed (longer path match).

Checklist

● Always test overlapping rules.

● Use Allow to “open” exceptions inside blocked folders.

Common mistakes to avoid

Using robots.txt as a “noindex” tool

○ Wrong: Disallow: /private/ doesn’t guarantee removal from Google.

○ Right: use <meta name="robots" content="noindex"> or HTTP X-Robots-Tag.

Blocking your whole live site by accident

○ Many devs forget to remove Disallow: / after staging → site disappears from Google.

Listing sensitive URLs

○ Hackers read robots.txt to find /admin/ or /private/.

○ Protect with authentication, not robots.txt.

Forgetting case sensitivity

○ /Admin/ ≠ /admin/.

Checklist

● Never rely on robots.txt for hiding content.

● Double-check Disallow: / isn’t live.

● Handle sensitive data with passwords or server restrictions.

Debugging & Testing

● Use Google Search Console robots.txt tester (or Bing equivalent).

● Paste suspicious URLs to see if they’re blocked.

● Always validate syntax:

○ Paths start with /.

○ No missing User-agent.

○ Wildcards used correctly.

Checklist

● Test after every change.

● Monitor crawl stats in Search Console.

● Keep robots.txt simple → fewer errors.

Robots.txt Master Checklist

Step	Action
	Place robots.txt at root (one per domain/subdomain)
	Start with User-agent lines
	Use /, /folder/, /file.html correctly
	Apply wildcards * and $ only when needed
	Use Allow to make exceptions
	Don’t block sensitive areas you want secure
	Don’t use robots.txt for de-indexing (use noindex)
	Test in Google Search Console
	Remove staging Disallow: / before go-live

Real-world robots.txt templates

1. E-Commerce Site Robots.txt

Goal: Allow products & categories, block duplicate filters, cart, checkout, and search pages.

User-agent: *

# Block cart, checkout, account

Disallow: /cart/

Disallow: /checkout/

Disallow: /account/

# Block search and filters (avoid crawl waste)

Disallow: /search?

Disallow: /*?filter=

Disallow: /*?sort=

# Block staging/preview URLs

Disallow: /staging/

# Allow all product & category pages

Allow: /products/

Allow: /category/

Why:

● Prevents crawling duplicate query parameters (filter, sort).

● Protects private areas (cart, checkout).

● Keeps product & category pages indexable.

2. Blog / News Site Robots.txt

Goal: Allow articles, block duplicate tags/search, control archives.

User-agent: *

# Block admin login area

Disallow: /wp-admin/

# Block search pages

Disallow: /?s=

# Block tags and archives to avoid duplicate content

Disallow: /tag/

Disallow: /archive/

# Allow everything else

Allow: /

Why:

● Avoids duplicate tag/archive pages in SERPs.

● Focuses crawler attention on posts and categories.

3. Local Business Site Robots.txt

Goal: Simple – allow service pages, block junk.

User-agent: *

# Block backend and private pages

Disallow: /admin/

Disallow: /login/

# Allow service pages, blog, contact, etc.

Allow: /

Why:

● Local sites are small, so keep it minimal.

● Blocks admin & login areas but leaves everything else crawlable.

4. Multi-Subdomain Setup

Goal: Each subdomain needs its own robots.txt.

example.com/robots.txt (main site)

User-agent: *

Disallow: /private/

Allow: /

blog.example.com/robots.txt

User-agent: *

Disallow: /wp-admin/

Disallow: /tag/

Disallow: /archive/

Allow: /

store.example.com/robots.txt

User-agent: *

Disallow: /cart/

Disallow: /checkout/

Disallow: /*?sort=

Allow: /products/

Why:

● Google treats each subdomain separately.

● You can tailor rules per subdomain for crawl efficiency.

Implementation Checklist (for all cases)

Put file at https://domain.com/robots.txt (not in subfolders).
Test in Google Search Console → Robots.txt Tester.
Confirm your critical pages (products, services, blog posts) are crawlable.
Monitor crawl stats → adjust if crawlers waste time on junk URLs.

Readul Haque Curriculum Vitae Website

Robots.txt – Complete Checklist

2) Matching rules you’ll actually use (with wildcards)

3) Fine-grained control with Allow + Disallow

4) Query strings & special characters (do it the safe way)

5) Order of precedence & conflict resolution (how crawlers decide)

6) High-risk mistakes to avoid (with safer alternatives)

7) Copy-paste templates for real-world scenarios

8) Debugging & auditing checklist

9) Quick “Do / Don’t” recap

In Details

What is robots.txt?

Structure of robots.txt

Basic examples

1. Allow everything

2. Block entire site

3. Block one folder

4. Block one page

Advanced rules with wildcards

Example: Block all GIFs

Example: Block URLs with query strings

Example: Block id= parameter everywhere

Allow vs Disallow conflicts

Example:

Common mistakes to avoid

Debugging & Testing

Robots.txt Master Checklist

Real-world robots.txt templates

1. E-Commerce Site Robots.txt

2. Blog / News Site Robots.txt

3. Local Business Site Robots.txt

4. Multi-Subdomain Setup

example.com/robots.txt (main site)

blog.example.com/robots.txt

store.example.com/robots.txt

Implementation Checklist (for all cases)

Popular

Tags

My Complete SEO Master Framework Resources

Technical SEO and Website Performance

On-Page SEO and Content Optimization

Linking Strategy and Site Architecture

Specialized and Platform-Specific SEO

My Important Links

Some Frequently Asked Questions (FAQs)

📌 How Can I Book a Consultation With Readul Haque?

📌 What Industries Has Readul Haque Worked With?

📌 Can Readul Haque Help With Local SEO for My Business?

📌 What is Readul Haque’s Process for SEO Audits?

📌 How Can I Contact Readul Haque for Services?

📌 What are the Achievements of Readul Haque?

Pages