Robots.txt – Complete Checklist

 1) Robots.txt fundamentals (syntax & placement)

      File location: must live at the site root (e.g., https://example.com/robots.txt). Putting it in a subfolder (e.g., /userpages/yourname/robots.txt) does not work.

      Scope: a robots.txt file applies only to the host/subdomain it’s served from. If you have multiple subdomains, each needs its own robots.txt.

      Blocks crawling, not indexing: disallowed URLs might still be indexed if discovered via external links; use noindex (meta or HTTP header) for true de-indexing.

      Minimum structure: every ruleset must begin with a User-agent: line, or nothing is enforced. Paths in Allow/Disallow must start with /.

Minimal skeleton

User-agent: *

Disallow:

(Empty Disallow means “crawl everything.”)

2) Matching rules you’ll actually use (with wildcards)

Robots.txt supports limited wildcards:

      * = matches zero or more characters

      $ = end of URL (end-anchor)

Common, safe patterns

# Block entire site (staging only!)

User-agent: *

Disallow: /

 (Useful for staging; be sure to remove on production.)

# Block specific folders

User-agent: *

Disallow: /calendar/

Disallow: /junk/

Disallow: /books/fiction/contemporary/

(Append / to block a whole directory.)

# Allow only Google News; block everyone else

User-agent: Googlebot-news

Allow: /

 

User-agent: *

Disallow: /

 

# Allow all but one bot

User-agent: Unnecessarybot

Disallow: /

 

User-agent: *

Allow: /

 

# Block single pages

User-agent: *

Disallow: /useless_file.html

Disallow: /junk/other_useless_file.html

 

# Block entire site but allow one public section

User-agent: *

Disallow: /

Allow: /public/

 

# Block all images from Google Images

User-agent: Googlebot-Image

Disallow: /

(Or block a single image: Disallow: /images/dogs.jpg.)

# Block a file type everywhere (end-anchor)

User-agent: Googlebot

Disallow: /*.gif$

($ ensures you only block URLs ending with .gif.)

# Block URL patterns with query strings

User-agent: *

Disallow: /*?

(Blocks any URL containing ?.)

# Block PHP pages (anywhere in path) vs only those ending with .php

User-agent: *

Disallow: /*.php        # contains

Disallow: /*.php$       # ends with .php

(Understand the difference between /*.php and /*.php$.)

3) Fine-grained control with Allow + Disallow

      When rules conflict, Google chooses the most specific path (longest matching string). If lengths tie, Allow wins.

Classic session-ID pattern
Block duplicates with ?, but allow a canonical “?-only” version (ends with ?):

User-agent: *

Allow: /*?$

Disallow: /*?

(Blocks any URL that contains ?, but allows URLs that end with ?.)

Unblock a “good” page inside a blocked folder

User-agent: *

Allow: /baddir/goodpage

Disallow: /baddir/

(The longer Allow path beats the shorter Disallow.)

Be careful with overlapping patterns

User-agent: *

Allow: /some

Disallow: /*page

/somepage is blocked (the /*page path is longer).

4) Query strings & special characters (do it the safe way)

      If you need to block a specific parameter (e.g., id=), you often need two lines to cover “first param” vs “later param”:

Disallow: /*?id=

Disallow: /*&id=

      For URLs with unsafe characters (like < or spaces), block the URL-encoded form (copy from the browser’s address bar).

User-agent: *

Disallow: /search?q=%3C%%20var_name%20%%3E

      To match a literal dollar sign ($) in a URL (e.g., ?price=$10), don’t use Disallow: /*$ (that means “end of URL” and will block everything). Use:

Disallow: /*$*

(The trailing * removes the end-anchor meaning from $.)

5) Order of precedence & conflict resolution (how crawlers decide)

      Most specific path wins (longest match).

      If equally specific, Google uses the least restrictive rule (i.e., Allow).
 Examples demonstrated in the PDF (e.g.,
/p vs /, /page vs /*.htm).

6) High-risk mistakes to avoid (with safer alternatives)

  1. Leaving “Disallow: /” on production after launch.

      Keep staging behind password (e.g., HTTP auth) so you can ship the same robots.txt to prod safely.

  1. Trying to block hostile scrapers via robots.txt

      Bad actors ignore robots.txt. Use firewalls, IP/user-agent blocking, rate limiting, or bot management.

  1. Listing secret directories in robots.txt

      This advertises where your private content lives. Use authentication. Band-aids: noindex meta or X-Robots-Tag (but still not a substitute for security).

  1. Accidental over-blocking with broad prefixes

      Disallow: /admin also blocks /administer-medication….

Safer pair:

 Disallow: /admin$

Disallow: /admin/

       ($ blocks exactly /admin, while the second blocks the folder.)

  1. Forgetting the User-agent line

      Rules won’t apply without it. Also avoid mixing a general block with a later specific bot rule unless you repeat shared rules under each bot block.

  1. Case sensitivity

      Paths are case sensitive. To block all variants, list each case explicitly.

  1. Trying to control other subdomains from one robots.txt

      Each subdomain needs its own robots.txt at its own root.

  1. Using robots.txt as “noindex”

      Disallow ≠ De-index. Use meta noindex/X-Robots-Tag for reliable removal from search. The PDF explicitly clarifies this (including a Bengali note explaining that Google may index a disallowed URL if discovered elsewhere).

7) Copy-paste templates for real-world scenarios

A) Staging site (block all crawling)

User-agent: *

Disallow: /

(Protect with password too; remove before go-live.)

B) Live site: block internal search pages & session IDs

User-agent: *

Disallow: /search?s=*

Disallow: /*?

Allow: /*?$

(Blocks parameterized duplicates but allows the rare “ends-with-?” canonical.)

C) Allow everything except an admin area (with exact page preserved)

User-agent: *

Disallow: /admin/

Disallow: /admin$

Allow: /admin/healthcheck.html

(Blocks folder & exact /admin, but lets a single page through.)

D) Images: remove .jpg from Google Images only

User-agent: Googlebot-Image

Disallow: /images/*.jpg$

(Prevents appearance/cropping in Image Search.)

E) Multi-subdomain setup

      https://example.com/robots.txt → rules for example.com

      https://blog.example.com/robots.txt → rules for blog.example.com

      https://store.example.com/robots.txt → rules for store.example.com
 (You cannot control subdomains from the main host’s robots.)

F) Blocking specific parameter “id=” (first vs later param)

User-agent: *

Disallow: /*?id=

Disallow: /*&id=

(Use both to be safe.)

G) Block literal $ anywhere

User-agent: *

Disallow: /*$*

(Do not use /*$ unless you intend to block everything.)

8) Debugging & auditing checklist

      Confirm location: https://{host}/robots.txt loads and is publicly readable.

      Validate syntax: every group begins with User-agent. All paths start with /.

      Scan for risky patterns: overly broad prefixes (e.g., /adm) that might catch unrelated pages (e.g., /administer…). Use $ and explicit folder slashes.

      Check wildcards: remember * is greedy; $ pins “end of URL.” Avoid trailing * after bare paths because /fish and /fish* behave the same.

      Conflict resolution: if a URL matches multiple rules, the longest path wins; if tie, Allow wins. Test suspect URLs against your rule set.

      Don’t rely on robots.txt to hide content: for sensitive assets, use authentication; for de-indexing, use noindex (meta or HTTP header).

9) Quick “Do / Don’t” recap

Do

      Put robots.txt in the root of every host/subdomain you control.

      Use * and $ deliberately for precise matching.

      Use paired rules for tricky params (?id= and &id=).

      Prefer meta/X-Robots-Tag noindex for removal from search results.

Don’t

      Put secrets in robots.txt (it advertises them).

      Expect bad crawlers to obey robots.txt.

      Forget User-agent or the leading / in paths.

      Try to control other subdomains from one robots.txt.

In Details

What is robots.txt?

      It’s a simple text file that lives at the root of your site (https://example.com/robots.txt).

      Purpose: tells search engine crawlers which parts of your site they can/can’t crawl.

      Important: it only controls crawling, not indexing.

      Example: if someone links to a blocked page, Google may still index the URL, but won’t know what’s inside (no title/snippet).

Checklist

      Place robots.txt only at site root.

      One robots.txt file per domain or subdomain.

      Use it for crawl management, not for security.

Structure of robots.txt

Every robots.txt is built from two parts:

  1. User-agent: → which bot(s) the rule applies to (e.g. Googlebot, Bingbot, or * for all).

  2. Allow / Disallow: → which URL paths are allowed or blocked.

Checklist

      Always start rules with User-agent.

      Paths must begin with /.

      Use wildcards (* and $) for flexible rules.

Basic examples

1. Allow everything

User-agent: *

Disallow:

(Empty disallow means: “crawl it all.”)

2. Block entire site

User-agent: *

Disallow: /

(Useful on staging sites – don’t forget to remove before going live!)

3. Block one folder

User-agent: *

Disallow: /private/

4. Block one page

User-agent: *

Disallow: /secret.html

Checklist

      / means everything.

      /folder/ means that folder and all inside.

      /file.html means just that file.

Advanced rules with wildcards

      * = matches anything.

      $ = end of URL.

Example: Block all GIFs

User-agent: *

Disallow: /*.gif$

Example: Block URLs with query strings

User-agent: *

Disallow: /*?

Example: Block id= parameter everywhere

User-agent: *

Disallow: /*?id=

Disallow: /*&id=

Checklist

      Use $ when you want to block only URLs that end a certain way.

      Use paired rules (?id= and &id=) for parameters.

Allow vs Disallow conflicts

When both apply:

  1. The longest rule wins (most specific).

  2. If length ties, Allow wins.

Example:

User-agent: *

Allow: /goodpage.html

Disallow: /goodpage

Result: /goodpage.html is allowed (longer path match).

Checklist

      Always test overlapping rules.

      Use Allow to “open” exceptions inside blocked folders.

Common mistakes to avoid

  1. Using robots.txt as a “noindex” tool

      Wrong: Disallow: /private/ doesn’t guarantee removal from Google.

      Right: use <meta name="robots" content="noindex"> or HTTP X-Robots-Tag.

  1. Blocking your whole live site by accident

      Many devs forget to remove Disallow: / after staging → site disappears from Google.

  1. Listing sensitive URLs

      Hackers read robots.txt to find /admin/ or /private/.

      Protect with authentication, not robots.txt.

  1. Forgetting case sensitivity

      /Admin//admin/.

Checklist

      Never rely on robots.txt for hiding content.

      Double-check Disallow: / isn’t live.

      Handle sensitive data with passwords or server restrictions.

Debugging & Testing

      Use Google Search Console robots.txt tester (or Bing equivalent).

      Paste suspicious URLs to see if they’re blocked.

      Always validate syntax:

      Paths start with /.

      No missing User-agent.

      Wildcards used correctly.

Checklist

      Test after every change.

      Monitor crawl stats in Search Console.

      Keep robots.txt simple → fewer errors.

Robots.txt Master Checklist

Step

Action

 

Place robots.txt at root (one per domain/subdomain)

 

Start with User-agent lines

 

Use /, /folder/, /file.html correctly

 

Apply wildcards * and $ only when needed

 

Use Allow to make exceptions

 

Don’t block sensitive areas you want secure

 

Don’t use robots.txt for de-indexing (use noindex)

 

Test in Google Search Console

 

Remove staging Disallow: / before go-live

Real-world robots.txt templates

1. E-Commerce Site Robots.txt

Goal: Allow products & categories, block duplicate filters, cart, checkout, and search pages.

User-agent: *

# Block cart, checkout, account

Disallow: /cart/

Disallow: /checkout/

Disallow: /account/

 

# Block search and filters (avoid crawl waste)

Disallow: /search?

Disallow: /*?filter=

Disallow: /*?sort=

 

# Block staging/preview URLs

Disallow: /staging/

 

# Allow all product & category pages

Allow: /products/

Allow: /category/

Why:

       Prevents crawling duplicate query parameters (filter, sort).

       Protects private areas (cart, checkout).

       Keeps product & category pages indexable.

2. Blog / News Site Robots.txt

Goal: Allow articles, block duplicate tags/search, control archives.

User-agent: *

# Block admin login area

Disallow: /wp-admin/

 

# Block search pages

Disallow: /?s=

 

# Block tags and archives to avoid duplicate content

Disallow: /tag/

Disallow: /archive/

 

# Allow everything else

Allow: /

Why:

       Avoids duplicate tag/archive pages in SERPs.

       Focuses crawler attention on posts and categories.

3. Local Business Site Robots.txt

Goal: Simple – allow service pages, block junk.

User-agent: *

# Block backend and private pages

Disallow: /admin/

Disallow: /login/

 

# Allow service pages, blog, contact, etc.

Allow: /

Why:

       Local sites are small, so keep it minimal.

       Blocks admin & login areas but leaves everything else crawlable.

4. Multi-Subdomain Setup

Goal: Each subdomain needs its own robots.txt.

example.com/robots.txt (main site)

User-agent: *

Disallow: /private/

Allow: /

 

blog.example.com/robots.txt

User-agent: *

Disallow: /wp-admin/

Disallow: /tag/

Disallow: /archive/

Allow: /

 

store.example.com/robots.txt

User-agent: *

Disallow: /cart/

Disallow: /checkout/

Disallow: /*?sort=

Allow: /products/

Why:

       Google treats each subdomain separately.

       You can tailor rules per subdomain for crawl efficiency.

Implementation Checklist (for all cases)

  1. Put file at https://domain.com/robots.txt (not in subfolders).

  2. Test in Google Search Console → Robots.txt Tester.

  3. Confirm your critical pages (products, services, blog posts) are crawlable.

  4. Monitor crawl stats → adjust if crawlers waste time on junk URLs.

My Complete SEO Master Framework Resources

A fully structured collection of technical, on-page, linking, and specialized SEO checklists designed to optimize every aspect of website performance and search visibility.
  • Linking Strategy and Site Architecture

    Includes best practices for internal links, external links, anchors, faceted navigation, and pagination structure.

    ➢ Anchor Text Best Practices »
    ➢ Link Best Practices (Internal and External Links) »
    ➢ Google E-E-A-T Complete Checklist »
    ➢ Faceted Navigation Best Practices »
    ➢ Pagination SEO Best Practices Checklist »
    Technical Skills Certification
    Special Skills Certification
    Certificate of Academic Excellence
    Hard Skills Certification

    Some Frequently Asked Questions (FAQs)

    📌 How Can I Book a Consultation With Readul Haque?

    You can book an appointment with Readul Haque through the online appointment form available at the Appointment Page. Choose your preferred date and time to schedule a consultation or use the WhatsApp number for better communication.

    📌 What Industries Has Readul Haque Worked With?

    Readul Haque has worked with various industries including e-commerce, technology, healthcare, finance, real estate, and more. His versatile experience enables him to tailor SEO strategies specific to the needs of different business sectors.

    📌 Can Readul Haque Help With Local SEO for My Business?

    Yes, Readul Haque specializes in Local SEO services, helping businesses rank higher in local search results and improve visibility for location-specific searches.

    📌 What is Readul Haque’s Process for SEO Audits?

    Readul conducts comprehensive SEO audits by analyzing your website’s performance, identifying areas for improvement, and recommending actionable strategies to boost rankings, enhance user experience, and increase traffic.

    📌 How Can I Contact Readul Haque for Services?

    You can contact Readul Haque through the WhatsApp number and email provided on the website. Simply fill out the contact form, and the team will get back to you shortly.

    📌 What are the Achievements of Readul Haque?

    Readul Haque has been recognized for his exceptional contributions to the digital marketing and IT industry, receiving numerous awards and certifications from prestigious platforms like Google, Facebook, and more.