1) Robots.txt fundamentals (syntax & placement)
● File location: must live at the site root (e.g., https://example.com/robots.txt). Putting it in a subfolder (e.g., /userpages/yourname/robots.txt) does not work.
● Scope: a robots.txt file applies only to the host/subdomain it’s served from. If you have multiple subdomains, each needs its own robots.txt.
● Blocks crawling, not indexing: disallowed URLs might still be indexed if discovered via external links; use noindex (meta or HTTP header) for true de-indexing.
● Minimum structure: every ruleset must begin with a User-agent: line, or nothing is enforced. Paths in Allow/Disallow must start with /.
Minimal skeleton
User-agent: *
Disallow:
(Empty Disallow means “crawl everything.”)
2) Matching rules you’ll actually use
(with wildcards)
Robots.txt
supports limited wildcards:
● *
= matches zero or more characters
● $
= end of URL (end-anchor)
Common, safe patterns
# Block entire site (staging only!)
User-agent: *
Disallow: /
(Useful for staging; be sure to remove on production.)
# Block specific folders
User-agent: *
Disallow: /calendar/
Disallow: /junk/
Disallow: /books/fiction/contemporary/
(Append
/
to block a whole directory.)
# Allow only Google News; block everyone else
User-agent: Googlebot-news
Allow: /
User-agent: *
Disallow: /
# Allow all but one bot
User-agent: Unnecessarybot
Disallow: /
User-agent: *
Allow: /
# Block single pages
User-agent: *
Disallow: /useless_file.html
Disallow: /junk/other_useless_file.html
# Block entire site but allow one public section
User-agent: *
Disallow: /
Allow: /public/
# Block all images from Google Images
User-agent: Googlebot-Image
Disallow: /
(Or
block a single image: Disallow: /images/dogs.jpg.)
# Block a file type everywhere (end-anchor)
User-agent: Googlebot
Disallow: /*.gif$
($
ensures you only block URLs ending with .gif.)
# Block URL patterns with query strings
User-agent: *
Disallow: /*?
(Blocks
any URL containing ?.)
# Block PHP pages (anywhere in path) vs only those ending with .php
User-agent: *
Disallow: /*.php # contains
Disallow: /*.php$ # ends with .php
(Understand
the difference between /*.php and /*.php$.)
3) Fine-grained control with Allow +
Disallow
● When rules conflict, Google chooses the most specific path
(longest matching string). If lengths tie, Allow wins.
Classic session-ID pattern
Block
duplicates with ?, but allow a canonical “?-only” version (ends with ?):
User-agent: *
Allow: /*?$
Disallow: /*?
(Blocks
any URL that contains ?,
but allows URLs that end with ?.)
Unblock a “good” page inside a
blocked folder
User-agent: *
Allow: /baddir/goodpage
Disallow: /baddir/
(The
longer Allow path beats the shorter Disallow.)
Be careful with overlapping patterns
User-agent: *
Allow: /some
Disallow: /*page
/somepage is blocked
(the /*page path is longer).
4) Query strings & special
characters (do it the safe way)
● If you need to block a specific parameter (e.g., id=), you often need two lines to
cover “first param” vs “later param”:
Disallow: /*?id=
Disallow: /*&id=
● For URLs with unsafe characters (like < or spaces), block the URL-encoded
form (copy from the browser’s address bar).
User-agent: *
Disallow: /search?q=%3C%%20var_name%20%%3E
● To match a literal dollar sign ($)
in a URL (e.g., ?price=$10), don’t
use Disallow: /*$ (that means “end of URL” and will block everything). Use:
Disallow: /*$*
(The
trailing * removes the end-anchor meaning from $.)
5) Order of precedence & conflict
resolution (how crawlers decide)
● Most specific path wins (longest match).
● If equally specific, Google uses the least restrictive rule
(i.e., Allow).
Examples demonstrated in the PDF (e.g., /p
vs /,
/page vs /*.htm).
6) High-risk mistakes to avoid (with
safer alternatives)
- Leaving “Disallow: /” on production after launch.
○ Keep staging behind password (e.g., HTTP auth) so you can
ship the same robots.txt to prod safely.
- Trying to
block hostile scrapers via robots.txt
○ Bad actors ignore robots.txt.
Use firewalls, IP/user-agent blocking,
rate limiting, or bot management.
- Listing
secret directories in robots.txt
○ This advertises where your private content lives. Use authentication. Band-aids: noindex meta or X-Robots-Tag (but
still not a substitute for security).
- Accidental
over-blocking with broad prefixes
○ Disallow: /admin also blocks /administer-medication….
Safer pair:
Disallow: /admin$
Disallow: /admin/
○ ($ blocks exactly /admin, while the second blocks the
folder.)
- Forgetting
the User-agent line
○ Rules won’t apply without it.
Also avoid mixing a general block with a later specific bot rule unless you repeat shared rules under each bot
block.
- Case
sensitivity
○ Paths are case sensitive. To block all variants,
list each case explicitly.
- Trying to
control other subdomains from one robots.txt
○ Each subdomain needs its own robots.txt at its own root.
- Using
robots.txt as “noindex”
○ Disallow
≠ De-index. Use meta noindex/X-Robots-Tag for reliable removal from
search. The PDF explicitly clarifies this (including a Bengali note explaining
that Google may index a disallowed URL if discovered elsewhere).
7) Copy-paste templates for
real-world scenarios
A) Staging site (block all crawling)
User-agent: *
Disallow: /
(Protect
with password too; remove before go-live.)
B) Live site: block internal search
pages & session IDs
User-agent: *
Disallow: /search?s=*
Disallow: /*?
Allow: /*?$
(Blocks
parameterized duplicates but allows the rare “ends-with-?” canonical.)
C) Allow everything except an admin
area (with exact page preserved)
User-agent: *
Disallow: /admin/
Disallow: /admin$
Allow: /admin/healthcheck.html
(Blocks
folder & exact /admin, but lets a single page through.)
D) Images: remove .jpg from Google
Images only
User-agent: Googlebot-Image
Disallow: /images/*.jpg$
(Prevents
appearance/cropping in Image Search.)
E) Multi-subdomain setup
● https://example.com/robots.txt
→ rules for example.com
● https://blog.example.com/robots.txt
→ rules for blog.example.com
● https://store.example.com/robots.txt
→ rules for store.example.com
(You cannot control subdomains from the main
host’s robots.)
F) Blocking specific parameter “id=”
(first vs later param)
User-agent: *
Disallow: /*?id=
Disallow: /*&id=
(Use
both to be safe.)
G) Block literal $ anywhere
User-agent: *
Disallow: /*$*
(Do
not use /*$ unless you intend to block
everything.)
8) Debugging & auditing checklist
● Confirm location: https://{host}/robots.txt loads and is publicly
readable.
● Validate syntax: every group begins with User-agent. All paths start with /.
● Scan for risky patterns: overly broad prefixes (e.g., /adm) that might catch unrelated pages (e.g., /administer…). Use $ and explicit folder slashes.
● Check wildcards: remember * is greedy; $ pins “end of URL.” Avoid
trailing * after bare paths because /fish and /fish* behave the same.
● Conflict resolution: if a URL matches multiple rules, the longest path wins; if tie, Allow wins. Test suspect URLs against
your rule set.
● Don’t rely on robots.txt to hide content: for sensitive assets, use authentication; for de-indexing, use noindex (meta or HTTP header).
9) Quick “Do / Don’t” recap
Do
● Put robots.txt in the root of every host/subdomain you
control.
● Use * and $ deliberately for precise matching.
● Use paired rules for tricky params (?id= and &id=).
● Prefer meta/X-Robots-Tag noindex for removal from search results.
Don’t
● Put secrets in robots.txt (it
advertises them).
● Expect bad crawlers to obey
robots.txt.
● Forget User-agent or the leading /
in paths.
● Try to control other subdomains from one robots.txt.
In Details
What is robots.txt?
● It’s a simple text file that lives at the root of your site (https://example.com/robots.txt).
● Purpose: tells search engine
crawlers which parts of your site they
can/can’t crawl.
● Important: it only controls crawling, not indexing.
○ Example: if someone links to
a blocked page, Google may still index the URL, but won’t know what’s inside
(no title/snippet).
Checklist
● Place robots.txt only at site
root.
● One robots.txt file per
domain or subdomain.
● Use it for crawl management,
not for security.
Structure of robots.txt
Every
robots.txt is built from two parts:
- User-agent: → which
bot(s) the rule applies to (e.g. Googlebot, Bingbot, or * for all).
- Allow / Disallow:
→ which URL paths are allowed or blocked.
Checklist
● Always start rules with User-agent.
● Paths must begin with /.
● Use wildcards (* and $) for flexible rules.
Basic examples
1. Allow everything
User-agent: *
Disallow:
(Empty
disallow means: “crawl it all.”)
2. Block entire site
User-agent: *
Disallow: /
(Useful
on staging sites – don’t forget to remove before going live!)
3. Block one folder
User-agent: *
Disallow: /private/
4. Block one page
User-agent: *
Disallow: /secret.html
Checklist
● /
means everything.
● /folder/ means that folder and all inside.
● /file.html means just that file.
Advanced rules with wildcards
● *
= matches anything.
● $
= end of URL.
Example: Block all GIFs
User-agent: *
Disallow: /*.gif$
Example: Block URLs with
query strings
User-agent: *
Disallow: /*?
Example: Block id= parameter everywhere
User-agent: *
Disallow: /*?id=
Disallow: /*&id=
Checklist
● Use $ when you want to block only
URLs that end a certain way.
● Use paired rules (?id= and &id=) for parameters.
Allow vs Disallow conflicts
When
both apply:
- The longest rule wins (most specific).
- If length ties, Allow wins.
Example:
User-agent: *
Allow: /goodpage.html
Disallow: /goodpage
Result:
/goodpage.html is allowed
(longer path match).
Checklist
● Always test overlapping
rules.
● Use Allow to “open” exceptions inside
blocked folders.
Common mistakes to avoid
- Using robots.txt as a “noindex” tool
○ Wrong: Disallow:
/private/
doesn’t guarantee removal from Google.
○ Right: use <meta name="robots" content="noindex"> or HTTP X-Robots-Tag.
- Blocking
your whole live site by accident
○ Many devs forget to remove Disallow: / after staging → site disappears from
Google.
- Listing
sensitive URLs
○ Hackers read robots.txt to
find /admin/ or /private/.
○ Protect with authentication, not robots.txt.
- Forgetting
case sensitivity
○ /Admin/ ≠ /admin/.
Checklist
● Never rely on robots.txt for
hiding content.
● Double-check Disallow: / isn’t live.
● Handle sensitive data with passwords or server restrictions.
Debugging & Testing
● Use Google Search Console robots.txt tester (or Bing equivalent).
● Paste suspicious URLs to see
if they’re blocked.
● Always validate syntax:
○ Paths start with /.
○ No missing User-agent.
○ Wildcards used correctly.
Checklist
● Test after every change.
● Monitor crawl stats in Search
Console.
● Keep
robots.txt simple → fewer errors.
Robots.txt Master Checklist
|
Step |
Action |
|
|
Place
robots.txt at root (one per domain/subdomain) |
|
|
Start
with User-agent lines |
|
|
Use
/, /folder/, /file.html correctly |
|
|
Apply
wildcards * and $ only when needed |
|
|
Use
Allow to make exceptions |
|
|
Don’t
block sensitive areas you want secure |
|
|
Don’t
use robots.txt for de-indexing (use noindex) |
|
|
Test
in Google Search Console |
|
|
Remove
staging Disallow: / before go-live |
Real-world robots.txt
templates
1. E-Commerce Site Robots.txt
Goal: Allow products & categories, block duplicate filters, cart,
checkout, and search pages.
User-agent: *
# Block cart, checkout, account
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
# Block search and filters (avoid crawl waste)
Disallow: /search?
Disallow: /*?filter=
Disallow: /*?sort=
# Block staging/preview URLs
Disallow: /staging/
# Allow all product & category pages
Allow: /products/
Allow: /category/
Why:
●
Prevents crawling duplicate query
parameters (filter, sort).
●
Protects private areas (cart,
checkout).
●
Keeps product & category pages
indexable.
2. Blog / News Site Robots.txt
Goal: Allow articles, block duplicate tags/search, control archives.
User-agent: *
# Block admin login area
Disallow: /wp-admin/
# Block search pages
Disallow: /?s=
# Block tags and archives to avoid duplicate content
Disallow: /tag/
Disallow: /archive/
# Allow everything else
Allow: /
Why:
●
Avoids duplicate tag/archive pages
in SERPs.
●
Focuses crawler attention on posts
and categories.
3. Local Business Site Robots.txt
Goal: Simple – allow service pages, block junk.
User-agent: *
# Block backend and private pages
Disallow: /admin/
Disallow: /login/
# Allow service pages, blog, contact, etc.
Allow: /
Why:
●
Local sites are small, so keep it
minimal.
●
Blocks admin & login areas but
leaves everything else crawlable.
4. Multi-Subdomain Setup
Goal: Each subdomain needs its own
robots.txt.
example.com/robots.txt (main site)
User-agent: *
Disallow: /private/
Allow: /
blog.example.com/robots.txt
User-agent: *
Disallow: /wp-admin/
Disallow: /tag/
Disallow: /archive/
Allow: /
store.example.com/robots.txt
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /*?sort=
Allow: /products/
Why:
●
Google treats each subdomain
separately.
● You can tailor rules per subdomain for crawl efficiency.