Robots.txt files (aka. robots exclusion protocol or standard) provides a means to communicate with the various bots that crawl your site and its pages.
Bots typically include web crawlers, such as GoogleBot, which will look at the robots.txt file on your site to learn what it should and shouldn’t crawl on your site.
Robots.txt files are not just for GoogleBot, but will be ‘read’ by many other web crawlers from other search engines and services, as well as web scrapers that are hunting your site for information.
It’s worth bearing in mind that the instructions contained in the robots.txt aren’t mandated – they’re not legally binding (though violating them could still result in breaking the law or violating copyright, etc).
Robots.txt files just pleasantly nudge bots in the right direction and kindly ask them to not crawl certain parts of your site.
Robots.txt files are created by default in most WordPress sites and other sites that use Wix or other site builders, and will automatically be set to allow web crawlers to crawl everything on your site.
With many site builders or platforms like Shopify, you can’t edit the robots.txt file (but can use other techniques to control the crawling of your site.
When To Adjust the Robots.txt File
If you’re an SEO beginner, novice, or an owner/designer of a personal or individual small site or blog (that has already been published), you likely won’t need to adjust the robots.txt file for some time, but checking it is still good practice.
One potential use some advocate for the robots.txt file is preventing your site from being crawled whilst it’s still under construction.
This is ineffective if links point to the page(s), and Google recommends you use a noindex meta tag or password protection instead if you want to prevent a page from being indexed during construction/migration.
You can also use a plugin that covers your site with an ‘under construction’ notice.
There are many other legit reasons to check and edit the robots.txt, though:
Crawl Budget and Robots.txt
There are legitimate motives for editing the robots.txt file and these can confer SEO benefits.
Most of these benefits are linked to a site’s crawl budget. Crawl budget refers to the resources Google, Bing, Yahoo, and other search engines allocate to crawling and indexing your site.
For example, whilst you might assume Google has an infinite army of bots ready to crawl each and every intricate detail of the internet at will, 24/7 around the clock, this isn’t strictly true.
Even Google has finite resources (for now!), and they allocate these resources to sites based on their size/reputation/authority and other factors (many of which are not totally known or understood).
SEOs analyze crawl budgets by checking out their site’s log files, which provide evidence of how many pages crawlers are crawling on their site.
- Say a site has 100,000 pages, which is not particularly rare, even amongst low authority domains.
- For argument’s sake, say GoogleBot or other crawlers allocate this site a crawl budget of 1,000 pages a day.
- That could mean 100 days passing before GoogleBot crawls certain pages on this site. If you write a series of ten new incredible blog posts for this site, they could be ignored for weeks, or even months! GoogleBot just isn’t allocating the resources to crawling this (likely bloated) site enough to uncover new content.
- This issue would be compounded by the presence of dynamic web pages, that display different content every time they’re viewed.
In short, editing the robots.txt directly influences how crawlers interact with your site, which directly influences crawl budget.
Case Example: Robots.txt and an eCommerce Store
Before moving onto some robots.txt tips and tools, let’s run through a simple case study of when you should consider editing your robots.txt file.
eCommerce stores typically use dynamic pages and contain a lot of duplicate content.
Dynamic pages and duplicate content are typically changed via user interaction, like when a user filters product categories or customizes a product. When a user filters a group of products, a page with duplicate content is produced – there’s nothing useful there for anyone apart from the user.
You can disallow bots from crawling duplicate content and filter content, or any other content that is practically useless for SEO and might occupy your crawl budget.
Content to potentially disallow from crawlers includes:
- Pages with duplicate content (often printer-friendly content)
- Dynamic products and service pages
- Pagination pages
- Admin pages and logins (e.g. wp-login)
- Shopping cart and user account pages
- Thank you pages
Whilst you might feel your site is way too small to be affected by crawl budget issues – and you’d probably be right – being aware of how you can edit the robots.txt file for the future is still very important.
If your site becomes bloated with content that is useless to Google then retrospectively fixing that is more exhaustive than preparing your robots.txt file early on in your web design and SEO journey!
Additionally, you may have pages with sensitive information, copyright material, and other files you don’t want to be crawled. Editing the robots.txt file can help keep these files off search engines.
Mostly, though, editing the robots.txt file is useful for managing crawl budget.
The Main Five Main Robots.txt Directives
There are five main robots.txt directives.
When you go to edit or generate your robots.txt you’ll see/use some of the following:
- User-Agent: This refers to the web crawlers you’re instructing. These typically include all your major search engines, but there are actually hundreds of user agents that you can check out on the robots.txt site here.
- Allow: For GoogleBot only, allow tells GoogleBot that it’s allowed to access a page even if its parent page is disallowed.
- Disallow: This command tells a user-agent not to crawl a URL or directory. You will need a command for each URL or directory.
- Crawl-Delay: Crawl-delay tells crawlers to wait before loading and crawling a page. This can prevent the host from being overloaded during peak crawl and is only typically useful when you have many pages. Note, Google doesn’t follow this, but it can be set in the Search Console.
- Sitemap: The robots.txt points crawlers to the sitemap (Google will point to your sitemap from Search Console but other search engines won’t have this information).
When you edit your robots.txt, you’ll be providing it with directives on URLs and directories.
The robots.txt file will sit in the root of your site, e.g. for site www.example.com, the robots.txt file lives at www.example.com/robots.txt.
A simple example of a robots.txt directive would be:
User-agent: Googlebot
Disallow: /thank-you/
This tells Googlebot to disallow crawling at: www.example.com/thank-you/
Note, whilst some tools ask you to input ‘directories’ and others ‘URLs’, you can input either.
Directories will be useful if you want to tell bots to ignore specific folders or files within your site’s root, e.g. /wp-content/plugins/ (which contains likely useless data on WordPress plugins).
URLs are best for singling out specific pages that you don’t want bots to crawl (e.g. thank-you [for ordering, buying, your interest, etc] pages as above).
Advanced Robots.txt Controls/Directives
There are some advanced robots.txt directives that go beyond allow/disallowing.
One key example is the wildcard, used to bulk-block files.
So, Disallow: /copyright-material/*.jpg would block all .jpg images located in the copyright-material directory.
Where To Put the Robots.txt File?
The robots.txt is placed at the root of your domain.
So, if your site is example.com, the robots.txt would be placed at: http://www.example.com/robots.txt.
You can locate this using the cPanel file manager (common for WordPress websites).
Once you open the file (literally a text file), you can write your new directives straight into it and save it.
There are different instructions for site builders like Wix, but Shopify and Squarespace don’t allow you to edit the file – though there are other options for hiding your site or its pages and preventing them from being indexed.
Some SEO suites and plugins such as Better Robots.txt and AIO SEO semi-automate the process of creating robots.txt files for your site.
Robots.txt Generator Tools
Small SEO Tools Robots.txt Generator
Small SEO tools provide a fantastic array of free SEO tools, and the robots.txt generator does what it says on the tin (and is obviously 100% free).
This is a simple robots.txt generator with a selection of 15 bots and crawlers that can be set to be allowed or refused from your site. You can point it to your site’s sitemap.
You can also set the default for all bots and choose crawl delay settings which will roll out to bots that take notice of this (i.e. not GoogleBot).
You’ll then be able to add restricted directories and/or URLs using the form at the bottom. The /cgi-bin is pre-filled (a commonly blocked directory that does not need to be crawled), but you can edit that field if you want to.
Features
- Allow/refuse 15 common bots/crawlers
- Add sitemap
- Block directories/URLs
- Crawl-delay
Ryte Robots.txt Generator
An intuitive robots.txt generator with step-by-step instructions. The Ryte Robots.txt Generator is an excellent little tool for quickly creating robots.txt files with a selection of 11 bots. It’s a bit strange to not see other search engines like Baidu or Yandex in the bot selection, but you can add these user agents yourself.
To block crawling of URLs or directories, simply input them into the fields. You can also allow or disallow all bots from crawling your site.
Features
- Good interface
- Add sitemap
- Block directories/URLs
SEOBook Robots.txt File Generator
Another simple robots.txt generator with allow/disallow all and per-bot control of basic robots.txt directives. There are 9 bots here and it’s straightforward to add URLs or directories to block.
A super-simple tool that lets you copy and paste your new robots.txt straight into the old one.
Features
- Simple interface
- Add sitemap
- Blocks URLs/directories
SEOptimer Free Robots.txt Generator
With a selection of 15 bots, this is one of the more complete robots.txt generators available. It also has crawl delay settings. You can allow/refuse certain bots or set all bots to allow/disallow by default.
The fields allow you to add the URLs or directories you don’t want to be crawled.
Features
- Allow/refuse 15 common bots/crawlers
- Add sitemap
- Block directories/URLs
- Crawl-delay
Internet Marketing Ninjas Robots.txt Generator Tool
With a comprehensive selection of 22 user agents, this is a quality tool for allowing/refusing crawling from many major bots. You can add URLs/directories to allow/disallow per bot.
A very simple, easy-to-use robots.txt generator tool with plenty of bots. Unfortunately, there are no crawl delay settings.
Features
- Allow/refuse 22 common bots/crawlers
- Add sitemap
- Block directories/URLs
LinkGraph Robots.txt Generator
With over 40 bots built-in to the tool, this is a comprehensive robots.txt generator. However, it only lets you add 5 URLs/directories for disallowing, which is a minor letdown.
Still, though, with crawl delay and plenty of user agents in the list, this is yet another solid free robots.txt generator.
Features
- Allow/refuse 40+ common bots/crawlers
- Add sitemap
- Block directories/URLs
- Crawl-delay
Summary
Editing the robots.txt file is actually pretty straightforward!
Even as an SEO novice, it’s very handy to know what can and can’t be done with the robots.txt file.
As Google recommends, using the robots.txt file for completely hiding a site and its pages from the SERPs is ineffective compared to using a noindex meta tag.
The primary reason you’ll want to edit the robots.txt file is to prevent the crawling of non-publicly accessible site data that potentially eat into your crawl budget, or to control the activity of individual user agents that crawl your site.