WEB HOSTING SUPPORT NETWORK
     
 

Using robots.txt files

What is a robots.txt file?

Robots.txt files allow you to inform search engine crawlers and bots what URLs of your website should or should not be accessed via the Robots Exclusion Protocol. You can use it in combination with a sitemap.xml file and Robots meta tags for more granular control over what parts of your website get crawled. The robots.txt file should be located at the root directory of the website.

Important: The rules in the robots.txt file rely on voluntary compliance of the crawlers and bots visiting your website. If you wish to fully block access to specific pages or files of your website or prevent specific bots from accessing your website, you should consider using an .htaccess file instead. Various examples on applying such restrictions are available in our How to use .htaccess files article.

How to create a robots.txt file?

Some applications, like Joomla, will come with a robots.txt file by default, while others, like WordPress, may generate the robots.txt file dynamically. Dynamically generated robots.txt files do not exist on the server, so editing them will depend on the specific software application you are using. For WordPress, you can use a plugin that handles the default robots.txt file or manually create a new robots.txt file.

You can create a robots.txt file for your website via the File Manager section of the hosting Control Panel. Alternatively, you can create a robots.txt file locally in a text editor of your choice, and after that, you can upload the file via an FTP client. You can find step-by-step instructions on how to set up the most popular FTP clients in the Uploading files category from our online documentation.

What can you use in the robots.txt file?

The rules in the robots.txt file are defined by directives. The following directives are supported for use in a robots.txt file:

  • User-agent - defines the User-Agent of the bot/crawler that the rule is for. Multiple User-Agents can be listed one after the other if they should be allowed or disallowed to access the same directories. The * (asterisk) character can be used as a wildcard character.
  • Allow/Disallow - defines if the above User-Agent/s are allowed or disallowed to access a specific directory of the domain. It is possible to set the access limitations for multiple directories by adding a separate line for each directory. The * (asterisk) character can be used as a wildcard character. The $ (dollar sign) character can be used to define the end of a directory/URL.
  • Crawl-delay (optional) - instructs crawlers and robots to delay their crawl rate. The directive accepts numbers as values from 1 to 30 which indicate the delay in seconds. This optional directive can help prevent aggressive crawling of the website as long as the bot is obeying the Crawl-delay directive.
  • Sitemap (optional) - provides the location of the sitemap to search engine crawlers. If there are separate websites located in subdirectories of the main domain, a separate Sitemap directive can be provided for each subdirectory.

The values for the directives are case-sensitive, so you need to make sure you enter them with the correct capitalization. For example, two different bots will be targeted if you use "Googlebot" and "GoogleBot" as User-Agents in your robots.txt file.

Comments in the robots.txt file

You can use the # (hashtag) character to add comments for better readability by humans. Characters after a # character will be considered a comment if the character is at the start of a line or after a properly defined directive followed by an interval. Examples of valid and invalid comments can be found below:

# This is comment on a new line.
User-agent: * # This is a comment after the User-agent directive.
Disallow: / # This is a comment after the Disallow directive.

// This is not a valid comment.
User-agent: Bot# This is not a valid comment as it is not separated with an interval after the User-agent. The matched User-Agent will be "Bot#" and not "Bot".


The robots.txt file should be in UTF-8 encoding. The Robots Exclusion Protocol requires crawlers to parse robots.txt files of at least 500 KiB. Since Google have set their bots to not crawl robots.txt files bigger than 500 KiB, you should try to keep the size of your robots.txt file below that size.

What is the default content of the robots.txt file?

The content of the robots.txt file will depend on your website and the applications/scripts you are using on it. By default, all User-Agents are allowed to access all pages of your website unless there is a custom robots.txt file with other instructions.

Joomla

You can find the default content of the robots.txt for Joomla in its official documentation. It looks like this:

User-agent: *
Disallow: /administrator/
Disallow: /api/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/


WordPress

The default WordPress robots.txt file has the following content:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://YOUR_DOMAIN.com/wp-sitemap.xml


Examples

You can find sample uses of robots.txt files listed below:

How to inform robots they can fully access your website?

If there is no robots.txt file available for your website, the default rule for your website will be to allow all User-Agents to access all pages of your domain:

User-Agent: *
Allow: /

The same can be achieved by specifying an empty Disallow directive:

User-Agent: *
Disallow:


How to inform all robots they should not access your website?

To disallow all robots access to your website, you can use the following content in your website's robots.txt file:

User-Agent: *
Disallow: /


How to inform robots they can fully access your website except for a specific directory/file?

You can inform bots to crawl your website with the exception of a specific directory or a file using this code block in your robots.txt file:

User-agent: *
Disallow: /directory/

User-agent: *
Disallow: /directory/file.html


How to allow full access to your website to all robots except for a specific one?

If you would like to allow all robots except just one to crawl your website, you can use this code block in your website's robots.txt file:


User-agent: NotAllowedBot
Disallow: /

Make sure you replace the NotAllowedBot string with the name of the actual bot you want the rule to apply for.

How to allow full access to your website only to a single robot?

To instruct all bots but one to not crawl the website, use this code block:

User-agent: AllowedBot
Allow: /

User-agent: *
Disallow: /

Make sure you replace the AllowedBot string with the name of the actual bot you want the rule to apply for.

How to disallow robots from accessing specific file types?

Adding the following code block to your robots.txt file will instruct compliant robots to not crawl .pdf files:

User-agent: *
Disallow: /*.pdf$


How to disallow robots from accessing URLs with a dollar sign?

The $ (dollar sign) character has a special meaning in robots.txt files - it defines the end of a matching value in the Allow/Disallow directive. To add a rule that takes effect for URLs that contain the $ character, you can use the following pattern in the Allow/Disallow directive:

User-agent: *
Disallow: /*$*

If you add the value without the ending wildcard character:

User-agent: *
Disallow: /*$

all bots will be prevented from accessing the full website, and this could seriously affect your website's SEO ranking.