Oct 23, 2020 | 13 min read

A robots.txt file is a text document that’s located in the root directory of a site that contains information intended for search engine crawlers about which URLs—that house pages, files, folders, etc.—should be crawled and which ones shouldn’t. The presence of this file is not compulsory for the operation of the website, but at the same time, its correct setup lies at the core of SEO.

The decision to use robots.txt was adopted back in 1994 as part of the Robot Exclusion Standard. According to Google Help Center, the main purpose of the file is not to prevent web pages from being shown in search results, but to limit the number of requests made by robots to sites as well as reduce the server load.

Generally speaking, the content of the robots.txt file should be viewed as a recommendation for search crawlers that defines the rules for website crawling. In order to access the content of any site’s robots.txt file, all you have to do is type “/robots.txt” after the domain name in the browser.

What is robots.txt used for?

The primary function of the document is to prevent the scanning of pages and resource files so that the crawl budget is allocated more efficiently. In the vast majority of cases, the robots.txt file hides information that does not provide website visitors with any value and does not affect SERP rankings.

Note: The crawl budget is the number of web pages that a search robot can crawl. To use it more frugally, search robots should only be directed to the most important content of websites and blocked from accessing unhelpful information.

What pages and files are usually closed off via robots.txt

1. Pages containing personal data.

Personal data can include names and phone numbers that visitors indicate during registration, personal dashboards and profile pages, payment card numbers. For security reasons, access to such information should be additionally protected with a password.

2. Auxiliary pages that only appear after certain user actions.

Such actions typically include messages that clients receive after successfully completing an order, client forms, authorization or password recovery pages.

3. Admin dashboard and system files.

Internal and service files that website administrators or webmaster interact with.

4. Search and category sorting pages.

Pages that are displayed after a website visitor enters a query into the site’s search box are usually closed off from search engine crawlers. The same goes for the results users get when sorting products by price, rating and other criteria. Aggregator sites may be an exception.

5. Filter pages.

Results that are displayed with an applied filter (size, color, manufacturer, etc.) are separate pages and can be looked at as duplicate content. As a rule of thumb, SEO experts also prevent them from getting crawled, except for cases when they drive traffic for brand keywords or other target queries.

6. Files of a certain format.

Such files can include photos, videos, .PDF documents, JS files. With the help of robots.txt, you can restrict the scanning of individual or extension-specific files.

How to create a robots.txt file and where to put it?

Tools for setting up robots.txt

Since the document has a .txt extension, any text editor that supports UTF-8 encoding will be suitable. The easiest option is Notepad (Windows) or TextEdit (Mac).

You can also use a robots.txt generator tool that will generate a robots.txt file based on the specified information.

Robots Text Generator Tool

Document title and size

The name of the robots.txt file should look exactly like this, without the use of any capital letters. According to Google guidelines, the permitted document size is 500 KiB. Exceeding this limit can result in the search robot partially processing the document, not crawling the website at all, or, conversely, scanning the content of a website in its entirety.

Where to place the file

collect.chat

The document must be located at the root directory of the website host and can be accessed via FTP. Before making any changes, it is recommended you download the robots.txt file in its original form.

robots.txt syntax and directives 

Now let’s take a closer look at the syntax of a robots.txt file that consists of directives (rules), parameters (pages, files, directories) and special characters, as well as the functions they perform.

General file contents requirements

1. Each directive must start on a new line and be formed according to the principle: one line = one directive + one parameter.

Wrong User-agent: * Disallow: /folder-1/ Disallow: /folder-2/
Correct User-agent: *

Disallow: /folder-1/

Disallow: /folder-2/

2. File names that use alphabets other than Latin should be converted using the Punycode converter.

Wrong User-agent: Disallow: /φάκελος-με-επαφές/
Correct Disallow: /xn—–v8bgtvbb4blm8as0bi7an/

3. In the syntax of parameters, you must adhere to the appropriate register. If a folder name starts with a capital letter, naming it with a small letter will disorient the robot. And vice versa.

Wrong User-agent: Disallow: /folder/
Correct Disallow: /Folder/

4. The use of a space at the beginning of a line, quotation marks or semicolons for directives is strictly prohibited.

Wrong User-agent: Disallow: /folder-1/;

Disallow: /“folder-2”/

Correct Disallow: /folder-1/

Disallow: /folder-2/

5. An empty or inaccessible robots.txt file may be perceived by search engines as permission to crawl the entire site. In order to be processed successfully, the robots.txt file must return the 200 OK HTTP response status code.

robots.txt status code

robots.txt file symbols

Let’s dissect the main symbols contained in the file and find out what each one means.

The slash (/) is added after the command, before the name of the file or directory (folder, section). If you want to close the entire directory, you need to put another “/” after its name.

Disallow: /search/

Disallow: /standarts.pdf

The asterisk (*) indicates that the robots.txt file applies to all search engine robots that visit the site.

User-agent: * means that the rules and conditions apply to all robots.

Disallow: /*videos/ means that all website links containing /videos/ will not be crawled.

The dollar sign ($) is an asterisk-type restriction that applies to site’s URL addresses. For example, the content of a site or an individual file is inaccessible, but links containing the specified name are available.

Disallow: /*folder-1/$

The hash (#) marks any text after it as a comment and means that it won’t be taken into account by search robots.

#search robots will not see this information.

robots.txt file directives

Differences in directives for different search engines

Let’s take a look at the different commands you can use to access Google, Bing, Yahoo! and Yandex robots. You never know when it will come in handy.

DIRECTIVE GOOGLE BING YAHOO! YANDEX
User-agent + + + +
Disallow + + + +
Allow + + + +
Sitemap + + + +
Crawl-delay + + +
Clean-param +

As you can see, the main robots.txt directives for accessing Google, Bing, Yahoo! and Yandex robots match, with the exception of the crawl-delay and clean-param (only recognized by Yandex).

The user-agent is a mandatory directive that defines the search robot for which the defined rules apply. If there are several bots, each rule group starts with this command.

Example

User-agent: * means that the instructions apply to all existing robots.

User-agent: Googlebot means that the file is intended for Google robot.

User-agent: Bing means that the file is intended for Bing robot.

User-agent: Yahoo! means that the file is intended for Yahoo! robot.

Disallow is a key command that instructs search engine bots not to scan a page, file or folder. The names of the files and folders that you want to restrict access to are indicated after the “/” symbol.

Example 1. Specifying different parameters after Disallow.

disallow: /link to page disallows access to a specific URL.

disallow: /folder name/ closes access to the folder.

disallow: /image/ closes access to the image.

disallow: /. The absence of any instructions after the “/” symbol indicates that the site is completely closed off from scanning which can come in handy during website development.

Example 2. Disabling scanning of all .PDF files on the site.

User-agent: Googlebot

Disallow: /*.pdf

In the robots.txt file, Allow performs the opposite function of Disallow, granting access to website content. Both commands are usually used in conjunction, for example, when you need to open access to a certain piece of information like a photo in a hidden media file directory.

Example. Using Allow to scan one image in a closed album.

Specify the Allow directory, the image URL and in another line Disallow along with the name of the folder where the file is located.

Allow: /album/picture1.jpg

Disallow: /album/

The sitemap command in robots.txt shows the path to the sitemap. The directive can be omitted if the sitemap has a standard name, is located in the root directory and is accessible via the link “site name”/sitemap.xml, similar to the robots.txt file.

Example

Sitemap: https://website.com/sitemap2020.xml

To avoid overloading the server, you can tell search robots what the recommended number of seconds to process a page is. However, nowadays search engines crawl pages with a 1 or 2 second delay. It should be stressed that this directive is no longer relevant for Google.

Example

User-agent: Bing

Crawl-delay: 2

When should the robots meta tag be used

If you want to hide site content from search results, using the robots.txt file won’t be enough. Robots are instructed not to index pages with the help of the robots meta tag that is added to the <head> heading of a page’s HTML code. The noindex directive indicates that page content cannot be indexed. Another way of limiting page indexing is to specify its URL in the X-Robots-Tag of the site’s configuration file.

Example for closing at the page level

<head>
<meta name=“robots” content=“noindex”>
</head>

What types of search crawlers are there?

A search crawler is a special type of program that scans web pages and adds them to a search engine’s database. Google has several bots that are responsible for different types of content.

  • Googlebot: crawls websites for desktop and mobile devices
  • Googlebot Image: displays site images in the “Images” section
  • Googlebot Video: scans and displays videos
  • Googlebot News: selects the most useful and high-quality articles for the “News” section
  • Adsense: ranks a site as an ad platform in terms of ad relevance

The full list of Google robots (user agents) is listed in the official Help documentation.

The following robots are relevant for other search engines: Bingbot for Bing, Slurp for Yahoo!, Baiduspider for Baidu, and the list does not end there. There are over 300 various search engine bots.

In addition to search robots, the site can be crawled by crawlers of analytical resources, like Ahrefs or Screaming Frog. The work of their software solutions is based on the same principle as search engines: parse URLs to add them to their own database.

Bots that should be blocked from accessing sites:

  • Malicious parsers (spambots that collect customer email addresses, viruses, DoS and DDoS attacks, and others);
  • Bots of other companies that monitor information for further use for their own purposes (prices, content, SEO methods, etc.).

If you decide to close the site from the aforementioned robots, it is better to use the .htaccess file instead of the robots.txt. The second method is safer, since it restricts access not as a recommendation, but at the server level.

SetEnvIfNoCase User-Agent “bot name-1” search_bot

SetEnvIfNoCase User-Agent “bot name-2” search_bot

The command must be specified at the bottom of the .htaccess file. The scanning restrictions for each robot must be specified on a separate line.

Example of robots.txt content

A template of a file with up-to-date directives will help you create the robots.txt file properly, pointing out the required robots and restricting access to the relevant site files.

User-agent: [bot name]

Disallow: /[path to file or folder]/

Disallow: /[path to file or folder]/

Disallow: /[path to file or folder]/

Sitemap: [Sitemap URL]

Now let’s see several examples of what the robots.txt file looks like on different websites.

Here’s a minimalistic version:

WizzAir robots.txt

In the following example, we see a list of website directories, which are closed for scanning. For some bots, separate groups have been created that generally prohibit crawling the website (Adsbot-Google, Mediapartners-Google):

Walmart robots.txt

How to check your robots.txt file

Sometimes errors in the robots.txt file can lead not only to the exclusion of important pages from the index, but also to the entire site becoming practically invisible to search engines.

The robots.txt file check option is missing in the new Google Search Console interface. Now you can check the indexing of pages individually (Check URL) or send requests to delete URLs (Index – Removals). The Robots.txt Tester tool can be accessed directly.

robots.txt Tester

How else can robots.txt be used?

The contents of the robots.txt file can include more than just a list of directives for search engines. Since the file is publicly available, some companies are creative and humorous in their creation. Sometimes you can find a picture, a brand logo and even a job offer. A custom robots.txt file is implemented with the help of # comments and other symbols.

This is what you’ll find in the robots.txt file of Nike:

Nike robots.txt

Users who are interested in a website’s robots.txt are most likely good at optimization. Therefore, the document can be an additional way of finding SEO specialists.

And here’s what you’ll find on TripAdvisor:

TripAdvisor robots.txt

And here’s a little doodle that was added to the website of Esty marketplace:

Etsy robots.txt doodle

Conclusions

To recap, here are a few important takeaways from this blog post that will help you cement your knowledge on robots.txt files:

  • The robots.txt file is a guideline for robots that tells them which pages should and shouldn’t be crawled.
  • The robots.txt file cannot be configured to prevent indexing, but you can increase the chances of a robot crawling or ignoring certain documents or files.
  • Hiding unhelpful website content with the disallow directive saves the crawl budget. This is true for both multi-page and small websites.
  • A simple text editor is enough to create a robots.txt file, and Google Search Console is enough to run a check.
  • The name of the robots.txt file must be in lowercase letters and not exceed 500 KB.

Be sure to reach out to us via the comments section if you have any questions or feedback!

Post Views: 36

Source link