Crawl statistics

The Yandex indexing robot regularly crawls site pages and loads them into the search database. The robot can fail to download a page if it is unavailable.

Yandex.Webmaster lets you know which pages of your site are crawled by the robot. You can view the URLs of the pages the robot failed to download because the hosting server was unavailable or because of errors in the page content.

Information about pages is available on the Indexing → Crawl statistics page in Yandex.Webmaster. The information is updated daily within six hours after the robot visits the page.

By default, the service provides data on the site as a whole. To view the information about a certain section, choose it from the list in the site URL field. Available sections reflect the site structure as known to Yandex (except for the manually added sections).

If the list doesn't contain any pages that should be included in the search results, use the Reindex pages tool to let Yandex know about them.

You can download the information about pages in the XLS or CSV format using the filters.

Note. The data is available starting from February 20, 2017.
  1. Page status dynamics
  2. Page changes in the search database
  3. List of pages crawled by the robot
  4. Data filtering

Page status dynamics

Page information is presented as follows:

  • New and changed — The number of pages the robot crawled for the first time and pages that changed their status after they were crawled by the robot.
  • Crawl statistics — The number of pages crawled by the robot, with the server response code.

Page changes in the search database

Changes are displayed if the HTTP response code changed when robot accessed the page again. For example, 200 OK becomes 404 Not Found. If only the page content changed, this won't be shown in Yandex.Webmaster.

To view the changes, set the option to Recent changes. Up to 50,000 changes can be displayed.

Yandex.Webmaster shows the following information about the pages:

  • The date when the page was last visited by the robot (the crawl date).
  • The page path from the root directory of the site.
  • The server response code received at the crawl.

Base on this information you can find out how often the robot crawls the site pages. You can also see which pages were just added to the database and which ones were re-crawled.

Pages added to the search base

If a page is crawled for the first time, the Was column displays the N/a status, and the Currently column displays the server response (for example, 200 OK).

After the page is loaded to the search database successfully, it can be displayed in the search results once the search database is updated. Information about it is shown on the Pages in search page.

Pages reindexed by the robot

If the robot crawled the page before, the page status can change when it's re-crawled: the Was column shows the server response received during previous visit, the Currently column shows the server response received during the the last crawl.

Assume that a page included in the search became unavailable for the robot. In this case, it is excluded from the search. After some time you can find it in the list of excluded pages on the Pages in search page.

A page excluded from the search can stay in the search database so that the robot could check its availability. Usually the robot keeps requesting the page as long as there are links to it and it isn't prohibited in the robots.txt file.

List of pages crawled by the robot

To view the list of pages, set the option to All pages. The list can contain up to 50,000 pages.

You can view the list of site pages crawled by the robot and the following information about them:

  • The date when the page was last visited by the robot (the crawl date).
  • The page path from the root directory of the site.
  • The server response code received when the page was last downloaded by the robot.
Tip. If the list shows pages that are already removed from the site or don't exist, the robot probably finds links to them when visiting other resources. To stop the robot from accessing unnecessary pages, prohibit indexing with the Disallow directive in the robots.txt file.

Data filtering

You can filter the information about the pages and changes in the search database by all parameters (the crawl date, the page URL, the server response code) using the icon. Here are a few examples:

By the server response

You can create a list of pages that the robot crawled but failed to download because of the 404 Not Found server response.

You can filter only new pages that were unavailable to the robot. To do this, set the radio button to Recent changes.

Also, you can get the full list of pages that were unavailable to the robot. To do this, set the radio button to All pages.

By the URL fragment

You can create a list of pages with the URL containing a certain fragment. To do this, choose Contains from the list and enter the fragment in the field.

By the URL using special characters

You can use special characters to match the beginning of the string or a substring, and set more complex conditions using regular expressions. To do it, choose URL matches from the list and enter the condition in the field. You can add multiple conditions by putting each of them on a new line.

For conditions, the following rules are available:

  • Match any of the conditions (corresponds to the “OR” operator).
  • Match all conditions (corresponds to the “AND” operator).
Characters used for filtering
Character Description Example
* Matches any number of any characters

Display data for all pages that start with https://example.com/tariff/, including the specified page: / tariff / *

Using the * character

The * character can be useful when searching for URLs that contain two specific elements or more.

For example, you can find news or announcements for a certain year: /news/*/2017/.

@ The filtered results contain the specified string (but don't necessarily strictly match it) Display information for all pages with URLs containing the specified string: @tariff
~ Condition is a regular expression Display data for pages with URLs that match a regular expression. For example, you can filter all pages with address containing the fragment ~table|sofa|bed repeated once or several times.
! Negative condition Exclude pages with URLs starting with https://example.com/tariff/: !/tariff/*

The use of characters isn't case sensitive.

The @,!, ~ characters can be used only at the beginning of the string. The following combinations are available:

Operator Example
!@ Exclude pages with URLs containing tariff: !tariff
!~ Exclude pages with URLs that match the regular expression

FAQ

I created a site, but it is still not indexed.

Perhaps, too little time passed since you created the website. To inform the robot about the website, add the website to Yandex.Webmaster and verify your rights to it. Also check if there were any server failures. In case of a server error, the Yandex robot stops indexing and makes another attempt when it crawls the website next time.

Yandex employees can't speed up how fast pages are added to the search base.

How long do I have to wait until a site enters the search?

We don't forecast the website indexing timeframe and we can't guarantee that a website will be indexed. Usually, it takes from several days to two weeks from when the robot finds the website until the pages are shown in search results.

The number of requests on the “Crawl history” chart decreased or increased

The number of pages crawled by the Yandex robot may be higher or lower on different days. These changes don't affect site indexing or ranking in search results.

If you notice that:
You are trying to download confidential information from my server. What should I do?

The robot takes links from other pages. This means that that some other page contains links to confidential sections of your website. You can either protect them with a password or block them from indexing by the Yandex robot in the robots.txt file. In both cases, the robot won't download confidential information.

To report a bug in Crawl statistics, fill out the form below.