Speed and accuracy. Sampling

Calculating indicators for a large number of sessions (over a million) may require a considerable amount of time. To reduce the calculation time, you can use sampling. In this case, calculations are made for just part of the sessions, instead of for all of them.

Suppose we are talking about direct hits to the site. After calculating how many there were out of 1/10 of all sessions, you can multiply the result by 10 and get the approximate number of direct hits. This way you get the response 10 times faster, but the response will be approximated. You can use sampling to select the desired ratio of speed to accuracy.

However, random sampling of a certain percentage of sessions has a number of disadvantages. Yandex Metrica accumulates the history of every user's actions. When session sampling is used, the relationship between the user and the session is lost. For example, this makes segmentation by user characteristics impossible. And counting the number of unique users almost always results in erroneously overestimating the number.

Yandex Metrica first creates a sample of the set percentage of unique users, then calculates the characteristics based on session parameters distributed evenly over time.

Sampling becomes available if the selected time period has more than 1.5 million sessions. In this case, the report shows the additional Accuracy control:



When loading the report or changing the time period, Yandex Metrica automatically selects the accuracy so that the time for building the report isn't more than a few seconds. After the report has loaded, you can change the ratio of speed to accuracy.



Reducing the accuracy may lead to none of the sessions being included in the sample, which means the report will be empty. This could happen if, for example, you select a very narrow segment and (or) a very long time period. In this case, it is a good idea to set the accuracy to 100% and check the indicators.