Regular expressions

Regular expressions can be used to filter the URL data in Yandex.Webmaster:

Expressions are parsed according to the RE2 syntax and the following rules:

  • The regular expression is applied to the entire URL of the page including the protocol and domain. For example, you can use the following regular expression: ^http://.
  • A regular expression is applied twice: to the original URL and the URL with the www prefix and without it. The presence of the www prefix in the domain doesn't affect the result of expression validation.
  • The regular expression is applied to the decoded URL where the URL codes (% sequences) are replaced with decoded characters. Exception: the codes for the /, &, =, ?, and # characters aren't replaced. For example, %2F isn't replaced with /. Note that the + character is replaced with a space. For example, the regular expression text=elephant will be processed, but text=%D1%81%D0%BB%D0%BE%D0%BD and text=%\w\w won't.
  • Cyrillic URL doesn't use punycode. For example, the regular expression ^http://ввв\.сайт\.рф/ will be processed, but ^http://xn--b1aaa\.xn--80aswg\.xn--p1ai/ won't.
  • Some characters are excluded from the URL ending before the regular expressions check: ?, #, &, as well as period (.). For example, the URLs http://example.com/?, http://example.com/#, http://example.com/?var=1& are compared with http://example.com/, http://example.com/, http://example.com/?var=1 respectively. If the user enters the URL http://example.com./, the regular expression \./$ isn't processed.
  • In the checked regular expressions, quantifiers match as many characters as possible.
  • The URL characters are case-sensitive.

Regular expressions memo

In the table below, a, b, c, d, e are any characters, n, m are positive numbers.

Possible options
abc|de Matches one of the options: abc or de.
Classes of characters
[abc] or [a-c] Matches any (one) character of the list (or from the range).
[^abc] or [^a-c] Matches any (one) character except those listed (or those from the range).
\d Matches a digit character. Equivalent to [0-9].
\D Matches a non-digit character. Equivalent to [^0-9].
\s Matches a space character. Equivalent to [\t\n\f\r].
\S Matches a non-white-space character. Equivalent to [^\t\n\f\r].
\pL Matches any Unicode character.
\w

Matches any Latin letter of any case, digit or the underscore character.

When working with the Unicode characters, use the \pL class instead of \w.

\W

Matches any character other than a Latin letter of any case, a digit or an underscore.

When working with the Unicode characters, use the \pL class instead of \w.

Number of occurrences (quantifiers)
a* Matches the a character repeated 0 or more times (the longest possible sequence).
a+ Matches the a character repeated 1 or more times (the longest possible sequence).
a? Matches the a character repeated 0 or 1 time (the presence of the character is a priority).
a{n,m} Matches the a character repeated at least n times and not more than m times (the longest possible sequence).
a{n,} Matches the a character repeated at least n times (the longest possible sequence).
a{n} Matches the a character repeated n times.
a*? Matches the a character repeated 0 or more times (the shortest possible sequence).
a+? Matches the a character repeated 1 or more times (the shortest possible sequence).
a?? Matches the a character repeated 0 or 1 time (the presence of the character is a priority).
a{n,m}? Matches the a character repeated at least n times and not more than m times (the longest possible sequence).
a{n,}? Matches the a character repeated at least n times (the shortest possible sequence).
Position in the line:
^ Matches the beginning of a string.
$ Matches the end of a string.
\b

Matches the word boundary — the position between the alphanumeric character (\w) and non-alphanumeric (\W) character.

\B

Matches a non-word boundary. Defined through the \w and \W classes.

Escaping
\

Reverse slash before the [ ] \ ^ $ special character. | ? * + ( ) { } indicates that the character is not special and should be interpreted literally.

Example: \$ corresponds to the dollar sign.

\Q...\E All special characters between \Qand\E are interpreted as common characters.