Drupal 7 has a couple of problems which cause infinite recursion in URIs, which in turn adds unnecessary load to web servers when they are being crawled. The two main causes are:
1) Recursion caused by panel pages which allows it by default. (note that I'm not an expert on Drupal functionality, so if someone has an update on this please chime in. Also this may be a misconfiguration issue but I don't manage the actual Drupal sites, only the servers).
2) Subsite symbolic links. Drupal requires symbolic links for subsites that use the same core code. (https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories) For example:
www.url.ca = /var/www/html/url/ (core drupal site for "www.url.ca")
www.url.ca/subsite1 = /var/www/html/url/subsite1 -> /var/www/html/url/ (symlink)
Drupal decides which settings file to use, based on the URL and URI. It gets more complicated, but the above is a legitimate configuration scenario.
An unfortunate side effect is that certain URIs can then be recursively requested, for example:
www.url.ca/subsite1
www.url.ca/subsite1/subsite1 : this may request the exact same page as previously loaded.
The problem is intensified when crawlers start following these loops. Drupal by default caches much of the content on the first request of a URI, and other caching layers do too. But, when a recursive page is hit, Drupal doesn't recognize the address and thinks it has never been cached, forcing it to regenerate the entire page. The same problem occurs even if you are using Varnish or Squid. It can cause quite a bit of extra load on the backend Apache servers depending on their configuration and available resources.
A solution provided by Drupal is to add a RedirectMatch directive. See https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories (scroll down to the bottom of the page.) However, I found that it failed to match most scenarios that I had to work with. Here is my modified directive:
RedirectMatch 301 "(.*/\w*\b)*(?P<word>(\/+[a-zA-Z0-9\-_]+))((?P=word)(?!(\.).*))+" $1$3
(Note that this directive is to be added to the .htaccess file of the core site only.)
Here is the breakdown of the regexp:
- match anything or nothing before the repeating pattern
# (.*/\w*\b)*
- match a term beginning with / and track with keyword "word"
# (?P<word>(\/+[a-zA-Z0-9\-_]+))
- recall the previous term, combined with above line this means the term is repeating
# ((?P=word)(?!(\.).*))+
- Breaking the previous rule into chuncks:
- (?P=word) - recall of the term
- (?!(\.).*) - negative lookahead = do not match anything
that has a . followed by any characters.
this means that the repeating keyword should only
match when not followed by a "." with any combination
of characters following.
- The final plus means that the repeating keyword should be matched if it exists one or more times.
# +
Examples tested:
1) http://suburl.url.ca/subsite/en/en => http://suburl.url.ca/subsite/en
2) http://suburl.url.ca/subsite/subsite => http://suburl.url.ca/subsite/
(this next one is special – is this acceptable? Probably, because the URL is not legitimate. Only the last repeated term is removed, but it get requested again and the first repeated term is then removed.)
3) http://suburl.url.ca/subsite/subsite/en/programs => http://suburl.url.ca/subsite/
4) http://suburl.url.ca/subsite/en/programs/programs => http://suburl.url.ca/subsite/en/programs
5) http://suburl.url.ca/subsite/en/programs/programs/programs.jpg => http://suburl.url.ca/subsite/en/programs (the programs.jpg here is not a legitimate request)
6) http://suburl.url.ca/subsite/en/programs/programs.jpg => <NO REDIRECT> (the programs.jpg might be a legitimate request)
Legitimate resources files are often named with the exact same term as the folder containing them. As seen in test #5, this does not redirect and therefore the legitimate file is requested and returned to the client.
I’ve tested the above regex with multiple scenarios and it seems to work in MOST situations that I've experienced, but it is not yet perfect. It is to be noted that when it does NOT work, it does not affect the behavior in a negative way. Scenarios that are not yet recognized are repeating patterns where the repeating terms are not immediately following each other, or when they are several groups of paths:
http://suburl.url.ca/subsite/en/program/subsite/en/program : this is not recognized as a repeating pattern. This particular example may not actually cause recursion in Drupal, but it is to be noted that I have encountered similar patterns in my logs.
1) Recursion caused by panel pages which allows it by default. (note that I'm not an expert on Drupal functionality, so if someone has an update on this please chime in. Also this may be a misconfiguration issue but I don't manage the actual Drupal sites, only the servers).
2) Subsite symbolic links. Drupal requires symbolic links for subsites that use the same core code. (https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories) For example:
www.url.ca = /var/www/html/url/ (core drupal site for "www.url.ca")
www.url.ca/subsite1 = /var/www/html/url/subsite1 -> /var/www/html/url/ (symlink)
Drupal decides which settings file to use, based on the URL and URI. It gets more complicated, but the above is a legitimate configuration scenario.
An unfortunate side effect is that certain URIs can then be recursively requested, for example:
www.url.ca/subsite1
www.url.ca/subsite1/subsite1 : this may request the exact same page as previously loaded.
The problem is intensified when crawlers start following these loops. Drupal by default caches much of the content on the first request of a URI, and other caching layers do too. But, when a recursive page is hit, Drupal doesn't recognize the address and thinks it has never been cached, forcing it to regenerate the entire page. The same problem occurs even if you are using Varnish or Squid. It can cause quite a bit of extra load on the backend Apache servers depending on their configuration and available resources.
A solution provided by Drupal is to add a RedirectMatch directive. See https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories (scroll down to the bottom of the page.) However, I found that it failed to match most scenarios that I had to work with. Here is my modified directive:
RedirectMatch 301 "(.*/\w*\b)*(?P<word>(\/+[a-zA-Z0-9\-_]+))((?P=word)(?!(\.).*))+" $1$3
(Note that this directive is to be added to the .htaccess file of the core site only.)
Here is the breakdown of the regexp:
- match anything or nothing before the repeating pattern
# (.*/\w*\b)*
- match a term beginning with / and track with keyword "word"
# (?P<word>(\/+[a-zA-Z0-9\-_]+))
- recall the previous term, combined with above line this means the term is repeating
# ((?P=word)(?!(\.).*))+
- Breaking the previous rule into chuncks:
- (?P=word) - recall of the term
- (?!(\.).*) - negative lookahead = do not match anything
that has a . followed by any characters.
this means that the repeating keyword should only
match when not followed by a "." with any combination
of characters following.
- The final plus means that the repeating keyword should be matched if it exists one or more times.
# +
Examples tested:
1) http://suburl.url.ca/subsite/en/en => http://suburl.url.ca/subsite/en
2) http://suburl.url.ca/subsite/subsite => http://suburl.url.ca/subsite/
(this next one is special – is this acceptable? Probably, because the URL is not legitimate. Only the last repeated term is removed, but it get requested again and the first repeated term is then removed.)
3) http://suburl.url.ca/subsite/subsite/en/programs => http://suburl.url.ca/subsite/
4) http://suburl.url.ca/subsite/en/programs/programs => http://suburl.url.ca/subsite/en/programs
5) http://suburl.url.ca/subsite/en/programs/programs/programs.jpg => http://suburl.url.ca/subsite/en/programs (the programs.jpg here is not a legitimate request)
6) http://suburl.url.ca/subsite/en/programs/programs.jpg => <NO REDIRECT> (the programs.jpg might be a legitimate request)
Legitimate resources files are often named with the exact same term as the folder containing them. As seen in test #5, this does not redirect and therefore the legitimate file is requested and returned to the client.
I’ve tested the above regex with multiple scenarios and it seems to work in MOST situations that I've experienced, but it is not yet perfect. It is to be noted that when it does NOT work, it does not affect the behavior in a negative way. Scenarios that are not yet recognized are repeating patterns where the repeating terms are not immediately following each other, or when they are several groups of paths:
http://suburl.url.ca/subsite/en/program/subsite/en/program : this is not recognized as a repeating pattern. This particular example may not actually cause recursion in Drupal, but it is to be noted that I have encountered similar patterns in my logs.