Monday, July 31, 2017

Drupal 7 recursion issue

Drupal 7 has a couple of problems which cause infinite recursion in URIs,  which in turn adds unnecessary load to web servers when they are being crawled.  The two main causes are:

1) Recursion caused by panel pages which allows it by default.  (note that I'm not an expert on Drupal functionality, so if someone has an update on this please chime in. Also this may be a misconfiguration issue but I don't manage the actual Drupal sites, only the servers).

2) Subsite symbolic links.  Drupal requires symbolic links for subsites that use the same core code.  (https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories) For example:

www.url.ca = /var/www/html/url/    (core drupal site for "www.url.ca")
www.url.ca/subsite1 = /var/www/html/url/subsite1 -> /var/www/html/url/ (symlink)


Drupal decides which settings file to use, based on the URL and URI.  It gets more complicated, but the above is a legitimate configuration scenario.

An unfortunate side effect is that certain URIs can then be recursively requested, for example:

www.url.ca/subsite1       
www.url.ca/subsite1/subsite1    : this may request the exact same page as previously loaded.


The problem is intensified when crawlers start following these loops.  Drupal by default caches much of the content on the first request of a URI, and other caching layers do too.  But, when a recursive page is hit, Drupal doesn't recognize the address and thinks it has never been cached, forcing it to regenerate the entire page.  The same problem occurs even if you are using Varnish or Squid.  It can cause quite a bit of extra load on the backend Apache servers depending on their configuration and available resources.

A solution provided by Drupal is to add a RedirectMatch directive.  See https://www.drupal.org/docs/7/multisite-drupal/multi-site-in-subdirectories (scroll down to the bottom of the page.) However, I found that it failed to match most scenarios that I had to work with.  Here is my modified directive:

RedirectMatch 301 "(.*/\w*\b)*(?P<word>(\/+[a-zA-Z0-9\-_]+))((?P=word)(?!(\.).*))+" $1$3

(Note that this directive is to be added to the .htaccess file of the core site only.)


Here is the breakdown of the regexp:

- match anything or nothing before the repeating pattern
  # (.*/\w*\b)*

- match a term beginning with / and track with keyword "word"
 

  # (?P<word>(\/+[a-zA-Z0-9\-_]+))

- recall the previous term, combined with above line this means the term is repeating

  # ((?P=word)(?!(\.).*))+

        - Breaking the previous rule into chuncks:
        - (?P=word)    - recall of the term
        - (?!(\.).*)   - negative lookahead = do not match anything
                       that has a . followed by any characters. 
                       this means that the repeating keyword should only
                       match when not followed by a "." with any combination
                       of characters following.


- The final plus means that the repeating keyword should be matched if it exists
one or more times.
  # +


Examples tested:

1) http://suburl.url.ca/subsite/en/en   => http://suburl.url.ca/subsite/en

2) http://suburl.url.ca/subsite/subsite => http://suburl.url.ca/subsite/

(this next one is special – is this acceptable? Probably, because the URL is not legitimate.
Only the last repeated term is removed, but it get requested again and the first repeated term is then removed.)
3) http://suburl.url.ca/subsite/
subsite/en/programs => http://suburl.url.ca/subsite/

4) http://suburl.url.ca/subsite/en/programs/programs => http://suburl.url.ca/subsite/en/programs

5) http://suburl.url.ca/subsite/en/programs/programs/programs.jpg    => http://suburl.url.ca/subsite/en/programs      (the programs.jpg here is not a legitimate request)

6) http://suburl.url.ca/subsite/en/programs/programs.jpg => <NO REDIRECT>  (the programs.jpg might be a legitimate request)


Legitimate resources files are often named with the exact same term as the folder containing them.  As seen in test #5, this does not redirect and therefore the legitimate file is requested and returned to the client.

I’ve tested the above regex with multiple scenarios and it seems to work in MOST situations that I've experienced, but it is not yet perfect.  It is to be noted that when it does NOT work, it does not affect the behavior in a negative way.  Scenarios that are not yet recognized are repeating patterns where the repeating terms are not immediately following each other, or when they are several groups of paths:

http://suburl.url.ca/subsite/en/program/subsite/en/program : this is not recognized as a repeating pattern. This particular example may not actually cause recursion in Drupal, but it is to be noted that I have encountered similar patterns in my logs.

Parsing large log files quickly

Timegrep is a fantastic utility to parse through massive log files quickly.  It does a binary search for a time range based on a specified time format.

The utility is available off github: https://github.com/linux-wizard/timegrep

Here is an example of how I can use it to go through and grep through dozens of log files each of which can be several GBs in size:

This example is for an NGINX server's errors.

find /var/log/nginx/ -type f -name '*.log-20170730' -exec ~/bin/timegrep.py -d 2017-07-29 --start-time=19:30:00 --end-time=19:45:00 '{}' \; | grep '\[error\]' > ./errors-list.txt

Another example to get some stats from Apache, combined with some piping and grepping from: https://blog.nexcess.net/2011/01/21/one-liners-for-apache-log-files/

Run this command from /var/log/httpd on a CentOS system:


find . -type f -name '*.access.log' -exec /root/bin/timegrep.py -d 2017-07-31 --start-time=10:05:00 --end-time=10:06:00 '{}' \; | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

This will go through all of the .access.log files in /var/log/httpd and parse all of the entries during the 10:05 to 10:06 minute, and print the top 20 IPs.

Basically, if you combine timegrep with the find command, you've got yourself some serious log parsing firepower.

Of course, if you've got this quantity of logs to parse through, sometimes tools like splunk are a bit more appropriate.  However, as they are not always available, the above technique can get you out of a serious bind.

No thumbnails for video previews in Nautilus.

As the default CentOS 7 / Gnome 3 video player, Totem, does not support many of the video formats in use today, thumbnails are not being generated by Nautilus.  (Nautilus uses the Totem player to generate these).  There are two causes to this problem:

1) The mp4,mkv and other formats are not supported by Totem and specific codecs need to be installed.  As posted in the Fedora Forums here are some of the codecs that need to be installed for this to work properly: https://ask.fedoraproject.org/en/question/9267/thumbnail-for-videos-in-nautilus/

yum -y install gstreamer1-libav gstreamer1-plugins-bad-free-extras gstreamer1-plugins-bad-freeworld gstreamer1-plugins-base-tools updates gstreamer1-plugins-good-extras gstreamer1-plugins-ugly gstreamer1-plugins-bad-free gstreamer1-plugins-good gstreamer1-plugins-base gstreamer1

Delete the following directory:

rm -r ~/.cache/thumbnails/fail
 
Logout and log back in, just to make sure Gnome takes the new plugins.  This step may not be necessary, but it may help.

The next thing to do is to increase the size of the files for which Nautilus can generate thumbnails.  By default this is set to 1MB.  (NOTE: Gnome does not recommend increase this size too much due to the impact it will have on performance.  However, the speed at which the thumbnails are generated is largely dependent on the file format of the videos: MP4s are done very quickly, while FLVs take much longer.)  To increase this, navigate to "File"->"Preferences"->"Preview" and change the "Only for files smaller than:" to whichever size you prefer.  I've set mine to 4GBs and performance seems fine.

Saturday, July 29, 2017

Bootloader bug with system-config-kickstart

Here is another minor bug with system-config-kickstart:

During the opening of an existing kickstart configuration file, the application fails to read or load the bootloader configuration.  If this isn't reconfigured within kickstart, saving the file will set the bootloader directive to:

bootloader --location=none --boot-drive=<disk device=>


The location option should have been populated with those from the original file when it was read.

Best thing to do is to double check all of the basic settings once the kickstart config file is created (and saved) and manually edit whatever needs to be adjusted.

Tuesday, July 25, 2017

system-config-kickstart error in CentOS 7

Using system-config-kickstart version 2.9.6-1.el7 in CentOS 7 yields the following error, when attempting to select packages.

"Package selection is disabled due to problems downloading package information."

Screenshot of the message in the "Package Selection" menu.


It seems someone filed a bug with CentOS regarding this problem:  See https://bugs.centos.org/view.php?id=9611

As stated by the bug poster, the issue can be fixed by modifying line 161 of the file: /usr/share/system-config-kickstart/packages.py

156         # If we're on a release, we want to try the base repo first.  Otherwise,
157         # try development.  If neither of those works, we have a problem.
158         if "fedora" in map(lambda repo: repo.id, self.repos.listEnabled()):
159             repoorder = ["fedora", "rawhide", "development"]
160         else:
161             repoorder = ["rawhide", "development", "fedora"]


Becomes:

161             repoorder = ["rawhide", "development", "fedora", "base"]

Restart system-config-kickstart and packages can now be read from the local yum repositories.