Google Analytics Referral Spam Bots

Submitted by allan on Mon, 12/28/2015 - 01:05

I've recently discovered that my analytics were lying to me. Apparently it's a fairly well known fact that there are actually bots out there that maliciously sabotage your analytics simply to get their referral address in there. For what, I don't know exactly why. I'll tell you about the solutions that I've used and built to achieve fairly solid analytics from now on.

Updated with links to Solid Console App

Fail2Ban

My first thoughts were that like just about every other thing on my server, I had tons of bots scanning my site and hitting it from well crafted http requests. This is easy enough to do with curl... so why not. With that in mind, I took a look at Referer Spam Domains Blacklist, a repository sitting out there with an ever updating blacklist, and some configuration for Fail2Ban. With some well crafted configuration files, I was able to send myself an email every time somone was banned using Fail2Ban.

This is what really threw me for a loop. It turns out I was catching no one. I even took the regex created for Fail2Ban and deconstructed it so I could grep through my logs and double check to see if there was a bug in the regex. Turns out, that there was just no one visiting my site with referrals.

My Custom Solution

So, I did some serious research. It turns out that bots are actually hitting up Google's Measurement Protocol directly. You can see this explained over at optimizesmart.com in an article Geek guide to removing referrer spam in Google Analytics

It turns out that all you need to do to spoof analytics is scrape the GA id from the webpage, and then make a call via javascript or some other language to Google Analytics, just like your webpage does!!!. All the bots have done is taken your webpage (and coincidentally your web server) out of the picture. GA id scrapped

How to fix the problem

So Google Analytics gives you the ability to filter your web results in 2 ways.

  1. Add domains to the Referral Exclusions list
  2. Create Filters filtering out hits from specific Referral Domains.

I chose to go the second route as I already had the regex to exclude all the domains I wanted. However, I hit a snag. Google Analytics only allows you to enter in a regex up to 255 characters long. So I do what any programmer would do... I started looking into how to programmatically split up my domain list into 255 character chunks so that I can make a whole bunch of filters. I also took a look to see if by chance I could add to the Referral Exclusions list, but that doesn't seem to be accessible via the google analytics API. So I had to settle for filters.

Enters General Rednecks cure "GA Referrer Spam Filters"

Console Application

Now I grab the list from Referer Spam Domains Blacklist with all the domains I want to remove. I then match them together in less than 255 character regex expressions. This gives us over 100 different filters, but it's less than each domain individually. I then assign all of these to the view of my choosing. You can see my progress of my work over at GeneralRedneck/GaReferrerSpamFilters repository. I'll have a readme with some help on how to use it soon but as of tonight, it's still kinda basic.

If you use the 0.1 version, it will take your first account, first Web Property, and first view associated with an api service account you set up, and apply all the filters to it. I'm currently working on porting that over to a Symfony Console application where you can manage/update the filters as well as specify exactly where you would like the filters to be made. In theory, you could put a couple of these commands on a cron and always have an up to date referral spam bot filter set.

Final Thoughts

So I think I've accomplished in building a tool that will be handy for everyone to use, (particularly if you are like me and have so little traffic that every hit counts). Stay tuned for an updated readme as well as some more robust commands to help you get where you need to be.

Update

I've built out a pretty solid version of this script. I've tested it to make sure it works in both Windows and Linux using PHP 5.5 and 5.6. Take a look at the updated readme. You can either use composer to install the thing, or use one of the builds I throw together using Jenkins. Both install methods work well. Curious about the Jenkins script?

set +e
STABLE=N
PUBLISH_DIR="path/to/my/site/sites/default/files/static-content/garefspam"
TAG="$(git describe --exact-match --tags HEAD)"
if [ "$?" = "0" ]; then
  STABLE=Y
  echo "$TAG" > version
else
  echo "$(git describe --abbrev=0 --tags)-dev#$(echo $GIT_COMMIT | cut -b 1-7)" > version
fi

set -e
composer install
./garefspam updatespamlist
ZIPNAME="garefspam-$(git describe --abbrev=0 --tags)-dev#$(echo $GIT_COMMIT | cut -b 1-7).zip"
zip -r -9 $PUBLISH_DIR/development/$ZIPNAME *

if [ "$STABLE" = "Y" ]; then
  cp "$PUBLISH_DIR/development/$ZIPNAME" "$PUBLISH_DIR/stable/garefspam-$TAG.zip"
fi