
This post is designed to give you a few basic commands to help you pull a quick report showing you how much Googlebot has hit your site, without dealing with dropping old data into a log analyzer
So you show up a a new job, a new client or you just remembered that you had a guitar tab site on your old college account that you forgot about 10 years ago. Regardless of the reason you have decided it is time to look back in time and figure out what level of love the old Googlebot has been giving your site lately.
Now there are tons of great web based, free, etc log analyzer tools, but we all know what a pain in the ass those can be to deal with and load. Well, there is a pretty easy way to see what is going on if you learn a few commands and we assume you are on some sort of Linux box (see a future rant on IIS/Windows web servers). It is time to put your War Games hat on and run sweet filters through the command line.
1) Log into your server using some form of SSH, I always use putty. It’s easy, its free, its secure and you can delicious this link putty.exe and fire it up on the fly from anywhere.
2) Figure out where the heck your apache log files are located. Often this will be somewhere like /usr/local/apache/logs/ or ask your host, IT guy, etc and go to that directory
$cd /usr/local/apache/logs
3) Now, you are sitting in your logs directory, and you can list your directory out
$ls
Figure out what file you want to dig around in. For simplicity sake let’s use the current log which 99% of the time is called simply access_log.
4) The first command you are going to run is called grep. It basically searches through a file for all rows that contain a character string and then outputs them to the screen. Since we are hoping that Googlebot comes to your site many many times a day we don’t want to see a million lines scroll by in front of your eyes (although that would seem very War Gamesesque), so we will use something called a pipe, which is denoted with a vertical bar like this: |. We want to “pipe” the output to a command called “more“. More basically will send one page of output to the screen at a time, and with each press of your space bar it will send another screen full of results from your output. So, here is what you do
$grep Googlebot access_log | more
If you ran that command you would see a list of just the URLs that Googlebot hit.
Check it out, hit space bar a couple times and see how it looks (hit the letter q to quit out and get back to your command line). As you can see we have now grabbed all the times Googlebot hit our site (at least in the life of this file).
5) Next we want to make this data more usable. Let’s take the output and list it out by URL to see what URLs are getting indexed the most. So we are going to grep for Googlebot and then pipe it to a command called awk. Awk can do all kinds of cool stuff but in this case we are just going to pull out the URL from each line of our log file. By default awk will split up a line by spaces. Since our log file is delimited by spaces we can count from the left and determine that the URL is the 7th from the left (6 spaces to the left of it). Basically you use the ‘{}’ and drop in dollar sign and then the field you want, in this case 7.
$grep Googlebot access_log | awk ‘{$7}’ | more
OK so we have the URLs, but how do we get the duplicates and stuff figured out? No, we are not going to port this to excel to do that. There is a command called ‘sort‘. Sort just takes everything you send it and sort it alphabetically. Sweet
$grep Googlebot access_log | awk ‘{$7}’ | sort | more
As you can see they are all nicely sorted. The next step is to get rid of the duplicates and replace them with a count of how many times that URL showed up. There is an easy command to do this called ‘uniq‘ and if you place a -c option it will give us a unique list including a count
$grep Googlebot access_log | awk ‘{$7}’ | sort | uniq -c | more
Now we got our list, but it is somewhat out of order. It sure would be nice to have the ones that were indexed the most at the top of the list. No problem, we can just use ’sort’ again and add in two options one to sort numerically and the other does a reverse sort (put the highest numbers first)
$grep Googlebot access_log | awk ‘{$7}’ | sort | uniq -c | sort -rn | more
There you have it, now you can easily see how many times Google bot has hit your site in the timeframe of your access log. Piece of cake right?
Well, you may tell me, but yeah Jim my server seems to gzip up all the log files and they roll them off every day. Ok no problem. Let’s say you want to go through all of the log files from the whole month of January and see what the Googlebot traffic looks like, and you are too lazy to unzip the files (or heck maybe if you unzip your server will run out of disk space.
In comes the gzip with the decompress option. Combine that with wild cards (in this case match everything between Jan and gz in the file name) and put it at the very front of your command, then pipe it to your grep for Googlebot.
$gzip -dc access_log.Jan.*.gz | grep Googlebot access_log | awk ‘{$7}’ | sort | uniq -c | sort -rn | more
Last but not least we probably want to save this so we can check it out later. Well instead of using a more you can redirect the output to a file using the > command and drop the file in your current directory.
$gzip -dc access_log.Jan.*.gz | grep Googlebot access_log | awk ‘{$7}’ | sort | uniq -c | sort -rn > googlebot-log.dat
Even better, assuming your web server has a mail server on it you can go ahead and send the output to yourself (or your boss) in an email, even with the use of the -s option you can put a pretty subject in there
$gzip -dc access_log.Jan.*.gz | grep Googlebot access_log | awk ‘{$7}’ | sort | uniq -c | sort -rn | mail -s “Googlebot loves us, check it out” youremail@yoursite.com
Once you get the hang of this you can leverage your skills to do all sorts of stuff like track how many times Googlebot hit your site in a hour, what hours of the day Googlebot comes the most across the last 2 weeks or maybe instead of grepping for Googlebot you could grep for IP ($1) and figure out what IPs are hitting your site the most if you want to figure out who is scraping your content, etc.
Now fire up putty and start hacking some commands real time. Who needs those twiddle your thumbs and wait, 24 hours lag to get pretty graphs and fancy charts anyways.