Save your SEO efforts. Preserve that link juice. Don’t let users end up at a 404 error page in frustration on your website.

In the past I’ve written on how to 301 redirect URLs, but today we’ll find 404 page errors as they occur and fix them in real time!

We recently launched our new site, and in the process ended up changing a bunch of page URLs around. A good number of page and media URLs also changed without us realizing, due to the differences in the way our old and new systems handled special characters in URLs (oh joy). And unfortunately, while running the excellent Xenu will tell you about broken links, it won’t tell you about URLs that are broken relative to the old site you migrated from (pages that need to be 301 redirected).

And then it’s nearly impossible to catch all loose ends in a practical amount of time when you’re migrating a website to a new CMS. You could go visit webmaster tools for a list of 404 pages, but that is not in real time.

So for all the webmasters out there I’ve put together a few handy Linux shell commands that let you easily view 404 pages and other media in real time as visitors crash into those dreaded “not founds” so you can fix them quickly!

tail -f access_log

This command is pretty generic and lets you view all items being requested and served up by Apache (I’m assuming your log file is named “access_log”).  Note that it includes pages, images, CSS, JavaScript, etc. That’s okay except that it’s information overload; each page that is requested typically means all media referenced in the page (images, CSS, and JavaScript) will be spit out by the command as well. Good if you want to eyeball macro trends but not so good to find actual 404 error pages.

But let’s get to the really useful stuff…

Here’s a quick video explaining the rest of the post:

One quick note: The below commands assume your Apache server logs data in a format that corresponds with positions like this (I think it’s default for most Apache servers, non-mentioned positions I’m just leaving out since I don’t find them relevant):

# 1 – IP

# 4 – date

# 7 – requested URL alias

# 9 – status code

# 11 – referring web page (404 shows a “-” character)

tail -n 10000 -f access_log | awk '$9 ~ /(404|500)/ && $7 !~ /(secars.dll|livezilla|.css|.js|.jpg|.gif|.png|.eot|.ico)/ {print $4 " " $7 " " $9 " " $11 " " $1}'

This command removes common media elements (JPG, GIF, PNG, CSS) so things are a little easier to weed through. I’m also removing “secars.dll” and “livezilla” since those are two special cases of URLs ending up in 404 I often had that I did not want printed.

Real quick–I’m doing -n 10000 just to get the latest bunch of lines from the log file but not really needed for real time monitoring.

Also, don’t try throwing a grep command on the end because it will slow down processing time and you will get delayed results… this was a head scratcher for me at first.  Just use the “~” (equals) or “!~” (not equal to) operators in AWK if you want to include or exclude more stuff.

One more useful command to show all pages being accessed in real time:

tail -n 1000 -f access_log | awk '$7 !~ /(secars.dll|livezilla|.css|.js|.jpg|.gif|.png|.eot|.ico)/ {print $4 " " $7 " " $9 " " $11 " " $1}'

This one is just leaving out the check on $9 (the status code as logged by Apache) and so all web pages 200 (ok), 404 (not found), etc. end up displaying, but we still leave out common media elements like the previous example.

Keep in mind your setup may differ and you may want to exclude other patterns if your site serves them up so some tweaking may be required.

But being able to know about and fix 404 pages in real time can be pretty handy indeed for SEO purposes.

Oh, and by the way…

On the subject of handy commands, see who is hotlinking your files (using your bandwidth without you knowing) and who might have scrapped your site and stolen your code

awk -F" '($2 ~ /.(jpg|gif)/ && $4 !~ /^/){print $4}' access_log | sort | uniq -c | sort

Make sure to replace “bestrank” and “com” with your appropriate domain name.

In my case, I actually found one site that completely stole my old site’s theme code (CSS files at the very least) because they still were hotlinking to my CSS files… a bunch of idiots I tell you! And their site is still broken as of the time I’m writing this post one week after I removed access to the file.


Your email will not be published. Required fields are marked *

There are no comments yet.

Other posts you will enjoy...

RelationEdge Announces the Acquisition of Main Path Marketing and Launch of a Full-Service Marketing Cloud Practice
Twitter Moments – Should They Be a Part of Your Social Media Strategy?
Developing a Plan for Social Live Video
4 Common Email Problems and How to Solve Them