+ Home
+ Subscriptions
+ Archives
+ Resources
+ Contact Us
+ In The News
+ GAP Home

logo-3.gif (4379 bytes) Click Here For Your Free Subscription

Home +

menu_hr.gif (1966 bytes)

+++ S P E C I A L R E P O R T +++
"Understanding Log Files"
javilk
javilk(at)mall-net.com 
IMAGINEERING 
 August 29, 2001
Log files contain a wealth of data. But if you don't know how to read them, they seem like chicken scratchings. This tutorial will show you what's in your log files, and how to see where others came from, re-running their queries. 

LOG FILES
What's in a log file? Well, your standard who, what, when and where. And if you work your way through this little tutorial, you'll be able to re-run your user's web searches to see who your competition really is. Really!

"Who" --- the first item is usually an IP number or fully qualified domain name.

"What" -- What they saw, which web page, and what they are using to view it, giving you some demographics information

"When" -- When they came. And with that, more information about their habits.

"Where" -- (Ok, this is a stretch.) Where in the sequence of pages they saw your page, or where they came from. (Not all servers provide this information.)

Here's a sample record from one of my log files. (It's really all on one line, but we had to fold it for publication.)

cache-mtc-af06.proxy.aol.com - - [24/Aug/2001:23:13:23 -0700] "GET
/free/magazine.html HTTP/1.0" 200 63855
"http://www.google.com/search?q=subscribe+free+magazine+and+playboy&hl=en&safe=off"
"Mozilla/4.0 (compatible; MSIE 5.5; CS 2000; Windows 98; Win 9x 4.90)"

WHO: cache-mtc-af06.proxy.aol.com - - 

This is a proxy from AOL. It represents one viewer, but others likely see the same page you serve them, as they cache it and re-serve it. How many more, will depend upon how popular your pages are. This is how you identify many (not all) search engine crawlers.

WHEN: [24/Aug/2001:23:13:23 -0700] 

Time of day. When do the bulk of your visitors come? Is it the home browsing crowd, or the guys at work? Or might they be from another part of the world? (Check the who for something other than .com.)

WHAT: "GET /free/magazine.html HTTP/1.0" 200 63855

Saw the free magazine page, 200 means they saw it (404 would mean it was not there, 304 would mean it wasn't modified, and they got it from their browser's cache,having viewed it before.) The next number tells us the server sent 63855 bytes. You can count the number of times the page was hit, and see the number of pages that they asked for which you DIDN'T have -- yet! Those are the 404's. (I hope /default.ida is one of your 404's, not 200! That's the code red worm trying to crawl into your system. I got 156 of those probes so far.)

WHERE:
"http://www.google.com/search?q=subscribe+free+magazine+and+playboy&hl=en&safe=off" 

This is the page he was looking at when he clicked on my link. (Not all ISPs provide it. You should demand it in your logs.) And since it was a search engine, I can see he was searching for "free magazine" and "playboy". (Must not have read the title of my page, or he was going market research.) This information is called psychographics -- the psychology of the visitor. Note that this HTTP_REFERER information is not provided by every browser, and is not always accurate.

You can get this data from the HTTP_REFERER variable with scripts, then have the script feed your visitor exactly what he or she is looking for. You can also re-run his searches, or the more common searches, and see how you rate, what your competition is, etc. By the end of this tutorial, you'll know how to do that.

WHAT: "Mozilla/4.0 (compatible; MSIE 5.5; CS 2000; Windows 98; Win 9x 4.90)"

This is what he was browsing with, and what operating system he was running. More hinting at his or her demographics. Mac user? Probably more graphics oriented. Linux user? Probably more technical. Netscape? Could be power user, or just an MS hater. Win95? Legacy systems, either not rich, or does not care much about computers.

Ok, so there are too many entries in your log file to do this on a one-by-one basis. So maybe you just want page counts. Well, there are plenty of tools for that. But... if you learn these simple things, you can do what the common tools can't do. 

GETTING YOUR HANDS ON IT

If you are able to log in to a UNIX shell via Telnet, a few simple tools can let you do a LOT of analysis, including re-running your visitors queries. 

Computers are interactive, like swimming; you have to thrash the water to get anywhere. I don't have the feel that things work if I don't get on the computer and actually type the commands as I read about them. That's the way I felt when I just cracked the manuals back in college, learning to program in two weeks on my own before the classes got started; and that's the way I still feel, decades later. So log on to on your server, and try these examples as we go. No cut and pasting! Print this file and type the commands yourself, as you have to teach your fingers the words.

The utility "grep" and "egrep" are like grappling hooks, letting you grasp select records and extract them. For example, if you wanted to see all the references to /portfolio.html was referenced in your logs, you would say: 

grep /protfolio.html your-log-file

(Come on, try this, the water is fine! All right, so maybe it feels freezing cold. You'll get use to it! Come on, try it! Call your tech support line if you don't know what your log file is, or if you don't know how to log on.)

To see what was fetched from aol.com, you would type

grep aol.com your-log-file

To see what aol users got your portfolio page, you would combine the two requests this way, feeding the output of one command into another using the vertical bar as a pipe.

grep /portfolio.html your-log-file | grep aol.com 

Ok, let's say you are running a promo banner at Octopus Search, that new engine you heard about last month. How many come from there? To find out, you created a special landing page to link the banner to, octofree.html. So we grep for /octofree.html Then to count the lines, we use "wc", which stands for word count, with the -l option limiting the output to show only the number of lines.

grep /octofree.html your-log-file | wc -l
grep /octofree.html your-log-file | grep aol.com | wc -l

Or to see who is using a Macintosh ("Mac" is the common abbreviation, catching Macintosh, Mac_PowerPC, and others. You may also get some false matches, but not that many.)

grep Mac your-log-file | wc -l

THE BIG EXAMPLE

What if you wanted a count on all of your pages, sorted by popularity? That's a little more complicated, but still in the range of a one-line command. 

To count, we need to count like things, and there are a lot of things in your log file. We will cut the log file to include only the file names, then sort it so we can count similar records, counting up the number for each page. Once that is done, we will re-sort that list to give it to us with the most visited pages on top.

Let's build the command step by step, watching that we get the right output each time. That's the way I started years ago, and that method still serves me fine. Oh, up arrow lets you edit previous command lines on many UNIX / LINUX systems.

To get it started, we use "cat", the catenate function. It's like DOS's "type" command. ("Unix is a dog, so use cat") So ask your systems person where your log file is, and use that instead of my "your-log-file" label.

cat your-log-file

Runs off the screen! Well, we can stop that by using "more", to get less. (Some systems also have a "less" command which does more.) We pipe the output of "cat" into the "more" command using the vertical bar, what we sometimes call the "or" symbol. (I always call it the or symbol, confusing people.)

cat your-log-file | more

Now, we add the cut command to extract only the file name. We need to specify a field delimiter, which will be a blank. That's -d " ". Then we have to tell it which field we want to look at, which in _my_ case is fields 7 through 7. So the command becomes:

cat your-log-file | cut -d " " -f 7-7 | more

You may need to adjust the 7-7 to match your particular system's log file layout; that's why we are building the command up step-by-step. That, and so you will understand how to modify all this for other purposes. When this runs right, you should just see page names, like /portfolio.html, /index.html, etc. Play with the numbers till you see only the file names.

Next, we want to count each file name. But to count, we need to group all the same file names together, so we have to sort before we count. 

cat your-log-file | cut -d " " -f 7-7 | sort | more

Verify everything is in order, or at lest the first few pages are. The command uniq eliminates duplicates. We add the count option, -c, to tell us how many of each we have.

cat your-log-file | cut -d " " -f 7-7 | sort | uniq -c | more

Because we sorted the file, all the file names are in alphabetical order. We can re-sort them to get the file in popularity order using another sort, using n to tell it to sort numerically, so that 21 comes before 101, and then reverse the order with r so the most popular pages are on top.

cat your-log-file | cut -d " " -f 7-7 | sort | uniq -c | sort -nr | more

And there we have it! Sure, you CAN use webtrends, but what if you just want the page counts that come from AOL; or just from a specific client of yours? Or how many hits you got from Google? Remember the grep for aol.com?

grep aol.com your-log-file | cut -d " " -f 7-7 | sort | uniq -c |
sort -nr | more

Or... what do the Macintosh users prefer to look at?

grep Mac your-log-file | cut -d " " -f 7-7 | sort | uniq -c |
sort -nr | more

The Spy

Want to re-run other people's queries and see where they came from? That's easy! Try this:

cat your-log-file | cut -d " " -f 11-11 |sort| uniq -c |
sort -nr | sed 's/./<BR>&/;s/".*"/<A HREF=&>&<\/A>/' >tmp.html

and if the 11-11 is the right number on your web logs, you would get a bunch of lines with counts and click-able URLs like this:

<BR> 4 <A
HREF="http://google.yahoo.com/bin/query?p=freebigmovie&hc=0&hs=0">
"http://google.yahoo.com/bin/query?p=freebigmovie&hc=0&hs=0"</A>
<BR> 4 <A
HREF="http://google.yahoo.com/bin/query?p=artis+melayu+bogel&rs=1&hc=0&hs=0">
"http://google.yahoo.com/bin/query?p=artis+melayu+bogel&rs=1&hc=0&hs=0"</A>

Step by step, we do the following:
1. cat the log file, cut it at field 11 (which you may need to change, so start building up the command as we did above, inspecting each step for reasonable output),
2. sort it so we can count it, 
3. use uniq -c to count how many times each reference is used,
4. use sed, the Stream Editor, to edit (or Swap) the first character "." (meaning any single character) with itself "&" and "<BR>", then use another swap s/old/new/ to put a link around the referencing URL
5.use the ">" to put it into a file so you can look at it with a web browser and click on the links to see what they saw.

Simple, once you know how. And if you play with it a little, you WILL know how.

Two other UNIX / LINUX commands you really need to know are "man" and "apropos". Man gives you the manual pages on the specified command:

man grep
man wc
man sed

Apropos tells you which commands might be appropriate:

apropos edit

Some UNIX systems also use info instead of man.

Depending on the system you use, you might want to pipe apropos to more so you get less on the screen at one time. 

apropos edit | more

This kind of play with your log data can be a lot of fun! Unix isn't just an operating system, it's a language you can use to describe and extract all kinds of fascinating information from your web logs and other files.

In some of my consulting, I use these types of commands to get a bit of information quickly, then if there is enough there to make it worth while, I write PERL scripts to generate more elaborate reports for marketing, etc.

If you have a quick question or two, e-mail me or call me at 408-779-9842. (I am in the process of moving, in part due to connectivity issues, so will be in and out, and the number will change soon.) If you want some thing more elaborate, then let's talk about your needs and your budget.

javilk(at)mall-net.com ------------------- webmaster(at)Mall-Net.com 
------------------------- IMAGINEERING -------------------------
------------------ Every mouse click, a Vote ------------------
----------- Do they vote For, or Against your pages? -----------
------- What people want: http://www.SitePsych.com/free/ -------
-- Check your page: http://www.SitePsych.com/sanitycheck.html --
- We have the reports, products and services to help you Grow! -
--- Web Imagineering -- Architecture to Programming CGI-BIN ----
---------------------------------------------------------------- 

[GAP Home] [E-Tailer's Digest Home] [Subscriptions] [Resources] [Contact Us] [In The News] [Invite a Friend]

 

Copyright 1998-2000 GAP Enterprises, Ltd.
Contact webmaster(at)gapent.com with any comments or questions about this site.
Last modified Sunday, February 16, 2003