Custom Website Analytics Built with Qrimp

Posted: 7/6/2009 12:04:08 PM
I was frustrated with a lot of the one-size-fits-all website analytics program, so I rolled my own on the Qrimp platform.

Why analytics are important


If you want to better understand your website visitors, your customers, or your potential partners, you need to know what they know about you. You know what they know, partly through examining what they are reading at your site. Which types of articles get the most attention? Which products are the most popular? Which pages are your potential customers looking at that convinces them to sign up?

It's really about data and information and turning those into Knowledge you can use to improve your company -- or yourself. Sometimes you don't know how you are going to use additional knowledge, but I know that the more I know, the better able I am to react and that's why site logs are important. If your site starts to perform slowly or you get a lot of traffic and you don't know where it's coming from, your ignorance is hurting you in ways you may not even know.

Why One-size-fits-all doesn't work


You may be wondering, "Why waste your time building your own analytics when there are lots of options out there?" Well, the simple answer is that with Qrimp it really doesn't take that long. There was information I wanted access to, that I couldn't get out of Urchin or Google Analytics. Some of these packages are good, but when I used them, I always felt constrained. I felt like I wanted more and I knew there was more knowledge to be gained from the logs than I was getting.

For example, Google Analytics really limited my ability to drill into details about each visitor. If someone signs up for Qrimp, I wanted to know where they came from, what they looked at. Are they coming from a university? Really, there are a lot of questions I had that were left unanswered by the statistics packages out there designed for all websites. I wanted a statistics package designed for my website.

When someone signs up, I log the IP address for the computer they are using. Since the sign up system is also built in Qrimp, I knew I'd be able to create a link from that IP address directly to the analytics system and immediately know everything that customer looked at before and after signing up. This kind of knowledge would help me understand better what is in the mind of that particular customer.

Problems javascript based solutions


Google Analytics is also JavaScript based and I found the statistics produced by GA was really quite different from Urchin stats generated from the raw logs. The discrepancy made me uneasy. I knew to get the best information, I had to go to the raw logs.

Rather than go into detail about why JavaScript based solutions aren't up to my standards, I'll refer to a post I read today, called 10 reasons why web log analyzers are better than JavaScript based analytics

Problems with urchin


One of the biggest issues with Urchin was that the navigation just didn't feel right. For example, in the chart below, I wanted to click on the bar and examine the sessions, but urchin doesn't let me do that. That kind of functionality is built into Qrimp, so I knew it would be very easy to present the data more intuitively.



Also, the urchin statistics are presented on a site by site basis, but because Qrimp is a multi-tenant system, I wanted to be able break down statistics by Qrimp instance. Were visitors coming to www.qrimp.com also visiting The Developer Network? Did they start out at the Cloud Computing Portal or maybe they were techies looking at the Tech Jobs Charts? Using Urchin, answering that question would be almost impossible.

Another issue was with website referrals, here's an image:

See how some of the information is disregarded? If you read Hacker News, you'll know from that link, there's no way for me to see the comments about whatever link it was that was submitted to Hacker News. That information is just lost. But when someone refers to a page at Qrimp, I want to be able to go to that page and see what they are saying.

File based analytics can't be queried


Most analytics programs aren't database driven. They look at your logs and then build a bunch of reports or static files that then present your data to you in ways they think you'll want to see it. But without a comprehensive querying system, they limit the visibility of the data and that's really what I was after and that's what Qrimp is all about.

What I wanted to see


Whenever I would use one of the standard analytics packages, I'd always have a lot of questions. I've been using website analytics tools for over 10 years now -- way back when I had to examine logs manually. I first tried awstats back in the day, which I loved, and then Microsoft had this great tool called Site Server which had some built in analytics processing. Site Server's analytics was really nice, but it "went away..."

On a side note, that's part of why I built Qrimp -- I was tired of my favorite software going away. I knew if I wanted software done right, I was going to have to do it, but there's so much software that I wanted to build, that I decided to start on the platform. It's my mission to make sure Qrimp never goes away.

I wanted Individual Visitor Logs


Individual visitor logs was something I really wanted to see. Not only for a single site, but across all sites. Was someone using their Qrimp app, then clicking on the Help link and finding more information from the Developer network? Which help topics were they reading and when? I need to know so I can make Qrimp better.

New Domains


I also wanted detailed reports about which new domains were coming to the site. Lots of smaller companies come to check out Qrimp and maybe they'd like to buy it. These new domains are hot leads I can examine, build up a case, and then call that company and ask them if they'd like to talk or see a web demo.

I've also been working with different universities who want to get their students up and running on Qrimp. I love the idea and so I wanted to see which new .edu domains were coming to Qrimp and then perhaps I could email the Computer Science department and see if they were interested.

What I did NOT want to see: Log Spam and Robots


I found that as I continued to use the Urchin statistics, I was getting more and more refers to websites peddling drugs, porn, and myriad other types of spam. There was no way for me to get these hits out of my logs unless I filtered them out myself.

I also didn't want to see a lot of details about the Robots. I developed a few algorithms to figure out what a robot was and then flag and remove those hits from the big list. This allowed me to really filter out the logs and drill into only the human beings. I might have a couple hundred sessions one day and then find out that that half of those are robots. It's bad data and I don't care about the robots -- I care about customers.

Customizing Qrimp


So I had to get the raw logs into Qrimp. In the raw logs, these statistics are presented with one request on each line. I needed each request to be a record in the database, so I had to write a little bit of customized code.

Downloading the Raw Logs


I run Qrimp on a local instance, so all I had to do was create a new database and install it. The hosting provider makes the logs available through FTP in ZIP files, so I wrote a script I run once a day to attach to the FTP server via sFTP and download any new log files. I use WinSCP which has an awesome scripting language that I use to automate it. I create a scheduled task or "cron" job to run this script each morning.

Parsing the logs


Then, I wrote a little code that would unzip the raw logs, open the text file contained within it and go line by line extracting out the logdate, servername, referer and other details about a request. I added a little bit of security to it so that the database wouldn't get corrupted by script tags via cross site scripting and that sort of thing. I've parsed lots of files like this, so it didn't take long, maybe an hour or two.

I thought having detailed analytics like this would be interesting for our customers as well, so I built the rest of the import process into web pages. At some point in the future, I may automate this process too or make it a commercial add on, but for now, it's part of my daily routine.

The first thing I do is click on the Qrimp server I want to import logs for from something like this:


When I click on one of the servers, I usually pogo-stick them, I scroll to the daily import and click it:


Then I see a simple page with a progress bar letting me know how long it'll take.


Back into the Browser based development


Everything beyond this point was created using only the Qrimp development platform. Once the data was in the system, custom queries, field templates, portals and everything else was super easy.

After importing all the new logs, I reindex the logs table so queries are fast. Then I do a reverse DNS lookup on the IP addresses so I know a little bit more about each visitor. Qrimp has a reverse dns function built in, so I created an external data source to extract this information. I know this isn't the most efficient way to do these, but it looks cool and it's fun to watch. Here's a video of that, it's neat to watch. You'll notice in the video that I use Opera to reload the page automically every 5 seconds.



When it's complete, run one more query that examines the logs, looks for robots, spammers, and does a bit of denormalization and truncation of old entries to keep the system running fast. This is running on my laptop and I only have 2GB of ram.

In the future, I plan on storing aggregate information for the older logs, but for now this is fine. I keep the log zip files, so it'll be easy to keep all the data on a larger server or in aggregate in the future.

Screen shots


So now that I have my data in the Qrimp database and processed, I can examine it. I have a portal page, here are my settings for that:


One of the charts on the reports is requests by day for the past month (click for full size):


Another report is "Popular pages at Qrimp," which shows me a page like this (click for full size):


I like the calendar view too. I created a custom template that allows me to see sessions, bandwidth, and unique referers by day. I can click on the links to drill in.


If I click on Sessions in the link above, I see the image below. I can click on the numbers under Sessions and Referers and get more details about the numbers behind them.


Also, notice the icons are linked to the server names in this chart too. It's just a field template in Qrimp, so anywhere there is a servername column in a grid report, I can quickly get details for the server I don't have to customize every single report.

If I click on the number under referers, I'll see a list of sites sending traffic there.


Notice in the image above, there's the word "spam" by the link to the referer. If I click it, it takes me to a form where I can add it to the list of spammers and all of those referers will be excluded from future reports. The Spam referer link is pre-populated into the form so all I do is click TAB TAB Space (to click the create button) and I've added the spammer. I set up the workflow so that as soon as I add the spammer, I go right back to the report where I found it.

Integrating with other sites


Now the really cool part starts. Because I'm using Qrimp for all our data systems, anywhere we have an IP address, I can create a Field Template for that item that will link it to our website analytics program. For example, when someone signs up for Qrimp, their IP address is logged. I added a little icon that when clicked, shows me exactly what that user viewed before signing up. This will help me improve my site so that we get more customers in the future, here's what that looks like:


Integrating the sites like this is very simple, here's the HTML behind the field template for the IP Address data:


Another cool part is that our website analytics application is running behind our firewall, but because Qrimp links can go deep into data, we can structure the querystring to link directly to particular pieces of information, even if the systems aren't running on the same server.

Conclusion


All told, I've spend probably 10 to 20 hours on this website analytics system. There is a lot to the system I haven't shown you, but the amount of power I can get out of Qrimp is really amazing for the amount of work put into it. The amount of time I lose learning how to use Urchin or Google Analytics, or even just maintaining the JavaScript links to GA is probably more than that over the life of the application.

But now I have unlimited access to my data. I can query it any way I like, I can integrate it very easily with my other systems.

I love Qrimp so much I almost about to cry right now! It's so powerful I can barely contain myself.