Loading up the Database
To perform the analysis we'll use a relational database running on a cluster. Relational databases are the most prevalent database architecture. Data is presented to the user in table format. There are a number of relational databases available for Linux, including DB2(r) Universal Database and IBM-Informix XPS Developers Kit. My examples use the XPS Developers Kit. XPS has implemented its own message-passing layer rather than using the more typical Beowulf implementations (for example, PVM, MPI). Feel free to use the database of your choice.
If you want to test a cluster using the XPS Developers Kit, you can get a copy by contacting Mohan Natraj (World Wide Marketing Manager for XPS) at mogs@us.ibm.com. Mohan's phone number is (503) 525-7419.
The installation procedure is well-documented in the Installation Guide and the release notes/README file. To make life easier, download the easyinst.tar file which automates the installation process. By reading the shell script you will see the exact steps to run to do the installation. By reading the installation documentation, you'll understand why the steps are needed.
Relational databases use Structured Query Language (SQL) to work with the data. A tutorial on Informix SQL is available if you need it. We need both data definition language (DDL) to create tables and data manipulation language (DML) to query the data in the tables.
First let's create a table to store the data. The SQL to do this looks like:
|  | 
To run this SQL from a file:
|  | 
Data loaders come in many different flavors. The XPS loader takes the relational table concept and maps it to Linux files and pipes. After the mapping has been defined with an external table you can use SQL to load, query, and unload data. The DDL for the external table is:
|  | 
Loading from the flat file of data defined in the "datafiles" clause is as simple as an insert-select. This is our first example of a DML statement.
|  | 
With data in the database we can now run queries to answer questions about the Web site.
Using SQL to answer questions about the Web site
Let's answer the following questions about the Web site with SQL:
- What are the most popular objects?
- Which objects consume the greatest site bandwidth?
- How many errors are generated and what is their nature?
- What is the distribution of activity over time?
- What links are bookmarked, bypassing the home page?
This is only a sample of the questions that can be answered. The logs used for the examples are samples from www.informix.com before the latest web site revisions.
1) What are the most popular objects?
The most popular objects belong on the home page, or a link to them belongs on the home page, to make navigation and access easier. The query is an easy one:
|  | 
What is the most popular link at Informix for one sample log file? The envelope please:
|  | 
2) Which objects consume the most bandwidth?
The objects with the largest value of hits * object size consume the most site bandwidth. Why do we care? If a large 24-bit color image is eating most of the bandwidth, you might want to reduce the size or quality of the image to improve performance.
The query we use to find out which objects are consuming the most bandwidth is more complex and creates a temporary table on the fly:
|  | 
A sample from the Informix site yields:
|  | 
3) How many errors are generated and what is their nature?
The first query below is a good quality-check of the site. The second query provides a list of urls and their errors in descending order, which gives a prioritized list of what to fix.
|  | 
| http_code | (count(*)) | |
| 200 | 1478884 | ok | 
| 304 | 452508 | use cached copy | 
| 404 | 163149 | not found | 
| 302 | 55476 | moved temporarily | 
| 206 | 42576 | partial content | 
| 500 | 7157 | internal server err | 
| 401 | 4166 | unauthorized | 
| 123 | ? | |
| 403 | 25 | forbidden | 
| 503 | 17 | service unavailable | 
|  | 
I'll spare you the output from this one. It is only interesting if it comes from your own site.
4) What is the distribution of activity over time?
The international community requires that this site is available 7x24. The distribution of hits over time illustrates the demand for high availability in the Web environment.
|  | 
A graphical front end can make all the difference when presenting information. This graph, based on the above query, shows that the site is 7x24:
Figure 1. HTTP Hits by Time5) What links are bookmarked, bypassing the home page?
If you build it they will come. And maybe even return for another visit. Browser bookmarks (or "favorites" if you use "that" browser) are your visitors' votes about the best content on the site. This content should be easily accessible. Putting links to the best content on the home page makes the home page more of a portal than an annoyance to avoid. Any customization or new information that is presented on the home page should be propagated to the popular pages to reach the returning bookmark audience.
The bookmark SQL is more complicated. We want the first object selected for a unique session. The UserTrack directive would be very helpful, but since this isn't available in the data we'll do the best we can. The comments in the code explain the stages of the query.
|  | 
All the analysis so far can be performed on any database. Now we'll look at the server used for these examples and how it can be clustered.
View Web Click Stream Analysis using Linux Clusters Discussion
Page: 1 2 3 Next Page: Linux Clusters