Developer Forums | About Us | Site Map
Search  
HOME > TUTORIALS > SERVER SIDE CODING > PHP TUTORIALS > TAKE WEB DATA ANALYSIS TO THE NEXT LEVEL WITH PHP


Sponsors





Useful Lists

Web Host
site hosted by netplex

Online Manuals

Take Web data analysis to the next level with PHP
By Paul Meagher - 2004-04-12 Page:  1 2 3 4 5 6 7 8 9 10 11

Test the hypothesis

Assuming that the sample of Nova Scotia beer drinkers is not biased, can you now conclude that Keith's is the most popular brand?

To answer this question, consider a related question: If you were to obtain another sample of Nova Scotia beer drinkers, would you expect to see exactly the same results? Realistically, you would expect some variability in the observed outcomes from sample to sample.

Given this expected sampling variability, you might wonder whether the observed brand preferences might be better accounted for by random sampling variability rather than by reflecting real differences in the population of interest. In statistical terminology, this sampling variability explanation is called the null hypotheses. (The null hypothesis is designated by the symbol Ho.) In this instance, formulate it as the statement that the expected number of responses is the same accross all categories of response:

Ho: # Keiths = # Olands = # Schooner = # Other

If you can rule out the null hypothesis, then you will make some progress towards answering the original question about whether Keith's is the most popular brand. The alternative hypothesis that you can then entertain is that the proportions are different in the population of interest.

This test-the-null-hypothesis-first logic applies at multiple stages in the analysis of poll data. Ruling out the null hypothesis so you have no overall differences in your data, you may then proceed to test a more specific null hypothesis, namely that no difference exists between Keith's and Schooner or between Keith's and all other brands.

The reason you proceed by testing the null hypothesis rather than directly assessing the alternative hypothesis is because it is easier to statistically model what one would expect to observe under the null hypothesis. Next, I'll demonstrate how to model what would be expected under the null hypothesis so that I can compare the observed results to what would be expected under the null hypothesis.

Model the null hypothesis: The Chi Square statistic

So far you have summarized the results of your Web poll using a table that reports frequency counts (and percentages) for each response option. To test the null hypothesis (that no difference exists between table cell frequencies), it is easier to compute an overall measure of how much each table cell deviates from the value you would expect under the null hypothesis.

In the case of this beer poll, the expected frequency under the null hypothesis is the following:


Expected Frequency = Number of Observations / Number of Response Options 
Expected Frequency = 1000 / 4 
Expected Frequency = 250

To compute an overall measure of how much the responses deviate from the expected frequency per cell, you can sum up all the differences into an overall measure of how much the observed frequencies differ from the expected frequencies: (285 - 250) + (250 - 250) + (215 - 250) + (250 - 250).

If you do this, you find the the expected frequency is 0 because deviations from a mean always sum to 0. To get around this problem, square all the difference scores (hence the square in Chi Square). Finally, to make the score comparable across samples with different numbers of observations (in other words, to standardize it), divide by the expected frequency. So, the formula for the Chi Square statistic looks like this ("O" means "observed frequency" and "E" equals "expected frequency"):

Figure 1. The formula for the Chi Square statistic
Figure 1. The formula for the Chi Square statistic

If you calculate the Chi Square statistic for the beer poll data, you obtain a value of 9.80. To test your null hypothesis, you want to know the probability of obtaining a value this extreme under the assumption that it is due to random sampling variability. To find this probability, you need to understand what the sampling distribution for Chi Square looks like.



View Take Web data analysis to the next level with PHP Discussion

Page:  1 2 3 4 5 6 7 8 9 10 11 Next Page: Look at the Chi Square sampling distribution

First published by IBM developerWorks


Copyright 2004-2024 GrindingGears.com. All rights reserved.
Article copyright and all rights retained by the author.