Developer Forums | About Us | Site Map
Search  
HOME > TUTORIALS > SERVER SIDE CODING > PHP TUTORIALS > APPLY PROBABILITY MODELS TO WEB DATA USING PHP


Sponsors





Useful Lists

Web Host
site hosted by netplex

Online Manuals

Apply probability models to Web data using PHP
By Paul Meagher - 2004-04-14 Page:  1 2 3 4 5 6 7 8 9 10 11

Building a probability model

Now that you understand the mechanics of how to use the Exponential distribution class, you're ready to see how it can be used to gain insight into a real-world problem.

And this will be fun problem -- to develop a model for when and how many goals are likely to occur in World Cup soccer games. What I want to focus on here is showing you how to use the Exponential distribution class I just discussed to derive some of the results reported in this article.

As you know, the Exponential distribution accepts a rate parameter, also referred to as lambda. To use the Exponential distribution to develop a probability model for World Cup soccer goals, you need to be able to derive this rate parameter from your measurements.

A rate is defined as the number of occurrences of some phenomenon over a unit of time or space. The rate of soccer goals in World Cup tournaments between 31-May-2002 and 30-June-2002 is equal to 575/232. This can be thought of as the mean number of goals scored in a 90-minute regulation game. It was computed as follows:

average goal rate = Total Number of Goals / Total Number of Games

In PHP, you represent this concept by setting up variables called $num_goals and $num_games that are placeholders for the evolving quantities that are needed to compute the goal rate. The code snippet that follows shows a fragment of the PHP-based probability model for World Cup soccer goals.


<?php

// world_cup_soccer.php

/**
* @package PHPMath_ProbabilityDistribution
*/

require_once 'PHPMath/ProbabilityDistribution/Exponential.php'; 

$num_goals = 575; 
$num_games = 232;
$rate = $num_goals / $num_games;
echo "Value of rate parameter = $rate<br>";
$exp = new PHPMath_ProbabilityDistribution_Exponential($rate); 

?>

This fragment produces the following output:


     Value of rate parameter = 2.4784482758621

One question you might be curious about is the probability that a goal will be scored in the first 10 minutes of a soccer game. I am going to ignore some prior modeling steps in which you would have developed code to plot inter-goal intervals and to test whether the exponential distribution is the best fitting distribution. Instead, I will assume that this code has been developed in a separate script. Therefore, I'll proceed to a stage where you would use the theoretical exponential distribution to calculate some probabilities of interest. The next code snippet, for example, is added to the previous bit of code and used to compute the probability of a goal in the first 10 minutes of play.


<?php

$mins = 10;
$mins_per_game = 90;
$interval = $mins / $mins_per_game;
echo "Probability of goal in t <= 10 mins: ". $exp->CDF($interval);

?>

The output this fragment generates is:


     Probability of goal in t <= 10 mins: 0.24071884483247

In other words, in 24 percent of games a goal is scored in the first 10 minutes of play. In the next 100 games that are played, you can expect that in 24 of those games, a goal will be scored in the first 10 minutes.

You might also be interested in the inverse question: In P percent of games, a goal occurs within how minutes of play? You can answer this question with three different P values using the following code snippet:


<?php

$prob = 0.25;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 25 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

$prob = 0.50;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 50 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

$prob = 0.75;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 75 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

?>

The output this fragment generates is:


In 25 percent of games, a goal will occur within 10.446611604858 
     minutes of play.
In 50 percent of games, a goal will occur within 25.170283704507 
     minutes of play.
In 75 percent of games, a goal will occur within 50.340567409014 
     minutes of play.

You might also be interested in understanding the issue of how likely it is that X number of goals are scored in a game. The easiest way to compute this probability is to use a mathematically related distribution called the Poisson distribution which is useful for obtaining answers to such discrete counting problems.

I have also implemented a PHP-based version of the Poisson distribution functions. The following code fragment shows how to use the Poisson distribution to compute the probability of scoring various numbers of goals in a game.


<?php

require_once 'PHPMath/ProbabilityDistribution/Poisson.php'; 
$lambda = $rate;
$pois = new PHPMath_ProbabilityDistribution_Poisson($lambda); 
echo "Probability that goal count = 0: ". $pois->PDF(0) ."<br>";
echo "Probability that goal count = 1: ". $pois->PDF(1) ."<br>";
echo "Probability that goal count = 2: ". $pois->PDF(2) ."<br>";
echo "Probability that goal count = 3: ". $pois->PDF(3) ."<br>";
echo "Probability that goal count = 4: ". $pois->PDF(4) ."<br>";
echo "Probability that goal count = 5: ". $pois->PDF(5) ."<br>";
echo "Probability that goal count = 6: ". $pois->PDF(6) ."<br>";

?>

The output this fragment generates is:


Probability that goal count = 0: 0.083873272849375
Probability that goal count = 1: 0.20787556848444
Probability that goal count = 2: 0.25760442215206
Probability that goal count = 3: 0.2128197453124
Probability that goal count = 4: 0.13186568270973
Probability that goal count = 5: 0.065364454791462
Probability that goal count = 6: 0.027000403380094

The Poisson distribution differs from the Exponential distribution in that Poisson is for modeling discrete random variables and Exponential is for modeling continuous random variables. I used Exponential to calculate waiting-time probabilities because inter-arrival time is a continuous random variable and this distribution is often a good probability distribution to consider using to account for the distribution of waiting times.

The Poisson distribution is for modeling discrete random variables involving a counting process (such as, the number of times a certain event occurs in some period of time). A count falls into a discrete list of values from 0 to some upper bound. In the case of World cup soccer, the Poisson distribution can be used to compute the probability of a game ending with different goal counts.

Space precludes a more complete discussion of this important probability distribution; however, I hope that this brief discussion has raised your awareness of the distinction between discrete and continuous random variables and probability distributions and how different probability distributions can be applied to data to construct a more detailed probability model that answers different types of questions.



View Apply probability models to Web data using PHP Discussion

Page:  1 2 3 4 5 6 7 8 9 10 11 Next Page: Some thoughts on probability modeling

First published by IBM developerWorks


Copyright 2004-2024 GrindingGears.com. All rights reserved.
Article copyright and all rights retained by the author.