This article gives the source code for a nice, clean implementation of a class to compile the discrete, cumulative distribution function of a sample data set. The CDF forms the basis for basic descriptive statistics.
(This article originally appeared in The Unofficial Newsletter of Delphi
Users)
TDistribution Class
This article gives the source code for a nice, clean implementation of a class to compile the discrete, cumulative distribution function of a sample data set. The CDF forms the basis for basic descriptive statistics.
TDistribution will automatically add 'bins' to the cdf as data elements are accumulated, or you can pre-define the bin boundaries before analyzing a data set. This auto-expansion of the cdf bins works best with sorted data sets (but for cool sorting components see http://www.connix.com/~btober/sorting.htm). In most cases it will work reasonably well without pre-sorting your data.
The class descends from TStringList, so you can use the Strings property to define a text string describing each 'bin', e.g.
String[0]:='Jan';
String[1]:='Feb';
.
.
.
String[11]:='Dec';
Thus the class flexibly supports your implementation within a variety of user interfaces (note that this is not a user interface component - it is "merely" a class that provides a specific, statistical functionality...it's up to you to implement the user interface appropriate for your particular application).
The class provides properties for estimating the sample data total and mean (based on the cdf).
An example project is also included.
{ ****************************************************************** }
{ Class for developing discrete, cumulative distribution function }
{ Copyright © 2000, Berend M. Tober. All rights reserved. }
{ Author's E-mail - mailto:btober@computer.org }
{ Other components at }
{ http://www.connix.com/~btober/delphi.htm}
{ ****************************************************************** }
unit Cdf;
{
The cumulative distribution function (cdf) for continuous, real
random variable X is defined as a function F(x) where
F(x) = P(X <= x), i.e., the probability that X <= x.
The discrete case is similarly defined.
The TDistribution class is used to generate an empirical CDF for a
given data set by counting the number of values from the sample
data set that fall into one of a set of discrete "bins".
This way you can quickly get a quantitative picture of a data set.
The class descends from TStringList, so you can use the Strings
property to define a text string describing each 'bin', e.g.
String[0]:='Jan';
String[1]:='Feb';
.
.
.
String[11]:='Dec';
}
interface
uses classes;
type
TDistributionItem = class(TObject)
private
FBin: Double; {Upper limit of bin}
FCount: LongInt;
public
constructor Create(Value: Double;Count:LongInt);
function Accumulate(Value: Double):LongInt;
property Bin: Double Read FBin;
property Count: LongInt Read FCount;
end;
TDistribution = class(TStringList)
private
function GetMean: double;
function GetTotal: Double;
public
constructor Create;
destructor Destroy;override;
procedure Clear;
function Accumulate(Value: Double):LongInt;
function AddObject(const S: string; AObject: TObject):Integer;override;
procedure FreeObjects;
procedure Put(Index:Integer; const Value:TDistributionItem);
function FreeObject(Index:Integer):Integer;
function Get(Index:Integer):TDistributionItem;
property Mean: Double Read GetMean;
property Objects[Index:Integer]:TDistributionItem read Get write Put;
property Total: Double Read GetTotal;
end;
implementation
constructor TDistributionItem.Create(Value: Double;Count:LongInt);
begin
inherited Create;
FBin := Value;
FCount:=Count;
end;
function TDistributionItem.Accumulate(Value: Double):LongInt;
begin
Result:=-1;
if Value<=FBin then {Increment count of bin when Value<=x}
begin
inc(FCount);
Result:=FCount;
end;
end;
constructor TDistribution.Create;
begin
inherited Create;
end;
destructor TDistribution.Destroy;
begin
Clear;
inherited Destroy;
end;
function TDistribution.AddObject(const S: string; AObject:TObject):Integer;
{Add a 'bin' to the CDF, in proper order}
var i:Integer;
begin
{
Find where to insert new 'bin'. This is just before the smallest 'bin'
which exceeds size of new 'bin'.
}
Result:=Count;
if Count>0 then
for i:=pred(Count) downto 0 do
if TDistributionItem(AObject).Bin Result:=i;
if Result<0 then
Result:=inherited AddObject(S,AObject) {If no such 'bin', append new one}
else
InsertObject(Result,S,AObject) {Insert new bin before next biggest}
end;
function TDistribution.Accumulate(Value:Double):LongInt;
{Count this data value into the cdf}
var
i: LongInt;
begin
if Count=0 then
AddObject('',TDistributionItem.Create(Value,0)) {Must have at least one 'bin'}
else if Value>Objects[pred(Count)].Bin then
{If Value exceeds largest 'bin', then add new one that IS big enough}
AddObject('',TDistributionItem.Create(Value,Objects[pred(Count)].Count));
for i:=0 to pred(Count) do
Result:=Objects[i].Accumulate(Value);
end;
function TDistribution.Get(Index:Integer):TDistributionItem;
begin
Result:=TDistributionItem(inherited Objects[Index]);
end;
function TDistribution.FreeObject(Index:Integer):Integer;
begin
Result:=-1;
if Index >= Count then Exit;
if Objects[Index] <> nil then
begin
Objects[Index].Free;
Objects[Index]:=nil;
end;
Delete(Index);
if Index>=Count then
Result:=pred(Count)
else if Count=0 then
Result:=-1
else
Result:=Index;
end;
procedure TDistribution.FreeObjects;
var i:Integer;
begin
if Count > 0 then
for i:=pred(Count) downto 0 do
FreeObject(i);
end;
procedure TDistribution.Put(Index:Integer; const Value:TDistributionItem);
begin
inherited Objects[Index]:=Value;
end;
procedure TDistribution.Clear;
begin
FreeObjects;
inherited Clear;
end;
function TDistribution.GetTotal:Double;
{This is an ESTIMATE of the actual sample total}
var i: integer;
begin
Result:=0;
if Count = 0 then Exit;
Result:=Objects[0].Bin*Objects[0].Count;
for i:=1 to pred(Count) do
Result:=Result+(Objects[i].Bin*(Objects[i].Count-Objects[pred(i)].Count));
{
*** This is an alternative way to compute total ***
Result:=Objects[0].Bin*Objects[0].Count;
for i:=1 to pred(Count) do
Result:=Result
+(Objects[i].Bin + Objects[pred(i)].Bin)
*(Objects[i].Count - Objects[pred(i)].Count);
Result:=Result/2.0;
}
end;
function TDistribution.GetMean: double;
{This is an ESTIMATE of the actual sample mean}
begin
Result:=0.0;
if Count>0 then
Result:=GetTotal/Objects[pred(Count)].Count;
end;
end.
Example implementation
program Example;
uses
WinCRT,cdf;
const
Data1:Array[1..10] of Real=(66,73,73,81,81,81,81,85,85,89);
{
Note: This data set is pre-sorted. TDistribution will still
work with unsorted data, but you might not get a good cdf
unless you predefine the cdf bins. This is especially true
if the first element of the dataset happens to be the largest
(try it by re-arranging the above data set!) because automatic
addition of bins will add a single bin into which ALL data
will be counted.
}
var
i,j:Word;
begin
with TDistribution.Create do
begin
for i:=1 to 10 do
Accumulate(Data1[i]);
writeln('Mean = ', Mean:6:4);
writeln('Total = ', Total:6:4);
{Print out quantitative summary}
for i := 0 to pred(Count) do
writeln(Objects[i].Bin:6:2,#44,Objects[i].Count,#44,Objects[i].Count/Objects[pred(Count)].Count:6:3);
{Print out crude 'histogram'}
for i := 0 to pred(Count) do
begin
for j:=1 to trunc(40*Objects[i].Count/Objects[pred(Count)].Count) do
write('*');
writeln;
end;
Free;
end;
end.