Tlinearregression-class


            This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm. 
(This article originally appeared in the March 17, 2000 issue
of The Unofficial Newsletter of
Delphi Users) 
TLinearRegression Class
This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm.
The class provides properties returning the average for both the independent and dependent variables, as well as the variance, covariance, and correlation. A function method is provided to interpolate along the line defined by the computed slope and intercept.
The code is well documented, giving the basic theoretical outline and some notes as to what the various methods are doing.
An example project is also included.
{ ****************************************************************** }
{ }
{ Class implementing linear regression by linear-least-squares }
{ }
{ Copyright © 2000, Berend M. Tober. All rights reserved. }
{ Author's E-mail - mailto:btober@computer.org }
{ Other components at }
{ http://www.connix.com/~btober/delphi.htm }
{ }
{ ****************************************************************** }
unit linregr;
{Linear Regression Class}
{
 Note Regarding the Problem Statement
 ------------------------------------
 This class is used for finding the best-fit slope (m) and y-intercept (b), as in the equation for a straight line
 y := m*x + b
 in the case where you have more than two (x,y) data point pairs, i.e., you have (Xi, Yi) for i:=1,2,...,n, with n>2. In this situation you can write n equations:
 Y1 := m*X1 + b + E1
 Y2 := m*X2 + b + E2
 .
 .
 .
 Yn := m*Xn + b + En
 giving a system of n linear equations that you need to solve for two unknowns, namely, m and b, where the Ei are error terms. The linear- least-squares error algorithm employed here finds the best m and b such that the sum of the squared error terms is minimized.
 The linear algebra behind the solution is to write an n-by-2 matrix A that has rows looking like [1, Xi] for i=1,..,n and a 2-by-1 vector u of unknowns with the intercept and slope, i.e., u = [b,m]', where (') indicates "take the transpose of". There's a bunch of other technical detail stuff like "finding an orthogonal basis for a sub-space", and "guaranteed unique solutions", etc., that I vaguely recall from my course in real vectors, but it boils down to writing
 e'e = (y - Au)'(y - Au) = |y - Au|^2
 to show the best solution is found by solving for u in the equation
 Au = y
 where, to summarize:
 y = [Y1,Y2,...,Yn]'
 u = [b,m]'
 A = [[1,X1]', [1,X2]',...[1,Xn]']'
 In the process of solving this you eventually get to the explicit solution
 u = inv(A'A)(A'y)
 where inv() means "take the inverse of". This is the equation implemented in this class. It calculates the statistics "on the fly", so it works for any number of two or more data points.
 This is all well-known, straight-forward stuff out of a textbook. I don't claim any credit for the math...I'm not inventing anything new here, just making a slick implementation.
 Note Regarding Exceptions
 -------------------------------
 This class does not raise exceptions: your controlling application must trap 'divide by zero' errors that can occur in
 function GetCorrelation
 function GetIntercept
 function GetSlope
 I didn't build in this exception handling because it would entail a USES clause. As it is, this unit has NO explicit dependencies on ANY other units, and I couldn't bear to destroy that purity!
 The 'divide by zero' circumstance can occur when accessing any of the derived, covariance-related properties if:
you have accumulated less than two data points, or
you have "degenerate" data, i.e., all the x's are the same (this corresponds to slope of infinity), in which case you will find that CovarianceXX will be zero (or within some small error range of zero).
 Note Regarding Normalizing Data
 -------------------------------
 It is recommeded that you normalize data in advance of applying this linear-least-squares algorithm, if you can.
 For purely multiplicative scaling, apply the following:
 If you scale data input as (x',y') := (Ax,By), then you must
 1) adjust computed intercept by b := b'/B, and
 2) adjust computed slope by m := m'*A/B
}
interface
type
 TRegressionRec=Record
 N :Integer;
{ Count of number of data points accumulated }
 AvgX :Double;
{ Average of independent variables }
 AvgY :Double;
{ Average of dependent variables }
 AvgXX :Double;
{ Average of squared independent variables }
 AvgYY :Double;
{ Average of squared dependent variables }
 AvgXY :Double;
{ Average of cross product }
 end;
 TLinearRegression=class(TObject)
 private
 FRegressionRec:TRegressionRec;
 function GetCorrelation:Double;
 function GetCovarianceXX:Double;
 function GetCovarianceYY:Double;
 function GetCovarianceXY:Double;
 function GetIntercept:Double;
 function GetSlope:Double;
 public
 constructor Create;
 procedure Clear;
 function Add(X,Y:Double):Integer;
 function Interpolate(x:Double):Double;
 property AverageX:Double Read FRegressionRec.AvgX;
 property AverageY:Double read FRegressionRec.AvgY;
 property Count:Integer read FRegressionRec.N;
 property Correlation:Double Read GetCorrelation;
 property CovarianceXX:Double Read GetCovarianceXX;
 property CovarianceYY:Double Read GetCovarianceYY;
 property CovarianceXY:Double Read GetCovarianceXY;
 property Intercept:Double Read GetIntercept;
 property Slope:Double Read GetSlope;
 end;
implementation
function Accumulate
(
 N:Integer; {Sequence number of this new data point}
 a,{The existing average (of N-1 data points)}
 x:Double{The new data value to be included in running average}):Double;
{Returns the running average}
begin
 Result:=(N-1)/N;
 Result:=a*Result + x/N;
end;
constructor TLinearRegression.Create;
begin
 inherited Create;
 Clear;
end;
function TLinearRegression.Add(X,Y:Double):Integer;
 {This function 'accumulates' another (x,y) data point}
begin
 with FRegressionRec DO
 begin
 Inc(N);
 AvgX := Accumulate(N,AvgX,X);
 AvgY := Accumulate(N,AvgY,Y);
 AvgXX := Accumulate(N,AvgXX,X*X);
 AvgYY := Accumulate(N,AvgYY,Y*Y);
 AvgXY := Accumulate(N,AvgXY,X*Y);
 Result:=N;
 end;
end;
procedure TLinearRegression.Clear;
begin
 with FRegressionRec do
 begin
 N := 0;
 AvgX := 0;
 AvgY := 0;
 AvgXX:= 0;
 AvgYY:= 0;
 AvgXY:= 0;
 end;
end;
function TLinearRegression.GetCovarianceXX:Double;
begin
 with FRegressionRec do
 Result := AvgXX-AvgX*AvgX;
end;
function TLinearRegression.GetCovarianceYY:Double;
begin
 with FRegressionRec do
 Result := AvgYY-AvgY*AvgY;
end;
function TLinearRegression.GetCovarianceXY:Double;
 begin
 with FRegressionRec do
 Result := AvgXY-AvgX*AvgY;
 end;
function TLinearRegression.GetCorrelation:Double;
 var CovXX,CovYY:Double;
 begin
 with FRegressionRec do
 begin
 CovXX:=GetCovarianceXX;
 CovYY:=GetCovarianceYY;
 Result:=0;
 if CovXX = 0.0 then
 Result := 1.0
{Degenerate case with infinite slope}
 else if CovYY = 0.0 then
 Result := 1.0
{Data has zero slope}
 else
 Result := GetCovarianceXY/Sqrt(CovXX*CovYY);
 end;
 end;
function TLinearRegression.GetIntercept:Double;
 begin
 {Note that if CovXX is zero, then no unique y-intercept exists}
 with FRegressionRec do
 Result:=(AvgY*AvgXX-AvgX*AvgXY)/GetCovarianceXX;
 end;
function TLinearRegression.GetSlope:Double;
 begin
 {Note that if CovXX is zero, then slope is infinite}
 with FRegressionRec do
 Result := GetCovarianceXY/GetCovarianceXX;
 end;
function TLinearRegression.Interpolate(x:Double):Double;
begin
 with FRegressionRec do
 Result:=x*Slope + Intercept;
end;
end.
Sample project file follows:
program Example;
uses
 WinCRT,linregr;
type TDataArray=Array[1..10,1..2]
of Real;
const
 Data1:TDataArray=
 (
 (75,81),
 (78,73),
 (88,85),
 (92,85),
 (95,89),
 (67,73),
 (55,66),
 (73,81),
 (74,81),
 (80,81)
 );
{
 These are the "correct" results for the data set
above:
 r=0.891, m=0.513, b=39.658
}
 Data2:TDataArray=
 ( {Zero slope}
 (75,1),
 (78,1),
 (88,1),
 (92,1),
 (95,1),
 (67,1),
 (55,1),
 (73,1),
 (74,1),
 (80,1)
 );
 Data3:TDataArray=
 ( {Infinite slope}
 (2,81),
 (2,73),
 (2,85),
 (2,85),
 (2,89),
 (2,73),
 (2,66),
 (2,81),
 (2,81),
 (2,81)
 );
procedure TestRegression(Data:TDataArray);
 var i:Word;
 begin
 with TLinearRegression.Create do
 begin
 for i:= 1 to 10 do
 Add(Data[i,1],Data[i,2]);
 try
 write(Correlation:12:3);
 write(Slope:12:3);
 writeln(Intercept:12:3);
 writeln('Average of X''s',AverageX:12:3);
 writeln('Average of Y''s',AverageY:12:3);
 writeln('Std Dev of X''s',sqrt(CovarianceXX):12:3);
 writeln('Std Dev of Y''s',sqrt(CovarianceYY):12:3);
 except
 end;
 writeln;
 Free;
 end;
 end;
begin
 writeln('This is what the correlation, slope and intercept values should be:');
 writeln(0.891:12:3,0.513:12:3,39.658:12:3,#13);
 writeln('Sample results:');
 TestRegression(Data1);
 writeln(#13'Data with zero slope');
 TestRegression(Data2);
 writeln(#13'Degenerate data with infinite slope');
 TestRegression(Data3);
end.
Examples Delphi