Title: Building an Easy-to-Use Parser/Parsing Framework (Part I)
Question: How to create a simple parsing framework to parse any kind of data?
Answer:
A second article was released on 29.01.2002 with a more detailed example:
Building an
Easy-to-Use Parser/Parsing Framework (Part II)
Today, we wonna speak about "how to create a simple parser framework" in
Delphi. Our goal will be a class solutions which helps up to parse any
kind of data and store all relevant informations in an easy-to- access
object model.
At the end of this article, we've developed a small utility, which
generates a simple object model of a .dtd file and output it's xml pendant
from a given starting node. In other words, we're using the parsing
framework to create a parser, which is able to parse a .dtd file, extract
all neccessary tags, store them in the object model and generates the xml
output. Note: This utility uses a simply dtd- parser model, which don't
include all routines to parse all kinds of dtd datas - it's up to you to
include those features.
Our claims to the framework and object model are:
- easy to use.
- save/loadable object trees.
- integrated error reporting.
- expandable.
Okay, now let's start to develope the main parsing engine. Delphi comes
with a unit called CopyPrsr which includes the simple stream parser
object TCopyParser. Try to take a look into that file to understand
how it works - it's located under
$(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is
derived from that idea, but uses a simple string instead of the stream and
includes some additional functions:
The boiler plate:
=========================================================
unit StringParser;
interface
uses
Classes;
const
{ Additional Parser special tokens }
toEOL = char(6);
toBOF = char(7);
type
{ TSysCharSet }
TSysCharSet = set of Char;
{ TStringParser }
TStringParser = class
private
{ Private declarations }
FParseString: string;
FLineTokens: Integer;
FSourceLine: Integer;
FSourcePos: Integer;
FTokenPos: Integer;
FToken: Char;
procedure SkipBlanks;
function GetParseString: string;
function GetSourcePos: Integer;
function GetTokenString: string;
protected
{ Protected declarations }
public
{ Public declarations }
constructor Create;
function LoadFromFile(const FileName: string): Boolean;
function LoadFromStream(const Stream: TStream): Boolean;
function SkipToEOF: string;
function SkipToEOL: string;
function SkipToken: Char;
function SkipTokenString: string;
function SkipToToken(const AToken: Char): string; overload;
function SkipToToken(const AToken: TSysCharSet): string; overload;
function SkipToTokenString(const ATokenString: string): string;
property ParseString: string read GetParseString;
property SourceLine: Integer read FSourceLine;
property SourcePos: Integer read GetSourcePos;
property Token: Char read FToken;
property TokenString: string read GetTokenString;
end;
As you can see, there are many public helper functions which you can use
to parse the data. The main functions are LoadFromFile and
LoadFromStream, which needs the name of the file to be parsed as
the only parameter. Both functions loads the content of the file and store
it to the string FParseString which can be accessed through the
denominator property:
LoadFromFile/LoadFromStream:
=========================================================
function TStringParser.LoadFromFile(const FileName: string): Boolean;
var
Stream: TMemoryStream;
begin
Result := False;
if not FileExists(FileName) then
Exit;
Stream := TMemoryStream.Create;
try
Stream.LoadFromFile(FileName);
Result := LoadFromStream(Stream);
finally
Stream.Free;
end;
end;
function TStringParser.LoadFromStream(const Stream: TStream): Boolean;
var
MemStream: TMemoryStream;
begin
Result := False;
if not(assigned(Stream)) then
Exit;
MemStream := TMemoryStream.Create;
try
Stream.Seek(0, soFromBeginning);
MemStream.CopyFrom(Stream, Stream.Size);
FParseString := StrPas(MemStream.Memory);
SetLength(FParseString, MemStream.Size);
FParseString := Concat(FParseString, toEOF);
FToken := toBOF;
Result := True;
finally
MemStream.Free;
end;
end;
The main functionality of the parsing engine is the extraction of so-
called tokens. A token can be a seperator (like CR, LF or EOF) or a
symbol, which can be a keyword if you plan to parse a programing language.
The following functions are used to skip blank characters (which are used
to seperate symbols and aren't relevant) and to extract/skip the next
token symbol:
Token related functions (pullout only):
=========================================================
procedure TStringParser.SkipBlanks;
begin
while True do
begin
FToken := FParseString[FTokenPos];
case FToken of
#10:
begin
Inc(FSourceLine);
FLineTokens := FTokenPos;
end;
toEOF, #33..#255:
Exit;
end;
Inc(FTokenPos);
end;
end;
function TStringParser.SkipToken: Char;
const
KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
SkipBlanks;
FSourcePos := FTokenPos;
if FParseString[FTokenPos] = toEOF then
FToken := toEOF
else
if FParseString[FTokenPos] in KeySet then
begin
while FParseString[FTokenPos] in KeySet do
Inc(FTokenPos);
FToken := toSymbol;
end
else
begin
FToken := FParseString[FTokenPos];
Inc(FTokenPos);
end;
Result := FToken;
end;
function TStringParser.SkipToToken(const AToken: TSysCharSet): string;
begin
FSourcePos := FTokenPos;
while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in AToken)) do
begin
if FParseString[FTokenPos] = #10 then
begin
Inc(FSourceLine);
FLineTokens := FTokenPos;
end;
Inc(FTokenPos);
end;
if FParseString[FTokenPos] = toEOF then
FToken := toEOF
else
FToken := FParseString[FTokenPos];
Result := GetTokenString;
if FToken toEOF then
SkipToken;
end;
The absent functions includes alternativ possibilities to extract or skip
the tokens, like SkipToTokenString or SkipToEof. Well, the
next step is to create the object model, which holds all our parsed
informations. As I mentioned earlier, the goal of this article it to
create a simple dtd parser, so we'll create an object model to store dtd
informations.
A dtd file is used to descripe the syntax rules of a xml file, like:
DTD example:
=========================================================
key CDATA #REQUIRED
value CDATA #REQUIRED
This example demonstrated the simplest way of a dtd structure. It's not
the purpose of this article to develope a highly flexible dtd parser which
spots all dtd grammas, so we only include the weightly ones. Root of each
object model is the document, which holds all other objects as
collections:
The Root Document:
=========================================================
{ TDTDDocument }
TDTDDocument = class(TPersistent)
private
{ Private declarations }
FEntities: TDTDEntities;
FElements: TDTDElements;
procedure SetEntities(Value: TDTDEntities);
procedure SetElements(Value: TDTDElements);
public
{ Public declarations }
constructor Create;
destructor Destroy; override;
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Entities: TDTDEntities read FEntities write SetEntities;
property Elements: TDTDElements read FElements write SetElements;
end;
As you can see, our document gives us the access of some other types of
data: Entities and Elements. Entities are very hard to
parse, so it's a good lesson for you to include that feature. Parsing
elements is quite easier, so this type of data is better to explain here.
Look at the dtd example some rows above this. You can see, that a dtd
element is descripted as followed:
Our object model needs some extra fields to store such element
informations. If you are not familiar with dtd or xml, look at
W3CSchools - it's a good starting
point to learn more about that topic. So, take a look at the following
object structure:
TDTDDocument
|
o--TDTDEntities
|
o--TDTElements
|
o--TDTDElementTyp
|
o--TDTDAttributes
|
o--TDTDAttributeTyp
o--TDTDAttributeStatus
o--Default: string
o--TDTDEnums
o--TDTDChild
|
o--TDTDTyp
o--TDTDChilds
I've tried to "pack" the dtd grammars into an easy object model as you can
see above:
Each document contains a collection of elements. Each
element is descripted by an elementtype and containes in turn a
collection of attributes and childs. Each attribute again
is descripted by an attributetype and contains a collection of
enum(erations) and so forth. Followed a code cantle from the
element implementation, it's not suggestive to show you the whole code
here - it's quit long and a little bit more complex:
TDTDElement:
=========================================================
unit DTD_Document;
interface
uses
Classes;
type
...
{ TDTDElementTyp }
TDTDElementTyp =
(etAny, etEmpty, etData, etContainer);
{ TDTDElementStatus }
TDTDElementStatus =
(esRequired, esRequiredSeq, esOptional, esOptionalSeq);
{ TDTDItem }
TDTDItem = class(TCollectionItem)
private
{ Private declarations }
FName: string;
public
{ Public declarations }
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Name: string read FName write FName;
end;
{ TDTDItems }
TDTDItems = class(TCollection)
private
{ Private declarations }
function GetItem(Index: Integer): TDTDItem;
procedure SetItem(Index: Integer; Value: TDTDItem);
public
{ Public declarations }
function Add: TDTDItem;
function Find(const Name: string): TDTDItem;
property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default;
end;
...
{ TDTDElement }
TDTDElement = class(TDTDProperty)
private
{ Private declarations }
FTyp: TDTDElementTyp;
FAttributes: TDTDAttributes;
FChilds: TDTDChilds;
procedure SetAttributes(Value: TDTDAttributes);
procedure SetChilds(Value: TDTDChilds);
public
{ Public declarations }
constructor Create(Collection: TCollection); override;
destructor Destroy; override;
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Typ: TDTDElementTyp read FTyp write FTyp;
property Attributes: TDTDAttributes read FAttributes write SetAttributes;
property Childs: TDTDChilds read FChilds write SetChilds;
end;
{ TDTDElements }
TDTDElements = class(TDTDProperties)
private
{ Private declarations }
function GetItem(Index: Integer): TDTDElement;
procedure SetItem(Index: Integer; Value: TDTDElement);
public
{ Public declarations }
function Add: TDTDElement;
function Find(const Name: string): TDTDElement;
property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default;
end;
...
implementation
...
{ TDTDItem }
procedure TDTDItem.Assign(Source: TPersistent);
begin
if Source is TDTDItem then
begin
Name := TDTDItem(Source).Name;
Exit;
end;
inherited Assign(Source);
end;
{ TDTDItems }
function TDTDItems.Add: TDTDItem;
begin
Result := TDTDItem(inherited Add);
end;
function TDTDItems.Find(const Name: string): TDTDItem;
var
i: Integer;
begin
Result := nil;
for i := 0 to Count - 1 do
if CompareStr(Items[i].Name, Name) = 0 then
begin
Result := Items[i];
Break;
end;
end;
function TDTDItems.GetItem(Index: Integer): TDTDItem;
begin
Result := TDTDItem(inherited GetItem(Index));
end;
procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem);
begin
inherited SetItem(Index, Value);
end;
...
{ TDTDElement }
constructor TDTDElement.Create(Collection: TCollection);
begin
inherited Create(Collection);
FAttributes := TDTDAttributes.Create(TDTDAttribute);
FChilds := TDTDChilds.Create(TDTDChild);
end;
destructor TDTDElement.Destroy;
begin
FAttributes.Free;
FChilds.Free;
inherited Destroy;
end;
procedure TDTDElement.Assign(Source: TPersistent);
begin
if Source is TDTDElement then
begin
Typ := TDTDElement(Source).Typ;
Attributes.Assign(TDTDElement(Source).Attributes);
Childs.Assign(TDTDElement(Source).Childs);
end;
inherited Assign(Source);
end;
procedure TDTDElement.SetAttributes(Value: TDTDAttributes);
begin
FAttributes.Assign(Value);
end;
procedure TDTDElement.SetChilds(Value: TDTDChilds);
begin
FChilds.Assign(Value);
end;
{ TDTDElements }
function TDTDElements.Add: TDTDElement;
begin
Result := TDTDElement(inherited Add);
end;
function TDTDElements.Find(const Name: string): TDTDElement;
begin
Result := TDTDElement(inherited Find(Name));
end;
function TDTDElements.GetItem(Index: Integer): TDTDElement;
begin
Result := TDTDElement(inherited GetItem(Index));
end;
procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement);
begin
inherited SetItem(Index, Value);
end;
...
The advantage of this object model is, that you're able to easily add the
document to a standard component and use Delphi's internal streaming
technology to load and save the object contents of a parsed file.
The next step will be the developing of the real dtd parser. Do you
remember the TStringParser descripted at the top of this article?
We'll using this class to build up our parser. But, we don't want to
create a parser from scratch each time we're about to parse a new kind of
data - it's not mind of a framework. So first, we'll develope a small
parser class from which we will inherit our dtd parser. This parent class
should include the error reporting and accessable fields to some other
informations:
The Private Parser class:
=========================================================
unit PrivateParser;
interface
uses
Classes, SysUtils, StringParser;
type
{ TParserError }
TParserError = class(TCollectionItem)
private
{ Private declarations }
FFileName: string;
FLine: Integer;
FMessage: string;
FPosition: Integer;
public
{ Public declarations }
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property FileName: string read FFileName write FFileName;
property Line: Integer read FLine write FLine;
property Message: string read FMessage write FMessage;
property Position: Integer read FPosition write FPosition;
end;
{ TParserErrors }
TParserErrors = class(TCollection)
private
{ Private declarations }
function GetItem(Index: Integer): TParserError;
procedure SetItem(Index: Integer; Value: TParserError);
public
{ Public declarations }
function Add: TParserError;
property Items[Index: Integer]: TParserError read GetItem write SetItem; default;
end;
{ TValidationParser }
TValidationParser = class
private
{ Private declarations }
FErrors: TParserErrors;
procedure SetErrors(const Value: TParserErrors);
public
{ Public declarations }
constructor Create;
destructor Destroy; override;
procedure AddError(const AMessage: string; Parser: TStringParser; const AFileName: string = '');
procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser: TStringParser; const AFileName: string = '');
property Errors: TParserErrors read FErrors write SetErrors;
end;
implementation
{ TParserError }
procedure TParserError.Assign(Source: TPersistent);
begin
if Source is TParserError then
begin
Line := TParserError(Source).Line;
Message := TParserError(Source).Message;
Position := TParserError(Source).Position;
Exit;
end;
inherited Assign(Source);
end;
{ TParserErrors }
function TParserErrors.Add: TParserError;
begin
Result := TParserError(inherited Add);
end;
function TParserErrors.GetItem(Index: Integer): TParserError;
begin
Result := TParserError(inherited GetItem(Index));
end;
procedure TParserErrors.SetItem(Index: Integer; Value: TParserError);
begin
inherited SetItem(Index, Value);
end;
{ TValidationParser }
constructor TValidationParser.Create;
begin
inherited Create;
FErrors := TParserErrors.Create(TParserError);
end;
destructor TValidationParser.Destroy;
begin
FErrors.Free;
inherited Destroy;
end;
procedure TValidationParser.SetErrors(const Value: TParserErrors);
begin
FErrors.Assign(Value);
end;
procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of const; Parser: TStringParser; const AFileName: string = '');
begin
with FErrors.Add do
begin
FileName := AFileName;
Line := Parser.SourceLine;
Message := Format(AMessage, Params);
Position := Parser.SourcePos;
end;
end;
procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser; const AFileName: string = '');
begin
AddErrorFmt(AMessage, [], Parser, AFileName);
end;
end.
Now we can start developing the real parser by inheriting it from the
TValidationParser. Again, I don't want to show you the whole
sourcecode here, so I pick up only the sapid one. Our dtd parser is a so-
called two-way parser, i.e. it uses the first pass to parse the
elements and the second pass to parse the attributes. This is useful,
because an attibute can refer to an element which is placed below it and
otherwise we'll get some unneeded errors. The main method of our parse is
Parse, which needs the name of the file to be parsed as the first
parameter, and a pre-initialized TDTDDocument as the second
parameter. A sample call should looks like:
Sample Call:
=========================================================
// Create DTDDocument.
DTDDocument := TDTDDocument.Create;
try
// Create DTDParser.
DTDParser := TDTDParser.Create;
try
// Parse File.
DTDParser.Parse(FileName, DTDDocument);
// Display possible Errors.
if DTDParser.Errors.Count 0 then
begin
for i := 0 to DTDParser.Errors.Count - 1 do
with DTDParser.Errors[i] do
WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position, Message]));
Exit;
end;
...
// Free DTDParser.
finally
DTDParser.Free;
end;
// Free DTDDocument.
finally
DTDDocument.Free;
end;
But now, let's take a look at some sourcecode lines of the parser
implementation. The first think we had to do is to inherited our parser
from the parent class:
Parser Implementation (Snippet):
=========================================================
type
{ EDTDParser }
EDTDParser = class(Exception);
{ TDTDParser }
TDTDParser = class(TValidationParser)
private
{ Private declarations }
procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const Pass: Integer);
procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument);
procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass: Integer = 0);
public
{ Public declarations }
procedure Parse(const FileName: string; var Document: TDTDDocument);
end;
Afterwards we implement the Parse method which calls the internal
method ParseFile on her part:
Method "Parse":
=========================================================
procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument);
var
TmpDocument: TDTDDocument;
begin
if not assigned(Document) then
raise EDTDParser.Create('Document not assigned!');
TmpDocument := TDTDDocument.Create;
try
ParseFile(FileName, TmpDocument);
if Errors.Count = 0 then
Document.Assign(TmpDocument);
finally
TmpDocument.Free;
end;
end;
As you can see, we create a special temporar document to store the parsed
objects in. I've done this because I don't want to return the document if
it is full of errors - I assign a exact copy of the objects only, if no
errors occured. The method ParseFile implements the proper parsing
calls to the StringParser and creates the real objects. Followed a code
snippet of the method body:
Method "ParseFile":
=========================================================
procedure TDTDParser.ParseFile(const FileName: string;
Document: TDTDDocument; const Pass: Integer = 0);
var
Parser: TStringParser;
begin
Parser := TStringParser.Create;
try
if not Parser.LoadFromFile(FileName) then
begin
AddErrorFmt('File "%s" not found', [FileName], Parser);
Exit;
end;
while True do
begin
while not (Parser.Token in [toEOF, ' Parser.SkipToken;
if Parser.Token = toEOF then
Break;
Parser.SkipToken;
if Parser.Token '!' then
begin
if not(Parser.Token in ['?']) and (Pass = 1) then
AddError('InvalidToken', Parser);
Continue;
end;
if Parser.SkipToken toSymbol then
begin
if (Parser.Token '-') and (Pass = 1) then
AddError('InvalidToken', Parser);
Continue;
end;
if UpperCase(Parser.TokenString) = 'ENTITY' then
Continue;
if UpperCase(Parser.TokenString) = 'ELEMENT' then
ParseElement(Parser, Document, Pass)
else
if UpperCase(Parser.TokenString) = 'ATTLIST' then
begin
if Pass = 1 then
ParseAttlist(Parser, Document);
end
else
if Pass = 1 then
AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser);
end;
if Pass = 0 then
ParseFile(FileName, Document, 1);
finally
Parser.Free;
end;
end;
This method calls some other functions (ParseElement and
ParseAttlist) which parses the internal structures of an element or
an attribute. Look at the whole sourceode to understand.
What's next??
Well, this article has shown you how easy it is to write a customizeable
parser which can parse any kind of data - it's up to you, how complex it
should be. The main benefit in using this kind of parsing is, that you
don't need to incorporate in complex systems like LexParser.
Continue reading my second article:
Building an Easy-to-Use Parser/Parsing Framework (Part II)
Thank you very much for you regard.
M. Hoffmann