Forms Delphi

Title: Extract HTML Tag information, Such as Links, Images..
Question: How can i extract HTML's tag information such as Links/Images/Frames in a certain page?
Answer:
Extracting tag information from html was always a tricky thing to do, as you have to predict most errors/broken-text or mispelling of the coder.
I always looked for a precise and accurate function to extract links from html page, but all the functions i found had their Disadvantages concerning broken lines or misplaced spaces or even slow routines.
So, as most programmers say, after a long search i decidec to write MY OWN function to handle everything in the best way i know, and hopefully make an end to all those silly/bad-written functions out-there who claim to do the job the best way.
Please mind that the function i came up with may NOT be the fastest out there, but it sure does the job on the best side and can be expanded to extract even more information in the future.
Here is the function, i noted some comments, and will not elaborate on what the function acctually does as its written in a simple manner so even the beginner can understand and expand it for its own use:
function ExtractHtmlTagValues(const HtmlText: string; TagName, AttribName: string; var Values: TStrings): integer;
function FindFirstCharAfterSpace(const Line: string; StartPos: integer): Integer;
var i: integer;
begin
Result := -1;
for i := StartPos to Length(Line) do
begin
if (Line[i] ' ') then
begin
Result := i;
exit;
end;
end;
end;
function FindFirstSpaceAfterChars(const Line: string; StartPos: integer): Integer;
begin
Result := PosEx(' ', Line, StartPos);
end;
function FindFirstSpaceBeforeChars(const Line: string; StartPos: integer): Integer;
var i: integer;
begin
Result := 1;
for i := StartPos downto 1 do
begin
if (Line[i] = ' ') then
begin
Result := i;
exit;
end;
end;
end;
var InnerTag: string;
LastPos, LastInnerPos: Integer;
SPos, LPos, RPos: Integer;
AttribValue: string;
ClosingChar: char;
TempAttribName: string;
begin
Result := 0;
LastPos := 1;
while (true) do
begin
// find outer tags ''
LPos := PosEx(' if (LPos RPos := PosEx('', HtmlText, LPos+1);
if (RPos LastPos := LPos + 1
else
LastPos := RPos + 1;
// get inner tag
InnerTag := Copy(HtmlText, LPos+1, RPos-LPos-1);
InnerTag := Trim(InnerTag); // remove spaces
if (Length(InnerTag)
// check tag name
if (SameText(Copy(InnerTag, 1, Length(TagName)), TagName)) then
begin
// found tag
AttribValue := '';
LastInnerPos := Length(TagName)+1;
while (LastInnerPos begin
// find first '=' after LastInnerPos
RPos := PosEx('=', InnerTag, LastInnerPos);
if (RPos
// this way you can check for multiple attrib names and not a specific attrib
SPos := FindFirstSpaceBeforeChars(InnerTag, RPos);
TempAttribName := Trim(Copy(InnerTag, SPos, RPos-SPos));
if (true) then
begin
// found correct tag
LPos := FindFirstCharAfterSpace(InnerTag, RPos+1);
if (LPos begin
LastInnerPos := RPos + 1;
continue;
end;
LPos := FindFirstCharAfterSpace(InnerTag, LPos); // get to first char after '='
if (LPos if ((InnerTag[LPos] '"') and (InnerTag[LPos] '''')) then
begin
// AttribValue is not between '"' or ''' so get it
RPos := FindFirstSpaceAfterChars(InnerTag, LPos+1);
if (RPos AttribValue := Copy(InnerTag, LPos, Length(InnerTag)-LPos+1)
else
AttribValue := Copy(InnerTag, LPos, RPos-LPos+1);
end
else
begin
// get url between '"' or '''
ClosingChar := InnerTag[LPos];
RPos := PosEx(ClosingChar, InnerTag, LPos+1);
if (RPos AttribValue := Copy(InnerTag, LPos+1, Length(InnerTag)-LPos-1)
else
AttribValue := Copy(InnerTag, LPos+1, RPos-LPos-1)
end;
if (SameText(TempAttribName, AttribName)) and (AttribValue '') then
begin
Values.Add(AttribValue);
inc(Result);
end;
end;
if (RPos LastInnerPos := Length(InnerTag)
else
LastInnerPos := RPos+1;
end;
end;
end;
end;
For eg. you want to extract all links in a page, just do:
var Links: TStrings;
begin
Links := TStrings.Create;
try
LinksFound := ExtractHtmlTagValues(HtmlText, 'A', 'HREF', Links);
Showmessage(Links.Text);
finally
Links.Free;
end;
end;
If you find any bugs or comments, or even improvements please drop an email.
thanks
Yaniv