BorlandTalk.com Forum Index BorlandTalk.com
Borland discussion newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

HTML text extraction

 
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> Delphi VCL Components Using
View previous topic :: View next topic  
Author Message
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 6:57 pm    Post subject: HTML text extraction Reply with quote



Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund


Back to top
Ignacio Vazquez
Guest





PostPosted: Tue Nov 04, 2003 7:12 pm    Post subject: Re: HTML text extraction Reply with quote



"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
[email]3fa7f696 (AT) newsgroups (DOT) borland.com[/email]...
Quote:
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?

You could use regexes to extract the body then replace out the tags:

1) Find the first group in '<body[ nt]?[^>]*>([.n]*)</body>'.

2) Replace '<[^>]+>' with ''.

Cheers,
Ignacio

--
The strange part isn't so much that he had an accent. No accent was
detectable. It was just sounds and burbs and gurgles coming from him. He
was a like a chubby, old R2-D2.
- La Üter



Back to top
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 7:29 pm    Post subject: Re: HTML text extraction Reply with quote



"Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en meddelelse
news:3fa7fa0c (AT) newsgroups (DOT) borland.com...
Quote:
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.

You could use regexes to extract the body then replace out the tags:

What is regexes?

Quote:
1) Find the first group in '<body[ nt]?[^>]*>([.n]*)</body>'.

Huh?

Quote:
2) Replace '<[^>]+>' with ''.

What if the visible text contains these characters?
I need to this in Delphi code of course.
--
Finn Tolderlund



Back to top
Ignacio Vazquez
Guest





PostPosted: Tue Nov 04, 2003 7:34 pm    Post subject: Re: HTML text extraction Reply with quote

"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
[email]3fa7fe36 (AT) newsgroups (DOT) borland.com[/email]...
Quote:
"Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en
meddelelse news:3fa7fa0c (AT) newsgroups (DOT) borland.com...
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.

You could use regexes to extract the body then replace out the tags:

What is regexes?

Regular expressions.

http://www.latiumsoftware.com/en/articles/00009.php

Quote:
2) Replace '<[^>]+>' with ''.

What if the visible text contains these characters?

HTML text won't contain the sequence given; it matches tags, and won't match
anything else in 99.999% of the HTML out there.

Cheers,
Ignacio

--
The strange part isn't so much that he had an accent. No accent was
detectable. It was just sounds and burbs and gurgles coming from him. He
was a like a chubby, old R2-D2.
- La Üter



Back to top
David Knaack
Guest





PostPosted: Tue Nov 04, 2003 7:54 pm    Post subject: Re: HTML text extraction Reply with quote

Finn Tolderlund wrote:
Quote:
Is there a simple way to extract the text from a html file?

FastStrings library (http://www.droopyeyes.com) has a strip html
function that might do it for you. Just a function call, very easy to
test to see if it will meet your needs.

DK


Back to top
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 7:57 pm    Post subject: Re: HTML text extraction Reply with quote

Looks good, thanks for the link.
However I found DIHtmlParser from
http://www.zeitungsjunge.de/delphi/htmlparser/ which seems to be just what
the doctor ordered.
--
Finn Tolderlund


"Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en meddelelse
Quote:
Regular expressions.
http://www.latiumsoftware.com/en/articles/00009.php



Back to top
eshipman
Guest





PostPosted: Tue Nov 04, 2003 8:48 pm    Post subject: Re: HTML text extraction Reply with quote

In article <3fa7f696 (AT) newsgroups (DOT) borland.com>, [email]no (AT) spam (DOT) dk[/email] says...
Quote:
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?


First, use something like this to get an IHTMLDocument2 object:

public
FDoc: IHTMLDocument2;
end;


Back to top
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 8:50 pm    Post subject: Re: HTML text extraction Reply with quote

Thank you very much, looks promising.
--
Finn Tolderlund


"David Knaack" <davidknaack (AT) cox (DOT) net> skrev i en meddelelse
news:3fa803da$1 (AT) newsgroups (DOT) borland.com...
Quote:
Finn Tolderlund wrote:
Is there a simple way to extract the text from a html file?

FastStrings library (http://www.droopyeyes.com) has a strip html
function that might do it for you. Just a function call, very easy to
test to see if it will meet your needs.



Back to top
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 8:51 pm    Post subject: Re: HTML text extraction Reply with quote

Thank you very much, I'll try it.
--
Finn Tolderlund


"eshipman" <eshipman@yahoo!!!.com> skrev i en meddelelse
news:MPG.1a11cc8ce4dab31c9896dc (AT) forums (DOT) borland.com...
Quote:
First, use something like this to get an IHTMLDocument2 object:



Back to top
Finn Tolderlund
Guest





PostPosted: Tue Nov 04, 2003 8:51 pm    Post subject: Re: HTML text extraction Reply with quote

Not so good, it requires that I buy license to get the source.
--
Finn Tolderlund


"Finn Tolderlund" <no (AT) spam (DOT) dk> skrev i en meddelelse
news:3fa804be (AT) newsgroups (DOT) borland.com...
Quote:
Looks good, thanks for the link.
However I found DIHtmlParser from
http://www.zeitungsjunge.de/delphi/htmlparser/ which seems to be just what
the doctor ordered.



Back to top
Fauschti
Guest





PostPosted: Tue Nov 04, 2003 8:59 pm    Post subject: Re: HTML text extraction Reply with quote

Hi Finn,

If you load the Page in a webBrowser, you can do this:

S := (TWebBrowser.Document as IHTMLDocument2).body.innerText;
ShowMessage( S );

I admit, it's similar eshipman's solution, but you do not need his
workaround, if you already have the page in a webbrowser. (This also works
with TEmbeddedWB)

regards,
Michael

"Finn Tolderlund" <no (AT) spam (DOT) dk> schrieb im Newsbeitrag
news:3fa7f696 (AT) newsgroups (DOT) borland.com...
Quote:
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund





Back to top
Ed
Guest





PostPosted: Wed Nov 05, 2003 4:34 pm    Post subject: Re: HTML text extraction Reply with quote

Here's how to remove HTML:

function removeHTML(htmlString: string; preserveLineBreaks: boolean =
false): string;
var
i: integer;
tagMode: boolean;
Special: string;
SpecialMode: boolean;
ch: AnsiChar;
LastCh: AnsiChar;
len: integer;
htmlTag: string;
begin
Result := '';
Special := '';
tagMode := false;
SpecialMode := false;
LastCh := ' ';
htmlTag := '';
len := Length(htmlString);
for i := 1 to len do
begin
ch := htmlString[i];
if (ch < ' ') then ch := ' ';
if ch = '<' then
begin
tagMode := true;
htmlTag := '';
end
else if ch = '>' then
begin
tagMode := false;
if (htmlTag = 'br') then
begin
if preserveLineBreaks then
Result := Result + #13#10
else if (LastCh <> ' ') then
Result := Result + ' ';
LastCh := ' ';
end;
end
else if tagMode then
begin
htmlTag := htmlTag + LowerCase(ch);
Continue;
end
else if ch = '&' then
SpecialMode := true
else if SpecialMode then
begin
if (ch = ';') and (SpecialMode) then
begin
SpecialMode := false;
if Special = 'nbsp' then ch := ' '
else if Special = 'lt' then ch := '<'
else if Special = 'gt' then ch := '>'
else if Special = 'copy' then ch := '©'
else if Special = 'reg' then ch := '®'
else if Special = 'amp' then ch := '&'
else ch := ' ';
Result := Result + ch;
Special := '';
LastCh := '&'; // remember that a Special charater (not a
space) was output
end
else
begin
Special := Special + LowerCase(ch);
end;
end
else
begin
if (ch <> ' ') or (LastCh <> ' ') then
Result := Result + ch;
LastCh := ch;
end;
end;
end;


HTH

Ed Wilson
Training Technologies, Inc.
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote

Quote:
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund





Back to top
Finn Tolderlund
Guest





PostPosted: Wed Nov 05, 2003 5:47 pm    Post subject: Re: HTML text extraction Reply with quote

Thanks Ed and to you all.
I think I have enough now to get the job done.
--
Finn Tolderlund


Back to top
peter
Guest





PostPosted: Thu Nov 27, 2003 4:18 am    Post subject: Re: HTML text extraction Reply with quote

IM not too sure if this will help but you could get pbears.com free html
comonent and use it to get turn HTML into text. I know Ive dont it..


"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote

Quote:
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund





Back to top
Display posts from previous:   
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> Delphi VCL Components Using All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.