| View previous topic :: View next topic |
| Author |
Message |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 6:57 pm Post subject: HTML text extraction |
|
|
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund
|
|
| Back to top |
|
 |
Ignacio Vazquez Guest
|
Posted: Tue Nov 04, 2003 7:12 pm Post subject: Re: HTML text extraction |
|
|
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
[email]3fa7f696 (AT) newsgroups (DOT) borland.com[/email]...
| Quote: | Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
|
You could use regexes to extract the body then replace out the tags:
1) Find the first group in '<body[ nt]?[^>]*>([.n]*)</body>'.
2) Replace '<[^>]+>' with ''.
Cheers,
Ignacio
--
The strange part isn't so much that he had an accent. No accent was
detectable. It was just sounds and burbs and gurgles coming from him. He
was a like a chubby, old R2-D2.
- La Üter
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 7:29 pm Post subject: Re: HTML text extraction |
|
|
"Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en meddelelse
news:3fa7fa0c (AT) newsgroups (DOT) borland.com...
| Quote: | "Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
You could use regexes to extract the body then replace out the tags:
|
What is regexes?
| Quote: | 1) Find the first group in '<body[ nt]?[^>]*>([.n]*)</body>'.
|
Huh?
| Quote: | 2) Replace '<[^>]+>' with ''.
|
What if the visible text contains these characters?
I need to this in Delphi code of course.
--
Finn Tolderlund
|
|
| Back to top |
|
 |
Ignacio Vazquez Guest
|
Posted: Tue Nov 04, 2003 7:34 pm Post subject: Re: HTML text extraction |
|
|
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
[email]3fa7fe36 (AT) newsgroups (DOT) borland.com[/email]...
| Quote: | "Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en
meddelelse news:3fa7fa0c (AT) newsgroups (DOT) borland.com...
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote in message
Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
You could use regexes to extract the body then replace out the tags:
What is regexes?
|
Regular expressions.
http://www.latiumsoftware.com/en/articles/00009.php
| Quote: | 2) Replace '<[^>]+>' with ''.
What if the visible text contains these characters?
|
HTML text won't contain the sequence given; it matches tags, and won't match
anything else in 99.999% of the HTML out there.
Cheers,
Ignacio
--
The strange part isn't so much that he had an accent. No accent was
detectable. It was just sounds and burbs and gurgles coming from him. He
was a like a chubby, old R2-D2.
- La Üter
|
|
| Back to top |
|
 |
David Knaack Guest
|
Posted: Tue Nov 04, 2003 7:54 pm Post subject: Re: HTML text extraction |
|
|
Finn Tolderlund wrote:
| Quote: | Is there a simple way to extract the text from a html file?
|
FastStrings library (http://www.droopyeyes.com) has a strip html
function that might do it for you. Just a function call, very easy to
test to see if it will meet your needs.
DK
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 7:57 pm Post subject: Re: HTML text extraction |
|
|
Looks good, thanks for the link.
However I found DIHtmlParser from
http://www.zeitungsjunge.de/delphi/htmlparser/ which seems to be just what
the doctor ordered.
--
Finn Tolderlund
"Ignacio Vazquez" <ivazquezATorioncommunications.com> skrev i en meddelelse
|
|
| Back to top |
|
 |
eshipman Guest
|
Posted: Tue Nov 04, 2003 8:48 pm Post subject: Re: HTML text extraction |
|
|
In article <3fa7f696 (AT) newsgroups (DOT) borland.com>, [email]no (AT) spam (DOT) dk[/email] says...
| Quote: | Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
|
First, use something like this to get an IHTMLDocument2 object:
public
FDoc: IHTMLDocument2;
end;
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 8:50 pm Post subject: Re: HTML text extraction |
|
|
Thank you very much, looks promising.
--
Finn Tolderlund
"David Knaack" <davidknaack (AT) cox (DOT) net> skrev i en meddelelse
news:3fa803da$1 (AT) newsgroups (DOT) borland.com...
| Quote: | Finn Tolderlund wrote:
Is there a simple way to extract the text from a html file?
FastStrings library (http://www.droopyeyes.com) has a strip html
function that might do it for you. Just a function call, very easy to
test to see if it will meet your needs.
|
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 8:51 pm Post subject: Re: HTML text extraction |
|
|
Thank you very much, I'll try it.
--
Finn Tolderlund
"eshipman" <eshipman@yahoo!!!.com> skrev i en meddelelse
news:MPG.1a11cc8ce4dab31c9896dc (AT) forums (DOT) borland.com...
| Quote: | First, use something like this to get an IHTMLDocument2 object:
|
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Tue Nov 04, 2003 8:51 pm Post subject: Re: HTML text extraction |
|
|
Not so good, it requires that I buy license to get the source.
--
Finn Tolderlund
"Finn Tolderlund" <no (AT) spam (DOT) dk> skrev i en meddelelse
news:3fa804be (AT) newsgroups (DOT) borland.com...
|
|
| Back to top |
|
 |
Fauschti Guest
|
Posted: Tue Nov 04, 2003 8:59 pm Post subject: Re: HTML text extraction |
|
|
Hi Finn,
If you load the Page in a webBrowser, you can do this:
S := (TWebBrowser.Document as IHTMLDocument2).body.innerText;
ShowMessage( S );
I admit, it's similar eshipman's solution, but you do not need his
workaround, if you already have the page in a webbrowser. (This also works
with TEmbeddedWB)
regards,
Michael
"Finn Tolderlund" <no (AT) spam (DOT) dk> schrieb im Newsbeitrag
news:3fa7f696 (AT) newsgroups (DOT) borland.com...
| Quote: | Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund
|
|
|
| Back to top |
|
 |
Ed Guest
|
Posted: Wed Nov 05, 2003 4:34 pm Post subject: Re: HTML text extraction |
|
|
Here's how to remove HTML:
function removeHTML(htmlString: string; preserveLineBreaks: boolean =
false): string;
var
i: integer;
tagMode: boolean;
Special: string;
SpecialMode: boolean;
ch: AnsiChar;
LastCh: AnsiChar;
len: integer;
htmlTag: string;
begin
Result := '';
Special := '';
tagMode := false;
SpecialMode := false;
LastCh := ' ';
htmlTag := '';
len := Length(htmlString);
for i := 1 to len do
begin
ch := htmlString[i];
if (ch < ' ') then ch := ' ';
if ch = '<' then
begin
tagMode := true;
htmlTag := '';
end
else if ch = '>' then
begin
tagMode := false;
if (htmlTag = 'br') then
begin
if preserveLineBreaks then
Result := Result + #13#10
else if (LastCh <> ' ') then
Result := Result + ' ';
LastCh := ' ';
end;
end
else if tagMode then
begin
htmlTag := htmlTag + LowerCase(ch);
Continue;
end
else if ch = '&' then
SpecialMode := true
else if SpecialMode then
begin
if (ch = ';') and (SpecialMode) then
begin
SpecialMode := false;
if Special = 'nbsp' then ch := ' '
else if Special = 'lt' then ch := '<'
else if Special = 'gt' then ch := '>'
else if Special = 'copy' then ch := '©'
else if Special = 'reg' then ch := '®'
else if Special = 'amp' then ch := '&'
else ch := ' ';
Result := Result + ch;
Special := '';
LastCh := '&'; // remember that a Special charater (not a
space) was output
end
else
begin
Special := Special + LowerCase(ch);
end;
end
else
begin
if (ch <> ' ') or (LastCh <> ' ') then
Result := Result + ch;
LastCh := ch;
end;
end;
end;
HTH
Ed Wilson
Training Technologies, Inc.
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote
| Quote: | Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund
|
|
|
| Back to top |
|
 |
Finn Tolderlund Guest
|
Posted: Wed Nov 05, 2003 5:47 pm Post subject: Re: HTML text extraction |
|
|
Thanks Ed and to you all.
I think I have enough now to get the job done.
--
Finn Tolderlund
|
|
| Back to top |
|
 |
peter Guest
|
Posted: Thu Nov 27, 2003 4:18 am Post subject: Re: HTML text extraction |
|
|
IM not too sure if this will help but you could get pbears.com free html
comonent and use it to get turn HTML into text. I know Ive dont it..
"Finn Tolderlund" <no (AT) spam (DOT) dk> wrote
| Quote: | Is there a simple way to extract the text from a html file?
The text I need to get is the text which is displayed by a browser when
viewing the html file.
All the HTML tags should be stripped away.
Any ideas?
--
Finn Tolderlund
|
|
|
| Back to top |
|
 |
|