 |
BorlandTalk.com Borland discussion newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
BGK Guest
|
Posted: Fri Dec 19, 2003 9:08 am Post subject: How to determine keywords of a document ( (*.txt, *.doc) |
|
|
how to use any scripts, automatically find out the keywords of a document??
|
|
| Back to top |
|
 |
Maarten Wiltink Guest
|
Posted: Fri Dec 19, 2003 12:21 pm Post subject: Re: How to determine keywords of a document ( (*.txt, *.doc) |
|
|
"BGK" <jiekuan (AT) pmail (DOT) ntu.edu.sg> wrote
| Quote: | how to use any scripts, automatically find out the keywords of a
document??
|
Which of about ten thousand word processors which store their documents
with a .doc extension are you considering? Anyway, the algorithm is the
same for all of them: find out the format; if it supports annotation
with keywords, extract them; otherwise, treat as plain text.
For plain text files, you may conjecture RFC-822 format and look for
the "Keywords: " header line. Otherwise, you may look for underlined
words (by means of a backspace-underscore combination, either per
letter or per word), or simply select the longest words from the text,
optionally performing some statistics on them to select the most
relevant ones. You know what they say about statistics? It's true,
especially when you start with data that may be anything (including
Cyrillic (You may simplify your problem by making some well-chosen
assumptions. You know what they say about assumptions?)) and mean
anything.
Groetjes,
Maarten Wiltink
|
|
| Back to top |
|
 |
Steven Guest
|
Posted: Tue Dec 23, 2003 10:15 pm Post subject: Re: How to determine keywords of a document ( (*.txt, *.doc) |
|
|
"Maarten Wiltink" <maarten (AT) kittensandcats (DOT) net> wrote
| Quote: | "BGK" <jiekuan (AT) pmail (DOT) ntu.edu.sg> wrote in message
news:b852cfd5.0312190108.6ea41f10 (AT) posting (DOT) google.com...
how to use any scripts, automatically find out the keywords of a
document??
Which of about ten thousand word processors which store their documents
with a .doc extension are you considering? Anyway, the algorithm is the
same for all of them: find out the format; if it supports annotation
with keywords, extract them; otherwise, treat as plain text.
For plain text files, you may conjecture RFC-822 format and look for
the "Keywords: " header line. Otherwise, you may look for underlined
words (by means of a backspace-underscore combination, either per
letter or per word), or simply select the longest words from the text,
optionally performing some statistics on them to select the most
relevant ones. You know what they say about statistics? It's true,
especially when you start with data that may be anything (including
Cyrillic (You may simplify your problem by making some well-chosen
assumptions. You know what they say about assumptions?)) and mean
anything.
Groetjes,
Maarten Wiltink
|
Maybe slightly OT:
I would like to build indices for several types of files, and esp. PDF
remains a problem.
I downloaded Adobe's SDK, but I was hoping to find a way to extract
plaintext without going through 10k pages of documentation first.
I know this question has been asked before, but AFAIK with zero replies.
Better result this time?
happy solstice and things
steven
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|