BorlandTalk.com Forum Index BorlandTalk.com
Borland discussion newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to determine keywords of a document ( (*.txt, *.doc)

 
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> comp.lang.pascal.delphi.misc
View previous topic :: View next topic  
Author Message
BGK
Guest





PostPosted: Fri Dec 19, 2003 9:08 am    Post subject: How to determine keywords of a document ( (*.txt, *.doc) Reply with quote



how to use any scripts, automatically find out the keywords of a document??
Back to top
Maarten Wiltink
Guest





PostPosted: Fri Dec 19, 2003 12:21 pm    Post subject: Re: How to determine keywords of a document ( (*.txt, *.doc) Reply with quote



"BGK" <jiekuan (AT) pmail (DOT) ntu.edu.sg> wrote

Quote:
how to use any scripts, automatically find out the keywords of a
document??

Which of about ten thousand word processors which store their documents
with a .doc extension are you considering? Anyway, the algorithm is the
same for all of them: find out the format; if it supports annotation
with keywords, extract them; otherwise, treat as plain text.

For plain text files, you may conjecture RFC-822 format and look for
the "Keywords: " header line. Otherwise, you may look for underlined
words (by means of a backspace-underscore combination, either per
letter or per word), or simply select the longest words from the text,
optionally performing some statistics on them to select the most
relevant ones. You know what they say about statistics? It's true,
especially when you start with data that may be anything (including
Cyrillic (You may simplify your problem by making some well-chosen
assumptions. You know what they say about assumptions?)) and mean
anything.

Groetjes,
Maarten Wiltink



Back to top
Steven
Guest





PostPosted: Tue Dec 23, 2003 10:15 pm    Post subject: Re: How to determine keywords of a document ( (*.txt, *.doc) Reply with quote



"Maarten Wiltink" <maarten (AT) kittensandcats (DOT) net> wrote

Quote:
"BGK" <jiekuan (AT) pmail (DOT) ntu.edu.sg> wrote in message
news:b852cfd5.0312190108.6ea41f10 (AT) posting (DOT) google.com...
how to use any scripts, automatically find out the keywords of a
document??

Which of about ten thousand word processors which store their documents
with a .doc extension are you considering? Anyway, the algorithm is the
same for all of them: find out the format; if it supports annotation
with keywords, extract them; otherwise, treat as plain text.

For plain text files, you may conjecture RFC-822 format and look for
the "Keywords: " header line. Otherwise, you may look for underlined
words (by means of a backspace-underscore combination, either per
letter or per word), or simply select the longest words from the text,
optionally performing some statistics on them to select the most
relevant ones. You know what they say about statistics? It's true,
especially when you start with data that may be anything (including
Cyrillic (You may simplify your problem by making some well-chosen
assumptions. You know what they say about assumptions?)) and mean
anything.

Groetjes,
Maarten Wiltink


Maybe slightly OT:
I would like to build indices for several types of files, and esp. PDF
remains a problem.
I downloaded Adobe's SDK, but I was hoping to find a way to extract
plaintext without going through 10k pages of documentation first.
I know this question has been asked before, but AFAIK with zero replies.
Better result this time?

happy solstice and things

steven



Back to top
Display posts from previous:   
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> comp.lang.pascal.delphi.misc All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.