BorlandTalk.com Forum Index BorlandTalk.com
Borland discussion newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Wanted: HTTP robot/spider

 
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> Delphi Thirdparty Tools (General)
View previous topic :: View next topic  
Author Message
John McTaggart
Guest





PostPosted: Mon Oct 16, 2006 10:15 pm    Post subject: Wanted: HTTP robot/spider Reply with quote



Hi,

I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.

Must be robust.

I've already seen HTTPScan, but at $250+ it's a non-starter..

Any ideas are appreciated.

John McTaggart
Back to top
Johnnie Norsworthy
Guest





PostPosted: Mon Oct 16, 2006 11:42 pm    Post subject: Re: Wanted: HTTP robot/spider Reply with quote



"John McTaggart" <_john_ (AT) _compnet101_ (DOT) _com> wrote in message
news:4533be47$1 (AT) newsgroups (DOT) borland.com...
Quote:

I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.

Must be robust.

I've already seen HTTPScan, but at $250+ it's a non-starter..

Any ideas are appreciated.

I have been wanting to write one. I'd use ICS internet components.

Contact me at Johnnie.Norsworthy (AT) gmail (DOT) com and we can talk.

-Johnnie
Back to top
Brian Moelk
Guest





PostPosted: Tue Oct 17, 2006 8:01 am    Post subject: Re: Wanted: HTTP robot/spider Reply with quote



John McTaggart wrote:
Quote:
I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.

I've written one for my previous employer.

Quote:
Must be robust.

I've already seen HTTPScan, but at $250+ it's a non-starter..

Any ideas are appreciated.

I used Indy and Delphi Inspiration's HTML Parser. It was pretty
straightforward in it's initial cut but in order to get more of the
links and to avoid honeypots, etc. it became more complex.

I'm pretty certain you can use any socket/parser components really, but
I have a feeling you'll be using ATagParser. ;)

There's some simple javascript parsing that I ended up doing to get more
links, and was going to hook in a scripting engine to rip the links for
specific sites.

This wasn't for eval purposes (that would have been quite difficult),
but so our customer support people could have written some regex's and
simple/specific javascript to parse out the more difficult links.

Although it was a good general purpose spider, we did extraction from
specific sites. Regardless, I left the company before getting around to
doing that. It would have been pretty fun to do though.

Anyway, feel free to email me if you want to talk about the project in
more detail. I'm not sure if I remember all the details, but I have
some ideas about what I would do differently if I had to do it over
again. But of course, they might not work out either. ;)

--
Brian Moelk
Brain Endeavor LLC
bmoelk (AT) NObrainSPAMendeavorFOR (DOT) MEcom
Back to top
Walter Matte
Guest





PostPosted: Tue Oct 17, 2006 5:07 pm    Post subject: Re: Wanted: HTTP robot/spider Reply with quote

http://www.felix-colibri.com/papers/web/web_spider/web_spider.html

Walter

"John McTaggart" <_john_ (AT) _compnet101_ (DOT) _com> wrote in message
news:4533be47$1 (AT) newsgroups (DOT) borland.com...
Quote:
Hi,

I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.

Must be robust.

I've already seen HTTPScan, but at $250+ it's a non-starter..

Any ideas are appreciated.

John McTaggart

Back to top
Display posts from previous:   
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> Delphi Thirdparty Tools (General) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.