 |
BorlandTalk.com Borland discussion newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
John McTaggart Guest
|
Posted: Mon Oct 16, 2006 10:15 pm Post subject: Wanted: HTTP robot/spider |
|
|
Hi,
I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.
Must be robust.
I've already seen HTTPScan, but at $250+ it's a non-starter..
Any ideas are appreciated.
John McTaggart |
|
| Back to top |
|
 |
Johnnie Norsworthy Guest
|
Posted: Mon Oct 16, 2006 11:42 pm Post subject: Re: Wanted: HTTP robot/spider |
|
|
"John McTaggart" <_john_ (AT) _compnet101_ (DOT) _com> wrote in message
news:4533be47$1 (AT) newsgroups (DOT) borland.com...
| Quote: |
I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.
Must be robust.
I've already seen HTTPScan, but at $250+ it's a non-starter..
Any ideas are appreciated.
|
I have been wanting to write one. I'd use ICS internet components.
Contact me at Johnnie.Norsworthy (AT) gmail (DOT) com and we can talk.
-Johnnie |
|
| Back to top |
|
 |
Brian Moelk Guest
|
Posted: Tue Oct 17, 2006 8:01 am Post subject: Re: Wanted: HTTP robot/spider |
|
|
John McTaggart wrote:
| Quote: | I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.
|
I've written one for my previous employer.
| Quote: | Must be robust.
I've already seen HTTPScan, but at $250+ it's a non-starter..
Any ideas are appreciated.
|
I used Indy and Delphi Inspiration's HTML Parser. It was pretty
straightforward in it's initial cut but in order to get more of the
links and to avoid honeypots, etc. it became more complex.
I'm pretty certain you can use any socket/parser components really, but
I have a feeling you'll be using ATagParser. ;)
There's some simple javascript parsing that I ended up doing to get more
links, and was going to hook in a scripting engine to rip the links for
specific sites.
This wasn't for eval purposes (that would have been quite difficult),
but so our customer support people could have written some regex's and
simple/specific javascript to parse out the more difficult links.
Although it was a good general purpose spider, we did extraction from
specific sites. Regardless, I left the company before getting around to
doing that. It would have been pretty fun to do though.
Anyway, feel free to email me if you want to talk about the project in
more detail. I'm not sure if I remember all the details, but I have
some ideas about what I would do differently if I had to do it over
again. But of course, they might not work out either. ;)
--
Brian Moelk
Brain Endeavor LLC
bmoelk (AT) NObrainSPAMendeavorFOR (DOT) MEcom |
|
| Back to top |
|
 |
Walter Matte Guest
|
Posted: Tue Oct 17, 2006 5:07 pm Post subject: Re: Wanted: HTTP robot/spider |
|
|
http://www.felix-colibri.com/papers/web/web_spider/web_spider.html
Walter
"John McTaggart" <_john_ (AT) _compnet101_ (DOT) _com> wrote in message
news:4533be47$1 (AT) newsgroups (DOT) borland.com...
| Quote: | Hi,
I'm looking for a multi-threaded robot/spider that allows
for the optional downloading (get/head) of web pages but does
no parsing beyond that needed to get the links/URLs.
Must be robust.
I've already seen HTTPScan, but at $250+ it's a non-starter..
Any ideas are appreciated.
John McTaggart
|
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|