PDA

View Full Version : Website Bots / Scrappers


jklondon
29th December 2005, 21:57
anyone got any experience with internet site scrappers - that basically automatically scan sites, rip the HTML and based on a pre-defined settings, can dump the data into a DB..? Any apps that do this off-the-shelf?

crus
29th December 2005, 22:07
yep, pm me for details as its grey area.

DuaneJackson
30th December 2005, 02:01
Yep, I've written a few of these for various reasons.

I don't know of anything off-the-shelf though.

The problem with scraping is that if the site you are scraping redesigns you end up having to spend a lot of time tweakig your program. Although if you are doing a quick pass to grab data then it's not such a big deal.

And ofcourse, I wouldn't use a scraper on any site that expressly forbids automated devices, etc.

Rob Holmes
30th December 2005, 05:41
We use something like this for our web rescue service: http://www.matrixxhosting.com/web-rescue.shtml

PM me if you think I can help.

Rob

jklondon
30th December 2005, 11:51
thanks for the comments, the sites I am looking to scrape obviously have pre-agreed to this but do not have a feed as yet (which is obviously the best solution) so the scrape would run overnight and dump the data into a DB using our pre-defined structure.

Duane - totally agree with your point - its messy & expensive work.

A follow on questions does anyone have experience / comments on NQL (network query language) which can, in theory, simplify a lot of this type of work.

Enigma121
30th December 2005, 13:54
We have produced similar types of code in the past.

In addition an alternative approach might be to encourage the other sites to present their data in a standard XML compliant format for syndication / processing by your site. RSS is one popular example.

We haven't dealt with NQL yet, but have done several web services implementations within Java. Again an alternative to achieve the same result.

PM me if you wish to discuss the possibilities further.