Website Scraping - Where do i start?

The Carfisher · Sep 25, 2023

I need to scrape data from various websites but don't know where to start and after a little research still don't know who i should be searching for. If i want a website i search for a website designer or developer. I was wondering if there were any technical people on here that could advise where to start, or even specialise in this and can help.

I can provide further information on what exactly I need of course. Look forward to seeing if anyone here could help.

fisicx · Sep 25, 2023

Use a freelancer site, there are loads of people who can do this for you. Prices have increased but £500 will get and excellent service.

Kerwin · Sep 25, 2023

It very much depends on the website. If the website uses frontend rendering of complex JavaScript it can get pretty hard. On the other hand if you have another site that does server side rendering with well written HTML it'll be a lot easier.

Also bear in mind that if the website owner finds out they may well block you and your IP address. I'm not sure what the law is but I'm pretty sure that copying copyrighted material without consent is illegal.

The Carfisher · Sep 25, 2023

fisicx said:
Use a freelancer site, there are loads of people who can do this for you. Prices have increased but £500 will get and excellent service.

What i'm trying to achieve will be a lot more than £500.

The Carfisher · Sep 25, 2023

Kerwin said:
It very much depends on the website. If the website uses frontend rendering of complex JavaScript it can get pretty hard. On the other hand if you have another site that does server side rendering with well written HTML it'll be a lot easier.

Also bear in mind that if the website owner finds out they may well block you and your IP address. I'm not sure what the law is but I'm pretty sure that copying copyrighted material without consent is illegal.

What i need it to achieve won't hinder any copyrighted material. I guess you could compare it to a price comparison site. In the motortrade if you put your registration and mileage in to the website it pulls through prices from all the car buying companies. It's more complex than this but in essence similar. The companies won't block me as it will put their price in front of my customers.

AlanJ1 · Sep 25, 2023

The Carfisher said:
What i need it to achieve won't hinder any copyrighted material. I guess you could compare it to a price comparison site. In the motortrade if you put your registration and mileage in to the website it pulls through prices from all the car buying companies. It's more complex than this but in essence similar. The companies won't block me as it will put their price in front of my customers.

I am unsure if this is a scraper you need. You are looking for real time lookups versus a registration?

The Carfisher · Sep 25, 2023

AlanJ1 said:
I am unsure if this is a scraper you need. You are looking for real time lookups versus a registration?

Yeah, thats why i was hoping someone on here might specialise in this to guide me as to what i need.

AlanJ1 · Sep 25, 2023

The Carfisher said:
Yeah, thats why i was hoping someone on here might specialise in this to guide me as to what i need.

I mean that's possible but will be mega slow to return results and massive potential risks / ways it can fail.

Get the data you need populated into a document and lookup to this when a customer searches a registration?

May be other ways to do it as I am not a developer.

BusterBloodvessel · Sep 25, 2023

Still slightly unclear what you're looking to do... enter a vehicle registration to get information back about that - e.g. the vehicle info, or the MOT status, or corresponding spare parts?

Or are you looking (as I suspect) for something to return listings/info - for example somebody wants to buy an Audi A1 and you return them listings from AutoTrader, Gumtree, eBay etc all in one location?

fisicx · Sep 25, 2023

The Carfisher said:
What i'm trying to achieve will be a lot more than £500.

How do you know? Once the script has been written it just keeps running.

Properly describe the project on a freelancer site and see what gets offered.

The good ones will use IP spoofing so you don’t get blocked.

The alternative is to use an API and the data legitimately.

Maybe describe what you are trying to achieve so we can better advise.

Calvin Crane · Sep 25, 2023

Crudely if you chose PHP Curl libs can do this. It WILL be a lot more than a script as you are building a web app and need to store the scraped data. Money Super M, Sky Scanner etc...
You need the views designing and yes you could use wordpress or Joomla for example and build a plugin. You will need a lot of admin based screens for GDPR etc too. If you are lucky you might find an agency who has previous experience but there will need to be very fluid interfaces building which can alert you when 'patterns' change on a source website. Crawling the web is challenging one which Google itself is struggling with as millions of pages are coming online daily through Chat GPT content farms.

Birmingham · Nov 25, 2023

i may be able to help. i do a lot of scraping. you might first want to check if the data sources (car buying companies) have Feeds or APIs, and if they're not free are you happy to pay to use them. if any of the necessary data sources lack a feed or api, or if you're not keen on paying for such things, then you may be looking at screen scraping, eg with curl as calvin suggested above. and there are legal considerations to this however there are also loopholes and mitigation measures, for example google scrapes every website it finds except those that tell it not to for example via a robots.txt file in the web root directory.

premium api feeds and standardised free rss feeds for example tend to follow protocols that give your app plenty of reliability about what data to expect. but when screen scraping, you're at the mercy of the source site changing their design at any moment and causing your app to crash, so a good scraping app will account for this by being very selective which bits of source code to look for, and running checks to make sure everything is in the expected format before attempting to process it further, and having a timeout feature where if the data is offline it continues some other way after a few seconds before the end user gives up waiting or the server falls over. you'll also want to think about how fresh the data needs to be, because repeatedly scraping the same data in realtime makes things very slow, especially on a busy site, so caching is needed, and then it's a tradeoff between how fresh you want the data to be before re-fetching it from the source, vs how fast you want the app to run. many things to consider.

if you want some further free advice, tell me a bit more about your project? like what's being taken care of already, what you still need help with, what your budget is, any urgent timescales, etc.

gpietersz · Dec 13, 2023

I have done quite a lot of scraping (including large scale for a niche search engine). A lot depends on:

1. What data you want to scrape
2. Whether the sites you are scraping are happy to the scraped, or whether they actively try to stop scraping
3. How structured the data is on the sites you scrape

If the data is available in another way such as a supported feed, it will be a lot more reliable.

There are also some more DIY options that let you visually choose what you want to be scraped.

The Carfisher said:
The companies won't block me as it will put their price in front of my customers.

Do not count on that. I have even had to deal with sites that wanted us to scrape them (for the search engine) but did not know how to selectively turn off the anti-scraping protection.

Even the assumption that they want their prices in front of customers is not necessarily correct: they may feel price comparison forces them to compete with the cheapest alternative, cutting margins. Tends to be the bigger sites that do this.

On another project where we were scraping product data a number of sites did block, so do not assume product/price information will not be blocked. I would check with a lawyer whether there are no copyright of legal problems too. I think just price data is OK, but descriptions and so on are are definitely covered copyright.

Calvin Crane said:
You need the views designing and yes you could use wordpress or Joomla for example and build a plugin.

That sounds like a really bad approach. I would start from a scraping framework, not a web framework. Then use a separate web framework to display the data.

Nick@Daydot · Dec 13, 2023

Do you know if the sites that you want to scrape have APIs that you can use? That's a data feed, typically fed from the same backend systems that the owner's website uses and so not subject to the vagaries of website change. That's what comparison sites will typically use.

If you adopt a scraping solution then you'll potentially need some different code for each site, which will need to be maintained as each site changes something at the front-end, and you'll have to figure out how to be alerted to such a change, and react quickly. Maybe there are solutions for this, I wouldn't know.

I'm assuming you are wanting products and prices.

Birmingham · Dec 13, 2023

gpietersz said:
That sounds like a really bad approach. I would start from a scraping framework, not a web framework. Then use a separate web framework to display the data.

Why? There is absolutely no limit to what can be achieved by writing the scraping framework as a WordPress plugin, with cURL fetching data from URLs and preg_match_all() parsing & interpreting the data. Depending how it's built, it can be an extremely reliable and user-friendly setup. I have several live websites sites doing it passively as we speak, and am having zero problems with it.

gpietersz · Dec 13, 2023

B

Birmingham said:
Why? There is absolutely no limit to what can be achieved by writing the scraping framework as a WordPress plugin, with cURL fetching data from URLs and preg_match_all() parsing & interpreting the data. Depending how it's built, it can be an extremely reliable and user-friendly setup. I have several live websites sites doing it passively as we speak, and am having zero problems with it.

It is the wrong tool for the job. Yes, it can work, but it is not efficient. reliable or productive. You can hammer a nail in with a screwdriver.

1. Using a more sophisticated approach such as a framework is a lot easier to both build and maintain than a pure DIY approach.
2. Async is clunky in PHP. I know it has green threads, but new and looks a bit clunky for this use to me. Event driven async matches scraping a lot better because
3. Using regular expressions to extract data from HTML as you suggest is a lot less robust than using a proper HTML parser and extracting the data from the tree.
4. There are websites curl + preg is not going to work on - for example if you need to run Javascript to get the content. You are going to need to integrate a headless browser at that point. Much easier if its a matter of changing a setting than replacing your code.
5. Wordpress does not do anything to help scraping. It is a CMS not a scraping (or even general purpose platform). You might as well write plain PHP, store the data in a database, and use Wordpress to display it. That does not solve the other problems though. The only useful thing writing it as a plugin might do is let you more easily create a settings panel - not hugely useful for a scraper because so much is hardwired.

All this is made worse by scrapers needing maintenance.

Unless you are only scraping a small number of pages it will not scale.

Birmingham · Dec 13, 2023

gpietersz said:
It is the wrong tool for the job. Yes, it can work, but it is not efficient. reliable or productive. You can hammer a nail in with a screwdriver.

1. Using a more sophisticated approach such as a framework is a lot easier to both build and maintain than a pure DIY approach.
2. Async is clunky in PHP. I know it has green threads, but new and looks a bit clunky for this use to me. Event driven async matches scraping a lot better because
3. Using regular expressions to extract data from HTML as you suggest is a lot less robust than using a proper HTML parser and extracting the data from the tree.
4. There are websites curl + preg is not going to work on - for example if you need to run Javascript to get the content. You are going to need to integrate a headless browser at that point. Much easier if its a matter of changing a setting than replacing your code.
5. Wordpress does not do anything to help scraping. It is a CMS not a scraping (or even general purpose platform). You might as well write plain PHP, store the data in a database, and use Wordpress to display it. That does not solve the other problems though. The only useful thing writing it as a plugin might do is let you more easily create a settings panel - not hugely useful for a scraper because so much is hardwired.

All this is made worse by scrapers needing maintenance.

Unless you are only scraping a small number of pages it will not scale.

Sorry but I have to disagree with you on practically everything you just said.

In terms of efficiency, if I just write the code for what I want done, rather than having a specialised framework bloating the code and doing things I didn't ask for, it will be more efficient and easier for me to make and manage. WordPress powers a quarter of all the sites on the web today, it's plenty reliable, and I would say any specialised scraping framework you're using is likely to have been developed far less finely. Using a HTML parser framework can be done in a WordPress plugin but direct to regex is lighter weight processing and more bespoke in terms of selecting exactly what I need which also makes it more reliable because I'm not making so many assumptions, and it's less likely to fail due to parsing errors in case of malformatted source code. I've never met a website where I need to run javascript within my app in order to grab data - I've always been able to parse javascript to get to data sources instead, and PHP has good JSON functions. I highly doubt the original poster's business would require running javascript on their server in order to facilitate their scraping. WordPress does not need to do anything to help scraping, the cURL does that, the WordPress is just the user-friendly interface for the admin parts and the public facing parts of the app wherever they may be useful - it doesn't cause any problems, like I said it's the most popular website framework in the world so it's hardly unreliable. Within the WordPress plugin there is plain PHP forming the meat of the app, I'm not using WordPress functions unnecessarily, only where they add value. WordPress scales plenty well for all my scraping needs but obviously there could be some industrial scraping applications where it would be a resource gain to omit the WordPress from the setup - this is almost certainly not relevant to the original poster's business though.

NickZ · Dec 16, 2023

Even though there are scrapers fetching entire sites that does not make it more legal.

A scrapted site still has a domain hosted by a company and those don't like copied sites on their servers. Archives such as webarchive show ownership quite accurately.

MarketingQuote · Dec 18, 2023

We used an internal developer in the past, but the amount of data crashed our hosting companies servers. That was some years ago, not sure what the legal factors are these days regards website scraping.

fisicx · Dec 18, 2023

Many sites have controls to detect scraping. If you use a single IP address or VPN then you will soon get identified. The trick is to use IP switching and not try an scrape the site in one pass.

Cookie Consent

Essential

Analytics

Marketing

Website Scraping - Where do i start?