g5media
9th March 2009, 13:19
like google, and indeed.co.uk
if I wanted to come up with a site which aggregated data from several sources and allowed users to vote on search results based on tags to refine them what would the logic for this need to be written in?
I realise this may be a piece of string question.....
A friend has found out about my site and suggested one of his own, which is a good idea but I think too complex to do without some serious funding. reason being I think that we'd need several sub-systems - ingestion, publish, user home, tracking etc
are we looking at the right or the wrong side of £10k for this kind of thing? (assume delivery of all events to be ingested in CSV format)
edmondscommerce
9th March 2009, 13:24
not necessarily
can you give us a bit more info - or contact me direct if you want to discuss in depth and dont want to flood the forum
FireFleur
9th March 2009, 14:20
Big search engines are written with highly tuned code and data structures. For that it is a compiled to native language, so C, Erlang (HiPE), C++.
PHP for example is awful with sorting (things may have changed) but what you would normally do there is write a PHP extension in C, for that element of the system.
Python and Perl have some fairly good in built data structures, but again it would be tempting to write an extension.
That sort of answers your question in the title, but your post asks for something different, for that I would say go with whatever you are comfortable in, profile the system and then optimise the bottle neck.
Large systems are normally written in many languages (well not too many) and go through a chain, if something proves itself to be consistent, then it gets optimised into more performant code, so the first thing to do is work out the flow of what you are doing.
If you think you are going to have load, then perhaps drop Apache for Lighttpd straight away. If the system is evolving and you want to be able to make tweaks fast then move Python in, if you want to distribute the system quickly then add an application server early or go the Erlang route. You can run with optimised kernels, a generic kernel is slower than one tuned to one task.
Lots of planning or lots of trial, mixing it up tends to work best. I tend to write a lot of logic in Python and then profile it with a test load.
You can find some crazy stuff out in assembly though, I am writing some tuts at the mo; looking at looping. It would appear that accessing memory off the heap is faster than pushing and popping off the stack or even a rather dubious use of an internal register, we are only talking a 100th of a sec over 10K of iterations though, and I have Mylo blaring away on audacious so not too scientific :)
A good book on the matter is Write Great Code, and in there you will find out that accessing memory with an even memory address is faster than an odd address due to data bus design, that holds true for 8086, 80286 and some ARM processors, it is in how the data banks are represented, the problem persists in 32bit to a degree and starts to lessen in 64bit.
Most of the time the optimisation is about algorithm choice, avoiding algorithms that scale badly or recognising when exponential style problems are being looked at. A general solution is to throw more hardware at it first, whilst you quickly go work on tuning the code up.
Oh, I just reread your post and you are looking to buy a system in. Well, if you became popular then you would need to spend cash quick so the real trick is to get the revenue stream in early. A search engine where you vote on results would probably get gamed quickly, so you would need to sort out how to respond to that, and using content from other sources probably not wise if you want to compete. Google and Yahoo both offer developer keys, so you can access their information and do what you like, but you are limited in the amount of access and not guaranteed anything, so you sort of act like their testing bed :)
g5media
9th March 2009, 14:30
thanks for the reply - the voting only applies to your own listings - kinda like on facebook where you can select "See more stories about xxxxx" etc
dave_n
9th March 2009, 14:44
i would look into 'federated search' and 'cross search stemming'.
should give you a better idea of whats available before you decide on bespoke code