A project to spider a small number of news sites and keep an indexed database of their text content. Each page will be assigned an rank dependant on several factors. The database will be queried by a simple web front-end using keywords.
This project comprises of three parts:
* The back-end server which will continuously spider a number of news sources. New pages will be indexed for full-text search and custom relevancy data will be computed and stored.
* The web front-end is a very simple website which will allow end users to view search results across a number of pages. An options page will allow users to customise their search parameters.
* There will also be an administration section for site admins to view users, news sources and indexing stats.
The project should be coded in Java, Python or Perl and hosted either on Google App Engine or Amazon EC2. Full text indexing and searching should be carried out using Lucene (with Solr if needed).
Full functional spec and web interface wireframes are attached.
All prospective candidates should have experience in deploying Lucene.
Hello!
I am experienced developer on Google App Engine and Python. I am interested in complete this task. As I see - we can't integrate Lucene with Python on Google App Engine.
I have found gae-search library, but with many restrictions. We can discuss about search mechanism and found easy solution for this task.
Please contact me and we continue discuss about this interesting task.
P.S. Your TechSpec is very cool!
Hi, I'm a developer in Sydney and very interested in this project! Please check the PM for my CV and more details on how I'd approach the project. Thanks, Will.