I'm working on a project that creates fingerprints for images and videos. The code for this is available as [login to view URL] and is undergoing constant development. What I'm looking for now is to use this blockhash algorithm to create a database of hashes for images and videos available from Wikipedia (especially Wikimedia Commons), and keep this database updated as works are added/removed/updated from Wikipedia. This project creates the Python scripts that can run in the background to keep the local database updated against Wikipedia.
We have done a similar program which is available here: [login to view URL] but it has not been updated, and had some bugs. It also did not really have any support for removing or updating works as they changed in Wikipedia. The contractor who continues this work may choose to work with this existing code base, or start anew (starting from scratch might be preferable).
The program is supposed to consist of two parts: a server and a client, that interact with each other in a way that the contractor can define. Previously we've used RabbitMQ or interactions through the PostgreSQL database with almost equal success.
The server is responsible for:
- Interacting with the Wikimedia API, finding works that are newly added, has been removed, or been updated.
- Adding any such works to a "queue" for further processing
- Monitoring the client(s) work (for instance, if a work can not be processed, and clients have tried it 3-4 times with different times in between, marking the work as "error" so that it doesn't get processed more)
The client is responsible for:
- Retrieving works from the queue to process
- Getting information from the Wikimedia API about:
-- The title
-- The copyright statement (Creative Commons or similar)
-- The author (name)
-- The available media files (image or video files)
-- For each media file:
--- The URL of the media file
--- The blockhash of the media file (calculated by the blockhash command listed previously)
The exact information retrieved should only be the basic information about a work. You can see for instance [login to view URL] for examples from a previous project of what information we stored about each work.
The server and client should both be constructed in a way that other sources of information, such as Flickr, could easily be added later (the logic would be the same, but the exact API calls etc would change).
The retrieved information should be stored in a PostgreSQL database.