The job is to create a list of all blogs (~1.2million) contained in the technorati blog directory
[login to view URL]
Each website must be categorised into the correct technorati categories. Scrape contact URL, find email address if relevant
The data would have the following column headings
1) blog homepage url
2) technorati top level category eg: "entertainment"
2) technorati 2nd level category eg: "celeb"
3) technorati "Auth" score eg: 937
4) Contact us page URL (scraped via search engine or site crawl?)
5) Email address contact if available
The data must use the following columns
HOMEPAGE | TOP_LEVEL | 2ND_LEVEL | AUTH | CONTACT | EMAIL
Each row must contain homepage, top level, 2nd level, auth and contact. We accept that not every website has a contact page or visible email address, though with the correct search engine query scrape this data should be reasonably well populated and error free.
All data lower case, please. Url format must include http://