Mar
15
2009

10 cloud datasets that I’d like to mashup

Cloud computing is being sold as a hosting architecture to provide instantly scalable on-demand computing power, storage and bandwidth.

“The cloud’s resources scale with user demands. Pay only for what you use” says RackSpace, the latest to join the cloud gang.

One problem for the cloud gang, however, is that hosting has always struggled as a low margin commodity business.

Rackspace has just hired Robert Scoble to help spread the message, so we should expect this space to soon get hotter than an Sun SPARC with a loose heatsink.

But where exactly can some value be added in cloud computing, to increase the margins and keep Scoble funded so he can continue to filter the signal from the noise on FriendFeed? Okay, that’s slightly selfish but it’s an interesting question.

The interesting answer IMHO is cloud datasets.

Having useful datasets available in the cloud will unlock value from the data by allowing a new generation of mashup. These aren’t mashups that simply use data from remote web services, like plotting Craigslist ads onto a Google Map. This involves the mashup (joining) of datasets in the cloud using the power and speed of a relational database.

This cloud database approach might also provide Twitter and other owners of valuable data with a revenue model that doesn’t depend on advertising.

Here’s 10 cloud datasets that I’d personally like to mashup, to help explain:

1. Wikipedia. Funnily enough Amazon Web Services has just announced that it now offers a 66Gb dataset of Wikipedia. “The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted in tabular form.” One example: imagine the opportunities for a start-up social travel site to mashup its content with the wealth of travel information now available on Wikipedia. Massive.

2. Geonames. It bugs me that everyone who wants to use the geonames database needs to duplicate 800Mb of data. Move it into the cloud! Example: the travel site can now analyse reams of user-generated content (or Wikipedia content) for up-to-date categorization and geo-coding onto a map. Another example: most websites need a simple (but updated-more-often-than-you-would-think) list of countries on the rego form. Wouldn’t it be good if everyone used the same (geonames) list?

3. MaxMind IP address lookup. Turn an IP address into an always accurate city location. Example: targeted ad serving and traffic analysis.

4. Google PageRank. For any URL, what’s the PageRank measure of quality? If this is relational data (rather than from a remote web service), it can be combined with other measures of quality at database speeds.

5. Real-time stock market data.

6. Real-time sports data.

7. Dodgy credit card numbers.

8. Dodgy email addresses.

9. Twitter. Some of the above might be considered proprietary rather than public data, which brings me to Twitter and a potential revenue model for them and the cloud gang. If you’ve got valuable proprietary data like Twitter has got (some would say that’s all they’ve got), then replicating it into a relational cloud database will unlock more value than could ever be extracted (or sold) via a remote web API.

Example: when visiting an e-commerce site, it would be nice to see only the product reviews submitted by people I am following on Twitter, sequenced by a measure of quality based on how often those people have been retweeted. Of course, the cloud gang already have the billing infrastructure and monitoring in place to work out exactly how much proprietary data you have used, and what to charge you for it. Did I mention yet that Jeff Bezos is an investor in Twitter?

The advertising pie is not big enough to fund the whole of the interweb, so perhaps paid data consumption is the revenue model for Twitter and others. Businesses are happy to pay hosting providers for commodity services like CPU cycles and disk space, so why not pay Twitter (via a hosting provider) for valuable information? Did I mention yet that Jeff Bezos is an investor in Twitter?

10. This one is further out there; private foreign keys. Imagine the Twitter dataset including the email address of users, joined using that email address to a Facebook or Digg dataset, but not revealing that email address in the result set. That’s number 10 on my list. It would need to work in a similar way to Facebook’s FQL or Yahoo’s YQL or Google’s GQL, to expose enough information to be useful but to not expose anything that would violate privacy concerns. I hope to write some more about this and the privacy implications in another post.

So, who’s in the cloud gang? Google is well placed with AppEngine and plenty of valuable datasets to get started with. Amazon has all the billing machinery in place to sell proprietary data from Twitter and others. Sun now has MySQL which already supports remote replication and column-level permissions to enforce private foreign keys. And now RackSpace has Robert Scoble. This will be an interesting one.

Jan
07
2009

What’s the difference between user generated content and user generated rubbish? Comments please…

Some user generated content (UGC) is genuine, honest, credible, reputable, trustworthy, valuable, quality information. But some is rubbish (let’s call that UGR), including deliberately misleading propaganda, biased blog comments, bogus product reviews, spam, veiled advertising, and bad poetry (or is it just my blog that attracts poetry bots?)

Google’s PageRank algorithm does a good job of measuring the quality of a simple web page, based on the number of incoming links to that page, and recursively weighted on the quality of those linking pages. However, web2.0 has given us blogs, wikis, forums, media sharing, customer product reviews and ratings, social bookmarking, and more recently aggregation of all of the above; resulting in web pages that contain an increasingly complex array of UGC and UGR, making it increasingly difficult for algorithms, and site visitors and site owners to filter the signal from the noise, the UGC from the UGR.

So I wanted to write a post about some of the emerging technology innovations attempting to solve this problem. Readers are kindly asked to add a comment at the bottom of the post. All comments will be shown, even bad poetry, for purposes of research and experimentation.

Measuring quality is relatively easy for eBay. Its Feedback Ratings provide an excellent indicator of trustworthiness, because online auctions involve measurable user actions such as ‘Was the product description accurate?’ and ‘Did the buyer pay up?’ Such actions speak louder than the mere words of a blog comment or product review.

Amazon now owns a valuable database of customer product reviews to help people through their purchasing decisions. Innovation by Amazon in this area has included the ability to provide feedback on the usefulness of other users’ comments, and a Reviewer Rank algorithm which provides a measure of reviewer quality (interestingly, this algorithm was recently improved to include some PageRank-like recursiveness).

In a past life I had the pleasure of working for Lonely Planet, a travel publisher whose credibility and quality has been built upon the independence of its authors and their unbiased travel reviews. Lonely Planet and its peers have long struggled with the opportunity to harvest UGC from loyal and passionate travelers, because it is just so difficult to measure the independence and quality of contributing users.

TripAdvisor was allowed to emerge as a disruptive force in the market for travel advice, allowing anybody to review any hotel or restaurant. That created a lot of quality content for a while, but ever since hotel owners found out about TripAdvisor and began to review their own hotels, it’s been difficult to tell the UGC and UGR apart. TripAdvisor still desperately needs a reliable measure of user generated quality to restore its credibility.

Perhaps social networking can help TripAdvisor; being able to filter your travel advice to that written only by your friends would eliminate biased reviews (unless you are friends with a bunch of hotel owners, in which case you’re probably going to stay in their hotel anyway). But until the internet settles on a standard for social data portability, not many of us will have enough online friends who have traveled enough and generated enough online travel content for such a social filter to work reliably, even allowing for recursive algorithms.

If it’s just travel advice and inspiration you’re looking for, you could wait for Lonely Planet’s upcoming blog syndication feature, which promises a novel solution to the problem.

But more generally, I think we all need a universal reputation system, one which aggregates lots of measures of quality from lots of different sites. Imagine if you could easily see a summary of my quality metrics from eBay and Amazon and Yahoo Answers and LinkedIn Answers and GetSatisfaction, perhaps even my Bugzilla and Basecamp metrics too; would that be enough for you to trust my travel advice and any other content that I generate?

Site visitors would benefit from increased visibility of users who generate content. Genuine contributors would be encouraged by being able to build a universal reputation for quality UGC, and discouraged from the risk of creating UGR. And site owners would benefit from data to filter out the UGC from the UGR.

A universal reputation system could also help to eliminate online vote rigging, astro-turfing (all those reviews of iPhone apps posted by the developers themselves), and space-faking (setting up false identities on social networking sites).

Who are the players?

SezWho SezWho provides a plugin for blog commentary which presents a useful summary of UGC history for each contributor, and allows customizable 5-point rating scales for site owners.
Intense Debate Intense Debate has a great interface design. It’s recently been acquired by Automattic, the owners of the Wordpress blogging platform, which will provide some valuable distribution, perhaps critical mass. But will the other blogging platforms want to adopt or integrate with a standard controlled by a competitor?
Google Friend Connect Google Friend Connect allows any site to embed a comments or ratings gadget onto any page. The universal view of previous UGC is not there yet, however this will become powerful when integrated fully with Google’s other stuff; Blogger and SearchWiki and the Social Graph API and YouTube (arguably the site most in need of a UGR filter!)
Disqus Disqus is getting lots of press for its prompt Facebook Connect integration which takes the hassle out of commenting. Video comments can by posted, powered by Seesmic. Readers can nudge comments up and down the list by voting on them. Try it out below.

If you have a view on who will win the race to become the universal reputation system, please comment below. Are there any other players that I have missed out? (Yes I know that is exposing me to some comments on the quality of this post!)

Also here’s some further questions to inspire some commentary:

  • Should we settle on a word for what is being measured here? Quality, importance, value, trust, reputation, credibility, honesty, transparency? Or will the winner of the race provide a web2.0 brand name to describe this concept of a universal measure of user generated content?
  • Is it even possible to determine an objective universal score? The success of PageRank would suggest yes. Or is quality in the eye of the beholder? Is one person’s signal another person’s noise?
  • Would a universal metric destroy the democratic level playing field that is UGC / UGR?
  • What are the consequences of such a universal reputation system being gamed?
  • How likely are eBay and Amazon to open up their reputation data? What are the privacy implications?

Thoughts please. Don’t be shy!

Powered by WordPress. Theme: TheBuckmaker.