10 cloud datasets that I’d like to mashup

Cloud computing is being sold as a hosting architecture to provide instantly scalable on-demand computing power, storage and bandwidth.

“The cloud’s resources scale with user demands. Pay only for what you use” says RackSpace, the latest to join the cloud gang.

One problem for the cloud gang, however, is that hosting has always struggled as a low margin commodity business.

Rackspace has just hired Robert Scoble to help spread the message, so we should expect this space to soon get hotter than an Sun SPARC with a loose heatsink.

But where exactly can some value be added in cloud computing, to increase the margins and keep Scoble funded so he can continue to filter the signal from the noise on FriendFeed? Okay, that’s slightly selfish but it’s an interesting question.

The interesting answer IMHO is cloud datasets.

Having useful datasets available in the cloud will unlock value from the data by allowing a new generation of mashup. These aren’t mashups that simply use data from remote web services, like plotting Craigslist ads onto a Google Map. This involves the mashup (joining) of datasets in the cloud using the power and speed of a relational database.

This cloud database approach might also provide Twitter and other owners of valuable data with a revenue model that doesn’t depend on advertising.

Here’s 10 cloud datasets that I’d personally like to mashup, to help explain:

1. Wikipedia. Funnily enough Amazon Web Services has just announced that it now offers a 66Gb dataset of Wikipedia. “The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted in tabular form.” One example: imagine the opportunities for a start-up social travel site to mashup its content with the wealth of travel information now available on Wikipedia. Massive.

2. Geonames. It bugs me that everyone who wants to use the geonames database needs to duplicate 800Mb of data. Move it into the cloud! Example: the travel site can now analyse reams of user-generated content (or Wikipedia content) for up-to-date categorization and geo-coding onto a map. Another example: most websites need a simple (but updated-more-often-than-you-would-think) list of countries on the rego form. Wouldn’t it be good if everyone used the same (geonames) list?

3. MaxMind IP address lookup. Turn an IP address into an always accurate city location. Example: targeted ad serving and traffic analysis.

4. Google PageRank. For any URL, what’s the PageRank measure of quality? If this is relational data (rather than from a remote web service), it can be combined with other measures of quality at database speeds.

5. Real-time stock market data.

6. Real-time sports data.

7. Dodgy credit card numbers.

8. Dodgy email addresses.

9. Twitter. Some of the above might be considered proprietary rather than public data, which brings me to Twitter and a potential revenue model for them and the cloud gang. If you’ve got valuable proprietary data like Twitter has got (some would say that’s all they’ve got), then replicating it into a relational cloud database will unlock more value than could ever be extracted (or sold) via a remote web API.

Example: when visiting an e-commerce site, it would be nice to see only the product reviews submitted by people I am following on Twitter, sequenced by a measure of quality based on how often those people have been retweeted. Of course, the cloud gang already have the billing infrastructure and monitoring in place to work out exactly how much proprietary data you have used, and what to charge you for it. Did I mention yet that Jeff Bezos is an investor in Twitter?

The advertising pie is not big enough to fund the whole of the interweb, so perhaps paid data consumption is the revenue model for Twitter and others. Businesses are happy to pay hosting providers for commodity services like CPU cycles and disk space, so why not pay Twitter (via a hosting provider) for valuable information? Did I mention yet that Jeff Bezos is an investor in Twitter?

10. This one is further out there; private foreign keys. Imagine the Twitter dataset including the email address of users, joined using that email address to a Facebook or Digg dataset, but not revealing that email address in the result set. That’s number 10 on my list. It would need to work in a similar way to Facebook’s FQL or Yahoo’s YQL or Google’s GQL, to expose enough information to be useful but to not expose anything that would violate privacy concerns. I hope to write some more about this and the privacy implications in another post.

So, who’s in the cloud gang? Google is well placed with AppEngine and plenty of valuable datasets to get started with. Amazon has all the billing machinery in place to sell proprietary data from Twitter and others. Sun now has MySQL which already supports remote replication and column-level permissions to enforce private foreign keys. And now RackSpace has Robert Scoble. This will be an interesting one.

This entry was posted in mobile geo social and tagged , , , , , , , , , , , , . Bookmark the permalink.
  • http://www.elite-proxy.net Craigslist Proxy

    Does it ever cease to amaze you with all the news and happenings about Craigslist?

  • http://www.englishclass.com.tw 英文家教

    Valid points you raise. Is it always like that though? I do wonder.

  • Pingback: HAROLD