In 2003 when I had been heading up AOL Search for a while, we began to bang structured data into search results pages on a query-by-query basis. We used pre-formatted javascript templates that were selected based on keywords, and filled from the freshest, most relevant data we could find.
In today’s parlance we had widgetized search. We called them widgets back then, too. The entire system had to be built from scratch, and it was both costly and time consuming to build.
It was worth it. For the user, this meant that a search for “Turkey Recipe” would pull up – amazingly – a turkey recipe right at the top of the page. A search for “Austin Powers” would take your zipcode plus Moviefone data and present you with reviews and showtimes with a single click to purchasing tickets at the theaters near you.
It really was a big deal. In fact, we had this “search programming” on 20% of all queries, across all known categories – sports, autos, entertainment… This was the “Google and More” plan that allowed AOL to go into a relationship with the future juggernaut with confidence – we were going to use real content and clever editors to build an experience well beyond what the bluelinks could provide. It was great being Google’s biggest partner as they entered the world of paid search.
This search editorial program wasn’t an accident. I, as well as several of my contemporaries had been working on the opportunity for several years by then. (In fact, you could claim that when Ram Sriram’s Junglee announced “the Internet Is the Database” this all began.)
It grew out of a few different streams of activity that had been going on – at AltaVista the Web Search team had been doing a small version of this and I was working on the getting shopping data into results completely structured and pre-widgetized. Lycos had been trying things out as well. (AltaVista and AOL people may remember our friend Tim Robinson who was a major visionary behind this)
This program is exactly why I went to AOL. AOL had just merged with TimeWarner and an entirely new range of content – content from a REAL media company – would be available for search enrichment. What an amazing opportunity. My biggest frustration at AltaVista was the lack of content resources to enrich the experience – to keep people from flowing straight through the system and out to the Web without adding real value.
To make an already too long story shorter, we (AOL) made a boatload of cash on search with Google and the TW merger cratered. I moved on to implement a similar vision within the bounded vertical of Finance and News. That left the world’s most evolved and enriched version of search – AOL’s Fullview – at the hands of aggressive costcutting and Fullview was determined to be off strategy. No sour grapes at all. Just a missed opportunity.
Since then, former colleague Jason Calacanis has gone on to create Mahalo under a similar premise, Wikipedia has evolved into an amazing resource, and Google is still pumping out bluelinks, plus just a little more.
Anyway, the title promises that this post is about semantics, and it is. This is just necessary background.
The opportunity is still there, whether in search or in the online world at large to create a virtual fabric of content that can be experienced (browsed and searched) – and even more importantly assembled on-the-fly - based on its relatedness.
What do I mean by that? Whether in search or in socially-relevant widgets or in feed aggregation, we need to link and connect content based on its MEANING and not keyword similarity. Specifically – When I am looking at a Microsoft earnings report and I see related links to “gates” I want it to be about the person, not the thing. The same applies to java, apple and about six million other things. We live in an ambiguous world.
For this to become reality, two things need to happen:
- Content needs to be accessible in a format that is native to its type of data. For example, the fundamental information about Microsoft (P/E, market cap, etc) will be in fields, just like a spreadsheet. MSFT’s price history is going to be formatted into a huge list of Bid, Ask and transaction prices (gross generalization). News about the latest earnings will be in text blobs. You can’t index this data with a traditional crawler, and it can’t all be mushed into a single format without losing the unique value.
- There needs to be a consistent way, across formats of data, to call out and associate similar items. MSFT is related to Microsoft is related to Steve Ballmer is related to Bill Gates. With this type of linking, we can then understand the interrelatedness of things. In its simplest form, this is semantics.
Now, to the point of this post.
If you look above, there are two things that need to happen for the content/information experience on the Web to be dramatically improved: we need content in a universally accessible repository (or repositories) and we need a technique for connecting it all together. Also, the web is evolving and we now have the challenge of making that all socially aware and realtime.
Let’s take those two chunks separately.
Big Honking Database of Content - In the last 18 months we have seen an amazing set of resources applied against this.
- Freebase is the first company of note. It promises to be a huge content stash in the sky and is funded to do it. Very very promising.
- Amazon released public datasets this week. So now if you want economics and scientific data it’s there. And you can make your own data available via AWS if you allow it to be freely accessible. This is a HUGE step.
- Fluidinfo and Terry Jones. It seems fashionable to say “I know Terry Jones” these days. Here’s why: Terry has quite possibly created the database to handle structured, semistructured, tagged and attributed, social and realtime data. In its native format. That’s why pundits like Tim O’Reilly and Robert Scoble are openly excited about Terry and Fluidinfo. Terry and I have spent many hours together contemplating this. (see, I know Terry Jones too!)
Semantic technologies – There is more work to be done here, but it’s on the way. First of all, to fit into the broad model I have laid out, the semantic tagging technology needs to be at the tool, or platform level. This rules out most of the activity in the Semantic space.
Here’s what I mean: If you want to create, say, a music fan-site application that pulls together artist bios, discography, tour reviews from the Web, user generated content and the ability to purchase both tickets and CD’s, you would assemble the content and then you’d need a semantic tool to tag and generate the connections between bands, releases, tour dates and purchasing.
You can’t do that with a semantic application that only provides related links or delivers search results on only the information in its own index/database. You need a tool you can run on all of the sources to generate consistent metadata. Not that Hakia, Twine, Powerset (now MSFT) and Zemanta aren’t useful, but they’re individual applications on top of a semantic engine. Builders need access to the engine itself in order to build a wide range of products and open up the power of the technology.
Unsurprisingly, I am highly in favor of the OpenCalais approach by ThomsonReuters.
The best part is that v 4.0 of Calais will be releasing the “Linked Data Cloud.” It goes after this in a truly powerful way, providing users not only the ability to get their own tags, but to see how those tags relate conceptually to other things in the OpenCalais data model. Rocket Science.
But this post isn’t about OpenCalais either.
This post is about finding opportunity in the world that is evolving.
If I haven’t lost you yet, and you can agree that content+semantic tagging is useful, you can see that there are some problems and opportunities.
- Completeness of data – DMOZ (aka The Open Directory) used to be under my domain at AOL. I never could invest in it because the community management model was flawed: communities only want to curate the things they’re interested in (thanks to Andrew Cohen for the analogy). Investment in the infrastructure was only going to feed the weediness and patchiness of the garden. Freebase is showing signs of content spottiness and AWS will too. It’s an issue of primary importance. So I think there is an emerging opportunity to provide curation on top of these open services. Like Redhat to Linux.
- Quality of data – Just like completeness, If anyone can publish to the datasets, there’s a risk of problematic information. This is where branding, and the associated quality control comes in. The opportunity is for companies who create content to establish and promote their brand as a sign of quality. Quality wins out over crap time and time again.
- Universality of tags – Zemanta is admirably trying to get a tagging standard adopted across semantic engines. Whether by agreement (standard) or market leadership (default) the emerging content world will benefit from consistent tags to operate on. More things will be “connectable.”
- Applications Applications Applications! – This is where I get excited. Really excited. After 15+years of helping people find what they’re looking for using technology, I scan the horizon and see the building blocks to finally get it done. We (the tech community at-large) now have raw content feeds, open and free databases, functionality APIs, open source platforms and development methodologies that free up our minds to think about how users really want their content. We are right on the edge of being able to build what we can imagine – quickly and cheaply. We’ve got the tools to measure it and the social context to present it in with personal relevance.
Jason Calacanis recently posted about the responsibility we all have to push forward through this downturn with 120% effort… The part I found specifically valuable was where he calls entrepreneurs and those with the resources to get out there and start something. I can agree with that.
So, the tools are there and hopefully I’ve given you at least one way to think about it… I am definitely making my bets on where this is going and will probably join in on the app-building side soon.
And yes, this ties in with the Splintering of Media. I’ll get to that soon…
(* Photo “Sound and Vision” copyright Rogiro from Flickr)


Illustration and design by Kurt Aspland
Viewing 12 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Trackbacks
(Trackback URL)
December 5, 2008 at 8:31 pm
[...] Semantics, Search and Big Honking Databases | luckyrobot.com [...]
March 6, 2009 at 1:30 am
[...] Lucky Robot - Semantics, Search & Big Honking Databases December 5, 2008 http://luckyrobot.com/2008/12/05/semantics-search-and-big-honking-databases/ [...]
April 13, 2009 at 4:18 pm
[...] a weekend reading everything I could find on the net ((@terrycojones, his blog, notable articles by Gerry Campbell, Paul ...