luckyrobot.com - Gerry Campbell header image 2

Semantics, Search and Big Honking Databases

December 5th, 2008 · Comments

*

In 2003 when I had been heading up AOL Search for a while, we began to bang structured data into search results pages on a query-by-query basis. We used pre-formatted javascript templates that were selected based on keywords, and filled from the freshest, most relevant data we could find.

In today’s parlance we had widgetized search. We called them widgets back then, too. The entire system had to be built from scratch, and it was both costly and time consuming to build.

It was worth it. For the user, this meant that a search for “Turkey Recipe” would pull up – amazingly – a turkey recipe right at the top of the page. A search for “Austin Powers” would take your zipcode plus Moviefone data and present you with reviews and showtimes with a single click to purchasing tickets at the theaters near you.

Users and press raved.

It really was a big deal. In fact, we had this “search programming” on 20% of all queries, across all known categories – sports, autos, entertainment… This was the “Google and More” plan that allowed AOL to go into a relationship with the future juggernaut with confidence – we were going to use real content and clever editors to build an experience well beyond what the bluelinks could provide. It was great being Google’s biggest partner as they entered the world of paid search.

This search editorial program wasn’t an accident. I, as well as several of my contemporaries had been working on the opportunity for several years by then. (In fact, you could claim that when Ram Sriram’s Junglee announced “the Internet Is the Database” this all began.)

It grew out of a few different streams of activity that had been going on – at AltaVista the Web Search team had been doing a small version of this and I was working on the getting shopping data into results completely structured and pre-widgetized. Lycos had been trying things out as well. (AltaVista and AOL people may remember our friend Tim Robinson who was a major visionary behind this)

This program is exactly why I went to AOL. AOL had just merged with TimeWarner and an entirely new range of content – content from a REAL media company – would be available for search enrichment. What an amazing opportunity. My biggest frustration at AltaVista was the lack of content resources to enrich the experience – to keep people from flowing straight through the system and out to the Web without adding real value.

To make an already too long story shorter, we (AOL) made a boatload of cash on search with Google and the TW merger cratered. I moved on to implement a similar vision within the bounded vertical of Finance and News. That left the world’s most evolved and enriched version of search – AOL’s Fullview – at the hands of aggressive costcutting and Fullview was determined to be off strategy. No sour grapes at all. Just a missed opportunity.

Since then, former colleague Jason Calacanis has gone on to create Mahalo under a similar premise, Wikipedia has evolved into an amazing resource, and Google is still pumping out bluelinks, plus just a little more.

Anyway, the title promises that this post is about semantics, and it is. This is just necessary background.

The opportunity is still there, whether in search or in the online world at large to create a virtual fabric of content that can be experienced (browsed and searched) – and even more importantly assembled on-the-fly - based on its relatedness.

What do I mean by that? Whether in search or in socially-relevant widgets or in feed aggregation, we need to link and connect content based on its MEANING and not keyword similarity. Specifically – When I am looking at a Microsoft earnings report and I see related links to “gates” I want it to be about the person, not the thing. The same applies to java, apple and about six million other things. We live in an ambiguous world.

For this to become reality, two things need to happen:

  1. Content needs to be accessible in a format that is native to its type of data. For example, the fundamental information about Microsoft (P/E, market cap, etc) will be in fields, just like a spreadsheet. MSFT’s price history is going to be formatted into a huge list of Bid, Ask and transaction prices (gross generalization). News about the latest earnings will be in text blobs. You can’t index this data with a traditional crawler, and it can’t all be mushed into a single format without losing the unique value.
  2. There needs to be a consistent way, across formats of data, to call out and associate similar items. MSFT is related to Microsoft is related to Steve Ballmer is related to Bill Gates. With this type of linking, we can then understand the interrelatedness of things. In its simplest form, this is semantics.

Now, to the point of this post.

If you look above, there are two things that need to happen for the content/information experience on the Web to be dramatically improved: we need content in a universally accessible repository (or repositories) and we need a technique for connecting it all together. Also, the web is evolving and we now have the challenge of making that all socially aware and realtime.

Let’s take those two chunks separately.

Big Honking Database of Content - In the last 18 months we have seen an amazing set of resources applied against this.

  • Freebase is the first company of note. It promises to be a huge content stash in the sky and is funded to do it. Very very promising.
  • Amazon released public datasets this week. So now if you want economics and scientific data it’s there. And you can make your own data available via AWS if you allow it to be freely accessible. This is a HUGE step.
  • Fluidinfo and Terry Jones. It seems fashionable to say “I know Terry Jones” these days. Here’s why: Terry has quite possibly created the database to handle structured, semistructured, tagged and attributed, social and realtime data. In its native format. That’s why pundits like Tim O’Reilly and Robert Scoble are openly excited about Terry and Fluidinfo. Terry and I have spent many hours together contemplating this. (see, I know Terry Jones too!)

Semantic technologies – There is more work to be done here, but it’s on the way. First of all, to fit into the broad model I have laid out, the semantic tagging technology needs to be at the tool, or platform level. This rules out most of the activity in the Semantic space.

Here’s what I mean: If you want to create, say, a music fan-site application that pulls together artist bios, discography, tour reviews from the Web, user generated content and the ability to purchase both tickets and CD’s, you would assemble the content and then you’d need a semantic tool to tag and generate the connections between bands, releases, tour dates and purchasing.

You can’t do that with a semantic application that only provides related links or delivers search results on only the information in its own index/database. You need a tool you can run on all of the sources to generate consistent metadata. Not that Hakia, Twine, Powerset (now MSFT) and Zemanta aren’t useful, but they’re individual applications on top of a semantic engine. Builders need access to the engine itself in order to build a wide range of products and open up the power of the technology.

Unsurprisingly, I am highly in favor of the OpenCalais approach by ThomsonReuters.

The best part is that v 4.0 of Calais will be releasing the “Linked Data Cloud.” It goes after this in a truly powerful way, providing users not only the ability to get their own tags, but to see how those tags relate conceptually to other things in the OpenCalais data model. Rocket Science.

But this post isn’t about OpenCalais either.

This post is about finding opportunity in the world that is evolving.

If I haven’t lost you yet, and you can agree that content+semantic tagging is useful, you can see that there are some problems and opportunities.

  • Completeness of data – DMOZ (aka The Open Directory) used to be under my domain at AOL. I never could invest in it because the community management model was flawed: communities only want to curate the things they’re interested in (thanks to Andrew Cohen for the analogy). Investment in the infrastructure was only going to feed the weediness and patchiness of the garden. Freebase is showing signs of content spottiness and AWS will too. It’s an issue of primary importance. So I think there is an emerging opportunity to provide curation on top of these open services. Like Redhat to Linux.
  • Quality of data – Just like completeness, If anyone can publish to the datasets, there’s a risk of problematic information. This is where branding, and the associated quality control comes in. The opportunity is for companies who create content to establish and promote their brand as a sign of quality. Quality wins out over crap time and time again.
  • Universality of tags – Zemanta is admirably trying to get a tagging standard adopted across semantic engines. Whether by agreement (standard) or market leadership (default) the emerging content world will benefit from consistent tags to operate on. More things will be “connectable.”
  • Applications Applications Applications! – This is where I get excited. Really excited. After 15+years of helping people find what they’re looking for using technology, I scan the horizon and see the building blocks to finally get it done. We (the tech community at-large) now have raw content feeds, open and free databases, functionality APIs, open source platforms and development methodologies that free up our minds to think about how users really want their content. We are right on the edge of being able to build what we can imagine – quickly and cheaply. We’ve got the tools to measure it and the social context to present it in with personal relevance.

Jason Calacanis recently posted about the responsibility we all have to push forward through this downturn with 120% effort… The part I found specifically valuable was where he calls entrepreneurs and those with the resources to get out there and start something. I can agree with that.

So, the tools are there and hopefully I’ve given you at least one way to think about it… I am definitely making my bets on where this is going and will probably join in on the app-building side soon.

And yes, this ties in with the Splintering of Media. I’ll get to that soon…

(* Photo “Sound and Vision” copyright Rogiro from Flickr)

Tags: Uncategorized · search

Viewing 12 Comments

    • ^
    • v
    Oh Gerry, I just flashed through the above before heading off to see my kids. Now I'm going to have to read it at length!

    I'll comment properly tonight. You make me smile & I hope vice-versa.
    • ^
    • v
    Hi again Gerry

    I think you probably know most of my thoughts on all this, but I'll summarize quickly.

    I guess I hate semantics. In fact I don't think there's any such thing as meaning, or understanding. Those are just words. We poor humans take comfort from imagining that they correspond to some underlying "thing" (if I were Husserl or Heidegger, I'd use another word than "thing"), but they do not. As you might put it, we live in an ambiguous world. Deeply ambiguous. I also don't think you can ever answer any question that starts with the word "Why", but that's another (closely related) subject.

    Ahem.

    What I do believe is that if you want to try to build applications that give the illusion of semantics :-) then you should build them on the most flexible architecture possible. Because as your application becomes increasingly heavily used, or as you increasingly realize that you didn't really know what you were doing in the first place, you're really just starting to plumb the depths of ambiguity and if your underlying architecture runs out of flexibility, then you have a problem.

    I also think the base architecture has to be dead simple. Even if it's dead simple it's going to be very hard to build it properly and have it scale. Freebase and SimpleDB didn't come into existence overnight. SimpleDB is certainly not the answer to any of what you're imagining. Freebase is much more interesting. Fluidinfo's FluidDB has some pretty striking departures from Freebase (which I'll save for now). It's safe to say that we're going after different regions of the same space. And it's a very big space. One difference is that Freebase are really into big honking datasets, whereas that's not my initial interest at all.

    I would also add Google's BigTable to the mix, as well as Neo4j (http://neo4j.org/) - let me know if you want an intro. Again, there are differences in emphasis, again within a big honkin' space of possibilities and value (except in the case of my company, which is apparently unfundable :-)).

    Thanks for taking all the time to write this up. Your experience is bigger than N for almost all values of N. I hope Fluidinfo can move quickly enough on the tech side that we'll find a way to do something together before you're off into some other irresistible project.

    Terry
    • ^
    • v
    I love a vigorous debate, but actually there isn't one here. I pretty much agree with you.

    We choose to state it differently (I think).

    The word "semantics" has been used to cover so many things that it may have lost some precision in its definition. I may be guilty of using a broadened version here.

    I think of it this way: Google solved one of the vexing problems of search: Out of millions of results, which one comes first in the rankings? They created PageRank to approximate what *most humans* would find to be the best result. That was a hard problem and they stepped up to OWN the solution, even creating the "I'm Feeling Lucky" to emphasize that they had a solution to the problem.

    But it's still wrong a portion of the time for a large percentage of users... So they leave the other 999,999,999 results just in case.

    That simple approximation, and the willingness to accept some error while committing to improvement has revolutionized search.

    The exact thing applies to understanding meaning. If we can accept error, and use words like "semantics" or otherwise to describe what we're trying to do, we can make progress.

    If you want to coin a new term for this I'll gladly use it and give you attribution. ;-)

    What we seem to agree on is that there's no room for purity and absolutism here...
    • ^
    • v
    Hi again Gerry

    I wasn't being very nuanced in my original comments. That's partly due to lack of time, partly due to liking a more colorful debate. So here are a few more thoughts, and some pointers.

    Consider Artificial Intelligence and its pursuit of intelligence. We once thought it took real intelligence to play chess (for example). But as we got better and better at engineering, and we thought up smarter (but completely mechanical and non-mysterious) algorithms, we moved the goalposts. I.e., we decided that actually you didn't need to be "intelligent" to play chess after all.

    I don't believe that "intelligence" corresponds to any "thing" either, just like I think "meaning" and "understanding" are also just words. What I do believe however is in engineering and tool-building. We're primates, and primates are pretty good tool builders. So I often suggest to people that they spend less time (and investment monies :-)) on chasing abstract words and more time on building tools.

    The lesson of AI seems clear. If your tools are good enough, you can give the illusion of intelligence up to and beyond (i.e., beyond grandmaster) where it matters in any practical sense. The computer plays chess so well that you might as well say it's intelligent, or not - it just doesn't matter anymore.

    And I believe the same is true of semantics, and going after meaning and understanding. Those things can perfectly well not really exist while at the same time we can practically achieve them (i.e., the convenient and practical illusion, as with intelligence for the purposes of chess playing) by just focusing on engineering and tools.

    Make sense?

    From that POV, I argue that huge strides can be made by improving representation. If you get representation right, things that look like problems can simply go away. If you get the representation right, you may not even need a clever algorithm. Can you do an end-run around Google's armies of PhDs by changing representation? I.e., don't challenge them on the algorithm front, where you're bound to lose, but change the ground under them. You wont be surprised to hear that I think the answer is yes. I'm not talking about "beating" Google as a company, but of taking search - and how we work with information in general - to a new level.

    I wrote about this at some length, back before it was so fashionable to be me :-)

    The main posting is http://www.fluidinfo.com/terry/2007/03/19/why-d...

    And there are several others, including some that give very simple examples of why representation is so important, at http://www.fluidinfo.com/terry/category/represe...

    In summary, I don't think the words matter much. I think we can achieve amazing results (things that look like real intelligence, real understanding, that somehow capture meaning, etc) simply by focusing on engineering. My best bet about where to focus is on representation. What are the implications of the various new ways of representing information that we're exploring? I've been pondering that for over a decade! :-) My own bet, via Fluidinfo, definitely has some strong advantages and some strong weaknesses. It's a tradeoff, like so many things in computer science. Other approaches represent different tradeoffs. It's far from clear what will "win". But as I said in my earlier comment, it's a vast space we're starting to explore, and, I like to imagine, there's plenty of value to go around.

    I hope that's a clearer and a more useful answer.
    • ^
    • v
    Great post Gerry and I'm dealing w/some of these issues today in the video space. Before I forget, you should list Dbpedia as one of the publicly accessible databases, which is an RDF normalized version of Wikipedia which now enables programmatic use of Wikipedia information.

    Combining your comments w/some of what Terry said below about our ambiguous world, I'm reminded of a magazine story a long time ago describing Microsoft's poor behavior which was titled "The Gates of Hell". Now if you consider the use of "gates" here, it's a double entendre. It's both a play on the doorway and on the person. Not very clean semantics ;)

    Great points you're making and now the business questions have to be satisfied as well as the financial incentives for the participants (content creators or curators) to do all of this work. When I think of all the work that sites have done in the past 5 yrs for SEO purposes, it's been all about findability. Making themselves more clearly indexed by the search engines and in turn more findable by people using search engines.

    Yesterday I spoke w/a stealth startup that is facilitating companies' ability to more easily make their data accessible to apps but w/business rules and metering services layered into their platform. I love that because companies have data that they s/b making more easily accessible for various uses, but also need to have an ability to monetize that and control to whom and how it is made available. Enabling easy access to it, but still having a spigot to open, filter and close its access to apps seems like the balance needed to open things up faster.

    In a world where data access is opened up because the right business rules can be put into place, and findability continues to be important to companies for the content, products and services they offer, the justification for properly marking up their content does exist. The challenge, which as you're pointing out is slowly being remedied, is "a chicken and the egg" one. Until the apps exist that make use of this marked up content in rich and useful ways, publishers and merchants do not want to go through the trouble of doing all of this work. On the other side of the house, app developers feel stifled to do a lot of work since they don't easy access to rich content sources for cool apps. An interesting company that is creating some good justifications for doing the mark-up work is Dapper in the semantic advertising space.

    It is getting easier, and I know that in my company's case, we're increasingly finding open access data sources to incorporate into our processes and apps, but there's still a long way to go. One of the best data sources we make use of for company and product information, does not make their stuff accessible in an RDF or even XML formatted way. Hence, we have to convert their data and upload it into our databases to then make it useful. It would have been nicer if they played a more open game, but they're small and have a hard time justifying the work to do this.

    I'm a lil' all over the place in this comment, but in essence, I agree w/Terry's comments which echo aspect of yours. Making a light weight technology to make content easily accessible at an organizational level, not so much in a big honking database, is really the way to go. It reminds me of the late '99/'00 time frame when RSS was still very early in its use. The company I was working with had chosen to develop syndication tools for the ICE (Information Content Exchange) syndication protocol which Vignette was supporting. It was more secure and reliable than RSS, and for the pro content providers of the time (ie. Reuters), this was very important to them as they warmed up to distributing their content to online publishers. ICE was a bulky technology however. RSS by contrast was light weight and was easy for almost anyone to use. While it didn't offer much in the way of security, it turns out that this didn't matter. As well, it's growth was secured the same way that eBay's and YouTube's was, by individuals w/a need (ie. bloggers and their readers). After garnering so much attention, the professional content providers realized that they needed to make their content available via RSS if they were to remain relevant (as has happened w/pro merchants on eBay and pro video content providers on YouTube). I believe the same could happen w/a light weight technology that helps make content providers' make their content available quickly and easily in these marked up ways.

    However, where publishers and merchants won't do the work at all, then a big honking database (a la Frebase) could take the work out of their hands and enable someone else to benefit fm the value of aggregating and structuring all of this information for meaningful uses. The big search engines have an obvious advantage here since they could theoretically start doing work to make their aggregated info available in interesting formats. Microsoft's acquisition of Powerset may have aspects of that to come, but may be not.

    Anyway, I'll concur w/your thesis that immense biz opportunities do exist to those who figure out how to pull all of this together.
    • ^
    • v
    Gerry's right about 'semantics' being an overly used expression.

    Terry's argument regarding the objective definition of meaning refers to the term's traditional philosophical usage, whereas Gerry and direwolf are talking about contextual ambiguity.

    I believe pursuing semantics (philosophical) in computing is a futile endeavor until machines are able to feel the wind on the their faces.

    Resolving contextual ambiguity, however, is a much more attainable and in many ways more useful goal. How many times have you Googled something only to be returned hundreds pages with the 'other' use of your key word?

    Whether progress is made via changes in representation, better algorithms or even some sort of stochastic analysis is largely irrelevant (to me).

    The key point is that whoever makes progress in this space will, as the VCs like to say, take away a lot of pain.
    • ^
    • v
    Does the nature of the task (and this discussion) change if we talk about it as codifying *relationships*? That's really where I am going.

    I am not sure it makes any difference at all WHAT the thing is, it's more about the interrelatedness of one word to other words. In that case, the ambiguity is represented in a set of linkages that are more or less exclusive.

    For example - the linkages to gates the thing vs gates the person would be different. Even in the case of that double entendre, the two sets could be statistically separable.
    • ^
    • v
    and can't we use co-occurrence, etc to establish that relatedness...
    • ^
    • v
    Gerry, I would say the goal is to *infer* context rather than codify it.

    For example, in a document that it tagged as about MSFT, references to Gates are statistically more likely to refer to the person rather than the object. So, instead of tagging (codifying) each individual reference to Gates in the document, context can be inferred from one single tag, and hence the ambiguity resolved.
    • ^
    • v
    With the 2010 Census on the horizon, I'm thinking about applying for a job with the Census just to see if I can help make sense of it all. Wouldn't it be great if the Census data actually provided us with data that all Americans could actually benefit from? Your discussion of tagging data is extremely important in this regard.

    As it relates to tagging words, isn't this just XML? And don't we also need to tag whole phrases and not just words?
    • ^
    • v
    How about the burgeoning cloud of RDF based Linked Data?

    Links:
    1. http://virtuoso.openlinksw.com/images/dbpedia-l...
    2. http://esw.w3.org/topic/SweoIG/TaskForces/Commu...
    3. http://dbpedia.org/resource/Linked_Data - cross linked with Freebase and many other structured data spaces
    • ^
    • v
    I read your post and was amazed how our work close to what you describe here as semantic technology. We have developed a technology for semantic search and text analysis which leverage Wikipedia knowledge to derive concept meaning and relationships. To recent moment Wikipedia has grown into a massive up-to-date database of such relationships. We would like to show our technology to you as it implements nearly everything that you discribed in your post: disambiguation, semantic tagging, semantic similarity to find related content/concepts and more. Could you please email me at maxim@grinev.net and I will reply with more details. Thank you.
 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus