|
|
|
|
Save Edit |
Cancel | More actions | |
Search 2.0 in the Enterprise:
Moving Beyond "Single Shot" Relevancy By Mark Bennett Chief Technology Officer, New Idea Engineering, Inc. Originally published in Enterprise Search Newsletter - Spring 2006
Exponential Growth Swamps Results Lists We've all heard the hype about the exponential growth of the Internet, but there are a few related developments, not so widely covered, that Enterprise Search people need to keep in mind. First, as the overall number of documents grows, so does the average number of documents returned by each search. Think about it. If a broad search matches 10% of the documents, then it would match 100 documents out of 1,000 total. If the same search is run later when there would now be 2,000 total documents, it would return on the order of 200 documents in the results. This may not be the case for every search; the overall makeup of the documents may change over time, or specific terms may come and go, but on average results list sizes will tend to grow exponentially, generally mirroring overall data growth.
Second, enterprise data is also growing exponentially. Of course the growth curve may not be as steep, and will vary from company to company or agency to agency, but exponential growth at any reasonable rate adds up surprisingly fast. You'll recall that in the 1990s, the public Internet crossed certain boundaries. The headlines reported when the web surpassed 10 million pages, 100 million pages, 1 billion pages, and more. We are now at the point where many private networks are as large as the entire public Internet was in the late 1990s. In fact, there are now private datasets that have crossed, or will soon cross, the 100 million and 1 billion document marks. This means that the Internet search engine problems of the 1990s are nowhitting corporations and other institutions; this is a "big" problem. |
The problem with "Single Shot Relevancy” - The Problem even Google Can't Fix "Single Shot Relevancy" is the idea that when a human types in a search, the search engine returns the "best" most relevant documents on the first page of results. One query goes in, and one amazing results list comes back. Early vendors' relevancy was based solely on the content of the documents themselves; later vendors added external techniques such as Google's link ranking. Now, when vendors talk about relevancy, this is usually what they are referring to, but their engines will all eventually fail because of exponential data growth. It's not that there aren't any good relevancy algorithms around – there certainly are. You can even employ "query cooking" to improve them further.But imagine two fictitious search engines, “Average Engine" and "Good Engine." We give each engine 1,000 documents to index, then run a reasonable search. The search matches 5% of the content, returning 50 documents. Both engines are probably going to put a few decent documents on the first page. Now imagine increasing the docset from 1,000 to 10,000 – this would raise the result set for the previous search to 500 documents – and at that point "Average Engine" has a good chance of not putting reasonable documents on the first page. Meanwhile, "Good Engine," through better ranking algorithms, manages to still get decent results on the first page.
But let's keep going with this. Now think of a similar dataset that is two orders of magnitude (100 times) larger, and we're now indexing 1 million documents. The results set has mushroomed to 50,000 hits. At this point "Good Engine" is going to have some problems populating that first page with relevant results – or at least what the user would think of as relevant. We are unhappy with the results, so now for this 1 million document dataset we consider a couple of new vendors, "Cool Engine" and "Great Engine." Let's assume that both of these engines do a better job than "Good Engine," and even with 50,000 hits, they can both usually put the correct documents at the top.
One more time – this is important – consider another dataset, 100 times larger again; that puts the raw data at 100 million documents, and a results list on the order of 5 million documents. At this lofty number "Great Engine" and "Cool Engine" are going to be straining, and even slight problems with ranking will be greatly amplified. Even if one of the vendors could miraculously improve its engine's relevancy, those gains would likely be wiped out by an even larger result set. |
Don't Be Confused: We're not talking aboutsimple "performance" A note to the reader... There are two paths this type of discussion usually follows which are distractions, neither of which is pertinent to the discussion of “Search 2.0."
Distraction 1: "OK, so you're talking about search engine performance – how long it takes to index and search documents?"
No. While that is certainly an important factor, the big players in the industry can scale up, using multiple servers to handle vast quantities of data. If you’ve got 100 million documents and a big enough budget, you will be able to get them indexed and searched.
Distraction 2: "Ah, so you're talking about Relevancy – getting the right documents to show up at the top of the results list?"
No, not exactly. Better relevancy, or "document ranking algorithms" will certainly buy you some time, but each time the number of matching documents increases, the effectiveness of even the best algorithms will eventually fail. This type of relevancy which we call "Single Shot Relevancy," will eventually fail when the dataset gets large enough – it's only a matter of time. |
The mistaken "HAL9000" Assumptions aboutUsers and Questions The assumption that a human, given enough time, could identify the most relevant document out of 5 million hits is suspect. We do, however, expect a computer to be able to do it. Many early computer scientists were inspired by the computer "HAL9000" in the science fiction epic "2001: A Space Odyssey" from the late 1960s. In the movie, this highly-evolved intelligent machine, created in the 1990s, could converse, reason, and question like a human being. Many people still hope and assume that computer technology will ultimately get there; a system that could truly understand what a human is asking, then research the question and come back with a concise answer would be wonderful to have and would satisfy the “HAL9000” benchmark. Sadly though, 10 years after this fictional technology was supposed to be available, it doesn’t exist. Still, many users subconsciously assume that their search engine knows the difference between a baseball diamond and the diamond on a wedding ring, or the difference between President Bush vs. a “bush” that you plant in your front yard. We are not there yet, our search engines are not “HAL9000” capable.
In the movie, the human operators asked well thought-out questions, and those questions do have "correct" answers. A number of articles cite the average query as only 1 to 2 words in length -humans don't usually ask well thought-out and complete questions. Even with a longer, more specific query, the “correctness” of answers is still debatable. Imagine the discussion that would ensue grading the results of the lengthier query "effectiveness of tax cuts to stimulate the economy." As you can imagine, opinions would vary widely over whether the answers from the Republican National Committee or MoveOn.org would be more relevant.
|
|
Document Saved Successfully
|
|
|
|
|
|
|