Tree of knowledge - Taxonomies

Tree of knowledge

In this age of information overload the time we spend searching is growing. Campbell McCracken asks: what can we do about it? And are taxonomies a possible route to take?

Information overload

Organisations are facing information overload. The amount of information that employees have to search through to find what they are looking for is continually growing. "It has been estimated that there are 2.5 billion documents directly accessible on the internet," says Alan Gilchrist, Senior Associate Consultant at TFPL Ltd. "And there are a further 520 billion documents publicly accessible in other databases."

The problem is exacerbated by the fact that an estimated 50 per cent of users are not sufficiently skilled in formulating good search engine queries. It's not surprising, therefore, that one solution that is finding favour with a number of large corporations is to move away from simple keyword searches and to use a more organised approach. This is where taxonomies come in.

What is a taxonomy?

A taxonomy can be defined as a classification of elements within a domain. It's a scheme for creating a set of labeled boxes, into which you can consistently categorise things. In simple cases, a taxonomy can be visualised as an inverted tree, with the root at the top. Each node at a given level is connected to a node in the level above it. The nodes at the lowermost levels are called leaves. The path from the root to the leaf is called a branch and is generally unique. The category boxes sit at the leaves. In an ideal world, the taxonomy should cover paper and electronic documents, audio-visual material, people, and all intellectual capital within an enterprise.

One benefit of the taxonomy approach over a straight search is that you can find accurate information much more quickly. This is because at each level in the tree you are presented with a number of choices which help to refine your initial query, guiding you to the category that most accurately matches the one you want. It's all about helping you to get to the right information quickly.

"AT&T disclosed last year in the B2B conference in San Diego that using our taxonomies instead of a basic search saved them $2 million in one year," said Semio's founder and CEO Claude Vogel. "This equates to a saving of roughly half an hour a day for every employee using this kind of technology intensively."

Taxonomies can be used in different ways to make the searching of information more efficient. Wordmap Ltd offers taxonomies as a broker between researchers and the information they are searching for. "A study of one billion search queries on the AltaVista web search engine showed that they contained an average of 2.35 words," says Wordmap's Chief Executive Bill Hutchison. In fact, 70% of the queries submitted used 2 keywords or less. "Obviously using only a small number of words can lead to ambiguities." For example, if you use the word "Lotus" in a query you could be referring to a yoga position, flower, car, or software company.

When Wordmap is confronted by an ambiguous word it presents the user with a list of all the possible meanings of that word. Once the exact meaning has been determined, Wordmap can then generate a 'super' query by using synonyms and related words joined together in a large Boolean expression. This new query can then be sent off to as few or as many search engines as you like. This means that not only is the search query more accurate than if you had only used simple keywords but it also brings together information from a number of disparate sources.

Specialized thesauri

You can start off a taxonomy using a standard classification scheme, such as the North American Industry Classification Scheme (NAICS) or the Library of Congress subject headings. Specialist organisations such as pharmaceutical or law firms can obtain technical taxonomies specific to their work areas. Other organisations find that they benefit most by setting up a taxonomy built around their existing processes. (See Case Study - "Taxonomies for Business".)

A taxonomy can be augmented by adding a thesaurus. This could be a plain English thesaurus or it could be a more specialised technical one. The thesaurus can be used to supply synonyms and to help determine the next step in navigating the taxonomy, by suggesting a more precise meaning of the node that you have already reached. "They provide you with a possibility of looking at a word and saying 'Can I find a better way of describing that?'" says TFPL's Associate Consultant Barry Mahon.

However although a large taxonomy can help refine a query, it has its downside too. "The biggest problem that most people seem to have with taxonomies is 'Where am I?'" says Mahon. It is very important for a good website or portal to show you where you are in relation to where you started. "Most people get lost quite quickly. They end up in backwaters and wonder how they got there."

Similarly, as the number of options is kept to a minimum to simplify the interface, people often can't decide where to go next. "A good taxonomy helps you to understand where you are, and where you should go now, as opposed to where you can go now," says Mahon. But it's not just a question of disorientation. "It's also about giving people a sense of their surroundings and a sense of the context of the subject they're interested in," says Wordmap's Hutchison. One of the key things in a successful taxonomy is matching its granularity to the application. The granularity is determined by the techniques used in splitting the target domain into separate categories.

The right granularity

Two contrasting techniques are neural networks and the use of standard taxonomies, thesauri, and key concept extraction. "Neural networks try to understand similarities and discrepancies between the training sets you provide," says Semio's Vogel. "Using training sets you can handle 150 to 200 categories. But for any specific application such as pharmaceutical or legal, we are talking about 20,000 categories." So, for example, if you are in publishing, with a broad news access and want to differentiate between politics and the weather, neural networks will be successful. If you have more specific news, for example chemical news, or the British Medical Journal, you're going to need a higher granularity.

Case Study - TFPL surveys the big boys

TFPL made a study of several world class organisations to see what they were doing about information overload. Som, the UK Ministry of Defence and Defence Evaluation Research Agency, for example, did not have a taxonomy as such but were considering compiling one to front end the knowledge stores that they were building. At the other extreme, Glaxo Wellcome (now GSK) took advantage of several mergers to reconcile the different taxonomies within the component organisations to produce a "superthesaurus" (which also carries information) of 53,000 concepts, 200,000 lexical equivalents and 440,000 assertions (concepts joined by relationships).

TFPL's study showed that while a wide range of software was being used to process information, the building of taxonomies was made "more effective bu human information management experts, working directly with information users".

To find out how you can obtain a copy of TFPL's report of its study, visit www.tfpl.com

Case Study - Semio at HighWire Press

Stanford University Library's HighWire Press has been using Semio Tagger to organise articles in the 280+ electonic journals it produces online for the journals' publishers. It found that its search engine, from Verity, helps researchers who know precisely what they were looking for, but it needed something to help those researchers who didn't know exactly what they were looking for.

Stanford also needed an automated way of categorising these articles. With almost 12 million articles online and several hundred more coming in every day, it didn't want to do this manually. So it installed Semio to help with the classification and taxonomy creation. It also added some categories that were unique to its organisation by describing the rules and using Semio to recategorise new documents.