A. Campbell McCracken's Article - Information Retrieval

	editorial
	interview
	sector
	tools of the trade
	the knowledge angle
	book reviews
	agenda
	links

Knowledge Management & Intranet Solutions - Conference & Exhibition, Novotel London West, UK, 4-5 April 2000

Information retrieval

Campbell McCracken navigates through the search engine maze

Thanks to the widespread availability of access to information brought about by low-cost computing, users have unprecedented amounts of data at their fingertips. Sadly, in most cases, it takes longer than desirable to sift through all this data to find the information needed.

The two main search methods that are used for information retrieval are the full-text search (e.g. using a standard search engine to look for documents containing specific keywords) and hypertext systems (e.g. browsing the Internet). One of the main disadvantages of the former is the number of unrelated documents that the search can unearth. Hypertext systems are not without their problems either, as following links on the Internet needs to be used in conjunction with large doses of serendipity.

Both of these systems are known as ‘pull’ systems: you have to initiate the gathering of the data. An alternative is to have information pushed towards you. One common way to do this is to register your name with a list server, perhaps filling in an application form with a set of the topics that interest you. As new information becomes available, the server sends it to you. The disadvantage of this approach is that the topic categories can be broad. Also, if you want to modify the topic you are interested in, you have to contact the server to get your original selection changed.

Alternative search tools

Several companies have developed products to try to make information retrieval more efficient. Verity’s K2 integrated suite uses word stemming, a thesaurus and linguistic analysis to expand search queries (see lower box on page 29). It can also analyse a document by themes and contextual relationships rather than keywords.

Another feature of the tool set is its scalability, which is of particular relevance in Internet trading. "You have to be careful to make sure that you take your growth figures into account," says Martin Black, ecommerce and retail sales manager at Verity. Similarly the robustness of your system can affect your site’s availability. K2 can operate under a 24x7 fault-tolerant regime where individual failing nodes or servers can be isolated from the rest of the system.

K2 is also able to perform searches in parallel across multiple servers. Stanford University’s Highwire Press needed just this technology to help it deal with the quarter of a million academic and scientific users who access more than half a million online documents on Highwire each week.

Semantic networking

Excalibur Technologies RetrievalWare tool set uses semantic networking, backed by a powerful dictionary set, to extract the concepts and contexts from documents (see box on page 30). This makes RetrievalWare one of the most accurate retrieval tools available. The basic language set can also be augmented by the integration of additional reference works for specialist areas such as legal, medical, financial and engineering.

Instead of using keyword searches, Excalibur uses natural language processing. It also uses fuzzy search techniques that are sufficiently advanced to cope with damaged, misspelled or poorly scanned data. RetrievalWare’s adaptive pattern recognition processing negates the need for manually defined keywords, topic trees or sorting and labelling information when processing documents, and it avoids the inherent subjectiveness of indexes.

Granada Media uses the RetrievalWare tool set to help it search its archives to identify suitable clips for inclusion in new programmes. Each archived programme has a synopsis containing a summary of the content, the people included and so on.

"However the problem is that the spelling on the synopsis may not be accurate" says Granada’s head of IT development, Ian Whitfield. As well as checking spelling, RetrievalWare also uses word association to improve its searches.

Portals and profiling

Portals – applications that provide users with a single gateway to personalised information – are becoming a common aid to information retrieval.

The Institute of Chartered Accountants in England and Wales (ICAEW) set up a members-only portal based on Autonomy’s adaptive pattern recognition technology. This software can analyse a piece of text (independent of language) and identify and rank the main ideas contained in the text.

"We wanted our site to have something different from other sites," says ICAEW’s head of business information Bill Wilson. "And that was the personalisation element."

By profiling users based on the ideas in the documents that they read or write, the portal is able to bring information to them that contains similar ideas. As other software agents and spiders bring back new documents to the portal server, these too are monitored to see if they have the same context as any of the individual’s. If so, they are ranked higher than those with less relevance. By profiling users in this way the portal ensures that each individual member sees the documents that are of most interest to them.

Autonomy was also the choice of business IT analysts, Butler Group. The Group has a 30,000-article database covering 11 technical categories and includes in-depth information on vendors and products. Originally, articles could only be viewed by product selection. Now they can be searched by the theme in the query, not simply the words.

Although it currently only uses the portal for accessing articles in the database, Butler Group plans to extend it to use roaming agents to collect documents from other sources and offer profiled information to its members. "Users have to drop the search engine mindset and get rid of their infatuation with keywords," says Autonomy’s PR manager, Simon Fletcher. "We don’t believe in searches, we prefer to push information out."

As the user’s interest in a topic wanes, the software adjusts their profile accordingly. "Analysing the ideas in the documents that a users reads is a better method of determining areas of interest than having the user fill in forms," explained Fletcher, "because users don’t need to do anything extra to keep their profile up to date."

The future

Autonomy recently joined forces with mobile phone manufacturer Ericsson on a project to use its portal technology to make information available to users inside Ericsson, no matter where they are in the world. Whenever information that matches an individual’s profile becomes available, the information is transmitted to that person’s mobile phone using the new WAP (wireless application protocol) standard.

Excalibur’s vision for the future is to perform cross-media searches (see below and case study at top of page 39). At the moment search engines use dictionaries and thesauri. In the future this could be expanded to use encyclopædias, which contain pictures as well as words. So, for example, if someone is performing a search for information on an Alsatian dog, a text cross-reference would widen the search to include documents containing the term ‘police dog’ and an image cross-reference would widen the search to include documents with pictures of Alsatians.

Whatever the methods used, the most obvious way that information retrieval is likely to develop in the future is in improved contextual awareness. Software will be able to ‘understand’ your information needs better and search engines will have the most relevant documents waiting for you for when you need them.

Case study: Excalibur at the Swedish National Crime Intelligence Unit

The Swedish National Crime Intelligence Unit (Child Protection Division) wasn't having much success in identifying suspects in its quest to stop the producers of child pornography material. It needed some automated tool to help it with the indexing of pornographic material and decided to try Visual RetrievalWare from Excalibur.

Visual RetrievalWare uses feature-extraction tools to break down a digitised picture into different 'feature vectors'. Each tool is designed to measure one particular feature such as texture (i.e. repeating patterns), colour content (how much of each colour), structure (where the colours are in the picture), brightness (sources of bright areas), shape, aspect ratio, any text, and so on. It provides a vector indicating how much of the feature that it is looking for was found in the picture. By combining the vectors, Visual RetrievalWare comes up with a single index for the whole image that it then feeds into a fuzzy comparison system modelled on the way biological neural networks process information. The system compares the image index with indexes from previously processed pictures to try to find links.

The Intelligence Unit hopes to be able to use the software to analyse the background clues (such as walls, carpets and beds) in the images, cross-refer them to clues from other images and connect enough bits of information to eventually determine where the pictures were taken. By doing so, it hopes to be able to identify suspects more quickly.

Case study: Verity at CDNow

Internet CD ecommerce operator CDNOW’s web site used to use a text search engine to allow customers to find the CD they were looking for. However 20 per cent of the time the customers were disappointed. "We knew our database queries we were getting a high number of misses," says CDNOW’s VP of Internet technology, Russ Cherry. "Analysing the reasons showed that the majority of them were caused by customers misspelling the names of their favourite artists."

CDNOW enlisted the services of Verity. Said Cherry: "If customers can’t find it, they can’t buy it." Verity’s Martin Black agreed, adding: "Customers want to be able to find their information quickly. They want the information to be no more than two clicks away."

Verity’s K2 software gives CDNOW a better search capability than they had previously. It works as a series of modules in a pipeline. By arranging the modules in a different order you can optimise the search results. Since using K2, CDNOW’s nil-returns have dropped from 20 per cent to less than 1 per cent. And sales have gone up by 33 per cent.

In depth: RetrievalWare

Excalibur’s Information Retrieval tool RetrievalWare combines adaptive pattern recognition technology with semantic networking and natural language processing technologies.

"We see ourselves as bringing everyone up to the same level of retrieval," says Excalibur’s technical director Phil Lewis. "If a doctor was searching for information on myocardial infarction, he’d know he should also search for ‘heart attacks’. A non-expert looking for information on heart attacks, wouldn’t know to look for ‘myocardial infarction’."

This is where RetrievalWare comes in. Users don’t have to define their own word association trees – the software has a 400,000 word dictionary along with 1.6 million word associations and 50,000 language idioms. This ‘context awareness’ makes Excalibur one of the most accurate information retrieval tools available. Another area where retrieval tools can fall down is in the elimination of ‘stop words (small words such as ‘the’, ‘of’, ‘a’, and so on). "If you’re searching for references to ‘vitamin A’ you don’t want to be overloaded with all documents that contain ‘vitamin’ on its own because the search engine doesn’t think the ‘A’ is significant. Similarly you don’t want references to documents that contain ‘vitamin’ and the letter ‘a’ somewhere else in the text."

Excalibur has also extended its horizons to include image-based information with the introduction of Visual RetrievalWare and Screening Room. Visual RetrievalWare automatically generates an index ‘vector’ from digital images depending on the presence of certain features in the image. By comparing the index vector generated from a sample image with indexes previously generated from other images, it can perform fuzzy searches to determine whether or not any of the other images have features in common with the sample. This has applications in image libraries where it can be used to supplement the potentially erroneous human-generated meta-information that might be stored with each picture. The Screening Room is a video retrieval tool that is used for re-purposing video content for training, marketing communications, indexing newsfeeds, and so on. Its scene-change recognition software not only detects abrupt cuts between one frame and the next but it can also detect fades, dissolves and so on. Once the video ‘footage’ has been dissected, each scene can be indexed by processing one of its frames using Visual RetrievalWare. Screening Room can also extract Close Caption Text to produce an additional textual reference or pass speech through a voice recogniser to convert it to text.

Contact the author

freelance@madscotsman.com

Contacts

Autonomy	www.autonomy.com
Butler Group	www.itexplorer.com
ICAEW	www.icaew.com
Verity	www.verity.com
Excalibur	www.excalib.com
CDNOW	www.cdnow.com
Highwire Press	highwire.stanford.edu