Campbell McCracken
navigates through the search engine maze
Thanks to the widespread availability of access
to information brought about by low-cost
computing, users have unprecedented amounts of
data at their fingertips. Sadly, in most cases, it
takes longer than desirable to sift through all
this data to find the information needed.
The two main search methods that are used for
information retrieval are the full-text search
(e.g. using a standard search engine to look for
documents containing specific keywords) and
hypertext systems (e.g. browsing the Internet).
One of the main disadvantages of the former is the
number of unrelated documents that the search can
unearth. Hypertext systems are not without their
problems either, as following links on the
Internet needs to be used in conjunction with
large doses of serendipity.
Both of these systems are known as ‘pull’
systems: you have to initiate the gathering of the
data. An alternative is to have information pushed
towards you. One common way to do this is to
register your name with a list server, perhaps
filling in an application form with a set of the
topics that interest you. As new information
becomes available, the server sends it to you. The
disadvantage of this approach is that the topic
categories can be broad. Also, if you want to
modify the topic you are interested in, you have
to contact the server to get your original
selection changed.
Alternative search
tools
Several companies have developed products to
try to make information retrieval more efficient.
Verity’s K2 integrated suite uses word stemming, a
thesaurus and linguistic analysis to expand search
queries (see lower box on page 29). It can also
analyse a document by themes and contextual
relationships rather than keywords.
Another feature of the tool set is its
scalability, which is of particular relevance in
Internet trading. "You have to be careful to make
sure that you take your growth figures into
account," says Martin Black, ecommerce and retail
sales manager at Verity. Similarly the robustness
of your system can affect your site’s
availability. K2 can operate under a 24x7
fault-tolerant regime where individual failing
nodes or servers can be isolated from the rest of
the system.
K2 is also able to perform searches in parallel
across multiple servers. Stanford University’s
Highwire Press needed just this technology to help
it deal with the quarter of a million academic and
scientific users who access more than half a
million online documents on Highwire each
week.
Semantic
networking
Excalibur Technologies RetrievalWare tool set
uses semantic networking, backed by a powerful
dictionary set, to extract the concepts and
contexts from documents (see box on page 30). This
makes RetrievalWare one of the most accurate
retrieval tools available. The basic language set
can also be augmented by the integration of
additional reference works for specialist areas
such as legal, medical, financial and engineering.
Instead of using keyword searches, Excalibur
uses natural language processing. It also uses
fuzzy search techniques that are sufficiently
advanced to cope with damaged, misspelled or
poorly scanned data. RetrievalWare’s adaptive
pattern recognition processing negates the need
for manually defined keywords, topic trees or
sorting and labelling information when processing
documents, and it avoids the inherent
subjectiveness of indexes.
Granada Media uses the RetrievalWare tool set
to help it search its archives to identify
suitable clips for inclusion in new programmes.
Each archived programme has a synopsis containing
a summary of the content, the people included and
so on.
"However the problem is that the spelling on
the synopsis may not be accurate" says Granada’s
head of IT development, Ian Whitfield. As well as
checking spelling, RetrievalWare also uses word
association to improve its searches.
Portals and
profiling
Portals – applications that provide users with
a single gateway to personalised information – are
becoming a common aid to information retrieval.
The Institute of Chartered Accountants in
England and Wales (ICAEW) set up a members-only
portal based on Autonomy’s adaptive pattern
recognition technology. This software can analyse
a piece of text (independent of language) and
identify and rank the main ideas contained in the
text.
"We wanted our site to have something different
from other sites," says ICAEW’s head of business
information Bill Wilson. "And that was the
personalisation element."
By profiling users based on the ideas in the
documents that they read or write, the portal is
able to bring information to them that contains
similar ideas. As other software agents and
spiders bring back new documents to the portal
server, these too are monitored to see if they
have the same context as any of the individual’s.
If so, they are ranked higher than those with less
relevance. By profiling users in this way the
portal ensures that each individual member sees
the documents that are of most interest to
them.
Autonomy was also the choice of business IT
analysts, Butler Group. The Group has a
30,000-article database covering 11 technical
categories and includes in-depth information on
vendors and products. Originally, articles could
only be viewed by product selection. Now they can
be searched by the theme in the query, not simply
the words.
Although it currently only uses the portal for
accessing articles in the database, Butler Group
plans to extend it to use roaming agents to
collect documents from other sources and offer
profiled information to its members. "Users have
to drop the search engine mindset and get rid of
their infatuation with keywords," says Autonomy’s
PR manager, Simon Fletcher. "We don’t believe in
searches, we prefer to push information out."
As the user’s interest in a topic wanes, the
software adjusts their profile accordingly.
"Analysing the ideas in the documents that a users
reads is a better method of determining areas of
interest than having the user fill in forms,"
explained Fletcher, "because users don’t need to
do anything extra to keep their profile up to
date."
The
future
Autonomy recently joined forces with mobile
phone manufacturer Ericsson on a project to use
its portal technology to make information
available to users inside Ericsson, no matter
where they are in the world. Whenever information
that matches an individual’s profile becomes
available, the information is transmitted to that
person’s mobile phone using the new WAP (wireless
application protocol) standard.
Excalibur’s vision for the future is to perform
cross-media searches (see below and case study at
top of page 39). At the moment search engines use
dictionaries and thesauri. In the future this
could be expanded to use encyclopćdias, which
contain pictures as well as words. So, for
example, if someone is performing a search for
information on an Alsatian dog, a text
cross-reference would widen the search to include
documents containing the term ‘police dog’ and an
image cross-reference would widen the search to
include documents with pictures of Alsatians.
Whatever the methods used, the most obvious way
that information retrieval is likely to develop in
the future is in improved contextual awareness.
Software will be able to ‘understand’ your
information needs better and search engines will
have the most relevant documents waiting for you
for when you need them.
Case study: Excalibur at
the Swedish National Crime Intelligence
Unit
The Swedish National Crime Intelligence Unit
(Child Protection Division) wasn't having much
success in identifying suspects in its quest to
stop the producers of child pornography material.
It needed some automated tool to help it with the
indexing of pornographic material and decided to
try Visual RetrievalWare from Excalibur.
Visual RetrievalWare uses feature-extraction
tools to break down a digitised picture into
different 'feature vectors'. Each tool is designed
to measure one particular feature such as texture
(i.e. repeating patterns), colour content (how
much of each colour), structure (where the colours
are in the picture), brightness (sources of bright
areas), shape, aspect ratio, any text, and so on.
It provides a vector indicating how much of the
feature that it is looking for was found in the
picture. By combining the vectors, Visual
RetrievalWare comes up with a single index for the
whole image that it then feeds into a fuzzy
comparison system modelled on the way biological
neural networks process information. The system
compares the image index with indexes from
previously processed pictures to try to find
links.
The Intelligence Unit hopes to be able to use
the software to analyse the background clues (such
as walls, carpets and beds) in the images,
cross-refer them to clues from other images and
connect enough bits of information to eventually
determine where the pictures were taken. By doing
so, it hopes to be able to identify suspects more
quickly.
Case study: Verity at
CDNow
Internet CD ecommerce operator CDNOW’s web site
used to use a text search engine to allow
customers to find the CD they were looking for.
However 20 per cent of the time the customers were
disappointed. "We knew our database queries we
were getting a high number of misses," says
CDNOW’s VP of Internet technology, Russ Cherry.
"Analysing the reasons showed that the majority of
them were caused by customers misspelling the
names of their favourite artists."
CDNOW enlisted the services of Verity. Said
Cherry: "If customers can’t find it, they can’t
buy it." Verity’s Martin Black agreed, adding:
"Customers want to be able to find their
information quickly. They want the information to
be no more than two clicks away."
Verity’s K2 software gives CDNOW a better
search capability than they had previously. It
works as a series of modules in a pipeline. By
arranging the modules in a different order you can
optimise the search results. Since using K2,
CDNOW’s nil-returns have dropped from 20 per cent
to less than 1 per cent. And sales have gone up by
33 per cent.
In depth:
RetrievalWare
Excalibur’s Information Retrieval tool
RetrievalWare combines adaptive pattern
recognition technology with semantic networking
and natural language processing technologies.
"We see ourselves as bringing everyone up to
the same level of retrieval," says Excalibur’s
technical director Phil Lewis. "If a doctor was
searching for information on myocardial
infarction, he’d know he should also search for
‘heart attacks’. A non-expert looking for
information on heart attacks, wouldn’t know to
look for ‘myocardial infarction’."
This is where RetrievalWare comes in. Users
don’t have to define their own word association
trees – the software has a 400,000 word dictionary
along with 1.6 million word associations and
50,000 language idioms. This ‘context awareness’
makes Excalibur one of the most accurate
information retrieval tools available. Another
area where retrieval tools can fall down is in the
elimination of ‘stop words (small words such as
‘the’, ‘of’, ‘a’, and so on). "If you’re searching
for references to ‘vitamin A’ you don’t want to be
overloaded with all documents that contain
‘vitamin’ on its own because the search engine
doesn’t think the ‘A’ is significant. Similarly
you don’t want references to documents that
contain ‘vitamin’ and the letter ‘a’ somewhere
else in the text."
Excalibur has also extended its horizons to
include image-based information with the
introduction of Visual RetrievalWare and Screening
Room. Visual RetrievalWare automatically generates
an index ‘vector’ from digital images depending on
the presence of certain features in the image. By
comparing the index vector generated from a sample
image with indexes previously generated from other
images, it can perform fuzzy searches to determine
whether or not any of the other images have
features in common with the sample. This has
applications in image libraries where it can be
used to supplement the potentially erroneous
human-generated meta-information that might be
stored with each picture. The Screening Room is a
video retrieval tool that is used for re-purposing
video content for training, marketing
communications, indexing newsfeeds, and so on. Its
scene-change recognition software not only detects
abrupt cuts between one frame and the next but it
can also detect fades, dissolves and so on. Once
the video ‘footage’ has been dissected, each scene
can be indexed by processing one of its frames
using Visual RetrievalWare. Screening Room can
also extract Close Caption Text to produce an
additional textual reference or pass speech
through a voice recogniser to convert it to
text.