Surfing the Sea of Chaos

By Mary Eisenhart

Photo: Terry Lorant

Super-Searcher Reva Basch on finding the information you need online

In the early days of the Web, two or three years ago, finding what was on it was relatively easy. Since then, both the number of Web pages and the number of users trying to find them have made the quest much harder. Wandering from search engine to search engine, typing in queries, the frustrated would-be researcher often finds radically different results generated by different sites--results which may or may not have anything to do with the object of the search. When it's a matter of settling a bet, this may be merely annoying; when it comes to finding out whether somebody in Switzerland has already patented your great idea, the stakes are higher.

Her career as a research professional has given Reva Basch considerable expertise at separating the wheat from the chaff in the world of online information. After three books aimed at an audience of colleagues, her most recent opus is the popular Researching Online For Dummies, in which she offers a wealth of useful advice to non-professionals--i.e., most of us who are all too ready to exploit the riches of online information, but often find ourselves wandering in confusion through a morass of useless material instead.

In the book, Basch provides an overview not only of the Web and its myriad resources, but of the proprietary databases that, until recently, comprised the online-information universe. She offers tips for deciding what information you're actually looking for, and also for determining which sites or databases are likely to prove most useful in the quest.

In a recent conversation, she reflected on the Web-driven changes in her profession and had some advice for the average net user trying to find needed information.

When you started in online research, it was a very specialized professional niche. Since the Web came along, it's more like a vast sea of chaos which wasn't there before -- and lots of non-professionals trying to find what they need there.

You also have a set of assumptions that go with that, like the assumption that that vast sea of chaos is all there is to online information, and that anyone can go out there and throw a search engine against it and, by brute force, come up with the information they want.

Often people try to do this and come up with garbage, and discount the Web completely as an information-gathering method.

Before the great Web explosion, what did online information consist of, and how much of that is still alive?

Pre-Web, what you had was a very small collection of proprietary database services--Dialog, LEXIS-NEXIS, Dow Jones News Retrieval--and some boutiquey ones, some of which have fallen by the wayside, like Newsnet, which used to be an excellent source for specialized industry newsletters.

The advantage that all of these had, which we're only beginning to see emerge on the Web, is aggregation: you could put in one search strategy, cover hundreds, or thousands, of publications at a time, and cover them deeply. You had highly structured databases with word indexing, higher-level concept indexing, the ability to search by author, by date range-- a lot of sophisticated parameters that we only see in very limited, kind of brain-damaged form on the Web.

And they vary wildly from engine to engine--this one wants its Boolean queries formatted this way, and that one doesn't do Boolean searches at all...

It's not just the Boolean aspect, but the fact that, for instance, if you do a date search on HotBot or another Web search engine that supports date searching, what you get is the date the page was put up, not the date of the original publication. Unless they found some way of coding it into the metadata, you don't get the year a conference paper was presented; you get the year and month that it, and probably another eleven years' worth of proceedings of the same conference, were finally put up on the Web. So it's really a very blunt-instrument approach compared with the old traditional online providers.

The environment I grew up in was very precise, very exacting. Within an online service, say Dialog, every database or family of databases had its own protocols. It's the same thing on the Web, but the difference is that the old-style database protocols were very, very well documented. You knew exactly what fields every one had; you knew how to search by date, how to search by word, how to search by a range of values, that kind of thing. You could search very, very precisely.

That, combined with field searching and the power and flexibility and precision of Boolean searching, really meant that you could zero right in on what you wanted.

A good thing, because you'd be paying through the nose for the privilege.

Pretty much. Back when I started, the prevalent model was connect-time searching, which meant that you paid for every hundredth of a minute you were online.

The thing to do was, don't think online. Do a logoff-hold--which on Dialog would save your search temporarily, put in a placeholder, and let you resume, as long as you logged back in in 10 or 15 minutes. If you had to rethink your strategy, there's no way that you were going to do it online.

Fortunately, that connect-time pricing is almost dead. All of the major proprietary services have essentially done away with it, or modified it greatly, in the face of the Web. They've all got to attract end-users, and end-users are certainly not used to paying for connect-time pricing.

The cost factor is definitely a difference. The structured quality of the databases, relative to the comparatively unstructured nature of the Web, is another.

A third area of real difference is in data quality, reliability, authority--whatever you want to call it. Most of what was in the traditional databases was juried in some way. It was either secondary, in that it was originally published in a trade journal or scholarly periodical or something like that, or it went through some kind of editorial process along the way. So you didn't have that problem that you run into on the Web-- "Whose is this?" "Is this legitimate?" "Is this a spoof?" "Is this the latest version of the data?" "Where did it really come from?" "Who really put it up?" You run into these same problems with the New York Times, but at least with the traditional database services you have some sort of responsibility, if not authority, for the information you pull out.

There are also things that the Web offers that you couldn't possibly find on those conventional services, like currency. You've got some near-real-time news and financial services through the traditional databases, but the Web has them beat all to hell when it comes to breaking news.

To what extent have the traditional services adapted to the Web, or otherwise managed to continue to exist and prosper?

If you look at what I think of as the Big Three--Dow Jones, LEXIS-NEXIS, and Dialog--they've all approached the Web in different ways.

Within six months of each other, Dialog and Dow Jones put up Web interfaces. Dow Jones did a very good job of it, and in fact announced plans to go Web-only by sometime this spring. A bunch of their heavy corporate users screamed bloody murder, because the Web still doesn't have the functionality, especially on the output end--printing selectively and things like that.

Just TRY printing some of those pages...

To their credit, Dow Jones kind of backed off and said okay, we'll maintain the Windows-based product for as long as it takes the Web to catch up, basically. They're running in parallel, and doing very well. They've made very few missteps; the Web product is good; they really seem to understand this new medium that they've been thrown into.

I can't be that optimistic about Dialog. They've stumbled. They were sold a year ago in November, and the new management has made some real missteps in terms of pricing. Users have beaten on them very heavily.

They're available through the Web in a number of different forms. There are some end-user products that are sliced-and-diced; there is what they call Dialog Web, which is the main Web interface. You can still get to them through direct dialup; you can still get to them by telnet; and you can also call up something called Classic Dialog On The Web; which looks like a straight passthrough from your browser into the ascii world of the native mode Dialog. I don't know why they're doing this...(laughs)

Dialog is a company in real transition. They're getting into some areas that you wouldn't think of as their core business--they're getting into e-commerce and a number of other areas. Quite frankly, I think they're desperate to show their investors some return. They seem pretty unfocused. I think the company is going to continue to exist; I have no question that Dow Jones will continue to exist. I just don't know what form Dialog is going to take in the next few years.

I'm really kind of distressed about it, because I cut my searching teeth on Dialog, and it's such an enormously powerful service, with such a rich array of content.

What might you find there?

Where Dialog really shines, and has shone in the past, is in sci-tech information, engineering and technology. The IEEE databases are there; Medline and other medical databases, pharmaceuticals; intellectual property, a lot of patents, copyrights, trademarks--plus a full range of business, news, financial and current events like you'd find in both LEXIS-NEXIS and Dow Jones. Dialog really has had it all, and the data on Dialog is so structured, and so tightly indexed, that it's possible to do things that you can't do on either of its competitors in terms of real precision searching.

I'm pulling for them, but right now I'd say the Dialog user is pretty concerned.

LEXIS-NEXIS has had a number of market-specific Web products up for a while, but it just put up something called LEXIS-NEXIS Universe, which is essentially an interface to the entire service, both the LEXIS side, which is legal information, and NEXIS, which is more general news, business and everything else. Based on what content bundles you've contracted for access to with LEXIS-NEXIS, you can get all of them through their Web interface, or through their Windows software, which they say they're going to continue to maintain. They're pouring most of their development money now into the Web, but they say they are going to maintain their dialup.

I think LEXIS-NEXIS is on a pretty good track, though I've heard rumors that they're an acquisition target as well.

Just looking at it generally, I'd say that "old online," the conventional online services, is making sort of an unsteady transition to the Web. The services all realize that they've got to be there in some way or another, and they are accomplishing it to one or another degree of success, and with varying levels of grace.

So how has the job of online research changed with the technology?

What's clearly happening is that clients' expectations have risen. They expect you to search everything. We've got to cover the Web along with whichever of the conventional services we search.

In fact, in many cases, clients say, "Will you search the Internet for me?" and they don't expect to see a charge for online connectivity or for documents that cost money. You have to explain to them, as I have with you, that there are services and data that will not turn up automatically in a Web search engine. They're what I call gated--you need a password and the account set up if you want to search them directly.

It's much more of a challenge because expectations are higher. There's been a sort of standard disclaimer in use in the professional research community -- "I've covered all the database sources to which I have access, and searched them to the best of my ability. I'm not responsible for errors of omission"-- that kind of thing (laughs). We felt pretty confident signing off on that kind of thing before the Web.

Now you really do have to think about it, because it's very much harder to document what you've done. How can you possibly say that you've searched the entire net?

Especially when every single search engine gives you a different result.

Different results, and remarkably little overlap.

Greg Notess has a number of retrieval studies for search engines; he discovered an amazingly low number of duplicate hits. Even AltaVista, which is supposedly the deepest, covers something like just 30% of the Web. That's not a lot.

And then there's the question, okay, what do you mean by the Web? But no way can you guarantee that you're going to do a thorough job.

At the same time, I think people are beginning to realize that you don't have to cover the entire Web.

Patterns do emerge.

Patterns emerge, and a lot of people will take a different tack than using a search engine. For instance, if you're looking for a government document, it makes sense to start at Fedworld, a site that you know is a gateway to government information, rather than keying in the name of the report in AltaVista or HotBot or Infoseek or the search engine du jour.

If you're just Joe User having to search for information on the Internet, what can you do to minimize your grief and increase the reward?

There are certain general rules of thumb that I think apply regardless of the search engine you use.

Use the most specific terminology possible-- one of the things I like to tell people is "try to
think of what the ideal article to address your question might be called, and put in that fictional title."

Use phrase searching--if you can express your concept as a phrase, that'll improve your results.

Use mandatory terms-- pick what terms must appear as opposed to might appear, or should appear, or could appear, in your search

Put the most significant concepts first. Most search engines default to relevance ranking, and I'd say try that first. If you're not comfortable with Boolean logic, let relevance ranking work for you.

Relevance ranking is number of occurrences of the word in the document?

Well, no. I wish it were! (laughs)

In some search engines, it is. In others, they have an internal thesaurus that gives greater weight to capitalized words, words that they recognize as proper names, company names, whatever. Concepts that are unique, as opposed to most numerous-- in other words, almost the opposite of what you suggest. They all use different algorithms.

It's not like you learn one system and you're done.

Or as if you have a printed thesaurus for all the databases you use most often, and as long as you consult that, you're going to pull out everything on your topic.

When relevance ranking, which goes hand in hand with natural-language searching, first came out in the early '90s-- and Dow Jones was a pioneer-- professional searchers, librarians and information professionals hated it, because it was a black box. Unlike with Boolean, you couldn't tell why you were getting the results you were getting.

It's the same thing, many-manyfold, on the Web. One of the complicating factors is that not only are you dealing with a black box, you're dealing with a different black box with every search engine you turn to.

So I think the attitude that's evolving among most of my research colleagues is don't try to outsmart the search engine. By all means look at the documentation, check the link that says "Search Tips" or "Advanced Search" or whatever. See how that particular search engine handles phrase searching or mandatory terms. Check the documentation, but don't drive yourself nuts trying to figure out why something did or did not turn up. If it doesn't show up in one search engine, try another one.

Or use a meta engine that covers half a dozen or more search engines at a time. Use something like Inference Find or Dogpile or ProFusion or Savvysearch or MetaCrawler, especially if you're not finding much at all.

In terms of relevance ranking, so what if you get 14,000 hits? Relevance ranking means that the engine is showing you what it thinks are the most useful hits first. It may not be entirely right, it might miss something -- but if what you want doesn't show up in the first 50 or 100 hits, rephrase your search or use a different engine or do a metasearch.

How concerned should you be about missing the holy grail of your quest because it's in a document that's not in English?

That goes back to a very fundamental consideration I've had to deal with over the years.

My background is in engineering research. I was an engineering librarian right out of library school, and one of the things you have to ask yourself is "Where's the important work being done? Am I making a false assumption in assuming I'm going to find it on an English-language site?"

Chances are for most general searches, the kinds of searches most people are gong to be doing, you will find what you want, or pointers to it, in English.

For a highly technical search, like the kind I used to do in databases on Dialog, where a lot of the important technology was being developed in Japan or Hungary or Russia or someplace like that, I always had to remind myself not to limit it to English, which was very easy to do in those databases, and was sometimes appropriate and sometimes not.

So I think it's something to be aware of, but if you're looking for general knowledge, unless you have reason to believe that there was important work done in another language, it's probably not going to be your first concern. There are search engines that allow you to search in particular languages, or to limit the results to English-only. I think it's AltaVista that gives you about 14 language choices.

Obviously, if you're looking for a site in a particular language, use the geographic restrictor on the domains for that country, and confine your searches to that geographic domain. There's lots of ways to do it, depending on whether you do or do not want to focus on particular countries or areas of the world other than the US.

So when does it make sense to do a search yourself, and when does it make sense to hire a pro?

One basic issue is, how much is your time worth? Everyone knows it's a lot of fun to hang out on the Web, but you know, you're probably being paid to do something else. It's an out-of-pocket cost to hire a professional researcher, but what's your time worth?

You have to do that time-versus-money equation for yourself, and the answer's going to be different depending on the situation.

Another way of looking at it is, is the information you're hoping to find going to be available on the Web, using the tools that you feel confident using?

The Web has a very short memory. If you're looking for something from a scholarly publication, or you're doing a prior art search for patents, and you've got a potentially zillion-dollar patent suit resting on this thing, you'd be an idiot to confine your search to the Web; you really want to go back to some of those proprietary databases. If you have access to them and know how to search them yourself, great, but you're not going to find that historic archival stuff on the Web, not consistently enough to have a whole lot of confidence in it.

There is material you can get to through the Web--you can get to Dialog and so on--but you're going to have to pay for it, and it's not going to turn up in a Web search. For instance, multi-client market studies that normally sell for thousands of dollars.

SRI is just not going to put all its stuff up on the Web for free...

You can get to a lot of this stuff fine, but you're going to have to sign up at the site or however, and pay $4.50 a page, or ten bucks a page, or several hundred dollars for the section of the report you want.

If it cost somebody good money to assemble a study, you just can't assume that you're going to find it for free on the Web, or that your basic lackadaisical Web search is going to turn up a pointer to it.

If you've got a business decision or something, say, involving quality-of-life issues of a medical course of treatment, where it's life or death, or money, or jail time or something, you should really think about hiring a professional.

