Finding Information in the World-Wide Web

dr. P.M.E. De Bra, debra@win.tue.nl
drs. R.D.J. Post, reinpost@win.tue.nl
tel. +31 40 472733
Eindhoven University of Technology
Dept. of Computing Science
PO Box 513
NL-5600 MB Eindhoven
the Netherlands

The World-Wide Web is growing at a phenomenal rate. We estimate that the Web currently consists of around 10.000 servers, and a total of around 1.000.000 documents. Attempts to capture some of the information on the Web and making it available for querying have not been very succesful. Well known robot-based index databases like RBSE's URL database and WebCrawler are based on less than 5% of the documents in the Web. Other databases that only contain header and title information, like the JumpStation and the WWW Worm reach 10 to 40% of the Web, but are not very useful because the headers and titles of many documents on the Web are meaningless.

There are several reasons why the current generation of search engines are inadequate:

The WWW does not keep an official directory (list) of all WWW servers. The robots have put a lot of effort into finding new servers.
Creating a complete index database of 1.000.000 documents (and not just their headers) requires a lot of diskspace, which no organization is likely to provide, at least not for free.
Finding all documents (on the servers one knows about) takes a very long time. Furthermore, documents hidden behind clickable maps and forms cannot be found by robots.

The first problem has already been recognized by the WWW organization, which is planning to organize a registration service for WWW servers.

The second problem indicates that it is unlikely that complete index databases will be freely available. Distributing the information retrieval services could not only make them affordable, but would also help increase the performance of search operations.

The third problem is the most important one. The concept of putting people in charge of index databases who have no relationship with the people providing the information to be indexed is fundamentally wrong. Robots are constantly roaming the Web, searching for added or updated information. It would be much more efficient if the maintainers of Web servers could somehow provide "update operations" for robots.

The architectural change to the WWW we propose is the addition of a search engine (annex index database) on every WWW server. Using a tool somewhat like GlimpseHTTP every server can provide the URL's of its documents containing certain words, regular expressions, or approximations thereof. The WWW community should agree on a minimal query language supported by this search engine. (Research on more general or more efficient information retrieval services remains equally valuable of course.)

Using these server-coupled search engines, databases could be built that retrieve information from the entire WWW in an efficient way. Organized databases, such as the EINet Galaxy, Yahoo or the Global Network Navigator could contain a set of caches of answers to queries (assembled from all WWW servers) for each of the topics they refer to.
On-line search engines like the fish-search could surf along WWW servers instead of individual documents.

This architectural change does not require any modification to the WWW servers or clients. The proposed search engines can be installed as cgi-scripts. We do suggest that the WWW organization enforce registered servers to include such a search engine, offering a minimal query language through standardized arguments to the scripts.