The World-Wide Web is growing at a phenomenal rate. We estimate that the Web currently consists of around 10.000 servers, and a total of around 1.000.000 documents. Attempts to capture some of the information on the Web and making it available for querying have not been very succesful. Well known robot-based index databases like RBSE's URL database and WebCrawler are based on less than 5% of the documents in the Web. Other databases that only contain header and title information, like the JumpStation and the WWW Worm reach 10 to 40% of the Web, but are not very useful because the headers and titles of many documents on the Web are meaningless.
There are several reasons why the current generation of search engines are inadequate:
The second problem indicates that it is unlikely that complete index databases will be freely available. Distributing the information retrieval services could not only make them affordable, but would also help increase the performance of search operations.
The third problem is the most important one. The concept of putting people in charge of index databases who have no relationship with the people providing the information to be indexed is fundamentally wrong. Robots are constantly roaming the Web, searching for added or updated information. It would be much more efficient if the maintainers of Web servers could somehow provide "update operations" for robots.
The architectural change to the WWW we propose is the addition of a search engine (annex index database) on every WWW server. Using a tool somewhat like GlimpseHTTP every server can provide the URL's of its documents containing certain words, regular expressions, or approximations thereof. The WWW community should agree on a minimal query language supported by this search engine. (Research on more general or more efficient information retrieval services remains equally valuable of course.)
Using these server-coupled search engines, databases could be built
that retrieve information from the entire WWW in an efficient way.
Organized databases, such as the
EINet Galaxy, Yahoo
or the
Global Network Navigator could contain a set of caches of answers
to queries (assembled from all WWW servers) for each of the topics they
refer to.
On-line search engines like the fish-search could surf along
WWW servers instead of individual documents.
This architectural change does not require any modification to the WWW servers or clients. The proposed search engines can be installed as cgi-scripts. We do suggest that the WWW organization enforce registered servers to include such a search engine, offering a minimal query language through standardized arguments to the scripts.