SEARCHING THE WEB
These pages are devoted to the tools and methodologies available for the extraction and manipulation of web data. Such automatic systems will be suitable for the cybermetric (and other quantitative) tasks, although theoretical and comparative studies are also covered.
Due to the great amount of information published about search tools, this section is mainly a directory of the more powerful engines for recovering data from the Web. Other sections cover analytical and comparative reviews about the best use of these engines. Finally, projects and pilot applications in the scientometric research deserve also their own pages.
The classification scheme presented here is provisional and open to criticism. In many cases the original aim of the software's author/s is far different from the suggested usage proposed in the following taxonomy. Editors have in mind the scientometric applications when the choices were done. Obviously, the evolution of these tools could make obsolete the classification and nomenclature proposed in a very short time.
As usual, suggestions are welcome.THE NATURE OF INFORMATION IN THE WWW
The World Wide Web has revolutionized the way that researchers access information, and has opened up new possibilities in areas such as digital libraries, academic and scientific information diffusion and retrieval, education, and medicine. The revolution that the Web has brought to scholarly communication includes both the availability of huge amounts of information (exceeding probably 1500 million webpages at the end of 1999) and the improved efficiency of accessing such information.
The Web search engines allow a large amount of information to be efficiently searched. The last statistical reports about the size of the search engines databases are very descriptive:
However the Web search engines are limited in terms of coverage, freshness, query interface options, and how well they rank the relevance of results, as shown by Steve Lawrence and C. Lee Giles in their recent reviews:
Volatility of the information
The "changing-time" of a webpage is very short, probably between about 44 days (Michael Lesk, 1996) and 70 days (Brewster Kahle, 1998). Wallace Koehler (1999) shows in recent studies that the half-life of a webpage is somewhat less than two years and the half-life of a website is somewhat more than two years, with some of them changing content very rapidly, while others do so infrequently.
In some cases, there is a very high rhythm of actualization, even in less than hourly periods such as CNN.
Shortcomings of the search engines
— Altavista and Hotbot, the two largest engines in the beginning of 1998 showed increases and decreases of coverage, so now the size of Hotbot is lower than before. The situation is getting worse as Hotbot (now part of Lycos group) has decreased its coverage from more than 110 to about 34 million records (January 2000).
— Even Altavista with its large database does not index all the contents of a web site already localized by its robot. The Melee's Indexing Coverage Analysis was set up due to that discovery that the AltaVista index retains only a sample of all the pages on medium to large sites. Now no longer maintained, but this index has show during 1998 and 1999 the strange behavior of both (AV & HT) engines.
— Other large engines such as Fast Search (stabilized during last months of 99 in over 200 million), Northern Light (second largest, over 170 million), Google and Infoseek exhibit a slow but continuous increase. We will try to show in the future the evolution of these databases as they are increasing its coverage (Jan. 2000).
— Excite changes its coverage every week without observing a true increase of Web pages (15 months lapse, personal data). Actually, same as Lycos and Hotbot, it does not seem useful for quantitative analysis as it has focused its results to a certain subgroup (in this case delimited by language).
— Altavista offered during 1998 different results if you use Simple or Advanced options, although now both are equal. Moreover, the results could be also different at different times of day and when "count" option is activated (normally decreasing about a 10%, but also increasing ones that is a less frequent situation). In some situations there is evidence that the number of hits changes from the first results screen to the second or third one, with the number of results exhausted far before the count provided.
— The geographical databases of Altavista are different, but this is probably due to the fact that the option of adding new URLs affects only each database separately.
(Advanced Search, October 10th 1998, *Canada database)
(Advanced Search, June 18th 1999, *Canada database)
(Advanced Search, January 11th 2000)
— The delimitation by date of the results offered by Altavista, Hotbot y Northern Light shows there is no clear criteria about the time period for renovating databases (agent behaviour collecting resources) (REVOK: agent behaviour for resource collecting).
The number of dead or invalid links (absolute and percentage) among the search answers is increasing (the same is true for indexes such as Yahoo).
The method for evaluating the popularity (link: in Altavista and Infoseek, linkdomain: in Hotbot and SearchMSN) of websites is not very comprehensive. The main reason, probably due to some erratic behaviour, is that this option requires huge resources from the server. Also it is possible that the strange results obtained when applied to complex strategies (Boolean searches) are due to bugs in the engine programming.
Google is becoming an exception as it ranked results using popularity derived from linking information. Besides direct estimation with link: it also offers a powerful delimiter called related: based on the information provided by the hypertext links.
The multisearchers, that provide a useful alternative as there is now low overlapping among even the largest engines, unfortunately are not well developed yet. We recommend to use alternative second generation multisearchers.
Besides, some engines have displayed certain peculiar criteria for providing results and rankings that appear to be more oriented by commercial reasons.
Heterogeneity of information (format)
Sometimes, the web is referred as the largest world database, due to its huge volume of information. Unfortunately, that assertion is not true as there is no standard format or universal agreement for structure and presentation of the "records" in of the Web.
In some situations there is no title available, or it is provided only for the home-page or it is not descriptive at all of the exact contents. The author or institutions are not even mentioned, with no possibility of contacting them because no postal address or email is provided.
Another source of heterogeneity is the information provided about information update. No date is provided, so it is very difficult to determine if the data is current or what is the frequency of the actualization. In some cases this is very low or clearly irregular.
Particularly frustrating are the conflicts arising between the type of information provided and the browsers capabilities (frames, java scripts, colour combinations, unsupported HTML tags), so many resources need special software requirements in order to be retrieved.
Some proposals suggested a similar approach to those provided by "cataloguing in print" that appear in books, using to this aim the META tag and developing a sometimes complex scheme of metadata. Even in this point there is no universal agreement about the different proposals to arrange the meta-resources embedded in the sites. As eventually this could provide a good tool for sampling the web, we provided a list of the main initiatives:
Another formal problem with practical consequences is the existence of very "deep" sites, with more than twenty levels ("deep degree"). Some of this information is not easy to recover, as the configuration of many robots is restricted to a limited number of levels.
There is a large fraction of contents "invisible" for the global search engines. The "infranet" or invisible Internet is specially important in certain disciplines as it covers library catalogues, several kinds of databases (bibliographical, factual and textual), some electronic magazines and other password-protected material and the info provided by registration.
Heterogeneity of information (quality content)
As mentioned above, some formal shortcomings exist with quality evaluation consequences:
Nevertheless the main problem is the doubtful origin of some data, with no contrasted quality or without supporting evidence. Regarding R&D information this fact has been pointed by some authors to suggest the application of a peer review process to web publications as a strong control of the contents is required.
We are elaborating this section in deeper detail, but you can find relevant info in this webliography about the evaluation of quality:
Especially interesting is the effort of some scientific groups such as medical ones as described in some websites:
As a summary,
Internet is not:
A digital virtual library,
nor a multidisciplinary database