International Journal of Scientometrics, Informetrics and Bibliometrics
ISSN 1137-5019
  > Homepage  > The Source  > Tools  > Searching the Web

 

 

SEARCHING THE WEB

These pages are devoted to the tools and methodologies available for the extraction and manipulation of web data. Such automatic systems will be suitable for the cybermetric (and other quantitative) tasks, although theoretical and comparative studies are also covered.

Due to the great amount of information published about search tools, this section is mainly a directory of the more powerful engines for recovering data from the Web. Other sections cover analytical and comparative reviews about the best use of these engines. Finally, projects and pilot applications in the scientometric research deserve also their own pages.

First Generation

Second Generation

Classification and Directory
Clients Z39.50 Downloaders Metasearchers Indexers Tracers Linkcheckers
Mappers
Other Agents
including "autonomous" agents
New approaches: Visualization
Hypertext links and Self-Organised Maps

The classification scheme presented here is provisional and open to criticism. In many cases the original aim of the software's author/s is far different from the suggested usage proposed in the following taxonomy. Editors have in mind the scientometric applications when the choices were done. Obviously, the evolution of these tools could make obsolete the classification and nomenclature proposed in a very short time.

As usual, suggestions are welcome.

THE NATURE OF INFORMATION IN THE WWW

The World Wide Web has revolutionized the way that researchers access information, and has opened up new possibilities in areas such as digital libraries, academic and scientific information diffusion and retrieval, education, and medicine. The revolution that the Web has brought to scholarly communication includes both the availability of huge amounts of information (exceeding probably 1500 million webpages at the end of 1999) and the improved efficiency of accessing such information.

The Web search engines allow a large amount of information to be efficiently searched. The last statistical reports about the size of the search engines databases are very descriptive:

Search Engines Statistics by Gregg R. Notess
• Steve Lawrence and C. Lee Giles Studies (Apr. 98; Sept. 98; Feb. 99)
Search Engine Sizes by Danny Sullivan

However the Web search engines are limited in terms of coverage, freshness, query interface options, and how well they rank the relevance of results, as shown by Steve Lawrence and C. Lee Giles in their recent reviews:

Accessibility and Distribution of Information on the Web. Nature, 400(6740): 107-109, July 8, 1999
• Searching the World Wide Web. Science, 280:98, April 3, 1998

Volatility of the information

The "changing-time" of a webpage is very short, probably between about 44 days (Michael Lesk, 1996) and 70 days (Brewster Kahle, 1998). Wallace Koehler (1999) shows in recent studies that the half-life of a webpage is somewhat less than two years and the half-life of a website is somewhat more than two years, with some of them changing content very rapidly, while others do so infrequently.

In some cases, there is a very high rhythm of actualization, even in less than hourly periods such as CNN.

Shortcomings of the search engines

Irregular behaviour:

Altavista and Hotbot, the two largest engines in the beginning of 1998 showed increases and decreases of coverage, so now the size of Hotbot is lower than before. The situation is getting worse as Hotbot (now part of Lycos group) has decreased its coverage from more than 110 to about 34 million records (January 2000).

— Even Altavista with its large database does not index all the contents of a web site already localized by its robot. The Melee's Indexing Coverage Analysis was set up due to that discovery that the AltaVista index retains only a sample of all the pages on medium to large sites. Now  no longer maintained, but this index has show during 1998 and 1999 the strange behavior of both (AV & HT) engines.

— Other large engines such as Fast Search (stabilized during last months of 99 in over 200 million), Northern Light (second largest, over 170 million), Google and Infoseek exhibit a slow but continuous increase. We will try to show in the future the evolution of these databases as they are increasing its coverage (Jan. 2000).

Excite changes its coverage every week without observing a true increase of Web pages (15 months lapse, personal data). Actually, same as Lycos and Hotbot, it does not seem useful for quantitative analysis as it has focused its results to a certain subgroup (in this case delimited by language).

Altavista offered during 1998 different results if you use Simple or Advanced options, although now both are equal. Moreover, the results could be also different at different times of day and when "count" option is activated (normally decreasing about a 10%, but also increasing ones that is a less frequent situation). In some situations there is evidence that the number of hits changes from the first results screen to the second or third one, with the number of results exhausted far before the count provided.

— The geographical databases of Altavista are different, but this is probably due to the fact that the option of adding new URLs affects only each database separately.

MIRROR/domain:

us

ca

es

au

my

CALIFORNIA 2.151.680 2,554,962 987.029 2.159.309 112.588
CANADA 2.151.680 2.554.962
7.382.495*
987.029 2.159.309 112.588
SPAIN 2.918.410 4.094.308 1.211.073 2.790.777 140.585
AUSTRALIA 2.151.680 2.554.962 987.029 2.159.309 112.588
MALAYSIA 2.151.680 2.554.962 987.029 2.159.309 112.588

(Advanced Search, October 10th 1998, *Canada database)

MIRROR/domain us ca es au my se ch de
CALIFORNIA 1.602.912 3.105.343 644.060 5.106.158 116.432 4.212.159 12.649.355 14.091.439
CANADA 1.606.592 3.103.449
39.010.887*
644.866 5.114.478 115.692 4.212.186 12.641.728 14.073.123
SPAIN 2.151.745 2.630.314 959.978 2.165.785 110.028 2.378.405 1.285.338 5.645.332
AUSTRALIA 1.602.912 3.105.343 644.060 5.106.158 116.432 4.212.159 12.649.355 14.091.439
MALAYSIA 1.606.592 3.103.449 644.866 5.114.478 115.692 4.212.186 12.641.728 14.073.123
GERMANY 1.602.912 3.105.343 644.060 5.106.158 116.432 4.212.159 12.649.355 14.091.439

(Advanced Search, June 18th 1999, *Canada database)

MIRROR/domain: us ca es au my se ch de uk
CALIFORNIA
www.altavista.com
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954 4.694.606 24.708.182 17.796.048
CANADA
altavistacanada.com
3.955.829 5.706.616
11.940.5611
1.769.721 6.762.257 402.992 5.333.954 4.694.606 24.708.182 17.796.048
SPAIN
magallanes.net
2.151.745 1.324.763 485.073 1.060.669 54.396 1.213.617 642.092 2.883.868 2.7222.239
AUSTRALIA
yellowpages.com.au
3.955.829 5.706.616 1.769.721 6.762.257
6.956.9422
402.992 5.333.954 4.694.606 24.708.182 17.796.048
MALAYSIA
skali.com.my
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954 4.694.606 24.708.182 17.796.048
GERMANY
altavista.de
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954 4.694.606 24.708.182
6.418.5563
17.796.048
UNITED-KINGDOM
altavista.co.uk
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954 4.694.606 24.708.182 17.796.048
66.3414
SWEDEN
altavista.se
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954
8.537.8675
4.694.606 24.708.182 17.796.048
SWITZERLAND
sear.ch
3.955.829 5.706.616 1.769.721 6.762.257 402.992 5.333.954 4.694.606
5.445.8486
24.708.182 17.796.048

(Advanced Search, January 11th 2000)
1Canada; 2Australia; 3Deutschprachigen Web; 4the UK Web; 5Sverige; 6Schweiz

See also  Notess, Gregg (1999). AltaVista's International Mirrors . EContent (formerly DATABASE), August 1999
Volume 22, Number 4  <http://www.ecmag.net/EC1999/net8.html>

— The delimitation by date of the results offered by Altavista, Hotbot y Northern Light shows there is no clear criteria about the time period for renovating databases (agent behaviour collecting resources) (REVOK: agent behaviour for resource collecting).

Some other delimiters are not very effective (language in Altavista) or either misleading (site in Infoseek seeks the string along the URL). You can see more inconsistencies of Altavista in a webpage specially maintained by Greg Notess.

The number of dead or invalid links (absolute and percentage) among the search answers is increasing (the same is true for indexes such as Yahoo).

The method for evaluating the popularity (link: in Altavista and Infoseek, linkdomain: in Hotbot and SearchMSN) of websites is not very comprehensive. The main reason, probably due to some erratic behaviour, is that this option requires huge resources from the server. Also it is possible that the strange results obtained when applied to complex strategies (Boolean searches) are due to bugs in the engine programming.

Google is becoming an exception as it ranked results using popularity derived from linking information. Besides direct estimation with link: it also offers a powerful delimiter called related: based on the information provided by the hypertext links.

The multisearchers, that provide a useful alternative as there is now low overlapping among even the largest engines, unfortunately are not well developed yet. We recommend to use alternative second generation multisearchers.

Besides, some engines have displayed certain peculiar criteria for providing results and rankings that appear to be more oriented by commercial reasons.

Heterogeneity of information (format)

Sometimes, the web is referred as the largest world database, due to its huge volume of information. Unfortunately, that assertion is not true as there is no standard format or universal agreement for structure and presentation of the "records" in of the Web.

In some situations there is no title available, or it is provided only for the home-page or it is not descriptive at all of the exact contents. The author or institutions are not even mentioned, with no possibility of contacting them because no postal address or email is provided.

Another source of heterogeneity is the information provided about information update. No date is provided, so it is very difficult to determine if the data is current or what is the frequency of the actualization. In some cases this is very low or clearly irregular.

Particularly frustrating are the conflicts arising between the type of information provided and the browsers capabilities (frames, java scripts, colour combinations, unsupported HTML tags), so many resources need special software requirements in order to be retrieved.

Some proposals suggested a similar approach to those provided by "cataloguing in print" that appear in books, using to this aim the META tag and developing a sometimes complex scheme of metadata. Even in this point there is no universal agreement about the different proposals to arrange the meta-resources embedded in the sites. As eventually this could provide a good tool for sampling the web, we provided a list of the main initiatives:

PICS

Dublin Core

GILS

SOIF DISCONTINUED

CDF

MCF

RDF

SGML

TEI

EAD

DOI

 

Another formal problem with practical consequences is the existence of very "deep" sites, with more than twenty levels ("deep degree"). Some of this information is not easy to recover, as the configuration of many robots is restricted to a limited number of levels.

There is a large fraction of contents "invisible" for the global search engines. The "infranet" or invisible Internet is specially important in certain disciplines as it covers library catalogues, several kinds of databases (bibliographical, factual and textual), some electronic magazines and other password-protected material and the info provided by registration.

Heterogeneity of information (quality content)

As mentioned above, some formal shortcomings exist with quality evaluation consequences:

• No indication of author or responsibility
• No exact date of creation or modification of the page

Nevertheless the main problem is the doubtful origin of some data, with no contrasted quality or without supporting evidence. Regarding R&D information this fact has been pointed by some authors to suggest the application of a peer review process to web publications as a strong control of the contents is required.

We are elaborating this section in deeper detail, but you can find relevant info in this webliography about the evaluation of quality:

Beyond Surfing: Tools and Techniques for Searching the Web
Building Quality non-WWW Resources
Critically Analyzing Information
Evaluating Internet Resources
Evaluating Quality on the Net
Evaluating Web Resources UPDATED
Evaluating Web Sites
Evaluating Web Sites for Educational Uses: Bibliography and Checklist
Evaluating Web-based Resources
Evaluating World Wide Web Information
Evaluation of information sources
Karen Campbell: Understanding and comparing web search tools
Library Selection Criteria for WWW Resources
Selection Criteria for Quality Controlled Information Gateways (DESIRE)
Six Quests for The Electronic Grail
The Good, The Bad and The Ugly, or, Why It's a Good Idea to Evaluate Web Sources

new.gif (111 bytes) Especially interesting is the effort of some scientific groups such as medical ones as described in some websites:

Net Scoring : critères de qualité de l'information de santé sur l'Internet including a proposal for using "sitations" for quality evaluation
• Eysenbach G, Diepgen TL, Muir Gray JA, Bonati M, Impicciatore P, Pandolfini C, and Arunachalam S.(1998). "Towards quality management of medical information on the Internet: evaluation, labelling, and filtering of information". British Medical Journal, 317: 1496-1502. <http://bmj.bmjjournals.com/cgi/content/full/317/7171/1496>
• Kim P, Eng TR, Deering MJ, Maxfield A. (1999) "Published criteria for evaluating health related web sites: review". British Medical Journal, 318: 647-649. <http://bmj.bmjjournals.com/cgi/content/full/318/7184/647>.
• Ambre, John et al. (1997) "Criteria for Assessing the Quality of Health Information on the Internet. Health Information Technology Institute of Mitretek Systems, Inc. <http://www.mlanet.org/tech_is/meb/criteria.pdf> (May 4, 1999)
Web Quality Bibliography (for medical information)
Evaluation of Information
• Pyun, Chong Soo, Lee, Sa Young, Nam, Kiseok. Volatility and information flows in emerging equity market: A case of the Korean Stock Exchange
• Brin, S. and Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Structure of the Web (Science) 2001
• Kumar SR (1999). Trawling the web for emerging cyber-communities
An interactive tutorial on evaluating the quality of Internet resources
HON Code of Conduct (HONcode) for medical and health Web sites
Evaluating Information Found on the Internet
• Smith, Alastair G. Testing the Surf: Criteria for Evaluating Internet Information Resources. The Public-Access Computer Systems Review 8, no. 3 (1997)
• Tillman, Hope (2003). Evaluating Quality on the Net.
Resource evaluation for BIOME
• Brandt, S (1996). Evaluating Information on the Internet
• Smith, Alastair (1997). Criteria for evaluation of Internet Information Resources
• Jana Liebermann ( 2000). Evaluating Health Web Sites
• Humphries, LaJean (2002). How To Evaluate A Web Site
How to Evaluate Medical Information Found on the Internet (1999)
OASIS: Student Evaluation Methods for World Wide Web Resources
WWW CyberGuides for Web Evaluation
Evaluating Information Found on the World Wide Web 
Web Quality Bibliography
Evaluating Web Sites > Overview - Key Ideas


As a summary,

Internet is not:

A digital virtual library, nor a multidisciplinary database