You are on page 1of 10

Methodology

PRESENTATION

The Webometrics Ranking formally and explicitly adheres to the Berlin Principles of Higher
Education Institutions. The ultimate aim is the continuous improvement and refinement of the
methodologies according to a set of agreed principles of good practices.

During the last year several of the signatories of the Code of Good Practices known as Berlin
Principles on Ranking of Higher Education Institutions became private-for-profit companies
and the biases of some of the Rankings are now more and more evident. Although Webometrics
Ranking formally and explicitly still adheres to the Berlin Principles, we would make some points
to add to these principles:

 A World Ranking is ONE ranking: Publishing a series of completely different classifications


with exactly the same data is useless and confusing.
 A World Universities Ranking is a ranking of universities from all over the world, covering
thousands of them, not only a few hundred institutions from the developed world.
 A Ranking backed by a for-profit company exploiting rank-related business should be
checked with care.
 Unexpected presence of certain universities in top positions is a good indicator of the (lack
of) quality of a Ranking, independently on how supposedly sound methodologies are used.
 Rankings favoring stability between editions and not publishing explicitly individual
changes and reasons for them (correcting errors, adding or deleting entries, changing
indicators) are violating the code of good practices.
 Research only (bibliometrics) based Ranking are biased against technologies, computer
science, social sciences and humanities, disciplines that usually amounts for more than half
of the scholars in a standard comprehensive university.
 Rankings should include indicators, even indirect ones, about teaching mission and the so-
called third mission, considering not only the scientific impact of the university activities but
also the economic, social, cultural and also the political ones.
 World-class universities are not small, very specialized institutions.
 Surveys are not a suitable tool for World Rankings as there is no even a single individual
with a deep (several semesters per institution), multi-institutional (several dozen),
multidisciplinary (hard sciences, biomedicine, social sciences, technologies) experience in a
representative sample (different continents) of universities worldwide.

Link analysis is a far more powerful tool for quality evaluation than citation analysis that only
counts formal recognition between peers, while links not only includes bibliographic citations but
third parties involvement with university activities.

0) Background of the project.

The “World Universities' ranking on the Web” is an initiative of the Cybermetrics Lab, a
research group of the Centro de Ciencias Humanas y Sociales (CCHS), part of the National
Research Council (CSIC), the largest public research body in Spain.
Cybermetrics Lab is devoted to the quantitative analysis of the Internet and Web contents specially
those related to the processes of generation and scholarly communication of scientific knowledge.
This is a new emerging discipline that has been called Cybermetrics (our team developed and
publishes the free electronic journal Cybermetrics since 1997) or Webometrics.

With these rankings we intend to provide extra motivation to researchers worldwide for publishing
more and better scientific content on the Web, making it available to colleagues and people
wherever they are located.

The "Webometrics Ranking of World Universities" was officially launched in 2004, and it is
updated every 6 months (data collected in January and July and published one month later). The
Web indicators used are based and correlated with traditional scientometric and bibliometric
indicators and the goal of the project is to convince academic and political communities of the
importance of the web publication not only for dissemination of the academic knowledge but for
measuring scientific activities, performance and impact too.

A) Purposes and Goals of Rankings

1. Assessment of higher education (processes, and outputs) in the Web. The Web indicators and we
are already publishing comparative analysis with similar initiatives. But the current objective of the
Webometrics Ranking is to promote Web publication by universities, evaluating the commitment to
the electronic distribution of these organizations and to fight a very concerning academic digital
divide which is evident even among world universities from developed countries. However, even
when we do not intend to assess universities performance solely on the basis of their web output,
Webometrics Ranking is measuring a wider range of activities than the current generation of
bibliometric indicators that focuses only in the activities of scientific elite

2. Ranking purpose and target groups. Webometrics Ranking is measuring the volume, visibility
and impact of the web pages published by universities, with special emphasis in the scientific output
(referred papers, conference contributions, pre-prints, monographs, thesis, reports, …) but also
taking into account other materials (courseware, seminars or workshops documentation, digital
libraries, databases, multimedia, personal pages, …) and the general information on the institution,
their departments, research groups or supporting services and people working or attending courses.
There is a direct target group for the Ranking which are the university authorities. If the web
performance of an institution is below the expected position according to their academic excellence,
they should reconsider their web policy, promoting substantial increases in the volume and quality
of their electronic publications.

Faculty members are indirect target groups as we expect that in a near future the web information
could be as important as other bibliometric and scientometric indicators for the evaluation of the
scientific performance of scholars and their research groups. Finally, candidate students should not
used this data as the sole guide for choosing university, although a Top position means that the
institution has a policy that encourages new technologies and it has resources for their adoption.

3. Diversity of institutions: Missions and goals of the institutions. Quality measures for research-
oriented institutions, for example, are quite different from those that are appropriate for institutions
that provide broad access to underserved communities. Institutions that are being ranked and the
experts that inform the ranking process should be consulted often.
4. Information sources and interpretation of the data provided. Access to the Web information is
done mainly through search engines. These intermediaries are free, universal, and very powerful
even when considering their shortcomings (coverage limitations and biases, lack of transparency,
commercial secrets and strategies, irregular behaviour). Search engines are key for measuring
visibility and impact of university’s websites.

There are a limited number of sources that can be useful for webometric purposes: 7 general search
engines (Google*, Yahoo Search*, Live (MSN) Search*, Exalead*, Ask (Teoma), Gigablast and
Alexa) and 2 specialised scientific databases (Google Scholar* and Live Academic). All of them
have very large (huge) independent databases, but due to the availability of their data collection
procedures (Apis), only those marked with asterisk are used in compiling the Webometrics
Ranking.

5. Linguistic, cultural, economic, and historical contexts. The project intends to have true global
coverage, not narrowing the analysis to a few hundreds of institutions (world-class universities) but
including as many organizations as possible. The only requirement in our international rankings is
having an autonomous web presence with an independent web domain. This approach allows a
larger number of institutions to monitor their current ranking and the evolution of this position after
adopting specific policies and initiatives. Universities in developing countries have the opportunity
to know precisely the indicators' threshold that marks the limit of the elite.

Current identified biases of the Webometrics Ranking includes the traditional linguistic one (more
than half of the internet users are English-speaking people), and a new disciplinary one (technology
instead of biomedicine is at the moment the hot topic) Since in most cases the infrastructure (web
space) and the connectivity to the Internet already exits , the economic factor is not considered a
major limitation (at least for the 3.000 Top universities).

B) Design and Weighting of Indicators

6. Methodology used to create the rankings. The unit for analysis is the institutional domain, so
only universities and research centres with an independent web domain are considered. If an
institution has more than one main domain, two or more entries are used with the different
addresses. About 5-10% of the institutions have no independent web presence, most of them located
in developing countries. Our catalogue of institutions includes not only universities but also other
Higher Education institutions following the recommendations of UNESCO. Names and addresses
were collected from both national and international sources including among others:

Universities Worldwide univ.cc


All Universities around the World www.bulter.nl/universities/
Braintrack University Index www.braintrack.com
Canadian Universities www.uwaterloo.ca/canu
UK Universities www.scit.wlv.ac.uk/ukinfo
US Universities www.utexas.edu/world/univ/state

University activity is multi-dimensional and this is reflected in its web presence. So the best way to
build the ranking is combining a group of indicators that measures these different aspects. Almind
& Ingwersen proposed the first Web indicator, Web Impact Factor (WIF), based on link analysis
that combines the number of external inlinks and the number of pages of the website, a ratio of 1:1
between visibility and size. This ratio is used for the ranking but adding two new indicators to the
size component: Number of documents, measured from the number of rich files in a web domain,
and number of publications being collected by Google Scholar database. As it has been already
commented, the four indicators were obtained from the quantitative results provided by the main
search engines as follows:
Size (S). Number of pages recovered from four engines: Google, Yahoo, Live Search and Exalead.
For each engine, results are log-normalised to 1 for the highest value. Then for each domain,
maximum and minimum results are excluded and every institution is assigned a rank according to
the combined sum.

Visibility (V). The total number of unique external links received (inlinks) by a site can be only
confidently obtained from Yahoo Search. Results are log-normalised to 1 for the highest value and
then combined to generate the rank.

Rich Files (R). After evaluation of their relevance to academic and publication activities and
considering the volume of the different file formats, the following were selected: Adobe Acrobat
(.pdf), Adobe PostScript (.ps), Microsoft Word (.doc) and Microsoft Powerpoint (.ppt). These data
were extracted using Google and merging the results for each filetype after log-normalising in the
same way as described before.

Scholar (Sc). Google Scholar provides the number of papers and citations for each academic
domain. These results from the Scholar database represent papers, reports and other academic items.
The four ranks were combined according to a formula where each one has a different weight:

7. Relevance and validity of the indicators. The choice of the indicators was done according to
several criteria (see note), some of them trying to catch quality and academic and institutional
strengths but others intending to promote web publication and Open Access initiatives. The
inclusion of the total number of pages is based on the recognition of a new global market for
academic information, so the web is the adequate platform for the internationalization of the
institutions. A strong and detailed web presence providing exact descriptions of the structure and
activities of the university can attract new students and scholars worldwide . The number of external
inlinks received by a domain is a measure that represents visibility and impact of the published
material, and although there is a great diversity of motivations for linking, a significant fraction
works in a similar way as bibliographic citation. The success of self-archiving and other repositories
related initiatives can be roughly represented from rich file and Scholar data. The huge numbers
involved with the pdf and doc formats means that not only administrative reports and bureaucratic
forms are involved. PostScript and Powerpoint files are clearly related to academic activities.

8. Measure outcomes in preference to inputs whenever possible. Data on inputs are relevant as they
reflect the general condition of a given establishment and are more frequently available. Measures
of outcomes provide a more accurate assessment of the standing and/or quality of a given institution
or program. We expect to offer a better balance in the future, but current edition intend to call the
attention to incomplete strategies, inadequate policies and bad practices in web publication before
attempting a more complete scenario.

9. Weighting the different indicators: Current and future evolution. The current rules for ranking
indicators including the described weighting model has been tested and published in scientific
papers. More research is still done on this topic, but the final aim is to develop a model that includes
additional quantitative data, especially bibliometric and scientometric indicators.
C) Collection and Processing of Data

10. Ethical standards. We identified some relevant biases in the search engines data including
under-representation of some countries and languages. As the behaviour is different for each
engine, a good practice consists of combining results from several sources. Any other mistake or
error is unintentional and it should not affect the credibility of the ranking. Please contact us if you
think the ranking is not objective and impartial in any way.

11. Audited and verifiable data. The only source for the data of the Webometrics Ranking is a small
set of globally available, free access search engines. All the results can be duplicated according to
the describing methodologies taking into account the explosive growth of the web contents, their
volatility and the irregular behaviour of the commercial engines.

12. Data collection. Data are collected during the same week, in two consecutive rounds for each
strategy, being selected the higher value. Every website under common institutional domain is
explored, but no attempt has been done to combine contents or links from different domains.

13. Quality of the ranking processes. After automatic collection of data, positions are checked
manually and compared with previous editions. Some of the processes are duplicated and new
expertise is added from a variety of sources. Pages that linked to the Webometrics Ranking are
explored and comments from blogs and other fora are taken into account. Finally, our mailbox
receives a lot of requests and suggestions that are acknowledged individually.

14. Organizational measures to enhance credibility. The ranking results and methodologies are
discussed in scientific journals, and presented in international conferences. We expect international
advisory or even supervisory bodies to take part in future developments of the ranking.

D) Presentation of Ranking Results

15. Display of data and factors involved. The published tables show all the Web indicators used in a
very synthetic and visual way. Rankings are provided not only from a central Top 4000
classification but also considering several regional rankings for comparative purposes.
16. Updating and error reducing. The listings are offered from asp dynamic pages build on several
databases that can be corrected when errors or typos are detected.
Glossary

This section is really a hybrid between a real Glossary and a FAQ that intends to explain some of
the terms and the meanings as used in the building of this ranking.

Database size. The number of records in the search engine databases that it publicly accessible from
external sources. Not all the robots crawl the Web at the same time or with identical procedures,
besides post crawling processes and other commercial requirements finally result in really different
databases. The current size, composition and evolution of the figures are a relevant point in
webometric analysis.

Delimited search. A key characteristic of the search engines that allow the cybermetric analysis. A
delimiter operator has a specific syntax and meaning that can differ among engines. It provides the
number of records (web pages) that satisfied a certain condition filtering the results according to
strings in the address (URL) or other characteristics (language, format) of the page. Special
relevance has the link delimiter that can be used in combination with site or other similar to
calculate inlinks.
Discipline differences. The ranking does not provide any kind of thematic assignation to the units,
so a formal thematic analysis is not possible at the moment. But there are important differences
regarding academic focus on our universities database that should be taken into account. Research
focused universities are mixed with learning institutions and a group of discipline oriented (mainly
pedagogy, medicine and theology) organizations are also present.

Formal characteristics. As there is neither universal document control nor formal guidelines for web
page building, there is a huge diversity of formal aspects in the Webspace, including obvious
malpractices. Some authors have focused on these to provide new indicators such as link density,
link quality, expressed as ratios of non working links, missing tags, including those so relevant as
title or metadata, or updating frequency. None of these characteristics are taken into account in our
rankings, but they should be taken into consideration for micro-analysis.

Geographical biases. The use of several search engines in our ranking is due to the geographical
bias observed in some of them. We do not know if this is due to topological or traffic problems in
the network (some eastern Asian countries are usually poorly covered) or to the crawlers behaviour
or if the biases are equal long the time. Alexa biases preclude us to add the popularity data in our
rankings.

Institutional domains. The basic unit of our analysis refers to the common URL domain shared by
all the web sites of an institution. Unfortunately some organizations maintain two or more
equivalent domains, without a preferred marked one. Also for concern is the fact that some second
level departments maintain completely different domains. Usually we maintain two entries for those
institutions with two top level equivalent domains. We intend to merge results of smaller domains
with those of the main one in the near future, but it is a difficult task.

Invocation. The presence of the name of an institution or a researcher in a Web page. The global
presence is the number of times the name appears in the Web and can be calculated easily using
quotation marks around the name in the search engines. Sometimes this figure is referred as the
number of times this name is cited in the Web. Some authors refer this as Web visibility, although
we prefer to reserve this word for link visibility. This indicator usually favours large, well-known,
old institutions independently of their real effort for having a relevant Web presence.

No invocation measure was used in our ranking, mainly because it is not possible to assign a
unique, unambiguous universal name for every institution.

Invisible Web. Traditionally refers to the information available through gateways or search
interfaces that is not accessible by the search engines’ robots. It is a huge part of the Internet
content, including library catalogues, bibliographic and alphanumeric databases or even some
repositories of documents. During last years some engines, specially Google, has made a great
effort to index these records and in fact several databases are more or less covered in their systems
(i.e. PubMed is partially indexed by Google). Our ranking do not consider the Invisible or Deep
Web and we encourage transforming it in crawler friendly information.

Language. English is the “lingua franca” for scientific communication and it is also the language of
a significant fraction of the internet users. Non-english institutions publishing only in their mother
tongue alone achieved a lower visibility than those with multilingual websites.

Link motivation. Major concern in link analysis is the motivations behind a link creation. Previous
studies suggest that “sitations”, the hypertextual equivalent to bibliographic citations, are still rare.
We think this situation will improve when more papers became available on the Web, but we
consider other reasons to link very useful to describe scholarly communication. Informal linking is
a powerful source of information about intellectual, economic and political connections of the
academic and scientific activities.
CATEGORY CASE COMMENTS
Generally in pdf/ps/doc
Sitation Link to paper or document
format
Mainly html pages but
Teaching/learning Link to course materials
also pdf, doc or ppt
Resources index Portal type
Software repository
Research projects sites
Research oriented Conferences, seminars or
meetings pages
Including media files if
Raw data
applicable
Pre or post prints, but also
Self archive
unpublished material
Personal Team or colleagues pages
Blog
Third parties (non-research)
Parent institution And related ones
Institutional
Funding organization

Link popularity. Another term to refer to link visibility that has been used extensively. We prefer to
reserve popularity for the measure of number of visits. Although not yet implemented on the
Ranking, we intend to consider number of visits or popularity as a relevant factor for our rankings
in the future.

Open access. The movement to distribute in an open way the scientific production of, at least, the
public funded researchers is facing tougher opposition than expected. A strong bet for open access
initiatives will be clearly reflected in our rankings.

Personal pages. A frequently heard statement about web contents quality is related to the
information provided by the personal pages of students or staff members. There is a lot of free space
hosted by the university web servers that is used for personal purposes, and in general it is thought
that it is used with low quality information or not academic related. Data suggest a large number of
small websites are crowding the institutional domains, but most of them are interesting enough to
merit consideration. Some “personal” pages are in fact the research group site, while others are
institutional (scientific societies, electronic bulletins, conference sites). True personal pages cover
both extremes of the contents range, with people offering only CVs to others providing very large
arrangements of information of their academic or research topics with links to personal repositories
of documents. A striking pattern is the absence of links to other colleague’s websites or institutions.

Quality. We advice against the use of the rankings as global or partial indicator of quality. Impact or
visibility describes better our aims, but in the particular context of promotion of open and universal
access to the scientific activities and results through the Web.

Ranking. As their main objective is purely commercial, current search engines are not offering
stable, reliable, or trustworthy results for webometric purposes. The situation has improved in the
last years but there are still important bias and a worrisome instability. This is the reason we are
using absolute values but relative positions for our analysis.

Rich files. A general term comprising a rather heterogeneous group of file types, mainly those
devoted to represent unitary enriched documents, such as MS Word doc, Adobe Acrobat pdf or
PostScript ps. In our analysis we also included MS Powerpoint ppt and excluded xls or latex or tex.
Rich files are relevant because they are use for scholarly communication as authors usually
distribute their papers and presentations in these formats. Certainly some of these types are used
extensively for bureaucratic purposes (forms, administrative documents, internal reports) but these
can only explain a small percentage of large numbers observed in domains with extensive
repositories.

There are several other file types that can be considered as rich files, and even raw formats like txt
are being used for distributing academic content. But their individual contribution is too low to be
considered.

Rounding. Google and Yahoo offer rounded results, ending in ,000, which means an error rate in
the order of 2 to 5%. Moreover the numbers provided by Yahoo in the first page is about another 4-
5% higher that the one showed in the following pages that show a trend towards the “correct”
number.

Search Engine. The software that searches an index and returns matches. Search engine is often
used synonymously with spider and index, although these are separate components that work with
the engine. There are only four engines useful for quantitative analysis purposes as they have a
large and independent self crawled database and their recovery system allow filtering of results
according to url-related delimiters:

Google www.google.com
Yahoo Search search.yahoo.com
Bing www.bing.com
Exalead www.exalead.com/search

Self archiving. Self-archiving involves depositing a free copy of a digital document on the World
Wide Web in order to provide open access to it. The term usually refers to the self-archiving of peer
reviewed research journal and conference articles as well as theses, deposited in the author's own
institutional repository or open archive for the purpose of maximizing its accessibility, usage and
citation impact. This practice is common among most prolific authors and in certain disciplines.
However globally it is only a minority of authors who support this option. As much of these papers
are published as rich files, pdf, ps or doc, this practice increases notably the performance of an
institution in our rankings.

Size. The size of an institutional domain is the combined number of pages of all the websites with
that domain, including html and non html formats that can be assimilated. From a practical point of
view, size refers to the number provided by a search engine when a search like site:domain is done.
This indicator is central for our rankings and it is used also as denominator for Web Impact Factor
calculations by other authors. However there is a wide range of pages according to different criteria,
including content size measured in bytes. For example, a page containing a pdf document that can
be a monograph consisting of several hundreds pages totalling several Mb of texts and images,
while other page consists only of the phrase “page under construction”. Global size could be an
interesting indicator and we expect to provide it for selected websites.

Stability. From the early times instability of the search results in general, and of the number that
represents results in particular has been a subject of special concern. Certainly the Web is a highly
dynamic system, growing at an incredible pace, but also the crawlers change their specifications and
schedule unexpectedly. A world crawling round can last from 15 to 45 days and in this meantime.

Visibility. In the context of this ranking, the term refers to link visibility: The number of external
inlinks received by an institutional domain. The most used syntax for this request in search engines
is:

linkdomain:webometrics.info –site:webometrics.info
Web cost. Maintain a very large presence on the Web can be quite costly; including specific
funding and human resources, but the total cost is far below any other publication method and the
potential audience is truly global. A way to undertake large projects is distributed effort, so
individual graduate students, professors or researchers, scientific teams and other administrative
units have an autonomous web presence. A rich content page should include a large diversity of
objects including images and other media files, certain amount of navigational links and a selected
group of external outlinks. That can require a huge effort that can be only face if theses tasks are
subject of evaluation as other academic and scientific activities.

Web Impact Factor. The most cited cybermetric indicator, although its usage is not universal due to
several shortcomings. It is the defined as the ratio between the external inlinks received by a
website and the number of webpages comprising that website. Some authors suggested
modifications to the denominator, using different alternative measures for the size of the institution
using non-internet data such us number of potential authors (staff, professors, graduate students),
economic wealth (funding, projects) or bibliometric data (papers in journals).
Our ranking is derived from WIF in which a ratio 1:1 is established between visibility and size.