Академический Документы
Профессиональный Документы
Культура Документы
Mapping Scientific
Frontiers
The Quest for Knowledge Visualization
Second Edition
Mapping Scientific Frontiers
Chaomei Chen
Second Edition
123
Chaomei Chen
College of Information Science and Technology
Drexel University
Philadelphia, Pennsylvania
USA
v
vi Foreword
Chaomei Chen’s book is important because it builds on many of the concepts and
findings of history, sociology and philosophy of science, but at the same time adds a
new dimension. As an example of the power of the new methods the skeptical reader
should consult chapter eight presenting a case study on recent work on induced
pluripotent stem-cells which shows how mapping can inform historical studies as
well as assist medical researchers to get an overview of a research area. Here we see
the strength of the new methods for exploring and tracking the internal structure of
revolutionary developments in contemporary science.
His book also draws on an even broader disciplinary framework from computer
to information science and particularly information visualization. In the first edition
Chaomei Chen commented on the disciplines that contribute ideas to science
mapping. “Different approaches to mapping scientific frontiers over recent years
are like streams running from several different sources : : : . A lot of work needs
to be done to cultivate knowledge visualization as a unifying subject matter that
can join several disciplines.” (Chen 2003, p. vii) This remains true even today
when scientometrics, computer science, and network science continue to evolve in a
strangely independent manner yet often dealing with the same underlying data and
issues. This may be an inevitable side effect of the barriers between disciplines, but
hopefully this book will help bridge these various streams.
As an example of the relevance of history of science, Chaomei Chen comments
that the work of Thomas Kuhn was an important backdrop to mapping because
one could think of the unfolding of a revolution in science as a series of cross-
sectional maps that at some point undergoes a radical structural transformation.
Cross sectional thinking is also very much encouraged in the history of science
because historians are exhorted to understand the ideas of a historical period by
entering its mind-set, “to think as they did” (Kuhn 1977, p. 110), and not interpret
older science in terms of our “current” understanding. This is a difficult requirement
because once we know that a new discovery or finding has occurred it is extremely
difficult for us not to be influenced by it, and our first impulse is to find precursors
and antecedents. As analysts we need to take care not to allow the present to distort
the past.
As an example of how various cross-currents converge in science mapping we
could point out the tension between psychological factors, as exemplified by Kuhn’s
gestalt switching as a way of looking at conceptual change, and social forces such
as collegial networks and invisible colleges. Do social relations determine cognitive
relations, or vice versa? In Stanley Milgram’s early work (1967) on social networks,
human subjects were required to think about what acquaintances their acquaintances
had several steps removed. In Don Swanson’s work (1987) on undiscovered public
knowledge, discoveries are made by seeking concepts that are indirectly related
through other concepts that are currently unconnected. Thus the same type of
thinking is involved in both the social and intellectual tasks. If we are dealing with
words or references as our mapping units, then psychology clearly enters the picture
because an author’s memory and recall are involved in the associative process.
But that memory and recall are also influenced by what authors have seen other
authors or colleagues say. If we map individual scientists in their co-author relations,
viii Foreword
then social factors must come into play but psychological factors also contribute to
the selection of co-authors. Thus social and psychological factors are inexorably
intertwined in both the social and intellectual structure of science.
The competition in science mapping between the various choices for unit
of analysis such as words, references, authors, journals, etc. and the means of
associating them such as co-word, co-citation, co-authorship, direct citation, etc.
seems to boil down to the types of structures and level of relations we want to
observe. To better understand the role of discovery in specialty development we
might turn to co-citations because many discoveries are associated with specific
papers and authors. On the other hand, if we want to include broader societal, non-
scholarly factors then we might turn to co-words which can more readily capture
public or political sentiments external to science. Journals, a yet broader unit of
analysis, might best represent whole fields or disciplines. Choice of a unit of analysis
also depends on the historical period under investigation. Document co-citation
is probably not feasible prior to 1900 due to the absence of standard referencing
practice. However, name co-mention within the texts of scientific papers and books
is still very feasible for earlier periods. It is instructive to try to imagine how we
would carry out a co-mention or other kind of mapping for some earlier era, say
for scientific literature in the eighteenth century and whether we would be able to
identify the schools of thought and rival paradigms active during the period.
Another important issue is the interpretation of maps. We know that the network
of associations that underlies maps is hyper dimensional, and that projection in two
dimensions is inevitably an approximation and can place weakly related units close
together. This argues for the need to pay close attention to the links themselves
which give rise to the two dimensional solution in the first place, which we can
think of as the neurons of the World Brain (Garfield 1968) we are trying to visualize.
Only by knowing what the links signify can we gain a better understanding of what
the maps represent. This will involve looking more deeply at the context in which
the linking takes place, and seeking new ways of representing and categorizing
those relationships, for example, by function or type such as logical, causal, social,
hypothetical, metaphorical, etc. One positive development in this direction, as
described in the final chapter, is the advent of systems for “visual analytics” that
allow us to more deeply probe the underpinnings of maps with the ultimate goal of
supporting decision making.
Part of what is exciting about science mapping is that the landscape is continually
changing: every year there is a new crop of papers and the structure changes as new
areas emerge and existing areas evolve or die off. Some will find such a picture
unsettling and would prefer to see science as a stable and predictable enterprise, but
as Merton has argued (2004), serendipity is endemic to science, and thus also to
science maps. We do not yet know if discovery is in any way predictable, if there
are recognizable antecedents or conditions, or whether discovery or creativity can be
engineered to happen at a quicker pace. But because discoveries are readily apparent
in maps after they occur, we also have the possibility of studying maps for previous
time periods to look for structural antecedents.
Foreword ix
References
The first edition of Mapping Scientific Frontiers (MSF) was published over 10 years
ago in 2002. Since then, a lot has changed. Social media has flourished to the extent
that we have never seen before. News, debates, hoaxes, and scholarly blogs all fight
for attention on Facebook (launched in 2004), YouTube (2005), and Twitter (2006),
which are made ubiquitously accessible by popular mobile devices such as iPhone
(2007) and iPad (2010).
Over the past 10 years, remarkable scientific breakthroughs have been made,
for example, Grigori Perelman’s proof of the century-old Poincaré Conjecture in
2002, the Nobel Prize winning research on induced pluripotent stem cells (iPSCs)
by Shinya Yamanaka and his colleagues since 2006, and the recent discovery of the
Higgs Boson in 2012 at CERN.
The big sciences continue to get bigger. Large-scale data collection efforts for
scientific research such as the Sloan Digital Sky Survey (SDSS) (2000–2014) in
astronomy represent one of many sources of big data. As old scientific fields
transform themselves, new ones emerge. Visual analytics entered our horizon in
2005 as a new field and has played a critical role ever since in advancing the science
and technology for solving practical issues, especially when we deal situations
that are full of complex, uncertain, incomplete, and potentially conflicting data.
A representative case is concerned with maintaining the integrity of scientific
literature itself. The increasing number of publications has overshadowed the
increase of retractions. What can be done to maintain a trustworthy body of scientific
knowledge?
What is the role that Mapping Scientific Frontiers has played? According to
Google Scholar, it has been cited by 235 scientific publications on the web. These
publications are in turn cited by an even broader range of articles. These articles
allow us to take a glimpse on the context in which research in science mapping has
been evolving. Interestingly, the citation profile appears to show two stages. The
first one ranges from 2002 to 2008 and the second one from 2009 to the present
(Fig. 1). Citations in the first stage peaked in 2007, whereas citations in the second
stage were evenly distributed for the first 3 years. A study of citation data in the Web
of Science revealed a similar pattern.
xi
xii Preface for the 2nd Edition
Fig. 1 The citation profile of Mapping Scientific Frontiers (Source: Google Scholar)
What is the citation pattern telling us? The nature of the set of articles that cited
Mapping Scientific Frontiers as a whole can be analyzed in terms of how they are
in turn cited by subsequently published articles. In particular, we turn to articles
that have strong citation bursts, or abruptly increased citation rates, during the time
span of 2002–2013. Figure 2 shows 25 articles of this type. Articles in the first
stage shared a unique focus on information visualization and citation analysis. The
original motivation of Mapping Scientific Frontiers was indeed to bridge together
the two fields across the boundaries of different disciplines.
The second stage is predominated by a series of publications dedicated to global
science maps at disciplinary levels as opposed to the document level in the first
stage. The most influential work in the second stage in terms of citation burst is
a 2009 Scientometrics article by Alan L. Porter and Ismael Rafolsonon on the
interdisciplinarity of science. The second highest citation burst is attributed to a
2010 article published in the Journal of American Society for Information Science
and Technology by Ismael Rafols, Alan L. Porter, and Loet Leydesdorff on science
overlay maps. We are still in the second stage. In terms of the scale and the unit
of analysis, the study of interdisciplinary interactions is a profound and potentially
fruitful way to better understand the dynamics of scientific frontiers.
In addition to the conceptual and theoretical development, researchers today
have a much wider range of choice than before in terms of computational tools
for analyzing, visualizing, and exploring patterns and trends in scientific liter-
ature. Notable examples include CiteSpace, HistCite, VOSViewer, and Sci2 for
scientometric studies and science mapping; GeoTime, Jigsaw, and Tableau for
visual analytics; and Gephi, Alluvial Maps, D3, and WebGL for more generic
information visualization. Today, a critical mass is taking its shape and gathering
its strengths as visual analytic tools, data sources, and exemplars of in-depth and
longitudinal studies become increasingly accessible and inter-operable. Mapping
Scientific Frontiers has reached a new level with a broad range of unprecedented
opportunities to impact scientific activity across so many disciplines.
The second edition of Mapping Scientific Frontiers brings you some of the most
profound discoveries and advances in the study of scientific knowledge and the
dynamics of its evolution. Some of the new additions are highlighted as follows:
Preface for the 2nd Edition xiii
Fig. 2 A citation analysis of Mapping Scientific Frontiers reveals two stages of relevant research.
Red bars indicate intervals of citation burst
• The Sloan Digital Sky Survey (SDSS) is featured in Chap. 2 in the context of
how a map of the Universe may reveal.
• In Chap. 3, a series of new examples of visualizing a thematic evolution over time
are illustrated, including the widely known ThemeRiver, the elegant TextFlow,
and the versatile Alluvial Maps.
• Chapter 8 is a new chapter. It introduces the framework of a predictive analysis
and demonstrates how it can apply to a fast-advancing field such as regenerative
medicine, which highlights the work that was awarded the 2012 Nobel Prize
in medicine on induced pluripotent stem cells (iPSCs). Chapter 8 also addresses
practical implications of the retraction of a scientific publication. The second half
of Chap. 8 is devoted to the design, construction, and analysis of global science
maps, including our own new design of dual-map overlays.
xiv Preface for the 2nd Edition
The second edition in part reflects the result of a continuous research effort
that I have been engaged in since the publication of the first edition. I’d like
to acknowledge the support and contributions of my colleagues, students, and
collaborators in various joint projects and publications, in particular, including my
current and former students Timothy Schultz (Drexel University, USA), Jian Zhang
(IBM Shanghai, China), and Donald A. Pellegrino (The Dow Chemical Company,
USA), collaborators such as Pak Chung Wong (PNNL, USA), Michael S. Vogeley
(Drexel University, USA), Alan MacEachren (Penn State University, USA), Jared
Milbank (Pfizer, USA), Loet Leydesdorff (The Netherlands), Richard Klavans,
Kevin Boyack, and Henry Small (SciTech Strategies, USA), and Hong Tseng (NIH,
USA).
As a Chang Jiang Scholar, I have the opportunity to work with the WISELab at
Dalian University of Technology, China, since 2008. I’d like to acknowledge the
collaboration with Zeyuan Liu, Yue Chen, Zhigang Hu, and Shengbo Liu. Yue Chen
is currently leading an ambitious effort to translate the second edition to Chinese.
I particularly appreciate the opportunity to work with Rod Miller, Chief Strategy
Officer at the iSchool of Drexel, and Paul Dougherty, Licensing Manager at the
Office of Technology Commercialization of Drexel University, over numerous
fruitful discussions of various research topics.
I’d like to acknowledge the support of sponsored research and grants from the
National Science Foundation (IIS-0612129, NSFDACS-10P1303, IIP 1160960),
Department of Homeland Security, Pfizer, and IMS Health. I’d also like to take the
opportunity to express my gratitude and appreciation to the hosts of my talks and
keynote speeches, including Michael Dietrich, History of Biology, Woods Hole,
MA; Stephanie Shipp, International Defense Agency (IDA), Washington D.C.;
Paula Fearon, NIH, Bethesda, MD; David Chavalarias, Mining the Digital Traces
of Science (MDTS), Paris, France; and Josiane Mothe, Institut de Recherche en
Informatique de Toulouse, France.
xv
xvi Acknowledgements
xvii
xviii Contents
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 341
Abbreviations
xxi
xxii Abbreviations
xxiii
xxiv List of Figures
Fig. 2.28 Major discoveries in the west region of the map. The
2003 Sloan Great Wall is much further away from us
than the 1989 CfA2 Great Wall . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 74
Fig. 2.29 The Hubble Ultra Deep Field (HUDF) is featured on
the map of the universe . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
Fig. 2.30 SDSS quasars associated with citation bursts . . .. . . . . . . . . . . . . . . . . . . . 76
Fig. 2.31 A network of co-cited publications based on the SDSS
survey. The arrow points to an article published in
2003 on a survey of high redshift quasars in SDSS II.
A citation burst was detected for the article. . . . .. . . . . . . . . . . . . . . . . . . . 76
Fig. 2.32 The original structure of DNA’s double helix
(Reprinted from Watson 1968) . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79
Fig. 2.33 Ear acupacture point map. What is the best organizing
metaphor? (Courtesy of http://www.auriculotherapy-
intl.com/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 80
Fig. 2.34 Musculoskeletal points (©1996 Terry Oleson, UCLA
School of Medicine. http://www.americanwholehealth.
com/images/earms.gif) .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
Fig. 2.35 Caenorhabditis elegans gene expression terrain map
created by VxInsight, showing three-dimensional
representation of 44 gene mountains derived from 553
microarray hybridizations and consisting of 17,661
genes (representing 98.6 % of the genes present on the
DNA microarrays) (Reprinted from Kim et al. 2001) . . . . . . . . . . . . . . 82
Fig. 2.36 114,996 influenza virus protein sequences (Reprinted
from Pellegrino and Chen 2011) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83
Fig. 3.1 Liberation by Escher. Rigid triangles are transforming
into more lively figures (© Worldofescher.com).. . . . . . . . . . . . . . . . . . . 86
Fig. 3.2 The scope of the Knowledge of London, within which
London taxi drivers are supposed to know the most
direct route by heart, that is, without resorting to the
A–Z street map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 90
Fig. 3.3 Nodes a and c are connected by two paths. If r D 1,
Path 2 is longer than Path 1, violating the triangle
inequality; so it needs to be removed.. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 95
Fig. 3.4 A Pathfinder network of the 20-city proximity data .. . . . . . . . . . . . . . . 96
Fig. 3.5 A Pathfinder network of a group of related concepts .. . . . . . . . . . . . . . 96
Fig. 3.6 Visualization of 279 images by color histogram .. . . . . . . . . . . . . . . . . . . 98
Fig. 3.7 Visualization of 279 images by layout . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 99
Fig. 3.8 Visualizations of 279 images by texture . . . . . . . .. . . . . . . . . . . . . . . . . . . . 100
Fig. 3.9 Valleys and peaks in ThemeView (© PNNL) . . .. . . . . . . . . . . . . . . . . . . . 102
Fig. 3.10 A virtual landscape in VxInsight . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
List of Figures xxvii
Fig. 8.21 An overlay on the Scopus 2010 map shows papers that
acknowledge NCI grants (Courtesy of Kevin Boyack,
reproduced with permission) . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 310
Fig. 8.22 A global science overlay base map. Nodes represent
Web of Science Categories. Grey links represent
degree of cognitive similarity (Reprinted from Rafols
et al. 2010 with permission) . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 311
Fig. 8.23 An interactive science overlay map of
Glaxo-SmithKline’s publications between
2000 and 2009. The red circles are GSK’s publications
in clinical medicine (as moving mouse-over the
Clinical Medicine label) (Reprinted from Rafols et al.
2010 with permission, available at http://idr.gatech.
edu/usermapsdetail.php?id=61) . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 312
Fig. 8.24 A similarity map of JCR journals shown in VOSViewer .. . . . . . . . . . 313
Fig. 8.25 The Blondel clusters in the citing journal map (left)
and the cited journal map (right). The overlapping
polygons suggest that the spatial layout and the
membership of clusters still contain a considerable
amount of uncertainty. Metrics calculated based on the
coordinates need to take the uncertainty into account .. . . . . . . . . . . . . 314
Fig. 8.26 Citation arcs from the publications of Drexel’s iSchool
(blue arcs) and Syracuse School of Information
Studies (magenta arcs) reveal where they differ in
terms of both intellectual bases and research frontiers . . . . . . . . . . . . . 315
Fig. 8.27 h-index papers (cyan) and citers to CiteSpace (red) .. . . . . . . . . . . . . . . 315
Fig. 9.1 A screenshot of GeoTime (Reprinted from Eccles et al. 2008) . . . . 322
Fig. 9.2 CiteSpace labels clusters with title terms of articles
that cite corresponding clusters .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 323
Fig. 9.3 Citations over time are shown as tree rings. Tree rings
in red depict the years an accelerated citation rate was
detected (citation burst). Three areas emerged from the
visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 324
Fig. 9.4 A network of 12,691 co-cited references. Each year top
2,000 most cited references were selected to form the
network. The same three-cluster structure is persistent
at various levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 325
Fig. 9.5 The document view in Jigsaw . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 326
Fig. 9.6 The list view of Jigsaw, showing a list of authors, a
list of concepts, and a list of index terms. The input
documents are papers from the InfoVis and VAST conferences . . . 327
Fig. 9.7 A word tree view in Jigsaw . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 328
Fig. 9.8 Tablet in Jigsaw provides a flexible workspace to
organize evidence and information .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 328
List of Figures xxxvii
Scientific knowledge changes all the time. Most of the changes are incremental, but
some are revolutionary and fundamental. There are two kinds of contributions to
the body of scientific knowledge: persistent and long-lasting ones versus transient
and fast-moving ones. Once widely known theories and interpretations may be
replaced by new theories and new interpretations. Scientific frontiers consist of
the current understanding of the world and the current set of questions that the
scientific community is addressing. Scientific frontiers are not only where one
would expect to find the cutting edge knowledge and technology of human being,
but also unsolved mysteries, controversies, battles and debates, and revolutions.
For example, a bimonthly newsletter Scientific Frontiers1 digests scientific reports
of scientific anomalies – observations and facts that do not quite fit into prevail-
ing scientific theories. This is where the unknown manifests itself in all sorts
of ways.
In this book, we will start with what is known about the structure and dynamics
of scientific knowledge and how information and computational approaches can
help us develop a good understanding of the complex and evolving system. We will
also trace the origin of some of the most fundamental assumptions that underline
the state of the art in science mapping, interactive visual analytics and quantitative
studies of science. This is not a technical tutorial; instead, our focus is on principles
of visual thinking and the ways that may vividly reveal the dynamics of scientific
frontiers at various levels of abstraction.
1
http://www.science-frontiers.com/
The pioneering innovators in the study of invisible colleges were Nick Mullins,
Susan Crawford, and other sociologists of science. In 1972, Diana Crane argues
scientific knowledge is diffused through invisible colleges (Crane 1972). The prob-
lems of scientific communication can be understood in terms of interaction between
a complex and volatile research front and a stable and much less flexible information
communication system. The research front creases new knowledge; the formal
communication system evaluates it and disseminates it beyond the boundaries of
the research area that produced it. The research front is continually evolving and
updating its own directions. This dynamics makes it challenging for anyone to
keep abreast of the current state of a research area solely through scholarly articles
circulated in the formal communication system. Research in information science
and scholarly communication has shown that when scientists experience difficulties
in finding information through formal communication channels, a common reason
is the lack of a broader context of where a particular piece of information belongs
in a relatively unfamiliar area.
Philosophy of science and sociology of science, two long established fields of
studies, provide high-level theories and interpretations of the dynamics of science
and scientific frontiers. In contrast, scientometrics is the quantitative study of
science. Its goal is to identify and make sense of empirical patterns that can shed
light on how science functions. Typically, scientometric studies have relied on
scientific literature, notably Thomson Reuters’ Web of Science, Elsevier’s Scopus,
and Google Scholar, patents, awards made by federal government agencies, and,
more recently, social media sources such as Twitter.
Mapping scientific frontiers aims to externalize the big picture of science. Its
origin can be easily traced back to the pioneering work of Eugene Garfield on
historgraphics of citation, Belver Griffith and Henry Small on document co-citation
analysis, and Howard White on author co-citation analysis. Today, researchers have
many more options of science mapping software than just 5 years ago. Many of
the major science mapping software applications are freely accessible. Notable
examples include our own software CiteSpace (2003), the Science of Science Tool
(SCI2) (2009) from Indiana University, VOSViewer from the Netherlands (2010),
SciMAT (2012) from Spain. If we can only pick one software that has made the
most substantial contribution to the widespread interest of network visualization, I
would choose Pajek. It was probably the first freely available software system for
visualizing large-scale networks. It has inspired many subsequent efforts towards
the development and maintenance of science mapping software tools. Although new
generation of systems such as Gephi have various new features, Pajek has earned
a unique position in giving many researchers the first taste of visualizing a large
network.
Mapping scientific frontiers takes more than presenting an intuitively designed
and spectacularly rendered big picture of science. A key question is how one
can identify information that is not only meaningful, but also actionable. The
1.1 Scientific Frontiers 3
A popular design metaphor that has been adopted for science mapping is the
notion of an abstract landscape with possible contours to highlight virtual valleys
and peaks. Similar landscape metaphors appeared in many earlier designs of
information visualization. What comes naturally with such metaphors is the notion
of exploration and navigation. Landmarks such as peaks of mountains are used
to attract an explorer’s attention. If the shape of the landscape matches to the
salient properties of the system that underlines the landscape, then exploring the
system becomes an intuitive and enjoyable navigation through the landmarks that
can be found effortlessly. Many of the earlier information visualization systems
capitalized on the assumption that the higher probability of an event is to occur,
the more important it is for the user to find the event easily. In contrast, users are
less motivated to visit valleys, or pay attention to events that tend to be associated
with low probabilities. For example, main-stream systems often emphasize high-
frequency topics and highlight prominent authors as opposed to low-frequency
outliers.
A different but probably equally thought-provoking analog may come from
evolutionary biology. Charles Darwin’s natural selection is now a household term
that describes the profound connection between fitness and survival. The notion
1.1 Scientific Frontiers 5
the so-called visualism in science, which says what contemporary scientists have
been doing in their daily work is, in essence, to visualize, to interpret, and to explain
(Ihde 1998). What is the metaphor that we can use to visualize scientific frontiers?
Our quest of knowledge domain visualization starts from mapping of terrestrial and
celestial phenomena in the physical world, cartography of conceptual maps and
intellectual structures of scientific literature, to static snapshots and longitudinal
maps featuring the dynamics of scientific frontiers.
There are three simplistic models of how scientific knowledge grows. The most
common one is a cumulative progression of new ideas developing from antecedent
ideas in a logical sequence. Hypotheses derived from theory are tested against
empirical evidence and either accepted or rejected. There is no ambiguity in the
evidence and consequently no disagreement among scientists about the extent to
which a hypothesis has been verified. Many discussions of the nature of scientific
method are based on this model of scientific growth.
An alternative model is that the origins of new ideas come not from the most
recent developments but from any previous development whatever in the history of
the field. In this model, there is a kind of random selection across the entire history
of a cultural area. Price (1965) argues that this kind of highly unstructured growth
is characteristic of the humanities.
The first of these models stresses continuous cumulative growth, the second its
absence. Another type of model includes periods of continuous cumulative growth
interspersed with periods of discontinuity. A notably representative is Kuhn’s theory
of scientific revolutions. In Kuhn’s terminology, periods of cumulative growth are
normal science. The disruption of such cumulative growth is characterized by crisis
or revolution.
One of the most influential works in the twentieth century is the theory of the
structure of scientific revolutions by Thomas Kuhn (1922–1996) (1962). Before
Kuhn’s structure, philosophy of science had been dominated by what is known as
the logical empirical approach. The logical empiricism uses modern formal logic
to investigate how scientific knowledge could be connected to sense experience.
It emphasizes the logical structure of science rather than its psychological and
historical development.
Kuhn criticized that the logical empiricism cannot adequately explain the history
of science. He claimed that the growth of scientific knowledge is characterized by
revolutionary changes in scientific theories. According to Kuhn, most of the time
scientists are engaged in a stage of an iterative process – normal science. The stage
of normal science is marked by the dominance of an established framework, or
paradigms. The majority of scientists would work on specific hypotheses within
such paradigms. The foundation of a paradigm largely remains unchallenged until
new discoveries cast more and more doubts, or, anomalies, over the foundation.
1.1 Scientific Frontiers 7
As more and more anomalies build up, scientists begin to examine basic assump-
tions that have been taken for granted. This re-examination marks a period of crises.
To resolve such crises, radically new theories with greater explanatory power may
replace the current paradigms that are in trouble. This type of replacement is often
view-changing in nature. They are often revolutionary and transformative. As the
new paradigm is accepted by the scientific community, science enters another period
of normal science. Scientific revolutions, as Kuhn claimed, are an integral part of
science and science progresses through such revolutionary changes. Although the
most common perception of the paradigm shift theory implies the rarity and severity
of such change, such view-changing events are much more commonly found at
almost all levels of science, from topics, fields of study, to disciplines.
Kuhn characterized the structure of scientific revolutions in terms of the
dynamics of competing scientific paradigms. His theory provides deep insights into
the mechanisms that operate at macroscopic levels and offers ways to explain the
history of science in terms of the tension between radical changes and incremental
extensions. The revolutionary transformation of science from one paradigm to
another – a paradigm shift – is one of the most widely known concepts not only in
scientific communities but also to the general public. The Copernican revolution is
a classic example of a paradigm shift. It marked the change from the geo-centric
to the solar-centric view of our solar system. Another classic example is Einstein’s
general relativity, which took over the authoritative place of Newtonian mechanics
and became the new predominant paradigm in physics.
Stephen Toulmin (1922–2009), a British philosopher of science, suggested a
“Darwinian” model of scientific disciplines: the more disciplines there are in which
a given theory is applicable, the more likely the theory will survive. A similar point
is made by a recent study of the value of ideas in a quite different context Kornish
and Ulrich (2011). It found that more valuable ideas tend to connect many different
topics.
Although Kuhn’s theory has been broadly received, philosophers criticized it in
several ways. In particular, the notion of incommensurability between competing
paradigms was heavily criticized. Incommensurability refers to the communicative
barrier between different paradigms; it can be taken as a challenge to the possibility
of a rational evaluation of competing paradigms using external standards. If that was
the case, the argument may lead to the irrationality of science.
Margaret Masterman (1970) examined Kuhn’s discussion of the concept of
paradigms and found that Kuhn’s definitions of a paradigm can be separated into
three categories:
1. Metaphysical paradigms, in which the crucial cognitive event is a new way of
seeing, a myth, a metaphysical speculation
2. Sociological paradigms, in which the event is a universally recognized scientific
achievement
3. Artifact or construct paradigms, in which the paradigm supplies a set of tools
or instrumentation, a means for conducting research on a particular problem, a
problem-solving device.
8 1 The Dynamics of Scientific Knowledge
She emphasized that the third category is most suitable to Kuhn’s view of
scientific development. Scientific knowledge grows as a result of the invention of
a puzzle-solving device that can be applied to a set of problems producing what
Kuhn has described as “normal science.”
In this book, we will focus on puzzle-solving examples in this category. For
example, numerous theories have been proposed to explain what caused the
extinction of dinosaurs 65 million years ago; scientists are still debating on this
topic. Similarly, scientists are still skeptical about what causes brain diseases in
sheep, cattle, and human. These topics share some common characteristics:
• interpretations of available evidence are controversial
• conclusive evidence is missing
• the current instruments are limited
Mapping the dynamics of competing paradigms is an integral part of our quest
for mapping scientific frontiers. We will demonstrate some intriguing connections
between Kuhn’s view on paradigm shifts and patterns identified from scholarly
publications.
Information scientists are concerned with patterns of scientific communications
and intellectual structures of scientific disciplines. Since the 1970s, information
scientists began to look for signs of competing paradigms in scientific literature,
for example, a rapid change of research focus within a short period of time. In
1974, Henry Small and Belver Griffith were among the first to address issues
concerning identifying and mapping specialties from the structure of scientific
literature by tapping on co-citation patterns as a grouping mechanism (Small and
Griffith 1974). In a longitudinal study of collagen research published in 1977,
Small demonstrated how collagen research underwent some rapid changes of its
focus at a macroscopic level (Small 1977). He used data from the Science Citation
Indexing (SCI) to group documents together based on how tightly they were co-
cited in subsequently published articles. Groupings of co-cited documents were
considered as a representation of leading specialties, or paradigms. Small used the
multidimensional scaling technique to map highly cited articles each year in clusters
on a two-dimensional plane. An abrupt disappearance of a few key documents in
the leading cluster in 1 year and the rapidly increased number of documents in the
leading cluster in the following year indicate an important type of specialty change –
rapid shift in research focus -, which is an indicator of “revolutionary” changes.
We can draw some useful insights from studies of thematic maps of geographic
information. For example, if people study a geographic map first and read relevant
text later, they can remember more information from the text (Rittschof et al.
1994). Traditionally, a geographic map shows two important types of information:
structural and feature information. Structure information helps us to locate indi-
vidual landmarks on the map and determine spatial relations among them. Feature
information refers to detail, shape, size, color, and other visual properties used to
depict particular items on a map. When people study a map, they first construct
a mental image of the map’s general spatial framework and add the landmarks
into the image subsequently (Rittschof et al. 1994). The mental image integrates
1.1 Scientific Frontiers 9
but possibly weak link between the two concepts. Removing an existing link can
be seen as a result of a decay of its strength; they no longer have a strong enough
presence in the system to be taken into account. Figure 1.1 illustrates how an old
system #1 is replaced by a new system #2 in this manner. Using this framework,
Thagard identified nine steps to make conceptual changes:
1. Adding a new instance, for example that the blob in the distance is a whale.
2. Adding a new weak rule, for example that whales can be found in the Arctic
Ocean.
3. Adding a strong rule that plays a frequent role in problem solving and explana-
tion, for example that whales eat sardines.
4. Adding a new part-relation, also called decomposition.
5. Adding a new kind-relation, for example that a dolphin is a kind of whale.
6. Adding a new concept, for example narwhale.
7. Collapsing part of a kind-hierarchy, abandoning a previous distinction.
8. Recognizing hierarchies by branch jumping, that is, shifting a concept from one
branch of a hierarchical tree to another.
9. Tree switching, that is, changing the organizing principle of a hierarchical tree.
Branch jumping and tree switching are much rare events associated with
conceptual revolutions. Thagard examined seven scientific revolutions:
1. Copernicus’ solar-centric system of the planets replacing the earth-centric theory
of Ptolemy
1.1 Scientific Frontiers 13
Fig. 1.2 Computer-generated “best fit” of the continents. There are several versions of this type
of fit maps credited to the British geophysicists E.C. Bullard, J.E. Everett, and A.G. Smith
Fig. 1.3 Wegener’s conceptual system (top) and the contemporary one (bottom)
his opponents (Fig. 1.5). Making conceptual structures explicit helps us understand
the central issues concerning how paradigms compete with each other.
Continental drift, along with polar wandering and seafloor spreading, is the
consequence of plate movements. Continental drift is the movement of one continent
relative to another continent. Polar wandering is the movement of a continent
relative to the rotational poles or spin axis of the Earth. Seafloor spreading is the
movement of one block of seafloor relative to another block of seafloor. Evidence
for both polar wandering and continental drift comes from matching continental
coastlines, paleoclimatology, paleontology, stratigraphy, structural geology, and
paleomagnetism. The concept of seafloor spreading is supported by evidence of the
age of volcanic islands and the age of the oldest sediments on the seafloor. It is also
supported by discoveries of the magnetism of the seafloor.
It is obviously a remarkable accomplishment to be able extract and summarize
the conceptual structure of a scientific theory. Representing it with such a high level
of clarity enables us focus on conceptual differences between conceptual structures
and pinpoint their merits and potentials. On the other hand, the distilling process
clearly demands the highest level of intellectual analysis and reasoning. It requires
the ability to tease out the most critical information from a vast and growing body of
relevant scientific knowledge. Today’s information and computing techniques still
have a long way to go to be able to turn a body of scientific knowledge into this
type of conceptual structures. Examples demonstrated by Thagard provide a good
reference for us to consider and reflect on what opportunities are opened up by
new generations of science mapping and visual analytics tools and what challenges
remain for us to overcome.
16 1 The Dynamics of Scientific Knowledge
1.1.4 TRACES
How long does it take for the society to fully recognize the value of scientific
breakthroughs or technological innovations? The U.S. Department of Defense
(DoD) commissioned a study in the 1960s to address this question. The study,
1.1 Scientific Frontiers 17
Project Hindsight, was set to search for lessons learned from the development of
some of the most revolutionary weapon systems. A preliminary report of Project
Hindsight was published in 1966. A team of scientists and engineers analyzed
retrospectively how 20 important military weapons came along, including Polaris
18 1 The Dynamics of Scientific Knowledge
and Minuteman missiles, nuclear warheads, C-141 aircraft, and Mark 46 torpedo,
and the M 102 Howitzer. The team of experts identified 686 “research or exploratory
development events” that were essential for the development of the weapons. Only
9 % were regarded as “scientific research” and 0.3 % was base research. 9 %
of research was conducted in universities. One of the preliminary conclusions of
Project Hindsight was that basic research commonly found in universities didn’t
seem to matter very much in these highly creative developments. In contrast,
projects with specific objectives appeared to be much more fruitful.
Project Hindsight concluded that projects funded with specific defense purposes
were about one order of magnitude more efficient than projects with the same
amount of finding but without specific defense goals. Project Hindsight further
concluded that:
1. The contributions of university research were minimal.
2. Scientists contributed most effectively when their effort was mission oriented.
3. The lag between initial discovery and final application was shortest when the
scientist worked in areas targeted by his sponsor.
Project Hindsight emphasized mission-oriented research, contract research, and
commission-initiated research. Although these conclusions were drawn from the
study of military weapon development, some of these conclusions found their way
to the evaluation of scientific fields such as biomedical research.
In respond to the findings of Project Hindsight, the National Science Foundation
(NSF) commissioned a study TRACES – Technology in Retrospect and Critical
Events in Science. Project Hindsight looked back 20 years, but TRACES looked
the history of five inventions and their origins dated back as early as 1850s. The
five inventions are the contraception pill, matrix isolation, the video tape recorder,
ferrites, and the electron microscope.
TRACES identified 340 critical research events associated with these inventions
and classified them into three major categories: non-mission research, mission-
oriented research, and development and application. 70 % of the critical events
belonged to non-mission research, i.e. basic research. 20 % were mission oriented,
and 10 % was development and application. Universities were responsible for 70 %
of non-mission and one third of mission oriented research. For most inventions,
75 % of the critical events occurred before the conception of the ultimate inventions.
Critical research events are not evenly distributed over time. Events in the early
stages are separated by longer periods of time than events occurred in later stages.
The video tape recorder, for example, was invented in mid-1950s. It took almost
100 years to complete the first 75 % of all relevant milestones, i.e. the critical
research events, but it took only 10 years for the remaining 25 % of the critical
events to converge rapidly. In particular, the innovation was conceived in the final
5 years.
The invention of the video tape recorder involves six areas: control theory,
magnetic and recording materials, magnetic theory, magnetic recording, electronics,
and frequency modulation (Fig. 1.6). The earliest non-mission research event
appeared in magnetic theory. It was Weber’s early ferromagnetic theory in 1852.
1.1 Scientific Frontiers 19
Fig. 1.6 Pathways to the invention of the video tape recorder (© Illinois Institute of Technology)
The earliest mission-oriented research appeared in 1898 when Poulsew used steel
wire for the first time for recording. According to TRACES, the technique was
“readily available but had many basic limitations, including twisting and single track
restrictions.” Following Poulsew’s work, Mix & Genest was able to develop steel
tape with several tracks around 1900s, but limited by the lack of flexibility and
increased weight.
This line of invention continued as homogeneous plastic tape on the magne-
tophon tape recorder was first introduced in 1935 by AEG. A two layer tape was
developed by 1940s. The development of reliable wideband tapes was intensive in
early 1950s. The first commercial video tape record appeared in late 1950s.
The invention of electron microscope went through similar stages. The first 75 %
of research was reached before the point of invention and the translational period
from conception to innovation. The invention of electron microscope relied on
five major areas, namely, cathode ray tube development, electron optics, electron
sources, wave nature of electrons, and wave nature of light. Each area may trace
several decades back to the initial non-mission discoveries. For instance, Maxwell’s
electromagnetic wave theory of light in 1864, Roentgen’s discovery of emission
of X-ray radiation in 1893, and Schrodinger’s foundation of wave mechanics in
1926 all belong to non-mission research that ultimately led to the invention of
electronic microscope. As a TRACES diagram shows, between 1860 and 1900 there
was no connection across these areas of non-mission research. While the invention
of electronic microscope was dominated by many earlier non-mission activities,
the invention of video tape recorder revealed more diverse interactions among non-
mission research, mission oriented research, and development activities.
20 1 The Dynamics of Scientific Knowledge
Vision is a unique source for thinking. We often talk about hindsight, insight,
foresight, and oversight. Our attention is first drawn to the big picture, the Gestalt,
before we attend to details (McKim 1980). Visual thinking actively operates on
structural information, not only to see what is inside, but also to figure out how the
parts are connected to form the whole.
1.2.1 Gestalt
The history of science and technology is full of discoveries in which visual thinking
played a critical role. Visual thinking from the abstract to the concrete is a powerful
strategy. In abstraction, the thinker can readily restructure even transform a concept.
Then the resulting abstraction can be represented in a concrete form and tested in
reality. When abstract and concrete ideas are expressed in graphic form, the abstract-
to-concrete thinking strategy becomes visible.
As everyone looking at Leonard de Vinci’s Mona Lisa is probably seeing a
“Mona Lisa” quite different from what others see, the individual perceptual ability
can be vital in science as not only does it often distinguish an expert from a
novice, but also means whether one can catch a passing chance of discovery. The
English novelist and essayist Aldous Huxley (1894–1963) wrote: “The experienced
microscopist will see certain details on a slide; the novice will fail to see them.
Walking through a wood, a city dweller will be blind to a multitude of things which
the trained naturalist will see without difficulty. At sea, the sailor will detect distant
objects which, for the landsman, are simply not there at all.” A knowledgeable
observer sees more than a less knowledgeable companion because he or she has
a richer stock of memories and expectations to draw upon to make sense of what is
perceived.
1.2 Visual Thinking 21
Many of us are familiar with the story of the Tower of Babel in the Bible.2 Ancient
Mesopotamians believed that the mountains were holy places and gods dwell on top
of mountains and such mountains were contact points between heaven and earth,
for example, Zeus on Mount Olympus, Baal on Mount Saphon, and Yahweh on
Mount Sinai. But there were no natural mountains on the Mesopotamian plain, so
people built ziggurats instead. The word ziggurat means a “tower with its top in the
heavens.” A ziggurat is a pyramid-shaped structure that typically had a temple at
the top. Remains of ziggurats have been found at the sites of ancient Mesopotamian
cities, including Ur and Babylon.
The story of the Tower of Babel is in the Bible, Genesis 11: 1–9. The name
Babylon literally means “gate of the gods.” It describes how the people used brick
and lime to construct a tower that would reach up to heaven. According to the story,
the whole earth used to have only one language and a few words. People migrated
from the east and settled on a plain. They said to each other, “Come, let us build
ourselves a city, and a tower with its top in the heavens, and let us make a name
for ourselves, lest we be scattered abroad upon the face of the whole earth.” They
baked bricks and used bitumen as mortar. When the Lord came down to see the city
and the tower, the Lord said, “Behold, they are one people, and they have all one
2
http://www.christiananswers.net/godstory/babel1.html
24 1 The Dynamics of Scientific Knowledge
Fig. 1.9 Map of Cholera deaths and locations of water pumps (Courtesy of National Geographic)
language; and this is only the beginning of what they will do; and nothing that they
propose to do will now be impossible for them. Come, let us go down, and there
confuse their language, that they may not understand one another’s speech.” So the
Lord scattered them abroad from there all over the earth, and they left off building
the city. Therefore its name was called Babel, because there the Lord confused the
language of all on the earth; and from there the Lord scattered them abroad over the
face of the earth. Archaeologists examined the remains of the city of Babylon and
found a square of earthen embankments some 300 ft on each side, which appears to
be the foundation of the tower. Although the Tower of Babel is gone, a few ziggurats
survived. The largest surviving temple, built in 1250 BC, is found in western Iran.
The Tower of Babel has been a popular topic for artists. Pieter Bruegel
(1525–1569) painted the Tower of Babel in 1563, which is now in Vienna’s
Kunsthistorisches Museum Wien (See Fig. 1.10). He painted the tower as an
1.2 Visual Thinking 25
Fig. 1.10 The Tower of Babel (1563) by Pieter Bruegel. Kunsthistorisches Museum Wien, Vienna.
(Copyright free, image is in the public domain)
immense structure occupying almost the entire picture, with microscopic figures,
rendered in perfect detail. The top floors of the tower are in bright red, whereas
the rest of the brickwork has already started to weather. Maurits Cornelis Escher
(1898–1972) was also intrigued by the story. In his painting in 1928, people were
building the tower when they started to experience the confusion and frustration of
the communication breakdown caused by the language barrier (See Fig. 1.11).
The moral of the Tower of Babel story in this book is the vital role of our language.
Consider the following examples and examine the basis of our communication that
we have been taking for granted. Space probes Pioneer and Voyager are travelling
into deep space with messages designed to reach some intelligent forms in a few
million years. If aliens do exist and eventually find the messages on the spacecraft,
will they be able to understand? What are the assumptions we make when we
communicate our ideas to others?
Pioneers 10 and 11 both carried small metal plaques identifying their time and
place of origin for whatever intelligent forms might find them in the distant future.
NASA placed a more ambitious message aboard Voyager 1 and 2 – a kind of time
capsule – to communicate a story of our world to extraterrestrial.
26 1 The Dynamics of Scientific Knowledge
Pioneer 10 was launched in 1972. It is now one of the few most remote man-
made objects. Communication was lost on January 23, 2003 when it was 80 AU3
from the Sun. It was 12 billion kilometers or 745.6 million miles away. Pioneer
10 was headed towards the constellation of Taurus (The Bull). It will take Pioneer
over 2 million years to pass by one of the stars in the constellation. Pioneer 11
was launched in 1973. It is headed toward the constellation of Aquila (The Eagle),
Northwest of the constellation of Sagittarius. Pioneer 11 may pass near one of the
stars in the constellation in about 4 million years.
According to “First to Jupiter, Saturn, and Beyond” (Fimmel et al. 1980), a group
of science correspondents from the national press were invited to see the spacecraft
3
Astronomical Unit: one AU is the distance between the Earth and the Sun, which is about
150 million kilometers (93,000 million miles).
1.2 Visual Thinking 27
Fig. 1.12 The gold-plated aluminum plaque on Pioneer spacecraft, showing the figures of a man
and a woman to scale next to a line silhouette of the spacecraft
This plate was attached to the antenna support struts of the spacecraft in a position
where it would be shielded from erosion by interstellar dust. The bracketing bars on
the far right are the representation of the number 8 in binary form (1,000), where
one is indicated above by the spin-flip radiation transition of a hydrogen atom from
electron state spin up to state spin down that gives a characteristic radio wave length
of 21 cm (8.3 in.). Therefore, the woman is 8 21 cm D 168 cm, or about 50 600 tall.
The bottom of the plaque shows schematically the path that Pioneers 10 and 11 took
to escape the solar system – starting at the third planet from the Sun accelerating
with a gravity assist from Jupiter out of the solar system. Also shown to help identify
the origin of the spacecraft is a radial pattern etched on the plaque that represents
the position of our Sun relative to 14 nearby pulsars (i.e., spinning neutron stars)
and a line directed to the center of our Galaxy. The plaque may be considered as
the cosmic equivalent to a message in a bottle cast into the sea. Sometime in the
far distant future, perhaps billions of years from now, Pioneer may pass through
a planetary system of a remote stellar neighbor, one of whose planets may have
evolved intelligent life. If that life possesses the technical ability and curiosity,
it may detect and pick up the spacecraft and inspect it. Then the plaque with its
message from Earth may be found and deciphered.
Pioneer 10 will be out there in interstellar space for billions of years. One day it
may pass through the planetary system of a remote stellar neighbor, one of whose
planets may have evolved intelligent life. If that life possesses sufficient capability to
detect the Pioneer spacecraft – needing a higher technology than mankind possesses
today – it may also have the curiosity and the technical ability to pick up the
spacecraft and take it into a laboratory to inspect it. Then the plaque with its
message from Earth should be found and possibly deciphered. Due to the loss of
communication, we may never hear from it again unless one day it could be picked
up by intelligent aliens in the deep space.
Voyager 1 and 2 were launched in the summer of 1977. They have become the
third and fourth human built artifacts to escape our solar system. The two spacecraft
will not make a close approach to another planetary system for at least 40,000 years.
The Voyager carried sounds and images to portray the diversity of life and culture
on Earth. These materials are recorded on a 12-in. gold-plated copper disk. Carl
Sagan was responsible for selecting the contents of the record for NASA (See
Fig. 1.13). They assembled 115 images and a variety of natural sounds, such as
those made by surf, wind and thunder, birds, whales, and other animals. They also
included musical selections from different cultures and eras, and spoken greetings
from Earth-people in fifty-five languages, and printed messages from President
Carter of the United States of America and United Nation’s Secretary General
Waldheim. Each record is encased in a protective aluminum jacket, together with a
cartridge and a needle. Instructions, in symbolic language, explain the origin of the
spacecraft and indicate how the record is to be played. The 115 images are encoded
in analog form. The remainder of the record is in audio, designed to be played at
16–2/3 rev/s. It contains the spoken greetings, beginning with Akkadian, which was
1.2 Visual Thinking 29
spoken in Sumer about 6,000 years ago, and ending with Wu, a modern Chinese
dialect. Following the section on the sounds of Earth, there is an eclectic 90-min
selection of music, including both Eastern and Western classics and a variety of
ethnic music. It will be 40,000 years before they make a close approach to any other
planetary system. In Carl Sagan’s words, “The spacecraft will be encountered and
the record played only if there are advanced space-faring civilizations in interstellar
space. But the launching of this bottle into the cosmic ocean says something very
hopeful about life on this planet.”
A 12-in. gold plated copper disk containing recorded sounds and images
representing human cultures and life on Earth is affixed to each Voyager – a
message in a bottle cast into the cosmic sea. The disks are like a phonograph
record. Cartridge and needle are supplied, along with some simple diagrams, which
represent symbolically the spacecraft’s origin and instructions for playing the disk.
Figure 1.14 shows instructions on Voyager’s plaque. Now see if you would be able
to understand them if you were an alien.
The Voyager record is detailed in “Murmurs of Earth” (1978) by Sagan, Drake,
Lomberg et al. This is the story behind the creation of the record, and includes a
30 1 The Dynamics of Scientific Knowledge
full list of everything on the record. Warner News Media, including a CD-ROM that
replicates the Voyager record, reissued “Murmurs of Earth” in 1992. The CD-ROM
is made available for purchase.4
“Ceci n’est pas une pipe” is a famous statement made by Belgian surrealist René
Magritte (1898–1967) in his oil painting in 1929 “The Treachery of Images.” The
picture of a pipe, Fig. 1.15, is underlined by the thought-provoking subtitle in
French – “This is not a pipe.”
Obviously, the “image” pipe is not a real pipe; it doesn’t share any physical
properties or functionality of a real pipe. On the other hand, this surrealistic painting
certainly makes us think deeper about the role of our language. The apparent
contradiction between the visual message conveyed by the picture of a pipe and the
statement made in words underlines the nature of language and interrelationships
4
http://math.cd-rom-directory.com/cdrom-2.cdprod1/007/419.Murmurs.of.Earth.-.The.Voyager.
Interstellar.Record.shtml
1.2 Visual Thinking 31
between what we see, what we think, and what we say. Philosophers study such
questions in the name of hermeneutics. Hermeneutics can be traced back to the
Greeks and to the rise of Greek philosophy. Hermes is the messenger of the gods,
he brings a word from the realm of the wordless; hermeios brings the word from
the Oracle. The root word for hermeneutics is the Greek verb hermeneuein, which
means to interpret.
Don Ihde’s book Expanding Hermeneutics – Visualism in Science (Ihde 1998)
provides a series of examples from the history of science and technology in
an attempt to establish that visualist hermeneutics is essential to science and
technology. According to Ihde, “This hermeneutics, not unlike all forms of writing,
is technologically embedded in the instrumentation of contemporary science, in
particular, in its development of visual machines or imaging technologies.”
Ihde argues that what we see is mediated by enabling devices. We see through,
with, and by means of instruments (Ihde 1998). Science has found ways to enhance,
magnify, and modify its perceptions. From this perspective, Kuhn’s philosophy in
essence emphasizes that science is a way of “seeing.” We will return to Kuhn’s
paradigm theory later with the goal to visualize the development of a paradigm.
Ihde refers to this approach as perceptual hermeneutics. Key features of perceptual
hermeneutics are repeatable Gestalt, visualizable, and isomorphic.
Ihde noted that Leonardo da Vinci’s depictions of human anatomy show muscu-
lature, organs, and the like and his depictions of imagined machines in his technical
diaries were indeed in the same style – both exteriors and interiors were visualized.
Ihde also found similar examples from astronomy and medicine, such as Galileo’s
telescope and the invention of X-rays in 1895 by German physicist Wilhelm Conrad
Röntgen (1845–1923) (See Fig. 1.16). What had been invisible or occluded became
observable. These imaging technologies have similar effects as da Vinci’s exploded
diagram style – they transform non-visual information to visual representations.
Two types of imaging technologies are significant: translation technologies that
transform non-visual dimensions to visual ones, and isomorphic ones. Imaging
technologies increasingly dominate contemporary scientific hermeneutics.
32 1 The Dynamics of Scientific Knowledge
words: “Röntgen5 had never seen a transparent hand as in the case of his wife’s
ringed fingers, but it was obvious from the first glimpse what was seen.” On the
other hand, there are more and more visual techniques that are moving away from
visual isomorphism. For example, the transparent and translucent microorganisms
in “true color” were difficult to see. It was false coloring that turned microscopic
imaging techniques to a standard technique within scientific visual hermeneutics.
Hermeneutics brings a word from the wordless. Information visualization aims
to bring insights into abstract information to the viewer. In particular, information
visualization deals with information that may not readily lend itself to geometric or
spatial representations. The subject of this book is about ways to depict and interpret
a gigantic “pipe” of scientific frontiers with reference to the implications of how
visualized scientific frontiers and real ones are interrelated.
As shown in the history of the continent drift theory, a common feature of a
research front is the presence of constant debates between competing theories and
how the same evidence could be interpreted from different views. These debates at a
disciplinary scale will be used to illustrate the central theme of this book – mapping
scientific frontiers. How can we take snapshots of a “battle ground” in scientific
literature? How can we track the development of competing schools of thought
over time? From a hermeneutic point of view, what are the relationships between
“images” of science and science itself? How do we differentiate the footprints of
science and scientific frontiers? Would René Magritte point to a visualization of a
scientific frontier, and say “This is not a science frontier?”
In the rest of this chapter, we will visit a few more examples and explore
profound connections between language, perception, and cognition. Some examples
illustrate the barrier of languages not only in the sense of natural languages but also
in terms of communicative barriers across scientific and technological disciplines.
Some show the power of visual languages throughout the history of mankind. Some
underline limitations of visual languages. Through these examples, we will be able
to form an overview of the most fundamental issues in grasping the dynamics of the
forefront of science and technology.
5
Wilhelm Röntgen, the inventor of X-ray, made copies of the X-ray of his wife’s hand and sent
these to his colleagues across Europe as evidence of his new invention.
34 1 The Dynamics of Scientific Knowledge
We can only see what we want to see. In other words, our vision is biased and
selective. Margritte’s pipe looks so realistic that people feel puzzled when they
read the subtitle “This is not a pipe.” Towards the end of the nineteenth century,
a group of Austrian and Germany psychologists found that human beings tend to
perceive coherent patterns out of visual imagery. Gestalt is a Germany word, which
essentially means a tendency of recognizing a pattern, i.e. a holistic image, out of
individual parts, even though sometimes the holistic image is illusive.
The study of pattern-seeking behavior is a branch of psychology called Gestalt
psychology. Human being’s perception has a tendency to seek patterns out of what
we see, or what we expect to see. A widely known example is the face on Mars,
which reminds us how our perceptual system can sometimes cheat on us.
Gestalt psychology emphasizes the importance of organizational processes
of perception, learning, and problem solving. They believe that individuals are
predisposed to organize information in particular ways. The basic ideas of Gestalt
psychology are:
• Perception is often different from reality. This includes optical illusions.
• The whole is more than the sum of its parts. Human experience couldn’t
be explained unless the overall experience is examined instead of individual
parts.
• The organism structures and organizes experience. The word Gestalt in German
means structured whole. This means an organism structures experience even
though structure might not be necessarily inherent.
• The organism is predisposed to organize experience in particular ways. For
example, according to the law of proximity, people tend to perceive as a unit
those things that are close together in space. Furthermore, similar people tend to
perceive as a unit those things that are similar to one another.
• Problem solving involves restructuring and insight. Problem solving involves
mentally combining and recombining the various elements of a problem until
a structure that solves the problem is achieved.
Human beings have the tendency of seeking patterns. Gestalt psychology
considers perception an active force. We perceive a holistic image that means more
than the sum of parts. We first see an overall pattern, then go on to analyze its
details. Personal needs and interests drive the detailed analysis. Like a magnetic
field, perception draws sensory imagery together into holistic patterns. According
to Gestalt theory, perception obeys an innate urge towards simplification by cohering
complex stimuli into simpler groups. Grouping effects include proximity, similarity,
continuity, and line of direction. Gestalt psychology highlights the ambiguity of
humans’ pattern-seeing abilities. Figure 1.18 shows a famous drawing by Maurits
Escher. See if you can see two figures alternatively, or even simultaneously.
1.2 Visual Thinking 35
that provide the base map of a thematic map. Indeed, thematic maps provide a
prosperous metaphor for a class of information visualization known as information
landscape. Notable examples include ThemeView (Wise et al. 1995) and Bead
(Chalmers 1992).
ManyEyes is a more recent example. It is a ‘social kind of data analysis’ in the
words of its designers at the formerly IBM’s Visual Communication Laboratory.
ManyEyes enables many people to have a taste of what is like to create your
own information visualization that they would otherwise have no such chance
at all. The public-oriented design significantly simplifies the entire process of
information visualization. Furthermore, ManyEyes is indeed a community-building
environment in which one can view visualizations made by other users, make
comments, and make your own visualizations. These reasons alone would be
enough to earn ManyEyes a unique position in the development of information
visualization. ManyEyes and Wikipedia share some interesting characteristics—
both tap in social construction and both demonstrate emergent properties of a
self-organizing underlying system.
Modeling and visualizing intellectual structures from scientific literature have
reached a new level in terms of the number of computer applications available,
the number of researchers actively engaged in relevant areas, and the number of
relevant publications. Traditionally, the scientific discipline that has been actively
addressing issues concerning science mapping and intellectual structure mapping
is information science. Information science itself constitutes of two sub-fields:
information retrieval and citation analysis. Both information retrieval and citation
analysis take the widely accessible scientific literature as their input. However,
information retrieval and citation analysis concentrate on disjoint sections of a
document. Information retrieval focuses on the bibliographic record of a document,
such as title and keyword list, and/or the full-text of a document, whereas citation
analysis focuses on referential links embedded in the document, or those appended
at the end of the document. The ultimate challenge for information visualization is to
invent and adapt powerful visual-spatial metaphors that can convey the underlying
semantics.
Information retrieval has brought many fundamental inspirations and challenges
to the field of information visualization. Our quest aims to demonstrate that
science mapping goes beyond information retrieval, information visualization, and
scientometrics. It becomes a unique field of study on its own and yet it has the
potential to be applicable to a wide range of scientific domains. Our focus is on the
growth of scientific knowledge and what are the key problems to solve and what
are the central tasks to support. Instead of focusing on locating specific items in
scientific literature, we turn to higher levels of granularity – scientific paradigms
and their movements in scientific frontiers.
Visual analytics can be seen as the second generation of information visualiza-
tion. It has transformed not only how we visualize complex and dynamic phenomena
in the new information age, but also how we may optimize analytical reasoning and
make sound decisions with incomplete and uncertain information (Keim et al. 2008).
Today’s widespread recognition of the indispensable value of visual analytics as a
1.3 Mapping Scientific Frontiers 39
field and the rapid growth of an energetic and interdisciplinary scientific community
would be simply impossible without the remarkable vision and tireless efforts of Jim
Thomas (1946–2010), his colleagues of the National Visualization and Analytics
Center (NVAC) at Pacific Northwest National Laboratory (PNNL), and the growing
community in visual analytics science and technology.
In 2004, Jim Thomas founded NVAC and initiated a new research area, visual
analytics. Visual analytics is the science of analytical reasoning facilitated by
visual interactive interfaces that focuses on analytical reasoning facilitated by
interactive visual interfaces (Thomas and Cook 2005; Wong and Thomas 2004).
Visual analytics is a multidisciplinary field. It brings together several scientific and
technical communities from computer science, information visualization, cognitive
and perceptual sciences, interactive design, graphic design, and social sciences.
It addresses challenges involving analytical reasoning, data representations and
transformations, visual representations and interaction techniques, and techniques to
support production, presentation, and dissemination of the results. Although visual
analytics has some overlapping goals and techniques with information visualization
and scientific visualization, it is especially concerned with sense-making and
reasoning and it is strongly motivated by solving problems and making sound
decisions.
Visual analytics integrates new computational and theory-based tools with
innovative interactive techniques and visual representations based on cognitive,
design, and perceptual principles. This science of analytical reasoning is central
to the analyst’s task of applying human judgments to reach conclusions from a
combination of evidence and assumptions (Thomas and Cook 2005). Today, visual
analytics centers are found in several countries, including Canada, Germany, the
United Kingdom, and the United States; and universities integrated visual analytics
into their core information sciences curricula which made the new field a recognized
and promising outgrowth of the fields of information visualization and scientific
visualization (Wong 2010).
The key contribution of visual analytics is that it is motivated by analytic
reasoning and decision making needs with high uncertainty data. Visual analytics
emphasizes the role of evidence in analytic reasoning and making informed
decisions. This is precisely what is needed for mapping scientific frontiers, i.e.
evidence-based reasoning. In the second edition of the book, we introduce the latest
development of visual analytics in relation to supporting analytic tasks pertinent to
mapping scientific frontiers.
This book is written with a few groups of audience in mind, for example, researchers
and students in information science, computer science, history of science, philoso-
phy of science, and sociology of science. The book is also suitable for readers who
are interested in scientometrics, information visualization, and visual analytics as
well as science of science policy and research evaluation.
40 1 The Dynamics of Scientific Knowledge
“Three Blind Men and an Elephant” is a widely told folktale in China. The
story probably started in Han Dynasty (202 BC–220 AD) (Kou and Kou 1976).
The story was later expanded to six blind men in India. As the folktale goes, six
blind men went to figure out what the elephant looks like. The first one approached
the elephant and felt the elephant’s body. He claimed: “The elephant is very like a
wall!” The second one feeling the tusk said, “It is like a spear!” The third one took
the elephant’s trunk and said, “It is like a snake!” The fourth touched the knee and
shouted, “It is like a tree!” The fifth touched the ear and thought it was like a fan.
The sixth, seizing on the swinging tail and, was convinced that the elephant must be
like a rope. They could not agree what an elephant is really like.
The moral of this folktale is that we are in a similar situation in which scientists
receive all sorts of messages about scientific frontiers. Actor Network Theory
(ANT) was originally proposed as a sociological model of science (Latour 2005;
Callon et al. 1986). According to this model, the work of scientists consists of the
enrolment and juxtaposition of heterogeneous elements – rats, test tubes, colleagues,
journal articles, grants, papers at scientific conferences, and so on – which need
continual management. Scientists simultaneously reconstruct social contexts – labs
simultaneously rebuild and link the social and natural contexts upon which they act.
Examining inscriptions is a key approach used for ANT. The other is to “follow
the actor,” via interviews and ethnographic research. Inscriptions include journal
articles, conference papers, presentations, grant proposals, and patents. Inscriptions
are the major products of scientific work (Latour 2005; Callon et al. 1986). In
Chap. 3, we will describe co-word analysis, which was originally developed for
analyzing inscriptions. Different genres of inscriptions may send messages to
scientists. On the one hand, messages from each genre of inscriptions form a
snapshot of scientific frontiers. For example, journal publications may provide a
snapshot of the “head” of the elephant; conference proceedings may provide the
“legs”; and textbooks may provide the “trunk”. On the other hand, messages in
different “bottles” must be integrated at a higher level, i.e. the “elephant” level, to
be useful as guidance to scientists and engineers.
Mapping scientific frontiers involves several disciplines, from philosophy of
science, sociology of science, to information science, scientometrics, and informa-
tion visualization. Each individual discipline has its own research agenda and prac-
tices, its own theories and methods. On the other hand, mapping scientific frontiers
by its very nature is interdisciplinary. One must transcend disciplinary boundaries so
that each contributing approach can fit into the context. Otherwise, the Tower of Ba-
bel is not only a story in the Bible, it could be a valid summary of the fate of new gen-
erations’ efforts in achieving the “holy grail” of standing on the shoulders of giants.
Science maps depict the spatial relations between research fronts, which are areas
of significant activity. Such maps can also simply be used as a convenient means of
depicting the way research areas are distributed and conveying added meaning to
their relationships.
1.3 Mapping Scientific Frontiers 41
Even with a database that is completely up-to-date, we are still only able to create
maps that show where research fronts have been. These maps may reveal a fresh
view of where the action is and give a hint where it may be going. However, as we
expand the size of the database from 1 year to a decade or more, the map created
through citation analysis provides a historical, indeed historiographical, window on
the field that we are investigating.
From a global viewpoint, these maps show relationships among fields or
disciplines. The labels attached or embedded in the graphics reveal their semantic
connections and may hint at why they are linked to one another. Furthermore, the
maps reveal which realms of science or scholarship are being investigated today and
the individuals, publications, institutions, regions, or nations currently pre-eminent
in these areas.
By using a series of chronologically sequential maps, one can see how knowledge
advances. While maps of current data alone cannot predict where research will
go, they can be useful indicators in the hands of informed analysts. By observing
changes from year to year, trends can be detected. Thus, the maps become
forecasting tools. And since some co-citation maps include core works, even a
novice can instantly identify those articles and books used most often by members
of the “invisible college.”
The creation of maps by co-citation clustering is a largely algorithmic process.
This stands in contrast to the relatively simple but arduous manual method we used
over 30 years ago to create a historical map of DNA research from the time of
Mendel up to the work of Nierenberg and others.
Samuel Bradford (1878–1948) referred to “a picture of the universe of discourse
as a globe, on which are scattered, in promiscuous confusion, the mutually related,
separate things we see or think about.” John Bernal (1901–1971), a prominent
international scientist and an X-ray crystallography scientist, was a pioneer in social
studies of science or “science of science”. His book The Social Function of Science
(Bernal 1939) has been regarded as a classic in this field. To Bernal, science is
the very basis of philosophy. There was no sharp distinction between the natural
sciences and the social sciences for Bernal, and the scientific analysis of society was
an enterprise continuous with the scientific analysis of nature. For Bernal, there was
no philosophy, no social theory, and no knowledge independent of science. Science
was the foundation of it all.
Bernal, among others, created by laborious manual methods what we would
today describe as historiographs. However, dynamic longitudinal mapping was
made uniquely possible by the development of the ISI® database. Indeed, it gave
birth to scientometrics and new life to bibliometrics.
It is not uncommon for a new theory in science to meet its resistance. A newborn
theory may grow stronger and become dominant over time. On the other hand, it
42 1 The Dynamics of Scientific Knowledge
might well be killed in its cradle. What are the factors that determine the fate of a
new theory? Is there any conclusive evidence? Are there in fact patterns in the world
of science and technology that can make us wiser? Let us take a look at some of the
widely known and long-lasting debates in the history of science.
Remember, Kuhn’s paradigm theory focuses on puzzle-solving problems. In this
book, we aim to describe a broad range of theories, methodologies, and examples
that can contribute to our knowledge of how to better capture the dynamics of the
creation of scientific knowledge. We will demonstrate our work in citation-based
approaches to knowledge domain visualization and present in-depth analysis of
several puzzle-solving cases, in particular, including debates between competing
theories on the causes of dinosaurs’ extinctions, the power sources of active galactic
nuclei, and the connections between mad cow disease and a new variant of human
brain disease.
Five mass extinctions have occurred in the past 500 million years on earth,
including the greatest ever Permian-Triassic extinction 248 million years ago and
the Cretaceous-Tertiary extinction 65 million years ago, which wiped out the
dinosaurs among many other species. The Cretaceous-Tertiary extinction, also
known as the KT extinction, has been the topic of intensive debates over the last
20 years, involving over 80 theories of what caused the mass extinction of dinosaurs.
Paleontologists, geologists, physicists, astronomers, nuclear chemists, and many
others are all involved. We will use our visualization techniques to reveal the process
of this debate.
Albert Einstein predicted the existence of black holes in the universe. By their virtual
nature, we cannot see black holes directly, even if a real one falls into the scope of
our telescope. Astronomers are puzzled by the gravitational power from the centers
of galaxies. If our theories are correct, the existence of heavyweight black holes
is among the few explanations. Astronomers have been collecting evidence with
increasingly powerful telescopes. In this case, we will analyze the impact of such
evidence on the acceptance of a particular paradigm.
The 1997 Nobel Prize in physiology or medicine was awarded to Stanley Prusiner,
professor of neurology, virology, and biochemistry, for his discovery of prions – an
abnormal form of a protein responsible for diseases such as scrapie in sheep, Bovine
Spongiform Encephalopathy (BSE) in cattle – also known as mad cow disease,
1.4 The Organization of the Book 43
and Creutzfeldt-Jakob disease (CJD) in humans. While CJD is often found among
people over 55, vCJD patients have an average of 27. In the middle of UK’s BSE
crisis, the public concerns about whether it is safe to eat beef products at all. This
concern has led to the question whether eating contaminated food can cause vCJD.
References
Bernal JD (1939) The social function of science. The Macmillan Co., New York
Callon M, Law J, Rip A (eds) (1986) Mapping the dynamics of Science and technology: sociology
of science in the real world. Macmillan Press, London
Card SK (1996) Visualizing retrieved information: a survey. IEEE Comput Graph Appl 16(2):
63–67
Card S, Mackinlay J, Shneiderman B (eds) (1999) Readings in information visualization: using
vision to think. Morgan Kaufmann, San Francisco
Chalmers M (1992) BEAD: explorations in information visualisation. Paper presented at the
SIGIR’92, Copenhagen, Denmark, June 1992
Chen C (1999) Information visualisation and virtual environments. Springer, London
Chen C (2010) Information visualization. Wiley Interdiscip Rev Comput Stat 2(4):387–403
Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of
Chicago Press, Chicago
Fimmel RO, Allen JV, Burgess E (1980) Pioneer: first to Jupiter, Saturn, and beyond (U.S. Govern-
ment Printing Office No. NASA SP-446). Scientific and Technical Information Office/NASA,
Washington, DC
Hearst MA (1999) User interfaces and visualization. In: Baeza-Yates R, Ribeiro-Neto B (eds)
Modern information retrieval. Addison-Wesley, Harlow, pp 257–224
Herman I, Melançon G, Marshall MS (2000) Graph visualization and navigation in information
visualization: a survey. IEEE Trans Vis Comput Graph 6(1):24–44
Hollan JD, Bederson BB, Helfman J (1997) Information visualization. In: Helenader MG,
Landauer TK, Prabhu P (eds) The handbook of human computer interaction. Elsevier Science,
Amsterdam, pp 33–48
Ihde D (1998) Expanding hermeneutics: visualism in science. Northwester University Press,
Evanston
Inselberg A (1997) Multidimensional detective. Paper presented at the IEEE InfoVis’97, Phoenix,
AZ, October 1997
Keim D, Mansmann F, Schneidewind J, Thomas J, Ziegler H (2008) Visual analytics: scope and
challenges. Vis Data Min 4404:76–90
Kochen M (1984) Toward a paradigm for information science: the influence of Derek de Solla
Price. J Am Soc Inf Sci Technol 35(3):147–148
Kornish LJ, Ulrich KT (2011) Opportunity spaces in innovation: empirical analysis of large
samples of ideas. Manag Sci 57(1):170–128
Kou L, Kou YH (1976) Chinese folktales. 231 Adrian Road, Millbrae, CA 94030: Celestial Arts,
pp 83–85
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Latour B (2005) Reassembling the social – an introduction to actor-network-theory. Oxford
University Press, Oxford
Masterman M (1970) The nature of the paradigm. In: Lakatos I, Musgrave A (eds) Criticism and
the growth of knowledge. Cambridge University Press, Cambridge, pp 59–89
McGrath JE, Altman I (1966) Small group research: a synthesis and critique of the field. Holt,
Rinehart & Winston, New York
46 1 The Dynamics of Scientific Knowledge
McKim RH (1980) Experiences in visual thinking, 2nd edn. PWS Publishing Company, Boston
Mukherjea S (1999) Information visualization for hypermedia systems. ACM Comput Surv
31(4):U24–U29
Norwood RH (1958) Patterns of discovery. Cambridge University Press, Cambridge
Price DD (1963) Little science, big science. Columbia University Press, New York
Price DD (1965) Networks of scientific papers. Science 149:510–515
Rittschof KA, Stock WA, Kulhavy RW, Verdi MP, Doran JM (1994) Thematic maps improve
memory for facts and inferences: a test of the stimulus order hypothesis. Contemp Educ Psychol
19:129–142
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small HG, Griffith BC (1974) The structure of scientific literatures I: identifying and graphing
specialties. Sci Stud 4:17–40
Spence B (2000) Information visualization. Addison-Wesley, New York
Thagard P (1992) Conceptual revolutions. Princeton University Press, Princeton
Thomas JJ, Cook K (2005) Illuminating the path: the R&D agenda for visual analytics. IEEE
Computer Society, Los Alamitos
Tufte ER (1983) The visual display of quantitative information. Graphics Press, Cheshire
Tufte ER (1990) Envisioning information. Graphics Press, Cheshire
Tufte ER (1997) Visual explanations. Graphics Press, Cheshire
Ware C (2000) Information visualization: perception for design. Morgan Kaufmann Publishers,
San Francisco
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A, et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Wong PC (2010) The four roads less traveled – a tribute to Jim Thomas (1946–2010). From http://
vgtc.org/JimThomas.html
Wong P, Thomas J (2004) Visual analytics. IEEE Comput Graph Appl 24(5):20–21
Chapter 2
Mapping the Universe
Powers of Ten is a short documentary film written and directed by Ray Eames and
her husband, Charles Eames. It was rereleased in 1977. Starting from a one-meter
wide scene, the film moves 10 times farther away every 10 s. By the 7th move,
we have already moved far enough to see the entire Earth (Fig. 2.1). In 1998, the
Library of Congress selected the film for preservation in the United States National
Film Registry because it is “culturally, historically, or aesthetically significant.” In
this chapter, we will review principles and techniques that have been developed for
drawing maps at three very different scales, namely, geographical maps, maps of the
universe, and maps of protein sequences and compounds.
This chapter focuses on a variety of organizing models behind a variety of maps,
and in particular their role in making visual thinking and visual communication
effective. These models are also known as metaphors. The fundamental value of a
metaphor is its affordance. The central theme in this chapter is the design of thematic
maps that represent phenomena in the physical world across terrestrial mapping and
celestial mapping. The key question is: what are the roles of various metaphors in
mapping macrocosmic phenomena and macrocosmic ones?
2.1 Cartography
Maps are graphic representations of the cultural and physical environment. Maps
appeared as early as the fifth or sixth century BC. Cartography is the art, science,
and technology of making maps. There are two types of maps: general-purpose
maps and thematic maps. General-purpose maps are also known as reference maps.
Examples of reference maps include topographic maps and atlas maps. These maps
display objects from the geographical environment with emphasis on location, and
Fig. 2.1 Scenes in the film Powers of Ten (Reprinted from http://www.powersof10.com/film
© 2010 Eames Office)
of location to which the thematic overlay can be related. Thematic maps must be
well designed and include only necessary information. Simplicity and clarity are
important design features of the thematic overlay.
Researchers are still debating about the roles of communication and visualization
within the context of modern cartography. David DiBiase’s view of visualization in
scientific research includes visual communication as in the public realm portion of
his model. His model suggests that visualization takes place along a continuum, with
exploration and confirmation in the private realm, and synthesis and presentation in
the public realm. The private realm constitutes visual thinking and the public realm
is visual communication. The traditional view of cartographic communication is
incorporated into more complex descriptions of cartography, indeed, as an important
component.
The distinction between cartographic communication and cartographic visualiza-
tion is that the former deals with an optimal map whose purpose is to communicate
a specific message, and the latter concerns a message that is unknown and for which
there is no optimal map (Hearnshaw and Unwin 1994). This idea follows much of
the thinking that distinguishes deterministic thinking and probabilistic thinking, and
this characterizes much of scientific thinking of the twentieth century. The latest
view of visualization in cartography and communication recognize the importance
of the map user in the communication process, who was often overlooked in the
traditional view.
Cartographers have recognized that map readers are different, and not simple
mechanical unthinking parts of the process, that they bring to the map reading
activity their own experiences and cognition. Map communication is the component
of thematic mapping whose purpose is to present one of many possible results of a
geographical inquiry. Maps are seen as tools for the researcher in finding patterns
and relationships among mapped data, not simply for the communication of ideas
to others. Cartographic communication requires that the cartographer knows what
a map reader needs so as to send the right message to the map reader, although a
cartographer may never be certain that the intended message is conveyed precisely.
50 2 Mapping the Universe
Fig. 2.3 The visual hierarchy. Objects on the map that are most important intellectually are
rendered with the greatest contrast to their surroundings. Less important elements are placed lower
in the hierarchy by reducing their edge contrasts. The side view in this drawing further illustrates
this hierarchical concept
forms of perceptual organization. Objects that stand out against their backgrounds
are referred to as figures in perception, and their formless backgrounds as grounds.
The segregation of the visual field into figures and grounds is a kind of automatic
perceptual mechanism. With careful attention to graphic detail, all the elements can
be organized in the map space so that the emerging figure and ground segregation
produces a totally harmonious design. Later chapters in the book include examples
of how figure-ground perception plays a role in describing scientific paradigms.
Cartographers have developed several techniques to represent the spherical sur-
face of the Earth. These techniques are known as map projections. Map projections
commonly use three types of geometric surfaces: cylinder, cone, and plane. A few
projections, however, cannot be categorized as such, or are combinations of these.
The three classifications are used for a wide variety of projections, including some
that are not geometrically constructed.
All thematic maps consist of a base map and a thematic overlay that depicts the
distribution pattern of a specific phenomenon. Different types of phenomena or
data require different mapping techniques. Qualitative and quantitative maps can
be distinguished as follows.
Qualitative maps show a variety of different phenomena across different regions.
For example, an agriculture map of Virginia would show that tobacco is the
dominant commercial product of Southside, beef cattle the dominant commercial
product of the Valley of Virginia, and so forth. Quantitative maps, on the other
hand, focus on a particular phenomenon and display numerical data associated
with the phenomenon. The nature of the phenomena, either continuous or discrete,
determines the best mapping method. For example, spatially continuous phenomena
like rainfall amounts are mapped using isolines; total counts of population may be
mapped using dots or graduated symbols; mean income on a county-by-county basis
would use area symbols.
2.1 Cartography 53
Fig. 2.4 Four types of relief map: (a) contours, (b) contours with hill shading, (c) layer tints, and
(d) digits (Reprinted from http://www.nottingham.ac.uk/education/maps/relief.html#r5)
Relief maps are used to represent a three-dimensional surface, such as hills, valleys
and other features of a place. Techniques such as contour lines, shading, and
layer tints are commonly used in relief maps. Reasoning in three dimensions
requires skills. Many people find relief features harder to interpret than most other
information on a map. There are more than a dozen distinct methods for showing
relief and so the map designer has a wide choice (See Fig. 2.4).
Information visualization has adapted many techniques from relief maps to
represent abstract structures and volatile phenomena. Notable examples include
self-organized maps (SOMs) (Lin 1997) and ThemeScape models (Wise et al.
1995). See Chap. 4 for more details.
In Chap.1, we introduce the view of visualism of science, which emphasizes the
instrumentational role of technologies in scientific discovery. Earlier cartography
relied on craftsmen’s measuring and drawing skills. Today, photographic cartogra-
phy relies on new technologies. For example, the powerful Hubble Space Telescope
(HST) took high-quality photographs of stars and galaxies for celestial mapping.
54 2 Mapping the Universe
Fig. 2.5 A Landsat photograph of Britain (left). Central London (right) is shown as the blue area
near to the lower right corner. The Landsat satellite took the photo on May 23rd, 2001 (Reprinted
from http://GloVis.usgs.gov/ImgViewer.jsp?path=201&row=24&pixelSize=1000)
The Greek astronomer Claudius Ptolemy (c.85–163 AD) generated one of the most
famous world maps in about 150 AD. Unfortunately, none of his maps survived.
Scholars in the Renaissance in the fifteenth century reconstructed Ptolemy’s map
following his instructions (See Fig. 2.6).
Ptolemy’s map represented his knowledge of the world. The map was most
detailed round the Mediterranean because he worked in Alexandria. The map
showed only three continents: Europe, Asia and Africa. The sea was colored in
light brown, the rivers in blue, and the mountains in dark brown. The surrounding
heads represent the major winds.
2.2 Terrestrial Maps 55
Fig. 2.6 Ptolemy’s world map, re-constructed based on his work Geography c. 150 (© The British
Library http://www.bl.uk/)
Fig. 2.7 A road map and an aerial photograph of the Westminster Bridge in London
Constellations are the imaginary work of our ancestors. The real purpose for the
constellations is to help us locate stars in the sky by dividing the sky into more
manageable regions as memory aids.
Historians believe that many of the myths associated with the constellations were
invented to help the farmers remember them. When they saw certain constellations,
they would know it was time to begin the planting or the reaping. Just like a visual
calendar.
2.3 Celestial Maps 57
Fig. 2.9 London underground map does not conform to the geographical configuration
The ancient Babylonians and Egyptians had constellation figures before the
Greeks. In some cases, these may correspond with later Greek constellations; in
other cases, there is no correspondence; and in yet other cases an earlier figure might
be represented in a different part of the sky. The constellation figures of the Northern
Hemisphere are over 2,000 years old.
Peter Whitfield describes the history of celestial cartography in The Topography
of the Sky (Whitfield 1999). One of the hallmarks of ancient astronomy was that
precise observation coexisted with a sense of mystery and transcendence. The
Babylonians, in particular, devised powerful mathematical systems for predicting
the positions of celestial bodies, while still considering those bodies to be “gods” of
the night. The practice of early civilizations was crucial for the development of star
mapping.
Early astronomers grouped stars in patterns to identify and to memorize regions
of the sky. Different cultures perceived different star patterns. By about 2000 BC,
both Egyptians and Babylonians had identified star groups, which typically took the
form of animal or mythic-human figures. Since the purpose was for everyone to
remember, there was hardly anything more suitable than animals or mythic-human
figures. The main point was to recognize an area of the sky.
The use of animals and mythic-human figures in constellations raises a deeper
question about the nature of their significance. From cave paintings, to constellation
figures, and to the message plaques on Pioneer and Voyager space probes, what is
the most suitable carrier of our intended message?
58 2 Mapping the Universe
The Egyptians and Babylonians did not produce models of the cosmos that could
account for the observed motions of the heavens or reveal the true shape of the
earth. A rational, theoretical approach to these problems began with the Greek
philosophers of the fifth century BC. People believed that the Sun, the Moon,
planets, and starts were embedded on the surfaces of several concentric spheres
centered at the center of the Earth, and that these spheres constantly revolved about
the Earth. This spherical model became a cornerstone of the Greek astronomy and
their civilization. The Greeks also had developed skills in spherical geometry that
enabled them to measure and map star positions.
We know that the stars are not set in one sphere. But for the purposes of
observation and mapmaking, this model works quite well. The Greek celestial
spherical model enabled astronomers and cartographers to construct globes and
armillary to show the stars, the poles, equator, ecliptic, and tropics.
Eudoxus of Cnidus first described many constellations into which we still use
today. Some constellations came from the Babylonians, such as the Scorpion,
the Lion, and the Bull. On the other hand, Perseus, Andromeda, and Hercules
are Greek mythic figures. These figures marked different regions in the sky. The
earliest representation of the classical constellations is the Farnese Atlas. The Museo
Nazionale in Naples houses a marble statue of the mythological character, Atlas,
who supports the heavens on his shoulders (See Fig. 2.10). Figure 2.11 shows some
constellation figures on the celestial globe. The hands on either side are the hands
of Atlas.
Figures 2.12 and 2.13 are star maps of the 48 classical constellations in the
Northern and Southern Hemisphere, respectively, published in the 1795 edition of
The Constellations of Eratosthenes by Schaubauch.
Celestial mapping relies on two fundamental inventions by Greek astronomers:
a spherical model of the heavens and star constellations in the sky. The symbol
of the ancient Greek astronomy is Ptolemy of Alexandria, who compiled a text
in the second century AD that remained fundamental to astronomy until the
sixteenth century. Known by its Arabic name, Almagest (Greatest), is a catalogue
identifying 1,022 of the brightest stars with their celestial coordinates, grouped
into 48 constellations. Ptolemy compiled this catalogue with the aid of naked-eye
sighting devices, but he was indebted to earlier catalogues such as that of the Greek
astronomer Hipparchus (146–127 BC). While Ptolemy specified how to draw the
stars and constellation figures on a globe, there is nothing in Almagest to suggest
that he made two-dimensional star maps (Whitfield 1999).
In order to draw accurate star maps in two dimensions, astronomers needed a
means of projecting a sphere of the sky onto a flat surface while still preserving
correct star positions. A star chart cannot be simply a picture of what is seen in the
sky because, at any given time of night, only about 40 % of the sky is visible.
2.3 Celestial Maps 59
Fig. 2.11 Most of the 48 classical constellation figures are shown, but not the stars comprising
each constellation. The Farnese Atlas, 200 BC from the National Maritime Museum, London
60 2 Mapping the Universe
Fig. 2.12 Constellations in the northern Hemisphere in 1795s. The Constellations of Eratosthenes
Ptolemy was familiar with the science of map projection through his work
in terrestrial geography. In Planisphaerium, he described the polar stereographic
projection that is ideal for star charts. This projection divides the heavens into
northern and southern hemispheres and spreads each onto planes centered on the
celestial poles. Celestial latitude is stretched progressively away from the poles
toward the equator, and all the stars in one hemisphere can be positioned correctly.
Islamic scholars picked up Ptolemy’s science between the eighth and twelfth
centuries. They described the brightest stars, modeled on Ptolemy’s Almagest, and
illustrated each constellation with charts. They also made beautiful, precise celestial
globes. Islamic astronomers perfected a sophisticated scientific instrument called the
astrolabe, which was an essential tool of astronomers until the seventeenth century.
The astrolabes are used to show how the sky looks at a specific place at a given time.
Only two kinds of star maps have survived from the centuries of classical
and medieval astronomy – that embodied by the astrolabe and the image of the
single constellation. Until the fifteenth century, European scientists and scholars
2.3 Celestial Maps 61
Fig. 2.13 Constellations in the southern hemisphere in 1795s. The Constellations of Eratosthenes
did not draw charts of the entire northern or southern heavens for purposes of
study and demonstration. From 1440, star maps began to feature the 48 classical
constellations.
The Renaissance in Europe revived the need for a celestial map, as a counterpart
of terrestrial explorers’ world map. The fascination with constellations as artistic
topics influenced astronomical imagery for four centuries. Constellations became
the subject of a number of Renaissance paintings (See Fig. 2.14).
During the European Renaissance, the celestial globe was imported from the
Islamic world to Europe. The celestial globe had a significant impact on celestial
cartography. Most star charts drawn during the sixteenth to eighteenth centuries
mapped the constellations in reverse, as shown on a globe. Globes are models
of the celestial sphere with viewers standing on the outside. Some cartographers
followed this convention when they made star maps. However, some chose to show
the constellations as they appeared from earth.
62 2 Mapping the Universe
Fig. 2.14 The painting of constellations by an unknown artist in 1575 on the ceiling of the Sala
del Mappamondo of the Palazzo Farnese in Caprarola, Italy. Orion the Hunter and Andromeda are
both located to the right of the painting (Reprinted from Sesti 1991)
The first star charts atlases, commonly showing the 48 constellations, appeared
during the sixteenth century. One of the finest of these was Giovanni Gallucci’s
Theatrum Mundi of 1558, in which Gallucci positioned the principal stars within
vigorously drawn pictures of the constellations. Ptolemy’s star catalogue remained
as the source for comprehensive star charts throughout the sixteenth century.
No one else had undertaken a new sky survey. But at the end of the century,
two revolutionary changes occurred: Tycho Brahe completely re-measured all of
Ptolemy’s star positions with unprecedented accuracy, and the Dutch navigator
Pieter Keyser organized the southern stars into twelve new constellations – the first
additions to the topography of the sky for 2,000 years.
These new southern constellations took the form of exotic animals: the Toucan,
the Bird of Paradise, and the Flying Fish, along with a figure of an Indian. The
star groups first appeared on globes by the Dutch mapmaker Willem Blaeu and
in the atlas Uranometria, published in 1603 by Johann Bayer. Bayer used Brahe’s
star catalogue, grading the stars for magnitude. The German-Polish astronomer
Johannes Hevelius added seven more constellations in 1690 in his collection of
charts Uranographia. He grouped the stars between existing constellations into new
constellations.
The arms and insignia of most of the royal houses of Europe were once
used to model new constellations, but they were not accepted by the scientific
world and did not last. John Flamsteed catalogued almost 4,000 stars visible from
the Royal Observatory in Greenwich between 1700 and 1720. The atlas drawn
from Flamsteed’s catalogue, elegantly engraved by the artist James Thornhill, was
published after Flamsteed’s death. As telescopes became more and more powerful,
astronomers began including more and more stars in their catalogues. Eventually,
scientists agreed upon a total of 88 constellations. The last hand-drawn star maps
were made by Friedrich Argelander in 1863, containing a staggering total of 324,189
stars with no decorative constellation figures.
2.3 Celestial Maps 63
2.3.2 Constellations
1
http://www.astr.ua.edu/gallery2t.html
64 2 Mapping the Universe
Fig. 2.15 Left: M-31 (NGC-224) – the Andromeda Galaxy; Right: The mythic figure Andromeda
Andromeda is a spiral galaxy with as twice as many stars of our Milky Way. It is
the most distant object visible to the naked eye. According to the Greek mythology,
Andromeda was King Cepheus’ daughter and Poseidon was the god of the sea. One
day her mother Cassiopeia boasted that she and Andromeda were more beautiful
than Poseidon’s daughters. Poseidon was angry and sent floods to the lands ruled
by Cassiopeia and her husband. King Cepheus found out from an oracle that the
only way to calm down Poseidon was to sacrifice his daughter. Andromeda was
chained to a rock, waiting to be sacrificed to a sea monster, when Perseus arrived
just in time and killed the sea monster and saved the princess. Not surprisingly,
the Andromeda constellation is next to the Perseus constellation as well as the
Cassiopeia constellation (See Fig. 2.16).
The Orion constellation is one of the most recognizable constellations in the
Northern Hemisphere. Orion the Hunter is accompanied by his faithful dogs, Canis
Major and Canis Minor. They hunt various celestial animals, including Lepus (the
rabbit) and Taurus (the bull). According to Greek mythology, Orion once boasted
that he could kill all wild beasts. The goddess of the earth Gaea wanted to punish
Orion for his arrogance and she sent the scorpion to kill him. The scorpion stung
Orion on the heel. So in the night sky, as Scorpio (the scorpion) rises from the eastern
horizon, Orion sets in the west. However, Asclepius from with the constellation
Ophiuchus healed Orion and crushed the scorpion. Orion rises again in the east and
Asclepius (Ophiuchus) crushes Scorpio into the earth in the west. To the Greek’s,
in the sky Orion waves his club in his right hand and he holds a lion’s skin trophy
aloft in his left hand (See Fig. 2.17). There are several other versions of the story.
For example, the scorpion did kill Orion and the gods put them on the opposite side
of the sky so that the scorpion would never hurt Orion again.
The red glow in the middle of Orion’s sword is the Orion Nebula. Hanging down
from Orion’s belt is his sword that is made up of three fainter stars. The central “star”
of the sword is the Great Orion Nebula (M-42), one of the regions most studied by
astronomers in the whole sky. Nearby is the Horsehead Nebula (IC-434), which is
a swirl of dark dust in front of a bright nebula. Figure 2.18 is another illustration of
Orion the Hunter.
2.3 Celestial Maps 65
Fig. 2.16 Perseus and Andromeda constellations in John Flamsteed’s Atlas Coelestis (1729)
(Courtesy of http://mahler.brera.mi.astro.it/)
Fig. 2.17 Taurus and Orion in John Flamsteed’s Atlas Coelestis (1729) (Courtesy of http://mahler.
brera.mi.astro.it/)
66 2 Mapping the Universe
Why do scientists map the Universe? Stephen Landy gave an informative overview
of the history of mapping the universe (Landy 1999). Astronomers study galaxies.
Cosmologists study nature on its very largest scales; a galaxy is the basic unit of
matter. There are billions of galaxies in the observable universe. These galaxies
form clusters three million or more light-years across. Figure 2.19 is an illustration
on Scientific American in June 1999, showing the scales in the universe. Modern
cosmology has a fundamental assumption about the distribution of matters in the
universe – the cosmological principle, which says that the universe is overall
homogeneous. On large scales, the distributions of galactic bodies should approach
2.3 Celestial Maps 67
Fig. 2.19 Large-scale structures in the Universe (Reprinted from Scientific American, June 1999)
uniformity. But scientists face a paradox: how can the uniformity on the ultimate
scale be reconciled with the clumpy distributions on smaller scales? Mapping the
universe may provide vital clues.
In the late 1970s and early 1980s, cosmologists began to systematically map
galaxies (Gregory and Thompson 1982). Cosmo-cartographers discovered that on
scales of up to 100 million light-years, galaxies are distributed as a fractal with a
dimension of between one and two. The fractal distribution of matter would be a
severe problem for the cosmological principle because a fractal distribution is never
homogeneous and uniform. However, subsequent surveys indicated that on scales of
hundreds of millions of light-years, the fractal nature broke down. The distributions
of galaxies appeared to be random on these scales. The cosmological principle was
saved just before it ran into its next challenge.
Astronomer John Huchra at the Harvard-Smithsonian Center for Astrophysics
(CfA) is well known for his work on mapping the Universe. Between 1985 and 1995,
John Huchra, Margaret Geller and others measured relative distances via redshifts
for about 18,000 bright galaxies in the northern sky to make maps of the distribution
of galaxies around us. The CfA used redshift as the measure of the radial coordinate
in a spherical coordinate system centered on the Milky Way. This initial map was
quite surprising; the distribution of galaxies in space was not random, with galaxies
actually appearing to be distributed on surfaces, almost bubble like, surrounding
large empty regions, or “voids.” Great voids and elongated structures are clearly
indicating organized structure of matter on large-scales. Any cosmological theory
must explain how these structures evolved from an almost uniform universe.
CfA’s redshift survey revealed a “Great Wall” of galaxies 750 million light-years
long, more than 250 million light-years wide and 20 million light-years thick (See
Fig. 2.20). This Great Wall is now called the CfA Great Wall to differentiate it from
68 2 Mapping the Universe
Fig. 2.20 The CfA Great Wall – the structure is 500 million light-years across. The
Harvard-Smithsonian Center for Astrophysics redshift survey of galaxies in the northern celestial
hemisphere of the universe has revealed filaments, bubbles, and, arching across the middle of the
sample
the even bigger Sloan Great Wall discovered a few years later in 2003. The CfA
Great Wall is like a giant quilt of galaxies across the sky (Geller and Huchra 1989).
A random distribution cannot readily explain such a coherent structure. Even larger
mapping and surveying projects were undertaken. Stephen Landy (1999) explained
the Las Campanas Redshift Survey, which took place between 1988 and 1994. It
would take a lengthy explore time to photograph the most distant galaxies because
they were faint. The Las Campanas survey chose to slice through the universe and
concentrated on a very deep and wide but think slice (See Fig. 2.21).
Astronomers have begun to catalogue 100 million of the brightest galaxies
and 100,000 quasars, which are the exploding hearts of galaxies, using a device
called two-degree field spectrograph (2dF). The 2dF Galaxy Redshift survey is an
international collaboration involving more than 30 scientists from 11 institutions. It
is due to complete in 2006. The survey aims to learn more about the structure of the
Universe, how galaxies are made and how they form into larger structures.
The 2dF instrument is one of the most complex pieces of astronomical “camera”
ever built. It uses 400 optical fibers, all of which can be positioned by an incredibly
accurate robotic arm in about one hour. The 2dF instrument allows astronomers to
observe and analyze 400 objects at once, and on a long clear night, they can log
the positions of more than 2,000 galaxies. It has taken less than 2 years to measure
the distances for 100,000 galaxies. Without the 2dF instrument, this project would
have taken decades. Figure 2.22 shows a virtual scene of flying through a three-
dimensional model of the universe.
2.3 Celestial Maps 69
The Sloan Digital Sky Survey (SDSS) is one of the most ambitious and influential
surveys in the history of astronomy.2 It is designed to collect astronomical data for
the study of the origin and evolution of the Universe, mapping large-scale structures
in the universe, and the study of quasars and their evolution. According to the
official site of SDSS, SDSS-I (2000–2005) and SDSS-II (2005–2008) covered more
than a quarter of the sky and created three-dimensional maps containing more than
930,000 galaxies and more than 120,000 quasars. SDSS-III is currently in operation
(2008–2014).
Some of the recent discoveries were only possible with the large amount of
data collected by the SDSS survey. For example, astronomers were able to detect
2
http://www.sdss.org/
70 2 Mapping the Universe
3
http://www.astro.princeton.edu/universe/
2.3 Celestial Maps 71
Fig. 2.23 Part of the rectangular logarithmic map of the universe depicting major astronomical
objects beyond 100 mpc from the Earth (The full map is available at http://www.astro.princeton.
edu/universe/all100.gif. Reprinted from Gott et al. 2005)
the design. The scale is logarithmically transformed to compress the large amount of
voids in space into a compact map. The Earth is at the center of the map because the
SDSS telescope located on the Earth is used to measure the distance an astronomical
object is from us. Quasars, for example, formed in the early stages of the universe,
appear near to the outer rim of the circular map. Each red dot in the outer belt depicts
a quasar found by the SDSS survey.
The map conveys 14 types of information, including various astronomical objects
such as high redshift quasars found by SDSS, extrasolar planets, stars, and space
probes. In addition to astronomical objects found by the SDSS survey, the circular
map of the universe contains the positions of several other types of objects such
as galaxies found by the CfA2 survey, galaxies on the Messier catalog, and the
brightest stars in the sky. Some of the objects are associated with information about
when they were discovered and the time periods in which articles about these objects
attracted bursts of citations. Figure 2.26 shows the types of objects on the map, the
sub-total of each type of objects, and examples.
72 2 Mapping the Universe
Fig. 2.24 A map of the universe based on the SDSS survey data and relevant literature data from
the web of science. The map depicts 618,223 astronomic objects, mostly identified by the SDSS
survey, including 4 space probes (A high resolution version of the map can be found at http://
cluster.cis.drexel.edu/ cchen/projects/sdss/images/2007/poster.jpg)
Figure 2.27 shows the center of the circular map of the universe. The Earth
is at the center of the depicted universe – we are of course aware of what the
Copernicus revolution was all about. The logarithmic scale shown on the map, along
the Northeast direction, gives us a rough idea how far away an object is from the
2.3 Celestial Maps 73
Fig. 2.26 The types of objects shown in the circular map of the universe
Earth. For example, artificial satellites are orbiting the Earth 10,000–100,000 km
above the Earth. The distance between the Sun and the Earth is one astronomical unit
(AU). Space probe Pioneer 10 was about 100 AU away from the Earth at the time
of the SDSS survey. Sirius, the brightest star in the sky, is slightly over one parsec
74 2 Mapping the Universe
Fig. 2.28 Major discoveries in the west region of the map. The 2003 Sloan Great Wall is much
further away from us than the 1989 CfA2 Great Wall
(pc) away from us; according to the Hipparcos astrometry satellite, it is 2.4 pc, or
8.48 light-years away. About 100 pc away, there are over 8,000 objects identified by
Hipparcos satellite.
Major discoveries in astronomy are also marked on the map, for example, the
discovery of Neptune in 1846 and the discovery of the first quasar 3C 273 in 1963,
the CfA2 Great Wall in 1989, and the Sloan Great Wall in 2003 (See Fig. 2.28).
As of 2012, the Sloan Great Wall is still the largest known cosmic structure in the
universe. It was discovered by J. Richard Gott III, Mario Juric, and their colleagues
in 2003 based on the SDSS data. The Sloan Great Wall is gigantic. It is 1.38 billion
light years in length. This is approximately 1/60 of the diameter of the observable
universe. It is about one billion light-years from the Earth. The Sloan Great Wall
is 2.74 times longer than the CfA2 Great Wall of galaxies discovered in 1989 by
Margaret Geller and John Huchra.
J. Richard Gott III and Mario Juric generously shared with us the data and the
source code they used to generate their rectangular-shaped logarithmic map of the
universe, which is the first scientifically accurate map of the universe, and our
circular map is the second. One of the valuable lessons we learned is the invaluable
role of a computer programming language in facilitating interdisciplinary research.
The computer programming language in fact provided a firm common ground for
astrophysicists and information scientists.
It is easy to tell from Fig. 2.28, the scope of the SDSS survey is much closer to
the beginning of the universe than the scope of the CfA2, marked by the galaxies
in yellow color. The red dots are high redshift quasars found by the SDSS survey.
The blue dots are galaxies found by the SDSS. The yellow dots are galaxies found
by the 1989 CfA2 survey. The depth of the SDSS survey is deeper than what the
Hubble Deep Field (HDF) had reached in 1995. The HDF is an image of a small
region obtained by assembling 342 separate exposures of Hubble over 10 days from
December 18 till December 28 in 1995 consecutively, which is, as I am writing,
almost exactly 7 years from today, December 27, 2012. Because the HDF image
reveals some of the youngest and most distant galaxies ever known, it has become a
landmark image in the study of the early universe.
2.3 Celestial Maps 75
Fig. 2.29 The Hubble Ultra Deep Field (HUDF) is featured on the map of the universe
Figure 2.29 shows the Northeast quadrant of the map. The Hubble Ultra Deep
Field (HUDF), located near the upper left corner of the image, reached even deeper
than the 1995 HDF. In other words, the HUDF reveals an even earlier stage of
the universe. It looks back approximately 13 billion years, which is about 400–
800 million years after the Big Bang. Its marker is approaching the 10 Gigaparsecs
(gpc) mark on the distance scale. One gigaParsec (gpc) is 3.0857 1025 m, or 3.26
billion light-years. The HUDF’s record was recently updated by the eXtreme Deep
Field (XDF), released on September 25, 2012. The XDF reveals galaxies formed
only 450 million years after the Big Bang.
In addition to the depiction of astronomical objects such as galaxies and quasars,
the circular map of the universe also presents information about which astronomical
objects have attracted the attention of astronomers in terms of citation bursts.
We will explain the concept of citation burst in detail in later chapters of the
book. Simply speaking, a citation burst of a scientific publication measures the
acceleration of the citations it has received. A strong citation burst is a sign that
the article has generated a significant level of interest in the scientific community,
in this case, astronomers. Figure 2.30 shows that a citation burst was found with
the object QSO J1030 C 0524 between 2003 and 2004. This object, as it turns out,
was the most distant quasar known when it was discovered. Astronomers measure
the redshift of an object with a metric z, which is the change in the wavelength of
the object divided by the rest wavelength of the light. The quasar was discovered to
have a z of 6.28 at the time, which was very high. The next quasar labeled below the
QSO J1030 C 0524 is QSO J1044-0125, which has a citation burst between 2000
and 2004. It is a high redshift quasar as well (z D 5.73). The third labeled quasar,
QSO J1048 C 4637, also has a high redshift (z D 6.23).
76 2 Mapping the Universe
Fig. 2.31 A network of co-cited publications based on the SDSS survey. The arrow points to an
article published in 2003 on a survey of high redshift quasars in SDSS II. A citation burst was
detected for the article
Figure 2.31 shows the literature resulted from the SDSS survey. Each dot
represents a published article. The size of the tree-ring indicates the citations
received by the corresponding article. The yellow arrow points to an article by Fan
et al. in 2003 on a survey of high redshift quasars in SDSS II, the second stage of
the SDSS project. The article was found to have a burst of citations, indicating the
attention it attracted from the scientific community. In later chapters in this book,
we will discuss this type of literature visualization in more detail.
The SDSS example has practical implications on science mapping. First, as-
tronomy provides a natural framework to organize and display the large amount
of astronomical objects. The distance between a high redshift quasar and the Earth
is meaningful. It can be precisely explained in scientific terms. The mission of the
scientific frontier in this context is to understand the early universe. The attention
of the research frontier is clearly directed to the high redshift quasars because they
were formed soon after the Big Bang. An intuitive indicator of the progression of the
frontier is the look-back time, i.e. how closely objects formed after the Big Bang can
be observed. The structure of the universe in this case provides an intuitive reference
2.4 Biological Maps 77
to represent where the current scientific frontier is and where its next move might
be. Just imagine for a moment what if we don’t have such an organizing structure to
work with.
Second, the structure of the universe provides an intellectual playground for
astronomers. It is clear that astronomers, as expected, do not devote their attention
evenly across the universe. Once we have developed a good understanding of our
local environment, the search is extended to other parts of the universe, literally far
and wide. The organizing metaphor in astronomy coincides with the universe. The
isomorphic relation raises a new question: is there a situation in which the nice and
intuitive structure may limit our creativity? Are there theories that are proven to
be valuable in one part of the universe would be potentially valuable if they were
applied to elsewhere in the universe? The visualization of the relevant literature
shows a different structure. In other words, the physical world and the conceptual
world have different structures. Things are connected not simply because they are
in proximity. Likewise, things separated by a vast space of void in the universe may
be close to each other in the conceptual world.
It seems more likely to be common rather than exceptional that we will deal with
multiple perspectives of the same phenomena and each perspective may lead to a
unique picture. What do we need to do to reconcile multiple perspectives? Do we
need to reconcile at all? What can we gain from having multiple views and what do
we have to lose?
The history of deoxyribonucleic acid (DNA) research began with a Swiss biologist
Friedrich Miescher. In 1868 he carried out the first chemical studies on the nuclei
of cells. Miescher detected a substance that he called nuclein and showed that
nuclein consisted of an acidic portion, which included the DNA we know today
and other things. Later he found a similar substance in the heads of salmon sperm
cells. Although he separated the nucleic acid fraction and studied its properties,
78 2 Mapping the Universe
the covalent structure of DNA did not become known with certainty until the late
1940s. Miescher suspected that nuclein or nucleic acid might play a key role in cell
inheritance, but others ruled out such a possibility. It was not until 1943 that the first
direct evidence emerged for DNA as the bearer of genetic information. In that year,
Oswald Avery, Colin MacLeod, and Maclyn McCarty, working at the Rockefeller
Institute provided the early evidence that DNA is the carrier of genetic information
in all living cells. In the 1950s biologists did not know what the DNA molecule
looked like or how the parts of it were arranged.
At King’s College in London an English physicist Maurice Wilkins together
with another English scientist Rosalind Franklin spent most of 1951 using an
x-ray method of photography to work out the structural shape and nitrogenous
base arrangements of DNA. Rosalind Franklin was an expert in using X-ray
crystallography to study imperfectly crystalline matter, such as coal. She discovered
the two forms of DNA. The easily photographed A form was dried, while the B
form was wet. While much harder to photograph, her pictures of the B form showed
a helix. Since the water would be attracted to the phosphates in the backbone, and
the DNA was easily hydrated and dehydrated, she guessed that the backbone of the
DNA was on the outside and the bases were therefore on the inside. This was a
major step forward in the search for the structure of DNA.
In May 1952, Franklin got her first good photograph of the B form of DNA,
showing a double helix. This was another major breakthrough, but Franklin missed
it and continued working on the A form. James Watson and Francis Crick started
their work together on DNA in 1951 at Cambridge University. By the end of
1952, Watson approached Maurice Wilkins who gave him one of Franklin’s x-ray
photographs. Watson started to build a new model of DNA revealing DNA’s
structure as a double helix or spiral. In 1962, together with Maurice Wilkins,
Watson & Crick were awarded a Nobel Prize for their discovery of DNA structure.
Figure 2.32 shows the original structure of DNA’s double helix.
Despite proof that DNA carries genetic information from one generation to the
next, the structure of DNA and the mechanism by which genetic information is
passed on to the next generation remained the single greatest unanswered question
in biology until 1953. It was in that year that James Watson, an American geneticist,
and Francis Crick, an English physicist, working at the University of Cambridge
in England proposed a double helical structure for DNA (Watson and Crick
1953). This was a key discovery to molecular biology and modern biotechnology.
Using information derived from a number of other scientists working on various
aspects of the chemistry and structure of DNA, Watson and Crick were able to
assemble the information like pieces of a jigsaw puzzle to produce their model
of the structure of DNA. Watson gave a personal account of the discovery in
(Watson 1991).
2.4 Biological Maps 79
Acupuncture began with the original Chinese medical text, the Yellow Emperor’s
Classic of Internal Medicine (475 BC). In this text, all six Yang Meridians were
said to be directly connected to the Auricle, whereas the six Yin meridians were
indirectly connected to the ear. These ancient Chinese Ear Points were arranged as
a scattered array of points on the ear. Figure 2.33 is an ear acupacture point map.
What is the best organizing metaphor?
The auricle of the ear is a complete miniature of the human body. There
are over 200 specific acupuncture points. In auriculotherapy, the auricle of the
external ear is utilized to alleviate pain, dysfunction and disease as represented and
manifest throughout the body. All vertebras, sympathetic/parasympathetic nerves,
spinal nerves, visceral organs and the central nervous system, and including all
anatomical sites and many functional points are represented on the ear. While
originally based upon the ancient Chinese practice of acupuncture, the somatic
tropic correspondence of specific parts of the body to specific parts of the ear
was first developed by Paul Nogier, a French doctor of medicine, in late 1950s.
According to Nogier, the auricle mirrors the internal organs and auricular points can
be mapped to an inverted projection of an embryo.
Nogier developed a somatatopic map of the ear based upon the inverted fetus
concept. His work was first presented in France and then published by a German
80 2 Mapping the Universe
Fig. 2.33 Ear acupacture point map. What is the best organizing metaphor? (Courtesy of http://
www.auriculotherapy-intl.com/)
acupuncture society and then finally translated into Chinese. In 1958, a massive
study was initiated in China to verify the clinical value of his inverted-embryo
model. In 1980, a study at UCLA by Richard Kroeuning and Terry Oleson verified
the scientific accuracy of auricular diagnosis. There was a statistically significant
level of 75 % accuracy achieved in diagnosing musculoskeletal pain problems in 40
pain patients. Figure 2.34 is a map showing musculoskeletal points.
Auricular therapy has numerous applications. A lot of work has been done to
establish the relationship between the auricle and the body as a whole; the location
and the distribution of auricular points, the function and specificity of the auricular
points; in addition to verify Nogier’s theory.
2.4 Biological Maps 81
Fig. 2.34 Musculoskeletal points (©1996 Terry Oleson, UCLA School of Medicine. http://www.
americanwholehealth.com/images/earms.gif)
Due to the publicity of the Human Genome project, genomic maps, gene expression
visualization, and bioinformatics have become the buzzwords in mass media.
Traditionally the common practice of analyzing expression data is done in a single
dimension. Single-dimensional analysis places genes in a total ordering, limiting the
ability to see important relationships.
Kim et al. (2001) visualize the C. elegans expression data in three dimensions.
Groups of related genes in this three-dimensional approach appear as mountains,
and the entire transcriptome appears as a mountain range. Distances in this synthetic
geography are related to gene similarity, and mountain heights are related to the
density of observed genes in a similar location. Expression visualization allows us
to hypothesize potential gene-gene relationships that can be experimentally tested.
To find out which genes are co-expressed, Kim et al. first assembled a gene
expression matrix in which each row represents a different gene (17,817 genes) and
each column corresponds to a different microarray experiment (553 experiments).
The matrix contains the relative expression level for each gene in each experiment
(expressed as log2 of the mornalized Cy3/Cy5 rations). They calculated the Pearson
82 2 Mapping the Universe
Fig. 2.35 Caenorhabditis elegans gene expression terrain map created by VxInsight, showing
three-dimensional representation of 44 gene mountains derived from 553 microarray hybridiza-
tions and consisting of 17,661 genes (representing 98.6 % of the genes present on the DNA
microarrays) (Reprinted from Kim et al. 2001)
correlation coefficient between every pair of genes. For each gene, the similarity
between it and the 20 genes with the strongest positive correlations were used to
assign that gene to an x-y coordinate in a two-dimensional scatter plot with the use
of force-directed placement. Each gene is placed to other genes that are similar in
gene expression. Figure 2.35 shows a terrain map of Caenorhabditis elegans gene
expressions.
In May 2009, as H1N1 was rapidly spreading across many countries, there was a
rich body of knowledge about influenza pandemics in the literature. Figure 2.36
shows a similarity map of 114,996 influenza virus protein sequences. Each dot is
an individual influenza virus protein sequence. Two sequences are connected if they
are similar in terms of protein structure. Structural similarity is one way to organize
protein sequences. There could be other ways, for example, based on similarities
of biological properties. Once again, multiple perspectives can be applicable. The
question is what would be the best combination of information provided by various
views to solve problems at hand.
2.4 Biological Maps 83
Fig. 2.36 114,996 influenza virus protein sequences (Reprinted from Pellegrino and Chen 2011)
References
Fan XH, Strauss MA, Schneider DP, Becker RH, White RL, Haiman Z, Gregg M, Pentericci L,
Grebel EK, Narayanan VK, Loh YS, Richards GT, Gunn JE, Lupton RH, Knapp GR, Ivezic
Z, Brandt WN, Collinge M, Hao L, Harbeck D, Prada F, Schaye J, Strateva I, Zakamska N,
Anderson S, Brinkmann J, Bahcall NA, Lamb DQ, Okamura S, Szalay A, York DG (2003) A
survey of z > 5.7 quasars in the Sloan Digital Sky Survey. II. Discovery of three additional
quasars at z > 6. Astron J 125(4):1649–1659. doi:10.1086/368246
Geller MJ, Huchra JP (1989) Mapping the universe. Science 246:897
Gregory SA, Thompson LA (1982) Superclusters and voids in the distributions of galaxies. Sci
Am 246(3):106–114
Hearnshaw HM, Unwin DJ (1994) Visualization in geographical information systems. New York:
John Wiley & Sons
Kim S, Lund J, Kiraly M, Duke K, Jiang M, Stuart J et al (2001) A gene expression map for
Caenorhabditis elegans. Science 293:2087–2092
Landy SD (1999) Mapping the universe. Sci Am 280(6):38–45
Lin X (1997) Map displays for information retrieval. J Am Soc Inf Sci 48(1):40–54
Pellegrino DA, Chen C (2011) Data repository mapping for influenza protein sequence analysis.
Paper presented at the 2011 Visualization and Data Analysis (VDA)
Richard Gott J III, Juric M, Schlegel D, Hoyle F, Vogeley M, Tegmark M, Bahcall N, Brinkmann
J (2005) A map of the universe. Astrophys J 624:463–484
Sesti GM (1991) The glorious constellations: history and mythology. Harry N. Abrams, Inc.,
New York
Watson J (1991) The double helix: a personal account of the discovery of the structure of DNA.
Mass Market Paperback
Watson JD (1968) The double helix. Atheneum, New York
Watson JD, Crick FHC (1953) A structure for deoxyribose nucleic acid. Nature 171:737–738
Whitefield P (1999) The topography of the sky: Celestial Maps gave order to the universe.
Mercator’s World, Eugene
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Chapter 3
Mapping Associations
The eyes are not responsible when the mind does the seeing.
Publilius Syrus (circa 85–43 BC)
In this chapter, we focus on the most basic requirements for representing abstract,
dynamic, and often-evasive abstractions of a structure with no inherit connections
between its content and a concrete, tangible form. We are particularly looking
for possible extensions and adaptations of cartographic techniques, for example,
terrain relief maps, landscape views, and constellation maps. But terrain maps
and constellation maps now acquire new meanings and transcend the boundary
of geometry, topology, and their appearance. The semantics of geometric features,
topological patterns, and temporal rhythms now need to be conveyed effectively
familiar ones. In Memex, the idea is to make such connections accessible to other
people. Connections made in this way are called trails. Bush referred to people who
are making such trials as trailblazers. Trailblazers are builders of an ever-growing
information space. Memex itself has never materialized, but it has a gigantic nearest
kin – the World-Wide Web.
We know that the Web relies on hypertext reference links to pull millions of
documents together. In fact, studies of small-world networks have found the Web
has many features of a small-world network. We will return to small-world networks
in later chapters, but here an interesting thing to know is that the Web has a diameter
of about 16, which means that given an arbitrary pair of documents on the Web, we
can reach one from the other by following a chain of, on the average, 16 hyperlinks.
A central issue for the Web is how to make sure that users can find their way in
this gigantic structure. The predecessor of the Web is a group of hyper-referencing-
enabled information systems – hypertext systems. Research in hypertext started in
the late 1980s was marked by a number of classic hypertext systems such as Apple’s
HyperCard and the NoteCards from Xerox PARC. Navigation has been a central
research issue for hypertext over the last two decades. For example, Canter and his
colleagues distinguished five types of search in hyperspace (Canter et al. 1985):
• Scanning: covering a large area without depth
• Browsing: following a path until a goal is achieved.
• Searching: striving to find an explicit goal.
• Exploring: finding out the extent of the information given.
• Wandering: purposeless and unstructured globetrotting.
An overview map is a commonly used solution to the notorious lost-in-
hyperspace problem first identified by Jeff Conklin (1987). A wide variety of
techniques have been developed over the last two decades for generating overview
maps automatically. The sheer size of the Web poses a tremendous challenge. Many
algorithms developed prior to the Web need to be scaled up before they can handle
the Web. New strategies have been developed to avoid brute-force approaches.
The origin of cognitive maps can be traced back to Edward Tolman’s famous study
published in 1948 on the behavior of rats in a maze1 (Tolman 1948). He studied the
behavior of those rats that managed to find the food placed in a maze and realized
that his rats had obviously managed to remember the layout of the maze. Prior to
Tolman’s study, it was thought that rats in a maze were only learning at particular
turning points to make left or right turns. Tolman called this internalized layout a
cognitive map. He further proposed that rats and other organisms develop cognitive
maps of their environments.
1
http://psychclassics.yorku.ca/Tolman/Maps/maps.htm
88 3 Mapping Associations
placed at the center of such triangles. Edges are drawn to show the boundaries of
large districts. Features such as signposts, history and backtracking mechanisms
were also considered in their city image metaphor, but they were not fully
implemented.
Legibility of a city helps people traveling in the city. The more spatial knowledge
we have of a city, the easier we can find our way in it. Thorndyke and Hayes-Roth
distinguished three levels of such spatial knowledge (Thorndyke and Hayes-Roth
1982) as landmark knowledge, procedural knowledge, and survey knowledge.
Landmark knowledge is the most basic awareness of specific locations in a city
or a way-finding environment. If all we know about London is the Big Ben and
the Trafalgar Square, then our ability to navigate through London would be rather
limited. Procedural knowledge, also known as route knowledge, allows a traveler to
follow a particular route between a source and a destination. Procedural knowledge
connects isolated landmark knowledge into larger, more complex structures. Now
we should know at least one route leading from the Big Ben to the Trafalgar Square.
At the level of Survey knowledge we have fully connected topological information
about a city. Survey knowledge is essential in performing way-finding tasks. A good
example of survey knowledge is the Knowledge of London examination that
everyone applying for a taxi license must have. The official Transport of London
says to each applicant:
You must have a thorough knowledge of London, including the location of
streets, squares, clubs, hospitals, hotels, theatres, government and public buildings,
railway stations, police stations, courts, diplomatic buildings, important places of
worship, cemeteries, crematoria, parks and open spaces, sports and leisure centers,
places of learning, restaurants and historic buildings; in fact everything you need to
know to be able to take passengers to their destinations by the most direct routes.
You may be licensed either for the whole of London or for one or more
of the 16 suburban sectors. The “All London” license requires you to have
a detailed knowledge of the 25,000 streets within a six-mile radius of Charing Cross
with a more general knowledge of the major arterial routes throughout the rest of
London. If you wish to work as a taxi driver in central London or at Heathrow
Airport you need an “All London” license.
We will briefly introduce the famous traveling salesman problem (TSP) in
Chap. 4. The salesman needs to figure out a tour of a number of cities such that he
visits each city for once only and the overall distance of the tour must be minimal.
If the salesman is in London, it looks his best bid is to take a taxi. Figure 3.2 shows
the coverage of London taxi drivers’ survey knowledge.
The most sound survey knowledge is acquired directly from first-hand navigation
experience in an environment – London’s taxi drivers have certainly demonstrated
their first-hand navigation experience in London. Alternatively, we can develop our
survey knowledge by reading maps. However, survey knowledge acquired in this
way tends to be orientation-specific, which means that the navigator may need
to rotate the mental representation of the space to match the environment. This
concern led Marchon Levine to explore how this phenomenon should be taken into
90 3 Mapping Associations
Fig. 3.2 The scope of the Knowledge of London, within which London taxi drivers are supposed
to know the most direct route by heart, that is, without resorting to the A–Z street map
account by map designers (Levine et al. 1982, 1984). Levine stressed that maps
should be congruent with the environment so that we can quickly locate our current
position and orientation on the map and in the environment. Levine laid down three
principals for map design:
The two-point theorem – a map reader must be able to relate two points on the map
to their corresponding two points in the environment.
The alignment principle – the map should be aligned with the terrain. A line between
any two points in space should be parallel to the line between those two points
on the map.
The forward-up principle – the upward direction on a map must always show what
is in front of the viewer.
Researchers have adapted much of the real-world way-finding strategies for way-
finding task in virtual environments. For example, Rudolph Darken and others
provide an informative summary in their article on way-finding behavior in virtual
environments (Darken et al. 1998).
3.2 Identifying Structures 91
Information visualization emerged as a field of study since the 1990s. There has
been a widely spread interest across research institutions and the commercial
market. Applications of information visualization range from dynamic maps of the
stock market to the latest visualization-empowered patent analysis laboratories. It is
one of the most active research areas that can bring technical advances into a new
generation of science mapping.
The goal of information visualization is to reveal invisible patterns from abstract
data. Information visualization is to bring new insights to people, not merely pretty
pictures. The greatest challenge is to capture something abstract and invisible with
something concrete, tangible, and visually meaningful. The design of an effective
information visualization system is more of an art than science. Two fundamental
components of information visualization are structuring and displaying.
has a term independent assumption, which says the occurrences of one term can be
regarded as independent from the occurrences of another term. However, it may not
be the case.
When dealing with text documents, a commonly encountered problem is known
as the vocabulary mismatch problem. In essence, people may choose different
vocabulary to describe the same thing.
There are two aspects to the problem. First, there is a tremendous diversity in the
words people use to describe the same object or concept; this is called synonymy.
Users in different contexts, or with different needs, knowledge or linguistic habits
will describe the same information using different terms. For example, it has been
demonstrated that any two people choose the same main keyword for a single, well-
known object less than 20 % of the time on average. Indeed, this variability is much
greater than commonly believed and this places strict, low limits on the expected
performance of word-matching systems.
The second aspect relates to polysemy, a word having more than one distinct
meaning. In different contexts or when used by different people the same word
takes on varying referential significance (e.g., “bank” in river bank versus “bank” in
a savings bank). Thus the use of a term in a search query does not necessarily mean
that a text object containing or labeled by the same term is of interest. Because
human word use is characterized by extensive synonymy and polysemy, straight-
forward term-matching schemes have serious shortcomings – relevant materials
will be missed because different people describe the same topic using different
words and, because the same word can have different meanings, irrelevant material
will be retrieved. The basic problem is that people want to access information
based on meaning, but the words they select do not adequately express intended
meaning. Previous attempts to improve standard word searching and overcome the
diversity in human word usage have involved: restricting the allowable vocabulary
and training intermediaries to generate indexing and search keys; hand-crafting
thesauri to provide synonyms; or constructing explicit models of the relevant domain
knowledge. Not only are these methods expert-labor intensive, but also they are
often not very successful.
Latent Semantic Indexing (LSI) is designed to overcome the vocabulary mis-
match problem faced by information retrieval systems (Deerwester et al. 1990;
Dumais 1995). Online services of LSI are available, for example, http://lsa.colorado.
edu/. Individual words in natural language provide unreliable evidence about
the conceptual topic or meaning of a document. LSI assumes the existence of
some underlying semantic structure in the data that is partially obscured by the
randomness of word choice in a retrieval process, and that the latent semantic
structure can be more accurately estimated with statistical techniques.
In LSI, a semantic space is constructed based on a large matrix of term-document
association observations. LSI uses a mathematical technique called Singular Value
Decomposition (SVD). One can approximate the original, usually very large, term
by document matrix by a truncated SVD matrix. A proper truncation can remove
noise data from the original data as well as improve the recall and precision of
information retrieval.
3.2 Identifying Structures 93
Perhaps the most compelling claim from the LSI is that it allows an informa-
tion retrieval system to retrieve documents that share no words with the query
(Deerwester et al. 1990; Dumais 1995). Another potentially appealing feature is
that the underlying semantic space can be subject to geometric representations. For
example, one can project the semantic space into a Euclidean space for a 2D or 3D
visualization. On the other hand, large complex semantic spaces in practice may not
always fit into low-dimension spaces comfortably.
The Minowski distance (geodetic) depends on the value of the r-metric. For
r D 1, the path weight is the sum of the link weights along the path; for r D 2, the
path weight is computed as Euclidean distance; and for r D 1, the path weight is
the same as the maximum weight associated with any link along the path.
8 k
ˆ
ˆ X
ˆ
ˆ wi r D 1
ˆ
ˆ
ˆ
ˆ D1
! 1r ˆ
ˆ
i
X k < ! 12
W .P / D w r
D X k
i
ˆ
ˆ w2i r D2
i D1 ˆ
ˆ
ˆ
ˆ D1
ˆ
ˆ
i
ˆ
:̂ max wi r D 1
i
The q-parameter specifies that triangle inequalities must be satisfied for paths
with k q links:
k1 1r
P
wn1 nk D wrni niC1 8k q
iD1
When a PFNET satisfies the following three conditions, the distance of a path is
the same as the weight of the path:
1. The distance from a document to itself is zero.
2. The proximity matrix for the documents is symmetric; thus the distance is
independent of direction.
3. The triangle inequality is satisfied for all paths with up to q links.
If q is set to the total number of nodes less one, then the triangle inequality is
universally satisfied over the entire network. Increasing the value of parameter r or
q can reduce the number of links in a network. The geodesic distance between two
nodes in a network is the length of the minimum-cost path connecting the nodes. A
minimum-cost network (MCN), PFNET(r D 1, q D n 1), has the least number of
3.2 Identifying Structures 95
Fig. 3.3 Nodes a and c are connected by two paths. If r D 1, Path 2 is longer than Path 1,
violating the triangle inequality; so it needs to be removed
links. Figure 3.3 illustrates how a link is removed if it violates the triangle inequality.
See (Chen 1999a, b; Chen and Paul 2001; Schvaneveldt et al. 1989) for further
details.
The spatial layout of a Pathfinder network is determined by a force-directed
graph-drawing algorithm (Kamada and Kawai 1989). Because of its simplicity and
intuitive appealing, force-directed graph drawing becomes increasingly popular in
information visualization.
Typical applications of Pathfinder networks include modeling a network of
concepts based on similarity ratings given by human experts, constructing proce-
dural and protocol analysis models of complex activities such as air-traffic control,
and comparing learners’ Pathfinder networks at various stages of their learning
(Schvaneveldt 1990).
Pathfinder networks display links between objects explicitly. Structural patterns
are easy for our perceptions to detect. In addition, Pathfinder network scaling is an
effective link-reduction mechanism, which prevents a network from being cluttered
by too many links. Figure 3.4 shows a Pathfinder network of 20 cities in the US.
The colors of nodes indicate the partition of the network based on the degree of
each node: white nodes have the degree of 3, blue nodes 2, and green nodes 1. The
size of each node indicates the centrality of the node. In this case, the Pathfinder
network turns out to be the unique minimum spanning tree. Figure 3.5 shows the
partition of the Pathfinder network by the degree of each node. The larger the size
of a node, the closer it is to the center.
If two images have the same size in terms of pixels, we can compare the
difference of the two pixel by pixel. If we have 100 images of the size of 64 64
pixels, the structure of these images can be represented as a so-called manifold in a
3.2 Identifying Structures 97
diagrams and visualization displays. Figures 3.7 and 3.8 show the screenshots of
two visualization models of the InfoViz image database by layout and by texture,
respectively. Both layout and texture similarities were computed by the QBIC
system.
The overall structure of the layout-based visualization is different from the color-
based visualization shown in Fig. 3.6. This is expected due to the self-organizing
nature of the spring-embedder model. On the other hand, visualizations based
on the two schemes share some local structures. Several clusters appear in both
visualizations. The spring embedder algorithm tends to work well with networks of
less than a few hundreds of nodes.
Unlike the layout version, the texture-based visualization has a completely
different visual appearance from the color-based visualization. In part, this is
because the color histogram and color-layout schemes share some commonality in
the way they deal with color.
Now we compare the Pathfinder networks generated by different features
extracted from images. The number of links in each network and the number
of links in common are used as the basis for network comparisons. The degree
of similarity between two networks is determined by the likelihood that a number of
common links are expected given the total number of links in the networks involved.
Alternatively, one may consider use the INDSCAL method outlined later in this
3.2 Identifying Structures 99
Information visualization has a long history of using terrain models and relief maps
to represent abstract structures. Information visualization based on word frequencies
and distribution patterns has been a unique research branch, especially originated
from information retrieval applications.
3.2.4.1 ThemeView
The changing patterns at the lexical level have been used to detect topical themes.
Some intriguing visualization technologies have been developed over the past few
years (Hetzler et al. 1998).
The most widely known example in this category is ThemeView, developed at
Pacific Northwest National Laboratory (Wise et al. 1995). James Wise described
an ecological approach to text visualization and how they used the relief map as a
model of a thematic space (Wise 1999). ThemeView enables the user to establish
connections easily between the construction and the final visualization. Figure 3.9
is a screenshot of PNNL’s ThemeView, showing word frequency distributions as
peaks and valleys in a virtual landscape.
3.2.4.2 VxInsight
Fig. 3.11 A virtual landscape of patent class 360 for a period between 1980 and 1984 in
VxInsight. Companies’ names are color-coded: Seagate-red, Hitachi-green, Olympus-blue, Sony-
yellow, IBM-cyan, and Philips-magenta (Courtesy of Kevin Boyack)
shows a virtual landscape of patent class 360 for a period of 4 years between 1980
and 1984. Further issues concerning patent analysis and visualization are discussed
in Chap. 5.
Fig. 3.12 A SOM-derived base map of the literature of geography (Reprinted from Skupin 2009)
(AAG) between 1993 and 2002. Each abstract was first represented as a document
in a 2,586-dimension vector space. Then a two-dimensional model of the document
space was generated using SOM. Finally, the SOM configuration was visualized in
standard GIS software.
3.2 Identifying Structures 105
Fig. 3.13 The process of visualizing citation impact in the context of co-citation networks (© 2001
IEEE)
Now we give a brief introduction to the use of techniques for mapping. A more
detailed analysis from co-citation points of view is in the next chapter. Figure 3.13
illustrate the process of structuring and visualizing citation impact in the context
of co-citation networks. Indeed, the process is very generic, applicable to a wide
spectrum of types of phenomena.
First, select authors who have received citations above a threshold. Intellectual
groupings of these authors represent snapshots of the underlying knowledge do-
main. Co-citation frequencies between these authors are computed from a citation
database, such as ISI’s SCI and SSCI. ACA uses a matrix of co-citation frequencies
to compute a correlation matrix of Pearson correlation coefficients. According to
(White and McCain 1998), such correlation coefficients best capture the citation
profile of an author.
106 3 Mapping Associations
Finally, the standardized scores zx and zy are used to calculate the correlation
coefficient rxy, which in turn forms the correlation matrix.
X Xmean Y Xmean
zx D zy D
X Y
P
N
zx zy
i D1
rxy D
N 1
Second, apply Pathfinder network scaling to the network defined by the correla-
tion matrix. Factor analysis is a standard practice in ACA. However, in traditional
ACA, MDS and factor analysis rarely appear in the same graphical representations.
In order to make knowledge visualizations clear and easy to interpret, we overlay
the intellectual groupings identified by factor analysis and the interconnectivity
structure modeled by the Pathfinder network scaling. Authors with similar colors
essentially belong to the same specialty and they should appear as a closely
connected group in the Pathfinder network. Therefore, one can expect to see the
two perspectives converge in the visualization. This is the third step.
Finally, display the citation impact of each author on top of the intellectual
groupings. The magnitude of the impact is represented by the height of a citation bar,
which in turn consists of a stack of color-coded annual citation sections. Figure 3.14
illustrates the construction of a three-dimensional knowledge landscape.
Figure 3.15 shows virtual landscape views of three different subject domains, the
upper middle one for computer graphics and applications (Chen and Paul 2001),
the lower left for hypertext (Chen 1999b; Chen and Carr 1999a, b), and the lower
right one for virtual reality. In the computer graphics example, we visualized author
co-citation patterns found in the journal IEEE Computer Graphics and Applications
(CG&A) between 1982 and 1999. The CG&A citation data include articles written
by 1,820 authors and co-authors. These authors cited a total of 10,292 unique
articles, written by 5,312 authors (first author only). Among them, 353 authors
who have received more than five citations in CG&A entered into author co-citation
analysis. Intellectual groupings of these 353 authors provide the basis for visualizing
3.2 Identifying Structures 107
The map of the universe conveys two types of information simultaneously – both
spatial and temporal. While spatial information specifies the distance between
galaxies and how astronomical objects are related to each other in the universe,
temporal information provides an equivalent interpretation of the spatial property in
108 3 Mapping Associations
terms of a high redshift quasar that is far away from us could be formed a dozen of
billion years ago in the early universe. The notion of timeline is widely adopted in
many visualizations of abstract information. Most notably, the evolution of events
can be organized along a timeline. Although usually a timeline design tends to move
spatial patterns to the background to even vanished completely, there are visual
designs that intend to preserve and convey both spatial and temporal patterns in the
same display. A visual analytic system called GeoTime, for instance, successfully
accommodates spatial and temporal patterns. We will discuss visual analytics in
detail in the next few chapters.
Fig. 3.16 Streams of topics in Fidel Castro’s speeches and other documents (Reprinted from
Havre et al. 2000)
river, in which topics of interest flow along a dimension of time, usually placed
horizontally and pointing from the left to the right. The timeline provides an
organizing framework so that a wide variety of information can be organized
according to its state at a particular point of time. Figure 3.16 shows a ThemeRiver
visualization of topics found in a collection of Fidel Castro’s speeches, interviews,
articles, and other text. The visualization represents the variations of topics from the
beginning of 1960 through the middle of 1961. The famous Cuban missile crisis
took place in about the same period of time. The topics are represented by the
frequencies of relevant terms appeared in each month. Major events are annotated
on the top with dashed lines drawn vertically at the time of events. For example, the
timeline visualization shows that Cuba and Soviet resumed diplomatic relations in
May 1960 and Castro confiscates American refineries around the end of June 1960.
Castro mentioned Soviet 49 times in September. The continuity of a topic is shown
as a continuous stream of varying width across time.
The ThemeRiver-style timeline visualization is suitable to a wide range of
applications. New York Times, for example, featured an interactive visualization of
popular movies in terms of their box office revenue.2 The streams of movies tend
to be short lived, which is understandable because our attention span to a particular
movie won’t last forever. The appearance of streams makes them more like the
peaks of mountains. Perhaps just to be more consistent with the timeline metaphor,
we should consider them as the tips of icebergs floating from the past to the future.
2
http://www.nytimes.com/interactive/2008/02/23/movies/20080223 REVENUE GRAPHIC.html
110 3 Mapping Associations
Fig. 3.17 The evolution of topics is visualized in TextFlow (Reprinted from Cui et al. 2011)
The height of a layer of a movie, or the width of the stream, indicates the weekly box
office revenue of the movie. Its color indicates its level in terms of the total domestic
gross up to Feb. 21, 2008, the cut-off date of the data. According to the visualization,
the most popular movies in 2007 include Transformers, Harry Potter and the Order
of the Phoenix, I am Legend, and National Treasure: Book of Secretes. The color of
National Treasure indicates that its total domestic gross is one level below I am
Legend.
The idea of using the variation of a thematic stream to reveal topic evolution
is appealing because it is intuitive and easy to understand. On the other hand,
computationally identifying a topic from unstructured text remains to be a challenge.
Identifying the evolution of a topic is an even more challenging subject. Research
in topic modeling has advanced considerable over the past decade. A good example
of integrating topic modeling and information visualization to address the ability to
track the evolution of topics over time is TextFlow (Fig. 3.17).
TextFlow follows the thematic river metaphor but extends the metaphor with
features that are more suitable for analyzing the evolution of topics. TextFlow adopts
the use of the width of a stream to indicate its strength. It addresses some of the
most common dynamics between streams, for example, how a stream splits into
multiple streams and how a few streams merge into a single stream. These patterns
of dynamics are of particular interest in analyzing the development of a subject
domain. Another remarkable aspect of TextFlow is how it presents a seemingly
simplistic display of a complex issue.
3
http://www.mapequation.org/alluvialgenerator/index.html
3.3 Dimensionality Reduction 111
Fig. 3.18 Alluvial map of scientific change (Reprinted from Rosvall and Bergstrom 2010)
Fig. 3.19 Load a network in .net format to the alluvial map generator
Fig. 3.20 An alluvia map generated based on networks of co-occurring terms in publications
related to regenerative medicine. Top 300 most frequently occurred terms are chosen each year
Fig. 3.21 An alluvial map of popular tweet topics identified as Hurricane Sandy approaching
PCA finds a low-dimensional embedding of the data points that best preserves their
variance as measured in the high-dimensional input space. Classic MDS finds an
embedding that preserves the pairwise point distances (Steyvers 2000). PCA and
MDS are equivalent if Euclidean distances are used.
Let us illustrate what MDS does with some real-world examples, including
distances between cities and similarities between concepts. In general, there are two
types of MDS: Metric and Non-metric MDS. Metric MDS assumes that the input
data is either ratio or interval data, while the non-metric model requires simply that
the data be in the form of ranks.
A metric space is defined by three basic axioms, which are assumed by a
geometric model:
1. Metric minimality: for the distance function d and any point x, the equation d(x,
x) D 0 holds.
2. Metric symmetry: for any data points x and y, the equation d(x, y) D d(y, x) holds.
3. Metric triangle inequality: for any data points x, y, and z, the inequality d(x,
y) C d(y, z) d(x, z) holds.
Multidimensional scaling (MDS) is a standard statistical method used on mul-
tivariate data (See Fig. 3.23). In MDS, N objects are represented as d-dimensional
vectors with all pairwise similarities or dissimilarities (distances) defined between
the N objects. The goal is to find a new representation for the N objects as k-
dimensional vectors, where k < d such that the interim proximity nearly matches
the original similarities or dissimilarities. Stress is the most common measure of
how well a particular configuration reproduces the observed distance matrix.
Given a matrix of distances between a number of major cities from the back of
a road atlas or an airline flight chart, we can use these distances as the input data
to derive an MDS solution. Figure 3.6 shows the procedure of generating an MDS
map. When the results are mapped in two dimensions, the configuration should look
very close to a conventional map, except that you might need to rotate the MDS
3.3 Dimensionality Reduction 115
Fig. 3.24 A geographic map showing 20 cities in the US (Copyright © 1998–2012 USA-
Tourist.com, LLC http://www.usatourist.com/english/tips/distances.html)
Fig. 3.25 An MDS configuration according to the mileage chart for 20 cities in the US
Fig. 3.26 The mirror image of the original MDS configuration, showing an overall match to the
geographic map, although Orlando, Miami should be placed further down to the South
We input the city distance data to MDS, in this case using ALSCAL in SPSS,
and use Euclidean distance for the model. Figure 3.25 shows the resultant MDS
configuration, which is like a mirror image of the usual geographic map with New
on the right instead of left. We can legitimately rotate and flip an MDS configuration
to suit our custom. If we take the mirror image of the MDS configuration, the result
is indeed very close to the US map (Fig. 3.26).
Now let us look at a more abstract example, in which each data point represents
a car and the distance between two different cars is measured by a number of per-
formance indicators. This example is based on a widely available multidimensional
data set, the CRCARS data set, prepared by David Donoho and Ernesto Ramos
(1982).
3.3 Dimensionality Reduction 117
The CRCARS data set includes 406 cases of cars. Each case consists of
information from 8 variables: miles per gallon (MPG), the number of cylinders,
engine displacement in cubic inches, horsepower, vehicle weight in pounds, 0–
60 mph acceleration time in seconds, the last two digits of the year of model, and
the origin of car, i.e. USA as 1, European 2, and Japanese 3. For example, a record
of a BMW 2002 shows that it was made in Europe in 1970, with a 26 mile per gallon
fuel consumption, 4 cylinders, 0–60 mph acceleration in 12.5 s, and so on. The A
procedure of combining MDS and MST is shown in Fig. 3.27 (Basalaj 2001). The
resultant MDS configuration of 406 cars in the CRCARS data set is reproduced in
Fig. 3.28.
Figure 3.29 is a procedural diagram of a journal co-citation study (Morris and
McCain 1998). More examples of co-citation analysis are provided in Chap. 5. This
example here is to illustrate the use of MDS to map more abstract relationships.
This is also a good example to show that clustering and MDS may result in different
groupings. When it happens, analysts need to investigate further and identify the
nature of discrepancies. Figure 3.30 shows the cluster solution. Each data point is a
journal. Note that the journal “Comput Biol Med” belongs to cluster BIOMEDICAL
COMPUTING, whereas the journal “Int J Clin Monit Comput” belongs to cluster
COMPUTING IN BIOMEDICAL ENGINEERING. In Fig. 3.31, the results of
clustering are superimposed on top of the MDS configuration. Now see how close
the two journals are located. This example indicates that one should be aware of the
limitations of applying clustering algorithms directly on MDS configurations.
In this example, both MDS and clustering took input directly from the similarity
matrix. This approach has some advantages. For example, between MDS and
Clustering, we might identify patterns that could be overlooked by either method
alone. We will also present an example in which MDS and clustering are done
sequentially. If that is the case, we need to bear in mind we are totally relying on
MDS alone because the subsequent clustering does not bring additional information
into the process.
118 3 Mapping Associations
Fig. 3.28 An MDS configuration of the 406 cars in the CRCARS data, including an MST overlay.
The edge connecting a pair of cars is coded in grayscale to indicate the strength of similarity: the
darker, the stronger the similarity. The MST structure provides a reference framework for assessing
the accuracy of the MDS configuration (Courtesy of http://www.pavis.org/)
Fig. 3.30 Cluster solution for SCI co-citation data (Reproduced from Morris and McCain (1998).
Note that “Comput Biol Med” and “Int J Clin Monit Comput” belong to different clusters)
increases exponentially as the stress value decreases. After all, if the original data
is of high-dimension in nature, it is not always possible to find a perfect fit in a
lower-dimensional space. For example, it is almost certain that we have to settle on
a higher stress value when mapping N statements on a general topic than mapping
the distances of N cities. Furthermore, if the distances among cities were measured
by something of higher dimension in nature, such as the perceived quality of life,
it would be equally unlikely for MDS to maintain the same level of goodness of
fit. Indeed, Trochim (1993) reported that the average stress value across 33 concept
map projects was 0.285 with a range from 0.155 to 0.352. After all, the goal of MDS
mapping is not merely to minimize the stress value; rather, we want to produce a
meaningful and informative map that can reveal hidden structures in the original
data.
INSCAL was developed by John Carroll and J. Chang of Bell Telephone Laborato-
ries in the 1970s to explain the relationship between subjects’ differential cognition
of a set of stimuli, or objects. For N subjects and p objects, INDSCAL takes a set
of N matrices as its input. Each matrix is a symmetric p p matrix of similarity
120 3 Mapping Associations
Fig. 3.31 SCI multidimensional scaling display with cluster boundaries (Reproduced from Morris
and McCain (1998). Note the distance between “Comput Biol Med” and “Int J Clin Monit Comput”
to the left of this MDS configuration)
measures between the p objects. The model explains differences between subjects’
cognition by a variant of the distance model. The p objects are represented as points
in a space known as a master space, a shared space, or a group space. The subjects
perceive this space differently because individuals afford a different salience or
weight to each dimension of the space. The INDSCAL model assumes that subjects
are systematically distorting the group space and it seeks to reconstruct both the
individual private, distorted spaces and the aggregate “group” space. Similarity
measures can be derived from aggregated groups as well as from individuals’
ratings. For example, in judging the differences between two houses an architect
might primarily concentrate on style and structure, whereas a buyer might be more
concerned with the difference in price.
Carroll and Chang illustrated INDSCAL with an example of analyzing how
people perceive the distances between six different areas of a city. They asked three
subjects to estimate the distance between each of the pairs of areas. Each subject
estimated a total of 15 such pairs, (6 5)/2 D 15.
The INDSCAL model interprets individual differences in terms of subjects
applying individual sets of weights to the dimension of a common “group” or
3.3 Dimensionality Reduction 121
Fig. 3.33 SCI weighted individual differences scaling display (Reproduced from Morris and
McCain 1998)
PCA and MDS have been routinely used to reduce the dimensionality of linear
data. Euclidean distances provide reliable measures of a linear structure in a
high-dimensional space. The problem is that when we deal with a non-linear
structure, Euclidean distances may not be able to detect the true structure. The
3.3 Dimensionality Reduction 123
Fig. 3.34 SSCI weighted individual differences scaling display (Reproduced from Morris and
McCain 1998)
Fig. 3.35 The Swiss-roll data set, illustrating how Isomap exploits geodesic paths for nonlinear
dimensionality reduction. Straight lines in the embedding (the blue line in part a) now represent
simpler and cleaner approximations to the true geodesic paths than do the corresponding graph
paths (the red line in part b) (Reproduced from Tenenbaum et al. (2000) Fig. 3. http://www.
sciencemag.org/cgi/content/full/290/5500/2319/F3)
approximation is to transform a non-linear data into many smaller linear data and
then re-construct a global solution from local solutions of linear structures. Both
algorithms explained below are tested on a Swiss-roll-like non-linear data structure
of 20,000 data points.
The Isomap algorithm extracts meaningful dimensions by measuring the distance
between data points along the surface (Tenenbaum et al. 2000). Isomap works best
for shapes that can be flattened out, like cylinders or Swiss rolls. Isomap measures
the distance between any two points on the shape, then uses these geodesic distances
in combination with the classic MDS algorithm in order to make a low dimensional
representation of that data. Figure 3.35 demonstrates how Isomap unfolds data
shaped like a Swiss roll.
In the Isomap algorithm, the local quantities computed are the distances between
neighboring data points. For each pair of non-neighboring data points, Isomap finds
the shortest path through the data set connecting them, subject to the constraint
that the path must hop from neighbor to neighbor. The length of this path is
an approximation to the distance between its end points, as measured within the
underlying manifold. Finally, the classical method of MDS is used to find a set
of low-dimensional points with similar pairwise distances. The Isomap algorithm
worked well on several test data, notably face images with three degrees of freedom,
up-down pose, left-right pose, and lighting direction (Fig. 3.36) and hand images
with wrist rotation and finger extension as two degrees of freedom (Fig. 3.37). In
other words, the true dimension of the face image data is 3 and that of the hand data
is 2. The residual variance of Isomap drops faster than PCA and MDS, which means
that PCA and MDS tend to overestimate the dimensionality, in contrast to Isomap
(Tenenbaum et al. 2000).
The Locally Linear Embedding (LLE) algorithm uses linear approximation to model
a non-linear manifold (Roweis and Saul 2000). It is like using a lot of small pieces
3.3 Dimensionality Reduction 125
Fig. 3.36 Face images varying in pose and illumination (Fig. 1A) (Reprinted from Tenenbaum
et al. 2000)
Fig. 3.37 Isomap (K D 6) applied to 2,000 images of a hand in different configurations (Repro-
duced from Supplemental Figure 1 of Tenenbaum et al. (2000) http://isomap.stanford.edu/handfig.
html)
Fig. 3.38 The color-coding illustrates the neighborhood-preserving mapping discovered by LLE
(Reprinted from Roweis and Saul 2000)
estimating and preserving global geometry, may distort the local structure of the
data. LLE, based only on local geometry, may distort the global structure.
Given the role of classic PCA and MDS in mapping concepts, the interest in
manifold scaling algorithms is likely to increase in the near future. It is largely
unknown whether typical data structures from concept mapping and the co-citation
3.4 Concept Mapping 127
Card sorting is one of the earliest methods used for concept mapping. Earlier works
on card sorting include George Miller’s “A psychological method to investigate
verbal concepts” (Miller 1969) and Anthony Biglan’s “The characteristics of subject
matter in different academic areas” (Biglan 1973).
We illustrate the process of concept mapping with the following example drawn
from William Trochim and his colleagues at Cornell University, see for example
(Trochim 1989). They follow a similar process as what we see in Chap. 2 for creating
a thematic map – a base map is superimposed by a thematic overlay (See Fig. 3.39).
In particular, the process utilizes MDS and clustering algorithms.
The process started with a brainstorm session, in which individual participants
were asked to sort a large set of N statements on a chosen topic into piles.
They should put statements into the same pile if they thought they were similar.
The results of each individual participant’s sorting were represented as an N N
similarity matrix. If a participant put statement i and statement j into the same pile,
the value of eij in the matrix was set to 1; if they were not in the same pile, the value
was set to 0. Then they aggregated the matrices of all the participants into a matrix
(Eij ). The value of Eij therefore is the number of participants who had put statement
i and statement j into the same pile. Because a statement is always sorted into the
same pile as itself, the diagonal of the aggregated matrix always equals N.
The structure of the similarity matrix was depicted through a two-dimensional
non-metric MDS configuration, which was followed by a hierarchical cluster
analysis of the MDS coordinates to divide the spatial configuration into district-
like groups. Finally, participants were led through a structured interpretation session
designed to help them understand the maps and label them in a meaningful way.
128 3 Mapping Associations
When participants sorted statements into piles, they also rated each statement
on one or more variables. Most typically, each statement was rated for its relative
importance on a 5-point scale, from 1 for unimportant through 5 for extremely
important. The results of such rating were subsequently used as a thematic overlay
on top of the base map (See Fig. 3.40).
3.4.2 Clustering
There are two broad types of approaches to hierarchical cluster analysis: agglom-
erative and divisive. In agglomerative, the procedure starts with each point as its
own branch end-point and decides which two points to merge first. In each step, the
algorithm determines which two points and/or clusters to combine next. Thus, the
procedure agglomerates the points together until they are all in one cluster. Divisive
hierarchical cluster analysis works in the opposite manner, beginning with all points
together and subsequently dividing them into groups until each point is its own
groups. Ward’s method is an agglomerative approach.
Three methods of analysis are closely related to MDS. These are principal
component analysis (PCA), correspondence analysis (CA) and cluster analysis.
Principal components analysis (PCA) is performed on a matrix A of N entities
observed p variables. The aim is to search for new variables, called principal
components, which are based on a linear combination of the original variables
and they can account for most of the variation in the original variables. When
these distances are Euclidean distances, the coordinates contained in X do represent
3.4 Concept Mapping 129
Fig. 3.40 An MDS-configured base map of topical statements and ratings of importance shown
as stacked bars
the principal coordinates, which would be obtained when doing PCA on A. This
approach is called principal coordinates analysis, or classical scaling. A more
detailed account of this correspondence can be found in Everitt and Rabe-Hesketh
(1997).
Correspondence analysis is classically used on a two-way contingency table
in order to visualize the relations between the row and column categories. The
unfolding models do the same: subjects (row-categories) and objects (column-
categories) are visualized in a way that the order of the distances between a
subject-point and the object-points reflects the preference ranking of the subject.
The measure of “proximity” used in CA is the Chi-square distance between the
profiles. A short description of CA and its relation to MDS can be found in Borg
and Groenen (1997).
Cluster analysis models are equally applicable to proximity data including two-
way (asymmetric) square and rectangular data as well as three-way two-mode data.
The main difference with the MDS models is that most models for cluster analysis
lead to a hierarchical structure. Path distances under a number of restrictions
approach the dissimilarities.
130 3 Mapping Associations
Fig. 3.41 Hierarchical cluster analysis divided MDS coordinates into nine clusters
Just as in geographic mapping, the cartographer makes decisions about scale and
detail depending on the intended uses of the map. There is no hard and fast rule to
determine the best number of clusters can be selected.
In Trochim’s concept mapping, rating data was used to provide the third-
dimension on a 2-dimensional map, a vertical overlay that depicts the height of
various regions. In a cluster map, the layers of a cluster depicted the average
importance rating of all statements within the cluster.
Meaningful text labels are essential to identify the nature of point groupings
and clusters simple and clear. Automatically generating meaningful labels is still
a challenge. The most straightforward way to generate labels is to ask people to
do it. If individuals gave different labels, simply choose the label that makes most
sense.
Graph theory is a branch of mathematics that studies graphs and networks. A graph
consists of vertices and edges. A network consists of nodes and links. Many impor-
tant phenomena can be formulated as a graph problem, such as telecommunication
networks, club membership networks, integrated electric circuits, and scientific
networks. Social networks, for example, are graphs in which vertices represent
people and edges represent interrelationships between people. Acquaintanceship
graphs, co-author graphs, and collaboration graphs are examples of social networks.
To a mathematician, they are essentially the same thing. In graph theory, the focus
is on the connectivity of a graph – the topology, rather than the geometry. One of
the earliest graph theoretical studies was dated back to 1736 when Leonhard Euler
(1707–1783) published his paper on the solution of the Königsberg bridge problem.
Another classical problem in graph theory is the famous Traveling Salesman
Problem (TSP). In the twentieth-century graph theory has become more statistical
and algorithmic, partly because we are now dealing with some very large graphs
such as the Web, telephone call graphs, and collaboration graphs. In this section,
two types of graphs are of particular interest to us: random graphs and small-world
networks.
Fig. 3.42 A structural hole between groups a, b and c (Reprinted from Burt 2002)
In the 1950s and 1960s Anatol Rapoport studied social networks as random
graphs (Rapoport and Horvath 1961). He showed that if the placement of edges was
not completely random, it could produce a graph with a lower overall connectivity
and a larger diameter. Sociologist Mark Granovetter (1973) argued that it is through
casual acquaintances, or weak ties, that we obtain new information, rather than
through strong ties, or close personal friends. The weak ties across different groups
are crucial in helping communities mobilize quickly and organize for common
goals easily. In this vein, Ronald Burt (1992) extended the strength of weak ties
argument and argued that it was not so much the strength or weakness of a tie that
determined its information potential, but rather whether there was a structural hole
between someone’s social network. A structural hole can be seen as a person who
has strong between-cluster connections but weak within-cluster connections in a
social network. Figure 3.42 illustrates two persons’ connections in a social network
(Burt 2002). While both Robert and James have six strong ties and one weak tie,
Robert is in a more informed position than James because much information for
James would be redundant. Robert, on the other hand, is a bridge to cluster A and C.
Therefore, the number of connections in a social network is important, but the value
of each connection depends on how important it is for maintaining the connectivity
of a social network.
The degree of separation between two people is defined as the minimum length
of such chains between them. In a graph, this is equivalent to the diameter of the
graph. You may have heard that everyone on Earth is separated from anyone else
by no more than six degrees of separation. Normally, the social world we know is
confined to a group of our immediate acquaintances; most of them know each other.
3.5 Network Models 133
Our average number of acquaintances is very much less than the size of the global
population. So the claim that any people in the world are just six degrees apart does
seem mysterious.
Stanley Milgram conducted a study in 1967 to test the small-world phenomenon
(Milgram 1967). He asked volunteers in Nebraska and Kansas to deliver packets
addressed to a person in Boston through people they know and who might get
it closer to its intended recipient. Milgram kept track of the letters and the
demographic characteristics of their handlers. He found a median chain length of
about 6, 5.5 to be precise. However, two-thirds of the packets were never delivered
at all, and that the reported path length of 5.5 nodes was an average, not a maximum.
Over the past few years, there has been a surge of revived interest in this topic
among mathematicians, statisticians, physics, and psychologists (Watts 1999; Watts
and Strogatz 1998). Brian Hayes wrote two-part features in American Scientist to
introduce some of the latest studies of far-reaching implications of the small-world
phenomenon (Hayes 2000a, b).
Random graphs are among the most intensively studied graphs. The Hungarian
mathematician Paul Erdös (1913–1996) and his colleague Alfred Renyi found that a
random graph has an important property: that is, when the number of edges exceeds
half the number of vertices, a “giant component” merges suddenly so that most of
the vertices become connected by the single piece. This is known as the Erdös-Renyi
theory.
Given that many huge graphs in the real world are not random graphs, it is
particularly interesting if there are such giant components in these graphs. For
example, a giant component in a citation network would indicate some mainstream
literature of a particular discipline. A giant component in the cyber-graph of the
World-Wide Web would identify the core users of the Web and the core customers
of e-commerce. James Abello of the AT&T Shannon Laboratories in New Jersey
studied the evolution of call graphs, in which the vertices are telephone numbers,
and the edges are calls made from one number to another (Abello et al. 1999).
Within 20 days, the graph grew to a gigantic network of 290 million vertices and 4
billion edges. This is simply too big to analyze with current computing resources.
Abello and his colleagues analyzed a one-day call graph, containing 53,767,087
vertices and 170 million edges. Among 3.7 million components, most of them
tiny, they found one giant connected component that connects 44,989,297 vertices
together, which is more than 80 % of the total number of vertices. The gigantic
component has a diameter of 20, which implies that any telephone number in the
component can be linked to any other through a chain of no more than 20 calls. The
emergence of a giant component is characteristic of Erdös-Renyi random graphs,
but the pattern of connections in the call graph is certainly not random.
134 3 Mapping Associations
that represents movie stars as vertices and edges connecting them if they ever
starred in a movie together. A version of the Hollywood graph in 2001 represents
355,848 actors and actresses from 170,479 movies. In this graph, the focus is on
the centrality of Hollywood actor Kevin Bacon in the film industry. This Hollywood
graph has gained widespread publicity, partly because researchers have found a way
to replicate some key characteristics of this graph (Watts and Strogatz 1998). Brett
Tjaden and Glenn Wasson of the University of Virginia maintain The Oracle of
Bacon on the Web that calculates Bacon numbers.
Small-world networks are defined by three properties: sparseness, clustering, and
small diameter (Watts 1999). Sparseness means that the graph has relatively few
edges. Clustering means that edges are not uniformly distributed among vertices;
instead there tend to be clumps in the graph. Small diameter means that the longest
shortest path across the graph is small.
In 1998, Duncan Watts and Steven Strogatz of Cornell University found these
properties in the Hollywood graph and several other huge graphs have (Watts and
Strogatz 1998). Watts and Strogatz used a rewiring strategy to produce a graph
somewhere between a random graph and a regular graph. The rewiring process
started with a regular lattice and then rewired some of the edges according to a
probability p, ranging from 0 to 1. If p is equal to 0, then everything remains
unchanged. If p is equal to 1, then every edge is re-arranged randomly and the lattice
becomes a random graph. They calculated the minimum path length L averaged
over all pairs of vertices and found that L dropped dramatically when just a few of
the edges were rewired. Watts and Strogatz also measured the degree of clustering
in their hybrid graphs using a clustering coefficient C. They found the clustering
coefficient C remained high until the rewiring probability was rather large. The
Hollywood graph demonstrated a good match to their model.
Semantic networks are useful tools as representations for semantic knowledge and
inference systems. Historically, semantic networks refer to the classic network the-
ory of Collins and Quillian (1969) in which concepts are represented as hierarchies
of interconnected nodes with nodes linked to certain attributes. It is important to
understand the organization of large-scale semantic networks. By applying graph-
theoretic analyses, the large-scale structure of semantic networks can be specified
by distributions over a few variables, such as the length of the shortest path between
two words and the number of connections per word. Researchers have shown that
the large-scale organization of semantic networks reveals a small-world structure
that is very similar to the structure of several other real-life networks such as the
neural network of the worm C. elegans, the collaboration network of film actors
and the WWW. We have seen examples of Erdos numbers and Bacon numbers. We
return to C. elegans in later chapters for an example of gene expression visualization
of C. elegans.
136 3 Mapping Associations
Mark Steyvers and Josh Tenenbaum analyzed three types of semantic networks:
associative networks, WordNet, and Roget’s thesaurus (Steyvers and Tenenbaum
2001). They found that these semantic networks demonstrate some typical features
of a small-world structure: sparse, short average path-lengths between words, and
strong local clustering. In these semantic networks it was also found that the
distributions of the number of connections follow power laws, suggesting a hub
structure similar to the WWW. They built a network model that acquires new
concepts over time and integrates them into the existing network. If new concepts
grow from well-connected concepts and their neighbors in the network, this network
model demonstrates the small-world characteristics of semantic networks and the
power-law distributions in the number of connections. An interesting prediction
of their model is that concepts that are learned early in the network acquire more
connections over time than concepts learned late.
For an example of a shortest pathway running through major scientific disciplines
instead of concepts, see Henry Small’s work on charting the pathways in scientific
literature (Small 2000), although he did not study these pathways as a small-
world phenomenon. In Chap. 5, we will introduce another trailblazing example
from Small’s work on specialty narratives (Small 1986). The small-world model
of semantic networks predicts that the earlier a concept is learned in the network,
the more connections it will get. This doesn’t sound surprising. Sociologist Robert
Merton’s Matthew’s Effect, or the rich get richer, leads us to think the characteristics
of scientific networks. After all, the small-world phenomenon was originated from
the society. Practical implications of these small-world studies perhaps lie in how
one can find strong local clusters and build shortest paths to connect to these clusters.
These findings may also influence the way we see citation networks.
3.5.5.1 Pajek
In Slovene, the word pajek means spider. A computer program Pajek is designed
for analysis of large networks of several thousands of vertices (Batagelj and Mrvar
1998). It is freely available for noncommercial use.4 Conventionally, a network with
more than hundreds of vertices can be regarded as large. There are even larger
networks, such as the Web, with estimated billions of web pages, forms a super-
large network.
Réka Albert, Hawoong Jeong, Albert-László Barabási analyzed the error and
attach tolerance of complex networks. The tool they used was Pajek. They illustrated
4
http://vlado.fmf.uni-lj.si/pub/networks/pajek/
3.5 Network Models 137
3.5.5.2 Gephi
The layout of our map of influenza virus protein sequences was generated by LGL.
It first determines the layout of a large graph from a minimum spanning tree of the
graph. LGL is one of the computer programs openly available for visualizing large
graphs. It is written in C and the source code is available.5 It has been mostly used
in biomedical studies. More details are available on the web, but the project is no
longer actively maintained, to compile it one has to download some legacy libraries
such as boost 1.33.1.6
LGL has been used to generate some of the most intriguing maps of the Internet
in 2003–2005. Examples can be found at http://www.opte.org/maps/.
3.6 Summary
In summary, in this chapter we have explored some of the most popular ideas and
techniques for mapping the mind. A good strategy of working with abstract data is
to apply the same technique to some concrete data or data that we are familiar with.
5
http://lgl.sourceforge.net/
6
http://sourceforge.net/projects/lgl/forums/forum/584294/topic/3507979
References 139
References
Abello J, Pardalos PM, Resende MGC (1999) On maximum clique problems in very large graphs.
In: Abello J, Vitter J (eds) External memory algorithms. American Mathematical Society,
Providence, pp 119–130
Albert A, Jeong H, Barabási A-L (1999) Diameter of the World Wide Web. Nature 401:130–131
Albert R, Jeong H, Barabási A-L (2000) Attack and error tolerance in complex networks. Nature
406:378–382
Barabási A-L, Albert R, Jeong H, Bianconi G (2000) Power-law distribution of the World Wide
Web. Science 287:2115a
Basalaj W (2001) Proximity visualization of abstract data. Retrieved November 5, 2001, from
http://www.pavis.org/essay/index.html
Batagelj V, Mrvar A (1998) Pajek: a program for large network analysis. Connections 21(2):47–57
Biglan A (1973) The characteristics of subject matter in different academic areas. J Appl Psychol
57:195–203
Borg I, Groenen P (1997) Modern multidimensional scaling. Springer, New York
Boyack KW, Wylie BN, Davidson GS, Johnson DK (2000) Analysis of patent databases using
Vxinsight (No. SAND2000-2266C). Sandia National Laboratories, Albuquerque
Burt RS (1992) Structural holes: the social structure of competition. Harvard University Press,
Cambridge, MA
Burt RS (2002) The social capital of structural holes. In: Guillen NF et al (eds) New directions in
economic sociology. Russell Sage Foundation, New York
Bush V (1945) As we may think. Atl Mon 176(1):101–108
Canter D, Rivers R, Storrs G (1985) Characterizing user navigation through complex data
structures. Behav Info Technol 4(2):93–102
Chen C (1999a) Information visualisation and virtual environments. Springer, London
Chen C (1999b) Visualising semantic spaces and author co-citation networks in digital libraries.
Info Process Manag 35(2):401–420
Chen C, Carr L (1999a) Trailblazing the literature of hypertext: author co-citation analysis (1989–
1998). Paper presented at the 10th ACM conference on hypertext (Hypertext’99), Darmstadt,
February 1999
Chen C, Carr L (1999b) Visualizing the evolution of a subject domain: a case study. Paper presented
at the IEEE visualization’99, San Francisco, 24–29 October 1999
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. IEE Comput
34(3):65–71
Chen H, Houston AL, Sewell RR, Schatz BR (1998) Internet browsing and searching: user
evaluations of category map and concept space techniques. J Am Soc Inf Sci 49(7):582–608
Chen C, Gagaudakis G, Rosin P (2000) Content-based image visualisation. Paper presented at
the IEEE international conference on information visualisation (IV 2000), London, 19–21 July
2000
Collins AM, Quillian MR (1969) Retrieval time from semanticmemory. J Verbal Learn Verbal
Behav 8:240–248
Conklin J (1987) Hypertext: an introduction and survey. Computer 20(9):17–41
140 3 Mapping Associations
Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z et al (2011) TextFlow: towards better understanding of
evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421
Darken RP, Allard T, Achille LB (1998) Spatial orientation and wayfinding in large-scale virtual
spaces: an introduction. Presence 7(2):101–107
Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent
semantic analysis. J Am Soc Info Sci 41(6):391–407
Donoho D, Ramos E (1982) PRIMDATA: data sets for use with PRIM-H. Retrieved November 5,
2001, from http://lib.stat.cmu.edu/data-expo/1983.html
Dumais ST (1995) Using LSI for information filtering: TREC-3 experiments. In Harman D (ed)
The 3rd text REtrieval conference (TREC3), National Institute of Standards and Technology
Special Publication, pp 219–230
Everitt BS, Rabe-Hesketh S (1997) The analysis of proximity data. Arnold, London
Everitt B (1980) Cluster analysis. Halsted Press, New York
Flickner M, Sawhney H, Niblack W, Sahley J, Huang Q, Dom B et al (1995) Query by image and
video content: the QBIC system. IEEE Comput 28(9):23–32
Granovetter M (1973) Strength of weak ties. Am J Sociol 8:1360–1380
Greenacre MJ (1993) Correspondence analysis in practice. Academic, San Diego
Havre S, Hetzler B, Nowell L (2000) ThemeRiver: visualizing theme change over time. In:
Proceedings of IEEE symposium on information visualization, Salt Lake City, 9–10 October
2000, pp 115–123
Hayes B (2000a) Graph theory in practice: part I. Am Sci 88(1):9–13
Hayes B (2000b) Graph theory in practice: part II. Am Sci 88(2):104–109
He DC, Wang L (1990) Texture unit, texture spectrum, and texture analysis. IEEE Trans Geosci
Remote Sens 28(4):509–512
Helm CE (1964) Multidimensional ratio scaling analysis of perceived color relations. J Opt Soc
Am 54:256–262
Hetzler B, Whitney P, Martucci L, Thomas J (1998) Multi-faceted insight through interoperable vi-
sual information analysis paradigms. Paper presented at the IEEE information visualization’98,
Los Alamitos, 19–20 October 1998
Ingram R, Benford S (1995). Legibility enhancement for information visualisation. Paper presented
at the 6th annual IEEE computer society conference on visualization, Atlanta, October 1995
Kamada T, Kawai S (1989) An algorithm for drawing general undirected graphs. Info Process Lett
31(1):7–15
Kleinberg J (1998) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Kochen M (ed) (1989) The small world: a volume of recent research advances commemorating
Ithiel de Sola Pool, Stanley Milgram, Theodore Newcomb. Ablex Publishing Corporations,
Norwood
Kohonen T (1989) Self-organization and associate memory, 3rd edn. Springer, New York
Krumhansl CL (1978) Concerning the applicability of geometric models to similar data: the
interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463
Kruskal JB (1977) The relationship between multidimensional scaling and clustering. In: van
Ryzin J (ed) Classification and clustering. Academic, New York, pp 17–44
Kruskal JB, Wish M (1978) Multidimensional scaling, Sage university paper series on quantitative
applications in the social sciences. SAGE Publications, Beverly Hills
Levine M, Jankovic IN, Palij M (1982) Principles of spatial problem solving. J Exp Psychol Gen
111(2):157–175
Levine M, Marchon I, Hanley G (1984) The placement and misplacement of You-Are-Here maps.
Environ Behav 16(2):139–157
Lynch K (1960) The image of the city. The MIT Press, Cambridge, MA
McCallum RC (1974) Relations between factor analysis and multidimensional scaling. Psychol
Bull 81(8):505–516
Milgram S (1967) The small world problem. Psychol Today 2:60–67
Miller GA (1969) A psychological method to investigate verbal concepts. J Math Psychol
6:169–191
References 141
Morris TA, McCain K (1998) The structure of medical informatics journal literature. J Am Med
Inform Assoc 5(5):448–466
Rapoport A, Horvath WJ (1961) A study of a large sociogram. Behav Sci 6(4):279–291
Rosvall M, Bergstrom CT (2010) Mapping change in large networks. PLoS One 5(1):e8694
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear
embedding. Sci Mag 290(5500): 2323–2326, DOI: 10.1126/science.290.5500.2323.
http://www.sciencemag.org/content/290/5500/2323
Schvaneveldt RW (ed) (1990) Pathfinder associative networks: studies in knowledge organization.
Ablex Publishing Corporations, Norwood
Schvaneveldt RW, Durso FT, Dearholt DW (1989) Network structures in proximity data. In: Bower
G (ed) The psychology of learning and motivation, 24. Academic, New York, pp 249–284
Shneiderman B (1996) The eyes have it: a task by data type taxonomy for information visualization.
Paper presented at the IEEE workshop on visual language, Boulder, 3–6 September 1996
Skupin A (2009) Discrete and continuous conceptualizations of science: implications for knowl-
edge domain visualization. J Informetr 3(3):233–245
Small H (1986) The synthesis of specialty narratives from co-citation clusters. J Am Soc Inf Sci
37(3):97–110
Small H (2000) Charting pathways through science: exploring Garfield’s vision of a unified index
to science web of knowledge – a Festschrift in Honor of Eugene Garfield. Information Today
Inc., New York, pp 449–473
Steyvers M (2000) Multidimensional scaling encyclopedia of cognitive science. Macmillan
Reference Ltd., London
Steyvers M, Tenenbaum J (2001) Small worlds in semantic networks. Retrieved December 2001,
from http://www-psych.stanford.edu/ msteyver/small worlds.htm
Swain M, Ballard H (1991) Color indexing. Int J Comput Vis 7:11–32
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500):2319–2323
Thorndyke P, Hayes-Roth B (1982) Differences in spatial knowledge acquired from maps and
navigation. Cogn Psychol 14:560–589
Tolman EC (1948) Cognitive maps in rats and men. Psychol Rev 55:189–208
Trochim W (1989) Concept mapping: soft science or hard art? Eval Program Plann 12:87–110
Trochim W (1993) The reliability of concept mapping. In: Annual conference of the American
Evaluation Association, Dallas
Trochim W, Linton R (1986) Conceptualization for evaluation and planning. Eval Program Plann
9:289–308
Trochim W, Cook J, Setze R (1994) Using concept mapping to develop a conceptual framework
of staff’s views of a supported employment program for persons with severe mental illness.
Consult Clin Psychol 62(4):766–775
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Watts DJ (1999) Small worlds: the dynamics of networks between order and randomness. Princeton
University Press, Princeton
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature
393(6684):440–442
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Wise JA (1999) The ecological approach to text visualization. J Am Soc Inf Sci 50(13):1224–1233
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995). Visualizing the non-
visual: spatial analysis and interaction with information from text documents. Paper presented
at the IEEE symposium on information visualization’95, Atlanta, Georgia, 30–31 October 1995
Zahn CT (1917) Graph-theoretical methods for detecting and describing Gestalt clusters. IEEE
Trans Comput C 20:68–86
Chapter 4
Trajectories of Search
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 143
DOI 10.1007/978-1-4471-5128-9 4, © Springer-Verlag London 2013
144 4 Trajectories of Search
documents according to the likelihood that users would visit one document from
another via hyperlinks (Pirolli et al. 1996). In the following examples, we first
introduce a travel planning problem in the real world and then discuss real-world
navigation strategies in a virtual world.
Fig. 4.1 Three Traveling Salesman tours in German cities: the 45-city Alten Commis-Voyageur
tour (green), the Groetschel’s 120-city tour (blue), and by far the latest 15,112-city tour (red)
(Courtesy of http://www.math.princeton.edu/)
Fig. 4.3 A scene in StarWalker when two users exploring the semantically organized virtual space
Fig. 4.5 A site map produced by see POWER. The colored contours represent the hit rate of a
web page. The home page is the node in the center (Courtesy of http://www.compudigm.com/)
148 4 Trajectories of Search
individual page has received. The home page is the node in the center, and the lines
linked to this represent navigation paths. Navigation issues can be quickly identified,
as can the effect of content changes.
Searching for information is in many ways like how human beings and animals
hunt for food. Research in biological evolution and optimal foraging identifies some
of the most profound factors that may influence our course of action. Whenever
possible, we prefer to minimize the consumption of our energy in searching for
information. We may also consider reducing other forms of cost. The bottom line
is that we want to maximize the returns by giving away the minimum amount of
resources. The perceived risk and expected gains will affect where we search and
how long we keep searching in the same area. A theory adapted from anthropology,
the optimal information foraging theory (Pirolli and Card 1995), can explain why
this type of information can be useful.
Sandstrom (1999) analyzed scholars’ information searching behavior as if they
were hunting for food based on the optimal foraging theory developed in anthropol-
ogy. She focused on author co-citation relationships as a means of tracing scholars in
their information seeking. Sandstrom derived a novelty-redundancy continuum on
which information foragers gauged the costs and benefits of their course of search.
She found three types of center-periphery zones in the mind map of scholars: one’s
home zone, core groupings for others and the rest of clusters of scholars.
Sandstrom’s study showed that scholars’ searching and handling mechanisms
varied by zone, and the optimal foraging theory does explain the variations.
For example, regular reading, browsing, or relatively solitary information seeking
activities often yielded resources belonging mostly to the peripheral zones of
scholars’ information environments. Peripheral resources tended to be first-time
references and previously unfamiliar to citing authors, whereas core resources
emerged from routine monitoring of key sources and the cited authors are very
familiar with such resources.
Sandstrom’s work draws our attention from the strongest and most salient
intellectual links in traditional author co-citation analysis to the weak bibliographic
connections and less salient intellectual links. Weak links that could lead to the
establishment of an overlooked connection between two specialties are particularly
significant for information foragers and scholars.
In order to understand users’ navigation strategies in information foraging,
the profitability of a given document can be defined according to this cost-effect
principle. For example, one can estimate the profitability with the proportion of
relevant documents in a specific area of an information space divided by the
time it will take to read all the documents within this area. In their study of the
Scatter/Gatherer system, Pirolli and Card found that even a much simplified model
of information foraging shows how users’ search strategies can be influenced. For
4.1 Footprints in Information Space 149
example, users are likely to search widely in an information space if the query is
simple, and more focused if the query is harder (1995). According to the profitability
principle, harder queries entail higher cost to resolve and the profitability of each
document is relatively low. In general, users must decide whether or not to pursue a
given document on the course of navigation based on the likelihood profitability of
the document.
In order to study sequential patterns in users’ trails, we decided to visualize doc-
uments visited by users in sequence. One would expect that the trail of a successful
information forager should lead to the target area and spend a considerable amount
of time in that area. The success of one user may provide insightful information to
another user to overcome the weakest link problem.
Hidden Markov Models (HMMs) are widely used in signal processing and speech
recognition. If we conceptualize users’ navigation as a sequence of observable
actions, such as clicking on a node or marking a node, we would expect that
behavioral patterns of navigation are likely to be governed by a latent cognitive
process, which is opaque to observers. For example, cognitive processes behind the
scene may include estimating the profitability of a document cluster and assessing
the relevance of a particular document. HMMs provide a potentially useful tool
to model such dual-process sequences. Given a sequence of observed actions, one
may want to know the dynamics of the underlying process. Given a model of an
underlying process, one may like to see what sequence is most likely to be observed.
Thus an HMM-based approach provides a suitable way to study users’ navigation
strategies as an information foraging process.
Hidden Markov Models are defined in terms of states and observations. States
are not observable, whereas observations are observable and they are probabilistic
functions of states. A stochastic process governs state transitions, which means at
each step the process of change is controlled by probabilities. Observations are also
a stochastic process. An HMM can be defined as follows:
N denotes the number of hidden states
Q denotes the set of states Q D f1, 2, : : : , Ng
M denotes the number of symbols, or observations
V denotes the set of symbols V D f1, 2, : : : , Mg
A denotes the state-transition probability matrix
152 4 Trajectories of Search
2 3
a11 a12 ::: a1N
6 ::: ::: aij ::: 7
AD6
4 :::
7
::: ::: ::: 5
aN1 aN 2 ::: aNN
ok .di /
bi k D
di
The reason we choose the profitability function of a document space as the state
space and the three types of events as the observation symbols is based on the
consideration that the sequence of activities such as node over, node click, and node
mark is a stochastic process. This observable process is the function of a latent
stochastic process – the process of estimating the profitability of documents in the
thematic space by a user because which document the user will move to in his/her
next step is very much opaque to observers.
We constructed HMMs based on the actual trails recorded from sessions of the
experiment. HMMs are both descriptive and normative – not only can one describe
what happened with information foraging sessions, but also can one predict what
might happen in similar situations. HMMs provide insights into how users would
behave as they are exposed to the same type of structural and navigational cues in
the same thematic space.
We defined the basic problems as follows. The first basic question states that
given observation O D (o1, o2, : : : , oT), which is a sequence of information
foraging actions of a user, and model œ D (A, B, ), efficiently compute P(Ojœ).
Given two models œ1 and œ2, this can be used to choose the better one. We first
derived an HMM model from the log files of two users: one has the best performance
score, but without any node click events; the other has all types of events. This model
is denoted as œlog . Given an observation sequence, it is possible to estimate model
parameters œ D (A, B, ) that maximize P(Ojœ), denoted as œseq . The navigation
sequence of the most successful user provided the input to the modeling process.
The second basic question states that given observation O D (o1, o2, : : : , oT)
and model œ find the optimal state sequence q D (q1, q2, : : : , qT). In this case,
we submited the navigation sequences of users to the model œlog and animated the
optimal state sequences within the thematic space. In this way, we can compare the
prevalent navigation strategies. Such animation will provide additional navigational
cues to other users.
Finally, the third basic question states that given observation O D (o1, o2,
: : : , oT), estimate model parameters œ D (A, B, ) that maximize P(Ojœ). We
focused on the most successful user in searching a given thematic space. If a user
is clicking and marking documents frequently, it is likely that the user has found a
high profitable set of documents.
center. If the user marks a document as relevant in a search session, this document
will be colored in blue. Upon the user visits a document, a dark circle is drawn
around the current document. The time spent on a document is denoted by a growing
green belt until the user leaves the document. If the user comes back to a previously
visited document, we will see a new layer of dark circle and an additional layer of
green belt will start to be drawn. One can choose to carry these discs grown from
one task into the next task and a red disc indicates how long the user has spent on it
in the previous task.
We expect to observe the following patterns concerning users’ navigation
strategies:
Spatial-semantic models may reduce the time spent on examining a cluster of
documents if the spatial-semantic mapping preserves the latent semantic structure.
Spatial-semantic models may mislead information foragers to over-estimate the
profitability of a cluster of documents if the quality of clustering is low.
Once users locate a relevant document in a spatial-semantic model, they tend to
switch to local search.
If we use the radius of disc to denote the time spent on a document, the
majority of large discs should fall in the target area in the thematic spaces. Discs
of subsequent tasks are likely to be embedded in discs of preceding tasks.
Fig. 4.8 Relevant documents for Task A in the ALCOHOL space (MST)
Fig. 4.9 Overview first: user jbr’s trails in searching the alcohol space (Task A)
and number 21. Another special node in the map is number 57. Three out of four
users we studied chose this node as the starting point for their navigation.
Each trajectory map shows the course of visual navigation of a particular user.
Figure 4.9 shows user jbr’s navigation trail for Task A in the alcohol space, who
performed the best in this group. Task A corresponds to the initial overview task in
Shneiderman’s taxonomy. Users must locate clusters of relevant documents in the
map. Subsequent tasks are increasingly focused.
As shown in the trajectory map, user jbr started from the node 57 and moved
downwards along the branch. Then the trajectory jumped to node 105 and followed
the long spine of the graph. Finally, the user reached the area where relevant
documents are located. We found an interesting trajectory pattern – once the user
156 4 Trajectories of Search
When the user moves his/her mouse over a document in the thematic space, the
title is flashed out on the screen. When the user clicks on the document, the content
of the document becomes available. When the user has decided that the current
document is relevant for the task, he/she can mark the document.
First, we use two users’ trails as the training set to build the first Hidden Markov
model œstate. We choose user jbr and nol because one marked the most documents
and the other clicked the most number of times.
The third parameter of a Hidden Markov model is the intial distribution, denoted
as . Intuitively, this is the likelihood that users will start with a given document for
their information foraging.
158 4 Trajectories of Search
Fig. 4.12 Overview first, zoom in, filtering, detail on demand. Accumulative trajectory maps of
user jbr in four consecutive sessions of tasks. Activated areas in each session reflect the changes of
the scope (clockwise: Task A to Task D)
Table 4.1 The state sequence generated by the HMM for user jbr. Relevant documents are
in bold type
67 57 120 199 65 61 61 61 73 73 73 87 170 134 105 170 142 172 156 112 192 77 47 138
128 114 186 30 13 13 18 114 135 50 161 50 43 50 66 50 50 66 161 66 66 169 66 66 169
169 123 123 83 149 169 169 123 123 149 149 83 11 138 159 121 123 149 149 100 100 91
91 83 83 119 83 83 119 119 83 41 162 162 82 50 82 82 82 82 161 122 31 43 135 81 161 43
43 135 81 81 135 14 135 135 14 14 20 20 80 80 189 189 152 56 189 189 64 64 158
In addition to the above approach, one can derive an HMM by using the Baum-
Welch algorithm based on a given sequence of observed actions. We use user jbr’s
action sequence as the input and generate an HMM.
Using the Hidden Markov model derived from user jbr’s and user nol’s actual
sequences, we can verify the internal structure of the model using the well-known
algorithm – the Viterbi algorithm. Given a Hidden Markov model œ and a sequence
of observed symbols, the Viterbi algorithm can be used to generate a sequence
of states. One can examine this state sequence and compare it with the original
sequence of events log from the user.
Table 4.1 shows the state sequence generated by the Viterbi algorithm based on
the HMM œstate, which returns the sequence of states that is most likely to emit the
observed symbols, i.e. the information foraging sequence. Relevant documents in
the state sequence are highlighted in bold. This sequence is of course identical to
the original sequence recorded in the session.
Based on the HMM œstate, user jbr’s observed information foraging action
sequence as the input and applied the Viterbi algorithm to generate the optimal
state transition path. Figure 4.13 shows the path of the sequence generated by the
4.1 Footprints in Information Space 159
Fig. 4.13 Synthesized trails. The trajectory of the optimal path over the original path of user jbr
Viterbi algorithm. The path started from the left-hand side of the thematic space
and traced the horizontal spine across the map and reached the target area. The path
finished in the target area with several extended visits to relevant documents in this
area. The optimal path is drawn on top of the original trail of the same user. By
showing the two versions of the trails on the same thematic map, it becomes clear
where the discrepancies are and where the conformance is. Since this is a novel way
to represent paths in a Hidden Markov model, many characteristics are yet to be
fully investigated. Even though, the synthesized path appears to be promising and
it moves straight to the target area and some wanders in the original trail has been
filtered out. For social navigation, the optimal path is likely to provide an enhanced
profile for this group of users.
Our study of behavioral semantics focused on the alcohol space in the MST-
based interface. The thematic space was exposed to users for the first time in
Task A. Apart from the structural model no navigation cues were readily available
to users. Users must first locate areas in the thematic space where they can find
documents relevant to the task. The optimal information foraging theory provides
an appropriate description of this type of processes.
We have made an assumption that this is an information foraging process and
it is also a stochastic process because much of the judgments and decisions made
by users in their exploration and foraging of relevant documents are implicit and
difficult to externalize. The introduction of Hidden Markov models allows us to
build descriptive and normative models so that we can characterize sequential
behavior of users in the context of information foraging. The visual inspection
of information foraging trails is encouraging. Animated trails and optimal paths
generated by HMMs have revealed many insights into how users were dealing
with the tasks and what are the prevailing characteristics and patterns. Replay and
animate HMM-paths over actual trails allow us to compare transition patterns in the
same context.
160 4 Trajectories of Search
4.2 Summary
References
In a letter to Robert Hooke in 1675, Isaac Newton made his most famous statement:
“If I have seen further it is by standing on the shoulders of Giants.” This statement
is now often quoted to symbolize scientific progress. Robert Merton examined the
origin of this metaphor in his On the Shoulders of Giants (Merton 1965). The
shoulders-of-giants metaphor can be traced to the French philosopher Bernard of
Chartres, who said that we are like dwarfs on the shoulders of giants, so that we can
see more than they, and things at a greater distance, not by virtue of any sharpness
of sight on our part, or any physical distinction, but because we are carried high and
raised up by their giant size.
In a presentation at the Conference on The History and Heritage of Science
Information Systems at Pittsburgh in 1998, Eugene Garfield used “On the Shoulders
of Giants” as the title of his tributes to an array of people who had made tremendous
contributions to citation indexing and science mapping, including Robert King
Merton, Derek John de Solla Price (1922–1983), Manfred Kochen (1928–1989),
Henry Small, and many others (Garfield 1998). In 1999, Henry Small used On the
Shoulders of Giants to entitle his ASIS Award Speech (Small 1999). He explained
that if a citation can be seen as standing on the shoulder of a giant, then co-
citation is straddling the shoulders of two giants, a pyramid of straddled giants
is a specialty, and a pathway through science is playing leapfrog from one giant
to another. Henry Small particularly mentioned Belver Griffith (1931–1999) and
Derek Price as the giants who shared the vision of mapping science with co-citation.
Griffith introduced the idea of using multidimensional scaling to create a spatial
representation of documents. According to Small, the work of Derek Price in
modeling of the research front (Price 1965) had a major impact on his thinking.
The goal of this chapter is to introduce some landmark works of giants in
quantitative studies of science, especially groundbreaking theories, techniques, and
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 163
DOI 10.1007/978-1-4471-5128-9 5, © Springer-Verlag London 2013
164 5 The Structure and Dynamics of Scientific Knowledge
applications of science mapping. Henry Small praised highly the profound impact of
Thomas Kuhn on visualizing the entire body of scientific knowledge. He suggested
that if Kuhn’s paradigms are snapshots of the structure of science at specific
points in time, examining a sequence of such snapshots might reveal the growth
of science. Kuhn (1970) speculated that citation linkage might hold the key to solve
the problem.
In this chapter, we start with general descriptions of science in action as reflected
through indicators such as productivity and authority. We follow the development of
a number of key methods to science mapping over the last few decades, including
co-word analysis and co-citation analysis. These theories and methods have been
an invaluable source of inspiration for generations of researchers across a variety of
disciplines. And we are standing on the shoulders of giants.
viewed as statements relating concepts.” They are the technical foundations of the
contemporary quantitative studies of science. Each offers a unique perspective on
the structure of scientific frontiers. Researchers have found that a combination of
co-word and co-citation analysis could lead to a clearer picture of the cognitive
content of publications (Braam et al. 1991a, b).
The history of co-word analysis has some interesting philosophical and sociological
implications for what we will see in later chapters. First, one of the key arguments
of the proponents of co-word analysis is that scientific knowledge is not merely pro-
duced within “specialist communities” which independently define their research
problems and delimit clearly the cognitive and methodological resources to be
used in their solution. The attention given to “specialist communities” is due to
the influence of the work done by Thomas Kuhn, particularly in his Postscript
to the second edition of The Structure of Scientific Revolutions. There are some
well-known examples of this approach, notably the invisible college by Diana
Crane (1972). The specialty areas are often identified by an analysis of citations in
scientific literature (Garfield et al. 1978). Co-citation analysis has been developed
in this context (Small, 1977; Small and Greenlee 1980). A general criticism of
the sociology of specialist communities was made by Knorr-Cetina (1999). Edge
(1979) gave critical comments on delimiting specialty areas by citations. In 1981,
the issue 11(1) of Social Studies of Science was devoted to the analysis of scientific
controversies. We will return to Kuhn’s theory when we explain its roles in
visualizing scientific frontiers in later chapters of the book.
In 1976, Henry Small raised the question of social-cognitive structures in science
and underlined the difficulties of using experts to help identify them. This is because
experts are biased. Co-word analysis was developed to provide an “objective”
approach without the help of domain experts.
The term leximappe was used to refer to this type of concept maps. More
specific types of such maps are inclusion maps and proximity maps. Subsequent
developments in relation to co-word analysis have incorporated artificial neural
168 5 The Structure and Dynamics of Scientific Knowledge
Inclusion maps and proximity maps are two types of concept maps resulted from
co-word analysis. Co-word analysis measures the degree of inclusion and proximity
between keywords in scientific documents and draws maps of scientific areas
automatically in inclusion maps and proximity maps, respectively.
Metrics for co-word analysis have been extensively studied. Given a corpus of
N documents, each document is indexed by a set of unique terms that can occur in
multiple documents. If two terms, ti and tj , appear together in a single document,
it counts as a co-occurrence. Let ck be the number of occurrences of term tk in
the corpus and cij be the number of co-occurrences of terms ti and tj , which is the
number of documents indexed by both terms. The inclusion index Iij is essentially a
conditional probability. Given the occurrence of one term, it measures the likelihood
of finding another term in documents of the corpus:
Iij D cij =min ci ; cj
Fig. 5.1 An inclusion map of research in mass extinction based on index terms of articles on mass
extinction published in 1990. The size of a node is proportional to the total number of occurrences
of the word. Links that violate first-order triangle inequality are removed (© D 0.75)
The original co-word analysis prunes a concept graph using a triangle inequality
rule on conditional probabilities. Suppose we have a total of N words in the analysis,
for 1 i, j, k N, ¨ij , ¨ik , and ¨kj represent the weights of links in the network and
¨ij is defined as 1 – Iij . Given a pre-defined small threshold ©, if there exists an index
k such that ¨ij > ¨ik *¨kj C ©, then we should remove the link Iij . Because ¨ik *¨kj
defines the weight of a path from term ti to tj , what this operation means is if we can
find a shorter path from term ti to tj than the direct path, then we choose the shorter
one. In other words, if a link violates the triangle inequality, it must be invalid; there-
fore, it should be removed. By rising or lowering the threshold ©, we can decrease
or increase the number of valid links in the network. This algorithm is simple to
implement. In co-word analysis, we usually only compare a one-step path with a
two-step path. However, when the size of the network increases, this simple algo-
rithm tends to allow in too many links and the resultant co-word map tends to lose
its clarity. In next chapter, we will introduce Pathfinder network scaling as a generic
form of the triangle inequality condition, which enable us compare much longer
paths connecting two points and detect much subtle association patterns in data.
Figure 5.1 shows a co-word map based on the inclusion index. The co-word
analysis was conducted on index terms of articles published in 1990 from a search in
the Web of Science with the query “mass extinction”. The meaning of this particular
co-word map should become clear when you complete Chap. 7, which contains a
170 5 The Structure and Dynamics of Scientific Knowledge
detailed account of the background and key issues in the study of mass extinction.
The main reason we skip the explanation here is because of its involvement
with theories and examples of competing paradigms, a unique characteristic of a
scientific frontier.
Steve Steinberg (1994) addressed several questions regarding the use of a quanti-
tative approach to identify paradigm shifts in the real world. He chose to examine
Reduced Instruction Set Computing (RISC). The idea behind RISC was that a pro-
cessor with only a minimal set of simple instructions could outperform a processor
that included instructions for complex high-level tasks. In part, RISC marked a
clear shift in computer architecture and had reached some degree of consensus.
Steinberg searched for quantitative techniques that could help his investigation.
Eventually he found the co-word analysis technique that could produce a map of
the field, a visualization of the mechanisms, and a battle chart of the debate. He
wrote (Steinberg 1994): “If I could see the dynamics of a technical debate, I thought,
perhaps I could understand them.”
He collected all abstracts with the keyword RISC for the years 1980–1993 from
the INSPEC database, filtered out the 200 most common English words, and ranked
the remaining words by frequency. The top 300 most frequently occurred words
were given to three RISC experts to choose those words central to the field. Finally,
words chosen by the experts were aggregated by synonyms into 45 keyword clusters.
The inclusion index was used to construct a similarity matrix. This matrix was
mapped by MDS with ALSCAL. The font size of a keyword was proportional to
the word’s frequency. Strongly linked keywords were connected by straight lines.
Figure 5.2 shows the co-word map of the period of 1980–1985. The first papers
to explicitly examine and define RISC appeared within this period. The design
philosophy of RISC was so opposed to the traditional computing architecture
paradigm, every paper in this period was written to defend and justify RISC. The
map shows two main clusters. One is on the left, surrounding keywords such
as register, memory, simple, and pipeline. These are the architectural terms that
uniquely define RISC. The other cluster is on the right, centered on keywords such as
language and CISC. These are the words that identify the debate between the RISC
and CISC camps. Language is the most frequent keyword on the map. According
to Steinberg, the term language most clearly captures the key to the debate between
RISC and CISC. While CISC proponents believed that a processor’s instruction
set should closely correspond to high-level languages such as FORTRAN and
COBOL, RISC proponents argue that simple instructions were better than high-
level instructions. This debate is shown in the co-word map with the connections
between language, CISC, compiler, and programming.
To illustrate the paradigm shift, we also include the co-word map of another
period: 1986–1987 (Fig. 5.3). During this period, Sun introduced the first
5.2 Maps of Words 171
Citation analysis takes into account one of the most crucial indicators of scholar-
ship – citations. Citation analysis has a unique position in the history of science
mapping because several widely used analytical methods have been developed to
extract citation patterns from scientific literature and these citation patterns can
provide insightful knowledge of an invisible college. Traditionally, both philosophy
of science and sociology of knowledge have a strong impact on citation analysis.
Opponents of citation analysis criticize its approach influenced by the idea of invis-
ible colleges and scientific communities, and argue that the way science operates is
far beyond the scope of citation practices (Callon et al. 1986). However, this issue
cannot be simply settled by theoretical arguments. Longitudinal studies and large-
scale domain analysis can provide insightful answers, but they tend to be very time-
consuming and resource demanding. In practice, researchers have been exploring
frameworks that can accommodate both co-word analysis and co-citation analysis
(Braam et al. 1991a, b). These efforts may provide additional insights into the philo-
sophical and sociological debates. Document co-citation analysis (DCA) and author
co-citation analysis (ACA) represent the two most prolific mainstream approaches
to co-citation analysis. Here we first introduce DCA and then explain ACA.
Citation indexing provides a device for researchers to track the history of advances
in science and technology. One can trace a network of citations to find out the history
and evolution of a chain of articles on a particular topic. The goal of citation analysis
is to make the structure of such a network more recognizable and more accessible.
5.3 Co-Citation Analysis 173
Fig. 5.4 A document co-citation network of publications in Data and Knowledge Engineering
5.3.1.1 Specialties
Fig. 5.6 A global map of science based on document co-citation patterns in 1996, showing
a linked structure of nested clusters of documents in various disciplines and research areas
(Reproduced from Garfield 1998)
Creating a science map is the first step towards exploring and understanding
scientific frontiers. Science maps should guide us from one topic or specialty to
related topics or specialties. Once we have a global map in our hands, the next
logical step is to find out how we can make a journey from one place to another
based on the information provided by the map. Small introduced the concept of
passage through science. Passages are chains of articles in scientific literature.
Chains running across the literature of different disciplines are likely to carry a
method established in one discipline into another. Such chains are vehicles for
cross-disciplinary fertilization. Traditionally, a cross-disciplinary journey would
require scientists to make a variety of connections, translations, and adaptations.
Small demonstrated his powerful algorithms by blazing a magnificent trail of
more than 300 articles across the literatures of different scientific disciplines.
This trailblazing mechanism development has brought Bush’s (1945) concept of
information trailblazing to life.
5.3 Co-Citation Analysis 177
Henry Small described what he called the synthesis of specialty narratives from
co-citation clusters (Small 1986). This paper won the JASIS best-paper award in
1986. Small first chose a citation frequency threshold to select the most cited
documents in SCI. The second step was to determine the frequency of co-citation
between all pairs of cited documents above the threshold. Co-citation counts
were normalized by Salton’s cosine formula. Documents were clustered using the
single-link clustering method, which was believed to be more suitable than the
complete-link clustering algorithm because the number of co-citation links can be
as many as tens of thousands. Single-link clusters tend to form a mixture of densely
and weakly linked regions in contrast to more densely packed and narrowly focused
complete-link clusters. MDS was used to configure the layout of a global map.
Further, Small investigated how to blaze trails in the knowledge space represented
by the global map. He called this type of trail the specialty narrative.
Small addressed how to transform a co-citation network into a flow of ideas. The
goal for specialty narrative construction is to find a path through such networks so as
to track the trajectory of scientists who had encountered these ideas. Recall that the
traveling salesman problem (TSP) requires the salesman to visit each city exactly
once along a route optimized against a given criterion. We are in a similar situation
with the specialty narrative construction, or more precisely, the re-construction of
178 5 The Structure and Dynamics of Scientific Knowledge
Fig. 5.8 Zooming in even further to examine the structure of immunology (Reproduced from
Garfield 1998)
narrative trails when we retrace the possible sequence of thought by following trails
of co-citation links. TSP is a hard problem to solve. Luckily, there are some very
efficient algorithms to traverse a network, namely breath-first search (BFS) and
depth-first search (DFS). Both result in a minimum-spanning tree (MST). Small
considered several possible heuristics for the traversal in his study. For example,
when we survey the literature, we tend to start with some old articles so as to form
a historical context. A reasonable approach is to start from the oldest article in
the co-citation network. In this example, DFS was used to generate an MST. The
longest path through the MST tree was chosen as the main sequence of the specialty
narrative (See Fig. 5.9).
The context of citing provides first-hand information on the nature of citation.
A specialty narrative is only meaningful and tangible if sufficient contextual
information of citation is attached to the narrative. The citation context of a given
article consists of sentences that explicitly cite the article. Such sentences may come
from different citing articles. Different authors may cite the same article for different
reasons. On the other hand, researchers may cite several articles within one sentence.
Small took all these circumstances into account in his study. In the foreseeable
future, we will still have to rely on human intervention to make such selection,
as opposed to automated algorithmic devices. Nevertheless, NEC’s ResearchIndex
has shown some promising signs of how much we might benefit from citation
contexts automatically extracted from documents on the Web. In his 1986 specialty
narrative study, Small had to examine passages from citing papers, coded them, and
5.3 Co-Citation Analysis 179
Fig. 5.9 The specialty narrative of leukemia viruses. Specialty narrative links are labeled by
citation-context categories (Reproduced from Small 1986)
keyed them before running a program to compute the occurrence frequencies. This
specialty narrative was rigorously planned, carefully carried out, and thoroughly
explained. Henry Small’s JASIS award-wining paper has many inspiring ideas and
technical solutions that predated the boom of information visualization in the 1990s.
Over the last 15 years, this paper has been a source of inspiration for citation
analysis; we expect it will also influence information visualization and knowledge
visualization in a fundamental way.
Robert Braam, Henk Moed and Anthony van Raan investigated whether co-
citation analysis indeed provided a useful tool for mapping subject-matter spe-
cialties of scientific research (Braam et al. 1991a, b). Most interestingly, the
cross-examination method they used was co-word analysis. Their work clarified a
number of issues concerning co-citation analysis.
The cluster of co-cited documents is considered to represent the knowledge
base of a specialty (Small 1977). In a review of bibliometric indicators, Jean King
(1987) sums up objections against co-citation analysis: loss of relevant papers,
inclusion of non-relevant papers, overrepresentation of theoretical papers, time lag,
and subjectivity in threshold setting. There were more skeptical claims that co-
citation clusters were mainly artifacts of the applied technique having no further
identifiable significance. Braam and his co-workers addressed several issues in their
investigation in response to such concerns. For example, does a co-citation cluster
180 5 The Structure and Dynamics of Scientific Knowledge
The 1980 s saw the beginning of what turned out to be a second fruitful line of
development in the use of citation to map science – author co-citation analysis
(ACA). Howard White and Belver Griffith introduced ACA in 1981 as a way to
map intellectual structures (White and Griffith 1981). The unit of analysis in ACA is
authors and their intellectual relationships as reflected through scientific literatures.
The author-centered perspective of ACA led to a new approach to the discovery
of knowledge structures in parallel to approaches used by document-centered co-
citation analysis (DCA).
An author co-citation network offers a useful alternative starting point for co-citation
analysis, especially when we encounter a complex document co-citation network,
and vice versa. Katherine McCain (1990) gave a comprehensive technical review of
mapping authors in intellectual spaces. ACA reached a significant turning point in
1998 when White and McCain (1998) applied ACA to information science in their
thorough study of the field. Since then ACA has flourished and has been adopted
by researchers across a number of disciplines beyond the field of citation analysis
itself. Their paper won the best JASIS paper award. With both ACA and DCA at our
hands, we begin to find ourselves in a position to compare and contrast messages
conveyed through different co-citation networks of the same topic as if we were
having two pairs of glasses.
Typically, the first step is to identify the scope and the focus of ACA. The raw
data are either analyzed directly or, more commonly, converted into a correlation
matrix of co-citation. Presentations often combine MDS with cluster analysis or
PCA. Groupings are often produced by hierarchical cluster analysis. Figure 5.10
illustrates a generic procedure of a standard co-citation procedure. For example,
node placement can be done with MDS; clustering can be done with the single-
or complete-link clustering; PCA might replace clustering. In practice, some
researchers choose to work on raw co-citation data directly, whereas others prefer
5.3 Co-Citation Analysis 181
Fig. 5.11 The first map of author co-citation analysis, featuring specialties in information science
(1972–1979) (Reproduced from White and Griffith 1981)
generated maps of the top 100 authors in the field. Major specialties in the fields
were identified using factor analysis. The resultant map showed that the field of
information science consisted of two major specialties with little overlap in terms
of their memberships, namely experimental retrieval and scientific communication.
Citation analysis belongs to the same camp as scientific communication. One of
remarkable findings was that the new map preserved some of the basic structure
from the 1981 map: scientific communication on the right and information retrieval
on the left.
White and McCain demonstrated that authors might simultaneously belong to
several specialties. Instead of clustering authors into mutual exclusive specialties,
they used PCA to accommodate the multiple-specialty membership for each author.
First, the raw co-citation counts were transformed into Pearson’s correlation
coefficients as a measure of similarity between pairs of authors (White and McCain
1998). They generated an MDS-based author co-citation map of 100 authors in
information science for the period of 1972–1995. It is clear from the map that
information science was made of two major camps: the experimental retrieval
camp on the right and the citation analysis camp on the left. The experimental
retrieval camp includes names such as Vannevar Bush (1890–1974), Gerald Salton
(1964–1988), and Don Swanson, whereas the citation camp includes David Price
(1922–1983), Eugene Garfield, Henry Small, and Howard White. Thomas Kuhn
(1922–1996) appears at about the coordinates of (1.3, 0.8).
5.3 Co-Citation Analysis 183
Fig. 5.12 A two-dimensional Pathfinder network integrated with information on term frequencies
as the third dimension (Reproduced from Chen 1998)
Fig. 5.13 A Pathfinder network of SIGCHI papers based on their content similarity. The interac-
tive interface allows users to view the abstract of a paper seamlessly as they navigate through the
network (Reproduced from Chen 1998)
Fig. 5.14 A Pathfinder network of co-cited authors of the ACM Hypertext conference series
(1989–1998) (Reproduced from Chen and Carr 1999)
5.3 Co-Citation Analysis 185
Fig. 5.15 A Pathfinder network of 121 information science authors based on raw co-citation
counts (Reproduced from White 2003)
domain, our focus turned to the question of the functionality of such visualizations
and maps. It became clear that a more focused perspective is the key to a more
fruitful use of such visualizations. This is the reason we will turn to Thomas Kuhn’s
puzzle-solving paradigms and focus on the scenarios of competing paradigms in
scientific frontiers in next chapter. Henry Small’s specialty narrative also provides
an excellent example of how domain visualization can guide us to a greater access
to the core knowledge in scientific frontiers.
Multidimensional scaling (MDS) maps are among the most widely used ones
to depict intellectual groupings. MDS-based maps are consistent with Gestalt
principles – our perceived groupings are largely determined by proximity, similarity,
and continuity. MDS is designed to optimize the match between pairwise proximity
and distance in high – dimensional space. In principle, MDS should place similar
objects next to each other in a two- or three-dimensional map and keep dissimilar
ones farther apart.
MDS is easily accessible in most statistical packages such as SPSS, SAS, and
Matlab. However, MDS provides no explicit grouping information. We have to judge
proximity patterns carefully in order to identify the underlying structure. Proximity-
based pattern recognition is not easy and sometimes can be misleading. For
example, one-dimensional MDS may not necessarily preserve a linear relationship.
A two-dimensional MDS configuration may not be consistent with the results of
hierarchical clustering algorithms – two points next to each other in an MDS
configuration may belong to different clusters. Finally, three-dimensional MDS may
become so visually complex that it is hard to make sense of it without rotating
the model in a 3D space and studying it from different angles. Because of these
limitations, researchers often choose to superimpose additional information over
an MDS configuration so as to clarify groupings of data points, for example, by
drawing explicit boundaries of point clusters in an MDS map. Most weaknesses of
MDS boil down to the lack of local details. If we treat an MDS as a graph, we can
easily compare the number of links across various network solutions and an MDS
configuration (See Table 5.1).
Figure 5.16 shows a minimum spanning tree (MST) of an author co-citation
network of 367 prominent authors in the field of hypertext. The original author
co-citation network consisted of 61,175 links among these authors. A fully
5.3 Co-Citation Analysis 187
connected symmetric matrix of this size would have a maximum of 66,978 links,
excluding self-citations. In other words, the co-citation patterns were about 91 %
of the maximum possible connectivity. The MST solution selected a total of 366
strongest links. It produces a much-simplified picture of the patterns.
MST provides explicit links to display a more detailed picture of the underlying
network. If the network contains equally weighted edges, one can arbitrarily choose
any one of the MSTs. However, an arbitrarily chosen MST destroys the semantic
integrity of the original network because the selection of an MST is not based on
semantic judgments. Pathfinder network scaling resolves this problem by preserving
the semantic integrity of the original network. When geodesic distances are used, a
Pathfinder network is the set union of all possible MSTs. Pathfinder selects links by
ensuring that selected links do not violate the triangle inequality condition.
Figure 5.17 is a Pathfinder network solution of the same author co-citation
matrix. Red circles mark the extra links when comparing to an MST solution. A total
of 398 links were included in the network – the pathfinder network was 32 links
more than the number of links in its MST counterpart solution. These extra links
would be denied in MST because they form cyclic paths, but forming a cyclic path
alone as a link selection criterion may overlook potentially important links.
In order to incorporate multiple aspects of author co-citation networks, we
emphasize the significance of the following aspects of ACA (See Fig. 5.18):
• Represent an author co-citation network as a Pathfinder network;
• Determine specialty memberships directly from the co-citation matrix using
PCA;
• Depict citation counts as segmented bars, corresponding to citation counts over
several consecutive years.
188 5 The Structure and Dynamics of Scientific Knowledge
The results from the three sources, namely, Pathfinder network scaling, PCA, and
annual citation counts, are triangulated to provide the maximum clarity. Figure 5.19
shows an author co-citation map produced by this method. This is an author
co-citation map of 367 authors in hypertext (1989–1998). PCA identified 39 factors,
5.3 Co-Citation Analysis 189
Fig. 5.20 A landscape view of the hypertext author co-citation network (1989–1998). The height
of each vertical bar represents periodical citation index for each author (© 1999 IEEE)
fully connected area tend to be more generic and generally applicable, whereas those
located in peripheral areas of the Pathfinder network tend to represent more specific
topics.
5.4 HistCite
1
http://garfield.library.upenn.edu/histcomp/
5.4 HistCite 191
2
http://garfield.library.upenn.edu/histcomp/cocitation small-griffith/graph/2.html
192 5 The Structure and Dynamics of Scientific Knowledge
Patent analysis has a long history in information science, but recently there is a
surge of interest from the commercial sector. Numerous newly formed companies
are specifically aiming at the patent analysis market. Apart from historical driving
forces such as monitoring knowledge and technology transfer and staying in com-
petition, the rising commercial interest in patent analysis is partly due to the public
accessible patent databases, notably the huge amount of patent applications and
grants from the United States Patent and Trademark Office (USPTO). The public
can search patents and trademarks at USPTO’s website http://www.uspto.gov/
and download bibliographic data from ftp://ftp.uspto.gov/pub/patdata/. Figure 5.22
shows a visualization of a network of 1,726 co-cited patents.
The availability of the abundant patent data, the increasingly widespread aware-
ness of information visualization, and the maturity of search engines on the Web
are among the most influential factors behind the emerging trend of patent analysis.
Fig. 5.22 A minimum spanning tree of a network of 1,726 co-cited patents related to cancer
research
194 5 The Structure and Dynamics of Scientific Knowledge
Fig. 5.23 Landscapes of patent class 360 for four 5-year periods. Olympus’s patents are shown
in blue; Sony in green; Hitachi in green; Philips in magenta; IBM in cyan; and Seagate in red
(Reproduced from Figure 1 of Boyack et al. 2000)
Many patent search interfaces allow users to search by specific sections in patent
databases, for example by claims. Statistical analysis and intuitive visualization
functions are by far the most commonly seen selling points from a salesman’s patent
analysis portfolio. The term visualization becomes so fashionable now in the patent
analysis industry that from time to time we come across visualization software tools
that turn out to be little more than standard displays of statistics.
A particularly interesting example is from Sandia National Laboratory. Kevin
Boyack and his colleagues (2000) used their landscape-like visualization tool
VxInsight to analyze the patent bibliographic files from USPTO in order to answer
a number of questions. For example, where are competitors placing their efforts?
Who is citing our patents, and what types of things have they developed? Are there
emerging competitors or collaborators working in related areas? The analysis was
based on 15,782 patents retrieved from a specific primary classification class from
the US Patent database. The primary classification class is class 360 on Dynamic
Magnetic Information Storage or Retrieval. A similarity measure was calculated
using the direct and co-citation link types of Small (1997). Direct citations were
given a weighting five times that of each co-citation link. These patents were
clustered and displayed in a landscape view (See Figs. 5.23 and 5.24).
5.6 Summary 195
Fig. 5.24 Map of all patents issued by the US Patent Office in January 2000. Design patents are
shown in magenta; patents granted to universities in green; and IBM’s patents in red (Reproduced
from Figure 5 of Boyack et al. 2000)
5.6 Summary
In this chapter, we have introduced factors that influence perceived impact of scien-
tific works, such as Matthew Effect. We focused on two mainstream approaches
to science mapping, namely co-word analysis and co-citation analysis. Within
co-citation analysis, we distinguished document co-citation analysis and author co-
citation analysis. Key techniques used in and developed along with these approaches
were described, although our focus was on the fundamental requirements and
strategies rather than detailed implementations. More fundamental issues were
identified, that is, where should we go next from the global map of a field of
study from 60,000 ft above the ground? The central theme of this chapter is on the
shoulders of giants, which implies that the knowledge of the structure of scientific
frontiers in the immediate past holds the key to a fruitful exploration of human
being’s intellectual assets. Henry Small’s specialty narrative provided an excellent
example to mark the transition from admiring a global map to a more detailed
knowledge acquisition process. We conclude this chapter with a visualization of the
literature of co-citation analysis. The visualization in Fig. 5.25 shows a network of
co-cited references from articles that cited either Henry Small or Belver Griffith, the
two pioneers of the co-citation research. The visualization is generated by CiteSpace
based on citations made between 1973 and 2011. The age of an area of concentration
196 5 The Structure and Dynamics of Scientific Knowledge
is indicated by the colors of co-citation links. The earlier works are in colder colors,
i.e. blue. The more recent works are in warmer colors, i.e. in orange. The upper half
of the network was formed the first, whereas the lower left area was the youngest.
The network is divided into clusters of co-cited references based on how tightly
they were coupled. Each cluster is automatically labeled with words from the titles
of articles that are responsible for the formation of the cluster. For example, clusters
such as #86 scientific specialty, #76 co-citation indicator, and #67 author co-citation
structure are found in the region with many areas in blue color. The few clusters in
the middle of the map connect the upper and lower parts, including #21 cocitation
map and #26 information science. Clusters in the lower left areas are relatively new,
including #37 interdisciplinarity and #56 visualization. Technical advances in the
past 10 years have made such visual analytics more accessible than before.
Researchers began to realize that to capture the dynamics of science in action,
science mapping needs to bring in different perspectives and metaphors. Loet
Leydesdorff of University of Amsterdam argued that evolutionary perspectives are
more appropriate for mapping science than a historical perspective commonly taken
by citation analysts (Leydesdorff and Wouters 2000). Leydesdorff suggested that
the metaphor of geometrical mappings of multidimensional spaces is gradually
being superseded by evolutionary metaphors. Animations, movies, and simulations
are replacing snapshots. Science is no longer perceived as a solid body of unified
knowledge in a single cognitive dimension. Instead, science may be better repre-
sented as a network in a multi-dimensional space that develops not only within the
boundaries of this space, but also by co-evolutionary processes creating dimensions
References 197
to this space. Now it is time to zoom closer to the map and find trails that can lead
us to the discovery of what happened in some of the most severe and long-lasting
puzzle-solving cases in modern science. In the next chapter, we will focus on the
role of Kuhn’s paradigm shift theory in mapping scientific frontiers.
References
Boyack KW, Wylie BN, Davidson GS, Johnson DK (2000) Analysis of patent databases using
Vxinsight (No. SAND2000-2266C). Sandia National Laboratories, Albuquerque
Braam RR, Moed HF, Raan AFJv (1991a) Mapping of science by combined co-citation and word
analysis II: dynamical aspects. J Am Soc Inf Sci 42(4):252–266
Braam RR, Moed HF, Raan AFJv (1991b) Mapping of science by combined co-citation and word
analysis. I: structural aspects. J Am Soc Inf Sci 42(4):233–251
Bush V (1945) As we may think. Atl Mon 176(1):101–108
Callon M, Courtial JP, Turner WA, Bauin S (1983) From translations to problematic networks – an
introduction to co-word analysis. Soc Sci Inf Sur Les Sci Soc 22(2):191–235
Callon M, Law J, Rip A (eds) (1986) Mapping the dynamics of science and technology: sociology
of science in the real world. Macmillan Press, London
Chen C (1997a) Structuring and visualising the WWW with generalised similarity analysis. Paper
presented at the 8th ACM conference on hypertext (Hypertext’97), Southampton, UK, April
1997
Chen C (1997b) Tracking latent domain structures: an integration of pathfinder and latent semantic
analysis. AI Soc 11(1–2):48–62
Chen C (1998) Generalised similarity analysis and pathfinder network scaling. Interact Comput
10(2):107–128
Chen C (1999) Visualising semantic spaces and author co-citation networks in digital libraries. Inf
Process Manag 35(2):401–420
Chen C, Carr L (1999) Trailblazing the literature of hypertext: author co-citation analysis (1989–
1998). Paper presented at the 10th ACM conference on hypertext (Hypertext’99), Darmstadt,
Germany, February 1999
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of
Chicago Press, Chicago
Edge D (1979) Quantitative measures of communication in science: a critical overview. Hist Sci
17:102–134
Garfield E (1955) Citation indexes for science: a new dimension in documentation through
association of ideas. Science 122(108–111)
Garfield E (1975) The “Obliteration Phenomenon” in science and the advantage of being
obliterated! Curr Content 51(52):5–7
Garfield E (1996) When to cite. Libr Q 66(4):449–458
Garfield E (1998) On the shoulders of giants. Paper presented at the conference on the history and
heritage of science information systems, Pittsburgh, PA, October 24 1998
Garfield E, Small H (1989) Identifying the changing frontiers of science. Paper presented at the
innovation: at the crossroads between science & technology
Garfield E, Sher IH, Torpie RJ (1964) The use of citation data in writing the history of science.
Institute for Scientific Information, Philadelphia
Garfield E, Malin MV, Small H (1978) Citation data as science indicators. In: Elkana Y (ed) Toward
a metric of science. Wiley, New York
198 5 The Structure and Dynamics of Scientific Knowledge
King J (1987) A review of bibliometric and other science indicators and their role in research
evaluation. J Inf Sci 13(5):261–276
Knorr-Cetina KD (1999) Epistemic cultures: how the sciences make knowledge. Harvard
University Press, Cambridge, MA
Koenig M, Harrell T (1995) Lotka’s law, price’s urn, and electronic publishing. J Am Soc Inf Sci
46(5):386–388
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Kuhn TS (1970) The structure of scientific revolutions, 2nd edn. University of Chicago Press,
Chicago
Lawrence S (2001) Online or invisible? Nature 411(6837):521
Leydesdorff L, Wouters P (2000) Between texts and contexts: advances in theories of citation.
Retrieved June 26 2000, from http://www.chem.uva.nl/sts/loet/citation/rejoin.htm
Lin X (1997) Map displays for information retrieval. J Am Soc Inf Sci 48(1):40–54
Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16:317–323
McCain KW (1990) Mapping authors in intellectual space: a technical overview. J Am Soc Inf Sci
41(6):433–443
Merton RK (1965) On the shoulders of giants: a Shandean postscript. University of Chicago Press,
Chicago
Merton RK (1968) The Mathew effect in science. Science 159(3810):56–63
Mullins N, Snizek W, Oehler K (1988) The structural analysis of a scientific paper. In: Raan AFJv
(ed) Handbook of quantitative studies of science & technology. Elsevier Science Publishers,
Amsterdam, pp 85–101
Noyons ECM, van Raan AFJ (1998) Monitoring scientific developments from a dynamic perspec-
tive: self-organized structuring to map neural network research. J Am Soc Inf Sci 49(1):68–81
Price D (1961) Science since Babylon. Yale University Press, New Haven
Price D (1965) Networks of scientific papers. Science 149:510–515
Price D (1976) A general theory of bibliometric and other cumulative advantage processes. J Am
Soc Inf Sci 27:292–306
Sher I, Garfield E (1966) New tools for improving and evaluating the effectiveness of research.
Paper presented at the research program effectiveness, Washington, DC, 27–29 1965
Small H (1973) Co-citation in scientific literature: a new measure of the relationship between
publications. J Am Soc Inf Sci 24:265–269
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small H (1986) The synthesis of specialty narratives from co-citation clusters. J Am Soc Inf Sci
37(3):97–110
Small HS (1988) Book review of Callon et al. Scientometrics 14(1–2):165–168
Small H (1994) A SCI-MAP case study: building a map of AIDS research. Scientometrics
30(1):229–241
Small H (1997) Update on science mapping: creating large document spaces. Scientometrics
38(2):275–293
Small H (1999) On the shoulders of giants. Bull Am Soc Inf Sci 25(2):23–25
Small H, Greenlee E (1980) Citation context analysis and the structure of paradigms. J Doc
36:183–196
Small HG, Griffith BC (1974) The structure of scientific literatures I: identifying and graphing
specialties. Sci Stud 4:17–40
Steinberg SG (1994) The ontogeny of RISC. Intertek 3(5):1–10
White HD (2003) Pathfinder networks and author cocitation analysis: a remapping of paradigmatic
information scientists. J Am Soc Inf Sci Tech 54(5):423–434
References 199
White HD, Griffith BC (1981) Author co-citation: a literature measure of intellectual structure.
J Am Soc Inf Sci 32:163–172
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Chapter 6
Tracing Competing Paradigms
Hjorland has been a key figure in promoting domain analysis in information science
(Hjorland 1997; Hjorland and Albrechtsen 1995). The unit of domain analysis is
a specialty, a discipline, or a subject matter. In contrast to existing approaches to
domain analysis, Hjorland emphasized the essential role of a social perspective
instead of the more conventional psychological perspective.
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 201
DOI 10.1007/978-1-4471-5128-9 6, © Springer-Verlag London 2013
202 6 Tracing Competing Paradigms
Table 6.1 Differences between cognitivism and the domain-specific viewpoint (Hjorland and
Albrechtsen 1995)
Cognitivism The domain-specific view
Priority is given to the understanding of Priority is given to the understanding of user
isolated user needs and intrapsychological needs from a social perspective and the
analysis. Intermediating between producers functions of information systems in trades
and users emphasizes psychological or disciplines
understanding
Focus on the single user Focus on either one knowledge domain or the
Typically looks at the disciplinary context as a comparative study of different knowledge
part of the cognitive structure of an domains. Looks at the single user in the
individual – if at all context of the discipline
Mainly inspired by artificial intelligence and Mainly inspired by knowledge about the
cognitive psychology information structures in domains, by the
sociology of knowledge and the theory of
knowledge
The psychological theory emphasizes the role The psychological theory emphasizes the
of cognitive strategies in performance interaction among aptitudes, strategies, and
knowledge in cognitive performance
Central concepts are individual knowledge Central concepts are scientific and professional
structures, individual information communication, documents (including
processing, short- and long-term memory, bibliographies), disciplines, subjects,
categorical versus situational classification information structures, paradigms etc
Methodology characterized by an Methodology characterized by a collectivistic
individualistic approach approach
Methodological individualism has some Methodological collectivism has some
connection to a general individualistic connection to a general collectivistic view,
view, but the difference between cognitive but the difference between cognitivism and
and the domain-specific view is not a the domain-specific view is not a different
different political perception of the role of political perception of the role of
information systems, but a different information systems, but a different
theoretical and methodological approach to theoretical and methodological approach to
the study and optimization of information the study and optimization of information
systems systems
Best examples of applications: user interfaces Best examples of applications:
(the outer side of information systems) subject-representation/classification (the
inner side of information systems)
Implicit theory of knowledge: mainly Theory of knowledge: scientific realism/forms
rationalistic/positivistic, tendencies toward of social constructivism with tendencies
hermeneutics towards hermeneutics
Implicit ontological position: subjective Ontological position: realism
idealism
Fig. 6.1 Paradigm shift in collagen research (Reproduced from Small 1977)
a particular article was visible so long as it remained within the scope of these
diagrams. Once an article moved out our sight, there would be no way to follow
the article any further. The chase would be over. A wider field of view would
provide more contextual information so that we can follow the trajectory of a rising
paradigm as well as a falling one.
Researchers have found that thematic maps of geographic information can help to
improve memory for facts and inferences (1994). If people study a geographic map
first and read relevant text later, they can remember more information from the text.
If we visualize the intellectual structure of a knowledge domain, such knowledge
maps may help researchers in a similar way.
Traditionally, a geographic map shows two important types of information:
structural and feature information. Structure information helps us to locate indi-
vidual landmarks on the map and determine spatial relations among them. Feature
information refers to detail, shape, size, color, and other visual properties used to
depict particular items on a map. One can distinguish landmarks from one another
based on feature information without relying on the structural relations among
these landmarks. When people study a map, they first construct a mental image
6.2 A Longitudinal Study of Collagen Research 205
of the map’s general spatial framework and add the landmarks into the image
subsequently (1994). Once a mental image is in place, it becomes a powerful tool
for retrieving information. The mental image integrates information about individual
landmarks in a single relatively intact piece, which allow rapid and easy access to
the landmarks embedded. In addition, the greater the integration of structural and
feature information in the image, the more intact the image is. The more intact the
image, the more easily landmark information can be located and help retrieval of
further details.
These findings about thematic maps provide useful design guidelines for in-
formation visualization. In a previous study of visualizing a knowledge domain’s
intellectual structure (Chen and Paul 2001), we developed a four-step procedure
to construct a landscape of a knowledge domain based on citation and co-citation
data. Our method extracts structural information from a variety of association
measures, such as co-citation, co-word, or co-descriptor. The structural information
is represented as a Pathfinder network, which essentially consists of shortest paths
connecting the network components. The feature information in our visualization
mainly corresponds to citation impact and specialty memberships. Citation impact
of an article is depicted by the height of its citation bar. The color of each year’s
citation bar indicates the recentness of citations. Identifying a landmark in such
knowledge landscape becomes a simple task: a tall citation bar with a large amount
of segments in bright color is likely to be a landmark article in the given knowledge
domain. In our approach, the membership of specialty, sometimes also known as a
sub-domain, or a theme, is colored according to the results of factor analysis.
In the following two case studies, we intend to highlight structural and feature
information associated with debates between competing paradigms. We also want
to highlight the movement of a paradigm in terms of the movement of landmark
articles in the global structure. We focus on matching structural and feature
information to what we know about scientific debates involved. A comprehensive
validation with domain experts is a separate topic in its own right.
Kuhn’s notion of scientific paradigm indeed provides a framework for us to
match visual-spatial patterns to the movement of an underlying paradigm. If there
exists a predominant paradigm within a scientific discipline, citation patterns should
reflect this phenomenon, allowing the usual delay in publication cycles. A predomi-
nant paradigm should acquire the most citations at least over certain period of time.
Citation peaks are likely to become visible in a landscape view. Two competing
paradigms would show as twin peaks locking in a landscape. Furthermore, such
clusters should be located towards the center of the domain structure. During a
period of normal science, the overall landscape would demonstrate continuous
increases in citations of such clusters. However, if the particular scientific discipline
is in crisis, one or more clusters outside the predominant one will rapidly appear
at the horizon of the virtual landscape. The phenomenon of a paradigm shift takes
place at the moment when the citations of the new clusters of articles take over
that of the original clusters of articles: the peak of the old paradigm drops, while
the valley of a new paradigm rises. Figure 6.2 illustrates the relationship between
206 6 Tracing Competing Paradigms
Five mass extinctions occurred in the past 570 million years on earth. Geologists
divided this vast time span into eras and periods on the geological scale (See
Table 6.2). The Permian-Triassic extinction 248 million years ago was the greatest
of all the mass extinctions. However, the Cretaceous-Tertiary extinction, which
wiped out the dinosaurs from the earth 65 million years ago within a short period of
time along with many other species, has been the most mysterious and hotly debated
topic over the last two decades.
Dinosaurs’ extinction occurred at the end of the Mesozoic. Many other organisms
became extinct or were greatly reduced in abundance and diversity. Among these
were the flying reptiles, sea reptiles, and ichthyosaurs, the last disappearing slightly
6.3 The Mass Extinction Debates 207
6.3.1.1 Catastrophism
In their 1980 Science article (Alvarez et al. 1980), Alvarez and his colleagues, a
team of a physicist, a geologist, and two nuclear chemists, proposed an impact
theory to explain what happened in the Cretaceous and Tertiary extinction. In
contrast to the widely held view at the time, especially by paleontologists, the impact
theory suggests that the extinction happened within a much shorter period of time
and that it was caused by an asteroid or a comet.
In the 1970s, Walter Alvarez found a layer of iridium sediment in rocks at the
Cretaceous-Tertiary (K-T) boundary at Gubbio, Italy. Similar discoveries were made
subsequently in Denmark and elsewhere, both in rocks on land and in core samples
drilled from ocean floors. Iridium normally is a rare substance in rocks of the Earth’s
crust (about 0.3 parts per billion). At Gubbio, the iridium concentration was found
more than 20 times greater than the normal level (6.3 parts per billion), and it was
even greater at other sites.
There are only two places one can find such high concentration of iridium: one
is in the earth’s mantle. The other is in extra-terrestrial record. Iridium can be found
in the earth’s mantle and in extra-terrestrial objects such as meteors and comets.
Scientists could not find other layers of iridium like this above or below the KT
boundary. This layer of iridium provided the crucial evidence for the impact theory.
However, the impact theory has triggered some of the most intense debates between
gradualism and catastrophism. The high iridium concentration did not necessarily
rule out the source could not be from the Earth.
6.3.1.2 Gradualism
Fig. 6.3 An artist’s illustration of the impact theory: before the impact, seconds to impact, moment
of impact, the impact crater, and the impact winter (© Walter Alvarez)
million years. The periodicity hypothesis challenged both the impact theory and the
volcanism to extend the explanation power of their theories to cover not only the KT
extinction alone but also other mass extinctions such as the Permian-Triassic mass
extinction and other major extinctions. Some researchers in the impact camp were
indeed searching for theories and evidence that could explain why the Earth could
be hit by asteroids or comets every 26 million years.
A watershed for the KT impact debate was 1991 when the Chicxulub crater was
identified as the impact site on the Yucatan Peninsula in Mexico (Hildebrand et al.
1991). The Signor-Lipps effect was another milestone for the impact theory. Phil
Signor and JereLipps in 1982 demonstrated that even for a truly abrupt extinction,
the poor fossil record would make it look like a gradual extinction (Signor and Lipps
1982). This work in effect weakened the gradualism’s argument.
In 1994, proponents of the impact theory were particularly excited to witness the
spectacular scene of the comet Shoemaker-Levy 9 colliding into Jupiter because
events of this type could happen to the Earth and it might have happened to
dinosaurs 65 million years ago. The comet impacts on Jupiter’s atmosphere were
spectacular and breathtaking. Figure 6.3 shows an artist’s impression of the KT
impact. Figure 6.4 shows the impact of Shoemaker-Levy 9 on Jupiter in 1994.
In the controversy between the gradualist and catastrophist explanations of the
dinosaurs’ extinction, one phenomenon might not exclude the other. It was the
explanations of the highly concentrated layer of iridium that distinguished two
competing paradigms (See Fig. 6.5).
Fig. 6.4 Shoemaker-Levy 9 colliding into Jupiter in 1994. Eight impact sites are visible. From left
to right are the E/F complex (barely visible on the edge of the planet), the star-shaped H site, the
impact sites for tiny N, Q1, small Q2, and R, and on the far right limb the D/G complex. The D/G
complex also shows extended haze at the edge of the planet. The features are rapidly evolving on
timescales of days. The smallest features in this image are less than 200 km across. This image is
a color composite from three filters at 9,530, 5,550, and 4,100 Å (Copyright free, image released
into the public domain by NASA)
Each is colored by factor loadings obtained from PCA. The KT Impact cluster
is in red, implying its predominance in the field. The green color for Periodicity
and Gradualism indicates their secondary position in the field. Of course, this
classification is purely based on co-citation groupings. Similarly, the blue Permiam
Extinction zone also marks its relative importance in the mass extinction research.
This is the most predominant specialty of the mass extinction research revealed by
the citation landscape. The highest cited article in the entire network of articles
was the one by Luis Alvarez, Walter Alvarez, Frank Asaro, and Helen Michel,
published in Science in 1980 (Alvarez et al. 1980). It was this article that laid down
the foundation for the impact paradigm.
Alvarez and his colleagues argued that an asteroid hit the earth and the impact
was the direct cause of the KT extinction, and that the discovery of the abnormally
concentrated layer of iridium provided crucial evidence. This is the essence of the
KT impact paradigm. Such layers of iridium were found in deep-sea limestone
exposed in several places, including Italy, Denmark, and New Zealand. The
excessive amount of iridium, found at precisely the time of the Cretaceous-Tertiary
extinctions, ranged from 20 to 160 times higher than the background level.
If the impact theory is correct, then there should be a crater left on the earth.
They estimated that the size of the impact asteroid was about 6 miles (10 km) in
diameter, so the size of the crater must be between 90 and 120 miles (150–200 km)
in diameter. In 1980, scientists only discovered three craters with a diameter of 60
miles (100 km) or more: Sudbury, Vrdefort, and Popigay. The first two were dated
to Precambrian age, which would be too old for the KT impact; the Popigay Crater
212 6 Tracing Competing Paradigms
in Siberia, dated only 28.8 million years old, would be too young. Alvarez and his
colleagues suggested that there was a 2/3 probability that the impact site was in the
ocean. If that was the case, we would not be able to find the crater because evidence
from the ocean of that age had long gone. Nevertheless, searching for the impact
crater had become a crucial line of research. A breakthrough came in 1991 when
Alan Hildebrand linked the Chicxulub crater to the KT impact.
The Chicxulub crater is a 110-mile (180-km) structure, completely buried under
the Yucatan Peninsula in Mexico (See Fig. 6.7). In 1950s, the gravity abnormality
of the Chicxulub crater attracted the Mexican National Oil Company (PEMEX)
searching for oil fields, but the crater remained its low profile to the community of
mass extinction research until Alan Hildebrand’s discovery.
Hildebrand’s 1991 paper is one of the most highly cited articles in the KT impact
cluster (Hildebrand et al. 1991). Figure 6.8 shows the gravity field and magnetic
field of the Chicxulub crater.
Since the impact theory was conceived, its catastrophism point of view has
received strong resistance, especially from paleontologists who held a gradualism
viewpoint. The impact theory, its interpretations of evidence, and the validity of
evidence have been all under scrutiny.
In Walter Alvarez’s book, Gerta Keller was regarded as the number one opponent
of the impact theory (Alvarez 1997). A number of Keller’s several papers appeared
in the KT impact cluster, including their 1993 paper, in which they challenged the
available evidence of impact-generated tsunami deposits.
The presence of articles from a leading opponent to the impact theory right
in the center of this cluster has led to new insights into visualizing competing
paradigms. Co-citations not only brought supportive articles together into the same
6.3 The Mass Extinction Debates 213
Fig. 6.8 Chicxulub’s gravity field (left) and its magnetic anomaly field (right) (© Mark Pilkington
of the Geological Survey of Canada)
cluster, but also ones that challenged the paradigm. This would be a desirable
feature because scientists can access a balanced collection of articles from different
perspectives of a debate. Indeed, evidence strongly supporting the impact theory,
such as Hildebrand’s 1991 paper on the Chicxulub crater and Keller’s 1995 paper
on the conclusiveness of available evidence (Keller 1993) were found in the same
cluster. After all, when we debate about a topic, we are likely to cite the arguments
from both sides.
The KT impact cluster also included an article labeled as Signor. This is an article
by Signor and Lipps on what is later known as the Signor-Lipps effect. The Signor-
Lipps effect says if there were few fossils preserved, an abrupt distinction can look
like a gradual extinction. Because whether the KT event was a gradual extinction
or a catastrophic one is crucial to the debate, the high citation profile of Signor and
Lipps’ article indicates its significance in this debate.
Table 6.3 shows the most representative articles of the KT impact cluster in terms
of their factor loadings. Walter Alvarez in his book (1997) highly regarded Smit’s
contribution to the impact theory: Alvarez found the iridium abnormally in Italy,
whereas Smit confirmed the iridium abnormally in Spain. Smit’s 1980 article in
Nature, which topped the list, is located immediately next to the 1980 Science paper
by Alvarez et al. Both articles are strongly connected via a strong Pathfinder network
link. The table also includes Glen’s 1994 book Mass Extinction Debates.
Articles from the Gradualism camp are located between the KT Impact cluster
and the Periodicity cluster. Landmark articles in this cluster include ones from
Chunk Officer, a key opponent of the impact theory. The article by another anti-
impact researcher Dewey McLean is also in this cluster, but below the 50-citation
landmark threshold. McLean proposed that prolonged volcanic eruptions from the
Deccan Traps in India were the cause of the KT mass extinction.
Piet Hut’s 1987 Nature article on comet showers, with co-authors such as Alvarez
and Keller, marked a transition from the KT impact paradigm to the periodicity
hypothesis. This article was seeking an explanation of the periodicity of mass
extinctions within the impact paradigm.
214 6 Tracing Competing Paradigms
Table 6.3 Landmark articles in the top three specialties of mass extinctions (Citations 50)
Factor loadings Name Year Source Volume Page
KT impact
0.964 Smit J 1980 Nature 285 198
0.918 Hildebrand AR 1991 Geology 19 867
0.917 Keller G 1993 Geology 21 776
0.887 Glen W 1994 Mass Extinction Deba
0.879 Sharpton VL 1992 Nature 359 819
0.877 Alvarez LW 1980 Science 208 1,095
Periodicity
0.898 Patterson C 1987 Nature 330 248
0.873 Raup DM 1986 Science 231 833
0.859 Raup DM 1984 P Natl Acad Sci-Biol 81 801
0.720 Jablonski D 1986 Dynamics Extinction 183
0.679 Benton MJ 1985 Nature 316 811
0.629 Davis M 1984 Nature 308 715
0.608 Jablonski D 1986 Science 231 129
Permian extinction
0.812 Magaritz M 1989 Geology 17 337
0.444 Renne PR 1995 Science 269 1,413
0.436 Stanley SM 1994 Science 266 1,340
0.426 Erwin DH 1994 Nature 367 231
0.425 Wignall PB 1996 Science 272 1,155
The second largest area in the visualization landscape highlights the theme of
the periodicity of mass extinctions. The periodicity frame in Fig. 6.9 shows two
predominant landmarks, both from David Raup and John Sepkoski. The one on the
left is their 1984 article published in the Proceedings of the National Academy of
Sciences of the United States of America – Biological Sciences, entitled Periodicity
of extinctions in the geologic past. They showed a graph of incidences of extinction
of marine families through time, in which peaks coincided with the time of most
major extinction events, and suggested that mass extinctions occurred every 26
million years. The one on the right is their 1982 article in Science, entitled Mass
extinctions in the marine fossil record.
The catastrophism was one of the major beneficiaries of the periodicity paradigm
because only astronomical forces are known to be capable of producing such a
precise periodic cycle. There were also hypotheses that attempted to incorporate
various terrestrial extinction-making events such as volcanism, global climatic
change, and glaciations. There was even a theory that each time an impact triggered
the volcanic plume, but supporting evidence was rather limited. A few landmark
articles in the periodicity frame addressed the causes of the periodicity of mass
extinctions using the impact paradigm with a hypothesis that asteroids or comets
strike the earth catastrophically every 26 million years.
6.3 The Mass Extinction Debates 215
The initial reaction from the impact camp was that the periodicity hypothesis
completely conflicted with the impact theory. What can possibly make asteroids
hit the earth at such pace? The impact paradigm subsequently came up with a
hypothesis that an invisible death star would make it possible, but the hypothesis
was still essentially theoretical. Landmark articles labeled as Alvarez and Davis in
the visualization address such extensions of the impact paradigm.
Since the periodicity hypothesis required a theory that can explain not only
one but several mass extinctions, both gradualism and catastrophism considered to
extend their theories beyond the KT boundary. Patterson and Smith’s 1987 Nature
article (Patterson and Smith 1987) questioned whether the periodicity really existed.
Its high factor loading (0.898) reflected the uniqueness of the work. The landmark
article by Davis et al. in Nature has the factor loading of 0.629.
The third cluster of articles features articles from Erwin, Wignall, and Knoll. Erwin
is the leading scientist on the Permian mass extinction, which was the greatest of all
five major mass extinctions. The Permian-Triassic (PT) mass extinction was much
severe than the KT extinction. Because it happened 248 million years ago, it is
extremely hard to find evidence in general, and for an impact theory in particular.
216 6 Tracing Competing Paradigms
Fig. 6.10 A year-by-year animation shows the growing impact of articles in the context of relevant
paradigms. The top-row snapshots show the citations gained by the KT impact articles (center),
whereas the bottom-row snapshots highlight the periodicity cluster (left) and the Permian extinction
cluster (right)
Fig. 6.11 Citation peaks of three clusters of articles indicate potential paradigms
A large number of galaxies have extremely bright galactic centers. These luminous
nuclei of galaxies are known as quasars. Astronomers and cosmologists have long
suspected that black holes are the source of power. The concept of black hole is
derived from Einstein’s General Relativity. Recent evidence indicated the existence
of supermassive black holes at the centers of most galaxies (Richstone et al. 1998).
In the mass extinction case, searching for conclusive evidence had forged some
of the most significant developments for each competing paradigm. Because those
extinction events happened at least tens of million years ago, it is a real challenge
to establish what had really happened. In our second case study, astronomers faced
a similar challenge. Black holes by definition are invisible. Searching for evidence
that can support theories about the formation of galaxies and the universe has been
a central line of research concerning supermassive black holes. We apply the same
visualization method to the dynamics of citation patterns associated with this topic.
BBC2 broadcasted a 50-min TV program on supermassive black holes in 2000. The
transcripts are available on the Internet.1
1
http://www.bbc.co.uk/science/horizon/massivebholes.shtml
6.4 Supermassive Black Holes 219
(John Kormendy and Ho 2000), the AGN paradigm still has an outstanding problem:
there was no dynamical evidence that black holes exist. Searching for conclusive
evidence has become a Holy Grail to the AGN paradigm (Ho and Kormendy 2000).
Kormendy and Richstone (1995) in 1995 staged the search for black holes in
three parts.
1. Look for dynamical evidence for central dark masses with high mass-to-light
ratios. A massive dark object is necessary but not sufficient evidence.
2. Narrow down the plausible explanations among identified massive dark matters.
3. Derive the mass function and frequency of incidence of black holes in various
types of galaxies.
According to the 1995 review, the status of the search was near the end of stage
one (Kormendy and Richstone 1995). Progress in the black hole search comes from
improvements in analysis as well as in observations. In 1995, M31, M32, and NGC
3115 were regarded as strong black hole cases (Kormendy and Richstone 1995). In
2000, the most compelling case for a black hole in any galaxy is our Milky Way
(Ho and Kormendy 2000). Richstone, Kormendy, and a dozen of other astronomers
have worked on surveying supermassive black holes. They called themselves the
“Nuker team.”
In 1997, the Nuker team announced the discovery of three black holes in three
normal galaxies. They suggest nearly all galaxies may have supermassive black
holes that once powered quasars but are now dormant. Their conclusion was based
on a survey of 27 nearby galaxies carried out by NASA’s Hubble Space Telescope
(HST) and ground-based telescopes in Hawaii.
Although this picture of active galaxies powered by supermassive black holes
is attractive, skeptics tend to point out that such a concentration of mass can be
explained without the concept of black holes. For example, they suggested that the
mass concentration in M87 could be a cluster of a billion or so dim stars such as
neutron stars or white dwarfs, instead of a supermassive black hole.
Skeptics in this case are in the minority with their attacks on the AGN paradigm.
Even so, the enthusiasts are expected to provide far stronger evidence than they have
managed to date. So what would constitute the definitive evidence for the existence
of a black hole?
We apply the same visualization method to reveal the dynamics of citation patterns
associated with the AGN paradigm over the last two decades. We intend to identify
some patterns of how the paradigm has been evolving.
Collecting citation data was straightforward in this case. Since a substantial body
of the astronomy and astrophysics literature is routinely covered by journal publica-
tions, the bibliographic data from Web of Science (WoS) provide a good basis for the
visualization of this particular topic. Citation data were drawn from with a complex
query on black holes and galaxies (See Table 6.4). The search retrieved 1,416
220 6 Tracing Competing Paradigms
articles in English from the SCI Expanded database dated between 1981 and 2000.
All matched to the query in at least one of the fields: titles, abstracts, and keywords.
Altogether these articles cited 58,315 publications, written by 58,148 authors.
We conducted both author co-citation analysis (ACA) and document co-citation
analysis (DCA) in order to detect the dynamics of prevailing paradigms. We chose
30 citations as the entry threshold for ACA and 20 citations for DCA. Ultimately,
373 authors and 221 publications were identified. We then generated three models of
the periods: 1981–1990, 1991–1995, and 1996–2000. The co-citation networks were
based on the entire range of citation data (1981–2000). The citation landscape in
each period conforms how often each article was cited within a particular sampling
window. In this book, we only describe the results of a document co-citation analysis
for this case study.
In document co-citation analysis (DCA), we visualized a co-citation network of
221 top-cited publications. We particularly examined citation profiles in the context
of co-citation structure. Articles with more than 20 citations were automatically
labeled on semi-transparent panels in the scene. These panels always face to the
viewer. The landscape of the 1981–1990 period is shown as a flat plane – this
landscape obviously pre-dated the existence of the majority of the 221 publications.
The visualization landscape of the period of 1991–1995 is showing an interesting
pattern – three distinct clusters are clearly visible in peripheral areas of the co-
citation network. M-31 has been regarded as one of the strongest supportive cases
for the AGN paradigm. Alan Dressler and John Kormendy are known for their
work within the AGN paradigm. One of the clusters included articles from both
of them regarding the evidence for supermassive black holes in M-31. Another
cluster is more theoretically oriented, including articles from Martin Rees, who
was a pioneer of the theory that giant black holes may provide the power at
quasars’ energetic centers. In addition, Martin Ree’s nearest neighbor in the
document co-citation network is Lynden-Bell’s article. Lynden-Bell provided the
most convincing argument for the AGN paradigm and showed that nuclear reactions
alone would have no way to power quasars. The cluster at the far end includes
ShakuraIvanovich’s article on black holes in binary systems, whereas the large area
in the center of the co-citation network remains unpopulated within this period.
A useful feature of a Pathfinder network is that the most cited articles tend to locate
6.4 Supermassive Black Holes 221
Fig. 6.12 Supermassive black holes search between 1991 and 1995. The visualization of the
document co-citation network is based on co-citation data from 1981 through 2000. Three
paradigmatic clusters highlight new evidence (the cluster near to the front) as well as theoretical
origins of the AGN paradigm
in the central area. Once these highly cited articles arrive, they will predominant the
overall citation profile of the entire co-citation network (See Fig. 6.12).
Citations in the central area remain very quiet, partly because some of the
documents located there were either newly published or not published yet. However,
the visualization of the third period, 1996–2000, clearly shows dramatic drops of the
overall citation profiles of once citation-prosperous clusters in the peripheral areas.
Two of the three distinct clusters have hardly been cited. In contrast, citations at the
center of the network now become predominant (See Fig. 6.13). Pathfinder-based
citation and co-citation visualizations are able to outline the movement of the AGN
paradigm in terms of which articles researchers cite during a particular period of
time.
The AGN paradigm is prevailing, but conclusive evidence is still missing. Some
astronomers have suggested alternative explanations. For example, could the mass
concentration in M87 be due to a cluster of a billion or so dim stars such as neutron
stars or white dwarfs, instead of supermassive black holes? Opponents of the AGN
paradigm such as Terlevich and colleagues have made strong arguments in their
articles. Some of these articles are located in a remote area towards the far end of
the co-citation network. In order to study how alternative theories had competed
222 6 Tracing Competing Paradigms
Fig. 6.13 The visualization of the final period of the AGN case study (1996–2000). The cluster
near to the front has almost vanished and the cluster to the right has also reduced considerably.
In contrast, citations of articles in the center of the co-citation network rocketed, leading by two
evidence articles published in Nature: one is about NGC-4258 and the other is about MCG-6-30-15
with the AGN paradigm directly, it is necessary to re-focus the visualization so that
both the AGN paradigm and its competitors are both within the scope of the initial
citation data. The current AGN visualization is the first step to help us understand
the fundamental works in this paradigm because we used the terms black holes
and galaxies explicitly in data sampling. In the mass extinction case, gradualism
and catastrophism debated over more than a decade since the impact theory was
first conceived to the identification of the Chicxulub crater. In the supermassive
black hole case, the AGN paradigm is so strong that its counterparts were likely
to be under-represented in the initial visualizations. This observation highlights an
issue concerning the use of such tools. The user may want to start with a simple
visualization, learn more about a set of related topics, and gradually expand the
coverage of the visualization.
In Fig. 6.13, the visualization of the latest period (1996–2000), the predominant
positions of two 1988 evidence articles in the front cluster have been replace by
two 1995 evidence articles. Makoto Miyoshi’s team at the National Astronomical
Observatory in Japan found evidence supporting the AGN paradigm based on their
study of a nearby galaxy NGC-4258. They used a network of radio telescopes
6.4 Supermassive Black Holes 223
Fig. 6.14 The rises and falls of citation profiles of 221 articles across three periods of the AGN
paradigm
called the Very Long Baseline Array, stretching from Hawaii to Puerto Rico. A
few highly cited articles in this period are located in the center of the co-citation
network, including a review article and a demographic article on supermassive black
holes. According to a three-stage agenda for the study of supermassive black holes
(Kormendy and Richstone 1995), a demographic article would correspond to the
third stage. The 1998 article by Magorrian and his collaborators is located between
the 1995 agenda article in the center and Ree’s article to the right.
It is clear from Fig. 6.14 that the peaks of citation have moved from one period
to another. There was no paradigm in the first period (1981–1990). In other words,
the core literature on this topic is no more than 10 years old. Three strands of
articles appeared in the second period, suggesting the first generation of theories
and evidence. The fall of two groups of citations in the third period and the rise of
a new landmark article in the center of the co-citation network indicate significant
changes in the field. The visualization of such changes in scientific literature may
provide new insights into scientific frontiers.
224 6 Tracing Competing Paradigms
6.5 Summary
In this chapter, we have included two case studies. Our visualizations have shown
the potential of the citation-based approach to knowledge discovery and to tracking
scientific paradigms. We do not expect that such visualizations would replace review
articles and surveys carefully made by domain experts. Instead, such visualizations,
if done properly, may lead to a more sensible literature search methodology than
the current fashionable but somewhat piecemeal retrieval-oriented approaches. By
taking into account values perceived by those who have domain expertise, our
generic approach has shown the potential of such visualizations as an alternative
“camera” to take snapshots of scientific frontiers.
We have drawn a great deal of valuable background information from Kormendy
and Richstone’s article Inward Bound (Kormendy and Richstone 1995). It was this
article that dominated the visualization landscape of the latest period. Kuhn later
suggested that specialization was more common. Instead of killing off a traditional
rival line of research immediately, a new branch of research may run in parallel.
The search for supermassive black hole is rapidly advancing. The media is full
of news on latest discoveries. In fact, the latest news announced at the winter
2001 American Astronomical Society meeting suggested that HST and the Chandra
X-ray Observatory have found evidence for an event horizon on Cygnus X-1, the
first object identified as a black hole candidate. Scientific visualism is increasingly
finding its way in modern science.
There are several possible research avenues to further develop this generic
approach to visualizing competing paradigms, for example:
1. Apply this approach to classic paradigm shifts identified by Kuhn and others
2. Refine the philosophical and sociological foundations of this approach.
3. Combine citation analysis with other modeling and analysis techniques, such as
automatic citation context indexing and latent semantic indexing (LSI), so as to
provide a more balance view of scientific frontiers.
4. Extend the scope of applications to a wider range of disciplines.
5. Track the development of the two case studies in the future with follow-up
studies.
6. Track the development of scientific frontiers. Work closely with domain experts
to evaluate and improve science mapping.
In the next chapter, we continue to explore issues concerning mapping scientific
frontiers with special focus on the discovery of latent domain knowledge. How do
scientists detect new and significant developments in knowledge? What does it take
a visualization metaphor to capture and predict the growth of knowledge? How do
we match the visualized intellectual structure to what scientists have in their mind?
References 225
References
Alvarez W (1997) T. rex and the crater of doom. Vintage Books, New York
Alvarez LW, Alvarez W, Asaro F, Michel HV (1980) Extraterrestrial cause for the Cretaceous-
Tertiary extinction. Science 208(4448):1095–1098
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Erwin DH (1994) The Permo-Triassic extinction. Nature 367:231–236
Hildebrand AR, Penfield GT, Kring DA, Pilkington M, Carmargo ZA, Jacobsen SB et al (1991)
Chicxulub crater: a possible Cretaceous-Tertiary boundary impact crater on the Yucatan
Peninsula, Mexico. Geology 19(9):867–871
Hjorland B (1997) Information seeking and subject representation: an activity-theoretical approach
to information science. Greenwood Press, Westport
Hjorland B, Albrechtsen H (1995) Toward a new horizon in information science: domain analysis.
J Am Soc Inf Sci 46(6):400–425
Ho LC, Kormendy J (2000) Supermassive black holes in active galactic nuclei. In Murdin P
(ed) Encyclopedia of astronomy and astrophysics. Institute of Physics Publishing, Bristol.
http://eaa.crcpress.com/default.asp?actionDsummary&articleIdD2365
Keller G (1993) Is there evidence for Cretaceous-Tertiary boundary age deep-water deposits in the
Caribbean and Gulf of Mexico. Geology 21(9):776–780
Knoll AH, Bambach RK, Canfield DE, Grotzinger JP (1996) Comparative earth history and Late
Permian mass extinction. Science 273(5274):452–457
Kormendy J, Ho LC (2000) Supermassive black holes in inactive galaxies. In: Encyclopedia of
astronomy and astrophysics. Institute of Physics Publishing, Bristol
Kormendy J, Richstone D (1995) Inward bound: the search for supermassive black-holes in galactic
nuclei. Annu Rev Astron Astrophys 33:581–624
Patterson C, Smith AB (1987) Is the periodicity of extinctions a taxonomic artifact? Nature
330(6145):248–251
Renne P, Zhang Z, Richards MA, Black MT, Basu A (1995) Synchrony and causal relations be-
tween Permian-Triassic boundary crises and Siberian flood volcanism. Science 269:1413–1416
Richstone D, Ajhar EA, Bender R, Bower G, Dressler A, Faber SM et al (1998) Supermassive
black holes and the evolution of galaxies. Nature 395(6701):A14–A19
Saracevic T (1975) Relevance: a review of and a framework for the thinking on the notion in
information science. J Am Soc Inf Sci 26:321–343
Signor PW, Lipps JH (1982) Sampling bias, gradual extinction patterns, and catastrophes in the
fossil record. Geol Soc Am Spec Pap 190:291–296
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small H (1994) A SCI-MAP case study: building a map of AIDS research. Scientometrics
30(1):229–241
Wignall PB, Twitchett RJ (1996) Oceanic anoxia and the end Permian mass extinction. Science
272:1155–1158
Wilson P (1993) Communication efficiency in research and development. J Am Soc Inf Sci
44:376–382
Chapter 7
Tracking Latent Domain Knowledge
Knowledge is power.
Francis Bacon (1561–1626)
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 227
DOI 10.1007/978-1-4471-5128-9 7, © Springer-Verlag London 2013
228 7 Tracking Latent Domain Knowledge
Fig. 7.1 An evolving landscape of research pertinent to BSE and CJD. The next hot topic may
emerge in an area that is currently not populated
There may be many reasons why a particular line of research may fall outside the
body of the mainstream domain knowledge and become latent to a knowledge
domain. In a cross-disciplinary research program, researchers face an entirely
unfamiliar scientific discipline. Tracking the latest development into a different
discipline can be rather challenging. One example of such problems is the cross-
disciplinary use of Pathfinder networks, a structural and procedural modeling
method developed by cognitive psychologists in the 1980s (Schvaneveldt 1990;
Schvaneveldt et al. 1989). Pathfinder is a generic tool that has been adapted by
several fields of study, including some quite different adaptations from its original
cognitive applications. For example, we have adapted Pathfinder network scaling
as an integral component of our generic structuring and visualization framework
(Chen 1999a, b; Chen and Paul 2001). It is a challenging task to track down how
applications of Pathfinder networks have evolved over the past two decades across
a number of apparently unconnected disciplines.
Another type of latent domain knowledge can be explained in terms of
scientific paradigms. Thomas Kuhn (1962) described the development of science as
7.2 Knowledge Discovery 229
In Chap. 5, we mentioned Donald Swanson was the recipient of the 2000 Award
of Merit from ASIS&T for his work in undiscovered public knowledge. In his
Award of Merit acceptance speech, Swanson (2001) stressed the enormous and
fast-growing gap between the entire body of recorded knowledge and the limited
human capacity to make sense of it. He also pointed out knowledge fragmentation as
the consequences of inadequate cross-specialty communication. Because specialties
are increasingly divided into more and more narrowly focused subspecialties in
response to the information explosion.
Swanson has been pursuing his paradigm since 1986 when he began to realize
that there were two sizeable but bibliographically unrelated biomedical literatures:
one on the circulatory effects of dietary fish oil and the other on the peripheral
circulatory disorder – Raynaud’s disease. Prior to Swanson’s research, no medical
researcher had noticed this connection, and the indexing of these two literatures was
unlikely to facilitate the discovery of any such connections.
Swanson’s paradigm focuses on the possibility that information in one specialty
might be of value in another without anyone becoming aware of the fact. Specialized
literatures that do not intercommunicate by citing one another may nonetheless have
many implicit textual interconnections based on meaning. The number of latent,
unintended or implicit, connections within the literature of science may greatly
exceed the number of explicit connections.
Swanson and Smalheiser (1997) defined noninteractive literatures as two lit-
eratures that have not been connected by a significant citation tie. In other
words, scientists in both camps have not regarded the existence of a meaningful
7.2 Knowledge Discovery 231
Table 7.1 Seven discoveries of undiscovered public knowledge, all published in the biomedical
literature
Seven examples published
Year in the biomedical literature A Potential cause factors C Disease
1986 Swanson (1986a) Fish oil Raynaud’s syndrome
1988 Swanson (1988) Magnesium Migraine
1990 Swanson (1990) Somatomedin C Arginine
1994 Smalheiser and Swanson (1994) Magnesium deficiency Neurologic disease
1996 Smalheiser and Swanson (1996a) Indomethacin Alzheimer’s disease
1996 Smalheiser and Swanson (1996b) Estrogen Alzheimer’s disease
1998 Smalheiser and Swanson (1998) Calcium-independent Schizophrenia
phospholipase A2
Swanson has been pursuing his paradigm since 1986 when he found two sizeable
biomedical literatures: one is on the circulatory effects of dietary fish oil and the
other is on the peripheral circulatory disorder, Raynaud’s disease. Swanson noticed
that these two literatures were not bibliographically related: No one from one camp
cited works in the other (Swanson, 1986a, b). On the other hand, he was pondering
the question that apparently no one had asked before: Is there a connection between
dietary fish oil and Raynaud’s disease?
Prior to Swanson’s research, no medical researcher had noticed this connection,
and the indexing of these two literatures was unlikely to facilitate the discovery of
any such connections. Swanson’s approach can be represented in a generic form.
Given two premises that A causes B (A ! B) and that B causes C (B ! C), the
question to ask is whether A causes C (A ! C). If the answer is positive, the causal
relation has the transitive property. In the biological world, such transitive properties
may not always be there. Therefore scientists must explicitly establish such tran-
sitivity relationships. Swanson suggests once information scientists identify such
possibilities, they should recommend domain experts to validate (Swanson 2001).
Swanson’s approach focuses on the discovery of such hypotheses from the vast
amount of implicit, or latent, connections. Swanson and Smalheiser (1997) defined
the concept of non-interactive literatures. If two literatures have never been cited
together at a notable level, they are non-interactive – scientists have not considered
both literatures together. In the past 15 years, Swanson identified several missing
links of the same pattern, notably migraine and magnesium (Swanson 1988),
and arginine and somatomedin C (Swanson 1990). Since 1994, the collaboration
between neurologist Neil Smalheiser and Swanson led to a few more such cases
(Smalheiser and Swanson 1994, 1996). Table 7.1 is a summary of various case
studies. They also made their software Arrowsmith available on the Internet
(Swanson 1999).
Swanson’s approach relies on the identification of the two premises A ! B and
B ! C. In a large knowledge domain, it is crucial for analysts to have sufficient
domain knowledge. Otherwise, to find two such premises is like searching for nee-
dles in a haystack. Knowledge domain visualization (KDViz) can narrow down the
search space and increase the chance of finding a fruitful line of scientific inquiry.
7.2 Knowledge Discovery 233
Fig. 7.2 A Venn diagram showing potential links between bibliographically unconnected
literatures (Figure 1 reprinted from Swanson and Smalheiser (1997))
In parallel, Swanson also published his work in the literature of library and
information science, notably (Swanson 1986a, b; Swanson 1987, 1988, 1990). The
Venn diagram in Fig. 7.2, adopted from (Swanson and Smalheiser 1997), shows sets
of articles, or bodies of literature, the target literature A and the source literature C.
Set A and set C have no articles in common, but they are linked through intermediate
literatures B1, B2, B3, and B4. Undiscovered links between A and C may be
found in through the intermediate literatures B’s. There may exist an intermediate
literature B such that a particular transitive relation can be established based on
A ! Bi and Bi ! C.
Figure 7.3 shows a schematic diagram of title-word pathways from a source
literature on the right (C terms), through intermediate title-words (B terms), to title
words of promising target literatures on the left (A terms) (Swanson and Smalheiser
1997). A ranking algorithm ranks discovered A-terms. The more B-pathways an A
term has, the higher it ranks. Term A3, magnesium, is the highest ranked title word.
It has a total of 7 pathways from B-terms. In this way, a pathway from migraine to
magnesium appears to be most promising.
Swanson called this algorithm procedure I. Swanson also developed what he
called Procedural II, in which titles from literatures A and C are downloaded first
in order to find words and phrases in common from the two literatures. Common
words and phrases are selected to form the so-called B-list. An output display
is then produced to help the human user compare A-titles and C-titles against
B-terms.
Figure 7.4 shows B-terms selected by Swanson’s Procedure II for magnesium
and migraine, and for fish-oil and Raynaud’s disease. The two numbers in front of
234 7 Tracking Latent Domain Knowledge
Fig. 7.3 A schematic diagram, showing the most promising pathway linking migraine in the
source literature to magnesium in the target literatures (C to A3) (Courtesy of http://kiwi.uchicago.
edu/)
B-terms are the number of articles within the BC and AB intersections, respectively.
The asterisks mark entries identified in the original studies (Swanson 1986a, 1988).
Table 7.2 lists B-term entries selected by Procedure II.
Fig. 7.4 A schematic flowchart of Swanson’s Procedure II (Figure 4 reprinted from Swanson and
Smalheiser (1997), available at http://kiwi.uchicago.edu/webwork/fig4.xbm)
Fig. 7.5 Mainstream domain knowledge is typically high in both relevance and citation, whereas
latent domain knowledge can be characterized as high relevance and low citation
Fig. 7.6 The strategy of visualizing latent domain knowledge. The global context is derived from
co-citation networks of highly cited works. An “exit” landmark is chosen from the global context
to serve as the seeding article in the process of domain expansion. The expanded domain consists
of articles connecting to the seeding article by citation chains of no more than two citation links.
Latent domain knowledge is represented through a citation network of these articles
our field of view from mainstream domain knowledge to latent domain knowledge.
A key component in this domain expansion process is the selection of a so-called
“exit” landmark from the citation landscape. This “exit” landmark will play a
pivot role in tracking latent knowledge by “pulling” highly relevant but relatively
rarely cited documents into the scene. The “exit” landmark is selected based on
both structural and topical characteristics. Structurally important documents in
the citation landscape include branching points, from which one can reach more
documents along citation paths preserved by the network. Topically important
documents are the ones that are closely related to the subject in question. Ideally,
a good “exit” landmark should be a classic work in a field of study and it can link
to a cluster of closely related documents by citation. We will explain in more detail
through case studies how we choose “exit” landmarks. Once an “exit” landmark is
chosen from the citation landscape, the four-step procedure can be applied again to
all the documents within a citation chain of up to two citation links. The resultant
citation network represents the latent domain knowledge. Finally, we embed this
local structure back into the global context by providing a reference from the “exit”
landmark in the global context to the latent knowledge structure.
In this chapter, we describe how we applied this approach to three case studies,
namely, Swanson’s work, cross-domain applications of Pathfinder network scaling
techniques, and the perceived connection between BSE and vCJD in contemporary
literature. We use the Web of Science, a Web-based interface to citation databases
compiled by the Institute for Scientific Information (ISI). We start with a search in
the Web of Science using some broad search terms in order to generate a global
context for subsequent visualization. For example, in the Pathfinder case, we chose
to use search terms such as knowledge discovery, knowledge acquisition, knowledge
modeling, and Pathfinder. Once the global context is visualized, it is straightforward
to identify an “exit” landmark. In the Pathfinder case, a classic citation of Pathfinder
networks is chosen as an “exit” landmark. This “exit” landmark article serves as the
seed in a citation search within the Web of Science. The citing space of the seeding
article s contains articles that either cite the seeding article directly or cite an article
that in turn cites the article.
COne Step .s/ D fc jc ! sg
˚ ˇ
CTwo Step .s/ D c ˇ9c 0 ) c ! c 0 ^ c 0 ! s
CitingSpaceTheme .s/ D COne Step .s/ [ CTwo Step .s/
Such citing spaces may contain articles beyond the boundary of the mainstream
domain knowledge. One can repeatedly apply this method by identifying another
“exit” landmark. Articles connected to the landmark by two-step citation chains are
gathered to represent latent domain knowledge. By using different ways to select
citing articles, we can visualize latent knowledge structures with reference to highly
established and frequently cited knowledge structures. In the following two case
studies, we apply the same spiral methodology to illustrate our approach.
7.3 Swanson’s Impact 239
The following example is based on citation records retrieved from the Web of
Science as of 17th April 2001. First, a search was conducted across all databases
between 1981 and 2001, the entire coverage available to the version we access.
This search aimed to locate Swanson’s articles as many as possible within these
citation databases. The AUTHOR field for the search was “Swanson DR” and
the ADDRESS as “Chicago”. This search returned 30 records. These 30 articles
served as a seeding set. In the second step, we expanded this initial set of articles
by including articles that have cited at least one article in the seeding set. All the
citations from the expanded set of articles form the population for the subsequent
document co-citation analysis. We applied a threshold of 65 to select top-slice
articles from this all-citation set. A total of 246 articles that met this criterion were
selected and analyzed to form a series of document co-citation maps as the snapshots
of the impact of Swanson’s work.
Figure 7.7 shows an overview of the document co-citation map. The entire
network is divided into three focused areas, which are colored by factor loadings.
Fig. 7.7 An overview of the document co-citation map. Lit-up articles in the scene are Swanson’s
publications. Four of Swanson’s articles are embedded in the largest branch – information science,
including information retrieval and citation indexing. A dozen of his articles are gathered in the
green specialty – the second largest grouping, ranging from scientometrics, neurology, to artificial
intelligence. The third largest branch – headache and magnesium – only contains one of Swanson’s
articles
240 7 Tracking Latent Domain Knowledge
Figure 7.11 shows the latent knowledge structure derived from the citing space of
the “exit” landmark article. This structure is not overshadowed by high citations
7.4 Pathfinder Networks’ Impact 243
Fig. 7.10 A landscape view of the Pathfinder case. Applications of Pathfinder networks are found
in a broader context of knowledge management technologies, such as knowledge acquisition,
knowledge discovery, and artificial intelligence. A majority of Pathfinder network users are
cognitive psychologists
Table 7.3 Leading articles in the three largest specialties ranked by the strength of factor loading
F1 F2 F3
Publication Pathfinder Artificial Expert
Specialty networks intelligence systems
Elstein AS, 1978, Med Problem Solving 0.872
Card SK, 1983, Psychol Human Comput 0.872
Johnsonlaird PN, 1983, Mental Models 0.858
Nisbett RE, 1977, Psychol Rev, v84, p231 0.855
Glaser R, 1988, Nature Expertise, PR15 0.850
Gammack JG, 1985, Res Dev Expert Syste, p105 0.841
Chi MTH, 1981, Cognitive Sci, v5, p121 0.841
Cooke NM, 1986, P IEEE, v74, p1422 0.836
Cooke NM, 1987, Int J Man Mach Stud, v26, p533 0.830
Anderson JR, 1982, Psychol Rev, v89, p369 0.814
Anderson JR, 1987, Psychol Rev, v94, p192 0.813
Mckeithen KB, 1981, Cognitive Psychol, v13, p307 0.811
Chi MTH, 1989, Cognitive Sci, v13, p145 0.810
Anderson JR, 1983, Architcture Cogniti 0.807
Cordingley ES, 1989, Knowledge Elicitatio, p89 0.804
Cooke NJ, 1994, Int J Hum-comput St, v41, p801 0.798
Hoffman RR, 1987, Ai Mag, v8, p53 0.797 0.528
Chase WG, 1973, Cognitive Psychol, v4, p55 0.794
Klein GA, 1989, IEEE T Syst Man Cyb, v19, p462 0.792 0.508
Schvaneveldt RW, 1985, Int J Man Mach Stud, v23, p699 0.789 0.532
Marcus S, 1988, Automating Knowledge 0.951
Musen MA, 1987, Int J Man Mach Stud, v26, p105 0.949
Bennett JS, 1985, J Automated Reasonin, v1, p49 0.947
Clancey WJ, 1989, Mach Learn, v4, p285 0.942
Newell A, 1982, Artif Intell, v18, p87 0.942
Musen MA, 1989, Knowl Acquis, v1, p73 0.941
Cancey WJ, 1985, Artif Intell, v27, p289 0.940
Ford KM, 1993, Int J Intell Syst, v8, p9 0.933
Kahn G, 1985, 9TH P Int Joint C Ar, p581 0.933
Musen MA, 1989, Automated Generation 0.930
Neches R, 1991, Ai Mag, v12, p36 0.929
Marcus S, 1989, Artif Intell, v39, p1 0.926
Chandrasekaran B, 1986, IEEE Expert, v1, p23 0.925
Lenat DB, 1990, Building Large Knowl 0.923
Chandrasekaran B, 1983, Ai Mag, v4, p9 0.921
Davis R, 1982, Knowledge Based Syst 0.920
Davis R, 1979, Artif Intell, v12, p121 0.918
Gruber TR, 1987, Int J Man Mach Stud, v26, p143 0.914
Shadbolt N, 1990, Current Trends Knowl, p313 0.912
Dekleer J, 1984, Artif Intell, v24, p7 0.910
Holland JH, 1986, Induction Processes 0.771
Oleary DE, 1987, Decision Sci, v18, p468 0.713
Waterman DA, 1986, Guide Expert Systems 0.526 0.712
(continued)
7.4 Pathfinder Networks’ Impact 245
Fig. 7.11 This citation map shows that the most prolific themes of Pathfinder network applications
include measuring the structure of expertise, eliciting knowledge, measuring the organization of
memory, and comparing mental models. No threshold is imposed
246 7 Tracking Latent Domain Knowledge
Table 7.4 Leading articles in the three most prominent specialties ranked by the strength of factor
loading
F1 F2 F3
Publication Pathfinder, cognitive Educational Knowledge
Specialty psychology psychology acquisition
Schvaneveldt RW, 1985, Int J Man Mach 0.916
Stud, v23, p699
Anderson JR, 1983, Architecture Cogniti 0.906
Reitman JS, 1980, Cognitive Psychol, 0.874
v12, p554
Friendly ML, 1977, Cognitive Psychol, 0.861
v9, p188
Mckeithen KB, 1981, Cognitive Psychol, 0.848
v13, p307
Ericsson KA, 1984, Protocol Anal 0.845
Cooke NM, 1987, Int J Man Mach Stud, 0.837
v26, p533
Chi MTH, 1981, Cognitive Sci, v5, p121 0.825
Kruskal JB, 1977, Statistical Methods 0.822
Cooke NM, 1986, P IEEE, v74, p1422 0.822
Hayesroth F, 1983, Building Expert Syst 0.807
Murphy GL, 1984, J Exp Psychol Learn, 0.806
v10, p144
Roskehoestrand RJ, 1986, Ergonomics, 0.803
v29, p1301
Anderson JR, 1982, Psychol Rev, v89, 0.801
p369
Cooke NJ, 1988, Int J Man Mach Stud, 0.800 0.514
v29, p407
Tversky A, 1977, Psychol Rev, v84, p327 0.798
Kelly GA, 1955, Psycol Personal Con 0.790
Butler KA, 1986, Artificial Intellige 0.789
Collins AM, 1969, J Verb Learn Verb Be, 0.784
v8, p240
Schvaneveldt RW, 1985, MCCS859 New 0.777
Mex Stat
Goldsmith TE, 1991, J Educ Psychol, 0.840
v83, p88
Gonzalvo P, 1994, J Educ Psychol, v86, 0.789
p601
Acton WH, 1994, J Educ Psychol, v86, 0.777
p303
Gomez RL, 1996, J Educ Psychol, v88, 0.754
p572
Johnson PJ, 1994, J Educ Psychol, v86, 0.747
p617
Novak JD, 1990, J Res Sci Teach, v27, 0.747
p937
Novak JD, 1984, Learning Learn 0.744
(continued)
7.4 Pathfinder Networks’ Impact 247
Fig. 7.12 This branch represents a new paradigm of incorporating Pathfinder networks into
Generalized Similarity Analysis (GSA), a generic framework for structuring and visualization, and
its applications especially in strengthening traditional citation analysis
BSE was first found in 1986 in England. A sponge-like malformation was found
in the brain tissue from affected cattle. It was identified as a new prion disease,
a new TSE disease. The BSE epidemic in Britain reached its peak in 1992 and
has since steadily declined. CJD was first discovered in the 1920s by two German
7.5 BSE and vCJD 249
Fig. 7.13 Schvaneveldtl’s “exit” landmark in the landscape of the thematic visualization
While no definitive link between prion disease in cattle and vCJD in humans
has been proven, the conditions are so similar most scientists are convinced that
infection by a BSE prion leads to vCJD in humans. The emergence of vCJD came
after the biggest ever epidemic of BSE in cattle. The fact that the epidemic was in
the UK and most vCJD victims lived in Britain added to evidence of a link. The
British government assured the public that the beef is safe, but in 1996 it announced
there is possibly a link between BSE and CJD. A brief timeline of relevant events is
shown in Table 7.6.
The British government assured the public that the beef is safe, but in 1996 it
announced there is possibly a link between BSE and vCJD. The central question in
this case study is what scientific literature tells us about the possible link between
BSE and vCJD.
First, we generated a mainstream-driven thematic landscape of the topic of BSE
and CJD by searching the Web of Science with the term “BSE or CJD” (See
Fig. 7.14). The strongest specialty Prion Protein is colored in red; the BSE specialty
is in green; and the CJD specialty is in blue. GSS is in purple next to the prion
protein specialty. In particular, the very light color of the vCJD specialty indicates
that this is an area where other specialties overlap.
7.5 BSE and vCJD 251
Fig. 7.14 An overview of 379 articles in the mainstream of BSE and vCJD research
In the Prion specialty, Prusiner’s 1982 article in Science and Gajdusek’s 1966
article in Nature are located next to each other. Gajdusek received 1976’s Nobel
Prize for his work on kuru, a prion-related brain disease. The Prion specialty also
includes radiation biologist Tikvah Alper’s 1967 article in Nature. Alper studied
scrapie in sheep and found that brain tissue remained infectious even after she
subjected it to radiation that would destroy any DNA or RNA. In 1969, J. S. Griffith
of Bedford College, London, suggested in an article published in Nature that an
infectious agent that lacked nucleic acid could cause disease. Griffith suggested in
a separate paper that perhaps a protein, which would usually prefer one folding
pattern, could somehow misfold and then catalyze other proteins to do so. Such an
252 7 Tracking Latent Domain Knowledge
Fig. 7.15 A year-by-year animation shows the growing impact of research in the connections
between BSE and vCJD. Top-left: 1991–1993; Top-right: 1994–1996; Bottom-left: 1997–1999;
Bottom-right: 2000–2001
idea seemed to threaten the very foundations of molecular biology, which held that
nucleic acids were the only way to transmit information from one generation to the
next.
Fifteen years later, in 1982, Prusiner followed up this idea of self-replication
proposed in the 1960s and described the “proteinaceous infectious particles” as the
cause of scrapie in sheep and hamsters. He suggested that scrapie and a collection of
other wasting brain diseases, some inherited, some infectious, and some sporadic,
were all due to a common process: a misfolded protein that propagates and kills
brain cells.
Prusiner and his colleagues reported in Science in 1982 that they had found an
unusual protein in the brains of scrapie-infected hamsters that did not seem to be
present in healthy animals. Their article, entitled “Novel proteinaceous infectious
particles cause scrapie,” has been cited 941 times by March 2001. A year later,
they identified the protein and called it prion protein (PrP). Prusiner led a series of
experiments, demonstrating that PrP actually is present in healthy animals, but in a
different form from the one found in diseased brains. The studies also showed that
mice lacking PrP are resistant to prion diseases. Taken together, the results have
convinced many scientists that the protein is indeed the agent behind CJD, scrapie,
mad cow disease, and others. Figure 7.15 shows four frames from an animation
sequence of the year-by-year citation growth. Figure 7.16 shows the following four
most cited articles over the period of 1995–2000.
• Will, R. G., Ironside, J. W., Zeidler, M., Cousens, S. N., Estibeiro, K.,
Alperovitch, A., Poser, S., Pocchiari, M., Hofman, A., & Smith, P. G. (1996).
A new variant of Creutzfeldt-Jakob disease in the UK. Lancet, 347, 921–925.
7.5 BSE and vCJD 253
Fig. 7.16 Articles cited more than 50 times during this period are labeled. Articles labeled 1–3
directly address the BSE-CJD connection. Article 4 is Prusiner’s original article on prion, which
has broad implications on brain diseases in sheep, cattle, and human
• Collinge, J., Sidle, K., Meads, J., Ironside, J., & Hill, A. (1996). Molecular
analysis of prion strain variation and the aetiology of ‘new variant’ CJD. Nature,
383, 685–691.
• Bruce, M. E., Will, R. G., Ironside, J. W., McConnell, I., Drummond, D., Suttie,
A., McCardle, L., Chree, A., Hope, J., Birkett, C., Cousens, S., Fraser, H., &
Bostock, C. J. (1997). Transmissions to mice indicate that ‘new variant’ CJD is
caused by the BSE agent. Nature, 389(6650), 498–501.
• Prusiner, S. B. (1982). Novel Proteinaceous Infectious Particles Cause Scrapie.
Science, 216(4542), 136–144.
Research by Moira Bruce at the Neuropathogenesis Unit in Edinburgh has
confirmed that sheep can produce a range of prion particles but finding the one
that causes BSE has eluded researchers until now. There is no evidence that people
can catch BSE directly from eating sheep but most research has focused on cattle so
the possibility cannot be ruled out. Such a discovery would also devastate consumer
confidence.
According to Bruce et al. “Twenty cases of a clinically and pathologically
atypical form of Creutzfeldt-Jakob disease (CJD), referred to as ‘new variant’ CJD
(vCJD), have been recognized in unusually young people in the United Kingdom,
and a further case has been reported in France. This has raised serious concerns that
BSE may have spread to humans, putatively by dietary exposure.”
254 7 Tracking Latent Domain Knowledge
The mainstream view on BSE has focused on the food chain: Cows got BSE by
eating feed made from sheep infected with scrapie, and, similarly, humans get vCJD
by eating BSE infected beef. However, Mark Purdey, a British organic dairy farmer,
believed that the unbalanced manganese and copper in the brain is the real cause of
BSE and vCJD (Stourton 2001). He studied the environment in areas known to have
found spongiform diseases, such as Colorado in the United States, Iceland, Italy
and Slovakia. He found a high level of manganese and low levels of copper in all of
them.
Purdey’s research on the manganese-copper hypothesis shows the sign of latent
domain knowledge. He has published in scientific journals, but they are not highly
cited by other researchers. We need to find a gateway from which we can expand
the global landscape of mainstream research in BSE and vCJD and place Purdey’s
research into the big picture of this issue. Recall that we need to find an “exit”
landmark in the global landscape to conduct the domain expansion, but none of
Purdey’s publications was featured in the scene. To solve this problem, we need to
find someone who is active in the manganese-copper paradigm and also included in
the mainstream visualization view.
David R. Brown, a biochemist at Cambridge University, is among scientists
who did cite Purdey’s publications. Brown provides a good candidate for an “exit”
landmark. On the one hand, Brown is interested in the role of the manganese-copper
balance in prion diseases (Brown et al. 2000) and he cited Purdey’s articles. On
the other hand, he is interested in Prusiner’s prion theory and published about 50
articles on prion diseases. Indeed two of his articles are featured in the mainstream
view visualization of the case study. We chose his 1997 article published in
Experimental Neurology as the “exit” landmark. Because of the relatively low
citations of Purdey’s articles, conventional citation analysis is unlikely to take
them into account. Predominant articles in this cluster all address the possible link
between BSE and vCJD. This observation suggests how Purdey’s articles might fit
into the mainstream domain knowledge.
The moral of this story is that citation networks can pull into articles that would
be excluded by conventional citation analysis such that researchers can explore the
development of a knowledge domain across a wider variety of works. This approach
provides a promising tool for finding weak connections in scientific literature that
would be otherwise overshadowed by those belong to the cream of the crop. This
example shows that Purdey’s theory is connected to the mainstream research on
BSE and CJD through Brown and his group.
We have demonstrated it that our approach can be successfully applied to find
connections that would be otherwise obscured. The BSE case study has shown that
Purdey’s theory is feeding in the mainstream research on BSE and CJD through
Brown and his group.
7.6 Summary 255
7.6 Summary
References
Brown DR, Hafiz F, Glasssmith LL, Wong BS, Jones IM, Clive C et al (2000) Consequences of
manganese replacement of copper for prion protein function and proteinase resistance. EMBO
J 19(6):1180–1186
Chen C (1998a) Bridging the gap: the use of pathfinder networks in visual navigation. J Vis Lang
Comput 9(3):267–286
Chen C (1998b) Generalised similarity analysis and pathfinder network scaling. Interact Comput
10(2):107–128
Chen C (1999a) Information visualisation and virtual environments. Springer, London
Chen C (1999b) Visualising semantic spaces and author co-citation networks in digital libraries.
Inf Process Manag 35(2):401–420
Chen C (2002) Visualization of knowledge structures. In: Chang SK (ed) Handbook of software
engineering and knowledge engineering, vol 2. World Scientific Publishing Co, River Edge,
p 700
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Chen C, Paul RJ, O’Keefe B (2001) Fitting the jigsaw of citation: information visualization in
domain analysis. J Am Soc Inf Sci 52(4):315–330
Chen C, Cribbin T, Macredie R, Morar S (2002) Visualizing and tracking the growth of competing
paradigms: two case studies. J Am Soc Inf Sci Technol 53(8):678–689
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Prusiner SB (1982) Novel proteinaceous infectious particles cause scrapie. Science
216(4542):136–144
Schvaneveldt RW (ed) (1990) Pathfinder associative networks: studies in knowledge organization.
Ablex Publishing Corporations, Norwood
Schvaneveldt RW, Durso FT, Dearholt DW (1989) Network structures in proximity data. In: Bower
G (ed) The psychology of learning and motivation, 24. Academic Press, New York, pp 249–284
Smalheiser NR, Swanson DR (1994) Assessing a gap in the biomedical literature – magnesium-
deficiency and neurologic disease. Neurosci Res Commun 15(1):1–9
Smalheiser NR, Swanson DR (1996a) Indomethacin and Alzheimer’s disease. Neurology 46:583
Smalheiser NR, Swanson DR (1996b) Linking estrogen to Alzheimer’s disease: an informatics
approach. Neurology 47:809–810
Smalheiser NR, Swanson DR (1998) Calcium-independent phospholipase A2 and schizophrenia.
Arch Gen Psychiatry 55:752–753
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Stourton E (Writer) (2001) Mad cows and an Englishman [TV]. In L. Telling (Producer). London:
BBC2
Swanson DR (1986a) Fish oil, Raynauds syndrome, and undiscovered public knowledge. Perspect
Biol Med 30(1):7–18
Swanson DR (1986b) Undiscovered public knowledge. Libr Q 56(2):103–118
Swanson DR (1987) Two medical literatures that are logically but not bibliographically connected.
J Am Soc Inf Sci 38:228–233
Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol Med
31:526–557
Swanson DR (1990) Somatomedin C and arginine: implicit connections between mutually-isolated
literatures. Perspect Biol Med 33:157–186
Swanson DR (1999) Computer-assisted search for novel implicit connections in text databases.
Abstracts of Papers of the American Chemical Society, 217, 010-CINF
References 257
Swanson DR (2001) On the fragmentation of knowledge, the connection explosion, and assembling
other people’s ideas. Bull Am Soc Inf Sci Technol 27(3):12–14
Swanson DR, Smalheiser NR (1997) An interactive system for finding complementary literatures:
a stimulus to scientific discovery. Artif Intell 91(2):183–203
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Chapter 8
Mapping Science
A critical part of a scientific activity is to discern how a new idea is related to what
we know and what may become possible. As the number of new scientific publica-
tions arrives at a rate that rapidly outpaces our capacity of reading, analyzing, and
synthesizing scientific knowledge, we need to augment ourselves with information
that can guide us through the rapidly growing intellectual space effectively. In this
chapter, we address some fundamental issues concerning with what information
may serve as early signs of potentially valuable ideas. In particular, we are interested
in information that is routinely available and derivable upon the publication of a
scientific paper without assuming the availability of additional information such as
its usage and citations.
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 259
DOI 10.1007/978-1-4471-5128-9 8, © Springer-Verlag London 2013
260 8 Mapping Science
Detecting early signs of potentially valuable ideas has theoretical and practical
implications. For instance, peer reviews of new manuscripts and new grant proposals
are under a growing pressure of accountability for safeguarding the integrity of
scientific knowledge and optimizing the allocation of limited resources (Chubin
1994; Chubin and Hackett 1990; Häyrynen 2007; Hettich and Pazzani 2006).
Long-term strategic science and technology policies require visionary thinking and
evidence-based foresights into the future (Cuhls 2001; Martin 2010; Miles 2010). In
foresight exercises on identifying future technology, experts’ opinions were found
to be overly optimistic on hindsight (Tichy 2004). The increasing specialization
in today’s scientific community makes it unrealistic to expect an expert to have a
comprehensive body of knowledge concerning multiple key aspects of a subject
matter, especially in interdisciplinary research areas.
The value, or perceived value, of an idea can be quantified in many ways. For
example, the value of a good idea can be measured by the number of people’s
life it has saved, the number of jobs it has created, or the amount of revenue it
has generated. In the intellectual world, the value of a good idea can be measured
by the number of other ideas it has inspired or the amount of attention it has
drawn. In this chapter, we are concerned with identifying patterns and properties of
information that can tell us something about the potential values of ideas expressed
and embodied in scientific publications. A citation count of a scientific publication is
the number of times other scientific publications have referenced to the publication.
Using citations to guide the search for relevant scientific ideas by way of association,
known as citation indexing, was pioneered by Eugene Garfield in the 1950s
(Garfield 1955). It is a general consensus that citation behavior can be motivated
by both scientific and non-scientific reasons (Bornmann and Daniel 2006). Citation
counts have been used as an indicator of intellectual impact on subsequent research.
There have been debates over the nature of citations and whether positive, negative,
and self-citations should all be treated equally. Nevertheless, even a negative citation
makes it clear that the referenced work cannot be simply ignored.
Researchers have searched for other clues that may inform us about the potential
impact of a newly published scientific paper, especially clues that can be readily
extracted from routinely available information at the time of publication instead
of waiting for download and citation patterns to build up over time. Factors such
as track record of authors, the prestige of authors’ institutions, the prestige of
the journal in which an article is published are among the most promising ones
that can provide an assurance of the quality of the article to an extent (Boyack
et al. 2005; Hirsch 2007; Kostoff 2007; van Dalen and Kenkens 2005; Walters
2006). The common assumption central to approaches in this category is that great
researchers tend to continuously deliver great work and, along a similar vein, an
article published in a high impact journal is also likely to be of high quality itself.
On the one hand, these approaches avoid the reliance on data that may not be readily
available upon the publication of an article and thus free analysts from constraints
8.1 System Perturbation and Structural Variation 261
due to the lack of download and citation data. On the other hand, the sources of
information used in these approaches are indirect to the new ideas reported in
scientific publications. In an analogy, we give credits to an individual based on
his/her credit history instead of assessing the risk of the current transaction directly.
In such approaches, we will not be able to know where precisely the novelty of an
idea is coming from. We will not be able to know whether similar ideas have been
proposed in the past.
Many studies have addressed factors that could explain or even predict future
citations of a scientific publication (Aksnes 2003; Hirsch 2007; Levitt and Thelwall
2008; Persson 2010). For example, is a paper’s citation count last year a good
predictor for new citations this year? Are the download times a good predictor
of citations? Is it true that the more references a paper cites, the more citations
it will receive later on? Similarly, the potential role of prestige, or the Matthew
Effect coined by Robert Merton, has been commonly investigated, ranging from
the prestige of authors to the prestige of journals in which articles are published
(Dewett and Denisi 2004). However, many of these factors are loosely and indirectly
coupled with the conceptual and semantic nature of the underlying subject matter of
concern. We refer them as extrinsic factors. In contrast, intrinsic factors have direct
and profound connections with the intellectual content and structure. One example
of intrinsic factor is concerned with the structural variation of a field of study.
A notable example is the work by Swanson on linking previously disjoint bodies
of knowledge, such as the connection between fish oil and Reynaud’s syndrome
(Swanson 1986a).
Researchers have made various attempts to characterize future citations and
identify emerging core articles (Shibata et al. 2007; Walters 2006). Shibata et al.
for example, studied citation networks in two subject areas, Gallium Nitride and
Complex Networks, and found that while past citations are a good predictor of near-
future citations, the betweenness centrality is correlated with citations in a longer
term.
Upham et al. (2010) studied the role of cohesive intellectual communities –
schools of thoughts – in promoting and constraining knowledge creation. They
analyzed publications on management and concluded that it is significantly bene-
ficial for new knowledge to be a part of a school of thought and the most influential
position within a school of thought is in the semi-periphery of the school. In
particular, boundary-spanning research positioned at the semi-periphery of a school
would attract attention from other schools of thought and receive the most citations
overall. Their study used a zero-inflated negative binomial regression (ZINB).
Negative binomial regression models have been used to predict the expected mean
patent citations (Fleming and Bromiley 2000). Hsieh (2011) studied inventions as
a combination of technological features. In particular, the closeness of features
plays an interesting role. Neither overly related nor loosely related features are
good candidates for new inventions. Useful inventions arise with rightly positioned
features where the cost of synthesis is minimized.
Takeda and Kjikawa (2010) reported three stages of clustering in citation
networks. In the first stage, core clusters are formed, followed by the formation
262 8 Mapping Science
of peripheral clusters and the continuous growth of the core clusters. Finally,
the core clusters’ growth becomes predominant again. Buter et al. (2011) studied
the emergence of an interdisciplinary research area from fields that did not show
interdisciplinary connections before. They used journal subject categories as a proxy
for fields and citations as a measure of interdisciplinary connection.
Lahiri et al. addressed how structural changes of a network may influence the
spread of information over the network (Lahiri et al. 2008). Although they did not
study bibliographic networks per se, their study indicates predictions made about
how information spreads over a network are sensitive to structural changes of the
network. This observation underlines the importance of taking structural change into
account in the development of metrics based on topological properties of networks.
Leydesdorff (2001) raised questions (p. 146) that are closely related to what we
are addressing: “How does the new text link up to the literature, and what is its
impact on the network of previously existing relations?” He took a quite different
approach and analyzed word occurrences in scientific papers from an information-
theoretic perspective. In his approach, the publication of a paper is perceived as an
event that may lead to the reduction of uncertainty involved in the current state of
knowledge. He devised diagrams that depict pathways of how a particular paper
improves the efficiency of communication. Although the information-theoretic
approach and our structural variation approach currently operate on different units of
analysis with distinct theoretical underpinnings, both share the fundamental concern
of changes introduced by newly published scientific papers on the existing body of
knowledge.
As shown above, many studies in the literature have addressed factors that may
influence citations. The value of our work is the introduction of the structural
variation paradigm along with computational metrics that can be integrated into in-
teractive exploration systems to better understand precisely the impact of individual
links made by a new article.
The basic assumption in the structural variation approach is that the extent
of a departure from the current intellectual structure is a necessary condition
for a potentially transformative idea in science. In other words, a potentially
transformative idea needs to bring changes to the existing structure of knowledge
in the first place. In order to measure the degree of structural variation introduced
by a scientific article, the intellectual structure at a particular moment of time needs
to be represented in such a way that structural changes can be computationally
detected and manually verifiable. Bibliographic networks can be computationally
derived from scientific publications. Research in scientometrics and citation analysis
routinely uses citation and co-citation networks as a proxy of the underlying
intellectual structure. Here we will focus on using several types of co-citation and
co-occurrence networks as the representation of a baseline network.
A network represents how a set of entities are connected. Entities are represented
as nodes, or vertices, in the network. Their connections are represented as links, or
edges. Relevant entities in our context include several types of information that can
be computationally extracted from a scientific article, such as references cited by the
article, authors and their affiliations, the journal in which the article is published, and
keywords in the article. We will limit our discussions to networks that are formed
with a single type of entities, although networks of multiple types of entities are
worth considering once we establish a basic understanding of structural variations
in networks of a single type of entities.
Once the type of entities is chosen, the nature of the interconnectivity between
entities is to be specified to form a network. Networks of co-occurring entities
represent a wide variety of types of connectivity. A network of co-occurring words
represents how words are related in terms of whether and how often they appear
in the vicinity of each other. Co-citation networks of entities such as references,
authors, and journals can be seen as a special case of co-occurring networks. For
example, co-citation networks of references are networks of references that appear
together in the bodies of scientific papers – these references are co-cited.
Networks of co-cited references represent more specific information than net-
works of co-cited authors because references of different articles by the same author
would be lumped together in a network of co-cited authors. Similarly, networks
of co-cited references are more specific than networks of co-cited journals. We
refer such differences in specificity as the granularity of networks. Measurements
of structural variation need to take the granularity factor into account because it is
reasonable to expect that networks at different levels of granularity would lead to
different measures of structural variations.
Another decision to be made about a baseline network is a sampling issue. Taking
a particular year as a standing point to look at in the past, how far back should we
consider in the construction of a baseline network that would adequately represent
the underlying intellectual structure? Does the network become more accurate if
we go back more into the past? Will it be more efficient if we limit it to the most
recent years that really matter the most? Given articles published in a particular year
Y, the baseline network represents the intellectual structure using information from
articles published up to year Y–1. Two types of baseline networks are investigated
8.1 System Perturbation and Structural Variation 265
here: ones using a moving window of a fixed size [Y–k, Y–1] and ones using the
entire history (Yo , Y–1], where Yo is the earliest year of publication for records in
the given dataset.
We expect that the degree of structural variation introduced by a new article can
offer prospective information because of the boundary spanning mechanism. If an
article introduces novel links that span the boundaries of different topics, then we
expect this signifies its potential in taking the intellectual structure for a new turn.
Given a baseline network, structural variations can be measured based on
information provided by a particular article. We will introduce three metrics of
structural variation. Each metric quantifies the degree of change in the baseline
network introduced by information provided by an article. No usage data is involved
in the measurement. The three metrics are modularity change rate, inter-cluster
linkage, and centrality divergence. The definitions of the first two metrics depend
on a partition of the baseline network, but the third one does not. A partition
of a network decomposes the network into non-overlapping groups of nodes. For
example, clustering algorithms such as spectral clustering can be used to partition a
network.
The theoretical underpinning of the structural variation is that scientific discov-
eries, at least a subset of them, can be explained in terms of boundary spanning,
brokerage, and synthesis mechanisms in an intellectual space (Chen et al. 2009).
This conceptualization generalizes the principle of literature-based discovery pio-
neered by Swanson (1986a, b), which assumes that connections between previously
disparate bodies of knowledge are potentially valuable. In Swanson’s famous ABC
model, the relationships AB and BC are known in the literature. The potential
relationship AC becomes a candidate that is subject to further scientific investigation
(Weeber 2003). Our conceptualization is more generic in several ways. First,
in the ABC model, the AC relation changes an indirect connection to a direct
connection, whereas our structural variation model makes no assumption about
any prior relations at all. Second, in the ABC model, the scope of consideration is
limited to relationships involving three entities. In contrast, our structural variation
model takes a wider context into consideration and addresses the novelty of a
connection that links groups of entities as well as connections linking individual
entities. Because of the broadened scope of consideration, it becomes possible to
search for candidate connections more effectively. In other words, given a set of
entities, the size of the search space of potential connections can be substantially
reduced if additional constraints are applicable for the selection of candidate
connections. For example, the structural hole theory developed in social network
analysis emphasizes the special potential of nodes that are strategically positioned
to form brokerage, or boundary spanning, links and create good ideas (Burt 2004;
Chen et al. 2009).
266 8 Mapping Science
Q .Gbaseli ne ; C / Q .Gbaseli ne ˚ Ga ; C /
M CR.a/ D 100
Q .Gbaseli ne ; C /
where Gbaseline ˚ Ga is the updated baseline network by information from the article
a. For example, suppose reference nodes ni and nj are not connected in a baseline
network of co-cited references but they are co-cited by article a, a new link between
ni and nj will be added to the baseline network. In this way, the article changes the
structure of the baseline network.
Intuitively, adding a new link anywhere in a network should not increase the
modularity of the network. It should either reduce it or leave it intact. However, the
change of modularity is not a monotonic function as we initially expect. In fact, it
depends on where the new link is added and how the network is structured. Adding
a link may reduce the proportion of the modularity in some clusters, but it may
increase the modularity in other clusters in the network. Thus, the overall modularity
change is not monotonic.
Without losing any generality, assume that an article adds one link at a time
to a given baseline network. If the new link connects two distinct clusters, then
it has no effect on the corresponding term in the updated modularity because by
definition •ij D 0 and the corresponding term becomes 0. Such a link is illustrated
8.1 System Perturbation and Structural Variation 267
Fig. 8.2 Scenarios that may increase or decrease individual terms in the modularity metric
by the dashed link e5,10 in the top diagram in Fig. 8.2. The new link eij will increase
the degree of nodes i and j by one, i.e. deg(i) will become deg(i) C 1. The total
number of edges m will increase to m C 1. A simple calculation at the bottom
of Fig. 8.2 shows that terms in the modularity formula involving blue links will
decrease from their previous values. However, if the network has clusters such as
CA with no changes in node degrees, then the corresponding values of terms of
lines in red will increase from their previous values as the denominator increases
from 2 m to 2(m C 1). In summary, the updated modularity may increase as well
as decrease, depending on the structure of the network and where the new link
is added. With this particular definition of modularity, between-cluster links are
always associated with a zero valued term in the overall modularity formula due
to the Kronecker’s delta. What we see in the change of modularity is a combination
of results from several scenarios that are indirectly affected by the newly added link.
We will introduce our next metric to reflect the changes in terms of between-cluster
links directly.
268 8 Mapping Science
The Cluster Linkage (CL) measures the overall structural change introduced by an
article a in terms of new connections added between clusters. Its definition assumes
a partition of the network. We introduce a function of edges œ(ci ,cj ) which is the
opposite of •ij used in the modularity definition. The value of œij is 1 for an edge
across distinct clusters ci and cj . It will be 0 for edges within a cluster. œij will allow
us to concentrate on between-cluster links and ignore within-cluster links, which is
the opposite of how the modularity metric is defined. The new metric Linkage is the
sum of all the weights of between-cluster links eij divided by K – the total number
of clusters in the network. Linking to itself is not allowed, i.e. we assume eii D 0
for all nodes. Using link weights makes the metric sensitive to links that strengthen
connections between clusters in addition to novel links that make unprecedented
connections between clusters.
It is possible to take into account the size of clusters that a link is con-
necting so that connections between larger-sized clusters become more promi-
nent
q in the measurement. For example, one option is to multiple each eij by
si ze .ci / si ze cj = max .si ze .ck //. Here we define the metric without such
modifications for the sake of simplicity. Suppose C is a partition of G, the Linkage
metric is defined as follows:
Pn
i ¤j ij eij
Li nkage.G; C / D
K
0; ni 2 cj
ij D
1; ni … cj
The Cluster Linkage is defined as the difference of Linkage before and after new
between-clusters links added by an article a.
X
n
pi
CKL .Gbaseli ne ; a/ D pi log
i D0
qi
pi D CB .vi ; Gbaseli ne /
qi D CB vi ; Gupdat ed
X NB .r; p/
One can adapt this definition to describe a wide variety of count events. Citation
counts belong to a type of count events with an over-dispersion, i.e. the variance is
greater than the mean. NB models are commonly used in the literature to study this
type of count events. Two types of dispersion parameters are used in the literature,
™ and ’, where ™•’ D 1.
Zero-inflated count models are commonly used to account for excessive zero
counts (Hilbe 2011; Lambert 1992). Zero-inflated models include two sources of
zero citations: the point mass at zero If0g (y) and the count component with a count
distribution fcount (counts) such as negative binomial or Poisson (Zeileis et al. 2011).
The probability of observing a zero count is inflated with probability D fzero (zero
citations).
ZINB models are increasingly used in the literature to model excessive occur-
rences of zero citations (Fleming and Bromiley 2000; Upham et al. 2010). The report
of a ZINB model consists of two parts: the count model and the zero-inflated model.
One way to test whether a ZINB model is superior to a corresponding NB model is
known as the Vuong test. The Vuong test is designed to test the null hypothesis that
the two models are indistinguishable. Akaike’s Information Criterion (AIC) is also
commonly used to evaluate the goodness of a model. Models with lower AIC scores
are regarded as better models.
270 8 Mapping Science
Fig. 8.3 The structure of the system before the publication of the ground breaking paper by Watts
Fig. 8.4 The structure of the system after the publication of Watts 1998
Figures 8.3 and 8.4 illustrate how the system adapts to the publication of the
groundbreaking paper by Watts’98. The network was derived from 5,135 articles
published on small-world networks between 1990 and 2010. The network of 205
references and 1,164 co-citation links is divided into 12 clusters with a modularity
of 0.6537 and the mean silhouette of 0.811. The red lines are made by the
top-15 articles measured by the centrality variation rate. Only major clusters’
labels are shown in the figure. Dashed lines in red are novel connections made
272 8 Mapping Science
by (Watts and Strogatz 1998) at the time of its publication. The article has the
highest scores in Cluster Linkage and CKL scores, 5.43 and 1.14, respectively. The
figure offers a visual confirmation that the article was indeed making boundary-
spanning connections. Recall that the data set was constructed by expanding the
seed article based on forward citation links. These boundary-spanning links provide
empirical evidence that the groundbreaking paper was connecting two groups of
clusters. The emergence of Cluster #8 complex network was the consequence of the
impact.
Table 8.1 summarizes the results of five NB regression models with different
types of networks. They have an average dispersion parameter ™ of 0.5270, which
is equivalent to an alpha of 1.8975. Coauthors has an average IRR of 1.3278.
References has an average IRR of 1.0126. Pages has an average IRR of 0.9714.
The effects of the three variables are consistent and stable across the five types of
networks. In contrast, the effects of structural variations are less stable. On the other
hand, structural variations appear to have a stronger impact on global citations than
other more commonly studied measures such as Coauthors and References. For
example, CL has an IRR of 3.160 in networks of co-cited references and an IRR of
1.33 108 in networks of noun phrases. IRRs that are greater than 1.0 predict an
increase of global citations.
We have found statistical evidence of the boundary-spanning mechanism. An
article that introduces novel connections between clusters of co-cited references is
likely to become highly cited subsequently. In addition, we have found that the
IRRs of Cluster Linkage are more than twice as much as the IRRs of Coauthors
and References. This finding provides a more fundamental explanation of why the
number of references cited by an article appears to be a good predictor of its future
citations as found in many previous studies. As a result, the structural variation
paradigm clarifies why a number of extrinsic features appear to be associated with
high citations.
A distinct characteristic of the structural variation approach is the focus on the
potential connection between the degree of structural variation introduced by an
article and its future impact. The analytic and modeling procedure demonstrated
here is expected to serve as an exemplar for subsequent studies along this line of
research. More importantly, the focus on the underlying mechanisms of scientific
activity is expected to provide additional insights and practical guidance for
scientists, sociologists, historians, and philosophers of scientific knowledge.
There are many new challenges and opportunities ahead. For example, how
common is the boundary-spanning mechanism in scientific discoveries overall?
What are the other major mechanisms and how do they interact with the boundary-
spanning mechanism? There are other potentially valuable techniques that we have
not utilized in the present study, including topic modeling, citation context analysis,
survival analysis and burst detection. In short, a lot of work is to be done and this is
an encouraging start.
Figure 8.5 shows that the structural variation approach is applied to the study
of the potential of patents. The patent US6537746 is ranked high on the structural
variation scale. Its position is marked by a star. The areas where the patent made
Table 8.1 Negative binomial regression models (NBs) of Complex Network Analysis (1996–2004) at five different levels of granularity
of units of analysis
Data Source: Complex Network Analysis (1996–2004), top 100 records per time slice, 2-year sliding window
Unit of analysis Reference Keyword Noun phrase Author Journal
Relation Co-citation Co-occurrence Co-occurrence Co-citation Co-citation
Offset (exposure) log2 (Year) log2 (Year) log2 (Year) log2 (Year) log2 (Year)
Number of citing articles 3,515 3,072 3,254 3,271 3,271
Global citations Incidence Rate Ratios (IRRs) in NB models
Coauthors 1.306 0.000 1.298 0.000 1.326 0.000 1.359 0.000 1.350 0.000
Modularity change rate 1.083 0.025 1.038 0.086 1.047 0.305 1.055 0.276 1.060 0.180
8.1 System Perturbation and Structural Variation
Weighted cluster linkage 3.160 0.000 0.205 0.095 1.33 108 0.000 2.879 0.000 1.204 0.049
Centrality divergence 0.343 0.184 3.679 0.023 1.534 0.665 23.400 0.000 7.620 0.000
Number of references 1.013 0.000 1.013 0.000 1.013 0.000 1.012 0.000 1.012 0.000
Number of pages 0.970 0.000 0.971 0.000 0.971 0.000 0.973 0.000 0.972 0.000
Dispersion parameter (™) 0.5284 0.5258 0.5150 0.5282 0.5375
2 log-likelihood 31,771 28,331 29,491 29,506 29,613
Akaike’s Information Criterion (AIC) 31,787 28,347 29,508 29,522 29,629
References involves the least amount of ambiguity with the finest granularity, whereas the other four types of units introduce ambiguity
at various levels
Models constructed with units of higher ambiguity are slightly improved in terms of Akaike’s Information Criterion (AIC)
273
274 8 Mapping Science
Fig. 8.5 The structural variation method is applied to a set of patents related to cancer research.
The star marks the position of a patent (US6537746). The red lines show where the boundary-
spanning connections were made by the patent. Interestingly, the impacted clusters are about
recombination
boundary-spanning links are clusters #88 and #83, both labeled as recombination.
The map shows that multiple streams of innovation have moved away from the
course of older streams.
We conclude that structural variation is an essential aspect of the development of
scientific knowledge and it has the potential to reveal the underlying mechanisms
of the growth of scientific knowledge. The focus on the underlying mechanisms
of knowledge creation is the key to the predictive potential of the structural
variation approach. The theory-driven explanatory and computational approach sets
an extensible framework for detecting and tracking potentially creative ideas and
gaining insights into challenges and opportunities in light of the collective wisdom.
of the cell. Simply speaking, a differentiation process refers to how a cell is divided
into new cells. Cells in the next generation, in general, become more specialized
than their parent generation. Cells with the broadest range of potential can produce
all kinds of cells in an organism. This potential is called totipotency. The next
level of potency is called pluripotency, which means very many in its Latin origin
plurimus. A pluripotent cell can differentiate into more specialized cells. In contrast,
a unipotent cell can differentiate into only one cell type.
Prior to the work of Gurdon and Yamanaka, it was generally believed that the
path of cell differentiation is irreversible in that the potency of a cell becomes more
and more limited in generations of differentiated cells. Induced pluripotent stem
cells (iPS cells) result from a reprogramming of the natural differentiation. Starting
with a non-pluripotent cell, human intervention can reverse the process so that the
non-pluripotent cell could regain a more generic potency.
John B. Gurdon discovered in 1962 that the DNA of a mature cell may still have
all the information needed to develop all cells in a frog. He modified an egg cell of
a frog by replacing its immature nucleus with the nucleus from a mature intestinal
cell. The modified egg cell developed into a normal tadpole. His work demonstrated
that the specialization of cells is reversible. Shinya Yamanaka’s discovery was made
more than 40 years later. He found out how mature cells in mice could be artificially
reprogrammed to become induced pluripotent stem cells.
On August 25, 2011, more than a year ago before the 2012 Nobel Prize was
announced, I received an email from Emma Pettengale. She is the Editor of a
peer-reviewed journal Expert Opinion on Biological Therapy (EOBT). The journal
provides expert reviews of recent research on emerging biotherapeutic drugs and
technologies. She asked if I would be interested in preparing a review of emerging
trends in regenerative medicine using CiteSpace and she would give me 3 months
to complete the review.
EOBT is a reputable journal with an impact factor of 3.505 according to the
Journal Citation Report (JCR) compiled by Thomson Reuters in 2011. Emma’s
invitation was an unusual one. The journal is a forum for experts to express their
opinions on emerging trends but I am not a specialist in regenerative medicine at
all. Although CiteSpace has been used in a variety of retrospective case studies,
including terrorism, mass extinctions, string theory, and complex network analysis,
we were able to find independent reviews of most of the case studies to cross validate
our results or contact domain experts to verify specific patterns. The invitation was
both challenging and stimulating. We would be able to analyze emerging trends in
a rapidly advancing field with CiteSpace. Most importantly, we wanted to find out
if we can limit our source of information exclusively to patterns that are obviously
identified by CiteSpace.
276 8 Mapping Science
The 3,875 records do not include relevant publications if the term “regenerative
medicine” does not explicitly appear in the titles, abstracts, or index terms. We
expanded the dataset by citation indexing. If an article cites at least one of the 3,875
records, then the article will be included in the expanded dataset based on the as-
sumption that citing a regenerative medicine article makes the citing article relevant
to the topic. The citation index-based expansion resulted in 35,963 records, con-
sisting of 28,252 (78.6 %) original articles and 7,711 (21.4 %) review articles. The
range of the expanded set remains to be 2000–2011. Thus the analysis focuses on
the development of regenerative medicine over the last decade. The 35,963-article
dataset is used in the subsequent analysis. Incorrect citation variants to the two
highly visible references, a 1998 landmark article by Thomson et al. (1998) and a
1999 article by Pittenger (Pittenger et al. 1999), were corrected prior to the analysis.
followed by #20 human embryonic stem cell, and then followed by the latest and
current #32 induced pluripotent stem cell. The patches of red rings in #32 indicate
this area is rapidly expanding as suggested by citation bursts.
Table 8.2 lists eight major clusters by their size, i.e. the number of members in
each cluster. Clusters with few members tend to be less representative than larger
clusters because small clusters are likely to be formed by the citing behavior of
a small number of publications. The quality of a cluster is also reflected in terms
of its silhouette score, which is an indicator of its homogeneity or consistency.
Silhouette values of homogenous clusters tend to close to 1. Most of the clusters
are highly homogeneous, except Cluster #19 with a low silhouette score of 0.119.
Each cluster is labeled by noun phrases from titles of citing articles of the cluster
(Chen et al. 2010).
The average year of publication of a cluster indicates its recentness. For example,
Cluster #9 on mesenchymal stem cell (MSCs) has an average year of 1999. The most
recently formed cluster, Cluster #7 on induced pluripotent stem cell (iPSCs), has an
average year of 2008.
Cluster #7 contains numerous nodes with red rings of citation bursts. The
visualized network also shows highly burst terms found in the titles and abstracts
of citing articles to the major clusters. For example, terms stem-cell-renewal and
germ-line-stem-cells are not only used when articles cite references in Cluster #17
drosophila spermatogenesis, but also used with a period of rapid increase. Similarly,
the term induced-pluripotent-stem-cells is a burst term associated with Cluster
#7, which is consistently labeled as induced pluripotent stem cell by a different
selection mechanism, the log-likelihood ratio test (LLR). We will particularly focus
on Cluster #7 in order to identify emerging trends in regenerative medicine.
Cluster #7 is the most recently formed cluster. We selected ten most cited
references in this cluster and 10 citing articles (See Table 8.3).
Table 8.3 Cited references and citing articles of Cluster #7 on iPSCs
Cluster #7 induced pluripotent stem cell
Cited references Citing articles
Cites Author (Year) Journal, Volume, Page Coverage % Author (Year) Title
1,841 Takahashi K (2006) Cell, v126, p663 95 Stadtfeld, Matthias (2010) induced pluripotency: history,
mechanisms, and applications
1,583 Takahashi K (2007) Cell, v131, p861 80 Kiskinis, Evangelos (2010) progress toward the clinical
8.2 Regenerative Medicine
The most cited article in this cluster, Takahashi 2006 (Takahashi and Yamanaka
2006), demonstrated how pluripotent stem cells can be directly generated from
mouse somatic cells by introducing only a few defined factors as opposed to
transferring nuclear contents to oocytes, or egg cells. Their work is a major
milestone. The second most cited reference (Takahashi et al. 2007), from the same
group of researchers, further advanced the state-of-the-art by demonstrating how
differentiated human somatic cells can be reprogrammed into pluripotent stem cells
using the same factors identified in their previous work. As it turns out, the work
represented by the two highly ranked papers was awarded the 2012 Nobel Prize in
Medicine.
Cluster #7 consists of 40 co-cited references. The 10 selected citing articles are
all published in 2010. They cited 65–95 % of these references. The one that has the
highest citation coverage of 95 % is an article by Stadtfeld et al. Unlike works that
aim to refine and improve the ways to produce iPSCs, their primary concern was
whether iPSCs are equivalent, molecularly and functionally, to blastocyst-derived
embryonic stem cells. The Stadtfeld article itself belongs to the cluster. Other citing
articles also seem to question some of the fundamental assumptions or call for more
research before further clinical development in regenerative medicine.
The most cited articles are usually regarded as the landmarks due to their
groundbreaking contributions (See Table 8.4). Cluster #7 has 3 articles in the top 10
landmark articles. Each of Clusters #9, #12, and #15 has two. The most cited article
in our dataset is Pittenger MF (1999) with 2,486 citations, followed by Thomson
JA (1998) with 2,223 citations. The third one is a review article by Reya T (2001).
Articles at the 4th–6th positions are all from Cluster #7, namely Takahashi K (2006),
Takahashi K (2007), and Yu JY (2007). These three are also the more recent articles
on the list, suggesting that they have inspired intense interest in induced pluripotent
stem cells.
A citation burst has two attributes: the intensity of the burst and how long the
burst status lasts. Table 8.5 lists references with the strongest citation bursts across
the entire dataset during the period of 2000–2011. The first four articles with strong
citation bursts are from Cluster #7 on iPSCs. Interestingly, one 2009 article (again
8.2 Regenerative Medicine 281
in Cluster #7) and one 2010 article (in Cluster #8, a small cluster) are detected to
have considerable degrees of citation burst. The leader of the group that authored
the top two references was awarded the 2012 Nobel Prize in Medicine.
The Sigma metric measures both structural centrality and citation burstness of a
cited reference. If a reference is strong in both measures, it will have a higher Sigma
value than a reference that is only strong in one of the two measures.
As shown in Table 8.6, the pioneering iPSCs article by Takahashi (2006) has
the highest Sigma of 377340.46, which means it is structurally essential and
inspirational in terms of its strong citation burst. The second highest work by this
measure is a 1999 article in Science by Bjornson et al. (1999). They reported an
experiment in which neural stem cells were found to have a wider differentiation
potential than previously thought because they evidently produced a variety of blood
cell types.
The modularity of a network measures the degree to which nodes in the network
can be divided into a number of groups such that nodes within the same group are
connected tighter than nodes between different groups. The collective intellectual
structure of the knowledge of a scientific field can be represented as associated
networks of co-cited references. Such networks evolve over time. Newly published
articles may introduce profound structural variation or have little or no impact on
the structure.
282 8 Mapping Science
Fig. 8.7 The modularity of the network dropped considerably in 2007 and even more in 2009,
suggesting that some major structural changes took place in these 2 years in particular
Figure 8.7 shows the change of modularity of networks over time. Each network
is constructed based on a 2-year sliding window. The number of publications per
year increased considerably. It is noticeable that the modularity dipped in 2007 and
bounced back to the previous level before it dropped even deeper in 2009. Based
on this observation, it is plausible that groundbreaking works appeared in 2007 and
2009. We will therefore specifically investigate potential emerging trends in these
2 years.
Which publications in 2007 would explain the significant decrease of the
modularity of the network formed based on publications prior to 2007? If a 2007
publication has a subsequent citation burst, then we expect that this publication
played an important role in changing the overall intellectual structure. Eleven
publications in 2007 are found to have subsequent citation bursts (Table 8.7).
Notably, Takahashi 2007 and Yu 2007 top the list. Both of them represent pioneering
investigations of reprogramming human body cells to iPSCs. Both of them have
current citation bursts since 2009. Other articles on the list address the pluripotency
of stem cells related to human cancer, including colon cancer and pancreatic cancer.
Two review articles on regenerative medicine and tissue repair are published in 2007
with citation bursts since 2010. These observations suggest that the modularity
change in 2007 is an indication of an emerging trend in the human induced
pluripotent stem cells research. The trend is current and active as shown by the
number of citation bursts associated with publications in 2007 alone.
If the modularity change in 2007 indicates an emerging trend in human iPSCs
research, what caused the even more profound modularity change in 2009? The
Table 8.7 Articles published in 2007 with subsequent citation bursts in descending order of local citation counts
References Local citations Title Burst Duration Range (2000–2011)
Takahashi et al. (2007) 1,583 Induction of pluripotent stem cells from adult human 121:36 2009–2011
fibroblasts by defined factors
Yu et al. (2007) 1,273 Induced pluripotent stem cell lines derived from human 81:37 2009–2011
8.2 Regenerative Medicine
somatic cells
Wernig et al. (2007) 640 In vitro reprogramming of fibroblasts into a pluripotent 26:70 2008–2009
ES-cell-like state
O’Brien et al. (2007) 438 A human colon cancer cell capable of initiating tumour 18:13 2008–2009
growth in immunodeficient mice
Ricci-Vitiani et al. (2007) 427 Identification and expansion of human 8:83 2008–2009
colon-cancer-initiating cells
Li et al. (2007) 299 Identification of pancreatic cancer stem cells 9:78 2008–2008
Mikkelsen et al. (2007) 283 Genome-wide maps of chromatin state in pluripotent 19:59 2010–2011
and lineage-committed cells
Laflamme et al. (2007) 265 Cardiomyocytes derived from human embryonic stem 16:48 2010–2011
cells in pro-survival factors enhance function of
infarcted rat hearts
Gimble et al. (2007) [R] 247 Adipose-derived stem cells for regenerative medicine 25:19 2010–2011
Phinney and and Prockop 229 Concise review: mesenchymal stem/multipotent stromal 16:52 2010–2011
(2007) [R] cells: the state of transdifferentiation and modes of
tissue repair—current views
Khang et al. (2007) [In 90 Recent and future directions of stem cells for the 35:25 2008–2009
Korean] application of regenerative medicine
283
284 8 Mapping Science
cluster that is responsible for the 2009 modularity change is Cluster #7 induced
pluripotent stem cell (iPSC). On the one hand, the cluster contains Takahashi 2006
and Takahashi 2007, which pioneered the human iPSCs trend. On the other hand,
the cluster contains many recent publications. The average age of the articles in
this cluster is 2008. Therefore, we examine the members of this cluster closely,
especially focusing on 2009 publications.
The impact of Takahashi 2006 and Takahashi 2007 is so profound that their
citation rings would overshadow all other members in Cluster #7. After excluding
the display of their overshadowing citation rings, it becomes apparent that this
cluster is full of articles with citation bursts, which are shown as citation rings in
red. We labeled the ones published in 2009 and also two 2008 articles and one 2010
article (Fig. 8.2 and Table 8.8).
The pioneering reprogramming methods introduced by Takahashi 2006 and
Takahashi 2007 modify adult cells to obtain properties similar to embryonic stem
cells using a cancer-causing oncogene c-Myc as one of the defined factors and a
virus to deliver the genes into target cells (Nakagawa et al. 2008). It was shown
later on that c-Myc is not needed. The use of viruses as the delivery vehicle raised
safety concerns of its clinical implications in regenerative medicine because viral
integration into target cells’ genome might activate or inactivate critical host genes.
Searching for virus-free techniques motivated a series of such studies, leading by an
article (Okita et al. 2008) appeared on October 9, 2008.
What many of these 2009 articles have in common appear to be the focus on
improving previous techniques of reprogramming human somatic cells to regain a
pluripotent state. It was realized that the original method used to induce pluripotent
stem cells has a number of possible drawbacks associated with the use of viral
reprogramming factors. Several subsequent studies investigated alternative ways
to induce pluripotent stem cells with lower risks or improved certainty. These
articles were published within a short period of time. For instance, Woltjen 2009
demonstrated a virus-independent simplification of induced pluripotent stem cell
production. On March 26, 2009, Yu et al.’s article demonstrated that reprogramming
human somatic cells can be done without genomic integration or the continued
presence of exogenous reprogramming factors. On April 23, 2009, Zhou et al.’s
article demonstrated how to avoid using exogenous genetic modifications by
delivering recombinant cell-penetrating reprogramming proteins directly into target
cells. Soldner 2009 reported a method without using viral reprogramming factors.
Kaij reported a virus-free pluripotency induction method. On May 28, 2009, Kim
et al.’s article introduced a method of direct delivery of reprogramming proteins.
Vierbuchen 2010 is one of the few most recent articles that are found to have
citation bursts. The majority of the 2009 articles with citation bursts focused
on reprogramming human somatic cells to an undifferentiated state. In contrast,
Vierbuchen 2010 expanded the scope of reprogramming by demonstrating the
possibility of converting fibroblasts to functional neurons directly (Fig. 8.8).
Table 8.8 Articles published in 2009 with citation bursts
References Local Citations Title Burst Burst Duration Range (2000–2011)
Woltjen et al. (2009) 320 piggyBac transposition reprograms fibroblasts to 52:65 2009–2011
induced pluripotent stem cells
Yu et al. (2009) 300 Human induced pluripotent stem cells free of vector 59:97 2010–2011
and transgene sequences
Zhou et al. (2009) 293 Generation of induced pluripotent stem cells using 62:54 2010–2011
recombinant proteins
8.2 Regenerative Medicine
Soldner et al. (2009) 288 Parkinson’s disease patient-derived induced 53:94 2010–2011
pluripotent stem cells free of viral
reprogramming factors
Kaji et al. (2009) 284 Virus-free induction of pluripotency and subsequent 46:71 2009–2011
excision of reprogramming factors
Kim et al. (2009a, b) 235 Generation of human induced pluripotent stem cells 56:03 2010–2011
by direct delivery of reprogramming proteins
Ebert et al. (2009) 211 Induced pluripotent stem cells from a spinal 41:91 2010–2011
muscular atrophy patient
Kim et al. (2009b) 194 Oct4-induced pluripotency in adult neural stem cells 31:87 2009–2011
Vierbuchen et al. (2010) 193 Direct conversion of fibroblasts to functional 63:12 2010–2011
neurons by defined factors
Lister et al. (2009) 161 Human DNA methylomes at base resolution show 51:93 2010–2011
widespread epigenomic differences
Chin et al. (2009) 158 Induced pluripotent stem cells and embryonic stem 45:39 2010–2011
cells are distinguished by gene expression
signatures
Discher et al. (2009) 149 Growth factors, matrices, and forces combine and 43:14 2010–2011
control stem cells
Hong et al. (2009) 138 Suppression of induced pluripotent stem cell 43:71 2010–2011
generation by the p53–p21 pathway
Slaughter et al. (2009) 97 Hydrogels in regenerative medicine 31:68 2010–2011
285
286 8 Mapping Science
Fig. 8.8 Many members of Cluster #7 are found to have citation bursts, shown as citation rings in
red. Chin MH 2009 and Stadtfeld M 2010 at the bottom area of the cluster represent a theme that
differs from other themes of the cluster
Two articles of particular interest appear at the lower end of Cluster #7, Chin et al.
(2009) and Stadtfeld et al. (2010). Chin et al.’s article has 158 citations within
the dataset. A citation burst was detected for Chin 2009 since 2010. Chin et al.
questioned whether induced pluripotent stem cells (iPSCs) are indistinguishable
from embryonic stem cells (ESCs). Their investigation suggested that iPSCs should
be considered as a unique subtype of pluripotent cell.
The co-citation network analysis has identified several articles that cite the work
by Chin et al. In order to establish whether Chin et al. represents the beginning of
a new emerging trend, we inspect these citing articles listed in Table 8.9. Stadtfeld
2010 is the most cited citing article by itself with 134 citations. Similarly to Chin
et al., Stadtfeld 2010 addresses the question whether iPSCs are molecularly and
functionally equivalent to blastocyst-derived embryonic stem cells. Their work
identified the role of Dlk1-Dio3 gene cluster in association with the level of induced
pluripotency. In other words, these studies focus on mechanisms that govern induced
pluripotency, which can be seen as a distinct trend from the earlier trend on
improving reprogramming techniques. Table 8.9 includes two review articles cited
by Stadtfeld 2010.
8.2 Regenerative Medicine 287
Table 8.9 Articles that cite Chin et al.’s 2009 article (Chin et al. 2009) and their citation counts
as of November 2011
Article Citations Title
Stadtfeld et al. (2010) 134 Aberrant silencing of imprinted genes on chromosome
12qF1 in mouse induced pluripotent stem cells
Boland et al. (2009) 109 Adult mice generated from induced pluripotent stem cells
Feng et al. (2010) 72 Hemangioblastic derivatives from human induced
pluripotent stem cells exhibit limited expansion and
early senescence
Kiskinis and Eggan 59 Progress toward the clinical application of patient-specific
(2010) [R] pluripotent stem cells
Laurent et al. (2011) 48 Dynamic changes in the copy number of pluripotency and
cell proliferation genes in human ESCs and iPSCs
during reprogramming and time in culture
Bock et al. (2011) 31 Reference maps of human ES and iPS cell variation
enable high-throughput characterization of pluripotent
cell lines
Zhao et al. (2011) 22 Immunogenicity of induced pluripotent stem cells
Boulting et al. (2011) 17 A functionally characterized test set of human induced
pluripotent stem cells
Young (2011) [R]a 16 Control of the embryonic stem cell state
Ben-David and Benvenisty 11 The tumorigenicity of human embryonic and induced
(2011) [R]a pluripotent stem cells
[R] Review articles
a
Cited by Stadtfeld et al. (2010)
The new emerging trend is concerned with the equivalence of iPSCs and their
human embryonic stem cell counterparts in terms of their short- and long-term
functions. The new trend has critical implications on the therapeutic potential of
iPSCs. In addition to the works by Chin et al. and Stadtfeld et al., an article
published on August 2, 2009 by Boland et al. (2009) reported an investigation of
mice derived entirely from iPSCs. Another article (Feng et al. 2010) appeared on
February 12, 2010 investigated abnormalities such as limited expansion and early
senescence found in human iPSCs. The Stadtfeld 2010 article (Stadtfeld et al. 2010)
we discussed earlier appeared on May 13, 2010.
Some of the more recent citing articles of Chin et al. focused on providing
resources for more stringent evaluative and comparative studies of iPSCs. On
January 7, 2011, an article (Laurent et al. 2011) reported a study of genomic
stability and abnormalities in pluripotent stem cells and called for frequent genomic
monitoring to assure phenotypic stability and clinical safety. On February 4, 2011,
Bock et al. (2011) published genome-wide reference maps of DNA methylation
and gene expression for 20 previously derived human ES lines and 12 human iPS
cell lines. In a more recent article (Boulting et al. 2011) published on February 11,
2011, Boulting et al. established a robust resource that consists of 16 iPSC lines and
a stringent test of differentiation capacity.
iPSCs are characterized by their self-renewal and versatile ability to differentiate
into a wide variety of cell types. These properties are invaluable for regenerative
medicine. However, the same properties also make iPSCs tumorigenic or cancer
288 8 Mapping Science
Fig. 8.9 A network of the regenerative medicine literature shows 2,507 co-cited references cited
by top 500 publications per year between 2000 and 2011. The work associated with the two labelled
references was awarded the 2012 Nobel Prize in Medicine
prone. In a review article published in April 2011, Ben-David and Benvenisty (Ben-
David and Benvenisty 2011) reviewed the tumorigenicity of human embryonic
and iPSCs. Zhao et al. challenged a generally held assumption concerning the
immunogenicity of iPSCs in an article (Zhao et al. 2011) on May 13, 2011. The
immunogenicity of iPSCs has clinical implications on therapeutically valuable cells
derived from patient-specific iPSCs.
In summary, a series of more recent articles have re-examined several funda-
mental assumptions and properties of iPSCs with more profound considerations
for clinical and therapeutic implications on regenerative medicine (Patterson et al.
2012) (Fig. 8.9).
the last decade and highlighted the areas of active pursuit. Emerging trends and
patterns identified in the analysis are based on computational properties selected by
CiteSpace, which is designed to facilitate sense-making tasks of scientific frontiers
based on relevant domain literature.
Regenerative medicine is a fascinating and a fast-moving subject matter. As
information scientists, we have demonstrated a scientometric approach to tracking
the advance of the collective knowledge of a dynamic scientific community by
tapping into what experts in the domain have published in the literature and how
information and computational techniques can help us to discern patterns and trends
at various levels of abstraction, namely, cited references and clusters of co-cited
references.
Based on the analysis of structural and temporal patterns of citations and co-
citations, we have identified two major emerging trends. The first one started in
2007 with pioneering works on human induced pluripotent stem cells (iPSCs),
including subsequently refined and alternative techniques for reprogramming. The
second one started in 2009 with an increasingly broad range of examinations
and re-examinations of previously unchallenged assumptions with clinical and
therapeutic implications on regenerative medicine, including tumorigenicity and
immunogenicity of iPSCs. It is worth noting that this expert opinion is solely based
on scientometric patterns revealed by CiteSpace without prior working experience
in the regenerative medicine field.
The referential expansion of the original topic search of regenerative medicine
has revealed a much wider spectrum of intellectual dynamics. The visual analysis of
the broader domain outlines the major milestones throughout the extensive period
of 2000–2011. Several indicators and observations converge to the critical and
active role of Cluster #7 on iPSCs. By tracing interrelationships along citation links
and citation bursts, visual analytic techniques of scientometrics are able to guide
our attention to some of the most vibrating and rapidly advancing research fronts
and identify the strategic significance of various challenges addressed by highly
specialized technical articles. The number of review articles on relevant topics is
rapidly increasing, which is also a sign that the knowledge of regenerative medicine
has been advancing rapidly. We expect that visual analytic tools as we utilized in this
review will play a more active role in supplement to traditional review and survey
articles. Visual analytic tools can be valuable in finding critical developments in the
vast amount of newly published studies.
The key findings of the regenerative medicine and related research over the last
decade have shown that regenerative medicine has become more and more feasible
in many areas and that it will ultimately revolutionize clinical and healthcare
practice and many aspects of our society. On the other hand, the challenges
ahead are enormous. The biggest challenge is probably related to the fact that
human beings are a complex system in that a local perturbation may lead to
unpredictable consequences in other parts of the system, which in turn may affect
the entire system. The state of the art in science and medicine has a long way to
go to handle such complex systems in a holistic way. Suppressing or activating a
seemingly isolated factor may have unforeseen consequences.
290 8 Mapping Science
The two major trends identified in this review have distinct research agendas as
well as different perspectives and assumptions. In our opinion, the independencies
of such trends at a strategic level are desirable at initial stages of these emerging
trends so as to maximize the knowledge gain that is unlikely to be achieved by a
single line of research alone. In a long run, more trends are expected to emerge from
probably the least expected perspectives. Existing trends may be accommodated by
new levels of integration. We expect that safety and uncertainty will remain to be
the central concern of regenerative medicine.
8.3 Retraction
new articles per year. The rate of retracted articles is calculated as the number of
eventually retracted articles published in a year divided out of the total number of
articles published in the same year in PubMed. The rate of retraction is the number
of retraction notices issued each year out of the total number of publications in
PubMed in the same year. The retraction rate in 2001 was 0.00005. It was doubled
three times since then, in 2003, 2006, and 2011, respectively. The retraction rate
in 2011 was 0.00046. Figure 8.10 shows that the number of retracted articles per
year peaked in 2006. The blue line is the retraction rate, which is growing fast. The
red line is the actual number of retracted articles. Although currently fewer recent
articles have been retracted than the 2006 peak number, we expect that this is in part
due to a delay in recognizing potential flaws in newly published articles. We will
quantify the extent of such delays later in a survival analysis.
On the one hand, the increasing awareness of mistakes in scientific studies
(Naik 2011), especially due to the publicity of high-profile cases of retraction and
fraudulent cases (Kakuk 2009; Service 2002) has led to a growing body of studies
of retractions. On the other hand, the study of retracted articles, the potential risk
that these articles may bring to the scientific literature in a long run, and actions
that could be taken to reduce such risks is relatively underrepresented, given the
urgency, possible consequences, and policy implications of the issue. We will
address some common questions concerning retracted articles. In particular, we
introduce a visual analytic framework and a set of tools that can be used to facilitate
situation awareness tasks at macroscopic and microscopic levels.
At the macroscopic level, we will focus on questions concerned with retracted
articles in a broader context of the rest of scientific literature. Given a retracted
article, which areas of the scientific literature are affected? Where are the articles
that directly cited the retracted article? Where are the articles that may have related
to the retracted articles indirectly?
292 8 Mapping Science
Table 8.10 The number of retractions found in major sources of scientific publications (As of
3/29/2012)
Sources Items Document type Search criteria
PubMeda 2,073 Retracted article “Retracted publication” [pt]
2,187 Retraction notice “Retraction of publication” [pt]
Web of Science 1,775 Retracted article Title contains “(Retracted article.)”
(1980–present)
1,734 Retraction notice Title contains “(Retraction of vol)”
Google scholar 219 Retracted article Allintitle: “retracted article”
Elsevier Content 659 Retracted article Title: Retracted article
Syndication (CONSYN) (full text)
a
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=DetailsSearch&Term=
%22retracted+publication%22%5Bpublication+type%5D
1
http://www.ncbi.nlm.nih.gov/PubMed?term=retracted+publication+[pt]
2
Correction: Correction of errors found in articles that were previously published and which have
been made known after that article was published. Includes additions, errata, and retractions. http://
images.webofknowledge.com/WOKRS51B6/help/WOS/hs document type.html
8.3 Retraction 293
A retraction sends a strong signal to the scientific community that retracted articles
are not trustworthy and they should be effectively purged from the literature. Studies
of retraction are often limited to formally retracted articles. It is a common belief
that many more articles should have been retracted (Steen 2011). On the other hand,
it has been noted that retraction should be made to scientific misconduct, whereas
correction is a more appropriate term for withdrawing articles with technical errors
(Sox and Rennle 2006). We outline some of the representative studies of retraction
as follows in terms of how they addressed several common questions.
Time to retraction – How long does it take on average for a scientific publication
to be retracted? Does the time to retraction differ between senior and junior
researchers?
Post-retraction citations – Does the retraction of an article influence how the article
is cited, quantitatively and qualitatively? How soon can one detect the decrease
of citations after retraction?
Cause of concern – How was an eventually retracted article noticed in the first place?
Are there any early signs that one can watch for and safeguard the integrity of
scientific publications?
Reasons for retractions – What are the most common reasons for retraction?
How are these common causes distributed? Should they be retreated equally or
differently as far as retraction is concerned?
Deliberate or accidental – Do scientists simply make mistakes with good faith or
some of them intended to cheat in terms of deliberate misconduct.
Table 8.11 outlines some of the most representative and commonly studied
aspects of retraction, including corresponding references of individual studies.
Several studies found that on average it took about 2 years to retract a scientific
publication and it took even longer for articles that were responsible by senior
researchers. Time to retraction of articles was particularly studied in a survival
analysis in (Trikalinos et al. 2008). Based on retractions made in top-cited high-
impact journals, it was found that the median survival time of eventually retracted
articles was 28 months. In addition, it was found that it took much longer to retract
articles authored by senior researchers, i.e. professors, lab directors, or researchers
with more than 5 years of publication records, than junior ones.
Post-retraction citations were studied at different time points after retraction,
ranging from the next calendar year, 1 year after retraction, to 3 years after
retractions. In general, citation counts tend to reduce after a retraction, but there
are outliers that are apparently unaware of a retraction after 23 years.
Irreproducibility and unusually high-level of productivity are among the most
common causes of initial concern. For example, Jan Hendrik Schön fabricated
17 papers in 2 years in Science and in Nature. He produced a new paper every
8 days at his peak (Steen 2011). Irreproducibility can be further explained in terms
of an array of specific types of reasons, including types of errors and deliberate
294 8 Mapping Science
misconduct. It has been argued that, pragmatically speaking, fabricating data and
results is perceived to be much more harmful than plagiarizing a description or an
expression. For example, some researchers distinguish data plagiarism from text
plagiarism and retreat data plagiarism as a scientific misconduct (Steen 2011).
A sign that may differentiate a deliberate fraudulent behavior from a good faith
mistake is whether it happens repeatedly with the same researcher. A higher rate
of repeat offenders was indeed found in fraudulent papers than erroneous papers
(Steen 2011).
Studies of retraction almost exclusively focused on the literature of medicine,
where the stake is high in terms of the safety of patients. PubMed and the Web
of Science are the major resources used in these studies. Analysts in these studies
typically searched for retracted articles and analyzed the content of retraction
notices as well as other types of information. Most of these studies appear to rely
on labor-intensive procedures with limited or no support for visual analytic tasks.
Several potentially important questions have not been adequately addressed due to
such constraints.
An article may cite a retracted article without realizing the corresponding retraction.
This type of citing articles may infect the integrity of the scientific literature. Studies
of retraction so far essentially focused on first-degree citing articles, i.e. articles that
8.3 Retraction 295
directly cited a retracted article. Citation counts and whether it is evident that the
citers were aware of the status of retracted articles are the most commonly studied
topics.
Given a published article ato , retracted or not, a citation path between a
subsequently published article atk and the original article can be defined in terms
of pairwise citation relations as follows: ato at1 atk , where denotes
a direct citation reference, ti < tj if i < j, and the length of each segment of the path
is minimized. In other words, ati ati C1 means ati C1 has no direct citation to any
of the articles on the path prior to ati . The length of a citation path is the number
of direct citation links included in the path. Existing studies of citations to retracted
articles are essentially limited to citation paths that contain one step only. Longer
citation paths originated from a retracted article have not been studied. It is clear
that the retraction of the first article is equivalent to the removal of the first article
from a potentially still growing path such as ato at1 atk because newly
published articles may unknowingly cite the last article atk without questioning the
validity of the potentially risky path. By k-degree post-retraction citation analysis,
we introduce a study of such paths formed by k pairwise direct citation links as in
ato at1 atk:
Over the recent years, tremendous advances have been made in scientometrics
(Boyack and Klavans 2010; Leydesdorff 2001; Shibata et al. 2007; Upham et al.
2010), science mapping (Chen 2006; Cobo et al. 2011; Small 1999; van Eck
and Waltman 2010), and visual analytics (Pirolli 2007; Thomas and Cook 2005).
Existing studies of citations to retracted articles have not yet incorporated these
relative new and more powerful techniques. Vice versa researchers who have access
to the new generation of analytic tools have not applied these tools to the analysis
of citation networks involving retracted articles.
It is important to find out how much a citing article’s authors know about the
current status of a retracted article when they refer to the retracted article. Previous
studies have shown that this is not always clear in text. A retracted article may have
been cited by hundreds of subsequently published articles. Manually examining
individual citation instances is time consuming and cognitively demanding. It is
an even more challenging task for analysts to synthesize emergent patterns from
individual citation instances and discern changes in terms of how a retracted article
has been cited over an extensive period of time because it is known that retracted
articles can be cited continuously for a long time after the retraction.
296 8 Mapping Science
The provision of full text articles would make it possible to study the context
of citations to a retracted article with computational tools. It would also make it
possible to study higher-level patterns of citations and how they change over time
with reference to retraction events.
We address these three questions and demonstrate how visual analytic methods
and tools can be developed and applied to the study of citation networks and citation
contexts involving retracted articles. There are many other issues that are important
to study but we decide to focus on the ones that are relatively fundamental.
In the Web of Science, the title of a retracted article includes a suffix of “Retracted
article.” As of 3/30/2012, there are 1,775 records of retracted articles. The distribu-
tion of the 1,775 retracted articles since 1980 shows that retractions appear to have
peaked in 2007 with 254 retracted articles recorded in the Web of Science alone.
On the other hand, it might be still too soon to rule out the possibility of more
retrospective retractions.
It is relatively straightforward to calculate on average how long it may last
before the retraction of an article since its publication. It is common that the time
of retraction of an article is retrievable from the amended title of the article. For
example, if the title of an article published in 2010 is followed by a clause in the
form of (Retracted article. See vol. 194, pg. 447, 2011), then we know that the
article was retracted in 2011. We loaded the data into a built-in relational database
of CiteSpace and used the substring function in SQL to extract the year of retraction
from the title by counting backwards, i.e. substring (title, 5, 4). We found that the
mean time to retraction is 2.57 years, or 30 months, based on the retraction time
of the 1,721 retracted articles, excluding 54 records with no retraction date. The
median time to retraction is 2 years, i.e. 24 months (See Table 8.12).
Figure 8.11 shows a plot of the survival function of retraction. The probability
of surviving retraction reduces rapidly for the first few years since publication.
In other words, the majority of retractions took place within the first few years.
The probability of survival is below 0.2 for a 4-year old eventually to be retracted
article.
8.3 Retraction 297
Fig. 8.11 The survival function of retraction. The probability of surviving retraction for 4 years
or more is below 0.2
Table 8.13 lists the ten most highly cited retracted articles in the Web of Science.
The 1998 Lancet paper by Wakefield et al. has the highest citations of 740. The
least cited of the ten has 366 citations. Three papers on the list were published in
Science and two in Lancet. In the rest of the article, we will primarily focus on these
high-profile retractions in terms of their citation contexts at both macroscopic and
microscopic levels.
We are interested in depicting the context of retracted articles in a co-citation
network of a broadly defined and relevant set of scientific publications. First, we
retrieved 29,756 articles that cited 1,584 retracted articles in the Web of Science.
We use CiteSpace to generate a co-citation network based on the collective citation
behavior of the 29,756 articles between 1998 and 2011. The top 50 % most cited
references were included to the formation of the co-citation network with an upper
limit of 3,000 references per year. The resultant network contains 7,217 references
and 155,391 co-citation links. A visualization of the co-citation network is generated
298
Fig. 8.12 An overview of co-citation contexts of retracted articles. Each dot is a reference of an
article. Red dots indicate retracted articles. The numbers in front of labels indicate their citation
ranking. Potentially damaging retracted articles are in the middle of an area that otherwise free
from red dots
and overlaid with the top-ten most cited retracted articles as well as other highly
cited articles without retractions (See Fig. 8.12). Each dot in the visualization
represents an article cited by the set of 29,756 citing articles. The dots in red
are retracted articles. Lines between dots are co-citation links. The color of a co-
citation link is the earliest time a co-citation between two articles was made. The
earliest time is in blue; more recent time is in yellow and orange. The size of a dot,
or a disc, is proportional to the citation counts of the corresponding cited article.
The top ten most cited retracted articles are labeled in the visualization. Retracted
articles are potentially more damaging if they are located in the middle of a densely
co-cited articles. In contrast, isolated red dots are relatively less damaging. This
type of visualizations will be valuable to highlight how deeply a retracted article is
embedded in the scientific literature.
Figure 8.13 shows a close-up view of the visualization shown in Fig. 8.12. The
retracted article by Nakao N et al. on the left, for example, has a sizable red disc,
indicating its numerous citations. Its position on a densly connected island of other
articles indicates its relevant to a significant topic. Hwang WS (slightly to the right)
and Potti A at the lower right corner of the image have similar citation context
profiles. More profound impacts are likely to be found in interconnected citation
contexts of multiple retracted articles.
Figure 8.14 shows an extensive representation of the citation context of the
retracted 2003 article by Nakao et al. First, 609 articles that cited the Nakao paper
were identified in the Web of Science. Next, 9,656 articles were retrieved because
300 8 Mapping Science
Fig. 8.13 Red dots are retracted articles. Labeled ones are highly cited. Clusters are formed by
co-citation strengths
Fig. 8.14 An extensive citation context of a retracted 2003 article by Nakao et al. The co-citation
network contains 27,905 cited articles between 2003 and 2011. The black dot in the middle of
the dense network represents the Nakao paper. Red dots represent 340 articles that directly cited
the Nakao paper (there are 609 such articles in the Web of Science). Cyan dots represent 2,130 of
the 9,656 articles that bibliographically coupled with the direct citers
they have at least one common references with the 609 direct citing articles. Top
6,000 most cited references per year between 2003 and 2011 were chosen to form
a co-citation network of 27,905 references and 2,162,018 co-citation links. The
retracted Nakao paper is shown as the black dot in the middle of the map. The red
8.3 Retraction 301
dots are 340 direct citers of the total of 609 available in the Web of Science. The cyan
dots share common references with the direct citers, not necessarily the retracted
article. The labels are the most cited articles in this topic area, which are not retracted
articles themselves.
The most cited retracted article among all the retracted articles in the Web of Science
is the 1998 Lancet article by Wakefield et al. A citation burst of 0.05 was detected
for this article. The article was partially retracted in 2004 and fully retracted in 2010.
The Lancet’s retraction notice in February 2010 noted that several elements of the
1998 paper are incorrect, contrary to the findings of an earlier investigation, and that
the paper made false claims of an “approval” of the local ethics committee.
In order to find out what exactly was said when researchers cited the controversial
article, we studied citation sentences, which are the sentences that contain references
to the Wakefield paper. A set of full text articles were obtained from Elsevier’s
Content Syndication (ConSyn), which contains 3,359 titles of scholarly journals
and 6,643 non-serial titles. Since the Wakefield paper is concerned with a claimed
causal relation between a combined MMR vaccine and autism, we searched for full
text journal articles on autism and vaccine in ConSyn and found 1,250 relevant
articles. The Wakefield paper was cited by 156 full text articles in the 1,250 articles
from the ConSyn collection. A total of 706 citation sentences are found in the 156
citing articles. We used the Lingo clustering method provided by Carrot2, an open
source framework for building search clustering engines,3 to cluster these citation
sentences into 69 clusters.
Figure 8.15 is a visualization of the 69 clusters formed by 706 sentences that
cited the 1998 Lancet paper. The visualization is called Foam Tree in Carrot. See
Chap. 9 for more details on Carrot. Clusters with the largest areas represent the
most prominent clusters of phrases used when researchers cited the 1998 paper. For
example, inflammatory bowel disease, mumps and rubella, and association between
MMR vaccine and autism are the central topics of the citations. These topics indeed
characterize the role of the retracted Lancet paper, although in this study we did
not differentiate positive and negative citations. Identifying the orientation of an
instance of citation from a citation context, for example, the citing sentence and
its surrounding sentences, is a very challenging task even for an intelligent reader
because the position of the argument becomes clear only when a broader context is
taken into account, for example, after reading the entire paragraph in many cases.
In addition to aggregate citation sentences into clusters at a higher level of
abstraction, we further developed a timeline visualization that can be used to depict
year-by-year flows of topics to facilitate analytics to discern changes associated with
3
http://project.carrot2.org/
302 8 Mapping Science
Fig. 8.15 69 clusters formed by 706 sentences that cited the 1998 Wakefield paper
Fig. 8.16 Divergent topics in a topic-transition visualization of the 1998 Wakefield et al. article
Table 8.14 Specific sentences that cite the eventually retracted 1998 Lancet paper by Wakefield
et al.
Year of citation Ref Sentence
1998 1 The report by Andrew Wakefield and colleagues confirms the clinical
observations of several paediatricians, including myself, who have
noted an association between the onset of the autistic spectrum and
the development of disturbed bowel habit
1998 1 Looking at the ages of the children in Wakefield’s study, it seems that
most of them would have been at an age when they could well have
been vaccinated with the vaccine that has since been withdrawn
1998 1 We are concerned about the potential loss of confidence in the mumps,
measles, and rubella (MMR) vaccine after publication of Andrew
Wakefield and colleagues’ report (Feb 28, p 637), in which these
workers postulate adverse effects of measles-containing vaccines
1998 1 We were surprised and concerned that the Lancet published the paper
by Andrew Wakefield and colleagues in which they alluded to an
association between MMR vaccine and a nonspecific syndrome,
yet provided no sound scientific evidence
2001 34 In 1998, Wakefield et al. [34] have published a second paper including
two ideas: that autism may be linked to a form of inflammatory
bowel disease and that this new syndrome is associated with
measles–mumps–rubella (MMR) immunization
2007 5 Vaccine scares in recent years have linked MMR vaccination with
autism and a variety of bowel conditions, and this has had an
adverse impact on MMR uptake [5]
2007 5 When comparing MMR uptake rates before (1994–1997) and after
(1999–2000) the 1998 Wakefield et al. article [5] it is seen that
prior to 1998 Asian children had the highest uptake
2010 2 This addresses a concern raised by a now-retracted article by
Wakefield et al. and adds to the body of evidence that has failed to
show a relationship between measles vaccination and autism (1, 2)
in terms of the number of related topics in the previous year. The convergent topic
sums up elements from multiple previously separated topics. In 1999, the topic of
Rubella MMR Vaccination is highlighted by an explicit label because it is associated
with several distinct topics in 1998. In 2004, the year Lancet partially retracted
the Wakefield paper, the prominent convergent topic was Developmental Disorders.
The visualization shows that numerous distinct topics in 2003 were converged into
the convergent topic in 2004. We expect that this type of topic-flow visualizations
can enable new ways of analyzing and studying the dynamics of topic transitions in
specific citations to a particular article.
Table 8.14 lists examples of sentences that cited the 1998 Lancet paper by
Wakefield et al. For example, as early as 1998, researchers were concerned about the
lack of sound scientific evidence to support the claimed association between MMR
vaccine and inflammatory bowel disease. The adverse impact on MMR uptake is
also evident in these citation sentences. Many more analytic tasks may become
feasible with this type of text and pattern-driven analyses at multiple levels of
granularity.
304 8 Mapping Science
8.3.5 Summary
The perceived risk introduced by retracted articles alone is the tip of an iceberg.
Many high-profile retracted articles are interwoven deeply with the scientific
literature and in many cases they are embedded in fast-moving significant lines of
research. It is essential to raise the awareness that much of the potential damages
introduced by a retracted article are hidden and likely to grow quietly for a long time
after the retraction via indirect citations. The original awareness of the invalidity of
a retracted article may be lost in subsequent citations. New tools and services are
needed so that researchers and analysts can easily verify the status of a citation
genealogy to ensure that the current status of the origin of the genealogy is clearly
understood. Such tools should become part of the workflow of journal editors and
publishers.
From a visual analytic point of view, it is essential to bring in more techniques
and tools that can support analytic and sense making tasks from the dynamic and
unstructured information and allow analysts and researchers to move back and forth
freely across multiple levels of analytic and decision making tasks. The ability of
trailblazing evidence and arguments through an evolving space of knowledge is a
critical step for the creation of scientific knowledge and maintaining a trustworthy
documentation of the collective intelligence.
Science mapping has made remarkable advances in the past decade. Powerful
techniques have become increasingly accessible to researchers and analysts. In this
chapter, we present some of the most representative efforts towards generating maps
of science. At the highest level, the goal is to identify how scientific disciplines
are interrelated, for example, how medicine and physics are connected, what topics
8.4 Global Science Maps and Overlays 305
are shared by chemistry and geology, how federal funding is distributed across the
landscape of disciplines. Drawing a boundary line for a disciplinary is challenging;
drawing a boundary line for a constantly evolving disciplinary is even more so. We
will highlight some recent examples of how researchers deal with such challenges.
Derek de Solla Price is probably the first person to anticipate that the Science
Citation Index (SCI) may contain the information for revealing the structure of
science. Price suggested that the appropriate units of analysis would be journals and
aggregations of journals by journal-journal citations would reveal the disciplinary
structure of science. An estimation mentioned in (Leydesdorff and Rafols 2009)
sheds light on the density of a science map at the journal level. Among the 6,164
unique journals in the 2006 SCI, there were only 1,201,562 pairs of journal citation
relations out of the possible 37,994,896 connections. In other words, the density
of the global science structure is 3.16 %.4 How stable is such a structure at the
level of journal? How volatile is the structure of science at the document level or a
topic level? Where are the activities concentrated or distributed with reference to a
discipline, an institution, or an individual?
A widely seen global map of science is the USCD map, depicting 554 clusters
of journals and how they are interconnected as sub-disciplines of science (See
Fig. 8.17). The history of the UCSD map is described in (Borner et al. 2012).
The map was first created by Richard Klavans and Kevin Boyack in 2007 for the
University of California San Diego (UCSD). The source data for the map was
a combination of Thomson Reuters Web of Science (2001–2004) and Elsevier’s
Scopus (2001–2005). Similarities between journals were computed in 18 different
ways to form matrices of journal-journal connections. These matrices were then
combined to form a single network of 554 sub-disciplines in terms of clusters of
journals. The layout of the map was generated using the 3D Fruchterman-Reingold
layout function in Pajek. The spherical map was then unfolded to a 2D map on a
flat surface with a Mercator projection. Each cluster was manually labeled based on
journal titles in the cluster. The 2D version of the map was further simplified to a
1D circular map – the circle map. The 13 labeled regions were ordered using factor
analysis. The circle map is used in Elsevier’s SciVal Spotlight.
The goal of the UCSD map was to provide a base map for research evaluation.
With 554 clusters, it provides more categories than the subject categories of the Web
of Science. While the original goal was for research evaluation, the map is being
used as a base map to superimpose overlays of additional information in systems
such as Sci2 and VIVO.5 Soon after the creation of the UCSD map, Richard Klavans
4
Assume this is a directed graph of 6,146 journals.
5
http://ivl.cns.iu.edu/km/pres/2012-borner-portfolio-analysis-nih.pdf
306 8 Mapping Science
Fig. 8.17 The UCSD map of science. Each node in the map is a cluster of journals. The clustering
was based on a combination of bibliographic couplings between journals and between keywords.
Thirteen regions are manually labeled (Reproduced with permission)
and Kevin Boyack came to the conclusion that research evaluation requires maps
with clusters at the article level rather than at the journal level.
The UCSD map was generated for UCSD to show their research strengths and
competencies. Although the discipline-level map characterizes the global structure
of scientific literature, much more details are necessary to quantify research
strengths at UCSD. The similar procedure was applied to generate an article-
level map as opposed to a journal-level map. Clusters of articles were calculated
based on co-citations. In addition to the discipline-level circle map, the paper-level
clustering provides much more detailed classification information. In contrast to the
554 journal clusters, the paper-level clustering of co-cited references identified over
84,000 clusters, which are called paradigms (Fig. 8.18).
In a 2009 Scientometrics paper (Boyack 2009), Boyack described how a
disciplinary-level map can be used for collaboration. He collected 1.35 million
papers from 7,506 journals and 1,206 conference proceedings. These papers contain
29.23 million references. Similarities between references were calculated in terms
of bibliographic coupling. These reference-level similarities were then aggregated to
obtain similarities between journals. For each journal, the top 15 most similar jour-
nals in terms of bibliographic coupling were retained for generating the final map.
The map layout step served two purposes: one is to optimize the arrangement
of the journals so that the distance between journals on the map is proportional to
8.4 Global Science Maps and Overlays 307
Fig. 8.18 Areas of research leadership for China. Left: A discipline-level circle map. Right:
A paper-level circle map embedded in a discipline circle map. Areas of research leadership are
located at the average position of corresponding disciplines or paradigms. The intensity of the
nodes indicates the number of leadership types found, Relative Publication Share (RPS), Relative
Reference Share (RRS), or state-of-the art (SOA) (Reprinted from Klavans and Boyack 2010 with
permission)
their dissimilarity; the other is to group individual journals into clusters based on
the distance generated by the layout process.
The map layout was made using the VxOrd algorithm, which ignores long-range
links in its layout process. The proximity of nodes in the resultant graph layout
was used to identify clusters using a modified single-linkage clustering algorithm.
In single linkage, the distance between two clusters is computed as the distance
between the two closest elements in the two clusters. The resultant map contains
812 clusters of journals and conference proceedings (See Fig. 8.19). The map was
used as a base map for a variety of overlays. In particular, the presence of an
institution can be depicted with this map. A cluster with a clear circle contains
journal papers only. In contrast, a cluster with a shaded circle contains proceeding
papers. As shown in the map, the majority of proceeding papers are located between
computer science (CS) and Physics. Disciplines such as Virology are almost entirely
dominated by journal papers.
More recently, Klavans and Boyack created a new global map of science based on
Scopus 2010. The new Scopus 2010 map is a paper-level map, representing 116,000
clusters of 1.7 million papers (See Fig. 8.20). The Scopus 2010 map is hybrid in
that clusters were generated from citations and the layout was done based on text
similarity. The similarities between clusters were calculated based on words from
titles and abstracts of papers in each cluster using the Okapi BM25 text similarity.
The clustering step did not use a hybrid similarity based on both text and citation
simultaneously. For each cluster, 5–15 clusters with the strongest connections were
retained. Labels of clusters were manually added.
308 8 Mapping Science
Fig. 8.19 A discipline-level map of 812 clusters of journals and proceedings. Each node is a
cluster. The size of a node represents the number of papers in the cluster (Reprinted from Boyack
2009 with permission)
Just as what we have described earlier in the book about a geographic base map
and thematic overlays, global maps of scientific disciplines provide a convenient
base map to depict additional thematic features. Figure 8.21 shows an example of
adding a thematic overlay to the Scopus 2010 base map. The overlay superimposes
a layer of orange dots on clusters in the Scopus 2010 map. The orange dots mark the
papers that acknowledged the support of grants from the National Cancer Institute
(NCI). The overlay provides an intuitive overview of the scope of NCI grants in the
context of research areas.
Fig. 8.20 The Scopus 2010 global map of 116,000 clusters of 1.7 million articles (Courtesy of
Richard Klavans and Kevin Boyack, reproduced with permission)
Fig. 8.21 An overlay on the Scopus 2010 map shows papers that acknowledge NCI grants
(Courtesy of Kevin Boyack, reproduced with permission)
science overlay maps, a paper published in February 2009 (Leydesdorff and Rafols
2009), was featured as a fast breaking paper by Thomson Rueters’ ScienceWatch
in December 2009.6 Fast breaking papers are publications that have the largest
percentage increase in citations in their field from one bimonthly update to the next.
The overlay method has two steps: (1) creating a global map of science as the
base map and (2) superimposing a specific set of publications, for example, from a
given institution or topic. Along with the method, the researchers have made a set
of tools available so that everyone could use their tools and generate his or her own
science overlay maps. The toolkit is freely available.7
6
http://archive.sciencewatch.com/dr/fbp/2009/09decfbp/09decfbpLeydET/
7
http://www.leydesdorff.net/overlaytoolkit
8.4 Global Science Maps and Overlays 311
Infectious
Diseases
Environ Sci & Tech
Clinical Med
Mech Eng
Chemistry
Materials Sci
Biomed Sci
Psychological Sci.
Physics
Math Methods
Fig. 8.22 A global science overlay base map. Nodes represent Web of Science Categories. Grey
links represent degree of cognitive similarity (Reprinted from Rafols et al. 2010 with permission)
8
http://idr.gatech.edu/maps.php
312 8 Mapping Science
Fig. 8.23 An interactive science overlay map of Glaxo-SmithKline’s publications between 2000
and 2009. The red circles are GSK’s publications in clinical medicine (as moving mouse-over the
Clinical Medicine label) (Reprinted from Rafols et al. 2010 with permission, available at http://idr.
gatech.edu/usermapsdetail.php?id=61)
an overlay. The strength of this overlay approach is that one can easily identify the
activity of an institution with references spreading over multiple disciplinary regions
as well as an institution with a much focused discipline.
The flexibility of the science overlay maps has been demonstrated in studies
of interdisciplinarity of fields over time (Porter and Rafols 2009), comparing
departments, universities and R&D bases of large corporations (Rafols et al. 2010),
and tracing the diffusion of research topics over science (Leydesdorff and Rafols
2011). Figure 8.24 shows a more recent base map generated by Loet Leydesdorff in
VOSViewer.
Many citation maps are designed to show either the sources or the targets of citations
in a single display but not both. The primary reason is that a representation with
a mixed of citing and cited articles may considerably increase the complexity of
its structure and dynamics. There doesn’t seem to be a clear gain if we combine
them together in a single view. Although it is conceivable that a combined structure
may be desirable in situations such as a heated debate, researchers are in general
more concerned with differentiating various arguments before considering how to
combine them.
8.4 Global Science Maps and Overlays 313
The Butterfly designed by Jock Mackinlay and his colleagues at Xerox shows
both ends in the same view, but the focus is at the individual paper level rather than
at a macroscopic level of thousands of journals (Mackinlay et al. 1995). Eugene
Garfield’s HistCite depicts direct citations in the literature. However, as the number
of citations increase, the network tends to become cluttered, which is a common
problem to network representations.
We introduce a dual-map overlay design that depicts both the citing overlay
and the cited overlay maps in the same view. The dual-map overlay has several
advantages over a single overlay map. First, it represents a citation instance
completely. One can see where it is originated and where it points to at a glance.
Second, it makes it easy to compare patterns of citations made by distinct groups of
authors, for example, authors from different organizations, or authors from the same
organization at different points of time. Third, it opens up more research questions
that can be addressed in new ways of analysis. For example, it becomes possible
to study the interdisciplinarity at both source and target sides. It becomes possible
to track the movements of scientific frontiers in terms of their footprints in both
base maps.
The construction of a dual-map base shares the initial steps but differs in later
steps. Once the coordinates are available for both citing and cited matrices of
journals, a dual-map overlay can be constructed. It is not necessary to have cluster
information, but additional functions are possible if cluster information is available.
In the rest of the description, we assume that at least one set of clusters are available
314 8 Mapping Science
Fig. 8.25 The Blondel clusters in the citing journal map (left) and the cited journal map (right).
The overlapping polygons suggest that the spatial layout and the membership of clusters still
contain a considerable amount of uncertainty. Metrics calculated based on the coordinates need
to take the uncertainty into account
for each matrix. In this example, clusters are obtained by applying the Blondel
clustering algorithm. Figure 8.25 is a screenshot of the dual-map display, containing
a base map of citing journals (left) and a base map of cited journals (right).
For each journal in the citing network, its cluster membership is stored with
the journal along with its coordinates. The coordinates may be obtained from a
network visualization program such as VOSViewer, Gephi, or Pajek. Members of
each cluster are painted in the map with the same color.
A number of overlays can be added to the dual-map base. Each overlay requires
a set of bibliographic records that contain citation information, i.e. like the records
retrieved from the Web of Science. The smallest set may contain a single article.
There is no limit to the size of the largest set. With journal overlay maps, each
citation instance is represented by an arc from its source journal in the citing base
map to its target journal on the cited base map. Arcs from the same set are displayed
in the same color chosen by the user so that citation patterns from distinct sets can
be distinguished by their unique colors.
Figure 8.26 shows a dual-map display of citations found in publications of two
iSchools between 2003 and 2012. The citation arcs made by the iSchool at Drexel
University are colored in blue, whereas the arcs made by the School of Information
Studies at Syracuse are in magenta. At a glance, the blue arcs on the upper part of
the map suggest that Drexel researchers published in these areas, whereas Syracuse
researchers made few publications in these areas. The dual-map overlay shows that
Drexel researchers not only published in the areas that correspond to mathematics
and systems journals, Drexel researchers’ publications in journals in other areas are
also influenced by journals related to systems, computing, and mathematics. The
overlapping arcs in the lower half of the map indicate that the two institutions share
their core journals in terms of where they publish.
8.4 Global Science Maps and Overlays 315
Fig. 8.26 Citation arcs from the publications of Drexel’s iSchool (blue arcs) and Syracuse School
of Information Studies (magenta arcs) reveal where they differ in terms of both intellectual bases
and research frontiers
As one more example, Fig. 8.27 shows a comparison between two sets of records.
One is a set of papers on h-index (green, mostly appeared in the upper half)
and the other is a set of papers citing the 2006 JASIST paper on CiteSpace II,
mostly originated from the lower right part of the base map of citing journals. This
image shows that research in h-index is widespread, especially published in physics
journals (Blondel cluster #5) and cited journals in similar categories. In contrast,
papers citing CiteSpace II concentrated on a few journals, but they cited journals in
a wide range of clusters of journals.
316 8 Mapping Science
In summary, global science maps provide base maps that enable interactive
overlays. Dual-map overlays display the citing and cited journals in the same view,
which makes it easier to compare the citation behaviors of different groups in terms
of their source journals and target journals.
References
Chin MH, Mason MJ, Xie W, Volinia S, Singer M, Peterson C et al (2009) Induced pluripotent stem
cells and embryonic stem cells are distinguished by gene expression signatures. Cell Stem Cell
5(1):111–123
Chubin DE (1994) Grants peer-review in theory and practice. Eval Rev 18(1):20–30
Chubin DE, Hackett EJ (1990) Paperless science: peer review and U.S. science policy. State
University of New York Press, Albany
Cobo MJ, Lopez-Herrera AG, Herrera-Viedma E, Herrera F (2011) Science mapping software
tools: review, analysis, and cooperative study among tools. [Review]. J Am Soc Info Sci
Technol 62(7):1382–1402
Cuhls K (2001) Foresight with Delphi surveys in Japan. [Article]. Technol Anal Strateg Manag
13(4):555–569
Dewett T, Denisi AS (2004) Exploring scholarly reputation: it’s more than just productivity.
[Article]. Scientometrics 60(2):249–272
Discher DE, Mooney DJ, Zandstra PW (2009) Growth factors, matrices, and forces combine and
control stem cells. Science 324(5935):1673–1677
Ebert AD, Yu J, Rose FF, Mattis VB, Lorson CL, Thomson JA et al (2009) Induced pluripotent
stem cells from a spinal muscular atrophy patient. Nature 457(7227):277–280. doi:10.1038/
nature07677
Fauconnier G, Turner M (1998) Conceptual integration networks. Cognit Sci 22(2):133–187
Feng Q, Lu S-J, Klimanskaya I, Gomes I, Kim D, Chung Y et al (2010) Hemangioblastic
derivatives from human induced pluripotent stem cells exhibit limited expansion and early
senescence. Stem Cells 28(4):704–712
Fleming L, Bromiley P (2000) A variable risk propensity model of technological risk taking. Paper
presented at the applied statistics workshop. Retrieved from http://courses.gov.harvard.edu/
gov3009/fall00/fleming.pdf
Garfield E (1955) Citation indexes for science: a new dimension in documentation through
association of ideas. Science 122(3159):108–111
Gimble JM, Katz AJ, Bunnell BA (2007) Adipose-derived stem cells for regenerative medicine.
Circ Res 100(9):1249–1260
Glotzbach JP, Wong VW, Gurtner GC, Longaker MT (2011) Regenerative medicine. Curr Probl
Surg 48(3):148–212
Häyrynen M (2007) Breakthrough research: funding for high-risk research at the Academy of
Finland. The Academy of Finland, Helsinki
Hettich S, Pazzani MJ (2006) Mining for proposal reviewers: lessons learned at the National
Science Foundation. Paper presented at the KDD’06
Hilbe JM (2011) Negative binomial regression, 2nd edn. Cambridge University Press, Cambridge
Hirsch JE (2007) Does the h index have predictive power? Proc Natl Acad Sci
104(49):19193–19198
Hong H, Takahashi K, Ichisaka T, Aoi T, Kanagawa O, Nakagawa M et al (2009) Suppression of in-
duced pluripotent stem cell generation by the p53–p21 pathway. Nature 460(7259):1132–1135.
doi:10.1038/nature08235
Hsieh C (2011) Explicitly searching for useful inventions: dynamic relatedness and the costs of
connecting versus synthesizing. Scientometrics 86(2):381–404
Kaji K, Norrby K, Paca A, Mileikovsky M, Mohseni P, Woltjen K (2009) Virus-free induction of
pluripotency and subsequent excision of reprogramming factors. Nature 458(7239):771–775.
doi:10.1038/nature07864
Kakuk P (2009) The legacy of the Hwang case: research misconduct in biosciences. Sci Eng Ethics
15:545–562
Khang G, Kim SH, Kim MS, Rhee JM, Lee HB (2007) Recent and future directions of stem cells
for the application of regenerative medicine. Tissue Eng Regen Med 4(4):441–470
Kim D, Kim C-H, Moon J-I, Chung Y-G, Chang M-Y, Han B-S et al (2009a) Generation of human
induced pluripotent stem cells by direct delivery of reprogramming proteins. Cell Stem Cell
4(6):472–476
318 8 Mapping Science
Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA
103(23):8577–8582
O’Brien CA, Pollett A, Gallinger S, Dick JE (2007) A human colon cancer cell capable of
initiating tumour growth in immunodeficient mice. Nature 445(7123):106–110.
doi:10.1038/nature05372
Okita K, Nakagawa M, Hyenjong H, Ichisaka T, Yamanaka S (2008) Generation of mouse induced
pluripotent stem cells without viral vectors. Science 322(5903):949–953
Patterson M, Chan DN, Ha I, Case D, Cui Y, Handel BV et al (2012) Defining the nature of human
pluripotent stem cell progeny. Cell Res 22(1):178–193
Persson O (2010) Are highly cited papers more international? Scientometrics 83(2):397–401
Pfeifer MP, Snodgrass GL (1990) The continued use of retracted, invalid scientific literature. J Am
Med Assoc 263:1420–1423
Phinney DG, Prockop DJ (2007) Concise review: mesenchymal stem/multipotent stromal cells:
the state of transdifferentiation and modes of tissue repair—current views. Stem Cells
25(11):2896–2902
Pirolli P (2007) Information foraging theory: adaptive interaction with information. Oxford
University Press, Oxford
Pittenger MF, Mackay AM, Beck SC, Jaiswal RK, Douglas R, Mosca JD et al (1999) Multilineage
potential of adult human mesenchymal stem cells. Science 284(5411):143–147
Polak DJ (2010) Regenerative medicine. Opportunities and challenges: a brief overview. J R Soc
Interface 7:S777–S781
Polykandriotis E, Popescu LM, Horch RE (2010) Regenerative medicine: then and now – an update
of recent history into future possibilities. J Cell Mol Med 14(10):2350–2358
Porter AL, Rafols I (2009) Is science becoming more interdisciplinary? Measuring and mapping
six research fields over time. Scientometrics 81(3):719–745
Price DD (1965) Networks of scientific papers. Science 149:510–515
Rafols I, Porter AL, Leydesdorff L (2010) Science overlay maps: a new tool for research policy
and library management. J Am Soc Info Sci Technol 61(9):1871–1887
Ricci-Vitiani L, Lombardi DG, Pilozzi E, Biffoni M, Todaro M, Peschle C et al (2007) Iden-
tification and expansion of human colon-cancer-initiating cells. Nature 445(7123):111–115.
doi:10.1038/nature05384
Service RF (2002) Bell Labs fires star physicist found guilty of forging data. Science 298:30–31
Shibata N, Kajikawa Y, Matsushima K (2007) Topological analysis of citation networks to discover
the future core articles. J Am Soc Info Sci Technol 58(6):872–882
Shibata N, Kajikawa Y, Takeda Y, Sakata I, Matsushima K (2011) Detecting emerging research
fronts in regenerative medicine by the citation network analysis of scientific publications.
Technol Forecast Soc Change 78:274–282
Slaughter BV, Khurshid SS, Fisher OZ, Khademhosseini A, Peppas NA (2009) Hydrogels in
regenerative medicine. Adv Mater 21(32–33):3307–3329
Small H (1999) Visualizing science by citation mapping. J Am Soc Inf Sci 50(9):799–813
Soldner F, Hockemeyer D, Beard C, Gao Q, Bell GW, Cook EG et al (2009) Parkinson’s
disease patient-derived induced pluripotent stem cells free of viral reprogramming factors. Cell
136(5):964–977
Sox HC, Rennle D (2006) Research misconduct, retraction, and cleansing the medical literature:
lessons from the Poehlman case. Ann Intern Med 144:609–613
Stadtfeld M, Apostolou E, Akutsu H, Fukuda A, Follett P, Natesan S et al (2010) Aberrant silencing
of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells. Nature
465(7295):175–181. doi:10.1038/nature09017
Steen RG (2011) Retractions in the scientific literature: do authors deliberately commit research
fraud? J Med Ethics 37:113–117
Swanson DR (1986a) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect
Biol Med 30:7–18
Swanson DR (1986b) Undiscovered public knowledge. Libr Q 56(2):103–118
320 8 Mapping Science
Takahashi K, Yamanaka S (2006) Induction of pluripotent stem cells from mouse embryonic and
adult fibroblast cultures by defined factors. Cell 126(4):663–676
Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K et al (2007) Induction of
pluripotent stem cells from adult human fibroblasts by defined factors. Cell 131(5):861–872
Takeda Y, Kajikawa Y (2010) Tracking modularity in citation networks. Scientometrics 83(3):783
Thomas J, Cook K (2005) Illuminating the path, the research and development agenda for visual
analytics. IEEE CS Press, Los Alamitos
Thomson JA, Itskovitz-Eldor J, Shapiro SS, Waknitz MA, Swiergiel JJ, Marshall VS et al (1998)
Embryonic stem cell lines derived from human blastocysts. Science 282(5391):1145–1147
Tichy G (2004) The over-optimism among experts in assessment and foresight. [Article]. Technol
Forecast Soc Change 71(4):341–363
Trikalinos NA, Evangelou E, Ioannidis JPA (2008) Falsified papers in high-impact journals were
slow to retract and indistinguishable from nonfraudulent papers. J Clin Epidemiol 61:464–470
Upham SP, Rosenkopf L, Ungar LH (2010) Positioning knowledge: schools of thought and new
knowledge creation. Scientometrics 83:555–581
van Dalen HP, Kenkens K (2005) Signals in science: on the importance of signaling in gaining
attention in science. Scientometrics 64(2):209–233
van Eck NJ, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric
mapping. [Article]. Scientometrics 84(2):523–538
Vierbuchen T, Ostermeier A, Pang ZP, Kokubu Y, Südhof TC, Wernig M (2010) Direct conver-
sion of fibroblasts to functional neurons by defined factors. Nature 463(7284):1035–1041.
doi:10.1038/nature08797
von Luxburg U (2006) A tutorial on spectral clustering. From http://www.kyb.mpg.de/fileadmin/
user upload/files/publications/attachments/Luxburg07 tutorial 4488%5b0%5d.pdf
Wager E, Williams P (2011) Why and how do journals retract articles? An analysis of Medline
retractions 1988–2008. J Med Ethics 37:567–570
Wakefield AJ, Murch SH, Anthony A, Linnell J, Casson DM, Malik M et al (1998) Ileal-lymphoid-
nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children
(Retracted article. See vol 375, pg 445, 2010). Lancet 351(9103):637–641
Walters GD (2006) Predicting subsequent citations to articles published in twelve crime-
psychology journals: author impact versus journal impact. Scientometrics 69(3):499–510
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature
393(6684):440–442
Weeber M (2003) Advances in literature-based discovery. J Am Soc Info Sci Technol
54(10):913–925
Wernig M, Meissner A, Foreman R, Brambrink T, Ku M, Hochedlinger K et al (2007) In vitro
reprogramming of fibroblasts into a pluripotent ES-cell-like state. Nature 448(7151):318–324.
doi:10.1038/nature05944
Woltjen K, Michael IP, Mohseni P, Desai R, Mileikovsky M, Hamalainen R et al (2009)
piggyBac transposition reprograms fibroblasts to induced pluripotent stem cells. Nature
458(7239):766–770. doi:10.1038/nature07863
Young RA (2011) Control of the embryonic stem cell state. Cell 144(6):940–954
Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S et al (2007) Induced
pluripotent stem cell lines derived from human somatic cells. Science 318(5858):1917–1920
Yu J, Hu K, Smuga-Otto K, Tian S, Stewart R, Slukvin II et al (2009) Human induced pluripotent
stem cells free of vector and transgene sequences. Science 324(5928):797–801
Zeileis A, Kleiber C, Jackman S (2011) Regression models for count data in R. from http://cran.r-
project.org/web/packages/pscl/vignettes/countreg.pdf
Zhao T, Zhang Z-N, Rong Z, Xu Y (2011) Immunogenicity of induced pluripotent stem cells.
Nature 474(7350):212–215. doi:10.1038/nature10135
Zhou H, Wu S, Joo JY, Zhu S, Han DW, Lin T et al (2009) Generation of induced pluripotent stem
cells using recombinant proteins. Cell Stem Cell 4(5):381–384
Chapter 9
Visual Analytics
9.1 CiteSpace
CiteSpace is a Java application for visualizing and analyzing emerging trends and
patterns in scientific literature. The design of CiteSpace is motivated to achieve two
ambitious goals. One is to provide a computational alternative to supplement the
traditional systematic reviews and surveys of a body of scientific literature. The
other is to provide an analytic tool so that one can study the structure and dynamics
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 321
DOI 10.1007/978-1-4471-5128-9 9, © Springer-Verlag London 2013
322 9 Visual Analytics
of scientific paradigms in the sense defined by Thomas Kuhn. The primary source
of input for CiteSpace is a body of scientific literature, namely bibliographic records
from the Web of Science or full-text versions of publications.
The general assumption is that the study of such input data will allow us to
address two fundamental questions that systematic reviews and surveys would intent
to address:
1. What is the persistent core of the literature?
2. What are the transient trends that have appeared or are emerging in the literature?
The persistent core of a body of literature corresponds to the intellectual base of
a field of study. The transient trends correspond to scientific frontiers. On the other
hand, researchers have realized that scientific knowledge can be seen as the constant
movement of scientific frontiers. The state of art today may or may not survive the
future. Only the time can tell whether an exciting new theory will have its position
in the history of science.
We use co-citations of references as the basic organizing mechanism. In other
words, we construct a global structure from local details. Each individual scientist
or domain expert provides their input as they publish their work in the literature. As
they cite previously published works in the literature, they leave their footprints that
carry information about their preferences, intents, criticisms, and interpretations. In
this way, citations provide a valuable source of information for us to identify and
measure the value of a scientific idea, a discovery, or a theory.
9.1 CiteSpace 323
Fig. 9.2 CiteSpace labels clusters with title terms of articles that cite corresponding clusters
Fig. 9.3 Citations over time are shown as tree rings. Tree rings in red depict the years an
accelerated citation rate was detected (citation burst). Three areas emerged from the visualization
are currently concerned with, which may or may not be consistent with the direction
of the cluster. These clusters represent the intellectual base of a paradigm, whereas
the citing articles associated with a cluster represent the research fronts. It is possible
that the same intellectual base may sustain more than one research front.
CiteSpace identifies noteworthy patterns in terms of structural and temporal
properties. Structuring properties include the betweenness centrality of a cited
reference at both the individual article level and the aggregated level of clusters.
Temporal properties include citation bursts, which measure the acceleration of
citations within a short period of time. It has been shown that these indicators
capture research focuses of the underlying scientific community (Chen 2012; Chen
et al. 2010; Small 1973).
CiteSpace characterizes emerging trends and patterns of change in such networks
in terms of a variety of visual attributes. The size of a node indicates how many
citations the associated reference received. Each node is depicted with a series
of citation tree-rings across the series of time slices. The structural properties of
a node are displayed in terms of a purple ring. The thickness of the purple ring
indicates the degree of its betweenness centrality, which is a measure associated
with the transformative potential of a scientific contribution. Such nodes tend to
bridge different stages of the development of a scientific field. Citation rings in red
indicate the time slices in which citation bursts, or abrupt increases of citations,
are detected. Citation bursts provide a useful means to trace the development of
research focus. Figure 9.3 shows an example of the distribution of topic areas with
strong citation bursts in research of terrorism.
9.2 Jigsaw 325
Fig. 9.4 A network of 12,691 co-cited references. Each year top 2,000 most cited references were
selected to form the network. The same three-cluster structure is persistent at various levels
9.2 Jigsaw
Jigsaw is developed at Georgia Tech, led by John Stasko, who has been active
in software visualization, information visualization, and visual analytics. Jigsaw
integrates a variety of views for the study of a collection of text. The software
is available at http://www.cc.gatech.edu/gvu/ii/jigsaw. Prospective users are highly
recommended to start with tutorial videos.1
1
http://www.cc.gatech.edu/gvu/ii/jigsaw/tutorial/
326 9 Visual Analytics
Fig. 9.6 The list view of Jigsaw, showing a list of authors, a list of concepts, and a list of index
terms. The input documents are papers from the InfoVis and VAST conferences
concepts on the left include network, text, animation, usability, and matrix. The
index terms highlighted on the right are much more detailed, including citation
analysis, astronomical surveys, PFNET, and SDSS. The flexibility to browse entities
and relations across multiple lists is very convenient. It supports a very common task
in exploring a dataset.
Other views in Jigsaw include a Circular Graph View, a Calendar View, a
Document Cluster View, a Document Grid View, and a Word Tree View. Jigsaw
also provides functions to compute the sentiment of a document and display the
result in a Document Grid View. Jigsaw uses lists of “positive” and “negative” words
and counts the number of occurrences in each document. The Document Grid view
represents positive documents in blue and negative documents in red. Figure 9.7
shows a Word Tree View in Jigsaw.
Perhaps the function that is most useful for an analyst is the Tablet View in
Jigsaw (See Fig. 9.8). The Tablet functions like a sandbox where the analyst can
organize various views of the same set of documents side by side. Tablet also allows
the analyst to create a timeline and place evidence and other information along the
timeline.
328 9 Visual Analytics
Fig. 9.8 Tablet in Jigsaw provides a flexible workspace to organize evidence and information
9.4 Power Grid Analysis 329
9.3 Carrot
Fig. 9.9 Carrot’s visualizations of clusters of text documents. Top right: Aduna cluster map
visualization; lower middle: circles visualization; lower right: Foam Tree visualization
2
http://project.carrot2.org/
330 9 Visual Analytics
Fig. 9.10 Left: The geographic layout of the Western Power Grid (WECC) with 230 kV or higher
voltage. Right: a GreenGrid layout with additional weights applied to both nodes (using voltage
phase angle) and links (using impedance) (Reprinted from Wong et al. 2009 with permission)
The Action Science Explorer (ASE) is a new tool developed at the University of
Maryland (Dunne et al. 2012). It is designed to present the scientific literature
for a field using many different modalities: lists of articles, their full texts,
automatic text summaries, and visualizations of the structure of the citation network.
Action Science Explorer integrates a variety of functions in order to support rapid
understanding of scientific literature. Users can analyze the network of citations
between papers, identify key papers and research clusters, automatically summarize
them, dig into the full text of articles to extract context, make annotations, write
reviews, and finally export their findings in many of document authoring formats.
Action Science Explorer is partially an integration of two existing tools – the
SocialAction network analysis tool3 and the JabRef reference manager.4 SocialAc-
tion provides network analysis capabilities including force-directed citation network
visualization, ranking and filtering papers by statistical measures, and automatic
cluster detection. JabRef supplies features for managing references, including
searching using simple regular expressions, automatic and manual grouping of
papers, DOI and URL links, PDF full text with annotations, abstracts, user generated
reviews and text annotations, and many ways of exporting. It integrates with
Microsoft Word, OpenOffice.org, and LaTeX/BibTeX, which allows quick adding
of citations to discovered articles when writing survey papers.
These tools are linked together to form multiple coordinated views of the data.
Clicking on a node in the citation network selects it and its corresponding paper
in the reference manager, displaying its abstract, review, and other data associated
with it. Moreover, when clusters of nodes are selected their papers are floated to
the top of the reference manager. When any node or cluster is selected, the In-
Cite Text window displays the text of all incoming citations to the paper(s), i.e.
the whole sentences from the citing papers that include the citation to the selected
paper(s). These are displayed in a hyperlinked list that allows the user to select
any one of them to show their surrounding context in the Out-Cite Text window.
This window shows the full text of the paper citing one of the selected papers, with
3
SocialAction network analysis tool.
4
JabRef reference manager.
332 9 Visual Analytics
Fig. 9.11 A screenshot of ASE (Reprinted from Dunne et al. 2012 with permission)
highlighting showing the selected citation sentence as well as any other sentences
that include hyperlinked citations to other papers. The last view is the summary
window, which can contain various multi-document summaries of a selected cluster.
Using automatic summarization techniques, we can summarize all of the incoming
citations to papers within that cluster, hopefully providing key insights into that
research community.
According to the website of ASE,5 it is currently not available to the general
public, except only to their collaborators. Figure 9.11 shows a screenshot of ASE.
In 2002, when I wrote the first edition of the book, I identified the following top-ten
challenges for the subject and predictions for the near future. Now in 2012, what
have been changed and what are the new challenges that have emerged since?
5
http://www.cs.umd.edu/hcil/ase/
9.6 Revisit the Ten Challenges Identified in 2002 333
Between 2005 and 2010, knowledge-oriented search tools and exploration tools
will become widely available. Users major search tasks will probably switch from
data-oriented search to comprehension and interpretation tasks. Intelligent software
agents will begin to mature for citation context extraction and summarization.
Genomic maps will play more substantial roles in linking scientific data and
scientific literature. A synergy of data mining in genomic map data and scientific
literature will attract increasing interest.
Beyond 2010, mapping scientific frontiers should reach a point where science
maps can start to make forecasts and simulations. Powerful simulations will allow
scientists to see the potential impact of a new technology. Further than that, we will
have to wait and see.
Many techniques have matured over the last 10 years, including automatically
summarizing multiple documents, automatic construction of ontology, and recom-
mending relevant references. Research has begun to touch the issues of predictive
analysis and how to deal with unanticipated situations. To what extent is scientific
advance predictable? What can we learn from the past so that we will be able to
better recognize early signs of something potentially significant?
I envisage the following two milestones ahead for mapping scientific frontiers.
First, recall the clarity of the conceptual structures demonstrated by Paul Thagard
as we have seen in Chap. 1. Here are the requirements: at any point of time, the first
part of the input is the entire knowledge that has ever conceived by human beings,
the second part of the input is a newly proposed idea, the future system will be able
to let us know very quickly to what extent the new idea has been addressed in the
past and, if it is true, what areas of our knowledge will be affected. This process is in
essence what scientists go through so many times in their research. The key question
is how much of the retrieval, sense making, differentiation, and other analytic tasks
can be performed with considerably more external help.
Figure 9.12 illustrates how the publication of an article by Galea et al. in 2002
altered the holistic system of our knowledge on post-traumatic stress disorder. The
Galea article is six pages long. It cites 32 references. On the one hand, it requires a
substantial amount of domain knowledge to understand its validity and significance.
On the other hand, co-citation patterns indicate its special position in the landscape
of the domain knowledge. The diagrams show how the key contribution of their
work can be summarized at a conceptual level that sense making tasks can become
much easier and efficient.
The second milestone may be built on the first one to a great extent. The second
milestone is to externalize all the activities associated with scientific inquiries in
a form that can greatly integrate and inform scientists of their current situations
and paths that may lead to their goals. Figure 9.13 shows an illustrative schedule
of a fitness landscape of scientific inquiries. Each point of the landscape indicates
the fitness value of corresponding points on the base of the landscape. Many
scientific inquiries can be conceptualized as an exploration on such a landscape.
Some areas they find consistent information. Some other areas they may expect
to find contradictions. Some areas may be well defined, whereas other areas may
9.6 Revisit the Ten Challenges Identified in 2002 337
Fig. 9.12 An ultimate ability to reduce the vast volume of scientific knowledge in the past and a
stream of new knowledge to a clear and precise representation of a conceptual structure
a long way to go. The role of visual thinking and reasoning in science is clear. We
draw inspirations from what we see. The advances of our ability to obtain a wide
variety of visual images make us reach what was impossible before. We are able to
see much farther away with modern telescopes. Our mind does a large part of the
work in terms of scientific reasoning. One day, our mind will be augmented further
by information and computational tools that can enhance our vision to our own.
References
Chen C (2011) Turning points: the nature of creativity. Springer, New York
Chen C (2012) Predictive effects of structural variation on citation counts. J Am Soc Info Sci
Technol 63(3):431–449
Chen C, Ibekwe-SanJuan F, Hou J (2010) The structure and dynamics of co-citation clusters: a
multiple-perspective co-citation analysis. J Am Soc Info Sci Technol 61(7):1386–1409
Dunne C, Shneiderman B, Gove R, Klavans J, Dorr B (2012) Rapid understanding of scientific
paper collections: integrating statistics, text analytics, and visualization. J Am Soc Info Sci
Technol 63(12):2351–2369
Eccles R, Kapler T, Harper R, Wright W (2008) Stories in GeoTime. Info Vis 7(1):3–17
Small H (1973) Co-citation in scientific literature: a new measure of the relationship between two
documents. J Am Soc Inf Sci 24:265–269
Wong PC, Schneider K, Mackey P, Foote H Jr, Chin G, Guttromson R et al (2009) A novel visual-
ization technique for electric power analytics. IEEE Trans Vis Comput Graph 15(3):410–423
Wong PC, Foote H, Mackey P Jr, Chin G, Huang Z, Thomas J (2012) A space-filling visualization
technique for multivariate small world graphs. IEEE Trans Vis Comput Graph 18(5):797–809
Index
C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 341
DOI 10.1007/978-1-4471-5128-9, © Springer-Verlag London 2013
342 Index
T
R TextFlow, 110
Retraction, 290–304 Thematic maps, 8, 38, 43, 47–49, 52–54,
204, 205
Thematic overlay, 43, 48–50, 52, 85, 127, 128,
S 144, 308
Scale-free networks, 136137 ThemeRiver, 108, 109
Science mapping, 1–4, 15, 38, 40–41, 43, ThemeView, 38, 101, 102
44, 76, 91, 127, 161, 163, 164, 166, Topic evolution, 110
172, 174, 180, 195, 196, 224, 295, Topic variations, 109, 110
304, 335 Tower of Babel, 23–26
Scientific debates, 5, 9, 44, 205, 230, 334, 338 TRACES, 16–20
Scientific frontiers, 1–20, 22, 33, 37–44, 76, Trajectories of search, 44, 143–161
77, 83, 139, 144, 167, 170, 176, 186, Transformative ideas, 259, 264
189, 197, 203, 223, 224, 227, 263, 289, Traveling salesman, 89, 131, 144–146, 177
313, 322, 336–338 Triangular inequality, 93, 94, 150, 183
Scientific inscriptions, 5
Scientific literature, 2–6, 8, 38, 45, 70, 91, 105,
143, 144, 166, 167, 172, 174, 175, 180, U
203, 223, 227, 229, 234, 250, 254, 276, UCSD map, 305, 306
290–292, 294, 299, 304, 306, 321, 322, Undiscovered public knowledge, 166, 227,
331, 333, 336, 338 230–234
344 Index