Вы находитесь на странице: 1из 48

The Tree of Life:

Challenges for Discrete


Mathematics and Theoretical
Computer Science

Fred S. Roberts
DIMACS
Rutgers University

The tree of life problem raises new


challenges for mathematics and
computer science just as it does for
biological science.

For math. and CS to become more


effectively utilized, we need to:
develop new tools;
establish working partnerships between
mathematical scientists and biological scientists;

introduce the two communities to each others


problems, language, and tools;

introduce outstanding junior researchers from


both sides to the issues, problems, and challenges
of problems arising from the tree of life;

involve biological and mathematical scientists


together to define the agenda and develop the
tools of this field.

These are some of the motivations for this


meeting.
I will lay out some of the challenges for
math and CS, with emphasis on discrete
math and theoretical CS.

What are DM and TCS?


DM deals with:
arrangements
designs
codes
patterns
schedules
assignments

TCS deals with the theory of computer


algorithms.
During the first 30-40 years of the computer
age, TCS, aided by powerful mathematical
methods, had a direct impact on technology,
by developing models, data structures,
algorithms, and lower bounds that are now at
the core of computing.

DM and TCS have found extensive use in many


areas of science and public policy, for example in
Molecular Biology.

These tools seem especially relevant to problems of


the tree of life

DM and TCS Continued


These tools are made especially relevant to
the tree of life problem because of:
Geographic Information Systems

DM and TCS Continued


Availability of large and disparate
computerized databases on subjects
relating to species and the relevance of
modern methods of data mining.

Outline

Phylogenetic Tree Reconstruction


Database Issues
Nomenclature
Setting up a Species Bank
Digitization of Natural History Collections
Interoperability
The Many Applications of Research on the
Tree of Life

Phylogenetic Tree
Reconstruction

Phylogeny (continued)
New methods of phylogenetic tree
reconstruction owe a significant amount to
modern methods of DM/TCS.
Trees, supertrees, consensus trees will all be
discussed at length in this meeting
I will only make a few brief remarks about
them.

Phylogenetic Challenges for


DM/TCS
Tailoring phylogenetic methods to describe
the idiosyncracies of viral evolution -- going
beyond a binary tree with a small number of
contemporaneous species appearing as
leaves.
Dealing with trees of thousands of vertices,
many of high degree.
Making use of data about species at internal
vertices (e.g., when data comes from serial
sampling of patients).

Phylogenetic Challenges for


DM/TCS: Continued
Network representations of evolutionary
history - if recombination has taken place.
Modeling viral evolution by a collection of
trees -- to recognize the quasispecies nature
of viruses.
Devising fast methods to average the
quantities of interest over all likely trees.
Thanks to Eddie Holmes and Mike Steel for ideas.
DIMACS Working Group on Phylogenetic Trees and Rapidly
Evolving Diseases, Sept. 3-6, 2003

Database Issues
Assembling the tree of life requires
collecting massive amounts of data
about the worlds scientific species.
Making it a collaborative project
requires making such data universally
available.
There are great challenges for Math and
CS, specifically DM and TCS.
Thanks to the Global Biodiversity Information Facility
(GBIF) for many of the following ideas.

Complexity of Data
In many ways,
data about the
worlds
species are
far more
complex than
genetic or
protein
sequence
data. (GBIF)

Complexity of Data (contd)


There are databases of images,
databases in numerous forms, etc.
Data is heterogeneous.
Data has errors and inconsistencies.

Nomenclature

There are some 1.75M named species


By some estimates, there are up to 10M
actual species.

Nomenclature (contd)
The same species is
often named more
than once.
On the average, each
species has two
additional names
(synonyms) besides
its own name. (GBIF)

Nomenclature (contd)
Thus, there is need to
assemble names in an
electronic catalogue,
with synonyms and
common misspellings.
This would be of
fundamental
importance in aiding
research on
biodiversity.

Nomenclature (contd)
Because of errors,
one major
challenge for TCS
is data cleaning.

Nomenclature (contd)
Another challenge is to search a database
to see if two entries are similar.
This is a standard problem in database
theory.
TCS algorithms involving k-nearest
neighbor and other methods are very
helpful here.

Setting up a Species Bank

Setting up a Species Bank


(contd)
A species bank would provide not only
names, but also data about a species:
Type
Distribution
Ecological role
Phylogenetic history
Physiology
Genomics

This involves issues about huge datasets.

Setting up a Species Bank


(contd)
NASA earth science
satellites alone beam
home image data at
the rate of 1.2
terabytes a day.
By 2010, this is
expected to grow to
10 petabytes a day.
(Kathleen Bergen, U.
Michigan)

Name

Equal to:

Size in Bytes

Bit

1 bit

1/8

Nibble

4 bits

1/2 (rare)

Byte

8 bits

Kilobyte

1,024 bytes

1,024

Megabyte

1,024 kilobytes

1,048,576

Gigabyte

1,024 megabytes

1,073,741,824

Terrabyte

1,024 gigabytes

1,099,511,627,776

Petabyte

1,024 terrabytes

1,125,899,906,842,624

Exabyte

1.024 petabytes

1,152,921,504,606,846,976

Zettabyte

1,024 exabytes

1,180,591,620,717,411,303,424

Yottabyte

1,024 zettabytes

1,208,925,819,614,629,174,706,176

Setting up a Species Bank


(contd)
The problem is even
worse: We need to
combine information
from many databases.
There is no known way
to catalogue all
species of plants in
one place given
current database
systems techniques.
(Jessie Kennedy, Napier University,
Edinburgh)

Setting up a Species Bank


(contd)
One possible approach: Tree and graph
methods to support overlapping
classifications as directed acyclic graphs
or with complex objects (taxa or
specimens) as nodes. (Jessie Kennedy)

Digitizing Natural History


Collections
It has been estimated
that there are between
1.5 and 3 Billion
specimens in the worlds
natural history
collections, including
herbaria, living
microorganism stock
centers, and other
repositories (GBIF).

Digitizing Natural History


Collections (contd)
If we could digitize information about these
specimens, and make them available, we
would have a treasure trove of
information about the worlds biota.
(GBIF)
Pilot projects have shown that utilizing
digitized data from several institutions
databases can be a powerful tool. (GBIF)

Digitizing Natural History


Collections (contd)
Challenge:
digitization and
reference of nonstandard data
(photos,
sonograms, field
notes)

Digitizing Natural History


Collections (contd)
Challenge:
Develop methods
for visualizing the
data (e.g., species
distributions)

Digitizing Natural History


Collections (contd)
Challenge: Develop
search engines for
real-time searching of
such extremely large
data sets.

Digitizing Natural History


Collections (contd)
Challenge: Make information access on
the web more knowledge-based so
humans and intelligent software can work
together. (Susan Gauch, U. Kansas)

Digitizing Natural History


Collections (contd)
Challenge: Use
intelligent
agents to
organize and
present relevant
information on
the web. (Susan
Gauch)

Digitizing Natural History


Collections (contd)
Challenge: Use partial information as
training data for classification
algorithms (Susan Gauch)
One approach: Use training data and
classification algorithms with learning
capabilities.
(See: DIMACS project on Monitoring
Message Streams)

Digitizing Natural History


Collections (contd)
Another approach to problems posed by
digitization: Use tools of knowledge
inferencing (Yannis Ioannidis, University of
Wisconsin)
Still another approach: Use methods of
spatio-temporal data mining (Ioannidis; see
work of Muthukrishnan at Rutgers)

Interoperability
Goal: Devise
standards for
datasets so as to
allow researchers to
collaborate across
datasets develop
standards leading to
database
interoperability.
(GBIF)

Interoperability
Challenge: How do we develop ways to
more accurately represent observational
or experimental data so that others may
use them? (Jessie Kennedy)
Challenge: Deal with issues of
inconsistency and scalability.
Challenge: Formalize issues of policy with
regard to others databases.
Challenge: Interoperability over a diversity
of users and types of equipment.

Interoperability
One approach: Semantic Web the
idea used to express the growing
desire to make information access on
the Web more knowledge-based so
humans and intelligent software can
work together. (Susan Gauch)

Interoperability
Another
approach: Make
use of languages
such as XML
developed to aid
interoperability in
business and
military
collaborations.

The Many Applications of


Research on the Tree of Life
Side benefits in many fields:
Agriculture
Biomedicine
Biotechnology
Natural resource management
Pest control
Control of emergent diseases
Sustainable use of biodiversity resources
Global climate change

The Many Applications of


Research on the Tree of Life
Lets say youre
importing bananas
from South
America

The Many Applications of


Research on the Tree of Life
A camera in the
hold of the ship
sees a spider.
What kind of spider
is it?
Is it safe to unload
your cargo of
bananas?

The Many Applications of


Research on the Tree of Life
Luckily, you have
a digitized natural
history database.
With an efficient
search feature.
(Thanks to Diana
Lipscomb for this
example)

The Many Applications of


Research on the Tree of Life

Вам также может понравиться