TreeofLife3 11 03

The Tree of Life:
Challenges for Discrete

Mathematics and Theoretical
Computer Science
Fred S. Roberts
DIMACS
Rutgers University
The tree of life problem raises new

challenges for mathematics and
computer science just as it does for
biological science.
For math. and CS to become more

effectively utilized, we need to:
develop new tools;
establish working partnerships between
mathematical scientists and biological scientists;
introduce the two communities to each others

problems, language, and tools;
introduce outstanding junior researchers from

both sides to the issues, problems, and challenges
of problems arising from the tree of life;
involve biological and mathematical scientists

together to define the agenda and develop the
tools of this field.
These are some of the motivations for this

meeting.
I will lay out some of the challenges for
math and CS, with emphasis on discrete
math and theoretical CS.
What are DM and TCS?

DM deals with:
arrangements
designs
codes
patterns
schedules
assignments
TCS deals with the theory of computer

algorithms.
During the first 30-40 years of the computer
age, TCS, aided by powerful mathematical
methods, had a direct impact on technology,
by developing models, data structures,
algorithms, and lower bounds that are now at
the core of computing.
DM and TCS have found extensive use in many

areas of science and public policy, for example in
Molecular Biology.
These tools seem especially relevant to problems of

the tree of life
DM and TCS Continued

These tools are made especially relevant to
the tree of life problem because of:
Geographic Information Systems
DM and TCS Continued

Availability of large and disparate
computerized databases on subjects
relating to species and the relevance of
modern methods of data mining.
Outline
Phylogenetic Tree Reconstruction

Database Issues
Nomenclature
Setting up a Species Bank
Digitization of Natural History Collections
Interoperability
The Many Applications of Research on the
Tree of Life
Phylogenetic Tree
Reconstruction
Phylogeny (continued)
New methods of phylogenetic tree
reconstruction owe a significant amount to
modern methods of DM/TCS.
Trees, supertrees, consensus trees will all be
discussed at length in this meeting
I will only make a few brief remarks about
them.
Phylogenetic Challenges for

DM/TCS
Tailoring phylogenetic methods to describe
the idiosyncracies of viral evolution -- going
beyond a binary tree with a small number of
contemporaneous species appearing as
leaves.
Dealing with trees of thousands of vertices,
many of high degree.
Making use of data about species at internal
vertices (e.g., when data comes from serial
sampling of patients).
Phylogenetic Challenges for

DM/TCS: Continued
Network representations of evolutionary
history - if recombination has taken place.
Modeling viral evolution by a collection of
trees -- to recognize the quasispecies nature
of viruses.
Devising fast methods to average the
quantities of interest over all likely trees.
Thanks to Eddie Holmes and Mike Steel for ideas.
DIMACS Working Group on Phylogenetic Trees and Rapidly
Evolving Diseases, Sept. 3-6, 2003
Database Issues
Assembling the tree of life requires
collecting massive amounts of data
about the worlds scientific species.
Making it a collaborative project
requires making such data universally
available.
There are great challenges for Math and
CS, specifically DM and TCS.
Thanks to the Global Biodiversity Information Facility
(GBIF) for many of the following ideas.
Complexity of Data
In many ways,
data about the
worlds
species are
far more
complex than
genetic or
protein
sequence
data. (GBIF)
Complexity of Data (contd)

There are databases of images,
databases in numerous forms, etc.
Data is heterogeneous.
Data has errors and inconsistencies.
Nomenclature
There are some 1.75M named species

By some estimates, there are up to 10M
actual species.
Nomenclature (contd)
The same species is
often named more
than once.
On the average, each
species has two
additional names
(synonyms) besides
its own name. (GBIF)
Thus, there is need to
assemble names in an
electronic catalogue,
with synonyms and
common misspellings.
This would be of
fundamental
importance in aiding
research on
biodiversity.
Because of errors,
one major
challenge for TCS
is data cleaning.
Another challenge is to search a database
to see if two entries are similar.
This is a standard problem in database
theory.
TCS algorithms involving k-nearest
neighbor and other methods are very
helpful here.

(contd)
A species bank would provide not only
names, but also data about a species:
Type
Distribution
Ecological role
Phylogenetic history
Physiology
Genomics
This involves issues about huge datasets.

(contd)
NASA earth science
satellites alone beam
home image data at
the rate of 1.2
terabytes a day.
By 2010, this is
expected to grow to
10 petabytes a day.
(Kathleen Bergen, U.
Michigan)
Name
Equal to:
Size in Bytes
Bit
1 bit
1/8
Nibble
4 bits
1/2 (rare)
Byte
8 bits
Kilobyte
1,024 bytes
1,024
Megabyte
1,024 kilobytes
1,048,576
Gigabyte
1,024 megabytes
1,073,741,824
Terrabyte
1,024 gigabytes
1,099,511,627,776
Petabyte
1,024 terrabytes
1,125,899,906,842,624
Exabyte
1.024 petabytes
1,152,921,504,606,846,976
Zettabyte
1,024 exabytes
1,180,591,620,717,411,303,424
Yottabyte
1,024 zettabytes
1,208,925,819,614,629,174,706,176

(contd)
The problem is even
worse: We need to
combine information
from many databases.
There is no known way
to catalogue all
species of plants in
one place given
current database
systems techniques.
(Jessie Kennedy, Napier University,
Edinburgh)

(contd)
One possible approach: Tree and graph
methods to support overlapping
classifications as directed acyclic graphs
or with complex objects (taxa or
specimens) as nodes. (Jessie Kennedy)
Digitizing Natural History

Collections
It has been estimated
that there are between
1.5 and 3 Billion
specimens in the worlds
natural history
collections, including
herbaria, living
microorganism stock
centers, and other
repositories (GBIF).

Collections (contd)
If we could digitize information about these
specimens, and make them available, we
would have a treasure trove of
information about the worlds biota.
(GBIF)
Pilot projects have shown that utilizing
digitized data from several institutions
databases can be a powerful tool. (GBIF)

Collections (contd)
Challenge:
digitization and
reference of nonstandard data
(photos,
sonograms, field
notes)

Collections (contd)
Challenge:
Develop methods
for visualizing the
data (e.g., species
distributions)

Collections (contd)
Challenge: Develop
search engines for
real-time searching of
such extremely large
data sets.

Collections (contd)
Challenge: Make information access on
the web more knowledge-based so
humans and intelligent software can work
together. (Susan Gauch, U. Kansas)

Collections (contd)
Challenge: Use
intelligent
agents to
organize and
present relevant
information on
the web. (Susan
Gauch)

Collections (contd)
Challenge: Use partial information as
training data for classification
algorithms (Susan Gauch)
One approach: Use training data and
classification algorithms with learning
capabilities.
(See: DIMACS project on Monitoring
Message Streams)

Collections (contd)
Another approach to problems posed by
digitization: Use tools of knowledge
inferencing (Yannis Ioannidis, University of
Wisconsin)
Still another approach: Use methods of
spatio-temporal data mining (Ioannidis; see
work of Muthukrishnan at Rutgers)
Interoperability
Goal: Devise
standards for
datasets so as to
allow researchers to
collaborate across
datasets develop
standards leading to
database
interoperability.
(GBIF)
Interoperability
Challenge: How do we develop ways to
more accurately represent observational
or experimental data so that others may
use them? (Jessie Kennedy)
Challenge: Deal with issues of
inconsistency and scalability.
Challenge: Formalize issues of policy with
regard to others databases.
Challenge: Interoperability over a diversity
of users and types of equipment.
Interoperability
One approach: Semantic Web the
idea used to express the growing
desire to make information access on
the Web more knowledge-based so
humans and intelligent software can
work together. (Susan Gauch)
Interoperability
Another
approach: Make
use of languages
such as XML
developed to aid
interoperability in
business and
military
collaborations.
The Many Applications of

Research on the Tree of Life
Side benefits in many fields:
Agriculture
Biomedicine
Biotechnology
Natural resource management
Pest control
Control of emergent diseases
Sustainable use of biodiversity resources
Global climate change

Lets say youre
importing bananas
from South
America

A camera in the
hold of the ship
sees a spider.
What kind of spider
is it?
Is it safe to unload
your cargo of
bananas?

Luckily, you have
a digitized natural
history database.
With an efficient
search feature.
(Thanks to Diana
Lipscomb for this
example)


TreeofLife3 11 03

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

TreeofLife3 11 03

Загружено:

Авторское право:

Доступные форматы

The Tree of Life:

Challenges for Discrete

The tree of life problem raises new

For math. and CS to become more

introduce the two communities to each others

introduce outstanding junior researchers from

involve biological and mathematical scientists

These are some of the motivations for this

What are DM and TCS?

TCS deals with the theory of computer

DM and TCS have found extensive use in many

These tools seem especially relevant to problems of

DM and TCS Continued

DM and TCS Continued

Phylogenetic Tree Reconstruction

Phylogenetic Challenges for

Phylogenetic Challenges for

Complexity of Data (contd)

There are some 1.75M named species

Setting up a Species Bank

Setting up a Species Bank

This involves issues about huge datasets.

Setting up a Species Bank

Setting up a Species Bank

Setting up a Species Bank

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

Digitizing Natural History

The Many Applications of

The Many Applications of

The Many Applications of

The Many Applications of

The Many Applications of

Вам также может понравиться