Dendrograms & PFGE Analysis

Dendrograms & PFGE analysis
Paul Vauterin Applied Maths BVBA
Outline of this talk: Simple explanation of mainstream hierarchical clustering (UPGMA) Interesting alternatives to UPGMA How to interpret a dendrogram? Problem of degenerate (equivalent) solutions
Bottom line: - be careful in interpreting dendrograms! - Consider alternatives to UPGMA (i. e. single & complete linkage)
Relevance of cluster analysis

Cluster Analysis is the mathematical study of methods for recognizing natural groups within a set of entities
Simply a tool that groups together related entities, based on the observed similarities between them Used as a data exploration/mining tool in virtually every field (psychology, economy, finance, astronomy, ...) Applicable to virtually any type of data. Only a similarity matrix is needed Applicable to large data sets (>10 000 entities) Easy to interpret (simple & intuitive mathematical principle) weak points easier to anticipate
UPGMA algorithm
Organisms A, B, C, D
Biological characterisation technique

127.3kb, 125.3kb, 140.9kb, 128.6kb, 83.6kb, 56,4kb, ... 101.6kb, 66.8kb, ... 129.6kb, 58.0kb, ... 101.3kb, 98.2kb, ...
Data set
Matrix of pairwize similarities
A B C D
A B C D 100 68 100 76 96 100 95 85 71 100
UPGMA algorithm
A B C D
80 90
A B C D 100 68 100 76 96 100 95 85 71 100

100
1. Find & merge two best matching
B + C
2. Update the similarities (averaging)
B C A D
80 90 100
96 72 100 78 95 100
3. Find & merge two best matching
A + D
4. Update the similarities
80 90 100
B C A D
96 75 95
5. Final merge
BC + AD
B C A D
UPGMA algorithm
Crucial step: determine similarities between two groups
UPGMA: average of all similarities
UPGMA algorithm
Single linkage: highest similarity (best case scenario)
UPGMA algorithm
Complete linkage: lowest similarity (worst case scenario) ... Other alternative schemes have been developed ...
How to interpret a dendrogram?

UPGMA tree:
A B C
What does this tell you? A & B are more close to each other than to C? Not necesarily true!
Fundamental problem: potential alternative solutions Equally valid Hidden Might give another view = not restricted to UPGMA or PFGE, but a major problem for most methods that summarise the original data
Degenerate dendrogram solutions

A simple example: PFGE, 3 organisms (A, B, C) Bands A B C Similarities:
A B C
A 100 50 50
B 100 0
100
UPGMA rule: Join highest similarities First A+B First A+C
How to solve this? Detect and visualise in a special way
A B C
A C B
A C B
Happens very often with discrete data with few degrees of freedom (bands on PFGE, but also MLST, MLVA, Spa typing, ...)

PFGE + band matching: even worse! A B C
A B C
A 100 100 0
B 100 100
100
A=B and B=C
A=C
Compromises the concept of a cluster of identical fingerprints Relaxed view: each member is identical to at least one other in the cluster Strict view: each member is identical to all other members of the cluster
Single linkage Single linkage Complete linkage Complete linkage
ALLWAYS human inspection needed anyway!
Case Study
6 5 4 3 2 1 0 # of different bands
PFGE fingerprints (Dis)similarity: # of different bands Complete linkage clustering Result= groups with members that have no more than n bands different with any other member = Good starting point for pattern naming
Case Study
6 5 4 3 2 1 0 # of different bands
PFGE fingerprints (Dis)similarity: # of different bands Single linkage clustering Result= groups with members that have no more than n bands different with some other members = Good starting point for finding clusters of related patterns

Dendrogram: ... Suppose unique solution What does this tell you? ... Still not necessarily anything! Garbage In Garbage Out ...
A B C
A cluster algorithm will always produce a tree
Need for methods to address the reliability of a dendrogram Phylogenetics: standard tool = Felsensteins boostrap Not (well) suited to most typing data sets PFGE MLST VNTR

Back to less sophisticated methods E. g. error flags on cluster levels Principle: each branch is an average representative of a variety of similarities -> show standard deviation
Visual inspection Cross-validation Large data sets are your friends!
Recipe 1: finding seed groups for pattern naming
Make sure you have Make sure you have a temporary field a temporary field
Install the plugin Install the plugin Dendrogram tools Dendrogram tools
Select Complete Linkage Select Complete Linkage and Different bands and Different bands
Use Fill field with Use Fill field with cluster number cluster number
Use 100% similarity Use 100% similarity Specify minimum Specify minimum group size group size Chose destination field Chose destination field Will overwrite any content!

Results Results
Resulting groups are Resulting groups are guaranteed to consist of guaranteed to consist of all identical fingerprints all identical fingerprints and have at least 5 and have at least 5 members members
Warning: numbering is not persistent: other data set might give different values
Recipe 2: find largest clusters in data set
Select Single Linkage Select Single Linkage and Different bands and Different bands
Use 100% similarity Use 100% similarity (or 99% for 1 band difference) (or 99% for 1 band difference) Specify minimum Specify minimum group size group size Chose destination field Chose destination field
Use Chart & Statistics tool Use Chart & Statistics tool
Add Temp field Add Temp field
Use sort by frequency Use sort by frequency
Fingerprints not associated Fingerprints not associated with any (large) cluster with any (large) cluster
Clusters ranked by size Clusters ranked by size use CTRL+click to select entries use CTRL+click to select entries

Dendrograms & PFGE Analysis

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dendrograms & PFGE Analysis

Загружено:

Авторское право:

Доступные форматы

Dendrograms & PFGE analysis

Paul Vauterin Applied Maths BVBA

Relevance of cluster analysis

Biological characterisation technique

Matrix of pairwize similarities

A B C D 100 68 100 76 96 100 95 85 71 100

A B C D 100 68 100 76 96 100 95 85 71 100

1. Find & merge two best matching

3. Find & merge two best matching

UPGMA: average of all similarities

Single linkage: highest similarity (best case scenario)

How to interpret a dendrogram?

Degenerate dendrogram solutions

UPGMA rule: Join highest similarities First A+B First A+C

How to solve this? Detect and visualise in a special way

Degenerate dendrogram solutions

Degenerate dendrogram solutions

A=B and B=C

Single linkage Single linkage Complete linkage Complete linkage

ALLWAYS human inspection needed anyway!

How to interpret a dendrogram?

A cluster algorithm will always produce a tree

How to interpret a dendrogram?

Visual inspection Cross-validation Large data sets are your friends!

Recipe 1: finding seed groups for pattern naming

Recipe 1: finding seed groups for pattern naming

Recipe 1: finding seed groups for pattern naming

Recipe 1: finding seed groups for pattern naming

Recipe 1: finding seed groups for pattern naming

Recipe 2: find largest clusters in data set

Recipe 2: find largest clusters in data set

Recipe 2: find largest clusters in data set

Add Temp field Add Temp field

Recipe 2: find largest clusters in data set

Use sort by frequency Use sort by frequency

Recipe 2: find largest clusters in data set

Recipe 2: find largest clusters in data set

Recipe 2: find largest clusters in data set

Вам также может понравиться