Вы находитесь на странице: 1из 28

Dendrograms & PFGE analysis

Paul Vauterin Applied Maths BVBA

Outline of this talk: Simple explanation of mainstream hierarchical clustering (UPGMA) Interesting alternatives to UPGMA How to interpret a dendrogram? Problem of degenerate (equivalent) solutions

Bottom line: - be careful in interpreting dendrograms! - Consider alternatives to UPGMA (i. e. single & complete linkage)

Relevance of cluster analysis


Cluster Analysis is the mathematical study of methods for recognizing natural groups within a set of entities
Simply a tool that groups together related entities, based on the observed similarities between them Used as a data exploration/mining tool in virtually every field (psychology, economy, finance, astronomy, ...) Applicable to virtually any type of data. Only a similarity matrix is needed Applicable to large data sets (>10 000 entities) Easy to interpret (simple & intuitive mathematical principle) weak points easier to anticipate

UPGMA algorithm
Organisms A, B, C, D

Biological characterisation technique


127.3kb, 125.3kb, 140.9kb, 128.6kb, 83.6kb, 56,4kb, ... 101.6kb, 66.8kb, ... 129.6kb, 58.0kb, ... 101.3kb, 98.2kb, ...

Data set

Matrix of pairwize similarities

A B C D

A B C D 100 68 100 76 96 100 95 85 71 100

UPGMA algorithm

A B C D
80 90

A B C D 100 68 100 76 96 100 95 85 71 100


100

1. Find & merge two best matching

B + C
2. Update the similarities (averaging)

B C A D
80 90 100

96 72 100 78 95 100

3. Find & merge two best matching

A + D
4. Update the similarities
80 90 100

B C A D

96 75 95

5. Final merge

BC + AD

B C A D

UPGMA algorithm
Crucial step: determine similarities between two groups

UPGMA: average of all similarities

UPGMA algorithm
Crucial step: determine similarities between two groups

Single linkage: highest similarity (best case scenario)

UPGMA algorithm
Crucial step: determine similarities between two groups

Complete linkage: lowest similarity (worst case scenario) ... Other alternative schemes have been developed ...

How to interpret a dendrogram?


UPGMA tree:

A B C

What does this tell you? A & B are more close to each other than to C? Not necesarily true!

Fundamental problem: potential alternative solutions Equally valid Hidden Might give another view = not restricted to UPGMA or PFGE, but a major problem for most methods that summarise the original data

Degenerate dendrogram solutions


A simple example: PFGE, 3 organisms (A, B, C) Bands A B C Similarities:

A B C

A 100 50 50

B 100 0

100

UPGMA rule: Join highest similarities First A+B First A+C

How to solve this? Detect and visualise in a special way

A B C

A C B

A C B

Happens very often with discrete data with few degrees of freedom (bands on PFGE, but also MLST, MLVA, Spa typing, ...)

Degenerate dendrogram solutions

Degenerate dendrogram solutions


PFGE + band matching: even worse! A B C

A B C

A 100 100 0

B 100 100

100

A=B and B=C

A=C

Compromises the concept of a cluster of identical fingerprints Relaxed view: each member is identical to at least one other in the cluster Strict view: each member is identical to all other members of the cluster

Single linkage Single linkage Complete linkage Complete linkage

ALLWAYS human inspection needed anyway!

Case Study
6 5 4 3 2 1 0 # of different bands

PFGE fingerprints (Dis)similarity: # of different bands Complete linkage clustering Result= groups with members that have no more than n bands different with any other member = Good starting point for pattern naming

Case Study
6 5 4 3 2 1 0 # of different bands

PFGE fingerprints (Dis)similarity: # of different bands Single linkage clustering Result= groups with members that have no more than n bands different with some other members = Good starting point for finding clusters of related patterns

How to interpret a dendrogram?


Dendrogram: ... Suppose unique solution What does this tell you? ... Still not necessarily anything! Garbage In Garbage Out ...

A B C

A cluster algorithm will always produce a tree

Need for methods to address the reliability of a dendrogram Phylogenetics: standard tool = Felsensteins boostrap Not (well) suited to most typing data sets PFGE MLST VNTR

How to interpret a dendrogram?


Back to less sophisticated methods E. g. error flags on cluster levels Principle: each branch is an average representative of a variety of similarities -> show standard deviation

Visual inspection Cross-validation Large data sets are your friends!

Recipe 1: finding seed groups for pattern naming

Make sure you have Make sure you have a temporary field a temporary field

Install the plugin Install the plugin Dendrogram tools Dendrogram tools

Recipe 1: finding seed groups for pattern naming

Select Complete Linkage Select Complete Linkage and Different bands and Different bands

Recipe 1: finding seed groups for pattern naming

Use Fill field with Use Fill field with cluster number cluster number

Recipe 1: finding seed groups for pattern naming

Use 100% similarity Use 100% similarity Specify minimum Specify minimum group size group size Chose destination field Chose destination field Will overwrite any content!

Recipe 1: finding seed groups for pattern naming


Results Results

Resulting groups are Resulting groups are guaranteed to consist of guaranteed to consist of all identical fingerprints all identical fingerprints and have at least 5 and have at least 5 members members

Warning: numbering is not persistent: other data set might give different values

Recipe 2: find largest clusters in data set

Select Single Linkage Select Single Linkage and Different bands and Different bands

Recipe 2: find largest clusters in data set

Use 100% similarity Use 100% similarity (or 99% for 1 band difference) (or 99% for 1 band difference) Specify minimum Specify minimum group size group size Chose destination field Chose destination field

Recipe 2: find largest clusters in data set

Use Chart & Statistics tool Use Chart & Statistics tool

Add Temp field Add Temp field

Recipe 2: find largest clusters in data set

Use sort by frequency Use sort by frequency

Recipe 2: find largest clusters in data set

Fingerprints not associated Fingerprints not associated with any (large) cluster with any (large) cluster

Clusters ranked by size Clusters ranked by size use CTRL+click to select entries use CTRL+click to select entries

Recipe 2: find largest clusters in data set

Recipe 2: find largest clusters in data set

Вам также может понравиться