Академический Документы
Профессиональный Документы
Культура Документы
Contents
1 Introduction 1
1.1 Abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Defining an abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Advantages of abstract data typing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Typical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.10 Citations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.13 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Analysis of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Cost models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Run-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Constant factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
ii CONTENTS
2 Sequences 18
2.1 Array data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Abstract arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Array data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Element identifier and addressing formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Dynamic array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Bounded-size dynamic arrays and capacity . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Geometric expansion and amortized cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Growth factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
CONTENTS iii
2.3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Basic concepts and nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.5 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.6 Linked list operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.7 Linked lists using arrays of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.8 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.9 Internal and external storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.10 Related data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.11 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.12 Footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.14 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Doubly linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.1 Nomenclature and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.2 Basic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.3 Advanced concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Stack (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2 Non-essential operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.3 Software stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.4 Hardware stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Queue (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Queue implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.2 Purely functional implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Double-ended queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
iv CONTENTS
3 Dictionaries 53
3.1 Associative array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.5 Permanent storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Association list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.3 Applications and software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Hash table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Key statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.3 Collision resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.4 Dynamic resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.5 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.6 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.7 Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS v
3.3.8 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.9 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.13 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Linear probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.4 Choice of hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.5 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Quadratic probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.1 Quadratic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.2 Quadratic probing insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.3 Quadratic probing search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Double hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.1 Classical applied data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.2 Implementation details for caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7.5 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7.6 Comparison with related structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Hopscotch hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.8.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vi CONTENTS
3.13.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.14 Cryptographic hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.14.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.14.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.14.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.14.4 Hash functions based on block ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.14.5 Merkle–Damgård construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.14.6 Use in building other cryptographic primitives . . . . . . . . . . . . . . . . . . . . . . . . 100
3.14.7 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.14.8 Cryptographic hash algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.14.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.14.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.14.11 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4 Sets 103
4.1 Set (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.1 Type theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.1.5 Multiset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.1.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.1.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2 Bit array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.2 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.3 More complex operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.5 Advantages and disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2.7 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Bloom filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.2 Space and time advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.3 Probability of false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3.4 Approximating the number of items in a Bloom filter . . . . . . . . . . . . . . . . . . . . 114
4.3.5 The union and intersection of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.6 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
viii CONTENTS
Introduction
1
2 CHAPTER 1. INTRODUCTION
Abstract data types are purely theoretical entities, used • fetch(V), that yields a value,
(among other things) to simplify the description of ab-
stract algorithms, to classify and evaluate data structures, with the constraint that
and to formally describe the type systems of program-
ming languages. However, an ADT may be implemented • fetch(V) always returns the value x used in the most
by specific data types or data structures, in many ways recent store(V, x) operation on the same variable V.
and in many programming languages; or described in a
formal specification language. ADTs are often imple-
mented as modules: the module’s interface declares pro- As in so many programming languages, the operation
cedures that correspond to the ADT operations, some- store(V, x) is often written V ← x (or some similar no-
times with comments that describe the constraints. This tation), and fetch(V) is implied whenever a variable V is
information hiding strategy allows the implementation of used in a context where a value is required. Thus, for
the module to be changed without disturbing the client example, V ← V + 1 is commonly understood to be a
programs. shorthand for store(V,fetch(V) + 1).
The term abstract data type can also be regarded as a gen- In this definition, it is implicitly assumed that storing a
eralized approach of a number of algebraic structures, value into a variable U has no effect on the state of a dis-
such as lattices, groups, and rings.[4] The notion of ab- tinct variable V. To make this assumption explicit, one
stract data types is related to the concept of data ab- could add the constraint that
straction, important in object-oriented programming and
design by contract methodologies for software develop- • if U and V are distinct variables, the sequence {
ment.[5] store(U, x); store(V, y) } is equivalent to { store(V,
y); store(U, x) }.
1.1.3 Defining an abstract data type More generally, ADT definitions often assume that any
operation that changes the state of one ADT instance has
An abstract data type is defined as a mathematical model no effect on the state of any other instance (including
of the data objects that make up a data type as well as the other instances of the same ADT) — unless the ADT ax-
functions that operate on these objects. There are no stan- ioms imply that the two instances are connected (aliased)
dard conventions for defining them. A broad division may in that sense. For example, when extending the definition
be drawn between “imperative” and “functional” defini- of abstract variable to include abstract records, the opera-
tion styles. tion that selects a field from a record variable R must yield
a variable V that is aliased to that part of R.
The definition of an abstract variable V may also restrict
Imperative-style definition
the stored values x to members of a specific set X, called
the range or type of V. As in programming languages,
In the philosophy of imperative programming languages,
such restrictions may simplify the description and analysis
an abstract data structure is conceived as an entity that is
of algorithms, and improve their readability.
mutable—meaning that it may be in different states at dif-
ferent times. Some operations may change the state of the Note that this definition does not imply anything about
ADT; therefore, the order in which operations are eval- the result of evaluating fetch(V) when V is un-initialized,
uated is important, and the same operation on the same that is, before performing any store operation on V. An
entities may have different effects if executed at differ- algorithm that does so is usually considered invalid, be-
ent times—just like the instructions of a computer, or the cause its effect is not defined. (However, there are some
commands and procedures of an imperative language. To important algorithms whose efficiency strongly depends
underscore this view, it is customary to say that the oper- on the assumption that such a fetch is legal, and returns
ations are executed or applied, rather than evaluated. The some arbitrary value in the variable’s range.)
imperative style is often used when describing abstract
algorithms. (See The Art of Computer Programming by
Donald Knuth for more details) Instance creation Some algorithms need to create new
instances of some ADT (such as new variables, or new
stacks). To describe such algorithms, one usually includes
Abstract variable Imperative-style definitions of in the ADT definition a create() operation that yields an
ADT often depend on the concept of an abstract vari- instance of the ADT, usually with axioms equivalent to
able, which may be regarded as the simplest non-trivial
ADT. An abstract variable V is a mutable entity that • the result of create() is distinct from any instance in
admits two operations: use by the algorithm.
1.1. ABSTRACT DATA TYPE 3
This axiom may be strengthened to exclude also partial Single-instance style Sometimes an ADT is defined as
aliasing with other instances. On the other hand, this ax- if only one instance of it existed during the execution of
iom still allows implementations of create() to yield a pre-the algorithm, and all operations were applied to that in-
viously created instance that has become inaccessible to stance, which is not explicitly notated. For example, the
the program. abstract stack above could have been defined with opera-
tions push(x) and pop(), that operate on the only existing
stack. ADT definitions in this style can be easily rewrit-
Example: abstract stack (imperative) As another ten to admit multiple coexisting instances of the ADT, by
example, an imperative-style definition of an abstract adding an explicit instance parameter (like S in the previ-
stack could specify that the state of a stack S can be mod- ous example) to every operation that uses or modifies the
ified only by the operations implicit instance.
On the other hand, some ADTs cannot be meaningfully
• push(S, x), where x is some value of unspecified na-
defined without assuming multiple instances. This is the
ture;
case when a single operation takes two distinct instances
• pop(S), that yields a value as a result, of the ADT as parameters. For an example, consider aug-
menting the definition of the abstract stack with an oper-
ation compare(S, T) that checks whether the stacks S and
with the constraint that
T contain the same items in the same order.
An abstract stack definition usually includes also a • top: takes a stack state, returns a value;
Boolean-valued function empty(S) and a create() opera-
• pop: takes a stack state, returns a stack state.
tion that returns a stack instance, with axioms equivalent
to
In a functional-style definition there is no need for a cre-
ate operation. Indeed, there is no notion of “stack in-
• create() ≠ S for any stack S (a newly created stack is stance”. The stack states can be thought of as being po-
distinct from all previous stacks); tential states of a single stack structure, and two stack
• empty(create()) (a newly created stack is empty); states that contain the same values in the same order are
considered to be identical states. This view actually mir-
• not empty(push(S, x)) (pushing something into a rors the behavior of some concrete implementations, such
stack makes it non-empty). as linked lists with hash cons.
4 CHAPTER 1. INTRODUCTION
Instead of create(), a functional-style definition of an ab- 1.1.4 Advantages of abstract data typing
stract stack may assume the existence of a special stack
state, the empty stack, designated by a special symbol like Encapsulation
Λ or "()"; or define a bottom() operation that takes no ar-
guments and returns this special stack state. Note that the Abstraction provides a promise that any implementation
axioms imply that of the ADT has certain properties and abilities; knowing
these is all that is required to make use of an ADT object.
• push(Λ, x) ≠ Λ. The user does not need any technical knowledge of how
the implementation works to use the ADT. In this way,
the implementation may be complex but will be encapsu-
In a functional-style definition of a stack one does not lated in a simple interface when it is actually used.
need an empty predicate: instead, one can test whether a
stack is empty by testing whether it is equal to Λ.
Localization of change
Note that these axioms do not define the effect of top(s) or
pop(s), unless s is a stack state returned by a push. Since
Code that uses an ADT object will not need to be edited
push leaves the stack non-empty, those two operations are
if the implementation of the ADT is changed. Since any
undefined (hence invalid) when s = Λ. On the other hand,
changes to the implementation must still comply with the
the axioms (and the lack of side effects) imply that push(s,
interface, and since code using an ADT object may only
x) = push(t, y) if and only if x = y and s = t.
refer to properties and abilities specified in the interface,
As in some other branches of mathematics, it is custom- changes may be made to the implementation without re-
ary to assume also that the stack states are only those quiring any changes in code where the ADT is used.
whose existence can be proved from the axioms in a finite
number of steps. In the abstract stack example above, this
rule means that every stack is a finite sequence of values, Flexibility
that becomes the empty stack (Λ) after a finite number
of pops. By themselves, the axioms above do not ex- Different implementations of the ADT, having all the
clude the existence of infinite stacks (that can be poped same properties and abilities, are equivalent and may
forever, each time yielding a different state) or circular be used somewhat interchangeably in code that uses the
stacks (that return to the same state after a finite number ADT. This gives a great deal of flexibility when using
of pops). In particular, they do not exclude states s such ADT objects in different situations. For example, differ-
that pop(s) = s or push(s, x) = s for some x. However, ent implementations of the ADT may be more efficient
since one cannot obtain such stack states with the given in different situations; it is possible to use each in the sit-
operations, they are assumed “not to exist”. uation where they are preferable, thus increasing overall
efficiency.
An extension of ADT for computer graphics was pro- Imperative-style interface An imperative-style inter-
posed in 1979:[7] an abstract graphical data type (AGDT). face might be:
It was introduced by Nadia Magnenat Thalmann, and typedef struct stack_Rep stack_Rep; // type: stack
Daniel Thalmann. AGDTs provide the advantages of instance representation (opaque record) typedef
ADTs with facilities to build graphical objects in a struc- stack_Rep* stack_T; // type: handle to a stack instance
tured way. (opaque pointer) typedef void* stack_Item; // type:
6 CHAPTER 1. INTRODUCTION
value stored in stack instance (arbitrary address) stack_T Built-in abstract data types
stack_create(void); // creates a new empty stack instance
void stack_push(stack_T s, stack_Item x); // adds an item The specification of some programming languages is
at the top of the stack stack_Item stack_pop(stack_T s); intentionally vague about the representation of certain
// removes the top item from the stack and returns it bool built-in data types, defining only the operations that can
stack_empty(stack_T s); // checks whether stack is empty be done on them. Therefore, those types can be viewed as
“built-in ADTs”. Examples are the arrays in many script-
This interface could be used in the following manner: ing languages, such as Awk, Lua, and Perl, which can be
regarded as an implementation of the abstract list.
#include <stack.h> // includes the stack interface stack_T
s = stack_create(); // creates a new empty stack instance
int x = 17; stack_push(s, &x); // adds the address of 1.1.8 See also
x at the top of the stack void* y = stack_pop(s); //
removes the address of x from the stack and returns it • Concept (generic programming)
if(stack_empty(s)) { } // does something if stack is empty
• Formal methods
This interface can be implemented in many ways. The
• Functional specification
implementation may be arbitrarily inefficient, since the
formal definition of the ADT, above, does not specify • Generalized algebraic data type
how much space the stack may use, nor how long each
operation should take. It also does not specify whether • Initial algebra
the stack state s continues to exist after a call x ← pop(s).
In practice the formal definition should specify that the • Liskov substitution principle
space is proportional to the number of items pushed and
• Type theory
not yet popped; and that every one of the operations above
must finish in a constant amount of time, independently • Walls and Mirrors
of that number. To comply with these additional specifi-
cations, the implementation could use a linked list, or an
array (with dynamic resizing) together with two integers 1.1.9 Notes
(an item count and the array size).
[1] Compare to the characterization of integers in abstract al-
gebra.
Functional-style interface Functional-style ADT def-
initions are more appropriate for functional programming
languages, and vice versa. However, one can provide a 1.1.10 Citations
functional-style interface even in an imperative language
like C. For example: [1] Dale & Walker 1996, p. 3.
typedef struct stack_Rep stack_Rep; // type: stack state [2] Dale & Walker 1996, p. 4.
representation (opaque record) typedef stack_Rep*
stack_T; // type: handle to a stack state (opaque pointer) [3] Liskov & Zilles 1974.
typedef void* stack_Item; // type: value of a stack state
(arbitrary address) stack_T stack_empty(void); // returns [4] Rudolf Lidl (2004). Abstract Algebra. Springer. ISBN
the empty stack state stack_T stack_push(stack_T s, 81-8128-149-7., Chapter 7,section 40.
stack_Item x); // adds an item at the top of the stack
state and returns the resulting stack state stack_T [5] “What Is Object-Oriented Programming?". Hiring | Up-
work. 2015-05-05. Retrieved 2016-10-28.
stack_pop(stack_T s); // removes the top item from
the stack state and returns the resulting stack state [6] Stevens, Al (March 1995). “Al Stevens Interviews Alex
stack_Item stack_top(stack_T s); // returns the top item Stepanov”. Dr. Dobb’s Journal. Retrieved 31 January
of the stack state 2015.
1.1.11 References tions that can be performed on a data structure and the
computational complexity of those operations. In com-
• Liskov, Barbara; Zilles, Stephen (1974). “Pro- parison, a data structure is a concrete implementation of
gramming with abstract data types”. Proceedings of the specification provided by an ADT.
the ACM SIGPLAN Symposium on Very High Level
Different kinds of data structures are suited to differ-
Languages. SIGPLAN Notices. 9. pp. 50–59.
ent kinds of applications, and some are highly special-
doi:10.1145/800233.807045.
ized to specific tasks. For example, relational databases
• Dale, Nell; Walker, Henry M. (1996). Abstract Data commonly use B-tree indexes for data retrieval,[3] while
Types: Specifications, Implementations, and Appli- compiler implementations usually use hash tables to look
cations. Jones & Bartlett Learning. ISBN 978-0- up identifiers.
66940000-7. Data structures provide a means to manage large amounts
of data efficiently for uses such as large databases and
internet indexing services. Usually, efficient data struc-
1.1.12 Further reading
tures are key to designing efficient algorithms. Some for-
• Mitchell, John C.; Plotkin, Gordon (July 1988). mal design methods and programming languages empha-
“Abstract Types Have Existential Type” (PDF). size data structures, rather than algorithms, as the key or-
ACM Transactions on Programming Languages and ganizing factor in software design. Data structures can be
Systems. 10 (3). doi:10.1145/44501.45065. used to organize the storage and retrieval of information
stored in both main memory and secondary memory.
or may be of almost any type). Typical implementa- Template Library, the Java Collections Framework, and
tions allocate contiguous memory words for the ele- the Microsoft .NET Framework.
ments of arrays (but this is not always a necessity). Modern languages also generally support modular pro-
Arrays may be fixed-length or resizable. gramming, the separation between the interface of a li-
• A linked list (also just called list) is a linear collection brary module and its implementation. Some provide
of data elements of any type, called nodes, where opaque data types that allow clients to hide implemen-
each node has itself a value, and points to the next tation details. Object-oriented programming languages,
node in the linked list. The principal advantage of a such as C++, Java, and Smalltalk, typically use classes
linked list over an array, is that values can always be for this purpose.
efficiently inserted and removed without relocating Many known data structures have concurrent versions
the rest of the list. Certain other operations, such which allow multiple computing threads to access a single
as random access to a certain element, are however concrete instance of a data structure simultaneously.
slower on lists than on arrays.
• Dinesh Mehta and Sartaj Sahni Handbook of broader computational complexity theory, which pro-
Data Structures and Applications, Chapman and vides theoretical estimates for the resources needed by
Hall/CRC Press, 2007. any algorithm which solves a given computational prob-
lem. These estimates provide an insight into reasonable
• Niklaus Wirth, Algorithms and Data Structures, directions of search for efficient algorithms.
Prentice Hall, 1985.
In theoretical analysis of algorithms it is common to
estimate their complexity in the asymptotic sense, i.e.,
1.2.8 External links to estimate the complexity function for arbitrarily large
input. Big O notation, Big-omega notation and Big-
• course on data structures theta notation are used to this end. For instance, binary
search is said to run in a number of steps proportional
• Data structures Programs Examples in c,java
to the logarithm of the length of the sorted list being
• UC Berkeley video course on data structures searched, or in O(log(n)), colloquially “in logarithmic
time". Usually asymptotic estimates are used because
• Descriptions from the Dictionary of Algorithms and different implementations of the same algorithm may dif-
Data Structures fer in efficiency. However the efficiencies of any two
• Data structures course “reasonable” implementations of a given algorithm are
related by a constant multiplicative factor called a hidden
• An Examination of Data Structures from .NET per- constant.
spective
Exact (not asymptotic) measures of efficiency can some-
• Schaffer, C. Data Structures and Algorithm Analysis times be computed but they usually require certain as-
sumptions concerning the particular implementation of
the algorithm, called model of computation. A model of
1.3 Analysis of algorithms computation may be defined in terms of an abstract com-
puter, e.g., Turing machine, and/or by postulating that
certain operations are executed in unit time. For exam-
n! 2ⁿ n² n log₂n n ple, if the sorted list to which we apply binary search has
100
n elements, and we can guarantee that each lookup of an
90 element in the list can be done in unit time, then at most
log2 n + 1 time units are needed to return an answer.
80
70
1.3.1 Cost models
60
N
50 Time efficiency estimates depend on what we define to be
a step. For the analysis to correspond usefully to the actual
40 execution time, the time required to perform a step must
30 be guaranteed to be bounded above by a constant. One
must be careful here; for instance, some analyses count
20 an addition of two numbers as one step. This assumption
√n may not be warranted in certain contexts. For example, if
10
1 log₂n the numbers involved in a computation may be arbitrarily
0 large, the time required by a single addition can no longer
0 10 20 30 40 50 60 70 80 90 100
n be assumed to be constant.
execute each of the instructions involved with carrying Therefore, the total running time for this algorithm is:
out this algorithm. The specific amount of time to carry
out a given instruction will vary depending on which in- [ ] [ ]
struction is being executed and which computer is exe- f (n) = T1 +T2 +T3 +T7 +(n+1)T4 + 1 (n2 + n) T6 + 1 (n2 + 3n) T5
cuting it, but on a conventional computer, this amount 2 2
[9]
will be deterministic. Say that the actions carried out which reduces to
in step 1 are considered to consume time T 1 , step 2 uses
time T 2 , and so forth. [ ] [ ]
1 2 1
In the algorithm above, steps 1, 2 and 7 will only be run f (n) = (n + n) T6 + (n2 + 3n) T5 +(n+1)T4 +T1 +T2 +T3 +T7
once. For a worst-case evaluation, it should be assumed 2 2
that step 3 will be run as well. Thus the total amount of As a rule-of-thumb, one can assume that the highest-
time to run steps 1-3 and step 7 is: order term in any given function dominates its rate of
growth and thus defines its run-time order. In this ex-
ample, n² is the highest-order term, so one can conclude
T1 + T2 + T3 + T7 . that f(n) = O(n²). Formally this can be proven as follows:
The loops in steps 4, 5 and 6 are trickier to evaluate. The [1 2 ]
outer loop test in step 4 will execute ( n + 1 ) times (note [ 1 Prove
2
] that 2 (n + n) T6 +
(n + 3n) T 5 + (n + 1)T4 + T1 +
that an extra step is required to terminate the for loop, 2
T2 + T[3 + T7 ≤ cn]2 , n ≥[n0 ]
hence n + 1 and not n executions), which will consume 1 2 1 2
T 4 ( n + 1 ) time. The inner loop, on the other hand, is (n + n) T6 + (n + 3n) T5 + (n + 1)T4 + T1 + T2
2 2
governed by the value of i, which iterates from 1 to i. On
the first pass through the outer loop, j iterates from 1 to ≤(n2 + n)T6 + (n2 + 3n)T5 + (n + 1)T4 + T1 + T2 + T3 + T
1: The inner loop makes one pass, so running the inner Let k be a constant greater than or equal to
loop body (step 6) consumes T 6 time, and the inner loop [T 1 ..T 7 ]
test (step 5) consumes 2T 5 time. During the next pass T6 (n2 + n) + T5 (n2 + 3n) + (n + 1)T4 + T1 + T2 + T3 + T7 ≤
through the outer loop, j iterates from 1 to 2: the inner =2kn2 + 5kn + 5k[ ≤ 2kn2 +] 5kn2 + 5kn2 ( forn ≥ 1) = 12kn2
loop makes two passes, so running the inner loop body 1 2
[Therefore ] 2 (n + n) T6 +
(step 6) consumes 2T 6 time, and the inner loop test (step 1
(n 2
+ 3n) T5 + (n + 1)T 4 + T 1 + T2 +
2
5) consumes 3T 5 time. T3 + T7 ≤ cn2 , n ≥ n0 for c = 12k, n0 = 1
Altogether, the total time required to run the inner loop
body can be expressed as an arithmetic progression: A more elegant approach to analyzing this algorithm
would be to declare that [T 1 ..T 7 ] are all equal to one unit
of time, in a system of units chosen so that one unit is
T6 + 2T6 + 3T6 + · · · + (n − 1)T6 + nT6 greater than or equal to the actual times for these steps.
This would mean that the algorithm’s running time breaks
which can be factored[10] as down as follows:[11]
∑n ∑n
[ ] 4 + i=1 i ≤ 4 + i=1 n = 4 + n2 ≤
1 2
T6 [1 + 2 + 3 + · · · + (n − 1) + n] = T6 (n + n) 5n2 ( forn ≥ 1) = O(n2 ).
2
The total time required to run the outer loop test can be
Growth rate analysis of other resources
evaluated similarly:
The methodology of run-time analysis can also be utilized
for predicting other growth rates, such as consumption of
2T5 + 3T5 + 4T5 + · · · + (n − 1)T5 + nT5 + (n + 1)T 5
memory space. As an example, consider the following
= T5 + 2T5 + 3T5 + 4T5 + · · · + (n − 1)T5 + nT5 + (npseudocode
+ 1)T5 − Twhich
5 manages and reallocates memory us-
which can be factored as age by a program based on the size of a file which that
program manages:
while (file still open) let n = size of file for every 100,000
T5 [1 + 2 + 3 + · · · + (n − 1) + n + (n + 1)] − T5
[ ] kilobytes of increase in file size double the amount of mem-
1 2 ory reserved
= (n + n) T5 + (n + 1)T5 − T5
2
[ ] In this instance, as the file size n increases, memory will
1 2 be consumed at an exponential growth rate, which is or-
=T5 (n + n) + nT5
2 der O(2n ). This is an extremely rapid and most likely
[ ]
1 2 unmanageable growth rate for consumption of memory
= (n + 3n) T5 resources.
2
12 CHAPTER 1. INTRODUCTION
1.4.1 History
array would have to create a new array of double the cur- 1.4.5 References
rent size (8), copy the old elements onto the new array,
and then add the new element. The next three push op- • Allan Borodin and Ran El-Yaniv (1998). Online
erations would similarly take constant time, and then the Computation and Competitive Analysis. Cambridge
subsequent addition would require another slow doubling University Press. pp. 20,141.
of the array size.
In general if we consider an arbitrary number of pushes n [1] “Lecture 7: Amortized Analysis” (PDF). https://www.cs.
cmu.edu/. Retrieved 14 March 2015. External link in
+ 1 to an array of size n, we notice that push operations
|website= (help)
take constant time except for the last one which takes O(n)
time to perform the size doubling operation. Since there [2] Rebecca Fiebrink (2007), Amortized Analysis Explained
were n + 1 operations total we can take the average of (PDF), retrieved 2011-05-03
this and find that for pushing elements onto the dynamic
array takes: O( n+1 [3] [3] “Lecture 20: Amortized Analysis”. http://www.cs.cornell.
n ) = O(1) , constant time.
edu/. Cornell University. Retrieved 14 March 2015. Ex-
ternal link in |website= (help)
payment will be enough to cover the worst case cost of operation, the pool has 3m + 3 - (2m + 1) = m + 2. Note
an operation, in and of itself. With proper selection of that this is the same amount as after inserting element m
payment, however, this is no longer a difficulty; the ex- + 1. In fact, we can show that this will be the case for any
pensive operations will only occur when there is sufficient number of reallocations.
payment in the pool to cover their costs. It can now be made clear why the payment for an inser-
tion is 3. 1 pays for the first insertion of the element, 1
pays for moving the element the next time the table is ex-
1.5.2 Examples panded, and 1 pays for moving an older element the next
time the table is expanded. Intuitively, this explains why
A few examples will help to illustrate the use of the ac- an element’s contribution never “runs out” regardless of
counting method. how many times the table is expanded: since the table is
always doubled, the newest half always covers the cost of
moving the oldest half.
Table expansion
We initially assumed that creating a table was free. In
It is often necessary to create a table before it is known reality, creating a table of size n may be as expensive as
how much space is needed. One possible strategy is to O(n). Let us say that the cost of creating a table of size n
double the size of the table when it is full. Here we will is n. Does this new cost present a difficulty? Not really; it
use the accounting method to show that the amortized turns out we use the same method to show the amortized
cost of an insertion operation in such a table is O(1). O(1) bounds. All we have to do is change the payment.
Before looking at the procedure in detail, we need some When a new table is created, there is an old table with
definitions. Let T be a table, E an element to insert, m entries. The new table will be of size 2m. As long as
num(T) the number of elements in T, and size(T) the al- the entries currently in the table have added enough to the
located size of T. We assume the existence of operations pool to pay for creating the new table, we will be all right.
create_table(n), which creates an empty table of size n, We cannot expect the first m 2 entries to help pay for the
for now assumed to be free, and elementary_insert(T,E), new table. Those entries already paid for the current ta-
which inserts element E into a table T that already has ble. We must then rely on the last m 2 entries to pay the
space allocated, with a cost of 1. 2m
cost 2m . This means we must add m/2 = 4 to the pay-
The following pseudocode illustrates the table insertion ment for each entry, for a total payment of 3 + 4 = 7.
procedure:
function table_insert(T,E) if num(T) = size(T) U := 1.5.3 References
create_table(2 × size(T)) for each F in T elemen-
tary_insert(U,F) T := U elementary_insert(T,E) • Thomas H. Cormen, Charles E. Leiserson, Ronald
L. Rivest, and Clifford Stein. Introduction to Algo-
Without amortized analysis, the best bound we can show
2 rithms, Second Edition. MIT Press and McGraw-
for n insert operations is O(n ) — this is due to the loop
Hill, 2001. ISBN 0-262-03293-7. Section 17.2:
at line 4 that performs num(T) elementary insertions.
The accounting method, pp. 410–412.
For analysis using the accounting method, we assign a
payment of 3 to each table insertion. Although the reason
for this is not clear now, it will become clear during the 1.6 Potential method
course of the analysis.
Assume that initially the table is empty with size(T) = In computational complexity theory, the potential
m. The first m insertions therefore do not require real- method is a method used to analyze the amortized time
location and only have cost 1 (for the elementary insert). and space complexity of a data structure, a measure of its
Therefore, when num(T) = m, the pool has (3 - 1)×m = performance over sequences of operations that smooths
2m. out the cost of infrequent but expensive operations.[1][2]
Inserting element m + 1 requires reallocation of the table.
Creating the new table on line 3 is free (for now). The
1.6.1 Definition of amortized time
loop on line 4 requires m elementary insertions, for a cost
of m. Including the insertion on the last line, the total cost In the potential method, a function Φ is chosen that maps
for this operation is m + 1. After this operation, the pool states of the data structure to non-negative numbers. If
therefore has 2m + 3 - (m + 1) = m + 2. S is a state of the data structure, Φ(S) may be thought
Next, we add another m - 1 elements to the table. At this of intuitively as an amount of potential energy stored in
point the pool has m + 2 + 2×(m - 1) = 3m. Inserting that state;[1][2] alternatively, Φ(S) may be thought of as
an additional element (that is, element 2m + 1) can be representing the amount of disorder in state S or its dis-
seen to have cost 2m + 1 and a payment of 3. After this tance from an ideal state. The potential value prior to the
16 CHAPTER 1. INTRODUCTION
operation of initializing a data structure is defined to be this assumption, if X is a type of operation that may be
zero. performed by the data structure, and n is an integer defin-
Let o be any individual operation within a sequence of ing the size of the given data structure (for instance, the
operations on some data structure, with S ₑ ₒᵣₑ denoting number of items that it contains), then the amortized time
the state of the data structure prior to operation o and for operations of type X is defined to be the maximum,
Sₐ ₑᵣ denoting its state after operation o has completed. among all possible sequences of operations on data struc-
Then, once Φ has been chosen, the amortized time for tures of size n and all operations oi of type X within the
operation o is defined to be sequence, of the amortized time for operation oi.
With this definition, the time to perform a sequence of
operations may be estimated by multiplying the amor-
Tamortized (o) = Tactual (o) + C · (Φ(Safter ) − Φ(Sbefore )), tized time for each type of operation in the sequence by
the number of operations of that type.
where C is a non-negative constant of proportionality (in
units of time) that must remain fixed throughout the anal-
ysis. That is, the amortized time is defined to be the actual 1.6.4 Examples
time taken by the operation plus C times the difference
in potential caused by the operation.[1][2] Dynamic array
writing array cells without changing the array size) do not This number is always non-negative and starts with 0, as
cause the potential function to change and have the same required.
constant amortized time as their actual time.[2] An Inc operation flips the least significant bit. Then, if the
Therefore, with this choice of resizing strategy and poten- LSB were flipped from 1 to 0, then the next bit should be
tial function, the potential method shows that all dynamic flipped. This goes on until finally a bit is flipped from 0 to
array operations take constant amortized time. Combin- 1, in which case the flipping stops. If the number of bits
ing this with the inequality relating amortized time and flipped from 1 to 0 is k, then the actual time is k+1 and
actual time over sequences of operations, this shows that the potential is reduced by k−1, so the amortized time is
any sequence of n dynamic array operations takes O(n) 2. Hence, the actual time for running m Inc operations is
actual time in the worst case, despite the fact that some O(m).
of the individual operations may themselves take a linear
amount of time.[2]
1.6.5 Applications
Multi-Pop Stack The potential function method is commonly used to ana-
lyze Fibonacci heaps, a form of priority queue in which
Consider a stack which supports the following operations: removing an item takes logarithmic amortized time, and
all other operations take constant amortized time.[4] It
• Initialize - create an empty stack. may also be used to analyze splay trees, a self-adjusting
form of binary search tree with logarithmic amortized
• Push - add a single element on top of the stack. time per operation.[5]
A Push operation takes constant time and increases Φ by [3] Goodrich and Tamassia, 1.5.2 Analyzing an Extendable
1, so its amortized time is constant. Array Implementation, pp. 139–141; Cormen et al., 17.4
Dynamic tables, pp. 416–424.
A Pop operation takes time O(k) but also reduces Φ by k,
so its amortized time is also constant. [4] Cormen et al., Chapter 20, “Fibonacci Heaps”, pp. 476–
497.
This proves that any sequence of m operations takes O(m)
actual time in the worst case. [5] Goodrich and Tamassia, Section 3.4, “Splay Trees”, pp.
185–194.
Binary counter
Φ = number-of-bits-
equal-to-1
Chapter 2
Sequences
18
2.1. ARRAY DATA TYPE 19
7 8 9
2.1.3 Implementations
In order to effectively implement variables of such types A two-dimensional array stored as a one-dimensional array of
as array structures (with indexing done by pointer arith- one-dimensional arrays (rows).
metic), many languages restrict the indices to integer data
types (or other types that can be interpreted as integers,
such as bytes and enumerated types), and require that as a vector of pointers to its rows. Thus an element in
all elements have the same data type and storage size. row i and column j of an array A would be accessed by
Most of those languages also restrict each index to a finite double indexing (A[i][j] in typical notation). This way of
interval of integers, that remains fixed throughout the life- emulating multi-dimensional arrays allows the creation of
time of the array variable. In some compiled languages, jagged arrays, where each row may have a different size
in fact, the index ranges may have to be known at compile — or, in general, where the valid range of each index de-
time. pends on the values of all preceding indices.
On the other hand, some programming languages provide This representation for multi-dimensional arrays is quite
more liberal array types, that allow indexing by arbitrary prevalent in C and C++ software. However, C and C++
values, such as floating-point numbers, strings, objects, will use a linear indexing formula for multi-dimensional
references, etc.. Such index values cannot be restricted to arrays that are declared with compile time constant size,
an interval, much less a fixed interval. So, these languages e.g. by int A[10][20] or int A[m][n], instead of the tra-
usually allow arbitrary new elements to be created at any ditional int **A.[6]:p.81
time. This choice precludes the implementation of array
types as array data structures. That is, those languages use
array-like syntax to implement a more general associative
array semantics, and must therefore be implemented by a Indexing notation
hash table or some other search data structure.
Most programming languages that support arrays support
the store and select operations, and have special syntax for
2.1.4 Language support indexing. Early languages used parentheses, e.g. A(i,j),
as in FORTRAN; others choose square brackets, e.g.
Multi-dimensional arrays A[i,j] or A[i][j], as in Algol 60 and Pascal (to distinguish
from the use of parentheses for function calls).
The number of indices needed to specify an element is
called the dimension, dimensionality, or rank of the array
type. (This nomenclature conflicts with the concept of Index types
dimension in linear algebra,[5] where it is the number of
elements. Thus, an array of numbers with 5 rows and 4 Array data types are most often implemented as array
columns, hence 20 elements, is said to have dimension 2 structures: with the indices restricted to integer (or totally
in computing contexts, but represents a matrix with di- ordered) values, index ranges fixed at array creation time,
mension 4-by-5 or 20 in mathematics. Also, the com- and multilinear element addressing. This was the case in
puter science meaning of “rank” is similar to its meaning most “third generation” languages, and is still the case
in tensor algebra but not to the linear algebra concept of of most systems programming languages such as Ada, C,
rank of a matrix.) and C++. In some languages, however, array data types
Many languages support only one-dimensional arrays. In have the semantics of associative arrays, with indices of
those languages, a multi-dimensional array is typically arbitrary type and dynamic element creation. This is the
represented by an Iliffe vector, a one-dimensional array case in some scripting languages such as Awk and Lua,
of references to arrays of one dimension less. A two- and of some array types provided by standard C++ li-
dimensional array, in particular, would be implemented braries.
20 CHAPTER 2. SEQUENCES
Some languages allow dynamic arrays (also called resiz- [1] Robert W. Sebesta (2001) Concepts of Programming
able, growable, or extensible): array variables whose in- Languages. Addison-Wesley. 4th edition (1998), 5th edi-
dex ranges may be expanded at any time after creation, tion (2001), ISBN 9780201385960
without changing the values of its current elements. [2] K. Jensen and Niklaus Wirth, PASCAL User Manual and
For one-dimensional arrays, this facility may be provided Report. Springer. Paperback edition (2007) 184 pages,
as an operation “append(A,x)" that increases the size of ISBN 978-3540069508
the array A by one and then sets the value of the last el-
[3] John Mitchell, Concepts of Programming Languages.
ement to x. Other array types (such as Pascal strings) Cambridge University Press.
provide a concatenation operator, which can be used to-
gether with slicing to achieve that effect and more. In [4] Lukham, Suzuki (1979), “Verification of array, record,
some languages, assigning a value to an element of an and pointer operations in Pascal”. ACM Transactions on
array automatically extends the array, if necessary, to in- Programming Languages and Systems 1(2), 226–244.
clude that element. In other array types, a slice can be
[5] see the definition of a matrix
replaced by an array of different size” with subsequent el-
ements being renumbered accordingly — as in Python’s [6] Brian W. Kernighan and Dennis M. Ritchie (1988), The
list assignment "A[5:5] = [10,20,30]", that inserts three C programming Language. Prentice-Hall, 205 pages.
new elements (10,20, and 30) before element "A[5]". Re-
sizable arrays are conceptually similar to lists, and the two [7] Edsger W. Dijkstra, Why numbering should start at zero
concepts are synonymous in some languages.
An extensible array can be implemented as a fixed-size 2.1.7 External links
array, with a counter that records how many elements are
actually in use. The append operation merely increments • NIST’s Dictionary of Algorithms and Data Struc-
the counter; until the whole array is used, when the ap- tures: Array
pend operation may be defined to fail. This is an imple-
mentation of a dynamic array with a fixed capacity, as in
the string type of Pascal. Alternatively, the append op-
eration may re-allocate the underlying array with a larger 2.2 Array data structure
size, and copy the old elements to the new area.
This article is about the byte-layout-level structure. For
the abstract data type, see Array data type. For other
2.1.5 See also uses, see Array.
Arrays are among the oldest and most important data 2.2.2 Applications
structures, and are used by almost every program. They
are also used to implement many other data structures, Arrays are used to implement mathematical vectors and
such as lists and strings. They effectively exploit the ad- matrices, as well as other kinds of rectangular tables.
dressing logic of computers. In most modern comput- Many databases, small and large, consist of (or include)
ers and many external storage devices, the memory is a one-dimensional arrays whose elements are records.
one-dimensional array of words, whose indices are their Arrays are used to implement other data structures,
addresses. Processors, especially vector processors, are such as lists, heaps, hash tables, deques, queues, stacks,
often optimized for array operations. strings, and VLists. Array-based implementations of
Arrays are useful mostly because the element indices can other data structures are frequently simple and space-
be computed at run time. Among other things, this fea- efficient (implicit data structures), requiring little space
ture allows a single iterative statement to process arbi- overhead, but may have poor space complexity, particu-
trarily many elements of an array. For that reason, the larly when modified, compared to tree-based data struc-
elements of an array data structure are required to have tures (compare a sorted array to a search tree).
the same size and should use the same data representa- One or more large arrays are sometimes used to emu-
tion. The set of valid index tuples and the addresses of late in-program dynamic memory allocation, particularly
the elements (and hence the element addressing formula) memory pool allocation. Historically, this has some-
are usually,[3][5] but not always,[2] fixed while the array is times been the only way to allocate “dynamic memory”
in use. portably.
The term array is often used to mean array data type, a Arrays can be used to determine partial or complete
kind of data type provided by most high-level program- control flow in programs, as a compact alternative to
ming languages that consists of a collection of values or (otherwise repetitive) multiple IF statements. They are
variables that can be selected by one or more indices com- known in this context as control tables and are used
puted at run-time. Array types are often implemented by in conjunction with a purpose built interpreter whose
array structures; however, in some languages they may be control flow is altered according to values contained in
implemented by hash tables, linked lists, search trees, or the array. The array may contain subroutine pointers (or
other data structures. relative subroutine numbers that can be acted upon by
The term is also used, especially in the description of SWITCH statements) that direct the path of the execu-
algorithms, to mean associative array or “abstract array”, tion.
a theoretical computer science model (an abstract data
type or ADT) intended to capture the essential properties
of arrays. 2.2.3 Element identifier and addressing
formulas
When data objects are stored in an array, individual
2.2.1 History objects are selected by an index that is usually a non-
negative scalar integer. Indexes are also called subscripts.
The first digital computers used machine-language pro- An index maps the array value to a stored object.
gramming to set up and access array structures for data
There are three ways in which the elements of an array
tables, vector and matrix computations, and for many
can be indexed:
other purposes. John von Neumann wrote the first array-
sorting program (merge sort) in 1945, during the build-
ing of the first stored-program computer.[6]p. 159 Array in- • 0 (zero-based indexing): The first element of the ar-
dexing was originally done by self-modifying code, and ray is indexed by subscript of 0.[8]
later using index registers and indirect addressing. Some
• 1 (one-based indexing): The first element of the ar-
mainframes designed in the 1960s, such as the Burroughs
ray is indexed by subscript of 1.[9]
B5000 and its successors, used memory segmentation to
perform index-bounds checking in hardware.[7] • n (n-based indexing): The base index of an array can
Assembly languages generally have no special support be freely chosen. Usually programming languages
for arrays, other than what the machine itself provides. allowing n-based indexing also allow negative index
The earliest high-level programming languages, including values and other scalar data types like enumerations,
FORTRAN (1957), Lisp (1958), COBOL (1960), and or characters may be used as an array index.
ALGOL 60 (1960), had support for multi-dimensional
arrays, and so has C (1972). In C++ (1983), class tem- Arrays can have multiple dimensions, thus it is not un-
plates exist for multi-dimensional arrays whose dimen- common to access an array using multiple indices. For
sion is fixed at runtime[3][5] as well as for runtime-flexible example, a two-dimensional array A with three rows and
arrays.[2] four columns might provide access to the element at the
2.2. ARRAY DATA STRUCTURE 23
2nd row and 4th column by the expression A[1, 3] in the This means that array a has 2 rows and 3 columns, and
case of a zero-based indexing system. Thus two indices the array is of integer type. Here we can store 6 elements
are used for a two-dimensional array, three for a three- they are stored linearly but starting from first row linear
dimensional array, and n for an n-dimensional array. then continuing with second row. The above array will be
The number of indices needed to specify an element is stored as a11 , a12 , a13 , a21 , a22 , a23 .
called the dimension, dimensionality, or rank of the array. This formula requires only k multiplications and k addi-
tions, for any array that can fit in memory. Moreover, if
In standard arrays, each index is restricted to a certain
range of consecutive integers (or consecutive values of any coefficient is a fixed power of 2, the multiplication
some enumerated type), and the address of an element is can be replaced by bit shifting.
computed by a “linear” formula on the indices. The coefficients ck must be chosen so that every valid in-
dex tuple maps to the address of a distinct element.
One-dimensional arrays If the minimum legal value for every index is 0, then B is
the address of the element whose indices are all zero. As
A one-dimensional array (or single dimension array) is a in the one-dimensional case, the element indices may be
type of linear array. Accessing its elements involves a sin- changed by changing the base address B. Thus, if a two-
gle subscript which can either represent a row or column dimensional array has rows and columns indexed from 1
index. to 10 and 1 to 20, respectively, then replacing B by B + c1 -
− 3 c1 will cause them to be renumbered from 0 through
As an example consider the C declaration int anArray-
9 and 4 through 23, respectively. Taking advantage of
Name[10];
this feature, some languages (like FORTRAN 77) specify
Syntax : datatype anArrayname[sizeofArray]; that array indices begin at 1, as in mathematical tradition
In the given example the array can contain 10 elements of while other languages (like Fortran 90, Pascal and Algol)
any value available to the int type. In C, the array element let the user choose the minimum value for each index.
indices are 0-9 inclusive in this case. For example, the ex-
pressions anArrayName[0] and anArrayName[9] are the Dope vectors
first and last elements respectively.
For a vector with linear addressing, the element with in- The addressing formula is completely defined by the di-
dex i is located at the address B + c × i, where B is a fixed mension d, the base address B, and the increments c1 , c2 ,
base address and c a fixed constant, sometimes called the ..., ck. It is often useful to pack these parameters into a
address increment or stride. record called the array’s descriptor or stride vector or dope
vector.[2][3] The size of each element, and the minimum
If the valid element indices begin at 0, the constant B is and maximum values allowed for each index may also be
simply the address of the first element of the array. For included in the dope vector. The dope vector is a com-
this reason, the C programming language specifies that plete handle for the array, and is a convenient way to pass
array indices always begin at 0; and many programmers arrays as arguments to procedures. Many useful array
will call that element "zeroth" rather than “first”. slicing operations (such as selecting a sub-array, swap-
However, one can choose the index of the first element by ping indices, or reversing the direction of the indices) can
an appropriate choice of the base address B. For example, be performed very efficiently by manipulating the dope
if the array has five elements, indexed 1 through 5, and the vector.[2]
base address B is replaced by B + 30c, then the indices of
those same elements will be 31 to 35. If the numbering
does not start at 0, the constant B may not be the address Compact layouts
of any element.
Often the coefficients are chosen so that the elements oc-
cupy a contiguous area of memory. However, that is not
Multidimensional arrays necessary. Even if arrays are always created with con-
tiguous elements, some array slicing operations may cre-
For multi dimensional array, the element with indices i,j ate non-contiguous sub-arrays from them.
would have address B + c · i + d · j, where the coeffi-
There are two systematic compact layouts for a two-
cients c and d are the row and column address increments,
dimensional array. For example, consider the matrix
respectively.
More generally, in a k-dimensional array, the address of
an element with indices i1 , i2 , ..., ik is 1 2 3
A = 4 5 6.
B + c1 · i1 + c2 · i2 + ... + ck · ik. 7 8 9
For example: int a[2][3]; In the row-major order layout (adopted by C for statically
24 CHAPTER 2. SEQUENCES
declared arrays), the elements in each row are stored in In an array with element size k and on a machine with a
consecutive positions and all of the elements of a row have cache line size of B bytes, iterating through an array of
a lower address than any of the elements of a consecutive n elements requires the minimum of ceiling(nk/B) cache
row: misses, because its elements occupy contiguous memory
In column-major order (traditionally used by Fortran), the locations. This is roughly a factor of B/k better than the
elements in each column are consecutive in memory and number of cache misses needed to access n elements at
all of the elements of a column have a lower address than random memory locations. As a consequence, sequen-
any of the elements of a consecutive column: tial iteration over an array is noticeably faster in practice
than iteration over many other data structures, a prop-
For arrays with three or more indices, “row major order” erty called locality of reference (this does not mean how-
puts in consecutive positions any two elements whose in- ever, that using a perfect hash or trivial hash within the
dex tuples differ only by one in the last index. “Column same (local) array, will not be even faster - and achiev-
major order” is analogous with respect to the first index. able in constant time). Libraries provide low-level opti-
In systems which use processor cache or virtual memory, mized facilities for copying ranges of memory (such as
scanning an array is much faster if successive elements memcpy) which can be used to move contiguous blocks
are stored in consecutive positions in memory, rather than of array elements significantly faster than can be achieved
sparsely scattered. Many algorithms that use multidimen- through individual element access. The speedup of such
sional arrays will scan them in a predictable order. A pro- optimized routines varies by array element size, architec-
grammer (or a sophisticated compiler) may use this infor- ture, and implementation.
mation to choose between row- or column-major layout Memory-wise, arrays are compact data structures with no
for each array. For example, when computing the prod- per-element overhead. There may be a per-array over-
uct A·B of two matrices, it would be best to have A stored head, e.g. to store index bounds, but this is language-
in row-major order, and B in column-major order. dependent. It can also happen that elements stored in an
array require less memory than the same elements stored
in individual variables, because several array elements
Resizing can be stored in a single word; such arrays are often called
packed arrays. An extreme (but commonly used) case is
Main article: Dynamic array the bit array, where every bit represents a single element.
A single octet can thus hold up to 256 different combina-
Static arrays have a size that is fixed when they are created tions of up to 8 different conditions, in the most compact
and consequently do not allow elements to be inserted or form.
removed. However, by allocating a new array and copy- Array accesses with statically predictable access patterns
ing the contents of the old array to it, it is possible to are a major source of data parallelism.
effectively implement a dynamic version of an array; see
dynamic array. If this operation is done infrequently, in-
sertions at the end of the array require only amortized Comparison with other data structures
constant time.
Some array data structures do not reallocate storage, but Growable arrays are similar to arrays but add the ability
do store a count of the number of elements of the array to insert and delete elements; adding and deleting at the
in use, called the count or size. This effectively makes end is particularly efficient. However, they reserve linear
the array a dynamic array with a fixed maximum size or (Θ(n)) additional storage, whereas arrays do not reserve
capacity; Pascal strings are examples of this. additional storage.
Associative arrays provide a mechanism for array-like
functionality without huge storage overheads when the in-
Non-linear formulas
dex values are sparse. For example, an array that contains
values only at indexes 1 and 2 billion may benefit from us-
More complicated (non-linear) formulas are occasionally
ing such a structure. Specialized associative arrays with
used. For a compact two-dimensional triangular array,
integer keys include Patricia tries, Judy arrays, and van
for instance, the addressing formula is a polynomial of
Emde Boas trees.
degree 2.
Balanced trees require O(log n) time for indexed access,
but also permit inserting or deleting elements in O(log
2.2.4 Efficiency n) time,[15] whereas growable arrays require linear (Θ(n))
time to insert or delete elements at an arbitrary position.
Both store and select take (deterministic worst case) Linked lists allow constant time removal and insertion in
constant time. Arrays take linear (O(n)) space in the the middle but take linear time for indexed access. Their
number of elements n that they hold. memory use is typically worse than arrays, but is still lin-
2.2. ARRAY DATA STRUCTURE 25
1 2 3 • Array slicing
• Offset (computer science)
• Row-major order
4 5 6 • Stride of an array
2.2.7 References
7 8 9 [1] Black, Paul E. (13 November 2008). “array”. Dictionary
of Algorithms and Data Structures. National Institute of
Standards and Technology. Retrieved 22 August 2010.
A two-dimensional array stored as a one-dimensional array of
[2] Bjoern Andres; Ullrich Koethe; Thorben Kroeger;
one-dimensional arrays (rows).
Hamprecht (2010). “Runtime-Flexible Multi-
dimensional Arrays and Views for C++98 and C++0x”.
An Iliffe vector is an alternative to a multidimensional arXiv:1008.2909 [cs.DS].
array structure. It uses a one-dimensional array of
references to arrays of one dimension less. For two di- [3] Garcia, Ronald; Lumsdaine, Andrew (2005). “MultiAr-
mensions, in particular, this alternative structure would ray: a C++ library for generic programming with arrays”.
be a vector of pointers to vectors, one for each row. Thus Software: Practice and Experience. 35 (2): 159–188.
doi:10.1002/spe.630. ISSN 0038-0644.
an element in row i and column j of an array A would
be accessed by double indexing (A[i][j] in typical no- [4] David R. Richardson (2002), The Book on Data Struc-
tation). This alternative structure allows jagged arrays, tures. iUniverse, 112 pages. ISBN 0-595-24039-9, ISBN
where each row may have a different size — or, in gen- 978-0-595-24039-5.
eral, where the valid range of each index depends on the
[5] Veldhuizen, Todd L. (December 1998). Arrays in
values of all preceding indices. It also saves one multipli- Blitz++ (PDF). Computing in Object-Oriented Paral-
cation (by the column address increment) replacing it by lel Environments. Lecture Notes in Computer Sci-
a bit shift (to index the vector of row pointers) and one ence. 1505. Springer Berlin Heidelberg. pp. 223–
extra memory access (fetching the row address), which 230. doi:10.1007/3-540-49372-7_24. ISBN 978-3-540-
may be worthwhile in some architectures. 65387-5.
[11] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: 2.3.1 Bounded-size dynamic arrays and
Complexity Trade-offs. Juniata College. Spring 2008. capacity
[12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go- A simple dynamic array can be constructed by allocating
ingNative 2012 on channel9.msdn.com from minute 45 or
an array of fixed-size, typically larger than the number
foil 44
of elements immediately required. The elements of the
[13] Number crunching: Why you should never, ever, EVER use dynamic array are stored contiguously at the start of the
linked-list in your code again at kjellkod.wordpress.com underlying array, and the remaining positions towards the
end of the underlying array are reserved, or unused. Ele-
[14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; ments can be added at the end of a dynamic array in con-
Munro, JI; Demaine, ED (1999), Resizable Arrays in Op- stant time by using the reserved space, until this space is
timal Time and Space (Technical Report CS-99-09) (PDF), completely consumed. When all space is consumed, and
Department of Computer Science, University of Waterloo an additional element is to be added, then the underly-
ing fixed-sized array needs to be increased in size. Typ-
[15] “Counted B-Trees”. ically resizing is expensive because it involves allocating
a new underlying array and copying each element from
the original array. Elements can be removed from the
end of a dynamic array in constant time, as no resizing is
2.3 Dynamic array required. The number of elements used by the dynamic
array contents is its logical size or size, while the size of
the underlying array is called the dynamic array’s capac-
2 ity or physical size, which is the maximum possible size
without relocating data.[2]
A fixed-size array will suffice in applications where the
27 maximum logical size is fixed (e.g. by specification), or
can be calculated before the array is allocated. A dynamic
271
array might be preferred if
Logical size
2.3.2 Geometric expansion and amortized
Capacity cost
Several values are inserted at the end of a dynamic array using To avoid incurring the cost of resizing many times, dy-
geometric expansion. Grey cells indicate space reserved for ex-
namic arrays resize by a large amount, such as doubling
pansion. Most insertions are fast (constant time), while some are
in size, and use the reserved space for future expansion.
slow due to the need for reallocation (Θ(n) time, labelled with tur-
tles). The logical size and capacity of the final array are shown. The operation of adding an element to the end might work
as follows:
In computer science, a dynamic array, growable array, function insertEnd(dynarray a, element e) if (a.size
resizable array, dynamic table, mutable array, or ar- = a.capacity) // resize a to twice its current capacity:
ray list is a random access, variable-size list data struc- a.capacity ← a.capacity * 2 // (copy the contents to the
ture that allows elements to be added or removed. It is new memory location here) a[a.size] ← e a.size ← a.size
supplied with standard libraries in many modern main- +1
stream programming languages.
A dynamic array is not the same thing as a dynamically As n elements are inserted, the capacities form a
allocated array, which is an array whose size is fixed when geometric progression. Expanding the array by any con-
the array is allocated, although a dynamic array may use stant proportion a ensures that inserting n elements takes
such a fixed-size array as a back end.[1] O(n) time overall, meaning that each insertion takes
2.3. DYNAMIC ARRAY 27
amortized constant time. Many dynamic arrays also deal- that resides in other areas of memory. In this case, ac-
locate some of the underlying storage if its size drops be- cessing items in the array sequentially will actually in-
low a certain threshold, such as 30% of the capacity. This volve accessing multiple non-contiguous areas of mem-
threshold must be strictly smaller than 1/a in order to pro- ory, so the many advantages of the cache-friendliness of
vide hysteresis (provide a stable band to avoiding repeat- this data structure are lost.
edly growing and shrinking) and support mixed sequences Compared to linked lists, dynamic arrays have faster in-
of insertions and removals with amortized constant cost. dexing (constant time versus linear time) and typically
Dynamic arrays are a common example when teaching faster iteration due to improved locality of reference;
amortized analysis.[3][4] however, dynamic arrays require linear time to insert or
delete at an arbitrary location, since all following ele-
ments must be moved, while linked lists can do this in
2.3.3 Growth factor constant time. This disadvantage is mitigated by the gap
buffer and tiered vector variants discussed under Variants
The growth factor for the dynamic array depends on sev- below. Also, in a highly fragmented memory region, it
eral factors including a space-time trade-off and algo- may be expensive or impossible to find contiguous space
rithms used in the memory allocator itself. For growth for a large dynamic array, whereas linked lists do not re-
factor a, the average time per insertion operation is about quire the whole data structure to be stored contiguously.
a/(a−1), while the number of wasted cells is bounded
A balanced tree can store a list while providing all oper-
above by (a−1)n. If memory allocator uses a first-fit allo-
ations of both dynamic arrays and linked lists reasonably
cation algorithm, then growth factor values such as a=2
efficiently, but both insertion at the end and iteration over
can cause dynamic array expansion to run out of memory
the list are slower than for a dynamic array, in theory and
even though a significant amount of memory may still be
in practice, due to non-contiguous storage and tree traver-
available.[5] There have been various discussions on ideal
sal/manipulation overhead.
growth factor values, including proposals for the Golden
Ratio as well as the value 1.5.[6] Many textbooks, how-
ever, use a = 2 for simplicity and analysis purposes.[3][4]
Below are growth factors used by several popular imple-
mentations:
2.3.5 Variants
2.3.4 Performance
Gap buffers are similar to dynamic arrays but allow effi-
The dynamic array has performance similar to an array, cient insertion and deletion operations clustered near the
with the addition of new operations to add and remove same arbitrary location. Some deque implementations
elements: use array deques, which allow amortized constant time
insertion/removal at both ends, instead of just one end.
• Getting or setting the value at a particular index Goodrich[15] presented a dynamic array algorithm called
(constant time) Tiered Vectors that provided O(n1/2 ) performance for or-
der preserving insertions or deletions from the middle of
• Iterating over the elements in order (linear time,
the array.
good cache performance)
Hashed Array Tree (HAT) is a dynamic array algorithm
• Inserting or deleting an element in the middle of the published by Sitarski in 1996.[16] Hashed Array Tree
array (linear time) wastes order n1/2 amount of storage space, where n is the
number of elements in the array. The algorithm has O(1)
• Inserting or deleting an element at the end of the
amortized performance when appending a series of ob-
array (constant amortized time)
jects to the end of a Hashed Array Tree.
Dynamic arrays benefit from many of the advantages In a 1999 paper,[14] Brodnik et al. describe a tiered dy-
of arrays, including good locality of reference and data namic array data structure, which wastes only n1/2 space
cache utilization, compactness (low memory use), and for n elements at any point in time, and they prove a lower
random access. They usually have only a small fixed addi- bound showing that any dynamic array must waste this
tional overhead for storing information about the size and much space if the operations are to remain amortized
capacity. This makes dynamic arrays an attractive tool constant time. Additionally, they present a variant where
for building cache-friendly data structures. However, in growing and shrinking the buffer has not only amortized
languages like Python or Java that enforce reference se- but worst-case constant time.
mantics, the dynamic array generally will not store the Bagwell (2002)[17] presented the VList algorithm, which
actual data, but rather it will store references to the data can be adapted to implement a dynamic array.
28 CHAPTER 2. SEQUENCES
2.3.6 Language support [12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go-
ingNative 2012 on channel9.msdn.com from minute 45 or
C++'s std::vector is an implementation of dynamic ar- foil 44
rays, as are the ArrayList[18] class supplied with the Java
[13] Number crunching: Why you should never, ever, EVER use
API and the .NET Framework.[19] The generic List<> linked-list in your code again at kjellkod.wordpress.com
class supplied with version 2.0 of the .NET Framework is
also implemented with dynamic arrays. Smalltalk's Or- [14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert;
deredCollection is a dynamic array with dynamic start Munro, JI; Demaine, ED (1999), Resizable Arrays in Op-
and end-index, making the removal of the first element timal Time and Space (Technical Report CS-99-09) (PDF),
also O(1). Python's list datatype implementation is a dy- Department of Computer Science, University of Waterloo
namic array. Delphi and D implement dynamic arrays [15] Goodrich, Michael T.; Kloss II, John G. (1999), “Tiered
at the language’s core. Ada's Ada.Containers.Vectors Vectors: Efficient Dynamic Arrays for Rank-Based
generic package provides dynamic array implementation Sequences”, Workshop on Algorithms and Data Struc-
for a given subtype. Many scripting languages such as tures, Lecture Notes in Computer Science, 1663: 205–
Perl and Ruby offer dynamic arrays as a built-in primitive 216, doi:10.1007/3-540-48447-7_21, ISBN 978-3-540-
data type. Several cross-platform frameworks provide 66279-2
dynamic array implementations for C, including CFAr-
[16] Sitarski, Edward (September 1996), “HATs: Hashed ar-
ray and CFMutableArray in Core Foundation, and GAr- ray trees”, Algorithm Alley, Dr. Dobb’s Journal, 21 (11)
ray and GPtrArray in GLib.
[17] Bagwell, Phil (2002), Fast Functional Lists, Hash-Lists,
Deques and Variable Length Arrays, EPFL
2.3.7 References
[18] Javadoc on ArrayList
[1] See, for example, the source code of java.util.ArrayList [19] ArrayList Class
class from OpenJDK 6.
[7] List object implementation from python.org, retrieved In computer science, a linked list is a linear collection
2011-09-27. of data elements, called nodes, each pointing to the next
node by means of a pointer. It is a data structure con-
[8] Brais, Hadi. “Dissecting the C++ STL Vector: Part 3 - sisting of a group of nodes which together represent a
Capacity & Size”. Micromysteries. Retrieved 2015-08- sequence. Under the simplest form, each node is com-
05. posed of data and a reference (in other words, a link) to
[9] “facebook/folly”. GitHub. Retrieved 2015-08-05. the next node in the sequence. This structure allows for
efficient insertion or removal of elements from any po-
[10] Chris Okasaki (1995). “Purely Functional Random- sition in the sequence during iteration. More complex
Access Lists”. Proceedings of the Seventh Inter- variants add additional links, allowing efficient insertion
national Conference on Functional Programming or removal from arbitrary element references.
Languages and Computer Architecture: 86–95.
doi:10.1145/224164.224187.
a link to the next node. The last node is linked to a terminator • Nodes in a linked list must be read in order from
used to signify the end of the list. the beginning as linked lists are inherently sequential
access.
Linked lists are among the simplest and most common
• Nodes are stored incontiguously, greatly increas-
data structures. They can be used to implement several
ing the time required to access individual elements
other common abstract data types, including lists (the ab-
within the list, especially with a CPU cache.
stract data type), stacks, queues, associative arrays, and
S-expressions, though it is not uncommon to implement • Difficulties arise in linked lists when it comes to re-
the other data structures directly without using a list as verse traversing. For instance, singly linked lists
the basis of implementation. are cumbersome to navigate backwards[1] and while
The principal benefit of a linked list over a conventional doubly linked lists are somewhat easier to read,
array is that the list elements can easily be inserted or re- memory is consumed in allocating space for a back-
moved without reallocation or reorganization of the en- pointer.
tire structure because the data items need not be stored
contiguously in memory or on disk, while an array has
to be declared in the source code, before compiling and 2.4.3 History
running the program. Linked lists allow insertion and re-
moval of nodes at any point in the list, and can do so with Linked lists were developed in 1955–1956 by Allen
a constant number of operations if the link previous to Newell, Cliff Shaw and Herbert A. Simon at RAND
the link being added or removed is maintained during list Corporation as the primary data structure for their
traversal. Information Processing Language. IPL was used by the
authors to develop several early artificial intelligence pro-
On the other hand, simple linked lists by themselves do
grams, including the Logic Theory Machine, the General
not allow random access to the data, or any form of ef-
Problem Solver, and a computer chess program. Reports
ficient indexing. Thus, many basic operations — such
on their work appeared in IRE Transactions on Informa-
as obtaining the last node of the list (assuming that the
tion Theory in 1956, and several conference proceedings
last node is not maintained as separate node reference in
from 1957 to 1959, including Proceedings of the Western
the list structure), or finding a node that contains a given
Joint Computer Conference in 1957 and 1958, and In-
datum, or locating the place where a new node should be
formation Processing (Proceedings of the first UNESCO
inserted — may require sequential scanning of most or all
International Conference on Information Processing) in
of the list elements. The advantages and disadvantages of
1959. The now-classic diagram consisting of blocks rep-
using linked lists are given below.
resenting list nodes with arrows pointing to successive list
nodes appears in “Programming the Logic Theory Ma-
chine” by Newell and Shaw in Proc. WJCC, February
2.4.1 Advantages 1957. Newell and Simon were recognized with the ACM
Turing Award in 1975 for having “made basic contribu-
• Linked lists are a dynamic data structure, which
tions to artificial intelligence, the psychology of human
can grow and be pruned, allocating and deallocating
cognition, and list processing”. The problem of machine
memory while the program is running.
translation for natural language processing led Victor Yn-
gve at Massachusetts Institute of Technology (MIT) to
• Insertion and deletion node operations are easily im-
use linked lists as data structures in his COMIT pro-
plemented in a linked list.
gramming language for computer research in the field of
• Dynamic data structures such as stacks and queues linguistics. A report on this language entitled “A pro-
can be implemented using a linked list. gramming language for mechanical translation” appeared
in Mechanical Translation in 1958.
• There is no need to define an initial size for a linked LISP, standing for list processor, was created by John
list. McCarthy in 1958 while he was at MIT and in 1960 he
published its design in a paper in the Communications
• Items can be added or removed from the middle of
of the ACM, entitled “Recursive Functions of Symbolic
list.
Expressions and Their Computation by Machine, Part I”.
• Backtracking is possible in two way linked list. One of LISP’s major data structures is the linked list.
By the early 1960s, the utility of both linked lists and lan-
guages which use these structures as their primary data
2.4.2 Disadvantages representation was well established. Bert Green of the
MIT Lincoln Laboratory published a review article enti-
• They use more memory than arrays because of the tled “Computer languages for symbol manipulation” in
storage used by their pointers. IRE Transactions on Human Factors in Electronics in
30 CHAPTER 2. SEQUENCES
March 1961 which summarized the advantages of the Doubly linked list
linked list approach. A later review article, “A Compar-
ison of list-processing computer languages” by Bobrow Main article: Doubly linked list
and Raphael, appeared in Communications of the ACM
in April 1964.
In a 'doubly linked list', each node contains, besides
Several operating systems developed by Technical Sys- the next-node link, a second link field pointing to the
tems Consultants (originally of West Lafayette Indiana, 'previous’ node in the sequence. The two links may
and later of Chapel Hill, North Carolina) used singly be called 'forward('s’) and 'backwards’, or 'next' and
linked lists as file structures. A directory entry pointed 'prev'('previous’).
to the first sector of a file, and succeeding portions of
the file were located by traversing pointers. Systems us-
ing this technique included Flex (for the Motorola 6800 12 99
CPU), mini-Flex (same CPU), and Flex9 (for the Mo- doubly linked list whose nodes contain three fields: an integer
torola 6809 CPU). A variant developed by TSC for and value, the link forward to the next node, and the link backward
marketed by Smoke Signal Broadcasting in California, to the previous node
used doubly linked lists in the same manner.
The TSS/360 operating system, developed by IBM for A technique known as XOR-linking allows a doubly
the System 360/370 machines, used a double linked list linked list to be implemented using a single link field in
for their file system catalog. The directory structure was each node. However, this technique requires the ability
similar to Unix, where a directory could contain files and to do bit operations on addresses, and therefore may not
other directories and extend to any depth. be available in some high-level languages.
Many modern operating systems use doubly linked lists
to maintain references to active processes, threads, and
2.4.4 Basic concepts and nomenclature other dynamic objects.[2] A common strategy for rootkits
to evade detection is to unlink themselves from these
Each record of a linked list is often called an 'element' or lists.[3]
'node'.
The field of each node that contains the address of the
Multiply linked list
next node is usually called the 'next link' or 'next pointer'.
The remaining fields are known as the 'data', 'informa-
tion', 'value', 'cargo', or 'payload' fields. In a 'multiply linked list', each node contains two or more
link fields, each field being used to connect the same set
The 'head' of a list is its first node. The 'tail' of a list may of data records in a different order (e.g., by name, by
refer either to the rest of the list after the head, or to the department, by date of birth, etc.). While doubly linked
last node in the list. In Lisp and some derived languages, lists can be seen as special cases of multiply linked list, the
the next node may be called the 'cdr' (pronounced could- fact that the two orders are opposite to each other leads to
er) of the list, while the payload of the head node may be simpler and more efficient algorithms, so they are usually
called the 'car'. treated as a separate case.
Singly linked lists contain nodes which have a data field In the last node of a list, the link field often contains a
as well as a 'next' field, which points to the next node in null reference, a special value used to indicate the lack of
line of nodes. Operations that can be performed on singly further nodes. A less common convention is to make it
linked lists include insertion, deletion and traversal. point to the first node of the list; in that case the list is
said to be 'circular' or 'circularly linked'; otherwise it is
Link − Each link of a linked list can store a data called an said to be 'open' or 'linear'.
element. Next − Each link of a linked list contains a link
to the next link called Next. Linked List − A Linked List
contains the connection link to the first link called First. 12 99 37
12 99 37 circular linked list
A
A
linked back to the front, or “head”, of the list and vice cause problems in another. This is a list of some of the
versa. common tradeoffs involving linked list structures.
Main article: Sentinel node A dynamic array is a data structure that allocates all ele-
ments contiguously in memory, and keeps a count of the
In some implementations an extra 'sentinel' or 'dummy' current number of elements. If the space reserved for the
node may be added before the first data record or after dynamic array is exceeded, it is reallocated and (possibly)
the last one. This convention simplifies and accelerates copied, which is an expensive operation.
some list-handling algorithms, by ensuring that all links Linked lists have several advantages over dynamic arrays.
can be safely dereferenced and that every list (even one Insertion or deletion of an element at a specific point of a
that contains no data elements) always has a “first” and list, assuming that we have indexed a pointer to the node
“last” node. (before the one to be removed, or before the insertion
point) already, is a constant-time operation (otherwise
without this reference it is O(n)), whereas insertion in a
Empty lists dynamic array at random locations will require moving
half of the elements on average, and all the elements in
An empty list is a list that contains no data records. This the worst case. While one can “delete” an element from
is usually the same as saying that it has zero nodes. If an array in constant time by somehow marking its slot as
sentinel nodes are being used, the list is usually said to be “vacant”, this causes fragmentation that impedes the per-
empty when it has only sentinel nodes. formance of iteration.
Moreover, arbitrarily many elements may be inserted into
Hash linking a linked list, limited only by the total memory available;
while a dynamic array will eventually fill up its under-
The link fields need not be physically part of the nodes. lying array data structure and will have to reallocate —
If the data records are stored in an array and referenced an expensive operation, one that may not even be possi-
by their indices, the link field may be stored in a separate ble if memory is fragmented, although the cost of real-
array with the same indices as the data records. location can be averaged over insertions, and the cost of
an insertion due to reallocation would still be amortized
O(1). This helps with appending elements at the array’s
List handles end, but inserting into (or removing from) middle posi-
tions still carries prohibitive costs due to data moving to
Since a reference to the first node gives access to the maintain contiguity. An array from which many elements
whole list, that reference is often called the 'address’, are removed may also have to be resized in order to avoid
'pointer', or 'handle' of the list. Algorithms that manipu- wasting too much space.
late linked lists usually get such handles to the input lists On the other hand, dynamic arrays (as well as fixed-size
and return the handles to the resulting lists. In fact, in the array data structures) allow constant-time random access,
context of such algorithms, the word “list” often means while linked lists allow only sequential access to elements.
“list handle”. In some situations, however, it may be con- Singly linked lists, in fact, can be easily traversed in only
venient to refer to a list by a handle that consists of two one direction. This makes linked lists unsuitable for ap-
links, pointing to its first and last nodes. plications where it’s useful to look up an element by its
index quickly, such as heapsort. Sequential access on ar-
Combining alternatives rays and dynamic arrays is also faster than on linked lists
on many machines, because they have optimal locality of
reference and thus make good use of data caching.
The alternatives listed above may be arbitrarily combined
in almost every way, so one may have circular doubly Another disadvantage of linked lists is the extra storage
linked lists without sentinels, circular singly linked lists needed for references, which often makes them imprac-
with sentinels, etc. tical for lists of small data items such as characters or
boolean values, because the storage overhead for the links
may exceed by a factor of two or more the size of the
2.4.5 Tradeoffs data. In contrast, a dynamic array requires only the space
for the data itself (and a very small amount of control
As with most choices in computer programming and de- data).[note 1] It can also be slow, and with a naïve alloca-
sign, no method is well suited to all circumstances. A tor, wasteful, to allocate memory separately for each new
linked list data structure might work well in one case, but element, a problem generally solved using memory pools.
32 CHAPTER 2. SEQUENCES
Some hybrid solutions try to combine the advantages of can be adapted for doubly linked and circularly linked
the two representations. Unrolled linked lists store several lists, the procedures generally need extra arguments and
elements in each list node, increasing cache performance more complicated base cases.
while decreasing memory overhead for references. CDR Linear singly linked lists also allow tail-sharing, the use
coding does both these as well, by replacing references of a common final portion of sub-list as the terminal por-
with the actual data referenced, which extends off the end tion of two different lists. In particular, if a new node is
of the referencing record. added at the beginning of a list, the former list remains
A good example that highlights the pros and cons of us- available as the tail of the new one — a simple example of
ing dynamic arrays vs. linked lists is by implementing a a persistent data structure. Again, this is not true with the
program that resolves the Josephus problem. The Jose- other variants: a node may never belong to two different
phus problem is an election method that works by having circular or doubly linked lists.
a group of people stand in a circle. Starting at a predeter- In particular, end-sentinel nodes can be shared among
mined person, you count around the circle n times. Once singly linked non-circular lists. The same end-sentinel
you reach the nth person, take them out of the circle and node may be used for every such list. In Lisp, for exam-
have the members close the circle. Then count around the ple, every proper list ends with a link to a special node,
circle the same n times and repeat the process, until only denoted by nil or (), whose CAR and CDR links point to
one person is left. That person wins the election. This itself. Thus a Lisp procedure can safely take the CAR or
shows the strengths and weaknesses of a linked list vs. a CDR of any list.
dynamic array, because if you view the people as con-
nected nodes in a circular linked list then it shows how The advantages of the fancy variants are often limited to
easily the linked list is able to delete nodes (as it only has the complexity of the algorithms, not in their efficiency.
to rearrange the links to the different nodes). However, A circular list, in particular, can usually be emulated by
the linked list will be poor at finding the next person to re- a linear list together with two variables that point to the
move and will need to search through the list until it finds first and last nodes, at no extra cost.
that person. A dynamic array, on the other hand, will be
poor at deleting nodes (or elements) as it cannot remove
one node without individually shifting all the elements up Doubly linked vs. singly linked
the list by one. However, it is exceptionally easy to find
the nth person in the circle by directly referencing them Double-linked lists require more space per node (unless
by their position in the array. one uses XOR-linking), and their elementary operations
are more expensive; but they are often easier to manip-
The list ranking problem concerns the efficient conver- ulate because they allow fast and easy sequential access
sion of a linked list representation into an array. Although to the list in both directions. In a doubly linked list, one
trivial for a conventional computer, solving this problem can insert or delete a node in a constant number of oper-
by a parallel algorithm is complicated and has been the ations given only that node’s address. To do the same in a
subject of much research. singly linked list, one must have the address of the pointer
A balanced tree has similar memory access patterns and to that node, which is either the handle for the whole list
space overhead to a linked list while permitting much (in case of the first node) or the link field in the previous
more efficient indexing, taking O(log n) time instead of node. Some algorithms require access in both directions.
O(n) for a random access. However, insertion and dele- On the other hand, doubly linked lists do not allow tail-
tion operations are more expensive due to the overhead sharing and cannot be used as persistent data structures.
of tree manipulations to maintain balance. Schemes exist
for trees to automatically maintain themselves in a bal-
anced state: AVL trees or red-black trees. Circularly linked vs. linearly linked
A circular list can be split into two circular lists, in con- section gives pseudocode for adding or removing nodes
stant time, by giving the addresses of the last node of each from singly, doubly, and circularly linked lists in-place.
piece. The operation consists in swapping the contents of Throughout we will use null to refer to an end-of-list
the link fields of those two nodes. Applying the same marker or sentinel, which may be implemented in a num-
operation to any two nodes in two distinct lists joins the ber of ways.
two list into one. This property greatly simplifies some al-
gorithms and data structures, such as the quad-edge and
face-edge. Linearly linked lists
The simplest representation for an empty circular list Singly linked lists Our node data structure will have
(when such a thing makes sense) is a null pointer, indicat- two fields. We also keep a variable firstNode which always
ing that the list has no nodes. Without this choice, many points to the first node in the list, or is null for an empty
algorithms have to test for this special case, and handle list.
it separately. By contrast, the use of null to denote an
empty linear list is more natural and often creates fewer record Node { data; // The data being stored in the node
special cases. Node next // A reference to the next node, null for last node
} record List { Node firstNode // points to first node of list;
null for empty list }
Using sentinel nodes Traversal of a singly linked list is simple, beginning at the
first node and following each next link until we come to
Sentinel node may simplify certain list operations, by en- the end:
suring that the next or previous nodes exist for every el-
ement, and that even empty lists have at least one node. node := list.firstNode while node not null (do something
One may also use a sentinel node at the end of the list, with node.data) node := node.next
with an appropriate data field, to eliminate some end-of- The following code inserts a node after an existing node
list tests. For example, when scanning the list looking for in a singly linked list. The diagram shows how it works.
a node with a given value x, setting the sentinel’s data field Inserting a node before an existing one cannot be done di-
to x makes it unnecessary to test for end-of-list inside the rectly; instead, one must keep track of the previous node
loop. Another example is the merging two sorted lists: and insert a node after it.
if their sentinels have data fields set to +∞, the choice of
newNode newNode
the next output node does not need special handling for
empty lists. 37 37
Since we can't iterate backwards, efficient insertBefore function insertAfter(Node node, Node newNode) if node
or removeBefore operations are not possible. Inserting to = null newNode.next := newNode else newNode.next :=
a list before a specific node requires traversing the list, node.next node.next := newNode
which would have a worst case running time of O(n). Suppose that “L” is a variable pointing to the last node
Appending one linked list to another can be inefficient of a circular linked list (or null if the list is empty). To
unless a reference to the tail is kept as part of the List append “newNode” to the end of the list, one may do
structure, because we must traverse the entire first list in insertAfter(L, newNode) L := newNode
order to find the tail, and then append the second list to
To insert “newNode” at the beginning of the list, one may
this. Thus, if two linearly linked lists are each of length n
do
, list appending has asymptotic time complexity of O(n) .
In the Lisp family of languages, list appending is provided insertAfter(L, newNode) if L = null L := newNode
by the append procedure.
Many of the special cases of linked list operations can 2.4.7 Linked lists using arrays of nodes
be eliminated by including a dummy element at the front
of the list. This ensures that there are no special cases Languages that do not support any type of reference can
for the beginning of the list and renders both insertBe- still create links by replacing pointers with array indices.
ginning() and removeBeginning() unnecessary. In this The approach is to keep an array of records, where each
case, the first useful data in the list will be found at record has integer fields indicating the index of the next
list.firstNode.next. (and possibly previous) node in the array. Not all nodes in
the array need be used. If records are also not supported,
parallel arrays can often be used instead.
Circularly linked list
As an example, consider the following linked list record
In a circularly linked list, all nodes are linked in a contin- that uses arrays instead of pointers:
uous circle, without using null. For lists with a front and record Entry { integer next; // index of next entry in ar-
a back (such as a queue), one stores a reference to the last ray integer prev; // previous entry (if double-linked) string
node in the list. The next node after the last node is the name; real balance; }
first node. Elements can be added to the back of the list
A linked list can be built by creating an array of these
and removed from the front in constant time.
structures, and an integer variable to store the index of
Circularly linked lists can be either singly or doubly the first element.
linked.
integer listHead Entry Records[1000]
Both types of circularly linked lists benefit from the abil-
Links between elements are formed by placing the array
ity to traverse the full list beginning at any given node.
index of the next (or previous) cell into the Next or Prev
This often allows us to avoid storing firstNode and lastN-
field within a given element. For example:
ode, although if the list may be empty we need a special
representation for the empty list, such as a lastNode vari- In the above example, ListHead would be set to 2, the
able which points to some node in the list or is null if it’s location of the first entry in the list. Notice that entry
empty; we use such a lastNode here. This representation 3 and 5 through 7 are not part of the list. These cells
significantly simplifies adding and removing nodes with are available for any additions to the list. By creating a
a non-empty list, but empty lists are then a special case. ListFree integer variable, a free list could be created to
keep track of what cells are available. If all entries are
in use, the size of the array would have to be increased
Algorithms Assuming that someNode is some node in or some elements would have to be deleted before new
a non-empty circular singly linked list, this code iterates entries could be stored in the list.
through that list starting with someNode: The following code would traverse the list and display
function iterate(someNode) if someNode ≠ null node names and account balance:
2.4. LINKED LIST 35
i := listHead while i ≥ 0 // loop through the list print 2.4.8 Language support
i, Records[i].name, Records[i].balance // print entry i :=
Records[i].next Many programming languages such as Lisp and Scheme
have singly linked lists built in. In many functional lan-
When faced with a choice, the advantages of this ap-
guages, these lists are constructed from nodes, each called
proach include:
a cons or cons cell. The cons has two fields: the car, a ref-
erence to the data for that node, and the cdr, a reference
• The linked list is relocatable, meaning it can be to the next node. Although cons cells can be used to build
moved about in memory at will, and it can also be other data structures, this is their primary purpose.
quickly and directly serialized for storage on disk or In languages that support abstract data types or templates,
transfer over a network. linked list ADTs or templates are available for building
linked lists. In other languages, linked lists are typically
• Especially for a small list, array indexes can occupy built using references together with records.
significantly less space than a full pointer on many
architectures.
2.4.9 Internal and external storage
• Locality of reference can be improved by keeping
the nodes together in memory and by periodically When constructing a linked list, one is faced with the
rearranging them, although this can also be done in choice of whether to store the data of the list directly in
a general store. the linked list nodes, called internal storage, or merely to
store a reference to the data, called external storage. In-
• Naïve dynamic memory allocators can produce an ternal storage has the advantage of making access to the
excessive amount of overhead storage for each node data more efficient, requiring less storage overall, hav-
allocated; almost no allocation overhead is incurred ing better locality of reference, and simplifying memory
per node in this approach. management for the list (its data is allocated and deallo-
cated at the same time as the list nodes).
• Seizing an entry from a pre-allocated array is faster External storage, on the other hand, has the advantage of
than using dynamic memory allocation for each being more generic, in that the same data structure and
node, since dynamic memory allocation typically re- machine code can be used for a linked list no matter what
quires a search for a free memory block of the de- the size of the data is. It also makes it easy to place the
sired size. same data in multiple linked lists. Although with internal
storage the same data can be placed in multiple lists by
This approach has one main disadvantage, however: it including multiple next references in the node data struc-
creates and manages a private memory space for its nodes. ture, it would then be necessary to create separate rou-
This leads to the following issues: tines to add or delete cells based on each field. It is pos-
sible to create additional linked lists of elements that use
internal storage by using external storage, and having the
• It increases complexity of the implementation. cells of the additional linked lists store references to the
nodes of the linked list containing the data.
• Growing a large array when it is full may be diffi- In general, if a set of data structures needs to be included
cult or impossible, whereas finding space for a new in linked lists, external storage is the best approach. If
linked list node in a large, general memory pool may a set of data structures need to be included in only one
be easier. linked list, then internal storage is slightly better, unless a
generic linked list package using external storage is avail-
• Adding elements to a dynamic array will occa- able. Likewise, if different sets of data that can be stored
sionally (when it is full) unexpectedly take linear in the same data structure are to be included in a single
(O(n)) instead of constant time (although it’s still an linked list, then internal storage would be fine.
amortized constant).
Another approach that can be used with some languages
• Using a general memory pool leaves more memory involves having different data structures, but all have the
for other data if the list is smaller than expected or initial fields, including the next (and prev if double linked
if many nodes are freed. list) references in the same location. After defining sep-
arate structures for each type of data, a generic struc-
ture can be defined that contains the minimum amount
For these reasons, this approach is mainly used for lan- of data shared by all the other structures and contained
guages that do not support dynamic memory allocation. at the top (beginning) of the structures. Then generic
These disadvantages are also mitigated if the maximum routines can be created that use the minimal structure to
size of the list is known at the time the array is created. perform linked list type operations, but separate routines
36 CHAPTER 2. SEQUENCES
can then handle the specific data. This approach is of- As long as the number of families that a member can be-
ten used in message parsing routines, where several types long to is known at compile time, internal storage works
of messages are received, but all start with the same set fine. If, however, a member needed to be included in an
of fields, usually including a field for message type. The arbitrary number of families, with the specific number
generic routines are used to add new messages to a queue known only at run time, external storage would be neces-
when they are received, and remove them from the queue sary.
in order to process the message. The message type field is
then used to call the correct routine to process the specific
type of message. Speeding up search
famNode := Families // start at head of families list Random access lists can be viewed as immutable linked
while famNode ≠ null // loop through list of families lists in that they likewise support the same O(1) head and
aFamily := (family) famNode.data // extract family from tail operations.[10]
node print information about family memNode := aFam- A simple extension to random access lists is the min-
ily.members // get list of family members while memNode list, which provides an additional operation that yields the
≠ null // loop through list of members aMember := (mem- minimum element in the entire list in constant time (with-
ber)memNode.data // extract member from node print in- out mutation complexities).[10]
formation about member memNode := memNode.next
famNode := famNode.next
Notice that when using external storage, an extra step is
2.4.10 Related data structures
needed to extract the record from the node and cast it
into the proper data type. This is because both the list Both stacks and queues are often implemented using
of families and the list of members within the family are linked lists, and simply restrict the type of operations
stored in two linked lists using the same data structure which are supported.
(node), and this language does not have parametric types. The skip list is a linked list augmented with layers of
2.4. LINKED LIST 37
pointers for quickly jumping over large numbers of el- [7] Number crunching: Why you should never, ever, EVER use
ements, and then descending to the next layer. This pro- linked-list in your code again at kjellkod.wordpress.com
cess continues down to the bottom layer, which is the ac-
[8] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert;
tual list.
Munro, JI; Demaine, ED (1999), Resizable Arrays in Op-
A binary tree can be seen as a type of linked list where the timal Time and Space (Technical Report CS-99-09) (PDF),
elements are themselves linked lists of the same nature. Department of Computer Science, University of Waterloo
The result is that each node may include a reference to the
first node of one or two other linked lists, which, together [9] Ford, William; Topp, William (2002). Data Structures
with C++ using STL (Second ed.). Prentice-Hall. pp.
with their contents, form the subtrees below that node.
466–467. ISBN 0-13-085850-1.
An unrolled linked list is a linked list in which each node
contains an array of data values. This leads to improved [10] Okasaki, Chris (1995). Purely Functional Random-Access
cache performance, since more list elements are contigu- Lists (PS). In Functional Programming Languages and
Computer Architecture. ACM Press. pp. 86–95. Re-
ous in memory, and reduced memory overhead, because
trieved May 7, 2015.
less metadata needs to be stored for each element of the
list.
A hash table may use linked lists to store the chains of 2.4.13 References
items that hash to the same position in the hash table.
A heap shares some of the ordering properties of a linked • Juan, Angel (2006). “Ch20 –Data Structures; ID06
list, but is almost always implemented using an array. In- - PROGRAMMING with JAVA (slide part of the
stead of references from node to node, the next and pre- book 'Big Java', by CayS. Horstmann)" (PDF). p. 3.
vious data indexes are calculated using the current data’s • Black, Paul E. (2004-08-16). Pieterse, Vreda;
index. Black, Paul E., eds. “linked list”. Dictionary of Al-
A self-organizing list rearranges its nodes based on some gorithms and Data Structures. National Institute of
heuristic which reduces search times for data retrieval by Standards and Technology. Retrieved 2004-12-14.
keeping commonly accessed nodes at the head of the list.
• Antonakos, James L.; Mansfield, Kenneth C., Jr.
(1999). Practical Data Structures Using C/C++.
2.4.11 Notes Prentice-Hall. pp. 165–190. ISBN 0-13-280843-
9.
[1] The amount of control data required for a dynamic array
is usually of the form K + B ∗ n , where K is a per- • Collins, William J. (2005) [2002]. Data Structures
array constant, B is a per-dimension constant, and n is and the Java Collections Framework. New York:
the number of dimensions. K and B are typically on the McGraw Hill. pp. 239–303. ISBN 0-07-282379-
order of 10 bytes. 8.
[6] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go- • Knuth, Donald (1997). “2.2.3-2.2.5”. Fundamental
ingNative 2012 on channel9.msdn.com from minute 45 or Algorithms (3rd ed.). Addison-Wesley. pp. 254–
foil 44 298. ISBN 0-201-89683-4.
38 CHAPTER 2. SEQUENCES
• Newell, Allen; Shaw, F. C. (1957). “Programming lists formed from the same data items, but in opposite
the Logic Theory Machine”. Proceedings of the sequential orders.
Western Joint Computer Conference: 230–240.
12 99 37
• Parlante, Nick (2001). “Linked list basics” (PDF).
Stanford University. Retrieved 2009-09-21.
A doubly linked list whose nodes contain three fields: an integer
• Sedgewick, Robert (1998). Algorithms in C. Addi- value, the link to the next node, and the link to the previous node.
son Wesley. pp. 90–109. ISBN 0-201-31452-5.
The two node links allow traversal of the list in either
• Shaffer, Clifford A. (1998). A Practical Introduc- direction. While adding or removing a node in a doubly
tion to Data Structures and Algorithm Analysis. New
linked list requires changing more links than the same op-
Jersey: Prentice Hall. pp. 77–102. ISBN 0-13- erations on a singly linked list, the operations are simpler
660911-2. and potentially more efficient (for nodes other than first
• Wilkes, Maurice Vincent (1964). “An Experiment nodes) because there is no need to keep track of the pre-
with a Self-compiling Compiler for a Simple List- vious node during traversal or no need to traverse the list
Processing Language”. Annual Review in Auto- to find the previous node, so that its link can be modified.
matic Programming. Pergamon Press. 4 (1): 1. The concept is also the basis for the mnemonic link sys-
doi:10.1016/0066-4138(64)90013-8. tem memorization technique.
• Wilkes, Maurice Vincent (1964). “Lists and Why
They are Useful”. Proceeds of the ACM National 2.5.1 Nomenclature and implementation
Conference, Philadelphia 1964. ACM (P–64): F1–
1. The first and last nodes of a doubly linked list are imme-
• Shanmugasundaram, Kulesh (2005-04-04). “Linux diately accessible (i.e., accessible without traversal, and
Kernel Linked List Explained”. Retrieved 2009-09- usually called head and tail) and therefore allow traversal
21. of the list from the beginning or end of the list, respec-
tively: e.g., traversing the list from beginning to end, or
from end to beginning, in a search of the list for a node
2.4.14 External links with specific data value. Any node of a doubly linked list,
once obtained, can be used to begin a new traversal of the
• Description from the Dictionary of Algorithms and list, in either direction (towards beginning or end), from
Data Structures the given node.
• Introduction to Linked Lists, Stanford University The link fields of a doubly linked list node are often called
Computer Science Library next and previous or forward and backward. The ref-
erences stored in the link fields are usually implemented
• Linked List Problems, Stanford University Com- as pointers, but (as in any linked data structure) they may
puter Science Library also be address offsets or indices into an array where the
nodes live.
• Open Data Structures - Chapter 3 - Linked Lists
• Patent for the idea of having nodes which are in sev-
eral linked lists simultaneously (note that this tech- 2.5.2 Basic algorithms
nique was widely used for many decades before the
patent was granted) Consider the following basic algorithms written in Ada:
called iteration, but that choice of terminology is unfor- Circular doubly linked lists
tunate, for iteration has well-defined semantics (e.g., in
mathematics) which are not analogous to traversal. Traversing the list Assuming that someNode is some
Forwards node in a non-empty list, this code traverses through that
list starting with someNode (any node will do):
node := list.firstNode while node ≠ null <do something
with node.data> node := node.next Forwards
2.5.5 References
[1] http://www.codeofhonor.com/blog/ 2.6.1 History
avoiding-game-crashes-related-to-linked-lists
2.6.2 Non-essential operations items to or removing items from the end of a dynamic
array requires amortized O(1) time.
In many implementations, a stack has more operations
than “push” and “pop”. An example is “top of stack”, or
"peek", which observes the top-most element without re- Linked list Another option for implementing stacks is
moving it from the stack.[7] Since this can be done with to use a singly linked list. A stack is then a pointer to the
a “pop” and a “push” with the same data, it is not essen- “head” of the list, with perhaps a counter to keep track of
tial. An underflow condition can occur in the “stack top” the size of the list:
operation if the stack is empty, the same as “pop”. Also, structure frame: data : item next : frame or nil struc-
implementations often have a function which just returns ture stack: head : frame or nil size : integer procedure
whether the stack is empty. initialize(stk : stack): stk.head ← nil stk.size ← 0
Pushing and popping items happens at the head of the
list; overflow is not possible in this implementation (un-
2.6.3 Software stacks
less memory is exhausted):
Implementation procedure push(stk : stack, x : item): newhead ←
new frame newhead.data ← x newhead.next ← stk.head
A stack can be easily implemented either through an array stk.head ← newhead stk.size ← stk.size + 1 procedure
or a linked list. What identifies the data structure as a pop(stk : stack): if stk.head = nil: report underflow er-
stack in either case is not the implementation but the in- ror r ← stk.head.data stk.head ← stk.head.next stk.size
terface: the user is only allowed to pop or push items ← stk.size - 1 return r
onto the array or linked list, with few other helper opera-
tions. The following will demonstrate both implementa-
tions, using pseudocode. Stacks and programming languages
stack.pop(); // removing the next top (“C”) } } and the stack pointer is adjusted by the size of the
data item.
apple banana banana ===right rotate==> cucumber cu- operands. A stack structure also makes superscalar im-
cumber apple cucumber apple banana ===left rotate==> plementations with register renaming (for speculative ex-
cucumber apple banana ecution) somewhat more complex to implement, although
A stack is usually represented in computers by a block it is still feasible, as exemplified by modern x87 imple-
of memory cells, with the “bottom” at a fixed location, mentations.
and the stack pointer holding the address of the current Sun SPARC, AMD Am29000, and Intel i960 are all ex-
“top” cell in the stack. The top and bottom terminology amples of architectures using register windows within a
are used irrespective of whether the stack actually grows register-stack as another strategy to avoid the use of slow
towards lower memory addresses or towards higher mem- main memory for function arguments and return values.
ory addresses. There are also a number of small microprocessors
Pushing an item on to the stack adjusts the stack pointer that implements a stack directly in hardware and some
by the size of the item (either decrementing or increment- microcontrollers have a fixed-depth stack that is not di-
ing, depending on the direction in which the stack grows rectly accessible. Examples are the PIC microcontrollers,
in memory), pointing it to the next cell, and copies the the Computer Cowboys MuP21, the Harris RTX line, and
new top item to the stack area. Depending again on the the Novix NC4016. Many stack-based microprocessors
exact implementation, at the end of a push operation, the were used to implement the programming language Forth
stack pointer may point to the next unused location in the at the microcode level. Stacks were also used as a basis
stack, or it may point to the topmost item in the stack. of a number of mainframes and mini computers. Such
If the stack points to the current topmost item, the stack machines were called stack machines, the most famous
pointer will be updated before a new item is pushed onto being the Burroughs B5000.
the stack; if it points to the next available location in the
stack, it will be updated after the new item is pushed onto
the stack. 2.6.5 Applications
Popping the stack is simply the inverse of pushing. The
topmost item in the stack is removed and the stack pointer Expression evaluation and syntax parsing
is updated, in the opposite order of that used in the push
operation. Calculators employing reverse Polish notation use a stack
structure to hold values. Expressions can be represented
in prefix, postfix or infix notations and conversion from
Hardware support one form to another may be accomplished using a stack.
Many compilers use a stack for parsing the syntax of ex-
Stack in main memory Many CPU families, includ- pressions, program blocks etc. before translating into low
ing the x86, Z80 and 6502, have a dedicated register re- level code. Most programming languages are context-
served for use as (call) stack pointers and special push free languages, allowing them to be parsed with stack
and pop instructions that manipulate this specific reg- based machines.
ister, conserving opcode space. Some processors, like
the PDP-11 and the 68000, also have special address-
Backtracking
ing modes for implementation of stacks, typically with
a semi-dedicated stack pointer as well (such as A7 in
Main article: Backtracking
the 68000). However, in most processors, several dif-
ferent registers may be used as additional stack pointers
as needed (whether updated via addressing modes or via Another important application of stacks is backtracking.
add/sub instructions). Consider a simple example of finding the correct path in a
maze. There are a series of points, from the starting point
to the destination. We start from one point. To reach
Stack in registers or dedicated memory Main the final destination, there are several paths. Suppose we
article: Stack machine choose a random path. After following a certain path, we
realise that the path we have chosen is wrong. So we need
The x87 floating point architecture is an example of a to find a way by which we can return to the beginning of
set of registers organised as a stack where direct access that path. This can be done with the use of stacks. With
to individual registers (relative the current top) is also the help of stacks, we remember the point where we have
possible. As with stack-based machines in general, hav- reached. This is done by pushing that point into the stack.
ing the top-of-stack as an implicit argument allows for a In case we end up on the wrong path, we can pop the last
small machine code footprint with a good usage of bus point from the stack and thus return to the last point and
bandwidth and code caches, but it also prevents some continue our quest to find the right path. This is called
types of optimizations possible on processors permitting backtracking.
random access to the register file for all (two or three) The prototypical example of a backtracking algorithm is
44 CHAPTER 2. SEQUENCES
depth-first search, which finds all vertices of a graph that is used to find and remove concavities in the bound-
can be reached from a specified starting vertex. Other ary when a new point is added to the hull.[8]
applications of backtracking involve searching through
spaces that represent potential solutions to an optimiza- • Part of the SMAWK algorithm for finding the row
tion problem. Branch and bound is a technique for per- minima of a monotone matrix uses stacks in a simi-
forming such backtracking searches without exhaustively lar way to Graham scan.[9]
searching all of the potential solutions in such a space.
• All nearest smaller values, the problem of finding,
for each number in an array, the closest preceding
Runtime memory management number that is smaller than it. One algorithm for this
problem uses a stack to maintain a collection of can-
Main articles: Stack-based memory allocation and Stack didates for the nearest smaller value. For each posi-
machine tion in the array, the stack is popped until a smaller
value is found on its top, and then the value in the
new position is pushed onto the stack.[10]
A number of programming languages are stack-oriented,
meaning they define most basic operations (adding two • The nearest-neighbor chain algorithm, a method
numbers, printing a character) as taking their arguments for agglomerative hierarchical clustering based on
from the stack, and placing any return values back on the maintaining a stack of clusters, each of which is
stack. For example, PostScript has a return stack and an the nearest neighbor of its predecessor on the stack.
operand stack, and also has a graphics state stack and a When this method finds a pair of clusters that
dictionary stack. Many virtual machines are also stack- are mutual nearest neighbors, they are popped and
oriented, including the p-code machine and the Java Vir- merged.[11]
tual Machine.
Almost all calling conventions—the ways in which
subroutines receive their parameters and return results—
2.6.6 Security
use a special stack (the "call stack") to hold information
Some computing environments use stacks in ways that
about procedure/function calling and nesting in order to
may make them vulnerable to security breaches and at-
switch to the context of the called function and restore to
tacks. Programmers working in such environments must
the caller function when the calling finishes. The func-
take special care to avoid the pitfalls of these implemen-
tions follow a runtime protocol between caller and callee
tations.
to save arguments and return value on the stack. Stacks
are an important way of supporting nested or recursive For example, some programming languages use a com-
function calls. This type of stack is used implicitly by the mon stack to store both data local to a called procedure
compiler to support CALL and RETURN statements (or and the linking information that allows the procedure to
their equivalents) and is not manipulated directly by the return to its caller. This means that the program moves
programmer. data into and out of the same stack that contains critical
return addresses for the procedure calls. If data is moved
Some programming languages use the stack to store data
to the wrong location on the stack, or an oversized data
that is local to a procedure. Space for local data items is
item is moved to a stack location that is not large enough
allocated from the stack when the procedure is entered,
to contain it, return information for procedure calls may
and is deallocated when the procedure exits. The C pro-
be corrupted, causing the program to fail.
gramming language is typically implemented in this way.
Using the same stack for both data and procedure calls Malicious parties may attempt a stack smashing attack
has important security implications (see below) of which that takes advantage of this type of implementation by
a programmer must be aware in order to avoid introduc- providing oversized data input to a program that does not
ing serious security bugs into a program. check the length of input. Such a program may copy the
data in its entirety to a location on the stack, and in so do-
ing it may change the return addresses for procedures that
Efficient algorithms have called it. An attacker can experiment to find a spe-
cific type of data that can be provided to such a program
Several algorithms use a stack (separate from the usual such that the return address of the current procedure is re-
function call stack of most programming languages) as set to point to an area within the stack itself (and within
the principle data structure with which they organize their the data provided by the attacker), which in turn contains
information. These include: instructions that carry out unauthorized operations.
This type of attack is a variation on the buffer overflow
• Graham scan, an algorithm for the convex hull of a attack and is an extremely frequent source of security
two-dimensional system of points. A convex hull of breaches in software, mainly because some of the most
a subset of the input is maintained in a stack, which popular compilers use a shared stack for both data and
2.7. QUEUE (ABSTRACT DATA TYPE) 45
procedure calls, and do not verify the length of data items. [10] Berkman, Omer; Schieber, Baruch; Vishkin, Uzi (1993),
Frequently programmers do not write code to verify the “Optimal doubly logarithmic parallel algorithms based on
size of data items, either, and when an oversized or un- finding all nearest smaller values”, Journal of Algorithms,
dersized data item is copied to the stack, a security breach 14 (3): 344–370, doi:10.1006/jagm.1993.1018.
may occur.
[11] Murtagh, Fionn (1983), “A survey of recent advances in
hierarchical clustering algorithms” (PDF), The Computer
Journal, 26 (4): 354–359, doi:10.1093/comjnl/26.4.354.
2.6.7 See also
• List of data structures • This article incorporates public domain material
from the NIST document: Black, Paul E. “Bounded
• Queue stack”. Dictionary of Algorithms and Data Struc-
• Double-ended queue tures.
• Call stack
2.6.9 Further reading
• FIFO (computing and electronics)
• Stack-based memory allocation • Donald Knuth. The Art of Computer Program-
ming, Volume 1: Fundamental Algorithms, Third
• Stack overflow Edition.Addison-Wesley, 1997. ISBN 0-201-
89683-4. Section 2.2.1: Stacks, Queues, and De-
• Stack-oriented programming language ques, pp. 238–243.
[2] Newton, David E. (2003). Alan Turing: a study in light • Bounding stack depth
and shadow. Philadelphia: Xlibris. p. 82. ISBN
9781401090791. Retrieved 28 January 2015. • Stack Size Analysis for Interrupt-driven Programs
(322 KB)
[3] Dr. Friedrich Ludwig Bauer and Dr. Klaus Samelson
(30 March 1957). “Verfahren zur automatischen Ver-
arbeitung von kodierten Daten und Rechenmaschine zur
Ausübung des Verfahrens” (in German). Germany, Mu- 2.7 Queue (abstract data type)
nich: Deutsches Patentamt. Retrieved 2010-10-01.
[5] Ball, John A. (1978). Algorithms for RPN calcula- Back Front
tors (1 ed.). Cambridge, Massachusetts, USA: Wiley-
Interscience, John Wiley & Sons, Inc. ISBN 0-471- Dequeue
03070-8. Enqueue
[6] Godse, A. P.; Godse, D. A. (2010-01-01). Computer
Architecture. Technical Publications. pp. 1–56. ISBN
9788184315349. Retrieved 2015-01-30.
known as dequeue. This makes the queue a First-In-First- There are several efficient implementations of FIFO
Out (FIFO) data structure. In a FIFO data structure, the queues. An efficient implementation is one that can per-
first element added to the queue will be the first one to be form the operations—enqueuing and dequeuing—in O(1)
removed. This is equivalent to the requirement that once time.
a new element is added, all elements that were added be-
fore have to be removed before the new element can be • Linked list
removed. Often a peek or front operation is also entered,
returning the value of the front element without dequeu- • A doubly linked list has O(1) insertion and
ing it. A queue is an example of a linear data structure, deletion at both ends, so is a natural choice for
or more abstractly a sequential collection. queues.
Queues provide services in computer science, transport, • A regular singly linked list only has efficient
and operations research where various entities such as insertion and deletion at one end. However, a
data, objects, persons, or events are stored and held to small modification—keeping a pointer to the
be processed later. In these contexts, the queue performs last node in addition to the first one—will en-
the function of a buffer. able it to implement an efficient queue.
Queues are common in computer programs, where they
• A deque implemented using a modified dynamic ar-
are implemented as data structures coupled with ac-
ray
cess routines, as an abstract data structure or in object-
oriented languages as classes. Common implementations
are circular buffers and linked lists. Queues and programming languages
its worst-time complexity is O(n) where n is the number 2.7.3 See also
of elements in the queue.
• Circular buffer
Let us recall that, for l a list, |l| denotes its length, that NIL
represents an empty list and CON S(h, t) represents the • Double-ended queue (deque)
list whose head is h and whose tail is t.
• Priority queue
• Queueing theory
Real-time queue
• Stack (abstract data type) – the “opposite” of a
The data structure used to implements our queues con- queue: LIFO (Last In First Out)
sists of three linked lists (f, r, s) where f is the front
of the queue, r is the rear of the queue in reverse or-
der. The invariant of the structure is that s is the rear 2.7.4 References
of f without its |r| first elements, that is |s| = |f | − |r| .
The tail of the queue (CON S(x, f ), r, s) is then almost [1] “Queue (Java Platform SE 7)". Docs.oracle.com. 2014-
(f, r, s) and inserting an element x to (f, r, s) is almost 03-26. Retrieved 2014-05-22.
(f, CON S(x, r), s) . It is said almost, because in both [2] Okasaki, Chris. “Purely Functional Data Structures”
of those results, |s| = |f |−|r|+1 . An auxiliary function (PDF).
aux must then be called for the invariant to be satisfied.
Two cases must be considered, depending on whether s [3] Hood, Robert; Melville, Robert (November 1981). “Real-
is the empty list, in which case |r| = |f | + 1 , or not. The time queue operations in pure Lisp”. Information Process-
ing Letters,. 13 (2).
formal definition is aux(f, r, Cons(_, s)) = (f, r, s)
and aux(f, r, N IL) = (f ′ , N IL, f ′ ) where f ′ is f fol-
lowed by r reversed. • Donald Knuth. The Art of Computer Programming,
Volume 1: Fundamental Algorithms, Third Edition.
Let us call reverse(f, r) the function which returns
Addison-Wesley, 1997. ISBN 0-201-89683-4. Sec-
f followed by r reversed. Let us furthermore assume
tion 2.2.1: Stacks, Queues, and Deques, pp. 238–
that |r| = |f | + 1 , since it is the case when this
243.
function is called. More precisely, we define a lazy
function rotate(f, r, a) which takes as input three list • Thomas H. Cormen, Charles E. Leiserson, Ronald
such that |r| = |f | + 1 , and return the concatenation L. Rivest, and Clifford Stein. Introduction to Algo-
of f, of r reversed and of a. Then reverse(f, r) = rithms, Second Edition. MIT Press and McGraw-
rotate(f, r, N IL) . The inductive definition of rotate Hill, 2001. ISBN 0-262-03293-7. Section 10.1:
is rotate(N IL, Cons(y, N IL), a) = Cons(y, a) Stacks and queues, pp. 200–204.
and rotate(CON S(x, f ), CON S(y, r), a) =
Cons(x, rotate(f, r, CON S(y, a))) . Its running • William Ford, William Topp. Data Structures with
time is O(r) , but, since lazy evaluation is used, the C++ and STL, Second Edition. Prentice Hall, 2002.
computation is delayed until the results is forced by the ISBN 0-13-085850-1. Chapter 8: Queues and Pri-
computation. ority Queues, pp. 386–390.
The list s in the data structure has two purposes. This list • Adam Drozdek. Data Structures and Algorithms in
serves as a counter for |f | − |r| , indeed, |f | = |r| if and C++, Third Edition. Thomson Course Technology,
only if s is the empty list. This counter allows us to ensure 2005. ISBN 0-534-49182-0. Chapter 4: Stacks and
that the rear is never longer than the front list. Further- Queues, pp. 137–169.
more, using s, which is a tail of f, forces the computation
of a part of the (lazy) list f during each tail and insert op-
eration. Therefore, when |f | = |r| , the list f is totally 2.7.5 External links
forced. If it wast not the case, the intern representation of
f could be some append of append of... of append, and • Queue Data Structure and Algorithm
forcing would not be a constant time operation anymore.
• Queues with algo and 'c' programme
“Deque” redirects here. It is not to be confused with There are at least two common ways to efficiently imple-
dequeueing, a queue operation. ment a deque: with a modified dynamic array or with a
Not to be confused with Double-ended priority queue. doubly linked list.
The dynamic array approach uses a variant of a dynamic
In computer science, a double-ended queue (dequeue, array that can grow from both ends, sometimes called
often abbreviated to deque) is an abstract data type that array deques. These array deques have all the proper-
generalizes a queue, for which elements can be added to ties of a dynamic array, such as constant-time random
or removed from either the front (head) or back (tail).[1] It access, good locality of reference, and inefficient inser-
is also often called a head-tail linked list, though prop- tion/removal in the middle, with the addition of amor-
erly this refers to a specific data structure implementation tized constant-time insertion/removal at both ends, in-
of a deque (see below). stead of just one end. Three common implementations
include:
2.8.1 Naming conventions • Storing deque contents in a circular buffer, and only
resizing when the buffer becomes full. This de-
Deque is sometimes written dequeue, but this use is gener- creases the frequency of resizings.
ally deprecated in technical literature or technical writing
because dequeue is also a verb meaning “to remove from • Allocating deque contents from the center of the
a queue”. Nevertheless, several libraries and some writ- underlying array, and resizing the underlying array
ers, such as Aho, Hopcroft, and Ullman in their textbook when either end is reached. This approach may re-
Data Structures and Algorithms, spell it dequeue. John quire more frequent resizings and waste more space,
Mitchell, author of Concepts in Programming Languages, particularly when elements are only inserted at one
also uses this terminology. end.
• An input-restricted deque is one where deletion can Double-ended queues can also [2]be implemented as a
be made from both ends, but insertion can be made purely functional data structure. Two versions of the
at one end only. implementation exist. The first one, called 'real-time
deque, is presented below. It allows the queue to be
persistent with operations in O(1) worst-case time, but
• An output-restricted deque is one where insertion
requires lazy lists with memoization. The second one,
can be made at both ends, but deletion can be made
with no lazy lists nor memoization is presented at the end
from one end only.
of the sections. Its amortized time is O(1) if the per-
sistency is not used; but the worst-time complexity of an
Both the basic and most common list types in comput- operation is O(n) where n is the number of elements in
ing, queues and stacks can be considered specializations the double-ended queue.
of deques, and can be implemented using deques. Let us recall that, for a list l, |l| denotes its length, that
NIL represents an empty list and CONS(h,t) represents
the list whose head is h and whose tail is t. The functions
2.8.3 Operations drop(i,l) and take(i,l) return the list l without its first i
elements, and the i’s first elements respectively. Or, if |l|
The basic operations on a deque are enqueue and dequeue < i, they return the empty list and l respectively.
on either end. Also generally implemented are peek op- A double-ended queue is represented as a sixtuple
erations, which return the value at that end without de- lenf,f,sf,lenr,r,sr where f is a linked list which contains
queuing it. the front of the queue of length lenf. Similarly, r is a
Names vary between languages; major implementations linked list which represents the reverse of the rear of the
include: queue, of length lenr. Furthermore, it is assured that |f|
2.8. DOUBLE-ENDED QUEUE 49
<= 2|r|+1 and |r| <= 2|f|+1 - intuitively, it means that nei- in O(1) amortized time. In this case, the lists sf and sr can
ther the front nor the rear contains more than a third of the be removed from the representation of the double-ended
list plus one element. Finally, sf and sr are tails of f and queue.
of r, they allow to schedule the moment where some lazy
operations are forced. Note that, when a double-ended
queue contains n elements in the front list and n elements
in the rear list, then the inequality invariant remains satis-
fied after i insertions and d deletions when (i+d)/2 <= n. 2.8.5 Language support
That is, at most n/2 operations can happen between each
rebalancing. Ada's containers provides the generic
Intuitively, inserting an element x in front of the double- packages Ada.Containers.Vectors and
ended queue lenf, f, sf, lenr, sr leads almost to the Ada.Containers.Doubly_Linked_Lists, for the dynamic
double-ended queue lenf+1, CONS(x,f), drop(2,sf), lenr, array and linked list implementations, respectively.
r, drop(2,sr), the head and the tail of the double-ended C++'s Standard Template Library provides the class tem-
queue lenf, CONS(x,f), sf, lenr, r, sr are x and al- plates std::deque and std::list, for the multiple array and
most lenf-1, f, drop(2,sf), lenr, r, drop(2,sr) respec- linked list implementations, respectively.
tively, and the head and the tail of lenf, NIL, NIL, lenr,
As of Java 6, Java’s Collections Framework provides a
CONS(x,NIL), drop(2,sr) are x and 0, NIL, NIL, 0, NIL,
new Deque interface that provides the functionality of
NIL respectively. The function to insert an element in
insertion and removal at both ends. It is implemented
the rear, or to drop the last element of the double-ended
by classes such as ArrayDeque (also new in Java 6) and
queue, are similar to the above function which deal with
LinkedList, providing the dynamic array and linked list
the front of the double-ended queue. It is said almost
implementations, respectively. However, the ArrayD-
because, after insertion and after an application of tail,
eque, contrary to its name, does not support random ac-
the invariant |r| <= 2|f|+1 may not be satisfied anymore.
cess.
In this case it is required to rebalance the double-ended
queue. Perl's arrays have native support for both removing (shift
and pop) and adding (unshift and push) elements on both
In order to avoid an operation with an O(n) costs, the
ends.
algorithm uses laziness with memoization, and force
the rebalancing to be partly done during the following Python 2.4 introduced the collections module with sup-
(|l| + |r|)/2 operations, that is, before the following port for deque objects. It is implemented using a doubly
rebalancing. In order to create the scheduling, some linked list of fixed-length subarrays.
auxiliary lazy functions are required. The function As of PHP 5.3, PHP’s SPL extension contains the
rotateRev(f,r,a) returns the list f, followed by the list 'SplDoublyLinkedList' class that can be used to imple-
r, and followed by the list a. It is required in this ment Deque datastructures. Previously to make a Deque
function that |r|−2|f| is 2 or 3. This function is defined structure the array functions array_shift/unshift/pop/push
by induction as rotateRev(NIL,r,a)=reverse(r++a) had to be used instead.
where ++ is the concatenation operation, and by ro-
tateRev(CONS(x,f),r,a)=CONS(x,rotateRev(f,drop(2,r),reverse GHC's Data.Sequence module implements an efficient,
(take(2,r))++a)). It should be noted that, ro- functional deque structure in Haskell. The implemen-
tateRev(f,r,NIL) returns the list f followed by the tation uses 2–3 finger trees annotated with sizes. There
list r reversed. The function rotateDrop(f,j,r) which re- are other (fast) possibilities to implement purely func-
turns f followed by (r without j’s first element) reversed is tional (thus also persistent)[3][4]
double queues (most using
also required, for j < |f|. It is defined by rotateDrop(f,0,r) heavily lazy evaluation). Kaplan and Tarjan were the
== rotateRev(f,r,NIL), rotateDrop(f,1,r) == ro- first to implement optimal confluently persistent caten-
[5]
tateRev(f,drop(1,r),NIL) and rotateDrop(CONS(x,f),j,r) able deques. Their implementation was strictly purely
== CONS(x,rotateDrop(f,j-2),drop(2,r)). functional in the sense that it did not use lazy evaluation.
Okasaki simplified the data structure by using lazy eval-
The balancing function can now be defined with uation with a bootstrapped data structure and degrading
fun balance(q as (lenf, f,sf, lenr,r,sr))= if lenf > 2*lenr+1 the performance bounds from worst-case to amortized.
then let val i= (left+lenr)div 2 val j=lenf + lenr -i val Kaplan, Okasaki, and Tarjan produced a simpler, non-
f'=take(i,f) val r'=rotateDrop(r,i,f) in (i,f',f',j,r',r') else bootstrapped, amortized version that can be implemented
if lenf > 2*lenr+1 then let val j= (left+lenr)div 2 val either using lazy evaluation or more efficiently using mu-
i=lenf + lenr -j val r'=take(i,r) val f'=rotateDrop(f,i,r) in tation in a broader but still restricted fashion. Mihaesau
(i,f',f',j,r',r') else q and Tarjan created a simpler (but still highly complex)
strictly purely functional implementation of catenable de-
ques, and also a much simpler implementation of strictly
Note that, without the lazy part of the implementation,
purely functional non-catenable deques, both of which
this would be a non-persistent implementation of queue
have optimal worst-case bounds.
50 CHAPTER 2. SEQUENCES
2.8.6 Complexity [5] Haim Kaplan and Robert E. Tarjan. Purely functional
representations of catenable sorted lists. In ACM Sym-
• In a doubly-linked list implementation and assuming posium on Theory of Computing, pages 202–211, May
no allocation/deallocation overhead, the time com- 1996. (pp. 4, 82, 84, 124)
plexity of all deque operations is O(1). Addition-
[6] Eitan Frachtenberg, Uwe Schwiegelshohn (2007). Job
ally, the time complexity of insertion or deletion in
Scheduling Strategies for Parallel Processing: 12th Inter-
the middle, given an iterator, is O(1); however, the national Workshop, JSSPP 2006. Springer. ISBN 3-540-
time complexity of random access by index is O(n). 71034-5. See p.22.
• In a growing array, the amortized time complexity
of all deque operations is O(1). Additionally, the
2.8.10 External links
time complexity of random access by index is O(1);
but the time complexity of insertion or deletion in
• Type-safe open source deque implementation at
the middle is O(n).
Comprehensive C Archive Network
• Queue
• Priority queue
2.8.9 References
[1] Donald Knuth. The Art of Computer Programming, Vol-
ume 1: Fundamental Algorithms, Third Edition. Addison-
Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1:
Stacks, Queues, and Deques, pp. 238–243.
[4] Adam L. Buchsbaum and Robert E. Tarjan. Confluently A circular buffer, circular queue, cyclic buffer or ring
persistent deques via data structural bootstrapping. Jour- buffer is a data structure that uses a single, fixed-size
nal of Algorithms, 18(3):513–547, May 1995. (pp. 58, buffer as if it were connected end-to-end. This structure
101, 125) lends itself easily to buffering data streams.
2.9. CIRCULAR BUFFER 51
2.9.1 Uses Assume that a 1 is written into the middle of the buffer
(exact starting location does not matter in a circular
The useful property of a circular buffer is that it does buffer):
not need to have its elements shuffled around when one
is consumed. (If a non-circular buffer were used then it
would be necessary to shift all elements when one is con- 1
6 7 8 9 A B 5
7 8 9 A B
52 CHAPTER 2. SEQUENCES
This image shows a full buffer with four elements (num- Fixed-sized compressed circular buffers use an alternative
bers 1 through 4) having been overwritten: indexing strategy based on elementary number theory to
maintain a fixed-sized compressed representation of the
entire data sequence.[3]
6 7 8 9 A B 5
• http://www.dspguide.com/ch28/2.htm
2.9.4 Optimization
Dictionaries
3.1 Associative array “binding” may also be used to refer to the process of cre-
ating a new association.
“Dictionary (data structure)" redirects here. It is not to The operations that are usually defined for an associative
be confused with data dictionary. array are:[1][2]
“Associative container” redirects here. For the im-
plementation of ordered associative arrays in the
standard library of the C++ programming language, see • Add or insert: add a new (key, value) pair to the
associative containers. collection, binding the new key to its new value.
The arguments to this operation are the key and the
In computer science, an associative array, map, symbol value.
table, or dictionary is an abstract data type composed of
a collection of (key, value) pairs, such that each possible • Reassign: replace the value in one of the
key appears at most once in the collection. (key, value) pairs that are already in the collection,
Operations associated with this data type allow:[1][2] binding an old key to a new value. As with an inser-
tion, the arguments to this operation are the key and
• the addition of a pair to the collection the value.
The dictionary problem is a classic computer science • Lookup: find the value (if any) that is bound to a
problem: the task of designing a data structure that main- given key. The argument to this operation is the key,
tains a set of data during 'search', 'delete', and 'insert' and the value is returned from the operation. If no
operations.[3] The two major solutions to the dictionary value is found, some associative array implementa-
problem are a hash table or a search tree.[1][2][4][5] In some tions raise an exception.
cases it is also possible to solve the problem using directly
addressed arrays, binary search trees, or other more spe-
cialized structures. Often then instead of add or reassign there is a single set
operation that adds a new (key, value) pair if one does
Many programming languages include associative arrays not already exist, and otherwise reassigns it.
as primitive data types, and they are available in software
libraries for many others. Content-addressable memory In addition, associative arrays may also include other op-
is a form of direct hardware-level support for associative erations such as determining the number of bindings or
arrays. constructing an iterator to loop over all the bindings. Usu-
ally, for such an operation, the order in which the bindings
Associative arrays have many applications including such are returned may be arbitrary.
fundamental programming patterns as memoization and
the decorator pattern.[6] A multimap generalizes an associative array by allowing
multiple values to be associated with a single key.[7] A
bidirectional map is a related abstract data type in which
3.1.1 Operations the bindings operate in both directions: each value must
be associated with a unique key, and a second lookup op-
In an associative array, the association between a key and eration takes a value as argument and looks up the key
a value is often known as a “binding”, and the same word associated with that value.
53
54 CHAPTER 3. DICTIONARIES
3.1.3 Implementation
For dictionaries with very small numbers of bindings, it
may make sense to implement the dictionary using an
association list, a linked list of bindings. With this im- This graph compares the average number of cache misses re-
plementation, the time to perform the basic dictionary quired to look up elements in tables with separate chaining and
open addressing.
operations is linear in the total number of bindings; how-
ever, it is easy to implement and the constant factors in
its running time are small.[1][8] Open addressing has a lower cache miss ratio than sep-
arate chaining when the table is mostly empty. How-
Another very simple implementation technique, usable ever, as the table becomes filled with more elements, open
when the keys are restricted to a narrow range of inte- addressing’s performance degrades exponentially. Ad-
gers, is direct addressing into an array: the value for a ditionally, separate chaining uses less memory in most
given key k is stored at the array cell A[k], or if there is cases, unless the entries are very small (less than four
no binding for k then the cell stores a special sentinel value
times the size of a pointer).
that indicates the absence of a binding. As well as being
simple, this technique is fast: each dictionary operation
takes constant time. However, the space requirement for Tree implementations
this structure is the size of the entire keyspace, making it
impractical unless the keyspace is small.[4] Main article: Search tree
The two major approaches to implementing dictionaries
are a hash table or a search tree.[1][2][4][5]
Self-balancing binary search trees Another com-
mon approach is to implement an associative array with
Hash table implementations
a self-balancing binary search tree, such as an AVL tree
[10]
The most frequently used general purpose implementa- or a red-black tree.
tion of an associative array is with a hash table: an array Compared to hash tables, these structures have both ad-
combined with a hash function that separates each key vantages and weaknesses. The worst-case performance
into a separate “bucket” of the array. The basic idea be- of self-balancing binary search trees is significantly bet-
hind a hash table is that accessing an element of an array ter than that of a hash table, with a time complexity in big
via its index is a simple, constant-time operation. There- O notation of O(log n). This is in contrast to hash tables,
fore, the average overhead of an operation for a hash table whose worst-case performance involves all elements shar-
is only the computation of the key’s hash, combined with ing a single bucket, resulting in O(n) time complexity. In
3.1. ASSOCIATIVE ARRAY 55
addition, and like all binary search trees, self-balancing In Smalltalk, Objective-C, .NET,[13] Python,
binary search trees keep their elements in order. Thus, REALbasic, Swift, and VBA they are called dictio-
traversing its elements follows a least-to-greatest pattern, naries; in Perl, Ruby and Seed7 they are called hashes;
whereas traversing a hash table can result in elements in C++, Java, Go, Clojure, Scala, OCaml, Haskell they
being in seemingly random order. However, hash ta- are called maps (see map (C++), unordered_map (C++),
bles have a much better average-case time complexity and Map); in Common Lisp and Windows PowerShell,
than self-balancing binary search trees of O(1), and their they are called hash tables (since both typically use this
worst-case performance is highly unlikely when a good implementation). In PHP, all arrays can be associative,
hash function is used. except that the keys are limited to integers and strings.
It is worth noting that a self-balancing binary search tree In JavaScript (see also JSON), all objects behave as
associative arrays with string-valued keys, while the Map
can be used to implement the buckets for a hash table
that uses separate chaining. This allows for average-case and WeakMap types take arbitrary objects as keys. In
Lua, they are called tables, and are used as the primitive
constant lookup, but assures a worst-case performance of
O(log n). However, this introduces extra complexity into building block for all data structures. In Visual FoxPro,
the implementation, and may cause even worse perfor- they are called Collections. The D language also has
mance for smaller hash tables, where the time spent in- support for associative arrays.[14]
serting into and balancing the tree is greater than the time
needed to perform a linear search on all of the elements
of a linked list or similar data structure.[11][12]
3.1.5 Permanent storage
Other trees Associative arrays may also be stored in Main article: Key-value store
unbalanced binary search trees or in data structures spe-
cialized to a particular type of keys such as radix trees, Most programs using associative arrays will at some point
tries, Judy arrays, or van Emde Boas trees, but these im- need to store that data in a more permanent form, like in
plementation methods are less efficient than hash tables a computer file. A common solution to this problem is a
as well as placing greater restrictions on the types of data generalized concept known as archiving or serialization,
that they can handle. The advantages of these alternative which produces a text or binary representation of the orig-
structures come from their ability to handle operations inal objects that can be written directly to a file. This
beyond the basic ones of an associative array, such as find- is most commonly implemented in the underlying object
ing the binding whose key is the closest to a queried key, model, like .Net or Cocoa, which include standard func-
when the query is not itself present in the set of bindings. tions that convert the internal data into text form. The
program can create a complete text representation of any
group of objects by calling these methods, which are al-
Comparison most always already implemented in the base associative
array class.[15]
3.1.4 Language support For programs that use very large data sets, this sort of
individual file storage is not appropriate, and a database
Main article: Comparison of programming languages
management system (DB) is required. Some DB systems
(mapping) natively store associative arrays by serializing the data and
then storing that serialized data and the key. Individual
Associative arrays can be implemented in any program- arrays can then be loaded or saved from the database us-
ming language as a package and many language systems ing the key to refer to them. These key-value stores have
provide them as part of their standard library. In some been used for many years and have a history as long as
languages, they are not only built into the standard sys- that as the more common relational database (RDBs), but
tem, but have special syntax, often using array-like sub- a lack of standardization, among other reasons, limited
scripting. their use to certain niche roles. RDBs were used for these
Built-in syntactic support for associative arrays was intro- roles in most cases, although saving objects to a RDB
duced by SNOBOL4, under the name “table”. MUMPS can be complicated, a problem known as object-relational
made multi-dimensional associative arrays, optionally impedance mismatch.
persistent, its key data structure. SETL supported them After c. 2010, the need for high performance databases
as one possible implementation of sets and maps. Most suitable for cloud computing and more closely matching
modern scripting languages, starting with AWK and in- the internal structure of the programs using them led to a
cluding Rexx, Perl, Tcl, JavaScript, Wolfram Language, renaissance in the key-value store market. These systems
Python, Ruby, Go, and Lua, support associative arrays as can store and retrieve associative arrays in a native fash-
a primary container type. In many more languages, they ion, which can greatly improve performance in common
are available as library functions without special syntax. web-related workflows.
56 CHAPTER 3. DICTIONARIES
increase the size of the list, and thus the time to search, [6] van de Snepscheut, Jan L. A. (1993). What Computing Is
without providing any compensatory advantage. All About. Monographs in Computer Science. Springer.
p. 201. ISBN 9781461227106.
One advantage of association lists is that a new element
can be added in constant time. Additionally, when the [7] Scott, Michael Lee (2000). “3.3.4 Association Lists
number of keys is very small, searching an association list and Central Reference Tables”. Programming Language
may be more efficient than searching a binary search tree Pragmatics. Morgan Kaufmann. p. 137. ISBN
or hash table, because of the greater simplicity of their 9781558604421.
implementation.[4] [8] Pearce, Jon (2012). Programming and Meta-
Programming in Scheme. Undergraduate Texts
in Computer Science. Springer. p. 214. ISBN
3.2.3 Applications and software libraries 9781461216827.
[9] Minsky, Yaron; Madhavapeddy, Anil; Hickey, Jason
In the early development of Lisp, association lists (2013). Real World OCaml: Functional Programming
were used to resolve references to free variables in for the Masses. O'Reilly Media. p. 253. ISBN
procedures.[5][6] In this application, it is convenient to 9781449324766.
augment association lists with an additional operation,
that reverses the addition of a key–value pair without [10] O'Sullivan, Bryan; Goerzen, John; Stewart, Donald Bruce
(2008). Real World Haskell: Code You Can Believe In.
scanning the list for other copies of the same key. In this
O'Reilly Media. p. 299. ISBN 9780596554309.
way, the association list can function as a stack, allow-
ing local variables to temporarily shadow other variables
with the same names, without destroying the values of
those other variables.[7] 3.3 Hash table
Many programming languages, including Lisp,[5]
Not to be confused with Hash list or Hash tree.
Scheme,[8] OCaml,[9] and Haskell[10] have functions for
“Rehash” redirects here. For the South Park episode, see
handling association lists in their standard libraries.
Rehash (South Park). For the IRC command, see List of
Internet Relay Chat commands § REHASH.
3.2.4 See also
• Self-organizing list, a strategy for re-ordering the hash
keys in an association list to speed up searches for
keys function buckets
frequently-accessed keys
00
01 521-8976
John Smith
3.2.5 References 02 521-1234
03
[1] Marriott, Kim; Stuckey, Peter J. (1998). Programming Lisa Smith
: :
with Constraints: An Introduction. MIT Press. pp. 193–
195. ISBN 9780262133418. 13
Sandra Dee
14 521-9655
[2] Frické, Martin (2012). “2.8.3 Association Lists”. Logic 15
and the Organization of Information. Springer. pp. 44–
45. ISBN 9781461430872.
A small phone book as a hash table
[3] Knuth, Donald. “6.1 Sequential Searching”. The Art of
Computer Programming, Vol. 3: Sorting and Searching
(2nd ed.). Addison Wesley. pp. 396–405. ISBN 0-201- In computing, a hash table (hash map) is a data structure
89685-0. used to implement an associative array, a structure that
can map keys to values. A hash table uses a hash function
[4] Janes, Calvin (2011). “Using Association Lists for As- to compute an index into an array of buckets or slots, from
sociative Arrays”. Developer’s Guide to Collections in which the desired value can be found.
Microsoft .NET. Pearson Education. p. 191. ISBN
9780735665279. Ideally, the hash function will assign each key to a unique
bucket, but most hash table designs employ an imperfect
[5] McCarthy, John; Abrahams, Paul W.; Edwards, Daniel hash function, which might cause hash collisions where
J.; Hart, Timothy P.; Levin, Michael I. (1985). LISP 1.5 the hash function generates the same index for more than
Programmer’s Manual (PDF). MIT Press. ISBN 0-262- one key. Such collisions must be accommodated in some
13011-4. See in particular p. 12 for functions that search way.
an association list and use it to substitute symbols in an-
other expression, and p. 103 for the application of asso- In a well-dimensioned hash table, the average cost (num-
ciation lists in maintaining variable bindings. ber of instructions) for each lookup is independent of the
58 CHAPTER 3. DICTIONARIES
number of elements stored in the table. Many hash ta- consecutive slots. Such clustering may cause the lookup
ble designs also allow arbitrary insertions and deletions cost to skyrocket, even if the load factor is low and colli-
of key-value pairs, at (amortized[2] ) constant average cost sions are infrequent. The popular multiplicative hash[3] is
per operation.[3][4] claimed to have particularly poor clustering behavior.[7]
In many situations, hash tables turn out to be more effi- Cryptographic hash functions are believed to provide
cient than search trees or any other table lookup struc- good hash functions for any table size s, either by modulo
ture. For this reason, they are widely used in many kinds reduction or by bit masking. They may also be ap-
of computer software, particularly for associative arrays, propriate if there is a risk of malicious users trying to
database indexing, caches, and sets. sabotage a network service by submitting requests de-
signed to generate a large number of collisions in the
server’s hash tables. However, the risk of sabotage can
3.3.1 Hashing also be avoided by cheaper methods (such as applying a
secret salt to the data, or using a universal hash function).
Main article: Hash function A drawback of cryptographic hashing functions is that
they are often slower to compute, which means that in
The idea of hashing is to distribute the entries (key/value cases where the uniformity for any s is not necessary, a
pairs) across an array of buckets. Given a key, the algo- non-cryptographic hashing function might be preferable.
rithm computes an index that suggests where the entry can
be found:
Perfect hash function
index = f(key, array_size)
Often this is done in two steps: If all keys are known ahead of time, a perfect hash func-
tion can be used to create a perfect hash table that has
hash = hashfunc(key) index = hash % array_size no collisions. If minimal perfect hashing is used, every
In this method, the hash is independent of the array size, location in the hash table can be used as well.
and it is then reduced to an index (a number between 0 Perfect hashing allows for constant time lookups in all
and array_size − 1) using the modulo operator (%). cases. This is in contrast to most chaining and open ad-
In the case that the array size is a power of two, the re- dressing methods, where the time for lookup is low on
mainder operation is reduced to masking, which improves average, but may be very large, O(n), for instance when
speed, but can increase problems with a poor hash func- all the keys hash to a few values.
tion.
bucket. Clearly the hashing is not working in the second only basic data structures with simple algorithms, and can
one. use simple hash functions that are unsuitable for other
A low load factor is not especially beneficial. As the load methods.
factor approaches 0, the proportion of unused areas in The cost of a table operation is that of scanning the en-
the hash table increases, but there is not necessarily any tries of the selected bucket for the desired key. If the
reduction in search cost. This results in wasted memory. distribution of keys is sufficiently uniform, the average
cost of a lookup depends only on the average number of
keys per bucket—that is, it is roughly proportional to the
3.3.3 Collision resolution load factor.
For this reason, chained hash tables remain effective even
Hash collisions are practically unavoidable when hashing
when the number of table entries n is much higher than
a random subset of a large set of possible keys. For ex-
the number of slots. For example, a chained hash table
ample, if 2,450 keys are hashed into a million buckets,
with 1000 slots and 10,000 stored keys (load factor 10)
even with a perfectly uniform random distribution, ac-
is five to ten times slower than a 10,000-slot table (load
cording to the birthday problem there is approximately a
factor 1); but still 1000 times faster than a plain sequential
95% chance of at least two of the keys being hashed to
list.
the same slot.
For separate-chaining, the worst-case scenario is when all
Therefore, almost all hash table implementations have
entries are inserted into the same bucket, in which case
some collision resolution strategy to handle such events.
the hash table is ineffective and the cost is that of search-
Some common strategies are described below. All these
ing the bucket data structure. If the latter is a linear list,
methods require that the keys (or pointers to them) be
the lookup procedure may have to scan all its entries, so
stored in the table, together with the associated values.
the worst-case cost is proportional to the number n of en-
tries in the table.
Separate chaining The bucket chains are often searched sequentially using
the order the entries were added to the bucket. If the
load factor is large and some keys are more likely to come
up than others, then rearranging the chain with a move-
keys buckets entries to-front heuristic may be effective. More sophisticated
000
Lisa Smith 521-8976 data structures, such as balanced search trees, are worth
001
John Smith
002 considering only if the load factor is large (about 10 or
Lisa Smith
: : John Smith 521-1234 more), or if the hash distribution is likely to be very non-
151 uniform, or if one must guarantee good performance even
152
Sam Doe
153
Sandra Dee 521-9655 in a worst-case scenario. However, using a larger table
154 and/or a better hash function may be even more effective
Sandra Dee
: : Ted Baker 418-4165 in those cases.
253
Ted Baker
254 Chained hash tables also inherit the disadvantages of
Sam Doe 521-5030
255 linked lists. When storing small keys and values, the
space overhead of the next pointer in each entry record
Hash collision resolved by separate chaining. can be significant. An additional disadvantage is that
traversing a linked list has poor cache performance, mak-
In the method known as separate chaining, each bucket ing the processor cache ineffective.
is independent, and has some sort of list of entries with
the same index. The time for hash table operations is the
time to find the bucket (which is constant) plus the time
for the list operation.
overflow
keys buckets entries
In a good hash table, each bucket has zero or one en- 000
tries, and sometimes two or three, but rarely more than John Smith
001 Lisa Smith 521-8976
002
that. Therefore, structures that are efficient in time and Lisa Smith
: : : :
151
space for these cases are preferred. Structures that are ef- Sam Doe
152 John Smith 521-1234
Sandra Dee 521-9655
ficient for a fairly large number of entries per bucket are 153
154
Ted Baker 418-4165
Sandra Dee
not needed or desirable. If these cases happen often, the :
253
: : :
Separate chaining with linked lists Chained hash ta- Hash collision by separate chaining with head records in the
bles with linked lists are popular because they require bucket array.
60 CHAPTER 3. DICTIONARIES
tables often have about as many slots as stored entries, 152 John Smith 521-1234
Sam Doe 153 Sandra Dee 521-9655
meaning that many slots have two or more entries.
154 Ted Baker 418-4165
Sandra Dee 155
Separate chaining with other structures Instead of : : :
a list, one can use any other data structure that supports Ted Baker 253
254 Sam Doe 521-5030
the required operations. For example, by using a self-
255
balancing binary search tree, the theoretical worst-case
time of common hash table operations (insertion, dele-
tion, lookup) can be brought down to O(log n) rather than Hash collision resolved by open addressing with linear probing
O(n). However, this introduces extra complexity into the (interval=1). Note that “Ted Baker” has a unique hash, but nev-
implementation, and may cause even worse performance ertheless collided with “Sandra Dee”, that had previously collided
for smaller hash tables, where the time spent inserting into with “John Smith”.
and balancing the tree is greater than the time needed to
perform a linear search on all of the elements of a list.[3][8] starting with the hashed-to slot and proceeding in some
A real world example of a hash table that uses a self- probe sequence, until an unoccupied slot is found. When
balancing binary search tree for buckets is the HashMap searching for an entry, the buckets are scanned in the
class in Java version 8.[9] same sequence, until either the target record is found, or
The variant called array hash table uses a dynamic array an unused array slot is found, which indicates that there
to store all the entries that hash to the same slot.[10][11][12] is no such key in the table.[15] The name “open address-
Each newly inserted entry gets appended to the end of ing” refers to the fact that the location (“address”) of the
the dynamic array that is assigned to the slot. The dy- item is not determined by its hash value. (This method
namic array is resized in an exact-fit manner, meaning is also called closed hashing; it should not be confused
it is grown only by as many bytes as needed. Alterna- with “open hashing” or “closed addressing” that usually
tive techniques such as growing the array by block sizes mean separate chaining.)
or pages were found to improve insertion performance, Well-known probe sequences include:
but at a cost in space. This variation makes more effi-
cient use of CPU caching and the translation lookaside
buffer (TLB), because slot entries are stored in sequential • Linear probing, in which the interval between
memory positions. It also dispenses with the next point- probes is fixed (usually 1)
ers that are required by linked lists, which saves space.
• Quadratic probing, in which the interval between
Despite frequent array resizing, space overheads incurred
probes is increased by adding the successive outputs
by the operating system such as memory fragmentation
of a quadratic polynomial to the starting value given
were found to be small.
by the original hash computation
An elaboration on this approach is the so-called dynamic
perfect hashing,[13] where a bucket that contains k entries • Double hashing, in which the interval between
is organized as a perfect hash table with k2 slots. While it probes is computed by a second hash function
uses more memory (n2 slots for n entries, in the worst case
and n × k slots in the average case), this variant has guar- A drawback of all these open addressing schemes is that
anteed constant worst-case lookup time, and low amor- the number of stored entries cannot exceed the number
tized time for insertion. It is also possible to use a fusion of slots in the bucket array. In fact, even with good hash
tree for each bucket, achieving constant time for all oper- functions, their performance dramatically degrades when
ations with high probability.[14] the load factor grows beyond 0.7 or so. For many ap-
plications, these restrictions mandate the use of dynamic
resizing, with its attendant costs.
Open addressing
Open addressing schemes also put more stringent require-
Main article: Open addressing ments on the hash function: besides distributing the keys
In another strategy, called open addressing, all entry more uniformly over the buckets, the function must also
records are stored in the bucket array itself. When a minimize the clustering of hash values that are consecu-
new entry has to be inserted, the buckets are examined, tive in the probe order. Using separate chaining, the only
3.3. HASH TABLE 61
concern is that too many objects map to the same hash other considerations typically come into play.
value; whether they are adjacent or nearby is completely
irrelevant.
Coalesced hashing A hybrid of chaining and open
Open addressing only saves memory if the entries are addressing, coalesced hashing links together chains of
small (less than four times the size of a pointer) and the nodes within the table itself.[15] Like open addressing, it
load factor is not too small. If the load factor is close to achieves space usage and (somewhat diminished) cache
zero (that is, there are far more buckets than stored en- advantages over chaining. Like chaining, it does not ex-
tries), open addressing is wasteful even if each entry is hibit clustering effects; in fact, the table can be efficiently
just two words. filled to a high density. Unlike chaining, it cannot have
more elements than table slots.
without invalidating the neighborhood property of any of and in Python’s dict, table size is resized when load factor
the buckets along the way. In the end, the open slot has is greater than 2/3.
been moved into the neighborhood, and the entry being Since buckets are usually implemented on top of a
inserted can be added to it. dynamic array and any constant proportion for resizing
greater than 1 will keep the load factor under the desired
Robin Hood hashing limit, the exact choice of the constant is determined by
the same space-time tradeoff as for dynamic arrays.
One interesting variation on double-hashing collision res- Resizing is accompanied by a full or incremental table
olution is Robin Hood hashing.[17][18] The idea is that a rehash whereby existing items are mapped to new bucket
new key may displace a key already inserted, if its probe locations.
count is larger than that of the key at the current posi-
tion. The net effect of this is that it reduces worst case To limit the proportion of memory wasted due to empty
search times in the table. This is similar to ordered hash buckets, some implementations also shrink the size of
tables[19] except that the criterion for bumping a key does the table—followed by a rehash—when items are deleted.
not depend on a direct relationship between the keys. From the point of space-time tradeoffs, this operation is
Since both the worst case and the variation in the num- similar to the deallocation in dynamic arrays.
ber of probes is reduced dramatically, an interesting vari-
ation is to probe the table starting at the expected suc-
Resizing by copying all entries
cessful probe value and then expand from that position
in both directions.[20] External Robin Hood hashing is an
extension of this algorithm where the table is stored in A common approach is to automatically trigger a com-
an external file and each table position corresponds to a plete resizing when the load factor exceeds some thresh-
fixed-sized page or bucket with B records.[21] old r ₐₓ. Then a new larger table is allocated, each entry
is removed from the old table, and inserted into the new
table. When all entries have been removed from the old
2-choice hashing table then the old table is returned to the free storage pool.
Symmetrically, when the load factor falls below a second
2-choice hashing employs two different hash functions, threshold r ᵢ , all entries are moved to a new smaller ta-
h1 (x) and h2 (x), for the hash table. Both hash functions ble.
are used to compute two table locations. When an object For hash tables that shrink and grow frequently, the resiz-
is inserted in the table, then it is placed in the table loca- ing downward can be skipped entirely. In this case, the
tion that contains fewer objects (with the default being the table size is proportional to the maximum number of en-
h1 (x) table location if there is equality in bucket size). 2- tries that ever were in the hash table at one time, rather
choice hashing employs the principle of the power of two than the current number. The disadvantage is that mem-
choices.[22] ory usage will be higher, and thus cache behavior may be
worse. For best control, a “shrink-to-fit” operation can be
provided that does this only on request.
3.3.4 Dynamic resizing
If the table size increases or decreases by a fixed percent-
The good functioning of a hash table depends on the fact age at each expansion, the total cost of these resizings,
that the table size is proportional to the number of entries. amortized over all insert and delete operations, is still a
With a fixed size, and the common structures, it is simi- constant, independent of the number of entries n and of
lar to linear search, except with a better constant factor. the number m of operations performed.
In some cases, the number of entries may be definitely For example, consider a table that was created with the
known in advance, for example keywords in a language. minimum possible size and is doubled each time the load
More commonly, this is not known for sure, if only due ratio exceeds some threshold. If m elements are inserted
to later changes in code and data. It is one serious, al- into that table, the total number of extra re-insertions that
though common, mistake to not provide any way for the occur in all dynamic resizings of the table is at most m −
table to resize. A general-purpose hash table “class” will 1. In other words, dynamic resizing roughly doubles the
almost always have some way to resize, and it is good cost of each insert or delete operation.
practice even for simple “custom” tables. An implemen-
tation should check the load factor, and do something if it
becomes too large (this needs to be done only on inserts, Incremental resizing
since that is the only thing that would increase it).
To keep the load factor under a certain limit, e.g., under Some hash table implementations, notably in real-time
3/4, many table implementations expand the table when systems, cannot pay the price of enlarging the hash table
items are inserted. For example, in Java’s HashMap class all at once, because it may interrupt time-critical opera-
the default load factor threshold for table expansion is 3/4 tions. If one cannot avoid dynamic resizing, a solution is
3.3. HASH TABLE 63
• During the resize, allocate the new hash table, but In the simplest model, the hash function is completely un-
keep the old table unchanged. specified and the table does not resize. For the best pos-
sible choice of hash function, a table of size k with open
• In each lookup or delete operation, check both ta- addressing has no collisions and holds up to k elements,
bles. with a single comparison for successful lookup, and a ta-
ble of size k with chaining and n keys has the minimum
• Perform insertion operations only in the new table. max(0, n − k) collisions and O(1 + n/k) comparisons for
• At each insertion also move r elements from the old lookup. For the worst choice of hash function, every in-
table to the new table. sertion causes a collision, and hash tables degenerate to
linear search, with Ω(n) amortized comparisons per in-
• When all elements are removed from the old table, sertion and up to n comparisons for a successful lookup.
deallocate it. Adding rehashing to this model is straightforward. As in a
dynamic array, geometric resizing by a factor of b implies
To ensure that the old table is completely copied over be- that only n/bi keys are inserted i or more times, so that the
fore the new table itself needs to be enlarged, it is neces-total number of insertions is bounded above by bn/(b −
sary to increase the size of the table by a factor of at least
1), which is O(n). By using rehashing to maintain n < k,
(r + 1)/r during resizing. tables using both chaining and open addressing can have
Disk-based hash tables almost always use some scheme unlimited elements and perform successful lookup in a
of incremental resizing, since the cost of rebuilding the single comparison for the best choice of hash function.
entire table on disk would be too high. In more realistic models, the hash function is a random
variable over a probability distribution of hash functions,
and performance is computed on average over the choice
Monotonic keys of hash function. When this distribution is uniform, the
assumption is called “simple uniform hashing” and it can
If it is known that key values will always increase (or be shown that hashing with chaining requires Θ(1 + n/k)
decrease) monotonically, then a variation of consistent comparisons on average for an unsuccessful lookup, and
hashing can be achieved by keeping a list of the single hashing with open addressing requires Θ(1/(1 − n/k)).[24]
most recent key value at each hash table resize operation. Both these bounds are constant, if we maintain n/k < c
Upon lookup, keys that fall in the ranges defined by these using table resizing, where c is a fixed constant less than
list entries are directed to the appropriate hash function— 1.
and indeed hash table—both of which can be different for
each range. Since it is common to grow the overall num-
ber of entries by doubling, there will only be O(log(N)) 3.3.6 Features
ranges to check, and binary search time for the redirec-
tion would be O(log(log(N))). As with consistent hashing, Advantages
this approach guarantees that any key’s hash, once issued,
will never change, even when the hash table is later grown. The main advantage of hash tables over other table data
structures is speed. This advantage is more apparent
when the number of entries is large. Hash tables are par-
Other solutions ticularly efficient when the maximum number of entries
can be predicted in advance, so that the bucket array can
Linear hashing[23] is a hash table algorithm that permits be allocated once with the optimum size and never re-
incremental hash table expansion. It is implemented us- sized.
ing a single hash table, but with two possible lookup func-
tions. If the set of key-value pairs is fixed and known ahead
of time (so insertions and deletions are not allowed), one
Another way to decrease the cost of table resizing is to may reduce the average lookup cost by a careful choice
choose a hash function in such a way that the hashes of of the hash function, bucket table size, and internal data
most values do not change when the table is resized. Such structures. In particular, one may be able to devise a hash
hash functions are prevalent in disk-based and distributed function that is collision-free, or even perfect. In this case
hash tables, where rehashing is prohibitively costly. The the keys need not be stored in the table.
problem of designing a hash such that most values do
not change when the table is resized is known as the
distributed hash table problem. The four most popular Drawbacks
approaches are rendezvous hashing, consistent hashing,
the content addressable network algorithm, and Kademlia Although operations on a hash table take constant time on
distance. average, the cost of a good hash function can be signifi-
64 CHAPTER 3. DICTIONARIES
cantly higher than the inner loop of the lookup algorithm used by the hash table in the Linux routing table cache
for a sequential list or search tree. Thus hash tables are was changed with Linux version 2.4.2 as a countermea-
not effective when the number of entries is very small. sure against such attacks.[29]
(However, in some cases the high cost of computing the
hash function can be mitigated by saving the hash value
together with the key.) 3.3.7 Uses
For certain string processing applications, such as spell-
Associative arrays
checking, hash tables may be less efficient than tries,
finite automata, or Judy arrays. Also, if there are not too
Main article: associative array
many possible keys to store—that is, if each key can be
represented by a small enough number of bits—then, in-
stead of a hash table, one may use the key directly as the Hash tables are commonly used to implement many
index into an array of values. Note that there are no col- types of in-memory tables. They are used to imple-
lisions in this case. ment associative arrays (arrays whose indices are arbi-
trary strings or other complicated objects), especially
The entries stored in a hash table can be enumerated ef-
in interpreted programming languages like Perl, Ruby,
ficiently (at constant cost per entry), but only in some
Python, and PHP.
pseudo-random order. Therefore, there is no efficient
way to locate an entry whose key is nearest to a given key. When storing a new item into a multimap and a hash col-
Listing all n entries in some specific order generally re- lision occurs, the multimap unconditionally stores both
quires a separate sorting step, whose cost is proportional items.
to log(n) per entry. In comparison, ordered search trees When storing a new item into a typical associative array
have lookup and insertion cost proportional to log(n), but and a hash collision occurs, but the actual keys them-
allow finding the nearest key at about the same cost, and selves are different, the associative array likewise stores
ordered enumeration of all entries at constant cost per en- both items. However, if the key of the new item exactly
try. matches the key of an old item, the associative array typ-
If the keys are not stored (because the hash function is ically erases the old item and overwrites it with the new
collision-free), there may be no easy way to enumerate item, so every item in the table has a unique key.
the keys that are present in the table at any given moment.
Although the average cost per operation is constant and Database indexing
fairly small, the cost of a single operation may be quite
high. In particular, if the hash table uses dynamic resiz- Hash tables may also be used as disk-based data struc-
ing, an insertion or deletion operation may occasionally tures and database indices (such as in dbm) although B-
take time proportional to the number of entries. This may trees are more popular in these applications. In multi-
be a serious drawback in real-time or interactive applica- node database systems, hash tables are commonly used
tions. to distribute rows amongst nodes, reducing network traf-
Hash tables in general exhibit poor locality of refer- fic for hash joins.
ence—that is, the data to be accessed is distributed
seemingly at random in memory. Because hash tables Caches
cause access patterns that jump around, this can trig-
ger microprocessor cache misses that cause long delays. Main article: cache (computing)
Compact data structures such as arrays searched with
linear search may be faster, if the table is relatively small
and keys are compact. The optimal performance point Hash tables can be used to implement caches, auxiliary
varies from system to system. data tables that are used to speed up the access to data that
is primarily stored in slower media. In this application,
Hash tables become quite inefficient when there are many hash collisions can be handled by discarding one of the
collisions. While extremely uneven hash distributions two colliding entries—usually erasing the old item that is
are extremely unlikely to arise by chance, a malicious currently stored in the table and overwriting it with the
adversary with knowledge of the hash function may be new item, so every item in the table has a unique hash
able to supply information to a hash that creates worst- value.
case behavior by causing excessive collisions, resulting
in very poor performance, e.g., a denial of service at-
tack.[25][26][27] In critical applications, a data structure Sets
with better worst-case guarantees can be used; however,
universal hashing—a randomized algorithm that prevents Besides recovering the entry that has a given key, many
the attacker from predicting which inputs cause worst- hash table implementations can also tell whether such an
case behavior—may be preferable.[28] The hash function entry exists or not.
3.3. HASH TABLE 65
Those structures can therefore be used to implement a Python's built-in hash table implementation, in the form
set data structure, which merely records whether a given of the dict type, as well as Perl's hash type (%) are used
key belongs to a specified set of keys. In this case, the internally to implement namespaces and therefore need
structure can be simplified by eliminating all parts that to pay more attention to security, i.e., collision attacks.
have to do with the entry values. Hashing can be used to Python sets also use hashes internally, for fast lookup
implement both static and dynamic sets. (though they store only keys, not values).[31]
In the .NET Framework, support for hash tables is pro-
vided via the non-generic Hashtable and generic Dictio-
Object representation
nary classes, which store key-value pairs, and the generic
HashSet class, which stores only values.
Several dynamic languages, such as Perl, Python,
JavaScript, Lua, and Ruby, use hash tables to implement In Rust's standard library, the generic HashMap and
objects. In this representation, the keys are the names of HashSet structs use linear probing with Robin Hood
the members and methods of the object, and the values bucket stealing.
are pointers to the corresponding member or method.
3.3.9 History
Unique data representation
The idea of hashing arose independently in different
places. In January 1953, H. P. Luhn wrote an internal
Main article: String interning
IBM memorandum that used hashing with chaining.[32]
Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester,
Hash tables can be used by some programs to avoid cre- and Arthur Samuel implemented a program using hash-
ating multiple character strings with the same contents. ing at about the same time. Open addressing with linear
For that purpose, all strings in use by the program are probing (relatively prime stepping) is credited to Amdahl,
stored in a single string pool implemented as a hash table, but Ershov (in Russia) had the same idea.[32]
which is checked whenever a new string has to be created.
This technique was introduced in Lisp interpreters under
the name hash consing, and can be used with many other 3.3.10 See also
kinds of data (expression trees in a symbolic algebra sys-
tem, records in a database, files in a file system, binary • Rabin–Karp string search algorithm
decision diagrams, etc.). • Stable hashing
• Consistent hashing
Transposition table
• Extendible hashing
Main article: Transposition table • Lazy deletion
• Pearson hashing
• PhotoDNA
3.3.8 Implementations
• Search data structure
In programming languages
Related data structures
Many programming languages provide hash table func-
tionality, either as built-in associative arrays or as stan-
There are several data structures that use hash functions
dard library modules. In C++11, for example, the
but cannot be considered special cases of hash tables:
unordered_map class provides hash tables for keys and
values of arbitrary type.
• Bloom filter, memory efficient data-structure de-
The Java programming language (including the vari- signed for constant-time approximate lookups; uses
ant which is used on Android) includes the HashSet, hash function(s) and can be seen as an approximate
HashMap, LinkedHashSet, and LinkedHashMap generic hash table.
collections.[30]
• Distributed hash table (DHT), a resilient dynamic
In PHP 5, the Zend 2 engine uses one of the hash func- table spread over several nodes of a network.
tions from Daniel J. Bernstein to generate the hash values
used in managing the mappings of data pointers stored • Hash array mapped trie, a trie structure, similar to
in a hash table. In the PHP source code, it is labelled as the array mapped trie, but where each key is hashed
DJBX33A (Daniel J. Bernstein, Times 33 with Addition). first.
66 CHAPTER 3. DICTIONARIES
3.3.12 Further reading up the value associated with a given key. It was in-
vented in 1954 by Gene Amdahl, Elaine M. McGraw,
• Tamassia, Roberto; Goodrich, Michael T. (2006). and Arthur Samuel and first analyzed in 1963 by Donald
“Chapter Nine: Maps and Dictionaries”. Data struc- Knuth.
tures and algorithms in Java : [updated for Java 5.0]
Along with quadratic probing and double hashing, linear
(4th ed.). Hoboken, NJ: Wiley. pp. 369–418. ISBN
probing is a form of open addressing. In these schemes,
0-471-73884-0.
each cell of a hash table stores a single key–value pair.
• McKenzie, B. J.; Harries, R.; Bell, T. (Feb When the hash function causes a collision by mapping
1990). “Selecting a hashing algorithm”. Soft- a new key to a cell of the hash table that is already oc-
ware Practice & Experience. 20 (2): 209–224. cupied by another key, linear probing searches the ta-
doi:10.1002/spe.4380200207. ble for the closest following free location and inserts the
new key there. Lookups are performed in the same way,
by searching the table sequentially starting at the posi-
3.3.13 External links tion given by the hash function, until finding a cell with a
matching key or an empty cell.
• A Hash Function for Hash Table Lookup by Bob As Thorup & Zhang (2012) write, “Hash tables are the
Jenkins. most commonly used nontrivial data structures, and the
most popular implementation on standard hardware uses
• Hash Tables by SparkNotes—explanation using C
linear probing, which is both fast and simple.”[1] Lin-
• Hash functions by Paul Hsieh ear probing can provide high performance because of
its good locality of reference, but is more sensitive to
• Design of Compact and Efficient Hash Tables for the quality of its hash function than some other colli-
Java sion resolution schemes. It takes constant expected time
per search, insertion, or deletion when implemented us-
• NIST entry on hash tables ing a random hash function, a 5-independent hash func-
tion, or tabulation hashing. However, good results can
• Lecture on Hash Tables
be achieved in practice with other hash functions such as
• Open Data Structures – Chapter 5 – Hash Tables MurmurHash.[2]
The collision between John Smith and Sandra Dee (both hashing Search
to cell 873) is resolved by placing Sandra Dee at the next free
location, cell 874. To search for a given key x, the cells of T are examined,
beginning with the cell at index h(x) (where h is the hash
Linear probing is a scheme in computer programming function) and continuing to the adjacent cells h(x) + 1,
for resolving collisions in hash tables, data structures for h(x) + 2, ..., until finding either an empty cell or a cell
maintaining a collection of key–value pairs and looking whose stored key is x. If a cell containing the key is found,
68 CHAPTER 3. DICTIONARIES
the search returns the value from that cell. Otherwise, if search for a movable key continues for the new emptied
an empty cell is found, the key cannot be in the table, cell, in the same way, until it terminates by reaching a cell
because it would have been placed in that cell in prefer- that was already empty. In this process of moving keys to
ence to any later cell that has not yet been searched. In earlier cells, each key is examined only once. Therefore,
this case, the search returns as its result that the key is not the time to complete the whole process is proportional to
present in the dictionary.[3][4] the length of the block of occupied cells containing the
deleted key, matching the running time of the other hash
table operations.[3]
Insertion
Alternatively, it is possible to use a lazy deletion strat-
To insert a key–value pair (x,v) into the table (possibly re- egy in which a key–value pair is removed by replacing
placing any existing pair with the same key), the insertion the value by a special flag value indicating a deleted key.
algorithm follows the same sequence of cells that would However, these flag values will contribute to the load
be followed for a search, until finding either an empty cell factor of the hash table. With this strategy, it may be-
or a cell whose stored key is x. The new key–value pair come necessary to clean the flag values out of the array
is then placed into that cell.[3][4] and rehash all the remaining key–value pairs once too
large a fraction of the array becomes occupied by deleted
If the insertion would cause the load factor of the table keys.[3][4]
(its fraction of occupied cells) to grow above some pre-
set threshold, the whole table may be replaced by a new
table, larger by a constant factor, with a new hash func- 3.4.2 Properties
tion, as in a dynamic array. Setting this threshold close to
zero and using a high growth rate for the table size leads Linear probing provides good locality of reference, which
to faster hash table operations but greater memory usage causes it to require few uncached memory accesses per
than threshold values close to one and low growth rates. operation. Because of this, for low to moderate load fac-
A common choice would be to double the table size when tors, it can provide very high performance. However,
the load factor would exceed 1/2, causing the load factor compared to some other open addressing strategies, its
to stay between 1/4 and 1/2.[5] performance degrades more quickly at high load factors
because of primary clustering, a tendency for one col-
Deletion lision to cause more nearby collisions.[3] Additionally,
achieving good performance with this method requires
a higher-quality hash function than for some other col-
lision resolution schemes.[6] When used with low-quality
hash functions that fail to eliminate nonuniformities in
the input distribution, linear probing can be slower than
other open-addressing strategies such as double hashing,
which probes a sequence of cells whose separation is de-
termined by a second hash function, or quadratic probing,
When a key–value pair is deleted, it may be necessary to move where the size of each step varies depending on its posi-
another pair backwards into its cell, to prevent searches for the tion within the probe sequence.[7]
moved key from finding an empty cell.
the maximal blocks of contiguous cells in the table. A rather than by their value. For instance, this is done us-
similar sum of squared block lengths gives the expected ing linear probing by the IdentityHashMap class of the
time bound for a random hash function (rather than for a Java collections framework.[12] The hash value that this
random starting location into a specific state of the hash class associates with each object, its identityHashCode,
table), by summing over all the blocks that could exist is guaranteed to remain fixed for the lifetime of an ob-
(rather than the ones that actually exist in a given state ject but is otherwise arbitrary.[13] Because the identity-
of the table), and multiplying the term for each potential HashCode is constructed only once per object, and is not
block by the probability that the block is actually occu- required to be related to the object’s address or value, its
pied. That is, defining Block(i,k) to be the event that there construction may involve slower computations such as the
is a maximal contiguous block of occupied cells of length call to a random or pseudorandom number generator. For
k beginning at index i, the expected time per operation is instance, Java 8 uses an Xorshift pseudorandom number
generator to construct these values.[14]
For most applications of hashing, it is necessary to com-
∑
N ∑
n
pute the hash function for each value every time that it
2
E[T ] = O(1) + O(k /N ) Pr[Block(i, k)].
is hashed, rather than once when its object is created.
i=1 k=1
In such applications, random or pseudorandom numbers
This formula can be simplified by replacing Block(i,k) by cannot be used as hash values, because then different ob-
a simpler necessary condition Full(k), the event that at jects with the same value would have different hashes.
least k elements have hash values that lie within a block And cryptographic hash functions (which are designed
of cells of length k. After this replacement, the value to be computationally indistinguishable from truly ran-
within the sum no longer depends on i, and the 1/N fac- dom functions) are usually too slow to be used in hash
tor cancels the N terms of the outer summation. These tables.[15] Instead, other methods for constructing hash
simplifications lead to the bound functions have been devised. These methods compute
the hash function quickly, and can be proven to work well
with linear probing. In particular, linear probing has been
∑
n analyzed from the framework of k-independent hashing,
E[T ] ≤ O(1) + O(k 2 ) Pr[Full(k)]. a class of hash functions that are initialized from a small
k=1 random seed and that are equally likely to map any k-tuple
of distinct keys to any k-tuple of indexes. The parame-
But by the multiplicative form of the Chernoff bound,
ter k can be thought of as a measure of hash function
when the load factor is bounded away from one, the prob-
quality: the larger k is, the more time it will take to com-
ability that a block of length k contains at least k hashed
pute the hash function but it will behave more similarly
values is exponentially small as a function of k, caus-
to completely random functions. For linear probing, 5-
ing this sum to be bounded by a constant independent of
independence is enough to guarantee constant expected
n.[3] It is also possible to perform the same analysis using
time per operation,[16] while some 4-independent hash
Stirling’s approximation instead of the Chernoff bound
functions perform badly, taking up to logarithmic time
to estimate the probability that a block contains exactly k
per operation.[6]
hashed values.[4][9]
Another method of constructing hash functions with both
In terms of the load factor α, the expected time for a suc-
high quality and practical speed is tabulation hashing. In
cessful search is O(1 + 1/(1 − α)), and the expected time
this method, the hash value for a key is computed by using
for an unsuccessful search (or the insertion of a new key)
each byte of the key as an index into a table of random
is O(1 + 1/(1 − α)2 ).[10] For constant load factors, with
numbers (with a different table for each byte position).
high probability, the longest probe sequence (among the
The numbers from those table cells are then combined
probe sequences for all keys stored in the table) has loga-
by a bitwise exclusive or operation. Hash functions con-
rithmic length.[11]
structed this way are only 3-independent. Nevertheless,
linear probing using these hash functions takes constant
3.4.4 Choice of hash function expected time per operation.[4][17] Both tabulation hash-
ing and standard methods for generating 5-independent
Because linear probing is especially sensitive to unevenly hash functions are limited to keys that have a fixed num-
distributed hash values,[7] it is important to combine it ber of bits. To handle strings or other types of variable-
with a high-quality hash function that does not produce length keys, it is possible to compose a simpler universal
such irregularities. hashing technique that maps the keys to intermediate val-
ues and a higher quality (5-independent or tabulation)
The analysis above assumes that each key’s hash is a ran- hash function that maps the intermediate values to hash
dom number independent of the hashes of all the other table indices.[1][18]
keys. This assumption is unrealistic for most applications
of hashing. However, random or pseudorandom hash val- In an experimental comparison, Richter et al. found
ues may be used when hashing objects by their identity that the Multiply-Shift family of hash functions (defined
70 CHAPTER 3. DICTIONARIES
as hz (x) = (x · z mod2w ) ÷ 2w−d ) was “the fastest [3] Goodrich, Michael T.; Tamassia, Roberto (2015), “Sec-
hash function when integrated with all hashing schemes, tion 6.3.3: Linear Probing”, Algorithm Design and Appli-
i.e., producing the highest throughputs and also of good cations, Wiley, pp. 200–203.
quality” whereas tabulation hashing produced “the low- [4] Morin, Pat (February 22, 2014), “Section 5.2: Lin-
est throughput”.[2] They point out that each table look-up earHashTable: Linear Probing”, Open Data Structures (in
require several cycles, being more expensive than simple pseudocode) (0.1Gβ ed.), pp. 108–116, retrieved 2016-
arithmetic operations. They also found MurmurHash to 01-15.
be superior than tabulation hashing: “By studying the re-
sults provided by Mult and Murmur, we think that the [5] Sedgewick, Robert; Wayne, Kevin (2011), Algorithms
(4th ed.), Addison-Wesley Professional, p. 471, ISBN
trade-off for by tabulation (...) is less attractive in prac-
9780321573513. Sedgewick and Wayne also halve the ta-
tice”.
ble size when a deletion would cause the load factor to be-
come too low, causing them to use a wider range [1/8,1/2]
in the possible values of the load factor.
3.4.5 History
[6] Pătraşcu, Mihai; Thorup, Mikkel (2010), “On the k-
The idea of an associative array that allows data to be ac- independence required by linear probing and minwise in-
cessed by its value rather than by its address dates back to dependence” (PDF), Automata, Languages and Program-
the mid-1940s in the work of Konrad Zuse and Vannevar ming, 37th International Colloquium, ICALP 2010, Bor-
deaux, France, July 6–10, 2010, Proceedings, Part I,
Bush,[19] but hash tables were not described until 1953,
Lecture Notes in Computer Science, 6198, Springer, pp.
in an IBM memorandum by Hans Peter Luhn. Luhn used 715–726, doi:10.1007/978-3-642-14165-2_60
a different collision resolution method, chaining, rather
than linear probing.[20] [7] Heileman, Gregory L.; Luo, Wenbin (2005), “How
caching affects hashing” (PDF), Seventh Workshop on Al-
Knuth (1963) summarizes the early history of linear gorithm Engineering and Experiments (ALENEX 2005),
probing. It was the first open addressing method, and was pp. 141–154.
originally synonymous with open addressing. According
to Knuth, it was first used by Gene Amdahl, Elaine M. [8] Knuth, Donald (1963), Notes on “Open” Addressing
McGraw (née Boehme), and Arthur Samuel in 1954, in [9] Eppstein, David (October 13, 2011), “Linear probing
an assembler program for the IBM 701 computer.[8] The made easy”, 0xDE.
first published description of linear probing is by Peterson
(1957),[8] who also credits Samuel, Amdahl, and Boehme [10] Sedgewick, Robert (2003), “Section 14.3: Linear Prob-
but adds that “the system is so natural, that it very likely ing”, Algorithms in Java, Parts 1–4: Fundamentals, Data
may have been conceived independently by others either Structures, Sorting, Searching (3rd ed.), Addison Wesley,
[21] pp. 615–620, ISBN 9780321623973.
before or since that time”. Another early publication
of this method was by Soviet researcher Andrey Ershov, [11] Pittel, B. (1987), “Linear probing: the probable largest
in 1958.[22] search time grows logarithmically with the number
of records”, Journal of Algorithms, 8 (2): 236–249,
The first theoretical analysis of linear probing, show-
doi:10.1016/0196-6774(87)90040-X, MR 890874.
ing that it takes constant expected time per operation
with random hash functions, was given by Knuth.[8] [12] “IdentityHashMap”, Java SE 7 Documentation, Oracle, re-
Sedgewick calls Knuth’s work “a landmark in the analysis trieved 2016-01-15.
of algorithms”.[10] Significant later developments include
[13] Friesen, Jeff (2012), Beginning Java 7, Expert’s voice in
a more detailed analysis of the probability distribution of Java, Apress, p. 376, ISBN 9781430239109.
the running time,[23][24] and the proof that linear probing
runs in constant time per operation with practically us- [14] Kabutz, Heinz M. (September 9, 2014), “Identity Crisis”,
able hash functions rather than with the idealized random The Java Specialists’ Newsletter, 222.
functions assumed by earlier analysis.[16][17] [15] Weiss, Mark Allen (2014), “Chapter 3: Data Structures”,
in Gonzalez, Teofilo; Diaz-Herrera, Jorge; Tucker, Allen,
Computing Handbook, 1 (3rd ed.), CRC Press, p. 3-11,
3.4.6 References ISBN 9781439898536.
[1] Thorup, Mikkel; Zhang, Yin (2012), “Tabulation-based [16] Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear
5-independent hashing with applications to linear probing probing with constant independence”, SIAM Journal on
and second moment estimation”, SIAM Journal on Com- Computing, 39 (3): 1107–1120, doi:10.1137/070702278,
puting, 41 (2): 293–331, doi:10.1137/100800774, MR MR 2538852
2914329.
[17] Pătraşcu, Mihai; Thorup, Mikkel (2011), “The power
[2] Richter, Stefan; Alvarez, Victor; Dittrich, Jens (2015), “A of simple tabulation hashing”, Proceedings of the
seven-dimensional analysis of hashing methods and its im- 43rd annual ACM Symposium on Theory of Com-
plications on query processing”, Proceedings of the VLDB puting (STOC '11), pp. 1–10, arXiv:1011.5200 ,
Endowment, 9 (3): 293–331. doi:10.1145/1993636.1993638
3.5. QUADRATIC PROBING 71
[18] Thorup, Mikkel (2009), “String hashing for linear prob- Quadratic probing is used in the Berkeley Fast File
ing”, Proceedings of the Twentieth Annual ACM-SIAM System to allocate free blocks. The allocation routine
Symposium on Discrete Algorithms, Philadelphia, PA: chooses a new cylinder-group when the current is nearly
SIAM, pp. 655–664, doi:10.1137/1.9781611973068.72, full using quadratic probing, because of the speed it shows
MR 2809270. in finding unused cylinder-groups.
[19] Parhami, Behrooz (2006), Introduction to Parallel Pro-
cessing: Algorithms and Architectures, Series in Computer
Science, Springer, 4.1 Development of early models, p. 3.5.1 Quadratic function
67, ISBN 9780306469640.
Let h(k) be a hash function that maps an element k to an
[20] Morin, Pat (2004), “Hash tables”, in Mehta, Dinesh integer in [0,m-1], where m is the size of the table. Let
P.; Sahni, Sartaj, Handbook of Data Structures and Ap- the ith probe position for a value k be given by the function
plications, Chapman & Hall / CRC, p. 9-15, ISBN
9781420035179.
[21] Peterson, W. W. (April 1957), “Addressing for random- h(k, i) = (h(k) + c1 i + c2 i2 ) (mod m)
access storage”, IBM Journal of Research and Develop-
where c2 ≠ 0. If c2 = 0, then h(k,i) degrades to a linear
ment, Riverton, NJ, USA: IBM Corp., 1 (2): 130–146,
doi:10.1147/rd.12.0130. probe. For a given hash table, the values of c1 and c2
remain constant.
[22] Ershov, A. P. (1958), “On Programming of Arithmetic
Operations”, Communications of the ACM, 1 (8): 3–6,
Examples:
doi:10.1145/368892.368907. Translated from Doklady
AN USSR 118 (3): 427–430, 1958, by Morris D. Fried- • If h(k, i) = (h(k) + i + i2 ) (mod m) , then the
man. Linear probing is described as algorithm A2. probe sequence will be h(k), h(k) + 2, h(k) + 6, ...
[23] Flajolet, P.; Poblete, P.; Viola, A. (1998), “On the analy- • For m = 2n , a good choice for the constants are c1
sis of linear probing hashing”, Algorithmica, 22 (4): 490– = c2 = 1/2, as the values of h(k,i) for i in [0,m-1]
515, doi:10.1007/PL00009236, MR 1701625. are all distinct. This leads to a probe sequence of
h(k), h(k) + 1, h(k) + 3, h(k) + 6, ... where the
[24] Knuth, D. E. (1998), “Linear probing and
values increase by 1, 2, 3, ...
graphs”, Algorithmica, 22 (4): 561–568,
doi:10.1007/PL00009240, MR 1701629. • For prime m > 2, most choices of c1 and c2 will
make h(k,i) distinct for i in [0, (m-1)/2]. Such
choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 =
3.5 Quadratic probing 0, c2 = 1. Because there are only about m/2 distinct
probes for a given element, it is difficult to guaran-
Quadratic probing is an open addressing scheme in tee that insertions will succeed when the load factor
computer programming for resolving collisions in hash is > 1/2.
tables—when an incoming data’s hash value indicates it
should be stored in an already-occupied slot or bucket. 3.5.2 Quadratic probing insertion
Quadratic probing operates by taking the original hash in-
dex and adding successive values of an arbitrary quadratic The problem, here, is to insert a key at an available key
polynomial until an open slot is found. space in a given Hash Table using quadratic probing.[1]
For a given hash value, the indices generated by linear
probing are as follows:
Algorithm to insert key in hash table
H + 1, H + 2, H + 3, H + 4, ..., H + k
This method results in primary clustering, and as the clus- 1. Get the key k 2. Set counter j = 0 3. Compute hash
ter grows larger, the search for those items hashing within function h[k] = k % SIZE 4. If hashtable[h[k]] is empty
the cluster becomes less efficient. (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3)
The key space at hashtable[h[k]] is occupied, so we need
An example sequence using quadratic probing is: to find the next available key space (4.4) Increment j (4.5)
2 2 2 2
H + 1 , H + 2 , H + 3 , H + 4 , ..., H + k 2 Compute new hash function h[k] = ( k + j * j ) % SIZE
(4.6) Repeat Step 4 till j is equal to the SIZE of hash table
Quadratic probing can be a more efficient algorithm in
5. The hash table is full 6. Stop
a closed hash table, since it better avoids the clustering
problem that can occur with linear probing, although it
is not immune. It also provides good memory caching C function for key insertion
because it preserves some locality of reference; however,
linear probing has greater locality and, thus, better cache int quadratic_probing_insert(int *hashtable, int key, int
performance. *empty) { /* hashtable[] is an integer hash table; empty[]
72 CHAPTER 3. DICTIONARIES
is another array which indicates whether the key space is h(k) + x2 = h(k) + y 2 (mod b) x2 = y 2 (mod b)
occupied; If an empty key space is found, the function x2 − y 2 = 0 (mod b) (x − y)(x + y) = 0 (mod b)
returns the index of the bucket where the key is inserted, As b (table size) is a prime greater than 3, either (x - y) or
otherwise it returns (−1) if no empty key space is found (x + y) has to be equal to zero. Since x and y are unique,
*/ int i, index; for (i = 0; i < SIZE; i++) { index = (key + (x - y) cannot be zero. Also, since 0 ≤ x, y ≤ (b / 2), (x +
i*i) % SIZE; if (empty[index]) { hashtable[index] = key; y) cannot be zero.
empty[index] = 0; return index; } } return −1; }
Thus, by contradiction, it can be said that the first (b / 2)
alternative locations after h(k) are unique. So an empty
key space can always be found as long as at most (b / 2)
3.5.3 Quadratic probing search locations are filled, i.e., the hash table is not more than
half full.
Algorithm to search element in hash table
Alternating sign
1. Get the key k to be searched 2. Set counter j = 0 3.
Compute hash function h[k] = k % SIZE 4. If the key If the sign of the offset is alternated (e.g. +1, −4, +9, −16
space at hashtable[h[k]] is occupied (4.1) Compare the etc.), and if the number of buckets is a prime number p
element at hashtable[h[k]] with the key k. (4.2) If they are congruent to 3 modulo 4 (i.e. one of 3, 7, 11, 19, 23, 31
equal (4.2.1) The key is found at the bucket h[k] (4.2.2) and so on), then the first p offsets will be unique modulo
Stop Else (4.3) The element might be placed at the next p.
location given by the quadratic function (4.4) Increment
In other words, a permutation of 0 through p-1 is ob-
j (4.5) Set h[k] = ( k + (j * j) ) % SIZE, so that we can
tained, and, consequently, a free bucket will always be
probe the bucket at a new slot, h[k]. (4.6) Repeat Step 4
found as long as there exists at least one.
till j is greater than SIZE of hash table 5. The key was
not found in the hash table 6. Stop The insertion algorithm only receives a minor modifica-
tion (but do note that SIZE has to be a suitable prime
number as explained above):
C function for key searching 1. Get the key k 2. Set counter j = 0 3. Compute hash
function h[k] = k % SIZE 4. If hashtable[h[k]] is empty
int quadratic_probing_search(int *hashtable, int key, (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3)
int *empty) { /* If the key is found in the hash table, The key space at hashtable[h[k]] is occupied, so we need
the function returns the index of the hashtable where to find the next available key space (4.4) Increment j (4.5)
the key is inserted, otherwise it returns (−1) if the Compute new hash function h[k]. If j is odd, then h[k] =
key is not found */ int i, index; for (i = 0; i < SIZE; ( k + j * j ) % SIZE, else h[k] = ( k - j * j ) % SIZE (4.6)
i++) { index = (key + i*i) % SIZE; if (!empty[index] Repeat Step 4 till j is equal to the SIZE of hash table 5.
&& hashtable[index] == key) return index; } return −1; } The hash table is full 6. Stop
The search algorithm is modified likewise.
3.5.7 External links and Siegel[4] showed this with k -wise independent and
uniform functions (for k = c log n , and suitable constant
• Tutorial/quadratic probing c ).
Double hashing is a computer programming technique Linear probing and, to a lesser extent, quadratic probing
used in hash tables to resolve hash collisions, in cases are able to take advantage of the data cache by accessing
when two different values to be searched for produce the locations that are close together. Double hashing has, on
same hash key. It is a popular collision-resolution tech- average, larger intervals and is not able to achieve this
nique in open-addressed hash tables. Double hashing is advantage.
implemented in many popular libraries. Like all other forms of open addressing, double hashing
Like linear probing, it uses one hash value as a starting becomes linear as the hash table approaches maximum
point and then repeatedly steps forward an interval until capacity. The only solution to this is to rehash to a larger
the desired value is located, an empty location is reached, size, as with all other open addressing schemes.
or the entire table has been searched; but this interval On top of that, it is possible for the secondary hash func-
is decided using a second, independent hash function tion to evaluate to zero. For example, if we choose k=5
(hence the name double hashing). Unlike linear probing with the following function:
and quadratic probing, the interval depends on the data,
h2 (k) = 5 − (k mod 7)
so that even values mapping to the same location have
different bucket sequences; this minimizes repeated col- The resulting sequence will always remain at the initial
lisions and the effects of clustering. hash value. One possible solution is to change the sec-
ondary hash function to:
Given two randomly, uniformly, and independently se-
lected hash functions h1 and h2 , the ith location in the h2 (k) = (k mod 7) + 1
bucket sequence for value k in a hash table T is: h(i, k) = This ensures that the secondary hash function will always
(h1 (k) + i · h2 (k)) mod |T |. Generally, h1 and h2 are be non zero.
selected from a set of universal hash functions.
Double hashing with open addressing is a classical data • Collision resolution in hash tables
structure on a table T . Let n be the number of elements
stored in T , then T 's load factor is α = |Tn | . • Hash function
Double hashing approximates uniform open address • Linear probing
hashing. That is, start by randomly, uniformly and inde-
pendently selecting two universal hash functions h1 and • Cuckoo hashing
h2 to build a double hashing table T .
All elements are put in T by double hashing using h1 and
h2 . Given a key k , determining the (i + 1) -st hash 3.6.4 Notes
location is computed by:
h(i, k) = (h1 (k) + i · h2 (k)) mod |T |. [1] Bradford, Phillip G.; Katehakis, Michael N. (2007), “A
probabilistic study on combinatorial expanders and hash-
Let T have fixed load factor α : 1 > α > 0 . Bradford ing” (PDF), SIAM Journal on Computing, 37 (1): 83–111,
and Katehakis[1] showed the expected number of probes doi:10.1137/S009753970444630X, MR 2306284.
for an unsuccessful search in T , still using these initially
1
chosen hash functions, is 1−α regardless of the distribu- [2] L. Guibas and E. Szemerédi: The Analysis of Dou-
tion of the inputs. More precisely, these two uniformly, ble Hashing, Journal of Computer and System Sciences,
randomly and independently chosen hash functions are 1978, 16, 226-274.
chosen from a set of universal hash functions where pair-
wise independence suffices. [3] G. S. Lueker and M. Molodowitch: More Analysis of Dou-
ble Hashing, Combinatorica, 1993, 13(1), 83-96.
Previous results include: Guibas and Szemerédi[2]
1
showed 1−α holds for unsuccessful search for load factors [4] J. P. Schmidt and A. Siegel: Double Hashing is Com-
α < 0.319 . Also, Lueker and Molodowitch[3] showed putable and Randomizable with Universal Hash Functions,
this held assuming ideal randomized functions. Schmidt manuscript.
74 CHAPTER 3. DICTIONARIES
3.7.1 History
Cuckoo hashing was first described by Rasmus Pagh and
Flemming Friche Rodler in 2001.[1]
3.7.2 Operation
Cuckoo hashing is a form of open addressing in which
each non-empty cell of a hash table contains a key or
key–value pair. A hash function is used to determine the
location for each key, and its presence in the table (or
the value associated with it) can be found by examining
that cell of the table. However, open addressing suffers
from collisions, which happen when more than one key is
mapped to the same cell. The basic idea of cuckoo hash-
ing is to resolve collisions by using two hash functions
instead of only one. This provides two possible locations
in the hash table for each key. In one of the commonly
used variants of the algorithm, the hash table is split into Cuckoo hashing example. The arrows show the alternative loca-
two smaller tables of equal size, and each hash function tion of each key. A new item would be inserted in the location
provides an index into one of these two tables. It is also of A by moving A to its alternative location, currently occupied
possible for both hash functions to provide indexes into a by B, and moving B to its alternative location which is currently
single table. vacant. Insertion of a new item in the location of H would not
succeed: Since H is part of a cycle (together with W), the new
Lookup requires inspection of just two locations in the
item would get kicked out again.
hash table, which takes constant time in the worst case
(see Big O notation). This is in contrast to many other
hash table algorithms, which may not have a constant keys to their second locations (or back to their first loca-
worst-case bound on the time to do a lookup. Deletions, tions) to make room for the new key. A greedy algorithm
also, may be performed by blanking the cell containing a is used: The new key is inserted in one of its two possible
key, in constant worst case time, more simply than some locations, “kicking out”, that is, displacing, any key that
other schemes such as linear probing. might already reside in this location. This displaced key
When a new key is inserted, and one of its two cells is is then inserted in its alternative location, again kicking
empty, it may be placed in that cell. However, when both out any key that might reside there. The process con-
cells are already full, it will be necessary to move other tinues in the same way until an empty position is found,
3.7. CUCKOO HASHING 75
sidering the possibility of having to rebuild the table, as bucket. Using [4]
just 2 keys per bucket permits a load factor
long as the number of keys is kept below half of the ca- above 80%.
pacity of the hash table, i.e., the load factor is below 50%. Another variation of cuckoo hashing that has been stud-
One method of proving this uses the theory of random ied is cuckoo hashing with a stash. The stash, in this data
graphs: one may form an undirected graph called the structure, is an array of a constant number of keys, used
“cuckoo graph” that has a vertex for each hash table lo- to store keys that cannot successfully be inserted into the
cation, and an edge for each hashed value, with the end- main hash table of the structure. This modification re-
points of the edge being the two possible locations of the duces the failure rate of cuckoo hashing to an inverse-
value. Then, the greedy insertion algorithm for adding a polynomial function with an exponent that can be made
set of values to a cuckoo hash table succeeds if and only if arbitrarily large by increasing the stash size. However,
the cuckoo graph for this set of values is a pseudoforest, larger stashes also mean slower searches for keys that are
a graph with at most one cycle in each of its connected not present or are in the stash. A stash can be used in
components. Any vertex-induced subgraph with more combination with more than two hash functions or with
edges than vertices corresponds to a set of keys for which blocked cuckoo hashing[5]to achieve both high load factors
there are an insufficient number of slots in the hash table. and small failure rates. The analysis of cuckoo hash-
When the hash function is chosen randomly, the cuckoo ing with a stash extends to practical hash functions, not
graph is a random graph in the Erdős–Rényi model. With just to the random hash function [6] model commonly used
high probability, for a random graph in which the ratio of in theoretical analysis of hashing.
the number of edges to the number of vertices is bounded Some people recommend a simplified generalization of
below 1/2, the graph is a pseudoforest and the cuckoo cuckoo hashing called skewed-associative cache in some
hashing algorithm succeeds in placing all keys. More- CPU caches.[7]
over, the same theory also proves that the expected size of
Another variation of a cuckoo hash table, called a cuckoo
a connected component of the cuckoo graph is small, en-
filter, replaces the stored keys of a cuckoo hash table with
suring that each insertion takes constant expected time.[2]
much shorter fingerprints, computed by applying another
hash function to the keys. In order to allow these fin-
3.7.4 Example gerprints to be moved around within the cuckoo filter,
without knowing the keys that they came from, the two
The following hash functions are given: locations of each fingerprint may be computed from each
⌊ ⌋ other by a bitwise exclusive or operation with the finger-
h (k) = k mod 11 h′ (k) = 11 k
mod 11 print, or with a hash of the fingerprint. This data structure
Columns in the following two tables show the state of the forms an approximate set membership data structure with
hash tables over time as the elements are inserted. much the same properties as a Bloom filter: it can store
the members of a set of keys, and test whether a query key
is a member, with some chance of false positives (queries
Cycle that are incorrectly reported as being part of the set) but
no false negatives. However, it improves on a Bloom fil-
If you now wish to insert the element 6, then you get into ter in multiple respects: its memory usage is smaller by
a cycle. In the last row of the table we find the same initial a constant factor, it has better locality of reference, and
situation as at the beginning again. (unlike Bloom filters) it allows for fast deletion of set el-
⌊6⌋
h (6) = 6 mod 11 = 6 h′ (6) = 11 mod 11 = 0 ements with no additional storage penalty.[8]
76 CHAPTER 3. DICTIONARIES
A study by Zukowski et al.[9] has shown that cuckoo [8] Fan, Bin; Andersen, Dave G.; Kaminsky, Michael;
hashing is much faster than chained hashing for small, Mitzenmacher, Michael D. (2014), “Cuckoo fil-
cache-resident hash tables on modern processors. Ken- ter: Practically better than Bloom”, Proc. 10th
ACM Int. Conf. Emerging Networking Experi-
neth Ross[10] has shown bucketized versions of cuckoo
ments and Technologies (CoNEXT '14), pp. 75–88,
hashing (variants that use buckets that contain more than doi:10.1145/2674005.2674994
one key) to be faster than conventional methods also for
large hash tables, when space utilization is high. The per- [9] Zukowski, Marcin; Heman, Sandor; Boncz, Peter (June
formance of the bucketized cuckoo hash table was inves- 2006). “Architecture-Conscious Hashing” (PDF). Pro-
tigated further by Askitis,[11] with its performance com- ceedings of the International Workshop on Data Manage-
pared against alternative hashing schemes. ment on New Hardware (DaMoN). Retrieved 2008-10-
16.
A survey by Mitzenmacher[3] presents open problems re-
lated to cuckoo hashing as of 2009. [10] Ross, Kenneth (2006-11-08). “Efficient Hash Probes
on Modern Processors” (PDF). IBM Research Report
RC24100. RC24100. Retrieved 2008-10-16.
3.7.7 See also
[11] Askitis, Nikolas (2009). Fast and Compact Hash Ta-
• Perfect hashing bles for Integer Keys (PDF). Proceedings of the 32nd Aus-
tralasian Computer Science Conference (ACSC 2009). 91.
• Linear probing pp. 113–122. ISBN 978-1-920682-72-9.
• Double hashing
• Hash collision 3.7.9 External links
• Hash function • A cool and practical alternative to traditional hash
• Quadratic probing tables, U. Erlingsson, M. Manasse, F. Mcsherry,
2006.
• Hopscotch hashing
• Cuckoo Hashing for Undergraduates, 2006, R.
Pagh, 2006.
3.7.8 References
• Cuckoo Hashing, Theory and Practice (Part 1, Part
[1] Pagh, Rasmus; Rodler, Flemming Friche (2001). 2 and Part 3), Michael Mitzenmacher, 2007.
“Cuckoo Hashing”. Algorithms — ESA 2001. Lec-
ture Notes in Computer Science. 2161. pp. 121– • Naor, Moni; Segev, Gil; Wieder, Udi (2008).
133. doi:10.1007/3-540-44676-1_10. ISBN 978-3-540-
“History-Independent Cuckoo Hashing”. Interna-
42493-2.
tional Colloquium on Automata, Languages and Pro-
[2] Kutzelnigg, Reinhard (2006). Bipartite random graphs gramming (ICALP). Reykjavik, Iceland. Retrieved
and cuckoo hashing (PDF). Fourth Colloquium on Mathe- 2008-07-21.
matics and Computer Science. Discrete Mathematics and
Theoretical Computer Science. AG. pp. 403–406 • Algorithmic Improvements for Fast Concurrent
Cuckoo Hashing, X. Li, D. Andersen, M. Kamin-
[3] Mitzenmacher, Michael (2009-09-09). “Some Open
sky, M. Freedman. EuroSys 2014.
Questions Related to Cuckoo Hashing | Proceedings of
ESA 2009” (PDF). Retrieved 2010-11-10.
[6] Aumüller, Martin; Dietzfelbinger, Martin; Woelfel, • Generic Cuckoo hashmap in Java
Philipp (2014), “Explicit and efficient hash families suf-
fice for cuckoo hashing with a stash”, Algorithmica, • Cuckoo hash table written in Haskell
70 (3): 428–456, doi:10.1007/s00453-013-9840-x, MR
3247374. • Cuckoo hashing for Go
3.8. HOPSCOTCH HASHING 77
3.8 Hopscotch hashing The idea is that hopscotch hashing “moves the empty slot
towards the desired bucket”. This distinguishes it from
Hopscotch hashing is a scheme in computer program- linear probing which leaves the empty slot where it was
ming for resolving hash collisions of values of hash func- found, possibly far away from the original bucket, or from
tions in a table using open addressing. It is also well suited cuckoo hashing that, in order to create a free bucket,
for implementing a concurrent hash table. Hopscotch moves an item out of one of the desired buckets in the
hashing was introduced by Maurice Herlihy, Nir Shavit target arrays, and only then tries to find the displaced item
and Moran Tzafrir in 2008.[1] The name is derived from a new place.
the sequence of hops that characterize the table’s inser- To remove an item from the table, one simply removes
tion algorithm. it from the table entry. If the neighborhood buckets are
The algorithm uses a single array of n buckets. For each cache aligned, then one could apply a reorganization op-
bucket, its neighborhood is a small collection of nearby eration in which items are moved into the now vacant lo-
consecutive buckets (i.e. ones with close indices to the cation in order to improve alignment.
original hashed bucket). The desired property of the One advantage of hopscotch hashing is that it provides
neighborhood is that the cost of finding an item in the good performance at very high table load factors, even
buckets of the neighborhood is close to the cost of find- ones exceeding 0.9. Part of this efficiency is due to using
ing it in the bucket itself (for example, by having buckets a linear probe only to find an empty slot during insertion,
in the neighborhood fall within the same cache line). The not for every lookup as in the original linear probing hash
size of the neighborhood must be sufficient to accommo- table algorithm. Another advantage is that one can use
date a logarithmic number of items in the worst case (i.e. any hash function, in particular simple ones that are close-
it must accommodate log(n) items), but only a constant to-universal.
number on average. If some bucket’s neighborhood is
filled, the table is resized.
In hopscotch hashing, as in cuckoo hashing, and unlike in 3.8.1 See also
linear probing, a given item will always be inserted-into
• Cuckoo hashing
and found-in the neighborhood of its hashed bucket. In
other words, it will always be found either in its original • Hash collision
hashed array entry, or in one of the next H-1 neighboring
entries. H could, for example, be 32, a common machine • Hash function
78 CHAPTER 3. DICTIONARIES
Main article: Hash table The same techniques can be used to find equal or similar
stretches in a large collection of strings, such as a docu-
ment repository or a genomic database. In this case, the
When storing records in a large unsorted file, one may input strings are broken into many small pieces, and a
use a hash function to map each record to an index into hash function is used to detect potentially equal pieces,
a table T, and to collect in each bucket T[i] a list of the as above.
numbers of all records with the same hash value i. Once
the table is complete, any two duplicate records will end The Rabin–Karp algorithm is a relatively fast string
up in the same bucket. The duplicates can then be found searching algorithm that works in O(n) time on average.
by scanning every bucket T[i] which contains two or more It is based on the use of hashing to compare strings.
members, fetching those records, and comparing them.
With a table of appropriate size, this method is likely to be
Geometric hashing
much faster than any alternative approach (such as sorting
the file and comparing all consecutive pairs).
This principle is widely used in computer graphics,
computational geometry and many other disciplines, to
solve many proximity problems in the plane or in three-
Protecting data dimensional space, such as finding closest pairs in a set of
points, similar shapes in a list of shapes, similar images
Main article: Security of cryptographic hash functions in an image database, and so on. In these applications,
the set of all inputs is some sort of metric space, and the
hashing function can be interpreted as a partition of that
A hash value can be used to uniquely identify secret infor- space into a grid of cells. The table is often an array with
mation. This requires that the hash function is collision- two or more indices (called a grid file, grid index, bucket
resistant, which means that it is very hard to find data that grid, and similar names), and the hash function returns
will generate the same hash value. These functions are an index tuple. This special case of hashing is known as
categorized into cryptographic hash functions and prov- geometric hashing or the grid method. Geometric hash-
ably secure hash functions. Functions in the second cate- ing is also used in telecommunications (usually under the
gory are the most secure but also too slow for most practi- name vector quantization) to encode and compress multi-
cal purposes. Collision resistance is accomplished in part dimensional signals.
by generating very large hash values. For example, SHA-
1, one of the most widely used cryptographic hash func-
tions, generates 160 bit values. Standard uses of hashing in cryptography
A hash function that will relocate the minimum number and their probability distribution in the intended applica-
of records when the table is – where z is the key being tion.
hashed and n is the number of allowed hash values – such
that H(z,n + 1) = H(z,n) with probability close to n/(n +
1).
Trivial hash function
Linear hashing and spiral storage are examples of dy-
namic hash functions that execute in constant time but If the data to be hashed is small enough, one can use
relax the property of uniformity to achieve the minimal the data itself (reinterpreted as an integer) as the hashed
movement property. value. The cost of computing this “trivial” (identity) hash
Extendible hashing uses a dynamic hash function that re- function is effectively zero. This hash function is perfect,
quires space proportional to n to compute the hash func- as it maps each input to a distinct hash value.
tion, and it becomes a function of the previous keys that The meaning of “small enough” depends on the size of
have been inserted. the type that is used as the hashed value. For example,
Several algorithms that preserve the uniformity property in Java, the hash code is a 32-bit integer. Thus the 32-
but require time proportional to n to compute the value bit integer Integer and 32-bit floating-point Float objects
of H(z,n) have been invented. can simply use the value directly; whereas the 64-bit in-
teger Long and 64-bit floating-point Double cannot use
this method.
Data normalization
Other types of data can also use this perfect hashing
In some applications, the input data may contain features scheme. For example, when mapping character strings
that are irrelevant for comparison purposes. For example, between upper and lower case, one can use the binary
when looking up a personal name, it may be desirable encoding of each character, interpreted as an integer, to
to ignore the distinction between upper and lower case index a table that gives the alternative form of that char-
letters. For such data, one must use a hash function that acter (“A” for “a”, “8” for “8”, etc.). If [7]
each character is
is compatible with the data equivalence criterion being stored in 8 bits (as 8in extended ASCII or ISO Latin 1),
used: that is, any two inputs that are considered equivalent the table has only 2 = 256 entries; in the
16
case of Unicode
must yield the same hash value. This can be accomplished characters, the table would have 17×2 = 1114112 en-
by normalizing the input before hashing it, as by upper- tries.
casing all letters. The same technique can be used to map two-letter coun-
try codes like “us” or “za” to country names (262 = 676
table entries), 5-digit zip codes like 13083 to city names
Continuity (100000 entries), etc. Invalid data values (such as the
country code “xx” or the zip code 00000) may be left un-
“A hash function that is used to search for similar (as op- defined in the table or mapped to some appropriate “null”
posed to equivalent) data must be as continuous as possi- value.
ble; two inputs that differ by a little should be mapped to
equal or nearly equal hash values.”[6]
Note that continuity is usually considered a fatal flaw for
checksums, cryptographic hash functions, and other re- Perfect hashing
lated concepts. Continuity is desirable for hash functions
only in some applications, such as hash tables used in Main article: Perfect hash function
Nearest neighbor search. A hash function that is injective—that is, maps each valid
input to a different hash value—is said to be perfect.
With such a function one can directly locate the desired
Non-invertible entry in a hash table, without any additional searching.
Rolling hash fast, but have higher collision rates in hash tables than
more sophisticated hash functions.[12]
Main article: Rolling hash
In many applications, such as hash tables, collisions make
the system a little slower but are otherwise harmless.
In some applications, such as substring search, one must In such systems, it is often better to use hash func-
compute a hash function h for every k-character substring tions based on multiplication—such as MurmurHash and
of a given n-character string t; where k is a fixed integer, the SBoxHash—or even simpler hash functions such as
and n is k. The straightforward solution, which is to ex- CRC32—and tolerate more collisions; rather than use a
tract every such substring s of t and compute h(s) sep- more complex hash function that avoids many of those
arately, requires a number of operations proportional to collisions but takes longer to compute.[12] Multiplicative
k·n. However, with the proper choice of h, one can use hashing is susceptible to a “common mistake” that leads
the technique of rolling hash to compute all those hashes to poor diffusion—higher-value input bits do not affect
with an effort proportional to k + n. lower-value output bits.[13]
A universal hashing scheme is a randomized algorithm Some cryptographic hash functions, such as SHA-1, have
that selects a hashing function h among a family of such even stronger uniformity guarantees than checksums or
functions, in such a way that the probability of a collision fingerprints, and thus can provide very good general-
of any two distinct keys is 1/n, where n is the number purpose hashing functions.
of distinct hash values desired—independently of the two
keys. Universal hashing ensures (in a probabilistic sense) In ordinary applications, this advantage may be too small
that the hash function application will behave as well as to offset their much higher cost.[14] However, this method
if it were using a random function, for any distribution of can provide uniformly distributed hashes even when the
the input data. It will, however, have more collisions than keys are chosen by a malicious agent. This feature may
perfect hashing and may require more operations than a help to protect services against denial of service attacks.
special-purpose hash function. See also unique permuta-
tion hashing.[9]
Hashing by nonlinear table lookup
One can adapt certain checksum or fingerprinting algo- Tables of random numbers (such as 256 random 32-bit
rithms for use as hash functions. Some of those algo- integers) can provide high-quality nonlinear functions to
rithms will map arbitrary long string data z, with any typ- be used as hash functions or for other purposes such as
ical real-world distribution—no matter how non-uniform cryptography. The key to be hashed is split into 8-bit
and dependent—to a 32-bit or 64-bit string, from which (one-byte) parts, and each part is used as an index for the
one can extract a hash value in 0 through n − 1. nonlinear table. The table values are then added by arith-
This method may produce a sufficiently uniform distri- metic or XOR addition to the hash output value. Because
bution of hash values, as long as the hash range size n is the table is just 1024 bytes in size, it fits into the cache
small compared to the range of the checksum or finger- of modern microprocessors and allows very fast execu-
print function. However, some checksums fare poorly in tion of the hashing algorithm. As the table value is on
the avalanche test, which may be a concern in some ap- average much longer than 8 bits, one bit of input affects
plications. In particular, the popular CRC32 checksum nearly all output bits.
provides only 16 bits (the higher half of the result) that This algorithm has proven to be very fast and of high qual-
are usable for hashing. Moreover, each bit of the input ity for hashing purposes (especially hashing of integer-
has a deterministic effect on each bit of the CRC32, that number keys).
is one can tell without looking at the rest of the input,
which bits of the output will flip if the input bit is flipped;
so care must be taken to use all 32 bits when computing Efficient hashing of strings
the hash from the checksum.[10]
See also: Universal hashing § Hashing strings
Multiplicative hashing
Modern microprocessors will allow for much faster pro-
Multiplicative hashing is a simple type of hash func- cessing, if 8-bit character strings are not hashed by pro-
tion often used by teachers introducing students to hash cessing one character at a time, but by interpreting the
tables.[11] Multiplicative hash functions are simple and string as an array of 32 bit or 64 bit integers and hash-
84 CHAPTER 3. DICTIONARIES
ing/accumulating these “wide word” integer values by 3.9.5 Origins of the term
means of arithmetic operations (e.g. multiplication by
constant and bit-shifting). The remaining characters of The term “hash” offers a natural analogy with its non-
the string which are smaller than the word length of the technical meaning (to “chop” or “make a mess” out of
CPU must be handled differently (e.g. being processed something), given how hash functions scramble their in-
one character at a time). put data to derive their output.[19] In his research for the
precise origin of the term, Donald Knuth notes that, while
This approach has proven to speed up hash code genera-
Hans Peter Luhn of IBM appears to have been the first to
tion by a factor of five or more on modern microproces-
use the concept of a hash function in a memo dated Jan-
sors of a word size of 64 bit.
uary 1953, the term itself would only appear in published
Another approach[15] is to convert strings to a 32 or 64 literature in the late 1960s, on Herbert Hellerman’s Digi-
bit numeric value and then apply a hash function. One tal Computer System Principles, even though it was already
method that avoids the problem of strings having great widespread jargon by then.[20]
similarity (“Aaaaaaaaaa” and “Aaaaaaaaab”) is to use a
Cyclic redundancy check (CRC) of the string to com-
pute a 32- or 64-bit value. While it is possible that two 3.9.6 List of hash functions
different strings will have the same CRC, the likelihood
is very small and only requires that one check the ac- Main article: List of hash functions
tual string found to determine whether one has an ex-
act match. CRCs will be different for strings such as
“Aaaaaaaaaa” and “Aaaaaaaaab”. Although, CRC codes • Coalesced hashing
can be used as hash values[16] they are not cryptographi-
cally secure since they are not collision-resistant.[17] • Cuckoo hashing
• Hopscotch hashing
• MD5
Locality-sensitive hashing (LSH) is a method of per-
forming probabilistic dimension reduction of high- • Bernstein hash[21]
dimensional data. The basic idea is to hash the input
items so that similar items are mapped to the same buck- • Fowler-Noll-Vo hash function (32, 64, 128, 256,
ets with high probability (the number of buckets being 512, or 1024 bits)
much smaller than the universe of possible input items).
• Jenkins hash function (32 bits)
This is different from the conventional hash functions,
such as those used in cryptography, as in this case the • Pearson hashing (64 bits)
goal is to maximize the probability of “collision” of sim-
ilar items rather than to avoid collisions.[18] • Zobrist hashing
One example of LSH is MinHash algorithm used for find-
ing similar documents (such as web-pages): 3.9.7 See also
Let h be a hash function that maps the members of A and
B to distinct integers, and for any set S define h ᵢ (S) to • Comparison of cryptographic hash functions
be the member x of S with the minimum value of h(x).
Then h ᵢ (A) = h ᵢ (B) exactly when the minimum hash • Distributed hash table
value of the union A ∪ B lies in the intersection A ∩ B. • Identicon
Therefore,
• Low-discrepancy sequence
[2] Menezes, Alfred J.; van Oorschot, Paul C.; Vanstone, [21] “Hash Functions”. cse.yorku.ca. September 22, 2003.
Scott A (1996). Handbook of Applied Cryptography. Retrieved November 1, 2012. the djb2 algorithm (k=33)
CRC Press. ISBN 0849385237. was first reported by dan bernstein many years ago in
comp.lang.c.
[3] “Robust Audio Hashing for Content Identification by Jaap
Haitsma, Ton Kalker and Job Oostveen”
[4] “3. Data model — Python 3.6.1 documentation”. 3.9.9 External links
docs.python.org. Retrieved 2017-03-24.
• Calculate hash of a given value by Timo Denk
[5] Sedgewick, Robert (2002). “14. Hashing”. Algorithms in
Java (3 ed.). Addison Wesley. ISBN 978-0201361209. • Hash Functions and Block Ciphers by Bob Jenkins
[6] “Fundamental Data Structures – Josiang p.132”. Re-
• The Goulburn Hashing Function (PDF) by Mayur
trieved May 19, 2014.
Patel
[7] Plain ASCII is a 7-bit character encoding, although it is
often stored in 8-bit bytes with the highest-order bit always • Hash Function Construction for Textual and Geo-
clear (zero). Therefore, for plain ASCII, the bytes have metrical Data Retrieval Latest Trends on Comput-
only 27 = 128 valid values, and the character translation ers, Vol.2, pp. 483–489, CSCC conference, Corfu,
table has only this many entries. 2010
[8] Broder, A. Z. (1993). “Some applications of Rabin’s fin-
gerprinting method”. Sequences II: Methods in Communi-
cations, Security, and Computer Science. Springer-Verlag. 3.10 Perfect hash function
pp. 143–152.
[9] Shlomi Dolev, Limor Lahiani, Yinnon Haviv, “Unique In computer science, a perfect hash function for a set S
permutation hashing”, Theoretical Computer Science is a hash function that maps distinct elements in S to a set
Volume 475, 4 March 2013, Pages 59–65. of integers, with no collisions. In mathematical terms, it
is a total injective function.
[10] Bret Mulvey, Evaluation of CRC32 for Hash Tables, in
Hash Functions. Accessed April 10, 2009. Perfect hash functions may be used to implement a
lookup table with constant worst-case access time. A
[11] Knuth. “The Art of Computer Programming”. Volume 3: perfect hash function has many of the same applications
“Sorting and Searching”. Section “6.4. Hashing”. as other hash functions, but with the advantage that no
[12] Peter Kankowski. “Hash functions: An empirical com- collision resolution has to be implemented.
parison”.
[13] “CS 3110 Lecture 21: Hash functions”. Section “Multi- 3.10.1 Application
plicative hashing”.
[14] Bret Mulvey, Evaluation of SHA-1 for Hash Tables, in A perfect hash function with values in a limited range can
Hash Functions. Accessed April 10, 2009. be used for efficient lookup operations, by placing keys
from S (or other associated values) in a lookup table in-
[15] http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1. dexed by the output of the function. One can then test
1.18.7520 Performance in Practice of String Hashing whether a key is present in S, or look up a value associ-
Functions ated with that key, by looking for it at its cell of the table.
[16] Peter Kankowski. “Hash functions: An empirical com- Each such lookup takes constant time in the worst case.[1]
parison”.
lower bounds”, SIAM Journal on Computing, 23 (4): 738– accesses”. In Proceedings of the 20th Annual
761, doi:10.1137/S0097539791194094, MR 1283572. ACM-SIAM Symposium On Discrete Mathematics
(SODA), New York, 2009. ACM Press.
[4] Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfel-
binger, Martin (2009), “Hash, displace, and compress” • Douglas C. Schmidt, GPERF: A Perfect Hash Func-
(PDF), Algorithms—ESA 2009: 17th Annual European
tion Generator, C++ Report, SIGS, Vol. 10, No. 10,
Symposium, Copenhagen, Denmark, September 7-9, 2009,
Proceedings, Lecture Notes in Computer Science, 5757,
November/December, 1998.
Berlin: Springer, pp. 682–693, doi:10.1007/978-3-642-
04128-0_61, MR 2557794.
3.10.8 External links
[5] Baeza-Yates, Ricardo; Poblete, Patricio V. (2010),
“Searching”, in Atallah, Mikhail J.; Blanton, Marina, Al- • Minimal Perfect Hashing by Bob Jenkins
gorithms and Theory of Computation Handbook: General
Concepts and Techniques (2nd ed.), CRC Press, ISBN • gperf is an Open Source C and C++ perfect hash
9781584888239. See in particular p. 2-10. generator
[6] Jenkins, Bob (14 April 2009), “order-preserving mini- • cmph is Open Source implementing many perfect
mal perfect hashing”, in Black, Paul E., Dictionary of Al-
hashing methods
gorithms and Data Structures, U.S. National Institute of
Standards and Technology, retrieved 2013-03-05 • Sux4J is Open Source implementing perfect hash-
[7] Belazzougui, Djamal; Boldi, Paolo; Pagh, Rasmus; Vi- ing, including monotone minimal perfect hashing in
gna, Sebastiano (November 2008), “Theory and prac- Java
tice of monotone minimal perfect hashing”, Journal of
Experimental Algorithmics, 16, Art. no. 3.2, 26pp, • MPHSharp is Open Source implementing many per-
doi:10.1145/1963190.2025378. fect hashing methods in C#
are too many collisions), so one would like to change the tant for the least significant bits of the hash values to be
hash function. also universal. When a family is strongly universal, this
The solution to these problems is to pick a function ran- is guaranteed:
L
if H is a strongly universal family with
′
domly from a family of hash functions. A family of func- m = 2 , then the family made of the functions hmod2L
′
tions H = {h : U → [m]} is called a universal family for all h ∈ H is also strongly universal for L ≤ L .
if, ∀x, y ∈ U, x ̸= y : Prh∈H [h(x) = h(y)] ≤ m 1
. Unfortunately, the same is not true of (merely) universal
families. For example, the family made of the identity
In other words, any two keys of the universe collide with function h(x) = x is clearly universal, but the family
probability at most 1/m when the hash function h is made of the function h(x) = xmod2L′ fails to be uni-
drawn randomly from H . This is exactly the proba- versal.
bility of collision we would expect if the hash function
assigned truly random hash codes to every key. Some- UMAC and Poly1305-AES and several other message
times, the definition is relaxed to allow collision proba- authentication
[4][5]
code algorithms are based on universal
bility O(1/m) . This concept was introduced by Carter hashing. In such applications, the software chooses a
[1]
and Wegman in 1977, and has found numerous appli- new hash function for every message, based on a unique
[2]
cations in computer science (see, for example ). If we nonce for that message.
have an upper bound of ϵ < 1 on the collision probability, Several hash table implementations are based on univer-
we say that we have ϵ -almost universality. sal hashing. In such applications, typically the software
Many, but not all, universal families have the following chooses a new hash function only after it notices that “too
stronger uniform difference property: many” keys have collided; until then, the same hash func-
tion continues to be used over and over. (Some colli-
sion resolution schemes, such as dynamic perfect hashing,
∀x, y ∈ U, x ̸= y , when h is drawn randomly
pick a new hash function every time there is a collision.
from the family H , the difference h(x) −
Other collision resolution schemes, such as cuckoo hash-
h(y) mod m is uniformly distributed in [m] .
ing and 2-choice hashing, allow a number of collisions
before picking a new hash function). A survey of fastest
Note that the definition of universality is only concerned known universal and strongly universal hash functions for
with whether h(x) − h(y) = 0 , which counts collisions. integers, vectors, and strings is found in.[6]
The uniform difference property is stronger.
(Similarly, a universal family can be XOR universal if
∀x, y ∈ U, x ̸= y , the value h(x) ⊕ h(y) mod m is 3.11.2 Mathematical guarantees
uniformly distributed in [m] where ⊕ is the bitwise ex-
clusive or operation. This is only possible if m is a power For any fixed set S of n keys, using a universal family
of two.) guarantees the following properties.
An even stronger condition is pairwise independence: we
have this property when ∀x, y ∈ U, x ̸= y we have the 1. For any fixed x in S , the expected number of keys
probability that x, y will hash to any pair of hash values in the bin h(x) is n/m . When implementing hash
z1 , z2 is as if they were perfectly random: P (h(x) = tables by chaining, this number is proportional to the
z1 ∧ h(y) = z2 ) = 1/m2 . Pairwise independence is expected running time of an operation involving the
sometimes called strong universality. key x (for example a query, insertion or deletion).
Another property is uniformity. We say that a family is
uniform if all hash values are equally likely: P (h(x) = 2. The expected number of pairs of keys x, y in S with
z) = 1/m for any hash value z . Universality does not x ̸= y that collide ( h(x) = h(y) ) is bounded above
imply uniformity. However, strong universality does im- by n(n − 1)/2m , which is of order O(n2 /m) .
ply uniformity. When the number of bins, m , is O(n) , the expected
number of collisions is O(n) . When hashing into n2
Given a family with the uniform distance property, one
bins, there are no collisions at all with probability at
can produce a pairwise independent or strongly univer-
least a half.
sal hash family by adding a uniformly distributed random
constant with values in [m] to the hash functions. (Sim- 3. The expected number of keys in bins with at least t
ilarly, if m is a power of two, we can achieve pairwise keys in them is bounded above by 2n/(t−2(n/m)+
independence from an XOR universal hash family by do- 1) .[7] Thus, if the capacity of each bin is capped
ing an exclusive or with a uniformly distributed random to three times the average size ( t = 3n/m ), the
constant.) Since a shift by a constant is sometimes irrel- total number of keys in overflowing bins is at most
evant in applications (e.g. hash tables), a careful distinc- O(m) . This only holds with a hash family whose
tion between the uniform distance property and pairwise collision probability is bounded above by 1/m . If a
independent is sometimes not made.[3] weaker definition is used, bounding it by O(1/m) ,
For some applications (such as hash tables), it is impor- this result is no longer true.[7]
3.11. UNIVERSAL HASHING 89
axmod2w or aymod2w is larger. Assume that the least It is possible to halve the number of multiplications,
significant set bit of x−y appears on position w−c . Since which roughly translates to a two-fold speed-up in
a is a random odd integer and odd integers have inverses practice.[12] Initialize the hash function with a vector ā =
in the ring Z2w , it follows that a(x − y)mod2w will be (a0 , . . . , ak−1 ) of random odd integers on 2w bits each.
uniformly distributed among w -bit integers with the least The following hash family is universal:[14]
significant set bit on position w − c . The probability that
these bits are all 0’s or all 1’s is therefore at most 2/2M =
2/m . On the other hand, if c < M , then higher-order ( ⌈k/2⌉
∑ )
M bits of a(x − y)mod2w contain both 0’s and 1’s, so it hā (x̄) = (x2i + a2i ) · (x2i+1 + a2i+1 ) mod 22w div 22w−M
is certain that h(x) ̸= h(y) . Finally, if c = M then bit i=0
w − M of a(x − y)mod2w is 1 and ha (x) = ha (y) if If double-precision operations are not available, one can
and only if bits w − 1, . . . , w − M + 1 are also 1, which
interpret the input as a vector of half-words ( w/2 -bit
happens with probability 1/2M −1 = 2/m . integers). The algorithm will then use ⌈k/2⌉ multiplica-
This analysis is tight, as can be shown with the example tions, where k was the number of half-words in the vec-
x = 2w−M −2 and y = 3x . To obtain a truly 'universal' tor. Thus, the algorithm runs at a “rate” of one multipli-
hash function, one can use the multiply-add-shift scheme cation per word of input.
The same scheme can also be used for hashing integers,
by interpreting their bits as vectors of bytes. In this vari-
ha,b (x) = ((ax + b)mod2w ) div 2w−M ant, the vector technique is known as tabulation hashing
and it provides a practical alternative to multiplication-
which can be implemented in C-like programming lan-
based universal hashing schemes.[15]
guages by
Strong universality at high speed is also possible.[16] Ini-
ha,b (x) = (unsigned) (a*x+b) >> (w-M) tialize the hash function with a vector ā = (a0 , . . . , ak )
of random integers on 2w bits. Compute
where a is a random odd positive integer with a < 2w
and b is a random non-negative integer with b < 2w−M .
∑
k−1
With these choices of a and b , Pr{ha,b (x) = ha,b (y)} ≤ hā (x̄)strong = (a0 + ai+1 xi mod 22w ) div 2w
1/m for all x ̸≡ y (mod 2 ) .w [10]
This differs slightly i=0
but importantly from the mistranslation in the English
The result is strongly universal on w bits. Experimentally,
paper.[11]
it was found to run at 0.2 CPU cycle per byte on recent
Intel processors for w = 32 .
Hashing vectors
Hashing strings
This section is concerned with hashing a fixed-length vec-
tor of machine words. Interpret the input as a vector
This refers to hashing a variable-sized vector of machine
x̄ = (x0 , . . . , xk−1 ) of k machine words (integers of w
words. If the length of the string can be bounded by a
bits each). If H is a universal family with the uniform
small number, it is best to use the vector solution from
difference property, the following family (dating back to
above (conceptually padding the vector with zeros up to
Carter and Wegman[1] ) also has the uniform difference
the upper bound). The space required is the maximal
property (and hence is universal):
length of the string, but the time to evaluate h(s) is just
(∑ ) the length of s . As long as zeroes are forbidden in the
k−1
h(x̄) = i=0 hi (xi ) mod m , where each string, the zero-padding can be ignored when evaluating
hi ∈ H is chosen independently at random. the hash function without affecting universality[12] . Note
that if zeroes are allowed in the string, then it might be
If m is a power of two, one may replace summation by best to append a fictitious non-zero (e.g., 1) character to
exclusive or.[12] all strings prior to padding: this will ensure that univer-
sality is not affected.[16]
In practice, if double-precision arithmetic is available,
this is instantiated with the multiply-shift hash family Now assume we want to hash x̄ = (x0 , . . . , xℓ ) , where a
of.[13] Initialize the hash function with a vector ā = good bound on ℓ is not known a priori. A universal family
(a0 , . . . , ak−1 ) of random odd integers on 2w bits each. proposed by [13] treats the string x as the coefficients of
Then if the number of bins is m = 2M for M ≤ w : a polynomial modulo a large prime. If xi ∈ [u] , let
p ≥ max{u, m} be a prime and define:
( k−1 ) (( ∑ ) )
(∑ ) ha (x̄) = hint
ℓ
x
i=0 i · a i
mod p ,
hā (x̄) = xi · ai mod 22w div 22w−M
i=0
where a ∈ [p] is uniformly random and hint is
3.11. UNIVERSAL HASHING 91
2. One can apply vector hashing to blocks. For in- [9] Thorup, Mikkel. “Text-book algorithms at SODA”.
stance, one applies vector hashing to each 16-word
block of the string, and applies string hashing to the [10] Woelfel, Philipp (2003). Über die Komplexität der Multi-
⌈k/16⌉ results. Since the slower string hashing is plikation in eingeschränkten Branchingprogrammmodellen
applied on a substantially smaller vector, this will (PDF) (Ph.D.). Universität Dortmund. Retrieved 18
September 2012.
essentially be as fast as vector hashing.
[11] Woelfel, Philipp (1999). Efficient Strongly Universal and
3. One chooses a power-of-two as the divisor, allow- Optimally Universal Hashing (PDF). Mathematical Foun-
ing arithmetic modulo 2w to be implemented with- dations of Computer Science 1999. LNCS. 1672. pp.
out division (using faster operations of bit masking). 262–272. doi:10.1007/3-540-48340-3_24. Retrieved 17
The NH hash-function family takes this approach. May 2011.
92 CHAPTER 3. DICTIONARIES
[12] Thorup, Mikkel (2009). String hashing for lin- hash codes of any designated k keys are independent ran-
ear probing. Proc. 20th ACM-SIAM Symposium dom variables (see precise mathematical definitions be-
on Discrete Algorithms (SODA). pp. 655–664. low). Such families allow good average case performance
doi:10.1137/1.9781611973068.72. Archived (PDF) in randomized algorithms or data structures, even if the
from the original on 2013-10-12., section 5.3 input data is chosen by an adversary. The trade-offs be-
[13] Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pip-
tween the degree of independence and the efficiency of
penger, Nicholas (1992). Polynomial Hash Functions Are evaluating the hash function are well studied, and many k
Reliable (Extended Abstract). Proc. 19th International -independent families have been proposed.
Colloquium on Automata, Languages and Programming
(ICALP). pp. 235–246.
1. for any fixed x ∈ U , as h is drawn randomly from 3.12.4 Independence needed by different
H , h(x) is uniformly distributed in [m] . hashing methods
2. for any fixed, distinct keys x1 , . . . , xk ∈ U , as h
The notion of k-independence can be used to differentiate
is drawn randomly from H , h(x1 ), . . . , h(xk ) are
between different hashing methods, according to the level
independent random variables.
of independence required to guarantee constant expected
time per operation.
Often it is inconvenient to achieve the perfect joint prob-
ability of m−k due to rounding issues. Following,[3] one For instance, hash chaining takes constant expected time
may define a (µ, k) -independent family to satisfy: even with a 2-independent hash function, because the ex-
pected time to perform a search for a given key is bounded
∀ distinct (x1 , . . . , xk ) ∈ Uk by the expected number of collisions that key is involved
and ∀(y1 , . . . , yk ) ∈ [m]k , in. By linearity of expectation, this expected number
Prh∈H [h(x1 ) = y1 ∧ · · · ∧ h(xk ) = yk ] ≤ equals the sum, over all other keys in the hash table, of the
µ/mk probability that the given key and the other key collide.
Because the terms of this sum only involve probabilistic
Observe that, even if µ is close to 1, h(xi ) are no longer events involving two keys, 2-independence is sufficient to
independent random variables, which is often a problem ensure that this sum has the same value that it would for
[2]
in the analysis of randomized algorithms. Therefore, a a truly random hash function.
more common alternative to dealing with rounding issues Double hashing is another method of hashing that re-
is to prove that the hash family is close in statistical dis- quires a low degree of independence. It is a form of open
tance to a k -independent family, which allows black-box addressing that uses two hash functions: one to determine
use of the independence properties. the start of a probe sequence, and the other to determine
the step size between positions in the probe sequence. As
long as both of these are 2-independent, this method gives
3.12.3 Techniques constant expected time per operation.[7]
Polynomials with random coefficients On the other hand, linear probing, a simpler form of open
addressing where the step size is always one, requires 5-
The original technique for constructing k-independent independence. It can be guaranteed to work in constant
hash functions, given by Carter and Wegman, was to se- expected time per operation with a 5-independent hash
lect a large prime number p, choose k random numbers function,[8] and there exist 4-independent hash functions
modulo p, and use these numbers as the coefficients of a for which it takes logarithmic time per operation.[9]
polynomial of degree k whose values modulo p are used
as the value of the hash function. All polynomials of the
given degree modulo p are equally likely, and any polyno- 3.12.5 References
mial is uniquely determined by any k-tuple of argument-
[1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
value pairs with distinct arguments, from which it follows
Ronald L.; Stein, Clifford (2009) [1990]. Introduction to
that any k-tuple of distinct arguments is equally likely to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN
[2]
be mapped to any k-tuple of hash values. 0-262-03384-4.
(2012) studies variations of tabulation hashing suitable • Lemire, Daniel (2012), “The univer-
for variable-length keys such as character strings. The sality of iterated hashing over variable-
general type of hashing scheme studied by Lemire uses a length strings”, Discrete Applied Mathe-
single table T indexed by the value of a block, regard- matics, 160: 604–617, arXiv:1008.1715 ,
less of its position within the key. However, the val- doi:10.1016/j.dam.2011.11.009, MR 2876344.
ues from this table may be combined by a more compli-
cated function than bitwise exclusive or. Lemire shows • Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009),
that no scheme of this type can be 3-independent. Nev- “Linear probing with constant independence”,
ertheless, he shows that it is still possible to achieve 2- SIAM Journal on Computing, 39 (3): 1107–1120,
independence. In particular, a tabulation scheme that in- doi:10.1137/070702278, MR 2538852.
terprets the values T[xi] (where xi is, as before, the ith
block of the input) as the coefficients of a polynomial • Pătraşcu, Mihai; Thorup, Mikkel (2010), “On
over a finite field and then takes the remainder of the re- the k-independence required by linear probing
sulting polynomial modulo another polynomial, gives a and minwise independence” (PDF), Proceedings of
2-independent hash function. the 37th International Colloquium on Automata,
Languages and Programming (ICALP 2010), Bor-
deaux, France, July 6-10, 2010, Part I, Lecture
3.13.6 Notes Notes in Computer Science, 6198, Springer,
pp. 715–726, doi:10.1007/978-3-642-14165-2_60,
[1] Morin (2014); Mitzenmacher & Upfal (2014). MR 2734626.
[2] Mitzenmacher & Upfal (2014).
• Pătraşcu, Mihai; Thorup, Mikkel (2012), “The
[3] Thorup (2013). power of simple tabulation hashing”, Journal of
[4] Zobrist (1970). the ACM, 59 (3): Art. 14, arXiv:1011.5200 ,
doi:10.1145/2220357.2220361, MR 2946218.
[5] Pătraşcu & Thorup (2012); Mitzenmacher & Upfal
(2014). • Siegel, Alan (2004), “On universal classes of
[6] Carter & Wegman (1979). extremely random constant-time hash functions”,
SIAM Journal on Computing, 33 (3): 505–543,
[7] For the sufficiency of 5-independent hashing for linear doi:10.1137/S0097539701386216, MR 2066640.
probing, see Pagh, Pagh & Ružić (2009). For examples of
weaker hashing schemes that fail, see Pătraşcu & Thorup • Thorup, M. (2013), “Simple tabulation, fast ex-
(2010). panders, double tabulation, and high independence”,
[8] Pătraşcu & Thorup (2012). Proceedings of the 54th Annual IEEE Symposium on
Foundations of Computer Science (FOCS 2013), pp.
90–99, doi:10.1109/FOCS.2013.18, MR 3246210.
3.13.7 References
• Wegman, Mark N.; Carter, J. Lawrence (1981),
Secondary sources “New hash functions and their use in authentica-
tion and set equality”, Journal of Computer and Sys-
• Morin, Pat (February 22, 2014), “Section 5.2.3: tem Sciences, 22 (3): 265–279, doi:10.1016/0022-
Tabulation hashing”, Open Data Structures (in pseu- 0000(81)90033-7, MR 633535.
docode) (0.1Gβ ed.), pp. 115–116, retrieved 2016-
01-08. • Zobrist, Albert L. (April 1970), A New Hashing
Method with Application for Game Playing (PDF),
• Mitzenmacher, Michael; Upfal, Eli (2014), “Some Tech. Rep. 88, Madison, Wisconsin: Computer
practical randomized algorithms and data struc- Sciences Department, University of Wisconsin.
tures”, in Tucker, Allen; Gonzalez, Teofilo; Diaz-
Herrera, Jorge, Computing Handbook: Computer
Science and Software Engineering (3rd ed.), CRC
Press, pp. 11-1 – 11-23, ISBN 9781439898529. 3.14 Cryptographic hash function
See in particular Section 11.1.1: Tabulation hash-
ing, pp. 11-3 – 11-4. A cryptographic hash function is a special class of hash
function that has certain properties which make it suitable
Primary sources for use in cryptography. It is a mathematical algorithm
that maps data of arbitrary size to a bit string of a fixed
• Carter, J. Lawrence; Wegman, Mark N. (1979), size (a hash function) which is designed to also be a one-
“Universal classes of hash functions”, Journal of way function, that is, a function which is infeasible to in-
Computer and System Sciences, 18 (2): 143–154, vert. The only way to recreate the input data from an
doi:10.1016/0022-0000(79)90044-8, MR 532173. ideal cryptographic hash function’s output is to attempt
3.14. CRYPTOGRAPHIC HASH FUNCTION 97
Cryptographic hash functions have many information- Informally, these properties mean that a malicious ad-
security applications, notably in digital signatures, versary cannot replace or modify the input data without
message authentication codes (MACs), and other forms changing its digest. Thus, if two strings have the same
of authentication. They can also be used as ordinary hash digest, one can be very confident that they are identical.
functions, to index data in hash tables, for fingerprinting, A function meeting these criteria may still have unde-
to detect duplicate data or uniquely identify files, and as sirable properties. Currently popular cryptographic hash
checksums to detect accidental data corruption. Indeed, functions are vulnerable to length-extension attacks: given
in information-security contexts, cryptographic hash val- hash(m) and len(m) but not m, by choosing a suitable m'
ues are sometimes called (digital) fingerprints, checksums, an attacker can calculate hash(m || m') where || denotes
or just hash values, even though all these terms stand for concatenation.[4] This property can be used to break naive
more general functions with rather different properties authentication schemes based on hash functions. The
and purposes. HMAC construction works around these problems.
98 CHAPTER 3. DICTIONARIES
hash. Because users have different salts, it is not feasible tographic hash functions tend to be much more expensive
to store tables of precomputed hash values for common computationally. For this reason, they tend to be used in
passwords. Key stretching functions, such as PBKDF2, contexts where it is necessary for users to protect them-
Bcrypt or Scrypt, typically use repeated invocations of a selves against the possibility of forgery (the creation of
cryptographic hash to increase the time required to per- data with the same digest as the expected data) by poten-
form brute force attacks on stored password digests. tially malicious participants.
In 2013 a long-term Password Hashing Competition was
announced to choose a new, standard algorithm for pass- Pseudorandom generation and key derivation
word hashing.[7]
Hash functions can also be used in the generation of
pseudorandom bits, or to derive new keys or passwords
Proof-of-work from a single secure key or password.
Main article: Proof-of-work system
3.14.4 Hash functions based on block ci-
A proof-of-work system (or protocol, or function) is an phers
economic measure to deter denial of service attacks and
other service abuses such as spam on a network by requir- There are several methods to use a block cipher to build a
ing some work from the service requester, usually mean- cryptographic hash function, specifically a one-way com-
ing processing time by a computer. A key feature of these pression function.
schemes is their asymmetry: the work must be moder-
The methods resemble the block cipher modes of opera-
ately hard (but feasible) on the requester side but easy
tion usually used for encryption. Many well-known hash
to check for the service provider. One popular system –
functions, including MD4, MD5, SHA-1 and SHA-2 are
used in Bitcoin mining and Hashcash – uses partial hash
built from block-cipher-like components designed for the
inversions to prove that work was done, as a good-will to-
purpose, with feedback to ensure that the resulting func-
ken to send an e-mail. The sender is required to find a
tion is not invertible. SHA-3 finalists included functions
message whose hash value begins with a number of zero
with block-cipher-like components (e.g., Skein, BLAKE)
bits. The average work that sender needs to perform in
though the function finally selected, Keccak, was built on
order to find a valid message is exponential in the number
a cryptographic sponge instead.
of zero bits required in the hash value, while the recipi-
ent can verify the validity of the message by executing a A standard block cipher such as AES can be used in
single hash function. For instance, in Hashcash, a sender place of these custom block ciphers; that might be useful
is asked to generate a header whose 160 bit SHA-1 hash when an embedded system needs to implement both en-
value has the first 20 bits as zeros. The sender will on cryption and hashing with minimal code size or hardware
average have to try 219 times to find a valid header. area. However, that approach can have costs in efficiency
and security. The ciphers in hash functions are built for
hashing: they use large keys and blocks, can efficiently
File or data identifier change keys every block, and have been designed and vet-
ted for resistance to related-key attacks. General-purpose
A message digest can also serve as a means of reliably ciphers tend to have different design goals. In particu-
identifying a file; several source code management sys- lar, AES has key and block sizes that make it nontrivial
tems, including Git, Mercurial and Monotone, use the to use to generate long hash values; AES encryption be-
sha1sum of various types of content (file content, direc- comes less efficient when the key changes each block; and
tory trees, ancestry information, etc.) to uniquely identify related-key attacks make it potentially less secure for use
them. Hashes are used to identify files on peer-to-peer in a hash function than for encryption.
filesharing networks. For example, in an ed2k link, an
MD4-variant hash is combined with the file size, provid-
ing sufficient information for locating file sources, down- 3.14.5 Merkle–Damgård construction
loading the file and verifying its contents. Magnet links
are another example. Such file hashes are often the top Main article: Merkle–Damgård construction
hash of a hash list or a hash tree which allows for addi- A hash function must be able to process an arbitrary-
tional benefits. length message into a fixed-length output. This can be
achieved by breaking the input up into a series of equal-
One of the main applications of a hash function is to al- sized blocks, and operating on them in sequence using a
low the fast look-up of a data in a hash table. Being hash one-way compression function. The compression func-
functions of a particular kind, cryptographic hash func- tion can either be specially designed for hashing or be
tions lend themselves well to this application too. built from a block cipher. A hash function built with
However, compared with standard hash functions, cryp- the Merkle–Damgård construction is as resistant to colli-
100 CHAPTER 3. DICTIONARIES
3.14.7 Concatenation
The Merkle–Damgård hash construction.
Concatenating outputs from multiple hash functions pro-
vides collision resistance as good as the strongest of the
sions as is its compression function; any collision for the algorithms included in the concatenated result. For ex-
full hash function can be traced back to a collision in the ample, older versions of Transport Layer Security (TLS)
compression function. and Secure Sockets Layer (SSL) use concatenated MD5
The last block processed should also be unambiguously and SHA-1 sums.[8][9] This ensures that a method to find
length padded; this is crucial to the security of this collisions in one of the hash functions does not defeat data
construction. This construction is called the Merkle– protected by both hash functions.
Damgård construction. Most widely used hash functions, For Merkle–Damgård construction hash functions, the
including SHA-1 and MD5, take this form. concatenated function is as collision-resistant as its
The construction has certain inherent flaws, includ- strongest component, but not more collision-resistant.
ing length-extension and generate-and-paste attacks, and Antoine Joux observed that 2-collisions lead to n-
cannot be parallelized. As a result, many entrants in the collisions: If it is feasible for an attacker to find two mes-
recent NIST hash function competition were built on dif- sages with the same MD5 hash, the attacker can find as
ferent, sometimes novel, constructions. many messages as the attacker desires with identical MD5
hashes with no greater difficulty.[10] Among the n mes-
sages with the same MD5 hash, there is likely to be a col-
3.14.6 Use in building other cryptographic lision in SHA-1. The additional work needed to find the
SHA-1 collision (beyond the exponential birthday search)
primitives requires only polynomial time.[11][12]
Hash functions can be used to build other cryptographic
primitives. For these other primitives to be cryptograph- 3.14.8 Cryptographic hash algorithms
ically secure, care must be taken to build them correctly.
Message authentication codes (MACs) (also called keyed There is a long list of cryptographic hash functions, al-
hash functions) are often built from hash functions. though many have been found to be vulnerable and should
HMAC is such a MAC. not be used. Even if a hash function has never been bro-
Just as block ciphers can be used to build hash func- ken, a successful attack against a weakened variant may
tions, hash functions can be used to build block ciphers. undermine the experts’ confidence and lead to its aban-
Luby-Rackoff constructions using hash functions can be donment. For instance, in August 2004 weaknesses were
provably secure if the underlying hash function is secure. found in several then-popular hash functions, including
Also, many hash functions (including SHA-1 and SHA- SHA-0, RIPEMD, and MD5. These weaknesses called
2) are built by using a special-purpose block cipher in into question the security of stronger algorithms derived
a Davies-Meyer or other construction. That cipher can from the weak hash functions—in particular, SHA-1
(a strengthened version of SHA-0), RIPEMD-128, and
also be used in a conventional mode of operation, with-
out the same security guarantees. See SHACAL, BEAR RIPEMD-160 (both strengthened versions of RIPEMD).
Neither SHA-0 nor RIPEMD are widely used since they
and LION.
were replaced by their strengthened versions.
Pseudorandom number generators (PRNGs) can be built
using hash functions. This is done by combining a (secret) As of 2009, the two most commonly used cryptographic
random seed with a counter and hashing it. hash functions were MD5 and SHA-1. However, a suc-
cessful attack on MD5 broke Transport Layer Security in
Some hash functions, such as Skein, Keccak, and 2008.[13]
RadioGatún output an arbitrarily long stream and can be
used as a stream cipher, and stream ciphers can also be The United States National Security Agency (NSA) de-
built from fixed-length digest hash functions. Often this veloped SHA-0 and SHA-1.
is done by first building a cryptographically secure pseu- On 12 August 2004, Joux, Carribault, Lemuet, and Jalby
dorandom number generator and then using its stream of announced a collision for the full SHA-0 algorithm. Joux
random bytes as keystream. SEAL is a stream cipher et al. accomplished this using a generalization of the
3.14. CRYPTOGRAPHIC HASH FUNCTION 101
Chabaud and Joux attack. They found that the collision [8] Florian Mendel; Christian Rechberger; Martin Schläffer.
had complexity 251 and took about 80,000 CPU hours “MD5 is Weaker than Weak: Attacks on Concatenated
on a supercomputer with 256 Itanium 2 processors— Combiners”. “Advances in Cryptology - ASIACRYPT
equivalent to 13 days of full-time use of the supercom- 2009”. p. 145. quote: 'Concatenating ... is often used by
puter. implementors to “hedge bets” on hash functions. A com-
biner of the form MD5||SHA-1 as used in SSL3.0/TLS1.0
In February 2005, an attack on SHA-1 was reported ... is an example of such a strategy.'
that would find collision in about 269 hashing operations,
rather than the 280 expected for a 160-bit hash function. [9] Danny Harnik; Joe Kilian; Moni Naor; Omer Reingold;
In August 2005, another attack on SHA-1 was reported Alon Rosen. “On Robust Combiners for Oblivious Trans-
63
that would find collisions in 2 operations. Theoreti- fer and Other Primitives”. “Advances in Cryptology - EU-
[14][15] ROCRYPT 2005”. quote: “the concatenation of hash
cal weaknesses of SHA-1 exist, and in February
[16] functions as suggested in the TLS... is guaranteed to be
of 2017 Google announced a collision in SHA-1. Se- as secure as the candidate that remains secure.” p. 99.
curity researchers recommend that new applications can
avoid these problems by using later members of the SHA [10] Antoine Joux. Multicollisions in Iterated Hash Functions.
family, such as SHA-2, or using techniques such as ran- Application to Cascaded Constructions. LNCS 3152/2004,
domized hashing[17][18] that do not require collision resis- pages 306–316 Full text.
tance.
[11] Finney, Hal (August 20, 2004). “More Problems with
However, to ensure the long-term robustness of applica- Hash Functions”. The Cryptography Mailing List. Re-
tions that use hash functions, there was a competition to trieved May 25, 2016.
design a replacement for SHA-2. On October 2, 2012,
Keccak was selected as the winner of the NIST hash func- [12] Hoch, Jonathan J.; Shamir, Adi (2008). “On the Strength
tion competition. A version of this algorithm became a of the Concatenated Hash Combiner when All the Hash
FIPS standard on August 5, 2015 under the name SHA- Functions Are Weak” (PDF). Retrieved May 25, 2016.
[19]
3.
[13] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Ar-
Another finalist from the NIST hash function competi- jen Lenstra, David Molnar, Dag Arne Osvik, Benne de
tion, BLAKE, was optimized to produce BLAKE2 which Weger, MD5 considered harmful today: Creating a rogue
is notable for being faster than SHA-3, SHA-2, SHA-1, CA certificate, accessed March 29, 2009.
or MD5, and is used in numerous applications and li-
[14] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, Finding
braries.
Collisions in the Full SHA-1
[7] “Password Hashing Competition”. Retrieved March 3, • Buldas, A. (2011). “Series of mini-lectures about
2013. cryptographic hash functions”.
102 CHAPTER 3. DICTIONARIES
Sets
4.1 Set (abstract data type) types, and quotient sets may be replaced by setoids.) The
characteristic function F of a set S is defined as:
In computer science, a set is an abstract data type that can
store certain values, without any particular order, and no {
repeated values. It is a computer implementation of the 1, if x ∈ S
F (x) =
mathematical concept of a finite set. Unlike most other 0, if x ̸∈ S
collection types, rather than retrieving a specific element
from a set, one typically tests a value for membership in In theory, many other abstract data structures can be
a set. viewed as set structures with additional operations and/or
additional axioms imposed on the standard operations.
Some set data structures are designed for static or frozen
For example, an abstract heap can be viewed as a set
sets that do not change after they are constructed. Static
structure with a min(S) operation that returns the element
sets allow only query operations on their elements — such
of smallest value.
as checking whether a given value is in the set, or enumer-
ating the values in some arbitrary order. Other variants,
called dynamic or mutable sets, allow also the insertion 4.1.2 Operations
and deletion of elements from the set.
An abstract data structure is a collection, or aggregate, Core set-theoretical operations
of data. The data may be booleans, numbers, characters,
or other data structures. If one considers the structure One may define the operations of the algebra of sets:
yielded by packaging [lower-alpha 1] or indexing,[lower-alpha 2]
there are four basic data structures:[1][2] • union(S,T): returns the union of sets S and T.
• intersection(S,T): returns the intersection of sets S
1. unpackaged, unindexed: bunch and T.
2. packaged, unindexed: set • difference(S,T): returns the difference of sets S and
3. unpackaged, indexed: string (sequence) T.
4. packaged, indexed: list (array) • subset(S,T): a predicate that tests whether the set S
is a subset of set T.
In this view, the contents of a set are a bunch, and isolated
data items are elementary bunches (elements). Whereas Static sets
sets contain elements, bunches consist of elements.
Further structuring may be achieved by considering the Typical operations that may be provided by a static set
multiplicity of elements (sets become multisets, bunches structure S are:
become hyperbunches)[3] or their homogeneity (a record
is a set of fields, not necessarily all of the same type). • is_element_of(x,S): checks whether the value x is in
the set S.
103
104 CHAPTER 4. SETS
• enumerate(S): returns a list containing the elements • equal(S1 , S2 ): checks whether the two given sets are
of S in some arbitrary order. equal (i.e. contain all and only the same elements).
• build(x1 ,x2 ,…,xn,): creates a set structure with val- • hash(S): returns a hash value for the static set S such
ues x1 ,x2 ,…,xn. that if equal(S1 , S2 ) then hash(S1 ) = hash(S2 )
• create_from(collection): creates a new set structure
Other operations can be defined for sets with elements of
containing all the elements of the given collection or
a special type:
all the elements returned by the given iterator.
• sum(S): returns the sum of all elements of S for some
Dynamic sets definition of “sum”. For example, over integers or
reals, it may be defined as fold(0, add, S).
Dynamic set structures typically add:
• collapse(S): given a set of sets, return the union.[9]
For example, collapse({{1}, {2, 3}}) == {1, 2, 3}.
• create(): creates a new, initially empty set structure.
May be considered a kind of sum.
• create_with_capacity(n): creates a new set
• flatten(S): given a set consisting of sets and atomic
structure, initially empty but capable of hold-
elements (elements that are not sets), returns a set
ing up to n elements.
whose elements are the atomic elements of the orig-
• add(S,x): adds the element x to S, if it is not present inal top-level set or elements of the sets it contains.
already. In other words, remove a level of nesting – like col-
lapse, but allow atoms. This can be done a sin-
• remove(S, x): removes the element x from S, if it is gle time, or recursively flattening to obtain a set of
present. only atomic elements.[10] For example, flatten({1,
• capacity(S): returns the maximum number of values {2, 3}}) == {1, 2, 3}.
that S can hold. • nearest(S,x): returns the element of S that is closest
in value to x (by some metric).
Some set structures may allow only some of these opera-
tions. The cost of each operation will depend on the im- • min(S), max(S): returns the minimum/maximum el-
plementation, and possibly also on the particular values ement of S.
stored in the set, and the order in which they are inserted.
4.1.3 Implementations
Additional operations
Sets can be implemented using various data structures,
There are many other operations that can (in principle) which provide different time and space trade-offs for
be defined in terms of the above, such as: various operations. Some implementations are designed
to improve the efficiency of very specialized operations,
• pop(S): returns an arbitrary element of S, deleting it such as nearest or union. Implementations described as
from S.[4] “general use” typically strive to optimize the element_of,
add, and delete operations. A simple implementation is
• pick(S): returns an arbitrary element of S.[5][6][7] to use a list, ignoring the order of the elements and tak-
Functionally, the mutator pop can be interpreted as ing care to avoid repeated values. This is simple but in-
the pair of selectors (pick, rest), where rest returns efficient, as operations like set membership or element
the set consisting of all elements except for the ar- deletion are O(n), as they require scanning the entire
bitrary element.[8] Can be interpreted in terms of list.[lower-alpha 4] Sets are often instead implemented using
iterate.[lower-alpha 3] more efficient data structures, particularly various flavors
of trees, tries, or hash tables.
• map(F,S): returns the set of distinct values resulting
from applying function F to each element of S. As sets can be interpreted as a kind of map (by the in-
dicator function), sets are commonly implemented in the
• filter(P,S): returns the subset containing all elements same way as (partial) maps (associative arrays) – in this
of S that satisfy a given predicate P. case in which the value of each key-value pair has the unit
• fold(A0 ,F,S): returns the value A|S| after applying type or a sentinel value (like 1) – namely, a self-balancing
Ai+1 := F(Ai, e) for each element e of S, for some binary search tree for sorted sets (which has O(log n) for
binary operation F. F must be associative and com- most operations), or a hash table for unsorted sets (which
mutative for this to be well-defined. has O(1) average-case, but O(n) worst-case, for most op-
erations). A sorted linear hash table[11] may be used to
• clear(S): delete all elements of S. provide deterministically ordered sets.
4.1. SET (ABSTRACT DATA TYPE) 105
Further, in languages that support maps but not sets, sets • Python has built-in set and frozenset types since 2.4,
can be implemented in terms of maps. For example, a and since Python 3.0 and 2.7, supports non-empty
common programming idiom in Perl that converts an ar- set literals using a curly-bracket syntax, e.g.: {x, y,
ray to a hash whose values are the sentinel value 1, for use z}.
as a set, is:
• The .NET Framework provides the generic HashSet
my %elements = map { $_ => 1 } @elements; and SortedSet classes that implement the generic
ISet interface.
Other popular methods include arrays. In particular a • Smalltalk's class library includes Set and Identity-
subset of the integers 1..n can be implemented efficiently Set, using equality and identity for inclusion test
as an n-bit bit array, which also support very efficient respectively. Many dialects provide variations for
union and intersection operations. A Bloom map imple- compressed storage (NumberSet, CharacterSet), for
ments a set probabilistically, using a very compact repre- ordering (OrderedSet, SortedSet, etc.) or for weak
sentation but risking a small chance of false positives on references (WeakIdentitySet).
queries.
• Ruby's standard library includes a set module which
The Boolean set operations can be implemented in terms
contains Set and SortedSet classes that implement
of more elementary operations (pop, clear, and add), but
sets using hash tables, the latter allowing iteration in
specialized algorithms may yield lower asymptotic time
sorted order.
bounds. If sets are implemented as sorted lists, for ex-
ample, the naive algorithm for union(S,T) will take time • OCaml's standard library contains a Set module,
proportional to the length m of S times the length n of which implements a functional set data structure us-
T; whereas a variant of the list merging algorithm will ing binary search trees.
do the job in time proportional to m+n. Moreover, there
are specialized set data structures (such as the union-find • The GHC implementation of Haskell provides a
data structure) that are optimized for one or more of these Data.Set module, which implements immutable sets
operations, at the expense of others. using binary search trees.[12]
• The Tcl Tcllib package provides a set module which
implements a set data structure based upon TCL
4.1.4 Language support lists.
One of the earliest languages to support sets was Pascal; • The Swift standard library contains a Set type, since
many languages now include it, whether in the core lan- Swift 1.2.
guage or in a standard library.
As noted in the previous section, in languages which do
• In C++, the Standard Template Library (STL) pro- not directly support sets but do support associative arrays,
vides the set template class, which is typically im- sets can be emulated using associative arrays, by using the
plemented using a binary search tree (e.g. red-black elements as keys, and using a dummy value as the values,
tree); SGI's STL also provides the hash_set template which are ignored.
class, which implements a set using a hash table.
C++11 has support for the unordered_set template 4.1.5 Multiset
class, which is implemented using a hash table. In
sets, the elements themselves are the keys, in con- A generalization of the notion of a set is that of a multiset
trast to sequenced containers, where elements are or bag, which is similar to a set but allows repeated
accessed using their (relative or absolute) position. (“equal”) values (duplicates). This is used in two dis-
Set elements must have a strict weak ordering. tinct senses: either equal values are considered identical,
and are simply counted, or equal values are considered
• Java offers the Set interface to support sets (with the
equivalent, and are stored as distinct items. For example,
HashSet class implementing it using a hash table),
given a list of people (by name) and ages (in years), one
and the SortedSet sub-interface to support sorted
could construct a multiset of ages, which simply counts
sets (with the TreeSet class implementing it using
the number of people of a given age. Alternatively, one
a binary search tree).
can construct a multiset of people, where two people are
• Apple's Foundation framework (part of Cocoa) considered equivalent if their ages are the same (but may
provides the Objective-C classes NSSet, be different people and have different names), in which
NSMutableSet, NSCountedSet, NSOrderedSet, case each pair (name, age) must be stored, and selecting
and NSMutableOrderedSet. The CoreFoundation on a given age gives all the people of a given age.
APIs provide the CFSet and CFMutableSet types Formally, it is possible for objects in computer science
for use in C. to be considered “equal” under some equivalence relation
106 CHAPTER 4. SETS
but still distinct under another relation. Some types of multiplicities (this will not be able to distinguish between
multiset implementations will store distinct equal objects equal elements at all).
as separate items in the data structure; while others will Typical operations on bags:
collapse it down to one version (the first one encountered)
and keep a positive integer count of the multiplicity of the • contains(B, x): checks whether the element x is
element. present (at least once) in the bag B
As with sets, multisets can naturally be implemented us-
• is_sub_bag(B1 , B2 ): checks whether each element
ing hash table or trees, which yield different performance
in the bag B1 occurs in B1 no more often than it oc-
characteristics. curs in the bag B ; sometimes denoted as B ⊑ B .
2 1 2
The set of all bags over type T is given by the expression
• count(B, x): returns the number of times that the
bag T. If by multiset one considers equal items identi-
element x occurs in the bag B; sometimes denoted
cal and simply counts them, then a multiset can be in-
as B # x.
terpreted as a function from the input domain to the
non-negative integers (natural numbers), generalizing the • scaled_by(B, n): given a natural number n, returns a
identification of a set with its indicator function. In some bag which contains the same elements as the bag B,
cases a multiset in this counting sense may be generalized except that every element that occurs m times in B
to allow negative values, as in Python. occurs n * m times in the resulting bag; sometimes
denoted as n ⊗ B.
• C++'s Standard Template Library implements both • union(B1 , B2 ): returns a bag that containing just
sorted and unsorted multisets. It provides the those values that occur in either the bag B1 or the
multiset class for the sorted multiset, as a kind of bag B2 , except that the number of times a value x
associative container, which implements this multi- occurs in the resulting bag is equal to (B1 # x) + (B2
set using a self-balancing binary search tree. It pro- # x); sometimes denoted as B1 ⊎ B2 .
vides the unordered_multiset class for the unsorted
multiset, as a kind of unordered associative contain-
ers, which implements this multiset using a hash ta- Multisets in SQL
ble. The unsorted multiset is standard as of C++11;
previously SGI’s STL provides the hash_multiset In relational databases, a table can be a (mathematical) set
class, which was copied and eventually standardized. or a multiset, depending on the presence on unicity con-
straints on some columns (which turns it into a candidate
• For Java, third-party libraries provide multiset func- key).
tionality:
SQL allows the selection of rows from a relational table:
• Apache Commons Collections provides the this operation will in general yield a multiset, unless the
Bag and SortedBag interfaces, with imple- keyword DISTINCT is used to force the rows to be all
menting classes like HashBag and TreeBag. different, or the selection includes the primary (or a can-
didate) key.
• Google Guava provides the Multiset interface,
with implementing classes like HashMultiset In ANSI SQL the MULTISET keyword can be used to
and TreeMultiset. transform a subquery into a collection expression:
SELECT expression1, expression2... FROM ta-
• Apple provides the NSCountedSet class as part of
ble_name...
Cocoa, and the CFBag and CFMutableBag types as
part of CoreFoundation.
is a general select that can be used as subquery expression
• Python’s standard library includes of another more general query, while
collections.Counter, which is similar to a mul-
MULTISET(SELECT expression1, expression2...
tiset.
FROM table_name...)
• Smalltalk includes the Bag class, which can be in-
stantiated to use either identity or equality as predi- transforms the subquery into a collection expression that
cate for inclusion test. can be used in another query, or in assignment to a col-
umn of appropriate collection type.
Where a multiset data structure is not available, a
workaround is to use a regular set, but override the equal-
ity predicate of its items to always return “not equal” on 4.1.6 See also
distinct objects (however, such will still not be able to
• Bloom filter
store multiple occurrences of the same object) or use
an associative array mapping the values to their integer • Disjoint set
4.2. BIT ARRAY 107
4.1.7 Notes [11] Wang, Thomas (1997), Sorted Linear Hash Table
[1] “Packaging” consists in supplying a container for an ag- [12] Stephen Adams, "Efficient sets: a balancing act", Journal
gregation of objects in order to turn them into a single of Functional Programming 3(4):553-562, October 1993.
object. Consider a function call: without packaging, a Retrieved on 2015-03-11.
function can be called to act upon a bunch only by passing
each bunch element as a separate argument, which com-
plicates the function’s signature considerably (and is just 4.2 Bit array
not possible in some programming languages). By pack-
aging the bunch’s elements into a set, the function may
now be called upon a single, elementary argument: the set A bit array (also known as bitmap, bitset, bit string,
object (the bunch’s package). or bit vector) is an array data structure that compactly
stores bits. It can be used to implement a simple set data
[2] Indexing is possible when the elements being considered structure. A bit array is effective at exploiting bit-level
are totally ordered. Being without order, the elements of
parallelism in hardware to perform operations quickly. A
a multiset (for example) do not have lesser/greater or pre-
ceding/succeeding relationships: they can only be com-
typical bit array stores kw bits, where w is the number of
pared in absolute terms (same/different). bits in the unit of storage, such as a byte or word, and
k is some nonnegative integer. If w does not divide the
[3] For example, in Python pick can be implemented on a number of bits to be stored, some space is wasted due to
derived class of the built-in set as follows: internal fragmentation.
class Set(set): def pick(self): return next(iter(self))
[6] Python Issue7212: Retrieve an arbitrary element from a Although most machines are not able to address individ-
set without removing it; see msg106593 regarding stan- ual bits in memory, nor have instructions to manipulate
dard name single bits, each bit in a word can be singled out and ma-
nipulated using bitwise operations. In particular:
[7] Ruby Feature #4553: Add Set#pick and Set#pop
[8] Inductive Synthesis of Functional Programs: Universal • OR can be used to set a bit to one: 11101010 OR
Planning, Folding of Finite Programs, and Schema Ab- 00000100 = 11101110
straction by Analogical Reasoning, Ute Schmid, Springer,
Aug 21, 2003, p. 240 • AND can be used to set a bit to zero: 11101010
AND 11111101 = 11101000
[9] Recent Trends in Data Type Specification: 10th Workshop
on Specification of Abstract Data Types Joint with the 5th • AND together with zero-testing can be used to de-
COMPASS Workshop, S. Margherita, Italy, May 30 - June termine if a bit is set:
3, 1994. Selected Papers, Volume 10, ed. Egidio Aste-
siano, Gianna Reggio, Andrzej Tarlecki, p. 38
11101010 AND 00000001 =
[10] Ruby: flatten() 00000000 = 0
108 CHAPTER 4. SETS
11101010 AND 00000010 = word and keep a running total. Counting zeros is simi-
00000010 ≠ 0 lar. See the Hamming weight article for examples of an
efficient implementation.
• XOR can be used to invert or toggle a bit:
Inversion
11101010 XOR 00000100 =
11101110 Vertical flipping of a one-bit-per-pixel image, or some
11101110 XOR 00000100 = FFT algorithms, requires flipping the bits of individual
11101010 words (so b31 b30 ... b0 becomes b0 ... b30 b31). When
this operation is not available on the processor, it’s still
• NOT can be used to invert all bits. possible to proceed by successive passes, in this example
on 32 bits:
NOT 10110010 = 01001101 exchange two 16bit halfwords exchange bytes by pairs
(0xddccbbaa -> 0xccddaabb) ... swap bits by pairs
To obtain the bit mask needed for these operations, we swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) The
can use a bit shift operator to shift the number 1 to the last operation can be written ((x&0x55555555)<<1) |
left by the appropriate number of places, as well as bitwise (x&0xaaaaaaaa)>>1)).
negation if necessary.
Given two bit arrays of the same size representing sets, we
can compute their union, intersection, and set-theoretic Find first one
difference using n/w simple bit operations each (2n/w for
difference), as well as the complement of either: The find first set or find first one operation identifies the
for i from 0 to n/w-1 complement_a[i] := not a[i] union[i] index or position of the 1-bit with the smallest index in
:= a[i] or b[i] intersection[i] := a[i] and b[i] difference[i] an array, and has widespread hardware support (for ar-
:= a[i] and (not b[i]) rays not larger than a word) and efficient algorithms for
its computation. When a priority queue is stored in a bit
array, find first one can be used to identify the highest pri-
If we wish to iterate through the bits of a bit array, we can
ority element in the queue. To expand a word-size find
do this efficiently using a doubly nested loop that loops
first one to longer arrays, one can find the first nonzero
through each word, one at a time. Only n/w memory ac-
word and then run find first one on that word. The re-
cesses are required:
lated operations find first zero, count leading zeros, count
for i from 0 to n/w-1 index := 0 // if needed word := a[i] leading ones, count trailing zeros, count trailing ones, and
for b from 0 to w-1 value := word and 1 ≠ 0 word := log base 2 (see find first set) can also be extended to a bit
word shift right 1 // do something with value index := array in a straightforward manner.
index + 1 // if needed
C++, the [] operator does not return a reference to an el- or word boundary— or unaligned— elements immedi-
ement, but instead returns a proxy reference. This might ately follow each other with no padding.
seem a minor point, but it means that vector<bool> is not Hardware description languages such as VHDL, Verilog,
a standard STL container, which is why the use of vec- and SystemVerilog natively support bit vectors as these
tor<bool> is generally discouraged. Another unique STL are used to model storage elements like flip-flops, hard-
class, bitset,[1] creates a vector of bits fixed at a partic- ware busses and hardware signals in general. In hard-
ular size at compile-time, and in its interface and syntax ware verification languages such as OpenVera, e and
more resembles the idiomatic use of words as bit sets by C SystemVerilog, bit vectors are used to sample values from
programmers. It also has some additional power, such as
the hardware models, and to represent data that is trans-
the ability to efficiently count the number of bits that are ferred to hardware during simulations.
set. The Boost C++ Libraries provide a dynamic_bitset
class[2] whose size is specified at run-time.
The D programming language provides bit arrays in its 4.2.8 See also
standard library, Phobos, in std.bitmanip. As in C++, the
[] operator does not return a reference, since individual • Bit field
bits are not directly addressable on most hardware, but
• Arithmetic logic unit
instead returns a bool.
In Java, the class BitSet creates a bit array that is then • Bitboard Chess and similar games.
manipulated with functions named after bitwise opera-
tors familiar to C programmers. Unlike the bitset in C++, • Bitmap index
the Java BitSet does not have a “size” state (it has an ef- • Binary numeral system
fectively infinite size, initialized with 0 bits); a bit can
be set or tested at any index. In addition, there is a • Bitstream
class EnumSet, which represents a Set of values of an
enumerated type internally as a bit vector, as a safer al- • Judy array
ternative to bitfields.
The .NET Framework supplies a BitArray collection 4.2.9 References
class. It stores boolean values, supports random access
and bitwise operators, can be iterated over, and its Length [1] std::bitset
property can be changed to grow or truncate it.
[2] boost::dynamic_bitset
Although Standard ML has no support for bit arrays,
Standard ML of New Jersey has an extension, the BitAr- [3] http://perldoc.perl.org/perlop.html#
ray structure, in its SML/NJ Library. It is not fixed in size Bitwise-String-Operators
and supports set operations and bit operations, including,
[4] http://perldoc.perl.org/functions/vec.html
unusually, shift operations.
Haskell likewise currently lacks standard support for bit-
wise operations, but both GHC and Hugs provide a 4.2.10 External links
Data.Bits module with assorted bitwise functions and op-
erators, including shift and rotate operations and an “un- • mathematical bases by Pr. D.E.Knuth
boxed” array over boolean values may be used to model
a Bit array, although this lacks support from the former • vector<bool> Is Nonconforming, and Forces Opti-
module. mization Choice
In Perl, strings can be used as expandable bit arrays. They • vector<bool>: More Problems, Better Solutions
can be manipulated using the usual bitwise operators (~ |
& ^),[3] and individual bits can be tested and set using the
vec function.[4] 4.3 Bloom filter
In Ruby, you can access (but not set) a bit of an integer
(Fixnum or Bignum) using the bracket operator ([]), as if Not to be confused with Bloom shader effect.
it were an array of bits.
Apple’s Core Foundation library contains CFBitVector A Bloom filter is a space-efficient probabilistic data
and CFMutableBitVector structures. structure, conceived by Burton Howard Bloom in 1970,
PL/I supports arrays of bit strings of arbitrary length, that is used to test whether an element is a member of a
which may be either fixed-length or varying. The array el- set. False positive matches are possible, but false nega-
ements may be aligned— each element begins on a byte tives are not – in other words, a query returns either “pos-
sibly in set” or “definitely not in set”. Elements can be
4.3. BLOOM FILTER 111
added to the set, but not removed (though this can be ad- have been set to 1 when it was inserted. If all are 1, then
dressed with a “counting” filter); the more elements that either the element is in the set, or the bits have by chance
are added to the set, the larger the probability of false been set to 1 during the insertion of other elements, re-
positives. sulting in a false positive. In a simple Bloom filter, there
Bloom proposed the technique for applications where is no way to distinguish between the two cases, but more
the amount of source data would require an impracti- advanced techniques can address this problem.
cally large amount of memory if “conventional” error- The requirement of designing k different independent
free hashing techniques were applied. He gave the ex- hash functions can be prohibitive for large k. For a good
ample of a hyphenation algorithm for a dictionary of hash function with a wide output, there should be little
500,000 words, out of which 90% follow simple hyphen- if any correlation between different bit-fields of such a
ation rules, but the remaining 10% require expensive disk hash, so this type of hash can be used to generate mul-
accesses to retrieve specific hyphenation patterns. With tiple “different” hash functions by slicing its output into
sufficient core memory, an error-free hash could be used multiple bit fields. Alternatively, one can pass k differ-
to eliminate all unnecessary disk accesses; on the other ent initial values (such as 0, 1, ..., k − 1) to a hash func-
hand, with limited core memory, Bloom’s technique uses tion that takes an initial value; or add (or append) these
a smaller hash area but still eliminates most unnecessary values to the key. For larger m and/or k, independence
accesses. For example, a hash area only 15% of the size among the hash functions can be relaxed with negligible
needed by an ideal error-free hash still eliminates 85% of increase in false positive rate.[3] Specifically, Dillinger &
the disk accesses – an 85–15 form of the Pareto princi- Manolios (2004b) show the effectiveness of deriving the
ple.[1] k indices using enhanced double hashing or triple hash-
More generally, fewer than 10 bits per element are re- ing, variants of double hashing that are effectively simple
quired for a 1% false positive probability, independent of random number generators seeded with the two or three
the size or number of elements in the set. [2] hash values.
Removing an element from this simple Bloom filter is im-
possible because false negatives are not permitted. An
4.3.1 Algorithm description element maps to k bits, and although setting any one of
those k bits to zero suffices to remove the element, it
also results in removing any other elements that happen
{x, y, z}
to map onto that bit. Since there is no way of determining
whether any other elements have been added that affect
the bits for an element to be removed, clearing any of the
0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0
bits would introduce the possibility for false negatives.
One-time removal of an element from a Bloom filter can
w be simulated by having a second Bloom filter that contains
items that have been removed. However, false positives
An example of a Bloom filter, representing the set {x, y, z}. The in the second filter become false negatives in the com-
colored arrows show the positions in the bit array that each set
posite filter, which may be undesirable. In this approach
element is mapped to. The element w is not in the set {x, y, z},
re-adding a previously removed item is not possible, as
because it hashes to one bit-array position containing 0. For this
figure, m = 18 and k = 3. one would have to remove it from the “removed” filter.
It is often the case that all the keys are available but are ex-
An empty Bloom filter is a bit array of m bits, all set to pensive to enumerate (for example, requiring many disk
0. There must also be k different hash functions defined, reads). When the false positive rate gets too high, the
each of which maps or hashes some set element to one of filter can be regenerated; this should be a relatively rare
the m array positions with a uniform random distribution. event.
Typically, k is a constant, much smaller than m, which is
proportional to the number of elements to be added; the
precise choice of k and the constant of proportionality of 4.3.2 Space and time advantages
m are determined by the intended false positive rate of
the filter. While risking false positives, Bloom filters have a strong
space advantage over other data structures for represent-
To add an element, feed it to each of the k hash functions ing sets, such as self-balancing binary search trees, tries,
to get k array positions. Set the bits at all these positions hash tables, or simple arrays or linked lists of the entries.
to 1. Most of these require storing at least the data items them-
To query for an element (test whether it is in the set), feed selves, which can require anywhere from a small num-
it to each of the k hash functions to get k array positions. ber of bits, for small integers, to an arbitrary number
If any of the bits at these positions is 0, the element is def- of bits, such as for strings (tries are an exception, since
initely not in the set – if it were, then all the bits would they can share storage between elements with equal pre-
112 CHAPTER 4. SETS
FILTER STORAGE
set, which means the array must be very large and contain
Do you have 'key1'?
long runs of zeros. The information content of the array
Filter: Storage:
No
relative to its size is low. The generalized Bloom filter (k
No
No
greater than 1) allows many more bits to be set while still
maintaining a low false positive rate; if the parameters
(k and m) are chosen well, about half of the bits will be
set,[5] and these will be apparently random, minimizing
necessary
Do you have 'key2'?
disk access Storage: redundancy and maximizing information content.
Filter:
Yes Yes
tem. Values are stored on a disk which has slow access times.
Bloom filter decisions are much faster. However some unneces-
p
sary disk accesses are made when the filter reports a positive (in 1e-06
( )kn This means that for a given false positive probability p, the
1 length of a Bloom filter m is proportionate to the number
E[q] = 1−
m of elements being filtered n and the required number of
hash functions only depends on the target false positive
It is possible to prove, without the independence assump-
probability p.[8]
tion, that q is very strongly concentrated around its ex-
n ln p
pected value. In particular, from the Azuma–Hoeffding The formula m = − (ln 2)2 is approximate for three rea-
[7]
inequality, they prove that sons. First, and of least concern, it approximates 1 − m 1
The number of hash functions, k, must be a positive inte- This bound can (
be interpreted as saying that the approxi-
)k
ger. Putting this constraint aside, for a given m and n, the mate formula 1 − e−kn/m can be applied at a penalty
value of k that minimizes the false positive probability is of at most half an extra element and at most one fewer bit.
114 CHAPTER 4. SETS
4.3.4 Approximating the number of items • Union and intersection of Bloom filters with the
in a Bloom filter same size and set of hash functions can be imple-
mented with bitwise OR and AND operations re-
Swamidass & Baldi (2007) showed that the number of spectively. The union operation on Bloom filters is
items in a Bloom filter can be approximated with the fol- lossless in the sense that the resulting Bloom filter is
lowing formula, the same as the Bloom filter created from scratch us-
ing the union of the two sets. The intersect operation
[ ] satisfies a weaker property: the false positive proba-
∗ m X bility in the resulting Bloom filter is at most the false-
n = − ln 1 − ,
k m positive probability in one of the constituent Bloom
filters, but may be larger than the false positive prob-
where n∗ is an estimate of the number of items in the ability in the Bloom filter created from scratch using
filter, m is the length (size) of the filter, k is the number the intersection of the two sets.
of hash functions, and X is the number of bits set to one.
• Some kinds of superimposed code can be seen as
a Bloom filter implemented with physical edge-
4.3.5 The union and intersection of sets notched cards. An example is Zatocoding, invented
by Calvin Mooers in 1947, in which the set of cate-
Bloom filters are a way of compactly representing a set of gories associated with a piece of information is rep-
items. It is common to try to compute the size of the in- resented by notches on a card, with a random pattern
tersection or union between two sets. Bloom filters can be of four notches for each category.
used to approximate the size of the intersection and union
of two sets. Swamidass & Baldi (2007) showed that for
two Bloom filters of length m, their counts, respectively 4.3.7 Examples
can be estimated as
• Akamai's web servers use Bloom filters to pre-
vent “one-hit-wonders” from being stored in its disk
[ ]
m n(A) caches. One-hit-wonders are web objects requested
n(A∗ ) = − ln 1 − by users just once, something that Akamai found ap-
k m
plied to nearly three-quarters of their caching infras-
and tructure. Using a Bloom filter to detect the second
request for a web object and caching that object only
[ ] on its second request prevents one-hit wonders from
∗ m n(B) entering the disk cache, significantly reducing disk
n(B ) = − ln 1 − .
k m workload and increasing disk cache hit rates.[10]
The size of their union can be estimated as • Google BigTable, Apache HBase and Apache Cas-
sandra, and Postgresql[11] use Bloom filters to re-
[ ] duce the disk lookups for non-existent rows or
m n(A ∪ B)
n(A∗ ∪ B ∗ ) = − ln 1 − , columns. Avoiding costly disk lookups consider-
k m ably increases the performance of a database query
where n(A ∪ B) is the number of bits set to one in either operation.[12]
of the two Bloom filters. Finally, the intersection can be • The Google Chrome web browser used to use a
estimated as Bloom filter to identify malicious URLs. Any URL
was first checked against a local Bloom filter, and
only if the Bloom filter returned a positive result was
n(A∗ ∩ B ∗ ) = n(A∗ ) + n(B ∗ ) − n(A∗ ∪ B ∗ ), a full check of the URL performed (and the user
warned, if that too returned a positive result).[13][14]
using the three formulas together.
• The Squid Web Proxy Cache uses Bloom filters for
cache digests.[15]
4.3.6 Interesting properties
• Bitcoin uses Bloom filters to speed up wallet
• Unlike a standard hash table, a Bloom filter of a fixed synchronization.[16][17]
size can represent a set with an arbitrarily large num-
• The Venti archival storage system uses Bloom filters
ber of elements; adding an element never fails due
to detect previously stored data.[18]
to the data structure “filling up”. However, the false
positive rate increases steadily as elements are added • The SPIN model checker uses Bloom filters to
until all bits in the filter are set to 1, at which point track the reachable state space for large verification
all queries yield a positive result. problems.[19]
4.3. BLOOM FILTER 115
• The Cascading analytics framework uses Bloom [0, n/ε] where ϵ is the requested false positive rate. The
filters to speed up asymmetric joins, where one sequence of values is then sorted and compressed using
of the joined data sets is significantly larger than Golomb coding (or some other compression technique)
the other (often called Bloom join in the database to occupy a space close to n log2 (1/ϵ) bits. To query
literature).[20] the Bloom filter for a given key, it will suffice to check
if its corresponding value is stored in the Bloom filter.
• The Exim mail transfer agent (MTA) uses Bloom Decompressing the whole Bloom filter for each query
filters in its rate-limit feature.[21] would make this variant totally unusable. To overcome
this problem the sequence of values is divided into small
• Medium uses Bloom filters to avoid recommending
[22] blocks of equal size that are compressed separately. At
articles a user has previously read.
query time only half a block will need to be decompressed
on average. Because of decompression overhead, this
4.3.8 Alternatives variant may be slower than classic Bloom filters but this
may be compensated by the fact that a single hash func-
Classic Bloom filters use 1.44 log2 (1/ϵ) bits of space per tion need to be computed.
inserted key, where ϵ is the false positive rate of the Another alternative to classic Bloom filter is the one based
Bloom filter. However, the space that is strictly neces- on space efficient variants of cuckoo hashing. In this case
sary for any data structure playing the same role as a once the hash table is constructed, the keys stored in the
Bloom filter is only log2 (1/ϵ) per key.[23] Hence Bloom hash table are replaced with short signatures of the keys.
filters use 44% more space than an equivalent optimal Those signatures are strings of bits computed using a hash
data structure. Instead, Pagh et al. provide an optimal- function applied on the keys.
space data structure. Moreover, their data structure has
constant locality of reference independent of the false
positive rate, unlike Bloom filters, where a smaller false 4.3.9 Extensions and applications
positive rate ϵ leads to a greater number of memory ac-
cesses per query, log(1/ϵ) . Also, it allows elements to Cache filtering
be deleted without a space penalty, unlike Bloom filters.
The same improved properties of optimal space usage,
constant locality of reference, and the ability to delete el-
ements are also provided by the cuckoo filter of Fan et
al. (2014), an open source implementation of which is
available.
Stern & Dill (1996) describe a probabilistic structure
based on hash tables, hash compaction, which Dillinger
& Manolios (2004b) identify as significantly more ac- Using a Bloom filter to prevent one-hit-wonders from being
curate than a Bloom filter when each is configured op- stored in a web cache decreased the rate of disk writes by nearly
timally. Dillinger and Manolios, however, point out that one half, reducing the load on the disks and potentially increasing
the reasonable accuracy of any given Bloom filter over a disk performance.[10]
wide range of numbers of additions makes it attractive
for probabilistic enumeration of state spaces of unknown Content delivery networks deploy web caches around the
size. Hash compaction is, therefore, attractive when the world to cache and serve web content to users with greater
number of additions can be predicted accurately; how- performance and reliability. A key application of Bloom
ever, despite being very fast in software, hash compaction filters is their use in efficiently determining which web ob-
is poorly suited for hardware because of worst-case linear jects to store in these web caches. Nearly three-quarters
access time. of the URLs accessed from a typical web cache are “one-
hit-wonders” that are accessed by users only once and
Putze, Sanders & Singler (2007) have studied some vari- never again. It is clearly wasteful of disk resources to
ants of Bloom filters that are either faster or use less space store one-hit-wonders in a web cache, since they will
than classic Bloom filters. The basic idea of the fast vari- never be accessed again. To prevent caching one-hit-
ant is to locate the k hash values associated with each key wonders, a Bloom filter is used to keep track of all URLs
into one or two blocks having the same size as processor’s that are accessed by users. A web object is cached only
memory cache blocks (usually 64 bytes). This will pre- when it has been accessed at least once before, i.e., the ob-
sumably improve performance by reducing the number ject is cached on its second request. The use of a Bloom
of potential memory cache misses. The proposed vari- filter in this fashion significantly reduces the disk write
ants have however the drawback of using about 32% more workload, since one-hit-wonders are never written to the
space than classic Bloom filters. disk cache. Further, filtering out the one-hit-wonders
The space efficient variant relies on using a single hash also saves cache space on disk, increasing the cache hit
function that generates for each key a value in the range rates.[10]
116 CHAPTER 4. SETS
that a tight upper bound of false positive rates is guaran- patterns don't match, we check the attenuated Bloom fil-
teed, and the method is superior to standard Bloom filters ter in order to determine which node should be the next
in terms of false positive rates and time efficiency when a hop. We see that n2 doesn't offer service A but lies on the
small space and an acceptable false positive rate are given. path to nodes that do. Hence, we move to n2 and repeat
the same procedure. We quickly find that n3 offers the
service, and hence the destination is located.[27]
Scalable Bloom filters
By using attenuated Bloom filters consisting of multiple
Almeida et al. (2007) proposed a variant of Bloom fil- layers, services at more than one hop distance can be
ters that can adapt dynamically to the number of elements discovered while avoiding saturation of the Bloom filter
stored, while assuring a minimum false positive proba- by attenuating (shifting out) bits set by sources further
bility. The technique is based on sequences of standard away.[26]
Bloom filters with increasing capacity and tighter false
positive probabilities, so as to ensure that a maximum Chemical structure searching
false positive probability can be set beforehand, regard-
less of the number of elements to be inserted. Bloom filters are often used to search large chemical
structure databases (see chemical similarity). In the sim-
plest case, the elements added to the filter (called a fin-
Layered Bloom filters
gerprint in this field) are just the atomic numbers present
A layered Bloom filter consists of multiple Bloom filter in the molecule, or a hash based on the atomic number
layers. Layered Bloom filters allow keeping track of how of each atom and the number and type of its bonds. This
many times an item was added to the Bloom filter by case is too simple to be useful. More advanced filters
checking how many layers contain the item. With a lay- also encode atom counts, larger substructure features like
ered Bloom filter a check operation will normally return carboxyl groups, and graph properties like the number of
the deepest layer number the item was found in.[25] rings. In hash-based fingerprints, a hash function based
on atom and bond properties is used to turn a subgraph
into a PRNG seed, and the first output values used to set
Attenuated Bloom filters bits in the Bloom filter.
Molecular fingerprints started in the late 1940s as way
to search for chemical structures searched on punched
cards. However, it wasn't until around 1990 that Day-
light introduced a hash-based method to generate the bits,
rather than use a precomputed table. Unlike the dictio-
nary approach, the hash method can assign bits for sub-
structures which hadn't previously been seen. In the early
1990s, the term “fingerprint” was considered different
from “structural keys”, but the term has since grown to en-
compass most molecular characteristics which can used
for a similarity comparison, including structural keys,
sparse count fingerprints, and 3D fingerprints. Unlike
Attenuated Bloom Filter Example: Search for pattern 11010, Bloom filters, the Daylight hash method allows the num-
starting from node n1.
ber of bits assigned per feature to be a function of the
feature size, but most implementations of Daylight-like
An attenuated Bloom filter of depth D can be viewed as fingerprints use a fixed number of bits per feature, which
an array of D normal Bloom filters. In the context of makes them a Bloom filter. The original Daylight fin-
service discovery in a network, each node stores regular gerprints could be used for both similarity and screening
and attenuated Bloom filters locally. The regular or local purposes. Many other fingerprint types, like the popular
Bloom filter indicates which services are offered by the ECFP2, can be used for similarity but not for screening
node itself. The attenuated filter of level i indicates which because they include local environmental characteristics
services can be found on nodes that are i-hops away from that introduce false negatives when used as a screen. Even
the current node. The i-th value is constructed by taking if these are constructed with the same mechanism, these
a union of local Bloom filters for nodes i-hops away from are not Bloom filters because they cannot be used to filter.
the node.[26]
Let’s take a small network shown on the graph below as
an example. Say we are searching for a service A whose 4.3.10 See also
id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node
• Count–min sketch
to be the starting point. First, we check whether service
A is offered by n1 by checking its local filter. Since the • Feature hashing
118 CHAPTER 4. SETS
[12] Chang et al. (2006); Apache Software Foundation (2012). • Blustein, James; El-Maazawi, Amal (2002), “opti-
mal case for general Bloom filters”, Bloom Filters
[13] Yakunin, Alex (2010-03-25). “Alex Yakunin’s blog: Nice — A Tutorial, Analysis, and Survey, Dalhousie Uni-
Bloom filter application”. Blog.alexyakunin.com. Re-
versity Faculty of Computer Science, pp. 1–31
trieved 2014-05-31.
[23] Pagh, Pagh & Rao (2005). • Byers, John W.; Considine, Jeffrey; Mitzenmacher,
Michael; Rost, Stanislav (2004), “Informed con-
[24] Pournaras, Warnier & Brazier (2013). tent delivery across adaptive overlay networks”,
IEEE/ACM Transactions on Networking, 12 (5):
[25] Zhiwang, Jungang & Jian (2010).
767, doi:10.1109/TNET.2004.836103
[26] Koucheryavy et al. (2009).
• Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay;
[27] Kubiatowicz et al. (2000). Hsieh, Wilson; Wallach, Deborah; Burrows, Mike;
4.3. BLOOM FILTER 119
Chandra, Tushar; Fikes, Andrew; Gruber, Robert • Eppstein, David; Goodrich, Michael T. (2007),
(2006), “Bigtable: A Distributed Storage System for “Space-efficient straggler identification in round-
Structured Data”, Seventh Symposium on Operating trip data streams via Newton’s identities and in-
System Design and Implementation vertible Bloom filters”, Algorithms and Data Struc-
tures, 10th International Workshop, WADS 2007,
• Charles, Denis; Chellapilla, Kumar (2008), Springer-Verlag, Lecture Notes in Computer Sci-
“Bloomier Filters: A second look”, The Computing ence 4619, pp. 637–648, arXiv:0704.3313
Research Repository (CoRR), arXiv:0807.0928
• Fan, Bin; Andersen, Dave G.; Kaminsky, Michael;
• Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Mitzenmacher, Michael D. (2014), “Cuckoo fil-
Tal, Ayellet (2004), “The Bloomier filter: an ef- ter: Practically better than Bloom”, Proc. 10th
ficient data structure for static support lookup ta- ACM Int. Conf. Emerging Networking Experi-
bles”, Proceedings of the Fifteenth Annual ACM- ments and Technologies (CoNEXT '14), pp. 75–88,
SIAM Symposium on Discrete Algorithms (PDF), pp. doi:10.1145/2674005.2674994. Open source im-
30–39 plementation available on github.
• Cohen, Saar; Matias, Yossi (2003), “Spectral Bloom • Fan, Li; Cao, Pei; Almeida, Jussara; Broder, An-
Filters”, Proceedings of the 2003 ACM SIGMOD drei (2000), “Summary Cache: A Scalable Wide-
International Conference on Management of Data Area Web Cache Sharing Protocol”, IEEE/ACM
(PDF), pp. 241–252, doi:10.1145/872757.872787, Transactions on Networking, 8 (3): 281–293,
ISBN 158113634X doi:10.1109/90.851975. A preliminary version ap-
peared at SIGCOMM '98.
• Deng, Fan; Rafiei, Davood (2006), “Approximately
Detecting Duplicates for Streaming Data using Sta- • Goel, Ashish; Gupta, Pankaj (2010), “Small subset
ble Bloom Filters”, Proceedings of the ACM SIG- queries and bloom filters using ternary associative
MOD Conference (PDF), pp. 25–36 memories, with applications”, ACM Sigmetrics 2010,
38: 143, doi:10.1145/1811099.1811056
• Dharmapurikar, Sarang; Song, Haoyu; Turner,
Jonathan; Lockwood, John (2006), “Fast packet • Haghighat, Mohammad Hashem; Tavakoli, Mehdi;
classification using Bloom filters”, Proceedings of Kharrazi, Mehdi (2013), “Payload Attribution via
the 2006 ACM/IEEE Symposium on Architecture for Character Dependent Multi-Bloom Filters”, Trans-
Networking and Communications Systems (PDF), action on Information Forensics and Security, IEEE,
pp. 61–70, doi:10.1145/1185347.1185356, ISBN 99 (5): 705, doi:10.1109/TIFS.2013.2252341
1595935800, archived from the original (PDF) on
2007-02-02 • Kirsch, Adam; Mitzenmacher, Michael (2006),
“Less Hashing, Same Performance: Building a
• Dietzfelbinger, Martin; Pagh, Rasmus (2008), “Suc- Better Bloom Filter”, in Azar, Yossi; Erlebach,
cinct Data Structures for Retrieval and Approximate Thomas, Algorithms – ESA 2006, 14th Annual
Membership”, The Computing Research Repository European Symposium (PDF), Lecture Notes in
(CoRR), arXiv:0803.3693 Computer Science, 4168, Springer-Verlag, Lecture
Notes in Computer Science 4168, pp. 456–467,
• Dillinger, Peter C.; Manolios, Panagiotis (2004a), doi:10.1007/11841036, ISBN 978-3-540-38875-3
“Fast and Accurate Bitstate Verification for SPIN”,
Proceedings of the 11th International Spin Workshop • Koucheryavy, Y.; Giambene, G.; Staehle, D.;
on Model Checking Software, Springer-Verlag, Lec- Barcelo-Arroyo, F.; Braun, T.; Siris, V. (2009),
ture Notes in Computer Science 2989 “Traffic and QoS Management in Wireless Multi-
media Networks”, COST 290 Final Report, USA:
• Dillinger, Peter C.; Manolios, Panagiotis (2004b), 111
“Bloom Filters in Probabilistic Verification”,
Proceedings of the 5th International Conference • Kubiatowicz, J.; Bindel, D.; Czerwinski, Y.; Geels,
on Formal Methods in Computer-Aided Design, S.; Eaton, D.; Gummadi, R.; Rhea, S.; Weather-
Springer-Verlag, Lecture Notes in Computer spoon, H.; et al. (2000), “Oceanstore: An architec-
Science 3312 ture for global-scale persistent storage” (PDF), ACM
SIGPLAN Notices, USA: 190–201
• Donnet, Benoit; Baynat, Bruno; Friedman, Timur
(2006), “Retouched Bloom Filters: Allowing Net- • Maggs, Bruce M.; Sitaraman, Ramesh K. (July
worked Applications to Flexibly Trade Off False 2015), “Algorithmic nuggets in content deliv-
Positives Against False Negatives”, CoNEXT 06 – ery”, SIGCOMM Computer Communication Review,
2nd Conference on Future Networking Technologies, New York, NY, USA: ACM, 45 (3): 52–66,
archived from the original on 2009-05-17 doi:10.1145/2805789.2805800
120 CHAPTER 4. SETS
• Mullin, James K. (1990), “Optimal semijoins for • Stern, Ulrich; Dill, David L. (1996), “A New
distributed database systems”, Software Engineer- Scheme for Memory-Efficient Probabilistic Verifi-
ing, IEEE Transactions on, 16 (5): 558–560, cation”, Proceedings of Formal Description Tech-
doi:10.1109/32.52778 niques for Distributed Systems and Communica-
tion Protocols, and Protocol Specification, Testing,
• Pagh, Anna; Pagh, Rasmus; Rao, S. Srinivasa and Verification: IFIP TC6/WG6.1 Joint Interna-
(2005), “An optimal Bloom filter replacement”, tional Conference, Chapman & Hall, IFIP Con-
Proceedings of the Sixteenth Annual ACM-SIAM ference Proceedings, pp. 333–348, CiteSeerX
Symposium on Discrete Algorithms (PDF), pp. 823– 10.1.1.47.4101
829
• Swamidass, S. Joshua; Baldi, Pierre (2007), “Math-
• Porat, Ely (2008), “An Optimal Bloom Filter Re- ematical correction for fingerprint similarity mea-
placement Based on Matrix Solving”, The Comput- sures to improve chemical retrieval”, Journal of
ing Research Repository (CoRR), arXiv:0804.1845 chemical information and modeling, ACS Publi-
cations, 47 (3): 952–964, doi:10.1021/ci600526a,
• Pournaras, E.; Warnier, M.; Brazier, F.M.T.. PMID 17444629
(2013), “A generic and adaptive aggregation
service for large-scale decentralized networks”, • Wessels, Duane (January 2004), “10.7 Cache Di-
Complex Adaptive Systems Modeling, 1:19, gests”, Squid: The Definitive Guide (1st ed.),
doi:10.1186/2194-3206-1-19. Prototype imple- O'Reilly Media, p. 172, ISBN 0-596-00162-2,
mentation available on github. Cache Digests are based on a technique first pub-
lished by Pei Cao, called Summary Cache. The fun-
• Putze, F.; Sanders, P.; Singler, J. (2007), “Cache-,
damental idea is to use a Bloom filter to represent
Hash- and Space-Efficient Bloom Filters”, in Deme-
the cache contents.
trescu, Camil, Experimental Algorithms, 6th Interna-
tional Workshop, WEA 2007 (PDF), Lecture Notes • Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010),
in Computer Science, 4525, Springer-Verlag, Lec- “A multi-layer Bloom filter for duplicated URL
ture Notes in Computer Science 4525, pp. 108– detection”, Proc. 3rd International Conference
121, doi:10.1007/978-3-540-72845-0, ISBN 978- on Advanced Computer Theory and Engineer-
3-540-72844-3 ing (ICACTE 2010), 1, pp. V1–586–V1–591,
doi:10.1109/ICACTE.2010.5578947
• Rottenstreich, Ori; Kanizo, Yossi; Keslassy,
Isaac (2012), “The Variable-Increment Count-
ing Bloom Filter”, 31st Annual IEEE Interna- 4.3.13 External links
tional Conference on Computer Communications,
2012, Infocom 2012 (PDF), pp. 1880–1888, • Why Bloom filters work the way they do (Michael
doi:10.1109/INFCOM.2012.6195563, ISBN Nielsen, 2012)
978-1-4673-0773-4
• Bloom Filters — A Tutorial, Analysis, and Survey
• Sethumadhavan, Simha; Desikan, Rajagopalan; (Blustein & El-Maazawi, 2002) at Dalhousie Uni-
Burger, Doug; Moore, Charles R.; Keckler, Stephen versity
W. (2003), “Scalable hardware memory disam-
biguation for high ILP processors”, 36th Annual • Table of false-positive rates for different configura-
IEEE/ACM International Symposium on Microar- tions from a University of Wisconsin–Madison web-
chitecture, 2003, MICRO-36 (PDF), pp. 399– site
410, doi:10.1109/MICRO.2003.1253244, ISBN 0-
7695-2043-X, archived from the original (PDF) on • Interactive Processing demonstration from ash-
2007-01-14 can.org
4.4. MINHASH 121
• “More Optimal Bloom Filters,” Ely Porat sets A and B. In other words, if r is the random variable
(Nov/2007) Google TechTalk video on YouTube that is one when h ᵢ (A) = h ᵢ (B) and zero otherwise,
then r is an unbiased estimator of J(A,B). r has too high a
• “Using Bloom Filters” Detailed Bloom Filter expla- variance to be a useful estimator for the Jaccard similar-
nation using Perl ity on its own—it is always zero or one. The idea of the
• “A Garden Variety of Bloom Filters - Explanation MinHash scheme is to reduce this variance by averaging
and Analysis of Bloom filter variants together several variables constructed in the same way.
of sampled elements due to the possibility that two differ- 4.4.4 Applications
ent hash functions may have the same minima. However,
when k is small relative to the sizes of the sets, this dif- The original applications for MinHash involved cluster-
ference is negligible. ing and eliminating near-duplicates among web docu-
By standard Chernoff bounds for sampling without re- ments, represented [1][2][6]
as sets of the words occurring in those
placement, this estimator has expected error O(1/√k), documents. Similar techniques have also been used
matching the performance of the multiple-hash-function for clustering and near-duplicate elimination for other
scheme. types of data, such as images: in the case of image data,
an image can be represented as a set of smaller subim-
ages cropped from it, or as sets of more complex image
Time analysis feature descriptions.[7]
In data mining, Cohen et al. (2001) use MinHash as a
The estimator |Y|/k can be computed in time O(k) from tool for association rule learning. Given a database in
the two signatures of the given sets, in either variant of the which each entry has multiple attributes (viewed as a 0–1
scheme. Therefore, when ε and k are constants, the time matrix with a row per database entry and a column per
to compute the estimated similarity from the signatures is attribute) they use MinHash-based approximations to the
also constant. The signature of each set can be computed Jaccard index to identify candidate pairs of attributes that
in linear time on the size of the set, so when many pair- frequently co-occur, and then compute the exact value
wise similarities need to be estimated this method can of the index for only those pairs to determine the ones
lead to a substantial savings in running time compared whose frequencies of co-occurrence are below a given
to doing a full comparison of the members of each set. strict threshold.[8]
Specifically, for set size n the many hash variant takes
O(n k) time. The single hash variant is generally faster,
requiring O(n) time to maintain the queue of minimum 4.4.5 Other uses
hash values assuming n >> k.[1]
The MinHash scheme may be seen as an instance of
locality sensitive hashing, a collection of techniques for
4.4.3 Min-wise independent permutations
using hash functions to map large sets of objects down to
smaller hash values in such a way that, when two objects
In order to implement the MinHash scheme as described
have a small distance from each other, their hash values
above, one needs the hash function h to define a random
are likely to be the same. In this instance, the signature
permutation on n elements, where n is the total number
of a set may be seen as its hash value. Other locality
of distinct elements in the union of all of the sets to be
sensitive hashing techniques exist for Hamming distance
compared. But because there are n! different permuta-
between sets and cosine distance between vectors; local-
tions, it would require Ω(n log n) bits just to specify a
ity sensitive hashing has important applications in nearest
truly random permutation, an infeasibly large number for
neighbor search algorithms.[9] For large distributed sys-
even moderate values of n. Because of this fact, by anal-
tems, and in particular MapReduce, there exist modified
ogy to the theory of universal hashing, there has been sig-
versions of MinHash to help compute similarities with no
nificant work on finding a family of permutations that is
dependence on the point dimension.[10]
“min-wise independent”, meaning that for any subset of
the domain, any element is equally likely to be the min-
imum. It has been established that a min-wise indepen-
dent family of permutations must include at least 4.4.6 Evaluation and benchmarks
[8] Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; MakeSet creates 8 singletons.
Motwani, R.; Ullman, J. D.; Yang, C. (2001), “Finding
interesting associations without support pruning”, IEEE
Transactions on Knowledge and Data Engineering, 13 (1): 1 2 5 6 8 3 4 7
64–78, doi:10.1109/69.908981.
After some operations of Union, some sets are grouped together.
[9] Andoni, Alexandr; Indyk, Piotr (2008), “Near-optimal
hashing algorithms for approximate nearest neighbor in In computer science, a disjoint-set data structure, also
high dimensions”, Communications of the ACM, 51 (1): called a union–find data structure or merge–find set,
117–122, doi:10.1145/1327452.1327494. is a data structure that keeps track of a set of elements
partitioned into a number of disjoint (nonoverlapping)
[10] Zadeh, Reza; Goel, Ashish (2012), Dimension Indepen-
subsets. It supports two useful operations:
dent Similarity Computation, arXiv:1206.2082 .
[11] Henzinger, Monika (2006), “Finding near-duplicate • Find: Determine which subset a particular element
web pages: a large-scale evaluation of algorithms”, is in. Find typically returns an item from this set
Proceedings of the 29th Annual International ACM SIGIR that serves as its “representative"; by comparing the
Conference on Research and Development in Information result of two Find operations, one can determine
Retrieval (PDF), doi:10.1145/1148170.1148222. whether two elements are in the same subset.
124 CHAPTER 4. SETS
• Union: Join two subsets into a single subset. have the name of the list to which it belongs updated. The
element x will only have its name updated when the list it
The other important operation, MakeSet, which makes a belongs to is merged with another list of the same size or
set containing only a given element (a singleton), is gen- of greater size. Each time that happens, the size of the list
erally trivial. With these three operations, many practical to which x belongs at least doubles. So finally, the ques-
partitioning problems can be solved (see the Applications tion is “how many times can a number double before it is
section). the size of n ?" (then the list containing x will contain all
n elements). The answer is exactly log2 (n) . So for any
In order to define these operations more precisely, some given element of any given list in the structure described,
way of representing the sets is needed. One common it will need to be updated log2 (n) times in the worst case.
approach is to select a fixed element of each set, called Therefore, updating a list of n elements stored in this way
its representative, to represent the set as a whole. Then, takes O(n log(n)) time in the worst case. A find opera-
Find(x) returns the representative of the set that x be- tion can be done in O(1) for this structure because each
longs to, and Union takes two set representatives as its node contains the name of the list to which it belongs.
arguments.
A similar argument holds for merging the trees in the data
structures discussed below. Additionally, it helps explain
4.5.1 Disjoint-set linked lists the time analysis of some operations in the binomial heap
and Fibonacci heap data structures.
A simple disjoint-set data structure uses a linked list for
each set. The element at the head of each list is chosen as
its representative. 4.5.2 Disjoint-set forests
MakeSet creates a list of one element. Union appends the Disjoint-set forests are data structures where each set is
two lists, a constant-time operation if the list carries a represented by a tree data structure, in which each node
pointer to its tail. The drawback of this implementation holds a reference to its parent node (see parent pointer
is that Find requires O(n) or linear time to traverse the listtree). They were first described by Bernard A. Galler and
backwards from a given element to the head of the list. Michael J. Fischer in 1964,[3] although their precise anal-
This can be avoided by including in each linked list node ysis took years.
a pointer to the head of the list; then Find takes constant In a disjoint-set forest, the representative of each set is the
time, since this pointer refers directly to the set represen- root of that set’s tree. Find follows parent nodes until it
tative. However, Union now has to update each element reaches the root. Union combines two trees into one by
of the list being appended to make it point to the head of attaching the root of one to the root of the other.
the new combined list, requiring O(n) time.
When the length of each list is tracked, the required time
can be improved by always appending the smaller list Implementation
to the longer. Using this weighted-union heuristic, a se-
quence of m MakeSet, Union, and Find operations on n el- Naive One way of implementing these might be:
ements requires O(m + nlog n) time.[2] For asymptotically function MakeSet(x) x.parent := x function Find(x) if
faster operations, a different data structure is needed. x.parent == x return x else return Find(x.parent) func-
tion Union(x, y) xRoot := Find(x) yRoot := Find(y)
xRoot.parent := yRoot
Analysis of the naive approach
In this naive form, this approach is no better than the
We now explain the bound O(n log(n)) above. linked-list approach, because the tree it creates can be
highly unbalanced.
Suppose you have a collection of lists and each node of
each list contains an object, the name of the list to which
it belongs, and the number of elements in that list. Also Union by rank The previous implementation can be
assume that the total number of elements in all lists is n enhanced in two ways.
(i.e. there are n elements overall). We wish to be able to
The first way, called union by rank, is to always attach
merge any two of these lists, and update all of their nodes
the smaller tree to the root of the larger tree. Since it is
so that they still contain the name of the list to which they
the depth of the tree that affects the running time, the
belong. The rule for merging the lists A and B is that iftree with smaller depth gets added under the root of the
A is larger than B then merge the elements of B into A deeper tree, which only increases the depth if the depths
and update the elements that used to belong to B , and were equal. In the context of this algorithm, the term rank
vice versa. is used instead of depth since it stops being equal to the
Choose an arbitrary element of list L , say x . We wish depth if path compression (described below) is also used.
to count how many times in the worst case will x need to One-element trees are defined to have a rank of zero, and
4.5. DISJOINT-SET DATA STRUCTURE 125
4.5.6 References
4.6 Partition refinement
[1] Tarjan, Robert Endre (1975). “Efficiency of a Good But
Not Linear Set Union Algorithm”. Journal of the ACM. In the design of algorithms, partition refinement is a
22 (2): 215–225. doi:10.1145/321879.321884.
technique for representing a partition of a set as a data
[2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, structure that allows the partition to be refined by split-
Ronald L.; Stein, Clifford (2001), “Chapter 21: Data ting its sets into a larger number of smaller sets. In that
structures for Disjoint Sets”, Introduction to Algorithms sense it is dual to the union-find data structure, which also
(Second ed.), MIT Press, pp. 498–524, ISBN 0-262- maintains a partition into disjoint sets but in which the op-
03293-7 erations merge pairs of sets together.
[3] Galler, Bernard A.; Fischer, Michael J. (May 1964), “An Partition refinement forms a key component of several
improved equivalence algorithm”, Communications of the efficient algorithms on graphs and finite automata, in-
ACM, 7 (5): 301–303, doi:10.1145/364099.364331. The cluding DFA minimization, the Coffman–Graham algo-
paper originating disjoint-set forests. rithm for parallel scheduling, and lexicographic breadth-
first search of graphs.[1][2][3]
[4] Fredman, M.; Saks, M. (May 1989), “The cell probe com-
plexity of dynamic data structures”, Proceedings of the
Twenty-First Annual ACM Symposium on Theory of Com- 4.6.1 Data structure
puting: 345–354, Theorem 5: Any CPROBE(log n) im-
plementation of the set union problem requires Ω(m α(m,
A partition refinement algorithm maintains a family of
n)) time to execute m Find’s and n−1 Union’s, beginning
disjoint sets Si. At the start of the algorithm, this family
with n singleton sets.
contains a single set of all the elements in the data struc-
[5] Knight, Kevin (1989). “Unification: A multidisci- ture. At each step of the algorithm, a set X is presented to
plinary survey”. ACM Computing Surveys. 21: 93–124. the algorithm, and each set Si in the family that contains
doi:10.1145/62029.62030. members of X is split into two sets, the intersection Si ∩
X and the difference Si \ X.
[6] Tarjan, Robert E.; van Leeuwen, Jan (1984), “Worst-case
analysis of set union algorithms”, Journal of the ACM, 31 Such an algorithm may be implemented efficiently by
(2): 245–281, doi:10.1145/62.2160 maintaining data structures representing the following
information:[4][5]
[7] Hopcroft, J. E.; Ullman, J. D. (1973). “Set Merging Al-
gorithms”. SIAM Journal on Computing. 2 (4): 294–303.
• The ordered sequence of the sets Si in the family, in
doi:10.1137/0202024.
a form such as a doubly linked list that allows new
[8] Conchon, Sylvain; Filliâtre, Jean-Christophe (October sets to be inserted into the middle of the sequence
2007), “A Persistent Union-Find Data Structure”, ACM
SIGPLAN Workshop on ML, Freiburg, Germany • Associated with each set Si, a collection of its ele-
ments of Si, in a form such as a doubly linked list
or array data structure that allows for rapid deletion
4.5.7 External links of individual elements from the collection. Alterna-
tively, this component of the data structure may be
• C++ implementation, part of the Boost C++ li- represented by storing all of the elements of all of
braries the sets in a single array, sorted by the identity of
the set they belong to, and by representing the col-
• A Java implementation with an application to color lection of elements in any set Si by its starting and
image segmentation, Statistical Region Merging ending positions in this array.
(SRM), IEEE Trans. Pattern Anal. Mach. Intell.
• Associated with each element, the set it belongs to.
26(11): 1452–1458 (2004)
• Java applet: A Graphical Union–Find Implementa- To perform a refinement operation, the algorithm loops
tion, by Rory L. P. McGuire through the elements of the given set X. For each such
4.6. PARTITION REFINEMENT 127
element x, it finds the set Si that contains x, and checks of edges, its input size.[7]
whether a second set for Si ∩ X has already been started. Partition refinement also forms a key step in lexicographic
If not, it creates the second set and add Si to a list L of breadth-first search, a graph search algorithm with appli-
the sets that are split by the operation. Then, regardless cations in the recognition of chordal graphs and several
of whether a new set was formed, the algorithm removes other important classes of graphs. Again, the disjoint set
x from Si and adds it to Si ∩ X. In the representation in elements are vertices and the set X represent sets of neigh-
which all elements are stored in a single array, moving x bors, so the algorithm takes linear time.[8][9]
from one set to another may be performed by swapping
x with the final element of Si and then decrementing the
end index of Si and the start index of the new set. Finally,
4.6.3 See also
after all elements of X have been processed in this way,
the algorithm loops through L, separating each current
• Refinement (sigma algebra)
set Si from the second set that has been split from it, and
reports both of these sets as being newly formed by the
refinement operation.
4.6.4 References
The time to perform a single refinement operations in this
way is O(|X|), independent of the number of elements in [1] Paige, Robert; Tarjan, Robert E. (1987), “Three partition
the family of sets and also independent of the total num- refinement algorithms”, SIAM Journal on Computing, 16
ber of sets in the data structure. Thus, the time for a (6): 973–989, doi:10.1137/0216062, MR 917035.
sequence of refinements is proportional to the total size
of the sets given to the algorithm in each refinement step. [2] Habib, Michel; Paul, Christophe; Viennot, Laurent
(1999), “Partition refinement techniques: an inter-
esting algorithmic tool kit”, International Journal of
4.6.2 Applications Foundations of Computer Science, 10 (2): 147–170,
doi:10.1142/S0129054199000125, MR 1759929.
An early application of partition refinement was in an al-
[3] Habib, Michel; Paul, Christophe; Viennot, Laurent
gorithm by Hopcroft (1971) for DFA minimization. In
(1998), “A synthesis on partition refinement: a use-
this problem, one is given as input a deterministic finite ful routine for strings, graphs, Boolean matrices and
automaton, and must find an equivalent automaton with as automata”, STACS 98 (Paris, 1998), Lecture Notes in
few states as possible. Hopcroft’s algorithm maintains a Computer Science, 1373, Springer-Verlag, pp. 25–38,
partition of the states of the input automaton into subsets, doi:10.1007/BFb0028546, MR 1650757.
with the property that any two states in different subsets
must be mapped to different states of the output automa- [4] Valmari, Antti; Lehtinen, Petri (2008). “Efficient min-
ton. Initially, there are two subsets, one containing all the imization of DFAs with partial transition functions”.
accepting states of the automaton and one containing the In Albers, Susanne; Weil, Pascal. 25th International
remaining states. At each step one of the subsets Si and Symposium on Theoretical Aspects of Computer Science
(STACS 2008). Leibniz International Proceedings in In-
one of the input symbols x of the automaton are chosen,
formatics (LIPIcs). 1. Dagstuhl, Germany: Schloss
and the subsets of states are refined into states for which a Dagstuhl: Leibniz-Zentrum fuer Informatik. pp. 645–
transition labeled x would lead to Si, and states for which 656. doi:10.4230/LIPIcs.STACS.2008.1328. ISBN 978-
an x-transition would lead somewhere else. When a set 3-939897-06-4. ISSN 1868-8969..
Si that has already been chosen is split by a refinement,
only one of the two resulting sets (the smaller of the two) [5] Knuutila, Timo (2001). “Re-describing an algorithm by
needs to be chosen again; in this way, each state partici- Hopcroft”. Theoretical Computer Science. 250 (1-2):
pates in the sets X for O(s log n) refinement steps and the 333–363. doi:10.1016/S0304-3975(99)00150-4. ISSN
overall algorithm takes time O(ns log n), where n is the 0304-3975.
number of initial states and s is the size of the alphabet.[6]
[6] Hopcroft, John (1971), “An n log n algorithm for mini-
Partition refinement was applied by Sethi (1976) in an mizing states in a finite automaton”, Theory of machines
efficient implementation of the Coffman–Graham al- and computations (Proc. Internat. Sympos., Technion,
gorithm for parallel scheduling. Sethi showed that it Haifa, 1971), New York: Academic Press, pp. 189–196,
could be used to construct a lexicographically ordered MR 0403320.
topological sort of a given directed acyclic graph in linear
[7] Sethi, Ravi (1976), “Scheduling graphs on two pro-
time; this lexicographic topological ordering is one of the
cessors”, SIAM Journal on Computing, 5 (1): 73–82,
key steps of the Coffman–Graham algorithm. In this ap-
doi:10.1137/0205005, MR 0398156.
plication, the elements of the disjoint sets are vertices of
the input graph and the sets X used to refine the partition [8] Rose, D. J.; Tarjan, R. E.; Lueker, G. S. (1976), “Al-
are sets of neighbors of vertices. Since the total number gorithmic aspects of vertex elimination on graphs”,
of neighbors of all vertices is just the number of edges in SIAM Journal on Computing, 5 (2): 266–283,
the graph, the algorithm takes time linear in the number doi:10.1137/0205021.
128 CHAPTER 4. SETS
Priority queues
5.1 Priority queue In addition, peek (in this context often called find-max
or find-min), which returns the highest-priority element
In computer science, a priority queue is an abstract data but does not modify the queue, is very frequently imple-
type which is like a regular queue or stack data structure, mented, and nearly always executes in O(1) time. This
but where additionally each element has a “priority” as- operation and its O(1) performance is crucial to many ap-
sociated with it. In a priority queue, an element with high plications of priority queues.
priority is served before an element with low priority. If More advanced implementations may support more com-
two elements have the same priority, they are served ac- plicated operations, such as pull_lowest_priority_element,
cording to their order in the queue. inspecting the first few highest- or lowest-priority ele-
While priority queues are often implemented with heaps, ments, clearing the queue, clearing subsets of the queue,
they are conceptually distinct from heaps. A priority performing a batch insert, merging two or more queues
queue is an abstract concept like “a list" or “a map"; just into one, incrementing priority of any element, etc.
as a list can be implemented with a linked list or an array,
a priority queue can be implemented with a heap or a va-
5.1.2 Similarity to queues
riety of other methods such as an unordered array.
One can imagine a priority queue as a modified queue,
but when one would get the next element off the queue,
5.1.1 Operations the highest-priority element is retrieved first.
A priority queue must at least support the following op- Stacks and queues may be modeled as particular kinds of
erations: priority queues. As a reminder, here is how stacks and
queues behave:
• insert_with_priority: add an element to the queue
with an associated priority. • stack – elements are pulled in last-in first-out-order
(e.g., a stack of papers)
• pull_highest_priority_element: remove the element • queue – elements are pulled in first-in first-out-order
from the queue that has the highest priority, and re- (e.g., a line in a cafeteria)
turn it.
This is also known In a stack, the priority of each inserted element is mono-
as "pop_element(Off)", tonically increasing; thus, the last element inserted is al-
"get_maximum_element" or ways the first retrieved. In a queue, the priority of each
"get_front(most)_element". inserted element is monotonically decreasing; thus, the
first element inserted is always the first retrieved.
Some conventions reverse the order of
priorities, considering lower values to
be higher priority, so this may also be 5.1.3 Implementation
known as "get_minimum_element", and
is often referred to as "get-min" in the lit- Naive implementations
erature.
This may instead be specified as sepa- There are a variety of simple, usually inefficient, ways
rate "peek_at_highest_priority_element" to implement a priority queue. They provide an anal-
and "delete_element" functions, ogy to help one understand what a priority queue is. For
which can be combined to produce instance, one can keep all the elements in an unsorted
"pull_highest_priority_element". list. Whenever the highest-priority element is requested,
129
130 CHAPTER 5. PRIORITY QUEUES
search through all elements for the one with the highest For applications that do many "peek" operations for ev-
priority. (In big O notation: O(1) insertion time, O(n) pull ery “extract-min” operation, the time complexity for peek
time due to search.) actions can be reduced to O(1) in all tree and heap imple-
mentations by caching the highest priority element after
every insertion and removal. For insertion, this adds at
Usual implementation most a constant cost, since the newly inserted element is
compared only to the previously cached minimum ele-
To improve performance, priority queues typically use ment. For deletion, this at most adds an additional “peek”
a heap as their backbone, giving O(log n) performance cost, which is typically cheaper than the deletion cost, so
for inserts and removals, and O(n log n) to build initially. overall time complexity is not significantly impacted.
Variants of the basic heap data structure such as pairing
Monotone priority queues are specialized queues that are
heaps or Fibonacci heaps can provide better bounds for
[1] optimized for the case where no item is ever inserted that
some operations.
has a lower priority (in the case of min-heap) than any
Alternatively, when a self-balancing binary search tree is item previously extracted. This restriction is met by sev-
used, insertion and removal also take O(log n) time, al- eral practical applications of priority queues.
though building trees from existing sequences of elements
takes O(n log n) time; this is typical where one might al-
ready have access to these data structures, such as with Summary of running times
third-party or standard libraries.
In the following time complexities[5] O(f) is an asymp-
From a computational-complexity standpoint, priority totic upper bound and Θ(f) is an asymptotically tight
queues are congruent to sorting algorithms. See the next bound (see Big O notation). Function names assume a
section for how efficient sorting algorithms can create ef- min-heap.
ficient priority queues.
[1] Brodal and Okasaki later describe a persistent variant with
the same bounds except for decrease-key, which is not
Specialized heaps
supported. Heaps with n elements can be constructed
bottom-up in O(n).[9]
There are several specialized heap data structures that
either supply additional operations or outperform heap- [2] Amortized time.
based implementations for specific types of keys, specif- [12]
[3] Lower√ bound of Ω(log log n), upper bound of
ically integer keys.
O(22 log log n ). [13]
• When the set of keys is {1, 2, ..., C}, and only in- [4] n is the size of the larger heap.
sert, find-min and extract-min are needed, a bucket
queue can be constructed as an array of C linked lists
plus a pointer top, initially C. Inserting an item with 5.1.4 Equivalence of priority queues and
key k appends the item to the k'th, and updates top sorting algorithms
← min(top, k), both in constant time. Extract-min
deletes and returns one item from the list with in- Using a priority queue to sort
dex top, then increments top if needed until it again
points to a non-empty list; this takes O(C) time in The semantics of priority queues naturally suggest a sort-
the worst case. These queues are useful for sorting ing method: insert all the elements to be sorted into a
the vertices of a graph by their degree.[2]:374 priority queue, and sequentially remove them; they will
come out in sorted order. This is actually the proce-
• For the set of keys {1, 2, ..., C}, a van Emde Boas dure used by several sorting algorithms, once the layer
tree would support the minimum, maximum, insert, of abstraction provided by the priority queue is removed.
delete, search, extract-min, extract-max, predecessor This sorting method is equivalent to the following sorting
and successor operations in O(log log C) time, but algorithms:
has a space cost for small queues of about O(2m/2 ),
where m is the number of bits in the priority value.[3]
Using a sorting algorithm to make a priority queue
• The Fusion tree algorithm by Fredman and Willard
implements the minimum operation in O(1) A sorting algorithm can also be used to implement a pri-
√ time ority queue. Specifically, Thorup says:[14]
and insert and extract-min operations in O( log n)
time however it is stated by the author that, “Our
algorithms have theoretical interest only; The con- We present a general deterministic linear
stant factors involved in the execution times pre- space reduction from priority queues to sort-
clude practicality.”.[4] ing implying that if we can sort up to n keys in
5.1. PRIORITY QUEUE 131
S(n) time per key, then there is a priority queue send the traffic from the highest priority queue upon ar-
supporting delete and insert in O(S(n)) time and rival. This ensures that the prioritized traffic (such as real-
find-min in constant time. time traffic, e.g. an RTP stream of a VoIP connection)
is forwarded with the least delay and the least likelihood
That is, if there is a sorting algorithm which can sort in of being rejected due to a queue reaching its maximum
O(S) time per key, where S is some function of n and word capacity. All other traffic can be handled when the high-
size,[15] then one can use the given procedure to create a est priority queue is empty. Another approach used is to
priority queue where pulling the highest-priority element send disproportionately more traffic from higher priority
is O(1) time, and inserting new elements (and deleting queues.
elements) is O(S) time. For example, if one has an O(n Many modern protocols for local area networks also in-
log log n) sort algorithm, one can create a priority queue clude the concept of priority queues at the media access
with O(1) pulling and O(log log n) insertion. control (MAC) sub-layer to ensure that high-priority ap-
plications (such as VoIP or IPTV) experience lower la-
tency than other applications which can be served with
5.1.5 Libraries best effort service. Examples include IEEE 802.11e (an
amendment to IEEE 802.11 which provides quality of
A priority queue is often considered to be a "container service) and ITU-T G.hn (a standard for high-speed local
data structure". area network using existing home wiring (power lines,
The Standard Template Library (STL), and the C++ phone lines and coaxial cables).
1998 standard, specifies priority_queue as one of the STL Usually a limitation (policer) is set to limit the band-
container adaptor class templates. However, it does not width that traffic from the highest priority queue can take,
specify how two elements with same priority should be in order to prevent high priority packets from choking
served, and indeed, common implementations will not re- off all other traffic. This limit is usually never reached
turn them according to their order in the queue. It im- due to high level control instances such as the Cisco
plements a max-priority-queue, and has three parame- Callmanager, which can be programmed to inhibit calls
ters: a comparison object for sorting such as a function which would exceed the programmed bandwidth limit.
object (defaults to less<T> if unspecified), the underly-
ing container for storing the data structures (defaults to
std::vector<T>), and two iterators to the beginning and Discrete event simulation
end of a sequence. Unlike actual STL containers, it does
not allow iteration of its elements (it strictly adheres to its Another use of a priority queue is to manage the events
abstract data type definition). STL also has utility func- in a discrete event simulation. The events are added to
tions for manipulating another random-access container the queue with their simulation time used as the prior-
as a binary max-heap. The Boost (C++ libraries) also ity. The execution of the simulation proceeds by repeat-
have an implementation in the library heap. edly pulling the top of the queue and executing the event
thereon.
Python’s heapq module implements a binary min-heap on
top of a list. See also: Scheduling (computing), queueing theory
Java's library contains a PriorityQueue class, which im-
plements a min-priority-queue. Dijkstra’s algorithm
Go's library contains a container/heap module, which im-
When the graph is stored in the form of adjacency list or
plements a min-heap on top of any compatible data struc-
matrix, priority queue can be used to extract minimum
ture.
efficiently when implementing Dijkstra’s algorithm, al-
The Standard PHP Library extension contains the class though one also needs the ability to alter the priority of a
SplPriorityQueue. particular vertex in the priority queue efficiently.
Apple’s Core Foundation framework contains a
CFBinaryHeap structure, which implements a min-heap. Huffman coding
Priority queuing can be used to manage limited resources Best-first search algorithms
such as bandwidth on a transmission line from a network
router. In the event of outgoing traffic queuing due to Best-first search algorithms, like the A* search algorithm,
insufficient bandwidth, all other queues can be halted to find the shortest path between two vertices or nodes of
132 CHAPTER 5. PRIORITY QUEUES
a weighted graph, trying out the most promising routes [3] P. van Emde Boas. Preserving order in a forest in less than
first. A priority queue (also known as the fringe) is used logarithmic time. In Proceedings of the 16th Annual Sym-
to keep track of unexplored routes; the one for which the posium on Foundations of Computer Science, pages 75-84.
estimate (a lower bound in the case of A*) of the total IEEE Computer Society, 1975.
path length is smallest is given highest priority. If mem- [4] Michael L. Fredman and Dan E. Willard. Surpassing the
ory limitations make best-first search impractical, vari- information theoretic bound with fusion trees. Journal of
ants like the SMA* algorithm can be used instead, with Computer and System Sciences, 48(3):533-551, 1994
a double-ended priority queue to allow removal of low-
[5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
priority items.
Ronald L. (1990). Introduction to Algorithms (1st ed.).
MIT Press and McGraw-Hill. ISBN 0-262-03141-8.
ROAM triangulation algorithm [6] Fredman, Michael Lawrence; Tarjan, Robert E. (July
1987). “Fibonacci heaps and their uses in improved net-
The Real-time Optimally Adapting Meshes (ROAM) al- work optimization algorithms” (PDF). Journal of the As-
gorithm computes a dynamically changing triangulation sociation for Computing Machinery. 34 (3): 596–615.
of a terrain. It works by splitting triangles where more doi:10.1145/28869.28874.
detail is needed and merging them where less detail is [7] Iacono, John (2000), “Improved upper bounds for pair-
needed. The algorithm assigns each triangle in the ter- ing heaps”, Proc. 7th Scandinavian Workshop on Al-
rain a priority, usually related to the error decrease if that gorithm Theory, Lecture Notes in Computer Science,
triangle would be split. The algorithm uses two priority 1851, Springer-Verlag, pp. 63–77, arXiv:1110.4428 ,
queues, one for triangles that can be split and another for doi:10.1007/3-540-44985-X_5, ISBN 3-540-67690-2
triangles that can be merged. In each step the triangle
from the split queue with the highest priority is split, or [8] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority
Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium
the triangle from the merge queue with the lowest priority
on Discrete Algorithms, pp. 52–58
is merged with its neighbours.
[9] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
Bottom-Up Heap Construction”. Data Structures and Al-
Prim’s algorithm for minimum spanning tree gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471-
46983-1.
Using min heap priority queue in Prim’s algorithm to [10] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E.
find the minimum spanning tree of a connected and (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing:
undirected graph, one can achieve a good running time. 1463–1485.
This min heap priority queue uses the min heap data
structure which supports operations such as insert, min- [11] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E.
imum, extract-min, decrease-key.[16] In this implementa- (2012). Strict Fibonacci heaps (PDF). Proceedings of
the 44th symposium on Theory of Computing - STOC
tion, the weight of the edges is used to decide the priority
'12. p. 1177. doi:10.1145/2213977.2214082. ISBN
of the vertices. Lower the weight, higher the priority and 9781450312455.
higher the weight, lower the priority.[17]
[12] Fredman, Michael Lawrence (July 1999). “On the Ef-
ficiency of Pairing Heaps and Related Data Structures”
5.1.7 See also (PDF). Journal of the Association for Computing Machin-
ery. 46 (4): 473–501. doi:10.1145/320211.320214.
• Batch queue [13] Pettie, Seth (2005). Towards a Final Analysis of Pair-
ing Heaps (PDF). FOCS '05 Proceedings of the 46th
• Command queue Annual IEEE Symposium on Foundations of Computer
Science. pp. 174–183. CiteSeerX 10.1.1.549.471 .
• Job scheduler doi:10.1109/SFCS.2005.75. ISBN 0-7695-2468-0.
[17] “Prim’s Algorithm”. Geek for Geeks. Retrieved 12 element with minimum (or maximum) priority. It con-
September 2014. sists of an array A of container data structures, where ar-
ray cell A[p] stores the collection of elements with prior-
ity p. It can handle the following operations:
5.1.9 Further reading
• To insert an element x with priority p, add x to the
• Thomas H. Cormen, Charles E. Leiserson, Ronald
container at A[p].
L. Rivest, and Clifford Stein. Introduction to Algo-
rithms, Second Edition. MIT Press and McGraw- • To remove an element x with priority p, remove x
Hill, 2001. ISBN 0-262-03293-7. Section 6.5: Pri- from the container at A[p]
ority queues, pp. 138–142.
• To find an element with the minimum priority, per-
form a sequential search to find the first non-empty
5.1.10 External links container, and then choose an arbitrary element
from this container.
• C++ reference for std::priority_queue
• Descriptions by Lee Killough In this way, insertions and deletions take constant time,
while finding the minimum priority element takes time
• PQlib - Open source Priority Queue library for C O(C).[1][3]
• libpqueue is a generic priority queue (heap) imple-
mentation (in C) used by the Apache HTTP Server 5.2.2 Optimizations
project.
As an optimization, the data structure can also maintain
• Survey of known priority queue structures by Stefan
an index L that lower-bounds the minimum priority of
Xenos
an element. When inserting a new element, L should be
• UC Berkeley - Computer Science 61B - Lecture 24: updated to the minimum of its old value and the new ele-
Priority Queues (video) - introduction to priority ment’s priority. When searching for the minimum prior-
queues using binary heap ity element, the search can start at L instead of at zero, and
after the search L should be left equal to the priority that
was found in the search.[3] In this way the time for a search
5.2 Bucket queue is reduced to the difference between the previous lower
bound and its next value; this difference could be sig-
nificantly smaller than C. For applications of monotone
In the design and analysis of data structures, a bucket
priority queues such as Dijkstra’s algorithm in which the
queue[1] (also called a bucket priority queue[2] or
minimum priorities form a monotonic sequence, the sum
bounded-height priority queue[3] ) is a priority queue
of these differences is at most C, so the total time for
for prioritizing elements whose priorities are small
a sequence of n operations is O(n + C), rather than the
integers. It has the form of an array of buckets: an array
slower O(nC) time bound that would result without this
data structure, indexed by the priorities, whose cells con-
optimization.
tain buckets of items with the same priority as each other.
Another optimization (already given by Dial 1969) can
The bucket queue is the priority-queue analogue of
be used to save space when the priorities are monotonic
pigeonhole sort (also called bucket sort), a sorting algo-
and, at any point in time, fall within a range of r values
rithm that places elements into buckets indexed by their
rather than extending over the whole range from 0 to C. In
priorities and then concatenates the buckets. Using a
this case, one can index the array by the priorities mod-
bucket queue as the priority queue in a selection sort gives
ulo r rather than by their actual values. The search for
a form of the pigeonhole sort algorithm.
the minimum priority element should always begin at the
Applications of the bucket queue include computation of previous minimum, to avoid priorities that are higher than
the degeneracy of a graph as well as fast algorithms for the minimum but have lower moduli.[1]
shortest paths and widest paths for graphs with weights
that are small integers or are already sorted. Its first use[2]
was in a shortest path algorithm by Dial (1969).[4] 5.2.3 Applications
A bucket queue can be used to maintain the vertices
5.2.1 Basic data structure of an undirected graph, prioritized by their degrees,
and repeatedly find and remove the vertex of minimum
This structure can handle the insertions and deletions of degree.[3] This greedy algorithm can be used to calculate
elements with integer priorities in the range from 0 to the degeneracy of a given graph. It takes linear time, with
some known bound C, as well as operations that find the or without the optimization that maintains a lower bound
134 CHAPTER 5. PRIORITY QUEUES
5.2.4 References
[1] Mehlhorn, Kurt; Sanders, Peter (2008), “10.5.1 Bucket parent node of B, then the key (the value) of node A is
Queues”, Algorithms and Data Structures: The Basic Tool- ordered with respect to the key of node B with the same
box, Springer, p. 201, ISBN 9783540779773. ordering applying across the heap. A heap can be classi-
fied further as either a "max heap" or a "min heap". In
[2] Edelkamp, Stefan; Schroedl, Stefan (2011), “3.1.1 Bucket a max heap, the keys of parent nodes are always greater
Data Structures”, Heuristic Search: Theory and Applica-
than or equal to those of the children and the highest key
tions, Elsevier, pp. 90–92, ISBN 9780080919737. See
also p. 157 for the history and naming of this structure.
is in the root node. In a min heap, the keys of parent
nodes are less than or equal to those of the children and
[3] Skiena, Steven S. (1998), The Algorithm Design Manual, the lowest key is in the root node.
Springer, p. 181, ISBN 9780387948607.
The heap is one maximally efficient implementation of
[4] Dial, Robert B. (1969), “Algorithm 360: Shortest- an abstract data type called a priority queue, and in fact
path forest with topological ordering [H]", priority queues are often referred to as “heaps”, regard-
Communications of the ACM, 12 (11): 632–633, less of how they may be implemented. A common imple-
doi:10.1145/363269.363610. mentation of a heap is the binary heap, in which the tree is
[5] Matula, D. W.; Beck, L. L. (1983), “Smallest-last order- a complete binary tree (see figure). The heap data struc-
ing and clustering and graph coloring algorithms”, Journal ture, specifically the binary heap, was introduced by J. W.
of the ACM, 30 (3): 417–427, doi:10.1145/2402.322385, J. Williams in 1964, as a data structure for the heapsort
MR 0709826. sorting algorithm.[1] Heaps are also crucial in several ef-
ficient graph algorithms such as Dijkstra’s algorithm. In
[6] Varghese, George (2005), Network Algorithmics: An In-
a heap, the highest (or lowest) priority element is always
terdisciplinary Approach to Designing Fast Networked De-
vices, Morgan Kaufmann, ISBN 9780120884773. stored at the root. A heap is not a sorted structure and
can be regarded as partially ordered. As visible from the
[7] Dial, Robert B. (1969), “Algorithm 360: Shortest- heap-diagram, there is no particular relationship among
path forest with topological ordering [H]", nodes on any given level, even among the siblings. When
Communications of the ACM, 12 (11): 632–633, a heap is a complete binary tree, it has a smallest possible
doi:10.1145/363269.363610. height—a heap with N nodes always has log N height. A
[8] Gabow, Harold N.; Tarjan, Robert E. (1988), “Algo- heap is a useful data structure when you need to remove
rithms for two bottleneck optimization problems”, Jour- the object with the highest (or lowest) priority.
nal of Algorithms, 9 (3): 411–417, doi:10.1016/0196-
Note that, as shown in the graphic, there is no implied
6774(88)90031-4, MR 955149
ordering between siblings or cousins and no implied se-
quence for an in-order traversal (as there would be in, e.g.,
a binary search tree). The heap relation mentioned above
5.3 Heap (data structure) applies only between nodes and their parents, grandpar-
ents, etc. The maximum number of children each node
In computer science, a heap is a specialized tree-based can have depends on the type of heap, but in many types
data structure that satisfies the heap property: If A is a it is at most two, which is known as a binary heap.
5.3. HEAP (DATA STRUCTURE) 135
• Binary heap
• Heapsort: One of the best sorting methods being in-
• Binomial heap place and with no quadratic worst-case scenarios.
• PHP has both max-heap (SplMaxHeap) and min- [7] Fredman, Michael Lawrence; Tarjan, Robert E. (July
heap (SplMinHeap) as of version 5.3 in the Standard 1987). “Fibonacci heaps and their uses in improved net-
PHP Library. work optimization algorithms” (PDF). Journal of the As-
sociation for Computing Machinery. 34 (3): 596–615.
• Perl has implementations of binary, binomial, and doi:10.1145/28869.28874.
Fibonacci heaps in the Heap distribution available
[8] Iacono, John (2000), “Improved upper bounds for pair-
on CPAN. ing heaps”, Proc. 7th Scandinavian Workshop on Al-
gorithm Theory, Lecture Notes in Computer Science,
• The Go language contains a heap package with heap
algorithms that operate on an arbitrary type that sat- 1851, Springer-Verlag, pp. 63–77, arXiv:1110.4428 ,
doi:10.1007/3-540-44985-X_5, ISBN 3-540-67690-2
isfies a given interface.
[9] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority
• Apple’s Core Foundation library contains a Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium
CFBinaryHeap structure. on Discrete Algorithms, pp. 52–58
• Pharo has an implementation in the Collections- [10] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
Sequenceable package along with a set of test cases. Bottom-Up Heap Construction”. Data Structures and Al-
A heap is used in the implementation of the timer gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471-
event loop. 46983-1.
[4] The Python Standard Library, 8.4. heapq — Heap queue 5.4 Binary heap
algorithm, heapq.heapreplace
[5] Suchenek, Marek A. (2012), “Elementary Yet Precise A binary heap is a heap data structure that takes the form
Worst-Case Analysis of Floyd’s Heap-Construction Pro- of a binary tree. Binary heaps are a common way of im-
gram”, Fundamenta Informaticae, IOS Press, 120 (1): plementing priority queues.[1]:162–163 The binary heap was
75–92, doi:10.3233/FI-2012-751. introduced by J. W. J. Williams in 1964, as a data struc-
ture for the heapsort.[2]
[6] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L. (1990). Introduction to Algorithms (1st ed.). A binary heap is defined as a binary tree with two addi-
MIT Press and McGraw-Hill. ISBN 0-262-03141-8. tional constraints:[3]
138 CHAPTER 5. PRIORITY QUEUES
11
which is a valid max-heap. There is no need to check the be swapped with its smaller child), until it satisfies the
left child after this final step: at the start, the max-heap heap property in its new position. This functionality is
was valid, meaning 11 > 5; if 15 > 11, and 11 > 5, then achieved by the Max-Heapify function as defined be-
15 > 5, because of the transitive relation. low in pseudocode for an array-backed heap A of length
heap_length[A]. Note that “A” is indexed starting at 1, not
0 as is common in many real programming languages.
Extract
Max-Heapify (A, i):
The procedure for deleting the root from the heap (effec- left ← 2*i // ← means “assignment” right ← 2*i + 1
tively extracting the maximum element in a max-heap or largest ← i
the minimum element in a min-heap) and restoring the if left ≤ heap_length[A] and A[left] > A[largest] then:
properties is called down-heap (also known as bubble- largest ← left
down, percolate-down, sift-down, trickle down, heapify- if right ≤ heap_length[A] and A[right] > A[largest] then:
down, cascade-down, and extract-min/max). largest ← right
if largest ≠ i then:
swap A[i] and A[largest] Max-Heapify(A, largest)
1. Replace the root of the heap with the last element
on the last level. For the above algorithm to correctly re-heapify the array,
the node at index i and its two direct children must vio-
2. Compare the new root with its children; if they are late the heap property. If they do not, the algorithm will
in the correct order, stop. fall through with no change to the array. The down-heap
3. If not, swap the element with one of its children and operation (without the preceding swap) can also be used
return to the previous step. (Swap with its smaller to modify the value of the root, even when an element is
child in a min-heap and its larger child in a max- not being deleted. In the pseudocode above, what starts
heap.) with // is a comment. Note that A is an array (or list) that
starts being indexed from 1 up to length(A), according to
the pseudocode.
So, if we have the same max-heap as before
In the worst case, the new root has to be swapped with
its child on each level until it reaches the bottom level of
11
the heap, meaning that the delete operation has a time
complexity relative to the height of the tree, or O(log n).
5 8
3 4
5.4.2 Building a heap
We remove the 11 and replace it with the 4. Building a heap from an array of n input elements can be
done by starting with an empty heap, then successively
4
inserting each element. This approach, called Williams’
method after the inventor of binary heaps, is easily seen
to run in O(n log n) time: it performs n insertions at O(log
5 8
n) cost each.[lower-alpha 1]
fore, the cost of heapifying all subtrees is: heap is always a complete binary tree, it can be stored
compactly. No space is required for pointers; instead,
the parent and children of each node can be found by
⌊log n⌋
∑ n ⌊log n⌋
∑ h arithmetic on array indices. These properties make this
O(h) = O n heap implementation a simple example of an implicit data
2h 2h structure or Ahnentafel list. Details depend on the root
h=0 h=0
( ∞ ) position, which in turn may depend on constraints of a
∑ h
=O n programming language used for implementation, or pro-
2h grammer preference. Specifically, sometimes the root is
h=0
= O(n) placed at index 1, sacrificing space in order to simplify
arithmetic.
[lower-alpha 2]
∑∞ This uses the fact that the given infinite series Let n be the number of elements in the heap and i be an
i
i=0 i/2 converges. arbitrary valid index of the array storing the heap. If the
The exact value of the above (the worst-case number of tree root is at index 0, with valid indices 0 through n − 1,
comparisons during the heap construction) is known to be then each element a at index i has
equal to:
• children at indices 2i + 1 and 2i + 2
2n − 2s2 (n) − e2 (n) ,[6] • its parent at index floor((i − 1) ∕ 2).
where s2 (n) is the sum of all digits of the binary repre- Alternatively, if the tree root is at index 1, with valid in-
sentation of n and e2 (n) is the exponent of 2 in the prime dices 1 through n, then each element a at index i has
factorization of n.
The Build-Max-Heap function that follows, converts an • children at indices 2i and 2i +1
array A which stores a complete binary tree with n nodes
• its parent at index floor(i ∕ 2).
to a max-heap by repeatedly using Max-Heapify in a bot-
tom up manner. It is based on the observation that the
array elements indexed by floor(n/2) + 1, floor(n/2) + 2, This implementation is used in the heapsort algorithm,
..., n are all leaves for the tree (assuming that indices start where it allows the space in the input array to be reused
at 1), thus each is a one-element heap. Build-Max-Heap to store the heap (i.e. the algorithm is done in-place).
runs Max-Heapify on each of the remaining tree nodes. The implementation is also useful for use as a Priority
queue where use of a dynamic array allows insertion of
Build-Max-Heap (A): an unbounded number of items.
heap_length[A] ← length[A]
for each index i from floor(length[A]/2) downto 1 do: The upheap/downheap operations can then be stated in
Max-Heapify(A, i) terms of an array as follows: suppose that the heap prop-
erty holds for the indices b, b+1, ..., e. The sift-down
function extends the heap property to b−1, b, b+1, ..., e.
5.4.3 Heap implementation Only index i = b−1 can violate the heap property. Let j
be the index of the largest child of a[i] (for a max-heap,
or the smallest child for a min-heap) within the range b,
..., e. (If no such index exists because 2i > e then the heap
property holds for the newly extended range and nothing
needs to be done.) By swapping the values a[i] and a[j]
0 1 2 3 4 5 6 the heap property for position i is established. At this
point, the only problem is that the heap property might
A small complete binary tree stored in an array not hold for index j. The sift-down function is applied
tail-recursively to index j until the heap property is estab-
lished for all elements.
The sift-down function is fast. In each step it only needs
two comparisons and one swap. The index value where it
is working doubles in each iteration, so that at most log2
e steps are required.
Comparison between a binary heap and an array implementa- For big heaps and using virtual memory, storing elements
tion. in an array according to the above scheme is inefficient:
(almost) every level is in a different page. B-heaps are
Heaps are commonly implemented with an array. Any bi- binary heaps that keep subtrees in a single page, reducing
nary tree can be stored in an array, but because a binary the number of pages accessed by up to a factor of ten.[7]
5.4. BINARY HEAP 141
The operation of merging two binary heaps takes Θ(n) for observations together yields the following expression for
equal-sized heaps. The best you can do is (in case of array the index of the last node in layer l.
implementation) simply concatenating the two heap ar-
rays and build a heap of the result.[8] A heap on n elements
can be merged with a heap on k elements using O(log n
log k) key comparisons, or, in case of a pointer-based im- last(l) = (2l+1 − 1) − 1 = 2l+1 − 2
plementation, in O(log n log k) time.[9] An algorithm for
splitting a heap on n elements into two heaps on k and n-k Let there be j nodes after node i in layer L, such that
elements, respectively, based on a new view of heaps as an
ordered collections of subheaps was presented in.[10] The
algorithm requires O(log n * log n) comparisons. The
view also presents a new and conceptually simple algo- i= last(L) − j
rithm for merging heaps. When merging is a common
= (2L+1 − 2) − j
task, a different heap implementation is recommended,
such as binomial heaps, which can be merged in O(log
n). Each of these j nodes must have exactly 2 children, so
there must be 2j nodes separating i 's right child from
Additionally, a binary heap can be implemented with a the end of its layer ( L + 1 ).
traditional binary tree data structure, but there is an issue
with finding the adjacent element on the last level on the
binary heap when adding an element. This element can
be determined algorithmically or by adding extra data to right = 1) + last(L − 2j
the nodes, called “threading” the tree—instead of merely
= (2L+2 − 2) − 2j
storing references to the children, we store the inorder
successor of the node as well. = 2(2L+1 − 2 − j) + 2
It is possible to modify the heap structure to allow extrac- = 2i + 2
tion of both the smallest and largest element in O (log n)
time.[11] To do this, the rows alternate between min heap As required.
and max heap. The algorithms are roughly the same, but, Noting that the left child of any node is always 1 place
in each step, one must consider the alternating rows with before its right child, we get left = 2i + 1 .
alternating comparisons. The performance is roughly the
same as a normal single direction heap. This idea can be If the root is located at index 1 instead of 0, the last node
generalised to a min-max-median heap. in each level is instead at index 2l+1 − 1 . Using this
throughout yields left = 2i and right = 2i + 1 for heaps
with their root at 1.
5.4.4 Derivation of index equations
Parent node
In an array-based heap, the children and parent of a node
can be located via simple arithmetic on the node’s index. Every node is either the left or right child of its parent, so
This section derives the relevant equations for heaps with we know that either of the following is true.
their root at index 0, with additional notes on heaps with
their root at index 1.
1. i = 2 × (parent) + 1
To avoid confusion, we'll define the level of a node as its
distance from the root, such that the root itself occupies 2. i = 2 × (parent) + 2
level 0.
Hence,
Child nodes
For a general node located at index i (beginning from 0), i−1 i−2
parent = or
we will first derive the index of its right child, right = 2 2
2i + 2 . ⌊ ⌋
i−1
Let node i be located in level L , and note that any level Now consider the expression .
l contains exactly 2l nodes. Furthermore, there are ex- 2
actly 2l+1 − 1 nodes contained in the layers up to and If node i is a left child, this gives the result immediately,
including layer l (think of binary arithmetic; 0111...111 however, it also gives the correct result if node i is a right
= 1000...000 - 1). Because the root is stored at 0, the k child. In this case, (i−2) must be even, and hence (i−1)
th node will be stored at index (k − 1) . Putting these must be odd.
142 CHAPTER 5. PRIORITY QUEUES
Therefore, irrespective of whether a node is a left or right [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L.; Stein, Clifford (2009) [1990]. Introduction to
child, its parent can be found by the expression:
Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN
0-262-03384-4.
[15] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority in practice despite having a theoretically larger worst-
Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium case running time.[6][7] Like binary heaps, d-ary heaps
on Discrete Algorithms, pp. 52–58 are an in-place data structure that uses no additional stor-
age beyond that needed to store the array of items in the
[16] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
heap.[2][8]
Bottom-Up Heap Construction”. Data Structures and Al-
gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471-
46983-1.
5.5.1 Data structure
[17] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E.
(2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: The d-ary heap consists of an array of n items, each of
1463–1485. which has a priority associated with it. These items may
[18] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. be viewed as the nodes in a complete d-ary tree, listed in
(2012). Strict Fibonacci heaps (PDF). Proceedings of breadth first traversal order: the item at position 0 of the
the 44th symposium on Theory of Computing - STOC array forms the root of the tree, the items at positions 1
'12. p. 1177. doi:10.1145/2213977.2214082. ISBN through d are its children, the next d2 items are its grand-
9781450312455. children, etc. Thus, the parent of the item at position i
(for any i > 0) is the item at position floor((i − 1)/d) and
[19] Fredman, Michael Lawrence (July 1999). “On the Ef- its children are the items at positions di + 1 through di +
ficiency of Pairing Heaps and Related Data Structures” d. According to the heap property, in a min-heap, each
(PDF). Journal of the Association for Computing Machin-
item has a priority that is at least as large as its parent; in
ery. 46 (4): 473–501. doi:10.1145/320211.320214.
a max-heap, each item has a priority that is no larger than
[20] Pettie, Seth (2005). Towards a Final Analysis of Pair- its parent.[2][3]
ing Heaps (PDF). FOCS '05 Proceedings of the 46thThe minimum priority item in a min-heap (or the maxi-
Annual IEEE Symposium on Foundations of Computer
mum priority item in a max-heap) may always be found
Science. pp. 174–183. CiteSeerX 10.1.1.549.471 . at position 0 of the array. To remove this item from the
doi:10.1109/SFCS.2005.75. ISBN 0-7695-2468-0. priority queue, the last item x in the array is moved into
its place, and the length of the array is decreased by one.
Then, while item x and its children do not satisfy the heap
5.4.9 External links property, item x is swapped with one of its children (the
one with the smallest priority in a min-heap, or the one
• Binary Heap Applet by Kubo Kovac
with the largest priority in a max-heap), moving it down-
• Open Data Structures - Section 10.1 - BinaryHeap: ward in the tree and later in the array, until eventually the
An Implicit Binary Tree heap property is satisfied. The same downward swapping
procedure may be used to increase the priority of an item
• Implementation of binary max heap in C by Robin in a min-heap, or to decrease the priority of an item in a
Thomas max-heap.[2][3]
To insert a new item into the heap, the item is appended
• Implementation of binary min heap in C by Robin
to the end of the array, and then while the heap property
Thomas
is violated it is swapped with its parent, moving it upward
in the tree and earlier in the array, until eventually the
heap property is satisfied. The same upward-swapping
5.5 ''d''-ary heap procedure may be used to decrease the priority of an item
in a min-heap, or to increase the priority of an item in a
[2][3]
The d-ary heap or d-heap is a priority queue data struc- max-heap.
ture, a generalization of the binary heap in which the To create a new heap from an array of n items, one may
nodes have d children instead of 2.[1][2][3] Thus, a binary loop over the items in reverse order, starting from the
heap is a 2-heap, and a ternary heap is a 3-heap. Ac- item at position n − 1 and ending at the item at position
cording to Tarjan[2] and Jensen et al.,[4] d-ary heaps were 0, applying the downward-swapping procedure for each
invented by Donald B. Johnson in 1975.[1] item.[2][3]
This data structure allows decrease priority operations to
be performed more quickly than binary heaps, at the ex-
pense of slower delete minimum operations. This trade- 5.5.2 Analysis
off leads to better running times for algorithms such as
Dijkstra’s algorithm in which decrease priority operations In a d-ary heap with n items in it, both the upward-
are more common than delete min operations.[1][5] Addi- swapping procedure and the downward-swapping proce-
tionally, d-ary heaps have better memory cache behav- dure may perform as many as logd n = log n / log d swaps.
ior than binary heaps, allowing them to run more quickly In the upward-swapping procedure, each swap involves a
144 CHAPTER 5. PRIORITY QUEUES
single comparison of an item with its parent, and takes a min-heap in which there are n delete-min operations
constant time. Therefore, the time to insert a new item and as many as m decrease-priority operations, where n
into the heap, to decrease the priority of an item in a min-is the number of vertices in the graph and m is the number
heap, or to increase the priority of an item in a max-heap, of edges. By using a d-ary heap with d = m/n, the total
is O(log n / log d). In the downward-swapping procedure, times for these two types of operations may be balanced
each swap involves d comparisons and takes O(d) time: against each other, leading to a total time of O(m logm/n
it takes d − 1 comparisons to determine the minimum or n) for the algorithm, an improvement over the O(m log n)
maximum of the children and then one more comparison running time of binary heap versions of these algorithms
against the parent to determine whether a swap is needed. whenever the number of edges is significantly larger than
Therefore, the time to delete the root item, to increase thethe number of vertices.[1][5] An alternative priority queue
priority of an item in a min-heap, or to decrease the pri- data structure, the Fibonacci heap, gives an even better
ority of an item in a max-heap, is O(d log n / log d).[2][3]theoretical running time of O(m + n log n), but in prac-
tice d-ary heaps are generally at least as fast, and often
When creating a d-ary heap from a set of n items, most of [10]
the items are in positions that will eventually hold leaves faster, than Fibonacci heaps for this application.
of the d-ary tree, and no downward swapping is per- 4-heaps may perform better than binary heaps in practice,
formed for those items. At most n/d + 1 items are non- even for delete-min operations.[2][3] Additionally, a d-ary
leaves, and may be swapped downwards at least once, at heap typically runs much faster than a binary heap for
a cost of O(d) time to find the child to swap them with. heap sizes that exceed the size of the computer’s cache
At most n/d2 + 1 nodes may be swapped downward two memory: A binary heap typically requires more cache
times, incurring an additional O(d) cost for the second misses and virtual memory page faults than a d-ary heap,
swap beyond the cost already counted in the first term, each one taking far more time than the extra work in-
etc. Therefore, the total amount of time to create a heap curred by the additional comparisons a d-ary heap makes
in this way is compared to a binary heap.[6][7]
∑logd n ( n )
i=1 di + 1 O(d) = O(n). [2][3]
5.5.4 References
The exact value of the above (the worst-case number of
comparisons during the construction of d-ary heap) is [1] Johnson, D. B. (1975), “Priority queues with update and
known to be equal to: finding minimum spanning trees”, Information Processing
Letters, 4 (3): 53–57, doi:10.1016/0020-0190(75)90001-
0.
d
d−1 (n − sd (n)) − (d − 1 −
(nmodd))(ed (⌊ nd ⌋) + 1) ,[9] [2] Tarjan, R. E. (1983), “3.2. d-heaps”, Data Structures
and Network Algorithms, CBMS-NSF Regional Confer-
where s (n) is the sum of all digits of the standard base-d ence Series in Applied Mathematics, 44, Society for In-
representation of n and e (n) is the exponent of d in the dustrial and Applied Mathematics, pp. 34–38.
factorization of n. This reduces to
[3] Weiss, M. A. (2007), "d-heaps”, Data Structures and Al-
gorithm Analysis (2nd ed.), Addison-Wesley, p. 216,
2n − 2s2 (n) − e2 (n) ,[9] ISBN 0-321-37013-9.
for d = 2, and to [4] Jensen, C.; Katajainen, J.; Vitale, F. (2004), An extended
truth about heaps (PDF).
3
2 (n − s3 (n)) − 2e3 (n) − e3 (n − 1) ,[9]
[5] Tarjan (1983), pp. 77 and 91.
for d = 3. [6] Naor, D.; Martel, C. U.; Matloff, N. S. (1991), “Per-
The space usage of the d-ary heap, with insert and delete- formance of priority queue structures in a virtual mem-
min operations, is linear, as it uses no extra storage other ory environment”, Computer Journal, 34 (5): 428–437,
than an array containing a list of the items in the heap.[2][8] doi:10.1093/comjnl/34.5.428.
If changes to the priorities of existing items need to be
[7] Kamp, Poul-Henning (2010), “You're doing it wrong”,
supported, then one must also maintain pointers from the
ACM Queue, 8 (6).
items to their positions in the heap, which again uses only
linear storage.[2] [8] Mortensen, C. W.; Pettie, S. (2005), “The complexity of
implicit and space efficient priority queues”, Algorithms
and Data Structures: 9th International Workshop, WADS
5.5.3 Applications 2005, Waterloo, Canada, August 15–17, 2005, Proceed-
ings, Lecture Notes in Computer Science, 3608, Springer-
Dijkstra’s algorithm for shortest paths in graphs and Verlag, pp. 49–60, doi:10.1007/11534273_6, ISBN 978-
Prim’s algorithm for minimum spanning trees both use 3-540-28101-6.
5.6. BINOMIAL HEAP 145
5.6.3 Implementation 1 9 2
+
3 6
of the tree. 21
14
Merge 1 6 2
3 9 14 7 4 8
7 > 3
11 12
21
10 5
As mentioned above, the simplest and most important op- function merge(p, q) while not (p.end() and q.end())
eration is the merging of two binomial trees of the same tree = mergeTree(p.currentTree(), q.currentTree()) if
order within a binomial heap. Due to the structure of bi- not heap.currentTree().empty() tree = mergeTree(tree,
nomial trees, they can be merged trivially. As their root heap.currentTree()) heap.addTree(tree) heap.next();
node is the smallest element within the tree, by compar- p.next(); q.next()
ing the two keys, the smaller of them is the minimum key,
and becomes the new root node. Then the other tree be-
comes a subtree of the combined tree. This operation is
basic to the complete merging of two binomial heaps. Insert
function mergeTree(p, q) if p.root.key <= q.root.key re-
turn p.addSubTree(q) else return q.addSubTree(p) The Inserting a new element to a heap can be done by sim-
operation of merging two heaps is perhaps the most in- ply creating a new heap containing only this element and
teresting and can be used as a subroutine in most other then merging it with the original heap. Due to the merge,
operations. The lists of roots of both heaps are traversed insert takes O(log n) time. However, across a series of
simultaneously in a manner similar to that of the merge n consecutive insertions, insert has an amortized time of
algorithm. O(1) (i.e. constant).
5.6. BINOMIAL HEAP 147
[3] Iacono, John (2000), “Improved upper bounds for pair- by a non-constant factor. It is also possible to merge two
ing heaps”, Proc. 7th Scandinavian Workshop on Al- Fibonacci heaps in constant amortized time, improving
gorithm Theory, Lecture Notes in Computer Science, on the logarithmic merge time of a binomial heap, and
1851, Springer-Verlag, pp. 63–77, arXiv:1110.4428 , improving on binary heaps which cannot handle merges
doi:10.1007/3-540-44985-X_5, ISBN 3-540-67690-2 efficiently.
[4] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Using Fibonacci heaps for priority queues improves the
Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium asymptotic running time of important algorithms, such
on Discrete Algorithms, pp. 52–58 as Dijkstra’s algorithm for computing the shortest path
between two nodes in a graph, compared to the same al-
[5] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
gorithm using other slower priority queue data structures.
Bottom-Up Heap Construction”. Data Structures and Al-
gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471-
46983-1.
5.7.1 Structure
[6] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E.
(2009). “Rank-pairing heaps” (PDF). SIAM J. Computing:
1463–1485.
together.
As a result of a relaxed structure, some operations can
take a long time while others are done very quickly. For
the amortized running time analysis we use the potential
method, in that we pretend that very fast operations take
a little bit longer than they actually do. This additional
time is then later combined and subtracted from the ac-
tual running time of slow operations. The amount of time
saved for later use is measured at any given moment by a
potential function. The potential of a Fibonacci heap is
given by
Potential = t + 2m Fibonacci heap from Figure 1 after first phase of extract mini-
mum. Node with key 1 (the minimum) was deleted and its chil-
dren were added as separate trees.
where t is the number of trees in the Fibonacci heap, and
m is the number of marked nodes. A node is marked if
at least one of its children was cut since this node was it takes time O(d) to process all new roots and the poten-
made a child of another node (all roots are unmarked). tial increases by d−1. Therefore, the amortized running
The amortized time for an operation is given by the sum time of this phase is O(d) = O(log n).
of the actual time and c times the difference in potential,
where c is a constant (chosen to match the constant factors
in the O notation for the actual time).
Thus, the root of each tree in a heap has one unit of time
stored. This unit of time can be used later to link this tree
with another tree at amortized time 0. Also, each marked
node has two units of time stored. One can be used to
cut the node from its parent. If this happens, the node
becomes a root and the second unit of time will remain
stored in it as in any other root.
the array is updated. The actual running time is O(log n Fd+2 ≥ φd for all integers d ≥ 0 , where
tion) that √
.
+ m) where m is the number of roots at the beginning of φ = (1 + 5)/2 = 1.618 . (We then have n ≥ Fd+2 ≥
d
the second phase. At the end we will have at most O(log φ , and taking the log to base φ of both sides gives
n) roots (because each has a different degree). There- d ≤ logφ n as required.)
fore, the difference in the potential function from beforeConsider any node x somewhere in the heap (x need not
this phase to after it is: O(log n) − m, and the amortized
be the root of one of the main trees). Define size(x) to be
running time is then at most O(log n + m) + c(O(log n) − the size of the tree rooted at x (the number of descendants
m). With a sufficiently large choice of c, this simplifies of x, including x itself). We prove by induction on the
to O(log n).
height of x (the length of a longest simple path from x
In the third phase we check each of the remaining roots to a descendant leaf), that size(x) ≥ Fd₊₂, where d is the
and find the minimum. This takes O(log n) time and the degree of x.
potential does not change. The overall amortized running Base case: If x has height 0, then d = 0, and size(x) = 1
time of extract minimum is therefore O(log n). =F . 2
sequence can take very long to complete (in particular [3] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel
delete and delete minimum have linear running time in D.; Tarjan, Robert E. (1986). “The pairing heap: a new
the worst case). For this reason Fibonacci heaps and form of self-adjusting heap” (PDF). Algorithmica. 1 (1):
other amortized data structures may not be appropriate 111–129. doi:10.1007/BF01840439.
for real-time systems. It is possible to create a data struc- [4] Gerth Stølting Brodal (1996), “Worst-Case Efficient
ture which has the same worst-case performance as the Priority Queues”, Proc. 7th ACM-SIAM Symposium
Fibonacci heap has amortized performance. One such on Discrete Algorithms, Society for Industrial and Ap-
structure, the Brodal queue,[4] is, in the words of the cre- plied Mathematics: 52–58, CiteSeerX 10.1.1.43.8133 ,
ator, “quite complicated” and "[not] applicable in prac- doi:10.1145/313852.313883, ISBN 0-89871-366-8
tice.” Created in 2012, the strict Fibonacci heap[5] is a
[5] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012).
simpler (compared to Brodal’s) structure with the same
Strict Fibonacci heaps (PDF). Proceedings of the 44th
worst-case bounds. It is unknown whether the strict Fi- symposium on Theory of Computing - STOC '12. p.
bonacci heap is efficient in practice. The run-relaxed 1177. doi:10.1145/2213977.2214082. ISBN 978-1-
heaps of Driscoll et al. give good worst-case performance 4503-1245-5.
for all Fibonacci heap operations except merge.
[6] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L. (1990). Introduction to Algorithms (1st ed.).
5.7.5 Summary of running times MIT Press and McGraw-Hill. ISBN 0-262-03141-8.
[7] Iacono, John (2000), “Improved upper bounds for pair-
In the following time complexities[6] O(f) is an asymp- ing heaps”, Proc. 7th Scandinavian Workshop on Al-
totic upper bound and Θ(f) is an asymptotically tight gorithm Theory, Lecture Notes in Computer Science,
bound (see Big O notation). Function names assume a 1851, Springer-Verlag, pp. 63–77, arXiv:1110.4428 ,
min-heap. doi:10.1007/3-540-44985-X_5, ISBN 3-540-67690-2
[8] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority
[1] Brodal and Okasaki later describe a persistent variant with
Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium
the same bounds except for decrease-key, which is not
on Discrete Algorithms, pp. 52–58
supported. Heaps with n elements can be constructed
bottom-up in O(n).[9] [9] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
Bottom-Up Heap Construction”. Data Structures and Al-
[2] Amortized time.
gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471-
[3] Lower√ bound of Ω(log log n), [12]
upper bound of 46983-1.
O(22 log log n ). [13] [10] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E.
(2009). “Rank-pairing heaps” (PDF). SIAM J. Computing:
[4] n is the size of the larger heap.
1463–1485.
[11] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E.
5.7.6 Practical considerations (2012). Strict Fibonacci heaps (PDF). Proceedings of
the 44th symposium on Theory of Computing - STOC
Fibonacci heaps have a reputation for being slow in '12. p. 1177. doi:10.1145/2213977.2214082. ISBN
practice[14] due to large memory consumption per node 9781450312455.
and high constant factors on all operations.[15] Recent ex- [12] Fredman, Michael Lawrence (July 1999). “On the Ef-
perimental results suggest that Fibonacci heaps are more ficiency of Pairing Heaps and Related Data Structures”
efficient in practice than most of its later derivatives, (PDF). Journal of the Association for Computing Machin-
including quake heaps, violation heaps, strict Fibonacci ery. 46 (4): 473–501. doi:10.1145/320211.320214.
heaps, rank pairing heaps, but less efficient than either
[13] Pettie, Seth (2005). Towards a Final Analysis of Pair-
pairing heaps or array-based heaps.[16] ing Heaps (PDF). FOCS '05 Proceedings of the 46th
Annual IEEE Symposium on Foundations of Computer
5.7.7 References Science. pp. 174–183. CiteSeerX 10.1.1.549.471 .
doi:10.1109/SFCS.2005.75. ISBN 0-7695-2468-0.
[1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, [14] http://www.cs.princeton.edu/~{}wayne/
Ronald L.; Stein, Clifford (2001) [1990]. “Chapter 20: kleinberg-tardos/pdf/FibonacciHeaps.pdf, p. 79
Fibonacci Heaps”. Introduction to Algorithms (2nd ed.).
MIT Press and McGraw-Hill. pp. 476–497. ISBN 0- [15] http://web.stanford.edu/class/cs166/lectures/07/
262-03293-7. Third edition p. 518. Small07.pdf, p. 72
[2] Fredman, Michael Lawrence; Tarjan, Robert E. (July [16] Larkin, Daniel; Sen, Siddhartha; Tarjan, Robert (2014).
1987). “Fibonacci heaps and their uses in improved net- “A Back-to-Basics Empirical Study of Priority Queues”.
work optimization algorithms” (PDF). Journal of the As- Proceedings of the Sixteenth Workshop on Algorithm En-
sociation for Computing Machinery. 34 (3): 596–615. gineering and Experiments: 61–72. arXiv:1403.0252 .
doi:10.1145/28869.28874. doi:10.1137/1.9781611973198.7.
152 CHAPTER 5. PRIORITY QUEUES
5.7.8 External links for which decrease-key runs in O(log log n) amortized
time and with all other operations matching Fibonacci
• Java applet simulation of a Fibonacci heap heaps,[7] but no tight Θ(log log n) bound is known for
the original data structure.[6][3] Moreover, it is an open
• MATLAB implementation of Fibonacci heap
question whether a o(log n) amortized time bound for
• De-recursived and memory efficient C imple- decrease-key and a O(1) amortized time bound for insert
mentation of Fibonacci heap (free/libre software, can be achieved simultaneously.[8]
CeCILL-B license) Although this is worse than other priority queue algo-
rithms such as Fibonacci heaps, which perform decrease-
• Ruby implementation of the Fibonacci heap (with
key in O(1) amortized time, the performance in practice
tests)
is excellent. Stasko and Vitter,[4] Moret and Shapiro,[9]
• Pseudocode of the Fibonacci heap algorithm and Larkin, Sen, and Tarjan[8] conducted experiments on
pairing heaps and other heap data structures. They con-
• Various Java Implementations for Fibonacci heap cluded that pairing heaps are often faster in practice than
array-based binary heaps and d-ary heaps, and almost al-
ways faster in practice than other pointer-based heaps, in-
5.8 Pairing heap cluding data structures like Fibonacci heaps that are the-
oretically more efficient.
A pairing heap is a type of heap data structure with
relatively simple implementation and excellent practical
amortized performance, introduced by Michael Fredman,
Robert Sedgewick, Daniel Sleator, and Robert Tarjan in
1986.[1] Pairing heaps are heap-ordered multiway tree
5.8.1 Structure
structures, and can be considered simplified Fibonacci
heaps. They are considered a “robust choice” for imple- A pairing heap is either an empty heap, or a pair consist-
menting such algorithms as Prim’s MST algorithm,[2] and ing of a root element and a possibly empty list of pairing
support the following operations (assuming a min-heap): heaps. The heap ordering property requires that all the
root elements of the subheaps in the list are not smaller
• find-min: simply return the top element of the heap. than the root element of the heap. The following descrip-
tion assumes a purely functional heap that does not sup-
• merge: compare the two root elements, the smaller port the decrease-key operation.
remains the root of the result, the larger element and
type PairingHeap[Elem] = Empty | Heap(elem: Elem,
its subtree is appended as a child of this root.
subheaps: List[PairingHeap[Elem]])
• insert: create a new heap for the inserted element A pointer-based implementation for RAM machines,
and merge into the original heap. supporting decrease-key, can be achieved using three
pointers per node, by representing the children of a node
• decrease-key (optional): remove the subtree rooted
by a singly-linked list: a pointer to the node’s first child,
at the key to be decreased, replace the key with a
one to its next sibling, and one to its previous sibling (or,
smaller key, then merge the result back into the heap.
for the leftmost sibling, to its parent). Alternatively, the
• delete-min: remove the root and merge its subtrees. previous-pointer can be omitted by letting the last child
Various strategies are employed. point back to the parent, if a single boolean flag is added
to indicate “end of list”. This achieves a more compact
The analysis of pairing heaps’ time complexity was ini- structure at[1]the expense of a constant overhead factor per
tially inspired by that of splay trees.[1] The amortized time operation.
per delete-min is O(log n), and the operations find-min,
merge, and insert run in O(1) amortized time.[3]
Determining the precise asymptotic running time of pair-
ing heaps when a decrease-key operation is needed has 5.8.2 Operations
turned out to be difficult. Initially, the time complexity of
this operation was conjectured on empirical grounds to be find-min
O(1),[4] but Fredman proved that the amortized time per
decrease-key is at least Ω(log log n) for some sequences
of operations.[5] Using a different amortization argument, The function find-min simply returns the root element of
Pettie then proved that insert, meld, and decrease-key all the heap:
√
2 log log n
run in O(2 ) amortized time, which is o(log n) function find-min(heap: PairingHeap[Elem]) -> Elem if
.[6] Elmasry later introduced a variant of pairing heaps heap == Empty error else return heap.elem
5.8. PAIRING HEAP 153
Merging with an empty heap returns the other heap, oth- In the following time complexities[10] O(f) is an asymp-
erwise a new heap is returned that has the minimum of totic upper bound and Θ(f) is an asymptotically tight
the two root elements as its root element and just adds the bound (see Big O notation). Function names assume a
heap with the larger root to the list of subheaps: min-heap.
function merge(heap1, heap2: PairingHeap[Elem]) -
> PairingHeap[Elem] if heap1 == Empty return [1] Brodal and Okasaki later describe a persistent variant with
heap2 elsif heap2 == Empty return heap1 elsif the same bounds except for decrease-key, which is not
heap1.elem < heap2.elem return Heap(heap1.elem, supported. Heaps with n elements can be constructed
heap2 :: heap1.subheaps) else return Heap(heap2.elem, bottom-up in O(n).[14]
heap1 :: heap2.subheaps)
[2] Amortized time.
[17]
[3] Lower√ bound of Ω(log log n), upper bound of
insert
O(22 log log n ). [18]
The easiest way to insert an element into a heap is to [4] n is the size of the larger heap.
merge the heap with a new heap containing just this ele-
ment and an empty list of subheaps:
function insert(elem: Elem, heap: PairingHeap[Elem]) 5.8.4 References
-> PairingHeap[Elem] return merge(Heap(elem, []),
heap) [1] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel
D.; Tarjan, Robert E. (1986). “The pairing heap: a new
form of self-adjusting heap” (PDF). Algorithmica. 1 (1):
delete-min 111–129. doi:10.1007/BF01840439.
merge-pairs([H1, H2, H3, H4, H5, H6, H7]) => [6] Pettie, Seth (2005), “Towards a final analysis of pair-
merge(merge(H1, H2), merge-pairs([H3, H4, H5, H6, ing heaps” (PDF), Proc. 46th Annual IEEE Sympo-
H7])) # merge H1 and H2 to H12, then the rest of sium on Foundations of Computer Science, pp. 174–183,
the list => merge(H12, merge(merge(H3, H4), merge- doi:10.1109/SFCS.2005.75, ISBN 0-7695-2468-0
pairs([H5, H6, H7]))) # merge H3 and H4 to H34,
[7] Elmasry, Amr (2009), “Pairing heaps with O(log log n)
then the rest of the list => merge(H12, merge(H34,
decrease cost” (PDF), Proc. 20th Annual ACM-SIAM
merge(merge(H5, H6), merge-pairs([H7])))) # merge H5
Symposium on Discrete Algorithms, pp. 471–476,
and H6 to H56, then the rest of the list => merge(H12, doi:10.1137/1.9781611973068.52
merge(H34, merge(H56, H7))) # switch direction, merge
the last two resulting heaps, giving H567 => merge(H12, [8] Larkin, Daniel H.; Sen, Siddhartha; Tarjan, Robert
merge(H34, H567)) # merge the last two resulting heaps, E. (2014), “A back-to-basics empirical study of pri-
giving H34567 => merge(H12, H34567) # finally, merge ority queues”, Proceedings of the 16th Workshop on
the first merged pair with the result of merging the rest Algorithm Engineering and Experiments, pp. 61–72,
=> H1234567 arXiv:1403.0252 , doi:10.1137/1.9781611973198.7
154 CHAPTER 5. PRIORITY QUEUES
[9] Moret, Bernard M. E.; Shapiro, Henry D. (1991), “An 5.9 Double-ended priority queue
empirical analysis of algorithms for constructing a min-
imum spanning tree”, Proc. 2nd Workshop on Algo-
Not to be confused with Double-ended queue.
rithms and Data Structures, Lecture Notes in Computer
Science, 519, Springer-Verlag, pp. 400–411, CiteSeerX
10.1.1.53.5960 , doi:10.1007/BFb0028279, ISBN 3- In computer science, a double-ended priority queue
540-54343-0 (DEPQ)[1] or double-ended heap[2] is a data structure
similar to a priority queue or heap, but allows for efficient
[10] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, removal of both the maximum and minimum, according
Ronald L. (1990). Introduction to Algorithms (1st ed.). to some ordering on the keys (items) stored in the struc-
MIT Press and McGraw-Hill. ISBN 0-262-03141-8. ture. Every element in a DEPQ has a priority or value.
In a DEPQ, it is possible to remove the elements in both
[11] Fredman, Michael Lawrence; Tarjan, Robert E. (July
ascending as well as descending order.[3]
1987). “Fibonacci heaps and their uses in improved net-
work optimization algorithms” (PDF). Journal of the As-
sociation for Computing Machinery. 34 (3): 596–615.
doi:10.1145/28869.28874.
5.9.1 Operations
[12] Iacono, John (2000), “Improved upper bounds for pair- A double-ended priority queue features the follow oper-
ing heaps”, Proc. 7th Scandinavian Workshop on Al- ations:
gorithm Theory, Lecture Notes in Computer Science,
1851, Springer-Verlag, pp. 63–77, arXiv:1110.4428 , isEmpty() Checks if DEPQ is empty and returns true if
doi:10.1007/3-540-44985-X_5, ISBN 3-540-67690-2 empty.
[13] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority size() Returns the total number of elements present in
Queues” (PDF), Proc. 7th Annual ACM-SIAM Symposium the DEPQ.
on Discrete Algorithms, pp. 52–58
getMin() Returns the element having least priority.
[14] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6.
getMax() Returns the element having highest priority.
Bottom-Up Heap Construction”. Data Structures and Al-
gorithms in Java (3rd ed.). pp. 338–341. ISBN 0-471- put(x) Inserts the element x in the DEPQ.
46983-1.
removeMin() Removes an element with minimum pri-
[15] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. ority and returns this element.
(2009). “Rank-pairing heaps” (PDF). SIAM J. Computing:
1463–1485. removeMax() Removes an element with maximum pri-
ority and returns this element.
[16] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E.
(2012). Strict Fibonacci heaps (PDF). Proceedings of
If an operation is to be performed on two elements having
the 44th symposium on Theory of Computing - STOC
the same priority, then the element inserted first is chosen.
'12. p. 1177. doi:10.1145/2213977.2214082. ISBN
9781450312455. Also, the priority of any element can be changed once it
has been inserted in the DEPQ.[4]
[17] Fredman, Michael Lawrence (July 1999). “On the Ef-
ficiency of Pairing Heaps and Related Data Structures”
(PDF). Journal of the Association for Computing Machin- 5.9.2 Implementation
ery. 46 (4): 473–501. doi:10.1145/320211.320214.
Double-ended priority queues can be built from balanced
[18] Pettie, Seth (2005). Towards a Final Analysis of Pair- binary search trees (where the minimum and maximum
ing Heaps (PDF). FOCS '05 Proceedings of the 46th elements are the leftmost and rightmost leaves, respec-
Annual IEEE Symposium on Foundations of Computer tively), or using specialized data structures like min-max
Science. pp. 174–183. CiteSeerX 10.1.1.549.471 . heap and pairing heap.
doi:10.1109/SFCS.2005.75. ISBN 0-7695-2468-0.
Generic methods of arriving at double-ended priority
queues from normal priority queues are:[5]
5.8.5 External links
Dual structure method
• Louis Wasserman discusses pairing heaps and their
implementation in Haskell in The Monad Reader, In this method two different priority queues for min and
Issue 16 (pp. 37–52). max are maintained. The same elements in both the PQs
are shown with the help of correspondence pointers.
• pairing heaps, Sartaj Sahni Here, the minimum and maximum elements are values
5.9. DOUBLE-ENDED PRIORITY QUEUE 155
• Removing the min element: Perform removemin() necessary for non-leaf elements to be in a one-to-one
on the min heap and remove(node value) on the max correspondence pair.[1]
heap, where node value is the value in the corre-
sponding node in the max heap.
• Removing the max element: Perform remove- Interval heaps
max() on the max heap and remove(node value) on
the min heap, where node value is the value in the
corresponding node in the min heap.
Total correspondence
A total correspondence heap for the elements 3, 4, 5, 5, 6, 6, 7, • The left element is less than or equal to the right
8, 9, 10, 11 with element 11 as buffer.[1] element.
Half the elements are in the min PQ and the other half • Both the elements define a closed interval.
in the max PQ. Each element in the min PQ has a
• Interval represented by any node except the root is a
one-to-one correspondence with an element in max PQ.
sub-interval of the parent node.
If the number of elements in the DEPQ is odd, one of
the elements is retained in a buffer.[1] Priority of every • Elements on the left hand side define a min heap.
element in the min PQ will be less than or equal to the
corresponding element in the max PQ. • Elements on the right hand side define a max heap.
2. Odd number of elements: In this case, each node root node. This element is removed and returned.
except the last contains two elements represented by To fill in the vacancy created on the right hand side
the interval [p, q] whereas the last node will contain of the root node, an element from the last node is
a single element and is represented by the interval removed and reinserted into the root node. Further
[p, p]. comparisons are carried out on a similar basis as dis-
cussed above. Finally, the root node will again con-
tain the max element on the right hand side.
Inserting an element Depending on the number of
elements already present in the interval heap, following
Thus, with interval heaps, both the minimum and maxi-
cases are possible:
mum elements can be removed efficiently traversing from
root to leaf. Thus, a DEPQ can be obtained[6] from an in-
• Odd number of elements: If the number of ele- terval heap where the elements of the interval heap are the
ments in the interval heap is odd, the new element priorities of elements in the DEPQ.
is firstly inserted in the last node. Then, it is suc-
cessively compared with the previous node elements
and tested to satisfy the criteria essential for an in- 5.9.3 Time Complexity
terval heap as stated above. In case if the element
does not satisfy any of the criteria, it is moved from Interval Heaps
the last node to the root until all the conditions are
satisfied.[6] When DEPQ’s are implemented using Interval heaps con-
sisting of n elements, the time complexities for the vari-
• Even number of elements: If the number of ele-
ous functions are formulated in the table below[1]
ments is even, then for the insertion of a new ele-
ment an additional node is created. If the element
falls to the left of the parent interval, it is consid-
Pairing heaps
ered to be in the min heap and if the element falls
to the right of the parent interval, it is considered in
When DEPQ’s are implemented using heaps or pairing
the max heap. Further, it is compared successively
heaps consisting of n elements, the time complexities for
and moved from the last node to the root until all
the various functions are formulated in the table below.[1]
the conditions for interval heap are satisfied. If the
For pairing heaps, it is an amortized complexity.
element lies within the interval of the parent node it-
self, the process is stopped then and there itself and
moving of elements does not take place.[6]
5.9.4 Applications
The time required for inserting an element depends on the External sorting
number of movements required to meet all the conditions
and is O(log n). One example application of the double-ended priority
queue is external sorting. In an external sort, there are
more elements than can be held in the computer’s mem-
Deleting an element
ory. The elements to be sorted are initially on a disk and
the sorted sequence is to be left on the disk. The external
• Min element: In an interval heap, the minimum el-
quick sort is implemented using the DEPQ as follows:
ement is the element on the left hand side of the root
node. This element is removed and returned. To fill
in the vacancy created on the left hand side of the 1. Read in as many elements as will fit into an internal
root node, an element from the last node is removed DEPQ. The elements in the DEPQ will eventually
and reinserted into the root node. This element is be the middle group (pivot) of elements.
then compared successively with all the left hand
elements of the descending nodes and the process 2. Read in the remaining elements. If the next element
stops when all the conditions for an interval heap are is ≤ the smallest element in the DEPQ, output this
satisfied.In case if the left hand side element in the next element as part of the left group. If the next el-
node becomes greater than the right side element at ement is ≥ the largest element in the DEPQ, output
[6]
any stage, the two elements are swapped and then this next element as part of the right group. Other-
further comparisons are done. Finally, the root node wise, remove either the max or min element from
will again contain the minimum element on the left the DEPQ (the choice may be made randomly or al-
hand side. ternately); if the max element is removed, output it
as part of the right group; otherwise, output the re-
• Max element: In an interval heap, the maximum moved element as part of the left group; insert the
element is the element on the right hand side of the newly input element into the DEPQ.
5.10. SOFT HEAP 157
3. Output the elements in the DEPQ, in sorted order, More precisely, the guarantee offered by the soft heap
as the middle group. is the following: for a fixed value ε between 0 and 1/2, at
any point in time there will be at most ε*n corrupted keys
4. Sort the left and right groups recursively. in the heap, where n is the number of elements inserted
so far. Note that this does not guarantee that only a fixed
percentage of the keys currently in the heap are corrupted:
5.9.5 See also in an unlucky sequence of insertions and deletions, it can
happen that all elements in the heap will have corrupted
• Queue (abstract data type)
keys. Similarly, we have no guarantee that in a sequence
• Priority queue of elements extracted from the heap with findmin and
delete, only a fixed percentage will have corrupted keys:
• Double-ended queue in an unlucky scenario only corrupted elements are ex-
tracted from the heap.
5.9.6 References The soft heap was designed by Bernard Chazelle in 2000.
The term “corruption” in the structure is the result of what
[1] Data Structures, Algorithms, & Applications in Java: Chazelle called “carpooling” in a soft heap. Each node in
Double-Ended Priority Queues, Sartaj Sahni, 1999. the soft heap contains a linked-list of keys and one com-
mon key. The common key is an upper bound on the
[2] Brass, Peter (2008). Advanced Data Structures. Cam- values of the keys in the linked-list. Once a key is added
bridge University Press. p. 211. ISBN 9780521880374.
to the linked-list, it is considered corrupted because its
[3] “Depq - Double-Ended Priority Queue”. value is never again relevant in any of the soft heap op-
erations: only the common keys are compared. This is
[4] “depq”. what makes soft heaps “soft"; you can't be sure whether
[5] Fundamentals of Data Structures in C++ - Ellis Horowitz,
or not any particular value you put into it will be cor-
Sartaj Sahni and Dinesh Mehta rupted. The purpose of these corruptions is effectively
to lower the information entropy of the data, enabling the
[6] http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/ data structure to break through information-theoretic bar-
interval.pdf riers regarding heaps.
5.10.2 References
• Chazelle, Bernard (November 2000). “The
soft heap: an approximate priority queue
with optimal error rate” (PDF). J. ACM. 47
(6): 1012–1027. CiteSeerX 10.1.1.5.9705 .
doi:10.1145/355541.355554.
• Kaplan, Haim; Zwick, Uri (2009). “A sim-
pler implementation and analysis of Chazelle’s
soft heaps”. Proceedings of the Nineteenth An-
nual ACM–SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics.
pp. 477–485. CiteSeerX 10.1.1.215.6250 .
doi:10.1137/1.9781611973068.53. ISBN 978-0-
89871-680-1.
Chapter 6
6.1 Binary search algorithm target value is less than or greater than the middle ele-
ment, the search continues in the lower or upper half of
This article is about searching a finite sorted array. the array, respectively,
[7]
eliminating the other half from
For searching continuous function values, see bisection consideration.
method.
Procedure
In computer science, binary search, also known as
half-interval search,[1] logarithmic search,[2] or bi- Given an array A of n elements with values or records A0
nary chop,[3] is a search algorithm that finds the position ... An₋₁, sorted such that A0 ≤ ... ≤ An₋₁, and target value
of a target value within a sorted array.[4][5] Binary search T, the following subroutine uses binary search to find the
compares the target value to the middle element of the index of T in A.[7]
array; if they are unequal, the half in which the target
cannot lie is eliminated and the search continues on the
remaining half until it is successful or the remaining half 1. Set L to 0 and R to n − 1.
is empty.
2. If L > R, the search terminates as unsuccessful.
Binary search runs in at worst logarithmic time, mak-
ing O(log n) comparisons, where n is the number of el- 3. Set m (the position of the middle element) to the
ements in the array, the O is Big O notation, and log is floor (the largest previous integer) of (L + R) / 2.
the logarithm. Binary search takes only constant (O(1))
space, meaning that the space taken by the algorithm 4. If Am < T, set L to m + 1 and go to step 2.
is the same for any number of elements in the array.[6]
Although specialized data structures designed for fast 5. If Am > T, set R to m – 1 and go to step 2.
searching—such as hash tables—can be searched more
efficiently, binary search applies to a wider range of 6. Now Am = T, the search is done; return m.
search problems.
Although the idea is simple, implementing binary search This iterative procedure keeps track of the search bound-
correctly requires attention to some subtleties about its aries via two variables. Some implementations may place
exit conditions and midpoint calculation. the comparison for equality at the end of the algorithm,
resulting in a faster comparison loop but costing one more
There exist numerous variations of binary search. In par- iteration on average.[8]
ticular, fractional cascading speeds up binary searches for
the same value in multiple arrays, efficiently solving a se-
ries of search problems in computational geometry and Approximate matches
numerous other fields. Exponential search extends binary
search to unbounded lists. The binary search tree and B- The above procedure only performs exact matches, find-
tree data structures are based on binary search. ing the position of a target value. However, due to the or-
dered nature of sorted arrays, it is trivial to extend binary
search to perform approximate matches. For example,
6.1.1 Algorithm binary search can be used to compute, for a given value,
its rank (the number of smaller elements), predecessor
Binary search works on sorted arrays. Binary search be- (next-smallest element), successor (next-largest element),
gins by comparing the middle element of the array with and nearest neighbor. Range queries seeking the number
the target value. If the target value matches the mid- of elements between two values can be performed with
dle element, its position in the array is returned. If the two rank queries.[9]
159
160 CHAPTER 6. SUCCESSORS AND NEIGHBORS
• Rank queries can be performed using a modified iteration, always eliminates the smaller subarray out of
version of binary search. By returning m on a suc- the two if they are not of equal size.[lower-alpha 1][12]
cessful search, and L on an unsuccessful search, the On average, assuming that each element is equally likely
number of elements less than the target value is re- to be searched, by the time the search completes, the tar-
turned instead.[9] get value will most likely be found at the second-deepest
level of the tree. This is equivalent to a binary search that
• Predecessor and successor queries can be performed
completes one iteration before the worst case, reached af-
with rank queries. Once the rank of the target value
ter log2 n − 1 iterations. However, the tree may be unbal-
is known, its predecessor is the element at the posi-
anced, with the deepest level partially filled, and equiv-
tion given by its rank (as it is the largest element that
alently, the array may not be divided perfectly by the
is smaller than the target value). Its successor is the
search in some iterations, half of the time resulting in the
element after it (if it is present in the array) or at the
smaller subarray being eliminated. The actual number of
next position after the predecessor (otherwise).[10] n−log2 n−1
The nearest neighbor of the target value is either its average iterations is slightly higher, at log2 n− n
[6]
predecessor or successor, whichever is closer. iterations. In the best case, where the first middle ele-
ment selected is equal to the target value, its position is
• Range queries are also straightforward. Once the returned after one iteration.[13] In terms of iterations, no
ranks of the two values are known, the number search algorithm that is based solely on comparisons can
of elements greater than or equal to the first value exhibit better average and worst-case performance than
and less than the second is the difference of the binary search.[12]
two ranks. This count can be adjusted up or down Each iteration of the binary search algorithm defined
by one according to whether the endpoints of the above makes one or two comparisons, checking if the
range should be considered to be part of the range middle element is equal to the target value in each it-
and whether the array contains keys matching those eration. Again assuming that each element is equally
endpoints.[11] likely to be searched, each iteration makes 1.5 compar-
isons on average. A variation of the algorithm instead
checks for equality at the very end of the search, elim-
6.1.2 Performance inating on average half a comparison from each itera-
tion. This decreases the time taken per iteration very
slightly on most computers, while guaranteeing that the
search takes the maximum number of iterations, on aver-
age adding one iteration to the search. Because the com-
parison loop is performed only ⌊log2 n + 1⌋ times in the
worst case, for all but enormous n , the slight increase in
comparison loop efficiency does not compensate for the
extra iteration. Knuth 1998 gives a value of 266 (more
than 73 quintillion)[14] elements for this variation to be
faster.[lower-alpha 2][15][16]
A tree representing binary search. The array being searched here
Fractional cascading can be used to speed up searches of
is [20, 30, 40, 50, 90, 100], and the target value is 40.
the same value in multiple arrays. Where k is the num-
ber of arrays, searching each array for the target value
The performance of binary search can be analyzed by re- takes O(k log n) time; fractional cascading reduces this
ducing the procedure to a binary comparison tree, where to O(k + log n) .[17]
the root node is the middle element of the array; the mid-
dle element of the lower half is left of the root and the
middle element of the upper half is right of the root.
The rest of the tree is built in a similar fashion. This
model represents binary search; starting from the root 6.1.3 Binary search versus other schemes
node, the left or right subtrees are traversed depending
on whether the target value is less or more than the node Sorted arrays with binary search are a very inefficient
under consideration, representing the successive elimina- solution when insertion and deletion operations are in-
tion of elements.[6][12] terleaved with retrieval, taking O(n) time for each such
The worst case is ⌊log2 n + 1⌋ iterations (of the compar- operation, and complicating memory use.[18] Other data
ison loop), where the ⌊⌋ notation denotes the floor func- structures support much more efficient insertion and dele-
tion that rounds its argument down to an integer and log2 tion, and also fast exact matching. However, binary
is the binary logarithm. This is reached when the search search applies to a wide range of search problems, usually
reaches the deepest level of the tree, equivalent to a bi- solving them in O(log n) time regardless of the type or
nary search that has reduced to one element and, in each structure of the values themselves.
6.1. BINARY SEARCH ALGORITHM 161
Hashing operations.[31]
Linear search is a simple search algorithm that checks ev- Uniform binary search stores, instead of the lower and up-
ery record until it finds the target value. Linear search per bounds, the index of the middle element and the num-
can be done on a linked list, which allows for faster in- ber of elements around the middle element that were not
sertion and deletion than an array. Binary search is faster eliminated yet. Each step reduces the width by about half.
than linear search for sorted arrays except if the array is This variation is uniform because the difference between
short.[lower-alpha 5][30] If the array must first be sorted, that the indices of middle elements and the preceding mid-
cost must be amortized over any searches. Sorting the ar- dle elements chosen remains constant between searches
ray also enables efficient approximate matches and other of arrays of the same length.[34]
162 CHAPTER 6. SUCCESSORS AND NEIGHBORS
Boundary search
And the following subroutine finds the right boundary in- Exponential search extends binary search to unbounded
dex of T in A. lists. It starts by finding the first element with an index
that is both a power of two and greater than the target
value. Afterwards, it sets that index as the upper bound,
1. Set L to 0 and R to n. and switches to binary search. A search takes ⌊log2 x+1⌋
iterations of the exponential search and at most ⌊log2 x⌋
2. While L - R > 1: iterations of the binary search, where x is the position of
the target value. Exponential search works on bounded
• Set m (the position of the middle element) to lists, but becomes an improvement over binary search
the floor (the largest previous integer) of L + only if the target value lies near beginning of the array.[37]
(R - L) / 2.
• If Am ≤ T, set L to m; otherwise, set R to m.
• Java offers a set of overloaded binarySearch() [4] The worst binary search tree for searching can be pro-
static methods in the classes Arrays and Collections duced by inserting the values in sorted or near-sorted or-
in the standard java.util package for perform- der or in an alternating lowest-highest record pattern.[25]
ing binary searches on Java arrays and on Lists,
[5] Knuth 1998 performed a formal time performance analy-
respectively.[56][57] sis of both of these search algorithms. On Knuth’s hypo-
thetical MIX computer, intended to represent an ordinary
• Microsoft's .NET Framework 2.0 offers static computer, binary search takes on average 18 log n − 16
generic versions of the binary search algorithm in its units of time for a successful search, while linear search
collection base classes. An example would be Sys- with a sentinel node at the end of the list takes 1.75n +
tem.Array’s method BinarySearch<T>(T[] array, T 8.5 − n mod
4n
2
units. Linear search has lower initial com-
value).[58] plexity because it requires minimal computation, but it
quickly outgrows binary search in complexity. On the
• Python provides the bisect module.[59] MIX computer, binary search only outperforms linear
search with a sentinel if n > 44 .[12][29]
• Ruby's Array class includes a bsearch method with
built-in approximate matching.[60] [6] As simply setting all of the bits which the hash functions
point to for a specific key can affect queries for other keys
• Go's sort standard library package contains the which have a common hash location for one or more of
functions Search, SearchInts, SearchFloat64s, and the functions.[32]
SearchStrings, which implement general binary [7] There exist improvements of the Bloom filter which im-
search, as well as specific implementations for prove on its complexity or support deletion; for exam-
searching slices of integers, floating-point numbers, ple, the cuckoo filter exploits cuckoo hashing to gain these
and strings, respectively.[61] advantages.[32]
[42]
• For Objective-C, the Cocoa framework provides the [8] That is, arrays of length 1, 3, 7, 15, 31 ...
NSArray -indexOfObject:inSortedRange:options:
usingComparator: method in Mac OS X 10.6+.[62]
Citations
Apple’s Core Foundation C framework also contains
[63]
a CFArrayBSearchValues() function. [1] Willams, Jr., Louis F. (1975). A modification to the
half-interval search (binary search) method. Proceedings
of the 14th ACM Southeast Conference. pp. 95–101.
6.1.8 See also doi:10.1145/503561.503582.
• Bisection method – the same idea used to solve [2] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
equations in the real numbers section “Binary search”.
[1] This happens as binary search will not always divide the [7] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
array perfectly. Take for example the array [1, 2 ... section “Algorithm B”.
16]. The first iteration will select the midpoint of 8. On
the left subarray are eight elements, but on the right are [8] Bottenbruch, Hermann (1962). “Structure and Use of
nine. If the search takes the right path, there is a higher ALGOL 60”. Journal of the ACM. 9 (2): 161–211. Pro-
chance that the search will make the maximum number of cedure is described at p. 214 (§43), titled “Program for
comparisons.[12] Binary Search”.
[3] It is possible to perform hashing in guaranteed constant [12] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
time.[20] section “Further analysis of binary search”.
6.1. BINARY SEARCH ALGORITHM 165
[13] Chang 2003, p. 169. [34] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
section “An important variation”.
[14] Sloane, Neil. Table of n, 2n for n = 0..1000. Part of OEIS
A000079. Retrieved 30 April 2016. [35] Kiefer, J. (1953). “Sequential Minimax Search for a Max-
imum”. Proceedings of the American Mathematical So-
[15] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- ciety. 4 (3): 502–506. doi:10.2307/2032161. JSTOR
section “Exercise 23”. 2032161.
[16] Rolfe, Timothy J. (1997). “Analytic derivation of com- [36] Hassin, Refael (1981). “On Maximizing Functions by Fi-
parisons in binary search”. ACM SIGNUM Newsletter. 32 bonacci Search”. Fibonacci Quarterly. 19: 347–351.
(4): 15–19. doi:10.1145/289251.289255.
[37] Moffat & Turpin 2002, p. 33.
[17] Chazelle, Bernard; Liu, Ding (2001). Lower bounds for
intersection searching and fractional cascading in higher [38] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
dimension. 33rd ACM Symposium on Theory of Com- section “Interpolation search”.
puting. pp. 322–329. doi:10.1145/380752.380818.
[39] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
[18] Knuth 1997, §2.2.2 (“Sequential Allocation”). section “Exercise 22”.
[19] Knuth 1998, §6.4 (“Hashing”). [40] Perl, Yehoshua; Itai, Alon; Avni, Haim (1978). “Interpo-
lation search—a log log n search”. CACM. 21 (7): 550–
[20] Knuth 1998, §6.4 (“Hashing”), subsection “History”. 553. doi:10.1145/359545.359557.
[21] Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; [41] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub-
Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, section “History and bibliography”.
Robert E. (August 1994). “Dynamic Perfect Hashing:
Upper and Lower Bounds”. SIAM Journal on Computing. [42] “2n −1”. OEIS A000225. Retrieved 7 May 2016.
23 (4): 738–761. doi:10.1137/S0097539791194094.
[43] Lehmer, Derrick (1960). Teaching combinatorial tricks to
[22] Morin, Pat. “Hash Tables” (PDF). p. 1. Retrieved 28 a computer. Proceedings of Symposia in Applied Mathe-
March 2016. matics. 10. pp. 180–181. doi:10.1090/psapm/010.
[33] Bloom, Burton H. (1970). “Space/time Trade-offs in [53] “bsearch – binary search a sorted table”. The Open Group
Hash Coding with Allowable Errors”. CACM. 13 (7): Base Specifications (7th ed.). The Open Group. 2013.
422–426. doi:10.1145/362686.362692. Retrieved 28 March 2016.
166 CHAPTER 6. SUCCESSORS AND NEIGHBORS
[54] Stroustrup 2013, §32.6.1 (“Binary Search”). • Leiss, Ernst (2007). A Programmer’s Companion to
Algorithm Analysis. Boca Raton, FL: CRC Press.
[55] “The Binary Search in COBOL”. The American Program-
ISBN 1-58488-673-0.
mer. Retrieved 7 November 2016.
[56] “java.util.Arrays”. Java Platform Standard Edition 8 Doc- • Moffat, Alistair; Turpin, Andrew (2002). Compres-
umentation. Oracle Corporation. Retrieved 1 May 2016. sion and Coding Algorithms. Hamburg, Germany:
Kluwer Academic Publishers. doi:10.1007/978-1-
[57] “java.util.Collections”. Java Platform Standard Edition 8 4615-0935-6. ISBN 978-0-7923-7668-2.
Documentation. Oracle Corporation. Retrieved 1 May
2016. • Sedgewick, Robert; Wayne, Kevin (2011).
Algorithms (4th ed.). Upper Saddle River, NJ:
[58] “List<T>.BinarySearch Method (T)". Microsoft Devel-
Addison-Wesley Professional. ISBN 978-0-321-
oper Network. Retrieved 10 April 2016.
57351-3. Condensed web version: ; book version
[59] “8.5. bisect — Array bisection algorithm”. The Python .
Standard Library. Python Software Foundation. Re-
trieved 10 April 2016. • Stroustrup, Bjarne (2013). The C++ Program-
ming Language (4th ed.). Upper Saddle River, NJ:
[60] Fitzgerald 2007, p. 152.
Addison-Wesley Professional. ISBN 978-0-321-
[61] “Package sort”. The Go Programming Language. Re- 56384-2.
trieved 28 April 2016.
[62] “NSArray”. Mac Developer Library. Apple Inc. Re- 6.1.10 External links
trieved 1 May 2016.
[63] “CFArray”. Mac Developer Library. Apple Inc. Re- • NIST Dictionary of Algorithms and Data Structures:
trieved 1 May 2016. binary search
fast lookup, addition and removal of items, and can be to be compared with the key of the element to be
used to implement either dynamic sets of items, or lookup inserted or found.
tables that allow finding an item by its key (e.g., finding
the phone number of a person by name). • The keys in the binary search tree may be long and
the run time may increase.
Binary search trees keep their keys in sorted order, so
that lookup and other operations can use the principle of • After a long intermixed sequence of random inser-
binary search: when looking for a key in a tree (or a place tion and deletion, the expected height of the tree
to insert a new key), they traverse the tree from root to approaches square root of the number of keys, √n,
leaf, making comparisons to keys stored in the nodes of which grows much faster than log n.
the tree and deciding, based on the comparison, to con-
tinue searching in the left or right subtrees. On average,
this means that each comparison allows the operations to Order relation
skip about half of the tree, so that each lookup, inser-
tion or deletion takes time proportional to the logarithm Binary search requires an order relation by which every
of the number of items stored in the tree. This is much element (item) can be compared with every other element
better than the linear time required to find items by key in the sense of a total preorder. The part of the element
in an (unsorted) array, but slower than the corresponding which effectively takes place in the comparison is called
operations on hash tables. its key. Whether duplicates, i.e. different elements with
same key, shall be allowed in the tree or not, does not
Several variants of the binary search tree have been stud- depend on the order relation, but on the application only.
ied in computer science; this article deals primarily with
the basic type, making references to more advanced types In the context of binary search trees a total preorder is re-
when appropriate. alized most flexibly by means of a three-way comparison
subroutine.
6.2.1 Definition
6.2.2 Operations
A binary search tree is a rooted binary tree, whose inter-
nal nodes each store a key (and optionally, an associated Binary search trees support three main operations: in-
value) and each have two distinguished sub-trees, com- sertion of elements, deletion of elements, and lookup
monly denoted left and right. The tree additionally satis- (checking whether a key is present).
fies the binary search tree property, which states that the
key in each node must be greater than or equal to any key Searching
stored in the left sub-tree, and less than or equal to any key
stored in the right sub-tree.[1]:287 (The leaves (final nodes) Searching a binary search tree for a specific key can be
of the tree contain no key and have no structure to distin- programmed recursively or iteratively.
guish them from one another. Leaves are commonly rep-
resented by a special leaf or nil symbol, a NULL pointer, We begin by examining the root node. If the tree is null,
etc.) the key we are searching for does not exist in the tree.
Otherwise, if the key equals that of the root, the search is
Generally, the information represented by each node is a successful and we return the node. If the key is less than
record rather than a single data element. However, for that of the root, we search the left subtree. Similarly, if
sequencing purposes, nodes are compared according to the key is greater than that of the root, we search the right
their keys rather than any part of their associated records. subtree. This process is repeated until the key is found or
The major advantage of binary search trees over other the remaining subtree is null. If the searched key is not
data structures is that the related sorting algorithms and found after a null subtree is reached, then the key is not
search algorithms such as in-order traversal can be very present in the tree. This is easily expressed as a recursive
efficient; they are also easy to code. algorithm (implemented in Python):
Binary search trees are a fundamental data structure used 1 def search_recursively(key, node): 2 if node is None
to construct more abstract data structures such as sets, or node.key == key: 3 return node 4 elif key < node.key:
multisets, and associative arrays. Some of their disad- 5 return search_recursively(key, node.left) 6 else: # key
vantages are as follows: > node.key 7 return search_recursively(key, node.right)
• The shape of the binary search tree depends entirely The same algorithm can be implemented iteratively:
on the order of insertions and deletions, and can be-
come degenerate. 1 def search_iteratively(key, node): 2 current_node =
node 3 while current_node is not None: 4 if key ==
• When inserting or searching for an element in a bi- current_node.key: 5 return current_node 6 elif key <
nary search tree, the key of each visited node has current_node.key: 7 current_node = current_node.left
168 CHAPTER 6. SUCCESSORS AND NEIGHBORS
8 else: # key > current_node.key: 9 current_node = The part that is rebuilt uses O(log n) space in the average
current_node.right 10 return None case and O(n) in the worst case.
In either version, this operation requires time proportional
These two examples rely on the order relation being a total to the height of the tree in the worst case, which is O(log
order. n) time in the average case over all trees, but O(n) time
If the order relation is only a total preorder a reasonable in the worst case.
extension of the functionality is the following: also in case Another way to explain insertion is that in order to insert
of equality search down to the leaves in a direction speci- a new node in the tree, its key is first compared with that
fiable by the user. A binary tree sort equipped with such of the root. If its key is less than the root’s, it is then
a comparison function becomes stable. compared with the key of the root’s left child. If its key
Because in the worst case this algorithm must search from is greater, it is compared with the root’s right child. This
the root of the tree to the leaf farthest from the root, process continues, until the new node is compared with
the search operation takes time proportional to the tree’s a leaf node, and then it is added as this node’s right or
height (see tree terminology). On average, binary search left child, depending on its key: if the key is less than
trees with n nodes have O(log n) height.[note 1] However, the leaf’s key, then it is inserted as the leaf’s left child,
in the worst case, binary search trees can have O(n) otherwise as the leaf’s right child.
height, when the unbalanced tree resembles a linked list There are other ways of inserting nodes into a binary tree,
(degenerate tree). but this is the only way of inserting nodes at the leaves and
at the same time preserving the BST structure.
Insertion
Deletion
Insertion begins as a search would begin; if the key is
not equal to that of the root, we search the left or right When removing a node from a binary search tree it
subtrees as before. Eventually, we will reach an external is mandatory to maintain the in-order sequence of the
node and add the new key-value pair (here encoded as a nodes. There are many possibilities to do this. However,
record 'newNode') as its right or left child, depending on the following method which has been proposed by T. Hi-
the node’s key. In other words, we examine the root and bbard in 1962[2] guarantees that the heights of the subject
recursively insert the new node to the left subtree if its subtrees are changed by at most one. There are three pos-
key is less than that of the root, or the right subtree if its sible cases to consider:
key is greater than or equal to the root.
Here’s how a typical binary search tree insertion might be • Deleting a node with no children: simply remove the
performed in a binary tree in C++: node from the tree.
Node* insert(Node*& root, int key, int value) { if (!root) • Deleting a node with one child: remove the node and
root = new Node(key, value); else if (key < root->key) replace it with its child.
root->left = insert(root->left, key, value); else // key >=
root->key root->right = insert(root->right, key, value); • Deleting a node with two children: call the node to
return root; } be deleted D. Do not delete D. Instead, choose either
its in-order predecessor node or its in-order succes-
sor node as replacement node E (s. figure). Copy
The above destructive procedural variant modifies the tree the user values of E to D.[note 2] If E does not have
in place. It uses only constant heap space (and the iter- a child simply remove E from its previous parent G.
ative version uses constant stack space as well), but the If E has a child, say F, it is a right child. Replace E
prior version of the tree is lost. Alternatively, as in the with F at E's parent.
following Python example, we can reconstruct all ances-
tors of the inserted node; any reference to the original
tree root remains valid, making the tree a persistent data In all cases, when D happens to be the root, make the
structure: replacement node root again.
def binary_tree_insert(node, key, value): if node is Broadly speaking, nodes with children are harder to
None: return NodeTree(None, key, value, None) if delete. As with all binary trees, a node’s in-order suc-
key == node.key: return NodeTree(node.left, key, cessor is its right subtree’s left-most child, and a node’s
value, node.right) if key < node.key: return Node- in-order predecessor is the left subtree’s right-most child.
Tree(binary_tree_insert(node.left, key, value), node.key, In either case, this node will have only one or no child at
node.value, node.right) else: return NodeTree(node.left, all. Delete it according to one of the two simpler cases
node.key, node.value, binary_tree_insert(node.right, above.
key, value)) Consistently using the in-order successor or the in-order
predecessor for every instance of the two-child case can
6.2. BINARY SEARCH TREE 169
lead to an unbalanced tree, so some implementations se- Traversal requires O(n) time, since it must visit every
lect one or the other at different times. node. This algorithm is also O(n), so it is asymptotically
Runtime analysis: Although this operation does not al- optimal.
ways traverse the tree down to a leaf, this is always a Traversal can also be implemented iteratively. For cer-
possibility; thus in the worst case it requires time propor- tain applications, e.g. greater equal search, approxima-
tional to the height of the tree. It does not require more tive search, an operation for single step (iterative) traver-
even when the node has two children, since it still follows sal can be very useful. This is, of course, implemented
a single path and does not visit any node twice. without the callback construct and takes O(1) on average
def find_min(self): # Gets minimum node in a subtree and O(log n) in the worst case.
current_node = self while current_node.left_child:
current_node = current_node.left_child return Verification
current_node def replace_node_in_parent(self,
new_value=None): if self.parent: if self == Sometimes we already have a binary tree, and we need to
self.parent.left_child: self.parent.left_child = determine whether it is a BST. This problem has a simple
new_value else: self.parent.right_child = new_value recursive solution.
if new_value: new_value.parent = self.parent def
binary_tree_delete(self, key): if key < self.key: The BST property—every node on the right subtree has
self.left_child.binary_tree_delete(key) elif key > to be larger than the current node and every node on the
self.key: self.right_child.binary_tree_delete(key) left subtree has to be smaller than (or equal to - should
else: # delete the key here if self.left_child and not be the case as only unique values should be in the tree
self.right_child: # if both children are present suc- - this also poses the question as to if such nodes should be
cessor = self.right_child.find_min() self.key = succes- left or right of this parent) the current node—is the key
sor.key successor.binary_tree_delete(successor.key) to figuring out whether a tree is a BST or not. The greedy
elif self.left_child: # if the node has only a *left* algorithm – simply traverse the tree, at every node check
child self.replace_node_in_parent(self.left_child) elif whether the node contains a value larger than the value at
self.right_child: # if the node has only a *right* child the left child and smaller than the value on the right child
self.replace_node_in_parent(self.right_child) else: # this – does not work for all cases. Consider the following tree:
node has no children self.replace_node_in_parent(None) 20 / \ 10 30 / \ 5 40
In the tree above, each node meets the condition that the
node contains a value larger than its left child and smaller
than its right child hold, and yet it is not a BST: the value
Traversal
5 is on the right subtree of the node containing 20, a vio-
lation of the BST property.
Main article: Tree traversal
Instead of making a decision based solely on the values
of a node and its children, we also need information flow-
Once the binary search tree has been created, its elements ing down from the parent as well. In the case of the tree
can be retrieved in-order by recursively traversing the left above, if we could remember about the node containing
subtree of the root node, accessing the node itself, then the value 20, we would see that the node with value 5 is
recursively traversing the right subtree of the node, con- violating the BST property contract.
tinuing this pattern with each node in the tree as it’s re-
cursively accessed. As with all binary trees, one may con- So the condition we need to check at each node is:
duct a pre-order traversal or a post-order traversal, but
neither are likely to be useful for binary search trees. An • if the node is the left child of its parent, then it must
in-order traversal of a binary search tree will always result be smaller than (or equal to) the parent and it must
170 CHAPTER 6. SUCCESSORS AND NEIGHBORS
pass down the value from its parent to its right sub- into a linked list with no left subtrees. For example,
tree to make sure none of the nodes in that subtree build_binary_tree([1, 2, 3, 4, 5]) yields the tree (1 (2 (3
is greater than the parent (4 (5))))).
• if the node is the right child of its parent, then it must There are several schemes for overcoming this flaw
be larger than the parent and it must pass down the with simple binary trees; the most common is the self-
value from its parent to its left subtree to make sure balancing binary search tree. If this same procedure is
none of the nodes in that subtree is lesser than the done using such a tree, the overall worst-case time is O(n
parent. log n), which is asymptotically optimal for a comparison
sort. In practice, the added overhead in time and space for
a tree-based sort (particularly for node allocation) make
A recursive solution in C can explain this further: it inferior to other asymptotically optimal sorts such as
struct TreeNode { int key; int value; struct TreeNode heapsort for static list sorting. On the other hand, it is
*left; struct TreeNode *right; }; bool isBST(struct one of the most efficient methods of incremental sort-
TreeNode *node, int minKey, int maxKey) { if(node ing, adding items to a list over time while keeping the
== NULL) return true; if(node->key < minKey || list sorted at all times.
node->key > maxKey) return false; return isBST(node-
>left, minKey, node->key-1) && isBST(node->right,
node->key+1, maxKey); } Priority queue operations
will essentially behave like a linked list data structure. 6.2.5 See also
• Search tree
Performance comparisons
• Binary search algorithm
[4]
D. A. Heger (2004) presented a performance compar-
ison of binary search trees. Treap was found to have the • Randomized binary search tree
best average performance, while red-black tree was found
to have the smallest amount of performance variations. • Tango tree
• Day–Stout–Warren algorithm
α γ 6.2.6 Notes
know exactly how often each item will be accessed, we [2] Of course, a generic software package has to work the
can construct[5] an optimal binary search tree, which is a other way around: It has to leave the user data untouched
and to furnish E with all the BST links to and from D.
search tree where the average cost of looking up an item
(the expected search cost) is minimized.
Even if we only have estimates of the search costs, such
a system can considerably speed up lookups on average.
6.2.7 References
For example, if you have a BST of English words used in
a spell checker, you might balance the tree based on word [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L.; Stein, Clifford (2009) [1990]. Introduction to
frequency in text corpora, placing words like the near the
Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN
root and words like agerasia near the leaves. Such a tree 0-262-03384-4.
might be compared with Huffman trees, which similarly
seek to place frequently used items near the root in order [2] s. Robert Sedgewick, Kevin Wayne: Algorithms Fourth
to produce a dense information encoding; however, Huff- Edition. Pearson Education, 2011, ISBN 978-0-321-
man trees store data elements only in leaves, and these 57351-3, p. 410.
elements need not be ordered.
If we do not know the sequence in which the elements in [3] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and
the tree will be accessed in advance, we can use splay trees Data Structures: The Basic Toolbox (PDF). Springer.
which are asymptotically as good as any static search tree
we can construct for any particular sequence of lookup [4] Heger, Dominique A. (2004), “A Disquisition on The Per-
operations. formance Behavior of Binary Search Tree Data Struc-
tures” (PDF), European Journal for the Informatics Pro-
Alphabetic trees are Huffman trees with the additional fessional, 5 (5): 67–75
constraint on order, or, equivalently, search trees with
the modification that all elements are stored in the leaves. [5] Gonnet, Gaston. “Optimal Binary Search Trees”. Scien-
Faster algorithms exist for optimal alphabetic binary trees tific Computation. ETH Zürich. Retrieved 1 December
(OABTs). 2013.
172 CHAPTER 6. SUCCESSORS AND NEIGHBORS
6.2.8 Further reading by repeated splitting. Adding and removing nodes di-
rectly in a random binary tree will in general disrupt its
• This article incorporates public domain material random structure, but the treap and related randomized
from the NIST document: Black, Paul E. “Binary binary search tree data structures use the principle of bi-
Search Tree”. Dictionary of Algorithms and Data nary trees formed from a random permutation in order
Structures. to maintain a balanced binary search tree dynamically as
nodes are inserted and deleted.
• Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L.; Stein, Clifford (2001). “12: Binary For random trees that are not necessarily binary, see
search trees, 15.5: Optimal binary search trees”. random tree.
Introduction to Algorithms (2nd ed.). MIT Press &
McGraw-Hill. pp. 253–272, 356–363. ISBN 0-
262-03293-7. 6.3.1 Binary trees from random permuta-
tions
• Jarc, Duane J. (3 December 2005). “Binary Tree
Traversals”. Interactive Data Structure Visualiza- For any set of numbers (or, more generally, values from
tions. University of Maryland. some total order), one may form a binary search tree in
which each number is inserted in sequence as a leaf of
• Knuth, Donald (1997). “6.2.2: Binary Tree Search-
the tree, without changing the structure of the previously
ing”. The Art of Computer Programming. 3: “Sort-
inserted numbers. The position into which each num-
ing and Searching” (3rd ed.). Addison-Wesley. pp.
ber should be inserted is uniquely determined by a binary
426–458. ISBN 0-201-89685-0.
search in the tree formed by the previous numbers. For
• Long, Sean. “Binary Search Tree” (PPT). Data instance, if the three numbers (1,3,2) are inserted into a
Structures and Algorithms Visualization-A Power- tree in that sequence, the number 1 will sit at the root
Point Slides Based Approach. SUNY Oneonta. of the tree, the number 3 will be placed as its right child,
and the number 2 as the left child of the number 3. There
• Parlante, Nick (2001). “Binary Trees”. CS Educa- are six different permutations of the numbers (1,2,3), but
tion Library. Stanford University. only five trees may be constructed from them. That is be-
cause the permutations (2,1,3) and (2,3,1) form the same
tree.
6.2.9 External links
• Literate implementations of binary search trees in Expected depth of a node
various languages on LiteratePrograms
For any fixed choice of a value x in a given set of n num-
• Binary Tree Visualizer (JavaScript animation of var-
bers, if one randomly permutes the numbers and forms a
ious BT-based data structures)
binary tree from them as described above, the expected
• Kovac, Kubo. “Binary Search Trees” (Java applet). value of the length of the path from the root of the tree to
Korešponden?ný seminár z programovania. x is at most 2 log n + O(1), where “log” denotes the natural
logarithm function and the O introduces big O notation.
• Madru, Justin (18 August 2009). “Binary Search For, the expected number of ancestors of x is by linear-
Tree”. JDServer. C++ implementation. ity of expectation equal to the sum, over all other values
y in the set, of the probability that y is an ancestor of x.
• Binary Search Tree Example in Python
And a value y is an ancestor of x exactly when y is the
• “References to Pointers (C++)". MSDN. Microsoft. first element to be inserted from the elements in the in-
2005. Gives an example binary tree implementa- terval [x,y]. Thus, the values that are adjacent to x in the
tion. sorted sequence of values have probability 1/2 of being
an ancestor of x, the values one step away have probabil-
ity 1/3, etc. Adding these probabilities for all positions
6.3 Random binary tree in the sorted sequence gives twice a Harmonic number,
leading to the bound above. A bound of this form holds
also for the expected search length of a path to a fixed
In computer science and probability theory, a random value x that is not part of the given set.[1]
binary tree is a binary tree selected at random from some
probability distribution on binary trees. Two different
distributions are commonly used: binary trees formed by The longest path
inserting nodes one at a time according to a random per-
mutation, and binary trees chosen from a uniform discrete Although not as easy to analyze as the average path length,
distribution in which all distinct trees are equally likely. there has also been much research on determining the ex-
It is also possible to form other distributions, for instance pectation (or high probability bounds) of the length of the
6.3. RANDOM BINARY TREE 173
longest path in a binary search tree generated from a ran- 6.3.2 Uniformly random binary trees
dom insertion order. It is now known that this length, for
a tree with n nodes, is almost surely The number of binary trees with n nodes is a Catalan
number: for n = 1, 2, 3, ... these numbers of trees are
where β is the unique number in the range 0 < β < 1 sat- Thus, if one of these trees is selected uniformly at ran-
isfying the equation dom, its probability is the reciprocal of a Catalan number.
Trees in this model have expected depth proportional to
2βe 1−β
= 1. [2] the square root of n, rather than to the logarithm;[4] how-
ever, the Strahler number of a uniformly random binary
tree, a more sensitive measure of the distance from a leaf
in which a node has Strahler number i whenever it has ei-
Expected number of leaves
ther a child with that number or two children with number
i − 1, is with high probability logarithmic.[5]
In the random permutation model, each of the numbers
from the set of numbers used to form the tree, except Due to their large heights, this model of equiprobable ran-
for the smallest and largest of the numbers, has probabil- dom trees is not generally used for binary search trees,
ity 1/3 of being a leaf in the tree, for it is a leaf when it but it has been applied to problems of modeling the
[6]
inserted after its two neighbors, and any of the six permu- parse trees of algebraic expressions in compiler design
tations of these two neighbors and it are equally likely. By (where the above-mentioned bound on Strahler number
similar reasoning, the smallest and largest of the numbers translates into the number of registers needed to evaluate
[7] [8]
have probability 1/2 of being a leaf. Therefore, the ex- an expression ) and for modeling evolutionary trees.
pected number of leaves is the sum of these probabilities, In some cases the analysis of random binary trees un-
which for n ≥ 2 is exactly (n + 1)/3. der the random permutation model can be automatically
transferred to the uniform model.[9]
[3] Martinez & Roura (1998); Seidel & Aragon (1996). • Mahmoud, Hosam M. (1992), Evolution of Random
Search Trees, John Wiley & Sons.
[4] Knuth (2005), p. 15.
• Martinez, Conrado; Roura, Salvador (1998),
[5] Devroye & Kruszewski (1995). That it is at most logarith-
“Randomized binary search trees”, Journal
mic is trivial, because the Strahler number of every tree is
of the ACM, ACM Press, 45 (2): 288–323,
bounded by the logarithm of the number of its nodes.
doi:10.1145/274787.274812.
[6] Mahmoud (1992), p. 63.
• Pittel, B. (1985), “Asymptotical growth of a class of
[7] Flajolet, Raoult & Vuillemin (1979). random trees”, Annals of Probability, 13 (2): 414–
427, doi:10.1214/aop/1176993000.
[8] Aldous (1996).
• Reed, Bruce (2003), “The height of a random binary
[9] Mahmoud (1992), p. 70.
search tree”, Journal of the ACM, 50 (3): 306–332,
doi:10.1145/765568.765571.
6.3.5 References • Robson, J. M. (1979), “The height of binary search
trees”, Australian Computer Journal, 11: 151–153.
• Aldous, David (1996), “Probability distributions on
cladograms”, in Aldous, David; Pemantle, Robin, • Seidel, Raimund; Aragon, Cecilia R. (1996),
Random Discrete Structures, The IMA Volumes in “Randomized Search Trees”, Algorithmica, 16 (4/5):
Mathematics and its Applications, 76, Springer- 464–497, doi:10.1007/s004539900061.
Verlag, pp. 1–18.
• Devroye, Luc (1986), “A note on the height of bi- 6.3.6 External links
nary search trees”, Journal of the ACM, 33 (3): 489–
498, doi:10.1145/5925.5930. • Open Data Structures - Chapter 7 - Random Binary
Search Trees
• Devroye, Luc; Kruszewski, Paul (1995), “A note
on the Horton-Strahler number for random trees”,
Information Processing Letters, 56 (2): 95–99,
doi:10.1016/0020-0190(95)00114-R. 6.4 Tree rotation
• Devroye, Luc; Kruszewski, Paul (1996), “The
botanical beauty of random binary trees”, in
Brandenburg, Franz J., Graph Drawing: 3rd
Int. Symp., GD'95, Passau, Germany, Septem-
ber 20-22, 1995, Lecture Notes in Computer
Science, 1027, Springer-Verlag, pp. 166–177, α γ
doi:10.1007/BFb0021801, ISBN 3-540-60723-4.
• Flajolet, P.; Raoult, J. C.; Vuillemin, J. (1979), “The Generic tree rotations.
number of registers required for evaluating arith-
metic expressions”, Theoretical Computer Science, 9 In discrete mathematics, tree rotation is an operation on
(1): 99–125, doi:10.1016/0304-3975(79)90009-4. a binary tree that changes the structure without interfer-
ing with the order of the elements. A tree rotation moves
• Hibbard, Thomas N. (1962), “Some combinato- one node up in the tree and one node down. It is used to
rial properties of certain trees with applications to change the shape of the tree, and in particular to decrease
searching and sorting”, Journal of the ACM, 9 (1): its height by moving smaller subtrees down and larger sub-
13–28, doi:10.1145/321105.321108. trees up, resulting in improved performance of many tree
operations.
• Knuth, Donald M. (1973), “6.2.2 Binary Tree
Searching”, The Art of Computer Programming, III, There exists an inconsistency in different descriptions as
Addison-Wesley, pp. 422–451. to the definition of the direction of rotations. Some say
that the direction of rotation reflects the direction that a
• Knuth, Donald M. (2005), “Draft of Section 7.2.1.6: node is moving upon rotation (a left child rotating into its
Generating All Trees”, The Art of Computer Pro- parent’s location is a right rotation) while others say that
gramming, IV. the direction of rotation reflects which subtree is rotating
6.4. TREE ROTATION 175
(a left subtree rotating into its parent’s location is a left you can see in the diagram, the order of the leaves doesn't
rotation, the opposite of the former). This article takes change. The opposite operation also preserves the order
the approach of the directional movement of the rotating and is the second kind of rotation.
node. Assuming this is a binary search tree, as stated above,
the elements must be interpreted as variables that can be
compared to each other. The alphabetic characters to the
6.4.1 Illustration left are used as placeholders for these variables. In the
animation to the right, capital alphabetic characters are
used as variable placeholders while lowercase Greek let-
ters are placeholders for an entire set of variables. The
circles represent individual nodes and the triangles repre-
sent subtrees. Each subtree could be empty, consist of a
single node, or consist of any number of nodes.
ent points to the pivot after the rotation. Also, the pro-
grammer should note that this operation may result in a
new root for the entire tree and take care to update point-
ers accordingly.
Tree rotations are used in a number of tree data structures • Associativity of a binary operation means that per-
such as AVL trees, red-black trees, splay trees, and treaps. forming a tree rotation on it does not change the final
They require only constant time because they are local result.
transformations: they only operate on 5 nodes, and need
• The Day–Stout–Warren algorithm balances an un-
not examine the rest of the tree.
balanced BST.
[2] Pournin, Lionel (2014), “The diameter of associ- data structures such as associative arrays, priority queues
ahedra”, Advances in Mathematics, 259: 13–42, and sets.
arXiv:1207.6296 , doi:10.1016/j.aim.2014.02.035, MR
The red–black tree, which is a type of self-balancing bi-
3197650.
nary search tree, was called symmetric binary B-tree[2]
and was renamed but can still be confused with the
6.4.8 External links generic concept of self-balancing binary search tree
because of the initials.
• Java applets demonstrating tree rotations
• The AVL Tree Rotations Tutorial (RTF) by John 6.5.1 Overview
Hargrove
50
β γ α β
9 14 19 67 If the data items are known ahead of time, the height can
be kept small, in the average sense, by adding values in a
random order, resulting in a random binary search tree.
The same tree after being height-balanced; the average path effort
However, there are many situations (such as online algo-
decreased to 3.00 node accesses
rithms) where this randomization is not viable.
In computer science, a self-balancing (or height- Self-balancing binary trees solve this problem by per-
balanced) binary search tree is any node-based binary forming transformations on the tree (such as tree rota-
search tree that automatically keeps its height (maximal tions) at key insertion times, in order to keep the height
number of levels below the root) small in the face of ar- proportional to log2 (n). Although a certain overhead is
bitrary item insertions and deletions.[1] involved, it may be justified in the long run by ensuring
These structures provide efficient implementations for fast execution of later operations.
mutable ordered lists, and can be used for other abstract Maintaining the height always at its minimum value
178 CHAPTER 6. SUCCESSORS AND NEIGHBORS
⌊log2 (n)⌋ is not always viable; it can be proven that any we have a very simple-to-describe yet asymptotically op-
insertion algorithm which did so would have an exces- timal O(n log n) sorting algorithm. Similarly, many al-
sive overhead. Therefore, most self-balanced BST algo- gorithms in computational geometry exploit variations
rithms keep the height within a constant factor of this on self-balancing BSTs to solve problems such as the
lower bound. line segment intersection problem and the point loca-
In the asymptotic ("Big-O") sense, a self-balancing BST tion problem efficiently. (For average-case performance,
structure containing n items allows the lookup, insertion, however, self-balanced BSTs may be less efficient than
and removal of an item in O(log n) worst-case time, and other solutions. Binary tree sort, in particular, is likely
to be slower than merge sort, quicksort, or heapsort, be-
ordered enumeration of all items in O(n) time. For some
implementations these are per-operation time bounds, cause of the tree-balancing overhead as well as cache ac-
cess patterns.)
while for others they are amortized bounds over a se-
quence of operations. These times are asymptotically op- Self-balancing BSTs are flexible data structures, in that
timal among all data structures that manipulate the key it’s easy to extend them to efficiently record additional in-
only through comparisons. formation or perform new operations. For example, one
can record the number of nodes in each subtree having
a certain property, allowing one to count the number of
6.5.2 Implementations nodes in a certain key range with that property in O(log
n) time. These extensions can be used, for example, to
Popular data structures implementing this type of tree in- optimize database queries or other list-processing algo-
clude: rithms.
search tree data structures that maintain a dynamic set of to have the same priority) then the shape of a treap has
ordered keys and allow binary searches among the keys. the same probability distribution as the shape of a random
After any sequence of insertions and deletions of keys, binary search tree, a search tree formed by inserting the
the shape of the tree is a random variable with the same nodes without rebalancing in a randomly chosen insertion
probability distribution as a random binary tree; in par- order. Because random binary search trees are known to
ticular, with high probability its height is proportional to have logarithmic height with high probability, the same is
the logarithm of the number of keys, so that each search, true for treaps.
insertion, or deletion operation takes logarithmic time to Aragon and Seidel also suggest assigning higher priori-
perform.
ties to frequently accessed nodes, for instance by a pro-
cess that, on each access, chooses a random number and
replaces the priority of the node with that number if it
6.6.1 Description is higher than the previous priority. This modification
would cause the tree to lose its random shape; instead,
frequently accessed nodes would be more likely to be near
6.6.2 Operations
Treaps support the following basic operations:
4 7
• To search for a given key value, apply a standard
c j binary search algorithm in a binary search tree, ig-
noring the priorities.
the treap with maximum priority—larger than the root of the tree, and otherwise it calls the insertion proce-
priority of any node in the treap. After this inser- dure recursively to insert x within the left or right subtree
tion, x will be the root node of the treap, all values (depending on whether its key is less than or greater than
less than x will be found in the left subtreap, and all the root). The numbers of descendants are used by the
values greater than x will be found in the right sub- algorithm to calculate the necessary probabilities for the
treap. This costs as much as a single insertion into random choices at each step. Placing x at the root of a
the treap. subtree may be performed either as in the treap by in-
serting it at a leaf and then rotating it upwards, or by an
• Merging two treaps that are the product of a former alternative algorithm described by Martínez and Roura
split, one can safely assume that the greatest value that splits the subtree into two pieces to be used as the
in the first treap is less than the smallest value in left and right children of the new node.
the second treap. Create a new node with value x,
such that x is larger than this max-value in the first The deletion procedure for a randomized binary search
treap, and smaller than the min-value in the second tree uses the same information per node as the insertion
treap, assign it the minimum priority, then set its left procedure, and like the insertion procedure it makes a se-
child to the first heap and its right child to the sec- quence of O(log n) random decisions in order to join the
ond heap. Rotate as necessary to fix the heap order. two subtrees descending from the left and right children
After that it will be a leaf node, and can easily be of the deleted node into a single tree. If the left or right
deleted. The result is one treap merged from the two subtree of the node to be deleted is empty, the join op-
original treaps. This is effectively “undoing” a split, eration is trivial; otherwise, the left or right child of the
and costs the same. deleted node is selected as the new subtree root with prob-
ability proportional to its number of descendants, and the
join proceeds recursively.
The union of two treaps t 1 and t 2 , representing sets A
and B is a treap t that represents A ∪ B. The following
recursive algorithm computes the union:
function union(t1 , t2 ): if t1 = nil: return t2 if t2 = nil:
6.6.4 Comparison
return t1 if priority(t1 ) < priority(t2 ): swap t1 and t2
t<, t> ← split t2 on key(t1 ) return new node(key(t1 ), The information stored per node in the randomized bi-
union(left(t1 ), t<), union(right(t1 ), t>)) nary tree is simpler than in a treap (a small integer rather
than a high-precision random number), but it makes a
Here, split is presumed to return two trees: one hold- greater number of calls to the random number generator
ing the keys less its input key, one holding the greater (O(log n) calls per insertion or deletion rather than one
keys. (The algorithm is non-destructive, but an in-place call per insertion) and the insertion procedure is slightly
destructive version exists as well.) more complicated due to the need to update the numbers
The algorithm for intersection is similar, but requires the of descendants per node. A minor technical difference is
join helper routine. The complexity of each of union, in- that, in a treap, there is a small probability of a collision
tersection and difference is O(m log n/m) for treaps of (two keys getting the same priority), and in both cases
sizes m and n, with m ≤ n. Moreover, since the recursive there will be statistical differences between a true ran-
calls to union are independent of each other, they can be dom number generator and the pseudo-random number
executed in parallel.[4] generator typically used on digital computers. However,
in any case the differences between the theoretical model
of perfect random choices used to design the algorithm
6.6.3 Randomized binary search tree and the capabilities of actual random number generators
are vanishingly small.
The randomized binary search tree, introduced by Although the treap and the randomized binary search tree
Martínez and Roura subsequently to the work of Aragon both have the same random distribution of tree shapes af-
and Seidel on treaps,[5] stores the same nodes with the ter each update, the history of modifications to the trees
same random distribution of tree shape, but maintains performed by these two data structures over a sequence
different information within the nodes of the tree in order of insertion and deletion operations may be different. For
to maintain its randomized structure. instance, in a treap, if the three numbers 1, 2, and 3
Rather than storing random priorities on each node, the are inserted in the order 1, 3, 2, and then the number
randomized binary search tree stores a small integer at 2 is deleted, the remaining two nodes will have the same
each node, the number of its descendants (counting itself parent-child relationship that they did prior to the inser-
as one); these numbers may be maintained during tree tion of the middle number. In a randomized binary search
rotation operations at only a constant additional amount tree, the tree after the deletion is equally likely to be either
of time per rotation. When a key x is to be inserted into of the two possible trees on its two nodes, independently
a tree that already has n nodes, the insertion algorithm of what the tree looked like prior to the insertion of the
chooses with probability 1/(n + 1) to place x as the new middle number.
6.7. AVL TREE 181
holds for every node N in the tree. traversing up to h ∝ log(n) links (particularly when nav-
A node N with BalanceFactor(N) < 0 is called “left- igating from the rightmost leaf of the root’s left subtree
heavy”, one with BalanceFactor(N) > 0 is called “right- to the root or from the root to the leftmost leaf of the
heavy”, and one with BalanceFactor(N) = 0 is sometimes root’s right subtree; in the AVL tree of figure 1, moving
simply called “balanced”. from node P to the next but one node Q takes 3 steps).
However, exploring all n nodes of the tree in this manner
would visit each link exactly twice: one downward visit
Remark to enter the subtree rooted by that node, another visit up-
ward to leave that node’s subtree after having explored it.
In the sequel, because there is a one-to-one correspon- And since there are n−1 links in any tree, the amortized
dence between nodes and the subtrees rooted by them, cost is found to be 2×(n−1)/n, or approximately 2.
we sometimes leave it to the context whether the name of
an object stands for the node or the subtree.
Insert
6.7.2 Operations Since with a single insertion the height of an AVL subtree
cannot increase by more than one, the temporary balance
Read-only operations of an AVL tree involve carrying out factor of a node after an insertion will be in the range [–
the same actions as would be carried out on an unbalanced 2,+2]. For each node checked, if the temporary balance
binary search tree, but modifications have to observe and factor remains in the range from –1 to +1 then only an
restore the height balance of the subtrees. update of the balance factor and no rotation is necessary.
However, if the temporary balance factor becomes less
than –1 or greater than +1, the subtree rooted at this node
Searching is AVL unbalanced, and a rotation is needed. The various
cases of rotations are described in section Rebalancing.
Searching for a specific key in an AVL tree can be done
the same way as that of a normal unbalanced binary By inserting the new node Z as a child of node X the
search tree. In order for search to work effectively it has height of that subtree Z increases from 0 to 1.
to employ a comparison function which establishes a total
order (or at least a total preorder) on the set of keys. The Invariant of the retracing loop for an insertion
number of comparisons required for successful search is
limited by the height h and for unsuccessful search is very The height of the subtree rooted by Z has increased by 1.
close to h, so both are in O(log n). It is already in AVL shape.
for (X = parent(Z); X != null; X = parent(Z)) { // Loop
Traversal (possibly up to the root) // BalanceFactor(X) has to be
updated: if (Z == right_child(X)) { // The right subtree
Once a node has been found in an AVL tree, the next or increases if (BalanceFactor(X) > 0) { // X is right-heavy
previous node can be accessed in amortized constant time. // ===> the temporary BalanceFactor(X) == +2 // ===>
Some instances of exploring these “nearby” nodes require rebalancing is required. G = parent(X); // Save parent
6.7. AVL TREE 183
of X around rotations if (BalanceFactor(Z) < 0) // Right from 1 to 0 or from 2 to 1, if that node had a child.
Left Case (see figure 5) N = rotate_RightLeft(X,Z); // Starting at this subtree, it is necessary to check each of
Double rotation: Right(Z) then Left(X) else // Right the ancestors for consistency with the invariants of AVL
Right Case (see figure 4) N = rotate_Left(X,Z); // Single trees. This is called “retracing”.
rotation Left(X) // After rotation adapt parent link } else
{ if (BalanceFactor(X) < 0) { BalanceFactor(X) = 0; Since with a single deletion the height of an AVL subtree
// Z’s height increase is absorbed at X. break; // Leave cannot decrease by more than one, the temporary balance
the loop } BalanceFactor(X) = +1; Z=X; // Height(Z) factor of a node will be in the range from −2 to +2. If the
increases by 1 continue; } } else { // Z == left_child(X): balance factor remains in the range from −1 to +1 it can be
the left subtree increases if (BalanceFactor(X) < 0) { // adjusted in accord with the AVL rules. If it becomes ±2
X is left-heavy // ===> the temporary BalanceFactor(X) then the subtree is unbalanced and needs to be rotated.
== –2 // ===> rebalancing is required. G = parent(X); // The various cases of rotations are described in section
Save parent of X around rotations if (BalanceFactor(Z) Rebalancing.
> 0) // Left Right Case N = rotate_LeftRight(X,Z);
// Double rotation: Left(Z) then Right(X) else // Left Invariant of the retracing loop for a deletion
Left Case N = rotate_Right(X,Z); // Single rotation
Right(X) // After rotation adapt parent link } else { if
(BalanceFactor(X) > 0) { BalanceFactor(X) = 0; // Z’s The height of the subtree rooted by N has decreased by
height increase is absorbed at X. break; // Leave the loop 1. It is already in AVL shape.
} BalanceFactor(X) = –1; Z=X; // Height(Z) increases for (X = parent(N); X != null; X = G) { // Loop (possibly
by 1 continue; } } // After a rotation adapt parent link: up to the root) G = parent(X); // Save parent of X around
// N is the new root of the rotated subtree // Height does rotations // BalanceFactor(X) has not yet been updated!
not change: Height(N) == old Height(X) parent(N) = G; if (N == left_child(X)) { // the left subtree decreases if
if (G != null) { if (X == left_child(G)) left_child(G) = (BalanceFactor(X) > 0) { // X is right-heavy // ===> the
N; else right_child(G) = N; break; } else { tree->root = temporary BalanceFactor(X) == +2 // ===> rebalancing
N; // N is the new root of the total tree break; } // There is required. Z = right_child(X); // Sibling of N (higher
is no fall thru, only break; or continue; } // Unless loop by 2) b = BalanceFactor(Z); if (b < 0) // Right Left Case
is left via break, the height of the total tree increases by 1. (see figure 5) N = rotate_RightLeft(X,Z); // Double
rotation: Right(Z) then Left(X) else // Right Right Case
In order to update the balance factors of all nodes, first (see figure 4) N = rotate_Left(X,Z); // Single rotation
observe that all nodes requiring correction lie from child Left(X) // After rotation adapt parent link } else { if
to parent along the path of the inserted leaf. If the above (BalanceFactor(X) == 0) { BalanceFactor(X) = +1; // N’s
procedure is applied to nodes along this path, starting height decrease is absorbed at X. break; // Leave the loop
from the leaf, then every node in the tree will again have } N = X; BalanceFactor(N) = 0; // Height(N) decreases
a balance factor of −1, 0, or 1. by 1 continue; } } else { // (N == right_child(X)): The
right subtree decreases if (BalanceFactor(X) < 0) { // X
The retracing can stop if the balance factor becomes 0 im- is left-heavy // ===> the temporary BalanceFactor(X)
plying that the height of that subtree remains unchanged. == –2 // ===> rebalancing is required. Z = left_child(X);
If the balance factor becomes ±1 then the height of the // Sibling of N (higher by 2) b = BalanceFactor(Z); if
subtree increases by one and the retracing needs to con- (b > 0) // Left Right Case N = rotate_LeftRight(X,Z);
tinue. // Double rotation: Left(Z) then Right(X) else // Left
Left Case N = rotate_Right(X,Z); // Single rotation
If the balance factor temporarily becomes ±2, this has
Right(X) // After rotation adapt parent link } else { if
to be repaired by an appropriate rotation after which the
(BalanceFactor(X) == 0) { BalanceFactor(X) = –1; // N’s
subtree has the same height as before (and its root the
height decrease is absorbed at X. break; // Leave the loop
balance factor 0).
} N = X; BalanceFactor(N) = 0; // Height(N) decreases
The time required is O(log n) for lookup, plus a maximum by 1 continue; } } // After a rotation adapt parent link: //
of O(log n) retracing levels (O(1) on average) on the way N is the new root of the rotated subtree parent(N) = G; if
back to the root, so the operation can be completed in (G != null) { if (X == left_child(G)) left_child(G) = N;
O(log n) time. else right_child(G) = N; if (b == 0) break; // Height does
not change: Leave the loop } else { tree->root = N; // N
is the new root of the total tree continue; } // Height(N)
Delete decreases by 1 (== old Height(X)−1) } // Unless loop
is left via break, the height of the total tree decreases by 1.
The preliminary steps for deleting a node are described in
section Binary search tree#Deletion. There, the effective The retracing can stop if the balance factor becomes
deletion of the subject node or the replacement node de- ±1 meaning that the height of that subtree remains un-
creases the height of the corresponding child tree either changed.
184 CHAPTER 6. SUCCESSORS AND NEIGHBORS
If the balance factor becomes 0 then the height of the sub- The union of two AVLs t 1 and t 2 representing sets A and
tree decreases by one and the retracing needs to continue. B, is an AVL t that represents A ∪ B. The following re-
If the balance factor temporarily becomes ±2, this has cursive function computes this union:
to be repaired by an appropriate rotation. It depends on function union(t1 , t2 ): if t1 = nil: return t2 if t2
the balance factor of the sibling Z (the higher child tree) = nil: return t1 t<, t> ← split t2 on t1 .root return
whether the height of the subtree decreases by one or does join(t1 .root,union(left(t1 ), t<),union(right(t1 ), t>))
not change (the latter, if Z has the balance factor 0).
Here, Split is presumed to return two trees: one hold-
The time required is O(log n) for lookup, plus a maximum ing the keys less its input key, one holding the greater
of O(log n) retracing levels (O(1) on average) on the way keys. (The algorithm is non-destructive, but an in-place
back to the root, so the operation can be completed in destructive version exists as well.)
O(log n) time. The algorithm for intersection or difference is similar, but
requires the Join2 helper routine that is the same as Join
Set operations and bulk operations but without the middle key. Based on the new functions
for union, intersection or difference, either one key or
In addition to the single-element insert, delete and lookup multiple keys can be inserted to or deleted from the AVL
operations, several set operations have been defined on tree. Since Split calls Join but does not deal with the bal-
AVL trees: union, intersection and set difference. Then ancing criteria of AVL trees directly, such an implemen-
fast bulk operations on insertions or deletions can be im- tation is usually called the “join-based” implementation.
plemented based on these set functions. These set oper- The complexity( of each
( n of ))union, intersection and dif-
ations rely on two helper operations, Split and Join. With ference is O m log m + 1 for AVLs of sizes m and
the new operations, the implementation of AVL trees can n(≥ m) . More importantly, since the recursive calls
be more efficient and highly-parallelizable.[11] to union, intersection or difference are independent of
each other, they can be executed in parallel with a parallel
[11]
• Join: The function Join is on two AVL trees t 1 and t 2 depth O(log m log n) . When m = 1 , the join-
and a key k and will return a tree containing all ele- based implementation has the same computational DAG
ments in t 1 , t 2 as well as k. It requires k to be greater as single-element insertion and deletion.
than all keys in t 1 and smaller than all keys in t 2 . If
the two trees differ by height at most one, Join sim-
ply create a new node with left subtree t 1 , root k and 6.7.3 Rebalancing
right subtree t 2 . Otherwise, suppose that t 1 is higher
than t 2 for more than one (the other case is symmet- If during a modifying operation (e.g. insert, delete) a
ric). Join follows the right spine of t 1 until a node c (temporary) height difference of more than one arises be-
which is balanced with t 2 . At this point a new node tween two child subtrees, the parent subtree has to be “re-
with left child c, root k and right child t 1 is created to balanced”. The given repair tools are the so-called tree
replace c. The new node satisfies the AVL invariant, rotations, because they move the keys only “vertically”,
and its height is one greater than c. The increase in so that the (“horizontal”) in-order sequence of the keys
height can increase the height of its ancestors, pos- is fully preserved (which is essential for a binary-search
sibly invalidating the AVL invariant of those nodes. tree).[12][13]
This can be fixed either with a double rotation if in- Let Z be the child higher by 2 (see figures 4 and 5). Two
valid at the parent or a single left rotation if invalid flavors of rotations are required: simple and double. Re-
higher in the tree, in both cases restoring the height balancing can be accomplished by a simple rotation (see
for any further ancestor nodes. Join will therefore figure 4) if the inner child of Z, that is the child with a
require at most two rotations. The cost of this func- child direction opposite to that of Z, (t in figure 4, Y in
23
tion is the difference of the heights between the two figure 5) is not higher than its sibling, the outer child t
4
input trees. in both figures. This situation is called “Right Right” or
• Split: To split an AVL tree into two smaller trees, “Left Left” in the literature.
those smaller than key x, and those larger than key x, On the other hand, if the inner child (t23 in figure 4, Y
first draw a path from the root by inserting x into the in figure 5) of Z is higher than t4 then rebalancing can be
AVL. After this insertion, all values less than x will accomplished by a double rotation (see figure 5). This sit-
be found on the left of the path, and all values greater uation is called “Right Left” because X is right- and Z left-
than x will be found on the right. By applying Join, heavy (or “Left Right” if X is left- and Z is right-heavy).
all the subtrees on the left side are merged bottom- From a mere graph-theoretic point of view, the two ro-
up using keys on the path as intermediate nodes from tations of a double are just single rotations. But they en-
bottom to top to form the left tree, and the right part counter and have to maintain other configurations of bal-
is asymmetric. The cost of Split is order of O(n) , ance factors. So, in effect, it is simpler – and more effi-
the height of the tree. cient – to specialize, just as in the original paper, where
6.7. AVL TREE 185
h+1 t23
Simple rotation
<
~
< < < < < else // 2nd case, BalanceFactor(Y) == 0, only happens
+2
with deletion, not insertion: if (BalanceFactor(Y) == 0)
0 X
{ BalanceFactor(X) = 0; BalanceFactor(Z) = 0; } else //
‒1 3rd case happens with insertion or deletion: { // t2 was
1 Z
higher BalanceFactor(X) = 0; BalanceFactor(Z) = +1;
–1
2 +1 Y // t4 now higher } BalanceFactor(Y) = 0; return Y; //
0
return new root of rotated subtree }
3
h t1
6.7.4 Comparison to other structures
h+1 t4
h
−0.328, and d := 1 + φ41√5 ≈ 1.065 .
t1 t2 t3 t4
h+1 • an RB tree’s height is at most
• Tree rotation [11] Blelloch, Guy E.; Ferizovic, Daniel; Sun, Yihan (2016),
“Just Join for Parallel Ordered Sets”, Proc. 28th ACM
• Red–black tree Symp. Parallel Algorithms and Architectures (SPAA 2016),
ACM, pp. 253–264, doi:10.1145/2935764.2935768,
• Splay tree ISBN 978-1-4503-4210-0.
• Scapegoat tree [12] Knuth, Donald E. (2000). Sorting and searching (2. ed.,
6. printing, newly updated and rev. ed.). Boston [u.a.]:
• B-tree Addison-Wesley. pp. 458–481. ISBN 0201896850.
[1] Eric Alexander. “AVL Trees”. [15] Mehlhorn & Sanders 2008, pp. 165, 158
[2] Robert Sedgewick, Algorithms, Addison-Wesley, 1983, [16] Dinesh P. Mehta, Sartaj Sahni (Ed.) Handbook of Data
ISBN 0-201-06672-6, page 199, chapter 15: Balanced Structures and Applications 10.4.2
Trees.
[17] Red–black tree#Proof of asymptotic bounds
[3] Georgy Adelson-Velsky, G.; Evgenii Landis (1962).
[18] Ben Pfaff: Performance Analysis of BSTs in System Soft-
“An algorithm for the organization of information”.
ware. Stanford University 2004.
Proceedings of the USSR Academy of Sciences (in Rus-
sian). 146: 263–266. English translation by Myron J.
Ricci in Soviet Math. Doklady, 3:1259–1263, 1962.
6.7.7 Further reading
[4] Pfaff, Ben (June 2004). “Performance Analysis of BSTs
in System Software” (PDF). Stanford University. • Donald Knuth. The Art of Computer Program-
ming, Volume 3: Sorting and Searching, Third Edi-
[5] AVL trees are not weight-balanced? (meaning: AVL trees tion. Addison-Wesley, 1997. ISBN 0-201-89685-0.
are not μ-balanced?) Pages 458–475 of section 6.2.3: Balanced Trees.
Thereby: A Binary Tree is called µ -balanced, with 0 ≤
µ ≤ 12 , if for every node N , the inequality
|Nl |
6.7.8 External links
1
2
−µ≤ |N |+1
≤ 1
2
+µ
• This article incorporates public domain material
holds and µ is minimal with this property. |N | is the num- from the NIST document: Black, Paul E. “AVL
ber of nodes below the tree with N as root (including the
Tree”. Dictionary of Algorithms and Data Struc-
root) and Nl is the left child node of N .
tures.
[6] Knuth, Donald E. (2000). Sorting and searching (2. ed.,
• AVL tree demonstration (HTML5/Canvas)
6. printing, newly updated and rev. ed.). Boston [u.a.]:
Addison-Wesley. p. 459. ISBN 0-201-89685-0. • AVL tree demonstration (requires Flash)
[7] More precisely: if the AVL balance information is kept in • AVL tree demonstration (requires Java)
the child nodes – with meaning “when going upward there
is an additional increment in height”, this can be done with
one bit. Nevertheless, the modifying operations can be
programmed more efficiently if the balance information 6.8 Red–black tree
can be checked with one test.
A red–black tree is a kind of self-balancing binary
[8] Knuth, Donald E. (2000). Sorting and searching (2. ed.,
search tree. Each node of the binary tree has an extra
6. printing, newly updated and rev. ed.). Boston [u.a.]:
Addison-Wesley. p. 460. ISBN 0-201-89685-0. bit, and that bit is often interpreted as the color (red or
black) of the node. These color bits are used to ensure
[9] Knuth, Donald E. (2000). Sorting and searching (2. ed., the tree remains approximately balanced during inser-
6. printing, newly updated and rev. ed.). Boston [u.a.]: tions and deletions.[2]
Addison-Wesley. pp. 458–481. ISBN 0201896850.
Balance is preserved by painting each node of the tree
[10] Pfaff, Ben (2004). An Introduction to Binary Search Trees with one of two colors (typically called 'red' and 'black')
and Balanced Trees. Free Software Foundation, Inc. pp. in a way that satisfies certain properties, which collec-
107–138. tively constrain how unbalanced the tree can become in
188 CHAPTER 6. SUCCESSORS AND NEIGHBORS
the worst case. When the tree is modified, the new tree is 6.8.2 Terminology
subsequently rearranged and repainted to restore the col-
oring properties. The properties are designed in such a A red–black tree is a special type of binary tree, used in
way that this rearranging and recoloring can be performed computer science to organize pieces of comparable data,
efficiently. such as text fragments or numbers.
The balancing of the tree is not perfect, but it is good The leaf nodes of red–black trees do not contain data.
enough to allow it to guarantee searching in O(log n) time, These leaves need not be explicit in computer memory—
where n is the total number of elements in the tree. The a null child pointer can encode the fact that this child is a
insertion and deletion operations, along with the tree re- leaf—but it simplifies some algorithms for operating on
arrangement and recoloring, are also performed in O(log red–black trees if the leaves really are explicit nodes. To
n) time.[3] save memory, sometimes a single sentinel node performs
Tracking the color of each node requires only 1 bit of the role of all leaf nodes; all references from internal
information per node because there are only two col- nodes to leaf nodes then point to the sentinel node.
ors. The tree does not contain any other data specific to Red–black trees, like all binary search trees, allow effi-
its being a red–black tree so its memory footprint is al- cient in-order traversal (that is: in the order Left–Root–
most identical to a classic (uncolored) binary search tree. Right) of their elements. The search-time results from the
In many cases, the additional bit of information can be traversal from root to leaf, and therefore a balanced tree
stored at no additional memory cost. of n nodes, having the least possible tree height, results in
O(log n) search time.
6.8.1 History
6.8.3 Properties
In 1972, Rudolf Bayer[4] invented a data structure that
was a special order-4 case of a B-tree. These trees main-
tained all paths from root to leaf with the same number of 13
nodes, creating perfectly balanced trees. However, they
were not binary search trees. Bayer called them a “sym- 8 17
metric binary B-tree” in his paper and later they became
popular as 2-3-4 trees or just 2-4 trees.[5] 1 11 15 25
These constraints enforce a critical property of red–black values per cluster with a maximum capacity of 3 values.
trees: the path from the root to the farthest leaf is no more
This B-tree type is still more general than a red–black
than twice as long as the path from the root to the nearesttree though, as it allows ambiguity in a red–black tree
leaf. The result is that the tree is roughly height-balanced.
conversion—multiple red–black trees can be produced
Since operations such as inserting, deleting, and finding from an equivalent B-tree of order 4. If a B-tree clus-
values require worst-case time proportional to the height ter contains only 1 value, it is the minimum, black, and
of the tree, this theoretical upper bound on the height al-has two child pointers. If a cluster contains 3 values, then
lows red–black trees to be efficient in the worst case, un- the central value will be black and each value stored on
like ordinary binary search trees.
its sides will be red. If the cluster contains two values,
To see why this is guaranteed, it suffices to consider the however, either one can become the black node in the
effect of properties 4 and 5 together. For a red–black tree red–black tree (and the other one will be red).
T, let B be the number of black nodes in property 5. Let So the order-4 B-tree does not maintain which of the val-
the shortest possible path from the root of T to any leaf ues contained in each cluster is the root black tree for the
consist of B black nodes. Longer possible paths may be whole cluster and the parent of the other values in the
constructed by inserting red nodes. However, property 4 same cluster. Despite this, the operations on red–black
makes it impossible to insert more than one consecutive trees are more economical in time because you don't have
red node. Therefore, ignoring any black NIL leaves, the to maintain the vector of values.[18] It may be costly if
longest possible path consists of 2*B nodes, alternating values are stored directly in each node rather than be-
black and red (this is the worst case). Counting the black ing stored by reference. B-tree nodes, however, are more
NIL leaves, the longest possible path consists of 2*B-1 economical in space because you don't need to store the
nodes. color attribute for each node. Instead, you have to know
The shortest possible path has all black nodes, and the which slot in the cluster vector is used. If values are stored
longest possible path alternates between red and black by reference, e.g. objects, null references can be used and
nodes. Since all maximal paths have the same number so the cluster can be represented by a vector containing 3
of black nodes, by property 5, this shows that no path is slots for value pointers plus 4 slots for child references in
more than twice as long as any other path. the tree. In that case, the B-tree can be more compact in
memory, improving data locality.
The same analogy can be made with B-trees with larger
6.8.4 Analogy to B-trees of order 4 orders that can be structurally equivalent to a colored bi-
nary tree: you just need more colors. Suppose that you
add blue, then the blue–red–black tree defined like red–
8 13 17 black trees but with the additional constraint that no two
successive nodes in the hierarchy will be blue and all blue
nodes will be children of a red node, then it becomes
equivalent to a B-tree whose clusters will have at most
NIL 1 6 NIL 11 NIL NIL 15 NIL 22 25 27
7 values in the following colors: blue, red, blue, black,
NIL NIL NIL NIL NIL NIL
blue, red, blue (For each cluster, there will be at most 1
black node, 2 red nodes, and 4 blue nodes).
The same red–black tree as in the example above, seen as a B-
tree. For moderate volumes of values, insertions and deletions
in a colored binary tree are faster compared to B-trees be-
A red–black tree is similar in structure to a B-tree of cause colored trees don't attempt to maximize the fill fac-
order[note 1] 4, where each node can contain between 1 and tor of each horizontal cluster of nodes (only the minimum
3 values and (accordingly) between 2 and 4 child point- fill factor is guaranteed in colored binary trees, limiting
ers. In such a B-tree, each node will contain only one the number of splits or junctions of clusters). B-trees
value matching the value in a black node of the red–black will be faster for performing rotations (because rotations
tree, with an optional value before and/or after it in the will frequently occur within the same cluster rather than
same node, both matching an equivalent red node of the with multiple separate nodes in a colored binary tree).
red–black tree. For storing large volumes, however, B-trees will be much
faster as they will be more compact by grouping several
One way to see this equivalence is to “move up” the red
children in the same cluster where they can be accessed
nodes in a graphical representation of the red–black tree,
locally.
so that they align horizontally with their parent black
node, by creating together a horizontal cluster. In the B- All optimizations possible in B-trees to increase the av-
tree, or in the modified graphical representation of the erage fill factors of clusters are possible in the equivalent
red–black tree, all leaf nodes are at the same depth. multicolored binary tree. Notably, maximizing the av-
erage fill factor in a structurally equivalent B-tree is the
The red–black tree is then structurally equivalent to a B-
same as reducing the total height of the multicolored tree,
tree of order 4, with a minimum fill factor of 33% of
190 CHAPTER 6. SUCCESSORS AND NEIGHBORS
by increasing the number of non-black nodes. The worst by a “color flip,” corresponding to a split, in which the
case occurs when all nodes in a colored binary tree are red color of two children nodes leaves the children and
black, the best case occurs when only a third of them are moves to the parent node. The tango tree, a type of tree
black (and the other two thirds are red nodes). optimized for fast searches, usually uses red–black trees
Notes as part of its data structure.
In the version 8 of Java, the Collection HashMap has been
modified such that instead of using a LinkedList to store
[1] Using Knuth’s definition of order: the maximum number different elements with colliding hashcodes, a Red-Black
of children tree is used. This results in the improvement of time com-
plexity of searching such an element from O(n) to O(log
n).[21]
6.8.5 Applications and related data struc-
tures 6.8.6 Operations
Red–black trees offer worst-case guarantees for insertion Read-only operations on a red–black tree require no mod-
time, deletion time, and search time. Not only does this ification from those used for binary search trees, because
make them valuable in time-sensitive applications such as every red–black tree is a special case of a simple binary
real-time applications, but it makes them valuable build- search tree. However, the immediate result of an in-
ing blocks in other data structures which provide worst- sertion or removal may violate the properties of a red–
case guarantees; for example, many data structures used black tree. Restoring the red–black properties requires
in computational geometry can be based on red–black a small number (O(log n) or amortized O(1)) of color
trees, and the Completely Fair Scheduler used in current changes (which are very quick in practice) and no more
Linux kernels uses red–black trees. than three tree rotations (two for insertion). Although in-
The AVL tree is another structure supporting O(log n) sert and delete operations are complicated, their times re-
search, insertion, and removal. It is more rigidly balanced main O(log n).
than red–black trees, leading to slower insertion and re-
moval but faster retrieval. This makes it attractive for data
Insertion
structures that may be built once and loaded without re-
construction, such as language dictionaries (or program
Insertion begins by adding the node as any binary search
dictionaries, such as the opcodes of an assembler or in-
tree insertion does and by coloring it red. Whereas in the
terpreter).
binary search tree, we always add a leaf, in the red–black
Red–black trees are also particularly valuable in tree, leaves contain no information, so instead we add a
functional programming, where they are one of the most red interior node, with two black leaves, in place of an
common persistent data structures, used to construct existing black leaf.
associative arrays and sets which can retain previous
What happens next depends on the color of other nearby
versions after mutations. The persistent version of
nodes. The term uncle node will be used to refer to the
red–black trees requires O(log n) space for each insertion
sibling of a node’s parent, as in human family trees. Note
or deletion, in addition to time.
that:
For every 2-4 tree, there are corresponding red–black
trees with data elements in the same order. The insertion • property 3 (all leaves are black) always holds.
and deletion operations on 2-4 trees are also equivalent
to color-flipping and rotations in red–black trees. This • property 4 (both children of every red node are
makes 2-4 trees an important tool for understanding the black) is threatened only by adding a red node, re-
logic behind red–black trees, and this is why many in- painting a black node red, or a rotation.
troductory algorithm texts introduce 2-4 trees just before • property 5 (all paths from any given node to its leaf
red–black trees, even though 2-4 trees are not often used nodes contain the same number of black nodes) is
in practice. threatened only by adding a black node, repainting
In 2008, Sedgewick introduced a simpler version of the a red node black (or vice versa), or a rotation.
red–black tree called the left-leaning red–black tree[19] by
eliminating a previously unspecified degree of freedom in Notes
the implementation. The LLRB maintains an additional
invariant that all red links must lean left except during in- 1. The label N will be used to denote the current node
serts and deletes. Red–black trees can be made isometric (colored red). In the diagrams N carries a blue con-
to either 2-3 trees,[20] or 2-4 trees,[19] for any sequence of tour. At the beginning, this is the new node being
operations. The 2-4 tree isometry was described in 1978 inserted, but the entire procedure may also be ap-
by Sedgewick. With 2-4 trees, the isometry is resolved plied recursively to other nodes (see case 3). P will
6.8. RED–BLACK TREE 191
denote N's parent node, G will denote N's grand- invalidated. In this case, the tree is still valid. Property
parent, and U will denote N's uncle. In between 5 (all paths from any given node to its leaf nodes contain
some cases, the roles and labels of the nodes are ex- the same number of black nodes) is not threatened, be-
changed, but in each case, every label continues to cause the current node N has two black leaf children, but
represent the same node it represented at the begin- because N is red, the paths through each of its children
ning of the case. have the same number of black nodes as the path through
the leaf it replaced, which was black, and so this property
2. If a node in the right (target) half of a diagram car- remains satisfied.
ries a blue contour it will become the current node
in the next iteration and there the other nodes will void insert_case2(struct node *n) { if (n->parent->color
be newly assigned relative to it. Any color shown in == BLACK) return; /* Tree is still valid */ else in-
the diagram is either assumed in its case or implied sert_case3(n); }
by those assumptions.
3. A numbered triangle represents a subtree of unspec- Note: In the following cases it can be assumed
ified depth. A black circle atop a triangle means that that N has a grandparent node G, because its
black-height of subtree is greater by one compared parent P is red, and if it were the root, it would
to subtree without this circle. be black. Thus, N also has an uncle node U,
although it may be a leaf in cases 4 and 5.
There are several cases of red–black tree insertion to han-
dle: void insert_case3(struct node *n) { struct node *u =
uncle(n), *g; if ((u != NULL) && (u->color == RED))
• N is the root node, i.e., first node of red–black tree { n->parent->color = BLACK; u->color = BLACK; g =
grandparent(n); g->color = RED; insert_case1(g); } else
• N's parent (P) is black { insert_case4(n); } }
• N is added to right of left child of grandparent, or Note: In the remaining cases, it is assumed that
N is added to left of right child of grandparent (P is the parent node P is the left child of its parent.
red and U is black) If it is the right child, left and right should be
reversed throughout cases 4 and 5. The code
• N is added to left of left child of grandparent, or N samples take care of this.
is added to right of right child of grandparent (P is
red and U is black) void insert_case4(struct node *n) { struct node *g
= grandparent(n); if ((n == n->parent->right) &&
Each case will be demonstrated with example C code. (n->parent == g->left)) { rotate_left(n->parent);
The uncle and grandparent nodes can be found by these /* * rotate_left can be the below because of al-
functions: ready having *g = grandparent(n) * * struct node
*saved_p=g->left, *saved_left_n=n->left; * g->left=n;
struct node *grandparent(struct node *n) { if ((n != * n->left=saved_p; * saved_p->right=saved_left_n; * *
NULL) && (n->parent != NULL)) return n->parent- and modify the parent’s nodes properly */ n = n->left;
>parent; else return NULL; } struct node *uncle(struct } else if ((n == n->parent->left) && (n->parent ==
node *n) { struct node *g = grandparent(n); if (g == g->right)) { rotate_right(n->parent); /* * rotate_right
NULL) return NULL; // No grandparent means no uncle can be the below to take advantage of already having
if (n->parent == g->left) return g->right; else return *g = grandparent(n) * * struct node *saved_p=g-
g->left; }
>right, *saved_right_n=n->right; * g->right=n; *
n->right=saved_p; * saved_p->left=saved_right_n; * */
Case 1: The current node N is at the root of the tree. In n = n->right; } insert_case5(n); }
this case, it is repainted black to satisfy property 2 (the void insert_case5(struct node *n) { struct node *g =
root is black). Since this adds one black node to every grandparent(n); n->parent->color = BLACK; g->color
path at once, property 5 (all paths from any given node to = RED; if (n == n->parent->left) rotate_right(g); else
its leaf nodes contain the same number of black nodes) is rotate_left(g); }
not violated.
void insert_case1(struct node *n) { if (n->parent == Note that inserting is actually in-place, since all the calls
NULL) n->color = BLACK; else insert_case2(n); } above use tail recursion.
In the algorithm above, all cases are chained in order, ex-
Case 2: The current node’s parent P is black, so prop- cept in insert case 3 where it can recurse to case 1 back
erty 4 (both children of every red node are black) is not to the grandparent node: this is the only case where an
192 CHAPTER 6. SUCCESSORS AND NEIGHBORS
iterative implementation will effectively loop. Because The complex case is when both M and C are black. (This
the problem of repair is escalated to the next higher level can only occur when deleting a black node which has two
but one, it takes maximally h ⁄2 iterations to repair the tree leaf children, because if the black node M had a black
(where h is the height of the tree). Because the probability non-leaf child on one side but just a leaf child on the
for escalation decreases exponentially with each iteration other side, then the count of black nodes on both sides
the average insertion cost is constant. would be different, thus the tree would have been an in-
Mehlhorn & Sanders (2008) point out: “AVL trees do not valid red–black tree by violation of property 5.) We be-
support constant amortized update costs”, but red-black gin by replacing M with its child C. We will relabel this
child C (in its new position) N, and its sibling (its new
trees do.[22]
parent’s other child) S. (S was previously the sibling of
M.) In the diagrams below, we will also use P for N's
Removal new parent (M's old parent), SL for S's left child, and
SR for S's right child (S cannot be a leaf because if M
In a regular binary search tree when deleting a node with and C were black, then P's one subtree which included
two non-leaf children, we find either the maximum ele- M counted two black-height and thus P's other subtree
ment in its left subtree (which is the in-order predeces- which includes S must also count two black-height, which
sor) or the minimum element in its right subtree (which cannot be the case if S is a leaf node).
is the in-order successor) and move its value into the node
being deleted (as shown here). We then delete the node Notes
we copied the value from, which must have fewer than
two non-leaf children. (Non-leaf children, rather than all 1. The label N will be used to denote the current node
children, are specified here because unlike normal binary (colored black). In the diagrams N carries a blue
search trees, red–black trees can have leaf nodes any- contour. At the beginning, this is the replacement
where, so that all nodes are either internal nodes with two node and a leaf, but the entire procedure may also
children or leaf nodes with, by definition, zero children. be applied recursively to other nodes (see case 3).
In effect, internal nodes having two leaf children in a red– In between some cases, the roles and labels of the
black tree are like the leaf nodes in a regular binary search nodes are exchanged, but in each case, every label
tree.) Because merely copying a value does not violate continues to represent the same node it represented
any red–black properties, this reduces to the problem of at the beginning of the case.
deleting a node with at most one non-leaf child. Once we
have solved that problem, the solution applies equally to 2. If a node in the right (target) half of a diagram car-
the case where the node we originally want to delete has ries a blue contour it will become the current node
at most one non-leaf child as to the case just considered in the next iteration and there the other nodes will
where it has two non-leaf children. be newly assigned relative to it. Any color shown
in the diagram is either assumed in its case or im-
Therefore, for the remainder of this discussion we address
plied by those assumptions. White represents an ar-
the deletion of a node with at most one non-leaf child. We
bitrary color (either red or black), but the same in
use the label M to denote the node to be deleted; C will
both halves of the diagram.
denote a selected child of M, which we will also call “its
child”. If M does have a non-leaf child, call that its child, 3. A numbered triangle represents a subtree of unspec-
C; otherwise, choose either leaf as its child, C. ified depth. A black circle atop a triangle means that
If M is a red node, we simply replace it with its child C, black-height of subtree is greater by one compared
which must be black by property 4. (This can only occur to subtree without this circle.
when M has two leaf children, because if the red node
M had a black non-leaf child on one side but just a leaf We will find the sibling using this function:
child on the other side, then the count of black nodes on struct node *sibling(struct node *n) { if ((n == NULL) ||
both sides would be different, thus the tree would violate (n->parent == NULL)) return NULL; // no parent means
property 5.) All paths through the deleted node will sim- no sibling if (n == n->parent->left) return n->parent-
ply pass through one fewer red node, and both the deleted >right; else return n->parent->left; }
node’s parent and child must be black, so property 3 (all
leaves are black) and property 4 (both children of every
red node are black) still hold. Note: In order that the tree remains well-
Another simple case is when M is black and C is red. defined, we need that every null leaf remains
Simply removing a black node could break Properties 4 a leaf after all transformations (that it will not
(“Both children of every red node are black”) and 5 (“All have any children). If the node we are delet-
paths from any given node to its leaf nodes contain the ing has a non-leaf (non-null) child N, it is easy
same number of black nodes”), but if we repaint C black, to see that the property is satisfied. If, on the
both of these properties are preserved. other hand, N would be a null leaf, it can be
6.8. RED–BLACK TREE 193
verified from the diagrams (or code) for all the void delete_case3(struct node *n) { struct node *s
cases that the property is satisfied as well. = sibling(n); if ((n->parent->color == BLACK) &&
(s->color == BLACK) && (s->left->color == BLACK)
We can perform the steps outlined above with the fol- && (s->right->color == BLACK)) { s->color = RED;
lowing code, where the function replace_node substitutes delete_case1(n->parent); } else delete_case4(n); }
child into n’s place in the tree. For convenience, code in void delete_case4(struct node *n) { struct node *s =
this section will assume that null leaves are represented sibling(n); if ((n->parent->color == RED) && (s->color
by actual node objects rather than NULL (the code in == BLACK) && (s->left->color == BLACK) &&
the Insertion section works with either representation). (s->right->color == BLACK)) { s->color = RED;
n->parent->color = BLACK; } else delete_case5(n); }
void delete_one_child(struct node *n) { /* * Precon-
void delete_case5(struct node *n) { struct node *s =
dition: n has at most one non-leaf child. */ struct
sibling(n); if (s->color == BLACK) { /* this if statement
node *child = is_leaf(n->right) ? n->left : n->right;
is trivial, due to case 2 (even though case 2 changed the
replace_node(n, child); if (n->color == BLACK) { if
sibling to a sibling’s child, the sibling’s child can't be
(child->color == RED) child->color = BLACK; else
red, since no red parent can have a red child). */ /* the
delete_case1(child); } free(n); }
following statements just force the red to be on the left
of the left of the parent, or right of the right, so case
six will rotate correctly. */ if ((n == n->parent->left)
Note: If N is a null leaf and we do not want
&& (s->right->color == BLACK) && (s->left->color
to represent null leaves as actual node objects,
== RED)) { /* this last test is trivial too due to cases
we can modify the algorithm by first calling
2-4. */ s->color = RED; s->left->color = BLACK;
delete_case1() on its parent (the node that we
rotate_right(s); } else if ((n == n->parent->right) &&
delete, n in the code above) and deleting it af-
(s->left->color == BLACK) && (s->right->color ==
terwards. We do this if the parent is black (red
RED)) {/* this last test is trivial too due to cases
is trivial), so it behaves in the same way as a
2-4. */ s->color = RED; s->right->color = BLACK;
null leaf (and is sometimes called a 'phantom'
rotate_left(s); } } delete_case6(n); }
leaf). And we can safely delete it at the end as n
void delete_case6(struct node *n) { struct node *s = sib-
will remain a leaf after all operations, as shown
ling(n); s->color = n->parent->color; n->parent->color
above. In addition, the sibling tests in cases 2
= BLACK; if (n == n->parent->left) { s->right->color =
and 3 require updating as it is no longer true
BLACK; rotate_left(n->parent); } else { s->left->color
that the sibling will have children represented
= BLACK; rotate_right(n->parent); } }
as objects.
If both N and its original parent are black, then deleting Again, the function calls all use tail recursion, so the al-
this original parent causes paths which proceed through N gorithm is in-place.
to have one fewer black node than paths that do not. As In the algorithm above, all cases are chained in order, ex-
this violates property 5 (all paths from any given node to
cept in delete case 3 where it can recurse to case 1 back
its leaf nodes contain the same number of black nodes), to the parent node: this is the only case where an itera-
the tree must be rebalanced. There are several cases to tive implementation will effectively loop. No more than
consider: h loops back to case 1 will occur (where h is the height
Case 1: N is the new root. In this case, we are done. We of the tree). And because the probability for escalation
removed one black node from every path, and the new decreases exponentially with each iteration the average
root is black, so the properties are preserved. removal cost is constant.
void delete_case1(struct node *n) { if (n->parent != Additionally, no tail recursion ever occurs on a child node,
NULL) delete_case2(n); } so the tail recursion loop can only move from a child back
to its successive ancestors. If a rotation occurs in case 2
(which is the only possibility of rotation within the loop
Note: In cases 2, 5, and 6, we assume N is of cases 1–3), then the parent of the node N becomes red
the left child of its parent P. If it is the right after the rotation and we will exit the loop. Therefore,
child, left and right should be reversed through- at most one rotation will occur within this loop. Since no
out these three cases. Again, the code exam- more than two additional rotations will occur after exiting
ples take both cases into account. the loop, at most three rotations occur in total.
ence are independent of each other, they can be executed 6.8.12 References
in parallel with a parallel depth O(log m log n) .[23] When
m = 1 , the join-based implementation has the same [1] James Paton. “Red-Black Trees”.
computational DAG as single-element insertion and dele-
tion if the root of the larger tree is used to split the smaller [2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest,
Ronald L.; Stein, Clifford (2001). “Red–Black Trees”.
tree.
Introduction to Algorithms (second ed.). MIT Press. pp.
273–301. ISBN 0-262-03293-7.
Parallel algorithms for constructing red–black trees from [4] Rudolf Bayer (1972). “Symmetric binary B-Trees: Data
sorted lists of items can run in constant time or O(log log structure and maintenance algorithms”. Acta Informatica.
1 (4): 290–306. doi:10.1007/BF00289509.
n) time, depending on the computer model, if the number
of processors available is asymptotically proportional to [5] Drozdek, Adam. Data Structures and Algorithms in Java
the number n of items where n→∞. Fast search, inser- (2 ed.). Sams Publishing. p. 323. ISBN 0534376681.
tion, and deletion parallel algorithms are also known.[24]
[6] Leonidas J. Guibas and Robert Sedgewick (1978). “A
Dichromatic Framework for Balanced Trees”. Proceed-
6.8.10 Popular culture ings of the 19th Annual Symposium on Foundations of
Computer Science. pp. 8–21. doi:10.1109/SFCS.1978.3.
A red-black-tree was referenced correctly in an episode [7] “Red Black Trees”. eternallyconfuzzled.com. Retrieved
of Missing (Canadian TV series)[25] as noted by Robert 2015-09-02.
Sedgewick in one of his lectures:[26]
[8] Robert Sedgewick (2012). Red-Black BSTs. Coursera.
Jess: " It was the red door again. "
A lot of people ask why did we use the name red–black.
Pollock: " I thought the red door was the storage con- Well, we invented this data structure, this way of looking
tainer. " at balanced trees, at Xerox PARC which was the home
Jess: " But it wasn't red anymore, it was black. " of the personal computer and many other innovations that
Antonio: " So red turning to black means what? " we live with today entering[sic] graphic user interfaces,
Pollock: " Budget deficits, red ink, black ink. " ethernet and object-oriented programmings[sic] and many
Antonio: " It could be from a binary search tree. The other things. But one of the things that was invented there
red-black tree tracks every simple path from a node to a was laser printing and we were very excited to have nearby
descendant leaf that has the same number of black nodes. color laser printer that could print things out in color and
" out of the colors the red looked the best. So, that’s why we
picked the color red to distinguish red links, the types of
Jess: " Does that help you with the ladies? "
links, in three nodes. So, that’s an answer to the question
for people that have been asking.
6.8.11 See also [9] “Where does the term “Red/Black Tree” come from?".
programmers.stackexchange.com. Retrieved 2015-09-02.
• List of data structures
[10] Andersson, Arne (1993-08-11). Dehne, Frank; Sack,
Jörg-Rüdiger; Santoro, Nicola; Whitesides, Sue, eds.
• Tree data structure
“Balanced search trees made simple” (PDF). Algorithms
and Data Structures (Proceedings). Lecture Notes in
• Tree rotation Computer Science. Springer-Verlag Berlin Heidelberg.
709: 60–71. doi:10.1007/3-540-57155-8_236. ISBN
• AA tree, a variation of the red-black tree 978-3-540-57155-1. Archived from the original on 2000-
03-17.
• AVL tree
[11] Okasaki, Chris (1999-01-01). “Red-black trees in a func-
• B-tree (2-3 tree, 2-3-4 tree, B+ tree, B*-tree, UB- tional setting” (PS). Journal of Functional Programming.
tree) 9 (4): 471–477. doi:10.1017/S0956796899003494.
ISSN 1469-7653.
• Scapegoat tree
[12] Sedgewick, Robert (1983). Algorithms (1st ed.).
Addison-Wesley. ISBN 0-201-06672-6.
• Splay tree
[13] RedBlackBST code in Java
• T-tree
[14] Sedgewick, Robert (2008). “Left-leaning Red-Black
• WAVL tree Trees” (PDF).
196 CHAPTER 6. SUCCESSORS AND NEIGHBORS
[15] Sedgewick, Robert; Wayne, Kevin (2011). Algorithms 6.8.14 External links
(4th ed.). Addison-Wesley Professional. ISBN 978-0-
321-57351-3. • A complete and working implementation in C
[16] Cormen, Thomas; Leiserson, Charles; Rivest, Ronald; • Red–Black Tree Demonstration
Stein, Clifford (2009). “13”. Introduction to Algorithms
(3rd ed.). MIT Press. pp. 308–309. ISBN 978-0-262- • OCW MIT Lecture by Prof. Erik Demaine on Red
03384-8. Black Trees -
[17] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and • Binary Search Tree Insertion Visualization on
Data Structures: The Basic Toolbox (PDF). Springer, YouTube – Visualization of random and pre-sorted
Berlin/Heidelberg. pp. 154–165. doi:10.1007/978-3- data insertions, in elementary binary search trees,
540-77978-0. ISBN 978-3-540-77977-3. p. 155.
and left-leaning red–black trees
[18] Sedgewick, Robert (1998). Algorithms in C++. Addison-
• An intrusive red-black tree written in C++
Wesley Professional. pp. 565–575. ISBN 978-
0201350883. • Red-black BSTs in 3.3 Balanced Search Trees
[19] http://www.cs.princeton.edu/~{}rs/talks/LLRB/ • Red–black BST Demo
RedBlack.pdf
[20] http://www.cs.princeton.edu/courses/archive/fall08/
cos226/lectures/10BalancedTrees-2x2.pdf 6.9 WAVL tree
[21] “How does a HashMap work in JAVA”. coding-geek.com.
In computer science, a WAVL tree or weak AVL tree
[22] Mehlhorn & Sanders 2008, pp. 165, 158 is a self-balancing binary search tree. WAVL trees are
named after AVL trees, another type of balanced search
[23] Blelloch, Guy E.; Ferizovic, Daniel; Sun, Yihan (2016), tree, and are closely related both to AVL trees and red–
“Just Join for Parallel Ordered Sets”, Proc. 28th ACM
black trees, which all fall into a common framework of
Symp. Parallel Algorithms and Architectures (SPAA 2016),
rank balanced trees. Like other balanced binary search
ACM, pp. 253–264, doi:10.1145/2935764.2935768,
ISBN 978-1-4503-4210-0. trees, WAVL trees can handle insertion, deletion, and
search operations in time O(log n) per operation.[1][2]
[24] Park, Heejin; Park, Kunsoo (2001). “Parallel algo-
WAVL trees are designed to combine some of the best
rithms for red–black trees”. Theoretical computer sci-
properties of both AVL trees and red–black trees. One
ence. Elsevier. 262 (1–2): 415–435. doi:10.1016/S0304-
3975(00)00287-5. Our parallel algorithm for construct- advantage of AVL trees over red–black trees is that they
ing a red–black tree from a sorted list of n items runs in are more balanced: they have height at most logφ n ≈
O(1) time with n processors on the CRCW PRAM and 1.44 log2 n (for a tree with n data items, where φ is the
runs in O(log log n) time with n / log log n processors on golden ratio), while red–black trees have larger maximum
the EREW PRAM. height, 2 log2 n . If a WAVL tree is created using only
insertions, without deletions, then it has the same small
[25] Missing (Canadian TV series). A, W Network (Canada); height bound that an AVL tree has. On the other hand,
Lifetime (United States). red–black trees have the advantage over AVL trees that
[26] Robert Sedgewick (2012). B-Trees. Coursera. 10:37 they perform less restructuring of their trees. In AVL
minutes in. So not only is there some excitement in that trees, each deletion may require a logarithmic number of
dialogue but it’s also technically correct which you don't tree rotation operations, while red–black trees have sim-
often find with math in popular culture of computer sci- pler deletion operations that use only a constant number
ence. A red black tree tracks every simple path from a of tree rotations. WAVL trees, like red–black trees, use
node to a descendant leaf with the same number of black only a constant number of tree rotations, and the constant
nodes they got that right. is even better than for red–black trees.[1][2]
WAVL trees were introduced by Haeupler, Sen & Tarjan
6.8.13 Further reading (2015). The same authors also provided a common view
of AVL trees, WAVL trees, and red–black trees as all
• Mathworld: Red–Black Tree being a type of rank-balanced tree.[2]
item, and is linked to its parent (except for a designated from each node to its parent, incrementing the rank of
root node that has no parent) and to exactly two children each parent node if necessary to make it greater than the
in the tree, the left child and the right child. An external new rank of its child, until one of three stopping condi-
node carries no data, and has a link only to its parent in tions is reached.
the tree. These nodes are arranged to form a binary tree,
so that for any internal node x the parents of the left and • If the path of incremented ranks reaches the root of
right children of x are x itself. The external nodes form the tree, then the rebalancing procedure stops, with-
the leaves of the tree.[3] The data items are arranged in out changing the structure of the tree.
the tree in such a way that an inorder traversal of the tree
lists the data items in sorted order.[4] • If the path of incremented ranks reaches a node
What distinguishes WAVL trees from other types of bi- whose parent’s rank previously differed by two, and
nary search tree is its use of ranks. These are num- (after incrementing the rank of the node) still differs
bers, stored with each node, that provide an approxima- by one, then again the rebalancing procedure stops
tion to the distance from the node to its farthest leaf de- without changing the structure of the tree.
scendant. The ranks are required to obey the following
• If the procedure increases the rank of a node x, so
properties:[1][2]
that it becomes equal to the rank of the parent y of x,
but the other child of y has a rank that is smaller by
• Every external node has rank 0[5] two (so that the rank of y cannot be increased) then
again the rebalancing procedure stops. In this case,
• If a non-root node has rank r, then the rank of its
by performing at most two tree rotations, it is always
parent must be either r + 1 or r + 2.
possible to rearrange the tree nodes near x and y in
• An internal node with two external children must such a way that the ranks obey the constraints of a
have rank exactly 1. WAVL tree, leaving the rank of the root of the ro-
tated subtree unchanged.
Overall, as with the insertion procedure, a deletion con- tions, then its structure will be the same as the struc-
sists of a search downward through the tree (to find the ture of an AVL tree created by the same insertion se-
node to be deleted), a continuation of the search farther quence, and its ranks will be the same as the ranks of
downward (to find a node with an external child), the re- the corresponding AVL tree. It is only through deletion
moval of a constant number of new nodes, a logarithmic operations that a WAVL tree can become different from
number of rank changes, and a constant number of tree an AVL tree. In particular this implies that a WAVL
rotations.[1][2] tree created only through insertions has height at most
logφ n ≈ 1.44 log2 n .[2]
WAVL trees are closely related to both AVL trees and red–black trees can equivalently be defined in terms of a
red–black trees. Every AVL tree can have ranks assigned system of ranks, stored at the nodes, satisfying the fol-
to its nodes in a way that makes it into a WAVL tree. lowing requirements (different than the requirements for
And every WAVL tree can have its nodes colored red and ranks in WAVL trees):
black (and its ranks reassigned) in a way that makes it into
a red–black tree. However, some WAVL trees do not • The rank of an external node is always 0 and its par-
come from AVL trees in this way and some red–black ent’s rank is always 1.
trees do not come from WAVL trees in this way.
• The rank of any non-root node equals either its par-
ent’s rank or its parent’s rank minus 1.
AVL trees
• No two consecutive edges on any root-leaf path have
An AVL tree is a kind of balanced binary search tree in rank difference 0.
which the two children of each internal node must have
heights that differ by at most one.[8] The height of an ex- The equivalence between the color-based and rank-based
ternal node is zero, and the height of any internal node definitions can be seen, in one direction, by coloring a
is always one plus the maximum of the heights of its two node black if its parent has greater rank and red if its
children. Thus, the height function of an AVL tree obeys parent has equal rank. In the other direction, colors can
the constraints of a WAVL tree, and we may convert any be converted to ranks by making the rank of a black node
AVL tree into a WAVL tree by using the height of each equal to the number of black nodes on any path to an
node as its rank.[1][2] external node, and by making the rank of a red node equal
[9]
The key difference between an AVL tree and a WAVL to its parent.
tree arises when a node has two children with the same The ranks of the nodes in a WAVL tree can be converted
rank or height. In an AVL tree, if a node x has two chil- to a system of ranks of nodes, obeying the requirements
dren of the same height h as each other, then the height for red–black trees, by dividing each rank by two and
of x must be exactly h + 1. In contrast, in a WAVL tree, rounding up to the nearest integer.[10] Because of this con-
if a node x has two children of the same rank r as each version, for every WAVL tree there exists a valid red–
other, then the rank of x can be either r + 1 or r + 2. This black tree with the same structure. Because red–black
greater flexibility in ranks also leads to a greater flexibil- trees have maximum height 2 log2 n , the same is true for
ity in structures: some WAVL trees cannot be made into WAVL trees.[1][2] However, there exist red–black trees
AVL trees even by modifying their ranks, because they that cannot be given a valid WAVL tree rank function.[2]
include nodes whose children’s heights differ by more Despite the fact that, in terms of their tree structures,
than one.[2] WAVL trees are special cases of red–black trees, their
If a WAVL tree is created only using insertion opera- update operations are different. The tree rotations used
6.10. SCAPEGOAT TREE 199
in WAVL tree update operations may make changes that to a regular binary search tree: a node stores only a key
would not be permitted in a red–black tree, because they and two pointers to the child nodes. This makes scape-
would in effect cause the recoloring of large subtrees of goat trees easier to implement and, due to data structure
the red–black tree rather than making color changes only alignment, can reduce node overhead by up to one-third.
on a single path in the tree.[2] This allows WAVL trees
to perform fewer tree rotations per deletion, in the worst
case, than red-black trees.[1][2]
6.10.1 Theory
6.9.5 References A binary search tree is said to be weight-balanced if half
[1] Goodrich, Michael T.; Tamassia, Roberto (2015), “4.4
the nodes are on the left of the root, and half on the right.
Weak AVL Trees”, Algorithm Design and Applications, An α-weight-balanced node is defined as meeting a re-
Wiley, pp. 130–138. laxed weight balance criterion:
size(left) <= α*size(node) size(right) <= α*size(node)
[2] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E.
(2015), “Rank-balanced trees” (PDF), ACM Transactions Where size can be defined recursively as:
on Algorithms, 11 (4): Art. 30, 26, doi:10.1145/2689412,
MR 3361215. function size(node) if node = nil return 0 else return
size(node->left) + size(node->right) + 1 end
[3] Goodrich & Tamassia (2015), Section 2.3 Trees, pp. 68–
An α of 1 therefore would describe a linked list as bal-
83.
anced, whereas an α of 0.5 would only match almost com-
[4] Goodrich & Tamassia (2015), Chapter 3 Binary Search plete binary trees.
Trees, pp. 89–114.
A binary search tree that is α-weight-balanced must also
[5] In this we follow Goodrich & Tamassia (2015). In the be α-height-balanced, that is
version described by Haeupler, Sen & Tarjan (2015), the height(tree) <= log₁/α(NodeCount) + 1
external nodes have rank −1. This variation makes very
little difference in the operations of WAVL trees, but it Scapegoat trees are not guaranteed to keep α-weight-
causes some minor changes to the formula for converting balance at all times, but are always loosely α-height-
WAVL trees to red–black trees. balanced in that
[6] Goodrich & Tamassia (2015), Section 3.1.2 Searching in height(scapegoat tree) <= log₁/α(NodeCount) + 1
a Binary Search Tree, pp. 95–96. This makes scapegoat trees similar to red-black trees in
[7] Goodrich & Tamassia (2015), Section 3.1.4 Deletion in a that they both have restrictions on their height. They dif-
Binary Search Tree, pp. 98–99. fer greatly though in their implementations of determin-
ing where the rotations (or in the case of scapegoat trees,
[8] Goodrich & Tamassia (2015), Section 4.2 AVL Trees, pp. rebalances) take place. Whereas red-black trees store ad-
120–125. ditional 'color' information in each node to determine the
[9] Goodrich & Tamassia (2015), Section 4.3 Red–black
location, scapegoat trees find a scapegoat which isn't α-
Trees, pp. 126–129. weight-balanced to perform the rebalance operation on.
This is loosely similar to AVL trees, in that the actual ro-
[10] In Haeupler, Sen & Tarjan (2015) the conversion is done tations depend on 'balances’ of nodes, but the means of
by rounding down, because the ranks of external nodes determining the balance differs greatly. Since AVL trees
are −1 rather than 0. Goodrich & Tamassia (2015) give a check the balance value on every insertion/deletion, it is
formula that also rounds down, but because they use rank typically stored in each node; scapegoat trees are able to
0 for external nodes their formula incorrectly assigns red–
calculate it only as needed, which is only when a scape-
black rank 0 to internal nodes with WAVL rank 1.
goat needs to be found.
Unlike most other self-balancing search trees, scapegoat
6.10 Scapegoat tree trees are entirely flexible as to their balancing. They sup-
port any α such that 0.5 < α < 1. A high α value results in
fewer balances, making insertion quicker but lookups and
In computer science, a scapegoat tree is a self-balancing deletions slower, and vice versa for a low α. Therefore in
binary search tree, invented by Arne Andersson[1] and practical applications, an α can be chosen depending on
again by Igal Galperin and Ronald L. Rivest.[2] It provides how frequently these actions should be performed.
worst-case O(log n) lookup time, and O(log n) amortized
insertion and deletion time.
Unlike most other self-balancing binary search trees that
provide worst case O(log n) lookup time, scapegoat trees 6.10.2 Operations
have no additional per-node memory overhead compared
200 CHAPTER 6. SUCCESSORS AND NEIGHBORS
(the amount of time to search for the element and flag it 6.11 Splay tree
as deleted). The n/2 deletion causes the tree to be re-
built and takes O(log n) + O(n) (or just O(n) ) time. A splay tree is a self-adjusting binary search tree with the
Using aggregate analysis it becomes clear that the amor- additional property that recently accessed elements are
tized cost of a deletion is O(log n) : quick to access again. It performs basic operations such
∑n 2 O(log n)+O(n) n
O(log n)+O(n) as insertion, look-up and removal in O(log n) amortized
1
n = 2 n = O(log n)
2 2 time. For many sequences of non-random operations,
splay trees perform better than other search trees, even
when the specific pattern of the sequence is unknown.
Lookup
The splay tree was invented by Daniel Sleator and Robert
Tarjan in 1985.[1]
Lookup is not modified from a standard binary search
tree, and has a worst-case time of O(log n). This is in All normal operations on a binary search tree are com-
contrast to splay trees which have a worst-case time of bined with one basic operation, called splaying. Splaying
O(n). The reduced node memory overhead compared to the tree for a certain element rearranges the tree so that
other self-balancing binary search trees can further im- the element is placed at the root of the tree. One way to
prove locality of reference and caching. do this is to first perform a standard binary tree search for
the element in question, and then use tree rotations in a
specific fashion to bring the element to the top. Alterna-
6.10.3 See also tively, a top-down algorithm can combine the search and
the tree reorganization into a single phase.
• Splay tree
a multi-threaded environment. Specifically, extra man- p. Note that zig-zig steps are the only thing that differen-
agement is needed if multiple threads are allowed to per- tiate splay trees from the rotate to root method introduced
form find operations concurrently. This also makes them by Allen and Munro[4] prior to the introduction of splay
unsuitable for general use in purely functional program- trees.
ming, although even there they can be used in limited
ways to implement priority queues.
6.11.3 Operations
Splaying
C A
• Splay x. Now it is in the root so the tree to its left
contains all elements smaller than x and the tree to
A B B C
its right contains all element larger than x.
• Insert x as with a normal binary search tree. Below there is an implementation of splay trees in C++,
which uses pointers to represent each node on the tree.
• when an item is inserted, a splay is performed.
This implementation is based on bottom-up splaying ver-
• As a result, the newly inserted node x becomes the sion and uses the second method of deletion on a splay
root of the tree. tree. Also, unlike the above definition, this C++ version
does not splay the tree on finds - it only splays on inser-
ALTERNATIVE: tions and deletions.
#include <functional> #ifndef SPLAY_TREE #define
• Use the split operation to split the tree at the value SPLAY_TREE template<typename T, typename Comp
of x to two sub-trees: S and T. = std::less<T>> class splay_tree { private: Comp comp;
• Create a new tree in which x is the root, S is its left unsigned long p_size; struct node { node *left, *right;
sub-tree and T its right sub-tree. node *parent; T key; node( const T& init = T( ) ) : left(
nullptr ), right( nullptr ), parent( nullptr ), key( init ) { }
~node( ) { if( left ) delete left; if( right ) delete right; if(
Deletion parent ) delete parent; } } *root; void left_rotate( node
*x ) { node *y = x->right; if(y) { x->right = y->left; if(
To delete a node x, use the same method as with a binary y->left ) y->left->parent = x; y->parent = x->parent; }
search tree: if x has two children, swap its value with that if( !x->parent ) root = y; else if( x == x->parent->left
of either the rightmost node of its left sub tree (its in-order ) x->parent->left = y; else x->parent->right = y; if(y)
predecessor) or the leftmost node of its right subtree (its y->left = x; x->parent = y; } void right_rotate( node
in-order successor). Then remove that node instead. In *x ) { node *y = x->left; if(y) { x->left = y->right; if(
this way, deletion is reduced to the problem of removing y->right ) y->right->parent = x; y->parent = x->parent;
a node with 0 or 1 children. Unlike a binary search tree, } if( !x->parent ) root = y; else if( x == x->parent->left
in a splay tree after deletion, we splay the parent of the ) x->parent->left = y; else x->parent->right = y; if(y) y-
removed node to the top of the tree. >right = x; x->parent = y; } void splay( node *x ) { while(
ALTERNATIVE: x->parent ) { if( !x->parent->parent ) { if( x->parent-
>left == x ) right_rotate( x->parent ); else left_rotate(
• The node to be deleted is first splayed, i.e. brought x->parent ); } else if( x->parent->left == x && x-
to the root of the tree and then deleted. leaves the >parent->parent->left == x->parent ) { right_rotate(
tree with two sub trees. x->parent->parent ); right_rotate( x->parent ); } else
if( x->parent->right == x && x->parent->parent->right
• The two sub-trees are then joined using a “join” op- == x->parent ) { left_rotate( x->parent->parent );
eration. left_rotate( x->parent ); } else if( x->parent->left
== x && x->parent->parent->right == x->parent ) {
right_rotate( x->parent ); left_rotate( x->parent ); } else
6.11.4 Implementation and variants { left_rotate( x->parent ); right_rotate( x->parent ); } }
} void replace( node *u, node *v ) { if( !u->parent ) root
Splaying, as mentioned above, is performed during a sec- = v; else if( u == u->parent->left ) u->parent->left = v;
ond, bottom-up pass over the access path of a node. It is else u->parent->right = v; if( v ) v->parent = u->parent;
possible to record the access path during the first pass for } node* subtree_minimum( node *u ) { while( u->left )
use during the second, but that requires extra space dur- u = u->left; return u; } node* subtree_maximum( node
ing the access operation. Another alternative is to keep *u ) { while( u->right ) u = u->right; return u; } public:
a parent pointer in every node, which avoids the need for splay_tree( ) : root( nullptr ), p_size( 0 ) { } void insert(
extra space during access operations but may reduce over- const T &key ) { node *z = root; node *p = nullptr;
all time efficiency because of the need to update those while( z ) { p = z; if( comp( z->key, key ) ) z = z->right;
pointers.[1] else z = z->left; } z = new node( key ); z->parent = p; if(
Another method which can be used is based on the ar- !p ) root = z; else if( comp( p->key, z->key ) ) p->right
gument that we can restructure the tree on our way down = z; else p->left = z; splay( z ); p_size++; } node* find(
the access path instead of making a second pass. This const T &key ) { node *z = root; while( z ) { if( comp(
top-down splaying routine uses three sets of nodes - left z->key, key ) ) z = z->right; else if( comp( key, z->key )
tree, right tree and middle tree. The first two contain all ) z = z->left; else return z; } return nullptr; } void erase(
items of original tree known to be less than or greater than const T &key ) { node *z = find( key ); if( !z ) return;
current item respectively. The middle tree consists of the splay( z ); if( !z->left ) replace( z, z->right ); else if(
sub-tree rooted at the current node. These three sets are !z->right ) replace( z, z->left ); else { node *y = sub-
updated down the access path while keeping the splay op- tree_minimum( z->right ); if( y->parent != z ) { replace(
erations in check. Another method, semisplaying, modi- y, y->right ); y->right = z->right; y->right->parent = y;
fies the zig-zig case to reduce the amount of restructuring } replace( z, y ); y->left = z->left; y->left->parent = y;
done in all operations.[1][5] } delete z; p_size--; } const T& minimum( ) { return
204 CHAPTER 6. SUCCESSORS AND NEIGHBORS
subtree_minimum( root )->key; } const T& maximum( any operation is done (Φi) to the final state after all oper-
) { return subtree_maximum( root )->key; } bool ations are completed (Φf).
empty( ) const { return root == nullptr; } unsigned long
size( ) const { return p_size; } }; #endif // SPLAY_TREE
∑
Φi − Φf = ranki (x) − rankf (x) = O(n log n)
x
6.11.5 Analysis where the last inequality comes from the fact that for ev-
ery node x, the minimum rank is 0 and the maximum
A simple amortized analysis of static splay trees can be rank is log(n).
carried out using the potential method. Define:
Now we can finally bound the actual time:
Φ will tend to be high for poorly balanced trees and low The above analysis can be generalized in the following
for well-balanced trees. way.
Balance Theorem — The cost of performing the se- The tightest upper bound proven so far is 4.5n .[9]
quence S is O [m log n + n log n] .
Proof
6.11.7 Dynamic optimality conjecture
Take a constant weight, e.g. w(x) = 1 for every node x. Main article: Optimal binary search tree
Then W = n .
This theorem implies that splay trees perform as well as In addition to the proven performance guarantees for
static balanced binary search trees on sequences of at least splay trees there is an unproven conjecture of great in-
n accesses.[1] terest from the original Sleator and Tarjan paper. This
Static Optimality Theorem — Let qx be the number of conjecture is known as the dynamic optimality conjecture
times element x is accessed in S. If every element is ac- and it basically claims that splay trees perform as well as
cessed
[ at least once, then ]the cost of performing S is any other binary search tree algorithm up to a constant
∑ factor.
O m + x∈tree qx log qmx
Proof Dynamic Optimality Conjecture:[1] Let A
be any binary search tree algorithm that ac-
Let w(x) = qx . Then W = m . cesses an element x by traversing the path from
the root to x at a cost of d(x) + 1 , and that be-
This theorem implies that splay trees perform as well as tween accesses can make any rotations in the
an optimum static binary search tree on sequences of at tree at a cost of 1 per rotation. Let A(S) be
least n accesses. They spend less time on the more fre- the cost for A to perform the sequence S of
quent items.[1] accesses. Then the cost for a splay tree to per-
Static Finger Theorem — Assume that the items are num- form the same accesses is O[n + A(S)] .
bered from 1 through n in ascending order. Let f be any
fixed element
[ (the 'finger'). Then the cost of performing
] There are several corollaries of the dynamic optimality
∑
S is O m + n log n + x∈sequence log(|x − f | + 1) . conjecture that remain unproven:
Proof
Traversal Conjecture:[1] Let T1 and T2 be
two splay trees containing the same elements.
Let w(x) = 1/(|x − f | + 1)2 . Then W = O(1) . The
Let S be the sequence obtained by visiting
net potential drop is O (n log n) since the weight of any
the elements in T2 in preorder (i.e., depth first
item is at least 1/n2 .[1]
search order). The total cost of performing the
Dynamic Finger Theorem — Assume that the 'finger' sequence S of accesses on T1 is O(n) .
for each step accessing an element y is the element ac-
cessed
[ in the previous step, x. The cost of performing
] S Deque Conjecture:[8][10][11] Let S be a se-
∑m
is O m + n + x,y∈sequence log(|y − x| + 1) .[6][7] quence of m double-ended queue operations
(push, pop, inject, eject). Then the cost of per-
Working Set Theorem — At any time during the
forming S on a splay tree is O(m + n) .
sequence, let t(x) be the number of distinct el-
ements accessed before the previous time element
x [was accessed. The cost of performing] S is Split Conjecture:[5] Let S be any permutation
∑ of the elements of the splay tree. Then the cost
O m + n log n + x∈sequence log(t(x) + 1)
of deleting the elements in the order S is O(n)
Proof .
[7] Cole 2000. • Knuth, Donald (1997). The Art of Computer Pro-
gramming. 3: Sorting and Searching (3rd ed.).
[8] Tarjan 1985. Addison-Wesley. p. 478. ISBN 0-201-89685-0.
[9] Elmasry 2004.
• Lucas, Joan M. (1991). “On the Competitiveness
[10] Pettie 2008. of Splay Trees: Relations to the Union-Find Prob-
lem”. On-line Algorithms: Proceedings of a DIMACS
[11] Sundar 1992.
Workshop, February 11–13, 1991. Series in Dis-
[12] Brinkmann, Degraer & De Loof 2009. crete Mathematics and Theoretical Computer Sci-
ence. 7. Center for Discrete Mathematics and The-
oretical Computer Science. pp. 95–124. ISBN 0-
6.11.11 References 8218-7111-0.
• Albers, Susanne; Karpinski, Marek (28 February • Pettie, Seth (2008), “Splay Trees, Davenport-
2002). “Randomized Splay Trees: Theoretical and Schinzel Sequences, and the Deque
Experimental Results” (PDF). Information Process- Conjecture” (PDF), Proc. 19th ACM-
ing Letters. 81 (4): 213–221. doi:10.1016/s0020- SIAM Symposium on Discrete Algorithms,
0190(01)00230-7. 0707: 1115–1124, arXiv:0707.2160 ,
Bibcode:2007arXiv0707.2160P
• Allen, Brian; Munro, Ian (October 1978). “Self-
organizing search trees”. Journal of the ACM. 25 • Sleator, Daniel D.; Tarjan, Robert E. (1985).
(4): 526–535. doi:10.1145/322092.322094. “Self-Adjusting Binary Search Trees” (PDF).
6.12. TANGO TREE 207
Join Our join operation will combine two auxiliary given access sequence. Our upper bound will be (k +
trees as long as they have the property that the top node of 1)O(log log n) , where k is the number of interleaves.
one (in the reference tree) is a child of the bottom node The total cost is divided into two parts, searching for the
of the other (essentially, that the corresponding preferred element, and updating the structure of the tango tree to
paths can be concatenated). This will work based on the maintain the proper invariants (switching preferred chil-
concatenate operation of red-black trees, which combines dren and re-arranging preferred paths).
two trees as long as they have the property that all ele-
ments of one are less than all elements of the other, and
split, which does the reverse. In the reference tree, note
that there exist two nodes in the top path such that a node Searching To see that the searching (not updating) fits
is in the bottom path if and only if its key-value is between in this bound, simply note that every time an auxiliary
them. Now, to join the bottom path to the top path, we tree search is unsuccessful and we have to move to the
simply split the top path between those two nodes, then next auxiliary tree, that results in a preferred child switch
concatenate the two resulting auxiliary trees on either side (since the parent preferred path now switches directions
of the bottom path’s auxiliary tree, and we have our final, to join the child preferred path). Since all auxiliary tree
joined auxiliary tree. searches are unsuccessful except the last one (we stop
once a search is successful, naturally), we search k + 1
auxiliary trees. Each search takes O(log log n) , because
Cut Our cut operation will break a preferred path into an auxiliary tree’s size is bounded by log n , the height of
two parts at a given node, a top part and a bottom part. the reference tree.
More formally, it'll partition an auxiliary tree into two
auxiliary trees, such that one contains all nodes at or above
a certain depth in the reference tree, and the other con- Updating The update cost fits within this bound as
tains all nodes below that depth. As in join, note that the well, because we only have to perform one cut and one
top part has two nodes that bracket the bottom part. Thus, join for every visited auxiliary tree. A single cut or join
we can simply split on each of these two nodes to divide operation takes only a constant number of searches, splits,
the path into three parts, then concatenate the two outer and concatenates, each of which takes logarithmic time
ones so we end up with two parts, the top and bottom, as in the size of the auxiliary tree, so our update cost is
desired. (k + 1)O(log log n) .
Interleave Bound
6.12.4 See also
Main article: Interleave lower bound
• Splay tree
To find a lower bound on the work done by the optimal
offline binary search tree, we again use the notion of pre- • Optimal binary search tree
ferred children. When considering an access sequence (a
sequence of searches), we keep track of how many times • Red-black tree
a reference tree node’s preferred child switches. The to-
tal number of switches (summed over all nodes) gives an
• Tree (data structure)
asymptotic lower bound on the work done by any binary
search tree algorithm on the given access sequence. This
is called the interleave lower bound.[1]
6.12.5 References
Tango Tree [1] Demaine, E. D.; Harmon, D.; Iacono, J.; Pă-
traşcu, M. (2007). “Dynamic Optimality—
In order to connect this to tango trees, we will find an Almost”. SIAM Journal on Computing. 37 (1):
upper bound on the work done by the tango tree for a 240. doi:10.1137/S0097539705447347.
6.13. SKIP LIST 209
even ones or only the odd ones. Instead of O(n log n) 5), traverse a link of width 1 at the top level. Now four
coin flips, there would only be O(log n) of them. Unfor- more steps are needed but the next width on this level is
tunately, this gives the adversarial user a 50/50 chance of ten which is too large, so drop one level. Traverse one
being correct upon guessing that all of the even numbered link of width 3. Since another step of width 2 would be
nodes (among the ones at level 1 or higher) are higher than too far, drop down to the bottom level. Now traverse the
level one. This is despite the property that he has a very final link of width 1 to reach the target running total of 5
low probability of guessing that a particular node is at (1+3+1).
level N for some integer N. function lookupByPositionIndex(i) node ← head i ← i
A skip list does not provide the same absolute worst-case + 1 # don't count the head as a step for level from top
performance guarantees as more traditional balanced tree to bottom do while i ≥ node.width[level] do # if next
data structures, because it is always possible (though with step is not too far i ← i - node.width[level] # subtract the
very low probability) that the coin-flips used to build the current width node ← node.next[level] # traverse forward
skip list will produce a badly balanced structure. How- at the current level repeat repeat return node.value end
ever, they work well in practice, and the randomized bal- function
ancing scheme has been argued to be easier to imple- This method of implementing indexing is detailed in
ment than the deterministic balancing schemes used in Section 3.4 Linear List Operations in “A skip list cook-
balanced binary search trees. Skip lists are also useful in book” by William Pugh.
parallel computing, where insertions can be done in dif-
ferent parts of the skip list in parallel without any global
rebalancing of the data structure. Such parallelism can 6.13.2 History
be especially advantageous for resource discovery in an
ad-hoc wireless network because a randomized skip list Skip lists were first described in 1989 by William Pugh.[6]
can be made robust to the loss of any single node.[5]
To quote the author:
Indexable skiplist
Skip lists are a probabilistic data structure that
seem likely to supplant balanced trees as the im-
As described above, a skiplist is capable of fast O(log n)
plementation method of choice for many ap-
insertion and removal of values from a sorted sequence,
plications. Skip list algorithms have the same
but it has only slow O(n) lookups of values at a given
asymptotic expected time bounds as balanced
position in the sequence (i.e. return the 500th value);
trees and are simpler, faster and use less space.
however, with a minor modification the speed of random
access indexed lookups can be improved to O(log n) .
For every link, also store the width of the link. The width 6.13.3 Usages
is defined as the number of bottom layer links being tra-
versed by each of the higher layer “express lane” links. List of applications and frameworks that use skip lists:
For example, here are the widths of the links in the ex-
ample at the top of the page: • MemSQL uses skiplists as its prime indexing struc-
ture for its database technology.
1 10 o---> o----------------------------------------------------
-----> o Top level 1 3 2 5 o---> o---------------> o-------- • Cyrus IMAP server offers a “skiplist” backend DB
-> o---------------------------> o Level 3 1 2 1 2 3 2 o---> implementation (source file)
o---------> o---> o---------> o---------------> o---------> o
Level 2 1 1 1 1 1 1 1 1 1 1 1 o---> o---> o---> o---> o---> • Lucene uses skip lists to search delta-encoded post-
o---> o---> o---> o---> o---> o---> o Bottom level Head ing lists in logarithmic time.
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th NIL Node Node
Node Node Node Node Node Node Node Node • QMap (up to Qt 4) template class of Qt that provides
a dictionary.
Notice that the width of a higher level link is the sum
of the component links below it (i.e. the width 10 link • Redis, an ANSI-C open-source persistent key/value
spans the links of widths 3, 2 and 5 immediately below store for Posix systems, uses skip lists in its imple-
it). Consequently, the sum of all widths is the same on mentation of ordered sets.[7]
every level (10 + 1 = 1 + 3 + 2 + 5 = 1 + 2 + 1 + 2 + 5).
To index the skiplist and find the i'th value, traverse the • nessDB, a very fast key-value embedded Database
skiplist while counting down the widths of each traversed Storage Engine (Using log-structured-merge (LSM)
link. Descend a level whenever the upcoming width trees), uses skip lists for its memtable.
would be too large. • skipdb is an open-source database format using or-
For example, to find the node in the fifth position (Node dered key/value pairs.
6.13. SKIP LIST 211
• ConcurrentSkipListSet and [9] Sundell, H.; Tsigas, P. (2003). “Fast and lock-free con-
ConcurrentSkipListMap in the Java 1.6 API. current priority queues for multi-thread systems”. Pro-
ceedings International Parallel and Distributed Processing
• Speed Tables are a fast key-value datastore for Tcl Symposium. p. 11. doi:10.1109/IPDPS.2003.1213189.
that use skiplists for indexes and lockless shared ISBN 0-7695-1926-1.
memory.
[10] Fomitchev, Mikhail; Ruppert, Eric (2004). Lock-free
• leveldb, a fast key-value storage library written at linked lists and skip lists (PDF). Proc. Annual ACM Symp.
Google that provides an ordered mapping from on Principles of Distributed Computing (PODC). pp. 50–
string keys to string values 59. doi:10.1145/1011767.1011776. ISBN 1581138024.
nodes are full, then the two nodes are split into three. 1,000,000 records, then a specific record could be located
Deleting nodes is somewhat more complex than in- with at most 20 comparisons: ⌈log2 (1, 000, 000)⌉ = 20
serting however. .
Large databases have historically been kept on disk
• B-trees can be turned into order statistic trees to al-
low rapid searches for the Nth record in key order, drives. The time to read a record on a disk drive far ex-
or counting the number of records between any two ceeds the time needed to compare keys once the record
records, and various other related operations.[1] is available. The time to read a record from a disk drive
involves a seek time and a rotational delay. The seek time
may be 0 to 20 or more milliseconds, and the rotational
Etymology delay averages about half the rotation period. For a 7200
RPM drive, the rotation period is 8.33 milliseconds. For
Rudolf Bayer and Ed McCreight invented the B-tree a drive such as the Seagate ST3500320NS, the track-to-
while working at Boeing Research Labs in 1971 (Bayer & track seek time is 0.8 milliseconds and the average read-
McCreight 1972), but they did not explain what, if any- ing seek time is 8.5 milliseconds.[4] For simplicity, as-
thing, the B stands for. Douglas Comer explains: sume reading from disk takes about 10 milliseconds.
Naively, then, the time to locate one record out of a mil-
The origin of “B-tree” has never been ex- lion would take 20 disk reads times 10 milliseconds per
plained by the authors. As we shall see, “bal- disk read, which is 0.2 seconds.
anced,” “broad,” or “bushy” might apply. Oth-
The time won't be that bad because individual records are
ers suggest that the “B” stands for Boeing. Be-
grouped together in a disk block. A disk block might be
cause of his contributions, however, it seems
16 kilobytes. If each record is 160 bytes, then 100 records
appropriate to think of B-trees as “Bayer"-
could be stored in each block. The disk read time above
trees. (Comer 1979, p. 123 footnote 1)
was actually for an entire block. Once the disk head is
in position, one or more disk blocks can be read with lit-
Donald Knuth speculates on the etymology of B-trees in tle delay. With 100 records per block, the last 6 or so
his May, 1980 lecture on the topic “CS144C classroom comparisons don't need to do any disk reads—the com-
lecture about disk storage and B-trees”, suggesting the parisons are all within the last disk block read.
“B” may have originated from Boeing or from Bayer’s
name.[2] To speed the search further, the first 13 to 14 comparisons
(which each required a disk access) must be sped up.
Ed McCreight answered a question on B-tree’s name in
2013:
An index speeds the search
Bayer and I were in a lunch time where we
get to think a name. And we were, so, B, we A significant improvement can be made with an index. In
were thinking… B is, you know… We were the example above, initial disk reads narrowed the search
working for Boeing at the time, we couldn't use range by a factor of two. That can be improved substan-
the name without talking to lawyers. So, there tially by creating an auxiliary index that contains the first
is a B. It has to do with balance, another B. record in each disk block (sometimes called a sparse in-
Bayer was the senior author, who did have sev- dex). This auxiliary index would be 1% of the size of the
eral years older than I am and had many more original database, but it can be searched more quickly.
publications than I did. So there is another B. Finding an entry in the auxiliary index would tell us which
And so, at the lunch table we never did resolve block to search in the main database; after searching the
whether there was one of those that made more auxiliary index, we would have to search only that one
sense than the rest. What really lives to say is: block of the main database—at a cost of one more disk
the more you think about what the B in B-trees read. The index would hold 10,000 entries, so it would
means, the better you understand B-trees.”[3] take at most 14 comparisons. Like the main database, the
last 6 or so comparisons in the aux index would be on the
same disk block. The index could be searched in about 8
6.14.2 B-tree usage in databases disk reads, and the desired record could be accessed in 9
disk reads.
Time to search a sorted file
The trick of creating an auxiliary index can be repeated
Usually, sorting and searching algorithms have been char- to make an auxiliary index to the auxiliary index. That
acterized by the number of comparison operations that would make an aux-aux index that would need only 100
must be performed using order notation. A binary search entries and would fit in one disk block.
of a sorted table with N records, for example, can be Instead of reading 14 disk blocks to find the desired
done in roughly ⌈log2 N ⌉ comparisons. If the table had record, we only need to read 3 blocks. Reading and
214 CHAPTER 6. SUCCESSORS AND NEIGHBORS
searching the first (and only) block of the aux-aux in- • keeps the index balanced with a recursive algorithm
dex identifies the relevant block in aux-index. Reading
and searching that aux-index block identifies the relevant In addition, a B-tree minimizes waste by making sure the
block in the main database. Instead of 150 milliseconds, interior nodes are at least half full. A B-tree can handle
we need only 30 milliseconds to get the record. an arbitrary number of insertions and deletions.
The auxiliary indices have turned the search problem
from a binary search requiring roughly log2 N disk reads
to one requiring only logb N disk reads where b is the Disadvantages of B-trees
blocking factor (the number of entries per block: b = 100
• maximum key length cannot be changed without
entries per block; logb 1, 000, 000 = 3 reads).
completely rebuilding the database. This led to
In practice, if the main database is being frequently many database systems truncating full human names
searched, the aux-aux index and much of the aux index to 70 characters.
may reside in a disk cache, so they would not incur a disk
read.
(Other implementations of associative array, such as a
ternary search tree or a separate-chaining hash table, dy-
Insertions and deletions namically adapt to arbitrarily long key lengths.)
• uses partially full blocks to speed insertions and According to Knuth’s definition, a B-tree of order m is a
deletions tree which satisfies the following properties:
6.14. B-TREE 215
1. Every node has at most m children. 6.14.4 Best case and worst case heights
2. Every non-leaf node (except root) has at least ⌈m/2⌉ Let h be the height of the classic B-tree. Let n > 0 be the
children. number of entries in the tree.[6] Let m be the maximum
3. The root has at least two children if it is not a leaf number of children a node can have. Each node can have
node. at most m−1 keys.
It can be shown (by induction for example) that a B-tree
4. A non-leaf node with k children contains k−1 keys. of height h with all its nodes completely filled has n=
5. All leaves appear in the same level mh+1 −1 entries. Hence, the best case height of a B-tree
is:
Each internal node’s keys act as separation values which
divide its subtrees. For example, if an internal node has 3
⌈logm (n + 1)⌉ − 1
child nodes (or subtrees) then it must have 2 keys: a1 and
a2 . All values in the leftmost subtree will be less than a1 , Let d be the minimum number of children an inter-
all values in the middle subtree will be between a1 and nal (non-root) node can have. For an ordinary B-tree,
a2 , and all values in the rightmost subtree will be greater d=⌈m/2⌉.
than a2 .
Comer (1979, p. 127) and Cormen et al. (2001, pp. 383–
384) give the worst case height of a B-tree (where the root
Internal nodes Internal nodes are all nodes except for node is considered to have height 0) as
leaf nodes and the root node. They are usually repre-
sented as an ordered set of elements and child point-
⌊ ( )⌋
ers. Every internal node contains a maximum of U n+1
children and a minimum of L children. Thus, the h ≤ logd 2
.
number of elements is always 1 less than the number
of child pointers (the number of elements is between
L−1 and U−1). U must be either 2L or 2L−1; there- 6.14.5 Algorithms
fore each internal node is at least half full. The rela-
tionship between U and L implies that two half-full Search
nodes can be joined to make a legal node, and one
full node can be split into two legal nodes (if there’s Searching is similar to searching a binary search tree.
room to push one element up into the parent). These Starting at the root, the tree is recursively traversed from
properties make it possible to delete and insert new top to bottom. At each level, the search reduces its field of
values into a B-tree and adjust the tree to preserve view to the child pointer (subtree) whose range includes
the B-tree properties. the search value. A subtree’s range is defined by the val-
ues, or keys, contained in its parent node. These limiting
The root node The root node’s number of children has values are also known as separation values.
the same upper limit as internal nodes, but has no Binary search is typically (but not necessarily) used
lower limit. For example, when there are fewer than within nodes to find the separation values and child tree
L−1 elements in the entire tree, the root will be the of interest.
only node in the tree with no children at all.
Insertion
Leaf nodes Leaf nodes have the same restriction on the
number of elements, but have no children, and no All insertions start at a leaf node. To insert a new ele-
child pointers. ment, search the tree to find the leaf node where the new
element should be added. Insert the new element into that
A B-tree of depth n+1 can hold about U times as many node with the following steps:
items as a B-tree of depth n, but the cost of search, insert,
and delete operations grows with the depth of the tree. As 1. If the node contains fewer than the maximum legal
with any balanced tree, the cost grows much more slowly number of elements, then there is room for the new
than the number of elements. element. Insert the new element in the node, keeping
Some balanced trees store values only at leaf nodes, and the node’s elements ordered.
use different kinds of nodes for leaf nodes and internal 2. Otherwise the node is full, evenly split it into two
nodes. B-trees keep values in every node in the tree, and nodes so:
may use the same structure for all nodes. However, since
leaf nodes never have children, the B-trees benefit from (a) A single median is chosen from among the
improved performance if they use a specialized structure. leaf’s elements and the new element.
216 CHAPTER 6. SUCCESSORS AND NEIGHBORS
Deletion
Deletion from an internal node Each element in an down; deficient node now has the minimum
internal node acts as a separation value for two subtrees, number of elements)
therefore we need to find a replacement for separation. 2. Replace the separator in the parent with the
Note that the largest element in the left subtree is still less last element of the left sibling (left sibling loses
than the separator. Likewise, the smallest element in the one node but still has at least the minimum
right subtree is still greater than the separator. Both of number of elements)
those elements are in leaf nodes, and either one can be
the new separator for the two subtrees. Algorithmically 3. The tree is now balanced
described below:
• Otherwise, if both immediate siblings have only the
minimum number of elements, then merge with a
1. Choose a new separator (either the largest element in sibling sandwiching their separator taken off from
the left subtree or the smallest element in the right their parent
subtree), remove it from the leaf node it is in, and
replace the element to be deleted with the new sep- 1. Copy the separator to the end of the left node
arator. (the left node may be the deficient node or it
may be the sibling with the minimum number
2. The previous step deleted an element (the new sepa- of elements)
rator) from a leaf node. If that leaf node is now defi-
cient (has fewer than the required number of nodes), 2. Move all elements from the right node to the
then rebalance the tree starting from the leaf node. left node (the left node now has the maxi-
mum number of elements, and the right node
– empty)
Rebalancing after deletion Rebalancing starts from a
3. Remove the separator from the parent along
leaf and proceeds toward the root until the tree is bal-
with its empty right child (the parent loses an
anced. If deleting an element from a node has brought
element)
it under the minimum size, then some elements must be
redistributed to bring all nodes up to the minimum. Usu- • If the parent is the root and now has no el-
ally, the redistribution involves moving an element from a ements, then free it and make the merged
sibling node that has more than the minimum number of node the new root (tree becomes shal-
nodes. That redistribution operation is called a rotation. lower)
If no sibling can spare an element, then the deficient node • Otherwise, if the parent has fewer than
must be merged with a sibling. The merge causes the par- the required number of elements, then re-
ent to lose a separator element, so the parent may become balance the parent
deficient and need rebalancing. The merging and rebal-
ancing may continue all the way to the root. Since the
Note: The rebalancing operations are different for
minimum element count doesn't apply to the root, mak-
B+ trees (e.g., rotation is different because parent
ing the root be the only deficient node is not a problem.
has copy of the key) and B* -tree (e.g., three siblings
The algorithm to rebalance the tree is as follows:
are merged into two siblings).
has one extra element, which will be used to build the TOPS-20 (and possibly TENEX) used a 0 to 2 level tree
internal nodes. that has similarities to a B-tree. A disk block was 512 36-
9
For example, if the leaf nodes have maximum size 4 and bit words. If the file fit in a 512 (2 ) word block, then the
the initial collection is the integers 1 through 24, we would file directory would
18
point to that physical disk block. If
initially construct 4 leaf nodes containing 5 values each the file fit in 2 words, then the directory would point to
and 1 which contains 4 values: an aux index; the 512 words of that index would either be
NULL (the block isn't allocated) or point to the physical
We build the next level up from the leaves by taking the address of the block. If the file fit in 227 words, then
last element from each leaf node except the last one. the directory would point to a block holding an aux-aux
Again, each node except the last will contain one extra index; each entry would either be NULL or point to an
value. In the example, suppose the internal nodes contain aux index. Consequently, the physical disk block for a
at most 2 values (3 child pointers). Then the next level up 227 word file could be located in two disk reads and read
of internal nodes would be: on the third.
This process is continued until we reach a level with only Apple’s filesystem HFS+, Microsoft’s NTFS,[9] AIX
one node and it is not overfilled. In the example only the (jfs2) and some Linux filesystems, such as btrfs and Ext4,
root level remains: use B-trees.
B* -trees are used in the HFS and Reiser4 file systems.
6.14.6 In filesystems
6.14.7 Variations
Most modern filesystems use B-trees (or § Variants); al-
ternatives such as extendible hashing are less common.[8] Access concurrency
In addition to its use in databases, the B-tree is also used
in filesystems to allow quick random access to an arbitrary Lehman and Yao[10] showed that all the read locks could
block in a particular file. The basic problem is turning be avoided (and thus concurrent access greatly improved)
the file block i address into a disk block (or perhaps to a by linking the tree blocks at each level together with a
cylinder-head-sector) address. “next” pointer. This results in a tree structure where both
insertion and search operations descend from the root to
Some operating systems require the user to allocate the the leaf. Write locks are only required as a tree block is
maximum size of the file when the file is created. The file modified. This maximizes access concurrency by multi-
can then be allocated as contiguous disk blocks. When ple users, an important consideration for databases and/or
converting to a disk block the operating system just adds other B-tree based ISAM storage methods. The cost as-
the file block address to the starting disk block of the file. sociated with this improvement is that empty pages can-
The scheme is simple, but the file cannot exceed its cre- not be removed from the btree during normal operations.
ated size. (However, see [11] for various strategies to implement
Other operating systems allow a file to grow. The result- node merging, and source code at.[12] )
ing disk blocks may not be contiguous, so mapping logical United States Patent 5283894, granted in 1994, appears
blocks to physical blocks is more involved. to show a way to use a 'Meta Access Method' [13] to al-
MS-DOS, for example, used a simple File Allocation low concurrent B+ tree access and modification without
Table (FAT). The FAT has an entry for each disk locks. The technique accesses the tree 'upwards’ for both
block,[note 1] and that entry identifies whether its block is searches and updates by means of additional in-memory
used by a file and if so, which block (if any) is the next indexes that point at the blocks in each level in the block
disk block of the same file. So, the allocation of each cache. No reorganization for deletes is needed and there
file is represented as a linked list in the table. In order to are no 'next' pointers in each block as in Lehman and Yao.
find the disk address of file block i , the operating system
(or disk utility) must sequentially follow the file’s linked
list in the FAT. Worse, to find a free disk block, it must 6.14.8 See also
sequentially scan the FAT. For MS-DOS, that was not a
huge penalty because the disks and files were small and • B+tree
the FAT had few entries and relatively short file chains.
In the FAT12 filesystem (used on floppy disks and early • R-tree
hard disks), there were no more than 4,080 [note 2] entries,
and the FAT would usually be resident in memory. As • Red–black tree
disks got bigger, the FAT architecture began to confront
penalties. On a large disk using FAT, it may be necessary • 2–3 tree
to perform disk reads to learn the disk location of a file
block to be read or written. • 2–3–4 tree
6.14. B-TREE 219
[13] Lockless Concurrent B+Tree • Dictionary of Algorithms and Data Structures entry
for B*-tree
General
• Open Data Structures - Section 14.2 - B-Trees
• Bayer, R.; McCreight, E. (1972), “Organization • Counted B-Trees
and Maintenance of Large Ordered Indexes”
(PDF), Acta Informatica, 1 (3): 173–189, • B-Tree .Net, a modern, virtualized RAM & Disk
doi:10.1007/bf00288683 implementation
220 CHAPTER 6. SUCCESSORS AND NEIGHBORS
6.15 B+ tree node, which is a leaf node. (The root is also the single
leaf, in this case.) This node is permitted to have as little
as one key if necessary, and at most b .
6.15.2 Algorithms
Search
6.15.4 Implementation
Bulk-loading
The leaves (the bottom-most index blocks) of the B+ tree
Given a collection of data records, we want to create a are often linked to one another in a linked list; this makes
B+ tree index on some key field. One approach is to in- range queries or an (ordered) iteration through the blocks
sert each record into an empty tree. However, it is quite simpler and more efficient (though the aforementioned
expensive, because each entry requires us to start from upper bound can be achieved even without this addition).
the root and go down to the appropriate leaf page. An This does not substantially increase space consumption
efficient alternative is to use bulk-loading. or maintenance on the tree. This illustrates one of the
significant advantages of a B+tree over a B-tree; in a B-
• The first step is to sort the data entries according to tree, since not all keys are present in the leaves, such an
a search key in ascending order. ordered linked list cannot be constructed. A B+tree is
thus particularly useful as a database system index, where
• We allocate an empty page to serve as the root, and the data typically resides on disk, as it allows the B+tree
insert a pointer to the first page of entries into it. to actually provide an efficient structure for housing the
data itself (this is described in [4]:238 as index structure
• When the root is full, we split the root, and create a
“Alternative 1”).
new root page.
If a storage system has a block size of B bytes, and the
• Keep inserting entries to the right most index page keys to be stored have a size of k, arguably the most ef-
just above the leaf level, until all entries are indexed. ficient B+ tree is one where b = (B/k) − 1 . Although
theoretically the one-off is unnecessary, in practice there
Note : is often a little extra space taken up by the index blocks
(for example, the linked list references in the leaf blocks).
Having an index block which is slightly larger than the
• when the right-most index page above the leaf level
storage system’s actual block represents a significant per-
fills up, it is split;
formance decrease; therefore erring on the side of caution
• this action may, in turn, cause a split of the right- is preferable.
most index page on step closer to the root; If nodes of the B+ tree are organized as arrays of ele-
ments, then it may take a considerable time to insert or
• splits only occur on the right-most path from the root delete an element as half of the array will need to be
to the leaf level. shifted on average. To overcome this problem, elements
inside a node can be organized in a binary tree or a B+
tree instead of an array.
6.15.3 Characteristics
B+ trees can also be used for data stored in RAM. In this
For a b -order B+ tree with h levels of index: case a reasonable choice for block size would be the size
of processor’s cache line.
• The maximum number of records stored is nmax = Space efficiency of B+ trees can be improved by using
bh − bh−1 some compression techniques. One possibility is to use
222 CHAPTER 6. SUCCESSORS AND NEIGHBORS
delta encoding to compress keys stored into each block. [4] Ramakrishnan Raghu, Gehrke Johannes – Database
For internal blocks, space saving can be achieved by ei- Management Systems, McGraw-Hill Higher Education
ther compressing keys or pointers. For string keys, space (2000), 2nd edition (en) page 267
can be saved by using the following technique: Normally [5] SQLite Version 3 Overview
the i-th entry of an internal block contains the first key
of block i+1. Instead of storing the full key, we could [6] CouchDB Guide (see note after 3rd paragraph)
store the shortest prefix of the first key of block i+1 that
[7] Tokyo Cabinet reference Archived September 12, 2009,
is strictly greater (in lexicographic order) than last key of
at the Wayback Machine.
block i. There is also a simple way to compress pointers:
if we suppose that some consecutive blocks i, i+1, ... i+k [8] "The Ubiquitous B-Tree", ACM Computing Surveys
are stored contiguously, then it will suffice to store only 11(2): 121–137 (1979).
a pointer to the first block and the count of consecutive
blocks.
6.15.8 External links
All the above compression techniques have some draw-
backs. First, a full block must be decompressed to ex- • B+ tree in Python, used to implement a list
tract a single element. One technique to overcome this
problem is to divide each block into sub-blocks and com- • Dr. Monge’s B+ Tree index notes
press them separately. In this case searching or inserting
• Evaluating the performance of CSB+-trees on Mu-
an element will only need to decompress or compress a
tithreaded Architectures
sub-block instead of a full block. Another drawback of
compression techniques is that the number of stored ele- • Effect of node size on the performance of cache con-
ments may vary considerably from a block to another de- scious B+-trees
pending on how well the elements are compressed inside
each block. • Fractal Prefetching B+-trees
• Towards pB+-trees in the field: implementations
Choices and performance
6.15.5 History
• Cache-Conscious Index Structures for Main-
The B tree was first described in the paper Organization Memory Databases
and Maintenance of Large Ordered Indices. Acta Infor-
matica 1: 173–189 (1972) by Rudolf Bayer and Edward • Cache Oblivious B(+)-trees
M. McCreight. There is no single paper introducing the
• The Power of B-Trees: CouchDB B+ Tree Imple-
B+ tree concept. Instead, the notion of maintaining all
mentation
data in leaf nodes is repeatedly brought up as an interest-
ing variant. An early survey of B trees also covering B+ • B+ Tree Visualization
trees is Douglas Comer.[8] Comer notes that the B+ tree
was used in IBM’s VSAM data access software and he
refers to an IBM published article from 1973. Implementations
223
224 CHAPTER 7. INTEGER AND STRING SEARCHING
• There is no need to provide a hash function or to We can look up a value in the trie as follows:
change hash functions as more keys are added to a find :: String -> Trie a -> Maybe a find [] t = value t find
trie. (k:ks) t = do ct <- Data.Map.lookup k (children t) find
• A trie can provide an alphabetical ordering of the ks ct
entries by key.
In an imperative style, and assuming an appropriate data
Tries do have some drawbacks as well: type in place, we can describe the same algorithm in
Python (here, specifically for testing membership). Note
• Tries can be slower in some cases than hash ta- that children is a list of a node’s children; and we say that
bles for looking up data, especially if the data is di- a “terminal” node is one which contains a valid word.
rectly accessed on a hard disk drive or some other
secondary storage device where the random-access def find(node, key): for char in key: if char in
time is high compared to main memory.[7] node.children: node = node.children[char] else: return
None return node
• Some keys, such as floating point numbers, can lead
to long chains and prefixes that are not particularly
Insertion proceeds by walking the trie according to the
meaningful. Nevertheless, a bitwise trie can han-
string to be inserted, then appending new nodes for the
dle standard IEEE single and double format floating
suffix of the string that is not contained in the trie. In
point numbers.
imperative Pascal pseudocode:
• Some tries can require more space than a hash ta- algorithm insert(root : node, s : string, value : any): node
ble, as memory may be allocated for each character = root i = 0 n = length(s) while i < n: if node.child(s[i])
in the search string, rather than a single chunk of != nil: node = node.child(s[i]) i = i + 1 else: break
memory for the whole entry, as in most hash tables. (* append new nodes, if necessary *) while i < n:
node.child(s[i]) = new node node = node.child(s[i]) i = i
Dictionary representation + 1 node.value = value
a o a Bitwise tries
nodes (bit strings) can be stored and compressed easily, External memory tries
reducing the overall size of the trie.
Several trie variants are suitable for maintaining sets of
Such compression is also used in the implementation of
strings in external memory, including suffix trees. A
the various fast lookup tables for retrieving Unicode char-
combination of trie and B-tree, called the B-trie has also
acter properties. These could include case-mapping ta-
been suggested for this task; compared to suffix trees,
bles (e.g. for the Greek letter pi, from ∏ to π), or lookup
they are limited in the supported operations but also more
tables normalizing the combination of base and combin-
compact, while performing update operations faster.[17]
ing characters (like the a-umlaut in German, ä, or the
dalet-patah-dagesh-ole in Biblical Hebrew, )ַּ֫ד. For such
applications, the representation is similar to transforming 7.1.5 See also
a very large, unidimensional, sparse table (e.g. Unicode
code points) into a multidimensional matrix of their com- • Suffix tree
binations, and then using the coordinates in the hyper-
matrix as the string key of an uncompressed trie to rep- • Radix tree
resent the resulting character. The compression will then
• Directed acyclic word graph (aka DAWG)
consist of detecting and merging the common columns
within the hyper-matrix to compress the last dimension • Acyclic deterministic finite automata
in the key. For example, to avoid storing the full, multi-
byte Unicode code point of each element forming a ma- • Hash trie
trix column, the groupings of similar code points can be • Deterministic finite automata
exploited. Each dimension of the hyper-matrix stores the
start position of the next dimension, so that only the off- • Judy array
set (typically a single byte) need be stored. The resulting
• Search algorithm
vector is itself compressible when it is also sparse, so each
dimension (associated to a layer level in the trie) can be • Extendible hashing
compressed separately.
• Hash array mapped trie
Some implementations do support such data compression
within dynamic sparse tries and allow insertions and dele- • Prefix Hash Tree
tions in compressed tries. However, this usually has a sig-
• Burstsort
nificant cost when compressed segments need to be split
or merged. Some tradeoff has to be made between data • Luleå algorithm
compression and update speed. A typical strategy is to
limit the range of global lookups for comparing the com- • Huffman coding
mon branches in the sparse trie. • Ctrie
The result of such compression may look similar to trying
to transform the trie into a directed acyclic graph (DAG), • HAT-trie
because the reverse transform from a DAG to a trie is
obvious and always possible. However, the shape of the 7.1.6 References
DAG is determined by the form of the key chosen to in-
dex the nodes, in turn constraining the compression pos- [1] de la Briandais, René (1959). File searching using variable
sible. length keys. Proc. Western J. Computer Conf. pp. 295–
298. Cited by Brass.
Another compression strategy is to “unravel” the data
structure into a single byte array.[16] This approach elim- [2] Brass, Peter (2008). Advanced Data Structures. Cam-
inates the need for node pointers, substantially reducing bridge University Press.
the memory requirements. This in turn permits memory
[3] Black, Paul E. (2009-11-16). “trie”. Dictionary of Al-
mapping and the use of virtual memory to efficiently load
gorithms and Data Structures. National Institute of Stan-
the data from disk. dards and Technology. Archived from the original on
One more approach is to “pack” the trie.[4] Liang de- 2010-05-19.
scribes a space-efficient implementation of a sparse
[4] Franklin Mark Liang (1983). Word Hy-phen-a-tion By
packed trie applied to automatic hyphenation, in which Com-put-er (Doctor of Philosophy thesis). Stanford Uni-
the descendants of each node may be interleaved in mem- versity. Archived from the original (PDF) on 2010-05-19.
ory. Retrieved 2010-03-28.
[6] Bentley, Jon; Sedgewick, Robert (1998-04-01). “Ternary [17] Askitis, Nikolas; Zobel, Justin (2008). “B-tries for Disk-
Search Trees”. Dr. Dobb’s Journal. Dr Dobb’s. Archived based String Management” (PDF). VLDB Journal: 1–26.
from the original on 2008-06-23. ISSN 1066-8888.
[7] Edward Fredkin (1960). “Trie Memory”. Com-
munications of the ACM. 3 (9): 490–499.
doi:10.1145/367390.367400. 7.1.7 External links
[8] Aho, Alfred V.; Corasick, Margaret J. (Jun 1975). • NIST’s Dictionary of Algorithms and Data Struc-
“Efficient String Matching: An Aid to Bibliographic
tures: Trie
Search” (PDF). Communications of the ACM. 18 (6): 333–
340. doi:10.1145/360825.360855.
[9] John W. Wheeler; Guarionex Jordan. “An Empirical
Study of Term Indexing in the Darwin Implementation of 7.2 Radix tree
the Model Evolution Calculus”. 2004. p. 5.
[10] “Cache-Efficient String Sorting Using Copying” (PDF).
Retrieved 2008-11-15.
[11] “Engineering Radix Sort for Strings.”. Lecture Notes in
Computer Science: 3–14. doi:10.1007/978-3-540-89097-
3_3.
[12] Allison, Lloyd. “Tries”. Retrieved 18 February 2014.
[13] Sahni, Sartaj. “Tries”. Data Structures, Algorithms, & Ap-
plications in Java. University of Florida. Retrieved 18
February 2014.
[14] Bellekens, Xavier (2014). A Highly-Efficient Memory-
Compression Scheme for GPU-Accelerated Intrusion De-
tection Systems. Glasgow, Scotland, UK: ACM. pp. An example of a radix tree
302:302––302:309. ISBN 978-1-4503-3033-6. Re-
trieved 21 October 2015. In computer science, a radix tree (also radix trie or
[15] Jan Daciuk; Stoyan Mihov; Bruce W. Watson; Richard E. compact prefix tree) is a data structure that represents
Watson (2000). “Incremental Construction of Minimal a space-optimized trie in which each node that is the only
Acyclic Finite-State Automata”. Computational Linguis- child is merged with its parent. The result is that the num-
tics. Association for Computational Linguistics. 26: 3. ber of children of every internal node is at least the radix r
doi:10.1162/089120100561601. Archived from the orig- of the radix tree, where r is a positive integer and a power
inal on 2006-03-13. Retrieved 2009-05-28. This paper x of 2, having x ≥ 1. Unlike in regular tries, edges can be
presents a method for direct building of minimal acyclic labeled with sequences of elements as well as single el-
finite states automaton which recognizes a given finite list ements. This makes radix trees much more efficient for
of words in lexicographical order. Our approach is to con-
small sets (especially if the strings are long) and for sets
struct a minimal automaton in a single phase by adding
new strings one by one and minimizing the resulting au-
of strings that share long prefixes.
tomaton on-the-fly Unlike regular trees (where whole keys are compared en
[16] Ulrich Germann; Eric Joanis; Samuel Larkin (2009). masse from their beginning up to the point of inequal-
“Tightly packed tries: how to fit large models into mem- ity), the key at each node is compared chunk-of-bits by
ory, and make them load fast, too” (PDF). ACL Work- chunk-of-bits, where the quantity of bits in that chunk at
shops: Proceedings of the Workshop on Software Engi- that node is the radix r of the radix trie. When the r is
neering, Testing, and Quality Assurance for Natural Lan- 2, the radix trie is binary (i.e., compare that node’s 1-bit
guage Processing. Association for Computational Lin- portion of the key), which minimizes sparseness at the
guistics. pp. 31–39. We present Tightly Packed Tries expense of maximizing trie depth—i.e., maximizing up
(TPTs), a compact implementation of read-only, com- to conflation of nondiverging bit-strings in the key. When
pressed trie structures with fast on-demand paging and r is an integer power of 2 greater or equal to 4, then the
short load times. We demonstrate the benefits of TPTs for
radix trie is an r-ary trie, which lessens the depth of the
storing n-gram back-off language models and phrase ta-
bles for statistical machine translation. Encoded as TPTs,
radix trie at the expense of potential sparseness.
these databases require less space than flat text file rep- As an optimization, edge labels can be stored in constant
resentations of the same data compressed with the gzip size by using two pointers to a string (for the first and last
utility. At the same time, they can be mapped into mem- elements).[1]
ory quickly and be searched directly in time linear in the
length of the key, without the need to decompress the en- Note that although the examples in this article show
tire file. The overhead for local decompression during strings as sequences of characters, the type of the string
search is marginal. elements can be chosen arbitrarily; for example, as a bit
228 CHAPTER 7. INTEGER AND STRING SEARCHING
Lookup Insertion
• Node targetNode
• Insert
'toast' while splitting 'te' and moving previous strings
a level lower
Deletion
7.2.3 History
Donald R. Morrison first described what he called “Patri-
cia trees” in 1968;[4] the name comes from the acronym
• Insert
PATRICIA, which stands for "Practical Algorithm To
'test' which is a prefix of 'tester' Retrieve Information Coded In Alphanumeric". Gernot
Gwehenberger independently invented and described the
data structure at about the same time.[5] PATRICIA tries
are radix tries with radix equals 2, which means that each
bit of the key is compared individually and each node is
a two-way (i.e., left versus right) branch.
are slow in practice due to long common prefixes (in the 7.2.6 See also
case where comparisons begin at the start of the string).
In a trie, all comparisons require constant time, but it • Prefix tree (also known as a Trie)
takes m comparisons to look up a string of length m.
Radix trees can perform these operations with fewer com- • Deterministic acyclic finite state automaton
parisons, and require many fewer nodes. (DAFSA)
Radix trees also share the disadvantages of tries, however: • Ternary search tries
as they can only be applied to strings of elements or ele-
ments with an efficiently reversible mapping to strings, • Acyclic deterministic finite automata
they lack the full generality of balanced search trees, • Hash trie
which apply to any data type with a total ordering. A
reversible mapping to strings can be used to produce the • Deterministic finite automata
required total ordering for balanced search trees, but not
the other way around. This can also be problematic if a • Judy array
data type only provides a comparison operation, but not
• Search algorithm
a (de)serialization operation.
Hash tables are commonly said to have expected O(1) • Extendible hashing
insertion and deletion times, but this is only true when
• Hash array mapped trie
considering computation of the hash of the key to be a
constant-time operation. When hashing the key is taken • Prefix hash tree
into account, hash tables have expected O(k) insertion
and deletion times, but may take longer in the worst case • Burstsort
depending on how collisions are handled. Radix trees
have worst-case O(k) insertion and deletion. The suc- • Luleå algorithm
cessor/predecessor operations of radix trees are also not
• Huffman coding
implemented by hash tables.
7.2.7 References
7.2.5 Variants [1] Morin, Patrick. “Data Structures for Strings” (PDF). Re-
trieved 15 April 2012.
A common extension of radix trees uses two colors of
nodes, 'black' and 'white'. To check if a given string is [2] “rtfree(9)". www.freebsd.org. Retrieved 2016-10-23.
stored in the tree, the search starts from the top and fol-
[3] Knizhnik, Konstantin. “Patricia Tries: A Better Index For
lows the edges of the input string until no further progress Prefix Searches”, Dr. Dobb’s Journal, June, 2008.
can be made. If the search string is consumed and the fi-
nal node is a black node, the search has failed; if it is [4] Morrison, Donald R. Practical Algorithm to Retrieve In-
white, the search has succeeded. This enables us to add formation Coded in Alphanumeric
a large range of strings with a common prefix to the tree,
[5] G. Gwehenberger, Anwendung einer binären Verweisket-
using white nodes, then remove a small set of “excep-
tenmethode beim Aufbau von Listen. Elektronische
tions” in a space-efficient manner by inserting them using Rechenanlagen 10 (1968), pp. 223–226
black nodes.
[6] Askitis, Nikolas; Sinha, Ranjan (2007). HAT-trie: A
The HAT-trie is a cache-conscious data structure based
Cache-conscious Trie-based Data Structure for Strings.
on radix trees that offers efficient string storage and re-
Proceedings of the 30th Australasian Conference on Com-
trieval, and ordered iterations. Performance, with re- puter science. 62. pp. 97–105. ISBN 1-920682-43-0.
spect to both time and space, is comparable to the cache-
conscious hashtable.[6][7] See HAT trie implementation [7] Askitis, Nikolas; Sinha, Ranjan (October 2010).
notes at “Engineering scalable, cache and space efficient tries
for strings”. The VLDB Journal. 19 (5): 633–660.
The adaptive radix tree is a radix tree variant that in- doi:10.1007/s00778-010-0183-9.
tegrates adaptive node sizes to the radix tree. One ma-
jor drawback of the usual radix trees is the use of space, [8] Kemper, Alfons; Eickler, André (2013). Datenbanksys-
because it uses a constant node size in every level. The teme, Eine Einführung. 9. pp. 604–605. ISBN 978-3-
major difference between the radix tree and the adaptive 486-72139-3.
radix tree is its variable size for each node based on the
[9] “armon/libart · GitHub”. GitHub. Retrieved 17 Septem-
number of child elements, which grows while adding new ber 2014.
entries. Hence, the adaptive radix tree leads to a better
use of space without reducing its speed.[8][9][10] [10] http://www-db.in.tum.de/~{}leis/papers/ART.pdf
7.3. SUFFIX TREE 231
3 1
Implementations
• FreeBSD Implementation, used for paging, for- Suffix tree for the text BANANA. Each substring is terminated
warding and other things. with special character $. The six paths from the root to the leaves
(shown as boxes) correspond to the six suffixes A$, NA$, ANA$,
NANA$, ANANA$ and BANANA$. The numbers in the leaves
• Linux Kernel Implementation, used for the page give the start position of the corresponding suffix. Suffix links,
cache, among other things. drawn dashed, are used during construction.
• Find the longest common prefix between the suffixes suffix tree is seen with a fibonacci word, giving the full 2n
Si [p..ni ] and Sj [q..nj ] in Θ(1) .[12] nodes.
• Search for a pattern P of length m with at most k An important choice when making a suffix tree im-
mismatches in O(kn + z) time, where z is the num- plementation is the parent-child relationships between
ber of hits.[13] nodes. The most common is using linked lists called sib-
ling lists. Each node has a pointer to its first child, and
• Find all z maximal palindromes in Θ(n) ,[14] or to the next node in the child list it is a part of. Other
Θ(gn) time if gaps of length g are allowed, or implementations with efficient running time properties
Θ(kn) if k mismatches are allowed.[15] use hash maps, sorted or unsorted arrays (with array dou-
bling), or balanced search trees. We are interested in:
• Find all z tandem repeats in O(n log n + z) , and
k-mismatch tandem repeats in O(kn log(n/k) + z)
.[16] • The cost of finding the child on a given character.
• Find the longest common substrings to at least k
strings in D for k = 2, . . . , K in Θ(n) time.[17] • The cost of inserting a child.
• Find the longest palindromic substring of a given • The cost of enlisting all children of a node (divided
string (using the generalized suffix tree of the string by the number of children in the table below).
and its reverse) in linear time.[18]
Let σ be the size of the alphabet. Then you have the fol-
7.3.5 Applications lowing costs:
Though linear, the memory usage of a suffix tree is signif- [10] Gusfield (1999), p.166.
icantly higher than the actual size of the sequence collec- [11] Gusfield (1999), Chapter 8.
tion. For a large text, construction may require external
memory approaches. [12] Gusfield (1999), p.196.
There are theoretical results for constructing suffix trees [13] Gusfield (1999), p.200.
in external memory. The algorithm by Farach-Colton,
Ferragina & Muthukrishnan (2000) is theoretically op- [14] Gusfield (1999), p.198.
timal, with an I/O complexity equal to that of sorting. [15] Gusfield (1999), p.201.
However the overall intricacy of this algorithm has pre-
vented, so far, its practical implementation.[27] [16] Gusfield (1999), p.204.
On the other hand, there have been practical works for [17] Gusfield (1999), p.205.
constructing disk-based suffix trees which scale to (few)
GB/hours. The state of the art methods are TDD,[28] [18] Gusfield (1999), pp.197–199.
TRELLIS,[29] DiGeST,[30] and B2 ST.[31]
[19] Allison, L. “Suffix Trees”. Retrieved 2008-10-14.
TDD and TRELLIS scale up to the entire human genome
– approximately 3GB – resulting in a disk-based suffix [20] First introduced by Zamir & Etzioni (1998).
tree of a size in the tens of gigabytes.[28][29] However, [21] Apostolico et al. (Vishkin).
these methods cannot handle efficiently collections of se-
quences exceeding 3GB.[30] DiGeST performs signifi- [22] Hariharan (1994).
cantly better and is able to handle collections of sequences
[23] Sahinalp & Vishkin (1994).
in the order of 6GB in about 6 hours.[30] . All these meth-
ods can efficiently build suffix trees for the case when the [24] Farach & Muthukrishnan (1996).
tree does not fit in main memory, but the input does. The
most recent method, B2 ST,[31] scales to handle inputs that [25] Iliopoulos & Rytter (2004).
do not fit in main memory. ERA is a recent parallel suffix [26] Shun & Blelloch (2014).
tree construction method that is significantly faster. ERA
can index the entire human genome in 19 minutes on an [27] Smyth (2003).
8-core desktop computer with 16GB RAM. On a sim-
ple Linux cluster with 16 nodes (4GB RAM per node), [28] Tata, Hankins & Patel (2003).
ERA can index the entire human genome in less than 9 [29] Phoophakdee & Zaki (2007).
minutes.[32]
[30] Barsky et al. (2008).
[5] Gusfield (1999), p.123. • Barsky, Marina; Stege, Ulrike; Thomo, Alex; Up-
ton, Chris (2008), “A new method for indexing
[6] Baeza-Yates & Gonnet (1996). genomes using on-disk suffix trees”, CIKM '08: Pro-
ceedings of the 17th ACM Conference on Informa-
[7] Gusfield (1999), p.132.
tion and Knowledge Management, New York, NY,
[8] Gusfield (1999), p.125. USA: ACM, pp. 649–658.
7.4. SUFFIX ARRAY 235
• Barsky, Marina; Stege, Ulrike; Thomo, Alex; Up- • Smyth, William (2003), Computing Patterns in
ton, Chris (2009), “Suffix trees for very large ge- Strings, Addison-Wesley.
nomic sequences”, CIKM '09: Proceedings of the
18th ACM Conference on Information and Knowl- • Shun, Julian; Blelloch, Guy E. (2014), “A Simple
edge Management, New York, NY, USA: ACM. Parallel Cartesian Tree Algorithm and its Appli-
cation to Parallel Suffix Tree Construction”, ACM
• Farach, Martin (1997), “Optimal Suffix Tree Con- Transactions on Parallel Computing.
struction with Large Alphabets” (PDF), 38th IEEE
Symposium on Foundations of Computer Science • Tata, Sandeep; Hankins, Richard A.; Patel, Jig-
(FOCS '97), pp. 137–143. nesh M. (2003), “Practical Suffix Tree Construc-
tion”, VLDB '03: Proceedings of the 30th Interna-
• Farach, Martin; Muthukrishnan, S. (1996), “Op- tional Conference on Very Large Data Bases, Mor-
timal Logarithmic Time Randomized Suffix Tree gan Kaufmann, pp. 36–47.
Construction”, International Colloquium on Au-
tomata Languages and Programming. • Ukkonen, E. (1995), “On-line construction of suf-
fix trees” (PDF), Algorithmica, 14 (3): 249–260,
• Farach-Colton, Martin; Ferragina, Paolo; Muthukr- doi:10.1007/BF01206331.
ishnan, S. (2000), “On the sorting-complexity of
suffix tree construction.”, Journal of the ACM, 47 • Weiner, P. (1973), “Linear pattern matching al-
(6): 987–1011, doi:10.1145/355541.355547. gorithms” (PDF), 14th Annual IEEE Symposium
on Switching and Automata Theory, pp. 1–11,
• Giegerich, R.; Kurtz, S. (1997), “From Ukko- doi:10.1109/SWAT.1973.13.
nen to McCreight and Weiner: A Unifying
View of Linear-Time Suffix Tree Construc- • Zamir, Oren; Etzioni, Oren (1998), “Web document
tion” (PDF), Algorithmica, 19 (3): 331–353, clustering: a feasibility demonstration”, SIGIR '98:
doi:10.1007/PL00009177. Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in
• Gusfield, Dan (1999), Algorithms on Strings, Trees information retrieval, New York, NY, USA: ACM,
and Sequences: Computer Science and Computa- pp. 46–54.
tional Biology, Cambridge University Press, ISBN
0-521-58519-8.
7.3.12 External links
• Hariharan, Ramesh (1994), “Optimal Parallel Suffix
Tree Construction”, ACM Symposium on Theory of • Suffix Trees by Sartaj Sahni
Computing.
• NIST’s Dictionary of Algorithms and Data Struc-
• Iliopoulos, Costas; Rytter, Wojciech (2004), “On tures: Suffix Tree
Parallel Transformations of Suffix Arrays into Suf-
fix Trees”, 15th Australasian Workshop on Combi- • Universal Data Compression Based on the Burrows-
natorial Algorithms. Wheeler Transformation: Theory and Practice, ap-
plication of suffix trees in the BWT
• Mansour, Essam; Allam, Amin; Skiadopoulos,
Spiros; Kalnis, Panos (2011), “ERA: Efficient • Theory and Practice of Succinct Data Structures,
Serial and Parallel Suffix Tree Construction for C++ implementation of a compressed suffix tree
Very Long Strings” (PDF), PVLDB, 5 (1): 49–60, • Ukkonen’s Suffix Tree Implementation in C Part 1
doi:10.14778/2047485.2047490. Part 2 Part 3 Part 4 Part 5 Part 6
• McCreight, Edward M. (1976), “A Space-
Economical Suffix Tree Construction Algorithm”,
Journal of the ACM, 23 (2): 262–272, CiteSeerX 7.4 Suffix array
10.1.1.130.8022 , doi:10.1145/321941.321946.
In computer science, a suffix array is a sorted array of
• Phoophakdee, Benjarath; Zaki, Mohammed J. all suffixes of a string. It is a data structure used, among
(2007), “Genome-scale disk-based suffix tree in- others, in full text indices, data compression algorithms
dexing”, SIGMOD '07: Proceedings of the ACM SIG- and within the field of bioinformatics.[1]
MOD International Conference on Management of
Data, New York, NY, USA: ACM, pp. 833–844. Suffix arrays were introduced by Manber & Myers (1990)
as a simple, space efficient alternative to suffix trees. They
• Sahinalp, Cenk; Vishkin, Uzi (1994), “Symmetry have independently been discovered by Gaston Gonnet in
breaking for suffix tree construction”, ACM Sympo- 1987 under the name PAT array (Gonnet, Baeza-Yates &
sium on Theory of Computing Snider 1992).
236 CHAPTER 7. INTEGER AND STRING SEARCHING
Most suffix array construction algorithms are based on mid else: l = mid + 1 return (s, r)
one of the following approaches:[4]
Finding the substring pattern P of length m in the string
• Prefix doubling algorithms are based on a strategy S of length n takes O(m log n) time, given that a sin-
of Karp, Miller & Rosenberg (1972). The idea is to gle suffix comparison needs to compare m characters.
find prefixes that honor the lexicographic ordering of Manber & Myers (1990) describe how this bound can
suffixes. The assessed prefix length doubles in each be improved to O(m + log n) time using LCP infor-
iteration of the algorithm until a prefix is unique and mation. The idea is that a pattern comparison does not
provides the rank of the associated suffix. need to re-compare certain characters, when it is already
known that these are part of the longest common prefix of
• Recursive algorithms follow the approach of the suf- the pattern and the current search interval. Abouelhoda,
fix tree construction algorithm by Farach (1997) to Kurtz & Ohlebusch (2004) improve the bound even fur-
recursively sort a subset of suffixes. This subset is ther and achieve a search time of O(m) as known from
then used to infer a suffix array of the remaining suf- suffix trees.
fixes. Both of these suffix arrays are then merged to
Suffix sorting algorithms can be used to compute the
compute the final suffix array.
Burrows–Wheeler transform (BWT). The BWT requires
• Induced copying algorithms are similar to recursive sorting of all cyclic permutations of a string. If this string
algorithms in the sense that they use an already ends in a special end-of-string character that is lexico-
sorted subset to induce a fast sort of the remaining graphically smaller than all other character (i.e., $), then
suffixes. The difference is that these algorithms fa- the order of the sorted rotated BWT matrix corresponds
vor iteration over recursion to sort the selected suffix to the order of suffixes in a suffix array. The BWT can
subset. A survey of this diverse group of algorithms therefore be computed in linear time by first construct-
has been put together by Puglisi, Smyth & Turpin ing a suffix array of the text and then deducing the BWT
(2007). string: BW T [i] = S[A[i] − 1] .
Suffix arrays can also be used to look up substrings in
A well-known recursive algorithm for integer alphabets Example-Based Machine Translation, demanding much
is the DC3 / skew algorithm of Kärkkäinen & Sanders less storage than a full phrase table as used in Statistical
(2003). It runs in linear time and has successfully been machine translation.
used as the basis for parallel[7] and external memory[8] Many additional applications of the suffix array require
suffix array construction algorithms. the LCP array. Some of these are detailed in the
Recent work by Salson et al. (2009) proposes an al- application section of the latter.
gorithm for updating the suffix array of a text that has
been edited instead of rebuilding a new suffix array from
scratch. Even if the theoretical worst-case time complex- 7.4.7 Notes
ity is O(n log n) , it appears to perform well in prac-
[1] Abouelhoda, Kurtz & Ohlebusch 2002.
tice: experimental results from the authors showed that
their implementation of dynamic suffix arrays is gener- [2] Abouelhoda, Kurtz & Ohlebusch 2004.
ally more efficient than rebuilding when considering the
insertion of a reasonable number of letters in the original [3] Kurtz 1999.
text.
[4] Puglisi, Smyth & Turpin 2007.
• Manber, Udi; Myers, Gene (1990). Suffix arrays: a • Farach, M. (1997). Optimal suffix tree construc-
new method for on-line string searches. First Annual tion with large alphabets. Proceedings 38th Annual
ACM-SIAM Symposium on Discrete Algorithms. Symposium on Foundations of Computer Science.
pp. 319–327. p. 137. doi:10.1109/SFCS.1997.646102. ISBN 0-
8186-8197-7.
• Manber, Udi; Myers, Gene (1993). “Suffix ar-
rays: a new method for on-line string searches”. • Kärkkäinen, Juha; Sanders, Peter (2003). Simple
SIAM Journal on Computing. 22: 935–948. Linear Work Suffix Array Construction. Automata,
doi:10.1137/0222058. Languages and Programming. Lecture Notes in
Computer Science. 2719. p. 943. doi:10.1007/3-
• Gonnet, G.H; Baeza-Yates, R.A; Snider, T (1992). 540-45061-0_73. ISBN 978-3-540-40493-4.
“New indices for text: PAT trees and PAT ar-
rays”. Information retrieval: data structures and al- • Dementiev, Roman; Kärkkäinen, Juha; Mehn-
gorithms. ert, Jens; Sanders, Peter (2008). “Better
external memory suffix array construction”.
• Kurtz, S (1999). “Reducing the space requirement Journal of Experimental Algorithmics. 12: 1.
of suffix trees”. Software-Practice and Experi- doi:10.1145/1227161.1402296.
ence. 29 (13): 1149. doi:10.1002/(SICI)1097-
024X(199911)29:13<1149::AID- • Kulla, Fabian; Sanders, Peter (2007). “Scalable par-
SPE274>3.0.CO;2-O. allel suffix array construction”. Parallel Computing.
33 (9): 605. doi:10.1016/j.parco.2007.06.004.
• Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan;
Ohlebusch, Enno (2002). The Enhanced Suffix Ar-
ray and Its Applications to Genome Analysis. Algo- 7.4.9 External links
rithms in Bioinformatics. Lecture Notes in Com-
puter Science. 2452. p. 449. doi:10.1007/3-540- • Suffix Array in Java
45784-4_35. ISBN 978-3-540-44211-0.
• Suffix sorting module for BWT in C code
• Puglisi, Simon J.; Smyth, W. F.; Turpin, Andrew H.
(2007). “A taxonomy of suffix array construction • Suffix Array Implementation in Ruby
algorithms”. ACM Computing Surveys. 39 (2): 4.
doi:10.1145/1242471.1242472. • Suffix array library and tools
• Nong, Ge; Zhang, Sen; Chan, Wai Hong (2009). • Project containing various Suffix Array c/c++ Im-
Linear Suffix Array Construction by Almost Pure plementations with a unified interface
Induced-Sorting. 2009 Data Compression Confer- • A fast, lightweight, and robust C API library to con-
ence. p. 193. doi:10.1109/DCC.2009.42. ISBN struct the suffix array
978-0-7695-3592-0.
• Suffix Array implementation in Python
• Fischer, Johannes (2011). Inducing the LCP-
Array. Algorithms and Data Structures. Lec- • Linear Time Suffix Array implementation in C using
ture Notes in Computer Science. 6844. p. 374. suffix tree
doi:10.1007/978-3-642-22300-6_32. ISBN 978-3-
642-22299-3.
• Salson, M.; Lecroq, T.; Léonard, M.; Mouchard, 7.5 Suffix automaton
L. (2010). “Dynamic extended suffix arrays”.
Journal of Discrete Algorithms. 8 (2): 241.
doi:10.1016/j.jda.2009.02.007. q0
example, a suffix automaton for the string “suffix” can be Verlag, pp. 503–518, doi:10.1007/978-3-642-
queried for other strings; it will report “true” for any of 22685-4_44, ISBN 978-3-642-22684-7
the strings “suffix”, “uffix”, “ffix”, “fix”, “ix” and “x”, and
“false” for any other string.[1]
The suffix automaton of a set of strings U has at most 2Q 7.6 Van Emde Boas tree
− 2 states, where Q is the number of nodes of a prefix-tree
representing the strings in U.[2] A Van Emde Boas tree (or Van Emde Boas prior-
Suffix automata have applications in approximate string ity queue; Dutch pronunciation: [vɑn 'ɛmdə 'boːɑs]), also
matching. [1] known as a vEB tree, is a tree data structure which im-
plements an associative array with m-bit integer keys. It
performs all operations in O(log m) time, or equivalently
7.5.1 See also in O(log log M) time, where M = 2m is the maximum
number of elements that can be stored in the tree. The M
• GADDAG is not to be confused with the actual number of elements
stored in the tree, by which the performance of other tree
• Suffix array data-structures is often measured. The vEB tree has good
space efficiency when it contains a large number of ele-
ments, as discussed below. It was invented by a team
7.5.2 References
led by Dutch computer scientist Peter van Emde Boas in
[1]
[1] Navarro, Gonzalo (2001), “A guided tour to approximate 1975.
string matching” (PDF), ACM Computing Surveys, 33 (1):
31–88, doi:10.1145/375360.375365
7.6.1 Supported operations
[2] Mohri, Mehryar; Moreno, Pedro; Weinstein, Eu-
gene (September 2009), “General suffix automa- A vEB supports the operations of an ordered associative
ton construction algorithm and space bounds”, array, which includes the usual associative array opera-
Theoretical Computer Science, 410 (37): 3553–3562, tions along with two more order operations, FindNext and
doi:10.1016/j.tcs.2009.03.034
FindPrevious:[2]
7.5.3 Additional reading • Insert: insert a key/value pair with an m-bit key
• Inenaga, S.; Hoshino, H.; Shinohara, A.; Takeda, • Delete: remove the key/value pair with a given key
M.; Arikawa, S. (2001), “On-line construction of • Lookup: find the value associated with a given key
symmetric compact directed acyclic word graphs”,
Proc. 8th Int. Symp. String Processing and In- • FindNext: find the key/value pair with the smallest
formation Retrieval, 2001. SPIRE 2001, pp. 96– key at least a given k
110, doi:10.1109/SPIRE.2001.989743, ISBN 0-
• FindPrevious: find the key/value pair with the largest
7695-1192-9.
key at most a given k
• Crochemore, Maxime; Vérin, Renaud (1997), “Di-
rect construction of compact directed acyclic word A vEB tree also supports the operations Minimum and
graphs”, Combinatorial Pattern Matching, Lecture Maximum, which return the minimum and maximum el-
Notes in Computer Science, Springer-Verlag, pp. ement stored in the tree respectively.[3] These both run
116–129, doi:10.1007/3-540-63220-4_55. in O(1) time, since the minimum and maximum element
are stored as attributes in each tree.
• Epifanio, Chiara; Mignosi, Filippo; Shallit, Jef-
frey; Venturini, Ilaria (2004), “Sturmian graphs and
a conjecture of Moser”, in Calude, Cristian S.; 7.6.2 How it works
Calude, Elena; Dineen, Michael J., Developments in
language theory. Proceedings, 8th international con- For the sake of simplicity, let log2 m = k for some integer
ference (DLT 2004), Auckland, New Zealand, De- k. Define M = 2m . A vEB tree T over the universe {0, ...,
cember 2004, Lecture Notes in Computer Science, M−1} has a root node that stores an array T.children of
3340, Springer-Verlag, pp. 175–187, ISBN 3-540- length √M. T.children[i] is a pointer to a vEB tree that is
24014-4, Zbl 1117.68454 responsible for the values {i√M, ..., (i+1)√M−1}. Addi-
• Do, H.H.; Sung, W.K. (2011), “Compressed Di- tionally, T stores two values T.min and T.max as well as
rected Acyclic Word Graph with Application in Lo- an auxiliary vEB tree T.aux.
cal Alignment”, Computing and Combinatorics, Lec- Data is stored in a vEB tree as follows: The smallest value
ture Notes in Computer Science, 6842, Springer- currently in the tree is stored in T.min and largest value is
240 CHAPTER 7. INTEGER AND STRING SEARCHING
An example Van Emde Boas tree with dimension 5 and the root’s 4. Otherwise, T.min< x < T.max so we insert x into
aux structure after 1, 2, 3, 5, 8 and 10 have been inserted. the subtree i responsible for x. If T.children[i] was
previously empty, then we also insert i into T.aux.
5. In any of the above cases, if we delete the last ele- reason why they are not popular in practice. One way of
ment x or y from any subtree T.children[i] then we addressing this limitation is to use only a fixed number of
also delete i from T.aux bits per level, which results in a trie. Alternatively, each
table may be replaced by a hash table, reducing the space
In code: to O(n) (where n is the number of elements stored in the
data structure) at the expense of making the data struc-
function Delete(T, x) if T.min == T.max == x then ture randomized. Other structures, including y-fast tries
T.min = M T.max = −1 return if x == T.min then and x-fast tries have been proposed that have compara-
x = T.children[T.aux.min].min T.min = x i = floor(x / ble update and query times and also use randomized hash
√M) Delete(T.children[i], x mod √M) if T.children[i] tables to reduce the space to O(n) or O(n log M).
is empty then Delete(T.aux, i) if x == T.max then
if T.aux is empty then T.max = T.min else T.max =
T.children[T.aux.max].max end 7.6.3 References
Again, the efficiency of this procedure hinges on the fact
that deleting from a vEB tree that contains only one el- [1] Peter van Emde Boas: Preserving order in a forest in
less than logarithmic time (Proceedings of the 16th Annual
ement takes only constant time. In particular, the last
Symposium on Foundations of Computer Science 10: 75-
line of code only executes if x was the only element in 84, 1975)
T.children[i] prior to the deletion.
[2] Gudmund Skovbjerg Frandsen: Dynamic algorithms:
Course notes on van Emde Boas trees (PDF) (University
Discussion of Aarhus, Department of Computer Science)
The assumption that log m is an integer is unnecessary. [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L.
The operations x/√M and x mod √M can be replaced Rivest, and Clifford Stein. Introduction to Algorithms,
by taking only higher-order ⌈m/2⌉ and the lower-order Third Edition. MIT Press, 2009. ISBN 978-0-262-
53305-8. Chapter 20: The van Emde Boas tree, pp. 531–
⌊m/2⌋ bits of x, respectively. On any existing machine,
560.
this is more efficient than division or remainder compu-
tations. [4] Rex, A. “Determining the space complexity of van Emde
The implementation described above uses pointers and Boas trees”. Retrieved 2011-05-27.
occupies a total space of O(M) = O(2m ). This√can be
seen
√ as follows. The√ recurrence is S(M ) = O( M ) + Further reading
( M + 1) · S(√ M ) . Resolving that would √ lead to
S(M ) ∈ (1 + M )log log M + log log M · O( M ) .
• Erik Demaine, Sam Fingeret, Shravas Rao, Paul
One can, fortunately, also show that S(M) = M−2 by
Christiano. Massachusetts Institute of Technology.
induction.[4]
6.851: Advanced Data Structures (Spring 2012).
In practical implementations, especially on machines Lecture 11 notes. March 22, 2012.
with shift-by-k and find first zero instructions, perfor-
mance can further be improved by switching to a bit ar- • Van Emde Boas, P.; Kaas, R.; Zijlstra, E. (1976).
ray once m equal to the word size (or a small multiple “Design and implementation of an efficient priority
thereof) is reached. Since all operations on a single word queue”. Mathematical Systems Theory. 10: 99–127.
are constant time, this does not affect the asymptotic per- doi:10.1007/BF01683268.
formance, but it does avoid the majority of the pointer
storage and several pointer dereferences, achieving a sig-
nificant practical savings in time and space with this trick. 7.7 Fusion tree
An obvious optimization of vEB trees is to discard empty
subtrees. This makes vEB trees quite compact when they In computer science, a fusion tree is a type of tree data
contain many elements, because no subtrees are created structure that implements an associative array on w-bit
until something needs to be added to them. Initially, each integers. When operating on a collection of n key–value
element added creates about log(m) new trees containing pairs, it uses O(n) space and performs searches in O(logw
about m/2 pointers all together. As the tree grows, more n) time, which is asymptotically faster than a traditional
and more subtrees are reused, especially the larger ones. self-balancing binary search tree, and also better than the
In a full tree of 2m elements, only O(2m ) space is used. van Emde Boas tree for large values of w. It achieves this
Moreover, unlike a binary search tree, most of this space speed by exploiting certain constant-time operations that
is being used to store data: even for billions of elements, can be done on a machine word. Fusion trees were in-
the pointers in a full vEB tree number in the thousands. vented in 1990 by Michael Fredman and Dan Willard.[1]
However, for small trees the overhead associated with Several advances have been made since Fredman and
vEB trees is enormous: on the order of √M. This is one Willard’s original 1990 paper. In 1999[2] it was shown
242 CHAPTER 7. INTEGER AND STRING SEARCHING
how to implement fusion trees under a model of com- Approximating the sketch
putation in which all of the underlying operations of the
algorithm belong to AC0 , a model of circuit complexity If the locations of the sketch bits are b1 < b2 < ··· < br,
that allows addition and bitwise Boolean operations but then the sketch of the key xw−₁···x1 x0 is the r-bit integer
disallows the multiplication operations used in the origi- xbr xbr−1 · · · xb1 .
nal fusion tree algorithm. A dynamic version of fusion With only standard word operations, such as those of the
trees using hash tables was proposed in 1996[3] which C programming language, it is difficult to directly com-
matched the original structure’s O(logw n) runtime in ex- pute the sketch of a key in constant time. Instead, the
pectation. Another dynamic version using exponential sketch bits can be packed into a range of size at most r4 ,
tree was proposed in 2007[4] which yields worst-case run- using bitwise AND and multiplication. The bitwise AND
times of O(logw n + log log u) per operation, where u is operation serves to clear all non-sketch bits from the key,
the size of the largest key. It remains open whether dy- while the multiplication shifts the sketch bits into a small
namic fusion trees can achieve O(logw n) per operation range. Like the “perfect” sketch, the approximate sketch
with high probability. preserves the order of the keys.
Some preprocessing is needed to determine the correct
7.7.1 How it works multiplication constant. Each sketch bit in location ∑ bi will
r
get shifted to bi + mi via a multiplication by m = i=1
mi
A fusion tree is essentially a B-tree with branching factor 2 . For the approximate sketch to work, the following
of w1/5 (any small exponent is also possible), which gives three properties must hold:
it a height of O(logw n). To achieve the desired runtimes
for updates and queries, the fusion tree must be able to 1. bi + mj are distinct for all pairs (i, j). This will ensure
1/5
search a node containing up to w keys in constant time. that the sketch bits are uncorrupted by the multipli-
This is done by compressing (“sketching”) the keys so that cation.
all can fit into one machine word, which in turn allows 2. bi + mi is a strictly increasing function of i. That is,
comparisons to be done in parallel. the order of the sketch bits is preserved.
3. (br + mr) - (b1 + m1 ) ≤ r4 . That is, the sketch bits
Sketching are packed into a range of size at most r4 .
Sketching is the method by which each w-bit key at a node An inductive argument shows how the mi can be con-
containing k keys is compressed into only k − 1 bits. Each structed. Let m1 = w − b1 . Suppose that 1 < t ≤ r and
key x may be thought of as a path in the full binary tree that m1 , m2 ... mt-1 have already been chosen. Then pick
of height w starting at the root and ending at the leaf cor- the smallest integer mt such that both properties (1) and
responding to x. To distinguish two paths, it suffices to (2) are satisfied. Property (1) requires that mt ≠ bi − bj
look at their branching point (the first bit where the two + ml for all 1 ≤ i, j ≤ r and 1 ≤ l ≤ t−1. Thus, there are
keys differ). All k paths together have k − 1 branching less than tr2 ≤ r3 values that mt must avoid. Since mt is
points, so at most k − 1 bits are needed to distinguish any chosen to be minimal, (bt + mt) ≤ (bt−₁ + mt−₁) + r3 .
two of the k keys. This implies Property (3).
The approximate sketch is thus computed as follows:
1. Mask out all but the sketch bits with a bitwise AND.
2. Multiply the key by the predetermined constant m.
This operation actually requires two machine words,
but this can still by done in constant time.
3. Mask out all but the shifted sketch bits. These are
now contained in a contiguous block of at most r4 <
w4/5 bits.
Parallel comparison
Visualization of the sketch function. The purpose of the compression achieved by sketching is
to allow all of the keys to be stored in one w-bit word. Let
An important property of the sketch function is that it pre- the node sketch of a node be the bit string
serves the order of the keys. That is, sketch(x) < sketch(y)
for any two keys x < y. 1sketch(x1 )1sketch(x2 )...1sketch(xk)
7.7. FUSION TREE 243
We can assume that the sketch function uses exactly b ≤ 3. Let l−1 be the length of the longest common prefix
r4 bits. Then each block uses 1 + b ≤ w4/5 bits, and since p.
k ≤ w1/5 , the total number of bits in the node sketch is at
most w. (a) If the l-th bit of q is 0, let e = p10w-l . Use
parallel comparison to search for the successor
A brief notational aside: for a bit string s and nonnegative of sketch(e). This is the actual predecessor of
integer m, let sm denote the concatenation of s to itself m q.
times. If t is also a bit string st denotes the concatenation
of t to s. (b) If the l-th bit of q is 1, let e = p01w-l . Use par-
allel comparison to search for the predecessor
The node sketch makes it possible to search the keys for of sketch(e). This is the actual successor of q.
any b-bit integer y. Let z = (0y)k , which can be computed
in constant time (multiply y by the constant (0b 1)k ). Note 4. Once either the predecessor or successor of q is
that 1sketch(xi) - 0y is always positive, but preserves its found, the exact position of q among the set of keys
leading 1 iff sketch(xi) ≥ y. We can thus compute the is determined.
smallest index i such that sketch(xi) ≥ y as follows:
2. Take the bitwise AND of the difference and the con- An application of fusion trees to hash tables was given
stant (10b )k . This clears all but the leading bit of by Willard, who describes a data structure for hashing in
each block. which an outer-level hash table with hash chaining is com-
bined with a fusion tree representing each hash chain. In
3. Find the most significant bit of the result. hash chaining, in a hash table with a constant load factor,
4. Compute i, using the fact that the leading bit of the the average size of a chain is constant, but additionally
i-th block has index i(b+1). with high probability all chains have size O(log n / log log
n), where n is the number of hashed items. This chain size
is small enough that a fusion tree can handle searches and
Desketching updates within it in constant time per operation. There-
fore, the time for all operations in the data structure is
For an arbitrary query q, parallel comparison computes constant with high probability. More precisely, with this
the index i such that data structure, for every inverse-quasipolynomial proba-
bility p(n) = exp((log n)O(1) ), there is a constant C such
sketch(xi−₁) ≤ sketch(q) ≤ sketch(xi) that the probability that there exists an operation that ex-
ceeds time C is at most p(n).[5]
Unfortunately, the sketch function is not in general order-
preserving outside the set of keys, so it is not necessarily
7.7.3 References
the case that xi−₁ ≤ q ≤ xi. What is true is that, among
all of the keys, either xi−₁ or xi has the longest common [1] Fredman, M. L.; Willard, D. E. (1990), “BLAST-
prefix with q. This is because any key y with a longer ING Through the Information Theoretic Barrier with
common prefix with q would also have more sketch bits FUSION TREES”, Proceedings of the Twenty-second
in common with q, and thus sketch(y) would be closer to Annual ACM Symposium on Theory of Computing
sketch(q) than any sketch(xj). (STOC '90), New York, NY, USA: ACM, pp. 1–7,
doi:10.1145/100216.100217, ISBN 0-89791-361-2.
The length longest common prefix between two w-bit in-
tegers a and b can be computed in constant time by find- [2] Andersson, Arne; Miltersen, Peter Bro; Thorup, Mikkel
ing the most significant bit of the bitwise XOR between (1999), “Fusion trees can be implemented with AC0 in-
a and b. This can then be used to mask out all but the structions only”, Theoretical Computer Science, 215 (1-
longest common prefix. 2): 337–344, doi:10.1016/S0304-3975(98)00172-8, MR
1678804.
Note that p identifies exactly where q branches off from
the set of keys. If the next bit of q is 0, then the successor [3] Raman, Rajeev (1996), “Priority queues: small, mono-
of q is contained in the p1 subtree, and if the next bit of tone and trans-dichotomous”, Fourth Annual European
q is 1, then the predecessor of q is contained in the p0 Symposium on Algorithms (ESA '96), Barcelona, Spain,
subtree. This suggests the following algorithm: September 25–27, 1996, Lecture Notes in Computer
Science, 1136, Berlin: Springer-Verlag, pp. 121–137,
doi:10.1007/3-540-61680-2_51, MR 1469229.
1. Use parallel comparison to find the index i such that
sketch(xi−₁) ≤ sketch(q) ≤ sketch(xi). [4] Andersson, Arne; Thorup, Mikkel (2007), “Dynamic or-
dered sets with exponential search trees”, Journal of the
2. Compute the longest common prefix p of q and ei- ACM, 54 (3): A13, doi:10.1145/1236457.1236460, MR
ther xi−₁ or xi (taking the longer of the two). 2314255.
244 CHAPTER 7. INTEGER AND STRING SEARCHING
8.1 Text
• Abstract data type Source: https://en.wikipedia.org/wiki/Abstract_data_type?oldid=767611214 Contributors: SolKarma, Merphant,
Ark~enwiki, B4hand, Michael Hardy, Wapcaplet, Skysmith, Haakon, Silvonen, Populus, Wernher, W7cook, Aqualung, BenRG, Noldoaran,
Tea2min, Giftlite, WiseWoman, Jonathan.mark.lingard, Jorge Stolfi, Daniel Brockman, Knutux, Dunks58, Andreas Kaufmann, Corti,
Mike Rosoft, Rich Farmbrough, Wrp103, Pink18, RJHall, Leif, Spoon!, R. S. Shaw, Alansohn, Diego Moya, Mr Adequate, Kendrick
Hang, Japanese Searobin, Miaow Miaow, Ruud Koot, Marudubshinki, Graham87, BD2412, Qwertyus, Kbdank71, Rjwilmsi, MZM-
cBride, Everton137, Chobot, YurikBot, Wavelength, SAE1962, Cedar101, Fang Aili, Sean Whitton, Petri Krohn, DGaw, TuukkaH,
SmackBot, Brick Thrower, Jpvinall, Chris the speller, SchfiftyThree, Nbarth, Cybercobra, Dreadstar, A5b, Breno, Antonielly, MTS-
bot~enwiki, Phuzion, Only2sea, Blaisorblade, Gnfnrf, Thijs!bot, Sagaciousuk, Ideogram, Widefox, JAnDbot, Magioladitis, David Eppstein,
Zacchiro, Felipe1982, Javawizard, SpallettaAC1041, AntiSpamBot, Khinyaminn, Cobi, Funandtrvl, Lights, Sector7agent, Anna Lincoln,
Don4of4, Kbrose, Arjun024, Flyer22 Reborn, Yerpo, Svick, Fishnet37222, Denisarona, ClueBot, The Thing That Should Not Be, Un-
buttered Parsnip, Garyzx, Adrianwn, Mild Bill Hiccup, PeterV1510, Boing! said Zebedee, M4gnum0n, Armin Rigo, Cacadril, BOTarate,
Thehelpfulone, Aitias, Appicharlask, Baudway, Addbot, Ghettoblaster, Capouch, Daniel.Burckhardt, Chamal N, Debresser, Bluebusy,
Jarble, Yobot, Legobot II, Pcap, Vanished user rt41as76lk, Materialscientist, ArthurBot, Nhey24, Omnipaedista, RibotBOT, FrescoBot,
Mark Renier, Chevymontecarlo, Maggyero, The Arbiter, RedBot, Tanayseven, Reconsider the static, Babayagagypsies, Dismantle101, Liz-
tanp, Efphf, Dinamik-bot, Ljr1981, John of Reading, Thecheesykid, Ebrambot, Demonkoryu, ChuispastonBot, Snehalshekatkar, Double
Dribble, Rocketrod1960, ClueBot NG, Hoorayforturtles, Frietjes, Widr, BG19bot, Ameyenn, ChrisGualtieri, GoShow, Hower64, JYBot,
Dexbot, Pintoch, Epicgenius, Carwile2, Cpt Wise, KasparBot, Tropicalkitty, Cakedy, OAbot, Zech147 and Anonymous: 188
• Data structure Source: https://en.wikipedia.org/wiki/Data_structure?oldid=762192284 Contributors: LC~enwiki, Ap, -- April, Andre
Engels, Karl E. V. Palmen, XJaM, Arvindn, Ghyll~enwiki, Michael Hardy, TakuyaMurata, Minesweeper, Ahoerstemeier, Nanshu, King-
turtle, Glenn, UserGoogol, Jiang, Edaelon, Nikola Smolenski, Dcoetzee, Chris Lundberg, Populus, Traroth, Mrjeff, Bearcat, Robbot,
Noldoaran, Craig Stuntz, Altenmann, Babbage, Mushroom, Seth Ilys, GreatWhiteNortherner, Tea2min, Giftlite, DavidCary, Esap, Jorge
Stolfi, Siroxo, Pgan002, Kjetil r, Lancekt, Jacob grace, Pale blue dot, Andreas Kaufmann, Corti, Bri, Wrp103, Bender235, Mister-
Sheik, Lycurgus, Shanes, Viriditas, Vortexrealm, Obradovic Goran, Helix84, Mdd, Jumbuck, Alansohn, Liao, Tablizer, Yamla, PaePae,
ReyBrujo, Derbeth, Forderud, Mahanga, Bushytails, Mindmatrix, Carlette, Ruud Koot, Easyas12c, TreveX, Bluemoose, Abd, Palica,
Mandarax, Yoric~enwiki, Qwertyus, Koavf, Ligulem, GeorgeBills, Husky, Margosbot~enwiki, Fragglet, RexNL, Fresheneesz, Butros,
Chobot, Tas50, Banaticus, YurikBot, RobotE, Hairy Dude, Pi Delport, Mipadi, Grafen, Dmoss, Tony1, Googl, Ripper234, Closed-
mouth, Vicarious, JLaTondre, GrinBot~enwiki, TuukkaH, SmackBot, Reedy, DCDuring, Thunderboltz, BurntSky, Gilliam, Ohnoitsjamie,
EncMstr, MalafayaBot, Nick Levine, Frap, Allan McInnes, Khukri, Ryan Roos, Sethwoodworth, Er Komandante, SashatoBot, Demicx,
Soumyasch, Antonielly, SpyMagician, Loadmaster, Noah Salzman, Mr Stephen, Alhoori, Sharcho, Caiaffa, Iridescent, CRGreathouse,
Ahy1, FinalMinuet, Requestion, Nnp, Peterdjones, GPhilip, Pascal.Tesson, Qwyrxian, MTA~enwiki, Thadius856, AntiVandalBot, Wide-
fox, Seaphoto, Jirka6, Dougher, Tom 99, Lanov, MER-C, Wikilolo, Wmbolle, Magioladitis, Rhwawn, Nyq, Wwmbes, David Eppstein,
User A1, Cpl Syx, Oicumayberight, Gwern, MasterRadius, Rettetast, Lithui, Sanjay742, Rrwright, Marcin Suwalczan, Jimmytharpe, San-
thy, TXiKiBoT, Oshwah, Eve Hall, Vipinhari, Coldfire82, BwDraco, Rozmichelle, Billinghurst, Falcon8765, Spinningspark, Spitfire8520,
Haiviet~enwiki, SieBot, Caltas, Eurooppa~enwiki, Ham Pastrami, Jerryobject, Flyer22 Reborn, Jvs, Strife911, Ramkumaran7, Nskillen,
DancingPhilosopher, Digisus, Tanvir Ahmmed, ClueBot, Spellsinger180, Justin W Smith, The Thing That Should Not Be, Rodhullan-
demu, Sundar sando, Garyzx, Adrianwn, Abhishek.kumar.ak, Excirial, Alexbot, Erebus Morgaine, Arjayay, Morel, DumZiBoT, XLinkBot,
Paushali, Pgallert, Galzigler, Alexius08, MystBot, Dsimic, Jncraton, MrOllie, EconoPhysicist, Publichealthguru, Mdnahas, Tide rolls,
ماني, Teles, سعی, Legobot, Luckas-bot, Yobot, Fraggle81, AnomieBOT, DemocraticLuntz, SteelPangolin, Jim1138, Kingpin13, Materi-
alscientist, ArthurBot, Xqbot, Pur3r4ngelw, Miym, DAndC, RibotBOT, Moxy, Shadowjams, Methcub, Prari, FrescoBot, Liridon, Mark
Renier, Hypersonic12, Maggyero, LeyNon, Rameshngbot, MertyWiki, Thompsonb24, Profvalente, FoxBot, Laurențiu Dascălu, Lotje,
Bharatshettybarkur, Tbhotch, Thinktdub, Kh31311, Vineetzone, Uriah123, DRAGON BOOSTER, EmausBot, Apoctyliptic, Dcirovic,
Thecheesykid, ZéroBot, MithrandirAgain, EdEColbert, IGeMiNix, Mentibot, BioPupil, MainFrame, Senator2029, Chandraguptamau-
rya, Rocketrod1960, Raveendra Lakpriya, Petrb, ClueBot NG, Aks1521, Widr, Danim, Jorgenev, Orzechowskid, Gmharhar, HMSSolent,
Wbm1058, Walk&check, Kndimov, Panchobook, Richfaber, SoniyaR, Yashykt, Cncmaster, Sallupandit, Pragmocialist, Sgord512, An-
derson, Vishnu0919, Varma rockzz, Frosty, Hernan mvs, Forgot to put name, I am One of Many, Bereajan, Gauravxpress, Haeynzen,
JaconaFrere, Justin15w, Gambhir.jagmeet, Richard Yin, Jrachiele, Guturu Bhuvanamitra, TranquilHope, Iliazm, AlphaBetaGamma01,
KasparBot, \wowzeryest\, ProprioMe OW, Ampsthatgoto11, Koerkra, SandeepGfG, Harvi004, Safadalvi and Anonymous: 393
245
246 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
Upnishad, Mc6809e, Lord Pistachio, Fawcett5, Theodore Kloba, Wtmitchell, Docboat, RJFJR, Versageek, TheCoffee, Kenyon, Christian
*Yendi* Severin, Unixxx, Kelmar~enwiki, Mindmatrix, Arneth, Ruud Koot, Tabletop, I64s, Smiler jerg, Rnt20, Graham87, Magister
Mathematicae, BD2412, TedPostol, StuartBrady, Arivne, Intgr, Adamking, King of Hearts, DVdm, Bgwhite, YurikBot, Borgx, Deep-
trivia, RussBot, Jengelh, Grafen, Welsh, Daniel Mietchen, Raven4x4x, Quentin mcalmott, ColdFusion650, Cesarsorm~enwiki, Tetracube,
Clayhalliwell, LeonardoRob0t, Bluezy, Katieh5584, Tyomitch, Willemo, RichF, robot, SmackBot, Waltercruz~enwiki, FlashSheri-
dan, Rōnin, Sam Pointon, Gilliam, Leafboat, Rmosler2100, NewName, Chris the speller, TimBentley, Stevage, Nbarth, Colonies Chris,
Deshraj, JonHarder, Cybercobra, IE, MegaHasher, Lasindi, Atkinson 291, Dreslough, Jan.Smolik, NJZombie, Minna Sora no Shita, 16@r,
Hvn0413, Beetstra, ATren, Noah Salzman, Koweja, Hu12, Iridescent, PavelY, Aeons, Tawkerbot2, Ahy1, Penbat, VTBassMatt, Ntsimp,
Mblumber, JFreeman, Xenochria, HappyInGeneral, Headbomb, Marek69, Neil916, Dark knight, Nick Number, Danarmstrong, Tha-
dius856, AntiVandalBot, Ste4k, Darklilac, Wizmo, JAnDbot, XyBot, MER-C, PhilKnight, SiobhanHansa, Wikilolo, VoABot II, Twsx,
Japo, David Eppstein, Philg88, Gwern, Moggie2002, Thirdright, Trusilver, Javawizard, Dillesca, Daniel5Ko, Nwbeeson, Cobi, KylieTas-
tic, Ja 62, Brvman, Meiskam, Larryisgood, Oshwah, Vipinhari, Mantipula, Don4of4, Amog, BotKung, BigDunc, Wolfrock, Celticeric,
B4upradeep, Tomaxer, Albertus Aditya, Clowd81, Sprocter, Kbrose, Arjun024, J0hn7r0n, Wjl2, SieBot, Tiddly Tom, Yintan, Ham Pas-
trami, Pi is 3.14159, Keilana, Flyer22 Reborn, TechTony, Redmarkviolinist, Beejaye, Bughunter2, Mygerardromance, NHSKR, Hariva,
Denisarona, Thorncrag, VanishedUser sdu9aya9fs787sads, Scarlettwharton, ClueBot, Ravek, Justin W Smith, The Thing That Should Not
Be, Raghaven, ImperfectlyInformed, Garyzx, Arakunem, Mild Bill Hiccup, Rob Bednark, Lindsayfgilmour, TobiasPersson, SchreiberBike,
Dixie91, Nasty psycho, XLinkBot, Marc van Leeuwen, Avoided, G7mcluvn, Hook43113, Kurniasan, Wolkykim, Addbot, Anandvachhani,
MrOllie, Freqsh0, Zorrobot, Jarble, Quantumobserver, Yobot, Fraggle81, KamikazeBot, Shadoninja, AnomieBOT, Jim1138, Materialsci-
entist, Mwe 001, Citation bot, Quantran202, SPTWriter, Mtasic, Binaryedit, Miym, Etienne Lehnart, Sophus Bie, Apollo2991, Construc-
tive editor, Afromayun, Prari, FrescoBot, Meshari alnaim, Ijsf, Mark Renier, Citation bot 1, I dream of horses, Apeculiaz, Patmorin,
Carloseow, Vrenator, Zvn, BZRatfink, Arjitmalviya, Vhcomptech, WillNess, Minimac, Jfmantis, RjwilmsiBot, Agrammenos, EmausBot,
KralSS, Super48paul, Primefac, Simply.ari1201, Eniagrom, MaGa, Donner60, Carmichael, Peter Karlsen, 28bot, Sjoerddebruin, Clue-
Bot NG, Jack Greenmaven, Millermk, Rezabot, Widr, MerlIwBot, Helpful Pixie Bot, HMSSolent, BG19bot, WikiPhoenix, Tango4567,
Dekai Wu, Computersagar, SaurabhKB, Klilidiplomus, Singlaive, IgushevEdward, Electricmuffin11, TalhaIrfanKhan, Jmendeth, Frosty,
Smortypi, RossMMooney, Gauravxpress, Noyster, Suzrocksyu, Bryce archer, Melcous, Monkbot, Azx0987, Mahesh Dheravath, Vikas
bhatnager, Aswincweety, RationalBlasphemist, TaqPol, Ishanalgorithm, KasparBot, Pythagorean Aditya Guha Roy, Fmadd, Ushkin N,
SimoneBrigante, DigDugDogDoog and Anonymous: 677
• Doubly linked list Source: https://en.wikipedia.org/wiki/Doubly_linked_list?oldid=762931767 Contributors: McKay, Tea2min, Jorge
Stolfi, Andreas Kaufmann, CanisRufus, Velella, Mindmatrix, Ruud Koot, Ewlyahoocom, Daverocks, Jeremy Visser, Chris the speller, Nick
Levine, Cybercobra, Myasuda, Medinoc, Fetchcomms, Crazytonyi, Kehrbykid, TechTony, Fishnet37222, The Thing That Should Not
Be, Addbot, Happyrabbit, Vandtekor, Amaury, Sae1962, Tyriar, Ryanz1123, Pinethicket, Jeffrd10, Suffusion of Yellow, Jfmantis, John of
Reading, Wikipelli, Usb10, ManU0710, ClueBot NG, Widr, Tlefebvre, Prashantgonarkar, Closeyes2, Asaifm, Comp.arch, Dkg1992matrix
and Anonymous: 69
• Stack (abstract data type) Source: https://en.wikipedia.org/wiki/Stack_(abstract_data_type)?oldid=769760613 Contributors: The
Anome, Andre Engels, Arvindn, Christian List, Edward, Patrick, RTC, Michael Hardy, Modster, MartinHarper, Ixfd64, TakuyaMurata,
Mbessey, Stw, Stan Shebs, Notheruser, Dcoetzee, Jake Nelson, Traroth, JensMueller, Finlay McWalter, Robbot, Noldoaran, Murray Lang-
ton, Fredrik, Wlievens, Guy Peters, Tea2min, Adam78, Giftlite, DocWatson42, BenFrantzDale, WiseWoman, Gonzen, Macrakis, Vamp-
Willow, Hgfernan, Maximaximax, Marc Mongenet, Karl-Henner, Andreas Kaufmann, RevRagnarok, Corti, Poccil, Andrejj, CanisRufus,
Spoon!, Bobo192, Grue, Shenme, R. S. Shaw, Vystrix Nexoth, Physicistjedi, James Foster, Obradovic Goran, Mdd, Musiphil, Liao, Hack-
wrench, Pion, ReyBrujo, 2mcm, Netkinetic, Postrach, Mindmatrix, MattGiuca, Ruud Koot, Mandarax, Slgrandson, Graham87, Qwertyus,
Kbdank71, Angusmclellan, Maxim Razin, Vlad Patryshev, FlaBot, Dinoen, Mahlon, Chobot, Bgwhite, Gwernol, Whosasking, NoirNoir,
Roboto de Ajvol, YurikBot, Borgx, Michael Slone, Ahluka, Stephenb, ENeville, Mipadi, Reikon, Vanished user 1029384756, Xdenizen,
Scs, Epipelagic, Caerwine, Boivie, Rwxrwxrwx, Fragment~enwiki, Cedar101, TuukkaH, KnightRider~enwiki, SmackBot, Adam majew-
ski, Hftf, Incnis Mrsi, BiT, Edgar181, Fernandopabon, Gilliam, Chris the speller, Agateller, RDBrown, Jprg1966, Thumperward, Oli Filth,
Nbarth, DHN-bot~enwiki, Cybercobra, Funky Monkey, PeterJeremy, MarkPritchard, Mlpkr, Vasiliy Faronov, MegaHasher, SashatoBot,
Zchenyu, Vanished user 9i39j3, F15x28, Ultranaut, SpyMagician, CoolKoon, Loadmaster, Tasc, Mr Stephen, Iridescent, Nutster, Tsf, Jesse
Viviano, IntrigueBlue, Penbat, VTBassMatt, Myasuda, FrontLine~enwiki, Simenheg, Jzalae, Pheasantplucker, Bsmntbombdood, Seth Man-
apio, Thijs!bot, Al Lemos, Headbomb, Davidhorman, Mentifisto, AntiVandalBot, Seaphoto, Stevenbird, CosineKitty, Arch dude, IanOs-
good, Jheiv, SiobhanHansa, Wikilolo, Magioladitis, VoABot II, Gammy, Individual X, David Eppstein, Gwern, R'n'B, Pomte, Adavidb,
Ianboggs, Dillesca, Sanjay742, Bookmaker~enwiki, Cjhoyle, Manassehkatz, David.Federman, Funandtrvl, Jeff G., Cheusov, Maxtremus,
TXiKiBoT, Hqb, Klower, JhsBot, Aaron Rotenberg, BotKung, Wikidan829, !dea4u, SieBot, Calliopejen1, BotMultichill, Raviemani, Ham
Pastrami, Keilana, Aillema, Ctxppc, OKBot, Hariva, Mr. Stradivarius, Fsmoura, ClueBot, Clx321, Melizg, Robert impey, Mahue, Rustam-
abd, LobStoR, Aitias, Johnuniq, XLinkBot, Jyotiswaroopr123321, Ceriak, Hook43113, MystBot, Dsimic, Gggh, Addbot, Ghettoblaster,
CanadianLinuxUser, Numbo3-bot, OlEnglish, Jarble, Aavviof, Luckas-bot, KamikazeBot, Peter Flass, AnomieBOT, 1exec1, Unara, Ma-
terialscientist, Citation bot, Xqbot, Quantran202, TechBot, GrouchoBot, RibotBOT, Cmccormick8, In fact, Rpv.imcc, Mark Renier,
D'ohBot, Ionutzmovie, Alxeedo, Colin meier, Salocin-yel, I dream of horses, Tom.Reding, Xcvista, ElNuevoEinstein, Tapkeerrambo007,
Trappist the monk, Tbhotch, IITManojit, Yammesicka, Jfmantis, Faysol037, RjwilmsiBot, Ripchip Bot, Mohinib27, EmausBot, Wiki-
tanvirBot, Dreamkxd, Luciform, Gecg, Maashatra11, RA0808, ZéroBot, Shuipzv3, Arkahot, Paul Kube, Thine Antique Pen, Blizzmas-
terPilch, L Kensington, ChuispastonBot, ClueBot NG, Matthiaspaul, MelbourneStar, StanBally, Dhardik007, Strcat, Ztothefifth, Zak-
blade2000, Robin400, Widr, Nakarumaka, KLBot2, Spieren, Vishal G.Dhavale., Nipunbayas, PranavAmbhore, Solomon7968, Proxyma,
BattyBot, David.moreno72, Abhidesai, Nova2358, ChrisGualtieri, Flaqueleto, Chengshuotian, Kushalbiswas777, Mogism, Makecat-bot,
Bentonjimmy, Stephenamills, Tiberius6996, Maeganm, Gauravxpress, Yasir72.multan, Tranzenic, Rajavenu.iitm, Jacektomas, Monkbot,
Opencooper, Pre8y, Flayneorange, KasparBot, Benaboy01, Fmadd, Azurnwiki, Bender the Bot and Anonymous: 342
• Queue (abstract data type) Source: https://en.wikipedia.org/wiki/Queue_(abstract_data_type)?oldid=765584693 Contributors: Blck-
Knght, Andre Engels, DavidLevinson, LapoLuchini, Edward, Patrick, Michael Hardy, Ixfd64, Ahoerstemeier, Nanshu, Glenn, Emper-
orbma, Dcoetzee, Furrykef, Traroth, Metasquares, Jusjih, PuzzletChung, Robbot, Noldoaran, Fredrik, JosephBarillari, Rasmus Faber,
Tea2min, Giftlite, Massysett, BenFrantzDale, MingMecca, Zoney, Rdsmith4, Tsemii, Andreas Kaufmann, Corti, Discospinster, Wrp103,
Mecanismo, Mehrenberg, Indil, Kwamikagami, Chairboy, Spoon!, Robotje, Helix84, Zachlipton, Alansohn, Liao, Conan, Gunslinger47,
Mc6809e, Caesura, Jguk, Kenyon, Woohookitty, Mindmatrix, Peng~enwiki, MattGiuca, Ruud Koot, Graham87, Rachel1, Qwertyus, De-
Piep, Olivier Teuliere, Bruce1ee, W3bbo, Margosbot~enwiki, Wouter.oet, Ewlyahoocom, Jrtayloriv, Zotel, Roboto de Ajvol, PhilipR,
RussBot, J. M., SpuriousQ, Stephenb, Stassats, Howcheng, JohJak2, Caerwine, Mike1024, Carlosguitar, SmackBot, Honza Záruba,
M2MM4M, Dabear~enwiki, Skizzik, Chris the speller, Oli Filth, Wikibarista, Nbarth, DHN-bot~enwiki, OrphanBot, Zvar, Cyberco-
bra, Pissant, Mlpkr, Cdills, Kflorence, Almkglor, PseudoSudo, Ckatz, 16@r, Sharcho, Nutster, Penbat, VTBassMatt, Banditlord, A876,
248 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
Simenheg, Tawkerbot4, Christian75, X96lee15, Uruiamme, Thadius856, Hires an editor, Lperez2029, Egerhart, Deflective, Siobhan-
Hansa, Wikilolo, MikeDunlavey, David Eppstein, Gwern, GrahamDavies, Sanjay742, Contactbanish, NewEnglandYankee, Nwbeeson,
Bobo2000, AlnoktaBOT, JhsBot, Broadbot, Atiflz, BotKung, Jesin, Calliopejen1, BotMultichill, Ham Pastrami, Keilana, Thesuper-
slacker, Flyer22 Reborn, Hariva, Arsenic99, Chelseafan528, WikiBotas, ClueBot, Ggia, Vanmaple, Alexbot, Ksulli10, Jotterbot, Tobi-
asPersson, SensuiShinobu1234, DumZiBoT, Kletos, XLinkBot, SilvonenBot, Marry314113, Dsimic, Addbot, Some jerk on the Internet,
OliverTwisted, MrOllie, SoSaysChappy, ماني, Loupeter, Yobot, Vanished user rt41as76lk, KamikazeBot, Materialscientist, LilHelpa,
Xqbot, Vegpuff, Joseph.w.s~enwiki, DSisyphBot, Ruby.red.roses, FrescoBot, Mark Renier, Miklcct, Arthur MILCHIOR, Gbduende,
PrometheeFeu~enwiki, Maxwellterry, John lindgren, Garfieldnate, EmausBot, Jasonanaggie, Akerans, Redhanker, Sorancio, Donner60,
Clehner~enwiki, Gralfca, ClueBot NG, Detonadorado, MahlerFive, Ztothefifth, Rahulghose, Iprathik, Zanaferx, Tlefebvre, Vasuakeel,
PhuksyWiki, Solomon7968, Fswangke, Dmitrysobolev, BattyBot, David.moreno72, Nemo Kartikeyan, Kushalbiswas777, DavidLeighEllis,
Sam Sailor, Tranzenic, ScottDNelson, Ishanalgorithm, InternetArchiveBot, Cakedy, NgYShung, GreenC bot and Anonymous: 206
• Double-ended queue Source: https://en.wikipedia.org/wiki/Double-ended_queue?oldid=772099980 Contributors: The Anome, Freck-
lefoot, Edward, Axlrosen, CesarB, Dcoetzee, Dfeuer, Zoicon5, Furrykef, Fredrik, Merovingian, Rasmus Faber, Tea2min, Smjg, Sj,
BenFrantzDale, Esrogs, Chowbok, Rosen, Andreas Kaufmann, Pt, Spoon!, Mindmatrix, Ruud Koot, Mandarax, Wikibofh, Drrngrvy,
Naraht, Ffaarr, Bgwhite, YurikBot, Fabartus, Jengelh, SpuriousQ, Fbergo, Schellhammer, Ripper234, Sneftel, Bcbell, SmackBot, Cparker,
Psiphiorg, Chris the speller, Kurykh, TimBentley, Oli Filth, Silly rabbit, Nbarth, Luder, Puetzk, Cybercobra, Offby1, Dicklyon, Cm-
drObot, Penbat, Funnyfarmofdoom, Mwhitlock, Omicronpersei8, Headbomb, VictorAnyakin, Felix C. Stegerman, David Eppstein, Mar-
tinBot, Huzzlet the bot, KILNA, VolkovBot, Anonymous Dissident, BotKung, Ramiromagalhaes, Kbrose, Hawk777, Ham Pastrami, Kr-
ishna.91, Hello71, Rdhettinger, Foxj, Alexbot, Rhododendrites, XLinkBot, Dekart, Wolkykim, Matěj Grabovský, Rrmsjp, Legobot, Yobot,
AnomieBOT, Sae1962, Arthur MILCHIOR, LittleWink, Woodlot, EmausBot, WikitanvirBot, Aamirlang, E Nocress, ClueBot NG, Ztothe-
fifth, Shire Reeve, Helpful Pixie Bot, BG19bot, IAPAAMMUABBU, Gauravi123, Mtnorthpoplar, RippleSax, Physics42, Vorhalas, Zdim
wiki and Anonymous: 94
• Circular buffer Source: https://en.wikipedia.org/wiki/Circular_buffer?oldid=772234418 Contributors: Damian Yerrick, Julesd, Malco-
hol, Chocolateboy, Tea2min, DavidCary, Andreas Kaufmann, Astronouth7303, Foobaz, Shabble, Cburnett, Qwertyus, Bgwhite, Pok148,
Cedar101, Mhi, WolfWings, SmackBot, Ohnoitsjamie, Chris the speller, KiloByte, Silly rabbit, Antonrojo, Rrelf, Frap, Cybercobra,
Bobamnertiopsis, Zoxc, Mike65535, Anonymi, Joeyadams, Mark Giannullo, Headbomb, Llloic, ForrestVoight, Marokwitz, Hosamaly,
Parthashome, Magioladitis, Indubitably, Amikake3, Strategist333, Billinghurst, Rhanekom, Calliopejen1, SiegeLord, BrightRoundCircle,
, OlivierEM, DrZoomEN, Para15000, Niceguyedc, Lucius Annaeus Seneca, Apparition11, Dekart, Dsimic, Addbot, Shervinemami,
MrOllie, OrlinKolev, Matěj Grabovský, Yobot, Ptbotgourou, Tennenrishin, AnomieBOT, BastianVenthur, Materialscientist, ChrisCPear-
son, Serkan Kenar, Shirik, 78.26, Mayukh iitbombay 2008, Hoo man, Sysabod, Ybungalobill, Paulitex, Lipsio, Eight40, ZéroBot, Bloodust,
Pokbot, ClueBot NG, Asimsalam, Shengliangsong, Lemtronix, Exfuent, Tectu, Msoltyspl, MuhannadAjjan, Cerabot~enwiki, ScotXW, Ji-
jubin, Clubjustin, Hailu143, EUROCALYPTUSTREE, Agustinothadeus and Anonymous: 103
• Associative array Source: https://en.wikipedia.org/wiki/Associative_array?oldid=758659575 Contributors: Damian Yerrick, Robert
Merkel, Fubar Obfusco, Maury Markowitz, Hirzel, B4hand, Paul Ebermann, Edward, Patrick, Michael Hardy, Shellreef, Graue,
Minesweeper, Brianiac, Samuelsen, Bart Massey, Hashar, Dcoetzee, Dysprosia, Silvonen, Bevo, Robbot, Noldoaran, Fredrik, Alten-
mann, Wlievens, Catbar, Wikibot, Ruakh, EvanED, Jleedev, Tea2min, Ancheta Wis, Jpo, DavidCary, Mintleaf~enwiki, Inter, Wolfkeeper,
Jorge Stolfi, Macrakis, Pne, Neilc, Kusunose, Karol Langner, Bosmon, Int19h, Andreas Kaufmann, RevRagnarok, Ericamick, LeeHunter,
PP Jewel, Kwamikagami, James b crocker, Spoon!, Bobo192, TommyG, Minghong, Alansohn, Mt~enwiki, Krischik, Sligocki, Rtmyers,
Kdau, Tony Sidaway, RainbowOfLight, Forderud, TShilo12, Boothy443, Mindmatrix, RzR~enwiki, Apokrif, Kglavin, Bluemoose, Ob-
sidianOrder, Pfunk42, Qwertyus, Yurik, Swmcd, Scandum, Koavf, Agorf, Jeff02, RexNL, Alvin-cs, Wavelength, Fdb, Maerk, Dggoldst,
Cedar101, JLaTondre, Owl-syme, TuukkaH, SmackBot, KnowledgeOfSelf, MeiStone, Mirzabah, TheDoctor10, Sam Pointon, Brianski,
Hugo-cs, Jdh30, Zven, Cfallin, CheesyPuffs144, Malbrain, Nick Levine, Vegard, Radagast83, Cybercobra, Decltype, Paddy3118, YeMer-
ryPeasant, AvramYU, Doug Bell, AmiDaniel, Antonielly, EdC~enwiki, Tobe2199, Hans Bauer, Dreftymac, Pimlottc, George100, JForget,
Jokes Free4Me, Pgr94, MrSteve, Countchoc, Ajo Mama, WinBot, Oddity-, Alphachimpbot, Maslin, JonathanCross, Pfast, PhiLho, Wm-
bolle, Magioladitis, David Eppstein, Gwern, Doc aberdeen, Signalhead, VolkovBot, Chaos5023, Kyle the bot, TXiKiBoT, Anna Lincoln,
BotKung, Comet--berkeley, Jesdisciple, PanagosTheOther, Nemo20000, Jerryobject, CultureDrone, Anchor Link Bot, ClueBot, Copyed-
itor42, Irishjugg~enwiki, XLinkBot, Orbnauticus, Frostus, Dsimic, Deineka, Addbot, Debresser, Jarble, Bartledan, Davidwhite544, Mar-
gin1522, Legobot, Luckas-bot, Yobot, TaBOT-zerem, Pcap, Peter Flass, AnomieBOT, RibotBOT, January2009, Sae1962, Efadae, Neil
Schipper, Floatingdecimal, Tushar858, EmausBot, WikitanvirBot, Marcos canbeiro, AvicBot, ClueBot NG, JannuBl22t, Helpful Pixie
Bot, Shuisman, DoctorRad, Crh23, Mithrasgregoriae, JYBot, Dcsaba70, Pintoch, LTWoods, Myconix, Comp.arch, Suelru, Bad Dryer,
Alonsoguillenv, EDickenson and Anonymous: 197
• Association list Source: https://en.wikipedia.org/wiki/Association_list?oldid=728838162 Contributors: SJK, Dcoetzee, Dremora, Tony
Sidaway, Pmcjones, SMcCandlish, David Eppstein, Yobot, Helpful Pixie Bot and Anonymous: 2
• Hash table Source: https://en.wikipedia.org/wiki/Hash_table?oldid=772548618 Contributors: Damian Yerrick, AxelBoldt, Zundark, The
Anome, BlckKnght, Sandos, Rgamble, LapoLuchini, AdamRetchless, Imran, Mrwojo, Frecklefoot, Michael Hardy, Nixdorf, Pnm, Axl-
rosen, TakuyaMurata, Ahoerstemeier, Nanshu, Dcoetzee, Dysprosia, Furrykef, Omegatron, Wernher, Bevo, Tjdw, Pakaran, Secretlondon,
Robbot, Fredrik, Tomchiukc, R3m0t, Altenmann, Ashwin, UtherSRG, Miles, Giftlite, DavidCary, Wolfkeeper, BenFrantzDale, Everyk-
ing, Waltpohl, Jorge Stolfi, Wmahan, Neilc, Pgan002, CryptoDerk, Knutux, Bug~enwiki, Sonjaaa, Teacup, Beland, Watcher, DNewhall,
ReiniUrban, Sam Hocevar, Derek Parnell, Askewchan, Kogorman, Andreas Kaufmann, Kaustuv, Shuchung~enwiki, T Long, Hydrox,
Cfailde, Luqui, Wrp103, Antaeus Feldspar, Bender235, Khalid, Raph Levien, JustinWick, CanisRufus, Shanes, Iron Wallaby, Krakhan,
Bobo192, Davidgothberg, Larryv, Sleske, Helix84, Mdd, Varuna, Baka toroi, Anthony Appleyard, Sligocki, Drbreznjev, DSatz, Akuch-
ling, TShilo12, Nuno Tavares, Woohookitty, LOL, Linguica, Paul Mackay~enwiki, Davidfstr, GregorB, Meneth, Graham87, Kbdank71,
Tostie14, Rjwilmsi, Scandum, Koavf, Kinu, Pleiotrop3, Filu~enwiki, Nneonneo, FlaBot, Ecb29, Fragglet, Intgr, Fresheneesz, Antaeus
FeIdspar, YurikBot, Wavelength, RobotE, Mongol, RussBot, Me and, CesarB’s unpriviledged account, Lavenderbunny, Gustavb, Mi-
padi, Cryptoid, Mike.aizatsky, Gareth Jones, Piccolomomo~enwiki, CecilWard, Nethgirb, Gadget850, Bota47, Sebleblanc, Deeday-UK,
Sycomonkey, Ninly, Gulliveig, Th1rt3en, CWenger, JLaTondre, ASchmoo, Kungfuadam, Daivox, SmackBot, Apanag, Obakeneko, Pizza-
Margherita, Alksub, Eskimbot, RobotJcb, C4chandu, Gilliam, Arpitm, Neurodivergent, EncMstr, Cribe, Deshraj, Tackline, Frap, Mayrel,
Radagast83, Cybercobra, Decltype, HFuruseth, Rich.lewis, Esb, Acdx, MegaHasher, Doug Bell, Derek farn, IronGargoyle, Josephsieh, Pe-
ter Horn, Pagh, Saxton, Tawkerbot2, Ouishoebean, CRGreathouse, Ahy1, MaxEnt, Seizethedave, Cgma, Not-just-yeti, Headbomb, Ther-
mon, OtterSmith, Ajo Mama, Stannered, AntiVandalBot, Hosamaly, Thailyn, Pixor, JAnDbot, MER-C, Epeefleche, Dmbstudio, Siobhan-
Hansa, Wikilolo, Bongwarrior, QrczakMK, Josephskeller, Tedickey, Schwarzbichler, Cic, Allstarecho, David Eppstein, Oravec, Gwern,
Magnus Bakken, Glrx, Narendrak, Tikiwont, Mike.lifeguard, Luxem, NewEnglandYankee, Cobi, Cometstyles, Winecellar, VolkovBot,
8.1. TEXT 249
Simulationelson, Floodyberry, Anurmi~enwiki, BotKung, Collin Stocks, JimJJewett, Nightkhaos, Spinningspark, Abatishchev, Helios2k6,
Kehrbykid, Kbrose, PeterCanthropus, Gerakibot, Triwbe, Digwuren, Svick, JL-Bot, ObfuscatePenguin, ClueBot, Justin W Smith, Imper-
fectlyInformed, Adrianwn, Mild Bill Hiccup, Niceguyedc, JJuran, Groxx, Berean Hunter, Eddof13, Johnuniq, Arlolra, XLinkBot, Het-
ori, Pichpich, Paulsheer, TheTraveler3, MystBot, Karuthedam, Wolkykim, Addbot, Gremel123, Scientus, CanadianLinuxUser, MrOl-
lie, Numbo3-bot, Om Sao, Zorrobot, Jarble, Frehley, Legobot, Luckas-bot, Yobot, Denispir, KamikazeBot, Peter Flass, Dmcomer,
AnomieBOT, Erel Segal, Jim1138, Sz-iwbot, Materialscientist, Citation bot, ArthurBot, Baliame, Drilnoth, Arbalest Mike, Ched, Shad-
owjams, Kracekumar, FrescoBot, Gbutler69, W Nowicki, X7q, Sae1962, Citation bot 1, Velociostrich, Simonsarris, Maggyero, Iekpo,
Trappist the monk, SchreyP, Grapesoda22, Patmorin, Cutelyaware, JeepdaySock, Shafigoldwasser, Kastchei, DuineSidhe, EmausBot, Su-
per48paul, Ibbn, DanielWaterworth, GoingBatty, Mousehousemd, ZéroBot, Purplie, Ticklemepink42, Paul Kube, Demonkoryu, Donner60,
Carmichael, Pheartheceal, Aberdeen01, Neil P. Quinn, Teapeat, Rememberway, ClueBot NG, Iiii I I I, Incompetence, Rawafmail, Frietjes,
Cntras, Rezabot, Jk2q3jrklse, Helpful Pixie Bot, BG19bot, Jan Spousta, MusikAnimal, SanAnMan, Pbruneau, AdventurousSquirrel, Tris-
ton J. Taylor, CitationCleanerBot, Happyuk, FeralOink, Spacemanaki, Aloksukhwani, Emimull, Deveedutta, Shmageggy, IgushevEdward,
AlecTaylor, Pintoch, Mcom320, Thomas J. S. Greenfield, Razibot, Djszapi, QuantifiedElf, Myconix, Chip Wildon Forster, Tmferrara,
Tuketu7, Whacks, Monkbot, Iokevins, Kjerish, Nitishch, Oleaster, MediKate, Micahsaint, Tourorist, Mtnorthpoplar, Dazappa, Luis150902,
Gou7214309, Dwemthy, Ushkin N, Earl King and Anonymous: 466
• Linear probing Source: https://en.wikipedia.org/wiki/Linear_probing?oldid=772550290 Contributors: Ubiquity, Bearcat, Enochlau, An-
dreas Kaufmann, Gazpacho, Discospinster, RJFJR, Linas, Tas50, The Rambling Man, CesarB’s unpriviledged account, SpuriousQ, Chris
the speller, JonHarder, MichaelBillington, Sbluen, Jeberle, Negrulio, Cryptic C62, Jngnyc, Alaibot, Thijs!bot, Headbomb, A3nm, David
Eppstein, STBot, Themania, OliviaGuest, Arjunaraoc, C. A. Russell, Addbot, Legobot, Yobot, Tedzdog, Patmorin, Infinity ive, Dixtosa,
Danmoberly, Dzf1992, Rubbish computer and Anonymous: 17
• Quadratic probing Source: https://en.wikipedia.org/wiki/Quadratic_probing?oldid=759799785 Contributors: Aragorn2, Dcoetzee,
Enochlau, Andreas Kaufmann, Rich Farmbrough, ZeroOne, Oleg Alexandrov, Ryk, Eubot, CesarB’s unpriviledged account, Robertvan1,
Mikeblas, SmackBot, InverseHypercube, Ian1000, Cybercobra, Wizardman, Jdanz, Simeon, Magioladitis, David Eppstein, R'n'B, Philip
Trueman, Hatmatbbat10, C. A. Russell, Addbot, Yobot, Bavla, Kmgpratyush, Donner60, ClueBot NG, Helpful Pixie Bot, Yashykt, Vaib-
hav1992, AndiPersti, Danielcamiel, EapenZhan and Anonymous: 40
• Double hashing Source: https://en.wikipedia.org/wiki/Double_hashing?oldid=722897281 Contributors: AxelBoldt, CesarB, Angela,
Dcoetzee, Usrnme h8er, Stesmo, RJFJR, Zawersh, Pfunk42, Gurch, CesarB’s unpriviledged account, Momeara, DasBrose~enwiki, Cob-
blet, SmackBot, Bluebot, Hashbrowncipher, JForget, Only2sea, Alaibot, Thijs!bot, David Eppstein, WonderPhil, Philip Trueman, Ox-
fordwang, Extensive~enwiki, Mild Bill Hiccup, Addbot, Tcl16, Smallman12q, Amiceli, Imposing, Jesse V., ClueBot NG, Exercisephys,
Bdawson1982, Kevin12xd and Anonymous: 36
• Cuckoo hashing Source: https://en.wikipedia.org/wiki/Cuckoo_hashing?oldid=772276433 Contributors: Arvindn, Dcoetzee, McKay,
Phil Boswell, Nyh, Pps, DavidCary, Neilc, Bender235, Unquietwiki, Zawersh, Ej, Nihiltres, Bgwhite, CesarB’s unpriviledged account,
Zr2d2, Zerodamage, Aaron Will, SmackBot, Mandyhan, Thumperward, Cybercobra, Pagh, Jafet, CRGreathouse, Alaibot, Headbomb,
Hermel, David Eppstein, S3000, Themania, Wjaguar, Mark cummins, LiranKatzir, Svick, Justin W Smith, Hetori, Addbot, Alquantor,
Lmonson26, Luckas-bot, Yobot, Valentas.Kurauskas, Thore Husfeldt, W Nowicki, Citation bot 1, Userask, Trappist the monk, EmausBot,
BuZZdEE.BuzZ, Rcsprinter123, Bomazi, Yoavt, BattyBot, Andrew Helwer, Dexbot, Usernameasdf, Monkbot, Cyberboys91, Harvi004
and Anonymous: 48
• Hopscotch hashing Source: https://en.wikipedia.org/wiki/Hopscotch_hashing?oldid=742861865 Contributors: Cybercobra, Svick, Im-
ageRemovalBot, Shafigoldwasser, QinglaiXiao, BG19bot, Alxradz and Anonymous: 9
• Hash function Source: https://en.wikipedia.org/wiki/Hash_function?oldid=771961351 Contributors: Damian Yerrick, Derek Ross, Taw,
BlckKnght, PierreAbbat, Miguel~enwiki, Imran, David spector, Dwheeler, Hfastedge, Michael Hardy, EddEdmondson, Ixfd64, Mde-
bets, Nanshu, J-Wiki, Jc~enwiki, Vanis~enwiki, Dcoetzee, Ww, The Anomebot, Doradus, Robbot, Noldoaran, Altenmann, Mikepelley,
Tea2min, Connelly, Giftlite, Paul Richter, DavidCary, KelvSYC, Wolfkeeper, Obli, Everyking, TomViza, Brona, Malyctenar, Jorge Stolfi,
Matt Crypto, Utcursch, Knutux, OverlordQ, Kusunose, Watcher, Karl-Henner, Talrias, Peter bertok, Quota, Eisnel, Shiftchange, Mormegil,
Jonmcauliffe, Rich Farmbrough, Antaeus Feldspar, Bender235, Chalst, Evand, PhilHibbs, Haxwell, Bobo192, Sklender, Davidgothberg,
Boredzo, Helix84, CyberSkull, Atlant, Jeltz, Mmmready, Apoc2400, InShaneee, Velella, Jopxton, ShawnVW, Kurivaim, MIT Trekkie,
Redvers, Blaxthos, Kazvorpal, Brookie, Linas, Mindmatrix, GVOLTT, LOL, TheNightFly, Drostie, Pfunk42, Graham87, Qwertyus,
Toolan, Rjwilmsi, Seraphimblade, Pabix, LjL, Ttwaring, Utuado, Nguyen Thanh Quang, FlaBot, Harmil, Gurch, Thenowhereman, Math-
rick, Intgr, M7bot, Chobot, Roboto de Ajvol, YurikBot, Wavelength, RattusMaximus, RobotE, CesarB’s unpriviledged account, Stephenb,
Pseudomonas, Andipi, Zeno of Elea, EngineerScotty, CecilWard, Mikeblas, Fender123, Bota47, Tachyon01, Ms2ger, Eurosong, Dfinkel,
Lt-wiki-bot, Ninly, Gulliveig, CharlesHBennett, StealthFox, Claygate, Snoops~enwiki, QmunkE, Emc2, Appleseed, Tobi Kellner, That
Guy, From That Show!, Jbalint, SmackBot, InverseHypercube, Bomac, KocjoBot~enwiki, BiT, Yamaguchi , Gilliam, Raghaw, Schmit-
eye, Mnbf9rca, JesseStone, Oli Filth, EncMstr, Octahedron80, Nbarth, Kmag~enwiki, Malbrain, Chlewbot, Shingra, Midnightcomm,
Lansey, Andrei Stroe, MegaHasher, Lambiam, Kuru, Alexcollins, Paulschou, RomanSpa, Chuck Simmons, KHAAAAAAAAAAN, Er-
win, Peyre, Vstarre, Pagh, MathStuf, ShakingSpirit, Iridescent, Agent X2, BrianRice, Courcelles, Owen214, Juhachi, Neelix, Mblumber,
SavantEdge, Adolphus79, Sytelus, Epbr123, Ultimus, Leedeth, Stualden, Folic Acid, AntiVandalBot, Xenophon (bot), JakeD409, Da-
vorian, Powerdesi, Dhrm77, JAnDbot, Epeefleche, Hamsterlopithecus, Kirrages, Stangaa, Steveprutz, Wikilolo, Coffee2theorems, Ma-
gioladitis, Pndfam05, Patelm, Tedickey, Nyttend, Kgfleischmann, Dappawit, Applrpn, STBot, GimliDotNet, R'n'B, Jfroelich, Francis
Tyers, Demosta, Thirdright, J.delanoy, Maurice Carbonaro, Svnsvn, Wjaguar, L337 kybldmstr, Globbet, Ontarioboy, Doug4, Meiskam,
Jrmcdaniel, VolkovBot, Sjones23, Boute, TXiKiBoT, Christofpaar, GroveGuy, A4bot, Nxavar, Noformation, Cuddlyable3, Crashthatch,
Wiae, Jediknil, Tastyllama, Skarz, LittleBenW, SieBot, WereSpielChequers, Laoris, KrizzyB, Xelgen, Flyer22 Reborn, Iamhigh, Dhb101,
BrightRoundCircle, OKBot, Svick, FusionNow, BitCrazed, ClueBot, Cab.jones, Ggia, Unbuttered Parsnip, Garyzx, Mild Bill Hiccup, शिव,
Dkf11, SamHartman, Rob Bednark, Alexbot, Erebus Morgaine, Diaa abdelmoneim, Wordsputtogether, Tonysan, Rishi.bedi, XLinkBot,
Kotha arun2005, Dthomsen8, MystBot, Karuthedam, SteveJothen, Addbot, Butterwell, TutterMouse, Dranorter, MrOllie, CarsracBot, An-
dersBot, Jeaise, Lightbot, Luckas-bot, Fraggle81, AnomieBOT, Erel Segal, Materialscientist, Citation bot, Twri, ArthurBot, Xqbot, Capri-
corn42, Matttoothman, M2millenium, Theclapp, RibotBOT, Alvin Seville, MerlLinkBot, FrescoBot, Nageh, MichealH, TruthIIPower,
Haeinous, Geoffreybernardo, Pinethicket, 10metreh, Cnwilliams, Mghgtg, Dinamik-bot, Vrenator, Keith Cascio, Phil Spectre, Jeffrd10,
Updatehelper, Kastchei, EmausBot, Timtempleton, Gfoley4, Mayazcherquoi, Timde, MarkWegman, Dewritech, Jachto, John Cline, White
Trillium, Fæ, Akerans, Paul Kube, Music Sorter, Donner60, Senator2029, Teapeat, Sven Manguard, Shi Hou, Mikhail Ryazanov, Re-
memberway, ClueBot NG, Incompetence, Neuroneutron, Monchoman45, Cntras, Widr, Mtking, Bluechimera0, HMSSolent, Wikisian,
BG19bot, JamesNZ, GarbledLecture933, Harpreet Osahan, Glacialfox, Winston Chuen-Shih Yang, ChrisGualtieri, Tech77, Jeff Erickson,
250 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
Jonahugh, Lindsaywinkler, Tmferrara, Cattycat95, Tolcso, Frogger48, Eddiearin123, Philnap, Kanterme, Laberinto15, MatthewBuch-
walder, Wkudrle, Computilizer, Mark22207, GlennLawyer, Gcarvelli, BlueFenixReborn, Some Gadget Geek, Siddharthgondhi, Surlycy-
borg, Ollie314, GSS-1987, Entranced98, KGirlTrucker81, John “Hannibal” Smith, WójcikBartosz, Hennerhubel, Lekkio and Anonymous:
497
• Perfect hash function Source: https://en.wikipedia.org/wiki/Perfect_hash_function?oldid=755582863 Contributors: Edward, Cimon
Avaro, Dcoetzee, Fredrik, Giftlite, Neilc, E David Moyer, Burschik, Bender235, LOL, Ruud Koot, JMCorey, ScottJ, Mathbot, Spl, Ce-
sarB’s unpriviledged account, Dtrebbien, Długosz, Gareth Jones, Salrizvy, Johndburger, SmackBot, Nbarth, Srchvrs, Otus, 4hodmt, Mega-
Hasher, Pagh, Mudd1, Krauss, Headbomb, Wikilolo, David Eppstein, Glrx, Cobi, Drkarger, Gajeam, PixelBot, Addbot, G121, Bbb23,
AnomieBOT, FrescoBot, Daoudamjad, John of Reading, Prvák, Maysak, Voomoo, Arka sett, BG19bot, SteveT84, Walrus068, Mcichelli,
Dexbot, Pintoch, Latin.ufmg and Anonymous: 34
• Universal hashing Source: https://en.wikipedia.org/wiki/Universal_hashing?oldid=768108046 Contributors: Mattflaschen, DavidCary,
Neilc, ArnoldReinhold, EmilJ, Pol098, Rjwilmsi, Sdornan, SeanMack, Chobot, Dmharvey, Gareth Jones, Guruparan18, Johndburger,
Twintop, CharlesHBennett, SmackBot, Cybercobra, Copysan, DanielLemire, Pagh, Dwmalone, Winxa, Jafet, Arnstein87, Marc W.
Abel, Sytelus, Francois.boutines, Headbomb, Golgofrinchian, David Eppstein, Copland Stalker, Danadocus, Cyberjoac, Ulamgamer, Ben-
der2k14, Rswarbrick, Addbot, RPHv, Yobot, Mpatrascu, Citation bot, LilHelpa, Citation bot 1, TPReal, Patmorin, RjwilmsiBot, Emaus-
Bot, Dewritech, ClueBot NG, Helpful Pixie Bot, Cleo, BG19bot, Walrus068, BattyBot, ChrisGualtieri, Zolgharnein, Jeff Erickson, Zen-
guine, Mikkel2thorup and Anonymous: 43
• K-independent hashing Source: https://en.wikipedia.org/wiki/K-independent_hashing?oldid=744752555 Contributors: Nandhp,
Rjwilmsi, CBM, David Eppstein, Iohannes Animosus, Mpatrascu, Mr Sheep Measham, BattyBot and Anonymous: 3
• Tabulation hashing Source: https://en.wikipedia.org/wiki/Tabulation_hashing?oldid=744532063 Contributors: RJFJR, DanielLemire,
Johnwbyrd, David Eppstein, Thomasda, Oranav, Thore Husfeldt, Tom.Reding, BG19bot, Cleanelephant, Eehcyl, Kbulgakov and Anony-
mous: 3
• Cryptographic hash function Source: https://en.wikipedia.org/wiki/Cryptographic_hash_function?oldid=769618249 Contributors:
Damian Yerrick, Bryan Derksen, Zundark, Arvindn, Imran, Paul Ebermann, Michael Hardy, Dan Koehl, Vacilandois, Dcljr, CesarB,
Ciphergoth, Feedmecereal, Charles Matthews, Ww, Amol kulkarni, Mrand, Taxman, Phil Boswell, Chuunen Baka, Robbot, Paranoid, As-
tronautics~enwiki, Fredrik, Lowellian, Pingveno, Aetheling, Mattflaschen, Javidjamae, Giftlite, Lunkwill, DavidCary, ShaunMacPherson,
Inkling, BenFrantzDale, Ianhowlett, Leonard G., Jorge Stolfi, Cloud200, Matt Crypto, Utcursch, CryptoDerk, Lightst, Antandrus, Tjwood,
Anirvan, Imjustmatthew, Rich Farmbrough, FT2, ArnoldReinhold, YUL89YYZ, Samboy, Mykhal, Chalst, Kyz, Sietse Snel, Schneier,
Bobo192, Myria, VBGFscJUn3, Davidgothberg, Boredzo, Quintus~enwiki, Sligocki, Ciphergoth2, Danhash, Pgimeno~enwiki, H2g2bob,
BDD, MIT Trekkie, PseudonympH, Simetrical, CygnusPius, Mindmatrix, Apokrif, Jok2000, Mandarax, Alienus, Ej, SMC, AndyKali,
Ruptor, Mathbot, Harmil, Maxal, Intgr, Fresheneesz, Wolfmankurd, Wigie, FrenchIsAwesome, CesarB’s unpriviledged account, Ted-
dyb, Gaius Cornelius, Rsrikanth05, Bachrach44, Froth, Guruparan18, Dbfirs, Ott2, Analoguedragon, Appleseed, Finell, DaishiHarada,
SmackBot, Mmernex, Tom Lougheed, Michaelfavor, Mdd4696, C4chandu, BiT, Yamaguchi , Ohnoitsjamie, Oli Filth, Nbarth, DHN-
bot~enwiki, Colonies Chris, Zsinj, Kotra, Deeb, Fuzzypeg, Lambiam, Twotwotwo, Twredfish, Brian Gunderson, Oswald Glinkmeyer, Dick-
lyon, Lee Carre, OnBeyondZebrax, Paul Foxworthy, Fils du Soleil, MoleculeUpload, Jafet, Chris55, Mellery, CmdrObot, Jesse Viviano,
Penbat, NormHardy, Cydebot, ST47, Optimist on the run, Bsmntbombdood, Bdragon, Jm3, N5iln, Strongriley, Dawnseeker2000, AntiVan-
dalBot, Nipisiquit, JAnDbot, BenjaminGittins, Instinct, Jimbobl, Coolhandscot, Gavia immer, Extropian314, VoABot II, NoDepositNoRe-
turn, Twsx, Firealwaysworks, David Eppstein, Vssun, WLU, Ratsbane, Gwern, JensAlfke, Maurice Carbonaro, Eliz81, Cpiral, Osndok,
83d40m, Robertgreer, SmallPotatoes, TreasuryTag, Sroc, TooTallSid, Oconnor663, Nxavar, Wordsmith, Jamelan, Enviroboy, Fltnsplr,
AP61, Arjun024, SieBot, Tehsha, Caltas, JuanPechiar, ArchiSchmedes, Jasonsewall, Wahrmund, Bpeps, ClueBot, JWilk, Ggia, Arakunem,
Avinava, CounterVandalismBot, Niceguyedc, DragonBot, Infomade, Cenarium, Leobold1, Erodium, Thinking Stone, DumZiBoT, Cmc-
queen1975, Pierzz, Mitch Ames, SteveJothen, Addbot, Non-dropframe, Laurinavicius, Cube444, Leszek Jańczuk, Wikipedian314, Down-
load, Maslen, Yobot, MarioS, Wurfmaul, Doctorhook, SwisterTwister, AnomieBOT, DemocraticLuntz, Materialscientist, Are you ready for
IPv6?, Xvsn, Clark89, Rabbler, Capricorn42, Oxwil, Marios.agathangelou, Sylletka, BrianWren, Daemorris, Amit 9b, Tsihonglau, Hymek,
MerlLinkBot, Maxiwheat, Bonev, FreeKnowledgeCreator, FrescoBot, Jsaenznoval, תומר א., Haeinous, Doremo, Blotowij, Jandalhandler,
RobinK, Salvidrim!, LiberatorG, קול ציון, Lotje, ATBS, Wedgefish, January, Eatnumber1, Plfernandez, Whisky drinker, Patriot8790, Trac-
erneo, RistoLaanoja, Mjd95, EmausBot, WikitanvirBot, AvicBot, ZéroBot, Quelrod, A930913, Erianna, Bomazi, ClueBot NG, Wcherowi,
MelbourneStar, Champloo11, Rezabot, Widr, Danwix, BG19bot, Lichtspiel, Garsd, ZipoBibrok5x10^8, Manoguru, RavelTwig, Luzm-
costa, David.moreno72, Darts123, Basisplan0815, JYBot, Pintoch, CuriousMind01, مونا بشيري, Connorr89, Epicgenius, Tentinator,
Jianhui67, Musko47, TAKUMI YAMAWAKI, Claw of Slime, Monkbot, Maciej Czyżewski, TimMagee, Maths314, Chouhartem, Touror-
ist, Lover amethyst, Onlinetvnet, TheExceptionCloaker, Axlesoft, ז62 and Anonymous: 305
• Set (abstract data type) Source: https://en.wikipedia.org/wiki/Set_(abstract_data_type)?oldid=757372624 Contributors: Damian Yer-
rick, William Avery, Mintguy, Patrick, Modster, TakuyaMurata, EdH, Mxn, Dcoetzee, Fredrik, Jorge Stolfi, Lvr, Urhixidur, Andreas
Kaufmann, CanisRufus, Spoon!, RJFJR, Ruud Koot, Pfunk42, Bgwhite, Roboto de Ajvol, Mlc, Cedar101, QmunkE, Incnis Mrsi, Blue-
bot, MartinPoulter, Nbarth, Gracenotes, Otus, Cybercobra, Dreadstar, Wizardman, MegaHasher, Hetar, Amniarix, CBM, Polaris408,
Peterdjones, Hosamaly, Hut 8.5, Wikilolo, Lt basketball, Gwern, Raise exception, Fylwind, Davecrosby uk, BotKung, Rhanekom, SieBot,
Oxymoron83, Casablanca2000in, Classicalecon, Linforest, Niceguyedc, UKoch, Quinntaylor, Addbot, SoSaysChappy, Loupeter, Legobot,
Luckas-bot, Denispir, Pcap, AnomieBOT, Citation bot, Twri, DSisyphBot, GrouchoBot, FrescoBot, Spindocter123, Tyamath, EmausBot,
Wikipelli, Elaz85, Mentibot, Nullzero, Helpful Pixie Bot, Poonam7393, Umasoni30, Vimalwatwani, Chmarkine, Irene31, Mark viking,
FriendlyCaribou, Brandon.heck, Aristiden7o, Bender the Bot and Anonymous: 46
• Bit array Source: https://en.wikipedia.org/wiki/Bit_array?oldid=772721016 Contributors: Awaterl, Boud, Pnm, Dcoetzee, Furrykef,
JesseW, AJim, Bovlb, Vadmium, Karol Langner, Sam Hocevar, Andreas Kaufmann, Notinasnaid, Paul August, CanisRufus, Spoon!,
R. S. Shaw, Rgrig, Forderud, Jacobolus, Bluemoose, Qwertyus, Hack-Man, StuartBrady, Intgr, RussBot, Cedar101, TomJF, JLaTondre,
Chris the speller, Bluebot, Doug Bell, Archimerged, DanielLemire, Glen Pepicelli, CRGreathouse, Gyopi, Neelix, Davnor, Kubanczyk,
Izyt, Gwern, Themania, R'n'B, Sudleyplace, TheChrisD, Cobi, Pcordes, Bvds, RomainThibaux, Psychless, Skwa, Onomou, MystBot, Ad-
dbot, IOLJeff, Tide rolls, Bluebusy, Peter Flass, AnomieBOT, Rubinbot, JnRouvignac, ZéroBot, Nomen4Omen, Cocciasik, ClueBot NG,
Snotbot, Minakshinajardhane, Chmarkine, Chip123456, BattyBot, Mogism, Thajdog10, User85734, François Robere, Carlos R Castro G,
Chadha.varun, Francisco Bajumuzi, Ushkin N and Anonymous: 54
• Bloom filter Source: https://en.wikipedia.org/wiki/Bloom_filter?oldid=772348327 Contributors: Damian Yerrick, The Anome, Edward,
Michael Hardy, Pnm, Wwwwolf, Thebramp, Charles Matthews, Dcoetzee, Doradus, Furrykef, Phil Boswell, Fredrik, Chocolateboy, Bab-
bage, Alan Liefting, Giftlite, DavidCary, ShaunMacPherson, Rchandra, Macrakis, Neilc, EvilGrin, James A. Donald, Mahemoff, Two
8.1. TEXT 251
Bananas, Urhixidur, Andreas Kaufmann, Anders94, Subrabbit, Smyth, Agl~enwiki, CanisRufus, Susvolans, Giraffedata, Drangon, Ter-
rycojones, Mbloore, Yinotaurus, Dzhim, GiovanniS, Galaxiaad, Mindmatrix, Shreevatsa, RzR~enwiki, Tabletop, Payrard, Ryan Reich,
Pfunk42, Qwertyus, Ses4j, Rjwilmsi, Sdornan, Brighterorange, Vsriram, Quuxplusone, Chobot, Wavelength, Argav, Taejo, CesarB’s un-
priviledged account, Msikma, E123, Dtrebbien, Wirthi, Cconnett, Cedar101, HereToHelp, Rubicantoto, Sbassi, Daivox, SmackBot, Stev0,
MalafayaBot, Nbarth, Cybercobra, Xiphoris, Wikidrone, Drae, Galaad2, Jeremy Banks, Shakeelmahate, Requestion, Krauss, Farzaneh,
Hilgerdenaar, Lindsay658, Hanche, Headbomb, NavenduJain, QuiteUnusual, Marokwitz, Labongo, Bblfish, Igodard, ARSHA, Magiola-
ditis, Alexmadon, David Eppstein, STBot, Flexdream, Willpwillp, Osndok, Coolg49964, Jjldj, Hammersoft, VolkovBot, Ferzkopp, Lo-
kiClock, Trachten, Rlaufer, SieBot, Emorrissey, Sswamida, Nhahmada, Abbasgadhia, Svick, Justin W Smith, Gtoal, HowardBGolden,
Rhubbarb, Quanstro, Pointillist, Shabbychef, Bender2k14, Sun Creator, Jakouye, AndreasBWagner, Sharma337, Dsimic, SteveJothen,
Addbot, Mortense, Jerz4835, FrankAndProust, MrOllie, Lightbot, Legobot, Russianspy3, Luckas-bot, Yobot, Ptbotgourou, Amirobot,
Gharb, AnomieBOT, Materialscientist, Citation bot, Naufraghi, Tjayrush, Krj373, Osloom, X7q, Citation bot 1, Chenopodiaceous,
HRoestBot, Jonesey95, Kronos04, Trappist the monk, Chronulator, Mavam, Buddeyp, RjwilmsiBot, Liorithiel, Lesshaste, John of Read-
ing, Drafiei, GoingBatty, HiW-Bot, ZéroBot, Meng6, AManWithNoPlan, Ashish goel public, Jar354, Mikhail Ryazanov, ClueBot NG,
Gareth Griffith-Jones, Bpodgursky, Rezabot, Helpful Pixie Bot, BG19bot, DivineTraube, ErikDubbelboer, Solomon7968, Exercisephys,
Chmarkine, Williamdemeo, Akryzhn, Pintoch, Faizan, Lsmll, Everymorning, BloomFilterEditor, OriRottenstreich, Monkbot, Reddish-
mariposa, Queelius, Epournaras, InternetArchiveBot, Ushkin N, Satokoala, Bender the Bot, Aagorilla, Bruce Maggs and Anonymous:
198
• MinHash Source: https://en.wikipedia.org/wiki/MinHash?oldid=772804906 Contributors: AxelBoldt, Kku, Qwertyus, Rjwilmsi, Gareth
Jones, Johndburger, Cedar101, Ma8thew, Ebrahim, David Eppstein, SchreiberBike, XLinkBot, Yobot, Citation bot, JonDePlume, Foo-
barnix, Trappist the monk, EmausBot, Nomadz, Chire, Chirag101192, Frietjes, Leopd, RWMajeed, Xmutangzk, Linuxjava, NickGrattan,
Srednuas Lenoroc, ElizaLepine and Anonymous: 27
• Disjoint-set data structure Source: https://en.wikipedia.org/wiki/Disjoint-set_data_structure?oldid=767826900 Contributors: The
Anome, Michael Hardy, Dominus, LittleDan, Charles Matthews, Dcoetzee, Grendelkhan, Pakaran, Giftlite, Pgan002, Jonel, Deewiant,
Finog, Andreas Kaufmann, Qutezuce, SamRushing, Nyenyec, Beige Tangerine, Msh210, Bigaln2, ReyBrujo, LOL, Bkkbrad, Ruud Koot,
Qwertyus, Kasei-jin~enwiki, Rjwilmsi, Salix alba, Intgr, Fresheneesz, Wavelength, Sceptre, NawlinWiki, Spike Wilbury, Kevtrice, Spirko,
Ripper234, Cedar101, Tevildo, SmackBot, Izzynn, Oli Filth, Nikaustr, Lambiam, Archimerged, SpyMagician, IanLiu, Dr Greg, Super-
joe30, Edward Vielmetti, Gfonsecabr, Headbomb, Kenahoo, Stellmach, David Eppstein, Chkno, Glrx, Rbrewer42, Kyle the bot, Oshwah,
Jamelan, AHMartin, Oaf2, MasterAchilles, Boydski, Alksentrs, Adrianwn, Vanisheduser12a67, DumZiBoT, XLinkBot, Cldoyle, Dekart,
Addbot, Shmilymann, Lightbot, Chipchap, Tonycao, Yobot, Erel Segal, Rubinbot, Sz-iwbot, Citation bot, Fantasticfears, Backpackadam,
Williacb, HRoestBot, MathijsM, Akim Demaille, Rednas1234, EmausBot, Zhouji2010, ZéroBot, Wmayner, ChuispastonBot, Mankarse,
Nullzero, Aleskotnik, Andreschulz, FutureTrillionaire, Josef Kufner, Qunwangcs157, Andyhowlett, Faizan, Simonemainardi, William Di
Luigi, Kimi91, Sharma.illusion, Kbhat95, Kennysong, R.J.C.vanHaaften, Nbro, Ahg simon, Michaelovertolli, Shiyu Ji, TaerimKim, Zha-
haoyu, AYUSHI, Pranavr93, Refat khan pathan and Anonymous: 82
• Partition refinement Source: https://en.wikipedia.org/wiki/Partition_refinement?oldid=771055545 Contributors: Tea2min, Linas, Qw-
ertyus, Matt Cook, Chris the speller, Headbomb, David Eppstein, Watchduck, Noamz, RjwilmsiBot, Xsoameix, David N. Jansen and
Anonymous: 2
• Priority queue Source: https://en.wikipedia.org/wiki/Priority_queue?oldid=772679819 Contributors: Frecklefoot, Michael Hardy, Nix-
dorf, Bdonlan, Strebe, Dcoetzee, Sanxiyn, Robbot, Fredrik, Kowey, Bkell, Tea2min, Decrypt3, Giftlite, Zigger, Vadmium, Andreas Kauf-
mann, Byrial, BACbKA, El C, Spoon!, Bobo192, Nyenyec, Dbeardsl, Jeltz, Mbloore, Forderud, RyanGerbil10, Kenyon, Woohookitty,
Oliphaunt, Ruud Koot, Hdante, Pete142, Graham87, Qwertyus, Pdelong, Ckelloug, Vegaswikian, StuartBrady, Jeff02, Spl, Anders.Warga,
Stephenb, Gareth Jones, Lt-wiki-bot, PaulWright, SmackBot, Emeraldemon, Stux, Gilliam, Riedl, Oli Filth, Silly rabbit, Nbarth, Kostmo,
Zvar, Calbaer, Cybercobra, BlackFingolfin, A5b, Clicketyclack, Ninjagecko, Robbins, Rory O'Kane, Sabik, John Reed Riley, ShelfSkewed,
Chrisahn, Corpx, Omicronpersei8, Thijs!bot, LeeG, Mentifisto, AntiVandalBot, Wayiran, CosineKitty, Ilingod, VoABot II, David Eppstein,
Jutiphan, Umpteee, Squids and Chips, TXiKiBoT, Coder Dan, Red Act, RHaden, Rhanekom, SieBot, ThomasTenCate, EnOreg, Volkan
YAZICI, ClueBot, Niceguyedc, Thejoshwolfe, SchreiberBike, BOTarate, Krungie factor, DumZiBoT, XLinkBot, Ghettoblaster, Vield,
Jncraton, Lightbot, Legobot, Yobot, FUZxxl, Bestiasonica, AnomieBOT, 1exec1, Kimsey0, Xqbot, Redroof, Thore Husfeldt, FrescoBot,
Hobsonlane, Itusg15q4user, Arthur MILCHIOR, Orangeroof, ElNuevoEinstein, HenryAyoola, EmausBot, LastKingpin, Moswento, Arken-
flame, Meng6, GabKBel, ChuispastonBot, Highway Hitchhiker, ClueBot NG, Carandraug, Ztothefifth, Widr, FutureTrillionaire, Happyuk,
Chmarkine, J.C. Labbrev, Dexbot, Kushalbiswas777, MeekMelange, Lone boatman, Sriharsh1234, Theemathas, Dough34, Mydog333,
Luckysud4, Sammydre, Bladeshade2, Mtnorthpoplar, Kdhanas, GreenC bot, Bender the Bot and Anonymous: 153
• Bucket queue Source: https://en.wikipedia.org/wiki/Bucket_queue?oldid=766064665 Contributors: David Eppstein
• Heap (data structure) Source: https://en.wikipedia.org/wiki/Heap_(data_structure)?oldid=764292574 Contributors: Derek Ross,
LC~enwiki, Christian List, Boleslav Bobcik, DrBob, B4hand, Frecklefoot, Paddu, Jimfbleak, Notheruser, Kragen, Jll, Aragorn2, Charles
Matthews, Timwi, Dcoetzee, Dfeuer, Dysprosia, Doradus, Jogloran, Shizhao, Cannona, Robbot, Noldoaran, Fredrik, Sbisolo, Vikingstad,
Giftlite, DavidCary, Wolfkeeper, Mellum, Tristanreid, Pgan002, Beland, Two Bananas, Pinguin.tk~enwiki, Andreas Kaufmann, Abdull,
Oskar Sigvardsson, Wiesmann, Yuval madar, Qutezuce, Tristan Schmelcher, Ascánder, Mwm126, Iron Wallaby, Spoon!, Mdd, Musiphil,
Guy Harris, Sligocki, Suruena, Derbeth, Wsloand, Oleg Alexandrov, Mahanga, Mindmatrix, LOL, Prophile, Daira Hopwood, Ruud Koot,
Apokrif, Tom W.M., Graham87, Qwertyus, Drpaule, Psyphen, Mathbot, Quuxplusone, Krun, Fresheneesz, Chobot, YurikBot, Wave-
length, RobotE, Vecter, NawlinWiki, DarkPhoenix, B7j0c, Moe Epsilon, Mlouns, LeoNerd, Bota47, Schellhammer, Lt-wiki-bot, Abu
adam~enwiki, Ketil3, HereToHelp, Daivox, SmackBot, Reedy, Tgdwyer, Eskimbot, Took, Thumperward, Oli Filth, Silly rabbit, Nbarth,
Ilyathemuromets, Jmnbatista, Cybercobra, Mlpkr, Prasi90, Itmozart, Atkinson 291, Ninjagecko, SabbeRubbish, Loadmaster, Hiiiiiiiiiiiii-
iiiiiiii, Jurohi, Jafet, Ahy1, Eric Le Bigot, Flamholz, Cydebot, Max sang, Christian75, Grubbiv, Thijs!bot, OverLeg, Ablonus, Anka.213,
BMB, Plaga701, Jirka6, Magioladitis, 28421u2232nfenfcenc, David Eppstein, Inhumandecency, Kibiru, Bradgib, Andre.holzner, Jfroelich,
Theo Mark, Cobi, STBotD, Cool 1 love, VolkovBot, JhsBot, Wingedsubmariner, Billinghurst, Rhanekom, Quietbritishjim, SieBot, Ham
Pastrami, Flyer22 Reborn, Svick, Jonlandrum, Ken123BOT, AncientPC, VanishedUser sdu9aya9fs787sads, ClueBot, Garyzx, Uncle Milty,
Bender2k14, Kukolar, Xcez-be, Addbot, Psyced, Nate Wessel, Chzz, Jasper Deng, Numbo3-bot, Konryd, Chipchap, Bluebusy, Luckas-bot,
Timeroot, KamikazeBot, DavidHarkness, AnomieBOT, Alwar.sumit, Jim1138, Burakov, ArthurBot, DannyAsher, Xqbot, Control.valve,
GrouchoBot, Лев Дубовой, Mcmlxxxi, Kxx, C7protal, Mark Renier, Wikitamino, Sae1962, Gruntler, AaronEmi, ImPerfection, Patmorin,
CobraBot, Akim Demaille, Stryder29, RjwilmsiBot, EmausBot, John of Reading, Tuankiet65, WikitanvirBot, Sergio91pt, Hari6389, Maxi-
antor, Kirelagin, Ermishin, Jaseemabid, Chris857, ClueBot NG, Manizzle23, Incompetence, Softsundude, Joel B. Lewis, Samuel Marks,
Mediator Scientiae, BG19bot, Racerdogjack, Chmarkine, Hadi Payami, PatheticCopyEditor, Hupili, ChrisGualtieri, Rarkenin, Frosty,
252 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
DJB3.14, Clevera, FenixFeather, P.t-the.g, Theemathas, Sunny1304, Tim.sebring, Ginsuloft, Azx0987, Chaticramos, KCAuXy4p, Evo-
hunz, Nbro, Sequoia 42, Ougarcia, Danmejia1, CLCStudent, Deacon Vorbis, Hellotherespellbound, Maldosari and Anonymous: 204
• Binary heap Source: https://en.wikipedia.org/wiki/Binary_heap?oldid=768914237 Contributors: Derek Ross, Taw, Shd~enwiki, B4hand,
Pit~enwiki, Nixdorf, Snoyes, Notheruser, Kragen, Kyokpae~enwiki, Dcoetzee, Dfeuer, Dysprosia, Kbk, Espertus, Fredrik, Altenmann,
DHN, Vikingstad, Tea2min, DavidCary, Laurens~enwiki, Levin, Alexf, Bryanlharris, Sam Hocevar, Andreas Kaufmann, Rich Farm-
brough, Sladen, Hydrox, Antaeus Feldspar, CanisRufus, Iron Wallaby, Liao, Wsloand, Bsdlogical, Kenyon, Oleg Alexandrov, Mahanga,
LOL, Ruud Koot, Qwertyus, Pdelong, Brighterorange, Drpaule, Platyk, VKokielov, Fresheneesz, Mdouze, Tofergregg, CiaPan, Daev,
MonoNexo, Htonl, Schellhammer, HereToHelp, Ilmari Karonen, DomQ, Theone256, Oli Filth, Nbarth, Matt77, Cybercobra, Djcmackay,
Danielcer, Ohconfucius, Doug Bell, J Crow, Catphive, Dicklyon, Inquisitus, Hu12, Velle~enwiki, Cydebot, Codetiger, Headbomb, Win-
Bot, Kba, Alfchung~enwiki, JAnDbot, MSBOT, R27182818, Magioladitis, Seshu pv, Jessicapierce, Japo, David Eppstein, Scott tucker,
Pgn674, Applegrew, Foober, Phishman3579, Funandtrvl, Rozmichelle, Vektor330, Tdeoras, Nuttycoconut, Lourakis, Ctxppc, Cpflames,
Anchor Link Bot, ClueBot, Miquelmartin, Jaded-view, Kukolar, Amossin, XLinkBot, Addbot, Bluebusy, Luckas-bot, Yobot, Amirobot,
Davidshen84, AnomieBOT, DemocraticLuntz, Jim1138, Baliame, Xqbot, Surturpain, Smk65536, GrouchoBot, Speakus, Okras, Fres-
coBot, Tom.Reding, Trappist the monk, Indy256, Patmorin, Duoduoduo, Loftpo, Tim-J.Swan, Superlaza, JosephCatrambone, Racerx11,
Dcirovic, Chris857, EdoBot, Dakaminski, Rezabot, Ciro.santilli, O12, Helpful Pixie Bot, BG19bot, Crocodilesareforwimps, Chmarkine,
MiquelMartin, IgushevEdward, Harsh 2580, Drjackstraw, 22990atinesh, Msproul, Billyisyoung, Lilalas, Cbcomp, Aswincweety, Erro-
hitagg, Nbro, Missingdays, Stevenxiaoxiong, Dilettantest, Wattitude, PhilipWelch and Anonymous: 175
• D-ary heap Source: https://en.wikipedia.org/wiki/D-ary_heap?oldid=752476068 Contributors: Derek Ross, Greenrd, Phil Boswell, Rich
Farmbrough, Qwertyus, Fresheneesz, SmackBot, Shalom Yechiel, Cydebot, Alaibot, David Eppstein, Skier Dude, Slemm, M2Ys4U, LeaW,
Addbot, DOI bot, Yobot, Miyagawa, Citation bot 1, JanniePieters, DrilBot, Dude1818, RjwilmsiBot, ChuispastonBot, Helpful Pixie Bot,
Fragapanagos, Angelababy00, Deacon Vorbis and Anonymous: 19
• Binomial heap Source: https://en.wikipedia.org/wiki/Binomial_heap?oldid=759725052 Contributors: Michael Hardy, Poor Yorick,
Dcoetzee, Dysprosia, Doradus, Maximus Rex, Cdang, Fredrik, Brona, MarkSweep, TonyW, Creidieki, Klemen Kocjancic, Martin TB,
Lemontea, Bo Lindbergh, Karlheg, Arthena, Wsloand, Oleg Alexandrov, LOL, Qwertyus, NeonMerlin, Fragglet, Fresheneesz, CiaPan,
YurikBot, Hairy Dude, Vecter, Googl, SmackBot, Theone256, Peterwhy, Yuide, Nviladkar, Stebulus, Cydebot, Marqueed, Thijs!bot,
Magioladitis, Matt.smart, Gwern, Funandtrvl, VolkovBot, Wingedsubmariner, Biscuittin, YonaBot, Volkan YAZICI, OOo.Rax, Alexbot,
Npansare, Addbot, Alquantor, Alex.mccarthy, Download, Sapeur, LinkFA-Bot, ماني, Aham1234, Materialscientist, Vmanor, DARTH
SIDIOUS 2, Josve05a, Templatetypedef, ClueBot NG, BG19bot, Dexbot, Mark L MacDonald, Boza s6, Oleksandr Shturmov and Anony-
mous: 65
• Fibonacci heap Source: https://en.wikipedia.org/wiki/Fibonacci_heap?oldid=771211390 Contributors: Michael Hardy, Zeno Gantner,
Poor Yorick, Charles Matthews, Dcoetzee, Dysprosia, Wik, Hao2lian, Phil Boswell, Fredrik, Eliashedberg, P0nc, Brona, Creidieki,
Qutezuce, Bender235, Aquiel~enwiki, Mkorpela, Wsloand, Oleg Alexandrov, Japanese Searobin, LOL, Ruud Koot, Rjwilmsi, Ravik, Fresh-
eneesz, Antiuser, YurikBot, SmackBot, Arkitus, Droll, MrBananaGrabber, Ninjagecko, Jrouquie, Hiiiiiiiiiiiiiiiiiiiii, Vanisaac, Myasuda,
AnnedeKoning, Cydebot, Gimmetrow, Headbomb, DekuDekuplex, Jirka6, JAnDbot, David Eppstein, The Real Marauder, DerHexer, An-
dre.holzner, Adam Zivner, Yecril, Funandtrvl, Aaron Rotenberg, Wingedsubmariner, Wbrenna36, Crashie, Bporopat, Arjun024, Thw1309,
ClueBot, Gene91, Mild Bill Hiccup, Nanobear~enwiki, RobinMessage, Peatar, Kaba3, Safenner1, Addbot, LatitudeBot, Mdk wiki~enwiki,
Luckas-bot, Yobot, Vonehrenheim, AnomieBOT, Erel Segal, Citation bot, Miym, Kxx, Novamo, Arthur MILCHIOR, MorganGreen,
Pinethicket, Lars Washington, Ereiniona, EmausBot, Coliso, Wikipelli, Trimutius, Lexusuns, Templatetypedef, ClueBot NG, Softsun-
dude, O.Koslowski, BG19bot, PatheticCopyEditor, ChrisGualtieri, Martin.carames, Dexbot, Jochen Burghardt, Faizan, Alexwho314,
Theemathas, Nvmbs, Oleksandr Shturmov, Mtnorthpoplar, Aayushdhir and Anonymous: 110
• Pairing heap Source: https://en.wikipedia.org/wiki/Pairing_heap?oldid=772018183 Contributors: Phil Boswell, Pgan002, Wsloand, Ruud
Koot, Qwertyus, Quale, Drdisque, Cedar101, Sneftel, Tgdwyer, Bluebot, SAMJAM, Jrouquie, Cydebot, Alaibot, Magioladitis, David
Eppstein, Wingedsubmariner, Celique, Geoffrey.foster, Yobot, Gilo1969, Kxx, Citation bot 1, Breaddawson, Hoofinasia, Dexbot, Pintoch,
Jeff Erickson, CV9933 and Anonymous: 14
• Double-ended priority queue Source: https://en.wikipedia.org/wiki/Double-ended_priority_queue?oldid=762669115 Contributors:
Dremora, Ruud Koot, Qwertyus, Quuxplusone, Wavelength, Sneftel, Racklever, Henning Makholm, PamD, David Eppstein, Julianhyde,
AvicAWB, Templatetypedef, Shire Reeve, 0milch0, BG19bot, Ramesh Ramaiah, Vibhave, BPositive, Mark Arsten, Loriendrew, Concep-
tualizing and Anonymous: 6
• Soft heap Source: https://en.wikipedia.org/wiki/Soft_heap?oldid=766017531 Contributors: Denny, Dcoetzee, Doradus, Fredrik, Just An-
other Dan, Pgan002, Wsloand, Ruud Koot, Agthorr, Eubot, Boticario, Bondegezou, SmackBot, Bluebot, Cydebot, Alaibot, Headbomb,
Cobi, AHMartin, Bender2k14, Addbot, LilHelpa, Ita140188, Agentex, FrescoBot, Lunae and Anonymous: 13
• Binary search algorithm Source: https://en.wikipedia.org/wiki/Binary_search_algorithm?oldid=771938724 Contributors: Peter
Winnberg, Taw, Dze27, Ed Poor, LA2, M~enwiki, Hannes Hirzel, Edward, Patrick, Robert Dober, Nixdorf, Pnm, Zeno Gantner, Takuya-
Murata, Loisel, Stan Shebs, EdH, Mxn, Hashar, Charles Matthews, Dcoetzee, Fuzheado, SirJective, McKay, Pakaran, Phil Boswell,
Fredrik, Altenmann, Tea2min, Giftlite, The Cave Troll, BenFrantzDale, Mboverload, Macrakis, Pne, DevilsAdvocate, Beland, Over-
lordQ, Maximaximax, Two Bananas, Pm215, Ukexpat, Sleepyrobot, Ericamick, Bfjf, Harriv, Shlomif, ESkog, Plugwash, El C, Diomidis
Spinellis, EmilJ, Baruneju, Spoon!, BrokenSegue, Photonique, Musiphil, Alansohn, Liao, Caesura, Andrewmu, Mr flea, Gpvos, HenryLi,
Forderud, Ericl234, Nuno Tavares, Pol098, Tabletop, Palica, Gerbrant, Ryajinor, Arjarj, Zzedar, GrundyCamellia, Coemgenus, Scandum,
Quale, XP1, Ligulem, R.e.b., FlaBot, Quuxplusone, Sioux.cz, CiaPan, Chobot, DVdm, Drtom, The Rambling Man, YurikBot, Wave-
length, Stephenb, Ewx, Hv, ColdFusion650, Kcrca, Black Falcon, Googl, Nikkimaria, Zachwlewis, Cedar101, Htmnssn, Messy Thinking,
SigmaEpsilon, JLaTondre, Fsiler, SmackBot, NickyMcLean, WilliamThweatt, TestPilot, KocjoBot~enwiki, Ieopo, BiT, Gene Thomas,
Amux, J4 james, Iain.dalton, Oli Filth, Jonny Diamond, TripleF, Oylenshpeegul, Sephiroth BCR, Mlpkr, Agcala~enwiki, Doug Bell,
Breno, Beetstra, Wstomv, Mr Stephen, David Souther, TwistOfCain, Lavaka, Devourer09, Fabian Steeg~enwiki, David Cooke, Svivian,
Ironmagma, Mike Christie, Solidpoint, Verdy p, Boemanneke, Ardnew, FrancoGG, Tmdean, Heineman, AntiVandalBot, Kylemcinnes,
Seaphoto, Donbraffitt, Kdakin, JAnDbot, FactoidCow, SiobhanHansa, Magioladitis, Soulbot, Chutzpan, Allstarecho, David Eppstein, Tod-
dcs, Gwern, MartinBot, Glrx, Userabc, Trusilver, Fylwind, Dodno, WhiteOak2006, Izno, SoCalSuperEagle, Mariolj, Oshwah, Vipinhari,
Kinkydarkbird, Merritt.alex, Swanyboy2, Don4of4, CanOfWorms, Dirkbb, Meters, Df747jet, Brianga, ICrann15, Scarian, Comp123,
Jan Winnicki, Psherm85, Jerryobject, Flyer22 Reborn, Joshgilkerson, Lourakis, Macy, Dillard421, Svick, Hariva, Rdhettinger, Vanishe-
dUser sdu9aya9fs787sads, ClueBot, Justin W Smith, Syhon, Garyzx, Arunsingh16, Tim32, JeffDonner, Dasboe, Predator106, Hasanadnan-
taha, Hkleinnl, Neuralwarp, XLinkBot, Muffincorp, Mitch Ames, Bob1312, Briandamgaard, NjardarBot, Balabiot, Legobot, Luckas-bot,
8.1. TEXT 253
Yobot, MarioS, AnomieBOT, Andrewrp, 1exec1, Jim1138, Mangarah, Gankro, Materialscientist, Citation bot, Taeshadow, Lacis alfredo,
Melmann, SPTWriter, Jeffrey Mall, Mononomic, Pmlineditor, Shirik, Harry0xBd, WithWhich, FrescoBot, CarminPolitano, Ninaddb, At-
lantia, Biker Biker, BigDwiki, AANaimi, Nnarasimhakaushik, Aperisic, MoreNet, Jfmantis, RjwilmsiBot, JustAHappyCamper, EmausBot,
Msswp, Robrohan, Wikipelli, Dcirovic, John Cline, Checkingfax, ChaosCon, Midas02, Staszek Lem, DOwenWilliams, L Kensington, Bill
william compton, Ranching, Peter Karlsen, Mark Martinec, TYelliot, 28bot, Rocketrod1960, Haigee2007, ClueBot NG, MelbourneStar,
Gilderien, Imjooseo, Widr, Nullzero, Sangameshh, Jk2q3jrklse, Helpful Pixie Bot, Curb Chain, Wbm1058, BG19bot, Streaver91, Su-
peramin, Rodion Gork, Lambin~enwiki, Rynishere, Chmarkine, Njanani, BattyBot, Nithin.A.P, Timothy Gu, ChrisGualtieri, Daiyuda,
Wullschj, Aj8uppal, Codethinkers, Pintoch, Lugia2453, AlwaysAngry, Jamesx12345, Nero hu, NC4PK, Mark viking, I am One of Many,
Tentinator, IRockStone, DavidLeighEllis, Pappu0007, Bloghog23, Alex.koturanov, Rulnick, Benjohnbarnes, Peturb, Dalton Quinn, Kjer-
ish, KH-1, Esquivalience, CruiserAbhi, PJ Cabral, JJMC89, BiomolecularGraphics4All, Atlantic306, JindalApoorv, Sunflower42, Ro-
hit0303, ThePlatypusofDoom, Chrissymad, Divyanshj.16, Zaffy806, Fresal, SingSighSep and Anonymous: 432
• Binary search tree Source: https://en.wikipedia.org/wiki/Binary_search_tree?oldid=772628414 Contributors: Damian Yerrick, Bryan
Derksen, Taw, Mrwojo, Spiff~enwiki, PhilipMW, Michael Hardy, Chris-martin, Nixdorf, Ixfd64, Minesweeper, Darkwind, LittleDan,
Glenn, BAxelrod, Timwi, MatrixFrog, Dcoetzee, Havardk, Dysprosia, Doradus, Maximus Rex, Phil Boswell, Fredrik, Postdlf, Bkell,
Hadal, Tea2min, Enochlau, Awu, Giftlite, DavidCary, P0nc, Ezhiki, Maximaximax, Qleem, Karl-Henner, Qiq~enwiki, Shen, An-
dreas Kaufmann, Jin~enwiki, Grunt, Kate, Oskar Sigvardsson, D6, Ilana, Kulp, ZeroOne, Damotclese, Vdm, Func, LeonardoGre-
gianin, Runner1928, Nicolasbock, HasharBot~enwiki, Alansohn, Liao, RoySmith, Rudo.Thomas, Pion, Wtmitchell, Evil Monkey,
4c27f8e656bb34703d936fc59ede9a, Oleg Alexandrov, Mindmatrix, LOL, Oliphaunt, Ruud Koot, Trevor Andersen, GregorB, Mb1000,
MrSomeone, Qwertyus, Nneonneo, Hathawayc, VKokielov, Ecb29, Mathbot, BananaLanguage, DevastatorIIC, Quuxplusone, Sketch-The-
Fox, Butros, Banaticus, Roboto de Ajvol, YurikBot, Wavelength, Personman, Michael Slone, Hyad, Taejo, Gaius Cornelius, Oni Lukos,
TheMandarin, Salrizvy, Moe Epsilon, BOT-Superzerocool, Googl, Regnaron~enwiki, Abu adam~enwiki, Chery, Cedar101, Jogers, Leonar-
doRob0t, Richardj311, WikiWizard, SmackBot, Bernard François, Gilliam, Ohnoitsjamie, Theone256, Oli Filth, Neurodivergent, DHN-
bot~enwiki, Alexsh, Garoth, Mweber~enwiki, Allan McInnes, Calbaer, NitishP, Cybercobra, Underbar dk, Hcethatsme, MegaHasher,
Breno, Nux, Tachyon77, Beetstra, Dicklyon, Hu12, Vocaro, Konnetikut, JForget, James pic, CRGreathouse, Ahy1, WeggeBot, Mikeput-
nam, TrainUnderwater, Jdm64, AntiVandalBot, Jirka6, Lanov, Huttarl, Eapache, JAnDbot, Anoopjohnson, Magioladitis, Abednigo, All-
starecho, Tomt22, Gwern, S3000, MartinBot, Anaxial, Leyo, Mike.lifeguard, Phishman3579, Skier Dude, Joshua Issac, Mgius, Kewlito,
Danadocus, Vectorpaladin13, Labalius, BotKung, One half 3544, Spadgos, MclareN212, Nerdgerl, Rdemar, Davekaminski, Rhanekom,
SieBot, YonaBot, Xpavlic4, Casted, VVVBot, Ham Pastrami, Jerryobject, Flyer22 Reborn, Swapsy, Djcollom, Svick, Anchor Link Bot,
GRHooked, Loren.wilton, Xevior, ClueBot, ChandlerMapBot, Madhan virgo, Theta4, Splttingatms, Shailen.sobhee, AgentSnoop, Onomou,
XLinkBot, WikHead, Metalmax, MrOllie, Jdurham6, Nate Wessel, LinkFA-Bot, ماني, Matekm, Legobot, Luckas-bot, Yobot, Dimchord,
AnomieBOT, The Parting Glass, Burakov, Ivan Kuckir, Tbvdm, LilHelpa, Shashi20008, Capricorn42, SPTWriter, Doctordiehard, Wtar-
reau, Shmomuffin, Dzikasosna, Smallman12q, Kurapix, Adamuu, FrescoBot, 4get, Citation bot 1, Golle95, Aniskhan001, Frankrod44,
Cochito~enwiki, MastiBot, Thesevenseas, Sss41, Vromascanu, Shuri org, Rolpa, Jayaneethatj, Avermapub, MladenWiki, Konstantin Pest,
Akim Demaille, Cyc115, WillNess, Nils schmidt hamburg, RjwilmsiBot, Ripchip Bot, X1024, Chibby0ne, Albmedina, Your Lord and Mas-
ter, Nomen4Omen, Meng6, Wmayner, Tolly4bolly, Snehalshekatkar, Dan Wang, ClueBot NG, SteveAyre, Jms49, Frietjes, Ontariolot, Sol-
san88, Nakarumaka, BG19bot, AlanSherlock, Rafikamal, BPositive, RJK1984, Phc1, WhiteNebula, IgushevEdward, Hdanak, JingguoYao,
Yaderbh, RachulAdmas, TwoTwoHello, Frosty, Josell2, SchumacherTechnologies, Farazbhinder, Wulfskin, Embanner, Mtahmed, Jihlim,
Kaidul, Cybdestroyer, Jabanabba, Gokayhuz, Mathgorges, Jianhui67, Paul2520, Super fish2, Ryuunoshounen, Dk1027, Azx0987, KH-1,
Tshubham, HarshalVTripathi, ChaseKR, Nbro, Filip Euler, Koolnik90, K-evariste, Enzoferber, Selecsosi, DaBrown95, Jonnypurgatory,
Peterc26, Mezhaka, SimoneBrigante, Cwowo, HarshKhatore and Anonymous: 379
• Random binary tree Source: https://en.wikipedia.org/wiki/Random_binary_tree?oldid=753780752 Contributors: Michael Hardy, Cyber-
cobra, David Eppstein, Cobi, Addbot, Cardel, Gilo1969, Citation bot 1, Patmorin, RjwilmsiBot, Helpful Pixie Bot, Marcocapelle, Dsp de
and Anonymous: 7
• Tree rotation Source: https://en.wikipedia.org/wiki/Tree_rotation?oldid=750774864 Contributors: Mav, BlckKnght, B4hand, Michael
Hardy, Kragen, Dcoetzee, Dysprosia, Altenmann, Michael Devore, Leonard G., Neilc, Andreas Kaufmann, Mr Bound, Chub~enwiki,
BRW, Oleg Alexandrov, Joriki, Graham87, Qwertyus, Wizzar, Pako, Mathbot, Peterl, Abarry, Trainra, Cedar101, SmackBot, DHN-
bot~enwiki, Ramasamy, Kjkjava, Hyperionred, Thijs!bot, Headbomb, Waylonflinn, Swpb, David Eppstein, Gwern, Vegasprof, STBotD,
Skaraoke, Mtanti, SCriBu, Castorvx, Salvar, SieBot, Woblosch, Svick, Xevior, Boykobb, LaaknorBot, ماني, Legobot, Mangarah, Lil-
Helpa, GrouchoBot, Adamuu, Citation bot 1, Britannic124, Nomen4Omen, Alexey.kudinkin, ClueBot NG, Knowledgeofthekrell, Josell2,
Explorer512, Tar-Elessar, Javier Borrego Fernandez C-512, Fmadd, AdamBignell and Anonymous: 40
• Self-balancing binary search tree Source: https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree?oldid=751986508 Contrib-
utors: Michael Hardy, Angela, Dcoetzee, Dysprosia, DJ Clayworth, Noldoaran, Fredrik, Diberri, Enochlau, Wolfkeeper, Jorge Stolfi, Neilc,
Pgan002, Jacob grace, Andreas Kaufmann, Shlomif, Baluba, Mdd, Alansohn, Jeltz, ABCD, Kdau, RJFJR, Japanese Searobin, Jacobolus,
Chochopk, Qwertyus, Moskvax, Intgr, YurikBot, Light current, Plyd, Daivox, MrDrBob, Cybercobra, Jon Awbrey, Ripe, Momet, Jafet,
CRGreathouse, Cydebot, Widefox, David Eppstein, Funandtrvl, VolkovBot, Sriganeshs, Lamro, Jruderman, Plastikspork, SteveJothen,
Addbot, Bluebusy, Yobot, Larrycz, Xqbot, Drilnoth, Steaphan Greene, FrescoBot, DrilBot, ActuallyRationalThinker, EmausBot, RA0808,
Larkinzhang1993, Azuris, ClueBot NG, Andreas4965, Solomon7968, Wolfgang42, Pintoch, Josell2, Jochen Burghardt, G PViB, Ollie314
and Anonymous: 51
• Treap Source: https://en.wikipedia.org/wiki/Treap?oldid=751556289 Contributors: Edward, Poor Yorick, Jogloran, Itai, Jleedev, Eequor,
Andreas Kaufmann, Qef, Milkmandan, Saccade, Wsloand, Oleg Alexandrov, Jörg Knappen~enwiki, Ruud Koot, Hdante, Behdad, Qw-
ertyus, Arbor, Gustavb, Regnaron~enwiki, James.nvc, SmackBot, KnowledgeOfSelf, Chris the speller, Cybercobra, MegaHasher, Pfh, J.
Finkelstein, Yzt, Jsaxton86, Cydebot, Blaisorblade, Escarbot, RainbowCrane, David Eppstein, AHMartin, Bajsejohannes, Justin W Smith,
Kukolar, Hans Adler, Addbot, Luckas-bot, Yobot, Erel Segal, Rubinbot, Citation bot, Bencmq, Gilo1969, Miym, Brutaldeluxe, Cshinyee,
C02134, ICEAGE, MaxDel, Patmorin, Cdb273, MoreNet, Allforrous, ChuispastonBot, BG19bot, Chmarkine, Naxik, Lsmll and Anony-
mous: 30
• AVL tree Source: https://en.wikipedia.org/wiki/AVL_tree?oldid=772933114 Contributors: Damian Yerrick, BlckKnght, M~enwiki, Ede-
maine, FvdP, Infrogmation, Michael Hardy, Nixdorf, Minesweeper, Jll, Poor Yorick, Dcoetzee, Dysprosia, Doradus, Greenrd, Topbanana,
Noldoaran, Fredrik, Altenmann, Merovingian, Tea2min, Andrew Weintraub, Mckaysalisbury, Neilc, Pgan002, Tsemii, Andreas Kauf-
mann, Safety Cap, Mike Rosoft, Guanabot, Byrial, Pavel Vozenilek, Shlomif, Lankiveil, Rockslave, Smalljim, Geek84, Axe-Lander,
Darangho, Kjkolb, Larryv, Obradovic Goran, HasharBot~enwiki, Orimosenzon, Kdau, Docboat, Evil Monkey, Tphyahoo, RJFJR, Kenyon,
Oleg Alexandrov, LOL, Ruud Koot, Gruu, Seyen, Graham87, Qwertyus, ErikHaugen, Toby Douglass, Mikm, Alex Kapranoff, Jeff02,
254 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
Gurch, Intgr, Chobot, YurikBot, Gaius Cornelius, NawlinWiki, Astral, Dtrebbien, Kain2396, Bkil, Pnorcks, Blackllotus, Bota47, Lt-wiki-
bot, Arthur Rubin, Cedar101, KGasso, Gulliveig, Danielx, LeonardoRob0t, Paul D. Anderson, SmackBot, Apanag, InverseHypercube,
David.Mestel, KocjoBot~enwiki, Gilliam, Tsoft, DHN-bot~enwiki, ChrisMP1, Tamfang, Cybercobra, Flyingspuds, Epachamo, Philvarner,
Dcamp314, Kuru, Euchiasmus, Michael miceli, Caviare, Babbling.Brook, Dicklyon, Yksyksyks, Momet, Nysin, Jac16888, Daewoollama,
Cyhawk, ST47, Zian, Joeyadams, Msanchez1978, Eleuther, AntiVandalBot, Ste4k, Jirka6, Gökhan, JAnDbot, Leuko, Magioladitis, Anant
sogani, Avicennasis, David Eppstein, Nguyễn Hữu Dung, MartinBot, J.delanoy, Pedrito, Phishman3579, Jeepday, Michael M Clarke, Un-
washedMeme, Binnacle, Adamd1008, DorganBot, Hwbehrens, Funandtrvl, BenBac, VolkovBot, Indubitably, Mtanti, Castorvx, AlexGreat,
Uw.Antony, Enviroboy, Srivesh, SieBot, Aent, Vektor330, Flyer22 Reborn, Hello71, Svick, Mauritsmaartendejong, Denisarona, Xevior,
ClueBot, Nnemo, CounterVandalismBot, Auntof6, Kukolar, Ksulli10, Moberg, Mellerho, XLinkBot, Gnowor, Njvinod, Resper~enwiki,
DOI bot, Dawynn, Ommiy-Pangaeus~enwiki, Leszek Jańczuk, Mr.Berna, West.andrew.g, Tide rolls, Matěj Grabovský, Bluebusy, MattyIX,
Legobot, Luckas-bot, Yobot, Fx4m, II MusLiM HyBRiD II, Agrawalyogesh, AnomieBOT, Jim1138, Royote, Kingpin13, Materialscien-
tist, Xqbot, Drilnoth, Oliversisson, VladimirReshetnikov, Greg Tyler, Shmomuffin, Adamuu, Mjkoo, FrescoBot, MarkHeily, Moham-
mad ahad, Ichimonji10, Maggyero, DrilBot, Sebculture, RedBot, Trappist the monk, MladenWiki, EmausBot, Benoit fraikin, Mzruya,
Iamnitin, AvicBot, Vlad.c.manea, Nomen4Omen, Geoff55, Mnogo, Chire, Compusense, ClueBot NG, MelbourneStar, Bulldog73, Mac-
donjo, G0gogcsc300, Codingrecipes, Helpful Pixie Bot, Titodutta, BG19bot, Northamerica1000, Solomon7968, Ravitkhurana, Crh23,
Proxyma, DmitriyVilkov, Zhaofeng Li, ChrisGualtieri, Eta Aquariids, Dexbot, Kushalbiswas777, CostinulAT, Akerbos, Josell2, Jochen
Burghardt, G PViB, Elfbw, Ppkhoa, Yelnatz, Dough34, Hibbarnt, Jasonchan1994, Jpopesculian, Skr15081997, Devsathish, Aviggiano,
Eeb379, Monkbot, Teetooan, HexTree, Henryy321, Badidipedia, Dankocevski, Esquivalience, StudentOfStones, Jmonty42, NNcNannara,
Ankitagrawalvit, Mhush12, Nirbhay c, NathanBierema, Saqibwahid and Anonymous: 359
• Red–black tree Source: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree?oldid=768957805 Contributors: Dreamyshade, Jz-
cool, Ghakko, FvdP, Michael Hardy, Blow~enwiki, Minesweeper, Ahoerstemeier, Cyp, Strebe, Jerome.Abela, Notheruser, Kragen, Julesd,
Ghewgill, Timwi, MatrixFrog, Dcoetzee, Dfeuer, Dysprosia, Hao2lian, Shizhao, Phil Boswell, Robbot, Fredrik, Altenmann, Hump-
back~enwiki, Jleedev, Tea2min, Enochlau, Connelly, Giftlite, Sepreece, BenFrantzDale, Brona, Dratman, Leonard G., Pgan002, Li-
Daobing, Sebbe, Karl-Henner, Andreas Kaufmann, Tristero~enwiki, Perey, Spundun, Will2k, Haxwell, Aplusbi, SickTwist, Giraffedata,
Ryan Stone, Zetawoof, Iav, Hawke666, Cjcollier, Fawcett5, Denniss, Cburnett, RJFJR, H2g2bob, Kenyon, Silverdirk, Joriki, Mindma-
trix, Merlinme, Ruud Koot, Urod, Gimboid13, Jtsiomb, Marudubshinki, Graham87, Qwertyus, OMouse, Drebs~enwiki, Rjwilmsi, Hgka-
math, ErikHaugen, Toby Douglass, SLi, FlaBot, Margosbot~enwiki, Fragglet, Jameshfisher, Kri, Loading, SGreen~enwiki, YurikBot,
Wavelength, Jengelh, Rsrikanth05, Bovineone, Sesquiannual, Jaxl, Długosz, Coderzombie, Mikeblas, Blackllotus, Schellhammer, Reg-
naron~enwiki, Ripper234, JMBucknall, Lt-wiki-bot, Abu adam~enwiki, Smilindog2000, SmackBot, Pgk, Gilliam, Thumperward, Silly rab-
bit, DHN-bot~enwiki, Sct72, Khalil Sawant, Xiteer, Cybercobra, Philvarner, TheWarlock, Alexandr.Kara, SashatoBot, Mgrand, N3bulous,
Bezenek, Caviare, Dicklyon, Otac0n, Belfry, Pqrstuv, Pranith, Supertigerman, Ahy1, Jodawi, Pmussler, Linuxrocks123, Dantiston, Sytelus,
Epbr123, Ultimus, Abloomfi, Headbomb, AntiVandalBot, Widefox, Hermel, Roleplayer, .anacondabot, Stdazi, David Eppstein, Luna-
keet, Gwern, MartinBot, Glrx, Themania, IDogbert, Madhurtanwani, Phishman3579, Warut, Smangano, Binnacle, Lukax, Potatoswatter,
KylieTastic, Bonadea, Funandtrvl, DoorsAjar, Jozue, Simoncropp, Laurier12, Bioskope, Yakov1122~enwiki, YonaBot, Sdenn, Stone628,
Stanislav Nowak~enwiki, AlanUS, Hariva, Shyammurarka, Xevior, Uncle Milty, Nanobear~enwiki, Xmarios, Karlhendrikse, Kukolar,
MiniStephan, Uniwalk, Versus22, Johnuniq, XLinkBot, Consed, C. A. Russell, Addbot, Joshhibschman, Fcp2007, AgadaUrbanit, Tide
rolls, Lightbot, Luckas-bot, Yobot, Fraggle81, AnomieBOT, Narlami, Cababunga, Maxis ftw, ChrisCPearson, Storabled, Zehntor, Tb-
vdm, Xqbot, Nishantjr, RibotBOT, Kyle Hardgrave, Adamuu, FrescoBot, AstaBOTh15, Karakak, Kmels, Banej, Userask, Hnn79, Xsanda,
Trappist the monk, Gnathan87, MladenWiki, Pellucide, Belovedeagle, Patmorin, Sreeakshay, EmausBot, John of Reading, Dem1995, Hugh
Aguilar, K6ka, Nomen4Omen, Mnogo, Awakenrz, Card Zero, Grandphuba, KYLEMONGER, Kapil.xerox, Donner60, Wikipedian to the
max, 28bot, ClueBot NG, Xjianz, Spencer greg, Wittjeff, Ontariolot, Widr, Hagoth, BG19bot, Pratyya Ghosh, Deepakabhyankar, Naxik,
Dexbot, JingguoYao, Akerbos, Epicgenius, Mimibar, Kahtar, Kojikawano, Weishi Zeng, Suelru, Monkbot, Henryy321, Spasticcodemon-
key, Aureooms, HMSLavender, Freitafr, Nbro, Demagur, Equinox, Rubydragons, Jmonty42, Codedgeass, JamesBWatson3, Frankbryce,
Mar10dejong, Asgowrisankar, Jhnam88, Cristophercalo, Linkadvitch, Taozhijiang and Anonymous: 326
• WAVL tree Source: https://en.wikipedia.org/wiki/WAVL_tree?oldid=685567411 Contributors: David Eppstein and I dream of horses
• Scapegoat tree Source: https://en.wikipedia.org/wiki/Scapegoat_tree?oldid=753128418 Contributors: FvdP, Edward, Dcoetzee, Ruakh,
Dbenbenn, Tweenk, Sam Hocevar, Andreas Kaufmann, Rich Farmbrough, Jarsyl, Aplusbi, Oleg Alexandrov, Firsfron, Slike2, Qwertyus,
Mathbot, Wknight94, SmackBot, Chris the speller, Cybercobra, MegaHasher, Vanisaac, AbsolutBildung, Thijs!bot, Robert Ullmann, The-
mania, Danadocus, Joey Parrish, WillUther, Kukolar, SteveJothen, Addbot, Yobot, Citation bot, C.hahn, Patmorin, WikitanvirBot, Hankjo,
Mnogo, ClueBot NG, AlecTaylor, Tomer adar, Theemathas, Hqztrue and Anonymous: 37
• Splay tree Source: https://en.wikipedia.org/wiki/Splay_tree?oldid=771203455 Contributors: Mav, BlckKnght, Xaonon, Christopher Ma-
han, FvdP, Edward, Michael Hardy, Nixdorf, Pnm, Drz~enwiki, Dcoetzee, Dfeuer, Dysprosia, Silvonen, Tjdw, Phil Boswell, Fredrik,
Stephan Schulz, Giftlite, Wolfkeeper, CyborgTosser, Lqs, Wiml, Gscshoyru, Urhixidur, Karl Dickman, Andreas Kaufmann, Yonkel-
tron, Rich Farmbrough, Qutezuce, Bender235, Sietse Snel, Aplusbi, Chbarts, Phdye, Tabletop, VsevolodSipakov, Graham87, Qwertyus,
Rjwilmsi, Pako, Ligulem, Jameshfisher, Fresheneesz, Wavelength, Vecter, Romanc19s, Długosz, Abu adam~enwiki, Cedar101, Terber,
HereToHelp, That Guy, From That Show!, SmackBot, Honza Záruba, Unyoyega, Apankrat, Silly rabbit, Octahedron80, Axlape, Orphan-
Bot, Cybercobra, Philvarner, Just plain Bill, Ohconfucius, MegaHasher, Vanished user 9i39j3, Lim Wei Quan, Jamie King, Dicklyon,
Freeside3, Martlau, Momet, Ahy1, VTBassMatt, Escarbot, Atavi, Coldzero1120, Eapache, KConWiki, David Eppstein, Ahmad87, Gw-
ern, HPRappaport, Foober, Phishman3579, Dodno, Funandtrvl, Anna Lincoln, Rhanekom, Zuphilip, Russelj9, Svick, AlanUS, JP.Martin-
Flatin, Nanobear~enwiki, Pointillist, Safek, Kukolar, XLinkBot, Dekart, Maverickwoo, Addbot, דוד שי, Legobot, Yobot, Roman Mu-
nich, AnomieBOT, Erel Segal, 1exec1, Josh Guffin, Citation bot, Winniehell, Shmomuffin, Dzikasosna, FrescoBot, Snietfeld, Citation bot
1, Jwillia3, Zetifree, Sss41, MladenWiki, Sihag.deepak, Ybungalobill, Crimer, Wyverald, Const86, EmausBot, Hannan1212, Dcirovic,
SlowByte, Mnogo, P2004a, Petrb, ClueBot NG, Wiki.ajaygautam, SteveAyre, Ontariolot, Antiqueight, Vagobot, Arunshankarbk, Harijec,
HueSatLum, FokkoDriesprong, Makecat-bot, Pintoch, Arunkumar nonascii, B.pradeep143, MazinIssa, Abc00786, Lfbarba, Craftbond-
pro, Mdburns, Fabio.pakk, BethNaught, Efortanely, BenedictEggers, Admodi, Havewish, Bender the Bot, Happyspace4ever, Haleal and
Anonymous: 138
• Tango tree Source: https://en.wikipedia.org/wiki/Tango_tree?oldid=766152766 Contributors: AnonMoos, Giraffedata, RHaworth, Qw-
ertyus, Rjwilmsi, Vecter, Jengelh, Grafen, Malcolma, Rayhe, SmackBot, C.Fred, Chris the speller, Iridescent, Alaibot, Headbomb, Nick
Number, Acroterion, Nyttend, Philg88, Inomyabcs, ImageRemovalBot, Sfan00 IMG, Nathan Johnson, Jasper Deng, Yobot, AnomieBOT,
Erel Segal, Anand Oza, FrescoBot, Σ, RenamedUser01302013, Card Zero, Ontariolot, Do not want, Tango tree, DoctorKubla, Dexbot,
Faizan, Pqqwetiqe and Anonymous: 17
8.1. TEXT 255
• Skip list Source: https://en.wikipedia.org/wiki/Skip_list?oldid=765080073 Contributors: Mrwojo, Stevenj, Charles Matthews, Dcoet-
zee, Dysprosia, Doradus, Populus, Noldoaran, Fredrik, Jrockway, Altenmann, Jorge Stolfi, Two Bananas, Andreas Kaufmann, Antaeus
Feldspar, R. S. Shaw, Davetcoleman, Nkour, Ruud Koot, Qwertyus, MarSch, Drpaule, Intgr, YurikBot, Wavelength, Pi Delport, Bovineone,
Gareth Jones, Zr2d2, Cedar101, AchimP, SmackBot, Gilliam, Chadmcdaniel, Silly rabbit, Cybercobra, Viebel, Almkglor, Laurienne Bell,
Nsfmc, CRGreathouse, Nczempin, Thijs!bot, Dougher, Bondolo, Sanchom, Magioladitis, A3nm, JaGa, STBotD, Musically ut, Funandtrvl,
VolkovBot, Rhanekom, SieBot, Ivan Štambuk, MinorContributor, Menahem.fuchs, Cereblio, OKBot, Svick, Rdhettinger, Denisarona,
PuercoPop, Gene91, Jurassicstrain, Kukolar, Resuna, Xcez-be, Braddunbar, Addbot, DOI bot, Jim10701, Luckas-bot, Yobot, Wojciech
mula, AnomieBOT, SvartMan, Citation bot, Carlsotr, Alan Dawrst, RibotBOT, FrescoBot, Jamesooders, MastiBot, Devynci, Patmorin,
EmausBot, Pet3ris, Allforrous, Jaspervdg, Overred~enwiki, ClueBot NG, Vishalvishnoi, Rpk512, BG19bot, ChrisGualtieri, Dexbot, Mark
viking, Purealtruism, Dmx2010, Monkbot, דובק1, Mtnorthpoplar, Corka94, Deacon Vorbis and Anonymous: 116
• B-tree Source: https://en.wikipedia.org/wiki/B-tree?oldid=767638022 Contributors: Kpjas, Bryan Derksen, FvdP, Mrwojo, Spiff~enwiki,
Edward, Michael Hardy, Rp, Chadloder, Minesweeper, JWSchmidt, Ciphergoth, BAxelrod, Alaric, Charles Matthews, Dcoetzee, Dys-
prosia, Evgeni Sergeev, Greenrd, Hao2lian, Ed g2s, Tjdw, AaronSw, Carbuncle, Wtanaka, Fredrik, Altenmann, Liotier, Bkell, Dmn,
Tea2min, Giftlite, DavidCary, Uday, Wolfkeeper, Lee J Haywood, Levin, Curps, Joconnor, Ketil, Jorge Stolfi, AlistairMcMillan, Nayuki,
Neilc, Pgan002, Gdr, Cbraga, Knutux, Stephan Leclercq, Peter bertok, Andreas Kaufmann, Chmod007, Kate, Ta bu shi da yu, Slady,
Rich Farmbrough, Guanabot, Leibniz, Qutezuce, Talldean, Slike, Dpotter, Mrnaz, SickTwist, Wipe, R. S. Shaw, HasharBot~enwiki, Alan-
sohn, Anders Kaseorg, ABCD, Wtmitchell, Wsloand, MIT Trekkie, Voxadam, Postrach, Mindmatrix, Decrease789, Ruud Koot, Qwertyus,
FreplySpang, Rjwilmsi, Kinu, Strake, Sandman@llgp.org, FlaBot, Psyphen, Ysangkok, Fragglet, Joe07734, Makkuro, Fresheneesz, Kri,
Antimatter15, CiaPan, Daev, Chobot, Vyroglyph, YurikBot, Bovineone, Ethan, PrologFan, Mikeblas, EEMIV, Cedar101, LeonardoRob0t,
SmackBot, Cutter, Ssbohio, Btwied, Danyluis, Mhss, Chris the speller, Bluebot, Oli Filth, Malbrain, Stevemidgley, Cybercobra, AlyM, Jeff
Wheeler, Battamer, Ck lostsword, Zearin, Bezenek, Flying Bishop, Loadmaster, Dicklyon, P199, Inquisitus, Norm mit, Noodlez84, Lamdk,
Amniarix, FatalError, Ahy1, Aubrey Jaffer, Beeson, Cydebot, PKT, ContivityGoddess, Headbomb, I do not exist, Alfalfahotshots, AntiVan-
dalBot, Luna Santin, Widefox, Jirka6, Lfstevens, Lklundin, The Fifth Horseman, MER-C, .anacondabot, Nyq, Yakushima, David Eppstein,
Hbent, MoA)gnome, Ptheoch, CarlFeynman, Glrx, Trusilver, Altes, Phishman3579, Jy00912345, Priyank bolia, GoodPeriodGal, Dorgan-
Bot, MartinRinehart, Michael Angelkovich, VolkovBot, Oshwah, Appoose, Kovianyo, Don4of4, Dlae, Jesin, Billinghurst, Uw.Antony,
Aednichols, Joahnnes, Ham Pastrami, JCLately, Jojalozzo, Ctxppc, Dravecky, Anakin101, Hariva, Wantnot, ClueBot, Rpajares, Simon04,
Junk98df, Abrech, Kukolar, Iohannes Animosus, Doprendek, XLinkBot, Paushali, Addbot, CanadianLinuxUser, AnnaFrance, LinkFA-
Bot, Jjdawson7, Verbal, Lightbot, Krano, Teles, Twimoki, Luckas-bot, Quadrescence, Yobot, AnomieBOT, Gptelles, Materialscientist,
MorgothX, Xtremejames183, Xqbot, Nishantjr, Matttoothman, Sandeep.a.v, Merit 07, Almabot, GrouchoBot, Eddvella, January2009,
Jacosi, SirSeal, Hobsonlane, Bladefistx2, Mfwitten, Redrose64, Fgdafsdgfdsagfd, Trappist the monk, Patmorin, Hjasud, RjwilmsiBot, Ma-
chineRebel, John lindgren, DASHBot, Wkailey, John of Reading, Wout.mertens, John ch fr, Pyschobbens, Ctail, Fabriciodosanjossilva,
TomYHChan, Mnogo, NGPriest, Tuolumne0, ClueBot NG, Betzaar, Oldsharp, Widr, DanielKlein24, Bor4kip, RMcPhillip, Meurondb,
BG19bot, WinampLlama, Erik.Bjareholt, Cp3149, Andytwigg, David.moreno72, JoshuSasori, Jimw338, YFdyh-bot, Dexbot, Pintoch,
Seanhalle, Lsmll, Enock4seth, Tentinator, TheWisestOfFools, DavidLeighEllis, M Murphy1993, JaconaFrere, Skr15081997, Audreyme-
ows, Utsavullas33, Nbro, IvayloS, CAPTAIN RAJU, Grecinto, SundeepBhuvan, GreenC bot, Bender the Bot and Anonymous: 387
• B+ tree Source: https://en.wikipedia.org/wiki/B%2B_tree?oldid=771110786 Contributors: Bryan Derksen, Cherezov, Tim Starling, Pnm,
Eurleif, CesarB, Cherkash, Marc omorain, Josh Cherry, Vikreykja, Lupo, Dmn, Giftlite, Inkling, WorldsApart, Neilc, Lightst, Ar-
row~enwiki, WhiteDragon, Two Bananas, Scrool, Leibniz, Zenohockey, Nyenyec, Cmdrjameson, TheProject, Obradovic Goran, Hap-
pyvalley, Mdd, Arthena, Yamla, TZOTZIOY, Stevestrange, Knutties, Oleg Alexandrov, RHaworth, LrdChaos, LOL, Decrease789, Gre-
gorB, PhilippWeissenbacher, Ash211, Penumbra2000, Gurch, Degeberg, Intgr, Fresheneesz, Chobot, Bornhj, Encyclops, Bovineone, Capi,
Luc4~enwiki, Mikeblas, Foeckler, Snarius, Cedar101, LeonardoRob0t, Jbalint, Jsnx, Arny, DomQ, Mhss, Hongooi, Rrburke, Cyber-
cobra, Itmozart, Nat2, Leksey, Tlesher, Julthep, Cychoi, UncleDouggie, Yellowstone6, Ahy1, Unixguy, CmdrObot, Leujohn, Jwang01,
Ubuntu2, I do not exist, Nuworld, Widefox, Ste4k, JAnDbot, Txomin, CommonsDelinker, Garciada5658, Afaviram, Mfedyk, Priyank bo-
lia, Mqchen, Mrcowden, VolkovBot, OliviaGuest, Mdmkolbe, Muro de Aguas, Singaldhruv, Highlandsun, Wiae, SheffieldSteel, MRLacey,
S.Örvarr.S, SieBot, Tresiden, YonaBot, Yungoe, Amarvashishth, Mogentianae, Imachuchu, ClueBot, Kl4m, Boing! said Zebedee, Tux-
thepenguin933, SchreiberBike, Max613, Raatikka, Addbot, TutterMouse, Thunderpenguin, Favonian, AgadaUrbanit, Matěj Grabovský,
Bluebusy, Twimoki, Luckas-bot, Matthew D Dillon, Yobot, ColinTempler, Vevek, AnomieBOT, Materialscientist, LilHelpa, Nishantjr,
Makeswell, Nqzero, Ajarov, Mydimle, Pinethicket, Eddie595, Reaper Eternal, MikeDierken, Holy-foek, Kastauyra, Igor Yalovecky, Gf
uip, EmausBot, Immunize, Wout.mertens, Tommy2010, K6ka, Entalpia2, James.walmsley, Bad Romance, Fabrictramp(public), QEDK,
Ysoroka, Grundprinzip, ClueBot NG, Vedantkumar, MaximalIdeal, Anchor89, Giovanni Kock Bonetti, BG19bot, Lowercase Sigma,
Chmarkine, BattyBot, NorthernSilencer, Cyberbot II, Michaelcomella, AshishMbm2012, Perkinsb1024, EvergreenFir, Alexjlockwood,
Graham477, Andylamp, Cowprophet, Kaartic, Ngkaho1234, Shubh-i sparkx, GreenC bot, Victor.scherbakov, Kushgrover, Linfeng371
and Anonymous: 250
• Trie Source: https://en.wikipedia.org/wiki/Trie?oldid=771766500 Contributors: Bryan Derksen, Taral, Bignose, Edward, Chris-martin, Rl,
Denny, Dcoetzee, Dysprosia, Evgeni Sergeev, Doradus, Fredrik, Altenmann, Mattflaschen, Tea2min, Matt Gies, Giftlite, Dbenbenn, David-
Cary, Sepreece, Wolfkeeper, Pgan002, Gdr, LiDaobing, Danny Rathjens, Teacup, Watcher, Andreas Kaufmann, Kate, Antaeus Feldspar,
BACbKA, JustinWick, Kwamikagami, Diomidis Spinellis, EmilJ, Shoujun, Giraffedata, BlueNovember, Hugowolf, CyberSkull, Diego
Moya, Loreto~enwiki, Stillnotelf, Velella, Blahedo, Runtime, Tr00st, Gmaxwell, Simetrical, MattGiuca, Gerbrant, Graham87, BD2412,
Qwertyus, Rjwilmsi, Drpaule, Sperxios, Hairy Dude, Me and, Pi Delport, Dantheox, Gaius Cornelius, Nad, Mikeblas, Danielx, TMott,
SmackBot, Slamb, Honza Záruba, InverseHypercube, Karl Stroetmann, Jim baker, BiT, Ennorehling, Eug, Chris the speller, Neurodiver-
gent, MalafayaBot, Drewnoakes, Otus, Malbrain, Kaimiddleton, Cybercobra, Leaflord, ThePianoGuy, Musashiaharon, Denshade, Edlee,
Johnny Zoo, MichaelPloujnikov, Cydebot, Electrum, Farzaneh, Bsdaemon, Deborahjay, Headbomb, Widefox, Maged918, KMeyer, Nos-
big, Deflective, Raanoo, Ned14, David Eppstein, FuzziusMaximus, Micahcowan, Francis Tyers, Pavel Fusu, 97198, Dankogai, Funandtrvl,
Bse3, Kyle the bot, Nissenbenyitskhak, Jmacglashan, C0dergirl, Sergio01, Ham Pastrami, Enrique.benimeli, Svick, AlanUS, Jludwig,
VanishedUser sdu9aya9fs787sads, Anupchowdary, Para15000, Niceguyedc, Pombredanne, JeffDonner, Estirabot, Mindstalk, Stepheng-
matthews, Johnuniq, Dscholte, XLinkBot, Dsimic, Deineka, Addbot, Cowgod14, MrOllie, Yaframa, OlEnglish, ماني, Legobot, Luckas-bot,
Yobot, Nashn, AnomieBOT, AmritasyaPutra, Royote, Citation bot, Ivan Kuckir, Coding.mike, GrouchoBot, Modiashutosh, RibotBOT,
Shadowjams, Pauldinhqd, FrescoBot, Mostafa.vafi, X7q, Jonasbn, Citation bot 1, Chenopodiaceous, Base698, GeypycGn, Miracle Pen,
Pmdusso, Diannaa, Cutelyaware, WillNess, RjwilmsiBot, EmausBot, DanielWaterworth, Dcirovic, Bleakgadfly, Midas02, HolyCookie,
Let4time, ClueBot NG, Jbragadeesh, Adityasinghhhhhh, Atthaphong, אנונימי17, Helpful Pixie Bot, Sangdol, Sboosali, Dvanatta, Dexbot,
Pintoch, Junkyardsparkle, Jochen Burghardt, Kirpo, Vsethuooo, RealFoxX, Averruncus, AntonDevil, Painted Fox, Ramiyam, *thing goes,
Bwegs14, Iokevins, Angelababy00, Tylerbittner, GreenC bot and Anonymous: 179
256 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
8.2 Images
• File:8bit-dynamiclist_(reversed).gif Source: https://upload.wikimedia.org/wikipedia/commons/c/cc/8bit-dynamiclist_%28reversed%
29.gif License: CC-BY-SA-3.0 Contributors: This file was derived from: 8bit-dynamiclist.gif
Original artist: Seahen, User:Rezonansowy
• File:AVL-double-rl_K.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/f9/AVL-double-rl_K.svg License: CC BY-SA
4.0 Contributors: This vector image was created with Inkscape. Original artist: Nomen4Omen
• File:AVL-simple-left_K.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/76/AVL-simple-left_K.svg License: CC BY-
SA 4.0 Contributors: This vector image was created with Inkscape. Original artist: Nomen4Omen
• File:AVL-tree-delete.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/36/AVL-tree-delete.svg License: CC BY-SA 3.0
de Contributors: commons Original artist: Nomen4Omen
• File:AVL-tree-wBalance_K.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ad/AVL-tree-wBalance_K.svg License:
CC BY-SA 4.0 Contributors: This vector image was created with Inkscape. Original artist: Nomen4Omen
• File:AVLtreef.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/06/AVLtreef.svg License: Public domain Contributors:
Own work Original artist: User:Mikm
• File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do-
main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs)
• File:AmortizedPush.png Source: https://upload.wikimedia.org/wikipedia/commons/e/e5/AmortizedPush.png License: CC BY-SA 4.0
Contributors: Own work Original artist: ScottDNelson
• File:An_example_of_how_to_find_a_string_in_a_Patricia_trie.png Source: https://upload.wikimedia.org/wikipedia/commons/6/
63/An_example_of_how_to_find_a_string_in_a_Patricia_trie.png License: CC BY-SA 3.0 Contributors: Microsoft Visio Original artist:
Saffles
• File:Array_of_array_storage.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/01/Array_of_array_storage.svg License:
Public domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No
machine-readable author provided. Dcoetzee assumed (based on copyright claims).
• File:AttenuatedBloomFilter2.png Source: https://upload.wikimedia.org/wikipedia/commons/d/d8/AttenuatedBloomFilter2.png Li-
cense: CC BY-SA 4.0 Contributors: Own work Original artist: Satokoala
8.2. IMAGES 257