Data Structure For JDK

Data Structures of JDK
Stack
There is a stack in jdk appeared there from origins class Stack, but its not recommended to use it,
its complex and strange: it inherits from Vector, so is based on Dynamic Array and synchronized.
Why a simple stack needs all of that and why its not just an interface is not clear (discussed many
times: 1, 2), but it seems just an architecture mistake, the same as with Vector itself. Btw, JDK
authors themselves suggest using Deque instead.
Deque an interface (api) of double-ended queue (LIFO + FIFO in O (1)), which includes the stack
operations (push, pop, isEmpty, size) and is available in the jdk not so far (1.6+). Of course itd be
more logical to put these stack operations in an interface Stack, and e.g. let Deque inherit it, but as
Stack was already present, and backward compatibility is the holy grail of java they had to
sacrifice a normal design. Implementations of Deque are ArrayDeque and LinkedList, who are also
implementors of an usual queue so well discuss this later.
Queue
Next, lets look at queue data type. Everything here is good and well designed. Queue interface
(api) of FIFO queue, adding to the beginning and removal from the end are in constant time O(1).
There are main implementations: ArrayDeque, cyclical buffer based on dynamically extensible
array (doubles when filling) and LinkedList, a classic doubly-linked list (the size is not limited).
Surprisingly, the first does not support random access (add/remove/get with index), the second does
but in O(n) time with iterations through a linked list. These classes also implements the above
mentioned Deque so they support removal from the end and adding to the top in constant time.
Next, since jdk 1.5+, the PriorityQueue was added which in fact violates the contract because the
queue elements are not retrieved from the end (and they are not added to the head), but according to
their priorities. PriorityQueue is based on an extensible binary heap with the minimum on the top
(according to its comparator) and it enhances in 1.5 times on filling. Respectively, the key features
are: element addition/removal is in O(log N) and a reference to the minimum (head) in O(1).
Other types of queues are designed for multithreaded usage, they are: BlockingQueue,
TransferQueue, ConcurrentLinkedQueue and ConcurrentLinkedDeque.
Implementations of BlockingQueue (ArrayBlockingQueue, LinkedBlockingQueue,
PriorityBlockingQueue) are kind of synchronized versions of their originals, i.e. nearly every
operation is performed serially (with exclusive lock). Its also true for DelayQueue: synchronized
too and uses PriorityQueue internally.
Other implementations: SynchronousQueue, TransferQueue ( LinkedTransferQueue),
ConcurrentLinkedQueue, ConcurrentLinkedDeque based on a different approach: they use a
non-blocking queue algorithms based on linked lists and CAS instructions which are well paralleled
in a multiprocessor environment. Detailed description is in the source code.
Theory of such algorithms is a quite huge and modern topic so its not well standardized and
structured yet and they are beyond the scope of this review and subject for another article.
Priority Queue
As been said since jdk 1.5+ there is an general PriorityQueue that works according to the element
comparator. Moreover, there is another implementation of the heap in the jdk. This is the good old
Timer, the internal class TaskQueue (a task with minimal latency is on the top). Of course its a
private class and can not be used except inside the Timer.
List
As you know there are sequential and/or random access type lists. In java, its List and there are 2
main implementations: first is ArrayList, supports random access, based on dynamically
extensible array (increases by half when filling), but it doesnt shrink after deleting all elements,
you need to call a special method (trimToSize).
And the second again LinkedList its a doubly-linked list sequential access size is limited only by
the memory of jvm. Although the methods of random access (index) are also present as been said,
they require O(n) time.
Thus, there is no the simplest linked list implementation in the java collections though itd be a
good idea (2 times less overhead for the links), as well as there is no a simple stack.
For using a list in a multithreaded environment, there are several options: CopyOnWriteArrayList
(change operations O(n)), wrappers (synchonizedList) as well as outdated Vector.
Symbol Tables
They are presented as binary tree and hash tables in JDK. The Binary Tree its TreeMap (or
TreeSet), the implementation of SortedMap (SortedSet too), it is based on a classic red-black tree,
i.e. is balanced and its basic operations are guaranteed for the O(log N) and it is not limited in size.
Other types of trees are not present in jdk.
A hash table is HashMap (HashSet too) is probably the most used structure in java, it is based on a
dynamically extensible hash table with separate chaining with linked-lists, with all the features: the
performance depends on a hash function quality, so its O(N) in the worst case. When the size
reaches a predetermined loadFactor the HashMap doubles in size. It is worth noting that for the
protection of the bad hash functions a double hashing is used and a tricky bit arithmetic applies to
the result of a call to hashCode().
There are also implementations of hash tables with open-addressing (linear probing) in JDK. One of
them is IdentityHashMap, which uses an optimized version of the classical linear probing when
both the keys and values are stored in the same array next to each other for better data caching
(javadoc: better locality for large tables than does using separate arrays)
The second open-addressing implementation is a very specific: it only serves to store ThreadLocal
elements (internal hidden class in ThreadLocalMap) and of course is not available for use.
There are multithreaded versions: ConcurrentHashMap, wrapper synchronizedMap, Hashtable and
ConcurrentSkipListMap. Wrapper naturally is just blocking version of the regular HashMap
Hashtable - the same thing (and its legacy and not recommended), ConcurrentHashMap lock-striping version, that reduces the critical lock sections (youd better read about it JCiP, here is
an excerpt) ConcurrentSkipListMap - is an adapted version of the non-blocking named algorithm
for hash tables (see the source code for the details).
Sets (w/o duplicates) its Set in java, and implemented via HashMap so all that is said to hash
tables valid for HashSet.
Graphs
Graph structures and algorithms are not represented in jdk. Therefore, you can use only third-party
libraries for that.
Strings
There is an usual implementation of strings in java: based on array of unicode characters. It is worth
mentioning that starting from version 1.7_17 the performance of substring is O(n), since the
substring is copied.
It is interesting this implementation uses a simple (brute-force) substring search algorithm that runs
in O(N*M) in the worst case, and there are not some efficient algorithms (built on finite-state
machine: Knuth-Morris-Pratt, etc.) in JDK.
There are several reasons for that: a large size of the alphabet UTF characters (~ 65K), and hence a
large overhead for storage a state machine, while the brute-force algorithm is in-place (do not use
extra memory). And secondly, on average input strings this algorithm is not much behind others
by statistics.
The same with a sorting: there are effective radix sort string algs (LSD, MSD, etc), but in the jdk,
there is a standard object sort used for strings, which runs in O(M*N*log N), if most of the lines
dont differ much (M length of lines). The reason is the same: quick counting string algs require
additional arrays of UTF alphabet size, which makes them very ineffective on average inputs.
JDK Algorithms
Sorting
Since jdk7 many changes regarding sorts of options has occured, this was discussed many time,
there are many information and articles on this subject, you can easily google it.
In short, here is the actual list of sorting implementations in the jdk: TimSort for sorting objects
by default, mergesort also used for objects, the old version (enabled through a system property),
Dual-Pivot Quick sort for primitives, then sorting by counting is used for byte / character arrays
and finally insertion sort is used for small arrays used in any case.
Collection contents are sorted with Collections.sort(List ) and it is still done via copying the
collection into an array, sorting it, and then overwriting the collection contents accordingly.
Therefore, its still impossible to sort a collection w/o an overhead, though I think itd be a good
idea to have an in-place sort of linked lists. String sorting in java is also done with the object
version, the reasons were already mentioned.
Searching
A traditional binary search is present for all arrays of primitives and objects as well as for random
access list implementations (e.g. ArrayList etc). In addition there is a version of binary search for
linked lists in JDK! Its very surprisingly: JDK doesnt have a sorting for linked lists, but it has a
binary search for them! Although it doesnt make much sense because the performance of such
version is O(N).
Regular Expressions
There are Pattern and Matcher classes in for that. Java uses the traditional implementation based on
non-deterministic finite state machine (NFA) with backtracking. So, it gives an exponential
complexity of O(m^N) in the worst case on degenerated inputs, e.g. run this regexp in java and try
to add/remove a symbols: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.matches((a|aa)*b)
Also, there is the so-called ordered alternation its a feature that stops a search after the first match
found but not the most specific(longest) search result (example).
Hash functions, checksums

There are six versions of hashCode implementation in jdk and the
Park-Miller_random_number_generator is a default, others are simple like constant or object
memory address and dont use algorithms, you can find them in c++ jdk sources. There are also
industry standard hashing algorithms (SHA-*, MD5 and variations) in MessageDigest class.
For checksums there are implementations of Adler-32 (javadoc) and CRC32 ( javadoc) algorithms.
Compression
There is an implementation of a standard compression deflate (Deflater) algorithm in jdk, and
zip/gzip jdk utils use it. All of them are in java.util.zip package.
Summary
As you can see the classic data structures are not fully presented in java, but at the same time almost
there are a lot of thread-safe versions available nearly for all data structures in jdk. What is missing
is an open question. For example you can argue whether jdk needs some Union-Find, but the lack of
graph structures and algorithms in the language completely in the age of social networks very
surprises and in practice generates a lot of bugs and bicycles.

Data Structure For JDK

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Structure For JDK

Загружено:

Авторское право:

Доступные форматы

Data Structures of JDK

Hash functions, checksums

Вам также может понравиться