You are on page 1of 5

A Text Editor as a Single String: The Scalable Functionality of

the Rope Abstract Data Structure


Molly Q. Feldman Swarthmore College 12 December 2012
Abstract
Programming languages usually represent a sequence of characters by
a standard library string implementation. In most applications, strings
work efficiently and without major
space or time deficiencies. However,
strings that are significantly larger than
words or sentences lose all functionality.
The rope data structure offers a scalable alternative to strings and contains
amortized O(log n) methods for Insertion, Deletion, Indexing, and Splitting.
Concatenation, the method for which
ropes are known for, is amortized to
O(1). Although over the past seventeen years computer science has created more efficient methods of representing long sequences, ropes are still
advantageous for an academic study of
data structures because of their use of
trees/graphs and the easily available
comparison to strings.

Motivation and Method Comparison


In December 1995, three researchers from the
Xerox Palo Alto Research Center (PARC) published an article entitled Ropes: An Alternative
to Strings, which provided the general interface
of the rope data type as well as some basic implementation ideas. The PARC researchers objective was to make a string representation with
four fundamental characteristics: the sequences
must be immutable, they must have arbitrarily
scalable operations, the most frequently occurring operations must be as efficient as possible,
and the data structure must be abstract and
adaptable (Boehm et al., 1995, 1315-6). In their
discussion, they note that strings violate three
of these four properties whereas ropes were designed according to these specifications (Boehm

et al., 1995). Ropes are fully implemented in


the Cedar programming language and used as
Cedars only standard string representation. According to the creators, Cedar programmers reported that ropes were a functional and efficient substitute for strings (Boehm et al., 1995).
Other common implementations are C Cords as
well as Java and C++ Ropes.
The functionality and structure of ropes versus
that of strings is an additional motivating factor for their creation. In general, the use of
strings can range anywhere from obtaining user
input to Natural Language Processing to communicating between different files and elements
of the machine. Many programming languages,
such as C++ and Python, store strings as contiguous arrays of chars in memory, which allows for quick access. Although methods such
as insert and delete are O(n) for strings, they
function just as well as constant or logarithmic
timed algorithms when n is small, such as for
a word or a sentence. This simplicity and their
implementation in most languages makes strings
the overwhelming favorite representation for sequences of chars. However, real-world applications of strings do not normally involve single
static words. In fact, the fundamental purpose of
basic word processors (such as Notepad in Windows, TextEdit in Mac OSX etc.) is to edit text,
which requires countless insertions and deletions.
Typically a user inserts a number of characters
onto the end of the text in rapid succession. In
strings such an addition appears as follows, utilizing a time and space-consuming O(n) operation:
"abcdefg" + "h" = "abcdefgh"
where "abcdefg" is generally stored as
[a, b, c, d, e, f, g].
In C++, the machine must then allocate a
new array of size 8 and copy over each
entry one at a time,
[a, , , , , , , ]
[a,b, , , , , , ]
[a,b,c, , , , , ]

Interface

[a,b,c,d, , , , ]
[a,b,c,d,e, , , ]
[a,b,c,d,e,f, , ]
[a,b,c,d,e,f,g,
[a,b,c,d,e,f,g,h]
which results in the string "abcdefgh".

Additional key methods for the implementation


of a rope are Deletion, Insertion, Indexing, and
finding a specific substring, similar to those utilized in the string structure. Deletion is the exact opposite of concatenation for strings, but is
slightly more time consuming for ropes. In this
case, the rope deletion method splits the rope
into three pieces: a subrope to the left of the
section to be deleted, a subrope to the right, and
the subrope (can be a single node) to be deleted.
It then concatenates the left and right subropes
together to reform the rope. This is generally an
amoritized O(log n) method as it is based on the
height of the tree, in opposition to the O(n) time
of deletion in strings (SGI, 1999). Rope indexing
is an O(log n) method as well, as a specific index
is reached by traversing one leg of the tree from
the root to a leaf node, i.e. in height amount of
steps (we can assume the height is O(log n) given
that ropes contain a balancing feature). Finding
a substring has two main implementations. One
utilizes a stored list of pointers to subtrees, which
the creators denote the lazy method. The second utilizes a indexing-like approach to find specific substrings. Thus the general find algorithm
is also an amortized O(log n) method (Boehm et
al., 1995). Insertion is referenced in the Analysis
section. The advantage of this interface is that it
contains room for specific implementations to develop faster or slower algorithms based on which
functionality they would like to highlight. In addition, ropes are notable for the general trend of
O(log n) algorithms.

Ropes, on the other hand, have a special case of


concatenation that allows this type of addition
to occur in constant time, regardless of the
ropes total size. These above factors point
to ropes as an efficient implementation of a
rudimentary word processor because they allow
efficient, scalable concatenation and keep the
entire contents of the document in one complete
data structure.

Related Work
In the modern implementations of ropes, such
as C Cords or the SGI C++ implementation,
ropes are defined as balanced binary search
trees (Boehm et al., 1995, SGI, 1999). Depending on the programmers choice, they can be implemented using specific forms such as B Trees
or AVL Trees. Yet ropes may contain indistinct nodes and are thus more accurately defined as directed acyclic graphs (Boehm et al.,
1995). For an AVL Tree implementation, concatenation would take slower than constant time,
as AVL trees rebalance for each and every insertion, which is not the case in the general rope
class. However, the speed of indexing would increase as the tree would be guaranteed to have
an exact height of log n by definition. B Trees
would be helpful in situations in which portions
of the rope would have to be saved on the disk.
Since B Trees are optimized for RAM-to-disk
access, they would allow indexing and searching through the rope to be faster than normal
(Neubauer, 1999). However, B Trees also have
space concerns that can increase runtime overall (Semaphore). In general, the idea of the abstract data structure and the pros and cons of
implementation is one of the main components
of theoretical computer science.

Illustration
The following is a discussion of the construction of a rope, which showcases some of the data
structures unique qualities and benefits. Our
example string will be How grey is my dragon.
First of all, large input strings are traditionally
separated into small substrings for each node
of the tree by spaces. After that division is
complete, longer substrings can be divided again
(e.g. dragon has been split into two substrings,
dra and gon). One of the unique elements
of the general constructor for ropes is that the

rope is built from the bottom level (the leaves) of each parent node must be updated to express
upwards. These leaf nodes contain both values the sum of the values in its left subtree, see Figand data. The data are the small split substrings ure 3.
and the values are the int representations of the
number of characters in each substring (Boehm
et al., 1995). For an example, see Figure 1.

Figure 3: Connecting parentless nodes and updating parent values

Figure 1: First level of construction


Once this level of the rope has been created,
the constructor then randomly picks nodes to
receive parents (Wikipedia, 2012). In general,
there are three different types of nodes in a rope:
leaf nodes, concatenation (internal) nodes, and
root nodes. Leaf nodes, as mentioned above,
contain both data and values. Their parent
nodes are concatenation nodes, which contain
only values equal to the sum of the values of the
leaves in their left subtree (Boehm et al., 1995).
They do not have data but rather represent the
concatenation of their subtrees. For example,
the parent of How in Figure 2 has a value of
4 because it has a single node in its left subtree
and represents the concatenation of How with
no other node.

Once this process has occurred for every node


on the leaf level and all parent nodes have been
updated, the constructor continues to utilize this
process to build the rest of the tree. The third
type of node, the root node, has the entire rest
of the rope structure as its left tree, which allows
for some simpler algorithms, such as indexing, to
be utilized. Its value, see Figure 4, is equal to
the sum of the values of all of the leaf nodes in
the tree.

Figure 2: Random addition of parent nodes


Figure 4: A completed rope representation of the
string How grey is my dragon

Once parent nodes have been randomly assigned, the nodes without parents are then connected to the leaf nodes immediately to their left
(Wikipedia, 2012). This then creates two different levels of nodes that contain data: leaf nodes
and nodes with a single right child. For the sake
of simplicity, any node that contains data will
be herein referred to as a leaf node. Because of
the addition of the parentless nodes, the values

Analysis
Indexing is a O(log n) method that showcases
rope tree traversal.
The pseudocode is as
follows,

algorithm is as follows:

int index(i):
current = root
while current != NULL:
if current.value > i:
current = current->left
else:
i = i - current.value
current = current->right
return current.data[i]

concatenate(r1, r2):
newroot = new Node(r1.root.value + r2.root.value)
newroot->left = r1
newroot->right = r2

Figure 5 offers a visual representation of this


algorithm. In general, this is an O(1) method
as it simply entails allocating a new node and
variable assignment. In other data structures,
this step would also incorporate balancing but,
as noted above, ropes only call balance explicitly
in order to maintain the O(1) runtime.

The general idea is that if the current nodes


value is larger than the index, current becomes
currents left child. If it is smaller than the index,
the index is updated to the difference between
the index and the current nodes value and then
current becomes currents right child. If there
are no left or right children, then the current
node contains the searched for index. It then returns the character at the indexed position in the
substring (for example, searching for 11 in the
above rope results in the char s from the is
node). It is important to note that rope indexing
is base 1 not base 0 like almost all other indexing.
That means s is index 2 rather than index 1 in
the is node (Wikipedia, 2012). The runtime
of this algorithm is solely based on the length of
the while loop, which is determined by the height
of the tree. Given that ropes are balanced binary trees, such as AVL Trees or B Trees which
both have O(log n) search, we can conclude that
ropes will have a similar runtime. The innate
balancing algorithm of the rope structure is complex; a full description can be seen in Ropes: An
Alternative to Strings. Balancing in ropes, unlike in other structures, is called explicitly when
the structure exceeds a certain depth determined
by the implementation (SGI, 1999). The form
of balancing may lead to an O(n) worst possible runtime. However, balancing does not occur
frequently enough to impact general runtimes.
Thus we interpret runtimes dealing with the tree
traversals/searches as amortized to a balanced
binary tree implementation (SGI, 1999).
Concatenation of two complete ropes is an
O(1) operation. The algorithm simply allocates
a new node and connects the two ropes as the
left and right children respectively. The order of
addition matters since ropes are built to maintain word order, so the first part of the sentence
should be the left child. The pseudocode for this

Figure 5: Representation of the concatenation of


How grey is my dragon with and sword
Ropes also contain specific methods for single character concatenations onto the end of the
tree, such as those seen in a word processor. In
general, if the rope itself is a single node containing a short substring and the node to be concatenated is as well, then the general algorithm is to
combine the two nodes into a single node. Given
a node with the value 3 and the data bee and
a node with the value 1 and the data s then
the algorithm specifies that their concatenation
would be equal to a single node with the value 4
and the data bees (Boehm et al., 1995). This
is similar to the general concatenation method
for strings, but is only used in ropes in this specific case. For when the left hand item is a rope
with arbitrary length and the rightmost node is
a leaf containing a short substring and the right
hand item is a single node, the following procedure is used. First, the rightmost node of the
left argument is removed from the tree and concatenated to the right hand argument. In our

example, seen in Figure 3, this producing a subrope of size 4. This subrope is then concatenated
to the rightmost node of the main rope and all
concatenation nodes affected have their values
updated (Boehm et al., 1995, SGI, 1999). The
following is the pseudocode for this operation,

are O(log n) because they function like modified indexing (O(1)) and may or many not contain concatenation (O(1)). Tree traversal methods, like iterating, are O(n) by the definition of
traversal.
The above methods and motivation for ropes
speak to its overall midlevel performance as a
data structure. Strings may have the extremes
of O(1) and O(n) operations, but ropes contain
mainly O(log n) operations. This ability to represent long strings, concatenate easily, and have
generally logarithmic operations is the strength
of the rope data structure.

contatenate(r1, n1):
lastnode = r1.pop_back()
newnode = new Node(lastnode.value + n1.value)
newnode->left = lastnode
lastnode->right = n1
r1.push_back(lastnode)

References
Hans J. Boehm, Russ Atkinson, and Michael Plass.
December 1995.
Ropes: An Alternative to
Strings Software-Practice and Experience, Volume 25. John Wiley & Sons, Ltd., Somerset, NJ.
Neubauer, P. (1999). B-trees: Balanced tree data
structures. Bluer White. Retrieved December 8,
2012, from http://www.bluerwhite.org/btree
Rope (computer science). 12 October 2012. In
Wikipedia. Retrieved December 7, 2012, from
http://en.wikipedia.org/wiki/Rope (computer science)

Figure 6: Representation of the concatenation of


How grey is my dragon with s

Semaphore Corporation.
When not to use btrees.
(n.d.).
Retrieved
December
8, 2012, from
The push_back() and pop_back() methods
http://www.semaphorecorp.com/btp/b11.html

are amortized constant methods from the SGI


C++ rope implementation. They are used in Silicon Graphics International.
(n.d.).
rope<
T, Alloc >. In SGI. Retrieved December 7, 2012,
the psuedocode as simply representations of the
from http://www.sgi.com/tech/stl/Rope.html.
additional methods that can be built into the
implementation of the rope abstract data type Silicon Graphics International.
(n.d.).
Rope
Implementation
Overview.
In
to perform these operations. Given that these
SGI.
Retrieved December 7, 2012, from
operations are O(1) and that the rest of the
http://www.sgi.com/tech/stl/ropeimpl.html.
method is a replica of general concatenation for
ropes, the process itself is O(1). Concatenation Silicon Graphics International and Hewlett-Packard
Company.
1999.
SGI STL Library.
In
lends itself to the method of insertion, which is
SGI.
Downloaded
December
7,
2012,
from
the idea of editing a rope by inserting a string
http://www.sgi.com/tech/stl/download.html.
into the middle. This operation is a combination of concatenation and splitting. Splitting
is an additional O(log n) method for ropes that
simply breaks concatenation node to leaf node
connections given a certain index and concatenates the left and right sides to form two distinct subtrees (Boehm et al., 1995). In general, the non-complete tree traversal methods
and non-concatenation methods of the rope class