Вы находитесь на странице: 1из 14

Statistical Science

2014, Vol. 29, No. 2, 167–180


DOI: 10.1214/13-STS452
c Institute of Mathematical Statistics, 2014

Object-Oriented Programming, Functional


Programming and R
John M. Chambers
arXiv:1409.3531v1 [stat.ME] 9 Sep 2014

Abstract. This paper reviews some programming techniques in R that


have proved useful, particularly for substantial projects. These include
several versions of object-oriented programming, used in a large num-
ber of R packages. The review tries to clarify the origins and ideas
behind the various versions, each of which is valuable in the appropri-
ate context.
R has also been strongly influenced by the ideas of functional pro-
gramming and, in particular, by the desire to combine functional with
object oriented programming.
To clarify how this particular mix of ideas has turned out in the
current R language and supporting software, the paper will first review
the basic ideas behind object-oriented and functional programming,
and then examine the evolution of R with these ideas providing context.
Functional programming supports well-defined, defensible software
giving reproducible results. Object-oriented programming is the mech-
anism par excellence for managing complexity while keeping things sim-
ple for the user. The two paradigms have been valuable in supporting
major software for fitting models to data and numerous other statistical
applications.
The paradigms have been adopted, and adapted, distinctively in R.
Functional programming motivates much of R but R does not enforce
the paradigm. Object-oriented programming from a functional perspec-
tive differs from that used in non-functional languages, a distinction
that needs to be emphasized to avoid confusion.
R initially replicated the S language from Bell Labs, which in turn
was strongly influenced by earlier program libraries. At each stage, new
ideas have been added, but the previous software continues to show its
influence in the design as well. Outlining the evolution will further clar-
ify why we currently have this somewhat unusual combination of ideas.
Key words and phrases: Programming languages, functional program-
ming, object-oriented programming.

1. INTRODUCTION
John M. Chambers is Consulting Professor, Department R has become an important medium for commu-
of Statistics, Stanford University, Stanford, California nicating new methodology in statistics and related
94305-4065, USA e-mail: jmc@stat.stanford.edu. technology. References to the supporting R soft-
This is an electronic reprint of the original article ware frequently accompany journal articles or other
published by the Institute of Mathematical Statistics in publications describing new results. The software is
Statistical Science, 2014, Vol. 29, No. 2, 167–180. This available to other R users, ideally as a package in
reprint differs from the original in pagination and a standard repository. The benefits for statistics as
typographic detail. a discipline are considerable: The community has
1
2 J. M. CHAMBERS

rapid access to new ideas in a free, open-source for- The original motivating use case, fitting models to
mat as software that can in most cases be installed data, remains compelling. An expression such as
and used immediately by those interested in the sta- irisFit <- lm(Sepal.Width ∼
tistical techniques. The user community has both . - Sepal.Length, iris)
created and benefited from this resource.
This paper examines two of the most signifi- calls a function that creates an object representing
cant paradigms in programming languages gener- the linear model specified by the first argument, ap-
ally: object-oriented programming (OOP) and func- plied to the data specified by the second argument.
tional programming. R makes use of both, but in The computation is functional, well-defined by the
its own way. Both paradigms are valuable for seri- arguments. It returns an object whose properties
ous programming with the language. But in both provide the information needed to study and work
cases, understanding the relevant ideas in the con- with the fitted model. Other functions and other ob-
text of R is needed to avoid confusion. The confu- jects can adapt to different models in a form that is
sion sometimes arises, in both cases, from applying convenient for both the user and the implementer.
to R interpretations of the paradigms that apply to Principles of functional programming guide us in
other languages but not to this one. Section 2 of the writing reliable, reproducible functions for the dif-
paper will review the ideas, generally and in their ferent models. Object-oriented programming pro-
R versions, with the goal of clarifying the basics. vides tools for defining the model objects clearly,
Given the importance of R software to the commu- and adapting to new ideas and new forms of mod-
nity, creators of new R software should benefit from els. Section 3.4 goes into details of the R implemen-
understanding these concepts. tations.
We will also examine in Section 3 of the paper As they have been realized in R, both paradigms
the evolution that led to these versions of functional center on a few, intuitive concepts. The details are
programming and OOP. The prime motivation was more complicated, as they usually are. In the case of
not language design in the abstract but to provide functional programming, the realization in R is only
the tools needed for research and data analysis by partial, reflecting the language’s origins as well as
the user community at the time. R originally repro- practical considerations. In the case of OOP, there
duced the functionality of the S language at Bell are now at least three realizations of the ideas in R,
Labs, which itself had evolved through several stages using two different paradigms. All three have signif-
beginning in the late 1970s and which was in turn icant applications and practical value.
based on earlier statistical software libraries, mainly Despite all these devilish details, the main ideas
in Fortran. remain visible and useful, particularly when pro-
R added important new ideas and has continued gramming serious applications using the language.
to evolve, but the main contents inherited through S
shaped the capabilities and the approach to statisti- 2.1 Functional Programming
cal computing. In a surprising number of areas, what For our purposes, the main principles of functional
we think of as “the R way” of organizing the compu- programming can be summarized as follows:
tations actually reflects software developed twenty
years or more before R existed. 1. Programming consists largely of defining func-
Having been involved in all the stages, I am nat- tions.
urally inclined to a historical perspective, but it is 2. A function definition in the language, like a
also the case that the history itself had substantial function in mathematics, implies that a function call
impact on the results. It may be comforting to view returns a unique value corresponding to each valid
programming languages as abstract definitions, but set of arguments, but only dependent on these ar-
in practice they evolve from the needs, interests and guments.
limitations of their creators and users. 3. A function call has no side effects that could
alter other computations.
2. FUNCTIONAL AND OBJECT-ORIENTED
The implication of the second point is that func-
PROGRAMMING: THE MAIN IDEAS
tions in the programming language are mappings
Functional and object-oriented programming fit from the allowed set of arguments to some range
naturally into statistical applications and into R. of output values. In particular, the returned value
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 3

should not depend on other quantities that affect 2.2 Object-Oriented Programming
the “state” of the software when the function call is
The main ideas of object-oriented programming
evaluated.
are also quite simple and intuitive:
True functional languages conform to these ideas
both by what they do provide, such as pattern ex- 1. Everything we compute with is an object, and
pressions, and what they do not provide, such as objects should be structured to suit the goals of our
procedural iteration or dynamic assignments. The computations.
classic tutorial example of the factorial function, for 2. For this, the key programming tool is a class
example, could be expressed in the Haskell language definition saying that objects belonging to this class
by the pattern: share structure defined by properties they all have,
factorial x = if x > 0 with the properties being themselves objects of some
then x * factorial (x-1) else 1, specified class.
3. A class can inherit from (contain) a simpler
plus some type information, such as that a value for superclass, such that an object of this class is also
x must be an integer scalar. an object of the superclass.
Is R a functional programming language in this 4. In order to compute with objects, we can de-
sense? No. The structure of the language does fine methods that are only used when objects are of
not enforce functionality; Section 2.3 examines that certain classes.
structure as it relates to functional programming
and OOP. The evolution of R from earlier work in Many programming languages reflect these ideas, ei-
statistical computing also inevitably left portions of ther from their inception or by adding some or all
earlier pre-functional computations; Section 3 out- of the ideas to an existing language.
lines the history. Random number generation, for ex- Is R an OOP language? Not from its inception,
ample, is implemented in a distinctly “state-based” but it has added important software reflecting the
model in which an object in the global environ- ideas. In fact, it has done so in at least three separate
ment (.Random.seed) represents the current state forms, giving rise to some confusion that this paper
of the generators. Purely functional languages have attempts to reduce.
developed techniques for many of these computa- Some of the confusion arises from not recognizing
tions, but rewriting R to eliminate its huge body of that the final item in the list above can be imple-
supporting software is not a practical prospect and mented in radically different ways, depending on the
would require replacing some very well-tested and general paradigm of the programming language. A
well-analyzed computations (random number gen- key distinction is whether the methods are to be
eration being a good example). embedded in some form of functional programming.
Functional programming remains an important Traditionally, most languages adopting the OOP
paradigm for statistical computing in spite of these paradigm are not functional; either the language be-
limitations. Statistical models for data, the motivat- gan with objects and classes as a central motivation
ing example for many features in S and R, illustrate (SIMULA, Java) or added the paradigm to an exist-
the value of analyzing the software from a functional ing non-functional language (C++, Python). In such
programming perspective. Software for fitting mod- languages, methods were naturally associated with
els to data remains one of the most active uses of classes, essentially as callable properties of the ob-
R. The functional validity of such software is im- jects. The language would then include syntax to
portant both for theoretical justification and to de- call or invoke a method on a particular object, most
fend the results in areas of controversy: Can we show often using the infix operator “.”. The class defini-
that the fitted models are well-defined functions of tion then encapsulates all the software for the class.
the data, perhaps with other inputs to the model Where methods are needed for other computations,
such as prior distributions considered as additional such as special method names in Python or opera-
arguments? The structure of R as described in Sec- tor overloading in C++, these are provided by ad-
tion 2.3 can provide support for analyzing functional hoc mechanisms in the language, but the method
validity. Equally usefully, such analysis can also illu- remains part of the class definition.
minate the limits of functional validity for particular In a language that is functional or that aspires to
software, such as that for model-fitting. behave functionally as S and R do, the natural role
4 J. M. CHAMBERS

of methods corresponds to the intuitive meaning of not enforce functional programming, but does en-
“method”—a technique for computing the desired courage it to a degree. In particular, the evaluation
result of a function call. In functional OOP, the par- process in R contributes to functional programming
ticular computational technique is chosen because by largely avoiding side effects when function calls
one or more arguments are objects from recognized are evaluated, but some mechanisms in the language
classes. and especially in the underlying support code can
Methods in this situation belong to functions, not behave in a non-functional way. To understand in a
to classes; the functions are generic. In the simplest bit more detail, we need to examine this evaluation
and most common case, referred to as a standard process.
generic function in R, the function defines the formal Computations in R are carried out by the R evalu-
arguments but otherwise consists of nothing but a ator by evaluating function call objects. These have
table of the corresponding methods plus a command an expression for the function definition (usually a
to select the method in the table that matches the reference to it by name) and zero or more expres-
classes of the arguments. The selected method is a sions for the arguments to the call. The full details
function; the call to the generic is then evaluated as are somewhat beyond our scope here, but an essen-
a call to the selected method. tial question is how references to objects are han-
We will refer to this form of object-oriented pro- dled. Any programming language must have refer-
gramming as functional OOP as opposed to the en- ences to data, which in R means references to ob-
capsulated form in which methods are part of the jects. As discussed in Section 3, the evolution of such
class definition. references is central to the evolution of programming
languages, especially for statistics.
2.3 Their Relationship to R
In R a reference to an object is the combination
To understand computations in R, two slogans are of a name and a context in which to look up that
helpful: name; the contexts in R are themselves objects, of
type “environment”. A reference is therefore the
• Everything that exists is an object.
combination of a name and an environment. (We’ll
• Everything that happens is a function call.
look at an example shortly.)
In contrast to languages such as Java and C++ Note that we are talking about references to ob-
where objects are distinct from more primitive data jects; most objects in R are not themselves refer-
types, every reference in R is to an object, in partic- ence objects. Languages implementing OOP in the
ular, to a single internal structure type in the under- traditional, non-functional form essentially always
lying C implementation. This applies to data in the include reference objects, in particular, what are
usual sense and also to all parts of the language it- termed mutable references. If a method alters an
self, such as function definitions and function calls. object, say, by assigning new values to some of
Computations that are more complex than a con- its properties, all references to that object see the
stant or a simple name are all treated as function change, regardless of the context of the call to the
calls by the R evaluator, with control structures and method. Whether the reassignment of the property
operators simply alternative syntax hiding the func- takes place where the object originated or down in
tion call. [Details and examples are shown in (Cham- some other method makes no difference; the object
bers (2008), pages 458–468).] itself is the reference.
The two slogans, however, do not imply that In contrast, the reference in R consists of a name
computations in R must follow either functional or and an environment—the environment in which the
object-oriented programming in the senses outlined object referred to has been assigned with that name.
in the preceding sections. With respect to object- Most R programming is based on a concept of lo-
oriented programming, R has several implementa- cal references; that is, reassigning part of an object
tions that have evolved as outlined in Section 3. referred to by name alters the object referred to by
These can be used by programmers to provide soft- that name, but only in the local environment. If that
ware following either of the OOP paradigms. local reference started out as a reference in some
Functional programming’s relationship to R is less other environment, that other reference is still to
straightforward. The evaluation process in R does the original object.
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 5

To understand the relation of local references to that depends on a non-functional “state.” Whether
functional programming in R, an example and a few a particular computation in R is strictly functional
more details of function call evaluation are needed. can only be determined by examining it in detail,
R evaluates function calls as objects. For example, including all the functions that call code in C or
when the evaluator encounters the call Fortran.
The rest of this section takes a slight detour to
lm(Sepal.Width
∼ . - Sepal.Length, iris),
consider how one might do that examination.

it uses the object representing the call to create an Validating Functionality in R


environment for the evaluation. In principle, the functional validity of particu-
The call identifies the function, also an object of lar computations could be analyzed and either cer-
course, typically referring to it by name. In this case tified or the limitations to functionality reported.
lm refers to an object in the stats package. That ob- Such functional validation would be useful in cases
ject has formal arguments [14 of them, in the case of where either the theoretical validity or the implica-
lm()]. The evaluator initializes an environment for tions of the result in an application are being ques-
the call with objects corresponding to the formal ar- tioned. Fitting models to data provides a natural
guments, as unevaluated expressions built from the example for both aspects. Given a function taking
two actual arguments and default expressions found as arguments data and a model specification and
in the function definition. For details see Section 4 returning a fitted model object, can one validate
of the language definition, R Core Team (2013) and that the returned object is functionally defined by
Chapter 13 of Chambers (2008). As an aside, the the arguments? If not, can the non-functionality be
common use of terms like “call by value” (and the parametrized meaningfully, in which case one can
contrasting “call by reference”) for argument pass- construct a functional version of the computation by
ing in R is invalid and misleading. Arguments are including such parameters as implicit arguments? R
not “passed” in the usual sense. does not have organized support for such validity
Local references operate on all the objects in the investigations, but developing tools for the purpose
environment to prevent side effects. The formal ar- would be a worthwhile project.
gument data to lm() matches the expression iris, Functional validation is a bottom-up construction.
which refers to an object in the datasets package. Ex- The bottom layer consists of any functions called
pressions that extract information from data work that are not implemented in R, typically those that
on that object. But the local reference defined by call routines in C++, C or Fortran. Included are the
data and the environment of the evaluation is dis- R primitives, routines from numerical libraries and a
tinct from the reference to iris in the package. If variety of other standard sources, plus any new code
an assignment or replacement expression is encoun- brought in to implement the computation in ques-
tered that would alter data, the evaluator will du- tion. The functional validity of each of these is an
plicate the object first to ensure locality of the ref- empirical assertion. Some are clearly non-functional,
erence. such as the “<<-” operator and assign() function
The local reference paradigm is helpful in validat- that do nonlocal assignments.
ing the functionality of an R function. Only the local Many computations in R eventually call subpro-
assignments and replacements need to be examined; grams not originally written for R. Each of these
calls to other functions will not alter references in must be examined for potential non-functional be-
this environment, so long as those functions stick havior, sometimes a daunting task. However, good
to local reference behavior. If a function f() calls a practice in using well-tested, preferably open-source
function g() and both functions stick to local refer- supporting software will often provide a plausible
ence assignments, then knowing that the value of a basis.
call to g() depends only on the arguments is all that If R code includes an interface to code in C, For-
tran or other languages whose functional validity
is needed; how g() computes that value is irrelevant.
cannot be established, nothing more can be said.
While local references help avoid side effects, they
Other than such code, functional validity is likely to
do not prevent computations from referring to ob-
fail for one of three reasons:
jects or other data outside the functions being
called, and therefore potentially returning a result • dependance on nonlocal values;
6 J. M. CHAMBERS

• using low-level computations in R known to vio- 3. THE EVOLUTION OF FUNCTIONAL


late functionality; PROGRAMMING, OOP AND R
• changing functions or other objects at run time.
The computational paradigms for functional pro-
A prime example of the first is the use of external gramming and for object-oriented programming
data, such as the global options object, for conver- have evolved from a sequence of changes in software,
gence tolerances or other parameters for iterative beginning with the earliest programable computers.
numerical computations. An example of the second During the same period, software for statistics was
is the inclusion of pseudo-random values in the cal- also evolving, one thread of which led through early
culation. The third problem might be caused, for libraries to S and then to R.
example, by using a function from the global envi- There may be an appearance of earlier languages
ronment. being replaced by later and presumably improved
The third danger is greatly reduced when the code approaches. It is true that each major revision as-
resides in the namespace of a package with explicit serts improvements that will extend our abilities to
import rules. Any reasonable approach to validating express our ideas in software. However, none of the
functionality would make this a requirement. versions of S or R actually totally replaced earlier
My feeling is that most examples of failures could software paradigms.
be corrected to create functionally valid extensions The current software in, and interfaced from, R il-
of the computation in question. Tolerances are often lustrates this evolution. R has developed important
organized through the R options() function, explic- new techniques, but originated from the S language,
itly designed to avoid functional programming by reproducing nearly all of S as it was described at
allowing users to set state parameters that are then
that time. S in turn went through several evolution-
queried by the calculation. Once identified, such op-
ary changes and was itself based on extensive earlier
tions could be converted to additional arguments to
software, particularly subroutine libraries for Fortran
the function being validated. [A general mechanism
programming. Examining the history shows that a
would be a version of getOption() that required the
surprising portion of what we see now is structure
option in question to be supplied as an argument.]
inherited from the early stages.
Pseudo-random values are used in a variety of
The form in which functional programming and
procedures, including some optimization techniques
where they are expected to provide more robust nu- OOP were adopted was also influenced by the ex-
merical behavior by jittering values during iteration. isting software. Examining the history will explain
These can be made functionally valid by using well- many of the choices made.
defined generator software, such as that supplied in 3.1 From Hardware to Data and Libraries
R itself, and by treating the initial state of the gener-
ator as another nonlocal value to be incorporated as The earliest general-purpose computers were pro-
an additional argument. One should always include grammed in terms of the physical machine, its stor-
an explicit initialization via set.seed() in any ex- age and the basic operations provided to move data
ample expected to be reproducible, and that prac- around and perform arithmetic and other opera-
tice can be the basis for a functionally valid version tions. The IBM 650 (Figure 1) was probably the
of the computation. first computer widely sold and used (and the ma-
Beyond these specific examples, numerical compu- chine on which I did my first programming, around
tations often depend on the underlying parameters 1960).
of the floating-point computations, for example, to In this pre-silicon world, storage for data or pro-
select convergence criteria for iteration. Fortunately, grams resided on a rotating magnetic drum, holding
several decades of work by numerical analysts and 2000 decimal words. Data could be read or written
hardware designers have greatly standardized the only when the corresponding segment of the drum
specification of the numerical engine in modern com- passed under the appropriate fixed head, so that
puters: just knowing 32-bit or 64-bit gets us a long physical positioning of data was a serious aspect
way. of performance. With this close view of the hard-
Developing a framework for validating functional- ware, programming languages (assembly languages
ity seems to me an interesting cooperative research for the actual machine instructions) defined storage
direction that could be of value to the statistical in terms of single physical units (words in the 650)
community. and blocks of sequential storage.
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 7

Fig. 1. An IBM 650 computer, mid 1950s. Under the glass is the magnetic drum storage unit (memory), 2000 words for
data and programs.

This was not an environment to encourage ab- puting generally and emphatically so in computing
straction of ideas about data. However, by 1960 the for statistics.
first generation of “high-level” languages had been Algorithm series and other publications supported
introduced and would support profound changes. by professional societies began to accumulate refer-
For statistical computing this meant primarily For- eed, trustworthy procedures for many key compu-
tran. tations. The statistics research group at Bell Labs
In terms of data storage, Fortran actually con- developed a large Fortran library that reflected our
tinued the basic notion of single items (scalars) needs and our philosophy of research and data anal-
and contiguous blocks (arrays). Two major changes, ysis. The book “Computational Methods for Data
however, were made. First, the contents were de- Analysis”, Chambers (1977), did not present soft-
scribed in terms of their content, the first data types ware but did reflect the tools that would later form
including integer and floating point numbers. Sec- the basis for S. After an introduction and discus-
ond, the language encouraged operations that iter- sion of program design, the remaining six chapters
ated over the contents of the arrays. By interpreting covered computations supported by the library:
an array as a sequence of equal-length subarrays, 3. Data Management and Manipulation (includ-
this indexing extended to matrices and to multi-way ing sorting and table lookup).
tables. 4. Numerical Computations (approximations,
Along with the new paradigm for data and facil- Fourier transforms, integration).
ities for iteration, the high-level languages encour- 5. Linear Models (numerical linear algebra, re-
aged software to be organized in subroutines, so that gression, multivariate methods).
a computational method could be realized as one or 6. Nonlinear Models (optimization, nonlinear least
several units of software. While the changes may squares).
seem modest from the current perspective, they in 7. Simulation of Random Processes (random num-
fact supported a major revolution in scientific com- ber generation and Monte Carlo).
8 J. M. CHAMBERS

8. Computational Graphics (plotting techniques, “bug reports” came to us as a result of confusing


scatter plots, histograms and probability plots). an “I” and a “1” when typing in the stable dis-
tribution software, Chambers, Mallows and Stuck
Each of these was supported in the pre-S era by
subroutines that would then become the basis for (1976).]
corresponding functions in S. Substantial in-house libraries, such as the one at
Much of the organization for basic tools in R has Bell Labs, gave users a fairly wide range of com-
inherited, through S, the structure of the subrou- putations, supported by improved numerical and
tine library. That includes the graphical computa- other algorithms. However, to apply the computa-
tions, in particular, features essential to S and R: tions specifically to a particular dataset with partic-
separation of graphic device specification from plot- ular results in mind required some substantial addi-
ting; the plot, figure and margins structure; graph- tional Fortran programming. That programming had
ical parameter specification to control style. These to be repeated and revised for each analysis or re-
were not created for S but taken over from previous search question.
Fortran software, described in Becker and Chambers In the 1970s the situation was therefore a combi-
(1977). nation of improved basic computational capabilities
The Bell Labs software was in the background of but with a high programming barrier for most statis-
Chambers (1977), but general readers were given in- ticians. The classical linear regression in Fortran as
structions for obtaining similar software from pub- shown in Becker and Chambers (1985), for example,
licly available sources for the methods described. was fairly straightforward:
The procedure would not always be simple, but the call lsfit(X, N, P, y, coef, resid).
potential availability marked a big step forward. For
the first time, statisticians could draw on an ex- This computes the fitted model and returns it as
tensive range of relevant software to support their vectors of coefficients and residuals. The data as ob-
research, at least in principle. Various statistical jects are restricted to arrays, a matrix X and vector
software packages had existed for some time, but y for the data and two arrays, coef and resid for
these were by and large oriented to routine analysis, the fitted model. The structure of the objects and
to teaching or to specialized statistical techniques. their storage allocation remains the programmer’s
Chambers (1977) and the software it reflected were responsibility. Linking the basic computation to the
aimed at research in statistics and challenging data data in an actual analysis remained nontrivial and
analysis. For this purpose, a more general and open- mistakes along the way were likely. And this is for
ended approach was needed. the most standard of models. Even given an exten-
3.2 From Fortran to S sive library, the programming to apply the tools to
most applications was a laborious, error-prone activ-
For those involved with statistical theory or ap- ity, usually assigned to dedicated programmers, re-
plications, in academia or industry, there were two search assistants or students. The statistician’s ideas
main limitations to the software described so far: went through nontrivial translation before they were
availability and the programming interface. The Ap- expressed as computations.
pendix to Chambers (1977) was a set of tables for The first two versions of S were designed to pro-
each of the chapters, with rows corresponding to vide an “interactive environment” that included the
computational tools that were more or less avail- computational areas described in Chambers (1977)
able to readers. The last column of the table listed and that allowed the statistician to formulate ideas
sources for the corresponding software. The entries directly for computation. The second version of S
in that column were not uniformly helpful; in the
was licensed for general use and described in Becker
best situation, a generally available program library
and Chambers (1984).
could be ordered that provided a number of the
In S, the linear regression computation became a
subroutines, but these were not designed for sta-
simpler expression, storage for data was provided
tistical applications, most being directed at numer-
automatically and the returned model was now an
ical methods typically motivated by applications in
object, with components for the coefficients and
physics. More than half of the entries read “List-
ing,” implying a laborious and error-prone man- residuals:
ual procedure for the user. [As an example, many fit <- reg(X, y).
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 9

At this stage, S had a functional appearance, not 3.3 From Data to Classes and Methods
radically unlike R, but its paradigm was essentially
an extension of the Fortran view. Dynamically cre- The languages that originated the concepts of
ated, self-describing objects were assigned in a single classes, properties, inheritance and methods came
workspace, but the underlying computations were out of several motivations. The first, Simula, was
those of the earlier subroutine library: The functions concerned with simulating systems. In retrospect,
in S, documented in Becker and Chambers (1984), modeling by simulation and modeling by fitting to
were in fact interfaces to Fortran subroutines: reg() data have clear correspondences but with quite a
would in fact be programmed by calling lsfit(). different perspective. For an example, suppose we
Although there was a macro facility in the lan- want to simulate a simple model for an evolving
guage, programming a function in this version of population of individuals. In R notation, but quite
S meant “extending S” as described in the book in the style of Simula, we define a class SimplePop.
of that name, Becker and Chambers (1985). The An object from this class is a specific realization
definition of the new function was programmed in of the model population with properties that define
an “interface language” built on Fortran and com- the probabilities of birth and death, and a vector of
piled from its Fortran translation. As the main pro- population size at each generation. An object from
gramming mechanism this was unsatisfactory, in the the population is created by calling the generator
sense that extending the language had a substan- for the class:
tial learning barrier beyond using the language. The
ability to access other software via an inter-system p <- SimplePop(birth = 0.08,
death = 0.1,
interface remains a key feature of R, however, one
size = 100).
still under active development.
Equally as important as the technical side was the Rather than a single functional computation as in
beginning of a network of statisticians involved in the case of linear regression, computations proceed
creating and sharing software through the medium by simulating the evolution of the population object
of the language. S was licensed from the early 1980s, p. The object itself evolves; in the terminology of
available thanks to the newly distributed UNIX oper- OOP, it is a mutable reference.
ating system, with inexpensive academic licenses to A corresponding difference in the programming
encourage adoption by university researchers, also paradigms of S and the emerging OOP languages
following the example of UNIX. Open-source soft- was that the latter did not take a functional view
ware was not an option, but the research community of computation. Instead, computations largely con-
was increasingly involved and their interest stimu- sisted of invoking a method on an object. In the
lated further developments on our part, particularly
SimplePop example, the fundamental computation
from contacts with interested users belonging to a
is to simulate one generation of the evolution by in-
“beta testing” network.
voking the evolve() method
Simultaneously, we were thinking about a new ap-
proach to the language itself, emphasizing the pro- p$evolve().
gramming aspect of creating new software for statis-
The value returned by this method is irrelevant. The
tical and other quantitative applications. Described
method’s purpose is to change the object, in this
initially in Chambers (1987) as a language sepa-
case by simulating one further generation and ap-
rate from S, this research later merged with other
changes to form the next version, labeled S3 and de- pending the resulting value to a property in the ob-
scribed in the “blue book,” Becker, Chambers and ject, namely, p$size. (See files “SimplePop.R” and
Wilks (1988). The slogans in Section 2.3 were basic “SimplePopExample.R” in the supplementary ma-
to this version of S: everything is an object (stated terials.)
explicitly) and function calls do all the computation Following the development of Simula in the late
(implicit). 1960s, a variety of languages adopted this paradigm.
This was functional programming (more or less) C++ added classes and methods to the C language;
and object-based but not object-oriented. Objects like C, it was initially used for a variety of program-
were given structure through attributes attached to ming tasks implementing UNIX and application soft-
vectors and through named components, but there ware for UNIX. In contrast to the “add-on” nature
were no classes or methods. of C++, the Smalltalk language was a very pure,
10 J. M. CHAMBERS

simplified realization of the ideas in Simula. Its ma- For the convenience of the user, further computa-
jor, and revolutionary, application was to implement tions should have a uniform appearance. To print or
the graphical user interface created at Xerox PARC plot the fitted model or to compute predictions or an
in the 1970s. Many other versions of encapsulated updated model corresponding to new data, the user
OOP followed, either added on to existing languages should call the same function [print(), plot(),
or incorporated into new languages from the start. predict() or update()] in the same way, regard-
Dialects of the Lisp language and languages based less of the type of model. The owner of the software
on Lisp also incorporated OOP in various forms. for a particular type of model, on the other hand,
During the 1980s, several research projects built sta- would like to write just that version of each function,
tistical software on the basis of these languages, in- without being responsible for the other versions.
cluding some elegant and potentially widely appli- Once stated, this is essentially a prescription for
cable systems, notably LISP-STAT, Tierney (1990). functional OOP: a class of objects for each kind of
As it turned out, however, the most widely used ver- model, generic functions for the computations on
sion of OOP for statistical applications would come the objects and methods for each function for each
from a somewhat casual approach in S. class. Where one class of models is an extension of
3.4 Functional OOP in S and R another (analysis of variance as a subclass of linear
models, e.g.), methods can be inherited when that
The chief motivation for introducing classes and makes sense.
functional methods to S was the initial applica- An implementation of generic functions and meth-
tion: fitting, examining and modifying diverse kinds ods was introduced as part of the statistical mod-
of statistical models for data. This remains ar- els project and described in the Appendix to the
guably the most compelling example for functional
white book. The central mechanism was an explicit
OOP in statistics. The “Statistical Models in S”
method dispatch. The function print(), for exam-
project reported in Chambers and Hastie (1992)—
ple, would evaluate the expression:
the “white book”—brought together ten authors
presenting software for a variety of statistical mod- UseMethod("print").
els, from linear regression to tree-based models. The The evaluation of this call would examine the
different models were presented as consistently as “class” attribute of the first formal argument to the
possible. function. If present, this would be a character vec-
Each type of model had a definition as an ob- tor. Eligible methods would be those matching one
ject having the information, such as coefficients and
of the strings in the class vector; if none matched,
other properties, required. The object was created
a method matching the string “default” would be
by a corresponding function taking as arguments the
used. Inheritance was implemented by having more
data, model description and possibly other control-
than one string in the class, with the first string be-
ling parameters. A linear regression fit, for example,
ing “the” class and the remainder corresponding to
called the function lm():
inherited behavior.
irisFit <- lm(Sepal.Width Chambers and Hastie (1992), in the discussion
∼ . - Sepal.Length, iris) of classes and methods, noted that S differed from
and returned a corresponding linear regression ob- other OOP languages because of its functional pro-
ject. Further computations on this object would ex- gramming style. In fact, this version of functional
amine the model, return information about it, or OOP finessed the resulting distinction from encap-
update the fit. The underlying computations still sulated OOP in two ways. First, the methods were
used basic software similar to that for lsfit() and dispatched according to a single argument, the first
reg(). However, the description of the model (a for- formal argument of the generic function in princi-
mula) and the data (a data frame) were designed to ple. As a result, the methods were unambiguously
apply to statistical models generally. For example, to associated with a single class, as they would be in en-
fit a generalized linear model the user called glm() capsulated OOP. Methods were actually dispatched
with formula and data arguments typically similar on either argument to the usual binary operators,
to those in a call to lm(). Other arguments would but a number of encapsulated OOP languages do
provide information suitable to the particular type the same, under the euphemism of operator over-
of model (a link function, e.g.). loading.
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 11

Second, the question of whether methods be- 3. Generic functions, methods and class defini-
longed to a class or a function was avoided by not tions were themselves objects of formally defined
having them belong to either. Methods were as- classes, giving the paradigm reflectivity.
signed as ordinary functions and identified by the
The new paradigm was part of the version of S de-
pattern of their name: “function.class”. In any case,
scribed in the 1998 book and generally referred to
there were no class objects and generic functions
as S4. The S4 label is generally applied to this OOP
were ordinary functions that invoked UseMethod()
paradigm, whether in S or R. S4 methods never
to select and call the appropriate method. Neither
had much chance of replacing S3 methods. In prac-
the function nor the class was able to own the meth-
tice, many S4 generic functions were based on func-
ods.
tions that already dispatched S3 methods. In this
Technically, the method dispatch in this version
case, the S3 generic function became the default S4
of OOP was instance-based, not class-based, since
method.
no rule enforced a consistent set of classes, that is,
The work on S4 paralleled in time the arrival of R
that all objects with a given first class string would
and its conversion into a broad-based joint project
have identical following strings for the superclasses.
following the initial publication by Ihaka and Gen-
(R for some time had an S3 class in the base pack-
tleman (1996). The implementation of R was de-
age with a main class string “POSIXt”, representing
signed to provide the functionality for S described
date/times, that could be followed in different ob-
in the blue book and white book, including S3 meth-
jects by one of two strings that in fact represented
ods. Beginning in 2000, an implementation of the S4
specializations, i.e., subclasses, of “POSIXt”.)
version of OOP was added to R. The “Software for
The classes and methods implemented for statisti-
Data Analysis” book, Chambers (2008), includes a
cal models constituted a bare-bones version of func-
description of the R version.
tional OOP, which is not to imply that this was a
Both versions of functional OOP will remain in R.
bad idea. Advantages include a relatively low learn-
Many prefer the simplicity of the old form, and in
ing barrier for programming and a thin implemen-
any case the very large body of existing code will not
tation layer above the previously existing language,
be discarded, and should not be. Some important ex-
which in turn means less computational overhead in
tensions have been made, for example, by register-
some circumstances. [Interestingly, the encapsulated
ing the S3 methods from a package. Major forward-
OOP of Python has a similarly thin implementation,
looking projects have typically used the newer ver-
with classes containing methods but without defin-
sion, for example, the Bioconductor project for bioin-
ing the properties. A very analogous defense is made
formatics software, Gentleman et al. (2004), and the
for that implementation, in Section 9 of the Python
Rcpp interface to C++, Eddelbuettel and François
tutorial, Python (2013), e.g.]
(2011). Recent changes, such as making the S3 and
A more formal version of functional OOP was de-
S4 versions of inheritance as compatible as possible,
veloped at Bell Labs, introduced into S in the late
have been aimed at helping the two forms to coexist
1990s and described in Chambers (1998). By this
productively.
time, S-based software was exclusively licensed to
Any programming paradigm with some degree of
the Insightful Corporation, which later purchased
formality is likely to have a higher initial learning
the rights to the S software, in 2004, and was itself
barrier and require some extra specification from the
subsequently purchased by Tibco.
programmer. A comparison of encapsulated OOP
The new paradigm differed from S3 classes and
programming with Python to that with Java is an
methods in three main ways:
interesting parallel to S3 and S4. In both examples,
1. Methods could be specified for an arbitrary the less formal version is likely to be quicker to learn,
subset of the formal arguments, and method dis- while the more formal version provides more infor-
patch would find the best match to the classes of mation about the resulting software. That informa-
the corresponding arguments in a call to the generic tion in turn can support some forms of validation
function. for the resulting software, as well as tools to analyze
2. Classes were defined explicitly with given prop- and describe it. Python and Java being rather dif-
erties (the slots) and optional superclasses for inher- ferent languages in other respects as well, projects
iting both properties and methods. are not too likely to make a choice between them
12 J. M. CHAMBERS

based solely on the formality of the object-oriented properties of the class with optional type declara-
programming. tions; properties may also be optionally declared
With R, a conscious choice is more likely. The ar- read-only. Class definitions are themselves objects
guments for a more formal approach apply particu- available at runtime. Methods are programmed as
larly, in my opinion, to projects with one or more of R functions, in which the object itself is implic-
the characteristics: a substantial amount of software itly available, not an explicit argument. Methods
is likely to be written; the application has a fairly can access or assign properties in the object by
wide scope in terms of either the data or the com- name. These characteristics make the implementa-
puting methods; or the validity and reliability of the tion more Java-like, say, than Python- or C++-like.
resulting software is important. The programmer defines a reference class in the R
Nothing prevents good software being written style, calling setRefClass() instead of setClass().
without formal tools in this case nor of bad soft- The call returns a generator for the class and saves
ware being written with them. However, there are the class definition object as a side effect, as does
several potential benefits that can be summarized setClass() for S4 classes.
in parallel with the main innovations noted above: As a side comment, while R uses a model for most
of its objects and computations that is fundamen-
1. Allowing methods to depend on multiple argu- tally different from the object references in encapsu-
ments fits the functional paradigm in R, in which lated OOP, a few key features made the implementa-
the arguments collectively define the domain of the tion of reference classes in R possible and even rela-
function. Many functions in R are naturally applied tively straightforward. Most importantly, the R data
to different classes of objects, not necessarily corre- type “environment” provides a vehicle for object
sponding to the first argument, or only to one argu- references and properties. Environments are univer-
ment. For example, when binary operators such as sal in R and well supported by programming tools.
arithmetic are defined for a new class, a clean design In particular, the active binding mechanism, which
of methods for the operators often needs to distin- allows access and assignment operations on objects
guish three cases: the first operand only belonging in environments to be programmed in R, was valu-
to the new class, the second operand only or both able in the implementation.
operands. Reference classes allow the use of encapsulated
2. A formal definition for a class allows program- OOP for objects that suit that paradigm more natu-
mers to rely on the properties of objects generated rally than they do functional OOP. As noted in Sec-
from the class. Otherwise, the nature of the objects tion 3.3, the essential distinction between functional
can only be inferred, if at all, from analyzing all the and encapsulated OOP is whether an object is cre-
software that creates or modifies an object of this ated, once, by a function call or is instead a mutable
class. object that changes as methods are invoked.
3. Having formal definitions for the generic func- Statistical computing has examples clearly suited
tions, methods and class definitions themselves sup- to each of these paradigms. The linear model re-
ports a growing set of tools for installing and using turned by lm() is not open to mutation. Change
packages that include such functions, methods or the numbers in the coefficients or residuals and you
classes. no longer have an object that should belong to that
class. In contrast, a model simulating a dynamic pro-
The benefits of a general, reliable form of functional cess such as the SimplePop class in Section 3.3 exists
OOP extend to developments in the language itself. precisely for the purpose of changing, with its evo-
For example, reference classes were built on the S4 lution being the central point of interest. Other, less
classes and methods, with no internal changes to the directly statistical computations in R also may cor-
R evaluator required. respond to mutable objects, for example, the frames
3.5 Reference Classes or other objects in a graphical interface.
Not every case is clear cut. Sometimes, essentially
Functional OOP remains an active area in R. the same class structure may be more appropriate
In addition, reference classes, introduced to R in for functional or encapsulated classes depending on
2010 in version 2.12.0, provide an implementation the purpose of the computation. Data frames are
of encapsulated OOP. Class definitions include the a prime example. This essential object structure is
OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 13

viewed naturally as functional when it is part of a other language with methods invoked in R but ex-
functional object related to the data frame. For ex- ecuted on the original object. The Rcpp interface
ample, a fitted model that wanted to be fully re- to C++, Eddelbuettel and François (2011), has a
producible could return the data frame on which mechanism for extending C++ classes in this way.
the fitting was based [e.g., lm() includes the model C++ classes can only be inferred from the source,
frame it constructs]. Such a data frame is clearly meaning that either the programmer must supply
functional; again, change it and you invalidate the the interface information (as in the current imple-
model. On the other hand, a data frame to be used mentation) or some processing of the source must
in data cleaning and editing is an object that needs be applied (currently used to export functions from
to be mutable. C++ but not classes). Java classes are accessible
Having both paradigms in a single language is as objects, via “reflectance” in Java terminology,
unusual. Some functional-style languages have im- so that in principle proxy classes in R should be
plemented functional OOP, notably Dylan, inter- possible. The rJavax package by Danenberg (2011)
esting for its parallels with OOP in R—see Shalit has an initial implementation. For Python, methods
(1996), particularly the discussion of method dis- are available from the objects but properties are not
patch. Other languages with a functional structure formally defined. At the time of writing, basic inter-
have nevertheless added what is essentially encapsu- faces to Python exist, for example, Grothendieck and
lated OOP, for example, Odersky, Spoon and Ven- Bellosta (2012), which could be extended to support
ners (2010) for the case of Scala. class interfaces, with methods but not properties in-
We hope that providing both paradigms in R en- ferred from the Python class objects.
courages software design that is natural for the ap- Further work on these and other inter-system in-
plication. It does at the same time pose some sub- terfaces would be a valuable contribution to the user
tleties. Reference classes and reference class objects community.
are somewhat abnormal in R. One needs to under-
stand the distinctions from standard R objects. 4. SUMMARY
The key is the local reference mechanism noted in R plays a major role in the communication and
Section 2.3. The R evaluator enforces local reference dissemination of new techniques for statistics and
by duplicating an object when a computation might for results of statistical research more generally. In
alter a nonlocal reference. Certain object types are particular, the many packages written in R or using
exceptions that are not duplicated. The important R as a base for interfacing to other software consti-
exception is type “environment”. Reference classes tute an essential, rapidly growing resource. There-
are implemented by extending this type. Encapsu- fore, the quality of such software and the ability of
lated OOP in R uses no special form of the func- programmers to create and extend it are important.
tion call. Method invocation is just a call to the The current R language and its supporting func-
“$” operator, for which reference classes have an S4 tionality are the result of many years of evolution,
method. Reference semantics are obtained by one from early programming libraries through the S lan-
basic fact: environments are never duplicated auto- guage to R, which itself has evolved and accumu-
matically. The S4 class mechanism in R nevertheless lated a variety of programming techniques. This evo-
allows one to subclass the “environment” type in lution has been much influenced by the functional
order to define reference class behavior. and object-oriented programming paradigms. New
The objects in the fields of a reference class object versions have continued to include supporting soft-
can be ordinary R objects. They behave just as usual ware and programming tools found useful at earlier
and when used in function calls will have regular stages along with improved capabilities.
local reference behavior in that call. It is only when The programming paradigms become especially
fields in the reference object itself are replaced that relevant when the applications are complex or the
the encapsulated OOP is relevant. quality of the resulting software is important. In
Reference class objects are also good candidates particular, the versions of object-oriented program-
for interfaces to other languages that implement the ming in R can assist in dealing with complexity of
same OOP paradigm, such as Java, C++ or Python. the underlying data. As noted, R implements OOP
The R object could be a proxy for an object in the in two forms, functional and encapsulated. These
14 J. M. CHAMBERS

are complementary, with one or the other suitable REFERENCES


for particular applications. The latter is essentially Becker, R. A. and Chambers, J. M. (1977). Gr-z: A sys-
the form of OOP used in most other languages, but tem of graphical subroutines for data analysis. In Proc.
the former is distinctly different. Considerable con- Interface Symp. on Statistics and Computing 10 409–415.
fusion has arisen in discussions of OOP in R from Becker, R. A. and Chambers, J. M. (1984). S: An In-
not noting that distinction, which the present paper teractive Environment for Data Analysis and Graphics.
has tried to clarify. Wadsworth, Belmont, CA.
Becker, R. A. and Chambers, J. M. (1985). Extending the
More generally, understanding the role of object- S System. Wadsworth, Belmont, CA.
oriented and functional programming in R may Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988).
assist future contributing programmers in using The New S Language. Chapman & Hall, Boca Raton, FL.
related programming tools. The continuing rapid Chambers, J. M. (1977). Computational Methods for Data
growth of R-based software and the expanding, chal- Analysis. Wiley, New York. MR0659716
Chambers, J. M. (1987). Interface for a quantitative pro-
lenging range of techniques it has to support make
gramming environment. In Comp. Sci. and Stat., Proc.
effective programming an important goal for the sta- 19th Symp. on the Interface 280–286.
tistical community. Chambers, J. M. (1998). Programming with Data: A Guide
The importance of object-oriented programming to the S Language. Springer, New York.
is likely to increase as statistical software takes on Chambers, J. M. (2008). Software for Data Analysis: Pro-
new and challenging applications. In particular, the gramming with R. Springer, New York.
Chambers, J. M. and Hastie, T., eds. (1992). Statistical
need to deal with increasingly large objects and dis-
Models in S. Chapman & Hall, Boca Raton, FL.
tributed sources of data will bring in specialized Chambers, J. M., Mallows, C. L. and Stuck, B. W.
classes of data and will need powerful computing (1976). A method for simulating stable random variables.
tools. One important direction has been to trans- J. Amer. Statist. Assoc. 71 340–344. MR0415982
form selected software in R, particularly to speed up Danenberg, P. (2011). rJavax: rJava extensions. R pack-
large-scale computations; see, for example, the com- age version 0.3. Available at http://CRAN.R-project.
org/package=rJavax.
panion paper Temple Lang (2014). Complementary
Eddelbuettel, D. and François, R. (2011). Rcpp: Seam-
to this is to interface to other languages and soft- less R and C++ integration. Journal of Statistical Software
ware when these provide better performance on “big 40 1–18.
data” and other computationally demanding appli- Gentleman, R. C., Carey, V. J., Bates, D. M. et al.
cations. In particular, interfaces that match with (2004). Bioconductor: Open software development for com-
object-oriented treatments for specialized forms of putational biology and bioinformatics. Genome Biology 5
R80.
data can exploit the OOP facilities in R. The inter- Grothendieck, G. and Bellosta, C. J. G. (2012).
face to C++, Eddelbuettel and François (2011), is rJython: R interface to Python via Jython. R package
an example. Further development of such interfaces version 0.0-4. Available at http://CRAN.R-project.org/
will be of much benefit. package=rJython.
Functional programming is perhaps not such an Ihaka, R. and Gentleman, R. (1996). R: A language for
obviously hot topic at the moment. However, the data analysis and graphics. J. Comput. Graph. Statist. 5
299–314.
underlying philosophy that our software should be Odersky, M., Spoon, L. and Venners, B. (2010). Program-
in the form of reliable, defensible units is very much ming in Scala, 2nd ed. Artima, Walnut Creek, CA.
part of R. Situations where the validity of statisti- Python (2013). The Python Tutorial. Python. Available at
cal computations needs to be defended are likely to http://docs.python.org/tutorial.
increase, given the growing need for statistical treat- R Core Team (2013). R Language Definition. R Founda-
ment of complex problems for science and society. tion for Statistical Computing, Vienna, Austria. ISBN 3-
900051-13-5. Available at http://cran.r-project.org/
doc/manuals/R-lang.html/.
ACKNOWLEDGMENTS Shalit, A. (1996). The Dylan Reference Manual. Addison-
Wesley, Reading, MA.
Thanks to the Associate Editor and the referees
Temple Lang, D. (2014). Enhancing R with advanced com-
for some helpful comments on presentation and con- pilation tools and methods. Statist. Sci. 29 181–200.
tent. Thanks especially to Vincent Carey for orga- Tierney, L. (1990). LISP-STAT: An Object-Oriented Envi-
nizing and encouraging the set of talks and papers ronment for Statistical Computing and Dynamic Graphics.
of which this is part. Wiley, New York.

Вам также может понравиться