Вы находитесь на странице: 1из 38

Language Learning

ISSN 0023-8333

A Corpus-Based Analysis of the Discourse


Functions of Ser/Estar + Adjective in Three
Levels of Spanish as FL Learners
Joe Collentine
Northern Arizona University

Yuly Asencion-Delaney
Northern Arizona University

Research on the acquisition of Spanishs two copulas, ser and estar, provides an understanding of the interaction among syntax, semantics, pragmatics, morphology, and
vocabulary during development (e.g., Geeslin, 2003a, 2003b; Gunterman, 1992; Ryan
& Lafford, 1992). Recent research suggests that linguistic features in the surrounding discourse influence learners copula choice. We present a corpus-based analysis
of the lexico-grammatical features co-occurring with copula + adjective usage among
foreign-language learners of Spanish at three levels of instruction. Findings revealed
the following: (a) both ser + adjective and estar + adjective occur at all levels where
little linguistic complexity typically occurs; (b) ser + adjective appears in descriptive
and evaluative discourse; and (c) estar + adjective is present in narrations, descriptions,
and hypothetical discourse.
Keywords second language acquisition; Spanish interlanguage; learner corpus; corpus
linguistics; grammatical development; ser and estar; copula choice

Introduction
Studying the acquisition of Spanish copulas, ser and estar, interests second
language acquisition (SLA) researchers because it requires studying syntax, semantics, pragmatics, morphology, and vocabulary during development
We wish to thank Dr. Roy St. Laurent of the Northern Arizona University Statistical Consulting
Lab for his valuable assistance in the design of the statistical analyses of this project. Any errors
reside solely with us. Our thanks also go to Dr. Vincent and Dr. Ojeda for their financial support to
transcribe the texts written by the learners.
Correspondence concerning this article should be addressed to Joe Collentine, Northern Arizona
University, Modern Languages, Box 6004, Flagstaff, AZ 86011. Internet: Joseph.Collentine@
nau.edu
Language Learning 60:2, June 2010, pp. 409445
!
C 2010 Language Learning Research Club, University of Michigan
DOI: 10.1111/j.1467-9922.2010.00563.x

409


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

(Leonetti, 1994). Although this might seem particular to Spanish as a second


language (L2), the acquisition of these verbs shows how learners acquire one
of the two basic Indo-European sentence types (Halliday, 1970): predicative
(e.g., Juan corre rapidamente John runs quickly) and attributive sentences
(e.g., Juan es rapido John is quick), with the ser/estar (S/E) distinction
forming the central verbal element of the latter. Pragmatically speaking, the
S/E distinction requires knowing when the relationship between the subject
and adjective involves characterization (Mara es capaz Mary is capable) or
identification (Mara es la encargada Mary is the one in charge; Fernandez
Leborans, 1999). Semantically, S/E can differ aspectually, with estar often connoting the perfective aspect (e.g., that an events time frame is short and limited
in duration) and ser connoting the imperfective (e.g., the event is habitual)
(Lujan, 1981). Morphologically, Spanish adjectives inflect for person and number, which is especially difficult for learners whose first language (L1) has few
inflections, like English. Finally, the number of adjectives that learners must associate with either ser or estar presents lexical challenges. Geeslin (2003a) and
Silva-Corvalan (1986, 1994) reminded us that even native speakers of Spanish
show much variation in S/E usage with adjectives as a function of pragmatic
considerations.
Traditionally (and in current learner textbooks), ser + adjective segments
describe a subjects permanent, seemingly unchanging characteristics. However, estar + adjective segments describe temporary, dynamic characteristics
of a subject. It is for this reason that an adjective like aburrido boring/bored
in soy aburrido, which uses ser, produces the meaning I am boring, whereas
in estoy aburrido, which uses estar, yields roughly I am bored; with ser, the
boredom is constant, whereas with estar, the stateand its effect on others
should pass. Nonetheless, this traditional view has come under much empirical
scrutiny, with the works of Geeslin (2003a) and Silva-Corvalan (1986, 1994)
showing that this explanation only scratches the surface of the pragmatic nuances that native speakers consider when choosing their copula.
Studying the acquisition of S/E provides a means to address various SLA
questions (e.g., orders of acquisition, the role of study abroad), and researchers
have used various methodologies (e.g., error analysis of open-ended conversations, raters judging the semantic intent of learner utterances). Recent S/E
research suggests that learner copula selection is sensitive to lexical and grammatical features (often referred to together as lexico-grammatical features) in
the surrounding discourse. Corpus-linguistics methods are particularly suited
to study the interaction between a construct and its lexical and grammatical context. After reviewing S/E research and the potential contribution of a corpus
Language Learning 60:2, June 2010, pp. 409445

410


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

analysis, we present a large-scale corpus-based analysis of learners use of


S/E + adjective at different instructional levels.

Ser/Estar SLA Research to Date


Initial S/E research identified developmental stages in instructed contexts, focusing on accuracy and omission rates. Estar emerges in later stages, especially
in estar + adjective segments. VanPatten (1985, 1987) studied oral interviews,
grammaticality judgments, and informal class observations to propose five
stages: (a) copula absence, (b) ser as the default copula, (c) estar with progressive, (d) estar with locatives, and (e) estar with adjectives of condition. Simplification, communicative value, frequency in input, and L1 transfer influence
these stages (VanPatten, 1987). Researchers have studied whether VanPattens
stages generalize to study-abroad and Peace Corps experiences (Gunterman,
1992; Ryan & Lafford, 1992). Oral proficiency interviews in both Guntermans study and Ryan and Laffords study confirmed most stages, with estar +
adjectives of condition appearing before estar with locatives.
Although accuracy studies reveal that these two copulas develop in a predictable fashion, they do not explain the variability in S/E usage. Additionally,
these studies appeared when SLA research was highly concerned with the role
of input in acquisition, and explanations focused on issues such as the copulas individual frequency and communicative value/saliency (Ryan & Lafford,
1992; VanPatten, 1985, 1987) in the input. Ryan and Lafford (1992) attributed
the late emergence of estar + adjective to access to naturalistic input. Nonetheless, we know almost nothing about the input (e.g., the types of discourse)
that learners process in naturalistic settings or over the course of a semester
in at-home or study-abroad settings (Collentine, 2008). SLA theory posits that
output (be it from instructional interventions or naturalistic experiences) plays
as strong a role as input at latter stages of acquisition (Shehadeh, 2002; Swain,
1985), which is when estar + adjective emerges. What type of communication,
then, do learners generate that coincides with estar + adjective emergence?
Some evidence suggests that copula + adjective production improves as
learners grow in the complexity of the discourse types they generate. Copula +
adjective segments help beginning learners to relate simple messages, containing a subject and a verb without elaboration (e.g., accompanied by adverbs).
Gunterman (1992) noted that when communication became difficult, learners
resorted to ser + adjective segments. Because the questions typically elicited
descriptions, explanations, and definitions, the [peace corps volunteers] were
able to build a great number of their answers around ser (Gunterman, 1992,
411

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

p. 1297). Descriptive discourse is structurally and semantically basic, depicting


a situations important nouns and their states (e.g., via adjectives); descriptions
lack dynamic details about events and changes of states. Estar + adjective
appears in Guntermans data, where learners went beyond descriptions to communicate narrative discourse, which entails both a situations states and its
events (often chronologically). Lafford (2004) attributed copula + adjective
gains after a single semester of study abroad to the pragmatic constraints inherent in real-world discourse . . . and perhaps to improved overall narrative and
discursive abilities, proficiency, and fluency (p. 216; emphasis added). Subsequent S/E research intimated that copula + adjective growth occurs as lexical
and grammatical choices become sensitive to what appears in the surrounding
discourse.
In the copula + adjective segment, natives demonstrate variation in copula selection because each copula affects different pragmatic and discursive
interpretations (Geeslin, 2002; Silva-Corvalan, 1986), and so the copula +
adjective context is ideal for studying how learners encode pragmatic and
discursive information. Geeslin (2002, 2003a, 2003b) focused on different
instructional levels while considering findings from sociolinguistic studies of
copula + adjective language change in bilingual and monolingual communities
(e.g., Silva-Corvalan, 1986) in which semantic, pragmatic, and sociolinguistic
variables such as frame of reference (i.e., comparison with group normJuan
es alto John is tallor with the referentJuan esta alto Johns gotten
tall), susceptibility of change (i.e., inherentJuan es inteligente John is
intelligentvs. changingJuan esta viejo Johns gotten old), lexical class
of the adjective (e.g., age, nationality), and semantic transparency (El mango es
verde/El mango esta verde The mango is green/The mango is unripe vs. Juan
es casado/Juan esta casado John is married/John is just married) explained
the overuse of estar. Geeslin (2002) collected data from high school students
with a guided interview, a picture-description task, and a contextualized questionnaire, concluding that learners acquire the restriction of susceptibility of
change earlier than the frame of reference restriction. Geeslin (2003a, 2003b)
later examined copula choice with advanced learners using contextualized
questionnaires, finding that semantic and pragmatic features interact to predict
estar usage. She found that whereas advanced learners seem to overgeneralize pragmatic constraints such as frame of reference and experience with the
referent, native speakers favor lexical and semantic constraints (e.g., predicate
type) to decide when to use ser or estar.
Recently, Geeslin (2003b) and Geeslin and Guijarro-Fuentes (2006) suggested that we need to understand the context that surrounds L2 copula choice:
Language Learning 60:2, June 2010, pp. 409445

412


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

In the case of copula choice, advanced learners apply pragmatic


constraints, even in contexts in which native speakers do not. In contrast,
native speakers choose not to apply pragmatic constraints in favor of
lexical and semantic constraints. (Geeslin, 2003a, p. 751)
In copula selection, L2 Spanish learners may even be more sensitive to contextual factors than native speakers, who appear to depend on local factors
within the attributive copula + adjective segment (i.e., lexical and semantic
constraints related to the interaction of the copula and the adjective alone);
learners are sensitive to a wider context, apparently attending to speaker intent and implicatures (as pragmatic considerations would imply) as well as
to lexico-grammatical features in the surrounding discourse. Geeslin (2003a,
p. 748) noted that words/phrases that imply change near a copula + adjective
segment apparently cause advanced learners to select estarthe copula associated with changing stateseven when the relationship between the adjective
and the copula necessitates the use of serthe copula associated with permanent states. Thus, to better understand the factors surrounding learners S/E
usage, we might well ask the following:
What are the contextual features that co-occur with each copula + adjective
segment at different levels of instruction?
What types of discourse (e.g., narratives, descriptions) are usually associated with each segment?
Corpus Techniques and the Study of Context
Geeslins (2002, 2003a, 2003b) research shows by way of rater judgments
that the pragmatic intent of copula + adjective segments influences whether
learners use ser or estar. It is also reasonable to suspect that discourse type
influences copula selection in important ways. Recall that Gunterman (1992)
argued that ser + adjective and estar + adjective segments are distributed within
different discourse types. Additionally, Lafford (2004) related learners copula
selection gains to the expansion of the types of discourse they can produce.
Myles and Mitchell (2004) argued that SLA researchers should take note that
corpus research examining large collections of digitized documents has had a
considerable role in furthering the field of discourse analysis. Accordingly, the
present study employs a variety of corpus-based techniques to understand the
contextual features that co-occur with ser + adjective and estar + adjective use
in addition to the discursive functions that learners at different levels assign to
ser and estar. As these techniques are not widely utilized in SLA research, in
413

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

the following section we not only briefly delimit what corpus-based research
can reveal about SLA, but we also describe important corpus assumptions
and techniques. Because our analysis compares learner data to corpus-based
native-speaker models, we also describe relevant perspectives that recent corpus
research has uncovered about the nature of Spanish discourse.
Not only does a corpus-based approach lend itself to questions of L2 discourse development, but the techniques also permit empirical comparisons
between learner behaviors and native-speaker models. For instance, using an
English learner corpus and two British native-speaker corpora, Siyanova and
Schmitt (2007) found that, in informal speech, learners are less likely to use
two-word verb constructs (e.g., run into, put off ) than are native English speakers. One advantage of comparing learner performance to native-speaker models
is that the SLA researcher can make empirically defensible and testable assumptions about the end state of the acquisition process, an approach we adopt in
the present study.
Myles (2005) and Myles and Mitchell (2004) lamented that SLA research
has not been quick to embrace new technologies for collecting and analyzing
data, especially as it relates to corpus linguistics. They argued that corpus linguistics complements the current research by examining large amounts of data
with relative ease, thus increasing the generalizability of findings (Rutherford
& Thomas, 2001). Still, some notable corpus-based SLA research has contributed to our understanding of the context on language development (Belz,
2004; Collentine, 2004; Granger, Hung, & Petch-Tyson, 2002; Klein & Purdue,
1997). Some corpus research exists on ser and estar.
Corpus-Based S/E Findings
Corpus-based S/E research provides some evidence that learners copula choice
is sensitive to contextual factors and that there is reason to suspect that Spanish
copula + adjective segments are distributed to different discourse types. Cheng,
Lu, and Giannakouros (2008) examined a corpus of Mandarin Chinese L1
learners of Spanish. They show how advanced learners copula choice varies
according to the pragmatic intent of the surrounding discourse they themselves
produce. They reported that exploratory writing evoked greater estar + adjective usage and that estar + adjective is compatible with the semantic and
pragmatic goals of narratives or descriptions. Collentine (2008), in an invited
commentary article on Cheng et al. (2008), conducted a study on whether copula + adjective segments might serve discernable discourse functions in native
Spanish discourse. His analysis uncovered a significant interaction between
copula and text type. Ser + adjective was relatively frequent in most all types
Language Learning 60:2, June 2010, pp. 409445

414


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

of discourse, whereas estar + adjective was most frequent in dramas, which


entail much evaluative language and monologues containing descriptions, and
narratives. These two studies suggest that copula + adjective use by learners
and native speakers is not influenced by local features alone (which range from
within the copula + adjective phrase structure to the lexico-grammatical characteristics of the discourse) but also by communicative goals such as the type
of discourse being produced.
Techniques, Tools, and Utility of Corpus Based-Research
Corpus linguistics ranges in complexity. Minimally, it utilizes searchable digitized texts sampled in a representative fashion, depending on the studys focus.
Textual information is critical for statistical procedures (just as it is for individual learners), and so files are tagged with header information, such as topic,
source type, biographical information about the author, and purpose (argumentative essay, narrative). Concordance applications and scripting languages
allow researchers to search for specific segments and tabulate their frequencies
by text. When investigators need to search for morphosyntactic information
(e.g., all adjectives, all verbs whose infinitive is either ser or estar), they often
use a part-of-speech tagger: a series of software modules that annotates every word with information about its major word classes (e.g., adjective, noun,
verb, determiner, preposition), basic morphological information (e.g., plural,
preterit), as well as its lemma (i.e., its unmarked, dictionary root, such as a
verbs infinitive or a nouns masculine, singular form).
Part-of-speech tagging requires a dictionary with lexical and grammatical
information about the possible words in a language (some words have more
than one entry because languages have many synonyms). For the present project
we compiled our own dictionary and we utilized a training set (which assists
tagging ambiguous forms) from samples from the Corpus del espanol (Biber,
Davies, Jones, & Tracey-Ventura, 2006) as well as software routines from the
Natural Language Tool Kit (NLTK; http://www.nltk.org/). After the corpus is
tagged in this way, the investigator must verify the accuracy of the tagging
and fix errors (individual and/or systematic) through further programming.
An increasingly popular technology to create search patterns (regardless of
the tagging software) utilizes regular expressions, a sophisticated wild-cardand variable-based text-search system (e.g., \w{3,} symbolizes words of three
letters or more; \w+ing symbolizes words of any length ending in ing).
Having a tagged corpus along with the flexibility of regular expressions
provided us with a powerful means of studying a number of lexical and/or
grammatical phenomena. For instance, the pattern \w+\ v[ `] `[ `] `[ `] `
415

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

(?:ser|estar)` (?:obvio|evidente) \ j \w+ que \ \w+ is one way to search for


every verb whose lemma is either ser or estar followed by the adjectives obvio
or evidente followed by the conjunction que.
It is important to make mention of two common corpus-statistical techniques. The process of norming is a numerical transformation of counts to
account for the fact that individual texts vary in length and that longer texts
can have a greater influence on the numerical distribution of any phenomenon.
Investigators often norm frequency counts to an arbitrary number, such as per
1,000 or per 10,000 words: The count of some phenomenon in a text is divided by the texts total word count, the quotient of which is multiplied by
1,000 (a higher norming multiplier like 100,000 affords greater precision). The
technique known as normalizing involves converting the count of some phenomenon to its z-score value vis-`a-vis its count in each document in the corpus
(i.e., the difference of the phenomenons frequency and its mean occurrence
in the corpus divided by its standard deviation). Normalizing is convenient
for measuring the relative presence of two or more linguistic features within
any given text, as one can easily sum two or more z-scores to calculate how
concentrated those features are in any texts or group of documents while taking
into account the fact that some linguistic phenomena are naturally scarce in a
document (e.g., the subjunctive), whereas others are naturally common (e.g.,
articles) (cf. Biber & Conrad, 2001). For instance, the frequency of adverbs of
time and copula + adjective segments are likely to vary in different ways across
the texts of a corpus (e.g., adverbs of time may be generally more frequent).
By summing the two segments z-score per document, we can find which texts
have the highest concentration of the two.
Corpus-Based Native-Speaker Models of Discourse
According to Myles and Mitchell (2004), we now have the ability to define
structurally and statistically different discourse types. Thus, the present study
not only compares learners copula selection behaviors between different levels
of instruction, but it also attempts to identify the types of discourse learners
produce when using S/E + adjective, based on a native-speaker model. Corpus
linguistics has shown through factor analyses how lexico-grammatical structures bundle together to produce different types of discourse (Biber & Conrad,
2001). Biber et al. (2006) provided the first comprehensive analysis of Spanish,
analyzing a 20 million-word Spanish corpus with written and oral data from a
variety of registers. There are four types of discourse that Biber et al. (2006)
identified that learners might well produce in written texts, the features with
which each is associated are presented in Table 1.
Language Learning 60:2, June 2010, pp. 409445

416


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Table 1 Discourse dimensions and features targeted in the learner-native speaker comparison (cf. Biber et al., 2006)
Discourse type

Lexico-grammatical features

Informationally rich

Singular and plural nouns


Postnominal descriptive adjectives
Prenominal descriptive adjectives
Definite articles
Prepositions
Derived nouns
Type-token ratio
Long wordsa
Se passives (i.e., ergative se use)

Hypothetical

Subjunctive use
Conditional use
Future use
Verbs of obligation and causation (e.g., dejar, permitir,
hacer + infinitive)
Infinitives not preceded by a verb or article
Verbs followed by an infinitive
Progressive aspect (imperfect use or present participle)
Dependent que clauses

Narrative

Descriptive

Postnominal descriptive adjectives


Derived nouns
Absence of all narrative variables

Clitic usage
Imperfect tense/aspect
Preterit tense/aspect
Possessives
Third-person pronouns
Reflexive se and changes of states
Infinitives not preceded by a verb or article
Verbs followed by an infinitive

Defined as those that have an average number of characters in the dataset, plus that
calculations standard deviation, plus one characterthus, six or more characters.

Informationally rich discourse is one that conveys large amounts of information densely. Derived nouns, adjectives, multisyllabic words, and passives
convey information in a decidedly encyclopedic fashion. Another important
type of discourse in Spanishwhich is not found in English analyses (cf. Biber
417

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

& Conrad, 2001), perhaps because Spanish has a neatly defined mood system
(with readily discernable inflections)is hypothetical discourse, which communicates possibilities and counterfactual information. It is characterized by
features such as verbs in the subjunctive and the conditional. The other two
discourse types identified by Biber et al. (2006) are well known to most (viz.,
narratives and descriptions).
Research Questions
The present study adds to our understanding of the acquisition of how contextual
variables interact with learners use of attributive sentences. Although the field
has a good idea of the communicative factors that motivate copula choice, we
do not know how each copula + adjective segment works with other lexical and
grammatical structures to communicate coherent discourse. To address this gap
in the literature and to understand the discursive function that ser + adjective
and estar + adjective segments serve over time, we provide a corpus-based
analysis of the lexico-grammatical features that predict the use of these two
segments with foreign-language (i.e., at-home) learners in the first, second, and
third years of the university level. More specifically, we address the following
research questions:
1. What are the lexico-grammatical features that co-occur with ser + adjective
usage? What are the discursive functions that these co-occurring features
serve?
2. What are the lexico-grammatical features that co-occur with estar + adjective usage? What are the discursive functions that these co-occurring
features serve?
To address these questions, we present the results of a series of regression
analyses predicting the occurrence of each copula + adjective segment from
a variety of lexico-grammatical features (see the Corpus Description section).
We predict that ser + adjective and estar + adjective segments will have
distinct lexico-grammatical associations that change over time. Specifically,
we posit that ser + adjective segments appear in simple discourses (e.g., highly
descriptive and listlike discourse) and estar + adjective segments become
increasingly associated with discursive complexity. However, we posit that the
association of estar + adjective with a particular discourse type will be more
difficult to identify because previous research suggests that even advanced
learners are more sensitive to contextual (i.e., pragmatic) constraints than are
native speakers with this construct.
Language Learning 60:2, June 2010, pp. 409445

418


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Method
Corpus Description
This study used a 432,511-word learner corpus of written Spanish, comprising
edited and nonedited compositions collected from English-speaking Spanish
learners at three levels of instruction: first year (230,270 words), second year
(109,224 words), and third year (93,017 words). The compositions were not
specific tasks designed to collect the data for this study but rather writing
samples used for assessment purposes. Students wrote letters, narratives, descriptions, summaries, and argumentative essays both in and out of class as well
as on exams. Topics related to the textbook themes (e.g., family, childhood)
and the cultural readings assigned in class. Each text was tagged for numerous
lexical and grammatical features (see above).
To determine what lexico-grammatical features co-occur with ser + adjective and estar + adjective usage, we considered a total of 75 potential predictor
variables, each operationalized in the form of a regular expression. In corpus
studies, variables refer to the linguistic features in the texts being analyzed. This
studys predictor variables included various lexical features, such as adjectives
other than the ones in the copula + adjective frame (e.g., derived adjectives,
adjective in postnominal position), nouns (e.g., derived nouns, feminine nouns,
masculine nouns), adverbs (e.g., adverbs of place, adverbs of time), and verb
classes (e.g., verb in imperfect aspect, verb in past participle), as well as morphosyntactic features such as dependent clauses, noun phrase configurations
(e.g., article plus noun), pronoun usage (e.g., cliticthird person), as well as
a variety of verb phrases (e.g., verbs of communication, verbs of knowledge).
The set of variables considered involved all parts of speech, common morphosyntactic constructs studied by learners, as well as additional constructs
studied in Biber et al. (2006).
Data Analyses
Learner Models Analysis
To identify the types of lexico-grammatical features that learners use with
ser + adjective and estar + adjective segments and to identify which variables distinguish among the three levels of learners, we constructed regression
models of lexical-grammatical regressors predicting copula + adjective usage:
a ser + adjective learner model and a estar + adjective learner model. We
constructed regression models for each copula + adjective segmentrather
than, for instance, a single regression model for which the choice between the
two is the dependent variablebecause the previous research suggests that the
factors motivating the use of ser + adjective usage are not the same as those
419

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

motivating estar + adjective usage (cf. Guntermann, 1992). The process involves screening a set of potential predictor variables for standard assumptions
of linear regression, submitting the reduced set to a best-subsets analysis
rather than a stepwise procedureto identify the so-called best subset, and,
finally, comparing the predictor variables ability to distinguish among the three
levels of learners in terms of copula + adjective usage.
We employed a standard data-screening process, identifying which of the
potential predictor variables had honest correlations with the criterion variables,
thus discarding the following: (a) variables that had no correlation with a criterion variable (by examining correlation coefficients and scatter plots between
a potential predictor variable and the criterion); (b) variables that represented
inflated correlations (i.e., where two features correlated highly with each other
and constituted too high an overlap in semantic or structural properties, so as
to avoid colinearity problems in the final model selection phase);1 and (c) variables that constituted deflated correlations, eliminating predictor variables that
had a highly reduced range of responses to the criterion variable (e.g., those
variables whose frequency was very small, such as n = 2, regardless of the
level of the participant or the genre). This screening of the data yielded a list
of 58 potential linguistic variables (37 for ser + adjective and 21 for estar +
adjective) that could be meaningful for the regression analyses to be performed.
Table 2 shows the preliminary list of variables.
We used best-subsets analyses to derive the two regression models for ser +
adjective and estar + adjective. Social scientists frequently employ stepwise
procedures for building regression models. Although these procedures for variable selection work adequately for reducing a small set of potential predictor
variables to a small, more meaningful set (e.g., a subset that does not have a
high degree of overlap), statisticians do not favor stepwise analyses when the
initial pool of predictor variables is extremely large (Miller, 2002), such as the
present case. Following Rencher (2002), we employed instead a best-subsets
analysis for building the two models for predicting ser + adjective and estar +
adjective. The principal advantage that a best-subsets approach has over statistical/stepwise regression (with a large number of predictor variables) is that
best-subsets approaches attempt to reduce the number of predictor variables
by comparing various combinations of variables, whereas the stepwise procedure attempts the reduction process by considering each and every potential
predictor variable individually. The best-subsets approach has been shown to
produce less spurious results than stepwise procedures when reducing a large
set of potential predictor variables. With large pools of potential predictor variables that have an almost infinite number of combinations, stepwise regression
Language Learning 60:2, June 2010, pp. 409445

420


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Table 2 Linguistic variables used in the study after initial data screening
Variable class

Ser + adjective

Estar + adjective

Noun

noun - derived
noun - feminine

noun - masculine
noun - singular

Adjective

adjective - singular
adjective - type 1
adjective - type 2

adjective - derived
adjective - feminine
adjective - masculine
adjective - plural
adjective - postnominalUna
casa grande a large house
adjective - prenominalUna
bella mansion A beautiful
mansion
adjective - singular
adjective - type 1Descriptive
adjective with four inflections:
masculine, feminine, singular,
and plural. Blanco/a(s) white
adjective - type 2Descriptive
adjective with two inflections:
singular and plural. Interesante(s),
liberal(es) interesting, liberal

Pronoun

clitic - third person


pronoun - subject
que subordinator

Other noun phrase


elements

article noun segmentEl libro


The book
definite article
possessive adjective

Verbs

SE plus third-singular verb


verb - Gustar-like
verb - third person
verb - communicationDecir
say/tell, anunciar announce,
explicar explain, etc.
verb - imperfect
verb - infinitive

clitic - preverbal
pronoun - third
person
que subordinator
article noun segment
possessive adjective

SE plus 3rd-singular
verb
verb - Gustar-like
verb - third person
verb - knowledge
verb - past participle
verb - present
participle
(Continued)

421

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Table 2 Continued
Variable class

Adverbs or
adverbial clauses

Total

Ser + adjective

Estar + adjective

verb - infinitive 2; not preceded


by verb or article
verb - knowledgeSaber
know, recordar recall,
entender understand, etc.
verb - observationVer see,
escuchar listen, etc.
verb - past participle
verb - past subjunctive
verb - periphrastic future
verb - present participle
verb - preterit
verb - suasiveQuerer want,
mandar order
verb aspect - progressive

verb suasiveQuerer
want, mandar
order, etc.
verb probabilityCreer
believe, negar
deny, dudar
doubt, etc.

adverb - time
adverbial clauses contingency
adverbial clauses time

adverb - place
adverb - time
adverbial clauses - contingency
adverbial clauses - time
37

21

Note. All adjectives in this list did not follow one of the two copulas.

may never consider combinations of predictor variables that are equally good at
predicting the occurrence of the response variable (i.e., the dependent variable)
in question.2 Because this analysis is computationally intensive and not available in many commercial software packages for the social sciences, we used
the statistical package R and its best-subsets regression package to perform the
analysis (see Dalgaard, 2008).3
We employed what is termed a subgroup regression analysis to determine
which of the variables in the two models predicting ser + adjective and estar +
adjective usage distinguished among the three levels (Hardy, 1993). The process employs indicator variables (sometimes called dummy variables) to add
categorical predictor variables (into the model described earlier) called differential intercept coefficients. This reveals the effect for each group for each
Language Learning 60:2, June 2010, pp. 409445

422


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

predictor variable (i.e., the unique contribution of each level in our study to
each coefficient calculated for the predictor variables), producing k 1 difference (predictor) variable models, where k represents the number of groups.4
Because this group-level coefficient effect process is derived from two regression models, we adjusted the alpha for significant coefficient differences via a
Bonferroni adjustment to 0.025 (i.e., 1 (1 .05)1/2 ).
Native-Speaker Model Comparison
To objectively identify the types of discourse that the lexico-grammatical structures (dis)associated with each copula + adjective segment represent (derived
from the best-subsets analysis), we compare the two copula + adjective learner
models with the native-speaker discourse model described in Table 1. Our
analysis measured the extent to which the learners discourse possessed indicators of informational richness, hypothetical discourse, narrative discourse,
and descriptive discourse.
As described earlier, we calculated the normed frequency of the occurrence
of each of these variables in the learner corpus to a scale of 10,000 per text.
Subsequently, we calculated the extent to which documents representing high
concentrations of each copula + adjective model correlated with high concentrations of each of the four native-speaker discourse types in three steps: (1) For
each document we calculated z-score totals for both the ser + adjective and the
estar + adjective models; (2) for each document we calculated a z-score total
for each of the four discourse types in Table 2; (3) we regressed the four discourse type z-score totals against each of the copula model z-score totals along
with subregession analyses to assess differences between the three levels. A
z-score value for any document on a given variablebe it a criterion variable as
in step 1 or a regressor as in step 2represents the extent to which that variable
is represented in that document vis-`a-vis all other documents. Summing a set
of z-scores produces a value representing to what extent any document had a
concentration of that set of variables (see Biber et al., 2006, as well as Biber and
Conrad, 2001, for in-depth discussions of this technique). Thus, summing the
z-scores for each document for variables representing, say, narrative discourse
indicated how narrative each document is. Likewise, z-score totals for the set of
regressors representing the ser + adjective model and for the set representing
the estar + adjective model for each document yields values indicating how
much each document more or less represented each model. (Of course, all
z-scores here must be weighted according to their +/ sign in the model.) The
regression and subregression analyses answer the following question: When
documents reflect the ser + adjective model and the estar + adjective model,
423

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

are they more or less encyclopedic, hypothetical, narrative, or descriptive in


nature? Again, because we employ two regression analyses, we adjusted the alpha for significant coefficient differences via a Bonferroni adjustment to 0.025
(i.e., 1 (1 .05)1/2 ).
Finally, to identify documents for the qualitative analysis of the discursive
nature of copula + adjective usage, we chose to concentrate on those documents
for each learner level that most represented each regression model derived from
the best-subsets analysis. This simply entailed identifying those documents that
had high z-score totals for the ser + adjective models and those with high z-sores
for the estar + adjective model, as described earlier in step 2.
Results
Learner Usage: Ser + Adjective
The best-subsets analysis identified 21 regressors predicting ser + adjective usage across the three levels, with 16 constituting significant regressors (p .05).
This model included twice as many predictor variables as the estar + adjective
model did. Additionally, the amount of variation that the ser + adjective model
accounted for was 41% in the use of the criterion variable, whereas the estar +
adjective model only accounted for 5% of its criterion variable (see below).
The ser + adjective model accounted significantly for ser + adjective usage,
F(21, 1576) = 54.9; p = .000.
Furthermore, the subgroup regression analysis revealed that 5 of these
21 regressors significantly distinguished among the three levels of learners:
pronoun - subject, adverbs of place, verb - gustar-like, verb - observation,
and verb - past subjunctive (see Table 3). In the following we discuss these
21 regressors by grouping them into six lexico-grammatical regressor categories: adjectives, nouns, pronouns, adverbial constructions, grammatical verb
variables, and lexical verb variables. Within the relevant lexico-grammatical
regressor categories, we discuss the five variables distinguishing among the
levels.
Seven of the regressors represented various features of descriptive adjectives, although none distinguished among the three levels of learners. Table 3
indicates that each variable contributed significantly to the model. For the most
part, adjectives predicted ser + copula usage, with five associating positively
(i.e., their coefficient sign was positive) and two were disassociated with the
construction (i.e., the coefficient sign was negative). The positive, adjectival
regressors reveal that, perhaps not surprisingly, a variety of adjectives representing particular inflectional properties co-occur with ser + adjective, suggesting
Language Learning 60:2, June 2010, pp. 409445

424

Corpus-Based Analysis of Ser/Estar + Adjective

Collentine and Asencion-Delaney

Table 3 Best-subsets regression model for ser + adjective


Coefficient
(Constant)
adjective - feminine
adjective - masculine
adjective - plural
adjective - postnominal
adjective - prenominal
adjective - singular
adjective - type 2
noun - derived
noun - feminine
pronoun - subjecta
adverbs of placea
adverbial clauses - cause
verb - third person
verb - infinitive
verb - periphrastic future
verb - past participlea
verb - past subjunctive
verb - Gustar-likea
verb - communication
verb - knowledge
verb - observationa
a

Estimate sign

Estimate

Std. error

+
+
+

+
+
+
+
+

+
+
+

81.371
.050
.040
.100
.170
.200
.150
.070
.020
.020
.050
.060
.040
.060
.040
.070
.040
.110
.040
.070
.090
.080

9.011
.020
.020
.020
.020
.020
.020
.030
.010
.010
.010
.040
.030
.010
.010
.050
.020
.060
.020
.030
.040
.040

t test
9.030
2.470
2.180
5.420
10.010
9.370
9.680
2.580
2.030
2.640
5.800
1.560
1.500
12.380
3.730
1.570
1.530
1.860
2.460
2.300
2.340
1.980

p
.000
.010
.030
.000
.000
.000
.000
.010
.040
.010
.000
.120
.130
.000
.000
.120
.130
.060
.010
.020
.020
.050

Variable distinguishing between the levels of instruction.

that at all levels in contexts/discourses where ser + adjective segments appear, learners use adjectives in general in a variety of inflections. Interestingly,
however, the positive correlation with type-2 adjectives (i.e., adjectives with
only two inflections: singular and plural) tempers this conclusion because they
are also significantly associated with the criterion. Finally, although various
morphological properties of adjectives associate with ser + adjective, this construction is not associated with more complex uses of adjectives because ser +
adjective is disassociated with adjectives that appear in either prenominal (e.g.,
bella casa beautiful house) or postnominal position (e.g., casa grande large
house).
An analysis of the two nominal regressors indicates that a certain degree
of morphological nominal complexity occurs where ser + adjective segments
predominate, as both had a significant positive association with the criterion
425

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

variable. The association with feminine nouns shows an association with the criterion variable of gender-inflectional processes, whereas the association with
derived nouns (which represent nouns packaging semantic information in a
dense fashion, as these derived forms have a base/root morpheme and an additional derivational morpheme; e.g., constitu-cion, sereni-dad, procesa-miento).
It is important to note, however, that this is the only indication of ser + adjective
association with semantically dense forms. As with the adjectival regressors,
neither of these two nominal regressors distinguished among the three levels,
suggesting that the association of ser + adjective with a certain degree of morphological complexity occurs from the beginning to more advanced levels of
instruction.
Subject pronouns for the most part also appeared where there was a preponderance of ser + adjective segments, although the subregression analysis
revealed that this regressor significantly distinguished among the three levels
of learners. The subregression analysis revealed that for the first-year learners
subject pronouns were positively associated with ser + adjective (beta = 0.06;
std error = 0.001), that for the second-year learners there was no association at
all (beta = 0.001; std error = 0.017), and that for the third-year learners there
was a disassociation with the criterion variable (beta = 0.06; std error =
0.043); the analysis also revealed that the significant difference came from the
first-year learners rather than the other two (t = 3.00; p = .003), meaning that
the association of ser + adjective with subject pronoun use was primarily due
to the first-year-learner data.
The best-subsets analysis identified two adverbial constructions as important contributing predictors of overall ser + adjective usage: adverbs of place
and adverbial clauses of cause. Although neither of the two contributed significantly on an individual basis, adverbs of place significantly distinguished
among the three levels of learners in terms of predicting when ser + adjective
would occur. The subregression analysis indicated that for the first-year learners, adverbs of place were disassociated with ser + adjective (beta = 0.12;
std error = 0.05), whereas these adverbs were (positively) associated with the
criterion at the second (beta = 0.07; std error = 0.06) and third years (beta =
0.06; std error = 0.09), with the significant difference being attributed to the
difference between the first-year and second-year individual contributions to
the model (t = 2.45; p = .015).
There were six grammatical features of verbs that predicted ser + adjective
usage at the three levels. For the most part, verbal variables were disassociated
with ser + adjective. Similar to the adverbial regressors, three were important
enough to be included in the ser + adjective model but did not individually
Language Learning 60:2, June 2010, pp. 409445

426


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

contribute significantly: Past subjunctive usage, periphrastic future usage, and


past participlesan adjectival/verbal featurewere disassociated with the use
of the criterion variable. Past participles significantly distinguished among the
three levels, as the second-year coefficients (beta = 0.08; std error = 0.03)
were significantly lower (t = 2.35; p = .02) than those of the first (beta = 0.04;
std error = 0.13) and third year (beta = 0.09; std error = 0.06). Gustar-like
verbs were also disassociated with ser + adjective usage at a significant level;
however, it is important to note that this was a regressor that the subregression
analysis identified as one that distinguished among the three levels. For both
the first (beta = 0.04; std error = 0.02) and the second year (beta = 0.19;
std error = 0.08), its coefficient was negative, meaning that it was disassociated
with ser + adjective. For the third-year learners, this subregression coefficient
was positive (beta = 0.25; std error = 0.10), and the difference between the
third- and second-year coefficients was significant (t = 1.97; p = .05). Because
Gustar-like verbs are syntactically complex for Spanish learners, these data
strongly suggest that ser + adjective appears in general where complex verbal
morphology does not but that more complex verbal syntax begins to become
associated with the criterion at more advanced stages of development. Finally,
simpler verbal grammatical properties were significantly and positively associated with ser + adjective usage, as verbs with third-person morphology and
infinitives reliably associated with the presence of ser + adjective segments.
The final group of variables involves various lexical classes of verbs, all of
which were significantly disassociated with ser + adjective usage. These included verbs of communication and knowledge, indicating that ser + adjective
usage is not associated with discourse where epistemic stance is manifested
(i.e., where one qualifies what is commented on by reporting that some assertion was [only] heard or is known to be true). Verbs of observation were
also disassociated with ser + adjective usage, although this regressor significantly distinguished among the three levels. The second-year (beta = 0.16;
std error = 0.06) and third-year subregression coefficients (beta = 0.21; std
error = 0.10) were negative, whereas the first-year subregression coefficients
were positive (beta = 0.07; std error = 0.07), with the difference between the
first- and the other second-year subregression coefficients (and so the third as
well) being significant (t = 2.61; p = .009).
Learner Usage: Estar + Adjective
The best-subsets analysis identified 10 regressors predicting estar + adjective
usage across the three levels, with 8 constituting significant regressors (p
.05). The value of the coefficient of determination (R2 ) of the model, however,
427

Language Learning 60:2, June 2010, pp. 409445

Corpus-Based Analysis of Ser/Estar + Adjective

Collentine and Asencion-Delaney

indicates that only 5% of the variance in the Spanish learners use of estar +
adjective could be explained by this regression model. This indicates that the
association of estar + adjective with other lexical-grammatical features is weak
within the interlanguage for all levels of learners. The model did account for
a significant amount of the overall variation in estar + adjective usage, [F(10,
1590) = 8.42; p < .0001].
As observed in Table 4, most of these 10 variables distinguished significantly among the three levels, with the subgroup regression analysis revealing
that four regressors significantly distinguished among the three levels of learners: type-2 adjectives (i.e., adjectives with singular and plural inflection), article
noun segments, preverbal clitics, and possessive adjectives. It is interesting to
note that this group of variables is entirely different from the group of significant regressors for the ser + adjective copula. At any rate, these differences
are considered below in the interpretation of the variables, where we discuss
all 10 variables by grouping them into three lexico-grammatical regressor categories: nominal (noun and adjectival), verbal, and syntactic variables.
In contrast to ser + adjective segments, estar + adjective is associated with
decidedly basic grammatical properties. For example, noun phrases in discourse
where estar + adjective occurs usually comprises nouns preceded by articles
or possessive determiners (e.g., mi mama my mother, la universidad the
university) and adjectives that have only two inflections (e.g., inteligente intelligent) or adjectives in their singular form (alta tall [feminine]). Three of
the four level-distinguishing regressors identified in the subregression analysis
Table 4 Best subset regression model for estar + adjective
Coefficient
(Constant)
adjective - singular
adjective - type 2a
noun - singular
article noun segmenta
possessive adjectivea
verb - Gustar-like
verb - present participle
verb - probability
clitics - preverbala
adverbial clauses - cause
a

Estimate sign

Estimate

Std. error

t test

+
+

4.460
.010
.020
.000
.010
.010
.020
.030
.020
.020
.020

3.518
.003
.009
.002
.002
.003
.008
.011
.011
.005
.009

1.267
2.459
2.616
1.716
3.544
3.669
2.419
2.364
1.871
3.617
2.326

.205
.014
.009
.086
.000
.000
.016
.018
.062
.000
.020

+
+

+
+
+
+

Variable distinguishing between the levels of instruction.

Language Learning 60:2, June 2010, pp. 409445

428


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

were nominal in nature. Type-2 adjectives were found to distinguish significantly between first- and third-year learners (t = 2.73; p = .006), indicating
that the trend to associate inflectionally simple adjectives with estar + adjective appears to become stronger as learners progress in their acquisition of
Spanish. This predictor variable was disassociated with the criterion variable
(beta = 0.01, std error = 0.009) for the first-year students and was positively
associated with estar + adjective for the second (beta = 0.01, std error =
0.021) and third year (beta = 0.13, std error = 0.046). The article noun segment significantly distinguished only between first- and second-year learners
(t = 3.30; p = .001). This regressor was weakly associated with the criterion
variable for first-year (beta = 0.002; std error = 0.003) and third-year students
(beta = 0.003; std error = 0.011) and only slightly more associated with estar + adjective for second-year students (beta = 0.018; std error = 0.005).
Finally, possessive adjectives significantly distinguished between second-year
and third-year learners (t = 2.78; p = .005). This regressor was found to be
weakly associated with the criterion level for the first year (beta = 0.008; std
error = 0.003) and the third year (beta = 0.003; std error = 0.011) and only
slightly more associated with estar + adjective for the second-year writing
(beta = 0.041; std error = 0.010).
Among verbal regressors, the significant predictor variables also showed
no evidence that complexity is associated with the criterion. Although Gustarlike verbs are usually associated with complex syntax, in the learners writing
this variable is negatively associated with the occurrence of estar + adjective.
The other grammatical verb formpresent participleis expected to co-occur
with estar + adjective because it is mostly associated with estar to form
the progressive aspect. Indeed, its beta coefficient was the highest of those
regressors included in the best-subsets analysis (0.030).
Two syntactic features were positively associated with estar + adjective.
Preverbal clitics positively associated with estar + adjectives at all levels, perhaps the only indication of complexity associated with this phrase structure.
The other syntactic regressor, causal adverbial clauseswhich usually started
with the conjunction porquealso predicted criterion usage. Preverbal clitics
was the only syntactic regressor variable that distinguished significantly between learners use of estar + adjective at different levels. This variable was
weakly associated with the criterion for first-year learners (beta = 0.006; std
error = 0.007), which increases modestly yet significantly (t = 2.31; p = .021)
into the second and third years, with the association being greater for second(beta = 0.033; std error = 0.011) and third-year (beta = 0.056; std error =
0.019) learners.
429

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Native-Speaker Model Comparison


As explained earlier (see the Corpus Description section), our analysis also
included a measurement (via regression analysis) of the extent to which the
learners texts with high concentrations of each copula + adjective model related with high concentrations of each of four types of native-speaker discourse
types: informational richness, hypothetical discourse, narrative discourse, and
descriptive discourse. The native-speaker model comparison indicated that
three native-speaker discourse types combined significantly and individually to
predict where the ser + adjective learner model occurred: hypothetical, narrative, and descriptive (see Table 5). As observed in Table 6, three also combined
to predict where the estar + adjective learner model held: information rich,
hypothetical, and narrative.
The information-rich discourse regressor indicates the extent to which documents reflecting a copula + adjective model is accompanied by semantically dense discourse. Considering the sign of the coefficientsspecifically,
whereas the encyclopedic regressor in the ser + adjective model was significantly negativeser + adjective usage is not semantically dense. Interestingly,
Table 5 Native-speaker discourse-type predictions of documents matching ser +
adjective model
Coefficient
(Constant)
information rich
hypothetical
narrative
descriptive

Estimate sign

Estimate

Std. error

t test

+
+

0.001
0.079
0.303
0.376
0.350

0.131
0.045
0.039
0.124
0.114

9.007
1.776
7.766
3.030
3.060

.995
.076
.000
.002
.002

Note. F(4, 1596) = 24.38; p = .000; multiple R2 : 0.06; adjusted R2 : 0.06.


Table 6 Native-speaker discourse-type predictions of documents matching estar +
adjective model
Coefficient
(Constant)
information rich
hypothetical
narrative
descriptive

Estimate sign

Estimate

Std. error

t test

+
+
+
+
+

81.371
0.129
0.059
0.550
0.135

9.011
0.026
0.023
0.072
0.067

9.030
4.954
2.574
7.592
2.025

.000
.000
.010
.000
.043

Note. F(4, 1596) = 149.3; p = .000; multiple R2 : 0.06; adjusted R2 : 0.06.


Language Learning 60:2, June 2010, pp. 409445

430


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

documents reflecting the estar + adjective model appear to be semantically


dense. Furthermore, because the subregression analysis showed no interlevel
coefficient difference, we must surmise that this association is constant for all
three levels of instruction.
The hypothetical regressor implies how much copula + adjective usage
occurs when learners conjecture and present possible scenarios. Given their
signs and significance levels, ser + adjective discourse appears to represent
the antithesis of hypothetical discourse and estar + adjective usage contains
hypothetical elements. The disassociation with ser + adjective discourse may
be partially explained by the observation made earlier that epistemic verbs
(representing stance) are entirely disassociated with ser + adjective usage
as well as the models exclusion of verbal entities like the subjunctive and
periphrastic future. The subregression analysis indicates that ser + adjective
is wholly unhypothetical at the first year and that at the second and third years
this disassociation raises to the level of no association. The hypothetical
regressor was disassociated with the first-year learner data (beta = 0.638; std
error = 0.071), which was significantly below those of the second-year (beta =
0.167; std error = 0.098; t = 7.652; p = .000) and third-year learner ser +
adjective usage (beta = 0.024; std error = 0.052; t = 3.280; p = .001).
The estar + adjective association with hypothetical discourse is supported
in the above analysis because this model was associated with verbs of probability. Additionally, the learner estar + adjective regression analysis included
causal adverbial clauses in the estar + adjective model, and cause-effect relationships are an important tool for hypothesizing. This hypothetical regressor
was not associated with the first-year coefficients (beta = 0.638; std error =
0.071), which were significantly below the second-year (beta = 0.055; std error = 0.040; t = 4.042; p = .000) and third-year (beta = 0.029; std error =
0.001; t = 3.280; p = .001) coefficients.
The narrative regressors generally indicate where learners used a copula + adjective model accompanied by story-telling elements, although not
necessarily whole narrations. Both copula + adjective segments appear to be
significantly associated with the presence of narrative features. The subregression analysis indicates that both the second- and third-year learners generate
more narrative features where ser + adjective occurs than first-year learners:
Although the coefficients for the second-year (beta = 1.031; std error = 0.192)
and third-year (beta = 1.015; std error = 0.379) data were not significantly
different (t = 0.030; std p = .976), the difference between the second- and firstyear coefficients (beta = 0.133; std error = 0.177) was significant (t = 3.450;
std p = .001). The subregression analysis indicates that the association of
431

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

estar + adjective segments with narrative features remains constant through


the three levels, as there were no significant interlevel coefficient differences.
This is consistent with the learner regression analysis, which showed that
present participles, which denote durative aspectan important element of
storieswere associated with estar + adjective.
Both copula + adjective learner models were associated with descriptive
features, although the ser + adjective association was significant. This might
seem surprising given the operationalization of Spanish descriptive discourse
offered by Biber et al. (2006), which is almost entirely devoid of narrative
features. The implication here is that both copula + adjective segments operate
in both narrative and descriptive contexts beyond the first year of instruction.
We see a significant transition toward greater association of ser + adjective
segments with descriptive features from first (beta = 0.154; std error =
0.168), to second (beta = 0.989; std error = 0.165), to third year (beta = 1.417;
std error = 0.350), with the second-year coefficients being greater than the first
(t = 4.880; p = .000) as well as the third-year coefficients being greater than
the first (t = 3.271; p = .001). Finally, It is important to note that strength of
association of estar + adjective segments with narrative features (beta = 0.550;
std error = 0.072) is almost four times as much as with descriptive features
(beta = 0.135; std error = 0.067).
Qualitative Analysis
We contextualize the following qualitative analysis in consideration of the
learner models presented above and of their association with the preceding
native-speaker discourse models. Ser + adjective discourse serves first-year
learners in highly descriptive discourse. The first-year documents reveal that
ser + adjective segments are employed to relate descriptions containing multiple chained adjectives where ser tends to be the most frequently inflected verb.
The following are segments from midterm-exam letters students in a first-year
course wrote to a Mexican friend to describe their girlfriend/boyfriend and
his/her family.
(1) yo estoy bien porque yo tengo novia. se llama jessica. ella [es] bonita,
inteligente y elegante. ella tiene veinte anos. ella es de oregon. y [es] moreno,
bajo y muy bonita. ella lleva camiseta verde y jeans azules. sus ropas es mucho
dolares. ella gusta bailar y cantar para mi. ella gusta tens . . . la madre de
jessica [es] bonita, inteligente y bajo. se llama velerie. nosotros jugamos tenis
mucho. ella [es] bueno. nosotros aprendamos la universidad. ella lleva camisa
verde y los jeans azul en la universidad . . . (I am well because I have a girlfriend.
Her name is Jessica. She is beautiful, intelligent and elegant. She is 20 years
Language Learning 60:2, June 2010, pp. 409445

432


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

old. She is from Oregon and she is a brunette, short and very beautiful. She
wears a green t-shirt and blue jeans. Her clothes cost a lot of dollars. She likes to
dance and sing for me. She likes tennis. Jessicas mother is beautiful, intelligent
and short. Her name is Valerie. We play tennis a lot. She is good. We learn it at
the University. She wears a green shirt and blue jeans at the university . . .)
(2) yo soy bien porque yo soy amo con novia, selena. ella [es] bonita y
simpatica. ella [es] soltera y practicar. ella es alta y la ropa es mocha colores.
mi muchacha lleva rojo gora, blanco jacqueta, azul jeans, y negro sandalias.
ella es mi amora. selena (stays) con madre en casa grande. la familia [es] baja.
la madre [es] rica y lista y soltera . . . (I am well because I am in love with
my girlfriend, Selena. She is beautiful and nice. She is single and practical.
She is tall and she wears clothes in a lot of colors. My girl wears a red cap,
white jacket, blue jeans, and black sandals. She is my love. Selena stays with
her mother in Casa Grande. Her family is small. Her mother is rich, smart and
single . . .)
In both of these samples we see simple discourse, grammar, and lexicon,
with few verbs except for the copula and an overuse of subject pronouns.
Additionally, although there are numerous adjectives in both segments, it is
apparent that noun + adjective segments are scarce. These first-year samples
are nonnarrative and possess almost no conjecturing.
Among second-year learners, ser + adjective segments appear in list fashion
in discourse with few conjunctions expressing interpropositional relationships
(e.g., ser + adjective + que copula + adjective + that). Such loosely connected discourse not only describes people, places and concepts, but it also
describes evaluations and reactions to events and states. As the learner model
suggests, there is a marked absence of epistemic verbs to demonstrate the
stance (verbs of knowledge, pienso que I think that; verbs of perception,
vemos que we see that; verbs of communication, se dice que it is said
that). Instead, copula + adjective segments present (seemingly) indisputable
assertions. Structurally speaking, we see subject pronouns omitted to mark
continuity; still, there are various referents and allusions to the things they do
frequently. This probably accounts for why ser + adjective segments are associated with a mix of descriptive and narrative features. Finally, the derivational
sophisticationand thus semantic densityof the nouns employed is slightly
greater at this level in nouns, although these are mostly cognates. The following
is an argumentative essay a second-year student wrote using short stories as the
topic.
(3) . . . este cuento es un ejemplo que muchos padres estan usando la television como ninera. pienso que esto es un problema porque los jovenes no saben
433

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

si es realidad o no. los ninos no reciben la atencion que necesitan para crecer.
tambien pienso que los jovenes necesitan atencion y amor en los primeros
anos mas que de cuando [son] maduros porque cuando son jovenes ellos no
saben que [es] malo o que [es] bueno. tambien, la television [es] mala para
los padres. para los adultos la puede ser un escape tan ellos no tienen hacer
trabajo, o cosas diferentes que necesitan hacer durante el da. pero, tambien
pienso que hay diferentes programas que [son] buenas. hay programas que
ensena como cocinar, leer (para los ninos), y que dice que esta haciendo en el
mundo hoy. no todos de los programas de television [son] mala. pero yo pienso
que [es] malo usar la mas de necesario. (. . . this story is an example that many
parents are using the television as babysitters. I think this is a problem because
young people dont know whether it is real life or not. Children do not receive
attention enough to grow up. I also think that young people need attention and
love in their first years of life more than when they are mature because when
they are young they dont know what is good or what is bad. Also, television
is bad for parents. For adults it can be an escape because they dont have to do
their work or the different things they need to do during the day. But, I also
think that there are different programs that are good. There are programs that
teach you how to cook, to read (for children) and that tell you what is being
done in the world today. Not all the TV programs are bad, but I think it is bad
to use it more than necessary.)
With the third-year learners, ser + adjective is less frequent, reflected by a
lower overall average z-score of ser + adjective. It is now mixed among other
verbs in the third person and adjectives modifying nouns. The discourse is descriptive and evaluative in nature, with references to relevant events, producing
a mix of descriptive and narrative elements. The following texts are expository
essays students wrote in a third-year course about different occupations.
(4) al principio de su vida, el bebe atleta es una hija diferente de sus
hermanas. el grito del bebe [es] mas fuerte, el apetito mas famelico y el
cuerpo pequeno mas musculoso que los otros bebes . . . de repente, en la escuela
primaria, es la estrella de su partido de futbol y la parte necesaria entre su
equipo de basquetbol. al fin, no se puede negar todos los hechos, ella es
atleta. [es] seguro que hay cualidades particulares para las atletas; factores
que definen las mujeres que aman los deportes . . . mientras que la atleta esta
entrenandose, se come un dietetico rico con una variedad de las frutas y las
verduras. sin las vitaminas y minerales de estas comidas, el cuerpo no funciona
mejor . . . se come mucho pescado y tofu, [es] justo porque los dos son comidas
saludables sin mucha grasa . . . en el concepto de la diversion, el cuerpo de la
atleta es su templo. por eso, no pasan los viernes bebiendo cerveza y fumando
Language Learning 60:2, June 2010, pp. 409445

434


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

cigarrillas. todas las actividades giran de la salud y se mantienen la buena


salud. [es] necesario que las atletas pasen sus noches jugando los juegos
activas como escondite y jugar al corre que te pillo. (At the beginning of her
life, the baby athlete is a different daughter from her sisters. Her crying is
stronger, her appetite is more ravenous. And her small frame more muscular
than the one of the other babies . . . Suddenly, in grade school, she is the star in
her football game and the main player in her basketball team. At the end, you
cannot deny all the facts, she is an athlete. It is sure that there are particular
qualities to athletes, factors that define women that love sports . . . While the
athlete is training, she has a rich diet with a variety of fruit or vegetables. Without
the vitamins and minerals in this food, her body couldnt work better . . . a lot
of fish and tofu is eaten. It is so because both are healthy foods without much
fat. On the entertainment side, the body of the athlete is her temple. Thats why
she doesnt spend her Fridays drinking beer and smoking cigarettes. All her
activities go around her health in order to keep her healthy. It is necessary that
athletes spend their nights playing active games such as hide and seek or run
and catch.)
(5) los musicos es distingue por no estar religioso. muchos de ellos no
creen que haya un dios. actualmente, [es] ironico, porque los musicos viven
como no creen en Dios, pero tan pronto como ganen un premio, lo agradecen . . . la dietetica no [es] similar entre musicos. unos musicos se distinguen
por su dietetica de alcohol y drogas. ellos tambien fumar cigarrillos, o otras
sustancias, y asistir a fiestas todas las noches, entonces casi nunca duermen.
unos musicos estan muy saludable, y estan vegetarianos estrictos . . . musicos
a veces tienen su propia familia. tienen esposos y a veces hijos. tener una
familia es muy difcil cuando los musicos siempre estan viajando. (Musicians
are known for not being religious. A lot of them dont believe there is a God.
Actually, it is ironic because musicians live as they dont believe in God, but
as soon as they are awarded a prize, they thank God . . . The diet is not similar
among musicians. Some musicians distinguish themselves for having a diet
with alcohol and drugs. They also smoke cigarettes or other substances, and
they attend parties every night. So they almost never sleep. Some musicians are
very healthy and they are strict vegetarians. Musicians sometimes have their
own families. They have spouses and sometimes kids. Having a family is very
difficult when the musicians are always traveling.)
For the most part, however, important information is packaged into nominal
lexemes (adjectives and nouns) with a derivational morpheme (e.g., salud-able
healthy, muscul-oso muscular, cuali-dad quality). Still, cognates prevail
and there is creative derivation (e.g., the neologism diet-etico diet). Finally,
435

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

subject pronouns are scarce perhaps due to topic continuity. As with the firstyear learners, we see expression of stance via epistemic verbs, and statements
are given as unqualified facts.
Regarding the estar + adjective segments, their principal discourse functions appear to be narrative and descriptions within narrations. In the first year,
estar + adjective mostly appears with a fixed expression such as estoy feliz
I am happy or is used in descriptive contexts where ser was required with
adjectives such as bonita pretty and grande large. The following examples
come from in-class letters that learners wrote to a friend. The examples relate
life events as well as describe familiar people and places.
(6) querida maria, hola! [estoy] muy feliz porque yo tengo un novio nueva.
su nombre es Pete. Pete tiene veinte anos. mi novio es de indiana. Pete es
moreno y alto. mi novio es muy inteligente y optimista. (Dear Mary, Hello! I am
happy because I have a new boyfriend. His name is Pete. Pete is twenty years
old. My boyfriend is from Indiana. Pete has dark hair and is tall. My boyfriend
is very intelligent and optimist.)
(7) hola aubrey! fue a costa rica para un semana. fue a un hotel en la
playa dominical de costa rica. la playa dominical [estuvo] mas bonita! viajo
con mis padres y mi hermano. fue en un avion y lo [estuvo] mas grande. dorm
en un hotel en la playa. el mar [estuvo] muy largo y yo pesque mucho. me
gustaron las comidas mucho! (Hello Aubrey! I went to Costa Rica for a week.
I went to a hotel in the Dominical beach in Costa Rica. The Dominical beach
was very beautiful. I traveled with my parents and my brother. I went by plane
and it was very big. I slept in a hotel by the beach. The ocean was very big and
I fished a lot. I liked the meals very much!)
Learners couple their assessments of peoples states with causes embedded in porque because adverbial clauses. The semantically dense nature is
attributable to the use of various cognates that are long words, which describe
places, disciplines, actions, or events.
In second-year writing, estar + adjective is used in narrative and descriptive discourse that is detached from the writer. Writing is elicited from tasks
in which students must summarize events and describe characters in readings
or audiovisual material. The description of events favors the use of the present
participle. The summarizing task also allows students to speculate about characters motives or actions by using verbs of probability such as creer to believe
and causal adverbial clauses that begin with the porque because causal conjunction. These are the types of behaviors that account for the hypothetical
nature identified for estar + adjective.
Language Learning 60:2, June 2010, pp. 409445

436


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

(8) la madre regresa de la cocina, ella piensa que el muchacho [esta]


dormido. sin embargo, el muchacho [esta] despierto todava y esta mirando la
television. la pantalla [esta] oscura, as la madre le pregunta a su hijo que e l
hace. el hijo responde que e l esta esperando la muchacha en el televisor. esto
es muy triste, porque obviamente la madre no es una madre muy bien . . . el hijo
cree que la muchacha en el televisor es su amiga. e l piensa esto porque cree
que la muchacha esta hablando a e l . . . (The mother returns to the kitchen, she
thinks that the boy is asleep. However, the boy is still awake and he is watching
the television. The screen is dark so she asks him what he is doing. The son
responds that he is waiting for the girl in the television. This is very sad because
it is obvious that the mother is not a good mother . . . The son believes that the
girl in the television is his friend. He thinks so because he thinks the girl is
speaking to him . . .)
(9) la personalidad del protagonista, juan, era tmido y tranquilo. le gustaba
[estar] solo con sus pensamientos. e l sonaba antes de acostarse de, todas
las peripecias de un viaje a francia, pero no pudo costearlo. no creo que
juan [estuviera] satisfecho con su vida y su trabajo porque sonaba con ir a
francia. e l quera experimentar nuevas cosas. su vida era muy rutinario y quera
cambiarla. e l no [estaba] satisfecho con su trabajo porque no pudo costear el
viaje . . . el narrador nos sugirio que juan escribiera las cartas porque la letra
tuvo los mismos rasgos esenciales. creo que el narrador nos dijo eso porque e l
es anonimo y nos quiere hacer creer que juan [estuviera] loco y se muriera. (The
personality of the main character, Juan, is shy and quiet. He liked to be alone
with his thoughts. He used to dream before going to bed about his adventures
in a trip to France, but he couldnt afford it. I dont think that Juan was satisfied
about his life and his work because he dreamed about going to France. He
wanted to experience new things. His life was a routine and wanted to change it.
He was not satisfied about his work because he could not afford his trip . . . . The
narrator suggested to us that Juan wrote the letters because his handwriting had
the same main features. I think the narrator told us so because he is anonymous
and he didnt want us to believe that Juan was crazy and he died.)
Third-year students combine different discourse patterns, using estar +
adjective in argumentative texts where they describe two opposing sides of
an issue, as in (10). The writer is comparing the culture or life of American
and Hispanic cultures, which gives the text a hypothetical reading. They use
estar + adjective to produce personal narratives such as what happened on a
birthday. It is noteworthy that preverbal clitic forms are not only used with some
Gustar-like constructions but also with verbs in middle voice (e.g., infiltrarse

437

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

to be infiltrated) and passive constructions (e.g. ensenarse to be taught),


making the discourse more encyclopedic-sounding.
(10) . . . . el sueno americano se infiltra desde juventud, es evidente por television, escuela, la cultura y ejemplos del gobierno y polticos. esta influencia
es subconsciente pero fuerte y se ensena el americano que la u nica cosa que se
necesita hacer es trabaja fielmente y comprar las cosas correctas y eventualmente se recibira la vida perfecta. / . . . ademas el americano siempre [esta]
preocupado, consumiendo y trabajando pero viva muy poco. (The American
dream is instilled from youth, it is evident in television, school, the culture
and the examples from the government and the politicians. This influence is
subconscious but strong and it is taught to the American that the only thing that
he needs to do is to work loyally and to buy the right things and eventually he
will receive a perfect life. / Moreover, the American is always busy, consuming
and working but he lives very little.)
(11) mi familia recordaron mi cumpleanos! pero, en este momento mis
padres [estaban] enojados conmigo. yo no quera pelear, entonces, termino el
papel y nos sentamos a comer. mis hermanas mi desean un buen cumpleanos
y me dieron un regalo bellsimo. mi padre me hablaba de un film que le haba
gustado. yo pide a mi madre si a ella le a gustado y ella me respondio, no, en
un tono agresivo. mientras la entera cena ella solo me dijo, no, y, si, y fue muy
molestosa. no me hablo durante mi fiesta de cumpleanos! [estaba] muy triste
este noche y la proxima da. (My family remembered my birthday! But, at that
moment my parents were angry with me. I didnt want to argue so I finished
my paper and we sat to eat. My sisters wished me a good birthday and gave me
a very beautiful present. My father was talking to me about a film he had liked.
I asked my mother if she had liked it and she answered no, in an aggressive
tone. During the whole dinner she told me no and yes and I was very angry.
She didnt talk to me during my birthday party! I was very sad that night and
the next day.)
Discussion and Conclusions
Studying the acquisition of the Spanish copula provides insights into the interaction among syntax, semantics, pragmatics, morphology, and vocabulary
during development in one of the most basic of syntactic structuresnamely,
attributive sentences (Leonetti, 1994). Spanish requires learners to choose between two copulas in attributive sentences in accordance with a variety of
contextual considerations and in consideration of a variety of levels of representations. Whereas relevant L2 research has examined (a) copula choice
Language Learning 60:2, June 2010, pp. 409445

438


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

and (b) the function of attributive sentences in terms of orders of acquisition in different learning contexts (Gunterman, 1992; Ryan & Lafford, 1992;
VanPatten, 1985, 1987) and (c) the contextual and semantic factors that predict
learner usage of this construct as compared to native speakers (Geeslin, 2003a,
2005), the present study is the first to provide a corpus-based analysis of the
lexico-grammatical features that co-occurred with the Spanish copula (i.e., ser
and estar) + adjective usage and so the different discursive functions that the
ser + adjective and the estar + adjective segments play at three learner levels
and in comparison to native-speaker models. The study delves into important
learner issuesfor example, the discourse types learners associate with copula
usage (Gunterman, 1992), the strong influence of contextual cues on copula
choice (Geeslin, 2003a, 2003b)identified in the S/E research but not fully
developed to date. The results overall revealed the following: (a) Both ser +
adjective and estar + adjective were associated with simple discourse at all
levels; (b) ser + adjective appears in descriptive and evaluative discourse where
much linguistic complexity reliably occurs; (c) estar + adjective is present in
narrations, descriptions, and hypothetical discourse where, nonetheless, little
linguistic complexity typically occurs.
Specifically, findings showed that the model predicting ser + adjective
usage identified more variables (n = 21) and accounted for more variation
(41%) than the estar + adjective model, which only identified 10 predictors
and 5% of the variation. It seems that at beginning levels of instruction, learners
find ser + adjective more communicatively productive and thus more easily
associated with a large array of features within their interlanguage, although
these features are basic grammatical and lexical items. Ser + adjective is one
of the first copula segments taught and recycled during various semesters,
whereas estar + adjective is primarily used at beginning levels in routines
and formulaics like estar + bien, mal, ocupado, enfermo. In this sense, the
input provided by teacher, materials, and other students in the class through
task completion emphasizes the use of ser + adjective over estar + adjective
constructions and, therefore, encourages more ser + adjective usage. These
findings are in line with early SLA studies on ser/estar acquisition, which found
that ser + adjective was acquired well before estar + adjective (Gunterman,
1992; Ryan & Lafford, 1992; VanPatten, 1987) presumably because of the
higher frequency and saliency of ser + adjective in instructional and naturalistic
input.
It was also found that many of the lexico-grammatical predictor variables in
both models were characteristics of simple discourse and they did not differentiate learners copula + adjective usage among the three levels of instructions.
439

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

All levels seem to use copula + adjective as a discourse tool such as to communicate evaluatives like es importante, lastima its important, its a shame, and
so forth. However, when the discourse becomes more syntactically and grammatically complex, ser + adjective segments are absent and estar + adjective
segments become more prevalent. On the one hand, these observations contrast
with native speakers, who use ser + adjective for evaluative purposes in a wide
variety of discourses, simple or complex; on the other hand, they are consistent
with natives propensity to use estar + adjective in more complex discourse
(Collentine, 2008).
The ser + adjective model was mostly associated with adjective and grammatical/lexical verb variables. Various morphological properties of adjectives
(e.g., feminine, plural) associated with ser + adjective, whereas more complex adjectival syntactic processes (e.g., prenominal or postnominal adjectives)
emerged as disassociated. Most of the verbal variables reflecting complex syntax (e.g., periphrastic future, past subjunctive, Gustar-like verbs) were disassociated with the copula construction and started to emerge as associated with
ser + adjective at advanced levels of instruction. Other features such as null subjects also indicated some grammatical sophistication at advanced levels where
ser + adjective became less frequently used. As for the discursive functions
served by the co-occurrence of the variables in the predictive model for ser +
adjective, the disassociation of verbs of observation and communication with
the construction indicated a discourse that was nonepistemic/nonhypothetical
in nature. Comparisons with native speakers discourse showed that learners
used ser + adjective in discourse that is highly descriptive in nature and accompanied by story-telling elements, especially at advanced levels of instruction.
These findings corroborate those of Gunterman (1992), who examined learners
in study-abroad contexts where ser + adjective was indicative of descriptive
discourse. Spanish learners, regardless their level, associate an evaluative stance
with ser + adjective.
The estar + adjective regression analysis revealed a weak association with
other lexical-grammatical features. This indicates that throughout the early to
middle stages of acquisition, this phrase structure is weakly integrated into the
interlanguage in terms of being a productive, necessary tool for the types of
communication in which learners engage. In other words, the use of estar +
adjective segments is not obviatedor evoked, cognitively speakingwhen
learners use their standard repertoire of lexico-grammatical tools. All told, the
story is complicated for estar + adjectives, which ultimately might account
for its late acquisition. On the one hand, it appears where there is little associated inflectional sophistication (recall that Spanish is a highly inflectional
Language Learning 60:2, June 2010, pp. 409445

440


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

language, both in verbal and nominal constructs.). The few variables associated
with estar + adjective suggest that it appears in discourse lacking in overall
inflectional sophistication (e.g., type-2 adjectives or adjectives with singular
and plural inflections, singular nouns, negative association with Gustar-like
verbs). Its use places significant processing demands on learners, as shown
by Geeslin (2003a, 2003b), who noted that learners use estar + adjective
segments according to pragmatic factors (which require a consideration of a
multitude of contextual variables) rather than according to semantic/lexical
constraints (which are local to the copula + adjective phrase structure). With
this in mind it is not unreasonable to conjecture that learners are more likely
to have the cognitive resources to employ it when other structural demands
are not overwhelming. On the other hand, estar + adjective segments usually
occurred in discourse that was semantically denseprobably because it was
based on sources, with hypothetical elements (e.g., verbs of probability, causal
adverbial clauses), narrative features (e.g., present participles), and descriptive
features (e.g., of adjectives type 2). Like study-abroad learners (see Gunterman,
1992), in an instructional context, learners also use estar + adjective when they
need to fulfill communicative functions that go beyond description. Learners
awareness and experience with different kinds of discourse (e.g., narration,
arguments) at advanced levels of instruction might explain these associations
rather than learners acquisition of discrete grammatical and lexical items,
given the simplicity of their interlanguage, as Lafford (2004) asserted for the
gains observed for learners studying abroad. Cheng et al. (2008) concluded that
more abstract registers can evoke greater estar + adjective usage. In this study,
learners at advanced levels of instruction were asked to complete written tasks
in which they summarized a story or argued in favor of a position. We have
no way of knowing if the task demands affected learners estar + adjective
usage, however, the results indicate that it would be possible that weighty processing demands of discourse where referents and events were detached from
the writer and in some cases based on reading with grammar and vocabulary
beyond their linguistic knowledge could have lead to more estar + adjective
use. The semantic and pragmatic goals of narratives as well as hypothetical discourse seem to entail more consideration of the states of affairs of referents and
changes in the background of a story or situation, thus being more compatible
with estar + adjective.
The findings in our study provide evidence of the influence of the lexicogrammatical and discourse predictors in learners copula + adjective usage, as
attested in previous studies (Cheng et al., 2008; Geeslin 2003a, 2003b, 2005).
Under classroom setting conditions, learners written discourse in response to
441

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

tasks designed to advance their communicative abilities or testing them revealed


that copula + adjective usually co-occurred with simple linguistic features. Developmentally, interesting and seemingly contradictory observations emerged
about the use of the two copula + adjective segments studied here. In general, the phrase structure (ser + adjective) associated with greater linguistic
complexity is typically associated with simple types of discourse, whereas the
phrase structure (estar + adjective) associated with less linguistic complexity
is typically associated with more complex discourse types. The implication
of these observations is that processing demands interact with the types of
discourse that learners can produce as they develop and the types of lexicogrammatical structures they produce in those types of discourse, as Cheng
et al. (2008) argued. The relatively simple linguistic features associated with
estar + adjective may be an indication that learners hit a wall when attempting
to communicate messages within complex discourse structures. Conversely, the
learner models reported here may indicate that the simple nature of discourse
structures like descriptions may afford learners processing resources for calling
up relatively complex structures. All in all, the present analysis suggests that
copula usage and choice depends on the amount of processing resources available and the discourse structure a learner produces. Consequently, we have a
much better understanding of the complex ways that pragmatic and discursive
features influence how learners make copula choices (in ways that are not how
native speakers make copula choices), as Geeslin (2003a, 2003b, 2005) has
argued.
The results of our study have pedagogical implications for teaching S/E +
adjective. It makes sense to teach ser + adjective to beginners first because
of its frequency and communicative value; however, estar + adjective probably deserves more attention and practice in different kinds of discourse
(e.g., narration and hypothetical situations). Given that there is some evidence that estar + adjective segments are more prominently distributed in
certain kinds of discourse (cf., Collentine, 2008), exposure to input with
estar + adjective-relevant types of discourse should have a positive effect.
What Spanish educators might examine is whether estar emerges in learner
production reliably as a result of having to write estar + adjective-relevant
types of discourse such as exploratory writingas Cheng et al.s (2008) study
suggestsor narratives. An interesting corollary would be whether such discourse types emerge as a result of asking learners to produce estar + adjective
segments.
Revised version accepted 5 March 2009

Language Learning 60:2, June 2010, pp. 409445

442


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Notes
1 Screening data for inflated correlations is difficult in the linguistic sciences. Whereas
two words might represent the same part of speech (e.g., adjectives), the inflectional
morphology of adjectives can represent important distinctions for learners, such as
those that are singular and those that are plural. Additionally, natural language is
extremely redundant as communication systems go, and so our initial screening
process sought to balance semantic and structural collinearity considerations.
2 The analysis identifies the optimal combination of regressors for a criterion by
comparing and contrasting all possible predictor-variable combination values for
Mallows C p , which simultaneously represents any given models bias (i.e., how well
it predicts the referent variable) and the variation associated with that bias. The
best-subsets analysis comparesnumerouscombinations of variables by
identifying (a) the number of and (b) which predictor variables balance bias and
variance where the mean-square error of a combination is small. The resulting
model has a small bias with the least amount of predictor variables, such that the
resulting model contains a highly reduced number of predictor variables whose
combination predicts values for the criterion variable that are closest to the observed
values. Statisticians recommend best-subsets analysis when the potential number of
predictor variables is large because stepwise methods tend to miss identifying
models that are equally good at balancing bias and variance as the resulting model
they produce.
3 R is an open-source statistical package based on Bells Labs (proprietary) S
programming language, a standard among statisticians for statistical programming
(see http://www.r-project.org/). R is gaining increasing popularity in academic
circles because of its reliability, statistical accuracy, and flexibility (it contains
numerous [tested] add-on modules) and due to the fact that it is freely available in
the public domain.
4 As a simplified example, because this process extrapolates the true coefficient for
each level, we can extrapolate individual level effects in the following fashion. For
instance, if the X1 coefficient were 8.0 for level 1, 5.0 for level 2, and 3.0 for level 3
and if the process calculates the difference coefficient for X1 between levels 1 and 2
to be 3.0 (i.e., 8.0 5.0 = 3.0) and between levels 2 and 3 to be 2.0 (i.e., 8.0
5.0 = 3.0), we infer that the difference between levels 1 and 3 by summing these
two difference coefficients, or 5.0 (i.e., (8.0 5.0) + (5.0 3.0)). See Hardy
(1993) for details.

References
Belz, J. (2004). Learner corpus analysis and the development of foreign language
proficiency. System, 32, 577597.

443

Language Learning 60:2, June 2010, pp. 409445


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Biber, D., & Conrad, S. (2001). Introduction: Multi-dimensional analysis and the study
of register variation. In S. Conrad & D. Biber (Eds.), Variation in English:
Multi-dimensional studies (pp. 313). London: Longman.
Biber, D., Davies, M., Jones, J., & Tracy-Ventura, N. (2006). Spoken and written
register variation in Spanish: A multi-dimensional analysis. Corpora, 1, 137.
Cheng, C., Lu, H., & Giannakouros, P. (2008). The uses of Spanish copulas by
Chinese-speaking learners in a free writing task. Bilingualism: Language and
Cognition, 11, 301317.
Collentine, J. G. (2004). The effects of learning contexts on morphosyntactic and
lexical development. Studies in Second Language Acquisition, 26, 227248.
Collentine, J. G. (2008). The role of discursive features in SLA modeling and
grammatical frequency: A response to Cheng, Lu and Giannakouros. Bilingualism:
Language and Cognition, 11, 319321.
Dalgaard, P. (2008). Introductory statistics with R. New York: Springer.
Fernandez Leborans, M. J. (1999). La predicacion: las oraciones copulativas. In I.
Bosque & V. Demonte (Eds.), Gramatica descriptiva de la lengua espanola (pp.
23542460). Madrid: Espasa.
Geeslin, K. (2002). The second language acquisition of copula choice and its
relationship to language change. Studies in Second Language Acquisition, 24,
419451.
Geeslin, K. (2003a). A comparison of copula choice in advanced and native Spanish.
Language Learning, 53, 703764.
Geeslin, K. (2003b). The role of adjectival features in the second language acquisition
of copula choice. In P. Kempchinsky & C. Pineros (Eds.), Theory, practice and
acquisition: Papers from the 6th Hispanic Linguistics Symposium and the
5th Conference on the Acquisition of Spanish and Portuguese (pp. 332351).
Medford, MA: Cascadilla Press.
Geeslin, K. (2005). Crossing disciplinary boundaries to improve the analysis of second
language data: A study of copula choice with adjectives in Spanish. Munich:
LINCOM Europa Publishers.
Geeslin, K., & Guijarro-Fuentes, P. (2006). Second language acquisition of variable
structures in Spanish and Portuguese speakers. Language Learning, 56, 53107.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora,
second language acquisition and foreign language teaching. Amsterdam:
Benjamins.
Gunterman, G. (1992). An analysis of interlanguage development over time: Part II,
ser and estar. Hispania, 75, 12941303.
Halliday, M. A. K. (1970). Language structure and language function. In J. Lyons
(Ed.), New horizons in linguistics (pp. 140165). Harmondsworth, UK: Penguin
Books.
Hardy, M. A. (1993). Regression with dummy variables. Sage University Papers, QASS
# 07-093. Newbury Park, CA: Sage.
Language Learning 60:2, June 2010, pp. 409445

444


Collentine and Asencion-Delaney

Corpus-Based Analysis of Ser/Estar + Adjective

Klein, W., & Perdue, C. (1997). The basic variety (Or: Couldnt natural languages be
much simpler?). Second Language Research, 13, 301347.
Lafford, B. A. (2004). The effect of the context of learning on the use of
communication strategies by learners of Spanish as a second language. Studies in
Second Language Acquisition, 26, 201225.
Leonetti, M. (1994). Ser y estar: estado de la cuestion. Barataria, 1, 182205.
Lujan, M. (1981). The Spanish copulas as aspectual indicators. Lingua, 54, 165210.
Miller, A. (2002). Subset selection in regression. Boca Raton, FL: Chapman &
Hall/CRC.
Myles, F. (2005). Interlanguage corpora and second language acquisition research.
Second Language Research, 21, 373391.
Myles, F., & Mitchell, R. (2004). Using information technology to support empirical
SLA research. Journal of Applied Linguistics, 1, 169196.
Rencher, A. (2002). Methods of multivariate analysis. New York: Wiley-Interscience.
Ryan, J., & Lafford, B. (1992). The acquisition of lexical meaning in a study abroad
environment: Ser + estar and the Granada experience. Hispania, 75, 714722.
Rutherford, W., & Thomas, M. (2001). The Child Language Data Exchange System in
research on second language acquisition. Second Language Research, 17, 195212.
Shehadeh, A. (2002). Comprehensible output, from occurrence to acquisition: An
agenda for acquisitional research. Language Learning, 52, 597649.
Silva-Corvalan, C. (1986). Bilingualism and language change: The extension of estar
in Los Angeles Spanish. Language, 62, 587608.
Silva-Corvalan, C. (1994). Language contact and change: Spanish in Los Angeles.
Oxford: Clarendon Press.
Siyanova, A., & Schmitt, N. (2007). Native and nonnative use of multi-word versus
one-word verbs. IRAL, 45, 119139.
Swain, M. (1985). Communicative competence: Some roles of comprehensible input
and comprehensible output in its development. In S. Gass & C. Madden (Eds.),
Input in second language acquisition (pp. 235253). Rowley, MA: Newbury House.
VanPatten, B. (1985). The acquisition of ser and estar in adult second language
learners: A preliminary investigation of transitional stages of competence.
Hispania, 68, 399406.
VanPatten, B. (1987). The acquisition of ser and estar: Accounting for developmental
patterns. In B. VanPatten, T. Dvorak, & J. Lee (Eds.), Foreign language learning: A
research perspective (pp. 6175). New York: Newbury House.

445

Language Learning 60:2, June 2010, pp. 409445

Copyright of Language Learning is the property of Wiley-Blackwell and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder's express written permission.
However, users may print, download, or email articles for individual use.

Вам также может понравиться