Академический Документы
Профессиональный Документы
Культура Документы
Chapter 1: Thinking like a scientist Chapter 2: Getting started: ideas, resources, ethics Chapter 3: defining, measuring and manipulating variables Chapter 4: descriptive methods Chapter 5: data organization and descriptive statistics Chapter 6: correlational methods and statistics Chapter 7: probability and hypothesis testing Chapter 8: introduction to inferential statistics Chapter 9: the logic of experimental design Chapter 10: inferential statistics: two group designs Chapter 11: experimental designs with more than two levels of an independent variable Chapter 12: complex experimental designs Chapter 13: quasi-experimental and single-case designs
Scientific research has three basic goals: (1) to describe behavior, (2) to predict behavior, and (3) to explain behavior p.14 Research Methods in Science Descriptive Methods o Observational: Naturalistic observation and Laboratory observation (p.15) o Case Study Method: in-depth study of one or more individuals e.g. Piagets theory of cognitive development in children developed by simply describing the individual(s) being studied. o Survey Method: question individuals on topic(s) and then describe their responses Predictive (relational) methods: we do not systematically manipulate the variables of interest; we only measure them; since alternative explanations cant be ruled out, cannot establish causation o Correlational method: assesses the degree of relationship between two measured vars o Quasi-experimental method: differs from the experimental method in that subjects choose to be members of the different groups being studied i.e. subject/participant var cant be changed i.e. it is not a manipulated variable e.g. sorority vs. non-sorority girls (p.17) Experimental method: Controls are very important in such experiments; you control who is in the study (get a representative population), who participates in each group (control for differences in participants by random assignment between the control (baseline) group and the experimental group), and the treatment each group receives (e.g. some take Vit C and some do not). Other vars such as amt of sleep, type of diet, amt of exercise might also need to be controlled. P.19
Aptitude tests measure an individuals potential to do something, whereas achievement tests measure an individuals competence in an area. P.63 Behavioral measures are often referred to as observational measures because they involve observing anything that a participant does. Most physical measures, or measures of bodily activity, are not directly Observable. Reliability refers to the consistency or stability of a measuring instrument.p65 Examples of errors include trait (participant truthfulness e.g.) error and method errors (operator using equipment). In effect, a measurement is a combination of the true score and an error score. Observed score = True score + Measurement error
Reliability is measured using correlation coefficients. A correlation coefficient measures the degree of relationship between two sets of scores and can vary between -1.00 and +1.00. To establish the reliability (or consistency) of a measure, we expect a strong correlation coefficientusually in the .80s or .90sbetween the two variables or scores being measured. A positive coefficient indicates that those who scored high on the measuring instrument at one time also scored high at another time, those who scored low at one point scored low again. P.68 Types of reliability: test/retest reliability (lowered if practice effects--person can get better between testing from practice), alternate-forms reliability (diff but equivalent questions on the tests), split-half reliability, and inter-rater reliability Validity: a measure that is valid measures what it claims to measure. P.70 A systematic examination of the test content to determine whether it covers a representative sample of the domain of behaviors to be measured assesses content validity. Criterion validity: estimate present performance (concurrent validity) or to predict future performance (predictive validity). The construct validity of a test assesses the extent to which a measuring instrument accurately measures a theoretical construct or trait that it is designed to measure. p.72: A test can be reliable, but not valid, but it can never be valid without being reliable.
together.p.90 A socially desirable response is one that is given because participants believe it is deemed appropriate by society, rather than because it truly reflects their own views or behaviors. mail survey: less sampling bias than phone/email b/c. wide availibilty; also ppl more comfortable answering sensitive stuff; disadv: if Qs are unclear, no clarification; low response rate: 20% Sampling Techniques for Surveys: There are two ways to sample individuals from a population: probability sampling and nonprobability sampling. Probability sampling:p.95 random selection, stratified random sample (guarantee that the sample accurately represents the population on specific characteristics; cluster sampling: e.g. sample from classes that are required of all students at the university, such as English composition. Non-probability Sampling: individual members of the population do not have an equal likelihood of being selected o Convenience (haphazard) sampling: if you wanted a sample of 100 college students, you could stand outside of the library and ask people who pass by to participate o Quota sampling: Quota sampling is to nonprobability sampling what stratified random sampling is to probability sampling, but still not much effort devoted to creating a sample that is truly representative of the population
For the sample mean, Median: not affected by extreme scores. Measures of Variation: A measure of central tendency provides information about the middleness of a distribution of scores but not about the width or spread of the distribution. P.114 A measure of variation indicates the degree to which scores are either clustered or spread out in a distribution. Range: The simplest measure of variation is the rangethe difference between the lowest and the highest scores in a distribution. The range is usually reported with the mean of the distribution.
Standard Deviation for a sample: Compare this to the average deviation for a population, which is given by . Note that the standard deviation will always be larger than the average deviation because the squaring of the terms gives more weight to outlying values. If, however, sample data is being used to estimate the population standard deviation, then an unbiased estimator
modification of N 1 must be used: p.126 Notice that the symbol for the unbiased estimator of the population standard deviation is s (lowercase), whereas the symbol for the sample standard deviation is S (uppercase). The estimate has N-1 in the denominator to compensate for the small samples not containing as much variability as the real population. Variance is the square of the standard deviation. Normal distributions are bell-shaped, symmetrical, and have an identical mean, median, and mode. They are unimodal; most observations are centrally clustered. Last, when standard deviations are plotted on the x-axis, the percentage of scores falling between the mean and any point on the x-axis is the same for all normal curves. (p.121) Kurtosis: how flat or peaked a normal distribution is; Platykurtic = short and wide (think: platypus = close to the ground, flat); Mesokurtic = medium height/breath; Leptokurtic = tall and thin (think: lepto = leap) In a positively skewed distribution, the peak is to the left of the center point, and the tail extends toward the right. Reason for its name: few individuals have extremely high scores that pull the distribution in that direction. Negatively skewed is just the opposite. P.122 If your disease has a low median survival rate, you would prefer a positive skewthis means some people live for a very long time post-diagnosis.
The Z-score (p.124): A z-score or standard score is a measure of how many standard deviation units an individual raw score falls from the mean of the distribution. Thus, when calculating a z-score for an individual in comparison to a
, while for a .
If the distribution of scores for which you are calculating transformations (z-scores) is normal (symmetrical and unimodal), then it is referred to as the standard normal distributiona normal distribution with a mean of 0 and a standard deviation of 1.p.126 The standard normal curve can also be used to determine an individuals percentile rankthe percentage of scores equal to or below the given raw score. P131
The overall formula is below: General rule of thumb: at least 10 ppl per variable. An alternative, computational formula is listed on p.154. Coefficient of Determination: Calculated by squaring the correlation coefficient, the coefficient of determination (r2) is a measure of the proportion of the variance in one variable that is accounted for by another variable. R2 is typically reported as a percentage.p.154 Correlations for Nominal or Ordinal Data: Spearmans rank-order correlation coefficient: both vars are ordinal (ranking) scale. If one var is interval/ratio, it must first be converted into the ordinal scale. Point-biserial correlation coefficient: one var is a two-value dichotomous nominal (e.g. gender) and the other is interval or ratio Phi coefficient: both vars are dichotomous nominal vars. Regression Analysis:p.156 A tool that enables us to predict an individuals score on one variable based on knowing one or more other variables is regression analysis. Regression analysis involves determining the equation for the best-fitting
line for a data set. The line has the form of y = mx + b, but is written as follows: , where Y is the predicted value on the Y variable, b is the slope of the line, X represents an individuals score on the X variable, and a is
the y-intercept. To compute the slope: To compute the y-intercept: , where the bars are the respective sample means. Multiple regression analysis involves combining several predictor variables in a single regression equation to increase the predictive accuracy because in the real world, it is unlikely that one variable is affected by only one other variable.
Two-tailed test: e.g. the researcher just wants to prove that there are IQ differences between the two groups, but isnt concerned with the direction of those differences.
Errors: p.186
The p-value or alpha level: When a result is statistically significant at the 0.05 (or 5%) level, it means that the observed difference between the sample and the population could have occurred by chance only 5 out of every 100 times. In other words, any variation between groups is most likely due to true/real differences between them. In this case, the risk of a Type I error is 5%.
an individuals score to the population mean, but rather a sample mean must be compared instead with a distribution of sample means, known as the sampling distribution. Standard Error of the Mean p.198: the standard error of the mean (the standard deviation of the sampling distribution) can never be as large as , the standard deviation for the distribution of individual scores. Think about it this way: if the size of each of these samples were to approach the population size, their means would all be tightly clustered around the pop. mean and the standard deviation of the sample distribution would be very small. Thus, the central limit theorem states that for any population with mean u and standard deviation , the distribution of sample means for sample size N will have a mean of u and a standard deviation of /sqrt (N) and will approach a normal distribution as N approaches infinity. p.198 Thus,
The z-score will tell us how many standard deviation units a sample mean is from the population mean, or the likelihood that the sample is from that population. P.175 e.g. if wind up with a z = 2.06 for the one-tailed test, the zritical = 1.64 i.e. the area under the graph to the right of that is 5%. The z-value would be significant and H0 would be rejected. In APA style, report result as Z (n = 50) = 2.06, p<.05 (one-tailed) . For a two-tailed test, zcritical at an alpha of 0.05 is + or 1.96. Statistical power refers to the probability of correctly rejecting a false H0. Two ways to increase statistical power: With a one-tailed test, we are more likely to reject H0 because zobt does not have to be as large p.180 Secondly, by increasing sample size, we reduce the standard error of the mean and increase the z value. Assumptions of the z-test: distribution of sample means is normalif sample size is small N <30, the z-test may not be appropriate. P.206 In such cases or when the S.D. of the pop. Is not known, use the t-test.
Confidence Intervals based on the z-distribution p.208 This differs from the previously described z test in that we are not determining whether the sample mean differs significantly from the population mean; rather, we are estimating the population mean based on knowing the sample mean. We are usually given the sample size (e.g. N = 100), sample mean (X-bar = 86), and population standard deviation = 17. First we calculate = 1.7; next, we rearrange the z-equation as follows: to get 82.667 <= u <= 89.332 for the 95% CI. It is also possible to do hypothesis testing with confidence intervals. For example, if you construct a 95% confidence interval based on knowing a sample mean and then determine that the population mean is not in the confidence interval, the result is significant. The T-test p.209: It is similar to the z-testparametric statistical test of the null hypothesis for a single sample and determines the # of standard deviations a score is from the mean of a distribution. However, key differences from z-test: here, population variance is not known; also, t distributions, although symmetrical and bell-shaped, do not fit the standard normal distribution because the size of the samples is usually less than N = 30. Further, unlike the z distribution, of which there is only one, the t distributions are a family of symmetrical distributions that differ for each sample size. As sample size increases, the t-distribution approaches the z-distribution. Degrees of freedom equal N -1. Notice that when the degrees of freedom approaches infinity at the bottom of the table, for a one-tailed alpha of 0.05 (99% CI), the critical tscore is 1.960, which is the same as the critical z-score value!
The estimated standard error of the mean: , where . is the estimated standard error of the mean i.e. an estimate of the standard deviation of the sampling distribution based on sample data since the pop. Standard dev is not known. s, (the estimated standard deviation for a population,
based on sample data): APA style: t(9) = 2.06, p <.05 (one-tailed) p.212 The chi-square (_2) goodness-of-fit test p.216 Neither u nor sigma is known. It is a nonparametric statistical test that tests the observed frequency (the frequency with which participants fall into a category) against the expected frequency (the frequency expected in a category if the sample data represent the population). P.191 It is used with nominal or ordinal data (i.e. categorical data).
where O is the observed frequency, E is the expected frequency, and indicates that we must sum the indicated fractions for each category in the study (e.g. for the pregnant and not pregnant groups). The null hypothesis is rejected if is greater than freedom is the number of categories minus 1 . The is found in Table A.4 in Appendix A. The degrees of
Pearson r Correlation Coefficients and Statistical Significance p.218 The null hypothesis (H0) is that the true population correlation is .00the variables are not related. The alternative hypothesis (Ha) is that the observed correlation is not equal to .00the variables are related. A one-tailed test of a correlation coefficient means that we have predicted the expected direction of the correlation coefficient (i.e., predicted either a positive or negative correlation). Table A.5 in Appendix A shows critical values for both one- and two-tailed tests of r, the Pearson product moment correlation coefficient. The degrees of freedom = N -2; If the correlation coefficient of +0.33 is based on 20 pairs of observations, then the degrees of freedom are 20 - 2 =18. Summary: (in the case of the z test, u and sigma must be known; for the t test, only u is needed. For the chi-square test, neither u nor sigma is needed.
History: change in dependent var due to external circumstances; eg. Stress reduction b/c. exams at start and vacation at end of study Maturation: participants mature physically, socially, and cognitively over course of study Testing: the testing effectchange in performance due to familiarity with and practice on test items. Both + practice effect and fatigue effect Regression to the Mean: extreme scores that are the product of chance will moderate upon retesting Instrumentation effect: observer becomes better/more fatigued with taking measures Attrition/Mortality: e.g. heaviest smokers in experimental cessation group drop-out; post-test measures would be unduly optimistic Diffusion of treatment: people receive treatment info from other participants Experimenter/Participant Effects: experimenter bias or expectancy effects influence outcome e.g. clever hans the mathematical horse receiving cues from owners. Solve via single blind: either the experimenter or the participants are blind to the manipulation being made or double blind: both unaware; Participant effects include reactivitychange in behavior due to being watched. Also, placebo effect. Floor and ceiling effects: e.g. measure rat weight in poundsno change detectedfloor effect; ceiling effect measure elephant w/. 350 lb max limit bathroom scale;
Threats to External Validity Generalization to Populations: hampered by the college sophomore problem Generalizations from Lab settings: control maximized in lab settingsthe artificiality criticism; solve by conceptual replicationtest concepts via diff indep var or dep var. Correlated-Group Designs: participants in experimental and control groups are related Within-participant design: also known as repeated measures designsall participants serve in all conditions; benefit is that you need fewer participants (e.g. if there are 4 conditions and need 15 ppl per condition; then in the between-participants design, need 60 ppl, whereas only 15 for within-participant design), takes less time, and increases statistical power b/c. reduces variability due to individual differences; this mode is popular is psychological research p.240 downside; b/c. participants tested at least twice, practice/fatigue effects; solve via counterbalancingreverse the order of tasks presented to control and experimental groups; however, with three conditions, 6 possibilities, 4 conditions have 24 orderings of conditions; therefore, complete counterbalancingexposing participants to all of the orderings of conditions is not possible; also carry-over effectsdrug administered in one condition effects performance in subsequent conditions
Matched-Participants Experimental design: for each participant in one condition, there is a participant in the other condition(s) who matches him or her on some relevant variable or variables. Has advantages over the between-participant design (groups are more similar) and the within-participant design (less carryover testing effects); downsidemore people needed; also mortality effectsif one person drops out, the pair is compromised; also difficulty finding participants (p.242)
The dependent var is participants scores on a test Statistical significance indicates that an observed difference between two descriptive statistics (such as means) is unlikely to have occurred by chance.
Rather than comparing a single sample mean to a population mean, we are comparing two sample means. To determine how far the difference between the sample means is from the difference between the population means, we convert the mean differences to standard errors.
The standard error of the difference between the means does have a logical meaning. If we took thousands of pairs of samples from these two populations and found for each pair, those differences between means would not all be the same. They would form a distribution. The mean of that distribution would be the difference between the means of the populations and its standard deviation would be . Thus,
, where . s12 and s22 are the variances of the two groups. P.252 The degrees of freedom for this independent groups t test are (n1 -1) + (n2 -1). Refer to Table A.3 for the tcritical value. APA style: t(18) =4.92, p <.05 (one-tailed). Note that the statistical power (the t value) can be increased by the following three things: Greater differences produced by the independent variable, Less variability of raw scores in each condition, Increased sample size Effect Size: Cohens d and r2
effect sizethe proportion of variance in the dependent variable that is accounted for by the manipulation of the independent variable. It is an estimate of the effect of the independent variable, regardless of sample size. P.232 For the t test, one formula for effect size, known as Cohens d, is
According to Cohen (1988, 1992), a small effect size is one of at least 0.20, a medium effect size is at least 0.50, and a large effect size is at least 0.80. e.g. APA: t(18) = 4.92, p = .05 (one-tailed), d = 2.198 R2: the proportion of variance accounted for in the dependent variable based on knowing which treatment group the
participants were assigned to for the independent variable. Confidence Intervals: Same formula as before (Ch. 7), except that rather than using the sample mean and the standard error of the mean, we use the difference between the means and the standard error of the difference between means. p.257
The standard error of the difference scores differences between dependent samples.
, where sD is the unbiased estimator of the standard deviation of the difference scores and N is the number of participants in each group.
Confidence interval: e.g. on word memorization differences between concrete and abstract words, we could answer that we are 95% confident that the difference in performance on the 20-item memory test between the two word type conditions would be between 0.96 and 4.04 words recalled correctly. Nonparametric Tests A nonparametric test does not use any population parameters, such as the mean and standard deviation. Three nonparametric tests: the Wilcoxon rank-sum test, the Wilcoxon matched-pairs signed-ranks T test (both used with ordinal data), and the chi-square test of independence, used with nominal data. P.240 Wilcoxon Rank-Sum Test: p.265 The Wilcoxon rank-sum test is similar to the independent-groups t test; however, it uses ordinal data (ranking) rather than interval-ratio data and compares medians rather than means. Interval or ratio data may be converted to ranked ordinal data. The underlying distribution is not normal. First, sum the ranks for the group expected to have the smaller total. This value needs to be equal to or less than the critical value to be statistically significant. Further, in table A.6, n1 is always the smaller of the two groups. Refer to Table A.6. Table A.6 presents the critical values for one-tailed tests only. If a two-tailed test is used, the table can be adapted by dividing the alpha level in half. n1(the number of participants in a group) is always the smaller of the two groups. Assumptions of this test: p.266 Wilcoxon Matched-Pairs T Test This is a nonparametric statistic and is necessary whenever the distribution is skewed (i.e. not normal). P.243 e.g. during the first term, the teacher measures the number of books her students read and ranks them ordinally; during the second term, a rewards program is instituted and the students are again ranked. Is there a statistically significant difference between the # of books read? The null hypothesis is that the median number of books read does not differ; the alternative hypothesis is that the median number of books read during rewards is greater. Step 1: for each student, compute a difference score (subtract books read 2nd month from those read first month); if program had no effect, would expect most scores to be close to 0. Step 2: rank the absolute values of the difference scores. If two scores at position 1 have the same numerical value, they are both ranked 1.5 and the next score gets a 3. Note that any values with a difference score of zero are not ranked and do not figure into the N value. Step 3: give the rank the sign of the difference score it represents Step 4: sum the positive and negative ranks. for a two-tailed test, Tobt is equal to the smaller of the summed ranks. In contrast, the Tobt for a one-tailed test is the sum of the signed ranks predicted to be smaller. p.268 As with the Wilcoxon rank-sum test, the obtained value needs to be equal to or less than the critical value to be statistically significant.
Objective: determine whether babysitters are more likely to have taken first aid than those who have never worked as babysitters. To determine the expected frequency for each cell:
, where RT is the row total, CT is the column total, and N is the total number of observations. P.246 If the exceeds the , then thenull hypothesis can be rejected.
Chi-Square test and effect size: Phi Coefficient As with the t tests discussed earlier in this chapter, we can also compute the effect size for a test of independence.
. Cohens (1988) specifications for the phi coefficient indicate that a phi coefficient of .10 is a small effect, .30 is a medium effect, and .50 is a large effect. In our particular example, if the phi value is small, then the difference observed in whether a teenager had taken a first aid class is not strongly accounted for by being a babysitter. Summary: First consideration: determine whether to use either a parametric or a nonparametic statistic; if the data is not normally distributed, use nonparametric; also if certain population parameters such as mean and standard deviation are not provide, use nonparametric (Wilcoxon or Chi-square); if data is normal, use parametric, such as T-test. Second consideration: whether a between-participants or correlated-groups design has been used. P.248 A nonparametric test is one that does not involve the use of any population parameters, such as the mean and standard deviation. In addition, a nonparametric test does not assume a bell-shaped distribution. The because it fits this definition. test is nonparametric
Chapter 11: Experimental designs with More than Two Levels of an Independent Variable
The experiments described in Chapter 9 involved manipulating one independent variable with only two levels (aka treatments)either a control group and an experimental group or two experimental groups. Researchers may want more than 2 levels of an independent var b/c. they can compare multiple treatments e.g. compare placebo group w/. control/experimental groups. P.281 If group 1 is compared to group 2, 2 to 3, 3 to 4, and so on, we increase the risk of a type 1 error by where c equals the number of comparisons performed. One way of counteracting this is to use a more stringent alpha level by performing the Bonferroni adjustment, in which the desired alpha level is divided by the number of tests or comparisons. However, Type II error is increased. A better method is to use a single statistical test that compares all groupsANOVA. ANOVA is an inferential parametric statistical test for comparing the means of three or more groups that have interval or ratio data. P.286. If the data are ordinal, use Kruskal-Wallis analysis of variance for a between-subjects design; for a within-subjects design, where the data are skewed and/or ordinal, use the Friedman rank test. if data are nominal, use chi-square test. If the Fobt value is greater than the Fcrit value, the results of ANOVA indicate that at least one of the sample means differs significantly from the others. In that case, a post hoc test for comparing each of the groups in the study with each of the other groups must be conducted to determine which ones difer significanlty from each other. e.g. Tukeys HSD test. p.297 Also, see p. 296 for the assumptions of the anova (interval-ratio, normal distributed etc.)
A significant ANOVA result i.e. F-value indicates that at least one of the sample means differs significantly from the others. to determine which means differ significantly from the others, one needs to perform a post hock test (such as Tukeys HSD). p.297 Assumptions (p.296): data are interval/ratio, normally distributed, observations are independent etc. The term randomized indicates that participants are randomly assigned to conditions in a between-participants design. The term one-way indicates that the design uses only one independent variable. E.g. rote rehearsal vs. imagery, vs. story-telling on # of words recalled. This is a design with one independent var with 3 levels. The null hypothesis is . The alternative hypothesis is atleast one u not equal to another u. When a researcher rejects H0 using an ANOVA, it means that the independent variable affected the dependent variable to the extent that at least one group mean differs from the others by more than would be expected based on chance. The grand mean is the mean performance across all participants in all conditions. Since none of the participants scored the grand mean, there is variability between conditions. Is this variability due to the independent var or due to error variance--chance or uncontrolled variables such as individual differences between participants? Within-groups variance This is an estimate of the population error variance. Error variance can be ascertained by seeing the variability within each condition b/c. participants were treated similarly. Between-groups variance Systematic variance due either to the effects of the independent variable or to uncontrolled confounding vars Error variance The F-ratio
If we assume that the systematic variance is due to the effects of the independent variable, then if the independent var has a strong effect, the F-ratio will be substantially greater than one; else it will be around 1. P.264 Step 1: Sum of Squares p.291: Several types of sums of squares (SS) are used in the calculation of an ANOVA; SSwithin + SSbetween = SStotal Total sum of squares (SStotal): the sum of the squared deviations of each score from the grand mean. The sum of the variances of all the groups are added together to produce the total sum of squares value Within-groups sum of squares : , where X is each individual score, and is the mean for each group or condition. This is the sum of the squared deviations of each score from its group or condition mean and is a reflection of the amount of error variance. Between-groups sum of squares: . This is the sum of the squared deviations of each groups mean from the grand mean, multiplied by the number of participants in each group. The betweengroups variance is an indication of the systematic variance across the groups. The basic idea: if the independent var has no effect, the group means would be similar to the grand mean, and there would be little variance across conditions.
Step 2: Mean Square (MS) is the mean squared deviation that is an estimate of variance between and within the groups. MSwithin and MSbetween groups are calculated by dividing each SS by the appropriate df. Dftotal = N -1, where N is the total number of subjects in the study; dfwithin = N k, where k = # of groups; dfbetween = k 1. Note that if the dfwithin number is not present in the table at the back, use the next lowest number (because when dfvalues decrease, the critical value increases)p.294.
In APA format, to say that a test with a between groups df of 2 and a within groups df of 21 has a value of 11.07 and is significant at the 0.01 level, we write: F(2,21) = 11.07, p <.01. Increasing any differences between the groups by using stronger controls increases the F-value. Also, decreasing the error variance within groups as well as increasing group size (and hence dfwithin) also increases the F-value. Effect Size From Fobt, we know that there was more variability between groups than within groups. However, it would be useful to know how much of the variability in the dependent variable can be attributed to the independent variable. In ANOVA, effect size is estimated using eta-squared:
. Since SSbetween is the differences in the means from the various levels of the independent var, and SStotal reflects the total differences between all scores in the experiment, reflects how much of the variability in the dependent variable (memory) is attributable to the manipulation of the independent variable. Tukeys Post Hoc Test p.297 Tukeys honestly significant difference (HSD) compares each of the groups in the study with each of the other groups to determine which ones differ significantly from each other. Tukeys test identifies the smallest difference between any two means that is significant with alpha = .05 or alpha = .01.
, where k = # of groups, n = # participants in each group. The value for Q is found in Appendix A.9. A difference of at least the HSD value between means is necessary to conclude that the differences between the means is greater than would be expected based on chance. If a difference is significant at 0.05, check at 0.01 as well p.298.
Step 2: determine participant sum of squares, SSparticipants, which indicates within-groups variance due to individual differences: , where is the mean across treatments for each participant, is the grand mean, and k is the number of treatments. After the variability due to individual differences, SSparticipants, has been removed from the
within groups sum of squares, the error sum of squares is left. Thus error sum of squares, SSerror, equals
Step 3: calculate F = MSbetween/MSerror p.302 MS or mean square; dfsubjects = n -1, where n is number of subjects (p.304). dftotal = N -1, where N is the total number of scores in the study; dfparticipants = n -1, where n = # in group; dfbetween = k-1, where k is # of conditions; dferror = dfbetween X dfparticipants. In table A.8, use dfbetween and dferror to find the Fcv Effect size in the repeated measures ANOVA is calculated similarly to one-way ANOVA. P.280 Tukeys Post Hoc HSD test:
Thus, a 3 X 6 factorial design is one with two independent variables, the first one of which has 3 levels and the second one, 6 levels, for a total of 18 possible conditions. It is not possible to have a 1 X 3 factorial design. A main effect is an effect of a single independent variable. The main effect of each independent variable tells us about the relationship between that single independent variable and the dependent variable. In other words, do different levels of one independent variable bring about changes in the dependent variable? For example, in a study about the effects of different rehearsal types (rote, imagery) and different word types (concrete, abstract) on memory, the first two are the independent variables, and memory is the dependent variable. p.317 There can be as many main effects as there are independent variables. An interaction effect is the effect of each independent variable across the levels of the other independent variable. The relationship can be graphed. The dependent variable always goes on the y-axis. One independent variable is placed on the x-axis, and the levels of the other independent variable are captioned in the graph. P.294 Possible outcomes of a 2 X 2 factorial design are Main effect of A? Main Effect of B? Interaction Effect? So 2*2*2 = 8 possible outcomes (p.296). Question p.322: How many main effect(s) and interaction effect(s) are possible in a 4 X 6 factorial design? A 4 X 6 factorial design has two independent variables. Thus, there is the possibility of two main effects (one for each independent variable) and one interaction effect (the interaction between the two independent variables). Two-Way ANOVA p.323 For the factorial designs discussed in this chapter, a two-way ANOVA would be used. The term two-way indicates that there are two independent variables in the study. As with one-way ANOVA, if either of the variables has an effect, the variance between the groups should be greater than the variance within the groups. In a 2 X 2 factorial design, such as the one we have been looking at in this chapter, there are three null and alternative hypotheses. The null hypothesis for factor A states that there is no main effect for factor A, and the alternative hypothesis states that there is an effect of factor A. A second null hypothesis states that there is no main effect for factor B. The third null hypothesis states that there is no interaction of factors A and B.
Step 1: Calculate SStotal. This is calculated in the same manner as one-way ANOVA. The dftotal also is the same: N 1; Step 2: Calculate SSA. p.325 This is the sum of the squared deviation scores of each group mean for factor A minus the grand mean times the number of scores in each factor A condition (column). The definitional formula is: , where is the mean for each condition of factor A, is the grand mean, and
is the number of people in each of the factor A conditions. dfA = the number of levels of factor A minus 1. P.325. SSB is calculated similarly. Step 3: Calculate the sum of squares interaction (SSA X B): , where Xc is the mean for each condtion (cell), Xg is the grand mean, and nC is thenumber of scores in each condition or cell. The degrees of freedom for the interaction are based on the number of conditions in the study. To determine the degrees of freedom across the conditions, we multiply the degrees of freedom for the factors involved in the interaction. p.327 Step 4: Calculate sum of squares error (SSError): The sum of squares error (SSError) is the sum of the squared deviations of each score from its condition (cell) mean: . dfError is calculated as follows: the number of conditions in the study is multiplied by the number of participants in each condition minus the one score not free to vary, or AB(n 1). P.303 In the table below, A = # of conditions in A (e.g. concrete vs. abstract), B = # of conditions in B (e.g. rote vs. imagery)
To determine the Fcritical value in Table A.8, we use dferror running down the left side of the table and the dfbetween running across the top of the table. p.329 However, note that there are three dfbetween values and thus three Fcv values. For factor A, dfbetween is dfA, for factor b, dfbetween is dfB, for the interaction, dfbetween is dfinteraction. If FA is significant, this means that there was a significant main effect for factor A. Note that Tukeys Post-hoc test needs only be completed if either or both of the independent variables have more than two levels (assuming that the main effects are significant to begin with). e.g. in a 2X6 factorial design for which both main effects are signficant, post-hoc needs to be calculated only for the independent variable that has six levels to determine which pairs of these six are significant). p.331 eta-squared = SSbetween/SStotal; here SSbeween equals SSA, SSB, and SSAXB, respectively p.331
Single-group posttest-only design: involves the use of a single group of participants to whom some treatment is given. there is neither a comparison group nor a comparison of the results to any previous measurements. The single-group pretest/posttest design is an improvement over the posttest-only design in that measures are taken twicebefore the treatment and after the treatment. The single-group time-series design involves using a single group of participants, taking multiple measures over a period of time before introducing the treatment, and then continuing to take several measures after the treatment. The nonequivalent control group posttest-only design is similar to the single-group posttest-only design; however, a nonequivalent control group is added as a comparison group. Nonequivalent means that group membership is not random, but already established. Thus, the differences observed between the two groups on the dependent variable may be due to the nonequivalence of the groups and not to the treatment.P.323. An improvement over the previous design involves the addition of a pretest measure, making it a nonequivalent control group pretest/posttest design. a pretest allows us to assess whether the groups are equivalent on the dependent measure before the treatment is given to the experimental group. The logical extension of the previous design is to take more than one pretest and posttest. In a multiple-group time-series design, several measures are taken on nonequivalent groups before and after treatment. Internal validity is the extent to which the results of an experiment can be attributed to the manipulation of the independent variable, rather than to some confounding variable. Thus, quasi-experimental designs lack internal validity. p.325 Statistical Analysis: Depending on the type of data (nominal, ordinal, or interval-ratio), the number of levels of the independent variable, the number of independent variables, and whether the design is between-participants or within-participants, we choose the appropriate statistic as we did for the experimental designs. Cross-sectional Designs p.352 Researchers study individuals of different ages at the same time. The advantage of this design is that a wide variety of ages can be studied in a short period of time. The main issue is that the researcher is typically attempting to determine whether or not there are differences across different ages; however, the reality of the design is such that the researcher tests not only individuals of different ages but also individuals who were born at different times and raised in different generations or cohorts, so rather than testing age differences, may be testing generational differences. Longitudinal Design With a longitudinal design, the same participants are studied repeatedly over a period of time. Disadvantage: people who attrition may differ from those who remain in the study. Sequential Designs a researcher begins with participants of different ages (a cross-sectional design) and tests or measures them. Then, either a number of months or years later, the researcher retests or measures the same individuals (a longitudinal design). P.352 Single Case Research: versions of a within-participants experiment in which only one person is measured repeatedly. Often the research is replicated on one or two other participants. Thus, we sometimes refer to these studies as small-n designs. A reversal design is a within-participants design with only one participant in which the independent variable is introduced and removed one or more times. o An ABA reversal design involves taking baseline measures (A), introducing the independent variable (B) and measuring behavior again, and then removing the independent variable and retaking the baseline measures (A). the reversal controls for confounds that may be changing the dependent variable. o The ABAB reversal design involves reintroducing the independent variable after the second baseline measurement. Multiple-baseline designs: Because single-case designs are a type of within-participants design, carryover effects from one condition to another are of concern.
Multiple Baselines across participants: So, here we assess the effect of introducing the treatment over multiple participants, behaviors, or situations. We control for confounds not by reversing back to baseline after treatment, as in a reversal design, but by introducing the treatment at different times across different people, behaviors, or situations. P.331 This eliminates the possibility that some other extraneous variable produced the results. Multiple baselines across behaviors: An alternative multiple-baseline design uses only one participant and assesses the effects of introducing a treatment over several behaviors. E.g. first introduce treatment for aggressive behaviors, then days later, for talking out of turn, then days later for temper tantrums Multiple baselines across situations: introduce treatment across different situations. E.g. treat first for bad behavior in math class, then days later, for bad behavior in English class. Introducing the treatment at different times in the two classes minimizes the possibility that a confounding variable is responsible for the behavior change.