Вы находитесь на странице: 1из 4

Optimal Data Analysis Vol.

2, Release 2 (November 20, 2013), 202-205

Copyright 2013 by Optimal Data Analysis, LLC 2155-0182/10/$3.00

MegaODA Large Sample and BIG DATA Time Trials: Harvesting the Wheat
Robert C. Soltysik, M.S., and Paul R. Yarnold, Ph.D.
Optimal Data Analysis, LLC

In research involving multiple tests of statistical hypotheses the efficiency of Monte Carlo (MC) simulation used to estimate the Type I error rate (p) is maximized using a two-step procedure. The first step is identifying the effects that are not statistically significant or ns. The second step of the procedure is verifying that remaining effects are statistically significant at the generalized or experimentwise criterion (p<0.05), necessary in order to reject the null hypothesis and accept the alternative hypothesis that a statistically significant effect occurred. This research uses experimental simulation to explore the ability of MegaODA to identify p<0.05 effects in a host of designs with a binary class variable and ordered attribute for mildly or weakly challenging discrimination conditions (moderate or modest distribution overlap), target p values of 0.01 and 0.001, and sample sizes of n=100,000 and n=1,000,000. Solution speeds ranged from 5 to more than 83,000 CPU seconds running MegaODA software on a 3 GHz Intel Pentium D microcomputer. Using MegaODA it is straightforward to rapidly rule-in p<0.05 for weak and moderate effects by Monte Carlo simulation with large samples and BIG DATA in designs having ordinal attributes with or without weights applied to observations. Significantly greater time was required for problems involving continuous attributes but even the most computer-intensive analyses were completed in less than a day.

In prior research a data set was constructed with three independently generated random numbers provided for each of 106 observations.1 The first variable was binary, created using a random probability value (p) generated from a uniform distribution: BINARY=0 if p<0.05; BINARY=1 if p>0.05. The second variable was an ordered 5-point Likert-type scale2 created using another random p generated from a uniform distribution: LIKERT=1 if p<0.2; LIKERT=2 if 0.2<p<0.4;
202

LIKERT=3 if 0.4<p<0.6; LIKERT=4 if 0.6<p< 0.8; and LIKERT=5 if p>0.8. The third variable is a constrained real number created as a random probability value generated from a uniform distribution: RANDOM=p. For the present study two additional data sets were created using this initial experimental data set. The first data set featured an effect that was on the border3 separating a relatively weak effect (ESS<25) versus a moderate effect (25<

Optimal Data Analysis Vol. 2, Release 2 (November 20, 2013), 202-205

Copyright 2013 by Optimal Data Analysis, LLC 2155-0182/10/$3.00

ESS<50). It was created by adding half a standard deviation (0.707) to the LIKERT data and half a SD (0.25) to the RANDOM data of class 1 observations. Descriptive statistics for the data are in Table 1 (CV=coefficient of variation). Table 1: Descriptive Statistics for LIKERT and RANDOM by Class: Weak-Moderate ESS n=49,978 Class 0 Statistic Mean 2.996 SD 1.418 Median 3 CV 47.3 Skewness 0.001 -1.308 Kurtosis Mean SD Median CV Skewness Kurtosis 0.501 0.288 0.501 57.6 -0.003 -1.197 n=50,022 Class 1 3.719 1.412 3.707 38.0 -0.014 -1.297 0.750 0.290 0.747 38.7 0.007 -1.210

subtracting half a SD from the LIKERT and the RANDOM data of class 0 observations. After this was done the value 1 was added to all of the weights, because in ODA software weights must all be positive numbers3 (see Table 2). Table 2: Descriptive Statistics for LIKERT and RANDOM by Class: Moderate-Strong ESS n=49,978 Statistic Class 0 Mean 2.289 SD 1.418 Median 2.293 CV 62.0 Skewness 0.001 -1.308 Kurtosis Mean SD Median CV Skewness Kurtosis 1.251 0.288 1.251 23.1 -0.003 -1.200 n=50,022 Class 1 3.719 1.412 3.707 38.0 -0.014 -1.300 1.750 0.290 1.747 16.6 0.007 -1.210

Variable LIKERT

Variable LIKERT

RANDOM

RANDOM

Variable LIKERT

Statistic Mean SD Median CV Skewness Kurtosis Mean SD Median CV Skewness Kurtosis

n=499,928 n=500,072 Class 0 Class 1 2.997 3.710 1.414 1.415 3 3.707 47.2 38.1 0.001 -0.004 -1.300 -1.301 0.500 0.289 0.501 57.7 -0.002 -1.199 0.750 0.289 0.750 38.5 0.001 -1.202

Variable LIKERT

Statistic Mean SD Median CV Skewness Kurtosis Mean SD Median CV Skewness Kurtosis

n=499,928 n=500,072 Class 0 Class 1 2.290 3.710 1.414 1.415 2.293 3.707 61.7 38.1 0.001 -0.004 -1.300 -1.301 1.250 0.289 1.251 23.1 -0.002 -1.200 1.750 0.289 1.750 16.5 0.001 -1.202

RANDOM

RANDOM

The second data set featured an effect that was either moderate or bordered a relatively strong effect (50<ESS<75). It was created by
203

Table 3 gives results for analysis with an ordinal attribute and a continuous weight, and Table 4 gives results for the opposite ordering.

Optimal Data Analysis Vol. 2, Release 2 (November 20, 2013), 202-205

Copyright 2013 by Optimal Data Analysis, LLC 2155-0182/10/$3.00

Effect Strength Weak Weak Weak Weak Weak Weak Weak Weak Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate

Table 3: Simulation Results for Ordinal Attribute, Continuous Weight CPU MC ESS ESP n Target p Seconds Iterations 20.5 20.5 20.3 20.3 20.2 20.2 20.0 20.0 40.4 40.4 40.2 40.2 40.2 40.2 40.0 40.0 24.4 24.4 65.2 65.2 24.1 24.1 65.2 65.2 42.0 42.0 70.1 70.1 41.8 41.8 70.0 70.0 100,000 100,000 100,000 100,000 1,000,000 1,000,000 1,000,000 1,000,000 100,000 100,000 100,000 100,000 1,000,000 1,000,000 1,000,000 1,000,000 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 5 59 5 66 104 1,647 112 1,581 4 64 4 61 109 1,462 94 1,508 700 9,300 700 9,300 700 9,300 700 9,300 700 9,300 700 9,300 700 9,300 700 9,300

Weighted No No Yes Yes No No Yes Yes No No Yes Yes No No Yes Yes

Effect Strength Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Strong Strong Strong Strong Strong Strong Strong Strong

Table 4: Simulation Results for Continuous Attribute, Ordinal Weight CPU MC ESS ESP n Target p Seconds Iterations 25.2 25.2 24.8 24.8 25.1 25.1 25.0 25.0 49.9 49.9 49.8 49.8 50.1 50.1 50.0 50.0 56.8 56.8 62.3 62.3 33.5 33.5 62.2 62.2 56.7 56.7 76.4 76.4 52.4 52.4 76.4 76.4 100,000 100,000 100,000 100,000 1,000,000 1,000,000 1,000,000 1,000,000 100,000 100,000 100,000 100,000 1,000,000 1,000,000 1,000,000 1,000,000
204

Weighted No No Yes Yes No No Yes Yes No No Yes Yes No No Yes Yes

0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001

831 3,669 822 3,336 4,708 83,596 31,311 80,175 274 1,727 414 942 2,650 19,979 3,417 18,621

700 5,000 700 5,000 700 5,000 700 5,000 700 5,000 700 5,000 700 5,000 700 5,000

Optimal Data Analysis Vol. 2, Release 2 (November 20, 2013), 202-205

Copyright 2013 by Optimal Data Analysis, LLC 2155-0182/10/$3.00

Comparing findings within Tables 3 and 4 via MegaODA3,4 indicated sample size was a statistically significant discriminator of solution time if assessed at the generalized criterion, and comparison between Tables 3 and 4 revealed the use of a continuous attribute was a statistically significant discriminator of solution time at the experimentwise criterion.3 The significant jump in computing resources used in rule-in analyses with continuous attributes will be reduced if a procedure is developed for dividing the attribute into a smaller number of segments and utilizing weights to perfectly reproduce actual score. References
1

Yarnold PR, Soltysik RC (2005). Optimal data analysis: Guidebook with software for Windows. Washington, D.C.: APA Books.
4

UniODA3 and MegaODA (see ODA Blog) code used in analysis is given below. Dummy-coded class variables were ess (0=weaker, 1=stronger); n (0=100,000, 1=106); the target p (0=0.01; 1= 0.001); and whether the simulation involved the use of a weight (0=no, 1=yes). As seen, MC was parameterized for 8 tests of statistical hypotheses: for the test between tables MC was set for nine tests (control commands are in red). open results.dat; output results.out; vars ess n p Tabl3CPU weighted Tabl4CPU; class ess n p weighted; attr Tabl3CPU Tabl4CPU; mc iter 25000 target .05 sidak 8 stop 99.9 stopup 99.9; go; Author Notes ODA Blog: odajournal.com

Soltysik RC, Yarnold PR (2013). MegaODA large sample and BIG DATA time trials: Separating the chaff. Optimal Data Analysis, 2, 194197.
2

Yarnold PR (2013). Comparing attributes measured with identical Likert-type scales in single-case designs via UniODA. Optimal Data Analysis, 2, 148-153.

205