Multivariate Data Sequencing Function and General Stratification For Data Comparison

Multivariate Data Sequencing Function and General Stratification
for Data Comparison
Tyler McNeal
1 Data Sequences
1.1 Cell Patterns with Sequences
Data with a fixed pattern is presented in the exact same place each time it is generated. This allows us
to construct an Excel formula that will evaluate this cell reliably no matter which item of data is selected.
However, most data is spread out and not directly in the next cell and the vanilla formulas don’t understand
how to evaluate cells in a sequenced pattern. The general behavior for formulas is to evaluate the next cell
only. To get around this the data must be presented to the formula in a sequential order which can be done
using the
INDEX(ROWS(CELL))
nested functions located within Excel. These functions have the ability to create geometric sequences.
Unfortunately the method for creating some specific sequence of cells doesn’t work through simple addition
and subtraction. The ability to solve for a specific sequence of rows was created using the following definitions
to construct a set of linear functions which can solve each data pattern sequence. A particular row in
a sequence of interest κn , the independent variables k1,n , k2,n which are the rows selected to go into the
function, a common multiplier µ, the test iteration n, and some addition of extra rows n . We combine these
to find κn as a function of the selected rows:
κn (k1,n , k2,n ) = µk2,n − µk1,n + µ + εn

s.t. k2,n − k1,n ≥ 0
∀n > 0, n ∈ N
We can make the following simplifications and define εn in the following manner:
k2,n ≥ k1,n
and
εn = en − e
We substitute into our original function:
κn (k1,n , k2,n ) = µk2,n − µk1,n + µ + en − e

Thus our final function is,
κn (k1,n , k2,n ) = µ (k2,n − k1,n + 1) + e (n − 1)
1.2 Usefulness of the Sequence Function: Two Practical Examples

1.2.1 Solving for Any εn
This function can be used to find any row given any two rows, some addition of rows and a multiplier. In
our case, we are generally interested in only solving for εn for a particular set of rows. This can be easily
solved for by setting n = 2:
κ2 (k1,2 , k2,2 ) s.t. k2,1 − k1,1 = 1

Which becomes
κ2 = 2µ + e
⇒ e = κ2 − 2µ
1
1.2.2 Finding a Specific Cell with a Particular Identifier
If we make our function constant:
κn = κ
we can find a particular row that needs to be examined if we have selected our step in the sequence by
solving for k2,n :
κ = µ (k2,n − k1,n + 1) + e (n − 1)
= µk2,n − µk1,n + µ + e (n − 1)
⇒ µk2,n = κ + µk1,n − µ − e (n − 1)
κ e (n − 1)
k2,n = + k1,n − 1 −
µ µ
2 Algorithm Formulas
The Sheet1 row range is expressed as Y and the Sheet2 row range is expressed as Z. The reference data for
Sheet1 is denoted as VALUE1 and the Sheet2 reference data is denoted as VALUE2. To create any generic
formula to run for any data type use the Full Formula as an example and use the Sheet1 and Sheet2 Values
as the reference cells for any data used in the formulas. These examples pull data from only 2 sources but
can be modified for multiple comparison tests across multiple sheets of data with differing patterns.
2.1 Alphanumeric Valued Cells

2.1.1 Alphabetical
The full test formula, function structure and value formulas for Alphabetical Only cells are as follows:
• Full Formula
=IF(OR(AND(ISNUMBER(FIND("B",INDEX(Sheet1!$Column1$1:Sheet1!$Column1$Y ,
ROWS(Sheet1!Column1$µ:Sheet1!Cellµ)*µ+εn )))=TRUE,ISNUMBER(FIND("B",INDEX(Sheet2!
$Column2$1:$Column2$Z,ROWS(Sheet2!Column2$µ:Sheet2!Cellµ)*µ+
εn )))=TRUE),AND(ISNUMBER(FIND("S",INDEX(Sheet1!$Column1$1:Sheet1!$Column1$Y ,
ROWS(Sheet1!Column1$µ:Sheet1!Cellµ)*µ+εn )))=TRUE,ISNUMBER(FIND("S",INDEX(Sheet2!
$Column2$1:$Column2$Z,ROWS(Sheet2!Column2$µ:Sheet2!Cellµ)*µ+εn )))=TRUE)),
"CLEAR!","Sheet1:"Sheet1:"&INDEX(Sheet1!$Column1$1:Sheet1!
$Column1$Y ,ROWS(Sheet1!Column1$µ:Sheet1!Cellµ)*µ+εn )&" vs. Sheet2:
"&INDEX(Sheet2!$Column2$1:$Column2$Z,ROWS(Sheet2!Column2$µ:Sheet2!Cellµ)*µ+εn ))
• Function Structure - Two Letter Comparison
=IF(OR(AND(ISNUMBER(FIND("Letter1",VALUE1))=TRUE,ISNUMBER(FIND("Letter1",VALUE2))=TRUE),
AND(ISNUMBER(FIND("Letter2",VALUE1))=TRUE,ISNUMBER(FIND("Letter2",VALUE2))=TRUE)),"CLEAR!",
"Sheet1: "&VALUE1&" vs. Sheet2: "& VALUE2)
• Sheet1 Value (i.e. VALUE1)
=INDEX(Sheet1!$Column1$1:Sheet1!$Column1$Y ,ROWS(Sheet1!Column1$µ:Sheet1!Cellµ)*µ+εn
• Sheet2 Value (i.e. VALUE2)
=INDEX(Sheet2!$Column2$1:$Column2$Z,ROWS(Sheet2!Column2$µ:Sheet2!Cellµ)*µ+εn
2
2.1.2 Alphanumeric Strings
The full test formula, function structure and value formulas for Alphanumeric Strings are as follows:
• Function Structure
=IF(RIGHT(VALUE1,VALUERANGE)=RIGHT(VALUE2,VALUERANGE),"CLEAR!","Sheet1:"&VALUE1&" vs.
Sheet2: "&VALUE2&""))
2.2 Numeric Valued Cells

2.2.1 Numbers Only
The function structure for Numbers Only are as follows:
=IF(VALUE1-VALUE2=0,"CLEAR!","Sheet1:"&VALUE1&"vs. Sheet2: "&VALUE2&""))
2.3 Date Valued Cells

2.3.1 Dates
The function structure for Dates are as follows:
=IF(RIGHT(VALUE1,5)=RIGHT(VALUE2,5),"CLEAR!","Sheet1:"&VALUE1&" vs. Sheet2:

"&VALUE2&"")
3 Stratification
One of the most important components of the risk control process is ensuring that every aspect of the informa-
tion we are examining is being tested at least as much as is proportionately appropriate. For any given data
set with distinct disjoint groups we have N units in the population and in each strata Nh units in the strata
population. The population is completely described by the population strata over N1 , N2 , . . . Nh , . . . NL . We
show completeness through N = N1 + N2 + . . . + Nh + . . . + NL . In the set of data confirms there are 18
distinct strata we are examining.
To take a stratified sample we want to find an allocation that characterizes the population correctly. We
find the stratum weight in the usual manner
Nh
Wh =
N
to use in our stratum sample size nh . To find the total sample size n needed from our population we use
a 95% confidence interval and a margin of error of 3%. The value for z was selected from Table 1. which
has the z values for common confidence intervals.
Table 1: Confidence Intervals

Confidence Interval Upper Limit
90% 1.645 σ
95% 1.96 σ
99% 2.575 σ
3
Section 4 will define the hypothesis test rigorously but for now it is sufficient to say our probability we
are examining is θ = 3%.
r
θ (1 − θ)
E=z
n
r
E θ (1 − θ)
⇔ =
z n
2
E θ (1 − θ)
⇔ =
z n
2
z θ (1 − θ)
⇔n=
E2
It is left to the reader to use standard one or two-tailed hypothesis testing to verify the statistical
significance of any errors found in their data given that we are assuming the data takes on error with some
probability.

Multivariate Data Sequencing Function and General Stratification For Data Comparison

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multivariate Data Sequencing Function and General Stratification For Data Comparison

Загружено:

Авторское право:

Доступные форматы

Multivariate Data Sequencing Function and General Stratification

for Data Comparison

κn (k1,n , k2,n ) = µk2,n − µk1,n + µ + εn

κn (k1,n , k2,n ) = µk2,n − µk1,n + µ + en − e

κn (k1,n , k2,n ) = µ (k2,n − k1,n + 1) + e (n − 1)

1.2 Usefulness of the Sequence Function: Two Practical Examples

κ2 (k1,2 , k2,2 ) s.t. k2,1 − k1,1 = 1

2.1 Alphanumeric Valued Cells

• Function Structure - Two Letter Comparison

• Sheet1 Value (i.e. VALUE1)

• Sheet2 Value (i.e. VALUE2)

2.2 Numeric Valued Cells

=IF(VALUE1-VALUE2=0,"CLEAR!","Sheet1:"&VALUE1&"vs. Sheet2: "&VALUE2&""))

2.3 Date Valued Cells

=IF(RIGHT(VALUE1,5)=RIGHT(VALUE2,5),"CLEAR!","Sheet1:"&VALUE1&" vs. Sheet2:

Table 1: Confidence Intervals

Вам также может понравиться