Census Analysis Project STAT 330

FEC Donation Data
CENSUS 2010
X
FEC DONATION DATA

A More In-Depth Look at the recent election
STAT 330 Martin Ieong Robert Lee Michael Love
0
A Sample Paper for SAS Global Forum Michael Love, Robert Lee, Martin Ieong Cal Poly San Luis Obispo, San Luis Obispo, CA
ABSTRACT Analyzing spending patterns between states is an integral part in determining the efficacy of donations in electing officials. By determining patterns in donations, we may also be able to forecast these donations patterns based upon numerous demographic factors. This paper will synthesize two sources, the 2010 Census and the FEC donation sources for 2012, to analyze how certain factors helped to predict the 2012 voting patterns. Specifically, we will analyze the relationship between: -Differences in the average donation per person based upon state. -Donation amount and various demographic information (e.g. household size, age, gender, income and political affiliation).
INTRODUCTION
In recent years, donations towards political parties have dramatically increased. The intuition behind this is that if candidates have more funds to run their campaign, then they will have a greater chance of being elected to office. While federal laws exist to bar the total amount an individual may contribute to a political campaign, both the Republican and Democratic campaigns raised in excess of $800 million in the most recent election. Analysts have spent a large amount of time in determining where and from whom they expect donations to come from as well as where their money should best go to in order to win a state. By analyzing demographic information in an efficient and useful way, campaigns can begin to estimate the donation amount they will receive and the characteristics of the donators. In the 2012 election, there was lots of media coverage regarding the voting patterns of various states. However, donation amounts were rarely discussed. The goal of this paper is to discuss in an objective and analytical way, the association between donation amount and demographic factors at a national, state, and zip-code level.
Table of Contents
I) Data Acquisition a. Source of Data..3 b. Specifications of Data..4 c. Using the Data..5 II) Programming Specifics a. Structure..7 b. Purposes...7 c. Technical Details...8 III) Analysis Description a. Hypotheses...9 b. Variables Used..9 c. Hypotheses Specifications......10 IV) Analytic Result a. Hypothesis I11 b. Hypothesis II..13 c. Hypothesis III.15 d. Hypothesis IV.17 V) Conclusion....20 VI) References....21
I)
DATA ACQUISITION
A) SOURCES OF DATA
RAW DATA We took our raw data from two main sources: the FEC Election Data for 2012 and the 2010 Census. A summary of the data and sources are illustrated below:
FEC Election Data

Census Data
(Summary File 1)
Details about each donation Organized by ZIP code

Website: http://www.fec.gov/disclosurep/PDownload.do
Various social factors Organized by ZCTA

Website: http://www2.census.gov/census_2010/04Summary_File_1/
INTERMEDIATE DATA In preparation for later merging, we also downloaded another set of raw data that contains information about ZIP code to ZCTA5 conversions. Such dataset is found on the University of Missouri website,
http://mcdc.missouri.edu/pub/data/corrlst/Zip_to_ZCTA_crosswalk_2010_JSI.csv
B) SPEIFICATIONS OF DATA
FEC ELECTION DATA Information Provided: The census data contains numerous demographic factors such as average household size, median age, racial components and etc. Organization: The census data is organized under the structure illustrated below. To best serve our purpose, we will be focusing on data of the ZCTA level. ZCTA is a 5 digit geographical indicator which is roughly analogous to a ZIP code. Technical Notes: There are many observations in the data set where donation receipt amount is 0 or negative. While these could possibly be refunds, this analysis will not include such observations just to be on the safe side. This also makes results a lot easier to interpret. Records with ZIP codes that are unrecognizable will also be deleted as they are impossible to be analyzed. Other ZIP codes that are in a different format will be formatted in such a way that is usable. Only records concerning Obama and Romney are used, to produce a more concise and interpretable result.
ZIPZCTA CONVERSION DATA Information Provided: The ZIPZCTA conversion data provides a record of each ZIP code area and their corresponding ZCTAs. This information acts as a link between the FEC Election Data and Census Data that will be analyzed. Organization: The ZIPZCTA conversion data is organized by ZIP codes, each with its corresponding information.
CENSUS DATA Information Provided: The census data contained numerous demographic factors such as average household size, median age, and race. Organization: The census data is organized under the structure illustrated below. To best serve our purpose, we will be focusing on data of the ZCTA level. ZCTA is a 5 digit geographical indicator which is roughly analogous to a ZIP code. Technical Notes: Records in the Census data that contain unidentifiable state names or ZCTA codes will be deleted. Only necessary variables will be kept in order to induce efficiency.
C) USING THE DATA

As mentioned earlier, the FEC Election Data will be linked to the Census Data through the ZIPZCTA Conversion Data. Analyses will be performed on a merged dataset that includes both the FEC Election Data and Census Data. The diagram below depicts the usage of the three datasets.
FEC Election Data

ZIP--ZCTA Conversion Data
Census Data
Final FEC+Census Data Set
Statistical Analysis
Note: By combining the FEC Election Data and Census Data, we are assuming that Census information at the ZCTA level is consistent to that of the donor(s) in that same ZCTA category, which in reality might not be the case. However, given the resources, this assumptions is necessary and is also reasonable.
II) PROGRAMMING SPECIFICS

A) STRUCTURE There are 3 main programs that work together to produce this analysis, listed below:
B) PURPOSES
Raw Data Program

The raw data program reads in all the raw data in states that they were first acquired, including CSV, TXT as well as Excel files. This program also acts as the "janitor" that cleans out any obviously unusable data to induce efficiency.
Merge Program
This program prepares the datasets created by the previous program for the merging procedure. This includes sorting as well as cleaning out any additional unwanted data. Minor cleaning is done here as well.
Statistical Analysis Program

This program performs all the statistical operations that will be used to analysis the conbined data set. SAS procs such as ANOVA, REG, MEANS are used. For a more detailed description, please refer to the Analysis Description section.
C) TECHNICAL DETAILS
Note: For more thorough explanation on in-depth technical details that cannot be listed here, please refer to the comments within the program.
1) Raw Data Program - Filename and Libname Option: In order to efficiently work on different devices, a master filename and libname statement is created so that users only have to change the link once. - Macros: Multiple Macros were created, each containing a specific portion that serves a specific purpose. This organization allows for a more clear logic and sequence and ease of access.
FEC File
GEO File
Census Files
Merge File
2) Merge Program - Sort: The merge program first sorts the key variables of all the data sets to prepare them for merging. - Merge: It then merges the following datasets by the key variable. (i) Census File and GEO File by Logical Record Number (LOGRECNO). (ii) FEC File with ZCTA File by ZIP code (iii) Census File and FEC File by ZCTA
3) Statistical Analysis Program - Macros: Macros were created for more efficient analysis; options to compare and contrast different states.
III) ANALYSIS DESCRIPTION

A) Hypotheses
1) Is there a difference in the average donation to Romney versus the average donation to Obama? 2) Are there states that have a significantly higher average amount per donation? 3) Is there a relationship between average household size and political affiliation? 4) What are some statistically significant predictors of donation amount?
B) Variables Used
FEC Data Variables

Quantitative Contribution Receipt Amount Categorical Contribution State Candidate ID Candidate Name ZIP Code
Census Data Variables

Quantitative Median Age Average Household Size Total Population Population, White Only Population Asian Only Population Males Females Categorical ZCTA5 Logical Record Number
Hypotheses:
Hypothesis I
Is there a difference in the average donation to Obama versus the average donation to Romney?
The donation amounts for Romney and Obama will be compared based upon individual contributions.
An independent two-sample T-test will be used.
Hypothesis II
Do certain states have a significantly higher average donation amount?
The average donation amount will be compared among the states.
A one-way ANOVA test with State as the grouping/categorical variable will be employed.
Hypothesis III
Is there a relationship between average household size and political affiliation?
The existence and strength of the relationship between household size and political affiliation will be investigated. (Please refer to Important Notes in section (III B)
A one-way ANOVA test with Political Affiliation as the grouping/categorical variable will be employed.
Hypothesis IV
What are some statistically significant predictors of donation amount?
The donation amount will be modeled by variables such as median age, racial compositions, household size and etc.
A Proc Reg will be used to create the model between the donation amount and the predictor variables.
Important Note on Variable Assumptions: 1) Discrepancy in Data Level: As stated in the Data Source section, the census data is recorded at the ZCTA level while the FEC data is recorded at an individual level. A reasonable assumption has to be made that the census data variables are representative of the donor(s) in that specific ZCTA region. 2) Assumption on Political Affiliation: While the explicit information is not given, it is reasonable to concur that donors are affiliated with political parties to which their donation is made.
10
IV) ANALYTIC RESULTS

1) Hypothesis I
Assumptions: Independence The two populations that were being compared (individuals who voted for Obama and individuals who voted for Romney) were assumed to be independent. Normality of donation amount appears to be violated for those who donated to Obama as well as those who donated to Romney. The evidence for this can be seen due to the obvious curvature in the Q-Q plots and histogram with overlayed density curve (shown below).
Normality
Obama
Romney
Summary Statistics: We can see that there were more than five times as many donators to Obama than Romney (3,139,614 vs. 587,033). However, the average donation to Romney was $276.30 which is much higher than the average donation to Obama which was $109.10. candID N Mean Std Dev Std Err Minimum Maximum 203.6 0.1149 0.0100 2499.0 P80003338 3139614 109.1 361.2 0.4715 0.0100 2500.0 P80003353 587033 276.3 -167.2 235.5 0.3349 Diff (1-2)
11
Results: At =.05, there is very strong evidence that the mean donation amount to Obama is not equivalent to the mean donation amount to Romney. The results were significant for both the 2-sample pooled t-test, which assumes equal variance (T-value = -499.13, p-value < .0001) and the Satterthwaite test which assumes unequal variance (T-value = -344.5, p-value <.0001). candID Method Mean 95% CL Mean Std Dev 95% CL Std Dev 109.1 108.9 109.3 203.6 203.4 203.8 P80003338 276.3 275.3 277.2 361.2 360.6 361.9 P80003353 -167.2 -167.8 -166.5 235.5 235.4 235.7 Diff (1-2) Pooled Diff (1-2) Satterthwaite -167.2 -168.1 -166.2 Method Variances DF t Value Pr > |t| 3.73E6 -499.13 <.0001 Pooled Equal 658407 -344.50 <.0001 Satterthwaite Unequal There is also very strong evidence that the variances for the two sample populations are not equivalent (Folded F-Value = 3.15, P-value <.0001). This test assumes that both sample populations come from a normal distribution. However, we previously discussed that this may not be a valid assumption due to the violated normality plots. Equality of Variances Method Num DF Den DF F Value Pr > F 587032 3.14E6 3.15 <.0001 Folded F
12
2) Hypothesis II
Assumptions:
Independence The samples used in this test are assumed to be independent of each other. i.e. It is assumed the average donation from one state does not affect the average donation of another The normality of donation amount appears to be somewhat violated. However, it does not look too bad. Also, normality should not be too big of a problem considering the large sample size.
Normality
PP-Plot of Cumulative Distribution of Means
The histogram plot and density curve again shows that the data is slightly skewed. But considering its low magnitude and the large sample size, we will proceed with the test.
13
Summary Statistics: Class Level Information Class Levels Values 52 AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA STATE MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA VT WA WI WV WY Number of Observations Read 3726647 Number of Observations Used 3726647 In this test, there were 52 levels (50 states+2 territories) used, summing to a total of 37,266,647 observations.
Results: At =.05, there is very strong evidence that at least one state has significant different donation amount than the other states. This conclusion is based upon the very large F value of 169.27 and the p-value of <0.0001.
The ANOVA Procedure Dependent Variable: AMOUNT
Source Model Error Corrected Total
DF 51 3.73E6 3.73E6
Sum of Squares 509769887.68 220057115691 220566885579
Mean Square 9995487.9938 59050.451066
F Value 169.27
Pr > F <.0001
R-Square Coeff Var Root MSE AMOUNT Mean 0.002311 179.4346 243.0030 135.4270
At first look, this might be quite obvious considering the large amount of states. However, this is after taking out the top 1% of donation amounts. Also, the large value of the F-value suggest that there is a very large discrepancy between some states. It could be of interest to further investigate the reasons behind such discrepancies and infer interesting conclusions about the election.
14
3) Hypothesis III
Assumptions:
Independence
The samples used in this test are assumed to be independent of each other. ie (it is assumed that individuals that identify as Democrats do not influence individuals that identify as Republicans). The normality of average household size appears to follow a roughly uniform distribution. It does not appear to be violated
Normality
15
Summary Statistics Class Level Information Class Levels Values 2 Obama, Barack Romney, Mitt candNM Number of Observations Read 3726647 3726647 Number of Observations Used There are 3726647 observations used for this part of the test, divided based on whether they voted for Romney or Obama. Results
Dependent Variable: avhsholdsize
Source DF Sum of Squares Mean Square F Value Pr > F 1 4206.5393 4206.5393 27052.5 <.0001 Model 579476.3140 0.1555 Error 3.73E6 583682.8534 Corrected Total 3.73E6 R-Square Coeff Var Root MSE avhsholdsize Mean 0.007207 16.18141 0.394329 2.436926
Note: The graph contains numbers because of the large number of observations.
From the boxplot, it seems that there is not a large difference between the average household size for individuals who voted for Romney versus individuals who voted for Obama. However, looking at the ANOVA results, we get a different story. At =.05, there is very strong evidence that the average household size between these two groups is not equivalent (F-value = 27052.5, P-value < .0001).
16
4) Hypothesis IV
Assumptions:
The samples used in this test are assumed to be independent of each other. ie (it is assumed that there is no difference in the facors for individuals who voted for Romney and Obama. The normality of donation amount appears to be somewhat violated. However, it does not look too bad. Also, normality should not be too big of a problem considering the large sample size. The linearity assumptions might be violated, as shown by the residual plot. We will proceed with caution while interpreting results. The equal variance assumptions also appears to be violated as shown by the residual plot. Caution is used when interpreting results.
Independence
Normality
Linearity Equal Variance
Once again the normality plots for donation amount are shown. We can see quite a bit of right skew, with lots of small donations and fewer large donations. The donation amounts were capped at $2500 for analysis purposes, so no values will exceed $2499.99.
17
Residual Plot of Regression
It appears that both the linearity and equal variance assumptions are violated.
Results
The GLM Procedure Dependent Variable: AMOUNT
Source DF Sum of Squares Mean Square F Value Pr > F Model 6 1070272107.8 178378684.64 3028.53 <.0001 Error 3.73E6 219496613471 58899.333843 Corrected Total 3.73E6 220566885579 R-Square Coeff Var Root MSE AMOUNT Mean 135.4270 0.004852 179.2049 242.6918
18
Predictors
Source DF Type I SS Mean Square F Value Pr > F WhiteAlone 1 4638543.7 4638543.7 78.75 <.0001 AsianAlone 1 4329654.0 4329654.0 73.51 <.0001 NonWhiteorAsian 1 654326064.1 654326064.1 11109.2 <.0001 Male 1 361097820.7 361097820.7 6130.76 <.0001 Female 1 19224427.2 19224427.2 326.39 <.0001 medianage 1 26655598.2 26655598.2 452.56 <.0001
Model Parameter Estimate Standard Error t Value Pr > |t| Intercept 126.9487227 0.96406736 131.68 <.0001 WhiteAlone 0.0011759 0.00002092 56.22 <.0001 AsianAlone 0.0013419 0.00002415 55.55 <.0001 NonWhiteorAsian 0.0005407 0.00002372 22.79 <.0001 Male -0.0024428 0.00003689 -66.23 <.0001 Female 0.0004783 0.00003707 12.90 <.0001 medianage 0.4615732 0.02169709 21.27 <.0001 Conclusion
At any reasonable level of significance, there is evidence of an association between donation amount and one or more of the predictors (White Ethnicity, Asian Ethnicity, Non-White and Non-Asian ethnicity, male population, female population, and median age). Our coefficient for the male population is actually negative, which means that for every additional male in a zip code, the predicted donation amount of a donator is expected to drop .002 cents. Looking at our R-squared value, we can see that only .48% of the variation in donation amount could be explained through regression.
19
V) CONCLUSION
Election information is often expressed in a one-dimensional way, focusing only on votes. This paper did not focus on the number of votes a candidate received, but rather the donation amounts and various factors that influence the donation amounts. In our first hypothesis we found that the average donation differs significantly based upon which candidate it is going to. While Obama had nearly five times as many donations, donations to Mitt Romney were much larger. In our second hypothesis, we found that the average donation for each state is significantly different. In our third hypothesis, we found that the average household size of a donator differs based upon what his/her political affiliation is. We found this result to be particularly interesting. In our last hypothesis, we found that racial factors, gender, and median age are all significant predictors of donation amount.
20
REFERENCES The Little SAS Book: A Primer, Fourth Edition Author: Lora Delwiche Contributor: Susan Slaughter
CONTACT INFORMATION Robert Lee California Polytechnic State University at San Luis Obispo 755 Canyon Drive San Luis Obispo, California 93410 (408) 966-8982 rlee@calpoly.edu
Martin Ieong California Polytechnic State University at San Luis Obispo 750 Canyon Drive San Luis Obispo, California 93410 (650) 430-5804 mieong@calpoly.edu Michael Love California Polytechnic State University at San Luis Obispo 9304 Vervain St. San Diego, California 92129 (858)-25-2754 milove@calpoly.edu
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.
21

Census Analysis Project STAT 330

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Census Analysis Project STAT 330

Загружено:

Авторское право:

Доступные форматы

FEC Donation Data

FEC DONATION DATA

FEC Election Data

Details about each donation Organized by ZIP code

Various social factors Organized by ZCTA

C) USING THE DATA

FEC Election Data

Final FEC+Census Data Set

II) PROGRAMMING SPECIFICS

Raw Data Program

Statistical Analysis Program

III) ANALYSIS DESCRIPTION

FEC Data Variables

Census Data Variables

An independent two-sample T-test will be used.

Do certain states have a significantly higher average donation amount?

The average donation amount will be compared among the states.

Is there a relationship between average household size and political affiliation?

What are some statistically significant predictors of donation amount?

IV) ANALYTIC RESULTS

PP-Plot of Cumulative Distribution of Means

Source Model Error Corrected Total

Sum of Squares 509769887.68 220057115691 220566885579

Mean Square 9995487.9938 59050.451066

Linearity Equal Variance

Residual Plot of Regression

Вам также может понравиться