Вы находитесь на странице: 1из 110

Statistical Analysis Business Intelligence Survey

Young Italian peoples perception of China Opportunity or threat? Evidence from a survey

Erica Casarini Giacomo Zoppi

INDEX 1. Introduction 2. Theoretical Approach a. Why China? b. The survey c. Hypothesis and expectations 3. Research Method a. Data collection 4. Procedures & results a. Library b. Creation of the data matrix c. Labelling quantitative variables d. Basic descriptive statistical analysis: Means procedure e. Principal component analysis i. Correlation matrix and observations on the results ii. Eigen value and Eigenvector f. The size effect g. Cluster creation and analysis h. Dendrogram i. T-tests for quantitative variables j. Frequency procedure and Chi square tests for qualitative variables 5. Discussion & Conclusions 6. Appendix a. Appendix A the Italian version of the questionnaire 7. References

1. INTRODUCTION The purpose of this research is to investigate whether young Italian people who is preparing to enter in the professional world or has just started with their job is influenced in their working choices by the new possibilities of career and business arising from the growing influence in the globalized markets of a big global player: China. With this research we want to learn from data: how much do Italian young people know about this nation? Are they interested in the history or the culture of this state? Are they aware of the possibilities that such a new, huge market is offering to the firms able to perform in it? Will Italian people exploit the Chinese potential? The aim of this research, in a very unpretentious way, is to give an answer to all these questions. We also would like to spend some thoughts about the marketing potential of this research: we were not able to deepen our analysis because of the small size of the sample we collected (109 observations only) and also because we exclusively focused on young people, giving to the variable age a short range, but we think that, properly reviewed, this work-group could be helpful in the segmentation process for firms producing in China and trying to sell their products in Italy. The main tool for this statistical research is SAS (Statistical Analysis System) a specialized software for statistical analysis. The study is based on a questionnaire (see Appendix A) composed by three different sets of questions: the first one is divided into 10 sub-questions that the interviewed person must evaluate, giving them a score included between 1 and 10, the first part of this survey is structured in order to collect 10 quantitative variables for all the observations registered in the dataset, the second and the third sets provide for categorical variables while the fourth should furnish to the dataset general pieces of information that should be tested in order to understand whether they are significant or not.

2. THEORETICAL APPROACH 2.a. Why China? The first, natural question that may arise from the reader of this research is: why China? Well, according to the estimates of the World Bank, China has grown, between 1981 and 1997, with a rate four times higher than that of the OCSE countries, the real income of the population in the twenty years between 1980 and 2000 has grown of the 10% per year. Between the end of the seventies and 2003 the Chinese GDP (Gross Domestic Product) multiplied itself nine times and in the same year (2003) China has become the country with the highest inflow of foreign direct investments, nowadays this State even overcomes USA. The Asiatic continent, thanks to the success of this state followed by India and Japan, realized in 1998 the 37% of the world GDP, followed by USA, which contributed to the total domestic GDP for the 25% and Europe, contributing with a slight 23%. Since the beginning of the globalization process, and that is one of this process merits, China has received stimulus for reaching the optimal exploitation of its production system, and this country, developing a market economy based on the unlimited increase of global and individual consumptions started to gradually invade the European markets with an increasingly wider range of products. In addition, Chinese businessmen are acquiring industrial enterprises and financial societies all over the world (one famous example over all: the American IBM has been bought by Lenovo, a Chinese enterprise). Their interests space within different sectors: banks, finance, insurances, high-tech, textile and clothing, etc. Thanks to tax concessions approved by the government, the financing of these merging and acquisition operations is easy and flexible and it is permitted by the huge amount of dollar reserves of the Bank of China (Zhongguo Yinhang ). For all these reasons China is becoming the country of the records: the GDP rate of growth of the Chinese nation was: in 2004 of 11.2%, in 2005 of 9.9%, in 2006 of 10.4%, to asset to 9% in 2007. In 2004 China has become the seventh global power and, as it was expected to do, it took part for the first time in history to the G8, the club of the most industrialized countries in the world.

2.b. The survey The focus of this research are young people, already in possess of a high school diploma and still studying or just graduated; we have chosen to interview individuals aged between 18 and 30 years old. The nationality of this sample must be, obligatory, Italian, since we want to know if and how the new generations entering in their working life will be able to interact with such an important nation that might raise many national enterprises out of the economic crisis, especially in this problematic period for all the occidental economies. In order to understand if and how much do our target population know about China, we set 10 general questions concerning the socio-economic situation of this country, representing quantitative variables, next to each one, a Likert scale ranging from 1 (not at all) to 10 (absolutely agree) was provided. The other questions, analysed in a successive step, produced qualitative dichotomic variables and are aimed to test the inclination of the interviewed people to travel, live or work in China in the present and in the future. In the end we created general questions (about gender, age, city of origin and occupation) in order to get qualitative variables that will be tested to understand whether they are significant for our analysis or not. 2.c. Hypothesis and expectations The first sentence to be ranked in the questionnaire is: q1_1: China is an underdeveloped and poor country According to the data we have collected, this sentence is partly correct, because the technological progress is concentrated in a few trans-national enterprises, and there are huge inequalities in the income distribution; in the vast Chinese territory the globalization created extremely fragmented spaces, composed by new poverty areas in the central and near-the-centre spaces and new rich areas in extremely poor regions. China, opening itself to the market and the foreign direct investments flows, could adopt new technologies and increase its productivity levels, enlarging the revenues of its balance of payments, but it had to sacrifice, in the name of the development of macro-economic data, a big part of its economic and political independence, finally provoking an increase in poverty rates for the majority of the population, with the drastic decrease of welfare state policies. Until few years ago, the classification criteria in order to assess the differences among countries was based on the dualism development underdevelopment while now, after the process of globalization has started, the inequalities and disparities are spread in a non uniform way. Basing our expectations on these premises we expect this question will receive a low rank in our survey, we think that the mean values the interviewed will attribute to these two statements will be low, that is, included between 1 and 3. q1_2 China is a developing country, but still not much developed According to the previously stated considerations, we retain that China is actually a developing country, and we suppose that people will be able to perceive it, by giving a mean score to this assertion, included between 4 and 6.

q1_3 China is an irrelevant State because of its geographical distance from Italy We firmly think this question will get low scores, possibly included between 1 and 3. It is impossible to think that young people might not notice that the actual perspective for this globalized world is that of a multicultural and open society. q1_4 China is a nation too different from Italy for language, culture, traditions and habits to influence our country Concerning the language and the culture it is absolutely true that Italy and China have almost nothing in common, but this gap is more and more diminishing as time passes by, a homogenization of culture, especially in the business field, so we think this statement might get a mid-low grade. q1_5 China is the Country of copies, because Chinese often copy European and American products This assertion is clearly a provocation, because the correlation with buyers and sellers is extremely strict, and, as will be subsequently shown, China and Italy are tightly connected and involved in the disguising and falsification business: the OECD declared that disguising represents a quota of about the 5-7% of the world wide trade and is responsible for the loss of 200 thousand jobs in Europe (Source: UNCRI, 2007). The European primacy for disguised goods is held by Italy, while China is the first producer of imitations and fake brands, in particular Italian ones and the Chinese republic is also the first buyer of textile machinery from Italy. At the same time, a research conducted by the American Chamber of Commerce in Italy with KPMG, demonstrated that Italy holds the not-enviable record of the first consumer of counterfeit goods, being the first European producer and the third in the world. In this case we do not expect that Italian people will recognize the link between our nation and China, therefore we expect to register high scores. q1_6 China is a country interesting only under the historical, artistic and cultural point of view The purpose of this question is to investigate whether someone answering the survey do care and is interested in the Chinese culture, that is very unlikely for us; we expect few answers with a high score here, first of all because people who are interested in Chinese culture are a small number, it is a niche, moreover those who are, should be interested in the socio-economic aspects of this country as well. q1_7 China is interesting only under the working-economical point of view

As already explained in question number 1_6, we expect a low mean score for this question, but probably, since China is affecting in many aspects the Occidental world with its amazing strength in the global trade, we assume to see a higher rank than the previous sentence.

q1_8 China is enough developed to represent a threat for the Italian economy The expectations we developed about this sentence are correlated to the first two questions, we believe that if the first questions will get a low score, this and the subsequent one will gain a high rank, because we think they are somehow inversely correlated. q1_9 China is enough developed to represent a opportunity for the Italian economy This question is connected with the previous one, and therefore we expect that q1_8 and q1_9 will score similar values. q1_10 China will be the new centre of powers that will replace the USA This represents, in a sense, the opposite of question q1_1, in this case we retain plausible that people who gave a low score to this question would have given a high score to question q1_1 and vice versa. These hypothesis will be compared with the SAS results derived by the proc MEANS before and after the size effect elimination. Table 1 Hypothesis resume Hypothesis 1 Hypothesis 2 Hypothesis 3 Hypothesis 4 Hypothesis 5 Hypothesis 6 Hypothesis 7 Hypothesis 8 Hypothesis 9 Hypothesis 10 China is an underdeveloped country China is a developing country China is irrelevant because of the distance China is too different from Italy China is the country of copies China is interesting only for its culture China is interesting only for its economy China is a threat for the Italian economy China is an opportunity for the Italian economy China will substitute the power of the USA 1 < evaluation < 3 4 < evaluation < 6 1 < evaluation < 3 4 < evaluation < 6 8 < evaluation < 10 4 < evaluation < 6 4 < evaluation < 6 8 < evaluation < 10 8 < evaluation < 10 1 < evaluation < 3

All these data, providing for quantitative variables that will be the pillar of our clusters, will be tested using the MEANS procedure of the SAS program, in order to understand whether our opinions were correct or we had wrong expectations and how much the reality moves away from the hypothesis.

3. RESEARCH METHOD 3.a. Data Collection Data were collected using an internet questionnaire survey. The initial sample comprised people between 18 and 30 years old, which have been contacted by e-mail but also using widely diffused social networks like Facebook, LinkedIn and MSN Messenger. In according to the common practice, after one week, a reminder was e-mailed to whom did not reply. After the end of the second week, if it was not received any answer, the person was disabled to participate to the survey. From the first mail nobody declared the intention not to participate, so the number of non participants is formally zero. After sending the questionnaires to 145 target people, we waited for two weeks and we collected 109 answers. The 36 non respondents represent the 24,83% of the interviewed people, while the observations of the people who did answer represent the 75,17% of the interviewed people. Table 2 Survey data results Sent questionnaires Respondents Non respondents Non participating Not useful or null Percentage of respondents Percentage of non respondents 145 109 36 0 0 75,17% 24,83%

The variables are measured in different ways, the first question provides for quantitative variables, these have received a value included between 1 (that stands for: I do not agree at all) and 10 (that corresponds to the thought: I completely agree). The second question makes available for our analysis qualitative dichotomic variables, that can assume two values only yes or, alternatively, no, in the end, the third part of the survey grants for our analysis some categorical qualitative variables, because we led our interviewed sample choose among a given set of cities where they would like to go for work or tourism. The last part of the survey provides for basic descriptive variables, supplying demographic data. These are both quantitative (age) and qualitative (sex, city of origin, employment, knowledge of Chinese).

4. PROCEDURES & RESULTS 4.a. Library SAS can host temporary SAS data sets, stored in the WORK library. This library is defined automatically at the beginning of the session and is deleted automatically at the end of the same session. Mechanical procedures assume that our data set is read or written in the WORK library, in case the program does not receive any other information. With the aim of eliminating the problem of our librarys volatility we had no choice but to create a permanent library; we could choose among two procedures to program our permanent library: We could assign a USER library with a LIBNAME statement or with the SAS system option USER=. We used the LIBNAME procedure, since it was more immediate and easier to manage, in fact just with the simple command reported here under: SAS syntax 1:

libname Gioconda "G:\Business Intelligence";

We set permanently the library in our pen drive, avoiding in this way to re-import our Excel file every time we had to work on our data matrix. 4.b. Creation of the data matrix To work with SAS, we must manage the data gathered from the survey in an Excel data warehouse, where all the collected information are vertically organized. The successive step is to export this Excel warehouse in SAS, by following the procedure PROC IMPORT, that reads data from an external data source and imports them to a SAS data set. The SAS variable definitions are based on the input records. Another easier and more immediate way to import data would be the Import Wizard, for which is enough to select File Import Data and then follow the provided indications; this latter one is the procedure we followed, and the resulting syntax was as follows: SAS syntax 2:

PROC IMPORT OUT= Gioconda.China DATAFILE= "C:\Documents and Settings\Zoppi\Desktop\Survey.xls" DBMS=EXCEL REPLACE; SHEET="Foglio1$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;

This command allowed the creation of a specific data matrix, that generates automatically a twodimensional space.

4.c. Labels for quantitative variables The LABEL statement attributes names to variables, that are basically classified in our data warehouse with a q (standing for question) plus the number of the question followed by the number of the sentence, divided by an underscore. By labelling the 10 variables we just obtained more identifiable tags; indeed, the procedure does not alter the values registered for our variables in the input data set, these names we attributed to each statement can easily order the variables. Clearly, the observations in each data set will be uniquely labelled and no spaces will be left among words, the underscore will substitute the space. Table 3 Labels setting q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 q1_7 q1_8 q1_9 q1_10 underdeveloped_country in_developement geographical_distance too_different fake_makers culture_only economy_only Threat Opportunity leading_nation

4.d. Basis Descriptive Statistical Analysis: Means Procedure This statistical tool helps in identifying the population mean value of the variables, on the basis of the known evaluations given by the interviewed. If we join the conditional mean values, we obtain what is known as the population regression line (PRL), or more generally, the population regression curve. Geometrically, a population regression curve is the locus of the conditional means for the dependent variable for the fixed values of the explanatory variable(s), and this explains the importance of the procedure: the sample regression line passes through the sample mean values of Y and X, and the formula for its computation is: Yi = 1 + 2 Xi The PROC MEANS permits also to find the minimum and maximum of the observations, the median value given for each of them, the variation coefficients and two of the most important values: the standard deviation (that is, for every observation xi, the distance of point xi from its projection on the best linear combination represented by Yi) and the standard error, also known as standard error of estimate or standard error of the regression (se), it is nothing but the estimated error of the standard deviation in the estimator's sampling distribution (with sampling distribution we mean a probability or frequency distribution of the estimator). The following is the SAS step required to compute the proc means, as we are going to explain later on, it is the first key procedure to be performed to achieve some basic statistical values. SAS syntax 3:

proc means data=Gioconda.China mean stddev min max median stderr cv; var q1_1-q1_10; run;

o SAS output 1: SAS System The MEANS procedure


Variable Labels Mean Std Dev Min Max Median Std Error Variation Coefficient 48.5768263 37.1434509 95.1082776 60.4860712 53.5748631 59.5886537 59.9121557 41.9831044 44.4399044 49.3242778

q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 q1_7 q1_8 q1_9 q1_10

underdeveloped_country in_developement geographical_distance too_different fake_makers culture_only economy_only threat opportunity leading_nation

4.5137615 5.8256881 2.3761468 4.4036697 5.6055046 3.9816514 4.0183486 6.1743119 6.1009174 5.3119266

2.1926421 2.1638616 2.2599123 2.6636068 3.0031414 2.3726125 2.4074793 2.5921678 2.7112419 2.6200694

1 1 1 1 1 1 1 1 1 1

10 10 10 10 10 10 10 10 10 10

5 6 1 4 6 4 4 6 6 5

0.2100170 0.2072604 0.2164603 0.2551273 0.2876488 0.2272551 0.2305947 0.2482846 0.2596899 0.2509571

The above output shows the results of the PROC MEANS procedure. The mean is classified as a measure of location, like mode and median and it measures the centre of a distribution, in fact this procedure statistically describes the central location of the data collected and allows us to know all the statistical values concerning each and every single variable such as the standard deviation, the median, the maximum and minimum values assumed by our groups of observations, the standard error and the variation coefficient. The standard deviation is useful as well, because it represents the spread between the values and the mean. Now we are classifying our variables hierarchically, ranking them in a descending scale according to their mean (left column) and from the lower to the higher level of standard deviation (right part) in order to better understand the mostly agreed opinions among our target population. Table 4 Variables classified by Mean and by Standard deviation
Var q1_8 q1_9 q1_2 q1_5 q1_10 q1_1 q1_4 q1_7 q1_6 q1_3 Labels Threat Opportunity In_Development Fakes_Makers Leading_Nation Undeveloped_Country Too_Different Economy_Only Culture_Only Geographical_Distance Mean 6,17431190 6,10091740 5,82568810 5,60550460 5,31192660 4,51376150 4,40366970 4,01834860 3,98165140 2,37614680 St. dev. Var 2,59216780 q1_2 2,71124190 q1_1 2,16386160 q1_3 3,00314140 q1_6 2,62006940 q1_7 2,19264210 q1_8 2,66360680 q1_10 2,40747930 q1_4 2,37261250 q1_9 2,25991230 q1_5 Labels In_Development Undeveloped_Country Geographical_Distance Culture_Only Economy_Only Threat Leading_Nation Too_Different Opportunity Fakes_Makers Mean 5,82568810 4,51376150 2,37614680 3,98165140 4,01834860 6,17431190 5,31192660 4,40366970 6,10091740 5,60550460 St. dev. 2,16386160 2,19264210 2,25991230 2,37261250 2,40747930 2,59216780 2,62006940 2,66360680 2,71124190 3,00314140

The conclusions stemming from the previous table are: The most shared belief is that China currently represents more a threat for the Italians economy than an opportunity, thats proved by the fact that the Standard Deviation of question 1_9 is higher than question 1_8, indeed, the mean of the former ranks higher, even if by a 0.06, over the latter. For the same reasons (mean and standard deviation), we can state that many people do agree with the view of China as a developing country, this affirmation has quite high values although its standard deviation ranks first, moreover this statement is considerably shared, since this variables mean reached the third place in the chart listed above. We are also able to notice that our sample thinks that China is not an irrelevant country because of its distance from Italy, so there is a general attitude to recognize that the space is not a variable that constrains contacts and interactions between countries: the mean ranked for this statement is the lowest of the set. Particularly interesting is the question number 5, China as a fake-maker nation: this statements grade, according to the mean, is the fourth in the graph. This does not foil our expectations, indeed, it implies that the perception of the sample population concerning the responsibilities for the diffusion of counterfeit and disguised goods is mainly one sided: the fault of this phenomenon is of the Chinese producers, not of the Italian buyers. After these preliminary considerations, we are going to compare our estimated outcomes with the real results of the processed data in our dataset: we can notice that the majority of our expectations were wrong; starting from the first one. As we can see, the perception of China as an undeveloped country is more diffused than what we thought, on the contrary, a quite positive judgement, beyond our best expectations, has been

given to that country for what concerns its future global role as the most influent nation in the world in substitution of the United States. The degree of evaluation of question q_9 and q_10 (i.e. the Chinese republic as a threat and an opportunity for the Italian economy) is lower than what we imagined, this means there is a certain disinterest for what concerns the job opportunities in that nation and also the assumption of living in a country with a strong economy, that do not have to worry too much about the foreign competitors. Table 5 Hypothesis vs. processed results
Number q_1 q_2 q_3 q_4 q_5 q_6 q_7 q_8 q_9 q_10 Question China is an underdeveloped country China is a developing country China is irrelevant because of the distance China is too different from Italy China is the country of copies China is interesting only for its culture China is interesting only for its economy China is a threat for the Italian economy China is an opportunity for the Italian economy China will substitute the power of the USA Hypothesis 1 < evaluation < 3 4 < evaluation < 6 1 < evaluation < 3 4 < evaluation < 6 8 < evaluation < 10 4 < evaluation < 6 4 < evaluation < 6 8 < evaluation < 10 8 < evaluation < 10 1 < evaluation < 3 Reality 4.5137615 5.8256881 2.3761468 4.4036697 5.6055046 3.9816514 4.0183486 6.1743119 6.1009174 5.3119266 Accuracy Wrong Right Right Right Wrong Wrong Right Wrong Wrong Wrong

Another surprising value is that of question q_5 (China as a country making a lot of copies): this value is extremely lower than our presumptions, this might be interpreted as a sign of the confidence of our population sample in the quality and un-substitutability of Italian products. Another data overestimated was about the identification of China as a country interesting only for its culture, arts and historical attractions, the interviewed people gave lower scores than we thought, the general idea is either that China has more to offer rather than culture, art and history or it has nothing.

4.e. Principal component analysis In data mining, facing a large sample is always positive, because, other things being equal, larger samples tend to minimize the probability of errors and maximize the accuracy of population estimates, and moreover they increase the generalizability of the results. However, in such circumstances it is very likely that the variables collected would be highly correlated with each other; in this case the accuracy and reliability of a classification or prediction model is in danger because statistical procedures that create optimized linear combinations of variables tend to "overfit" the data, in addition, superfluous variables can delay the data-processing. One important step in data mining is to find ways to reduce dimensionality without sacrificing accuracy by somehow summarizing the data into a considerably smaller number of linear combinations without losing fundamental pieces of information. This is the aim of the principal component analysis (PCA from now on), a mathematical procedure that transforms a number of (possibly) correlated variables into a smaller number of uncorrelated variables called principal components. The objective of principal component analysis is to reduce the dimensionality (number of variables) of the dataset but retain most of the original variability in the data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability. A principal component analysis is concerned with explaining the variance-covariance structure of high-dimensional random vectors through a few linear combinations of the original component variables. The use of the PCA procedure is necessary to advance with a more precise analysis for the examination of relationships among the quantitative variables, to summarize data and discover possible linear relationships. The main objective is to reduce the number of variables in the clustering process. We are able to do this through the multivariate technique of the principal component analysis, executed by SAS with the command PROC PRINCCOMP. We can observe the following SAS commands output and the tables resulting from it: SAS syntax 4:

proc princomp data=Gioconda.China;

var q1_1-q1_10; run; o SAS output 2.1: SAS System


The PRINCOMP procedure
Observations Variables 109 10

Simple statistics q1_1 Mean StD 4.513761 2.192642 q1_2 5.825688 2.163862 q1_3 2.376147 2.259912 q1_4 4.40367 2.663607 q1_5 5.605505 3.003141 q1_6 3.981651 2.372612 q1_7 4.018349 2.407479 q1_8 6.174312 2.592168 q1_9 6.100917 2.711242 q1_10 5.311927 2.620069

4.e.i. Correlation matrix The matrix below shows the correlation among variables represented by a coefficient (named r) for each couple of variable. The role of this coefficient is to indicate the strength and the direction of the linear relationship existing between two random quantitative variables. The correlation coefficient between two random variables and their standard deviations is defined in the formula below: Figure 1 Correlation Coefficient Formula

It can assume values included between 1 and 1. - if r > 0 then there is positive correlation between 2 variables - if r < 0 then there is negative correlation between 2 variables - if r = 0 there is no presence of correlation If we observe positively correlated variables in our database then they are dependently related to each other, consequently we can find redundancy and complexity in the collected information. Vice versa, if there is a low level of correlation between variables, the database has a low level of complexity. As showed in our SAS output at the bottom of the page, we have some highly correlated variables such as question number 6 (China is interesting just for its culture) and number 7 (China is interesting just for its economy); they show the highest correlation coefficient so we can hypothesize that our sample population gave the same score to each statement because it thought China as an interesting nation from both perspectives. Question number 4 (China can not influence us due to its different culture) and number 5 (China is a fakes maker nation) have the second highest correlation scores; it probably means that people think about China as a nation with a so different and unattractive culture that, to perform good results on the market, it is forced to copy the more appreciated style of the western nations products. The same concept holds for negatively correlated variables, but in an opposite sense: if one of the two considered variables increases, the second one proportionately and accordingly to the correlation matrix decreases and vice versa.

Correlation Matrix
q1_1 q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 q1_7 q1_8 q1_9 q1_10
underdeveloped_country in_developement geographical_distance too_different fakes_makers culture_only economy_only threat opportunity leading_nation

q1_2 0.4581 1.0000 0.2332 0.1601 0.3242 0.1292 0.1588 0.2696 0.1056 -.0409

q1_3 0.0597 0.2332 1.0000 0.4391 0.2103 0.0894 0.2183 0.0251 -.1589 0.0441

q1_4 0.0577 0.1601 0.4391 1.0000 0.4773 0.2898 0.3064 -.0116 -.2314 -.0089

q1_5 0.3123 0.3242 0.2103 0.4773 1.0000 0.2407 0.2674 0.0743 -.1941 -.0325

q1_6 0.0196 0.1292 0.0894 0.2898 0.2407 1.0000 0.4945 -.1681 -.1537 -.1346

q1_7 -.0193 0.1588 0.2183 0.3064 0.2674 0.4945 1.0000 0.0203 0.0877 0.1092

q1_8 0.1796 0.2696 0.0251 -.0116 0.0743 -.1681 0.0203 1.0000 0.1121 0.2250

q1_9 0.1719 0.1056 -.1589 -.2314 -.1941 -.1537 0.0877 0.1121 1.0000 0.3318

q1_10 0.0089 -.0409 0.0441 -.0089 -.0325 -.1346 0.1092 0.2250 0.3318 1.0000

1.0000 0.4581 0.0597 0.0577 0.3123 0.0196 -.0193 0.1796 0.1719 0.0089

o SAS output 2.2: Captions: XXX : highest positive correlation XXX : high positive correlation XXX : high negative correlation XXX : highest negative correlation

4.e.ii. Eigenvalues analysis Each principal component represents a linear combination of the original variables, with coefficient equal to the eigenvectors of the correlation or covariance matrix. The eigenvectors are measured taking into account the length of the units. They measure the distances among units in order to detect redundancy in the database. The eigenvectors are nonzero vectors that may change in length, but not in direction. For each eigenvector of a linear transformation we can find a corresponding scalar value that represents an eigenvalue for that vector, which determines the length of the principal component, or more precisely the quantity of variance explained. Variance is a measure of capacity to observe differences between units. Consequently, we can sort the principal components by decreasing order of eigenvalues, which are equal to the variances of the components. Therefore, we will minimize the distances on the points on the eigenvectors and we will analyze the Principal component through the variance among the new variables (which are obtained as a result of complexity and redundancy of information). The eigenvectors present the following characteristics: 1. they reduce the redundancy of information by sorting them according to the relation between the vectors and the principal components; 2. they minimize the distance between the points and the vectors; 3. they are related to eigenvalues generated from the correlation matrix; Now we should proceed with the eigenvalue matrix of correlation and choose the variables with the higher eigenvalues. The eigenvalues output shows the length of the principal component and the cumulative variances explanation of each principal component to the next one until the maximum value of 1 is reached. The first and most important eigenvalue is the one with the highest length and the highest variability so the most valuable to describe the data because it shows the highest correlation with the entire dataset. o SAS output 2.3:
Eigenvalues of the correlation matrix Eigenvalue 1 2 3 4 5 6 7 8 9 10 2.49549260 1.77279532 1.29608593 1.10845954 0.80962056 0.75706459 0.52750354 0.46086401 0.43891437 0.33319953 Difference 0.72269728 0.47670939 0.18762639 0.29883899 0.05255596 0.22956106 0.06663952 0.02194964 0.10571484 Proportion 0.2495 0.1773 0.1296 0.1108 0.0810 0.0757 0.0528 0.0461 0.0439 0.0333 Cumulative 0.2495 0.4268 0.5564 0.6673 0.7482 0.8240 0.8767 0.9228 0.9667 1.0000

We would like to point out some interesting observations concerning the matrix above: It seems clear that, the important and significant principal components are the first 4 because of their high value, the first one in particular presents an extremely high value compared to the second, in fact the difference between them stands out among all the data, moreover the 24.95% of the explanation of the variance in the dataset springs just only with this first eigenvalue. The first six principal components summed up explain approximately the 82% of the information extracted by the dataset, therefore they are the most useful to consider in order to explain the variance of the dataset and to synthesize with ease the reduction of the data complexity.

4.e.ii. The Eigenvectors analysis o SAS output 2.4:


Eigenvectors Prin1 q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 q1_7 q1_8 q1_9 q1_10 underdeveloped_country In_developement geographical_distance too_different fakes_makers culture_only economy_only threat opportunity leading_nation 0.22606 0.34580 0.35188 0.46001 0.46189 0.34869 0.36764 0.06509 -.13533 -.03529 Prin2 0.43501 0.37949 -.03951 -.16110 0.03596 -.26911 -.05206 0.46704 0.46512 0.35825 Prin3 -0.3465 -.24671 0.02514 0.04694 -.17139 0.25405 0.53582 -.03309 0.38581 0.5341 Prin4 -.28148 -.18283 0.51005 0.31714 0.00823 -.46588 -.22251 0.27506 -.26554 0.33738 Prin5 0.24738 0.06542 0.45501 0.08149 -.14275 -.17424 -.13987 -.71393 0.37448 0.02478 Prin6 0.22802 -.37704 -.44213 0.22707 0.55862 -.10004 -.17687 -.25802 -.01319 0.37244 Prin7 0.3308 0.01407 0.15777 -.32461 -.23270 0.4244 -.18826 -.07752 -.48446 0.49969 Prin8 0.43671 -.51059 0.03769 0.41945 -.37222 0.23353 -.11302 0.29916 0.17272 -.21454 Prin9 -.30828 0.44332 -.27532 0.47465 -.21977 0.28472 -.48341 -.05078 0.11778 0.16685 Prin10 -.22802 -.19554 0.33047 -.30796 0.42719 0.41028 -.43905 0.16288 0.35661 -.09186

This exhibit shows the kind of relation between the principal components and the original variables, as we can see the first eight values of the first principal component, which is the most interesting one in our point of view, are all positive so, even if the output of this first table is quite satisfactory, we think that the redundancy present in the database would affect negatively our investigation, therefore we will proceed with the elimination of the size effect, that means, these biases.

4.f. Size effect Is obvious that no sample is perfectly reflective of the population. Thus, the over fitting of vectors can result in erroneous conclusions if models fit to one data set are applied to others. In multiple regression this problem manifests itself as inflated R2 (the index measuring the fitness of a model to reality) and misestimates of variable regression coefficients. In PCA this over fitting can result in erroneous conclusions in several ways, including the extraction of erroneous factors or misassignment of items to factors. At the end of the analysis, if the sample gathered is too small (between 50 and 100 observations) errors of inference can easily arise, particularly with techniques such as PCA. When this phenomenon, called size effect, occurs, it produces a biased principal component. In other words, the size effects is a data bias or a distortion due to the peoples different psychological views whose opinions can be optimistically or pessimistically affected by their personal experience so as to mystify the results of the research. Statistically speaking size effect is a way to quantify the difference between the mean values of two groups divided by the standard deviation; so it is a measure of the effectiveness of the data treatment. Usually the size effect is requested to be studied because researchers need to know whether the difference observed is revealing, in that case differences matters in our data generation and if we need to eliminate this size effect. In order to properly compare the relative scores the population sample gave to the survey the size effect can be eliminated by a data standardization, through the creation of a new dataset composed by 10 new variables [new1 new10] which will be centred on the average for each unit and then the scores will be compared with their averages. After the mean, maximum and minimum values for each row (every interviewed person) are recalculated in SAS so as to be included in a range oscillating from 1 to 1, by using an if command and after setting the missing values equal to the mean value [0] (so that they will not affect the final output) the MEANS procedure and the PCA can be recomputed, with the correlated eigenvalues and eigenvectors. SAS syntax 5:

These are the commands that SAS needs in order to compute a new, more precise matrix with new standardized variables not affected by size effect:
data Gioconda.China_2; set Gioconda.China; array p1 q1_1-q1_10; array p2 new_q1_1-new_q1_10; mean=mean(of q1_1-q1_10); min=min(of q1_1-q1_10); max=max(of q1_1-q1_10); do over p2; if p1<mean then p2=(p1-mean)/(mean-min); if p1>mean then p2=(p1-mean)/(max-mean); if p1=mean then p2=0; end; run;

SAS syntax 6:

The following strings are the SAS commands requested to change the names of the variables labels in our new dataset, not affected by size effect, the purpose of this operation is to have

some more comfortable names and to avoid misunderstandings with the labels used before the size effect elimination:
data Gioconda.China_2; set Gioconda.China_2; label new_q1_1='undeveloped_country nse' new_q1_2='in_development nse' new_q1_3='geographical_distance nse' new_q1_4='too_different nse' new_q1_5='fakes_makers nse' new_q1_6='culture_only nse' new_q1_7='economy_only nse' new_q1_8='threat nse' new_q1_9='opportunity nse' new_q1_10='leading_nation nse'; run;

in addition, for less skilled users, it is possible to change labels and variables names in SAS from the Explorer menu, which shows all the datasets, by clicking on property of the dataset to be modified and then on the column button and so doing modify the labels of each variable. The procedure is Explorer Right click on the dataset Property Column Right click on the column name Modify Label. Below it is reported the screen-shot of the new relabelled dataset: Figure 2 New relabelled dataset:

Now that the second reduced dataset has been created, we must repeat all the procedures allowing us to study and this new set of data, without the size effect, with the purpose of comparing it with the original one later. The comparison of China and China_2 is useful because will lead us to understand the evolution of the data treatment.

So, as we previously did, we are going to compute the mean, min, max, standard deviation, standard error, median and the variation coefficient of the new variables using the SAS proc means procedure: SAS syntax 7:
proc means data=Gioconda.China_2 mean stddev min max median stderr cv; var new_q1_1-new_q1_10; run;

o SAS output 3: SAS System


The MEANS procedure Variable Label Mean Std Dev Min Max Median Std Error Var. Coeff.
-456.54213 211.5099 -71.19747 -342.26532 534.37829 -211.66900 -223.08089 219.46396 237.25945 924.78689

New_q1_1 New_q1_2 New_q1_3 New_q1_4

undeveloped_country nse in_development nse geographical_distance nse too_different nse

- 0.12554 0.24364 -0.74200 -0.19383 0.13485 -0.30106 -0.27043 0.30435 0.29408 0.07437

0.57315 0.51531 0.52829 0.66342 0.72062 0.63725 0.60328 0.66794 0.69774 0.68777

-1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

-0.09091 0.24528 -1.00000 -0.20000 0.18919 -0.35484 -0.28571 0.43396 0.50000 0.18919

0.0548979 0.0493582 0.0506005 0.0635441 0.0690225 0.061038 0.057784 0.063978 0.066831 0.065876

New_q1_5 New_q1_6 New_q1_7 New_q1_8 New_q1_9 New_q1_10

fakes_makers nse culture_only nse economy_only nse threat nse opportunity nse Leading_nation nse

Now that all the necessary basic statistical data for an appropriate analysing of the second dataset have been assembled, the next step is to compare the variables according to their means and standard deviation, first of all sorting them by their means in a decreasing order and then by their standard deviations in increasing order:

Table 6 New means for the no size effect dataset and Standard Deviations
Variable new_q1_8 new_q1_9 new_q1_2 new_q1_5 new_q1_10 new_q1_1 new_q1_4 new_q1_7 new_q1_6 new_q1_3 threat nse opportunity nse in_development nse fakes_makers nse leading_nation nse undeveloped_country nse too_different nse economy_only nse culture_only nse geographical_distance nse Label Mean 0,304353 0,294082 0,243636 0,134851 0,074371 -0,12554 -0,19383 -0,27043 -0,30106 -0,742 Std Dev 0,667945 0,697738 0,515315 0,720616 0,687769 0,573151 0,66342 0,637253 0,528285 Variable new_q1_2 new_q1_3 new_q1_1 new_q1_7 new_q1_6 new_q1_4 new_q1_8 new_q1_9 new_q1_5 Label in_development nse geographical_distance nse undeveloped_country nse economy_only nse culture_only nse too_different nse threat nse opportunity nse fakes_makers nse Mean 0,2436362 -0,742 -0,125542 -0,270431 -0,301061 -0,193832 0,3043529 0,0743705 0,2940824 0,1348512 Std Dev 0,515315 0,528285 0,573151 0,60328 0,637253 0,66342 0,667945 0,687769 0,697738 0,720616

0,60328 new_q1_10 leading_nation nse

Comparing the results obtained from the means procedure before the elimination of the size effect and those after the procedure, we noticed that the rank of the variables has not changed at all for what concerns their sorting according to their means scores in a decreasing order. The threat variable is still the first one, followed by the one labelled opportunity therefore the means previously computed were representing correctly the most diffused opinions concerning China among our population sample and our conclusions concerning the means are still holding; the standard deviation table of this new dataset has changed, the variable labelled threat decreased by one position, from the sixth to the seventh place among all the 10 variables, that means that when the size effect is excluded, this variable is less stable, because it has higher oscillating values. According to the questionnaires collected, the attribution of the characteristic developing country seems to suits China as well, because this feature ranks 3rd for mean among all the 10 sentences attributed to that country, also its standard deviation is the smallest, a sign of quite homogeneous opinion. In addition we have the confirmation that interviewed sample population thinks about China as a fakes maker nation: that variable has still the 4th higher mean score and the last standard deviation score. Two thing to be considered are: the fact that the old means with a score under 5.0000 now are scored with a negative sign can be because a quite high number of observations gave them a score lower than the mean; the second is that would be important taking a look at the mean row because it represents the average score given by each individual. Now we are going to compute, for these new data, the PCA as for the old data affected by size effect; with the command below SAS will compute the principal component analysis for the new variables new_q1_1 - new_q1_10 and will stock the resulting values of each principal component in a new output dataset called Gioconda.cluster (we attributed this name to the output because we will further use it in the clusterization process): SAS syntax 8:

proc princomp data=Gioconda.China_2 out=Gioconda.cluster; var new_q1_1-new_q1_10; run;

Figure 3 Screen-shot of the SAS dataset Gioconda.Cluster:

o SAS output 4.1: SAS System


The PRINCOMP procedure
Observations Variables 109 10

Simple Statistics new_q1_1 Mean StD -.12554168 0.57315066 new_q1_2 0.24363616 0.51531461 new_q1_3 -.7419999 0.52828519 new_q1_4 -.19383219 0.66342035 new_q1_5 0.13485124 0.72061576 new_q1_6 -.30106120 0.63725321 new_q1_7 -.27043089 0.60327963 new_q1_8 0.3043529 0.66794493 new_q1_9 0.29408235 0.69773818 new_q1_10 0.0743705 0.68776861

o SAS output 4.2:


Correlation matrix
new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10

new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10

undeveloped_country nse in_development nse geographical_distance nse too_different nse fakes_makers nse culture_only nse economy_only nse threat nse opportunity nse leading_nation nse

1.0000 0.3396 -.1863 -.2460 0.0606 -.2074 -.3001 0.0383 0.0737 -.2198

0.3396 1.0000 -.0149 -.2025 -.0645 -.0291 -.1473 0.1036 0.0376 -.2457

-.1863 -.0149 1.0000 0.2281 0.0512 -.1130 -.0272 -.0583 -.3049 -.0936

-.2460 -.2025 0.2281 1.0000 0.3101 0.1316 0.0791 -.2068 -.3690 -.1639

0.0606 -.0645 0.0512 0.3101 1.0000 0.1119 0.0799 -.0923 -.3892 -.1303

-.2074 -.0291 -.1130 0.1316 0.1119 1.0000 0.3965 -.2375 -.2625 -.2119

-.3001 -.1473 -.0272 0.0791 0.0799 0.3965 1.0000 -.0598 -.0180 0.0834

0.0383 0.1036 -.0583 -.2068 -.0923 -.2375 -.0598 1.0000 0.0936 0.1078

0.0737 0.0376 -.3049 -.3690 -.3892 -.2625 -.0180 0.0936 1.0000 0.2356

-.2198 -.2457 -.0936 -.1639 -.1303 -.2119 0.0834 0.1078 0.2356 1.0000

Caption: XXX : highest positive correlation XXX : high positive correlation XXX : high negative correlation XXX : highest negative correlation We can evince from the data collected by this procedure that the size effect has been eliminated since all the variables are now negatively correlated with some others and there is no more majority of positive correlation; we can state that we didnt have any missing value because the number of observations without size effect is the same as the original one.

With the size effect elimination, if there would have been any missing value it would have been filled with the mean score [0] so to lead my observations to an increase; before size effect elimination we had all the first seven variables positively correlated with each other, excepted for the correlation between the variable number 1 and variable number 7; up to now we have positive correlation just between the first 2 variables. Initially only 11 variables were negatively correlated; now we can see that they have expanded their negative\low correlation and deepen their negative correlation with the previously negative correlated ones; One thing that is flashing from the comparison are the correlation coefficient of the variables because it have been occurred an expansion of negative correlation between them, now every variable has a negative correlation coefficient with some other. Hereunder it is reported the table resuming the eigenvalues for the new correlation matrix, without size effect. o SAS output 4.3:
Eigenvalues of the correlation matrix Eigenvalues 1 2 3 4 5 6 7 8 9 10 2.27048390 1.69531743 1.33669434 0.98990909 0.96123589 0.70580000 0.63093495 0.58306098 0.46808959 0.35847382 Difference 0.57516647 0.35862309 0.34678525 0.02867320 0.25543589 0.07486505 0.04787397 0.11497139 0.10961577 Proportional 0.2270 0.1695 0.1337 0.0990 0.0961 0.0706 0.0631 0.0583 0.0468 0.0358 Cumulative 0.2270 0.3966 0.5302 0.6292 0.7254 0.7959 0.8590 0.9173 0.9642 1.0000

The important and significant principal component are now the first 5 because of their value higher than 1 or approaching this number (0,98 and 0,96 of the fourth and fifth principal components are almost equal to 1). For this reason we will keep working on them, which are also explaining the 72.5% of the dataset information, while the rest might be purely random noise. The first seven principal components summed up explain approximately the 86% of the information extracted by the dataset, because of that, they are the most useful to consider in order to synthesize, with ease, the data reducing their complexity. All these pieces of information will be used in the next step for creating clusters, because from this eigenvalues correlation matrix we can depict the cumulative variance and consequently we can select the principal components over which we can work on without losing any meaningful information.

o SAS output 4.4:


Eigenvectors Prin1
new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 undeveloped_country nse in_development nse geographical_distance nse too_different nse fakes_makers nse culture_only nse economy_only nse threat nse opportunity nse Leading_nation nse

Prin2 0.49541 0.44142 0.11755 0.06330 0.24106 -.07560 -.37162 -.02679 -.28187 -.50971

Prin3 0.15328 0.24146 -.51088 -.23683 -.03945 0.58339 0.38706 -.18377 0.13328 -.24057

Prin4 -.30250 0.38484 0.54873 -.12250 -.56387 0.15614 0.15419 0.13162 -.02350 -.24556

Prin5 -.03544 0.15093 -.05377 -.10472 0.34419 0.04026 0.29864 0.82648 -.24390 0.10138

Prin6 0.19714 0.31278 0.38172 -.34265 0.32493 -.11701 0.34364 -.38512 -.02493 0.46248

Prin7 0.00264 0.22961 0.01218 0.51812 0.14877 -.34545 0.40476 0.02826 0.56895 -.22106

Prin8 -.33603 0.61036 -.29226 0.34827 -.01975 0.10276 -.30631 -.06905 -.09798 0.43439

Prin9 0.56682 -.05130 -.04657 0.39704 -.49528 -.02379 0.28716 0.05414 -.32493 0.28183

Prin10 0.28757 -.07792 0.36388 0.19076 0.12332 0.59642 -.27406 0.19849 0.45245 0.22868

-.28731 -.21018 0.23523 0.46125 0.33957 0.35788 0.25653 -.26249 -.44556 -.16905

4.g. Cluster creation and analysis A problem we are facing in this survey is how to organize observed data into meaningful structures, the key to resolve this problem is the process of clusterization, which is a generic term for multivariate analysis with the objective of determining how many similar or different groups or clusters of observations are contained in the original dataset. The cluster analysis, also known as data segmentation procedure, encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Clustering actually refers to group and segment data or collect different homogeneous objects into clusters or subsets with similar characteristics in order to build up a strategy for micromarketing as well as macro-marketing (according to specific needs). In order to proceed in the cluster analysis, we should first examine the distance and the similarity between points. Distance measures how far two observations are (similar characteristics have low distance) and similarity represents the level of alikeness between cases. In brief, the cases with the lowest distance and highest similarity will constitute a cluster. Now we analyze our dataset containing the principal component variables through the cluster analysis. Below it is reported the SAS command lines necessary to compute the proc cluster: SAS syntax 9:

proc cluster data=Gioconda.Cluster method=ward; var prin1-prin5; run;

The following is the SAS html output of the above proc cluster procedure: o SAS output 5.1:
Eigenvalues of the covariance matrix Eigenvalues 1 2 3 4 5 2.27048390 1.69531743 1.33669434 0.98990909 0.96123589 Difference 0.57516647 0.35862309 0.34678525 0.02867320 Proportion 0.3130 0.2337 0.1843 0.1365 0.1325 Cumulated 0.3130 0.5467 0.7310 0.8675 1.0000

Standard deviation of the sample total mean root square = 1.204462 Distance mean root square among observations = 3.808843

o SAS output 5.2:


Cluster history NCL Cluster Joined Freq SPRSQ RSQ Tie

Cluster history NCL 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 Cluster Joined OB78 OB71 OB37 OB44 OB12 OB7 OB51 OB62 OB79 OB15 OB63 CL100 CL104 OB91 OB40 OB10 OB43 OB1 OB4 OB45 OB18 OB23 CL108 OB29 CL91 OB70 OB21 CL98 OB8 OB13 OB99 OB77 OB104 OB49 OB26 OB81 OB55 OB87 OB88 OB68 OB76 OB102 OB94 OB109 OB100 OB17 OB64 CL106 OB27 OB95 OB80 OB107 OB101 OB38 OB22 OB106 OB41 OB85 OB28 OB19 Freq 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 3 2 2 2 2 3 2 4 2 2 3 2 2 SPRSQ 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0005 0.0005 0.0006 0.0006 0.0007 0.0007 0.0007 0.0007 0.0008 0.0008 0.0008 0.0009 RSQ 1.00 1.00 .999 .999 .999 .999 .998 .998 .998 .997 .997 .997 .996 .996 .996 .995 .995 .994 .994 .993 .993 .992 .991 .991 .990 .989 .988 .988 .987 .986 Tie

Cluster history NCL 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 Cluster Joined OB5 OB9 OB56 OB16 OB90 OB25 OB20 OB53 OB33 CL81 OB92 CL92 OB75 CL90 CL87 CL77 CL82 OB2 CL73 OB42 CL76 OB57 OB59 OB11 CL103 CL72 OB30 CL62 OB46 CL59 CL80 OB24 OB58 CL105 OB105 CL89 OB103 OB89 OB35 OB97 OB98 OB54 OB83 OB6 OB39 OB82 CL107 CL99 OB48 OB60 OB86 OB96 CL66 OB84 CL69 OB34 OB36 CL101 CL56 CL83 Freq 3 2 2 3 2 3 2 2 2 4 2 3 2 3 3 3 4 3 4 2 3 2 3 2 6 3 2 6 4 4 SPRSQ 0.0009 0.0009 0.0009 0.0009 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0012 0.0012 0.0012 0.0013 0.0013 0.0014 0.0015 0.0015 0.0015 0.0015 0.0015 0.0016 0.0018 0.0018 0.0019 0.0022 0.0022 0.0024 0.0026 0.0026 RSQ .985 .984 .983 .982 .981 .980 .979 .978 .977 .976 .975 .974 .973 .972 .970 .969 .967 .966 .964 .963 .961 .960 .958 .956 .954 .952 .950 .947 .945 .942 Tie

Cluster history NCL 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Cluster Joined CL65 CL70 CL85 OB61 CL86 CL93 CL96 OB74 OB52 OB32 CL47 CL94 CL71 OB31 CL57 CL75 CL60 OB3 CL64 CL84 CL49 CL55 CL54 CL79 CL48 CL61 OB72 CL42 CL29 CL39 CL78 OB47 OB65 OB108 CL97 OB50 CL102 OB93 CL68 CL95 CL58 CL44 CL74 OB66 OB69 OB73 CL50 OB14 CL45 CL63 OB67 CL53 CL88 CL33 CL36 CL52 CL41 CL37 CL51 CL40 Freq 6 3 3 2 6 3 5 2 3 3 6 8 4 2 3 4 8 2 5 7 5 5 8 6 10 5 3 13 13 6 SPRSQ 0.0026 0.0027 0.0027 0.0029 0.0030 0.0031 0.0032 0.0033 0.0035 0.0036 0.0038 0.0038 0.0042 0.0044 0.0046 0.0046 0.0046 0.0049 0.0052 0.0052 0.0060 0.0065 0.0068 0.0071 0.0077 0.0081 0.0090 0.0092 0.0093 0.0094 RSQ .940 .937 .934 .931 .928 .925 .922 .919 .915 .912 .908 .904 .900 .896 .891 .886 .882 .877 .872 .866 .860 .854 .847 .840 .832 .824 .815 .806 .797 .787 Tie

Cluster history NCL 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Cluster Joined CL28 CL46 CL31 CL23 CL43 CL26 CL17 CL21 CL15 CL12 CL9 CL20 CL13 CL7 CL10 CL6 CL5 CL2 CL67 CL35 CL27 CL24 CL18 CL30 CL22 CL25 CL32 CL19 CL34 CL16 CL38 CL11 CL14 CL8 CL4 CL3 Freq 8 5 7 15 11 13 8 19 23 14 17 20 19 39 34 36 73 109 SPRSQ 0.0097 0.0116 0.0121 0.0150 0.0162 0.0168 0.0169 0.0214 0.0269 0.0284 0.0310 0.0312 0.0364 0.0518 0.0708 0.0857 0.1170 0.1885 RSQ .778 .766 .754 .739 .723 .706 .689 .668 .641 .612 .581 .550 .514 .462 .391 .306 .188 .000 Tie

The output above is called hierarchical clustering procedure (cronologia delle cluster in Italian) it uses average group linkage method, which calculates the distance between clusters by taking of all pair wise differences btw the points within each cluster, for describing the observations or clusters joined, the frequencies and other cluster statistics. Initially each case is treated as a cluster and are afterwards combined, based on the measured characteristics and then those that present more similarities are subsequently aggregated. This method is used for the computation of the variance, and it is called the Ward method, it is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares of any two hypothetical clusters that can be formed at each step. In general, this method is considered a very efficient one, however, it tends to create clusters of small size. SAS will create then a tree diagram which allows us to better understand the clusterization: the graphic called dendrogram.

4.h. Dendrogram construction The Dendorgram is a tree diagram that minimizes the minimum linkage, one of the simplest agglomerative hierarchical clustering method. The essential feature of the minimum linkage method is that distance between groups is defined as the distance between the closest pair of objects from each cluster. The minimum value of these distances is the value of the shortest link between the groups. The graph resulting from this technique is a very flexible statistical tool for the analysis of clusters because it is a graphical representation that identifies how many natural groups are present in the sample population and by observing the data it is possible to segment the units by a hierarchical criterion, helpful in the identification of clusters. Trees are represented horizontally on the graph and each row represents a variable on the y axis, on the x axis there is the increasing scale of the proximity coefficients. Variables characterised by low distance or high similarity are very close together; these variables are grouped into a cluster at a low distance coefficient, showing alikeness. SAS syntax 10:

proc tree data=gioconda.tree; run;

Figure 4 Screen-shot of the Dendrogram

We can see that all the groups in the bottom are homogeneous; as well, it is possible to divide the groups according to the different detectable jumps. So, as the variance decreases the confusion between data increases in the bottom of the graph. The dendrogram is a very coherent populations picture. If the start of analysis is from low levels we can observe micro-groups and how they are joint in macro-groups at the higher levels (hierarchical criteria) by this way it is possible

to analyze the segmentation of the collected data and build up a strategy for micro-marketing or macro-marketing as well, if the analysis start from higher levels. If we begin our study from the highest levels (strategic marketing), there are 3 main groups so they are not a big font of information; instead, analyzing the bottom ones (micro-marketing) we can observe if a group is homogeneous by the jump it has before the joint point. (The most a group is linked to the others at an high level of fusion distance, the higher is its homogeneity). The red-lighted is the cut-off line from which we are going to segment the clusters. It is clear by a first glance that the dendrogram presents 6 micro clusters and 3 macro clusters. The analysis proceeds then with the computation of the Wards method for the analysis of variance among clusters and here are the SAS commands lines: SAS syntax 11:
outtree=Gioconda.tree;

proc cluster data=Gioconda.Cluster method=ward var prin1-prin5; copy _all_; id id; run;

with this procedure now we could stock all the data segmentation collected in a new dataset called Gioconda.Tree. Previously we have created a new variable in the dataset Gioconda.Cluster called id [i.e. identifier] which is a sort of counter able to give as a feedback the correspondence between the 2 different datasets. In this way I can recognize the new principal components I decided to analyze and the old ones as SAS mixes them up. We also needed to copy all the pieces of information contained in the dataset previously made just not to loose the values and statistics obtained with the other steps. The html SAS output of that procedure is the same as the inserted one above in these pages; we are now pasting a few lines of the classic SAS output, in order to report a sample of the results of the procedure: o SAS output 6:
The CLUSTER procedure Analysis minimum Ward variance for cluster Clusters chronology NCL 31 30 29 28 27 26 25 24 23 22 21 -------Cluster united-------3 CL64 CL84 CL49 CL55 CL54 CL79 CL48 CL61 72 CL42 14 CL45 CL63 67 CL53 CL88 CL33 CL36 CL52 CL41 CL37 2 Freq SPRSQ RSQ Tie

0.0049 .877 5 0.0052 .872 7 0.0052 .866 5 0.0060 .860 5 0.0065 .854 8 0.0068 .847 6 0.0071 .840 10 0.0077 .832 5 0.0081 .824 3 0.0090 .815 13 0.0092 .806

The final row of this dataset is the one with the variable we called id to match the new data with the older ones. SAS generate a new dendrogram on which base SAS will create a new dataset Gioconda.Cluster_2 including my new cluster results: These here under are the SAS command lines necessary to set the procedure requested: SAS syntax 12:

proc tree data=gioconda.tree n=6 out=Gioconda.Cluster_2; copy _all_; run;

Now we have to prepare the data to start the procedure for the computation of the t-tests for each cluster. As they are not ready yet, I need to make some more steps to be able to successfully run the t-test procedure. The first step to make is sorting the data of the dataset Gioconda.Cluster_2, as follows: SAS syntax 13:

proc sort data=Gioconda.Cluster_2; by id; run;

The above procedure sorts the data by the identifier because initially the variable id in the dataset Gioconda.Cluster_2 was not in an increasing order. Now, the action to be taken for being allowed to proceed with the t-test is to merge Gioconda.Cluster with Gioconda.Cluster_2 always keeping the matching given by the identifier. The followings are the SAS command lines for the merger procedure: SAS syntax 14:

data Gioconda.merged; merge Gioconda.Cluster_2 Gioconda.Cluster; by id; run;

These commands produce a new dataset which includes, aggregated, the other two dataset ordered by the variable id. So, for running t-test procedures, we need to create a new cluster entailing all the observations. By the way we crated a new dataset named Gioconda.merged_1 with the 7th cluster: SAS syntax 15:

data Gioconda.merged_1; set Gioconda.merged; cluster=7; run;

At this point, we put together all the two previous dataset, Gioconda.merged and Gioconda.merged_1, in another dataset Gioconda.merged_2: SAS syntax 16:

data Gioconda.merged_2; set Gioconda.merged Gioconda.merged_1; run;

4.i. Clusters t-tests In order to analyse the relations of the 10 quantitative variables from our database with the clusters, we must compute the t-test procedure, a statistical test used for the comparison of two groups of replicated data. Principally, it evaluates the differences between the means of the two groups, taking in consideration the dispersion of the data (that is the standard deviation); in this way we are able to compare the mean of each cluster with the one of the whole population for each variable. The higher the t-values found with the t-test procedure, the more probable is that this variable will describe and influence the cluster. Once we found the value of the t-test, we should compare it with the tabulated values given from statistic tables, with the objective to determine whether the difference between two means of the variables is significant, or if it is due to a random effect. The zero hypothesis establishes that the difference has the random effect. The assumption lying under the hypothesis H0 is that our sample derive from the population and is true when the sample and the population coincide; on the other hand zero hypothesis is false or refused when the sample and the population do not correspond (H1 alternative hypothesis). The significance tests are those statistical tests that quantify data in the sense of probability. It is necessary, in order to use those statistical tests that two or more groups of data were distributed in normal way, or that variances are very similar. If there is no normality and variances are very different we should search for other suitable transformation of the original measures, like for example logarithmic transformation. Following the SAS procedure, we will calculate t-test values using two different methods, i.e. the Pooled method and the Satterthwaite method. The difference between these two types of processes is quite significant and it focuses in the treatment of the standard deviations. When the Pooled method is used, it calculates the arithmetic average of the standard deviations and converts the value into the standard error, while in the Satterthwaite method it is possible to compute the standard error from the weighted average of the two variances, it does not assume that the variances are similar, while the Pooled method, on the contrary, take for granted that the variances are similar. In the end, we can always apply the Satterthwaite method, in every situation, and it will be correct while the application of the Pooled method is required only in specific situations (when the variances of the two clusters are the same). For practical reasons we chose to use the Satterthwaite procedure the majority of cases, but there are a few exceptions, where we took in consideration, after looking upon the values provided by the Fisher test, the Pooled method. A simple scheme to follow in the calculus of the t-test is: Figure 6: t-test procedure

Compare two means

Hypothesis H0: the difference exists by chance

Accept or reject H0?

t-student test

At the moment we must apply the theory explained below to our set of data, the following is the SAS command line to perform the T-Test for the cluster number 1. SAS syntax 17:

proc ttest data=Gioconda.merged_2; var new_q1_1-new_q1_10 age;

class cluster; where cluster=1 or cluster=7; run;

The performance of this procedure produces the following tables: o SAS output 7: SAS System The t-TEST procedure
Statistics
119-0.3410.1440.053 10.3090.40 90.60480.0 93810.5918Var iable new_q1_1n

CLUSTER

Inferior mean

Mean

Superior mean

Inferior Std dev

Std dev

Superior Std dev

Std Error

Min

Max

ew_q1_ 1
new_q1_1 new_q1_2 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_7

109

-0.234

-0.126

-0.017

0.5058

0.5732

0.6613

0.0549

-1

Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2)

-0.29 19 109 -0.085 0.1458 -0.361 19 109 -1.026 -0.842 -0.475 19 109 -0.925 -0.32 -0.871 19 109 -0.705 -0.002 -0.906 19 109 -0.917 -0.422 -0.793 19 109 -0.794 -0.385 -0.649

-0.018 0.1314 0.2436 -0.112 -0.976 -0.742 -0.234 -0.756 -0.194 -0.562 -0.425 0.1349 -0.56 -0.8 -0.301 -0.499 -0.638 -0.27 -0.368

0.2535 0.3481 0.3415 0.1369 -0.926 -0.642 0.0073 -0.587 -0.068 -0.253 -0.145 0.2717 -0.215 -0.683 -0.18 -0.205 -0.483 -0.156 -0.087

0.4921 0.3399 0.4548 0.4509 0.0788 0.4663 0.4369 0.265 0.5855 0.5594 0.4387 0.636 0.6253 0.1837 0.5624 0.5316 0.2431 0.5324 0.5089

0.5527 0.4498 0.5153 0.5065 0.1043 0.5283 0.4907 0.3507 0.6634 0.6283 0.5806 0.7206 0.7023 0.2431 0.6373 0.5971 0.3217 0.6033 0.5716

0.6305 0.6651 0.5945 0.5778 0.1542 0.6095 0.5598 0.5187 0.7654 0.7168 0.8585 0.8314 0.8012 0.3595 0.7352 0.6811 0.4757 0.696 0.6521

0.1374 0.1032 0.0494 0.1259 0.0239 0.0506 0.122 0.0805 0.0635 0.1562 0.1332 0.069 0.1746 0.0558 0.061 0.1484 0.0738 0.0578 0.1421 -1 -1 0 1 -1 -1 -0.333 1 -1 -1 0.6552 1 -1 -1 0 1 -1 -1 -0.545 1 -1 -1 1 1

Statistics
new_q1_8 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 Age Age Age 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2) 1 7 Diff (1-2)

19 109

0.2996 0.1775 -0.028

0.6028 0.3044 0.2984 0.8768 0.2941 0.5827 0.6163 0.0744 0.5419 23.684 23.706 -0.022

0.9059 0.4312 0.6244 1.0168 0.4266 0.9051 0.8414 0.2049 0.867 24.898 24.201 1.2532

0.4753 0.5895 0.5899 0.2196 0.6158 0.5834 0.353 0.607 0.5883 1.9025 2.2987 2.3079

0.6291 0.6679 0.6625 0.2906 0.6977 0.6553 0.4671 0.6878 0.6608 2.5178 2.6045 2.5923

0.9303 0.7706 0.7558 0.4298 0.805 0.7475 0.6908 0.7935 0.7538 3.7233 3.0049 2.9572

0.1443 0.064 0.1647 0.0667 0.0668 0.1629 0.1072 0.0659 0.1643 0.5776 0.2495 0.6445

-1 -1

1 1

19 109

0.7367 0.1616 0.2603

-0.143 -1

1 1

19 109

0.3912 -0.056 0.2168

-1 -1

1 1

19 109

22.471 23.212 -1.298

21 19

30 30

Test T Variable new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_9 Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Variances Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal DF 126 31.8 126 26.9 126 124 126 44.6 126 28.6 126 70.2 126 44.1 126 25.6 126 t-Value -0.13 -0.17 -0.89 -0.98 -1.92 -4.18 -3.60 -5.48 -3.21 -3.73 -3.36 -6.04 -2.59 -3.93 1.81 1.89 3.58 Pr > |t| 0.8933 0.8662 0.3743 0.3351 0.0573 <.0001 0.0005 <.0001 0.0017 0.0008 0.0010 <.0001 0.0107 0.0003 0.0724 0.0701 0.0005

Test T Variable new_q1_9 new_q1_10 new_q1_10 age age Method Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Variances Unequal Equal Unequal Equal Unequal DF 61.9 126 33.4 126 25.2 t-Value 6.17 3.30 4.31 -0.03 -0.04 Pr > |t| <.0001 0.0013 0.0001 0.9726 0.9721

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 age Method Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F DF numb 108 108 108 108 108 108 108 108 108 108 108 DF den 18 18 18 18 18 18 18 18 18 18 18 F-value 1.96 1.31 25.66 3.58 1.54 6.87 3.52 1.13 5.76 2.17 1.07 Pr > F 0.1024 0.5214 <.0001 0.0033 0.2949 <.0001 0.0037 0.8099 0.0001 0.0626 0.9208

We analyzed the t-test statistics using only, in order to obtain more valid conclusions, the Satterthwaite method, where the standard error used to compute the t-test is calculated through the weighted average of the variances. First of all we can observe that more than half of the variables have a quite high t-value so we discovered they are quite significant for the cluster description: their p-values are oscillating between <.0001 and 0.0008. We can also see that, for this cluster, there are 2 variables, which have a positive significance test response, are quite strongly explaining the cluster composition: New_q1_9 (China is an opportunity) New_q1_10 (China will be the worlds leader nation) +6.17 +4.31

On the other hand there are five variables which are not important for describing the cluster and three of them have extremely negative t-value significance response, even if the first three (that is question 1_6, 1_3 and 1_4) have a p-value (that represents the probability to obtain the value of the t-test statistic equal or superior to the value observed in the sample) very close to 0 [pr > | t | = 0.0001]. This is because of the people included in this cluster answered significantly and negatively to these questions in respect to the whole population (cluster number 7) hereunder are reported, in a decreasing order, the variables completely excluded from the cluster explanation: New_q1_6 (China is interesting for its culture only) New_q1_3 (China is not influent because of its distance) New_q1_4 (China has a totally different culture) - 6.04 - 5.48 - 4.18 - 3.93

New_q1_7 (China is interesting only for work and its economy) New_q1_5 (China is the Country of copies)

- 3.73

In this cluster the variable age is not important to give a description since its t-statistic is negative and very low (-0.04) and its p-value is 0.9721. Now we proceed in the examination of the 2nd cluster, the following are the commands line in order to perform a new t-test on the second cluster: SAS syntax 18:

proc ttest data=Gioconda.merged_2; var new_q1_1-new_q1_10 age; class cluster; where cluster=2 or cluster=7; run;

o SAS output 8: SAS System The t-TEST procedure


Statistics
Inferior mean Superior mean Inferior Std dev Superior Std dev Std Error

Variable

CLUSTER

Mean

Std dev

Min

Max

new_q1_1 new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_2

2 7 Diff (1-2) 2 7 Diff (1-2)

20 109

0.18 -0.234 0.2437

0.3853 -0.126 0.5109 0.582 0.2436 0.3383

0.5907 -0.017 0.7781 0.7379 0.3415 0.5753

0.3337 0.5058 0.4944 0.2534 0.4548 0.4386

0.4388 0.5732 0.5551 0.3333 0.5153 0.4924

0.6409 0.6613 0.6329 0.4868 0.5945 0.5614

0.0981 0.0549 0.135 0.0745 0.0494 0.1198

-0.231 -1

1 1

20 109

0.426 0.1458 0.1013

0.032 3 -1

1 1

Statistics 0.79 6 1

new_q1_3

20

-1.011

-0.99

-0.968

0.0347

0.0456

0.0667

0.0102

-1

new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6

7 Diff (1-2) 2 7 Diff (1-2) 2 7 Diff (1-2) 2 7 Diff (1-2)

109

-0.842 -0.482

-0.742 -0.248 -0.488 -0.194 -0.294 0.1068 0.1349 -0.028 -0.566 -0.301 -0.265

-0.642 -0.013 -0.231 -0.068 0.0176 0.4276 0.2717 0.3164 -0.311 -0.18 0.036

0.4663 0.4342 0.4173 0.5855 0.5768 0.5212 0.636 0.6372 0.4142 0.5624 0.556

0.5283 0.4875 0.5487 0.6634 0.6475 0.6853 0.7206 0.7154 0.5446 0.6373 0.6243

0.6095 0.5558 0.8014 0.7654 0.7383 1.001 0.8314 0.8157 0.7954 0.7352 0.7118

0.0506 0.1186 0.1227 0.0635 0.1575 0.1532 0.069 0.174 0.1218 0.061 0.1519

-1

20 109

-0.745 -0.32 -0.606

-1 -1

1 1

20 109

-0.214 -0.002 -0.372

-1 -1

1 1

20 109

-0.82 -0.422 -0.565

-1 -1

1 1

new_q1_7

20

-0.901

-0.76

-0.619

0.2293

0.3015

0.4403

0.0674

-1

0.23 1 1

new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 age age age

7 Diff (1-2) 2 7 Diff (1-2) 2 7 Diff (1-2) 2 7 Diff (1-2) 2 7 Diff (1-2)

109

-0.385 -0.764

-0.27 -0.49 0.3047 0.3044 0.0004 0.6352 0.2941 0.3411 -0.225 0.0744 -0.299 24.2 23.706 0.4936

-0.156 -0.216 0.6483 0.4312 0.3268 0.8719 0.4266 0.6648 0.0411 0.2049 0.0238 25.448 24.201 1.7519

0.5324 0.5063 0.5582 0.5895 0.6041 0.3847 0.6158 0.599 0.4324 0.607 0.5979 2.0286 2.2987 2.3283

0.6033 0.5684 0.7341 0.6679 0.6782 0.5059 0.6977 0.6725 0.5686 0.6878 0.6713 2.6675 2.6045 2.614

0.696 0.6481 1.0721 0.7706 0.7733 0.7389 0.805 0.7668 0.8304 0.7935 0.7654 3.8961 3.0049 2.9804

0.0578 0.1383 0.1641 0.064 0.165 0.1131 0.0668 0.1636 0.1271 0.0659 0.1633 0.5965 0.2495 0.6359

-1

20 109

-0.039 0.1775 -0.326

-1 -1

1 1

20 109

0.3984 0.1616 0.0174

-1 -1

1 1

20 109

-0.491 -0.056 -0.622

-1 -1

0.69 7 1

20 109

22.952 23.212 -0.765

20 19

30 30

t-Test

Variable

Method

Variances

DF

t-Value

Pr > |t|

t-Test New_q1_1 New_q1_1 New_q1_2 New_q1_2 New_q1_3 New_q1_3 New_q1_4 New_q1_4 New_q1_5 New_q1_5 New_q1_6 New_q1_6 New_q1_7 New_q1_7 New_q1_8 New_q1_8 New_q1_9 New_q1_9 New_q1_10 New_q1_10 Age Age Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal 127 32.2 127 38 127 116 127 30.2 127 27.3 127 29.4 127 52.2 127 25.1 127 33.9 127 30.2 127 26.1 3.78 4.54 2.82 3.79 -2.09 -4.80 -1.87 -2.13 -0.16 -0.17 -1.74 -1.94 -3.54 -5.52 0.00 0.00 2.08 2.60 -1.83 -2.09 0.78 0.76 0.0002 <.0001 0.0055 0.0005 0.0386 <.0001 0.0642 0.0416 0.8723 0.8688 0.0839 0.0618 0.0006 <.0001 0.9983 0.9984 0.0391 0.0138 0.0691 0.0451 0.4391 0.4521

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable new_q1_1 new_q1_2 new_q1_3 Method Folded F Folded F Folded F DF Num 108 108 108 DF Den 19 19 19 F Value 1.71 2.39 134.02 Pr > F 0.1818 0.0323 <.0001

Equality of variances Variable new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 Age Method Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F DF Num 108 108 108 108 19 108 108 19 DF Den 19 19 19 19 108 19 19 108 F Value 1.46 1.11 1.37 4.00 1.21 1.90 1.46 1.05 Pr > F 0.3451 0.8430 0.4393 0.0011 0.5302 0.1091 0.3439 0.8271

This time we can observe that just 4 variables have a quite high t-value and therefore they result in being quite significant for the cluster description, their p-values are lower than the first cluster and are oscillating between <.0001 and 0.0005. In addition, for this cluster, there is only one variable with a positive significance test response, that strongly explains the cluster composition: New_q1_1 (China is an undeveloped country) + 4.54

On the other hand there are two variables not important for describing the cluster, as they have a negative t-value significance response, even if they have a p value very close to 0 [pr > |t| = 0.0001]. This is because of the people included in this cluster that answered significantly and negatively to these questions in respect to the whole population (cluster number 7): New_q1_3 (China is not influent because of its distance) - 4.80 New_q1_7 (China has a totally different culture) - 5.52

Also the variable number two is slightly important since it has a positive and not so low tstatistic but not a p-value close to 0: T P-value New_q1_2 (China is a developing country) +3.79 (0.0005) In this cluster the variable age is not important to give a description since its t-statistic is low (0.76) and its p-value is 0.4521. Now we continue with the analysis of the 3rd cluster. The procedure is the same as usual, and is reported here under. SAS syntax 19:

proc ttest data=Gioconda.merged_2; var new_q1_1-new_q1_10 age; class cluster; where cluster=3 or cluster=7; run;

The resulting output concerning the third cluster is: o SAS output 9: SAS System The t-TEST procedure
Statistics 319-0.3620.1280.1073 0.36810.487 20.72050.11 1810.6296Vari able new_q1_1

CLUSTER

Inferior mean

Mean

Superior Mean

Inferior Std Dev

Std Dev

Sperior Std Dev

Std err

Min

Max

New_q1 _1
new_q1_1 new_q1_2 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_7 new_q1_8

109

-0.234

-0.126

-0.017

0.5058

0.5732

0.6613

0.0549

-1

Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 19 19 109 19 109 19 109 19 109 19 109 19 109

-0.278 -0.22 0.1458 -0.488 -0.937 -0.842 -0.285 0.2828 -0.32 0.4129 0.715 -0.002 0.3835 -0.335 -0.422 -0.057 -0.552 -0.385 -0.269 -0.455

-0.002 0.006 0.2436 -0.238 -0.779 -0.742 -0.037 0.5367 -0.194 0.7306 0.8507 0.1349 0.7159 -0.047 -0.301 0.2543 -0.239 -0.27 0.0312 -0.117

0.2743 0.2317 0.3415 0.0127 -0.621 -0.642 0.2116 0.7906 -0.068 1.0482 0.9865 0.2717 1.0483 0.2412 -0.18 0.5651 0.0738 -0.156 0.3314 0.2205

0.5001 0.3537 0.4548 0.453 0.2478 0.4663 0.4492 0.398 0.5855 0.5748 0.2128 0.636 0.6015 0.4516 0.5624 0.5624 0.4907 0.5324 0.5432 0.5298

0.5617 0.4681 0.5153 0.5088 0.328 0.5283 0.5046 0.5268 0.6634 0.6457 0.2817 0.7206 0.6756 0.5976 0.6373 0.6317 0.6494 0.6033 0.6101 0.7012

0.6407 0.6923 0.5945 0.5805 0.485 0.6095 0.5756 0.779 0.7654 0.7366 0.4165 0.8314 0.7707 0.8838 0.7352 0.7207 0.9603 0.696 0.696 1.0369

0.1396 0.1074 0.0494 0.1265 0.0752 0.0506 0.1254 0.1209 0.0635 0.1605 0.0646 0.069 0.168 0.1371 0.061 0.1571 0.149 0.0578 0.1517 0.1609 -1 1 -1 -1 1 1 -1 -1 1 1 0.1071 -1 1 1 -1 -1 1 1 -1 -1 -0.048 1 -1 -1 0.6429 1

Statistics new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 Age Age Age 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 3 7 Diff (1-2) 19 109 19 109 19 109 109 0.1775 -0.753 -0.669 0.1616 -1.03 -0.562 -0.056 -0.657 22.751 23.212 -1.032 0.3044 -0.422 -0.402 0.2941 -0.696 -0.246 0.0744 -0.321 23.947 23.706 0.2409 0.4312 -0.091 -0.136 0.4266 -0.362 0.0692 0.2049 0.0154 25.144 24.201 1.5139 0.5895 0.599 0.4178 0.6158 0.6045 0.4946 0.607 0.6082 1.876 2.2987 2.3036 0.6679 0.6728 0.5529 0.6977 0.6789 0.6546 0.6878 0.6831 2.4827 2.6045 2.5874 0.7706 0.7675 0.8176 0.805 0.7745 0.9681 0.7935 0.7793 3.6715 3.0049 2.9517 0.064 0.1673 0.1268 0.0668 0.1688 0.1502 0.0659 0.1698 0.5696 0.2495 0.6433 21 19 30 30 -1 -1 1 1 -1 -1 0.6296 1 -1 1

t-test Variable new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_8 new_q1_8 Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Variances Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal DF 126 27.5 126 26.2 126 36.7 126 29 126 67.8 126 25.7 126 23.7 126 24 t-value -0.01 -0.02 -1.88 -2.01 -0.29 -0.40 4.55 5.35 4.26 7.57 1.62 1.69 0.21 0.20 -2.52 -2.44 Pr > |t| 0.9885 0.9872 0.0627 0.0548 0.7710 0.6889 <.0001 <.0001 <.0001 <.0001 0.1080 0.1023 0.8372 0.8467 0.0129 0.0226

t-test Variable new_q1_9 new_q1_9 new_q1_10 new_q1_10 age age Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Variances Equal Unequal Equal Unequal Equal Unequal DF 126 29 126 25.4 126 25.4 t-value -4.12 -4.86 -1.89 -1.96 0.37 0.39 Pr > |t| <.0001 <.0001 0.0613 0.0616 0.7086 0.7016

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 age Metodo Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Num DF 108 108 108 108 108 108 18 18 108 108 108 Den DF 18 18 18 18 18 18 108 108 18 18 18 Valore F 1.38 1.21 2.59 1.59 6.54 1.14 1.16 1.10 1.59 1.10 1.10 Pr > F 0.4370 0.6655 0.0236 0.2630 <.0001 0.7923 0.6171 0.7208 0.2587 0.8544 0.8607

This time we can observe that three variables present a quite high t-value and we discovered they are significant for the cluster description: from their p-values there is a probability of having an accidental distribution close to 0 [pr > | t | = 0.0001]. For this cluster, there are two variables having a positive significance test response (positive correlation with the cluster), that are more precise than any other in explaining the cluster composition: New_q1_5 (China is a fakes maker country) New_q1_4 (China has a totally different culture) + 7.57 + 5.35

On the contrary there is only one variable which is not important for describing the cluster as it has a negative t-value significance response, even if they have a p value very close to 0 [pr > | t | = 0.0001]. This is due to the answers of people included in this cluster, they replied significantly and negatively to this question in respect to the whole population (cluster number 7): New_q1_9 (China is an opportunity) - 4.86

In this cluster as well, the variable age is not important to give a description since its t-statistic is low (0.39) and its p-value is 0.7016. Subsequently is reported the test performed for the fourth cluster: SAS syntax 20:

proc ttest data=Gioconda.merged_2; var new_q1_1-new_q1_10 age; class cluster; where cluster=4 or cluster=7; run;

o SAS output 10: SAS System t-test procedure


Statistics 423-0.6360.4060.1760.4113 0.53180.752 60.110911Variable new_q1_1n

CLUSTER

Inferior mean

Mean

Superior Mean

Inferior Std dev

Std dev

Superior Std dev

Std err

Min

Max

ew_q1_1
new_q1_1 new_q1_2 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5

109

-0.234

-0.126

-0.017

0.5058

0.5732

0.6613

0.0549

-1

Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 23 23 109 23 109 23 109

-0.538 -0.205 0.1458 -0.457 -1.007 -0.842 -0.435 -0.398 -0.32 -0.29 -0.138

-0.281 0.0206 0.2436 -0.223 -0.957 -0.742 -0.215 -0.196 -0.194 -0.002 0.1131

-0.024 0.246 0.3415 0.0114 -0.907 -0.642 0.0046 0.0071 -0.068 0.2864 0.3639

0.5051 0.403 0.4548 0.4604 0.0902 0.4663 0.4315 0.3624 0.5855 0.566 0.4485

0.5664 0.5211 0.5153 0.5163 0.1166 0.5283 0.4839 0.4686 0.6634 0.6347 0.58

0.6447 0.7375 0.5945 0.5877 0.165 0.6095 0.5508 0.6633 0.7654 0.7224 0.8209

0.13 0.1087 0.0494 0.1185 0.0243 0.0506 0.111 0.0977 0.0635 0.1456 0.1209 -1 1 -1 -1 0.697 1 -1 -1 -0.583 1 -1 -1 1 1

Statistics new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 age age age 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 4 7 Diff (1-2) 23 109 23 109 23 109 23 109 23 109 23 109 109 -0.002 -0.339 -0.279 -0.422 -0.028 0.0865 -0.385 0.2735 0.0387 0.1775 -0.283 0.275 0.1616 -0.076 0.3683 -0.056 0.1747 22.222 23.212 -1.435 0.1349 -0.022 -0.047 -0.301 0.2543 0.2642 -0.27 0.5346 0.3243 0.3044 0.0199 0.526 0.2941 0.2319 0.5436 0.0744 0.4692 23.478 23.706 -0.228 0.2717 0.2955 0.1858 -0.18 0.5365 0.4419 -0.156 0.7958 0.6099 0.4312 0.3226 0.777 0.4266 0.5403 0.7189 0.2049 0.7637 24.735 24.201 0.9784 0.636 0.6232 0.4159 0.5624 0.5543 0.3179 0.5324 0.513 0.5107 0.5895 0.5945 0.4489 0.6158 0.6058 0.3136 0.607 0.5785 2.2472 2.2987 2.3703 0.7206 0.6988 0.5378 0.6373 0.6215 0.411 0.6033 0.5753 0.6604 0.6679 0.6667 0.5805 0.6977 0.6793 0.4055 0.6878 0.6487 2.9056 2.6045 2.6579 0.8314 0.7954 0.7611 0.7352 0.7075 0.5817 0.696 0.6548 0.9347 0.7706 0.7589 0.8215 0.805 0.7733 0.5739 0.7935 0.7384 4.1125 3.0049 3.0254 0.069 0.1603 0.1121 0.061 0.1426 0.0857 0.0578 0.132 0.1377 0.064 0.153 0.121 0.0668 0.1559 0.0845 0.0659 0.1488 0.6059 0.2495 0.6099 19 19 30 30 -0.259 -1 1 1 -1 -1 1 1 -1 -1 1 1 -0.63 -1 1 1 -1 -1 1 1 -1 1

t-test Variable new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_5 Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Variances Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal DF 130 33.7 130 31.7 130 130 130 43 130 t-value -2.16 -2.27 -1.88 -1.87 -1.94 -3.83 -0.01 -0.01 -0.14 Pr > |t| 0.0326 0.0298 0.0620 0.0709 0.0550 0.0002 0.9905 0.9881 0.8924

t-test Variable new_q1_5 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_10 new_q1_10 age age Method Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Variances Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal DF 37.8 130 36.3 130 44.7 130 32.2 130 36.8 130 52.9 130 29.9 t-value -0.16 1.78 1.99 4.05 5.17 0.13 0.13 1.49 1.68 3.15 4.38 -0.37 -0.35 Pr > |t| 0.8768 0.0769 0.0539 <.0001 <.0001 0.8965 0.8963 0.1392 0.1019 0.0020 <.0001 0.7089 0.7301

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 age Method Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F DF Num 108 22 108 108 108 108 108 108 108 108 22 DF Den 22 108 22 22 22 22 22 22 22 22 108 F Value 1.16 1.02 20.54 2.00 1.54 1.40 2.15 1.02 1.44 2.88 1.24 Pr > F 0.7113 0.8882 <.0001 0.0628 0.2417 0.3629 0.0409 1.0000 0.3226 0.0060 0.4550

We observe here that three variables have a quite high t-value in absolute terms and we can surely state that they are significant for the cluster description, since from their p-values there is a probability of having an accidental distribution close to 0 [pr > | t | = 0.0001] up to a maximum of 0.0002. In this cluster we have noticed, from the Fisher test, that the variable new q1_8 presents a pvalue higher than 0.95 so we have the possibility to use the pooled method for computing the ttest analysis. To be more precise, for this cluster there are two variables which are having a positive significance test response (positive correlation with the cluster), they are highly explicative of the cluster composition: New_q1_7 (China has a totally different culture) New_q1_10 (China will be a leading nation) + 5,17

+ 4,38

On the contrary there is just one variable which is negatively correlated with the cluster, and that describes what the population sample do not absolutely think about China, as it has a negative tvalue significance response, even if they have a p value very close to 0 [pr > | t | = 0.0001]. This is because of the people included in this cluster answered significantly and negatively to this question in respect to the whole population (cluster number 7): New_q1_3 (China is not influent because of its distance) - 3.83

In this cluster as well, the variable age is not important to give a description since its t-statistic is negative and also low (- 0.35) and its p-value is 0.7301, this means that the age of people included in this cluster is not uniform. The other variables instead, both positively and negatively correlated to the cluster population, are not so significative because of their p-values and tstudent statistics too high and too low respectively. Now the analysis of clusters proceeds with the 5th one. The following are the usual commands lines in order to get our output: SAS syntax 21:

proc ttest data=Gioconda.merged_2; var new_q1_1-new_q1_10 age; class cluster; where cluster=5 or cluster=7; run;

o SAS output 11: SAS System t-test procedure


Statistics 517-0.790.5150.2390.3992 0.5360.8158 0.1310.7436Vari able new_q1_1n

CLUSTER

Inferior Mean

Mean

Superior Mean

Inferior Std dev

Std dev

Superior Std dev

Std err

Min

Max

ew_q1_1
new_q1_1 new_q1_2 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_8 new_q1_9

109

-0.234

-0.126

-0.017

0.5058

0.5732

0.6613

0.0549

-1

Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 5 17 17 109 17 109 17 109 17 109 17 109 17 109 17 109

-0.682 -0.085 0.1458 -0.323 -0.012 -0.842 0.7555 -0.107 -0.32 0.0678 -0.41 -0.002 -0.499 -0.872 -0.422 -0.594 -0.636 -0.385 -0.404 0.1329 0.1775 -0.256 -0.47

-0.389 0.1877 0.2436 -0.056 0.2903 -0.742 1.0323 0.2138 -0.194 0.4076 0.0149 0.1349 -0.12 -0.57 -0.301 -0.269 -0.368 -0.27 -0.098 0.3819 0.3044 0.0776 -0.117

-0.096 0.4606 0.3415 0.2111 0.5924 -0.642 1.3091 0.5342 -0.068 0.7474 0.44 0.2717 0.2595 -0.267 -0.18 0.0572 -0.1 -0.156 0.2085 0.631 0.4312 0.4116 0.235

0.5057 0.3953 0.4548 0.4602 0.4377 0.4663 0.4771 0.4642 0.5855 0.5856 0.6158 0.636 0.654 0.4382 0.5624 0.5614 0.3885 0.5324 0.5278 0.3607 0.5895 0.5757 0.5106

0.5685 0.5308 0.5153 0.5173 0.5876 0.5283 0.5363 0.6232 0.6634 0.6584 0.8268 0.7206 0.7352 0.5884 0.6373 0.6312 0.5216 0.6033 0.5934 0.4844 0.6679 0.6472 0.6856

0.6493 0.8079 0.5945 0.5908 0.8943 0.6095 0.6125 0.9485 0.7654 0.7519 1.2583 0.8314 0.8396 0.8955 0.7352 0.7208 0.7939 0.696 0.6777 0.7372 0.7706 0.7391 1.0434

0.1482 0.1287 0.0494 0.1349 0.1425 0.0506 0.1399 0.1512 0.0635 0.1717 0.2005 0.069 0.1917 0.1427 0.061 0.1646 0.1265 0.0578 0.1547 0.1175 0.064 0.1688 0.1663 -1 1 -0.333 -1 1 1 -1 -1 0.5455 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1

Statistics new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 age age age 7 Diff (1-2) 5 7 Diff (1-2) 5 7 Diff (1-2) 17 109 17 109 109 0.1616 -0.771 -0.384 -0.056 -0.443 22.681 23.212 -0.997 0.2941 -0.412 -0.012 0.0744 -0.086 24.059 23.706 0.3524 0.4266 -0.052 0.3605 0.2049 0.2715 25.437 24.201 1.7018 0.6158 0.6193 0.5389 0.607 0.616 1.9962 2.2987 2.3256 0.6977 0.6962 0.7236 0.6878 0.6925 2.6803 2.6045 2.6144 0.805 0.7951 1.1013 0.7935 0.7909 4.0792 3.0049 2.9858 0.0668 0.1815 0.1755 0.0659 0.1806 0.6501 0.2495 0.6817 20 19 30 30 -1 -1 1 1 -1 1

Test T Variable new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_10 new_q1_10 age age Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Variances Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal DF 124 22.1 124 21 124 20.2 124 22.1 124 20 124 22.3 124 23.2 124 26.6 124 21.5 124 20.8 124 21 t-value -2.62 -2.76 -0.41 -0.41 7.38 6.83 2.37 2.49 -0.63 -0.57 -1.63 -1.73 -0.63 -0.70 0.46 0.58 -2.27 -2.30 -0.48 -0.46 0.52 0.51 Pr > |t| 0.0098 0.0115 0.6791 0.6891 <.0001 <.0001 0.0191 0.0210 0.5326 0.5779 0.1053 0.0974 0.5287 0.4891 0.6466 0.5669 0.0251 0.0318 0.6351 0.6515 0.6061 0.6180

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable new_q1_1 new_q1_2 new_q1_3 new_q1_4 new_q1_5 new_q1_6 new_q1_7 new_q1_8 new_q1_9 new_q1_10 age Method Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F DF Num 108 16 16 108 16 108 108 108 108 16 16 DF Den 16 108 108 16 108 16 16 16 16 108 108 F Value 1.14 1.06 1.24 1.13 1.32 1.17 1.34 1.90 1.04 1.11 1.06 Pr > F 0.8015 0.8021 0.5044 0.8189 0.4009 0.7516 0.5208 0.1433 1.0000 0.7160 0.8060

This was a curious cluster because it has just one variable with a quite high t-value and another one with a negative t-value slightly lower than the 2,57 of the region of acceptance; only those two variables are significant for the cluster description: from the p-value of the positive one there is a probability of having an accidental distribution close to 0 [pr > |t| = 0.0001]. In this cluster, again, we have detected, from the Fisher test, that the variable new q1_9 presents a p-value higher than 0.95 we have therefore possibility to use the pooled method in this case as well as the variable in cluster 4. New_q1_3 (China is not influent because of its distance) + 6.83

The negative correlated variable present a p-value quite close to 0, of pr > | t | = 0.0115, and this variable is: New_q1_1 (China is an undeveloped country) - 2.76

But after that there are no variables which can be defined as clusters benchmark as their tstatistics and their p-values can hardly be considered significant for that purpose. In fact their low t-statistics and high p-values mean they superfluous for explaining the cluster population. In this cluster as well, the variable age is not important to give a description since its t-statistic is low (0.51) and its p-value is 0.6180 The last cluster will be, then examined: SAS syntax 22:

proc ttest data=Gioconda.merged_2;

var new_q1_1-new_q1_10 age; class cluster; where cluster=6 or cluster=7; run;

o SAS output 12: SAS System t-test procedure


Statistiche Variable new_q1_1 new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_8 new_q1_9 CLUSTER 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 6 11 11 109 11 109 11 109 11 109 11 109 11 109 11 109 N 11 109 Inferior Mean -0.208 -0.234 -0.064 0.6397 0.1458 0.2306 -1.037 -0.842 -0.545 -0.815 -0.32 -0.786 -0.344 -0.002 -0.439 0.1903 -0.422 0.3969 -0.14 -0.385 0.1301 -0.037 0.1775 -0.363 -0.337 Mean 0.169 -0.126 0.2945 0.7856 0.2436 0.5419 -0.97 -0.742 -0.228 -0.577 -0.194 -0.383 0.1476 0.1349 0.0127 0.4861 -0.301 0.7872 0.2353 -0.27 0.5057 0.3553 0.3044 0.051 0.021 Superior Mean 0.5462 -0.017 0.653 0.9315 0.3415 0.8533 -0.902 -0.642 0.0895 -0.339 -0.068 0.0196 0.6392 0.2717 0.4648 0.7819 -0.18 1.1775 0.6102 -0.156 0.8814 0.7478 0.4312 0.4652 0.3789 Inferior Std dev 0.3923 0.5058 0.5076 0.1517 0.4548 0.4409 0.0702 0.4663 0.4491 0.2474 0.5855 0.5704 0.5113 0.636 0.6401 0.3077 0.5624 0.5526 0.3899 0.5324 0.5319 0.4082 0.5895 0.5866 0.3722 Std dev 0.5615 0.5732 0.5722 0.2172 0.5153 0.497 0.1005 0.5283 0.5063 0.354 0.6634 0.643 0.7318 0.7206 0.7216 0.4403 0.6373 0.623 0.558 0.6033 0.5996 0.5842 0.6679 0.6613 0.5326 Superior Std dev 0.9854 0.6613 0.6558 0.3811 0.5945 0.5697 0.1764 0.6095 0.5802 0.6213 0.7654 0.737 1.2843 0.8314 0.827 0.7728 0.7352 0.714 0.9793 0.696 0.6872 1.0253 0.7706 0.7579 0.9347 Std err 0.1693 0.0549 0.181 0.0655 0.0494 0.1572 0.0303 0.0506 0.1602 0.1067 0.0635 0.2034 0.2207 0.069 0.2283 0.1328 0.061 0.1971 0.1682 0.0578 0.1897 0.1761 0.064 0.2092 0.1606 -1 0.7143 -0.434 -1 1 1 -1 -1 1 1 0 -1 1 1 -0.714 -1 1 1 -1 -1 -0.057 1 -1 -1 -0.667 1 0.5 -1 1 1 Min -1 -1 Max 1 1

Statistiche Variable new_q1_9 new_q1_9 new_q1_10 new_q1_10 new_q1_10 Age Age Age CLUSTER 7 Diff (1-2) 6 7 Diff (1-2) 6 7 Diff (1-2) 11 109 11 109 N 109 Inferior Mean 0.1616 -0.702 -0.882 -0.056 -1.105 20.978 23.212 -2.948 Mean 0.2941 -0.273 -0.612 0.0744 -0.686 22.364 23.706 -1.343 Superior Mean 0.4266 0.1563 -0.342 0.2049 -0.268 23.749 24.201 0.2629 Inferior Std dev 0.6158 0.6079 0.2807 0.607 0.5928 1.4412 2.2987 2.2736 Std dev 0.6977 0.6853 0.4018 0.6878 0.6683 2.0627 2.6045 2.563 Superior Std dev 0.805 0.7854 0.7051 0.7935 0.766 3.6198 3.0049 2.9376 Std err 0.0668 0.2168 0.1211 0.0659 0.2114 0.6219 0.2495 0.8108 19 19 25 30 -1 -1 -0.07 1 Min -1 Max 1

t-test Variable new_q1_1 new_q1_1 new_q1_2 new_q1_2 new_q1_3 new_q1_3 new_q1_4 new_q1_4 new_q1_5 new_q1_5 new_q1_6 new_q1_6 new_q1_7 new_q1_7 new_q1_8 new_q1_8 new_q1_9 new_q1_9 new_q1_10 new_q1_10 age Method Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Satterthwaite Pooled Variances Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal Unequal Equal DF 118 12.2 118 23.9 118 83.4 118 18.1 118 12 118 14.6 118 12.5 118 12.8 118 13.7 118 16.7 118 t-value 1.63 1.65 3.45 6.61 -1.42 -3.86 -1.88 -3.09 0.06 0.05 3.99 5.39 2.67 2.84 0.24 0.27 -1.26 -1.57 -3.25 -4.98 -1.66 Pr > |t| 0.1064 0.1234 0.0008 <.0001 0.1577 0.0002 0.0620 0.0063 0.9557 0.9571 0.0001 <.0001 0.0087 0.0143 0.8080 0.7900 0.2104 0.1392 0.0015 0.0001 0.1004

t-test Variable age Method Satterthwaite Variances Unequal DF 13.4 t-value -2.00 Pr > |t| 0.0657

Caption: XXX properly describing the cluster XXX negatively correlated to the cluster description
Equality of variances Variable New_q1_1 New_q1_2 New_q1_3 New_q1_4 New_q1_5 New_q1_6 New_q1_7 New_q1_8 New_q1_9 new_q1_10 age Method Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F Folded F DF Num 108 108 108 108 10 108 108 108 108 108 108 DF Den 10 10 10 10 108 10 10 10 10 10 10 F Value 1.04 5.63 27.63 3.51 1.03 2.09 1.17 1.31 1.72 2.93 1.59 Pr > F 1.0000 0.0050 <.0001 0.0332 0.8447 0.1972 0.8464 0.6762 0.3500 0.0646 0.4245

In this cluster we observe that four variables have a quite high t-value and other two with a value slightly out of the region of acceptance, we connected them with the explanation of the cluster: from their p-values there is a probability of having an accidental distribution close to 0 [pr > |t| is included between 0.0143 and > 0.0001]. We signal here another variable, new q1_1, we are able to analyze by using the pooled method as it has a p-value, for the Fisher test, higher than 0.95. In this case there are two variables, having an high significance test response (positive correlation with the cluster), that are strongly explaining the cluster composition: New_q1_2 (China is developing country) New_q1_6 (China is interesting for its culture only) + 6.61 + 5.31

And also another one, with a lower positively explaining the cluster as well: New_q1_7 (China has a totally different culture) + 2.84

On the contrary there is just one variable which is not important for the cluster population as it has a negative t-value significance response, even if they have a p-value very close to 0 [pr > | t | = 0.0001]. This is because of the people included in this cluster answered significantly and negatively to this question in respect to the whole population (cluster number 7): New_q1_10 (China will be a leading nation) New_q1_4 (China has a totally different culture) - 4.98 - 3.86

Interestingly, in this cluster for the first time the variable age reaches its highest level of importance for describing the cluster population since its t-statistic is not low enough to completely ignore it (-2.00) and its p-value is 0.0657. The fact it has a negative but not extremely low tstatistic is something to be considered because, even if negatively correlated with the cluster population, in this segmentation the variable age could explain something about the cluster scores given by the individuals.

4.j. Frequency procedure and Chi square tests for qualitative variables For the qualitative variables, both discrete and dichotomic, we want to examine the existence or strength of any association between the variables, and to determine whether an association really exists; therefore we computed chi-square tests (2). This statistical tool serves us to measure the level of correlation between qualitative variables and our clusters (in our case we will use the following variables: geographical provenience, study of Chinese and in case the interviewed people would like to go to China to work, in which city they would like to stay). The principal objective of chi-squared test is to measure if the observed frequencies differ significantly from the theoretical ones (given by statistical tables according to the level of significance chosen and the n 1 degrees of freedom). If the computed chi-square is out of the critical region of acceptance, we reject the null hypothesis, because there is a high probability that the correlation between the variable and the cluster is not accidental, therefore the variables examined are dependent. When the value of 2 is equal to 0 or it is included in the region of acceptance the observed values of the variables are similar to the theoretical ones and therefore the two variables are completely independent (the observed frequency coincide with or is lower than the theoretical). The frequency procedure (from now on: PROC FREQ) produces one-way to n-way frequency and contingency tables (cross tabulation) and reports frequency counts. PROC FREQ can also compute chi-square tests for one-way to n-way tables; for tests and measures of association and of agreement for two-way to n-way cross-tabulation tables. To estimate the strength of an association, PROC FREQ computes measures of association that tend to be close to zero when there is no association and close to the maximum (or minimum) value when there is perfect association. In PROC FREQ, the value of the variable in the WEIGHT statement represents the frequency of occurrence for each observation. At this time SAS will be used to perform a Chi-squared test for four qualitative variables, that we selected among all those collected from the survey, our purpose is to investigate if they influence the clusters population and how much. To perform such a test we use the specific procedure PROC FREQ, the following are the command lines for the Freq procedure regarding the influence of the provenience on the different clusters: CLUSTER 1

SAS Syntax 23:

proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=1 or cluster=7; run;

Here is the SAS html output which shows the tables of the freq procedure, first computing the expectations and the real composition of the cluster, then under, there are two statistical tables, showing how the discrete variable is performing inside the cluster: o SAS output 13: SAS System The FREQ procedure

CLUSTER table for region CLUSTER C 1 8 5.1953 6.25 42.11 22.86 27 29.805 21.09 24.77 77.14 35 27.34 Region N 8 6.9766 6.25 42.11 17.02 39 40.023 30.47 35.78 82.98 47 36.72 S 3 6.8281 2.34 15.79 6.52 43 39.172 33.59 39.45 93.48 46 35.94 19 14.84 Total

109 85.16

Totale

128 100.00

Statistics for the CLUSTER table for region


Statistics Chi square Chi square rapp verosim Chi square MH Coefficient Phi Coefficient of contingency V of Cramer DF 2 2 1 Value 4.4747 4.8240 4.3136 0.1870 0.1838 0.1870 Prob 0.1067 0.0896 0.0378

Statistics Gamma Tau-b of Kendall Tau-c of Stuart D of Somers C|R D of Somers R|C Pearsons correlation coefficient Spearmans correlation Lambda asymmetric C|R Lambda asymmetric R|C Lambda symmetric Coefficient of uncertainty C|R

Value 0.4144 0.1750 0.1431 0.2830 0.1082 0.1843 0.1854 0.0494 0.0000 0.0400 0.0173

ASE 0.1677 0.0764 0.0662 0.1211 0.0500 0.0820 0.0810 0.1090 0.0000 0.0887 0.0148

Statistics Coefficient of uncertainty R|C Coefficient of uncertainty symmetric

Value 0.0449 0.0250

ASE 0.0377 0.0213

Sample dimension = 128 Resuming statistics for CLUSTER compared to region


Cochran-Mantel-Haenszel Statistics (based on a score table) Statistics 1 2 3 Alternative hypothesis Correlazione non zero Diff score medi riga Associazione generale DF 1 1 2 Value 4.3136 4.3136 4.4397 Prob 0.0378 0.0378 0.1086

Total dimension of the sample = 128 From this first table we notice that the expectations for the population of the first cluster was of five people from the centre of Italy and seven both coming from south and north, but in reality this group, composed by nineteen people, present eight components from the centre of Italy (42.11% of the cluster), eight as well coming from the north (same as before, they are the 42.11%) and three only from the south of Italy (15.79% of the group). The highest discrepancy is given by the group of the south, in fact the expected people were seven but it turned out that in the reality they are three only. In addition, it is interesting to underline that this cluster constitutes the 14.84% of the sample population. The 2 is equal to 4.4747 and in case, since there are two degrees of freedom, we consider this value quite high, but the of probability of non correlation among the variables is not significantly high, since it reaches the 10.67%. Therefore our conclusion is that the variable region of provenience and the cluster 1 are dependent, even if this dependence is not high.

SAS Syntax 24:

CLUSTER 2
proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=2 or cluster=7; run;

o SAS output 14: La procedura FREQ


Expected Frequency Percentage Pct row Pct col. CLUSTER per region CLUSTER C Region (region) N S Total

4 4.8062 3.10 20.00 12.90 27 26.194 20.93 24.77 87.10 31 24.03

11 7.7519 8.53 55.00 22.00 39 42.248 30.23 35.78 78.00 50 38.76

5 7.4419 3.88 25.00 10.42 43 40.558 33.33 39.45 89.58 48 37.21

20 15.50

109 84.50

Total

129 100.00

Statistics for the CLUSTER table per region


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 2 2 1 Valore 2.7189 2.6783 0.2640 0.1452 0.1437 0.1452 Prob 0.2568 0.2621 0.6074

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.1218 0.0505 0.0418 0.0798 0.0320 0.0454 0.0534 0.0506 0.0000 0.0404 0.0096 0.0241 0.0138

ASE 0.1787 0.0750 0.0623 0.1181 0.0477 0.0788 0.0793 0.1117 0.0000 0.0896 0.0118 0.0292 0.0168

Dimensione del campione = 129 Statistiche di riepilogo per CLUSTER rispetto a region:

Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 2 Valore 0.2640 0.2640 2.6979 Prob 0.6074 0.6074 0.2595

Dimensione totale del campione = 129 In this graph we notice that the population expected in this cluster was of five people from the centre, seven (almost eight) from north and seven from south of Italy. This time, inside this group composed by twenty people, the expectations for the individuals coming from the centre of Italy were almost correct, in fact they are four and this time their weight inside the cluster counts for the 20%, eleven members of the sample population are from the north (55%) and five from the south of Italy (25% of the group). In this cluster the 2 is 2.72, this means that the variable region of provenience might be independent from cluster two, since it has a low value. Moreover we have the 25.68% of probability of non correlation among the variables, quite low, but too high to be considered irrelevant. In conclusion, we can state that the cluster is still influenced by the variable but in a slight way compared to the previous one. CLUSTER 3

SAS Syntax 25:

proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=3 or cluster=7; run;

o SAS output 15: The FREQ procedure


Expected Frequency Percentage Pct row Pct col. Tabella di CLUSTER per region region(region) CLUSTER C 3 4.4531 2.34 15.79 10.00 27 25.547 21.09 24.77 90.00 30 N 6 6.6797 4.69 31.58 13.33 39 38.32 30.47 35.78 86.67 45 S 10 7.8672 7.81 52.63 18.87 43 45.133 33.59 39.45 81.13 53 19 14.84 Totale

109 85.16

Totale

128

23.44

35.16

41.41

100.00

Statistiche per la tabella di CLUSTER per region


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 2 2 1 Valore 1.3170 1.3355 1.2798 0.1014 0.1009 0.1014 Prob 0.5176 0.5129 0.2579

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.2398 -0.0958 -0.0776 -0.1535 -0.0597 -0.1004 -0.1012 0.0000 0.0000 0.0000 0.0049 0.0124 0.0070

ASE 0.2032 0.0808 0.0666 0.1290 0.0512 0.0844 0.0855 0.0000 0.0000 0.0000 0.0083 0.0211 0.0119

Dimensione del campione = 128 Statistiche di riepilogo per CLUSTER rispetto a region
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 2 Valore 1.2798 1.2798 1.3068 Prob 0.2579 0.2579 0.5203

Dimensione totale del campione = 128 This time the expected sample of this cluster was of four people from the centre, six or seven from north and eight from south of Italy. This group is composed by nineteen people, and it presents small differences between the expectations and the reality, for people coming from the south of Italy, in fact they are eight instead of being ten. This group presents a 2 value of 1.32, this value is the closest to zero we have ever found until now (but the next cluster will reach an even lower value), this leads us to think that the variable region of provenience might be independent from cluster 1, to strengthen our hypothesis, we have a probability of non correlation of 51.76%, since is quite high, we classify this cluster as not particularly influenced by the region of provenience, even if this variable has a weak weight on the cluster, it is not particularly relevant. CLUSTER 4

SAS Syntax 26:

proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=4 or cluster=7; run;

o SAS output 16: La procedura FREQ


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per region region(region) CLUSTER C 6 5.75 4.55 26.09 18.18 27 27.25 20.45 24.77 81.82 33 25.00 N 7 8.0152 5.30 30.43 15.22 39 37.985 29.55 35.78 84.78 46 34.85 S 10 9.2348 7.58 43.48 18.87 43 43.765 32.58 39.45 81.13 53 40.15 23 17.42 Totale

109 82.58

Totale

132 100.00

Statistiche per la tabella di CLUSTER per region


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi DF 2 2 1 Valore 0.2456 0.2495 0.0221 0.0431 Prob 0.8844 0.8827 0.8819

Statistica Coefficiente di contingenza V di Cramer

DF

Valore 0.0431 0.0431

Prob

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.0341 -0.0148 -0.0129 -0.0223 -0.0098 -0.0130 -0.0157 0.0000 0.0000 0.0000 0.0009 0.0020 0.0012

ASE 0.1942 0.0841 0.0731 0.1269 0.0558 0.0890 0.0890 0.0000 0.0000 0.0000 0.0035 0.0081 0.0049

Dimensione del campione = 132 Statistiche di riepilogo per CLUSTER rispetto a region
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 2 Valore 0.0221 0.0221 0.2438 Prob 0.8819 0.8819 0.8852

Dimensione totale del campione = 13 For the fourth cluster, the expectations were of five or six people from the centre, against six really joining this category, eight from north, versus seven in the reality and nine from south of Italy versus ten in the real world. This group is composed by nineteen people, and it presents small differences between the expectations and the reality, for people coming from the south of Italy, in fact they are eight instead of being ten. The 2 value of this group is of 0.2456, this value is the closest to zero we have ever found, taking also into account the probability of non correlation of 88.44%, we are completely sure that this cluster is not influenced by the region of provenience in any way, this variable has no weight on the cluster.

CLUSTER 5

SAS Syntax 27:

proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=5 or cluster=7; run;

o SAS output 17: The FREQ Procedure


Expected Frequency Percentage Pct row Pct col CLUSTER per region region(region) CLUSTER C 3 4.0476 2.38 17.65 10.00 27 25.952 21.43 24.77 90.00 30 23.81 N 5 5.9365 3.97 29.41 11.36 39 38.063 30.95 35.78 88.64 44 34.92 S 9 7.0159 7.14 52.94 17.31 43 44.984 34.13 39.45 82.69 52 41.27 17 13.49 Totale

109 86.51

Totale

126 100.00

Statistiche per la tabella di CLUSTER per region


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 2 2 1 Valore 1.1329 1.1224 0.9996 0.0948 0.0944 0.0948 Prob 0.5675 0.5705 0.3174

Statistica Gamma Tau-b di Kendall Tau-c di Stuart

Valore -0.2252 -0.0866 -0.0675

ASE 0.2169 0.0829 0.0656

Statistica D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.1446 -0.0519 -0.0894 -0.0916 0.0000 0.0000 0.0000 0.0041 0.0113 0.0061

ASE 0.1378 0.0504 0.0869 0.0877 0.0000 0.0000 0.0000 0.0078 0.0212 0.0114

Dimensione del campione = 126 Statistiche di riepilogo per CLUSTER rispetto a region
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 2 Valore 0.9996 0.9996 1.1239 Prob 0.3174 0.3174 0.5701

Dimensione totale del campione = 126 In the fifth cluster there are not high discrepancies among expectations and reality for the subcategories of people coming from centre and north of Italy (four expected and three effective for the former and six expected and five effective for the latter). Only expectations for people who came from the south of Italy were lower in the statistics compared to the reality. Here, the 2 value is of 1.13, as for the third cluster, this situation is unclear, since this value is the close to zero and the probability of non correlation of 56.75%, for this reason we think appropriate to classify this cluster as not particularly influenced by the region of provenience, even if there exist a weak link between the cluster and the variable, it is not particularly relevant. CLUSTER 6

SAS Syntax 28:

proc freq data=Gioconda.merged_2; tables cluster*region/all expected; where cluster=6 or cluster=7; run;

o SAS output 18:

La procedura FREQ
Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per region region(region) CLUSTER C 3 2.75 2.50 27.27 10.00 27 27.25 22.50 24.77 90.00 30 25.00 N 2 3.7583 1.67 18.18 4.88 39 37.242 32.50 35.78 95.12 41 34.17 S 6 4.4917 5.00 54.55 12.24 43 44.508 35.83 39.45 87.76 49 40.83 11 9.17 Totale

109 90.83

Totale

120 100.00

Statistiche per la tabella di CLUSTER per region:


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 2 2 1 Valore 1.4883 1.6089 0.2482 0.1114 0.1107 0.1114 Prob 0.4751 0.4473 0.6184

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C

Valore -0.1509 -0.0497 -0.0328 -0.0984 -0.0251 -0.0457 -0.0525 0.0000 0.0000

ASE 0.2852 0.0930 0.0618 0.1838 0.0473 0.0984 0.0984 0.0000 0.0000

Statistica Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.0000 0.0062 0.0219 0.0097

ASE 0.0000 0.0093 0.0323 0.0144

Dimensione del campione = 120 Statistiche di riepilogo per CLUSTER rispetto a region
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale 2 DF 1 1 2 Valore 0.2482 0.2482 1.4759 Prob 0.6184 0.6184 0.4781

This group presents a value of 1.48, this leads us to think that the cluster 6 might be independent from variable region of provenience, to strengthen our hypothesis, we have a probability of non correlation of 47.51%, since is quite high, we classify this cluster as not particularly influenced by the region of provenience, even if this variable has a weak weight on the cluster, it is not particularly relevant. Dimensione totale del campione = 120 Table 7 Results for the variable Region of provenience Is the cluster influenced by the variable? Yes Yes Slightly No Slightly Slightly

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 CLUSTER 1

SAS Syntax 29:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=1 or cluster=7; run;

Below there are the SAS html output showing the results for cluster 1:

o SAS Output 19: La procedura FREQ


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per chin_stud chin_stud (chin_stud) CLUSTER no 14 17.07 10.94 73.68 12.17 101 97.93 78.91 92.66 87.83 115 89.84 yes 5 1.9297 3.91 26.32 38.46 8 11.07 6.25 7.34 61.54 13 10.16 19 14.84 Totale

109 85.16

Totale

128 100.00

Statistiche per la tabella di CLUSTER per chin_stud:


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 1 1 1 1 Valore 6.3852 5.0075 4.4749 6.3353 -0.2233 0.2180 -0.2233 Prob 0.0115 0.0252 0.0344 0.0118

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Test esatto di Fisher Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 14 0.0252 0.9957

Probabilit tabella (P) Pr bilaterale <= P

0.0209 0.0252

Statistica

Valore

ASE

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.6370 -0.2233 -0.0959 -0.1898 -0.2629 -0.2233 -0.2233 0.0000 0.0000 0.0000 0.0595 0.0466 0.0523

ASE 0.1894 0.1165 0.0552 0.1041 0.1383 0.1165 0.1165 0.0000 0.0000 0.0000 0.0567 0.0449 0.0498

Stime del rischio relativo (Riga1/Riga2) Tipo di studio Caso controllo (r quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Valore 0.2218 0.7952 3.5855 Limiti di confidenza al 95% 0.0636 0.6047 1.3118 0.7736 1.0457 9.8003

Dimensione del campione = 128 Statistiche di riepilogo per CLUSTER rispetto a chin_stud
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 6.3353 6.3353 6.3353 Prob 0.0118 0.0118 0.0118

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte Metodo Mantel-Haenszel Logit Mantel-Haenszel Valore 0.2218 0.2218 0.7952 Limiti di confidenza al 95% 0.0636 0.0636 0.6047 0.7736 0.7736 1.0457

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio (Rischio Col1) Coorte (Rischio Col2) Metodo Logit Mantel-Haenszel Logit Valore 0.7952 3.5855 3.5855 Limiti di confidenza al 95% 0.6047 1.3118 1.3118 1.0457 9.8003 9.8003

Dimensione totale del campione = 128 As we can see for this cluster, we can observe a chi-squared index of 6,38 and the probability of non-correlation is of 1,1% so we consider it as a very low one. From this results we can state that the cluster number 1 is strongly influenced by the people who are currently studying or have studied Chinese. We have also to point out that the difference between the observed values and the expected ones are quite big, if considered the cluster population of just 19 individuals, because there are 3 non-Chinese-students less than the expected and 3 Chinese-students more than the statistical forecasts. (14 effective versus 17 estimated non-Chinese-students and 5 real vs. 1.93 estimated Chinese-students). We take these unequal values as another proof of the study-ofChineses influence among this first cluster. CLUSTER 2

SAS Syntax 30:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=2 or cluster=7; run;

o SAS Output 20: The FREQ procedure


Expected Frequency Percentage Pct row Pct col CLUSTER Table for chin_stud chin_stud(chin_stud) CLUSTER no 18 18.45 13.95 90.00 15.13 101 100.55 78.29 92.66 84.87 119 92.25 yes 2 1.5504 1.55 10.00 20.00 8 8.4496 6.20 7.34 80.00 10 7.75 20 15.50 Totale

109 84.50

Totale

129 100.00

Statistiche per la tabella di CLUSTER per chin_stud


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 1 1 1 1 Valore 0.1673 0.1568 0.0000 0.1660 -0.0360 0.0360 -0.0360 Prob 0.6825 0.6922 1.0000 0.6837

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Test esatto di Fisher Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 18 0.4799 0.8137

Probabilit tabella (P) Pr bilaterale <= P

0.2935 0.6533

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.1676 -0.0360 -0.0139 -0.0266 -0.0487 -0.0360 -0.0360 0.0000 0.0000 0.0000 0.0022 0.0014 0.0017

ASE 0.4038 0.0965 0.0376 0.0716 0.1307 0.0965 0.0965 0.0000 0.0000 0.0000 0.0116 0.0073 0.0090

Stime del rischio relativo (Riga1/Riga2) Tipo di studio Caso controllo (r quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Valore 0.7129 0.9713 1.3625 Limiti di confidenza al 95% 0.1399 0.8315 0.3119 3.6333 1.1345 5.9514

Dimensione del campione = 129 Statistiche di riepilogo per CLUSTER rispetto a chin_stud:
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 0.1660 0.1660 0.1660 Prob 0.6837 0.6837 0.6837

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Metodo Mantel-Haenszel Logit Mantel-Haenszel Logit Mantel-Haenszel Logit Valore 0.7129 0.7129 0.9713 0.9713 1.3625 1.3625 Limiti di confidenza al 95% 0.1399 0.1399 0.8315 0.8315 0.3119 0.3119 3.6333 3.6333 1.1345 1.1345 5.9514 5.9514

Dimensione totale del campione = 129 The chi-squared index for the second cluster, one of the biggest, has a value of 0,16 because of its proximity to 0 and because of its probability of non-correlation of 68,25% (so a very high one) we can state that the cluster number 2 is not influenced by the variable characterized by the people who have studied Chinese. Moreover the difference we see between the observed values and the expected ones are very short: almost both the expectations match with the observed value. CLUSTER 3

SAS Syntax 31:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=3 or cluster=7; run;

o SAS Output 21: The FREQ Procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per chin_stud chin_stud(chin_stud) CLUSTER no 19 17.813 14.84 100.00 15.83 101 102.19 78.91 92.66 84.17 120 93.75 yes 0 1.1875 0.00 0.00 0.00 8 6.8125 6.25 7.34 100.00 8 6.25 19 14.84 Totale

109 85.16

Totale

128 100.00

Statistiche per la tabella di CLUSTER per chin_stud


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 1 1 1 1 Valore 1.4875 2.6622 0.4986 1.4758 0.1078 0.1072 0.1078 Prob 0.2226 0.1028 0.4801 0.2244

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Test esatto di Fisher Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 19 1.0000 0.2657

Probabilit tabella (P) Pr bilaterale <= P

0.2657 0.6040

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 1.0000 0.1078 0.0371 0.0734 0.1583 0.1078 0.1078 0.0000 0.0000 0.0000 0.0445 0.0248 0.0318

ASE 0.0000 0.0226 0.0142 0.0250 0.0333 0.0226 0.0226 0.0000 0.0000 0.0000 0.0108 0.0089 0.0092

Stime del rischio relativo (Riga1/Riga2) Tipo di studio Coorte (Rischio Col1) Valore 1.0792 Limiti di confidenza al 95% 1.0237 1.1378

Non calcolate una o pi stime dei rischi --- cella uguale a zero. Dimensione del campione = 128 Statistiche di riepilogo per CLUSTER rispetto a chin_stud
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 1.4758 1.4758 1.4758 Prob 0.2244 0.2244 0.2244

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte (Rischio Col1) Metodo Mantel-Haenszel Logit ** Mantel-Haenszel Logit Valore . 3.2660 1.0792 1.0792 Limiti di confidenza al 95% . 0.1810 1.0237 1.0237 . 58.9454 1.1378 1.1378

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Coorte (Rischio Col2) Metodo Mantel-Haenszel Logit ** Valore 0.0000 0.3235 Limiti di confidenza al 95% . 0.0194 . 5.3850

Per evitare risultati indefiniti, alcune stime non vengono calcolate. ** Questi stimatori logit utilizzano una correzione di 0.5 in ogni cella delle tabelle che contengono uno zero. Dimensione totale del campione = 128 The one we are analyzing now is cluster a little bit stranger than the others previously analyzed. We can observe a chi-squared index of 1,48 so a quite low one but a probability of noncorrelation of 22,26% thus a medium one. The real thing which is interesting is that in this cluster there are no people who have studied Chinese. From this results we can state that the cluster number 2 is not influenced at all by the variable characterized by the people who have studied Chinese. Also the difference we see between the observed values and the expected ones are quite short if compared to the whole cluster population: there is one student more who has not studied Chinese and one less who has. CLUSTER 4

SAS Syntax 32:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=4 or cluster=7; run;

o SAS Output 22: The FREQ Procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per chin_stud chin_stud (chin_stud) CLUSTER no 23 21.606 17.42 100.00 18.55 101 102.39 76.52 92.66 81.45 124 93.94 yes 0 1.3939 0.00 0.00 0.00 8 6.6061 6.06 7.34 100.00 8 6.06 23 17.42 Totale

109 82.58

Totale

132 100.00

Statistiche per la tabella di CLUSTER per chin_stud:


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 1 1 1 1 Valore 1.7970 3.1704 0.7390 1.7834 0.1167 0.1159 0.1167 Prob 0.1801 0.0750 0.3900 0.1817

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Test esatto di Fisher Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 23 1.0000 0.2063

Probabilit tabella (P) Pr bilaterale <= P

0.2063 0.3496

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 1.0000 0.1167 0.0422 0.0734 0.1855 0.1167 0.1167 0.0000 0.0000 0.0000 0.0525 0.0260 0.0347

ASE 0.0000 0.0238 0.0157 0.0250 0.0349 0.0238 0.0238 0.0000 0.0000 0.0000 0.0117 0.0093 0.0101

Stime del rischio relativo (Riga1/Riga2) Tipo di studio Coorte (Rischio Col1) Valore 1.0792 Limiti di confidenza al 95% 1.0237 1.1378

Non calcolate una o pi stime dei rischi --- cella uguale a zero. Dimensione del campione = 132 Statistiche di riepilogo per CLUSTER rispetto a chin_stud:
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 1.7834 1.7834 1.7834 Prob 0.1817 0.1817 0.1817

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Metodo Mantel-Haenszel Logit ** Mantel-Haenszel Logit Mantel-Haenszel Logit ** Valore . 3.9360 1.0792 1.0792 0.0000 0.2696 Limiti di confidenza al 95% . 0.2193 1.0237 1.0237 . 0.0161 . 70.6259 1.1378 1.1378 . 4.5131

Per evitare risultati indefiniti, alcune stime non vengono calcolate. ** Questi stimatori logit utilizzano una correzione di 0.5 in ogni cella delle tabelle che contengono uno zero. Dimensione totale del campione = 132 As for the cluster we observed before, here we have a chi-squared index of 1,79 so a quite low one but a probability of non-correlation of 18,01% so a medium-low one. Again we have to notice that in this cluster there are no people who have studied Chinese. From this results we can state that the cluster number 2 is not influenced at all by the variable characterized by the people who have studied Chinese. Also the difference we see between the observed values and the expected ones are quite short if compared to the whole cluster population but a little more significative than the previous cluster (23 vs. 19 individuals): there is one student more who has not studied Chinese and one less who has. CLUSTER 5

SAS Syntax 33:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=5 or cluster=7; run;

o SAS Output 23: The FREQ Procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per chin_stud chin_stud(chin_stud) CLUSTER no 16 15.786 12.70 94.12 13.68 101 101.21 80.16 92.66 86.32 117 92.86 yes 1 1.2143 0.79 5.88 11.11 8 7.7857 6.35 7.34 88.89 9 7.14 17 13.49 Totale

109 86.51

Totale

126 100.00

Statistiche per la tabella di CLUSTER per chin_stud


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 1 1 1 1 Valore 0.0471 0.0495 0.0000 0.0467 0.0193 0.0193 0.0193 Prob 0.8282 0.8240 1.0000 0.8289

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Fisher Test Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 16 0.7411 0.6511

Fisher Test

Probabilit tabella (P) Pr bilaterale <= P

0.3922 1.0000

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.1179 0.0193 0.0068 0.0146 0.0256 0.0193 0.0193 0.0000 0.0000 0.0000 0.0008 0.0005 0.0006

ASE 0.5395 0.0825 0.0291 0.0623 0.1095 0.0825 0.0825 0.0000 0.0000 0.0000 0.0067 0.0043 0.0053

Stime del rischio relativo (Riga1/Riga2) Tipo di studio Caso controllo (r quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Valore 1.2673 1.0157 0.8015 Limiti di confidenza al 95% 0.1484 0.8919 0.1068 10.8224 1.1568 6.0119

Dimensione del campione = 126 Statistiche di riepilogo per CLUSTER rispetto a chin_stud:
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 0.0467 0.0467 0.0467 Prob 0.8289 0.8289 0.8289

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Metodo Mantel-Haenszel Logit Mantel-Haenszel Logit Mantel-Haenszel Logit Valore 1.2673 1.2673 1.0157 1.0157 0.8015 0.8015 Limiti di confidenza al 95% 0.1484 0.1484 0.8919 0.8919 0.1068 0.1068 10.8224 10.8224 1.1568 1.1568 6.0119 6.0119

Dimensione totale del campione = 126 As we can see for this cluster, we can observe a chi-squared index of 0,04 so a very close to 0 one and a probability of non-correlation of 82,82% so a very high one. From this results we can state that the cluster number 5 is not influenced by the variable characterized by the people who have studied Chinese. Also the difference we see between the observed values and the expected ones are very short: almost both the expectations match with the observed value. CLUSTER 6

SAS Syntax 34:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=6 or cluster=7; run;

o SAS Output 24: The FREQ Procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per chin_stud chin_stud(chin_stud) CLUSTER no 11 10.267 9.17 100.00 9.82 101 101.73 84.17 92.66 90.18 112 93.33 yes 0 0.7333 0.00 0.00 0.00 8 7.2667 6.67 7.34 100.00 8 6.67 11 9.17 Totale

109 90.83

Totale

120 100.00

Statistiche per la tabella di CLUSTER per chin_stud

Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato corr continuit Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer

DF 1 1 1 1

Valore 0.8650 1.5948 0.0876 0.8578 0.0849 0.0846 0.0849

Prob 0.3523 0.2066 0.7673 0.3544

ATTENZIONE: il 25% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Test esatto di Fisher Cella (1,1) Frequenza (F) Pr coda sinistra <= F Pr coda destra >= F 11 1.0000 0.4521

Probabilit tabella (P) Pr bilaterale <= P

0.4521 1.0000

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 1.0000 0.0849 0.0244 0.0734 0.0982 0.0849 0.0849 0.0000 0.0000 0.0000 0.0271 0.0217 0.0241

ASE 0.0000 0.0197 0.0104 0.0250 0.0281 0.0197 0.0197 0.0000 0.0000 0.0000 0.0085 0.0078 0.0072

Stime del rischio relativo (Riga1/Riga2)

Tipo di studio Coorte (Rischio Col1)

Valore 1.0792

Limiti di confidenza al 95% 1.0237 1.1378

Non calcolate una o pi stime dei rischi --- cella uguale a zero. Dimensione del campione = 120 Statistiche di riepilogo per CLUSTER rispetto a chin_stud
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 1 Valore 0.8578 0.8578 0.8578 Prob 0.3544 0.3544 0.3544

Stime del rischio relativo comune (Riga1/Riga2) Tipo di studio Caso controllo (Rapp quote) Coorte (Rischio Col1) Coorte (Rischio Col2) Metodo Mantel-Haenszel Logit ** Mantel-Haenszel Logit Mantel-Haenszel Logit ** Valore . 1.9261 1.0792 1.0792 0.0000 0.5392 Limiti di confidenza al 95% . 0.1042 1.0237 1.0237 . 0.0331 . 35.5924 1.1378 1.1378 . 8.7722

Per evitare risultati indefiniti, alcune stime non vengono calcolate. ** Questi stimatori logit utilizzano una correzione di 0.5 in ogni cella delle tabelle che contengono uno zero. Dimensione totale del campione = 120 For the cluster number 6 we can observe a chi-squared index of 0,86 so a quite low one but a probability of non-correlation of 35,23% that can be considered a medium value. In this cluster, again, as we noticed this phenomenon for others, there are no people who have studied Chinese. From this results we can say that the cluster is not influenced at all by the variable characterized by the people who have studied Chinese. In addition the differences between the observed values and the expected ones are quite small if compared to the whole cluster population and are not massive. With this last cluster we can consider the analysis for this variable ended, the results of performing the Chi square test are resumed in a table below: Table 8 Results for the variable Chinese student Is the cluster influenced by the variable? YES NO

Cluster 1 Cluster 2

Cluster 3 Cluster 4 Cluster 5 Cluster 6

NO NO NO NO

After this, the attention switches now to another variable, this time not dichotomic but discrete, and it concerns the intention of the interviewed people to travel to one specific Chinese city in order to work there. Hereunder follows the results for each cluster: CLUSTER 1

SAS Syntax 35:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=1 or cluster=7; run;

o SAS Output 25: The FREQ procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 3 3.1172 2.34 15.79 14.29 18 17.883 14.06 16.51 85.71 21 16.41 Macao 0 0.1484 0.00 0.00 0.00 1 0.8516 0.78 0.92 100.00 1 0.78 Pechino 5 3.1172 3.91 26.32 23.81 16 17.883 12.50 14.68 76.19 21 16.41 Shanghai 1 1.3359 0.78 5.26 11.11 8 7.6641 6.25 7.34 88.89 9 7.03 none 10 11.281 7.81 52.63 13.16 66 64.719 51.56 60.55 86.84 76 59.38 19 14.84 Totale

109 85.16

Totale

128 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza DF 4 4 1 Valore 1.7850 1.7753 0.1718 0.1181 0.1173 Prob 0.7752 0.7770 0.6786

Statistica V di Cramer

DF

Valore 0.1181

Prob

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.1174 0.0472 0.0364 0.0719 0.0309 0.0368 0.0501 0.0000 0.0000 0.0000 0.0062 0.0165 0.0090

ASE 0.1998 0.0834 0.0645 0.1270 0.0549 0.0870 0.0886 0.0000 0.0000 0.0000 0.0088 0.0234 0.0127

Dimensione del campione = 128 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Valore 0.1718 0.1718 1.7711 Prob 0.6786 0.6786 0.7778

Dimensione totale del campione = 128 In this first cluster the leading group, with 10 preferences registered versus 11.3 expected is that of people not interested in working in China, at the second place there is who would like to go and work in Beijing, with 5 preferences, two more than the expectations (of 3.12). the third place is reached by Hong Kong, with 3 preferences, that do not deny the expectations of the statistics, that estimated a value of 3.11. Macao gets zero preferences and registers 0.1 estimated preferences as well, therefore this clusters expectations were quite correct, the only one value that surpass significantly the statistics is that of people interested in going to Beijing to work. This cluster is not particularly wide, since it is composed by 19 people and weights on the total for the 14.84%. Focusing on the chi square analysis, we pointed out that for the first cluster we can ob-

serve a chi-squared index of 1,78 that is a low value, almost approaching zero, and a probability of non-correlation of 77,52% so a very high one. From this results we can say that the cluster number is not influenced by the different places the individuals want to work in. CLUSTER 2

SAS Syntax 36:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=2 or cluster=7; run;

o SAS Output 26: The FREQ procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 4 3.4109 3.10 20.00 18.18 18 18.589 13.95 16.51 81.82 22 17.05 Macao 0 0.155 0.00 0.00 0.00 1 0.845 0.78 0.92 100.00 1 0.78 Pechino 4 3.1008 3.10 20.00 20.00 16 16.899 12.40 14.68 80.00 20 15.50 Shanghai 2 1.5504 1.55 10.00 20.00 8 8.4496 6.20 7.34 80.00 10 7.75 none 10 11.783 7.75 50.00 13.16 66 64.217 51.16 60.55 86.84 76 58.91 20 15.50 Totale

109 84.50

Totale

129 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 4 4 1 Valore 1.0861 1.2169 0.4372 0.0918 0.0914 0.0918 Prob 0.8965 0.8753 0.5085

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica

Valore

ASE

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.1579 0.0658 0.0519 0.0991 0.0437 0.0584 0.0700 0.0000 0.0000 0.0000 0.0041 0.0109 0.0060

ASE 0.1910 0.0838 0.0665 0.1259 0.0561 0.0898 0.0892 0.0000 0.0000 0.0000 0.0065 0.0172 0.0095

Dimensione del campione = 129 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Valore 0.4372 0.4372 1.0777 Prob 0.5085 0.5085 0.8978

Dimensione totale del campione = 129 Cluster number two is the second biggest cluster (after the fourth one). In this table we can detect a quite positive match between the estimates of the cities and reality, even if Hong Kong, Beijing and Shanghai present a statistical underestimation, Macao, as for the cluster before, did not get any preference, and its expected frequency perfectly matches with the reality. The estimate for the sub-group of people not interested in going abroad were too high, 11.7 people versus 10 real observations, this groups, in a sample of 20 people, counts for exactly 50% of the cluster, while Beijing and Hong Kong share the 20% and Shanghai the 10% only. This is a quite balanced group, where the elements present heterogeneity of preferences, even if the cluster with the highest percentage of preferences is that of people not interested in going to China for work, this cluster is that with the second lower percentage of influence coming from this variable (the first one, as will be pointed out later, is the fourth one). For the cluster number 2 we can observe a chi-squared index of 1,08 so a low one and a probability of non-correlation of 89,65%, very high. From this results we can say that the cluster number is not influenced by the different Chinese cities the individuals want to work in.

CLUSTER 3

SAS Syntax 37:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=3 or cluster=7; run;

o SAS Output 27: The FREQ procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 3 3.1172 2.34 15.79 14.29 18 17.883 14.06 16.51 85.71 21 16.41 Macao 0 0.1484 0.00 0.00 0.00 1 0.8516 0.78 0.92 100.00 1 0.78 Pechino 1 2.5234 0.78 5.26 5.88 16 14.477 12.50 14.68 94.12 17 13.28 Shanghai 0 1.1875 0.00 0.00 0.00 8 6.8125 6.25 7.34 100.00 8 6.25 none 15 12.023 11.72 78.95 18.52 66 68.977 51.56 60.55 81.48 81 63.28 19 14.84 Totale

109 85.16

Totale

128 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 4 4 1 Valore 3.5194 5.0611 0.7150 0.1658 0.1636 0.1658 Prob 0.4749 0.2811 0.3978

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica Gamma Tau-b di Kendall

Valore -0.3136 -0.1037

ASE 0.2578 0.0796

Statistica Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.0774 -0.1531 -0.0702 -0.0750 -0.1096 0.0000 0.0000 0.0000 0.0186 0.0471 0.0266

ASE 0.0606 0.1168 0.0551 0.0875 0.0841 0.0000 0.0000 0.0000 0.0103 0.0254 0.0146

Dimensione del campione = 128 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Valore 0.7150 0.7150 3.4919 Prob 0.3978 0.3978 0.4791

Dimensione totale del campione = 128 In this graph, the statistics expected in this cluster 3.12 people interested in going to Hong Kong (Xiang gang) for work, and we discovered this estimate perfectly fits with reality, since the people that chose Hong Kong as an interesting city for a job were 3. Macao gets, again, 0 preferences, while the statistics estimated 0.15 preferences among the population sample, that is almost nobody. One person wants to go to Beijing, no one to Shanghai and 15, among 19 constituting the cluster, are not interested at all. In this case the expectations for Hong Kong as well as those for Macao matched the reality, while the expectations for the sub-group none were too low, the group of people not interested in going to China shows three members more than expected: from 12 estimated they reach 15 people; this score is the highest above all the sub-groups and this data produces the highest percentage of the cluster composition, according to the row scores, with the 78.95%, the second highest score is given by people who wants to go to Hong Kong, 15.79% and thirdly, by individuals interested in visiting Beijing (5.26%). The other cities did not get any preference, therefore they do not even get a score and their weight inside the cluster is 0%. About cluster number 3 we can observe a Chi-squared index of 3.52 and a probability of noncorrelation of 47,49% ; a medium level one.

From this results we are able to state that this cluster is not completely influenced by the variable we are taking in exam. It is a sort of half-way cluster: nor completely independent nor dependent at all. CLUSTER 4

SAS Syntax 38:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=4 or cluster=7; run;

o SAS Output 28: The FREQ procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 7 4.3561 5.30 30.43 28.00 18 20.644 13.64 16.51 72.00 25 18.94 Macao 1 0.3485 0.76 4.35 50.00 1 1.6515 0.76 0.92 50.00 2 1.52 Pechino 2 3.1364 1.52 8.70 11.11 16 14.864 12.12 14.68 88.89 18 13.64 Shanghai 2 1.7424 1.52 8.70 20.00 8 8.2576 6.06 7.34 80.00 10 7.58 None 11 13.417 8.33 47.83 14.29 66 63.583 50.00 60.55 85.71 77 58.33 23 17.42 Totale

109 82.58

Totale

132 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi Coefficiente di contingenza V di Cramer DF 4 4 1 Valore 4.4903 3.9690 2.3468 0.1844 0.1814 0.1844 Prob 0.3437 0.4102 0.1255

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica

Valore

ASE

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore 0.2540 0.1128 0.0937 0.1627 0.0781 0.1338 0.1202 0.0000 0.0000 0.0000 0.0130 0.0325 0.0185

ASE 0.1794 0.0864 0.0730 0.1244 0.0607 0.0952 0.0921 0.0000 0.0000 0.0000 0.0135 0.0337 0.0192

Dimensione del campione = 132 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Valore 2.3468 2.3468 4.4563 Prob 0.1255 0.1255 0.3478

Dimensione totale del campione = 132 Here we can clearly see that the expectations for the fifth cluster, are extremely favourable for the category none, as usual, the components of this cluster are not interested in travelling to China for work, but in reality only 11 out of 13 expected individuals do not care about going in China for work. This cluster is highly Hong Kong-oriented, 7 people, against the 4 expected, are interested in working in this city. The difference among people who wants to go to Shanghai and the estimates are not so wide spread (1.7 estimated versus 2 positive answers) and the same thing holds for Beijing (3.1 estimated versus 2 positive answers). Macao, always with low scores, gets in this cluster only one preference, the unique among all the six clusters, the only one observed among 109 observations. This cluster, the larger we have ever faced and the biggest among all, is composed by 23 elements and weights for the 17.42% over the total scores. Another interesting thing about this cluster is that, even if the variable none scored high as usual and influenced the cluster for the 47.83% (according to the rows) this is the lowest percentage the variable obtained among all the clusters.

About cluster number 4 we can observe a Chi-squared index of 4.49 and a probability of noncorrelation of 34,37% , medium-low. From this results we are able to state that this cluster is not completely influenced by the variable we are focusing on. But anyway in a stronger way than the cluster number 3. CLUSTER 5

SAS Syntax 39:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=5 or cluster=7; run;

o SAS Output 29: The FREQ procedure


Expected Frequency Percentage Pct row Pct col

Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 1 2.5635 0.79 5.88 5.26 18 16.437 14.29 16.51 94.74 19 15.08 Macao 0 0.1349 0.00 0.00 0.00 1 0.8651 0.79 0.92 100.00 1 0.79 Pechino 1 2.2937 0.79 5.88 5.88 16 14.706 12.70 14.68 94.12 17 13.49 Shanghai 2 1.3492 1.59 11.76 20.00 8 8.6508 6.35 7.34 80.00 10 7.94 none 13 10.659 10.32 76.47 16.46 66 68.341 52.38 60.55 83.54 79 62.70 17 13.49 Totale

109 86.51

Totale

126 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH DF 4 4 1 Valore 3.0591 3.6005 2.3177 Prob 0.5480 0.4628 0.1279

Statistica Coefficiente Phi Coefficiente di contingenza V di Cramer

DF

Valore 0.1558 0.1540 0.1558

Prob

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico

Valore -0.3714 -0.1223 -0.0884 -0.1894 -0.0790 -0.1362 -0.1297 0.0000 0.0000 0.0000

ASE 0.2310 0.0692 0.0525 0.1064 0.0465 0.0674 0.0734 0.0000 0.0000 0.0000

Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

0.0131 0.0361 0.0193

0.0121 0.0328 0.0176

Dimensione del campione = 126 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Statistiche di Cochran-Mantel-Haenszel (in base a score tabella) Statistica 1 2 3 Ipotesi alternativa Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Valore 2.3177 2.3177 3.0348 Prob 0.1279 0.1279 0.5520

Dimensione totale del campione = 126 In this graph we noticed the population expected in this cluster was of 2.56 (so two or at least three) people interested in going to Hong Kong (Xianggang) for work, 0.13 to Macao, that is almost nobody, 2.3 that wants to go to Beijing, 1.35 to Shanghai and 10.65 not interested at all. This time, inside this group composed by twenty people, the expectations for the individuals coming from the centre of Italy were almost correct.

The expectations for Hong Kong as well as those for Shanghai, were too high, on reverse, the expectations for the sub-group none and for Beijing were too low, the group none in particular is registering three members more than the expected and, for a cluster composed by seventeen individuals, this is certainly a high score, the highest above all the sub-groups. About cluster number 5 we can observe a Chi-squared index of 3.05 and a probability of noncorrelation of 54,80%. From this results we deducted this cluster is not completely influenced by the variable we are analysing. It is a sort of halfway-cluster as well as the third cluster we analyzed before: neither completely independent nor dependent at all. CLUSTER 6

SAS Syntax 40:

proc freq data=Gioconda.merged_2; tables cluster*chin_stud/all expected; where cluster=6 or cluster=7; run;

o SAS Output 30: The FREQ procedure


Expected Frequency Percentage Pct row Pct col Tabella di CLUSTER per q3_city_2 q3_city_2(q3_city_2) CLUSTER Hong Kong 0 1.65 0.00 0.00 0.00 18 16.35 15.00 16.51 100.00 18 15.00 Macao 0 0.0917 0.00 0.00 0.00 1 0.9083 0.83 0.92 100.00 1 0.83 Pechino 3 1.7417 2.50 27.27 15.79 16 17.258 13.33 14.68 84.21 19 15.83 Shanghai 1 0.825 0.83 9.09 11.11 8 8.175 6.67 7.34 88.89 9 7.50 none 7 6.6917 5.83 63.64 9.59 66 66.308 55.00 60.55 90.41 73 60.83 11 9.17 Totale

109 90.83

Totale

120 100.00

Statistiche per la tabella di CLUSTER per q3_city_2


Statistica Chi quadrato Chi quadrato rapp verosim Chi quadrato MH Coefficiente Phi DF 4 4 1 Valore 2.9748 4.5474 0.8080 0.1574 Prob 0.5620 0.3370 0.3687

Statistica Coefficiente di contingenza V di Cramer

DF

Valore 0.1555 0.1574

Prob

ATTENZIONE: il 50% delle celle ha conteggi attesi minori di 5. Il chi quadrato pu non essere un test valido.

Statistica Gamma Tau-b di Kendall Tau-c di Stuart D di Somers C|R D di Somers R|C Correlazione di Pearson Correlazione di Spearman Lambda asimmetrico C|R Lambda asimmetrico R|C Lambda simmetrico Coefficiente di incertezza C|R Coefficiente di incertezza R|C Coefficiente di incertezza simmetrico

Valore -0.1542 -0.0471 -0.0292 -0.0876 -0.0253 -0.0824 -0.0500 0.0000 0.0000 0.0000 0.0170 0.0618 0.0267

ASE 0.2530 0.0729 0.0457 0.1355 0.0395 0.0587 0.0774 0.0000 0.0000 0.0000 0.0079 0.0246 0.0119

Dimensione del campione = 120 Statistiche di riepilogo per CLUSTER rispetto a q3_city_2
Cochran-Mantel-Haenszel Statistics (score table) Stat 1 2 3 Altternative Hypothesis Correlazione non zero Diff score medi riga Associazione generale DF 1 1 4 Value 0.8080 0.8080 2.9500 Prob 0.3687 0.3687 0.5662

Total Sample Dimension = 120 From this table we notice that the expectations for this last cluster are extremely favourable for the category none, that is, there is a good margin of certainty that the components of this cluster are not interested in travelling to China for work. The statistics forecasts (6.69 people, almost 7 do not have the intention to work in one of the biggest Chinese cities) are not denied: seven out of eleven elements of the cluster do not want to go to China for work, one would like to go to Shanghai and three to Beijing, only this data is

against the statistics, because the expectation was of 1.74 people interested in working in Beijing. In brief, the highest percentage of the cluster composition, according to the row scores, is given by people not interested in going to China, with the 63.64%, the second highest score is given by people who wants to go to Beijing, 27.27% and the third place is occupied by individuals interested in visiting Shanghai (9.09%). The other cities did not get any preference, therefore they do not even get a score and their weight inside the cluster is 0%. For the last cluster we can observe a Chi-squared index equal to 2,97 and a probability of non correlation of 56,20% so a medium one. Hence, this cluster is not completely influenced by the variable we are analyzing. It is, again, a sort of half-way cluster: nor completely independent nor dependent at all but slightly more independent than the cluster number five. Table 9 Results for the variable Chinese city to go to Is the cluster influenced by the variable? A bit No A bit A bit A bit A bit

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

5. DISCUSSION & CONCLUSIONS Hereunder are reported the clusters identified from our analysis; we attributed them six different names, with the aim of making them easily recognizable and more immediate to the reader: CLUSTER 1 Fans and lovers

For this first cluster there are two highly characterizing variables, particularly significant; they provide a positive significance test response; significantly explaining alone the whole cluster composition and these are: 1) The statement affirming that China is a country which can provide good opportunities for the Italian economy and that 2) China will be the next worlds leading nation, that will substitute the role of the United States in the global balance of powers. Conversely there are five variables not descriptive of the clusters characteristics at all, they have extremely negative t-value significance response, more precisely, those statements that were depicting China in a negative way got all low scores among the cluster. Those assertions were: 1) China is interesting for its culture only, 2) China is an irrelevant State because of its geographical distance from Italy, 3) China is a nation too different from Italy to influence our country because of the language, culture, traditions and habits. 4) China is interesting only under the working-economical point of view, 5) China is the Country of copies, because Chinese often copy European and American products. In this cluster the variable age is not important to give a description since its t-statistic is negative and very low (-0.04) and its p-value is 0.9721. For what concerns the variable taking into account the region of provenience, the 2 test value is 4.4747 and, since there are two degrees of freedom, we consider this value quite high, in addition the probability of non correlation among the variables is not significantly high, since it reaches 10.67%. Therefore our conclusion is that the variable region of provenience and the cluster 1 are dependent, even if this dependence is not so strict. The knowledge of the Chinese language is extremely influencing this cluster: we observed a chisquared index of 6,38 and the probability of non-correlation of 1,1%. We have also to point out that the chi square for the analysis of the interest in travel to a Chinese city for job is of 1,78, a low value, almost approaching zero, and a probability of non-correlation of 77,52% so a very high one. From this results we can say that the cluster number is not influenced by the different places the individuals want to work in. To resume, this cluster is composed by people who is educated in understanding the Chinese language and is sensitive about the progresses of this nation, quite optimistic about the Chinese

rise and recognize this country as interesting and appealing. They see China as a suitable partner for Italy, rather than an enemy.

CLUSTER 2 Underestimators

This time we can observe that just four variables are significant for the cluster description, their p-values are lower than the first cluster and are oscillating between <.0001 and 0.0005. In addition, for this cluster, there is only one variable with a positive significance test response, that strongly explains the cluster composition: China is an undeveloped country. On the other hand, there are two variables not important for describing the cluster, as they have a negative t-value significance response, these are the statements affirming that China has not influence over Italy because of the distance among the two nations and that China has a totally different culture compared to the Italian one. In this cluster the variable age is not important to give a description since its t-statistic is low (0.76) and its p-value is 0.4521. In addition, in this cluster the 2 value is 2.72, this means that the variable region of provenience is not noticeably influencing this group. Moreover we have the 25.68% of non correlation probability among the variables, which is, in our opinion, quite low, but too high to be considered irrelevant. In conclusion, we can state that the cluster is still influenced by the variable but in a slight way compared to the previous one. According to the results, the knowledge of Chinese is not influencing this cluster: the chisquared index has a value of 0,16; because of its proximity to 0 and because of its probability of non-correlation of 68,25% (high one) we can state that cluster number 2 is not influenced by the variable characterized by the study of the Chinese language. Taking into account the possibility that people who wants to work abroad in a well known Chinese city might influence this cluster, we performed the chi-square test, but the index we found was of 1,08 so a low one and the probability of non-correlation is 89,65%. From this results we can say that the cluster number is not influenced by the different Chinese cities the individuals want to work in. This cluster is composed by people who think that China is not interesting at all as a country, they recognize that globalization has simplified contacts among nations and they know that Italy has links with this country, but they retain this state too underdeveloped to focus their attention on it.

CLUSTER 3- Dislikers

In order to describe the third cluster we can attribute a high significance to the following variables: the first one is that China is a fake-maker country, mainly producing counterfeit goods (within this cluster a lot of individuals share this opinion) and the second variable is that China has a totally different culture for affecting the occidental nations, and therefore they consider unlikely that China would be able to extend its influence to Italy. On the contrary, the variable that negatively influence the cluster is that China represents an opportunity, implying the nation is seen as an enemy, not a partner, the opposite view of cluster one. Regarding the values of chi square analysis, this cluster is not really influenced by the variable characterising the provenience of the interviewed (i.e this variable does not describe the cluster very well) but we can observe that this cluster is characterised by the majority of people coming from the south of Italy. Two variables are not influent at all: the first one is the age, the second is the knowledge of the Chinese language (in this cluster we have no people who have studied Chinese). In cluster 3 we observe the majority of people is not willing to go to work in China, and the preferred city (after the commonly negative option) is Hong Kong, so this variable (Cities where we would like to work in China) is not sufficient to describe the cluster.

CLUSTER 4 Careeroriented

This cluster is characterised by a high presence of people who think that China is a nation that has a culture too different from the Italian one and the most shared opinions inside this group is that China will become a leading nation. What is considered not true and is an opinion not shared by the cluster population is that China is not influent because of its distance from Italy. The variable age is not relevant, so we do not take it into account in the cluster description. The variable provenience of the individuals (north, centre, south) is not highly significant, because data are heterogeneous. Regarding the variable Chinese knowledge, its influence is zero, in this cluster there are no individuals who have studied this language and consequently this variable is independent from the cluster. The majority of the interviewed within this cluster retain they will not move to China to work, and the others with a different attitude would choose the city of Hong Kong.

CLUSTER 5 Conservative

For this cluster, as depicted by the t-test analysis, is a shared opinion that China is a country that has nothing to deal with Italy because of its geographical distance and its cultural differences. In this case, the age factor does not influence the cluster description. In addition, the cluster receives no influence from the different provenience of its individuals; but, statistically speaking, it is a cluster with a provenience majority of southern and northern Italian people. It is characterized by the presence of one individual who has\is studying Chinese (we have only eight of them in all the dataset) and, for coherence with the t-test result, in this cluster, the majority of individuals do not want to move to any Chinese top city for business reasons (just less than 25% wants to).

CLUSTER 6 Cosmopolitan Minded

The last cluster thinks about China as an in-development nation as well as a nation which is able to generate interest for its culture and artistic heritage. On the other hand, these individuals do not think China will become the new world leading nation substituting the United States. In general, this is a cluster characterized by an high level of cosmopolitism and open-minded people, as they not consider China as an irrelevant nation for its different culture. In this cluster, also, the variable age describes the cluster in a negative sense so the age of the individuals included has nothing to deal with their view of China. We can also see this is a cluster of 11 people only (10% of our dataset) with a prevalence of southern and central Italian individuals which summed up compose the 81% of the cluster population and none of them has studied or is studying Chinese. Moreover many people in this cluster do not want to move for business reasons to China even if a minority of them desires to work in Beijing.

APPENDIX A The Italian version of the questionnaire


Alma Mater Studiorum Universit di Bologna Facolt di Economia CLAMDA International Management Luglio 2009 QUESTIONARIO 1. Potrebbe gentilmente esprimere il Suo giudizio personale sulle seguenti affermazioni, attribuendo ad esse un punteggio compreso tra 1 e 10? La Cina : 1) Un paese sottosviluppato (arretrato e povero) 1 2 3 4 5 6 7 8 9 10

2) Un paese in via di sviluppo, ma ancora poco progredito 1 2 3 4 5 6 7 8 9 10

3) Uno stato irrilevante, per via della lontananza geografica dallItalia 1 2 3 4 5 6 7 8 9 10

4) Una nazione dalla cultura, lingua, tradizioni ed abitudini troppo diverse da quella italiana per influenzare lItalia 1 2 3 4 5 6 7 8 9 10

5) La patria delle copie, dato che i cinesi copiano spesso prodotti USA ed Europei 1 2 3 4 5 6 7 8 9 10

6) Un paese interessante soltanto sotto laspetto culturale, storico ed artistico 1 2 3 4 5 6 7 8 9 10

7) Un paese interessante soltanto sotto laspetto economico-lavorativo 1 2 3 4 5 6 7 8 9 10

8) Uno stato abbastanza sviluppato da rappresentare una minaccia per leconomia italiana 1 2 3 4 5 6 7 8 9 10

9) Uno stato abbastanza sviluppato da rappresentare unopportunit per leconomia italiana 1 2 3 4 5 6 7 8 9 10

10) Il nuovo centro di potere mondiale che sostituir gli Stati Uniti 1 2 3 4 5 6 7 8 9 10

2. Potrebbe cortesemente esprimere il Suo parere personale circa le seguenti domande? 1) Lei mai stato in Cina? SI NO

2) Andrebbe mai a vivere in Cina? SI NO

3) Andrebbe mai a lavorare in Cina? SI NO

4) Vede in questo paese qualche opportunit per migliorare la tua vita in futuro? SI NO

5) C una citt in particolare che Le interesserebbe visitare per turismo? SI NO

Se ha risposto SI, potrebbe specificare quale? Shanghai Hong Kong Pechino Macao Xian Altro (specificare) .. 6) Esiste una citt cinese in cui vorrebbe recarsi per lavoro? SI NO

Se ha risposto di SI alla precedente domanda, potrebbe precisare quale? Shanghai Hong Kong Pechino Macao

Xian Altro (specificare) ..

3. La preghiamo di rispondere, ora, queste a queste domande di carattere generale: 1) Et: ..

2) Sesso: Maschio Femmina

3) Da quale area dItalia proviene? Nord Centro Sud

4) Ha mai studiato oppure attualmente studente di cinese? SI No

5) Qual la sua professione? Studente Lavoratore

7. References

Segre Reinach S., 2006, Mantelmi Ed., Manuale di comunicazione, sociologia e cultura della moda, Vol IV Orientalismi Damodar Gujarati, 2004, The McGrawHill Companies, Basic Econometrics, Fourth edition SAS Institute Inc. 2008, Base SAS 9.2 Procedures Guide. Cary, NC: SAS Institute Inc. Friedman, Hastie and Tibshirani, 2001, Springer, The elements of statistical learning: data mining, inference and prediction Free online enciclopedia: www.wikipedia.org

Вам также может понравиться