Вы находитесь на странице: 1из 11

# Qualitative data

Qualitative data is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data. For example: favorite color = "yellow" height = "tall" When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.
Object 1

Quantitative data
Quantitative data is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract. For example: favorite color = "450 nm" height = "1.8 m" Quantitative data always are associated with a scale measure. Probably the most common scale type is the ratio-scale. Observations of this type are on a scale that has a meaningful zero value but also have an equidistant measure (i.e., the difference between 10 and 20 is the same as the difference between 100 and 110). For example, a 10 yearold girl is twice as old as a 5 year-old girl. Since you can measure zero years, time is a ratio-scale variable. Money is another common ratio-scale quantitative measure. Observations that you count are usually ratio-scale (e.g., number of widgets). A more general quantitative measure is the interval scale. Interval scales also have a equidistant measure. However, the doubling principle breaks down in this scale. A temperature of 50 degrees Celsius is not "half as hot" as a temperature of 100, but a difference of 10 degrees indicates the same difference in temperature anywhere along the scale. The Kelvin temperature scale, however, constitutes a ratio scale because on the Kelvin scale zero indicates absolute zero in temperature, the complete absence of heat. So one can say, for example, that 200 degrees Kelvin is twice as hot as 100 degrees Kelvin.

## The differences between qualitative and quantitative data:

Quantitative data is data that is relating to, measuring, or measured by the quantity of something, rather than its quality. ex: the number of people in a town Qualitative data is data that can be captured that is not numerical in nature ex: the color of people's skin. Thus, essentially the distinction is that quantitative data deals with numbers and numerical values of what is being tested, where as qualitative data deals with the quality of what is being tested.

Qualitative data's description cannot be describe in numbers. Quantitative data's description can only be described in numbers. Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of one-dimensional data set. Cross-sectional data refers to data collected by observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. Analysis of cross-sectional data usually consists of comparing the differences among the subjects. For example, we want to measure current obesity levels in a population. We could draw a sample of 1,000 people randomly from that population (also known as a cross section of that population), measure their weight and height, and calculate what percentage of that sample is categorized as obese. For example, 30% of our sample were categorized as obese. This cross-sectional sample provides us with a snapshot of that population, at that one point in time. Note that we do not know based on one cross-sectional sample if obesity is increasing or decreasing; we can only describe the current proportion. Cross-sectional data differs from time series data also known as longitudinal data, which follows one subject's changes over the course of time. Another variant, panel data (or time-series crosssectional (TSCS) data), combines both and looks at multiple subjects and how they change over the course of time. Panel analysis uses panel data to examine changes in variables over time and differences in variables between subjects. In a rolling cross-section, both the presence of an individual in the sample and the time at which the individual is included in the sample are determined randomly. For example, a political poll may decide to interview 100,000 individuals. It first selects these individuals randomly from the entire population. It then assigns a random date to each individual. This is the random date on which that individual will be interviewed, and thus included in the survey. Time Series

A time series is a sequence of observations which are ordered in time (or space). If observations are made on some phenomenon throughout time, it is most sensible to display the data in the order in which they arose, particularly since successive observations will probably be dependent. Time series are best displayed in a scatter plot. The series value X is plotted on the vertical axis and time t on the horizontal axis. Time is called the independent variable (in this case however, something over which you have little control). There are two kinds of time series data: 1. Continuous, where we have an observation at every instant of time, e.g. lie detectors, electrocardiograms. We denote this using observation X at time t, X(t). 2. Discrete, where we have an observation at (usually regularly) spaced intervals. We denote this as Xt. Examples Economics - weekly share prices, monthly profits Meteorology - daily rainfall, wind speed, temperature Sociology - crime figures (number of arrests, etc), employment figures
In statistics, signal processing, pattern recognition, econometrics, mathematical finance, Weather forecasting, Earthquake prediction,Electroencephalography, Control engineering and Communications engineering a time series is a sequence of data points, measured typically at successive time instants spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the Nile River at Aswan. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. Time series are very frequently plotted via line charts.

Time series data have a natural temporal ordering. This makes time series analysis distinct from other common data analysis problems, in which there is no natural ordering of the observations (e.g. explaining people's wages by reference to their respective education levels, where the individuals' data could be entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). Astochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values (see time reversibility.) Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. Additionally time series analysis techniques may be divided into parametric and nonparametric methods. The parametric approaches assume that the underlying stationary Stochastic processhas a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure. Additionally methods of time series analysis may be divided into linear and nonlinear, univariate and multivariate. Time series analysis can be applied to:

real-valued, continuous data discrete numeric data discrete symbolic data (i.e. sequences of characters, such as letters and words in English language ).
Raw data would be the basic numbers and details collected from research without any manipulations. I.E. It is the "input" for any statistical calculations. However, with justification, certain anomalies can be removed from a data set before performing calculations, or subjects might be excluded if they do not meet certain predefined criteria.

Classifications of data A. According to Nature 1. Quantitative data- information obtained from numeral variables(e.g. age, bills, etc) 2.Qualitative Data- information obtained from variables in the form of categories, characteristics names or labels or alphanumeric variables (e.g. birthdays, gender etc.) B. According to Source 1. Primary data- first- hand information (e.g. autobiography, financial statement) 2. Secondary data- second-hand information (e.g. biography, weather forecast from news papers) C. According to Measurement 1. Discrete data- countable numerical observation. -Whole numbers only - has an equal whole number interval

- obtained through counting(e.g. corporate stocks, etc.) 2. Continuous data-measurable observations. -decimals or fractions -obtained through measuring(e.g. bank deposits, volume of liquid etc.) D. According to Arrangement 1. Ungrouped data- raw data - no specific arrangement 2. Grouped Data - organized set of data - at least 2 groups involved -arranged
Dichotomous data are data from outcomes that can be divided into two categories (e.g. dead or alive, pregnant or not pregnant), where each participant must be in one or other category, and cannot be in both. Your question is very general. I will give you some suggestions and perhaps you can rephrase your question to a specific problem. I believe the question can be rephrased to how a statistician may approach obtaining valid data for the purposes of interpretation. Generally, data is collected with the purpose of making inferences to a larger population which can not be surveyed. So, in statistics, the key to collecting data is that it is representative of the larger population that you are interested in. The statistician has choices to make in a planned observational or experimental study. The simple random selection may be appropriate in many cases, for example, in a quality control situation, where a sample of parts from a larger batch of parts are selected and tested. More complex sampling schemes are possible, still with the intent that the data can provide a significant, meaningful understanding of the population. The means to reduce biases in these surveys is very important. Data can be complicated, and may not tell the full story. For instance, let's say that one road has a high number of accidents. Is it a problem of the road condition, the drivers that use that road, poor signs, too many exits, etc. In this example, statistics and other information can help point to the most important factors. It should be noted that surveys are not the only way of collecting data. In education, data may be in the form of tests scores, GPA, etc. In media research, content analysis is frequently used to count and/or categorize randomly sampled media content (for example, comparing the volume or tone of war coverage in newspapers to television). The list of alternatives to survey research is extensive, but in all cases, the principles of random sampling and statistical assumptions still apply.

Central tendency
In statistics, the term central tendency relates to the way in which quantitative data tend to cluster around some value. A measure of central tendency is any of a number of ways of specifying this "central value". In practical statistical analysis, the terms are often used before one has chosen even a preliminary form of analysis: thus an initial objective might be to "choose an appropriate measure of central tendency". In the simplest cases, the measure of central tendency is an average of a set of measurements, the word average being variously construed as mean, median, or other measure of location, depending on the context. However, the term is applied to multidimensional data as well as to univariate data and in situations where a transformation of the data values for some or all dimensions would usually be considered necessary: in the latter cases, the notion of a "central location" is retained in converting an "average" computed for the transformed data back to the original units. In addition, there are several different kinds of calculations for central tendency, where the kind of calculation depends on the type of data (level of measurement). Both "central tendency" and "measure of central tendency" apply to either statistical populations or to samples from a population.

The following may be applied to individual dimensions of multidimensional data, after transformation,

although some of these involve their own implicit transformation of the data.

Arithmetic mean the sum of all measurements divided by the number of observations in the data set Median the middle value that separates the higher half from the lower half of the data set Mode the most frequent value in the data set Geometric mean the nth root of the product of the data values Harmonic mean the reciprocal of the arithmetic mean of the reciprocals of the data values Weighted mean an arithmetic mean that incorporates weighting to certain data elements Distance-weighted estimator the measure uses weighting coefficients for xi that are computed as the inverse mean distance between xi and the other data points. Truncated mean the arithmetic mean of data values after a certain number or proportion of the highest and lowest data values have been discarded. Midrange the arithmetic mean of the maximum and minimum values of a data set. Midhinge the arithmetic mean of the two quartiles. Trimean the weighted arithmetic mean of the median and two quartiles. Winsorized mean an arithmetic mean in which extreme values are replaced by values closer to the median.

Introduction
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as, the median and the mode. The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections we will look at the mean, mode and median and learn how to calculate them and under what conditions they are most appropriate to be used.

Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek capitol letter, pronounced "sigma", which means "sum of...":

You may have noticed that the above formula refers to the sample mean. So, why call have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter "mu", denoted as :

The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. When not to use the mean The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:

Staff Salary

1 15k

2 18k

3 16k

4 14k

5 15k

6 15k

7 12k

8 17k

9 90k

10 95k

The mean salary for these ten staff is \$30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the \$12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation. Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal then the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data as the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide.

Median
The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

65

55

89

56

35

14

56

55

87

45

92

We first need to rearrange that data into order of magnitude (smallest first):

14

35

45

55

55

56

56

65

87

89

92

Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:

65

55

89

56

35

14

56

55

87

45

## We again rearrange that data into order of magnitude (smallest first):

14

35

45

55

55

56

56

65

87

89

92

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5.

Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below: Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below:

## Summary of when to use the mean, median and mode

Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable.

## Best measure of central tendency Mode Median Mean Median

Measures of central tendency and dispersion provide a convenient way to describe and compare sets of data. Importance of Statistics in Different Fields

Statistics plays a vital role in every fields of human activity. Statistics has important role in determining the existing position of per capita income, unemployment, population growth rate, housing, schooling medical facilities etcin a country. Now statistics holds a central position in almost every field like Industry, Commerce, Trade, Physics, Chemistry, Economics, Mathematics, Biology, Botany, Psychology, Astronomy etc, so application of statistics is very wide. Now we discuss some important fields in which statistics is commonly applied.

1) Business: Statistics play an important role in business. A successful businessman must be very quick and accurate in decision making. He knows that what his customers wants, he should therefore, know what to produce and sell and in what quantities. Statistics helps businessman to plan production according to the taste of the costumers, the quality of the products can also be checked more efficiently by using statistical methods. So all the activities of the businessman based on statistical information. He can make correct decision about the location of business, marketing of the products, financial resources etc

(2) In

Economics: Statistics play an important role in economics. Economics largely depends upon statistics. National income accounts are multipurpose indicators for the economists and administrators. Statistical methods are used for preparation of these accounts. In economics research statistical methods are used for collecting and analysis the data and testing hypothesis. The relationship between supply and demands is studies by statistical methods, the imports and exports, the inflation rate, the per capita income are the problems which require good knowledge of statistics. (3) In Mathematics: Statistical plays a central role in almost all natural and social sciences. The methods of natural sciences are most reliable but conclusions draw from them are only probable, because they are based on incomplete evidence. Statistical helps in describing these measurements more precisely. Statistics is branch of applied mathematics. The large number of statistical methods like probability averages, dispersions, estimation etc is used in mathematics and different techniques of pure mathematics like integration, differentiation and algebra are used in statistics.
(4) In Banking: Statistics play an important role in banking. The banks make use of statistics for a number of purposes. The banks work on the principle that all the people who deposit their money with the banks do not withdraw it at the same time. The bank earns profits out of these deposits by lending to others on interest. The bankers use statistical approaches based on probability to estimate the numbers of depositors and their claims for a certain day.

(5) In

State Management (Administration): Statistics is essential for a country. Different policies of the government are based on statistics. Statistical data are now widely used in taking all administrative decisions. Suppose if the government wants to revise the pay scales of employees in view of an increase in the living cost, statistical methods will be used to determine the rise in the cost of living. Preparation of federal and provincial government budgets mainly depends upon statistics because it helps in estimating the expected expenditures and revenue from different sources. So statistics are the eyes of administration of the state. (6) In Accounting and Auditing: Accounting is impossible without exactness. But for decision making purpose, so much precision is not essential the decision may be taken on the basis of approximation, know as statistics. The correction of the values of current asserts is made on the basis of the purchasing power of money or the current value of it. In auditing sampling techniques are commonly used. An auditor determines the sample size of the book to be audited on the basis of error.

(7) In

Natural and Social Sciences: Statistics plays a vital role in almost all the natural and social sciences. Statistical methods are commonly used for analyzing the experiments results, testing their significance in Biology, Physics, Chemistry, Mathematics, Meteorology, Research chambers of commerce, Sociology, Business, Public Administration, Communication and Information Technology etc
(8) In Astronomy: Astronomy is one of the oldest branch of statistical study, it deals with the measurement of distance, sizes, masses and densities of heavenly bodies by means of observations. During these measurements errors are unavoidable so most probable measurements are founded by using statistical methods. Example: This distance of moon from the earth is measured. Since old days the astronomers have been statistical methods like method of least squares for finding the movements of stars. Collection of Statistical Data

Statistical Data: A sequence of observation, made on a set of objects included in the sample drawn from population is known as statistical data. (1) Ungrouped Data: Data which have been arranged in a systematic order are called raw data or ungrouped data. (2) Grouped Data: Data presented in the form of frequency distribution is called grouped data. Collection of Data: The first step in any enquiry (investigation) is collection of data. The data may be collected for the whole population or for a sample only. It is mostly collected on sample basis. Collection of data is very difficult job. The enumerator or investigator is the well trained person who collects the statistical data. The respondents (information) are the persons whom the information is collected. Types of Data: There are two types (sources) for the collection of data. (1) Primary Data (2) Secondary Data

(1) Primary Data: The primary data are the first hand information collected, compiled and published by organization for some purpose. They are most original data in character and have not undergone any sort of statistical treatment. Example: Population census reports are primary data because these are collected, complied and published by the population census organization. (2) Secondary Data: The secondary data are the second hand information which are already collected by some one (organization) for some purpose and are available for the present study. The secondary data are not pure in character and have undergone some treatment at least once. Example: Economics survey of England is secondary data because these are collected by more than one organization like Bureau of statistics, Board of Revenue, the Banks etc

Methods of Collecting Primary Data: Primary data are collected by the following methods:

Personal Investigation: The researcher conducts the survey him/herself and collects data from it. The data collected in this way is usually accurate and reliable. This method of collecting data is only applicable in case of small research projects. Through Investigation: Trained investigators are employed to collect the data. These investigators contact the individuals and fill in questionnaire after asking the required information. Most of the organizing implied this method. Collection through Questionnaire: The researchers get the data from local representation or agents that are based upon their own experience. This method is quick but gives only rough estimate. Through Telephone: The researchers get information through telephone this method is quick and give accurate information.
Methods of Collecting Secondary Data: The secondary data are collected by the following sources:

Official: e.g. The publications of the Statistical Division, Ministry of Finance, the Federal Bureaus of Statistics, Ministries of Food, Agriculture, Industry, Labor etc Semi-Official: e.g. State Bank, Railway Board, Central Cotton Committee, Boards of Economic Enquiry etc Publication of Trade Associations, Chambers of Commerce etc Technical and Trade Journals and Newspapers. Research Organizations such as Universities and other institutions. Difference between Primary and Secondary Data: The difference between primary and secondary data is only a change of hand. The primary data are the first hand data information which is directly collected form one source. They are most original data in character and have not undergone any sort of statistical treatment while the secondary data are obtained from some other sources or agencies. They are not pure in character and have undergone some treatment at least once. For Example: Suppose we interested to find the average age of MS students. We collect the ages data by two methods; either by directly collecting from each student himself personally or getting their ages from the university record. The data collected by the direct personal investigation is called primary data and the data obtained from the university record is called secondary data.
Editing of Data: After collecting the data either from primary or secondary source, the next step is its editing. Editing means the examination of collected data to discover any error and mistake before presenting it. It has to be decided before hand what degree of accuracy is wanted and what extent of errors can be tolerated in the inquiry. The editing of secondary data is simpler than that of primary data.

1. 2. 3. 4. 5. 6. 7. 8.

Prof. Ved Prakash Chairman (Actg.) U.G.C Sh. Ashok Thakur, Head NET Bureau, secy.dhe@nic.in Prof Achyutananda Samanta, achyuta@kiit.ac.in Prof (Dr) Seyed E. Hasnain, seh@bioschool.iitd.ac.in Prof. Meenakshi Gopinath, meenug11@gmail.com Dr. Indu Shahani, sheriffofmumbai@gmail.com Prof. Yogendra Yadav, yogendra.yadav@csds.in Prof. D. Narasimha Reddy, reddydn@gmail.com