Вы находитесь на странице: 1из 10

Application of statistical tools for data analysis and

interpretation in crops
Sudershan Mishra
Department of Plant Physiology, CBSH, GBPUAT&T, Pantnagar.
Email-tosudershanmishra@gmail.com

Abstract
Statistical methods involved in carrying out agricultural study include planning, designing,
collecting data, analyzing, drawing meaningful interpretation and reporting of the research
findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing
life into a lifeless data. The results and inferences are precise only if proper statistical tests are
used. A number of graphical user based softwares ranging from proprietary to open source are
available nowadays to test multivariate type of data; which is the most common data type in
agricultural studies. Using these latest softwares the new facets of multivariate analysis have
evolved which include Principal component analysis (PCA), Principal coordinate analysis,,
Correspondence analysis, Multidimensional scaling, Factor analysis (FA), Discriminant
analysis Multiple logistic regression analysis, Multivariate analysis of variance (MANOVA),
Cluster analysis (CA), and Canonical correlation, redundancy, as well as correspondence.

Introduction scale, convey quantitative information and


Statistics is a branch of science that deals are called as quantitative variables. Sex and
with the collection, organization, analysis of eye color give qualitative information and
data and drawing of inferences from the are called as qualitative variables.
samples to the whole population. This Categorical or nominal variables are
requires a proper design of the study, an unordered. The data are merely classified
appropriate selection of the study sample into categories and cannot be arranged in
and choice of a suitable statistical test. An any particular order. If only two categories
adequate knowledge of statistics is exist (as in gender male and female), it is
necessary for proper designing of an called as a dichotomous (or binary) data
epidemiological study or a clinical trial. Ordinal variables have a clear ordering
Improper statistical methods may result in between the variables. However, the ordered
erroneous conclusions which may lead to data may not have equal intervals Interval
unethical practice. variables are similar to an ordinal variable,
except that the intervals between the values
Variable is a characteristic that varies from of the interval variable are equally spaced
one individual member of population to Ratio scales are similar to interval scales, in
another individual. Variables such as height that equal differences between scale values
and weight are measured by some type of have equal quantitative meaning. However,
ratio scales also have a true zero point, Most of the biological variables usually
which gives them an additional property. cluster around a central value, with
symmetrical positive and negative
deviations about this point The standard
normal distribution curve is a symmetrical
bell-shaped. In a normal distribution curve,
about 68% of the scores are within 1 SD of
the mean. Around 95% of the scores are
within 2 SDs of the mean and 99% within 3
SDs of the mean.

Skewed distribution
It is a distribution with an asymmetry of the
variables about its mean. In a negatively
skewed
Fig1 - Different types of variables distribution figure 3, the mass of the
distribution is concentrated on the right of
STATISTICS: DESCRIPTIVE AND Figure 1. In a positively skewed distribution
INFERENTIAL STATISTICS [Figure 3], the mass of the distribution is
concentrated on the left of the figure leading
Descriptive statistics try to describe the to a longer right tail.
relationship between variables in a sample
or population. Descriptive statistics provide
a summary of data in the form of mean,
median and mode. Inferential statistics use a
random sample of data taken from a
population to describe and make inferences
about the whole population. It is valuable
when it is not possible to examine each
member of an entire population.

Components of descriptive statistics


• Measures of central tendency
• Measures of dispersion
Fig-2 Normal distribution curve
• Skewness
• Kurtosis
• Gaussian distribution
• Skewed distribution

Normal distribution or Gaussian distribution


whether to reject or retain the null
hypothesis.

Table 1- P Value with intrepretation

Fig3- Negatively and positively skewed


curves

In inferential statistics, data are analyzed


PARAMETRIC AND NON-
from a sample to make inferences in the
PARAMETRIC TESTS
larger collection of the population. The
purpose is to answer or test the hypotheses.
Numerical data (quantitative variables) that
A hypothesis (plural hypotheses) is a
are normally distributed are analysed with
proposed explanation for a phenomenon.
parametric tests.
Hypothesis tests are thus procedures for
making rational decisions about the reality Two most basic prerequisites for parametric
of observed effects. statistical analysis are:

Probability is the measure of the likelihood  The assumption of normality which


that an event will occur. Probability is specifies that the means of the
quantified as a number between 0 and 1 sample group are normally
(where 0 indicates impossibility and 1 distributed
indicates certainty). In inferential statistics,  The assumption of equal variance
the term ‘null hypothesis’ (H0 ‘H-naught,’ which specifies that the variances of
‘H-null’) denotes that there is no the samples and of their
relationship (difference) between the corresponding population are equal.
population variables in question. Alternative
hypothesis (H1 and Ha) denotes that a However, if the distribution of the sample is
statement between the variables is expected skewed towards one side or the distribution
to be true. is unknown due to the small sample size,
non-parametric statistical techniques are
The P value (or the calculated probability) is used. Non-parametric tests are used to
the probability of the event occurring by analyze ordinal and categorical data.
chance if the null hypothesis is true. The P
value is a numerical between 0 and 1 and is
interpreted by researchers in deciding
Parametric tests Table 2- Analogous parametric and non-
The parametric tests assume that the data are parametric tests
on a quantitative (numerical) scale, with a
normal distribution of the underlying
population. The samples have the same
variance (homogeneity of variances). The
samples are randomly drawn from the
population, and the observations within a
group are independent of each other. The
commonly used parametric tests are the
Student's t-test, analysis of variance
(ANOVA) and repeated measures ANOVA.

Non Parametric tests


Some basic and advanced application
When the assumptions of normality are not softwares
met, and the sample means are not normally,
distributed parametric tests can lead to SPSS (IBM)
erroneous results. Non-parametric tests SPSS, (Statistical Package for the Social
(distribution-free test) are used in such Sciences) is perhaps the most widely used
situation as they do not require the normality statistics software package within human
assumption.[15] Non-parametric tests may behavior research. SPSS offers the ability to
fail to detect a significant difference when easily compile descriptive statistics,
compared with a parametric test. That is, parametric and non-parametric analyses, as
they usually have less power. well as graphical depictions of results
through the graphical user interface (GUI).
As is done for the parametric tests, the test It also includes the option to create scripts to
statistic is compared with known values for automate analysis, or to carry out more
the sampling distribution of that statistic and advanced statistical processing.
the null hypothesis is accepted or rejected.
The types of non-parametric analysis R (R Foundation for Statistical
techniques and the corresponding parametric Computing)
analysis techniques are delineated in Table R is a free statistical software package that is
2. widely used across both human behavior
research and in other fields. Toolboxes
(essentially plugins) are available for a great
range of applications, which can simplify
various aspects of data processing. While R
is a very powerful software, it also has a
steep learning curve, requiring a certain
degree of coding. It does however come available to automate analyses, or carry out
with an active community engaged in more complex statistical calculations, but
building and improving R and the associated the majority of the work can be completed
plugins, which ensures that help is never too through the GUI.
far away.
Minitab
MATLAB (The Mathworks) The Minitab software offers a range of both
MatLab is an analytical platform and basic and fairly advanced statistical tools for
programming language that is widely used data analysis. Similar to GraphPad Prism,
by engineers and scientists. As with R, the commands can be executed through both the
learning path is steep, and you will be GUI and scripted commands, making it
required to create your own code at some accessible to novices as well as users
point. A plentiful amount of toolboxes are looking to carry out more complex analyses.
also available to help answer your research
questions (such as EEGLab for analysing
EEG data). While MatLab can be difficult to STAR- Statistical Tool for Agricultural
use for novices, it offers a massive amount Research
of flexibility in terms of what you want to
STAR is developed using Eclipse Rich
do – as long as you can code it (or at least Client Platform (RCP) and R language for
operate the toolbox you require). crop scientists and has a user-friendly
graphical user interface (GUI). Its current
version provides modules for generating
SAS (Statistical Analysis Software) randomization and layout of experimental
SAS is a statistical analysis platform that designs commonly used in crop research,
data management, and basic statistical
offers options to use either the GUI, or to
analysis, including descriptive statistics,
create scripts for more advanced analyses. It hypothesis testing, and ANOVA of designed
is a premium solution that is widely used in experiments. In the future, modules for
business, healthcare, and human behavior mixed models, combined analysis, general
research alike. It’s possible to carry out linear models and multivariate analysis will
advanced analyses and produce publication- also be included.
worthy graphs and charts, although the
CropStat
coding can also be a difficult adjustment for
those not used to this approach.
CropStat is a computer program for data
management and basic statistical analysis of
GraphPad Prism
experimental data. It can be run in any 32-
GraphPad Prism is premium software
bit Windows operating system. It has been
primarily used within statistics related to
developed primarily for the analysis of data
biology, but offers a range of capabilities
from agricultural field trials, but many of the
that can be used across various fields.
features can be used for analysis of data
Similar to SPSS, scripting options are
from other sources.
 The main modules and facilities are  Manages trait to measured
 Data management with a spreadsheet  Manages images and audio captured
 Text editor
 Descriptive statistics and Scatterplot Various statistical analysis powered by
Graphics the new application softwares
 Balanced analysis of variance
 Unbalanced analysis (generalized Multivariate statistical methods are available
linear models) for analysis of data comprising of multiple
 Linear Mixed Models variables encompass, including Ordination:
 Combined analysis of variance Comprising of principal component analysis
 Analysis of repeated measures (PCA), principal coordinate analysis,
 Regression and correlation analysis discriminant analysis, correspondence
 Single-site analysis for variety trials analysis, multidimensional scaling, and
 Spatial Analysis factor analysis (FA); Discrimination/
 Genotype × environment interaction Classification: Comprising of discriminant
analysis analysis, multiple logistic regression
 Pattern Analysis analysis, multivariate analysis of variance
 Quantitative trait loci analysis (MANOVA), and cluster analysis (CA); and
 Graphics Canonical: Comprising of canonical
 Utilities for randomization and correlation, correspondence, and
layout, and orthogonal redundancy. Ordination aims at describing
polynomial17. Analysis of data by identifying a reduced data dimension
Categorical Data of a few variables accounting for the
greatest amount of variability in the data.
FieldLab Discrimination aims at delineating
experimental groups or classifying
FieldLab is an application for Android tablet observations into experimental groups based
that used for data collection in the field. on a set of variables. Canonical aims at
IRRI’s researchers and technicians are using describing and predicting the relationship
this application to go paperless and thus, between two sets of variables.
promote digital revolution.

Features The MVA usually involves PCA, CA, FA


and pattern analysis. PCA is a multivariate
 Import ICIS workbook as a study statistical technique which reduces the
 Export observation data collected to dimension of a p-dimensional array by
an ICIS workbook format introducing a set of linear combinations of
 With validation, range entry and the original variables. It has been suggested
look-up values on data entry form. that PCA can be used in studying plant
 Integration with wireless bar-code disease epidemics because it provides a
reader (Baracoda brand) means to quantitatively evaluate the relative
importance of curve element which are level isolates of Xanthomonas arboricola pv.
of effect of a factor curve, the rate of yield juglandis from different geographic origins
increase, and the variation in shape or is investigated by analyzing the proximities
skewness from the mean curve. CA is an among amplified fragment length
exploratory data analysis tool which aims at polymorphic (AFLP) banding patterns using
sorting different objects into groups in a correspondence analysis (Loreti et al, 2001).
way that the degree of association between
two objects is maximal if they belong to the Canonical correlation analysis
same group and minimal otherwise. FA, as a Canonical correlation analysis describes the
branch of multivariate analysis, is useful to association between two sets of variables.
explain the inter-correlations of variables. It Canonical correlation analysis was first
helps to find out the number and nature of employed by Schlosser et al (2000) to
causative influences on which more characterize the relationship between plant
intensive investigations can be concentrated. morphological variables such as plant
height, leaf length, leaf area, and plant
MANOVA growth rates and rice blast disease variables
MANOVA is a procedure for assessing like lesion densities, and lesion types in six
differences among several nonmetric upland rice cultivars
dependent variables based on the linear
combination of several metric dependent Redundancy analysis
variables. This procedure enables the Redundancy analysis, which aims at
simultaneous examination of several measuring the percentage of variation in a
dependent variables. MANOVA was first set of variables (considered singly) that is
used by Golinski et al (2002) to assess the accounted for by the other set of variables
effect two pathogens (Fusarium avenaceum (considered collectively) This determination
and F. culmorum) on three yield is achieved by regressing each variable from
components (1000-grain weight, and weight one set on all variables in the other set.
and number of kernels per winter wheat Redundancy analysis was first used by
head) of 14 winter wheat cultivars in a two Folman et al (2003) to describe the
year study. relationship of carbon source utilization
profiles of 20 clusters of rhizobacteria to 9
Correspondence analysis root tissue types consisting of 3 root regions
Correspondence analysis describes the (tip, intermediate and base of root) sampled
relationships among two or more cross- at three developmental stages (seedling,
tabulated categorical variables (contingency vegetative and generative)
table). The frequencies in the contingency
table are transformed into Chi-square GGE Biplot
distances, which are used to establish a The concept of biplot was first developed by
perpetual map of the relation among Gabriel (1971). It is a scatter plot that
variables In one of the first studies using this graphically displays both the entries (e.g.,
method the genomic variability of 66
cultivars) and the testers (e.g., two-way data. Entry is the factor to be
environments) of a two-way data. In tested, and tester is the factor used to test.
breeding and genetics data, testers can also An entry is a level of the entry factor, and a
be traits, genetic markers, etc. When the tester is a level of the tester factor. In
two-way data is subjected to singular value genotype by environment data, genotypes
decomposition, it is decomposed into three are entries and environments are testers. In
matrices: the singular value matrix, the entry genotype by trait data, genotypes are entries
eigenvector matrix, and the tester and traits are testers. In genotype by genetic
eigenvector matrix. The singular value marker data, genotypes are entries and
matrix is a diagonal matrix, and can be markers are testers. In convention, entries
somehow partitioned into the entry and are presented as rows, and testers as
tester eigenvector matrices. After singular columns in a data matrix.
values are partitioned, the positions of the
entries in the biplot is defined by the entry
eigenvector matrix and those of the testers
by the tester eigenvector matrix.

Properties of a biplot
A biplot graphically displays the two-way
data and allows visualization of:
the interrelationship among the entries (e.g.,
genotypes),
the interrelationship among testers (e.g.,
environments), and
the interaction between entries and testers.

GGE stands for genotype main effect (G) Fig 4- Sample GGE Biplot
plus genotype by environment interaction
(GE), which is the only source of variation Additive main effects and
that is relevant to cultivar evaluation. multiplicative interaction analysis
Mathematically, GGE is the genotype by
AMMI biplot is a graphical
environment data matrix after the
representation in which genotypes,
environment means are subtracted. A GGE
environments or pathogen strains and
biplot is a biplot that displays the GGE of a
host genotypes are displayed
genotype by environment two-way data. The
simultaneously in four sectors depending
GGE biplot methodology originates from
upon the positive or negative signs of the
graphical analysis of multi-environment
scores on the first two principal
variety trials (MET) data, but is equally
components. For simple interpretation of
applicable to all other types of two-way
the biplot, the genotypes with vector end
data. Entry and tester are the two factors in a
points far from the origin contribute
relatively more to the interaction than factor and one random effect
those with vector end points close to the factor excluding residual
origin. Sector-1 represents host
genotypes pathogen strains or • Generalized linear mixed
environments with positive IPCA1 as model(GLMM)- it is an extension
well as IPCA2 scores, while sector-2 to LMM, which contains more
represents positive IPCA1 and negative than one random effect in
IPCA2 scores. Sector-3 represents addition to the usual fixed effects
negative IPCA1 as well as IPCA2 scores References
and sector-4 represents negative IPCA1
and positive IPCA2 scores.

 Adeyanju A, Little C, Yu J M, Tesso


T. 2015. Genome-wide association
study on resistance to stalk rot
diseases in grain sorghum. G3-Genes
Genom Genet, 5(6): 1165–1175
 Bradbury P J, Zhang Z, Kroon D E,
Casstevens T M, Ramdoss Y,
Buckler E S. 2007. TASSEL:
Software for association mapping of
complex traits in diverse samples.
Bioinformatics, 23: 2633–2635.
 IRRI (International Rice Research
Institute). 2002. Standard Evaluation
Fig 5- Biplot of 52 isolates of rice System for Rice. Manila, the
bacterial blight strains and 16 host Philippines: International Rice
genotypes (Nayak et al, 2008). A, The Research Institute.
mean lesion length and the first  Kaur SP. Variables in research.
interaction principal component (IPCA1); Indian J Res Rep Med Sci.
B, IPCA1 and IPCA2 scores. IG, Isolate 2013;4:36–8
group; HG, Host genotype group.  Manibhushanrao K, Krishnan P.
1991. Epidemiology of blast
Modeling approaches (EPIBLA): A simulation model and
forecasting system for tropical rice in
• Generalized linear model (GLM)-
India. In: Rice Blast Modeling and
it assumes one fixed factor and
Forecasting. Seoul, Korea Republic:
multinomial distribution for the
International Rice Research
variable
Conference. 27–31
• Linear mixed model (LMM)-  Mukherjee A K, Mohapatra N K,
with at least one fixed effect Bose L K, Jambhulkar N N, Nayak
P. 2013b. Additive main effects  Yang R C. 2008. Why is MIXED
and multiplicative interaction analysis underutilised? Can J Plant
(AMMI) analysis of G × E Sci, 88: 563–567.
interactions in rice-blast  Yang R C. 2010. Towards
pathosystems to identify stable understansing and use of mixed-
resistant genotypes. Afr J Agric Res, model analysis of agricultural
8(44): 5492–5507 experiments. Can J Plant Sci, 90(5):
 Winters R, Winters A, Amedee RG. 605–630.
Statistics: A brief overview. Ochsner
J. 2010;10:213–6

Вам также может понравиться