Вы находитесь на странице: 1из 3

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/6903421

PYCHEM: A multivariate analysis package for


python

Article in Bioinformatics November 2006


DOI: 10.1093/bioinformatics/btl416 Source: PubMed

CITATIONS READS

45 161

5 authors, including:

David Broadhurst Royston Goodacre


Edith Cowan University The University of Manchester
125 PUBLICATIONS 7,222 CITATIONS 467 PUBLICATIONS 18,612 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

META-PHOR View project

HUSERMET View project

All content following this page was uploaded by David Broadhurst on 17 August 2017.

The user has requested enhancement of the downloaded file.


Bioinformatics Advance Access published July 31, 2006

BIOINFORMATICS

PYCHEM: A Multivariate Analysis Package for Python


Roger M. Jarvis1*, David Broadhurst1, Helen Johnson2, Noel M. OBoyle3 and Royston Goo-
dacre1
1
School of Chemistry, The University of Manchester, PO Box 88, Sackville St, Manchester M60 1QD, U.K.; 2Faculty of Life
Sciences, University of Manchester, Stopford Building, Oxford Road, Manchester, M13 9PT, U.K. and 3Unilever Centre for
Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, U.K;

Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN.

Associate Editor: Martin Bishop

ABSTRACT trait(s); and discriminant analysis, for distinguishing between dif-


Summary: We have implemented a multivariate statistical analysis ferent sample groups and for subsequent predictions on new sam-
toolbox, with an optional standalone graphical user interface (GUI), ples. In fact, multivariate analysis encompasses many more meth-
using the Python scripting language. This is a free and open source ods than these examples of linear modeling imply (Brereton,
project that addresses the need for a multivariate analysis toolbox in 2003); but these tools are perhaps those most commonly used for
Python. Although the functionality provided does not cover the full the modeling of biological data.
range of multivariate tools that are available, it has a broad comple- Many programs currently exist for multivariate analysis. Flexi-
ment of methods that are widely used in the biological sciences. In ble environments for mathematical computing are available in the
contrast to tools like MATLAB, PyChem 2.0.0 is easily accessible form of MATLAB (The Mathworks, Natick, MA, USA), GNU
and free, allows for rapid extension using a range of Python mod- Octave (http://www.octave.org/) which aims to be a free equivalent
ules, and is part of the growing amount of complementary and inter- of MATLAB, and R (http://www.r-project.org/); which has many
operable scientific software in Python based upon SciPy. One of the other bio-analysis modules, such as Vegan (for environmetrics)
attractions of PyChem is that it is an open source project and so and Bioconductor (for genomic analysis). These products provide
there is an opportunity, through collaboration, to increase the scope powerful tools for multivariate analysis through command line
of the software and to continually evolve a user-friendly platform that interpreters, which allow the user to perform their analysis with a
has applicability across a wide range of analytical and post-genomic great degree of flexibility. However, they require some investment
disciplines. in time to become familiar with the interpreters syntax, and are not
Availability: http://sourceforge.net/projects/pychem necessarily straightforward for people with little computational
experience. In addition, a number of graphical multivariate soft-
Contact: Roger.Jarvis@manchester.ac.uk or, alternatively
ware tools are also available; Evince (UmBio, Ume, Sweden),
admin@pychem.org.uk
The Unscrambler (CAMO, Woodbridge, NJ, USA), Pirouette (In-
Supplementary information: Further information is available from fometrix, Bothell, WA, USA), S-Plus (Insightful, Seattle, WA.
the project home page at http://pychem.org.uk/ whilst details of data USA) and SIMCA (Umetrics, Ume, Sweden) are all good tools
generation are available at http://biospec.net/. for basic multivariate analysis although, with the exception of S-
Plus, they lack the flexibility of the interpreter style interfaces.
1 INTRODUCTION Thus there is currently a requirement for a flexible, extensible,
Increasingly in the life sciences many experiments generate data free and open source graphical environment for performing multi-
which is of a multivariate nature, where many observations are variate analysis, which can be used by both experts and casual
recorded for each sample under analysis. Interpretation of such users. The increasing popularity of scripting languages such as
complex data can not generally be performed by taking a univari- Python (http://www.python.org/) within the life sciences commu-
ate approach, since no single measurement is necessarily adequate nity offers the technology and critical mass for such a project. A
enough to describe the problem being addressed. In fact, the ap- platform of this type addresses the requirements outlined above,
plication of univariate methodology is in many cases totally inap- with the additional benefit that it allows for the rapid development
propriate as the complexity of information contained within large of new cross-platform software approaches, and the integration of
biological data sets reflects the complexity of the system(s) being currently available software libraries through application pro-
studied. Typical multivariate analysis problems involve unsuper- gramming interfaces (APIs).
vised learning such as factor analysis, for reducing the dimension-
ality of data and modeling of variance; linear regression, for for- 2 THE MULTIVARIATE ANALYSIS TOOLBOX
mulating input to output transformation models based on super- FOR PYTHON
vised learning which are predictive generally for quantitative
The PyChem project aims to provide a simple multivariate
*
analysis toolbox with a powerful and intuitive GUI front-end. The
To whom correspondence should be addressed.

 2006 The Author(s)


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
R. Jarvis et al.

project is implemented in Python and utilizes the wxPython of data splitting (Brereton, 2003), which works by dividing the
(http://www.wxpython.org/), Boa Constructor (http://boa- measured X-variables in to three groups; a model training set,
constructor.sourceforge.net/) and SciPy (http://scipy.org/) pack- model cross validation data and finally an independent test set.
ages (see Fig. 1 for an example screenshot) amongst others. The The model is trained on the first set, optimised on the second set
software was designed to provide a range of algorithms that ad- and then tested for accuracy on the third set of 'hold-out' data.
dress three fundamental questions commonly asked by the re- A major emphasis of this work has been in providing clear and
searcher. useful graphical reports for the interpretation of results. The GUI
uses wxPyPlot
(1) What is the shape of the data - including sources of vari-
(http://www.cyberus.ca/~g_will/wxPython/wxpyplot.html), with a
ance and outlier identification?
small modification to include text plotting. In the future even more
(2) How similar are different samples? focus will be given to the structure of graphical reporting in Py-
(3) Which measurements from the original data can be attrib- Chem, as well as the functionality associated with the plotting
uted to observed differences and/or similarities? canvases. Finally, all results both graphical and numerical can
easily be exported from PyChem, with numerical results in ASCII
To help answer these questions, the initial release includes algo- file format to allow for use in other software applications.
rithms for the pre-processing of multivariate data (such as scaling,
baseline correction, filtering and derivatisation), principal compo-
nents analysis (PCA) (Jolliffe, 1986), partial least squares regres-
sion (PLS1 see Fig. 1) (Martens & Naes, 1989), discriminant
function analysis (DFA) (Manly, 1994), cluster analysis (using the
C clustering library for Python (http://bonsai.ims.u-
tokyo.ac.jp/~mdehoon/software/cluster/) (Eisen et al., 1998; de
Hoon et al., 2004)), and a number of genetic algorithm (GA) based
tools for performing feature selection (Jarvis & Goodacre, 2005).
The software is able to handle any two dimensional data set
where each sample is defined by a series of discrete or continuous
measurements. Data can be imported from flat ASCII files that use
the standard delimiters. Typical data of this type include those
generated from microarrays, proteomics, spectroscopic methods
(UV-Vis, infrared and Raman), mass spectrometry, NMR, or in-
deed any data arrays representing samples for which multiple dis-
crete measurements have been acquired. Once data have been
imported into PyChem they can be saved in an XML format (im-
plemented using cElementTree (http://effbot.org/)) as a PyChem
experiment, which allows for the subsequent storage of multiple
Fig 1. A screenshot demonstrating the partial least squares regression functionality
experimental results within a single file. This allows for the cap- available in PyChem.
ture of the state of the system at a point in time, so that results of
multivariate analyses can be stored and the progression of the
ACKNOWLEDGEMENTS
analysis recorded, which is particularly useful for tracking data
analyses as part of GLP. The additional benefit of using the XML RMJ, DB, HJ, NMOB & RG would like to thank the BBSRC for
data structure for storage is that it introduces the potential for engi- funding (NMOB grant BB/C51320X/1).
neering simple bespoke interfaces to database storage systems.
PyChem provides simple grid-style user interfaces for the input REFERENCES
of experimental and sample metadata, so producing a series of Brereton, R. (2003). Chemometrics: data analysis for the laboratory and chemical
vectors describing the origin and identity of each sample and plant, 1st Edn. Chichester: John Wiley & Sons Ltd.
Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998).Cluster analysis and display
measured variable. For unsupervised analyses such as PCA the of genome-wide expression patterns. In Proceedings of the National Academy of
software simply requires a vector, or multiple vectors of sample Science, pp. 14863-14868, USA.
labels for plotting; in addition for supervised analyses, vectors are de Hoon, M.; Imoto, S.; Nolan, J.; Miyano, S. (2004). Open source clustering software
required to (a) represent putative class structures or some quantita- Bioinformatics 20, 1453-1454.
Jarvis, R. & Goodacre, R. (2005). Genetic algorithm optimization for pre-processing
tive trait (e.g., level of abiotic or biotic interference) and (b) iden-
and variable selection of spectroscopic data. Bioinformatics 21, 860-868.
tify groups in to which the data should be split for the purpose of Jolliffe, I. T. (1986). Principal Component Analysis. New York: Springer-Verlag.
cross-validation. In supervised analyses the issue of model valida- Manly, B. F. J. (1994). Multivariate Statistical Methods: A Primer. New York: Chap-
tion is crucial; when a model is formulated there is a possibility man & Hall/CRC.
that it will overfit the data and find a relationship between the data Martens, H. & Naes, T. (1989). Multivariate Calibration. Chichester: John Wiley &
Sons.
and the target class structure or dependent variables, that does not
hold for subsequent predictions; i.e. the model has learnt the train-
ing data perfectly and is not able to generalise. This situation can
be avoided by performing some form of model validation. In the
current version of PyChem (2.0.0) we use the preferred approach

View publication stats

Вам также может понравиться