Академический Документы
Профессиональный Документы
Культура Документы
Workshop I Report
27-28 August 2015, University of Hawai‘i at Mānoa, Honolulu, HI
Report: 24 November 2016
Table of Contents
Executive Summary Page 1
ECOGEO – a community focused on solutions Page 2
Workshop Goals Page 2
Workshop Outcomes Page 3
I. Summary of workshop structure
II. Grand ‘omics science challenges
III. Specific cyberinfrastructure needs
IV. Leveraging and expanding existing infrastructures
V. Enabling and encouraging big data best practices
Recommendations
Building Environmental ‘Omics Infrastructure for Earth Sciences Page 9
Upcoming events Page 11
Appendices Page 12
Appendix I: Workshop I Agenda
Appendix II: Participant List
Appendix III: Participant Use Cases and Use Case Template
Appendix IV: Community Survey and Summary
Resources
ECOGEO RCN website: http://earthcube.org/group/ecogeo
Workshop website: http://cmore.soest.hawaii.edu/rcn2015/
Workshop agenda: http://cmore.soest.hawaii.edu/rcn2015/agenda.htm (also Appendix I)
Workshop webinar: https://vimeo.com/uhcmore/review/138035693/5223a63a63
Environmental ‘omics Resource Viewer: http://pivots.azurewebsites.net/ecogeo.html
Executive Summary
The aim of ECOGEO’s first workshop was to enable domain scientists and cyberinfrastructure
experts to collaboratively discuss grand challenges in ‘omics science (Outcome II) and explore
use cases that translate those challenges into cyberinfrastructure needs (Outcome III). The
group also worked to outline existing resources, brainstormed how to best leverage and expand
those resources (Outcome IV), and discussed ways to better establish best practices in the
community (Outcome V).
The workshop hosted over 50 participants from more than 20 universities, three national labs,
four cyberinfrastructure centers, two NSF funded data resources, and three EarthCube funded
projects. At the conclusion of the workshop, participant comments were unified and optimistic
about how to best move forward as a community. Scientists and cyberinfrastructure experts
were able to identify common ground and consensus on how to address the some of the core
‘omics science challenges. This report summarizes discussions and synthesis activities that
generated the following recommendations, aimed at overcoming cyberinfrastructure challenges
in environmental ‘omics research.
III. Data centers, databases, and analytical tools that address issues of data discovery,
scalability, and community HPC access should be further developed.
IV. Data visualization and statistical analytical frameworks should be integrated into
standard ‘omics analyses workflows and software.
V. The cyberinfrastructure should enable and encourage “big data” best practices and
standards for the community.
VI. The ‘omics and cyberinfrastructure communities should enable and provide a platform
for future ‘omics research via streamlined, accessible, state-of-the-art education, training
tools, and best practices.
This report aims to provide a foundation of community support for a federated platform of
interoperable cyberinfrastructures for oceanography and geobiology environmental ‘omics
research. With support for improved cyberinfrastructure, interdisciplinary collaborations, and
best practices and training, the ‘omics community (domain scientists and cyberinfrastructure
experts) is very well positioned to move environmental ‘omics research into the future.
1
24 November 2015
The ECOGEO research community spans an array of disciplines, but is united in developing
and applying ‘omics technologies and bioinformatic approaches to address core questions
relating the interplay of biological, geological, and chemical processes. Investigations range
from high-throughput sequencing of microbial community DNA to assess taxon, gene, and
metabolic pathway distributions across samples (metagenomics), monitoring the expression of
genes and/or proteins in a variety of environmental settings (metatranscriptomics and
proteomics, respectively), and measuring the distribution and significance of metabolites and
lipids in organisms and the environment (metabolomics, lipidomics). These methods enable
researchers in biological oceanography, biogeochemistry, organic geochemistry, microbial
oceanography, and geobiology to explore and inter-relate the biological, geological, and
chemical (biogeochemical) world in hitherto unprecedented depth and detail. This general
approach requires considerable computational hardware and software infrastructures, which
rely on high performance computing, advanced networking and database capabilities, and
collaboration with computer scientists, bioinformaticians, software engineers, computational
biologists, and interdisciplinary support from both government and private funding agencies
and foundations.
Overall goals
Create and sustain a strategic network and community of field and cyber scientists to
explore new facets of ‘omics data.
Articulate needs, challenges, and practical solutions that address: 1) development of
cyberinfrastructure, 2) integration and implementation of workflows, and 3) database and
resource sustainability to support ocean and geobiology environmental ‘omics research.
Develop a community-based framework that integrates best practices for sharing, curation,
and analysis of ‘omics data, with associated “metadata”, and facilitates collaboration and
training among environmental microbiology, geobiology, and computer science disciplines.
Workshop Goals
Highlight core science and technology drivers for research using environmental ‘omics
The field of environmental ‘omics requires close collaboration between domain science and
technology/cyberinfrastructure. Therefore, one of ECOGEO’s main workshop goals was to
discuss key drivers from the perspective of current challenges and cyberinfrastructure needs.
These drivers were identified during review of the original end-user workshop documents,
participants use cases developed for this workshop (Appendix III), and community feedback
from the ECOGEO survey (Appendix IV).
2
24 November 2015
Identify solutions: Based on these challenge-focused science and technology drivers, the
workshop participants also discussed ways to leverage existing resources and described gaps
where new solutions could be built.
EarthCube Context
The workshop leveraged numerous EarthCube resources to enhance our discussions. Several
active members of EarthCube governance and funded projects were in attendence, including
(alphabetical by last name, also available in Appendix II) Emma Aronson, Science Committee;
Basil Gomez, Leadership Council; Danie Kinkade, Leadership Council; Ouida Meier,
CReSCyNT RCN; Ken Rubin, Science Committee; Elisha Wood-Charlson, Engagement and
Liaison Teams, Science Committee, ECOGEO RCN; and Ilya Zaslavsky, GEAR Conceptual
Design, Technology and Architecture Committee, CINERGI Building Block.
In addition, ECOGEO has, and will continue, to contribute to EarthCube’s vision. The 2nd year of
our RCN will focus on integrating the ‘omics community and our collective resources into the
broader EarthCube infrastructure, with the goal of creating sustainable contributions through
future EarthCube funded projects. Thus far, we have compiled a dozen domain science use
cases, which will be refined at a follow up working group meeting early 2016 (see Path Forward)
and have been actively engaged with CINERGI to expand our environmental ‘omics resource
viewer. Throughout year 2, ECOGEO will continue to have representation in EarthCube
governance, including revision of best practices and core documents, as well as future ideas for
EarthCube funded projects recommendations. Finally, we will communicate outcomes from our
workshops to EarthCube community through dissemination of reports, and continue to inform
the broader ‘omics community of EarthCube through society town hall sessions.
Workshop Outcomes
I. Summary of workshop structure
This workshop focused on understanding the key science and technology drivers in the field and
the development of use cases to identify resource gaps, as well highlight their potential for
training. The workshop was organized into several breakout groups, with time for reporting and
discussion amongst all participants (please see Appendix I for the full agenda). The first series
of breakouts on science/tech drivers focused on grand ‘omics science challenges:
Geospatial & temporal registry for 'omics data across scales, led by D. Kindade, V. Orphan
Tracking synoptic ‘omics data products, led by B. Hurwitz, N. Kyrpides
Integrated modeling of organisms (‘omics) and environmental dynamics, led by M. Follows,
N. Levine
The next set of breakouts focused on a subset of participant submitted use cases with the aim
of extracting the overarching science challenges and cyberinfrastructure needs in ‘omics
research. Representative use cases were grouped by theme and are available in Appendix III.
“Google Earth” ‘omics, led by E. Allen. Use cases by H. Alexander and B. Jenkins.
Linking function to biogeochemical cycling in space/time, led by M. Saito. Use cases by R.
Morris and J. Waldbauer.
3
24 November 2015
Using ‘omics for evolution/trait-based studies, led by E. Aronson. Use cases by D. Chivian
and J. Gilbert.
The final discussion focused on the potential for and limitations of existing cyberinfrastructure.
The aim of this session was to move beyond just identifying gaps in existing resources to
proposing solutions as a community to address specific current and future cyberinfrastructure
needs in ‘omics research. Breakout leads included J. Heidelberg, B. Jenkins, and D. Kinkade.
Within the issue of scaling lies the fundamental challenge of understanding how biological
processes interact with these scalable environmental data layers. For example, from an
environmental (microbial) ‘omics perspective, the classic ecological metrics of alpha and beta
species diversity may no longer be entirely relevant. In part, microbes don’t follow typical
4
24 November 2015
speciation criteria. Therefore, the context of environmental interactions and functional diversity
(ability to fix nitrogen, utilize low abundance iron, etc.) may be more relevant from a
biogeochemist’s perspective than defining which strain of microbe is present (Use Case:
Chivian, Gilbert, Waldbauer). In particular, trait-based metrics as opposed to taxonomic criteria
may be more important when considering globally significant and/or societally relevant
questions such as, “how are microbe-microbe and microbe-environment interactions impacting
global biogeochemical cycles?” and “how can that information be used to improve climate
models and projections of change?”. Currently, there is a disconnect between model predictions
and data driven observations. Therefore, we need new ways to enable more iterative
observations, hypothesis generation, and hypothesis testing cycles. These big picture
challenges require a fundamental change in the structure, availability, and scale of ‘omics-
enabling cyberinfrastructures.
One of the most acute articulated needs for the ECOGEO community is a new mode of
sequence data repositories that facilitate data sharing and data discovery of primary data and
any associated environmental and sample processing data (“metadata”), as well as links to
other data products (and their provenance), analytical software and workflows, and the
infrastructures required to implement them. Such a repository would store sequence read data
with its associated metadata in a manner that would allow seamless and simultaneous queries
of metadata fields and their corresponding sequence reads (and vice versa). A similar repository
for environmental proteomic mass spectrometry data was also identified as a core need. Such
repositories would be invaluable tools for data discovery. They promote efficient computation,
the requirement for transferring large data sets would be reduced, and they facilitate
development of downstream analysis pipelines that could be shared and standardized. Beyond
repositories for raw sequence reads and mass spectrometry data, the community also needs a
federated, searchable repository for data products being used as a basis for biological inference
in publications, which would greatly enhance comparative analyses between studies. These
products include such things as phylogenetically resolved population level genome fragments
assembled from metagenomes, gene/protein expression data from metatranscriptome/proteome
analyses, and sequence alignments underlying phylogenetic inferences. In addition to
searchable data and data product repositories, analysis platforms that enable large-scale
metagenomic data comparisons were identified as another core “ocean ‘omics”
cyberinfrastructure requirement. Such platforms would support data searches driven by
taxonomy and physiochemistry thus making such large-scale comparisons feasible.
However, this vision of next generation of sequence data repositories goes beyond aggregating
disparate data sets currently housed in dispersed data resources (e.g. NCBI, IMG, iPlant, EBI,
MG-RAST, BCO-DMO, etc.). They must also consider “dark data”, data already available in the
5
24 November 2015
public domain but not readily discoverable in via the commonly accessed databases. Data
mining through web crawls focused on primary scientific literature would significantly extend the
volume of data gathered to promote comparative ‘omics investigations.
Once ‘omics data are available in a suitable federated infrastructure with query-able and
standardized metadata, it should be possible to pose a striking diversity of hypotheses. Analysis
tools, workflows, visualization, and statistics are critical to making sense of ‘omics data. Many
analysis tools are disparate (developed by individual research groups), which can make them
difficult to capture in a single workflow. Furthermore, most tools require computational
experience to run, and are not well vetted by the community (i.e. which is best tool for certain
data types and why) or maintained once a developer moves on. This computational climate
prevents the continual improvement and vetting of existing tools by the community and an ever-
expanding database of programs that are difficult to maintain and come up to speed on. This
strategy is particularly problematic for researchers without a bioinformatics team. In addition,
programs are often developed independently without workflows in mind. This leads to disparate
output and formats that are not easily bridged between tools without scripting skills to reformat
resulting data and therefore cannot be easily merged into user-defined workflows.
The ideal platform would promote a federated toolkit with inter-operable and standardized
output formats to enable domain scientists to answer their science questions, as well as
encourage continued technology and cyberinfrastructure development. Quantitative Insights in
Microbial Ecology (QIIME) and the associated QIITA database have been recently adopted by a
large community as platforms for federated ribosomal RNA (rRNA) tag studies, with open
access tools and a well supported community run helpdesk. QIIME provides online tutorials that
allow for community adoption resulting in a large user database. However, the rRNA tag data
set analyses for which QIIME is designed have low complexity, and require comparatively less
processing than metagenomic data sets (rendering development and use of these analytical
tools much more straightforward). For large-scale meta- ‘omic analyses, different cyber
solutions will be required. A few other analysis platforms exist, such as DNA Nexus, which
supports the biomedical community, and bioinformatics Apps for the life science that exist in the
iPlant “ecosystem”.
Finally, statistical and visualization tools are necessary for researchers to explore data and draw
conclusions related to their core science questions. During the workshop, we were able to start
this conversation with big data statistics and visualization experts. The main take-home
message was that domain scientists should not struggle with these challenges alone. Just as
our community has grown to educate and include bioinformaticians in the development of our
research plans, it is evident that we need to expand our collaborations to also include the big
data statistics and visualization experts.
6
24 November 2015
discussing these needs and identifying existing resources that might be leveraged to
accomplish our research goals.
One of the core issues with ‘omics data is the size. Moving large-scale sequence data sets
requires significant network bandwidth and access to network platforms, such as Globus and
Internet2. It is estimated that the data we have today represents only 10% of the total data
available in the next 5 years, given improvements in sequencing capacity and advances in the
throughput of non-nucleic acid data products. While data storage and networking and
communications infrastructures will continue to evolve to help meet these “big data” needs, they
also need to serve a variety of end-users: from raw novices to domain experts, as well as
specialized requirements of the educational, survey and monitoring, and policy driven programs
and communities.
Analyzing and interpreting large collections of data will likely require a collaborative and
federated approach. Cloud-based computing approaches hold great promise, but there are a
number of issues that need to be addressed by the community, and the current economics of
commercial cloud solutions do not appear scalable for a large and diverse community. For the
larger and more well established institutions, such as the Joint Genome Institute (JGI),
iPlant/iMicrobe, Broad, J. Craig Venter Institute (JCVI), Sanger, etc., maintaining dedicated
computing resources make sense, while for small labs it probably doesn’t. For most
intermediate size groups, a hybrid approach may be optimal, with some dedicated resources
coupled with access to Cloud-based resources.
In this context, federated infrastructure virtual machines - including lightweight containers - will
likely be a central avenue to provide easy-to-use analysis tools. These have the potential to
democratize access to software suites that may be too complex for researchers, without
dedicated computational support staff, to install. These researchers may be best served by
access to online analysis tools offered by groups such as JGI, KBase, and iPlant/iMicrobe.
Common APIs and architectures with such virtual machines will help forge the links for an
interoperable and federated infrastructure. In addition, existing EarthCube “dark data” discovery
projects, such as DeepDive, can be used to identify published data that are not in a public
repository. The imagined next generation data repositories will likely be built on a federated
cross-agency structure that ties together data from different providers into a common
framework. Data collections from public resources, such as iMicrobe and the International
Nucleotide Sequence Database Collaboration (INSDC), which includes Sequence Read Archive
(SRA), GenBank, European Nucleotide Archive (ENA), and Integrated Microbial Genomes
(IMG), are currently the most used, robust and sustainable cyber and meta- ’omics resources.
These should definitely be integral, federated players in the context of any proposed meta-
‘omics “cyber superstructure”.
Presently, researchers in the ECOGEO community deposit raw sequence data into the SRA as
part of the National Center for Biotechnology Information’s (NCBI) GenBank service, which
currently houses over 19,000 environmental genomic data sets totaling > 15 TB. This resource,
however, is not easily searchable and thus prevents the integration of data sets across projects
7
24 November 2015
and limits the possibility of ecosystem level analyses. Further, SRA files at NCBI often do not
contain a sufficient description of a sample’s “metadata”, which ideally includes information on
the sampled environment, sample collection, processing, and data generation. This contextual
metadata, in addition to the oceanographic data in BCO-DMO, is essential if data sets are to be
intercomparable. The Genome Standards Consortium (GSC) has established baselines for
describing genomic, metagenomic, metatranscriptomic, and amplicon sequence data
(discussed in the next section). Previously, the ocean ‘omics community relied heavily on the
Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis
(CAMERA) database for data discoverability, but this platform was discontinued in 2014. The
CAMERA data sets have been transferred into iMicrobe, a sub-portal within iPlant. In addition,
IMG at JGI and MG-RAST at Argonne National Laboratory are genomic and metagenomic
resources that have been heavily leveraged by the larger community.
Many data repositories, such as GenBank, are also limited in that they do not accept processed
data products and are not properly formatted for non-nucleic acid ‘omics data, such as
proteomic, glycomic, metabolomic, and lipidomic data. Currently, the largest available
metagenomics data integrations are provided through the JGI’s IMG system. Both systems
support the integration and analysis of a number of different ‘omics data sets, and support the
general community by annotating and analyzing user submitted data. Due to their long term
funding scheme, volume of existing data, and their position within the international community,
these systems are “pre-adapted” to be an integral part of a federated meta- ‘omics “cyber
superstructure”. A unique capability provided by IMG is the scale of processed data publically
available. While IMG is not a centralized resource for raw data storage, it has potential to serve
as a central resource of assembled metagenomics data sets. There is currently an effort to
assemble and annotate a large part of the raw metagenomics data available through SRA and
integrate them with the metadata curation effort through the Genomes OnLine Database
(GOLD). This provides a good example of how current data centers could serve as specialized
hubs in a federated, interoperable alliance, each providing different data products – in this case
with IMG serving as a central repository for metagenome assemblies.
8
24 November 2015
their scope to the broader (non-plant) life science community, which is compatible with
ECOGEO-related research questions.
Recommendations
Building Environmental ‘Omics Infrastructure for Earth Sciences
Enabling our community to build the necessary data discovery repositories, with federated and
efficient frameworks for data integration and interoperability, the establishment of best practices
and workflows, and the development of functional platforms for analysis, visualization, and
statistics.
9
24 November 2015
Recommendation III. Data centers, databases, and analytical tools that address issues of
data discovery, scalability, and community HPC access should be further developed.
Data discovery through accessible repositories, semantic integration of associated metadata,
and scalable analyses are crucial for the ‘omics community to address many of the globally
significant and/or societally relevant questions. This level of data integration will require
ingenuity and collaboration between domain scientists, cyberinfrastructure developers,
statisticians, and visualization experts. As resources, tools, and experts become available, the
‘omics community should support the development of innovative ideas.
10
24 November 2015
and disseminate educational training tools, such as training workflows, demonstration videos,
interactive workshops, and training courses. Effective knowledge transfer to the next generation
of ‘omics researchers, developers, and innovators will be necessary to position them to take
‘omics science into the future. Through EarthCube, ECOGEO will develop a foundation of
training videos, but proper development, assessment, and improvements will require support
and a large community effort.
Upcoming events
The ECOGEO RCN has several activities planned for the remaining year of NSF funding
(through August 2016). In addition to hosting a second workshop (late Spring, early Summer
2016) as funded in the original award (1440066), supplementary funds were granted by the NSF
Division of Ocean Science (OCE) for additional activities. In January/February, the ECOGEO
RCN will run a small working group focused on creating 12 complete EarthCube use cases.
Prior to Workshop I, participants were asked to submit use cases to be reviewed and discussed
during the workshop. Due to time constraints, we were only able to review six use cases, but we
are keen to work with the TAC Use Case Working Group to flesh out all 12 use cases, including
integration into EarthCube resources where possible, and then contribute them to the
EarthCube use case repository. In addition, the ECOGEO RCN will be hosting a Town Hall at
the 2016 ASLO/AGU/TOS Ocean Sciences Meeting in New Orleans, LA. The Town Hall will be
held on 25 February from 12:45-13:45 in the Ernest N. Morial Convention Center (217-219). The
Town Hall is intended to introduce the OSM community to EarthCube and the on-going efforts of
the ECOGEO RCN. Because we already have representation on the EarthCube Engagement
Team, several “Introduction to EarthCube” resources are already under development. Our final
workshop will focus on creating instructional webinars that demonstrate ‘omics tools and data
portals, as well as implementing the developed use cases. The main goal is to train the next
generation of ‘omics researchers and develop ways for them to integrate their research with
EarthCube’s on-going mission to enable data science through cyberinfrastructure.
11
(updated 15 Sep 2015)
Workshop I – Agenda
2
Workshop Participant List (updated 14 Sep 2015)
We are really looking forward to having you join us in Hawai‘i on the 27-28 August for the first
ECOGEO RCN workshop. In order to prepare for the workshop, the organizers (Ed, Elisha, and
the ECOGEO Steering Committee) would greatly appreciate having your research group
contribute a single Use Case related to your work in environmental ‘omics.
As our first workshop is focused on core issues in ‘omics research, many of the invited
participants (list available on the website) represent the senior research/PI level. Therefore, we
ask that you use this Use Case development opportunity to involve your research group in the
conversation. Below are a few points to help provide some direction, but don’t hesitate to
contact Elisha if you have questions or would like feedback.
1. Please draft a Use Case that highlights a current challenge/limitation for your ‘omics
research (see the provided Use Case as an example).
2. The provided Use Case represents a current big picture ‘omics question/challenge.
Depending on your Use Case, this may or may not be appropriate. Any level of focus
and/or complexity is welcome.
Prior to the workshop, we will review the submitted Use Cases with the aim of collecting and
preparing representative examples for 1) focused discussion on solving challenges and 2)
progressing each Use Cases towards functionality in research and training.
Mahalo, looking forward to seeing you all very soon! Please refer to the website for logistics and
documents related to the workshop.
Cheers!
Elisha Wood-Charlson and Ed DeLong
1
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1.
2.
3.
4.
5.
6.
7.
…
Critical Existing Cyberinfrastructure
o
o
o
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
be good candidate(s) for a community CI application.
2
Critical Cyberinfrastructure Not in Existence
o
o
o
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Notes (any additional information that does not fit in a previous category)
3
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Harriet Alexander (halexand@mit.edu)
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
be good candidate(s) for a community CI application.
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Notes (any additional information that does not fit in a previous category)
2
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Dylan Chivian (DCChivian@lbl.gov)
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1. Assembly and annotation of isolate genomes.
2. Assembly, annotation, binning, and assessment of MG-derived genomes.
3. Meta-transcriptomic abundance calculations against isolate and MG-derived genomes.
4. Trait-guild member assignment.
1
5. Integration of metabolomic and meta-metabolomic data into species models.
6. Time-series models of community adaptation.
7. Stats and visualization.
Critical Existing Cyberinfrastructure
o KBase/RAST/ModelSEED/MG-RAST, M-suite, QIIME, IMG, IMG/M, ggKbase,
iMicrobe, PathwayTools, MicrobesOnline, metaMicrobesOnline
o R, MeV, SparCC, kallisto, bowtie, Cytoscape (analysis and viz)
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Notes (any additional information that does not fit in a previous category)
2
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Jack A. Gilbert: gilbertjack@gmail.com
Naseer Sangwan: nsangwan@anl.gov
Chris Marshall: chris.w.marshall@gmail.com
Melissa Dsouza: dsouzam@uchicago.edu
Pamela Weisenhorn: pweisenhorn@anl.gov
Basic Flow
1. Quality trimming and de-novo assembly of shot-gun metagenome datasets
2. Binning Metagenome contigs into population genomes (pan-genomes)
3. Gene calling on contig bins representing population genomes
4. Identification of orthologous genes between population genomes
5. Cross validation of orthologous genes (i.e length cut-off, sequencing errors)
1
6. Calculating pairwise dN/dS and codon bias values
7. Normalization and calculation of pairwise correlation between dN/dS and codon bias
profiles
8. Demarcate & functionally characterize protein pairs w/ positive and/or negative selection
Activity Diagram
This can be targeted during the workshop
Problems/Challenges
1. How to acess the habitat specific gene pool information?
Recommendation : Create a comprehensive portal that can store such datasets.
3. How to calculate accurate rate to evolution and codon bias on short protein sequences.
a. There are some methods but they are not validated for errors and bias caused during
metagenome data analysis e.g length variation, average genome size variation etc.
b. Recommendation: develop some new method to calculate and normalize the dN/dS
and codon bias profiles of population genomes. e.g consider the average genome size
variations.
References
-Ran W, Kristensen DM, Koonin EV. (2014). Coupling Between Protein Level Selection and
Codon Usage Optimization in the Evolution of Bacteria and Archaea. mBio 5:e00956–14.
-Nielsen, R. (2005). Molecular signatures of natural selection. Annu Rev Genet. 39:197-218.
Notes
2
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Bethany Jenkins University of Rhode Island,
Joselynn Wallace PhD candidate University of Rhode Island
Basic Flow
1. Use global models predicting the role of nutrient limitation on primary production of key
phytoplankton taxa to select oceanic region of interest.
2. Filter by depth horizon
3. Retrieve historical macro and micronutrient measurements collected from this region
and filter data by concentration of a given nutrient
4. Retrieve ‘omics datasets from this region (this is the crux of this pipeline matching the
nutrient data with the ‘omics data and finding relevant omics data)
5. Compile locations of nutrient measurements at a range of selected values with ‘omic
data-availability of metagenomes and metatranscriptomes
6. Determine from metagenomics data if target organisms or taxa are present at target
nutrient values
7. Filter metatranscriptome data by taxonomy to only retrieve transcripts from target
taxonomic group (2nd crux of pipeline-need to interface with phylogenetics
infrastructure).
8. Use downstream measures to search for specific genes (e.g. BLAST)
1
Critical Existing Cyberinfrastructure
o World Ocean Database (Atlas)( https://www.nodc.noaa.gov/OC5/indprod.html)
o BCO-DMO (http://www.bco-dmo.org/)
o GEOTRACES International Data Assembly Center
o PANGEA archive (http://doi.pangaea.de/10.1594/PANGAEA.840721)
o iMicrobe (http://imicrobe.us/)
o EBI metagenomics (https://www.ebi.ac.uk/metagenomics/)
o European Nucleotide Archive (http://www.ebi.ac.uk/ena)
o NCBI (http://www.ncbi.nlm.nih.gov/)
o QIIME (http://qiime.org/)
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
be good candidate(s) for a community CI application.
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
1.#Global#map#of#nutrient#limita2on#for#diatoms#
4.#Query##database#(same#or#
different)#with#metagenomics#
informa2on#that#is#cross#
referenced#to#samples#
2.#Define#region#of#Fe#limita2on#in#N#equatorial#Atlan2c#
5.#Apply#taxonomic#filtering#to#data#
(requires#integrated#pipeline#for#
taxonomic#classifica2on)#
3.#Query#db#of#mixed#layer#depth#samples#from#
specified#region#with#measured#Fe#values#below#
specified#level.# 6.##Retrieve#metatranscriptome#data#for#
Return#data#with#Fe#and#all#other#measured#nutrient# sample#containing#taxonomic#targets#
and#profiling#data#(e.g.#temp,#salinity#etc).#
2
Problems/Challenges (any barriers to successful completion of use case)
For each one, list
- The challenge
- What, if any, efforts have been undertaken to fix these problems?
- What recommendations do you have for tackling this problem?
1. Cross referencing of data-BCO-DMO-having a “accession number’ for each sample that
is capitulated through all data records so they can be housed in different databases but
search engines can query by record and then for specific types of associated data
2. Discoverability of “omics data” –data currently living in a variety of repositories (ncbi,
ebi, iMicrobe) submissions don’t presently contain links to metadata records. Omics
data may need to live in separate mirrored repository to facilitate retrieval.
Notes (any additional information that does not fit in a previous category)
3
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Robert M. Morris, University of Washington (morrisrm@uw.edu)
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1. Identify samples with matching datasets (physical, chemical, biological)
2. Download and retrieve appropriate datasets (omics, metals, nutrients, etc.)
3. Synchronize biological omics datasets (annotate using standard annotations)
4. Identify categories for comparison (CEG paths, EC numbers, taxonomy, etc.)
5. Extract data for comparative analyses
6. Determine genetic potential, gene regulation, and expressed protein functions
7. Multivariate analysis of biological activity with physical and chemical parameters
Critical Existing Cyberinfrastructure
o **Standard annotation database developed by Mary Ann Moran
o Data archives (BCO-DMO, NCBI, MG-RAST, SILVA-RDP-Greengenes for 16S)
o Comet: An open source MS/MS sequence database search tool
o Kbase: A systems biology knowledge base (mostly genomic at this point)
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
1
be good candidate(s) for a community CI application.
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Notes (any additional information that does not fit in a previous category)
2
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Contact(s)
Jacob Waldbauer
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1
Critical Existing Cyberinfrastructure
o Peptide-spectrum matching, spectral library searching and de novo sequencing
algorithms (of varying speed/parallelizability)**
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
be good candidate(s) for a community CI application.
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Notes (any additional information that does not fit in a previous category)
2
Summary of ECOGEO’s community survey
Overview
The main areas of ‘omics research
currently being explored by our
community are metagenomics,
16S/18S taxonomy, and correlating
omics data with environmental data
(Figure 1). In addition, the majority of
our research community regularly
collects samples for processing
(~85%), conduct in-depth analysis on
the output data (~72%), and use the
data for comparative omics (62%)
(n=96, with more that one selection
possible). However, our community’s
engagement with ‘omics data ranges
from doing limited analysis (~47%) to
using the data to develop workflows
(~40%).
Figure 1. Areas of 'omics research (n=97, more than
one selection possible)
Accessing data
Most ‘omics users are able Would like to use Already use
100%
to submit data sets and 90%
associated metadata for 80%
archival, search reference 70%
databases by sequence 60%
similarity or annotation 50%
40%
(Figure 2). However, we
30%
struggle to search by
20%
associated metadata/ 10%
project characteristics, and 0%
we definitely face
challenges in accessing
unique data sets not in the
main reference databases
(a.k.a. “dark data”).
Figure 2. Resources already in use or would like to use.
1
Summary of ECOGEO’s community survey
Community Workflows
Feedback on idealized workflows provided good fodder for use case development, the need for
which was highlighted in Figure 2 (“would like to use” – case studies, interactive webinars).
Therefore, we are asking the 2015 workshop participants to submit a use case prior to the
workshop; one that highlights a current challenge in ‘omics research. We have provided a
template form and an example use case focused on using metadata to retrieve targeted data
sets for further exploration. During the workshop, we will discuss several representative use
cases to 1) highlight areas that we, as a community, need to focus on to move our research
forward, and 2) establish a repository of training tools for the next generation of ‘omics
researchers.
Barriers to Research
The general consensus on the barriers to ‘omics research moving forward are summarized in a
few key, big picture points (below). During the 2015 ECOGEO workshop, we will be tackling
these at a finer scale level, in an attempt to move solutions forward.
1. Data standards – including quality measures and a way to index data sets that will
link samples to environmental metadata, across different types of ‘sequencing’, and
throughout various sequence analyses and annotations stages.
2. Central repository of raw and processed data (see #1) that is searchable (see Figure
2) and downloadable with compatible/standardized output, while also having online
tools and compute power for processing (and archiving) assemblies, comparative
analyses, annotations, visualizations, and statistics.
3. Regular annotation updates on existing databases with potential to request
notifications if data sets of interest gain new information.
4. Training – use-cases/workflows, training webinars, user-friendly GUI interface.
5. Last, but far from least – longevity!
2
Earth Cube Oceanography and Geobiology Environmental 'Omics
ECOGEO is a brandnew, NSFfunded Research Coordination Network (RCN) housed within the EarthCube platform. Please visit
http://workspace.earthcube.org/ecogeo for more information and to join our listserv!
The mission of this RCN is to identify community needs and develop necessary plans to create a federated cyberinfrastructure to enable ocean and
geobiology environmental ‘omics.
This survey is designed to address the first part of our mission. We are gathering information regarding the current usage of and community needs
for 'omics research in the oceanography and geobiology communities. This brief research survey should take 515 minutes of your time, depending
on your level of feedback.
Your participation is greatly appreciated, but also voluntary and you can choose to not answer any question. This survey is anonymous and without
foreseeable risks to you for taking part in this survey. Please do not include any personal information in your responses. If you have any questions or
concerns regarding this survey, please contact Dr. Elisha WoodCharlson at the University of Hawai'i at Manoa (ecogeo.rcn@gmail.com). If you
have questions regarding your rights as a participant, please contact the University of Hawai'i at Manoa Human Studies Program
(uhirb@hawaii.edu)
This study has been reviewed and approved by the University of Hawaii Institutional Review Board (#...).
*1. By selecting "Yes", you are indicating your consent to participate in this survey.
j Yes
k
l
m
n
j No
k
l
m
n
Page 1
Earth Cube Oceanography and Geobiology Environmental 'Omics
2. What area(s) of ‘omics research do you typically work in? (select all that apply)
c Genomics
d
e
f
g
c Single cell genomics
d
e
f
g
c Metagenomics
d
e
f
g
c Transcriptomics
d
e
f
g
c Metatranscriptomics
d
e
f
g
c Proteomics
d
e
f
g
c Metaproteomics
d
e
f
g
c Metabolomics
d
e
f
g
c Correlating ‘omics data with environmental data
d
e
f
g
c Phylogenetics
d
e
f
g
c 16S, 18S; Taxonomy
d
e
f
g
c Modeling
d
e
f
g
Other (please specify)
Page 2
Earth Cube Oceanography and Geobiology Environmental 'Omics
3. What area(s) of ‘omics sample and data processing do you typically engage in? (select
all that apply)
c Collect samples and process for sequencing
d
e
f
g
c Limited analysis of processed ‘omics data (e.g. postQC/QA)
d
e
f
g
c Indepth analysis (e.g. single data set assembly, annotation, pathways, etc…)
d
e
f
g
c Workflow development
d
e
f
g
c Analytical and/or statistical tool development
d
e
f
g
c Use ‘omics data in modeling
d
e
f
g
c Comparative ‘omics (e.g. across ‘omic types, complex data sets, integration with metadata)
d
e
f
g
Other (please specify)
Page 3
Earth Cube Oceanography and Geobiology Environmental 'Omics
Submission of sequence data and metadata for archival services c
d
e
f
g c
d
e
f
g
Access to unique data sets not available in other sequence repositories c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by description or project characteristics c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by sequence similarity (e.g. BLAST, RapSearch) c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by annotation (e.g. gene function, taxonomy) c
d
e
f
g c
d
e
f
g
Search for data sets by metadata (e.g. latitude/longitude, date collected, lead PI) c
d
e
f
g c
d
e
f
g
Access to reference datasets (e.g. nonredundant and RefSeq from NCBI) c
d
e
f
g c
d
e
f
g
Casestudies for training c
d
e
f
g c
d
e
f
g
Interactive webinars c
d
e
f
g c
d
e
f
g
Other resources, or additional comments (please specify)
Initial data processing (e.g. QC/QA, trimming) c
d
e
f
g c
d
e
f
g
BLAST and BLASTlike workflows (e.g. RapSearch) c
d
e
f
g c
d
e
f
g
Assembly tools (e.g. RayMeta, Newbler) c
d
e
f
g c
d
e
f
g
Annotation tools (e.g. Pfam, COG/KOG, TIGRFAM, NCBI’s PRK) c
d
e
f
g c
d
e
f
g
Phylogeneticallybased annotation services (e.g. MEGAN) c
d
e
f
g c
d
e
f
g
Workflow pipelines (e.g. Clustering, RAMMCAP, Redundancy filter) c
d
e
f
g c
d
e
f
g
Comparative pathway analysis (e.g. KEGG, pFAM) c
d
e
f
g c
d
e
f
g
Statistical tools c
d
e
f
g c
d
e
f
g
Visualization tools c
d
e
f
g c
d
e
f
g
Other resources, or additional comments (please specify)
Page 4
Earth Cube Oceanography and Geobiology Environmental 'Omics
6. If you currently have favorite tools/resources, please list them and explain why they are
working for you.
5
6
7. To put the previous questions in a research context, please describe your idealized data
analysis workflow that would best achieve your main science goals using omics data sets.
What do you want ‘omics data to do in order to answer your scientific questions?
5
6
Page 5
Earth Cube Oceanography and Geobiology Environmental 'Omics
8. Please identify the community needs for storage, management, analysis, sharing,
integration, and visualization of ‘omic data that you feel are immediate vs. should be
considered in future development with a longerterm vision.
Immediate Longterm
Storage of raw data (akin to the NCBI Short Read Archive for sequence data) c
d
e
f
g c
d
e
f
g
Storage of processed data (e.g., translated proteins or assembled contigs) c
d
e
f
g c
d
e
f
g
Storage of data used for biological inference (e.g., differential gene/protein expression) c
d
e
f
g c
d
e
f
g
Linking different ‘omics for single sample c
d
e
f
g c
d
e
f
g
Sustainable curation c
d
e
f
g c
d
e
f
g
Access to highperformance computational resources c
d
e
f
g c
d
e
f
g
Access to usersubmitted data c
d
e
f
g c
d
e
f
g
Analysis workflows c
d
e
f
g c
d
e
f
g
Annotation tools c
d
e
f
g c
d
e
f
g
Comparative pathway tools c
d
e
f
g c
d
e
f
g
Comparative ‘omics tools c
d
e
f
g c
d
e
f
g
Statistical tools c
d
e
f
g c
d
e
f
g
Visualization tools c
d
e
f
g c
d
e
f
g
Casestudies for training c
d
e
f
g c
d
e
f
g
Other (please specify)
6
Page 6
Earth Cube Oceanography and Geobiology Environmental 'Omics
10. Please comment on what you perceive to be the PRIMARY NEEDS surrounding ‘omics
research for the oceanography and geobiology communities
5
6
6
Page 7