EarthCube - ECOGEO 2015 Workshop I Final Report PDF

Research Coordination Network
Workshop I Report
27-28 August 2015, University of Hawai‘i at Mānoa, Honolulu, HI
Report: 24 November 2016
Table of Contents
Executive Summary Page 1
ECOGEO – a community focused on solutions Page 2
Workshop Goals Page 2
Workshop Outcomes Page 3
I. Summary of workshop structure
II. Grand ‘omics science challenges
III. Specific cyberinfrastructure needs
IV. Leveraging and expanding existing infrastructures
V. Enabling and encouraging big data best practices
Recommendations
Building Environmental ‘Omics Infrastructure for Earth Sciences Page 9
Upcoming events Page 11
Appendices Page 12
Appendix I: Workshop I Agenda
Appendix II: Participant List
Appendix III: Participant Use Cases and Use Case Template
Appendix IV: Community Survey and Summary
Resources
ECOGEO RCN website: http://earthcube.org/group/ecogeo
Workshop website: http://cmore.soest.hawaii.edu/rcn2015/
Workshop agenda: http://cmore.soest.hawaii.edu/rcn2015/agenda.htm (also Appendix I)
Workshop webinar: https://vimeo.com/uhcmore/review/138035693/5223a63a63
Environmental ‘omics Resource Viewer: http://pivots.azurewebsites.net/ecogeo.html
Workshop conveners: Edward DeLong, Elisha Wood-Charlson, ECOGEO Steering Committee

Report prepared by: Edward DeLong, Elisha Wood-Charlson, and ECOGEO Workshop I
participants (Appendix II)
24 November 2015
Executive Summary
The aim of ECOGEO’s first workshop was to enable domain scientists and cyberinfrastructure
experts to collaboratively discuss grand challenges in ‘omics science (Outcome II) and explore
use cases that translate those challenges into cyberinfrastructure needs (Outcome III). The
group also worked to outline existing resources, brainstormed how to best leverage and expand
those resources (Outcome IV), and discussed ways to better establish best practices in the
community (Outcome V).
The workshop hosted over 50 participants from more than 20 universities, three national labs,
four cyberinfrastructure centers, two NSF funded data resources, and three EarthCube funded
projects. At the conclusion of the workshop, participant comments were unified and optimistic
about how to best move forward as a community. Scientists and cyberinfrastructure experts
were able to identify common ground and consensus on how to address the some of the core
‘omics science challenges. This report summarizes discussions and synthesis activities that
generated the following recommendations, aimed at overcoming cyberinfrastructure challenges
in environmental ‘omics research.
In summary, the community’s recommendations include:

I. EarthCube-based solutions should integrate science drivers and challenges (e.g. end
user discussions and use case scenarios) with technological and engineering solutions
via continuous, iterative discussion/development cycles, from start to finish.
II. Existing ‘omics-orientated cyberinfrastructures should be extensively leveraged and

integrated into EarthCube systems, with further development and support, to better meet
current and future needs of the ‘omics community.
III. Data centers, databases, and analytical tools that address issues of data discovery,
scalability, and community HPC access should be further developed.
IV. Data visualization and statistical analytical frameworks should be integrated into
standard ‘omics analyses workflows and software.
V. The cyberinfrastructure should enable and encourage “big data” best practices and
standards for the community.
VI. The ‘omics and cyberinfrastructure communities should enable and provide a platform
for future ‘omics research via streamlined, accessible, state-of-the-art education, training
tools, and best practices.
This report aims to provide a foundation of community support for a federated platform of
interoperable cyberinfrastructures for oceanography and geobiology environmental ‘omics
research. With support for improved cyberinfrastructure, interdisciplinary collaborations, and
best practices and training, the ‘omics community (domain scientists and cyberinfrastructure
experts) is very well positioned to move environmental ‘omics research into the future.
1
24 November 2015
ECOGEO – a community focused on solutions

The EarthCube Oceanography and Geobiology Environmental ‘Omics (ECOGEO) project is a
two-year, NSF funded Research Coordination Network (RCN) designed to bring together
domain and cyberinfrastructure scientists and engineers with the goal of articulating needs,
challenges, solutions, and required software and hardware infrastructures for enabling and
advancing current and future ‘omics research in the Geosciences, and in particular Ocean
Science.
The ECOGEO research community spans an array of disciplines, but is united in developing
and applying ‘omics technologies and bioinformatic approaches to address core questions
relating the interplay of biological, geological, and chemical processes. Investigations range
from high-throughput sequencing of microbial community DNA to assess taxon, gene, and
metabolic pathway distributions across samples (metagenomics), monitoring the expression of
genes and/or proteins in a variety of environmental settings (metatranscriptomics and
proteomics, respectively), and measuring the distribution and significance of metabolites and
lipids in organisms and the environment (metabolomics, lipidomics). These methods enable
researchers in biological oceanography, biogeochemistry, organic geochemistry, microbial
oceanography, and geobiology to explore and inter-relate the biological, geological, and
chemical (biogeochemical) world in hitherto unprecedented depth and detail. This general
approach requires considerable computational hardware and software infrastructures, which
rely on high performance computing, advanced networking and database capabilities, and
collaboration with computer scientists, bioinformaticians, software engineers, computational
biologists, and interdisciplinary support from both government and private funding agencies
and foundations.
Overall goals
 Create and sustain a strategic network and community of field and cyber scientists to
explore new facets of ‘omics data.
 Articulate needs, challenges, and practical solutions that address: 1) development of
cyberinfrastructure, 2) integration and implementation of workflows, and 3) database and
resource sustainability to support ocean and geobiology environmental ‘omics research.
 Develop a community-based framework that integrates best practices for sharing, curation,
and analysis of ‘omics data, with associated “metadata”, and facilitates collaboration and
training among environmental microbiology, geobiology, and computer science disciplines.
Workshop Goals
Highlight core science and technology drivers for research using environmental ‘omics
The field of environmental ‘omics requires close collaboration between domain science and
technology/cyberinfrastructure. Therefore, one of ECOGEO’s main workshop goals was to
discuss key drivers from the perspective of current challenges and cyberinfrastructure needs.
These drivers were identified during review of the original end-user workshop documents,
participants use cases developed for this workshop (Appendix III), and community feedback
from the ECOGEO survey (Appendix IV).
2
24 November 2015
Identify solutions: Based on these challenge-focused science and technology drivers, the
workshop participants also discussed ways to leverage existing resources and described gaps
where new solutions could be built.
EarthCube Context
The workshop leveraged numerous EarthCube resources to enhance our discussions. Several
active members of EarthCube governance and funded projects were in attendence, including
(alphabetical by last name, also available in Appendix II) Emma Aronson, Science Committee;
Basil Gomez, Leadership Council; Danie Kinkade, Leadership Council; Ouida Meier,
CReSCyNT RCN; Ken Rubin, Science Committee; Elisha Wood-Charlson, Engagement and
Liaison Teams, Science Committee, ECOGEO RCN; and Ilya Zaslavsky, GEAR Conceptual
Design, Technology and Architecture Committee, CINERGI Building Block.
In addition, ECOGEO has, and will continue, to contribute to EarthCube’s vision. The 2nd year of
our RCN will focus on integrating the ‘omics community and our collective resources into the
broader EarthCube infrastructure, with the goal of creating sustainable contributions through
future EarthCube funded projects. Thus far, we have compiled a dozen domain science use
cases, which will be refined at a follow up working group meeting early 2016 (see Path Forward)
and have been actively engaged with CINERGI to expand our environmental ‘omics resource
viewer. Throughout year 2, ECOGEO will continue to have representation in EarthCube
governance, including revision of best practices and core documents, as well as future ideas for
EarthCube funded projects recommendations. Finally, we will communicate outcomes from our
workshops to EarthCube community through dissemination of reports, and continue to inform
the broader ‘omics community of EarthCube through society town hall sessions.
Workshop Outcomes
I. Summary of workshop structure
This workshop focused on understanding the key science and technology drivers in the field and
the development of use cases to identify resource gaps, as well highlight their potential for
training. The workshop was organized into several breakout groups, with time for reporting and
discussion amongst all participants (please see Appendix I for the full agenda). The first series
of breakouts on science/tech drivers focused on grand ‘omics science challenges:
 Geospatial & temporal registry for 'omics data across scales, led by D. Kindade, V. Orphan
 Tracking synoptic ‘omics data products, led by B. Hurwitz, N. Kyrpides
 Integrated modeling of organisms (‘omics) and environmental dynamics, led by M. Follows,
N. Levine
The next set of breakouts focused on a subset of participant submitted use cases with the aim
of extracting the overarching science challenges and cyberinfrastructure needs in ‘omics
research. Representative use cases were grouped by theme and are available in Appendix III.
 “Google Earth” ‘omics, led by E. Allen. Use cases by H. Alexander and B. Jenkins.
 Linking function to biogeochemical cycling in space/time, led by M. Saito. Use cases by R.
Morris and J. Waldbauer.
3
24 November 2015
 Using ‘omics for evolution/trait-based studies, led by E. Aronson. Use cases by D. Chivian
and J. Gilbert.
The final discussion focused on the potential for and limitations of existing cyberinfrastructure.
The aim of this session was to move beyond just identifying gaps in existing resources to
proposing solutions as a community to address specific current and future cyberinfrastructure
needs in ‘omics research. Breakout leads included J. Heidelberg, B. Jenkins, and D. Kinkade.
In addition to collective brainstorming, the workshop offered several presentations on resources

that could be integrated into the cyberinfrastructure “ecosystem”, or at least provide fodder for
on-going discussions. Presentations (listed in agenda, Appendix I) included several plenary
talks by NSF-funded Science and Technology Centers and their approaches to handling big
data, as well as a presentation by Jason Leigh, who leads the Laboratory for Advanced
Visualization and Application (LAVA) at UH Mānoa. Jason and his team also hosted a Q&A and
informal brainstorming session with the workshop participants, who were encouraged to explore
advancements in visualization. We also had a presentation by B. Hurwitz and D. Kinkade on
linking the main environmental oceanographic database, BCO-DMO, to the iMicrobe ‘omics
data commons integrated in the iPlant Cyberinfrastructure, both supported by NSF. B. Tully and
I. Zaslavsky presented the ECOGEO resource viewer, which was enabled by the EarthCube
Building Block CINERGI. Finally, L. Teytelman described Protocols.io as a mechanism for
researchers to share, modify, comment, and collaborate on laboratory and bioinformatics
protocols, and F. Chavez presented MBARI’s collaboration with NOAA on a new eDNA study
concept and proposed data analysis workflow.
II. Grand ‘omics science challenges

One of the inherent challenges in environmental ‘omics research is that the focus, by default,
occurs at the level of microbes, since they make up most of the environmental biomass on
Earth. However, many of our research questions span from sub-micron scales (viruses) to
global ecosystems and modeling. How do you sample a micron-level 3-D space (e.g. an algal
bloom in the ocean) over a 4th dimension (time) and then extrapolate the micro-changes and
environmental interactions to ecosystem biogeochemical cycling? The current answer is – “the
technology is not there quite yet”, but these sorts of analyses may be within reach in the very
near future. The workshop participants discussed how to create geospatial and temporal
registries for ‘omics data across different scales (Use Case: Jenkins), and how they could be
visualized through a Google Earth data discovery model (Use Case: Alexander, Morris). For
example, data layers could be expanded from latitude/longitude to include nutrient
concentrations and global temperatures, with a “street-view” like function that would allow for
micro-to-meter scale visualization, such as activity on sinking particles in the ocean. When the
main challenges extend beyond traditional conventional scale boundaries, one must reach
across scales in order to enable the next generation of research questions.
Within the issue of scaling lies the fundamental challenge of understanding how biological
processes interact with these scalable environmental data layers. For example, from an
environmental (microbial) ‘omics perspective, the classic ecological metrics of alpha and beta
species diversity may no longer be entirely relevant. In part, microbes don’t follow typical
4
24 November 2015
speciation criteria. Therefore, the context of environmental interactions and functional diversity
(ability to fix nitrogen, utilize low abundance iron, etc.) may be more relevant from a
biogeochemist’s perspective than defining which strain of microbe is present (Use Case:
Chivian, Gilbert, Waldbauer). In particular, trait-based metrics as opposed to taxonomic criteria
may be more important when considering globally significant and/or societally relevant
questions such as, “how are microbe-microbe and microbe-environment interactions impacting
global biogeochemical cycles?” and “how can that information be used to improve climate
models and projections of change?”. Currently, there is a disconnect between model predictions
and data driven observations. Therefore, we need new ways to enable more iterative
observations, hypothesis generation, and hypothesis testing cycles. These big picture
challenges require a fundamental change in the structure, availability, and scale of ‘omics-
enabling cyberinfrastructures.
III. Specific cyberinfrastructure needs

Although the ‘omics community is diverse in focus and techniques, we share several common
challenges that prevent the field from moving forward. These challenges were highlighted in the
workshop use cases (Appendix III) as well as the community survey (Appendix IV), which was
administered to the ‘omics community in late 2014.
One of the most acute articulated needs for the ECOGEO community is a new mode of
sequence data repositories that facilitate data sharing and data discovery of primary data and
any associated environmental and sample processing data (“metadata”), as well as links to
other data products (and their provenance), analytical software and workflows, and the
infrastructures required to implement them. Such a repository would store sequence read data
with its associated metadata in a manner that would allow seamless and simultaneous queries
of metadata fields and their corresponding sequence reads (and vice versa). A similar repository
for environmental proteomic mass spectrometry data was also identified as a core need. Such
repositories would be invaluable tools for data discovery. They promote efficient computation,
the requirement for transferring large data sets would be reduced, and they facilitate
development of downstream analysis pipelines that could be shared and standardized. Beyond
repositories for raw sequence reads and mass spectrometry data, the community also needs a
federated, searchable repository for data products being used as a basis for biological inference
in publications, which would greatly enhance comparative analyses between studies. These
products include such things as phylogenetically resolved population level genome fragments
assembled from metagenomes, gene/protein expression data from metatranscriptome/proteome
analyses, and sequence alignments underlying phylogenetic inferences. In addition to
searchable data and data product repositories, analysis platforms that enable large-scale
metagenomic data comparisons were identified as another core “ocean ‘omics”
cyberinfrastructure requirement. Such platforms would support data searches driven by
taxonomy and physiochemistry thus making such large-scale comparisons feasible.
However, this vision of next generation of sequence data repositories goes beyond aggregating
disparate data sets currently housed in dispersed data resources (e.g. NCBI, IMG, iPlant, EBI,
MG-RAST, BCO-DMO, etc.). They must also consider “dark data”, data already available in the
5
24 November 2015
public domain but not readily discoverable in via the commonly accessed databases. Data
mining through web crawls focused on primary scientific literature would significantly extend the
volume of data gathered to promote comparative ‘omics investigations.
Once ‘omics data are available in a suitable federated infrastructure with query-able and
standardized metadata, it should be possible to pose a striking diversity of hypotheses. Analysis
tools, workflows, visualization, and statistics are critical to making sense of ‘omics data. Many
analysis tools are disparate (developed by individual research groups), which can make them
difficult to capture in a single workflow. Furthermore, most tools require computational
experience to run, and are not well vetted by the community (i.e. which is best tool for certain
data types and why) or maintained once a developer moves on. This computational climate
prevents the continual improvement and vetting of existing tools by the community and an ever-
expanding database of programs that are difficult to maintain and come up to speed on. This
strategy is particularly problematic for researchers without a bioinformatics team. In addition,
programs are often developed independently without workflows in mind. This leads to disparate
output and formats that are not easily bridged between tools without scripting skills to reformat
resulting data and therefore cannot be easily merged into user-defined workflows.
The ideal platform would promote a federated toolkit with inter-operable and standardized
output formats to enable domain scientists to answer their science questions, as well as
encourage continued technology and cyberinfrastructure development. Quantitative Insights in
Microbial Ecology (QIIME) and the associated QIITA database have been recently adopted by a
large community as platforms for federated ribosomal RNA (rRNA) tag studies, with open
access tools and a well supported community run helpdesk. QIIME provides online tutorials that
allow for community adoption resulting in a large user database. However, the rRNA tag data
set analyses for which QIIME is designed have low complexity, and require comparatively less
processing than metagenomic data sets (rendering development and use of these analytical
tools much more straightforward). For large-scale meta- ‘omic analyses, different cyber
solutions will be required. A few other analysis platforms exist, such as DNA Nexus, which
supports the biomedical community, and bioinformatics Apps for the life science that exist in the
iPlant “ecosystem”.
Finally, statistical and visualization tools are necessary for researchers to explore data and draw
conclusions related to their core science questions. During the workshop, we were able to start
this conversation with big data statistics and visualization experts. The main take-home
message was that domain scientists should not struggle with these challenges alone. Just as
our community has grown to educate and include bioinformaticians in the development of our
research plans, it is evident that we need to expand our collaborations to also include the big
data statistics and visualization experts.
IV. Leveraging and expanding existing infrastructures

The ‘omics community has many distinct layers of cyberinfrastructure requirements, ranging
from the physical hardware to house and serve the data, to software that allows users to
process, analyze, and interface with the data. The ECOGEO workshop focused extensively on
6
24 November 2015
discussing these needs and identifying existing resources that might be leveraged to
accomplish our research goals.
One of the core issues with ‘omics data is the size. Moving large-scale sequence data sets
requires significant network bandwidth and access to network platforms, such as Globus and
Internet2. It is estimated that the data we have today represents only 10% of the total data
available in the next 5 years, given improvements in sequencing capacity and advances in the
throughput of non-nucleic acid data products. While data storage and networking and
communications infrastructures will continue to evolve to help meet these “big data” needs, they
also need to serve a variety of end-users: from raw novices to domain experts, as well as
specialized requirements of the educational, survey and monitoring, and policy driven programs
and communities.
Analyzing and interpreting large collections of data will likely require a collaborative and
federated approach. Cloud-based computing approaches hold great promise, but there are a
number of issues that need to be addressed by the community, and the current economics of
commercial cloud solutions do not appear scalable for a large and diverse community. For the
larger and more well established institutions, such as the Joint Genome Institute (JGI),
iPlant/iMicrobe, Broad, J. Craig Venter Institute (JCVI), Sanger, etc., maintaining dedicated
computing resources make sense, while for small labs it probably doesn’t. For most
intermediate size groups, a hybrid approach may be optimal, with some dedicated resources
coupled with access to Cloud-based resources.
In this context, federated infrastructure virtual machines - including lightweight containers - will
likely be a central avenue to provide easy-to-use analysis tools. These have the potential to
democratize access to software suites that may be too complex for researchers, without
dedicated computational support staff, to install. These researchers may be best served by
access to online analysis tools offered by groups such as JGI, KBase, and iPlant/iMicrobe.
Common APIs and architectures with such virtual machines will help forge the links for an
interoperable and federated infrastructure. In addition, existing EarthCube “dark data” discovery
projects, such as DeepDive, can be used to identify published data that are not in a public
repository. The imagined next generation data repositories will likely be built on a federated
cross-agency structure that ties together data from different providers into a common
framework. Data collections from public resources, such as iMicrobe and the International
Nucleotide Sequence Database Collaboration (INSDC), which includes Sequence Read Archive
(SRA), GenBank, European Nucleotide Archive (ENA), and Integrated Microbial Genomes
(IMG), are currently the most used, robust and sustainable cyber and meta- ’omics resources.
These should definitely be integral, federated players in the context of any proposed meta-
‘omics “cyber superstructure”.
Presently, researchers in the ECOGEO community deposit raw sequence data into the SRA as
part of the National Center for Biotechnology Information’s (NCBI) GenBank service, which
currently houses over 19,000 environmental genomic data sets totaling > 15 TB. This resource,
however, is not easily searchable and thus prevents the integration of data sets across projects
7
24 November 2015
and limits the possibility of ecosystem level analyses. Further, SRA files at NCBI often do not
contain a sufficient description of a sample’s “metadata”, which ideally includes information on
the sampled environment, sample collection, processing, and data generation. This contextual
metadata, in addition to the oceanographic data in BCO-DMO, is essential if data sets are to be
intercomparable. The Genome Standards Consortium (GSC) has established baselines for
describing genomic, metagenomic, metatranscriptomic, and amplicon sequence data
(discussed in the next section). Previously, the ocean ‘omics community relied heavily on the
Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis
(CAMERA) database for data discoverability, but this platform was discontinued in 2014. The
CAMERA data sets have been transferred into iMicrobe, a sub-portal within iPlant. In addition,
IMG at JGI and MG-RAST at Argonne National Laboratory are genomic and metagenomic
resources that have been heavily leveraged by the larger community.
Many data repositories, such as GenBank, are also limited in that they do not accept processed
data products and are not properly formatted for non-nucleic acid ‘omics data, such as
proteomic, glycomic, metabolomic, and lipidomic data. Currently, the largest available
metagenomics data integrations are provided through the JGI’s IMG system. Both systems
support the integration and analysis of a number of different ‘omics data sets, and support the
general community by annotating and analyzing user submitted data. Due to their long term
funding scheme, volume of existing data, and their position within the international community,
these systems are “pre-adapted” to be an integral part of a federated meta- ‘omics “cyber
superstructure”. A unique capability provided by IMG is the scale of processed data publically
available. While IMG is not a centralized resource for raw data storage, it has potential to serve
as a central resource of assembled metagenomics data sets. There is currently an effort to
assemble and annotate a large part of the raw metagenomics data available through SRA and
integrate them with the metadata curation effort through the Genomes OnLine Database
(GOLD). This provides a good example of how current data centers could serve as specialized
hubs in a federated, interoperable alliance, each providing different data products – in this case
with IMG serving as a central repository for metagenome assemblies.
Another example of existing cyber-infrastructure that is “pre-adapted” to be an integral part of a

federated meta- ‘omics “cyber superstructure” is the iMicrobe project, built on the iPlant
cyberinfrastructure. This collaboration is emerging as a viable solution for storing user-
generated and defined data sets in a community data commons. The iMicrobe project provides
a query-able interface for data sets in iPlant by linking to BCODMO’s data and mapping
appropriate metadata to GSC’s MiXS compliant terminology and other standardized ontologies
to enhance data discovery and re-use. iPlant also provides the capacity for users to develop
and distribute tools for use by the community. These tools are tied to freely available, high
performance computing resources at iPlant and Texas Advanced Computing Center (TACC).
Presently, over 500 bioinformatics tools are available within iPlant’s discovery environment, and
the iMicrobe project is developing tools specific to microbial ‘omics analyses, including
metagenomic and metatranscriptomic data sets, and new analysis pipelines for uncultured
viruses. iPlant and TACC are NSF-funded programs that, at the request of NSF, have expanded
8
24 November 2015
their scope to the broader (non-plant) life science community, which is compatible with
ECOGEO-related research questions.
V. Enabling and encouraging big data best practices

The GSC has already established a foundation for the minimal information that should
accompany sequence data sets. This community-led initiative has spent 10 years creating
consensus based standard languages and formats for describing the metadata associated with
a sequence data set. This includes physical, chemical, and biological data that accompany the
physically sampled environment. These environmental metadata standards include formats
developed for marine, soil, human, host-associated, built-environment, and many other
systems. This provides a crib sheet that helps educate people on the kinds of information they
should include when they submit their data to a public database. The format promotes a
standard that makes incoming data compliant with other data sets in the databases, and also
makes the data machine readable and hence searchable. This includes the use of standardized
ontologies (e.g. country of origin defined as USA, instead of U.S.A., US, United States, or
United States of America). Variations in descriptors confuse searching and make data retrieval
extremely difficult. Standard ontologies allow for communication between the searcher and the
submitter. In addition, the ‘omics community has been asked to compile a list of requests that
would make the SRA database more useful to the community, in alignment with the NIH
microbiome database needs. The primary difficulty is not getting people to agree that these
standards should be used; it is getting them to use them. Formatting data appropriately requires
effort from the submitter. Therefore, databases, journals, and funding agencies are finding it
difficult to reach consensus on the best way to motivate the community to employ such
standards. GenBank and EBI have adopted the GSC ‘gold star’ standard, making databases
complied with the minimal information standard (e.g. MiMS, MiXS, etc.) more data rich and the
search for data sets easier. The hope is that if data sets comply with this standard, and are
given a gold star, they will be more frequently used, more regularly cited, and hence will
encourage more researchers to employ these standards.
Recommendations
Building Environmental ‘Omics Infrastructure for Earth Sciences
Enabling our community to build the necessary data discovery repositories, with federated and
efficient frameworks for data integration and interoperability, the establishment of best practices
and workflows, and the development of functional platforms for analysis, visualization, and
statistics.
Recommendation I. EarthCube-based solutions should integrate science drivers and

challenges (e.g. end user discussions and use case scenarios) with technological and
engineering solutions via continuous, iterative discussion/development cycles, from
start to finish.
Without frequent communication regarding needs of the larger end user community, EarthCube
funded projects will not be able to adapt, limiting their usefulness and sustainability.
Development cycles should strive for rapid beta testing, end user feedback, and engineering
redesign and build iterations.
9
24 November 2015
Recommendation II. Existing ‘omics-orientated cyberinfrastructures should be

extensively leveraged and integrated into EarthCube systems, with further development
and support, to better meet current and future needs of the ‘omics community. The true
potential of ‘omics-based science is limited by the community’s need for a well developed,
federated, interoperable, and distributed “cyber superstructure”. Currently, IMG and iMicrobe
are the most extensive, robust, and sustainable cyber meta- ‘omic resources. These platforms
should be integrated with other data projects (e.g. BCO-DMO, EarthCube funded projects) and
supported as leaders in the development of a federated and interoperable of meta- ‘omics
“cyber superstructure”.
Recommendation III. Data centers, databases, and analytical tools that address issues of
data discovery, scalability, and community HPC access should be further developed.
Data discovery through accessible repositories, semantic integration of associated metadata,
and scalable analyses are crucial for the ‘omics community to address many of the globally
significant and/or societally relevant questions. This level of data integration will require
ingenuity and collaboration between domain scientists, cyberinfrastructure developers,
statisticians, and visualization experts. As resources, tools, and experts become available, the
‘omics community should support the development of innovative ideas.
Recommendation IV. Data visualization and statistical analytical frameworks should be

integrated into standard ‘omics analyses workflows and software. Conversations with data
visualization and statistical experts have already started, but stronger integration, through the
development of collaborations and interdisciplinary projects, will be necessary for effective
interdisciplinary projects that will expand and enable the full potential of ‘omics data.
Recommendation V. The cyberinfrastructure should enable and encourage “big data”

best practices and standards for the community. The on-going efforts of the GSC to create
a ‘gold star’ standard for user submitted data sets should continue to be supported and adopted
by the community. With a concerted effort by leaders in this field, from single investigators to the
established institutions, these standards can be adopted by the ‘omics community over time.
This level of cohesion should also be extended to the development of workflows describing data
processing and data products. As a community, we should explore existing tools, such as
Galaxy, Protocols.io, and EarthCube’s GeoSoft project for curation and disseminating of
protocols, software, and scripts. In addition, international collaboration should be further
developed and encouraged, with the assistance of the EarthCube Liaison Team. For example,
European efforts in metagenomics and microbiome studies (e.g. Marine Ecological Genomics
and EBI Metagenomics) have parallel goals and objective to those described here.
Recommendation VI. The ‘omics and cyberinfrastructure communities should enable

and provide a platform for future ‘omics research via streamlined, accessible, state-of-
the-art education, training tools, and best practices. The complex network of ‘omics
research requires individuals to be savvy in field, laboratory, and computer-based techniques. In
order to continue pushing the science forward, the ‘omics community should strive to develop
10
24 November 2015
and disseminate educational training tools, such as training workflows, demonstration videos,
interactive workshops, and training courses. Effective knowledge transfer to the next generation
of ‘omics researchers, developers, and innovators will be necessary to position them to take
‘omics science into the future. Through EarthCube, ECOGEO will develop a foundation of
training videos, but proper development, assessment, and improvements will require support
and a large community effort.
Upcoming events
The ECOGEO RCN has several activities planned for the remaining year of NSF funding
(through August 2016). In addition to hosting a second workshop (late Spring, early Summer
2016) as funded in the original award (1440066), supplementary funds were granted by the NSF
Division of Ocean Science (OCE) for additional activities. In January/February, the ECOGEO
RCN will run a small working group focused on creating 12 complete EarthCube use cases.
Prior to Workshop I, participants were asked to submit use cases to be reviewed and discussed
during the workshop. Due to time constraints, we were only able to review six use cases, but we
are keen to work with the TAC Use Case Working Group to flesh out all 12 use cases, including
integration into EarthCube resources where possible, and then contribute them to the
EarthCube use case repository. In addition, the ECOGEO RCN will be hosting a Town Hall at
the 2016 ASLO/AGU/TOS Ocean Sciences Meeting in New Orleans, LA. The Town Hall will be
held on 25 February from 12:45-13:45 in the Ernest N. Morial Convention Center (217-219). The
Town Hall is intended to introduce the OSM community to EarthCube and the on-going efforts of
the ECOGEO RCN. Because we already have representation on the EarthCube Engagement
Team, several “Introduction to EarthCube” resources are already under development. Our final
workshop will focus on creating instructional webinars that demonstrate ‘omics tools and data
portals, as well as implementing the developed use cases. The main goal is to train the next
generation of ‘omics researchers and develop ways for them to integrate their research with
EarthCube’s on-going mission to enable data science through cyberinfrastructure.
Appendices (pages 12 – 40)

Appendix I: Workshop I Agenda
Appendix II: Participant List
Appendix III: Participant Use Cases and Use Case Template
Appendix IV: Community Survey and Summary
11
(updated 15 Sep 2015)
Workshop I – Agenda
See also - http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

27 August 2015, East-West Center (EWC), UH Manoa

0745-0800 Depart hotel for EWC
0800-0830 Morning coffee, light breakfast at EWC
0830-1200 Big Data, Big Ideas - Joint session w/ STC directors meeting (Keoni Auditorium)
0830-0845 Advanced Networking Critical Infrastructure for Big Data and Global
Collaborative Science – David Lassner, President – University of Hawaii
0845-0945 See the Angel, and Other Thoughts on Breakthrough Science
Hon. Daniel S. Goldin, Founder & Chairman, Intellisis Corporation;
9th NASA Administrator, Retired
0945-1015 Coffee Break
1015-1035 Geophysical Sensors, Ice Sheets – Prasad Gogineni, Director CReSIS
1035-1055 Life Science Applications – Ananth Grama, Associate Director CSoI
1055-1115 The Role of STC's and BIO Centers in the Face of Big Data – Erik
Goodman, Director BEACON
1115-1135 Big Data Visualization – Jason Leigh, Information and Computer
Sciences, UH Manoa
1135-1200 Open mic – Moderator Ed DeLong, ECOGEO PI & C-MORE Co-Director
1200-1300 Lunch at EWC (w/ STCs for group discussion, Bottom Floor)
1300-1330 ECOGEO only (Asia Room, 2nd floor): Opening Remarks; Goals and Agenda
1330-1430 Breakout I (Asia, Sarimanok, Kaniela): Science and CI drivers
1430-1500 Report I: Science/CI drivers (Asia)
1530-1630 Breakout II (Asia, Sarimanok, Kaniela): Use cases
1630-1700 Report II: Use cases (Asia)
1700 Depart EWC for hotel
1800 Dinner at Waikiki Aquarium (w/ STC meeting) – Bring your name tag!
Workshop I – Agenda
28 August 2015, Morning – C-MORE Hale, Late-morning – EWC

0745-0800 Depart hotel for C-MORE Hale
0800-0830 Morning coffee, light breakfast at C-MORE Hale
0830-0900 Debrief for Community Telecom
0900-1030 Community Telecom – live video stream, interactive Q&A webinar
• Overview of workshop
• Panel presentation of breakouts: Science/CI drivers, Use cases
• Open community forum for discussion and feedback
1030-1100 Coffee Break at EWC (Asia)
1100-1130 Discussion re: Community Telecom, remaining agenda items (Asia)
1130-1220 Presentation and discussion: Linking environmental and sequence databases
Bonnie Hurwitz, iMicrobe; Danie Kinkade, BCO-DMO
1230-1300 Lunch with presentation on ECOGEO Resource Viewer (Bottom Floor)
Benjamin Tully, USC/C-DEBI; Ilya Zaslavsky, UCSD/EarthCube CINERGI BB
1300-1400 Discussion and brainstorm on data visualization (Asia)
Jason Leigh, UH; Khairi Reda, UH/ Argonne NL; Madhi Belcaid, UH/ HIMB
1400-1500 Breakout III (Asia, Sarimanok, Kaniela): Final list-CI needs, potential solutions
1530-1630 Report II: Final list of CI needs, potential solutions - addressed in the final report
1630-1700 Final Presentations
Lenny Teytelman, ZappyLab – Protocols.io
Francisco Chavez, MBARI – eDNA workflow
1700-1715 Outline of workshop report, ECOGEO RCN’s next steps
1715 Depart EWC for hotel – Aloha and Mahalo!
2
Workshop Participant List (updated 14 Sep 2015)
Last First Institution

Alexander Harriet MIT
Allen Andrew JCVI, UCSD
Allen Eric SIO, UCSD
Alm Eric MIT
Amend Jan USC, CDEBI
Aronson Emma UC Riverside, EarthCube
Belcaid Madhi UH (HIMB)
Bender Sara GBMF
Buchan Alison U Tenn
Chavez Francisco MBARI
Chivian Dylan LBNL
Cleveland Sean UH (ITS)
Crump Byron Oregon State
DeLong Edward UH (ECOGEO Lead PI)
Dhyrman Sonya Columbia
Follows Michael MIT
Gomez Basil UH, EarthCube
Grethe Jeffrey UCSD
Hallam Steven UBC
Heidelberg John USC, C-DEBI
Hurwitz Bonnie U Arizona, iMicrobe
Jacobs Gwen UH (ITS)
Jenkins Bethany U Rhode Isl
Kinkade Danie BCO-DMO, EarthCube
Kyrpides Nikos JGI
Leigh Jason UH (Visualization)
Levine Naomi USC
Mackey Katherine UC Irvine
Matsen Frederick Hutchinson
Meier Ouida UH, EarthCube - CReSCyNT
Merrill Ron UH (ITS)
Moran Mary Ann U Georgia
Murray Alison DRI/U Nevada
Nahorniak Jasmine Oregon State
Neuer Susanne Arizona State
Orphan Victoria Cal Tech
Polson Shawn U Delaware
Reda Khairi UH, Argonne NL
Rubin Ken UH, EarthCube
Saito Mak WHOI
Schanzenbach David UH (ITS)
Seracki Michael NSF
Stanzione Dan TACC
Teske Andreas UNC
Teytelman Lenny ZappyLab. Protocols.io
Tully Ben USC, C-DEBI
Waldbauer Jacob U Chicago
Wood-Charlson Elisha UH (ECOGEO Communications)
Zaslavsky Ilya UCSD
Zeigler Allen Lisa JCVI, UCSD
Zinser Erik U Tenn
Aloha 2015 ECOGEO Workshop Participants,
We are really looking forward to having you join us in Hawai‘i on the 27-28 August for the first
ECOGEO RCN workshop. In order to prepare for the workshop, the organizers (Ed, Elisha, and
the ECOGEO Steering Committee) would greatly appreciate having your research group
contribute a single Use Case related to your work in environmental ‘omics.
As our first workshop is focused on core issues in ‘omics research, many of the invited
participants (list available on the website) represent the senior research/PI level. Therefore, we
ask that you use this Use Case development opportunity to involve your research group in the
conversation. Below are a few points to help provide some direction, but don’t hesitate to
contact Elisha if you have questions or would like feedback.
1. Please draft a Use Case that highlights a current challenge/limitation for your ‘omics
research (see the provided Use Case as an example).
2. The provided Use Case represents a current big picture ‘omics question/challenge.
Depending on your Use Case, this may or may not be appropriate. Any level of focus
and/or complexity is welcome.
3. We encourage input from all research groups: Science and Tech/CI !
Please submit your Use Case NO LATER than 10 August 2015!

(Elisha: elishawc@hawaii.edu)
Prior to the workshop, we will review the submitted Use Cases with the aim of collecting and
preparing representative examples for 1) focused discussion on solving challenges and 2)
progressing each Use Cases towards functionality in research and training.
Mahalo, looking forward to seeing you all very soon! Please refer to the website for logistics and
documents related to the workshop.
Cheers!
Elisha Wood-Charlson and Ed DeLong
1
Use Case Template (revised from EarthCube version 1.1)
Summary Information Section
Use Case Name
Contact(s)
Overarching Science Driver (these can be refined during the workshop)
Science Objectives, Outcomes, and/or Measures of Success
Key people and their roles
Basic Flow
Describe steps to be followed. Document as a list here and/or as a diagram (see use case
example)
1.
2.
3.
4.
5.
6.
7.
…
Critical Existing Cyberinfrastructure
o
o
o
Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may
be good candidate(s) for a community CI application.
2
Critical Cyberinfrastructure Not in Existence
o
o
o
Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI)
Please list particulars that come to mind, but don’t focus on completing the story. This can be
expanded during the workshop.
Problems/Challenges (any barriers to successful completion of use case)

For each one, list
- The challenge
- What, if any, efforts have been undertaken to fix these problems?
- What recommendations do you have for tackling this problem?
1.
2.
3.
4.
…
References (links to background or useful source material)
Notes (any additional information that does not fit in a previous category)
3
Use Case Name

Cosmopolitan species physiological response and strain variability across ecological gradients
Contact(s)
Harriet Alexander (halexand@mit.edu)

Understand the role of species-level genome variability in the success of a species complex
across environmental gradients.

Aggregate available meta-omic datasets that contain an organism or sequence of interest
Create analysis work flow for pulling out target species from within meta-omic datasets

Sonya Dyhrman (lead PI)
Harriet Alexander
Basic Flow
example)
1. Select all available metagenomic/metatranscriptomic datasets based on location within

the water column (euphotic zone 1% surface irradiance)
2. Query selected datasets for the presence of organism of interest (based on query
sequence, genome, or transcriptome) within the omic dataset.
3. Extract metadata, sequences associated with your taxonomic query, and information
associated with the sequences (e.g. relative sequence abundance)
4. Run expression, statistical, and alignments locally
5. Visualize data locally

o iMicrobe, NCBI SRA, IMG, EBI, JGI
o Python, iPython, Amazon cloud for HPC, virtual machines
o Bioinformatic tools for mapping sequences (BWA, Bowtie), assembling sequences
(Trinity, Velvet, Abyss), clustering sequences (CD-HIT), taxonomically binning
sequences (ClaMS, Phylophythia, ESOM)
1

o Portal similar to JGI or EBI that can be used to browse meta-omic data without
having to download it locally
o Standardized data format for environmental sequence data collected on different
platforms
o Some means of linking omic datasets based on organisms/genes present

For each one, list
- The challenge
1. How can we unify the type of sequence data that is made available from environmental
studies? What types of data should be required?
a. We should decide upon what types of data should be
2. The computational time and memory required to specifically query against tens to
hundreds of large omic data sets is not feasible to do locally.
a. Many groups have started to use a combination of cloud computing (e.g.
Amazon cloud) and virtual machines to perform analyses. If the databases
provided through earthcube could be made to streamline into such a platform
analyses might be made easier.
3. In an ideal world every time a meta-omic dataset were added to the overarching
database that dataset would be queried against all other environmental/culture datasets.
For example, genes would be clustered with like genes from other environments,
species common across environments would be highlighted, patterns of khmer
abundance might be tracked and correlated. The goal here would be to create a
synthetic , this might place the data within the new dataset into greater context and
consequently make further analyses more streamlined.
a. This particular challenge is still a bit far off from being solved. I think that work
needs to be done to improve the actual computational tools that we currently
have available to make such computational efforts more tractable.
2
Use Case Name

Trait-based modelling of community response to changing conditions.
(Overall goal: integrate time-series biogeochemical measurements, meta-transcriptomics, 16S
profiling, metagenomic assemblies, isolate genomes, isolate metabolite dependencies, and
sometimes isolate metabolomics and meta-metabolomics.)
(Note: this is not one of my own experiments, but rather I have several collaborations pursuing
questions using such rich data. Example systems include Desert Crust and Mediterranean
Grassland Rhizosphere, but similar experimental designs are also used in aquatic
environments).
Contact(s)
Dylan Chivian (DCChivian@lbl.gov)

Understand key functional genes and the roles of the trait-guild member species in adaptation
to a perturbed environment.

1. Identify key functional genes in perturbation response.
2. Link key functional genes to species.
3. Model trait-guild member species and their interactions.

Dylan Chivian, Ulas Karaoz - Science and CI
Eoin Brodie - PI
Trent Northen - PI
Basic Flow
example)
1. Assembly and annotation of isolate genomes.
2. Assembly, annotation, binning, and assessment of MG-derived genomes.
3. Meta-transcriptomic abundance calculations against isolate and MG-derived genomes.
4. Trait-guild member assignment.
1
5. Integration of metabolomic and meta-metabolomic data into species models.
6. Time-series models of community adaptation.
7. Stats and visualization.
o KBase/RAST/ModelSEED/MG-RAST, M-suite, QIIME, IMG, IMG/M, ggKbase,
iMicrobe, PathwayTools, MicrobesOnline, metaMicrobesOnline
o R, MeV, SparCC, kallisto, bowtie, Cytoscape (analysis and viz)

o Easy access to rapid metagenomic assembly and binning.
o Easy Integration of metabolomics data into metabolic modeling.
o Meaningful compartmentalized metabolic models and interaction networks.
o Easy trait-guild modeling and viz.
o Easy Time-series trait-based modeling.
This will be done during the workshop.

For each one, list
- The challenge
1. Tools and data formats sometimes inconsistent. One-stop shopping would be nice.
2. Ease of use for non-coder biologist desirable.
3. Information rich but clear data and analysis viz hard to make. Need to make more
available to biologists.
2
Use Case Name

Using population genomes to analyse taxon specific functional constraints
Contact(s)
Jack A. Gilbert: gilbertjack@gmail.com
Naseer Sangwan: nsangwan@anl.gov
Chris Marshall: chris.w.marshall@gmail.com
Melissa Dsouza: dsouzam@uchicago.edu
Pamela Weisenhorn: pweisenhorn@anl.gov
Overarching Science Driver

To understand how translational fine-tuning shapes the microbial genome evolution in natural
environment

(I) Create habitat specific database of population level orthologous genes with pre-calculated
metrics i.e. codon bias, dN/dS.
(ii) Create new workflows and analysis pipelines to compute codon bias and dN/dS values
across fragmented metagenome assemblies representing complex environments e.g.
soil/sediment
(iii) Create new normalization methods for accurate correlation between dN/dS and codon bias
values of population level genes

Jack A. Gilbert: Lead PI
Naseer Sangwan: Postdoctoral researcher
Chris Marshall: Postdoctoral researcher
Pamela B. Weisenhorn : Postdoctoral researcher
Melissa Dsouza: Postdoctoral researcher
Basic Flow
1. Quality trimming and de-novo assembly of shot-gun metagenome datasets
2. Binning Metagenome contigs into population genomes (pan-genomes)
3. Gene calling on contig bins representing population genomes
4. Identification of orthologous genes between population genomes
5. Cross validation of orthologous genes (i.e length cut-off, sequencing errors)
1
6. Calculating pairwise dN/dS and codon bias values
7. Normalization and calculation of pairwise correlation between dN/dS and codon bias
profiles
8. Demarcate & functionally characterize protein pairs w/ positive and/or negative selection

o Alignable Tight Genome Clusters (ATGC) database of prokaryote genomes (has
genomes of cultured isolates)
o Integrated Microbial Genomes (IMG) (e.g. can be used to pull orthologous genes)
o MicroScope pipeline ( e.g. *has size limit for annotation*)

o Central database of population genomes i.e. reconstructed from metagenomes
o Unique algorithms for calculating codon bias and dN/dS across short protein
sequences.
o Accurate normalization method that can handle the average genome size variation
across populations
Activity Diagram
This can be targeted during the workshop
Problems/Challenges
1. How to acess the habitat specific gene pool information?
Recommendation : Create a comprehensive portal that can store such datasets.
2. High-throughput methods to screen orthologous genes across multipule population genomes

a. some methods exist, but they are specific for genome sequences of cultured micobes.
b. Recommendation: develop new methods or modify the existing methods to target the
genome bins represting mix of strains or species.
3. How to calculate accurate rate to evolution and codon bias on short protein sequences.
a. There are some methods but they are not validated for errors and bias caused during
metagenome data analysis e.g length variation, average genome size variation etc.
b. Recommendation: develop some new method to calculate and normalize the dN/dS
and codon bias profiles of population genomes. e.g consider the average genome size
variations.
References
-Ran W, Kristensen DM, Koonin EV. (2014). Coupling Between Protein Level Selection and
Codon Usage Optimization in the Evolution of Bacteria and Archaea. mBio 5:e00956–14.
-Nielsen, R. (2005). Molecular signatures of natural selection. Annu Rev Genet. 39:197-218.
Notes
2
Use Case Name

Linking global models of nutrient limitation to gene expression of nutrient-specific responses in
diatoms
Contact(s)
Bethany Jenkins University of Rhode Island,
Joselynn Wallace PhD candidate University of Rhode Island

Linking global biogeochemical models to in-situ measurements and meta-omics

Compile micro (trace metals, vitamins) and macro (N, P, Si) nutrient concentration
measurements, CTD depth profiles, measures of biodiversity and metagenomics, and gene-
specific expression or metatranscriptome data into a queryable database.

GEOTRACES, PDC, K. Buck – trace metal concentration and distribution, Fe speciation
BDJ, PDC, K. Thamatrakoln? – gene-specific expression (genetic markers of Si and Fe
limitation of diatoms), metagenomics and transcriptomics
Basic Flow
1. Use global models predicting the role of nutrient limitation on primary production of key
phytoplankton taxa to select oceanic region of interest.
2. Filter by depth horizon
3. Retrieve historical macro and micronutrient measurements collected from this region
and filter data by concentration of a given nutrient
4. Retrieve ‘omics datasets from this region (this is the crux of this pipeline matching the
nutrient data with the ‘omics data and finding relevant omics data)
5. Compile locations of nutrient measurements at a range of selected values with ‘omic
data-availability of metagenomes and metatranscriptomes
6. Determine from metagenomics data if target organisms or taxa are present at target
nutrient values
7. Filter metatranscriptome data by taxonomy to only retrieve transcripts from target
taxonomic group (2nd crux of pipeline-need to interface with phylogenetics
infrastructure).
8. Use downstream measures to search for specific genes (e.g. BLAST)
1
o World Ocean Database (Atlas)( https://www.nodc.noaa.gov/OC5/indprod.html)
o BCO-DMO (http://www.bco-dmo.org/)
o GEOTRACES International Data Assembly Center
o PANGEA archive (http://doi.pangaea.de/10.1594/PANGAEA.840721)
o iMicrobe (http://imicrobe.us/)
o EBI metagenomics (https://www.ebi.ac.uk/metagenomics/)
o European Nucleotide Archive (http://www.ebi.ac.uk/ena)
o NCBI (http://www.ncbi.nlm.nih.gov/)
o QIIME (http://qiime.org/)

o Centralized or cross referenced queryable repository of global model/map overlays,
nutrient and in-situ measurements, and associated –omics data.
o Integrated taxonomic pipelines for omics data
1.#Global#map#of#nutrient#limita2on#for#diatoms#
4.#Query##database#(same#or#
different)#with#metagenomics#
informa2on#that#is#cross#
referenced#to#samples#
2.#Define#region#of#Fe#limita2on#in#N#equatorial#Atlan2c#
5.#Apply#taxonomic#filtering#to#data#
(requires#integrated#pipeline#for#
taxonomic#classifica2on)#
3.#Query#db#of#mixed#layer#depth#samples#from#
specified#region#with#measured#Fe#values#below#
specified#level.# 6.##Retrieve#metatranscriptome#data#for#
Return#data#with#Fe#and#all#other#measured#nutrient# sample#containing#taxonomic#targets#
and#profiling#data#(e.g.#temp,#salinity#etc).#
2
For each one, list
- The challenge
1. Cross referencing of data-BCO-DMO-having a “accession number’ for each sample that
is capitulated through all data records so they can be housed in different databases but
search engines can query by record and then for specific types of associated data
2. Discoverability of “omics data” –data currently living in a variety of repositories (ncbi,
ebi, iMicrobe) submissions don’t presently contain links to metadata records. Omics
data may need to live in separate mirrored repository to facilitate retrieval.

Global model images from
J. Keith Moore, Keith Lindsay, Scott C. Doney, Matthew C. Long, and Kazuhiro Misumi, 2013:
Marine Ecosystem Dynamics and Biogeochemical Cycling in the Community Earth System
Model [CESM1(BGC)]: Comparison of the 1990s with the 2090s under the RCP4.5 and RCP8.5
Scenarios.J. Climate, 26, 9291–9312.
3
Use Case Name

Systems analysis linking information from metagenomic, metatranscriptomic, and
metaproteomic datasets with key physical and chemical parameters.
Contact(s)
Robert M. Morris, University of Washington (morrisrm@uw.edu)

To identify the key environmental parameters controlling the activities of microbial communities
across an ocean gradient in organic and inorganic nutrients

Synchronize community “omics” datasets (ID, location, time, replicate, annotations!!!)
Extract information across datasets (genes, transcripts, proteins with same annotations)

Virginia Armbrust, Adam Martini (genomics)
Mary Ann Moran (transcriptomics)
Robert Morris (proteomics)
Basic Flow
example)
1. Identify samples with matching datasets (physical, chemical, biological)
2. Download and retrieve appropriate datasets (omics, metals, nutrients, etc.)
3. Synchronize biological omics datasets (annotate using standard annotations)
4. Identify categories for comparison (CEG paths, EC numbers, taxonomy, etc.)
5. Extract data for comparative analyses
6. Determine genetic potential, gene regulation, and expressed protein functions
7. Multivariate analysis of biological activity with physical and chemical parameters
o **Standard annotation database developed by Mary Ann Moran
o Data archives (BCO-DMO, NCBI, MG-RAST, SILVA-RDP-Greengenes for 16S)
o Comet: An open source MS/MS sequence database search tool
o Kbase: A systems biology knowledge base (mostly genomic at this point)
1

o Database to host datasets
o Tools for comparative analyses of “omics” datasets (establishes links)
o File conversion and export capabilities
Will be done at the workshop

For each one, list
- The challenge
1. What data are available?
A) BCO-DMO does some of this, but is not specialized for large biological datasets
generated by genomics, transcriptomics and proteomics
B) The host site should have a fairly uniform summary diagram with links to available
data, data that is coming, data available through other sources.
2. Data are in different formats (raw files, processed files, annotated/unannotated)
A) Existing data archives (above) do this, but they can be very difficult to navigate and
the file formats are not always consistent (for meta “omics” data).
A) Some standards regarding data formats should be established
3. Some datasets have been deduplicated and some have not been deduplicated
A) Many sites offer both versions
B) This is particularly challenging when annotations don’t match. Decisions about
annotation (in addition to sequence similarity) will impact this.
4. Multiple files are available (size fractionated, replicates, etc.)
A) This is often done, but naming schemes are not uniform
B) Develop some standards for naming when data are deposited so that the user will
know if there are replicates, different size fractions, etc.
5. Processed files from published results are often times unavailable
A) Not always required
B) Should be able to save and export data at different stages of analyses.
2
Use Case Name

Increasing identification rates for peptide mass spectra from ocean metaproteome datasets
Contact(s)
Jacob Waldbauer

Develop clearer picture of protein-level gene expression patterns and regulation for quantitative
understanding of metabolic & biogeochemical processes

Develop ‘Ocean Metaproteome Atlas’ for comparative analysis of protein-level expression in
oceanographic context
Compare community spatiotemporal gene expression patterns between transcript & protein
levels, and examine relationships with activities of biogeochemical interest
Ultimately, develop a sufficiently mechanistic & quantitative picture of expression regulation &
consequent metabolic activity in marine microbes to contribute to predictive biogeochemical
models of ocean carbon & nutrient cycling
Basic Flow
example)
1. Collate & integrate ocean metaproteome datasets

2. Extract potentially informative peptide fragmentation spectra
3. Develop refined sequence databases for PSM searching
4. Sequence peptides by database searching, de novo, spectral library and/or hybrid
methods
5. Control FDR on putative sequence IDs in integrated statistical framework
6. Assign gene ID, function and/or taxon to identified peptides
7. Compare & visualize identified peptides across metaproteome samples
8. Contribute identified spectra to community spectrum library
1
o Peptide-spectrum matching, spectral library searching and de novo sequencing
algorithms (of varying speed/parallelizability)**

o Sharing/integration platform for raw metaproteome data in open format(s)
o Automated pipeline/expert system for generating/optimizing proteome search
databases
o Integrated system for linking peptide IDs with annotation/taxonomy systems
o Community spectral library of confident peptide IDs

For each one, list
- The challenge
1. Sharing metaproteomic mass spec data
a. Currently, PRIDE and MassIVE repositories active, but little ability to integrate
oceanographically-relevant metadata
b. Recommendation: work with proteomeXchange and/or MassIVE (CCMS, UCSD)
to develop ocean-specific metaproteome repository
2. Focusing on most (potentially) informative spectra
a. Recommendation: develop generalized criteria (via machine learning?) for
sequence-information content of fragmentation spectra – will cull large amounts
of uninformative data
3. Arriving at consensus, FDR-controlled sequence ID & annotation from multiple
sequencing methods/annotation streams
a. Recommendation: Allow Metaproteome Atlas to maintain multiple scored ID
candidates for given spectrum, apply parsimony and/or pathway logic at protein
and/or organism levels
2
Summary of ECOGEO’s community survey
EarthCube’s Oceanography and Geobiology Environmental ‘Omics (ECOGEO) Research

Coordination Network (RCN) created a survey to assess current community needs and
challenges with respect to ‘omics research. The survey was available
from Nov 2014 – Jan 2015, and had a total of 105 respondents. Of those,
~90 gave feedback on a major of questions, while 30-60 responded to the
open ended questions. Results from
this survey are summarized below.
Overview
The main areas of ‘omics research
currently being explored by our
community are metagenomics,
16S/18S taxonomy, and correlating
omics data with environmental data
(Figure 1). In addition, the majority of
our research community regularly
collects samples for processing
(~85%), conduct in-depth analysis on
the output data (~72%), and use the
data for comparative omics (62%)
(n=96, with more that one selection
possible). However, our community’s
engagement with ‘omics data ranges
from doing limited analysis (~47%) to
using the data to develop workflows
(~40%).
Figure 1. Areas of 'omics research (n=97, more than
one selection possible)
Accessing data
Most ‘omics users are able Would like to use Already use
100%
to submit data sets and 90%
associated metadata for 80%
archival, search reference 70%
databases by sequence 60%
similarity or annotation 50%
40%
(Figure 2). However, we
30%
struggle to search by
20%
associated metadata/ 10%
project characteristics, and 0%
we definitely face
challenges in accessing
unique data sets not in the
main reference databases
(a.k.a. “dark data”).
Figure 2. Resources already in use or would like to use.
1
Summary of ECOGEO’s community survey
Community Workflows
Feedback on idealized workflows provided good fodder for use case development, the need for
which was highlighted in Figure 2 (“would like to use” – case studies, interactive webinars).
Therefore, we are asking the 2015 workshop participants to submit a use case prior to the
workshop; one that highlights a current challenge in ‘omics research. We have provided a
template form and an example use case focused on using metadata to retrieve targeted data
sets for further exploration. During the workshop, we will discuss several representative use
cases to 1) highlight areas that we, as a community, need to focus on to move our research
forward, and 2) establish a repository of training tools for the next generation of ‘omics
researchers.
Barriers to Research
The general consensus on the barriers to ‘omics research moving forward are summarized in a
few key, big picture points (below). During the 2015 ECOGEO workshop, we will be tackling
these at a finer scale level, in an attempt to move solutions forward.
1. Data standards – including quality measures and a way to index data sets that will
link samples to environmental metadata, across different types of ‘sequencing’, and
throughout various sequence analyses and annotations stages.
2. Central repository of raw and processed data (see #1) that is searchable (see Figure
2) and downloadable with compatible/standardized output, while also having online
tools and compute power for processing (and archiving) assemblies, comparative
analyses, annotations, visualizations, and statistics.
3. Regular annotation updates on existing databases with potential to request
notifications if data sets of interest gain new information.
4. Training – use-cases/workflows, training webinars, user-friendly GUI interface.
5. Last, but far from least – longevity!
2
Earth Cube Oceanography and Geobiology Environmental 'Omics

ECOGEO is a brandnew, NSFfunded Research Coordination Network (RCN) housed within the EarthCube platform. Please visit
http://workspace.earthcube.org/ecogeo for more information and to join our listserv!

The mission of this RCN is to identify community needs and develop necessary plans to create a federated cyberinfrastructure to enable ocean and
geobiology environmental ‘omics.

This survey is designed to address the first part of our mission. We are gathering information regarding the current usage of and community needs
for 'omics research in the oceanography and geobiology communities. This brief research survey should take 515 minutes of your time, depending
on your level of feedback.

Your participation is greatly appreciated, but also voluntary and you can choose to not answer any question. This survey is anonymous and without
foreseeable risks to you for taking part in this survey. Please do not include any personal information in your responses. If you have any questions or
concerns regarding this survey, please contact Dr. Elisha WoodCharlson at the University of Hawai'i at Manoa (ecogeo.rcn@gmail.com). If you
have questions regarding your rights as a participant, please contact the University of Hawai'i at Manoa Human Studies Program
(uhirb@hawaii.edu)

This study has been reviewed and approved by the University of Hawaii Institutional Review Board (#...).
*1. By selecting "Yes", you are indicating your consent to participate in this survey.

j Yes
k
l
m
n

j No
k
l
m
n
Page 1

2. What area(s) of ‘omics research do you typically work in? (select all that apply)

c Genomics
d
e
f
g

c Single cell genomics
d
e
f
g

c Metagenomics
d
e
f
g

c Transcriptomics
d
e
f
g

c Metatranscriptomics
d
e
f
g

c Proteomics
d
e
f
g

c Metaproteomics
d
e
f
g

c Metabolomics
d
e
f
g

c Correlating ‘omics data with environmental data
d
e
f
g

c Phylogenetics
d
e
f
g

c 16S, 18S; Taxonomy
d
e
f
g

c Modeling
d
e
f
g
Other (please specify)
Page 2
3. What area(s) of ‘omics sample and data processing do you typically engage in? (select
all that apply)

c Collect samples and process for sequencing
d
e
f
g

c Limited analysis of processed ‘omics data (e.g. postQC/QA)
d
e
f
g

c Indepth analysis (e.g. single data set assembly, annotation, pathways, etc…)
d
e
f
g

c Workflow development
d
e
f
g

c Analytical and/or statistical tool development
d
e
f
g

c Use ‘omics data in modeling
d
e
f
g

c Comparative ‘omics (e.g. across ‘omic types, complex data sets, integration with metadata)
d
e
f
g
Page 3

4. Please indicate ‘omicsassociated RESOURCES you use or would like to use.

Already use Would like to use
Submission of sequence data and metadata for archival services c
d
e
f
g c
d
e
f
g
Access to unique data sets not available in other sequence repositories c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by description or project characteristics c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by sequence similarity (e.g. BLAST, RapSearch) c
d
e
f
g c
d
e
f
g
Search for usersubmitted samples by annotation (e.g. gene function, taxonomy) c
d
e
f
g c
d
e
f
g
Search for data sets by metadata (e.g. latitude/longitude, date collected, lead PI) c
d
e
f
g c
d
e
f
g
Access to reference datasets (e.g. nonredundant and RefSeq from NCBI) c
d
e
f
g c
d
e
f
g
Casestudies for training c
d
e
f
g c
d
e
f
g
Interactive webinars c
d
e
f
g c
d
e
f
g
Other resources, or additional comments (please specify)
5. Please indicate ‘omicsassociated TOOLS you use or would like to use.

Already use Would like to use
Initial data processing (e.g. QC/QA, trimming) c
d
e
f
g c
d
e
f
g
BLAST and BLASTlike workflows (e.g. RapSearch) c
d
e
f
g c
d
e
f
g
Assembly tools (e.g. RayMeta, Newbler) c
d
e
f
g c
d
e
f
g
Annotation tools (e.g. Pfam, COG/KOG, TIGRFAM, NCBI’s PRK) c
d
e
f
g c
d
e
f
g
Phylogeneticallybased annotation services (e.g. MEGAN) c
d
e
f
g c
d
e
f
g
Workflow pipelines (e.g. Clustering, RAMMCAP, Redundancy filter) c
d
e
f
g c
d
e
f
g
Comparative pathway analysis (e.g. KEGG, pFAM) c
d
e
f
g c
d
e
f
g
Statistical tools c
d
e
f
g c
d
e
f
g
Visualization tools c
d
e
f
g c
d
e
f
g
Other resources, or additional comments (please specify)
Page 4
6. If you currently have favorite tools/resources, please list them and explain why they are
working for you.
5
6
7. To put the previous questions in a research context, please describe your idealized data
analysis workflow that would best achieve your main science goals using omics data sets.
What do you want ‘omics data to do in order to answer your scientific questions?
5
6
Page 5

8. Please identify the community needs for storage, management, analysis, sharing,
integration, and visualization of ‘omic data that you feel are immediate vs. should be
considered in future development with a longerterm vision.
Immediate Longterm
Storage of raw data (akin to the NCBI Short Read Archive for sequence data) c
d
e
f
g c
d
e
f
g
Storage of processed data (e.g., translated proteins or assembled contigs) c
d
e
f
g c
d
e
f
g
Storage of data used for biological inference (e.g., differential gene/protein expression) c
d
e
f
g c
d
e
f
g
Linking different ‘omics for single sample c
d
e
f
g c
d
e
f
g
Sustainable curation c
d
e
f
g c
d
e
f
g
Access to highperformance computational resources c
d
e
f
g c
d
e
f
g
Access to usersubmitted data c
d
e
f
g c
d
e
f
g
Analysis workflows c
d
e
f
g c
d
e
f
g
Annotation tools c
d
e
f
g c
d
e
f
g
Comparative pathway tools c
d
e
f
g c
d
e
f
g
Comparative ‘omics tools c
d
e
f
g c
d
e
f
g
Statistical tools c
d
e
f
g c
d
e
f
g
Visualization tools c
d
e
f
g c
d
e
f
g
Casestudies for training c
d
e
f
g c
d
e
f
g
9. Any additional thoughts regarding Question 8?

5
6
Page 6
10. Please comment on what you perceive to be the PRIMARY NEEDS surrounding ‘omics
research for the oceanography and geobiology communities
5
6
11. Please comment on what you perceive to be the MAJOR INFRASTRUCTURE

BARRIERS for improving ‘omics research in the oceanography and geobiology
communities.
5
6
Page 7

EarthCube - ECOGEO 2015 Workshop I Final Report PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

EarthCube - ECOGEO 2015 Workshop I Final Report PDF

Загружено:

Авторское право:

Доступные форматы

Research Coordination Network

Workshop conveners: Edward DeLong, Elisha Wood-Charlson, ECOGEO Steering Committee

In summary, the community’s recommendations include:

II. Existing ‘omics-orientated cyberinfrastructures should be extensively leveraged and

ECOGEO – a community focused on solutions

In addition to collective brainstorming, the workshop offered several presentations on resources

II. Grand ‘omics science challenges

III. Specific cyberinfrastructure needs

IV. Leveraging and expanding existing infrastructures

Another example of existing cyber-infrastructure that is “pre-adapted” to be an integral part of a

V. Enabling and encouraging big data best practices

Recommendation I. EarthCube-based solutions should integrate science drivers and

Recommendation II. Existing ‘omics-orientated cyberinfrastructures should be

Recommendation IV. Data visualization and statistical analytical frameworks should be

Recommendation V. The cyberinfrastructure should enable and encourage “big data”

Recommendation VI. The ‘omics and cyberinfrastructure communities should enable

Appendices (pages 12 – 40)

See also - http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

27 August 2015, East-West Center (EWC), UH Manoa

28 August 2015, Morning – C-MORE Hale, Late-morning – EWC

Last First Institution

3. We encourage input from all research groups: Science and Tech/CI !

Please submit your Use Case NO LATER than 10 August 2015!

Use Case Name

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Problems/Challenges (any barriers to successful completion of use case)

Use Case Name

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

1. Select all available metagenomic/metatranscriptomic datasets based on location within

Critical Existing Cyberinfrastructure

Critical Cyberinfrastructure Not in Existence

Problems/Challenges (any barriers to successful completion of use case)

References (links to background or useful source material)

Use Case Name

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Critical Cyberinfrastructure Not in Existence

This will be done during the workshop.

Problems/Challenges (any barriers to successful completion of use case)

References (links to background or useful source material)

Use Case Name

Overarching Science Driver

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Critical Existing Cyberinfrastructure

Critical Cyberinfrastructure Not in Existence

2. High-throughput methods to screen orthologous genes across multipule population genomes

Use Case Name

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Critical Cyberinfrastructure Not in Existence

References (links to background or useful source material)

Use Case Name

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

4. Please indicate ‘omicsassociated RESOURCES you use or would like to use.

5. Please indicate ‘omicsassociated TOOLS you use or would like to use.