Вы находитесь на странице: 1из 4

2019 CEF Telecom Call - Public Open Data (CEF-TC-2019-2)

STIR Data: Specifications and Tools for Interoperable and


Reusable Data

Concept note
Harmonisation is currently (partly) achieved in the European Data Portal (EDP) only at the level
of cataloguing. The heterogeneity of data formats and standards (if any) followed by different data
providers as well as the poor-quality descriptions of the actual content hinders the deployment of
added-value services which can allow end-users to discover and make sense of the available
open data. The main objective of the current proposal is to propose a set of specifications along
with technological tools which can streamline and facilitate harmonisation at the level of content
(i.e. records) of datasets and enable users to search and visualise data that is relevant to them.
The project thus seeks on one hand to assist data providers to improve the quality of their data
by following a well-defined workflow supported by a user-friendly toolset; and offer advanced
search capabilities to interested users who can benefit from accessing data coming from different
providers and countries in a homogeneous way, from a single entry point.

The harmonisation and search capabilities will be offered as a service which will integrate and
customise for the purpose of the project state-of-the-art semantic technologies and tools. The
service will offer (a) semi-automatic capabilities for enriching and mapping datasets in different
formats and of different quality to an RDF (Resource Description Framework) representation; and
(b) federated search functionalities, which will support semantic queries on the content of these
datasets. Licensing conditions will be considered both from a legal and technical perspective:
legal experts in copyright law and in terms of use for open data will investigate the cross-border
terms of use for the considered datasets and collaborate with technical experts so as to facilitate
the mapping of datasets' licenses to established vocabularies.

In order to showcase the potential and usefulness of adopting such an approach, the project will
consider datasets from the economy domain coming from different countries. The use case will
particularly focus on datasets about business registries discoverable via the EDP as well as
datasets to be made available by participating and affiliated data providers. The enriched versions
of existing datasets as well as new content which will be provided in the framework of the use
case considered by the project will be published to the respective National/Regional open data
portals (and thus to the EDP).

An open online advanced search service will be deployed, offering useful information to users
who are looking for information about enterprises active in different European countries. Via a
user-friendly interface, the end-user will be able to issue custom semantic queries e.g. asking
which businesses are classified under a particular type of economic activity, which are located in
a certain region, which were established within a certain time frame etc. Visualisations and
statistic overviews will also be offered.

It should be emphasised that, although the project focuses on a particular use case, the offered
services are generic, so that they can be used in any use case which requires harmonisation of
content data and support for semantic queries.

Technical tools to be deployed during the project

The project will deliver a web-based platform, which will support in a semi-automatic way all the
steps required for the harmonisation of data content and offer a user-friendly interface for
performing advanced search on this data and visualising the results. The platform will enable
users to transform data coming from different sources, including the EDP and other national-level
registries, which may be in different formats (CSV, JSON, XML) to RDF, by making them
compliant with established standards and linking them to an extended (and extendable) set of
supported vocabularies and thesauri. Through the application of a sequence of mappings, the
underlying heterogeneity due to different data structures, field names, and value representations
will be overcome, making the data searchable and discoverable and allowing users to write
custom queries in SPARQL.

Moreover, the platform will make available an online service which will offer semantic search
functionalities over content related to businesses active in different European countries. The
service will support the consumption of the harmonised data in a user-friendly way: it will provide
a set of queries which can be parameterised by users who are not experts along with a set of
visualisation and statistic results (e.g. a user will be able to enter the specific NACE code and
establishment dates they are interested in and visualise the results on a map). Interested data
providers who wish to make their data discoverable via the offered search service will be able to
do so, by following the harmonisation workflow supported by the downloadable toolset.

The services to be offered by the project will build on the following tools developed by the technical
partners of the consortium, which are already in a mature state and have been applied in different
real-world scenarios:
● The LinkedPipes ETL (https://etl.linkedpipes.com/), which has been developed by the
Charles University. LinkedPipes is an open-source ETL tool for the production and
consumption of Linked Open Data. It utilizes reusable components to create data
transformation pipelines, which can be easily debugged, run repeatedly, etc. It has been
deployed as a backend for harvesting local catalogs in the Czech National Open Data
catalog https://data.gov.cz.
● The D2RML processor (http://apps.islab.ntua.gr/d2rml/), which has been developed by
NTUA, is a service used for transforming, via the definition of a set of rules, heterogeneous
data to RDF graphs.
● MINT (http://mint-wordpress.image.ntua.gr/) is an aggregation and mapping tool, which
has been developed by NTUA. MINT has been used by more than 100 organisations in
the past years in the area of digital cultural heritage for the aggregation of content and
metadata and is also a main component of the ingestion infrastructure of Europeana, the
European Digital Library. Its services allow the operation of different aggregation schemes
(thematic or cross-domain, international, national or regional and include a visual mapping
editor, which enables users to map their dataset records to a desired XML target schema.
Although MINT has not so far been applied only in the domain of digital cultural heritage,
the requirements brought forth by the harmonisation of open data are of the same nature
and thus, the mapping and aggregation services provided by MINT can be reused with
minimal customisations in the context of the project.
● LinkedPipes DCAT-AP Viewer developed by CUNI is a lightweight viewer of DCAT-AP
metadata records which has been deployed as frontend in the Czech National Open Data
catalog https://data.gov.cz/datasets. LinkedPipes DCAT-AP Forms by CUNI is a simple
user-friendly form for entering DCAT-AP metadata, deployed as frontend in the Czech
National Open Data catalog https://data.gov.cz/formulář/dataset-registration.

The project will use some open framework for the development of flexible and reusable User
Interface components for Linked Data applications. For example, Linked Data Reactor (http://ld- Commented [1]: maybe we should list it above
r.org/) framework could be used, which applies the idea of component-based application together with the other tools even if it's not developed
by a partner of the project
development into RDF data model hence enhancing current user interfaces to view, browse and
edit Linked Data. The eTranslation Building Block will also be used for the uniform visualisation Commented [2]: the same as the LD reactor
of the data content.

Proposed workflow and role of each tool

The services offered by MINT and D2RML will be integrated together with the components of the
LinkedPipes ETL and offered as a single technological framework, so as to support the different
steps of the harmonisation and semantic search workflow proposed by the project.

Τhe first step of the proposed workflow is to define the specifications that will have to be respected
for the harmonisation to take place. The specifications will outline a set of vocabularies and
thesauri, starting from the business use case to be considered in the lifetime of the project. This Commented [3]: perhaps with some examples here,
set is by no means meant to be exhaustive and the technical infrastructure will support the import e.g. NACE?
of new vocabularies, so that the set of supported standards is extensible. The specifications will
consist of a set of mappings from certain data representations to the target RDF: for example a
mapping rule may map field names from a CSV file to target namespaces and manipulate values
so that they can be linked to established vocabulary terms. For the harmonisation steps of the
workflow, the LinkedPiples framework extended with the mapping possibilities offered by the
D2RML tool will be used. The visual mapping editor offered by MINT can also be used at this
stage by offering a more user-friendly way for manipulating data values and linking them to
vocabularies. Users will be able to import datasets from the EDP and other sources as well as to
upload their own datasets. By applying the set of rules/mappings defined in the specifications, the
original data will be transformed to RDF. For exposing the search capabilities, the SPARQL
constructs of LinkedPipes will be used, extended with some accompanying UI components for
supporting the parameterisable queries addressed to non-experts and for some added-value
visualisations (on the map and of statistics). The platform will also connect to the eTranslation Commented [4]: To what extent is development of
Building Block for providing a uniform way of accessing the data content. The DCAT-AP viewer new components funded (besides
deployment/integration of existing tools)?
and forms will be used at the final steps of the workflow, allowing data providers to make the
description of their datasets DCAT-AP compliant. Commented [5]: To the discussion about the workflow
today (30th Sept):
1) Is this a (necessary) step between "... as well as to
upload their own datasets" and "... applying the set of
rules/mappings ..."? or it doesn't matter.
Consortium 2) Since Member States may have their own DCAT-
APs, I hope that the DCAT-AP Viewer is (or will be
made) "configurable" so that it may be used to describe
datasets in compliance with national APs, such that the
Confirmed partners results will be harvestable by the national data portals.
If so, please mention this "configurability".
National Technical University of Athens - NTUA (Greece): Coordinator
Ministry of Digital Governance - MDG (Greece)
Athens chamber of commerce and industry - ACCI (Greece)
Charles University - CUNI (Czechia)
Masaryk University, Faculty of Law - MUNI (Czechia)
Brønnøysund Register Centre - BRREG (Norway)
Agency for Public Management and eGovernment - DIFI (Norway)
ThinkCode (Cyprus)

Contacted partners (yet to confirm)


Ministry of Finance (Denmark)
Ministry of Finance (Cyprus) --- NO

External Collaborators (LoS)


BOSA (Belgium) Commented [6]: +tzouvaras@image.ntua.gr Poio
Ministry of the Interior of the Czech Republic - MoI (Czechia) einai to plhres onoma?

Вам также может понравиться