Data curation for SDIL

SDIL Data Catalogue

In the course of the BMBF-funded SDI-X project, we are currently realizing a repository of Big Data datasets of different domains in industrial and acedemic research. Among them are, e.g.,

  • open datasets like DBpedia or LinkedGeoData,
  • datasets from existing data catalogues and repositories such as the UCI machine learning repository,
  • datasets supplied by academic and industrial partners within the SDIL project, and
  • datasets emerging from projects conducted at SDIL and which are provided using a “fair share” policy.

The overall goal is to provide users of SDIL a comprehensive and easily accessible data catalogue which allows for efficient search and retrieval of data that is suitable for a specific purpose. For that reason, all datasets which should be part of the catalogue are equipped with relevant metadata. In order to provide interoperability with existing data repositories, we are using standardized metadata formats and fields on the basis of Semantic Web technologies. Tools and definition of this metadata vocabulary as well as annotation of datasets and their integration into the data catalogue are developed and supplied by the of the SDI-X project.

Data Storage and Provision

The standard use case is to store datasets using the SDIL infrastructure which can then be retrieved from there according to the defined access restrictions. Data which is not necessary to be stored at SDIL (e.g., datasets from existing repositories) or for which storage is not permitted from a legal point of view, can nevertheless be registred at the repository by providing information of the data supplier. In that case, a redirection to the dataset supplier is performed who can then provide the data after the parties are clear on the data usage conditions. The exact modalities for this process can be arranged in the course of the SDIL application process.

Data Curation Tools

Besides the described tools for establishing, operating and maintaining the data catalogue, we plan to provide a broad spectrum of services which cover the full data life cycle from data selection to authoring and repair functionality. This also includes basic manipulation and visualisation tools. These are developed by the SDI-X project and provided by SDIL.