Skip to content Skip to footer

Data Preservation Policy

The document describes the approach of the RODBUK Cracow Open Research Data Repository to the long-term archiving and responsible management of deposited research data.

RODBUK's main objectives include:

  • to provide long-term access to research data,
  • maintaining the stability of the repository's operation,
  • ensuring the authenticity, integrity and security of the deposited datasets.

The roles in the Repository are:

  • AGH University of Science and Technology in Kraków, Academic Computer Centre CYFRONET AGH, 11 Nawojki Street, 30-950 Kraków, NIP: 675-000-19-23, REGON: 000001577-00022 responsible for the technical side, storage security and implementation of new functionalities.
  • units of universities or other scientific institutions joining the project designated to provide support to users and to develop functionalities.

Preservation strategies

The Repository ensures long-term archiving of deposited research data using the following strategies:

  1. the FAIR principles are followed at all stages of working with research data, making the data easily discoverable, accessible, interoperable and reusable;
  2. the depositor is required to include documentation to enable the reading and reuse of published research data;
  3. data are verified, validated and supervised according to defined workflows;
  4. data are described and enriched with metadata according to the Dublin Core standard;
  5. datasets are stored for at least 10 years and metadata for an indefinite period;
  6. data authenticity and integrity are maintained with a view to reuse.

The technical side of the long-term archiving processes is handled by the ACC CYFRONET AGH team. This process includes tasks related to: change of media, conversion to current formats, review of integrity, authenticity, control of availability, reading and presentation of data. These tasks relate to both digital objects and metadata.

ACC CYFRONET AGH has implemented a number of internal security procedures (Information Security Policy, Business Continuity Management Policy). Computer Centre of ACC CYFRONET AGH has an internal Cybersecurity Department and a Data Security Department.

According to the "Information Security Policy" adopted at ACC CYFRONET AGH "security of information and the systems in which it is processed is one of the key elements ensuring the fulfilment of the Centre's statutory tasks. [...] In order to ensure information security, CYFRONET implements a coherent "Information Security Management System". The system [...] serves to protect and provide access to assets in such a way that the confidentiality, availability and integrity of the processed information remain at an appropriate level (Polityka Bezpieczeństwa Informacji /in Polish/)".

Long-term data storage is achieved by regularly creating data backups. A backup is made on active data through the replication process from the source location to a separate and isolated target location. The backup procedure ensures the consistency of source and backup data, both at the level of a single file and entire data sets. For obsolete data, their migration/archiving is planned based on solutions using magnetic tape data storage. ACC CYFRONET AGH currently has three tape libraries with over 9,000 slots for LTO standard magnetic tapes and 44 drives of generations 6, 7, and 9. A single LTO-9 magnetic carrier has a physical capacity of 18 TB and allows for writing at a speed reaching 400 MB/s. The purpose of creating an archive is to ensure the safety of unused data and free up occupied storage resources. Unlike a backup, the archive is created only once, by migrating data from the source location to the target location.

Verification of data before publication

The verification of the datasets is done by the data stewards appointed by the institutions having their instances within the Repository. If necessary, the dataset is sent back to the depositor with a message about the scope and purpose of the correction or additions. If the dataset raises no objections it is accepted for publication. data stewards support researchers at all stages of the research data lifecycle in terms of FAIR compliance, including: applying accepted metadata standards, improving descriptions, versioning and converting data to new formats to increase the potential for reuse. Data stewards are also responsible for checking the README file.

Data storage

RODBUK uses the Dataverse project, an open source research data repository software. The application code is developed by the community and made available via a GitHub repository.

ACC CYFRONET AGH guarantees reliable availability. Both hardware and software are well managed and tailored to user needs and application functionality. The RODBUK infrastructure runs on virtual machines powered by OpenStack, using S3-based object storage and running on Rocky Linux 8. The physical resources of the virtual server, such as RAM size, VCPUs, disks and their performance, are adapted to the nature of the application.

All documents stored at RODBUK are archived and made available for at least 10 years, the metadata that describes them indefinitely. All documents stored in RODBUK are archived and made available with data security at all stages of their lifecycle (in the processes of adoption, implementation into the collection and use). Deposited files are automatically backed up as soon as they are entered into RODBUK by the user, and metadata copies are made once a day.

Due to physical threats that can compromise data integrity, ACC CYFRONET AGH operates in two separate data centers - DC Nawojki and DC Podole - strategically located in different buildings within Krakow. To mitigate the risks associated with natural disasters such as fires or floods, data replication is used. This results in two independent object storage systems, each of which can serve as a failover for the other.

Maintaining availability

Data stewards review deposited data for FAIR compliance before publication in the repository and provide necessary support to depositors in this regard. This ranges from correcting metadata, improving and standardising descriptions to assisting with data versioning and converting files to new formats for future reuse. The final decision on the elements that make up the dataset, its size and format is made by the researcher.

RODBUK recommends the use of open formats, publicly available and free of charge. At the stage of depositing files in the repository, the Dataverse software recognises the type of format based on its extension. It is recommended that files with an unrecognised format are converted to another format. The only exception to this is when converting files from specialised software to open source may affect data quality. In such cases, the data accompanying the README file should describe the software used to open the files.

For each dataset, a licence must be selected from the list available in RODBUK. Files can be made openly available, or access can be restricted (embargo, release on request).

Each deposited dataset is assigned a DOI number. Activation of the DOI number follows verification of the deposited data by the Data steward when the first version of the dataset is published.

If it is determined that a particular data format is no longer technically supported, RODBUK administrators contact the depositor with a request to convert the uploaded files. Where contact is not possible, the RODBUK administrators will convert the data if technical conditions allow. The dataset will be published on RODBUK as a new version.

To maintain optimal performance and security undergoes regular updates - with consideration of ad-hoc updates in case of vulnerability detection (CVE). Similarly, the Dataverse application itself is consistently updated with the latest from the Harvard development group. However, this is done after testing in dedicated environments to ensure seamless updates and stability. Relevant teams of personnel are granted access to virtual machines, each equipped with specific roles, facilitated through our company VPN for secure and controlled connectivity.

The RODBUK aggregator stores all metadata of the deposited data from individual institutions, even if the agreement between the institution and ACC CYFRONET AGH is terminated.

Data validation

All datasets deposited in the Repository are subject to regular verification, consisting of a comparison of the checksum values calculated at a given time with the checksums generated when the datasets were downloaded. Such a mechanism makes it possible to identify damaged or lost content and restore the correct version from backups. This audit is carried out twice a year.

Providing security

RODBUK has multi-level access security. Research data can only be deposited by individuals registered with the repository after logging in using a central authentication system. This procedure is secured with OIDC (OpenID Connect) or SAML2 protocols. Each login requires a login (email address) and an authentication password (provided at first login). The Depositor only has rights to a specific collection. The ability to edit data and make changes to its description and structure has been restricted to the Depositor until the dataset has been submitted for verification by the data steward. Any subsequent changes to the datasets must be approved by the data steward.

In order to ensure a high level of security and stability of services, regular checks of the infrastructure and stability of services are carried out. To protect RODBUK from potential data loss, ACC CYFRONET AGH meticulous backup routine is implemented at various levels.

Data migration plan to/from RODBUK

Dopuszcza się migrację metadanych danych badawczych pomiędzy repozytoriami uczelni współtworzących RODBUK. Metadane są pobierane w celu zapewnienia pełnej i spójnej reprezentacji danych badawczych danej jednostki. Proces migracji odbywa się w ścisłej współpracy z ACK CYFRONET AGH.

The data migration (custody transfer) plan consists of the following stages:

  1. defining migration requirements: this step outlines the reasons for migration, the specific data to be migrated, and the desired outcomes of the migration process;
  2. identifying and configuring the target environment: it includes assessing the technological and infrastructure capabilities of the new environment, ensuring its capacity for efficient data transfer, and confirming the legal standing of the environment (e.g., licenses, contracts);
  3. establishing proper metadata formats: ensuring the target environment supports the correct metadata format and minimizing the risk of metadata loss during the migration;
  4. inventorying the current repository: conducting a thorough inventory of the existing repository according to newly defined criteria, including supplementing any missing metadata in the source environment;
  5. migration planning: this involves determining the timing for shutting down the repository, creating data backups prior to migration, verifying data restoration procedures, and informing users about the planned system downtime;
  6. data migration and repository updates: executing the migration and updating all repository-related data and datasets to point to the new environment;
  7. validation and testing: after migration, validating and testing the new environment to ensure the migration was successful.

Control procedures/verification

Published data is not subject to change. Once a dataset has been accepted by the data steward, it becomes impossible to edit it. The authors of the dataset can only make the amended/new files available by creating a new version of the dataset. In this case, after completing the missing data in the published dataset, the dataset must be resubmitted for review: "Send for review". The data steward may send it back for correction or, if there are no objections, may publish it.

It is important to agree with the data steward on the final version of the dataset – data steward may publish another version with minor or major corrections. In the first and second case, the version number of the dataset can be checked under the "Versions" tab. Information about the current version can be found at the very top of the page or under the title of the dataset. Adding another version of a file does not change the DOI.

In exceptional cases, such as violations of copyright and other intellectual property rights or suspicion of plagiarism, the following actions may be taken:

  1. withdrawal of a dataset: for this purpose, the depositor must contact the data steward. The removal of a dataset involves deleting all its versions. However, basic information about the removed dataset (the so-called tombstone), remains publicly accessible, e.g., the citation and the reason for the data removal. The full metadata description will only be visible to system administrators (ACC CYFRONET AGH);
  2. removal of available files: when justified by appropriate legal grounds, the files undergo a deaccession process.

Update: 21.01.2025 

Stopka