The document describes the approach of the RODBUK Cracow Open Research Data Repository to the long-term archiving and responsible management of deposited research data.
RODBUK's main objectives include:
The roles in the Repository are:
The Repository ensures long-term archiving of deposited research data using the following strategies:
The technical side of the long-term archiving processes is handled by the ACC CYFRONET AGH team. This process includes tasks related to: change of media, conversion to current formats, review of integrity, authenticity, control of availability, reading and presentation of data. These tasks relate to both digital objects and metadata.
ACC CYFRONET AGH has implemented a number of internal security procedures (Information Security Policy, Business Continuity Management Policy). Computer Centre of ACC CYFRONET AGH has an internal Cybersecurity Department and a Data Security Department.
According to the "Information Security Policy" adopted at ACC CYFRONET AGH "security of information and the systems in which it is processed is one of the key elements ensuring the fulfilment of the Centre's statutory tasks. [...] In order to ensure information security, CYFRONET implements a coherent "Information Security Management System". The system [...] serves to protect and provide access to assets in such a way that the confidentiality, availability and integrity of the processed information remain at an appropriate level (Polityka Bezpieczeństwa Informacji /in Polish/)".
Long-term data storage is achieved by regularly creating data backups. A backup is made on active data through the replication process from the source location to a separate and isolated target location. The backup procedure ensures the consistency of source and backup data, both at the level of a single file and entire data sets. For obsolete data, their migration/archiving is planned based on solutions using magnetic tape data storage. ACC CYFRONET AGH currently has three tape libraries with over 9,000 slots for LTO standard magnetic tapes and 44 drives of generations 6, 7, and 9. A single LTO-9 magnetic carrier has a physical capacity of 18 TB and allows for writing at a speed reaching 400 MB/s. The purpose of creating an archive is to ensure the safety of unused data and free up occupied storage resources. Unlike a backup, the archive is created only once, by migrating data from the source location to the target location.
The verification of the datasets is done by the data stewards appointed by the institutions having their instances within the Repository. If necessary, the dataset is sent back to the depositor with a message about the scope and purpose of the correction or additions. If the dataset raises no objections it is accepted for publication. data stewards support researchers at all stages of the research data lifecycle in terms of FAIR compliance, including: applying accepted metadata standards, improving descriptions, versioning and converting data to new formats to increase the potential for reuse. Data stewards are also responsible for checking the README file.
RODBUK uses the Dataverse project, an open source research data repository software. The application code is developed by the community and made available via a GitHub repository.
ACC CYFRONET AGH guarantees reliable availability. Both hardware and software are well managed and tailored to user needs and application functionality. The RODBUK infrastructure runs on virtual machines powered by OpenStack, using S3-based object storage and running on Rocky Linux 8. The physical resources of the virtual server, such as RAM size, VCPUs, disks and their performance, are adapted to the nature of the application.
All documents stored at RODBUK are archived and made available for at least 10 years, the metadata that describes them indefinitely. All documents stored in RODBUK are archived and made available with data security at all stages of their lifecycle (in the processes of adoption, implementation into the collection and use). Deposited files are automatically backed up as soon as they are entered into RODBUK by the user, and metadata copies are made once a day.
Due to physical threats that can compromise data integrity, ACC CYFRONET AGH operates in two separate data centers - DC Nawojki and DC Podole - strategically located in different buildings within Krakow. To mitigate the risks associated with natural disasters such as fires or floods, data replication is used. This results in two independent object storage systems, each of which can serve as a failover for the other.
Data stewards review deposited data for FAIR compliance before publication in the repository and provide necessary support to depositors in this regard. This ranges from correcting metadata, improving and standardising descriptions to assisting with data versioning and converting files to new formats for future reuse. The final decision on the elements that make up the dataset, its size and format is made by the researcher.
RODBUK recommends the use of open formats, publicly available and free of charge. At the stage of depositing files in the repository, the Dataverse software recognises the type of format based on its extension. It is recommended that files with an unrecognised format are converted to another format. The only exception to this is when converting files from specialised software to open source may affect data quality. In such cases, the data accompanying the README file should describe the software used to open the files.
For each dataset, a licence must be selected from the list available in RODBUK. Files can be made openly available, or access can be restricted (embargo, release on request).
Each deposited dataset is assigned a DOI number. Activation of the DOI number follows verification of the deposited data by the Data steward when the first version of the dataset is published.
If it is determined that a particular data format is no longer technically supported, RODBUK administrators contact the depositor with a request to convert the uploaded files. Where contact is not possible, the RODBUK administrators will convert the data if technical conditions allow. The dataset will be published on RODBUK as a new version.
To maintain optimal performance and security undergoes regular updates - with consideration of ad-hoc updates in case of vulnerability detection (CVE). Similarly, the Dataverse application itself is consistently updated with the latest from the Harvard development group. However, this is done after testing in dedicated environments to ensure seamless updates and stability. Relevant teams of personnel are granted access to virtual machines, each equipped with specific roles, facilitated through our company VPN for secure and controlled connectivity.
The RODBUK aggregator stores all metadata of the deposited data from individual institutions, even if the agreement between the institution and ACC CYFRONET AGH is terminated.
All datasets deposited in the Repository are subject to regular verification, consisting of a comparison of the checksum values calculated at a given time with the checksums generated when the datasets were downloaded. Such a mechanism makes it possible to identify damaged or lost content and restore the correct version from backups. This audit is carried out twice a year.
RODBUK has multi-level access security. Research data can only be deposited by individuals registered with the repository after logging in using a central authentication system. This procedure is secured with OIDC (OpenID Connect) or SAML2 protocols. Each login requires a login (email address) and an authentication password (provided at first login). The Depositor only has rights to a specific collection. The ability to edit data and make changes to its description and structure has been restricted to the Depositor until the dataset has been submitted for verification by the data steward. Any subsequent changes to the datasets must be approved by the data steward.
In order to ensure a high level of security and stability of services, regular checks of the infrastructure and stability of services are carried out. To protect RODBUK from potential data loss, ACC CYFRONET AGH meticulous backup routine is implemented at various levels.
Dopuszcza się migrację metadanych danych badawczych pomiędzy repozytoriami uczelni współtworzących RODBUK. Metadane są pobierane w celu zapewnienia pełnej i spójnej reprezentacji danych badawczych danej jednostki. Proces migracji odbywa się w ścisłej współpracy z ACK CYFRONET AGH.
The data migration (custody transfer) plan consists of the following stages:
Published data is not subject to change. Once a dataset has been accepted by the data steward, it becomes impossible to edit it. The authors of the dataset can only make the amended/new files available by creating a new version of the dataset. In this case, after completing the missing data in the published dataset, the dataset must be resubmitted for review: "Send for review". The data steward may send it back for correction or, if there are no objections, may publish it.
It is important to agree with the data steward on the final version of the dataset – data steward may publish another version with minor or major corrections. In the first and second case, the version number of the dataset can be checked under the "Versions" tab. Information about the current version can be found at the very top of the page or under the title of the dataset. Adding another version of a file does not change the DOI.
In exceptional cases, such as violations of copyright and other intellectual property rights or suspicion of plagiarism, the following actions may be taken:
Update: 21.01.2025