Data preparation -- RODBUK

Folder Organization and Files Naming

It is crucial to use the clearest possible folder structure, during the group project or before sharing a dataset.

Additionally, please remember:

the research team should agree on and adopt a unified folder structure;
folder names should be concise and clear, indicating the type of data contained within;
If the folder structure is complex due to the project format, each folder of files representing, e.g. an individual study should include a separate README file detailing that folder. Additionally, the dataset should have a main README file that describes the entire dataset;
The folder hierarchy should be coherent and logical, starting with general folders and moving to more specific ones. The structure should neither be overly extended nor too simplistic, typically involving 3-4 levels of folders depending on the project's size;
As part of the storage strategy, it may be beneficial to define "temporary folders" from which data can be safely deleted after use.

Please avoid the following practices:

naming folders with general phrases, such as "ongoing issues";
using researchers' names for folder titles (folder names should reflect the content, not the authors;
creating folders with identical names in different locations;
duplicating files in various folders; if necessary, use shortcuts to reference the original file.

Files Organization

File names may contain substantial information about their content. They must be consistent, logical, descriptive, concise, and clear. Establishing a naming convention agreed upon by all project members is essential to prevent unexpected errors. Description elements should be ordered from general to specific.

A file name may include the following elements:

an acronym representing the current project or experiment (2-5 letters) to indicate the file's subject;
a short description of the file's content (1-3 words);
information regarding the location or coordinates (if applicable);
the date;
the initials of the individual (researcher or entity) or the full name, always begin with the surname, e.g., KowalskiJ or Kowalski-Jakub.

Additional Recommendations:

Avoid using spaces. Instead, consider the following alternatives:
- CamelCase: A notation system where consecutive words are written together, with each successive word beginning with a capital letter (except the first). For example, foreColor, setConnection, isPaymentPosted,
- Hyphens (-) (may be used)
- Underscores (_) (may be used).
when numbering files, use multiple digits (e.g., 001 instead of 1) to prevent sorting issues;
when using dates, follow the ISO standard (year first, then month and day): YYYYMMDD (e.g., 20240528 or 2024-05-28). This can be shortened to just the year or the year and month, depending on your needs and context;
when noting the time, use the HHMMSS format (hours, minutes, seconds);
never use special characters or diacritical marks such as ę ć ! ? * & # ~ ! @ # $ % ^ & * ( ) ` ; ? , [ ] { } ‘ “.

Files Formats

In accordance with the guidelines of research funding institutions, research data should be stored in open, widely accessible, and free formats, unless converting files from specialized software to open formats affects data quality.
The file format significantly impacts the ability to access a file in the future. Proprietary file formats require specific software, whereas non-proprietary, or open, formats are more interoperable, meaning they can be used across different hardware, operating systems, and software. Saving your data in open, unencrypted, and uncompressed formats will ensure its usability for many years.
The dataset must include a 00_readme.txt file that contains essential information about the shared data, please check a sample README file.
For data compression and archiving, we recommend using ZIP or 7-Zip, as they have an open architecture and are publicly available.

Recommended Files Formats

Type of Data	Recommended Formats
Text Files	.txt (Plain text) .pdf (Portable Document Format) .tex (LaTeX documents) .html (Hypertext Markup Language) .odt (Open Document Format) .xml (Extensible Markup Language)
Tables, spreadsheets, and databases	.txt/.tsv/.tab (Tab-separated tables) .csv/.txt (Comma-separated tables) Other standard delimiter, e.g. colon, pipe Fixed-width .ods (OpenDocument Spreadsheet) .odb (OpenDocument Database)
Image Files	.tiff/.tif (TIFF) .jpg/.jp2 (JPEG) .png (Portable Network Graphics) .svg (Scalable Vector Graphics) .pdf (Portable Document Format) .gif (Graphics Interchange Format) .bmp (Microsoft Windows Bitmap Format)
Sound Files	.wav (WAVE) .flac (FLAC) .mp3 (MPEG-3) – (.mp3 – usually suitable for human voice and moderate-quality audio, but may not be suitable for high-fidelity audio) .aiff (Audio Interchange File Format)
Video Files	.mp4 (MPEG-4) .mxf (Material Exchange Format)
Databases	.xml (Extensible Markup Language) .csv (Comma-separated tables)
Geospatial Data	.tiff (Geo-Referenced TIFF) .shp, .shx, .dbf (ESRI Shapefile) .kml (Keyhole Markup Language) .nc (Network Common Data Format)
Web Data	.json (Javascript Object Notation) .xml (Extensible Markup Language) .html (Hypertext Markup Language)
Web Archive	.warc (WebARChive)
Multidimensional Arrays	.cdf (Common Data Format) .nc (Network Common Data Format) .hdf/.h5 (Hierarchical Data Format)
E-books	.epub (Electronic Publication)

Source: File Formats - Research Data Management - Best Practices - Research Guides at Ohio State University

Data preparation

Folder Organization and Files Naming

Files Organization

Files Formats

Recommended Files Formats

Previous step

Next step

Data preparation

Stopka