A lightweight data-centric framework for semantic interoperability. DLite allows to represent data and metadata with simple, but formalised data models, making it possible to decouple the (meta)data from how it is serialised. It includes a rich and easy extendable plugin-system for loading/writing (meta)data to different storage backends (like JSON, BSON, YAML, RDF, MinIO, MongoDB, PostgreSQL, Redis, CSV/Excel, ...). DLite enhances the reusability of storage plugins by a clear separation between data transfer (protocol) and loading/writing. This makes it possible to use the same file-based storage plugin against for instance the local file system or an sftp or http server. Semantic interoperability and automated data transformations is achieved by mapping DLite data models and/or their properties of to classes defined in ontologies. By combining mappings with a library of reusable mapping functions, fully automated and very powerful data transformations and integrations can be achieved. DLite also include a collection of tools for e.g. validation of data models and generation of code for handling of i/o in C and Fortran programs. DLite is written in C, but includes bindings to Python and Fortran. It is commonly used from Python and available under a permissive MIT license.
DLite: Lightweight data-centric framework for semantic interoperability
LinkML is a flexible modeling language that allows you to author schemas in YAML that describe the structure of your data. Additionally, it is a framework for working with and validating data in a variety of formats (JSON, RDF, TSV), with generators for compiling LinkML schemas to other frameworks.
Linked (Open) Data Modeling Language | LinkML
The Dublin Metadata Element Set, which is often called Dublin Core (DC), is a standardized metadata scheme for description of any kind of resource such as documents in electronic and non-electronic form, digital materials (such as video, sound, images, etc) and composite media like web pages. Dublin Core Metadata may be used for multiple purposes, from simple resource description, to combining metadata vocabularies of different metadata standards, to providing interoperability for metadata vocabularies in the Linked Data cloud and Semantic Web implementations. Please note that this version of the specification for the Dublin Core Element Set 1.1 is somewhat out of date, although it is not officially deprecated. The DCMI Metadata Terms specification is linked to this record and is the current documentation that should be used for the Dublin Core Element Set 1.1. This document, an excerpt from the more comprehensive document "DCMI Metadata Terms" [DCTERMS] provides an abbreviated reference version of the fifteen element descriptions that have been formally endorsed in the following standards: ISO Standard 15836:2009 of February 2009 [ISO15836], ANSI/NISO Standard Z39.85-2012 of February 2013 [NISOZ3985], and IETF RFC 5013 of August 2007 [RFC5013].
Dublin Core Metadata Element Set | DCES
Data Documentation Initiative (DDI-Codebook, DDI-C) is a more light-weight version of DDI-Lifecycle, intended primarily to document simple survey data. Originally DTD-based, DDI-C is now available as an XML Schema. The freely available international DDI standards describe data that result from observational methods in the social, behavioral, economic, and health sciences. DDI is used to document data in over 80 countries of the world.
Data Documentation Initiative Codebook | DDI-Codebook
The OpenAIRE Guidelines for Data Archive Managers 2.0 will provide instruction for data archive managers to expose their metadata in a way that is compatible with the OpenAIRE infrastructure. The metadata from data archives should be included in the OpenAIRE information space, and exposed when data are related to an open access publication e.g. a dataset cited by an article.
OpenAIRE Metadata Schema for Data Archives
The DataCite Metadata Schema is a list of core metadata properties chosen for accurate and consistent identification of a resource for citation and retrieval purposes, with recommended use instructions in the documentation. The resource that is being identified can be of any kind, but it is typically a dataset. We use the term ‘dataset’ in its broadest sense. We mean it to include not only numerical data, but any other research objects in keeping with DataCite’s mission (https://datacite.org/value.html). The metadata schema properties are presented and described in detail in the section DataCite Metadata Properties in this document.
DataCite Metadata Schema
This list contains the controlled vocabulary Subject used in the DANS Data Station Social Sciences and Humanities (SSH).
DANS Data Station SSH | Subject List
RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.
RO-Crate | Research Object Crate
DICOM stands for Digital Imaging and Communications in Medicine. It is a communication standard for dealing with medical information generated by medical equipment, such as scanners. DICOM has been developed by the Medical Imaging and Technology Alliance (MITA).
Before the communication standard existed, information was stored in various file formats associated with the software of a specific medical device. DICOM has put an end to this. DICOM is widely used and is the standard within medical digital image processing. Both doctors and medical imaging research groups use the DICOM file format for their research.
In the standard, various types of images are supported for different medical applications, both still images and moving images. DICOM supports commonly used compression standards, such as JPEG and JPEG2000, or MPEG-2 for video images.
The copyright on the standard is held by the American National Electrical Manufacturers Association (NEMA). DICOM viewing software can be divided into two groups: (1) “proprietary” viewers that are part of the (medical) recording systems and (2) DICOM viewing software for PCs.
Non-proprietary viewers that are available for free are DicomWorks, Osiris and IrfanView (a widely used all-format viewer). Adobe has developed a plug-in for Photoshop that makes it possible to view or export DICOM images and “header” (= metadata) information to other formats. The IrfanView program is also capable of extracting images and / or animations (sequence of images) from DICOM files.
DICOM files are a preferred format, due to their open specification and adoption as standard within the medical world.
In certain circumstances it may be useful to save a single DICOM file from the entire dataset as .jpeg / tiff (the image) and .txt file (the header). This is due to the fact that few people outside the medical world have experience with DICOM. This way you can easily view one file from the dataset, without having to use an open DICOM viewer.
As datasets containing DICOM files tend to be relatively large, we advise you to contact a DANS data manager before depositing these files. We are happy to advise you on the best way to deliver data and store it in the dataset for user-friendly purposes
DICOM is a preferred format for file type Images (raster).
DICOM (Digital Imaging and Communications in Medicine)
The Matroska Multimedia Container is an open source, open standard multimedia container, that can wrap an unlimited number of audio, video, metadata and/or subtitle-files. Matroska Audio supports a large number of codecs. It has a very large userbase and is the basis for WebM media in HTML5/browsers. Most platforms offer native support for this format. Adoption of the file format is expected to further increase thanks to several outcomes of the PREFORMA project, including the Media Conch toolset, an implementation checker for Matroska. Matroska file extensions are .mkv for video (with subtitles and audio), .mka for audio-only files, and .mks for subtitles only.
Matroska is a DANS preferred container format for audio and video.
Matroska Multimedia Container
The purpose of the REFI-QDA (Qualitative Data Analysis) standard is to enable the exchange of qualitative data between QDA programs. The standard has been developed by QDA software companies ATLAS.TI, Dedoose, f4 analysis, MAXQDA, NVivo, QDAMiner, Quirkos and Transana. The standard consists of two parts:
REFI-QDA Codebook, for exchanging codebooks.
REFI-QDA Project for the exchange of information about projects.
The specifications of the REFI-QDA standard are openly available and the format is based on XML. See: https://www.qdasoftware.org.
The REFI-QDA (Qualitative Data Analysis) standard is a preferred format for file type Computer Assisted Qualitative Data Analysis (CAQDAS).
REFI-QDA (Qualitative Data Analysis)
TriG is a plain text serialization format designed for storing RDF datasets containing multiple graphs within a single file. It’s a compact and readable alternative to other XML-based syntaxes.
It’s an extension of the Turtle syntax, adding functionality to allow graph separation, which is useful for representing datasets with different contexts or versions. TriG has been standardized by the W3C.
TriG is a preferred serialization for RDF data.
TriG
Industry Foundation Classes (.ifc) is an open, XML-based data model for the exchange of building information model data: data about buildings, construction and maintenance. It can describe all kinds of details of building structures. It was also designed to support CAD data. Various software packages support IFC, and there are also several free viewers available.
IFC is a preferred format for file types 3D and CAD.
IFC (Industry Foundation Classes)
COLLADA is an open-source XML format for storing interactive 3D applications. It was designed by Sony in collaboration with various large graphic design organizations. Today, COLLADA is managed by the non-profit organization Khronos Group. COLLADA .dae has been adopted by ISO as a publicly available specification. COLLADA is supported by popular software packages such as (GIS application) ArcGIS, (CAD application) Autodesk Infrastructure Modeller, Google Earth v1.4, Mac OS X 10.6 Preview. COLLADA supports various graphic 3D elements. It is intended as an intermediate format: to redirect 3D information in parts to other software. However, the file specification of COLLADA is not very strict, which means that interoperability between COLLADA exports from different software can cause problems. Although DANS marked COLLADA as a preferred format, X3D is preferred if 3D data can be supplied in X3D.
COLLADA is a preferred format for file type 3D.
COLLADA File Format
The Graphics Language Transmission Format was designed as an interoperable format for loading scenes and models created with 3D tools, in modern applications. The glTF format is a JSON-based text format and is accompanied by binary data (.bin) for geometry, animations, and other associated data; and raster images (.jpg; .png) for textures.
There are two versions of glTF: glTF 1.0, first released in October 2015; and glTF 2.0, released in June 2017. Version 2.0 is a significant upgrade from version 1.0 and is aimed at better support from applications. With the increasing adoption of version 2.0, version 1.0 seems to have become obsolete.
glb is the extension for binary glTF: a container format that collects the glTF and its associated data into a single file. Binary glTF was introduced as an extension for glTF 1.0 and was incorporated in the glTF 2.0 specification.
glTF is being developed in an open project by the non-profit organization Khronos Group.
glTF 2.0 is a preferred format for file type 3D.
glb is a preferred format for file type 3D on the condition that it is based on glTF 2.0.
glTF 1.0 and glb based on glTF 1.0 are non-preferred formats for file type 3D.
GlTF 2.0 File Format
The X3D format is managed by the non-profit organization Web3D Consortium. It is an XML-based ISO standard set up with the aim of becoming the 3D standard for the web.
In addition to a few 3D models, the format is also suitable for storing complex 3D data such as virtual reality.
X3D contains fairly extensive support for the storage of graphic elements.
X3D is a preferred format for file type 3D.
X3D File Format
PLY is also known as the Stanford Triangle Format. It is a WaveFront Object-inspired format with the option of extensions describing additional features of a 3D model, including color and transparency.
The .ply file can be supplied as an ASCII text file or as a binary format. ASCII PLY is preferred for accessibility and archiving in the long term.
Not every software supports all extensions of the Polyon file format: it is possible that the data is not completely read, depending on the software used. That is why it is especially important with PLY that it is well documented which information the data contains.
Polygon file format is a preferred format for file type 3D.
Polygon File Format (PLY)
WaveFront Object is a very widely supported open file format for the representation of 3D geometry. The file contains a clear, simple structure in which the spatial positions of each point of the object and texture coordinates are written.
The .obj file can be supplied as an ASCII text file or as a binary format. ASCII OBJ is preferred for accessibility and archiving in the long term.
In addition to the .obj file, material / texture information can be stored in an .mtl file. The texture itself is saved separately as an image. Read more about raster images: https://dans.knaw.nl/en/file-formats/images-raster/. OBJ does not store information about animations.
WaveFront Object is a preferred format for file type 3D.
WaveFront Object
It is advisable to convert Grid files to ASCII text as much as possible. It can be expected from GIS applications that they can correctly import ASCII grid files.
ASCII GRID is a preferred format for file type raster grid.
ASCII GRID
GeoTIFF is a metadata standard for adding georeference to a TIFF image. This metadata is included in the TIFF file itself.
GeoTIFF is an extension of TIFF, an open format in the public domain. The format is supported by various commonly used GIS applications.
GeoTIFF is a preferred format for file type Images (georeference).
GeoTIFF
The “MapInfo Interchange Format” .mif, usually associated with the .mid file, is the export format of the MapInfo software, designed for GIS interoperability. It is a clear, well documented, well supported and stable ASCII text file.
It can be expected from GIS applications that they can correctly import MIF/MID data and that map layers in the application’s own format can be exported to MIF/MID.
MIF/MID is a preferred format for file type GIS.
MIF/MID (Mapinfo Interchange Format)
GML is the XML standard for geographic data, designed by the Open Geospatial Consortium.
Since August 2007, GML has been adopted as the ISO standard ISO 19136.
Support for GML was initially limited but has since increased significantly, now GIS applications can be expected to be able to correctly import GML data.
GML is a preferred format for file type GIS. Read more about GIS.
GML
The developer Autodesk, with the main software AutoCAD, is a market leader in the field of CAD. The AutoCAD formats DXF and DWG are used extensively.
No open formats have been developed for the exchange of CAD formats. DXF is specifically designed to facilitate data interoperability between AutoCAD and other programs and is therefore well supported by other CAD applications.
DXF version R12 (ASCII) appears to be best supported for successful and correct import into other applications.
A major problem with the use of DXF is the development of the DWG format. DWG now offers possibilities for which not all properties can be stored in DXF. If export to DXF R12 (ASCII) does not lead to information loss, this version of the DXF format is considered the best option for preserving AutoCAD formats in a relatively open, widely supported format. A conversion from DWG to DXF must be carefully checked for completeness. If elements of the DWG cannot be stored in the DXF, an export to a more recent version of DXF is acceptable. If conversion is not possible without loss of information, the DWG must be retained.
DXF version R12 (ASCII) is a preferred format for file type Computer Aided Design (CAD).
AutoCAD DXF version R12 (ASCII)
OPUS is a codec for Audio formats.
OPUS
Free Lossless Audio Codec (FLAC) is an open format highly suitable for transcoding without quality loss (lossless compression). The FLAC specification includes a wrapper for FLAC bitstreams plus the lossless compression codec itself. FLAC can be used as a dissemination/delivery format, but is also a suitable format for the archival storage of audio streams. Adoption is moderate.
FLAC is a DANS preferred format for audio, because of its lossless compression of digital audio, but adoption is moderate.
FLAC (Free Lossless Audio Codec)
Material Exchange Format (MXF) is an open wrapper format developed by the Society of Motion Picture and Television Engineers. It supports a number of different bitstreams encoded with any of a variety of codecs, together with a metadata wrapper that describes the material contained within the MXF file. The wrapper is relatively transparent although the structure of an MXF file can be complex. Since invention of this format, there is an ever-increasing interest and adoption of MXF. It is emerging as a standard for professional digital video and audio media. For example, the Library of Congress National Audio-Visual Conservation Center produces a form of MXF as archival masters as they reformat their older videotapes.
MXF is a DANS preferred container format for audio and video.
MXF (Material Exchange Format)
The Broadcast Wave Format (BWF) is an open standard that is widely recommended as the archival master for audio preservation. It consists of uncompressed audio. It is an extension of the WAV format, since it may contain metadata which describes the format of the audio data and the encoding method, and carries a sample-accurate time stamp that can be used to place related files in the proper sequence. BWF can wrap either MPEG or LPCM encoding, although for long-term preservation the use of LPCM is preferred.
BWF is a DANS preferred container format for Audio.
BWF (Broadcast Wave Format)
SVG stands for “Scalable Vector Graphics”. It is a robust, XML-based format for statistical and dynamic vector images.
SVG is an open standard and is well supported by various applications. SVG vector images can be opened in web browsers such as Firefox, Safari, Google Chrome, Microsoft Edge and Explorer. For further processing, vector image applications such as Adobe Illustrator or Inkscape can be used. Inkscape is free to download from the website and works on Windows, Mac OS X and Linux.
All common Vector Image formats (EPS, AI, WMF, CDR) can be opened in Inkscape and Adobe Illustrator and converted to SVG.
SVG is a preferred format for file type Images (vector) and for file type Computer Aided Design (CAD).
SVG (Scalable Vector Graphics)
JASP is a free platform-independent statistical software package with a graphical user interface. It is an open source alternative for proprietary software like SPSS, and is being developed by the University of Amsterdam. In JASP, you have the option to export data in formats which can be easily read into other applications such as .csv. Results of analyses are being exported in .html format.
These formats have been added to our preferred formats within the file type of Statistical data.
JASP
For relational databases, SIARD is seen as a suitable and sustainable format. SIARD (Software Independent Archiving of Relational Databases) is intended for archiving relational databases in a way that is as independent of the original DBMS as possible. This format takes into account all the significant characteristics of databases. SIARD is an open, freely available format, based on clear text formats: Unicode, XML, SQL (1999). This makes it accessible for various tools.
SIARD is a relatively young format, see SIARD_Suite.
SIARD is a preferred format for file type Databases.
SIARD (Software Independent Archiving of Relational Databases)
Many DBMSs support the ISO-standardized version of Structured Query Language (SQL): a language for querying and updating relational databases. Together with the Data Definition Language, used to define and modify schemas, the contents of a database can be stored as a collection of schema and data statements.
The language rarely changes, but the various modifications and additions may change along with software updates. If extensions are used, the documentation must show which SQL version is used.
It is possible to refer to non-existent and / or external data in SQL, without invalidating the file. If SQL is used for exchanging data, any external references must therefore be supplied, or each reference must be replaced by the data referred to.
SQL is a preferred format for file type Databases.
Structured Query Language (SQL)
ODF is an ISO standardized format for office documents. For spreadsheets, it uses the .ods extension; the other two extensions are .odt for textual files and .odp for presentations. ODF files are essentially XML files that comply with the XML schema defined in ISO / IEC 26300: 2006. The last revision dates from 2015.
The advantage of ODS is that it is an open standard that can be processed by multiple applications. Metadata can be added in the form of RDF to later versions of ODF. You can add metedata up to the level of cells, in the case of spreadsheets. Examples of office suites that use ODF include OpenOffice and LibreOffice. MSExcel files can also be saved as ODS. There are various software libraries that can be used to program ODF programmatically.
Media type
The media type of ODS files is application/vnd.oasis.openocument.spreadsheet and application/vnd.oasis.openocument.spreadsheet-template for templates.
Also see
Planning for Library of Congress Collections: https://www.loc.gov/preservation/digital/formats/fdd/fdd000439.shtml
ODS is a preferred format for the Spreadsheets file type.
ODS Open Document Spreadsheet format
Files in the text-fabric file format (.tf) store a column of feature values that correspond to nodes and edges in a graph, which together represent annotated text. So, one could say that .tf is a Markup format.
Annotated Text
In the humanities, primary research data often takes the form of texts. Many of these texts are historical artefacts and a lot of knowledge is needed to interpret them. Annotations are a preferred way to represent this knowledge. They may convey detailed linguistic information at the word level, but they can also link persons, places, materials, and concepts found in the text to external descriptions.
Texts are always structured, and annotations need an addressing mechanism to target the specific portions in the text that they are about. The annotations tend to form bodies of knowledge in themselves, and need to be shared and distributed as separate entities.
Data model
Text-Fabric is a tool that facilitates this exchange of data. In order to do so, it defines a model [TF model] for annotated text. In this model, text is an annotated graph: a system of nodes and edges between nodes, where nodes and edges are linked to other information by means of features. The nodes stand for textual concepts, such as words, sentences, chapters, and the edges for relationships between these portions of text. Features are mappings from nodes or edges to values. Nodes themselves are just integer numbers, and edges are just pairs of numbers.
This model is very close to Linguistic Annotation Framework [LAF ISO standard]. The main differences are that LAF prefers to be represented in XML and Text-Fabric is XML-free, and that a LAF dataset may reside in a single or in separate files at the choice of the corpus designer, while a Text-Fabric dataset always stores a single feature in a single file.
Node features
A node feature is a mapping from numbers to values: a column of values, where the position in the column corresponds to the number of the node.
Edge features
An edge feature can be seen as a mapping from nodes to other nodes, where a value may be supplied for each connection. Edge features are also columns of values, where the postion in the column corresponds to the number of the node where the edges start.
File format
Text-Fabric defines an efficient way to store features in files [TF file format]. Each feature occupies a single file. A Text-Fabric dataset is just a flat collection of feature files.
Extension
Feature files typically have extension .tf .
Tools
Text-Fabric is also a library [TF API] by which you can process text and annotations. It understands the .tf file format and offers an API to load and save feature files and to compute with the data contained in them. Text-Fabric compiles .tf files into binary .tfx files which are optimised to load very fast. These .tfx files are just a convenience but are not suitable for archiving and should not be considered a preferred or even acceptable format. They are dependent on the computer where they have been generated.
Text-Fabric is by no means required to make sense of .tf files. The format is so transparent that several users bypass the tool Text-Fabric and have written their own programs (in languages other than Python) to ingest .tf files.
Corpora
A number of corpora [TF Corpora] have already been converted to Text-Fabric, such as the Hebrew Bible, various Cuneiform tablet collections, the Quran, and more. For all these corpora there are dedicated tutorials [TF tutorials] that show the practice that Text-Fabric supports.
References
TF model: Model – Text-Fabric (archived version)
TF file format: Format – Text-Fabric (archived version)
TF optimizations: Optimizations – Text-Fabric (archived version)
TF example: Banks: convert.ipynb (archived version)
TF API: TF – Text-Fabric (archived version)
TF Corpora: Corpora – Text-Fabric (archived version)
TF Tutorials: tutorials (archived version)
LAF ISO Standard: ISO 24612:2012 – Linguistic annotation framework (LAF)
Text-Fabric is a preferred format for file type Programming languages.
Text-Fabric
MATLAB is a programming platform designed specifically for engineers and scientists. The heart of MATLAB is the MATLAB language, a matrix-based language allowing the most natural expression of computational mathematics.The name MATLAB stands for matrix laboratory. MATLAB was originally written to provide easy access to matrix software developed by the LINPACK and EISPACK projects, which together represent the state-of-the-art in software for matrix computation. MATLAB has evolved over a period of years with input from many users. In university environments, it is the standard instructional tool for introductory and advanced courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis. This format is accepted in DANS with no further conversions required in its executable code format (.m) and its workspace files (.mat). We strongly suggest to build up your code as neatly as possible and clearly state which version of MATLAB was used to develop the code, include comments where necessary, and make use of good programming practices. The code should be clear enough for users to understand the results and for possible reproduction of the study and methods.
MATLAB is a preferred format for file type Programming languages.
Matlab
Markdown is a lightweight markup language that is known for its simplicity and ease of use for writing structured documents. It was created in 2004 by John Gruber, with significant contributions from Aaron Swartz. Gruber wanted to create a plain-text format that could be easily converted to structurally valid HTML. His goal was to develop a syntax that was both human-readable and easy to write, as opposed to the verbose nature of HTML. The resulting Markdown consists of two components: a plain text formatting syntax, and a software tool that converts the plain text to HTML.
The first Markdown version was released as an open-source project and was quickly adopted by bloggers and people working with wikis. Today it is widely used by platforms like GitHub (as README files), Reddit, Stack Overflow, and Discord. Markdown’s readability and ease of use make it popular in a variety of domains, including web content creation, software documentation, note-taking and manuscript writing. It allows bloggers, journalists, developers and authors to create documents for the web without needing a deep understanding of HTML or CSS, as Markdown files can easily be converted into fully formatted documents.
While Markdown is widely used, there isn’t a universally accepted standard specification for its syntax. The exact implementation can differ across platforms and versions. Markdown parsers and implementations may interpret the same input differently, leading to potential difficulties when moving content across platforms or converting them into different formats. For example, certain Markdown features, such as code blocks, images and footnotes, may be rendered differently on Github versus a personal website using a different Markdown processor. Note that these rendering differences don’t affect the content of the Markdown file, just the layout.
Interpretations can also differ across different Markdown parsers. This has led to the development of slightly different Markdown syntax variations that you can refer to as flavors or dialects. The vast majority of the syntax will be the same across dialects, so don’t worry about it too much. The dialects may differ in which features are supported (such as tables or math blocks), and may handle ambiguities differently. DANS accepts any markdown dialect.
A Markdown document can have a .md or .markdown extension.
Markdown is a preferred file format for file type Markup Language.
Markdown
The OpenDocument format for file formats has been developed by the Organization for the Advancement of Structured Information Standards (OASIS) as XML-based open standards for storing and exchanging various types of data. OpenDocument was incorporated as an official ISO standard on 11 November 2006. This standard is further elaborated in OpenDocument 1.1 (1 February 2007) and OpenDocument 1.2 (29 September 2011).
The file formats are in fact ZIP compression files within which XML files and any additions such as images have been collected.
The free Apache OpenOffice software uses OpenDocument formats as standard formats. Outside of this package, OpenDocuments enjoy good support. Microsoft Office has supported OpenOffice formats since version 2007, although problems may occur with specific properties of the OpenDocuments.
In many countries, including the Netherlands, the use of open source software and OpenDocuments has been set as standard for making government documentation accessible.
OpenDocument formats are suitable for archiving, accessibility and reusability in the long term.
The OpenDocument format for text documents is OpenDocument Text (.odt). ODT is a preferred format under the Text documents file type.
Open Document Format
CROISSANT Metadata Format
This standard defines a digital file format useful for storage, transmission and processing of scientific and other images in astronomy. Unlike many image formats, FITS is designed specifically for scientific data and hence includes many provisions for describing photometric and spatial calibration information, together with image origin metadata. A major feature of the FITS format is that image metadata is stored in a human-readable ASCII header, so that an interested user can examine the headers to investigate an archived file of unknown provenance.
Flexible Image Transport System
anndata is a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface. anndata is part of the scverse project (website, governance) and is fiscally sponsored by NumFOCUS.
anndata | Access and store annotated data matrices
0.12.2
The International Federation of Digital Seismograph Networks (FDSN) defines miniSEED as a format for digital data and related information. The primary intended uses are data collection, archiving and exchange of seismological data. The format is also appropriate for time series data from other geophysical measurements such as pressure, temperature, tilt, etc. In addition to the time series, storage of related state-of-health and parameters documenting the state of the recording system are supported. The FDSN metadata counterpart of miniSEED is StationXML which is used to describe characteristics needed to interpret the data such as location, instrument response, etc.
miniSEED | International Federation of Digital Seismograph Networks FDSN
A CBOR-based serialization format for Linked Data, designed to leverage the JSON-LD
ecosystem for compact, efficient binary encoding in constrained environments.
CBOR-LD
A set of conventions built on top of YAML, which outlines how to serialize
Linked Data as YAML based on JSON-LD syntax, semantics, and APIs.
YAML-LD
A set of conventions built on top of YAML, which outlines how to serialize
Linked Data as YAML based on JSON-LD syntax, semantics, and APIs.
YAML-LD
A set of conventions built on top of YAML, which outlines how to serialize
Linked Data as YAML based on JSON-LD syntax, semantics, and APIs.
YAML-LD
A set of conventions built on top of YAML, which outlines how to serialize
Linked Data as YAML based on JSON-LD syntax, semantics, and APIs.
YAML-LD
MPEG-4 (Moving Picture Experts Group - 4) is a standard for compressing and encoding audio-visual digital data. Developed by ISO/IEC, it enables efficient storage and streaming of multimedia content such as video, audio, subtitles, and interactive graphics. MPEG-4 is widely used in formats like .mp4 and is compatible with web, broadcast, and mobile applications.
MPEG-4 | Moving Picture Experts Group-4
In IUCLID 6, the exchange of chemical information, from either datasets or dossiers, is facilitated via a zip/archive file that has the extension i6z, which stands for IUCLID 6 zip. Chemical information can be exported as an i6z file from an installation of IUCLID 6, and then imported into another. An i6z file has a well-defined and structured format that contains information on the IUCLID 6 entities, documents, and attachments it contains. The export feature of IUCLID 6 provides an advanced filtering mechanism that allows a user to select which of the interrelated entities are included in the archive.
i6z | IUCLID dossier schema format
Simple Protocol and RDF Query Language (SPARQL). This document is an overview of SPARQL 1.1. It provides an introduction to a set of W3C specifications that facilitate querying and manipulating RDF graph content on the Web or in an RDF store.
Simple Protocol and RDF Query Language Overview | SPARQL
MBTiles is a compact, restrictive specification. It is, technically, a SQLite database. It supports only tiled data, including vector or image tiles and interactivity grid tiles. Only the Spherical Mercator projection is supported for presentation (tile display), and only latitude-longitude coordinates are supported for metadata such as bounds and centers. It is a minimum specification, only specifying the ways in which data must be retrievable. Thus MBTiles files can internally compress and optimize data, and construct views that adhere to the MBTiles specification. One MBTiles file represents a single tileset, optionally including grids of interactivity data. Multiple tilesets (layers, or maps in other terms) can be represented by multiple MBTiles files.
MBTiles File Format
The LAZ file format is a compressed version of the LAS (Lidar LASer) file format, which is specifically designed for storing lidar point cloud data. LAZ files retain the same data and structure as LAS files but employ lossless compression techniques to reduce file size while preserving the original data fidelity. The LAZ file format was developed to address the growing demand for efficient storage and transmission of large lidar datasets. By compressing LAS files, LAZ files significantly reduce their size, making them easier to manage and transfer. The compression is achieved by employing a combination of different algorithms, such as entropy coding and variable-length encoding, to represent lidar point attributes in a more compact form. Despite the compression, LAZ files retain the ability to fully restore the original LAS data without any loss of information. This means that once a LAZ file is decompressed, it can be processed and analyzed in the same way as an uncompressed LAS file. The compression and decompression process is typically performed using specialized software or libraries that support the LAZ format. The LAZ file format maintains compatibility with LAS files, ensuring interoperability across lidar software and processing tools. This means that applications that can read and process LAS files can typically handle LAZ files without any modifications.
LAZ File Format | .laz
The LAS file format is a public file format for the interchange of 3-dimensional point cloud data data between data users. Although developed primarily for exchange of lidar point cloud data, this format supports the exchange of any 3-dimensional x,y,z tuplet. This binary file format is an alternative to proprietary systems or a generic ASCII file interchange system used by many companies. The LAS file format is a binary file format that maintains information specific to the lidar nature of the data while not being overly complex.
LASer (LAS) File Format | .las
ASCII (American Standard Code for Information Interchange) is the most common character encoding format for text data in computers and on the internet. In standard ASCII-encoded data, there are unique values for 128 alphabetic, numeric or special additional characters and control codes.
ASCII | American Standard Code for Information Interchange
GPS Exchange Format (GPX) is an XML schema designed as a common GPS data format for software applications. It can be used to describe waypoints, tracks, and routes. It is an open format[2] and can be used without the need to pay license fees. Location data (and optionally elevation, time, and other information) is stored in tags and can be interchanged between GPS devices and software. Common software applications for the data include viewing tracks projected onto various map sources, annotating maps, and geotagging photographs based on the time they were taken.
GPS Exchange Format | .gpx
A shapefile is an Esri vector data storage format for storing the location, shape, and attributes of geographic features. It is stored as a set of related files and contains one feature class. Shapefiles often contain large features with a lot of associated data and historically have been used in GIS desktop applications. The primary way to make shapefile data available for others to view through a web browser is to add it to a .zip file, upload it, and publish a hosted feature layer. The .zip file must contain at least the .shp, .shx, .dbf, and .prj files components of the shapefile.
Shape file | .shp
The first proposal for the 'Receiver Independent Exchange Format' RINEX has been developed by the Astronomical Institute of the University of Berne for the easy exchange of the GPS data to be collected during the large European GPS campaign EUREF 89. Currently the format consists of four ASCII file types: 1. Observation Data File 2. Navigation Message File 3. Meteorological Data File 4. GLONASS Navigation Message File. RINEX Version 2 also allows to include observation data from more than one site subsequently occupied by a roving receiver in rapid static or kinematic applications.
RINEX2 | Receiver Independent Exchange Format Version 2
The UKOOA format (United Kingdom Offshore Operators Association) is a data format developed to manage and store geological and geophysical information in the context of offshore activities, particularly for the oil and gas industry in the United Kingdom. It is used to standardize and exchange data between operating companies and regulatory authorities, such as the UK Oil and Gas Authority (OGA).
UKOOA | United Kingdom Offshore Operators Association
The SEG-Y (sometimes SEG Y or SEGY) file format is one of several data standards developed by the Society of Exploration Geophysicists (SEG) for the exchange of geophysical data.
SEG-Y | Society of Exploration Geophysics
PDBx/mmCIF is a dictionary of data archiving macromolecule crystallographic experiments and their results.
macromolecular Crystallographic Information File | PDBx/mmCIF
An exchange format for reporting experimentally determined three-dimensional structures of biological macromolecules that serves a global community of researchers, educators, and students. The data contained in the archive include atomic coordinates, bibliographic citations, primary and secondary structure, information, and crystallographic structure factors and NMR experimental data.
Protein Data Bank Format | PDB
A language for describing and validating RDF graphs.
Shapes Constraint Language
A set of conventions built on top of YAML, which outlines how to serialize Linked Data as YAML based on JSON-LD syntax, semantics, and APIs.
YAML-LD
SHACL, Shapes Constraint Language, is a language for validating RDF graphs against a set of conditions.
SHACL Shapes Constraint Language
The Fast Healthcare Interoperability Resources standard is a set of rules and specifications for exchanging electronic health care data.
Fast Healthcare Interoperability Resources
IIIF is a set of open standards for delivering high-quality, attributed digital objects online at scale. It’s also an international community developing and implementing the IIIF APIs. IIIF is backed by a consortium of leading cultural institutions.
International Image Interoperability Framework (IIIF)
MARC (MAchine-Readable Cataloging) standards are a set of digital formats for the description of items catalogued by libraries, such as books. MARC 21 was designed to redefine the original MARC record format for the 21st century and to make it more accessible to the international community. MARC 21 is a result of the combination of the United States and Canadian MARC formats (USMARC and CAN/MARC).
MARC21 Format for Bibliographic Data
kind of computer file containing plain text (description from https://www.wikidata.org/wiki/Q86920)
TXT | text file
Surveys designed in Colectica Questionnaires are represented in DDI 3.2. DDI is a data documentation standard used by national statistical organizations, long-running longitudinal studies, data archives, and others. Files of DDI 3.2 XML can be generated from the interface of the Colectica Questionnaires.
DDI XML | DDI 3.2 XML Schema
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.
eXtensible Markup Language (XML)
SKOS is a common data model for sharing and linking knowledge organization systems via the Web.
SKOS | Simple Knowledge Organization System
Exchangeable image file format is a standard that specifies formats for images, sound, and ancillary tags used by digital cameras, scanners and other systems handling image and sound files recorded by digital cameras. This resource represents version 3.
EXIF 3 | Exchangeable Image File Format version 3
Exchangeable image file format is a standard that specifies formats for images, sound, and ancillary tags used by digital cameras, scanners and other systems handling image and sound files recorded by digital cameras. This resource represents version 2 (the linked standard is version 2.32).
EXIF 2 | Exchangeable Image File Format version 2
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, and Python. It is built on top of Apache Parquet.
Delta Lake
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
Apache Iceberg
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Apache Parquet
RDF-XML is a syntax, defined by the W3C, to express (i.e. serialize) an RDF graph as an XML document.
RDF-XML | XML syntax for RDF
MOD is conceived as an OWL ontology and application profile to capture metadata information for ontologies, vocabularies or semantic resources/artefacts in general. MOD 2.0 is designed as a profile of DCAT 2.
MOD | Metadata for Ontology Description and publication Ver 2.0
Turtle is a common textual syntax for RDF that allows an RDF graph to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes.
Turtle Format
N-Triples is a format for storing and transmitting data. It is a line-based, plain text serialisation format for RDF (Resource Description Framework) graphs, and a subset of the Turtle (Terse RDF Triple Language) format.
N-Triples format
R File Format most often used by the R Studio
R File Format
SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics.
SAS File Format for the SAS Statistical Analysis System
IBM® SPSS® Statistics data files are files specifically formatted for use by IBM SPSS Statistics, containing both data and the metadata (dictionary) that define the data.
IBM SPSS Statistics Data File Format
Stata data files are saved with the extension “.dta”. This means the file is ready to use in Stata and unlike data saved in, for example, an excel file, you do not need to import this into Stata. Stata is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting.
Stata Data File Format (.dta)
Files with .csv (Comma Separated Values) extension represent plain text files that contain records of data with comma separated values. Each line in a CSV file is a new record from the set of records contained in the file. Such files are generated when data transfer is intended from one storage system to another. Since all applications can recognize records separated by comma, import of such data files to database is done very conveniently. Almost all spreadsheet applications such as Microsoft Excel or OpenOffice Calc can import CSV without much effort. Data imported from such files is arranged in cells of a spreadsheet for representation to user.
CSV File Format
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript. Web browsers receive HTML documents from a web server or from local storage and render the documents into multimedia web pages. HTML describes the structure of a web page semantically and originally included cues for its appearance.
HTML: The HyperText Markup Language
Jinja is a fast, expressive, extensible templating engine. Special placeholders in the template allow writing code similar to Python syntax. Then the template is passed data to render the final document.
Jinja Template Engine
JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding linked data using JSON
Javascript Object Notation Language for Linked Data
The Web Ontology Language (OWL) is a family of knowledge representation languages or ontology languages for authoring ontologies or knowledge bases. The languages are characterized by formal semantics and RDF/XML-based serializations for the Semantic Web. OWL is endorsed by the World Wide Web Consortium (W3C) and has attracted academic, medical and commercial interest. The OWL 2 Web Ontology Language, informally OWL 2, is an ontology language for the Semantic Web with formally defined meaning. OWL 2 ontologies provide classes, properties, individuals, and data values and are stored as Semantic Web documents. OWL 2 ontologies can be used along with information written in RDF, and OWL 2 ontologies themselves are primarily exchanged as RDF documents.
OWL | Web Ontology Language
JSON-LD is a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines. JSON-LD is a concrete RDF syntax. A JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize generalized RDF Datasets.
JSON-LD | JavaScript Object Notation for Linking Data
JavaScript Object Notation (JSON) is a lightweight, text-based, language-independent data interchange format. It was derived from the ECMAScript Programming Language Standard. JSON defines a small set of formatting rules for the portable representation of structured data. This RFC specification aims to remove inconsistencies with other specifications of JSON, repair specification errors, and offer experience-based interoperability guidance.
JSON | JavaScript Object Notation
RDF Schema (RDFS) is the RDF vocabulary description language. RDFS defines classes and properties that may be used to describe classes, properties and other resources.
RDFS | Resource Description Framework Schema
Unidata's Network Common Data Form (netCDF) is a set of software libraries and a machine-independent data format that support the creation, access, and sharing of array-oriented scientific data. This record describes the data format and not the software libraries. The netCDF Data Model is also a community standard for sharing scientific data. The data model of dimensions, variables, and attributes, which define the The Classic Model, was extended starting with netCDF-4.0. The new The Enhanced Data Model supports the classic model in a completely backward-compatible way, while allowing access to new features such as groups, multiple unlimited dimensions, and new types, including user-defined types. For maximum interoparability with existing code, new data should be created with the The Classic Model. The Classic Model was introduced with the very first netCDF release, and is still the core of all netCDF files.
NetCDF | Network Common Data Form
Crossref is a registration agency of the International DOI Foundation. Crossref provides a mechanism for identifying and describing research objects (books and chapters, components, conference proceedings, datasets, dissertations, grants, journals and articles, peer reviews, pending publications, posted content (includes preprints), reports and working papers, and standards). It follows the ISO/IEC 11179 Metadata Registry (MDR) standard, which specifies a schema for recording both the meaning and technical structure of the data for unambiguous usage by humans and computers. CrossRef uses a single deposit schema stored as XML, which supports a range of different content types and provides a structure and set of rules to keep everything consistent and interoperable.
Crossref (DOI)
The REFI-QDA (Rotterdam Exchange Format Initiative, Qualitative Data Analysis) Standard enables interoperability for exchanging processed qualitative data between Qualitative Data Analysis Software (QDAS or CAQDAS) programs. Its purpose is to enable users to exchange processed data between programs. It is an open standard and any program can implement it, thus increasing the number of software programs that can talk to one another.
REFI-QDA Standard
The REFI-QDA Standard enables interoperability for exchanging processed qualitative data between Qualitative Data Analysis Software (QDAS or CAQDAS) programs. Its purpose is to enable users to exchange processed data between programs. It is an open standard and any program can implement it, thus increasing the number of software programs that can talk to one another.
REFI-QDA Standard
The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
Text Encoding Initiative
OceanSITES uses NetCDF (Network Common Data Form), a set of software libraries and machine-independent data formats. The OceanSITES data managment team has developed an implementation of NetCDF for the data sets. OceanSITES uses netCDF (Network Common Data Form), a set of software libraries and machine-independent data formats developed by the Unidata progam at UCAR. Our implementation of netCDF is based on the community-supported Climate and Forecast Metadata Convention (CF), which provides a definitive description of the data in each variable, and the spatial and temporal properties of the data. Any version of CF may be used, but it must be identified in the ‘Conventions’ attribute.
NetCDF OceanSITES
A file with .e57 extension is a compact, vendor-neutral file format that is used for storing and exchange of three-dimensional (3D) imaging data such as point clouds, images, and metadata. E57 is open source and stores 3D point data, its attributes (such as colour and intensity), and 2D imagery as captured by the 3D imaging system.
E57 Lidar Point Cloud Data Format
STL is a file format native to the stereolithography CAD software created by 3D Systems. STL files describe only the surface geometry of a three-dimensional object without any representation of colour, texture or other common CAD model attributes. The files are commonly used for 3D printing.
STL File Format
PNG is a raster-graphics file format that supports lossless data compression. PNG files use the file extension .png and have been assigned the MIME media type image/png.
Portable Network Graphics (PNG)
The Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it.
Portable Document Format (PDF)