MPEG-G
MPEG-G (ISO / IEC 23092) is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis.[1][2] The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source[3] implementations started to be available in 2019, covering progressively more of the published parts of the standard. BackgroundThe advent of high-throughput sequencing (HTS) technologies has revolutionized the field of quantitative biology. Availability of large collections of genomic information has now entered everyday practice and has become a cornerstone of a number of disciplines, ranging from biological research to personalized medicine in the clinic. At the moment, genomic information is mostly exchanged through a variety of data formats, such as FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. The ISO/IEC 23092 (MPEG-G) standard aims to provide a unified format for the efficient representation and compression of such diverse data, both for file storage and data transport. In order to do that, the standard is divided in several parts. Structure of the standardThe MPEG-G standard utilizes technology and data representation architectures previously validated in the field of digital media. They allow to compress and transport genome sequencing data even in complex scenarios, for instance when access is needed to large amounts of possibly distributed data, or when part of the data needs to be encrypted for privacy reasons. Conceptually, such requirements lead to the definition of a number of mutually interrelated mechanisms, which are summarized in the following list:
In turn, some of these topic have been collected together, in order to make the standard easier to understand and implement. As a result, the ISO/IEC 23092 standard is physically structured as a series of separate document, as follows:
ISO/IEC 23092-1 MPEG-G Part 1ISO/IEC 23092-1 specifies how the genomic data is organized within MPEG-G structures for transport (i.e., streaming) and storage. Formats of genomic record, reference record, MPEG-G file and transport stream are defined in this part. It introduces Access Unit as the container of the compressed genomic data and provides a reference conversion process among different formats. ISO/IEC 23092-2 MPEG-G Part 2ISO/IEC 23092-2 specifies the syntax and methods for MPEG-G lossless compression of sequencing data and lossy compression of associated quality scores. MPEG-G, as is typical for MPEG standards, only specifies the decoding process while the encoding process is left open to algorithmic and implementation-specific innovations. All MPEG-G conformed decoders produce identical outputs from the multiplexed bitstreams included in MPEG-G files and the data streams in streaming scenarios. The input data of the encoder are genomic records or metadata, with optional reference data, while its output is MPEG-G file or transport streams. ISO/IEC 23092-3 MPEG-G Part 3ISO/IEC 23092-3 specifies a metadata format and provides genomic data representation APIs to support interoperability among existing tools and systems. Part 3 specifies how an MPEG-G compliant bitstream can be integrated with metadata as well as mechanisms to implement access control, integrity verification, authentication and authorization mechanisms. This part also contains an informative section devoted to the mapping between SAM and MPEG-G data structures, including backward compatibility with existing SAM content. It defines:
ISO/IEC 23092-4 MPEG-G Part 4ISO/IEC 23092-4[9] specifies genomic information representation reference software, referred to as the genomic model (GM). It consists of two components: the reference encoder software and the reference decoder software. While the reference decoder software is provided to assess the conformance to the requirements of ISO/IEC 23092-1,[4] ISO/IEC 23092-2[5] and ISO/IEC 23092-6,[7] the reference encoder software serves as a guide for the implementation of the aforementioned standards. The reference encoder software called Genie[3] is an open source software developed by a group of individuals from multiple universities and companies around the world. It features the following components:
ISO/IEC 23092-5 MPEG-G Part 5ISO/IEC 23092-5 specifies conformance of the coding of genomic information. Part 5 provides a means to test and validate the correct implementation of the MPEG-G technology in different devices and applications to ensure the interoperability among all systems. It specifies a normative procedure to assess conformity to the standard on an exhaustive set of compressed data. MIME Type and Filename extensionsNo MIME type (RFC 6838 based IANA media type) currently defined for MPEG-G file. No conventional file extensions are defined. See alsoReferences
External links |