GA4GH Computed Identifier Alignment

This appendix describes alignment on standard practices for for serializing data, computing digests on serialized data, and constructing CURIE identifiers from the digests. Essentially, it is a generalization of the Computed Identifiers section.

This mechanism for generating identifiers has been in place since VRS version 1.0.

Background

The GA4GH mission entails structuring, connecting, and sharing data reliably. A key component of this effort is to be able to identify entities, that is, to associate identifiers with entities. Ideally, there will be exactly one identifier for each entity, and one entity for each identifier. Traditionally, identifiers are assigned to entities, which means that disconnected groups must coordinate on identifier assignment.

The computed identifier scheme used in VRS computes identifiers from the data itself. Because identifiers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem.

Here we detail alignment between VRS and other GA4GH products to work towards consistent approaches to identifier design.

VRS Convention

The following algorithmic processes, described in depth in the VRS Computed Identifiers convention, are included in this overview by reference:

GA4GH Digest Serialization is the process of converting an object to a canonical binary form based on JSON and the RFC 8785 specification. This strategy was chosen for its visibility as an independent standard (not IETF-endorsed) on the IETF site, and the selection of this standard by the Sequence Collections draft standard.
GA4GH Truncated Digest is a convention for using SHA-512, truncated to 24 bytes, and encoding using base64url. This convention is shared with the RefGet v2.0 specification.
GA4GH Identification is the CURIE-based syntax for constructing a namespaced and typed identifier for an object. This convention is shared with the RefGet v2.0 specification, and the identifier syntax has been approved by GA4GH TASC.

GA4GH Inherent Properties

New in v2

In VRS v1, data classes were limited to only inherent properties that contained the minimum information for describing a variant or other identifiable object. In practice, this resulted in frequent nesting of VRS objects inside descriptive containers, a complicated pattern for implementations. VRS 2.0 addresses this limitation with the designation of inherent properties for use with the computed identifier algorithm.

When creating computed identifiers from objects, VRS uses a custom schema attribute, ga4gh.inherent, that contains the property names used for computing digests. For example, the Allele JSON Schema:

{
 "$schema": "https://json-schema.org/draft/2020-12/schema",
 "$id": "https://w3id.org/ga4gh/schema/vrs/2.0/json/Allele",
 "title": "Allele",
 "type": "object",
 "maturity": "draft",
 "ga4gh": {
    "prefix": "VA",
    "inherent": [
       "location",
       "state",
       "type"
    ]
 },
 "description": "The state of a molecule at a Location.",
 "properties": {
    ...

Note

The ga4gh JSON Schema namespace is aligned with the Sequence Collections effort (see SeqCol#84).

GA4GH Type Prefixes

A GA4GH identifier is constructed according to this syntax:

"ga4gh" ":" type_prefix "." digest

The digest is computed as described above. The type_prefix is a short alphanumeric code that corresponds to the type of object being represented.

We use the following guidelines for type prefixes:

Prefixes SHOULD be short, approximately 2-4 characters.
Prefixes SHOULD be used only for concrete classes, not abstract parent classes.
Prefixes SHOULD be used only for stand-alone classes (e.g. Variation, Location), not classes that require additional context to be meaningful (e.g. Range, Sequence Expression) or are primarily used for adding descriptive context to external data types (e.g. Sequence Reference)
A prefix MUST map 1:1 with a schema.

Administration

Type prefixes, pURL registration, and JSON Schema keyword administration are coordinated with the GA4GH TASC.