Computed Identifiers

VRS provides an algorithmic solution to deterministically generate a globally unique identifier from a VRS object itself. All valid implementations of the VRS Computed Identifier will generate the same identifier when the objects are identical, and will generate different identifiers when they are not. The VRS Computed Digest algorithm obviates centralized registration services, allows computational pipelines to generate “private” ids efficiently, and makes it easier for distributed groups to share data.

A VRS Computed Identifier for a VRS concept is computed as follows:

The object SHOULD be normalized. Normalization formally applies to all VRS classes.
Generate binary data to digest through Digest Serialization.
Generate a Truncated Digest (sha512t24u) from the binary data.
Construct an identifier based on the digest and object type.

Important

Normalizing objects is STRONGLY RECOMMENDED for interoperability. While normalization is not strictly required, automated validation mechanisms are anticipated that will likely disqualify Variation that is not normalized. See should-normalize for a rationale.

The following diagram depicts the operations necessary to generate a computed identifier. These operations are described in detail in the subsequent sections.

../_images/id-dig-ser.png — Serialization, Digest, and Computed Identifier Operations

Entities are shown in gray boxes. Functions are denoted by bold italics. The yellow, green, and blue boxes, corresponding to the `sha512t24u`, `ga4gh_digest`, and `ga4gh_identify` functions respectively, depict the dependencies among functions. `SHA512` is SHA-512 truncated to 24 bytes (192 bits), using the SHA-512 initialization vector. base64url is the official name of the variant of Base64 encoding that uses a URL-safe character set. [figure source]

Note

Most implementation users will need only the ga4gh_identify function. We describe the ga4gh_serialize, ga4gh_digest, and sha512t24u functions here primarily for implementers.

Requirements

Implementations MUST adhere to the following requirements:

Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier.
When computing identifiers, implementations MUST ensure that each nested GA4GH Identifiable Object is referenced with a GA4GH Computed Identifier.

New in v2

In VRS v2, all objects now inherit from Entity, providing a means by which common expressions and accessions for VRS objects can be provided in other fields as decorative metadata, alongside object IDs. Implementations may freely implement such fields without impacting computed identifiers. Implementations are therefore encouraged (but not required) to use the id field strictly for computed identifiers and use decorative fields for alternate accessions, to reduce computational complexity.

Digest Serialization

Digest serialization converts a VRS object into a binary representation in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will also be identical. VRS provides validation tests to ensure compliance.

VRS uses the JSON Canonicalization Scheme (RFC 8785) to serialize JSON data, and includes additional preprocessing steps to ensure computed digests are not impacted by decorative metadata.

New in V2

Beginning in VRS v2, object value data and descriptive metadata may be passed in the same object, providing a means for sharing commonly expected annotations (e.g. a “Ref Allele”) on VRS objects. Read GA4GH Inherent Properties for more.

The first step in serialization is to generate message content.

If the object is an instance of a VRS class, implementations MUST:

ensure that objects are referenced with identifiers in the ga4gh namespace

replace each nested identifiable object with their corresponding digests

order arrays of digests and ids by Unicode Character Set values

filter out fields not included in the class GA4GH Inherent Properties (if defined)

filter out fields with null values

The second step is to JSON serialize the message content following the RFC 8785 specification, which includes these REQUIRED constraints:

encode the serialization in UTF-8

exclude insignificant whitespace, as defined in RFC8785§3.2.1

order all keys by Unicode Character Set values

use predefined JSON control character codes when available, as defined in RFC8785§3.2.2.1

The criteria for the digest serialization method was that it must be relatively easy and reliable to implement in any common computer language.

Example

allele = models.Allele(
  location=models.SequenceLocation(
    end=44908822,
    start=44908821,
    sequenceReference=models.SequenceReference(
      refgetAccession="SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
    )
  ),
  state=models.LiteralSequenceExpression(sequence=models.SequenceString("T"))
)
ga4gh_serialize(allele)

Gives the following binary (UTF-8 encoded) data:

{"location":"wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz","state":{"sequence":"T","type":"LiteralSequenceExpression"},"type":"Allele"}

For comparison, here is one of many possible JSON serializations of the same object:

allele.model_dump(exclude_none=True)

{
  "location": {
      "type": "SequenceLocation",
      "sequenceReference": {
        "type": "SequenceReference",
        "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
      },
      "start": 44908821,
      "end": 44908822
  },
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "T"
  },
  "type": "Allele"
}

Truncated Digest (sha512t24u)

The sha512t24u truncated digest algorithm [Hart2020] computes an ASCII digest from binary data. The method uses two well-established standard algorithms, the SHA-512 hash function, which generates a binary digest from binary data, and a URL-safe variant of Base64 encoding, which encodes binary data using printable characters.

Computing the sha512t24u truncated digest for binary data consists of three steps:

Compute the SHA-512 digest of a binary data.
Truncate the digest to the left-most 24 bytes (192 bits). See Truncated Digest Timing and Collision Analysis for the rationale for 24 bytes.
Encode the truncated digest as a base64url ASCII string.

>>> import base64, hashlib
>>> def sha512t24u(blob):
        digest = hashlib.sha512(blob).digest()
        tdigest = digest[:24]
        tdigest_b64u = base64.urlsafe_b64encode(tdigest).decode("ASCII")
        return tdigest_b64u
>>> sha512t24u(b"ACGT")
'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

Identifier Construction

The final step of generating a computed identifier for a VRS object is to generate a W3C CURIE formatted identifier, which has the form:

prefix ":" reference

The GA4GH VRS constructs computed identifiers as follows:

"ga4gh" ":" type_prefix "." <digest>

Warning

Do not confuse the W3C CURIE prefix (“ga4gh”) with the type prefix.

Type prefixes used by VRS are:

type_prefix	VRS class name
VA	Allele
CPB	CisPhasedBlock
CN	CopyNumberCount
CX	CopyNumberChange
AJ	Adjacency
TM	Terminus
DM	DerivativeMolecule
SL	SequenceLocation
SQ	Sequence (RefGet)

For example, the identifier for the allele example under Digest Serialization gives:

ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt_

Warning

GA4GH Computed Identifiers are a key mechanism for globally unique federated identification of variants. However, as described here, these identifiers are dependent upon the structure of the object from which the identifier is constructed. Consequently, there is no guarantee that VRS computed identifiers will remain stable across major version releases of VRS; for example, all VRS v1.x computed identifiers are distinct from all VRS v2.x identifiers. It is recommended that implementers heed the maturity level of data classes as defined by the GKS Maturity Model when gauging the stability of data classes-and therefore object identifiers from those classes-across releases of VRS.

References

[Hart2020]

Hart RK, Prlić A. SeqRepo: A system for managing local collections of biological sequences. PLoS One. 2020;15: e0239883. doi:10.1371/journal.pone.0239883