Computed Identifiers¶
VRS provides an algorithmic solution to deterministically generate a globally unique identifier from a VRS object itself. All valid implementations of the VRS Computed Identifier will generate the same identifier when the objects are identical, and will generate different identifiers when they are not. The VRS Computed Digest algorithm obviates centralized registration services, allows computational pipelines to generate “private” ids efficiently, and makes it easier for distributed groups to share data.
A VRS Computed Identifier for a VRS concept is computed as follows:
The object SHOULD be normalized. Normalization formally applies to all VRS classes.
Generate binary data to digest. If the object is a Sequence string, encode it using UTF-8. Otherwise, serialize the object using Digest Serialization.
Generate a truncated digest from the binary data.
Construct an identifier based on the digest and object type.
Important
Normalizing objects is STRONGLY RECOMMENDED for interoperability. While normalization is not strictly required, automated validation mechanisms are anticipated that will likely disqualify Variation that is not normalized. See Implementations should normalize Alleles for a rationale.
The following diagram depicts the operations necessary to generate a computed identifier. These operations are described in detail in the subsequent sections.

Serialization, Digest, and Computed Identifier Operations¶
Entities are shown in gray boxes. Functions are denoted by bold
italics. The yellow, green, and blue boxes, corresponding to the
sha512t24u
, ga4gh_digest
, and ga4gh_identify
functions
respectively, depict the dependencies among functions. SHA512
is SHA-512 truncated to 24 bytes (192 bits), using the SHA-512
initialization vector. base64url is the official name of the
variant of Base64 encoding that uses a URL-safe character
set. [figure source]
Note
Most implementation users will need only the
ga4gh_identify
function. We describe the
ga4gh_serialize
, ga4gh_digest
, and sha512t24u
functions here primarily for implementers.
Requirements¶
Implementations MUST adhere to the following requirements:
Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier.
Implementations MUST ensure that all nested objects are identified with GA4GH Computed Identifiers. Implementations MAY NOT reference nested objects using identifiers in any namespace other than
ga4gh
.
Note
The GA4GH schema MAY be used with identifiers from any
namespace. For example, a SequenceLocation may be defined
using a sequence_id = refseq:NC_000019.10
. However,
an implementation of the Computed Identifier algorithm MUST
first translate sequence accessions to GA4GH SQ
accessions to be compliant with this specification.
Digest Serialization¶
Digest serialization converts a VRS object into a binary representation in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will also be identical. VRS provides validation tests to ensure compliance.
Important
Do not confuse Digest Serialization with JSON serialization or other serialization forms. Although Digest Serialization and JSON serialization appear similar, they are NOT interchangeable and will generate different GA4GH Digests.
Although several proposals exist for serializing arbitrary data in a consistent manner ([Gibson], [OLPC], [JCS]), none have been ratified. As a result, VRS defines a custom serialization format that is consistent with these proposals but does not rely on them for definition; it is hoped that a future ratified standard will be forward compatible with the process described here.
The first step in serialization is to generate message content. If the object is a string representing a Sequence, the serialization is the UTF-8 encoding of the string. Because this is a common operation, implementations are strongly encouraged to precompute GA4GH sequence identifiers as described in Required External Data.
If the object is an instance of a VRS class, implementations MUST:
ensure that objects are referenced with identifiers in the
ga4gh
namespacereplace each nested identifiable object with their corresponding digests.
order arrays of digests and ids by Unicode Character Set values
filter out fields that start with underscore (e.g., _id)
filter out fields with null values
The second step is to JSON serialize the message content with the following REQUIRED constraints:
The criteria for the digest serialization method was that it must be relatively easy and reliable to implement in any common computer language.
Example
allele = models.Allele(location=models.SequenceLocation(
sequence_id="ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
interval=simple_interval),
state=models.SequenceState(sequence="T"))
ga4gh_serialize(allele)
Gives the following binary (UTF-8 encoded) data:
{"location":"u5fspwVbQ79QkX6GHLF8tXPCAXFJqRPx","state":{"sequence":"T","type":"SequenceState"},"type":"Allele"}
For comparison, here is one of many possible JSON serializations of the same object:
allele.for_json()
{
"location": {
"interval": {
"end": 44908822,
"start": 44908821,
"type": "SimpleInterval"
},
"sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
"type": "SequenceLocation"
},
"state": {
"sequence": "T",
"type": "SequenceState"
},
"type": "Allele"
}
Truncated Digest (sha512t24u)¶
The sha512t24u truncated digest algorithm [Hart2020] computes an ASCII digest from binary data. The method uses two well-established standard algorithms, the SHA-512 hash function, which generates a binary digest from binary data, and a URL-safe variant of Base64 encoding, which encodes binary data using printable characters.
Computing the sha512t24u truncated digest for binary data consists of three steps:
Compute the SHA-512 digest of a binary data.
Truncate the digest to the left-most 24 bytes (192 bits). See Truncated Digest Timing and Collision Analysis for the rationale for 24 bytes.
Encode the truncated digest as a base64url ASCII string.
>>> import base64, hashlib
>>> def sha512t24u(blob):
digest = hashlib.sha512(blob).digest()
tdigest = digest[:24]
tdigest_b64u = base64.urlsafe_b64encode(tdigest).decode("ASCII")
return tdigest_b64u
>>> sha512t24u(b"ACGT")
'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
Identifier Construction¶
The final step of generating a computed identifier for a VRS object is to generate a W3C CURIE formatted identifier, which has the form:
prefix ":" reference
The GA4GH VRS constructs computed identifiers as follows:
"ga4gh" ":" type_prefix "." <digest>
Warning
Do not confuse the W3C CURIE prefix
(“ga4gh”) with the
type prefix.
Type prefixes used by VRS are:
type_prefix |
VRS class name |
---|---|
SQ |
Sequence |
VA |
Allele |
VH |
Haplotype |
VAB |
Abundance |
VS |
VariationSet |
VSL |
SequenceLocation |
VCL |
ChromosomeLocation |
VT |
Text |
For example, the identifer for the allele example under Digest Serialization gives:
ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_