Computed Identifiers¶
The VR-Spec provides an algorithmic solution to deterministically generate a globally unique identifier from a VR object itself. All valid implementations of the VR Computed Identifier will generate the same identifier when the objects are identical, and will generate different identifiers when they are not. The VR Computed Digest algorithm obviates centralized registration services, allows computational pipelines to generate “private” ids efficiently, and makes it easier for distributed groups to share data.
A VR Computed Identifier for a VR concept is computed as follows:
- If the object is an Allele, normalize it.
- Generate binary data to digest. If the object is a Sequence string, encode it using UTF-8. Otherwise, serialize the object using Digest Serialization.
- Generate a truncated digest from the binary data.
- Construct an identifier based on the digest and object type.
The following diagram depicts the operations necessary to generate a computed identifier. These operations are described in detail in the subsequent sections.
Note
Most implementation users will need only the ga4gh_identify function. We describe the ga4gh_serialize, ga4gh_digest, and sha512t24u functions here primarily for implementers.
Requirements¶
Implementations MUST adhere to the following requirements:
- Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier.
- Implementations MUST ensure that all nested objects are identified
with GA4GH Computed Identifiers. Implementations MAY NOT reference
nested objects using identifiers in any namespace other than
ga4gh
.
Note
The GA4GH schema MAY be used with identifiers from any
namespace. For example, a SequenceLocation may be defined
using a sequence_id = refseq:NC_000019.10
. However,
an implementation of the Computed Identifier algorithm MUST
first translate sequence accessions to GA4GH SQ
accessions to be compliant with this specification.
Digest Serialization¶
Digest serialization converts a VR object into a binary representation in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will also be identical. VR Specification provides validation tests to ensure compliance.
Important
Do not confuse Digest Serialization with JSON serialization or other serialization forms. Although Digest Serialization and JSON serialization appear similar, they are NOT interchangeable and will generate different GA4GH Digests.
Although several proposals exist for serializing arbitrary data in a consistent manner ([Gibson], [OLPC], [JCS]), none have been ratified. As a result, VR Specification defines a custom serialization format that is consistent with these proposals but does not rely on them for definition; it is hoped that a future ratified standard will be forward compatible with the process described here.
The first step in serialization is to generate message content. If the object is a string representing a Sequence, the serialization is the UTF-8 encoding of the string. Because this is a common operation, implementations are strongly encouraged to precompute GA4GH sequence identifiers as described in Required External Data.
If the object is a composite VR object, implementations MUST:
- ensure that objects are referenced with identifiers in the
ga4gh
namespace- replace nested identifiable objects (i.e., objects that have id properties) with their corresponding digests
- order arrays of digests and ids by Unicode Character Set values
- filter out fields that start with underscore (e.g., _id)
- filter out fields with null values
The second step is to JSON serialize the message content with the following REQUIRED constraints:
The criteria for the digest serialization method was that it must be relatively easy and reliable to implement in any common computer language.
Example
allele = models.Allele(location=models.SequenceLocation(
sequence_id="ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
interval=simple_interval),
state=models.SequenceState(sequence="T"))
ga4gh_serialize(allele)
Gives the following binary (UTF-8 encoded) data:
{"location":"u5fspwVbQ79QkX6GHLF8tXPCAXFJqRPx","state":{"sequence":"T","type":"SequenceState"},"type":"Allele"}
For comparison, here is one of many possible JSON serializations of the same object:
allele.for_json()
{
"location": {
"interval": {
"end": 44908822,
"start": 44908821,
"type": "SimpleInterval"
},
"sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
"type": "SequenceLocation"
},
"state": {
"sequence": "T",
"type": "SequenceState"
},
"type": "Allele"
}
Truncated Digest (sha512t24u)¶
The sha512t24u truncated digest algorithm computes an ASCII digest from binary data. The method uses two well-established standard algorithms, the SHA-512 hash function, which generates a binary digest from binary data, and Base64 URL encoding, which encodes binary data using printable characters.
Computing the sha512t24u truncated digest for binary data consists of three steps:
- Compute the SHA-512 digest of a binary data.
- Truncate the digest to the left-most 24 bytes (192 bits). See Truncated Digest Collision Analysis for the rationale for 24 bytes.
- Encode the truncated digest as a base64url ASCII string.
>>> import base64, hashlib
>>> def sha512t24u(blob):
digest = hashlib.sha512(blob).digest()
tdigest = digest[:24]
tdigest_b64u = base64.urlsafe_b64encode(tdigest).decode("ASCII")
return tdigest_b64u
>>> sha512t24u(b"ACGT")
'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
Identifier Construction¶
The final step of generating a computed identifier for a VR object is to generate a W3C CURIE formatted identifier, which has the form:
prefix ":" reference
The GA4GH VR-Spec constructs computed identifiers as follows:
"ga4gh" ":" type_prefix "." <digest>
Warning
Do not confuse the W3C CURIE prefix
(“ga4gh”) with the
type prefix.
Type prefixes used by VR are:
type_prefix | VR Spec class name |
---|---|
SQ | Sequence |
VA | Allele |
VSL | Sequence Location |
VT | Text |
For example, the identifer for the allele example under Digest Serialization gives:
ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_