Sequence Location
The sequence location class is a fundamental concept in VRS. Sequence locations are used to describe every form of Variation, and they have stand-alone utility for describing sequence locations in other (non-variation) contexts. This class is used to represent a location on a specified Sequence Reference. The sequence reference is typically a chromosome, transcript, or protein sequence.
Definition and Information Model
Note
This data class is at a trial use maturity level and may change in future releases. Maturity levels are described in the GKS Maturity Model.
Computational Definition
A Location defined by an interval on a Sequence Reference.
GA4GH Digest
Prefix |
Inherent |
---|---|
SL |
[‘end’, ‘sequenceReference’, ‘start’, ‘type’] |
Information Model
Some SequenceLocation attributes are inherited from GA4GH Identifiable Object.
Field |
Flags |
Type |
Limits |
Description |
---|---|---|---|---|
id |
string |
0..1 |
The ‘logical’ identifier of the Entity in the system of record, e.g. a UUID. This ‘id’ is unique within a given system, but may or may not be globally unique outside the system. It is used within a system to reference an object from another. |
|
name |
string |
0..1 |
A primary name for the entity. |
|
description |
string |
0..1 |
A free-text description of the Entity. |
|
aliases |
⋮ | string |
0..m |
Alternative name(s) for the Entity. |
extensions |
⋮ | 0..m |
A list of extensions to the Entity, that allow for capture of information not directly supported by elements defined in the model. |
|
digest |
string |
0..1 |
A sha512t24u digest created using the VRS Computed Identifier algorithm. |
|
type |
string |
1..1 |
MUST be “SequenceLocation” |
|
sequenceReference |
0..1 |
A reference to a Sequence Reference on which the location is defined. |
||
start |
integer | Range |
0..1 |
The start coordinate or range of the SequenceLocation. The minimum value of this coordinate or range is 0. For locations on linear sequences, this MUST represent a coordinate or range less than or equal to the value of end. For circular sequences, start is greater than end when the location spans the sequence 0 coordinate. |
|
end |
integer | Range |
0..1 |
The end coordinate or range of the SequenceLocation. The minimum value of this coordinate or range is 0. For locations on linear sequences, this MUST represent a coordinate or range greater than or equal to the value of start. For circular sequences, end is less than start when the location spans the sequence 0 coordinate. |
|
sequence |
0..1 |
The literal sequence encoded by the sequenceReference at these coordinates. |
Example
The Sequence Location for the position 44908822
is:
{
"id": "ga4gh:SL.4t6JnYWqHwYw9WzBT_lmWBb3tLQNalkT",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 44908821,
"end": 44908822
}
Implementation Guidance
Start, End, and Ranges
At least one of the start and end properties MUST be specified in any SequenceLocation
instance.
When only one of these properties is specified, this represents an open interval beginning at the specified
coordinate and extending left (when start is null
) or right (when end is null
).
When there is ambiguity at a coordinate (e.g., when using a SequenceLocation
to describe the confidence boundary
of a copy number segment), this is specified using the Range class for that coordinate.
New in v2
In VRS v1, the SequenceLocation
class had an interval property which contained start and end
attributes. This intermediate object layer has been removed in v2.0, making start and end
top-level properties of the SequenceLocation
.
The “Ref” Allele
In some variant representation formats (e.g. HGVS, VCF) sequence variants are described by both their “reference” (ref) and “alternate” (alt) alleles. When representing an Allele with VRS v2, it is also possible to describe the ref sequence (derived from the Sequence Reference at the location) using the sequence property.
The sequence property is for describing the sequence derived from the SequenceLocation, and is not a
substitute for the sequenceReference
property that references the sequence on which the location is defined.
New in v2
In VRS v1, sequence derived from the reference was not transmitted. This feature was added in VRS v2 due to this common practice in other variant representation formats.
Linear and Circular Sequence Coordinates
When representing a linear sequence, it is expected that for a Sequence Reference of length n, 0 ≤ start ≤ end ≤ n
For a circular sequence, 0 ≤ end ≤ start ≤ n
is also allowed. In cases where end < start
, this represents
a location that spans the circular sequence origin coordinate.
New in v2
The v2 SequenceLocation
now also supports circular sequences. The optional circular property of the
Sequence Reference class may be set to True
or False
to explicitly indicate if a reference is
circular, and therefore if 0 ≤ end ≤ start ≤ n
is also allowed.
Implied Sequence Coordinates
The Sequence Location
class refers to coordinates on a Sequence Reference; if that sequence
represents a coding transcript, then the coordinates refer to the coding transcript, and not a
chromosome sequence to which it aligns. VRS intentionally does not allow for start or end values
that use an offset system to represent sequence not found on the Sequence Reference.
Todo
Describe and add a ref to an intronic variant profile
CisPhasedBlocks and the Inferred SequenceReference
When a Sequence Reference is provided in a Cis-Phased Block, it is defined that all member
Allele objects occur on that sequence. Consequently, the SequenceLocation
object for each
Allele does not need to populate the sequenceReference
property. There may be other contexts
where this optional property may be omitted, but when this is done there SHOULD be a means of inferring
the content of this property (as is explicitly described in Cis-Phased Block).