Terminology & Information Model

When biologists define terms in order to describe phenomena and observations, they rely on a background of human experience and intelligence for interpretation. Definitions may be abstract, perhaps correctly reflecting uncertainty of our understanding at the time. Unfortunately, such terms are not readily translatable into an unambiguous representation of knowledge.

For example, “allele” might refer to “an alternative form of a gene or locus” [Wikipedia], “one of two or more forms of the DNA sequence of a particular gene” [ISOGG], or “one of a set of coexisting sequence alleles of a gene” [Sequence Ontology]. Even for human interpretation, these definitions are inconsistent: does the definition precisely describe a specific change on a specific sequence, or, rather, a more general change on an undefined sequence? In addition, all three definitions are inconsistent with the practical need for a way to describe sequence changes outside regions associated with genes.

The computational representation of biological concepts requires translating precise biological definitions into information models and data structures that may be used in software. This translation should result in a representation of information that is consistent with conventional biological understanding and, ideally, be able to accommodate future data as well. The resulting computational representation of information should also be cognizant of computational performance, the minimization of opportunities for misunderstanding, and ease of manipulating and transforming data.

Accordingly, for each term we define below, we begin by describing the term as used by the genetics and/or bioinformatics communities as available. When a term has multiple such definitions, we explicitly choose one of them for the purposes of computational modelling. We then define the computational definition that reformulates the community definition in terms of information content. Finally, we translate each of these computational definitions into precise specifications for the (information model). Terms are ordered “bottom-up” so that definitions depend only on previously-defined terms.

Note

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Information Model Principles

  • VRS objects are minimal value objects. Two objects are considered equal if and only if their respective attributes are equal. As value objects, VRS objects are used as primitive types and MUST NOT be used as containers for related data, such as primary database accessions, representations in particular formats, or links to external data. Instead, related data should be associated with VRS objects through identifiers. See Computed Identifiers.

  • VRS uses polymorphism. VRS uses polymorphism extensively in order to provide a coherent top-down structure for variation while enabling precise models for variation data. For example, Allele is a kind of Variation, SequenceLocation is a kind of Location, and SequenceState is a kind of State. See Future Plans for the roadmap of VRS data classes and relationships. All VRS objects contain a type attribute, which is used to discriminate polymorphic objects.

  • Error handling is intentionally unspecified and delegated to implementation. VRS provides foundational data types that enable significant flexibility. Except where required by this specification, implementations may choose whether and how to validate data. For example, implementations MAY choose to validate that particular combinations of objects are compatible, but such validation is not required.

  • VRS uses snake_case to represent compound words. Although the schema is currently JSON-based (which would typically use camelCase), VRS itself is intended to be neutral with respect to languages and database.

  • Optional attributes start with an underscore. Optional attributes are not part of the value object. Such attributes are not considered when evaluating equality or creating computed identifiers. The _id attribute is available to identifiable objects, and MAY be used by an implementation to store the identifier for a VRS object. If used, the stored _id element MUST be a CURIE. If used for creating a Truncated Digest (sha512t24u) for parent objects, the stored element must be a GA4GH Computed Identifier. Implementations MUST ignore attributes beginning with an underscore and they SHOULD NOT transmit objects containing them.

Variation

In the genetics community, variation is often used to mean sequence variation, describing the differences observed in DNA or AA bases among individuals, and typically with respect to a common reference sequence.

In VRS, the Variation class is the conceptual root of all types of biomolecular variation, and the Variation abstract class is the top-level object in the Current Variation Representation Specification Schema. Variation types are broadly categorized as Molecular Variation, Systemic Variation, or a utility subclass. Types of variation are widely varied, and there are several Variation Classes currently under consideration to capture this diversity.

Computational Definition

A representation of the state of one or more biomolecules.

Information Model

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

The Variation class type. MUST match child class type.

Molecular Variation

Computational Definition

A Variation on a contiguous molecule.

Allele

Note

The terms allele and variant are often used interchangeably, although this use may mask subtle distinctions made by some users. Specifically, while allele connotes a specific sequence state, variant connotes a change between states.

This distinction makes it awkward to use variant to represent an unchanged (refrence-agreement) state at a Sequence Location. This was a primary factor for choosing to use allele over variant when designing VRS. Read more about this design decision: Using Allele Rather than Variant.

An allele may refer to a number of alternative forms of the same gene or same genetic locus. In the genetics community, allele may also refer to a specific haplotype. In the context of biological sequences, “allele” refers to a distinct state of a molecule at a location.

Computational Definition

The state of a molecule at a Location.

Information Model

Some Allele attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “Allele”

location

CURIE | Location

1..1

Where Allele is located

state

Sequence Expression | SequenceState (deprecated)

1..1

An expression of the sequence state

Implementation Guidance

  • The Sequence Expression and Location subclasses respectively represent diverse kinds of sequence changes and mechanisms for describing the locations of those changes, including varying levels of precision of sequence location and categories of sequence changes.

  • Implementations MUST enforce values interval.end ≤ sequence_length when the Sequence length is known.

  • Alleles are equal only if the component fields are equal: at the same location and with the same state.

  • Alleles MAY have multiple related representations on the same Sequence type due to normalization differences.

  • Implementations SHOULD normalize Alleles using fully-justified normalization whenever possible to facilitate comparisons of variation in regions of representational ambiguity.

  • Implementations SHOULD preferentially represent Alleles using LiteralSequenceExpression, however there are cases where use of other Sequence Expression classes is most appropriate; see Using Sequence Expressions for guidance.

  • When the alternate Sequence is the same length as the interval, the lengths of the reference Sequence and imputed Sequence are the same. (Here, imputed sequence means the sequence derived by applying the Allele to the reference sequence.) When the replacement Sequence is shorter than the length of the interval, the imputed Sequence is shorter than the reference Sequence, and conversely for replacements that are larger than the interval.

  • When the state is a LiteralSequenceExpression of "" (the empty string), the Allele refers to a deletion at this location.

  • The Allele entity is based on Sequence and is intended to be used for intragenic and extragenic variation. Alleles are not explicitly associated with genes or other features.

  • Biologically, referring to Alleles is typically meaningful only in the context of empirical alternatives. For modelling purposes, Alleles MAY exist as a result of biological observation or computational simulation, i.e., virtual Alleles.

  • “Single, contiguous” refers the representation of the Allele, not the biological mechanism by which it was created. For instance, two non-adjacent single residue Alleles could be represented by a single contiguous multi-residue Allele.

  • When a trait has a known genetic basis, it is typically represented computationally as an association with an Allele.

  • This specification’s definition of Allele applies to any Location, including locations on RNA or protein Sequence.

Examples

An Allele correponding to rs7412 C>T on GRCh38:

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "SequenceState"
  },
  "type": "Allele"
}

Sources

Haplotype

Haplotypes are a specific combination of Alleles that are in-cis: occurring on the same physical molecule. Haplotypes are commonly described with respect to locations on a gene, a set of nearby genes, or other physically proximal genetic markers that tend to be transmitted together.

Computational Definition

A set of non-overlapping Allele members that co-occur on the same molecule.

Information Model

Some Haplotype attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “Haplotype”

members

Allele | CURIE

1..m

List of Alleles, or references to Alleles, that comprise this Haplotype.

Implementation Guidance

  • Haplotypes are an assertion of Alleles known to occur “in cis” or “in phase” with each other.

  • All Alleles in a Haplotype MUST be defined on the same reference sequence or chromosome.

  • Alleles within a Haplotype MUST not overlap (“overlap” is defined in Interval).

  • The locations of Alleles within the Haplotype MUST be interpreted independently. Alleles that create a net insertion or deletion of sequence MUST NOT change the location of “downstream” Alleles.

  • The members attribute is required and MUST contain at least one Allele.

  • Haplotypes with one Allele are intended to be distinct entities from the Allele by itself. See discussion on Equivalence Between Concepts.

Sources

  • ISOGG: Haplotype — A haplotype is a combination of alleles (DNA sequences) at different places (loci) on the chromosome that are transmitted together. A haplotype may be one locus, several loci, or an entire chromosome depending on the number of recombination events that have occurred between a given set of loci.

  • SequenceOntology: haplotype (SO:0001024) — A haplotype is one of a set of coexisting sequence variants of a haplotype block.

  • GENO: Haplotype (GENO:0000871) - A set of two or more sequence alterations on the same chromosomal strand that tend to be transmitted together.

Examples

An APOE ε2 Haplotype with inline Alleles:

{
  "members": [
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908822
          },
          "start": {
            "type": "Number",
            "value": 44908821
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908684
          },
          "start": {
            "type": "Number",
            "value": 44908683
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    }
  ],
  "type": "Haplotype"
}

The same APOE ε2 Haplotype with referenced Alleles:

{
  "members": [
    "ga4gh:VA.-kUJh47Pu24Y3Wdsk1rXEDKsXWNY-68x",
    "ga4gh:VA.Z_rYRxpUvwqCLsCBO3YLl70o2uf9_Op1"
  ],
  "type": "Haplotype"
}

The GA4GH computed identifier for these Haplotypes is ga4gh:VH.i8owCOBHIlRCPtcw_WzRFNTunwJRy99-, regardless of whether the Variation objects are inlined or referenced, and regardless of order. See Computed Identifiers for more information.

Systemic Variation

Computational Definition

A Variation of multiple molecules in the context of a system, e.g. a genome, sample, or homologous chromosomes.

AbsoluteCopyNumber

Absolute Copy Number Variation captures the copies of a molecule within a genome, and can be used to express concepts such as amplification and copy loss. Copy Number Variation has conflated meanings in the genomics community, and can mean either (or both) the notion of copy number in a genome or copy number on a molecule. VRS separates the concerns of these two types of statements; this concept is a type of Systemic Variation and so describes the number of copies in a genome. The related Molecular Variation concept can be expressed as an Allele with a RepeatedSequenceExpression.

Computational Definition

The absolute count of discrete copies of a Molecular Variation, Feature, Sequence Expression, or a CURIE reference within a system (e.g. genome, cell, etc.).

Information Model

Some AbsoluteCopyNumber attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “AbsoluteCopyNumber”

subject

Molecular Variation | Feature | Sequence Expression | CURIE

1..1

Subject of the Copy Number object

copies

Number | IndefiniteRange | DefiniteRange

1..1

The integral number of copies of the subject in a system

Examples

Two, three, or four total copies of BRCA1:

{
  "copies": {
    "comparator": ">=",
    "type": "IndefiniteRange",
    "value": 3
  },
  "subject": {
    "gene_id": "ncbigene:348",
    "type": "Gene"
  },
  "type": "AbsoluteCopyNumber"
}

RelativeCopyNumber

Relative Copy Number Variation captures a classification of copies of a molecule within a system, relative to a baseline. These types of Variation are common outputs from CNV callers, particularly in the somatic domain where Absolute Copy Counts are difficult to estimate and less useful in practice than relative statements.

Computational Definition

The relative copies of a Molecular Variation, Feature, Sequence Expression, or a CURIE reference against an unspecified baseline in a system (e.g. genome, cell, etc.).

Information Model

Some RelativeCopyNumber attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “RelativeCopyNumber”

subject

Molecular Variation | Feature | Sequence Expression | CURIE

1..1

Subject of the Copy Number object

relative_copy_class

string

1..1

MUST be one of “complete loss”, “partial loss”, “copy neutral”, “low-level gain” or “high-level gain”.

Examples

Low-level copy gain of BRCA1:

{
  "relative_copy_class": "low-level gain",
  "subject": {
    "gene_id": "ncbigene:348",
    "type": "Gene"
  },
  "type": "RelativeCopyNumber"
}

Utility Variation

Computational Definition

A collection of Variation subclasses that cannot be constrained to a specific class of biological variation, but are necessary for some applications of VRS.

Text

A free-text description of variation that is intended for interpretation by humans.

Important

Text variation should be used sparingly. The Text type is provided as an option of last resort for systems that need to represent human-readable descriptions of complex genetic phenomena or variation for which VRS does not yet have a data type. Structured data types should be preferred over Text.

Computational Definition

A free-text definition of variation.

Information Model

Some Text attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “Text”

definition

string

1..1

A textual representation of variation not representable by other subclasses of Variation.

Implementation Guidance

  • An implementation MUST represent Variation with subclasses other than Text if possible.

  • Because the Text type can be easily abused, implementations are NOT REQUIRED to provide it. If it is provided, implementations SHOULD consider applying access controls.

  • Implementations SHOULD upgrade Text variation to structured data types when available. A future version of VRS will provide additional guidance regarding upgrade mechanisms.

  • Additional Variation subclasses are continually under consideration. Please open a GitHub issue if you would like to propose a Variation subclass to cover a needed variation representation.

Examples

{
  "definition": "MSI High",
  "type": "Text"
}

VariationSet

Sets of variation are used widely, such as sets of variants in dbSNP or ClinVar that might be related by function.

Computational Definition

An unconstrained set of Variation members.

Information Model

Some VariationSet attributes are inherited from Variation.

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

MUST be “VariationSet”

members

CURIE | Variation

0..m

List of Variation objects or identifiers. Attribute is required, but MAY be empty.

Implementation Guidance

  • The VariationSet identifier MAY be computed as described in Computed Identifiers, in which case the identifier effectively refers to a static set because a different set of members would generate a different identifier.

  • members may be specified as Variation objects or CURIE identifiers.

  • CURIEs MAY refer to entities outside the ga4gh namespace. However, objects that use non-ga4gh identifiers MAY NOT use the Computed Identifiers mechanism.

  • VariationSet identifiers computed using the GA4GH Computed Identifiers process do not depend on whether the Variation objects are inlined or referenced, and do not depend on the order of members.

  • Elements of members must be subclasses of Variation, which permits sets to be nested.

  • Recursive sets are not meaningful and are not supported.

  • VariationSets may be empty.

Examples

Example VariationSet with inline Alleles:

{
  "members": [
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908822
          },
          "start": {
            "type": "Number",
            "value": 44908821
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908684
          },
          "start": {
            "type": "Number",
            "value": 44908683
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    }
  ],
  "type": "VariationSet"
}

The same VariationSet with referenced Alleles:

{
  "members": [
    "ga4gh:VA.-kUJh47Pu24Y3Wdsk1rXEDKsXWNY-68x",
    "ga4gh:VA.Z_rYRxpUvwqCLsCBO3YLl70o2uf9_Op1"
  ],
  "type": "VariationSet"
}

The GA4GH computed identifier for these sets is ga4gh:VS.QLQXSNSIFlqNYWmQbw-YkfmexPi4NeDE, regardless of the whether the Variation objects are inlined or referenced, and regardless of order. See Computed Identifiers for more information.

Locations and Intervals

Location

As used by biologists, the precision of “location” (or “locus”) varies widely, ranging from precise start and end numerical coordinates defining a Location, to bounded regions of a sequence, to conceptual references to named genomic features (e.g., chromosomal bands, genes, exons) as proxies for the Locations on an implied reference sequence.

The most common and concrete Location is a SequenceLocation, i.e., a Location based on a named sequence and an Interval on that sequence. Another common Location is a ChromosomeLocation, specifying a location from cytogenetic coordinates of stained metaphase chromosomes. Additional Intervals and Locations may also be conceptual or symbolic locations, such as a cytoband region or a gene. Any of these may be used as the Location for Variation.

Computational Definition

A contiguous segment of a biological sequence.

Information Model

Field

Type

Limits

Description

_id

CURIE

0..1

Location Id. MUST be unique within document.

type

string

1..1

The Location class type. MUST match child class type.

Implementation Guidance

  • Location refers to a position. Although it MAY imply a sequence, the two concepts are not interchangeable, especially when the location is non-specific (e.g., specified by an IndefiniteRange). To represent a sequence derived from a Location, see DerivedSequenceExpression.

ChromosomeLocation

Chromosomal locations based on named features, including named landmarks, cytobands, and regions observed from chromosomal staining techniques.

Computational Definition

A Location on a chromosome defined by a species and chromosome name.

Information Model

Some ChromosomeLocation attributes are inherited from Location.

Field

Type

Limits

Description

_id

CURIE

0..1

Location Id. MUST be unique within document.

type

string

1..1

MUST be “ChromosomeLocation”

species_id

CURIE

1..1

CURIE representing a species from the NCBI species taxonomy. Default: “taxonomy:9606” (human)

chr

string

1..1

The symbolic chromosome name. For humans, For humans, chromosome names MUST be one of 1..22, X, Y (case-sensitive)

interval

CytobandInterval

1..1

The chromosome region defined by a CytobandInterval

Implementation Guidance

  • ChromosomeLocation is intended to enable the representation of cytogenetic results from karyotyping or low-resolution molecular methods, particularly those found in older scientific literature. Precise SequenceLocation should be preferred when nucleotide-scale location is known.

  • species is specified using the NCBI taxonomy. The CURIE prefix MUST be “taxonomy”, corresponding to the NCBI taxonomy prefix at identifiers.org, and the CURIE reference MUST be an NCBI taxonomy identifier (e.g., 9606 for Homo sapiens).

  • ChromosomeLocation is intended primarily for human chromosomes. Support for other species is possible and will be considered based on community feedback.

  • chromosome is an archetypal chromosome name. Valid values for, and the syntactic structure of, chromosome depends on the species. chromosome MUST be an official sequence name from NCBI Assembly. For humans, valid chromosome names are 1..22, X, Y (case-sensitive). NOTE: A `chr` prefix is NOT part of the chromosome and MUST NOT be included.

  • interval refers to a contiguous region specified named markers, which are presumed to exist on the specified chromosome. See CytobandInterval for additional information.

  • The conversion of ChromosomeLocation instances to SequenceLocation instances is out-of-scope for VRS. When converting start and end to SequenceLocations, the positions MUST be interpreted as inclusive ranges that cover the maximal extent of the region.

  • Data for converting cytogenetic bands to precise sequence coordinates are available at NCBI GDP, UCSC GRCh37 (hg19), UCSC GRCh38 (hg38), and bioutils (Python).

  • See also the rationale for Not using External Chromosome Declarations.

Examples

{
  "chr": "19",
  "interval": {
    "end": "q13.32",
    "start": "q13.32",
    "type": "CytobandInterval"
  },
  "species_id": "taxonomy:9606",
  "type": "ChromosomeLocation"
}

SequenceLocation

A Sequence Location is a specified subsequence of a reference Sequence. The reference is typically a chromosome, transcript, or protein sequence.

Computational Definition

A Location defined by an interval on a referenced Sequence.

Information Model

Some SequenceLocation attributes are inherited from Location.

Field

Type

Limits

Description

_id

CURIE

0..1

Location Id. MUST be unique within document.

type

string

1..1

MUST be “SequenceLocation”

sequence_id

CURIE

1..1

A VRS Computed Identifier for the reference Sequence.

interval

SequenceInterval | SimpleInterval

1..1

Reference sequence region defined by a SequenceInterval.

Implementation Guidance

  • For a Sequence of length n:
    • 0 ≤ interval.startinterval.endn

    • inter-residue coordinate 0 refers to the point before the start of the Sequence

    • inter-residue coordinate n refers to the point after the end of the Sequence.

  • Coordinates MUST refer to a valid Sequence. VRS does not support referring to intronic positions within a transcript sequence, extrapolations beyond the ends of sequences, or other implied sequence.

Important

HGVS permits variants that refer to non-existent sequence. Examples include coordinates extrapolated beyond the bounds of a transcript and intronic sequence. Such variants are not representable using VRS and MUST be projected to a genomic reference in order to be represented.

Examples

{
  "interval": {
    "end": 44908822,
    "start": 44908821,
    "type": "SimpleInterval"
  },
  "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
  "type": "SequenceLocation"
}

SequenceInterval

SequenceInterval is intended to be compatible with a “region” in Sequence Ontology (SO:0000001), with the exception that the GA4GH VRS SequenceInterval may be zero-width. The SO definition of region has an “extent greater than zero”.

Computational Definition

A SequenceInterval represents a span on a Sequence. Positions are always represented by contiguous spans using interbase coordinates or coordinate ranges.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “SequenceInterval”

start

Number | IndefiniteRange | DefiniteRange

1..1

The start coordinate or range of the interval. The minimum value of this coordinate or range is 0. MUST represent a coordinate or range less than the value of end.

end

Number | IndefiniteRange | DefiniteRange

1..1

The end coordinate or range of the interval. The minimum value of this coordinate or range is 0. MUST represent a coordinate or range greater than the value of start.

Sources

Examples

{
  "end": {
    "type": "Number",
    "value": 44908822
  },
  "start": {
    "type": "Number",
    "value": 44908821
  },
  "type": "SequenceInterval"
}

CytobandInterval

Important

VRS currently supports only human cytobands and cytoband intervals. Implementers wishing to use VRS for other cytogenetic systems are encouraged to open a GitHub issue.

Cytobands refer to regions of chromosomes that are identified by visible patterns on stained metaphase chromosomes. They provide a convenient, memorable, and low-resolution shorthand for chromosomal segments.

Computational Definition

A contiguous span on a chromosome defined by cytoband features. The span includes the constituent regions described by the start and end cytobands, as well as any intervening regions.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “CytobandInterval”

start

HumanCytoband

1..1

The start cytoband region. MUST specify a region nearer the terminal end (telomere) of the chromosome p-arm than end.

end

HumanCytoband

1..1

The start cytoband region. MUST specify a region nearer the terminal end (telomere) of the chromosome q-arm than start.

Implementation Guidance

  • When using CytobandInterval to refer to human cytogentic bands, the following conventions MUST be used. Bands are denoted by the arm (“p” or “q”) and position (e.g., “22”, “22.3”, or the symbolic values “cen” or “ter”) per ISCN conventions 1. These conventions identify cytobands in order from the centromere towards the telomeres. In VRS, we order cytoband coordinates in the p-ter → cen → q-ter orientation, analogous to sequence coordinates. This has the consequence that bands on the p-arm are represented in descending numerical order when selecting cytobands for start and end.

Examples

{
  "end": "q13.32",
  "start": "q13.32",
  "type": "CytobandInterval"
}

Sequence Expression

VRS provides several syntaxes for expressing a sequence, collectively referred to as Sequence Expressions. They are:

Some SequenceExpression instances may appear to resolve to the same sequence, but are intended to be semantically distinct. There MAY be reasons to select or enforce one form over another that SHOULD be managed by implementations. See discussion on Equivalence Between Concepts.

Computational Definition

An expression describing a Sequence.

Information Model

Field

Type

Limits

Description

type

string

1..1

The SequenceExpression class type. MUST match child class type.

LiteralSequenceExpression

A LiteralSequenceExpression “wraps” a string representation of a sequence for parallelism with other SequenceExpressions.

Computational Definition

An explicit expression of a Sequence.

Information Model

Some LiteralSequenceExpression attributes are inherited from Sequence Expression.

Field

Type

Limits

Description

type

string

1..1

MUST be “LiteralSequenceExpression”

sequence

Sequence

1..1

the literal Sequence expressed

Examples

{
  "sequence": "ACGT",
  "type": "LiteralSequenceExpression"
}

DerivedSequenceExpression

Certain mechanisms of variation result from relocating and transforming sequence from another location in the genome. A derived sequence is a mechanism for expressing (typically large) reference subsequences specified by a SequenceLocation.

Computational Definition

An approximate expression of a sequence that is derived from a referenced sequence location. Use of this class indicates that the derived sequence is approximately equivalent to the reference indicated, and is typically used for describing large regions in contexts where the use of an approximate sequence is inconsequential.

Information Model

Some DerivedSequenceExpression attributes are inherited from Sequence Expression.

Field

Type

Limits

Description

type

string

1..1

MUST be “DerivedSequenceExpression”

location

SequenceLocation

1..1

The location from which the approximate sequence is derived

reverse_complement

boolean

1..1

A flag indicating if the expressed sequence is the reverse complement of the sequence referred to by location

Examples

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "reverse_complement": false,
  "type": "DerivedSequenceExpression"
}

RepeatedSequenceExpression

Repeated Sequence is a class of sequence expression where a specified subsequence is repeated multiple times in tandem. Microsatellites are an example of a common class of repeated sequence, but repeated sequence can also be used to describe larger subsequence repeats, up to and including large-scale tandem duplications.

Computational Definition

An expression of a sequence comprised of a tandem repeating subsequence.

Information Model

Some RepeatedSequenceExpression attributes are inherited from Sequence Expression.

Field

Type

Limits

Description

type

string

1..1

MUST be “RepeatedSequenceExpression”

seq_expr

LiteralSequenceExpression | DerivedSequenceExpression

1..1

An expression of the repeating subsequence

count

Number | IndefiniteRange | DefiniteRange

1..1

The count of repeated units, as an integer or inclusive range

Examples

{
  "count": {
    "comparator": ">=",
    "type": "IndefiniteRange",
    "value": 6
  },
  "seq_expr": {
    "location": {
      "interval": {
        "end": {
          "type": "Number",
          "value": 44908822
        },
        "start": {
          "type": "Number",
          "value": 44908821
        },
        "type": "SequenceInterval"
      },
      "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
      "type": "SequenceLocation"
    },
    "reverse_complement": false,
    "type": "DerivedSequenceExpression"
  },
  "type": "RepeatedSequenceExpression"
}

ComposedSequenceExpression

Composed Sequence is a class of sequence expression where two or more constitutive sequence expressions are expressed as an ordered list, representing a concatenated sequence. This class is useful for expressing concepts such as the OPMD polyalanine alleles 2.

2

Brais b, et al. Short CCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy Nat Genet. (1998).

Computational Definition

An expression of a sequence composed from multiple other Sequence Expressions objects. MUST have at least one component that is not a ref:LiteralSequenceExpression. CANNOT be composed from nested composed sequence expressions.

Information Model

Some ComposedSequenceExpression attributes are inherited from Sequence Expression.

Field

Type

Limits

Description

type

string

1..1

MUST be “ComposedSequenceExpression”

components

LiteralSequenceExpression | RepeatedSequenceExpression | DerivedSequenceExpression

2..m

An ordered list of Sequence Expression components comprising the expression.

Examples

{
  "type": "ComposedSequenceExpression",
  "components": [
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
      "count": { "type": "Number", "value": 11 }
    },
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCA" },
      "count": { "type": "Number", "value": 3 }
    },
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
      "count": { "type": "Number", "value": 1 }
    }
  ]
}

Feature

Computational Definition

A named entity that can be mapped to a Location. Genes, protein domains, exons, and chromosomes are some examples of common biological entities that may be Features.

Information Model

Field

Type

Limits

Description

type

string

1..1

The Feature class type. MUST match child class type.

Gene

A gene is a basic and fundamental unit of heritability. Genes are functional regions of heritable DNA or RNA that include transcript coding regions, regulatory elements, and other functional sequence domains. Because of the complex nature of these many components comprising a gene, the interpretation of a gene depends on context.

Computational Definition

A reference to a Gene as defined by an authority. For human genes, the use of hgnc as the gene authority is RECOMMENDED.

Information Model

Some Gene attributes are inherited from Feature.

Field

Type

Limits

Description

type

string

1..1

MUST be “Gene”

gene_id

CURIE

1..1

A CURIE reference to a Gene concept

Implementation guidance

  • Gene symbols (e.g., “BRCA1”) are unreliable keys. Implementations MUST NOT use a gene symbol to define a Gene.

  • A gene is specific to a species. Gene orthologs have distinct records in the recommended databases. For example, the BRCA1 gene in humans and the Brca1 gene in mouse are orthologs and have distinct records in the recommended gene databases.

  • Implementations MUST use authoritative gene namespaces available from identifiers.org whenever possible. Examples include:

  • The hgnc namespace is RECOMMENDED for human variation in order to improve interoperability. When using the hgnc namespace, the optional “HGNC:” prefix MUST NOT be used.

  • Gene MAY be converted to one or more Locations using external data. The source of such data and mechanism for implementation is not defined by this specification.

  • See discussion on Equivalence Between Concepts.

Examples

The following examples all refer to the human APOE gene:

{
  'gene_id': 'ncbigene:613',
  'type': 'Gene'
}

Sources

  • SequenceOntology: gene (SO:0000704) — A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions.

Basic Types

Basic types are data structures that represent general concepts and that may be applicable in multiple parts of VRS.

Number

Computational Definition

A simple integer value as a VRS class.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “Number”

value

integer

1..1

The value represented by Number

Examples

{
  "type": "Number",
  "value": 55
}

DefiniteRange

Computational Definition

A bounded, inclusive range of numbers.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “DefiniteRange”

min

number

1..1

The minimum value; inclusive

max

number

1..1

The maximum value; inclusive

Examples

{
  "max": 33,
  "min": 22,
  "type": "DefiniteRange"
}

IndefiniteRange

Computational Definition

A half-bounded range of numbers represented as a number bound and associated comparator. The bound operator is interpreted as follows: ‘>=’ are all numbers greater than and including value, ‘<=’ are all numbers less than and including value.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “IndefiniteRange”

value

number

1..1

The bounded value; inclusive

comparator

string

1..1

MUST be one of “<=” or “>=”, indicating which direction the range is indefinite

Examples

This value is equivalent to the concept of “equal to or greater than 22”:

{
  "comparator": ">=",
  "type": "IndefiniteRange",
  "value": 22
}

Primitives

Primitives represent simple values with syntactic or other constraints. They enable correctness for values stored in VRS.

CURIE

Computational Definition

A W3C Compact URI formatted string. A CURIE string has the structure prefix:reference, as defined by the W3C syntax.

Implementation Guidance

  • All identifiers in VRS MUST be a valid CURIE, regardless of whether the identifier refers to GA4GH VRS objects or external data.

  • For GA4GH VRS objects, this specification RECOMMENDS using globally unique Computed Identifiers for use within and between systems.

  • For external data, CURIE-formatted identifiers MUST be used. When an appropriate namespace exists at identifiers.org, that namespace MUST be used. When an appropriate namespace does not exist at identifiers.org, support is implementation-dependent. That is, implementations MAY choose whether and how to support informal or local namespaces.

  • Implementations MUST use CURIE identifiers verbatim. Implementations MAY NOT modify CURIEs in any way (e.g., case-folding).

Examples

Identifiers for GRCh38 chromosome 19:

ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl
refseq:NC_000019.10
grch38:19

See Identifier Construction for examples of CURIE-based identifiers for VRS objects.

HumanCytoband

Cytobands are any of a pattern of stained bands, formed on chromosomes of cells undergoing metaphase, that serve to identify particular chromosomes. Human cytobands are predominantly specified by the International System for Human Cytogenomic Nomenclature (ISCN) 1.

Computational Definition

A character string representing cytobands derived from the International System for Human Cytogenomic Nomenclature (ISCN) guidelines.

Information Model

A string constrained to match the regular expression ^cen|[pq](ter|([1-9][0-9]*(\.[1-9][0-9]*)?))$, derived from the ISCN guidelines 1.

1(1,2,3)

McGowan-Jordan J (Ed.). ISCN 2016: An international system for human cytogenomic nomenclature (2016). Karger (2016).

Examples

"q13.32" (string)

Residue

A residue refers to a specific monomer within the polymeric chain of a protein or nucleic acid (Source: Wikipedia Residue page).

Computational Definition

A character representing a specific residue (i.e., molecular species) or groupings of these (“ambiguity codes”), using one-letter IUPAC abbreviations for nucleic acids and amino acids.

Sequence

A sequence is a character string representation of a contiguous, linear polymer of nucleic acid or amino acid Residues. Sequences are the prevalent representation of these polymers, particularly in the domain of variant representation.

Computational Definition

A character string of Residues that represents a biological sequence using the conventional sequence order (5’-to-3’ for nucleic acid sequences, and amino-to-carboxyl for amino acid sequences). IUPAC ambiguity codes are permitted in Sequences.

Information Model

A string constrained to match the regular expression ^[A-Z*\-]*$, derived from the IUPAC one-letter nucleic acid and amino acid codes.

Implementation Guidance

  • Sequences MAY be empty (zero-length) strings. Empty sequences are used as the replacement Sequence for deletion Alleles.

  • Sequences MUST consist of only uppercase IUPAC abbreviations, including ambiguity codes.

  • A Sequence provides a stable coordinate system by which an Allele MAY be located and interpreted.

  • A Sequence MAY have several roles. A “reference sequence” is any Sequence used to define an Allele. A Sequence that replaces another Sequence is called a “replacement sequence”.

  • In some contexts outside VRS, “reference sequence” may refer to a member of set of sequences that comprise a genome assembly. In the VRS specification, any sequence may be a “reference sequence”, including those in a genome assembly.

  • For the purposes of representing sequence variation, it is not necessary that Sequences be explicitly “typed” (i.e., DNA, RNA, or AA).

Examples

"ACGT" (string)

Deprecated and Obsolete Classes

SimpleInterval

Warning

DEPRECATED. Use SequenceInterval instead. SimpleInterval will be removed in VRS 2.0.

Computational Definition

DEPRECATED: A SimpleInterval represents a span of sequence. Positions are always represented by contiguous spans using interbase coordinates. This class is deprecated. Use SequenceInterval instead.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “SimpleInterval”

start

integer

1..1

The start coordinate

end

integer

1..1

The end coordinate

Implementation Guidance

  • Implementations MUST enforce values 0 ≤ start ≤ end. In the case of double-stranded DNA, this constraint holds even when a feature is on the complementary strand.

  • VRS uses Inter-residue coordinates because they provide conceptual consistency that is not possible with residue-based systems (see rationale). Implementations will need to convert between inter-residue and 1-based inclusive residue coordinates familiar to most human users.

  • Inter-residue coordinates start at 0 (zero).

  • The length of an interval is end - start.

  • An interval in which start == end is a zero width point between two residues.

  • An interval of length == 1 MAY be colloquially referred to as a position.

  • Two intervals are equal if the their start and end coordinates are equal.

  • Two intervals intersect if the start or end coordinate of one is strictly between the start and end coordinates of the other. That is, if:

    • b.start < a.start < b.end OR

    • b.start < a.end < b.end OR

    • a.start < b.start < a.end OR

    • a.start < b.end < a.end

  • Two intervals a and b coincide if they intersect or if they are equal (the equality condition is REQUIRED to handle the case of two identical zero-width SimpleIntervals).

  • <start, end>=<0,0> refers to the point with width zero before the first residue.

  • <start, end>=<i,i+1> refers to the i+1th (1-based) residue.

  • <start, end>=<N,N> refers to the position after the last residue for Sequence of length N.

  • See example notebooks in GA4GH VRS Python Implementation.

Examples

{
  "end": 44908822,
  "start": 44908821,
  "type": "SimpleInterval"
}

SequenceState

Warning

DEPRECATED. Use LiteralSequenceExpression instead. SequenceState will be removed in VRS 2.0.

Deprecated since version 1.2.

Computational Definition

DEPRECATED. A Sequence as a State. This is the State class to use for representing “ref-alt” style variation, including SNVs, MNVs, del, ins, and delins. This class is deprecated. Use LiteralSequenceExpression instead.

Information Model

Field

Type

Limits

Description

type

string

1..1

MUST be “SequenceState”

sequence

Sequence

1..1

A string of Residues

Examples

{
  "sequence": "T",
  "type": "SequenceState"
}

State

Warning

OBSOLETE. State was an abstract class that was intended for future growth. It was replaced by SequenceExpressions, which subsumes the functionality envisioned for State. Because State was abstract, and therefore purely an internal concept, it was made obsolete at the same time that SequenceState was deprecated.

Deprecated since version 1.2.

Computational Definition

State objects are one of two primary components specifying a VRS Allele (in addition to Location), and the designated components for representing change (or non-change) of the features indicated by the Allele Location. As an abstract class, State currently encompasses single and contiguous Sequence changes (see SequenceState).