Normalization¶
In VRS, “normalization” refers to the process of rewriting an ambiguous variation representation of variation into a canonical form. Normalization eliminates a class of ambiguity that impedes comparison of variation across systems.
In the sequencing community, “normalization” refers to the process of converting a given sequence variant into a canonical form, typically by left or rightshuffling insertion/deletion variants. VRS normalization extends this concept to all classes of VRS Variation objects.
Implementations MUST provide a normalize function that accepts any Variation object and returns a normalized Variation. Guidelines for these functions are below.
General Normalization Rules¶
 Object types that do not have explicit VRS normalization rules below are returned asis. That is, all types of Variation MUST be supported, even if such objects are unchanged.
 VRS normalization functions are idempotent: Normalizing a previouslynormalized object returns an equivalent object.
 VRS normalization functions are not necessarily homomorphic: That is, the input and output objects may be of different types.
Allele Normalization¶
Certain insertion or deletion alleles may have ambiguous representations when using conventional sequence normalization, resulting in significant challenges when comparing such alleles.
VRS uses a “fullyjustified” normalization algorithm inspired by NCBI’s Variant Overprecision Correction Algorithm [1]. Fullyjustified normalization expands such ambiguous representation over the entire region of ambiguity, resulting in an unambiguous representation that may be readily compared with other alleles.
VRS RECOMMENDS that Alleles at precise locations are normalized to a fully justified form unless there is a compelling reason to do otherwise. Alleles SHOULD be normalized in order to generate Computed Identifiers.
The process for fully justifying an Allele is outlined below.
Given an Allele:
 Let reference allele sequence refer to the subsequence at the Allele’s SequenceLocation.
 Let alternate allele sequence be the sequence in the Allele’s State object.
 Let start and end initially be the start and end of the Allele’s SequenceLocation.
Trim sequences:
 Remove suffixes common to the reference allele sequence and alternate allele sequence, if any. Decrement end by the length of the trimmed suffix.
 Remove prefixes common to the reference allele sequence and alternate allele sequence, if any. Increment start by the length of the trimmed prefix.
If reference allele sequence and alternate allele sequence are empty, the input Allele is a reference Allele. Return the input Allele unmodified.
If reference allele sequence and alternate allele sequence are nonempty, the input Allele has been reduced to a substitution Allele. Construct and return a new Allele with the current start, end, and alternate allele sequence.
NOTE: The remaining cases are that exactly one of reference allele sequence or alternate allele sequence is empty. If reference allele sequence is empty, the Allele represents an insertion in the reference. If alternate allele sequence is empty, the Allele represents a deletion in the reference.
Determine bounds of ambiguity:
 Left roll: While the terminal base of all nonempty alleles is equal to the base prior to the current position, circularly permute all alleles rightward and move the current position leftward. When terminating, return left_roll, the number of steps rolled leftward.
 Right roll: Symmetric case of left roll, returning right_roll, the number of steps rolled rightward.
Fully justify the trimmed allele sequences:
 To the reference allele sequence and alternate allele sequence, prepend the left_roll bases prior to the trimmed allele position and append the right_roll bases after the trimmed allele position.
 Decrement start by left_roll and increment end by right_roll.
Construct and return a new Allele with the current start, end, and alternate allele sequence.
Steps

start and end (interbase)
and allele sequences

Equivalent representations



(4,6)
(“CA”, “CAGCA”)

\[TCAG \Bigl[ \frac{CA}{CAGCA} \Bigr] GCT\]


(5,5)
(“”, “AGC”)

\[TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT\]



4a. Roll Left

(1,1)
(“”, “CAG”)

\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT \\
TCAG \Bigl[ \frac{}{CAG} \Bigr] CAGCT \\
TCA \Bigl[ \frac{}{GCA} \Bigr] GCAGCT \\
TC \Bigl[ \frac{}{AGC} \Bigr] AGCAGCT \\
T \Bigl[ \frac{}{CAG} \Bigr] CAGCAGCT \\
\Rightarrow left\_roll = 4\end{split}\]

4b. Roll Right

(8,8)
(“”, “AGC”)

\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT \\
TCAGCA \Bigl[ \frac{}{GCA} \Bigr] GCT \\
TCAGCAG \Bigl[ \frac{}{CAG} \Bigr] CT \\
TCAGCAGC \Bigl[ \frac{}{AGC} \Bigr] T \\
\Rightarrow right\_roll = 3\end{split}\]


(1,8)
(“CAGCAGC”,
“CAGCAGCAGC”)

\[T \Bigl[ \frac{CAGCAGC}{CAGCAGCAGC} \Bigr] T\]

References
[1]  Holmes, J. B., Moyer, E., Phan, L., Maglott, D. & Kattman, B. L. SPDI: Data Model for Variants and Applications at NCBI. Bioinformatics (2020 March 15). doi:10.1093/bioinformatics/btz856 