Normalization¶
Certain insertion or deletion alleles may be represented ambiguously when using conventional sequence normalization, resulting in significant challenges when comparing such alleles.
The VRSpec describes a “fullyjustified” normalization algorithm inspired by NCBI’s Variant Overprecision Correction Algorithm [1]. Fullyjustified normalization expands such ambiguous representation over the entire region of ambiguity, resulting in an unambiguous representation that may be readily compared with other alleles.
The VRSpec RECOMMENDS that Alleles at precise locations are normalized to a fully justified form unless there is a compelling reason to do otherwise.
The process for fully justifying two alleles (reference sequence and alternate sequence) at an interval is outlined below.
 Trim sequences:
 Remove suffixes common to all alleles, if any. Decrement the interval end position by the length of the trimmed suffix.
 Remove prefixes common to all alleles, if any. Increment the interval start position by the length of the trimmed prefix.
 If neither allele is empty, the allele pairs represent a alleles that do not have common prefixes or suffixes. Normalization is not applicable and the trimmed alleles are returned.
 Determine bounds of ambiguity:
 Left roll: While the terminal base of all nonempty alleles is equal to the base prior to the current position, circularly permute all alleles rightward and move the current position leftward. When terminating, return left_roll, the number of steps rolled leftward.
 Right roll: Symmetric case of left roll, returning right_roll, the number of steps rolled rightward.
 Update position and alleles:
 To each trimmed allele, prepend the left_roll bases prior to the trimmed allele position and append the right_roll bases after the trimmed allele position.
 Expand the trimmed allele position by decrementing the start by left_roll and incrementing the end by right_roll.
Steps  Interbase Position
and Alleles

Resulting Allele Set
(All alleles in this column result
in the same empirical sequence change.)



(4,6)
(“CA”, “CAGCA”)

\[TCAG \Bigl[ \frac{CA}{CAGCA} \Bigr] GCT\]


(5,5)
(“”, “AGC”)

\[TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\]




(1,1)
(“”, “CAG”)

\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\
TCAG \Bigl[ \frac{}{CAG} \Bigr] CAGCT \\
TCA \Bigl[ \frac{}{GCA} \Bigr] GCAGCT \\
TC \Bigl[ \frac{}{AGC} \Bigr] AGCAGCT \\
T \Bigl[ \frac{}{CAG} \Bigr] CAGCAGCT \\
\Rightarrow left\_roll = 4\end{split}\]


(8,8)
(“”, “AGC”)

\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\
TCAGCA \Bigl[ \frac{}{GCA} \Bigr] GCT \\
TCAGCAG \Bigl[ \frac{}{CAG} \Bigr] CT \\
TCAGCAGC \Bigl[ \frac{}{AGC} \Bigr] T \\
\Rightarrow right\_roll = 3\end{split}\]


(1,8)
(“CAGCAGC”,
“CAGCAGCAGC”)

\[\begin{split}TCAGC \Bigl[ \frac{}{AGC} \Bigr] AGCT ①\\
T \Bigl[ \frac{CAGCAGC}{CAGCAGCAGC} \Bigr] T\end{split}\]

References
[1]  Holmes, J. B., Moyer, E., Phan, L., Maglott, D. & Kattman, B. L. SPDI: Data Model for Variants and Applications at NCBI. bioRxiv 537449 (2019). doi:10.1101/537449 