NX4 – A data visualization tool for massive analysis of viral sequences

2019 | open source | data visualization | product design | genomics

The Sabeti Laboratory at the Broad Institute of Harvard and MIT needed an efficient way to analyze hundreds of genetic viral sequences extracted from infected patients. We designed a tool with an innovative visualization approach that addresses interaction and accessibility shortcomings of existing software.

https://nx4.io/

menu_book This project has a peer-reviewed publication!

what is nx4?

NX4 is an open-source, web-based visualization for the exploration of aligned viral sequences. In other words, NX4 allows researchers to visually analyze thousands of bits of genetic information without scrolling through an endless sea of columns and rows of data. You can read our publication in Oxford Bioinformatics.

We tested the tool with datasets of over 1.8 million data points (101 viral samples of sequences of 18,000 base pairs long).

How to read the image above?

Genetic sequences are composed of letters (A, C, T, G, etc.) that have a specific order or position. In a traditional MSA visualization, each row represents a single genetic sequence, and each column represents a numbered position. By assigning colors to each letter, it’s easier to spot changes (mutations) in a group of sequences.

The usability problem with traditional visualizations

Traditional viewers for Multiple Sequence Analyses (MSAs) have relied on the same visualization for decades: a very long and tall matrix where every row represents a genetic sequence and every column a position in the sequence. While some viewers out there are extremely performant, they sacrifice accessibility (due to the use of many colors) and good user interaction (since most of the data is actually hidden from view at any given time).

NX4 improves MSAs by implementing statistical calculations that reduce the complexity of the data and visualization techniques like focus+context.

Conceptual Work

Through an iterative process, I started to develop different notational systems (a type of visual design language) that could help us abstract and “compress” the information so that the visualization wouldn’t have to show each row and column. Instead, I wanted to focus on the mutations and how to display them in a clear and salient way.

Grounding concepts in science and math

Through a collaboration with a math expert, Andrés Colubri, we resorted to using Shannon Entropy as a measure of variability in the genomic samples. Therefore, if the line chart described a peak, it easily indicated high variability and therefore a region of the genome with the potential for many mutations.

The resulting tool combined a line chart (with a focus+context) approach that calculated the Entropy for every position in the genetic sequence, and a custom heat map that showed the frequency of each letter (A, C, T, G, or missing values)

Final solution

The published and released tool provided improvements in color accessibility, user interaction, and discoverability of data compared to traditional visualizations.

Credits

Data Visualization and Product Design: Antonio Solano-Román | Statistical analysis and transformations: Andrés Colubri, Web Development: Carlos Cruz, Advisor: Dietmar Offenhuber. This project was done in collaboration with the Sabeti Laboratory at the Broad Institute of MIT and Harvard