Currently at Hewlett Packard Laboratories, Palo Alto, California 94304.
Several studies have shown how sets of single-nucleotide polymorphisms (SNPs) can help to classify subjects on the basis of their continental origins, with applications to case–control studies and population genetics. However, most of these studies use dimensionality-reduction methods, such as principal component analysis, or clustering methods that result in unipartite (either subjects or SNPs) representations of the data. Such analyses conceal important bipartite relationships, such as how subject and SNP clusters relate to each other, and the genotypes that determine their cluster memberships.
To overcome the limitations of current methods of analyzing SNP data, the authors used three bipartite analytical representations (bipartite network, heat map with dendrograms, and Circos ideogram) that enable the simultaneous visualization and analysis of subjects, SNPs, and subject attributes.
The results demonstrate (1) novel insights into SNP data that are difficult to derive from purely unipartite views of the data, (2) the strengths and limitations of each method, revealing the role that each play in revealing novel insights, and (3) implications for how the methods can be used for the analysis of SNPs in genomic studies associated with disease.
The results suggest that bipartite representations can reveal new patterns in SNP data compared with existing unipartite representations. However, the novel insights require multiple representations to discover, verify, and comprehend the complex relationships. The results therefore motivate the need for a complementary visual analytical framework that guides the use of multiple bipartite representations to analyze complex relationships in SNP data.
Because more than 99% of the 3 billion base pairs in the human genome are identical across all humans,
To the best of our knowledge, the methods used in the above studies rely on a unipartite view (either SNPs or subjects) of the data. For example, studies that identify AIM SNPs typically use dimensionality-reduction methods, such as principal component analysis (PCA), or clustering methods, such as
Several detailed reviews of methods used to analyze SNP data exist.
Most analyses of SNP data use the univariate χ2 test to identify which SNPs are the most significant across the populations being studied (eg, subjects from different ancestries or between diseased and healthy populations). This method compares for each SNP the proportion of genotypes between the two or more groups being studied, and outputs the significance for each SNP. Because of the large number of SNPs being tested, the results are adjusted for false discovery using methods such as the Bonferroni correction. Researchers then use the most significant SNPs for further analyses. Although this method is powerful, it is limited because it treats each SNP independently, when SNPs could in fact be working in groups.
Multivariate methods that are applied to SNP data can be broadly classified into two categories: (1) distance-based, and (2) model-based.
The distance-based methods typically consist of two steps.
In contrast to the above distance-based methods, model-based algorithms such as STRUCTURE
Although the above methods are powerful in separating subjects on the basis of continental origins and disease subtypes, or in identifying the important SNPs, they are based on a unipartite view of the data: they can be used to analyze either SNP clusters based on subjects or subject clusters based on SNPs. For example, they cannot directly reveal which clusters of subjects are related to which clusters of SNPs, nor can they reveal the nature of their membership based on the proportion of genotypes. To address these limitations, we explored the use of bipartite visual analytical representations to analyze SNP data. Such representations enable the simultaneous view of (1) subjects and SNPs, and (2) the type and frequency of genotype associations between subjects and SNPs. We therefore posed the research question: what is the bipartite relationship between subjects (from different continental origins) and SNPs (known to code for ancestry information)?
To address the research question, we used SNP data from the phase 2 HapMap (release 23) database.
A SNP typically has only two possible bases (eg, A or G), one of which is less common in the population (‘minor allele’), and the other is more common (‘major allele’). Because humans carry two copies of the genome, SNPs that have bases A and G can have three combinations across the two copies of the genome: AA, AG and GG. These three combinations are referred to as the ‘genotypes’ of the SNP. For each SNP, we coded the three genotypes as 0, 1, or 2 denoting whether a subject was a ‘major homozygote’ (having two copies of the major allele), a ‘heterozygote’ (having one copy of each allele), or a ‘minor homozygote’ (having two copies of the minor allele), respectively. The minor allele of a SNP was determined to be the one that had the lower frequency in the data. This encoding is referred to as the ‘additive genetic model’,
Our analysis consisted of two steps: (1) ‘exploratory visual analysis’ through the use of three bipartite visual representations chosen to identify emergent bipartite relationships between subjects and SNPs; (2) ‘quantitative analysis’ through the use of methods suggested by the emergent visual patterns. This two-step method was motivated by our earlier studies
We selected bipartite networks as our primary method for analyzing the relationship between subjects and SNPs because they (1) are based on a simple but expressive graph-based visual representation to display both subjects and SNPs simultaneously, and (2) can be interactively manipulated to explore emergent patterns, which can be quantitatively verified through a wide range of graph-based and other quantitative methods. However, as described below, because the bipartite network representation was not adequate for our analysis, we used two other complementary bipartite visual representations that are well known in the bioinformatics community, but not often used in combination.
Networks provide a powerful approach for representing and analyzing complex relationships. They are increasingly being used to analyze a wide range of molecular measurements related to gene regulation,
A sample bipartite network showing 15 subjects (black and white nodes) and eight SNPs (colored nodes), and their connecting edges representing genotypes 0 (white), 1 (gray), and 2 (black). The nodes are sized on the basis of the sum of the weights of their connecting edges, and laid out using the Kamada–Kawai algorithm, which helps to reveal the relationship between the nodes and the nature of cluster memberships. This figure is produced in colour in the online journal-please visit the website to view the colour figure.
Edge weights in the network were used to represent the genotype (0, 1, or 2). Node diameter was used to represent the sum of weights on the edges connected to that node. This enabled rapid visual inspection to determine, for example, which subjects have overall high aggregate genotype values, and how such subjects relate to the rest of the network.
Global patterns in the network were visualized and analyzed using the Kamada–Kawai
Network analyses provide two advantages for analyzing complex relationships. (1) They do not require a priori assumptions about the relationship of nodes within the data, such as the hierarchical assumption of hierarchical clustering or disjoint clusters of
Although networks provide a powerful method for visualizing data, the edges can often get very dense, making it difficult to analyze the edges and their weights connected to specific nodes. We therefore used a second bipartite representation called a bipartite heat map.
Although heat maps enable inspection of subjects and their relationship to each SNP, they cannot simultaneously represent attributes of the entities, such as the sex of subjects, nor do they allow interactive exploration of the relationship between subsets of the data, such as subjects who have high admixture (resulting from mating between subjects from reproductively isolated ancestral populations
The insights derived from the three bipartite visualizations were analyzed using three quantitative methods, which were selected based on their appropriateness to the emergent patterns in the network.
Because the network analysis suggested the existence of disjoint clusters, we used agglomerative hierarchical clustering to verify the number of clusters and to identify the boundaries of the clusters. The clustering was performed using Manhattan distance (to handle the 0, 1, and 2 edge weights representing the genotype) with the Ward linkage function.
To identify subjects with high admixture of SNPs from the two ancestries, we calculated the betweenness centrality
To test the statistical significance of clusteredness in the network, we compared the variance, skewness, and kurtosis of the dissimilarities in the HapMap data, to 1000 random permutations of these data. For each network permutation, we preserved the size of the network and the edge weight distribution of each SNP when analyzing the SNP dendrogram, and the edge weight distribution for each subject when analyzing the subject dendrogram. Significant breaks in the HapMap's subject or SNP dendrograms would result in a significantly larger variance, skewness, and kurtosis of the dissimilarity measures, compared with the same measures generated from the random networks.
We also computed the modularity of the bipartite network. Modularity
The bipartite visualizations and quantitative analysis revealed distinct SNP and subject clusters, in addition to a subset of subjects that represents an admixed population. For each outcome, we describe the results of the visual analysis, followed by their quantitative verification.
The bipartite network visualization of 120 subjects and 78 SNPs revealed a complex but understandable clustered pattern. As shown in
(A) The bipartite network showing the subjects (black and white nodes), ancestry informative marker (AIM) single-nucleotide polymorphisms (SNPs) (colored nodes), and their connecting edges representing genotypes 0 (white), 1 (gray), and 2 (black). (B) The SNP dendrogram was used to determine the boundaries of the SNP, and a similar dendrogram determined the boundaries of the subject clusters. This figure is produced in colour in the online journal-please visit the website to view the colour figure.
To quantitatively verify the above visual result, we used agglomerative hierarchical clustering. As shown in
To generate a network based on a parsimonious subset of the SNPs, and to examine the admixture based on the dominant SNP clusters (blue and pink), we removed the center SNP cluster (red nodes) from the network, and re-laid the network using the Kamada–Kawai algorithm.
(A) The bipartite network without the non-discriminating single-nucleotide polymorphisms (SNPs); (B) the associated heat maps with dendrograms, which were used to determine the boundaries of the SNP and subject clusters. This figure is produced in colour in the online journal-please visit the website to view the colour figure.
The clusteredness of the subjects in the HapMap data was statistically significant when compared with 1000 random networks based on variance of the dissimilarities (HapMap =74822.5, random mean =1023.6, p<0.001, two-tailed test), skewness of the distribution of dissimilarities (HapMap =10.56, random mean =4.3, p<0.001, two-tailed test), and kurtosis of the distribution of dissimilarities (HapMap =114.01, random mean =24.28, p<0.001, two-tailed test).
To compute modularity, we generated unweighted bipartite networks representing the dominant and recessive genetic models as described in the methods section. For the recessive network shown in
To analyze the admixed subjects who are located in the center of the network, we used the betweenness centrality measure. Because genotype 2 appeared to be the main determinant of the clusters, we used the recessive model to conduct this analysis. As shown by the enclosed dotted line in
(A) The bipartite network with nodes sized based on the betweenness centrality measure; (B) the Circos ideogram showing the relationship of the admixed Utah Americans to the SNPs of both clusters (Utah American and Yoruba African SNP clusters), and the sex of the subjects (outer ring). (The betweenness centrality measure for each node has been multiplied by 10 000 to enable Pajek to display them to the maximum two decimal places.) This figure is produced in colour in the online journal-please visit the website to view the colour figure.
The betweenness centrality also identified SNPs that have strong connections to the admixed subjects, and therefore are implicated in the admixture. However, owing to the density of black edges in the network, it was difficult to determine which SNPs from each cluster were connected to subjects from the opposite cluster. Furthermore, the admixed subjects were scattered across the heat map (rows containing dark cells representing genotype 2 in the upper right and lower left areas of
To address this limitation, we used the Circos representation for a closer inspection of this subset of subjects across all the SNPs.
Our goal was to explore the role of bipartite visual analytical representations in the analysis of SNP data. Although the results matched many of the results from earlier AIMs studies,
Furthermore, this smaller set also enabled us to closely examine the admixed subjects, and which SNPs were involved in that admixture. The results therefore provided a richer understanding of the association between the SNP and subject clusters, in addition to the nature of the cluster memberships. These in turn enabled us to understand the complementary role that each bipartite representation played in revealing the associations as discussed below.
The bipartite network of 78 SNPs in
In addition to the identification of the cluster boundaries, and the relationship between the clusters, the bipartite representations also revealed the nature of the cluster memberships. Unlike unipartite representations used by methods such as PCA,
Similar to the inadequacy of any single representation to enable the comprehension of the clusters and their relationship to each other, networks and heat maps were also unable to provide a more complete view of the admixed population. While the network helped to identify the existence of the admixed population, the density of the edges did not allow a direct inspection of the nature of their admixture, and which SNPs in both clusters were responsible for that admixture. Furthermore, these admixed subjects were spread out in the heat map, as their discovery is based on a network-based relationship which is not the basis of the clustering algorithm. In contrast, the Circos representation enabled the selection of edges on the basis of the subject cluster to which they were connected, which helped to quickly identify which nodes were, or were not, implicated in the admixture. Therefore, although the Circos representation is not designed to identify clusters, it enabled the inspection of the admixture in a much more effective way compared with networks and heat maps.
The results have methodological and theoretical implications. From a methodological perspective, the bipartite representations intuitively show a researcher studying the data from a case–control study, not only which subjects have high admixture, but also the reason for that admixture based on the type and nature of SNP cluster membership. For example, if SNPs are the focus of the study, then the bipartite representation can reveal important information for making critical decisions to prevent confounding experimental results. Furthermore, when studying SNP data beyond AIMs, researchers can use the identity of the SNP membership to rapidly derive data-driven hypotheses for disease causation. For example, we used the Genetic Association Database
From a theoretical perspective, we have demonstrated that the network representation enabled us to rapidly explore the effect of different genetic models (eg, recessive, dominant) on the SNP and subject clustering, and how the emergent patterns were detected and quantitatively verified through network measures such as modularity and betweenness. We have also elucidated the limitations of networks, and how to overcome them through the use of multiple bipartite visual representations. The results show that each representation provides different affordances, and therefore plays the role of enabling discovery, confirmation, explanation, and inspection for different tasks. This understanding has inspired us to explore the development of a complementary visual analytical framework, which could explain and guide the use of multiple visual analytical representations for rapidly enabling discoveries in complex SNP data.
Although we have focused on separately identifying SNP and subject clusters to understand how they are related, biclustering methods are designed to identify clusters that allow membership of both types of node (eg, SNPs and subjects). Indeed, our use of biclustering
There are three main limitations to our study. (1) Because it was designed as a proof-of-concept for the application of multiple visual analytical representations to comprehend the relationship between subjects and SNPs, we focused on the use of existing visual analytical methods that are well known in the bioinformatics community. However, there exist several other visual analytical representations, such as TreeMap
Although there exist powerful methods for analyzing SNP data, to the best of our knowledge they rely on unipartite representations of the data. Here we explored the use of three bipartite visual analytical representations and associated quantitative methods to enable a richer understanding of the relationships in SNP–subject data. The results suggest that bipartite representations of AIM SNPs data can provide not only an understanding of the SNP and subject clusters based on different models, but also how the clusters are related to each other, and the nature of the membership of the subjects to different SNP clusters.
Although we have demonstrated the value of bipartite representations in only one SNP dataset, our ongoing research suggests that the approach is more general. For example, we have begun to use the same approach to analyze a dataset of SNPs related to Alzheimer's disease. The results are revealing complex patterns of bipartite clustering, which have the potential to lead to a deeper understanding of the underlying genetics in Alzheimer's disease. Furthermore, we are investigating how to extend bipartite modularity to handle weighted edges, which will enable us to additionally analyze SNP–subject bipartite networks with all three genotypes simultaneously.
Finally, we believe we have only scratched the surface in understanding the complementary role of multiple bipartite visual analytic representations. While the development of new methods holds a high premium in the informatics field, we believe that there is much to be understood in how to strategically combine existing visual analytical methods to reveal new insights in a domain. Accordingly, in our future research, we hope to develop a comprehensive framework that integrates current methods with bipartite visual analytical representations, with the goal of helping researchers to rapidly identify complex SNP-related phenomena and unravel the mysteries related to the genetic causes of complex diseases.
We thank G Vallabha, V McMicken, M Abbas, J Tupa, and S Trevino III for their contributions.
The Kamada–Kawai layout algorithm is approximate because it does not guarantee a globally optimal layout. The method is therefore used to explore the data using different starting conditions, and the observed topology verified using appropriate quantitative methods. Layouts generated using the Fruchterman–Reingold algorithm produce similar topologies to Kamada–Kawai, but with a different node layout because it uses a different layout algorithm.