Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees
MetadataShow full item record
Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic analysis of gene sequences typically produces a collection of these trees, one for each gene in the analysis. Analysis of samples of trees is difficult due to the multi-dimensionality of the space of possible trees. In Euclidean spaces, principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample’s structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree space similar to the kth principal component in Euclidean space: the locus of the weighted Fréchet mean of k + 1 vertex trees when the weights vary over the k-simplex. We establish some basic properties of these objects, in particular showing that they have dimension k, and propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second principal components.
The article of record as published may be found at http://dx.doi.org/10.1093/biomet/asx047
RightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Showing items related by title, author, creator and subject.
Tang, Xiaoxian; Wang, Houjie; Yoshid, Ruriko (ArXiv, 2020);Most data in genome-wide phylogenetic analysis (phylogenomics) is essentially multidimensional, posing a major challenge to human comprehension and computational analysis. Also, we can not directly apply statistical learning ...
Yoshida, Ruriko; Zhang, Leon; Zhang, Xu (ArXiv, 2017-10-15);Principal component analysis is a widely-used method for the dimensionality reduction of a given data set in a high-dimensional Euclidean space. Here we define and analyze two analogues of principal component analysis in ...
Monod, Anthea; Lin, Bo; Yoshida, Ruriko; Kang, Qiwen (ArXiv, 2018);Phylogenetic trees are the fundamental mathematical representation of evolutionary processes in biology. As data objects, they are characterized by the challenges associated with “big data,” as well as the complication ...