Applications and benefits for big data sets using tree distances and the t-SNE algorithm
Buttrey, Samuel E.
Whitaker, Lyn R.
MetadataShow full item record
Modern data sets often consist of unstructured data and mixed data; that is, they include both numerical and categorical variables. Often, these data sets will include noise, redundancy, missing values and outliers. Clustering is one of the most important and widely-used data analytic methods. However, clustering requires the ability to measure distances or dissimilarities, which are not defined in an obvious way for mixed data. Practitioners often use the Gower dissimilarity for this task. In this work we use tree distance computed using Buttrey’s treeClust package in R, as discussed by Buttrey and Whitaker in 2015, to process mixed data, at the same time handling missing values and outliers. Visualization is also an important method for big data. We use the t-distributed Stochastic Neighbor Embedded (t-SNE) algorithm for visualization introduced by van der Maaten and Hinton in 2008, which produces visualization for high-dimensional data by assigning individual data points in a two- or three-dimensional map. We also use popular visualization techniques grouped under the name multidimensional scaling. We compare the results using the tree distance and the t-SNE algorithm to results from using Gower dissimilarity and multidimensional scaling. Unlike established dimensionality reduction techniques, which generally map from high dimensions directly to two (or three) dimensions, we explore a new approach in which the dimensionality reduction takes place in several separate steps. Our experiments show that our new techniques can outperform the established techniques in producing visualizations of high-dimensional mixed data.
Approved for public release; distribution is unlimited
Showing items related by title, author, creator and subject.
Kim, Kang Suk (Monterey, California. Naval Postgraduate School, 2002-12);Spectral Imagery provides multi-dimensional data, which are difficult to display in standard three-color image formats. Tyo, et al. (2001) propose an invariant display strategy to address this problem. This approach is to ...
Nicklaus, Shane D. (Monterey, California. Naval Postgraduate School, 2001-09);TodayÎ±s planning and modeling systems use two-dimensional (2D) representations of the threedimensional (3D) battlespace. This presents a challenge for planners, commanders, and troops to understand the true nature of the ...
Flow visualization and wake analysis for standard and modified configurations of the AN/ALQ-78 antenna pod Small, James F. (Monterey, California. Naval Postgraduate School, 1991-03);A low-speed wind tunnel investigation was conducted to compare the aerodynamics flow field characteristics for standard and modified configurations of a 20 percent scale model of the AN/ALQ-78 electronic support measures ...