Applications and benefits for big data sets using tree distances and the t-SNE algorithm

Download
Author
Lee, Suyoung
Date
2016-03Advisor
Buttrey, Samuel E.
Second Reader
Whitaker, Lyn R.
Metadata
Show full item recordAbstract
Modern data sets often consist of unstructured data and mixed data; that is, they include both numerical and categorical variables. Often, these data sets will include noise, redundancy, missing values and outliers. Clustering is one of the most important and widely-used data analytic methods. However, clustering requires the ability to measure distances or dissimilarities, which are not defined in an obvious way for mixed data. Practitioners often use the Gower dissimilarity for this task. In this work we use tree distance computed using Buttrey’s treeClust package in R, as discussed by Buttrey and Whitaker in 2015, to process mixed data, at the same time handling missing values and outliers. Visualization is also an important method for big data. We use the t-distributed Stochastic Neighbor Embedded (t-SNE) algorithm for visualization introduced by van der Maaten and Hinton in 2008, which produces visualization for high-dimensional data by assigning individual data points in a two- or three-dimensional map. We also use popular visualization techniques grouped under the name multidimensional scaling. We compare the results using the tree distance and the t-SNE algorithm to results from using Gower dissimilarity and multidimensional scaling. Unlike established dimensionality reduction techniques, which generally map from high dimensions directly to two (or three) dimensions, we explore a new approach in which the dimensionality reduction takes place in several separate steps. Our experiments show that our new techniques can outperform the established techniques in producing visualizations of high-dimensional mixed data.
Rights
Copyright is reserved by the copyright owner.Collections
Related items
Showing items related by title, author, creator and subject.
-
3D visualization of an invariant display strategy for hyperspecteral imagery
Kim, Kang Suk (Monterey, California. Naval Postgraduate School, 2002-12);Spectral Imagery provides multi-dimensional data, which are difficult to display in standard three-color image formats. Tyo, et al. (2001) propose an invariant display strategy to address this problem. This approach is to ... -
A SYSTEMS ENGINEERING APPROACH TO COMPARING MIXED REALITY GAMING ENGINES WITHIN THE DOD
Cha, Ted L.; Davis, Blake A.; Shutte, Zachariah R.; Snodgrass, Douglas J.; Wimsatt, Christopher J.; Ybarra, Rene V. (Monterey, CA; Naval Postgraduate School, 2020-12);Joint Special Operations Command (JSOC), the primary stakeholder of this report, identified a need to visualize the operating environment prior to mission execution. Historically, JSOC performed visualization by two-dimensional ... -
Scenario Authoring and Visualization for Advanced Graphical Environments (SAVAGE)
Nicklaus, Shane D. (Monterey, California. Naval Postgraduate School, 2001-09);Todayαs planning and modeling systems use two-dimensional (2D) representations of the threedimensional (3D) battlespace. This presents a challenge for planners, commanders, and troops to understand the true nature of the ...