Applications and benefits for big data sets using tree distances and the t-SNE algorithm

Loading...
Thumbnail Image
Authors
Lee, Suyoung
Subjects
big data sets
tree distance algorithm
treeClust
visualization
t-SNE algorithm
dimensionality reduction
Advisors
Buttrey, Samuel E.
Date of Issue
2016-03
Date
Mar-16
Publisher
Monterey, California: Naval Postgraduate School
Language
Abstract
Modern data sets often consist of unstructured data and mixed data; that is, they include both numerical and categorical variables. Often, these data sets will include noise, redundancy, missing values and outliers. Clustering is one of the most important and widely-used data analytic methods. However, clustering requires the ability to measure distances or dissimilarities, which are not defined in an obvious way for mixed data. Practitioners often use the Gower dissimilarity for this task. In this work we use tree distance computed using Buttrey’s treeClust package in R, as discussed by Buttrey and Whitaker in 2015, to process mixed data, at the same time handling missing values and outliers. Visualization is also an important method for big data. We use the t-distributed Stochastic Neighbor Embedded (t-SNE) algorithm for visualization introduced by van der Maaten and Hinton in 2008, which produces visualization for high-dimensional data by assigning individual data points in a two- or three-dimensional map. We also use popular visualization techniques grouped under the name multidimensional scaling. We compare the results using the tree distance and the t-SNE algorithm to results from using Gower dissimilarity and multidimensional scaling. Unlike established dimensionality reduction techniques, which generally map from high dimensions directly to two (or three) dimensions, we explore a new approach in which the dimensionality reduction takes place in several separate steps. Our experiments show that our new techniques can outperform the established techniques in producing visualizations of high-dimensional mixed data.
Type
Thesis
Description
Series/Report No
Department
Operations Research (OR)
Organization
Identifiers
NPS Report Number
Sponsors
Funder
Format
Citation
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
Copyright is reserved by the copyright owner.
Collections