A scale-independent, noise-resistant dissimilarity for tree-based clustering of mixed data
Abstract
Clustering techniques divide observations into groups.Current techniques usually rely on measurements of dissimilarities between
pairs of observations, between pairs of clusters, and between an observation and a cluster.For numeric variables, these dissimilarity
measurements often depend on the scaling of the variables, are changed by monotonic transformations, and do not provide for
selection of “important" variables.In our scheme, we fit a set of regression or classification trees with each variable acting in turn
as the “response" variable.Points are “close" to one another if they tend to appear in the same leaves of these trees.Trees with poor
predictive power are discarded.Therefore, “noise" variables will often appear in none of the trees and have no effect on the clustering.
Because our technique uses trees, the dissimilarities are unaffected by linear transformations of the numeric variables and resistant
to monotonic ones and to outliers.Categorical variables are included automatically and missing values handled in a natural way.We
demonstrate the performance of this technique by using these dissimilarities to cluster some well-known data sets to which noise has
been added.
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.NPS Report Number
NPS-OR-16-003Related items
Showing items related by title, author, creator and subject.
-
Visualizing mixed variable-type multidimensional data using tree distances
Shaham, Yoav (Monterey, California: Naval Postgraduate School, 2015-09);This research explores the use of the tree distances of Buttrey and Whitaker to visualize multidimensional data of mixed-variable types, having both numerical and categorical data. Tree distances measure dissimilarities ... -
Applications and benefits for big data sets using tree distances and the t-SNE algorithm
Lee, Suyoung (Monterey, California: Naval Postgraduate School, 2016-03);Modern data sets often consist of unstructured data and mixed data; that is, they include both numerical and categorical variables. Often, these data sets will include noise, redundancy, missing values and outliers. ... -
treeClust: an R package for tree-based clustering dissimilarities
Buttrey, Samuel E.; Whitaker, Lyn R. (2015);This paper describes treeClust, an R package that produces dissimilarities useful for clustering. These dissimilarities arise from a set of classification or regression trees, one with each variable in the data acting ...