Improving cluster analysis with automatic variable selection based on trees
Orr, Anton D.
Buttrey, Samuel E.
Whitaker, Lyn R.
MetadataShow full item record
Clustering is an algorithmic technique that aims to group similar objects together in order to give users better understanding of the underlying structure of their data. It can be thought of as a two-step process. The first step is to measure the distances among the objects to determine how dissimilar they are. The second, clustering, step takes the dissimilarity measurements and assigns each object to a cluster. We examine three distance measures proposed by Buttrey at the Joint Statistical Meeting in Seattle, August 2006 based on classification and regression trees to address problems with determining dissimilarity. Current algorithms do not simultaneously address the issues of automatic variable selection, independence from variable scaling, resistance to monotonic transformation and datasets of mixed variable types. These "tree distances" are compared with an existing dissimilarity algorithm and two newer methods using four well-known datasets. These datasets contain numeric, categorical and mixed variable types. In addition, noise variables are added to test the ability of each algorithm to automatically select important variables. The tree distances offer much improvement for the problems they aimed to address, performing well against competitors amongst numerical datasets, and outperforming in the cases of categorical and mixed variable type datasets.
RightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Showing items related by title, author, creator and subject.
Data Consolidation of Disparate Procurement Data Sources for Correlated Performance-Based Acquisition Nangia, Samantha; Dickover, Ryan; Wardwell, Thomas; Mora, Randall (Monterey, California. Naval Postgraduate School, 2017-03); SYM-AM-17-095Frank Kendall, then Under Secretary of Defense for Acquisition, Technology and Logistics, released the first defense acquisition system performance report in June 2013. This report focused primarily on performance related ...
Data Consolidation of Disparate Procurement Data Sources for Correlated Performance-Based Acquisition Decision Support Nangia, Samantha; Dickover, Ryan; Wardwell, Thomas; Mora, Randall (Monterey, California. Naval Postgraduate School, 2017-03); SYM-AM-17-044Frank Kendall, then Under Secretary of Defense for Acquisition, Technology and Logistics, released the first defense acquisition system performance report in June 2013. This report focused primarily on performance related ...
McArdle, Ryan P. (Monterey, California: Naval Postgraduate School, 2017-12);Over the past decade, a deluge of large and complex datasets (aka big data) has overwhelmed the scientific community. Traditional computing architectures were not capable of processing the data efficiently, or in some ...