Parallel processing with treeClust

Loading...
Thumbnail Image
Authors
McKechnie, I. Taylor
Advisors
Buttrey, Samuel E.
Second Readers
Whitaker, Lyn R.
Subjects
tree clusters
parallel processing
high performance computing
big data sets
batch scripting
information systems technology
Date of Issue
2017-09
Date
Sep-17
Publisher
Monterey, CA; Naval Postgraduate School
Language
Abstract
Clustering data is one of the most common statistical and machine learning techniques for analyzing big data. Clustering can be particularly difficult when the data sets include categorical, missing, or noise variables. The tree clustering algorithm developed by Samuel Buttrey and Lyn Whitaker, as described in the December 2015 issue of The R Journal, seems to provide a solution to these problems, but it requires a large set of overhead computations. This issue is intensified when working with high-dimensional data because the extent of treeClust’s overhead computations are based on the dimensions of the data. High performance computing (HPC) and parallel processing present a solution to this overhead computation burden, but treeClust’s existing parallel processing method does not work on the Naval Postgraduate School’s HPC, the Hamming Supercomputer (HSC). Furthermore, correctly determining what HPC resources to use can be a difficult task. In this thesis, we present a new HSC-specific method for parallel processing data using the treeClust R package developed by Buttrey and Whitaker. Based on the results of our experiments, our method approximates the optimal resource HPC request, so that users realize the best run time when using treeClust on the HSC.
Type
Thesis
Description
Series/Report No
Department
Operations Research (OR)
Organization
Identifiers
NPS Report Number
Sponsors
Funding
Format
Citation
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Collections