Data mining of extremely large ad-hoc data sets to produce reverse web-link graphs

Loading...
Thumbnail Image
Authors
Chang, Tao-hsiang
Subjects
Amazon Web Services
cluster computing
data mining
Hadoop MapReduce
the Common Crawl
Advisors
Kragh, Frank
Xie, Geoffrey
Date of Issue
2017-03
Date
Mar-17
Publisher
Monterey, California: Naval Postgraduate School
Language
Abstract
Data mining can be a valuable tool, particularly in the acquisition of military intelligence. As the second study within a larger Naval Postgraduate School research project using Amazon Web Services (AWS), this thesis focuses on data mining on a very large data set (32 TB) with the open web crawler data set Common Crawl. Similar to previous studies, this research employs MapReduce (MR) for sorting and categorizing output value pairs. Our research, however, is the first to implement the basic Reverse Web-Link Graph (RWLG) algorithm as a search capability for web sites, with validation that it works correctly. A second goal is to extend the RWLG algorithm using a full Common Crawl archive as input for processing as a single MR job. To mitigate the out-of-memory error, we relate some environment variables with the Yet Another Resource Negotiator (YARN) architecture and provide some sample error tracking methods. As a further contribution, this study considers limitations associated with using AWS, which inform our recommendations for future work.
Type
Thesis
Description
Series/Report No
Department
Electrical and Computer Engineering (ECE)
Organization
Identifiers
NPS Report Number
Sponsors
Funder
Format
Citation
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
Copyright is reserved by the copyright owner.
Collections