Data mining of extremely large ad hoc data sets to produce inverted indices
Coudray, Aaron D.
MetadataShow full item record
The purpose of this study is to leverage existing Internet-sized ad hoc data sets by creating an inverted index that will enable a robust search capability. In particular, this study is focused on the Common Crawl web corpus. This involves exploring the tools and techniques necessary to effectively traverse this data set, as well as producing the tools to create an inverted index relationship between the terms and websites found within web archive files. The primary tools utilized in this process are Apache Hadoop, Apache MapReduce, Amazon Web Services, and Java. Additionally, methods to enhance this relationship with other information of interest are investigated in this thesis. Specifically, an index was developed that contains the added component of term relative location. This inverted index relationship is an essential component of--and the first step in--creating a robust search capability for a very large ad hoc data set.
Approved for public release; distribution is unlimitedIncludes supplementary material
Showing items related by title, author, creator and subject.
Fillmore, Paul F. (Monterey, California. Naval Postgraduate School, 2003-09);The U. S. Navy is pursuing an all electric ship that will require enormous amounts of power for applications such as electric propulsion. Reliability and redundancy in the electronics are imperative, since failure of a ...
White, Terence H. (Monterey California. Naval Postgraduate School, 2004-06);The naval vessels of the future will require lighter, more compact, and more versatile power electronics systems. With the advent of the DC Zonal Electrical Distribution System, more innovative approaches to the conversion ...
Marinac, Mark L. (Monterey, California: Naval Postgraduate School, 1999-09);The modular DC Zonal Electrical Distribution System (DC ZEDS) offers many advantages over traditional radial shipboard electrical distribution. The advantages of DC ZEDS will be exploited in the next class of surface ...