Data mining of extremely large ad hoc data sets to produce inverted indices
Coudray, Aaron D.
MetadataShow full item record
The purpose of this study is to leverage existing Internet-sized ad hoc data sets by creating an inverted index that will enable a robust search capability. In particular, this study is focused on the Common Crawl web corpus. This involves exploring the tools and techniques necessary to effectively traverse this data set, as well as producing the tools to create an inverted index relationship between the terms and websites found within web archive files. The primary tools utilized in this process are Apache Hadoop, Apache MapReduce, Amazon Web Services, and Java. Additionally, methods to enhance this relationship with other information of interest are investigated in this thesis. Specifically, an index was developed that contains the added component of term relative location. This inverted index relationship is an essential component of--and the first step in--creating a robust search capability for a very large ad hoc data set.
Approved for public release; distribution is unlimitedIncludes supplementary material
Showing items related by title, author, creator and subject.
Harding, Richard Warren (Monterey, California. Naval Postgraduate School, 1981);In a space booster on takeoff, a control system must be employed to prevent the rocket from falling over as it is forced upward by the engines. One accurate dynamic model of the space booster on takeoff is the inverted ...
Fan, Chenwu; Liu, W. Timothy; Chu, Peter C. (2000-07);A recently developed parametric model by P. C. Chu et al. is used in this paper for determining subsurface thermal structure from satellite sea surface temperature observations. Based on a layered structure of temperature fields ...
Compton, Oliver Doty (Troy, New York; Rensselaer Polytechnic Institute, 1947-06-16);This thesis undertakes a general investigation of combustion in a flowing stream. It was first necessary to determine, if possible, the mechanism of burning, the establish the flow characteristic which must be present for ...