Scaling bulk data analysis with mapreduce

dc.contributor.advisorMcCarrin, Michael
dc.contributor.advisorStefanou, Marcus S.
dc.contributor.authorAndrzejewski, Timothy J.
dc.contributor.departmentComputer Science (CS)
dc.dateSep-17
dc.date.accessioned2017-11-07T23:39:28Z
dc.date.available2017-11-07T23:39:28Z
dc.date.issued2017-09
dc.description.abstractBetween 2005 and 2015, the world population grew by 11% while hard drive capacity grew by 95%. Increased demand for storage combined with decreasing costs presents challenges for digital forensic analysts working within tight time constraints. Advancements have been made to current tools to assist the analyst, but many require expensive specialized systems, knowledge and software. This thesis provides a method to address these challenges through distributed analysis of raw forensic images stored in a distributed file system using open-source software.We develop a proof-of-concept tool capable of counting unique bytes in a 116 TiB corpus of drives in 1 hour 41 minutes, demonstrating a peak throughput of 18.33 GiB/s on a 25-node Hadoop cluster. Furthermore, we demonstrate the ability to perform email address extraction on the corpus in 2 hours 5 minutes, for a throughput of 15.84 GiB/s, a result that compares favorably to traditional email address extraction methods, which we estimate to run with a throughput of approximately 91 MiB/s on a 24-core production server. Primary contributions to the forensic community are: 1) a distributed, scalable method to analyze large data sets in a practical timeframe, 2) a MapReduce program to count unique bytes of any forensic image, and 3) a MapReduce program capable of extracting 233 million email addresses from a 116 TiB corpus in just over two hours.en_US
dc.description.distributionstatementApproved for public release; distribution is unlimited.
dc.description.serviceCivilian, Department of the Navyen_US
dc.description.urihttp://archive.org/details/scalingbulkdatan1094556132
dc.identifier.urihttps://hdl.handle.net/10945/56132
dc.publisherMonterey, CA; Naval Postgraduate School
dc.rightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.en_US
dc.subject.authorhadoopen_US
dc.subject.authormapreduceen_US
dc.subject.authordigital forensicsen_US
dc.subject.authorbulk data analysisen_US
dc.subject.authorbulk_extractoren_US
dc.subject.authordistributed digital forensicsen_US
dc.subject.authordata miningen_US
dc.subject.authorbig dataen_US
dc.titleScaling bulk data analysis with mapreduceen_US
dc.typeThesisen_US
dspace.entity.typePublication
etd.thesisdegree.disciplineComputer Scienceen_US
etd.thesisdegree.grantorNaval Postgraduate Schoolen_US
etd.thesisdegree.levelMastersen_US
etd.thesisdegree.nameMaster of Science in Computer Scienceen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
17Sep_Andrzejewski_Timothy.pdf
Size:
2.1 MB
Format:
Adobe Portable Document Format
Collections