Scaling bulk data analysis with mapreduce
Andrzejewski, Timothy J.
Stefanou, Marcus S.
MetadataShow full item record
Between 2005 and 2015, the world population grew by 11% while hard drive capacity grew by 95%. Increased demand for storage combined with decreasing costs presents challenges for digital forensic analysts working within tight time constraints. Advancements have been made to current tools to assist the analyst, but many require expensive specialized systems, knowledge and software. This thesis provides a method to address these challenges through distributed analysis of raw forensic images stored in a distributed file system using open-source software.We develop a proof-of-concept tool capable of counting unique bytes in a 116 TiB corpus of drives in 1 hour 41 minutes, demonstrating a peak throughput of 18.33 GiB/s on a 25-node Hadoop cluster. Furthermore, we demonstrate the ability to perform email address extraction on the corpus in 2 hours 5 minutes, for a throughput of 15.84 GiB/s, a result that compares favorably to traditional email address extraction methods, which we estimate to run with a throughput of approximately 91 MiB/s on a 24-core production server. Primary contributions to the forensic community are: 1) a distributed, scalable method to analyze large data sets in a practical timeframe, 2) a MapReduce program to count unique bytes of any forensic image, and 3) a MapReduce program capable of extracting 233 million email addresses from a 116 TiB corpus in just over two hours.
Approved for public release; distribution is unlimited
Showing items related by title, author, creator and subject.
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 2012-08);The National Software Reference Library (NSRL) is an essential data source for forensic investigators, providing in its Reference Data Set (RDS) a set of hash values of known software. However, the NSRL RDS has not ...
Hall, Daniel M. (Monterey, California: Naval Postgraduate School, 2016-06);We evaluate Anseri, a commercial text analytics software, and its ability to assist a military intelligence analyst in the planning phase of major operations. The intelligence cycle involves extensive, timely, and detailed ...
Authorship attribution in the e-mail domain a study of the effect of size of author corpus and topic on accuracy of identification Levy-Minzie, Kori. (Monterey, California. Naval Postgraduate School, 2011-03);We determined that it is possible to achieve authorship attribution in the e-mail domain when training on "ersonal" e-mails and testing on "work" e-mails and vice versa. These results are unique since they simulate two ...