Filetype identification using long, summarized n-grams
Mayer, Ryan C.
MetadataShow full item record
Past research into file type identification has employed many different techniques in an attempt to accurately classify files and file fragments including N-gram analysis. However, naive application of n-grams breaks down when handling n-grams that are greater than two bytes, due to the sparseness of the feature. As a result, other researchers have generally ignored long n-grams for filetype identification. This thesis explores the use of long n-grams for whole file and file fragment classification by building feature distributions of commonly occurring n-grams for single filetypes and using those distributions to classify unknown files and file fragments. This thesis also utilizes summarized n-grams in order to "collapse" similar n-grams within a file type into common n-grams. The algorithms developed to both generate and compare unknown files are presented as well as results from an experiment that was conducted using another researcher's data set.
Approved for public release; distribution is unlimited.