Show simple item record

dc.contributor.authorRowe, Neil C.
dc.contributor.authorSchwamm, Riqui
dc.contributor.authorGarfinkel, Simson L.
dc.date.accessioned2013-09-18T18:42:22Z
dc.date.available2013-09-18T18:42:22Z
dc.date.issued2013
dc.identifier.urihttp://hdl.handle.net/10945/36471
dc.descriptionThis paper appeared in the Proceedings of the 2013 DFRWS Conference. Also includes powerpoint presentation.en_US
dc.description.abstractForensic examiners are frequently confronted with content in languages that they do not understand, and they could benefit from machine translation into their native language. But automated translation of file paths is a difficult problem because of the minimal context for translation and the frequent mixing of multiple languages within a path. This work developed a prototype implementation of a file-path translator that first identifies the language for each directory segment of a path, and then translates to English those that are not already English nor artificial words. Brown’s LA-Strings utility for language identification was tried, but its performance was found inadequate on short strings and it was supplemented with clues from dictionary lookup, Unicode character distributions for languages, country of origin, and language-related keywords. To provide better data for language inference, words used in each directory over a large corpus were aggregated for analysis. The resulting directory-language probabilities were combined with those for each path segment from dictionary lookup and character-type distributions to infer the segment's most likely language. Tests were done on a corpus of 50.1 million file paths looking for 35 different languages. Tests showed 90.4% accuracy on identifying languages of directories and 93.7% accuracy on identifying languages of directory/file segments of file paths, even after excluding 44.4% of the paths as obviously English or untranslatable. Two of seven proposed language clues were shown to impair directory-language identification. Experiments also compared three translation methods: the Systran translation tool, Google Translate, and word-for-word substitution using dictionaries. Google Translate usually performed the best, but all still made errors with European languages and a significant number of errors with Arabic and Chinese.en_US
dc.publisherMonterey, California. Naval Postgraduate Schoolen_US
dc.rightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.en_US
dc.titleLanguage Translation for File Pathsen_US
dc.typeConference Paperen_US
dc.subject.authordigital forensicsen_US
dc.subject.authorfile pathsen_US
dc.subject.authormachine translationen_US
dc.subject.authordictionaryen_US
dc.subject.authorcharacter distributionen_US
dc.subject.authorUnicodeen_US
dc.subject.authorNaive Bayes inferenceen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record