Language Translation for File Paths

Loading...
Thumbnail Image
Authors
Rowe, Neil C.
Schwamm, Riqui
Garfinkel, Simson L.
Subjects
digital forensics
file paths
machine translation
dictionary
character distribution
Unicode
Naive Bayes inference
Advisors
Date of Issue
2013
Date
Publisher
Monterey, California. Naval Postgraduate School
Language
Abstract
Forensic examiners are frequently confronted with content in languages that they do not understand, and they could benefit from machine translation into their native language. But automated translation of file paths is a difficult problem because of the minimal context for translation and the frequent mixing of multiple languages within a path. This work developed a prototype implementation of a file-path translator that first identifies the language for each directory segment of a path, and then translates to English those that are not already English nor artificial words. Brown’s LA-Strings utility for language identification was tried, but its performance was found inadequate on short strings and it was supplemented with clues from dictionary lookup, Unicode character distributions for languages, country of origin, and language-related keywords. To provide better data for language inference, words used in each directory over a large corpus were aggregated for analysis. The resulting directory-language probabilities were combined with those for each path segment from dictionary lookup and character-type distributions to infer the segment's most likely language. Tests were done on a corpus of 50.1 million file paths looking for 35 different languages. Tests showed 90.4% accuracy on identifying languages of directories and 93.7% accuracy on identifying languages of directory/file segments of file paths, even after excluding 44.4% of the paths as obviously English or untranslatable. Two of seven proposed language clues were shown to impair directory-language identification. Experiments also compared three translation methods: the Systran translation tool, Google Translate, and word-for-word substitution using dictionaries. Google Translate usually performed the best, but all still made errors with European languages and a significant number of errors with Arabic and Chinese.
Type
Conference Paper
Description
This paper appeared in the Proceedings of the 2013 DFRWS Conference. Also includes powerpoint presentation.
Series/Report No
Department
Organization
Identifiers
NPS Report Number
Sponsors
Funder
Format
Citation
Distribution Statement
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Collections