Language Translation for File Paths
Rowe, Neil C.
Garfinkel, Simson L.
MetadataShow full item record
Forensic examiners are frequently confronted with content in languages that they do not understand, and they could benefit from machine translation into their native language. But automated translation of file paths is a difficult problem because of the minimal context for translation and the frequent mixing of multiple languages within a path. This work developed a prototype implementation of a file-path translator that first identifies the language for each directory segment of a path, and then translates to English those that are not already English nor artificial words. Brown’s LA-Strings utility for language identification was tried, but its performance was found inadequate on short strings and it was supplemented with clues from dictionary lookup, Unicode character distributions for languages, country of origin, and language-related keywords. To provide better data for language inference, words used in each directory over a large corpus were aggregated for analysis. The resulting directory-language probabilities were combined with those for each path segment from dictionary lookup and character-type distributions to infer the segment's most likely language. Tests were done on a corpus of 50.1 million file paths looking for 35 different languages. Tests showed 90.4% accuracy on identifying languages of directories and 93.7% accuracy on identifying languages of directory/file segments of file paths, even after excluding 44.4% of the paths as obviously English or untranslatable. Two of seven proposed language clues were shown to impair directory-language identification. Experiments also compared three translation methods: the Systran translation tool, Google Translate, and word-for-word substitution using dictionaries. Google Translate usually performed the best, but all still made errors with European languages and a significant number of errors with Arabic and Chinese.
This paper appeared in the Proceedings of the 2013 DFRWS Conference. Also includes powerpoint presentation.
RightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Showing items related by title, author, creator and subject.
Carpenter, Steven M. (Monterey, California: Naval Postgraduate School, 2002., 2002-09);A Visual Meta-Programming Language allows the user to see a graphic representation of the data flow between components. Like the visual programming concepts for common programming languages in use today, this language makes ...
Free, Nancy C. (Monterey, California. Naval Postgraduate School, 1994-09);This thesis involved the design and translation of the Data Flow Query Language (DFQL) for the Multi-Lingual, Multi-Backend Database System (MDBS). The MDBS is a database system that can effectively support multiple data ...
Phelps, David; Levin, Timothy E.; Auguston, Mikhail (International Conference on i-Warfare and Security, 2008-06-01);We describe the specification of the formal security policy model and formal top-level specification for the Least Privilege Separation Kernel (LPSK) in Alloy, a relatively new modelling language and analysis tool. The ...