Language Translation for File Paths
Rowe, Neil C.
Garfinkel, Simson L.
MetadataShow full item record
Forensic examiners are frequently confronted with content in languages that they do not understand, and they could benefit from machine translation into their native language. But automated translation of file paths is a difficult problem because of the minimal context for translation and the frequent mixing of multiple languages within a path. This work developed a prototype implementation of a file-path translator that first identifies the language for each directory segment of a path, and then translates to English those that are not already English nor artificial words. Brown’s LA-Strings utility for language identification was tried, but its performance was found inadequate on short strings and it was supplemented with clues from dictionary lookup, Unicode character distributions for languages, country of origin, and language-related keywords. To provide better data for language inference, words used in each directory over a large corpus were aggregated for analysis. The resulting directory-language probabilities were combined with those for each path segment from dictionary lookup and character-type distributions to infer the segment's most likely language. Tests were done on a corpus of 50.1 million file paths looking for 35 different languages. Tests showed 90.4% accuracy on identifying languages of directories and 93.7% accuracy on identifying languages of directory/file segments of file paths, even after excluding 44.4% of the paths as obviously English or untranslatable. Two of seven proposed language clues were shown to impair directory-language identification. Experiments also compared three translation methods: the Systran translation tool, Google Translate, and word-for-word substitution using dictionaries. Google Translate usually performed the best, but all still made errors with European languages and a significant number of errors with Arabic and Chinese.
This paper appeared in the Proceedings of the 2013 DFRWS Conference. Also includes powerpoint presentation.
Showing items related by title, author, creator and subject.
Options for meeting U.S. Navy foreign language and cultural expertise requirements in the post 9/11 security environment D'Angelo, Michael F. (Monterey, California. Naval Postgraduate School, 2009-06);This thesis examines foreign language and cultural awareness skills vital to the U.S. Navy, analyzes the stock of such skills already resident in the Navy and explores options for meeting current and future requirements ...
Ong, Wing Shan Shirley. (Monterey, California. Naval Postgraduate School, 2007-12);Sequoyah, which is the Department of Defense (DoD)'s Program of Record for automated foreign language translation, is to identify current and developing technologies to meet warfighter requirements for foreign language ...
Moffitt, Charlie Robert. (Monterey, California. Naval Postgraduate School, 1988);While the cost of computing hardware has decreased steadily, the cost of software design, development and, maintenance has increased. One approach to reduce the cost of software development is rapid prototyping. Further, ...