Automatically detecting authors' native language

Loading...
Thumbnail Image
Authors
Ahn, Charles S.
Subjects
Advisors
Martell, Craig H.
Date of Issue
2011-03
Date
Publisher
Monterey, California. Naval Postgraduate School
Language
Abstract
When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work.
Type
Thesis
Description
Department
Computer Science
Organization
Naval Postgraduate School (U.S.)
Identifiers
NPS Report Number
Sponsors
Funder
Format
xviii, 95 p. : ill. ;
Citation
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Collections