Authorship discovery in blogs using Bayesian classification with corrective scaling

Loading...
Thumbnail Image
Authors
Gehrke, Grant T.
Advisors
Martell, Craig H.
Second Readers
Squire, Kevin M.
Subjects
Date of Issue
2008-06
Date
Publisher
Monterey, California. Naval Postgraduate School
Language
Abstract
Widespread availability of free, public blog platforms has facilitated growth in the amount of individually written electronic text available online. Our research leverages an extremely large blog corpus for a study in authorship discovery, both to evaluate a traditional technique as applied to blogs, as well as to demonstrate the implications of authorship discovery in blogs for intelligence and forensic purposes. Our study uses a Bayesian classifier with two important extensions. First, we introduce a post-classification corrective scaling technique to mitigate the over-classification of many samples to a few authors. Second, we propose an n-percent-correct threshold metric, whereby we define a "correct" result as one where the true author is within some small subset of the original search space rather than requiring that he or she be the single most probable author. Using this technique, we are able to reduce a search space of 2000 authors to 1% of its original size with 91% accuracy when 1000 bigrams are present, or reduce the search space to 10% of its original size with 94% accuracy when only 500 bigrams are present.
Type
Thesis
Description
Series/Report No
Department
Organization
Naval Postgraduate School (U.S.)
Identifiers
NPS Report Number
Sponsors
Funding
Format
xii, 37 p. : ill. ;
Citation
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
Collections