New framework for cross-domain document classification

Download
Author
Gupta, Anjum
Date
2011-03Advisor
Martell, Craig
Gera, Ralucca
Schein, Andrew I.
Volpano, Dennis
Young, Joel
Metadata
Show full item recordAbstract
Automatic text document classification is a fundamental problem in machine learning. Given the dynamic nature and the exponential growth of the World Wide Web, one needs the ability to classify not only a massive number of documents, but also documents that belong to wide variety of domains. Some examples of the domains are e-mails, blogs, Wikipedia articles, news articles, newsgroups, online chats, etc. It is the difference in the writing style that differentiates these domains. Text documents are usually classified using supervised learning algorithms that require large set of pre-labeled data. This requirement, of labeled data, poses a challenge in classifying documents that belong to different domains. Our goal is to classify text documents in the testing domain without requiring any labeled documents from the same domain. Our research develops specialized cross-domain learning algorithms based the distributions over words obtained from a collection of text documents by topic models such as Latent Dirichlet Allocation (LDA). Our major contributions include (1) empirically showing that conventional supervised learning algorithms fail to generalize their learned models across different domains and (2) development of novel and specialized cross-domain classification algorithms that show an appreciable improvement over conventional methods used for cross-domain classification that is consistent for different datasets. Our research addresses many real-world needs. Since massive number of new types of text documents is generated daily, it is crucial to have the ability to transfer learned information from one domain to another domain. Cross-domain classification lets us leverage information learned from one domain for use in the classification of documents in a new domain.
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.Collections
Related items
Showing items related by title, author, creator and subject.
-
Toward an improved method of HSI evaluation in Defense Acquisition
Simpson, Matthew A. (Monterey, California. Naval Postgraduate School, 2006-12);Each of the domains of HSI is of itself a discipline with vast amounts of research, analytic techniques, educational programs, and methods for evaluating the effectiveness of the system with respect to the specific domain. ... -
Bi-criteria risk analysis of domain-specific and cross-domain changes in complex systems
Doerr, Kenneth H.; Kang, Keebom (2014);Government and not-for-profit organizations measure success in terms of their ability to promote an organizational mission. Complex assets in such organizations are acquired in a budget-allocation process which reflects ... -
A Cloud-Oriented Cross-Domain Security Architecture
Nguyen, D Thuy; Gondree, Mark A.; Shifflet, David J.; Khosalim, Jean; Levin, Timothy E.; Irvine, Cynthia E. (Military Communications Conference (MILCOM 2010), San Jose, CA, 2010-11-07);The Monterey Security Architecture addresses the need to share high-value data across multiple domains of different classification levels while enforcing information flow policies. The architecture allows users with different ...