New framework for cross-domain document classification
Schein, Andrew I.
MetadataShow full item record
Automatic text document classification is a fundamental problem in machine learning. Given the dynamic nature and the exponential growth of the World Wide Web, one needs the ability to classify not only a massive number of documents, but also documents that belong to wide variety of domains. Some examples of the domains are e-mails, blogs, Wikipedia articles, news articles, newsgroups, online chats, etc. It is the difference in the writing style that differentiates these domains. Text documents are usually classified using supervised learning algorithms that require large set of pre-labeled data. This requirement, of labeled data, poses a challenge in classifying documents that belong to different domains. Our goal is to classify text documents in the testing domain without requiring any labeled documents from the same domain. Our research develops specialized cross-domain learning algorithms based the distributions over words obtained from a collection of text documents by topic models such as Latent Dirichlet Allocation (LDA). Our major contributions include (1) empirically showing that conventional supervised learning algorithms fail to generalize their learned models across different domains and (2) development of novel and specialized cross-domain classification algorithms that show an appreciable improvement over conventional methods used for cross-domain classification that is consistent for different datasets. Our research addresses many real-world needs. Since massive number of new types of text documents is generated daily, it is crucial to have the ability to transfer learned information from one domain to another domain. Cross-domain classification lets us leverage information learned from one domain for use in the classification of documents in a new domain.
RightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Showing items related by title, author, creator and subject.
Simpson, Matthew A. (Monterey, California. Naval Postgraduate School, 2006);Each of the domains of HSI is of itself a discipline with vast amounts of research, analytic techniques, educational programs, and methods for evaluating the effectiveness of the system with respect to the specific domain. ...
Doerr, Kenneth H.; Kang, Keebom (2014);Government and not-for-profit organizations measure success in terms of their ability to promote an organizational mission. Complex assets in such organizations are acquired in a budget-allocation process which reflects ...
Nguyen, D Thuy; Gondree, Mark A.; Shifflet, David J.; Khosalim, Jean; Levin, Timothy E.; Irvine, Cynthia E. (Military Communications Conference (MILCOM 2010), San Jose, CA, 2010);The Monterey Security Architecture addresses the need to share high-value data across multiple domains of different classification levels while enforcing information flow policies. The architecture allows users with different ...