Exploiting Captions for Web Data Mining by Neil C. Rowe
Abstract
We survey research on using captions in data mining from the Web. Captions are text that describes some other information
(typically, multimedia). Since text is considerably easier to index and manipulate than non-text (being usually smaller and less
ambiguous), a good strategy for accessing non-text is to index its captions. However, captions are not often obvious on the Web as
there are few standards. So caption references can reside within paragraphs near a media reference, in clickable text or display text for
it, on names of media files, in headings or titles on the page, and in explicit references arbitrarily far from the media. We discuss the
range of possible syntactic clues (such as HTML tags) and semantic clues (such as meanings of particular words). We discuss how to
quantify their strength and combine their information to arrive at a consensus. We then discuss the problem of mapping information in
captions to information in media objects. While it is hard, classes of mapping schemes are distinguishable, and segmentation of the
media can be matched to a parsing of the caption by constraint-satisfaction methods. Active work is addressing the issue of
automatically learning the clues for mapping from examples.
Description
This article is to appear in Web Mining: Applications and Techniques ed. A. Scime, 2004.
Rights
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.Collections
Related items
Showing items related by title, author, creator and subject.
-
Mapping between image regions and caption concepts of captioned depictive photographs
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 1998-07);We discuss the obstacles to inference of correspondances between objects within photographic images and their counterpart concepts in descriptive captions of those images. This is important for information retrieval of ... -
Understanding of Navy Technical Language via Statistical Parsing
Rowe, Neil C. (Naval Postgraduate School, Monterey CA, 2004);A key problem in indexing technical information is the interpretation of technical words and word senses, expressions not used in everyday language. This is important for captions on technical images, whose often pithy ... -
Using Context to Disambiguate Web Captions
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 2004-06);The easiest way to index multimedia from ordinary Web pages is to find their captions. However, captions are not used consistently, and retrieval effectiveness for caption-based multimedia browsers is significantly poorer ...