Exploiting Captions for Web Data Mining by Neil C. Rowe
Rowe, Neil C.
MetadataShow full item record
We survey research on using captions in data mining from the Web. Captions are text that describes some other information (typically, multimedia). Since text is considerably easier to index and manipulate than non-text (being usually smaller and less ambiguous), a good strategy for accessing non-text is to index its captions. However, captions are not often obvious on the Web as there are few standards. So caption references can reside within paragraphs near a media reference, in clickable text or display text for it, on names of media files, in headings or titles on the page, and in explicit references arbitrarily far from the media. We discuss the range of possible syntactic clues (such as HTML tags) and semantic clues (such as meanings of particular words). We discuss how to quantify their strength and combine their information to arrive at a consensus. We then discuss the problem of mapping information in captions to information in media objects. While it is hard, classes of mapping schemes are distinguishable, and segmentation of the media can be matched to a parsing of the caption by constraint-satisfaction methods. Active work is addressing the issue of automatically learning the clues for mapping from examples.
This article is to appear in Web Mining: Applications and Techniques ed. A. Scime, 2004.
RightsThis publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.
Showing items related by title, author, creator and subject.
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 1998-07);We discuss the obstacles to inference of correspondances between objects within photographic images and their counterpart concepts in descriptive captions of those images. This is important for information retrieval of ...
Rowe, Neil C. (Naval Postgraduate School, Monterey CA, 2004);A key problem in indexing technical information is the interpretation of technical words and word senses, expressions not used in everyday language. This is important for captions on technical images, whose often pithy ...
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 2004-06);The easiest way to index multimedia from ordinary Web pages is to find their captions. However, captions are not used consistently, and retrieval effectiveness for caption-based multimedia browsers is significantly poorer ...