Exploiting Captions for Web Data Mining by Neil C. Rowe

Rowe, Neil C.

Download

Rowe_Exploiting_Captions.pdf (440.4Kb)

Download Record

Download to EndNote/RefMan (RIS)

Download to BibTex

Author

Rowe, Neil C.

Date

2004

Metadata

Show full item record

Abstract

We survey research on using captions in data mining from the Web. Captions are text that describes some other information (typically, multimedia). Since text is considerably easier to index and manipulate than non-text (being usually smaller and less ambiguous), a good strategy for accessing non-text is to index its captions. However, captions are not often obvious on the Web as there are few standards. So caption references can reside within paragraphs near a media reference, in clickable text or display text for it, on names of media files, in headings or titles on the page, and in explicit references arbitrarily far from the media. We discuss the range of possible syntactic clues (such as HTML tags) and semantic clues (such as meanings of particular words). We discuss how to quantify their strength and combine their information to arrive at a consensus. We then discuss the problem of mapping information in captions to information in media objects. While it is hard, classes of mapping schemes are distinguishable, and segmentation of the media can be matched to a parsing of the caption by constraint-satisfaction methods. Active work is addressing the issue of automatically learning the clues for mapping from examples.

Description

This article is to appear in Web Mining: Applications and Techniques ed. A. Scime, 2004.

Rights

This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.

URI

https://hdl.handle.net/10945/36457

Collections

Faculty and Researchers' Publications