A High-Recall Self-Improving Web Crawler That Finds Images Using Captions
Abstract
Finding multimedia objects to meet some need is considerably harder on the World Wide Web than finding text because content-based
retrieval of multimedia is much harder than text retrieval and caption text is inconsistently placed. We describe a Web "crawler" and
caption filter MARIE-4 we have developed that searches the Web to find text likely to be image captions and its associated image
objects. Rather than examining a few features like existing systems, it uses broad set of criteria including some novel ones to yield
higher recall than competing systems, which generally focus on high precision. We tested these criteria in careful experiments that
extracted 8140 caption candidates for 4585 representative images, and quantified for the first time the relative value of several kinds of
clues for captions. The crawler is self-improving in that it obtains from experience further statistics as positive and negative clues. We
index the results found by the crawler and provide a user interface. We have done a demonstration implementation of a Web search
engine for all 667,573 publicly-accessible U.S. Navy Web images.
Description
IEEE Intelligent Systems, July/August2002
Collections
Related items
Showing items related by title, author, creator and subject.
-
Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions
Rowe, Neil C. (2002);Marie-4, a Web crawler and caption filter, searches the Web to find image captions and the associated image objects. It uses a broad set of criteria to yield higher recall than competing systems, which generally focus ... -
Exploiting captions in retrieval of multimedia data
Rowe, Neil C.; Guglielmo, Eugene J. (Monterey, California. Naval Postgraduate School, 1992-07); NPS-CS-92-011Descriptive natural-language captions can help organize multimedia data. We described our MARIE system that interprets English queries directing the fetch of media objects. it is novel in the extent to which it exploits ... -
Exploiting Captions for Web Data Mining by Neil C. Rowe
Rowe, Neil C. (Monterey, California. Naval Postgraduate School, 2004);We survey research on using captions in data mining from the Web. Captions are text that describes some other information (typically, multimedia). Since text is considerably easier to index and manipulate than non-text ...