A High-Recall Self-Improving Web Crawler That Finds Images Using Captions

Loading...
Thumbnail Image
Authors
Rowe, Neil C.
Subjects
Images
captions
World Wide Web
software agents
data mining
digital libaries
information filtering
keywords
parsing
image processing
probabilistic reasoning
servlets
Advisors
Date of Issue
2002-07
Date
July / Augst 2002
Publisher
Language
Abstract
Finding multimedia objects to meet some need is considerably harder on the World Wide Web than finding text because content-based retrieval of multimedia is much harder than text retrieval and caption text is inconsistently placed. We describe a Web "crawler" and caption filter MARIE-4 we have developed that searches the Web to find text likely to be image captions and its associated image objects. Rather than examining a few features like existing systems, it uses broad set of criteria including some novel ones to yield higher recall than competing systems, which generally focus on high precision. We tested these criteria in careful experiments that extracted 8140 caption candidates for 4585 representative images, and quantified for the first time the relative value of several kinds of clues for captions. The crawler is self-improving in that it obtains from experience further statistics as positive and negative clues. We index the results found by the crawler and provide a user interface. We have done a demonstration implementation of a Web search engine for all 667,573 publicly-accessible U.S. Navy Web images.
Type
Conference Paper
Description
IEEE Intelligent Systems, July/August2002
Series/Report No
Department
Organization
Identifiers
NPS Report Number
Sponsors
Funder
Format
Citation
IEEE Intelligent Systems, July/August2002
Distribution Statement
Approved for public release; distribution is unlimited.
Rights
Collections