Efficient and Effective Search Services Over
Archival Webs
NSF AWARD IIS-0803605
The Web is enormous and in constant flux, causing much content to be
lost over time. Historical collections of web content are thus of
monumental value in preserving records of significant aspects of modern
society. The Internet Archive offers access to hundreds of billions of
historical web page snapshots. The scale of such archives, however,
presents tremendous challenges to making this content fully searchable.
This NSF-funded research effort investigates efficient and effective approaches to
store, index, and retrieve web content from large-scale historical
archives. In addition, the temporal content and structure of the
archives are mined to exploit temporal characteristics that can improve
search result ranking. Technological advances from this work will be
tested on content from and in collaboration with the Internet Archive
and potentially integrated into its infrastructure, enabling new archival search
capabilities for the public.
Participants:
- Investigators: Brian D.
Davison (PI: Lehigh University),
Torsten Suel (co-PI: NYU
Polytechnic),
Kris Carpenter Negulescu (Internet Archive) and Gordon Mohr (Internet Archive).
- Research Assistants:
Josh Attenberg,
Na Dai,
Shuai Ding,
Jinru He,
Liangjie Hong,
Xiaoguang Qi,
Zhenzhen Xue,
Hao Yan, and
Junyuan Zeng.
- Additional staff at the Internet Archive:
Brad Tofel, Vinay Goel, and Aaron Binns.
Primary publications:
-
H. Yan,
S. Ding,
and
T. Suel. (2009)
Inverted Index
Compression and
Query Processing with Optimized Document Ordering. In
Proceedings
of the 18th
International World Wide Web Conference (WWW), pages 401-410,
Madrid, Spain, ACM Press, April.
-
N. Dai,
B. D. Davison and
X. Qi.
(2009)
Looking
into the Past to Better Classify Web
Spam.
In Proceedings of the Fifth International Workshop
on Adversarial Information Retrieval on the Web (AIRWeb),
pages 1-8, Madrid, Spain, ACM Press, April.
-
J. He,
H. Yan, and
T. Suel. (2009)
Compact Full-Text Indexing of
Versioned Document Collections.
To be published in Proceedings of the 18th ACM
Conference on Information and Knowledge Management (CIKM),
Hong Kong, ACM Press, November.
-
N. Dai and
B. D. Davison.
(2009)
Vetting the Links of the Web.
To be published in Proceedings of the
18th ACM Conference on Information and Knowledge Management (CIKM),
Hong Kong, ACM Press, November.
This research grant supports, in part, research projects in the
WUME Lab
of the Computer Science and Engineering Department at Lehigh University
and the WEST Lab of the
Computer Science and Engineering
Department at
NYU Poly.
This material is based upon work supported by the National Science
Foundation under
Grant No. 0803605 (III-COR-Medium: Efficient and Effective Search Services Over
Archival Webs). Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the
National Science Foundation.
Last modified: 6 September 2009