Keynote presentation
Adapting Rankers Online
Maarten de Rijke, Intelligent Systems Lab Amsterdam, NL
Authors: Katja Hofmann, Shimon Whiteson, and Maarten de Rijke
At the heart of many effective approaches to the core information retrieval problem---identifying relevant content---lies the following three-fold
strategy: obtaining content-based matches, inferring additional ranking criteria and constraints, and combining all of the above so as to arrive at a single ranking of retrieval units.
As retrieval systems become more complex, learning to rank approaches are being developed to automatically tune the parameters for integrating multiple ways of ranking documents. This is the issue on which we will focus in the talk.
Search engines are typically tuned offline; they are tuned manually or using machine learning methods to fit a specific search environment. These efforts require substantial human resources and are therefore only economical for relatively large groups of users and search environments.
More importantly, they are inherently static and disregard the dynamic nature of search environments, where collections change and users acquire knowledge and adapt their search behaviors. Using online learning to rank approaches, retrieval systems can learn directly from implicit feedback, while they are running.
The talk will discuss three issues around online learning to rank:
balancing exploitation and exploration, gathering data using one pair of rankers and using it to compare another pair of rankers, and the use of rich contextual data.
Maarten de Rijke is full professor of Information Processing and Internet in the Informatics Institute at the University of Amsterdam. He holds MSc degrees in Philosophy and Mathematics (both cum laude), and a PhD in Theoretical Computer Science. He worked as a postdoc at CWI, before becoming a Warwick Research Fellow at the University of Warwick, UK. He joined the University of Amsterdam in 1998, and was appointed full professor in 2004.
He leads the Information and Language Processing Systems group, one of the leading academic research groups in information retrieval in Europe. During the most recent computer science research assessment exercise, the group achieved maximal scores on all dimensions. De Rijke's current focus is on intelligent web information access, with projects on search and discovery for social media, vertical search engines, machine learning for information retrieval, semantic search and multilingual information.
A Pionier personal innovational research incentives grant laureate (comparable to an advanced ERC grant), De Rijke has generated over 30MEuro in project funding. With an h-index of 39 he has published close to 500 papers, has published or edited over a dozen books, is editor for various journals and book series, and a former coordinator of retrieval evaluation tracks at TREC, CLEF and INEX (Blog, Web, Question answering). He is general co-chair for the CLEF 2011 conference, the director of the University of Amsterdam's Intelligent Systems Lab (ISLA), its Information Science bachelor program and its Center for Creation, Content and Technology (CCCT).
IRF Conference 2011 Accepted Papers
Expanding queries with term and phrase translations in patent retrieval
Authors: Charles Jochim, Christina Lioma and Hinrich Schuetze
Session: 10:45 - 12:00 Session 1: Patents and Multilinguality
Keywords: patent retrieval, cross-language information retrieval, query translation, statistical machine translation, relevance feedback, query expansion
Abstract: Patent retrieval is a branch of Information Retrieval (IR) that aims to enable the challenging task of retrieving highly technical and often complicated patents. Typically, patent granting bodies translate patents into several major foreign languages, so that language boundaries do not hinder their accessibility. Given such multilingual patent collections, we posit that the patent translations can be exploited for facilitating patent retrieval. Specifically, we focus on the translation of patent queries from German and French, the morphology of which poses an extra challenge to retrieval. We compare two translation approaches that expand the query with (i) translated terms and (ii) translated phrases. Experimental evaluation on a standard CLEF-IP dataset reveals a novel finding: phrase translation may be more suited to French, and term translation may be more suited to German. We trace this to language morphology, and we conclude that tailoring the query translation per language leads to improved results in patent retrieval.
Query Expansion for Language Modeling using Sentence Similarities
Authors: Debasis Ganguly, Johannes Leveling and Gareth Jones
Session: 13:00 - 14:15 Session 2: Interactive Retrieval Support
Keywords: Language Modeling, Blind Relevance Feedback, Dirichlet prior, Query drift
Abstract: We propose a novel method of query expansion for Language Modeling (LM) in Information Retrieval (IR) based on the similarity of the query with sentences in the top ranked documents. We argue that the terms in the expanded query obtained by the proposed method roughly follows a Dirichlet distribution which being the conjugate prior of the multinomial distribution used in the LM retrieval model helps the feedback step.
Information retrieval experiments on the TREC ad-hoc collection and topics using the sentence based query expansion (SBQE) shows a significant increase in Mean Average Precision (MAP) compared to the baselines of standard term-based query expansion using LM selection score and the Relevance Model (RLM).
The proposed approach of query expansion for LM increases the likelihood of generation of the pseudo-relevant documents by adding sentences with maximum term overlap with the query sentences for each top ranked pseudo-relevant document thus making the query look more like these documents.
A per topic analysis shows that the new method hurts less queries as compared to the baseline methods and improves the average precision (AP) over a broad range of queries ranging from easy to difficult in terms of the initial retrieval AP. We also show that the new method is able to add a higher number of good feedback terms (the golden standard of good terms are the set of terms added by True Relevance Feedback). Experiments on the hard topics of TREC-2004 Robust track shows that the new method is able to improve MAP by 5.7% without the use of external resources and query hardness prediction typically used for these topics.
Word Clouds of Multiple Search Results
Authors: Rianne Kaptein and Jaap Kamps
Session: 13:00 - 14:15 Session 2: Interactive Retrieval Support
Keywords: word clouds, language models, SERP summarization
Abstract: Search engine result pages (SERPs) are known as the most expensive real estate on the planet. Most queries yield millions of organic search results, yet searchers seldom look beyond the first handful of results. To make things worse, different searchers with different query intents may issue the exact same query. There is a fast growing body of research on personalizing, customizing, diversi- fying, and aggregating search results. Current SERPs are but a distant reflection of the raw ranking of organic results, and Boolean choices are made to include or exclude certain types of results, essentially trying to pick the five results that will make everyone happy. An alternative to showing individual web pages summa- rized by snippets is to represent whole group of results. In this paper we investi- gate if we can use word clouds to summarize groups of documents, e.g. to give a preview of the next SERP, or of clusters of related documents. We experiment with three word cloud generation methods (full-text, query biased and anchor text based clouds) and evaluate them in a user study. Our findings are: First, biasing the cloud towards the query does not lead to test persons better distinguishing relevance and topic of the search results, but is preferred by the test persons more often than the full-text clouds, because it emphasizes the differences between the clouds. Second, our test persons can distinguish the relevance and topic of groups of search results better using clouds generated from anchor text than using full- text, and test persons also prefer the anchor text clouds. Anchor text contains less noisy words than the full text of documents. Third, we obtain moderately posi- tive results on the relation between the selected world clouds and the underlying search results: there is exact correspondence in 70% of the subtopic matching judgments and in 60% of the relevance assessment judgments. Our initial exper- iments open up new possibilities to have SERPs reflect a far larger number of results by using word clouds to summarize groups of search results.
Supporting Arabic Cross-Lingual Retrieval using Contextual Information
Authors: Farag Ahmed, Andreas Nürnberger and Marcus Nitsche
Session: 10:45 - 12:00 Session 1: Patents and Multilinguality
Keywords: cross lingual information retrieval, word sense disambiguation, Arabic
Abstract: One of the main problems that impact the performance of cross-language information retrieval (CLIR) systems is how to disambiguate translations and - since this usually can not be done completely automatic - how to smoothly integrate a user in this disambiguation process. In order to ensure that a user has a certain confidence in selecting a translation he possibly can not even read or understand, we have to make sure that the system has provided sufficient information about translation alternatives and their meaning. In this paper, we present a CLIR tool that automatically translates the user query and provides possibilities to interactively select relevant terms using contextual information. This information is obtained from a parallel corpora to describe the translation in the user's query language. Furthermore, a user study was conducted to identify weaknesses in both disambiguation algorithm and interface design. The outcome of this user study leads to a much clearer view of how and what CLIR should offer to users.
Applying Web Usage Mining for Adaptive Intranet Navigation
Authors: Sharhida Zawani Saad and Udo Kruschwitz
Session: 14:30 - 15:50 Session 3: IR and the Net
Keywords: Web usage mining, Adaptive Web sites, Evaluation
Abstract: Much progress has recently been made in assisting a user in the search process, be it
Web search where the big search engines have now all incorporated more interactive features or be it online shopping where customers are commonly recommended items that appear to match the customer's interest.
While assisted Web search relies very much on {\it implicit} information such as the users' search behaviour, recommender systems typically rely on {\it explicit} information, expressed for example by a customer purchasing an item. Surprisingly little progress has however been made in making {\it navigation} of a Web site more adaptive. Web sites can be difficult to navigate as they tend to be rather static and a new user has no idea what documents are most relevant to his or her need. We try to assist a new user by exploiting the navigation behaviour of previous users. On a university Web site for example, the target users change constantly. What we propose is to exploit the navigation behaviour of existing users so that we can make the Web site more adaptive by introducing links and suggestions to commonly visited pages without changing the actual Web site. We simply add a layer on top of the existing site that makes recommendations regarding links found on the page or pages that are further away but have been typical landing pages whenever a user visited the current Web page. This addresses a real business need not just for university Web pages but similarly for company sites where new employees join the company and others retire.
This paper reports on a task-based evaluation that demonstrates that the idea is very effective. Introducing suggestions as outlined above was found to be not just preferred by the users of our study but allowed them also to get to the results more quickly.
Search Result Caching in Peer-to-Peer Information Retrieval Networks
Authors: Almer Tigelaar, Djoerd Hiemstra and Dolf Trieschnigg
Session: 14:30 - 15:50 Session 3: IR and the Net
Keywords: distributed query processing, peer-to-peer simulation, large-scale distributed systems
Abstract: For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.
[This paper is a full version of an earlier extended abstract]
Building Queries for Prior Art Search
Authors: Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni and Fabio Crestani.
Session: 10:45 - 12:00 Session 1: Patents and Multilinguality
Keywords: Information Retrieval, Patent Retrieval, Prior-Art Search, Query Generation
Abstract: Prior art search is a critical step in the examination procedure of a patent application. This study explores automatic query generation from patent documents to facilitate the time-consuming and labor-intensive search for relevant patents. It is essential for this task to identify discriminative terms in different sections of a query patent, which enable us to distinguish relevant patents from non-relevant patents. To this end we investigate the term distribution of words occurring in different sections of the query patent and compare them with the rest of the collection using language modeling estimation techniques. We experiment with term weighting based on the Kullback-Leibler divergence between the query patent and the collection and also with parsimonious language model estimation. Both of these techniques promote words that are common in the query patent and are rare in the collection. We also incorporate the classification assigned to patent documents into our model, to exploit the available human judgements in the form of a hierarchical classification. Experimental results show the effectiveness of generated queries particularly in terms of recall while patent description showed to be the most useful source for extracting terms.
Combining Interaction and Content for Feedback-Based Ranking
Authors: Emanuele Di Buccio, Massimo Melucci and Dawei Song
Session: 13:00 - 14:15 Session 2: Interactive Retrieval Support
Keywords: Interaction Features, Geometric Model, Implicit Feedback, Pseudo-Relevance Feedback
Abstract: The paper is concerned with the design and the evaluation of the combination of user interaction and informative content features for implicit and pseudo feedback-based document re-ranking. The features are observed during the visit of the top-ranked documents returned in response to a query. The feature combination is supervised by a geometric model which seamlessly joins the conceptual design with the logical design and then with the experiments. TREC Web test collection-based experiments have been carried out and the experimental outcomes are illustrated. We report that the effectiveness of the combination of user interaction for implicit feedback depends on whether the document re-ranking is on a single-user or a user-group basis. Moreover, the adoption of document re-ranking on a user-group basis can be adopted improve pseudo feedback performance providing more effective document for query expansion.
Free-Text Search over Complex Web Forms
Authors: Kien Tjin-Kam-Jet, Dolf Trieschnigg and Djoerd Hiemstra
Session: 14:30 - 15:50 Session 3: IR and the Net
Keywords: query processing. free-text interfaces, query translation
Abstract: This paper investigates the problem of using free-text queries as an alternative means for searching `behind' web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.
Multilingual Document Clustering using Wikipedia as External Knowledge
Authors: Kiran Kumar N, Santosh GSK and Vasudeva Varma
Session: 14:30 - 15:50 Session 3: IR and the Net
Keywords: Multilingual Document Clustering, Wikipedia, Document Representation
Abstract: This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia, a structured multilingual knowledge base, has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have made an in-depth study on availing Wikipedia in enhancing MDC performance.We tried to leverage Wikipedia knowledge structure (Crosslingual links, Category, Outlinks, Infobox information, etc.) to enrich the document representation for clustering multilingual documents. By avoiding language-specic tools, this approach has become a general framework which can be easily extensible to other languages. We have experimented with bisecting k-means clustering algorithm on a standard dataset provided by FIRE1 for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. We have considered English and Hindi datasets. The system is evaluated using F-score and Purity measures and the results obtained are encouraging.