TREC-CHEM '10
The TREC-CHEM evaluation campaign aims at creating a reference collection for chemical information retrieval engines, usable to both academics and members of the industry for evaluating their own systems or third party systems that they intend to invest in.
Based on the important progresses made in information retrieval (IR) in terms of theoretical models and evaluations, more and more attention has recently been paid to research in domain specific IR, as evidenced by the organisation of Genomics and Legal tracks in TREC. Now is the right time to carry out large scale evaluations on chemistry datasets in order to promote the research in chemical IR in general and chemical patent IR in particular. Accordingly, we organize a chemical IR track in TREC in order to address the challenges in chemical and patent IR. We will provide a test collection consisting of full-text chemical patents from the IRF and research papers from several publishers (see below). The aim is to identify how current IR methods adapt to text containing chemical names and formulas. Without making it a prerequisite, we encourage participants to use entity identification methods to extract and index chemicals. The evaluation process will be a combination of the pooling/sampling/expert evaluation approach frequently used in TREC and an automatic evaluation method based on references in patent documents.
For most up-to-date information please also visit our wiki. If you are a participant, please remember to register to our mailing list trec-chem@ir-facility.org.
Data collection
The 2010 TREC-CHEM data collection is very similar to the one of 2009, but larger.
Chemical patent documents come from the MAREC collection and include all patent documents classified in category C or A61K of the International Patent Classification codes (IPCs). The total number of documents in this set is approximately 2million.
Scientific articles come from several publishers this year:
- The Royal Society of Chemistry - all articles of 31 journals, in XML format since 1997, with images since 2001
- Oxford University Press - Nucleic Acids Research (NAR) Journal, in XML/SGML since 2005, with images since 2008
- Hindawi Publishers - Advances in Physical Chemistry Journal, in XML with images since 2008 (other journals are in the pipeline)
- IUCr Journals - Acta Crystallographica Section E, in XML with images, 2009
- Molecular Diversity Preservation International - International Journal of Molecular Sciences, coverage to be determined.
- all other open access relevant journals in PubMed Central.
Tasks
This year's tasks will take on the 2009 tasks, namely the "Prior Art Search" and "Technology Survey Search". Additional tasks may be still decided upon.
Evaluation
We use two types of evaluations: automatic evaluation based on patent citations (Prior Art task) and manual judgements by students and experts (Technical Survey task).
Compared to 2009, this year we aim to have a more domain specific evaluation process in the TS task by introducing specific relevance judgements (e.g. "has compound", "has disease").
Organizers
As in 2009, the 2010 TREC-CHEM is a collaboration of the Information Retrieval Facility in Vienna, Austria, University College London, UK, and York University, Canada, and it is supported by the National Institute for Standards and Technology (NIST), USA.
- John Tait, Information Retrieval Facility, Vienna, Austria
- Jianhan Zhu, University College London, UK
- Jimmy Xiangji Huang, York University, Toronto, Canada
- Mihai Lupu, Information Retrieval Facility, Vienna, Austria