Source Retrieval 2012
Synopsis
- Task: Given a suspicious document and a web search API, your task is to retrieve all plagiarized sources while minimizing retrieval costs.
- Input: [data]
- Baseline: [code]
Input
To develop your software, we provide you with a training corpus that consists of suspicious documents. Each suspicious document is about a specific topic and may consist of plagiarized passages obtained from web pages on that topic found in the ClueWeb09 corpus.
API
If you are not in possession of the ClueWeb09 corpus, we also provide access to two search engines which index the ClueWeb, namely the Lemur Indri search engine and the ChatNoir search engine. To programmatically access these two search engines, we provide a unified search API.
Note: To better separate the source retrieval task from the text alignment task, the API provides a text alignment oracle feature. For each document you request to download from the ClueWeb, the text alignment oracle discloses if this document is a source for plagiarism for the suspicious document in question. In addition, the plagiarized text is returned. This, way participation in the source retrieval task does not require the development of a text alignment solution. However, you are free to use your own text alignment, if you want to.
Output
For each suspicious document suspicious-documentXYZ.txt
found in the evaluation
corpora, your plagiarism detector shall output an interaction log suspicious-documentXYZ.log
which logs meta information about your retrieval process:
Timestamp [Query|Download_URL]
1258326592 barack obama family tree
1258326597 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=110212744
1258326598 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=10221241
1258326599 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=100003305377
1258326605 barack obama genealogy
1258326610 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=82208332
...
For example, the above file would specify that at 1258326592 (Unix timestamp) the query
barack obama family tree
was sent and that in the following three of the retrieved documents
were selected for download before the next query was sent.
Evaluation
Performance will be measured based on the following five scores as averages over each suspicious document:
- Number of queries submitted.
- Number of web pages downloaded.
- Precision and recall of web pages downloaded regarding actual sources of a suspicious document.
- Number of queries until the first actual source is found.
- Number of downloads until the first actual source is downloaded.
Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of identifying reused passages between documents is not taken into account here, but note that retrieving duplicates of a source document is considered a true positive, whereas retrieving more than one duplicate of a source document does not improve performance.
Baseline
For your convenience, we provide a baseline program written in Python. The program loops through the suspicious documents in a given directory and outputs a search interaction log. The log is valid with respect to the output format described below. You may use the source code for getting started with your own approach.
Results
Source Retrieval Performance | ||||
---|---|---|---|---|
Workload to 1st Detection | Downloaded Sources | Team | ||
Queries | Downloads | Precision | Recall | |
4.47 | 25.88 | 0.0182 | 0.5567 | L. Gillam, N. Newbold, and N. Cooke University of Surrey, UK |
8.78 | 12.50 | 0.0709 | 0.4342 | A. Jayapal University of Sheffield, UK |
80.59 | 27.47 | 0.0178 | 0.3742 | L. Kong°, H. Qi°, S. Wang°, C. Du*, S. Wang*, and Y. Han° °Heilongjiang Institute of Technology and *Harbin Engineering University, China |
27.28 | 318.94 | 0.0025 | 0.2123 | Y. Palkovskii and A. Belov Zhytomyr State University, Ukraine |
1.53 | 6.28 | 0.0812 | 0.3512 | Š. Suchomel, J. Kasprzak, and M. Brandejs Masaryk University, Czech Republic |
A more detailed analysis of the retrieval performances can be found in the overview paper accompanying this task.