Authorship Verification 2014
Synopsis
Introduction
Authorship attribution is an important problem in many areas including information retrieval and computational linguistics, but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may be able to save lives. The most common framework for testing candidate algorithms is a text classification problem: given known sample documents from a small, finite set of candidate authors, which if any wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding problem is author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. This may more accurately reflect real life in the experiences of professional forensic linguists, who are often called upon to answer this kind of question. It is the second year PAN focuses on the so-called author verification problem.
A note to forensic linguists: In order to bridge the gap between linguistics and computer science, we strongly encourage submissions from researchers from both fields. We understand that research groups with expertise in linguistics use manual or semi-automated methods and, therefore, they are not able to submit their software. To enable their participation, we will provide them with the opportunity to analyze the test corpus after the deadline of software submission (mid-April). Their results will be ranked in a separate list with respect to the performance of the software submissions and they will be entitled to describe their approach in a paper. In this framework, any scholar or research group with expertise in linguistics wishing to participate should contact the Task Chair.
Task
Given a small set (no more than 5, possibly as few as one) of "known" documents by a single person and a "questioned" document, the task is to determine whether the questioned document was written by the same person who wrote the known document set.
For your convenience, we summarize the main contributions of the 2014 edition of the author identification task with respect to previous editions:
Novelties:- The output of your software must be composed of real (probability) scores rather than binary Y/N answers
- The maximum number of documents of known authorship within a problem is 5 (instead of 10)
- The evaluation measures used for ranking are (ROC) AUC and c@1 instead of recall, precision and F1
- More languages/genres are represented in the corpus
- The training/evaluation corpora are larger
- It is possible (optionally) to submit a trainable version of your approach to be used with any given training corpus
- The task definition is the same
- The format of corpus and ground truth is the same
- The positive/negative problems are equally distributed
Input
To develop your software, we provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language and best efforts are applied to assure that within-problem documents are matched for genre, register, theme, and date of writing. The document lengths vary from a few hundred to a few thousand words.
The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language/genre of the documents. The following list shows the available languages/genres, their codes, and examples of problem IDs:
Language | Genre | Code | Problem IDs |
---|---|---|---|
Dutch | essays | DE | DE001, DE002, DE003, etc. |
Dutch | reviews | DR | DR001, DR002, DR003, etc. |
English | essays | EE | EE001, EE002, EE003, etc. |
English | novels | EN | EN001, EN002, EN003, etc. |
Greek | articles | GR | GR001, GR002, GR003, etc. |
Spanish | articles | SP | SP001, SP002, SP003, etc. |
The ground truth data of the training corpus found in the file truth.txt
include one
line per problem with problem ID and the correct binary answer (Y means the known and the
questioned documents are by the same author and N means the opposite). For example:
EN001 N EN002 Y EN003 N ...
Output
Your software must take as input the absolute path to a set of problems. For each problem there
is a separate sub-folder within that path including the set of known documents and the single
unknown document of that problem (similarly to the training corpus). The software has to output
a single text file answers.txt
with all the produced answers for the whole set of
evaluation problems. Each line of this file corresponds to a problem instance, it starts with
the ID of the problem followed by a score, a real number in [0,1] inclusive, corresponding to
the probability of a positive answer. That is, 0 means it is absolutely sure the questioned
document is not by the author of the known documents, 1.0 means it is absolutely sure the
questioned document and the known documents are by the same author, and 0.5 means that a
positive and a negative answer are equally likely. The probability scores should be round with
three decimal digits. Use a single whitespace to separate problem ID and probability score.
For example, an answers.txt
file may look like this:
EN001 0.031 EN002 0.874 EN003 0.500 ...
Evaluation
The participants’ answers will be evaluated according to the area under the ROC curve (AUC) of their probability scores.
In addition, the performance of the binary classification results (automatically extracted from probability scores where every score greater than 0.5 corresponds to a positive answer, every score lower than 0.5 corresponds to a negative answer, while 0.5 corresponds to an unanswered problem, or an "I don’t know" answer) will be measured based on c@1 (Peñas & Rodrigo, 2011):
- c@1 = (1/n)*(nc+(nu*nc/n))
where:
- n = #problems
- nc = #correct_answers
- nu = #unanswered_problems
Note: when positive/negative answers are provided for all available problems (probability scores different than 0.5), then c@1=accuracy. However, c@1 rewards approaches that maintain the same number of correct answers and decrease the number of incorrect answers by leaving some problems unanswered (when probability score equals 0.5).
The final ranking of the participants will be based on the product of AUC and c@1.
Results
Authorship attribution performance | |
---|---|
FinalScore | Team |
0.566 | Meta Classifier |
0.490 | Mahmoud Khonji and Youssef Iraqi Khalifa University, United Arab Emirates |
0.484 | Jordan Fréry°, Christine Largeron°, and Mihaela Juganaru-Mathieu* °Université de Lyon and *École Nationale Supérieure des Mines, France |
0.461 | Esteban Castillo°, Ofelia Cervantes°, Darnes Vilariño*, David Pinto*, and Saul
León* °Universidad de las Américas Puebla and *Benemérita Universidad Autónoma de Puebla, Mexico |
0.451 | Erwan Moreau, Arun Jayapal, and Carl Vogel Trinity College Dublin, Ireland |
0.450 | Cristhian Mayor, Josue Gutierrez, Angel Toledo, Rodrigo Martinez, Paola Ledesma, Gibran
Fuentes, and Ivan Meza Universidad Nacional Autonoma de Mexico, Mexico |
0.426 | Hamed Zamani, Hossein Nasr, Pariya Babaie, Samira Abnar, Mostafa Dehghani, and Azadeh
Shakery University of Tehran, Iran |
0.400 | Satyam, Anand, Arnav Kumar Dawn, and Sujan Kumar Saha Birla Institute of Technology, India |
0.375 | Pashutan Modaresi and Philipp Gross pressrelations GmbH, Germany |
0.367 | Magdalena Jankowska, Vlado Kešelj, and Evangelos Milios Dalhousie University, Canada |
0.335 | Oren Halvani and Martin Steinebach Fraunhofer Institute for Secure Information Technology SIT, Germany |
0.325 | Baseline |
0.308 | Anna Vartapetiance and Lee Gillam University of Surrey, UK |
0.306 | Robert Layton Federation University, Australia |
0.304 | Sarah Harvey University of Waterloo, Canada |
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.
Related Work
- Author Identification, PAN @ CLEF'13
- Author Identification, PAN @ CLEF'12
- Author Identification, PAN @ CLEF'11
- Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, March 2008.
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
- Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.