Author Diarization 2016
Synopsis
- Task: Given a document, identify and group text fragments that correspond to individual authors.
- Input: [data]
- Submission: [submit]
Introduction
The term author diarization originates from the research field speaker diarization, where approaches try to automatically identify and cluster different speakers of an audio speech signal like a telephone conversation or a political TV debate by processing and analyzing the audio frequency signal (an overview of approaches can be found, for example, here).
Similar to such approaches, the task of author diarization in this PAN edition is to identify different authors within a single document. Such documents may be the result of a collaborative work (e.g., a combined master thesis written by two students, a scientific paper written by four authors, …), or the result of plagiarism. The latter is thereby a special case, where it can be assumed that the main text is written by one author and only some fragments are by other writers (the plagiarized or intrusive sections). On the other hand, the contributions of a collaboratively written document may be equally weighted, i.e., each author contributes to the same extent.
Task
Given a document, identify and group text fragments that correspond to individual authors. Similarly to the situation in speaker diarization approaches, where active speakers may change at any time, you cannot assume that changes in authorship occur, for example, only on paragraph boundaries. But you should rather be prepared to detect different authors at any text position. An example could be as follows:
"She is also a successful businesswoman and an American icon, was born in Jersey City to middle-class Polish-American parents and she earned a partial scholarship to …"
Nevertheless, you may use paragraph boundaries or other useful metrics as heuristic to potential changes.
To cover different variants of the problem, the task of this years PAN edition is split into three subproblems. For this year’s edition, all documents are provided in English.
Traditional intrinsic plagiarism detection: here, you can assume that there exists one main author who wrote at least 70% of the text. Up to the other 30% may be written by other authors. For this problem, you should build exactly two clusters: one containing the text fragments of the main author, and the other one containing the intrusive fragments.
Diarization with a given number (n) of authors: given a document, the task is to build exactly n clusters containing the contributions of the different writers. Thereby, each author may have contributed to an arbitrary, but non-zero extent.
Diarization with an unknown number of authors: finally, this variant covers the most challenging task, which is similar to the previous task, but without the information of knowing how many authors contributed to the document.
Input
The data set consists of three folders corresponding to each subtask. For each problem instance X in each subtask, three files are provided:
problem-X.txt
contains the actual textproblem-X.meta
contains meta information about the file in JSON format. It contains the"language"
(which is always English this year), the problem"type"
("plagiarism"
or"diarization"
) and for the diarization problem with given number of authors additionally the correct number of authors ("numAuthors"
)problem-X.truth
contains the ground truth, i.e., the correct solution in JSON format:{ "authors": [ [ {"from": fromCharPosition, "to": toCharPosition}, … ], … ] }
To identify the text fragments, the absolute character start/end positions within the document are used, whereby the document starts at character position 0.
Note that for simplicity reasons the solutions for the intrinsic plagiarism task contains exactly 2 clusters: one for the main author and one combined for all other authors. Nevertheless, when producing the output, you are free to create as many clusters as you wish for the plagiarized sections.
An example for an intrinsic plagiarism detection solution could look like this:
{ "authors": [ [ {"from": 314, "to": 15769} ], [ {"from": 0, "to": 313}, {"from": 15770, "to": 19602} ] ] }
An example of the diarization solution of a document that was written by four authors could then look like this:{ "authors": [ [ {"from": 123, "to": 400}, {"from": 598, "to": 680} ], [ {"from": 0, "to": 122} ], [ {"from": 401, "to": 597}, {"from": 681, "to": 1020}, {"from": 1101, "to": 1400} ], [ {"from": 1021, "to": 1100} ] ] }
Of course, in the actual evaluation phase the ground truth, i.e., the
problem-X.truth
file will be missing.
Output
In general, the data structure during the evaluation phase will be similar to that in the
training phase, with the exception that the ground truth files are missing.
This means, you can also use the information provided in the problem-X.meta
file.
Your software should finally output the missing solution file problem-X.truth
for
every problem instance X in the respective output folder (see Submission). The output syntax
should thereby be exactly like it is used in the training phase.
In general, there is no difference in the output between the intrinisic plagiarism detection and the diarization subtasks. Moreover, the order of the entries is not relevant.
In the following, we provide you with some examples for both subtasks:
-
For the intrinsic plagiarism detection subtask, you should create one entry for the main author. For the plagiarized sections you are free to either combine them into one entry (like it is done in the training data) or split them into more entries. As an example, if you found 2 plagiarized sections in the file
problem-3.txt
, you should produce the fileproblem-3.truth
, where both{ "authors": [ [ {"from": 314, "to": 15769} ], [ {"from": 0, "to": 313}, {"from": 15770, "to": 19602} ] ] }
and
{ "authors": [ [ {"from": 314, "to": 15769} ], [ {"from": 0, "to": 313} ], {"from": 15770, "to": 19602} ] ] }
are valid solutions.
-
For the diarization subtask, if you found 3 authors for the file
problem-12.txt
, you should produce the fileproblem-12.truth
containing the solution like this:{ "authors": [ [ {"from": 0, "to": 409}, {"from": 645, "to": 4893} ], [ {"from": 410, "to": 644}, {"from": 4894, "to": 6716} ], [ {"from": 6717, "to": 15036} ] ] }
Performance Measures
- To evaluate the quality of the intrinsic plagiarism detection algorithms, the micro- and macro-averaged F-score will be used (see this paper).
- For the diarization algorithms, the BCubed F-score (Amigo et al. 2007) will be used.
Evaluation
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Related Work
- PAN@CLEF'11
- PAN@CLEF'12
- Sven Meyer zu Eissen, Benno Stein. Intrinsic Plagiarism Detection. In Advances in Information Retrieval. Proceedings of the 28th European Conference on IR Research (ECIR), pages 565-569, 2006
- Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, March 2008.
- Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.