Author Profiling 2016
Synopsis
- Task: Given a document, what're its author's traits?
- Input: [data]
- Twitter Downloader: [code]
- Submission: [submit]
Introduction
Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
Task
The focus of 2016 shared task is on cross-genre age and gender identification. That is, the training documents will be on one genre (e.g. Twitter, blogs, social media...) and the evaluation will be on another genre (e.g. Twitter, blogs, social media...).
Three languages will be addressed: English, Spanish and Dutch.
Award
We are happy to announce that the best performing team at the 4th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.
- Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, Hessel Haagsma, Malvina Nissim. University of Groningen, Netherlands.
Congratulations!
Data
To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish and Dutch.
The English and Spanish datasets are labeled with age and gender, whereas the Dutch one only with gender. With regard to age, we will consider the following classes: 18-24, 25-34, 35-49, 50-64, 65-xx.
Remark. Due to Twitter's privacy policy we cannot provide tweets directly, but only URLs referring to them. You will have to download them yourself. For your convenience, we provide a download software for this. We expect participants to extract gender and age information only from the textual part of a tweet and to discard any other meta information that may be provided by Twitter's API. When we evaluate your software at our site, we do not expect it downloads tweets. We will do this beforehand.
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="{author-id}" type="not relevant" lang="en|es|nl" age_group="18-24|25-34|35-49|50-64|65-xx" gender="male|female" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.
Evaluation
The performance of your author profiling solution will be ranked by accuracy.
Concretely, we will calculate individual accuracies for each language, gender, and age class. Then, we will average the accuracy values to obtain a joint identification of age and gender in each language.
Results
The following table lists the performances achieved by the participating teams:
Author profiling performance | |
---|---|
Avg. Accuracy | Team |
0.5258 | Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, Hessel Haagsma, Malvina Nissim. University of Groningen, Netherlands. |
0.5247 | Pashutan Modaresi, Matthias Liebeck, Stefan Conrad. Heinrich Heine University Düsseldorf, Germany. |
0.4834 | Ivan Bilan, Desislava Zhekova. University of Munich, Germany. |
0.4602 | Philipp Gross, Siavash Sefidrodi, Germany. |
0.4593 | Ilia Markov, Helena Gómez Adorno, Grigori Sidorov, Alexander Gellbukh. Instituto Politécnico Nacional, Mexico. |
0.4519 | Konstantinos Bougiatiotis, Anastasia Krithara. NCSR Demokritos, Greece. |
0.4425 | Daniel Dichiu, Irina Rancea. Bitdefender, Romania. |
0.4369 | Hannes de Valkeneer, Shoira Mukhsinova. Belgium |
0.4293 | Waser, Switzerland (this team withdrew their submission a posteriori) |
0.4255 | Roy Bayot, Teresa Gonçalves. Universidade de Évora, Portugal. |
0.4015 | Pepa Gencheva, Martin Boyanov, Elena Deneva, Preslav Nakov, Yasen Kiprov, Ivan Koychev, Georgi Georgiev. Sofia University, Bulgaria. |
0.4014 | Elena Deneva, Nikolay Hubanov. Sofia University, Bulgaria. |
0.3971 | Madhulika Agrawal, Teresa Gonçalves. Universidade de Évora, Portugal. |
0.3800 | Mirco Kocher, Jacques Savoy. University of Neuchâtel, Switzerland. |
0.3664 | Constantino Román Gómez. Universidad Politécnica de Madrid, Spain. |
0.3660 | María José Garciaren Ucelay, María Paula Villegas, Dario G. Funez, Leticia C. Cagnina, Marcelo L. Errecalde, Gabriela Ramírez de la Rosa, Esaú Villatoro Tello. Universidad Nacional de San Luis, Argentina. |
0.3154 | Anam Zahid, Aadarsh Sampath, Anindya Dey, Golnoosh Farnady. University of Washington Tacoma, United States. |
0.2949 | José María Aceituno. Spain. |
0.1688 | Shaina Ashraf, Hafiz Rizwan Iqbal, Rao Muhammad Adeel Nawab. Institute of Information Technology, Pakistan. |
0.1560 | Rodwan Bakkar Deyab, José Duarte, Teresa Gonçalves. Universidade de Évora, Portugal. |
0.1410 | Oliver Pimas, Andi Rexha, Mark Kröll, Roman Kern. Know-Center GmbH, Austria. |
0.0571 | Anand Kumar M., Sanjay S. Poongunran. Amrita Vishwa Vidyapeetham, India. |
Related Work
- Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
- Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
- Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.
- S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
- J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006.
- M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender, Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
- J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.
- PAN-AP-13 corpus - Author Profiling Shared Task
- PAN-AP-14 corpus - Author Profiling Shared Task
- PAN-AP-15 corpus - Author Profiling Shared Task
- The Blog Authorship Corpus