Author Profiling 2014
Synopsis
- Task: Given a document, what're its author's traits?
- Input: [data]
- Twitter Downloader: [code]
- Output Validator: [code]
- Submission: [submit]
Introduction
Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
Task
Note. Besides, at RepLab 2014 author profiling will be approached from the online reputation monitoring perspective. Given a large number of Twitter profiles with 600 associated tweets each, participants will be asked to classify the author of a set of tweets as journalist, politician, activist, professional, client, company, authority or citizen, since the fact of belonging to a certain category could determine the importance of the user's opinions. The dataset will contain English and Spanish tweets related to the banking and automotive domains.
Award
We are happy to announce that the best performing team at the 2rd International Competition on Author Profiling will be awarded 300,- Euro sponsored by Atribus (Corex).
- A. Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, and Luis Villaseñor-Pineda from INAOE, Mexico
Congratulations!
Input
To develop your software for age and gender identification, we provide you with a training data set that consists of blog posts, Twitter tweets and social media texts written in both English and Spanish as well as hotel reviews written in English. With regard to age, we will consider the following classes: 18-24, 25-34, 35-49, 50-64, 65-xx.
Remark. Due to Twitter's privacy policy we cannot provide tweets directly, but only URLs referring to them. You will have to download them yourself. For your convenience, we provide a download software for this. We expect participants to extract gender and age information only from the textual part of a tweet and to discard any other meta information that may be provided by Twitter's API. When we evaluate your software at our site, we do not expect it downloads tweets. We will do this beforehand.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus. During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below. After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="{author-id}" type="blog|twitter|socialmedia|reviews" lang="en|es" age_group="18-24|25-34|35-49|50-64|65-xx" gender="male|female" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension. The output files have to be written either directly to the working directory (to "..) or to a subfolder. The author-id has to be extracted from each document's filename which follows the pattern <authorid>_<lang>_<age>_<gender>.xml. Note that in the test corpus the age and gender information are replaced by "xxx".
Evaluation
The performance of your author profiling solution will be ranked by accuracy.
Results
The following table lists the performances achieved by the participating teams:
Author profiling performance | |
---|---|
Avg. Accuracy | Team |
0.2895 | A. Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, and Luis
Villaseñor-Pineda Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico |
0.2802 | Liau Yung Siang and Vrizlynn L. L. Thing Institute for Infocomm Research, Singapore |
0.2760 | Suraj Maharjan, Prasha Shrestha, and Thamar Solorio University of Alabama at Birmingham, USA |
0.2349 | Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira UFRGS, Brazil |
0.2314 | Julio Villena-Román and José Carlos González-Cristóbal DAEDALUS - Data, Decisions and Language, S.A., Spain |
0.1998 | James Marquardt°, Golnoosh Farnadi*, Gayathri Vasudevan°, Marie-Francine Moens*, Sergio
Davalos°, Ankur Teredesai°, Martine De Cock° °University of Washington Tacoma, USA, *Katholieke Universiteit Leuven, Belgium |
0.1677 | Christopher Ian Baker Private, UK |
0.1404 | Baseline |
0.1067 | Seifeddine Mechti, Maher Jaoua, and Lamia Hadrich Belguith University of Sfax, Tunisia |
0.0946 | Esteban Castillo Juarez°, Ofelia Delfina Cervantes Villagomez*, Darnes Vilariño Ayala*, David
Pinto Avendaño*, and Saul Leon Silverio* °Universidad de las Américas Puebla and *Benemérita Universidad Autónoma de Puebla, Mexico |
0.0834 | Gilad Gressel, Hrudya P, Surendran K, Thara S, Aravind A, Prabaharan Poornachandran Amrita University, India |
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.
Related Work
- Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. Proceedings of PAN at CLEF 2013.
- The Blog Authorship Corpus
- S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
- J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006.
- M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender, Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
- J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.