Author Profiling 2017
Synopsis
- Task: Given the text of a Twitter feed, determine identify the authors gender and language variety.
- Input: [data]
- Submission: [submit]
Introduction
Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
Task
Gender and language variety identification in Twitter. Demographics traits such as gender and language variety have so far investigated separately. In this task we will provided participantes with a Twitter corpus annotated with authors' gender and their specific variation of their native language:
- English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
- Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
- Portuguese (Brazil, Portugal)
- Arabic (Egypt, Gulf, Levantine, Maghrebi)
Although we suggest to participate in both subtasks (gender and language identification) and in all languages, it is possible participating only in one of them and in some of the languages.
Award
We are happy to announce that the best performing team at the 5th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.
- Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. University of Groningen, Netherlands.
Congratulations!
Data
To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Portuguese and Arabic, labeled with gender and language variety.Download corpus (Updated March 10, 2017)
Info about additional training material (although domains are different): http://ttg.uni-saarland.de/resources/DSLCC
Test Corpus
Download test corpus + truth files (Updated March 16, 2017)
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="author-id" lang="en|es|pt|ar" variety="australia|canada|great britain|ireland|new zealand|united states| argentina|chile|colombia|mexico|peru|spain|venezuela| portugal|brazil| gulf|levantine|maghrebi|egypt" gender="male|female" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your author profiling solution will be ranked by accuracy.
For each language, we will calculate individual accuracies for gender and variety identification. Then, we will calculate the accuracy when BOTH variety and gender are properly predicted together. Finally, we will average the accuracy values per language to obtain the final ranking.
Results
The following tables list the performances achieved by the participating teams in the different subtasks:
We provide with three baselines:
- LDR-baseline: It is described in A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
- BOW-baseline: A common bag-of-words with the 1000 most frequent ones
- STAT-baseline: A statistical baseline (majority class or random choice)
GLOBAL RANKING | |||||
---|---|---|---|---|---|
RANK | TEAM | GENDER | VARIETY | JOINT | AVERAGE |
1 | Basile et al. | 0.8253 | 0.9184 | 0.8361 | 0.8599 |
2 | Martinc et al. | 0.8224 | 0.9085 | 0.8285 | 0.8531 |
3 | Tellez et al. | 0.8097 | 0.9171 | 0.8258 | 0.8509 |
4 | Miura et al. | 0.8127 | 0.8982 | 0.8162 | 0.8424 |
5 | López-Monroy et al. | 0.8047 | 0.8986 | 0.8111 | 0.8381 |
6 | Markov et al. | 0.7957 | 0.9056 | 0.8097 | 0.8370 |
7 | Poulston et al. | 0.7974 | 0.8786 | 0.7942 | 0.8234 |
8 | Sierra et al. | 0.7641 | 0.8911 | 0.7822 | 0.8125 |
LDR-baseline | 0.7325 | 0.9187 | 0.7750 | 0.8087 | |
9 | Ogaltsov & Romanov | 0.7669 | 0.8591 | 0.7653 | 0.7971 |
10 | Franco-Salvador et al. | 0.7667 | 0.8508 | 0.7582 | 0.7919 |
11 | Schaetti | 0.7207 | 0.8864 | 0.7511 | 0.7861 |
12 | Kodiyan et al. | 0.7531 | 0.8522 | 0.7509 | 0.7854 |
13 | Ciobanu et al. | 0.7504 | 0.8524 | 0.7498 | 0.7842 |
14 | Kheng et al. | 0.7002 | 0.8513 | 0.7176 | 0.7564 |
15 | Ganesh* | 0.7342 | 0.7626 | 0.6881 | 0.7283 |
16 | Kocher & Savoy | 0.7178 | 0.7661 | 0.6813 | 0.7217 |
17 | Ignatov et al. | 0.6917 | 0.7024 | 0.6270 | 0.6737 |
BOW-baseline | 0.6763 | 0.6907 | 0.6195 | 0.6622 | |
18 | Khan | 0.6252 | 0.5296 | 0.4952 | 0.5500 |
19 | Ribeiro-Oliveira et al. | 0.3666 | 0.4141 | 0.3092 | 0.3633 |
STAT-baseline | 0.5000 | 0.2649 | 0.2991 | 0.3547 | |
20 | Alrifai et al. | 0.1806 | 0.1888 | 0.1701 | 0.1798 |
21 | Bouzazi* | 0.1530 | 0.0931 | 0.1027 | 0.1163 |
22 | Adame et al. | 0.1353 | 0.0476 | 0.0695 | 0.0841 |
VARIETY RANKING | ||||||
---|---|---|---|---|---|---|
RANK | TEAM | ARABIC | ENGLISH | PORTUGUESE | SPANISH | AVERAGE |
LDR-baseline | 0.8250 | 0.8996 | 0.9875 | 0.9625 | 0.9187 | |
1 | Basile et al. | 0.8313 | 0.8988 | 0.9813 | 0.9621 | 0.9184 |
2 | Tellez et al. | 0.8275 | 0.9004 | 0.9850 | 0.9554 | 0.9171 |
3 | Martinc et al. | 0.8288 | 0.8688 | 0.9838 | 0.9525 | 0.9085 |
4 | Markov et al. | 0.8169 | 0.8767 | 0.9850 | 0.9439 | 0.9056 |
5 | López-Monroy et al. | 0.8119 | 0.8567 | 0.9825 | 0.9432 | 0.8986 |
6 | Miura et al. | 0.8125 | 0.8717 | 0.9813 | 0.9271 | 0.8982 |
7 | Sierra et al. | 0.7950 | 0.8392 | 0.9850 | 0.9450 | 0.8911 |
8 | Schaetti | 0.8131 | 0.8150 | 0.9838 | 0.9336 | 0.8864 |
9 | Poulston et al. | 0.7975 | 0.8038 | 0.9763 | 0.9368 | 0.8786 |
10 | Ogaltsov & Romanov | 0.7556 | 0.8092 | 0.9725 | 0.8989 | 0.8591 |
11 | Ciobanu et al. | 0.7569 | 0.7746 | 0.9788 | 0.8993 | 0.8524 |
12 | Kodiyan et al. | 0.7688 | 0.7908 | 0.9350 | 0.9143 | 0.8522 |
13 | Kheng et al. | 0.7544 | 0.7588 | 0.9750 | 0.9168 | 0.8513 |
14 | Franco-Salvador et al. | 0.7656 | 0.7588 | 0.9788 | 0.9000 | 0.8508 |
15 | Kocher & Savoy | 0.7188 | 0.6521 | 0.9725 | 0.7211 | 0.7661 |
16 | Ganesh* | 0.7144 | 0.6021 | 0.9650 | 0.7689 | 0.7626 |
17 | Ignatov et al. | 0.4488 | 0.5813 | 0.9763 | 0.8032 | 0.7024 |
BOW-baseline | 0.3394 | 0.6592 | 0.9712 | 0.7929 | 0.6907 | |
18 | Khan | 0.5844 | 0.2779 | 0.9063 | 0.3496 | 0.5296 |
19 | Ribeiro-Oliveira et al. | 0.6713 | 0.9850 | 0.4141 | ||
STAT-baseline | 0.2500 | 0.1667 | 0.5000 | 0.1429 | 0.2649 | |
20 | Alrifai et al. | 0.7550 | 0.1888 | |||
21 | Bouzazi* | 0.3725 | 0.0931 | |||
22 | Adame et al. | 0.1904 | 0.0476 |
GENDER RANKING | ||||||
---|---|---|---|---|---|---|
RANK | TEAM | ARABIC | ENGLISH | PORTUGUESE | SPANISH | AVERAGE |
1 | Basile et al. | 0.8006 | 0.8233 | 0.8450 | 0.8321 | 0.8253 |
2 | Martinc et al. | 0.8031 | 0.8071 | 0.8600 | 0.8193 | 0.8224 |
3 | Miura et al. | 0.7644 | 0.8046 | 0.8700 | 0.8118 | 0.8127 |
4 | Tellez et al. | 0.7838 | 0.8054 | 0.8538 | 0.7957 | 0.8097 |
5 | López-Monroy et al. | 0.7763 | 0.8171 | 0.8238 | 0.8014 | 0.8047 |
6 | Poulston et al. | 0.7738 | 0.7829 | 0.8388 | 0.7939 | 0.7974 |
7 | Markov et al. | 0.7719 | 0.8133 | 0.7863 | 0.8114 | 0.7957 |
8 | Ogaltsov & Romanov | 0.7213 | 0.7875 | 0.7988 | 0.7600 | 0.7669 |
9 | Franco-Salvador et al. | 0.7300 | 0.7958 | 0.7688 | 0.7721 | 0.7667 |
10 | Sierra et al. | 0.6819 | 0.7821 | 0.8225 | 0.7700 | 0.7641 |
11 | Kodiyan et al. | 0.7150 | 0.7888 | 0.7813 | 0.7271 | 0.7531 |
12 | Ciobanu et al. | 0.7131 | 0.7642 | 0.7713 | 0.7529 | 0.7504 |
13 | Ganesh* | 0.6794 | 0.7829 | 0.7538 | 0.7207 | 0.7342 |
LDR-baseline | 0.7044 | 0.7220 | 0.7863 | 0.7171 | 0.7325 | |
14 | Schaetti | 0.6769 | 0.7483 | 0.7425 | 0.7150 | 0.7207 |
15 | Kocher & Savoy | 0.6913 | 0.7163 | 0.7788 | 0.6846 | 0.7178 |
16 | Kheng et al. | 0.6856 | 0.7546 | 0.6638 | 0.6968 | 0.7002 |
17 | Ignatov et al. | 0.6425 | 0.7446 | 0.6850 | 0.6946 | 0.6917 |
BOW-baseline | 0.5300 | 0.7075 | 0.7812 | 0.6864 | 0.6763 | |
18 | Khan | 0.5863 | 0.6692 | 0.6100 | 0.6354 | 0.6252 |
STAT-baseline | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | |
19 | Ribeiro-Oliveira et al. | 0.7013 | 0.7650 | 0.3666 | ||
20 | Alrifai et al. | 0.7225 | 0.1806 | |||
21 | Bouzazi* | 0.6121 | 0.1530 | |||
22 | Adame et al. | 0.5413 | 0.1353 |
JOINT RANKING | ||||||
---|---|---|---|---|---|---|
RANK | TEAM | ARABIC | ENGLISH | PORTUGUESE | SPANISH | AVERAGE |
1 | Basile et al. | 0.6831 | 0.7429 | 0.8288 | 0.8036 | 0.7646 |
2 | Martinc et al. | 0.6825 | 0.7042 | 0.8463 | 0.7850 | 0.7545 |
3 | Tellez et al. | 0.6713 | 0.7267 | 0.8425 | 0.7621 | 0.7507 |
4 | Miura et al. | 0.6419 | 0.6992 | 0.8575 | 0.7518 | 0.7376 |
5 | López-Monroy et al. | 0.6475 | 0.7029 | 0.8100 | 0.7604 | 0.7302 |
6 | Markov et al. | 0.6525 | 0.7125 | 0.7750 | 0.7704 | 0.7276 |
7 | Poulston et al. | 0.6356 | 0.6254 | 0.8188 | 0.7471 | 0.7067 |
8 | Sierra et al. | 0.5694 | 0.6567 | 0.8113 | 0.7279 | 0.6913 |
LDR-baseline | 0.5888 | 0.6357 | 0.7763 | 0.6943 | 0.6738 | |
9 | Ogaltsov & Romanov | 0.5731 | 0.6450 | 0.7775 | 0.6846 | 0.6701 |
10 | Franco-Salvador et al. | 0.5688 | 0.6046 | 0.7525 | 0.7021 | 0.6570 |
11 | Kodiyan et al. | 0.5688 | 0.6263 | 0.7300 | 0.6646 | 0.6474 |
12 | Ciobanu et al. | 0.5619 | 0.5904 | 0.7575 | 0.6764 | 0.6466 |
13 | Schaetti | 0.5681 | 0.6150 | 0.7300 | 0.6718 | 0.6462 |
14 | Kheng et al. | 0.5475 | 0.5704 | 0.6475 | 0.6400 | 0.6014 |
15 | Ganesh* | 0.5075 | 0.4713 | 0.7300 | 0.5614 | 0.5676 |
16 | Kocher & Savoy | 0.5206 | 0.4650 | 0.7575 | 0.4971 | 0.5601 |
BOW-baseline | 0.1794 | 0.4713 | 0.7588 | 0.5561 | 0.4914 | |
17 | Ignatov et al. | 0.2875 | 0.4333 | 0.6675 | 0.5593 | 0.4869 |
18 | Khan | 0.3650 | 0.1900 | 0.5488 | 0.2189 | 0.3307 |
19 | Ribeiro-Oliveira et al. | 0.4831 | 0.7538 | 0.3092 | ||
20 | Alrifai et al. | 0.5638 | 0.1410 | |||
STAT-baseline | 0.1250 | 0.0833 | 0.2500 | 0.0714 | 0.1324 | |
21 | Bouzazi* | 0.2479 | 0.0620 | |||
22 | Adame et al. | 0.1017 | 0.0254 |
Related Work
- Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784
- Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.
- Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
- Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
- Marcos Zampieri, Shervin Malmasi, Nikola Ljubešic, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, Noëmi Aepli. Findings of the VarDial Evaluation Campaign 2017. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-15. Valencia, Spain.
- Shervin Malmasi, Marcos Zampieri, Nikola Ljubešic, Preslav Nakov, Ahmed Ali, Jörg Tiedemann. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-14. Osaka, Japan.
- Marcos Zampieri, Liling Tan, Nikola Ljubešic, Jörg Tiedemann, Preslav Nakov. Overview of the DSL Shared Task 2015. Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). pp. 1-9. Hissar, Bulgaria.
- Francisco Rangel, Paolo Rosso. On the Impact of Emotions on Author Profiling. In: Information Processing & Management, vol. 52, issue 1, pp. 73-92
- Francisco Rangel, Marc Franco-Salvador, Paolo Rosso. A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
- S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
- J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.
- PAN-AP-16 corpus - Author Profiling Shared Task
- PAN-AP-15 corpus - Author Profiling Shared Task
- PAN-AP-14 corpus - Author Profiling Shared Task
- PAN-AP-13 corpus - Author Profiling Shared Task
- The Blog Authorship Corpus
- DSL Corpus Collection