Understanding Authorship Attributions of Kazakh Texts via Distance Measures


Izbassarov T., Turan C.

2022 International Conference on Smart Information Systems and Technologies, SIST 2022, Nur-Sultan, Kazakhstan, 28 - 30 April 2022 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/sist54437.2022.9945707
  • City: Nur-Sultan
  • Country: Kazakhstan
  • Keywords: Authorship attribution, Burrows's Delta, Eder's Delta, Intertextual distance
  • Süleyman Demirel University Affiliated: No

Abstract

© 2022 IEEE.Author identification of a given text is an important issue to be solved or improved in Natural Language Processing field (NLP) and has many useful applications like plagiarism detection and forensic analysis. However, there is little to no information about effectiveness of authorship identification methods in application to Kazakh language. This paper is then dedicated to implementing and reviewing accuracy of several empirical language-based authorship identification techniques. For purposes of this paper, a new dataset was assembled, which consisted of texts from 11 different authors, including, but not limited to, Abai Qunanbaiuly, Muhtar Ayezov, and Saken Seifyllin. All documents in the corpus were pre-processed, tokenized, and split into samples containing 5,000 n-grams of size 1. Reviewed methods have included Chi-Squared Measure by Kilgariff and Delta Method by Burrows, as well as its derivatives and variants. Experimentally, we show that methods which are based on Burrows' Delta measure produce some promising results when applied to texts written in Kazakh language.