Authorship identification on limited samplings

Resource type
Authors/contributors
Title
Authorship identification on limited samplings
Abstract
The internet has changed the way that many people access written works. Books and articles, of various lengths, in several formats can be bought and accessed online, both legally and illegally. Texts in even shorter form are originating through forums, SMS, blogs, emails, and social media. Automating the process of determining the authorship of posted texts would help combat online piracy of copyrighted text and plagiarism. In addition, authorship identification could help detect fraudulent email messages from dangerous sources and combat cyberattacks by identifying authentic sources. We experiment with several machine learning algorithms on a limited set of public domain literature to identify the most efficient method of authorship identification using the least amount of samples. Different sized data sets are created by 5 predefined rounds of random sampling of 1500 word blocks on a total of 28 text books from a corpus of 7 authors. Traditional methods of authorship identification, such as Naive Bayes, Artificial Neural Network, and Support Vector Machine are implemented in addition to using a modern Deep Learning Neural Network for classification. Thirteen stylometric features are extracted ranging from character based, word based, and syntactic features. Our model consistently showed that Support Vector Machine out performs other classification methods. © 2020
Publication
Computers and Security
Publisher
Elsevier Ltd
Date
2020
Volume
97
Journal Abbr
Comput Secur
Citation Key
boranAuthorshipIdentificationLimited2020
ISSN
01674048 (ISSN)
Archive
Scopus
Language
English
Extra
6 citations (Crossref) [2023-10-31]
Citation
Boran, T., Martinaj, M., & Hossain, M. S. (2020). Authorship identification on limited samplings. Computers and Security, 97. Scopus. https://doi.org/10.1016/j.cose.2020.101943