Authorship identification on limited samplings

Boran, T.; Martinaj, M.; Hossain, M.S.

doi:10.1016/j.cose.2020.101943

Full bibliography

Authorship identification on limited samplings

Resource type

Authors/contributors

Boran, T. (Author)
Martinaj, M. (Author)
Hossain, M.S. (Author)

Title

Authorship identification on limited samplings

Abstract

The internet has changed the way that many people access written works. Books and articles, of various lengths, in several formats can be bought and accessed online, both legally and illegally. Texts in even shorter form are originating through forums, SMS, blogs, emails, and social media. Automating the process of determining the authorship of posted texts would help combat online piracy of copyrighted text and plagiarism. In addition, authorship identification could help detect fraudulent email messages from dangerous sources and combat cyberattacks by identifying authentic sources. We experiment with several machine learning algorithms on a limited set of public domain literature to identify the most efficient method of authorship identification using the least amount of samples. Different sized data sets are created by 5 predefined rounds of random sampling of 1500 word blocks on a total of 28 text books from a corpus of 7 authors. Traditional methods of authorship identification, such as Naive Bayes, Artificial Neural Network, and Support Vector Machine are implemented in addition to using a modern Deep Learning Neural Network for classification. Thirteen stylometric features are extracted ranging from character based, word based, and syntactic features. Our model consistently showed that Support Vector Machine out performs other classification methods. © 2020

Publication

Computers and Security

Publisher

Elsevier Ltd

Date

2020

Volume

97

Journal Abbr

Comput Secur

DOI

10.1016/j.cose.2020.101943

Citation Key

boranAuthorshipIdentificationLimited2020

URL

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85087198784&doi=10.1016%2fj.cose.2020.101943&partnerID=40&md5=3df5edc2ccac3acedd64b5d46ba0f809

ISSN

01674048 (ISSN)