Author identification: A forensic-linguistic research study in Afrikaans SMS language

  • 2


Forensic linguistics is a field of study that has gained popularity in many countries around the world (Blackwell 2012). In South Africa forensic linguistics is not a well-known field of study, but academics and postgraduate students are beginning to explore research and study opportunities within this field. Author identification, which is the focus of this article, is only one of the subcategories within forensic linguistics. A very basic definition of forensic linguistics is that it is a section of applied linguistics where a variety of both written and spoken texts is analysed for judicial purposes. The field is roughly divided into two main categories: language use and its judicial implications and the analysis of forensic texts (written or spoken). In the first category forensic linguists consider translating and interpreting in a courtroom setting, the language use and discourse of a trial and the language rights of individuals in the courtroom or during the course of the trial, among others. The second category includes author identification, speaker identification, profiling and identification of plagiarism (Olsson n.d.:4–5).

Author identification is the analysis of a text with the goal of determining the possible author because there is some uncertainty or dispute about the author of that specific text (or group of texts). Texts that are usually analysed in author identification include ransom notes, e-mail messages, threat letters and blackmail messages. Although author identification analysis has been done on shorter texts such as SMS messages, ransom notes and Facebook messages in the past (Ishihara 2011; McLeod and Grant 2012; Michell 2013) there is still, to some extent, insufficient research in the area of author identification of short (and extremely short) texts. Author identification in Afrikaans SMS messages has never been attempted. It is mainly for this reason that the current article, and the dissertation it is based on, is considered of value to the field of author identification (Thiart 2014).

For the purposes of this study the researcher aimed to answer three questions. First it had to be determined if a generic SMS language exists that could complicate author identification. The presence of a generic SMS language would mean that there are very few individual characteristics present when SMS messages in the corpus are compared. Secondly, it had to be determined whether individual idiolects could be identified within the supposed generic SMS language, and thirdly, to what extent it is possible to identify the author of an SMS text with the limited data available to the forensic linguist.

Thirteen participants between the ages of 18 and 23 were used in the research. The only selection criterion was that the participants had to be mother-tongue speakers of Afrikaans. Each participant was asked to send 5 to 10 SMS messages of between 30 and 50 words each to the researcher. The participants were asked to select messages on their phones that they had already sent, i.e. messages that they had already typed in the past. This was done to ensure that the participants would not type in a different manner when creating new messages that they knew would be used in the analysis. Each participant was given a number in order for the researcher to identify participants and to ensure that they remained anonymous. One participant, Deelnemer 2 (Participant 2), was asked to send a second set of messages to the researcher. This set was labelled Teks X (Text X) and was the “suspect text”. All the other texts in the corpus were compared with Text X in order to determine if it was possible to match the first set of texts from Deelnemer 2 with the “suspect” texts. Based on the statistical analyses and comparisons the researcher would then be able either to identify Deelnemer 2 as the author of Text X with a high percentage of certainty or conclude that it was not possible to match Deelnemer 2 to Text X successfully.

The corpus for the study is small, consisting of only 2 434 words in total. The small corpus is due to the system used to receive the SMS messages from the participants, namely SMSPortal, which places a limit on the number of characters it can read per SMS. This meant that it cut some of the SMS messages and decreased the amount of data available to the researcher.

Both stylometric and stylistic methods were used to analyse the data. WordSmith Tools and Antconc were used to perform statistical analyses on the data and a very basic n-gram analysis was also used to strengthen the results. Both the Pearson’s chi-square test and the Yates correction were used in determining the results of the statistical analyses. The limited amount of data that the researcher obtained through the participants is a realistic amount of data that can be expected in a real-life forensic linguistic situation. Even though no actual crime was being investigated, the research gives an accurate indication of what is possible when a forensic linguist has only limited data to analyse and a number of possible authors.

The results of both the stylistic and stylometric analyses answer all three of the research questions mentioned above. Firstly, it was found that no generic SMS language existed among the participants in this study. This indicated that idiolects were present. However, due to the limited data used in the study it was not possible to determine the author of the suspect text with any certainty. Although these results were negative they were still useful in terms of narrowing down the number of suspects from 13 to 11. 11 is still a large number of suspects, but in that group the actual author (Deelnemer 2) was identified as the possible author in most of the analysis results.

The results showed that even though identification of the actual author of the suspect text was not possible in the situation created in the study, the methods used do show potential. As mentioned above, many researchers have proven that, to some extent, successful author identification is possible when a forensic linguist has limited data to analyse. It has to be taken into account, however, that these studies made use of a much larger corpus than was the case in the current study. Other methods should also be tested in a similar small corpus to see if better results can be achieved. It is also important to note that “successful author identification” does not mean that a suspect has been identified with 100% certainty; it simply indicates that the statistical possibility of a suspect’s being the author of a specific text is high enough for him or her to be considered as the possible author.

Keywords: author identification, forensic linguistics, idiolect, SMS, stylometry

Lees die volledige artikel in Afrikaans: Outeuridentifikasie: ’n Forensies-taalkundige ondersoek na Afrikaanse SMS-taal.

  • 2


  • Wat Afrikaans betref, moet 'n mens dan ook die verskeie variante van Afrikaans in gedagte hou, asook die feit dat die ideolek van individue in SA die afgelope jare drasties beïnvloed is deur 'gentrification', immigrasie, interkulturele kommunikasie, 'n nuwe soort sosialisering, ens. 'Code switching' en 'style switching' kom so baie voor agv die veranderde en immer-veranderende SA-gemeenskappe dat dit enige analise grootliks sal bemoeilik.

  • Baie dankie vir hierdie artikel. Ek studeer tans BA Kommunikasiekunde, maar wil graag in "Forensic Linguistics" ingaan. Hierdie artikel is baie behulpsaam. - Janrie

  • Reageer

    Jou e-posadres sal nie gepubliseer word nie. Kommentaar is onderhewig aan moderering.