A corpus-based study of cohesion links as characteristic of authorship of André le Roux and Dana Snyman

  • 0

Abstract

Individuals have preferences for certain linguistic items when using language. This means that writers leave traces of their authorship behind when they write a text (Louwerse 2004:307). In theory, texts can therefore be identified as the work of a specific author based on their linguistic choices. The identification of a specific author as the producer of a certain text is a part of the academic field of authorship attribution studies. Joula (2008:249–51) suggests that there are certain aspects of authorship attribution studies which remain uncertain. These relate mainly to making the most appropriate decisions about methodology when conducting authorship attribution studies. Another aspect Joula mentions (2008:249–51) relates to the choice of linguistic item to be analysed as possible indication of authorship. Both the question about appropriate methodological considerations and that of a linguistic indication of authorship are explored in this study.

This hypothesis testing study mainly aims to explore a multistage process of isolating and analysing cohesion links within a text. The methodological contribution to the field of authorship attribution is made through its research design as the authors of all the texts in the corpus were known, but cohesion links as indicators of authorship and the best way to show possible patterns in cohesion link usage were unclear. This meant that predetermined linguistic variables (cohesion links) were chosen to be analysed in texts of known authors in order to identify possible patterns in the use of cohesion links as possible signs of authorship. The five types of cohesive devices as described by, among others, Halliday and Hasan (1976) will be used as linguistic variables to show possible authorship. These categories are reference, substitution, ellipsis, conjunction and lexical cohesion. Additionally, the way in which cohesion links appear in two text types (and not only the texts of two different authors) is also described.

A corpus-based methodology was followed to study the use of cohesive devices in the texts of two popular Afrikaans authors, André le Roux and Dana Snyman. The corpus for this study consisted of four sub-corpora, namely columns and short stories by Dana Snyman and columns and short stories by André le Roux. Each sub-corpus consisted of ten texts. Both the text analysis and the statistical analysis were made up of different stages to simplify the analysis process and improve reliability of the findings.

The process of text analysis was made up of three phases. In the first phase, texts (in .txt format) were tagged to create metadata to indicate cohesion. In this phase, linguistic items that were cohesively related to other items (regardless of the type of cohesion link) were tagged as being related to the antecedent. In the second phase, these .txt files were processed using computer software (Oxford WordSmith Tools). The goal of this phase was to isolate parts of the texts that contained linguistic items that all related to a single antecedent. At this point they could also be sorted, so all tags that related to a single antecedent could be viewed together. The third phase in the text analysis process was classifying all the tags as examples of a specific cohesion link. At this point, the cohesion links were mapped and data relating to its frequency and position within the text could be used. This was done through statistical analysis.

The above-mentioned statistical analysis consisted of multiple stages (just like the process of text analysis) with the goal of systematically and reliably identifying and delineating relevant data. In each stage a data set was analysed by means of a specific statistical test to identify parts of the data set which could be unique to the author or text type in question. Data identified because of the statistical analysis became the data set for the subsequent stage of statistical analysis. First, descriptive statistics and relative frequencies were used to identify anomalies in the use of cohesion links in either of the sub-corpora. Cohesion links that appeared more or less frequently in certain sub-corpora were flagged for further analysis in the second stage. In the second stage, chi-square analyses were done using the frequencies of occurrence of the cohesion link types that were flagged in the previous stage. Cohesion link patterns in each sub-corpus that showed a statistically significant difference from other frequencies in the other sub-corpora were flagged as possible indications of authorship for the specific texts in the sub-corpora. Again, these cohesion links were flagged to be used as possible indications of authorship in the third stage of statistical analysis. In the third stage a binary regression model was generated using the cohesion link categories identified in the previous two stages. This model was used to test predictability of authorship of selected sub-corpora.

The statistical analysis supported various findings regarding authorship and the use of cohesive devices mapping as an authorship attribution strategy. Generally it was found that Dana Snyman’s use of cohesion links is more consistent between the different genres than André le Roux’s. Additionally, it was found that the use of lexical cohesion, reference and conjunction between the two authors was more distinct for each author than substitution and ellipsis. Lastly, it was found that the position in which the cohesion links appear in the text did not vary significantly between the two authors. Relative frequency of cohesion links in the whole text was a better indicator of authorship.

While this study provides useful insights into the patterns of cohesion link use of these two authors, there are still many questions about the practicalities of using cohesion link use as an identifier of authorship. The process of categorising linguistic items into types of cohesion links can be subjective and the best way to show variation statistically is still not clear. Nonetheless, the multistage process of text analysis and statistical analysis proved useful in promoting evidence-based methodology decisions within the research process. Closer collaboration with statisticians and more empirical research using this method may help to refine the process and provide more clarity on improved methodological practices within the field of authorship attribution.

Keywords: authorship style; cohesion; column; computer-based authorship attribution; idiolect

 

  • This article’s featured image was created by Byrev and obtained from Pixabay.

 

Lees die volledige artikel in Afrikaans:

’n Korpusgebaseerde ondersoek na kohesieskakels as kenmerkende eienskap van outeurstyl van Dana Snyman en André le Roux

  • 0

Reageer

Jou e-posadres sal nie gepubliseer word nie. Kommentaar is onderhewig aan moderering.


 

Top