The statistical properties of written Afrikaans as a complex network

Shares0Facebook0 Tweet0 Pin0 Print0 Email0 LinkedIn0

Abstract

One of the most influential linguists of the 20th century, Ferdinand de Saussure (1966), proposed that language could be studied as a system, and Dorogovtsev and Mendes (2001), Weideman (2009b), Lee, Mikesell, Joaquin, Mates and Schumann (2009), Kwapień and Drożdż (2012), Cong and Liu (2014) and others recently proposed that language could be studied as a complex system. After the seminal publications of Watts and Strogatz (1998) and Barabási and Albert (1999), language has also been approached as a complex network (i Cancho and Sole 2001), and in this approach the study of language entails a precise, quantitative analysis. When language is studied as a complex network, statistical methods developed mainly from physics since the late 1990s are used to measure the similarities and differences between languages and between languages and other complex networks, such as protein-protein interaction networks, social networks, neural networks and power grids. A large number of studies of language as a complex network have also been published in physics journals, in particular in Physica A: Statistical Mechanics and its Applications (Holanda, Pisa, Kinouchi, Martinez and Ruiz 2004; Kosmidis, Kalampokis and Argyrakis 2006; Markošová 2008; Zhou, Hu, Zhang and Guan 2008; Liang, Shi, Tse, Liu, Wang and Cui 2009; Sheng and Li 2009; Ke, Zeng, Ma and Zhu 2014; Amancio, Nunes, Oliveira, Pardo, Antiqueira and Costa 2011). Unlike studies in linguistics, such studies focus on the statistical properties of a language, and especially on the structure of the language as a complex network.

In the study of language as a complex network, Markošová (2008) distinguishes between two approaches: conceptual and positional studies. Conceptual studies examine the relationship between words on a semantic level, and include studies of synonyms, antonyms and hyponyms (Borge-Holthoefer and Arenas 2010; Kenett, Kenett, Ben-Jacob and Faust 2011). Positional studies investigate the surface structure of a language by analysing word co-occurrence networks. The latter approach is taken in the current study, following several overseas studies (e.g. Cancho and Sole in 2001; Masucci and Rodgers 2006, Antiqueira, Nunes, Oliveira and Costa 2007; Markošová 2008; Minett and Wang 2008; Zhou et al. 2008; Liang et al. 2009; Sheng and Li 2009; Grabska-Gradzińska, Kulig, Kwapień and Drożdż 2012; Ke et al. 2014) and a South African study (Senekal and Geldenhuys 2016).

The aim of the present study is to determine the statistical properties of written Afrikaans as a complex network, including in comparison with English. The focus falls on macro-level structural characteristics relating to two network models from mathematical graph theory, namely those of Watts and Strogatz (1998) and Barabási and Albert (1999). More specifically, the comparison of Afrikaans texts and network models relates to:

Whether the structure of Afrikaans is similar to, and if so, to what extent, the structure of English and other languages in terms of Watts and Strogatz's small-world network model. For this analysis, the average path length (L) and clustering (C) of Afrikaans texts are compared with the Erdös and Rényi random network model (1959) using the method proposed by Humphries and Gurney (2008) that provides a value on the small world index (S).
Whether the structure of Afrikaans is similar to, and if so, to what extent, the structure of English and other languages regarding degree distribution patterns as found in Barabási and Albert's scale-free network model. For this comparison, the degree distribution of words is compared with the degree distribution in an equivalent Barabási and Albert-network.

In order to conduct the comparison between written Afrikaans as a complex network and the above network models, as well as with English texts, a large data set was analysed. The Afrikaans component of the data set consists of a total of 257 863 sentences and 5 009 819 words, while the English component consists of 374 706 sentences and 6 928 894 words. In total, 11 938 713 words are included in this data set. From this data set, 63 word co-occurrence networks were constructed and analysed.

As found in studies of other languages, a short average path length also characterises Afrikaans, and L was found to be in the range 2,7743 ≤ L ≤ 3,2385 with an average of L = 2,975. For the English texts studied here, L was calculated in the range 2,5752 ≤ L ≤ 3,0448, with an average of L = 2,747. Afrikaans is characterised by a lesser degree of clustering than English, and for the Afrikaans networks studied here C lies in the range 0,2645 ≤ C ≤ 0,4928, with an average of C = 0,386, while C for English networks lies in the range 0,3445 ≤ C ≤ 0,6319 with an average of C = 0,478. Afrikaans can also be shown to be a small-world network using Humphries and Gurney's method, and for all the networks investigated, S was calculated in the range 105 ≤ S ≤ 2544 for Afrikaans networks (with an average of S = 837) and in the range 124 ≤ S ≤ 1907 for English networks (with an average of S = 541). In addition, the distribution of degrees in Afrikaans word co-occurrence networks follows the pattern of the BA rather than the ER network model, with correlations between word co-occurrence networks and the BA model in the range 0,6384 ≤ r ≤ 0,9215 (with an average of r = 0,8358) for Afrikaans texts and 0,6997 ≤ r ≤ 0,9138 (with an average of r = 0,8334) for English texts. On the other hand, a comparison between the word co-occurrence networks studied here and their equivalent ER-models has correlations in the range –0,2393 ≤ r ≤ 0,2067 (with an average of r = –0,1421) for Afrikaans texts and –0,2525 ≤ r ≤ –0,1044 (with an average of r = –0,1916) for English texts, which means that the ER model is not a suitable model for degree distribution patterns in these networks, while the BA model is well suited to representing the degree distributions of words in Afrikaans.

Overall, Afrikaans is found to be statistically similar to English and other languages as studied in previous complex network studies, but differences are also discussed. Suggestions are made for further research.

Keywords: Afrikaans; R. Albert; A.-L. Barabási; complex networks; scale free networks; small-world networks; S.H. Strogatz; D.J. Watts; word co-occurrence networks

Lees die volledige artikel in Afrikaans: Die statistiese eienskappe van geskrewe Afrikaans as ’n komplekse netwerk

Shares0Facebook0 Tweet0 Pin0 Print0 Email0 LinkedIn0

Litnet

The statistical properties of written Afrikaans as a complex network

Reageer Cancel reply