## Guido Caldarelli

Print publication date: 2007

Print ISBN-13: 9780199211517

Published to Oxford Scholarship Online: January 2010

DOI: 10.1093/acprof:oso/9780199211517.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 17 October 2019

# Social and cognitive networks

Chapter:
(p.211) 10. Social and cognitive networks
Source:
Scale-Free Networks
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780199211517.003.0011

# Abstract and Keywords

This chapter studies the properties of social and cognitive networks, including collaborative and linguistic networks such as Wikipedia.

When Professor Paul Erdőscripto;s of Budapest announced that he would shortly be celebrating his eighty-first birthday with a party of special magnificence, there was much talk and excitement in the mathematical community. The special reason for the excitement was that this was probably the last time he could celebrate a birthday with a number of years that was a perfect square (Cerabino, 1994). Paul Erdőscripto;s was not only the developer of graph theory, but also one of the most respected mathematicians of his time (Hoffmann, 1998). A great sign of distinction for his colleagues was to be numbered in the list of his collaborators and coauthors for a scientific paper. In his life he wrote many papers, with 507 distinct collaborators. These 507 are at (p.212) a ‘distance’ 1 from Erdőscripto;s. People collaborating with them (if not already listed in the 507 set) are at a distance 2 from Erdőscripto;s. This ‘Erdőscripto;s Number’ is then a measure of the proximity with such a genius.

Fig. 10.1 The map of co-authorship for Nobel laureates in Physics (⋆) and in Chemistry (+) (after the on-line extended version of De Castro and Grossman (1999)).

This issue is so serious that various scientific papers have been written on this subject (De Castro and Grossman, 1999; Batagelj and Mrvar, 2000) and a web page (from which we took the above figures) is also devoted to the Erdőscripto;s number (Grossman, 2003). Unfortunately, it is very difficult to be proud of a small Erdőscripto;s number. Either you did not write a scientific paper in your life and therefore your distance is infinite, or your Erdőscripto;s number is small. I am not very good at mathematics (Gabrielli, 2000) but nevertheless (as far as I know) my Erdőscripto;s number is as small as 3.68

The idea of using networks to characterize social systems was introduced by an experiment on the structure of social communities (Milgram, 1967; Travers and Milgram, 1969). An example of collaboration network is shown in Fig. 10.1 The purpose of the experiment was to measure the network of acquaintances in human society. The idea was to send a passport-like packet to a few hundred randomly selected individuals in Wichita, Kansas. They had to deliver them to one ‘target’ in Cambridge Massachusetts. The study was then repeated by selecting random person in Omaha, Nebraska with ‘target’ living in Sharon and working in Boston, Massachusetts. A constraint in the delivering process was that each person must send the packet only to someone whom they knew on a first-name basis (thinking they were more likely to know the target than themselves). To this end some information was provided on the targets (name, address, occupation). Since every participant was also requested to tear off a card (after recording certain demographic details about themselves) and mail it back to the organizers of the experiment, it was possible to track all the paths to destination. The results of the experiment were surprising. Most of the messages arrived at their destination and this happened in a very short number of passages. A similar experiment on a much larger scale69 was conducted at Columbia University (Dodds et al., 2003) yielding similar results. This phenomenon described as ‘six degrees of separation’ (or ‘small-world effect’) became lately widely known being the title of a Hollywood movie (F. Schepisi (Director) and J. Guare (Writer), 1993).

The cases presented in this chapter range from the structure of collaboration between people to the propagation of diseases in epidemics. We also present some discussion on various language networks. In this field, people (p.213) distinguish between cognitive approaches to language and social approaches: cognitive approaches focus on the structure of knowledge inside an individual’s head, while social approaches focus on patterns of communication and interaction among individuals. From this point of view, everything presented here is definitely ‘cognitive’ rather than ‘social’. Another issue that can be referred to as cognitive research is that of the organization of knowledge as it can be measured for example in an encyclopedia (this is a topic that can also be treated in conjunction with taxonomy). The case study is that of Wikipedia, which many consider a technological network, rather than a cognitive one. The situation is then open, as is often the case, for new evolving interdisciplinary fields. Here we make this division only for the sake of presentation of the results.

# 10.1 Networks of scientific papers

## 10.1.1 Co-authorship networks

One of the most difficult features when considering a network of persons is to characterize the quantity and quality of interactions between different people. Consider for example friendship: it is very difficult, if not impossible, to determine the degree of acquaintance between two persons. Anyway, by restricting the problem there is a specific area where we can measure the strength of connections. This is the case of scientific collaboration. It is a fair assumption to believe that two people have something in common if they write a paper together. Therefore, in the spirit of the Erdőscripto;s number, one can generalize the question and consider the statistical properties of the whole network of co-authorships in one or more fields. This approach has a long tradition and it is interesting to note that one of the first models of preferential attachment was introduced in bibliometrics (de Solla Price, (p.214) 1965). The structure and features of collaboration networks are therefore one measure of the social interactions in this particular area. The structure of this particular network is given by the bipartite graph of the kind shown in Fig. 10.2. On one side there are papers, on the other the authors connected to their common papers. This bipartite structure can be transformed into a simple graph, by connecting to each other all the authors of the same paper. In this process we must expect a large global clustering, since the graph is made by little complete subgraphs connected to each other.

Fig. 10.2 The network of co-authorships is a bipartite graph. Dark vertices are the papers and white vertices authors. This network can be transformed into an author-only network by considering only the vertices in the bottom layer. We draw an edge between authors if they have written a paper together (they have an edge to the same dark vertex).

As regards the corpus of co-authorship networks, some extensive work has been done by collecting different data sets from electronic preprint databases (Newman 2001a, 2001b) and from other sets of data (Barabási et al. 2002, Newman 2001c, 2004a). The first database is formed by a corpus of 2, 163,923 papers which can be extracted from the Medline database and whose topic is biomedical research. The second data set is a network of about 1.7 million papers from the database of Mathematical Review (Grossman and Ion, 1995). The third data set is composed of 156,454 papers on Economics taken from the archive Econlit. This sample is particularly important since it allows us to investigate the dynamics of growth of the network. The data stored refer to papers published in the decades 1970–9, (p.215) 1980–9 and 1990–9 respectively (Goyal, van der Leij, and Moraga-GonzÃles, 2004). The fourth database is a corpus of 98,502 papers submitted to the e-print archive at http://www.arxiv.org. This archive consists of different areas where authors can upload their contribution. These areas include, for example, condensed matter (cond-mat) and high energy physics (hep) (Newman 2001a, 2001b). The fifth in the series is composed of 66,652 papers collected from the database SPIRES, hosting papers published in the area of high energy physics. The last data set collected consists of 13, 169 papers from NCSTRL, a preprint database in computer science.

Fig. 10.3 The network of co-authorships in Santa Fe Institute. Different clusters correspond to different topics (Girvan and Newman, 2002). Data provided by M. E. J. Newman.

Some authors may happen to be present in more than one data set even if the various archives correspond to rather different topics. These authors represent a sort of shortcut between different areas (as defined by editorial decisions of journals). This structure is particular evident from the network of collaboration in the Santa Fe Institute (an interdisciplinary research institution). In Fig. 10.3 we also see that in a small environment the communities of colleagues are visible (Girvan and Newman, 2002; Newman, 2004a). The main quantities characterizing these networks are reported in Table 10.1 In Fig. 10.4 we present the plot of the degree distribution (Newman, 2004a). Again we find distributions characterized by fat tails. In one case (the Medline archive) the function is well described by a power law. Another interesting quantity is that of the number of collaborations. As (p.216) shown in Table 10.1 it ranges from the small values of 1.67 in economics and 3.9 in physics to the larger values of 18.1 in biomedicine and 173 in the high energy physics.

Table 10.1 First, second and fourth columns are after Newman (2004a). The third column is after Goyal, van der Leij, and Moraga-Gonzáles (2004). The last two columns are after Newman, (2001a, 2001b). 1, 2, 3, 4, 5, 6 stand for the Medline, M. Review, EconLit, Physics, SPIRES, NCSTRL archives. The statistical error is on the last digit (or lower) of the figures. The size of giant component (g.c.) is the percentage with respect to the total size. When the network is not connected the average path length l is measured on the giant component. C indicates the clustering coefficient and assortativity is measured by means of assortativity coefficient. In this case it is computed as the assortativity coefficient defined in 1.3.3.

1

2

3

4

5

6

papers

2, 163, 923

≃1.75 106

156, 454

98, 502

66, 652

13, 169

authors

1,520, 251

253, 339

81, 217

52, 909

56, 627

11, 994

papers/author

6.4

6.9

-

5.1

11.6

2.55

authors/paper

3.75

1.45

-

2.53

8.96

2.22

coll./author

18.1

3.9

1.67

9.7

173

3.59

g.c.

92%

82%

41%

85%

89%

57%

l

4.6

7.6

9.47

5.9

4.0

9.7

diameter

24

27

29

20

19

31

C

0.066

0.15

0.157

0.43

0.726

0.496

assortativity

0.13

0.12

-

0.36

-

-

As regards the small-world effect, we find small values of the distance between authors. Typically the path length is anti-correlated to the general clustering C. This is because the more the system is connected, the lower the distance between two vertices. Note the exception of computer science with a fairly large clustering and a large value of l.

A further analysis available only for Econlit is the evolution over time. For example, the average degree <k> changed from 0.894 in the 1970s to 1.244 in the 1980s to 1.672 in the 1990s. This can be related to an evolution in research habits but also to the substantial increase in the number of scientists in the field (increasing from 33, 770 in the 1970s to 81,217 in the 1990s). The issue of evolution has also been treated considering measures of centrality to characterize the impact of authors (Börner et al., 2005). In this case a drift towards more cooperation in the production of scientific knowledge has also been observed.

## (p.217) 10.1.2 Citation network

Scientific papers form another kind of network when we consider the set of citations made between them. There are two main differences from the previous case. Firstly, the network is not bipartite; here the papers are the only kind of vertices connected by citations. Secondly, the papers are ordered according to the time of publication. In fact, a paper cannot cite papers that are not yet written (apart those in preparation, which are actually present even in a preliminary form) (see Fig. 10.5).

Fig. 10.4 Data on collaboration network provided by M. E. J. Newman (Newman, 2004a).

A typical question in the field is to investigate if the inter-citation is based on real quality of cited papers or rather on personal knowledge. In other words, we want to know if the reason for the edge is more scientific than social. The case study presented here (White, Wellmann, and Nazer, 2004) investigates the inter-citation patterns of sixteen members of a research group on human development established in 1993. In this group of scientists there are 240 possible edges (the network is directed and therefore we have that this number is given by n Ã (n 1) where n is the number of vertices). Half of the possible couples of scientists have a strong collaboration and 74 percent consider themselves friends or colleagues. Also in this case the inter-citation patterns are divided into four snapshots: prior to 1989, from 1989 to 1992, 1993 to 1996, and 1997 to 2000. Co-citation is shown to predict intercitation; one cites those with whom one is co-cited. As members became better acquainted, citation of one another increased. Inter-citation was not randomly distributed, with a core group of twelve pairs predominating.

(p.218) Friends cited friends more than acquaintances, and inter-citers communicated more than non-inter-citers. However, intellectual affinity, as shown by co-citation, rather than social ties, leads to inter-citation.

Fig. 10.5 An example of a very schematic citation network. Starting from one paper (black vertex) we consider all the old papers in the reference list. For every one of these we iterate the procedure. The colour of vertices in this picture is loosely related to the time of publication passing from dark (recent) to light (old). This procedure could in principle produce an exponentially large number of ancestors. Actually since the corpus of all the scientific production is large but finite, we must expect that most of the ancestors coincide (i.e. strong presence of cycles in the graph).

The same database SPIRES considered above has also been analysed as a network of citations (Lehmann, Lautrup, and Jackson, 2003). The first result is that the probability that a given paper in the SPIRES database has k citations is well described by simple power laws. We have that P(k) α k, with γ ∝ 1.2 for k less than fifty citations and α ≃ 2.3 for fifty or more citations. Different models reproduce these data. One of them generates power laws, while another generates a stretched exponential. It is not possible to discriminate between these models on the present empirical basis. A classification of citations distribution by subfield shows that the citation patterns of high energy physics form a remarkably homogeneous network. Furthermore, the extreme improbability that the citation records of selected individuals and institutions have been obtained by a random procedure can be demonstrated.

(p.219) Another way to look at the same structure is to consider the corpus of citations to a single person (Lahèrrere and Sornette, 1998) or to a single paper (Redner, 1998). In the first case by considering the top scientists over the period 1981-97, one finds a stretched exponential distribution for the number of citations received. In the second case the statistics of citations are represented by a power law with exponent –γ = –3. This behaviour has been confirmed by successive analysis (Tsallis and de Albuquerque, 1999). From a different perspective, it is also possible to study the out-degree (Vázquez, 2001) that is the number of papers cited within a single paper. In this case the analysis reveals that at least for large values of the number of outgoing citations decays exponentially fast with two slopes characterizing two classes of journals, one with limitation on the page number (and therefore on the number of citations) and the second with no limitation. In the latter case the tail of the distribution is not very large, maybe because the number of papers that can be cited is in any case bounded by author knowledge.

The number and quality of citations also plays a role in the assessment of the success of scientific journals. In first approximation one can determine the impact factor of a journal with the ratio Cit/I where I is the number of papers published in a certain period and C it is the number of citations these papers received. Put in this way the impact factor is nothing more than the total in-degree of the journal, normalized with the total number of publications. In the same spirit it is also possible to apply the Google PageRank algorithm (see Section 9.6.2) to assess the relative importance of one manuscript (Chen et al., 2007). This study was done on the journal ‘Physical Review’ in the period 1893-2003. While in general PageRank is correlated with the number of citations, some manuscripts that are outside this relationship (i.e. large PageRank and few citations) correspond to very famous and important pieces of research. Therefore it would be sensible to consider another way to measure the impact factor based on PageRank (Bollen, Rodriguez, and Van de Sompel, 2006).

# 10.2 Contact networks

## 10.2.1 Sexual networks

As already pointed out in the introduction, social networks are very difficult to define and measure when the vertices are individuals and the edges relationships amongst them. Degree of acquaintance, friendships or even love cannot be measured. Even if this were possible, the relations are not reciprocal; therefore the edges have one value for the first person and another for the second one. This difference can be so large that even the existence of (p.220) one connection can be questioned by one of the two persons concerned. In the case presented here (Liljeros et al., 2001), we are dealing with a social network where this basic ambiguity is removed. The case is that of sexual contacts in a sample of the Swedish population. These data were collected in a Swedish survey of 1996 (Lewin, 1998) from a sample of 4,781 persons aged between 18 and 74 (of these only 2,810 persons (59%) answered). Because of the nature of the sexual intercourse, this network is highly dynamical and edges appear and disappear as relations are initiated and terminated. Averaging out the fluctuations we have the cumulative degree distribution shown in Fig. 10.6.

Fig. 10.6 Frequency distribution of the number of sexual contacts. Data provided by F. Liljeros (Liljeros et al., 2001).

This degree distribution also has a power-law shape. We consider in the next section what implications this might have for this specific case. Here we want to extract some more information from a non evident feature of the plot in Fig. 10.6. If you note this plot, the area below the circles (males) and the squares (females) should be the roughly the same; instead the two areas are rather different. It has already been argued (Laumann et al., 1994) that males tend to declare a larger number of intercourses than females. It is worth noting that ‘proportional cheating’ i.e. declaring a number proportional to the real one, does not change the shape of the distributions.

By reasoning in terms of absolute quantities the spread of the number of partners ranges from one to the thousands. These numbers at least for the order of magnitude are not strange and have been documented both (p.221) in real life and in art.70 A very simplified model that is nevertheless able to reproduce most of the above statistical properties is based on only one parameter (the ‘collision rate’) that drives the probability and the strength of interaction between two individuals (González, Lind, and Herrmann, 2006). The presence of such hubs can play a role in the way diseases spread out in a community as we shall see in the next subsection.

## 10.2.2 Epidemics

The study of epidemics is based on the fact that some diseases (as well as computer viruses) survive by passing from one host to another. Their diffusion is then connected with the structure of the individuals (or computers) that could be infected. We have already seen (as in the preceding subsection) that many social structures are scale-free networks with small-world properties. This fact plays a crucial role in the spread of epidemics and in the policies adopted to stop this diffusion. Because of the importance of the topic of disease control, much effort has been devoted both to data collection and to theoretical analysis (modelling). Many theoretical studies on infection spread are based on the assumption that the network of individuals is formed by a regular lattice of connections. This is not completely true (as for the sexual network) and a change of this basic hypothesis has important consequences for the behaviour of disease propagation (Pastor-Satorras and Vespignani, 2002).

The two basic models for infection propagation describe two different situations (Diekmann and Heesterbeek, 2000; Anderson and May, 1992). The first, is called SIR after the various possible states of an individual (Susceptible, Immune, and Recovered). This model describes diseases for which you have permanent immunization after you recover from them. The second one is called SIS, after the initial of Susceptible, Immune, and Susceptible and describes all the diseases for which immunization is not permanent. This means that you can again suffer the same infection (e.g. cold but also most computer viruses). In both cases a scale-free network of connections has the effect of dramatically increasing the strength of the infection.

Let us focus on the standard description of the SIS model; one key parameter of the model is the spreading rate λ. This is given by the ratio of (p.222) the probability ν of a vertex being infected divided by the probability δ of recovering from the disease, that is $λ = ν δ$ The average number of connections for an individual is given by the quantity <k> (the degree of the vertex). One can write an equation for the evolution in time of the percentage ρ(t) of infected people. There are two contributions to consider. Firstly at time t all the individuals affected at time t — 1 are now recovered. Secondly those who were healthy at time t — 1 but neighbours of infected ones can develop the disease. At the steady state these two contributions must balance and the value of ρ(t) is constant.

As a first approximation we can write

(10.241)

The second term of right-hand side is composed of the probability of being healthy (1 —ρ), times the probability of having a neighbour infected, that is <k>ρ. The above quantity is then multiplied by the spreading rate λ. In this expression all the individuals have the same number of connections and furthermore we assume that the strength of the infection is proportional to ρ(t). This hypothesis that corresponds to the ‘mean field’ approach in physics is known as homogeneous mixing (Anderson and May 1992) in epidemics. By making the left-hand side equal to zero and indicating as ρ the stationary value for ρ(t) we have that

(10.242)

The requirement of a characteristic number of connections defines an epidemic threshold λc = <k>-1 that we can use to compute the value of ρ. We obtain

(10.243)

The plot of the variation of this quantity with respect to λ constitutes the phase diagram of the system, as shown in Fig. 10.7. From the above formula we see that this model predicts a non-zero epidemic threshold. If the spreading rate is larger than the threshold, the infection spreads and become persistent. Instead below the threshold the infection dies out. In the language of physics this is an absorbing state. The main hypothesis behind these results is the homogeneity of the number of connections; that is <k> constant.

In the case of a scale-free network the above statements are no longer correct. The same equations must now consider the case of non-homogeneous (p.223) distribution of infected individuals. This corresponds to introducing a function ρk of infected nodes with degree k and a function Θ(ρk) that gives the probability that any edge points to an infected node.

Fig. 10.7 The plot of the states of the system obtained by varying the percentage ρ of individuals infected and the spread of the disease λ.

The dynamical equation for the quantity ρk becomes

(10.244)

The assumption is that Θ is a function of the partial densities of infected nodes. In the steady (endemic) state, the latter are functions only of the spreading rate. Thus, the probability Θ also becomes an implicit function of the spreading rate, and by imposing the stationarity condition one finds

(10.245)

The simplest form for Θ can be obtained by imposing no correlations between the degree of various nodes, on average the probability that an edge is drawn to a vertex whose degree is s is simply sP(s)/<k>.

Therefore, we obtain

(10.246)

in which, by putting the expression given by eqn 10.245, we have

(10.247)

(p.224) The trivial solution of this equation is Θ = 0. But a non-zero stationary prevalence (ρk ≠ 0) is also ensured when the following inequality is satisfied

(10.248)

The value of λ for which the left-hand side of the above equation gives 1 defines the value of λc.

Taking into consideration the above ingredients one finds that now the value of the epidemic threshold is given by

(10.249)

This means that whenever the denominator goes to infinity (as happens for scale-free networks whose degree exponent γ is such that 2 < γ > 3) we have λc = 0.

# 10.3 Linguistic networks

Linguistic networks in a loose sense are all those networks that can be established by studying the features of a particular (or more than one) language.

Most of these are composed of words related to each other by some kind of relationship related to position, syntax, or grammar. The simplest possibility is to start from a text document; after that one considers all the words present as the vertices of the graph and draws an edge between them according to the relationship considered. In almost all cases this results in a scale-free degree frequency distribution. One possible explanation of this behaviour could be the presence of Zipf’s law. This law states that the frequency of words in a text follows a power-law distribution.

## 10.3.1 Zipf’s law. The word frequency distribution

One elementary model reproducing Zipf’s law is related to the exponential combination already seen in Section 4.3.2. Consider that words are randomly formed by collecting letters randomly. A word is complete when we select as ‘letter’ a separation space. All the letters are equiprobable, so that if the probability of having the separation space at any letter extraction is p, and the letters are m, their probability is q = (1 —p)/m. The probability x with (p.225) which a particular word of y letter is extracted is given by

(10.250)

where μ = ln(1 p) – ln(m).

The distribution of words of length m is an exponential distribution p(y) = my or p(y) = ey|n(m). Therefore the distribution of the x (i.e. the distribution of the frequency of the words) is a power law (Mandelbrot, 1953)

(10.251)

Since

(10.252)

and in most cases m is large and p small we obtain γ ≃ 2, which is in good agreement with most of the data.

## 10.3.2 Co-occurrence networks

The most immediate relationship is given when words are simply neighbours in the text. This would mean that in the elementary sentence ‘John eats pizza’ the vertices are ‘John’, ‘eats’ and ‘pizza’ and the edges are between ‘John’ and ‘eats’ and between ‘eats’ and ‘pizza’. This rule can be loosened a little bit and in general one can put an edge between words even if they are a little further apart. For linguistic reasons (Kaplan, 1955) it is possible to connect all the words within a distance of two71 (i.e. the nearest neighbours and the next nearest neighbours) (Ferrer i Cancho and Solé, 2001). The first data set we present here is composed of the British National Corpus as registered on the website http://info.ox.ac.uk/bnc/. This consists of about 107 words that can be linked in various ways giving rise to different networks. The first is created by the above defined rule of drawing an edge between two words at a distance less than or equal to two. This case will be indicated by the name ‘Unrestricted Word Network’ (UWN). The second network is a subset of the first, where an edge is actually drawn only if the co-occurrence has a probability larger than in a random case. This case will be denoted as the ‘Restricted Word Network’ (RWN). Finally they define also a kernel word network (KWN) where the only words present are those of a ‘kernel’, (p.226) that is all the words that are common to all the speakers of the language. The frequency of these common words follow a Zipf’s law with an exponent different from the rarest ones. So the same authors claim that this set of words in the kernel is characterized by the change in the exponent in the Zipf’s law. The striking features of all these networks, UWN, RWN and KWN is the presence of scale-invariance and small-world effect (see Tab. 10.2). A similar analysis also holds when considering poems. The source of data is the Iliad and as shown in Fig. 10.8 the words in the same verse are linked together. Remarkably, this network also shows evidence of scaleinvariance in the degree probability function (Caldarelli and Servedio, 2006). This is very likely related with the Zipf’s law for the same word as already shown for this case study in Fig. 3.17.

Fig. 10.8 An example of the network that can be drawn out of the first verse of Homer’s Iliad (Sing, o Goddess the anger of Achilles son of Peleus). On the right the plot of the degree distribution for such a network. Note in the inset the plot of Zipf’s law for this text.

Table 10.2 The value of the graph quantities in the graph. Both UWN and RWN display a degree probability distribution.

Size

Measure

C

<d>

<k>

Unrestricted (UWN)

478, 773

66, 652

0. 687

2. 63

74. 2

Restricted (RWN)

460, 902

1. 77107

0. 437

2. 63

70. 13

Kernel (KWN)

5, 000

1. 61107

-

2.63

1219

## (p.227) 10.3.3 Word associations

On top of the syntactic relations between the words, we can also define a network where the words are related by some more complicated relationship. One example of these linguistic networks is given by the word association network. In this case the structure is given by an experiment carried out as follows. A psychologist prepares a list of words (‘stimulus’) that is read to persons taking part in the experiment. The persons must respond with the first word that comes in their mind after the stimulus. These words are then used in a new experiment as stimuli and proceeding in this way a whole interconnected web is created (see Fig. 10.9). These patterns have been analysed from the point of view of their social importance (Steyvers and Tenenbaum, 2005) as well as a test case for an algorithm of community detections (Capocci et al., 2005). The data set is formed by the results of one specific experiment with more than 6,000 participants and 5,000 initial words (Nelson, McEvoy, and Schreiber, 1998).

Fig. 10.9 A portion of the network of word association coming from ‘Volcano’ to ‘Ache’. After Steyvers and Tenenbaum (2005).

Another way to put words together is by means of a Thesaurus. A thesaurus is a collection of synonymous (same meaning) or antonymous (opposite meaning) words. The idea is to connect two words if they happen to be synonymous. An extension of this concept is represented by Wordnet, where the network is composed by different classes of words. These two categories are called ‘forms’ and ‘meanings’

(p.228)

• An edge is drawn between many word ‘forms’ and a word ‘meaning’ if the former have the same meaning (i.e. if the word forms are synonymous)

• an edge is drawn between a word ‘form’ and the various word ‘meanings’ it can have (polysemic words)

• an edge is drawn between two words meanings if they are antonymous

• an edge is drawn between two words meanings if they have n hypernymous or meronymous relation72.

In the literature two main corpus have been studied: The Merrian-Webster Dictionary and the Roget’s Thesaurus (Roget, 2002). In these two cases of Wordnet and Thesaurus the two networks show remarkably features of scale invariance. In Fig. 10.10 we report the various degree distributions for such syntactic networks; after Steyvers and Tenenbaum (2005).

Fig. 10.10 The power law distribution of the various syntactic network. Data provided by M. Steyvers (Steyvers and Tenenbaum, 2005).

# (p.229) 10.4 Wikipedia

The last system we present here has something in common both with the cognitive networks with which we organize the meanings through language and with technological networks. The graph is that of Wikipedia (http://www.wikipedia.org) a free on-line encyclopedia. The whole system works on the same principle of hypertext markup language as the World Wide Web. That is, there are documents with mostly text, but also pictures, and other materials that are mutually accessible by means of the system of the hyperlinks. Every entry of the encyclopedia is a separate page and people can decide (and actually do) to connect different pages according to some kind of relation that might exist between the two entries. This system grows constantly as new entries are continuously added by users through the Internet. This is made possible by means of Wiki software. This software allows any user to introduce new entries and modify old ones. It is natural to represent this corpus of documents as a directed graph, where the vertices correspond to entries and edges to hyperlinks (autonomously drawn by the thousands of contributors). The distribution of both in-degree and out-degree is a power law (see Fig. 10.11). From the point of view of the medium used this is certainly a technological network. To be precise it is a thematic subset of the World Wide Web, since any entry is also a legitimate web page. On the other hand is the result of a cognitive process since entries are arranged (and that’s a new part with respect to conventional encyclopedia) according to the perception of the readers. In this respect the (p.230) system is also very similar to a word association network. One important feature of Wikipedia is that the users build different version of it in different languages. In principle, this could allow us to check if different cultures and ways of thinking result in different organization of the knowledge.

Fig. 10.11 The frequency distribution of the English version of Wikipedia.

Some analysis has been done on the system (Holloway, Bočević, and Börner 2005; Capocci et al. 2006; Zlatić et al. 2006) showing most of the statistical properties of the system. The various languages form different graphs ranging from the English version of more than one million entries, to a few thousands for less common languages.

As a general remark the Wikipedia system is still expanding, new vertices are added continuously, and the number of edges is growing as a power of the number of vertices. In particular m(t) α n(t)1.14. The Wikipedia shows a rather large interconnection; by using the bow-tie representation introduced for the WWW we find that most of the vertices are in the SCC. From almost any page it is possible to reach any other. For the English version (the largest) of the Wikipedia the in-degree exponent is γin ≈ 2.1 ± 0.1 in almost all the analyses cited. For the out-degree instead, the various data sets analysed show a range of values from γout ≃ 2.1 to γout ≃ 2.6. A possible explanation for this discrepancy is given by the different times at which data sets were collected. No particular structure is present from the point of view of the clustering coefficient. In this case also different studies report slightly different result, in one case (Capocci et al., 2006) no assortativity was measured; in another the assortative coefficient (defined in eqn 1.30) is very small (r = -0.10 ± 0.05) (Zlatić et al., 2006). Particularly important is the fact that different networks (in different languages) show very similar behaviour (Zlatić et al., 2006). This suggests that the structure of the organization of knowledge might actually be similar in different cultures.

## Notes:

(68) A. Capocci, R. Ferrer i Cancho, G. Caldarelli R. Ferrer i Cancho, O. Riordan, B. Bollobas, B. Bollobas, P. Erdőscripto;s (18 joint papers, first in 1962).

(70) Consider the famous Aria nr.4 ‘Madamina il catalogo è questo’ by Lorenzo Da Ponte in the libretto for Mozart’s ‘Don Giovanni’ (Da Ponte and Mozart, 1787). ‘In Italia seicento e quaranta, in Lamagna duecento e trentʼuna, cento in Francia, in Turchia novantʼuna, ma in Ispagna son giá mille e tre’ (six hundred, forty in Italy, two hundred, thirty one in Germany, one hundred in France, ninety one in Turkey, but in Spain are already one thousand and three). This list accounts for 2,065 partnerships; the same order of magnitude as the tail of the above distribution.

(71) Statistically 87 percent of the syntactic relations take place at this distance (Kaplan, 1955). This of course does not mean that all edges drawn at a distance less than two are syntactically meaningful. Rather roughly one half of the edges drawn do not correspond to a syntactically valid relation.

(72) A meronym is the name of a portion of something, or the material constituting something. ‘Beer’ is a meronym of ‘drinks’, ‘Goalkeeper’ is a meronym of ‘football player’ etc. An hypernym is a word whose meaning encompasses the meaning of another word.