­­­­­Ethnic diversity in Facebook

Abstract

Social Networking Systems (SNS) provide not only a virtual representation of one, but also affords people to develop and maintain their social network. In this report, we analyse a subset data of an SNS, Facebook, and capture the diversity of Facebook by approximating users’ ethnicity. We use Onomap, an ethnic classifier based on names, for our analysis. And then we identify the overall structure and cohesiveness of each ethnicity. We also find two clear power-law outlooks as was found in [1], dissecting at degree 300. We classify our ethnic groups into major and minor groups based on their population. To identify the hidden communities within the network, we run a community detection mechanism and then in each identified community, we calculate inter and intra ethnic relationships. We find almost the same behaviour of ethnicities in the identified communities, where the major groups have higher homophily than minors.

1. Introduction

With the inception of online social networking, which is generally known as Social Networking System(SNS), the structure - and the focus too, of the Internet has drastically changed. Unlike the old paradigm where the majority was at the receiving end of information out on the internet, now common people have their own presence over it with the help of an SNS. The Internet is said to have become Social now. There has been a lot of research on how people form friendships and interact over it (e.g. [2], [3], [4]). It was Friendster back in 2003 which envisaged the concept and then since 2004 Facebook came into being, becoming the leading SNS for over the last five years or so. With over 800 Million users to its credit [5], Facebook has become a part of everyday life.

The amount of data in an SNS is humungous – a holy grail for the research community (and to advertising companies as well). Since there is a huge monetary and legal stakes involved, the data out of SNS is generally not shared – only a handful of researchers do manage to get a subset of it, but then again, they are reluctant to share it with others. According to a study carried out in 2007, the amount of digital information created, captured, and replicated is 281 billion gigabytes [6]. Due to the relatively public nature of these SNS, the only means left is to use automatic tools such as a web crawler, or an SNS specific application to get users’ information. We used the crawling strategy to get the data for our analysis.

The second section describes the background of our work by detailing our data collection and the methodology used for our analysis. We mention the relevant research work in Section 3. Analysis of our work is described in Section 4. And Section 5 concludes the report.

2. Dataset and Methodology

In this section, we talk about our dataset and how we gathered it. We will also talk about how we estimated ethnicities within our dataset. Also for the underlying methodology of data collection and estimation, some of the weaknesses are going to be shared.

2.1 Data Collection

We made an account in Facebook and then joined our University’s Network. To get ourselves registered into our University’s network, we had to use our official email account with the domain mmu.ac.uk. We have collected our data from Facebook during November 2009 and April 2010. At the time Facebook had almost 400 Million users [7]. Initially we wanted to crawl the whole location based network (regional network), such as Manchester, for all the profiles and their information. Unfortunately, Facebook had removed that feature[1]. Instead of location then, we added a few people in our network – Manchester Metropolitan University’s. And then crawled the Facebook network through them. Only one profile was enough to get our crawler started.

Our methodology was to add a person from our same network into our profile and then collect as much information by exploration with the help of breadth-first search (BFS) algorithm. In total, we managed to collect data of almost half a million of profiles. The web crawler we used was an adaptation of Alan Mislove's[2] crawler, originally designed to get a location-specific network, the vertices and their edges. Since we were interested in all the publicly available information - specifically racial and ethnic information - we modified it to suit our requirements. We ran the crawler on one machine and then kept it running till we achieved a sizeable sample of Facebook network.

We used (biased) Breadth-first-search (BFS) algorithm, which is a well-known traversal algorithm. It has been extensively used to crawl various SNS [8], [9], [10]. The algorithm starts from a single vertex, which is known as seed, and then discovers its neighbours. In our case the agent crawler logs into Facebook with our dummy account credentials; it fetches the profile page of every neighbouring user, scrapes data out of it, cleans it and then stores it in the local database. After that, it then fetches the friends’ list of the user and then the same process repeats for one of the friends of the user, based on FIFO strategy and having the MMU Network. Or in other words, all ego nodes are from MMU network, while alters may not be.  Vertices are also recorded in the order they are found. The algorithm stops when all the vertices are visited. In our case, however, we selected the next neighbour ONLY if it was from MMU Network. For a large graph such as Facebook, the whole crawl is highly expensive and time consuming. Even if we do not consider time constraint for the moment, according to a Facebook study in [11] carried out in 2010, 44 terabytes of data is required to be download and processed, for the whole Facebook network. If we consider a single friends list page of a Facebook user, which is around 200kb [12], then given the current population of Facebook [5], a total 200KB x 800M = 160 Terabytes of HTML data has to be collected. Hence we just collected a subset of it, by terminating our crawler after getting a large dataset (half a million users’ data) of Facebook. So our strategy is incomplete BFS. In Table 1, we have summarized the structure of our dataset. The average number of friends is quite higher than Facebook statistics (130) [5], but it is quite comparable to the other study done on Facebook in [13]. Also the diameter of our crawl is almost similar to the same study.

Table 1 - Dataset Description

# of visited users

# of discovered neighbours

# of edges

Avg. # of friends

Diameter

4601

568037

1501233

326.28

6

 

2.2 Onomap Background

In this section, we discuss the Onomap project[3] which helped us identify, with an approximation, ethnicity within our dataset. To get this ethnic information for each Facebook user, we collaborated with the geography department at UCL, London. This estimation was done on the basis of Facebook profile user names. The Onomap system covers data from 28 countries with detailed information from UK in particular. The data is accumulated through UK electoral register and public telephone directories of 27 countries. There are 10.8 Million unique surnames and 6.5 Million unique forenames. It has its own system of classification with 185 Onomap Types, aggregated into 66 ethnic Subgroups and 15 Groups. Its details can be found in [14], [15].  Table 2 shows the Onomap ethnic groups and subgroups.

Table 2 - Onomap Classification

 

Onomap Group

Onomap Subgroup

AFRICAN

AFRICAN

 

BLACK SOUTHERN AFRICAN

 

CONGOLESE

 

ETHIOPIAN

 

GHANAIAN

 

NIGERIAN

 

SIERRA LEONIAN

 

UGANDAN

CELTIC

CELTIC

 

IRISH

 

SCOTTISH

 

WELSH

EAST ASIAN & PACIFIC

CHINESE

 

EAST ASIAN & PACIFIC

 

HONG KONGESE

 

MALAYSIAN

 

SOUTH KOREAN

 

VIETNAMESE

ENGLISH

BLACK CARIBBEAN

 

ENGLISH

EUROPIAN

AFRIKAANS

 

ALBANIAN

 

BALKAN

 

BALTIC

 

CZECH

 

DUTCH

 

ENGLISH

 

EUROPEAN

 

FRENCH

 

GERMAN

 

HUNGARIAN

 

ITALIAN

 

POLISH

 

ROMANIAN

 

RUSSIAN

 

SERBIAN

 

UKRANIAN

Onomap Group

Onomap Subgroup

GREEK

GREEK

HISPANIC

HISPANIC

 

PORTUGUESE

 

SPANISH

INTERNATIONAL

INTERNATIONAL

JAPANESE

JAPANESE

JEWISH AND ARMENIAN

ARMENIAN

 

JEWISH

 

JEWISH AND ARMENIAN

MUSLIM

BANGLADESHI

 

ERITREAN

 

IRANIAN

 

LEBANESE

 

MUSLIM

 

MUSLIM MIDDLE EAST

 

MUSLIM NORTH AFRICAN

 

MUSLIM STANS

 

PAKISTANI

 

PAKISTANI KASHMIR

 

SOMALIAN

 

TURKISH

NORDIC

DANISH

 

FINNISH

 

NORDIC

 

NORWEGIAN

 

SWEDISH

SIKH

SIKH

SOUTH ASIAN

HINDI NOT INDIAN

 

INDIAN HINDI

 

SOUTH ASIAN

 

SRI LANKAN

UNCLASSIFIED

UNCLASSIFIED

VOID

VOID

 

The Onomap system takes the first and last names of a person as an input, and then after several iterations, an ethnic estimation is calculated based on the data collected. There is a probability assigned to each output as well. For each of set of names, following information has been received:

Table 3 - Onomap Classification Attributes

Onomap Type Code

Onomap Group

Onomap Subgroup

Onomap Type

Geographical Area

Religion

Major

Language

Major Language SIL Code

Major Language Family Tree

Personal Score

Onomap coding case

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As can be seen in Table 3, Onomap does not only estimate fine-grain ethnic classification by Onomap Subgroup, but it also does religion and language classifications. In this technical report, however, we are just considering a higher level of ethnic classification represented by Onomap Group.

According to our knowledge, name-based ethnicity recognition has never been applied to SNS at such a large scale to the world population. A few months ago, the diversity team of Facebook released the trend of various ethnicities in the US, based on the US census data in [16]. Also Facebook itself tried to understand different ethnic behaviours by classifying US users by the census data in [17]. Other than that, we are not aware of any study on SNS with a large and diverse dataset as ours.

2.2 Short comings

There are a few limitations in our dataset, starting with the crawling strategy we employed: both empirical and theoretical research shows that incomplete BFS tends to favour highly connected nodes [18], resulting in a skewed degree distribution.

For the last few years, the users of SNS, especially of Facebook, have become more aware of their privacy issues. Unlike the public display of one’s profile and friends’ list, a lot of the users have started to hide them. In our data collection, we did manage to get more than screen name for Facebook users, but for conformity and generalization, we are not taken them into account.

We are unable to quantify (or cross check) how much this ethnic estimation by Onomap is correct, but based on the global information it has collected from various resources, we considered it as a good measure. For the sensitivity of each estimation, a “personal score” has also been calculated, but for simplicity, we did not take this into account in our analysis.

Regardless of these shortcomings, we have analysed the data which we have gathered. The whole exercise is to identify the ethnic diversity and segregation in Facebook.

3. Related Work

In this section, we talk about the general characteristics of SNS both based on theoretical and empirical studies. Also some of the relevant work is going to be discussed here.

3.1 SNS General Outlook

Here we discuss the network structure of an SNS. In general,  the graph of an SNS can be characterized by its low average distance, moderate clustering coefficient and a power law distribution of number of links [3], [19]. Generally social networks have a moderate clustering coefficient ranging from 0.2 to 0.7, depending on the size and the degree of the network [19] and also a low average distance when compared with a random network with the same density.  

A plethora of research in SNS has been done over the last five year. It is impossible to cover all of it; hence some of the relevant work is being mentioned here. The major focus of such work has been the identification of the static nature of SNS (e.g. see [8]).

To understand the behavior of students’ real social network development, a function of mutual acquaintances has been used in [20] to make a model. Jackson et al. in [19] developed a model in which local search in the neighbourhood is done to develop a social network. It is capable of producing many characteristics of socially generated networks. Based on realistic modes of interaction within university students, we also developed a model, and then later on compared it with the underlying dataset [22].

Adalbert studied Facebook from an economist’s point of view [23]. The data which he collected and then studied showed that race plays the most significant role in student friendship development – especially in the case of minority. In his previous study , out of students of Taxes A&M, he found that majority of meeting new friends (26%), were driven by members of the same school organizations. In an another study carried out on students’ network [24], race and local proximity, such as dorm were determined to play the most important role, followed by common interests such as major and similar social standing, which in turn were followed by common characteristics such as same year.

In case of SNS growth, there are some studies that identify the different classes of users [4]. And also, based on the activity of users, a couple of studies show their social network development [25]. After identification of communities within a network, just like homophily [26], users share attributes which could be inferred from their social neighbors [27].  Based on only the structure of an SNS, a couple of exploration techniques have also been devised to predict what new links users are going to make [28], [29], but they usually do not take into account the rich information of attributes of users into account [30].

4. Findings

In this section, we summarize our findings for the whole network which we have collected. First in Section 4.1, we define the ethnic and geographical distribution of the network. And then in Section 4.2, we talk about a subgraph of traversed vertices and their network. In Section 4.3, we identify the underlying communities within our dataset and analyse the ethnic distribution in them.

4.1 Ethnic Distribution

Here we describe the Onomap classification done on our dataset. In Figure 1, we show the major Onomap group distribution. According to it, the English group is in the majority (it is 46.8 %). The Muslim and then Celtic group follows. The top 9 ethnic groups’ names are shown in the figure. In total there are 18 ethnic groups found including three unknown groups which either couldn’t be classified, or the Onomap did not have sufficient knowledge of them. These groups are known as 'unclassified' and 'void' and in total represent 4.2% of the whole population.

Figure 1 - Onomap Group Classification

Similarly, if we see the geographical classification based on these names, the majority lies in the British Isles, as can be seen in Figure 2. As said earlier, we do not have geographical locations of each profiles so we can not cross check it, but it does give us an insight through estimation that the majority lives in the UK – hence the majority in our dataset is from the UK and have their social network within UK as well.

Figure 2 - Onomap Geographical Classification

4.2 Degree Distribution

Our dataset contains over half a million profiles’ information (566012). Due to the limitation of the crawling strategy and the private settings of the users, in total the Social Networks (ego-network) of 4601 people were gathered. Although the friendships links in Facebook are bidirectional – both parties have to agree to become friends - to identify the degree distribution of those who have their social network public (and are included in our dataset), we kept the graph directed. The Total, In and Out degree distribution can be seen in Figure 3.

A seminal work regarding node degree distribution in Facebook has been done in [1]. The authors show that, unlike the established understanding of power law distribution in degree distribution of vertices in an SNS, Facebook shows two different regimes of power law outlook: one between 1 ≤ k ≤ 300 and another 300 ≤ k ≤ 5000, where k represents node degree. The Total degree shows a similar pattern for degrees smaller and greater than 300. There are clearly two identified regimes of power-law outlook as was found in [1]. For degrees less than 300, i.e. 1≥ k ≥ 300, where k is the degree, the fitted power law distribution has 2.06 as alpha. And for the degrees over  300, it found out to be 2.88, which is not too far from 3.39, found in [1].

Figure 3 – Total/In/Out Degree Distribution

4.3 Community Detection

We have used a statistical tool, R, to identify communities in the network. In particular, we used the igraph package of it for all our analysis. After importing the whole network as an edge-list (source and target list) and the Onomap classification as attributes, of each user, we created a graph. For the identification of meaningful subgraphs with the whole graph, we ran [31] algorithm to identify subsets within the whole network which are closely linked together. These subgraphs are usually called communities. And the whole process is known as community detection. This algorithm treats the community detection as the problem of finding the ground state of a spin glass model. It works like this: each vertex i is labelled by a Potts spin variable σi, which indicates the cluster including the vertex. The basic principle of the model is that edges should link with the same spin state, and should be, in ideal case, disconnected with differing spin states. So, one has to energetically favour edges between vertices in the same class (spin state), as well as non-edges between vertices in different classes, and penalize edges between vertices of different classes, along with non-edges between vertices in the same class [32]. The resulting Hamiltonian of the spin model is:

where Aij are the elements of the adjacency matrix of the graph, γ>0 a parameter expressing the relative contribution to the energy from existing and missing edges, and pij is the expected number of links connecting i and j for a null model graph with the same total number of edges m of the graph considered. The aim is to find the spin configuration for which the Hamiltonian is minimal, which maximizes the modularity.

After running the algorithm, in total, 25 communities were found where the largest community has 56844 vertices with 234540 edges. Here is the plot of all the communities’ size in terms of number of edges in each of them (we have sorted them out by size for a better visualization) in Figure 4.

Figure 4 - Community according to their sizes (# of vertices)

We wanted to visualize the whole network, but unfortunately we could not find any supporting library for it. In Figure 5, we have displayed the largest community identified by the employed algorithm.

Figure 5 - Biggest identified community

For a better insight, based on the Onomap group classification, we have coloured each vertex and edge accordingly. The underlying colour and the overall distribution of vertices in this community can be seen in Table-1.

Table 1 - Biggest community with Onomap group distribution

4.3.1. Ethnic Diversity in the Identified Communities

In this section, we identify the ethnic distribution, according to their population. As mentioned before, in total, there are in total 18 ethnic groups in our dataset. In Figure 6, we have drawn top two and bottom two communities with their ethnic division. Each bar represents the number of edges of each group. The higher the bar, the more the number of edges.

Rplot56844.png

Rplot50574.png

 

Figure 6 - Ethnic Distribution of the top two and the bottom two communities

In all the four communities, English and Muslim groups have the highest bars. It can be concluded thus the communities from the top to the bottom in size have more or less the same ethnic composition, quite similar to the whole network (see Figure 1). A handful of groups such as Muslim, English and Celtic represent the majority, while the rest is represented by the minorities. And in almost all the communities, all the ethnic groups are represented. It can be concluded that a group's internal diversity is proportional to the ethnic group’s population.

4.3.1 Inter-ethnic relationships

In this section, we identify how each ethnicity is connected with itself. In Table 4, we show the 'same-links' for all the communities, which mean that the total number of edges solely within group. English group has the maximum edges to itself and then Muslims, Celtic and South Asian groups follow. We refer to these four groups as major groups, and the rest as minor groups. On average, English group has over 12 thousand links between itself. This table does not show the external links (links made by English groups with other groups’ vertices, for instance), so we cannot really say how much cohesion exist for each group.

Table 4 - # of In links within ethnic groups

Comm.

AFRICAN

CELTIC

EAST ASIAN
& PACIFIC

ENGLISH

EUROPEAN

GREEK

HISPANIC

INTL.

JAPANESE

JEWISH
AND ARMENIAN

MUSLIM

NORDIC

SIKH

SOUTH
ASIAN

UNCLASSIFIED

VOID
1

VOID
2

1

982

12209

164

68603

480

104

72

6

0

7

1429

0

43

108

643

9

0

2

8

7635

7

53931

33

0

0

0

0

0

101

1

15

30

18

0

0

3

29

7404

100

48441

66

24

14

0

0

4

1060

0

0

24

1

0

0

4

21

5389

15

24737

82

20

81

0

0

0

1002

1

12

34

20

0

0

5

0

3537

5

21864

44

0

0

0

0

0

427

0

0

1

26

0

0

6

0

832

2

8374

22

0

0

0

0

2

440

0

0

13

46

1

0

7

22

1902

1067

8349

11

0

55

0

0

0

408

0

0

2

94

0

0

8

210

650

26

8093

52

0

123

0

0

0

831

0

0

6

106

0

0

9

0

448

1

7157

19

14

125

0

0

2

576

0

0

125

30

0

0

10

0

637

16

6344

0

1

0

0

0

0

542

0

0

11

44

0

0

11

0

495

0

5897

11

1

0

0

0

0

1951

0

7

24

20

0

0

12

0

435

0

5747

6

0

47

0

0

0

2678

0

0

0

0

0

0

13

65

227

0

5151

0

0

70

0

0

0

2600

0

0

90

2

6

0

14

40

1232

54

5129

129

0

0

0

0

0

1535

0

0

12

20

0

0

15

36

851

116

4662

76

0

12

0

0

0

85518

2

406

520

362

1

0

16

0

422

0

4648

324

0

85

0

0

0

1561

4

0

380

4

0

0

17

56

325

0

4325

29

0

0

0

0

0

2847

0

0

305

62

1

0

18

11

483

0

4146

156

53

39

1

0

0

667

0

0

436

18

0

0

19

0

341

0

3042

6

0

26

0

0

2

1622

0

76

172

368

0

0

20

7

966

131

2741

2

0

1

0

1

0

552

0

1

54

35

25

0

21

229

311

9

2467

0

10

2

0

0

0

3067

0

390

1093

28

0

0

22

427

44

0

2457

0

0

0

0

0

0

7795

0

0

3

115

0

0

23

1

81

13

1732

149

19

13

1

0

0

4932

0

0

14

154

3

0

24

0

89

14

1431

24

0

13

0

0

0

23389

0

1095

299

105

3

0

25

2

118

37

1340

0

0

0

0

0

0

9474

0

82

9872

124

0

0

Mean

85.84

1882.52

71.08

12432.32

68.84

9.84

31.12

0.32

0.04

0.68

6280.16

0.32

85.08

545.12

97.80

1.96

0.00

St. Dev.

211.27

3061.09

212.69

17934.77

112.78

23.13

40.07

1.22

0.20

1.65

17198.19

0.90

237.15

1958.75

149.66

5.27

0.00

 

 

In the identified communities, which are 25 in total, we calculated for each ethnic group, a Silo Index. This is an Index which identifies the degree of inter-links between vertices with a particular attribute value in a (social) network. If a set of vertices having a value Y for an attribute X, has all the edges to itself, and not to any other values of attribute X, that means a very strong community exists, which is totally disconnected from the rest of the network. In short, this index helps us identify how cohesive inter-attribute edges are. It ranges from -1 to 1, representing the extreme cases (no in-group links to only in-group links respectively).

  

Table 5 - Silo Index of ethnic groups

Comm.

AFRICAN

CELTIC

EAST ASIAN
& PACIFIC

ENGLISH

EUROPEAN

GREEK

HISPANIC

INTL.

JAPANESE

JEWISH
AND ARMENIAN

MUSLIM

NORDIC

SIKH

SOUTH
ASIAN

UNCL.

VOID
1

VOID
2

1

-0.85

-0.73

-0.96

-0.25

-0.94

-0.88

-0.98

-1

-1

-1

-0.9

-1

-0.97

-0.93

-0.94

-0.99

-1

2

-0.99

-0.71

-0.98

0.02

-0.98

-1

-1

-1

-1

-1

-0.96

-0.99

-0.94

-0.96

-0.99

-1

-1

3

-0.89

-0.7

-0.88

-0.04

-0.97

-0.94

-0.97

-1

-1

-0.99

-0.81

-1

-1

-0.98

-1

-1

-1

4

-0.98

-0.66

-0.99

-0.15

-0.95

-0.95

-0.87

-1

-1

-1

-0.76

-0.99

-0.95

-0.94

-0.98

-1

-1

5

-1

-0.71

-0.99

-0.07

-0.96

-1

-1

-1

-1

-1

-0.83

-1

-1

-1

-0.97

-1

-1

6

-1

-0.76

-0.99

0.03

-0.95

-1

-1

-1

-1

-0.99

-0.6

-1

-1

-0.95

-0.85

-1

0

7

-0.95

-0.68

-0.06

-0.21

-0.98

-1

-0.89

-1

-1

-1

-0.76

-1

-1

-0.99

-0.92

-1

-1

8

-0.07

-0.78

-0.91

0.08

-0.93

-1

-0.43

-1

-1

-1

-0.37

-1

-1

-0.96

-0.86

-1

-1

9

-1

-0.85

-1

0.03

-0.95

-0.91

-0.78

-1

-1

-0.99

-0.64

-1

-1

-0.72

-0.93

-1

0

10

-1

-0.76

-0.98

-0.14

-1

-0.99

-1

-1

-1

-1

-0.73

-1

-1

-0.97

-0.93

-1

0

11

-1

-0.8

-1

-0.12

-0.98

-0.99

-1

-1

-1

-1

-0.32

-1

-0.94

-0.95

-0.97

-1

-1

12

-1

-0.8

-1

-0.07

-0.98

-1

-0.89

-1

-1

-1

-0.24

-1

-1

-1

-1

-1

-1

13

-0.52

-0.86

-1

-0.18

-1

-1

-0.92

-1

-1

-1

-0.25

-1

-1

-0.84

-1

-0.97

-1

14

-0.88

-0.65

-0.73

-0.25

-0.85

-1

-1

-1

-1

-1

-0.48

-1

-1

-0.97

-0.96

-1

0

15

-0.98

-0.66

-0.99

-0.15

-0.95

-0.95

-0.87

-1

-1

-1

-0.76

-0.99

-0.95

-0.94

-0.98

-1

-1

16

-1

-0.8

-1

-0.17

-0.83

-1

-0.79

-1

-1

-1

-0.5

-0.98

-1

-0.57

-0.99

-1

-1

17

-0.65

-0.83

-1

-0.17

-0.94

-1

-1

-1

-1

-1

-0.14

-1

-1

-0.56

-0.89

-0.99

0

18

-0.87

-0.82

-1

-0.37

-0.86

-0.77

-0.9

-1

-1

-1

-0.79

-1

-1

-0.51

-0.97

-1

0

19

-1

-0.82

-1

-0.41

-0.99

-1

-0.92

-1

-1

-0.98

-0.44

-1

-0.78

-0.81

-0.85

-1

0

20

-0.97

-0.67

-0.85

-0.38

-0.99

-1

-1

-1

-1

-1

-0.68

-1

-1

-0.92

-0.94

-0.92

0

21

-0.52

-0.84

-0.98

-0.52

-1

-0.89

-0.99

-1

-1

-1

-0.41

-1

-0.81

-0.65

-0.97

-1

0

22

-0.63

-0.95

-1

-0.19

-1

-1

-1

-1

-1

-1

0.03

-1

-1

-0.99

-0.92

-1

0

23

-1

-0.91

-0.98

-0.46

-0.82

-0.59

-0.96

-1

-1

-1

-0.19

-1

-1

-0.97

-0.88

-0.98

0

24

-1

-0.97

-0.98

-0.83

-0.98

-1

-0.99

-1

-1

-1

-0.08

-1

-0.55

-0.88

-0.95

-0.99

0

25

-1

-0.92

-0.94

-0.74

-1

-1

-1

-1

-1

-1

-0.27

-1

-0.91

-0.26

-0.93

-1

0

Mean

-0.87

-0.79

-0.93

-0.23

-0.95

-0.95

-0.93

-1.00

-1.00

-1.00

-0.52

-1.00

-0.95

-0.85

-0.94

-0.99

-0.48

STD. Dev.

0.23

0.09

0.19

0.23

0.05

0.09

0.12

0.00

0.00

0.01

0.28

0.01

0.10

0.19

0.05

0.02

0.51

 

In Table 5, we have presented Silo Index of all the 18 ethnic groups identified in our identified communities from the reference dataset. Each row represents one of the 25 communities. The most cohesive ethnic group is English. And then Muslim, Celtic and South Asian groups come. So it could be concluded that the majority groups have a higher homophily than the minorities. The value 0 has a special meaning here. It represents two special cases: when there are both no In and Out edges of the group; and when both In and Out edges are equal in number so they cancel out each other. As mentioned earlier, the value -1 means that there are only Out group edges which means not a single group member is connected to any other member of the same group.

Although the majority groups seem densely connected in Table 5, but the Silo Index is mostly negative for both majority and minority ethnic groups. In the majority groups, there are only a few communities with positive Silo Indices (see, for example, Silo Index for the 2nd community for the English group). At the other extreme, the minor groups have very few or no edges whatsoever between them (see Japanese and International groups, for instance). In most community, they also do not hold links to other groups as can be seen in Table 4. It can be construed from this analysis, that neither of the groups is fully segregated. Each community presents a well mixed group members. There is a high preference of similar groups in certain ethnic groups, but the overall social network is quite heterogeneous.

4.4 Out Community

Since a relatively very few vertices have been crawled (which have greater than zero out-degree), we made a graph of just those vertices and the edges between them. In other words, we made a network of just the ego vertices in our overall dataset, and have excluded all the vertices. In this section, we see the degree distribution, SNA measure and the ethnic diversity of this network alone.  To see the total degree distribution of this network, we have drawn a log-log plot of the total degrees of each vertex. It can be seen in Figure 7. This shows that within this sub-graph, there is a power law outlook of two regimes. One 1 ≤ k 100 and another 100 ≤ k ≤ 1100 where k represents the total degrees.

Figure 7 - Total Degree Distribution

In Table 6, we have shown the SNA measure of this graph. As can be seen, not only the clustering coefficient (0.36) tells us that there is a strong relationship among vertices, but the community modularity (0.553) also shows the high level of cohesiveness.

Table 6 - SNA attributes of Subgraph

Average Degree

Avg. Clustering Coef.

Total triangles

Community Modularity

77.189

0.36

1187264

0.553

 

In total, there are 4601 vertices and 177573 edges. As for the distribution of ethnic group is concerned, see the Figure 8.

Figure 8 - Barchart of Ethnic Diversity

There are in total 16 ethnic groups present in this graph, where English group has the highest number of vertices (1714), which is about 37 percent, while Muslim group represents 32 percent population. Some groups are over and some are under-represented.

As for the relationship between each ethnic group, we have calculated the percentile of inter-ethnic groups, which is summarized in Table 7. For each ethnic group, we calculated the percentage of its vertices with another ethnic group - the higher number of edges, the higher the percentage. It seems all groups have the most links with the majority groups.

Table 7 - Inter-ethnic percentile vertices

AFRICAN

CELTIC

EAST ASIAN
& PACIFIC

ENGLISH

EUROPEAN

GREEK

HISPANIC

INTL.

JAPANESE

JEWISH
AND ARMENIAN

MUSLIM

NORDIC

SIKH

SOUTH
ASIAN

UNCL.

VOID

AFRICAN

0.78

16.80

1.76

36.36

2.21

0.31

1.13

0.09

0.00

0.38

30.95

0.12

1.86

4.00

2.98

0.28

CELTIC

1.44

8.16

1.77

41.09

2.43

0.20

0.97

0.13

0.01

0.57

34.53

0.14

1.17

4.04

2.86

0.49

EAST ASIAN
& PACIFIC

1.28

15.04

0.89

36.79

2.77

0.14

0.99

0.12

0.03

0.55

32.71

0.14

1.33

3.86

3.11

0.26

ENGLISH

1.42

18.77

1.98

23.02

2.81

0.24

1.26

0.17

0.02

0.54

39.89

0.13

1.47

4.50

3.25

0.53

EUROPEAN

1.19

15.28

2.06

38.82

1.19

0.10

1.05

0.10

0.00

0.53

31.73

0.20

1.08

3.59

2.61

0.44

GREEK

1.98

15.22

1.22

39.42

1.22

0.00

0.30

0.30

0.00

0.30

29.07

0.00

2.74

5.78

2.13

0.30

HISPANIC

1.33

13.39

1.61

37.94

2.30

0.06

0.39

0.11

0.00

0.53

34.28

0.06

1.52

3.41

2.63

0.44

INTL.

0.80

12.88

1.41

36.42

1.61

0.40

0.80

0.00

0.00

0.00

36.42

0.40

0.80

4.83

2.82

0.40

JAPANESE

0.00

9.68

3.23

30.65

0.00

0.00

0.00

0.00

0.00

0.00

46.77

0.00

0.00

3.23

0.00

6.45

JEWISH AND
ARMENIAN

0.98

17.25

1.95

35.89

2.56

0.12

1.16

0.00

0.00

0.12

33.09

0.12

1.22

3.41

2.01

0.12

MUSLIM

1.37

17.84

1.99

45.12

2.60

0.20

1.29

0.19

0.03

0.57

19.25

0.13

1.32

4.46

3.15

0.49

NORDIC

1.29

17.48

2.06

35.73

4.11

0.00

0.51

0.51

0.00

0.51

32.90

0.00

1.03

1.54

2.31

0.00

SIKH

1.94

14.22

1.91

39.08

2.08

0.44

1.35

0.10

0.00

0.49

31.01

0.10

0.42

3.82

2.79

0.25

SOUTH ASIAN

1.32

15.53

1.75

37.92

2.19

0.29

0.95

0.19

0.02

0.43

33.16

0.05

1.21

1.84

2.75

0.42

UNCLASSIFIED