ABSTRACT
Title of dissertation: MULTILINGUAL USE OF TWITTER:
LANGUAGE CHOICE AND LANGUAGE
BRIDGES IN A SOCIAL NETWORK
Irene Eleta, Doctor of Philosophy, 2014
Dissertation directed by: Professor Jennifer Golbeck
College of Information Studies
Social media is international: users from different cultures and language back-
grounds are generating and sharing content. But language barriers emerge in the
communication landscape online. In the quest for language diversity and universal
access, the vision of a cosmopolitan Internet has stumbled over the language frontier.
Expatriates, minorities, diasporic communities, and language learners play an
important role in forming transnational networks, creating social ties across borders.
Many users of social media are multicultural and multilingual; they are mediat-
ing between language communities. In the microblogging site Twitter, information
spreads across languages and countries. How are multilingual users of Twitter con-
necting language groups? What are the factors influencing their language choices?
This research advances a step towards understanding the network structures and
communication strategies that enable intercultural dialog, cross-language sharing of
information, and awareness of global problems.
This dissertation research aims at: (1) exploring the ways in which multilingual
users of Twitter are connecting different language groups in their social network; (2)
modeling how the network influences their language choices; (3) and exploring what
the textual features of their posts can elicit about language choices and mediation
between groups.
This dissertation goes beyond survey information about multilingualism and
provides a deeper understanding about the structural relations between language
communities in Twitter. This research work is one of the few that apply social
network analysis to the study of sociolinguistic questions on the Internet. Focusing
on the social networks of multilingual users, this dissertation contributes an original
classification of network types based on the patterns of connections between language
groups. Also, it applies the novel idea of modeling the influence of network factors in
the language choices of the user. Finally, this dissertation tests the hypothesis that
the type of exchange influences language choice, and explores with a theme analysis
how other textual features might elicit cross-cultural awareness. These results can
inform the design of social media platforms.
MULTILINGUAL USE OF TWITTER:
LANGUAGE CHOICE AND LANGUAGE BRIDGES IN A
SOCIAL NETWORK
by
Irene Eleta Mogollo´n
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2014
Advisory Committee:
Professor Jennifer Golbeck, Chair/Advisor
Professor Benjamin B. Bederson
Professor Jordan Boyd-Graber
Professor Kari M. Kraus
Professor Ira Chinoy
c© Copyright by
Irene Eleta Mogollo´n
2014
En ese oce´ano que separa los continentes y las vidas,
au´n no te he perdido en la tormenta.
A Pepe.
ii
Acknowledgments
First and foremost I would like to thank my advisor, Prof. Jennifer Golbeck,
for her inspiring lessons on social network analysis, for encouraging me to develop my
own original ideas and guiding me through that difficult process, always caring for
my motivation. I have reached this milestone thanks to her support in the moments
of adversity. She also made it possible by financing this dissertation research.
I owe my gratitude to the other committee members of my dissertation, Prof.
Ben Bederson, Prof. Jordan Boyd-Graber, Prof. Kari Kraus, and Prof. Ira Chinoy,
who have provided valuable feedback in their diverse areas of expertise to make this
dissertation a more solid and complete research work.
I would also like to thank Dr. Judith Klavans and Prof. Doug Oard for their
advice and mentoring in the early stages of my doctoral endeavor. They transformed
the graduate student I was into a researcher. It was an immense privilege to count
with them.
I would also like to acknowledge help from Tony Rogers, who joined Prof.
Jennifer Golbeck and I in our search for multilingual users of Twitter.
My peers and the professors at the College of Information Studies, the iSchool,
and at the Human-Computer Interaction Lab (HCIL) have enriched my graduate
experience in many ways, providing inspiration and support.
I owe my deepest thanks to Fulbright for sponsoring my doctoral studies and
for the financial support in the first years of my program. Also, they have enriched
my stay in the United States by giving me the opportunity to participate in the many
iii
cultural and social events they organize, including academic workshops, where I have
met many new friends. The Fulbright community has had an enormous influence
in my vision of the world and in this research work. They are the most inspiring
example I know of multicultural social ties between the world’s nations.
iv
Table of Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 What is Twitter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Language bubbles? . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 The bridges between the local and the global . . . . . . . . . . 9
1.2.3 Values in the design of communication platforms . . . . . . . . 10
1.3 An Ultimate Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Objectives and Research Questions . . . . . . . . . . . . . . . . . . . 12
1.5 Contributions and Audiences . . . . . . . . . . . . . . . . . . . . . . . 14
2 Theoretical Framework 16
2.1 The Global Language System . . . . . . . . . . . . . . . . . . . . . . 17
2.2 The Ecology of Language . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Concepts of Social Network Analysis . . . . . . . . . . . . . . 21
2.3.2 A network perspective on Sociolinguistics . . . . . . . . . . . . 25
2.4 The Internet as a Sociolinguistic Ecology . . . . . . . . . . . . . . . . 29
2.4.1 The mediation of technology and the cosmopolitan space . . . 32
2.4.2 Overview and remarks . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Micro-Sociology Focus: Conceptualizing Multilingual Users and Lan-
guage Choice in Twitter . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Related Work 41
3.1 Language Choice and Code-Switching Online . . . . . . . . . . . . . . 41
3.2 Networked Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Multilingual Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Methodology 52
4.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Sampling and Data Collection . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Methods for Assigning Language Labels to Users . . . . . . . . . . . 61
4.3.1 Tools for automatic language identification . . . . . . . . . . . 62
4.3.2 Algorithm for assigning a language label to a person . . . . . 63
4.4 Testing Methods for Assigning Language Labels to Users . . . . . . . 66
4.4.1 The test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 The baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.3 Testing the language identification tools and the algorithm
that assigns language labels to users . . . . . . . . . . . . . . 70
4.4.4 Deciding the number of posts per user . . . . . . . . . . . . . 74
v
4.5 Assigning Language Labels to Users . . . . . . . . . . . . . . . . . . . 75
4.6 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Social Network Analysis 83
5.1 Qualitative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Network Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Application of Categories . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Factor Analysis 103
6.1 Operationalization of Variables . . . . . . . . . . . . . . . . . . . . . 104
6.2 Regression Models and Analysis . . . . . . . . . . . . . . . . . . . . . 106
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Exploring Textual Features 115
7.1 Description of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Hypothesis Testing: Fisher’s Exact Test . . . . . . . . . . . . . . . . 118
7.3 Discussion: Addressivity as a Factor . . . . . . . . . . . . . . . . . . 120
7.4 Theme Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1 International themes in the English language set . . . . . . . . 122
7.4.2 English hashtags in the non-English language set . . . . . . . 127
8 Discussion and Future Work 139
8.1 Of Links, Social Ties, and Gravitational Forces . . . . . . . . . . . . . 140
8.2 The Road Ahead... . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.1 Translation and Mediation in Twitter . . . . . . . . . . . . . . 144
8.2.2 Who Are the Multilingual Users? . . . . . . . . . . . . . . . . 146
9 Conclusion 147
A Visualizations of Social Networks 152
B International Themes in English Posts 194
Bibliography 200
vi
List of Tables
4.1 Research design schema . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Budget options and associated error rates . . . . . . . . . . . . . . . . 74
5.1 Properties of bilingual networks observed in visualizations . . . . . . 89
6.1 Linear regression coefficients for English use . . . . . . . . . . . . . . 109
6.2 Linear regression coefficients for L2 use . . . . . . . . . . . . . . . . . 110
6.3 Logistic regression coefficients for English use . . . . . . . . . . . . . 111
6.4 Logistic regression coefficients for L2 use . . . . . . . . . . . . . . . . 111
7.1 2x2 contingency table for the Fisher’s Exact Test . . . . . . . . . . . 120
7.2 Frequencies of international themes in English posts . . . . . . . . . . 125
7.3 Conversational tags: discourse conventions in Twitter . . . . . . . . . 133
7.4 Other conversational tags . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5 Hashtags: ICT topic, brands and devices . . . . . . . . . . . . . . . . 135
7.6 Hashtags: events, music, TV and sports . . . . . . . . . . . . . . . . . 136
7.7 Hashtags: location, time, and other named entities . . . . . . . . . . 137
7.8 Hashtags: other topics . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vii
List of Figures
1.1 Example of Twitter posts . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Egocentric network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Schematic view of a network with clusters . . . . . . . . . . . . . . . 24
2.3 European language communities in Twitter . . . . . . . . . . . . . . . 27
2.4 Interactions constrained by technology and social network . . . . . . 33
2.5 Factors for language choice in Twitter . . . . . . . . . . . . . . . . . . 37
3.1 Language share of top 20 most active countries on Twitter . . . . . . 49
4.1 Schematic description of the datasets . . . . . . . . . . . . . . . . . . 55
4.2 Words for detecting languages . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Purpose of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Language label assignation to users . . . . . . . . . . . . . . . . . . . 65
4.5 Estimated error function . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Comparison of langPy and Google Language ID . . . . . . . . . . . . 73
5.1 Trilingual egocentric network: English, Spanish, Basque . . . . . . . . 86
5.2 Trilingual egocentric network: English, Spanish, Catalan . . . . . . . 87
5.3 Trilingual egocentric network: English, Chinese, Japanese . . . . . . . 88
5.4 Qualitative categories of bilingual networks . . . . . . . . . . . . . . . 92
5.5 L2 inner/crossing edge ratio for five and three categories . . . . . . . 96
5.6 L2 group proportion for five and three categories . . . . . . . . . . . . 96
5.7 Cross-language edge ratio for five and three categories . . . . . . . . . 97
5.8 Bilingual ratio for five and three categories . . . . . . . . . . . . . . . 97
5.9 Results of classification model . . . . . . . . . . . . . . . . . . . . . . 100
6.1 Sample input data file for factor analysis . . . . . . . . . . . . . . . . 106
A.1 Trilingual networks (1). . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2 Trilingual networks (2). . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.3 Trilingual networks (3). . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.4 Bilingual networks: gatekeeper type (1). . . . . . . . . . . . . . . . . 156
A.5 Bilingual networks: gatekeeper type (2). . . . . . . . . . . . . . . . . 157
A.6 Bilingual networks: gatekeeper type (3). . . . . . . . . . . . . . . . . 158
A.7 Bilingual networks: gatekeeper type (4). . . . . . . . . . . . . . . . . 159
A.8 Bilingual networks: gatekeeper type (5). . . . . . . . . . . . . . . . . 160
A.9 Bilingual networks: gatekeeper type (6). . . . . . . . . . . . . . . . . 161
A.10 Bilingual networks: language bridge type (1). . . . . . . . . . . . . . 162
A.11 Bilingual networks: language bridge type (2). . . . . . . . . . . . . . 163
A.12 Bilingual networks: language bridge type (3). . . . . . . . . . . . . . 164
A.13 Bilingual networks: language bridge type (4). . . . . . . . . . . . . . 165
A.14 Bilingual networks: language bridge type (5). . . . . . . . . . . . . . 166
A.15 Bilingual networks: language bridge type (6). . . . . . . . . . . . . . 167
viii
A.16 Bilingual networks: union type (1). . . . . . . . . . . . . . . . . . . . 168
A.17 Bilingual networks: union type (2). . . . . . . . . . . . . . . . . . . . 169
A.18 Bilingual networks: union type (3). . . . . . . . . . . . . . . . . . . . 170
A.19 Bilingual networks: union type (4). . . . . . . . . . . . . . . . . . . . 171
A.20 Bilingual networks: integration type (1). . . . . . . . . . . . . . . . . 172
A.21 Bilingual networks: integration type (2). . . . . . . . . . . . . . . . . 173
A.22 Bilingual networks: integration type (3). . . . . . . . . . . . . . . . . 174
A.23 Bilingual networks: integration type (4). . . . . . . . . . . . . . . . . 175
A.24 Bilingual networks: integration type (5). . . . . . . . . . . . . . . . . 176
A.25 Bilingual networks: integration type (6). . . . . . . . . . . . . . . . . 177
A.26 Bilingual networks: integration type (7). . . . . . . . . . . . . . . . . 178
A.27 Bilingual networks: integration type (8). . . . . . . . . . . . . . . . . 179
A.28 Bilingual networks: peripheral language type (1). . . . . . . . . . . . 180
A.29 Bilingual networks: peripheral language type (2). . . . . . . . . . . . 181
A.30 Bilingual networks: peripheral language type (3). . . . . . . . . . . . 182
A.31 Bilingual networks: peripheral language type (4). . . . . . . . . . . . 183
A.32 Bilingual networks: peripheral language type (5). . . . . . . . . . . . 184
A.33 Bilingual networks: peripheral language type (6). . . . . . . . . . . . 185
A.34 Small and monolingual networks (1). . . . . . . . . . . . . . . . . . . 186
A.35 Small and monolingual networks (2). . . . . . . . . . . . . . . . . . . 187
A.36 Small and monolingual networks (3). . . . . . . . . . . . . . . . . . . 188
A.37 Small and monolingual networks (4). . . . . . . . . . . . . . . . . . . 189
A.38 Small and monolingual networks (5). . . . . . . . . . . . . . . . . . . 190
A.39 Small and monolingual networks (6). . . . . . . . . . . . . . . . . . . 191
A.40 Small and monolingual networks (7). . . . . . . . . . . . . . . . . . . 192
A.41 Small and monolingual networks (8). . . . . . . . . . . . . . . . . . . 193
ix
Chapter 1
Introduction
[G]lobalization is characterised by unprecedented flows of information, ex-
changes among different groups and networks that transcend the local and
national [116, p. 9].
As the number of Internet users from different parts of the world grows [58], so
does the use of a wealth of languages online [89]. The Internet is not accessed only
through computers, but also through cellphones and tablets; this trend is enabling
more people in developing countries and speakers of a plethora of languages to access
it [58]. While access to the Internet and communication flows are greater than ever
before, there is evidence of fragmentation due to language and national borders on
the Web [44], and on the blogosphere [53, 47]. Also, many authors warn about the
existence of a “linguistic digital divide” that prevents many users of the Internet
from having access to relevant information in their languages [78, 67, 68, 8].
In the past years, social media has emerged as horizontal networks of commu-
nication, where a complex interplay takes place between mainstream media, jour-
nalists, political actors, grassroots activists, citizens and technology [13, 77]. On
the one hand, there are powerful social actors shaping the linguistic landscape of
the Internet with a top-down approach, like national and supranational institutions,
broadcasting media, and companies with interests in transnational business [30]. On
1
the other hand, users of social networks and content-sharing platforms constitute a
counter-power [13], reshaping this linguistic landscape with their contributions.
Social media has enabled valuable social outcomes such as spontaneous or-
ganization during humanitarian crisis [98], public denunciations of human rights
violations [85], creation of relevant content for communities that are underserved in
terms of information on the Internet and in their languages [102], and foreign lan-
guage practice and participation in transnational interest communities and diaspora
communities [100].
Many researchers and media outlets are turning their attention to the mi-
croblogging site Twitter. They have realized the potential of Twitter for spreading
information of unfolding events in real-time across languages and geographic regions
[55, 77]. But how are the news traveling across language frontiers?
In this dissertation, I study how multilingual users of Twitter mediate between
language groups in their social network, focusing on social connections and language
choice. My long-term goal is to advance our understanding of the network struc-
tures and communication strategies that enable intercultural dialog, cross-language
sharing of information, and awareness of global problems.
This research goes beyond survey information about multilingualism: I apply
social network analysis to gain a deeper understanding about the structural relations
between language communities in Twitter. I focus on the social networks of multi-
lingual users and contribute a classification of network types based on the patterns
of connections between language groups. Also, I propose and apply the novel idea
of modeling the influence of network factors in the language choices of the user.
2
1.1 What is Twitter?
In Twitter, users share posts with followers; these posts are limited to 140
characters and often include links to webpages, images and other resources. Twitter
has characteristics of a social network —although relationships do not need to be
reciprocal— and an information-sharing network, where both mainstream media
and user-generated content are disseminated publicly [69, 77]. The posts of the
people a user follows are laid out in a vertical stream, in inverse chronological order,
i.e. the most recent posts are at the top and the user can scroll down the screen to
read the previous messages. Twitter posts can be of three types:
1. an original comment by the author;
2. a reposting of a comment authored by someone else, the user that passes the
message along can do it either by means of the button “Retweet” or preceding
copied text by “RT”, “rt”, or other markers of attribution (see figure 1.1);
3. a reply or comment addressed to a particular user by means of a “mention”,
the @ sign followed by a username.
The key to the success and novelty of Twitter is due to the speed of information
dissemination and the fact that most of this information is publicly accessible. For
instance, it takes a “tweet” less than one hour on average to be reposted and, if it
gets beyond that first hop, it will be reposted almost instantly in subsequent hops,
reaching an average of 1000 people [69].
3
Figure 1.1: Two Twitter repostings. The authors’ usernames at the top of each message have
been erased for privacy reasons, as well as the usernames next to “Retweeted by”; the later users
reposted them by clicking the Retweet button. The message at the top was previously reposted
copying the text and preceding it by RT and the mention of the original author’s username, which
is partially shown (@k). Screenshot taken in 2011 reflecting interface design at the time of data
collection. For more images and details on the evolution of Twitter’s interface until 2011, see [114]:
http://blog.bufferapp.com/how-twitter-evolved-from-2006-to-2011.
As a consequence, there is an emergent body of research literature studying
how to leverage Twitter’s tremendous potential for “participatory sensing” and col-
laboration [28], enabling “situational awareness” in emergency events [105], and
functioning as an “awareness system” for journalism [51].
1.2 Motivation
The underlying motivation for my research is promoting language diversity
and facilitating access to multilingual information on the Internet, thus everybody
can benefit from it for communicating, learning, making business, sharing ideas and
resources. However, multilingualism also brings new challenges, like the segregation
4
of information and communication spheres, which can hinder the potential of the In-
ternet for discovery, cross-cultural awareness, intercultural dialog, and transnational
collaboration to find solutions for local conflicts and global problems.
To support my views, I highlight below the relevant points of the “Geneva
Declaration of Principles” [115] for an inclusive Information Society, approved at
the World Summit of the Information Society in 2003:
• The international management of the Internet should facilitate access for all,
taking into account multilingualism.
• The Information Society should foster and respect cultural and linguistic
diversity, dialogue among cultures and civilizations, and encourage in-
ternational cooperation.
• Everyone should have the right to seek, receive, and impart information and
ideas through any media and regardless of frontiers.
These principles are based on prior international declarations, such as the
UNESCO’s “recommendation concerning the promotion and use of multilingualism
and universal access to cyberspace” [103].
The Declaration of Principles states the importance of a “rich public domain
[...] for the growth of the Information Society, creating multiple benefits such as an
educated public, new jobs, innovation, business opportunities, and the advancement
of sciences” [115, p. 4]. Similar supporting arguments come from the Internet
Society, which is an non-profit international organization that provides leadership
5
for Internet policy and technology standards [59]. The Internet Society’s vision was
remarkably conveyed by Vint Cerf in his 1999 speech “The Internet is for Everyone”
[14].
Also, I was inspired by Zuckerman’s comparison of cosmopolitan cities with
the Internet [119]; using this metaphor, he proposed to plan and design technol-
ogy for creating the structure that fosters social contact, vibrant communities, and
discovery, to fulfill the vision of an internet that constitutes a truly cosmopolitan
space.
Unfortunately, there are innumerable challenges to achieve these goals that go
beyond the technical aspects, such as socioeconomic inequality, lack of infrastructure
in rural areas and disadvantaged parts of the world [34, 91], restrictive governmental
or private controls [107]. Indeed, the Internet has many types of frontiers and
barriers, but I am particularly interested in the language frontier.
1.2.1 Language bubbles?
The first language frontier many potential users encounter prevents them from
using the services on the Internet: the interfaces are not localized into their language,
writing system, and cultural conventions. There is a wealth of literature, mostly
practice-oriented, about localization of interfaces for improving usability and acces-
sibility of software products and websites targeting a global market [117, 50, 92].
Additionally, there are other —more subtle— language frontiers.
6
Drawing similarities with the “filter bubble” problem, Scott Hale [46] writes
about “language bubbles” on the Internet. As an example, he shows the different set
of search results obtained for the query “Tiananmen Square” in English and Chinese
using the search engine Google [46]. The filter bubble [86] was a very discussed book
warning the public about the widespread use of algorithms for personalization of
search results and news feeds online without the knowledge or control of the end-
user [86].
This book, and other works expressing similar concerns, have triggered debate
about the decisions shaping the design of information systems and social networks,
and the impact they have on society. Maybe as a result of this debate, the recom-
mender systems community is making an effort to incorporate the values of diversity
and novelty into the recommendation models and algorithms [1]. However, as Hale
points out [46], system designers might not be taking into account the dimensions
of culture and language yet.
A research study of 25 language versions of Wikipedia by Hecht and Gergle [49]
serves to illustrate this ignored challenge. Wikipedia is an online encyclopedia built
with user contributions and revolves around the principle of reaching consensus on
concepts’ descriptions. Hecht and Gergle [49] found that more than 74% of concepts
in Wikipedia are described in only one language and there is a surprisingly small
overlap of concepts in different languages. For instance, in the case of two mature
language editions, the authors report that 51% of concepts in English are covered
in German, but only 16% of concepts in German are also in the English Wikipedia
[49].
7
One implication is that we are seeing a substantially different knowledge repos-
itory depending on the language we use. This is not only a matter of insufficient
translation, but of concepts that are culture-specific or not considered relevant in
other languages, e.g. city districts, national sport teams [49].
A survey on the topology of web links determined that the number of hy-
perlinks that cross international borders is significantly lower than the number of
domestic hyperlinks [44]. Similarly, the blogosphere is fragmented into language
communities [80, 53, 47]. We see a different Internet depending on the language we
use, which is hindering our capabilities for sharing and learning, but this research
problem remains unexplored for the most part.
Notably, the field of multilingual information retrieval has a solid literature
body [87], including the design of multilingual search interfaces and the specific
problem of cross-language information retrieval, but is narrowly focused on search.
Also, there is a growing body of literature in the Semantic Web field about multilin-
gual ontologies, cross-language linking of data and resources [3], but its application
is limited to certain domains.
In general, there are scarce efforts to understand what are the network struc-
tures and communication environments that foster intercultural dialog, cross-language
sharing of information and resources, awareness of global problems, and interna-
tional collaboration.
8
1.2.2 The bridges between the local and the global
Coupland [18] highlights how social relations become possible across distance
with the help of Information and Communication Technologies (ICTs) in this time of
unprecedented numbers of mobile trajectories and flows of populations. Expatriates,
migrants, minorities, diaspora communities, and language learners play an impor-
tant role in forming transnational networks and cultural bridges between nations
and communities.
Many users of the Internet are multicultural and multilingual. They sometimes
act as invisible translators. For instance, during casual daily interactions, they
might be passing information from one language community to the other, without
strictly translating, but re-contextualizing a story in a new language and culture [6].
Some ground-breaking initiatives are already taking advantage of multilingual users’
language skills for raising international awareness about local conflicts, human rights
violations, and advocacy causes. Such is the case of Global Voices, “an international
community of bloggers who report on blogs and citizen media from around the
world” [37].
In 2009, the Berkman Center for Internet and Society mapped the Arabic
blogosphere and described a key concept that has motivated this dissertation work.
They identified English and French “language bridges” on the Arabic blogosphere,
consisting of bloggers that wrote in English or French and their native (Arabic)
language, which connected the different national blogospheres with the international
one [32].
9
Understanding what is the impact of these “language bridges” and how social
media is used for “reaching out to the world” and drawing the attention of inter-
national broadcasting media are still open questions of particular interest after the
popular uprisings during 2011 [16].
In the microblogging site Twitter, information spreads across languages and
countries [75, 76, 77] and, as I will show, this is possible thanks to multilingual
users that are mediating between language communities. “[T]he greatest connecting
power is the will of the users who want to be connected” [45], like in the example
of bloggers in Arab countries connecting with an international audience [32] or the
self-denominated “voluntweeters” after the earthquake in Haiti [98].
A world that faces global challenges, could benefit from leveraging the inter-
connections of its population for finding and sharing solutions from the local level
to the international level.
1.2.3 Values in the design of communication platforms
The localization of the interface into a diversity of languages and cultural
codes, support for non-latin scripts and bidirectional text displays, as well as pro-
viding assistive technologies for translation, are basic requirements for a globally
accessible communication platform [117, 50].
The debate about filter bubbles uncovers that information systems, online so-
cial networks, content-sharing and communications platforms are not neutral tools.
10
The values and design decisions that underlie these systems [95] have an impact on
the users’ perceptions of the world and their behavior.
The design of Twitter and other information-sharing platforms comes with an
embedded set of values, like sharing, dissemination, being public and participative,
etc. In accordance with these values, research can shed light on how to leverage the
language skills and multicultural background of its users to promote dissemination
of information across language frontiers.
Even after designers identify the values they want to imprint in the system,
they still need to understand the challenges associated. For example, if we want a
communication and information-sharing platform that enables intercultural dialog
and collaboration, cross-language link sharing, and awareness of global problems,
we need to study how the system might be constraining the linguistic decisions of
multilingual users and impairing their ability to cross online frontiers. Also, we
should acknowledge and respect that, in some cases, certain communities might
have reasons for concealing information or resources.
1.3 An Ultimate Goal
The overarching goal that motivates my research is to advance our understand-
ing of the network structures and communication strategies that foster intercultural
dialog, cross-language sharing of information, and awareness of global problems. We
could leverage this knowledge to reduce the impact of language frontiers online, to
encourage social contacts and links to resources across languages, and to promote
11
the use of multiple languages, i.e. instead of constraining multilingual users to one
language choice, empowering them to mediate between cultures.
Ultimately, who are the people and what are the reasons that connect different
cultural and linguistic groups? What can we do to foster and leverage these cross-
cultural connections for building a cosmopolitan space?
These are very broad and ambitious questions, and my research path has barely
started. In the next section, I narrow the scope to provide a founding ground for
this area of inquiry.
1.4 Objectives and Research Questions
This dissertation research aims at: (1) exploring the ways in which multilingual
users of Twitter are connecting different language groups in their social network; (2)
modeling how the network influences their language choices; (3) and exploring what
the textual features of their posts can elicit about language choices and mediation
between language groups.
This dissertation focuses on the microblogging site Twitter because it consti-
tutes an example of a social and information-sharing network where information
is disseminating across languages and countries (see subsection 1.2.2). Also, the
interface is available in a diversity of languages, supports various non-latin scripts,
bidirectional text, and it does not filter the posts by language. Therefore, it po-
tentially exposes the user to a multilingual conversation if she/he chooses to follow
people writing in different languages.
12
Four questions drive this research:
1. In what ways are multilingual users of Twitter connecting language groups?
2. How is the social network of multilingual users in Twitter influencing their
choice of language?
3. Does the type of exchange in Twitter (i.e. public post, reply) influence the
language choice of multilingual users?
4. What the themes and textual features in the posts of multilingual users reveal
about cross-cultural awareness or international dialogue?
Inspired by an expanded paradigm of Web Content Analysis proposed by Her-
ring [52], this research includes social network analysis, natural language processing
for automatic language identification, theme and exchange analyses.
In this dissertation, the research subjects are Twitter users authoring posts in
English and at least another language. Focusing on the social network of these mul-
tilingual users, the methodology combines a qualitative approach to social network
analysis and network statistics to present a taxonomy of network types based on
the patterns of intersections and connections between language groups. The result-
ing theoretical constructs or categories answer the first research question. A factor
analysis based on two regression models will answer the second research question on
the social network influence in the language choices of multilingual users.
To answer the third research question, I test the hypothesis that the textual
feature indicating addressivity (@ sign) within the posts of multilingual users in-
13
fluences their language choice. Finally, regarding the fourth research question, a
generic theme analysis will provide preliminary findings on topics that might help
in raising cross-cultural awareness, and on the reasons for using English keywords
in non-English posts.
1.5 Contributions and Audiences
The main contribution of this dissertation is that it goes beyond survey in-
formation about multilingualism and provides a deeper understanding about the
structural relations between language communities in a social network online. In
particular, this research proposes new specific network statistics to enhance the defi-
nitions of original theoretical constructs: the types of intersections between language
groups in social networks.
Inspired by previous studies on the blogosphere, I propose to apply social
network analysis to study sociolinguistic questions on the Internet. Adapting the
Ecology of Language theoretical framework from sociolinguistics to the social net-
work context, this research conceives of the social network of multilingual users as
a micro-scale language ecology, influencing their communication strategies and lan-
guage choices. This conceptualization leads to a second key contribution, which is
the novel idea of modeling the influence of social network factors in the language
choices of the user.
Other contributions include: the confirmation of previous empirical observa-
tions pointing to addressivity as a factor for language choice in Twitter, the iden-
14
tification of themes that might be raising cross-cultural awareness, and the identifi-
cation of certain types of hashtags (keywords preceded by the # sign) and related
contexts that could encourage multilingual conversations.
Regarding the audiences that this dissertation addresses, the research area of
information diffusion in social networks could benefit from the findings about the
structural relations between language groups. More broadly, this work is relevant
to the fields of Information Studies and Social Informatics.
The lessons learned in this work could inform the design of socio-technical
systems. This research contributes to “understanding users” in the field of Human-
Computer Interaction, especially, in the areas of computer-supported cooperative
work and technology-mediated social participation. Also, this dissertation might be
of interest in the field of Language Technologies for potential applications.
This dissertation was inspired by works in Computer-Mediated Communica-
tion and constitutes another example of applying social network analysis in this
field, which is still rarely used.
Finally, other audiences include the fields of Digital Humanities, Sociolinguis-
tics, and Communications Studies, especially in relation to the Internet. Researchers
in these areas might find inspiration in this dissertation for exploring their research
questions with the new lens of social network analysis, and the use of automatic
language processing.
15
Chapter 2
Theoretical Framework
I begin this chapter by reviewing relevant theoretical perspectives from So-
ciolinguistics in the context of globalization. The Global Language System theory
proposed by De Swaan provides a macro-scale perspective on language; it was later
reinterpreted by Calvet in his Ecology of World Languages, which includes an amal-
gam of views conforming the Ecology of Language approach. This approach com-
prises different levels of analysis in the study of languages: the macro-scale language
dynamics, described as language ecologies, emerge as a result of the interactions of
individuals, and their language choices at micro-scale level. However, Sociolinguis-
tics theories and methods remain too fragmentary.
Inspired by previous studies on the blogosphere, in this chapter I propose
to apply social network analysis to study sociolinguistic questions on the Internet.
Social network analysis enables us to understand the influence of micro-scale inter-
actions into macro-scale social dynamics; this analytical approach could enrich the
Ecology of Language perspective.
Finally, narrowing the scope to the particular environment of the Internet and
social media, I discuss and define the concepts that relate to interactions of users
and technology, with particular attention to multilingualism, language choice, and
language-switching.
16
2.1 The Global Language System
De Swaan’s theory of communication potential and language competition,
called the “Global Language System” [26, 25, 27], illustrates the dynamics of the
world’s languages with a constellation metaphor: English constitutes the hyper-
central sun of the global constellation of languages. At the top level, supranational
subsystems —like Spanish, French, and Arabic— compete with English as languages
of global communication. There are a dozen supranational languages in this con-
stellation and a hundred national languages orbiting around them like planets. This
pattern appears at different levels in the system, starting in the periphery with local
languages surrounding a national language like the satellites of planets. The cen-
tral languages in each subsystem or cluster have a mediation function between local
languages.
A key point in De Swaan’s theory that is relevant to this research is the connec-
tion between the language groups through polyglots and interpreters. Multilingual-
ism and translation constitutes the gravitational force that provides cohesion to the
system, enabling communication and interaction between different language groups.
At the same time, speakers are confronted by multiple and competing linguistic
options. Individual and collective choices shape the system, but are themselves
influenced by the spheres of politics, economics, and culture [25].
De Swaan [26] proposes a formula to determine the communication potential
of a language, which could influence the decision of people to learn it. The factors
that the formula takes into account are: the number of speakers of the language and
17
the number of multilingual speakers that know the language. The number of mul-
tilingual speakers is related to the centrality of the language in the system. These
multilingual speakers increase the communication potential value of a language be-
cause they enable connections with other languages in the system. For instance, in
the language subsystem of the European Union (E.U.), German has a high com-
munication potential value due to the large number of speakers in the region, but
English has a higher value due to the lager number of multilingual speakers compe-
tent in English, which provides the opportunity to communicate with people from
many different countries in the E.U. [27].
De Swaan already mentions network concepts that will be introduced in sec-
tion 2.3: the “centrality” of a language in the system, which accounts for the medi-
ation potential between that language and others thanks to the number of polyglots
speaking it; these polyglots are facilitating “linkages or connections” among lan-
guages, which are necessary for the “cohesion” of the system.
The constellation metaphor serves to illustrate the following concepts: lan-
guages of local communication (described as “satellites” or “the periphery”), lan-
guages of regional communication (illustrated as “planets”), and languages of global
communication (represented as the “central stars”). When considering the linkages
between language groups and the communication potential, the following terms are
used given a particular context or region: a vernacular language is the first language
of the majority of its users, a lingua franca is the second language of the majority of
its users, who speak different first languages, and in the case of a vehicular language
18
there is a balance between the number of speakers that use it as a first language
and the number of speakers that use it as a second language [2].
2.2 The Ecology of Language
While De Swaan’s theory places English as a hyper-central sun and uses socio-
economic concepts to assign a “value” to languages, the Ecology of Language ap-
proach adds nuanced views regarding language hierarchies. For instance, this ap-
proach addresses the phenomena of “ethnic revivals” as a form of counter-power in
the language dynamics of a globalized world [56, 30, 13], where the socio-economic
approach of De Swaan falls short.
Borrowing concepts from the field of Ecology, Haugen introduced the Lan-
guage Ecology approach and the notion of the co-evolution of languages and their
interdependence within a social system [19, 71, 48]. Hornberger [56] synthesizes
this analytical approach as taking into account all languages in a given ecosystem,
recognizing their social spaces and contexts. Adapting the Ecology of Language
approach to Twitter, this research conceives of the social network of multilingual
users as a micro-scale language ecology, influencing their communication strategies
and language choices.
Under the ample umbrella of this approach and its various names (language or
linguistic ecology, ecology of languages, and ecolinguistics), there is a diverse group
of authors and works painting the multifaceted social, political, and cultural reality
surrounding the languages of the world. However, this painting is, for the most part,
19
fragmentary in nature. While the works of the Ecology of Language generally focus
on micro-sociology problems, such as managing multilingualism in South African
and Bolivian classrooms [56], De Swaan proposes a planetary vision. In essence, the
difference between the two perspectives is based on the scale of the analysis.
Calvet provides an integrative perspective in his book Towards and Ecology of
World Languages [11]. Calvet reinterprets De Swaan’s theory as the “gravitational
model”, which describes the global ecosystem of languages, or the macroscopic scale,
and complements it with other models that account for phenomena in lower scales,
such as internal regulation of languages, social and official functions of languages,
and identity [71]. In this way, the Ecology of Language does not exclude the Global
Language System theory by De Swaan, but embraces it as part of this inclusive
approach.
In summary, the Ecology of Language comprises various levels of study in
sociolinguistics, from microscopic to macroscopic scale: from the individuals inter-
acting to the population, and ultimately, the ecosystem of society, policy, economics,
and communication media [71]. However, this framework lacks a connecting tool
between those levels of analysis, i.e. how global language dynamics emerge from
individual interactions and language choices. Inspired by Language Networks on
LiveJournal [53] —one of the first studies in taking a network analysis approach
to study languages in social media—, I propose to apply social network analysis to
facilitate our understanding of the connections between micro-scale and macro-scale
sociolinguistic questions.
20
2.3 Networks
Building on a long tradition of network analysis in sociology and anthro-
pology [...] and an even longer history of graph theory in discrete mathemat-
ics [...], the study of networks and networked systems has exploded across the
academic spectrum [109](243).
The science of networks provides a new framework to understand complex
systems in biology, sociology, communication technologies, business, etc [7]. Network
structure is “thought to influence individual (micro) and collective (macro) behavior,
as well as the relationships between the two”, and has provided useful insights in the
study of the spread of disease and information dissemination [109](256). Similarly,
networks could help us gain a deeper understanding of language use in society, in
communication technologies, and of language effects on information dissemination.
2.3.1 Concepts of Social Network Analysis
Social network analysis (SNA) studies the network structure of relations be-
tween people to understand social phenomena, instead of categorizing human be-
havior based on individual inner forces [80]. In SNA, people are represented as nodes
of a social graph. The nodes are connected by edges, or social ties, that could be
reciprocal or just a one way relation, like the “follower of” relation in Twitter.
In this work, I will use an important type of subgraph: the egocentric network.
The egocentric network is obtained by selecting an individual node, called the ego,
and all of its connections [38]. In other words, it constitutes the personal social
21
Bob Kat 
Ego 
Jen 
Kat 
Ego 
Bob 
Jen 
Figure 2.1: Egocentric network with degree 1 (left) and with degree 1.5 (right).
network of an individual with his or her contacts. An egocentric network that
includes only the connections with the ego has degree 1. More frequently, researchers
are interested in including the connections among the ego’s contacts, in this case the
egocentric network has degree 1.5 [38]. Figure 2.1 illustrates these basic concepts.
The egocentric network has become a standard unit of measurement for studying
small scale interactions (or micro-sociology) [41, 80].
An edge that constitutes the only connection between two groups of nodes
is called a bridge [38], which is a mathematical concept. Bridges are of special
importance for the dissemination of information from one group to the other [41, 42].
In this context, gatekeepers are people enabling communication between two
groups. Multilingual users of Twitter might be in a position of their social network
where information necessarily has to pass through them to reach the other language
group. This has several implications: the gatekeeper’s role is critical for spreading
news between communities and for rising cross-cultural awareness, but they could
22
broker information to their advantage [80], or they could be conservative in their
decisions to transfer information to one group for cultural reasons [82]; in either
case, places where we find gatekeepers could be considered structural holes between
communities.
Communities or clusters are composed by nodes that are more connected to
one another than with the rest of the social graph. For example, imagine a town with
different neighborhoods, where almost everybody that lives in the same neighbor-
hood knows each other, but they know fewer people from the other neighborhoods.
There are a variety of automatic methods to detect clusters or communities based
on network structure [38]. Figure 2.2 shows a schematic view of how people are con-
nected in clusters. In Twitter, clusters form due to language, geography, or topic of
interest, but this research focuses on language clusters.
The cohesion of a social graph is a count of the minimum number of edges
that prevent the entire graph from breaking in isolated components [38]. These
types of edges that connect clusters of people are critical for providing cohesion to
the society, building a sense of community, and for effective self organization and
collective action across language groups [41, 42].
Centrality is a core concept in SNA, and measures how “central” a node is in
the network to estimate its importance [38]. There are different centrality measures
that account for various reasons why a node might be important. For instance,
degree centrality counts the number of edges or connections a node has. In this
work, and in De Swaan’s theory (explained in section 2.1), the relevant centrality
measure is betweenness centrality, which captures how important a node is in the
23
Figure 2.2: Schematic view of a network with clusters. Clusters (or communities) are composed
by nodes that are more connected to one another than with the rest of the social graph. This
dissertation focuses on language clusters. This visualization is an extract from a open access
journal article [63]: M. Kaiser, M. Go¨rner, and C. C. Hilgetag. (2007). Criticality of spreading
dynamics in hierarchical cluster networks without inhibition. New Journal of Physics: http:
//iopscience.iop.org/1367-2630/9/5/110/fulltext/
flow of information from one part of the network to another [38]. In other words,
betweenness centrality reveals the potential mediators or gatekeepers.
24
2.3.2 A network perspective on Sociolinguistics
In this dissertation, I focus on multilingual users of Twitter as mediators be-
tween language clusters. For example, imagine a user posting in English and Span-
ish. In Twitter, she follows the updates of researchers posting in English, but also
the Twitter accounts of Spanish local media. Her friends in Spain are connected with
her in Twitter, and also her colleagues in the United States. This user sometimes
posts in Spanish commenting on local news or replying to some Spanish friend. Of-
ten, she posts in English to disseminate research content. She might post in English
to draw international attention about important events in Spain.
This dissertation hypothesizes that the language choices this user makes ev-
ery time she writes a post will be influenced by the language composition of her
social network and, in turn, will have an impact on it. In section 2.2, adapting
the Ecology of Language approach to Twitter, I propose to conceptualize the social
network of multilingual users as a micro-scale language ecology, influencing their
communication strategies and language choices.
Also, this dissertation investigates what is the structural relation between the
language clusters in the social network of this user to develop methods for detecting
gatekeepers or structural holes. Future research on information dissemination could
benefit from these methods that account for the language effect.
An early work that inspired the question of multilingual users as mediators,
at the micro-socioloy scale, was the research on gatekeepers by Metoyer-Duran in a
variety of ethnolinguistic communities in the United States (American Indian, Chi-
25
nese, Japanese, Korean, and Latino) [82]. She studied their profiles (multilingual
and multi-literate), their behavior as information providers in their respective com-
munities and how they utilize their interpersonal network and new technologies [82].
Her study identified the profiles that facilitated access to information resources for
underserved communities.
At the macroscopic scale, giant clusters in the social network might represent
language communities at international level, in some cases roughly corresponding
with national borders. For illustrating this idea, I include a visualization of European
language communities in Twitter from a recent study [83]; in figure 2.3 the dots
represent Twitter posts with geolocation information and the colors differentiate the
languages of the posts. However, at the micro-scale, there are pockets of expatriates
and diverse ethnolinguistic communities (similar to those studied by Metoyer-Duran)
immersed in these giant clusters, where multilingualism is present.
A pair of language communities that share more connections through multi-
lingual individuals or translations than other pairs would have a “communication
highway” between those two languages. Continuing with the metaphoric theme
of roadways, if only a few multilingual gatekeepers and translations connect both
language communities, there would be a “rope bridge” crossing the structural hole.
However, these relationships between languages are rarely reciprocal, in other words,
the communication highway might be one way only. For instance, statistics on lit-
erature translation in Europe reflect the dominance of English as a source (origin
of the translation) and, in the other extreme, low percentages of non-European lan-
26
Figure 2.3: European language communities in Twitter. The colored dots represent Twitter
posts with geolocation information and their colors differentiate the languages of the posts. Giant
language clusters roughly overlap with national borders at the macro-scale. However, at the micro-
scale, there are pockets of expatriates and minority languages immersed in these giant clusters.
This visualization is an extract from a journal article published under creative commons license
[83]: Mocanu D, Baronchelli A, Perra N, Gonalves B, Zhang Q, et al. (2013) The Twitter of Babel:
Mapping World Languages through Microblogging Platforms. PLoS ONE 8(4): e61981
27
guages as a source [74]. For this reason, studying language choice is important to
understand the directionality of information flows.
Network science connects the microscopic scale (eg. the multilingual user in-
teracting with her network and switching between languages) and the macroscopic
scale (eg. the language dynamics in the Twitter system). The problem of model-
ing language competition exemplifies the connection between the microscopic and
macroscopic scales in sociolinguistics.
When looking at the different outcomes these models predict over time, there
are cases of language death, language dominance, language coexistence, language
fragmentation into multiple languages, which reminds of the Language Ecology ap-
proach. Vazquez et al. [104] described the idea underlying these models: collective
social phenomena are studied in terms of interacting agents, which are represented
as nodes in a network of social interactions; nodes can change their language accord-
ing to specified rules of interaction with the neighbors in the network. The models
include probabilities to switch languages determined by the local density of speakers
of the opposite language, prestige of the language and other parameters [104].
The rules of interaction and probabilities to switch languages belong to the
microscopic level, but the simulation of interacting agents generates macroscopic
results. For example, the Bilinguals Model with three types of people —speakers of
language X, speakers of language Y, and bilingual speakers— shows how the social
structure influences the final outcome: the lower the cohesion of the network, the
higher the chances of evolving into one dominant language [104].
28
Using social network analysis in Sociolinguistics is not straightforward. Repre-
senting people as nodes requires careful thinking for assigning individual attributes,
like the languages they understand, the languages they use, level of language com-
petence and literacy, race or ethnic identity, genre, etc. Additionally, nodes can
have a location attribute, which overlaid on a map can distinguish the expatriates
and migrant populations. Also, nodes can represent other actors in society, like
organizations and media outlets. The edges connecting the nodes might be face to
face interactions, interactions mediated by technology (like phone or email), affec-
tive or affiliation relations, to name a few examples. Ultimately, the socio-linguist
has to conceptualize and interpret what the network measures —like centrality and
cohesion— reveal.
2.4 The Internet as a Sociolinguistic Ecology
There is an ongoing debate about the dominance of English on the Internet
and its impact on language diversity [20, 39, 33]. The United States’ leading role
in developing the Internet had consequences like the initial use of English only,
protocols devised for the Roman alphabet, and a telecommunications infrastructure
that was economically dominated by U.S. companies [33].
However, the Internet is evolving very fast. As other nations started to come
into play, and users of different countries gained access to the Internet, a wealth
of languages blossomed online [89]. At the same time that online content was in-
creasing exponentially, the percentage of English content diminished to 45% in 2005,
29
in favor of other languages, while the estimated online content in Chinese grew to
9% in 2008, followed by German and Spanish [89]. The UNESCO’s “recommen-
dation concerning the promotion and use of multilingualism and universal access
to cyberspace” [103] and new standards like Unicode, enabling the use of different
written systems, intensified the trend towards a multilingual Internet.
Despite this progress, non-English speaking users perceive the scarcity of online
resources in their first language and are generally appreciative when they can find
information in their language [8]. If users have sufficient knowledge of English as a
second language, they might search in English because they perceive there is more
content in this language and of better quality [8].
Organizations interested in transnational business have realized the impor-
tance of adapting to local cultures to be competitive in a global economy [30]. They
are translating and localizing (adapting to the culture) their products and services
on the Internet. Dor [30] warns against leaving the standardization of vernacular
languages in the hands of software, media, and advertising industries, in detriment
of the users key role on language change, identity, and maintenance.
Even though this preoccupation is well founded, when Dor wrote his article
the participatory Web was still in its infancy. Recently, the wide array of content-
sharing and social media platforms, blogs, wikies, and social networking sites that
conform the so-called “Web 2.0” has lowered the barriers for users to become produc-
ers of content too [6]. The social networking site Facebook broke with the top-down
approach of language standardization in interface localization and implemented one
where users seek consensus about the translation of terms in the interface [73]. How-
30
ever, the model of inviting users to translate the interface of a site is not transferable
to every company.
The Wold Wide Web relied on the information retrieval paradigm, were users
search and read content generated by institutions, organizations, broadcasting me-
dia, etc., while interpersonal communication happened via email, Internet Relay
Chat (IRC), and newsgroups [6]. With the advent of Web 2.0 environments, which
encouraged participation and sharing, there was a paradigm shift. Users have be-
come consumers and producers of content at the same time, blurring the boundaries
between professional and user-generated discourse, individual and collective author-
ship, and various communication modes co-existing in a single platform: personal
messages, instant messaging or chat, public posts, etc [6].
Thanks to the changes brought by the participatory Web, there is a growing
body of literature documenting the increased visibility of vernaculars [6], the cre-
ation of relevant content in minority languages [102], and foreign language practice
and participation in transnational interest communities and diaspora communities
[100]. Other studies in the field of computer-mediated communication focus on the
reproduction in written form of patterns associated with spoken language, the use
of slang or dialect features, playful uses of orthography and typography [23], and
describe the informal adaptations to the Roman alphabet of languages with other
writing systems, like Arabic [108]. These characteristics of the written language in
social media pose a challenge for the automatic analysis of text, which I will discuss
in detail in chapter 4.
31
2.4.1 The mediation of technology and the cosmopolitan space
Instead of just thinking of a global language and its impact on local ones,
Androutsopoulos [6] proposes to direct our attention to the circulation of cultural
artifacts across national and ethnolinguistic borders and how social media platforms
enable the negotiation of local responses, and the appropriation of those artifacts in
new socio-cultural environments.
In these participatory environments, a network of users interacting with other
users and with digital resources emerges. Digital resources like videos, still images,
speech, music, and text can be labelled (tagged) by users, who are collectively
building taxonomies [88], or even creating multilingual knowledge repositories like
Wikipedia [49]. Very importantly, users are now finding information through social
recommendations or cues (like tags) left by other people [88]. As a result of this
overlapping networks of content and users, information and resources circulate in
different ways across countries [6].
In the sociolinguistic ecology of the Internet, interactions between users are
constraint by the mediation of technology [6]. The design of keyboards, displays,
interfaces, standards that support writing systems, and features of communication
platforms have an impact on the users’ language choices and translation behaviors.
As explained in section 2.3, language choice can affect the directionality of infor-
mation flows between language groups. In social media, social interactions are not
as clearly delimited from interactions with content, since user comments on a dig-
ital object (text, photo, video), and repostings, also constitute an interaction with
32
Bob Kat 
Jen 
News%
Image%
Video%Comment%
Web content linked 
Interac3on%with%content%
Interac3on%%between%%users%
(a) Interactions mediated by technology.
Bob% Jen%
(b) Bilingual social network.
Figure 2.4: (a) Focuses on a few users, Jen, Bob and Kate, who are interacting between them
and with Web content through the mediation of technology. Two networks overlap: the network of
digital objects that are interlinked and the social network of users. And (b) illustrates the social
network to which Jen and Bob belong. Pink nodes represent people who use English, blue nodes
represent people who use Chinese, and the edges represent the“follower of” relationship in Twitter.
the user that posted it. Figure 2.4 illustrates technology and network structure
conditioning users’ communication strategies.
Zuckerman [119] used the metaphor of cities to provide a vision of an internet
that aspires to be a cosmopolitan space, enabling the contact with the unfamiliar,
the serendipity that propitiates learning. In cities, urban planning can create the
structure for social contact, vibrant communities, and discovery [119]. An urban
planning for a vibrant language ecology on the Internet has to challenge the ex-
isting structure of the network of hyperlinks and social connections, and consider
the capabilities for sharing multimedia, the length of text permitted, the language
technologies available, the functionality for managing audiences, the flexibility for
users to reinvent purposes and adapt content.
33
Inspired by this vision, I focus on the problem of the social network structure in
multilingual egocentric networks and on the factors influencing the flexible language
choices of multilingual users.
2.4.2 Overview and remarks
In summary, the sociolinguistic ecology of the Internet is determined by pow-
erful social actors like national and supranational institutions, broadcasting media,
companies with interests in transnational business, and also by the contributions
and interactions of users in social networks, content-sharing platforms, blogs, wikis,
etc. At the microscopic scale, the interactions of users are mediated by technol-
ogy, constrained by it and the network structure. At the macroscopic scale, the
Internet is facilitating transnational communication, the flow of information and
digital artifacts across language and national borders, and language learning. The
growing language diversity of the Internet seems to be enabling access to informa-
tion in minority languages and encouraging participation across a wider spectrum
of society.
However, the flows, access and participation are hindered due to socioeconomic
reasons, reduced bandwidth and lack of infrastructure in rural areas and disadvan-
taged parts of the world [34, 91], or even by governments that purposefully seek to
maintain their nation constraint into an isolated information and communication
sphere [107].
34
2.5 Micro-Sociology Focus: Conceptualizing Multilingual Users and
Language Choice in Twitter
Adapting the Ecology of Language approach to the social network context,
this dissertation focuses on the social network of the multilingual user, conceptual-
ized as a micro-scale language ecology influencing the user’s language choices. As an
application of this conceptualization, I propose the novel idea of modeling the influ-
ence of social network factors in the language choices of the user. In the rest of this
section, I describe key concepts underlying this research related to multilingualism,
language choice and mediation in the context of Twitter.
Mediators. In section 2.1, I explained the importance that De Swaan gave
to multilingual speakers, who are linking language groups and providing cohesion
to the system of languages [25]. In section 2.3, I introduced the work of Metoyer-
Duran focusing on gatekeepers of ethnolinguistic communities, where she described
them as being multilingual and multi-literate [82]. In the context of the Inter-
net, Androutsopoulos [6] highlighted an additional condition to become a mediator
between global resources and local audiences: adequate technology access and com-
petence. Bringing these characteristics together, we can draw a very basic profile
of mediators between language groups in Twitter: multilingual, multi-literate, and
technologically literate.
There are many degrees of language competence and literacy, some users might
understand a second language but are unable to speak it or write it; others abstain
from using one of the languages they know in certain contexts, like documented cases
35
of Internet users who preferred to search in English instead of their first language [8].
A discussion on the various degrees of bilingualism, and what Hornberger called “the
continua of biliteracy” [56], falls outside the scope of this work. For a comprehensive
discussion and classification of the types of bilinguals, consult The Bilingualism
Reader [111]. In this dissertation, I focus on the multilingual users that write in at
least two languages on Twitter.
Language choice. A recurrent theme in the Sociolinguistics literature regard-
ing multilingualism is language choice, or how multilingual speakers make decisions
about which language to use in each situation and interaction. These small scale
decisions have an impact on the global dynamics when aggregated. Simplifying
the outcomes of numerous research studies, Androutsopoulos [5] identified setting,
participants, and topic as the main factors influencing language choice in bilingual
online communities. Other works that I will review in this section and in chapter
3 highlight the influence of the audience, the social context, prestige of a language,
identity, etc. Figure 2.5 represents the main factors theoretically influencing the
language choices of multilingual Twitter users. One of the contributions of this
dissertation is to study, for the first time, social network factors in language choice.
Code-switching. A term that frequently appears associated to language
choice and bilingualism is code-switching. In this work, I use the definition of
Joshi [62], which considers two types of code-switching: intra-sentential, when the
user alternates from one language to another within the same sentence, and inter-
sentential, when the change of language happens at the same time that the sentence
finishes and a new one starts. When studying code-switching, we need to identify
36
Bob Jen 
audience 
interaction 
addressee 
Social context  
(i.e. professional, personal) 
Social network 
The$se&ng:$Twi,er$
The$Internet$language$ecology$
Global$Language$Ecosystem$
Figure 2.5: Aside from topic and identity, important factors influencing the language choices
of a multilingual Twitter user are: the addressee in the interaction, the imagined audience, the
social context, and the social network. Also, the setting and the Internet influence language choice
due to the language ecologies associated. There is an overarching global language ecosystem.
the matrix language, which usually provides the grammatical structure and more
lexical items, and the embedded language [62].
In Twitter, posts are so short that we could consider inter-sentential code-
switching when the language changes from one post to the next, while bilingual posts
would be cases of intra-sentential code-switching. In chapter 6, the factor analysis
represents language choices of the users as counts of inter-sentential code-switching.
In chapter 7, the theme analysis includes cases of intra-sentential code-switching,
where there are embedded English keywords in other languages.
37
The setting. Although not directly addressed in this dissertation, it is impor-
tant to acknowledge the setting as an underlying factor for language choice. In this
work, the setting is the Twitter platform and is characterized by its design features,
like the limitation of text to 140 characters in every interaction, the default mode
of communication being public, the languages available to interact with the inter-
face, the display of messages posted by other users, the possibility to share links,
the asynchronous nature of communication, the features for users to manage their
social network, etc. Also, conventions and social uses of Twitter have emerged over
time among its users [66].
The Twitter setting has a specific language ecology derived from sociopolitical
factors. Twitter is a company based in the United States, which has an impact on
its adoption across the world, or lack thereof in certain countries like China [83].
Since the micro-blogging service launched in 2006, it was rapidly adopted in many
countries; as early as 2007, Java et al. [60] reported about its international adoption
in North America, Europe, and Asia (mainly in Japan), but they estimated that
45% of the social network lied within North America.
Not surprisingly, English became a dominant language, with various estimates
ranging from 51% of posts [55] to 53% [90]. A recent large-scale study selected the
20 most active countries in Twitter and showed the percentage of English use in each
country against the percentage of their corresponding vernaculars [83], illustrating
the weight that English has in communications via Twitter. Also, Poblete et al.
[90] selected the 10 most active countries in Twitter and unveiled that the U.S. was
38
the country that concentrated more connections from overseas. For these reasons, I
selected multilingual users who have English as one of their language options.
The participants or interlocutors. The participants or interlocutors in
an interaction can vary from a one-to-one exchange in an online chat to a one-to-
many question posed in a forum for an entire community. In the micro-blogging
site Twitter posts are generally public, but there are also posts addressed to specific
individuals, and the possibility to send private messages.
The audience. In Twitter, there are different levels of reach a user could have.
First, the posts addressed to one or few individuals could be seen by common friends,
and potentially found in search results of the platform by others; second, public
posts can be seen by the network of people that “follows” the user, and potentially,
these posts can reach anyone in Twitter. Re-using a theoretical framework from the
field of communication, Johnson classifies Twitter audiences as addressees, auditors,
over-hearers and eavesdroppers [61]. Marwick and boyd [81] described the concept
of the “imagined audience” in Twitter, or how the user conceptualizes his or her
audience to be able to make linguistic choices, even though the real audience that
reads the post might be different.
In chapter 7, I test the hypothesis that addressing a message to one interlocutor
or a public audience influences the choice of language of the multilingual user.
The egocentric network. Wether Twitter users address posts to the mem-
bers of their social network explicitly or not, in this dissertation I argue that the
egocentric network has an impact on the choice of language. Not only it could have
an influence as a perceived audience, but also as a source of information.
39
Topics or interests. Java et al. [60] identified Twitter communities based on
the social network structure and, analyzing the words in the posts, they observed the
common topics or interests that differentiated the communities. In the context of
the Internet, where the perception of distance and territory blurs, the experience of
identity becomes multi-layered. In addition to ethnic identities, there are dimensions
of shared “feelings”, “knowledge”, or “activities” across distance [18]. Because of
this multi-layered identities, the user can belong to different communities and choose
the language accordingly. This dissertation includes preliminary work related to
topics in the theme analysis (chapter 7), but a complete analysis of topics as a
factor for language choice is left for future research.
Other factors. There are other important factors surrounding the multi-
faceted reality of language choice online. Kelly-Holmes [65] argues that the prestige
and international importance of a language encourages its use online. Also, lan-
guage choice could relate to the availability of online resources in a language, or
lack thereof [65]. The social context of the interaction is another factor, for ex-
ample, English being used for professional emails and a vernacular language for
personal communications [108]. As mentioned above, identity, as a marker of social
and cultural differences, play an important role in language choice [108, 5]. However,
this dissertation does not include them in the analysis to keep this research work to
a reasonable scope.
40
Chapter 3
Related Work
This chapter comprises a review of the literature informing the present research
work in one or more of the following themes: language choice and code-switching on
the Internet, a network approach to language on the Internet, and multilingualism
on Twitter.
This dissertation contributes a classification of network types based on the
patterns of connections between language groups, which goes beyond survey works
about multilingualism on Twitter. I used a network approach after being inspired
by works analyzing language networks on the blogosphere. The literature about
language choice on the Internet serves to frame my novel proposal of modeling the
influence of network factors in the language choices of the user. Also, the literature
about language choice is relevant to chapter 7, where I test the hypothesis that
English is used more in public messages than in replies to individuals. In the theme
analysis, I include cases of code-switching.
3.1 Language Choice and Code-Switching Online
A survey gathering answers by 2267 students in high schools and universities of
eight countries (France, Italy, Indonesia, Macedonia, Oman, Poland, Ukraine, and
Tanzania) revealed the complex reality of language choices of Internet users [65].
41
Most of the participants reported some knowledge of English in addition to their
native language, or language of education, and often, they also reported competence
in a third language. The study found that bilingual or trilingual Internet sessions
were somewhat frequent, that language choice could relate to the availability of
online resources in a language, or lack thereof, and to the prestige and international
communication potential of a native language [65]. Other research studies support
the observation that the perceived scarcity of online resources in a native language
influences behavior and attitudes of its users when searching online [8, 67].
Also, domain knowledge influences language choice in online searching because
higher expertise on a topic facilitates understanding of relevant texts in a second
language [67]. Combining the factors topic and context, there are reported cases
where looking for international news and doing academic work encourages the use of
English, while personal communication is conducted more often in native languages
[65].
Along the same lines, a study on email and Internet chat in Egypt documented
the use of English for professional emails, while in personal emails and chat users
preferred a romanized form of Egyptian Arabic, which was mostly used orally before
the advent of the Internet [108]. This informal transliterations include the numbers
2, 3 and 7 for rendering additional phonemes from Arabic into the Roman alpha-
bet [108]. On the other hand, the use of Classical Arabic and Arabic script was
somewhat relegated [108]. Informal transliterations pose a challenge when doing
automatic text analysis, as I will explain in detail in chapter 4.
42
The same study observed cases of code-switching between English and ro-
manized Egyptian Arabic; the later was used for greetings, humor, sarcasm, food,
holidays and religion [108]. A case study on code-switching between English and
Spanish, and English and Indonesian in Internet chat provided evidence of borrowed
English terms related to computers, such as “e-mail”, “attachment”, and “PC” [12].
When chatting about friendship and relationships, the subjects preferred their first
language instead of English [12]. Another study on the language choices of Greek,
Turkish and Persian diaspora online communities in Germany found that topics
on politics and technology disfavored the use of home and minority languages in
newsgroups and web forums, while music and poetry favored it [5].
These works studying topics and contexts (i.e. professional versus personal)
as factors for language choice provide some basis for the theme analysis in chapter 7
and for future research about the influence of interest communities in the language
choices of Twitter users. The remaining works that I review in this section study
the selection of English in multilingual settings online.
A longitudinal study of a Swiss forum with participants of three different
native languages detected the increase in the use of English over time, even though
English was not the first language of any of the users [31]. This finding suggests
that the presence of a multilingual or international audience might encourage the
use of English as a lingua franca. In view of this previous finding, I propose a
multilingual index of the social network as a potential predictor of English use by
the multilingual user (chapter 6). Other research works on email and mailing lists
43
[31, 65] studied the impact on language choice of addressing a message to one person
or to a multilingual audience; in the later case English was preferred.
Recent works on the use of social networking sites by bilinguals argue that
the intended audience determines the language choice. In Facebook, Welsh-English
bilingual high school students write the status updates more often in English to
ensure that all their friends feel included, whereas they use Welsh for one-to-one
messages with other Welsh-speaking friends [21].
In Twitter, Welsh-English bilinguals use proportionally more English in public
posts (53%) than in replies to individuals (44%) in a sample of 500 posts [61]. The
reason for using Welsh or English in replies is related to the language profile of
the addressee to some extent; sometimes English is used to communicate between
Welsh-English bilinguals [61]. In relation to this finding, Johnson [61] speculates
that the use of English is encouraged in Twitter for its potential to reach a wider
audience. Finally, the later study on Twitter reported very few cases of bilingual
posts [61]. In chapter 7, I test that English is used more in public messages and I
observe that bilingual posts are scarce.
As a closing note to this section, I would like to acknowledge an existing
body of literature about “context collapse” in social media, and particularly in
Twitter [81] and Facebook [106]. Professional and personal contexts merge in the
same communication environment, where users seek to balance the different identity
presentations [81] and, as a result, possibly their linguistic choices.
44
3.2 Networked Languages
The literature on multilingual computer-mediated communication is very help-
ful for raising awareness about the multifaceted reality of language choice and how
the context and the mediating technology influences it, but generally does not ad-
dress the potential transnational impact. Language Networks on LiveJournal [53]
was possibly one of the first studies in taking a network analysis approach to study
the language demographics of a social media site, a blog hosting service in partic-
ular. Apart from studying the robustness of non-English language networks, they
identified blogs that were bridging language communities and described some char-
acteristics of their authors (students of foreign languages, expatriates, multilingual
and multicultural) and topics (images, content with international appeal).
Two years later, the Berkman Center identified English and French “language
bridges” on the Arabic blogosphere, consisting of bloggers that wrote in English
or French and their native (Arabic) language, and connected the different national
blogospheres with the international one [32]. However, they did not explore fur-
ther into the connections with the international blogosphere or their motivations
for language choice. These questions are important to understand how people draw
international attention using their transnational networks and how information dis-
seminates across language borders.
Hale [47] tackled the aspect of cross-language linking among blogs in English,
Spanish, and Japanese. Focusing on the topic of the earthquake in Haiti in 2010,
he was able to quantify the increase in foreign content awareness over time and
45
detect patterns of cross-lingual linking among blogs [47]. Most notably, blogs in
English linked much less to foreign content than the blogs in Spanish and Japanese,
the largest single destination of cross-lingual links being a collection of photos [47].
Interestingly, Global Voices, an international blogging community that promotes
translation of content, created 15% of all cross-lingual linking in the dataset [47].
This finding illustrates that designing for multilingualism and cross-cultural aware-
ness has a impact on the network structure.
This network approach to languages has been applied to the blogosphere, but
not yet to the microblogging service Twitter. A study on the influence of distance
in the formation of Twitter ties [99] tangentially included language. The most
interesting finding related to this research is the observation of cross-language ties
in a sample of 1768 pairs of nodes: English-Other (7.4%), Other-English (3.1%),
Other-Other (1.1%) [99]. However, the authors identified the users’ language using
only one post, and made some questionable assumptions in their interpretations,
like users being monolingual and equating country with language use, which lead
them to be skeptical of their own observation of 8% geolocated subjects in Brazil
using English [99]. Actually, this percentage of English use on Twitter in Brazil is
very close to the estimate provided by a more solid and large-scale survey [83].
Like in [99], there are examples of rough research assumptions about lan-
guages, geography, and text analysis on Twitter that manifest the need for a deeper
understanding of these interrelated areas of study.
Shifting the focus from languages to countries, it is possible to find research
that looks at transnational ties in Twitter. Poblete et al. [90] illustrated with a
46
detailed network graph the strength of ties between the ten most active countries in
Twitter. These ties are the aggregated results of users’ “following” relationships in
Twitter. Apart from the expected stronger connections among countries that share
the same language (U.S., U.K., Canada, and Australia), the graph also unveils that
the U.S. attracts the most international attention, while it pays little attention out-
side its borders [90]. Also, South Korea and Brazil have little connections overseas
[90].
Ideally, an holistic view of the language ecology in Twitter will require an
analysis of the languages in the overlapping networks of users’ attention ties (follower
relationship), interaction ties (replies and retweets), topics and linked resources. As
a starting point, this dissertation focuses on attention networks at small scale.
3.3 Multilingual Twitter
In section 3.2, I noted that there are currently no research works that take a
network approach to languages on Twitter, which would be useful to understand
cross-language ties and communication connections between language communities.
On the positive side, there are a number of works that study language on Twitter
for different purposes and, as the body of research literature on Twitter is growing
fast, it might be a matter of time that works on networked language communities
become published.
A large-scale study on the languages used in Twitter (more than 62 million
posts over a period of four weeks) reveals that almost 49% of the posts are written
47
in a language different from English and provides a ranking of the top 10 languages
[55]. This study records the number of URLs and hashtags (keywords preceded by
the @ sign) shared by people from different language communities [55]. Their closing
reflection encourages the study of bilingual brokers and how information flows across
language communities [55]. An even larger-scale survey of Twitter (380 million posts
over 564 days), conducted later, also presents a ranking of top languages among the
78 detected [83].
When comparing the language rankings of the first study and the later sur-
vey, we can observe that European languages are loosing positions against Asiatic
languages, except for the increase of Spanish and the unchallenged dominance of
English [83]. Twitter is a fast-changing environment, where the language ecolo-
gies and their impact on communication behavior are constantly in the process of
negotiating their contexts and finding their balances.
The same survey, The Twitter of Babel [83], compares the percentage of En-
glish use on Twitter in 20 countries versus the vernacular language use with an illu-
minating graph (figure 3.1). This graph should dismiss the assumptions that equate
country and language use. Another piece of evidence against such an assumption
is the survey’s focus on Belgium. Mocanu et al. [83] compare Belgium census data
with Twitter data, and observe that the Flemish-speaking population (or Belgian
Dutch speakers) is over-represented in Twitter by comparison to the Walloon French
speaking population. The researchers connect it with the finding that Twitter has
higher penetration in the Netherlands than in France [83], which might attract more
users of Dutch language variants regardless of geographical borders [83].
48
Figure 3.1: Language share of the top 20 most active countries on Twitter, ordered by number
of English posts. This graph is an extract from the journal article [83]: Mocanu D, Baronchelli A,
Perra N, Gonalves B, Zhang Q, et al. (2013) The Twitter of Babel: Mapping World Languages
through Microblogging Platforms. PLoS ONE 8(4): e61981
49
The Twitter of Babel [83] constitutes a very valuable large-scale survey with
examples at different scales, from country level, to city level and neighborhood.
However, it does not mention or count cross-lingual connections and bilingual users,
even when being implicit in the multilingual situations they describe. Here is where
a network approach could provide more insights about transnational influence.
Drawing attention to methodological challenges, Graham et al. [40] compare
common geolocalization and language identification methods used for Twitter data
analysis. One of the issues the authors encountered is the difficulty in classifying text
correctly as Arabic when the Twitter posts were written in Roman alphabet [40],
like the cases reported in section 3.1. In particular, Compact Language Detector
failed to classify romanized Arabic in 89% of cases [40]. Also, Bergsma et al. [9]
detail the challenges in automatic language identification of Twitter posts. The
researchers used people to annotate the language of the posts and build a test
collection of Twitter texts in languages with non-roman scrip, i.e. Arabic, Farsi,
Hindi, and Urdu [9]. The aim is improving automatic language identification in
these languages [9].
Looking at the different use of Twitter depending on the language, at least
two studies document the different frequency of features in Twitter posts, such as
URLs, hashtags, repostings, and user mentions [110, 55]. The findings of these works
suggest that Twitter is used more for conversational purposes in some languages, like
Indonesian, while in other languages is more common to use it for sharing resources,
like German [110, 55]. This dissertation proposes to study in future work particular
languages as factors influencing the communication behavior of multilingual users.
50
In conclusion, the studies about languages in Twitter are descriptive, or of
survey type, and only implicitly one can guess there are multilingual users playing
a role in the language landscape they describe. Even when comparing the different
uses of Twitter depending on the language, these studies do not investigate further
into the interactions between language communities and their relation to language
choice.
51
Chapter 4
Methodology
Inspired by an expanded paradigm of Web Content Analysis proposed by Her-
ring [52], who also pioneered a network approach to the study of languages on social
media [53], this research includes social network analysis, natural language pro-
cessing for automatic language identification, theme and exchange analyses. This
expanded paradigm proposes broadening the construct of content analysis for ac-
commodating new techniques of analysis appropriate for the evolving landscape of
the Internet, and enumerates link and exchange analyses, topic analysis, feature
analysis, image analysis, language analysis, etc.
In this dissertation, I apply social network analysis to answer the first research
question about the egocentric networks of multilingual users; I use two regression
models in the factor analysis to answer the second research question on social net-
work factors that affect language choice; finally, I test the hypothesis that the ad-
dressivity feature (@ sign) influences language choice, and explore with a theme
analysis how other textual features might be facilitating cross-cultural awareness.
This chapter starts with a brief introduction of the research design, followed
by an account of the collection and processing of data that underlies the four stud-
ies of the dissertation. The details of the analysis are described in the chapters
corresponding to each study.
52
4.1 Research Design
The research design is composed of four sequential studies of the same datasets,
focusing on complementary facets of mediation between language communities and
language choice. Table 4.1 shows a schematic view of the four studies and the
corresponding chapters.
First, I identified Twitter users authoring posts in English and another lan-
guage. I collected their last 50 posts and their egocentric network with degree 1.5.
The egocentric network with degree 1.5 includes the people connecting with the mul-
tilingual subject or ego and the connections among the people directly connected
with the ego (see section 2.3 for social network concepts). Also, I analyzed auto-
matically the last 30 posts of all the users within the egocentric networks to identify
the language they are using in Twitter.
In summary, the data comprises a list of 92 egos, with 50 posts each, and a list
of contacts associated with every ego, with a language label, and their linkages in
the form of an adjacency list. Figure 4.1 illustrates the components of the datasets.
In chapter 5, the social network analysis combines a qualitative approach and
network statistics to generate a taxonomy of network types based on the patterns
of intersections and connections between language groups. The study follows an
exploratory design, with a first qualitative phase that takes a grounded theory ap-
proach, and a second quantitative phase that consolidates the qualitative findings.
The unit of the analysis is the egocentric network of multilingual users. I visualized
the 92 networks with the Gephi social network analysis tool and identified the groups
53
Problem Facet Chapter Objective Approach
multilingual
Twitter users
as mediators
social network 5
classification of
egocentric
networks
social network
analysis
QUAL+QUAN
content 7
exploring
cross-cultural
themes
theme analysis
QUAL
language
choices of
multilingual
Twitter users
social network 6
influence of
network in
language choice
factor analysis,
regression
QUAN
content 7
influence of
addressivity in
language choice
hypothesis test
QUAN
Table 4.1: Research design divided in four studies of the same datasets, looking at the research
problem of multilingual Twitter users as mediators and their language choices with different foci. In
chapter 5, I use social network analysis for classifying egocentric networks of multilingual Twitter
users. Chapter 6 consists of a factor analysis, using regression models to detect if the social
network influences language choices of multilingual users. Chapter 7 focuses on textual content;
firstly, I test the hypothesis that addressivity influences language choice and, secondly, I look for
international themes and other textual features that might indicate cross-cultural awareness.
54
Ego 1 
Ego 2 
Post 1, en 
Post 2, en 
Post 3, es 
… 
Post 50, en 
…. 
Ego 92 
Contact 1, contact 2 
Contact 1, contact 3 
Contact 2, contact 3 
Contact 3, contact 4 
… 
Contact 1, en 
Contact 2, es 
Contact 3, en 
Contact 4, en 
…. 
Text, language Adjacency list Network languages List of egos 
Figure 4.1: The data comprises a list of 92 egos, with 50 posts each and language labels, a list of
their contacts with a language label, and the social network links in the form of an adjacency list.
of people that write in different languages. Focusing on the structural relationships
of these language groups, I complemented the qualitative study of visualizations
with network statistics specifically created to provide a robust definition of network
types. Finally, I used machine learning for testing the results.
In chapter 6, the factor analysis models the influence of a set of factors related
to the social network in the language choices of multilingual users. The dependent
variables considered are the proportion of English use and non-English use within
the 50 posts of the ego (language choice of the ego). The factors included are the
proportion of English and non-English language use in the social network of the ego,
and the degree of multilingualism of the social network. The relative importance
of factors, or their weight, is represented by the coefficients obtained by fitting two
different generalized linear models to the dataset (linear and logistic regression).
55
In chapter 7, exploring textual features, I shift attention from the social network
to the content of the posts written by the egos. First, I look at the textual feature
of the @ sign at the beginning of a post as an indicator of addressivity. Based on
this indicator, I test the hypothesis that the type of exchange (public post versus
reply to an individual) influences the choice between English and other languages.
In a second study included in chapter 7, I look at content with the objective
of detecting themes that might help in creating cross-cultural awareness, where the
multilingual users could be acting as mediators from the point of view of their mes-
sages. I identify themes related to non-English speaking countries or communities
in English posts and, also, I identify English hashtags (keywords preceded by the
# sign) inserted in non-English posts. Using a generic theme analysis, this study
serves as an explorative qualitative phase to inform the design of future studies after
this dissertation work.
4.2 Sampling and Data Collection
I identified potential multilingual Twitter users with the help of Prof. Jennifer
Golbeck and Tony Rogers. We started by issuing queries to the Google search engine,
restricted to the Twitter domain, that combined one English word (“between” or
“tomorrow”) and one of the words in the list of figure 4.2. For instance, a query
was “tomorrow” and “tambie´n” (which means “also” in Spanish). The words were
selected from lists of “stop words” for every language. Stop words are very common
words in a language. There are many lists of stop words created for natural language
56
Language Words 
Arabic  !"#$ , %#"&
Chinese  
French alors, très 
German zusammen, gern 
Greek περίπου 
Hebrew  טוב
Italian molto, peggio 
Japanese ,  
Korean  
Polish właśnie, chyba, albo 
Portuguese muito 
Russian к , о 
Spanish desde, también 
Figure 4.2: Common words in different languages used for querying in combination with English
words.
processing, usually to filter them for various purposes. In this case, I used these
common words to represent each language. The main selection criteria was that the
word should not be identical or similar to any other word in a different language, in
order to avoid ambiguity about the language the word represents.
In sections 2.5 and 3.3, I reviewed studies that document the dominant use of
English in Twitter and how this relates to the weight that the United States has in
the social network [60, 90] and to its use in many non-English speaking countries
that are active on Twitter [83]. When trying bilingual combinations in our initial
search, it was very difficult to find bilinguals that did not use English as one of
the active languages; this realization is supported by the findings of a study [99]
commented in section 3.3. For this reason, we decided to limit the sampling to
multilingual users who wrote in English and, at least, one other language.
57
The search results directed to the users’ profile pages on Twitter. The ordered
ranking of users’ profiles given by Google could be placing more popular users first,
biasing our initial selection, but we ignore the actual criteria used by the search
engine. We visited these profile pages, read the last posts, and checked that the
subjects were actually using two languages. We established clear written instructions
for selecting them. In particular, we did not select users whose:
• posts in one language were automatically generated (i.e. users posting in
Spanish with only Foursquare checkins in English) or were spam,
• posts in one language were only named entities, like song titles, names of
books, etc.
• posts in one language were only reposted content (or “retweets”).
Note that reposting on Twitter does not prove any active knowledge in a
language, as it only requires to click on a button or copy text. Moreover, if users just
repost the same text, the message stays concealed in the same language community.
Also, the instructions for selecting a user required that he or she had written
at least one post entirely in a second language, had more than 30 posts (excluding
“retweets”), and had between 4 and 5,000 followers. I discarded potential subjects
that had more than 5000 followers due to the computational workload required for
processing large social networks and the policy limitations of the Twitter API for
extracting data.
We identified 175 potential multilingual users. After this first selection of
users, I retrieved the last 50 posts of each one of them by means of the Twitter
58
API. The API only allows one to extract data from public user accounts in Twitter.
For this reason, accounts made private by the users are not included in the sample
of multilingual subjects, and when extracting their contacts automatically, private
accounts render no data.
As in the previous selection phase, I did not include the repostings, with the
exception of those that had a comment added by the user; in such cases, it might be
possible to find bilingual text and translation. Specifically, the 50 posts of every user
did not include the repostings that were shared clicking on a “retweet” button, or
those posts that started with the characters “RT” or “rt” (abbreviations of retweet),
but could include the posts that had some text written before RT or rt.
Based on the data, I selected only those users who had written at least 4 non-
automatically generated posts in a second language, to ensure that the language
was well represented. During the data collection process, I had to discard some
users because they made their accounts private or closed them, and one user started
posting spam. The data collection process spanned from October 3 to November 7,
2011.
Finally, my sample contains 92 multilingual users that write in 19 languages
(Arabic, Basque, Catalan, Chinese, Dutch, English, French, Galician, German,
Greek, Hebrew, Italian, Japanese, Korean, Mongolian, Polish, Portuguese, Russian,
Turkish), usually two or three languages per person.
Figure 4.3 shows the purpose of different components of the dataset. I kept
the last 50 posts of the final multilingual users for studying their language choices
and conducting the theme analysis. Also, with the help of Prof. Jennifer Golbeck,
59
92 egos posts of egos !
followers & followings 
(adjacency list of contacts) 
Location 
Last 30 posts of contacts 
Automatic Language Identification 
Social Network Analysis 
for language 
choice and theme 
analysis 
Contextual info. 
egocentric network degree 1.5 
Egos%dataset%
Contacts%dataset%
Figure 4.3: Data collection and purpose of different datasets extracted from Twitter.
I extracted the location of the multilingual users from their profiles as contextual
information. For every multilingual user (ego) we extracted the egocentric networks
with degree 1.5 in the form of adjacency lists of followers and followings (contacts)
for the social network analysis, as explained in section 4.1. In total, there are
25,556 contacts within the 92 egocentric networks. Finally, I retrieved the last 100
posts from the contacts’ accounts to identify the languages they use in Twitter,
with the exception of private accounts and accounts with no posts (5,950 cases).
As previously explained, only 30 posts per user were analyzed and the repostings
included have text before the characters “RT” and “rt”.
60
4.3 Methods for Assigning Language Labels to Users
In this section, I introduce the options I considered to assign language labels
to every user in the egocentric networks with the purpose of completing the contacts
dataset illustrated in figure 4.3. First, I had to automatically detect the language(s)
used in a number of their posts and, secondly, I had to determine their language
profile based on those multiple posts.
The egos dataset also required the automatic identification of the languages
of posts, but the egos were not assigned a language label. In the factor analysis,
the language profile of the multilingual egos is conceptualized differently, as pairs
or frequencies for the two main languages of the user.
I considered two main options in order to assign a language label to a user in
Twitter: (1) extracting the language code that the user has selected on the Twitter
interface, and (2) automatically identifying the language of a certain number of
posts that the user has written. In the case of extracting the language code of the
interface, the immediate problem is that this code is not accurate for bilingual users.
Also, as I will explain in section 4.4, a test suggested that the interface language has
a very high error rate in representing the actual languages of the user. One reason
could be that many users do not change the interface language given by default
(i.e. English) because they can understand it, but prefer to write in their native
language.
As a result of these considerations, I chose the option of automatically iden-
tifying the language of users’ posts. As reviewed in section 3.2, related research
61
[99] only used one post per person to assign a language to a user. This is insuffi-
cient to determine bilingual and multilingual use, and also problematic in a noisy
environment such as Twitter, with frequent cases of automatic posting and spam.
A question that I will address in the next section is how many Twitter posts
are enough to determine the language(s) of the user. Having more than one post
and language label for a user requires a process to determine which language label
fits best for that user.
4.3.1 Tools for automatic language identification
The first step was to identify the language of the users’ posts. I consid-
ered three tools: Google Language Identification tool (Google’s proprietary option),
Chorme browser Compact Language Detector (partly open source code by Google),
and Python Language Detector (an open source module for programming language
Python). Google’s language detection algorithm uses quadgrams —or four character
tokens— [118, 97] and Python uses trigrams [43].
Briefly, the process of using the language identification tool works as follows:
I send an input file that contains the posts of a user after eliminating mentions,
hashtags, URLs and symbols, and the language identification tool returns an output
file with the language labels and confidence levels for every post.
62
4.3.2 Algorithm for assigning a language label to a person
In a second step, I elaborated the rules for assigning a language or languages
to a person. For a given user, with a list containing pairs of language and confidence
level, the heuristics of the algorithm are:
• Discard all pairs with confidence level below 0.1. The purpose of this rule is
to eliminate noise or inaccuracies in the language assignation method. If no
pair remains, the language label assigned to the user is “unknown”.
• For each remaining language, compute the frequency (number of posts in that
language) and select the highest confidence value of all posts in that language,
thereafter called “maximum confidence”.
• Discard all languages with a frequency below 10% of the total number of posts
for that user. The purpose of this rule is to eliminate languages that are not
well represented in the profile of the user, due to automatic posting, etc. If
no language passes the frequency threshold, the label assigned to the user is
“unknown”.
• Determine if the user is monolingual or multilingual. If more than one language
has maximum confidence equal or greater than 0.7, the language label assigned
to the user is a multilingual label. Otherwise is monolingual. Note that the
“maximum confidence” is the highest confidence level of all posts in a language
for one user. Therefore, the requirements for considering a user multilingual
are: (1) at least two languages with 10% minimum frequency (2) and maximum
63
confidence equal to or greater than 0.7. This multilingual label is composed
by the code of the most frequent language and the code of the second most
frequent language.
• In the monolingual case, if only one language has maximum confidence equal
to or greater than 0.7, that is the language assigned to the user.
• In the monolingual case, if no language has maximum confidence equal to or
greater than 0.7, the language assigned to the user is the one with the highest
frequency.
Figures 4.4a and 4.4b illustrate the process of assigning a language label to a
user with examples.
Note that the thresholds are based on the assumption that the confidence level
represents a probability between 0 and 1, but they could be tailored for each tool.
The confidence level of 0.7, used as threshold to determine multilingualism, was
selected after observing issues derived from transliteration.
For instance, Arabic was sometimes written in the Arabic scrip, but also in a
romanized form that the tool identified incorrectly and with low confidence as some
other language. In the cases where Arabic was one of the two languages used, the
romanized form (with an inaccurate language label and low confidence) had enough
frequency to displace the Arabic as one of the actual languages that composed the
multilingual label. Adding the requirement of high confidence (equal to or greater
than 0.7) eliminated the instances with incorrect labeling and favored the Arabic
that was correctly classified because it was not transliterated.
64
cy, 0.08 en, 0.75 en, 0.40 en, 0.65 ar, 0.20 ar, 0.35 ar, 0.80 en, 0.80 fr, 0.5 en, 0.30 en, 0.70 ar, 0.60 
User @mary Outcome of the automatic  language ID tool: Language, confidence 
1st threshold: no language with confidence below 0.1 
Languages ordered  by frequency maximum confidence selected 
en, 50%, 0.80 
ar, 33%, 0.80 
fr, 8%, 0.50 
2nd threshold: no language below 10% frequency 
If not language remains,  return label “unknown” If not language remains,  return label “unknown” 
en, 50%, 0.80 
ar, 33%, 0.80 
Multilingual:  At least 2  languages with confidence  0.7 or more 
Monolingual:  0 or 1  languages  with  confidence  0.7 or more 
en+ar 
Algorithm assigns label  to user @mary 
(a) Example where the algorithm assigns a multilingual label to a user.
cy, 0.08 en, 0.75 en, 0.40 en, 0.65 ar, 0.20 ar, 0.35 pt, 0.10 en, 0.80 fr, 0.5 en, 0.30 en, 0.70 en, 0.60 
User @tony Language, confidence 
1st threshold: no language with confidence below 0.1 
Language, frequency,  maximum confidence 
en, 64%, 0.80 
ar, 18%, 0.60 
fr, 9%, 0.50 
pt, 9%,010 
2nd threshold: no language below 10% frequency 
If not language remains,  return label “unknown” If not language remains,  return label “unknown” 
en, 64%, 0.80 
ar, 18%, 0.35 
Multilingual:  At least 2  languages with confidence  0.7 or more 
Monolingual:  0 or 1  languages  with  confidence  0.7 or more en 
Algorithm assigns label  to user @tony 
(b) Example where the algorithm assigns a monolingual label to a user.
Figure 4.4: Two examples of how users are assigned language labels by the algorithm after the
language of their posts have been automatically identified.
65
4.4 Testing Methods for Assigning Language Labels to Users
In this section, I explain the creation of a test dataset and a baseline for com-
paring the results of different language identification tools, and for comparing the
results of the language assignation algorithm versus a human making that assigna-
tion.
Finally, I compare the test results using a number of posts per user between 1
and 100 to answer the question: how many Twitter posts are enough to determine
the language(s) of the user?
4.4.1 The test dataset
For testing the tools, I prepared a sample of users. I randomly sampled 10 egos
from the 92, and 20 contacts for each ego —or all contacts, whichever is greater—
obtaining a total of 190 users from the list 25,556 contacts. From those users, only
177 had data available. The other 13 users had no data either because their account
was private or they did not post anything. Finally, I extracted the last 100 posts
of the 177 users, including repostings that had an added comment by the user. In
total, I extracted 15,973 posts. This is my test dataset.
4.4.2 The baseline
I created a baseline as close as possible to human labeling. Given the time and
resource constrains, I decided to use one of the automatic language identification
tools to assign a language to each post as an initial guess and manually revise the
66
results. In this task, I took advantage of my skills in English, Spanish, and French,
as well as my familiarity with some Romanic languages, such as Portuguese, Italian,
Catalan, and Galician. The number of posts that were sent to the language identifier
after eliminating mentions, hashtags, URLs and symbols is 15,856 (from the test
dataset). Then, I revised the language labels of the posts following these criteria:
• A. A post is labelled monolingual if any of the following are true:
1. All words are in one language.
2. Only one word in a second language within a post of five or more words.
3. There is a title or named entity in a second language and fewer than
five words in the most dominat language of the user.
• B. A post is labelled bilingual if any of the following are true:
1. There is one word in a second language when a post has fewer than five
words.
2. There are two words in a second language when the post has between
five and ten words included, unless is case A3.
2. If the post has more than ten words and there are at least three words
in a second language, unless is case A3.
3. A title or named entity is in a second language but there are at least
five or more words in the other.
• C. A post is labelled automatic if any of the following are true:
67
1. There are two identical posts.
2. There are at least three posts that are nearly identical except for a
number or a word.
3. There are sentences like: “Posted a picture on Facebook”, “liked a photo
on Facebook”, “favorited a Youtube video”, “I am at something @ name of
place” (foursquare).
• D. A post has a non-identifiable language if any of the following are true:
1. There is a named entity that could correspond to more than one lan-
guage.
2. There are only symbols, emoticons, or random letters.
3. Other reasons.
Named entities —a term coined in the field of Information Extraction— are
information units that consist of rigid designators for a referent, like proper names
of people or organization names, locations or times, among many types [84].
In the cases where Arabic, Hebrew and Mongolian were transliterated, the tool
did not identified the language correctly. In those cases, I considered the language
of the post to be the language of the user other than English (Arabic, Hebrew or
Mongolian). This language was determined in other posts written by the same user
in non transliterated form, where the tool is more accurate.
Finally, the criteria for classifying the language profile of the user for the base-
line was to use a 10% frequency threshold do determine if the user was monolingual
68
or multilingual. If fewer than 10% of the posts were in a second language, the user
was considered monolingual and assigned the label of the most frequent language.
If the second language passed the threshold, the user was assigned a multilingual
label composed by the codes of the most frequent language and the second most
frequent language. Automatic posts and non-identifiable language posts did not add
to the frequency count of any language. Bilingual posts added 0.5 frequency points
(instead of 1) for each of the two languages.
To create the baseline, I decided to use Google Language Identification tool
because the Compact Language Detector (CLD) has two additional disadvantages
and Python Language Detector has one additional disadvantage. Unlike Google’s
proprietary option, CLD cannot detect the Mongolian language in the dataset, and
the confidence values do not represent probability. The confidence values range from
0 to over 100, but the maximum value is unknown. The lack of interpretability of
confidence values poses a challenge to use the algorithm that assigns a language
label to a person.
Python Language Detector is able to detect Mongolian and the confidence
values are between 0 and 1, which might indicate probability. In practice, the
confidence values are biased towards the range 0.9–1. The minimum is only 0 when
the post is empty. Instead, 0.17 acts as the minimum confidence value, for instance,
in cases when the post has just a symbol and any guess should return a 0 confidence
value. This biased behavior of the confidence values would affect the performance of
the algorithm. I considered tailoring the thresholds of the algorithm that assigns a
language to a person to account for this, but given the disproportionate amount of
69
posts with confidence level 0.9, this value has little discriminatory power. For these
reasons, I expected that the Google Language Identification tool would perform
better.
In summary, I considered my baseline to be the results of Google Language
Identification with a subsequent revision, and I assigned the language labels accord-
ing to a set of criteria described in this subsection 4.4.2.
4.4.3 Testing the language identification tools and the algorithm that
assigns language labels to users
I tested the performance of the language assigning algorithm, comparing the
baseline (Google Language ID, manual revision, criteria-based language assignation)
with the results of the automated language assignation (Google Language ID, man-
ual revision, language assigning algorithm). The manual revision is not performed
in the actual analysis of the contacts dataset, but serves for testing the performance
of the language assigning algorithm, changing only one variable with respect to the
baseline. To obtain the estimated error rate, I divided the number of cases where
the automated results did not coincide with the baseline by 177 total cases. The
resulting estimated error rate is 6.78%, with 1.13% false negatives (missing multi-
linguals), and 5.65% false positives. Therefore, the algorithm tends to overrepresent
multilinguals.
Subsequently, I tested the performance of Google Language Identification tool
in combination with the language assigning algorithm, eliminating the human re-
70
vision (Google Language ID, language assigning algorithm). These are the actual
conditions of the analysis performed with the contacts dataset. I computed the es-
timated error rate with respect to the baseline. Figure 4.5 shows how increasing the
number of posts used for assigning a language to a person diminishes the estimated
error rate.
0.05
0.10
0.15
0.20
0.25
0.30
l
l
l
l
ll
l
l
ll
l
l
l
l
l
ll
l
l
l
ll
ll
l
ll
l
lll
lll
l
l
l
lll
l
ll
l
l
ll
l
llll
l
ll
ll
l
ll
ll
l
lll
l
l
lllll
lllll
ll
l
lll
ll
l
lll
lllll
l
l
lll
0 20 40 60 80 100Number of Tweets
Err
or 
rat
e
Figure 4.5: The y axis represents the estimated error rate of using the Google Language ID
method with respect to the baseline and the x axis represents the number of posts per person used
for language assignation. As the number of posts increases, the estimate error rate diminishes like
a negative logarithmic function.
71
Using a regression model, I obtained function 4.1 fitting the estimated error
rate as a function of the number of posts per person used for language assignation.
The variable x is the number of posts per user.
f(x) = 0.285 − 0.056 × log(x) (4.1)
Compared to Google Language ID, the Compact Language Detector does not
detect the language in many more instances: Google did not identify the language
in 46 cases, while CLD did not identify the language in 3,071 cases of 15,856 in the
test dataset.
In the case of the Python Language Detector (LangPy), there are 3,590 differ-
ent outcomes compared to the Google Language ID (from a total of 15,856 posts). I
compared the performance of LangPy in combination with the algorithm that assigns
language labels to users (LangPy, language assigning algorithm) with the baseline
(Google Language ID, manual revision, criteria-based language assignation). Figure
4.6 shows the estimated error rate of using LangPy with respect to the baseline,
as the number of posts used for assigning a language to a person increases. It also
displays, side by side, the previous results of the estimated error rate using Google
Language ID (Google Language ID, language assigning algorithm) with respect to
the baseline. In the case of Google Language ID, the estimated error rate is al-
ways lower, partially due to the fact that the baseline uses this tool as a starting
point. Also, the biased confidence values of Python Language Detector constitute a
challenge for the algorithm that assigns language labels to users.
72
0.1
0.2
0.3
0 25 50 75 100Number of tweets
Err
or 
rat
e Language Detector
Google
LangPy
Figure 4.6: The y axis represents the estimated error rate with respect to the baseline and
the x axis represents the number of posts per person used for language assignation. The tools
compared are Google Language ID and Python Language Detector (langPy). As the number
of posts increases, the estimate error rate diminishes. In the case of Google Language ID, the
estimated error rate is always lower, also due to the fact that the baseline uses this tool as a
starting point.
As explained before, looking at Twitter’s language code to identify a language
has the highest estimated error of all methods considered: 0.418. From 177 users
in the test dataset, 25.99% were multilingual but the interface does not offer them
the option to select more than one language. Another 15.82% of users were using
a language different from the language selected on the interface, which was English
in all of these cases.
73
Estimated Error Rate Number of Posts per User Estimated Cost ($)
Below 0.15 15 363.11
Below 0.10 30 704.40
Below 0.05 70 1547.85
Table 4.2: Overview of different budget options: estimated error rate in the automatic language
analysis associated with the number of posts used per person, and the corresponding analysis costs
for the entire contacts dataset. The use of Google Language ID tool costs $20 per one million
characters of text.
4.4.4 Deciding the number of posts per user
Once I decided to use Google Language ID tool, a question remained: “How
many Twitter posts are enough to determine the language(s) of the user?” In
essence, this is a budget question. The cost of using Google Language ID tool is $20
per one million characters of text.
Drawing from the estimated error rate results shown in figure 4.5 and the
estimated error rate function 4.1, I selected three error rate options paired with the
number of posts per person needed. Based on this number, I used the character
count to estimate the cost of analysis for the contacts dataset, which comprises
19606 contacts with up to 100 posts per user. Table 4.2 provides an overview of
estimated cost versus estimated error rate in the automatic language analysis of the
contacts dataset.
74
With Prof. Jennifer Golbeck’s advice, we selected an estimated error rate 0.10,
a cost of $704.4 for the automatic language identification of the contacts dataset
using 30 posts per person. Aside from the contacts dataset, there is also a small
dataset of 92 egos with 50 posts per user (figure 4.3).
4.5 Assigning Language Labels to Users
For the social network analysis and the factor analysis, the contacts dataset
required language labels for 25,556 people in the 92 egocentric networks. However,
only 19,606 contacts had available data. The rest was assigned the language label
“unknown”. As explained in section 4.4.4, I decided to extract 30 posts per per-
son to determine the language or languages they are using in Twitter. The text
extracted from the posts was processed through a pipeline for automatic language
identification. The first stage in the pipeline involved the elimination of URLs, hash-
tags (keywords preceded by the # sign), replies and mentions (usernames preceded
by the @ sign), and other symbols. In the second stage, I used the Google API to
identify the language of each processed post and the confidence value.
Subsequently, every user was represented by a file that contained the languages
and confidence values of their posts. This file was processed by the algorithm that
assigns language labels to users; the details of the heuristics can be found in section
4.3.2. The labels could be: “unknown”; a code for one language in the case of
monolingual users (i.e. “en” for English); or two language codes joined by the
symbol “+” in the case of multilingual users (i.e. “en+ar” for English and Arabic).
75
In the case of the egos dataset, composed of 50 posts from 92 multilingual
users, the text was similarly processed for automatic language identification. The
purpose of this dataset being different, I automatically processed the data to obtain
a percentage of use of the two most frequent languages for every person, with the
corresponding language codes, and identified those egos that used a third language
in at least 10% of the posts. The results served to describe the characteristics of this
dataset, and to quantify the language choices. The egos dataset is composed of 87
bilingual users and 5 trilingual users, all of them use English as fist or second most
frequent language (which was a condition in the sampling method). Also, they use
one or two of the following 18 languages: Arabic, Basque, Catalan, Chinese, Dutch,
French, Galician, German, Greek, Hebrew, Italian, Japanese, Korean, Mongolian,
Polish, Portuguese, Russian, Turkish.
4.6 Scope
When looking at the links between users in Twitter, there are two types of
networks where the methodology could focus: (1) the network generated by the
exchange of messages between people, like replies and repostings, which represents a
communication network and a transient social network, dynamically evolving around
a topic of interest [64]; (2) the network of “follower of” relationships between people,
representing a relatively more stable social network, spanning diverse topics and
communities. In both cases, the networks could reflect a static moment in time or
an evolution, and the data collection has to be planned accordingly.
76
In section 1.3, An Ultimate Goal, I wonder what generates connections across
languages communities and enables cross-cultural communication? Answering that
question completely requires different approaches and collecting data to look at both
types of networks. There are many complementary aspects that can be analyzed
using the transient communication networks and the attention (or followers) social
networks. However, in the interest of keeping this research project to a reasonable
scope, I decided to focus on attention social networks of multilingual users, as cap-
tured at one point in time. In future research, I would like to broaden the scope to
account for dynamic networks, topic-based and communication networks.
As documented in section 3.1, language choice online is influenced by many
simultaneous factors, such as the cultural and linguistic context in a particular re-
gion, the social context, the users’ perception of the availability of online resources
in a language, and the topic, to name a few factors that this dissertation will not
consider. The distinctive approach of this research consists on shifting the focus to
factors derived from the social network where the user is immersed. Also, partic-
ipants and audience are implicit factors when studying the textual feature of the
@ sign. Regarding the setting factor, the social networking site Twitter could be
considered one variable for comparison, but this work will not expand into other
social networks with different characteristics, like Facebook.
Methodologically, Androutsopoulos recommends to take into account the dig-
ital surroundings when analyzing written text, for instance, looking at the pictures
and videos that are linked [6]. Although this strategy is particularly relevant to
77
Twitter and would enrich the qualitative analysis, this work will focus just on tex-
tual themes and hashtags due to time constraints in the final stage.
According to Rotman et al. [93], this type of research work would be consid-
ered an exploratory step prior to embarking in “extreme ethnography”, which is a
new approach to ethnographic methods for the study of human behavior in large
scale online environments. Indeed, adding detailed geographic information and cul-
tural backgrounds of the nodes in the social networks would provide a fascinating
overview of international communication patterns among individuals. However, such
endeavor will require a wealth of resources, and a long time.
4.7 Limitations
Due to the policy limitations of the Twitter API for extracting data and the
computational workload required for processing large social networks, I discarded
potential subjects that had more than 5000 followers. In practice, this decision biases
the sample against the bigger hubs in the social network. Other limitations in data
collection include the impossibility to obtain information from private accounts,
for technical and ethical reasons, and the presence of inactive users. These issues
translate into a 23% of subjects in the contacts dataset with no data available.
This research work focuses on 92 subjects, and their egocentric networks, which
is a small dataset that poses challenges in obtaining statistically significant results.
The diversity of languages included, and small size of their respective samples, hin-
dered any attempt to make comparisons between language groups.
78
Also, a challenge lies in automatically identifying the language of this type of
short texts, and subsequently of nodes in the egocentric networks. In computer-
mediated communication the text often has characteristics of both written and spo-
ken language, with colloquial and regional dialect features, playful performance with
orthography and typography [23], which adds difficulty for automatic language iden-
tification. As reviewed in section 3.3, other research works have encountered this
problem and have reported high error rates in identifying romanized Arabic [9, 40].
Finally, during the data collection process, we did not collect geolocation in-
formation of posts. This type of information consists of GPS coordinates derived
from the users’ devices, or approximate area derived from the Internet Protocol (IP)
address of the users’ computer [40]. Only a small number of users publish geocoded
posts, as a result, this condition would have reduced the number of subjects and
biased the sample [83]. Instead, we collected the location information users provide
in a specific field of the Twitter interface. However, this data is unreliable [40] and
I decided not to take it into account during the analysis.
4.8 Reliability and Validity
Regarding the reliability of the data collected, this work focuses on messages
and actions —like “following” someone— of multilingual users in Twitter, and is
not using the data as a proxy for their behavior in other settings. The Twitter API
provides access to this information. Spammers are the most compromising problem
for the reliability of data. In the case of the 92 egos, I designed the selection steps
79
(section 4.2) to avoid spammers and, in relation to validity, I also discarded people
that were not multilingual users as defined in this investigation.
For improving the reliability of language assignment, I used 30 posts per per-
son, tested different language identification tools and the algorithm that assigns
language labels to users based on their posts. I provide an estimated error rate
below 10%.
In the social network analysis, I designed a two-step study to improve the relia-
bility of the categories obtained in the qualitative phase with quantitative measures,
and tested the results with a classification model.
In the factor analysis, I operationalized the multilingualism of a social network
using the concept of entropy, which can be interpreted as the unpredictability of the
language used in the network. The more people in the network using different
languages, the higher the entropy. Unlike just counting the number of languages
present, the entropy accounts for the weigh those diverse languages have on the
network and provides a more accurate measure of multilingualism.
In the factor analysis, I used two regression models to compare the results
and test their validity. Finally, the qualitative data conformed by the posts written
by the 92 egos was categorized into public posts and replies. The comparison of
categorical data requires the use of non-parametric statistics to obtain valid results.
The theme analysis includes many examples from the data and compares the results
with previous findings in related studies to support their validity.
Regarding external validity and generalizability of results, the small scale of
the study limits the possibility for extrapolation to a wider multilingual population
80
in Twitter. On the positive side, the systematic documentation of steps in the
grounded theory approach for classifying network types, complemented with network
measures, enables replication. This replicability facilitates to scale up the study and
potentially obtain more generalizable results.
4.9 Ethical Considerations
This type of research, which consists of collecting public content from the
Internet with no aim of presenting subjects in a bad light, is considered a low risk
research activity that does not require an approval process by the Institutional
Review Board (IRB). In particular, I am not collecting any private information, like
age, gender, or real names, neither I am collecting data from accounts made private.
However, the study of these new social media environments with user-generated
content is challenging the established ethical protocols. Despite this study following
the current norms of the research community, many ethical issues are still being
debated and protocols might change in the future.
In the article Six Provocations for Big Data [22], the authors warn that users
posting publicly accessible messages online does not automatically make them con-
sent for anyone to collect and use their data. They have an intended audience and
purpose, and are unaware of their data being collected. Unfortunately, there is no
practical way to obtain consent from users or to inform them of the data collection
process.
81
The main concern is the user’s privacy. The only identifiable personal in-
formation in the present datasets are public usernames, but I took the additional
precaution of using anonymized identification codes for all the subjects. When pre-
senting textual content, susceptible of including usernames, either I eliminated the
user mention or I replaced it for a fake name. However, the vast majority of textual
content was used only for automatic language identification.
I have encountered the problem of discovering one minor in the dataset by
reading a post stating the age. At that point, I eliminated immediately the subject
and corresponding data from the sample. This raises the question about how to
detect minors and be able to discard their data. Twitter added in 2012 an age
screening program, but it only works in the context of a minor trying to follow a
brand intended for adults and registered in the program [101].
82
Chapter 5
Social Network Analysis
The main goal in this study is to develop a classification of egocentric net-
works based on the number of language groups that conform them and the patterns
of connections between the groups. The types of egocentric networks constitute the-
oretical constructs to understand the ways in which multilingual users of Twitter
are connecting language groups.
For that purpose, I visualized the 92 egocentric networks with the Gephi social
network analysis tool. The visualizations represent people as colored nodes and the
“follower of” relationship, as edges. The colors represent the single language they
use in Twitter, if they are bilingual, or have no data available. The ego is taken
out of the picture to avoid obscuring the display with too many edges; all members
of the egocentric network are connected with the ego by definition. I chose the
layout “Force Atlas”, which is a force directed placement scheme developed by
Mathieu Jacomy in 2007 for Gephi [35]. The Force Atlas layout follows a similar
placement scheme as the commonly used Fuchterman-Reingold layout, where the
algorithm replicates a hypothetical physical system trying to minimize the energy,
balancing attraction between nodes connected by springs [38]. The force-directed
placement schemes are particularly useful in revealing network structure [10], such
as communities.
83
In summary, the visualizations convey structural information and language
information about the social network, by separating the social groups or communities
in the layout and distinguishing the language groups with colors.
As explained in section 4.1, this study follows an exploratory design with two
phases. First, I use a grounded theory approach to identify emergent types of ego-
centric networks focusing on the structural relationships of language groups. I use
grounded theory in the generic sense, to define theoretical constructs derived from
qualitative analysis of data, following the principles of the book by Corbin and
Strauss [17]. This approach consists of a sequence of coding stages, firstly establish-
ing some basic properties observed in the social networks, secondly extracting codes
from the visualizations as defined by their properties, and finally grouping codes
into categories according to shared properties.
In the second phase, I complemented the qualitative study of visualizations
with network statistics to provide a robust definition of network types. Also, I
propose an application of these types using machine learning for classification, which
also serves for testing the results. Finally, I discuss the findings in relation to the
theoretical framework and related work.
5.1 Qualitative Approach
As an initial step based on visual differences, I separated the egocentric net-
works in three types: 25 monolingual or very small networks, 62 bilingual networks,
84
and 5 trilingual networks. Based on this initial classification, I established quanti-
tative thresholds that define these types:
• Bilingual networks: have at least 7 nodes using a second language (L2), and
the L2 group represents at least 2% of the graph nodes;
• Monolingual or very small networks: have fewer than 7 nodes using L2, or the
L2 group represents less than 2% of the graph nodes;
• Trilingual networks: have at least 7 nodes using a third language (L3), and
the L3 group does represents at least 7% of the graph nodes.
A higher threshold for trilingual networks enables to overcome the problem of
noise in multilingual networks, where differentiating a third language among several
others sometimes becomes challenging. This issue is less accentuated in bilingual
networks, which will be the focus of the subsequent analysis. Before proceeding with
the analysis of bilingual networks, I provide some insights about trilingual networks
with three examples from the dataset.
The first example is the egocentric network of the user “Kepa”1 (fig. 5.1). The
Basque group on the upper side (dark blue) connects with the Spanish group (green
in the middle) and in turn, the Spanish group connects with the English group at the
bottom (pink). Basque is a minority language in Spain and a co-official language
in the Basque Country region, where Spanish is also official language. This net-
work illustrates the interesting intersections and overlaps of language groups in the
context of a bilingual region, where English is taught as language for international
1Reported user names are changed for privacy protection.
85
Figure 5.1: Basque group on the upper side (dark blue) connects with the Spanish group (green
in the middle) and in turn, the Spanish group connects with the English group at the bottom
(pink) and English-Spanish bilinguals (violet). Visualization made with Gephi.
communication. The Spanish-posting group seems to create a path of communica-
tion between English and the Basque community.
86
Figure 5.2: Catalan group in the center (dark blue) integrated in the Spanish group at the bottom
(light green), connects with the English group at the top (pink). Bilinguals of Catalan-Spanish
are represented in dark green, Catalan-English in light violet, and Spanish-English in dark violet.
The nodes in light yellow represent nodes with no data. Visualization made with Gephi.
The second example is the egocentric network of the user “Montse” (fig. 5.2).
The Catalan group in the central axis of the graph (dark blue) is completely in-
tegrated within the Spanish group on the lower side (light green), and there is a
smaller English group on the upper side (pink). It is noteworthy the number of
bilinguals of Catalan and Spanish (darker green), followed by Catalan and English
(light violet) and Spanish and English (darker violet). Catalan is a minority lan-
guage in Spain and a co-official language in the Catalonia region, where Spanish is
also official language. This network illustrates a different flavor of language groups’
overlaps in the context of a bilingual region, where English is taught as language
for international communication.
87
Figure 5.3: The Chinese group in the center (dark blue) connects through a few nodes with the
Japanese group on the upper side (green), and with the English group on the lower side (pink),
through some bilinguals (violet). The nodes in yellow represent nodes with no data. Visualization
made with Gephi.
Finally, the third example is the egocentric network of the user “Wei” (fig.
5.3). The Chinese group in the center (dark blue) connects through a few nodes
with the Japanese group on the upper side (green), and with the English group
on the lower side (pink). Some of the users connecting the groups either post in
English or both in English and Chinese. In this example, English seems to be playing
the role of international Lingua Franca, connecting the Chinese-posting group with
other language groups.
88
I focused the social network analysis on the bilingual networks, for simplifying
the categorization to types of intersections between two language groups. During the
initial coding, I created a list of properties observed in the visualizations concerning
the structural relationships between the languages groups. These properties are
shown in table 5.1:
Properties
A) Degree of connection between language groups
A1 few connections
A2 tightly connected
B) Degree of integration of one language group inside another
B1 separated
B2 partial integration
B3 complete integration
C) Relative size of one language group respect to the other
C1 similar size
C2 very different size
Table 5.1: Properties of bilingual networks observed in the Gephi visualizations.
When combining the three types of properties, I deductively obtained 12 codes,
for instance: code 1 consisted of two language groups of similar size (C1), separated
(B1), and connected by a few nodes (A1); code 2 consisted of two language groups
of very different size (C2), separated (B1), and connected by a few nodes (A1);
code 9 consisted of two language groups of similar size (C1), tightly connected
(A2), and one language group has been partially penetrated by users of the other
(B2) ; code 12 consisted of two language groups of very different size (C2), the
small one completely integrated within the big one (A2, B3), etc.
89
During the final iteration of the coding process, I observed some codes had no
instances in the dataset or very few. Those codes that had very few instances could
be grouped with codes of similar properties. For instance, regarding codes with a
high degree of integration of one language group inside another (B3), there are few
instances of language groups with similar size (C1), therefore I merged codes with
properties A2 and B3, regardless of the differences in group size (either C1 or C2).
In relation to codes with no instances, some properties presume others, like
B3 or B2 (some degree of integration of one language group inside another) require
a high degree of connection between the groups (A2); in consequence, certain com-
binations of properties are not possible. For this reason, some codes were discarded.
• Code 1 (A1, B1, C1) with 12 networks;
• Code 2 (A1, B1, C2) and code 8 (A2, B1, C2) grouped together have 12
networks;
• Code 3 (A1, B2, C1), code 4 (A1, B2, C2), code 5 (A1, B3, C1), and code 6
(A1, B3, C2) have contradictory properties, because B2 and B3 require A2,
and there are no instances in the dataset;
• Code 7 (A2, B1, C1) with 12 networks;
• Code 9 (A2, B2, C1) and code 10 (A2, B2, C2) grouped together have 9
networks;
• Code 11 (A2, B3, C1) and code 12 (A2, B3, C2) grouped together have 17
networks;
90
The resulting groups of codes constitute the five categories of bilingual net-
works obtained with a qualitative approach. Below, I define the categories of ego-
centric networks in relation to the patterns of intersection between language groups.
Figure 5.4 illustrates them with examples from the data. The names of the cat-
egories are metaphorical; here bridge is not used as the graph theory term. See
appendix A for all the visualizations and the categories they were assigned.
• Gatekeeper (Fig. 5.4.1): two language groups connected by a few nodes only,
with properties A1, B1, and C1 (12 networks);
• Language bridge (Fig. 5.4.2): two tightly connected language groups, but still
separated, with properties A2, B1, and C1 (12 networks);
• Peripheral language (Fig. 5.4.3): a dominant language group connected to a
small or not cohesive language group, with properties A1 or A2, B1, and C2
(12);
• Union (Fig. 5.4.4): two tightly connected language groups, where one lan-
guage group has been penetrated by the other, with properties A2, B2, and
C1 or C2 (9 networks);
• Integration (Fig. 5.4.5): one language group inside another with properties
A2, B3, and C1 or C2 (17 networks).
91
Fig. 5.4.2 Fig. 5.4.1 
Fig. 5.4.4 Fig. 5.4.3 Fig. 5.4.5 
Figure 5.4: Networks of 5 multilingual Twitter users exemplifying the types of egocentric net-
works. The nodes are their contacts and the edges represent the “follower of” relationship. Pink
nodes post in English and yellow/white is used for nodes with no data. Fig. 5.4.1 is the gatekeeper
type; there is a French group on the right side (green) loosely connected with an English group on
the left. Fig. 5.4.2 represents the language bridge type; in this network, the Japanese group on the
right side (green) is tightly connected with the English group on the left, and intermingled with
bilingual users (violet and dark green). Fig. 5.4.3 shows the peripheral language, Portuguese, on
the right side (green) of the dominant English group. Fig. 5.4.4 exemplifies the union type, where
the Greek group on the left (turquoise) is merging and mixing with the English group on the right,
and there are many bilinguals (violet and dark green). Fig. 5.4.5 illustrates the integration type;
the English group being inside the Arabic (green). Visualizations made with Gephi.
92
The categories gatekeeper and language bridge present a continuum of in-
creasing connectivity between the two language groups, where extreme cases could
potentially belong to the other category. Similarly, the union and integration cate-
gories present a continuum of increasing penetration of one language group within
the other. The implication is that no statistic is going to divide these categories
cleanly. However, the network statistics helped to refine which networks were in
which categories in the extreme cases.
These different structures can potentially impact information diffusion [80]
across languages and nations. In the case of the gatekeeper type, and peripheral
language, cross-cultural awareness and information diffusion between the language
groups depend on a small number of users. If we look at the proportion of links
between the language groups, it seems that information will have higher chances of
crossing the language barrier in the case of the union and integration types.
5.2 Network Statistics
Similarly to how user types were defined by network structure in [112], I ex-
plored different network statistics to provide a robust definition of the types of
bilingual networks. The objective is to define a set of measures that, taken to-
gether, can differentiate each network type. Note that this analysis continues to
focus on the set of 62 bilingual networks.
First, I tried to convey the qualitative property of degree of connection between
language groups with the cross-language edge ratio (XLangR), as suggested by Prof.
93
Jennifer Golbeck. To compute this ratio, I used the total number of edges in the
graph (T ), except those linking to nodes with no data or a non-identifiable language,
and the number of edges linking two nodes of different language (t):
t
T
= XLangR (5.1)
Additionally, the ratio between inner edges in the L2 group and the edges
going out of the group could convey both the degree of connection of the L2 group
with the rest of the graph and the relative size of the group with respect to the
graph. Computing the L2 inner/crossing edge ratio (XL2R) requires: the number
of edges connecting two nodes of L2 (τL2), and the number of edges connecting a
L2 node with a node in a different language (tL2).
τL2
tL2
= XL2R (5.2)
Another property that is related to the degree of connection and overlap be-
tween the language groups is the the bilingual ratio. After determining the two main
languages, L1 and L2, computing the bilingual ratio (BR) requires the number of
nodes in each group (n, m), and the number of bilinguals using both L1 and L2 (b):
b
n+m
= BR (5.3)
Finally, to account for the qualitative property of different size of the two main
language groups, I use the proportion of nodes in the L2 group (p(L2)) with respect
to the sum of nodes in L2 (n) and L1 (m):
94
n
n+m
= p(L2) (5.4)
As explained in section 5.1, the network categories present a continuum of
increasing connectivity and overlap between two language groups, where extreme
cases could potentially belong to another category. Even though no statistic is
going to divide the categories cleanly, the figures below show how the five categories
can be regrouped into three main types that are differentiated by the statistics.
The categories gatekeeper and language bridge present a continuum of increas-
ing connectivity between the two language groups, but are different from the other
types because the L2 inner/crossing edge ratio is higher, which implies more con-
nections within the same language group than across language groups (figure 5.5).
Also, the L2 proportion differentiates the gatekeeper-bridge from the peripheral type
because the two language groups tend to be of similar size, whereas the different
sizes of the language groups is a defining property of the peripheral type (figure 5.6).
Similarly, the union and integration categories present a continuum of increas-
ing penetration of one language group within the other. The box plots in figure 5.7,
representing the cross-language edge ratio, show that the integration and union types
have higher ratios and are clearly differentiated from the other types. This pattern is
consistent with the bilingual ratio (5.8), which reinforces the differentiation between
integrated (union and integration) and separated (gatekeeper, language bridge, and
peripheral language) types.
95
●●
0
1
2
3
4
gatekeeper bridge integration peripheralCategories
L2 i
nne
r/cro
ssin
g ed
ge r
atio
0
1
2
3
4
bridge gatekeeper integration peripheral unionCategories
L2 i
nne
r/cro
ssin
g ed
ge r
atio
Figure 5.5: L2 inner/crossing edge ratio for five categories (left) and for three categories (right).
This statistic differentiates the gatekeeper-bridge type.
0.05
0.10
0.15
0.20
0.25
0.30
gatekeeper bridge integration peripheralCategories
L2 p
ropo
rtion
0.05
0.10
0.15
0.20
0.25
0.30
●
bridge gatekeeper integration peripheral unionCategories
L2 p
ropo
rtion
Figure 5.6: L2 group proportion for five categories (left) and for three categories (right). This
statistic differentiates the peripheral language type.
96
0.2
0.4
0.6
!
!
bridge gatekeeper integration peripheral unionCategories
Cro
ss−l
ang
 edg
e ra
tio
0.2
0.4
0.6
!
!
gatekeeper bridge integration peripheralCategories
Cro
ss−l
ang
 edg
e ra
tio
Figure 5.7: Cross-language edge ratio for five categories (left) and for three categories (right).
This statistic differentiates the integration and union type.
0.0
0.2
0.4
0.6
0.8
●
●
bridge gatekeeper integration peripheral unionCategories
Bilin
gua
l rat
io
0.0
0.2
0.4
0.6
0.8
●
gatekeeper bridge integration peripheralCategories
Bilin
gua
l rat
io
Figure 5.8: Bilingual ratio for five categories (left) and for three categories (right). This statistic
differentiates the integration and union type.
97
In summary, from a quantitative approach three main types of intersection
between two language groups in the social network can be defined:
• Gatekeeper-Language bridge: defined by a high L2 inner/crossing edge ratio,
more connections within the same language group than across language groups,
and language groups of similar size (24 networks);
• Integration and union: defined by high cross-language edge and bilingual ratios
(26 networks);
• Peripheral language: the L2 group accounts for a small proportion of the
graph, and it does not meet the defining properties of integration and union
types (12 networks).
Following the reasoning in section 5.1, the cross-language edge ratio and the
bilingual ratio can reflect the potential for information dissemination across language
borders. If we are able to classify the types of intersection between language groups
in a set of egocentric networks, we might be able to predict which networks have
more potential for cross-lingual linking, translation, and cross-cultural awareness.
However, the relationship between the types and the spread of information across
languages requires further investigation and fall outside the scope of this work.
5.3 Application of Categories
One potential application of the categories, particularly the three types that
are differentiated more clearly with the statistics, is the classification of bilingual
98
egocentric networks. If the linkage between the types and the potential for cross-
language information dissemination is supported by further research, this classifi-
cation could be fundamental in detecting nodes and their egocentric networks that
can spread information across language and national borders more effectively.
In this section, I test the results of the social network analysis with a clas-
sification model using machine learning. I trained the classification model using
support vector machines (sequential minimal-based implementation, SMO) and the
dataset of 62 bilingual networks divided into three types. This dataset included
the attributes type, L1, L2, cross-language edge ratio, L2 inner/crossing edge ratio,
bilingual ratio, and proportion of L2. I used the Weka (Waikato Environment for
Knowledge Analysis) free software for machine learning, developed at the University
of Waikato, New Zealand.
Figure 5.9 shows the confusion matrix: all 26 networks of the integration
type were classified correctly; 19 of 24 gatekeeper-bridge networks were classified
correctly, while only half of the networks of peripheral language type were classified
correctly. In general, 51 networks of 62 were classified correctly and 11 incorrectly.
The F-measure for accuracy is 0.812 in average, but is particularly high for the
integration type, 0.881. These results show a promising potential for prediction,
even with this small dataset. As observed, these statistics are enabling differentiation
between the types of bilingual networks.
99
=== Summary ===
Correctly Classified Instances          51               82.2581 %
Incorrectly Classified Instances        11               17.7419 %
Kappa statistic                          0.7127
Mean absolute error                      0.2688
Root mean squared error                  0.3474
Relative absolute error                 63.0008 %
Root relative squared error             75.1714 %
Coverage of cases (0.95 level)          96.7742 %
Mean rel. region size (0.95 level)      66.6667 %
Total Number of Instances               62     
=== Detailed Accuracy By Class ===
                 TP Rate  FP Rate  Precision  Recall  F-Measure Class
                 0.792    0.079    0.864      0.792   0.826 gatekeeper bridge
                 1        0.194    0.788      1       0.881 integration
                 0.5      0.02     0.857      0.5     0.632 peripheral
Weighted Avg.    0.823    0.116    0.831      0.823   0.812
=== Confusion Matrix ===
  a  b  c   <-- classified as
 19  4  1 |  a = gatekeeper bridge
  0 26  0 |  b = integration
  3  3  6 |  c = peripheral
Figure 5.9: Classification results using 10-fold cross-validation for the SVM model. This model
was trained using the dataset of 62 bilingual networks with attributes type, L1, L2, cross-language
edge ratio, L2 inner/crossing edge ratio, bilingual ratio, and proportion of L2.
5.4 Discussion
According to the Global Language System theory, polyglots provide cohesion
to the system [25]. In section 2.3, I explain that the cohesion of a social graph
depends on the edges that prevent the entire graph from breaking in isolated com-
ponents [38]. In other words, multilingual users might be preventing the social graph
of Twitter from breaking into isolated language groups, or “language bubbles” [46],
where information is concealed and similar views reinforced. As motivated in sec-
tion 1.2, instead of promoting isolated communities, social media sites should seek
to expose their users to the unexpected [119] and foster cross-cultural awareness.
100
This social network analysis reveals how multilingual users are standing be-
tween language groups. Reusing the concept of “language bridges” applied by Etling
et al. [32] on the blogosphere, multilingual users are forming part of a language
bridge between communities in varying degrees. These varying degrees are pre-
sented in this chapter as a continuum of increasing connections between the language
groups and a continuum of increasing penetration of one language group within the
structure of the other. The classification of egocentric networks, or intersections of
language groups, could serve to distinguish those egos who might be playing a role
as gatekeepers [82] or language brokers [55], and also unveils that not all multilin-
gual users are necessarily in such position. For instance, in the case of the union
and integration types many users are connecting both language groups aside from
the ego itself.
As a result of this analysis other questions arise: what are the profiles and
social contexts of these multilingual users and how they relate to the type of network?
For instance, does the integration type reflect a minority or immigrant community
in a country? Do small English-posting groups integrated in a non-English group
reflect an elite in a country? An example related to the later question can be
found in section 3.1, where I reviewed a study on email and Internet chat in Egypt
documenting the use of English by a professional elite [108].
The relationship between these types of egocentric networks with the potential
impact on information flows remains an open question. It seems reasonable to
hypothesize in future research that certain types —like the union and integration
categories— might favor cross-lingual linking and dissemination, while other types
101
—like the gatekeeper category— might be interesting for those seeking purposeful
concealment of information.
Methodologically, social network analysis enables going beyond survey infor-
mation about multilingualism, like the large-scale survey The Twitter of Babel [83],
and facilitates a deeper understanding about the structural relations between lan-
guage communities, potentially shedding light into the dynamics of international
communication. In this respect, the present study takes a similar approach as Lan-
guage Networks on LiveJournal [53], but enhances the descriptive analysis with the
creation and definition of theoretical constructs: the types of intersections between
language groups in egocentric networks. Also, this study conceives the egocentric
network as a language ecology where the ego is immersed at the micro-scale level,
influencing its communication strategy and language choices. This is relevant to the
next chapter on social network factors for language choice.
102
Chapter 6
Factor Analysis
The main goal of this study is to explore how the social network influences the
language choices of the multilingual Twitter user. In particular, I tested if we can
model the number of times this person (the ego) chooses one language over the other
using some characteristics of the egocentric network as predictors. The dependent
variables considered are the frequency of English use and non-English language use
within the 50 posts of the ego. The frequencies represent the language choices of
the multilingual user.
As explained in section 2.5, I consider inter-sentential code-switching when
the language changes from one post to the next, while bilingual posts would be
cases of intra-sentential code-switching. In this study, I only take into account
inter-sentential code-switching and each post represents one interaction. Every post
was assigned a language label by the automatic language identification tool and,
subsequently, frequency counts of posts in each language were calculated for every
ego. Finally, the two most frequent languages of a user were selected to represent
his or her options for language choice. English was always one of them due to the
sampling conditions.
The factors —independent variables or predictors— are the proportion of En-
glish and non-English language users in the social network, and the degree of multi-
103
lingualism of the social network. The relative importance of factors, or their weight,
is represented by the coefficients obtained by fitting two different generalized lin-
ear models to the dataset: linear regression, and logistic regression. The main
hypotheses are: higher proportions of English users in the network will be a good
predictor for more frequent English use by the ego; inversely, higher proportions of
non-English language users in the network will be a good predictor of more frequent
use of a non-English language by the ego; and the multilingualism of the network
will be also a predictor of English use, reflecting its role as a lingua franca.
6.1 Operationalization of Variables
The language choices of the multilingual user are defined by two dependent
variables: (1) the number of English posts within the 50 (or fewer) posts extracted
for each multilingual ego, (2) and the number of posts in other language, called L2.
The dependent variables reflect the aggregation of posts at user level, not individual
posts. In other words, the models do not consider if one particular post is written in
English or L2, but how much or little English a person will tend to use in interactions
via Twitter. The factors considered are:
• proportion of English users in the network, represented by the number of
speakers labelled as English users and divided by the total number of nodes
in the network;
104
• proportion of users of the most frequent non-English language in the network
(L2), represented by the number of speakers labelled as L2 users, and divided
by the total number of nodes in the network;
• the multilingual index of the network, which represents the degree of multilin-
gualism of the social network.
As suggested by Prof. Jordan Boyd-Graber, the multilingual index can be
operationalized as the entropy of a multinomial distribution (formula 6.1). This
idea borrows the concept of entropy from Information Theory [94].
In this context, the entropy can be interpreted as the unpredictability of the
language used in the network. An entropy close to 0 means that most people in
the network are writing in one language, hence the language of the network is more
predictable. The more people in the network using different languages, the higher
the entropy, reflecting the uncertainty about the language of the network. Unlike
providing just the number of languages as a measure of multilingualism, the entropy
accounts for the weight those diverse languages have on the network in the form of
probabilities.
H = −
n∑
i=1
pilog(pi) (6.1)
Equation 6.1 for calculating the multilingual index of a social network is bor-
rowed from Shannon’s entropy theorem [94]. In this dissertation study, n is the
number of languages in the network and pi is the number of nodes in language i
divided by the total number of nodes.
105
Ego$en$use Ego$L2$use N$of$posts entropy net$en$use net$L2$use28 19 50 0.79230492 0.51793722 0.4551569511 38 50 0.67065588 0.63473054 0.3532934135 15 50 0.42538128 0.84931507 0.1095890418 6 25 0.69040118 0.47368421 0.5263157915 35 50 0.57881166 0.7202381 0.26785714
Figure 6.1: Input data file for factor analysis with a reduced number of lines for illustration
purposes. The columns represent, from left to right, dependent variables English use by the ego
and L2 use by the ego, number of available posts for the ego, and factors multilingual index or
entropy, proportion of English users and proportion of L2 users in the egocentric network.
6.2 Regression Models and Analysis
In this study, I used two different generalized linear regression models to build a
probabilistic model that relates a dependent variable y to more than one independent
or predictor variable [29].
Formula 6.2 is the multiple linear regression equation for three predictor vari-
ables, x1, x2, x3: proportion of English users, proportion of L2 users, and multilingual
index of the network.
y = β0 + β1x1 + β2x2 + β3x3 (6.2)
In formula 6.2, β1, β2, β3 are the regression coefficients. β1 is interpreted as the
expected change in y associated with a 1-unit increase in x1, while x2 and x3 are
held fixed [29]. Analogous interpretations hold for β2 and β3. The intercept of the
fitted line is β0, which is the predicted value of y when all factors have value 0 [29].
106
I used the linear regression model for two dependent variables yen and yl2,
which are operationalized as the normalized count of posts written in English by
the ego (yen) and the normalized count of posts written in L2 by the ego (yl2). Given
that not all egos have 50 posts available, the normalization consists of dividing a
particular count by the total number of posts available for the ego. However, the
output of the linear regression model are numbers from 0 to infinity. For this reason,
the linear regression model might not be the best option for this dataset.
Alternatively, logistic regression can be used to get probability scores (between
0 and 1) as the predicted values of the dependent variable y [79]. Formula 6.3 is
the transformation equation from a linear regression output to logistic regression
probabilities, with three predictor variables x1, x2, x3 [79].
log
y
1 − y
= β0 + β1x1 + β2x2 + β3x3 (6.3)
Like in the previous model, I used the logistic regression model for two de-
pendent variables yen and yl2, which are operationalized as pairs of counts: (yen,
N−yen) and (yl2, N−yl2), where N is the total number of posts available for a user.
I used the R language for the statistical analysis. R is an open programming
language and software environment for statistical computing. As a result of fit-
ting these generalized linear models, R outputs the regression coefficients for the
independent variables or factors, including the intercept, and indicating positive or
negative correlation. In addition, R provides the specific p-value scores for each of
the regression coefficients.
107
Firstly, I used the linear regression function in R to model the use of English
by the ego (model 6.4) and the use of L2 by the ego (model 6.5).
−lm(Ego.en.use/N.of.posts ∼ entropy + net.en.use+ net.L2.use) (6.4)
−lm(Ego.L2.use/N.of.posts ∼ entropy + net.en.use+ net.L2.use) (6.5)
Secondly, I used the logistic regression function in R to model the use of English
by the ego (model 6.6) and the use of L2 by the ego (model 6.7).
−glm(response ∼ entropy + net.en.use+ net.L2.use, family = binomial(′logit′))
(6.6)
−glm(responseL2 ∼ entropy+net.en.use+net.L2.use, family = binomial(′logit′))
(6.7)
In the logistic regression model, the depended variables are operationalized as
pairs of counts:
response < −cbind(Ego.en.use,N.of.posts− Ego.en.use)
responseL2 < −cbind(Ego.L2.use,N.of.posts− Ego.L2.use)
108
Predictors of Ego.en.use Estimate Std. Error p-value
(Intercept) 0.0002 0.554 1.000
entropy 0.0414 0.160 0.797
net.en.use 0.796 0.502 0.117
net.L2.use 0.075 0.454 0.868
Table 6.1: Linear regression coefficients for modeling the use of English by the ego. None of the
coefficients are statistically significant. The proportion of English users in the network is the most
important predictor of English use by the ego.
6.3 Results
The results of the linear regression model in table 6.1 do not provide statisti-
cally significant coefficients for predictors of English use by the ego. I established
the level of significance for the coefficients at a p-value of 0.05. The proportion of
English users in the network has the greatest coefficient, indicating this factor is
more important for predicting the use of English by the ego. Both the proportion
of English users in the network and the multilingual index have positive correlation
with the use of English by the ego, as stated in the hypothesis.
Similarly, the results of the linear regression model in table 6.2 do not provide
statistically significant coefficients for predictors of L2 use by the ego. The propor-
tion of English users in the network is the factor with the greatest coefficient in
absolute value, but is negatively correlated with L2 use by the ego. The proportion
of L2 users in the network is positively correlated with L2 use, as stated in the
109
Predictors of Ego.L2.use Estimate Std. Error p-value
(Intercept) 0.482 0.548 0.382
entropy 0.0335 0.159 0.833
net.en.use -0.354 0.497 0.479
net.L2.use 0.263 0.449 0.559
Table 6.2: Linear regression coefficients for modeling the use of L2 by the ego. None of the
coefficients are statistically significant. The factor proportion of L2 users in the network has a
positive correlation with the use of L2 by the ego, whereas the factor proportion of English users
has a negative correlation.
hypothesis. However, in disagreement with the hypothesis, the entropy is positively
correlated with L2 use. In summary, the proportion of English users and the pro-
portion of L2 users in the network are better predictors of L2 use by the ego than
the entropy.
The results of the logistic regression model in table 6.3 include the coefficients
for predictors of English use by the ego. The proportion of English users in the
network is a statistically significant predictor. Like in the previous model, the
proportion of English users in the network has the greatest coefficient, indicating
this is the best predictor of English use by the ego. In this model, all the positive
and negative correlations of the coefficients are in agreement with the hypothesis,
i.e. both the proportion of English users in the network and the multilingual index
correlate positively with the use of English by the ego, while the proportion of L2
users in the network correlates negatively.
110
Predictors of Ego.en.use Estimate Std. Error p-value
(Intercept) -1.718 0.918 0.061
entropy 0.114 0.265 0.668
net.en.use 2.981 0.832 0.0003 *
net.L2.use -0.086 0.739 0.907
Table 6.3: Logistic regression coefficients for modeling the use of English by the ego. The
proportion of English users in the network is a statistically significant predictor and the most
important for predicting English use by the ego.
Predictors of Ego.L2.use Estimate Std. Error p-value
(Intercept) -0.563 0.899 0.531
entropy 0.302 0.251 0.245
net.en.use -1.109 0.801 0.170
net.L2.use 1.551 0.737 0.035 *
Table 6.4: Logistic regression coefficients for modeling the use of L2 by the ego. The proportion
of L2 users in the network is a statistically significant predictor. Both the proportion of L2 users
and English users in the network are important predictors of L2 use by the ego.
111
Finally, the results of the logistic regression model in table 6.4 include the
coefficients for predictors of L2 use by the ego. Using this model, the proportion
of L2 users in the network is a statistically significant predictor if establishing the
level of significance at p = 0.05. Also, the proportion of L2 users in the network
has the greatest coefficient, indicating this is the most important predictor of L2
use by the ego. The proportion of English users in the network correlates negatively
with the use of L2 by the ego, and the proportion of L2 users in the network
correlates positively, as stated in the hypothesis. However, in disagreement with the
hypothesis, the entropy correlates positively with the use of L2 by the ego.
6.4 Discussion
In conclusion, the two generalized linear regression models consistently show
that the proportion of English users in the network constitutes a key influencing
factor in the frequency of English use by the multilingual individual, as stated in
the hypothesis. This result was statistically significant in the logistic regression
model. Also, the coefficient of this factor was the greatest in the two models.
Similarly, the proportion of L2 users in the network is a very important factor
influencing the frequency of L2 use by the multilingual person, in agreement with
the hypothesis. This result was statistically significant in the logistic regression
model. Also, the coefficient of this factor was the greatest in the logistic regression
model of L2 use. Interestingly, the factor proportion of English use in the network
is also an important predictor for L2 use by the ego, but is negatively correlated.
112
Regarding the multilingual index (or entropy), the results are inconclusive
about the relation to the language choice of multilingual users. The hypothesis that
the entropy could be a good predictor of English use is not confirmed. A future
study could deepen into the question of English being used as a lingua franca by
focusing on multilingual egocentric networks with no monolingual users of English.
Controlling this variable can eliminate the confounding influence of users writing in
English only.
In essence, these results suggest that the multilingual Twitter users perceive
the language composition of their egocentric network and interact accordingly. Or,
on the contrary, the language choices of multilingual users might attract followers
of a specific language profile. Most probably, the relation goes both ways, in a self-
feeding cycle. Social networks evolve over time and users may adapt their language
choices in a dynamic relationship with their egocentric network. As Marwick and
boyd theorized: “[...] identity on Twitter is constructed through conversations with
others. Tweets are formulated based partially on a social context constructed from
the tweets of people one follows” [81](11).
Other factors that I initially tested in this analysis were the most frequently
used non-English language of the network and the type of network. However, these
factors posed specific challenges. In the 92 networks, there were a total of 16 lan-
guages that appeared as the most frequently used non-English language. As a
consequence, the data were too sparse for any one language to be operationalized as
a factor. The type of network structure resulting from the social network analysis
was challenging to use as a factor because there are not clear-cut divisions between
113
types, but stages in a continuum. In future work, it will be interesting to study
users’ awareness or intuition about their social network type and how this might
affect their language choices.
There are other factors influencing language choice that do not fall under
the scope of this work: cultural and linguistic context in a particular region, the
perceived availability of online resources in a language, social context, language
competence, geographic location of the ego, time zone, etc.
114
Chapter 7
Exploring Textual Features
In this chapter, I shift attention from the social network to the content of the
posts written by the egos. There is a convention in Twitter for addressing a message
to a particular user or referencing a person, the “mentions”, which consists of an
@ sign and a username [81]. Honeycutt and Herring [54] studied the @ sign as a
marker of addressivity in Twitter; they found that more than 90% of messages with
the @ sign were addressed to a user, 5% were referencing a person, and the rest
were indicating location or other functions. Surprisingly, they did not comment on
the key factor of the mention location within the message to differentiate the posts
intended for a specific user: mentions in the beginning of the post are typically used
to reply to someone’s message.
In the first part of this study, I look at the textual feature of the @ sign at the
beginning of a post as an indicator of addressivity, in particular, to distinguish the
posts that are replies to an individual. In other words, this indicator can be used to
differentiate the type of exchange: sending a public post, including repostings with
a comment, and replying to an individual. The objective is to test the hypothesis
that the type of exchange is a factor that affects language choice.
The second part of this study takes a qualitative approach to detect the themes
that might help in creating cross-cultural awareness, where the multilingual users
115
might be trying to reach an international audience, acting as mediators from the
point of view of their messages. I identify themes related to non-English speaking
countries or communities in English posts and, also, I identify English hashtags
(keywords preceded by the # sign) inserted in non-English posts. Using a generic
theme analysis, this study serves as an explorative qualitative phase to inform the
design of future studies after this dissertation work.
7.1 Description of the Data
The dataset used in this study was called “ego dataset” in chapter 4 and in-
cludes the last 50 posts (or fewer) of the main 92 subjects, who are multilingual
users. In total, the dataset contains 4,423 Twitter posts, associated to their re-
spective authors. Note that these posts are never automatic repostings (using a
“retweet” button), due to the requirements during data collection, as explained in
chapter 4. The majority of posts have a language label, obtained with automatic
language identification. In some cases, there was no language label because the
post contained only symbols or URLs, and those were removed before proceeding
to the automatic identification of the language. The precise number of posts with a
language label is 4,374.
In preparation for this analysis, I revised the language labels with two objec-
tives:
1. eliminating from frequency counts of English automatic posts sent by applica-
tions, for example, “Posted a picture on Facebook”, “liked a photo on Face-
116
book”, “favorited a Youtube video”, “I am at something @ name of place”
(foursquare);
2. identifying bilingual posts with the appropriate label. I used the criteria de-
scribed in section 4.4.2 to classify a post as bilingual, in particular, the post
has to meet one of these conditions:
– one word is in a second language in a post with fewer than five words;
– two words are in a second language in a post that has between five and
ten words included, except if those two words are a named entity;
– at least three words are in a second language in a post with more than
ten words, except when they are a named entity.
Also, the posts were classified in three types of exchange: public post (ToAll),
reposting with a comment (RT), or replying to an individual (reply). I designed a
simple algorithm for the automatic classification of the posts, using regular expres-
sions, i.e. when a post starts with the @ sign followed by a username is a reply to
that person.
Initially, I posed the question whether I will find more bilingual posts of the
type reposting with a comment, thinking of potential translations, which triggered
the revision of language labels and posts to identify them. Also, this initial question
justified the differentiation between general public posts and repostings with a com-
ment, using the convention of “RT” or “rt” [66]. However, the resulting number of
bilingual posts in the ego dataset was very low, 37, and none of them were repostings
117
with a comment. For this reason, I discarded any hypotheses related to bilingual
posts.
To sum up, the dataset used has 4,374 posts, all of which have a language
category (English, other), and a category of exchange type (ToAll, RT, reply).
7.2 Hypothesis Testing: Fisher’s Exact Test
In this section, I propose to test the following hypothesis: English is used more
when addressing a post to the public in general (ToAll and RT) than when replying
to individuals . This hypothesis is based on the empirical observations of previous
research studies on email and mailing lists [31, 65], focusing on the impact on lan-
guage choice of addressing a message to one person or to a multilingual audience;
in the later case English was preferred for its function as a lingua franca.
If a is the number of replies in English, and b is the number of ToAll and RT
posts in English, the null hypothesis can be stated as:
H0 : a ≥ b (7.1)
And the alternative hypothesis is:
Ha : a < b (7.2)
I set the commonly accepted value of p = 0.05 as the level of significance
associated with the null hypothesis.
Honeycutt and Herring [54] estimated that roughly 30% of all Twitter posts
contained an @ sign regardless of language. However, they recognized that English
118
was by far the most frequent language in their sample. More recently, Weerkamp
et al. [110] looked specifically at the different proportion of replies depending on
the language in Twitter, which varies from 36% of posts being replies in Dutch and
34% in Spanish, to 25% of replies in English, and 13% of replies both in Portuguese
and Indonesian. Similarly, in the ego dataset the number of replies to specific users
is lower in general than posts addressed to a wider audience. In particular, 22% of
English posts are replies, while 35% of posts in other languages are replies.
Given the lower number of replies versus public posts in this dataset, I nor-
malize the counts. Therefore, I compare the proportion of replies that are written
in English, aa+c , with the proportion of ToAll and RT written in English,
b
b+d , where
c is the number of replies in other languages, and d is the number of ToAll and RT
posts in other languages.
The results show that the null hypothesis seems to be untrue: the proportion
of replies in English is aa+c = 0.3810, while the proportion of public posts written in
English is bb+d = 0.5368. To reject the null hypothesis, I have to apply a statistical
test of significance. Since I am comparing the categories of the posts, the most
appropriate non-parametric test is the Fisher’s Exact Test [36].
In the Fisher’s Exact Test, the data can be displayed in a 2x2 contingency
table (table 7.1). The probability of obtaining any such set of values is given by the
hypergeometric distribution in equation 7.3.
p =
(
a+b
a
)(
c+d
c
)
(
n
a+c
) =
(a+ b)! (c+ d)! (a+ c)! (b+ d)!
a! b! c! d! n!
(7.3)
119
Replies ToAll+RT Language total
English a=482 b=1669 a+b=2151
Other c=783 d=1440 c+d=2223
Type of exchange total a+c=1265 b+d=3109 n=4374
Table 7.1: 2x2 contingency table for the Fisher’s Exact Test.
The resulting p-value is 2 e−21, which is much lower than the set value. In
conclusion, we can reject the null hypothesis in favor of the alternative hypothesis:
in the case of multilingual Twitter users, they use English more frequently in public
posts than in replies to individuals.
7.3 Discussion: Addressivity as a Factor
This result reinforces the idea that addressing an online message to a perceived
multilingual audience encourages the use of English. In Twitter, the only previous
study that has looked at addressivity as a factor for language choice focused on
Welsh-English bilinguals [61]. The study by Johnson [61] also found that they use
proportionally more English in public posts (53%) than in replies to individuals
(44%) in a sample of 500 posts. The author speculates that the use of English is
encouraged in Twitter for its potential to reach a wider audience [61]. The work by
Johnson reported very few cases of bilingual posts [61], which is consistent with the
description in section 7.1.
120
In section 3.3, I reviewed two works that suggest Twitter is used more for
conversational purposes in some languages, with higher frequency of @ signs, while
in other languages is more common to use it for sharing resources, as the higher
frequency of URLs and repostings might indicate [110, 55]. Taking these previous
findings into consideration, future work should study the combination of language
profile of the user and addressivity of the message to understand language choices
in Twitter.
The lesson for system designers is that different types of exchange between
people, with the corresponding number of sources and receivers (one or many),
could prompt code-switching, a user changing the language.
7.4 Theme Analysis
In chapter 5, I studied the social connections between language groups, but
ultimately, I am interested in understanding the complementary roles of social con-
nections and topic-based linkages in creating cross-cultural bridges.
This qualitative analysis constitutes a first exploration of themes that could
potentially connect communities speaking different languages. I use the ego dataset,
which was not initially collected with the idea of detecting communities around a
topic, instead the purpose was to gather multilingual users regardless of their inter-
ests. With this limitation in mind, as well as the numerous languages present in the
dataset, I decided to focus on identifying keywords related to non-English speak-
121
ing countries or communities in English posts, and English “hashtags” (keywords
prefixed with the pound sign) inserted in non-English posts.
Previous works [24] have attempted to develop various classification schemes of
content in Twitter, but they are too broad for the purposes of this study. Dann [24]
focuses his review on five classification schemes and complements them with his own
scheme of six general categories, namely conversational, pass along, news, status,
phatic, spam, and 24 subcategories, like response, location, endorsement, headlines,
sports, events, etc. In this section, I do not base my analysis on these existing
classification schemes, instead I read through the data without prior preconceptions
in search of answers to the questions: why is the user mentioning a non-English
speaking place or a non-English language? What is she/he saying about it? Why is
the user inserting an English hashtag in a non-English post?
The descriptive nature of this generic theme analysis intends to stimulate more
systematic research questions and classification schemes to discover the topics and
feelings that bring multilingual online communities together.
7.4.1 International themes in the English language set
Firstly, while reading the 2,151 posts in English, I identified the names of
places, cities, countries, or languages that are not English speaking. At the same
time, I annotated the context in which they were mentioned. After a second revision,
I used the annotations about the context as the basis for the following emergent
themes:
122
• international news, which include links to media sources and reactions (i.e.
“Arab Revolution power”, “The Ukrainian experience for Arab world by Lionel
Beehner ”, “Hope that everyone in Japan is fine... ”);
• people’s travel plans or an accomplished travel (i.e. “can’t wait to fly to
#Barcelona”, “Now considering Martinique & Guadeloupe for my next holi-
day”, “Hi back from germany”, “off to Geneva for a show at a fancy birthday
party”);
• people’s location (i.e. “ It’s 3AM here in France so I’m kind of tired”, “[...]you
are at the Schokoladenmuseum”, “[...]we need you in Oberhausen! [...]How
long will you stay in Hamburg?”);
• event’s location (i.e. “Buskers Festival in Bern was fantastic!”, “annual Swiss
Congress of Radiology”, “Global Performing Arts Exchange Singapore”, “To-
morrow is our concert in Dresden” );
• an opinion, of political kind or not (i.e. “miserable greek reality”, “I am for
communism. Swedish communism”, “bahrain tonight should burn...”, “Given
the birth rate & z population figures in Egypt, I can’t understand how sex is
a taboo [...]”);
• internationalization of technology (i.e. “[...]software penetrating Angola”,
“Portuguese RTS game for PS3 [...]”, “I use it in China to redirect my website”,
“@adobe should organise an event for nord africa just like @google”);
• Sports (i.e. “German Rugby”, “Forza Milan!”);
123
• Culture (i.e. “Nigerian Clubbing etiquette”, “Victor Manuelle is a Puerto
Rican artist!”, “Israeli band Orphaned Land rocks Turkey”);
• Language, including remarks about the language of a resource linked when is
different from English (i.e. “New Review 9/10 Points (German)”, “Afrikaans
Video”);
• gastronomy and restaurants (i.e. “Turkish scrambled eggs...” , “all you can
eat at thai-jap rest...”);
• travel recommendations (i.e. “If a Lille visit is on the agenda [...] I’d highly
recommend the LAM museum”, “see my latest thoughts on ‘Where To Stay’
in Istanbul”, “You should come to Israel during the summer, you’ll have a
blast!”);
There are 227 posts in which a non-English speaking place or language is
mentioned. Table 7.2 displays the frequency of the themes. The complete list of
places and languages, with an extract of the textual context and associated theme
can be found in Appendix B.
Looking at this list of themes found in English posts, I wondered which ones are
drawing attention to countries and communities that are not English-speaking and
providing some information. Generally, posts about people’s locations (32 instances)
or travel plans and accomplished travels (33 instances) do not provide information
about those places, and I speculate the function of these posts relates more to
coordinating with friends than to draw attention to a culture.
124
Theme Instances
International news 31
Reaction to international news 4
Travel plans 29
Accomplished travel 4
People’s location 32
Events’ location 27
Opinion 24
Tech internationalization 14
Sports 13
Culture 12
Language 8
Language of resource 4
Gastronomy and restaurants 10
Travel recommendations 9
A country’s policy 2
Location and language 1
Culture and humor 1
Travel plans and opinion 1
Movies 1
Table 7.2: Frequencies of themes related to non-English speaking places and languages mentioned
in English posts.
125
Likewise, the posts related to events’ locations (27 instances) might have a
primary coordinating function, but in some cases (eg.“oktoberfest”) the post is
drawing attention to events of cultural interest in a country and could be considered
as fostering cross-cultural awareness.
The theme international news, which includes also the reactions to the news,
is the theme that most frequently fosters cross-cultural awareness in the ego dataset
(35 instances). These posts show the interest of the author in a non-English speak-
ing community and provides information to the followers. Opinions, of political
kind or not (24 instances) is the subsequent most frequent theme in raising cross-
cultural awareness. In this case, the author also shows interest in a non-English
speaking community and provides some piece of information, albeit the impressions
are sometimes negative. More scarce, there are posts referring to Sports (13 cases),
Culture (12 cases), gastronomy and restaurants (10 cases), and travel recommenda-
tions (9 instances) of countries that are non-English speaking, which also provide
some information about cultural aspects.
Aside, the theme internationalization of technology is interesting in its own
right because it reflects on technology adoption in different areas of the world.
Although the instances are low in this dataset (14 cases), it points to a promising
application for tracking the adoption of technology products and services at an
international scale.
Finally, it is worth noting that among the 12 mentions of a language different
from English, there are 4 instances in this dataset where users specify the language of
a resource they are linking. They are creating a cross-language link, a phenomenon
126
that has been studied on the blogosphere [53, 47]. Cross-language linking, like the
cross-language social connections studied in chapter 5, contribute to building paths
between communication spheres online that are separated by language, enabling the
flow of information across national boundaries.
In summary, even though the mention of a non-English place or language
does not always come with information about it, some themes might be facilitating
cross-cultural awareness, like international news, sports, and culture. Unfortunately,
sharing of resources in languages other than English into the English language sphere
on Twitter seems to be infrequent, or at least the language notices preceding the
link. Alternative data collection techniques could shed light into this phenomenon
in future research.
7.4.2 English hashtags in the non-English language set
In a second phase, I read the 2,223 posts that were left out from the English
language set. However, among those, there are posts in English that were classified
as automatic (see section 7.1). I ignored such posts for the purposes of this analysis.
Aside from English, there are 18 languages in the ego dataset. Given the challenge
that so many languages posed for the analysis, I focused on identifying English
“hashtags” only, keywords or phrases prefixed with the pound sign (#).
The purpose is to explore the reasons why a user writing in a language different
from English might want to add an English word or phrase. Are they examples of
127
code-switching? Are they a mechanism to draw international attention? Or are
there other motivations behind?
The hashtag was a convention of the Internet Relay Chat (IRC) introduced
and accepted by users as the Twitter tagging feature in 2007 [15]. Hashtags are used
to “funnel related tweets into common streams” [57], by aggregating posts with a
common hashtag. It is important to bare in mind that the hashtag can be a means
to classify a message, but also provides the user with visibility in a “many-to-many”
conversation about a topic, potentially enabling the user to reach an audience beyond
their followers. The Twitter system leverages hashtags for creating lists of popular
topics in real time, called “trending topics”. These trending topics are visible by all
users, thus potentially drawing worldwide attention. As a side note, the proportion
of Twitter posts with hashtags seems to vary with language, for example, a study
reports that in German they account for 25% of posts, 14% in English, and as low
as 4% in Japanese [110].
I created a list with the identified hashtags in English, and also I included
hashtags of brand names and products, names of places in English or transliterated
into latin script, and many acronyms. Acronyms posed a particular challenge, given
that many of them were just informal ad-hoc abbreviations and required search
and documentation to understand the meaning. Often, they were abbreviations of
conference names, music festivals, and other events with a potential international
audience. Occasionally, I could not determine the meaning of the acronym, in
consequence I did not include those on the list of selected hashtags. Also, I discarded
acronyms that were referring to local events or institutions (eg. German institution
128
#rbb, Portuguese political debate #e2011pt ), when they were specific of the culture
and language in which the author was writing and did not constitute a change of
language or script.
Subsequently, I classified the hashtags using the annotations about their mean-
ing. I tried to classify them in topics, like Information and Communication Tech-
nologies, and political topic and campaigns. However, some hashtags have a primary
conversational function instead of referring to a topic, as studied by Huang et al.
[57]. They argue that these types of tags “serve as a prompt for user comment”
and “the resulting content is an asynchronous massively-multi-person conversation”;
also, they provide a few examples, like #igrewupon, #liesmentell [57]. The conver-
sational tags of Huang et al. [57] correspond to what Laniado and Mika [70] called
“tags characterized by a common sentiment”, which they illustrate with examples
such as #thankfulfor and #youknowyouareuglyif. They estimated that 20% of
Twitter hashtags belong to this type. Also, many hashtags refer to named entities.
Laniado and Mika [70] estimated that 39% of hashtags are named entities, most
commonly referring to organizations, products, and events.
As a result of this diversity, first I classified hashtags in two main groups:
conversational tags, and the rest. Secondly, I divided the conversational tags in
three types: emergent discourse conventions in Twitter (eg. #fail, #wtf), reflecting
on Twitter use (eg. #1000followers, #odd trend), and informal or ad-hoc Twitter
genres (eg.#a thought, #kindlyAdvice, #roadrage). Finally, I classified the rest of
the hashtags into groups of topics, and grouped aside the brands, devices, events,
129
locations, and dates. The tables of hashtags, including frequency of appearance and
the categories in which they are classified, can be found at the end of this section.
A prominent group of English hashtags relates to Information and Commu-
nication Technologies (eg. #mobileusers, #Android, #Cloud Computing, #hyper-
linking), which also spans brands and devices, such as #microsoft, #skype, #Ipod
(see table 7.5). Among these hashtags, there are named entities, but also cases of
intrasentential code-switching, where the user switches from one language to an-
other within the same sentence [62]; following Joshi’s definition [62], English is the
embedded language within a non-English matrix language in these examples.
While the topic of Technology and Internet seems to trigger the use of English
terms, international news (eg. #bahrain, #egypt), and campaigns, such as #1bil-
lionhungry, might be biased towards English hashtags to draw global attention (see
tables 7.7 and 7.8). International news could potentially lead to a trending topic
and draw attention about events unfolding in real time in some part of the world,
like it was the case during the popular uprising in Egypt during 2011 [16].
There are also numerous hashtags referring to conferences and music festivals.
If the organizers of such events announce or promote a hashtag, they avoid the
problem of fragmentation of message streams related to the event due to variations
of keywords [15]. However, in this sample there are variants of a hashtag for the
same event, such as #caexpoitaly and #caexpo, #gmaghreb and #gmaghreb11,
#sepIn11 and #sepIn (see table 7.6). Letierce et al. [72] studied the use of Twitter
and hashtags in conferences of Semantic Web researchers and revealed that users
have a desire to participate in the discussion around the conference and see their
130
messages included in the conference stream, while hoping to increase their network.
They concluded that Twitter is used as a background communication channel during
those events.
Most interestingly, international conferences and music festivals attract multi-
lingual and multicultural audiences, who can conform to the same Twitter hashtag
and generate multilingual conversations around the event taking place. These inter-
national events might be key in promoting cross-language sharing of resources and
creating social ties across language and national boundaries in the online communi-
cation sphere.
There are still other reasons to use an English hashtag when writing in a dif-
ferent language: as studied by Huang et al. [57], some hashtags are prompts for user
comment and a way to participate in a multi-person conversation. A conversation
that can be multilingual.
A few of these conversational tags constitute emergent discourse conventions
in Twitter, even adopted from other online sites, such as the commonly used #like
in the social networking site Facebook. Kooti et al. [66] studied the emergence and
evolution of this type of conventions looking at the specific example of reposting
in Twitter. They provided data showing how the user community was choosing to
include “RT” in their repostings more often than other alternatives over time. In
other words, Twitter users have progressively agreed in the use of certain codes,
such as adding #fail to their post when they talk about disappointing or deceiving
news (7 times in the non-English sample), #wtf to express disbelief, #FF or “follow
friday” to recommend other users, etc (see table 7.3). Even though these conventions
131
come from the English language, they have been adopted in other languages for
communicating in Twitter.
Similarly, users in the non-English sample sometimes categorized their posts
in an “informal or ad-hoc genre” by adding an English hashtag, like #a thought,
#kindlyAdvice, #thisislife, #roadrage (see table 7.4). These examples are along the
lines of those presented by Huang et al. [57], and Laniado and Mika [70], referring to
common sentiments, but also less persistent over time than the previously discussed
emergent discourse conventions. These informal or ad-hoc genres in English seem
to have the potential to spread internationally and be adopted across languages in
Twitter, but this phenomenon is still not well documented.
In summary, some English hashtags reflect code-switching in relation to certain
topics, like Information and Communication Technologies. A question for future re-
search could be if this topic affects language choice, favoring the use of English. Also,
international news and campaigns might tend to trigger the use of English hashtags
to draw global attention. Finally, international events organizing back-channel com-
ments around a common hashtag, as well as certain conversational tags, could be
the focus of further research for their potential to foster multilingual conversations.
132
Hashtag Frequency Meaning/Context
#fail 7 Sharing a bad experience or deceiving news
#wtf 2 Expressing surprise or disbelief
#FF 1 “follow friday”: recommending a person or orga-
nization to follow
#Like 1
#np 1 now playing or no problem
Table 7.3: Conversational tags: emergent discourse conventions in Twitter, social networks or
online chat.
133
Hashtag Frequency
Reflecting on Twitter use
#TweepsMidName 5
#1000followers 1
Twitter #addicted 1
#odd trend 1
#summer trends 1
Informal or ad-hoc Twitter genres
#a thought 1
#kindlyAdvice 1
#thisislife 1
#roadrage 1
#supportedby 1
#UNeverKnow 1
Table 7.4: Conversational tags: reflecting on Twitter use and informal or ad-hoc Twitter genres.
134
Topic Hashtag Frequency Meaning/Context
ICT
#mobileusers 6
#cisa 2 Online live broadcasting
#OVH 2 Online virtual hosting
#Android 1
#AR 1 Augmented Reality
#Cloud Computing 1
#Honeycomb 1 Android version
#hyperlinking 1
#launch 1 a website
#opendata 1
Devices
#fb funerals 2 Facebook
#n900 1 Nokia model
ICT brands
#Ipod 1
#microsoft 1
#samsungcheerdance 1 Samsung
#skype 1
Vehicle brand #Audi 1
Table 7.5: Hashtags: ICT topic, brands and devices.
135
Topic Hashtag Frequency Meaning/Context
Conferences/events
#caexpoitaly 9 CA expo 2011
#mcdd10 7 Mobile Camp Dresden
#caexpo 5 CA expo 2011
#gmaghreb 2 Google event in Maghreb
#gmaghreb11 1 Google event in Maghreb
#innovlab2011 1
#pycon4 1
#SMW11 1 Social Media Week 2011
#sepIn11 1
#sepIn 1
Music festivals
#212RMX 2
#fib2011 2
#readingandleeds 1
Music
#arcticmonkeys 1 Band
#FearFactory 1 Album
TV and sports
#BlueWolves 1 Mongolian soccer team
#Comedystreet 1 German TV program
#tv 1
Table 7.6: Hashtags: events, music, TV, and sports.
136
Topic Hashtag Frequency Meaning/Context
Location
#bahrain 12
#berlin 2 Germany
#egypt 2
#greece 2
#germany 1
#korinthos 1 Greece
#Mongolia 1
#liveyourmythingreece 1 Greece
#Tahrir 1 Egypt
#uniineurope 1 Europe
Dates/time
#14feb 1 events in Bahrain
#september 1
#winter 1
Celebrations
#Jerusalemday 1
#Ramadan 1
Table 7.7: Hashtags: location, time, and other named entities.
137
Topic Hashtag Frequency Meaning/Context
Project management
#marketing 1
#scrum 1
Political/Campaign
#debtocracy 13
#1billionhungry 1
#endSH 1 End Street Harassment
#sexquota 1
Not classified
#networking 1
#selective default 1
Table 7.8: Hashtags: other topics.
138
Chapter 8
Discussion and Future Work
I have encountered many challenges in this explorative research that require
bringing together multiple fields and diverse methods in ways that have not been
established previously. For instance, one of these challenges was detecting multi-
lingual users in Twitter and, more broadly, determining language profiles of users
(if they are monolingual or multilingual), and assigning language labels accordingly.
An example of the multidisciplinary character of this dissertation is the application
of social network analysis to sociolinguistic questions.
As a result of the decision process for resolving the challenges as they became
apparent, a concomitant contribution of this research are the methodology consid-
erations. Namely, the process of testing natural language identification tools, the
relationship between number of posts per user analyzed and estimated error rates
in language profiling, can serve as a guide to approach the problems arising in the
study of languages in Twitter.
Another challenge of doing research about Twitter (and other Internet plat-
forms) is that the findings on particular aspects of this fast evolving environment
can easily become outdated. This dissertation aimed at answering the research ques-
tions without requiring an exhaustive revision of the Twitter interface, which has
changed several times in the past years, and at obtaining conclusions that could
139
be informative for the study of other social networks and communication environ-
ments that share the key characteristics of Twitter, like being public, reposting and
replying capabilities, etc.
This chapter discusses the results of the studies that compose this disserta-
tion, and highlights the key contributions and future directions of this research.
Ultimately, I hope this discussion provokes questions for new studies.
8.1 Of Links, Social Ties, and Gravitational Forces
The vision of a cosmopolitan Internet with vibrant communities, enabling
contact with the unfamiliar, discovery, and the serendipity that propitiates learning
[119] is challenged by the existence of language frontiers online.
In the view of the Global Language System theory (section 2.1), multilingual
people constitute the gravitational force that provides cohesion to the system, by
connecting different language groups. There is empirical evidence of this language
bridging in the blogosphere [53, 32].
The language ecology approach (section 2.2) connects these macro-scale di-
mension of languages with the micro-scale level of interactions between individuals.
Social network analysis provides an analytic tool for studying these language ecolo-
gies that emerge from the interactions of the multilingual users with their social
connections.
The main contribution of this dissertation is going beyond survey information
about multilingualism in Twitter [83, 55], and providing a deeper understanding
140
about the structural relations between language communities in a social network
online. Although inspired by previous studies on the blogosphere [53], this research
enhances the descriptive analysis with the creation and definition of theoretical
constructs: the types of bilingual networks.
Focusing on the networks of multilingual users, the social network analysis re-
vealed three types of bilingual networks: the Gatekeeper-Language bridge, represent-
ing a continuum of increasing connections between two separate language groups;
the Integration and union type, representing a continuum of increasing penetration
of one language group within the structure of the other; and the Peripheral language
type, where one language group is smaller or less cohesive, and lies at the periphery
of the social graph.
This research conceives of the social network of multilingual users as a micro-
scale language ecology, influencing their communication strategies and language
choices. This conceptualization leads to a second key contribution, which is the
novel idea of modeling the influence of social network factors in the language choices
of the user.
In the factor analysis, the dependent variables considered are the proportion of
English use and non-English use within the posts of the user. The factors included
are the proportion of English and non-English language users in the social network
of the multilingual subject, and the degree of multilingualism of the social network.
The relative importance of factors is represented by the coefficients obtained by
fitting two generalized linear models to the dataset (linear and logistic regression).
141
The proportion of English users in the network constitutes a key influencing
factor in the frequency of English use by the multilingual individual. Similarly, the
proportion of non-English language (L2) users in the network is a very important
factor influencing the frequency of L2 use by the multilingual person. The results
suggest that multilingual Twitter users perceive the language composition of their
network and interact accordingly. Or on the contrary, the language choices of mul-
tilingual users might attract followers of a specific language profile. Most probably,
the relation goes both ways, in a self-feeding cycle.
Shifting attention from the social network to the content of the posts written
by the multilingual users, I tested the hypothesis that the type of exchange (public
post versus reply to an individual) influences the choice between English and other
languages. The result reinforces previous empirical findings suggesting that sending
public messages to a seemingly multilingual audience encourages the use of English
[31, 65].
Finally, there is another gravitational force that could connect language groups
and affect language choice: topics [65, 5]. Common interest in certain topics attract
people from different cultures, and encourages the creation of cross-language links
to resources and news [47].
As a step toward future studies on international topics, this dissertation ex-
plores what themes might be raising cross-cultural awareness. I identified themes
related to non-English speaking countries or communities in English posts, and I
concluded that international news was the most popular theme.
142
Also, I identified English hashtags (keywords preceded by the # sign) inserted
in non-English posts and related contexts that could encourage multilingual con-
versations. International conferences and music festivals attract multilingual and
multicultural audiences, who conform to the same Twitter hashtag and generate
multilingual conversations around the event taking place. These international events
might be key in promoting cross-language sharing of resources and creating social
ties across geographic regions.
If we embrace the idea of a vibrant language ecology on the Internet, we should
challenge the existing structure of the network of hyperlinks and social ties. For
instance, empowering multilingual users to leverage their social ties across language
groups, facilitating translation, and recommending links to resources in different
languages.
8.2 The Road Ahead...
Future directions for this research include scaling up the social network analysis
to account for multilingual users with larger social networks. This will require
improving the methods for analysis of larger collections of data, e.g. training natural
language identification tools to detect transliterated text, and using spam detection
algorithms.
Also, I envision expanding the theme analysis to include methods of automatic
topic detection in multiple languages, or crowdsourcing annotations using platforms
such as Mechanical Turk or CrowdFlower (i.e. sending micro-tasks to large numbers
143
of people for specifying the topics of Twitter posts). Further research could focus on
topic-based networks, targeting the sampling to specific language pairs and topics
for enabling comparisons across languages and a more complex factor analysis.
Finally, studying the evolution of social networks over time could unveil the
relationship between the language composition of the social network, audience per-
ception, and language choice. In relation to this, other questions arise: wether
multilingual users are aware of the type of social network they have; and if they are,
how this affects their language choices.
8.2.1 Translation and Mediation in Twitter
In section 7.1, I describe my unsuccessful attempt to generate a hypothesis
in relation to bilingual posts and repostings with a comment, as a previous step to
identify translation behaviors. However, the resulting number of bilingual posts in
the ego dataset was only 37. It seems that the limited number of characters allowed
poses a problem for including a translation together with the original comment
in the same post. Alternatively, translations might be found in separate posts
but, unlike reposting, there is no way to connect the translation to the original
message. Also, some people create separate accounts for each language to address
different audiences. Future research could use automatic topic analysis to identify
translations.
Although Twitter enables the use of many languages and writing systems
thanks to Unicode, it does not offer support for translation or features for strengthen-
144
ing connections between language groups. Regarding support for translation, there
are not embedded linguistic resources on the interface, such as machine translation,
dictionaries or transliteration tools. The meedan project [113] organizes volunteers
for translating Twitter posts and has encountered a number of challenges: engag-
ing users in translation, linking and representation of translations in relation to the
original post, authorship, validation, etc.
Instead of relying on volunteer translators, we could seek ways to encourage
translation, cross-language linking and connection behaviors that are happening
already. However, in Robert Munro’s words, there is not a “unified resource that
links people by languages spoken” in social media [96], which would be a helpful
starting point.
Recommendation mechanisms could foster the creation of cross-language links.
For example, AlMeshary and Abhari [4] propose a strategy for recommending people
to follow on Twitter with the purpose of obtaining local information in the context
of a user relocating from a different country. They use machine translation to match
the users’ interests found in their posts with the local offers.
Finally, by studying the dynamic language preferences of multilingual users,
not only we will be in a better position to design a satisfying experience for those
users, but also we are learning how to help them in their mediation tasks. This
dissertation advances in that direction by modeling the influence of factors in the
language choices of the multilingual users.
145
8.2.2 Who Are the Multilingual Users?
This dissertation focuses on multilingual users because of their role in connect-
ing different language groups. But who are they? Are they expatriates? Members
of minority communities? Language learners?
Twitter posts and the languages in which they are written represent just a
limited language profile of the user, and they barely provide any social context.
Androutsopoulos recommends to take into account the digital surroundings when
analyzing written text, for instance, looking at the pictures and videos that are
linked [6]. Also, adding detailed geographic information could help in building a
more complete profile.
Understanding more about the context of multilingual users could help in the
identification of roles and motivations for mediating between language groups and
in finding the relationship of these roles with network types.
The next step after this dissertation is adding consideration of geolocation
information and content analysis of the resources linked in the posts to provide
more attributes for nodes and edges in the social network analysis. Additionally,
ethnographic methods could shed light on who are the people and what are the
reasons that connect different language groups.
146
Chapter 9
Conclusion
Social media is international: users from different cultures and language back-
grounds are communicating, generating and sharing content. However, language
barriers emerge in the communication landscape online. The aspiration of an Inter-
net that constitutes a cosmopolitan space and fosters language diversity has stum-
bled over the language frontier.
In the microblogging site Twitter, information spreads across languages and
countries. But how are the news traveling across borders? Expatriates, migrants,
minorities, diasporic communities, and language learners play an important role in
forming transnational networks and cultural bridges between nations and commu-
nities. They are multicultural and multilingual.
This dissertation studied how multilingual users of Twitter mediate between
language groups in their social network, looking at social connections and language
choices. The overarching goal that motivates this research is to advance our un-
derstanding of the network structures and communication strategies that enable
intercultural dialog, cross-language sharing of information, and awareness of global
problems. The implication for the design of social media platforms is that, instead
of constraining multilingual users to only one language option, technology should
support their language-switching and mediating role between cultures.
147
The objectives of this dissertation were: (1) to explore the ways in which
multilingual users of Twitter are connecting different language groups in their social
network; (2) to model how the network influences their language choices; (3) and
to explore what the textual features of their posts can elicit about language choices
and mediation between groups.
RQ 1: In what ways are multilingual users of Twitter connecting language
groups? Focusing on the social network of 92 multilingual users, the methodology
combined a qualitative approach to social network analysis and network statistics
to present a classification of network types based on the patterns of connections
between language groups. The study followed an exploratory design, with a first
qualitative phase that took a grounded theory approach to classify the network
visualizations, and a second quantitative phase that complemented the qualitative
study with network statistics specifically created to provide a robust definition of
network types. Finally, I used machine learning for testing the results.
The social network analysis revealed three differentiated types of bilingual
networks: the Gatekeeper-Language bridge, representing a continuum of increasing
connections between two separate language groups; the Integration and union type,
representing a continuum of increasing penetration of one language group within
the structure of the other; and the Peripheral language type, where one language
group is smaller or less cohesive, and lies at the periphery of the social graph.
RQ 2: How is the social network of multilingual users in Twitter influencing
their choice of language? The factor analysis modeled the influence of a set of fac-
tors related to the social network in the language choices of multilingual users. The
148
dependent variables considered are the proportion of English use and non-English
use within the 50 posts of the user. The factors included are the proportion of
English and non-English language users in the social network of the multilingual
subject, and the degree of multilingualism of the social network. The relative im-
portance of factors, or their weight, is represented by the coefficients obtained by
fitting two generalized linear models to the dataset (linear and logistic regression).
The proportion of English users in the network constitutes a key influencing
factor in the frequency of English use by the multilingual individual. Similarly, the
proportion of non-English language (L2) users in the network is a very important
factor influencing the frequency of L2 use by the multilingual person. Interest-
ingly, the influence of the factor proportion of English users in the network is also
important when modeling L2 use, and negatively correlated to it. Regarding the
multilingual index, the results were inconclusive about its influence in the language
choice of multilingual users.
RQ 3: Does the type of exchange in Twitter influence the language choice of
multilingual users? I shifted the attention from the social network to the content of
the posts written by the multilingual users. First, I looked at the textual feature of
the @ sign at the beginning of a post as an indicator of addressivity. Based on this
indicator, I tested the hypothesis that the type of exchange (public post versus reply
to an individual) influences the choice between English and other languages. The
result reinforces previous empirical findings suggesting that sending public messages
to a seemingly multilingual audience encourages the use of English.
149
RQ 4: What the themes and textual features in the posts of multilingual users
reveal about cross-cultural awareness or international dialogue? Finally, I looked at
content with the objective of detecting themes that might help in creating cross-
cultural awareness, where the multilingual users could be acting as mediators from
the point of view of their messages. I identified themes related to non-English speak-
ing countries or communities in English posts and, also, I identified English hashtags
(keywords preceded by the # sign) inserted in non-English posts. Using a generic
theme analysis, I concluded that international news was the most popular theme
when mentioning a non-English speaking place. This study serves as an explorative
qualitative phase to inform the design of future studies after this dissertation work.
The main contribution of this dissertation is going beyond survey information
about multilingualism and providing a deeper understanding about the structural
relations between language communities in a social network online. This research
work is one of the few that apply social network analysis to the study of sociolinguis-
tic questions on the Internet. In particular, it contributes an original classification
of network types based on the patterns of connections between language groups,
complemented with new network statistics that enhance the definitions of these
theoretical constructs.
Adapting the Ecology of Language approach from Sociolinguistics to the social
network context, this research conceived of the social network of multilingual users
as a micro-scale language ecology, influencing their communication strategies and
language choices. This conceptualization led to the novel idea of modeling the
influence of social network factors in the language choices of the user.
150
This dissertation can benefit the study of information diffusion regarding the
potential impact of these types of network structures on cross-language flows. Also,
it contributes to understanding users’ behavior and informing the design of social
media platforms.
Future directions for this research include: scaling up the social network
analysis to account for multilingual users with larger social networks; studying topic-
based networks and detecting cases of translation; targeting the sampling to specific
language pairs and topics for enabling comparisons across languages and a more
complex factor analysis; studying the evolution of social networks over time to ex-
plore how this affects audience perception and language choice.
The next step to this dissertation research is adding geolocation information
and content analysis of the resources linked in the posts to provide more attributes
for nodes and edges in the social network analysis. Finally, ethnographic methods
could shed light on who are the people and what are the reasons that connect
different cultural and linguistic groups.
151
Appendix A
Visualizations of Social Networks
This appendix contains the visualizations of the 92 egocentric networks, with
the qualitative category assigned, the language codes and corresponding colors. Lan-
guage labels can have one language code, following the ISO standard codes for names
of languages (eg. “en” for English, “es” for Spanish, “de” for German), two lan-
guage codes joined by the + sign in the case of bilinguals, the word “empty” for
nodes with no data available, the number 0 for nodes where the language could not
be identified. All visualizations were made with the Gephi social network analysis
tool, using the Force Atlas layout. The size of the nodes represents betweenness
centrality.
152
Figure A.1: Trilingual networks (1).
153
Figure A.2: Trilingual networks (2).
154
Figure A.3: Trilingual networks (3).
155
Figure A.4: Bilingual networks: gatekeeper type (1).
156
Figure A.5: Bilingual networks: gatekeeper type (2).
157
Figure A.6: Bilingual networks: gatekeeper type (3).
158
Figure A.7: Bilingual networks: gatekeeper type (4).
159
Figure A.8: Bilingual networks: gatekeeper type (5).
160
Figure A.9: Bilingual networks: gatekeeper type (6).
161
Figure A.10: Bilingual networks: language bridge type (1).
162
Figure A.11: Bilingual networks: language bridge type (2).
163
Figure A.12: Bilingual networks: language bridge type (3).
164
Figure A.13: Bilingual networks: language bridge type (4).
165
Figure A.14: Bilingual networks: language bridge type (5).
166
Figure A.15: Bilingual networks: language bridge type (6).
167
Figure A.16: Bilingual networks: union type (1).
168
Figure A.17: Bilingual networks: union type (2).
169
Figure A.18: Bilingual networks: union type (3).
170
Figure A.19: Bilingual networks: union type (4).
171
Figure A.20: Bilingual networks: integration type (1).
172
Figure A.21: Bilingual networks: integration type (2).
173
Figure A.22: Bilingual networks: integration type (3).
174
Figure A.23: Bilingual networks: integration type (4).
175
Figure A.24: Bilingual networks: integration type (5).
176
Figure A.25: Bilingual networks: integration type (6).
177
Figure A.26: Bilingual networks: integration type (7).
178
Figure A.27: Bilingual networks: integration type (8).
179
Figure A.28: Bilingual networks: peripheral language type (1).
180
Figure A.29: Bilingual networks: peripheral language type (2).
181
Figure A.30: Bilingual networks: peripheral language type (3).
182
Figure A.31: Bilingual networks: peripheral language type (4).
183
Figure A.32: Bilingual networks: peripheral language type (5).
184
Figure A.33: Bilingual networks: peripheral language type (6).
185
Figure A.34: Small and monolingual networks (1).
186
Figure A.35: Small and monolingual networks (2).
187
Figure A.36: Small and monolingual networks (3).
188
Figure A.37: Small and monolingual networks (4).
189
Figure A.38: Small and monolingual networks (5).
190
Figure A.39: Small and monolingual networks (6).
191
Figure A.40: Small and monolingual networks (7).
192
Figure A.41: Small and monolingual networks (8).
193
Appendix B
International Themes in English Posts
In the English language set of posts written by the 92 egos, totaling 2151 posts,
there are 227 posts in which a non-English speaking place or language is mentioned.
This appendix presents in landscape layout the complete list of places and languages
in the central column, with an extract of the textual context on the left side, and
associated theme on the right side.
194
TE
XT
UA
L C
ON
TE
XT
PL
AC
E 
OR
 L
AN
GU
AG
E
TH
EM
E
Pa
ris
 ca
llin
g 
fo
r a
 m
ee
tin
g 
at
...
 [.
..]
Pa
ris
tra
ve
l p
lan
s
ca
n't
 w
ait
 to
 fly
 to
 #
Ba
rc
elo
na
Ba
rc
elo
na
tra
ve
l p
lan
s
In
sp
ira
tio
n 
fo
r a
 lit
tle
 Ly
on
 b
re
ak
 […
]
Ly
on
tra
ve
l r
ec
om
m
en
da
tio
n
If 
a 
Lil
le 
vis
it i
s o
n 
th
e 
ag
en
da
 […
] I
'd 
hig
hly
 re
co
m
m
en
d 
th
e 
LA
M
 m
us
eu
m
 […
]
Lil
le
tra
ve
l r
ec
om
m
en
da
tio
n
No
w 
co
ns
ide
rin
g 
M
ar
tin
iqu
e 
& 
Gu
ad
elo
up
e 
fo
r m
y n
ex
t h
oli
da
y. 
[…
]
M
ar
tin
iqu
e 
& 
Gu
ad
elo
up
e
tra
ve
l p
lan
s
Oh
 d
ea
r! 
Ex
plo
sio
n 
@
M
os
co
w 
air
po
rt.
 […
]
M
os
co
w
In
te
rn
at
ion
al 
ne
ws
Is 
go
ing
 to
 th
e 
Ch
ine
se
 fo
r d
inn
er
 […
]
Ch
ine
se
ga
str
on
om
y a
nd
 re
sta
ur
an
ts
ac
co
m
pli
sh
ed
 tr
av
el
RE
QU
ES
TI
NG
 T
HE
 #
NK
OT
BC
RU
IS
E2
01
2 
ON
 T
HE
 M
ED
IT
ER
RA
NE
AN
 S
EA
! [
…
]
M
ed
ite
rra
ne
an
 S
ea
tra
ve
l p
lan
s
ac
co
m
pli
sh
ed
 tr
av
el
[…
] G
er
m
an
y i
n 
3 
Da
ys
 […
]
Ge
rm
an
y
tra
ve
l p
lan
s
[…
] Z
im
ba
bw
e 
an
d 
M
ug
ab
e's
 R
ule
 […
]
Zi
m
ba
bw
e
In
te
rn
at
ion
al 
ne
ws
Af
rik
aa
ns
La
ng
ua
ge
 o
f r
es
ou
rc
e
[..
.] 
It's
 3
AM
 h
er
e 
in 
Fr
an
ce
 […
]
Fr
an
ce
loc
at
ion
[..
.] 
Ve
nic
e,
 to
m
or
ro
w 
13
:2
0?
 [.
..]
Ve
nic
e
tra
ve
l p
lan
s
Vi
rtu
al 
#g
ro
ce
ry
 sh
op
pin
g 
ex
pe
rie
nc
e 
in 
Ko
re
a 
[…
]
Ko
re
a
te
ch
 in
te
rn
at
ion
ali
za
tio
n
[…
] B
ye
 G
er
m
an
y
Ge
rm
an
y
loc
at
ion
loc
at
ion
[…
] w
e 
ne
ed
 yo
u 
in 
Ob
er
ha
us
en
! (
: H
ow
 lo
ng
 w
ill 
yo
u 
sta
y i
n 
Ha
m
bu
rg
?
Ob
er
ha
us
en
 a
nd
 H
am
bu
rg
loc
at
ion
[…
] I
 h
av
e 
he
ar
d 
yo
u 
sp
ea
k g
er
m
an
 […
]
ge
rm
an
La
ng
ua
ge
loc
at
ion
Be
lgi
um
 fr
ies
 ca
nd
ida
te
s t
o 
UN
ES
CO
 p
at
rim
on
y!
Be
lgi
um
In
te
rn
at
ion
al 
ne
ws
Ar
ab
 R
ev
olu
tio
n 
po
we
r [
…
]
Ar
ab
 R
ev
olu
tio
n
In
te
rn
at
ion
al 
ne
ws
Ha
d 
to
 ca
ll m
y P
op
s s
o 
he
 co
uld
 se
nd
 m
e 
a 
co
up
le 
of
 M
rs
. D
as
h 
se
as
on
ing
s t
o 
Pu
er
to
 R
ico
!
Pu
er
to
 R
ico
loc
at
ion
Pu
er
to
 R
ica
n
cu
ltu
re
[…
] H
av
e 
a 
go
od
 o
ne
 a
nd
 sa
lut
e 
fro
m
 P
ue
rto
 R
ico
! [
…
]
Pu
er
to
 R
ico
loc
at
ion
loo
kin
g 
fo
rw
ar
d 
to
 e
at
ing
 ch
ina
 fo
od
 […
]
Ch
ina
ga
str
on
om
y a
nd
 re
sta
ur
an
ts
loc
at
ion
Of
f t
o 
Ge
ne
va
 fo
r a
 sh
ow
 a
t a
 fa
nc
y b
irt
hd
ay
 p
ar
ty 
:-)
Ge
ne
va
tra
ve
l p
lan
s
loc
at
ion
Bu
sk
er
s F
es
tiv
al 
in 
Be
rn
 w
as
 fa
nt
as
tic
!
Be
rn
ev
en
t lo
ca
tio
n
W
e 
ha
ve
 th
e 
ho
no
r o
f p
er
fo
rm
ing
 a
t o
ne
 o
f S
wi
tze
rla
nd
's 
m
os
t e
xc
lus
ive
 H
ot
els
 to
nig
ht
 […
]
Sw
itz
er
lan
d
loc
at
ion
[…
] a
t t
he
 a
nn
ua
l S
wi
ss
 C
on
gr
es
s o
f R
ad
iol
og
y [
…
]
Sw
iss
ev
en
t lo
ca
tio
n
ge
rm
an
La
ng
ua
ge
 o
f r
es
ou
rc
e
Th
is 
we
ek
en
d 
we
'll 
be
 tr
av
eli
ng
 to
 B
ar
ce
lon
a!
 C
an
't w
ait
 fo
r a
no
th
er
 a
m
az
ing
 E
ur
op
ea
n 
Yo
-Y
o 
M
ee
tin
g.
 […
]
Ba
rc
elo
na
 a
nd
 E
ur
op
ea
n
tra
ve
l p
lan
s
Zü
ric
h
ev
en
t lo
ca
tio
n
Sw
iss
cu
ltu
re
W
e'r
e 
he
ad
ing
 to
 A
ug
sb
ur
g/
Ge
rm
an
y t
od
ay
. [
…
]
Au
gs
bu
rg
/G
er
m
an
y
tra
ve
l p
lan
s
W
e 
ar
e 
gr
at
ef
ul 
ab
ou
t c
on
ta
cts
 d
ur
ing
 L
ive
! G
lob
al 
Pe
rfo
rm
ing
 A
rts
 E
xc
ha
ng
e 
Si
ng
ap
or
e 
[…
]
Si
ng
ap
or
e
ev
en
t lo
ca
tio
n
To
m
or
ro
w 
is 
ou
r c
on
ce
rt 
in 
Dr
es
de
n 
[…
]
Dr
es
de
n
ev
en
t lo
ca
tio
n
To
da
y C
D-
Re
lea
se
 in
 G
er
m
an
y [
…
]
Ge
rm
an
y
ev
en
t lo
ca
tio
n
ev
en
t lo
ca
tio
n
Go
t h
om
e 
to
nig
ht
 fr
om
 B
er
lin
 […
]
Be
rlin
ac
co
m
pli
sh
ed
 tr
av
el
Dr
es
de
n
ev
en
t lo
ca
tio
n
Ru
ss
ian
op
ini
on
Ru
ss
ian
op
ini
on
vis
ite
d 
th
e 
4t
h 
m
ar
ke
tin
g-
da
y i
n 
au
str
ia 
[…
]
au
str
ia
[…
] H
i b
ac
k f
ro
m
 g
er
m
an
y ;
o)
ge
rm
an
y
Af
rik
aa
ns
 V
ide
o:
 g
ivi
ng
 fr
ac
kin
g 
a 
dr
illi
ng
 […
]
[..
.] 
yo
u 
ar
e 
at
 th
e 
Sc
ho
ko
lad
en
m
us
eu
m
. [
…
]
Sc
ho
ko
lad
en
m
us
eu
m
[…
] s
to
rm
y b
elg
ium
be
lgi
um
[…
] V
ict
or
 M
an
ue
lle
 is
 a
 P
ue
rto
 R
ica
n 
ar
tis
t! 
[…
]
W
e 
ha
d 
so
 m
uc
h 
fu
n 
to
da
y :
D.
 A
m
az
ing
 a
ud
ien
ce
 in
 K
üt
tig
en
!! 
[..
.]
Kü
ttig
en
To
nig
ht
 w
e'r
e 
pe
rfo
rm
ing
 a
t S
ta
dt
ca
sin
o 
Fr
au
en
fe
ld.
 T
ha
t's
 th
e 
th
ea
tre
 w
he
re
 […
]
St
ad
tca
sin
o 
Fr
au
en
fe
ld
A 
sp
ec
ial
 ce
leb
ra
tio
n 
vid
eo
 fo
r t
he
 h
an
db
all
 cl
ub
 O
be
rw
il/B
L (
ge
rm
an
) [
…
]
To
m
or
ro
w 
Sa
tu
rd
ay
: S
pe
cia
l m
idn
igh
t b
lac
kli
gh
t-p
er
fo
rm
an
ce
 a
t [
…
] (
Zü
ric
h A
irp
or
t).
[…
] S
ve
n 
Ep
ine
y, 
Sw
iss
 T
V 
pr
es
en
te
r [
…
]
[..
.] 
wi
ll b
e 
pla
yin
g 
at
 K
ua
la 
Lu
m
pu
r, 
M
aly
sia
, m
id-
se
pt
em
br
e 
th
is 
ye
ar
 […
]
Ku
ala
 L
um
pu
r, 
M
aly
sia
Th
e 
Fe
sti
va
l J
az
zta
ge
 D
re
sd
en
 is
 o
ve
r. 
W
e 
ha
d 
gr
ea
t m
us
ici
an
s h
er
e!
 […
]
[…
] t
his
 S
ch
ar
an
sk
y i
s v
er
y d
an
ge
ro
us
 g
an
gs
te
r a
nd
 th
e 
lea
de
r o
f t
er
rib
le 
ru
ss
ian
 tr
iad
s, 
be
 a
fra
id 
br
ot
h.
[…
] b
e 
af
ra
id,
 L
ev
 S
ch
ar
an
sk
y i
s  
na
tiv
e 
of
 B
rig
ht
on
-b
ea
ch
 R
us
sia
n 
m
af
ia 
, v
er
y, 
ve
ry
 d
an
ge
ro
us
 g
an
gs
te
r.
195
I a
m
 fo
r c
om
m
un
ism
. S
we
di
sh
 c
om
m
un
ism
Sw
ed
ish
op
in
io
n
[…
] W
at
ch
 s
er
va
nt
s 
of
 th
e 
pe
op
le
 in
 R
us
sia
 […
]
Ru
ss
ia
cu
ltu
re
[…
] i
n 
Ru
ss
ia
 a
nd
 P
ar
is,
 fo
r t
he
 p
eo
pl
e 
of
 th
os
e 
co
un
tri
es
 a
re
 s
o 
wi
llin
g 
to
 b
e 
am
us
ed
 […
]
Ru
ss
ia
 a
nd
 P
ar
is
op
in
io
n
Di
d 
yo
u 
kn
ow
 th
at
 a
bo
ut
 5
0 
th
ou
sa
nd
 p
eo
pl
e 
ar
e 
kil
le
d 
fro
m
 s
na
ke
bi
te
s 
fo
r a
 y
ea
r?
 It
 is
 o
nl
y 
in
 In
di
a
In
di
a
In
te
rn
at
io
na
l n
ew
s
Uk
ra
in
ia
n 
an
d 
Ar
ab
 W
or
ld
In
te
rn
at
io
na
l n
ew
s
Uk
ra
in
e
In
te
rn
at
io
na
l n
ew
s
If 
yo
u 
wa
nt
 to
 b
ec
om
e 
a 
pa
rt 
of
 E
ur
o-
20
12
 […
]
Eu
ro
 2
01
2
sp
or
ts
[…
] B
ra
zil
...
 H
er
e 
I c
om
e 
:D
Br
az
il
tra
ve
l p
la
ns
tra
ve
l p
la
ns
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
M
ila
n
sp
or
ts
W
ho
's 
go
t t
he
 b
ig
ge
r m
el
on
s?
 S
ize
 e
m
 u
p.
.. 
Tu
rk
ish
 S
ty
le
  [
…
]
Tu
rk
ish
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
Ita
lia
n
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
O
tto
m
an
cu
ltu
re
As
ia
tra
ve
l r
ec
om
m
en
da
tio
n
[…
]  
se
e 
m
y 
la
te
st
 th
ou
gh
ts
 o
n 
"W
he
re
 T
o 
St
ay
" i
n 
Is
ta
nb
ul
. [
…
]
Is
ta
nb
ul
tra
ve
l r
ec
om
m
en
da
tio
n
[…
] g
re
at
 Is
ta
nb
ul
 fo
od
. S
ee
 m
or
e 
at
 […
]
Is
ta
nb
ul
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
Tu
rk
ish
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
[…
] t
ur
kis
h 
ea
tin
g 
pa
rty
 […
]
Tu
rk
ish
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
Ju
st
 fi
ni
sh
ed
 c
hu
rc
h 
in
 N
ig
er
ia
 […
]
Ni
ge
ria
cu
ltu
re
wa
ve
 tw
o 
fin
ge
rs
 in
 th
e 
ai
r..
. N
ig
er
ia
n 
Cl
ub
bi
ng
 e
tiq
ue
tte
 […
]
Ni
ge
ria
n
cu
ltu
re
[…
] c
ur
re
nt
ly 
in
 L
ag
os
, N
ig
er
ia
 fo
r w
ee
ke
nd
 b
et
we
en
 v
ol
un
te
er
 te
ac
hi
ng
 in
 G
ha
na
.
La
go
s,
 N
ig
er
ia
 a
nd
 G
ha
na
tra
ve
l p
la
ns
[…
] o
rg
an
izi
ng
 a
 m
on
th
 o
f t
ea
ch
in
g 
En
gl
ish
 in
 G
ha
na
 fo
r J
un
e
G
ha
na
tra
ve
l p
la
ns
Is
 h
ap
py
 th
at
 th
e 
ge
rm
an
 a
nd
 p
hy
sic
s 
te
st
s 
ar
e 
ov
er
;))
ge
rm
an
La
ng
ua
ge
Ha
ifa
 w
as
 a
 lo
t o
f f
un
...
 […
]
Ha
ifa
lo
ca
tio
n
So
 I 
ha
ve
 a
 ti
ck
et
 to
 A
m
st
er
da
m
...
 n
ow
 I 
ne
ed
 to
 fi
nd
 s
om
e 
on
e 
wh
o 
wi
ll c
om
e 
wi
th
 m
e 
to
 B
er
lin
 […
]
Am
st
er
da
m
 a
nd
 B
er
lin
tra
ve
l p
la
ns
M
un
ich
op
in
io
n
op
in
io
n
M
un
ich
op
in
io
n
In
 th
e 
af
te
rm
at
h 
of
 #
No
rw
ay
 a
tta
ck
s,
 p
ie
ce
 b
y 
NY
Ti
m
es
 o
n 
th
e 
ris
e 
of
 ri
gh
t-w
in
g 
m
ov
em
en
ts
 in
 E
ur
op
e 
[…
]
No
rw
ay
 a
nd
 E
ur
op
e
In
te
rn
at
io
na
l n
ew
s
[…
] I
 h
op
e 
to
 s
ee
 y
ou
 d
an
cin
g 
in
 It
al
y 
(M
ila
n)
 s
oo
n!
 ;-
*
Ita
ly 
(M
ila
n)
tra
ve
l p
la
ns
[…
] B
es
t W
ish
es
 a
nd
 a
 lo
t o
f l
ov
e 
fro
m
 It
al
y!
Ita
ly
lo
ca
tio
n
[…
] B
ut
 I 
liv
e 
in
 It
al
y!
! :
-(
Ita
ly
lo
ca
tio
n
An
go
la
te
ch
 in
te
rn
at
io
na
liz
at
io
n
[…
] W
e 
be
t i
n 
lo
w 
co
st
 h
ig
h 
Q
 b
us
in
es
s 
so
ftw
ar
e 
- P
or
tu
ga
l /
 E
ur
o-
As
ia
 / 
US
A
Po
rtu
ga
l /
 E
ur
o-
As
ia
te
ch
 in
te
rn
at
io
na
liz
at
io
n
Eu
ro
pe
te
ch
 in
te
rn
at
io
na
liz
at
io
n
Do
ct
or
s 
ha
ve
 b
ee
n 
se
nt
en
ce
d 
to
 1
5 
ye
ar
s 
in
 p
ris
on
 in
 #
ba
hr
ai
n 
fo
r t
re
at
in
g 
pr
ot
es
te
rs
 […
]
Ba
hr
ai
n
In
te
rn
at
io
na
l n
ew
s
ba
hr
ai
n 
to
ni
gh
t s
ho
ul
d 
bu
rn
.. 
[…
]
Ba
hr
ai
n
op
in
io
n
Fi
rs
t d
ist
ric
t t
ra
in
in
g 
in
 K
-to
wn
 to
m
or
ro
w,
 a
fte
rw
ar
ds
 fi
rs
t i
nd
ivi
du
al
 tr
ai
ni
ng
 in
 S
tu
ttg
ar
t!
St
ut
tg
ar
t
tra
ve
l p
la
ns
tra
ve
l p
la
ns
I h
ad
 a
 g
re
at
 m
ee
tin
g 
wi
th
 th
e 
bo
ar
d 
m
em
be
rs
 fr
om
 U
S 
Yo
ut
h 
So
cc
er
 E
ur
op
e.
..[
…
]
Eu
ro
pe
sp
or
ts
Th
e 
Ho
rn
 o
f A
fri
ca
: C
hr
on
icl
e 
of
 a
 fa
m
in
e 
fo
re
to
ld
 […
]
Th
e 
Ho
rn
 o
f A
fri
ca
In
te
rn
at
io
na
l n
ew
s
BB
C 
Ne
ws
 - 
Ja
pa
n 
pe
ns
io
ne
rs
 v
ol
un
te
er
 to
 ta
ck
le
 n
uc
le
ar
 c
ris
is 
[…
]
Ja
pa
n
In
te
rn
at
io
na
l n
ew
s
Th
e 
No
rw
ay
 a
tta
ck
s:
 M
an
ife
st
o 
of
 a
 m
ur
de
re
r [
…
]
No
rw
ay
In
te
rn
at
io
na
l n
ew
s
[…
] t
he
re
's 
no
 S
uc
h 
a 
th
in
g 
as
 to
ur
ism
 in
 J
or
da
n 
!! 
Ho
w'
s 
an
yt
hi
ng
 d
on
e 
in
 th
is 
co
un
try
 ?
Jo
rd
an
op
in
io
n
Eg
yp
t
op
in
io
n
ne
xt
 s
te
p 
Ja
pa
n:
 T
ok
yo
!! 
[…
]
tra
ve
l p
la
ns
Th
e 
Uk
ra
in
ia
n 
ex
pe
rie
nc
e 
fo
r A
ra
b 
wo
rld
 b
y 
Li
on
el
 B
ee
hn
er
 […
]
M
e-
e-
e-
t U
-u
-u
-k
ra
in
e 
- t
he
 c
ha
m
pi
on
 o
f a
ll a
nt
ira
tin
gs
...
 N
ow
 th
e 
18
 C
ou
nt
rie
s 
M
os
t L
ike
ly 
To
 D
ef
au
lt 
[…
]
[…
] a
lm
os
t w
or
th
 a
 tr
ip
 fr
om
 m
un
ich
 to
 b
er
lin
 :-
)
m
un
ich
 a
nd
 b
er
lin
To
da
y 
en
gl
ish
 g
ar
de
n,
 c
hi
ne
se
 to
we
r, 
ge
rm
an
 b
ee
r! 
[…
]
ch
in
es
e 
an
d 
ge
rm
an
Fo
rz
a 
M
ila
n！
W
e 
ar
e 
th
e 
ch
am
pi
on
s！
[…
] A
n 
Ita
lia
n 
vib
e.
 A
 m
ot
o 
an
d 
sm
al
l t
ab
le
s.
 T
hi
n 
cr
us
t p
izz
as
 […
]
Th
in
kin
g 
of
 S
ul
ta
ns
, B
el
ly 
Da
nc
in
g,
 H
ar
em
, G
yp
sy
 M
us
ic,
 R
ak
i, 
M
ez
e 
an
d 
W
in
e.
.. 
th
e 
go
od
 o
ld
 O
tto
m
an
 T
im
es
...
[…
] T
he
 b
es
t s
to
p 
fo
r M
id
ye
 D
ol
m
a 
in
 T
ow
n.
 P
ar
t o
f m
y 
IS
T 
- A
sia
 to
ur
 […
]
Lo
vin
g 
M
en
em
en
 --
 T
ur
kis
h 
sc
ra
m
bl
ed
 e
gg
s.
.. 
[…
]
An
d 
I t
hi
nk
 th
at
's 
th
e 
on
ly 
th
in
g 
I r
ea
lly
 re
al
ly 
ha
te
 a
bo
ut
 m
un
ich
[…
] o
kt
ob
er
fe
st
, o
m
g 
so
 m
an
y 
dr
un
k 
pe
op
le
...
ok
to
be
rfe
st
Aa
aa
ah
hh
 s
o 
m
an
y 
to
ur
ist
s 
in
 m
y 
be
lo
ve
d 
M
un
ich
G
es
tix
 E
RP
 s
of
tw
ar
e 
pe
ne
tra
tin
g 
An
go
la
 e
ve
n 
wi
th
ou
t l
oc
al
 re
se
lle
rs
...
 […
]
[…
] G
es
tix
 C
er
tif
ie
d 
fro
m
 E
UR
 1
50
 lif
et
im
e 
lic
en
se
 […
] -
 E
ur
op
e 
an
d 
US
A-
re
ad
y
I w
ill 
st
ar
t w
or
kin
g 
on
 […
] i
n 
Sa
ar
br
üc
ke
n
Sa
ar
br
üc
ke
n
G
ive
n 
th
e 
bi
rth
 ra
te
 &
 z
 p
op
ul
at
io
n 
fig
ur
es
 in
 E
gy
pt
, I
 c
an
't 
un
de
rs
ta
nd
 h
ow
 s
ex
 is
 a
 ta
bo
o 
#j
us
ts
ay
in
g
Ja
pa
n,
 T
ok
io
196
lo
ca
tio
n
Va
le
nc
ia
, S
pa
in
ev
en
t l
oc
at
io
n
[…
] A
us
tri
a 
ad
op
ts
 C
C
-B
Y 
as
 n
at
io
n-
w
id
e 
de
fa
ul
t!
Au
st
ria
In
te
rn
at
io
na
l n
ew
s
H
ai
fa
, I
sr
ae
l
ev
en
t l
oc
at
io
n
#i
cw
sm
20
11
 w
ill 
ta
ke
 p
la
ce
 n
ex
t w
ee
k 
in
 B
ar
ce
lo
na
 […
]
Ba
rc
el
on
a
ev
en
t l
oc
at
io
n
M
ad
rid
ev
en
t l
oc
at
io
n
M
ad
rid
ev
en
t l
oc
at
io
n
[…
] c
an
't 
w
ai
t t
o 
lis
te
n 
th
em
 a
t J
ap
an
:) 
[…
]
Ja
pa
n
tra
ve
l p
la
ns
Ja
pa
n
tra
ve
l r
ec
om
m
en
da
tio
n
[…
] J
AP
AN
 IS
 O
N
 #
TH
EV
ER
G
E.
 H
M
V 
To
ky
o 
w
/ T
FT
 o
n 
di
sp
la
y 
[…
]
op
in
io
n
Le
op
ar
d 
Tr
ek
 to
 le
ad
 tr
ib
ut
e 
to
 W
ey
la
nd
t i
n 
G
iro
 d
'It
al
ia
 [.
..]
Ita
lia
sp
or
ts
sp
or
ts
Ba
sq
ue
 C
ou
nt
ry
tra
ve
l r
ec
om
m
en
da
tio
n
To
ur
 o
f Q
at
ar
 a
lre
ad
y 
hi
st
or
y,
 T
ou
r o
f O
m
an
 s
ta
rti
ng
.
Q
at
ar
 a
nd
 O
m
an
tra
ve
l p
la
ns
Sh
an
gh
ai
, C
hi
na
 (P
VG
) A
tla
nt
a 
(A
TL
) J
un
 5
, 2
01
1;
 T
ue
s/
Su
n 
w
es
tb
ou
nd
; W
ed
/M
on
 e
as
tb
ou
nd
.
Sh
an
gh
ai
, C
hi
na
tra
ve
l p
la
ns
In
te
rn
at
io
na
l n
ew
s
[…
] I
 u
se
 it
 in
 C
hi
na
 to
 re
di
re
ct
 m
y 
w
eb
si
te
 […
]
C
hi
na
te
ch
 in
te
rn
at
io
na
liz
at
io
n
lo
ca
tio
n
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
G
oi
ng
 to
 R
om
a 
by
 tr
ai
n.
..
R
om
a
tra
ve
l p
la
ns
[…
] f
or
 c
hi
le
an
s,
 a
 to
ur
is
tic
 p
la
ce
 in
 L
on
do
n 
is
 "T
he
 C
lin
ic
".
ch
ile
an
s
op
in
io
n
C
hi
le
lo
ca
tio
n
#c
hi
le
 #
st
ud
en
ts
 #
4d
ea
go
st
o 
se
ve
ra
l p
ic
tu
re
s
C
hi
le
ev
en
t l
oc
at
io
n
#4
de
ag
os
to
 […
] t
re
nd
in
g 
si
nc
e 
th
e 
st
ud
en
ts
 in
 C
hi
le
 a
re
 p
ro
te
st
in
g 
to
 re
fo
rm
 e
du
ca
tio
n 
in
eq
ua
lit
y 
an
d 
co
st
 […
]
C
hi
le
In
te
rn
at
io
na
l n
ew
s
[…
] G
er
m
an
 R
ug
by
 C
ha
m
pi
on
sh
ip
 o
f t
he
 U
ni
ve
rs
iti
es
 !!
ge
rm
an
sp
or
ts
cu
ltu
re
Ita
ly
 w
in
s 
vs
 F
ra
nc
e 
 2
2:
21
   
 s
o 
am
az
in
g
Ita
ly,
 F
ra
nc
e
sp
or
ts
[…
] E
ur
op
ea
n 
Te
ch
 T
ou
r G
al
a 
di
nn
er
 o
n 
W
ed
ne
sd
ay
 n
ig
ht
 in
 b
er
lin
Eu
ro
pe
an
 a
nd
 b
er
lin
ev
en
t l
oc
at
io
n
C
R
O
SS
 IN
N
O
VA
TI
O
N
 A
C
AD
EM
Y 
th
is
 T
hu
rs
da
y 
in
 B
on
n 
[…
]
Bo
nn
ev
en
t l
oc
at
io
n
M
un
ic
h
lo
ca
tio
n
Be
rli
n
ev
en
t l
oc
at
io
n
BB
C
 N
ew
s 
- G
re
ec
e 
sa
ys
 d
eb
t t
al
ks
 to
 a
ve
rt 
de
fa
ul
t '
pr
od
uc
tiv
e'
 […
]
G
re
ec
e
In
te
rn
at
io
na
l n
ew
s
[…
] t
o 
es
ca
pe
 fr
om
 m
is
er
ab
le
 g
re
ek
 re
al
ity
gr
ee
k
op
in
io
n
[…
] a
re
 y
ou
 s
ur
e 
of
 th
is
 p
ie
ce
 o
f n
ew
s 
co
z 
Eg
yp
t c
an
't 
ta
ke
 th
e 
co
ns
eq
ue
nc
es
 […
] #
Eg
yp
t#
Ja
n2
5#
M
ub
ar
ak
Eg
yp
t
re
ac
tio
n 
to
 In
te
rn
at
io
na
l n
ew
s
Eg
yp
t w
e 
ar
e 
yo
ur
 p
ro
te
ct
or
s 
an
d 
yo
ur
 b
ui
ld
er
s 
an
d 
w
e 
w
ill 
st
ar
t f
ro
m
 s
cr
at
ch
 […
]
Eg
yp
t
op
in
io
n
[…
] a
ll 
i c
an
 th
in
k 
of
 is
 th
at
 a
fte
r e
ve
ry
 ra
in
fa
ll 
m
us
t c
om
e 
a 
ra
in
bo
w
 w
ai
tin
g 
fo
r E
gy
pt
's
 ra
in
bo
w
 […
]
Eg
yp
t
op
in
io
n
Fe
el
in
g 
3 
m
 h
ig
h 
ju
st
 fo
r b
ei
ng
 a
n 
Eg
yp
tia
n,
D
ea
r c
ou
nt
ry
 I 
lo
ve
 y
ou
 […
]
Eg
yp
tia
n
op
in
io
n
In
te
rn
at
io
na
l n
ew
s
[…
] i
s 
of
f t
o 
a 
gr
ea
t s
ta
rt 
bu
ild
in
g 
a 
re
lia
bl
e 
an
d 
pr
of
es
si
on
al
 ta
xi
 n
et
w
or
k 
in
 A
th
en
s!
At
he
ns
lo
ca
tio
n
[…
] I
 ju
st
 re
ad
 it
: "
C
an
 G
re
ek
s 
Be
co
m
e 
G
er
m
an
s?
" [
…
]
G
re
ek
s,
 G
er
m
an
s
op
in
io
n
Fa
ke
 A
pp
le
 s
to
re
 in
 C
hi
na
 […
]
C
hi
na
In
te
rn
at
io
na
l n
ew
s
[…
] S
tu
de
nt
 fr
om
 S
w
ed
en
 s
en
t m
e 
[…
]
Sw
ed
en
lo
ca
tio
n
G
re
ec
e
op
in
io
n
G
et
tin
g 
re
ad
y 
fo
r a
no
th
er
 s
un
se
t i
n 
Se
yc
he
lle
s.
.. 
[…
]
Se
yc
he
lle
s
lo
ca
tio
n
I k
no
w
 Is
ra
el
 is
 a
n 
in
te
rn
at
io
na
lly
 k
no
w
n 
st
ar
t-u
ps
 m
ak
er
, I
 ju
st
 lo
ve
 b
ei
ng
 re
m
in
de
d 
[…
]
Is
ra
el
cu
ltu
re
Li
by
a
In
te
rn
at
io
na
l n
ew
s
[…
] Y
ou
 a
re
 a
lw
ay
s 
in
vi
te
d 
ba
ck
 to
 Is
ra
el
. T
he
 s
um
m
er
 h
er
e 
is
 a
m
az
in
g 
:)
Is
ra
el
tra
ve
l r
ec
om
m
en
da
tio
n
M
y 
in
te
rn
et
 c
on
ne
ct
io
n 
su
ck
s 
in
 C
ha
pa
la
.
C
ha
pa
la
[…
] I
nt
l. 
C
on
f. 
on
 In
fo
rm
at
io
n,
 P
ro
ce
ss
, a
nd
 K
no
w
le
dg
e 
M
an
ag
em
en
t i
n 
Va
le
nc
ia
, S
pa
in
 […
]
In
 H
ai
fa
, I
sr
ae
l, 
su
pp
or
tin
g 
fre
e,
 c
ol
la
bo
ra
tiv
e,
 a
nd
 o
pe
n 
kn
ow
le
dg
e 
at
 #
w
ik
im
an
ia
 2
01
1[
…
]
Vi
de
o 
of
 y
es
te
rd
ay
's
 p
re
se
nt
at
io
n 
of
 #
po
w
er
of
op
en
 a
t @
eo
i M
ad
rid
At
 th
e 
pr
es
en
ta
tio
n 
of
 @
cr
ea
tiv
ec
om
m
on
s 
bo
ok
 #
th
ep
ow
er
of
op
en
 a
t @
eo
i M
ad
rid
[…
] n
ee
ds
 to
 to
ur
 w
ith
 […
] i
n 
Ja
pa
n!
 p
le
as
ee
ee
ee
!!
Ja
pa
n,
 T
ok
io
M
ar
io
 C
ip
ol
lin
i’s
 M
ila
n-
Sa
n 
R
em
o 
fo
rm
 g
ui
de
 [.
..]
M
ila
n-
Sa
n 
R
em
o
A 
vi
si
t t
o 
O
rb
ea
 p
re
m
is
es
 a
nd
 th
e 
Ba
sq
ue
 C
ou
nt
ry
 is
 a
lw
ay
s 
fu
n 
[…
]
Ba
hn
un
gl
üc
k 
in
 C
hi
na
 - 
Tr
ai
n 
ac
ci
de
nt
 in
 C
hi
na
 [.
..]
Ba
hn
un
gl
üc
k 
in
 C
hi
na
at
 S
ap
ie
nz
a.
..[
...
]
Sa
pi
en
za
al
l y
ou
 c
an
 e
at
 a
t t
ha
i-j
ap
 re
st
...
[..
.]
Th
ai
-ja
p
#4
de
ag
os
to
 y
 #
ca
ce
ro
la
zo
  b
an
gi
ng
 o
n 
a 
po
t f
or
 b
et
te
r e
du
ca
tio
n 
in
 #
C
hi
le
[…
] u
se
le
ss
 tr
iv
ia
: W
ei
he
ns
te
ph
an
 is
 th
e 
el
de
st
 b
re
w
er
y 
in
 th
e 
w
or
ld
 ! 
Bi
g 
C
he
er
s 
fro
m
 M
un
ic
h 
!
W
ei
he
ns
te
ph
an
, M
un
ic
h
M
ad
ve
rti
se
 @
 g
ro
w
 in
 m
un
ic
h
m
ad
ve
rti
se
 w
ill 
ce
le
br
at
e 
its
 s
er
ie
s 
A 
cl
os
in
g 
pa
rty
 o
n 
29
.0
4.
 in
 B
er
lin
 --
 h
op
e 
to
 s
ee
 y
ou
 th
er
e!
 […
]
Se
cu
rit
y 
Th
ea
te
r L
es
so
ns
 F
ro
m
 U
tø
ya
 […
]
U
tø
ya
G
re
ec
e 
de
fin
ite
lly
 n
ee
ds
 it
s 
st
at
el
ea
ks
, t
oo
 […
]
M
ic
ro
so
ft 
C
ou
nt
ry
 M
an
ag
er
 In
 L
ib
ya
 D
et
ai
ne
d 
By
 A
ut
ho
rit
ie
s 
[…
]
197
[…
] Y
ou
 sh
ou
ld 
co
m
e 
to
 Is
ra
el 
du
rin
g 
th
e 
su
m
m
er
, y
ou
'll 
ha
ve
 a
 b
las
t!
Isr
ae
l
tra
ve
l r
ec
om
m
en
da
tio
n
Isr
ae
li b
an
d 
Or
ph
an
ed
 L
an
d 
ro
ck
s T
ur
ke
y, 
de
sp
ite
 d
isc
or
d 
[…
]
Isr
ae
li a
nd
 T
ur
ke
y
cu
ltu
re
Isr
ae
l a
nd
 A
m
ste
rd
am
loc
at
ion
Ne
w 
we
ek
 (@
 U
PC
 N
ed
er
lan
d 
w/
 2
 o
th
er
s)
 […
]
UP
C 
Ne
de
rla
nd
loc
at
ion
Ita
ly
cu
ltu
re
[…
] I
s t
his
 fo
r r
ea
l? 
(H
eb
re
w)
 […
]
He
br
ew
La
ng
ua
ge
 o
f r
es
ou
rc
e
Je
ru
sa
lem
loc
at
ion
[…
] I
 d
idn
't k
no
w 
th
er
e 
wa
s o
ne
 in
 H
aif
a 
[…
] #
gd
d1
1
Ha
ifa
ev
en
t lo
ca
tio
n
tra
ve
l p
lan
s
To
m
or
ro
w 
we
 w
ill 
ro
ck
 V
ien
na
!
Vi
en
na
tra
ve
l p
lan
s
Tic
ke
ts 
fo
r o
ur
 5
th
 A
nn
ive
rs
ar
y c
on
ce
rt 
in 
Gr
az
 a
re
 n
ow
 a
va
ila
ble
![…
]
Gr
az
ev
en
t lo
ca
tio
n
[…
] F
or
 th
is 
sp
ec
ial
 o
cc
as
ion
 w
e 
wi
ll p
lay
 a
 sh
ow
 in
 G
ra
z [
…
]
Gr
az
ev
en
t lo
ca
tio
n
Gr
az
ev
en
t lo
ca
tio
n
Eu
ro
pe
ev
en
t lo
ca
tio
n
Ne
w 
Re
vie
w 
9/
10
 P
oin
ts 
(G
er
m
an
) [
…
]
ge
rm
an
La
ng
ua
ge
 o
f r
es
ou
rc
e
Pa
les
tin
e 
an
d 
Isr
ae
l
op
ini
on
ev
en
t lo
ca
tio
n
Ja
pa
n
re
ac
tio
n 
to
 In
te
rn
at
ion
al 
ne
ws
Go
og
le 
ch
ina
 is
 a
 jo
ke
Ch
ina
te
ch
 in
te
rn
at
ion
ali
za
tio
n
Co
ok
ing
 in
 fr
en
ch
 :p
fre
nc
h
La
ng
ua
ge
[…
] Y
ou
 m
igh
t b
e 
int
er
es
te
d 
at
 th
is:
 A
re
 C
hin
es
e 
m
om
s b
et
te
r t
ha
n 
W
es
te
rn
 m
om
s?
 […
]
Ch
ine
se
op
ini
on
Ar
e 
Ch
ine
se
 m
om
s b
et
te
r t
ha
n 
W
es
te
rn
 m
om
s?
 […
]
Ch
ine
se
op
ini
on
St
ay
ing
 in
 M
on
tre
al,
 le
ar
nin
g 
Fr
en
ch
 si
m
ult
an
eo
us
ly 
[…
]
M
on
tre
al 
an
d 
Fr
en
ch
loc
at
ion
, la
ng
ua
ge
M
ad
rid
 is
 m
uc
h 
be
tte
r c
ho
ice
 :)
 R
T 
[…
] P
ar
is 
is 
we
ll l
oc
at
ed
 to
 […
] b
et
we
en
 S
ea
ttle
 a
nd
 B
eij
ing
 […
]
M
ad
rid
, P
ar
is,
 B
eij
ing
op
ini
on
Pl
ea
se
 co
ns
ide
r c
om
ing
 to
 S
pa
in,
 to
o.
 T
ha
nk
s f
or
 a
n 
un
fo
rg
et
ta
ble
 tim
e!
Sp
ain
tra
ve
l r
ec
om
m
en
da
tio
n
[…
] D
em
on
str
at
ion
s a
ll o
ve
r #
sp
ain
 si
nc
e 
las
t s
un
da
y #
15
m
 fo
r #
re
al 
#d
em
oc
ra
cy
Sp
ain
In
te
rn
at
ion
al 
ne
ws
[…
] N
at
ion
al 
Re
se
ar
ch
er
s S
ys
te
m
 (S
NI
 in
 S
pa
nis
h)
 […
]
Sp
an
ish
La
ng
ua
ge
Bo
m
b 
at
ta
ck
 a
t M
os
co
w 
air
po
rt 
[…
]
M
os
co
w
In
te
rn
at
ion
al 
ne
ws
M
ela
ne
sia
ns
In
te
rn
at
ion
al 
ne
ws
sh
or
t e
ng
lis
h 
tra
ns
lat
ion
 (s
or
ry
 fo
r t
he
 b
ad
 e
ng
lis
h)
 fo
r o
ut
 in
te
rn
at
ion
al 
Fa
ns
 in
 U
K,
 R
us
sia
, B
ra
zil
 […
]
Ru
ss
ia 
an
d 
Br
az
il
loc
at
ion
Ge
rm
an
y
ev
en
t lo
ca
tio
n
Co
m
e 
to
 se
e 
us
 a
nd
 o
ur
 b
rit
ish
 fr
ien
ds
 fr
om
 T
he
 D
iss
oc
iat
es
 in
 A
ac
he
n 
[…
]
Aa
ch
en
tra
ve
l p
lan
s
[…
] A
ll t
he
 vi
de
o 
m
at
er
ial
 is
 fr
om
 th
eir
 E
ur
op
ea
n 
To
ur
 w
ith
 u
s l
as
t y
ea
r! 
[…
]
Eu
ro
pe
an
ev
en
t lo
ca
tio
n
Ai
da
 co
nf
er
en
ce
 in
 P
av
ia 
on
 so
cia
l n
et
wo
rk
s
Pa
via
ev
en
t lo
ca
tio
n
Ed
itin
g 
an
 a
rti
cle
 a
bo
ut
 It
ali
an
 ca
se
-la
w 
on
 lia
bil
ity
 o
f I
SP
s
Ita
lia
n
a 
co
un
try
's 
po
lic
y
Lo
ok
 fo
rw
ar
d 
to
 e
xp
er
ien
cin
g 
ne
w 
Ita
lia
n 
op
po
sit
ion
 p
ro
ce
du
re
Ita
lia
n
a 
co
un
try
's 
po
lic
y
ev
en
t lo
ca
tio
n
loc
at
ion
loc
at
ion
#T
rib
alD
DB
 L
isb
on
 m
an
ag
es
 2
 o
f t
he
 m
os
t e
ng
ag
ing
 F
ac
eb
oo
k P
ag
es
 in
 P
or
tu
ga
l […
]
Lis
bo
n,
 P
or
tu
ga
l
te
ch
 in
te
rn
at
ion
ali
za
tio
n
[…
] I
´m
 a
nx
iou
s t
o 
bu
y t
he
 To
we
r o
f B
ele
m
, h
er
e 
in 
Lis
bo
n 
:-)
Lis
bo
n
loc
at
ion
Ba
rc
elo
na
loc
at
ion
Po
rtu
gu
es
e
te
ch
 in
te
rn
at
ion
ali
za
tio
n
Po
rtu
gu
es
e 
RT
S 
ga
m
e 
fo
r P
S3
 g
et
s a
n 
inc
re
dib
le 
cin
em
at
ic 
tra
ile
r  
[…
]
Po
rtu
gu
es
e
te
ch
 in
te
rn
at
ion
ali
za
tio
n
Po
rtu
ga
l G
ive
s I
tse
lf a
 C
lea
n-
En
er
gy
 M
ak
eo
ve
r
Po
rtu
ga
l
In
te
rn
at
ion
al 
ne
ws
@
ph
illo
rd
 @
du
llh
un
k I
 b
et
 th
at
's 
his
 n
am
e 
in 
gr
ee
k
gr
ee
k
La
ng
ua
ge
@
tim
or
eil
ly 
[…
] b
as
qu
es
 a
lso
 :-
) [
…
]
ba
sq
ue
s
te
ch
 in
te
rn
at
ion
ali
za
tio
n
M
y d
au
gh
te
r i
s c
om
ing
 b
ac
k h
om
e 
fro
m
 Is
ra
el 
an
d 
I'm
 w
ait
ing
 b
y t
he
 g
at
e 
(@
 A
m
ste
rd
am
 A
irp
or
t S
ch
iph
ol)
[…
] L
ibe
ra
tio
n 
Da
y b
y t
he
 P
ap
al 
St
at
e 
:: 
#w
ell
do
ne
 #
ita
ly 
#p
ap
al 
#c
ar
niv
al 
[…
]
Ou
tb
ra
in'
s w
ee
ke
nd
 a
t J
er
us
ale
m
 :)
 […
]
W
e 
wi
ll p
lay
 a
 g
ue
st 
sh
ow
 o
n 
th
e 
up
co
m
ing
 B
lac
k T
ro
lls
 O
ve
r E
ur
op
e 
To
ur
. [
…
] n
ex
t w
ee
k i
n 
Tr
au
n,
 A
us
tri
a!
...
 […
]
Eu
ro
pe
 a
nd
 T
ra
un
, A
us
tri
a
Fr
an
z L
oe
ch
ing
er
 w
ill 
be
 d
ru
m
m
ing
 fo
r I
LL
UM
IN
AT
A 
on
 F
r. 
4.
3.
 in
 G
ra
z (
Ex
plo
siv
) [
…
]
Fr
an
z L
oe
ch
ing
er
 w
ill 
hit
 th
e 
dr
um
s o
n 
th
e 
Bl
ac
k T
ro
lls
 O
ve
r E
ur
op
e 
To
ur
 […
]
Ap
ple
 re
m
ov
ed
 a
n 
ap
p 
of
 th
e 
Pa
les
tin
e 
Th
ird
 In
tifa
da
 ju
st 
lik
e 
fa
ce
bo
ok
, I
sr
ae
l is
 co
nt
ro
llin
g 
M
ed
ia?
@
ad
ob
e 
sh
ou
ld 
or
ga
nis
e 
an
 e
ve
nt
 fo
r n
or
d 
af
ric
a 
jus
t li
ke
 @
go
og
le 
(g
m
ag
hr
eb
)
No
rd
 A
fri
ca
Ho
pe
 th
at
 e
ve
ry
on
e 
in 
Ja
pa
n 
is 
fin
e.
.. 
#p
ra
yfo
rja
pa
n
Ar
ch
aic
 D
en
iso
va
ns
 (h
om
ini
n 
gr
ou
p)
 co
nt
rib
ut
ed
 to
 m
od
er
n 
M
ela
ne
sia
ns
! [
…
]
Th
e 
Di
ss
oc
iat
es
 co
m
ing
 to
 G
er
m
an
y f
or
 th
e 
"H
igh
Fü
nf
 to
ur
" [
…
]
M
iss
 W
or
ld 
To
ur
ism
 2
01
1,
 In
 K
ef
alo
nia
 […
]
Ke
fa
lon
ia
Su
m
m
er
 N
igh
t in
 A
rg
os
to
li (
Ke
fa
lon
ia)
...
 […
]
Ar
go
sto
li (
Ke
fa
lon
ia)
Sw
im
m
ing
 in
 th
e 
wi
nt
er
 se
a 
of
 L
ixo
ur
i (
Ke
fa
lon
ia)
...
 […
]
Lix
ou
ri 
(K
ef
alo
nia
)
Go
t a
 g
re
at
 tim
e 
at
 h
yp
er
isl
an
d 
in 
Ba
rc
elo
na
 […
]
Un
de
r S
ieg
e™
, t
he
 p
or
tu
gu
es
e 
RT
S 
vid
eo
ga
m
e 
fo
r P
S3
 w
on
 to
da
y t
he
 fir
st 
pr
ize
 […
]
198
W
e 
wa
nt
 to
 tr
an
sla
te
 T
wi
tte
r t
o 
Ba
sq
ue
,s
up
po
rt 
us
! [
…
]
Ba
sq
ue
te
ch
 in
te
rn
at
io
na
liz
at
io
n
Is
ra
el
cu
ltu
re
Tr
av
el
 in
 E
ila
t i
s 
ov
er
. T
he
 R
ed
 S
ea
 is
 a
m
az
in
g 
an
d 
th
e 
wa
te
r i
s 
so
 c
le
ar
.  
bu
t T
el
 A
viv
 w
ea
th
er
 […
] m
uc
h 
be
tte
r.
Ei
la
t, 
Re
d 
Se
a,
 Te
l A
viv
tra
ve
l p
la
ns
, o
pi
ni
on
Ha
pp
y 
Is
ra
el
 In
de
pe
nd
en
ce
 D
ay
! F
ire
wo
rk
s 
in
 th
e 
sk
y 
to
ni
gh
t!
Is
ra
el
cu
ltu
re
G
oo
d 
di
m
 s
um
s 
in
 B
ru
ss
el
s?
 D
oe
s 
it 
ev
en
 e
xis
t?
Br
us
se
ls
op
in
io
n
in
 b
ru
ss
el
s 
...
 n
o 
di
vin
g 
sit
es
 :(
Br
us
se
ls
lo
ca
tio
n
Ar
riv
ed
 in
 M
el
ak
a 
in
 M
al
ay
sia
, b
ut
 e
ve
ry
th
in
g 
is 
clo
se
d 
ea
rly
 to
ni
gh
t. 
W
ill 
ch
ec
k 
Ch
in
es
e 
sh
op
pi
ng
 to
m
or
ro
w 
:)
M
el
ak
a,
 M
al
ay
sia
, C
hi
ne
se
lo
ca
tio
n
Ch
in
es
e
ga
st
ro
no
m
y 
an
d 
re
st
au
ra
nt
s
[…
]..
.b
ut
 ju
st
 v
oc
ab
ul
ar
y)
. I
 h
av
e 
on
e 
bu
t i
n 
Po
lis
h 
an
d 
I w
an
t s
om
et
hi
ng
 lik
e 
th
is 
in
 E
ng
lis
h 
[…
]
Po
lis
h
La
ng
ua
ge
[…
] I
'm
 lo
ok
in
g 
fo
r s
om
e 
co
m
pu
te
r p
ro
gr
am
 to
 le
ar
n 
G
er
m
an
 v
oc
ab
ul
ar
y 
[…
]
ge
rm
an
La
ng
ua
ge
pr
ep
ar
in
g 
to
 d
an
ce
 m
y 
as
s 
of
f f
or
 h
ai
ti!
!!!
ha
iti
re
ac
tio
n 
to
 In
te
rn
at
io
na
l n
ew
s
Fr
an
kf
ur
t
tra
ve
l p
la
ns
tra
ve
l p
la
ns
Fr
an
kf
ur
t
lo
ca
tio
n
ba
ck
 fr
om
 b
er
lin
...
tir
ed
 n
ow
Be
rli
n
ac
co
m
pl
ish
ed
 tr
av
el
ge
tti
ng
 m
y 
ha
ir 
cu
t, 
th
en
 fl
yin
g 
to
 b
er
lin
...
Be
rli
n
tra
ve
l p
la
ns
be
rli
n 
is 
ca
llin
g 
an
d 
i a
m
 fo
llo
wi
ng
.e
ve
ry
bo
dy
 fr
om
 b
er
lin
 m
ee
t m
e 
at
 […
]
Be
rli
n
tra
ve
l p
la
ns
[…
] i
 lo
ve
 b
us
te
d!
!![
…
]..
..S
pa
in
 a
re
 w
ith
 y
ou
!!!
;))
)
Sp
ai
n
op
in
io
n
Br
az
il: 
De
at
h 
of
 F
or
es
t D
ef
en
de
r C
ou
pl
e 
is 
a 
Sh
am
e 
to
 th
e 
Co
un
try
 […
]
Br
az
il
In
te
rn
at
io
na
l n
ew
s
G
re
ek
, S
pa
ni
sh
, e
ur
op
ea
n
In
te
rn
at
io
na
l n
ew
s
In
te
rn
at
io
na
l n
ew
s
[…
] S
ta
y 
in
 J
ap
an
 to
 w
or
k 
fo
r t
hi
s 
f**
kin
g 
co
m
pa
ny
? 
NO
 W
AY
!!
Ja
pa
n
lo
ca
tio
n
W
on
 w
on
 w
on
!!!
 J
ap
an
 h
as
 b
ec
om
e 
th
e 
Q
ue
en
!!!
Ja
pa
n
sp
or
ts
So
rti
ng
 a
lg
or
ith
m
s 
de
m
on
st
ra
te
d 
wi
th
 H
un
ga
ria
n 
fo
lk 
da
nc
e 
[…
]
Hu
ng
ar
ia
n
cu
ltu
re
/h
um
or
W
hi
ch
 c
ou
nt
rie
s 
m
at
ch
 th
e 
G
DP
 o
f A
m
er
ica
's 
st
at
es
? 
[…
] C
al
ifo
rn
ia
 is
 It
al
y!
 B
ut
...
 It
al
y 
ha
s 
20
M
 m
or
e 
pe
op
le
...
Ita
ly
In
te
rn
at
io
na
l n
ew
s
be
au
tif
ul
 n
ig
ht
 v
ie
w 
of
 It
al
y 
ta
ke
n 
fro
m
 In
te
rn
at
io
na
l S
pa
ce
 S
ta
tio
n 
#I
SS
 […
]
Ita
ly
In
te
rn
at
io
na
l n
ew
s
W
hy
 Y
ou
ng
 It
al
ia
ns
 A
re
 L
ea
vin
g 
[…
]
Ita
lia
ns
In
te
rn
at
io
na
l n
ew
s
[…
] C
hi
na
's 
fa
ke
 A
pp
le
 s
to
re
s 
[…
]
Ch
in
a
In
te
rn
at
io
na
l n
ew
s
W
or
ld
 C
up
 jo
y 
fo
r J
ap
an
 […
]
Ja
pa
n
sp
or
ts
W
at
ch
in
g 
on
 IT
V1
 #
En
gl
an
d 
vs
 #
Sw
itz
er
la
nd
 […
]
Sw
itz
er
la
nd
sp
or
ts
M
on
go
lia
sp
or
ts
Ch
in
es
e,
 C
hi
na
te
ch
 in
te
rn
at
io
na
liz
at
io
n
[…
] U
 A
RE
 F
AM
O
US
 IN
 C
HI
NE
SE
 T
W
IT
TE
R 
(W
EI
BO
) [
…
]
Ch
in
es
e
te
ch
 in
te
rn
at
io
na
liz
at
io
n
Ch
in
a
re
ac
tio
n 
to
 In
te
rn
at
io
na
l n
ew
s
To
m
or
ro
w 
wi
ll b
e 
m
y 
fir
st
 s
pa
ni
sh
 te
st
 […
]
Sp
an
ish
La
ng
ua
ge
In
te
rn
at
io
na
l n
ew
s
Ja
pa
n
sp
or
ts
te
ch
 in
te
rn
at
io
na
liz
at
io
n
W
at
ch
in
g 
a 
19
65
 B
W
 P
ol
ish
 fi
lm
 s
et
 in
 S
pa
in
...
 […
]
Po
lis
h,
 S
pa
in
m
ov
ie
s
Li
st
en
in
g 
to
 S
pa
ni
sh
 fo
ot
ba
ll g
am
es
 […
]
Sp
an
ish
sp
or
ts
Tu
 B
eA
v 
- T
he
 je
wi
sh
 h
ol
yd
ay
 o
f L
ov
e 
in
 Is
ra
el
Ra
in
! A
nd
 a
 s
pi
cy
 C
hi
ne
se
 n
ud
dl
es
 […
]
to
m
or
ro
w 
at
 m
os
ai
ic 
ba
r f
ra
nk
fu
rt.
..
[…
]..
fin
ish
ed
 m
un
ich
...
of
f t
o 
la
ng
en
se
lb
ol
d 
wi
th
 […
]
m
un
ich
, l
an
ge
ns
el
bo
ld
to
ni
gh
t a
t m
os
ai
ic 
ba
r f
ra
nk
fu
rt.
..[
…
]
[…
] G
re
ek
 a
nd
 S
pa
ni
sh
 y
ou
ng
 p
eo
pl
e 
oc
cu
py
in
g 
Tr
af
al
ga
r s
qu
ar
e 
#e
ur
op
ea
nr
ev
ol
ut
io
n 
#u
kr
ev
ol
ut
io
n 
#L
on
do
n
Ti
m
 H
et
he
rin
gt
on
 is
 k
ille
d 
in
 #
M
isr
at
a!
 […
] #
Li
by
a
M
isr
at
a,
 L
ib
ya
Ha
ku
ho
 M
on
go
lia
's 
be
st
 p
ai
d 
sp
or
ts
 s
ta
r -
 Y
ah
oo
! E
ur
os
po
rt 
[…
]
[…
] C
hi
ne
se
 o
ffic
ia
l m
ed
ia
 […
] q
uo
te
d 
ur
 o
pi
ni
on
 a
bo
ut
 O
ba
m
a 
fro
m
 tw
itt
er
 (a
 w
eb
sit
e 
bl
oc
ke
d 
in
 C
hi
na
) !
!! 
[…
]
#m
ol
k 
BU
T 
th
is 
th
ou
gh
t i
s 
ba
se
d 
on
 th
e 
re
po
rt 
I r
ea
d 
in
 th
e 
CH
IN
A.
 […
]
Bo
m
bi
ng
 in
 O
slo
 a
nd
 s
ho
ot
in
g 
at
 U
tø
ya
 ! 
[…
]
O
slo
 a
nd
 U
tø
ya
Ja
pa
n 
wo
n 
!! 
[…
] #
wo
rld
cu
pf
in
al
Ne
tfl
ix 
Br
as
il b
lo
g 
[…
]
Br
as
il
199
Bibliography
[1] Workshop on novelty and diversity in recommender systems - DiveRS 2011. In
Pablo Castells, Jun Wang, Rube´n Lara, and Dell Zhang, editors, Proceedings
of the fifth ACM conference on Recommender systems - RecSys ’11, pages
393–394, New York, New York, USA, October 2011. ACM Press. 1.2.1
[2] A Toolkit for Transnational Communication in Europe. In J. Normann
Jø rgensen, editor, The Copenhagen Studies in Bilingualism Vol. 64, 2011.
2.1
[3] Proceedings of the 3rd Workshop on the Multilingual Semantic Web (MSW3).
In Paul Buitelaar, Philipp Cimiano, David Lewis, James Pustejovsky, and
Felix Sasaki, editors, International Semantic Web Conference, volume 936,
Boston, 2012. CEUR. URL http://ceur-ws.org/Vol-936/. Last accessed
Oct 30, 2013. 1.2.1
[4] Meshary AlMeshary and Abdolreza Abhari. A recommendation system for
Twitter users in the same neighborhood. In Proceedings of the 16th Commu-
nications & Networking Symposium, pages 1–5, San Diego, California, April
2013. Society for Computer Simulation International. 8.2.1
[5] Jannis Androutsopoulos. Language Choice and Code Switching in German-
Based Diasporic Web Forums. In Brenda Danet and Susan Herring, editors,
The Multilingual Internet: Language, Culture, and Communication Online,
chapter 15. Oxford University Press, New York, 2007. 2.5, 2.5, 3.1, 8.1
[6] Jannis Androutsopoulos. Localizing the Global on the Participatory Web.
In Nikolas Coupland, editor, The Handbook of Language and Globalization,
chapter 9, pages 203–231. Wiley-Blackwell, Malden, MA, 2010. 1.2.2, 2.4,
2.4.1, 2.5, 4.6, 8.2.2
[7] Albert-Laszlo Barabasi. Linked: How Everything Is Connected to Everything
Else and What It Means for Business, Science, and Everyday Life. Plume,
2003. 2.3
[8] Bettina Berendt and Anett Kralisch. A user-centric approach to identifying
best deployment strategies for language tools: the impact of content and access
language on Web user behaviour and attitudes. Information Retrieval, 12(3):
380–399, January 2009. 1, 2.4, 2.5, 3.1
[9] Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, and
Theresa Wilson. Language identification for creating language-specific Twit-
ter collections. In LSM ’12 Proceedings of the Second Workshop on Language
in Social Media, pages 65–74. Association for Computational Linguistics, June
2012. 3.3, 4.7
200
[10] Carter T. Butts. Social network analysis: A methodological introduction.
Asian Journal Of Social Psychology, 11(1):13–41, March 2008. 5
[11] Louis-Jean Calvet. Towards an Ecology of World Languages. Polity Press,
Cambridge, 2006. 2.2
[12] Mo´nica Stella Ca´rdenas-Claros and Neny Isharyanti. Code switching and code
mixing in Internet chatting: between yes, ya, and si a case study. The Journal
of the JALT CALL SIG, 5(3):67–78, 2009. 3.1
[13] Manuel Castells. Communication, Power and Counter-power in the Network
Society. International Journal of Communication, 1:238–266, 2007. 1, 2.2
[14] Vint Cerf. The Internet is for Everyone, 1999. URL http://www.
internetsociety.org/internet-everyone. Last accessed Oct 30, 2013. 1.2
[15] Hsia-Ching Chang. A new perspective on Twitter hashtag use: Diffusion
of innovation theory. Proceedings of the American Society for Information
Science and Technology, 47(1):1–4, November 2010. 7.4.2
[16] Alok Choudhary, William Hendrix, Kathy Lee, Diana Palsetia, and Wei-Keng
Liao. Social media evolution of the Egyptian revolution. Communications of
the ACM, 55(5):74–80, May 2012. 1.2.2, 7.4.2
[17] Juliet Corbin and Anselm Strauss. Basics of Qualitative Research: Techniques
and Procedures for Developing Grounded Theory. SAGE Publications, Inc, 3rd
edition, 2007. 5
[18] Nikolas Coupland. Introduction: Sociolinguistics in the Global Era. In Nikolas
Coupland, editor, The Handbook of Language and Globalization, chapter 0,
pages 1–27. Wiley-Blackwell, Malden, MA, 2010. 1.2.2, 2.5
[19] Angela Creese and Peter Martin. Introduction to Volume 9: Ecology of Lan-
guage. In Angela Creese, Peter Martin, and Nancy H. Hornberger, editors,
Ecology of Language - Encyclopedia of Language and Education Volume 9,
pages i–vi. Springer, 2nd edition, 2008. 2.2
[20] David Crystal. English as a Global Language. Cambridge University Press,
2nd edition, 2003. 2.4
[21] Daniel Cunliffe, Delyth Morris, and Cynog Prys. Investigating the Differ-
ential Use of Welsh in Young Speakers’ Social Networks: A Comparison of
Communication in Face-to-Face Settings, in Electronic Texts and on Social
Networking Sites. In Elin Haf Gruffydd Jones and Enrique Uribe-Jongbloed,
editors, Social Media and Minority Languages: Convergence and the Creative
Industries, pages 75–86. Multilingual Matters, Bristol, Buffalo, Toronto, 2013.
3.1
201
[22] danah Boyd and Kate Crawford. Six Provocations for Big Data. In A Decade
in Internet Time: Symposium on the Dynamics of the Internet and Society.
SSRN Electronic Journal, September 2011. URL http://papers.ssrn.com/
abstract=1926431. Last accessed Oct 30, 2013. 4.9
[23] Brenda Danet and Susan Herring. Introduction: Welcome to the Multilin-
gual Internet. In Brenda Danet and Susan Herring, editors, The Multilingual
Internet: Language, Culture, and Communication Online, chapter 1. Oxford
University Press, New York, 2007. 2.4, 4.7
[24] Stephen Dann. Twitter content classification. First Monday, 15(12), Novem-
ber 2010. URL http://firstmonday.org/ojs/index.php/fm/article/
view/2745/2681. Last accessed Oct 30, 2013. 7.4
[25] Abram De Swaan. The Evolving European Language System: A Theory of
Communication Potential and Language Competition. International Political
Science Review, 14(3):241–255, January 1993. 2.1, 2.5, 5.4
[26] Abram De Swaan. The Emergent World Language System: An Introduction.
International Political Science Review, 14(3):219–226, January 1993. 2.1
[27] Abram De Swaan. Language Systems. In Nikolas Coupland, editor, The Hand-
book of Language and Globalization, chapter 2, pages 56–76. Wiley-Blackwell,
Malden, MA, 2010. 2.1
[28] Murat Demirbas, Murat Ali Bayir, Cuneyt Gurcan Akcora, Yavuz Selim Yil-
maz, and Hakan Ferhatosmanoglu. Crowd-sourced sensing and collaboration
using twitter. In 2010 IEEE International Symposium on “A World of Wire-
less, Mobile and Multimedia Networks” (WoWMoM), pages 1–9. IEEE, June
2010. 1.1
[29] Jay L. Devore. Probability and Statistics for Engineering and the Sciences.
Thomson Brooks/Cole, Belmont, CA, 7th edition, 2008. 6.2, 6.2
[30] Danny Dor. From Englishization to Imposed Multilingualism: Globalization,
the Internet, and the Political Economy of the Linguistic Code. Public Culture,
16(1):97–118, 2004. 1, 2.2, 2.4
[31] Mercedes Durham. Language Choice on a Swiss Mailing List. In Brenda Danet
and Susan Herring, editors, The Multilingual Internet: Language, Culture, and
Communication Online, chapter 14. Oxford University Press, New York, 2007.
3.1, 7.2, 8.1
[32] Bruce Etling, John Kelly, Robert Faris, and John Palfrey. Mapping the Arabic
blogosphere: politics and dissent online. New Media & Society, 12(8):1225–
1243, December 2010. 1.2.2, 3.2, 5.4, 8.1
202
[33] Madelyn Flammia and Carol Saunders. Language as power on the Internet.
Journal of the American Society for Information Science and Technology, 58
(12):1899–1903, October 2007. 2.4
[34] C. Fuchs. The Role of Income Inequality in a Multivariate Cross-National
Analysis of the Digital Divide. Social Science Computer Review, 27(1):41–58,
April 2008. 1.2, 2.4.2
[35] Gephi.org. Gephi Tutorial Layouts — Gephi.org, 2011. URL http://gephi.
org/tutorials/gephi-tutorial-layouts.pdf. Last accessed Oct 30, 2013.
5
[36] Jean D. Gibbons. Nonparametric Statistics: An Introduction (Quantitative
Applications in the Social Sciences). SAGE Publications, Inc, 1993. 7.2
[37] Global Voices. About Global Voices, 2007. URL http://
globalvoicesonline.org/about/. Last accessed Oct 28, 2013. 1.2.2
[38] Jennifer Golbeck. Analyzing the Social Web. Morgan Kaufmann, 2013. 2.3.1,
2.3.1, 2.3.1, 5, 5.4
[39] David Graddol. English Next. Technical report, British Coun-
cil, 2006. URL http://www.britishcouncil.org/learning-research-
englishnext.htm. Last accessed Oct 30, 2013. 2.4
[40] Mark Graham, Scott A. Hale, and Devin Gaffney. Where in the World are
You? Geolocation and Language Identification in Twitter. The Professional
Geographer, 2013. 3.3, 4.7
[41] Mark Granovetter. The Strength of Weak Ties. American Journal of Sociology,
78(6):1360–1380, 1973. 2.3.1, 2.3.1
[42] Mark Granovetter. The Strength of Weak Ties: A Network Theory Revisited.
Sociological Theory, 1(1983):201–233, 1983. 2.3.1
[43] Jeffrey Graves. Python Language Detector, 2012. URL https://github.com/
decultured/Python-Language-Detector/blob/master/README.md. Last
accessed Oct 4, 2013. 4.3.1
[44] Alexander Halavais. National Borders on the World Wide Web. New Media
& Society, 2(1):7–28, March 2000. 1, 1.2.1
[45] Scott Hale. Translating Twitter, 2011. URL http://www.scotthale.net/
blog/?p=152. Last accessed Oct 30, 2013. 1.2.2
[46] Scott Hale. Online language bubbles: the last frontier?, 2012. URL
http://freespeechdebate.com/en/discuss/online-language-bubbles-
the-last-frontier/. Last accessed Oct 23, 2013. 1.2.1, 5.4
203
[47] Scott A. Hale. Net Increase? Cross-Lingual Linking in the Blogosphere. Jour-
nal of Computer-Mediated Communication, 17(2):135–151, January 2012. 1,
1.2.1, 3.2, 7.4.1, 8.1
[48] Einar Haugen. The Ecology of Language. In Anwar S Dil, editor, Essays by
Einar Haugen. Stanford University Press, Stanford, CA, 1972. 2.2
[49] Brent Hecht and Darren Gergle. The Tower of Babel Meets Web 2.0: User-
Generated Content and Its Applications in a Multilingual Context. In Pro-
ceedings of the 28th international conference on Human factors in computing
systems - CHI ’10, pages 291–300, New York, New York, USA, April 2010.
ACM Press. 1.2.1, 2.4.1
[50] Amir Helzer. Localizing for software, websites and global apps. Multilingual,
22(3):34–37, 2011. 1.2.1, 1.2.3
[51] Alfred Hermida. From TV to Twitter: How Ambient News Became Ambient
Journalism. Media/Culture Journal, 13(2), 2010. URL http://ssrn.com/
paper=1732603. Last accessed Oct 30, 2013. 1.1
[52] Susan Herring. Web Content Analysis: Expanding the Paradigm. In Jeremy
Hunsinger, Lisbeth Klastrup, and Matthew Allen, editors, International Hand-
book of Internet Research, chapter 11, pages 233–249. Springer Verlag, Berlin,
2010. 1.4, 4
[53] Susan Herring, John Paolillo, Irene Ramos-Vielba, Inna Kouper, Elijah
Wright, Sharon Stoerger, Lois Scheidt, and Benjamin Clark. Language Net-
works on LiveJournal. In Proceedings of the 40th Annual Hawaii International
Conference on System Sciences (HICSS’07), pages 79–90. IEEE Computer So-
ciety, January 2007. 1, 1.2.1, 2.2, 3.2, 4, 5.4, 7.4.1, 8.1
[54] Courtenay Honeycutt and Susan C. Herring. Beyond Microblogging: Con-
versation and Collaboration via Twitter. In Proceedings of the 42nd Annual
Hawaii International Conference on System Sciences (HICSS’09), pages 1–10.
IEEE Computer Society, December 2009. 7, 7.2
[55] Lichan Hong, Gregorio Convertino, and Ed Chi. Language Matters in Twit-
ter: A Large Scale Study. In Proceedings of the Fifth International AAAI
Conference on Weblogs and Social Media, volume 91, pages 518–521. AAAI
Publications, 2011. 1, 2.5, 3.3, 3.3, 5.4, 7.3, 8.1
[56] Nancy H. Hornberger. Multilingual language policies and the continua of
biliteracy: An ecological approach. Language Policy, 1(1):27–51, March 2002.
2.2, 2.5
[57] Jeff Huang, Katherine M. Thornton, and Efthimis N. Efthimiadis. Conver-
sational Tagging in Twitter. In Proceedings of the 21st ACM conference on
Hypertext and hypermedia - HT ’10, pages 173–178, New York, New York,
USA, June 2010. ACM Press. 7.4.2
204
[58] International Telecommunication Union. ITU Measuring the Information
Society. Technical report, Geneva, 2011. URL http://www.itu.int/ITU-
D/ict/publications/idi/. Last accessed Oct 30, 2013. 1
[59] Internet Society. Who We Are. URL http://www.internetsociety.org/
who-we-are. Last accessed Oct 28, 2013. 1.2
[60] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter:
understanding microblogging usage and communities. In Proceedings of the
9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social
network analysis - WebKDD/SNA-KDD ’07, pages 56–65, New York, New
York, USA, August 2007. ACM Press. 2.5, 4.2
[61] Ian Johnson. Audience Design and Communication Accommodation Theory:
Use of Twitter by Welsh-English Biliterates. In Elin Haf Gruffydd Jones
and Enrique Uribe-Jongbloed, editors, Social Media and Minority Languages:
Convergence and the Creative Industries, chapter 6, pages 99–118. Multilingual
Matters, Bristol, Buffalo, Toronto, 2013. 2.5, 3.1, 7.3
[62] Aravind K. Joshi. Processing of sentences with intra-sentential code-switching.
In Proceedings of the 9th conference on Computational linguistics -, volume 1,
pages 145–150, Morristown, NJ, USA, July 1982. Association for Computa-
tional Linguistics. 2.5, 7.4.2
[63] M Kaiser, M Go¨rner, and C C Hilgetag. Criticality of spreading dynamics in
hierarchical cluster networks without inhibition. New Journal of Physics, 9
(5):110–110, May 2007. 2.2
[64] Krishna Yeshwanth Kamath and James Caverlee. Transient crowd discovery
on the real-time social web. In Proceedings of the fourth ACM international
conference on Web search and data mining - WSDM ’11, pages 585–594, New
York, New York, USA, February 2011. ACM Press. 4.6
[65] Helen Kelly Holmes. An Analysis of the Language Repertoires of Students in
Higher Education and their Language Choices on the Internet. International
Journal of Multicultural Societies, 6(1):52–75, 2004. 2.5, 3.1, 7.2, 8.1
[66] Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna Gummadi, and Winter
Mason. The Emergence of Conventions in Online Social Networks. In Inter-
national AAAI Conference on Weblogs and Social Media, 2012. URL http:
//www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4661. Last
accessed Oct 30, 2013. 2.5, 7.1, 7.4.2
[67] A. Kralisch and B. Berendt. Language-sensitive search behaviour and the
role of domain knowledge. New Review of Hypermedia and Multimedia, 11(2):
221–246, December 2005. 1, 3.1
205
[68] A. Kralisch and T. Mandl. Barriers to Information Access across Languages on
the Internet: Network and Language Effects. In Proceedings of the 39th Annual
Hawaii International Conference on System Sciences (HICSS’06), volume 3,
page 54b. IEEE Computer Society, January 2006. 1
[69] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twit-
ter, a social network or a news media? In Proceedings of the 19th international
conference on World wide web - WWW ’10, pages 591–600, New York, New
York, USA, April 2010. ACM Press. 1.1, 1.1
[70] David Laniado and Peter Mika. Making Sense of Twitter. In Peter F. Patel-
Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian
Horrocks, and Birte Glimm, editors, The Semantic Web ISWC 2010, volume
6496 of Lecture Notes in Computer Science, pages 470–485. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2010. 7.4.2
[71] Nade`ge Lechevrel. L’e´colinguistique : une discipline e´mergente? Revue des
e´tudiants en linguistique du Que´bec - Quebec Student Journal of Linguistics,
3(1):18–38, 2008. 2.2
[72] Julie Letierce, Alexandre Passant, John Breslin, and Stefan Decker. Under-
standing how Twitter is used to spread scientific messages. In Proceedings of
the WebSci10: Extending the Frontiers of Society On-Line, Raleigh, NC: US,
2010. URL http://journal.webscience.org/314/. Last accessed Oct 30,
2013. 7.4.2
[73] David Lewis, Stephen Curran, Gavin Doherty, Kevin Feeney, Nikiforos Kara-
manis, Saturnino Luz, and John McAuley. Supporting Flexibility and Aware-
ness in Localisation Workflows. The International Journal of Localisation, 8
(1):29–38, 2009. 2.4
[74] Literature Across Frontiers. Publishing Translations in Europe. Trends 1990-
2005. Technical report, Mercator Institute for Media, Languages and Culture,
2010. 2.3.2
[75] Gilad Lotan. #OccupyWallStreet: origin and spread visualized — So-
cialFlow blog, 2011. URL http://blog.socialflow.com/post/7120244404/
occupywallstreet-origin-and-spread-visualized. Last accessed Oct 30,
2013. 1.2.2
[76] Gilad Lotan. Data Reveals That Occupying Twitter Trending Topics is
Harder Than it Looks!, 2011. URL http://blog.socialflow.com/post/
7120244374/data-reveals-that-occupying-twitter-trending-topics-
is-harder-than-it-looks. Last accessed Oct 30, 2013. 1.2.2
[77] Gilad Lotan, Erhardt Graeff, Mike Ananny, Devin Gaffney, Ian Pearce, and
danah Boyd. The Revolutions Were Tweeted: Information Flows during the
206
2011 Tunisian and Egyptian Revolutions. International Journal of Commu-
nication, 5:1375–1405, 2011. 1, 1.1, 1.2.2
[78] Safari Mafu. From the Oral Tradition to the Information Era: The Case of
Tanzania. International Journal of Multicultural Societies, 6(1):99–124, 2004.
1
[79] Christopher Manning. Logistic Regression (with R), 2007. URL http:
//nlp.stanford.edu/~manning/courses/ling289/logistic.pdf. Last ac-
cessed Oct 8, 2013. 6.2
[80] Cameron A. Marlow. The Structural Determinants of Media Contagion. Ph.d.,
Massachusetts Institute of Technology, 2005. 1.2.1, 2.3.1, 2.3.1, 5.1
[81] A. E. Marwick and d. Boyd. I tweet honestly, I tweet passionately: Twitter
users, context collapse, and the imagined audience. New Media & Society, 13
(1):114–133, July 2010. 2.5, 3.1, 6.4, 7
[82] Cheryl Metoyer-Duran. Gatekeepers in Ethnolinguistic Communities. Infor-
mation Management, Policy and Services. Ablex Publishing Corporation, Nor-
wood, New Jersey, 1993. 2.3.1, 2.3.2, 2.5, 5.4
[83] Delia Mocanu, Andrea Baronchelli, Nicola Perra, Bruno Gonc¸alves, Qian
Zhang, and Alessandro Vespignani. The Twitter of Babel: mapping world
languages through microblogging platforms. PloS one, 8(4):e61981, January
2013. URL http://dx.plos.org/10.1371/journal.pone.0061981. Last ac-
cessed Oct 30, 2013. 2.3.2, 2.3, 2.5, 3.2, 3.3, 3.1, 3.3, 4.2, 4.7, 5.4, 8.1
[84] David Nadeau and Satoshi Sekine. A survey of named entity recognition and
classification. Lingvisticae Investigationes, 30(1):3–26, January 2007. 4.4.2
[85] Ory Okolloh. Ushahidi, or ’testimony’: Web 2.0 tools for crowdsourcing crisis
information. Participatory Learning and Action, 59(1):65–70, 2009. 1
[86] Eli Pariser. The Filter Bubble: What the Internet is Hiding from You. The
Penguin Press, New York, 2011. 1.2.1
[87] Carol Peters, Martin Braschler, and Paul Clough. Multilingual Information
Retrieval: From Research To Practice. Springer, 2012. 1.2.1
[88] Isabella Peters. Folksonomies. Indexing and Retrieval in Web 2.0. Knowledge
and Information. De Gruyter, Berlin, 2009. 2.4.1
[89] Daniel Pimienta, Daniel Prado, and A´lvaro Blanco. Twelve years of measur-
ing linguistic diversity in the Internet: balance and perspectives — UNESCO
publications for the World Summit on the Information Society. Technical re-
port, United Nations Educational, Scientific and Cultural Organization, Paris,
2009. 1, 2.4
207
[90] Barbara Poblete, Ruth Garcia, Marcelo Mendoza, and Alejandro Jaimes. Do
all birds tweet the same?: characterizing Twitter around the world. In Pro-
ceedings of the 20th ACM international conference on Information and knowl-
edge management - CIKM ’11, pages 1025–1030, New York, New York, USA,
October 2011. ACM Press. 2.5, 3.2, 4.2
[91] James E. Prieger. The broadband digital divide and the economic benefits of
mobile broadband for rural areas. Telecommunications Policy, 37(6):483–502,
2013. 1.2, 2.4.2
[92] Pei-Luen Patrick Rau, Tom Plocher, and Yee-Yin Choong. Cross-Cultural
Design for IT Products and Services. CRC Press, 2012. 1.2.1
[93] Dana Rotman, Jennifer Preece, Yurong He, and Allison Druin. Extreme
ethnography. In Proceedings of the 2012 iConference, pages 207–214, New
York, New York, USA, February 2012. ACM Press. 4.6
[94] C.E. Shannon. A Mathematical Theory of Communication. The Bell System
Technical Journal, 27(3):379–423, 1948. 6.1, 6.1
[95] Katie Shilton, Jes A. Koepfler, and Kenneth R. Fleischmann. How to See
Values in Social Computing: Methods for Studying Values Dimensions. In (To
appear in) Proceedings of the ACM 2014 conference on Computer Supported
Cooperative Work - CSCW ’14. ACM Press, 2014. 1.2.3
[96] David Sims. Understanding place and space in a digital Babel. The nuances
of location language, 2012. URL http://radar.oreilly.com/2012/03/
location-unstructured-non-english-health-outbreak.html. Last ac-
cessed Oct 30, 2013. 8.2.1
[97] Richard L. Sites. Language Technology Ecosystem, 2011. URL http://www.
hltd.org/alex.pdf. Last accessed Oct 30, 2013. 4.3.1
[98] Kate Starbird and Leysia Palen. “voluntweeters”: Self-organizing by digital
volunteers in times of crisis. In Proceedings of the 2011 annual conference on
Human factors in computing systems - CHI ’11, pages 1071–1080, New York,
New York, USA, May . ACM Press. 1, 1.2.2
[99] Yuri Takhteyev, Anatoliy Gruzd, and Barry Wellman. Geography of Twitter
networks. Social Networks, 34(1):73–81, 2012. 3.2, 4.2, 4.3
[100] Steven L. Thorne, Rebecca W. Black, and Julie M. Sykes. Second Language
Use, Socialization, and Learning in Internet Interest Communities and Online
Gaming. The Modern Language Journal, 93:802–821, December 2009. 1, 2.4
[101] Twitter Help Center. Age screening on Twitter. URL https://support.
twitter.com/articles/20169945-age-screening-on-twitter. Last ac-
cessed Oct 7, 2013. 4.9
208
[102] Claire Ulrich. Technological Developments for African Languages. Multilin-
gual, 21(5):51–53, 2010. 1, 2.4
[103] UNESCO. Recommendation Concerning the Promotion and Use of Mul-
tilingualism and Universal Access to Cyberspace, 2003. URL http:
//www.unesco.org/new/en/communication-and-information/about-
us/how-we-work/strategy-and-programme/promotion-and-use-of-
multilingualism-and-universal-access-to-cyberspace/. Last accessed
Oct 30, 2013. 1.2, 2.4
[104] Federico Vazquez, Xavier Castello´, and Maxi San Miguel. Agent based models
of language competition: macroscopic descriptions and order-disorder tran-
sitions. Journal of Statistical Mechanics: Theory and Experiment, 2010
(04):P04007, 2010. URL http://iopscience.iop.org/1742-5468/2010/
04/P04007/. Last accessed Oct 30, 2013. 2.3.2
[105] Sarah Vieweg, Amanda L. Hughes, Kate Starbird, and Leysia Palen. Mi-
croblogging during two natural hazards events. In Proceedings of the 28th
international conference on Human factors in computing systems - CHI ’10,
pages 1079–1088, New York, New York, USA, April 2010. ACM Press. 1.1
[106] Jessica Vitak, Cliff Lampe, Rebecca Gray, and Nicole B. Ellison. “Why won’t
you be my Facebook friend?”. In Proceedings of the 2012 iConference, pages
555–557, New York, New York, USA, February 2012. ACM Press. 3.1
[107] Barney Warf. Geographies of global Internet censorship. GeoJournal, 76(1):
1–23, November 2010. 1.2, 2.4.2
[108] Mark Warschauer, Ghada El Said, and Ayman Zohry. Language Choice On-
line: Globalization and Identity in Egypt. In Brenda Danet and Susan Herring,
editors, The Multilingual Internet: Language, Culture, and Communication
Online, chapter 13. Oxford University Press, New York, 2007. 2.4, 2.5, 3.1,
5.4
[109] Duncan J. Watts. The “New” Science of Networks. Annual Review of Sociol-
ogy, 30:243–270, 2004. 2.3
[110] Wouter Weerkamp, Simon Carter, and Manos Tsagkias. How People use
Twitter in Different Languages. In WebSci Conference 2011, Koblenz, Ger-
many, June 2011. URL http://journal.webscience.org/539/2/Table1.
png. Last accessed Oct 30, 2013. 3.3, 7.2, 7.3, 7.4.2
[111] Li Wei. The Bilingualism Reader, volume 24. Routledge, London, July 2000.
2.5
[112] Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith. Visualizing
the Signatures of Social Roles in Online Discussion Groups. JoSS: The Journal
of Social Structure, 8(2):1–31, 2007. 5.2
209
[113] George Weyman. Translating Tweets from the Arab Spring: Towards a Trans-
lation Workbench for Twitter, 2012. URL http://meedan.org/2012/03/
translation-twitter-middle-east-arabic/. Last accessed Oct 30, 2013.
8.2.1
[114] Leo Widrich. How Twitter evolved from 2006 to 2011, 2011. URL http:
//blog.bufferapp.com/how-twitter-evolved-from-2006-to-2011. Last
accessed Oct 28, 2013. 1.1
[115] World Summit of the Information Society. Building the Information Society: a
global challenge in the new Millennium. Declaration of Principles, 2003. URL
http://www.itu.int/wsis/basic/about.html. Last accessed Oct 30, 2013.
1.2
[116] Sue Wright. Multilingualism on the Internet - Thematic introduction. Inter-
national Journal of Multicultural Societies, 6(1):5–13, 2004. 1
[117] John Yunker. Beyond Borders: Web Globalization Strategies. New Riders,
2002. 1.2.1, 1.2.3
[118] John Yunker. Inside Google’s language detection tool - Global by Design,
2010. URL http://www.globalbydesign.com/blog/2010/12/06/inside-
googles-language-detection-tool/. Last accessed Oct 30, 2013. 4.3.1
[119] Ethan Zuckerman. CHI keynote: Desperately Seeking Serendipity,
2011. URL http://www.ethanzuckerman.com/blog/2011/05/12/chi-
keynote-desperately-seeking-serendipity/. Last accessed Oct 28, 2013.
1.2, 2.4.1, 5.4, 8.1
210