ABSTRACT Title of dissertation: MULTILINGUAL USE OF TWITTER: LANGUAGE CHOICE AND LANGUAGE BRIDGES IN A SOCIAL NETWORK Irene Eleta, Doctor of Philosophy, 2014 Dissertation directed by: Professor Jennifer Golbeck College of Information Studies Social media is international: users from different cultures and language back- grounds are generating and sharing content. But language barriers emerge in the communication landscape online. In the quest for language diversity and universal access, the vision of a cosmopolitan Internet has stumbled over the language frontier. Expatriates, minorities, diasporic communities, and language learners play an important role in forming transnational networks, creating social ties across borders. Many users of social media are multicultural and multilingual; they are mediat- ing between language communities. In the microblogging site Twitter, information spreads across languages and countries. How are multilingual users of Twitter con- necting language groups? What are the factors influencing their language choices? This research advances a step towards understanding the network structures and communication strategies that enable intercultural dialog, cross-language sharing of information, and awareness of global problems. This dissertation research aims at: (1) exploring the ways in which multilingual users of Twitter are connecting different language groups in their social network; (2) modeling how the network influences their language choices; (3) and exploring what the textual features of their posts can elicit about language choices and mediation between groups. This dissertation goes beyond survey information about multilingualism and provides a deeper understanding about the structural relations between language communities in Twitter. This research work is one of the few that apply social network analysis to the study of sociolinguistic questions on the Internet. Focusing on the social networks of multilingual users, this dissertation contributes an original classification of network types based on the patterns of connections between language groups. Also, it applies the novel idea of modeling the influence of network factors in the language choices of the user. Finally, this dissertation tests the hypothesis that the type of exchange influences language choice, and explores with a theme analysis how other textual features might elicit cross-cultural awareness. These results can inform the design of social media platforms. MULTILINGUAL USE OF TWITTER: LANGUAGE CHOICE AND LANGUAGE BRIDGES IN A SOCIAL NETWORK by Irene Eleta Mogollo´n Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2014 Advisory Committee: Professor Jennifer Golbeck, Chair/Advisor Professor Benjamin B. Bederson Professor Jordan Boyd-Graber Professor Kari M. Kraus Professor Ira Chinoy c© Copyright by Irene Eleta Mogollo´n 2014 En ese oce´ano que separa los continentes y las vidas, au´n no te he perdido en la tormenta. A Pepe. ii Acknowledgments First and foremost I would like to thank my advisor, Prof. Jennifer Golbeck, for her inspiring lessons on social network analysis, for encouraging me to develop my own original ideas and guiding me through that difficult process, always caring for my motivation. I have reached this milestone thanks to her support in the moments of adversity. She also made it possible by financing this dissertation research. I owe my gratitude to the other committee members of my dissertation, Prof. Ben Bederson, Prof. Jordan Boyd-Graber, Prof. Kari Kraus, and Prof. Ira Chinoy, who have provided valuable feedback in their diverse areas of expertise to make this dissertation a more solid and complete research work. I would also like to thank Dr. Judith Klavans and Prof. Doug Oard for their advice and mentoring in the early stages of my doctoral endeavor. They transformed the graduate student I was into a researcher. It was an immense privilege to count with them. I would also like to acknowledge help from Tony Rogers, who joined Prof. Jennifer Golbeck and I in our search for multilingual users of Twitter. My peers and the professors at the College of Information Studies, the iSchool, and at the Human-Computer Interaction Lab (HCIL) have enriched my graduate experience in many ways, providing inspiration and support. I owe my deepest thanks to Fulbright for sponsoring my doctoral studies and for the financial support in the first years of my program. Also, they have enriched my stay in the United States by giving me the opportunity to participate in the many iii cultural and social events they organize, including academic workshops, where I have met many new friends. The Fulbright community has had an enormous influence in my vision of the world and in this research work. They are the most inspiring example I know of multicultural social ties between the world’s nations. iv Table of Contents List of Tables vii List of Figures viii 1 Introduction 1 1.1 What is Twitter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Language bubbles? . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 The bridges between the local and the global . . . . . . . . . . 9 1.2.3 Values in the design of communication platforms . . . . . . . . 10 1.3 An Ultimate Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Objectives and Research Questions . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions and Audiences . . . . . . . . . . . . . . . . . . . . . . . 14 2 Theoretical Framework 16 2.1 The Global Language System . . . . . . . . . . . . . . . . . . . . . . 17 2.2 The Ecology of Language . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Concepts of Social Network Analysis . . . . . . . . . . . . . . 21 2.3.2 A network perspective on Sociolinguistics . . . . . . . . . . . . 25 2.4 The Internet as a Sociolinguistic Ecology . . . . . . . . . . . . . . . . 29 2.4.1 The mediation of technology and the cosmopolitan space . . . 32 2.4.2 Overview and remarks . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Micro-Sociology Focus: Conceptualizing Multilingual Users and Lan- guage Choice in Twitter . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 Related Work 41 3.1 Language Choice and Code-Switching Online . . . . . . . . . . . . . . 41 3.2 Networked Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Multilingual Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Methodology 52 4.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Sampling and Data Collection . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Methods for Assigning Language Labels to Users . . . . . . . . . . . 61 4.3.1 Tools for automatic language identification . . . . . . . . . . . 62 4.3.2 Algorithm for assigning a language label to a person . . . . . 63 4.4 Testing Methods for Assigning Language Labels to Users . . . . . . . 66 4.4.1 The test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 The baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.3 Testing the language identification tools and the algorithm that assigns language labels to users . . . . . . . . . . . . . . 70 4.4.4 Deciding the number of posts per user . . . . . . . . . . . . . 74 v 4.5 Assigning Language Labels to Users . . . . . . . . . . . . . . . . . . . 75 4.6 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.8 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.9 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Social Network Analysis 83 5.1 Qualitative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Network Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Application of Categories . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Factor Analysis 103 6.1 Operationalization of Variables . . . . . . . . . . . . . . . . . . . . . 104 6.2 Regression Models and Analysis . . . . . . . . . . . . . . . . . . . . . 106 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7 Exploring Textual Features 115 7.1 Description of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Hypothesis Testing: Fisher’s Exact Test . . . . . . . . . . . . . . . . 118 7.3 Discussion: Addressivity as a Factor . . . . . . . . . . . . . . . . . . 120 7.4 Theme Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.1 International themes in the English language set . . . . . . . . 122 7.4.2 English hashtags in the non-English language set . . . . . . . 127 8 Discussion and Future Work 139 8.1 Of Links, Social Ties, and Gravitational Forces . . . . . . . . . . . . . 140 8.2 The Road Ahead... . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2.1 Translation and Mediation in Twitter . . . . . . . . . . . . . . 144 8.2.2 Who Are the Multilingual Users? . . . . . . . . . . . . . . . . 146 9 Conclusion 147 A Visualizations of Social Networks 152 B International Themes in English Posts 194 Bibliography 200 vi List of Tables 4.1 Research design schema . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Budget options and associated error rates . . . . . . . . . . . . . . . . 74 5.1 Properties of bilingual networks observed in visualizations . . . . . . 89 6.1 Linear regression coefficients for English use . . . . . . . . . . . . . . 109 6.2 Linear regression coefficients for L2 use . . . . . . . . . . . . . . . . . 110 6.3 Logistic regression coefficients for English use . . . . . . . . . . . . . 111 6.4 Logistic regression coefficients for L2 use . . . . . . . . . . . . . . . . 111 7.1 2x2 contingency table for the Fisher’s Exact Test . . . . . . . . . . . 120 7.2 Frequencies of international themes in English posts . . . . . . . . . . 125 7.3 Conversational tags: discourse conventions in Twitter . . . . . . . . . 133 7.4 Other conversational tags . . . . . . . . . . . . . . . . . . . . . . . . 134 7.5 Hashtags: ICT topic, brands and devices . . . . . . . . . . . . . . . . 135 7.6 Hashtags: events, music, TV and sports . . . . . . . . . . . . . . . . . 136 7.7 Hashtags: location, time, and other named entities . . . . . . . . . . 137 7.8 Hashtags: other topics . . . . . . . . . . . . . . . . . . . . . . . . . . 138 vii List of Figures 1.1 Example of Twitter posts . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Egocentric network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Schematic view of a network with clusters . . . . . . . . . . . . . . . 24 2.3 European language communities in Twitter . . . . . . . . . . . . . . . 27 2.4 Interactions constrained by technology and social network . . . . . . 33 2.5 Factors for language choice in Twitter . . . . . . . . . . . . . . . . . . 37 3.1 Language share of top 20 most active countries on Twitter . . . . . . 49 4.1 Schematic description of the datasets . . . . . . . . . . . . . . . . . . 55 4.2 Words for detecting languages . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Purpose of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Language label assignation to users . . . . . . . . . . . . . . . . . . . 65 4.5 Estimated error function . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Comparison of langPy and Google Language ID . . . . . . . . . . . . 73 5.1 Trilingual egocentric network: English, Spanish, Basque . . . . . . . . 86 5.2 Trilingual egocentric network: English, Spanish, Catalan . . . . . . . 87 5.3 Trilingual egocentric network: English, Chinese, Japanese . . . . . . . 88 5.4 Qualitative categories of bilingual networks . . . . . . . . . . . . . . . 92 5.5 L2 inner/crossing edge ratio for five and three categories . . . . . . . 96 5.6 L2 group proportion for five and three categories . . . . . . . . . . . . 96 5.7 Cross-language edge ratio for five and three categories . . . . . . . . . 97 5.8 Bilingual ratio for five and three categories . . . . . . . . . . . . . . . 97 5.9 Results of classification model . . . . . . . . . . . . . . . . . . . . . . 100 6.1 Sample input data file for factor analysis . . . . . . . . . . . . . . . . 106 A.1 Trilingual networks (1). . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2 Trilingual networks (2). . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.3 Trilingual networks (3). . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.4 Bilingual networks: gatekeeper type (1). . . . . . . . . . . . . . . . . 156 A.5 Bilingual networks: gatekeeper type (2). . . . . . . . . . . . . . . . . 157 A.6 Bilingual networks: gatekeeper type (3). . . . . . . . . . . . . . . . . 158 A.7 Bilingual networks: gatekeeper type (4). . . . . . . . . . . . . . . . . 159 A.8 Bilingual networks: gatekeeper type (5). . . . . . . . . . . . . . . . . 160 A.9 Bilingual networks: gatekeeper type (6). . . . . . . . . . . . . . . . . 161 A.10 Bilingual networks: language bridge type (1). . . . . . . . . . . . . . 162 A.11 Bilingual networks: language bridge type (2). . . . . . . . . . . . . . 163 A.12 Bilingual networks: language bridge type (3). . . . . . . . . . . . . . 164 A.13 Bilingual networks: language bridge type (4). . . . . . . . . . . . . . 165 A.14 Bilingual networks: language bridge type (5). . . . . . . . . . . . . . 166 A.15 Bilingual networks: language bridge type (6). . . . . . . . . . . . . . 167 viii A.16 Bilingual networks: union type (1). . . . . . . . . . . . . . . . . . . . 168 A.17 Bilingual networks: union type (2). . . . . . . . . . . . . . . . . . . . 169 A.18 Bilingual networks: union type (3). . . . . . . . . . . . . . . . . . . . 170 A.19 Bilingual networks: union type (4). . . . . . . . . . . . . . . . . . . . 171 A.20 Bilingual networks: integration type (1). . . . . . . . . . . . . . . . . 172 A.21 Bilingual networks: integration type (2). . . . . . . . . . . . . . . . . 173 A.22 Bilingual networks: integration type (3). . . . . . . . . . . . . . . . . 174 A.23 Bilingual networks: integration type (4). . . . . . . . . . . . . . . . . 175 A.24 Bilingual networks: integration type (5). . . . . . . . . . . . . . . . . 176 A.25 Bilingual networks: integration type (6). . . . . . . . . . . . . . . . . 177 A.26 Bilingual networks: integration type (7). . . . . . . . . . . . . . . . . 178 A.27 Bilingual networks: integration type (8). . . . . . . . . . . . . . . . . 179 A.28 Bilingual networks: peripheral language type (1). . . . . . . . . . . . 180 A.29 Bilingual networks: peripheral language type (2). . . . . . . . . . . . 181 A.30 Bilingual networks: peripheral language type (3). . . . . . . . . . . . 182 A.31 Bilingual networks: peripheral language type (4). . . . . . . . . . . . 183 A.32 Bilingual networks: peripheral language type (5). . . . . . . . . . . . 184 A.33 Bilingual networks: peripheral language type (6). . . . . . . . . . . . 185 A.34 Small and monolingual networks (1). . . . . . . . . . . . . . . . . . . 186 A.35 Small and monolingual networks (2). . . . . . . . . . . . . . . . . . . 187 A.36 Small and monolingual networks (3). . . . . . . . . . . . . . . . . . . 188 A.37 Small and monolingual networks (4). . . . . . . . . . . . . . . . . . . 189 A.38 Small and monolingual networks (5). . . . . . . . . . . . . . . . . . . 190 A.39 Small and monolingual networks (6). . . . . . . . . . . . . . . . . . . 191 A.40 Small and monolingual networks (7). . . . . . . . . . . . . . . . . . . 192 A.41 Small and monolingual networks (8). . . . . . . . . . . . . . . . . . . 193 ix Chapter 1 Introduction [G]lobalization is characterised by unprecedented flows of information, ex- changes among different groups and networks that transcend the local and national [116, p. 9]. As the number of Internet users from different parts of the world grows [58], so does the use of a wealth of languages online [89]. The Internet is not accessed only through computers, but also through cellphones and tablets; this trend is enabling more people in developing countries and speakers of a plethora of languages to access it [58]. While access to the Internet and communication flows are greater than ever before, there is evidence of fragmentation due to language and national borders on the Web [44], and on the blogosphere [53, 47]. Also, many authors warn about the existence of a “linguistic digital divide” that prevents many users of the Internet from having access to relevant information in their languages [78, 67, 68, 8]. In the past years, social media has emerged as horizontal networks of commu- nication, where a complex interplay takes place between mainstream media, jour- nalists, political actors, grassroots activists, citizens and technology [13, 77]. On the one hand, there are powerful social actors shaping the linguistic landscape of the Internet with a top-down approach, like national and supranational institutions, broadcasting media, and companies with interests in transnational business [30]. On 1 the other hand, users of social networks and content-sharing platforms constitute a counter-power [13], reshaping this linguistic landscape with their contributions. Social media has enabled valuable social outcomes such as spontaneous or- ganization during humanitarian crisis [98], public denunciations of human rights violations [85], creation of relevant content for communities that are underserved in terms of information on the Internet and in their languages [102], and foreign lan- guage practice and participation in transnational interest communities and diaspora communities [100]. Many researchers and media outlets are turning their attention to the mi- croblogging site Twitter. They have realized the potential of Twitter for spreading information of unfolding events in real-time across languages and geographic regions [55, 77]. But how are the news traveling across language frontiers? In this dissertation, I study how multilingual users of Twitter mediate between language groups in their social network, focusing on social connections and language choice. My long-term goal is to advance our understanding of the network struc- tures and communication strategies that enable intercultural dialog, cross-language sharing of information, and awareness of global problems. This research goes beyond survey information about multilingualism: I apply social network analysis to gain a deeper understanding about the structural relations between language communities in Twitter. I focus on the social networks of multi- lingual users and contribute a classification of network types based on the patterns of connections between language groups. Also, I propose and apply the novel idea of modeling the influence of network factors in the language choices of the user. 2 1.1 What is Twitter? In Twitter, users share posts with followers; these posts are limited to 140 characters and often include links to webpages, images and other resources. Twitter has characteristics of a social network —although relationships do not need to be reciprocal— and an information-sharing network, where both mainstream media and user-generated content are disseminated publicly [69, 77]. The posts of the people a user follows are laid out in a vertical stream, in inverse chronological order, i.e. the most recent posts are at the top and the user can scroll down the screen to read the previous messages. Twitter posts can be of three types: 1. an original comment by the author; 2. a reposting of a comment authored by someone else, the user that passes the message along can do it either by means of the button “Retweet” or preceding copied text by “RT”, “rt”, or other markers of attribution (see figure 1.1); 3. a reply or comment addressed to a particular user by means of a “mention”, the @ sign followed by a username. The key to the success and novelty of Twitter is due to the speed of information dissemination and the fact that most of this information is publicly accessible. For instance, it takes a “tweet” less than one hour on average to be reposted and, if it gets beyond that first hop, it will be reposted almost instantly in subsequent hops, reaching an average of 1000 people [69]. 3 Figure 1.1: Two Twitter repostings. The authors’ usernames at the top of each message have been erased for privacy reasons, as well as the usernames next to “Retweeted by”; the later users reposted them by clicking the Retweet button. The message at the top was previously reposted copying the text and preceding it by RT and the mention of the original author’s username, which is partially shown (@k). Screenshot taken in 2011 reflecting interface design at the time of data collection. For more images and details on the evolution of Twitter’s interface until 2011, see [114]: http://blog.bufferapp.com/how-twitter-evolved-from-2006-to-2011. As a consequence, there is an emergent body of research literature studying how to leverage Twitter’s tremendous potential for “participatory sensing” and col- laboration [28], enabling “situational awareness” in emergency events [105], and functioning as an “awareness system” for journalism [51]. 1.2 Motivation The underlying motivation for my research is promoting language diversity and facilitating access to multilingual information on the Internet, thus everybody can benefit from it for communicating, learning, making business, sharing ideas and resources. However, multilingualism also brings new challenges, like the segregation 4 of information and communication spheres, which can hinder the potential of the In- ternet for discovery, cross-cultural awareness, intercultural dialog, and transnational collaboration to find solutions for local conflicts and global problems. To support my views, I highlight below the relevant points of the “Geneva Declaration of Principles” [115] for an inclusive Information Society, approved at the World Summit of the Information Society in 2003: • The international management of the Internet should facilitate access for all, taking into account multilingualism. • The Information Society should foster and respect cultural and linguistic diversity, dialogue among cultures and civilizations, and encourage in- ternational cooperation. • Everyone should have the right to seek, receive, and impart information and ideas through any media and regardless of frontiers. These principles are based on prior international declarations, such as the UNESCO’s “recommendation concerning the promotion and use of multilingualism and universal access to cyberspace” [103]. The Declaration of Principles states the importance of a “rich public domain [...] for the growth of the Information Society, creating multiple benefits such as an educated public, new jobs, innovation, business opportunities, and the advancement of sciences” [115, p. 4]. Similar supporting arguments come from the Internet Society, which is an non-profit international organization that provides leadership 5 for Internet policy and technology standards [59]. The Internet Society’s vision was remarkably conveyed by Vint Cerf in his 1999 speech “The Internet is for Everyone” [14]. Also, I was inspired by Zuckerman’s comparison of cosmopolitan cities with the Internet [119]; using this metaphor, he proposed to plan and design technol- ogy for creating the structure that fosters social contact, vibrant communities, and discovery, to fulfill the vision of an internet that constitutes a truly cosmopolitan space. Unfortunately, there are innumerable challenges to achieve these goals that go beyond the technical aspects, such as socioeconomic inequality, lack of infrastructure in rural areas and disadvantaged parts of the world [34, 91], restrictive governmental or private controls [107]. Indeed, the Internet has many types of frontiers and barriers, but I am particularly interested in the language frontier. 1.2.1 Language bubbles? The first language frontier many potential users encounter prevents them from using the services on the Internet: the interfaces are not localized into their language, writing system, and cultural conventions. There is a wealth of literature, mostly practice-oriented, about localization of interfaces for improving usability and acces- sibility of software products and websites targeting a global market [117, 50, 92]. Additionally, there are other —more subtle— language frontiers. 6 Drawing similarities with the “filter bubble” problem, Scott Hale [46] writes about “language bubbles” on the Internet. As an example, he shows the different set of search results obtained for the query “Tiananmen Square” in English and Chinese using the search engine Google [46]. The filter bubble [86] was a very discussed book warning the public about the widespread use of algorithms for personalization of search results and news feeds online without the knowledge or control of the end- user [86]. This book, and other works expressing similar concerns, have triggered debate about the decisions shaping the design of information systems and social networks, and the impact they have on society. Maybe as a result of this debate, the recom- mender systems community is making an effort to incorporate the values of diversity and novelty into the recommendation models and algorithms [1]. However, as Hale points out [46], system designers might not be taking into account the dimensions of culture and language yet. A research study of 25 language versions of Wikipedia by Hecht and Gergle [49] serves to illustrate this ignored challenge. Wikipedia is an online encyclopedia built with user contributions and revolves around the principle of reaching consensus on concepts’ descriptions. Hecht and Gergle [49] found that more than 74% of concepts in Wikipedia are described in only one language and there is a surprisingly small overlap of concepts in different languages. For instance, in the case of two mature language editions, the authors report that 51% of concepts in English are covered in German, but only 16% of concepts in German are also in the English Wikipedia [49]. 7 One implication is that we are seeing a substantially different knowledge repos- itory depending on the language we use. This is not only a matter of insufficient translation, but of concepts that are culture-specific or not considered relevant in other languages, e.g. city districts, national sport teams [49]. A survey on the topology of web links determined that the number of hy- perlinks that cross international borders is significantly lower than the number of domestic hyperlinks [44]. Similarly, the blogosphere is fragmented into language communities [80, 53, 47]. We see a different Internet depending on the language we use, which is hindering our capabilities for sharing and learning, but this research problem remains unexplored for the most part. Notably, the field of multilingual information retrieval has a solid literature body [87], including the design of multilingual search interfaces and the specific problem of cross-language information retrieval, but is narrowly focused on search. Also, there is a growing body of literature in the Semantic Web field about multilin- gual ontologies, cross-language linking of data and resources [3], but its application is limited to certain domains. In general, there are scarce efforts to understand what are the network struc- tures and communication environments that foster intercultural dialog, cross-language sharing of information and resources, awareness of global problems, and interna- tional collaboration. 8 1.2.2 The bridges between the local and the global Coupland [18] highlights how social relations become possible across distance with the help of Information and Communication Technologies (ICTs) in this time of unprecedented numbers of mobile trajectories and flows of populations. Expatriates, migrants, minorities, diaspora communities, and language learners play an impor- tant role in forming transnational networks and cultural bridges between nations and communities. Many users of the Internet are multicultural and multilingual. They sometimes act as invisible translators. For instance, during casual daily interactions, they might be passing information from one language community to the other, without strictly translating, but re-contextualizing a story in a new language and culture [6]. Some ground-breaking initiatives are already taking advantage of multilingual users’ language skills for raising international awareness about local conflicts, human rights violations, and advocacy causes. Such is the case of Global Voices, “an international community of bloggers who report on blogs and citizen media from around the world” [37]. In 2009, the Berkman Center for Internet and Society mapped the Arabic blogosphere and described a key concept that has motivated this dissertation work. They identified English and French “language bridges” on the Arabic blogosphere, consisting of bloggers that wrote in English or French and their native (Arabic) language, which connected the different national blogospheres with the international one [32]. 9 Understanding what is the impact of these “language bridges” and how social media is used for “reaching out to the world” and drawing the attention of inter- national broadcasting media are still open questions of particular interest after the popular uprisings during 2011 [16]. In the microblogging site Twitter, information spreads across languages and countries [75, 76, 77] and, as I will show, this is possible thanks to multilingual users that are mediating between language communities. “[T]he greatest connecting power is the will of the users who want to be connected” [45], like in the example of bloggers in Arab countries connecting with an international audience [32] or the self-denominated “voluntweeters” after the earthquake in Haiti [98]. A world that faces global challenges, could benefit from leveraging the inter- connections of its population for finding and sharing solutions from the local level to the international level. 1.2.3 Values in the design of communication platforms The localization of the interface into a diversity of languages and cultural codes, support for non-latin scripts and bidirectional text displays, as well as pro- viding assistive technologies for translation, are basic requirements for a globally accessible communication platform [117, 50]. The debate about filter bubbles uncovers that information systems, online so- cial networks, content-sharing and communications platforms are not neutral tools. 10 The values and design decisions that underlie these systems [95] have an impact on the users’ perceptions of the world and their behavior. The design of Twitter and other information-sharing platforms comes with an embedded set of values, like sharing, dissemination, being public and participative, etc. In accordance with these values, research can shed light on how to leverage the language skills and multicultural background of its users to promote dissemination of information across language frontiers. Even after designers identify the values they want to imprint in the system, they still need to understand the challenges associated. For example, if we want a communication and information-sharing platform that enables intercultural dialog and collaboration, cross-language link sharing, and awareness of global problems, we need to study how the system might be constraining the linguistic decisions of multilingual users and impairing their ability to cross online frontiers. Also, we should acknowledge and respect that, in some cases, certain communities might have reasons for concealing information or resources. 1.3 An Ultimate Goal The overarching goal that motivates my research is to advance our understand- ing of the network structures and communication strategies that foster intercultural dialog, cross-language sharing of information, and awareness of global problems. We could leverage this knowledge to reduce the impact of language frontiers online, to encourage social contacts and links to resources across languages, and to promote 11 the use of multiple languages, i.e. instead of constraining multilingual users to one language choice, empowering them to mediate between cultures. Ultimately, who are the people and what are the reasons that connect different cultural and linguistic groups? What can we do to foster and leverage these cross- cultural connections for building a cosmopolitan space? These are very broad and ambitious questions, and my research path has barely started. In the next section, I narrow the scope to provide a founding ground for this area of inquiry. 1.4 Objectives and Research Questions This dissertation research aims at: (1) exploring the ways in which multilingual users of Twitter are connecting different language groups in their social network; (2) modeling how the network influences their language choices; (3) and exploring what the textual features of their posts can elicit about language choices and mediation between language groups. This dissertation focuses on the microblogging site Twitter because it consti- tutes an example of a social and information-sharing network where information is disseminating across languages and countries (see subsection 1.2.2). Also, the interface is available in a diversity of languages, supports various non-latin scripts, bidirectional text, and it does not filter the posts by language. Therefore, it po- tentially exposes the user to a multilingual conversation if she/he chooses to follow people writing in different languages. 12 Four questions drive this research: 1. In what ways are multilingual users of Twitter connecting language groups? 2. How is the social network of multilingual users in Twitter influencing their choice of language? 3. Does the type of exchange in Twitter (i.e. public post, reply) influence the language choice of multilingual users? 4. What the themes and textual features in the posts of multilingual users reveal about cross-cultural awareness or international dialogue? Inspired by an expanded paradigm of Web Content Analysis proposed by Her- ring [52], this research includes social network analysis, natural language processing for automatic language identification, theme and exchange analyses. In this dissertation, the research subjects are Twitter users authoring posts in English and at least another language. Focusing on the social network of these mul- tilingual users, the methodology combines a qualitative approach to social network analysis and network statistics to present a taxonomy of network types based on the patterns of intersections and connections between language groups. The result- ing theoretical constructs or categories answer the first research question. A factor analysis based on two regression models will answer the second research question on the social network influence in the language choices of multilingual users. To answer the third research question, I test the hypothesis that the textual feature indicating addressivity (@ sign) within the posts of multilingual users in- 13 fluences their language choice. Finally, regarding the fourth research question, a generic theme analysis will provide preliminary findings on topics that might help in raising cross-cultural awareness, and on the reasons for using English keywords in non-English posts. 1.5 Contributions and Audiences The main contribution of this dissertation is that it goes beyond survey in- formation about multilingualism and provides a deeper understanding about the structural relations between language communities in a social network online. In particular, this research proposes new specific network statistics to enhance the defi- nitions of original theoretical constructs: the types of intersections between language groups in social networks. Inspired by previous studies on the blogosphere, I propose to apply social network analysis to study sociolinguistic questions on the Internet. Adapting the Ecology of Language theoretical framework from sociolinguistics to the social net- work context, this research conceives of the social network of multilingual users as a micro-scale language ecology, influencing their communication strategies and lan- guage choices. This conceptualization leads to a second key contribution, which is the novel idea of modeling the influence of social network factors in the language choices of the user. Other contributions include: the confirmation of previous empirical observa- tions pointing to addressivity as a factor for language choice in Twitter, the iden- 14 tification of themes that might be raising cross-cultural awareness, and the identifi- cation of certain types of hashtags (keywords preceded by the # sign) and related contexts that could encourage multilingual conversations. Regarding the audiences that this dissertation addresses, the research area of information diffusion in social networks could benefit from the findings about the structural relations between language groups. More broadly, this work is relevant to the fields of Information Studies and Social Informatics. The lessons learned in this work could inform the design of socio-technical systems. This research contributes to “understanding users” in the field of Human- Computer Interaction, especially, in the areas of computer-supported cooperative work and technology-mediated social participation. Also, this dissertation might be of interest in the field of Language Technologies for potential applications. This dissertation was inspired by works in Computer-Mediated Communica- tion and constitutes another example of applying social network analysis in this field, which is still rarely used. Finally, other audiences include the fields of Digital Humanities, Sociolinguis- tics, and Communications Studies, especially in relation to the Internet. Researchers in these areas might find inspiration in this dissertation for exploring their research questions with the new lens of social network analysis, and the use of automatic language processing. 15 Chapter 2 Theoretical Framework I begin this chapter by reviewing relevant theoretical perspectives from So- ciolinguistics in the context of globalization. The Global Language System theory proposed by De Swaan provides a macro-scale perspective on language; it was later reinterpreted by Calvet in his Ecology of World Languages, which includes an amal- gam of views conforming the Ecology of Language approach. This approach com- prises different levels of analysis in the study of languages: the macro-scale language dynamics, described as language ecologies, emerge as a result of the interactions of individuals, and their language choices at micro-scale level. However, Sociolinguis- tics theories and methods remain too fragmentary. Inspired by previous studies on the blogosphere, in this chapter I propose to apply social network analysis to study sociolinguistic questions on the Internet. Social network analysis enables us to understand the influence of micro-scale inter- actions into macro-scale social dynamics; this analytical approach could enrich the Ecology of Language perspective. Finally, narrowing the scope to the particular environment of the Internet and social media, I discuss and define the concepts that relate to interactions of users and technology, with particular attention to multilingualism, language choice, and language-switching. 16 2.1 The Global Language System De Swaan’s theory of communication potential and language competition, called the “Global Language System” [26, 25, 27], illustrates the dynamics of the world’s languages with a constellation metaphor: English constitutes the hyper- central sun of the global constellation of languages. At the top level, supranational subsystems —like Spanish, French, and Arabic— compete with English as languages of global communication. There are a dozen supranational languages in this con- stellation and a hundred national languages orbiting around them like planets. This pattern appears at different levels in the system, starting in the periphery with local languages surrounding a national language like the satellites of planets. The cen- tral languages in each subsystem or cluster have a mediation function between local languages. A key point in De Swaan’s theory that is relevant to this research is the connec- tion between the language groups through polyglots and interpreters. Multilingual- ism and translation constitutes the gravitational force that provides cohesion to the system, enabling communication and interaction between different language groups. At the same time, speakers are confronted by multiple and competing linguistic options. Individual and collective choices shape the system, but are themselves influenced by the spheres of politics, economics, and culture [25]. De Swaan [26] proposes a formula to determine the communication potential of a language, which could influence the decision of people to learn it. The factors that the formula takes into account are: the number of speakers of the language and 17 the number of multilingual speakers that know the language. The number of mul- tilingual speakers is related to the centrality of the language in the system. These multilingual speakers increase the communication potential value of a language be- cause they enable connections with other languages in the system. For instance, in the language subsystem of the European Union (E.U.), German has a high com- munication potential value due to the large number of speakers in the region, but English has a higher value due to the lager number of multilingual speakers compe- tent in English, which provides the opportunity to communicate with people from many different countries in the E.U. [27]. De Swaan already mentions network concepts that will be introduced in sec- tion 2.3: the “centrality” of a language in the system, which accounts for the medi- ation potential between that language and others thanks to the number of polyglots speaking it; these polyglots are facilitating “linkages or connections” among lan- guages, which are necessary for the “cohesion” of the system. The constellation metaphor serves to illustrate the following concepts: lan- guages of local communication (described as “satellites” or “the periphery”), lan- guages of regional communication (illustrated as “planets”), and languages of global communication (represented as the “central stars”). When considering the linkages between language groups and the communication potential, the following terms are used given a particular context or region: a vernacular language is the first language of the majority of its users, a lingua franca is the second language of the majority of its users, who speak different first languages, and in the case of a vehicular language 18 there is a balance between the number of speakers that use it as a first language and the number of speakers that use it as a second language [2]. 2.2 The Ecology of Language While De Swaan’s theory places English as a hyper-central sun and uses socio- economic concepts to assign a “value” to languages, the Ecology of Language ap- proach adds nuanced views regarding language hierarchies. For instance, this ap- proach addresses the phenomena of “ethnic revivals” as a form of counter-power in the language dynamics of a globalized world [56, 30, 13], where the socio-economic approach of De Swaan falls short. Borrowing concepts from the field of Ecology, Haugen introduced the Lan- guage Ecology approach and the notion of the co-evolution of languages and their interdependence within a social system [19, 71, 48]. Hornberger [56] synthesizes this analytical approach as taking into account all languages in a given ecosystem, recognizing their social spaces and contexts. Adapting the Ecology of Language approach to Twitter, this research conceives of the social network of multilingual users as a micro-scale language ecology, influencing their communication strategies and language choices. Under the ample umbrella of this approach and its various names (language or linguistic ecology, ecology of languages, and ecolinguistics), there is a diverse group of authors and works painting the multifaceted social, political, and cultural reality surrounding the languages of the world. However, this painting is, for the most part, 19 fragmentary in nature. While the works of the Ecology of Language generally focus on micro-sociology problems, such as managing multilingualism in South African and Bolivian classrooms [56], De Swaan proposes a planetary vision. In essence, the difference between the two perspectives is based on the scale of the analysis. Calvet provides an integrative perspective in his book Towards and Ecology of World Languages [11]. Calvet reinterprets De Swaan’s theory as the “gravitational model”, which describes the global ecosystem of languages, or the macroscopic scale, and complements it with other models that account for phenomena in lower scales, such as internal regulation of languages, social and official functions of languages, and identity [71]. In this way, the Ecology of Language does not exclude the Global Language System theory by De Swaan, but embraces it as part of this inclusive approach. In summary, the Ecology of Language comprises various levels of study in sociolinguistics, from microscopic to macroscopic scale: from the individuals inter- acting to the population, and ultimately, the ecosystem of society, policy, economics, and communication media [71]. However, this framework lacks a connecting tool between those levels of analysis, i.e. how global language dynamics emerge from individual interactions and language choices. Inspired by Language Networks on LiveJournal [53] —one of the first studies in taking a network analysis approach to study languages in social media—, I propose to apply social network analysis to facilitate our understanding of the connections between micro-scale and macro-scale sociolinguistic questions. 20 2.3 Networks Building on a long tradition of network analysis in sociology and anthro- pology [...] and an even longer history of graph theory in discrete mathemat- ics [...], the study of networks and networked systems has exploded across the academic spectrum [109](243). The science of networks provides a new framework to understand complex systems in biology, sociology, communication technologies, business, etc [7]. Network structure is “thought to influence individual (micro) and collective (macro) behavior, as well as the relationships between the two”, and has provided useful insights in the study of the spread of disease and information dissemination [109](256). Similarly, networks could help us gain a deeper understanding of language use in society, in communication technologies, and of language effects on information dissemination. 2.3.1 Concepts of Social Network Analysis Social network analysis (SNA) studies the network structure of relations be- tween people to understand social phenomena, instead of categorizing human be- havior based on individual inner forces [80]. In SNA, people are represented as nodes of a social graph. The nodes are connected by edges, or social ties, that could be reciprocal or just a one way relation, like the “follower of” relation in Twitter. In this work, I will use an important type of subgraph: the egocentric network. The egocentric network is obtained by selecting an individual node, called the ego, and all of its connections [38]. In other words, it constitutes the personal social 21 Bob Kat Ego Jen Kat Ego Bob Jen Figure 2.1: Egocentric network with degree 1 (left) and with degree 1.5 (right). network of an individual with his or her contacts. An egocentric network that includes only the connections with the ego has degree 1. More frequently, researchers are interested in including the connections among the ego’s contacts, in this case the egocentric network has degree 1.5 [38]. Figure 2.1 illustrates these basic concepts. The egocentric network has become a standard unit of measurement for studying small scale interactions (or micro-sociology) [41, 80]. An edge that constitutes the only connection between two groups of nodes is called a bridge [38], which is a mathematical concept. Bridges are of special importance for the dissemination of information from one group to the other [41, 42]. In this context, gatekeepers are people enabling communication between two groups. Multilingual users of Twitter might be in a position of their social network where information necessarily has to pass through them to reach the other language group. This has several implications: the gatekeeper’s role is critical for spreading news between communities and for rising cross-cultural awareness, but they could 22 broker information to their advantage [80], or they could be conservative in their decisions to transfer information to one group for cultural reasons [82]; in either case, places where we find gatekeepers could be considered structural holes between communities. Communities or clusters are composed by nodes that are more connected to one another than with the rest of the social graph. For example, imagine a town with different neighborhoods, where almost everybody that lives in the same neighbor- hood knows each other, but they know fewer people from the other neighborhoods. There are a variety of automatic methods to detect clusters or communities based on network structure [38]. Figure 2.2 shows a schematic view of how people are con- nected in clusters. In Twitter, clusters form due to language, geography, or topic of interest, but this research focuses on language clusters. The cohesion of a social graph is a count of the minimum number of edges that prevent the entire graph from breaking in isolated components [38]. These types of edges that connect clusters of people are critical for providing cohesion to the society, building a sense of community, and for effective self organization and collective action across language groups [41, 42]. Centrality is a core concept in SNA, and measures how “central” a node is in the network to estimate its importance [38]. There are different centrality measures that account for various reasons why a node might be important. For instance, degree centrality counts the number of edges or connections a node has. In this work, and in De Swaan’s theory (explained in section 2.1), the relevant centrality measure is betweenness centrality, which captures how important a node is in the 23 Figure 2.2: Schematic view of a network with clusters. Clusters (or communities) are composed by nodes that are more connected to one another than with the rest of the social graph. This dissertation focuses on language clusters. This visualization is an extract from a open access journal article [63]: M. Kaiser, M. Go¨rner, and C. C. Hilgetag. (2007). Criticality of spreading dynamics in hierarchical cluster networks without inhibition. New Journal of Physics: http: //iopscience.iop.org/1367-2630/9/5/110/fulltext/ flow of information from one part of the network to another [38]. In other words, betweenness centrality reveals the potential mediators or gatekeepers. 24 2.3.2 A network perspective on Sociolinguistics In this dissertation, I focus on multilingual users of Twitter as mediators be- tween language clusters. For example, imagine a user posting in English and Span- ish. In Twitter, she follows the updates of researchers posting in English, but also the Twitter accounts of Spanish local media. Her friends in Spain are connected with her in Twitter, and also her colleagues in the United States. This user sometimes posts in Spanish commenting on local news or replying to some Spanish friend. Of- ten, she posts in English to disseminate research content. She might post in English to draw international attention about important events in Spain. This dissertation hypothesizes that the language choices this user makes ev- ery time she writes a post will be influenced by the language composition of her social network and, in turn, will have an impact on it. In section 2.2, adapting the Ecology of Language approach to Twitter, I propose to conceptualize the social network of multilingual users as a micro-scale language ecology, influencing their communication strategies and language choices. Also, this dissertation investigates what is the structural relation between the language clusters in the social network of this user to develop methods for detecting gatekeepers or structural holes. Future research on information dissemination could benefit from these methods that account for the language effect. An early work that inspired the question of multilingual users as mediators, at the micro-socioloy scale, was the research on gatekeepers by Metoyer-Duran in a variety of ethnolinguistic communities in the United States (American Indian, Chi- 25 nese, Japanese, Korean, and Latino) [82]. She studied their profiles (multilingual and multi-literate), their behavior as information providers in their respective com- munities and how they utilize their interpersonal network and new technologies [82]. Her study identified the profiles that facilitated access to information resources for underserved communities. At the macroscopic scale, giant clusters in the social network might represent language communities at international level, in some cases roughly corresponding with national borders. For illustrating this idea, I include a visualization of European language communities in Twitter from a recent study [83]; in figure 2.3 the dots represent Twitter posts with geolocation information and the colors differentiate the languages of the posts. However, at the micro-scale, there are pockets of expatriates and diverse ethnolinguistic communities (similar to those studied by Metoyer-Duran) immersed in these giant clusters, where multilingualism is present. A pair of language communities that share more connections through multi- lingual individuals or translations than other pairs would have a “communication highway” between those two languages. Continuing with the metaphoric theme of roadways, if only a few multilingual gatekeepers and translations connect both language communities, there would be a “rope bridge” crossing the structural hole. However, these relationships between languages are rarely reciprocal, in other words, the communication highway might be one way only. For instance, statistics on lit- erature translation in Europe reflect the dominance of English as a source (origin of the translation) and, in the other extreme, low percentages of non-European lan- 26 Figure 2.3: European language communities in Twitter. The colored dots represent Twitter posts with geolocation information and their colors differentiate the languages of the posts. Giant language clusters roughly overlap with national borders at the macro-scale. However, at the micro- scale, there are pockets of expatriates and minority languages immersed in these giant clusters. This visualization is an extract from a journal article published under creative commons license [83]: Mocanu D, Baronchelli A, Perra N, Gonalves B, Zhang Q, et al. (2013) The Twitter of Babel: Mapping World Languages through Microblogging Platforms. PLoS ONE 8(4): e61981 27 guages as a source [74]. For this reason, studying language choice is important to understand the directionality of information flows. Network science connects the microscopic scale (eg. the multilingual user in- teracting with her network and switching between languages) and the macroscopic scale (eg. the language dynamics in the Twitter system). The problem of model- ing language competition exemplifies the connection between the microscopic and macroscopic scales in sociolinguistics. When looking at the different outcomes these models predict over time, there are cases of language death, language dominance, language coexistence, language fragmentation into multiple languages, which reminds of the Language Ecology ap- proach. Vazquez et al. [104] described the idea underlying these models: collective social phenomena are studied in terms of interacting agents, which are represented as nodes in a network of social interactions; nodes can change their language accord- ing to specified rules of interaction with the neighbors in the network. The models include probabilities to switch languages determined by the local density of speakers of the opposite language, prestige of the language and other parameters [104]. The rules of interaction and probabilities to switch languages belong to the microscopic level, but the simulation of interacting agents generates macroscopic results. For example, the Bilinguals Model with three types of people —speakers of language X, speakers of language Y, and bilingual speakers— shows how the social structure influences the final outcome: the lower the cohesion of the network, the higher the chances of evolving into one dominant language [104]. 28 Using social network analysis in Sociolinguistics is not straightforward. Repre- senting people as nodes requires careful thinking for assigning individual attributes, like the languages they understand, the languages they use, level of language com- petence and literacy, race or ethnic identity, genre, etc. Additionally, nodes can have a location attribute, which overlaid on a map can distinguish the expatriates and migrant populations. Also, nodes can represent other actors in society, like organizations and media outlets. The edges connecting the nodes might be face to face interactions, interactions mediated by technology (like phone or email), affec- tive or affiliation relations, to name a few examples. Ultimately, the socio-linguist has to conceptualize and interpret what the network measures —like centrality and cohesion— reveal. 2.4 The Internet as a Sociolinguistic Ecology There is an ongoing debate about the dominance of English on the Internet and its impact on language diversity [20, 39, 33]. The United States’ leading role in developing the Internet had consequences like the initial use of English only, protocols devised for the Roman alphabet, and a telecommunications infrastructure that was economically dominated by U.S. companies [33]. However, the Internet is evolving very fast. As other nations started to come into play, and users of different countries gained access to the Internet, a wealth of languages blossomed online [89]. At the same time that online content was in- creasing exponentially, the percentage of English content diminished to 45% in 2005, 29 in favor of other languages, while the estimated online content in Chinese grew to 9% in 2008, followed by German and Spanish [89]. The UNESCO’s “recommen- dation concerning the promotion and use of multilingualism and universal access to cyberspace” [103] and new standards like Unicode, enabling the use of different written systems, intensified the trend towards a multilingual Internet. Despite this progress, non-English speaking users perceive the scarcity of online resources in their first language and are generally appreciative when they can find information in their language [8]. If users have sufficient knowledge of English as a second language, they might search in English because they perceive there is more content in this language and of better quality [8]. Organizations interested in transnational business have realized the impor- tance of adapting to local cultures to be competitive in a global economy [30]. They are translating and localizing (adapting to the culture) their products and services on the Internet. Dor [30] warns against leaving the standardization of vernacular languages in the hands of software, media, and advertising industries, in detriment of the users key role on language change, identity, and maintenance. Even though this preoccupation is well founded, when Dor wrote his article the participatory Web was still in its infancy. Recently, the wide array of content- sharing and social media platforms, blogs, wikies, and social networking sites that conform the so-called “Web 2.0” has lowered the barriers for users to become produc- ers of content too [6]. The social networking site Facebook broke with the top-down approach of language standardization in interface localization and implemented one where users seek consensus about the translation of terms in the interface [73]. How- 30 ever, the model of inviting users to translate the interface of a site is not transferable to every company. The Wold Wide Web relied on the information retrieval paradigm, were users search and read content generated by institutions, organizations, broadcasting me- dia, etc., while interpersonal communication happened via email, Internet Relay Chat (IRC), and newsgroups [6]. With the advent of Web 2.0 environments, which encouraged participation and sharing, there was a paradigm shift. Users have be- come consumers and producers of content at the same time, blurring the boundaries between professional and user-generated discourse, individual and collective author- ship, and various communication modes co-existing in a single platform: personal messages, instant messaging or chat, public posts, etc [6]. Thanks to the changes brought by the participatory Web, there is a growing body of literature documenting the increased visibility of vernaculars [6], the cre- ation of relevant content in minority languages [102], and foreign language practice and participation in transnational interest communities and diaspora communities [100]. Other studies in the field of computer-mediated communication focus on the reproduction in written form of patterns associated with spoken language, the use of slang or dialect features, playful uses of orthography and typography [23], and describe the informal adaptations to the Roman alphabet of languages with other writing systems, like Arabic [108]. These characteristics of the written language in social media pose a challenge for the automatic analysis of text, which I will discuss in detail in chapter 4. 31 2.4.1 The mediation of technology and the cosmopolitan space Instead of just thinking of a global language and its impact on local ones, Androutsopoulos [6] proposes to direct our attention to the circulation of cultural artifacts across national and ethnolinguistic borders and how social media platforms enable the negotiation of local responses, and the appropriation of those artifacts in new socio-cultural environments. In these participatory environments, a network of users interacting with other users and with digital resources emerges. Digital resources like videos, still images, speech, music, and text can be labelled (tagged) by users, who are collectively building taxonomies [88], or even creating multilingual knowledge repositories like Wikipedia [49]. Very importantly, users are now finding information through social recommendations or cues (like tags) left by other people [88]. As a result of this overlapping networks of content and users, information and resources circulate in different ways across countries [6]. In the sociolinguistic ecology of the Internet, interactions between users are constraint by the mediation of technology [6]. The design of keyboards, displays, interfaces, standards that support writing systems, and features of communication platforms have an impact on the users’ language choices and translation behaviors. As explained in section 2.3, language choice can affect the directionality of infor- mation flows between language groups. In social media, social interactions are not as clearly delimited from interactions with content, since user comments on a dig- ital object (text, photo, video), and repostings, also constitute an interaction with 32 Bob Kat Jen News% Image% Video%Comment% Web content linked Interac3on%with%content% Interac3on%%between%%users% (a) Interactions mediated by technology. Bob% Jen% (b) Bilingual social network. Figure 2.4: (a) Focuses on a few users, Jen, Bob and Kate, who are interacting between them and with Web content through the mediation of technology. Two networks overlap: the network of digital objects that are interlinked and the social network of users. And (b) illustrates the social network to which Jen and Bob belong. Pink nodes represent people who use English, blue nodes represent people who use Chinese, and the edges represent the“follower of” relationship in Twitter. the user that posted it. Figure 2.4 illustrates technology and network structure conditioning users’ communication strategies. Zuckerman [119] used the metaphor of cities to provide a vision of an internet that aspires to be a cosmopolitan space, enabling the contact with the unfamiliar, the serendipity that propitiates learning. In cities, urban planning can create the structure for social contact, vibrant communities, and discovery [119]. An urban planning for a vibrant language ecology on the Internet has to challenge the ex- isting structure of the network of hyperlinks and social connections, and consider the capabilities for sharing multimedia, the length of text permitted, the language technologies available, the functionality for managing audiences, the flexibility for users to reinvent purposes and adapt content. 33 Inspired by this vision, I focus on the problem of the social network structure in multilingual egocentric networks and on the factors influencing the flexible language choices of multilingual users. 2.4.2 Overview and remarks In summary, the sociolinguistic ecology of the Internet is determined by pow- erful social actors like national and supranational institutions, broadcasting media, companies with interests in transnational business, and also by the contributions and interactions of users in social networks, content-sharing platforms, blogs, wikis, etc. At the microscopic scale, the interactions of users are mediated by technol- ogy, constrained by it and the network structure. At the macroscopic scale, the Internet is facilitating transnational communication, the flow of information and digital artifacts across language and national borders, and language learning. The growing language diversity of the Internet seems to be enabling access to informa- tion in minority languages and encouraging participation across a wider spectrum of society. However, the flows, access and participation are hindered due to socioeconomic reasons, reduced bandwidth and lack of infrastructure in rural areas and disadvan- taged parts of the world [34, 91], or even by governments that purposefully seek to maintain their nation constraint into an isolated information and communication sphere [107]. 34 2.5 Micro-Sociology Focus: Conceptualizing Multilingual Users and Language Choice in Twitter Adapting the Ecology of Language approach to the social network context, this dissertation focuses on the social network of the multilingual user, conceptual- ized as a micro-scale language ecology influencing the user’s language choices. As an application of this conceptualization, I propose the novel idea of modeling the influ- ence of social network factors in the language choices of the user. In the rest of this section, I describe key concepts underlying this research related to multilingualism, language choice and mediation in the context of Twitter. Mediators. In section 2.1, I explained the importance that De Swaan gave to multilingual speakers, who are linking language groups and providing cohesion to the system of languages [25]. In section 2.3, I introduced the work of Metoyer- Duran focusing on gatekeepers of ethnolinguistic communities, where she described them as being multilingual and multi-literate [82]. In the context of the Inter- net, Androutsopoulos [6] highlighted an additional condition to become a mediator between global resources and local audiences: adequate technology access and com- petence. Bringing these characteristics together, we can draw a very basic profile of mediators between language groups in Twitter: multilingual, multi-literate, and technologically literate. There are many degrees of language competence and literacy, some users might understand a second language but are unable to speak it or write it; others abstain from using one of the languages they know in certain contexts, like documented cases 35 of Internet users who preferred to search in English instead of their first language [8]. A discussion on the various degrees of bilingualism, and what Hornberger called “the continua of biliteracy” [56], falls outside the scope of this work. For a comprehensive discussion and classification of the types of bilinguals, consult The Bilingualism Reader [111]. In this dissertation, I focus on the multilingual users that write in at least two languages on Twitter. Language choice. A recurrent theme in the Sociolinguistics literature regard- ing multilingualism is language choice, or how multilingual speakers make decisions about which language to use in each situation and interaction. These small scale decisions have an impact on the global dynamics when aggregated. Simplifying the outcomes of numerous research studies, Androutsopoulos [5] identified setting, participants, and topic as the main factors influencing language choice in bilingual online communities. Other works that I will review in this section and in chapter 3 highlight the influence of the audience, the social context, prestige of a language, identity, etc. Figure 2.5 represents the main factors theoretically influencing the language choices of multilingual Twitter users. One of the contributions of this dissertation is to study, for the first time, social network factors in language choice. Code-switching. A term that frequently appears associated to language choice and bilingualism is code-switching. In this work, I use the definition of Joshi [62], which considers two types of code-switching: intra-sentential, when the user alternates from one language to another within the same sentence, and inter- sentential, when the change of language happens at the same time that the sentence finishes and a new one starts. When studying code-switching, we need to identify 36 Bob Jen audience interaction addressee Social context (i.e. professional, personal) Social network The$se&ng:$Twi,er$ The$Internet$language$ecology$ Global$Language$Ecosystem$ Figure 2.5: Aside from topic and identity, important factors influencing the language choices of a multilingual Twitter user are: the addressee in the interaction, the imagined audience, the social context, and the social network. Also, the setting and the Internet influence language choice due to the language ecologies associated. There is an overarching global language ecosystem. the matrix language, which usually provides the grammatical structure and more lexical items, and the embedded language [62]. In Twitter, posts are so short that we could consider inter-sentential code- switching when the language changes from one post to the next, while bilingual posts would be cases of intra-sentential code-switching. In chapter 6, the factor analysis represents language choices of the users as counts of inter-sentential code-switching. In chapter 7, the theme analysis includes cases of intra-sentential code-switching, where there are embedded English keywords in other languages. 37 The setting. Although not directly addressed in this dissertation, it is impor- tant to acknowledge the setting as an underlying factor for language choice. In this work, the setting is the Twitter platform and is characterized by its design features, like the limitation of text to 140 characters in every interaction, the default mode of communication being public, the languages available to interact with the inter- face, the display of messages posted by other users, the possibility to share links, the asynchronous nature of communication, the features for users to manage their social network, etc. Also, conventions and social uses of Twitter have emerged over time among its users [66]. The Twitter setting has a specific language ecology derived from sociopolitical factors. Twitter is a company based in the United States, which has an impact on its adoption across the world, or lack thereof in certain countries like China [83]. Since the micro-blogging service launched in 2006, it was rapidly adopted in many countries; as early as 2007, Java et al. [60] reported about its international adoption in North America, Europe, and Asia (mainly in Japan), but they estimated that 45% of the social network lied within North America. Not surprisingly, English became a dominant language, with various estimates ranging from 51% of posts [55] to 53% [90]. A recent large-scale study selected the 20 most active countries in Twitter and showed the percentage of English use in each country against the percentage of their corresponding vernaculars [83], illustrating the weight that English has in communications via Twitter. Also, Poblete et al. [90] selected the 10 most active countries in Twitter and unveiled that the U.S. was 38 the country that concentrated more connections from overseas. For these reasons, I selected multilingual users who have English as one of their language options. The participants or interlocutors. The participants or interlocutors in an interaction can vary from a one-to-one exchange in an online chat to a one-to- many question posed in a forum for an entire community. In the micro-blogging site Twitter posts are generally public, but there are also posts addressed to specific individuals, and the possibility to send private messages. The audience. In Twitter, there are different levels of reach a user could have. First, the posts addressed to one or few individuals could be seen by common friends, and potentially found in search results of the platform by others; second, public posts can be seen by the network of people that “follows” the user, and potentially, these posts can reach anyone in Twitter. Re-using a theoretical framework from the field of communication, Johnson classifies Twitter audiences as addressees, auditors, over-hearers and eavesdroppers [61]. Marwick and boyd [81] described the concept of the “imagined audience” in Twitter, or how the user conceptualizes his or her audience to be able to make linguistic choices, even though the real audience that reads the post might be different. In chapter 7, I test the hypothesis that addressing a message to one interlocutor or a public audience influences the choice of language of the multilingual user. The egocentric network. Wether Twitter users address posts to the mem- bers of their social network explicitly or not, in this dissertation I argue that the egocentric network has an impact on the choice of language. Not only it could have an influence as a perceived audience, but also as a source of information. 39 Topics or interests. Java et al. [60] identified Twitter communities based on the social network structure and, analyzing the words in the posts, they observed the common topics or interests that differentiated the communities. In the context of the Internet, where the perception of distance and territory blurs, the experience of identity becomes multi-layered. In addition to ethnic identities, there are dimensions of shared “feelings”, “knowledge”, or “activities” across distance [18]. Because of this multi-layered identities, the user can belong to different communities and choose the language accordingly. This dissertation includes preliminary work related to topics in the theme analysis (chapter 7), but a complete analysis of topics as a factor for language choice is left for future research. Other factors. There are other important factors surrounding the multi- faceted reality of language choice online. Kelly-Holmes [65] argues that the prestige and international importance of a language encourages its use online. Also, lan- guage choice could relate to the availability of online resources in a language, or lack thereof [65]. The social context of the interaction is another factor, for ex- ample, English being used for professional emails and a vernacular language for personal communications [108]. As mentioned above, identity, as a marker of social and cultural differences, play an important role in language choice [108, 5]. However, this dissertation does not include them in the analysis to keep this research work to a reasonable scope. 40 Chapter 3 Related Work This chapter comprises a review of the literature informing the present research work in one or more of the following themes: language choice and code-switching on the Internet, a network approach to language on the Internet, and multilingualism on Twitter. This dissertation contributes a classification of network types based on the patterns of connections between language groups, which goes beyond survey works about multilingualism on Twitter. I used a network approach after being inspired by works analyzing language networks on the blogosphere. The literature about language choice on the Internet serves to frame my novel proposal of modeling the influence of network factors in the language choices of the user. Also, the literature about language choice is relevant to chapter 7, where I test the hypothesis that English is used more in public messages than in replies to individuals. In the theme analysis, I include cases of code-switching. 3.1 Language Choice and Code-Switching Online A survey gathering answers by 2267 students in high schools and universities of eight countries (France, Italy, Indonesia, Macedonia, Oman, Poland, Ukraine, and Tanzania) revealed the complex reality of language choices of Internet users [65]. 41 Most of the participants reported some knowledge of English in addition to their native language, or language of education, and often, they also reported competence in a third language. The study found that bilingual or trilingual Internet sessions were somewhat frequent, that language choice could relate to the availability of online resources in a language, or lack thereof, and to the prestige and international communication potential of a native language [65]. Other research studies support the observation that the perceived scarcity of online resources in a native language influences behavior and attitudes of its users when searching online [8, 67]. Also, domain knowledge influences language choice in online searching because higher expertise on a topic facilitates understanding of relevant texts in a second language [67]. Combining the factors topic and context, there are reported cases where looking for international news and doing academic work encourages the use of English, while personal communication is conducted more often in native languages [65]. Along the same lines, a study on email and Internet chat in Egypt documented the use of English for professional emails, while in personal emails and chat users preferred a romanized form of Egyptian Arabic, which was mostly used orally before the advent of the Internet [108]. This informal transliterations include the numbers 2, 3 and 7 for rendering additional phonemes from Arabic into the Roman alpha- bet [108]. On the other hand, the use of Classical Arabic and Arabic script was somewhat relegated [108]. Informal transliterations pose a challenge when doing automatic text analysis, as I will explain in detail in chapter 4. 42 The same study observed cases of code-switching between English and ro- manized Egyptian Arabic; the later was used for greetings, humor, sarcasm, food, holidays and religion [108]. A case study on code-switching between English and Spanish, and English and Indonesian in Internet chat provided evidence of borrowed English terms related to computers, such as “e-mail”, “attachment”, and “PC” [12]. When chatting about friendship and relationships, the subjects preferred their first language instead of English [12]. Another study on the language choices of Greek, Turkish and Persian diaspora online communities in Germany found that topics on politics and technology disfavored the use of home and minority languages in newsgroups and web forums, while music and poetry favored it [5]. These works studying topics and contexts (i.e. professional versus personal) as factors for language choice provide some basis for the theme analysis in chapter 7 and for future research about the influence of interest communities in the language choices of Twitter users. The remaining works that I review in this section study the selection of English in multilingual settings online. A longitudinal study of a Swiss forum with participants of three different native languages detected the increase in the use of English over time, even though English was not the first language of any of the users [31]. This finding suggests that the presence of a multilingual or international audience might encourage the use of English as a lingua franca. In view of this previous finding, I propose a multilingual index of the social network as a potential predictor of English use by the multilingual user (chapter 6). Other research works on email and mailing lists 43 [31, 65] studied the impact on language choice of addressing a message to one person or to a multilingual audience; in the later case English was preferred. Recent works on the use of social networking sites by bilinguals argue that the intended audience determines the language choice. In Facebook, Welsh-English bilingual high school students write the status updates more often in English to ensure that all their friends feel included, whereas they use Welsh for one-to-one messages with other Welsh-speaking friends [21]. In Twitter, Welsh-English bilinguals use proportionally more English in public posts (53%) than in replies to individuals (44%) in a sample of 500 posts [61]. The reason for using Welsh or English in replies is related to the language profile of the addressee to some extent; sometimes English is used to communicate between Welsh-English bilinguals [61]. In relation to this finding, Johnson [61] speculates that the use of English is encouraged in Twitter for its potential to reach a wider audience. Finally, the later study on Twitter reported very few cases of bilingual posts [61]. In chapter 7, I test that English is used more in public messages and I observe that bilingual posts are scarce. As a closing note to this section, I would like to acknowledge an existing body of literature about “context collapse” in social media, and particularly in Twitter [81] and Facebook [106]. Professional and personal contexts merge in the same communication environment, where users seek to balance the different identity presentations [81] and, as a result, possibly their linguistic choices. 44 3.2 Networked Languages The literature on multilingual computer-mediated communication is very help- ful for raising awareness about the multifaceted reality of language choice and how the context and the mediating technology influences it, but generally does not ad- dress the potential transnational impact. Language Networks on LiveJournal [53] was possibly one of the first studies in taking a network analysis approach to study the language demographics of a social media site, a blog hosting service in partic- ular. Apart from studying the robustness of non-English language networks, they identified blogs that were bridging language communities and described some char- acteristics of their authors (students of foreign languages, expatriates, multilingual and multicultural) and topics (images, content with international appeal). Two years later, the Berkman Center identified English and French “language bridges” on the Arabic blogosphere, consisting of bloggers that wrote in English or French and their native (Arabic) language, and connected the different national blogospheres with the international one [32]. However, they did not explore fur- ther into the connections with the international blogosphere or their motivations for language choice. These questions are important to understand how people draw international attention using their transnational networks and how information dis- seminates across language borders. Hale [47] tackled the aspect of cross-language linking among blogs in English, Spanish, and Japanese. Focusing on the topic of the earthquake in Haiti in 2010, he was able to quantify the increase in foreign content awareness over time and 45 detect patterns of cross-lingual linking among blogs [47]. Most notably, blogs in English linked much less to foreign content than the blogs in Spanish and Japanese, the largest single destination of cross-lingual links being a collection of photos [47]. Interestingly, Global Voices, an international blogging community that promotes translation of content, created 15% of all cross-lingual linking in the dataset [47]. This finding illustrates that designing for multilingualism and cross-cultural aware- ness has a impact on the network structure. This network approach to languages has been applied to the blogosphere, but not yet to the microblogging service Twitter. A study on the influence of distance in the formation of Twitter ties [99] tangentially included language. The most interesting finding related to this research is the observation of cross-language ties in a sample of 1768 pairs of nodes: English-Other (7.4%), Other-English (3.1%), Other-Other (1.1%) [99]. However, the authors identified the users’ language using only one post, and made some questionable assumptions in their interpretations, like users being monolingual and equating country with language use, which lead them to be skeptical of their own observation of 8% geolocated subjects in Brazil using English [99]. Actually, this percentage of English use on Twitter in Brazil is very close to the estimate provided by a more solid and large-scale survey [83]. Like in [99], there are examples of rough research assumptions about lan- guages, geography, and text analysis on Twitter that manifest the need for a deeper understanding of these interrelated areas of study. Shifting the focus from languages to countries, it is possible to find research that looks at transnational ties in Twitter. Poblete et al. [90] illustrated with a 46 detailed network graph the strength of ties between the ten most active countries in Twitter. These ties are the aggregated results of users’ “following” relationships in Twitter. Apart from the expected stronger connections among countries that share the same language (U.S., U.K., Canada, and Australia), the graph also unveils that the U.S. attracts the most international attention, while it pays little attention out- side its borders [90]. Also, South Korea and Brazil have little connections overseas [90]. Ideally, an holistic view of the language ecology in Twitter will require an analysis of the languages in the overlapping networks of users’ attention ties (follower relationship), interaction ties (replies and retweets), topics and linked resources. As a starting point, this dissertation focuses on attention networks at small scale. 3.3 Multilingual Twitter In section 3.2, I noted that there are currently no research works that take a network approach to languages on Twitter, which would be useful to understand cross-language ties and communication connections between language communities. On the positive side, there are a number of works that study language on Twitter for different purposes and, as the body of research literature on Twitter is growing fast, it might be a matter of time that works on networked language communities become published. A large-scale study on the languages used in Twitter (more than 62 million posts over a period of four weeks) reveals that almost 49% of the posts are written 47 in a language different from English and provides a ranking of the top 10 languages [55]. This study records the number of URLs and hashtags (keywords preceded by the @ sign) shared by people from different language communities [55]. Their closing reflection encourages the study of bilingual brokers and how information flows across language communities [55]. An even larger-scale survey of Twitter (380 million posts over 564 days), conducted later, also presents a ranking of top languages among the 78 detected [83]. When comparing the language rankings of the first study and the later sur- vey, we can observe that European languages are loosing positions against Asiatic languages, except for the increase of Spanish and the unchallenged dominance of English [83]. Twitter is a fast-changing environment, where the language ecolo- gies and their impact on communication behavior are constantly in the process of negotiating their contexts and finding their balances. The same survey, The Twitter of Babel [83], compares the percentage of En- glish use on Twitter in 20 countries versus the vernacular language use with an illu- minating graph (figure 3.1). This graph should dismiss the assumptions that equate country and language use. Another piece of evidence against such an assumption is the survey’s focus on Belgium. Mocanu et al. [83] compare Belgium census data with Twitter data, and observe that the Flemish-speaking population (or Belgian Dutch speakers) is over-represented in Twitter by comparison to the Walloon French speaking population. The researchers connect it with the finding that Twitter has higher penetration in the Netherlands than in France [83], which might attract more users of Dutch language variants regardless of geographical borders [83]. 48 Figure 3.1: Language share of the top 20 most active countries on Twitter, ordered by number of English posts. This graph is an extract from the journal article [83]: Mocanu D, Baronchelli A, Perra N, Gonalves B, Zhang Q, et al. (2013) The Twitter of Babel: Mapping World Languages through Microblogging Platforms. PLoS ONE 8(4): e61981 49 The Twitter of Babel [83] constitutes a very valuable large-scale survey with examples at different scales, from country level, to city level and neighborhood. However, it does not mention or count cross-lingual connections and bilingual users, even when being implicit in the multilingual situations they describe. Here is where a network approach could provide more insights about transnational influence. Drawing attention to methodological challenges, Graham et al. [40] compare common geolocalization and language identification methods used for Twitter data analysis. One of the issues the authors encountered is the difficulty in classifying text correctly as Arabic when the Twitter posts were written in Roman alphabet [40], like the cases reported in section 3.1. In particular, Compact Language Detector failed to classify romanized Arabic in 89% of cases [40]. Also, Bergsma et al. [9] detail the challenges in automatic language identification of Twitter posts. The researchers used people to annotate the language of the posts and build a test collection of Twitter texts in languages with non-roman scrip, i.e. Arabic, Farsi, Hindi, and Urdu [9]. The aim is improving automatic language identification in these languages [9]. Looking at the different use of Twitter depending on the language, at least two studies document the different frequency of features in Twitter posts, such as URLs, hashtags, repostings, and user mentions [110, 55]. The findings of these works suggest that Twitter is used more for conversational purposes in some languages, like Indonesian, while in other languages is more common to use it for sharing resources, like German [110, 55]. This dissertation proposes to study in future work particular languages as factors influencing the communication behavior of multilingual users. 50 In conclusion, the studies about languages in Twitter are descriptive, or of survey type, and only implicitly one can guess there are multilingual users playing a role in the language landscape they describe. Even when comparing the different uses of Twitter depending on the language, these studies do not investigate further into the interactions between language communities and their relation to language choice. 51 Chapter 4 Methodology Inspired by an expanded paradigm of Web Content Analysis proposed by Her- ring [52], who also pioneered a network approach to the study of languages on social media [53], this research includes social network analysis, natural language pro- cessing for automatic language identification, theme and exchange analyses. This expanded paradigm proposes broadening the construct of content analysis for ac- commodating new techniques of analysis appropriate for the evolving landscape of the Internet, and enumerates link and exchange analyses, topic analysis, feature analysis, image analysis, language analysis, etc. In this dissertation, I apply social network analysis to answer the first research question about the egocentric networks of multilingual users; I use two regression models in the factor analysis to answer the second research question on social net- work factors that affect language choice; finally, I test the hypothesis that the ad- dressivity feature (@ sign) influences language choice, and explore with a theme analysis how other textual features might be facilitating cross-cultural awareness. This chapter starts with a brief introduction of the research design, followed by an account of the collection and processing of data that underlies the four stud- ies of the dissertation. The details of the analysis are described in the chapters corresponding to each study. 52 4.1 Research Design The research design is composed of four sequential studies of the same datasets, focusing on complementary facets of mediation between language communities and language choice. Table 4.1 shows a schematic view of the four studies and the corresponding chapters. First, I identified Twitter users authoring posts in English and another lan- guage. I collected their last 50 posts and their egocentric network with degree 1.5. The egocentric network with degree 1.5 includes the people connecting with the mul- tilingual subject or ego and the connections among the people directly connected with the ego (see section 2.3 for social network concepts). Also, I analyzed auto- matically the last 30 posts of all the users within the egocentric networks to identify the language they are using in Twitter. In summary, the data comprises a list of 92 egos, with 50 posts each, and a list of contacts associated with every ego, with a language label, and their linkages in the form of an adjacency list. Figure 4.1 illustrates the components of the datasets. In chapter 5, the social network analysis combines a qualitative approach and network statistics to generate a taxonomy of network types based on the patterns of intersections and connections between language groups. The study follows an exploratory design, with a first qualitative phase that takes a grounded theory ap- proach, and a second quantitative phase that consolidates the qualitative findings. The unit of the analysis is the egocentric network of multilingual users. I visualized the 92 networks with the Gephi social network analysis tool and identified the groups 53 Problem Facet Chapter Objective Approach multilingual Twitter users as mediators social network 5 classification of egocentric networks social network analysis QUAL+QUAN content 7 exploring cross-cultural themes theme analysis QUAL language choices of multilingual Twitter users social network 6 influence of network in language choice factor analysis, regression QUAN content 7 influence of addressivity in language choice hypothesis test QUAN Table 4.1: Research design divided in four studies of the same datasets, looking at the research problem of multilingual Twitter users as mediators and their language choices with different foci. In chapter 5, I use social network analysis for classifying egocentric networks of multilingual Twitter users. Chapter 6 consists of a factor analysis, using regression models to detect if the social network influences language choices of multilingual users. Chapter 7 focuses on textual content; firstly, I test the hypothesis that addressivity influences language choice and, secondly, I look for international themes and other textual features that might indicate cross-cultural awareness. 54 Ego 1 Ego 2 Post 1, en Post 2, en Post 3, es … Post 50, en …. Ego 92 Contact 1, contact 2 Contact 1, contact 3 Contact 2, contact 3 Contact 3, contact 4 … Contact 1, en Contact 2, es Contact 3, en Contact 4, en …. Text, language Adjacency list Network languages List of egos Figure 4.1: The data comprises a list of 92 egos, with 50 posts each and language labels, a list of their contacts with a language label, and the social network links in the form of an adjacency list. of people that write in different languages. Focusing on the structural relationships of these language groups, I complemented the qualitative study of visualizations with network statistics specifically created to provide a robust definition of network types. Finally, I used machine learning for testing the results. In chapter 6, the factor analysis models the influence of a set of factors related to the social network in the language choices of multilingual users. The dependent variables considered are the proportion of English use and non-English use within the 50 posts of the ego (language choice of the ego). The factors included are the proportion of English and non-English language use in the social network of the ego, and the degree of multilingualism of the social network. The relative importance of factors, or their weight, is represented by the coefficients obtained by fitting two different generalized linear models to the dataset (linear and logistic regression). 55 In chapter 7, exploring textual features, I shift attention from the social network to the content of the posts written by the egos. First, I look at the textual feature of the @ sign at the beginning of a post as an indicator of addressivity. Based on this indicator, I test the hypothesis that the type of exchange (public post versus reply to an individual) influences the choice between English and other languages. In a second study included in chapter 7, I look at content with the objective of detecting themes that might help in creating cross-cultural awareness, where the multilingual users could be acting as mediators from the point of view of their mes- sages. I identify themes related to non-English speaking countries or communities in English posts and, also, I identify English hashtags (keywords preceded by the # sign) inserted in non-English posts. Using a generic theme analysis, this study serves as an explorative qualitative phase to inform the design of future studies after this dissertation work. 4.2 Sampling and Data Collection I identified potential multilingual Twitter users with the help of Prof. Jennifer Golbeck and Tony Rogers. We started by issuing queries to the Google search engine, restricted to the Twitter domain, that combined one English word (“between” or “tomorrow”) and one of the words in the list of figure 4.2. For instance, a query was “tomorrow” and “tambie´n” (which means “also” in Spanish). The words were selected from lists of “stop words” for every language. Stop words are very common words in a language. There are many lists of stop words created for natural language 56 Language Words Arabic !"#$ , %#"& Chinese  French alors, très German zusammen, gern Greek περίπου Hebrew טוב Italian molto, peggio Japanese ,  Korean  Polish właśnie, chyba, albo Portuguese muito Russian к , о Spanish desde, también Figure 4.2: Common words in different languages used for querying in combination with English words. processing, usually to filter them for various purposes. In this case, I used these common words to represent each language. The main selection criteria was that the word should not be identical or similar to any other word in a different language, in order to avoid ambiguity about the language the word represents. In sections 2.5 and 3.3, I reviewed studies that document the dominant use of English in Twitter and how this relates to the weight that the United States has in the social network [60, 90] and to its use in many non-English speaking countries that are active on Twitter [83]. When trying bilingual combinations in our initial search, it was very difficult to find bilinguals that did not use English as one of the active languages; this realization is supported by the findings of a study [99] commented in section 3.3. For this reason, we decided to limit the sampling to multilingual users who wrote in English and, at least, one other language. 57 The search results directed to the users’ profile pages on Twitter. The ordered ranking of users’ profiles given by Google could be placing more popular users first, biasing our initial selection, but we ignore the actual criteria used by the search engine. We visited these profile pages, read the last posts, and checked that the subjects were actually using two languages. We established clear written instructions for selecting them. In particular, we did not select users whose: • posts in one language were automatically generated (i.e. users posting in Spanish with only Foursquare checkins in English) or were spam, • posts in one language were only named entities, like song titles, names of books, etc. • posts in one language were only reposted content (or “retweets”). Note that reposting on Twitter does not prove any active knowledge in a language, as it only requires to click on a button or copy text. Moreover, if users just repost the same text, the message stays concealed in the same language community. Also, the instructions for selecting a user required that he or she had written at least one post entirely in a second language, had more than 30 posts (excluding “retweets”), and had between 4 and 5,000 followers. I discarded potential subjects that had more than 5000 followers due to the computational workload required for processing large social networks and the policy limitations of the Twitter API for extracting data. We identified 175 potential multilingual users. After this first selection of users, I retrieved the last 50 posts of each one of them by means of the Twitter 58 API. The API only allows one to extract data from public user accounts in Twitter. For this reason, accounts made private by the users are not included in the sample of multilingual subjects, and when extracting their contacts automatically, private accounts render no data. As in the previous selection phase, I did not include the repostings, with the exception of those that had a comment added by the user; in such cases, it might be possible to find bilingual text and translation. Specifically, the 50 posts of every user did not include the repostings that were shared clicking on a “retweet” button, or those posts that started with the characters “RT” or “rt” (abbreviations of retweet), but could include the posts that had some text written before RT or rt. Based on the data, I selected only those users who had written at least 4 non- automatically generated posts in a second language, to ensure that the language was well represented. During the data collection process, I had to discard some users because they made their accounts private or closed them, and one user started posting spam. The data collection process spanned from October 3 to November 7, 2011. Finally, my sample contains 92 multilingual users that write in 19 languages (Arabic, Basque, Catalan, Chinese, Dutch, English, French, Galician, German, Greek, Hebrew, Italian, Japanese, Korean, Mongolian, Polish, Portuguese, Russian, Turkish), usually two or three languages per person. Figure 4.3 shows the purpose of different components of the dataset. I kept the last 50 posts of the final multilingual users for studying their language choices and conducting the theme analysis. Also, with the help of Prof. Jennifer Golbeck, 59 92 egos posts of egos ! followers & followings (adjacency list of contacts) Location Last 30 posts of contacts Automatic Language Identification Social Network Analysis for language choice and theme analysis Contextual info. egocentric network degree 1.5 Egos%dataset% Contacts%dataset% Figure 4.3: Data collection and purpose of different datasets extracted from Twitter. I extracted the location of the multilingual users from their profiles as contextual information. For every multilingual user (ego) we extracted the egocentric networks with degree 1.5 in the form of adjacency lists of followers and followings (contacts) for the social network analysis, as explained in section 4.1. In total, there are 25,556 contacts within the 92 egocentric networks. Finally, I retrieved the last 100 posts from the contacts’ accounts to identify the languages they use in Twitter, with the exception of private accounts and accounts with no posts (5,950 cases). As previously explained, only 30 posts per user were analyzed and the repostings included have text before the characters “RT” and “rt”. 60 4.3 Methods for Assigning Language Labels to Users In this section, I introduce the options I considered to assign language labels to every user in the egocentric networks with the purpose of completing the contacts dataset illustrated in figure 4.3. First, I had to automatically detect the language(s) used in a number of their posts and, secondly, I had to determine their language profile based on those multiple posts. The egos dataset also required the automatic identification of the languages of posts, but the egos were not assigned a language label. In the factor analysis, the language profile of the multilingual egos is conceptualized differently, as pairs or frequencies for the two main languages of the user. I considered two main options in order to assign a language label to a user in Twitter: (1) extracting the language code that the user has selected on the Twitter interface, and (2) automatically identifying the language of a certain number of posts that the user has written. In the case of extracting the language code of the interface, the immediate problem is that this code is not accurate for bilingual users. Also, as I will explain in section 4.4, a test suggested that the interface language has a very high error rate in representing the actual languages of the user. One reason could be that many users do not change the interface language given by default (i.e. English) because they can understand it, but prefer to write in their native language. As a result of these considerations, I chose the option of automatically iden- tifying the language of users’ posts. As reviewed in section 3.2, related research 61 [99] only used one post per person to assign a language to a user. This is insuffi- cient to determine bilingual and multilingual use, and also problematic in a noisy environment such as Twitter, with frequent cases of automatic posting and spam. A question that I will address in the next section is how many Twitter posts are enough to determine the language(s) of the user. Having more than one post and language label for a user requires a process to determine which language label fits best for that user. 4.3.1 Tools for automatic language identification The first step was to identify the language of the users’ posts. I consid- ered three tools: Google Language Identification tool (Google’s proprietary option), Chorme browser Compact Language Detector (partly open source code by Google), and Python Language Detector (an open source module for programming language Python). Google’s language detection algorithm uses quadgrams —or four character tokens— [118, 97] and Python uses trigrams [43]. Briefly, the process of using the language identification tool works as follows: I send an input file that contains the posts of a user after eliminating mentions, hashtags, URLs and symbols, and the language identification tool returns an output file with the language labels and confidence levels for every post. 62 4.3.2 Algorithm for assigning a language label to a person In a second step, I elaborated the rules for assigning a language or languages to a person. For a given user, with a list containing pairs of language and confidence level, the heuristics of the algorithm are: • Discard all pairs with confidence level below 0.1. The purpose of this rule is to eliminate noise or inaccuracies in the language assignation method. If no pair remains, the language label assigned to the user is “unknown”. • For each remaining language, compute the frequency (number of posts in that language) and select the highest confidence value of all posts in that language, thereafter called “maximum confidence”. • Discard all languages with a frequency below 10% of the total number of posts for that user. The purpose of this rule is to eliminate languages that are not well represented in the profile of the user, due to automatic posting, etc. If no language passes the frequency threshold, the label assigned to the user is “unknown”. • Determine if the user is monolingual or multilingual. If more than one language has maximum confidence equal or greater than 0.7, the language label assigned to the user is a multilingual label. Otherwise is monolingual. Note that the “maximum confidence” is the highest confidence level of all posts in a language for one user. Therefore, the requirements for considering a user multilingual are: (1) at least two languages with 10% minimum frequency (2) and maximum 63 confidence equal to or greater than 0.7. This multilingual label is composed by the code of the most frequent language and the code of the second most frequent language. • In the monolingual case, if only one language has maximum confidence equal to or greater than 0.7, that is the language assigned to the user. • In the monolingual case, if no language has maximum confidence equal to or greater than 0.7, the language assigned to the user is the one with the highest frequency. Figures 4.4a and 4.4b illustrate the process of assigning a language label to a user with examples. Note that the thresholds are based on the assumption that the confidence level represents a probability between 0 and 1, but they could be tailored for each tool. The confidence level of 0.7, used as threshold to determine multilingualism, was selected after observing issues derived from transliteration. For instance, Arabic was sometimes written in the Arabic scrip, but also in a romanized form that the tool identified incorrectly and with low confidence as some other language. In the cases where Arabic was one of the two languages used, the romanized form (with an inaccurate language label and low confidence) had enough frequency to displace the Arabic as one of the actual languages that composed the multilingual label. Adding the requirement of high confidence (equal to or greater than 0.7) eliminated the instances with incorrect labeling and favored the Arabic that was correctly classified because it was not transliterated. 64 cy, 0.08 en, 0.75 en, 0.40 en, 0.65 ar, 0.20 ar, 0.35 ar, 0.80 en, 0.80 fr, 0.5 en, 0.30 en, 0.70 ar, 0.60 User @mary Outcome of the automatic language ID tool: Language, confidence 1st threshold: no language with confidence below 0.1 Languages ordered by frequency maximum confidence selected en, 50%, 0.80 ar, 33%, 0.80 fr, 8%, 0.50 2nd threshold: no language below 10% frequency If not language remains, return label “unknown” If not language remains, return label “unknown” en, 50%, 0.80 ar, 33%, 0.80 Multilingual: At least 2 languages with confidence 0.7 or more Monolingual: 0 or 1 languages with confidence 0.7 or more en+ar Algorithm assigns label to user @mary (a) Example where the algorithm assigns a multilingual label to a user. cy, 0.08 en, 0.75 en, 0.40 en, 0.65 ar, 0.20 ar, 0.35 pt, 0.10 en, 0.80 fr, 0.5 en, 0.30 en, 0.70 en, 0.60 User @tony Language, confidence 1st threshold: no language with confidence below 0.1 Language, frequency, maximum confidence en, 64%, 0.80 ar, 18%, 0.60 fr, 9%, 0.50 pt, 9%,010 2nd threshold: no language below 10% frequency If not language remains, return label “unknown” If not language remains, return label “unknown” en, 64%, 0.80 ar, 18%, 0.35 Multilingual: At least 2 languages with confidence 0.7 or more Monolingual: 0 or 1 languages with confidence 0.7 or more en Algorithm assigns label to user @tony (b) Example where the algorithm assigns a monolingual label to a user. Figure 4.4: Two examples of how users are assigned language labels by the algorithm after the language of their posts have been automatically identified. 65 4.4 Testing Methods for Assigning Language Labels to Users In this section, I explain the creation of a test dataset and a baseline for com- paring the results of different language identification tools, and for comparing the results of the language assignation algorithm versus a human making that assigna- tion. Finally, I compare the test results using a number of posts per user between 1 and 100 to answer the question: how many Twitter posts are enough to determine the language(s) of the user? 4.4.1 The test dataset For testing the tools, I prepared a sample of users. I randomly sampled 10 egos from the 92, and 20 contacts for each ego —or all contacts, whichever is greater— obtaining a total of 190 users from the list 25,556 contacts. From those users, only 177 had data available. The other 13 users had no data either because their account was private or they did not post anything. Finally, I extracted the last 100 posts of the 177 users, including repostings that had an added comment by the user. In total, I extracted 15,973 posts. This is my test dataset. 4.4.2 The baseline I created a baseline as close as possible to human labeling. Given the time and resource constrains, I decided to use one of the automatic language identification tools to assign a language to each post as an initial guess and manually revise the 66 results. In this task, I took advantage of my skills in English, Spanish, and French, as well as my familiarity with some Romanic languages, such as Portuguese, Italian, Catalan, and Galician. The number of posts that were sent to the language identifier after eliminating mentions, hashtags, URLs and symbols is 15,856 (from the test dataset). Then, I revised the language labels of the posts following these criteria: • A. A post is labelled monolingual if any of the following are true: 1. All words are in one language. 2. Only one word in a second language within a post of five or more words. 3. There is a title or named entity in a second language and fewer than five words in the most dominat language of the user. • B. A post is labelled bilingual if any of the following are true: 1. There is one word in a second language when a post has fewer than five words. 2. There are two words in a second language when the post has between five and ten words included, unless is case A3. 2. If the post has more than ten words and there are at least three words in a second language, unless is case A3. 3. A title or named entity is in a second language but there are at least five or more words in the other. • C. A post is labelled automatic if any of the following are true: 67 1. There are two identical posts. 2. There are at least three posts that are nearly identical except for a number or a word. 3. There are sentences like: “Posted a picture on Facebook”, “liked a photo on Facebook”, “favorited a Youtube video”, “I am at something @ name of place” (foursquare). • D. A post has a non-identifiable language if any of the following are true: 1. There is a named entity that could correspond to more than one lan- guage. 2. There are only symbols, emoticons, or random letters. 3. Other reasons. Named entities —a term coined in the field of Information Extraction— are information units that consist of rigid designators for a referent, like proper names of people or organization names, locations or times, among many types [84]. In the cases where Arabic, Hebrew and Mongolian were transliterated, the tool did not identified the language correctly. In those cases, I considered the language of the post to be the language of the user other than English (Arabic, Hebrew or Mongolian). This language was determined in other posts written by the same user in non transliterated form, where the tool is more accurate. Finally, the criteria for classifying the language profile of the user for the base- line was to use a 10% frequency threshold do determine if the user was monolingual 68 or multilingual. If fewer than 10% of the posts were in a second language, the user was considered monolingual and assigned the label of the most frequent language. If the second language passed the threshold, the user was assigned a multilingual label composed by the codes of the most frequent language and the second most frequent language. Automatic posts and non-identifiable language posts did not add to the frequency count of any language. Bilingual posts added 0.5 frequency points (instead of 1) for each of the two languages. To create the baseline, I decided to use Google Language Identification tool because the Compact Language Detector (CLD) has two additional disadvantages and Python Language Detector has one additional disadvantage. Unlike Google’s proprietary option, CLD cannot detect the Mongolian language in the dataset, and the confidence values do not represent probability. The confidence values range from 0 to over 100, but the maximum value is unknown. The lack of interpretability of confidence values poses a challenge to use the algorithm that assigns a language label to a person. Python Language Detector is able to detect Mongolian and the confidence values are between 0 and 1, which might indicate probability. In practice, the confidence values are biased towards the range 0.9–1. The minimum is only 0 when the post is empty. Instead, 0.17 acts as the minimum confidence value, for instance, in cases when the post has just a symbol and any guess should return a 0 confidence value. This biased behavior of the confidence values would affect the performance of the algorithm. I considered tailoring the thresholds of the algorithm that assigns a language to a person to account for this, but given the disproportionate amount of 69 posts with confidence level 0.9, this value has little discriminatory power. For these reasons, I expected that the Google Language Identification tool would perform better. In summary, I considered my baseline to be the results of Google Language Identification with a subsequent revision, and I assigned the language labels accord- ing to a set of criteria described in this subsection 4.4.2. 4.4.3 Testing the language identification tools and the algorithm that assigns language labels to users I tested the performance of the language assigning algorithm, comparing the baseline (Google Language ID, manual revision, criteria-based language assignation) with the results of the automated language assignation (Google Language ID, man- ual revision, language assigning algorithm). The manual revision is not performed in the actual analysis of the contacts dataset, but serves for testing the performance of the language assigning algorithm, changing only one variable with respect to the baseline. To obtain the estimated error rate, I divided the number of cases where the automated results did not coincide with the baseline by 177 total cases. The resulting estimated error rate is 6.78%, with 1.13% false negatives (missing multi- linguals), and 5.65% false positives. Therefore, the algorithm tends to overrepresent multilinguals. Subsequently, I tested the performance of Google Language Identification tool in combination with the language assigning algorithm, eliminating the human re- 70 vision (Google Language ID, language assigning algorithm). These are the actual conditions of the analysis performed with the contacts dataset. I computed the es- timated error rate with respect to the baseline. Figure 4.5 shows how increasing the number of posts used for assigning a language to a person diminishes the estimated error rate. 0.05 0.10 0.15 0.20 0.25 0.30 l l l l ll l l ll l l l l l ll l l l ll ll l ll l lll lll l l l lll l ll l l ll l llll l ll ll l ll ll l lll l l lllll lllll ll l lll ll l lll lllll l l lll 0 20 40 60 80 100Number of Tweets Err or rat e Figure 4.5: The y axis represents the estimated error rate of using the Google Language ID method with respect to the baseline and the x axis represents the number of posts per person used for language assignation. As the number of posts increases, the estimate error rate diminishes like a negative logarithmic function. 71 Using a regression model, I obtained function 4.1 fitting the estimated error rate as a function of the number of posts per person used for language assignation. The variable x is the number of posts per user. f(x) = 0.285 − 0.056 × log(x) (4.1) Compared to Google Language ID, the Compact Language Detector does not detect the language in many more instances: Google did not identify the language in 46 cases, while CLD did not identify the language in 3,071 cases of 15,856 in the test dataset. In the case of the Python Language Detector (LangPy), there are 3,590 differ- ent outcomes compared to the Google Language ID (from a total of 15,856 posts). I compared the performance of LangPy in combination with the algorithm that assigns language labels to users (LangPy, language assigning algorithm) with the baseline (Google Language ID, manual revision, criteria-based language assignation). Figure 4.6 shows the estimated error rate of using LangPy with respect to the baseline, as the number of posts used for assigning a language to a person increases. It also displays, side by side, the previous results of the estimated error rate using Google Language ID (Google Language ID, language assigning algorithm) with respect to the baseline. In the case of Google Language ID, the estimated error rate is al- ways lower, partially due to the fact that the baseline uses this tool as a starting point. Also, the biased confidence values of Python Language Detector constitute a challenge for the algorithm that assigns language labels to users. 72 0.1 0.2 0.3 0 25 50 75 100Number of tweets Err or rat e Language Detector Google LangPy Figure 4.6: The y axis represents the estimated error rate with respect to the baseline and the x axis represents the number of posts per person used for language assignation. The tools compared are Google Language ID and Python Language Detector (langPy). As the number of posts increases, the estimate error rate diminishes. In the case of Google Language ID, the estimated error rate is always lower, also due to the fact that the baseline uses this tool as a starting point. As explained before, looking at Twitter’s language code to identify a language has the highest estimated error of all methods considered: 0.418. From 177 users in the test dataset, 25.99% were multilingual but the interface does not offer them the option to select more than one language. Another 15.82% of users were using a language different from the language selected on the interface, which was English in all of these cases. 73 Estimated Error Rate Number of Posts per User Estimated Cost ($) Below 0.15 15 363.11 Below 0.10 30 704.40 Below 0.05 70 1547.85 Table 4.2: Overview of different budget options: estimated error rate in the automatic language analysis associated with the number of posts used per person, and the corresponding analysis costs for the entire contacts dataset. The use of Google Language ID tool costs $20 per one million characters of text. 4.4.4 Deciding the number of posts per user Once I decided to use Google Language ID tool, a question remained: “How many Twitter posts are enough to determine the language(s) of the user?” In essence, this is a budget question. The cost of using Google Language ID tool is $20 per one million characters of text. Drawing from the estimated error rate results shown in figure 4.5 and the estimated error rate function 4.1, I selected three error rate options paired with the number of posts per person needed. Based on this number, I used the character count to estimate the cost of analysis for the contacts dataset, which comprises 19606 contacts with up to 100 posts per user. Table 4.2 provides an overview of estimated cost versus estimated error rate in the automatic language analysis of the contacts dataset. 74 With Prof. Jennifer Golbeck’s advice, we selected an estimated error rate 0.10, a cost of $704.4 for the automatic language identification of the contacts dataset using 30 posts per person. Aside from the contacts dataset, there is also a small dataset of 92 egos with 50 posts per user (figure 4.3). 4.5 Assigning Language Labels to Users For the social network analysis and the factor analysis, the contacts dataset required language labels for 25,556 people in the 92 egocentric networks. However, only 19,606 contacts had available data. The rest was assigned the language label “unknown”. As explained in section 4.4.4, I decided to extract 30 posts per per- son to determine the language or languages they are using in Twitter. The text extracted from the posts was processed through a pipeline for automatic language identification. The first stage in the pipeline involved the elimination of URLs, hash- tags (keywords preceded by the # sign), replies and mentions (usernames preceded by the @ sign), and other symbols. In the second stage, I used the Google API to identify the language of each processed post and the confidence value. Subsequently, every user was represented by a file that contained the languages and confidence values of their posts. This file was processed by the algorithm that assigns language labels to users; the details of the heuristics can be found in section 4.3.2. The labels could be: “unknown”; a code for one language in the case of monolingual users (i.e. “en” for English); or two language codes joined by the symbol “+” in the case of multilingual users (i.e. “en+ar” for English and Arabic). 75 In the case of the egos dataset, composed of 50 posts from 92 multilingual users, the text was similarly processed for automatic language identification. The purpose of this dataset being different, I automatically processed the data to obtain a percentage of use of the two most frequent languages for every person, with the corresponding language codes, and identified those egos that used a third language in at least 10% of the posts. The results served to describe the characteristics of this dataset, and to quantify the language choices. The egos dataset is composed of 87 bilingual users and 5 trilingual users, all of them use English as fist or second most frequent language (which was a condition in the sampling method). Also, they use one or two of the following 18 languages: Arabic, Basque, Catalan, Chinese, Dutch, French, Galician, German, Greek, Hebrew, Italian, Japanese, Korean, Mongolian, Polish, Portuguese, Russian, Turkish. 4.6 Scope When looking at the links between users in Twitter, there are two types of networks where the methodology could focus: (1) the network generated by the exchange of messages between people, like replies and repostings, which represents a communication network and a transient social network, dynamically evolving around a topic of interest [64]; (2) the network of “follower of” relationships between people, representing a relatively more stable social network, spanning diverse topics and communities. In both cases, the networks could reflect a static moment in time or an evolution, and the data collection has to be planned accordingly. 76 In section 1.3, An Ultimate Goal, I wonder what generates connections across languages communities and enables cross-cultural communication? Answering that question completely requires different approaches and collecting data to look at both types of networks. There are many complementary aspects that can be analyzed using the transient communication networks and the attention (or followers) social networks. However, in the interest of keeping this research project to a reasonable scope, I decided to focus on attention social networks of multilingual users, as cap- tured at one point in time. In future research, I would like to broaden the scope to account for dynamic networks, topic-based and communication networks. As documented in section 3.1, language choice online is influenced by many simultaneous factors, such as the cultural and linguistic context in a particular re- gion, the social context, the users’ perception of the availability of online resources in a language, and the topic, to name a few factors that this dissertation will not consider. The distinctive approach of this research consists on shifting the focus to factors derived from the social network where the user is immersed. Also, partic- ipants and audience are implicit factors when studying the textual feature of the @ sign. Regarding the setting factor, the social networking site Twitter could be considered one variable for comparison, but this work will not expand into other social networks with different characteristics, like Facebook. Methodologically, Androutsopoulos recommends to take into account the dig- ital surroundings when analyzing written text, for instance, looking at the pictures and videos that are linked [6]. Although this strategy is particularly relevant to 77 Twitter and would enrich the qualitative analysis, this work will focus just on tex- tual themes and hashtags due to time constraints in the final stage. According to Rotman et al. [93], this type of research work would be consid- ered an exploratory step prior to embarking in “extreme ethnography”, which is a new approach to ethnographic methods for the study of human behavior in large scale online environments. Indeed, adding detailed geographic information and cul- tural backgrounds of the nodes in the social networks would provide a fascinating overview of international communication patterns among individuals. However, such endeavor will require a wealth of resources, and a long time. 4.7 Limitations Due to the policy limitations of the Twitter API for extracting data and the computational workload required for processing large social networks, I discarded potential subjects that had more than 5000 followers. In practice, this decision biases the sample against the bigger hubs in the social network. Other limitations in data collection include the impossibility to obtain information from private accounts, for technical and ethical reasons, and the presence of inactive users. These issues translate into a 23% of subjects in the contacts dataset with no data available. This research work focuses on 92 subjects, and their egocentric networks, which is a small dataset that poses challenges in obtaining statistically significant results. The diversity of languages included, and small size of their respective samples, hin- dered any attempt to make comparisons between language groups. 78 Also, a challenge lies in automatically identifying the language of this type of short texts, and subsequently of nodes in the egocentric networks. In computer- mediated communication the text often has characteristics of both written and spo- ken language, with colloquial and regional dialect features, playful performance with orthography and typography [23], which adds difficulty for automatic language iden- tification. As reviewed in section 3.3, other research works have encountered this problem and have reported high error rates in identifying romanized Arabic [9, 40]. Finally, during the data collection process, we did not collect geolocation in- formation of posts. This type of information consists of GPS coordinates derived from the users’ devices, or approximate area derived from the Internet Protocol (IP) address of the users’ computer [40]. Only a small number of users publish geocoded posts, as a result, this condition would have reduced the number of subjects and biased the sample [83]. Instead, we collected the location information users provide in a specific field of the Twitter interface. However, this data is unreliable [40] and I decided not to take it into account during the analysis. 4.8 Reliability and Validity Regarding the reliability of the data collected, this work focuses on messages and actions —like “following” someone— of multilingual users in Twitter, and is not using the data as a proxy for their behavior in other settings. The Twitter API provides access to this information. Spammers are the most compromising problem for the reliability of data. In the case of the 92 egos, I designed the selection steps 79 (section 4.2) to avoid spammers and, in relation to validity, I also discarded people that were not multilingual users as defined in this investigation. For improving the reliability of language assignment, I used 30 posts per per- son, tested different language identification tools and the algorithm that assigns language labels to users based on their posts. I provide an estimated error rate below 10%. In the social network analysis, I designed a two-step study to improve the relia- bility of the categories obtained in the qualitative phase with quantitative measures, and tested the results with a classification model. In the factor analysis, I operationalized the multilingualism of a social network using the concept of entropy, which can be interpreted as the unpredictability of the language used in the network. The more people in the network using different languages, the higher the entropy. Unlike just counting the number of languages present, the entropy accounts for the weigh those diverse languages have on the network and provides a more accurate measure of multilingualism. In the factor analysis, I used two regression models to compare the results and test their validity. Finally, the qualitative data conformed by the posts written by the 92 egos was categorized into public posts and replies. The comparison of categorical data requires the use of non-parametric statistics to obtain valid results. The theme analysis includes many examples from the data and compares the results with previous findings in related studies to support their validity. Regarding external validity and generalizability of results, the small scale of the study limits the possibility for extrapolation to a wider multilingual population 80 in Twitter. On the positive side, the systematic documentation of steps in the grounded theory approach for classifying network types, complemented with network measures, enables replication. This replicability facilitates to scale up the study and potentially obtain more generalizable results. 4.9 Ethical Considerations This type of research, which consists of collecting public content from the Internet with no aim of presenting subjects in a bad light, is considered a low risk research activity that does not require an approval process by the Institutional Review Board (IRB). In particular, I am not collecting any private information, like age, gender, or real names, neither I am collecting data from accounts made private. However, the study of these new social media environments with user-generated content is challenging the established ethical protocols. Despite this study following the current norms of the research community, many ethical issues are still being debated and protocols might change in the future. In the article Six Provocations for Big Data [22], the authors warn that users posting publicly accessible messages online does not automatically make them con- sent for anyone to collect and use their data. They have an intended audience and purpose, and are unaware of their data being collected. Unfortunately, there is no practical way to obtain consent from users or to inform them of the data collection process. 81 The main concern is the user’s privacy. The only identifiable personal in- formation in the present datasets are public usernames, but I took the additional precaution of using anonymized identification codes for all the subjects. When pre- senting textual content, susceptible of including usernames, either I eliminated the user mention or I replaced it for a fake name. However, the vast majority of textual content was used only for automatic language identification. I have encountered the problem of discovering one minor in the dataset by reading a post stating the age. At that point, I eliminated immediately the subject and corresponding data from the sample. This raises the question about how to detect minors and be able to discard their data. Twitter added in 2012 an age screening program, but it only works in the context of a minor trying to follow a brand intended for adults and registered in the program [101]. 82 Chapter 5 Social Network Analysis The main goal in this study is to develop a classification of egocentric net- works based on the number of language groups that conform them and the patterns of connections between the groups. The types of egocentric networks constitute the- oretical constructs to understand the ways in which multilingual users of Twitter are connecting language groups. For that purpose, I visualized the 92 egocentric networks with the Gephi social network analysis tool. The visualizations represent people as colored nodes and the “follower of” relationship, as edges. The colors represent the single language they use in Twitter, if they are bilingual, or have no data available. The ego is taken out of the picture to avoid obscuring the display with too many edges; all members of the egocentric network are connected with the ego by definition. I chose the layout “Force Atlas”, which is a force directed placement scheme developed by Mathieu Jacomy in 2007 for Gephi [35]. The Force Atlas layout follows a similar placement scheme as the commonly used Fuchterman-Reingold layout, where the algorithm replicates a hypothetical physical system trying to minimize the energy, balancing attraction between nodes connected by springs [38]. The force-directed placement schemes are particularly useful in revealing network structure [10], such as communities. 83 In summary, the visualizations convey structural information and language information about the social network, by separating the social groups or communities in the layout and distinguishing the language groups with colors. As explained in section 4.1, this study follows an exploratory design with two phases. First, I use a grounded theory approach to identify emergent types of ego- centric networks focusing on the structural relationships of language groups. I use grounded theory in the generic sense, to define theoretical constructs derived from qualitative analysis of data, following the principles of the book by Corbin and Strauss [17]. This approach consists of a sequence of coding stages, firstly establish- ing some basic properties observed in the social networks, secondly extracting codes from the visualizations as defined by their properties, and finally grouping codes into categories according to shared properties. In the second phase, I complemented the qualitative study of visualizations with network statistics to provide a robust definition of network types. Also, I propose an application of these types using machine learning for classification, which also serves for testing the results. Finally, I discuss the findings in relation to the theoretical framework and related work. 5.1 Qualitative Approach As an initial step based on visual differences, I separated the egocentric net- works in three types: 25 monolingual or very small networks, 62 bilingual networks, 84 and 5 trilingual networks. Based on this initial classification, I established quanti- tative thresholds that define these types: • Bilingual networks: have at least 7 nodes using a second language (L2), and the L2 group represents at least 2% of the graph nodes; • Monolingual or very small networks: have fewer than 7 nodes using L2, or the L2 group represents less than 2% of the graph nodes; • Trilingual networks: have at least 7 nodes using a third language (L3), and the L3 group does represents at least 7% of the graph nodes. A higher threshold for trilingual networks enables to overcome the problem of noise in multilingual networks, where differentiating a third language among several others sometimes becomes challenging. This issue is less accentuated in bilingual networks, which will be the focus of the subsequent analysis. Before proceeding with the analysis of bilingual networks, I provide some insights about trilingual networks with three examples from the dataset. The first example is the egocentric network of the user “Kepa”1 (fig. 5.1). The Basque group on the upper side (dark blue) connects with the Spanish group (green in the middle) and in turn, the Spanish group connects with the English group at the bottom (pink). Basque is a minority language in Spain and a co-official language in the Basque Country region, where Spanish is also official language. This net- work illustrates the interesting intersections and overlaps of language groups in the context of a bilingual region, where English is taught as language for international 1Reported user names are changed for privacy protection. 85 Figure 5.1: Basque group on the upper side (dark blue) connects with the Spanish group (green in the middle) and in turn, the Spanish group connects with the English group at the bottom (pink) and English-Spanish bilinguals (violet). Visualization made with Gephi. communication. The Spanish-posting group seems to create a path of communica- tion between English and the Basque community. 86 Figure 5.2: Catalan group in the center (dark blue) integrated in the Spanish group at the bottom (light green), connects with the English group at the top (pink). Bilinguals of Catalan-Spanish are represented in dark green, Catalan-English in light violet, and Spanish-English in dark violet. The nodes in light yellow represent nodes with no data. Visualization made with Gephi. The second example is the egocentric network of the user “Montse” (fig. 5.2). The Catalan group in the central axis of the graph (dark blue) is completely in- tegrated within the Spanish group on the lower side (light green), and there is a smaller English group on the upper side (pink). It is noteworthy the number of bilinguals of Catalan and Spanish (darker green), followed by Catalan and English (light violet) and Spanish and English (darker violet). Catalan is a minority lan- guage in Spain and a co-official language in the Catalonia region, where Spanish is also official language. This network illustrates a different flavor of language groups’ overlaps in the context of a bilingual region, where English is taught as language for international communication. 87 Figure 5.3: The Chinese group in the center (dark blue) connects through a few nodes with the Japanese group on the upper side (green), and with the English group on the lower side (pink), through some bilinguals (violet). The nodes in yellow represent nodes with no data. Visualization made with Gephi. Finally, the third example is the egocentric network of the user “Wei” (fig. 5.3). The Chinese group in the center (dark blue) connects through a few nodes with the Japanese group on the upper side (green), and with the English group on the lower side (pink). Some of the users connecting the groups either post in English or both in English and Chinese. In this example, English seems to be playing the role of international Lingua Franca, connecting the Chinese-posting group with other language groups. 88 I focused the social network analysis on the bilingual networks, for simplifying the categorization to types of intersections between two language groups. During the initial coding, I created a list of properties observed in the visualizations concerning the structural relationships between the languages groups. These properties are shown in table 5.1: Properties A) Degree of connection between language groups A1 few connections A2 tightly connected B) Degree of integration of one language group inside another B1 separated B2 partial integration B3 complete integration C) Relative size of one language group respect to the other C1 similar size C2 very different size Table 5.1: Properties of bilingual networks observed in the Gephi visualizations. When combining the three types of properties, I deductively obtained 12 codes, for instance: code 1 consisted of two language groups of similar size (C1), separated (B1), and connected by a few nodes (A1); code 2 consisted of two language groups of very different size (C2), separated (B1), and connected by a few nodes (A1); code 9 consisted of two language groups of similar size (C1), tightly connected (A2), and one language group has been partially penetrated by users of the other (B2) ; code 12 consisted of two language groups of very different size (C2), the small one completely integrated within the big one (A2, B3), etc. 89 During the final iteration of the coding process, I observed some codes had no instances in the dataset or very few. Those codes that had very few instances could be grouped with codes of similar properties. For instance, regarding codes with a high degree of integration of one language group inside another (B3), there are few instances of language groups with similar size (C1), therefore I merged codes with properties A2 and B3, regardless of the differences in group size (either C1 or C2). In relation to codes with no instances, some properties presume others, like B3 or B2 (some degree of integration of one language group inside another) require a high degree of connection between the groups (A2); in consequence, certain com- binations of properties are not possible. For this reason, some codes were discarded. • Code 1 (A1, B1, C1) with 12 networks; • Code 2 (A1, B1, C2) and code 8 (A2, B1, C2) grouped together have 12 networks; • Code 3 (A1, B2, C1), code 4 (A1, B2, C2), code 5 (A1, B3, C1), and code 6 (A1, B3, C2) have contradictory properties, because B2 and B3 require A2, and there are no instances in the dataset; • Code 7 (A2, B1, C1) with 12 networks; • Code 9 (A2, B2, C1) and code 10 (A2, B2, C2) grouped together have 9 networks; • Code 11 (A2, B3, C1) and code 12 (A2, B3, C2) grouped together have 17 networks; 90 The resulting groups of codes constitute the five categories of bilingual net- works obtained with a qualitative approach. Below, I define the categories of ego- centric networks in relation to the patterns of intersection between language groups. Figure 5.4 illustrates them with examples from the data. The names of the cat- egories are metaphorical; here bridge is not used as the graph theory term. See appendix A for all the visualizations and the categories they were assigned. • Gatekeeper (Fig. 5.4.1): two language groups connected by a few nodes only, with properties A1, B1, and C1 (12 networks); • Language bridge (Fig. 5.4.2): two tightly connected language groups, but still separated, with properties A2, B1, and C1 (12 networks); • Peripheral language (Fig. 5.4.3): a dominant language group connected to a small or not cohesive language group, with properties A1 or A2, B1, and C2 (12); • Union (Fig. 5.4.4): two tightly connected language groups, where one lan- guage group has been penetrated by the other, with properties A2, B2, and C1 or C2 (9 networks); • Integration (Fig. 5.4.5): one language group inside another with properties A2, B3, and C1 or C2 (17 networks). 91 Fig. 5.4.2 Fig. 5.4.1 Fig. 5.4.4 Fig. 5.4.3 Fig. 5.4.5 Figure 5.4: Networks of 5 multilingual Twitter users exemplifying the types of egocentric net- works. The nodes are their contacts and the edges represent the “follower of” relationship. Pink nodes post in English and yellow/white is used for nodes with no data. Fig. 5.4.1 is the gatekeeper type; there is a French group on the right side (green) loosely connected with an English group on the left. Fig. 5.4.2 represents the language bridge type; in this network, the Japanese group on the right side (green) is tightly connected with the English group on the left, and intermingled with bilingual users (violet and dark green). Fig. 5.4.3 shows the peripheral language, Portuguese, on the right side (green) of the dominant English group. Fig. 5.4.4 exemplifies the union type, where the Greek group on the left (turquoise) is merging and mixing with the English group on the right, and there are many bilinguals (violet and dark green). Fig. 5.4.5 illustrates the integration type; the English group being inside the Arabic (green). Visualizations made with Gephi. 92 The categories gatekeeper and language bridge present a continuum of in- creasing connectivity between the two language groups, where extreme cases could potentially belong to the other category. Similarly, the union and integration cate- gories present a continuum of increasing penetration of one language group within the other. The implication is that no statistic is going to divide these categories cleanly. However, the network statistics helped to refine which networks were in which categories in the extreme cases. These different structures can potentially impact information diffusion [80] across languages and nations. In the case of the gatekeeper type, and peripheral language, cross-cultural awareness and information diffusion between the language groups depend on a small number of users. If we look at the proportion of links between the language groups, it seems that information will have higher chances of crossing the language barrier in the case of the union and integration types. 5.2 Network Statistics Similarly to how user types were defined by network structure in [112], I ex- plored different network statistics to provide a robust definition of the types of bilingual networks. The objective is to define a set of measures that, taken to- gether, can differentiate each network type. Note that this analysis continues to focus on the set of 62 bilingual networks. First, I tried to convey the qualitative property of degree of connection between language groups with the cross-language edge ratio (XLangR), as suggested by Prof. 93 Jennifer Golbeck. To compute this ratio, I used the total number of edges in the graph (T ), except those linking to nodes with no data or a non-identifiable language, and the number of edges linking two nodes of different language (t): t T = XLangR (5.1) Additionally, the ratio between inner edges in the L2 group and the edges going out of the group could convey both the degree of connection of the L2 group with the rest of the graph and the relative size of the group with respect to the graph. Computing the L2 inner/crossing edge ratio (XL2R) requires: the number of edges connecting two nodes of L2 (τL2), and the number of edges connecting a L2 node with a node in a different language (tL2). τL2 tL2 = XL2R (5.2) Another property that is related to the degree of connection and overlap be- tween the language groups is the the bilingual ratio. After determining the two main languages, L1 and L2, computing the bilingual ratio (BR) requires the number of nodes in each group (n, m), and the number of bilinguals using both L1 and L2 (b): b n+m = BR (5.3) Finally, to account for the qualitative property of different size of the two main language groups, I use the proportion of nodes in the L2 group (p(L2)) with respect to the sum of nodes in L2 (n) and L1 (m): 94 n n+m = p(L2) (5.4) As explained in section 5.1, the network categories present a continuum of increasing connectivity and overlap between two language groups, where extreme cases could potentially belong to another category. Even though no statistic is going to divide the categories cleanly, the figures below show how the five categories can be regrouped into three main types that are differentiated by the statistics. The categories gatekeeper and language bridge present a continuum of increas- ing connectivity between the two language groups, but are different from the other types because the L2 inner/crossing edge ratio is higher, which implies more con- nections within the same language group than across language groups (figure 5.5). Also, the L2 proportion differentiates the gatekeeper-bridge from the peripheral type because the two language groups tend to be of similar size, whereas the different sizes of the language groups is a defining property of the peripheral type (figure 5.6). Similarly, the union and integration categories present a continuum of increas- ing penetration of one language group within the other. The box plots in figure 5.7, representing the cross-language edge ratio, show that the integration and union types have higher ratios and are clearly differentiated from the other types. This pattern is consistent with the bilingual ratio (5.8), which reinforces the differentiation between integrated (union and integration) and separated (gatekeeper, language bridge, and peripheral language) types. 95 ●● 0 1 2 3 4 gatekeeper bridge integration peripheralCategories L2 i nne r/cro ssin g ed ge r atio 0 1 2 3 4 bridge gatekeeper integration peripheral unionCategories L2 i nne r/cro ssin g ed ge r atio Figure 5.5: L2 inner/crossing edge ratio for five categories (left) and for three categories (right). This statistic differentiates the gatekeeper-bridge type. 0.05 0.10 0.15 0.20 0.25 0.30 gatekeeper bridge integration peripheralCategories L2 p ropo rtion 0.05 0.10 0.15 0.20 0.25 0.30 ● bridge gatekeeper integration peripheral unionCategories L2 p ropo rtion Figure 5.6: L2 group proportion for five categories (left) and for three categories (right). This statistic differentiates the peripheral language type. 96 0.2 0.4 0.6 ! ! bridge gatekeeper integration peripheral unionCategories Cro ss−l ang edg e ra tio 0.2 0.4 0.6 ! ! gatekeeper bridge integration peripheralCategories Cro ss−l ang edg e ra tio Figure 5.7: Cross-language edge ratio for five categories (left) and for three categories (right). This statistic differentiates the integration and union type. 0.0 0.2 0.4 0.6 0.8 ● ● bridge gatekeeper integration peripheral unionCategories Bilin gua l rat io 0.0 0.2 0.4 0.6 0.8 ● gatekeeper bridge integration peripheralCategories Bilin gua l rat io Figure 5.8: Bilingual ratio for five categories (left) and for three categories (right). This statistic differentiates the integration and union type. 97 In summary, from a quantitative approach three main types of intersection between two language groups in the social network can be defined: • Gatekeeper-Language bridge: defined by a high L2 inner/crossing edge ratio, more connections within the same language group than across language groups, and language groups of similar size (24 networks); • Integration and union: defined by high cross-language edge and bilingual ratios (26 networks); • Peripheral language: the L2 group accounts for a small proportion of the graph, and it does not meet the defining properties of integration and union types (12 networks). Following the reasoning in section 5.1, the cross-language edge ratio and the bilingual ratio can reflect the potential for information dissemination across language borders. If we are able to classify the types of intersection between language groups in a set of egocentric networks, we might be able to predict which networks have more potential for cross-lingual linking, translation, and cross-cultural awareness. However, the relationship between the types and the spread of information across languages requires further investigation and fall outside the scope of this work. 5.3 Application of Categories One potential application of the categories, particularly the three types that are differentiated more clearly with the statistics, is the classification of bilingual 98 egocentric networks. If the linkage between the types and the potential for cross- language information dissemination is supported by further research, this classifi- cation could be fundamental in detecting nodes and their egocentric networks that can spread information across language and national borders more effectively. In this section, I test the results of the social network analysis with a clas- sification model using machine learning. I trained the classification model using support vector machines (sequential minimal-based implementation, SMO) and the dataset of 62 bilingual networks divided into three types. This dataset included the attributes type, L1, L2, cross-language edge ratio, L2 inner/crossing edge ratio, bilingual ratio, and proportion of L2. I used the Weka (Waikato Environment for Knowledge Analysis) free software for machine learning, developed at the University of Waikato, New Zealand. Figure 5.9 shows the confusion matrix: all 26 networks of the integration type were classified correctly; 19 of 24 gatekeeper-bridge networks were classified correctly, while only half of the networks of peripheral language type were classified correctly. In general, 51 networks of 62 were classified correctly and 11 incorrectly. The F-measure for accuracy is 0.812 in average, but is particularly high for the integration type, 0.881. These results show a promising potential for prediction, even with this small dataset. As observed, these statistics are enabling differentiation between the types of bilingual networks. 99 === Summary === Correctly Classified Instances 51 82.2581 % Incorrectly Classified Instances 11 17.7419 % Kappa statistic 0.7127 Mean absolute error 0.2688 Root mean squared error 0.3474 Relative absolute error 63.0008 % Root relative squared error 75.1714 % Coverage of cases (0.95 level) 96.7742 % Mean rel. region size (0.95 level) 66.6667 % Total Number of Instances 62 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.792 0.079 0.864 0.792 0.826 gatekeeper bridge 1 0.194 0.788 1 0.881 integration 0.5 0.02 0.857 0.5 0.632 peripheral Weighted Avg. 0.823 0.116 0.831 0.823 0.812 === Confusion Matrix === a b c <-- classified as 19 4 1 | a = gatekeeper bridge 0 26 0 | b = integration 3 3 6 | c = peripheral Figure 5.9: Classification results using 10-fold cross-validation for the SVM model. This model was trained using the dataset of 62 bilingual networks with attributes type, L1, L2, cross-language edge ratio, L2 inner/crossing edge ratio, bilingual ratio, and proportion of L2. 5.4 Discussion According to the Global Language System theory, polyglots provide cohesion to the system [25]. In section 2.3, I explain that the cohesion of a social graph depends on the edges that prevent the entire graph from breaking in isolated com- ponents [38]. In other words, multilingual users might be preventing the social graph of Twitter from breaking into isolated language groups, or “language bubbles” [46], where information is concealed and similar views reinforced. As motivated in sec- tion 1.2, instead of promoting isolated communities, social media sites should seek to expose their users to the unexpected [119] and foster cross-cultural awareness. 100 This social network analysis reveals how multilingual users are standing be- tween language groups. Reusing the concept of “language bridges” applied by Etling et al. [32] on the blogosphere, multilingual users are forming part of a language bridge between communities in varying degrees. These varying degrees are pre- sented in this chapter as a continuum of increasing connections between the language groups and a continuum of increasing penetration of one language group within the structure of the other. The classification of egocentric networks, or intersections of language groups, could serve to distinguish those egos who might be playing a role as gatekeepers [82] or language brokers [55], and also unveils that not all multilin- gual users are necessarily in such position. For instance, in the case of the union and integration types many users are connecting both language groups aside from the ego itself. As a result of this analysis other questions arise: what are the profiles and social contexts of these multilingual users and how they relate to the type of network? For instance, does the integration type reflect a minority or immigrant community in a country? Do small English-posting groups integrated in a non-English group reflect an elite in a country? An example related to the later question can be found in section 3.1, where I reviewed a study on email and Internet chat in Egypt documenting the use of English by a professional elite [108]. The relationship between these types of egocentric networks with the potential impact on information flows remains an open question. It seems reasonable to hypothesize in future research that certain types —like the union and integration categories— might favor cross-lingual linking and dissemination, while other types 101 —like the gatekeeper category— might be interesting for those seeking purposeful concealment of information. Methodologically, social network analysis enables going beyond survey infor- mation about multilingualism, like the large-scale survey The Twitter of Babel [83], and facilitates a deeper understanding about the structural relations between lan- guage communities, potentially shedding light into the dynamics of international communication. In this respect, the present study takes a similar approach as Lan- guage Networks on LiveJournal [53], but enhances the descriptive analysis with the creation and definition of theoretical constructs: the types of intersections between language groups in egocentric networks. Also, this study conceives the egocentric network as a language ecology where the ego is immersed at the micro-scale level, influencing its communication strategy and language choices. This is relevant to the next chapter on social network factors for language choice. 102 Chapter 6 Factor Analysis The main goal of this study is to explore how the social network influences the language choices of the multilingual Twitter user. In particular, I tested if we can model the number of times this person (the ego) chooses one language over the other using some characteristics of the egocentric network as predictors. The dependent variables considered are the frequency of English use and non-English language use within the 50 posts of the ego. The frequencies represent the language choices of the multilingual user. As explained in section 2.5, I consider inter-sentential code-switching when the language changes from one post to the next, while bilingual posts would be cases of intra-sentential code-switching. In this study, I only take into account inter-sentential code-switching and each post represents one interaction. Every post was assigned a language label by the automatic language identification tool and, subsequently, frequency counts of posts in each language were calculated for every ego. Finally, the two most frequent languages of a user were selected to represent his or her options for language choice. English was always one of them due to the sampling conditions. The factors —independent variables or predictors— are the proportion of En- glish and non-English language users in the social network, and the degree of multi- 103 lingualism of the social network. The relative importance of factors, or their weight, is represented by the coefficients obtained by fitting two different generalized lin- ear models to the dataset: linear regression, and logistic regression. The main hypotheses are: higher proportions of English users in the network will be a good predictor for more frequent English use by the ego; inversely, higher proportions of non-English language users in the network will be a good predictor of more frequent use of a non-English language by the ego; and the multilingualism of the network will be also a predictor of English use, reflecting its role as a lingua franca. 6.1 Operationalization of Variables The language choices of the multilingual user are defined by two dependent variables: (1) the number of English posts within the 50 (or fewer) posts extracted for each multilingual ego, (2) and the number of posts in other language, called L2. The dependent variables reflect the aggregation of posts at user level, not individual posts. In other words, the models do not consider if one particular post is written in English or L2, but how much or little English a person will tend to use in interactions via Twitter. The factors considered are: • proportion of English users in the network, represented by the number of speakers labelled as English users and divided by the total number of nodes in the network; 104 • proportion of users of the most frequent non-English language in the network (L2), represented by the number of speakers labelled as L2 users, and divided by the total number of nodes in the network; • the multilingual index of the network, which represents the degree of multilin- gualism of the social network. As suggested by Prof. Jordan Boyd-Graber, the multilingual index can be operationalized as the entropy of a multinomial distribution (formula 6.1). This idea borrows the concept of entropy from Information Theory [94]. In this context, the entropy can be interpreted as the unpredictability of the language used in the network. An entropy close to 0 means that most people in the network are writing in one language, hence the language of the network is more predictable. The more people in the network using different languages, the higher the entropy, reflecting the uncertainty about the language of the network. Unlike providing just the number of languages as a measure of multilingualism, the entropy accounts for the weight those diverse languages have on the network in the form of probabilities. H = − n∑ i=1 pilog(pi) (6.1) Equation 6.1 for calculating the multilingual index of a social network is bor- rowed from Shannon’s entropy theorem [94]. In this dissertation study, n is the number of languages in the network and pi is the number of nodes in language i divided by the total number of nodes. 105 Ego$en$use Ego$L2$use N$of$posts entropy net$en$use net$L2$use28 19 50 0.79230492 0.51793722 0.4551569511 38 50 0.67065588 0.63473054 0.3532934135 15 50 0.42538128 0.84931507 0.1095890418 6 25 0.69040118 0.47368421 0.5263157915 35 50 0.57881166 0.7202381 0.26785714 Figure 6.1: Input data file for factor analysis with a reduced number of lines for illustration purposes. The columns represent, from left to right, dependent variables English use by the ego and L2 use by the ego, number of available posts for the ego, and factors multilingual index or entropy, proportion of English users and proportion of L2 users in the egocentric network. 6.2 Regression Models and Analysis In this study, I used two different generalized linear regression models to build a probabilistic model that relates a dependent variable y to more than one independent or predictor variable [29]. Formula 6.2 is the multiple linear regression equation for three predictor vari- ables, x1, x2, x3: proportion of English users, proportion of L2 users, and multilingual index of the network. y = β0 + β1x1 + β2x2 + β3x3 (6.2) In formula 6.2, β1, β2, β3 are the regression coefficients. β1 is interpreted as the expected change in y associated with a 1-unit increase in x1, while x2 and x3 are held fixed [29]. Analogous interpretations hold for β2 and β3. The intercept of the fitted line is β0, which is the predicted value of y when all factors have value 0 [29]. 106 I used the linear regression model for two dependent variables yen and yl2, which are operationalized as the normalized count of posts written in English by the ego (yen) and the normalized count of posts written in L2 by the ego (yl2). Given that not all egos have 50 posts available, the normalization consists of dividing a particular count by the total number of posts available for the ego. However, the output of the linear regression model are numbers from 0 to infinity. For this reason, the linear regression model might not be the best option for this dataset. Alternatively, logistic regression can be used to get probability scores (between 0 and 1) as the predicted values of the dependent variable y [79]. Formula 6.3 is the transformation equation from a linear regression output to logistic regression probabilities, with three predictor variables x1, x2, x3 [79]. log y 1 − y = β0 + β1x1 + β2x2 + β3x3 (6.3) Like in the previous model, I used the logistic regression model for two de- pendent variables yen and yl2, which are operationalized as pairs of counts: (yen, N−yen) and (yl2, N−yl2), where N is the total number of posts available for a user. I used the R language for the statistical analysis. R is an open programming language and software environment for statistical computing. As a result of fit- ting these generalized linear models, R outputs the regression coefficients for the independent variables or factors, including the intercept, and indicating positive or negative correlation. In addition, R provides the specific p-value scores for each of the regression coefficients. 107 Firstly, I used the linear regression function in R to model the use of English by the ego (model 6.4) and the use of L2 by the ego (model 6.5). −lm(Ego.en.use/N.of.posts ∼ entropy + net.en.use+ net.L2.use) (6.4) −lm(Ego.L2.use/N.of.posts ∼ entropy + net.en.use+ net.L2.use) (6.5) Secondly, I used the logistic regression function in R to model the use of English by the ego (model 6.6) and the use of L2 by the ego (model 6.7). −glm(response ∼ entropy + net.en.use+ net.L2.use, family = binomial(′logit′)) (6.6) −glm(responseL2 ∼ entropy+net.en.use+net.L2.use, family = binomial(′logit′)) (6.7) In the logistic regression model, the depended variables are operationalized as pairs of counts: response < −cbind(Ego.en.use,N.of.posts− Ego.en.use) responseL2 < −cbind(Ego.L2.use,N.of.posts− Ego.L2.use) 108 Predictors of Ego.en.use Estimate Std. Error p-value (Intercept) 0.0002 0.554 1.000 entropy 0.0414 0.160 0.797 net.en.use 0.796 0.502 0.117 net.L2.use 0.075 0.454 0.868 Table 6.1: Linear regression coefficients for modeling the use of English by the ego. None of the coefficients are statistically significant. The proportion of English users in the network is the most important predictor of English use by the ego. 6.3 Results The results of the linear regression model in table 6.1 do not provide statisti- cally significant coefficients for predictors of English use by the ego. I established the level of significance for the coefficients at a p-value of 0.05. The proportion of English users in the network has the greatest coefficient, indicating this factor is more important for predicting the use of English by the ego. Both the proportion of English users in the network and the multilingual index have positive correlation with the use of English by the ego, as stated in the hypothesis. Similarly, the results of the linear regression model in table 6.2 do not provide statistically significant coefficients for predictors of L2 use by the ego. The propor- tion of English users in the network is the factor with the greatest coefficient in absolute value, but is negatively correlated with L2 use by the ego. The proportion of L2 users in the network is positively correlated with L2 use, as stated in the 109 Predictors of Ego.L2.use Estimate Std. Error p-value (Intercept) 0.482 0.548 0.382 entropy 0.0335 0.159 0.833 net.en.use -0.354 0.497 0.479 net.L2.use 0.263 0.449 0.559 Table 6.2: Linear regression coefficients for modeling the use of L2 by the ego. None of the coefficients are statistically significant. The factor proportion of L2 users in the network has a positive correlation with the use of L2 by the ego, whereas the factor proportion of English users has a negative correlation. hypothesis. However, in disagreement with the hypothesis, the entropy is positively correlated with L2 use. In summary, the proportion of English users and the pro- portion of L2 users in the network are better predictors of L2 use by the ego than the entropy. The results of the logistic regression model in table 6.3 include the coefficients for predictors of English use by the ego. The proportion of English users in the network is a statistically significant predictor. Like in the previous model, the proportion of English users in the network has the greatest coefficient, indicating this is the best predictor of English use by the ego. In this model, all the positive and negative correlations of the coefficients are in agreement with the hypothesis, i.e. both the proportion of English users in the network and the multilingual index correlate positively with the use of English by the ego, while the proportion of L2 users in the network correlates negatively. 110 Predictors of Ego.en.use Estimate Std. Error p-value (Intercept) -1.718 0.918 0.061 entropy 0.114 0.265 0.668 net.en.use 2.981 0.832 0.0003 * net.L2.use -0.086 0.739 0.907 Table 6.3: Logistic regression coefficients for modeling the use of English by the ego. The proportion of English users in the network is a statistically significant predictor and the most important for predicting English use by the ego. Predictors of Ego.L2.use Estimate Std. Error p-value (Intercept) -0.563 0.899 0.531 entropy 0.302 0.251 0.245 net.en.use -1.109 0.801 0.170 net.L2.use 1.551 0.737 0.035 * Table 6.4: Logistic regression coefficients for modeling the use of L2 by the ego. The proportion of L2 users in the network is a statistically significant predictor. Both the proportion of L2 users and English users in the network are important predictors of L2 use by the ego. 111 Finally, the results of the logistic regression model in table 6.4 include the coefficients for predictors of L2 use by the ego. Using this model, the proportion of L2 users in the network is a statistically significant predictor if establishing the level of significance at p = 0.05. Also, the proportion of L2 users in the network has the greatest coefficient, indicating this is the most important predictor of L2 use by the ego. The proportion of English users in the network correlates negatively with the use of L2 by the ego, and the proportion of L2 users in the network correlates positively, as stated in the hypothesis. However, in disagreement with the hypothesis, the entropy correlates positively with the use of L2 by the ego. 6.4 Discussion In conclusion, the two generalized linear regression models consistently show that the proportion of English users in the network constitutes a key influencing factor in the frequency of English use by the multilingual individual, as stated in the hypothesis. This result was statistically significant in the logistic regression model. Also, the coefficient of this factor was the greatest in the two models. Similarly, the proportion of L2 users in the network is a very important factor influencing the frequency of L2 use by the multilingual person, in agreement with the hypothesis. This result was statistically significant in the logistic regression model. Also, the coefficient of this factor was the greatest in the logistic regression model of L2 use. Interestingly, the factor proportion of English use in the network is also an important predictor for L2 use by the ego, but is negatively correlated. 112 Regarding the multilingual index (or entropy), the results are inconclusive about the relation to the language choice of multilingual users. The hypothesis that the entropy could be a good predictor of English use is not confirmed. A future study could deepen into the question of English being used as a lingua franca by focusing on multilingual egocentric networks with no monolingual users of English. Controlling this variable can eliminate the confounding influence of users writing in English only. In essence, these results suggest that the multilingual Twitter users perceive the language composition of their egocentric network and interact accordingly. Or, on the contrary, the language choices of multilingual users might attract followers of a specific language profile. Most probably, the relation goes both ways, in a self- feeding cycle. Social networks evolve over time and users may adapt their language choices in a dynamic relationship with their egocentric network. As Marwick and boyd theorized: “[...] identity on Twitter is constructed through conversations with others. Tweets are formulated based partially on a social context constructed from the tweets of people one follows” [81](11). Other factors that I initially tested in this analysis were the most frequently used non-English language of the network and the type of network. However, these factors posed specific challenges. In the 92 networks, there were a total of 16 lan- guages that appeared as the most frequently used non-English language. As a consequence, the data were too sparse for any one language to be operationalized as a factor. The type of network structure resulting from the social network analysis was challenging to use as a factor because there are not clear-cut divisions between 113 types, but stages in a continuum. In future work, it will be interesting to study users’ awareness or intuition about their social network type and how this might affect their language choices. There are other factors influencing language choice that do not fall under the scope of this work: cultural and linguistic context in a particular region, the perceived availability of online resources in a language, social context, language competence, geographic location of the ego, time zone, etc. 114 Chapter 7 Exploring Textual Features In this chapter, I shift attention from the social network to the content of the posts written by the egos. There is a convention in Twitter for addressing a message to a particular user or referencing a person, the “mentions”, which consists of an @ sign and a username [81]. Honeycutt and Herring [54] studied the @ sign as a marker of addressivity in Twitter; they found that more than 90% of messages with the @ sign were addressed to a user, 5% were referencing a person, and the rest were indicating location or other functions. Surprisingly, they did not comment on the key factor of the mention location within the message to differentiate the posts intended for a specific user: mentions in the beginning of the post are typically used to reply to someone’s message. In the first part of this study, I look at the textual feature of the @ sign at the beginning of a post as an indicator of addressivity, in particular, to distinguish the posts that are replies to an individual. In other words, this indicator can be used to differentiate the type of exchange: sending a public post, including repostings with a comment, and replying to an individual. The objective is to test the hypothesis that the type of exchange is a factor that affects language choice. The second part of this study takes a qualitative approach to detect the themes that might help in creating cross-cultural awareness, where the multilingual users 115 might be trying to reach an international audience, acting as mediators from the point of view of their messages. I identify themes related to non-English speaking countries or communities in English posts and, also, I identify English hashtags (keywords preceded by the # sign) inserted in non-English posts. Using a generic theme analysis, this study serves as an explorative qualitative phase to inform the design of future studies after this dissertation work. 7.1 Description of the Data The dataset used in this study was called “ego dataset” in chapter 4 and in- cludes the last 50 posts (or fewer) of the main 92 subjects, who are multilingual users. In total, the dataset contains 4,423 Twitter posts, associated to their re- spective authors. Note that these posts are never automatic repostings (using a “retweet” button), due to the requirements during data collection, as explained in chapter 4. The majority of posts have a language label, obtained with automatic language identification. In some cases, there was no language label because the post contained only symbols or URLs, and those were removed before proceeding to the automatic identification of the language. The precise number of posts with a language label is 4,374. In preparation for this analysis, I revised the language labels with two objec- tives: 1. eliminating from frequency counts of English automatic posts sent by applica- tions, for example, “Posted a picture on Facebook”, “liked a photo on Face- 116 book”, “favorited a Youtube video”, “I am at something @ name of place” (foursquare); 2. identifying bilingual posts with the appropriate label. I used the criteria de- scribed in section 4.4.2 to classify a post as bilingual, in particular, the post has to meet one of these conditions: – one word is in a second language in a post with fewer than five words; – two words are in a second language in a post that has between five and ten words included, except if those two words are a named entity; – at least three words are in a second language in a post with more than ten words, except when they are a named entity. Also, the posts were classified in three types of exchange: public post (ToAll), reposting with a comment (RT), or replying to an individual (reply). I designed a simple algorithm for the automatic classification of the posts, using regular expres- sions, i.e. when a post starts with the @ sign followed by a username is a reply to that person. Initially, I posed the question whether I will find more bilingual posts of the type reposting with a comment, thinking of potential translations, which triggered the revision of language labels and posts to identify them. Also, this initial question justified the differentiation between general public posts and repostings with a com- ment, using the convention of “RT” or “rt” [66]. However, the resulting number of bilingual posts in the ego dataset was very low, 37, and none of them were repostings 117 with a comment. For this reason, I discarded any hypotheses related to bilingual posts. To sum up, the dataset used has 4,374 posts, all of which have a language category (English, other), and a category of exchange type (ToAll, RT, reply). 7.2 Hypothesis Testing: Fisher’s Exact Test In this section, I propose to test the following hypothesis: English is used more when addressing a post to the public in general (ToAll and RT) than when replying to individuals . This hypothesis is based on the empirical observations of previous research studies on email and mailing lists [31, 65], focusing on the impact on lan- guage choice of addressing a message to one person or to a multilingual audience; in the later case English was preferred for its function as a lingua franca. If a is the number of replies in English, and b is the number of ToAll and RT posts in English, the null hypothesis can be stated as: H0 : a ≥ b (7.1) And the alternative hypothesis is: Ha : a < b (7.2) I set the commonly accepted value of p = 0.05 as the level of significance associated with the null hypothesis. Honeycutt and Herring [54] estimated that roughly 30% of all Twitter posts contained an @ sign regardless of language. However, they recognized that English 118 was by far the most frequent language in their sample. More recently, Weerkamp et al. [110] looked specifically at the different proportion of replies depending on the language in Twitter, which varies from 36% of posts being replies in Dutch and 34% in Spanish, to 25% of replies in English, and 13% of replies both in Portuguese and Indonesian. Similarly, in the ego dataset the number of replies to specific users is lower in general than posts addressed to a wider audience. In particular, 22% of English posts are replies, while 35% of posts in other languages are replies. Given the lower number of replies versus public posts in this dataset, I nor- malize the counts. Therefore, I compare the proportion of replies that are written in English, aa+c , with the proportion of ToAll and RT written in English, b b+d , where c is the number of replies in other languages, and d is the number of ToAll and RT posts in other languages. The results show that the null hypothesis seems to be untrue: the proportion of replies in English is aa+c = 0.3810, while the proportion of public posts written in English is bb+d = 0.5368. To reject the null hypothesis, I have to apply a statistical test of significance. Since I am comparing the categories of the posts, the most appropriate non-parametric test is the Fisher’s Exact Test [36]. In the Fisher’s Exact Test, the data can be displayed in a 2x2 contingency table (table 7.1). The probability of obtaining any such set of values is given by the hypergeometric distribution in equation 7.3. p = ( a+b a )( c+d c ) ( n a+c ) = (a+ b)! (c+ d)! (a+ c)! (b+ d)! a! b! c! d! n! (7.3) 119 Replies ToAll+RT Language total English a=482 b=1669 a+b=2151 Other c=783 d=1440 c+d=2223 Type of exchange total a+c=1265 b+d=3109 n=4374 Table 7.1: 2x2 contingency table for the Fisher’s Exact Test. The resulting p-value is 2 e−21, which is much lower than the set value. In conclusion, we can reject the null hypothesis in favor of the alternative hypothesis: in the case of multilingual Twitter users, they use English more frequently in public posts than in replies to individuals. 7.3 Discussion: Addressivity as a Factor This result reinforces the idea that addressing an online message to a perceived multilingual audience encourages the use of English. In Twitter, the only previous study that has looked at addressivity as a factor for language choice focused on Welsh-English bilinguals [61]. The study by Johnson [61] also found that they use proportionally more English in public posts (53%) than in replies to individuals (44%) in a sample of 500 posts. The author speculates that the use of English is encouraged in Twitter for its potential to reach a wider audience [61]. The work by Johnson reported very few cases of bilingual posts [61], which is consistent with the description in section 7.1. 120 In section 3.3, I reviewed two works that suggest Twitter is used more for conversational purposes in some languages, with higher frequency of @ signs, while in other languages is more common to use it for sharing resources, as the higher frequency of URLs and repostings might indicate [110, 55]. Taking these previous findings into consideration, future work should study the combination of language profile of the user and addressivity of the message to understand language choices in Twitter. The lesson for system designers is that different types of exchange between people, with the corresponding number of sources and receivers (one or many), could prompt code-switching, a user changing the language. 7.4 Theme Analysis In chapter 5, I studied the social connections between language groups, but ultimately, I am interested in understanding the complementary roles of social con- nections and topic-based linkages in creating cross-cultural bridges. This qualitative analysis constitutes a first exploration of themes that could potentially connect communities speaking different languages. I use the ego dataset, which was not initially collected with the idea of detecting communities around a topic, instead the purpose was to gather multilingual users regardless of their inter- ests. With this limitation in mind, as well as the numerous languages present in the dataset, I decided to focus on identifying keywords related to non-English speak- 121 ing countries or communities in English posts, and English “hashtags” (keywords prefixed with the pound sign) inserted in non-English posts. Previous works [24] have attempted to develop various classification schemes of content in Twitter, but they are too broad for the purposes of this study. Dann [24] focuses his review on five classification schemes and complements them with his own scheme of six general categories, namely conversational, pass along, news, status, phatic, spam, and 24 subcategories, like response, location, endorsement, headlines, sports, events, etc. In this section, I do not base my analysis on these existing classification schemes, instead I read through the data without prior preconceptions in search of answers to the questions: why is the user mentioning a non-English speaking place or a non-English language? What is she/he saying about it? Why is the user inserting an English hashtag in a non-English post? The descriptive nature of this generic theme analysis intends to stimulate more systematic research questions and classification schemes to discover the topics and feelings that bring multilingual online communities together. 7.4.1 International themes in the English language set Firstly, while reading the 2,151 posts in English, I identified the names of places, cities, countries, or languages that are not English speaking. At the same time, I annotated the context in which they were mentioned. After a second revision, I used the annotations about the context as the basis for the following emergent themes: 122 • international news, which include links to media sources and reactions (i.e. “Arab Revolution power”, “The Ukrainian experience for Arab world by Lionel Beehner ”, “Hope that everyone in Japan is fine... ”); • people’s travel plans or an accomplished travel (i.e. “can’t wait to fly to #Barcelona”, “Now considering Martinique & Guadeloupe for my next holi- day”, “Hi back from germany”, “off to Geneva for a show at a fancy birthday party”); • people’s location (i.e. “ It’s 3AM here in France so I’m kind of tired”, “[...]you are at the Schokoladenmuseum”, “[...]we need you in Oberhausen! [...]How long will you stay in Hamburg?”); • event’s location (i.e. “Buskers Festival in Bern was fantastic!”, “annual Swiss Congress of Radiology”, “Global Performing Arts Exchange Singapore”, “To- morrow is our concert in Dresden” ); • an opinion, of political kind or not (i.e. “miserable greek reality”, “I am for communism. Swedish communism”, “bahrain tonight should burn...”, “Given the birth rate & z population figures in Egypt, I can’t understand how sex is a taboo [...]”); • internationalization of technology (i.e. “[...]software penetrating Angola”, “Portuguese RTS game for PS3 [...]”, “I use it in China to redirect my website”, “@adobe should organise an event for nord africa just like @google”); • Sports (i.e. “German Rugby”, “Forza Milan!”); 123 • Culture (i.e. “Nigerian Clubbing etiquette”, “Victor Manuelle is a Puerto Rican artist!”, “Israeli band Orphaned Land rocks Turkey”); • Language, including remarks about the language of a resource linked when is different from English (i.e. “New Review 9/10 Points (German)”, “Afrikaans Video”); • gastronomy and restaurants (i.e. “Turkish scrambled eggs...” , “all you can eat at thai-jap rest...”); • travel recommendations (i.e. “If a Lille visit is on the agenda [...] I’d highly recommend the LAM museum”, “see my latest thoughts on ‘Where To Stay’ in Istanbul”, “You should come to Israel during the summer, you’ll have a blast!”); There are 227 posts in which a non-English speaking place or language is mentioned. Table 7.2 displays the frequency of the themes. The complete list of places and languages, with an extract of the textual context and associated theme can be found in Appendix B. Looking at this list of themes found in English posts, I wondered which ones are drawing attention to countries and communities that are not English-speaking and providing some information. Generally, posts about people’s locations (32 instances) or travel plans and accomplished travels (33 instances) do not provide information about those places, and I speculate the function of these posts relates more to coordinating with friends than to draw attention to a culture. 124 Theme Instances International news 31 Reaction to international news 4 Travel plans 29 Accomplished travel 4 People’s location 32 Events’ location 27 Opinion 24 Tech internationalization 14 Sports 13 Culture 12 Language 8 Language of resource 4 Gastronomy and restaurants 10 Travel recommendations 9 A country’s policy 2 Location and language 1 Culture and humor 1 Travel plans and opinion 1 Movies 1 Table 7.2: Frequencies of themes related to non-English speaking places and languages mentioned in English posts. 125 Likewise, the posts related to events’ locations (27 instances) might have a primary coordinating function, but in some cases (eg.“oktoberfest”) the post is drawing attention to events of cultural interest in a country and could be considered as fostering cross-cultural awareness. The theme international news, which includes also the reactions to the news, is the theme that most frequently fosters cross-cultural awareness in the ego dataset (35 instances). These posts show the interest of the author in a non-English speak- ing community and provides information to the followers. Opinions, of political kind or not (24 instances) is the subsequent most frequent theme in raising cross- cultural awareness. In this case, the author also shows interest in a non-English speaking community and provides some piece of information, albeit the impressions are sometimes negative. More scarce, there are posts referring to Sports (13 cases), Culture (12 cases), gastronomy and restaurants (10 cases), and travel recommenda- tions (9 instances) of countries that are non-English speaking, which also provide some information about cultural aspects. Aside, the theme internationalization of technology is interesting in its own right because it reflects on technology adoption in different areas of the world. Although the instances are low in this dataset (14 cases), it points to a promising application for tracking the adoption of technology products and services at an international scale. Finally, it is worth noting that among the 12 mentions of a language different from English, there are 4 instances in this dataset where users specify the language of a resource they are linking. They are creating a cross-language link, a phenomenon 126 that has been studied on the blogosphere [53, 47]. Cross-language linking, like the cross-language social connections studied in chapter 5, contribute to building paths between communication spheres online that are separated by language, enabling the flow of information across national boundaries. In summary, even though the mention of a non-English place or language does not always come with information about it, some themes might be facilitating cross-cultural awareness, like international news, sports, and culture. Unfortunately, sharing of resources in languages other than English into the English language sphere on Twitter seems to be infrequent, or at least the language notices preceding the link. Alternative data collection techniques could shed light into this phenomenon in future research. 7.4.2 English hashtags in the non-English language set In a second phase, I read the 2,223 posts that were left out from the English language set. However, among those, there are posts in English that were classified as automatic (see section 7.1). I ignored such posts for the purposes of this analysis. Aside from English, there are 18 languages in the ego dataset. Given the challenge that so many languages posed for the analysis, I focused on identifying English “hashtags” only, keywords or phrases prefixed with the pound sign (#). The purpose is to explore the reasons why a user writing in a language different from English might want to add an English word or phrase. Are they examples of 127 code-switching? Are they a mechanism to draw international attention? Or are there other motivations behind? The hashtag was a convention of the Internet Relay Chat (IRC) introduced and accepted by users as the Twitter tagging feature in 2007 [15]. Hashtags are used to “funnel related tweets into common streams” [57], by aggregating posts with a common hashtag. It is important to bare in mind that the hashtag can be a means to classify a message, but also provides the user with visibility in a “many-to-many” conversation about a topic, potentially enabling the user to reach an audience beyond their followers. The Twitter system leverages hashtags for creating lists of popular topics in real time, called “trending topics”. These trending topics are visible by all users, thus potentially drawing worldwide attention. As a side note, the proportion of Twitter posts with hashtags seems to vary with language, for example, a study reports that in German they account for 25% of posts, 14% in English, and as low as 4% in Japanese [110]. I created a list with the identified hashtags in English, and also I included hashtags of brand names and products, names of places in English or transliterated into latin script, and many acronyms. Acronyms posed a particular challenge, given that many of them were just informal ad-hoc abbreviations and required search and documentation to understand the meaning. Often, they were abbreviations of conference names, music festivals, and other events with a potential international audience. Occasionally, I could not determine the meaning of the acronym, in consequence I did not include those on the list of selected hashtags. Also, I discarded acronyms that were referring to local events or institutions (eg. German institution 128 #rbb, Portuguese political debate #e2011pt ), when they were specific of the culture and language in which the author was writing and did not constitute a change of language or script. Subsequently, I classified the hashtags using the annotations about their mean- ing. I tried to classify them in topics, like Information and Communication Tech- nologies, and political topic and campaigns. However, some hashtags have a primary conversational function instead of referring to a topic, as studied by Huang et al. [57]. They argue that these types of tags “serve as a prompt for user comment” and “the resulting content is an asynchronous massively-multi-person conversation”; also, they provide a few examples, like #igrewupon, #liesmentell [57]. The conver- sational tags of Huang et al. [57] correspond to what Laniado and Mika [70] called “tags characterized by a common sentiment”, which they illustrate with examples such as #thankfulfor and #youknowyouareuglyif. They estimated that 20% of Twitter hashtags belong to this type. Also, many hashtags refer to named entities. Laniado and Mika [70] estimated that 39% of hashtags are named entities, most commonly referring to organizations, products, and events. As a result of this diversity, first I classified hashtags in two main groups: conversational tags, and the rest. Secondly, I divided the conversational tags in three types: emergent discourse conventions in Twitter (eg. #fail, #wtf), reflecting on Twitter use (eg. #1000followers, #odd trend), and informal or ad-hoc Twitter genres (eg.#a thought, #kindlyAdvice, #roadrage). Finally, I classified the rest of the hashtags into groups of topics, and grouped aside the brands, devices, events, 129 locations, and dates. The tables of hashtags, including frequency of appearance and the categories in which they are classified, can be found at the end of this section. A prominent group of English hashtags relates to Information and Commu- nication Technologies (eg. #mobileusers, #Android, #Cloud Computing, #hyper- linking), which also spans brands and devices, such as #microsoft, #skype, #Ipod (see table 7.5). Among these hashtags, there are named entities, but also cases of intrasentential code-switching, where the user switches from one language to an- other within the same sentence [62]; following Joshi’s definition [62], English is the embedded language within a non-English matrix language in these examples. While the topic of Technology and Internet seems to trigger the use of English terms, international news (eg. #bahrain, #egypt), and campaigns, such as #1bil- lionhungry, might be biased towards English hashtags to draw global attention (see tables 7.7 and 7.8). International news could potentially lead to a trending topic and draw attention about events unfolding in real time in some part of the world, like it was the case during the popular uprising in Egypt during 2011 [16]. There are also numerous hashtags referring to conferences and music festivals. If the organizers of such events announce or promote a hashtag, they avoid the problem of fragmentation of message streams related to the event due to variations of keywords [15]. However, in this sample there are variants of a hashtag for the same event, such as #caexpoitaly and #caexpo, #gmaghreb and #gmaghreb11, #sepIn11 and #sepIn (see table 7.6). Letierce et al. [72] studied the use of Twitter and hashtags in conferences of Semantic Web researchers and revealed that users have a desire to participate in the discussion around the conference and see their 130 messages included in the conference stream, while hoping to increase their network. They concluded that Twitter is used as a background communication channel during those events. Most interestingly, international conferences and music festivals attract multi- lingual and multicultural audiences, who can conform to the same Twitter hashtag and generate multilingual conversations around the event taking place. These inter- national events might be key in promoting cross-language sharing of resources and creating social ties across language and national boundaries in the online communi- cation sphere. There are still other reasons to use an English hashtag when writing in a dif- ferent language: as studied by Huang et al. [57], some hashtags are prompts for user comment and a way to participate in a multi-person conversation. A conversation that can be multilingual. A few of these conversational tags constitute emergent discourse conventions in Twitter, even adopted from other online sites, such as the commonly used #like in the social networking site Facebook. Kooti et al. [66] studied the emergence and evolution of this type of conventions looking at the specific example of reposting in Twitter. They provided data showing how the user community was choosing to include “RT” in their repostings more often than other alternatives over time. In other words, Twitter users have progressively agreed in the use of certain codes, such as adding #fail to their post when they talk about disappointing or deceiving news (7 times in the non-English sample), #wtf to express disbelief, #FF or “follow friday” to recommend other users, etc (see table 7.3). Even though these conventions 131 come from the English language, they have been adopted in other languages for communicating in Twitter. Similarly, users in the non-English sample sometimes categorized their posts in an “informal or ad-hoc genre” by adding an English hashtag, like #a thought, #kindlyAdvice, #thisislife, #roadrage (see table 7.4). These examples are along the lines of those presented by Huang et al. [57], and Laniado and Mika [70], referring to common sentiments, but also less persistent over time than the previously discussed emergent discourse conventions. These informal or ad-hoc genres in English seem to have the potential to spread internationally and be adopted across languages in Twitter, but this phenomenon is still not well documented. In summary, some English hashtags reflect code-switching in relation to certain topics, like Information and Communication Technologies. A question for future re- search could be if this topic affects language choice, favoring the use of English. Also, international news and campaigns might tend to trigger the use of English hashtags to draw global attention. Finally, international events organizing back-channel com- ments around a common hashtag, as well as certain conversational tags, could be the focus of further research for their potential to foster multilingual conversations. 132 Hashtag Frequency Meaning/Context #fail 7 Sharing a bad experience or deceiving news #wtf 2 Expressing surprise or disbelief #FF 1 “follow friday”: recommending a person or orga- nization to follow #Like 1 #np 1 now playing or no problem Table 7.3: Conversational tags: emergent discourse conventions in Twitter, social networks or online chat. 133 Hashtag Frequency Reflecting on Twitter use #TweepsMidName 5 #1000followers 1 Twitter #addicted 1 #odd trend 1 #summer trends 1 Informal or ad-hoc Twitter genres #a thought 1 #kindlyAdvice 1 #thisislife 1 #roadrage 1 #supportedby 1 #UNeverKnow 1 Table 7.4: Conversational tags: reflecting on Twitter use and informal or ad-hoc Twitter genres. 134 Topic Hashtag Frequency Meaning/Context ICT #mobileusers 6 #cisa 2 Online live broadcasting #OVH 2 Online virtual hosting #Android 1 #AR 1 Augmented Reality #Cloud Computing 1 #Honeycomb 1 Android version #hyperlinking 1 #launch 1 a website #opendata 1 Devices #fb funerals 2 Facebook #n900 1 Nokia model ICT brands #Ipod 1 #microsoft 1 #samsungcheerdance 1 Samsung #skype 1 Vehicle brand #Audi 1 Table 7.5: Hashtags: ICT topic, brands and devices. 135 Topic Hashtag Frequency Meaning/Context Conferences/events #caexpoitaly 9 CA expo 2011 #mcdd10 7 Mobile Camp Dresden #caexpo 5 CA expo 2011 #gmaghreb 2 Google event in Maghreb #gmaghreb11 1 Google event in Maghreb #innovlab2011 1 #pycon4 1 #SMW11 1 Social Media Week 2011 #sepIn11 1 #sepIn 1 Music festivals #212RMX 2 #fib2011 2 #readingandleeds 1 Music #arcticmonkeys 1 Band #FearFactory 1 Album TV and sports #BlueWolves 1 Mongolian soccer team #Comedystreet 1 German TV program #tv 1 Table 7.6: Hashtags: events, music, TV, and sports. 136 Topic Hashtag Frequency Meaning/Context Location #bahrain 12 #berlin 2 Germany #egypt 2 #greece 2 #germany 1 #korinthos 1 Greece #Mongolia 1 #liveyourmythingreece 1 Greece #Tahrir 1 Egypt #uniineurope 1 Europe Dates/time #14feb 1 events in Bahrain #september 1 #winter 1 Celebrations #Jerusalemday 1 #Ramadan 1 Table 7.7: Hashtags: location, time, and other named entities. 137 Topic Hashtag Frequency Meaning/Context Project management #marketing 1 #scrum 1 Political/Campaign #debtocracy 13 #1billionhungry 1 #endSH 1 End Street Harassment #sexquota 1 Not classified #networking 1 #selective default 1 Table 7.8: Hashtags: other topics. 138 Chapter 8 Discussion and Future Work I have encountered many challenges in this explorative research that require bringing together multiple fields and diverse methods in ways that have not been established previously. For instance, one of these challenges was detecting multi- lingual users in Twitter and, more broadly, determining language profiles of users (if they are monolingual or multilingual), and assigning language labels accordingly. An example of the multidisciplinary character of this dissertation is the application of social network analysis to sociolinguistic questions. As a result of the decision process for resolving the challenges as they became apparent, a concomitant contribution of this research are the methodology consid- erations. Namely, the process of testing natural language identification tools, the relationship between number of posts per user analyzed and estimated error rates in language profiling, can serve as a guide to approach the problems arising in the study of languages in Twitter. Another challenge of doing research about Twitter (and other Internet plat- forms) is that the findings on particular aspects of this fast evolving environment can easily become outdated. This dissertation aimed at answering the research ques- tions without requiring an exhaustive revision of the Twitter interface, which has changed several times in the past years, and at obtaining conclusions that could 139 be informative for the study of other social networks and communication environ- ments that share the key characteristics of Twitter, like being public, reposting and replying capabilities, etc. This chapter discusses the results of the studies that compose this disserta- tion, and highlights the key contributions and future directions of this research. Ultimately, I hope this discussion provokes questions for new studies. 8.1 Of Links, Social Ties, and Gravitational Forces The vision of a cosmopolitan Internet with vibrant communities, enabling contact with the unfamiliar, discovery, and the serendipity that propitiates learning [119] is challenged by the existence of language frontiers online. In the view of the Global Language System theory (section 2.1), multilingual people constitute the gravitational force that provides cohesion to the system, by connecting different language groups. There is empirical evidence of this language bridging in the blogosphere [53, 32]. The language ecology approach (section 2.2) connects these macro-scale di- mension of languages with the micro-scale level of interactions between individuals. Social network analysis provides an analytic tool for studying these language ecolo- gies that emerge from the interactions of the multilingual users with their social connections. The main contribution of this dissertation is going beyond survey information about multilingualism in Twitter [83, 55], and providing a deeper understanding 140 about the structural relations between language communities in a social network online. Although inspired by previous studies on the blogosphere [53], this research enhances the descriptive analysis with the creation and definition of theoretical constructs: the types of bilingual networks. Focusing on the networks of multilingual users, the social network analysis re- vealed three types of bilingual networks: the Gatekeeper-Language bridge, represent- ing a continuum of increasing connections between two separate language groups; the Integration and union type, representing a continuum of increasing penetration of one language group within the structure of the other; and the Peripheral language type, where one language group is smaller or less cohesive, and lies at the periphery of the social graph. This research conceives of the social network of multilingual users as a micro- scale language ecology, influencing their communication strategies and language choices. This conceptualization leads to a second key contribution, which is the novel idea of modeling the influence of social network factors in the language choices of the user. In the factor analysis, the dependent variables considered are the proportion of English use and non-English use within the posts of the user. The factors included are the proportion of English and non-English language users in the social network of the multilingual subject, and the degree of multilingualism of the social network. The relative importance of factors is represented by the coefficients obtained by fitting two generalized linear models to the dataset (linear and logistic regression). 141 The proportion of English users in the network constitutes a key influencing factor in the frequency of English use by the multilingual individual. Similarly, the proportion of non-English language (L2) users in the network is a very important factor influencing the frequency of L2 use by the multilingual person. The results suggest that multilingual Twitter users perceive the language composition of their network and interact accordingly. Or on the contrary, the language choices of mul- tilingual users might attract followers of a specific language profile. Most probably, the relation goes both ways, in a self-feeding cycle. Shifting attention from the social network to the content of the posts written by the multilingual users, I tested the hypothesis that the type of exchange (public post versus reply to an individual) influences the choice between English and other languages. The result reinforces previous empirical findings suggesting that sending public messages to a seemingly multilingual audience encourages the use of English [31, 65]. Finally, there is another gravitational force that could connect language groups and affect language choice: topics [65, 5]. Common interest in certain topics attract people from different cultures, and encourages the creation of cross-language links to resources and news [47]. As a step toward future studies on international topics, this dissertation ex- plores what themes might be raising cross-cultural awareness. I identified themes related to non-English speaking countries or communities in English posts, and I concluded that international news was the most popular theme. 142 Also, I identified English hashtags (keywords preceded by the # sign) inserted in non-English posts and related contexts that could encourage multilingual con- versations. International conferences and music festivals attract multilingual and multicultural audiences, who conform to the same Twitter hashtag and generate multilingual conversations around the event taking place. These international events might be key in promoting cross-language sharing of resources and creating social ties across geographic regions. If we embrace the idea of a vibrant language ecology on the Internet, we should challenge the existing structure of the network of hyperlinks and social ties. For instance, empowering multilingual users to leverage their social ties across language groups, facilitating translation, and recommending links to resources in different languages. 8.2 The Road Ahead... Future directions for this research include scaling up the social network analysis to account for multilingual users with larger social networks. This will require improving the methods for analysis of larger collections of data, e.g. training natural language identification tools to detect transliterated text, and using spam detection algorithms. Also, I envision expanding the theme analysis to include methods of automatic topic detection in multiple languages, or crowdsourcing annotations using platforms such as Mechanical Turk or CrowdFlower (i.e. sending micro-tasks to large numbers 143 of people for specifying the topics of Twitter posts). Further research could focus on topic-based networks, targeting the sampling to specific language pairs and topics for enabling comparisons across languages and a more complex factor analysis. Finally, studying the evolution of social networks over time could unveil the relationship between the language composition of the social network, audience per- ception, and language choice. In relation to this, other questions arise: wether multilingual users are aware of the type of social network they have; and if they are, how this affects their language choices. 8.2.1 Translation and Mediation in Twitter In section 7.1, I describe my unsuccessful attempt to generate a hypothesis in relation to bilingual posts and repostings with a comment, as a previous step to identify translation behaviors. However, the resulting number of bilingual posts in the ego dataset was only 37. It seems that the limited number of characters allowed poses a problem for including a translation together with the original comment in the same post. Alternatively, translations might be found in separate posts but, unlike reposting, there is no way to connect the translation to the original message. Also, some people create separate accounts for each language to address different audiences. Future research could use automatic topic analysis to identify translations. Although Twitter enables the use of many languages and writing systems thanks to Unicode, it does not offer support for translation or features for strengthen- 144 ing connections between language groups. Regarding support for translation, there are not embedded linguistic resources on the interface, such as machine translation, dictionaries or transliteration tools. The meedan project [113] organizes volunteers for translating Twitter posts and has encountered a number of challenges: engag- ing users in translation, linking and representation of translations in relation to the original post, authorship, validation, etc. Instead of relying on volunteer translators, we could seek ways to encourage translation, cross-language linking and connection behaviors that are happening already. However, in Robert Munro’s words, there is not a “unified resource that links people by languages spoken” in social media [96], which would be a helpful starting point. Recommendation mechanisms could foster the creation of cross-language links. For example, AlMeshary and Abhari [4] propose a strategy for recommending people to follow on Twitter with the purpose of obtaining local information in the context of a user relocating from a different country. They use machine translation to match the users’ interests found in their posts with the local offers. Finally, by studying the dynamic language preferences of multilingual users, not only we will be in a better position to design a satisfying experience for those users, but also we are learning how to help them in their mediation tasks. This dissertation advances in that direction by modeling the influence of factors in the language choices of the multilingual users. 145 8.2.2 Who Are the Multilingual Users? This dissertation focuses on multilingual users because of their role in connect- ing different language groups. But who are they? Are they expatriates? Members of minority communities? Language learners? Twitter posts and the languages in which they are written represent just a limited language profile of the user, and they barely provide any social context. Androutsopoulos recommends to take into account the digital surroundings when analyzing written text, for instance, looking at the pictures and videos that are linked [6]. Also, adding detailed geographic information could help in building a more complete profile. Understanding more about the context of multilingual users could help in the identification of roles and motivations for mediating between language groups and in finding the relationship of these roles with network types. The next step after this dissertation is adding consideration of geolocation information and content analysis of the resources linked in the posts to provide more attributes for nodes and edges in the social network analysis. Additionally, ethnographic methods could shed light on who are the people and what are the reasons that connect different language groups. 146 Chapter 9 Conclusion Social media is international: users from different cultures and language back- grounds are communicating, generating and sharing content. However, language barriers emerge in the communication landscape online. The aspiration of an Inter- net that constitutes a cosmopolitan space and fosters language diversity has stum- bled over the language frontier. In the microblogging site Twitter, information spreads across languages and countries. But how are the news traveling across borders? Expatriates, migrants, minorities, diasporic communities, and language learners play an important role in forming transnational networks and cultural bridges between nations and commu- nities. They are multicultural and multilingual. This dissertation studied how multilingual users of Twitter mediate between language groups in their social network, looking at social connections and language choices. The overarching goal that motivates this research is to advance our un- derstanding of the network structures and communication strategies that enable intercultural dialog, cross-language sharing of information, and awareness of global problems. The implication for the design of social media platforms is that, instead of constraining multilingual users to only one language option, technology should support their language-switching and mediating role between cultures. 147 The objectives of this dissertation were: (1) to explore the ways in which multilingual users of Twitter are connecting different language groups in their social network; (2) to model how the network influences their language choices; (3) and to explore what the textual features of their posts can elicit about language choices and mediation between groups. RQ 1: In what ways are multilingual users of Twitter connecting language groups? Focusing on the social network of 92 multilingual users, the methodology combined a qualitative approach to social network analysis and network statistics to present a classification of network types based on the patterns of connections between language groups. The study followed an exploratory design, with a first qualitative phase that took a grounded theory approach to classify the network visualizations, and a second quantitative phase that complemented the qualitative study with network statistics specifically created to provide a robust definition of network types. Finally, I used machine learning for testing the results. The social network analysis revealed three differentiated types of bilingual networks: the Gatekeeper-Language bridge, representing a continuum of increasing connections between two separate language groups; the Integration and union type, representing a continuum of increasing penetration of one language group within the structure of the other; and the Peripheral language type, where one language group is smaller or less cohesive, and lies at the periphery of the social graph. RQ 2: How is the social network of multilingual users in Twitter influencing their choice of language? The factor analysis modeled the influence of a set of fac- tors related to the social network in the language choices of multilingual users. The 148 dependent variables considered are the proportion of English use and non-English use within the 50 posts of the user. The factors included are the proportion of English and non-English language users in the social network of the multilingual subject, and the degree of multilingualism of the social network. The relative im- portance of factors, or their weight, is represented by the coefficients obtained by fitting two generalized linear models to the dataset (linear and logistic regression). The proportion of English users in the network constitutes a key influencing factor in the frequency of English use by the multilingual individual. Similarly, the proportion of non-English language (L2) users in the network is a very important factor influencing the frequency of L2 use by the multilingual person. Interest- ingly, the influence of the factor proportion of English users in the network is also important when modeling L2 use, and negatively correlated to it. Regarding the multilingual index, the results were inconclusive about its influence in the language choice of multilingual users. RQ 3: Does the type of exchange in Twitter influence the language choice of multilingual users? I shifted the attention from the social network to the content of the posts written by the multilingual users. First, I looked at the textual feature of the @ sign at the beginning of a post as an indicator of addressivity. Based on this indicator, I tested the hypothesis that the type of exchange (public post versus reply to an individual) influences the choice between English and other languages. The result reinforces previous empirical findings suggesting that sending public messages to a seemingly multilingual audience encourages the use of English. 149 RQ 4: What the themes and textual features in the posts of multilingual users reveal about cross-cultural awareness or international dialogue? Finally, I looked at content with the objective of detecting themes that might help in creating cross- cultural awareness, where the multilingual users could be acting as mediators from the point of view of their messages. I identified themes related to non-English speak- ing countries or communities in English posts and, also, I identified English hashtags (keywords preceded by the # sign) inserted in non-English posts. Using a generic theme analysis, I concluded that international news was the most popular theme when mentioning a non-English speaking place. This study serves as an explorative qualitative phase to inform the design of future studies after this dissertation work. The main contribution of this dissertation is going beyond survey information about multilingualism and providing a deeper understanding about the structural relations between language communities in a social network online. This research work is one of the few that apply social network analysis to the study of sociolinguis- tic questions on the Internet. In particular, it contributes an original classification of network types based on the patterns of connections between language groups, complemented with new network statistics that enhance the definitions of these theoretical constructs. Adapting the Ecology of Language approach from Sociolinguistics to the social network context, this research conceived of the social network of multilingual users as a micro-scale language ecology, influencing their communication strategies and language choices. This conceptualization led to the novel idea of modeling the influence of social network factors in the language choices of the user. 150 This dissertation can benefit the study of information diffusion regarding the potential impact of these types of network structures on cross-language flows. Also, it contributes to understanding users’ behavior and informing the design of social media platforms. Future directions for this research include: scaling up the social network analysis to account for multilingual users with larger social networks; studying topic- based networks and detecting cases of translation; targeting the sampling to specific language pairs and topics for enabling comparisons across languages and a more complex factor analysis; studying the evolution of social networks over time to ex- plore how this affects audience perception and language choice. The next step to this dissertation research is adding geolocation information and content analysis of the resources linked in the posts to provide more attributes for nodes and edges in the social network analysis. Finally, ethnographic methods could shed light on who are the people and what are the reasons that connect different cultural and linguistic groups. 151 Appendix A Visualizations of Social Networks This appendix contains the visualizations of the 92 egocentric networks, with the qualitative category assigned, the language codes and corresponding colors. Lan- guage labels can have one language code, following the ISO standard codes for names of languages (eg. “en” for English, “es” for Spanish, “de” for German), two lan- guage codes joined by the + sign in the case of bilinguals, the word “empty” for nodes with no data available, the number 0 for nodes where the language could not be identified. All visualizations were made with the Gephi social network analysis tool, using the Force Atlas layout. The size of the nodes represents betweenness centrality. 152 Figure A.1: Trilingual networks (1). 153 Figure A.2: Trilingual networks (2). 154 Figure A.3: Trilingual networks (3). 155 Figure A.4: Bilingual networks: gatekeeper type (1). 156 Figure A.5: Bilingual networks: gatekeeper type (2). 157 Figure A.6: Bilingual networks: gatekeeper type (3). 158 Figure A.7: Bilingual networks: gatekeeper type (4). 159 Figure A.8: Bilingual networks: gatekeeper type (5). 160 Figure A.9: Bilingual networks: gatekeeper type (6). 161 Figure A.10: Bilingual networks: language bridge type (1). 162 Figure A.11: Bilingual networks: language bridge type (2). 163 Figure A.12: Bilingual networks: language bridge type (3). 164 Figure A.13: Bilingual networks: language bridge type (4). 165 Figure A.14: Bilingual networks: language bridge type (5). 166 Figure A.15: Bilingual networks: language bridge type (6). 167 Figure A.16: Bilingual networks: union type (1). 168 Figure A.17: Bilingual networks: union type (2). 169 Figure A.18: Bilingual networks: union type (3). 170 Figure A.19: Bilingual networks: union type (4). 171 Figure A.20: Bilingual networks: integration type (1). 172 Figure A.21: Bilingual networks: integration type (2). 173 Figure A.22: Bilingual networks: integration type (3). 174 Figure A.23: Bilingual networks: integration type (4). 175 Figure A.24: Bilingual networks: integration type (5). 176 Figure A.25: Bilingual networks: integration type (6). 177 Figure A.26: Bilingual networks: integration type (7). 178 Figure A.27: Bilingual networks: integration type (8). 179 Figure A.28: Bilingual networks: peripheral language type (1). 180 Figure A.29: Bilingual networks: peripheral language type (2). 181 Figure A.30: Bilingual networks: peripheral language type (3). 182 Figure A.31: Bilingual networks: peripheral language type (4). 183 Figure A.32: Bilingual networks: peripheral language type (5). 184 Figure A.33: Bilingual networks: peripheral language type (6). 185 Figure A.34: Small and monolingual networks (1). 186 Figure A.35: Small and monolingual networks (2). 187 Figure A.36: Small and monolingual networks (3). 188 Figure A.37: Small and monolingual networks (4). 189 Figure A.38: Small and monolingual networks (5). 190 Figure A.39: Small and monolingual networks (6). 191 Figure A.40: Small and monolingual networks (7). 192 Figure A.41: Small and monolingual networks (8). 193 Appendix B International Themes in English Posts In the English language set of posts written by the 92 egos, totaling 2151 posts, there are 227 posts in which a non-English speaking place or language is mentioned. This appendix presents in landscape layout the complete list of places and languages in the central column, with an extract of the textual context on the left side, and associated theme on the right side. 194 TE XT UA L C ON TE XT PL AC E OR L AN GU AG E TH EM E Pa ris ca llin g fo r a m ee tin g at ... [. ..] Pa ris tra ve l p lan s ca n't w ait to fly to # Ba rc elo na Ba rc elo na tra ve l p lan s In sp ira tio n fo r a lit tle Ly on b re ak [… ] Ly on tra ve l r ec om m en da tio n If a Lil le vis it i s o n th e ag en da [… ] I 'd hig hly re co m m en d th e LA M m us eu m [… ] Lil le tra ve l r ec om m en da tio n No w co ns ide rin g M ar tin iqu e & Gu ad elo up e fo r m y n ex t h oli da y. [… ] M ar tin iqu e & Gu ad elo up e tra ve l p lan s Oh d ea r! Ex plo sio n @ M os co w air po rt. [… ] M os co w In te rn at ion al ne ws Is go ing to th e Ch ine se fo r d inn er [… ] Ch ine se ga str on om y a nd re sta ur an ts ac co m pli sh ed tr av el RE QU ES TI NG T HE # NK OT BC RU IS E2 01 2 ON T HE M ED IT ER RA NE AN S EA ! [ … ] M ed ite rra ne an S ea tra ve l p lan s ac co m pli sh ed tr av el [… ] G er m an y i n 3 Da ys [… ] Ge rm an y tra ve l p lan s [… ] Z im ba bw e an d M ug ab e's R ule [… ] Zi m ba bw e In te rn at ion al ne ws Af rik aa ns La ng ua ge o f r es ou rc e [.. .] It's 3 AM h er e in Fr an ce [… ] Fr an ce loc at ion [.. .] Ve nic e, to m or ro w 13 :2 0? [. ..] Ve nic e tra ve l p lan s Vi rtu al #g ro ce ry sh op pin g ex pe rie nc e in Ko re a [… ] Ko re a te ch in te rn at ion ali za tio n [… ] B ye G er m an y Ge rm an y loc at ion loc at ion [… ] w e ne ed yo u in Ob er ha us en ! ( : H ow lo ng w ill yo u sta y i n Ha m bu rg ? Ob er ha us en a nd H am bu rg loc at ion [… ] I h av e he ar d yo u sp ea k g er m an [… ] ge rm an La ng ua ge loc at ion Be lgi um fr ies ca nd ida te s t o UN ES CO p at rim on y! Be lgi um In te rn at ion al ne ws Ar ab R ev olu tio n po we r [ … ] Ar ab R ev olu tio n In te rn at ion al ne ws Ha d to ca ll m y P op s s o he co uld se nd m e a co up le of M rs . D as h se as on ing s t o Pu er to R ico ! Pu er to R ico loc at ion Pu er to R ica n cu ltu re [… ] H av e a go od o ne a nd sa lut e fro m P ue rto R ico ! [ … ] Pu er to R ico loc at ion loo kin g fo rw ar d to e at ing ch ina fo od [… ] Ch ina ga str on om y a nd re sta ur an ts loc at ion Of f t o Ge ne va fo r a sh ow a t a fa nc y b irt hd ay p ar ty :-) Ge ne va tra ve l p lan s loc at ion Bu sk er s F es tiv al in Be rn w as fa nt as tic ! Be rn ev en t lo ca tio n W e ha ve th e ho no r o f p er fo rm ing a t o ne o f S wi tze rla nd 's m os t e xc lus ive H ot els to nig ht [… ] Sw itz er lan d loc at ion [… ] a t t he a nn ua l S wi ss C on gr es s o f R ad iol og y [ … ] Sw iss ev en t lo ca tio n ge rm an La ng ua ge o f r es ou rc e Th is we ek en d we 'll be tr av eli ng to B ar ce lon a! C an 't w ait fo r a no th er a m az ing E ur op ea n Yo -Y o M ee tin g. [… ] Ba rc elo na a nd E ur op ea n tra ve l p lan s Zü ric h ev en t lo ca tio n Sw iss cu ltu re W e'r e he ad ing to A ug sb ur g/ Ge rm an y t od ay . [ … ] Au gs bu rg /G er m an y tra ve l p lan s W e ar e gr at ef ul ab ou t c on ta cts d ur ing L ive ! G lob al Pe rfo rm ing A rts E xc ha ng e Si ng ap or e [… ] Si ng ap or e ev en t lo ca tio n To m or ro w is ou r c on ce rt in Dr es de n [… ] Dr es de n ev en t lo ca tio n To da y C D- Re lea se in G er m an y [ … ] Ge rm an y ev en t lo ca tio n ev en t lo ca tio n Go t h om e to nig ht fr om B er lin [… ] Be rlin ac co m pli sh ed tr av el Dr es de n ev en t lo ca tio n Ru ss ian op ini on Ru ss ian op ini on vis ite d th e 4t h m ar ke tin g- da y i n au str ia [… ] au str ia [… ] H i b ac k f ro m g er m an y ; o) ge rm an y Af rik aa ns V ide o: g ivi ng fr ac kin g a dr illi ng [… ] [.. .] yo u ar e at th e Sc ho ko lad en m us eu m . [ … ] Sc ho ko lad en m us eu m [… ] s to rm y b elg ium be lgi um [… ] V ict or M an ue lle is a P ue rto R ica n ar tis t! [… ] W e ha d so m uc h fu n to da y : D. A m az ing a ud ien ce in K üt tig en !! [.. .] Kü ttig en To nig ht w e'r e pe rfo rm ing a t S ta dt ca sin o Fr au en fe ld. T ha t's th e th ea tre w he re [… ] St ad tca sin o Fr au en fe ld A sp ec ial ce leb ra tio n vid eo fo r t he h an db all cl ub O be rw il/B L ( ge rm an ) [ … ] To m or ro w Sa tu rd ay : S pe cia l m idn igh t b lac kli gh t-p er fo rm an ce a t [ … ] ( Zü ric h A irp or t). [… ] S ve n Ep ine y, Sw iss T V pr es en te r [ … ] [.. .] wi ll b e pla yin g at K ua la Lu m pu r, M aly sia , m id- se pt em br e th is ye ar [… ] Ku ala L um pu r, M aly sia Th e Fe sti va l J az zta ge D re sd en is o ve r. W e ha d gr ea t m us ici an s h er e! [… ] [… ] t his S ch ar an sk y i s v er y d an ge ro us g an gs te r a nd th e lea de r o f t er rib le ru ss ian tr iad s, be a fra id br ot h. [… ] b e af ra id, L ev S ch ar an sk y i s na tiv e of B rig ht on -b ea ch R us sia n m af ia , v er y, ve ry d an ge ro us g an gs te r. 195 I a m fo r c om m un ism . S we di sh c om m un ism Sw ed ish op in io n [… ] W at ch s er va nt s of th e pe op le in R us sia [… ] Ru ss ia cu ltu re [… ] i n Ru ss ia a nd P ar is, fo r t he p eo pl e of th os e co un tri es a re s o wi llin g to b e am us ed [… ] Ru ss ia a nd P ar is op in io n Di d yo u kn ow th at a bo ut 5 0 th ou sa nd p eo pl e ar e kil le d fro m s na ke bi te s fo r a y ea r? It is o nl y in In di a In di a In te rn at io na l n ew s Uk ra in ia n an d Ar ab W or ld In te rn at io na l n ew s Uk ra in e In te rn at io na l n ew s If yo u wa nt to b ec om e a pa rt of E ur o- 20 12 [… ] Eu ro 2 01 2 sp or ts [… ] B ra zil ... H er e I c om e :D Br az il tra ve l p la ns tra ve l p la ns ga st ro no m y an d re st au ra nt s M ila n sp or ts W ho 's go t t he b ig ge r m el on s? S ize e m u p. .. Tu rk ish S ty le [ … ] Tu rk ish ga st ro no m y an d re st au ra nt s Ita lia n ga st ro no m y an d re st au ra nt s O tto m an cu ltu re As ia tra ve l r ec om m en da tio n [… ] se e m y la te st th ou gh ts o n "W he re T o St ay " i n Is ta nb ul . [ … ] Is ta nb ul tra ve l r ec om m en da tio n [… ] g re at Is ta nb ul fo od . S ee m or e at [… ] Is ta nb ul ga st ro no m y an d re st au ra nt s Tu rk ish ga st ro no m y an d re st au ra nt s [… ] t ur kis h ea tin g pa rty [… ] Tu rk ish ga st ro no m y an d re st au ra nt s Ju st fi ni sh ed c hu rc h in N ig er ia [… ] Ni ge ria cu ltu re wa ve tw o fin ge rs in th e ai r.. . N ig er ia n Cl ub bi ng e tiq ue tte [… ] Ni ge ria n cu ltu re [… ] c ur re nt ly in L ag os , N ig er ia fo r w ee ke nd b et we en v ol un te er te ac hi ng in G ha na . La go s, N ig er ia a nd G ha na tra ve l p la ns [… ] o rg an izi ng a m on th o f t ea ch in g En gl ish in G ha na fo r J un e G ha na tra ve l p la ns Is h ap py th at th e ge rm an a nd p hy sic s te st s ar e ov er ;)) ge rm an La ng ua ge Ha ifa w as a lo t o f f un ... [… ] Ha ifa lo ca tio n So I ha ve a ti ck et to A m st er da m ... n ow I ne ed to fi nd s om e on e wh o wi ll c om e wi th m e to B er lin [… ] Am st er da m a nd B er lin tra ve l p la ns M un ich op in io n op in io n M un ich op in io n In th e af te rm at h of # No rw ay a tta ck s, p ie ce b y NY Ti m es o n th e ris e of ri gh t-w in g m ov em en ts in E ur op e [… ] No rw ay a nd E ur op e In te rn at io na l n ew s [… ] I h op e to s ee y ou d an cin g in It al y (M ila n) s oo n! ;- * Ita ly (M ila n) tra ve l p la ns [… ] B es t W ish es a nd a lo t o f l ov e fro m It al y! Ita ly lo ca tio n [… ] B ut I liv e in It al y! ! : -( Ita ly lo ca tio n An go la te ch in te rn at io na liz at io n [… ] W e be t i n lo w co st h ig h Q b us in es s so ftw ar e - P or tu ga l / E ur o- As ia / US A Po rtu ga l / E ur o- As ia te ch in te rn at io na liz at io n Eu ro pe te ch in te rn at io na liz at io n Do ct or s ha ve b ee n se nt en ce d to 1 5 ye ar s in p ris on in # ba hr ai n fo r t re at in g pr ot es te rs [… ] Ba hr ai n In te rn at io na l n ew s ba hr ai n to ni gh t s ho ul d bu rn .. [… ] Ba hr ai n op in io n Fi rs t d ist ric t t ra in in g in K -to wn to m or ro w, a fte rw ar ds fi rs t i nd ivi du al tr ai ni ng in S tu ttg ar t! St ut tg ar t tra ve l p la ns tra ve l p la ns I h ad a g re at m ee tin g wi th th e bo ar d m em be rs fr om U S Yo ut h So cc er E ur op e. ..[ … ] Eu ro pe sp or ts Th e Ho rn o f A fri ca : C hr on icl e of a fa m in e fo re to ld [… ] Th e Ho rn o f A fri ca In te rn at io na l n ew s BB C Ne ws - Ja pa n pe ns io ne rs v ol un te er to ta ck le n uc le ar c ris is [… ] Ja pa n In te rn at io na l n ew s Th e No rw ay a tta ck s: M an ife st o of a m ur de re r [ … ] No rw ay In te rn at io na l n ew s [… ] t he re 's no S uc h a th in g as to ur ism in J or da n !! Ho w' s an yt hi ng d on e in th is co un try ? Jo rd an op in io n Eg yp t op in io n ne xt s te p Ja pa n: T ok yo !! [… ] tra ve l p la ns Th e Uk ra in ia n ex pe rie nc e fo r A ra b wo rld b y Li on el B ee hn er [… ] M e- e- e- t U -u -u -k ra in e - t he c ha m pi on o f a ll a nt ira tin gs ... N ow th e 18 C ou nt rie s M os t L ike ly To D ef au lt [… ] [… ] a lm os t w or th a tr ip fr om m un ich to b er lin :- ) m un ich a nd b er lin To da y en gl ish g ar de n, c hi ne se to we r, ge rm an b ee r! [… ] ch in es e an d ge rm an Fo rz a M ila n! W e ar e th e ch am pi on s! [… ] A n Ita lia n vib e. A m ot o an d sm al l t ab le s. T hi n cr us t p izz as [… ] Th in kin g of S ul ta ns , B el ly Da nc in g, H ar em , G yp sy M us ic, R ak i, M ez e an d W in e. .. th e go od o ld O tto m an T im es ... [… ] T he b es t s to p fo r M id ye D ol m a in T ow n. P ar t o f m y IS T - A sia to ur [… ] Lo vin g M en em en -- T ur kis h sc ra m bl ed e gg s. .. [… ] An d I t hi nk th at 's th e on ly th in g I r ea lly re al ly ha te a bo ut m un ich [… ] o kt ob er fe st , o m g so m an y dr un k pe op le ... ok to be rfe st Aa aa ah hh s o m an y to ur ist s in m y be lo ve d M un ich G es tix E RP s of tw ar e pe ne tra tin g An go la e ve n wi th ou t l oc al re se lle rs ... [… ] [… ] G es tix C er tif ie d fro m E UR 1 50 lif et im e lic en se [… ] - E ur op e an d US A- re ad y I w ill st ar t w or kin g on [… ] i n Sa ar br üc ke n Sa ar br üc ke n G ive n th e bi rth ra te & z p op ul at io n fig ur es in E gy pt , I c an 't un de rs ta nd h ow s ex is a ta bo o #j us ts ay in g Ja pa n, T ok io 196 lo ca tio n Va le nc ia , S pa in ev en t l oc at io n [… ] A us tri a ad op ts C C -B Y as n at io n- w id e de fa ul t! Au st ria In te rn at io na l n ew s H ai fa , I sr ae l ev en t l oc at io n #i cw sm 20 11 w ill ta ke p la ce n ex t w ee k in B ar ce lo na [… ] Ba rc el on a ev en t l oc at io n M ad rid ev en t l oc at io n M ad rid ev en t l oc at io n [… ] c an 't w ai t t o lis te n th em a t J ap an :) [… ] Ja pa n tra ve l p la ns Ja pa n tra ve l r ec om m en da tio n [… ] J AP AN IS O N # TH EV ER G E. H M V To ky o w / T FT o n di sp la y [… ] op in io n Le op ar d Tr ek to le ad tr ib ut e to W ey la nd t i n G iro d 'It al ia [. ..] Ita lia sp or ts sp or ts Ba sq ue C ou nt ry tra ve l r ec om m en da tio n To ur o f Q at ar a lre ad y hi st or y, T ou r o f O m an s ta rti ng . Q at ar a nd O m an tra ve l p la ns Sh an gh ai , C hi na (P VG ) A tla nt a (A TL ) J un 5 , 2 01 1; T ue s/ Su n w es tb ou nd ; W ed /M on e as tb ou nd . Sh an gh ai , C hi na tra ve l p la ns In te rn at io na l n ew s [… ] I u se it in C hi na to re di re ct m y w eb si te [… ] C hi na te ch in te rn at io na liz at io n lo ca tio n ga st ro no m y an d re st au ra nt s G oi ng to R om a by tr ai n. .. R om a tra ve l p la ns [… ] f or c hi le an s, a to ur is tic p la ce in L on do n is "T he C lin ic ". ch ile an s op in io n C hi le lo ca tio n #c hi le # st ud en ts # 4d ea go st o se ve ra l p ic tu re s C hi le ev en t l oc at io n #4 de ag os to [… ] t re nd in g si nc e th e st ud en ts in C hi le a re p ro te st in g to re fo rm e du ca tio n in eq ua lit y an d co st [… ] C hi le In te rn at io na l n ew s [… ] G er m an R ug by C ha m pi on sh ip o f t he U ni ve rs iti es !! ge rm an sp or ts cu ltu re Ita ly w in s vs F ra nc e 2 2: 21 s o am az in g Ita ly, F ra nc e sp or ts [… ] E ur op ea n Te ch T ou r G al a di nn er o n W ed ne sd ay n ig ht in b er lin Eu ro pe an a nd b er lin ev en t l oc at io n C R O SS IN N O VA TI O N A C AD EM Y th is T hu rs da y in B on n [… ] Bo nn ev en t l oc at io n M un ic h lo ca tio n Be rli n ev en t l oc at io n BB C N ew s - G re ec e sa ys d eb t t al ks to a ve rt de fa ul t ' pr od uc tiv e' [… ] G re ec e In te rn at io na l n ew s [… ] t o es ca pe fr om m is er ab le g re ek re al ity gr ee k op in io n [… ] a re y ou s ur e of th is p ie ce o f n ew s co z Eg yp t c an 't ta ke th e co ns eq ue nc es [… ] # Eg yp t# Ja n2 5# M ub ar ak Eg yp t re ac tio n to In te rn at io na l n ew s Eg yp t w e ar e yo ur p ro te ct or s an d yo ur b ui ld er s an d w e w ill st ar t f ro m s cr at ch [… ] Eg yp t op in io n [… ] a ll i c an th in k of is th at a fte r e ve ry ra in fa ll m us t c om e a ra in bo w w ai tin g fo r E gy pt 's ra in bo w [… ] Eg yp t op in io n Fe el in g 3 m h ig h ju st fo r b ei ng a n Eg yp tia n, D ea r c ou nt ry I lo ve y ou [… ] Eg yp tia n op in io n In te rn at io na l n ew s [… ] i s of f t o a gr ea t s ta rt bu ild in g a re lia bl e an d pr of es si on al ta xi n et w or k in A th en s! At he ns lo ca tio n [… ] I ju st re ad it : " C an G re ek s Be co m e G er m an s? " [ … ] G re ek s, G er m an s op in io n Fa ke A pp le s to re in C hi na [… ] C hi na In te rn at io na l n ew s [… ] S tu de nt fr om S w ed en s en t m e [… ] Sw ed en lo ca tio n G re ec e op in io n G et tin g re ad y fo r a no th er s un se t i n Se yc he lle s. .. [… ] Se yc he lle s lo ca tio n I k no w Is ra el is a n in te rn at io na lly k no w n st ar t-u ps m ak er , I ju st lo ve b ei ng re m in de d [… ] Is ra el cu ltu re Li by a In te rn at io na l n ew s [… ] Y ou a re a lw ay s in vi te d ba ck to Is ra el . T he s um m er h er e is a m az in g :) Is ra el tra ve l r ec om m en da tio n M y in te rn et c on ne ct io n su ck s in C ha pa la . C ha pa la [… ] I nt l. C on f. on In fo rm at io n, P ro ce ss , a nd K no w le dg e M an ag em en t i n Va le nc ia , S pa in [… ] In H ai fa , I sr ae l, su pp or tin g fre e, c ol la bo ra tiv e, a nd o pe n kn ow le dg e at # w ik im an ia 2 01 1[ … ] Vi de o of y es te rd ay 's p re se nt at io n of # po w er of op en a t @ eo i M ad rid At th e pr es en ta tio n of @ cr ea tiv ec om m on s bo ok # th ep ow er of op en a t @ eo i M ad rid [… ] n ee ds to to ur w ith [… ] i n Ja pa n! p le as ee ee ee !! Ja pa n, T ok io M ar io C ip ol lin i’s M ila n- Sa n R em o fo rm g ui de [. ..] M ila n- Sa n R em o A vi si t t o O rb ea p re m is es a nd th e Ba sq ue C ou nt ry is a lw ay s fu n [… ] Ba hn un gl üc k in C hi na - Tr ai n ac ci de nt in C hi na [. ..] Ba hn un gl üc k in C hi na at S ap ie nz a. ..[ ... ] Sa pi en za al l y ou c an e at a t t ha i-j ap re st ... [.. .] Th ai -ja p #4 de ag os to y # ca ce ro la zo b an gi ng o n a po t f or b et te r e du ca tio n in # C hi le [… ] u se le ss tr iv ia : W ei he ns te ph an is th e el de st b re w er y in th e w or ld ! Bi g C he er s fro m M un ic h ! W ei he ns te ph an , M un ic h M ad ve rti se @ g ro w in m un ic h m ad ve rti se w ill ce le br at e its s er ie s A cl os in g pa rty o n 29 .0 4. in B er lin -- h op e to s ee y ou th er e! [… ] Se cu rit y Th ea te r L es so ns F ro m U tø ya [… ] U tø ya G re ec e de fin ite lly n ee ds it s st at el ea ks , t oo [… ] M ic ro so ft C ou nt ry M an ag er In L ib ya D et ai ne d By A ut ho rit ie s [… ] 197 [… ] Y ou sh ou ld co m e to Is ra el du rin g th e su m m er , y ou 'll ha ve a b las t! Isr ae l tra ve l r ec om m en da tio n Isr ae li b an d Or ph an ed L an d ro ck s T ur ke y, de sp ite d isc or d [… ] Isr ae li a nd T ur ke y cu ltu re Isr ae l a nd A m ste rd am loc at ion Ne w we ek (@ U PC N ed er lan d w/ 2 o th er s) [… ] UP C Ne de rla nd loc at ion Ita ly cu ltu re [… ] I s t his fo r r ea l? (H eb re w) [… ] He br ew La ng ua ge o f r es ou rc e Je ru sa lem loc at ion [… ] I d idn 't k no w th er e wa s o ne in H aif a [… ] # gd d1 1 Ha ifa ev en t lo ca tio n tra ve l p lan s To m or ro w we w ill ro ck V ien na ! Vi en na tra ve l p lan s Tic ke ts fo r o ur 5 th A nn ive rs ar y c on ce rt in Gr az a re n ow a va ila ble ![… ] Gr az ev en t lo ca tio n [… ] F or th is sp ec ial o cc as ion w e wi ll p lay a sh ow in G ra z [ … ] Gr az ev en t lo ca tio n Gr az ev en t lo ca tio n Eu ro pe ev en t lo ca tio n Ne w Re vie w 9/ 10 P oin ts (G er m an ) [ … ] ge rm an La ng ua ge o f r es ou rc e Pa les tin e an d Isr ae l op ini on ev en t lo ca tio n Ja pa n re ac tio n to In te rn at ion al ne ws Go og le ch ina is a jo ke Ch ina te ch in te rn at ion ali za tio n Co ok ing in fr en ch :p fre nc h La ng ua ge [… ] Y ou m igh t b e int er es te d at th is: A re C hin es e m om s b et te r t ha n W es te rn m om s? [… ] Ch ine se op ini on Ar e Ch ine se m om s b et te r t ha n W es te rn m om s? [… ] Ch ine se op ini on St ay ing in M on tre al, le ar nin g Fr en ch si m ult an eo us ly [… ] M on tre al an d Fr en ch loc at ion , la ng ua ge M ad rid is m uc h be tte r c ho ice :) R T [… ] P ar is is we ll l oc at ed to [… ] b et we en S ea ttle a nd B eij ing [… ] M ad rid , P ar is, B eij ing op ini on Pl ea se co ns ide r c om ing to S pa in, to o. T ha nk s f or a n un fo rg et ta ble tim e! Sp ain tra ve l r ec om m en da tio n [… ] D em on str at ion s a ll o ve r # sp ain si nc e las t s un da y # 15 m fo r # re al #d em oc ra cy Sp ain In te rn at ion al ne ws [… ] N at ion al Re se ar ch er s S ys te m (S NI in S pa nis h) [… ] Sp an ish La ng ua ge Bo m b at ta ck a t M os co w air po rt [… ] M os co w In te rn at ion al ne ws M ela ne sia ns In te rn at ion al ne ws sh or t e ng lis h tra ns lat ion (s or ry fo r t he b ad e ng lis h) fo r o ut in te rn at ion al Fa ns in U K, R us sia , B ra zil [… ] Ru ss ia an d Br az il loc at ion Ge rm an y ev en t lo ca tio n Co m e to se e us a nd o ur b rit ish fr ien ds fr om T he D iss oc iat es in A ac he n [… ] Aa ch en tra ve l p lan s [… ] A ll t he vi de o m at er ial is fr om th eir E ur op ea n To ur w ith u s l as t y ea r! [… ] Eu ro pe an ev en t lo ca tio n Ai da co nf er en ce in P av ia on so cia l n et wo rk s Pa via ev en t lo ca tio n Ed itin g an a rti cle a bo ut It ali an ca se -la w on lia bil ity o f I SP s Ita lia n a co un try 's po lic y Lo ok fo rw ar d to e xp er ien cin g ne w Ita lia n op po sit ion p ro ce du re Ita lia n a co un try 's po lic y ev en t lo ca tio n loc at ion loc at ion #T rib alD DB L isb on m an ag es 2 o f t he m os t e ng ag ing F ac eb oo k P ag es in P or tu ga l [… ] Lis bo n, P or tu ga l te ch in te rn at ion ali za tio n [… ] I ´m a nx iou s t o bu y t he To we r o f B ele m , h er e in Lis bo n :-) Lis bo n loc at ion Ba rc elo na loc at ion Po rtu gu es e te ch in te rn at ion ali za tio n Po rtu gu es e RT S ga m e fo r P S3 g et s a n inc re dib le cin em at ic tra ile r [… ] Po rtu gu es e te ch in te rn at ion ali za tio n Po rtu ga l G ive s I tse lf a C lea n- En er gy M ak eo ve r Po rtu ga l In te rn at ion al ne ws @ ph illo rd @ du llh un k I b et th at 's his n am e in gr ee k gr ee k La ng ua ge @ tim or eil ly [… ] b as qu es a lso :- ) [ … ] ba sq ue s te ch in te rn at ion ali za tio n M y d au gh te r i s c om ing b ac k h om e fro m Is ra el an d I'm w ait ing b y t he g at e (@ A m ste rd am A irp or t S ch iph ol) [… ] L ibe ra tio n Da y b y t he P ap al St at e :: #w ell do ne # ita ly #p ap al #c ar niv al [… ] Ou tb ra in' s w ee ke nd a t J er us ale m :) [… ] W e wi ll p lay a g ue st sh ow o n th e up co m ing B lac k T ro lls O ve r E ur op e To ur . [ … ] n ex t w ee k i n Tr au n, A us tri a! ... [… ] Eu ro pe a nd T ra un , A us tri a Fr an z L oe ch ing er w ill be d ru m m ing fo r I LL UM IN AT A on F r. 4. 3. in G ra z ( Ex plo siv ) [ … ] Fr an z L oe ch ing er w ill hit th e dr um s o n th e Bl ac k T ro lls O ve r E ur op e To ur [… ] Ap ple re m ov ed a n ap p of th e Pa les tin e Th ird In tifa da ju st lik e fa ce bo ok , I sr ae l is co nt ro llin g M ed ia? @ ad ob e sh ou ld or ga nis e an e ve nt fo r n or d af ric a jus t li ke @ go og le (g m ag hr eb ) No rd A fri ca Ho pe th at e ve ry on e in Ja pa n is fin e. .. #p ra yfo rja pa n Ar ch aic D en iso va ns (h om ini n gr ou p) co nt rib ut ed to m od er n M ela ne sia ns ! [ … ] Th e Di ss oc iat es co m ing to G er m an y f or th e "H igh Fü nf to ur " [ … ] M iss W or ld To ur ism 2 01 1, In K ef alo nia [… ] Ke fa lon ia Su m m er N igh t in A rg os to li ( Ke fa lon ia) ... [… ] Ar go sto li ( Ke fa lon ia) Sw im m ing in th e wi nt er se a of L ixo ur i ( Ke fa lon ia) ... [… ] Lix ou ri (K ef alo nia ) Go t a g re at tim e at h yp er isl an d in Ba rc elo na [… ] Un de r S ieg e™ , t he p or tu gu es e RT S vid eo ga m e fo r P S3 w on to da y t he fir st pr ize [… ] 198 W e wa nt to tr an sla te T wi tte r t o Ba sq ue ,s up po rt us ! [ … ] Ba sq ue te ch in te rn at io na liz at io n Is ra el cu ltu re Tr av el in E ila t i s ov er . T he R ed S ea is a m az in g an d th e wa te r i s so c le ar . bu t T el A viv w ea th er [… ] m uc h be tte r. Ei la t, Re d Se a, Te l A viv tra ve l p la ns , o pi ni on Ha pp y Is ra el In de pe nd en ce D ay ! F ire wo rk s in th e sk y to ni gh t! Is ra el cu ltu re G oo d di m s um s in B ru ss el s? D oe s it ev en e xis t? Br us se ls op in io n in b ru ss el s ... n o di vin g sit es :( Br us se ls lo ca tio n Ar riv ed in M el ak a in M al ay sia , b ut e ve ry th in g is clo se d ea rly to ni gh t. W ill ch ec k Ch in es e sh op pi ng to m or ro w :) M el ak a, M al ay sia , C hi ne se lo ca tio n Ch in es e ga st ro no m y an d re st au ra nt s [… ].. .b ut ju st v oc ab ul ar y) . I h av e on e bu t i n Po lis h an d I w an t s om et hi ng lik e th is in E ng lis h [… ] Po lis h La ng ua ge [… ] I 'm lo ok in g fo r s om e co m pu te r p ro gr am to le ar n G er m an v oc ab ul ar y [… ] ge rm an La ng ua ge pr ep ar in g to d an ce m y as s of f f or h ai ti! !!! ha iti re ac tio n to In te rn at io na l n ew s Fr an kf ur t tra ve l p la ns tra ve l p la ns Fr an kf ur t lo ca tio n ba ck fr om b er lin ... tir ed n ow Be rli n ac co m pl ish ed tr av el ge tti ng m y ha ir cu t, th en fl yin g to b er lin ... Be rli n tra ve l p la ns be rli n is ca llin g an d i a m fo llo wi ng .e ve ry bo dy fr om b er lin m ee t m e at [… ] Be rli n tra ve l p la ns [… ] i lo ve b us te d! !![ … ].. ..S pa in a re w ith y ou !!! ;)) ) Sp ai n op in io n Br az il: De at h of F or es t D ef en de r C ou pl e is a Sh am e to th e Co un try [… ] Br az il In te rn at io na l n ew s G re ek , S pa ni sh , e ur op ea n In te rn at io na l n ew s In te rn at io na l n ew s [… ] S ta y in J ap an to w or k fo r t hi s f** kin g co m pa ny ? NO W AY !! Ja pa n lo ca tio n W on w on w on !!! J ap an h as b ec om e th e Q ue en !!! Ja pa n sp or ts So rti ng a lg or ith m s de m on st ra te d wi th H un ga ria n fo lk da nc e [… ] Hu ng ar ia n cu ltu re /h um or W hi ch c ou nt rie s m at ch th e G DP o f A m er ica 's st at es ? [… ] C al ifo rn ia is It al y! B ut ... It al y ha s 20 M m or e pe op le ... Ita ly In te rn at io na l n ew s be au tif ul n ig ht v ie w of It al y ta ke n fro m In te rn at io na l S pa ce S ta tio n #I SS [… ] Ita ly In te rn at io na l n ew s W hy Y ou ng It al ia ns A re L ea vin g [… ] Ita lia ns In te rn at io na l n ew s [… ] C hi na 's fa ke A pp le s to re s [… ] Ch in a In te rn at io na l n ew s W or ld C up jo y fo r J ap an [… ] Ja pa n sp or ts W at ch in g on IT V1 # En gl an d vs # Sw itz er la nd [… ] Sw itz er la nd sp or ts M on go lia sp or ts Ch in es e, C hi na te ch in te rn at io na liz at io n [… ] U A RE F AM O US IN C HI NE SE T W IT TE R (W EI BO ) [ … ] Ch in es e te ch in te rn at io na liz at io n Ch in a re ac tio n to In te rn at io na l n ew s To m or ro w wi ll b e m y fir st s pa ni sh te st [… ] Sp an ish La ng ua ge In te rn at io na l n ew s Ja pa n sp or ts te ch in te rn at io na liz at io n W at ch in g a 19 65 B W P ol ish fi lm s et in S pa in ... [… ] Po lis h, S pa in m ov ie s Li st en in g to S pa ni sh fo ot ba ll g am es [… ] Sp an ish sp or ts Tu B eA v - T he je wi sh h ol yd ay o f L ov e in Is ra el Ra in ! A nd a s pi cy C hi ne se n ud dl es [… ] to m or ro w at m os ai ic ba r f ra nk fu rt. .. [… ].. fin ish ed m un ich ... of f t o la ng en se lb ol d wi th [… ] m un ich , l an ge ns el bo ld to ni gh t a t m os ai ic ba r f ra nk fu rt. ..[ … ] [… ] G re ek a nd S pa ni sh y ou ng p eo pl e oc cu py in g Tr af al ga r s qu ar e #e ur op ea nr ev ol ut io n #u kr ev ol ut io n #L on do n Ti m H et he rin gt on is k ille d in # M isr at a! [… ] # Li by a M isr at a, L ib ya Ha ku ho M on go lia 's be st p ai d sp or ts s ta r - Y ah oo ! E ur os po rt [… ] [… ] C hi ne se o ffic ia l m ed ia [… ] q uo te d ur o pi ni on a bo ut O ba m a fro m tw itt er (a w eb sit e bl oc ke d in C hi na ) ! !! [… ] #m ol k BU T th is th ou gh t i s ba se d on th e re po rt I r ea d in th e CH IN A. [… ] Bo m bi ng in O slo a nd s ho ot in g at U tø ya ! [… ] O slo a nd U tø ya Ja pa n wo n !! [… ] # wo rld cu pf in al Ne tfl ix Br as il b lo g [… ] Br as il 199 Bibliography [1] Workshop on novelty and diversity in recommender systems - DiveRS 2011. In Pablo Castells, Jun Wang, Rube´n Lara, and Dell Zhang, editors, Proceedings of the fifth ACM conference on Recommender systems - RecSys ’11, pages 393–394, New York, New York, USA, October 2011. ACM Press. 1.2.1 [2] A Toolkit for Transnational Communication in Europe. In J. Normann Jø rgensen, editor, The Copenhagen Studies in Bilingualism Vol. 64, 2011. 2.1 [3] Proceedings of the 3rd Workshop on the Multilingual Semantic Web (MSW3). In Paul Buitelaar, Philipp Cimiano, David Lewis, James Pustejovsky, and Felix Sasaki, editors, International Semantic Web Conference, volume 936, Boston, 2012. CEUR. URL http://ceur-ws.org/Vol-936/. Last accessed Oct 30, 2013. 1.2.1 [4] Meshary AlMeshary and Abdolreza Abhari. A recommendation system for Twitter users in the same neighborhood. In Proceedings of the 16th Commu- nications & Networking Symposium, pages 1–5, San Diego, California, April 2013. Society for Computer Simulation International. 8.2.1 [5] Jannis Androutsopoulos. Language Choice and Code Switching in German- Based Diasporic Web Forums. In Brenda Danet and Susan Herring, editors, The Multilingual Internet: Language, Culture, and Communication Online, chapter 15. Oxford University Press, New York, 2007. 2.5, 2.5, 3.1, 8.1 [6] Jannis Androutsopoulos. Localizing the Global on the Participatory Web. In Nikolas Coupland, editor, The Handbook of Language and Globalization, chapter 9, pages 203–231. Wiley-Blackwell, Malden, MA, 2010. 1.2.2, 2.4, 2.4.1, 2.5, 4.6, 8.2.2 [7] Albert-Laszlo Barabasi. Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life. Plume, 2003. 2.3 [8] Bettina Berendt and Anett Kralisch. A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes. Information Retrieval, 12(3): 380–399, January 2009. 1, 2.4, 2.5, 3.1 [9] Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, and Theresa Wilson. Language identification for creating language-specific Twit- ter collections. In LSM ’12 Proceedings of the Second Workshop on Language in Social Media, pages 65–74. Association for Computational Linguistics, June 2012. 3.3, 4.7 200 [10] Carter T. Butts. Social network analysis: A methodological introduction. Asian Journal Of Social Psychology, 11(1):13–41, March 2008. 5 [11] Louis-Jean Calvet. Towards an Ecology of World Languages. Polity Press, Cambridge, 2006. 2.2 [12] Mo´nica Stella Ca´rdenas-Claros and Neny Isharyanti. Code switching and code mixing in Internet chatting: between yes, ya, and si a case study. The Journal of the JALT CALL SIG, 5(3):67–78, 2009. 3.1 [13] Manuel Castells. Communication, Power and Counter-power in the Network Society. International Journal of Communication, 1:238–266, 2007. 1, 2.2 [14] Vint Cerf. The Internet is for Everyone, 1999. URL http://www. internetsociety.org/internet-everyone. Last accessed Oct 30, 2013. 1.2 [15] Hsia-Ching Chang. A new perspective on Twitter hashtag use: Diffusion of innovation theory. Proceedings of the American Society for Information Science and Technology, 47(1):1–4, November 2010. 7.4.2 [16] Alok Choudhary, William Hendrix, Kathy Lee, Diana Palsetia, and Wei-Keng Liao. Social media evolution of the Egyptian revolution. Communications of the ACM, 55(5):74–80, May 2012. 1.2.2, 7.4.2 [17] Juliet Corbin and Anselm Strauss. Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory. SAGE Publications, Inc, 3rd edition, 2007. 5 [18] Nikolas Coupland. Introduction: Sociolinguistics in the Global Era. In Nikolas Coupland, editor, The Handbook of Language and Globalization, chapter 0, pages 1–27. Wiley-Blackwell, Malden, MA, 2010. 1.2.2, 2.5 [19] Angela Creese and Peter Martin. Introduction to Volume 9: Ecology of Lan- guage. In Angela Creese, Peter Martin, and Nancy H. Hornberger, editors, Ecology of Language - Encyclopedia of Language and Education Volume 9, pages i–vi. Springer, 2nd edition, 2008. 2.2 [20] David Crystal. English as a Global Language. Cambridge University Press, 2nd edition, 2003. 2.4 [21] Daniel Cunliffe, Delyth Morris, and Cynog Prys. Investigating the Differ- ential Use of Welsh in Young Speakers’ Social Networks: A Comparison of Communication in Face-to-Face Settings, in Electronic Texts and on Social Networking Sites. In Elin Haf Gruffydd Jones and Enrique Uribe-Jongbloed, editors, Social Media and Minority Languages: Convergence and the Creative Industries, pages 75–86. Multilingual Matters, Bristol, Buffalo, Toronto, 2013. 3.1 201 [22] danah Boyd and Kate Crawford. Six Provocations for Big Data. In A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. SSRN Electronic Journal, September 2011. URL http://papers.ssrn.com/ abstract=1926431. Last accessed Oct 30, 2013. 4.9 [23] Brenda Danet and Susan Herring. Introduction: Welcome to the Multilin- gual Internet. In Brenda Danet and Susan Herring, editors, The Multilingual Internet: Language, Culture, and Communication Online, chapter 1. Oxford University Press, New York, 2007. 2.4, 4.7 [24] Stephen Dann. Twitter content classification. First Monday, 15(12), Novem- ber 2010. URL http://firstmonday.org/ojs/index.php/fm/article/ view/2745/2681. Last accessed Oct 30, 2013. 7.4 [25] Abram De Swaan. The Evolving European Language System: A Theory of Communication Potential and Language Competition. International Political Science Review, 14(3):241–255, January 1993. 2.1, 2.5, 5.4 [26] Abram De Swaan. The Emergent World Language System: An Introduction. International Political Science Review, 14(3):219–226, January 1993. 2.1 [27] Abram De Swaan. Language Systems. In Nikolas Coupland, editor, The Hand- book of Language and Globalization, chapter 2, pages 56–76. Wiley-Blackwell, Malden, MA, 2010. 2.1 [28] Murat Demirbas, Murat Ali Bayir, Cuneyt Gurcan Akcora, Yavuz Selim Yil- maz, and Hakan Ferhatosmanoglu. Crowd-sourced sensing and collaboration using twitter. In 2010 IEEE International Symposium on “A World of Wire- less, Mobile and Multimedia Networks” (WoWMoM), pages 1–9. IEEE, June 2010. 1.1 [29] Jay L. Devore. Probability and Statistics for Engineering and the Sciences. Thomson Brooks/Cole, Belmont, CA, 7th edition, 2008. 6.2, 6.2 [30] Danny Dor. From Englishization to Imposed Multilingualism: Globalization, the Internet, and the Political Economy of the Linguistic Code. Public Culture, 16(1):97–118, 2004. 1, 2.2, 2.4 [31] Mercedes Durham. Language Choice on a Swiss Mailing List. In Brenda Danet and Susan Herring, editors, The Multilingual Internet: Language, Culture, and Communication Online, chapter 14. Oxford University Press, New York, 2007. 3.1, 7.2, 8.1 [32] Bruce Etling, John Kelly, Robert Faris, and John Palfrey. Mapping the Arabic blogosphere: politics and dissent online. New Media & Society, 12(8):1225– 1243, December 2010. 1.2.2, 3.2, 5.4, 8.1 202 [33] Madelyn Flammia and Carol Saunders. Language as power on the Internet. Journal of the American Society for Information Science and Technology, 58 (12):1899–1903, October 2007. 2.4 [34] C. Fuchs. The Role of Income Inequality in a Multivariate Cross-National Analysis of the Digital Divide. Social Science Computer Review, 27(1):41–58, April 2008. 1.2, 2.4.2 [35] Gephi.org. Gephi Tutorial Layouts — Gephi.org, 2011. URL http://gephi. org/tutorials/gephi-tutorial-layouts.pdf. Last accessed Oct 30, 2013. 5 [36] Jean D. Gibbons. Nonparametric Statistics: An Introduction (Quantitative Applications in the Social Sciences). SAGE Publications, Inc, 1993. 7.2 [37] Global Voices. About Global Voices, 2007. URL http:// globalvoicesonline.org/about/. Last accessed Oct 28, 2013. 1.2.2 [38] Jennifer Golbeck. Analyzing the Social Web. Morgan Kaufmann, 2013. 2.3.1, 2.3.1, 2.3.1, 5, 5.4 [39] David Graddol. English Next. Technical report, British Coun- cil, 2006. URL http://www.britishcouncil.org/learning-research- englishnext.htm. Last accessed Oct 30, 2013. 2.4 [40] Mark Graham, Scott A. Hale, and Devin Gaffney. Where in the World are You? Geolocation and Language Identification in Twitter. The Professional Geographer, 2013. 3.3, 4.7 [41] Mark Granovetter. The Strength of Weak Ties. American Journal of Sociology, 78(6):1360–1380, 1973. 2.3.1, 2.3.1 [42] Mark Granovetter. The Strength of Weak Ties: A Network Theory Revisited. Sociological Theory, 1(1983):201–233, 1983. 2.3.1 [43] Jeffrey Graves. Python Language Detector, 2012. URL https://github.com/ decultured/Python-Language-Detector/blob/master/README.md. Last accessed Oct 4, 2013. 4.3.1 [44] Alexander Halavais. National Borders on the World Wide Web. New Media & Society, 2(1):7–28, March 2000. 1, 1.2.1 [45] Scott Hale. Translating Twitter, 2011. URL http://www.scotthale.net/ blog/?p=152. Last accessed Oct 30, 2013. 1.2.2 [46] Scott Hale. Online language bubbles: the last frontier?, 2012. URL http://freespeechdebate.com/en/discuss/online-language-bubbles- the-last-frontier/. Last accessed Oct 23, 2013. 1.2.1, 5.4 203 [47] Scott A. Hale. Net Increase? Cross-Lingual Linking in the Blogosphere. Jour- nal of Computer-Mediated Communication, 17(2):135–151, January 2012. 1, 1.2.1, 3.2, 7.4.1, 8.1 [48] Einar Haugen. The Ecology of Language. In Anwar S Dil, editor, Essays by Einar Haugen. Stanford University Press, Stanford, CA, 1972. 2.2 [49] Brent Hecht and Darren Gergle. The Tower of Babel Meets Web 2.0: User- Generated Content and Its Applications in a Multilingual Context. In Pro- ceedings of the 28th international conference on Human factors in computing systems - CHI ’10, pages 291–300, New York, New York, USA, April 2010. ACM Press. 1.2.1, 2.4.1 [50] Amir Helzer. Localizing for software, websites and global apps. Multilingual, 22(3):34–37, 2011. 1.2.1, 1.2.3 [51] Alfred Hermida. From TV to Twitter: How Ambient News Became Ambient Journalism. Media/Culture Journal, 13(2), 2010. URL http://ssrn.com/ paper=1732603. Last accessed Oct 30, 2013. 1.1 [52] Susan Herring. Web Content Analysis: Expanding the Paradigm. In Jeremy Hunsinger, Lisbeth Klastrup, and Matthew Allen, editors, International Hand- book of Internet Research, chapter 11, pages 233–249. Springer Verlag, Berlin, 2010. 1.4, 4 [53] Susan Herring, John Paolillo, Irene Ramos-Vielba, Inna Kouper, Elijah Wright, Sharon Stoerger, Lois Scheidt, and Benjamin Clark. Language Net- works on LiveJournal. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS’07), pages 79–90. IEEE Computer So- ciety, January 2007. 1, 1.2.1, 2.2, 3.2, 4, 5.4, 7.4.1, 8.1 [54] Courtenay Honeycutt and Susan C. Herring. Beyond Microblogging: Con- versation and Collaboration via Twitter. In Proceedings of the 42nd Annual Hawaii International Conference on System Sciences (HICSS’09), pages 1–10. IEEE Computer Society, December 2009. 7, 7.2 [55] Lichan Hong, Gregorio Convertino, and Ed Chi. Language Matters in Twit- ter: A Large Scale Study. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, volume 91, pages 518–521. AAAI Publications, 2011. 1, 2.5, 3.3, 3.3, 5.4, 7.3, 8.1 [56] Nancy H. Hornberger. Multilingual language policies and the continua of biliteracy: An ecological approach. Language Policy, 1(1):27–51, March 2002. 2.2, 2.5 [57] Jeff Huang, Katherine M. Thornton, and Efthimis N. Efthimiadis. Conver- sational Tagging in Twitter. In Proceedings of the 21st ACM conference on Hypertext and hypermedia - HT ’10, pages 173–178, New York, New York, USA, June 2010. ACM Press. 7.4.2 204 [58] International Telecommunication Union. ITU Measuring the Information Society. Technical report, Geneva, 2011. URL http://www.itu.int/ITU- D/ict/publications/idi/. Last accessed Oct 30, 2013. 1 [59] Internet Society. Who We Are. URL http://www.internetsociety.org/ who-we-are. Last accessed Oct 28, 2013. 1.2 [60] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis - WebKDD/SNA-KDD ’07, pages 56–65, New York, New York, USA, August 2007. ACM Press. 2.5, 4.2 [61] Ian Johnson. Audience Design and Communication Accommodation Theory: Use of Twitter by Welsh-English Biliterates. In Elin Haf Gruffydd Jones and Enrique Uribe-Jongbloed, editors, Social Media and Minority Languages: Convergence and the Creative Industries, chapter 6, pages 99–118. Multilingual Matters, Bristol, Buffalo, Toronto, 2013. 2.5, 3.1, 7.3 [62] Aravind K. Joshi. Processing of sentences with intra-sentential code-switching. In Proceedings of the 9th conference on Computational linguistics -, volume 1, pages 145–150, Morristown, NJ, USA, July 1982. Association for Computa- tional Linguistics. 2.5, 7.4.2 [63] M Kaiser, M Go¨rner, and C C Hilgetag. Criticality of spreading dynamics in hierarchical cluster networks without inhibition. New Journal of Physics, 9 (5):110–110, May 2007. 2.2 [64] Krishna Yeshwanth Kamath and James Caverlee. Transient crowd discovery on the real-time social web. In Proceedings of the fourth ACM international conference on Web search and data mining - WSDM ’11, pages 585–594, New York, New York, USA, February 2011. ACM Press. 4.6 [65] Helen Kelly Holmes. An Analysis of the Language Repertoires of Students in Higher Education and their Language Choices on the Internet. International Journal of Multicultural Societies, 6(1):52–75, 2004. 2.5, 3.1, 7.2, 8.1 [66] Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna Gummadi, and Winter Mason. The Emergence of Conventions in Online Social Networks. In Inter- national AAAI Conference on Weblogs and Social Media, 2012. URL http: //www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4661. Last accessed Oct 30, 2013. 2.5, 7.1, 7.4.2 [67] A. Kralisch and B. Berendt. Language-sensitive search behaviour and the role of domain knowledge. New Review of Hypermedia and Multimedia, 11(2): 221–246, December 2005. 1, 3.1 205 [68] A. Kralisch and T. Mandl. Barriers to Information Access across Languages on the Internet: Network and Language Effects. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06), volume 3, page 54b. IEEE Computer Society, January 2006. 1 [69] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twit- ter, a social network or a news media? In Proceedings of the 19th international conference on World wide web - WWW ’10, pages 591–600, New York, New York, USA, April 2010. ACM Press. 1.1, 1.1 [70] David Laniado and Peter Mika. Making Sense of Twitter. In Peter F. Patel- Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, and Birte Glimm, editors, The Semantic Web ISWC 2010, volume 6496 of Lecture Notes in Computer Science, pages 470–485. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. 7.4.2 [71] Nade`ge Lechevrel. L’e´colinguistique : une discipline e´mergente? Revue des e´tudiants en linguistique du Que´bec - Quebec Student Journal of Linguistics, 3(1):18–38, 2008. 2.2 [72] Julie Letierce, Alexandre Passant, John Breslin, and Stefan Decker. Under- standing how Twitter is used to spread scientific messages. In Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, Raleigh, NC: US, 2010. URL http://journal.webscience.org/314/. Last accessed Oct 30, 2013. 7.4.2 [73] David Lewis, Stephen Curran, Gavin Doherty, Kevin Feeney, Nikiforos Kara- manis, Saturnino Luz, and John McAuley. Supporting Flexibility and Aware- ness in Localisation Workflows. The International Journal of Localisation, 8 (1):29–38, 2009. 2.4 [74] Literature Across Frontiers. Publishing Translations in Europe. Trends 1990- 2005. Technical report, Mercator Institute for Media, Languages and Culture, 2010. 2.3.2 [75] Gilad Lotan. #OccupyWallStreet: origin and spread visualized — So- cialFlow blog, 2011. URL http://blog.socialflow.com/post/7120244404/ occupywallstreet-origin-and-spread-visualized. Last accessed Oct 30, 2013. 1.2.2 [76] Gilad Lotan. Data Reveals That Occupying Twitter Trending Topics is Harder Than it Looks!, 2011. URL http://blog.socialflow.com/post/ 7120244374/data-reveals-that-occupying-twitter-trending-topics- is-harder-than-it-looks. Last accessed Oct 30, 2013. 1.2.2 [77] Gilad Lotan, Erhardt Graeff, Mike Ananny, Devin Gaffney, Ian Pearce, and danah Boyd. The Revolutions Were Tweeted: Information Flows during the 206 2011 Tunisian and Egyptian Revolutions. International Journal of Commu- nication, 5:1375–1405, 2011. 1, 1.1, 1.2.2 [78] Safari Mafu. From the Oral Tradition to the Information Era: The Case of Tanzania. International Journal of Multicultural Societies, 6(1):99–124, 2004. 1 [79] Christopher Manning. Logistic Regression (with R), 2007. URL http: //nlp.stanford.edu/~manning/courses/ling289/logistic.pdf. Last ac- cessed Oct 8, 2013. 6.2 [80] Cameron A. Marlow. The Structural Determinants of Media Contagion. Ph.d., Massachusetts Institute of Technology, 2005. 1.2.1, 2.3.1, 2.3.1, 5.1 [81] A. E. Marwick and d. Boyd. I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience. New Media & Society, 13 (1):114–133, July 2010. 2.5, 3.1, 6.4, 7 [82] Cheryl Metoyer-Duran. Gatekeepers in Ethnolinguistic Communities. Infor- mation Management, Policy and Services. Ablex Publishing Corporation, Nor- wood, New Jersey, 1993. 2.3.1, 2.3.2, 2.5, 5.4 [83] Delia Mocanu, Andrea Baronchelli, Nicola Perra, Bruno Gonc¸alves, Qian Zhang, and Alessandro Vespignani. The Twitter of Babel: mapping world languages through microblogging platforms. PloS one, 8(4):e61981, January 2013. URL http://dx.plos.org/10.1371/journal.pone.0061981. Last ac- cessed Oct 30, 2013. 2.3.2, 2.3, 2.5, 3.2, 3.3, 3.1, 3.3, 4.2, 4.7, 5.4, 8.1 [84] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, January 2007. 4.4.2 [85] Ory Okolloh. Ushahidi, or ’testimony’: Web 2.0 tools for crowdsourcing crisis information. Participatory Learning and Action, 59(1):65–70, 2009. 1 [86] Eli Pariser. The Filter Bubble: What the Internet is Hiding from You. The Penguin Press, New York, 2011. 1.2.1 [87] Carol Peters, Martin Braschler, and Paul Clough. Multilingual Information Retrieval: From Research To Practice. Springer, 2012. 1.2.1 [88] Isabella Peters. Folksonomies. Indexing and Retrieval in Web 2.0. Knowledge and Information. De Gruyter, Berlin, 2009. 2.4.1 [89] Daniel Pimienta, Daniel Prado, and A´lvaro Blanco. Twelve years of measur- ing linguistic diversity in the Internet: balance and perspectives — UNESCO publications for the World Summit on the Information Society. Technical re- port, United Nations Educational, Scientific and Cultural Organization, Paris, 2009. 1, 2.4 207 [90] Barbara Poblete, Ruth Garcia, Marcelo Mendoza, and Alejandro Jaimes. Do all birds tweet the same?: characterizing Twitter around the world. In Pro- ceedings of the 20th ACM international conference on Information and knowl- edge management - CIKM ’11, pages 1025–1030, New York, New York, USA, October 2011. ACM Press. 2.5, 3.2, 4.2 [91] James E. Prieger. The broadband digital divide and the economic benefits of mobile broadband for rural areas. Telecommunications Policy, 37(6):483–502, 2013. 1.2, 2.4.2 [92] Pei-Luen Patrick Rau, Tom Plocher, and Yee-Yin Choong. Cross-Cultural Design for IT Products and Services. CRC Press, 2012. 1.2.1 [93] Dana Rotman, Jennifer Preece, Yurong He, and Allison Druin. Extreme ethnography. In Proceedings of the 2012 iConference, pages 207–214, New York, New York, USA, February 2012. ACM Press. 4.6 [94] C.E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, 27(3):379–423, 1948. 6.1, 6.1 [95] Katie Shilton, Jes A. Koepfler, and Kenneth R. Fleischmann. How to See Values in Social Computing: Methods for Studying Values Dimensions. In (To appear in) Proceedings of the ACM 2014 conference on Computer Supported Cooperative Work - CSCW ’14. ACM Press, 2014. 1.2.3 [96] David Sims. Understanding place and space in a digital Babel. The nuances of location language, 2012. URL http://radar.oreilly.com/2012/03/ location-unstructured-non-english-health-outbreak.html. Last ac- cessed Oct 30, 2013. 8.2.1 [97] Richard L. Sites. Language Technology Ecosystem, 2011. URL http://www. hltd.org/alex.pdf. Last accessed Oct 30, 2013. 4.3.1 [98] Kate Starbird and Leysia Palen. “voluntweeters”: Self-organizing by digital volunteers in times of crisis. In Proceedings of the 2011 annual conference on Human factors in computing systems - CHI ’11, pages 1071–1080, New York, New York, USA, May . ACM Press. 1, 1.2.2 [99] Yuri Takhteyev, Anatoliy Gruzd, and Barry Wellman. Geography of Twitter networks. Social Networks, 34(1):73–81, 2012. 3.2, 4.2, 4.3 [100] Steven L. Thorne, Rebecca W. Black, and Julie M. Sykes. Second Language Use, Socialization, and Learning in Internet Interest Communities and Online Gaming. The Modern Language Journal, 93:802–821, December 2009. 1, 2.4 [101] Twitter Help Center. Age screening on Twitter. URL https://support. twitter.com/articles/20169945-age-screening-on-twitter. Last ac- cessed Oct 7, 2013. 4.9 208 [102] Claire Ulrich. Technological Developments for African Languages. Multilin- gual, 21(5):51–53, 2010. 1, 2.4 [103] UNESCO. Recommendation Concerning the Promotion and Use of Mul- tilingualism and Universal Access to Cyberspace, 2003. URL http: //www.unesco.org/new/en/communication-and-information/about- us/how-we-work/strategy-and-programme/promotion-and-use-of- multilingualism-and-universal-access-to-cyberspace/. Last accessed Oct 30, 2013. 1.2, 2.4 [104] Federico Vazquez, Xavier Castello´, and Maxi San Miguel. Agent based models of language competition: macroscopic descriptions and order-disorder tran- sitions. Journal of Statistical Mechanics: Theory and Experiment, 2010 (04):P04007, 2010. URL http://iopscience.iop.org/1742-5468/2010/ 04/P04007/. Last accessed Oct 30, 2013. 2.3.2 [105] Sarah Vieweg, Amanda L. Hughes, Kate Starbird, and Leysia Palen. Mi- croblogging during two natural hazards events. In Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10, pages 1079–1088, New York, New York, USA, April 2010. ACM Press. 1.1 [106] Jessica Vitak, Cliff Lampe, Rebecca Gray, and Nicole B. Ellison. “Why won’t you be my Facebook friend?”. In Proceedings of the 2012 iConference, pages 555–557, New York, New York, USA, February 2012. ACM Press. 3.1 [107] Barney Warf. Geographies of global Internet censorship. GeoJournal, 76(1): 1–23, November 2010. 1.2, 2.4.2 [108] Mark Warschauer, Ghada El Said, and Ayman Zohry. Language Choice On- line: Globalization and Identity in Egypt. In Brenda Danet and Susan Herring, editors, The Multilingual Internet: Language, Culture, and Communication Online, chapter 13. Oxford University Press, New York, 2007. 2.4, 2.5, 3.1, 5.4 [109] Duncan J. Watts. The “New” Science of Networks. Annual Review of Sociol- ogy, 30:243–270, 2004. 2.3 [110] Wouter Weerkamp, Simon Carter, and Manos Tsagkias. How People use Twitter in Different Languages. In WebSci Conference 2011, Koblenz, Ger- many, June 2011. URL http://journal.webscience.org/539/2/Table1. png. Last accessed Oct 30, 2013. 3.3, 7.2, 7.3, 7.4.2 [111] Li Wei. The Bilingualism Reader, volume 24. Routledge, London, July 2000. 2.5 [112] Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith. Visualizing the Signatures of Social Roles in Online Discussion Groups. JoSS: The Journal of Social Structure, 8(2):1–31, 2007. 5.2 209 [113] George Weyman. Translating Tweets from the Arab Spring: Towards a Trans- lation Workbench for Twitter, 2012. URL http://meedan.org/2012/03/ translation-twitter-middle-east-arabic/. Last accessed Oct 30, 2013. 8.2.1 [114] Leo Widrich. How Twitter evolved from 2006 to 2011, 2011. URL http: //blog.bufferapp.com/how-twitter-evolved-from-2006-to-2011. Last accessed Oct 28, 2013. 1.1 [115] World Summit of the Information Society. Building the Information Society: a global challenge in the new Millennium. Declaration of Principles, 2003. URL http://www.itu.int/wsis/basic/about.html. Last accessed Oct 30, 2013. 1.2 [116] Sue Wright. Multilingualism on the Internet - Thematic introduction. Inter- national Journal of Multicultural Societies, 6(1):5–13, 2004. 1 [117] John Yunker. Beyond Borders: Web Globalization Strategies. New Riders, 2002. 1.2.1, 1.2.3 [118] John Yunker. Inside Google’s language detection tool - Global by Design, 2010. URL http://www.globalbydesign.com/blog/2010/12/06/inside- googles-language-detection-tool/. Last accessed Oct 30, 2013. 4.3.1 [119] Ethan Zuckerman. CHI keynote: Desperately Seeking Serendipity, 2011. URL http://www.ethanzuckerman.com/blog/2011/05/12/chi- keynote-desperately-seeking-serendipity/. Last accessed Oct 28, 2013. 1.2, 2.4.1, 5.4, 8.1 210