ABSTRACT Title of Dissertation: WHO, WHAT, WHEN, WHERE, AND WHY? QUANTIFYING AND UNDERSTANDING BIOMEDICAL DATA REUSE Lisa Federer, Doctor of Philosophy, 2019 Dissertation directed by: Dr. Katie Shilton, Associate Professor & Doctoral Program Director, College of Information Studies Since the mid-2000s, new data sharing mandates have led to an increase in the amount of research data available for reuse. Reuse of data benefits the scientific community and the public by potentially speeding scientific discovery and increasing the return on investment of publicly funded research. However, despite the potential benefits of reuse and the increasing availability of data, research on the impact of data reuse is so far sparse. This dissertation provides a deeper understanding of the impacts of shared biomedical research data by exploring who is reusing data and for what purpose. Specifically, this dissertation examines use requests and dataset descriptions from three biomedical repositories that require potential requestors to submit descriptions of their planned reuse. Content analysis of use requests yields insight into who is requesting data and the methods and topics of their planned reuse. Comparing use requests to the descriptions of the original datasets provides insight into the breadth of impact of data reuse and text mining of the original dataset descriptions helps determine the topics of datasets that are highly reused. This study demonstrates that patterns of reuse differ between dataset types, with genomic datasets used more frequently together in meta-analyses for topics that diverge from the original purpose of collection, while clinical datasets are used more often on their own within a context that is similar to the reason for which they were collected. While requestors do come from a range of career stages from around the world, they are not evenly distributed; most requests come from English-speaking countries, especially the United States. This study also finds that datasets that receive the most requests soon after release continue to go on to be more requested, and that datasets covering common diseases are requested more than datasets on rare diseases. These findings have implications for several stakeholders, including funders and institutions developing policies to reward and incentivize data sharing, researchers who share data and those who reuse it, and repositories and data curators who must make choices about which datasets to curate and preserve. WHO, WHAT, WHEN, WHERE, AND WHY? QUANTIFYING AND UNDERSTANDING BIOMEDICAL DATA REUSE by Lisa Federer Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Katie Shilton, Chair Professor Lindley Darden, Dean’s Representative Professor Beth St. Jean Professor Yla Tausczik Professor Susan Winter © Copyright by Lisa Federer 2019 Acknowledgements This dissertation is the culmination of intense research conducted over a very short period of time, and it was only possible with the help and support of a number of people whom I gratefully acknowledge here. First, I thank my advisor, Dr. Katie Shilton, who has been tremendously supportive throughout this process. Some advisors might have tried to talk me out of my very ambitious dissertation timeline, but she was always encouraging and had incredibly helpful advice. I am likewise grateful for my committee members, Dr. Lindley Darden, Dr. Beth St. Jean, Dr. Yla Tausczik, and Dr. Susan Winter, who have given excellent feedback that has helped shape this dissertation and who were generous with their time in working with my tight schedule. I also thank Dr. Andrea Wiggins, who was my advisor when I started in the doctoral program but left for a new position after my first year. The work I did with her during that time was foundational to the research presented here, and I greatly appreciate her insights and direction. I am very grateful to Dr. Mike Feolo from dbGaP, Sean Coady from NHLBI’s BioLINCC repository, and Sharon Lawlor the NIDDK Central Repository, who were crucial in assisting me with obtaining the use requests and repository information that were the foundation of this research. I also appreciate Jim Mork’s guidance in preparing my data for batch processing by the Medical Text Indexer, and I thank Dr. ii Maryam Zaringhalam and Franklin Sayre for serving as additional coders to validate my qualitative coding methodology. Undertaking a doctorate degree is an arduous task on its own; doing so while also employed in a demanding full-time position presents an additional set of challenges. I am fortunate to have had the enthusiastic support of leadership in both of the libraries where I have held positions while working on this degree. Dr. Keith Cogdill encouraged me when I first began considering pursuing the degree, and I appreciate his support throughout the process, as well as the support of all my former colleagues at the NIH Library. Since moving to the National Library of Medicine, I have been grateful for the encouragement and mentorship of Dr. Mike Huerta and Dr. Patti Brennan, as well as of the many colleagues at the NLM and the NIH who have provided helpful input and cheered me on throughout the process, especially Dianne Babski and Anna Ripple. I am also immensely appreciative of the financial support from the NIH Library and NLM that have made it possible for me to complete this degree. I gratefully acknowledge the many friends and colleagues who have provided advice and feedback on my dissertation research, particularly Dr. Ben Busby, Dr. Maryam Zaringhalam, and the members of the Ethics and Values in Design lab. I also thank my non-academic friends who have been willing to listen to me ramble at length about data and have been enthusiastic cheerleaders all along the way, especially Ali Sabzevari, Susie Nguyen, Monica Waterston, Matt Woodrum, and Rich McGowan. iii I thank my family for their unending support throughout not only this degree, but my many educational endeavors. My parents have always made clear their pride and love for me, and I am grateful to them for encouraging me in my academic interest from a young age. My father passed away during the first semester of my doctoral studies, and his absence at my graduation will be deeply felt, but I know he would have been very proud. I have been very thankful for the love and friendship of my mother during this degree and throughout my life. Finally, even though I know she’ll never read this, I thank my dog Ophelia, my best friend and faithful companion for the last seven years. She provided many cuddles and listened very seriously when I talked through my research problems with her. Most importantly, she never hesitated to remind me that, no matter how much serious work you have to do, you should always be sure you make time to play. iv Table of Contents Acknowledgements ....................................................................................................... ii Table of Contents .......................................................................................................... v List of Figures ............................................................................................................ viii List of Tables ............................................................................................................... xi Chapter 1: Introduction ................................................................................................. 1 1.1 Background of the Research ............................................................................... 2 1.2 Research Questions ............................................................................................. 6 1.3 Scope of this Study ............................................................................................. 8 1.4 Study Methodologies ........................................................................................ 11 1.5 Importance and Contributions .......................................................................... 12 1.6 Organization of the Dissertation ....................................................................... 16 Chapter 2: Review of the Literature ............................................................................ 17 2.1 Scientific Credit and Reward ............................................................................ 18 2.1.1 The Role of Credit in Science .................................................................... 18 2.1.2 Metrics for Scientific Credit ...................................................................... 22 2.1.3 Patterns of Scientific Attention .................................................................. 25 2.2 Scientific Data Sharing and Reuse .................................................................... 27 2.2.1 Understanding Data Reuse ......................................................................... 28 2.2.2 Challenges in Tracking and Quantifying Data Reuse ................................ 33 2.3 Conclusions ....................................................................................................... 35 Chapter 3: Methodology ............................................................................................. 36 3.1 Research Design ............................................................................................... 37 3.1.1 Operationalizing “Reuse” .......................................................................... 38 3.1.2 Sampling and Data Collection ................................................................... 40 3.2 Data Preparation and Analysis .......................................................................... 43 3.2.1 Research Question 1: For what research objectives are biomedical datasets reused? ................................................................................................................ 43 3.2.2 Research Question 2: What are the demographics of researchers who reuse existing datasets? ................................................................................................ 50 3.2.3 Research Question 3: Are there temporal patterns to dataset requests? .... 54 3.2.4 Research Question 4: Are there dataset topics that are more highly requested? ........................................................................................................... 57 3.3 Limitations ........................................................................................................ 64 Chapter 4: Findings About Requests and Requestors ................................................. 65 4.1 Research Question 1: For what research objectives are biomedical datasets reused? .................................................................................................................... 66 4.1.1 Research Question 1.1: For what methods and analysis types are datasets reused? ................................................................................................................ 67 4.1.2 Research Question 1.2: How closely are the topics for data reuse aligned with the topics for which the data were originally collected?............................. 74 v 4.1.3 Summary of Findings ................................................................................. 79 4.2 Research Question 2: What are the demographics of researchers who reuse existing datasets? .................................................................................................... 80 4.2.1 Research Question 2.1: Where are requestors located in the world? ......... 81 4.2.2 Research Question 2.2: Are there patterns in career stage of requestors? . 96 4.2.3 Summary of Findings ............................................................................... 102 4.3 Conclusions and Summary of Findings .......................................................... 103 Chapter 5: Findings About Datasets ......................................................................... 104 5.2 Research Question 3: Are there temporal patterns to dataset requests? ......... 104 5.1.1 dbGaP Results .......................................................................................... 107 5.1.2 NHLBI Results ......................................................................................... 118 5.1.3 Summary of Findings ............................................................................... 127 5.3 Research Question 4: Are there dataset topics that are more highly requested? .............................................................................................................................. 128 5.2.1 Defining Topics ....................................................................................... 129 5.2.2 Comparing Requests Across Topics ........................................................ 137 5.2.3 dbGaP Results .......................................................................................... 140 5.2.4 NHLBI Results ......................................................................................... 144 5.2.5 NIDDK Results ........................................................................................ 146 5.2.6 Summary of Findings ............................................................................... 149 5.3 Conclusions and Summary of Findings .......................................................... 151 Chapter 6: Discussion ............................................................................................... 153 6.1 Summary of the Major Findings ..................................................................... 153 6.2 Interpretation of the Major Findings ............................................................... 155 6.2.1 Who is Reusing Data? .............................................................................. 155 6.2.2 What Are the Most Requested Topics?.................................................... 157 6.2.3 When in a Dataset’s Life Cycle Are Requests Made? ............................. 158 6.2.4 Where in the World Are Requestors Located? ........................................ 163 6.2.5 Why Are Requestors Reusing Datasets? .................................................. 166 6.3 Methodological Contributions of the Study.................................................... 169 6.4 Limitations and Considerations for Application of Findings ......................... 173 6.5 Summary of Discussion .................................................................................. 175 Chapter 7: Conclusion............................................................................................... 176 7.1 Implications of the Findings ........................................................................... 176 7.1.1 For Researchers ........................................................................................ 176 7.1.2 For Repositories and Curators.................................................................. 180 7.1.3 For Research Funders .............................................................................. 184 7.2 Directions for Future Research ....................................................................... 189 7.2.1 Understanding Data Requestors and Data Reuse ..................................... 190 7.2.2 Long-term Temporal Patterns .................................................................. 191 7.2.3 Understanding Reuse Within the Broader Research Context .................. 193 7.3 Conclusion ...................................................................................................... 193 Appendix A: Examples of Requests for Each Type of Reuse .................................. 196 vi Appendix B: Custom Stopwords Used in LDA ........................................................ 201 Appendix C: Topic Model Term Charts ................................................................... 202 References ................................................................................................................. 205 vii List of Figures Figure 3-1. MeSH tree sample demonstrating semantic similarity. The number following each term is its semantic similarity score (SSS) to the index term of “Heart Diseases.” .................................................................................................................... 48 Figure 3-2. Demonstration of analyzing topics and requests. ..................................... 63 Figure 4-1. Distribution of maximum semantic similarity scores for request/dataset pairs. ............................................................................................................................ 78 Figure 4-2. Relative difference in composition of requests for dbGaP datasets and universities in countries in the world. ......................................................................... 84 Figure 4-3. Counts of universities compared to counts of requests to dbGaP. ........... 85 Figure 4-4. Relative difference in composition of requests for NHLBI datasets and universities in countries in the world. ......................................................................... 86 Figure 4-5. Counts of universities compared to counts of requests to NHLBI........... 87 Figure 4-6. Relative difference in composition of requests for NIDDK datasets and universities in countries in the world. ......................................................................... 88 Figure 4-7. Counts of universities compared to counts of requests to NIDDK. ......... 89 Figure 4-8. Relative difference in composition of requests for dbGaP datasets and NIH funding in FY18 by state within the US. ............................................................ 93 Figure 4-9. Relative difference in composition of requests for NHLBI datasets and NIH funding in FY18 by state within the US. ............................................................ 94 viii Figure 4-10. Relative difference in composition of requests for NIDDK datasets and NIH funding in FY18 by state within the US. ............................................................ 95 Figure 5-1. Mean requests by year for dbGaP datasets in each decile, by age of the dataset at time of request........................................................................................... 108 Figure 5-2. Mean requests by year for dbGaP datasets in mean quartile, by age of the dataset at time of request........................................................................................... 112 Figure 5-3. Mean requests by year for NHLBI datasets in each decile, by age of the dataset at time of request........................................................................................... 119 Figure 5-4. Mean requests by year for NHLBI datasets released between 2009 and 2017 in each decile, by age of the dataset at time of request. ................................... 121 Figure 5-5. Output from ldatuning package for the dbGaP dataset descriptions. ..... 133 Figure 5-6. Output from ldatuning package for the NHLBI dataset descriptions. ... 133 Figure 5-7. Output from the ldatuning package for the NIDDK dataset descriptions. ................................................................................................................................... 134 Figure 5-8. An example of a chart showing the top ten terms in topic 7 of the 14- group NIDDK model with its corresponding beta value. ......................................... 135 Figure 5-9. Visual explanation of request ratio calculation. ..................................... 138 Figure 5-10. Request to dataset ratios for dbGaP datasets, by topic, calculated annually from 2008 – 2018. ...................................................................................... 142 Figure 5-11. Request to dataset ratios for dbGaP datasets related to cancer, by cancer type, calculated annually from 2008 – 2018. ............................................................ 144 ix Figure 5-12. Request to dataset ratios for NHLBI datasets by topic, calculated annually from 2000 – 2018. ...................................................................................... 146 Figure 5-13. Request to dataset ratios for NIDDK datasets by topic, calculated annually from 2013 – 2018. ...................................................................................... 149 x List of Tables Table 3-1. Number of datasets, requestors, institutional affiliations, and use requests from each repository and overall. ............................................................................... 42 Table 3-2. Contents of use requests by repository (X indicates the repository contains the item). ..................................................................................................................... 42 Table 3-3. Coding categories and their definitions. .................................................... 44 Table 3-4 An example matrix of semantic similarity scores between two sets of terms. ..................................................................................................................................... 49 Table 4-1. Coding categories and their definitions. .................................................... 68 Table 4-2. Counts and percentages of requests describing various types of reuse for NIDDK and dbGaP datasets. ...................................................................................... 71 Table 4-3. Example semantic similarity scoring......................................................... 76 Table 4-4. Summary statistics of semantic similarity scores for dbGaP and NIDDK request/dataset pairs. ................................................................................................... 77 Table 4-5. Countries with number of universities and number of requests (N) and relative difference in composition (RDC) for each repository. .................................. 90 Table 4-6. Proportions of datasets requested by career status of requestor for dbGaP and NIDDK. ................................................................................................................ 98 Table 4-7. Relative difference in composition (RDC) between faculty at five academic ranks in US institutions and their requests to dbGaP and NIDDK. .......... 101 xi Table 5-1. Distribution of dbGaP datasets by request deciles for requests made between 2007 and 2017. ........................................................................................... 108 Table 5-2. Distribution of dbGaP datasets by mean request quartiles for requests made between 2007 and 2017. .................................................................................. 112 Table 5-3. Results of regression analysis showing effects of requests during year one, two, and three of a dbGaP dataset’s life on the total number of requests during the 2007 – 2017 period. .................................................................................................. 115 Table 5-4. Results of regression analysis showing effects of requests during year one, two, and three of a dbGaP dataset’s life on the total number of requests in the fourth year and later during the 2007 – 2017 period. .......................................................... 117 Table 5-5. Distribution of NHLBI datasets by request deciles for requests made between 2000 and 2017. ........................................................................................... 119 Table 5-6. Results of regression analysis showing effects of requests during years one, two, and three of a NHLBI dataset’s life on the total number of requests during the 2010 – 2017 period. ............................................................................................ 122 Table 5-7. Results of regression analysis showing effects of requests during year one, two, and three of an NHLBI dataset’s life on the total number of requests in the fourth year and later during the 2009 – 2017 period. .......................................................... 124 Table 5-8. Results of regression analysis showing effects of requests during year two and three of an NHLBI dataset’s life on the total number of requests in the fourth year and later during the 2009 – 2017 period. .................................................................. 126 xii Table 5-9. Distribution of dbGaP datasets and requests among 18 topics derived from the assigned primary phenotype, and calculated request to dataset (RTD) ratio. ..... 140 Table 5-10. Distribution of dbGaP datasets specific to cancer and their requests among 10 cancer topics derived from the assigned primary phenotype, and calculated request to dataset (RTD) ratio. .................................................................................. 143 Table 5-11. Distribution of NHLBI datasets and their requests among 14 topics determined by LDA, and calculated request to dataset (RTD) ratio. ........................ 144 Table 5-12. Distribution of NIDDK datasets and their requests from 2013 – 2018, for 14 topics determined by LDA, and calculated request to dataset (RTD) ratio. ........ 147 Table 6-1. Summary of the major findings. .............................................................. 153 xiii Chapter 1: Introduction In 2007, computer scientist Jim Gray asserted that the practice of science had been fundamentally changed by the advent of new technologies that facilitated the collection, storage, and analysis of large digital datasets. “Techniques and technologies for such data-intensive science are so different,” he argued, “that it is worth distinguishing data-intensive science…as a new, fourth paradigm for scientific exploration” (Hey, Tansley, & Tolle, 2009, p. xix). In the decade since Gray first proposed this new paradigm, thousands of human genomes have been sequenced, and petabytes’ worth of scientific data collected, with more pouring in every day, giving rise to a veritable data deluge. It is not only the technical ability to more quickly and inexpensively gather, create, and store data that has transformed the practice of science, but also the establishment by both major funders and prominent publishing groups of mandates to share those data. Researchers around the world have begun to share their data not only in response to such mandates, but also as part of a growing movement toward open science practices that bring not only data, but a broad range of products of scientific research out of desk drawers and hard drives and into the public sphere, where they can be accessed, reused, and repurposed. In many fields, researchers today can feasibly conduct studies using publicly shared data, without ever having to set foot into a lab or seek funding to gather new data. 1 Despite this increasing availability of a broad range of datasets across scientific disciplines, little research has focused on how, why, or even if researchers are utilizing publicly available, shared research data. This dissertation aims to help close that gap in knowledge by exploring the ways in which scientific research datasets that are publicly shared have been reused. Specifically, I examine use requests from three biomedical data repositories in order to answer questions about who is reusing these datasets, how they are using them, and why some datasets are used more than others. 1.1 Background of the Research A number of cultural and policy changes in the last few years have increased the availability of scientific research data for reuse. In 2013, the United States Office of Science and Technology Policy (OSTP) issued a memo directing agencies to develop policies to increase public access to research data generated using federal funds (Holdren, 2013). Accordingly, federal funders including the National Science Foundation (NSF) and National Institutes of Health (NIH) have created policies requiring researchers to share their data (National Institutes of Health Office of Extramural Research, 2016; National Institutes of Health Office of Science Policy, 2017; National Science Foundation, 2010). The International Committee of Medical Journal Editors (ICMJE) has encouraged member journals to require that authors make data underlying their articles publicly available (Taichman et al., 2017), and many major publishers have already done so, including PLoS and Nature (Nature 2 Publishing Group, 2017; Silva, 2014). Researchers themselves are also increasingly embracing a culture of greater data sharing and transparency under the umbrella of various open science practices (Nosek et al., 2015b). As previous research has demonstrated, data sharing and openness bring a number of benefits to the researchers who share, the scientific community, and the general public. In addition to enhancing scientific reproducibility (Ioannidis, 2014; Munafò et al., 2017), shared data can be reused by other researchers, potentially to answer new questions not addressed in the original research. Data reuse increases the return on the investment of the original grant, and also saves on funding that would have been used to gather new data (Arzberger et al., 2004; Costello, 2009). The speed of scientific discovery, and in turn translation to clinical practice, can be accelerated when researchers can reuse existing data instead of spending months or years collecting new data (Knoppers, 2014; Knoppers, Harris, Budin, & Edward, 2014). Researchers who share their data may be rewarded in the form of increased citations to articles with associated publicly available data, as well as opportunities to collaborate and co-author publications with the researchers who reuse their shared data (Piwowar, Day, & Fridsma, 2007; Tenopir et al., 2015). Despite the potential benefits of reuse and the increasing availability of data, research on the actual impacts of data reuse is so far sparse. Some studies have considered patterns of data request and citation for individual repositories (Coady et al., 2017; Paltoo et al., 2014), but less research has been done to gain a deeper understanding of the impacts that shared research data can have, as well as to 3 determine how to quantify or measure that impact. Such research has important implications for both policy and practice. Data sharing policies should be founded on a strong evidence base that demonstrates the impacts and benefits of data sharing (Pryor, 2009). The time and effort required to share and curate data is not trivial (Leonelli, 2014), so quantifying the actual impacts of these datasets – as well as determining which datasets have the most potential for long-term impact – helps assure that these investments are worthwhile. Understanding how and why researchers reuse data could also inform development of better technical infrastructure to facilitate discoverability and enhance reuse (Jagodnik et al., 2017). Finally, understanding patterns of data reuse could incentivize sharing by making it possible to build upon existing academic reward structures to give credit to researchers who share high-use and high-impact datasets (Olfson, Wall, & Blanco, 2017). At present, most academic institutions do not recognize shared data as a scholarly product in the context of tenure and promotion decisions, likely because tracking data reuse is technologically challenging and the impact on the broader scientific community of shared datasets is difficult to quantify (Ali-Khan, Harris, & Gold, 2017; Piwowar, Becich, Bilofsky, Crowley, & on behalf of the caBIG Data Sharing and Intellectual Capital Workspace, 2008). While tracking data reuse across science in general may be informative, the question of how to quantify data reuse and its impacts is especially salient in the context of biomedical research. In some disciplines, such as geology and astronomy, a culture of data reuse is relatively well established, given that these research 4 communities have a long history of sharing data generated by a small number of sensors or telescopes, which is then analyzed by researchers around the world (Giles, 1995; Pepe, Goodman, Muench, Crosas, & Erdmann, 2014). However, widespread data sharing and reuse has not been the norm in biomedical research, and biomedical researchers have expressed both less willingness to share their own data and less interest in using others’ data (Tenopir et al., 2011, 2015). Some biomedical researchers even consider data reuse anathema, with one controversial editorial decrying researchers who reuse data as “research parasites” (Longo & Drazen, 2016). One argument of its detractors is that sharing data will discourage researchers from undertaking large studies, particularly clinical trials, because they expect to be able to publish multiple articles over the course of several years using the data (The International Consortium of Investigators for Fairness in Trial Data Sharing, 2016). Sharing the data before they have the chance to conduct longer term studies, they argue, means that other researchers could “scoop” them – beat them to publication on discoveries that they could have gotten credit for. Given that articles are one of the most important currencies in academic credit systems, this argument suggests that identifying a means to reward researchers for sharing data could alleviate some of these concerns and remove some of the disincentives to sharing. Indeed, the NIH’s recent Strategic Plan for Data Science recognizes that “appropriate reward…systems are central to making data FAIR [findable, accessible, interoperable, and reusable] and for incentivizing researchers to share their data and analysis tools widely for reuse by others” (National Institutes of Health, 2018b, p. 24). 5 While the research presented here does not necessarily solve the deeper cultural problems associated with biomedical data sharing, the findings of this study will help lay the foundation for solutions by providing a deeper understanding of the nature of biomedical data reuse. Data sharing cannot be meaningfully rewarded, nor can informed decisions be made about data curation and preservation, if it remains unclear how much datasets are being reused, who is reusing them, and for what purpose. This study explores biomedical data reuse in ways that will help answer these questions, as well as providing insight into how repositories and funders can make evidence-based decisions about policy and practice. 1.2 Research Questions To better understand how and why biomedical researchers reuse existing datasets, this dissertation is guided by four research questions: Research Question 1: What are the purposes and characteristics of biomedical research reuse? Research Question 1.1: For what methods and analysis types are datasets reused? Hypothesis 1.1: Genomic datasets of the type found in dbGaP will be more likely to be used in combination in meta-analyses, while clinical datasets of the type found in the NIDDK repository will be more likely to be used on their own to answer an original research question. 6 Research Question 1.2: How closely are the topics for data reuse aligned with the topics for which the data were originally collected? Hypothesis 1.2: Similarity between original topics and topics of reuse will be lower for genomic data (found in dbGaP) than for clinical data (found in the NIDDK repository). Research Question 2: What are the demographics of researchers who reuse existing datasets? Research Question 2.1: Where are requestors located in the world? Hypothesis 2.1: Requestors will be primarily located in regions with a greater proportion of research institutions, including North America, Europe, and Asia. Research Question 2.2: Are there patterns in career stage of requestors? Hypothesis 2.2: A broad range of career stages, from student to full professor (or equivalent) will be represented. Research Question 3: Are there temporal patterns to dataset requests? Hypothesis 3: Patterns of requests relative to the original dataset release date will demonstrate a cumulative advantage process, similar to other scientific communication processes such as article citation. Research Question 4: Are there dataset topics that are more highly requested? These four questions approach the topic of reuse from two perspectives. Research Questions 1 and 2 answer questions about the characteristics of requests and requestors: who are the requestors and what are they planning to do with the data? 7 Research Questions 3 and 4, on the other hand, examine characteristics of the datasets: which datasets are most requested and how does a dataset’s requests evolve over the years after its release? Together, the findings of these questions will provide a better understanding the complex phenomenon of biomedical data reuse. 1.3 Scope of this Study Broadly speaking, “biomedical research data” can include many different types of data generated, collected, or used in the course of the wide range of research activities that biomedical researchers conduct. In its original 2003 statement on data sharing, the NIH specifically notes that only “final research data” fall within the purview of its sharing policy. Their definition of final research data is the “recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” They further note that other research objects such as “laboratory notebooks, partial datasets, preliminary analyses, drafts of scientific papers, plans for future research, peer review reports, communications with colleagues, or physical objects, such as gels or laboratory specimens” do not constitute final research data and are therefore excluded from policies regarding sharing (National Institutes of Health Office of Extramural Research, 2004). The 2003 statement also recognizes that there are many mechanisms by which research data may be shared, from the relatively restrictive (interested parties must contact the original researcher to negotiate access) to the maximally open (data are made freely available in a public repository). In more recent policies and mandates 8 from publishers and funders, the once-acceptable “data available upon request” is often considered inadequate as a means of sharing, especially since requestors have often found that authors cannot or do not share data upon request (Langille, Ravel, & Fricke, 2018; Savage, Vickers, Kats, & Molenaar, 2009; Stodden, Seiler, & Ma, 2018). Instead, most policies encourage, and sometimes even require, researchers to make data freely available in a repository, although that ideal is not always fully realized (Federer et al., 2018). Given the policy move toward repositories as the “gold standard” for data sharing, this study focuses on data shared within public biomedical data repositories. This choice is also based on practical considerations; data kept within an individual researcher’s lab would not only be difficult for someone else to reuse, but nearly impossible to identify for inclusion in this study. A further challenge to this research is identifying means for quantifying data reuse. Obtaining accurate counts of reuse of research datasets is challenging, given that standards for data citation have not been widely adopted yet. My previous research on the correlation between data use requests and citations to those datasets in the published literature found that the average dataset from biomedical data repositories had between about five and nine use requests for every one citation, suggesting that most use requests do not result in a publication that can be identified using existing search tools (Federer, 2018). While many open repositories track download counts for datasets, such raw counts provide little insight into who is using the data and for what purpose, or even whether they end up actually using the data at all. 9 In the absence of a tool or method for accurately quantifying and tracking reuse of shared datasets, this study utilizes use requests submitted for controlled- access biomedical datasets as a proxy for data reuse. This study considers three repositories administered by various groups within the NIH, all of which make their use requests publicly available. The Database of Genotypes and Phenotypes (dbGaP), housed at the National Center for Biotechnology Information (NCBI), contains human genetic sequence data and associated diseases or characteristics (National Center for Biotechnology Information, 2018). The BioLINCC repository and the NIDDK Central Repository contain datasets arising from research funded by the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute of Diabetes and Digestive and Kidney Diseases’ (NIDDK), respectively (National Heart, Lung, and Blood Institute, 2018; National Institute of Diabetes and Digestive and Kidney Diseases, 2018). Together, these three repositories cover a range of data types, from clinical data (NIDDK and NHLBI) to genomic data (dbGaP), as well as a range of diseases and topics. As will be further discussed throughout this dissertation, this method for operationalizing reuse has certain limitations, as does the selection of these particular repositories. A request for a dataset does not necessarily guarantee that the requestor ended up using it, nor can it be known for certain whether the person who requested the data was the person who intended to use it – for example, a professor might request a dataset on behalf of a student. Still, these use requests provide a richer source of information about how biomedical datasets are reused than other currently 10 available methods. Throughout this study, I will note how the methodologies and data sources used here limit the generalizability of these results and provide specific discussion about how these results can be meaningfully and responsibly applied. 1.4 Study Methodologies This study utilizes a mixed methods approach, combining qualitative and quantitative methods to gain a holistic view of the reuse of biomedical datasets from the three repositories. Some of this work considers the requests and requestors, while other parts of the study focus on the datasets themselves. Taken together, these different pieces of data and types of analysis form a view of the who, what, when, where, and why of data reuse. In the first part of the study, content analysis of the use requests provides insight into who is making requests, where in the world they are located, and why they would like to reuse the data. I coded use requests for the type of reuse using a taxonomy drawn from the literature and inductively expanded to address types of reuse not previously identified. This analysis provides insight into the ways different types of data are reused. Using an automated indexing tool, I further coded requests with topics drawn from a controlled vocabulary that the repositories also use to describe the datasets. Comparing the similarity between topics in the requests to topics in the datasets provides a quantitative means to understand how similar intended data reuse is to the reasons for which the data were originally collected. By analyzing demographic information about the researchers who request datasets, this 11 study also provides an understanding of who is benefitting from shared data – specifically, what is the career status of researchers who request data, and where in the world are they geographically located? The second part of the study focuses on analysis of the patterns of reuse of the datasets, investigating when in the data’s life cycle it is requested and what topics are most requested. Analyzing patterns of requests over the course of a dataset’s life can yield insight into the long-term usefulness of a dataset, as well as provide an understanding of how similar patterns of request are to other processes in science, such as citations to articles over time. I also conducted text mining to determine whether there are topics that are more highly requested than others. Using topic modeling on repository-provided dataset descriptions yields groupings of datasets that are conceptually similar. Examining patterns of reuse among those topics enables identification of highly requested topics. Understanding these “when” and “what” questions of data reuse could aid in early identification of datasets that will go on to be highly requested; datasets that show early signs of high reuse patterns or those that cover highly requested topics could be prioritized for more in-depth curation. 1.5 Importance and Contributions The findings of this study will have implications for a number of different stakeholders interested in how to track and quantify data reuse. At present, rewarding researchers for sharing data is challenging because of the difficulties in identifying and tracking reuse; moreover, the practical impact of a shared dataset cannot be 12 quantified. Methods for evaluating research impact generally rely on well-established metrics with widely agreed-upon significance across scholarly communities. For example, the impact of an article may be quantified by the number of times other researchers cite it; the impact of a research grant may be quantified by the number of patents it generates or the market value of a drug that it yields. While these measures may be imperfect representations of the practical impact of a researcher’s work and productivity, they still represent a common currency used in the context of tenure, promotion, retention, and funding decisions (Carpenter, Cone, & Sarli, 2014; Holden, Rosenberg, & Barker, 1994; Moher et al., 2018). If data sharing is to be rewarded, the research community must come to consensus about how the impact of a shared dataset is quantified. Simple counts of use requests or downloads elide the many, often very different, forms of reuse. By providing a better understanding of how datasets are reused, this research will help inform how to most effectively and fairly reward data sharing. Thus, these findings may provide insight for funders that wish to reward researchers for sharing, for academic institutions that want ways to measure the impact of their researchers’ contributions, and for researchers who often spend significant time and effort to share data but do not yet have mechanisms to be rewarded for doing so (CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013; Mooney & Newton, 2012). This understanding of how to quantify the impact of data sharing has further implications for development of policies informing data sharing and reuse. The 13 aforementioned OSTP memo calling for federal agencies to create policies to increase public access to research results, including data, was issued in 2013; five years on, policy in these areas is still nascent (Holdren, 2013). For example, as of this writing, the NIH has yet to issue such a policy, although they have received and are reviewing public comments on a Proposed Provisions for a Draft Data Management and Sharing Policy (National Institutes of Health, 2018c). The NIH’s existing policy only requires a data sharing plan for grants of over $500,000 annually and does not consider the content of the plan in its competitive review process (National Institutes of Health Office of Extramural Research, 2004). A better understanding of the ways that shared data contribute to advancing science through reuse could help inform future policy developments. Academic research institutions, too, are beginning to adopt policies that could be informed by the results of this study. For example, the Montreal Neuroscience Institute (MNI) has adopted an institution-wide open science policy that includes rewarding open sharing of data and other research products in the tenure and promotion process (Ali-Khan, Jean, MacDonald, & Gold, 2018). At the same time, they have recognized that doing so requires understanding how to quantify the impact of open science products and are developing a toolkit of qualitative and quantitative techniques to do so (Gold et al., 2018). That work, which has drawn on the input of international experts in open science, will be further enhanced by the deeper understanding of data reuse that this study will yield. 14 This study also has implications for the repositories that host data and the curators who do the often time-consuming work of making data ready for reuse (Leonelli, 2014; Levin & Leonelli, 2017). Traditional libraries must make choices about which materials they will commit to preserve and which they will discard because physical space and other resources are limited. Similarly, it is neither feasible nor desirable to curate and preserve every single research dataset in perpetuity. It is difficult to predict which datasets will have future value, in part because biomedical research is a moving target – the hottest topics and most advanced technologies of today can quickly become outdated – but understanding what characteristics are most predictive of reuse can build an evidence base for making well-informed decisions about which datasets to prioritize. By exploring the demographics of the researchers who reuse datasets, this research may also provide a better understanding of how data sharing can help democratize science and facilitate research in areas where funding resources are sparser. For example, in regions of the world where less scientific funding is available, generating certain types or large quantities of data may be financially out of reach (Serwadda, Ndebele, Grabowski, Bajunirwe, & Wanyenze, 2018). Even in countries where research is comparatively well funded, resources may not be equally distributed. Early career researchers, women, under-represented minorities, and researchers at smaller institutions may not have the resources or funding that are required to generate certain types of data. Not every researcher has access to sophisticated high-throughput sequencing machines or a cadre of staff to collect 15 years’ worth of longitudinal data. If these data already exist, sharing them may be a more efficient way to distribute limited resources while maximizing scientific discovery. By better understanding who is (and who is not) currently using shared data resources, this research will be useful to funders who may wish to fund research that encourages reuse of existing resources, as well as to repositories and others who may be in a position to conduct outreach to increase awareness of the availability of such resources. 1.6 Organization of the Dissertation This dissertation comprises seven chapters, including this introduction. Chapter 2 reviews the literature to contextualize this research within the literatures of science and technology studies, open science, and scholarly metrics. Chapter 3 describes the design of this study, including discussion of methods for data collection and analysis. The findings of these analyses are split into two chapters; Chapter 4 describes findings based on analysis of use requests and requestors, while Chapter 5 focuses on the analysis of the datasets themselves. Chapter 6 synthesizes these findings to better define the who, what, when, where, and why of biomedical data reuse. Finally, Chapter 7 discusses the implications of these findings for various stakeholders in the biomedical research community and outlines directions for future research that builds on this exploratory study. 16 Chapter 2: Review of the Literature Although the wide availability of research data for reuse is a relatively new development, various areas of inquiries into scholarly communication and academic reward systems, as well as researchers’ data use and reuse behaviors, provide a foundation for this study. This chapter begins with a discussion of research impact, including the historical context of how and why research impacts are measured, as well as an examination of how these measures are used in the context of academic reward systems today. Understanding how research outputs are currently measured and rewarded backgrounds this study’s approach to how metrics of data reuse could fit within existing scientific reward structures, and therefore provides insight into what characteristics of datasets and data reuse should be considered in a model for quantifying the impact of datasets. Some of these approaches draw on established bibliometrics techniques; although citations to articles cannot be considered exactly equivalent to instances of data reuse, many of the approaches used in the context of articles can yield insight into the quantification of data reuse. This chapter also draws upon the nascent literature on data sharing and reuse to provide background on what is already known about how researchers reuse data. Many of these studies consider scientific research from non-biomedical fields; while it has been established that different disciplines have different cultures of data sharing and reuse (Tenopir et al., 2011, 2015), these studies provide important ideas about how to conceptualize data reuse and its role in advancing science. 17 2.1 Scientific Credit and Reward Before attempting to track, quantify, and predict data reuse, it is essential to understand the ecosystem of credit and reward within which science operates. At the heart of science, of course, is the attempt to understand the phenomena that drive the world around us, but the pursuit of knowledge is arguably not the only goal of many researchers – rather, it is the pursuit of knowledge that will allow them to gain credit in the scientific community. 2.1.1 The Role of Credit in Science Robert K. Merton has posited “four sets of institutional imperatives taken to comprise the ethos of modern science”: communalism, universalism, disinterestedness, and organized skepticism (Merton, 1942, p. 270). The norm of disinterestedness suggests that science be conducted for the common good rather than the researcher’s personal benefit, particularly in the context of financial gain. By communalism, Merton means that scientific knowledge should be “owned” communally by, and therefore be accessible to, the entire scientific community in order to facilitate collaboration and advance research. This argument is especially salient in the context of federally funded research; as the Office of Science and Technology Policy’s 2013 memorandum on Increasing Access to the Results of Federally Funded Scientific Research points out, the outcomes of research should be available to the public that has funded it through their tax dollars (Holdren, 2013). 18 Some critics have argued that Merton’s norms do not present a comprehensive view of the normative structure of science, suggesting that “counternorms” often drive scientists’ behavior and serve a function in scientific communities. For example, scientists regularly engage in secrecy, the counternorm to communalism, by strategically withholding information to ensure that others cannot steal credit for their work. Some secrecy is probably essential to the social structure of science, as without it, “science would degenerate into a state of continual warfare” (Mitroff, 1974, p. 593). Like Anderson et al., I suggest that Merton’s norms are best viewed as “ideals that...are counterbalanced by opposing norms” (Anderson, Ronning, DeVries, & Martinson, 2010, p. 5). Scientific knowledge progresses most effectively when researchers operate somewhere between complete secrecy and complete openness, in a system that provides them with a mechanism for receiving credit for their contributions while still allowing them to build upon the knowledge of others. This view is not incompatible with Merton’s norms – despite arguing for a high level of openness and community ownership of knowledge, Merton does not suggest scientists should work without reward or acknowledgement. Rather, he suggests that “the scientist’s claim to ‘his’ intellectual ‘property’ is limited to that of recognition and esteem,” and argues that, when scientific institutions function well, they reward scientists proportionally to the significance of their work (Merton, 1942, p. 273). Article citation is an essential mechanism for enabling this proportional reward process. The practice of citing articles makes it possible to trace influence and 19 inspiration and serves the very practical purpose of giving credit to researchers for their scholarly labor. Researchers need not pay a licensing fee or purchase an idea to build upon it in their own work; the “payment” for the idea is rendered to the original creator in the form of a citation. A citation on its own has no monetary value, but citations have very real economic impacts on researchers, given that they are often used in academic hiring, tenure and promotion, and funding decisions (Carpenter et al., 2014; Durieux & Gevenois, 2010; Holden et al., 1994). Several researchers have explored the concept of credit and its important role in the economy of the research community. In their seminal work Laboratory Life: The Construction of Scientific Facts, Latour and Woolgar devote an entire chapter to “Cycles of Credit” (1986). They describe science as a process of accumulating credibility capital through recognition in the form of citations, awards, and credentials, which can in turn be “reinvested” to receive the necessary resources to continue conducting research, such as grant funding, laboratory resources, and tenure. “The notion of credibility,” they argue, “makes possible the conversion between money, data, prestige, credentials, problem areas, argument, papers, and so on” (Latour & Woolgar, 1986, p. 200). Other scholars have also taken an economic view of the function of credit and citation, for example, describing citation as payment of an “intellectual debt” (Garfield, 2002; Kochen, 1987). Merton argues that, in a sense, getting citations is the impetus for scientific publishing in the first place, pointing out that, “since recognition by qualified peers is the basic form of extrinsic reward…and since that reward can be accorded only when the work is made known, this 20 historically evolving reward system provides institutionalized incentive for open publication without direct financial reward” (Merton, 1983, para. 5). The importance of credit in scientific research is underscored by the grave tone of discussions about instances in which credit is not properly given. Variously termed as “citation amnesia,” “bibliographic negligence,” “disregard syndrome,” and “petty larceny plagiarism,” the failure of a researcher to cite an article that has informed his or her work is considered a serious breach of scientific conduct (Garfield, 1982, 1991; Ginsburg, 2001; Maes, 2015). Lack of proper attribution is also framed as a moral failing, with citation being described as “a matter of science’s family values” and failing to cite as “a menace to honest science” and a “serious transgression” (Garfield, 1991; Ginsburg, 2001; Palevitz, 1997). Garfield even muses on the possibility of establishing a “science court” that would enforce the norms of citation and “met[e] out punishment to willful perpetrators” (Garfield, 1987, 1989, 1991, para. 2). In light of this economy of reward and the intellectual theft that failure to cite represents to the scientific community, it is not difficult to understand why some researchers would be unwilling to share their data. As I will discuss, standard mechanisms for researchers to cite datasets they have reused have not yet been widely adopted. Data sharing detractors see data reuse as “possibly stealing from the research productivity planned by the data gatherers” (Longo & Drazen, 2016, para. 3). From a game theory perspective, though sharing is good for the community at large, researchers would not logically do so, since “there is a conflicting interest for 21 individual researchers, who are always better off not sharing and omitting the sharing cost while they would have higher impact when sharing as a community” (Pronk, Wiersma, van Weerden, & Schieving, 2015, p. 1). Until mechanisms exist to situate data reuse within the scientific economy – that is, to quantify data reuse and reward researchers who share high-value datasets that go on to be frequently reused – many researchers may see reuse of their data as intellectual theft. 2.1.2 Metrics for Scientific Credit The assumption that citations equal credit is foundational to the field of bibliometrics, which has at its aim measuring scientific impact. Bibliometricians use various indicators and statistical methods to assess the value of articles, the impact of journals, and the productivity of researchers. For example, the h-index, calculated by considering the number of citations for each paper in a researcher’s body of work, is often used in hiring and funding decisions and has been demonstrated to be effective in comparing researchers’ outputs and predicting future scientific success (Acuna, Allesina, & Kording, 2012; Bornmann & Daniel, 2007; Carpenter et al., 2014; Hirsch, 2005, 2007; Penner, Pan, Petersen, Kaski, & Fortunato, 2013). Despite the widespread use of bibliometric methods, uncertainty remains about how well these measures accurately reflect scientific achievement and productivity. Article citations are not always easy to collect, and analysis may provide incomplete results (Lane, 2010). Article citation is also only one means of measuring scientific output, and cannot capture uses of scientific knowledge that occur outside 22 of the traditional scientific literature (Hicks, Wouters, Waltman, de Rijcke, & Rafols, 2015). As Priem puts it, “ideas do not leave good tracks” (Priem, 2014, p. 263). As access to articles has become largely digital, new methods for counting article use have emerged, including article downloads, Mendeley readership, and mentions in online sources such as blogs and social media, but their validity and significance remains unclear (Bollen, Van De Sompel, Smith, & Luce, 2005; Galligan & Dyas- Correia, 2013; Schlögl, Gorraiz, Gumpenberger, Jack, & Kraker, 2014; Thelwall, Haustein, Larivière, & Sugimoto, 2013). To address some of these limitations, bibliometrics researchers have undertaken research to consider how effectively article citation reflects impact. Using citations as a means to reward impactful science assumes that citations are positive, though in Garfield’s influential list of fifteen reasons for citations, some are actually negative, such as “criticizing previous work” or “disclaiming work or ideas of others” (Garfield, 1964, p. 85). A growing body of literature explores what citation counts actually measure, including quantitative studies of citations and qualitative studies of researchers’ citing behaviors (Bornmann & Daniel, 2008). Although significant questions remain about researchers’ motivations for citing articles, citation counts are still widely recognized as a “a strong indicator of scientific performance” (van Raan, 2005, p. 3). Nonetheless, even when bibliometric measures can be relatively well defined and easily measured, bibliometricians urge caution in interpreting and using these metrics for decision-making. The 2015 “Leiden Manifesto,” a declaration of best 23 practices for bibliometrics, described a scientific community inundated by metrics that are “usually well intentioned, not always well informed, often ill applied” (Hicks et al., 2015, p. 429). They warn that citation counts in particular are subject to “conceptual ambiguity and random variability” and urge the scientific community to “avoid misplaced concreteness and false precision” when interpreting and using all types of research impact measures (Hicks et al., 2015, p. 431). In considering how to quantify data reuse and measure its impact, an important caution that researchers should take from bibliometricians is to consider potential unintended consequences. Some critics see citation counts as an example of Goodhart’s Law, which states that “when a measure becomes a target, it ceases to be a good measure” (Edwards & Roy, 2017, p. 52). They argue that using article citation to reward high-impact science may have the effect of creating perverse incentives that encourage self-citation and other bad behaviors that artificially inflate citation counts (Edwards & Roy, 2017; Werner, 2015). Further, it is also essential to consider how to measure what is meaningful, and not simply what is easy to count, especially in the context of phenomena for which there is an “absence of internationally meaningful comparative data” – a situation that is almost certainly the case for data reuse at present (Hazelkorn, 2013, p. 6). As the “Leiden Manifesto” points out, “the problem is that evaluation is now led by the data rather than by judgement” (Hicks et al., 2015, p. 429). 24 2.1.3 Patterns of Scientific Attention Merton has described a social phenomenon in science that he termed “the Matthew effect” after a parable that states “for to every one who has will more be given, and he will have abundance; but from him who has not, even what he has will be taken away” (1968, p. 159). He describes this effect at the level of the investigator, suggesting that “the accruing of greater increments of recognition for particular scientific contributions to scientists of considerable repute and the withholding of such recognition from scientists who have not yet made their mark” (Merton, 1968, p. 159). In other words, the more well known a researcher is, the more likely he or she is to gain further attention. Bibliometric research has demonstrated that this effect, also called the “success breeds success” phenomenon (Cozzens, 1985), exists in article citations as well; that is, articles that are highly cited are more likely to receive more citations in the future (Bornmann & Daniel, 2008; Burrell, 2003; Cozzens, 1985). Given that this phenomenon occurs at the researcher level as well as the article level, it stands to reason that dataset reuse may also governed by such a model, such that the more a dataset is reused, the more attention it gets and the more likely it is to be reused. Further, it has been shown that the data creator’s reputation is a factor in a researcher’s decision to use a dataset (Faniel, Kriesberg, & Yakel, 2015). Since researcher reputation is subject to the Matthew effect, it follows that the success of the researcher will breed success of his or her datasets. In the context of data science, this process is likely especially true for benchmarking datasets, which are used for testing new tools and methods, as well as comparing them to existing gold standard 25 tools (Moura et al., 2013; Ó Conchúir et al., 2015). Datasets that have been used for benchmarking are more likely to go on to be used for this purpose again, since it is useful to compare a new tool to an existing tool on the same dataset. In statistics, the Matthew effect is described as a cumulative advantage process (de Solla Price, 1976). Some bibliometricians argue that citations to papers are accrued in a linear fashion at a constant rate (Bornmann & Daniel, 2008; Hirsch, 2005). Others contend that papers accrue citations at random, therefore arguing for a stochastic model (Burrell, 2003, 2008). De Solla Price suggests that accumulated citations are determined by the number of citations that the article receives early, which he terms the initial pulse (1976). These types of models of cumulative advantage may be helpful for predicting future reuse of datasets. While dataset reuse likely follows some of the same patterns of article citation over shorter time spans, the long-term patterns may differ. Even the most highly cited papers are subject to a process of “attention decay;” citations hit a peak, typically between two and seven years depending on discipline, and citations subsequently taper off (Eom & Fortunato, 2011; Parolo et al., 2015). Attention decay in articles is largely driven by knowledge obsolescence; as new discoveries are made and new articles written, researchers are more likely to cite the newer, more current information (Fortunato et al., 2018). However, this same process may not hold with datasets. Some of the datasets considered in this study demonstrate that even old data can be of significance to researchers; for example, in the NHLBI repository, over 20% of the datasets were collected more than 20 years ago. That these datasets are 26 still being requested suggests that datasets may not be subject to the same pattern of attention decay as articles. 2.2 Scientific Data Sharing and Reuse The sharing of data with other researchers is not new to science. However, data previously tended to be shared through interpersonal connections, such that a researcher who wanted to use a dataset had to first know that it existed, and then negotiate with the data creator for access. This process requires significant and often tacit knowledge of the discipline and interpersonal connections within the field, limiting opportunities to students, early career-researchers, under-represented minorities, and others who were not research community insiders (Wallis, Rolando, & Borgman, 2013; Yoon, 2017; Zimmerman, 2007). Data was exchanged in the context of a “gift economy” – sharing in an open repository would be undesirable because data had value as an item to be bartered with other researchers in return for resources or intangible credit capital (Bollen, Van de Sompel, Hagberg, & Chute, 2009; Wallis et al., 2013). Data use agreements governed how reusers would be expected to “compensate” the data sharer, sometimes in the form of co-authorship on resulting papers (Gorgolewski, Margulies, & Milham, 2013). Although this type of sharing still occurs, the development of computational and technological infrastructure that has enabled the creation of data repositories, as well as the policy mandates that have driven researchers to populate them, have inherently changed how datasets are shared today (Tausczik, 2016). 27 2.2.1 Understanding Data Reuse Given that widespread data availability is a relatively new phenomenon, our understanding of how researchers are reusing publicly available datasets is still emerging. Coady et al. (2017) developed a set of categories for coding reuse requests in their study of the NHLBI repository (emphasis added for ease of reading): new question, defined as a secondary analysis designed to explore associations, prognostic factors, subgroup analyses, or similar issues; meta- analysis or pooled study, defined as a formal meta-analysis of individual participant data, combined study analysis, or consortium of studies with participant-level data; statistical methods, defined as a project focused on the development and testing of new statistical approaches; clinical trial methods, defined as a project examining statistical methods or analytic approaches that are generalizable to all or specific types of clinical trials; and other projects, examples of which include pilot data for a subsequent grant submission, simulation studies, and development of prediction equations. (p. 1851) This taxonomy provides a useful starting point for understanding types of data reuse, although the datasets considered in the Coady et al. study are limited to clinical trial data; additional types of reuse not covered in this taxonomy have been discussed in other studies. For example, publicly available data can be useful in reproducing or verifying the results of an original study, a particularly compelling use given that many scientific disciplines are troubled by a “reproducibility crisis” (Borgman, 2011; Pasquetto, Randles, & Borgman, 2017). As data science methodologies advance, 28 researchers also need existing data for development and validation of new software, particularly in the context of supervised machine learning tasks, which require well- described and tagged data from which the algorithm can learn patterns (Kotsiantis, 2007). Even beyond research, shared data can have important applications in training the next generation of researchers who do not yet have their own data to analyze. For example, Compute Canada funds cloud-based data and compute hubs for training use in Canadian academic institutions (Compute Canada, 2018). Some of the variation in ways datasets are reused is due to characteristics of the data themselves. Biomedical data can include a wide range of data types; the two considered in this study, clinical and genomic data, have very different histories that influence the ways they are collected, and therefore the ways they can be reused. Clinical research traces its history back hundreds of years (Bhatt, 2010); by comparison, genomic research is quite young, beginning with the Human Genome Project (HGP) in the 1990s (National Human Genome Research Institute, 2012). Data sharing has been a norm in genomic research from the start – the HGP considered “rapid prepublication data release” fundamental to genomic research, and this principle was even codified in the form of the Bermuda Principles and adopted into policy by the National Institutes of Health (Collins, Morgan, & Patrinos, 2003, p. 288; Powledge, 2003). That type of widespread sharing and collaboration has not been part of the culture of clinical research, which likely contributes to the resistance among many clinical researchers to policies that would require them to share (The International Consortium of Investigators for Fairness in Trial Data Sharing, 2016). 29 Since genomic research has embraced sharing and collaboration from its beginnings, genomic data have intentionally been standardized; the Genomic Standards Consortium was formed in 2005 to develop and promote data standards (Field et al., 2011). These standards enable researchers not only to use data from another lab, but to aggregate it with their own, which is especially important given that genomic research requires a much larger sample of participants to achieve statistical power than does clinical research (Hong & Park, 2012). On the other hand, little standardization exists across clinical datasets; researchers often word questions to patients in different ways or record the same concept using different terminology (Richesson & Nadkarni, 2011). As a result, even if clinical researchers share their data, other researchers’ ability to aggregate it with other datasets is limited. Efforts are underway to improve standardization of clinical data; for example, the National Institutes of Health’s activities to promote Common Data Elements would help ensure greater consistency across clinical datasets and thereby enable aggregation and potentially increase reuse (Sheehan et al., 2016). Beyond the what of data reuse, a number of studies have considered the why, exploring researchers’ attitudes toward and experiences with reusing research data. Tenopir et al.’s 2011 article and their 2015 follow-up provide useful insight into how practices have changed over time. Eighty-three percent of respondents strongly or somewhat agreed that they “would use other researchers’ datasets if their datasets were easily accessible” (Tenopir et al., 2011, p. 8). They do not report the percentages for responses in the follow-up article, but do indicate that the agreement with this 30 statement increased significantly from a mean of 4.19 to 4.33, on scale of 1 (disagree strongly) to 5 (agree strongly) (Tenopir et al., 2015). However, these attitudes differ across disciplines; notably, researchers in medical and health sciences fields had the lowest rate of agreement with the statement in both the original study and its follow up (Tenopir et al., 2011, 2015). Other studies have aimed to understand the reasons underlying researchers’ attitudes about reuse. Several studies have found that trust plays a major role in researchers’ decision to reuse data (Faniel & Jacobsen, 2010; Faniel et al., 2015; Rolland & Lee, 2013; Yakel, Faniel, Kriesberg, & Yoon, 2013; Yoon, 2014, 2017), although another study found that reuse decisions were more based on perceived usefulness of the data than its trustworthiness (Kim & Yoon, 2017) . The concept of trustworthiness may be tied to the repository (does the researcher trust the repository to curate, preserve, and provide accurate data?), as well as the original data collector (does the data collector have a reputation for accurate and clean data?). Characteristics of the datasets themselves also play a significant role in researchers’ selection of datasets to reuse. Researchers look for datasets that are complete, credible, accompanied by high-quality metadata, and easy to use (Faniel & Jacobsen, 2010; Faniel et al., 2015). However, most of these studies considered reuse in specific research disciplines, such as earthquake engineering or social sciences, and little research has interrogated the practices and attitudes of biomedical researchers. Given Tenopir et al.’s (2011, 2015) findings that researchers in biomedical research differ from their counterparts in other disciplines in many ways regarding sharing and reuse, 31 these findings may not be generalizable to biomedical researchers. Differences likely exist even within specific sub-disciplines of biomedical research; a previous study my colleagues and I conducted found that NIH clinical researchers were significantly less likely to consider data reuse important to their work than non-clinical researchers at NIH (Federer, Lu, Joubert, Welsh, & Brandys, 2015). Not only are biomedical researchers different from those in other disciplines, but the data used in biomedical research is also different in an important way: it often contains personally identifiable information on human subjects. Data reuse in the context of biomedical research therefore raises some additional concerns about privacy that may not apply to other types of research data. The Health Insurance Portability and Accountability Act of 1996 stipulates that patients’ data cannot be shared without their consent, thus limiting the sharing of some types of patient data, unless they can be de-identified adequately to present “only a very small risk” of the patient being re-identified (Meystre et al., 2017). However, in some cases, such as patients with very rare diseases, de-identification may not be possible (Hansson et al., 2016; Wan et al., 2017). Paradoxically, it is these very patients who could potentially stand to benefit the most from data sharing, since collecting enough data to draw statistically meaningful conclusions often necessitates researchers from around the world sharing data on their patients. 32 2.2.2 Challenges in Tracking and Quantifying Data Reuse As data reuse becomes more common, many in the scientific community have recognized the need for mechanisms to track and quantify data reuse. One approach that has been championed by various stakeholders is data citation (Bierer, Crosas, & Pierce, 2017). In 2014, the scholarly communication organization FORCE11 issued a Joint Declaration of Data Citation Principles that suggests “data should be considered legitimate, citable products of research,” and proposes eight principles for the “purpose, function and attributes of citations” (Data Citation Synthesis Group, 2014). This formal declaration is situated within a body of literature exploring both the ideal forms that data citation might take (Altman & Crosas, 2013; Altman & King, 2007; Silvello, 2017) and actual citation practices observed in the literatures of various disciplines (Edmunds, Pollard, Hole, & Basford, 2012; Henderson & Kotz, 2015; Mooney & Newton, 2012). Advocates see data citation as a means to enhance scientific reproducibility by allowing readers to easily locate data underlying scientific articles (Altman & Crosas, 2013; CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013). Researchers themselves also seem to consider data citation important: in 2011, 92% of respondents said that they “agree strongly” or “agree somewhat” with the statement “it is important that my data are cited when used by other researchers,” although a 2015 follow-up found significantly less agreement with that statement (Tenopir et al., 2011, 2015). However, utilizing data citations as a means for quantifying reuse remains challenging, especially since standards have not been widely adopted (Zhao, Yan, & 33 Li, 2017). While article citations are standardized and typically found in a reference section, authors place data citations throughout articles, including in the acknowledgements, materials/methods section, or elsewhere (Callahan, Winnenburg, & Shah, 2018; Piwowar, Carlson, & Vision, 2011). Others may cite not the dataset itself, but an article describing the dataset. For example, the GenBank database directs researchers who have reused data to cite a paper describing the database (Benson, Karsch-Mizrachi, Lipman, Ostell, & Wheeler, 2005). Citations to that paper do not necessarily reflect use of GenBank data; authors may cite that paper even when GenBank data have not been used (such as I have done here). Inconsistencies in data citations complicate the process of locating articles that report on reuse of a dataset. A variety of academic databases have article citation indices that automatically connect a user to citing articles, but a similarly comprehensive data citation does not yet exist (Garfield, 1955; Robinson-García, Jiménez-Contreras, & Torres-Salinas, 2015). Though some computational and automatic methods have been developed (Piwowar, 2010; Q. Zhang, Cheng, Huang, & Lu, 2016), correctly and completely identifying articles citing datasets often requires significant manual work. Most studies that have utilized citation-based methods to quantify data reuse have relied at least partly on manual identification of articles and elimination of false positives (Belter, 2014; Callahan et al., 2018; Piwowar et al., 2011). These methods would be impractical in large-scale analyses to systematically quantify the impact of larger sets of data citations. In my previous research, I demonstrated that some articles that report on data reuse do not cite the 34 original dataset at all; I could include them in my study because the repository containing the cited dataset had been notified of the publication and so included it in the list they provided me. Had I attempted to locate all articles citing those datasets on my own, it would have been impossible for me to identify them (Federer, 2018). Despite the slow uptake, it appears that many within the scholarly community are eager to move toward data citation standards and infrastructure that allow for better tracking of data and its reuse in the scholarly literature. These efforts could also be stimulated by increased recognition of data reuse as a form of scientific impact that merits scholarly credit. 2.3 Conclusions Although the study of data reuse is relatively young and many questions remain about who is reusing biomedical research data and for what purposes, the bodies of research described here can help inform directions for this research. Quantifying and tracking data reuse is important for ensuring proper credit and attribution. Because data reuse is situated within the context of an existing structure for academic credit, this research is most useful if it builds upon our current understanding of how researchers interact with and use scientific knowledge of all types. As the next chapter will discuss, applying bibliometric models of understanding scientific credit and reward also supplies a useful set of methods for quantifying and tracking data reuse. 35 Chapter 3: Methodology Chapters 1 and 2 have provided an overview of the need for and challenges involved in quantifying and understanding biomedical data reuse. In this chapter, I will describe the research design and the approach I have taken to answer four research questions intended to elucidate who is reusing data and how they are doing so: Research Question 1: What are the purposes and characteristics of biomedical research data reuse? Research Question 1.1: For what methods and analysis types are datasets reused? Hypothesis 1.1: Genomic datasets of the type found in dbGaP will be more likely to be used in combination in meta-analyses, while clinical datasets of the type found in the NIDDK repository will be more likely to be used on their own to answer an original research question. Research Question 1.2: How closely are the topics for data reuse aligned with the topics for which the data were originally collected? Hypothesis 1.2: Similarity between original topics and topics of reuse will be lower for genomic data (found in dbGaP) than for clinical data (found in the NIDDK repository). Research Question 2: What are the demographics of researchers who reuse existing datasets? 36 Research Question 2.1: Where are requestors located in the world? Hypothesis 2.1: Requestors will be primarily located in regions with a greater proportion of research institutions, including North America, Europe, and Asia. Research Question 2.2: Are there patterns in career stage of requestors? Hypothesis 2.2: A broad range of career stages, from student to full professor (or equivalent) will be represented. Research Question 3: Are there temporal patterns to dataset requests? Hypothesis 3: Patterns of requests relative to the original dataset release date will demonstrate a cumulative advantage process, similar to other scientific communication processes such as article citation. Research Question 4: Are there dataset topics that are more highly requested? 3.1 Research Design This study utilizes a mixed methods approach to explore the complicated phenomenon of biomedical data reuse, employing both manual techniques for analyzing the qualitative content of data reuse requests and automated analyses that aim to quantify and better understand patterns of requests. A mixed methods design has the benefit of combining qualitative and quantitative methods to provide a more complete picture of a phenomenon, as well as allowing for exploration of multiple related research questions (Bryman, 2006). This chapter describes how this study 37 combines qualitative content analysis of data reuse requests with quantitative methods, including text mining and bibliometric modeling. 3.1.1 Operationalizing “Reuse” Quantifying reuse of datasets is challenging, since reuse can take so many different forms. Some forms of reuse are easy to identify, but others leave few traces that can be identified and tracked. An article that makes an explicit citation to a shared dataset and clearly describes its role in the study is an obvious instance of reuse; however, articles frequently do not cite datasets in systematic ways that can be easily and automatically tracked. Even when efforts are made to systematically track and record citations to datasets, dataset requests typically outnumber citations by 75% (Federer, 2018). While using citations as a proxy likely underestimates reuse, using counts of downloads and views as a proxy likely overestimates reuse. In open repositories where anyone can download or view a dataset, it cannot be known how or even if the downloader goes on to use the data. Further, because most of these repositories do not collect information about who is viewing or downloading, little can be known about the potential users of the dataset. One approach that may more accurately reflect reuse is analysis of data use requests. Repositories that contain sensitive human research data cannot make datasets available to freely download because of privacy and consent issues. Instead, researchers must make a formal request for datasets, including a description of the specific purpose for which they are requesting the data and, in most cases, clearance 38 from their Institutional Review Board (IRB); these requests are then reviewed by a Data Access Committee (DAC) at the repository, a body charged with determining acceptable reuse. Since researchers cannot use the data without submitting a request, and a request cannot be submitted without having a specific intended use, use requests likely provide a reasonably complete representation of data reuse, as well as providing information about how the requestor intends to use the data. In this study, I will draw on these requests as a proxy for reuse. While they are likely more accurate than citations or download counts, use requests also do not provide an exact measure of data reuse. Just because researchers must have a specific use in mind when they apply for the dataset does not mean that they end up using the data. They may realize once they have the dataset that it is not actually suited for their purpose after all, or they may discover that the data do not support their initial hypothesis and discard the project. Knowing the identity of the requestor also does not mean that the actual data reuser is known; it is possible that someone else, or even whole research groups, are the actual users. For example, a professor might request a dataset on behalf of a student, or an administrator may request data on behalf of an entire team. Despite these limitations, use requests are a useful proxy for data reuse in the context of this study, in that they provide a depth of information about how researchers at least intend to reuse datasets. Throughout this dissertation, I will discuss how the limitations of this approach constrain the application and generalizability of the findings, as well as propose how these findings 39 could be supplemented by future research that draws on other methodologies and data sources. 3.1.2 Sampling and Data Collection This study considers three repositories administered by various groups within the NIH, all of which require researchers to submit requests to reuse the datasets. While other NIH repositories do exist, these three lend themselves to study because they not only require submission of use requests, but also make most or all of the request contents publicly available. The Database of Genotypes and Phenotypes (dbGaP), administered by the National Center for Biotechnology Information (NCBI), contains human genetic sequence data and associated diseases or characteristics (National Center for Biotechnology Information, 2018). The BioLINCC repository and the NIDDK Central Repository contains biospecimens and datasets arising from research funded by the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute of Diabetes and Digestive and Kidney Diseases’ (NIDDK), respectively (National Heart, Lung, and Blood Institute, 2018; National Institute of Diabetes and Digestive and Kidney Diseases, 2018). Together, these three repositories cover a range of data types, from clinically- focused data (NIDDK and NHLBI) to genomic data (dbGaP), as well as a range of diseases and topics. The data contained within these repositories almost exclusively comes from NIH-funded studies. While individual researchers may submit data, many 40 of these datasets arise from large efforts that involve research teams or even multi-site consortia. In addition to use requests, all three of the repositories considered in this study display descriptive metadata about available datasets, including Medical Subject Headings (MeSH) terms that describe the focus of the dataset and narrative descriptions of the original study, which contain information such as the purpose of the original study, data collection methods, characteristics of the original study participants (such as adults, children, healthy volunteers, or individuals with a particular disease), and findings of the original study. Table 3-1 summarizes the counts of data used in this study (the total datasets requested is greater than total requests, since many individual requests mentioned more than one dataset). NHLBI could not provide identifying information about requestors for privacy reasons; therefore, analyses can be conducted at the dataset and institution levels, but not individual requestor level for NHLBI. While dbGaP’s full dataset also includes requests that were rejected, this study considers (and Table 3-1 reflects) only requests that were accepted. Future study on differences between requests that were accepted and those that were rejected may be fruitful, but this study considers data reuse, which of course did not occur in the case of requests that were rejected. Table 3-2 indicates the content included in the use requests by repository. 41 Table 3-1. Number of datasets, requestors, institutional affiliations, and use requests from each repository and overall. dbGaP NHLBI NIDDK All repositories Datasets 1,014 146 77 1,237 Total 5,260 N/A 253 5,513 requestors Total 1,230 1,001 195 2,426 institutional affiliations Total requests 9,444 1,939 449 11,832 Total datasets 104,326 3,864 562 108,752 requested Table 3-2. Contents of use requests by repository (X indicates the repository contains the item). dbGaP NHLBI NIDDK Requestor name X X Requestor X X X institution/affiliation Dataset(s) requested X X X Date of request X X X Reuse summary X X Technical research X use statement Non-technical X research use statement I acquired the data for analysis through a combination of web-scraping from the public sites (that is, writing a script to automatically fetch and parse the data) and requesting the data from the repositories. I requested data on use requests and dataset descriptions from NIDDK and NHLBI, and they provided this information in a set of comma-separated value (CSV) files. dbGaP staff were not able to provide the information I requested due to staffing limitations and time constraints. However, all 42 the information needed for these analyses is publicly available on the dbGaP website, so I was able to obtain the necessary data from the site. Rather than manually download all the metadata and use requests, I wrote an R script that automatically downloaded the dbGaP use requests and dataset metadata. This web-scraping process was accomplished using the R packages httr (version 1.3.1) and rvest (version 0.3.2) (Wickham, 2016, 2017a). Once I obtained all the data, I wrote various custom R scripts to clean, organize, and visualize the data to prepare it for coding and analysis, incorporating existing functions from the R tidyverse package (version 1.2.1) (Wickham, 2017b). Except where noted otherwise, all code is written in R version 3.4.1 and run in RStudio version 1.0.143. All code used for data collection, cleaning, and analysis is available at https://github.com/informationista/integrative_paper. 3.2 Data Preparation and Analysis 3.2.1 Research Question 1: For what research objectives are biomedical datasets reused? Requests to dbGaP and NIDDK included not only general information about who was making the request and what data they were requesting, but the actual text of the request itself. These requests provide an overview of how the requestor intended to reuse the dataset, written with enough detail to enable the repositories’ Data Access Committee (or equivalent body) to make a determination about whether the request constituted valid and appropriate reuse. These detailed requests provide a rich corpus from which to draw information about how data are intended to be reused. 43 Specifically, I consider the type of reuse and the similarity of the reuse to the topic for which the data were originally collected. I manually coded requests for the type of reuse from a taxonomy drawn from existing literature and validated in my previous research on the use requests in this dataset (Borgman, 2011; Coady et al., 2017; Federer, 2018; Pasquetto et al., 2017). I also inductively added categories as needed for cases that did not fit within the taxonomy. For example, initial coding revealed that some of the use requests asked for data to include in a larger database for general use, which did not fit any of the existing categories. Therefore, I added the category of “infrastructure” to describe this type of reuse. Table 3-3 describes and defines the categories used in this analysis. Table 3-3. Coding categories and their definitions. Category Definition Original research study use of a single dataset to answer a new research question, distinct from the specific question for which the data were originally collected Meta-analysis study aggregation or integration of the dataset with other datasets to answer a research question or conduct a formal meta-analysis Statistical methods study use of one or more datasets to develop or verify new statistical methodology Software or tool use of one or more datasets to develop, test, or validate development study a new software product or analysis tool Validation use of one or more datasets to validate other findings, such as validating findings from an animal model in human subjects Comparison or control use of one or more datasets to validate the investigator’s own data, provide comparison, or serve as a control group Reproducibility or reanalysis of one or more datasets to answer the same reanalysis study question for which the data were originally collected or to verify the original study’s findings Infrastructure use of one or more datasets to populate a database or 44 Category Definition repository for internal or institutional use Of the 449 unique requests for NIDDK datasets, 17 were missing the executive summary that contained information about how the datasets would be reused. 432 unique requests had executive summaries, an amount that was small enough to permit me to code all the requests. This total population sampling has the benefit of avoiding sampling error and providing a richer understanding of the phenomena of interest (Etikan, Abubakar Musa, & Sunusi Alkassim, 2016; Thygesen & Ersbøll, 2014). However, dbGaP datasets had 9,444 unique requests, too many for me to feasibly manually code. Therefore, for the dbGaP analysis, I randomly selected a subset of 1,500 of the 9,444 requests (15.9%), which provides a confidence interval of +/-1.1 at a 95% confidence level (based on estimation of proportion). To identify the topic of reuse proposed in a dataset, I used an automated coding method rather than manual coding. Using an automated method has the benefit of applying systematic coding, not affected by human judgment, across the entire dataset. The use of an automated technique also allowed me to include the entire set of both dbGaP and NIDDK requests (9,444 and 432 requests, respectively) in this analysis, since I was not limited by what I could feasibly manually code. MeSH On Demand is an automated tool that generates a list of Medical Subject Headings (MeSH) terms for each use request, including the specific organ systems, diseases, research techniques, and other topics that describe the content of a text by using the National Library of Medicine’s (NLM) Medical Text Indexer (MTI) 45 (National Library of Medicine, 2018). MTI was originally developed to partly automate indexing of journal articles for inclusion in MEDLINE, the database of biomedical literature maintained by the NLM. Prior to the development of MTI, human indexers manually indexed all articles for MEDLINE; by 2014, MTI was being used in the indexing of over 60% of MEDLINE articles (Mork, Aronson, & Demner-Fushman, 2017). With advances in technology and the application of machine learning technologies to MTI, its precision and recall have improved since its original development in 2002 (Aronson, Mork, Gay, Humphrey, & Rogers, 2004; Mork, Yepes, & Aronson, 2013). Even so, MTI is not as accurate as a human indexer, so I tested whether it would perform adequately for use in this study by comparing the terms it automatically generated with my own manual coding for ten randomly selected use requests from the repositories considered in this analysis (five from dbGaP and five from the NIDDK repository). For each of the requests, MTI assigned more terms than I did (mean 9.7 terms per use request for MTI compared to a mean of 4.7 for me). My own indexing focused only on terms such as diseases, conditions, and organ systems, which are the categories of terms that are used to describe the original datasets, while MTI also picked up on concepts such as analytical methods and study populations. Considering only disease, condition, and organ system terms, the MTI terms and my own matched in all ten cases. Given that MTI’s sensitivity (in other words, its ability to identify all relevant terms) is similar to my human indexing, its 46 lack of specificity (that is, its tendency to identify some irrelevant terms) does not present a problem for this study. To help improve the accuracy of the MTI indexing, I also removed high-level terms related to study populations, such as Male, Female, Child, and Adult. Leaving extraneous terms in would not significantly affect the outcome of the analysis; as will be discussed, the algorithm that calculates similarity considers terms that come from two separate “branches” of the MeSH tree hierarchy to be entirely unrelated. Since the MeSH terms assigned to datasets almost exclusively covered diseases and organ systems, which are on separate branches of the MeSH tree from study population terms, the algorithm would consider these terms unrelated to the dataset terms, which would make them irrelevant to this analysis, since the similarity score is based on the set of most similar pair of terms. Leaving unrelated or extraneous terms in the MTI- produced term lists would therefore have no impact on the outcome of the analysis, but removing them did improve the efficiency of an already computationally intensive analysis, so I removed them. Once MeSH terms were assigned for each request, these terms could be compared to the MeSH terms assigned by the repositories to the corresponding dataset in order to determine how closely the proposed reuse matches the original reason for which the dataset was collected. This comparison is based on a technique called semantic similarity, which employs ontologies to calculate the relatedness of a set of terms (Pesquita, Faria, Falcão, Lord, & Couto, 2009). MeSH’s tree structure makes it possible to calculate semantic similarity between terms based on their 47 relative positions in the hierarchy, where 0 means two terms are completely unrelated and 1 means they are identical (Gan, Dou, & Jiang, 2013; Garla & Brandt, 2012; Zhou et al., 2015). Figure 3-1 demonstrates the concept of semantic similarity in a small portion of the MeSH tree structure. Considering the term Heart Diseases, some of the terms in the tree are similar conceptually (for example, Vascular Diseases also affect the cardiovascular system), while others are completely unrelated (for example, Informatics is on a totally separate branch of the MeSH tree and has no conceptual relationship to Heart Disease). Figure 3-1 shows the semantic similarity score (SSS) for each term to the index term of Heart Diseases. All MeSH terms Diseases Information (SSS = 0.85) Science (SSS = 0) Cardiovascular Eye Diseases Informatics Diseases (SSS = 0.95) (SSS = 0.85) (SSS = 0) Heart Diseases Eye Infections (SSS = 1) (SSS = 0.783) Vascular Diseases (SSS = 0.9) Figure 3-1. MeSH tree sample demonstrating semantic similarity. The number following each term is its semantic similarity score (SSS) to the index term of “Heart Diseases.” I calculated semantic similarity using the shortest path algorithm in the R package MeSHSim (version 1.2.0; requires R version 3.2.1) (Zhou & Shui, 2015). I 48 tested each of the nine algorithms that are implemented in this R package; all performed similarly in terms of how they relatively ranked similarity of terms, but the shortest path algorithm has the benefit of being on a 0 – 1 scale that enables straightforward interpretation of similarity (or lack thereof). Both use requests and datasets can be tagged with multiple MeSH terms; the MeSHSim package returns results in the form of a matrix of similarities for all terms, as shown in the example in Table 3-4. I recorded the highest semantic similarity value for each use request/dataset pair (for example, in the case of the terms in Table 3-4, I would record the value 0.86, since the terms Lung and Cardiovascular System are most similar). Since the datasets and requests are both described by multiple terms, it is likely that many of the term pairs in the matrix will be 0, even if the dataset and request also share a term that is an exact match. For that reason, the use of the maximum rather than the mean score provides a better understanding of the similarity between the dataset and the request. Table 3-4 An example matrix of semantic similarity scores between two sets of terms. Dataset terms Ankle Cardiovascular Intermittent Peripheral Brachial System Claudication Vascular Index Diseases Lung 0 0.86 0 0 Smoking 0 0 0 0 Global Health 0 0 0 0 49 Request terms Cohort Studies 0.24 0 0 0 Biological 0 0 0 0 Markers Pulmonary 0 0 0.62 0.65 Disease, Chronic Obstructive Together, the analyses of manually-coded reuse types and machine-coded topics provide insight into the uses for which data are being requested. For example, are most datasets being used in the context of the same topic for which they were originally collected, as measured by semantic similarity? Are multiple datasets being combined to derive additional findings that would not have been possible using a single dataset on its own? Are genomic datasets of the type found in dbGaP reused in different contexts or ways than clinical datasets of the type found in NIDDK? These findings contribute to a clearer view of biomedical data reuse that will contribute to understanding the impacts of shared datasets. Given the concern that many researchers have about others “scooping” their work if they share their data, the answers to these questions may also have implications for researchers’ attitudes toward sharing. 3.2.2 Research Question 2: What are the demographics of researchers who reuse existing datasets? To better understand the types of requestors who are reusing data, I manually coded the use requests with demographic information about the requestor. First, for 50 each unique institution, I recorded the latitude and longitude of the institution’s city to determine where requests are originating. The latitude and longitude enabled me to use the data with R packages that rely on geocoded data for visualization, both at the international level and at the state level within the United States. The institution name was available for all 9,444 dbGaP requests, all 1,939 of the NHLBI requests, and 255 of the 449 NIDDK requests (57%). The large number of missing institutions in the NIDDK datasets is due to differences in the repository’s systems prior to September 2013 that resulted in some data about requests being unavailable; therefore, this analysis reflects only the most recent six years of use. Raw counts of requests would not provide useful insight into which countries were making the most reuse of shared datasets, since research activities are not evenly distributed around the world. For example, it would be reasonable that more requests would come from the United States (a large country with a sizeable research enterprise) than say, Liechtenstein (one of the smallest countries in the world). Therefore, rather than use raw counts, I compared the number of requests coming from a geographic region to its research presence. Research presence is difficult to quantify, since research is conducted within many different organizations, including academic institutions, government agencies, non-profit organizations, and private research corporations, to name a few. For international-level comparisons, I used number of universities as a proxy for research presence. For state-level comparisons, I used NIH funding received within each state in Fiscal Year 2018 (the most recent year for which complete funding data are available). This state-level proxy likely 51 provides a more accurate representation of research presence, since NIH funds are awarded not only to universities, but to other types of research institutions. To compare a country’s (or state’s) research presence to the number of use requests its researchers make to each repository, I calculated the relative difference in composition (RDC). RDC is a measure of how over- or underrepresented a group is within a specific context compared to the composition of the entire population. For example, RDC has been used to measure underrepresentation of racial groups in gifted and talented education programs compared to their total presence in a school overall; a group that makes up 50% of the students in the whole school, but only 25% of the students in the gifted and talented program is underrepresented (Ford, 2014). I calculated RDC for countries and for states within the United States to determine whether certain geographic regions are making more requests than might be expected based on their research presence. I did this analysis for each repository individually to determine whether there was variation in where requests were concentrated for the different repositories. To better understand who is reusing data, I also coded each unique request with the requestor’s career stage at the time of the request. To determine career stage, I located web resources that documented requestors’ career, such as LinkedIn, CVs, biosketches, and web pages. Where I could not definitively determine a requestor’s career stage using available online materials, I coded the career status as unknown. Because a single requestor may have made multiple requests across his or her career, I recorded the career stage for each unique request. For example, a requestor may 52 have been an assistant professor when she made her first request in 2013, but she had received tenure and was an associate professor by the time she made her next request in 2016. I converted non-United States job titles to their United States equivalent to allow for comparison across countries. For example, in many commonwealth countries such as the United Kingdom and Australia, the term “lecturer” is the equivalent of assistant professor in the United States (Wikipedia, 2018). Because NHLBI did not provide the names of individual requestors, I limited this analysis to the dbGaP and NIDDK requests. Of the 449 unique NIDDK requests, 286 included the requestor’s name (64%). As with institution name, the requestor names were missing from the oldest requests (in the case of requestor name, those made before December 2012). The 9,444 requests to dbGaP came from 5,260 unique requestors. As with coding for reuse type, locating career status information for so many requestors was not feasible, so I coded a subset of 1,500 of the 9,444 requests (15.9%), which provides a confidence interval of +/-1.1 at a 95% confidence level (based on estimation of proportion). As with the distribution of research across different geographic areas, the distribution of researchers across career stages is not totally even. More requests may come from assistant professors simply because more researchers are at this career stage, and not because they are actually making more requests than researchers at other stages. Therefore, I took the same approach of calculating relative difference in composition between the proportion of individuals at a career stage overall and the number of requests coming from individuals at this career stage. For non-academic 53 career stages, determining the number of individuals in a given career stage, such as “senior scientist” or “executive” would be nearly impossible, since these individuals are employed in so many different types of institutions. However, for academic requestors, this analysis is possible, since the National Center for Education Statistics tracks counts of full-time faculty in US degree-granting postsecondary institutions (National Center for Education Statistics, 2017). One unavoidable limitation of this approach is that the person who requested the data may not actually be the person who used the data. For example, a junior lab member may request data on behalf of his or her principal investigator, or a professor may request data on behalf of a student. Future survey research of dataset requestors could help elucidate the extent to which the data requestor and the data reuser differ. 3.2.3 Research Question 3: Are there temporal patterns to dataset requests? The repositories in this study contain many years’ worth of datasets, some dating back to the early 2000s, and records of requests dating back almost as long. With many years’ worth of request data available, it is possible to track the dynamics of requests over a dataset’s lifetime to better understand when datasets are most requested. Further, understanding temporal patterns to dataset requests could make it possible to predict early in a dataset’s life how much use it would receive in the long- term, which could be useful in making curation and preservation decisions. Knowing how long a dataset remains useful could also influence preservation decisions – if 54 datasets are generally no longer requested once they reach a certain age, it may be reasonable to discard them. This inquiry into requests to datasets over time is similar to the study of citation dynamics within bibliometrics, which considers the numbers of citations an article receives over time. Part of this exploration involves mapping “citation bursts,” or the time it takes for articles in a given field to reach their peak annual citation before citations begin to decline (Eom & Fortunato, 2011). The literature also contains explorations of unique or unusual citation dynamics, such as descriptions of the dynamics of “sleeping beauties” (articles that receive few citations for many years and then suddenly attract significant attention) and “flashes in the pan” (articles that receive a great deal of initial attention, which quickly dies down) (Li, 2014; van Raan, 2004). These explorations provide a basis upon which to begin to explore temporal patterns of dataset requests. As has been previously discussed, NIDDK’s move to a different system in September 2013 means that the year of release for datasets prior to that date is unknown. Removing all datasets from before September 2013 left too few datasets for this analysis. Therefore, for this analysis I used dbGaP, which contains 982 datasets with a total of 100,115 requests, and NHLBI, which contains 143 datasets with a total of 3,860 requests. For each dataset, I aggregated the number of requests it had received each year. I also calculated the dataset’s age at the time of request, enabling comparison across datasets at the same age, regardless of when they were released. If dataset requests are a cumulative advantage process, with success 55 breeding success, then datasets that are older are likely to receive more requests in a given calendar year than those that are younger. For example, consider a dataset released in 2010 and one in released in 2016. The older dataset has had an additional six years to accrue advantage, so if we compare the number of requests each received in calendar year 2017, it is likely that the 2010 dataset would receive more than the 2016 dataset. However, considering how many requests each received in the first year after they were released provides a more meaningful basis for comparison. Once the number of datasets requested per year of a dataset’s life was calculated, I divided the data within each repository into groups. First, I divided the datasets into tiers based on their percentile ranking of total requests over time, that is, the top 10% most requested, the next 10% most requested, and so on. To better control for age of dataset, I also calculated the mean percentile ranking over the course of a dataset’s life. For example, if it was in the 20th percentile of first year requests, the 30th percentile of second year requests, and the 40th percentile of third year requests, its mean percentile ranking is 30th percentile. I divided the mean percentile rankings into quartiles. I then plotted both the overall request deciles and the mean request quartiles to visualize the pattern of requests for datasets of varying levels of attention based on requests. In addition to understanding patterns of requests over time, I also aimed to determine whether the number of requests a dataset received early in its life was predictive of how many requests it would receive over the long run. That is, does a dataset that receives many requests in its first year likely to go on to receive more 56 requests than a dataset that is less requested soon after its release? I tested this by fitting three regression models to the dbGaP and NHLBI dataset, looking at the relationship between total requests and first-year requests only; first- and second-year requests; and first-, second-, and third-year requests, controlling for year of release for all three models. These models provide an understanding of the extent to which requests in the first three years of a dataset’s life can be used to potentially predict the number of requests it will go on to receive. Because dynamics and temporal patterns of dataset requests have not yet been studied, my primary aim here was to determine whether in fact patterns do indeed exist, and if so, the general dynamics of requests over time. This study provides an initial view of the temporal patterns within dataset requests, that can be expanded based on request dynamics. This analysis also demonstrates the extent to which dataset requests can be considered a cumulative advantage process. 3.2.4 Research Question 4: Are there dataset topics that are more highly requested? The time-based methods described above provide insight into patterns of how datasets are requested over time and whether cumulative advantage processes and attention decay effects influence how many requests datasets receive. However, these models likely do not fully account for the reasons why some datasets are more highly requested than others. Previous studies have explored researchers’ decision-making processes related to choice of and satisfaction with datasets, but the factors identified in these studies are subjective and would be difficult to measure in the context of this 57 study. For example, opinions about dataset credibility would likely differ significantly among dataset requestors, so it would be difficult to develop a method to quantify credibility as a factor. Reputation of the data creator is also a factor in data reuse; even if there were an objective measure of reputation, many of these datasets have been collected by large, multi-site consortia with many individuals involved, and some of the datasets do not list who originally collected the data at all. In the absence of robust and reliable methods for quantifying these subjective measures, it is necessary to look to the datasets themselves to understand why some are more highly requested. The repositories considered in this study do include some basic metadata about the dataset, such as the number of subjects in the dataset and the dates of data collection. However, this metadata is sparse and provides little useful insight into the content of the dataset itself. In addition, the content of the metadata differs across the three repositories, making it challenging to identify patterns that would hold for biomedical data reuse broadly, rather than being specific to an individual repository. More useful than this basic metadata is the narrative description of the dataset, which can be meaningfully explored using text-mining methods. At its most basic level, text mining is useful in understanding the contents of a document by identifying the terms that are most central based on frequency (Hotho, Andreas, & Paaß, 2005). This simple approach considers a document as a “bag of words,” simply counting the number of times a given word appears without consideration of its context within the text (Y. Zhang, Jin, & Zhou, 2010). More advanced topic modeling techniques make 58 it possible to identify complex latent topics in a text by counting words in their broader context, such as considering n-grams (a set of n words appearing together in sequence), sentences, or paragraphs (Blei, Ng, & Jordan, 2003). These topic modeling techniques are considered “unsupervised,” in that the algorithm simply identifies patterns of words within a corpus that frequently appear together in texts, and it is up to a human subject matter expert to determine the topic it describes. For example, the algorithm might determine that the terms “myocardial infarction,” “hypertension,” and “cardiac output” form a topic in a corpus; a human interpreter would then be able to determine that texts containing this topic could be described as being about “cardiovascular disease.” Text mining is especially useful in this analysis because it allows for the detection of patterns in the data even when potentially important features are not known in advance and has the benefit of being able to account for a wide range of features that are not captured in the metadata. For example, since data descriptions include information such as the specific brand of the sequencing machine and the study methodologies, these methods will be able to take into account whether these features are characteristic of reuse. Text mining techniques also are a practical method here because they have been demonstrated to be useful in various bibliometric applications, which, as has been discussed, is similar to the type of inquiry being conducted here. For example, topic modeling techniques have been used to successfully identify high impact articles, with significant correlation to article citation counts (Gerrish & Blei, 2010; 59 Mann, Mimno, & McCallum, 2006). Text mining has also been used to detect similarities between patent documents and scientific articles (Magerman, van Looy, & Song, 2010); while I did not use that technique in this study, this approach could have potential future applications for detecting similarities between dataset descriptions and the associated reuse requests. The narrative study descriptions from each of the three repositories formed the corpus for text mining, specifically using a topic modeling approach. This analysis includes the descriptions of 1,150 datasets from dbGaP, 166 datasets from NHLBI, and 140 datasets from NIDDK. I wrote a script that retrieved dataset descriptions from the webpages of each of the datasets’ web pages, then prepared the texts using standard text mining pre-processing techniques incorporated in the R text mining package tm (version 0.7-3) (Meyer, Hornik, & Feinerer, 2008), including converting all text to lowercase (since R is case-sensitive); removing common English language stopwords such as “the” and “and”; stemming, which converts various forms of a word to their common root (for example, “genetic,” “genetically,” and “genetics” would all be collapsed to “genetic”); and trimming of white space and special characters. The dataset descriptions, particularly from the same repositories, are all somewhat homogenous in terms of certain scientific words that would not be contained in the English language stopword list, but that would not be informative about the content of the description, such as “study” and “subject.” Therefore, I also removed a custom set of stopwords that appeared almost universally in the 60 descriptions and provided no useful context about the topic of the dataset; this list is in Appendix B. Once the texts were prepared, I proceeded to develop topic models for each repository using latent Dirichlet allocation (LDA) implemented in the R package topicmodels (version 0.2-7). To understand the LDA model, consider a set of documents from a corpus. Most documents do not have a single topic, but several; for example, a description of a dataset in dbGaP has the topic genetics, as well as the topic of whatever disease or condition it is studying. The topic genetics, in turn, has a number of words that are associated with it, such as “genetic,” “sequence,” and “genome.” LDA fits a mathematical model to “[find] the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document” (Silge & Robinson, 2018, para. 5). The topicmodels package will generate a list of the terms most highly associated with each topic, as well as calculating the probability that a term is predictive of a given topic. For example, the term “carcinoma” would have a higher probability of being associated with the topic of cancer than the topic of cardiovascular disease. The word “topic” should be understood broadly here, not just to refer to the disciplinary focus of a dataset, but to also potentially draw on other concepts contained in the descriptions, such as study type or characteristics of subjects. The application of the LDA model does require a fair amount of judgment on the part of the human programmer. For example, the choice must be made whether to use individual words as the “token” or unit of analysis (the bag of words approach) or 61 group words into n-grams with their nearest neighbors. For example, in a simple text with a basic vocabulary, the bag of words approach may be effective, but more technical texts might use many multi-word phrases, which would not be reflected if using simple counts of single words. Therefore, achieving meaningful results requires experimenting with using single words, bigrams (word pairs), or trigrams (word triplets). In addition, the human must determine the number of topic groups into which to divide the corpus. While there are some statistical methods that can aid in identifying the optimal number of topic groups, achieving meaningful topics largely relies on human judgment. The process of determining the number of topics is iterative, starting with the predicted optimal number of groups and experimenting with the varying numbers until the most meaningful categories appear. In addition, it is up to the human to identify what the topics actually describe. The topicmodels package simply returns a set of numbered topics and the words most highly associated with them; based on the mixtures of terms associated with the topic, I applied my subject matter knowledge to determine what the topic describes. Once the datasets were organized into topics, I determined which topics were most requested based on the number of use requests to datasets in each topic. I looked not only at overall counts, but counts by year, to determine whether the most popular topics changed over time. A problem here is that the datasets are not divided evenly among the topics. For example, in Figure 3-2, Topic A contains 4 datasets while Topic B contains only half as many. Topic A would be reasonably expected to have more requests than Topic B, not necessarily because it’s more popular, but because it 62 contains more datasets to receive requests. The fact that Topic B has actually received more requests than Topic A despite having fewer datasets must also be accounted for; not only does it have more requests, but it has them despite having half as many datasets to be requested. Figure 3-2. Demonstration of analyzing topics and requests. To solve the problem of comparing topics of uneven size, I compared the proportion of datasets in a topic to total datasets in the repository, to the proportion of requests received by the topic to total requests received by the repository. For example, Topic A contains 4 datasets of the 6 datasets total (0.67) and 70 requests of the 192 requests total (0.36). That is, it contains 67% of the total datasets but has received only 36% of the total requests. By comparison, Topic B only contains 33% of the requests but received 64% of the requests. This analysis makes it possible to compare topics’ requests even when the datasets are unevenly distributed among them, to determine which topics are most highly requested. 63 3.3 Limitations It is important to note that the scope of this study limits its generalizability not only beyond biomedical data, but also beyond the three repositories considered here. Analyzing each repository separately from the others makes it possible to gain insight into the extent to which biomedical repositories differ from each other, such as whether genomic datasets are reused differently from clinical datasets. Still, caution should be used in generalizing results, and further research should examine whether the findings of this study hold for other repositories, data types, and disciplines. The repositories considered here are also somewhat unique in that they are restricted access repositories. Because of the limitations I have described, it is difficult or even impossible to know who is using data from truly open repositories and in what ways. Counts of dataset views and downloads provide limited insight into the deeper questions about dataset reuse considered here. At present, use requests are one of the few robust ways to operationalize data reuse, so the limitations associated with these findings are difficult to avoid. However, efforts currently underway in the scientific community to standardize data citation will likely enable better automated tracking of data reuse over time, including data from both restricted access and fully open repositories. As data citation standards mature, future research may be able to address questions about differences in reuse of different types of data and repositories. 64 Chapter 4: Findings About Requests and Requestors This chapter presents the findings of the two research questions that focus on questions about requests and requestors, or the who, where, and why of biomedical data reuse – who is reusing biomedical data, from where in the world do requests come, and why are datasets reused. Specifically, the research questions and hypotheses considered here are: Research Question 1: What are the purposes and characteristics of biomedical research reuse? Research Question 1.1: For what methods and analysis types are datasets reused? Hypothesis 1.1: Genomic datasets of the type found in dbGaP will be more likely to be used in combination in meta-analyses, while clinical datasets of the type found in the NIDDK repository will be more likely to be used on their own to answer an original research question. Research Question 1.2: How closely are the topics for data reuse aligned with the topics for which the data were originally collected? Hypothesis 1.2: Similarity between original topics and topics of reuse will be lower for genomic data (found in dbGaP) than for clinical data (found in the NIDDK repository). Research Question 2: What are the demographics of researchers who reuse existing datasets? 65 Research Question 2.1: Where are requestors located in the world? Hypothesis 2.1: Requestors will be primarily located in regions with a greater proportion of research institutions, including North America, Europe, and Asia. Research Question 2.2: Are there patterns in career stage of requestors? Hypothesis 2.2 A broad range of career stages, from student to full professor (or equivalent) will be represented. 4.1 Research Question 1: For what research objectives are biomedical datasets reused? Biomedical research is a large umbrella that encompasses many different research methodologies on a range of topics, from efforts aimed at understanding the very building blocks of life to specific trials on the efficacy of various types of therapies. The types of data that comprise biomedical research data are similarly diverse – as are their potential applications. Even when datasets seem very specific in their scope and application, the potential often exists for researchers to reuse data in new and sometimes unexpected ways. In fact, as data science methodologies advance, biomedical research data has potential for researchers who might not even be considered biomedical researchers, such as computer scientists who need test data to develop and validate new algorithms or statisticians who can use existing data to pioneer new statistical approaches. 66 Here, I aim to better understand how researchers are making use of data available through the dbGaP and NIDDK repositories by examining the descriptions that are submitted as part of a potential reuser’s request to access the data. These descriptions, intended for evaluation by repository staff responsible for determining whether the use is appropriate, contain details about the specific research questions researchers intend to explore with the dataset. In this section, I use a combination of qualitative analysis and computational indexing methods to understand the types of research conducted using these datasets, as well as the topics of reuse, including how similar (or different) they are from the original data use. This analysis draws on data from two repositories; NIDDK provided me a spreadsheet containing details of use requests, including the proposed use, and I wrote an R script to retrieve requests from the dbGaP website. The NIDDK requests cover the period between 2005 and 2018, while dbGaP includes 2007 to 2018. NHLBI does not make their full use requests public; they provided me with summary information about requests, but they did not share identifying information about requestors or the text of proposed uses. Therefore, this analysis does not include NHLBI requests. 4.1.1 Research Question 1.1: For what methods and analysis types are datasets reused? To determine the purpose for which researchers intend to use requested datasets, I analyzed the descriptions of reuse that are included in the request submission. I hypothesized that the types of reuse described for datasets in dbGaP, which contains primarily genetic data, would differ from those in NIDDK, which 67 contains primarily clinical data. Given that studies using genetic data typically require a large number of subjects to achieve adequate statistical power (Hong & Park, 2012), I expected that dbGaP would have more requests to use datasets in meta-analyses (that is, in combination with other data). On the other hand, clinical data of the type found in the NIDDK repository can be difficult to combine with other datasets because of nuances of how individual researchers or teams collect the data, so I would expect that these datasets would more likely be used on their own to answer an original research question. After reading the proposed use for each request, I classified the request according to the type of reuse. The categories were based on review of the relevant literature, with the addition of new categories when needed for use requests proposing an activity not covered by an existing category, and include the eight types described in Table 4-1. Table 4-1. Coding categories and their definitions. Category Definition Original research study use of a single dataset to answer a new research question, distinct from the specific question for which the data were originally collected Meta-analysis study aggregation or integration of the dataset with other datasets to answer a research question or conduct a formal meta-analysis Statistical methods study use of one or more datasets to develop or verify new statistical methodology Software or tool use of one or more datasets to develop, test, or validate development study a new software product or analysis tool Validation use of one or more datasets to validate other findings, such as validating findings from an animal model in human subjects Comparison or control use of one or more datasets to validate the 68 Category Definition investigator’s own data, provide comparison, or serve as a control group Reproducibility or reanalysis of one or more datasets to answer the same reanalysis study question for which the data were originally collected or to verify the original study’s findings Infrastructure use of one or more datasets to populate a database or repository for internal or institutional use Each of the requests may ask for more than one dataset, and I report the findings at the dataset rather than the request level. For example, if I code a request as being a meta-analysis, and it asks for 200 datasets, 200 instances of meta-analysis are added to the tally. This treats each dataset request as its own unit; even though a requestor may use the same request text for more than one dataset, each dataset’s request should still be counted. Appendix A provides examples of use requests in each category from dbGaP and NIDDK. The determination of how to categorize each dataset was based on a number of factors, including the number of datasets included in the request (more than one requested dataset would suggest a meta-analysis) and the inclusion of phrases that explicitly named a reuse type (e.g. “we propose to validate findings from our own colorectal cancer studies” or “our goal here is to perform a meta-analysis of densely sequenced genomes” – emphasis added) or keywords that likewise identified a reuse type (e.g. “we develop a Bayesian hierarchical model,” with Bayesian referring to a statistical approach or “we are currently evaluating the performance of our mutation detection pipeline,” where a pipeline refers to a series of software tools used in sequence to conduct a specific analysis). 69 I also drew on my own extensive experience with biomedical research, including nearly ten years working in biomedical libraries, the last six of which were spent working closely with researchers at the National Institutes of Health and National Library of Medicine. In that capacity, I have served as a consultant to and collaborator with biomedical researchers, used my expertise in data science as a team member in “hackathons” aimed at using some of these same types of data to answer biomedical research questions, and developed and delivered training for other biomedical librarians interested in learning more about these skills. These experiences have given me a depth of understanding of research techniques and a familiarity with the vocabulary of the science described within these requests. I also validated my coding by comparing my codes to those of two outside coders for a random subset of twenty requests (ten from each repository). Both coders have experience working with research of the type described in the use requests: one is an academic biomedical librarian who consults with researchers on issues related to biomedical data and computational reproducibility, and the other is an NIH fellow in data science and open science policy, who holds a doctoral degree in computational biology. Their mean percent agreement with my codes was 72.5% (70% and 75%), which is considered Substantial agreement on Landis and Koch’s scale for Strength of Agreement (Landis & Koch, 1977). Most of the variability between their coding and mine was due to their use of the “comparison or control” code when I used “meta- analysis” or vice versa. These two types of reuse are similar, since they both refer to combining data. 70 The set of NIDDK requests included 416 requests from 252 unique requestors, requesting a total of 561 datasets. Each request asked for a mean of 1.3 datasets, with a minimum of 1 and a maximum of 10. For the dbGaP analysis, I randomly selected a subset of 1,500 of the 9,444 requests (15.9%), which provides a confidence interval of +/-1.1 at a 95% confidence level (based on estimation of proportion). This set came from 1,069 unique requestors and included requests for a total of 20,179 datasets. Each request asked for a mean of 13.5 datasets, with a minimum of 1 and a maximum of 398. Table 4-2 shows the number of requests in each reuse category and the percent of overall requests for requests to dbGaP and NIDDK. Table 4-2. Counts and percentages of requests describing various types of reuse for NIDDK and dbGaP datasets. Reuse type dbGaP Requests NIDDK requests N % N % Original research 460 2.3% 282 50.27% Meta-analysis 14,619 72.4% 139 24.78% Comparison 858 4.3% 2 0.36% Validation 221 1.2% 14 2.5% Statistics 2,242 11.1% 84 15.0% Software 1,097 5.4% 14 2.5% Infrastructure 644 3.2% 0 0% Re-analysis 11 0.05% 2 0.36% Reuse type not specified 2 0.01% 24 4.28% Although some types of reuse are uniformly low for both dbGaP and NIDDK datasets, the most common ways that they are reused are very different from each other. A chi-squared test of independence confirms that the distributions of reuse between dbGaP and NIDDK are significantly different (χ2 = 4547, df = 8, p < 0.01). 71 As hypothesized, original research is the most common reuse type for NIDDK datasets, but it is actually the fourth least common reuse type for dbGaP datasets. On the other hand, nearly three-quarters of dbGaP datasets are requested for use in a meta-analysis; while meta-analysis is still a significant category for NIDDK data reuse, it is much less common than for dbGaP. The greater frequency of meta- analyses in dbGaP and original research studies in NIDDK is also reflected in the very different number of datasets per request for these two repositories: on average, NIDDK requests ask for just 1.3 datasets to dbGaP’s mean of 13.5. A Welch unpaired two-sample t-test shows that the means of datasets per request for dbGaP and NIDDK are significantly different (t = 11.5, df =1504, p < 0.001). These variations are likely due to the differences in the types of data that dbGaP and NIDDK house. Genome-wide association studies, a common use of the dbGaP datasets, require a much larger sample size to achieve adequate statistical power than do clinical studies, and therefore several datasets may need to be pooled in order to have enough subjects for a study (Hong & Park, 2012). On the other hand, many of the NIDDK are clinical datasets, which are often difficult to combine using meta-analytic techniques because different research teams collecting the original datasets often use their own unique ways of recording variables. For example, many of the studies in NIDDK ask participants about their alcohol consumption habits, but they do so in ways that make it difficult to compare across studies. One such study, the Diabetes Prevention Program Outcomes Study (DPPOS) queries in specific detail, asking participants to recall how many “12 ounce 72 bottles of beer,” “4 ounce glass of wine,” and “1.5 ounce shots of hard liquor or mixed drinks” they had consumed in the past seven days (Diabetes Prevention Program Outcomes Study, 2016). Another, the Nonalcoholic Fatty Liver Disease (NAFLD) study, simply asks participants how many drinks they have on a typical day (Nonalcoholic Fatty Liver Disease (NAFLD) Adult Database, 2016). It is difficult to know if responses to these questions yield truly comparable results. Perhaps an NAFLD participant is in the habit of going to the pub for a pint of beer (16 ounces) every evening, and without this more specific guidance of “12 ounces,” will likely count each of these as one drink. When this NAFLD participant responds he has seven beers a week, he has consumed 112 ounces of beer, or 30% more than a DPPOS respondent who says she consumes seven 12-ounce bottles of beer a week (or 84 ounces). These two studies also differ on how they define binge drinking, with the DPPOS asking about how often the participant has had seven or more drinks in 24 hours, whereas NAFLD asks about how often the participant has had six or more drinks on one occasion. Even such seemingly inconsequential differences – six versus seven drinks, “on one occasion” versus in 24 hours – mean that different information is being elicited from participants. With many of these clinical studies having hundreds or even thousands of variables, these small differences can add up to significant challenges that prevent datasets from being combined for meta-analytic purposes. 73 4.1.2 Research Question 1.2: How closely are the topics for data reuse aligned with the topics for which the data were originally collected? This analysis aims to quantify similarity between the original subject focus of shared datasets and the focus of the research for which requestors hope to reuse them. I hypothesize that the differences in genomic versus clinical research discussed above will also lead to differences in the similarity of reuse to original data purpose between dbGaP and NIDDK. Given the broader applications of dbGaP data compared to the relatively specific applicability of NIDDK’s clinical datasets, I expect greater similarity between NIDDK datasets and their topics of reuse than for dbGaP datasets and their topics of reuse. Medical Subject Heading (MeSH) terms provide a means by which to compute an objective measure of similarity between original use and reuse. These terms are used to describe medical literature consistently as well as to understand relationships between terms. Because MeSH terms are arranged in a hierarchical fashion in a tree structure, it is possible to calculate a measure of similarity between two terms, known as semantic similarity. Terms closer to each other in the hierarchy will have a high semantic similarity score, whereas terms that are far from each other on the tree will have a lower semantic similarity score. Exactly identical terms have a semantic similarity score of 1, whereas a semantic similarity score of 0 indicates that the two terms are not in any way topically related (since they are on totally different top-level branches in the 16-branch MeSH tree). Thus, comparing MeSH terms that are assigned to a dataset with MeSH terms assigned to a request for that data allows 74 for a quantitative measure of similarity between the proposed reuse and the original dataset’s purpose. Conveniently, datasets from dbGaP and NIDDK were classified by the repository with one or more MeSH terms. To determine MeSH terms for the requests, I used the MeSH On Demand tool, which utilizes the National Library of Medicine’s (NLM) Medical Text Indexer (MTI) to assign terms to a provided text. Given a reuse request description, the MeSH On Demand tool returns a list of relevant MeSH terms. I removed very general terms, such as “Human” and “Adult” from the list of returned MeSH terms, since these provided little useful context. Once the terms had been assigned, I wrote an R script that would join the set of MeSH terms for a request with the set of MeSH terms for all the datasets included in the request. Since most datasets and requests had more than one MeSH term, the script calculated a semantic similarity score for each request/dataset term pair and recorded the highest score. Table 4-3 shows an example of a request/dataset pair from dbGaP with their terms and the semantic similarity score for each term. Most of the term pairs have a semantic similarity score of 0, since they are on totally different top-level branches of the MeSH tree. Others have a small score because they are on the same branch, but far apart from each other. For example, Ankle Brachial Index and Cohort Studies are both on the top-level branch Analytical, Diagnostic, and Therapeutic Techniques, and Equipment. However, moving down the tree, they are far down on very distant branches from each other. On the other hand, Pulmonary Disease, Chronic Obstructive, is much closer to Intermittent Claudication and 75 Peripheral Vascular Diseases on the Diseases top-level branch. Cardiovascular System and Lung are the closest to each other, just one level apart on the Anatomy branch. Because many of the terms are unrelated, recording the maximum score provides the best comparison of the similarity between the request and the dataset; for example, in this case, the mean would only be 0.03, compared to the maximum score of 0.86. Table 4-3. Example semantic similarity scoring. Dataset terms Ankle Cardiovascular Intermittent Peripheral Brachial System Claudication Vascular Index Diseases Lung 0 0.86 0 0 Smoking 0 0 0 0 Global Health 0 0 0 0 Cohort Studies 0.24 0 0 0 Biological 0 0 0 0 Markers Pulmonary 0 0 0.62 0.65 Disease, Chronic Obstructive Semantic similarity scores were calculated for each request/dataset pair in dbGaP and NIDDK; NHLBI was not included in this analysis because they did not provide me the text of use requests. The dbGaP dataset included 9,348 unique 76 Request terms requests for 986 unique datasets, for a total of 92,523 request/dataset pairs. The NIDDK dataset included 544 unique requests for 65 unique datasets, for a total of 539 request/dataset pairs. Figure 4-1 shows the distribution of maximum semantic similarity scores for dbGaP and NIDDK request/dataset pairs. The top part of the chart shows density at each score (i.e. the proportion of how many request/dataset pairs have that score). The horizontal boxplot below shows the distribution of maximum semantic similarity scores. Each of the points overlaying the boxplot corresponds to a single request/dataset pair at that score. Table 4-4 provides summary statistics. Table 4-4. Summary statistics of semantic similarity scores for dbGaP and NIDDK request/dataset pairs. Mean score Number of pairs Number of pairs with score = 0 with score = 1 dbGaP 0.56 28,804 (31.1%) 18,347 (19%) NIDDK 0.78 85 (15.8%) 297 (55.1%) 77 Figure 4-1. Distribution of maximum semantic similarity scores for request/dataset pairs. 78 A Welch unpaired two-sample t-test shows that the means of the maximum semantic similarity scores for dbGaP and NIDDK request/dataset pairs are significantly different (t = -14.22, df = 546, p < 0.001). These differences suggest that requestors are using dbGaP for topics that vary more from the original data topic than NIDDK requestors. As hypothesized, NIDDK scores tended to be higher (more similar), while dbGaP scores tended to be lower (less similar). Over half of the NIDDK datasets had a score of 1, indicating that requestors intended to reuse the datasets for the same topic of research for which it had originally been collected. On the other hand, nearly a third of dbGaP datasets had a score of 0, suggesting that these datasets were being used in entirely novel contexts compared to the topic for which the data were originally collected. 4.1.3 Summary of Findings These findings demonstrate that dbGaP and NIDDK datasets are being reused in very different ways from each other. dbGaP datasets were most often used in combination with other datasets to conduct meta-analyses, and they were more likely to be used for a topic that diverged from the original reason the data were collected. On the other hand, just over half of the NIDDK datasets were requested for use in an original research study, using a single dataset on its own. NIDDK datasets were also reused in contexts that were generally more similar to the reason for which the data had originally been collected. 79 The differences in reuse observed here are likely reflective of the very different types of data in the two repositories. dbGaP houses genetic sequence data; because of statistical issues associated with analyzing this type of data, very large sample sizes are required to achieve adequate statistical power and arrive at meaningful results (Hong & Park, 2012). A number of the dbGaP datasets contain genetic sequences of normal, healthy humans, which can serve as a useful comparison group for a researcher’s own set of sequences on a particular disease, since identifying where variations occur in the disease group but not in the healthy comparison group can elucidate genetic regions of interest. In general, genetic sequence data provides more flexibility in its range of research applications than the type of clinical data collected in NIDDK. These clinical datasets tend to be more focused on a specific disease or condition and therefore have less broad applicability. Further, while genetic sequence data is largely standardized and therefore generally interoperable regardless of who collected it, the same is not true for clinical data, which is often recorded based on the specific practices of individual research teams, and therefore more difficult to analyze in combination with other datasets. 4.2 Research Question 2: What are the demographics of researchers who reuse existing datasets? The repositories included in this study represent a valuable resource for the research community at large, regardless of a researcher’s country of origin or career status. A young assistant professor at a small university in South America is just as 80 eligible to request data as an acclaimed full professor at an Ivy League university. However, just because both of these hypothetical researchers are able to request data does not necessarily mean that they do. Here, I aim to understand the demographics of researchers who request data by exploring the geographic distribution of requests and the career status of requestors. 4.2.1 Research Question 2.1: Where are requestors located in the world? Although the three repositories considered here are funded by and administered through various parts of the National Institutes of Health, a United States government research institution, researchers from around the world are permitted to request use of the datasets. While requests can and do come from around the world, I hypothesize that most requests will arise from geographic regions with a large research presence, such as North America, Europe, and Asia, as well as highly- populated states within the US. Research activities are not distributed evenly among countries around the world, nor among states in the United States. For example, a country such as the United States that is large and has many well-established research institutions is likely to have more dataset requests than a country such as Liechtenstein, which is much smaller and has fewer universities, simply because there are more researchers in the United States to request datasets. Therefore, I calculated relative difference in composition between requests by repository and a proxy measure for presence of research institutions. Relative difference in composition (RDC) is used to quantify 81 over- and underrepresentation of specific groups in a measure of interest compared to their representation in the population overall (Ford, 2014). To calculate RDC, first the difference in composition between the measure of interest (requests) and the comparison measure (the proxy measure for research presence) is calculated. For example, suppose that requests from a country constitute 15% of the overall requests to a repository, and that country has 10% of the research institutions in the world. The difference in composition is 5%. Then, the RDC is calculated by dividing the difference in composition by the composition of the research proxy, that is, 5%/10%, and multiplying by 100, yielding an RDC of 200% - that is, that particular country’s requests are 200% of what would be expected given its number of research institutions. In this analysis, I use counts of individual requests rather than counts of datasets requested to represent how many studies the repository is supporting. For example, if one researcher from a country is requesting 250 datasets to conduct a single meta-analysis, it is counted as one request, not 250. If each dataset requested was counted individually, a single meta-analysis could significantly sway a country’s results, overrepresenting the amount of research supported by the shared data. A single list of all research institutions of all types globally would be nearly impossible to obtain, so I use number of universities in the country as a proxy for number of research institutions. Although there are various types of non-academic research institutions employing researchers that might request datasets, the number of universities provides a reasonable basis for quantifying the relative research presence 82 of a given country. The Cybermetrics Lab in the Consejo Superior de Investigaciones Científicas (CSIC), a public research institution in Spain, maintains a list of universities and rankings for 209 countries around the world, including 28,077 universities as of January 2019 (Consejo Superior de Investigaciones Científicas, 2019). I used this list to calculate the percent of all universities in the world located in each country. For example, India has 3,944 universities, the most of any country in the world, accounting for 14% of all global universities. By comparison, a country such as Malawi that has only 12 universities accounts for 0.04% of the world’s universities. Considering the difference between the percent of all repository requests coming from a country and the percent of all universities in the world that are in that country provides a basis for determining whether countries are requesting datasets at a rate that is proportional to its representation among global universities. Figure 4-2, Figure 4-4, and Figure 4-6 show relative difference in composition by each repository internationally. Darker shades of blue indicate more significant underrepresentation of requests relative to number of universities, while darker shades of red indicate more significant overrepresentation. Countries in gray have no universities represented in the CISC list, nor requests to the data repository. Each figure has a different legend based on the maximum difference in relative composition for each repository. Figure 4-3, Figure 4-5, and Figure 4-7 compare counts of universities per country to counts of requests coming from that country for each repository, demonstrating that there is neither a linear nor quadratic relationship between these two variables. 83 Figure 4-2. Relative difference in composition of requests for dbGaP datasets and universities in countries in the world. 84 Figure 4-3. Counts of universities compared to counts of requests to dbGaP. 85 Figure 4-4. Relative difference in composition of requests for NHLBI datasets and universities in countries in the world. 86 Figure 4-5. Counts of universities compared to counts of requests to NHLBI. 87 Figure 4-6. Relative difference in composition of requests for NIDDK datasets and universities in countries in the world. 88 Figure 4-7. Counts of universities compared to counts of requests to NIDDK. As these three maps demonstrate, requests for datasets are unevenly distributed, with a few countries highly overrepresented. In fact, most countries that had at least one university had never made any requests to the repositories; 79% of countries with universities had no requests to NHLBI, 81% had no requests to dbGaP, and 90% had made no requests to NIDDK. Given that these three repositories are within the United States, it is perhaps unsurprising that United States-based institutions are highly overrepresented among requests from all three repositories. Datasets also appear to be more highly requested in English-speaking countries; Canada, the United Kingdom, and Australia are all over-represented for some or all three of the repositories. This finding could be due to the documentation and web pages of the repositories being written in English; non-English speakers might have 89 difficulty finding and using datasets that do not include documentation in their native language, especially given that requesting the datasets requires writing a detailed description of the proposed reuse in English. Table 4-5 shows the number of universities, requests per repository, and relative difference in composition for the ten highest scoring countries for each repository, except NIDDK, which only had six countries that were overrepresented (several countries are in the top ten for more than one repository). RDC values of less than 0 are highlighted in light gray. As Table 4-5 demonstrates, countries were not universally under- or over-represented among requests to the various repositories; in fact, relative difference in composition varied significantly among the repositories. For example, Luxembourg, which had the highest relative difference in composition for dbGaP requests, (1,397% over-represented), did not have one single request to either of the other two repositories and therefore was 100% underrepresented. Table 4-5. Countries with number of universities and number of requests (N) and relative difference in composition (RDC) for each repository. Country University dbGaP NIDDK NHLBI Count N RDC N RDC N RDC Australia 188 183 221% 6 55% 35 170% Canada 355 301 179% 2 -72% 85 246% Cyprus 26 1 -89% 1 84% 0 -100% Finland 46 23 65% 0 -100% 4 28% Germany 465 223 58% 2 -26% 22 -32% Iceland 9 12 337% 0 -100% 0 -100% Israel 42 77 501% 0 -100% 10 248% Italy 239 86 19% 5 2% 1 -94% Luxembourg 3 14 1,397% 0 -100% 0 -100% Netherlands 133 106 162% 2 -26% 32 248% New Zealand 56 27 60% 0 -100% 11 186% 90 Country University dbGaP NIDDK NHLBI Count N RDC N RDC N RDC Qatar 9 0 -100% 0 -100% 1 56% Singapore 45 44 224% 0 -100% 3 -6% Sweden 46 63 352% 5 431% 3 -8% Switzerland 102 59 90% 2 -4% 4 -42% United 280 471 484% 16 179% 71 267% Kingdom United States 3,257 5,773 484% 338 406% 1,556 592% Among the most highly overrepresented countries, the large number of requests cannot be explained by coming from one highly prolific requestor or institution. For example, all of the dbGaP requests from Luxembourg do come from just one of its three national universities, but the 14 requests come from nine different requestors. The 77 requests to dbGaP from Israel, the next most overrepresented country, come from 15 different institutions. However, some of the countries that are overrepresented in fact have a low number of requests and only appear overrepresented because they also have very few universities. For example, Qatar is the eighth most highly represented country among NHLBI requests despite having only one request. In fact, 27 countries have more requests than Qatar, but 20 of them have a lower RDC because of the much larger number of universities they have than Qatar’s nine. Just as research institutions are not evenly distributed around the world, they also are not within the United States among states. I conducted the RDC analysis for states as well, using NIH funding amounts in Fiscal Year 2018 (National Institutes of Health Research Portfolio Online Reporting Tools, 2018) as a proxy for research 91 presence. I calculated the relative difference between the percent of total requests by repository made in a state and the percent of all NIH funding awarded within the United States that was awarded to that state. NIH research funding is probably a more accurate proxy for biomedical research presence than the university count that was feasible to use for the world analysis, since NIH awards funding to a variety of types of research institutions, not just universities, and focuses specifically on the type of biomedical research that is relevant here. Figure 4-8, Figure 4-9, and Figure 4-10 show RDC by repository within the United States. Red indicates states that are requesting a larger share of datasets compared to the research funding they receive, while blue indicates states that are requesting a smaller share. The darker the color, the more highly the state is over- or underrepresented, while states in white request datasets at a rate about equivalent to their research presence. 92 Figure 4-8. Relative difference in composition of requests for dbGaP datasets and NIH funding in FY18 by state within the US. 93 Figure 4-9. Relative difference in composition of requests for NHLBI datasets and NIH funding in FY18 by state within the US. 94 Figure 4-10. Relative difference in composition of requests for NIDDK datasets and NIH funding in FY18 by state within the US. The state RDC analysis shows more variation in geographic distributions than the global RDC analysis. The states that are the most highly over-represented among the various repositories are not necessarily the ones that might be expected: New Mexico, Wyoming, and Alaska all appear as outliers. On the other hand, other states with a strong research reputation also are over-represented, such as Massachusetts and California. Unlike the global analysis, more states appear in white (or a shade close to it), indicating that they are requesting datasets at a level that is proportional to the amount of NIH funding they receive. This finding could suggest that requests for data are more evenly distributed among research institutions within the United States 95 than they are within universities across the world. Where disparities do exist within the states, they also generally tended to be less significant than those among countries. Compared to RDCs of nearly 1,500% for the most highly overrepresented countries, the most extreme RDCs for NHLBI is about 500% and dbGaP’s is only 85%. However, NIDDK requests are skewed at a level closer to that seen at the global level, largely due to the very high overrepresentation of requests from New Mexico and Washington, DC. As with the global RDC analysis, some states appear highly overrepresented not because they have a very high number of requests, but because they receive very little NIH funding. For example, only two requests for NHLBI data came from Wyoming, but they also receive the least NIH funding of any state – only 0.05% of all NIH funding. However, some highly funded states also request very little data. For example, Texas, the seventh-highest funded state, had only made four requests to NIDDK. 4.2.2 Research Question 2.2: Are there patterns in career stage of requestors? Although requestors to the three repositories must demonstrate that they are legitimate researchers (for example, dbGaP requestors must be registered in NIH’s Electronic Research Administration system, while NHLBI and NIDDK have a process for requestors to apply for an account, which includes indicating their research affiliation and status), researchers from a range of career stages are free to request datasets. That range includes students to full professors, as well as career 96 stages from areas outside of academia, such as senior scientists, CEOs and other executives, and managers. Some requestors may be at a career stage at which they might benefit more substantially from the opportunity to use existing data – for example, students and early career researchers are less likely to have access to the significant funding, laboratory resources, and staff that it would take to generate their own data. Despite the potentially greater benefit to early career researchers, I hypothesized that a broad range of career stages, from student to full professor (or their equivalents in non-academic contexts) would be represented. For this analysis, I used the NIDDK requests and a random sample of 1,500 of the total 9,444 dbGaP requests (15.9%), which provides a +/-1.1 confidence interval at a 95% confidence level based on estimation of proportion. Of the 416 NIDDK requests, 144 of them (35%) did not include a requestor name and were therefore excluded from this analysis, leaving 272 requests. NHLBI did not provide me individual researcher level request data for privacy reasons, so those requests could not be included in this analysis. I determined the career status of the researcher at the time they made the request by searching the internet for documentation of their career history, such as institutional web pages, CVs, biosketches, and LinkedIn pages. Titles from non-American institutions were converted to their American equivalent; for example, the rank of “senior lecturer” in the United Kingdom is the equivalent of an associate professor in the US (Wikipedia, 2018). For requestors for whom I could not definitively determine the career status at the time of request, I recorded “unknown.” 97 The 1,500 dbGaP requests came from 1,118 unique requestors and requested access to 18,117 total datasets (since a request could ask for multiple datasets). Each unique request asked for between one and 529 datasets, with a mean of 12.1 datasets per request. The 272 NIDDK requests came from 252 unique requestors, requesting a total of 394, with each request asking for between one and ten datasets (mean 1.4). While many requests asked for more than one dataset, this analysis counts individual requests rather than requests by dataset to provide a clear understanding of how much research is being supported at each career stage. For example, if an associate professor requests 30 datasets for a meta-analysis, that request supports one research project; counting each dataset separately would inflate counts of how much research is being supported at a given career stage. Table 4-6 provides the distribution of requests for dbGaP and NIDDK, by career status, and with statuses grouped by career stages that approximately reflect where the career status falls in a broader career trajectory. Table 4-6. Proportions of datasets requested by career status of requestor for dbGaP and NIDDK. Career Stage Title Percent of dbGaP Percent of NIDDK requests requests Pre-professional Student 0.7% 1.8% Fellow 0.7% 3.1% Total 1.4% 4.9% Early career Assistant Professor 19.1% 27.6% Resident Physician 0% 1.1% Lecturer 0.07% 0.4% Instructor 0.07% 0% Total 19.2% 29.1% Mid-Career Associate Professor 15.4% 13% Scientist 5.7% 3.9% 98 Career Stage Title Percent of dbGaP Percent of NIDDK requests requests Attending 0% 0.2% Physician Manager 0.7% 0.4% Total 21.8% 17.5% Established Professor 26.8% 24% Director 8.5% 5.5% Executive 3% 5.1% Senior Scientist 10.3% 6.7% Total 48.6% 41.3% Unknown 9% 5.9% Patterns of requests appeared generally similar across dbGaP and NIDDK; however, a chi-squared test of independence revealed that the two distributions were in fact significantly different (χ2 =81, df = 12, p < 0.001). This statistic was most influenced by the numbers of resident and attending physicians requesting datasets; expected counts would be 1 physician (0.9 expected resident and 0.1 expected attending) for NIDDK and 6 for dbGaP (5.1 expected resident and 0.9 expected attending), but all 7 requests from physicians went to NIDDK. This finding could be explained by the fact that NIDDK contains clinical data of the type that would be familiar to physicians, whereas physicians generally do not have training in dealing with genomic information and would be therefore be less likely to use the genomic data found in dbGaP (Demmer & Waggoner, 2014; Manolio & Murray, 2014; Murray, 2014). Despite the fact that the distribution of requestors between the two repositories differed statistically, the requests did at least follow a broadly similar pattern, with nearly half of requests to both repositories coming from full professors 99 and other researchers in more established positions. Assistant professors also represented a sizeable proportion of requestors, accounting for about a quarter of the datasets requested from both dbGaP and NIDDK. Almost none of the requests came from pre-professionals such as students and fellows. However, a limitation that should be noted for this analysis is that the person who requested the data might not be the person who actually ended up using the data. For example, a full professor might request data on behalf of a graduate student. As with universities’ uneven distribution around the world, researchers are not necessarily evenly distributed among career ranks. For example, faculty might be more concentrated in lower ranks, and therefore it would be expected that they would make more requests, since there are more individuals to be making requests. Therefore, in addition to considering proportions overall, I also calculated the relative difference in composition (RDC), as described in Section 4.1. Obtaining counts of non-academic ranks such as CEO or scientist was infeasible, but I calculated RDC for the academic-related ranks based on 2016 data from the National Center for Education Statistics, which reports counts of full-time faculty in US degree-granting postsecondary institutions (National Center for Education Statistics, 2017). I compared the proportion of each rank within all of US faculty to its proportion of academic requests for dbGaP and NIDDK. Note that this analysis only considers requests that came from academic requestors; for example, the 46.3% reported for professors requesting dbGaP datasets refers not to the proportion of datasets this group requested compared to all requests, but to the proportion requested compared 100 to requests coming from the five academic ranks listed in the results, reported in Table 4-7. Table 4-7. Relative difference in composition (RDC) between faculty at five academic ranks in US institutions and their requests to dbGaP and NIDDK. Faculty status Percent of US Academic dbGaP Academic NIDDK faculty requests requests % RDC % RDC Professor 22.4% 44% 96% 40% 78% Associate professor 19.3% 25% 29% 20% 4% Assistant professor 21.6% 31% 43% 42% 94% Instructor 12.4% 0.1% -99% 0% -100% Lecturer 5.2% 0.1% -98% 0.6% -88% Other 19.1% NA NA NA NA A chi-squared test of independence revealed that request counts from staff at different faculty ranks differed significantly from their representation in American universities for both dbGaP and NIDDK (χ2 = 641, df = 5, p < 0.001 and χ2 = 108, df = 5, p < 0.001, respectively). Their distributions are also significantly different from each other (χ2 = 14, df = 4, p = 0.01). As Table 4-7 demonstrates, professors are overrepresented in their requests to both repositories, although to a lesser degree among requests to NIDDK. Instructors and lecturers are almost 100% underrepresented, a finding that seems reasonable given that many faculty members at this level have teaching and service responsibilities that may limit their engagement in research, and therefore request less data for that purpose. A surprising finding is that the representation of assistant professors and associate professors varies between dbGaP and NIDDK. Associate professors are 30% overrepresented among dbGaP requests but only 4% represented among NIDDK 101 requests. Assistant professors, on the other hand, are 94% overrepresented among NIDDK requests, but less than half as much overrepresented among dbGaP requests (43%). A possible explanation for this finding could be that researchers at different ranks are more likely to engage in the types of research that tend to be supported by each repository, that is, that associate professors are requesting more data from dbGaP because they are doing more meta-analyses and assistant professors are requesting more data from NIDDK because they are doing more original research studies. Further research into how requestors are using datasets could help elucidate some of the differences in request rates. 4.2.3 Summary of Findings Although datasets from the three repositories considered here are theoretically available to any qualified researcher, requests for datasets are unequally distributed around the world and among researchers at different career stages. English-speaking regions, particularly the United States, were overrepresented in requests compared to their number of research institutions. Established researchers who were at higher career ranks were also overrepresented, particularly among academic staff. These findings suggest that, in many cases, datasets are going to the researchers most able to collect their own data if need be: established researchers in wealthy countries who likely have access to resources that earlier career researchers and those in poorer countries do not. 102 4.3 Conclusions and Summary of Findings The results reported here have helped elucidate the who, where, and why of data reuse. From these findings, a general picture of biomedical data reuse begins to appear. Researchers are making use of data in a wide range of contexts, from using one dataset in a context very similar to its original purpose, to requesting hundreds of datasets from a range of unrelated topics to conduct large-scale meta-analyses. The range of types and contexts of reuse seen here demonstrates that data reuse is complex, not a single, easily explained phenomenon, although some of the differences in reuse can be explained by the repository and the type of data it holds. Researchers from around the world are taking advantage of the opportunity to reuse existing datasets rather than gathering their own, though requests tend to be concentrated in English-speaking countries, particularly the United States. Requests come from researchers at all different career stages, from students just beginning their career to full professors who are well established in their discipline, though later career researchers are somewhat overrepresented. In Chapter 5, I will build on this emerging picture of biomedical data reuse by considering patterns of use requests in relation to dataset topic and time since dataset release. 103 Chapter 5: Findings About Datasets This chapter presents the findings of the two research questions that focus on questions about the datasets themselves, or the when and what of biomedical data reuse – when in a dataset’s life cycle is it most requested, and what topics are the most requested? Specifically, the research questions and hypotheses considered here are: Research Question 3: Are there temporal patterns to dataset requests? Hypothesis 3: Patterns of requests relative to the original dataset release date will demonstrate a cumulative advantage process, similar to other scientific communication processes such as article citation. Research Question 4: Are there dataset topics that are more highly requested? 5.2 Research Question 3: Are there temporal patterns to dataset requests? Many processes in the study of science, including citations to articles, follow the model of a cumulative advantage process: the rich get richer, and success breeds success. In other words, an article that has already been cited many times is more likely to go on to receive more citations than an article that has only been cited a few times. This process makes sense for a variety of reasons – an article cited many times could be cited more because it is of higher quality than a less-cited article, and a highly cited article likely ends up having more visibility than a less-cited article, since it appears in the bibliography of more citing articles. I hypothesize that temporal 104 patterns in requests for datasets over time can, like article citations, be explained by a cumulative advantage model. For this analysis, I used dbGaP and NHLBI datasets only. The NIDDK repository only had a specific release year for datasets released in 2014 or later, which is just 30% of its datasets. As a result, only 91 of the total 516 requests, just 18%, could be matched with a dataset with a known release date, and most datasets had only one or two requests per year. With so few datasets and only four years’ worth of requests to consider, the NIDDK data was inadequate for this analysis. Request data began in 2007 for dbGaP and 2000 for NHLBI. For both the dbGaP and NHLBI analysis, 2018 requests were excluded since the list of dataset requests was collected in mid-2018 and therefore did not represent a full year worth of requests. Thus, the dbGaP analysis included requests made between 2007 and 2017, and the NHLBI analysis, requests made between 2000 and 2017. For each repository, I considered how many total requests each dataset had received across the years included in this analysis (that is, excluding 2018 requests). Based on these total requests, I determined rankings for the least to the most requested datasets by calculating how many requests a dataset would need to fall in each decile (or set of 10 percentile points) between the 10th and 90th percentile and determined the decile for each dataset. For example, a dbGaP dataset with a total of 6 requests would be in the 20th percentile, while a highly requested dataset that had received 200 requests would be in the 90th percentile. For each individual request, I determined the age of the dataset at the time of the request by subtracting the year the 105 request was made from the year the dataset was first shared. Finally, I calculated the mean number of requests by decile for each year since dataset release (for example, of the 365 datasets that fall into the 50th percentile for dbGaP, the mean number of requests they received in the first year of being available was 5.85). Using the dataset’s age at the time of request (e.g. the request was made 2 years after the dataset was released) rather than the calendar year of the request (e.g. the request was made in 2015) makes it possible to compare datasets of different ages. If the cumulative advantage effect holds true, a dataset released in 2009, for example, would be more likely to have a higher number of requests in 2015 than a dataset released in 2014, since it is six years old and has had more time to accumulate advantage than a one-year-old dataset. However, the number of requests the 2009 dataset received in 2010, when it was one year old, can be reasonably compared to the number of requests the 2014 dataset received in 2015, when it was also one year old. Of course, it is possible that the year a dataset was released might affect the number of requests it receives even when comparing like to like by using dataset age. For example, data science and other computational methods have become increasingly popular in recent years, so perhaps a dataset released in 2015 would be more requested in its first year than a dataset released in 2009 would in its first year, simply because more people are making requests overall. However, this analysis also controls for age, as will be further discussed, by measuring correlation between year of release and total requests. A weak correlation between year of release and number 106 of requests received would suggest that year of release has little impact on a dataset’s number of requests. 5.1.1 dbGaP Results The dbGaP analysis includes 982 datasets with a total of 100,115 requests between 2007 and 2017; 68 datasets for which a year of release could not be determined from the dbGaP website were excluded. Figure 5-1 shows the number of requests datasets in each decile received in each year since their release (not cumulative requests). The count of age at request on the x-axis begins with 0, which indicates requests made within the first year of its release, with 1 indicating requests when the data is one year old, and so on. Table 5-1 shows the range and distribution of dbGaP datasets within deciles. 107 Figure 5-1. Mean requests by year for dbGaP datasets in each decile, by age of the dataset at time of request. Table 5-1. Distribution of dbGaP datasets by request deciles for requests made between 2007 and 2017. Decile Request count Number of Mean age of range datasets in datasets, years percentile 10th percentile 4 or fewer 164 1.82 (range 0 –7) 20th percentile 5 – 8 263 2.89 (range 0 – 8) 30th percentile 9 – 12 254 3.57 (range 0 – 7) 40th percentile 13 – 18 324 3.85 (range 0 – 7) 50th percentile 19 – 27 373 4.42 (range 0 – 7) 60th percentile 28 – 42 406 4.66 (range 0 – 8) 70th percentile 43 – 66 465 5.16 (range 1 – 9) 80th percentile 67 – 174 635 6.75 (range 1 – 10) 90th percentile 175 – 2754 1807 9.03 (range 5 – 10) 108 As Table 5-1 demonstrates, datasets in the lower percentiles of requests are on average younger, which is logical considering that a ten-year old dataset has had twice as long to accumulate requests as a five-year old dataset, and would therefore be in a higher decile. It does appear that length of data availability does at least partly explain the amount of requests a dataset has received; no datasets that had less than five years to accrue citations (i.e. no datasets released after 2012) made it into the 90th percentile, and none of the oldest datasets (released between 2007 and 2009) fell below the 80th percentile. However, a dataset’s age cannot fully account for the number of requests it has received, given that at least some datasets that had eight years to accrue requests were only in the 20th percentile, compared to other datasets that had only one year to accrue requests and made it into the 60th percentile. Further, as discussed above, this analysis controls for age by comparing requests over time based on dataset age rather than calendar year. Datasets in the 90th percentile were already more highly requested in the first year after being released, receiving on average 42 requests in the first year – more than three times as many as datasets in the 80th percentile received in their first year (mean = 13) and more than twenty times as many as datasets in the bottom 10th percentile received in their first year (mean = 2). The number of requests in the first year varies significantly even among the lower deciles; on average, datasets in the 20th percentile received over 70% more requests in their first year than those in the 10th percentile. Because this method is still somewhat affected by the age of the dataset, I also calculated percentile ranges for each year of release (i.e. calculated percentiles for all 109 requests in year one of a dataset’s life, year two, and so on) and conducted the same analysis using the mean percentile across all the years of its availability, rather than its overall percentile. For example, a dataset in the 90th percentile of first-year requests, 80th percentile of second-year requests, and 70th percentile of third-year requests would have a mean of the 80th percentile. Using the mean instead of the overall percentile ranking helps compare newer datasets more fairly with older datasets. Whereas an older dataset would be more likely to be in a higher percentile overall because it had more time to accrue requests, using the mean percentile makes it possible to compare datasets at various points in their life. For example, suppose a dataset has been available for two years, and receives 25 requests in its first year and 40 requests in its second year, putting it in the 90th percentile for both first- and second-year requests, for a mean of the 90th percentile. In its two years, it has accrued a total of 65 requests, but this is only enough to put it in the 70th percentile overall, since it is also competing against datasets that have had five times as long to accrue total requests. Using the mean percentile instead of the overall percentile more accurately reflects its high performance over the course of its life, comparing it to datasets of the same age at each year of its life. Most datasets had at least some variability in their performance across all years, with few achieving the same percentile at every year since their release. Because of this variability, and since the mean is a measure of central tendency, few datasets had a mean percentile at the highest end – only two dataset older than two years old had a mean of 90th percentile. Therefore, rather than use deciles, I grouped 110 more datasets together and use quartiles (i.e. 0- 25th percentile, 26-50th percentile, 51-75th percentile, and 76-100th percentile). Figure 5-2 shows the number of requests for datasets in each mean quartile received in each year since their release (not cumulative requests). The count of age at request on the x-axis begins with 0, which indicates requests made within the first year of its release, with 1 indicating requests when the data is one year old, and so on. Table 5-2 shows the range and distribution of dbGaP datasets within deciles. As Table 5-2 demonstrates, calculating the mean quartile by averaging the percentile for a dataset in each year of its life creates a more even distribution of datasets by age among the various quartiles. However, the two higher quartiles do have older datasets on average than the two lower quartiles. 111 Figure 5-2. Mean requests by year for dbGaP datasets in mean quartile, by age of the dataset at time of request. Quartile Total request Number of Mean age of count range datasets in datasets, years percentile 1-25th percentile 1 – 114 867 4.7 (range = 0-9) 26-50th percentile 2 – 215 1321 4.9 (range = 0-10) 51-75th percentile 5 – 442 1808 7.7 (range = 0-10) 76-100th percentile 10 – 2754 695 7.8 (range = 0-9) Table 5-2. Distribution of dbGaP datasets by mean request quartiles for requests made between 2007 and 2017. While the exact dynamics of requests in the analysis using the mean percentile rather than the overall percentile differ, as Figure 5-2 demonstrates, the general pattern is the same: datasets that start out being highly requested go on to continue 112 being highly requested as time goes on. Taken together, the mean percentile and overall percentile analyses suggest that dbGaP dataset requests are at least partly affected by a cumulative advantage process. Datasets that are highly requested soon after their release go on to continue to receive more requests later, while datasets that initially receive fewer requests continue to be less requested over time. As Figure 5-1 demonstrates, datasets across all overall deciles reach a peak of requests in their second year (age = 1) and requests begin to drop off in the third year. This pattern is very similar to citations to articles over time – articles reach a peak of citations at various ages depending on discipline (for example, Clinical Medicine articles peak around 4 years while Biology articles peak around 7 years), but eventually drop off as the articles becomes older (Eom & Fortunato, 2011; Parolo et al., 2015; Wang, 2013). As with article citations, this decline continues over time for datasets in the 10th through 80th percentiles overall, but the same is not true of datasets in the 90th percentile overall. After following the pattern of third year drop- off, requests actually begin to increase again in the fourth year and steadily climb in each subsequent year, eventually reaching and even surpassing the number of requests received in the first year. With only ten years of data to consider here, it is difficult to completely explain the mechanism behind this pattern. Perhaps these highly requested datasets see a bump in requests in subsequent years as early requestors begin to publish articles that cite their reuse of the dataset, thus creating a cycle of increased attention for the already highly requested datasets. 113 Looking at the data requests as a whole, rather than dividing them by deciles, also demonstrates a strong relationship between the number of requests a dataset receives in the first few years after release and the number it receives over the long term. The number of long-term requests is strongly positively correlated with the number of requests received in the first year and second year (correlation coefficient for both = 0.8), and even more so with requests in the third year (correlation coefficient = 0.9). There is a negative correlation between release year (i.e. calendar year) and total number of requests, indicating that older datasets have more requests, but this correlation is only moderate (correlation coefficient = -0.6). Fitting a linear regression model to the request data further demonstrates the importance of a dbGaP dataset’s early performance in predicting its long-term request rate. Table 5-3 summarizes results from three regression models: first-year requests only; first- and second-year requests and first-, second-, and third-year requests. All three models also include release year to control for the influence of a dataset’s age on the number of requests. All three models are statistically significant at p < 0.001. To determine whether results were affected by collinearity, I calculated the variance- inflation factor (VIF) for each model and each variable. VIF values of greater than 10 indicate collinearity; all VIFs here are less than 10 (Dormann et al., 2013). The R-squared value of a regression model is a measure of the amount of variability in the outcome variable (total requests) that is explained by a given model. For example, the one-year model accounts for 73% of the variability in total requests, while adding the second and third year increases the amount of variability the model 114 accounts for. The regression coefficient (coef) represents the mean change in total requests for every additional increase of one in the predictor variable, while holding other variables constant. For example, in the one-year model, the coefficient of 6.61 for number of requests in the first year means that for every additional request in the first year, a dataset would have, on average, 6.61 additional requests over time. The standard error (SE) is a measure of the average distance between the regression line and the values in the data. The higher the standard error, the less correct the model is on average; variables with SE of greater than 2.5 are not statistically significant at a 95% prediction interval. Finally, the Beta value (β) indicates the relative influence (and the direction of that influence) of a variable on the number of total requests a dataset receives. Values close to 1 (or -1) indicate a high level of influence, while values close to 0 indicate less influence. Table 5-3. Results of regression analysis showing effects of requests during year one, two, and three of a dbGaP dataset’s life on the total number of requests during the 2007 – 2017 period. One-Year Model Two-Year Model Three-Year Model First year requests coef = 6.61 (p < coef = 4.6 (p < coef = 3.54 (p < 0.001) 0.001) 0.001) SE = 0.26 (p < SE = 0.28 (p < SE = 0.23 (p < 0.001) 0.001) 0.001) β = 0.76 (p < β = 0.53 (p < β = 0.41 (p < 0.001) 0.001) 0.001) VIF = 1.1 VIF = 1.11 VIF = 1.6 Second year NA coef = 5.44 (p < coef = -0.56 (p < requests 0.001) 0.001) SE = 0.45 (p < SE = 0.51 (p < 0.001) 0.001) β = 0.38 (p < β = -0.04 (p < 0.001) 0.001) 115 One-Year Model Two-Year Model Three-Year Model VIF = 2 VIF = 4 Third year requests NA NA coef = 7.57 (p < 0.001) SE = 0.46 (p < 0.001) β = 0.58 (p < 0.001) VIF = 4.9 Year of release coef = -10.92 (p < coef = -5.42 (p < coef = -5.8 (p < 0.001) 0.001) 0.001) SE = 2.31 (p < SE = 2.05 (p < SE = 1.63 (p < 0.001) 0.001) 0.001) β = -0.14 (p < β = -0.07 (p < β = -0.08 (p < 0.001) 0.001) 0.001) VIF = 1.1 VIF = 2.04 VIF = 2.2 R-squared 0.733 (p < 0.001) 0.7979 (p < 0.001) 0.8733 (p < 0.001) As Table 5-3 shows, the three-year model accounts for nearly 90% of the variability in long-term requests. Even the model with only first-year requests accounts for 73% of the variability in total requests. Year of release appears to have only a small amount of influence on the total number of requests, with Beta values close to 0 for all three models. These models suggest that the number of requests a dataset receives in the first few years is a good predictor of long-term requests, regardless of when the dataset was released. Because first, second, and third year requests are included in total requests, I also fit models to determine the effect of first through third year requests on all later requests, in other words, total requests made in the fourth year and beyond. This analysis includes the 615 datasets that were released before 2015 (that is, those that were old enough to have more than three years’ worth of requests). All three models 116 also include release year to control for the influence of a dataset’s age on the number of requests. All three models are statistically significant at p < 0.001. Table 5-4 summarizes the results of these three models. Table 5-4. Results of regression analysis showing effects of requests during year one, two, and three of a dbGaP dataset’s life on the total number of requests in the fourth year and later during the 2007 – 2017 period. One-Year Model Two-Year Model Three-Year Model First year requests coef = 2.53 (p < coef = 2.54 (p < coef = 1.16 (p < 0.001) 0.001) 0.001) SE = 0.22 (p < SE = 0.16 (p < SE = 0.18 (p < 0.001) 0.001) 0.001) β = 0.32 (p < β = 0.32 (p < β = 0.14 (p < 0.001) 0.001) 0.001) VIF = 1.1 VIF = 1.1 VIF = 1.78 Second year NA coef = 4.75 (p < coef = 2.18 (p < requests 0.001) 0.001) SE = 0.2 (p < SE = 0.27 (p < 0.001) 0.001) β = 0.65 (p < β = 0.3 (p < 0.001) 0.001) VIF = 5.02 VIF = 2.21 Third year requests NA NA coef = 4.96 (p < 0.001) SE = 0.39 (p < 0.001) β = 0.47 (p < 0.001) VIF = 4.96 Year of release coef = -40.3 (p < coef = -8.85 (p < coef = -9.56 (p < 0.001) 0.001) 0.001) SE = 1.76 (p < SE = 1.85 (p < SE = 1.65 (p < 0.001) 0.001) 0.001) β = -0.62 (p < β = -0.14 (p < β = -0.14 (p < 0.001) 0.001) 0.001) VIF = 1.1 VIF = 2.31 VIF = 2.32 R-squared 0.599 (p < 0.001) 0.7895 (p < 0.001) 0.833 (p < 0.001) 117 This analysis shows that early requests are good predictors for how many requests datasets will go on to receive in later years. In fact, the two- and three-year models perform almost as well in predicting later requests as they do in predicting total requests. This finding provides further evidence that requests early in a dataset’s life can be helpful in predicting patterns of long-term reuse among dbGaP datasets. 5.1.2 NHLBI Results The NHLBI analysis includes 143 datasets with a total of 3,860 requests between 2000 and 2017. Figure 5-3 shows the number of requests datasets in each decile received in each year since their release (not cumulative requests). Table 5-5 shows the range and distribution of NHLBI datasets within deciles. 118 Figure 5-3. Mean requests by year for NHLBI datasets in each decile, by age of the dataset at time of request. Table 5-5. Distribution of NHLBI datasets by request deciles for requests made between 2000 and 2017. Decile Request count Number of Mean age of datasets, range datasets in years percentile 10th percentile 2 or fewer 39 4.45 (range 0 – 17) 20th percentile 3 19 6.74 (range = 1 – 15) 30th percentile 4 22 8.64 (range = 3 – 17) 40th percentile 5 46 9.04 (range = 0 – 15) 50th percentile 6 – 8 70 6.73 (range = 2 – 12) 60th percentile 9 – 14 97 9.7 (range = 2 – 15) 70th percentile 15 – 21 107 9.49 (range = 2 – 16) 80th percentile 22 – 34 127 11.83 (range = 2 – 16) 90th percentile 35 or more 367 13.26 (range = 1 – 17) 119 Figure 5-3 reveals a markedly different pattern of requests from what was observed within the dbGaP analysis. NHLBI requests appear to follow no demonstrable pattern of requests at all. Part of the variability seen in Figure 5-3 is due to the sparsity of datasets that reach age 17 or 18. For example, given that only eight datasets exist that had been around for 17 years by 2017, the notable spike in requests in year 17 is probably not a meaningful finding; with so few datasets to consider in that age range, one or two outliers will have more of an effect on the mean than in a larger pool of datasets. A further consideration in this analysis is that methods for requesting and accessing the data changed significantly over the course of the nearly twenty years considered in the full 2000 – 2017 analysis. In September 2009, NHLBI launched the BioLINCC website as it exists in its current form, which permits users to request and access data through a secure web portal (Giffen et al., 2015). Prior to the launch of the site, requestors were required to submit a paper request form by mail, and datasets were disseminated to approved requestors by mailing them a CD-ROM within two weeks (National Heart, Lung, and Blood Institute, 2008). Given the very different means of accessing data before and after the launch of the BioLINCC site, it seems reasonable to expect that patterns of requests from the two periods would likely differ. To determine whether request patterns were more predictable after the launch of the BioLINCC site, I repeated this analysis including only the 90 datasets released between 2010 (the first complete year that BioLINCC was online) and 2017, and the 120 3,704 requests those datasets received during that time. However, as Figure 5-4 demonstrates, this subset shows no more coherent pattern than did the entire set. Figure 5-4. Mean requests by year for NHLBI datasets released between 2009 and 2017 in each decile, by age of the dataset at time of request. While it seems apparent that dbGaP requests are likely a cumulative advantage process, this analysis suggests that the same may not true for NHLBI requests. However, the seeming randomness of the NHLBI data may be based more on the relative sparsity of this set compared to dbGaP’s. With 982 datasets to NHLBI’s 96, and a whopping 100,115 requests to NHLBI’s 3,704, the dbGaP data massively dwarfs the NHLBI data. While it’s possible that NHLBI request patterns 121 over time are significantly different from dbGaP’s, and in fact seem to follow no real pattern at all, it seems just as likely that this is simply too small a set to yield meaningful findings. Nonetheless, I did proceed with regression analysis of the NHLBI 2010 – 2017 release subset. Total requests were strongly correlated with first- and second- year requests (correlation coefficient = 0.9 for both) and but only moderately correlated with third-year requests (correlation coefficient = 0.4). Total number of requests is weakly positively correlated with calendar year of the dataset’s release (correlation coefficient = 0.1), suggesting that older datasets are slightly likely to have fewer requests. Table 5-6 summarizes results from these three regression models fit to the 2010 – 2017 request data. All models are statistically significant at p < 0.001. Table 5-6. Results of regression analysis showing effects of requests during years one, two, and three of a NHLBI dataset’s life on the total number of requests during the 2010 – 2017 period. One-Year Model Two-Year Model Three-Year Model First year requests coef = 1.63 (p < coef = -0.66 (p > coef = 0.52 (p > 0.001) 0.05) 0.05) SE = 0.13 (p < SE = 0.56 (p > SE = 0.45 (p > 0.001) 0.05) 0.05) β = 0.9 (p < 0.001) β = 0.38 (p > 0.05) β = 0.3 (p > 0.05) VIF = 1.03 VIF = 17.44 VIF = 23.29 Second year NA coef = 3.81 (p < coef = 1.82 (p < requests 0.001) 0.05) SE = 0.91 (p < SE = 0.75 (p < 0.001) 0.05) β = 1.33 (p < β = 0.64 (p < 0.05) 0.001) VIF = 23.22 VIF = 17.27 122 One-Year Model Two-Year Model Three-Year Model Third year requests NA NA coef = 5 (p < 0.001) SE = 0.88 (p < 0.001) β = 0.31 (p < 0.001) VIF = 1.38 Year of release coef = -4.15(p = 0. coef = -5.07 (p = coef = -5.57 (p < 01) 0.01) 0.001) SE = 1.37 (p = SE = 1.62 (p SE = 1.34 (p < 0.01) =0.01) 0.001) β = -0.22 (p = 0.01) β = -0.21 (p = β = -0.21 (p < VIF = 1.03 0.01) 0.001) VIF = 1.04 VIF = 1.06 R-squared 0.795 (p < 0.001) 0.888 (p < 0.001) 0.955 (p < 0.001) The three-year model is the best fit, accounting for more than 95% of the variability in long-term requests. However, while the R-squared values for the NHLBI regression models are higher than the models for the corresponding years of requests in dbGaP, the Beta values are higher for the NHLBI models, suggesting that the year of release has a more significant impact on NHLBI requests. That is, part of the better predictive power in the NHLBI models is simply due to the fact that older datasets have had more time to accrue requests, and not because requests within the first several years are more highly predictive within NHLBI than within dbGaP. In addition, the two- and three-year models’ VIF values indicate that there is a level of collinearity between first and second year requests. However, this finding is not necessarily problematic since its mechanism can likely be understood by the fact that datasets receive similar amounts of requests in their first and second years. Moreover, 123 first- and second-year requests are not perfectly collinear in the sense that one predicts the other in the way that variables like age and date of birth do. Finally, while methods exist to address collinearity, they do not perform much better than standard regression models, and many statisticians recommend simply ignoring collinearity (Dormann et al., 2013). I also considered the role of first, second, and third year requests in predicting later requests, made in the fourth year and beyond. This analysis includes the 49 datasets that were released before 2015 (that is, those that were old enough to have more than three years’ worth of requests). All three models also include release year to control for the influence of a dataset’s age on the number of requests. All three models are statistically significant, although the one-year model achieved a higher (but still significant) p-value. Table 5-7 shows a summary of the results. Table 5-7. Results of regression analysis showing effects of requests during year one, two, and three of an NHLBI dataset’s life on the total number of requests in the fourth year and later during the 2009 – 2017 period. One-Year Model Two-Year Model Three-Year Model First year requests coef = 4.6 (p = coef = 1.43 (p = coef = 0.12 (p = 0.03) 0.3) 0.9) SE = 2.1 (p = SE = 1.48 (p = SE = 1.48 (p = 0.03) 0.3) 0.9) β = 0.3 (p = 0.03) β = 0.09 (p = 0.3) β = 0.008 (p = 0.9) VIF = 1.13 VIF = 1.24 VIF = 1.39 Second year NA coef = 5.13 (p < coef = 3.8 (p < requests 0.001) 0.001) SE = 0.68 (p < SE = 0.82 (p < 0.001) 0.001) β = 0.69 (p < β = 0.51 (p < 0.001) 0.001) VIF = 1.1 VIF = 1.79 124 One-Year Model Two-Year Model Three-Year Model Third year requests NA NA coef = 2.31 (p = 0.01) SE = 0.88 (p = 0.01) β = 0.3 (p = 0.01) VIF = 1.99 Year of release coef = -4.55 (p < coef = -4.61 (p < coef = -4.25 (p < 0.001) 0.001) 0.001) SE = 1.26 (p < SE = 0.85 (p < SE = 0.81 (p < 0.001) 0.001) 0.001) β = -0.5 (p < β = -0.5 (p < β = -0.46 (p < 0.001) 0.001) 0.001) VIF = 1.13 VIF = 1.13 VIF = 1.17 R-squared 0.2332 (p = 0.002) 0.6593 (p < 0.001) 0.7057 (p < 0.001) Unlike with the dbGaP data, first year requests do not appear to be a good predictor for reuse later in an NHLBI dataset’s life. In fact, while the one-year model did achieve statistical significance, the first-year request variable barely did (p = 0.03), and first-year requests did not achieve statistical significant in the two- and three-year models. Further, the Beta value for first-year requests was lower than for year of release in each model; the very low Beta values in the two- and three-year models indicate that first-year requests have very little impact at all on later requests. Because first year requests are such a poor predictor for requests later in the dataset’s life, I also fit models for second-year requests only and second- and third- year requests. Table 5-8 summarizes these models. 125 Table 5-8. Results of regression analysis showing effects of requests during year two and three of an NHLBI dataset’s life on the total number of requests in the fourth year and later during the 2009 – 2017 period. Second Year Only Model Second and Third Year Model Second year requests coef = 5.32 (p < 0.001) coef = 3.8 (p < 0.001) SE = 0.66 (p < 0.001) SE = 0.81 (p < 0.001) β = 0.71 (p < 0.001) β = 0.51 (p < 0.001) VIF = 1.01 VIF = 1.79 Third year requests NA coef = 2.33 (p = 0.006) SE = 0.82 (p = 0.006) β = 0.31 (p = 0.006) VIF = 1.77 Year of release coef = -4.34 (p < 0.001) coef = -4.23 (p < 0.001) SE = 0.8 (p < 0.001) SE = 0.75 (p < 0.001) β = -0.47 (p < 0.001) β = -0.46 (p < 0.001) VIF = 1.01 VIF = 1.02 R-squared 0.6522 (p < 0.001) 0.7056 (p < 0.001) Interestingly, these models performed substantially better than the models that include the first year. For example, considering first-year requests only accounts for only 23% of the variability in later requests, while considering second-year requests only accounts for 65% of the variability. The second and third year model likewise performs better than the first and second year model. This finding suggests that, while first-year requests are a good predictor for dbGaP dataset’s long-term reuse, the same is not true for NHLBI. Rather, it appears that it takes longer for a dataset in NHLBI to be “noticed” and start receiving requests. This finding is further supported by the fact that while 27% of dbGaP datasets received no requests in their first year, 40% of the NHLBI datasets received no requests in their first year. A Welch unpaired two- sample t-test shows that the means of the maximum semantic similarity scores for 126 dbGaP and NIDDK request/dataset pairs are significantly different (t = 2.49, df = 83.34, p = 0.01). 5.1.3 Summary of Findings Here, I considered whether there are temporal patterns to dataset requests by considering the number of requests datasets receive by year since their release, rather than calendar year. Specifically, I tested the hypothesis that patterns of requests relative to the original dataset release date will be similar to patterns of citations to articles relative to their publication date. The findings of the dbGaP analysis support this hypothesis; like citations to articles, requests to dbGaP datasets appear to be a cumulative advantage process, with highly requested datasets going on to receive even more requests over time. Except for the tier of most- requested datasets, dbGaP datasets requests peak around the third year after a dataset is released and gradually decline over time, a pattern again seen in citations to articles (Parolo et al., 2015). Finally, regardless of when a dataset was released, the number of requests it receives in the first few years of its life are a good prediction for how many requests it will go on to receive. On the other hand, this hypothesis did not hold true with the NHLBI analysis. While mean requests by decile followed a clean pattern in the dbGaP datasets, requests in the NHLBI datasets appeared to follow virtually no pattern at all. Even when considering only the subset of datasets that had been released during the time that the request process existed in electronic form, no pattern appeared to exist. 127 However, interestingly, these analyses found that first-year requests, which were a good predictor of later requests in the dbGaP datasets, were actually a very poor predictor for later requests in NHLBI. Instead, second- and third-year requests showed good predictive power, suggesting that patterns of reuse differ between NHLBI and dbGaP. NHLBI datasets take longer to begin accruing requests, a finding that could suggest genomic data starts being requested earlier than clinical data, but these differences could also be due to characteristics of the repositories themselves. For example, perhaps dbGaP does more to raise awareness of its datasets among the community of researchers who use them. If a dataset’s release is not publicized, researchers would likely not be aware of its existence until later after its release, perhaps when articles start to cite the dataset (which would likely correspond with the second and third year after the dataset’s release). Further research into how these repositories promote outreach to their research communities and how researchers typically find datasets to reuse could help explain these findings. 5.3 Research Question 4: Are there dataset topics that are more highly requested? The datasets contained within the three repositories considered here vary, in some cases significantly, in the number of requests they have received. The length of time a dataset has been available likely plays some role in accounting for more requests; a dataset released in 2009 has had more time to be requested than a dataset released in 2017, so it stands to reason that the total number of requests accrued by age would differ. However, as was demonstrated in Section 5.1, the variations in 128 request numbers can only partly be explained by how long a dataset has been available. The repositories considered here contain datasets that cover a wide range of different conditions and disorders, from the very common (such as heart disease, which is the leading cause of death in the US) to the very rare (such as biliary atresia, a rare liver disorder affecting about 1 in 20,000 live births in the US), as well as including some reference sets of healthy human subjects (Hopkins, Yazigi, & Nylund, 2017; National Center for Health Statistics, 2017). Given that the burden of disease for various conditions differs widely, biomedical research funders and pharmaceutical development companies understandably tend to focus more money and effort on certain diseases. Likewise, it seems logical that some of the topics within the datasets from these three repositories would be more “popular” than others and would therefore receive more requests. Here I aim to explore whether variations in numbers of requests are due to the subject coverage of the dataset – in other words, are some topics more highly requested? 5.2.1 Defining Topics Determining whether some topics are more requested requires first defining what the major topics in repositories are, a task that is not as straightforward as it might seem. Of course, datasets could simply be categorized into topics based on the primary condition the dataset covers, but there are other characteristics of datasets that might influence requests. For example, some of the studies in the NIDDK and 129 NHLBI repositories are longitudinal, following participants over the course of decades. Such datasets provide a wealth of data that could be useful for a range of research purposes, regardless of the topic focus of the original study. As the semantic similarity analysis described in Section 4.1.2 demonstrated, datasets often end up being reused in the context of topics that differ significantly from the topic for which the dataset was originally requested, so at least in some cases, the topic of the dataset is not what makes it appealing for reusers. Rather than make my own assumptions about how to divide datasets into topics, I utilized a topic modeling approach that used a technique called latent Dirichlet allocation (LDA) to sort datasets into topics with other datasets that were most similar to them. For this analysis, I considered each repository separately, since the subject coverage and request patterns for the three repositories are very different, and used the descriptions of each dataset as the corpus for text mining. I wrote custom R scripts that retrieved the descriptions for each dataset from the three repositories and removed extraneous text, such as HTML tags and section headers, leaving only the dataset description. Next, I removed common English language stopwords (such as “and” and “the”), as well as a set of custom stopwords, which were terms such as “patients” or “research” that appeared in many of the dataset descriptions and did not provide meaningful context (the full list of stop word is included as Appendix B). Next, I experimented with several text processing techniques to determine which processes would yield the most meaningful input for the topic-modeling algorithm. First, various forms of the same word needed to be processed to arrive at a 130 common form that would prevent the algorithm from considering them as two different words. For example, the words “decrease,” “decreases,” “decreased,” and “decreasing” are all forms of the common stem “decrease,” and therefore should be considered as one, rather than four separate terms. I tested the stemming algorithm in the R package SnowballC (Bouchet-Valat, 2019), but found it to be aggressive in stemming words, essentially indiscriminately stripping many “-s”, “-d”, and “-ing” endings from words that should not have been stemmed, resulting in many terms being reduced to the same stem when in fact they were not topically similar. Instead, I used lemmatization, a process that is more computationally intensive but achieved better results by identifying the term within a pre-defined dictionary and thereby determining the correct root word (Rinker, 2018). Once terms had been lemmatized, I experimented to determine the unit of analysis that would provide the most meaningful input for the LDA algorithm. I started with using individual lemmatized words, producing a count of the number of times each word appeared in a given description. This approach is referred to as a “bag of words” approach, since it simply counts words in the text without consideration for the context of the word within the description. Given the complexity of the concepts within the descriptions of the datasets, the bag of words approach was not effective. For example, consider the topics “elevated blood pressure,” “elevated risk,” “elevated glucose levels,” and “elevated liver enzymes.” These four terms refer to very different concepts, but the bag of words approach would consider them similar because they all contain the term “elevated.” 131 Instead, I experimented with bigrams and trigrams, sets of two or three words appearing in conjunction with each other. This approach provides greater consideration of the words in context. For example, “elevated blood pressure” would be split into bigrams “elevated blood” and “blood pressure.” Some less meaningful connections would still be made with the “elevated blood” bigram, connecting this description to ones referring for example, to “elevated blood glucose” or “elevated blood platelets.” However, it would also have the bigram of “blood pressure” to connect it to concepts such as “high blood pressure” and “blood pressure measurement.” I processed all descriptions into both bigrams and trigrams and found that bigrams provided the most useful sets of terms for these descriptions. The lemmatized bigrams were then arranged in a document-term-matrix (dtm), which gives a count for the number of times a bigram appears in each dataset description. The dtm serves as the input for the LDA algorithm, which sorts the documents (i.e. the dataset descriptions) into k groups of similar documents, where k is the user-defined number of topics. To some extent, determining the number of topics is a matter of trial and error, experimenting with different values of k until meaningful topics appear. However, the R package ldatuning provides some metrics to assist in determining the optimal value of k. For each repository, I tested for all values of k between 5 and 25 to determine the optimal number of groups. The package returns results of four metrics, two for which the optimal value should be as low as possible and two for which the optimal value should be as high as possible. 132 Figure 5-5, Figure 5-6, and Figure 5-7 show output from the ldatuning package, run for k between 5 and 25 for each of the repositories. Figure 5-5. Output from ldatuning package for the dbGaP dataset descriptions. Figure 5-6. Output from ldatuning package for the NHLBI dataset descriptions. 133 Figure 5-7. Output from the ldatuning package for the NIDDK dataset descriptions. Based on this output, I ran the LDA algorithm with the k values that appeared optimal, comparing a few variations where the ldatuning package showed some ambiguity. For example, for NIDDK, I tried models with k of 13 and 14, and examined the outputs to determine which model provided the most meaningful groupings. Each term was assigned a beta value that indicated how strongly it was associated with a topic. Reviewing the ten terms with the highest beta for each grouping helped provide insight as to whether the grouping was meaningful and what the topic was about. Figure 5-8 shows an example of a chart showing the top ten terms in topic 7 of the 14-group NIDDK model. Appendix C contains the full set of charts for all topics for each repository. 134 Figure 5-8. An example of a chart showing the top ten terms in topic 7 of the 14-group NIDDK model with its corresponding beta value. A consideration of the terms shown in Figure 5-8 confirms that this is a logical grouping of documents and also provides insight into how the topic can be described. Gastroparesis is a disorder characterized by delayed gastric emptying, resulting in nausea, vomiting, and abdominal pain. It is diagnosed by a gastric emptying scintigraphy test, as well as upper endoscopy, and the severity of the condition can be quantified using the Gastroparesis Cardinal Symptom Index. Gastroparesis often occurs in insulin-dependent patients with diabetes, but in non- diabetics, it is known as idiopathic gastroparesis. Based on this list of terms, this topic appears to contain datasets that are about gastroparesis. Of course, some of the terms are general enough that they might not refer to gastroparesis – for example, nausea and vomiting are common to many illnesses, and upper endoscopy is used in the context of many gastrointestinal disorders. Therefore, I also reviewed the list of 135 datasets that had been classified as belonging to topic 7 to determine whether “gastroparesis” was an accurate title for this grouping, as well as to ensure that the grouping of datasets also made logical sense. In reviewing the datasets in this topic, gastroparesis was the most common topic, but a few datasets also contained data about other rare gastrointestinal disorders, so I expanded the title of this topic to “gastroparesis and other GI disorders.” I conducted this procedure with the descriptions of datasets from each of the repositories. For both NIDDK and NHLBI, the optimal number of groupings was 14, the topics of which are described below. However, the LDA algorithm was less useful in analyzing the dbGaP dataset descriptions. The tuning algorithms suggested using a k of 11, which yielded groupings of datasets that seemed mostly unconnected and for which I could not find meaningful topic descriptions. I experimented with a range of different values for k, but was not able to obtain groupings that made sense. The success of the topic modeling in NIDDK and NHLBI might be due to the fact that the datasets in these repositories do generally fall within a relatively constrained range of topics – after all, they only collect datasets related to diabetes, digestive disorders, and kidney diseases (NIDDK) and heart-, lung-, and blood-disorders (NHLBI). dbGaP, by comparison, contains thousands of datasets spanning the range of human disease and health, so it may be that the range of topics is too complex to be meaningfully captured by the LDA algorithm. Of course, it is also possible that the groupings the algorithm made did actually have some meaning, but it was too obscure 136 for me to understand (such as, “datasets with a principal investigator named Jim”) and that also would have been unlikely to provide a meaningful basis for this analysis. Since the LDA algorithm was ineffective for the dbGaP datasets, I instead categorized them based on the “primary phenotype” (that is, the main disease or characteristic of interest in the dataset) reported on the dbGaP website for each dataset. The 1,150 datasets had 452 unique primary phenotypes; to achieve a more manageable number of topics, I grouped the datasets into 18 broad topics as described below, using the MeSH trees into which each phenotype term fell as a guide. Because the dbGaP dataset also contains a large number of datasets covering different types of cancers, I also further categorized cancer datasets with the type of cancer they described and conducted a sub-analysis of these datasets. 5.2.2 Comparing Requests Across Topics Because datasets were not evenly distributed among the topics, raw request counts would not provide a fair comparison for considering request rates. For example, consider the top two most requested topics in the NIDDK repository, Chronic Kidney Diseases and Type 2 and Gestational Diabetes (shortened here for convenience to CKD and T2D), which have received 125 and 104 requests and account for 32% and 27%, respectively, of all requests submitted to the NIDDK repository. However, there are almost twice as many CKD datasets (13, or 14% of all NIDDK datasets) as there are T2D datasets (7, or 8% of all NIDDK datasets). Even though the request counts are similar, the 104 requests for T2D topics are spread 137 among a much smaller set of datasets, and therefore cannot reasonably be compared to the CKD requests. To account for the differences in number of datasets per topic, I calculated a request to dataset (RTD) ratio. First, I calculated the proportion of requests by dividing the number of requests in a topic by the total number of requests in the repository. Similarly, I calculated the proportion of datasets by dividing the number of datasets in a topic by the total number of datasets in the repository. Dividing the proportion of requests by the proportion of datasets, I arrived at the request ratio. Figure 5-9 provides a visual explanation of this process. In this example, topic A’s request ratio is calculated by dividing the proportion of its requests (70 requests for topic A datasets divided by 192 total requests for datasets in the repository = 0.36) by the proportion of datasets in the topic (4 datasets in topic A divided by 6 datasets total in the repository = 0.67), arriving at a ratio of 0.54. Figure 5-9. Visual explanation of request ratio calculation. 138 A ratio of 1 would indicate that a topic received as many requests as would be expected based on the number of datasets in the topic. If every topic in a repository received a score of 1, it would mean that every topic had been requested at the same rate, and essentially all topics were equally popular. A topic with a ratio of greater than 1 is over-requested based on how many datasets it contains; for example, a ratio of 2 would mean the topic had received twice as many requests as would be expected based on the number of datasets it contained. Similarly, a ratio of less than 1 meant the topic was under-requested; a ratio of 0.3 would mean the topic had received only 30% as many requests as would be expected based on the number of datasets it contained. To revisit our NIDDK example, the CKD topic has a request proportion of 0.323 and a dataset proportion of 0.141, yielding a request ratio of 2.29. The T2D topic has a request proportion of 0.269 and a dataset ratio of 0.076, yielding a request ratio of 3.54. Both topics are highly requested; with a request ratio of greater than 1, they received more requests than would be expected if all topics were requested equally. However, despite the T2D topic having received 21 fewer requests than the CKD topic, it has actually outperformed CKD by 1.5 times, given that T2D had fewer datasets than CKD overall. In addition to considering the total RTD ratio for each topic, I also calculated a yearly RTD ratio to explore whether topic popularity remained consistent or whether some topics gained or lost popularity over time. To calculate the yearly RTD ratio, I used annual request counts and cumulative rather than total dataset counts. For 139 example, in 2009 dbGaP contained eight datasets in the Cancer topic, which received a total of 230 requests in that year. By 2010, an additional 10 datasets had been added for a total of 18 datasets that received 592 requests in the Cancer topic. To calculate the 2009 RTD ratio for the Cancer topic, I used the proportions for that year; the eight datasets that were about Cancer constituted 4.5% of the 178 datasets that existed in dbGaP at that time. By 2010, dbGaP contained 243 total datasets, so the 18 Cancer datasets now constituted 7.4% of the total. 5.2.3 dbGaP Results 1,133 datasets from dbGaP were sorted into 18 topics based on their primary phenotype. These datasets had received a total of 104,337 requests between 2008 and 2018. Table 5-9 shows the distribution of datasets and requests among the 18 topics and each topic’s RTD ratio. Table 5-9. Distribution of dbGaP datasets and requests among 18 topics derived from the assigned primary phenotype, and calculated request to dataset (RTD) ratio. Topic Datasets Requests RTD Ratio Blood and Cardiovascular 269 (23.7%) 66,725 (64%) 2.69 Mental Disorders 39 (3.4%) 3,117 (3%) 0.87 Eye Disorders 20 (1.8%) 1,298 (1.2%) 0.7 Normal 48 (4.2%) 2,668 (2.6%) 0.6 Women's Health and 26 (2.3%) 1,412 (1.4%) 0.59 Pregnancy Cancer 319 (28.2%) 17,208 (16.5%) 0.59 Neurological 86 (7.6%) 4,154 (4%) 0.52 Lung and Respiratory 38 (3.4%) 1,653 (1.6%) 0.47 Disorders Substance Use Disorders 18 (1.6%) 729 (0.7%) 0.44 Metabolic Diseases 57 (5%) 2,147 (2.1%) 0.41 Skin Disorders 7 (0.6%) 238 (0.2%) 0.37 Other 39 (3.4%) 926 (0.9%) 0.26 140 Topic Datasets Requests RTD Ratio Musculoskeletal 17 (1.5%) 293 (0.28%) 0.19 GI Disorders 26 (2.3%) 365 (0.3%) 0.15 Congenital Disorders 70 (6.2%) 910 (0.9%) 0.14 Immune and Autoimmune 16 (1.4%) 173 (0.2%) 0.12 Disorders Urogenital Disorders 18 (1.6%) 190 (0.2%) 0.11 Infectious Disease 20 (1.8%) 131 (0.1%) 0.07 The mean RTD ratio for dbGaP topics is 0.52, indicating that disparity exists among the various topics. Most of this disparity comes from the Blood and Cardiovascular topic being highly over-requested, receiving requests at a rate nearly triple would be expected based on the number of datasets in the category. Six categories also have ratios of less than 0.2, having received less than 20% as many requests as would be expected. Figure 5-10 shows the annual RTD ratios for each topic between 2008 and 2018. The dashed line indicates a ratio of 1; values below the line indicate higher- than-expected requests, and values below the line, lower-than-expected requests. Annual results are similar to the overall results described above, and RTD ratios remain generally steady for most topics over time. However, a few topics do show change over time. Blood and Cardiovascular datasets, already over-requested in 2008 with a ratio of 1.14, continues to rise in popularity, eventually reaching a high RTD of 2.3 in 2017. Conversely, datasets in the Mental Disorders topic see their ratio decline over time; with RTD ratios over 1 and even approaching 2 in most years between 2008 and 2012, the ratio declined to just over 0.6 by 2018. 141 Figure 5-10. Request to dataset ratios for dbGaP datasets, by topic, calculated annually from 2008 – 2018. In addition to considering the full dbGaP repository, I also performed this analysis for the 319 datasets that contained data about cancer to determine whether differences existed in requests for data about specific types of cancer. These datasets received 17,208 requests between 2008 and 2018. I classified them into ten groups based on primary cancer site, with one group for datasets that included multiple types of cancer as well as forms of cancer that could not be categorized into one of the other nine types. Table 5-10 shows the distribution of datasets and requests among the ten cancer types and each topic’s RTD ratio. 142 Table 5-10. Distribution of dbGaP datasets specific to cancer and their requests among 10 cancer topics derived from the assigned primary phenotype, and calculated request to dataset (RTD) ratio. Topic Datasets Requests RTD Ratio Other or Multiple Cancers 39 (12.2%) 4,388 (25.5%) 2.09 Blood Cancer 63 (19.7%) 3,950 (23%) 1.16 Bone and Soft Tissue Cancers 11 (3.4%) 618 (3.6%) 1.04 Urogenital Cancers 32 (10%) 1,581 (9.2%) 0.92 Prostate Cancer 28 (8.8%) 1,373 (8%) 0.91 Lung Cancer 23 (7.2%) 1,116 (6.5%) 0.9 Brain and Nervous System 21 (6.6%) 965 (5.6%) 0.85 Cancers Breast Cancer 36 (11.3%) 1,434 (8.3%) 0.74 Skin Cancers 25 (7.8%) 716 (4.2%) 0.53 GI Cancers 41 (12.9%) 1,067 (6.2%) 0.48 The mean RTD ratio among the cancer datasets is 0.96, indicating that requests are relatively evenly distributed among the topics. The Other or Multiple Cancer type is requested at a rate more than double what would be expected based on the number of datasets, but this category is influenced by a significant outlier: the Cancer Genome Atlas (TCGA). This dataset, which contains detailed data about several different types of cancer, has been requested 2,857 times since its release in 2009, more than three times as many as the next-most requested dataset in all of dbGaP. No other dataset in dbGaP (or any of the other repositories in this study) has been requested so significantly more than TCGA; its requests alone account for 65% of requests in the Other or Multiple Cancer topic and 17% of all the requests in the subset of datasets on cancer. Without the TCGA requests, the Other or Multiple Cancer topic would only have an RTD ratio of 0.33, and if TCGA alone were considered its own topic, it would have an RTD ratio of 55.3. 143 Figure 5-11 shows RTD ratios for the ten cancer types for each year between 2008 and 2018, which remain mostly steady over time. Figure 5-11. Request to dataset ratios for dbGaP datasets related to cancer, by cancer type, calculated annually from 2008 – 2018. 5.2.4 NHLBI Results The LDA algorithm classified the 166 datasets from NHLBI into 14 different topics. These datasets had received a total of 893 requests between 2000 and 2018. Table 5-11 shows the distribution of datasets and requests among the 14 topics and each topic’s RTD ratio. Table 5-11. Distribution of NHLBI datasets and their requests among 14 topics determined by LDA, and calculated request to dataset (RTD) ratio. Topic Datasets Requests RTD Ratio Heart Disease Treatment and 17 (10.2%) 157 (17.6%) 1.72 Prevention Lung Injuries and Mechanical 12 (7.2%) 97 (10.9%) 1.5 144 Topic Datasets Requests RTD Ratio Ventilation Population-Based Studies 7 (4.2%) 53 (5.9%) 1.41 Cardiovascular Risk Factors 11 (6.6%) 82 (9.2%) 1.39 Heart Failure and Rhythm Disorders 16 (9.6%) 106 (11.9%) 1.23 Hypertension 15 (9%) 98 (11%) 1.21 Non-Asthma Lung Diseases 14 (8.4%) 81 (9.1%) 1.08 Myocardial Ischemia 13 (7.8%) 61 (6.8%) 0.87 Sickle Cell Anemia and Blood-Borne 8 (4.8%) 32 (3.6%) 0.74 Diseases HIV and Other Viral Diseases 9 (5.4%) 33 (3.7%) 0.68 Asthma 19 (11.4%) 60 (6.7%) 0.59 Emergency Resuscitation 8 (4.8%) 17 (1.9%) 0.4 Coagulation and Sleep Disorders 5 (3%) 9 (1%) 0.33 Blood Transfusions and Marrow 12 (7.2%) 7 (0.8%) 0.11 Transplants The mean RTD ratio for all NHLBI datasets was 0.94, suggesting a relatively even distribution of requests among the 14 topics. Topics related to heart disease were particularly popular, with Heart Disease Treatment and Prevention and the related Hypertension and Cardiovascular Risk Factors topics (both of which lead to heart disease) all having RTD ratios over 1. By comparison, non-heart-related topics were more under-requested; of the seven topics with an RTD of less than 1, only one of them, Myocardial Ischemia, is related to any kind of cardiovascular disorder. Figure 5-12 shows RTD ratio scores for each topic over time. Several of the topics do not appear across all years of this analysis; for example, the first datasets in the Coagulation and Sleep Disorders topic were not released until 2014, so that topic’s first RTD ratio is recorded in that year. As with the dbGaP topics, RTD scores among the NHLBI topics remain mostly steady over time. 145 Figure 5-12. Request to dataset ratios for NHLBI datasets by topic, calculated annually from 2000 – 2018. 5.2.5 NIDDK Results The 92 datasets in NIDDK were sorted into 14 topics determined by the LDA algorithm. These datasets received a total of 387 requests between 2013 and 2018. These datasets and requests do not represent the entire set of NIDDK datasets and requests; because the annual RTD analysis requires knowing how many total datasets existed in a given year, the date of release must be known for every dataset, but NIDDK did not track the date of release for datasets released before 2013. Because 49 of the NIDDK datasets were simply recorded as being released sometime before 146 2013, the earliest the annual analysis could begin was 2013. Table 5-12 shows the distribution of datasets and requests among the 14 topics and each topic’s RTD ratio. Table 5-12. Distribution of NIDDK datasets and their requests from 2013 – 2018, for 14 topics determined by LDA, and calculated request to dataset (RTD) ratio. Topic Datasets Requests RTD Ratio Type 2 and Gestational Diabetes 7 (7.6%) 104 (26.9%) 3.53 Chronic Kidney Diseases 13 (14.1%) 125 (32.3%) 2.29 Glomerulopathies* 3 (3.2%) 11 (2.8%) 0.87 Genetics and Disease Mechanisms 8 (8.7%) 26 (6.7%) 0.77 Dialysis and Lifestyle Interventions 9 (9.8%) 27 (7%) 0.71 Nonalcoholic Liver Diseases and 9 (9.8%) 23 (5.9%) 0.61 Bariatric surgery Type 1 Diabetes 8 (8.7%) 20 (5.2%) 0.59 Hepatitis 6 (6.5%) 14 (3.6%) 0.55 Incontinence 5 (5.4%) 8 (2.1%) 0.38 Urological Disorders 12 (13%) 19 (4.9%) 0.38 Gastroparesis and GI Diseases 4 (4.3%) 6 (1.6%) 0.36 Biliary Diseases and Liver 6 (6.5%) 4 (1%) 0.16 Transplantation Islet Transplantation** 2 (2.2%) 0 (0%) 0 *diseases affecting the filtering mechanism of the kidney **transplantation of insulin-producing cells to treat type 1 diabetes The NIDDK topics had a mean RTD ratio of 0.86, suggesting that there is at least moderate disparity in requests among the topics. In fact, NIDDK has two of the highest RTD ratios of all three repositories, with the Type 2 and Gestational Diabetes and Chronic Kidney Diseases topics scoring 3.53 and 2.29, respectively. NIDDK is also the only repository to have a topic with an RTD score of 0, meaning the datasets in this topic have never been requested. However, only two datasets were in the Islet Transplantation topic, and both were released in 2016, so these datasets may go on to receive requests over time. 147 Figure 5-13 shows RTD ratio scores for each topic over time (the Islet Transplantation topic is not shown because its RTD ratio is 0). Many of the topics remain steady over time, but the yearly RTD ratios show somewhat more variability than those for NHLBI and dbGaP. For example, the Hepatitis and Dialysis and Lifestyle Interventions topics both have RTD ratios greater than one for the first two years of the analysis, but then decline and drop below 1 in 2015. The Chronic Kidney Diseases topic also sees a significant increase in its RTD ratio in 2015; while some topics in the other repositories see a spike in a single year and then drop back to the baseline rate in the following year, the increase in the RTD ratio for this topic lasts throughout this analysis. 148 Figure 5-13. Request to dataset ratios for NIDDK datasets by topic, calculated annually from 2013 – 2018. 5.2.6 Summary of Findings Considering requests of datasets by topic reveals that not all topics are requested at the same rate, with certain topics emerging as highly popular. Although some variation did exist over time, the RTD ratio for most topics stayed generally consistent across the range of years analyzed for each repository. Among all three repositories, the most highly requested topics were all related to illnesses and disorders with a significant disease burden (National Center for Health Statistics, 2017). For example, heart disease is the number one cause of death in the US, and had the highest ratio for topics in both dbGaP and NHLBI. Diabetes and chronic 149 kidney diseases, the two topics with the highest ratios in NIDDK are also both on the list of top ten causes of death in the US. However, other topics are surprisingly under-requested based on their disease burden. For example, breast cancer is the most common cancer in the US, with more than 260,000 women diagnosed annually, and also the fourth deadliest, but falls close to the bottom of the rankings for the cancer-specific dbGaP requests (National Cancer Institute, 2019). Some of the multiple cancer datasets that are the most highly requested likely do contain some breast cancer data (for example, the highly- requested TCGA dataset does include breast cancer cases). However, datasets covering prostate cancer, which is also included in the TCGA dataset, receive requests at a rate 1.2 times higher than those covering breast cancer, despite breast cancer killing nearly 30% more people annually. The difference in requests might suggest that breast cancer is less studied in comparison to prostate cancer, but the opposite is true; in Fiscal Year 2017, NIH funded nearly 3 times more breast cancer research than prostate cancer research – $689 million for breast cancer and $239 million for prostate cancer (National Institutes of Health, 2018a), suggesting that breast cancer is in fact more widely researched than prostate cancer. Thus, it is not clear why prostate cancer datasets are requested so much more than those on breast cancer, which is both deadlier and more highly funded. One possible explanation for the disparity between these cancers’ dataset request rates and their disease burden and research funding could be that prostate cancer researchers receive less funding and therefore take advantage of existing 150 datasets to make the most of their limited funds. Conversely, prostate cancer researchers could be requesting less funding because of the fact that datasets that are suited to their purpose already exist and they therefore do not need to request funds to gather new data. The request data here does not provide enough information to draw a conclusion about the reasons behind this disparity, but does suggest potential avenues for future research. Considered in combination with the results of the temporal analysis described in Section 5.2, these findings suggest that dataset requests do follow potentially predictable patterns and are not simply a function of datasets accruing requests over time. As will be discussed in Chapters 6 and 7, these findings have potential implications for how datasets are curated and stored. 5.3 Conclusions and Summary of Findings This chapter has turned to information about the datasets themselves to better understand the dynamics and patterns of why some datasets are requested more than others. These analyses provide answers to the when and what of biomedical data reuse: when in a dataset’s life cycle does reuse happen, and what are the topics that are most highly requested? A number of factors appear to be at play in determining which datasets researchers choose to request. As would be expected, the age of a dataset has some influence in the number of total requests it has received – a dataset that has been around longer has had more time to accrue requests. However, at least among the 151 dbGaP datasets, age alone does not fully explain the number of requests a dataset receives. dbGaP datasets appeared to follow a cumulative advantage model, with the number of requests a dataset receives in its first year, regardless of when it was released, being highly predictive of how many requests it will receive in later years of its life. Some of the variation in request rates is also likely due to the topics that datasets cover. Among all three repositories, the most highly requested datasets were those that were related to common diseases that take a high toll on public health, with fewer requests for datasets covering rare diseases. The findings presented here build upon the analyses described in Chapter 4 and help to provide a deeper understanding of how biomedical datasets are reused. Chapter 6 will discuss how the findings presented here and in Chapter 4 can be interpreted within the broad context of biomedical data reuse and explore what these findings tell us about who is using data and why. 152 Chapter 6: Discussion The previous chapters have provided a view of the impacts of shared datasets – who is reusing them, for what topics they are being reused, when in their life cycle they are requested, where in the world they are being reused, and why they are reused. In this chapter, I will interpret the major findings of this study and discuss how these findings help advance our understanding of biomedical data reuse. I will also discuss the limitations of these findings and the context within which they can be meaningfully applied. 6.1 Summary of the Major Findings This study aimed to provide a better understanding of how data are reused by exploring four broad research questions. Because research into data reuse is still nascent, I drew on an understanding of other phenomena in scientific research to formulate hypotheses for these questions. One exception to this is Research Question 4, about the topics of datasets that are most highly requested; little exists in the way of prior studies that would enable forming a hypothesis on this exploratory question. Table 6-1 provides a summary of the major findings. Table 6-1. Summary of the major findings. Research Question Hypothesis Finding Research Question 1.1: Hypothesis 1.1: Genomic Confirmed. Genomic For what methods and datasets of the type found datasets from dbGaP are analysis types are datasets in dbGaP will be more more often used together reused? likely to be used in in meta-analysis and combination in meta- clinical datasets from analyses, while clinical NIDDK are more often 153 Research Question Hypothesis Finding datasets of the type found used on their own for an in the NIDDK repository original study. There are will be more likely to be statistically significant used on their own to differences in the ways answer an original that dbGaP and NIDDK research question. datasets are used. Research Question 1.2: Hypothesis 1.2: Similarity Confirmed. Similarity How closely are the topics between original topics between original topics for data reuse aligned with and topics of reuse will be and topics of reuse is the topics for which the lower for genomic data lower for genomic datasets data were originally (found in dbGaP) than for from dbGaP than for collected? clinical data (found in the clinical datasets from NIDDK repository). NIDDK. This difference is statistically significant. Research Question 2.1: Hypothesis 2.1: Partially confirmed. Where are requestors Requestors will be Requestors are located located in the world? primarily located in around the world, but regions with a greater English-speaking countries proportion of research are most over-represented institutions, including when considering their North America, Europe, requests compared to their and Asia. international research presence. Research Question 2.2: Hypothesis 2.2: A broad Partially confirmed. Are there patterns in career range of career stages, While requests do come stage of requestors? from student to full from a broad range of professor (or equivalent) requestors, the majority of will be represented. requests come from established researchers, rather than those early in their career. Research Question 3: Are Hypothesis 3: Patterns of Confirmed. Patterns of there temporal patterns to requests relative to the requests do appear to dataset requests? original dataset release follow a cumulative date will demonstrate a advantage model, with cumulative advantage patterns of requests over process, similar to other time similar to patterns of scientific communication article citations over time. processes such as article Early requests are citation. predictive of later requests, especially for dbGaP. Research Question 4: Are NA Datasets that contain data 154 Research Question Hypothesis Finding there dataset topics that are on more common diseases more highly requested? are more requested. 6.2 Interpretation of the Major Findings This study used a variety of methods to describe biomedical data reuse to better understand patterns of reuse and the impacts of shared data for the biomedical research community. Throughout, I have framed this approach as providing answers to the questions of the who, what, when, where, and why of biomedical data reuse. Here, I interpret what the findings of this study can tell us about each of those questions. 6.2.1 Who is Reusing Data? As I supposed in Hypothesis 2.2, researchers from across the research career life cycle reuse biomedical research data, from students just kicking off their careers, to mid-career professors, to well-established researchers and high-level commercial executives. This finding suggests that data sharing is more equitable in its current form – that is, in data repositories – than it had been through the interpersonal “gift economy” that previously characterized data sharing (Wallis et al., 2013; Yoon, 2017; Zimmerman, 2007). Students and early career researchers who would have lacked the professional network and status to be able to locate and negotiate access to data on their own can, and as this study found, do, make use of the datasets shared through repositories. These earlier career researchers can particularly benefit from the ability to use existing data, since they likely have less access to funding and other research 155 resources. The representation of researchers from both earlier and later career stages here suggests that a system of sharing data through repositories is more equitable and can help democratize research. However, it is also notable that these various career stages are not evenly represented among requestors. Just under half of requestors to both the NIDDK and dbGaP datasets were established researchers at the full professor, senior scientist, executive, or director level, while assistant professors accounted for around a quarter of the requests. While these early-career and established researchers were making many requests, surprisingly, researchers in the middle of their careers were making fewer. Considering the relative difference in composition of requests for academic career stages (for which actual counts of researchers at each level are known) reveals that the number of researchers in a given career stage alone does not account for differences in rates of requests. The reason for the lower request rates among mid- career researchers cannot be determined with the data available in this study; further research could help elucidate the drivers behind different request rates. Another finding that merits further exploration is the differences in rates of requests for associate and assistant professors to the dbGaP and NIDDK repositories. As discussed in section 4.2.2, associate professors are overrepresented in requests to dbGaP and underrepresented in requests to NIDDK, while the opposite is true for assistant professors. Further research would be needed to explain the reasons behind this finding, but it is possible that assistant and associate professors are engaged in substantively different types of research. The higher request rate to dbGaP for 156 associate professors could indicate that they are doing more genomic research or conducting more meta-analyses (the most common type of reuse for dbGaP data), while assistant professors’ higher request rate to NIDDK could suggest that they are doing more clinical research or doing more original research studies (the most common type of reuse for NIDDK data). Analysis of the articles that arise from reuse among these two groups could help provide insight into how early versus mid-career researchers are reusing data. 6.2.2 What Are the Most Requested Topics? The three repositories considered here include datasets covering a wide range of topics. Even within the NIDDK and NHLBI repositories, which are more constrained in terms of topic coverage than dbGaP, many different diseases and conditions are represented. In general, datasets about more common diseases and conditions were more requested than those that covered rare diseases. It stands to reason that a disease such as type 2 diabetes, which affects may people, would be the focus of more research, and therefore receive more reuse requests, than something such as a rare genetic disorder that affects only a few families in the world. On the other hand, the datasets that cover uncommon diseases do not go entirely unrequested, suggesting that they still represent a valuable source of data for the researchers who are engaged in such research. Disease burden and research density alone do not fully explain request rates for some topics. The example discussed in section 5.2.5 – the relative request rates of 157 prostate cancer and breast cancer datasets – demonstrates that not all topics are requested at a rate that correlates with the relative disease burden and research funding for that topic. Based on this analysis, which simply compares relative request rates, it is impossible to know what other factors might be at play in determining what topics researchers are most likely to request. Perhaps prostate cancer data are more difficult or expensive to collect than breast cancer data, and therefore researchers are more likely to request existing datasets rather than collect their own. Perhaps the prostate cancer datasets have been cited more in the literature, thus giving them higher visibility. Perhaps the prostate cancer datasets just happen to be better described and more clearly documented than the breast cancer datasets and are therefore more useful. Perhaps it is an issue of gender disparity in research, with prostate cancer, a disease affecting men, receiving more requests than breast cancer, a disease primarily affecting women. These findings suggest that further research into the broader funding, publication, and disease context in which datasets are requested could provide additional insight into the drivers behind the patterns of requests by topics seen here. 6.2.3 When in a Dataset’s Life Cycle Are Requests Made? Temporal analysis of data requests reveals that long-term requests of datasets can likely be predicted from early requests. In both the dbGaP and NHLBI repositories, the number of requests that a dataset receives in the first three years after its release is a good predictor of how many requests it will receive in the long-term, 158 considering both total requests and requests made after the first three years. This finding holds true even when controlling for age, suggesting that the number of requests a dataset receives is not merely a function of how old it is. However, interestingly, while the first year of requests is a good predictor of long-term reuse in dbGaP, it is actually a very poor predictor of reuse in NHLBI. It is not until the second year that NHLBI datasets begin to be requested at a rate that is predictive of long-term reuse. This finding could be due to differences in patterns of how clinical versus genomic datasets are reused, or could be reflective of differences in how datasets from these two particular repositories are reused. Unfortunately, the NIDDK repository did not have enough historical data to include in this analysis, which would have provided a means of better understanding whether the difference could be ascribed to differences in ways clinical data is used. However, this analysis could be expanded to include other repositories to determine the mechanism behind these different request patterns. Within dbGaP, datasets also follow typical patterns of request over the course of their life cycle that suggests that dbGaP data reuse requests, much like other scientific processes, follow a cumulative advantage model – success breeds success. Datasets that are highly requested early in their life go on to continue to be highly requested, whereas datasets that receive few requests in their first years tend to continue to be less requested. Dataset requests are also similar to article citations in that they tend to reach a peak number of requests and then receive gradually fewer requests over time. In article citations, this peak is often achieved around five to ten 159 years after the article’s publication (Wang, 2013); for datasets, this peak occurs in the second year of the dataset’s life, after which requests slowly taper off over time. The shorter time period in which datasets reach their peak compared to articles could be due to differences in where in the life cycle of datasets requests happen versus where in the life cycle of articles citations happen. A use request happens much earlier than an article citation, at the start of the research process rather than at its end. The publication process often stretches over the course of months, as the article goes through peer review, potential revisions, and preparation of the final documents, so there will always be a lag between the time that a researcher uses an article and the appearance of evidence of that use, in the form of a citation. On the other hand, a use request provides evidence of use immediately. Therefore, it is likely that patterns of when datasets and articles are used are similar, and what differs is just the times at which evidence of that use appears. A surprising exception to the finding about peak request year was that the most highly requested datasets – those in the 90th percentile of overall requests – diverge from the pattern observed in the less-requested datasets in ways that suggest different dynamics could be at work in driving requests. The mean number of requests for these datasets does reach a peak and then drop off in the third year after release, like datasets in the other percentiles. However, the mean number of requests begins to rise again in the fourth year, increasing over subsequent years and eventually even surpassing its previous peak. Without further research, it is difficult to definitively say why this pattern occurs, but one possible explanation is that the 160 most highly requested datasets see peaks at the usual time for datasets and the usual time for article citations. That is, the dataset is released, and, like its peers, reaches its peak in the first year, following whatever dynamics drive requests. However, unlike less-requested datasets, these highly requested datasets also go on to be cited in articles reporting on the secondary reuse that arose in this first wave of requests. As descriptions of dataset reuse start to appear in the literature, perhaps the temporal pattern of requests starts to behave more like article citations, reaching a peak around the same time that the article describing the dataset would also expect to see a peak in citations (around 5-10 years after publication of the article). With only ten years of requests available for this analysis, it is impossible to know whether this explanation holds. Mean requests for 90th percentile datasets were at their highest in the final year of requests available for this study; without additional years of data to consider, it cannot be known whether that year is the peak or whether requests will continue to increase over time. Revisiting this analysis a few years from now, when additional years of requests are available, could demonstrate whether in fact this predicted pattern occurs. In addition, having a better mechanism to connect datasets with articles that cite them would help provide additional evidence that could support this potential explanation. At present, data citation mechanisms do not allow for sufficiently accurate counts of articles that cite datasets to enable a meaningful analysis of this theory. If further research supports the initial findings of this study – that the number of requests a dataset receives early in its life is predictive of its long term reuse – 161 reward and credit for researchers who share highly reused datasets could come more quickly than with other measures of scientific success or productivity. One criticism of measures such as citations to articles is that these are lagging measures that cannot show impact until months or even years after the release of the original article. As discussed above, the nature of the scholarly publication cycle means that citations to articles generally do not even begin to appear until well after the article’s publication, peaking sometimes as late as a decade after the article’s publication. Some bibliometricians have tried to identify measures that could provide earlier identification of high-impact articles, so-called altmetrics such as mentions of the article on Twitter or number of times readers have saved the article in Mendeley. While altmetrics provide quantitative counts of attention early in an article’s life cycle, that attention generally does not translate into long-term impact in the form of article citations, limiting their usefulness as a means of assigning meaningful scholarly credit (Thelwall et al., 2013). If it can be demonstrated that early attention to datasets in the form of requests in the first few years do reliably predict long-term use of datasets, credit could be given comparatively early in the research life cycle to researchers who share high-value datasets. Being able to recognize researchers who share high-value datasets soon after they share them, rather than having to wait years to receive credit, could incentivize researchers to not only share datasets, but to do so in a timely fashion. 162 6.2.4 Where in the World Are Requestors Located? Although the repositories considered here are funded and administered by various organizations within the NIH, an agency of the US government, the datasets contained within them are available worldwide and represent a potentially valuable global research resource. Indeed, requests do come into these repositories from all around the world, but the global distribution of requests is far from uniform. Even when accounting for research presence by considering the number of universities within countries, the United States is highly overrepresented in requests to all three repositories. Outside of the United States, other patterns in which countries were over- and underrepresented emerged. Other English-speaking countries such as Canada, the United Kingdom, Australia, and New Zealand, were also overrepresented given their share of universities. This is a finding that I did not predict, but it is logical given that that the websites and documentation for all three repositories considered here are available in English only, making it more challenging for non-English speakers to request and meaningfully use the data. Applying a similar methodology to analyze geographic distribution of reuse for datasets in repositories documented in other languages could provide a comparison to test whether researchers’ native language drives their choice to reuse certain datasets. Besides potential language barriers, other geographic factors may influence rates of reuse among researchers in various countries. Researchers may be more familiar with repositories located within their home country or region than those in 163 other parts of the world. Previous studies on researchers’ data reuse practices have identified trust in the repository as a major factor in the decision to reuse data (Faniel & Jacobsen, 2010; Faniel et al., 2015; Rolland & Lee, 2013; Yakel et al., 2013; Yoon, 2014, 2017); perhaps researchers are more likely to trust a repository located within their region. One example that supports this hypothesis is the existence of an international collaboration among three nucleotide sequence databases: GenBank, located in the United States; the European Molecular Biology Laboratory (EMBL), with several locations in Europe; and the DNA Data Bank of Japan (DDBJ) all contain the exact same data. The three databases are synchronized daily, so that a user need only submit data to one of the databases for it to be available in all three (Baker et al., 2000). While distributed and redundant data storage makes sense from a preservation perspective, the fact that these three identical databases exist with their own distinctive names (two of which reference geography explicitly) suggests that researchers might make choices about where to look for data based on the geographic location of the repository. These three repositories do not require submission of a use request to access data, so other methods would be needed to track patterns of reuse, such as analysis of use logs and IP address access, but such an analysis could provide insight into the extent to which geographic factors play a role in researchers’ choice to use data from a repository. For all three repositories in this study, North American and European countries (including European countries where English is not the official language) were the most overrepresented, while countries in Asia, the Middle East, Africa, and 164 South America were almost universally underrepresented – if they were represented at all. The low use of data in Asian nations in particular was a surprising finding, given the major research presence within that region. For example, together, India, China, and Indonesia have about 30% of the world’s universities, yet account for only 2% of the data requests. The majority of countries in the world that had at least one university had no requests at all to any of the three repositories. This finding suggests that these valuable data resources might not be benefitting the researchers who could potentially gain the most value from them: those in countries with less research funding and therefore less resources with which to collect their own data. Within the United States, requests are more evenly distributed among states than they are among countries in the world. There were fewer extremes among requests within the US, with most states requesting data from the three repositories at rates that are in accordance with the amount of NIH funding received by research institutions within the state. The less extreme variations between request rates and research presence within the states versus within countries could simply indicate that the proxy for research presence within states – NIH funding – is a better representation of biomedical research presence than the proxy used for countries – number of universities. However, outliers do still exist – Alaska, New Mexico, and Wyoming are somewhat surprising outliers in terms of overrepresentation. This finding could indicate that these states are requesting more data than might be expected given the amount of research underway, but conversely, it could also mean that these states are 165 receiving less NIH funding than might be expected given that amount of research. Perhaps researchers in these states are unable to secure adequate NIH funding to support large-scale data collection, so they turn to existing datasets to fill the gap. On the other hand, it is possible that these researchers simply are not applying for as much research funding because they are already planning to reuse existing data. Analysis of not just the research that is funded in each state, but what proposals are not funded could help elucidate the reasons behind the funding/request discrepancy (although information about unfunded proposals is not publicly available). Either way, this finding could support economic arguments in favor of sharing and reuse of biomedical research data – not only can reuse of data save money that would have otherwise been spent on gathering new data, but sharing also increases the return on investment of scientific research funding by extracting additional discovers from the original data (Arzberger et al., 2004; Costello, 2009). 6.2.5 Why Are Requestors Reusing Datasets? This study revealed that researchers request data for a variety of different reasons – sometimes they simply want a dataset in which to test a research question, but researchers also request data to pool multiple datasets for questions that one dataset alone cannot answer, develop and test new statistical methods, design and validate software and computational tools, develop data infrastructure, and more. While any given dataset can and often is used in a variety of different contexts, the genomic and clinical datasets here demonstrate different patterns of reuse that are at 166 least in part accounted for by the different methodological limitations and practices associated with these two data types. As section 4.1.1 describes, some of the methodological differences in how researchers use datasets can be explained in part by the strengths and limitations of certain types of data. Genomic dataset of the kind found in dbGaP often must be combined with each other to achieve the massive sample sizes that are needed to achieve adequate statistical power for this type of study (Hong & Park, 2012). With this in mind, the genomic research community has developed data standards to ensure that, wherever you are in the world and whatever type of equipment you use to collect your data, it will likely be interoperable with other genomic datasets (Field et al., 2011). On the other hand, clinical datasets of the kind found in NIDDK often use variables developed uniquely for a specific research study, often aimed at capturing subjective measures of patient experience. Similar concepts may be represented with varying degrees of difference among studies, such as the example discussed in Chapter 4 of differences in how alcohol consumption and binge drinking are defined in two similar studies from the NIDDK repository. Even if the discrepancy between how two studies define a concept is slight, those two datasets cannot be meaningfully combined. Although the NIH has made efforts to encourage the use of Common Data Elements (CDEs) that would enable harmonization of data across studies, uptake has not been universal, and researchers will still face problems integrating datasets that 167 have already been collected without CDEs, such as many of the datasets in NIDDK (Sheehan et al., 2016). These differences in how data are used also influence the degree of similarity between the topic of the original dataset and the topic for which it will be reused. NIDDK datasets were used in contexts that were similar to the original reason for which the data were collected. Over half of request/dataset pairs had a semantic similarity score of 1, meaning that the request proposed reuse in the exact context for which the data had been originally collected, and the mean score for NIDDK was 0.78, demonstrating a high degree of similarity between reuse and the original data context. This finding makes sense, considering the attributes of clinical datasets described above. These datasets focus not only on a defined patient population, but also on fairly specific characteristics of that population – their response to a particular drug or intervention, symptoms and clinical findings related to their disease, or their self-described perception of their health and emotional well-being. While these datasets provide a depth of understanding – often featuring hundreds, if not thousands, of variables – they provide it in a very specific context, meaning that the applicability of these datasets is relatively constrained to a small set of related topics. On the other hand, genomic data is comparatively uncomplicated, consisting of the genetic sequences of individuals with a certain condition (or even normal, healthy individuals). Not only are these data interoperable with other genomic datasets, but they are also more generally applicable beyond a narrow disease category. As a result, they are used in a broader range of reuses that may diverge 168 quite significantly from the original reason for which they were collected. The mean semantic similarity score for dbGaP request/dataset pairs was only 0.56, and nearly a third of them had a score of 0, meaning that the request proposed a topic of reuse that was completely different from the reason for which the data had been collected. It may be tempting to suggest that dbGaP datasets are more useful than NIDDK datasets, since they are not only more requested, but also reused in a broader range of contexts. As Chapter 7 will discuss, just because a dataset is infrequently requested does not mean that it lacks value. However, the datasets that are most likely to be requested frequently and for the broadest range of reuse may merit additional curation or prioritization for preservation. 6.3 Methodological Contributions of the Study This study is an early exploration of questions that need to be answered to understand the impact of data sharing and thereby reward researchers who share high- value datasets. As this study has demonstrated, data reuse takes many forms, and also introduces a set of methods for understanding various aspects of this complex phenomenon. These methods will also be of use to repositories who wish to better understand who is using their data and how. Researchers could also benefit from knowing these answers to these questions as well, so repositories could consider creating dashboards or reports that draw on these methods to provide more detailed information to researchers beyond simple counts of reuse. 169 First, this study introduces semantic similarity as a method to understand how similar a proposed reuse is to the reason for which the dataset was originally collected. Using MeSH terms is a useful approach here, since the datasets already have MeSH terms, and the availability of a reliable automated text indexer, which NLM makes freely available, enables easy description of texts with little manual intervention. Because semantic similarity is used in a range of biomedical text comparison applications, packages exist for R and other popular statistical software, incorporating existing, validated algorithms, lessening the challenges of adoption of semantic similarity as a metric. While measuring semantic similarity with MeSH terms is limited to texts within the context of biomedical literature, other similar methods exist for quantitatively determining similarity between a pair of texts, so repositories with other types of data could use either a discipline-specific or a general-purpose measure. The coding of reuse requests in this study gives new insight into the ways that datasets are reused by expanding on the existing taxonomy of reuse types drawn from the literature. This expanded taxonomy provides a more complete understanding of the ways that datasets are reused and is validated by external coders. While other types of reuse likely exist outside of biomedical research, this taxonomy provides a basis for categorizing and understanding types of reuse. Unfortunately, this method is time-consuming because it requires manual coding of reuse requests, which can only be done by someone with a reasonably comprehensive understanding of the science described in the requests. However, in future research, I intend to use the set of use 170 requests I have coded with reuse type as a corpus for a machine learning text classifier to determine whether an automated approach could be used to categorize requests, which could replace the manual process, at least in the context of repositories with similar types of data to those discussed here. This study draws from a discipline quite distant from biomedical research, borrowing the measure of relative difference in composition that is used to assess racial and ethnic disproportionality in educational settings. This metric moves beyond raw counts of reuse to contextualize the extent to which researchers from particular countries or career stages are reusing existing datasets. I have used number of universities per country and amount of NIH funding by state as a proxy for research presence, but this method could also be used with other ways of approximating research presence, such as funding from another relevant funder or number of publications arising from a country. To understand temporal patterns of datasets over time, this study proposes two techniques: tracking use by deciles of overall reuse and quintiles based on the mean decile per year over the course of the dataset’s life. This method can be applied to dataset requests from any repository, regardless of discipline, since it does not rely on information about the dataset or topic of request. Further, this method could also be used to explore cumulative advantage processes outside of dataset requests, such as citations to articles over time. Finally, this study introduces the request to dataset ratio as a way of understanding which topics are most requested. This method could also be applied to 171 other repositories in different disciplines or even to other comparisons of topics, such as comparing citations to articles with certain topics. Here, I use a topic modeling algorithm to identify topics within the datasets, a technique that is broadly applicable to texts of any type, regardless of their linguistic content. This approach could therefore be applied to any repository, but topics could also be determined manually or by drawing on metadata from the dataset descriptions. For example, because the topic model did not perform well for the dbGaP datasets, I used primary phenotype to determine the topics. Once topics are determined, the request to dataset ratio can be used with any number of topics and any number of datasets to provide insight into the topics that are most requested. Based on the variation in findings among the three repositories studied here, study of repositories from other disciplines would also likely exhibit some differences in how datasets are reused. In addition to providing a set of methods for exploring data reuse, this study provides a set of data to compare against to understand how reuse differs from one discipline to another, or even from one repository to another within the same discipline. This study also provides a baseline against which to compare data reuse over time. For example, it could be informative to revisit these analyses after the NIH implements its forthcoming data management and sharing plan policy to determine whether the increased demand to share datasets impacts reuse. 172 6.4 Limitations and Considerations for Application of Findings As has been discussed, this study aimed to provide a preliminary understanding of a very complex phenomenon. As such, the findings should be understood and interpreted in that context. This study considers a very small group of repositories, several of which had incomplete data (such as NIDDK, which was missing release dates and request info from before September 2013, or NHLBI, which did not provide me with data on requestors or the text of use requests). Even where full data were available, the NIDDK and NHLBI datasets had much fewer requests than dbGaP, so these findings should be considered with less certainty, given that any variations here may be due more to the smaller population size than to actual differences in the phenomena described. As has been discussed, reuse of data is difficult to quantify. This study used requests to reuse data as a proxy for reuse, which is likely a better proxy than some other measures, such as download counts or citations within the scholarly literature, but they are still an imperfect measure. Although requestors must have a fairly specific reason for which they intend to use the data, their actual research may not proceed according to those plans. A researcher might request a dataset and then, upon receiving it, discover it is not actually suited to her needs after all and end up not using it. Anecdotally, researchers have told me that the request process for some of the repositories is onerous enough that they sometimes request more datasets than they will likely need just in case, rather than find out later that they need additional data and have to go through the process again. Connecting use to requestors may also 173 lead to inaccuracies in understanding the career status of reusers; the person who requests the data may not actually be the person reusing it. A professor might request a dataset on behalf of a student, or a project manager on behalf of a research team. Therefore, a data request cannot be considered exactly equivalent to an instance of data reuse, and results should be interpreted with this consideration in mind. As this study has demonstrated, findings that hold true for one repository may not hold true for another, which suggests that the ability to generalize findings across repositories may be limited. Some of the findings were similar across repositories – datasets were almost universally most highly requested by researchers in the United States and other English-speaking countries, and topics with significant global disease burden were among the most requested for all three repositories, compared to rare diseases. However, for other questions, the findings differed widely between repositories. For example, the types of research for which dbGaP and NIDDK datasets were used differed widely, as did the temporal patterns of use between dbGaP and NHLBI. That this much difference existed between three relatively similar repositories – all three housing human subject data related to biomedical research and funded by the NIH – suggests that data reuse is not a phenomenon with simple, universal explanations. Therefore, caution should be used in trying to apply these findings to biomedical research repositories or datasets more broadly, and they almost certainly should not be applied to data and repositories from other disciplines. 174 6.5 Summary of Discussion This chapter has provided an interpretation of the findings, with a particular focus on what this study can tell us about the who, what, when, where, and why of data reuse. The answers to these questions help extend our understanding of the nature of biomedical data reuse and contribute to the development of scholarship in this area. This study was designed around a specific definition of reuse and constrained by the limited information that is currently collected about data reuse, so these findings must be interpreted within the context of a specific type of biomedical data reuse. Despite these limitations, these findings suggest potential implications for a range of stakeholders in the biomedical research ecosystem, which will be discussed in Chapter 7. 175 Chapter 7: Conclusion With researchers increasingly being required to share their data, the amount of publicly available and potentially reusable biomedical research data will continue to grow. Understanding how those datasets are reused will help ensure that informed decisions are made about how to best curate, preserve, and share data, as well as how to reward researchers who share high-value datasets. Shared datasets exist within a complex research ecosystem with a variety of stakeholders; accordingly, I will suggest how each of these stakeholders could consider acting on the findings of this study. Given the exploratory nature of this study, I will also propose future research that could build upon, confirm, and explain the findings I have presented within this dissertation. 7.1 Implications of the Findings 7.1.1 For Researchers The findings of this study may help allay some of the concerns that researchers have expressed about sharing their data. Researchers have worried that they might get “scooped” if they share their data – that someone else will beat them to publication on a discovery that they would have gone on to make themselves (Laine, 2017). One controversial editorial on data sharing worried that researchers who reuse data would end up “possibly stealing from the research productivity planned by the data gatherers” (Longo & Drazen, 2016, para. 3). However, the findings of this study 176 suggest that the ways in which researchers are reusing shared data make it unlikely they will end up scooping the original data collector in most cases. Especially for data in dbGaP, the context in which researchers proposed to reuse datasets often diverged markedly from the reason they were originally collected. These reusers are unlikely to scoop the original data collectors because they are looking at such different questions than the collectors were. While topics of reuse were more similar in the NIDDK repository, only about half of the request/dataset pairs had a semantic similarity score of 1, meaning they were reusing the data in the same context as the original collector. Of course, a semantic similarity of 1 does not mean that the reuser is doing the exact same research as the original collector. Semantic similarity scores are based on the MeSH terms assigned to use requests and datasets, which are mostly diseases or even broad disease categories. A use request and its corresponding dataset would have a semantic similarity score of 1 if they were both described as covering “Kidney Diseases,” but this term is sufficiently broad that the reuse and the original study could actually be considering quite different questions. Even so, clinical data does generally have more limited reuse potential than genomic data, based on the type of information contained in these datasets and how it is collected. The potential to be scooped is therefore perhaps higher for researchers sharing clinical data than those sharing genomic data. It should also be noted that not sharing data does not protect a researcher from being scooped; it happens all the time and did even before sharing data became a common practice. The nature of scientific research and discovery means that there 177 are often multiple research teams around the world working on a topic at any given time, not because one is riding the others’ coattails, but simply because “we tend to make important new advances when the tools (intellectual and technical) become available, and others are not unlikely to do the same” (Mole, 2004, para. 10). In fact, in some cases, data sharing and other open science practices can actually help prevent scooping by establishing the primacy of one’s scientific claim. For example, researchers may choose to pre-register their studies using a platform such as the Open Science Framework (where data can also be shared), a process by which they state in advance their outcomes of interest. Because pre-registering or sharing data in a repository creates a time stamp, researchers can definitively demonstrate that they were the originator of an idea or discovery, helping to lessen the possibility that they will be scooped (or at least giving them ammunition to fight back if they are). Other researchers have expressed concern that making their data publicly available might open them up to scrutiny of their original results by outside researchers (The International Consortium of Investigators for Fairness in Trial Data Sharing, 2016). With increasing concerns about the reproducibility of research, this concern is not entirely unfounded (“Reality check on reproducibility,” 2016), although one might argue that making sure your original results are correct before publishing might be the best course of action to avoid such problems. While it may seem that re-running analyses on the exact same dataset would necessarily lead to the same results and outcomes, it often turns out that this is not the case; it is entirely possible to use the exact same data and come to entirely different results, particularly 178 when the original authors have not clearly documented the computational methods they have used in their analysis. Results can be dependent on factors such as the specifics of the computing environment, software versions and dependencies, and choices the researcher makes about parameters of the analysis (Begley & Ioannidis, 2014; Grüning et al., 2018; Stodden et al., 2012). This study’s results suggest that, at least for dbGaP and the NIDDK repository, reproducibility studies or other attempts to replicate the original study’s findings are not common purposes for requesting the data. Only 11 requests for dbGaP and two for NIDDK indicated they intended to use the data to reproduce the original results. These numbers correspond to just 0.05% of all requests for dbGaP data and 0.36% for NIDDK data. Of these requests, most described an interest in reproducing the results using slightly different methods, such as using different software or different sampling criteria, rather than questioning the original findings. Only one of the requests indicated that it aimed to re-analyze the data because the original findings had not been confirmed in other studies; the requestor speculates that this finding “was a spurious result of inappropriate statistical technique.” However, this request is only one out of thousands, indicating that reanalyzing data for the purpose of debunking the original findings is not a major type of reuse. Of course, this is not to say that reproducibility is not a significant problem in biomedical research; many researchers have raised alarms over the reproducibility of biomedical research (Begley & Ioannidis, 2014; Ioannidis, 2005, 2014). A range of efforts are underway to increase reproducibility in biomedical research, such as 179 development of guidelines to enhance (National Institutes of Health, 2017b) and tools to encourage broader adoption of open scientific practices (Munafò et al., 2017; Nosek et al., 2015a; Nosek & Bar-Anan, 2012). However, it appears from this study that verifying or reproducing results is not a common use of shared research data. Limiting access to data based on an individual’s concern about possible scrutiny when sharing has the potential to further science and enhance human health does not serve the public good, particularly given that the findings of this study suggest that this type of reuse is rare. 7.1.2 For Repositories and Curators Patterns of use requests – both temporal patterns and patterns of highly requested topics – can provide an evidence base for informing curation and preservation decisions. While it may seem desirable to preserve all biomedical data indefinitely, just in case it is of use at some point, doing so is not feasible, nor would long-term storage of certain datasets be an efficient use of funds. For example, as costs of genome sequencing continue to decline, in some cases it may actually be cheaper to just re-collect data rather than store them (Weymann et al., 2017). Curating data to ensure they are in a usable and discoverable form often requires significant human effort, and despite decreasing costs of memory and the availability of cloud storage, long-term preservation can come with high costs. The findings of this study are preliminary and do not hold across all three repositories, but at least for the data in dbGaP, the number of requests a dataset receives in its first year is highly predictive 180 of the number of requests it will receive over the long term. It may therefore be possible to make meaningful curation decisions early in the data life cycle, prioritizing the datasets that are most highly requested in their first few years. In addition to predicting future use based on early request rates, it may also be possible to anticipate demand for datasets based on the topics they cover. As this study demonstrated, datasets that focus on common diseases are more requested than those that focus on rare diseases. However, that is not to say that datasets covering rare diseases should be discarded or ignored; in fact, quite the opposite is true. Even though they may be less requested than datasets on more common and well-studied disorders, data on rare diseases are in a sense more valuable because they are more difficult to re-collect. Given the prevalence of diseases such as heart disease, type 2 diabetes, and cancer, finding participants for studies on these topics would be relatively easy, since they affect so many people. On the other hand, it is much more difficult to locate patients with rare diseases by virtue of the fact that they are rare. Especially in the case of genomic research, which requires larger sample sizes, it is often necessary to pool rare disease data from multiple sites that are able to collect the data from small patient groups to whom they have access. Repositories have been described as “unequivocally essential” to rare disease research, given their important role in facilitating access to rare disease data that can support research that might not be accomplished otherwise (Raza & Hall, 2017, p. 37). In fact, bioethicists have argued that researchers have a responsibility to their participants to share research data, particularly in the case of rare diseases. These patients have freely given their 181 time and data to participate in research that they hope will lead to treatments, and researchers should do all they can to advance that work, including sharing data (Hansson et al., 2016). Repositories must find a balance between focusing curation and preservation efforts on datasets with high reuse potential and those that may not be reused as often, but have value because of their rarity. Library practices may provide insight into how to prioritize curation and preservation of certain content without entirely discarding lower-use materials. For example, NLM provides enhanced indexing of certain journals that are searchable within its PubMed bibliographic database. The subset of journals that are selected for inclusion in MEDLINE (one of the underlying data sources searched within PubMed), based on criteria such as journal scope and coverage and quality of content, are indexed with additional metadata such as Medical Subject Heading (MeSH) terms and publication type (National Library of Medicine, 2019). Articles from journals that are not selected for MEDLINE indexing can still be searched in PubMed based on metadata such as keywords in their abstract, or author’s name; they just do not have the added information that comes from the NLM’s investment of a curator’s time that enhances the metadata associated with selected journals. Library practices can also provide guidance on how repositories might choose to make preservation decisions. Libraries must make choices about their collections based on the physical limitations of their space; there are only so many books that can fit on the shelves. Sometimes this means discarding items that are out of date, 182 damaged, or no longer used. This choice may be appropriate for some datasets in repositories as well, especially if technologies advance in ways that make existing datasets technologically obsolete. On the other hand, sometimes libraries have books that are not highly used, but still merit keeping, perhaps because they have historical value, or are still used from time to time. Off-site storage can provide a location to more cheaply and efficiently store less-used items, with a tradeoff in terms of convenience – a user must request the item and wait for it to be retrieved, rather than walking in to the library and simply taking it off the shelf. Repositories could take a similar approach of using “cold storage” for infrequently used data (Dell EMC, 2019). Cold storage methods are more economical and computationally efficient, preserving high-cost and high-performance systems for frequently accessed data while still enabling preservation of lower-use data. Researchers who want to use a lower-use dataset may have to wait a little longer to get it, but they will still be able to get access, while the repository can help control storage costs. This study also demonstrated that biomedical data reuse is not evenly distributed among researchers around the world. Repositories could consider outreach to under-resourced regions to increase awareness of and access to freely available data resources. In many parts of the world, potential partners are already in place who could facilitate this outreach. For example, the NIH and other US funders support a variety of research and capacity building efforts in Sub-Saharan Africa (National Institutes of Health Fogarty International Center, 2019). Libraries and institutions that train researchers would be natural partners to help increase awareness and access. 183 Initiatives such as the Hinari Access to Research for Health Programme and Librarians without Borders, which already provide training and resources for librarians in underserved regions, could help to increase librarians’ knowledge of how to support researchers interested in working with existing research data (Medical Library Association, 2019; World Health Organization, 2019). Establishing contacts within those regions could also help encourage researchers to in turn deposit their data in these repositories, which could significantly increase the usefulness of the repository as a research resource. For example, the Human Health and Heredity in Africa (H3Africa) project aims to increase research infrastructure and expertise to collect genomic and clinical data from African populations (Human Health and Heredity in Africa, 2019). Repositories could greatly improve their representation by ingesting this type of dataset. A 2016 study found that over 80% of the existing genomic data in the world came from people of European descent; other populations made up as little as 0.05% of the existing genomic data (Popejoy & Fullerton, 2016). Partnering with researchers in other regions of the world could therefore not only increase access and use of existing data, but potentially create pathways to increase the diversity of subjects represented in repositories and thereby improve healthcare for patients of all races. 7.1.3 For Research Funders As this study has demonstrated, biomedical data repositories represent a rich source of data to fuel research across a broad range of topics, sometimes diverging 184 widely from the original purpose for which the data were collected. The NIH has, accordingly, made a significant investment in curating and making available data arising from NIH-funded research. The recent NIH Strategic Plan for Data Science highlights the need to develop infrastructure and policies that help make biomedical research data FAIR (findable, accessible, interoperable, and reusable), thereby enhancing the ability of researchers to locate and reuse the data (National Institutes of Health, 2018b). The findings of this study suggest that researchers do have an interest in using shared data from repositories, and further emphasis by NIH on funding and policy towards increasing FAIRness of data could help increase reuse, as well as making reuse of data easier and lowering the barrier to entry for reusing data. In addition to providing funding and policy guidance that will increase the availability and usability of biomedical research data, the NIH has also begun to encourage researchers to reuse data by providing funding specifically for that purpose. While many Funding Opportunity Announcements (FOAs) mention that secondary analysis or data reuse are permitted, a few of the currently active FOAs are intended specifically for that purpose. Some of these FOAs are specific to particular disorders or areas of research, such as “Secondary Analyses of Existing Alcohol Research Data” and “Cancer-Related Behavioral Research through Integrating Existing Data,” or even fund use of data from a specific repository, such as “Leveraging Population-based Cancer Registry Data to Study Health Disparities,” which funds secondary analysis of data in either the Surveillance, Epidemiology, and End Results (SEER) Program or the National Program of Cancer Registries (NPCR) 185 (National Institutes of Health, 2016, 2017a, 2018d). These FOAs highlight some of the benefits of reusing data – accelerating discovery, increasing cost-efficiency, and enabling access to large datasets or data on rare diseases that researchers likely would not be able to gather on their own. While these FOAs can help raise awareness of existing data resources and incentive their reuse, it is important to caution that support for reuse of shared data should not be considered an alternative to providing funding for researchers that aim to collect their own data. For example, an NIH pediatric cancer research effort proposed in 2019 features data sharing as a major focus of the initiative. While cancer researchers generally recognize the importance of sharing and combining data, especially in the context of rare cancers, some argue that making data sharing the emphasis in this initiative is ineffective. They point to differences in the biology of childhood cancers that make integrating data from multiple sources a less meaningful approach than in the context of adult cancers and suggest that funding other approaches might be more effective than prioritizing data sharing (Kaiser, 2019). Not all questions can be answered with existing data, and as technologies progress, older data may no longer be useful. Therefore, even as more data of higher quality become widely available, reuse of existing datasets should be considered complementary to rather than a replacement for research activities that involve collecting new data. While this may seem so obvious that it hardly seems worth noting, I believe it bears explicitly stating given a political climate in which some government entities are seeking to cut funding to federal agencies that conduct and fund research. 186 Funders also have an important role to play in thinking about how they will not only encourage reuse of shared data, but also reward the researchers who originally collected datasets that go on to be reused. As Chapter 2 discussed, the notion of credit and reward are foundational to scientific norms (Carpenter et al., 2014; Durieux & Gevenois, 2010; Garfield, 2002; Holden et al., 1994; Kochen, 1987; Latour & Woolgar, 1986; Merton, 1942). Many researchers already balk at the idea of sharing data because they see it as giving away one of the products of their financial and intellectual investment (Longo & Drazen, 2016; The International Consortium of Investigators for Fairness in Trial Data Sharing, 2016). Funders (as well as research institutions) are in a position to encourage and incentivize data sharing by giving credit to researchers who have shared data that goes on to be used by others. As this study has demonstrated, not all biomedical datasets are reused equally. Some of the datasets in this study had been requested hundreds or even thousands of times, whereas others only had a handful of requests. Part of the variability in the number of requests a dataset receives is due to the type of data it contains; for example, NIDDK clinical datasets, with their relatively constrained uses based on the way the data are collected, are requested less than dbGaP genomic datasets, which are more interoperable and tended to be used in a range of topics that diverged more significantly from the original context in which the data were collected. Given the differences in these types of data, it would be reasonable to expect that dbGaP datasets, which have a wider range of uses, would be more requested. Based on these differences, it hardly seems fair to compare datasets from these repositories based on 187 counts of requests alone; dbGaP datasets had 103 requests on average compared to just 8 requests on average for NIDDK datasets. Raw counts alone simply cannot be used to compare dataset use across multiple repositories. Even within repositories, using raw request counts may be an ineffective means of rewarding data sharing, since dataset use may not always be equivalent to dataset value. As the topic request analysis in section 5.3 demonstrated, datasets that cover common illnesses receive more requests than datasets covering rare illnesses. However, it could be argued that a dataset on a rare disease is more valuable than one on a common disease, regardless of how much either dataset is used. As discussed above, data on rare diseases is more difficult to come by and would be more difficult to recreate than data on common diseases, which have plenty of potential subjects to draw on. It could be reasonably argued that a researcher who shares a dataset on a very rare disease is making a significant contribution to research and to meaningfully improving the lives of patients who would not otherwise have been the focus of research beyond of the original researcher’s work, even if only a few other researchers use the data. To suggest that such a dataset deserves less credit than a dataset that is requested many times risks rewarding researchers of common diseases over researchers of rare diseases and could even potentially disincentivize sharing of rare disease data. Much as article citations are a flawed means of measuring the actual value or impact of an article (Edwards & Roy, 2017; Lane, 2010; Werner, 2015), simple counts of dataset reuse (whether that is measured by requests or other quantitative 188 counts) is likely an inaccurate means of determining a dataset’s impact. Bibliometricians have begun to call for responsible application of metrics to avoid creating perverse incentives or misunderstanding the actual impact of articles, and that field has long been characterized by efforts to develop more accurate means of measuring and quantifying scientific impact (Edwards & Roy, 2017; Hicks et al., 2015). The scientific community has a rare opportunity now, as data sharing begins to become a more standard and formalized practice, to think carefully about how data sharing should be quantified, considering such questions as how value in data is defined and how to give credit for sharing in ways that meaningfully advance science and reward data sharers for meritorious contributions. The findings of this study help lay the foundation for future efforts aimed at determining the answers to these questions. 7.2 Directions for Future Research This study represents some of the first research to undertake a comprehensive understanding of biomedical data reuse. As such, it has largely been exploratory in nature, but these findings suggest a wide range of potential avenues for future research. Some of the research directions I propose here are not currently possible, either because data sharing as described here is a new enough phenomenon that not enough historical data is yet available to conduct the analyses, or because the necessary data are simply not collected at present. I hope that the research I propose here may encourage repositories to collect the necessary data, as well as provide 189 direction for the development of infrastructure that will enable connections between datasets and the articles that cite them. 7.2.1 Understanding Data Requestors and Data Reuse This study enabled a high-level understanding of who is requesting data, but it raises many additional questions about who is reusing data and patterns of reuse among requestors. For example, what accounts for the lower rate of requests among mid-career researchers, particularly associate professors, compared to early and later career researchers? Are there meaningful reasons behind the finding that associate professors are overrepresented in requests to dbGaP and underrepresented in requests to NIDDK, while the opposite is true for assistant professors? Some of these questions could be answered by examining not only use requests, but publications arising from these requests. If systems existed to automatically connect articles to the datasets they cite, it could be possible to trace data reuse from the point of request to the point of publication, which would enable a better understanding of what various requestors are actually doing with the data. Some efforts at developing such systems are already underway. For example, the Make Data Count project aims to track data reuse by using the infrastructure that already exists to track citations to articles (Fenner et al., 2018; Make Data Count, 2019). However, tracking data reuse in this way requires not only that datasets have persistent unique identifiers that comply to a global standard, such as Digital Object Identifiers (DOIs), but also that authors know how to cite datasets and journals 190 correctly indicate that the citations refer to datasets. Even with the technical infrastructure in place, correct and complete tracking of dataset reuse will require significant cultural changes in science to ensure that all stakeholders in the research process document data citations in a way that enables tracking of data reuse. It is worth noting that none of the three repositories included in this study assign DOIs to their datasets, so tracking their reuse in publications is at present technically infeasible. An even better way of finding out what requestors are doing with the data is simply asking them – since the identity of requestors is known, survey research could elicit further information about why requestors had chosen to reuse data, what they intended to do with it, what they actually did with it, and the impact that shared data has had on their research. This research could enable a deeper understanding of the nuances of data reuse that could inform repository plans and policies, funding decisions, and outreach to researchers. 7.2.2 Long-term Temporal Patterns Many scientific research processes, including article citations, follow temporal patterns, and understanding these patterns can help make predictions about future performance as well as enable the development of meaningful metrics to evaluate the phenomenon in question. While this research was only able to find such patterns in requests for one of the repositories, the findings were in line with patterns observed in similar phenomena, such as article citations. With only two repositories to consider 191 here, it is possible that this study simply did not have enough data to draw on, so repeating this analysis with requests from other repositories could provide more meaningful results. It might also be possible to use counts of downloads or views for this analysis, in order to include repositories that do not require submission of a use request. While other parts of this study relied on use requests to understand reuse, for this analysis, that level of detail is not required, and simple annual counts of use – whether in the form of downloads, views, or requests – may be adequate. This analysis could also yield more meaningful results if repeated again in a few years, when a longer period of request data is available. For example, the 90th percentile dbGaP datasets received more requests in the final year of available data than any previous year, so considering how the pattern of requests progresses over time could help answer some remaining questions. Will request rates continue to increase each year? It seems likely that request rates would peak and then start to decline at some point, but when will that be? Revisiting this analysis in perhaps two to five years could give a more complete picture of the temporal patterns of reuse. The temporal analysis is also an area of research that could benefit from better connections between datasets and articles citing them. Use requests for datasets are almost certainly driven in part by the publication of articles in which researchers describe their reuse – citations to the datasets increase their visibility as well as potentially suggesting new types of reuse, when they are used in contexts that diverge from the original reason they were collected. The ability to track citations to datasets 192 could help explain some of the temporal patterns in requests and provide additional predictive power to models aimed at forecasting future patterns of reuse. 7.2.3 Understanding Reuse Within the Broader Research Context Biomedical datasets are part of a complex research ecosystem that includes other research inputs and outputs, such as articles, code and software, and research funding, to name just a few. This study has provided insight into some patterns of reuse, but understanding the drivers behind those patterns likely requires looking to the broader context of how those datasets are situated within the research ecosystem. As I have emphasized, the ability to connect datasets with the articles that cite them is crucial for understanding the context of how datasets are reused. In addition, comparing reuse of datasets by topic to the broader research funding context and the global disease burden could help provide insight into why some topics are more requested than others. These findings could identify disease areas for which datasets are under-utilized and could potentially benefit from outreach to research communities. 7.3 Conclusion This study has provided a clearer picture of biomedical data reuse – who is reusing data, what they are doing with it, and why some datasets are more highly requested. The findings presented here demonstrate that biomedical data sharing is not a single phenomenon, but can take a range of forms that are in many cases driven by the type of data in question. Patterns of reuse differ between genomic and clinical 193 data, with the former being used in more meta-analyses and across a range of topics that diverges more from the original purpose for which the data were collected, while the latter tend to be reused on their own in studies that are more similar to the purpose for which they were collected. Reuse is also driven by the topic of the dataset, with more datasets covering common diseases being requested more highly than those covering rare diseases. Beyond the value of a dataset’s topic in predicting the number of requests it receives, its performance early in its life is also useful in predicting how many requests it will accrue over time. Finally, data are reused by researchers from around the world and from a range of career stages, though they are in many cases most highly requested by the researchers who have the most resources with which they could collect their own data – later career researchers in the United States – as opposed to earlier career researchers and those in less-funded countries who could potentially benefit the most from having data available for reuse. These findings are a first step in better understanding this complex phenomenon, and suggest potential avenues for future research, as well as policy and curation directions for funders and repositories. A vast amount of biomedical research data is already available, and this amount is only going to continue to grow as data sharing policies are put in place, especially when NIH eventually adopts a sharing policy that applies to all NIH funding. Understanding how those datasets are being reused is crucial to ensuring that data are shared in ways that enable meaningful reuse and that the datasets with the most value are properly curated and preserved. Many 194 questions still remain, but this study has taken some important first steps in better understanding data reuse. 195 Appendix A: Examples of Requests for Each Type of Reuse The following table provides examples of use requests from dbGaP and NIDDK that exemplify the types of reuse described here. Request text is reproduced exactly from the original without corrections or addition of spelled-out acronyms. Reuse Type dbGaP Example NIDDK Example Original We propose to conduct a Cardiovascular disease is the research genome-wide scan for genetic leading cause of mortality associations with secondary among Hemodialysis patients. phenotypes captured in the case- Prior research suggests that control sample, such as body- volume status and vascular mass-index, lipid levels, fasting stiffness are associated with blood sugar, and serum cardiovascular disease. These creatinine measures using a factors are thought to be related novel secondary analysis to the rate of ultrafiltration, approach. The analysis proposed Hemodialysis session length, represents a comprehensive and dialysate sodium concentration statistically rigorous genome- and phosphate intake. Though wide search of secondary analysis of data from the HEMO phenotypic associations, and as Study, we seek to clarify the such, is likely to contribute to relationships of the relationships our understanding of the of these factors to one another as underlying biologic process of well as to cardiovascular peripheral arterial disease outcomes among Hemodialysis (PAD). patients. Meta-analysis The main goal of this research is Our goal of this study is to to re-define the place multiple improve clinical outcomes in sclerosis (MS) occupies in the health. Hypertension is a topic human disease landscape. MS is that influences millions of lives a complex autoimmune disorder around the world. As such, of the central nervous system optimal targets for patients is of and is the second most common upmost importance. cause of neurological disability Furthermore, it is possible that in adults after trauma. We will optimal targets are not use de-identified genetic consistent by subpopulation information from studies groups. The NIDDK has offered performed on other access to guideline influencing neurological, autoimmune, and studies: specifically the AASK unrelated diseases to better and the MDRD trial. Our goal of 196 Reuse Type dbGaP Example NIDDK Example understand their similarities and our study will be to utilize differences with MS on a advances from both these genome-wide scale. studies and pool data together to find new, meaningful clinical insights. Comparison or Some children have severe The primary aim of this control seizures and other issues with community participatory project their brains. Occasionally brain is to conduct a translational tissue is removed from these study of the CDC Diabetes kids for surgical reasons. By Prevention Program’s successful RNA sequencing these samples clinic-based lifestyle we might be able to understand intervention delivered in the cause, course and treatments Community settings by for the disease. The GTex data community residents. allows us to compare these sick Community residents at kids to normal individuals so increased risk of type 2 diabetes that we can better understand based on BMJ, along with other what is going wrong in the kids. risk factors, form the target population. Outcome measures include anthropometrics (e.g., BMJ, waist circumference), eating habits, and physical activity habit. The DPP data will be used to form comparison groups to examine the outcome of the community based lifestyle, intervention program. Validation The aim of our project is to To date the majority of studies better understand how have focused on chronic kidney oncogenic events cooperate disease as a single entity with during the early stages of lung respect to outcomes. We have cancer and during its malignant preliminary data to suggest that progression. To achieve this in heart failure populations this goal we are using multiple may not be correct and that the mouse models of lung cancer to underlying pathophysiology study how gene gain and loss of may be highly relevant with function influences respect to the adverse prognosis. tumorigenesis. The dbGAP To date we have validated these dataset will provide a valuable findings in 4 heart failure resource to help validate that datasets. Interestingly, there did recurrent genomic changes seen not appear to be any relationship in our mouse models are between heart failure severity 197 Reuse Type dbGaP Example NIDDK Example relevant to the human disease. and the strength of this Ultimately, our goal is to interaction. As a result it is identify new targets for possible that the above diagnosis, prognosis and described observations may not personalized treatment of be restricted to heart failure patients.” populations and thus we are requesting the MDRD dataset to investigate these findings Statistical We are requesting the late onset In most longitudinal medical methods Alzheimers disease data to apply researches, the spacing of visits the statistical methods that we is usually the same for all develop for mapping complex subjects (unbalanced design). In genetic traits. Complex genetic this study, we will evaluate how traits are caused by more than unbalanced design with one disease gene and/or non- increasing the frequency of genetic traits. Our methods take visits in the high risk group will into account this fact to map the influence the precision of disease genes. We have covariate effect estimation in developed a method that does interval-censored time to event not require disease model data. The TN01 study used this specification, i.e., the type of unbalanced design, we inheritance pattern of the disease will use data from this study to in a family, which is unknown in illustrate how this unbalanced real life but many methods need design is beneficial in term of its specification. To study its improving precision in risk properties, we have applied the factor estimation.” method to simulated data. Now we need to apply it to a real data and so we are requesting this family data. Software or The goal of this research is to Computer simulation models tool create software for physician would enable researchers to development researchers which allow them to assess the comparative- rapidly identify common genetic effectiveness and cost- changes among patients effectiveness of alternative suffering from the same disease. strategies for the prevention and That knowledge will enable treatment of type 2 diabetes. physicians to better diagnose However, due to constantly and treat disease of all types. evolving treatment landscape, The real world data requested these models need to be for this project will ensure that repeatedly updated as new the software we develop meets evidence becomes available to 198 Reuse Type dbGaP Example NIDDK Example the needs of clinical personnel. inform their structure or input values. This project aim to update the stroke, coronary heart disease, and nephropathy sub- models in MMD by using both secondary individual-level data available through NIH repository and summary data published in the literature. Infrastructure The Autism Sequencing [No requests for this use type in Consortium (ASC) is an this repository.] organization of more than 20 research groups. The ASC seeks to collectively exploit DNA sequencing to resolve a substantial fraction of the genetic factors that contribute to Autism Spectrum Disorders (ASD). Mount Sinai School of Medicine serves as the bioinformatic Hub of the ASC. As the Hub, we store and share sequence data and call variants with ASC members, and provide them with a computing platform on which they can perform analyses. The main goal of this work is to identify rare genetic variants that associate with ASD to better understand the underlying causes of ASD. Reproducibility We wish to replicate the work of An analysis in the DCCT, or reanalysis Alexandrov et al. (Nature 2013; suggested that men were at study reviewed in Martincorena increased risk for severe Science 2015) counting the hypoglycaemia. This has not number of mutations that been replicated in other studies. correspond to various mutational We hypothesise that gender signatures. For this we begin difference is not a risk factor for with a list of mutations available severe hypoglycaemia, and that through TCGA .maf files; we the effect found in the DCCT must then add the local genetic was a spurious result of context for these mutations, e.g. inappropriate statistical 199 Reuse Type dbGaP Example NIDDK Example the preceding and following technique. nucleotides for each single- nucleotide mutation, and this information is in the requested data from TCGA. 200 Appendix B: Custom Stopwords Used in LDA This list contains stopwords that were removed from the NHLBI and NIDDK datasets descriptions for the LDA topic modeling. background objectives center outcome conclusions participant/participants data research design sample/samples grant source individual study/studies measure/measures supported 201 Appendix C: Topic Model Term Charts 202 Terms associated with topics from dbGaP LDA model. 203 Terms associated with topics from NHLBI LDA model. 204 Terms associated with topics from NIDDK LDA model. References Acuna, D. E., Allesina, S., & Kording, K. P. (2012). Future impact: Predicting scientific success. Nature, 489(7415), 201–202. https://doi.org/10.1038/489201a Ali-Khan, S. E., Harris, L. W., & Gold, E. R. (2017). Motivating participation in open science by examining researcher incentives. ELife, 6, 1–12. https://doi.org/10.7554/eLife.29319 Ali-Khan, S. E., Jean, A., MacDonald, E., & Gold, E. R. (2018). Defining success in open science. MNI Open Research, 2, 2. https://doi.org/10.12688/mniopenres.12780.1 Altman, M., & Crosas, M. (2013). The evolution of data citation: From principles to implementation. IASSIST Quarterly, 37(1–4), 62–70. Altman, M., & King, G. (2007). A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine, 13(3/4). Retrieved from http://www.dlib.org/dlib/march07/altman/03altman.html Anderson, M. S., Ronning, E. A., DeVries, R., & Martinson, B. C. (2010). Extending the Mertonian norms: Scientists’ subscription to norms of research. Journal of Higher Education, 81(3), 612–624. https://doi.org/10.1353/jhe.0.0095 Aronson, A. R., Mork, J. G., Gay, C. W., Humphrey, S. M., & Rogers, W. J. (2004). The NLM Indexing Initiative’s Medical Text Indexer. Studies in Health Technology and Informatics, 107(Pt 1), 268–72. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/15360816 205 Arzberger, P., Schroeder, P., Beaulieu, A., Bowker, G., Casey, K., Laaksonen, L., & Moorman, D. (2004). Promoting access to public research data for scientific, economic, and social development. Data Science Journal, 3(November), 135– 152. Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., & Tuli, M. A. (2000). The EMBL nucleotide sequence database. Nucleic Acids Research, 28(1), 19–23. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/10592171 Begley, C. G., & Ioannidis, J. P. A. (2014). Reproducibility in science. Circulation Research, 116(1), 116–126. Retrieved from http://circres.ahajournals.org/content/116/1/116.long Belter, C. W. (2014). Measuring the value of research data: A citation analysis of oceanographic data sets. PLoS ONE, 9(3). https://doi.org/10.1371/journal.pone.0092590 Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2005). GenBank. Nucleic Acids Research, 33(Database issue), D34-8. https://doi.org/10.1093/nar/gki063 Bhatt, A. (2010). Evolution of clinical research: A history before and beyond James Lind. Perspectives in Clinical Research, 1(1), 6–10. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/21829774 Bierer, B. E., Crosas, M., & Pierce, H. H. (2017). Data authorship as an incentive to data sharing. New England Journal of Medicine, 376(17), 1684–1687. 206 https://doi.org/10.1056/NEJMsb1616595 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Retrieved from http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf Bollen, J., Van de Sompel, H., Hagberg, A., & Chute, R. (2009). A principal component analysis of 39 scientific impact measures. PLoS ONE, 4(6). https://doi.org/10.1371/journal.pone.0006022 Bollen, J., Van De Sompel, H., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing and Management, 41(6), 1419–1440. https://doi.org/10.1016/j.ipm.2005.03.024 Borgman, C. L. (2011). The conundrum of sharing research data. SSRN Electronic Journal, (1–14). https://doi.org/10.2139/ssrn.1869155 Bornmann, L., & Daniel, H.-D. (2007). What do we know about the h index? Journal of the American Society for Information Science and Technology, 58(9), 1381– 1385. https://doi.org/10.1002/asi Bornmann, L., & Daniel, H.-D. (2008). What do citation counts measure ? A review of studies on citing behavior. Journal of Documentation, 64(1), 45–80. https://doi.org/10.1108/00220410810844150 Bouchet-Valat, M. (2019). SnowballC: Snowball stemmers based on the C “libstemmer” UTF-8 Library. R package version 0.6.0. Retrieved from https://cran.r-project.org/package=SnowballC 207 Bryman, A. (2006). Integrating quantitative and qualitative research: How is it done? Qualitative Research, 6(1), 97–113. https://doi.org/10.1177/1468794106058877 Burrell, Q. L. (2003). Predicting future citation behavior. Journal of the American Society for Information Science and Technology, 54(5), 372–378. https://doi.org/10.1002/asi.10207 Burrell, Q. L. (2008). The publication/citation process at the micro level: A case study. COLLNET Journal of Scientometrics and Information Management, 3(1), 71–77. https://doi.org/10.1080/09737766.2009.10700866 Callahan, A., Winnenburg, R., & Shah, N. H. (2018). Analysis : U-Index, a dataset and an impact metric for informatics tools and databases. Scientific Data, (March), 1–10. Carpenter, C. R., Cone, D. C., & Sarli, C. C. (2014). Using publication metrics to highlight academic productivity and research impact. Academic Emergency Medicine, 21(10), 1160–1172. https://doi.org/10.1111/acem.12482 Coady, S. A., Mensah, G. A., Wagner, E. L., Goldfarb, M. E., Hitchcock, D. M., & Giffen, C. A. (2017). Use of the National Heart, Lung, and Blood Institute Data Repository. New England Journal of Medicine, 376(19), 1849–1858. https://doi.org/10.1056/NEJMsa1603542 CODATA-ICSTI Task Group on Data Citation Standards and Practices. (2013). Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data. Data Science Journal, 12(September), 1–75. https://doi.org/10.2481/dsj.OSOM13-043 208 Collins, F. S., Morgan, M., & Patrinos, A. (2003). The Human Genome Project: Lessons from large-scale biology. Science, 300(5617), 286–290. https://doi.org/10.1126/science.1084564 Compute Canada. (2018). JupyterHub. Retrieved November 12, 2018, from https://docs.computecanada.ca/wiki/JupyterHub#cite_note-1 Consejo Superior de Investigaciones Científicas. (2019). Ranking web of universities. Retrieved March 3, 2019, from http://www.webometrics.info/en/node/54 Costello, M. J. (2009). Motivating online publication of data. BioScience, 59(5), 418– 427. https://doi.org/10.1525/bio.2009.59.5.9 Cozzens, S. E. (1985). Comparing the sciences: Citation context analysis of papers from neuropharmacology and the sociology of science. Social Studies of Science, 15(1), 127–153. https://doi.org/10.1177/030631285015001005 Data Citation Synthesis Group. (2014). Joint declaration of data citation principles. (M. Martone, Ed.). FORCE11. Retrieved from https://www.force11.org/group/joint-declaration-data-citation-principles-final de Solla Price, D. (1976). A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27(5–6), 292–306. Dell EMC. (2019). Cold Data Storage. Retrieved March 17, 2019, from https://www.dellemc.com/en-us/glossary/cold-data-storage.htm Demmer, L. A., & Waggoner, D. J. (2014). Professional medical education and genomics. Annual Review of Genomics and Human Genetics, 15(1), 507–516. 209 https://doi.org/10.1146/annurev-genom-090413-025522 Diabetes Prevention Program Outcomes Study. (2016). Data dictionary. Retrieved from https://repository.niddk.nih.gov/media/studies/dppos/Data Dictionary.pdf Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., … Lautenbach, S. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x Durieux, V., & Gevenois, P. A. (2010). Bibliometric indicators: Quality measurements of scientific publication. Radiology, 255(2), 342–351. https://doi.org/10.1148/radiol.09090626 Edmunds, S. C., Pollard, T. J., Hole, B., & Basford, A. T. (2012). Adventures in data citation: Sorghum genome data exemplifies the new gold standard. BMC Research Notes, 5(223). https://doi.org/10.1186/1756-0500-5-223 Edwards, M. A., & Roy, S. (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1), 51–61. https://doi.org/10.1089/ees.2016.0223 Eom, Y. H., & Fortunato, S. (2011). Characterizing and modeling citation dynamics. PLoS ONE, 6(9), 1–7. https://doi.org/10.1371/journal.pone.0024926 Etikan, I., Abubakar Musa, S., & Sunusi Alkassim, R. (2016). Comparison of convenience sampling and purposive sampling. American Journal of Theoretical and Applied Statistics, 5(1), 1–4. https://doi.org/10.11648/j.ajtas.20160501.11 210 Faniel, I. M., & Jacobsen, T. E. (2010). Reusing scientific data: How earthquake engineering researchers assess the reusability of colleagues’ data. Computer Supported Cooperative Work, 19(3–4), 355–375. https://doi.org/10.1007/s10606-010-9117-8 Faniel, I. M., Kriesberg, A., & Yakel, E. (2015). Social scientists’ satisfaction with data reuse. Journal of the Association for Information Science and Technology, 67(6), 1404–1416. https://doi.org/10.1002/asi.23480 Federer, L. (2018). Quantifying biomedical data reuse: Do citations tell the whole story? (under revision for JASIST). Federer, L., Belter, C. W., Joubert, D. J., Livinski, A., Lu, Y.-L., Snyders, L. N., & Thompson, H. (2018). Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLoS One, 13(5), e0194768. https://doi.org/10.1371/journal.pone.0194768 Federer, L., Lu, Y.-L., Joubert, D. J., Welsh, J., & Brandys, B. (2015). Biomedical data sharing and reuse: Attitudes and practices of clinical and scientific research staff. PLoS One, 10(6), e0129506. https://doi.org/10.1371/journal.pone.0129506 Fenner, M., Lowenberg, D., Jones, M., Needham, P., Vieglais, D., Abrams, S., … Chodacki, J. (2018). Code of Practice for Research Data Usage Metrics Release 1. PeerJ Preprints, 1–43. https://doi.org/10.7287/peerj.preprints.26505v1 Field, D., Amaral-Zettler, L., Cochrane, G., Cole, J. R., Dawyndt, P., Garrity, G. M., … Wooley, J. (2011). The Genomic Standards Consortium. PLoS Biology, 9(6), e1001088. https://doi.org/10.1371/journal.pbio.1001088 211 Ford, D. Y. (2014). Segregation and the underrepresentation of Blacks and Hispanics in gifted education: Social inequality and deficit paradigms. Roeper Review, 36(3), 143–154. https://doi.org/10.1080/02783193.2014.919563 Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., … Barabási, A.-L. (2018). Science of science. Science, 359(6379), eaao0185. https://doi.org/10.1126/science.aao0185 Galligan, F., & Dyas-Correia, S. (2013). Altmetrics: Rethinking the way we measure. Serials Review, 39, 56–61. https://doi.org/10.1016/j.serrev.2013.01.003 Gan, M., Dou, X., & Jiang, R. (2013). From ontology to semantic similarity: calculation of ontology-based semantic similarity. The Scientific World Journal, 2013, 793091. https://doi.org/10.1155/2013/793091 Garfield, E. (1955). Citation indexes for science: A new dimension in documentation through association of ideas. Science, 122(July), 108–11. Retrieved from http://science.sciencemag.org/content/122/3159/108 Garfield, E. (1964). Can citation indexing be automated? In Statistical Assocation Methods for Mechanized Documentation (Vol. 269, pp. 84–90). https://doi.org/10.1093/ije/dyl190 Garfield, E. (1982). More on the ethics of scientific publication: Abuses of authorship attribution and citation amnesia undermine the reward system of science. Current Contents, 30, 5–10. Garfield, E. (1987). Contemplating a science court: On the question of institutionalizing scientific factfinding. The Scientist, 1(6), 9. 212 Garfield, E. (1989). Can a science court settle controversies between scientists? Current Contents, 28(3–6), 189–192. Retrieved from http://garfield.library.upenn.edu/essays/v12p189y1989.pdf Garfield, E. (1991). Bibliographic negligence: A serious transgression. The Scientist, 5(23), 14. Retrieved from https://www.the- scientist.com/commentary/bibliographic-negligence-a-serious-transgression- 60359 Garfield, E. (2002). Demand citation vigilance. The Scientist, 16(2), 6. Retrieved from http://garfield.library.upenn.edu/papers/demandcitationvigilance012102.html Garla, V. N., & Brandt, C. (2012). Semantic similarity in the biomedical domain: An evaluation across knowledge sources. BMC Bioinformatics, 13, 261. https://doi.org/10.1186/1471-2105-13-261 Gerrish, S. M., & Blei, D. M. (2010). A language-based approach to measuring scholarly impact. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 375–382). https://doi.org/10.1002/chin.200533198 Giffen, C. A., Carroll, L. E., Adams, J. T., Brennan, S. P., Coady, S. A., & Wagner, E. L. (2015). Providing contemporary access to historical biospecimen collections: Development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreservation and Biobanking, 13(4), 271–9. https://doi.org/10.1089/bio.2014.0050 213 Giles, J. R. A. (1995). The what, why, when, how, where and who of geological data management. Geological Society, London, Special Publications, 97(1), 1–4. https://doi.org/10.1144/GSL.SP.1995.097.01.01 Ginsburg, I. (2001). The disregard syndrome, a menace to honest science. The Scientist, 15(24). Retrieved from https://www.the-scientist.com/opinion-old/the- disregard-syndrome-a-menace-to-honest-science-53924 Gold, E. R., Ali-Khan, S. E., Allen, L., Ballell, L., Barral-Netto, M., Carr, D., … Thelwall, M. (2018). An open toolkit for tracking open science partnership implementation and impact. F1000Research, 2. https://doi.org/10.21955/GATESOPENRES.1114891.1 Gorgolewski, K. J., Margulies, D. S., & Milham, M. P. (2013). Making data sharing count: A publication-based solution. Frontiers in Neuroscience, 7, 9. https://doi.org/10.3389/fnins.2013.00009 Grüning, B., Chilton, J., Köster, J., Dale, R., Soranzo, N., van den Beek, M., … Taylor, J. (2018). Practical computational reproducibility in the life sciences. Cell Systems, 6(6), 631–635. https://doi.org/10.1016/j.cels.2018.03.014 Hansson, M. G., Lochmüller, H., Riess, O., Schaefer, F., Orth, M., Rubinstein, Y., … Woods, S. (2016). The risk of re-identification versus the need to identify individuals in rare disease research. European Journal of Human Genetics, 24(11), 1553–1558. https://doi.org/10.1038/ejhg.2016.52 Hazelkorn, E. (2013). How rankings are reshaping higher education. In Rankings and the Reshaping of Higher Education: The Battle for World-Class Excellence. (pp. 214 1–8). https://doi.org/10.1057/9781137446671 Henderson, T., & Kotz, D. (2015). Data citation practices in the CRAWDAD wireless network data archive. D-Lib Magazine, 21(1), 1. https://doi.org/10.1045/january2015-henderson Hey, T., Tansley, S., & Tolle, K. (2009). Jim Gray on eScience: A transformed scientific method. In The Fourth Paradigm: Data-Intensive Scientific Discovery (pp. xvii–xxxi). Redmond, WA: Microsoft Research. Retrieved from http://research.microsoft.com/en- us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429–431. https://doi.org/10.1038/520429a Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16569–16572. https://doi.org/10.1073/pnas.0507655102 Hirsch, J. E. (2007). Does the h index have predictive power? Proceedings of the National Academy of Sciences, 104(49), 19193–19198. https://doi.org/10.1073/pnas.0707962104 Holden, G., Rosenberg, G., & Barker, K. (1994). Bibliometrics: A Potential decision making aid in hiring, reappointment, tenure and promotion decisions. Social Work, 39(4), 421–431. https://doi.org/10.1300/J010v41n03 Holdren, J. P. (2013). Increasing access to the results of federally funded scientific 215 research. Retrieved July 19, 2017, from https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_pu blic_access_memo_2013.pdf Hong, E. P., & Park, J. W. (2012). Sample size and statistical power calculation in genetic association studies. Genomics & Informatics, 10(2), 117–22. https://doi.org/10.5808/GI.2012.10.2.117 Hopkins, P. C., Yazigi, N., & Nylund, C. M. (2017). Incidence of biliary atresia and timing of hepatoportoenterostomy in the United States. Journal of Pediatrics, 187, 253–257. https://doi.org/10.1016/j.jpeds.2017.05.006 Hotho, A., Andreas, N., & Paaß, G. (2005). A brief survey of text mining. LDV- Forum 20, (1), 19–62. https://doi.org/10.1111/j.1365-2621.1978.tb09773.x Human Health and Heredity in Africa. (2019). Vision. Retrieved March 21, 2019, from https://h3africa.org/index.php/about/vision/ Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124 Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Medicine, 11(10), e1001747. https://doi.org/10.1371/journal.pmed.1001747 Jagodnik, K. M., Koplev, S., Jenkins, S. L., Ohno-Machado, L., Paten, B., Schurer, S. C., … Ma’ayan, A. (2017). Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop. Journal of Biomedical Informatics, 71, 49–57. https://doi.org/10.1016/J.JBI.2017.05.006 216 Kaiser, J. (2019). Data sharing will be a major thrust of Trump’s $500 million childhood cancer plan. Science. https://doi.org/10.1126/science.aax1698 Kim, Y., & Yoon, A. (2017). Scientists’ data reuse behaviors: A multilevel analysis. Journal of the Association for Information Science and Technology, 68(12). https://doi.org/10.1002/asi.23892 Knoppers, B. M. (2014). Framework for responsible sharing of genomic and health- related data. The HUGO Journal, 8(1). https://doi.org/10.1186/s11568-014- 0003-1 Knoppers, B. M., Harris, J. R., Budin, I., & Edward, L. (2014). A human rights approach to an international code of conduct for genomic and clinical data sharing. Human Genetics, (1), 895–903. https://doi.org/10.1007/s00439-014- 1432-6 Kochen, M. (1987). How well do we acknowledge intellectual debts? Journal of Documentation, 43(1), 54–64. https://doi.org/10.1108/eb026801 Kotsiantis, S. B. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268. Retrieved from http://www.informatica.si/index.php/informatica/article/viewFile/148/140 Laine, H. (2017). Afraid of scooping – Case study on researcher strategies against fear of scooping in the context of open science. Data Science Journal, 16, 29. https://doi.org/10.5334/dsj-2017-029 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–74. Retrieved from 217 http://www.ncbi.nlm.nih.gov/pubmed/843571 Lane, J. (2010). Let’s make science metrics more scientific. Nature, 464(7288), 488– 489. https://doi.org/10.1038/464488a Langille, M. G. I., Ravel, J., & Fricke, W. F. (2018). “Available upon request”: not good enough for microbiome data! Microbiome, 6(1), 8. https://doi.org/10.1186/s40168-017-0394-z Latour, B., & Woolgar, S. (1986). Laboratory Life. Princeton, NJ: Princeton University Press. Leonelli, S. (2014). What difference does quantity make? On the epistemology of big data in biology. Big Data & Society, 1(1). https://doi.org/10.1177/2053951714534395 Levin, N., & Leonelli, S. (2017). How does one “open” science? Questions of value in biological research. Science Technology and Human Values, 42(2), 280–305. https://doi.org/10.1177/0162243916672071 Li, J. (2014). Citation curves of “all-elements-sleeping-beauties”: “flash in the pan” first and then “delayed recognition.” Scientometrics, 100(2), 595–601. https://doi.org/10.1007/s11192-013-1217-z Longo, D. L., & Drazen, J. M. (2016). Data Sharing. New England Journal of Medicine, 374(3), 276–277. https://doi.org/10.1056/NEJMe1516564 Maes, M. (2015). A review on citation amnesia in depression and inflammation research. Neuro Endocrinology Letters, 36(1), 1–6. Magerman, T., van Looy, B., & Song, X. (2010). Exploring the feasibility and 218 accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics, 82(2), 289–306. https://doi.org/10.1007/s11192-009-0046-6 Make Data Count. (2019). About. Retrieved March 23, 2019, from https://makedatacount.org/about/ Mann, G. S., Mimno, D., & McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL ’06 (p. 65). https://doi.org/10.1145/1141753.1141765 Manolio, T. A., & Murray, M. F. (2014). The growing role of professional societies in educating clinicians in genomics. Genetics in Medicine, 16(8), 571–572. https://doi.org/10.1038/gim.2014.6 Medical Library Association. (2019). Librarians without Borders. Retrieved March 21, 2019, from https://www.mlanet.org/page/librarians Merton, R. K. (1942). The normative structure of science. In N. Storer (Ed.), The Sociology of Science: Theoretical and Empirical Investigations (pp. 267–278). Chicago: University of Chicago Press. Merton, R. K. (1968). The Matthew effect in science. Science, 159(3810), 56–63. Retrieved from http://www.unc.edu/~fbaum/teaching/PLSC541_Fall06/Merton_Science_1968.p df Merton, R. K. (1983). Foreward. In Citation Indexing: Its Theory and Application in 219 Science, Technology, and Humanities (pp. v–ix). Philadelphia: ISI Press. Retrieved from http://www.garfield.library.upenn.edu/cifwd.html Meyer, D., Hornik, K., & Feinerer, I. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. Retrieved from http://epub.wu.ac.at/3978/%5Cnhttp://epub.wu.ac.at/%5Cnhttp://www.jstatsoft.o rg/ Meystre, S. M., Lovis, C., Bürkle, T., Tognola, G., Budrionis, A., & Lehmann, C. U. (2017). Clinical data reuse or secondary use: Current status and potential future progress. Yearbook of Medical Informatics, 26(01), 38–52. https://doi.org/10.15265/IY-2017-007 Mitroff, I. I. (1974). Norms and counter-norms in a select group of the Apollo moon scientists: A case study of the ambivalence of scientists. American Sociological Review, 39(4), 579–595. https://doi.org/10.2307/2094423 Moher, D., Naudet, F., Cristea, I. A., Miedema, F., Ioannidis, J. P. A., & Goodman, S. N. (2018). Assessing scientists for hiring, promotion, and tenure. PLOS Biology, 16(3), e2004089. https://doi.org/10.1371/journal.pbio.2004089 Mole. (2004). Stealing thunder I. Journal of Cell Science, 117(Pt 15), 3073–4. https://doi.org/10.1242/jcs.01281 Mooney, H., & Newton, M. (2012). The anatomy of a data citation: Discovery, reuse, and credit. Journal of Librarianship and Scholarly Communication, 1(1), eP1035. https://doi.org/10.7710/2162-3309.1035 Mork, J. G., Aronson, A., & Demner-Fushman, D. (2017). 12 years on: Is the NLM 220 medical text indexer still useful and relevant? Journal of Biomedical Semantics, 8(8). https://doi.org/10.1186/s13326-017-0113-5 Mork, J. G., Yepes, A. J. J., & Aronson, A. R. (2013). The NLM Medical Text Indexer System for Indexing Biomedical Literature. Retrieved from https://ii.nlm.nih.gov/Publications/Papers/MTI_System_Description_Expanded_ 2013_Accessible.pdf Moura, D. C., López, M. A. G., Cunha, P., de Posada, N. G., Pollan, R. R., Ramos, I., … Fernandes, T. C. (2013). Benchmarking datasets for breast cancer Computer- Aided Diagnosis (CADx). In J. Ruiz-Shulcloper & G. Sanniti di Baja (Eds.), Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2013 (pp. 326–333). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41822-8_41 Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie, N., … Wagenmakers, E. (2017). A manifesto for reproducible science. Nature Human Behavior, 1(January), 1–9. https://doi.org/10.1038/s41562-016- 0021 Murray, M. F. (2014). Educating physicians in the era of genomic medicine. Genome Medicine, 6(6), 45. https://doi.org/10.1186/gm564 National Cancer Institute. (2019). Common cancer types. Retrieved March 2, 2019, from https://www.cancer.gov/types/common-cancers National Center for Biotechnology Information. (2018). dbGaP. Retrieved March 29, 2018, from https://www.ncbi.nlm.nih.gov/gap 221 National Center for Education Statistics. (2017). Digest of Education Statistics, 2017. Retrieved December 3, 2018, from https://nces.ed.gov/programs/digest/d17/tables/dt17_315.20.asp?current=yes National Center for Health Statistics. (2017). FastStats: Leading causes of death. Retrieved March 2, 2019, from https://www.cdc.gov/nchs/fastats/leading-causes- of-death.htm National Heart, Lung, and Blood Institute. (2008). Procedures for requesting data sets (Archived at Internet Archive). Retrieved March 1, 2019, from https://web.archive.org/web/20081120052240/http://www.nhlbi.nih.gov./resourc es/deca/prcdrs.htm National Heart, Lung, and Blood Institute. (2018). BioLINCC: Biologic Specimen and Data Repository Information Coordinating Center. Retrieved March 29, 2018, from https://biolincc.nhlbi.nih.gov/home/ National Human Genome Research Institute. (2012). A Brief History of the Human Genome Project. Retrieved March 29, 2019, from https://www.genome.gov/12011239/a-brief-history-of-the-human-genome- project/ National Institute of Diabetes and Digestive and Kidney Diseases. (2018). NIDDK Central Repository. Retrieved March 29, 2018, from https://repository.niddk.nih.gov/home/ National Institutes of Health. (2016). PAR-16-256: Cancer-related Behavioral Research through Integrating Existing Data (R01). Retrieved March 12, 2019, 222 from https://grants.nih.gov/grants/guide/pa-files/PAR-16-256.html National Institutes of Health. (2017a). PA-17-289: Leveraging population-based cancer registry data to study health disparities (R01). Retrieved March 12, 2019, from https://grants.nih.gov/grants/guide/pa-files/PA-17-289.html National Institutes of Health. (2017b). Principles and guidelines for reporting preclinical research. Retrieved March 23, 2019, from https://www.nih.gov/research-training/rigor-reproducibility/principles- guidelines-reporting-preclinical-research National Institutes of Health. (2018a). Estimates of funding for various Research, Condition, and Disease Categories (RCDC). Retrieved from https://report.nih.gov/categorical_spending.aspx National Institutes of Health. (2018b). NIH strategic plan for data science. https://doi.org/10.1109/OFC.2007.4348300 National Institutes of Health. (2018c). NOT-OD-19-014: Request for Information (RFI) on proposed provisions for a draft data management and sharing policy for NIH funded or supported research. Retrieved November 11, 2018, from https://grants.nih.gov/grants/guide/notice-files/NOT-OD-19-014.html National Institutes of Health. (2018d). PA-17-467: Secondary Analyses of Existing Alcohol Research Data (R01). Retrieved March 12, 2019, from https://grants.nih.gov/grants/guide/pa-files/PA-17-467.html National Institutes of Health Fogarty International Center. (2019). Sub-Saharan African region information, grants and resources. Retrieved March 21, 2019, 223 from https://www.fic.nih.gov/WorldRegions/Pages/SubSaharanAfrica.aspx National Institutes of Health Office of Extramural Research. (2004). Frequently asked questions (FAQs) on data sharing. Retrieved November 11, 2018, from https://grants.nih.gov/grants/policy/data_sharing/data_sharing_faqs.htm#912 National Institutes of Health Office of Extramural Research. (2016). NIH Data Sharing Policy. Retrieved July 19, 2017, from https://grants.nih.gov/grants/policy/data_sharing/ National Institutes of Health Office of Science Policy. (2017). NIH Genomic Data Sharing Policy. Retrieved July 19, 2017, from https://osp.od.nih.gov/scientific- sharing/policies/ National Institutes of Health Research Portfolio Online Reporting Tools. (2018). NIH awards by location and organization. Retrieved February 7, 2019, from https://report.nih.gov/award/index.cfm National Library of Medicine. (2018). MeSH on Demand. Retrieved October 27, 2018, from https://meshb.nlm.nih.gov/MeSHonDemand National Library of Medicine. (2019). Fact Sheet: MEDLINE® Journal Selection. Retrieved March 17, 2019, from https://www.nlm.nih.gov/lstrc/jsel.html National Science Foundation. (2010). Dissemination and sharing of research results. Retrieved from http://www.nsf.gov/bfa/dias/policy/dmp.jsp Nature Publishing Group. (2017). Availability of data & materials. Retrieved April 2, 2017, from http://www.nature.com/authors/policies/availability.html Nonalcoholic Fatty Liver Disease (NAFLD) Adult Database. (2016). Data dictionary. 224 Retrieved from https://repository.niddk.nih.gov/media/studies/nafld_pediatric/Data_Dictionary.p df Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., … Yarkoni, T. (2015a). Promoting an open research culture. Science, 348(6242), 1422 LP-1425. Retrieved from http://science.sciencemag.org/content/348/6242/1422.abstract Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., … Yarkoni, T. (2015b). Promoting an Open research culture. Science, 348(6242), 1422–1425. Retrieved from http://science.sciencemag.org/content/348/6242/1422.abstract Nosek, B. A., & Bar-Anan, Y. (2012). Scientific Utopia: I. Opening Scientific Communication. Psychological Inquiry, 23(3), 217–243. https://doi.org/10.1080/1047840X.2012.692215 Ó Conchúir, S., Barlow, K. A., Pache, R. A., Ollikainen, N., Kundert, K., O’Meara, M. J., … Kortemme, T. (2015). A web resource for standardized benchmark datasets, metrics, and Rosetta protocols for macromolecular modeling and design. PLOS ONE, 10(9), e0130433. https://doi.org/10.1371/journal.pone.0130433 Olfson, M., Wall, M. M., & Blanco, C. (2017). Incentivizing data sharing and collaboration in medical research-the S-index. JAMA Psychiatry, 74(1), 5–6. https://doi.org/10.1001/jamapsychiatry.2016.2610 225 Palevitz, B. A. (1997). The ethics of citation: A matter of science’s family values. The Scientist. Retrieved from https://www.the-scientist.com/opinion-old/the-ethics- of-citation-a-matter-of-sciences-family-values-57456 Paltoo, D. N., Rodriguez, L. L., Feolo, M., Gillanders, E., Ramos, E. M., Rutter, J. L., … Green, E. D. (2014). Data use under the NIH GWAS Data Sharing Policy and future directions. Nature Genetics, 46(9), 934–938. https://doi.org/10.1038/ng.3062 Parolo, P. D. B., Pan, R. K., Ghosh, R., Huberman, B. A., Kaski, K., & Fortunato, S. (2015). Attention decay in science. Journal of Informetrics, 9(4), 734–745. https://doi.org/10.1016/J.JOI.2015.07.006 Pasquetto, I. V., Randles, B. M., & Borgman, C. L. (2017). On the reuse of scientific data. Data Science Journal, 16, 1–9. https://doi.org/10.5334/dsj-2017-008 Penner, O., Pan, R. K., Petersen, A. M., Kaski, K., & Fortunato, S. (2013). On the predictability of future impact in science. Scientific Reports, 3, 1–8. https://doi.org/10.1038/srep03052 Pepe, A., Goodman, A., Muench, A., Crosas, M., & Erdmann, C. (2014). How do astronomers share data? Reliability and persistence of datasets linked in AAS publications and a qualitative study of data practices among US astronomers. PLoS ONE, 9(8), e104798. https://doi.org/10.1371/journal.pone.0104798 Pesquita, C., Faria, D., Falcão, A. O., Lord, P., & Couto, F. M. (2009). Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5(7), e1000443. https://doi.org/10.1371/journal.pcbi.1000443 226 Piwowar, H. A. (2010). A method to track dataset reuse in biomedicine: Filtered GEO accession numbers in PubMed Central. Proceedings of the ASIST Annual Meeting, 47, 1–2. https://doi.org/10.1002/meet.14504701450 Piwowar, H. A., Becich, M. J., Bilofsky, H., Crowley, R. S., & on behalf of the caBIG Data Sharing and Intellectual Capital Workspace. (2008). Towards a data sharing culture: Recommendations for leadership from academic health centers. PLoS Medicine, 5(9), e183. https://doi.org/10.1371/journal.pmed.0050183 Piwowar, H. A., Carlson, J. D., & Vision, T. J. (2011). Beginning to track 1000 datasets from public repositories into the published literature. Proceedings of the ASIST Annual Meeting, 48. https://doi.org/10.1002/meet.2011.14504801337 Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PLoS ONE, 2(3), e308. https://doi.org/10.1371/journal.pone.0000308 Popejoy, A. B., & Fullerton, S. M. (2016). Genomics is failing on diversity. Nature, 538(7624), 161–164. https://doi.org/10.1038/538161a Powledge, T. M. (2003). Revisiting Bermuda. Genome Biology, 4(1). https://doi.org/10.1186/gb-spotlight-20030311-01 Priem, J. (2014). Altmetrics. In B. Cronin & C. R. Sugimoto (Eds.), Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact (pp. 263–287). Cambridge, MA: MIT Press. Pronk, T. E., Wiersma, P. H., van Weerden, A., & Schieving, F. (2015). A game theoretic analysis of research data sharing. PeerJ, 3, e1242. 227 https://doi.org/10.7717/peerj.1242 Pryor, G. (2009). Multi-scale Data Sharing in the Life Sciences: Some Lessons for Policy Makers. International Journal of Digital Curation, 4(3), 71–82. https://doi.org/10.2218/ijdc.v4i3.115 Raza, S., & Hall, A. (2017). Genomic medicine and data sharing. British Medical Bulletin, 123(1), 35–45. https://doi.org/10.1093/bmb/ldx024 Reality check on reproducibility. (2016). Nature, 533, 437. Richesson, R. L., & Nadkarni, P. (2011). Data standards for clinical research data collection forms: Current status and challenges. Journal of the American Medical Informatics Association, 18(3), 341–346. https://doi.org/10.1136/amiajnl-2011-000107 Rinker, T. W. (2018). {textstem}: Tools for stemming and lemmatizing text. R package version 0.1.4. Retrieved from http://github.com/trinker/textstem Robinson-García, N., Jiménez-Contreras, E., & Torres-Salinas, D. (2015). Analyzing data citation practices using the Data Citation Index. Journal of the American Society for Information Science and Technology, 18071, 12. https://doi.org/10.1002/asi.23529 Rolland, B., & Lee, C. P. (2013). Beyond trust and reliability: reusing data in collaborative cancer epidemiology research. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW’13) (pp. 435–444). https://doi.org/10.1145/2441776.2441826 Savage, C. J., Vickers, A. J., Kats, J., & Molenaar, D. (2009). Empirical study of data 228 sharing by authors publishing in PLoS journals. PLoS ONE, 4(9), e7078. https://doi.org/10.1371/journal.pone.0007078 Schlögl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads, citations and readership data for two information systems journals. Scientometrics, 101(2), 1113–1128. https://doi.org/10.1007/s11192-014-1365-9 Serwadda, D., Ndebele, P., Grabowski, M. K., Bajunirwe, F., & Wanyenze, R. K. (2018). Open data sharing and the Global South: Who benefits? Science, 359(6376), 642–643. https://doi.org/10.1126/science.aap8395 Sheehan, J., Hirschfeld, S., Foster, E., Ghitza, U., Goetz, K., Karpinski, J., … Huerta, M. (2016). Improving the value of clinical research through the use of Common Data Elements. Clinical Trials (London, England), 13(6), 671–676. https://doi.org/10.1177/1740774516653238 Silge, J., & Robinson, D. (2018). Topic modeling. In Text Mining with R: A Tidy Approach. Boston: O’Reilly Media. Silva, L. (2014). PLOS’ new data policy: Public access to data. Retrieved April 2, 2017, from http://blogs.plos.org/everyone/2014/02/24/plos-new-data-policy- public-access-data-2/ Silvello, G. (2017). Theory and practice of data citation. Journal of the Association for Information Science and Technology, 69(1), 6–20. https://doi.org/10.1088/0305-4624/9/3/409 Stodden, V., Bailey, D., Borwein, J., LeVeque, R. J., Rider, W., & Stein, W. (2012). 229 Setting the default to reproducible: Reproducibility in computational and experimental mathematics. In Reproducibility in Computational and Experimental Mathematics (pp. 1–19). Retrieved from http://icerm.brown.edu/video_archive, Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences, 115(11), 2584–2589. https://doi.org/10.1073/pnas.1708290115 Taichman, D. B., Sahni, P., Pinborg, A., Peiperl, L., Laine, C., James, A., … Backus, J. (2017). Data sharing statements for clinical trials: A requirement of the International Committee of Medical Journal Editors. New England Journal of Medicine, 376(23), 2277–2279. https://doi.org/10.1056/NEJMe1705439 Tausczik, Y. R. (2016). Citation and attribution in open science: A case study. In Proceedings of the Conference on Computer Supported Cooperative Work and Social Computing (CSCW) (pp. 1524–1534). https://doi.org/10.1145/2818048.2820070 Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., … Frame, M. (2011). Data sharing by scientists: practices and perceptions. PLOS ONE, 6(6), e21101. https://doi.org/10.1371/journal.pone.0021101 Tenopir, C., Dalton, E. D., Allard, S., Frame, M., Pjesivac, I., Birch, B., … Dorsett, K. (2015). Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLOS ONE, 10(8), e0134826. 230 https://doi.org/10.1371/journal.pone.0134826 The International Consortium of Investigators for Fairness in Trial Data Sharing. (2016). Toward Fairness in Data Sharing. New England Journal of Medicine, 375(5), 405–407. https://doi.org/10.1056/NEJMp1605654 Thelwall, M., Haustein, S., Larivière, V., & Sugimoto, C. R. (2013). Do altmetrics work? Twitter and ten other social web services. PLoS ONE, 8(5), 1–7. https://doi.org/10.1371/journal.pone.0064841 Thygesen, L. C., & Ersbøll, A. K. (2014). When the entire population is the sample: Strengths and limitations in register-based epidemiology. European Journal of Epidemiology, 29(8), 551–558. https://doi.org/10.1007/s10654-013-9873-0 van Raan, A. F. J. (2004). Sleeping Beauties in science. Budapest Scientometrics, 59(3), 467–472. Retrieved from https://link.springer.com/content/pdf/10.1023/B:SCIE.0000018543.82441.f1.pdf van Raan, A. F. J. (2005). Measurement of central aspects of scientific research: Performance, interdisciplinarity, structure. Measurement, 3(1), 1–19. https://doi.org/10.1207/s15366359mea0301_1 Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PLoS ONE, 8(7), e67332. https://doi.org/10.1371/journal.pone.0067332 Wan, Z., Vorobeychik, Y., Xia, W., Clayton, E. W., Kantarcioglu, M., & Malin, B. (2017). Expanding access to large-scale genomic data while promoting privacy: A game theoretic approach. American Journal of Human Genetics, 100(2), 316– 231 322. https://doi.org/10.1016/j.ajhg.2016.12.002 Wang, J. (2013). Citation time window choice for research impact evaluation. Scientometrics, 94(3), 851–872. https://doi.org/10.1007/s11192-012-0775-9 Werner, R. (2015). The focus on bibliometrics makes papers less useful. Nature, 517(7534), 245. https://doi.org/10.1038/517245a Weymann, D., Laskin, J., Roscoe, R., Schrader, K. A., Chia, S., Yip, S., … Regier, D. A. (2017). The cost and cost trajectory of whole-genome analysis guiding treatment of patients with advanced cancers. Molecular Genetics and Genomic Medicine, 5(3), 251–260. https://doi.org/10.1002/mgg3.281 Wickham, H. (2016). rvest: Easily harvest (scrape) web pages. R package version 0.3.2. Retrieved from https://github.com/hadley/rvest Wickham, H. (2017a). httr: Tools for working with URLs and HTTP. R package version 1.3.1. Retrieved from https://github.com/r-lib/httr Wickham, H. (2017b). tidyverse: Easily Install and Load “Tidyverse” Packages. R package version 1.1.1. Retrieved from https://cran.r- project.org/package=tidyverse Wikipedia. (2018). List of academic ranks. Retrieved October 28, 2018, from https://en.wikipedia.org/wiki/List_of_academic_ranks World Health Organization. (2019). About Hinari. Retrieved March 21, 2019, from https://www.who.int/hinari/about/en/ Yakel, E., Faniel, I. M., Kriesberg, A., & Yoon, A. (2013). Trust in digital repositories. International Journal of Digital Curation, 8(1), 143–156. 232 https://doi.org/10.2218/ijdc.v8i1.251 Yoon, A. (2014). End users’ trust in data repositories: Definition and influences on trust development. Archival Science, 14(1), 17–34. https://doi.org/10.1007/s10502-013-9207-8 Yoon, A. (2017). Role of communication in data reuse. In Proceedings of the Association for Information Science and Technology (Vol. 54, pp. 463–471). Crystal City, VA. https://doi.org/10.1002/pra2.2017.14505401050 Zhang, Q., Cheng, Q., Huang, Y., & Lu, W. (2016). A bootstrapping-based method to automatically identify data-usage statements in publications. Journal of Data and Information Science, 1(1), 1–17. https://doi.org/10.20309/jdis.201606 Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52. https://doi.org/10.1007/s13042-010-0001-0 Zhao, M., Yan, E., & Li, K. (2017). Data set mentions and citations: A content analysis of full-text publications. Journal of the Association for Information Science and Technology, 69(1), 32–46. https://doi.org/10.1002/asi.23919 Zhou, J., & Shui, Y. (2015). MeSHSim: MeSH(Medical Subject Headings) semantic similarity measures. R package version 1.2.0. Retrieved from https://github.com/JingZhou2015/MeSHSim Zhou, J., Shui, Y., Peng, S., Li, X., Mamitsuka, H., & Zhu, S. (2015). MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents. Journal of Bioinformatics and Computational 233 Biology, 13(06), 1542002. https://doi.org/10.1142/S0219720015420020 Zimmerman, A. (2007). Not by metadata alone: The use of diverse forms of knowledge to locate data for reuse. International Journal on Digital Libraries, 7(1–2), 5–16. https://doi.org/10.1007/s00799-007-0015-8 234