ABSTRACT Title of Dissertation: EXPLORING BLIND AND SIGHTED USERS? INTERACTIONS WITH ERROR-PRONE SPEECH AND IMAGE RECOGNITION Jonggi Hong Doctor of Philosophy, 2021 Dissertation Directed by: Assistant Professor Hernisa Kacorri College of Information Studies Speech and image recognition, already employed in many mainstream and assistive applications, hold great promise for increasing independence and improving the quality of life for people with visual impairments. However, their error-prone nature combined with challenges in visually inspecting errors can hold back their use for more independent living. This thesis explores blind users? challenges and strategies in handling speech and image recognition errors through non-visual interactions looking at both perspectives: that of an end-user interacting with already trained and deployed models such as automatic speech recognizer and image recognizers but also that of an end-user who is empowered to attune the model to their idiosyncratic characteristics such as teachable image recognizers. To better contextualize the findings and account for human factors beyond visual impairments, user studies also involve sighted participants on a parallel thread. More specifically, Part I of this thesis explores blind and sighted participants? experience with speech recognition errors through audio-only interactions. Here, the recognition result from a pre-trained model is not being displayed; instead, it is played back through text-to-speech. Through carefully engineered speech dictation tasks in both crowdsourcing and controlled-lab settings, this part investigates the percentage and type of errors that users miss, their strategies in identifying errors, as well as potential manipulations of the synthesized speech that may help users better identify the errors. Part II investigates blind and sighted participants? experience with image recognition errors. Here, we consider both pre-trained image recognition models and those fine-tuned by the users. Through carefully engineered questions and tasks in both crowdsourcing and semi-controlled remote lab settings, this part investigates the percentage and type of errors that users miss, their strategies in identifying errors, as well as potential interfaces for accessing training examples that may help users better avoid prediction errors when fine-tuning models for personalization. EXPLORING BLIND AND SIGHTED USERS? INTERACTIONS WITH ERROR-PRONE SPEECH AND IMAGE RECOGNITION by Jonggi Hong Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021 Advisory Committee: Assistant Professor Hernisa Kacorri, Chair/Advisor Assistant Professor Marine Carpuat Assistant Professor Huaishu Peng Assistant Professor Zhicheng Liu Associate Professor Leah Findlater (University of Washington) ? Copyright by Jonggi Hong 2021 Acknowledgments First and foremost, I would like to thank my advisor, Professor Hernisa Kacorri, for her endless support and thoughtful advice throughout the years. She has been supportive in completing my research as well as becoming an independent researcher. Her passion and creative ideas inspired me many times. I believe that what I learned from her would be a good guideline as a researcher for the rest of my life. I also would like to thank my former advisor, Professor Leah Findlater, who shaped my research at the beginning of my Ph.D. at UMD. She provided many important skills and techniques related to all parts of conducting research and building relationship with other researchers. I was very fortunate to have opportunities to work with Professor Hernisa Kaccori and Professor Leah Findlater. I would like to thank my dissertation committee members: Marine Carpuat, Huaishu Peng, and Leo Zhicheng Liu. Thank you for providing insightful comments that made my work stronger. I am grateful to all lab mates, students, and friends who have shared skills and ideas with me: Uran, Kotaro, Lee, Meethu, Kristin, Kyungjun, Utkarsh, Rie, Alisha, Christine, Ebrima, Jaina, Ernest, June, Tak, Deokgun, Soekbin, Seongkook, and Jaeyeon. As always, I thank my parents? for providing a huge support in completing my Ph.D. successfully. I am also grateful to my sister for giving me lots of assistance. ii Table of Contents Acknowledgements ii Table of Contents iii List of Tables vi List of Figures vii Chapter 1: Introduction 1 Chapter 2: Background on User Interaction With Error-Prone Systems 4 2.1 Identifying Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Understanding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Avoiding and Correcting Errors . . . . . . . . . . . . . . . . . . . . . . . 8 Part I: Interacting with Error-Prone Speech Recognition 10 Prologue to Part I 11 Chapter 3: Background 13 3.1 Automatic Speech Recognition and Error Identification . . . . . . . . . . 13 3.2 Automatic Speech Recognition for Accessibility . . . . . . . . . . . . . . 15 3.3 Comprehension of Synthesized Speech . . . . . . . . . . . . . . . . . . . 16 Chapter 4: Characterizing the Challenges in Identifying ASR Errors With Sighted Users 17 4.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Understanding Error Identification in Recognized Speech . . . . . . . . . 18 4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Improving Error Identification Through Speech Rate and Pause . . . . . . 25 4.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Improving Error Identification by Varying Pause Length . . . . . . . . . . 33 4.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iii 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Improving Error Identification by Listening to Speech Twice . . . . . . . 37 4.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 5: Comparing Error Identification in ASR Across Blind and Sighted Users 43 5.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Epilogue to Part I 79 Part II: Interacting with Error-Prone Image Recognition 82 Prologue to Part II 83 Chapter 6: Background 86 6.1 Image Recognition and Error Identification . . . . . . . . . . . . . . . . . 86 6.2 Image Recognition for Accessibility . . . . . . . . . . . . . . . . . . . . 88 6.3 Machine Teaching and Teachable Interfaces . . . . . . . . . . . . . . . . 89 Chapter 7: Understanding Error Identification in Pre-Trained Image Recognition With Blind Users 93 7.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 8: Exploring Error Understanding and Avoidance in Teachable Image Recognition With Sighted Users 114 8.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . 114 8.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Chapter 9: Designing a Teachable Object Recognizer with Training Set Descriptors for Blind Users 144 9.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . 144 iv 9.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Epilogue to Part II 176 Chapter 11:Conclusions and Future Work 179 11.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Bibliography 183 v List of Tables 4.1 Subjective vote tallies in Study 2. The 200 WPM speech rate and shortest two pause lengths were the most preferred, while 300 WPM was least likely to be voted easiest. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Participant characteristics, with ?B? denoting blind and ?S? sighted participants. All but B10 and S12 were native English speakers; B10 and S12 had lived in the US for 30 and 27 years, respectively. . . . . . . . . . . . . . . . . . 46 5.2 Definition and the number of error instances for the types where error instances sounded like the original words. The identified column includes the proportion of exactly identified errors (the number of exactly identified error instances divided by the number of all error instances) . . . . . . . 69 5.3 Definition and the number of error instances for the types where error instances did not sound like the original words. The identified column includes the proportion of exactly identified errors (number of exactly identified error instances / number of all error instances) . . . . . . . . . 69 6.1 Related studies? characteristics juxtaposed with ours. . . . . . . . . . . . 91 7.1 Participants? characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.1 Variation attributes, true if a variation is present for at least one object. . . 125 8.2 Inconsistency attributes, true if there is an inconsistency in variation across the three objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.3 Count attributes, number of photos with a given characteristic including those looking at quality issues. . . . . . . . . . . . . . . . . . . . . . . . 126 8.4 Modeling recognition performance based on attributes capturing variation, inconsistency, and other characteristics. . . . . . . . . . . . . . . . . . . 138 9.1 Our descriptors for reviewing photos are informed by prior studies exploring how people who have no machine learning expertise synthesize their data for training and iterate on them when they can access them visually [1, 2? ].148 9.2 Participants? characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 154 vi List of Figures 1.1 Upward trend and a plateauing line for top-1 and 5 accuracies for image classification on ImageNet from 2012 to 2017. (Source: Su et al., 2018 [3]) 2 4.1 The experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 WER, precision, recall, and phrase-level accuracy in Study 1. Recall results showed that participants missed identifying more than half of the speech recognition errors. Error bars show standard error (N = 12 per group). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Screenshot of the online testbed used for Studies 2, 3, and 4, showing a single trial. A trial consisted of reading a presented phrase, listening to an audio clip of what a speech recognition engine had heard, and marking errors in the recognized version (i.e., discrepancies between text and audio).. 26 4.4 Precision, recall, phrase-level accuracy, and trial completion time in Study 2. The shaded portion in trial completion time indicates the average length of audio clips in that condition. Participants identified errors most accurately with the 200 WPM speech rate and 150ms pause. Error bars show the standard error (N = 52). . . . . . . . . . . . . . . . . . . . . . 30 4.5 Types of errors participants missed identifying in Study 2. Participants missed 33% fewer multiple-word errors with a 150ms pause compared to no pause. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6 Graphs of precision, recall, phrase-level accuracy, and trial completion time in Study 3. The shaded portion in trial completion time indicates the length of audio clips. There were no significant differences in accuracy measures due to pause length. Error bars show standard error (N = 40). . 34 4.7 Graphs of precision, recall, and phrase-level accuracy in Study 4. The shaded portion in trial completion time indicates the average length of audio clips in that condition. The only time was significantly different between the two conditions. The error bars are standard errors (N = 30). . 39 5.1 Study setup for the speech dictation task, showing researcher (left) and participant (right) perspectives. The screen was blank across all participants to control for access to visual information. . . . . . . . . . . . . . . . . . 48 5.2 Reported frequency of using synthesized speech (N = 24). . . . . . . . . 56 5.3 Reported frequency of using speech input for dictation and voice commands (N = 24). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 vii 5.4 Perceived frequency of encountering ASR errors when dictating text (N = 24)* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 Frequency with which participants reported reviewing and editing text after dictation (N = 24). . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Recall and precision for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). The trials with open questions had longer messages with higher error rates. . . . . . . . . . . . . . . . . 62 5.7 The strategy used to report different types of ASR errors by the blind and sighted participants. There is no strategy in a cell if no error occurred or a participant missed all errors. . . . . . . . . . . . . . . . . . . . . . . . 64 5.8 WER and length of dictated messages for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). Participants dictated longer messages in trials with OQ than SS. There was no significant difference in WER between sighted and blind participants. . . . . . . . . 65 5.9 Speech rate and length of words for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). Blind participants spoke slower than sighted participants. The average length of words was shorter in OQ trials than SS trials. . . . . . . . . . . . . . . . . . . . . . 66 6.1 Characterization of our testbed in the machine teaching problem space [4], where T stands for teacher and S for student. A human T employs a pool- based, model-free, angelic, empirical teaching. The testbed has a single recognition model S learning in batch mode, unaware that is being taught, while considering T as a friend (no adversarial examples). . . . . . . . . . 90 7.1 Object stimuli: baking soda, caramel coffee, Cheetos, chewy bars, chicken broth, coca-cola, diced tomatoes, diet coke, dill, Fritos, Lacroix apricot, Lacroix mango, Lays, oregano, pike place roast. . . . . . . . . . . . . . . 98 7.2 A screenshot of the general object recognizer. . . . . . . . . . . . . . . . 100 7.3 Participant responses to questions about their experience in taking photos. 103 7.4 What participants captured in their photos. . . . . . . . . . . . . . . . . . 104 7.5 Camera-based assistive apps the participants have used regularly. . . . . . 105 7.6 Participants responses to two questions about the frequency of encountering errors and verifying the outputs from the apps. . . . . . . . . . . . . . . . 107 7.7 Participants responses to two questions about handling errors in the apps. . 108 7.8 The number of missed errors (false negatives, FN) and correct predictions considered as misrecognitions (false positives, FP). . . . . . . . . . . . . 109 8.1 Given an object category, MTurkers are called to choose three object instances and train a robust personal object recognizer using their mobile camera. Here we include examples from some of the participants? selected objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.2 Testbed screenshots: questionnaires, category selection, object labeling, and camera view in training and testing. . . . . . . . . . . . . . . . . . . 120 viii 8.3 Participants? technology experience and familiarity with machine learning mostly ranging from slightly (have heard of it but don?t know what it does) to somewhat familiar (I have a broad understanding of what it is and what it does). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.4 Examples of variation attributes in teaching sets. . . . . . . . . . . . . . . 127 8.5 Sample photos considered by the count attributes. . . . . . . . . . . . . . 128 8.6 Number of participants per variation and inconsistency attribute across all five interactions with the model: preliminary test (TS0), train 1 (TR1), test 1(TS1), train 2 (TR2), and test 2 (TS2). The graphs on the left indicate how participants incorporate diversity in their photos in terms of object size, viewpoint, location, and illumination when they train and debug their models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.7 Percentage of photos per participant given a count attribute, with standard error as error bars. Participants took photos mostly with the logo on it and many of them against a textured or cluttered background. Often the objects were cropped in the camera frame and sometimes participants? hands were included in the photos. Surprisingly, few participants opened the object and trained the model on their content as well. The most common quality issues were blurry and dim photos though not that prevalent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.1 Screenshots from the TOR app indicating from left to right the home screen, teach screen, teach screen with descriptors, teach screen with the number of remaining photos notification, review screen (top), review screen (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 9.2 Screenshots from the TOR app indicating from left to right the labeling screen, home screen when training is in progress, home screen with a recognition result, list of items screen, item information screen (top), item information screen (bottom). . . . . . . . . . . . . . . . . . . . . . . . . 148 9.3 Object stimuli in the study: Fritos, Cheetos, and Lays. . . . . . . . . . . . 155 9.4 Participant responses to questions about their training experience during the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.5 Training photos annotated as having too small objects (target objects are marked with blue dotted rectangles). . . . . . . . . . . . . . . . . . . . . 161 9.6 Scatter plots with the manually annotated values on the x axis and estimated values on the y axis. The correlation coefficient (r) and p-value (p) are specified in the plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.7 Training photos with cluttered backgrounds. . . . . . . . . . . . . . . . . 164 9.8 Training photos with little variation. . . . . . . . . . . . . . . . . . . . . 164 9.9 Training photos with problems in framing (i.e., adjusting the distance and centering the object). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.10 Test photos with cluttered backgrounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.11 Participant responses to questions about their testing experience during the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 ix 9.12 The number of tests per object. . . . . . . . . . . . . . . . . . . . . . . . 166 9.13 The proportion of errors and number of tests. . . . . . . . . . . . . . . . 166 9.14 The number of tests per object and proportion of errors. . . . . . . . . . . 166 9.15 The accuracy of the object recognition models tested by the participants. . 169 9.16 Average accuracy versus satisfaction with the performance. The red dots are means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.17 The number of tests per object and proportion of errors. . . . . . . . . . . 169 9.18 Participant responses to questions about their reviewing and editing experience during the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.19 Participant responses to questions about their overall experience during the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 x Chapter 1: Introduction Using deep neural networks and large datasets (e.g., ImageNet [5], LibriSpeech [6]), recent machine learning systems have reduced errors dramatically. For example, the word error rate of the state-of-the-art speech recognition system is only around 5% for English [7]; top-5 error rate for image classification is roughly at 4% [8]. With advances in computer vision, speech recognition, and natural language processing, machine learning has been employed in a variety of applications such as self-driving cars, automated retail services (e.g., Amazon Go), and voice-controlled intelligent personal assistants (e.g., Google Home, Amazon Echo). While we?ve reached low error rates in object and speech recognition tasks with public benchmark datasets (e.g., lower than 5% error rates in speech recognition with the World Street Journal dataset [9] and top-5 image classification with ImageNet [5]), these error rates may not be reflective of real world scenarios. In practice, one would expect to see much higher numbers due to many factors such as difficult tasks (e.g., the error rate of top-1 image classification is higher than top-5 classification as shown in Figure 1.1), limited computational resources (e.g., classifying images locally on a mobile device), or inputs that deviates from the training data (e.g., classifying images with personal items or unique backgrounds collected by a user). Such errors are known to affect the 1 Figure 1.1: Upward trend and a plateauing line for top-1 and 5 accuracies for image classification on ImageNet from 2012 to 2017. (Source: Su et al., 2018 [3]) user experience significantly in machine learning applications [10, 11]. For example, Fox et al. [12] identified causes of errors in automatic speech recognizer (ASR) such as similar sounding words, software parsing error, faulty microphones, and hardware issues. Reviewing and editing text to correct ASR errors has been known as a bottleneck in text entry through speech [13, 14, 15, 16]. In object recognition systems, errors can sneak in due to a mismatch of the training and real-world data [17] but also that these systems are sensitive to adversarial attacks [18, 19]. While the image classifiers are particularly useful for blind users [20], they would not be able to tell most of these errors especially if they could not touch the object recognized in the image (e.g., recognizing a far object, a scene recognition). Therefore, interfaces that support users in identifying, correcting, understanding and altogether avoiding errors is crucial. In the broader HCI area, we see recent efforts such as Amershi et al. [10] providing guidelines of user interface design for AI-infused systems1. They emphasize that an interface ought to inform users how accurately the system can do its task and how to deal with the errors. As it is hard for users to predict and understand the errors from the AI-infused systems, researchers 1A term coined by Amershi et al.(2019) [10] 2 have put effort into explaining the model and its output to help users understand the rationale behind the system?s output and the impact of a user?s input on the behavior of the model [21]. This thesis is complementary to these efforts with a focus on everyday AI-infused systems employing speech and image recognition that can benefit the blind community and whose errors are typically identified through sight making them inaccessible to blind users. 3 Chapter 2: Background on User Interaction With Error-Prone Systems This chapter provides an overview of user-error interaction in broader automation as well as machine learning systems with a focus on blind users? interactions with errors. As machine learning has been employed in many assistive technologies such as brain- computer interface [22, 23], sign language synthesis [24], and indoor localization [25, 26], developing accessible interfaces for identifying, understanding, and recovering from errors is essential. This chapter organizes prior studies based on the following steps to resolve errors: identifying, understanding, and recovering from errors. 2.1 Identifying Errors Error identification plays an important role in the user-error interaction as a first step to handling errors. While errors are obvious and easy to tell in some applications where users can understand the outcome from the system and ground truth easily and quickly (e.g., navigating familiar routes with way-finding system), the outcome from the system may not be clearly perceived due to the characteristics of the task, poorly designed interface, the complexity of the information, or poor concentration caused by a high workload, etc [27, 28]. For example, the ground truth may not be available immediately when the outcome is provided by the system (e.g., medical diagnosis, weather prediction). 4 The ground truth may not be straightforward to the user if the system handles data in an unfamiliar work domain [29]. Therefore, researchers have explored the challenge of identifying machine learning errors. For example, the interface for identifying errors has been investigated because of the significant impact of the error handling process in speech, handwriting, and gesture recognition systems on the effectiveness of natural input [14, 15, 30, 31]. To resolve the challenges, prior studies have developed interfaces for helping users identify the errors. Bourguet [30] identified two approaches when categorizing these prior studies: automatic identification, where the system automatically identifies potential errors in its output and machine-led discovery, where the system has an interface that aids users to discover errors. However, while these approaches could reduce the errors missed by sighted users, it is still hard for blind users to identify all errors for several reasons. If the automatic identification systems missed any of the errors, blind users may not have a way to tell. Also, many systems with the machine-led discovery approach are designed for sighted users using visual feedback [30]. For example, a handwriting recognition system displayed different types of lines (e.g., bold and dotted) to represent the system?s top guess and the potential alternatives [32]. A common way to present alternative predictions for potential errors in speech recognition systems is by employing a graphical user interface to show n-best predictions. However, these approaches are not effective to enable blind users to identify errors easily because most of their interfaces depend on visual information which is not available for blind users. In consequence, prior studies on speech and image recognition systems revealed difficulties in identifying errors. While speech input is blind users? main method to enter texts on a mobile device, they have difficulties in identifying speech recognition errors due to the similarity between the 5 original speech input and the misrecognized texts in synthesized speech [13]. MacLeod et al. [33] conducted a user study to understand blind users? experience with social media images and showed that blind users were usually not able to identify the problems in the computer-generated captions of images even when the captions were incorrect and out of context. When errors occur in computer-generated captions, blind users tended to justify the difference between the caption and the text around the image rather than considering the captions as errors. These studies emphasize the importance of developing accessible interfaces for identifying and understanding errors. 2.2 Understanding Errors Though deep neural networks and large datasets made a breakthrough in the performance of machine learning systems, the complexity of the systems made it difficult for users to understand the system?s behavior. Therefore, prior studies have presented guidelines for interface design of machine learning applications to help users understand the output from the applications. For example, the guidelines developed by Amershi et al. [10] recommend informing users of what and how well a system can do to control users? expectations. Another prior study confirmed that controlling a user?s expectation with the information of the system?s capability impacts the user?s experience positively when the system makes some errors [34]. As the complexity of the machine learning models makes it harder for users to understand why the errors occur, researchers have actively investigated the methods to explain the machine learning models and their output (i.e., explainable artificial intelligence) [35, 36, 37, 38, 39, 40]. The explanations are found 6 to make users satisfied, perceive better control of the system, and trust the output of the system better [41, 42]. Therefore, many machine learning applications such as image classifiers [43], recommendation systems [35, 44], and news feed algorithms in social media [38] employed explainable interfaces. As blind users depend on the machine learning-based assistive tools without visual information for some tasks (e.g., writing texts with speech recognition or handwriting recognition [30, 45], finding routes with navigation system [46]), they care about understanding the errors and the consequence with them. Prior studies indeed showed that the errors impact blind users? experience with the systems significantly. For example, when it comes to self-driving vehicles for blind users, the safety issues caused by malfunctions in autopilot system is the main concern when they were suggested to use the self-driving vehicles independently without help from sighted people [47, 48]. Similar concerns exist with the systems where the risk of having errors is not as much as those with self-driving vehicles where errors may threaten users? lives. Though some errors in a navigation system are acceptable to blind users, they had a negative experience with errors when people around them showed misguided responses [49]. Another prior study showed that, when the navigation system guided a blind user to a place that is only a few meters away from the destination, blind users may be frustrated and get totally lost due to the small error [46]. It shows that the interface for user-error interaction needs to provide context or supplementary information with a prediction from a machine learning model so that blind users can figure out the cause and severity of errors accurately. 7 2.3 Avoiding and Correcting Errors Prior studies have shown that the interactions for correcting errors have a significant impact on users? experience with machine learning systems [14, 15, 30, 31]. To improve the interactions with errors, several guidelines for designing user interfaces in machine learning applications recommend employing an intermediate step for users to provide a confirmation for the predictions from machine learning models [50, 51]. One of the most common approaches in recognition systems to allow users to correct errors is to repeat the input that was misrecognized by the system [30]. For example, in the case of handwriting recognition, a user overwrites the incorrectly recognized words for correction [52]. However, repeating has shown to be an inefficient way to correct the errors due to the possibility of having the same errors repeatedly [53, 54]. Therefore, to avoid the repeated misrecognitions with inputs in the same modality, a prior study recommended having multi-modal interactions for users to correct errors with different modalities [55]. For example, if speech input fails repeatedly, users can type the misrecognized text with a keyboard. A prior study confirmed the effectiveness of multi-modal interactions, showing that experienced users tend to switch modalities more often than the first-time users of speech recognition systems to correct errors [14]. In assistive tools for blind people, the easiness of correcting errors is a critical factor that affects the users? decision on whether they will continue to use the tools or not. Prior studies showed that people with disabilities frequently decide to use or abandon accessibility tools based on whether they can easily recover from errors or not [56, 57]. When it is difficult to recover from failures of some systems, users of 8 assistive technologies usually try to find mitigation strategies such as using multiple devices/software for the same task [58]. However, when multiple devices or applications are not available, blind users have to depend on their intuition or experience to recover from errors. For example, when a blind navigation system provides incorrect directional guidance, blind users find the correct directions based on their experience and awareness of the situation [26]. On the other hand, a prior study also showed that errors of only a few meters in a navigation system would make blind users frustrated and lost when the place is not familiar to them [46]. Therefore, given that many sources of errors exist in speech and image recognition systems, assistive tools based on these recognition systems need to provide user interfaces for reviewing, understanding, and recovering from errors. 9 Part I: Interacting with Error-Prone Speech Recognition 10 Prologue to Part I Using deep neural networks, researchers have achieved vast improvements in speech recognition. Speech input is faster and more accurate on mobile devices than entering text with a touchscreen keyboard [59]. It is a primary means of text input on devices that have a small or no visual display, such as smartwatches or voice-based intelligent personal assistants (e.g., Google Home, Amazon Echo). Speech input is also particularly useful for eyes-free interaction (e.g., using a mobile device while walking or driving) or as accessible input for blind users [13, 60]. Reviewing and editing the inputted text, however, is a bottleneck [15]. Speech recognition errors arise from several sources: the ambiguity of words (e.g., homophones and pronouns), background noise, and mistakes from users [61]. Visual interfaces for error detection and correction have been proposed and studied for desktops and mobile devices (e.g., [62, 63, 64]). When visual output is available, users can read the recognized text and easily identify these errors. However, error identification is challenging when users can only hear an audio synthesis of the same text [13]. Azenkot et al. [13] showed that while speech is a primary text input method for blind users on a mobile device, 80% of the time is spent reviewing and correcting errors with synthesized audio of recognized texts. 11 Part I characterizes the challenges in identifying ASR errors through audio-only interactions. Since experiences and listening rates for synthesized speech differ across blind and sighted users due to differences in the experience with a screen reader, we analyzed the ability to identify ASR errors with blind and sighted users separately. Therefore, the first four user studies involve sighted participants recruited through Amazon Mechanical Turk not using a screen reader. Findings and insights from these initial studies are used to guide the experimental design for a follow-up in-depth study that investigates experiences and the ability to identify errors across blind and sighted participants in a lab setup. In all studies, error-identification accuracy is measured through speech dictation tasks. Specifically, Part I of this thesis will explore each of the following research questions: ? RQ1: How frequently are ASR errors missed? (We will investigate RQ1 in Chapter 4) ? RQ2: Do different synthetic speech manipulations affect the user?s accuracy of identifying ASR errors? (We will investigate RQ2 in Chapter 4) ? RQ3: For what tasks do blind and sighted users use ASR? (We will investigate RQ3 in Chapter 5) ? RQ4: How different are the experiences with speech dictation and listening between blind and sighted users? (We will investigate RQ4 in Chapter 5) ? RQ5: Is the accuracy of identifying ASR errors different between blind and sighted users? (We will investigate RQ5 in Chapter 5) ? RQ6: What are the blind and sighted users? strategies of pointing to ASR errors? (We will investigate RQ6 in Chapter 5) 12 Chapter 3: Background ASR and synthesized speech have been employed in a variety of accessibility scenarios. Here, this chapter reviews state-of-the-art ASR systems, applications to accessibility with a focus on blind users, studies related to the comprehension of synthesized speech and ASR error through audio. 3.1 Automatic Speech Recognition and Error Identification The performance of ASR systems have been improved with various techniques. The techniques can be categorized into three approaches: acoustic-phonetic approach, pattern recognition approach, and artificial intelligence approach [65, 66]. The early ASR systems were built with the acoustic-phonetic approach [67]. This approach have been particularly useful for various applications using speech sound such as multilingual speech recognition, accent classification, speech activity detection systems, etc [68]. The pattern recognition approach had been a dominant method to build an ASR system for decades before the artificial intelligence approach emerged with the advance of deep learning technique. The state-of-the art ASR systems employed artificial intelligence approach using a deep neural network, reaching 5% word error rate (WER)1 recently [69]. 1WER = S+D+IN where S is the number of substituted words, D is the number of deleted words, I is the number of inserted words, N is the number of all words in the reference. 13 Though the prior studies achieved quite low error rates in restricted environments (e.g., noise-free sound, limited vocabulary, articulate speech), many factors such as speaker variation and noise may cause ASR errors in practice [61, 70]. Therefore, researchers have been exploring techniques for automatically detecting ASR errors to supplement ASR systems, which are inherently error-prone [71]. Prior work has attempted to detect ASR errors automatically or to help users identify ASR errors. A simple approach is to visually highlight words that are grammatically incorrect, which is common in mainstream mobile devices, or words that have low ASR confidence [72]. Researchers have also attempted to automatically detect ASR errors for enhancing speech-based interfaces (e.g., confirming a voice request for clarification when the system detects a potential recognition error [73]). While recent studies have developed methods to predict ASR errors using neural networks [73, 74, 75], the predictions reach 70% precision and 60% recall at best, suggesting that this is an open area of research. The focus of our study is on identifying ASR errors by users in non-visual context, but a follow-on step is to correct those errors by editing the dictated text. With the exception of Azenkot et al. [13], already discussed, work on editing ASR results has assumed that users will visually review and edit the text. These visual editing approaches can be defined as unimodal (speech used edit) and multi-modal (other input modalities used to edit) [31]. Multimodal solutions have combined speech with modalities such as pen, touchscreen, and keyboard input [76, 77, 78, 79]. As an example of unimodal (speech only) correction, Choi et al. [80] developed a prediction model for distinguishing whether a user?s utterance is intended to be a dictation input or a correction command, achieving 84% accuracy in offline experiments. However, unimodal interfaces suffer from 14 cascading side effects where speech input commands for correcting errors cause further ASR errors [15]. 3.2 Automatic Speech Recognition for Accessibility People with disabilities have been early adopters of user interfaces with speech input. Speech input can allow for efficient control of a computer, home-based IPAs (e.g., Amazon Echo, Google Home), or mobile device for people with visual (e.g., [81, 82, 83, 84]) or motor impairments (e.g., [85, 86, 87, 88, 89]). ASR can also provide access to spoken information for people who are Deaf/deaf or hard-of-hearing (e.g., [90]). For people with speech impairments, speech input has been used to support self-assessment of pronunciation (e.g., [91, 92]) and to recognize a user?s dictation and reproduce it through a synthesized voice (e.g., [93]). In the Chapter 4 and 5, we characterize the strategies and challenges in detecting ASR errors using synthesized speech (i.e., text-to-speech) among blind and sighted users. When comparing these two user groups, prior work has shown that blind users make use of speech dictation on mobile devices more often than sighted users [94], likely due to the inefficiency of using touchscreen keyboards with a screen reader [60]. Blind users also make use of speech input to access smartphone apps [84] and to browse the web [81, 82]. While in the latter cases, users can infer errors based on system response (e.g., which app opens), for dictation tasks, ASR errors need to be identified by listening to the text-to- speech output from the screen reader. This ASR identification task is the focus of our study. 15 3.3 Comprehension of Synthesized Speech Several studies have concluded that blind people comprehend synthesized speech better than sighted people. For example, Papadopoulos and Koustriava [95] showed that the comprehensibility of synthesized speech was higher for blind users, probably due to greater experience with screen readers, while natural speech was easier to understand than synthesized speech for both blind and sighted users. Similarly, Stent et al. [96] showed that users? experience with synthesized speech positively impacts the accuracy of transcribing fast synthesized speech; they tested 300 to 500 words per minute (WPM) speech rates with users with early-onset blindness. A recent study by Bragg et al. [97] measured the accuracy of answering questions based on synthesized speech ranging from 100 to 800 WPM, and found that the maximum intelligible speech rate was higher for blind users than sighted users. Blind users have also rated the degree of understanding ultra-fast synthesized speech, at a rate of 17-22 syllables per second (680-880 WPM), higher than sighted users [98]. However, the differences between these two groups of users may disappear when there are multiple streams of speech, called the Cocktail Party environment [99]. In support of this, Guerreiro and Gonc?alves [100] found no differences between blind and sighted users in being able to focus on a specific source when exposed to 2-4 synthesized concurrent speech sources. While the above studies evaluated the intelligibility and comprehensibility of synthesized speech and compared performance for blind and sighted people, they focused on speech output without errors, which contrasts our focus on identifying ASR errors through synthesized speech. 16 Chapter 4: Characterizing the Challenges in Identifying ASR Errors With Sighted Users 4.1 Motivation and Introduction In this chapter, we quantify the problem of identifying speech recognition errors through audio-only feedback and investigate potential solutions. While researchers have examined understanding of and ability to transcribe synthesized speech output (e.g., [96, 98]), the impacts of different synthesized speech manipulations on the user?s ability to identify speech recognition errors have not been investigated. We report on a series of four controlled studies. The goal of the first study was to characterize the problem of identifying errors based on audio-only output. For this in- lab study, native and non-native English speakers dictated and listened to the recognized version of a series of phrases in silent and noisy conditions. Overall, participants were unable to identify more than 50% of recognition errors when listening to the audio of the recognized text, with the most common difficulty being with multiple-word errors (e.g., ?mean? to ?me in?, or ?storm redoubles? to ?stormy doubles?). Studies in Chapter 4.2 through 4.5 then investigated the effect of three synthesized speech manipulations (i.e., pauses between words, speech rate, and speech repetition) on the user?s ability to identify 17 those recognition errors. Inserting pauses, in particular, could help to address the multiple- word errors identified from the study in Chapter 4.2. Studies in Chapter 4.3 and 4.4 showed that adding a pause between words resulted in significantly higher error identification rates than no pause, and that fast speech (i.e., 300 WPM) made identification more difficult. Finally, The study in Chapter 4.5 evaluated another alternative?repeating the audio output twice?and found that repetition did not improve participants? ability to identify errors over simply listening to the audio once. 4.2 Understanding Error Identification in Recognized Speech Though previous studies have shown that reviewing dictated text using non-visual output is a challenge [13], the extent of that challenge and the specific difficulties that users encounter have not been quantitatively assessed. How many misrecognized words do users miss when reviewing only through audio? What kind of errors is the hardest for users to identify? To answer these questions, we conducted a lab-based study where participants dictated a set of phrases using a mobile device and reviewed the system?s recognition of each phrase by listening to audio output. To increase the generalizability of the findings, we manipulated the level of background noise and participants? fluency levels, two factors that are known to impact speech recognition accuracy [101]. 4.2.1 Method This controlled experiment measured the impacts of background noise level and the user?s English proficiency, on the WER of the speech recognizer and on the user?s ability 18 Figure 4.1: The experimental setup. to identify recognition errors based on synthesized speech output. Participants. We recruited 12 native English speakers (5 male and 7 female) and 12 non-native English speakers (8 male and 4 female) through campus email lists. The native English speakers ranged in age from 18 to 36 (M=23.4, SD=5.3), while the non-native English speakers were 22 to 38 years old (M=26.3, SD=4.4). None reported having hearing loss. Non-native speakers had lived in the United States for 0.3 years on average (SD=2.3). Ten native speakers and eight non-native speakers had experience with speech input before, while the remaining participants did not. Procedure. Study sessions took 30 minutes and were conducted in a quiet room. As shown in Figure 4.1, participants sat at a table on which a Galaxy Tab 4 and two speakers were placed. We first collected demographic information and experience using speech input. The silent and noisy conditions were then presented in counterbalanced order. The tablet?s audio output was set to 75% of maximum volume, which was approximately 60db with the synthesized speech audio. For the noisy condition, the speakers played 19 street noise at 50db. A custom Android application guided participants through 30 trials per condition, where each trial consisted of: (1) reading a phrase displayed on the tablet screen, (2) dictating the phrase, which included double-tapping the screen to indicate the start and end of dictation, (3) listening to synthesized speech output of the recognized phrase, and (4) identifying discrepancies, if any, between the dictated and recognized text. This lattermost step involved reporting words that had been incorrectly recognized and locations where extra words were inserted. Participants viewed the reference phrase while listening to the synthesized speech, and verbally reported errors they heard to the experimenter. The phrases were randomly selected without replacement from a set with 200 phrases extracted from the LibriSpeech ASR corpus [6]. Of the 2703 phrases in the Librispeech development subset, 600 had 10 or fewer words, of which we randomly selected 200 that were of a complete sentence form, comprehensible, and contained no proper nouns which would increase ASR errors. The IBM Speech-to-Text API was used for speech recognition because it provides functions to analyze the speech recognition results (e.g., confidence scores and timing of words). The speech was synthesized on the tablet device using the TextToSpeech function in Android 5.0 with the default speech rate of 175 WPM (which is within the range recommended in the research literature as well [102]). 4.2.1.1 Data Analysis Study Design. This study used a mixed factorial design with a within-subjects factor of Noise (silent vs. noisy) and a between-subjects factor of Fluency (native vs. 20 non-native). The silent and noisy conditions were presented in counterbalanced order. Participants were randomly assigned to orders. Measures and Data Analysis. To provide a baseline understanding of how well the speech recognizer performed, we computed word error rate (WER) on the recognized text [72]; lower rates are better. To assess the user?s ability to identify errors, we computed precision?that is, when a participant thinks they hear an error, how often is it actually an error?and recall?that is, how the proportion of true errors participants were able to identify. We also employed phrase-level accuracy as a secondary measure, that is, whether a participant identified at least one error in a phrase that contains one or more errors, or no errors in a correct phrase. For this exploratory study, we focused on accuracy and did not measure speed. To compute these measures, we needed to judge whether each instance where the participant pointed out an incorrectly recognized or inserted word was a true positive, or that the lack of an error label was a true negative. Ambiguity arose when a single word was recognized as multiple words (e.g., ?meet? to ?me it?). Is this (i) one ?incorrect word ? meet? or (ii) one ?incorrect word ? meet? plus one ?inserted word ? it?? We considered both responses to be correct, with (i) counted as a truly positive and (ii) counted as two true positives. As a third case, if the participant marked this error as simply one ?inserted word ? it?, we judged the response to include one false negative (the word ?meet? should have been marked as incorrect) and one true positive (for the word ?it? being added). WER and precision violated the normality assumption of an ANOVA (Shapiro Wilk tests, p < .05), so we instead used 2-way repeated-measures ANOVAs with aligned rank transform (ART) for these measures, a non-parametric alternative to a factorial 21 Figure 4.2: WER, precision, recall, and phrase-level accuracy in Study 1. Recall results showed that participants missed identifying more than half of the speech recognition errors. Error bars show standard error (N = 12 per group). ANOVA [103]. The recall was analyzed in the same way, but without the ART adjustment. 4.2.2 Results Figure 4.2 shows the WER, precision, recall, and phrase-level accuracy in silent and noisy conditions with native and non-native speakers. Fluency affected speech recognition accuracy. Impacts of fluency and noise on WER have been previously studied, so our intention in including these factors in the experience was simply to increase the generalizability of our main measures (i.e., precision and recall in identifying recognition errors). For completeness, however, we still examined whether fluency and noise impacted WER. As expected based on past work [104], fluency did impact WER. WER was higher with non-native speakers than native speakers, at 0.22 (SD=0.05) compared to 0.07 (SD=0.04); this difference was significant (main effect of Fluency: F1,22 = 123.00, p < .001, ?2 = 0.74). The WER with native speakers was close to typical WERs achieved by recent speech recognition engines at 0.05-0.10 WER [7]. Different levels of background noise did not significantly impact WER (main effect of 22 Noise: F = 0.79, p = .384, ?21,22 < 0.01), nor was there a significant interaction effect between Fluency and Noise (F1,22 = 0.16, p = .693, ?2 < 0.01). Participants missed more than half of the errors. In terms of participants? ability to identify the speech recognition errors based on audio output, across all conditions, precision was 0.81 (SD=0.18), meaning that 19% of the errors that participants marked were not true errors. Of greater importance for being able to produce accurate text input, however, are the relatively low recall rates: on average across all four conditions, only 0.44 (SD=0.16) of true errors were identified?more than half the errors were undetected. Phrase-level accuracy, which could allow a user to at least know they should re-dictate an entire phrase even if they are not aware of all detailed errors, was higher, at 0.90 (SD=0.06) in the best case (native speakers + silent). ANOVA (with ART if applicable) results revealed no significant main or interaction effects of Fluency or Noise on precision or recall. There was a significant main effect of Fluency on phrase-level accuracy (F1,22 = 7.48, p = .009, ?2 = 0.14), whereby native speakers had higher accuracy than non-native speakers, at 0.85 (SD=0.08) compared to 0.76 (SD=0.12). However, the main effect of Noise and the interaction effect between Fluency and Noise on phrase-level accuracy were not significant. Multiple-word errors were most difficult to identify. To better understand what types of errors participants had trouble identifying, we qualitatively analyzed the 183 errors that native speaker participants missed (i.e., instances of false negatives). Native speakers who are most likely to use speech input in English were target participants in Studies 2-4, so we focused on native speakers in this analysis. One research team member coded the missed errors into the categories below. For validation, a second coder also independently coded 23 all missed errors, and Cohen?s kappa showed strong inter-rater agreement (kappa=0.82, 95% CI: [0.76, 0.88]). The categorizations were as follows: ? Multiple-word errors (N=107; 58.5%). Multiple sequential words sometimes sounded like another word or words. We included cases where multiple words were recognized as a single word (e.g., ?a while? and ?awhile?), multiple words were recognized as other multiple words (e.g., ?storm redoubles? and ?stormy doubles?), and single words were recognized as multiple words (e.g., ?meet? and ?me it?). ? Single word errors (N=57; 31.1%). This type of error includes single words that were replaced with homophones or other single words with similar sounds (e.g.?inquire? and ?acquire?, ?he? and ?she?). ? Punctuation mark errors (N=7; 3.8%). There is typically no explicit indication of punctuation marks such as apostrophes in text-to-speech output. If the recognized word is exactly the same as the intended word except for a punctuation mark, we classified it as a punctuation error (e.g., ?state?s? and ?states?). ? Other (N=12; 6.6%). In some cases, the type of error was unclear. For example, when there were many errors in a phrase the participant may simply have been unable to remember them all. 4.2.3 Summary and Discussion Across both user groups, participants missed over 50% of recognition errors when listening to the audio playback. Phrase-level accuracy, which would allow a participant 24 to know they should re-dictate an entire phrase, was higher but still left many unidentified errors (10% of phrases). The majority of the errors that participants did not notice were classified as multiple-word errors. A potential solution to address this type of error is to emphasize the individual words in the text-to-speech output by adding pauses between words?an approach that we focus on in Studies 2-4 alongside other simple output manipulations. That there were no differences in WER or participants? ability to identify recognition errors between different background noise levels suggests that we may have needed a wider range of noise levels to properly assess that factor. 4.3 Improving Error Identification Through Speech Rate and Pause Study 1 showed that participants missed a substantial number of errors when listening to the confirmation audio clips, with the most common type of missed error being a multiple-word error. In Study 2, we focused on a straightforward potential means of addressing this problem: adding artificial pauses between words in the speech output, which should allow the user more easily distinguish individual words. Inserting pauses in synthesized speech affects prosody and elision?the latter being when successive words are strung together while speaking, causing the omission of an initial or final sound in a word. While this change is not ideal for many uses of text-to-speech, it is potentially useful for helping users to correct recognition errors with audio-only interaction. This study isolated the error identification component of speech input and correction. Fifty-four crowdsourced read a series of phrases that had been dictated in Study 1, listened to corresponding confirmation audio clips (i.e., text-to-speech output of what the system 25 Figure 4.3: Screenshot of the online testbed used for Studies 2, 3, and 4, showing a single trial. A trial consisted of reading a presented phrase, listening to an audio clip of what a speech recognition engine had heard, and marking errors in the recognized version (i.e., discrepancies between text and audio).. had recognized), and identified discrepancies (recognition errors) between the presented text and the audio output under varying conditions: no/short/long pause and three speech rates. 4.3.1 Method For this study and the two subsequent ones, we recruited crowdsourced participants on Amazon?s Mechanical Turk to be able to run a series of studies with a larger and more diverse sample than would have been feasible in the lab. Participants. The 54 participants (33 male, 21 female) ranged in age from 21 to 58 (M=33.4, SD=8.8). All participants were native English speakers, and none reported 26 hearing loss. Just over half (N=29) had experience in using speech input. All participants reported completing the study in a quiet room. Procedure. Participants were directed to an online testbed that guided them through the 45-minute study procedure. The procedure began with a background questionnaire, followed by instructions about the overall tasks. Participants were then shown a sample phrase and asked to adjust the sound volume to ensure they could easily hear the audio clip. The task consisted of identifying discrepancies between presented text phrases and audio clips, where the audio clips may contain errors made by a speech recognizer. To test realistic speech recognition errors, the pairings of presented phrases and audio clips were taken from the speech input collected during Study 1. That study resulted in 600 pairs of presented and recognized phrases, where 32.4% of the recognized phrases included at least one error. The Say app in Mac OS X was used to generate the synthesized speech, including pauses. Figure 4.3 shows an example trial, with the presented phrase and an audio clip widget. After clicking to listen to the audio clip once (a single time; no replays allowed), the participant answered (yes/no) whether the audio clip had matched the presented phrase. The page included boxes that mapped to each word in the presented phrase as well as locations before and afterwords where extra words could appear. Participants marked all discrepancies between the presented text and the audio by clicking the corresponding boxes. The boxes were only enabled after the audio clip finished playing, so participants could not mark errors while actively listening to the audio. The presented phrase was visible for the duration of the trial. The ?next? button was enabled only after the participant 27 had reported whether the audio contained any errors. Participants first completed six practice trials to familiarize themselves with the task. Practice trials used a typical text-to-speech output setting of 180 WPM and no pauses between words. After each practice trial, participants were shown the correct answer as feedback. The experimental conditions were then presented in counterbalanced order, with 20 test trials per condition. Phrases were randomly selected from the set of 600 with no replacement, and different phrases were used for practice and test trials. After finishing all conditions, participants had to answer questions about easiness and preference of conditions. Study design. Study 2 used a 3x3 within-subjects design with factors of Speech Rate (100, 200, and 300 WPM) and Pause Length (no pause, 1ms, and 150ms). Order or presentation for the nine conditions was counterbalanced using a balanced Latin square (in fact, two squares due to having an odd number of conditions). Participants were randomly assigned to orders. The 1ms pauses, while too short to cause a detectable silence in the output, were used to eliminate elision in contrast to the ?no pause? condition. The 150ms pause length was selected based on pilot testing different lengths (1 to 200ms) to identify a short, yet distinguishable pause. Because the effectiveness of pause lengths and error identification, in general, may be impacted by the speech rate, we included three speech rates: one close to default rates in commercial text-to-speech systems (200 WPM), a slower rate (100 WPM), and a faster rate (300 WPM). 28 4.3.1.1 Data Analysis Like Study 1, we computed precision, recall, and phrase-level accuracy of identifying errors in the audio clips. Two participants were excluded from the analysis because they did not mark any words as errors in one condition, making it impossible to calculate the precision. Although our focus is on how well participants identify errors, for completeness we also report on trial completion time (time from the start of a trial to clicking the ?next? button). However, low trial completion times are not necessarily our goal, since they could be due to not noticing and thus not taking the time to mark errors. Perhaps more importantly, the length of the audio clips varies by condition, so we also report on descriptive statistics for audio clip length. Precision, phrase-level accuracy, and trial completion time violated the normality assumption of ANOVA (Shapiro-Wilk tests, p < .05). Therefore, 2-way repeated-measures ANOVAs with ART was used, with Wilcoxon signed-rank tests and a Bonferroni correction for posthoc pairwise comparisons. For recall, a 2-way RM ANOVA was used with paired t-tests for posthoc pairwise comparisons. 4.3.2 Results Figure 4.4 shows our primary measures of precision, recall, and phrase-level accuracy, along with trial completion time for completeness. Pauses and slower speech improve recall. Recall ranged from 0.48 to 0.67 across the nine conditions. Pause Length significantly impacted recall (F2,408 = 1.47, p < 29 Figure 4.4: Precision, recall, phrase-level accuracy, and trial completion time in Study 2. The shaded portion in trial completion time indicates the average length of audio clips in that condition. Participants identified errors most accurately with the 200 WPM speech rate and 150ms pause. Error bars show the standard error (N = 52). .001, ?2 = 0.07). All posthoc pairwise comparisons were significant (p < .05), showing that as pause length increased, so did recall. Speech Rate also significantly affected recall (F = 0.66, p < .001, ?22,408 = 0.03). Posthoc pairwise comparisons showed that the 300WPM speech rate resulted in significantly lower recall than the other two speeds (both comparisons p < .05). The interaction between Speech Rate and Pause Length was not significant (F4,408 = 0.89, p = .467, ?2 < 0.01). Pauses also impact precision. Precision ranged from 0.83 to 0.92 across the nine conditions. Precision was significantly impacted by Pause Length (F2,408 = 3.71, p = .025, ?2 = 0.01), although after a Bonferroni correction no posthoc pairwise comparisons were significant. There was no significant main effect of Speech Rate on precision (F2,408 = 1.22, p = .297, ?2 < 0.01), nor was the Pause Length x Speech Rate interaction 30 Figure 4.5: Types of errors participants missed identifying in Study 2. Participants missed 33% fewer multiple-word errors with a 150ms pause compared to no pause. effect significant (F2,408 = 1.53, p = .193, ?2 < 0.01). Secondarily, pauses improve phrase-level accuracy. Overall, Pause Length significantly impacted phrase-level accuracy (F2,408 = 22.36, p < .001, ?2 = 0.07), with posthoc pairwise comparisons, showed that the differences between all pairs of pause lengths were significant (all p?.05). Speech Rate also significantly impacted phrase-level accuracy (F2,408 = 6.46, p < .001, ?2 = 0.01), but no posthoc pairwise comparisons were significant after a Bonferroni correction. The interaction effect between Speech Rate and Pause Length was not significant (F 24,408 = 2.31, p = .058, ? < 0.01). Trial completion times and audio lengths as expected. The length of time to play the audio clip consisted of a substantial portion of the trial completion time on average, as shown in Figure 4.4. The downside of inserting pauses between words and slowing down speech playback is that these changes lengthen the audio clip time. Accordingly, there was wide variation in both trial completion times and audio clip length. Even the 1ms pause added 10-15% to trial completion times across the three speech rates compared to no pause, and 44-47% if just examining the length of the audio clips because the pauses eliminate overlaps between words (eliminating elision). Identifying multiple-word errors improved the most. To examine the effect of pauses 31 on specific types of errors, we manually coded 2341 missed errors from all participants. Figure 4.5 shows the number of errors of all types. The overall trend shows that all three types of errors decreased as the pause increased. However, the most substantial reduction was for multiple-word errors, which dropped 33.2% from the no pause condition (497 missed errors) to the 150ms pause condition (332 errors). In contrast, missed single-word errors only dropped 18.9%, from 359 to 291, and punctuation errors dropped 28.2%, from 46 to 33. Speech rate (WPM) Pause length (ms) 100 200 300 no 1 150 Ease 21 27 4 17 21 14 Preference 7 31 14 20 22 10 Table 4.1: Subjective vote tallies in Study 2. The 200 WPM speech rate and shortest two pause lengths were the most preferred, while 300 WPM was least likely to be voted easiest. Speech rate impacted perceived ease and preference. The subjective responses differed from the objective measures. Table 4.1 shows vote tallies for easiest and most preferred speech rates and pause lengths. Pearson Chi-Square test of independence showed that Speech Rate significantly impacted ease (X2(2,N=52) = 16.42, p < .001) and preference votes (X2(2,N=52) = 17.58, p < .001). The 200 WPM speech rate received the most votes for both ease and preference. Pause Length did not significantly impact either measure. In open-ended comments, participants said that 200 WPM felt natural because it was close to normal speech rate. While the accuracy with 150ms was highest, nine participants felt that it sounded unnatural compared to the other two pause lengths. Four participants reported that the 1ms pause, however, gave a moment to think as well as being more natural than the 150ms pause. 32 4.3.3 Summary and Discussion Recall and phrase-level accuracy were highest with the longest pause length (150ms), while the fastest speech rate (300 WPM) negatively affected recall. An important consideration, however, is that inserting pauses and slowing down speech increases audio clip length and thus overall task time. Compared to the baseline condition (i.e., 200 WPM, no pause), the best combination (200 WPM, 150ms pause) resulted in a 31% increase in recall and a 4% increase in precision, though also almost doubled the playback length. Even the 1ms pause made the audio 0.6-1.9s longer than no pause audio because it removed the elision in the phrase. In terms of subjective responses, most participants preferred the 200 WPM speech rate (which corresponds to [102]) and felt that 300 WPM made the task harder. However, there was no impact of pause length on subjective measures, suggesting that these short pauses (1ms, 150ms) may be acceptable compared to no pause even though they add time to the task. 4.4 Improving Error Identification by Varying Pause Length Study 2 showed that inserting pauses between words in the speech output enables users to identify errors more accurately, but only included two pause lengths that were greater than 0ms. Because adding pauses increases overall task time, we would ideally be able to pinpoint the shortest pause length that is still effective, and use that during audio- only speech input. To more precisely identify an ideal pause length than was possible in Study 2, here we evaluate seven pause lengths ranging from 1ms to 300ms. 33 Figure 4.6: Graphs of precision, recall, phrase-level accuracy, and trial completion time in Study 3. The shaded portion in trial completion time indicates the length of audio clips. There were no significant differences in accuracy measures due to pause length. Error bars show standard error (N = 40). 4.4.1 Method The study method is similar to Study 2 with the exceptions described here. The speech rate was fixed at 200 WPM because there were no significant error identification differences between 100 and 200 WPM in Study 2, but participants preferred 200 WPM. We recruited 42 participants (23 male, 19 female). Participants were on average 37.2 years old (SD=11.7; range 21-68). All were native English speakers and none had hearing loss. Twenty had previously used speech input. Four participants reported completing the study with light background noise (e.g., light street noise or office), while the remaining 38 participants reported using a quiet room. This study employed a within-subjects design with the single factor of Pause Length 34 (1, 50, 100, 150, 200, 250, or 300ms). This range spans from imperceptible pauses to highly obvious pauses. The seven conditions were presented in counterbalanced order using a balanced Latin square, similar to Study 2. Participants were randomly assigned to orders. Precision, recall, phrase-level accuracy, and trial completion time all violated the normality assumption of ANOVA (Shapiro-Wilk tests, p < .05), so 2-way RM ANOVAs with ART were used. Two participants who marked no errors in one condition were excluded from analysis because their precision could not be calculated. 4.4.2 Results Figure 4.6 shows results for the four main measures. Unlike in Study 2, there were no significant main effects of Pause Length on recall, precision, or phrase-level accuracy (respectively: F6,234=2.12, p = .052, ?2 = 0.04;F 26,234=0.68, p = .667, ? = 0.02;F 26,234=2.09, p = .06, ? = 0.04). Average audio clip length ranged from 3.0s per trial with the 1ms pause to 5.0s per trial with the 300ms pause. The trial completion time was shortest with 1ms pause at 6.4s and longest at 8.1s for both the 250ms and 300ms pauses. Following the performance results, there was no statistically significant difference in easiness and preference due to Pause Length (Chi-square tests, p > .05). 4.4.3 Summary and Discussion These results are unexpected and appear to contradict Study 2, where we had concluded that the 150ms pauses resulted in significantly higher recall and phrase-level accuracy than the 1ms pause. (Note that the worst-performing condition from Study 2 ? no pause 35 ? is not included in this study.) To confirm that the result of Study 3 was not obtained by chance and to better understand this unexpected result, we conducted two additional studies, which we report on briefly. First, we approximately replicated Study 3, but with 30 participants and two adjustments to increase statistical power: only four pause length conditions (1, 75, 150, and 225ms), and 40 trials per condition instead of 20. This replication yielded a similar result to what is reported above: no significant effects of pause length on recall, precision, and phrase-level accuracy. A subsequent closer examination of the Study 2 results, however, revealed that an important yet not statistically significant interaction effect may have affected those earlier conclusions: the 1ms vs. 150ms pause difference may have arisen primarily from the 300 WPM speech rate condition, rather than the 100 WPM or 200 WPM conditions. As such, because we used only 200 WPM in Study 3, we revisited the 200 WPM data from Study 2. A simple paired t-test showed that there was no significant difference between the 1ms and 150ms pause for recall; similarly, Wilcoxon signed-rank tests were not significant for precision or phrase-level accuracy. As such, Study 3 does confirm Study 2 but also provides more nuance on the conclusions. Again, the worst-performing pause length from Study 2 was the no pause condition, which allowed us to conclude that inserting even 1ms pauses was better than no pause. To confirm that this conclusion still held for a 200 WPM speech rate alone, we first conducted a t-test and Wilcoxon signed-rank tests on the 200 WPM data from Study 2. The 1ms pause resulted in significantly higher recall and phrase-level accuracy than no pause (all p < .05). We then conducted a short follow-up replication: we collected new data from 28 participants who completed 25 trials in two conditions: 200 WPM with a 1ms pause 36 and 200 WPM with no pause. The 1ms pause resulted in significantly higher recall (t-test, t27 = 2.73, p = .011, d = 0.59) and phrase-level accuracy (Wilcoxon signed-rank test, W = 216.5, Z = 2.43, p = .014, r = 0.32) than no pause. Considering the results from both Study 2 and 3, we can conclude that inserting a pause between words does help significantly in identifying speech recognition errors at the preferred speech rate of 200 WPM, but the length of that pause does not matter. What is most important is the existence of a pause, perhaps because it eliminates elision. 4.5 Improving Error Identification by Listening to Speech Twice Inserting pauses between words lengthens the time for audio playback. As already mentioned, even with only a 1ms pause, there was an additional 45% for playback time over no pause with the text-to-speech engine we used in Study 2. In this final study, we conducted an initial assessment of an alternative approach to making use of extra time: simply repeating the audio clip twice compared to listening to it only once. Participants in the earlier studies had only been allowed to listen to each audio clip once, to assess their first-pass ability to identify errors. However, repeating the audio twice could improve error identification. While it may be useful to assess the effects of repetition in more detail in future work, for this first evaluation, we compared clips at 200 WPM played only once versus played twice, with no pauses between words. 37 4.5.1 Method The method is the same as for Study 2 except as follows . Participants. We recruited 30 participants (17 male, 13 female). Participants ranged in age from 23 to 66 (M=36.6, SD=11.5). All participants were native English speakers with no hearing loss. All participants reported completing the task in a quiet room, except for one who reported light background noise. Seventeen participants had previously used speech input on their phone or computer. Study Design. We used a within-subjects design with two conditions: Default or Repeat. With Default, the audio feedback played once at 200 WPM with no pause between words, whereas with Repeat the audio played twice at 200 WPM with no pause between words and a chime sound (1.1s long) between repetitions. The two conditions were presented in counterbalanced order. Participants were randomly assigned to orders. Precision, recall, phrase-level accuracy, and trial completion time data all violated the normality assumption (Shapiro-Wilk test, p < .05). Therefore, Wilcoxon signed-rank tests were used to compare the two conditions. 4.5.2 Results Figure 10 shows the measures for Study 4. There was no significant difference in recall between the two conditions (W = 192, Z = ?0.56, p = .591, r = 0.07). The differences in precision and phrase-level accuracy were also not significant (respectively: W = 184, Z = 0.49508, p = .633, r = 0.06;W = 213, Z = 1.0238, p = .315, r = 0.132). The average length of the audio clips was 2.1s (SD=0.01) in the Default condition 38 Figure 4.7: Graphs of precision, recall, and phrase-level accuracy in Study 4. The shaded portion in trial completion time indicates the average length of audio clips in that condition. The only time was significantly different between the two conditions. The error bars are standard errors (N = 30). and 5.1s (SD=0.02) in the Repeat condition. Due to the longer length of audio, trial completion time in the Repeat condition was also longer than in the Default condition, at 9.3s (SD=0.4) compared to 6.4s (SD=0.4); the difference was significant (W = 1, Z = ?4.76, p < .001, r = 0.61). 4.5.3 Summary and Discussion Listening to audio clips twice did not improve the accuracy measures, although it added length to the audio clip. However, this result should be considered to be a preliminary exploration of the repetition approach, with more work needed to evaluate potential interactions with speech speeds, pauses between words, and repetition. 39 4.6 Discussion Combined, the four studies show, first, that identifying speech recognition errors through audio-only interaction is hard: participants missed identifying over 50% of errors in Study 1, the majority of which included speech sounds strung across multiple words (e.g., one word recognized as two separate words that sound similar to the original word). Studies 2 to 4 then explored three straightforward speech output manipulations, showing that adding even an imperceptibly brief pause (1ms) between words increases recall and phrase-level accuracy. In terms of speech rate, a high speech rate of 300 WPM reduced the ability to identify errors compared to slower and more subjectively comfortable rates. Finally, repeating the audio output (i.e., playing it twice instead of once) did not impact error identification, at least at a 200 WPM speech rate. Designing Audio-Only Speech Input. The manipulations we evaluated all add time to the audio output. One design choice would be to add very short inter-word pauses to all dictated text output. However, it may be preferable to provide user control over whether to achieve higher input accuracy at the cost of this extra time. Users could listen to the text output using typical speech settings (e.g., no pause), then if they detect the possibility of an error, they could review the text again in more detail using pauses and slower speech. Depending on the speech recognizer?s accuracy, in fact, this could be overall the most efficient interaction style, achieving both high speed and text input accuracy. More work is needed. Users may also have individual preferences regarding the tradeoff between speed and text input accuracy, where some may be more concerned than others about missed errors. The level of concern will also vary based on task context, 40 similar to how the acceptability of handwriting input recognition errors varies based on context [105]. For example, sending an informal text message to a spouse likely does not require the same level of attention to accuracy as writing an email to one?s work supervisor. Previous work has shown that experience with synthesized speech impacts comprehensibility (e.g., [95]). As such, it will be important to investigate how experience may interact with the effectiveness of pauses, speech rate, and repetition on audio-only error identification. For example, users with visual impairments who are experienced with screen readers, will likely perform differently than the sighted users included in our study. Another factor that could impact the ability of users to identify errors when reviewing audio is the attentional demand in many settings where speech is used, such as mobile interaction [106]. Related, while we did not see an impact of background noise level on error identification rates in Study 1, we hypothesize that our background noise was simply too quiet and that louder noise would cause lower identification rates. Limitations. Our study has limitations that should be addressed in future work. First, though we purposely chose a single-factor design for Study 3 to bolster statistical power, the follow-up data collection showed that there is an interaction effect between pause length and speech rate, an interaction that should be examined further. Second, we used a transcription task where participants dictated a presented phrase. This approach allows for precise measurement of error identification rates (by comparing errors and participant responses to the presented phrases) but is less realistic than a free-form dictation task would be. Third, we reused the phrase set and errors from Study 1 for all subsequent studies. It will thus be important to generalize the findings to different phrase sets. Fourth, 41 participants were only provided with feedback on whether they had correctly identified speech recognition errors during the practice trials in Studies 2-4 but not during test trials. It is possible that such feedback, while not representative of real use, would have impacted performance and subjective responses. Finally, while we did not observe any impacts due to the quality of the synthesized speech, future work should examine the potential impacts of different types of speech synthesis engines on error identification. 4.7 Conclusion We reported on four studies to characterize and address the difficulty of identifying speech recognition errors when using audio-only speech input. Study 1 revealed that by listening to audio clips alone, users could identify less than half of the speech recognition errors. We then addressed the most common type of error that participants had missed in Study 1?errors where multiple words blended together?by inserting pauses between each word and varying speech rate in the audio output. The simple solution of inserting even a 1ms pause between words improved the ability to identify errors, while a fast speech rate made the task more difficult, and repeating the audio output had no effect. These findings have implications for speech-based text input for a variety of non-visual contexts, and an important avenue of future work will be to extend the investigation to accessibility for blind and visually impaired users. 42 Chapter 5: Comparing Error Identification in ASR Across Blind and Sighted Users 5.1 Motivation and Introduction The study in Chapter 4 quantified and characterized the challenge of identifying ASR errors. Studying sighted users, it showed that participants missed about 50% of ASR errors when reviewing dictated text with no visual output and standard text-to- speech synthesis (at 200 words per minute). Despite the importance of accurate audio- only speech input for blind users?for example, blind users make use of speech input at higher rates than sighted users [13]?the ability of screen reader users to identify ASR errors has not been evaluated. Individuals with visual impairments who use screen readers are known to comprehend synthesized speech better than sighted people [95]. This leads to the following research questions: How do blind and sighted individuals? experience with speech input and concerns for ASR errors differ? How well can blind screen reader users identify ASR errors when using speech input? In this chapter, we report on an exploratory user study that compares blind and sighted users? experience with speech input and interactions with ASR errors to better understand challenges and strategies in reviewing ASR errors through audio-only. Study 43 sessions included a semi-structured interview on experiences with speech dictation and synthesized text-to-speech output, followed by a speech dictation task where participants were asked to identify ASR errors in their dictated text. We found that while both user groups confirmed the importance of speech input, blind participants used speech input more frequently than sighted participants (confirming results from [13]). Other differences between the two groups included the most common uses of synthesized speech (reading text on a screen for blind participants vs. conversational interfaces such as Siri for sighted participants) and methods to review the inputted text (visual magnifier1 or audio for blind participants vs. visual review for sighted participants). During the initial interview, most participants reported that identifying ASR errors is not challenging, but the performance data in our study suggests otherwise. In the speech dictation task, participants in both groups were only able to identify around 40% of ASR errors in the speech dictation task, and, counter to our hypothesis, there were no significant performance differences between the two user groups. While the challenge of identifying ASR errors through audio-only has been identified for sighted users in a study in Chapter 4, sighted users can choose to review important text visually when needed. That audio-only identification of ASR errors is equally challenging for blind users with substantial synthesized speech experience?and who do not have the option for visual review?emphasizes the importance of developing speech input techniques that more accurately allow blind users to review and edit dictated text. We show that identifying ASR errors with audio is even more difficult for longer texts, indicating that the length of the message may need to be considered when designing 1Our blind participants included two legally blind individuals who used a magnifier to read text. 44 interfaces for reviewing the dictated text. Based on the analysis of the audio recordings, we found that blind participants dictated their messages slower than sighted participants, perhaps compensating for system limitations, though this difference was not reflected in the corresponding ASR errors. Similarly, we observed that shorter words were used on average for longer messages yet more ASR errors were observed. Most importantly, we identified three distinct strategies that participants used to indicate ASR errors in the played back messages that could lead to novel interactions for reviewing ASR errors: pointing to a specific word(s), indicating the location of the errors in the message, and counting overall errors that they spotted. 5.2 Method To compare blind and sighted users? experiences with speech input and their ability to identify ASR errors with only audio output, we recruited 24 participants and conducted a two-part study that included a semi-structured interview followed by a speech dictation task. 5.2.1 Participants We recruited 12 blind participants (6 male, 6 female) who were screen reader users and 12 sighted participants (5 male, 7 female) from campus email lists and local organizations. The sample size was in line with typical sample sizes in this community and designed to balance research goals with practical issues of recruitment and burden on the participant community [103]. Blind participants ranged in age from 23 to 67 (M = 45 ID Age Gender Visual impairment Age of onset ID Age Gender B1 33 F Total blindness 27 S1 26 F B2 40 M Light perception 35 S2 22 M B3 30 F Legally blind 23 S3 22 M B4 65 M Total blindness Birth S4 20 M B5 52 F Total blindness 15 S5 19 F B6 59 F Total blindness 1 S6 27 M B7 63 M Light perception 40 S7 19 F B8 23 F Total blindness 13 S8 19 F B9 49 F Total blindness 34 S9 21 M B10 54 M Legally blind 6 months S10 20 F B11 67 M Legally blind Birth S11 19 F B12 64 M Legally blind 50 S12 31 F Table 5.1: Participant characteristics, with ?B? denoting blind and ?S? sighted participants. All but B10 and S12 were native English speakers; B10 and S12 had lived in the US for 30 and 27 years, respectively. 49.9, SD = 15.1) and sighted participants were 19 to 31 years old (M = 22.1, SD = 3.9). Blind participants reported being totally blind (N = 6), having some light perception (N = 2), or being legally blind (N = 4). All but two participants (one blind and one sighted) were native English speakers2. Background information for all participants is shown in Table 5.1; blind participants are denoted ?B#? and sighted participants are denoted ?S#.? Our blind participants were all familiar with synthesized speech since it serves as speech output for their screen readers; participants used a screen reader several times a day (N=11) or several times a week (only B11). Only one participant across both groups (B12) reported some hearing loss3. 2However, the two non-native English speakers (B10, S12) were not found to be outliers in terms of message length, ASR errors, or missed errors on the speech dictation task, with outliers at 1.5 times the interquartile range [107]. Thus, their data are included in the analysis. 3However, we did not find B12 to be an outlier in terms of message length, ASR errors, or missed errors on the speech dictation task, with outliers at 1.5 times the interquartile range. Thus, B12 was also included in the analysis. 46 5.2.2 Procedure Study sessions took up to 1.5 hours and were conducted in a quiet room. The whole procedure was video recorded for later analysis of participants? input in the interview and speech dictation task. The session started with a questionnaire to collect demographic information and experience with a screen reader. Semi-structured Interview. We then conducted a semi-structured interview ( 30 minutes) on prior experience with synthesized speech, speech input, and ASR errors. For the questions about ASR errors, we defined the speech recognition errors as texts recorded incorrectly by the device because it misunderstands a word or words that the user said. Specifically, participants responded to questions about: ? frequency of use, usefulness, devices, and applications for synthesized speech output ? frequency of speech rate adjustment and reasoning behind these adjustments ? frequency of use, usefulness, devices, and applications for speech input ? maximum length for previously dictated text, and reviewing practices for dictated text ? frequency of encountering and fixing ASR errors ? ASR error importance and how that relates to specific situations ? strategies for identifying and fixing ASR errors For the two questions regarding the frequencies of using speech input or synthesized speech, frequencies were measured in an absolute 7-point scale adopted from Rosen et 47 Figure 5.1: Study setup for the speech dictation task, showing researcher (left) and participant (right) perspectives. The screen was blank across all participants to control for access to visual information. al. [108] (Never, Once a month, Several times a month, Once a week, Several times a week, Once a day, Several times a day). For example, the absolute scale was used when asking ?How often do you use speech input to dictate text?? Another four questions which were relative to the frequency of using the speech input or synthesized speech employed a relative 6-point scale (Never, Very rarely, Rarely, Occasionally, Very frequently, Always) [109]. For example, a question with the relative scale asked, ?How often do you encounter speech recognition errors when you dictate text?? Speech Dictation Task. Participants then completed a speech dictation task using our custom experimental testbed built for the Apple iPhone 8 and using iOS?s built-in ASR4 and synthesized speech5 features. A female voice with 175 words per minute (WPM) speech rate was used for the synthesized speech. The study setup is shown in Figure 5.1. We employed the free-form text entry task (i.e., composing the text for speech input by a participant) instead of asking participants to read reference text. The free-form 4https://developer.apple.com/documentation/speech 5https://developer.apple.com/documentation/avfoundation/speech_ synthesis 48 text entry task is more realistic than reading reference phrases for the speech input because people usually compose a text rather than reading a reference text when they use speech input. Moreover, the free-form text entry task allows us to recruit blind participants from the general population without restrictions on Braille literacy. If the reference phrases had been given to the blind participants in Braille for this task, the participants would have to be Braille readers who are from around 10% of all people with visual impairments [110]. The task consisted of four practice trials followed by 30 test trials. For each trial the participant composing short text or email messages in response to a series of prompted scenarios, then reviewing the recognized text to identify any ASR errors. The overall task description was as follows: ?In this task, you will be given a series of situations in which you need to compose a text message or email. For each situation, you will listen to a description with a chime sound at the beginning, then dictate a short text message or email with 1-2 sentences in response.? The 30 different prompts for the test trials were presented in random order. The test prompts were selected from a list of short scenarios (?situations?) studied by Vertanen and Kristensson [111] for a freeform text composition task, such as: ?Your housemate has been sick for the last week. You are currently shopping downtown. See if he requires anything.? We asked participants to limit the dictated messages to 1-2 sentences so that they would remember their original input easily when it came time to review for ASR errors. Participants were allowed to make up names for message recipients when desired. 49 As shown in Figure 5.1, the screen was blank throughout the task so that neither blind nor sighted participants received visual feedback. After completing the 30 trials with short scenarios, the testbed presented three additional trials (prompts for narrative writing from New York Times [112]) with open question prompts that were intended to elicit longer descriptive answers, such as: ?You are filling out an online questionnaire about customer reviews of products. Describe how much you trust online reviews and why.? In these three trials, participants were given no length limit for their dictated messages. At the start of each trial, the testbed played a chime sound followed by an audio recording of the prompt description. We chose to use pre-recorded audio spoken by a native English speaker for all the prompts to control for any potential effect of synthesized speech for this description on participants? ability to later identify ASR errors through synthesized speech. Participants were allowed to repeat the prompt multiple times to ensure that they understood it and were ready to dictate a response. Participants then double-tapped on the iPhone screen, dictated their message, and double-tapped again to end the dictation. Sound effects played to provide feedback when the system started and stopped recording (the on/off sounds used for Siri on iOS), to help participants speak only while the ASR was activated. Immediately following dictation, the text recognized by the ASR system was played using synthesized speech. After listening once to the synthesized speech output, participants were asked to verbally report any difference(s) between the original speech they had dictated and the text they heard via the synthesized speech output. Participants also reported how certain they were that they had identified all errors in the message by using a 4-point scale (very certain, certain, uncertain, very uncertain). Participants were allowed to redo the dictation for a trial once and only once 50 if they felt they had made a mistake while speaking (e.g., stumbling over words). Of all participants with 720 trials in total, S5, B1, and B12 opted to re-dictate their input in 1, 1, and 6 trials with short scenarios, respectively. Only one of these instances (a trial of B12) occurred after the synthesized speech output had played. An additional three trials with short scenarios for B5 were redone because the participant?s accidental input caused the system to prematurely end the trial. B12 also re-dictated the input in one trial with open questions while speaking in the first attempt6. Post-study questions. At the end of the study, we asked questions about the overall experience of reviewing the dictated message during the task. Specifically, participants reported their agreement to the following statements by using a 5-point scale (strongly agree, agree, neither agree nor disagree, disagree, strongly disagree) from Rosen et al. [108]: ? ?The system correctly recognized almost everything I said.? ? ?It was difficult to identify errors made by the speech recognition system.? Open-ended questions were used to obtain a rationale for their responses as well as feedback on strategies and challenges in identifying errors. 5.2.3 Measures and Data Analysis The responses from the participants in the semi-structured interview and speech dictation task were transcribed from the videos of the user study and used to analyze the results. We logged the timing of speech input and the ASR results from the experimental testbed. 6The trials where participants re-dictated their input were not found to be outliers, with outliers at 1.5 times the interquartile range. Thus, these trials were included in the statistical analysis. 51 5.2.3.1 Semi-structured Interview We qualitatively coded the responses of open questions using a thematic coding method to identify the major themes in the participants? responses [113]. Two researchers collaborated to code the interviews. The first researcher transcribed all of the interview data. The second researcher prepared the initial codebook based on transcription and coded the answers. The first researcher then conducted a peer review of the codebook and of randomly selected transcripts from two blind and two sighted participants. There were 10 disagreements out of 72 coded answers. The two researchers then resolved the disagreements through consensus and updated the codebook with 132 codes for 16 open questions to include two new codes about why participants were using synthesized speech and the method of reviewing text from ASR. Answers for all Likert-scale questions were analyzed with the Mann-Whitney U test, a non-parametric test that allows us to compare ordinal data from the two participant groups. 5.2.3.2 Speech Dictation Task The speech dictation task used a mixed factorial design with a within-subjects factor of Prompt (short scenarios vs. open questions) and a between-subjects factor of Vision (blind vs. sighted). To analyze ASR errors from the speech dictation task, we manually transcribed the participants? original speech input and the verbal report of the ASR errors from the video recordings. The differences between the manually transcribed speech input and the ASR results recorded by the experimental system were considered to be ASR errors. We defined 52 an error instance as an ASR error with a word or a group of consecutive words. Error instances were coded based on their identification by participants as one of three levels of correctness: ? Identified: a participant identified mentioned the specific incorrect word(s). For example, if the original input is ?Can I have the vendor?s price lists??, the ASR result is ?Can I have the vendor?s price list?? (i.e., missing an ?s?), and the participant says, ?I said lists instead of list.?, then an identified error instance is ?lists.? Error instances with multiple consecutive words were considered to be identified if at least one of those words was identified exactly, based on the assumption that users would be able to locate that error instance if they wanted to edit it. ? Noticed: an error instance was noticed by the participant but was described with some ambiguity. If the participant says, ?I think there was an error in there.? in the above example, ?lists? is a noticed error instance. ? Missed: a participant did not notice any of the misrecognized words or error instances. Based on the coded errors, we computed precision (when a participant thinks they identified an error, how often is it actually an error) and recall (the proportion of error instances that participants were able to identify). We measured the WER of the ASR results to see how frequently errors occurred. The length of messages was also measured as the number of characters and the number of words in the original speech input. WER, recall, and precision did not violate the normality assumption (Shapiro Wilk test, p > 53 .05) and were analyzed using Welch?s t-tests (? = 0.05). Message length violated the normality assumption for sighted participants (Shapiro Wilk test, p = .023), so we used a Mann-Whitney U test, a non-parametric alternative to the t-test, for this measure. We looked into how participants reported the ASR errors during the speech dictation task from the transcribed data. The strategies of reporting errors would be potentially related to how people identify and remember the ASR errors while they are reviewing an ASR result. We found three distinct strategies that participants employed to report the ASR errors on the short scenario trials: ? Finding a specific word(s): an error instance was pointed out with the specific incorrect word(s). For example, a participant reported errors by saying ?I think there was one error where it missed the word ?the??, ?last word it said ?think? instead of ?thinking?.? ? Indicating the location: an error instance was indicated by its location in the text. For example, a participant said ?I think the last part is messed up [...]? in this case. ? Counting: a participant counted the errors in ASR result (e.g., ?I heard two errors.?). The strategies were not used exclusively; participants used one or more than one method in a trial. A total of 274 error instances from 2 blind and 2 sighted participants (randomly selected) were independently coded by two researchers for interrater validation. There was a substantial agreement in the level of correctness (Cohen?s kappa7 =0.75) and almost perfect agreement in the strategy of reporting errors (Cohen?s kappa=0.83) [115]. 7Using cohen.kappa from R package ?psych? [114]. 54 After the validation process, one of the two researchers coded the error instances from all participants. 5.3 Results 5.3.1 Insights from Semi-Structured Interview The main themes from the interview included experience with synthesized speech and speech input as well as strategies for detecting ASR errors. 5.3.1.1 Experience with Synthesized Speech While 11 out of 12 blind participants reported using their screen readers several times a day, when asked about the frequency of use for synthesized speech only 9 participants reported several times a day. We suspect that the other 3 participants might have not associated the term ?synthesized speech? with their screen reader voice when answering this question. Still, blind participants reported using synthesized speech more frequently than sighted participants (U = 17.5, p < .001; r = 0.60); only two of the 12 sighted participants used synthesized speech on a daily basis (Figure 5.2). Participants in both groups reported using synthesized speech with a range of devices, such as a computer, smartphone, tablet, watch, TV, or smart speaker (e.g., Amazon Echo, Google Home). However, while smartphones were the most popular device for both groups, only one sighted participant used synthesized speech on a computer versus 9 of the 12 blind participants. Blind participants also primarily used synthesized speech when using screen readers (N = 12), whereas sighted participants used it mostly with conversational interfaces 55 Figure 5.2: Reported frequency of using synthesized speech (N = 24). such as asking Siri a question (N = 6) and calling (N = 3). Unsurprisingly, as indicated by prior studies (e.g., [97, 98]), blind participants preferred faster speech rates compared to sighted participants. More than half of the blind participants (N = 7) preferred a speech rate setting of 51-100 (around 250-780 WPM [97]) on iOS, which is faster than the default speech rate of 50; the rest preferred the default (N = 4) or a slightly slow speech rate (N = 1). Blind participants who used faster speech than the default rate were used to listening to fast synthesized speech. Some of the participants mentioned the balance of the comprehensibility and speed. When asked about the speech rate, B3 said ?[...] I think mine is set to something like 57% and basically I can understand everything. If it?s faster than that, I may miss some things that it says because it may sound jumbled. If it?s slower than that, it may be aggravating [...]? On the other hand, sighted participants were not concerned by the speech rate, saying they did not have any preferred speech rate (N = 7) or that they preferred the default (N = 5). Nine out of 12 blind participants had experience adjusting the speed of synthesized speech, while none of the sighted participants did. Only one of the blind participants, B7, reported doing so frequently, using a fast speech rate for standard listening, but slowing it down for books or articles. Other blind participants adjusted the speech rate occasionally 56 Figure 5.3: Reported frequency of using speech input for dictation and voice commands (N = 24). (N = 4) and very rarely (N = 4) for various reasons: reading certain words or content carefully (e.g., email, books, address), when letting other people use their device to get help or share contents, when getting used to a new device, and just for variety?s sake. B2 said, ?[...] If I?m working on someone else?s device I would have to adjust their rate to match what my rate is [...] If I?m teaching, I would have to adjust it, so another person could understand because it may be too fast for them [...]? 5.3.1.2 Experience with Speech Input Blind participants also used speech input more frequently than sighted participants (U = 26, p = .006; r = 0.49), as shown in Figure 5.3 (and confirming [13]). Across both groups, participants most commonly used speech input on a smartphone compared to other devices. In terms of specific tasks, blind participants regularly used speech input for writing a text for various applications (N = 7), such as text messages, emails, and filling out online forms while only a few sighted participants used speech input for writing text messages (N = 4). It was more comfortable to write texts with speech input than keyboards for blind participants who wrote texts with speech input. B2 said, ?[...] Probably my main reason I mean is really just the convenience of it (speech input) so I 57 Figure 5.4: Perceived frequency of encountering ASR errors when dictating text (N = 24)* * Participants in ?No experience? had not entered text with speech input. Participants in ?Never? had entered text with speech input, but never encountered any ASR error. don?t have to really type anything out unless I have to more so the quickness of it.? The majority of both blind (N = 9) and sighted (N = 9) participants used conversational interfaces such as calling, asking Siri questions, opening apps, and setting timers. Regardless of whether they regularly used speech input for dictating text, to understand differences in how speech input is being used, we asked participants to describe the length of the longest text that they had experience dictating. Of the ten blind participants who had experience dictating text, eight had entered text longer than two sentences and four of those eight had dictated several paragraphs at a time. In contrast, only one sighted participant had dictated an entire paragraph, whereas the remaining eleven reported dictating at most 1-2 sentences. 5.3.1.3 Experience with Detecting ASR Errors As seen in Figure 5.4, the majority of participants in both groups felt that they encountered ASR errors at least occasionally when dictating text; there was no significant difference between the two groups on this measure (U = 73, p = .976). When participants were asked an open-ended question about how concerned they were about ASR errors, 58 the majority of blind participants expressed deep concerns about ASR errors (N = 9) versus only some of the sighted participants (N = 5). In particular, B1 said, ?I care about them a lot because I don?t want people to think that I?m stupid and I want them to understand what I?m talking about, what I?m trying to say to them.?, highlighting a previously studied misconception on the relation between spelling errors and cognitive abilities such as intelligence and logical ability [95,162]. B10, one of the two blind participants cared moderately, said ?To some extent. I wouldn?t say I care extremely or I don?t care just as much as I could have it correct.? The only blind participant, B3, who care a little said ?... [I care] a little because if she can pick up 96% of what I?m saying, I?m happy with that.? No blind participant and four sighted participants reported not being concerned about ASR errors. Those participants did not necessarily feel that ASR was accurate. For example, S12 said, ?I mean I think it?s a frustration but it?s not a big deal. If it?s an informal text it?s fine [...] I wouldn?t use it [speech input] to write something that?s a little more important because it?s not as reliable.? As illustrated by the S12 comment, the importance of ASR errors also varied depending on the situation. To explore such use cases, as a follow-up question we asked participants if there were some situations in which they were more concerned about ASR errors than others. When necessary, we further clarified this question by providing situation themes such as: specific tasks, certain contents, communicating with different people, and being more rushed. Blind participants reported paying more attention when sending a message to someone in a professional relationship such as a work colleague and client (N = 5) or in a rushed situation to avoid wasting time in fixing ASR errors (N = 3). Blind participants also focused on punctuation marks, certain words that may be likely to be 59 Figure 5.5: Frequency with which participants reported reviewing and editing text after dictation (N = 24). misrecognized by the speech recognizer (e.g., addresses, proper nouns, numbers), and content that may be hard to understand with incorrect speech recognition. B3 said ?[...] if you don?t put a period, of course, it?s one run-on sentence so again I guess that?s user error because if I say period or comma it?ll give the space [...]? B7 said ?[...] I need to speak a person?s name, or a location that is something the speech recognition software is very unlikely to recognize and it?s essential that the name or location be accurate.? On the other hand, sighted participants said they were most concerned about ASR errors when sending emails to multiple people and when performing a voice search (N = 6). Some of the sighted participants (N = 4) mentioned rushed situations where they have limited time to review and fix ASR errors. For example, S6, said, ?If I?m more relaxed I don?t really care but if I?m rushed and I need to like articulate a text message then I?m going to take the time to actually type it out [...]? Like blind participants, sighted participants also mentioned concerns about ASR errors when sending a message to a person in a professional relationship compared to family or friends (N = 6). The frequency of reviewing dictated text was not significantly different between blind and sighted participants (U = 49, p = .168), as shown in Figure 5.5. Unsurprisingly, blind participants were more likely to review dictated text via audio (synthesized speech 60 output). Pertaining to the blind participants that reported having reviewed their dictated text (N = 10), the majority had used primary audio (N = 8) and only (B10, B11) had used audio plus a magnifier. Of the sighted participants, only one (S9) had used audio to review dictated text and did that in conjunction with visual output by listening to the dictated text first, then visually checking if it sounded like there were ASR errors. The remaining sighted participants reported having reviewed dictated text only visually (N = 9) or had no experience with speech input for text entry at all (N = 2). Though the study in Chapter 4 showed that the accuracy of identifying ASR errors by audio playback is only around 50%, when asked how difficult it is to identify ASR errors, participants were not aware of such challenge. All participants who had experience with reviewing dictated text thought that identifying ASR errors is not challenging. For example, B2 said ?not challenging at all?, S2 ?not challenging?, S9 ?not that hard?, and B4 ?not really challenging.? Exceptionally, B11 pointed out that some ASR errors are not easy to detect due to the similar sounds with original input: ?[...] you can easily hear an error, but you may not see it, you might not know it?s an error. In other words, ?to? and it might put two ?o?s instead of one or something.? Perhaps, the rest of the participants did not realize the challenge of identifying ASR errors with synthesized speech due to difficulty invalidating what they heard (for the blind participants) or due to limited experience with the audio review (for the sighted participants). 61 Figure 5.6: Recall and precision for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). The trials with open questions had longer messages with higher error rates. 5.3.2 Results from Speech Dictation Task We report on WER and length of messages, and analyze participants? performance in identifying ASR errors based on precision and recall. Our primary analysis compares blind and sighted participants in the short scenario (SS) trials. As a secondary analysis, we report on the open question (OQ) trials, comparing them to the SS trials. Given that we purposely chose to focus on the SS trials and did not counterbalance the SS and OQ trials, and that there are many more SS than OQ trials, this analysis should be considered exploratory?useful for informing future research directions but not meant to be conclusive. We further analyzed the characteristics of speech input from the participants (i.e., speech rate and length of words) and the error instances (i.e., types of errors and the strategy of reporting errors). The analysis provides the empirical findings of the patterns of entering a text using speech input and identifying errors. The hypotheses of this task are: (i) blind participants can identify the ASR errors with audio more accurately than sighted participants; (ii) ASR error identification is harder with longer speech input. 62 5.3.2.1 Differences in Identifying ASR Errors Figure 5.6 shows the average recall and precision of identifying ASR errors. Short scenario trials. While prior studies have shown that blind users comprehend synthesized speech better than sighted users [97, 98], this did not translate to a significantly improved ability to identify ASR errors through synthesized speech. Recall, the proportion of error instances correctly identified, was 0.42 (SD = 0.13) for blind users and 0.38 (SD = 0.16) for sighted users. This difference was not statistically significant (t21 = ?0.64, p = .529). Precision, the proportion of ASR errors identified by the participants that were actually errors (not mistakes on the participant?s part), was also not significantly different across the two groups: on average 0.72 (SD = 0.17) for blind participants and 0.56 (SD = 0.20) for sighted participants (t17 = ?1.54, p = .140). Open question trials. Compared to the short scenario trials above, identifying ASR errors was more challenging with the three open question trials. The average recall of all 24 participants was 0.25 (SD = 0.24), which was significantly lower than SS trials at 0.40 (SD = 0.15) (W = 42, p = .001; r = 0.45). Specifically, the recall was 0.25 (SD = 0.24) and 0.26 (SD = 0.21) for blind and sighted participants, respectively in OQ trials. The average precision in open question trials was 0.50 (SD = 0.40) for blind participants and 0.69 (SD = 0.34) for sighted participants. Average precision of all 24 participants in OQ trials was 0.59 (SD = 0.37) in OQ trials and 0.64 (SD = 0.20). There was no significant difference in precision between the two types of trials (W = 91.5, p = .627). The most common strategy was finding a specific word(s) which was used by 12 blind participants in 156 trials and 12 sighted participants in 137 trials. Some participants 63 Figure 5.7: The strategy used to report different types of ASR errors by the blind and sighted participants. There is no strategy in a cell if no error occurred or a participant missed all errors. counted the errors when they identify ASR errors. The eight blind and eight sighted participants had reported the errors by counting the number of errors in 36 and 34 trials, respectively. Nine blind and Eight sighted participants indicated the location of errors in 29 and 18 trials. Figure 5.7 shows that some participants (B6, S4, S5, S7, S8) tend to use the same strategies across different types of errors. The length of the message and the number of errors in the ASR results may have influenced the strategies. When participants counted the ASR errors or indicated the location of errors, the length of the message was 38.2 (SD = 24.3) and 40.6 (SD = 27.7) words on average, respectively. The length of the message was only 30.4 (SD = 19.2) on average in trials where participants pointed out the specific word to report ASR errors. In trials where participants found a specific word, the ASR results had 1.75 (SD = 1.5) error instances on average while there were 2.4 (SD = 1.9) and 2.4 (SD = 2.0) error instances on average in trials where participants counted or indicated the location of words. 64 Figure 5.8: WER and length of dictated messages for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). Participants dictated longer messages in trials with OQ than SS. There was no significant difference in WER between sighted and blind participants. 5.3.3 Characteristics of Dictated Messages There are many characteristics of the dictated messages that could relate to the number of ASR errors that participants were able to identify. To better contextualize our findings we report differences in word error rate, message length, speech rate, and word length across the recordings of blind and sighted participants as well as across short scenario and open question trials. Word Error Rate and Length of Messages. Overall, no significant differences were found in WER or message length for the two user groups. Figure 5.8 shows the average WER and length of messages. Short scenario trials (SS). In the 30 SS trials, the average WER of blind and sighted participants? speech input was 0.04 (SD = 0.02) and 0.04 (SD = 0.02), respectively, which is similar to the WER of state-of-the-art ASR engines [7]. We asked participants to keep their dictated messages to 1-2 sentences in length. Blind and sighted participants? dictated messages were 129.3 (SD = 56.8) and 98.5 (SD = 29.5) characters, which were 25.9 (SD = 11.4) and 19.9 (SD = 6.1) words, respectively; the medians were 121.3 65 Figure 5.9: Speech rate and length of words for the blind and sighted participants in trials with short scenarios (SS) and open questions (OQ). Blind participants spoke slower than sighted participants. The average length of words was shorter in OQ trials than SS trials. characters (24.7 words) for blind participants and 95.1 characters (19.1 words) for sighted participants; this difference was not statistically significant (calculated in character; the mean ranks of blind and sighted participants were 14.4 and 10.6, respectively; U = 49, Z = 1.33, p = .198). To examine if the characteristics of the ASR results impacted the accuracy of identifying errors, we compared the trials with and without errors in terms of the message length and the number of errors in a trial. The average number of words was 32.7 (SD = 20.4) in trials with missed errors and 24.8 (SD = 15.0) in trials without missed errors. The average number of errors was 3.6 (SD = 2.2) in trials with missed errors and 2.2 (SD = 0.8) in trials without missed errors. The result shows that the length of the message and the number of errors in the ASR result are potential factors that would affect the accuracy. That is, users would be able to identify ASR errors better in a shorter message and when there are fewer ASR errors. Comparing SS and OQ trials. As expected based on the task instructions, the dictated messages were longer for the OQ trials than the SS trials, at on average 292.1 (SD = 108.1) 268.1 (SD = 134.4) characters, which were 56.2 (SD = 21.6) 51.3 66 (SD = 25.5) words, for blind and sighted participants, respectively. The average length of messages of all 24 participants in OQ trials was 280.1 (SD = 120.0) characters (53 words, SD = 23.5) which were longer than SS trials at 113.9 (SD = 46.9) characters (22.9 words, SD=9.4); this difference between SS and OQ trials was significant (calculated in character; W = 0, Z = ?4.29, p < .001; r = 0.62). Average WER of all 24 participants was also significantly higher in the OQ trials at 0.06 (SD = 0.04) than the SS trials at 0.04 (SD = 0.02) (t23 = ?2.63, p = .015; d = 0.60). Speech Rate and Length of Words. We analyzed the original speech input of the short scenario trials from the blind and sighted participants in terms of the speech rate and the length of words to examine if the experience with speech input influences speech rate or complexity of words in the speech input. Figure 5.9 shows the speech rate and length of words. In the SS trials, blind participants spoke slower than sighted participants with 94.6 (SD = 22.9) and 131.3 (SD = 17.2) WPM speech rates, respectively (t20 = 4.42, p < .001, d = 1.81). However, we did not observe this to be reflected on the WER of ASR. As shown in the previous section, there was no significant difference between blind and sighted participants. In the comparison between SS and OQ trials, there was no significant difference in the speech rate (t45 = 0.72, p = .476). The blind and sighted participants spoke at 113.0 (SD = 27.3) WPM speech rate in SS trials and 107.0 (SD = 30.5) WPM in OQ trials on average. The average length of words in the speech input was 4.2 characters (SD = 0.2) for blind participants and 4.3 characters (SD = 0.2) for sighted participants in SS trials with no significant difference (t21 = 0.22, p = .832). On the other hand, the average length 67 of words in SS trials from all participants at 4.3 (SD = 0.2) characters were longer than the length of words in OQ trials at 4.0 (SD = 0.1) characters (t40 = 5.06, p < .001, d = 1.46). Considering that the OQ trials had higher WER than the SS trials, speaking shorter words would not have a positive impact on reducing ASR errors. 5.3.3.1 Error Analysis In the SS trials, there were 340 error instances for blind participants and 236 ASR error instances in total for sighted participants. Participants in both groups missed more than half of ASR error instances: missed error instances represented 52.1% (SD = 11.1) and 52.1% (SD = 12.2) of all error instances on average for the blind and sighted participant groups, respectively. A further 42.3% (SD = 12.7) for blind participants and 38.5% (SD = 16.5) of ASR error instances for sighted participants were exactly identified. Finally, only a small portion of errors, 5.6% (SD = 4.7) for blind participants and 9.4% (SD = 12.5) for sighted, were noticed. To further understand error identification challenges, we assessed what types of ASR errors were missed in SS and OQ trials. As expected, based on past work [13], participants missed ASR errors when the errors sounded like the original words as shown in Table 5.2. However, not all missed errors related to similar-sounding words as shown in Table 5.3. In general, the accuracy of identifying the error instances with similar sounds with original words was lower at 36.8% than the accuracy of identifying error instances that did not sound like the original words at 44.8%. Though the sound of error instances in ?spacing?, ?homophones?, and ?apostrophes? 68 Type Description and example Occured Identified Pronouns were recognized as other words 64.8% Pronouns 54 with similar sound (e.g., ?Carol? and ?Cara?). (35/54) The recognized words were incorrect 33.3% Spacing because of the spacing (e.g., ?prototype? and 15 (5/15) ?proto type?). The recognized words were homophones 36.3% Homophones of the original words (e.g., ?owe you? and 11 (4/11) ?OU?). The recognized words were incorrect 0.0% Apostrophes because of an apostrophe mark (e.g., 5 (0/5) ?doctor?s? and ?doctors?). Recognized words have similar sounds with 27.4% Similar 135 the original words (e.g., ?I?ll? and ?I will?). (37/135) 36.8% Total 220 (81/220) Table 5.2: Definition and the number of error instances for the types where error instances sounded like the original words. The identified column includes the proportion of exactly identified errors (the number of exactly identified error instances divided by the number of all error instances) Type Description and example Occured Identified The original word(s) was missed in the 58.7% Missing 46 recognized text. (27/46) The error word(s) was inserted in the recognized text though they were not spoken 17.6% Inserted 51 by the participants (e.g., recognized text of (9/51) filler words). The recognized text has a different format 50.0% Formatting 4 (e.g., ?2 o?clock PM? and ?2:00 PM?). (2/4) The recognized words sound differently from the original words. We did not observe 47.7% Others a common pattern among these errors (e.g., 5 (123/258) ?we are? and ?of years?, ?I think? and ?Of think?). 44.8% Total 359 (161/359) Table 5.3: Definition and the number of error instances for the types where error instances did not sound like the original words. The identified column includes the proportion of exactly identified errors (number of exactly identified error instances / number of all error instances) 69 is almost the same as the original words, participants could identify a few of them. For example, B6 guessed the misrecognition of ?too? as ?to? in a trial, saying ?[...] I think it might?ve said the wrong version of too.? Some participants picked up the small difference in synthesized speech caused by a space between words. For example, S10 distinguished ?prototype? and ?proto type?, saying ?It just said ?prototype? like ?prawto type? do you guys care about how it says words? That?s the difference.? Participants identified the errors better when some words are missing in the recognized text than when additional words are inserted into the recognized text. The proportion of identified error instances was higher than 50% for the ?missing? type while it was only 17.6% for the ?inserted? type. It is hard to identify the error instances of some types that have almost the same sound like the original words (i.e., ?spacing?, ?homophones?, ?apostrophes?) with audio- only. To assess the ability to identify errors that can be distinguished with audio, we measured the precision and recall of SS trials after excluding the error instances of ?spacing?, ?homophones?, ?apostrophes? types. Still, there was no significant difference in precision and recall between blind and sighted participants (t21 = ?2.01, p = .056; t21 = ?0.42, p = .677). 5.3.3.2 Subjective Certainty For each trial, participants were asked how certain they were that they had identified all ASR errors in their dictated text using a 4-point scale (very certain, certain, uncertain, very uncertain). In the short scenario trials, blind participants were confident, being 70 ?very certain? in 247 (68.6%) trials, certain in 104 (28.9%), and uncertain in only nine (2.5%). There was no trial where blind participants were very uncertain. Similarly, sighted participants were very certain in 237 (65.8%) trials, certain in 110 (30.6%), uncertain in 12 (3.3%), and very uncertain in 1 (0.3%). These numbers might not be surprising given that all but one participant, who had experience with reviewing dictated text, reported that they did not think that identifying ASR errors is challenging (Section 4.2). While participants were confident in more than 96% of trials, the accuracy of identifying ASR errors in those trials was still low in terms of recall (very certain: 0.37; certain: 0.46) and precision (very certain: 0.67; certain: 0.61). Perhaps, this could be explained by the fact that some ASR errors were difficult to detect because the errors sounded like the participants? intended words, as in [13]. Another plausible explanation could be that when interacting with a reliable ASR (with WER around 4% in our study), participants may have been less vigilant and less able at detecting ASR errors when they occur. Prior work, surveyed in [116], indicates that complacency could explain why more reliable automation hurts the identification of system errors. 5.3.3.3 Qualitative Feedback After completing the ASR dictation task, participants were still positive about the performance of ASR and their ability to identify ASR errors. When participants were asked if they agree that the system correctly recognized their input (5-point scale), 9 blind and 10 sighted participants agreed or strongly agreed; there was no significant difference between the two participant groups (U = 76, p = .914). Participants also disagreed 71 when they were asked if it was difficult to identify ASR errors: eight blind and eight sighted participants disagreed or strongly disagreed in this question. Again, there was no significant difference between the two groups (U = 90, p = .375). When asked about any other difficulties they had during the task, seven blind participants reported no difficulty at all, while the remaining five blind participants mentioned challenges in remembering ASR errors in a long text, checking punctuation marks, and distinguishing words with similar sounds. For example, B3 said: ?I knew there was a mistake in the beginning and the end but anything in the middle was fuzzy because these were like I said random tasks.? Contrastingly, 11 of the 12 sighted participants said they had difficulties, including remembering ASR errors in a long text, imperfect pronunciation of synthesized speech, and the fast rate of synthesized speech. S12 said: ?If there are a couple of little errors in the larger text then you kind of lose track of them. [...] another is, is it me who?s like am I creating and I saying it incorrectly or is the system picking it up incorrectly?? 5.4 Discussion The semi-structured interview showed differences between blind and sighted participants with respect to their experience with speech input and error identification. In the speech dictation task, blind participants spoke slower than sighted participants when they use speech input. We also found that the length of the speech input impacts the accuracy of error identification. Further analysis of the errors characterized the patterns of identifying ASR errors. The empirical findings from the user study provide some insights for future research. 72 5.4.1 Implications Need for accessible ASR error reviewing through audio-only interactions. Our findings reinforce the importance of improving text-entry through audio-only for blind users, confirming past studies that show that blind users are more likely to use speech dictation than sighted users [13, 60]. However, when it comes to reviewing their dictated text, our interview findings show sighted participants used the visual output, which is only available to blind users through the text-to-speech audio. Perhaps this explains why blind participants were more concerned about ASR errors than sighted participants, given the difficulty of reviewing the ASR results through audio. For both groups, context relates to their concerns about ASR errors (i.e., kinds of tasks, content, the recipient of the dictated message, rushed or relaxed situations), suggesting that in some cases, users may be willing to use a more time-consuming but accurate reviewing process than simply hearing back their dictated message. Mismatch between ability and perception of challenges in finding ASR errors. Neither participant group felt it was challenging to identify ASR errors by just listening to the dictated message. However, when asked to perform this task, they missed more than half of the ASR errors. This contradiction suggests that users may be making more errors than they are aware of in their dictated text motivating future work in assessing real- world error rates in dictated and reviewed messages. Therefore, future work is needed to develop an interface for enabling blind users to check the final text after revision in an efficient way rather than going through the text letter by letter. Higher chance of missing errors with longer text. Comparing the results from the 73 trials with short scenarios to the trials with open questions also showed that identifying ASR errors is more difficult with longer input. With longer input and higher WER in the trials with open questions, participants had to identify more errors with the long text than the short text. This would have increased the mental load of the task, requiring participants to remember more ASR errors. Since blind participants were more likely to have experience dictating longer passages of text than sighted participants, this challenge may unduly affect blind users. It will be important to consider whether mechanisms to support users in reviewing and editing speech dictation via synthesized speech output will need to differ for shorter versus longer passages of text, such as supporting users in reviewing only one sentence at a time. Little impact of experience with a screen reader on the ability to find ASR errors. Contrary to our hypothesis, no significant differences were found between blind and sighted participants? ability to identify ASR errors through a synthesized speech on our speech dictation task. Though our interview showed that the blind participants had more experience than sighted participants in reviewing dictated text via audio, only two blind participants who also used magnifiers had the opportunity to confirm what they heard through a synthesized speech by checking visually. This lack of visual confirmation may have led them to overly trust the ASR results as compared to sighted users who had on average substantial exposure to visual feedback of ASR results. The relatively low WERs seen in the task, though reflective of state-of-the-art automatic speech recognizers [7], may have also made it more difficult to detect a statistically significant difference between the two user groups. Distinct strategies for reporting ASR errors can lead to novel interactions. We 74 found three distinct strategies of identifying ASR errors by analyzing how participants reported the ASR errors during the speech dictation task. The most common strategy was finding a specific word(s) of the ASR errors. The other two strategies are indicating the location of errors and counting the errors. We found that the average length of messages was shorter and the number of errors was fewer in trials where participants found a specific word(s) than the trials where participants counted or indicated the location of errors. The selection of strategy would be potentially related to the length of the message and the number of errors in the ASR result. When the text is long or the ASR result includes many ASR errors, participants would have counted or remember the location of errors rather than memorizing words to reduce the mental load. The future study on designing the accessible interface of reviewing ASR results needs to consider that the strategy of identifying ASR errors can be influenced by the length of the message and the number of ASR errors. Variation of speech input in different contexts. The analysis of speech rate provides empirical evidence that blind users speak slower than sighted users when they enter a text with the speech input. The difference in speech rate would have been caused by blind participants? caution to avoid the ASR errors. A prior study showed that a user articulates the speech when they want to enter a text with speech input without errors [55]. In this case, blind participants would compensate for potential limitations of the ASR system by speaking slowly. 75 5.4.2 Limitations The speech dictation task in the user study was designed to make it realistic to the participants by employing the free-form text entry task. Though we were able to measure the ability to identify ASR errors and characterize the use of speech input, the design also has some limitations. Limitation of the free-form text entry. For our speech dictation task, we employed a free-form text entry task that allows participants to compose texts for themselves. Though the study in Chapter 4 evaluated the ability to identify ASR errors using reference phrases, the free-form text entry was adopted in this article because of the advantages mentioned in Section 5.2.2. However, the user study design with the free-form text entry has a few drawbacks compared to using reference texts. The free-form text entry task can result in ambiguity during error coding by the research team given that the team has only access to the spoken messages by the participant and not the ground truth text phrases. For example, some proper nouns were accurately recognized by the ASR engine (e.g., city and product names) but others were ambiguous (e.g., did the user intend to spell the name Steven or Stephen). In these cases, if the proper noun in the synthesized speech has correct spelling, then we marked it as correct. However, the proper noun (e.g., ?Barbara?) that was recognized as a common noun(s) (e.g., ?barber?) has been considered as an ASR error. The participants dictated 13.6 proper nouns on average throughout the task. Missing the semantic change in metrics of performance. In this work, we analyzed the performance of an ASR system for text entry through speech only both in terms of WER but also in terms of participants? ability in identifying these errors using metrics 76 such as recall and precision. A limitation of these metrics is that they focus on the number of error instances instead of the degree of change in the meaning of the text. For example, if ?want? and ?can? are recognized as ?wanted? and ?can?t,? the latter usually changes the meaning of the text more significantly than the former. However, WER, recall, and precision cannot reflect such differences in error analysis [117]. Metrics reflecting the semantic change of the original text due to the semantic differences between ASR errors (e.g., ACE metric by Kafle et al. [118]) would also be useful to examine. Small sample size. The small number of participants in this study limits the statistical power to detect the significant difference with a small effect size, though 24 participants is a common sample size in the CHI and ASSETS community. Therefore, in the analysis of precision and recall, this limitation may have resulted in no statistically significant difference in the comparison of blind and sighted participants. The small number of participants also makes the statistical analysis subject to change by potential outliers. Considering this limitation, we conducted another statistical analysis of the data from the speech dictation task where any outliers were excluded. Specifically, there were four outliers (S3, S11, B10, B11) in terms of dictated message length and one outlier (S11) in terms of precision. Removing these outliers did not change any of our results. 5.5 Conclusion We explored the experience of speech input, synthesized speech, and ASR error identification through a semi-structured interview and evaluated the ability to identify ASR errors through a task of entering and reviewing text using speech-only. From the 77 semi-structured interview, we found that sighted and blind participants? experiences differ in many aspects such as tasks, devices, and frequency of using speech input as well as employed methods for reviewing the dictated text. Though most participants reported that identifying ASR errors is not a challenging task, participants in both groups identified only around 40% of the ASR errors. This indicates that identifying ASR errors is challenging even for blind users who may have more experience with speech input and synthesized speech compared to sighted participants. We also characterized how participants identified ASR errors through the analysis of the speech input, the ASR errors, and strategies for pointing to ASR errors in the speech dictation task. These findings enable us to better understand and quantify the challenges in identifying ASR errors for both sighted and blind users. More so, they reveal the need for further research on improving user interaction for speech-only text input that relies on inherently error-prone systems such as ASR. 78 Epilogue to Part I In Part I of this thesis, the challenge of identifying ASR errors was characterized through crowdsourcing and controlled lab studies with blind and sighted participants. The studies investigated the impact of manipulating the synthesized speech (i.e., inserting pauses between words and repeating the synthesized speech), types of missed ASR errors, the accuracy of identifying ASR errors, and strategies of identifying ASR errors. The controlled lab study also revealed the experience of identifying ASR errors in terms of how much they cared about the errors, different attitudes of reviewing the recognized text in various situations, etc. The study with crowdsourcing explored the challenge of identifying ASR errors to examine quantitatively whether identifying ASR errors with synthesized speech only is challenging. The user studies were conducted through crowdsourcing where participants were asked to identify ASR errors after listening to the synthesized speech of a phrase. Participants were able to identify only around 50% of errors, showing that identifying ASR errors through audio is challenging. Inserting pauses between words and slow speech rate improve the accuracy. On the other hand, repeating the audio of the synthesized speech does not help users identifying ASR errors accurately. Next, we conducted a controlled lab study to compare the experience with synthesized 79 speech and the accuracy of identifying ASR errors between blind and sighted participants. The hypothesis of the study was that blind users care more about the ASR errors and have better accuracy of error identification due to the more frequent experience with the speech input and synthesized speech than sighted users. The results showed that both blind and sighted participants identified only around 40% of the ASR errors though they thought identifying ASR errors is not a challenging task. On the other hand, blind participants cared more about the ASR errors. The analysis of the speech dictation task characterized the strategies of identifying errors when participants review the ASR results. Part I of this thesis reported on the completed work and answered the following research questions: ? RQ1: How frequently are ASR errors missed? (The user studies in Chapter 4 and 5measured the blind and sighted participants? accuracy of identifying ASR errors through synthetic speech. They showed that both blind and sighted participants missed around 50% of the errors.) ? RQ2: Do different synthetic speech manipulations affect the user?s accuracy of identifying ASR errors? (We found from the results of the user studies in Chapter 4 that synthetic speech with slower speech rate and pauses between words can increase the accuracy of error identification with audio.) ? RQ3: For what tasks do blind and sighted users use ASR? (The study in Chapter 5 showed that blind people use ASR mainly for entering texts while sighted people use it for conversational interface or voice commands.) ? RQ4: How different are the experiences with speech dictation and listening between 80 blind and sighted users? (Both blind and sighted participants thought identifying ASR errors with synthetic speech is not challenging but blind participants cared more about the errors as shown in Chapter 5.) ? RQ5: Is the accuracy of identifying ASR errors different between blind and sighted users? (The results of the speech dictation task in Chapter 5 showed that both blind and sighted participants missed around 50% of errors with no statistically significant difference.) ? RQ6: What are the blind and sighted users? strategies of pointing to ASR errors? (The user study in Chapter 5 found that they spot a specific word, indicate locations where errors occurred, or count the number of errors.) 81 Part II: Interacting with Error-Prone Image Recognition 82 Prologue to Part II In Part II, this work explores the challenge of identifying errors in camera-based assistive apps with blind users and reducing errors in TOR through iterations with blind and sighted users. Identifying and validating the predictions from camera-based assistive apps would be even hard for blind people because they depend on the visual characteristics of the inputs. While prior studies have shown that blind users actively use their cameras in mobile devices for fun, preserving memories, using social media, and using assistive apps, many of the studies focused on the challenges in blind photography and developing a user interface for blind photography. In this work, we characterize the challenges of validating the predictions from the camera-based apps and identifying the errors. Furthermore, we investigate the usability issues of TOR with blind and sighted users, considering another perspective in using an image recognition system, personalizing it with a teachable interface. While training and validating a machine learning model (i.e., machine teaching) has been mostly conducted by experts in machine learning, recent intelligent systems enable end-users to conduct the machine teaching task to personalize the system for their idiosyncratic environments and inputs. (e.g., teachable object recognizer [20], personal sound detector [119]). Though building a teachable interface where end-users personalize an intelligent system through machine teaching is technically possible with few-shot 83 learning or meta-learning approaches, end-users with little knowledge in machine learning would have difficulties in training and validating a machine learning model with their own data samples. In this research, we explore the difficulties from two perspectives: (i) understanding non-experts? patterns and misconceptions in machine teaching (Chapter 8) (ii) designing a mobile TOR app to enable blind users to review their teaching strategies to reduce errors. (Chapter 9) Part II consists of three studies that investigate blind users? experience with camera- based assistive apps, explore sighted non-experts? challenges in machine teaching, and developing a mobile TOR app for blind users. The first study consists of a semi-structured interview and an error identification task to explore the users? challenges in identifying pre-trained image recognition errors (Chapter 7). In the second study, we recruited 100 sighted participants who are non-experts in machine learning through crowdsourcing. The participants were asked to train and validate TOR through the web (Chapter 8). The photos and feedback from the participants revealed patterns in non-experts? strategies to conduct machine teaching (i.e., how they train an object recognition model, test it, and change their training strategy after observing errors). In the third study with blind participants, we evaluate a TOR app designed based on the findings in the first study and a prior study on the feasibility of using TOR for blind people [20]. The study includes tasks of training, testing the app, and managing the information of objects in the users? datasets (Chapter 9). Part II of this thesis aims to resolve the following research questions: ? RQ7: For what tasks and objects do blind users take photos? (We will examine 84 RQ7 in Chapter 7.) ? RQ8: How did blind users identify the image recognition errors? (We will examine RQ8 in Chapter 7.) ? RQ9: What are the blind users? accuracy of identifying the object recognition errors? (We will examine RQ9 in Chapter 7.) ? RQ10: What are their strategies for identifying the errors? (We will examine RQ10 in Chapter 7.) ? RQ11: What are non-experts? teaching and debugging strategies for a teachable object recognizer? (We will examine RQ11 in Chapter 8.) ? RQ12: Do teaching strategies evolve through iteration? (We will examine RQ12 in Chapter 8.) ? RQ13: How could descriptors be useful for avoiding errors due to their training examples? (We will examine RQ13 in Chapter 9.) ? RQ14: What are blind users? teaching and debugging patterns? (We will examine RQ14 in Chapter 9.) 85 Chapter 6: Background This chapter discusses prior work on object recognition focusing on error identification and assistive technologies for blind people. As the work in Chapter 8 and 9 are kinds of machine teaching studies, a research field that investigates people who teach a machine, we present background studies on machine teaching. 6.1 Image Recognition and Error Identification Object detection and classification have been actively studied for decades as they are fundamental and challenging problems in computer vision. Specifically, object detection aims to provide the location and size of the object instances in an image (e.g., bounding boxes) [120]. The goal of object classification is figuring out whether objects in a set of classes exist in an image or not [121]. Object detection and classification are employed in a variety of applications including blind navigation systems (e.g., [122, 123]), object recognizer for blind people (e.g., [124, 125]), and image captioning (e.g., [126]). In this thesis, we employ a general term, image recognition, that embraces both object detection and classification tasks [127, 128] to indicate general applications related to object detection and classification. Recently, the emerging image recognition systems achieved dramatic improvements with deep learning techniques [129]. However, they are 86 still error-prone due to the high variations of objects, a huge number of object categories, and limited computing power in mobile/wearable devices [121]. Errors in an image recognition system affect blind users? experience with it significantly as most blind users depend on the output from the system due to the difficulty in verifying the output [130]. For example, a blind user who uses social media rely on the automatically generated captions to understand photos without sighed help and may be confused by the errors in the captions [33]. Therefore, researchers in human-computer interaction and accessibility have emphasized the importance of providing the confidence of the outputs and degree of reliability of AI-infused systems to users [10, 130]. While the impact of object recognition errors on user experiences has not been explored thoroughly, prior studies presented some cases where such errors caused problems. A study on blind users? experience in using social media with images revealed that they overtrust the automatically generated captions even when incorrect captions make little sense [33]. While some errors in blind navigation systems are acceptable when blind users are familiar with surrounding environments, the errors are not acceptable when people around the user react with misguided responses [49]. When image recognition is used to help blind users control household objects (e.g., turning on/off a stove, finding an outlet), errors would cause safety threats. Therefore, such tools are required to have robust safety mechanisms, or users are recommended to use other types of assistive tools such as voice command to control household objects rather than using computer vision-based tools [131]. Given the significance of error handling in using object recognition systems for blind users, this work explores and characterize their challenges in identifying and recovering from object recognition errors. 87 6.2 Image Recognition for Accessibility Object recognition has been actively used to develop assistive tools for people with disabilities. For example, body movement recognition has been used to create assistive tools for people with motor impairments such as rehabilitation systems with automatic body movement guidance [132], automatic symptom diagnosis of motor impairments through motion analysis [133], and gesture recognizer as an input method for controlling robots [134] and computers [135]. Assistive tools for people with cognitive impairments also employ computer vision and object recognition. The body motion analysis with images or videos is used to detection of autism-related behavior automatically [136, 137]. Moreover, object recognition and computer vision have been used in assistive technologies for other types of disabilities such as hearing impairments(e.g., [138, 139]) and visual impairments (e.g., [20, 140, 141, 142]). In this thesis, we focus on blind users? experience with camera-based assistive tools. As object recognition can enable blind people to have a better sense of the visual world, many products that enable people to read texts, distinguish colors, and recognize objects with computer vision are already on the market [124, 125, 143, 144, 145]. Due to the significant impact of the errors in these tools, prior studies have presented guidelines for AI-infused systems with recommendations to enable users to recover from errors [10, 50, 51]. However, they targeted general AI-infused systems including applications for sighted people without an in-depth analysis of blind users? interactions with such systems. A unique aspect of this thesis is that it covers blind users? interactions with both pre- trained camera-based assistive tools and teachable object recognizers (TOR). Since TOR 88 was shown to be useful for blind people [24], researchers have been actively developing user interfaces that help blind people to take photos to train an object recognition model [141, 146]. However, as TOR is a kind of emerging technology, it has many issues to resolve such as developing user interfaces for blind users to identify, understand, and recover from errors effectively through machine teaching. 6.3 Machine Teaching and Teachable Interfaces Machine teaching involves a teacher who knows the decision boundaries and designs an optimal training set for one or more students [4]. In this paper, the teacher is a human and the student is a classification model who is being trained to classify images of objects, as shown in Figure 6.1, though the inverse ? machines teaching humans to classify images ? is also an active area of research [147]. There is rich literature on sequential machine teaching with humans as the teacher, e.g.programming by demonstration for teaching robots to manipulate objects [148, 149]. However, in this review, we focus on prior work that utilizes batch teaching, where examples are given as a set and their order does not matter. Batch teaching is a very common paradigm for many real-world AI-infused systems, e.g.using face recognition, fraud detection, and speech recognition. This is typically done by experts in the field and end-users are hardly exposed to the underlying mechanisms that could help explain their limitations. Teachable interfaces1 that fall under this machine teaching paradigm, have the potential to help in this direction as they can enable non- 1A term coined by Patel and Roy (1998) [150], where ?the user is a willing participant in the adaptation process and actively provides feedback to the machine to guide its learning.? 89 Human vs. T=machine, T=machine, T=human, T=human, machine S=machine S=human S=machine S=human One vs. many One student Many student Batch vs. sequential Batch learning Sequential learning Teaching signal Synthetic / constructiveteaching Hybrid teaching Pool-based teaching Model-based vs. model-free Model-based teaching Graybox teaching Model-free teaching Student awareness The student anticipates teaching The student does not anticipate teaching Angelic vs. adversarial Angelic teaching Adversarial teaching Theoretical vs. empirical Theoretical teaching Hybrid Empirical teaching Figure 6.1: Characterization of our testbed in the machine teaching problem space [4], where T stands for teacher and S for student. A human T employs a pool-based, model- free, angelic, empirical teaching. The testbed has a single recognition model S learning in batch mode, unaware that is being taught, while considering T as a friend (no adversarial examples). experts to uncover basic machine learning concepts (e.g. [151]). Moreover, with advances in transfer learning [152, 153], they can spur innovation as end-users can re-purpose models trained on vast amounts of data for new but related tasks, e.g.personalize assistive technologies [154]. We look into prior work employing teachable interfaces, a term perhaps not originally used by the authors. Here, we focus on a subset of interactive machine learning literature, where users are called to generate all the training and testing examples for a personalized model. Table 6.1 presents representative examples of prior studies from 2011-2019 on gesture recognition for musicians [155], sign language [156], educational applications [151], personalized sound detectors for people who are deaf/Deaf or hard-of-hearing [119], personal object recognizers for blind people [20], and physical activity classifiers for young athletes [157]. In contrast to this work, prior studies tend to have smaller participant 90 Table 6.1: Related studies? characteristics juxtaposed with ours. [155] [156] [119] [20] [151] [157] This study People 1,7,21 10 12 8 30 5 100 Controlled ? ? ? ? ? ? Real-world ? ? Crowd ? Children ? ? Disability ? ? Sensing ? ? ? Audio ? Image ? ? Video ? ? Recognition ? ? ? ? Detection ? ? Control ? Accuracy ? ? ? ? Behavior ? ? ? ? ? ? Feedback ? ? ? ? ? ? pools and are typically conducted in a controlled setting, where the researchers are present. Partially this could be due to the user characteristics of interest; people with disabilities [20, 119], children [151], and students [157]. Another reason could be the challenges in remote data collection as it would require a working prototype [20, 119] or specialized devices from the users [151, 157]. Our teachable object recognition testbed, utilizing a built-in camera in a mobile phone, and existing crowdsourcing platforms allow us to reach a larger participant pool that can be further scaled. As shown in Table 6.1, the input modality for the teaching set was more often based on sensing [151, 155, 157] and videos [155, 156] with one example for sound [119] and photos [20]. For the last two, participants could not assess the quality of their teaching examples ? participants who were deaf/Deaf or hard-of-hearing could not hear the sounds they recorded [119] and blind participants could not see the photos they took [20]. In this paper, we choose images as the input modality for the teaching set. This allows us to tap into a large user group of non-experts that can simply use their mobile phones to take photos in a real-world setting. More so, by choosing an object classification task, an 91 Analysis Output Input People Setting accessible task to many where they can serve as the oracle, we are given the opportunity to explore how humans teach a high-dimensional decision boundary to machines by feeding them only with few instances. More importantly, this modality allows us to visually inspect the teaching set for common patterns in users? behavior. Similar to most of the prior work in Table 6.1, our analysis is based on observed behaviors and participant feedback. Leveraging prior work in neuroscience, we examine how non-experts? teaching strategies draw parallels in machine robustness to human robustness, where object recognition involves generalization across size, location, viewpoint, and illumination [158]. While prior work did not include such a fine-grained analysis of the participants? input, it provided insights and anecdotal evidence that guided the design of our studies such as the need for iterations [151, 155, 157], which may vary not only across participants but also due to the underlying algorithm and task [159]. For comparison purposes and time sake, we opted to keep the number of iterations constant at two. Similar to our study, the number of classes was limited (2-5) with an exception of 15 [20], where there were no iterations. 92 Chapter 7: Understanding Error Identification in Pre-Trained Image Recognition With Blind Users 7.1 Motivation and Introduction The past few years have yielded vast improvements in image recognition due to advances in machine learning. As computer vision can be used for blind people to access the visual world independently using a camera on their smartphones, many assistive mobile apps (e.g., Seeing AI [124], Envision AI [144]) are deployed to enable them to read text, recognize objects, understand images, and navigate. However, most image recognition systems are built on benchmark datasets with images collected by sighted people, being more likely to have errors with images from blind users. Prior studies have provided anecdotal evidence indicating that errors in image recognition systems affect blind users? experience with it significantly. As it is hard for most blind users to verify the image recognition results without sighted help, they trust and rely on the output of the recognition system which may not be error-free. For example, MacLeod et al. [33] showed that blind users may overtrust the output of an automatic caption generator even when it makes little sense. Image recognition errors would cause more critical problems when blind people interact with real-world objects through their 93 cameras and computer vision. Jafri et al. [131] have warned that such misrecognitions may cause a safety threat when computer vision applications are used for controlling household objects such as stoves and microwaves. Errors are especially non-acceptable when they can adversely affect blind users? interactions with others in a way that may affect how others perceive them. In a recent study, Lee et al. [160] discussed how blind users prefer not to get a prediction of passersby gender as potential errors can lead to an embarrassing situation. This echoes some of our previous findings in Chapter 5 where blind participants worried that errors in their dictation could affect others? perception of their intellect. Similar to Part I, we start our exploration by better understanding and quantifying blind users? ability to identify image recognition errors on their input, an object to be recognized. In this chapter, we focus on blind people?s experience with image recognition systems and their errors. We conducted a controlled experiment, which due to COVID- 19 had to be simulated in people?s homes. The experiment mirrors the controlled lab methods employed in the study in Chapter 5 that included a semi-structured interview and speech dictation task. Our research questions also mirrored those in the speech study. With the semi-structured interview, we aim to get some context on the use of image recognition by this user group and prior strategies for identifying errors. Then, through an error identification task, we explore how well blind users can identify the errors and what strategies they employ. 94 7.2 Method To understand blind people?s experience with camera-based assistive tools, we conducted a two-part user study including a semi-structured interview and a task of identifying errors of a general object recognizer. 7.2.1 Participants We recruited 12 blind participants (6 female, 6 male) from campus email lists and local organizations (Table 9.2). The participants ranged in age from 32 to 70 (M = 54.3, SD = 15.2). The participants reported being totally blind (N = 3), having some light perception (N = 5), or being legally blind (N = 4). All participants have used smartphones several times a day. While P1 and P2 had some auditory processing disorder and a problem in hearing the high sound, respectively, all participants did not have problems in communication or using a screen reader. All participants reported that they take a photo or record a video at least once a month. When asked to report their levels of familiarity with machine learning in 4-scales: not familiar at all (have never heard of machine learning); slightly familiar (have heard of it but don?t know what it does); somewhat familiar (have a broad understanding of what it is and what it does); extremely familiar (have extensive knowledge on machine learning), two participants selected not familiar at all, eight selected somewhat familiar, and two reported being somewhat familiar. 95 ID Age Gender Level of vision Age of onset Familiarity with ML* P1 39 Female Light perception Birth Not familiar at all P2 67 Male Legally blind 55 Slightly familiar P3 62 Female Totally blind Birth Somewhat familiar P4 32 Male Legally blind 20 Slightly familiar P5 66 Male Light perception 46 Slightly familiar P6 61 Male Light perception 41 Somewhat familiar P7 70 Male Legally blind Birth Slightly familiar P8 50 Female Legally blind 45 Slightly familiar P9 69 Female Totally blind 55 Not familiar at all P10 66 Female Light perception Birth Slightly familiar P11 33 Female Light perception Birth Slightly familiar P12 36 Male Totally blind Birth Slightly familiar *ML: Machine learning Table 7.1: Participants? characteristics. 7.2.2 Procedure The user study was conducted in two days. We conducted a semi-structured interview on the first day with questions regarding demographic information, photo-taking experience, and experience with camera-based assistive tools. On the second day, participants completed a task of identifying errors of a general object recognizer. Semi-structured interview. The interview lasted for one hour. The interview was conducted through an online meeting application, Zoom. The whole interview procedure was video recorded for later analysis. Specifically, participants responded to questions about: ? frequency of using a mobile device, taking photos, reviewing photos, and changing settings of the camera ? purpose of taking a photo, subjects of the photos, applications used to take photos, devices to take photos, the confidence of taking a good photo 96 ? frequency of use, usefulness, and devices for a camera-based assistive tool ? frequency of verifying the recognition results of a camera-based assistive tool, encountering errors, the importance of the errors, difficulty of identifying the errors ? strategy of taking photos for a camera-based assistive tool, degree of understanding how a camera-based assistive tool works For most questions regarding the frequencies, we employed an absolute 7-point scale adopted from Rosen et al. [108] (Never, Once a month, Several times a month, Once a week, Several times a week, Once a day, Several times a day). For some questions about relative frequencies (e.g., how often do you encounter misrecognitions when you use [a name of a camera-based assistive tool]?), the frequencies were measured in a relative 6-point scale (never, very rarely, rarely, occasionally, very frequently, always) [109]. Error identification task. Participants completed a task of taking photos of objects with a general object recognizer and identifying recognition errors 1-7 days after the interview. The devices and objects for the user study were delivered to the participants? houses and the instructions were given through Zoom due to COVID-19. For remote communication, participants are given a laptop computer with Zoom on. We also provided a Vuzix Blade smart glasses with a camera that are also connected to Zoom so that we can monitor the participants? views throughout the study and for later data analysis. Participants used iPhone 8 with an object recognition app that we built for this study. At the beginning of the task, the experimenter provided the names of 15 objects (Figure 7.1). In each trial of the task, participants were asked to select one of the objects randomly, take a photo of it, and get a label from the object recognition app. The label was provided 97 Figure 7.1: Object stimuli: baking soda, caramel coffee, Cheetos, chewy bars, chicken broth, coca-cola, diced tomatoes, diet coke, dill, Fritos, Lacroix apricot, Lacroix mango, Lays, oregano, pike place roast. through a synthesized speech. After listening to the label, participants reported whether the recognition was correct or not and how certain they were that the recognition was correct (or not correct). After finishing the recognition trials with the 15 objects, participants went through the objects once more in random order, completing 30 trials in total. Participants were encouraged to think out loud throughout the task. At the end of the task, we asked questions about the difficulty and strategy of identifying errors. 7.2.3 Object Stimuli For the task of identifying errors from object recognizers, we used 15 objects (Figure 7.1) used in a prior study on examining the blind users? interaction with a teachable object recognizer by Hernisa et al. [20]. We followed their approach where the objects are selected to include different shapes, sizes, materials, and visual similarities. The logos or images on the container of some products (i.e., baking soda, chicken broth, diced tomatoes, and diet coke) are slightly different from the products in the prior study because their designs have changed. However, the shape, material, and weight that may affect 98 participants? tactile perception of the objects were the same for all objects. 7.2.4 Testbed For the task in the user study, we used a general object recognizer fine-tuned on the photos of objects in Figure 7.1. The base model of the general object recognizer is an InceptionV3 [161] model trained on ImageNet [5]. We fine-tuned the base model on a dataset with photos taken by nine blind participants in a prior study where the participants trained a teachable object recognizer [146]. The dataset included 225 photos for each object, having 3375 photos in total. The fine-tuning was conducted with 500 steps of gradient descent and a 0.01 learning rate. During the task, participants used the general object recognizer app on Apple iPhone 8 (Figure 7.2. When a participant touched a ?Scan Item? button on the screen, the app sent an image to a server through HTTP and received the results of recognition from the server where the fine-tuned object recognition model predicted the label of the image. 7.2.5 Measures and Data Analysis The responses in the semi-structured interview and the tasks are video recorded using Zoom. We transcribed the responses to analyze the results. The images and labels from the object recognizer were saved on the server during the task so that we examine if participants identified errors correctly or not. 99 Figure 7.2: A screenshot of the general object recognizer. 7.2.5.1 Semi-Structured Interview. We used a thematic coding approach to find the major themes in the participants? responses [113]. To reduce the subjectivity, two researchers cooperated to code the responses. One of the researchers transcribed the responses. With the transcribed data, the two researchers coded the responses and created initial codebooks. They compared the two codebooks and coded data to resolve the disagreements through consensus. There were 35 disagreements out of 373 answers. After resolving the disagreements, they established a codebook and coded data. In the final codebook, the responses of 17 open questions in the semi-structured interviews included 153 codes. 100 7.2.5.2 Error Identification Task. The blind participants? ability to identify the object recognition errors was measured using precision and recall, which are commonly used to measure the performance of machine learning models. Specifically, precision indicates how often the recognition results are actually errors when participants thought the results were incorrect. Recall denotes the proportion of errors that participants correctly identified. We also measured the error rate of the object recognizer using precision, recall, and accuracy. 7.3 Results The semi-structured interview provides insights into blind people?s experience in taking photos and recording videos. The error identification task revealed blind users? patterns in identifying object recognition errors. 7.3.1 Insights from Semi-Structured Interview The main themes in the questions of the interview included experience in taking photos (or recording videos) and interacting with camera-based assistive apps. We focused on how blind people check the quality of their photos, why they take photos, and how they identify misrecognitions from the camera-based assistive apps. 101 7.3.1.1 Experience in Taking Photos or Videos All participants have taken photos at least once a month as shown in Figure 7.3. When they take a photo or record a video, they rarely changed the setting of their cameras or environments. The majority of the participants said they have never changed the settings (N = 8). When the four participants changed their settings, they tried to find a place with maximum light (N = 3), change the flash setting (N = 1), and tried different camera angles (N = 1). For example, P8 thought taking photos at home can get the maximum light, saying ?when I?m home, I feel it gives me the maximum amount of light and I get the best pictures. [...] I might move it around a couple of times so that it?ll describe it in the most detailed way.? When asked how often they check if their photos are good, many participants responded they occasionally did relative to the frequency of taking photos. The majority of the participants checked their photos several times a month or less (N = 8). The participants with low vision reviewed photos with their vision (N = 4). They also used automatically generated image descriptions from assistive tools such as Seeing AI and built-in image captioning function in iOS (N = 5). For example, P12 who inferred the quality of the photo based on the text recognition results said ?what?s relevant are the OCR results I get from it. Especially if there is a garbled section that doesn?t fall into a normal OCR error pattern, then I know the photos not good.? The participants also got help from sighted people around them (N = 3) or remotely using assistive apps such as Aira [145] and BeMyEyes [162] (N = 1). When asked what they captured in their photos, the most common response was that they capture documents for text recognition (N = 10), people (N = 9), and objects 102 Figure 7.3: Participant responses to questions about their experience in taking photos. (N = 8) as shown in Figure 7.4. Similarly, the most common purposes of taking photos or recording videos were for text recognition (N = 10), video calls (N = 8), and object recognition (N = 5). These responses are somewhat different from the findings in a prior study conducted by Jayant et al. [142] in 2011 that blind people mostly take photos to capture friends/family for fun while their most desired use for a camera was text recognition. One of the possible reasons for this difference would be that computer vision-based assistive apps became more common among blind people as they have been improved with the advance of machine learning. However, many participants still thought image framing (i.e., centering the object and adjusting the distance between a camera and object) is challenging (N = 9). For example, P1 and P5 said ?Making sure the information I?m trying to capture is in the frame of the camera.? and ?I don?t know how far away from the object to hold the phone.? Participants also mentioned other challenges: having the focus on the object (N = 2), holding a camera steadily (N = 2), adjusting the light condition (N = 2), finding the right orientation of the object (N = 2). 103 Figure 7.4: What participants captured in their photos. 7.3.1.2 Experience in Using Camera-Based Assistive Apps We asked participants what camera-based assistive apps they have used regularly. Participants have used eight camera-based assistive apps. We asked questions about their experience in using each app. There were 20 cases (i.e., participant-app pairs). The majority of them have used Seeing AI (N = 9) as shown in Figure 7.5. They use other apps with text and object recognition functions such as Google Lookout, KNFB Reader, Super Lidar, Supersense, and Voice Dream Scanner. They used Aira and Be My Eyes to get sighted help remotely through the apps. The participants used the app several times a day (N = 5), several times a week (N = 7), several times a month (N = 5), and once a month (N = 3). When asked how frequently they encountered misrecognitions or mistakes from the apps, the participants reported that they rarely (N = 10), very rarely, or never encountered such cases. When asked how frequently they encounter errors in the absolute frequency scale, participants found them less frequently than once a week in most cases (N = 19). However, one thing to note is that participants would 104 Figure 7.5: Camera-based assistive apps the participants have used regularly. have not perceived many errors while using the apps without validation considering that participants missed more than half of the errors in the error identification task in the section 7.3.2. Therefore, the frequency reported by the participants would be lower than the actual frequency of errors. To take good photos for these apps, participants tried to find a proper distance between the camera and object (N = 9), adjust the orientation of the object (N = 7), center the object in the camera frame (N = 7) in most cases. This would probably be because most participants thought image framing was challenging as mentioned earlier. When computer-generated feedback for blind photography is available, participants also used them to take good photos (N = 8). For example, P12 who have used Voice Dream Scanner said ?It has this system where the louder and steadier the audio tone is, the better you are. There?s a certain tone. You?ve got the perfect picture and you snap it.? We also asked them how frequently they validate the predictions from the apps (Figure 7.6). The participants have never verified the outputs in most cases while using the apps (N = 9). Most of these participants mentioned that they just believe the outputs from the apps (N = 7). For example, P2 and P8 said ?if it says it?s a $5 bill, I believe 105 it? and ?I assume it?s correct when it reads it to me.? This response is consistent with a finding from a prior study that blind users usually overtrust computer-vision systems [33]. Some participants did not validate the outputs from the apps because they thought it was easy to find the errors (N = 6). When they use a text recognizer, they could identify errors if the outputs do not make sense. For example, P11 who reported that she had never verified the outputs from Seeing AI and Voice Dream Scanner said ?If it tells me a certain thing, I?ll know that it actually meant certain numbers. The errors that are sometimes made, they kind of have patterns if you know what it is.? When recognizing objects, they compare the outputs from the apps with what they expected based on the textures, shapes, weights of the object. For example, P6 who never validated outputs from Seeing AI said ?[...] I could say sometimes it does get the canned soup name wrong, but I guess I don?t consider it wrong enough to call it wrong.? With some apps, they verified the outputs occasionally (N = 5), rarely (N = 3), and very rarely (N = 1). The most common reasons for verifying the results were that they were unsure with a single output (i.e., needed multiple trials to make a decision) (N = 8). For example, P3 said, ?if I?m consistently not getting a result with Seeing AI, then I?ll see if KNFB Reader will give me results.? In most cases, participants agreed (N = 13) or strongly agreed (N = 3) that they cared about the misrecognitions from the apps as shown in Figure 7.7. They sometimes did not care about the errors because they could understand the outputs from the apps even with some errors (e.g., the errors in text recognition did not change the meaning of the texts significantly) or they did not use the apps for sensitive or important tasks. P8 said, ?It?s not the most important thing, because I?m not using it for something critical.?. 106 Figure 7.6: Participants responses to two questions about the frequency of encountering errors and verifying the outputs from the apps. When we asked if there is any situation that they care about the errors more than others, participants? responses were mostly about text recognition and the importance of the contents in texts. The most common responses were when reading bills, currency, expiration date, or other important numbers (N = 11). P1 who used Be My Eyes said ?if they don?t see the expiration date properly on something and it?s expired, you know, I could get sick.? Other situations include reading directions for some tasks (N = 5) and reading important documents (N = 5). P9 provided some examples of the important documents, saying ?probably when it?s something that is connected to legal documents, financial statements, legal financial statements.? The participants? responses were divided on the difficulty of identifying misrecognitions from the apps. Participants disagreed or strongly disagreed that it was challenging when they could easily find the errors using the contexts such as the contents around misrecognized texts and the texture of misrecognized objects (N = 10). For example, P1 said ?if they?re wrong, I know they?re wrong.? P12 said ?I can catch the errors as they come up because often, it?s not wrong enough for me to not be able to figure out what it says. In other cases, participants thought the errors are not clearly 107 Figure 7.7: Participants responses to two questions about handling errors in the apps. distinguishable and hard to identify (N = 9). P8 who was aware of the possibility of missing the errors said ?If it?s wrong, I wouldn?t know. . . I don?t even know whether it?s wrong or true.? P9 had experiences that sighted people found errors from Voice Dream Scanner when she missed them. She said ?There have been occasions when I didn?t detect anything and a sighted person may have indicated there was something that I just did not get.? When we asked participants how they identify the errors, we found that they identified the misrecognitions for themselves based on the contexts (i.e., texts around text recognition errors, textures of the objects) in most cases (N = 10). For example, P1 who recognized texts with Seeing AI said ?If the information reading isn?t very clear, if I can tell that it?s only reading a part of something then I have to readjust it.? P6 who identified objects with Seeing AI said ?[...] if I get a soup, and it?s not pronouncing the type of soup, that type of thing [...].? In other cases, they asked sighted people to clarify (N = 5), verify the outputs from the app with multiple trials (N = 5). 108 Figure 7.8: The number of missed errors (false negatives, FN) and correct predictions considered as misrecognitions (false positives, FP). 7.3.2 Results from Error Identification Task Over the 30 trials, the accuracy of the object recognition was 0.76 (SD = 0.10) on average. Participants got between 4 and 13 errors during the task. We counted the number of missed errors (false negatives) and the correct predictions considered as errors (false positives). The number of false negatives and false negatives were 3.67 (SD = 2.46) and 0.5 (SD = 6.74) on average. The proportion of errors that are identified by the participants was 0.49 (SD = 0.32) on average. These results indicate that participants tended to believe the predictions from the object recognizer rather than having doubts on them, missing more than half of the errors. While participants missed many errors, they mostly thought identifying the errors is not challenging. When asked if it was challenging, the majority of them disagreed (N = 5) and strongly disagreed (N = 3). P8 who has low vision could tell the correct and incorrect predictions based on their vision and the textures of the object. Some participants identified errors by comparing the predictions in multiple trials. For example, 109 P10 elaborated ?I didn?t recognize a mistake until the second similar object appeared. So like the two cans of the Lacroix apricot and Lacroix mango, one of them was incorrect because it was telling me apricot both times.? The errors were sometimes clear for the participants because the predicted and true objects had different textures, shapes, or weights as mentioned by P12. He said ?[...] for example, the diced tomatoes versus the chicken broth, chicken broth is more liquid. It was easy to identify that it was wrong.? On the other hand, only three participants agreed. On the other hand, some participants agreed that it was challenging to identify the errors (N = 3). Two of the three participants mentioned that the recognition results were not consistent with an object, making it hard to decide whether the results are correct or not. A participant mentioned that it was difficult to remember all objects explained at the beginning of the study and it made it hard to decide whether the recognition results were correct or not. 7.4 Discussion 7.4.1 Understanding the Contexts of Using the Object Recognition Apps Looked into blind people?s experience with camera-based assistive apps, we observed that the majority of them were about using text recognition apps. Comparing the results in this study and a study in a prior study by [142] in 2011, we could see that both text and object recognition are more popular now than a decade ago. However, the object recognition apps were found not to be as popular as text recognition apps among the participants in this study. One of the possible reasons would be that text recognition would be more reliable than object recognition in practice because the texts in documents 110 are more structured than the shapes and textures of arbitrary objects. A problem in general object recognition apps is that the recognition results are not fine-grained due to the technical limitations in computer vision [20]. Though both text and object recognition use cameras, the blind users? interactions with these recognition systems would be significantly different. For example, this study showed that some participants identified an error based on the texts around the error while using text recognition and some participants depended on background knowledge of objects such as textures, shapes, and weights to identify object recognition errors. Therefore, we need a more in-depth study that separates the user experiences of text and object recognition apps. Understanding the contexts of using such recognition apps would provide insights for enabling blind users to identify errors as we found that one of the strategies for identifying errors in the error identification task was using the information in context (i.e., recognition results of objects with similar textures). In the error identification task of this study, we assumed that participants would know the candidate objects (i.e., 15 objects in the task) before using an object recognizer. However, the contexts in the task would be different from real use-cases. In reality, for example, the accuracy and strategy of identifying errors would depend on many factors such as whether they know about the objects and the performance of the object recognizer in advance. To understand the impact of the context, we need an in-depth analysis of blind users? experience in using object recognizers in real scenarios. 111 7.4.2 Interface with Feedback for Identifying Errors As it is challenging for blind users to aim the object properly, adjust a light condition, check if the background is cluttered, some feedback for them to take good photos would be helpful. Some participants in this study also have used the feedback from camera- based assistive apps to capture a target object. For example, P12 who have used Voice Dream Scanner with feedback mentioned ?It has a system where the louder and steadier the audio tone is, the better you are. When there?s a certain tone, you?ve got the perfect picture and you snap it.? Enabling blind users to take good photos not only results in lower error rates but also affects the trust of the system and the users? accuracy of identifying errors. While prior studies have developed systems that provide feedback for centering an object and adjusting the distance [141, 146], they have not investigated the impact of such feedback on the error rate and blind users? interaction with the errors. Through the interview with participants, we found that the significance of errors and the frequency of verifying the outputs from recognition apps are affected by the performance of the apps estimated by the participants through past experience. However, as a prior study showed that blind users overtrust automatically generated captions [33], participant?s estimation would not be accurate sometimes. Therefore, informing blind users of the certainty of the predictions and the exact performance of the apps would help them identify errors. While prior studies in Explainable AI (XAI) have shown that explaining certainty and rationale behind the output of a machine learning model can improve the trust and usability of machine learning systems [163, 164]. However, many of them are based on visual information inaccessible to blind users or not assessed with 112 blind people. In future studies, we need to evaluate the impact of feedback with the certainty of predictions and the performance of an object recognition system on blind users? experience. 7.5 Conclusion In this chapter, we investigated blind users? experience with camera-based assistive apps through a semi-structured interview and measured their accuracy of identifying object recognition errors through a error identification task. Through the interview, we found that participants were divided on the difficulty of identifying misrecognitions of camera-based assistive apps. As they believe the outputs from the apps or some errors are easily identifiable based on the context, they rarely verified the outputs of the apps. However, through the error identification task, we observed that the participants miss more than half of the object recognition errors. Analyzing the participants? strategies for identifying object recognition errors, they use their knowledge of the objects such as textures and weights as well as recognition results of other objects to infer the correctness of the recognition results. The findings emphasize the demand of understanding the context of using object recognizers for blind people and designing interfaces that provide feedback for blind photography and explainable outputs. 113 Chapter 8: Exploring Error Understanding and Avoidance in Teachable Image Recognition With Sighted Users 8.1 Motivation and Introduction The previous chapters (Chapter 4, 5, 7) covered both speech and image recognition systems where the machine learning models have been pre-trained and deployed by experts. This and the following chapters switch the contexts from the interactions with the recognition systems pre-trained by experts to the systems that can be personalized by end-users through machine teaching. As machine learning and artificial intelligence become more present in everyday applications, so do efforts to better capture, understand, and imagine this coexistence. Experts from diverse disciplines are working together and critically examining the impact of algorithmic decisions, their assumptions, and their biases [34, 165, 166, 167, 168]. Error-prone, computationally complex, and failing in ways unexpected by humans, such algorithms called early on for transparency, interpretability, accountability, and control [169, 170, 171, 172, 173]. More recently, these efforts have redoubled (surveyed in [94, 174]), fueled by funding and legal initiatives such as the DARPA Explainable Artificial Intelligence [175] and the European Union?s General Data Protection Regulation [176], while feeding into 114 Figure 8.1: Given an object category, MTurkers are called to choose three object instances and train a robust personal object recognizer using their mobile camera. Here we include examples from some of the participants? selected objects. future initiatives such as the Algorithmic Accountability Act [177]. Machine teaching [4, 178] lies at the core of these efforts as it enables end-users and domain experts with no machine learning expertise to innovate and build AI-infused1 systems. Beyond helping to democratize machine learning, it offers an opportunity for a deeper understanding of how people perceive and interacts with such systems to inform the design of future interfaces and algorithms [180] ? a perspective this paper shares. Within this paradigm, teachable interfaces [150, 181] explore applications where users can explicitly train a model with their generated data and labels. While facilitating user control, the effectiveness of these applications can be hindered by the lack of expertise or misconceptions about machine learning. Though personalization is often the ultimate goal (e.g. [20]), the interactive nature of these interfaces can help users in return to 1A term in Amershi et al., 2019 [179] for ?systems that have features harnessing AI capabilities that are directly exposed to the end-user.? 115 uncover basic machine learning concepts (e.g. [151]). In this chapter, we examine how people conceptualize, experience, and reflect on their engagement with machine teaching in the context of a supervised image classification task, a task where humans are extremely good compared to machines, especially when they possess prior knowledge of the image classes. As the study in Chapter 4, we reached out to a larger user pool of sighted participants through crowdsourcing. Using a teachable interface for object recognition, we recruit participants (N = 100) through Amazon Mechanical Turk2 to choose three objects in their environment and train a model to distinguish between them in real-time using the camera on their mobile phones, as shown in Figure 8.1. We build a web-based testbed for a mobile teachable object recognizer and ask participants to train and evaluate it on three objects of choice within an object category (Figure 8.1). Categories represent daily objects that span different characteristics such as size, shape, color, material, and function. Through a performance-based payment scheme [182], participants are called to iterate and reflect over their efforts with the goal of making their recognition models more robust. Serving as an oracle, they are tasked with delivering teaching set to the recognition model to help it learn the classification task. We conduct a contextualized quantitative analysis on the participants? photos, their written responses, as well as their model performance. We find that diversity, important in machine learning, is deemed important by a majority of participants and incorporated in teaching strategies, drawing from parallels to how humans generalize across object size, 2https://www.mturk.com/ 116 viewpoint, location, and illumination [158]. Many misconceptions relate to consistency; few think that it is good to be consistent and teach with almost identical examples; others failed to be consistent in incorporating diversity across classes. While participants have good intuition on the importance of discriminatory features in teaching but on evaluating their models, we observe susceptibility to missing edge cases. Last, we see that the majority of participants do not change strategies on a second attempt even though possess a reasonable intuition on what would be important. We see how our findings and insights can help better understand non-experts? interactions with machine teaching and guide the design of future teachable interfaces that can anticipate users? misconceptions and assumptions. 8.2 Method We deploy our testbed in Amazon Mechanical Turk (IRB #1255427-1) and investigate how non-experts crowdworkers teach a machine a high-dimensional decision boundary such as a fine-grained image classification with a few examples only. 8.2.1 Testbed: Teachable Object Recognizer To explore how non-experts conceptualize, experience, and reflect on their engagement with machine teaching, we build a web-based teachable object recognizer for mobile phones. Participants can train, test, and re-train it to distinguish between three objects of their choice. In this case, a test corresponds to a ?direct? evaluation [155], where participants take photos of their objects in real-time and observe the model?s behavior. 117 To help us better contextualize our observations, participants also provide background information and feedback3. Our machine teaching problem. As shown in Figure 6.1, we adopt Zhu et al. [4] machine teaching problem space to characterize the teachable interface in our testbed as a system where the human is the teacher and the machine is the student. The teacher provides, in batch mode, a finite pool of examples consisting of labeled photos of objects as the teaching signal. The teacher takes a model-free approach, treating the student as a black box, though we anticipate that humans may already have some assumptions on how the black box works or should work. The student, employing a convolutional neural network, does not anticipate teaching, i.e.assuming training examples are independent and identically distributed and that there are no errors. More so, the teacher is considered a friend, i.e.no adversarial training. Last, we assume that the teacher uses heuristic teaching methods to improve the performance of the student, the object recognition model in our case. We aim to better understand these heuristic methods, factors they may relate to, as well as assumptions that people may have. Model. For each user, our testbed creates a new convolutional neural network using the Google Inception V3 [161] pre-trained on ImageNet [5]. Every time the user provides a teaching set, the last layer of the pre-trained model gets replaced with a new softmax layer and re-trained with the user?s images with 500 steps and a gradient descent learning rate of 10?2. The models are trained on our 8 GPU server in real-time asynchronously; the app continues to run and ask users for open-ended feedback while the training continues in the back. The web interface communicates with the server using the Flask API [183]. 3Questions and prompts can be found in the supplementary material. 118 Interface. As shown in Figure 8.2, initially the testbed asks for background information, technology experience, and familiarity with machine learning. Then, it provides five object category options: bottle, cereal, drink, snack, and spice, with three sample icons for each category indicative of the preferred shape. Categories are inspired by prior work on personal object recognizers [20] and are engineered to elicit objects that are present in daily life but differ in size, shape, color, material, and function. Participants can choose to train only on one of the categories. To avoid object shape or size from being a factor in any observed inconsistencies between the classes, they are asked to use objects (a total of three) that fall within the same category; three, the smallest number for multiclass classification and previously used in teachable interfaces for non-experts [181], minimizes challenges in finding different object instances within a category in a real- world environment as well as the task completion time (already 40 mins long). After labeling their objects, participants are guided through five interactions with the machine learning model (the student)4: Preliminary test (TS0): Participants are asked to take photos of their objects to see if the existing non-personalized model can recognize them. The instruction reads: ?Take a photo of an object (name at the top) by tapping on the camera screen. The existing model will try to predict it.? Given an object label displayed at the top, one takes a photo of the corresponding object and sees the recognition result (a label displayed for 3 seconds). This repeats 15 times (5 times per object in random order). As expected, during this interaction recognition results will not match participant?s labels as the generic model is based on Google?s Inception V3 and is not yet personalized. There is a dual motivation 4All instructions can be found in the supplementary material. 119 Figure 8.2: Testbed screenshots: questionnaires, category selection, object labeling, and camera view in training and testing. 120 behind this interaction. First, it helps familiarize with the interface, which simulates the native camera app. Second, it helps collect evaluation examples unbiased from one?s teaching experience that is to follow. Train 1 (TR1): Participants are asked to train the object recognizer with the following instructions: ?Train our object recognizer to identify robustly your objects anywhere, anytime, for anyone. We will randomly choose one of your objects and ask you to take 30 photos of it. You will be paid $2 extra if your examples pass our robustness test.? Here, we hint that model robustness means to be able to recognize an object anywhere, anytime, for anyone. Motivated by Ho et al. [182] performance-based payment scheme, we also create the impression of a ?secret? test distinguishing examples best for robustness, though on our end this is merely a naive quality examination (e.g.photos of objects in a screen rather than in the real-world). As shown in Figure 8.2, given an object label displayed at the top, participants take 30 sequential photos. This repeats 3 times (1 time per object in random order). Thus, the first teaching set comprises 90 photos (30 per object). Test 1 (TS1): Similar to TS0, participants are asked to ?Test the trained object recognizer again to see how robust it is.? Here, recognition labels match participants? labels except in cases of misclassification, where an object is misrecognized as one of the other two. Again, no confidence scores are shown. Train 2 (TR2): Participants are given an opportunity to re-train their model from scratch with the following instructions: ?You told us what you would do differently, now show us! On the next screen, take 30 more pictures of the requested object. You will be paid $3 extra if this training does better than the previous one in our robustness test.? Test 2 (TS2): As in TS1, users can test the re-trained model. The instruction given 121 to the participant was ?The object recognizer is trained again. Test the trained object recognizer.? Eliciting Feedback. The testbed includes the following open-ended questions: ?What did you think was important to consider when training the object recognizer?? after TR1; ?If you were to retrain the system to make it more robust, what would you do differently?? after TS1; ?How did you position the object in the image??, ?How did you decide the distance of the camera from the object??, and ?How did you decide which side of the object is visible in the image?? at the end. 8.2.2 Participants We recruited 143 participants over 10 days. However, data from 43 were excluded from the analysis ? 7 helped in piloting, 1 used the same object for all classes, 3 took photos of objects in display screens, 2 took photos with no objects. The other 30 had technical problems by attempting the task simultaneously with our system failing to distribute them across the 8 GPUs, losing data from 12, and interrupting the task for the other 18; all were compensated and the bug was fixed. The 100 participants who were included in the dataset ranged from 20 to 60 in age (?=32.6, ?=8.3); 49 were male, 50 female, and 1 non-binary with 90 reporting being right-handed. No one reported a visual or motor impairment. As shown in Figure 8.3, the majority of participants are frequent users of mobile devices taking photos with them weekly, though many of them don?t use any applications for recognizing objects, food, or plants. When asked about familiarity with machine learning, 6 reported never having heard of it, 45 had heard of it but didn?t 122 Use a mobile device 1% 0% 99% Take pictures using a mobile phone 9% 4% 87% Use apps that can recognize type of objects, 77% 2% 21% food, or plants through the camera 100 50 0 50 100 Never Several times a month Several times a week Several times a day Once a month Once a week Once a day How would you classify your 51% 49% level of familiarity with machine learning? 100 50 0 50 100 Percentage Not familiar at all Slightly familiar Somewhat familiar Extremely familiar Figure 8.3: Participants? technology experience and familiarity with machine learning mostly ranging from slightly (have heard of it but don?t know what it does) to somewhat familiar (I have a broad understanding of what it is and what it does). know what it does, 48 had a broad understanding of what it is and what it does, and only one reported having extensive knowledge. 8.2.3 Procedure With the goal of attracting non-experts in machine learning, we opted for a HIT description that minimizes technical terms: ?You will be asked to take photos of everyday products such as soda cans, cereal boxes, and spices to teach your phone to automatically recognize them. To see how well the object recognition works you will test it by giving a single photo at the time.? A warning message was displayed if participants attempted to start the study from a device other than a mobile phone. Only one participation was allowed. Through piloting, we estimated that a study session could be successfully completed within 30-40 minutes. Adopting a $15/hour compensation rate [184] all participants 123 received a total of $10 once all the data collection was completed. To incentivize participants, we used a performance-based payment scheme [182], where this amount was split as $5 flat participation, $2 bonus for passing ?our robustness test? in the first attempt to train, and $3 bonus for achieving a better performance in ?our robustness test? the second time around. Given that objects differ across participants it was not possible to have an ideal ?secret robustness test?; the bonus was decided merely on a quality check. While the testbed?s connection is persistent and one could do other tasks in between, we observe that participants took on average 35.57 minutes (14.21-79.86, ?=12.85) to complete the study, very close to our estimates. We explore how participants conceptualize, experience, and reflect on their engagement with machine teaching by looking at the photos they took for the teaching and testing sets as well as changes in their behavior when repeating the process. Observations are contextualized with participants? responses. Visual Attributes in Photos. We collected a total of 22, 500 photos from 100 participants across all training and testing interactions. To uncover patterns in participants? teaching strategies, photos were coded using thematic coding [113]. Two researchers independently created initial codebooks of visual attributes in photos across four dimensions, i.e.size, location, viewpoint, and illumination; prior work on visual object understanding [158] indicates that our ability to recognize objects generalizes across these dimensions. We want to see how participants draw parallels from their understanding of robustness in these dimensions to enable machines to do the same. Researchers discussed disagreements to produce a final codebook, shown in Tables 8.1? 8.3 with examples in Figures 8.4 and 8.5. There are two types of attributes: binary 124 Table 8.1: Variation attributes, true if a variation is present for at least one object. Variation Definition True if camera distance, ratio of object height to frame, differs for two or VSizeDist more photos using [0, 0.25), [0.25, 0.5), [0.5, 1.0), and [1.0,?) bins. True if the background differs for two or more photos, i.e.different VLocBg locations or perspectives of a space. VViewSide True if the side of objects differs for two or more photos. True if the angle between the camera and the object with the same side of VViewAngle an object differs for two or more photos. True if the position of the object in the camera frame, center, top left, top VViewPos right, bottom left, or bottom right, differs for two or more photos. True if the exposure to light differs for two or more photos taken at the VIllumExp same location. True if the source of light differs for two or more photos because they were VIllumSrc taken at different locations. and count. Binary attributes capture the presence of variation or inconsistency within a teaching or testing set of photos. If a participant varied photos for an object along with an attribute such as distance (VSizeDist) or background (VLocBg), the corresponding attribute is 1; otherwise 0. Similarly, variation inconsistency across the three objects is captured through binary attributes, named ISize, ILoc, IView, IIllum. Count attributes indicate the number of photos within a set with a certain characteristic such as the presence of participant?s hand (CHands) and use of flashlight (CFlash) or a quality issue such as dark (QDim) and blurry (QBlurry) photos. There was substantial agreement (Cohen?s kappa=0.80). Subjective Feedback. Participants? responses to the open-ended questions were also analyzed with a thematic coding approach [113]. The same two researchers who coded the photos, created initial codebooks and merged them through discussions resolving disagreements. Responses were coded independently with a substantial agreement (Cohen?s kappa=0.73). 125 Table 8.2: Inconsistency attributes, true if there is an inconsistency in variation across the three objects. Count Definition True if the camera distance varies in the photos for one or two objects but ISize not all three. True if the background varies in the photos for one or two objects but not ILoc all three. True if size, angle, or position capturing viewpoint varies in the training IView photos for one or two objects but not all three. True if light exposure or source capturing illumination varies in the IIllum training photos for one or two objects but not all three. Table 8.3: Count attributes, number of photos with a given characteristic including those looking at quality issues. Count Definition Number of photos where the object is cropped, i.e.object is close to the CCrop camera, out of frame, or obscured by another object. Number of photos where the object was reshaped (e.g.opening a lid of a CReshape package). Number of photos where the contents inside a package was taken out of CContents the container or the inside of the package is visible. Number of photos where were the background is not visible because the CNoBg photos are filled with the object completely. Number of photos where the background includes two or fewer colors with CPlainBg no or very simple textures. Number of photos where the background is cluttered with objects other CClutBg than the object of interest. Number of photos where the background includes a wall, floor, or furniture CTextBg with texture. CHands Number of photos where the participant?s hand(s) is visible in the photo. Number of photos where the side with the logo (or label) of the object was CLogo visible in the photos. Number of photos where the brightness varies in different parts of the photo CFlash like using flashlight. Number of photos where the object is too small (height of the object ? 25% QSmall of the height of the photo). Number of photos where the brightness of the photo is too dark to QDim recognize texture or edge of the object. QBlurry Number of photos where the object of interest is blurry. Number of photos where the photo includes only irrelevant objects without QIrrelevant the object of interest. 126 Figure 8.4: Examples of variation attributes in teaching sets. 8.3 Results 8.3.1 Teaching and Debugging Strategies We explore how variation5, inconsistency, and other attributes manifest on participants? image sets when they are first called to train the object recognizer on objects of their choice. Incorporating diversity in teaching. Diversity plays an important role in machine learning [185]. When incorporated in the teaching set, it ensures that examples can provide more discriminatory information to help the model learn. By looking at participants? 5A preliminary analysis of this appears in a work-in-progress [1]. 127 Figure 8.5: Sample photos considered by the count attributes. photos (results in Figure 8.6) and by reading their responses, we find that the majority of the participants share this intuition, but not all. In detail, 23 participants (age 21? 60, ?=37.57, ?=9.87) did not include any kind of variation in their TR1 teaching set ? 3 of them reported never having heard of machine learning, 12 had heard of it but did not know what it does, and 8 had a broad understanding of what it is and what it does. Immediately after training, when asked about what they considered important, 5 participants referred to the need for consistency, which in this context contradicts the way 128 machines and people learn. For instance, P6 said ?I figured I needed to be consistent when I took the picture so they looked similar.? and P30 ?Keeping the pictures the same.? Others, who did not consider this type of consistency, mentioned that it is important to have a good quality photo where the object is well framed (4) with visible labels (8) and images that are clear (6) with ample light (2). Without even having tested their model, P2 said: ?Getting different angles and perspectives so the trainer could recognize it more easily? ? a contradiction to their initial teaching set that had no variation. We observed that in TR2, P2 reflected on this observation and varied both the object size and viewpoint. Only two other participants from this group did so as well, P5 and P18. They said having the ?name and color in? is important in TR1 but also varied the camera distance (P5) and angle (P18) in TR2. However, the majority of participants (N = 77) diversified examples in their first attempt. They varied either size (N = 65) or viewpoint (N = 63), with some considering location (N = 39) and illumination (N = 19). Light exposure was least diverse (N = 4). Looking at responses on important considerations for training, many participants (N = 52) mentioned these strategies6 and reflected on the need for diversity with concrete terms such as ?different?, ?various?, ?all?, ?many?, ?multiple?, ?every?, ?variety?, and ?difference? combined with ?angles?, ?views?, ?sides?, ?facets?, ?background?, ?lighting?, ?distance?, and ?positioning?. These terms correspond to the four dimensions of our coding scheme informed from prior work on visual object understanding [158], highlighting that humans? strategies for machine teaching parallel their own abilities. 6All questions, instructions, and prompts prior to training were carefully edited not to prime participants towards our coding attributes. 129 Variation Inconsistency 100 80 TS0 TS1 TS2 TR1 TR2 TS0 TS1 TS2 TR1 TR2 60 40 20 0 VSizeDist VLocBg VViewSide VViewAngle ISize ILoc IView IIllum VViewPos VIllumExp VIllumSrc VNone Figure 8.6: Number of participants per variation and inconsistency attribute across all five interactions with the model: preliminary test (TS0), train 1 (TR1), test 1(TS1), train 2 (TR2), and test 2 (TS2). The graphs on the left indicate how participants incorporate diversity in their photos in terms of object size, viewpoint, location, and illumination when they train and debug their models. However, only 11 participants (age: ? = 34, ? = 8.71) incorporated diversity in their teaching set across all four dimensions ? 3 reported having heard of machine learning with no further understanding, and 8 had a broad understanding of what it is and what it does. Being fair and consistent between classes. Model consistency across classes is a desirable trait in machine learning with many social implications for fairness, whose definition is still being debated in the community (e.g. [186, 187]). There is anecdotal evidence on non-experts learning to balance class proportions in the training set over multiple iterations [155, 157]. By keeping the number of training examples constant, we look into their behavior across other potential disparate treatments. Given that many participants considered diversity important for good performance, we explore how fair7 (i.e.consistent) they are in incorporating diversity across their three objects, with results shown in Figure 8.6. Beyond the 23 participants who did not introduce any variation for any object, we find that there were 30 other participants that were consistent. This is 7In this work classes are object instances that fall within the same category and consequently share similarities such as shape, size, and material in the context of the decision making task of incorporating variation. Thus, we consider ?individual fairness? [188], where ?similar individuals should be treated similarly?, and explore whether object instances within a category are being treated the same by a participant when introducing variation in the training photos. 130 100 TS0 TS1 TS2 TR1 TR2 8 TS0 TS1 TS2 TR1 TR2 80 6 60 4 40 20 2 0 0 % CCrop RReshape CContents CNoBg CPlainBg % QSmall QDim CClutBg CTextBg CHands CLogo CFlash QBlurry QIrrelevant Figure 8.7: Percentage of photos per participant given a count attribute, with standard error as error bars. Participants took photos mostly with the logo on it and many of them against a textured or cluttered background. Often the objects were cropped in the camera frame and sometimes participants? hands were included in the photos. Surprisingly, few participants opened the object and trained the model on their content as well. The most common quality issues were blurry and dim photos though not that prevalent. promising, especially since this included participants from all levels of familiarity with machine learning: not familiar at all (N = 1), slightly familiar (N = 11), somewhat familiar (N = 17), and the only participant in our study that reported being extremely familiar (N = 1). While none of these participants explicitly mentioned consistency as important, we find that more than half of them (N = 16) continued doing so in their second attempt at training, in TR2. For the remaining 47 participants, their inconsistencies were found in variations related to all four dimensions: object size (N = 21), viewpoint (N = 31), location (N = 10), and illumination (N = 5). Deciding what to show in the teaching set. We analyze the fine-grained count attributes in teaching and training sets (Figure 8.7) to uncover common teaching patterns across participants. Khan et al. [189] observed that one of the most prominent teaching strategies for a binary classification task among non-experts, called the extreme strategy, is consistent with the ?curriculum learning? principle [190, 191], where participants start with the most extreme examples and continue with those closer to the decision boundary8. While our batch teaching task does not allow for a similar sequential analysis, we find 8In the Khan et al. [189] study participants did not generate the examples but they ordered them as most representative of the two classes and chose to teach one by one using all of them or a subset. 131 that almost all participants (N = 98) included the logo (or label) of objects in their teaching sets; on average 84.9% (SD = 25.0) of any participants? images included logos. This indicates that participants understand that logos and labels tend to include the most discriminatory features, which serve as the most extreme examples. Then, through variation, they add less discriminative viewpoints that are closer to the decision boundary. Indeed, 18 participants explicitly mentioned logos or labels as being important in training. For instance, P36 said ?... trying to have a constant label view? and P46 ?... a clear shot of the front of the package with minimal background interference.? When looking deeper at these responses though, we find that many of the participants assumed that the machine would read the text. For example, P28 said ?It [the model] recognizing the different cereals by name? and P44 ?Getting a clear shot where the writing and the size are clear.? In terms of the background, we find that the majority were textured (N = 66) or cluttered (N = 62), while many used plain (N = 48) and a few none at all (N = 11) ? the latter two are preferred since very few varied the object location. We observe that 26 participants included their hands in the photos. The presence of hands has been leveraged to better distinguish objects by modeling the contextual relationship between grasp types and object attributes [192] or to estimate the object of interest in a clutter environment [140, 146]. However, given this study?s fine-grained task, the grasp is expected to be similar across objects of the same category. Thus, the presence of the hand doesn?t really help, especially if it is not applied consistently across classes. More surprisingly, we observe that 8 participants reshaped their objects, e.g.opened the lid, and 4 decided to train on the content of the object as well, e.g.cinnamon powder. When asked what is important for training, one of these participants, P76, said: ?Getting lots of 132 different angles and different ways the spice could be portrayed.? In general, there were not many photos with quality issues. Participants took clear photos in most cases and many of them mentioned the importance of image quality in their responses, but some (N = 36) mistakenly took a few blurry photos. Also, objects sometimes appeared too small (N = 17) and occasionally the light was dim (N = 9). Debugging and including edge cases in testing. When asked to evaluate their model in TS1, many participants (N = 30) did not diversify their images at all ? 2 of them reported never having heard of machine learning, 17 had heard of it but didn?t know what it does, and 11 had a broad understanding of what it is and what it does. This means that they did not check whether the recognizer is robust. We also find that compared to training, fewer participants diversify their testing set across object size (N = 57), viewpoint (N = 49), location (N = 21) and illumination (N = 6). This could be explained by many factors such as: a smaller number of photos in testing (15) compared to training (90); difficulty in conceptualizing robustness; assumptions about machine?s generalizing capabilities; not anticipating future uses of the model under different circumstances; or simply minimizing efforts for this HIT. Logos were still included by the majority of the participants (N = 98) and the same number of participants (N = 11) took photos that did not include any background, keeping their testing data consistent with their training examples. Similar to what Zimmermann et al. [157] observed, participants ?enacted [testing] practices wherein their models appeared to have high reliability but questionable validity.? We also find that participants took fewer photos with plain background (W = 756, Z = 2.17, p = .030, r = 0.15), and objects that were too small (W = 126.5, Z = 2.61, p = .011, r = 0.18) using a Wilcoxon signed-rank test. None of the interesting 133 object reshaping, or content images present in training, carried over to testing; a similar behavior to Kacorri et al. [20], with ?exaggerated? variation in training unobserved in testing. 8.3.2 Changes in Teaching Strategies Through Iterations Prior work indicates that the interactive nature of teachable interfaces can help users uncover machine learning concepts [151]. We ask participants whether they would do something differently were they to retrain the model for a second time and offer a bonus if they could make it even more robust. Updating teaching strategies to improve performance. ?Is this information a signal or noise? was one of the most common debug strategies by experts [193]. We investigate whether participants employ a similar approach by comparing TR2 to TR1 in terms of the variation, inconsistency, and other image characteristics, which serve as information signals for the model. Using a McNemar test for binary and Wilcoxon signed rank test for count attributes, we find the only significant difference is variation of location as observed by changes in the photo background (VLocBg). More participants diversified the background in their teaching set on the first attempt than the second (?2(1, N = 100) = 4.35, p = .037, ? = 0.21, the odds ratio is 11.86). As in Zimmermann et al. [157], we suspect that participants were trying to maximize performance by increasing consistency between their training and testing data, even though in our prompts we had defined robustness as ability to recognize the objects anywhere, anytime, for anyone. No other significant differences were observed, though this could be partially explained by 134 limitations in the binary nature of our variation and inconsistency attributes failing to capture changes in magnitude. We shed light into other possible explanations by looking at participant?s responses. When asked about what they would do differently if they were to retrain, some (N = 22) said ?nothing?, ?wouldn?t do it differently?, and ?would not change anything?. Few said they had nothing to change because they were satisfied with the performance in TS1 (N = 6). For instance, P23 said ?Nothing it seems very robust after the learning phase.? This was not a surprise given that in TS1 participants did not opt for a thorough evaluation, as discussed above. ?Having no idea what to change? was also mentioned by some (N = 19) reflected by terms such as ?not sure?, ?unsure?, ?I can?t think of anything?, ?have no idea?, or ?don?t know?. Indeed, we find that the models of these 22 participants perform well on their own test data with an average F1 score of 0.981 (SD = 0.048)9 and significantly better than the rest of the participants (U = 1472, Z = 5.22, p < .001, r = 0.52); a trend that carries over to the second attempt. Few participants wanted to change elements of the teaching process such as improving the testbed (N = 3), taking photos faster (N = 1), adding more classes (N = 2), or adding more samples (N = 6). Yang et al. [193] characterized the latter as ?most non- experts? only strategy to improve a model?s performance.? Others focused on improving the quality of their teaching set such as better focus (N = 5), more light (N = 2), show labels (N = 2), better framing with a certain distance (N = 1), and centering (N = 1). Few participants (N = 2) explicitly mentioned the importance of the background, with P83 saying ?I would try to change the color of the background to ensure that it knows 9Only recognition labels are available in testing and no scores. 135 what the actual object is. I think it was confused by the curry because of the black stove background which may look like the black cap of the cumin.? Surprisingly, one participant (P85) pointed to discriminatory limitations of their objects uncovering challenges in fine- grained classification by stating ?Change objects to not look so similar.? Last, some participants (N = 22) explicitly indicate that adding more variation in their training set is something they would do. For instance, P14: ?I would take a wider variety of angles? and P21: ?Take picture from many different locations lighting and positions.? Only one, P36 mentioned doing so in testing, ?Test different sizes?. When examining what they actually did in their second attempt at training, we find differing approaches: some indeed started incorporating new variations (N = 13), some perhaps changed the magnitude as variations were present in both first and second attempt (N = 5), and others (N = 4) did not make those changes. While variation for these 22 participants was mostly limited to the 4 dimensions (size, viewpoint, location, and illumination), few other participants (N = 5) indicated that they would also include different forms of the same object, e.g.different containers, perhaps difficult within this study. 8.3.3 Analysis of Performance We report the performance of the models that the participants train by looking at the predicted labels during the first and second round of testing using the F1 score measure (F-score). Relating observed behavior to performance. Participants achieved on average a 136 0.75 (SD = 0.38) F-score in their first attempt to train the model. Using a multiple linear regression, we explore how attributes capturing their behavior in teaching and testing may relate to the relative performance of their models. While this performance is far from an ideal controlled robustness10, it can provide some context for the observations above such as participants? behavior in the second attempt. We use a square root transform of the F-score11 as the dependent variable. As independent variables, we use variation, inconsistency, and count attributes in TR1 and TS1 and their interaction. For model selection, we use stepwise variable selection based on Akaike information criterion (AIC) [194] with results shown in Table 8.4. We find that only 28% of the variability in recognition performance is accounted by this model, as indicated by the adjusted R-squared metric. While this is modest, it is not surprising, as there are many factors that can contribute to the performance of an image classification algorithm. For instance, performance can vary based on object similarities, a common challenge in fine-grained classification; a similarity that is not directly captured by our attributes. In training, we find that variation in light exposure (VIllumExp) relates positively with the F-score, though very few participants included this type of diversity in their teaching set. We also see that the number of images where the object is taken against a plain background (CPlainBg) has a negative relationship with model performance. Though counter-intuitive, we suspect that lack of diversity in the background might have contributed to a model that does not generalize well, e.g.when tested. This seems to be supported by the negative relationship of the number of cluttered background images during testing. 10Such a neutral test is unrealistic in our study since participants choose different objects in different environments. 11Transformation is used to meet the normality assumption. 137 Attempt Variable Estimate Std. Error t value (Intercept) 0.939 0.048 19.79*** VIllumExp 0.167 0.063 2.64** VIllumSrc -0.076 0.049 -1.55 TR1 CCrop 0.000 0.002 0.12 CPlainBg -0.002 0.001 -2.50* CTextBg -0.001 0.001 -1.55 VSizeDist -0.068 0.037 -1.81. VViewSide 0.108 0.038 2.83** TS1 VViewPos -0.089 0.045 -1.97. CCrop 0.048 0.012 4.04*** CClutBg -0.007 0.003 -2.14* QBlurry -0.016 0.009 -1.74. TR*TS CCrop -0.001 0.000 -3.16** Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.157 on 87 degrees of freedom Multiple R-squared: 0.3681, Adjusted R-squared: 0.2809 F-statistic: 4.223 on 12 and 87 DF, p-value: 3.195e-05 Table 8.4: Modeling recognition performance based on attributes capturing variation, inconsistency, and other characteristics. In testing, we find that variation in object size (VViewSide) relates positively with the F-score. We also see that the number of images where objects appear to be cropped (CCrop) has a positive relationship with model performance. A plausible explanation could be that these attributes capture participants? behavior of zooming in on the object?s most discriminative features, thus helping the model to distinguish objects. However, when considered as an interaction between training and testing (TR1*TS1-CCrop), this attribute appears to be negatively related to the model performance perhaps pointing to the sensitivity for consistency between the two ? if you crop objects in one case, then it helps to do so in the other as well. Improving performance the second time around. As shown in the previous analysis, we observe few changes in participants? teaching strategies in the second training as captured by our attributes ? though some participants said they would do things differently. 138 We find that this is also reflected when comparing the performance of their second model to the first. On average, participants achieved a 0.746 (SD = 0.38) F-score the first time and a 0.749 (SD = 0.28) the second with no significant change (W = 80.5, Z = ?0.16, p = .871). However, participants who indicated they would do nothing to improve their model after the first attempt (N = 22), seem to achieve significantly higher performance than the rest (U = 1472, Z = 5.22, p < .001, r = 0.52) and this is a consistent trend across both attempts (U = 1459.5, Z = 5.12, p < .001, r = 0.51). Looking at these relative low F-scores for such a simple 3-way classification task, it is surprisingly that the second group of participants did not further improve their performance even though they expressed reasonable strategies. Perhaps the incentives were not strong enough and they had a higher threshold for errors, or there was not enough time and iterations to try things out. It could simply be that their object instances were too similar. Indeed, the majority (N=38) of the participants in this group had chosen spices. 8.4 Discussion We see how our results, some being new insights, others strengthening prior empirical and anecdotal evidence, can help better understand non-experts? interactions with machine teaching and guide the design of future teachable interfaces. We highlight some of them with the following suggestions: Account for teaching strategies: Our observations suggest that non-experts mainly tend to teach with clear representative examples and sometimes incorporate examples that are closer to the decision boundary through variation, which draws from parallels to 139 how humans generalize for similar recognition tasks. In the case of object recognition, these were object size, viewpoint, location, and illumination [158]; though all four were considered only by a few. Our analysis also suggest that beyond class imbalance [155, 157], there can be other disparate treatments such as inconsistency in the way variation is incorporated across classes. Anticipate misconceptions: A prevalent misconception relates to consistency. While it is true that consistency between training and testing data will result in better performance, assuming they both represent real-life examples, some thought that being consistent entails teaching with multiple identical examples with no variation whatsoever. Other misconceptions relate to the capabilities of the machine for reasoning. For example, participants would train with visually disparate examples from both the container and its content separately. Others would assume that the models were able to infer the text. Help users craft evaluation examples: Our observations indicate that testing examples tend to be less diverse or not at all. Thus, it is no surprise to see many people wanting to change nothing, being satisfied with the performance, or not knowing what to do. Even those who did change their behavior when training for a second time, it was to not vary the background rather than making their model more generalizable. Help may look different based on the goal of the teachable interface. If it is personalization (e.g.[195]), then it could mean guiding the user to generate examples that are more representative of future use cases [155]. However, if it is an application intended to uncover machine learning concepts (e.g.[151]) perhaps promoting more model-breaking examples [196] would be more appropriate; though in the context of a teachable interface this could lead to users training the model with less authentic data to simply improve its performance [157]. 140 This work has several limitations listed below: Task: We explore machine teaching in a narrow context, that of a supervised 3-way image classification task. This allows us to dive deep in our analysis using a fine-grained scheme when coding participants examples informed from prior work on visual object understanding. However, it also limits the generalizability of our findings. We attempt to overcome this by connecting our results with that of prior work when possible. Three, the smallest number for multiclass classification, was selected to minimize challenges in finding different object instances within a category in a real-world environment as well as the task completion time (already 40 minutes long). Study: While teachable object recognizers are real-world applications [146], they are typically intended for blind users. Thus, the sighted participants may lack motivation in this study. We attempt to compensate for this lack of incentives with a performance- based payment scheme [182] creating the impression that we have a ?secret? test to distinguish models that are more ?robust?; though on our end this is merely a naive quality examination. By doing so, combined with the fact that the testbed shows only the predicted labels but no confidence scores in testing, we might have limited participants? criteria for model evaluation [155] to just correctess. Analysis: Through crowdsourcing we were able to quickly recruit a large participant pool and collect data outside a lab in the users? environment. However, this limited our control over the object instances that participants could use as well as the opportunity to create our own evaluation set for comparing the performance of the models against the same data. To allow some time before testing for the photos to be received on our server and the 141 models to be trained on our GPUs, participants were asked to review their training photos and select 10 out of 30, 5 out of 10, and 1 out of 5. We are still analyzing these data while considering more fine-grained variation and inconsistency attributes. 8.5 Conclusion We have presented a crowdsourcing study, where MTurkers choose three objects in their environment and iteratively train a model to distinguish between them in real- time using the camera on their mobile phones. By doing so, we were able to explore, with a large participant pool (N = 100), an instance of a machine teaching problem with a task where many non-experts can serve as the oracle. Our findings and insights can contribute to the ongoing discussion on how non-experts conceptualize, experience, and reflect on their engagement with machine teaching. To allow for study replicability and future comparisons, we have provided a detailed description of our testbed, its framing within the machine teaching problem space from Zhuet al. [4], and the list of questions and prompts used in the study. Our results are based on a fine-grained analysis of the participants? examples contextualized by their responses, background, and model performance. We discuss how they can guide the design of future teachable interfaces to anticipate users tendencies, misconceptions, and assumptions. Given our research group?s interest in teachable interfaces for accessibility [195], our next step will be to explore whether these insights and data from sighted participants could be leveraged for the design of effective teachable object recognizers for blind users. Our rationale is that insights from this study can perhaps enable us to decouple non- 142 experts misconceptions from challenges in camera manipulations among blind users [146]. 143 Chapter 9: Designing a Teachable Object Recognizer with Training Set Descriptors for Blind Users 9.1 Motivation and Introduction In Chapter 8, we defined some attributes of photos that would affect the performance of a teachable object recognizer. We see that the attributes describing the photos and teaching strategies among sighted users can be leveraged to serve as descriptors in teachable object recognizers where descriptors inform blind users of the attributes of their training photos. To demonstrate this implication, we built TOR, an accessible teachable object recognizer that enables blind users to review their training photos through a set of descriptors. In this chapter, we design, implement the TOR app with descriptors, and evaluate it with blind participants through a simulated controlled study that was conducted in blind participants? homes due to COVID-19. The user study explores the blind participants? experience with the TOR app and descriptors by asking blind users to use a prototype of the TOR app and asking questions about their experiences. In addition, while the user study in Chapter 8 explores the patterns of training and testing a personal object recognizer with sighted participants, the user study in this chapter further examines the interactions of blind users and a mobile personal object recognizer app more thoroughly 144 including the tasks of training, testing the app, managing items, and iterating these processes to improve the app. We analyze the blind users? training and testing strategies based on their photos taken during the tasks and subjective feedback after the tasks. We report on a user study with 12 blind participants. We found that the descriptors in the TOR app benefited the users in their experimentation for collecting good training examples. Though the descriptors provided approximately estimated attributes of photos with some errors, the participants could understand the attributes that would affect the performance of TOR and inspect their training examples with the descriptors. With the interface design based on findings from prior studies, the subjective evaluation by the blind participants showed that they could effectively train the object recognizer, test it, and manage the information of the objects in their training sets. However, they pointed out important design issues that should be resolved in a future study such as the time- consuming process to collect many photos for training. The accuracy of recognizing objects with TOR trained by the blind participants was only 0.65 which is low considering that we included only three objects in the study, though, they were engineered for a worse- case scenario. We identified the possible reasons for this through the analysis of the participants? photos, revealing that they had a lack of variation, cropped objects, cluttered backgrounds in their training sets as well as test examples. To the best of our knowledge, this is the first work to propose non-visual access and to provide empirical results with blind participants on automatically estimating and incorporating accessible descriptors for inspecting training data in teachable computer vision applications. Our analysis focuses on object recognizers, where ?learning to train? is deemed as one of the main challenges among blind users [20, 195]. However, we see 145 how the underlying methods for extracting meaningful instance- and set-level descriptors can be adopted for other teachable assistive technologies. Perhaps, they can also serve as the first step towards more accessible approaches for explainable AI interfaces, where there is an underlying assumption on people?s ability to visually inspect explanations [197, 198, 199]. 9.2 Method We built a prototype of TOR app on Apple iPhone 8 and evaluated the app design through a user study with blind participants. Figure 9.1: Screenshots from the TOR app indicating from left to right the home screen, teach screen, teach screen with descriptors, teach screen with the number of remaining photos notification, review screen (top), review screen (bottom). 9.2.1 Interface Design As shown in Figure 9.1, when users open the app, they enter the main screen which has three buttons, Scan item, View items, and Teach TOR (Figure 9.1). Users can recognize an object, manage items in the training dataset, and collect photos of an object of interest for training the recognition model, respectively. 146 9.2.1.1 Training The Teach TOR button brings users to a screen where they can take photos for training an object recognition model. The training process includes three steps: taking photos, reviewing the training examples with descriptors, and providing information (i.e., labeling and recording an audio description) for the object. Taking photos. When users select the Teach TOR button, the app displays a camera screen where users can take a photo using the shutter button at the bottom center (Figure 9.1). As soon as they take a photo, the photo is sent to a back-end server which calculates the descriptors of the photo that may affect the performance of an object recognition model. The descriptors are instance- and set-level attributes (Table 9.1) devised to help blind users examine their photos in terms of their quality for the purpose of training. As soon as a photo is taken, users are notified of the instance-level attributes (e.g., a cropped object in the photo) with synthesized speech. The attributes are visually displayed on the screen at the same time as shown in Figure 9.1. For every object, users take 30 photos with the count indicated in real-time. Reviewing the photos with descriptors. As shown in Figure 9.2, when users are done taking the 30 training photos, the app presents a screen with the set-level attributes indicating how much variation there is among the photos in terms of object size and perspective as well as background. More so, an aggregate of the instance-level attributes is given indicating the total number of photos where the object is too small or cropped, the image is blurred, and the hand is present. After reviewing the descriptors, users may select OK to proceed or Retrain to restart the training process. 147 Instance-level attributes The bounding box of the object is smaller than 1/8 (12.5%) Small object of the image. Cropped object The object is partially included in the image. Blurry photo The photo is too blurry to recognize textures or texts. Hand in photo A user?s hand is visible in the image. Set-level attributes Variation in size A set of images shows objects with different sizes. Variation in perspective A set of images shows different sides of objects. Variation in background A set of images show backgrounds with different textures or items. Table 9.1: Our descriptors for reviewing photos are informed by prior studies exploring how people who have no machine learning expertise synthesize their data for training and iterate on them when they can access them visually [1, 2? ]. Figure 9.2: Screenshots from the TOR app indicating from left to right the labeling screen, home screen when training is in progress, home screen with a recognition result, list of items screen, item information screen (top), item information screen (bottom). Providing information about the object. Before the object recognition model is trained with the photos, users need to provide a name and optionally an audio description for the object. A dialogue box with a text field shows up so that users can enter the name of the object which will be used as a label for training (Figure 9.2). Once this step is completed, the app notifies that training has started. At this moment, the object recognition model is trained with the photos on the server-side. While training is in progress, the Scan item and Teach TOR buttons on the main screen are disabled. Users 148 are notified when training is done. 9.2.1.2 Recognizing Objects The main screen shows a camera view to allow users to take a photo with the Scan item button. The Scan item button is disabled when users have trained the object recognition model with fewer than three objects. After training the model with three or more objects, users can recognize objects by taking photos of them with the Scan item button. When a photo is taken, the photo is sent to a server where users? personal object recognition models make a prediction. The mobile app plays a synthesized speech of the label and visually displays it on the screen. Users hear the label in 100 milliseconds after taking a photo. To distinguish the objects not in the training dataset, we employed an approach of quantifying the confidence level of the discriminability with the entropy of confidence scores [200]. Specifically, when the entropy value is greater than 2.0 or the confidence score is lower than 0.4, the app says ?Don?t know? in synthesized speech instead of the label from the model. The thresholds of the entropy and confidence score were decided based on internal tests conducted by our research team. 9.2.1.3 Managing Items in One?s Dataset When users select the View items button on the main screen, the app shows a screen with a list of items (Figure 9.2). The list includes the names of objects, dates when the objects are added, and thumbnail images. When users select one of the items, the app brings them to a screen with descriptors and the photos that the users have taken to train 149 the object recognizer (Figure 9.2). In this screen, users can select the edit button to change the name of the object and re-record the audio description as they did when entering the name and recording an audio description of the object for training (Figure 9.2). 9.2.2 Implementation We build TOR as an iOS app on the Apple iPhone 8. For the object recognition and estimation of the descriptors, we use various computer vision techniques such as image classification, object detection, and hand segmentation. To speed up the calculations for real-time interactions, these functions run on a back-end server with GPUs, though, the promise of teachable object recognizers is that eventually they will run on the device for more privacy. The TOR app and the server communicate through HTTP. 9.2.2.1 Object Recognition Model. The base model for object recognition is Inception V3 pre-trained on ImageNet [5]. When users train the TOR app, the last layer of the base model is fine-tuned using transfer learning with the photos taken by the users. The transfer learning is done with a gradient descent algorithm with 500 iterations and a 0.01 learning rate. For example, the training takes around 80 seconds with 90 photos of three objects. 9.2.2.2 Descriptors. The attributes in Hong et al. [201], which inspired our descriptors, were originally coded manually by two researchers after visually inspecting the photos taken by the 150 participants. Given that this is not a trivial process, methods like Wizard of Oz did not deem appropriate in this early exploration of these descriptors for facilitating an experimentation that is accessible for blind users. Thus, we opted for methods that attempt to automatically estimate them, even though, developing techniques for more accurate estimations is beyond the focus of this paper and is briefly discussed in section ??. In the current version of TOR, the descriptors are estimated with the following approach: ? Small object: The bounding box of an object in the image is detected by a YOLOv3 object detection model [202]. The object is considered too small if the size of the bounding box is smaller than 1/8 (12.5%) of the image. ? Cropped object: If the bounding box is at the edge of the image, we considered the object cropped. ? Blurry photo: An image is converted to a grayscale image (the values of pixels have a range of 0-255). If the variance of pixels in the output of Laplacian edge detection [? ] is lower than 3.0, the photo is considered blurry. ? Hand in photo: The pixels from a hand are detected via a hand segmentation model that has been previously tested with blind participants [140]. If the pixels are more than 0.3% of the image, a hand is detected. ? Variation in size: The position of the camera (i.e., the smartphone device) is detected using the 3D coordinate system in ARKit1 of iOS when a photo is taken. As the size of the object changes depending on the distance between the camera 1https://developer.apple.com/documentation/arkit/content_anchors/ scanning_and_detecting_3d_objects 151 and the object, the differences in the positions of the camera are measured. The variation is calculated using the standard deviation of the differences. ? Variation in perspective: The sides of an object are detected using the 3D object detection in ARKit. The number of sides shown in the photos is used to measure the variation in perspective. While the object can be detected with ARKit when the object is captured clearly with a plain background, it would not work with photos with cluttered background or a cropped object. When the object is not detected from a user?s photos, the orientation of the camera was used to measure the variation in perspective instead of the number of sides assuming that a user would move the camera to capture different sides of the object. We calculated the standard deviation of the cosine similarities between the orientations in the 3D coordinate system of ARKit to measure the variation in perspective in this case. ? Variation in background: Assuming that the backgrounds captured in photos can vary as a user moves the camera to different places or change its orientation, we used the orientation and the location of the camera to measure the variation in the background. Both the standard deviation of differences in orientations and locations in the 3D coordinate system are calculated. The maximum value of the two standard deviations is selected as a variation in background. 9.2.3 Procedure To explore the usability of TOR and its potential for increasing the accessibility of experimentation with teachable object recognizers, we conducted a user study with 152 blind participants. Participants were asked to train the app to recognize three snacks. Participants carried out tasks of training, testing their object recognition models, and reviewing the information of the items in their training dataset with the app. After each task, we asked questions about their experiences. The study is approved by IRB at [Anonymized institution] (IRB #: anonymized). 9.2.4 Participants We recruited 12 blind participants (6 female, 6 male) from campus email lists and local organizations (Table 9.2). The participants ranged in age from 32 to 70 (M = 54.3, SD = 15.2). They self-reported being totally blind (N = 3), having some light perception (N = 5), or being legally blind (N = 4). All participants have used smartphones several times a day. P1 and P2 reported having some hearing loss (auditory processing disorder and difficulty in hearing high frequencies, respectively). All participants reported that they take a photo or record a video at least once a month. When asked to report their levels of familiarity with machine learning in 4-scales: not familiar at all (have never heard of machine learning); slightly familiar (have heard of it but don?t know what it does); somewhat familiar (have a broad understanding of what it is and what it does); extremely familiar (have extensive knowledge on machine learning), two participants selected not familiar at all, eight selected slightly familiar, and two reported being somewhat familiar. 153 ID Age Gender Level of vision Age of onset Familiarity with ML* P1 39 Female Light perception Birth Not familiar at all P2 67 Male Legally blind 55 Slightly familiar P3 62 Female Totally blind Birth Somewhat familiar P4 32 Male Legally blind 20 Slightly familiar P5 66 Male Light perception 46 Slightly familiar P6 61 Male Light perception 41 Somewhat familiar P7 70 Male Legally blind Birth Slightly familiar P8 50 Female Legally blind 45 Slightly familiar P9 69 Female Totally blind 55 Not familiar at all P10 66 Female Light perception Birth Slightly familiar P11 33 Female Light perception Birth Slightly familiar P12 36 Male Totally blind Birth Slightly familiar *ML: Machine learning Table 9.2: Participants? characteristics. 9.2.5 Procedure The study took place in participants? homes due to COVID-19. The participants were given a laptop and wore Vuzix Blade smart glasses [? ] with an online meeting application (Zoom) to communicate with the experimenter remotely. The study consists of three tasks: 1) training the TOR app with photos of objects, 2) testing the performance of the TOR app, 3) reviewing and editing the information of the objects. At the beginning of the study, we explained the concept of TOR briefly with minimal description of how to take photos to train or test the app effectively so that we can observe participants? strategies for taking photos for training and testing an object recognizer. The description of the app given at the beginning of the study reads as follows: ?The idea behind the app is that you can teach it to recognize objects by giving it a few photos of them, their names, and if you wish, audio descriptions. Once you?ve trained the app and it has them in its memory, you can point it 154 Figure 9.3: Object stimuli in the study: Fritos, Cheetos, and Lays. to an object, take a photo, and it will tell you what it is. You can always go back and manage its memory.? Participants were asked to train the app with photos of three objects in Figure 9.3. The order of objects was fully counterbalanced between participants. When participants train the app with the first object, the experimenter provided instructions on the user interface of the app step by step. For the second and third objects, participants were asked to train the app for themselves and they could ask the experimenter about the user interface if necessary. After training the app with three objects, they tested the performance of their models by taking photos of the objects. Participants did not have any restrictions on how many photos they need and how the objects should be captured during the tests. They measured the performance of their object recognizer and decided when to finish testing it for themselves. After the tests, participants were asked to review the information of an object (i.e., descriptors, label, and audio description) and edit the label of it at the end. Throughout the study, we encouraged participants to think out loud and to ask questions at any time. For each task, we asked questions related to the experience with teachable interfaces and usability satisfaction questions developed by Lewis [? ]. At the end of the study, we also had a post-task interview with open-ended questions about their overall experience with the app. All questions in this study were either open questions or 155 on a 5-point Likert scale (i.e., strongly disagree, disagree, neutral, agree, strongly agree). 9.2.6 Object Stimuli Based on the need for recognizing objects with similar sizes, weights, and textures [20] with fine-grained labels, we used three snacks for this study with the same size, texture, and nearly identical weights, shown in Figure 9.3. With these snacks, we could simulate a scenario that a blind user uses TOR to recognize different objects that are difficult to distinguish with the tactile sensation only. It is engineered to be a challenging scenario as it involves fine-grained recognition for similarly shaped and colored deformable objects with reflective surfaces. Unique and personal objects without logos or texts on them (e.g., key, mug cup) can be potentially used with TOR and perhaps could fit a more realistic scenario. However, for this study, we included only commercial products to allow for comparison and replicability similar to what other prior studies regarding object recognition have done [20, 146, 203]. 9.3 Results We found that participants could train an object recognition model, test to see how well it works, and review the information of their training examples with TOR. All participants could complete the tasks in the study successfully. While participants? responses to the questions in the study revealed that it was easy to complete these tasks in general, they also pointed out some design issues to improve the usability of the app. Moreover, analysis of participants? feedback and photos revealed some problems in their 156 Figure 9.4: Participant responses to questions about their training experience during the study. strategies of taking photos for training and testing an object recognition model which can be potentially resolved with guidance from the app. 9.3.1 Training We evaluate the interface design based on participants? feedback. We analyze the interaction with descriptors and participants? strategies of training an object recognition model in-depth. 9.3.1.1 Interacting with the Interface for Training Participants spent 143.8 seconds (SD = 72.4) to take 30 photos of an object on average. Six participants re-trained the app with at least one object after reviewing the training photos with descriptors. All participants completed the training task though the performance of the object recognition model varied across participants. When asked if they could train the app effectively, ten participants agreed that they could. Seven participants thought the training interface was easy to learn and straightforward. P1 and 157 P10, for example, who are not familiar at all and slightly familiar with machine learning said ?after a while, I learned that I could train it? and ?It?s pretty easy. You have to teach me though. But if you teach me then it?s pretty easy to follow instruction and finish the process,? respectively. Three participants mentioned that the descriptors helped them understand their training examples. For example, P3 said, ?(I could train it effectively) because you can use the feedback for determining if you?ve gotten a good representation of the object.? On the other hand, P11 and P12 neither agreed nor disagreed that they could train the app effectively. P11 pointed out that taking 30 photos is a time-consuming task, saying ?I don?t really feel like I was all that effective, because it takes a while to train for each one.? P12 thought the descriptors were not helpful due to errors. P12 mentioned ?I don?t think that the app is correct, especially when I know, for example, that my hand was not in the photo...I don?t have a lot of confidence in the app?s accuracy.? When asked about whether they could train the app quickly, five participants agreed, but four disagreed and three were neutral. Seven participants thought that taking 30 photos is tedious. For example, P10 said, ?The process is pretty straightforward. But I have to spend, like, quite long time to train the three objects.?. To resolve this problem, P6 suggested allowing users to record a video to shorten the step for collecting multiple photos of an object. For the question about the difficulty of the training task, all but one participant agreed or strongly agreed that the task was not difficult. P11 who was neutral thought it was not difficult but tedious. While participants could record audio descriptions if they want, only four participants did during the study. One of the participants added the audio description to clarify the object with details. P10 said ?The reason I did with the Lays is because Lays makes 158 different things. They make potato chips, Fritos. So I wanted to say it was potato chips because I thought that?s what they were.? Six participants thought that the label itself was enough to understand the object. However, three of them mentioned that they would add audio descriptions for other kinds of objects. P7 said, ?[...] regarding some food, if it were, for example, milk, I might add it for any of the other items that have an expiration date.? Another reason for not adding the audio description was that two participants did not like listening to their own voice from the app. On the other hand, P2 thought it is easier to understand his own voice than synthesized speech. 9.3.1.2 Interacting with the Descriptors All but one participant (P1) agreed or strongly agreed that the descriptors are easy to understand. This indicates that the factors that may affect the performance of the object recognizer are easily understandable to the users. P6 said ?I understood what it was telling me. I didn?t have questions about what I was supposed to do.? However, the values of descriptors presented while the participants reviewed the photos would be somewhat ambiguous to the participants as mentioned by P1 who was neutral on this. P1 said, ?I guess just knowing exactly what they?re referring to what numbers are really preferable.? P4 also mentioned the challenge in understanding the values, but he could figure it out based on his experience during the study. P4 said ?I wasn?t aware of any of those fields when we did the first object [...] For the second and third objects. I could take a little bit more variation in the photos or to better train the application.? While the current app had some errors in estimating the descriptors of photos, 159 participants thought the descriptors are useful to understand what to do to collect good training photos. To measure the accuracy of the descriptors, we calculated the correlation coefficients between the estimated and manually annotated attributes. The estimated attributes are the percentages of variations and the number of photos with instance-level attributes estimated by the app during the study. A researcher manually annotated the photos from the participants based on the definitions of the descriptors in the section 9.2.2.2. To quantify the variation of background and perspective, the researcher grouped the photos in a training set with the same background and the side of the object. The groups are used to calculate the Shannon-Wiener Diversity Index [204]. The cropped object, hand-in-photo, and blurry-photo attributes are simply quantified as the number of photos with the attributes identified through visual inspection. For the attributes related to the size of the object (i.e., variation in size, too small object), the researcher annotated the bounding boxes of the objects. The manually annotated variation of size and too small object attributes are the standard deviation of the sizes of the bounding boxes which range from 0.0 (i.e., the object is not captured) to 1.0 (i.e., the size of photo) and the number of photos with bounding boxes smaller than 12.5% of the photos. The low correlation between the manually annotated and estimated attributes (Figure 9.6) shows that the descriptors had some flaws. The correlation coefficients between them ranged from 0.23 to 0.57. There were no photos estimated as having too small objects by the app while the manually annotated bounding boxes in three photos in Figure 9.5 were smaller than 12.5% of the photos due to the cropped or obscured objects. While we employed naive approaches for estimating the descriptors as a proof of concept, generating accurate descriptors is a complicated problem. Some possible reasons for the 160 Figure 9.5: Training photos annotated as having too small objects (target objects are marked with blue dotted rectangles). low correlations are the poor object detection in idiosyncratic environments (i.e., textures of backgrounds, light conditions, cluttered photos) and mismatch between the movement of a camera and the visual changes in photos. We discuss the technical problem in estimating attributes in the section ??. While descriptors had some errors, ten participants agreed or strongly agreed that the descriptors were useful. P10 and P11 thought descriptors helped them understand how to collect training examples for the object recognizer. P10 said, ?(I agree) because I know the quality of the photos, the different aspects of the photos that I take.? P11 said, ?It helped me understand what the camera needed in order to recognize the objects.? Participants also used them to diagnose problems in their training sets. P10 elaborated ?you have to get feedback or you?re not going to improve [...] it helps you to understand what you?re doing wrong.? P2 had a similar idea: ?the explanation afterward, in the analysis, told me that my photographs were not always good, so I have to learn to take better photographs.? On the other hand, P11 neither agreed nor disagreed that descriptors are useful. P12 thought they were not useful because of the errors. P12 said, ?I don?t think that the app is correct, especially when I know, for example, that my hand was not in the photo, or that the object is not cropped because the previous objects were cropped.? 161 Figure 9.6: Scatter plots with the manually annotated values on the x axis and estimated values on the y axis. The correlation coefficient (r) and p-value (p) are specified in the plots. Participants thought that specifying how to resolve issues the descriptors. P7 suggested integrating the feedback for blind photography (e.g., [141, 146]) and descriptors, elaborating ?Cropped, it did not help me know what to do differently. If it said, maybe move up, move down and move camera left, move the camera, right. That would have been more useful.? P6 mentioned that the interface for replacing problematic photos in a training set would improve the app. He said ?I would assume the training process can self-evaluate itself and it should sum that up for me and tell me what photos I should replace. [...] you need to replace those bad pictures unless you don?t need them for the training.? 162 9.3.1.3 Training Strategies When participants finished the training task, we asked a question about what they thought was important to consider when training the object recognizer. The most frequent responses were about diversifying the photos. Five participants thought varying the distance between the camera and the object was important. Another five participants intended to vary the perspectives in photos. Some reasons for variations were: they wanted to include photos that would be similar to the photos they would take for recognition with different perspectives and sides; they hoped the app to learn the visual information from different sides of the objects. Four participants mentioned that centering the object was important. The participants? responses revealed that the descriptors affected their strategies in collecting photos. For example, P3 said ?It was important to consider the instances of cropped photo and handed photo. You know, it was always good to hear when it would just click the shutter and then not hear those two things.? and P7 mentioned, ?you can make adjustments very easily so there?s a good chance you?re going to get a reasonable percentage that would help you to identify the object.? We also analyzed the photos from the participants based on the manually annotated descriptors to identify patterns and problems in the photos. The majority of participants varied photos in their training sets with at least one object. Eight and six participants varied the background and perspective, respectively (i.e., the diversity index is greater than 0). Eight participants varied the size of objects captured in photos (i.e., the standard deviation of the sizes of bounding boxes is greater than 0.1). On the other hand, we found that seven participants took photos with no variation at all with at least one object 163 Figure 9.7: Training photos with cluttered Figure 9.8: Training photos with little backgrounds. variation. Figure 9.9: Training photos with Figure 9.10: Test photos with cluttered problems in framing (i.e., adjusting the backgrounds. distance and centering the object). (i.e., the diversity indices of the distance and perspective variation are 0, the standard deviation of sizes of bounding boxes is lower than 0.1). The example photos with no variation are shown in Figure 9.8. This is consistent with the findings from a prior user study on exploring non-experts? perception of machine teaching [? ]. It showed the majority of non-experts in machine learning are aware of the importance of diversity in a dataset though having no variation at all is frequently observed in their photos. We also observed quality issues in participants? training samples. Four participants took photos with cluttered backgrounds as shown in Figure 9.7. The training examples from ten participants included photos with poor image framing (i.e., Cropped object) as shown in Figure 9.9 164 9.3.2 Testing (Recognizing Objects) We evaluated the interface design for testing (i.e., recognizing objects) based on the responses from participants. The responses and photos revealed patterns in how users test an object recognizer and interpret their test results. 9.3.2.1 Interacting with the Interface for Testing After testing the app, we asked participants if they could test the object recognizer effectively and quickly. These questions are about the effectiveness of the interface and time for understanding the performance of the object recognizer regardless of its performance. Ten participants agreed or strongly agreed that they could test the object recognizer effectively and quickly. Most participants thought it was just easy and straightforward. P10 said ?I felt like I went through it pretty quick. I felt like I understood what to do.? On the other hand, two participants, P2 and P3, who disagreed pointed out that the misrecognitions in the tests made it hard for them to evaluate the app. P3 said ?(I disagree because) I got different results (with one object). I?d want to be certain about what I was getting.? 9.3.2.2 Strategies for Testing the App As prior studies showed that it is challenging for non-experts to test a machine learning model systematically [20, 157, 205], the analysis of participants? testing samples revealed some patterns that may be problematic in conducting a thorough evaluation of an object recognizer. During the testing task, participants took 3.7 photos per object 165 Figure 9.11: Participant responses to questions about their testing experience during the study. Figure 9.13: The proportion of errors and Figure 9.12: The number of tests per object. number of tests. Figure 9.14: The number of tests per object and proportion of errors. on average (SD = 3.2). Considering that the test data need samples with different visual contexts (e.g., sides, sizes of objects, background, light condition) to test the object recognizer thoroughly, participants would have had fewer photos for testing than necessary for a thorough evaluation. Looking at the number of test photos per object (Figure 9.12), we observed that the number of test samples was different across objects. The number of errors would affect the number of tests as the correlation between the number of errors and test samples per object (Figure 9.13) is strong (Pearson correlation, ? = 0.82, p < .001). The average accuracy (i.e., the number of correct predictions divided by the number of total test samples) of the object recognition models was 0.65 (SD = 0.24) when they 166 are tested by the participants (Figure 9.15). While machine learning models are typically evaluated with large benchmark datasets, the test sets from participants include photos from participants? idiosyncratic environments. Looking into the test photos, we found that the test photos had quality problems that would affect the validity of the participants? evaluations. One of the frequent problems were about image framing. The test photos from Four participants included less than half of the objects. We also observed that four participants took photos capturing two or three snacks, making it hard for the app to distinguish which one they wanted to recognize 9.10. The problems in the test sets would be critical as the perceived and actual performance of TOR may be different when it is used after training. 9.3.2.3 Interpreting the Test Results When we asked participants if they were satisfied with the performance of the object recognizer, five participants agreed or strongly agreed, six participants disagreed or strongly disagreed, and a participant was neutral. Looking at their responses and the performance of the object recognizer together (Figure 9.16, we observed that participants were not satisfied if the accuracy was lower than 0.6. On the other hand, the accuracy spread between 0.6 and 1.0 with participants who were satisfied with the performance. Based on the responses from the participants, we found that performance was not the only factor that affects the users? satisfaction. One of the factors was the amount of effort for training the object recognizer. P11 was neutral though she did not observe any errors because the training task was tedious. P11 said, ?Because it took so much work to get 167 that small amount of performance.? P7 and P10 agreed that they were satisfied with the performance though the accuracy was only 0.6 and 0.4, respectively. P7 has low vision and it is enough to supplement his vision. P10 thought she did not train the app properly with an object that the app made misrecognitions with. P10 said, ?I think it recognized objects, but if you don?t train it properly, then it?s not going to recognize anything [...] the Fritos bag was the one that didn?t work out, but that was probably my fault.? While nine out of the 12 participants observed misrecognitions during the tests, the participants mostly did not have any idea of why they happened during the tests. Six participants were neutral or disagreed that they have a good sense of why the misrecognitions happened. Their responses were simply ?I have no idea.? or ?I don?t know.? Though P7 and P10 strongly agreed and agreed, respectively, they had abstract ideas about the errors. P10 said ?I think it was my fault. I think it was my training. Other than that, I don?t know.? P9 strongly agreed because the descriptors provided feedback that the samples had problems, elaborating ?The reason is because I was teaching it, and I wasn?t 100% sure that it was 100% accurate. It makes sense that while I was teaching it, I was a little bit off, so its recognition was a little bit off. It kept telling me that the hand was in the photos.? 9.3.3 Managing Items in the User?s Dataset All participants successfully reviewed the information of objects (i.e., listening to descriptors, audio descriptions, labels) and edited the label of an object. This would be because the task consists of basic interactions used in other apps such as navigating 168 Figure 9.15: The accuracy of the Figure 9.16: Average accuracy versus object recognition models tested by satisfaction with the performance. The red the participants. dots are means. Figure 9.17: The number of tests per object and proportion of errors. Figure 9.18: Participant responses to questions about their reviewing and editing experience during the study. through a table view and entering texts in a text field. They also agreed or strongly agreed that they could complete the task with this app effectively and quickly. Participants said ?It was easy to do?, ?That was easy?, ?the steps to be taken on the app was easy to follow.? P12 thought the interface can be improved by integrating the review and edit interface. He elaborated ?Because it?s pretty easy, although it could be more streamlined. I would think that you could put the name right on the review screen just as a text field.? P6 pointed out that the difficulty of using this interface would depend on the experience with the iPhone as the interface of the app use typical designs and controls in iOS apps. P6 said, ?I guess if you never used an iPhone before, it might be a little bit of an effort but the interface is compatible with voice ever.? 169 9.3.4 Overall Experience At the end of the study, all participants agreed or strongly agreed that it was easy to learn to use the TOR app. Nine participants agreed or strongly agreed that it was simple to use the TOR app overall while two participants were neutral and one disagreed as shown in Figure 9.19. The three participants thought the training process is inefficient. They pointed out that taking 30 photos for training is tedious and inefficient. In particular, P12 who disagreed could not come up with any case that training would be necessary assuming that users know about an object when they train the app. P12 said ?The very fact that I have to teach it makes it inefficient. If I have to teach it when an object is, then I already have to know what an object is.? Based on the responses from participants, we found that they have different attitudes toward a teachable interface. For example, P11 thought the effort for training is a lot, considering that the information from the object recognizer is small. P11 said ?I honestly feel like it takes too long to do that. I feel like if you have to train it to recognize things, you?re not going to be as efficient. I like the other way better, where you just have it read the label (using a text recognizer).? On the other hand, P9 was positive about having a teachable interface for an object recognizer. She said ?To identify what an object is so handy. And then to be able to teach it you know, to identify items that may not already be there is particularly powerful because you know, something?s not there, you have the ability to include it.? All participants thought that the organization of the interface is clear. All but one participant agreed or strongly agreed that it was easy and quick to recover from mistakes using the interface. P11 who disagreed thought it was hard to figure out a way 170 Figure 9.19: Participant responses to questions about their overall experience during the study. to fix problems in the descriptors. She said ?If you?re getting bad images, you need to take several sometimes as you?re trying to figure out what angles and everything to use. Honestly, it?s not quick, not really efficient.? We got a similar response from P10 when we asked participants if they agree that the TOR app has all functions and capabilities they expect it to have. P10 elaborated ?What does it mean? when it says it?s cropped? [...] Like, if you get this feedback, what should you do? They don?t know what to do.? Four participants disagreed that the app has all important functions and capabilities. They expected the app to provide: more detailed information of the snacks such as ingredients in recognition results (P2), better performance of object recognition (P5), an interface to replace a subset of training examples (P6), and feedback for fixing problems in descriptors (P10). 171 9.4 Discussion 9.4.1 Usability Issues of TOR Through the evaluation of the app design, we observed that participants could easily understand and carry out the tasks of training an object recognition model, testing it, and managing the information of objects in their datasets. While the responses from the participants were also positive with regard to the user experience with the app, they also provided issues that provides useful insights in designing a TOR app in the future. One of the critical problem was that taking many photos for training (i.e., 30 photos in this study) would be tedious for some blind users. Some possible solutions for this problem would be using computer vision techniques that require smaller number of photos (e.g., one-shot learning [206]) or extracting frames from a short video. To achieve this, a future study is needed to find a good number of photos that can balance the performance of an object recognizer and the usability of a teachable interface. While the descriptors just approximately estimated the attributes of photos in this study, the majority of the participants thought they are helpful. They also pointed out some ways to improve the interaction with the descriptors. For example, P8 suggested having an interface that filters out bad images based on descriptors or enables users to replace them instead of retaking all images. P10 mentioned a challenge in resolving problems in descriptors because they inform users of the problems in photos without providing ways to solve them. For example, when an object is cropped in a photo, participants did not get feedback on in which direction the camera should move. This 172 indicates that combining the descriptors with systems that provide audio/haptic feedback for blind photography (e.g., [141, 146]) would make the descriptors more effective. When it comes to the usefulness of a teachable interface, participants had different attitudes. From the participants? responses, we could find both participants who valued the possibility of recognizing personal items with the TOR app and who raised questions about the need of a teachable interface for object recognition. The participants with questions thought they would not need an object recognizer if they can train it because they already know about the object. Though there are some scenarios where identifying personal items can be useful (e.g., scanning the surroundings to find it, distinguishing similar objects) and assistive apps in the market (e.g., Seeing AI, LookTel) are deploying teachable interfaces in object recognizer, blind users may not have a clear motivation to use a teachable interface without instructions. Therefore, as an emerging technology for accessibility, a teachable interface would need to be incorporated with descriptions of real-world scenarios where blind people can use it. 9.4.2 Difference Between the User Study and Real Use Cases Though we set up a controlled user study to simulate a real scenario of using the app, the study has a few limitations that makes difference from it. The limitations highlight the need of a future study with a deployment study that enables participants to use it in their environments. One of the limitations is the fact that participants were asked to wear smart glasses and communicate with the experimenter through a laptop computer in front of them. Though these devices were necessary for communication and data analysis, 173 they would limit the participants behavior such as walking around with the phone and finding a place for taking photos. For example, when participants wanted to vary the backgrounds in photos, they took pictures with different parts of a table as backgrounds. However, if they can move around outside the user study setup, they would be able to choose completely different locations for background variation. Another limitation is that all participants had to use the app on iPhone 8 instead of their own mobile devices. As all but one participant have used iPhone, most participants would be familiar with using iOS apps. However, the size of the device and position of the camera would affect the quality of photos taken without vision. 9.5 Conclusion We designed and implemented a mobile TOR app for blind users. We aimed to resolve the known issues found in prior studies on interaction between TOR and blind users. The TOR app design was evaluated through a user study with blind participants. The user study also provided some patterns in training and testing an object recognition model through an analysis of feedback and photos from the participants. The responses to questions on usability of the app revealed that participants could easily train the app with descriptors, evaluate it with their test samples, and manage information of the objects during the study. However, participants also pointed out some difficulties such as taking many photos and resolving problems found in the descriptors. Moreover, we observed that the photos from participants had some issues such as little variation in training photos and cluttered background in test photos that would make the training and testing ineffective. 174 The findings from the user study provide insights and research problems to improve the usability of a TOR app for blind users. 175 Epilogue to Part II In Part II of this thesis, the challenge of identifying image recognition errors and managing errors with TOR were characterized through crowdsourcing and controlled lab studies with blind and sighted participants. The studies investigated the blind users? experience in identifying errors in camera-based assistive tools and challenges in identifying object recognition errors. In the follow-up studies with TOR, we further looked into the interaction between blind users and teachable interface for object recognizer. The study in Chapter 7 investigated the challenge in identifying errors from a pre-built camera-based assistive apps including object recognizer with blind users. It revealed that blind users identify errors based on the contexts such as the shape, size, and weight of the object. Like ASR errors, participants rarely had verified their the outputs from the camera-based assistive apps. On the other hand, while most blind people thought identifying ASR errors was not challenging, around half of the blind participants were aware of the difficulty of identifying image recognition errors. The results of error detection task showed that blind participants missed more than half of the object recognition errors. The studies in Chapter 8, 9, and ?? characterized blind and sighted people?s interactions with TOR. The analysis of their feedback and photos with web-based and mobile TOR 176 apps revealed that they tend to have some problems in their teaching strategies such as having little variation and cluttered photos that may cause errors in TOR. The responses of both blind and sighted participants in the user studies in Chapter 8 and ?? revealed that even if they observe the errors during the tests, many of them do not know what to do to resolve errors. However, some participants in Chapter ?? could figure out what to change in their strategies based on the descriptors of photos in our TOR app. Part II answered the following research questions: ? RQ7: For what tasks and objects do blind users take photos? (The interview with blind participants in Chapter 7 showed that they mostly take photos for text recognition, video call, and object recognition.) ? RQ8: How did blind users identify the image recognition errors? (The most common way for blind people to identify image recognition errors reported during the interview in Chapter 7 was to decide the correctness for themselves based on the context (e.g., surrounding texts of the recognized text, comparing the object recognition result with the shape, size, and texture of the object). ? RQ9: What are the blind users? accuracy of identifying the object recognition errors? (During the error identification task in Chapter 7, participants identified 49% of the image recognition errors.) ? RQ10: What are their strategies of identifying the errors? (Blind participants in Chapter 7 compared the recognition results with the expected objects based on the weight, texture, and shape in most cases. Some participants compared the 177 recognition results between trials. Some participants with low vision used the perceived colors and shapes to decide the correctness.) ? RQ11: What are non-experts? teaching and debugging strategies for a teachable object recognizer? (The study in Chapter 8 revealed both promising trends (e.g., incorporating diversity in training examples) and misconceptions (e.g., inconsistency between classes) in photos collected by non-experts.) ? RQ12: Do teaching strategies evolve through iteration? (The study in Chapter 8 showed that it did not evolve significantly because non-experts did not know what to change or did not want to change their strategies.) ? RQ13: How could descriptors be useful for avoiding errors due to their training examples? (Participants in the user study in Chapter 9 could learn what is important to consider to collect good training examples from the descriptors.) ? RQ14: What are blind users? teaching and debugging patterns? (The analysis of their photos in Chapter 9 showed similar trends found in Chapter 8 while they also had image framing problems.) 178 Chapter 11: Conclusions and Future Work 11.1 Summary of Contributions This dissertation characterized the blind and sighted users? interactions with errors in speech and image recognition systems. As a conclusion, this chapter summarizes the key contributions of the thesis and presents directions for future research. Overall, the contributions of this thesis are related to speech recognition, object recognition systems, and machine teaching, and accessibility. ? Speech recognition systems: the basis of evaluating the accuracy of identifying ASR errors with the baseline accuracy of error identification; possibility of enabling users to identify the errors more accurately with manipulations of synthesized speech. ? Object recognition systems: understanding the challenges of using camera-based assistive apps; understanding non-experts? strategies for training and testing a teachable object recognizer; enabling blind users to understand and avoid errors by reviewing their training examples with descriptors. ? Machine teaching: identifying research problems in building teachable object recognizer for non-experts in machine learning and blind users. 179 11.2 Future Directions My dissertation research explored the challenge of identifying and avoiding errors in speech and teachable object recognizers with blind and sighted users. With the findings and observations in my research, I highlight the following future directions to facilitate identifying, understanding, and correcting errors in AI-infused systems for blind and sighted people. 11.2.1 Interacting with Speech Recognition Errors Enabling users to better identify ASR errors with audio-only interactions. As devices with no or very small visual displays (e.g., smart speakers, wearable devices) where speech input is useful have been increasingly popular, audio-based interactions for identifying, understanding, and correcting errors can be more frequently used in various types of applications such as text editor, office tools, and social media apps. However, audio-only interaction for text entry is still an under-explored area. My previous study presented promising and simple manipulations that increased the accuracy of error identification (i.e., adding pauses between words, using slower speech rates). However, even with the manipulations, the best average proportion of identified errors was around 0.70, which indicates room for more improvement with elaborate audio-only error identification support. That is, improvement is necessary to bring audio-only text input more in line with the accuracy that can be achieved with the visual text entry interface. One possibility is to explore audio techniques that are comparable to visually underlining words that the recognition system deems to be potentially incorrect. 180 User interface for correcting errors with audio-only interactions. Though my dissertation research focused on the challenge of identifying ASR errors through synthetic speech, a system using speech input needs to incorporate user interfaces not only for error identification but also for error understanding and correction. A possible approach for enabling users to go through error identification, understanding, and correction in a non- visual context would be a dialogue-based interface that allows users to indicate errors and re-speak for the misrecognized words through a conversation with the system. We see that some of the strategies for pointing to errors found in my previous work would be leveraged to develop dialogue-based interfaces in future work. 11.2.2 Interacting with Error-Prone Image Recognition Effective teachable interfaces for real-world scenarios. While many parameters in teachable object recognizers can be explored in the future such as incremental model learning, extreme illumination changes, and video versus images in training, a critical remaining issue is a scalability over a long period of time. Although my thesis and prior studies have shown success for moderately sized datasets (e.g., fewer than 20 objects, around 30 photos per object), the number of objects would increase over time to hundreds or thousands in real use cases. The scalability problem affects both performance and usability. As a dataset include more objects, a teachable object recognizer inevitably makes more errors because the object recognition task becomes harder. Moreover, like machine learning practitioners put much effort in controlling a large dataset from various perspectives (e.g., fairness, diversity, consistent distribution of data in training and real- 181 world data), it will be a tricky task for end-users to manage their datasets with many classes collected over time. Therefore, future studies are needed to facilitate data management and model evaluation processes with scalability for end-users. Descriptors in other teachable assistive applications. My thesis explored the challenge of understanding and avoiding errors in teachable object recognizers, where ?learning to train? is deemed as one of the main challenges among blind users. As a way to resolve this challenge, we presented descriptors that allowed blind users to understand the important attributes of a training dataset and to evaluate their training examples quantitatively. While my thesis focused on a teachable object recognizer, this challenge is common among teachable assistive applications where users are required to build their own datasets. We see how the underlying methods for extracting meaningful descriptors can be adopted for other teachable assistive technologies and user groups such as teachable sound detectors for Deaf/deaf and hard of hearing people. 182 Bibliography [1] Jonggi Hong, Kyungjun Lee, June Xu, and Hernisa Kacorri. Exploring machine teaching for object recognition with the crowd. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI EA ?19, New York, NY, USA, 2019. Association for Computing Machinery. [2] Utkarsh Dwivedi, Jaina Gandhi, Raj Parikh, Merijke Coenraad, Elizabeth Bonsignore, and Hernisa Kacorri. Exploring machine teaching with children. In 2021 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 2021. [3] Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? ? a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [4] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching. CoRR, abs/1801.05927, 2018. [5] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248?255, 2009. [6] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206?5210, 2015. [7] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke. The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5934? 5938, 2018. [8] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780?4789, 2019. 183 [9] Yingbo Zhou, Caiming Xiong, and Richard Socher. Improving end-to-end speech recognition with policy learning. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5819?5823. IEEE, 2018. [10] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, and et al. Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ?19, New York, NY, USA, 2019. Association for Computing Machinery. [11] Andrew Begel, John Tang, Sean Andrist, Michael Barnett, Tony Carbary, Piali Choudhury, Edward Cutrell, Alberto Fung, Sasa Junuzovic, Daniel McDuff, Kael Rowan, Shibashankar Sahoo, Jennifer Frances Waldern, Jessica Wolk, Hui Zheng, and Annuska Zolyomi. Lessons learned in designing ai for autistic adults. In The 22nd International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ?20, New York, NY, USA, 2020. Association for Computing Machinery. [12] Matthew A Fox, Carl J Aschkenasi, and Arjun Kalyanpur. Voice recognition is here comma like it or not period. The Indian journal of radiology & imaging, 23(3):191, 2013. [13] Shiri Azenkot and Nicole B. Lee. Exploring the use of speech input by blind people on mobile devices. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ?13, New York, NY, USA, 2013. Association for Computing Machinery. [14] Christine A Halverson, Daniel B Horn, Clare-Marie Karat, and John Karat. The beauty of errors: Patterns of error correction in desktop speech systems. In INTERACT, volume 99, pages 1?8, 1999. [15] Clare-Marie Karat, Christine Halverson, Daniel Horn, and John Karat. Patterns of entry and correction in large vocabulary continuous speech recognition systems. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 568?575, 1999. [16] Ben Shneiderman. The limits of speech recognition. Communications of the ACM, 43(9):63?65, 2000. [17] Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4845?4854, 2019. [18] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [19] Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016. 184 [20] Hernisa Kacorri, Kris M. Kitani, Jeffrey P. Bigham, and Chieko Asakawa. People with visual impairment training personal object recognizers: Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ?17, page 5839?5849, New York, NY, USA, 2017. Association for Computing Machinery. [21] Carrie J. Cai, Jonas Jongejan, and Jess Holbrook. The effects of example-based explanations in a machine learning interface. IUI ?19, page 258?262, New York, NY, USA, 2019. Association for Computing Machinery. [22] Andrew Fowler, Brian Roark, Umut Orhan, Deniz Erdogmus, and Melanie Fried- Oken. Improved inference and autotyping in eeg-based bci typing systems. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, pages 1?8, 2013. [23] Heidi Horstmann Koester and Simon P Levine. Validation of a keystroke-level model for a text entry system used by people with disabilities. In Proceedings of the first annual ACM conference on Assistive technologies, pages 115?122, 1994. [24] Hernisa Kacorri and Matt Huenerfauth. Continuous profile models in asl syntactic facial expression synthesis. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2084? 2093, 2016. [25] Cole Gleason, Anhong Guo, Gierad Laput, Kris Kitani, and Jeffrey P Bigham. Vizmap: Accessible visual information through crowdsourced map reconstruction. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility, pages 273?274, 2016. [26] Daisuke Sato, Uran Oh, Kakuya Naito, Hironobu Takagi, Kris Kitani, and Chieko Asakawa. Navcog3: An evaluation of a smartphone-based blind indoor navigation assistant with semantic features in a large-scale environment. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pages 270?279, 2017. [27] Tom Kontogiannis. User strategies in recovering from errors in man?machine systems. Safety Science, 32(1):49?68, 1999. [28] Tom Kontogiannis and Stathis Malakis. A proactive approach to human error detection and identification in aviation and air traffic control. Safety Science, 47(5):693 ? 706, 2009. [29] Abigail J Sellen. Detection of everyday errors. Applied Psychology, 43(4):475? 498, 1994. [30] Marie-Luce Bourguet. Towards a taxonomy of error-handling strategies in recognition-based multi-modal human?computer interfaces. Signal Processing, 86(12):3625?3643, 2006. 185 [31] Bernhard Suhm, Brad Myers, and Alex Waibel. Model-based and empirical evaluation of multimodal interactive error correction. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 584?591, 1999. [32] Takeo Igarashi, Satoshi Matsuoka, Sachiko Kawachiya, and Hidehiko Tanaka. Interactive beautification: A technique for rapid geometric design. In ACM SIGGRAPH 2007 courses, pages 18?es. 2007. [33] Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. Understanding blind people?s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 5988?5999, 2017. [34] Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ?19, page 1?14, New York, NY, USA, 2019. Association for Computing Machinery. [35] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 241?250, 2000. [36] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces, pages 126?137, 2015. [37] Brian Y Lim and Anind K Dey. Assessing demand for intelligibility in context- aware applications. In Proceedings of the 11th international conference on Ubiquitous computing, pages 195?204, 2009. [38] Emilee Rader, Kelley Cotter, and Janghee Cho. Explanations as mechanisms for supporting algorithmic transparency. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1?13, 2018. [39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135?1144. ACM, 2016. [40] Daniel S Weld and Gagan Bansal. The challenge of crafting intelligible intelligence. Communications of the ACM, 62(6):70?79, 2019. [41] Rene? F Kizilcec. How much information? effects of transparency on trust in an algorithmic interface. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 2390?2395, 2016. 186 [42] Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. Tell me more? the effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1?10, 2012. [43] A?ngel Alexander Cabrera, Fred Hohman, Jason Lin, and Duen Horng Chau. Interactive classification for deep learning interpretation. arXiv preprint arXiv:1806.05660, 2018. [44] Pearl Pu and Li Chen. Trust building with explanation interfaces. In Proceedings of the 11th international conference on Intelligent user interfaces, pages 93?100, 2006. [45] Rahhal Errattahi, Asmaa El Hannani, Thomas Hain, and Hassan Ouahmane. Towards a generic approach for automatic speech recognition error detection and classification. In 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pages 1?6. IEEE, 2018. [46] Manaswi Saha, Alexander J Fiannaca, Melanie Kneisel, Edward Cutrell, and Meredith Ringel Morris. Closing the gap: Designing for the last-few-meters wayfinding problem for people with visual impairments. In The 21st international acm sigaccess conference on computers and accessibility, pages 222?235, 2019. [47] Robin N Brewer and Vaishnav Kameswaran. Understanding the power of control in autonomous vehicles for people with vision impairment. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, pages 185?197, 2018. [48] Julian Brinkley, Brianna Posadas, Julia Woodward, and Juan E Gilbert. Opinions and preferences of blind and low vision consumers regarding self-driving vehicles: Results of focus group discussions. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pages 290?299, 2017. [49] Ali Abdolrahmani, William Easley, Michele Williams, Stacy Branham, and Amy Hurst. Embracing errors: Examining how context of use impacts blind individuals? acceptance of navigation aid errors. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 4158?4169, 2017. [50] Kristina Ho?o?k. Steps to take before intelligent user interfaces become real. Interacting with computers, 12(4):409?426, 2000. [51] Donald A Norman. How might people interact with agents. Communications of the ACM, 37(7):68?71, 1994. [52] Tomoko Hashida, Kohei Nishimura, and Takeshi Naemura. Hand-rewriting: automatic rewriting similar to natural handwriting. In Proceedings of the 2012 ACM international conference on Interactive tabletops and surfaces, pages 153? 162, 2012. 187 [53] David Dearman, Amy Karlson, Brian Meyers, and Ben Bederson. Multi-modal text entry and selection on a mobile device. In Proceedings of Graphics Interface 2010, pages 19?26. 2010. [54] Clive Frankish, Dylan Jones, and Kevin Hapeshi. Decline in accuracy of automatic speech recognition as a function of time on task: fatigue or voice drift? International Journal of Man-Machine Studies, 36(6):797?816, 1992. [55] Sharon Oviatt and Robert VanGent. Error resolution during multimodal human- computer interaction. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP?96, volume 1, pages 204?207. IEEE, 1996. [56] Anja Kintsch and Rogerio DePaula. A framework for the adoption of assistive technology. SWAAAC 2002: Supporting learning through assistive technology, pages 1?10, 2002. [57] Betsy Phillips and Hongxin Zhao. Predictors of assistive technology abandonment. Assistive technology, 5(1):36?45, 1993. [58] Shaun K Kane, Chandrika Jayant, Jacob O Wobbrock, and Richard E Ladner. Freedom to roam: a study of mobile device adoption and accessibility for people with visual and motor disabilities. In Proceedings of the 11th international ACM SIGACCESS conference on Computers and accessibility, pages 115?122, 2009. [59] Sherry Ruan, Jacob O Wobbrock, Kenny Liou, Andrew Ng, and James Landay. Speech is 3x faster than typing for english and mandarin text entry on mobile devices. arXiv preprint arXiv:1608.07323, 2016. [60] Hanlu Ye, Meethu Malu, Uran Oh, and Leah Findlater. Current and future mobile and wearable device use by people with visual impairments. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3123?3132, 2014. [61] Hui Jiang. Confidence measures for speech recognition: A survey. Speech communication, 45(4):455?470, 2005. [62] W Feng. Using handwriting and gesture recognition to correct speech recognition errors. Urbana, 51:61801, 1994. [63] Arnout RH Fischer, Kathleen J Price, and Andrew Sears. Speech-based text entry for mobile handheld devices: an analysis of efficacy and error correction techniques for server-based solutions. International Journal of Human-Computer Interaction, 19(3):279?304, 2005. [64] Kazuki Fujiwara. Error correction of speech recognition by custom phonetic alphabet input for ultra-small devices. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pages 104?109, 2016. 188 [65] Vidyashree Kanabur, Sunil S Harakannanavar, and Dattaprasad Torse. An extensive review of feature extraction techniques, challenges and trends in automatic speech recognition. International Journal of Image, Graphics and Signal Processing, 10(5):1, 2019. [66] Arul Valiyavalappil Haridas, Ramalatha Marimuthu, and Vaazi Gangadharan Sivakumar. A critical review and analysis on techniques of speech recognition: The road ahead. International Journal of Knowledge-Based and Intelligent Engineering Systems, 22(1):39?57, 2018. [67] D Walker. The sri speech understanding system. IEEE transactions on acoustics, speech, and signal processing, 23(5):397?416, 1975. [68] Parabattina Bhagath and Pradip K Das. Acoustic phonetic approach for speech recognition: A review. Language, 77:93, 2004. [69] Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L Seltzer, and Christian Fuegen. End-to-end contextual speech recognition using class language models and a token passing decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6186?6190. IEEE, 2019. [70] Sharon Goldwater, Dan Jurafsky, and Christopher D Manning. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Communication, 52(3):181?200, 2010. [71] Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128:32?37, 2018. [72] Larwan Berke, Christopher Caulfield, and Matt Huenerfauth. Deaf and hard- of-hearing perspectives on imperfect automatic speech recognition for captioning one-on-one meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pages 155?164, 2017. [73] Yik-Cheung Tam, Yun Lei, Jing Zheng, and Wen Wang. Asr error detection using recurrent neural network language model and complementary asr. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2312?2316. IEEE, 2014. [74] Sahar Ghannay, Nathalie Camelin, and Yannick Esteve. Which asr errors are hard to detect. In Errors by Humans and Machines in Multimedia, Multimodal and Multilingual Data Processing (ERRARE 2015) Workshop, Sinaia, Romania, pages 11?13, 2015. [75] Sahar Ghannay, Yannick Esteve, and Nathalie Camelin. Word embeddings combination and neural networks for robustness in asr error detection. In 2015 23rd European Signal Processing Conference (EUSIPCO), pages 1671?1675. IEEE, 2015. 189 [76] David Huggins-Daines and Alexander Rudnicky. Interactive asr error correction for touchscreen devices. In Proceedings of the ACL-08: HLT Demo Session, pages 17?19, 2008. [77] Yuan Liang, Koji Iwano, and Koichi Shinoda. Simple gesture-based error correction interface for smartphone speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. [78] Jun Ogata and Masataka Goto. Speech repair: quick error correction just by using selection operation for speech input interfaces. In Ninth European Conference on Speech Communication and Technology, 2005. [79] Lijuan Wang, Tao Hu, Peng Liu, and Frank K Soong. Efficient handwriting correction of speech recognition errors with template constrained posterior (tcp). In Ninth Annual Conference of the International Speech Communication Association, 2008. [80] Junhwi Choi, Kyungduk Kim, Sungjin Lee, Seokhwan Kim, Donghyeon Lee, Injae Lee, and Gary Geunbae Lee. Seamless error correction interface for voice word processor. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4973?4976. IEEE, 2012. [81] Vikas Ashok, Yevgen Borodin, Yury Puzis, and IV Ramakrishnan. Capti-speak: a speech-enabled web screen reader. In Proceedings of the 12th Web for All Conference, pages 1?10, 2015. [82] Yevgen Borodin, Jalal Mahmud, IV Ramakrishnan, and Amanda Stent. The hearsay non-visual web browser. In Proceedings of the 2007 international cross- disciplinary conference on Web accessibility (W4A), pages 128?129, 2007. [83] Alisha Pradhan, Kanika Mehta, and Leah Findlater. ? accessibility came by accident? use of voice-controlled intelligent personal assistants by people with disabilities. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1?13, 2018. [84] Yu Zhong, TV Raman, Casey Burkhardt, Fadi Biadsy, and Jeffrey P Bigham. Justspeak: enabling universal voice control on android. In Proceedings of the 11th Web for All Conference, pages 1?4, 2014. [85] Jeff Bilmes, Xiao Li, Jonathan Malkin, Kelley Kilanski, Richard Wright, Katrin Kirchhoff, Amarnag Subramanya, Susumu Harada, James Landay, Patricia Dowden, et al. The vocal joystick: A voice-based human-computer interface for individuals with motor impairments. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 995?1002, 2005. [86] Eric Corbett and Astrid Weber. What can i say? addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 190 18th international conference on human-computer interaction with mobile devices and services, pages 72?82, 2016. [87] Susumu Harada, James A Landay, Jonathan Malkin, Xiao Li, and Jeff A Bilmes. The vocal joystick: evaluation of voice-based cursor control techniques for assistive technology. Disability and Rehabilitation: Assistive Technology, 3(1- 2):22?34, 2008. [88] Bill Manaris, Rene?e McCauley, and Valanne MacGyvers. An intelligent interface for keyboard and mouse control. In Proc. 14th Int?l Florida AI Research Symposium (FLAIRS-01), pages 182?188. Citeseer, 2001. [89] Yoshiyuki Mihara, Etsuya Shibayama, and Shin Takahashi. The migratory cursor: accurate speech-based cursor movement by moving multiple ghost cursors using non-verbal vocalizations. In Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility, pages 76?83, 2005. [90] Stephen Cox, Michael Lincoln, Judy Tryggvason, Melanie Nakisa, Mark Wells, Marcus Tutt, and Sanja Abbott. Tessa, a system to aid communication with deaf people. In Proceedings of the fifth international ACM conference on Assistive technologies, pages 205?212, 2002. [91] Thomas Pellegrini, Lionel Fontan, Julie Mauclair, Je?ro?me Farinas, Charlotte Alazard-Guiu, Marina Robert, and Peggy Gatignol. Automatic assessment of speech capability loss in disordered speech. ACM Transactions on Accessible Computing (TACCESS), 6(3):1?14, 2015. [92] Oscar Saz, Shou-Chun Yin, Eduardo Lleida, Richard Rose, Carlos Vaquero, and William R Rodr??guez. Tools and technologies for computer-aided speech and language therapy. Speech Communication, 51(10):948?967, 2009. [93] Mark S Hawley, Stuart P Cunningham, Phil D Green, Pam Enderby, Rebecca Palmer, Siddharth Sehgal, and Peter O?Neill. A voice-input voice-output communication aid for people with severe speech impairment. IEEE Transactions on neural systems and rehabilitation engineering, 21(1):23?31, 2012. [94] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1?18, 2018. [95] Konstantinos Papadopoulos and Eleni Koustriava. Comprehension of synthetic and natural speech: Differences among sighted and visually impaired young adults. Enabling Access for Persons with Visual Impairment, 147:149?153, 2015. [96] Amanda Stent, Ann Syrdal, and Taniya Mishra. On the intelligibility of fast synthesized speech for individuals with early-onset blindness. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, pages 211?218, 2011. 191 [97] Danielle Bragg, Cynthia Bennett, Katharina Reinecke, and Richard Ladner. A large inclusive study of human listening rates. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1?12, 2018. [98] Anja Moos and Ju?rgen Trouvain. Comprehension of ultra-fast speech?blind vs.?normally hearing?persons. In Proceedings of the 16th International Congress of Phonetic Sciences, volume 1, pages 677?680. Saarland University Saarbru?cken, Germany, 2007. [99] E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5):975?979, 1953. [100] Joa?o Guerreiro and Daniel Gonc?alves. Scanning for digital content: How blind and sighted people perceive concurrent speech. ACM Transactions on Accessible Computing (TACCESS), 8(1):1?28, 2016. [101] Ann R Bradlow and Jennifer A Alexander. Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners. The Journal of the Acoustical Society of America, 121(4):2339?2349, 2007. [102] Brenda Sutton, Julia King, Karen Hux, and David Beukelman. Younger and older adults? rate performance when listening to synthetic speech. Augmentative and Alternative Communication, 11(3):147?153, 1995. [103] Kelly Caine. Local standards for sample size at chi. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 981?992, 2016. [104] Zhirong Wang, Tanja Schultz, and Alex Waibel. Comparison of acoustic model adaptation techniques on non-native speech. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP?03)., volume 1, pages I?I. IEEE, 2003. [105] Mary LaLomia. User acceptance of handwritten recognition accuracy. In Conference companion on Human factors in computing systems, pages 107?108, 1994. [106] Antti Oulasvirta, Sakari Tamminen, Virpi Roto, and Jaana Kuorelahti. Interaction in 4-second bursts: the fragmented nature of attentional resources in mobile hci. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 919?928, 2005. [107] Graham Upton and Ian Cook. Understanding statistics. Oxford University Press, 1996. [108] Larry D Rosen, Kelly Whaling, L Mark Carrier, Nancy A Cheever, and Jeffrey Rokkum. The media and technology usage and attitudes scale: An empirical investigation. Computers in human behavior, 29(6):2501?2511, 2013. 192 [109] Sorrel Brown. Likert scale examples for surveys. ANR Program evaluation, Iowa State University, USA, 2010. [110] Marcelo Philip. Technology seeks to preserve fading skill: Braille literacy. AP Financial, 2017. [111] Keith Vertanen and Per Ola Kristensson. Complementing text entry evaluations with a composition task. ACM Transactions on Computer-Human Interaction (TOCHI), 21(2):1?33, 2014. [112] Michael Gonchar. 650 prompts for narrative and personal writing. New York Times, 20, 2016. [113] Virginia Braun and Victoria Clarke. Using thematic analysis in psychology. Qualitative research in psychology, 3(2):77?101, 2006. [114] William R Revelle. psych: Procedures for personality and psychological research. 2017. [115] J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159?174, 1977. [116] Sara E McBride, Wendy A Rogers, and Arthur D Fisk. Understanding human management of automation errors. Theoretical issues in ergonomics science, 15(6):545?577, 2014. [117] Xiaodong He, Li Deng, and Alex Acero. Why word error rate is not a good metric for speech recognizer training for the speech translation task? In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5632?5635. IEEE, 2011. [118] Sushant Kafle and Matt Huenerfauth. Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pages 165?174, 2017. [119] Danielle Bragg, Nicholas Huynh, and Richard E Ladner. A personalizable mobile sound detector app design for deaf and hard-of-hearing users. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility, pages 3?13, 2016. [120] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, and Chao Gao. Object class detection: A survey. ACM Computing Surveys (CSUR), 46(1):1?53, 2013. [121] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietika?inen. Deep learning for generic object detection: A survey. International journal of computer vision, 128(2):261?318, 2020. 193 [122] Jinqiang Bai, Dijun Liu, Guobin Su, and Zhongliang Fu. A cloud and vision-based navigation system used for blind people. In Proceedings of the 2017 International Conference on Artificial Intelligence, Automation and Control Technologies, pages 1?6, 2017. [123] YingLi Tian, Xiaodong Yang, Chucai Yi, and Aries Arditi. Toward a computer vision-based wayfinding aid for blind persons to access unfamiliar indoor environments. Machine vision and applications, 24(3):521?535, 2013. [124] SeeingAI. An app for visually impaired people that narrates the world around you, 2017. [125] Aipoly. Vision through artificial intelligence, 2016. [126] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In European Conference on Computer Vision, pages 417?434. Springer, 2020. [127] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211?252, 2015. [128] Alexander Andreopoulos and John K Tsotsos. 50 years of object recognition: Directions forward. Computer vision and image understanding, 117(8):827?891, 2013. [129] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436?444, 2015. [130] Meredith Ringel Morris. Ai and accessibility. Communications of the ACM, 63(6):35?37, 2020. [131] Rabia Jafri, Syed Abid Ali, Hamid R Arabnia, and Shameem Fatima. Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. The Visual Computer, 30(11):1197?1222, 2014. [132] Alejandro Reyes-Amaro, Yanet Fadraga-Gonza?lez, Oscar Luis Vera-Pe?rez, Elizabeth Dom??nguez-Campillo, Jenny Nodarse-Ravelo, Alejandro Mesejo- Chiong, Biel Moya?-Alcover, and Antoni Jaume-i Capo?. Rehabilitation of patients with motor disabilities using computer vision based techniques. Journal of accessibility and design for all, 2(1):62?70, 2012. [133] Taha Khan, Dag Nyholm, Jerker Westin, and Mark Dougherty. A computer vision framework for finger-tapping evaluation in parkinson?s disease. Artificial intelligence in medicine, 60(1):27?40, 2014. 194 [134] Hairong Jiang, Ting Zhang, Juan P Wachs, and Bradley S Duerstock. Enhanced control of a wheelchair-mounted robotic manipulator using 3-d vision and multimodal interaction. Computer Vision and Image Understanding, 149:21?31, 2016. [135] Cristina Manresa-Yee, Javier Varona, Francisco J Perales, and Iosune Salinas. Design recommendations for camera-based head-controlled interfaces that replace the mouse for motion-impaired users. Universal access in the information society, 13(4):471?482, 2014. [136] Kathleen Campbell, Kimberly LH Carpenter, Jordan Hashemi, Steven Espinosa, Samuel Marsan, Jana Schaich Borg, Zhuoqing Chang, Qiang Qiu, Saritha Vermeer, Elizabeth Adler, et al. Computer vision analysis captures atypical attention in toddlers with autism. Autism, 23(3):619?628, 2019. [137] Jordan Hashemi, Mariano Tepper, Thiago Vallin Spina, Amy Esler, Vassilios Morellas, Nikolaos Papanikolopoulos, Helen Egger, Geraldine Dawson, and Guillermo Sapiro. Computer vision tools for low-cost and noninvasive measurement of autism-related behaviors in infants. Autism research and treatment, 2014, 2014. [138] Ruxandra Tapu, Bogdan Mocanu, and Titus Zaharia. Deep-hear: A multimodal subtitle positioning system dedicated to deaf and hearing-impaired people. IEEE Access, 7:88150?88162, 2019. [139] Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, et al. Sign language recognition, generation, and translation: An interdisciplinary perspective. In The 21st international ACM SIGACCESS conference on computers and accessibility, pages 16?31, 2019. [140] Kyungjun Lee and Hernisa Kacorri. Hands holding clues for object recognition in teachable machines. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1?12, 2019. [141] Dragan Ahmetovic, Daisuke Sato, Uran Oh, Tatsuya Ishihara, Kris Kitani, and Chieko Asakawa. Recog: Supporting blind people in recognizing personal objects. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1?12, 2020. [142] Chandrika Jayant, Hanjie Ji, Samuel White, and Jeffrey P Bigham. Supporting blind photography. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, pages 203?210, 2011. [143] TapTapSee. Mobile camera application designed specifically for the blind and visually impaired ios users, 2016. [144] Envision AI. Enabling vision for the blind., 2018. 195 [145] Aira. Your life, your schedule, right now., 2017. [146] Kyungjun Lee, Jonggi Hong, Simone Pimento, Ebrima Jarjue, and Hernisa Kacorri. Revisiting blind photography in the context of teachable object recognizers. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility, pages 83?95, 2019. [147] E. Johns, O. M. Aodha, and G. J. Brostow. Becoming the expert - interactive multi- class machine teaching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2616?2624, June 2015. [148] Andrea L. Thomaz and Cynthia Breazeal. Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172(6):716 ? 737, 2008. [149] Ru?diger Dillmann. Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems, 47(2):109 ? 116, 2004. Robot Learning from Demonstration. [150] Rupal Patel and Deb Roy. Teachable interfaces for individuals with dysarthric speech and severe physical disabilities. In Proceedings of the AAAI Workshop on Integrating Artificial Intelligence and Assistive Technology, pages 40?47. Citeseer, 1998. [151] Tom Hitron, Yoav Orlev, Iddo Wald, Ariel Shamir, Hadas Erel, and Oren Zuckerman. Can children understand machine learning concepts? the effect of uncovering black boxes. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ?19, New York, NY, USA, 2019. Association for Computing Machinery. [152] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345?1359, Oct 2010. [153] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403?412, 2019. [154] Hernisa Kacorri. Teachable machines for accessibility. SIGACCESS Access. Comput., (119):10?18, November 2017. [155] Rebecca Fiebrink, Perry R. Cook, and Dan Trueman. Human model evaluation in interactive supervised learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ?11, page 147?156, New York, NY, USA, 2011. Association for Computing Machinery. [156] I. I. Itauma, H. Kivrak, and H. Kose. Gesture imitation using machine learning techniques. In 2012 20th Signal Processing and Communications Applications Conference (SIU), pages 1?4, April 2012. 196 [157] Abigail Zimmermann-Niefield, Makenna Turner, Bridget Murphy, Shaun K. Kane, and R. Benjamin Shapiro. Youth learning machine learning through building models of athletic moves. In Proceedings of the 18th ACM International Conference on Interaction Design and Children, IDC ?19, page 121?132, New York, NY, USA, 2019. Association for Computing Machinery. [158] Thomas J Palmeri and Isabel Gauthier. Visual object understanding. Nature Reviews Neuroscience, 5(4):291, 2004. [159] Jerry Alan Fails and Dan R. Olsen. Interactive machine learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces, IUI ?03, page 39?45, New York, NY, USA, 2003. Association for Computing Machinery. [160] Kyungjun Lee, Daisuke Sato, Saki Asakawa, Hernisa Kacorri, and Chieko Asakawa. Pedestrian detection with wearable cameras for the blind: A two-way perspective. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1?12, 2020. [161] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818?2826, June 2016. [162] BeMyEyes. Lend you eyes to the blind, 2016. [163] Donghee Shin. The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable ai. International Journal of Human- Computer Studies, 146:102551, 2021. [164] Eric S Vorm. Assessing demand for transparency in intelligent systems using machine learning. In 2018 Innovations in Intelligent Systems and Applications (INISTA), pages 1?7. IEEE, 2018. [165] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There?s software used across the country to predict future criminals. and it?s biased against blacks. ProPublica, 2016. [166] Solon Barocas and Andrew D Selbst. Big data?s disparate impact. Cal. L. Rev., 104:671, 2016. [167] danah boyd and Kate Crawford. Critical questions for big data. Information, Communication & Society, 15(5):662?679, 2012. [168] Alex Campolo, Madelyn Sanfilippo, Meredith Whittaker, and Kate Crawford. Ai now 2017 report. AI Now Institute at New York University, 2017. [169] Lucy A Suchman. Plans and situated actions: The problem of human-machine communication. Cambridge university press, 1987. 197 [170] William R. Swartout. Xplain: a system for creating and explaining expert consulting programs. Artificial Intelligence, 21(3):285 ? 325, 1983. [171] Ben Shneiderman and Pattie Maes. Direct manipulation vs. interface agents. Interactions, 4(6):42?61, November 1997. [172] Nicholas Diakopoulos. Algorithmic accountability reporting: On the investigation of black boxes. 2014. [173] Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. Designing theory- driven user-centric explainable ai. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ?19, New York, NY, USA, 2019. Association for Computing Machinery. [174] Daniel S Weld and Gagan Bansal. Intelligible artificial intelligence. arXiv preprint arXiv:1803.04263, 2018. [175] David Gunning. Explainable artificial intelligence (xai), 2017. [176] European Commision. European union general data protection regulation (gdpr), 2016. [177] US Congress. S.1108 - algorithmic accountability act of 2019, 2019. [178] Patrice Y. Simard, Saleema Amershi, David Maxwell Chickering, Alicia Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, and John Wernsing. Machine teaching: A new paradigm for building machine learning systems. CoRR, abs/1707.06742, 2017. [179] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, and et al. Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ?19, New York, NY, USA, 2019. Association for Computing Machinery. [180] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. Power to the people: The role of humans in interactive machine learning. AI Magazine, 35(4):105?120, Dec. 2014. [181] Google Creative Lab. Teachable machine, 2017. [182] Chien-Ju Ho, Aleksandrs Slivkins, Siddharth Suri, and Jennifer Wortman Vaughan. Incentivizing high quality crowdwork. In Proceedings of the 24th International Conference on World Wide Web, WWW ?15, page 419?429, Republic and Canton of Geneva, CHE, 2015. International World Wide Web Conferences Steering Committee. [183] Flask API. Browsable web apis for flask, 2010. 198 [184] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, and Jeffrey P. Bigham. A data-driven analysis of workers? earnings on amazon mechanical turk. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ?18, pages 449:1?449:14, New York, NY, USA, 2018. ACM. [185] Z. Gong, P. Zhong, and W. Hu. Diversity in machine learning. IEEE Access, 7:64323?64350, 2019. [186] Arvind Narayanan. Fat* tutorial: 21 fairness definitions and their politics. New York, NY, USA, 2018. [187] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning, 2019. [188] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ?12, page 214?226, New York, NY, USA, 2012. Association for Computing Machinery. [189] Faisal Khan, Bilge Mutlu, and Jerry Zhu. How do humans teach: On curriculum learning and teaching dimension. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1449?1457. Curran Associates, Inc., 2011. [190] Yoshua Bengio, Je?ro?me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ?09, page 41?48, New York, NY, USA, 2009. Association for Computing Machinery. [191] Y. J. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR 2011, pages 1721?1728, June 2011. [192] Minjie Cai, Kris Kitani, and Yoichi Sato. Understanding hand-object manipulation by modeling the contextual relationship between actions, grasp types and object attributes, 2018. [193] Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. Grounding interactive machine learning tool design in how non-experts actually build models. In Proceedings of the 2018 Designing Interactive Systems Conference, DIS ?18, page 573?584, New York, NY, USA, 2018. Association for Computing Machinery. [194] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716?723, December 1974. [195] Hernisa Kacorri. Teachable machines for accessibility. ACM SIGACCESS Accessibility and Computing, (119):10?18, 2017. 199 [196] Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd- Graber. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. Transactions of the Association for Computational Linguistics, 7(0):387?401, 2019. [197] Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. In The European Conference on Computer Vision (ECCV), September 2018. [198] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Mu?ller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. CoRR, abs/1708.08296, 2017. [199] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. [200] Guangxiao Zhang, Zhuolin Jiang, and Larry S Davis. Online semi-supervised discriminative dictionary learning for sparse representation. In Asian conference on computer vision, pages 259?273. Springer, 2012. [201] Jonggi Hong, Christine Vaing, Hernisa Kacorri, and Leah Findlater. Reviewing speech input with audio: Differences between blind and sighted users. ACM Trans. Access. Comput., 13(1), April 2020. [202] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. [203] Joan Sosa-Garc??a and Francesca Odone. ?hands on? visual recognition for visually impaired users. ACM Transactions on Accessible Computing (TACCESS), 10(3):1? 30, 2017. [204] Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review, 5(1):3?55, 2001. [205] A. Groce, T. Kulesza, C. Zhang, S. Shamasunder, M. Burnett, W. Wong, S. Stumpf, S. Das, A. Shinsel, F. Bice, and K. McIntosh. You are the only possible oracle: Effective test selection for end users of interactive machine learning systems. IEEE Transactions on Software Engineering, 40(3):307?323, March 2014. [206] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3630?3638. Curran Associates, Inc., 2016. 200