ABSTRACT 
 
 
Title of Dissertation: TOWARD SYMBIOTIC HUMAN-AI 
INTERACTION FOCUSING ON 
PROGRAMMING BY EXAMPLE     
  
 Tak Yeon Lee, Doctor of Philosophy, 2017  
  
Dissertation directed by: Professor Benjamin B. Bederson 
Computer Science 
 
 
 
Programming has become a new literacy, but is still inaccessible to ordinary people. 
Programming-by-example (PBE) is an alternative approach that allows people to teach 
computers repetitive tasks by demonstrating couple input and output examples of the 
tasks. While the advancements of PBE have been mainly driven by algorithmic 
improvements, a growing community of researchers started realizing the importance of 
issues on the human side of PBE. For instance, inexperienced users often find it hard 
to provide complete and consistent examples, which is crucial for computers to learn 
the correct programs. Unfortunately, most PBE systems have limited ways to 
communicate with users about what it can or cannot do, and how to handle unsuccessful 
situations. The lack of symbiotic interaction between human users and PBE engines 
remain as a major hurdle against a widespread adoption of PBE techniques.  
  
To address the issues on the human side of PBE, this dissertation has four research 
threads. First, we began with two formative studies to establish a better understanding 
of inexperienced users' needs and mental models. Second, based on the findings of the 
formative studies, we developed a Visual Environment for Symbiotic Programming, 
called VESPY. VESPY interleaves visual programming and PBE techniques, enabling 
users (1) to decompose complex tasks into small modules on its 2-d grid, and (2) to 
complete each module by providing input and output examples. Four sample programs 
demonstrate VESPY's remarkable versatility. However, we also noticed that VESPY 
still had a number of usability issues. Third, to better understand the usability issues 
and how to help users out from common mistakes, we conducted an online user study 
that observed how inexperience users perform program decomposition and 
disambiguation, which are the two core activities of PBE. We identified seven types of 
mistakes, and reaffirmed that informative feedback on those mistakes is crucial for 
designing usable systems. Finally, we explored the design space of feedback 
components, in order to understand their impact on user's experience.  
My dissertation contributes to the AI and HCI communities with: (i) identification 
of unmet needs of end-users of the Web; (ii) characterization of non-programmers’ 
mental model; (iii) design process of interleaving visual programming and PBE; (iv) 
identification of mistakes people make while using PBE; and (v) design and assessment 
of feedback components for PBE users. 
 
 
 
  
 
 
TOWARD SYMBIOTIC HUMAN-AI INTERACTION FOCUSING ON 
PROGRAMMING BY EXAMPLE    
 
by 
 
Tak Yeon Lee 
 
 
Thesis submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2017 
 
 
 
 
Advisory Committee: 
Professor Ben Bederson, chair 
Professor Jeff Foster 
Assistant Professor Leah Findlater 
Assistant Professor Jon Froehlich 
Associate Professor Jen Golbeck, Dean’s Representative 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
© Copyright by 
Tak Yeon Lee 
2017 
 ii 
Statement of Co-Authorship 
 
All work in this dissertation was conducted under the supervision of Dr. Benjamin B. 
Bederson, and I am the primary contributor to all aspects of this research. Most of the 
research in this dissertation from Chapters 3 and 5 are updated versions of published 
papers listed below: 
• Chapter 3: Lee, T.Y., and Bederson, B.B. Give the people what they want: 
studying end-user needs for enhancing the web. PeerJ Computer Science 2:e91 
https://doi.org/10.7717/peerj-cs.91, 2016  
• Chapter 5: Lee, T.Y., Dugan, C., and Bederson, B.B. Towards Understanding 
User Behavior of Programming by Example: A Crowdsourced User Study. 
Accepted for IUI'17: 22nd International Conference on Intelligent User 
Interfaces, 2017.  
 
 iii 
 
Dedication 
To my family 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 iv 
Acknowledgements 
 
First of all, I would like to thank my advisor Dr. Benjamin B. Bederson for his endless 
support and encouragement. Along the long path of this thesis, Ben has always been 
supportive to my decisions. I learned a lot from Ben about education, research, writing, 
and even the pronunciation of "image". I also thank Dr. Leah Findlater for her guidance 
on research methodology and scientific writing. I am very grateful to Casey Dugan, 
who is the best mentor I have ever met.  
I appreciate my dissertation committee: Jeff Foster, Jon Froehlich, and Jennifer 
Golbeck for their valuable advices and patience.  
I would like to thank HCIL night watches and all the precious people who would 
always be by my side: Uran, Kotaro, Matthew, Meethu, Jonggi, SeokBin, DeokGeun, 
Joon, Alina, MinKyoung, Angela, SoYoung, YoungSam, HyunJong, and others. 
Lastly, I’d like to thank my parents for all the encourage and faith. Most of all, I 
thank HyeRee for all the love and care. 
 
 
  
 
 
 
 v 
Table of Contents 
Statement of Co-Authorship ...................................................................................................... ii	
Acknowledgements ................................................................................................................... iv	
Table of Contents ....................................................................................................................... v	
List of Tables .......................................................................................................................... viii	
List of Figures ........................................................................................................................... ix	
Chapter 1:	 Introduction .......................................................................................................... 1	
	 Motivation .............................................................................................................................. 1	
	 Dissertation goals and statement ........................................................................................... 2	
	 Approach and overview ......................................................................................................... 3	
	 Organization of the Dissertation ............................................................................................ 5	
Chapter 2:	 Related Work ....................................................................................................... 7	
	 End-User Development (EUD) .............................................................................................. 7	
	 End-User Programming (EUP) .............................................................................................. 8	
	 End-User Programming (EUP) ......................................................................................... 9	
	 Visual programming ....................................................................................................... 11	
	 Dataflow Programming ................................................................................................... 13	
	 Programming-by-Example / Demonstration (PBE / PBD) ............................................. 15	
	 Summary ......................................................................................................................... 18	
	 End-User Software Engineering (EUSE) ............................................................................ 18	
	 Supporting the exploratory approach of end-user programming .................................... 19	
	 Understanding end-user programmer’s mental model .................................................... 19	
	 Preventing mistakes of end-user programmers ............................................................... 21	
	 Human-AI Interaction .......................................................................................................... 23	
	 Direct manipulation vs. Autonomous agent .................................................................... 23	
	 Mixed Initiative interaction ............................................................................................. 24	
Chapter 3:	 Formative Study: End-User Needs for Enhancing the Web .............................. 27	
	 Introduction .......................................................................................................................... 28	
	 Study 1: End-User Needs on the Web ................................................................................. 30	
	 Participants ...................................................................................................................... 31	
	 Procedure ........................................................................................................................ 31	
	 Data and Analysis ........................................................................................................... 32	
	 Result: Challenges .......................................................................................................... 33	
	 Potential Functionality of Web Enhancements ............................................................... 42	
	 Design Implications ........................................................................................................ 45	
	 Limitations ...................................................................................................................... 48	
	 STUDY 2: NON-PROGRAMMERS MENTAL MODEL OF COMPUTATIONAL 
TASKS 48	
	 Participants ...................................................................................................................... 48	
	 Method ............................................................................................................................ 49	
	 Procedure ........................................................................................................................ 53	
	 Data and Analysis ........................................................................................................... 54	
	 Findings .......................................................................................................................... 54	
	 Implications ..................................................................................................................... 57	
	 Limitations ...................................................................................................................... 59	
	 Conclusion ...................................................................................................................... 59	
Chapter 4:	 VESPY: A Visual Environment for Symbiotic Programming ........................... 61	
	 Introduction .......................................................................................................................... 61	
	 Design Iteration ................................................................................................................... 62	
	 Version 1: Spreadsheet ................................................................................................... 62	
	 Version 2: Graph of Multiple Spreadsheets .................................................................... 64	
 vi 
	 Version 3: List of Operations .......................................................................................... 65	
	 Version 4: Grid and Semantic Zoom .............................................................................. 66	
	 Version 5: Grid and Pop-up Panel .................................................................................. 68	
	 Version 6: Grid, Pop-up Panel, and Side Panel .............................................................. 69	
	 Example Walkthrough ......................................................................................................... 69	
	 VESPY System .................................................................................................................... 76	
	 The Grid UI ..................................................................................................................... 76	
	 Direct Specification ........................................................................................................ 77	
	 Programming-by-Example (PBE) Engine ...................................................................... 78	
	 Domain Specific Language (DSL) .................................................................................. 80	
	 Single-step inference algorithms ..................................................................................... 83	
	 Multi-step PBE with task recipes .................................................................................... 87	
	 Example Enhancements ....................................................................................................... 87	
	 Example #1: Deep search ................................................................................................ 88	
	 Example #2: Custom Filter ............................................................................................. 90	
	 Example #3: Event Parser for Google Calendar ............................................................. 90	
	 Example #4: Multi-Attribute Ranking ............................................................................ 90	
	 Preliminary User Study ........................................................................................................ 91	
	 Method ............................................................................................................................ 92	
	 Tasks ............................................................................................................................... 93	
	 Tasks ............................................................................................................................... 94	
	 Results ............................................................................................................................. 95	
	 Discussion ............................................................................................................................ 95	
	 PBE vs. Direct Specification ........................................................................................... 95	
	 Limitations ...................................................................................................................... 97	
	 CONCLUSION .................................................................................................................... 98	
Chapter 5:	 Understanding Human Mistakes when Programming by Example ................... 99	
	 Abstract ................................................................................................................................ 99	
	 Introduction .......................................................................................................................... 99	
	 METHODS ........................................................................................................................ 101	
	 Experimental System ......................................................................................................... 102	
	 SUCCESS RATE ............................................................................................................... 105	
	 Types of Mistakes .............................................................................................................. 106	
	 Missing steps (found 92 times; 30 were critical) .......................................................... 107	
	 Ambiguous cases (29 times; 11 critical) ....................................................................... 107	
	 Inconsistent or unsupported values (28 times; 8 critical) ............................................. 108	
	 Unnecessary steps (15 times; 5 critical) ........................................................................ 108	
	 Describing with formula (11 times; 7 critical) .............................................................. 109	
	 Inconsistent program (3 times; 2 critical) ..................................................................... 110	
	 Empty cases (2 times; 0 critical) ................................................................................... 110	
	 LIMITATIONS .................................................................................................................. 110	
	 CONCLUSION .................................................................................................................. 111	
Chapter 6:	 Experiments on Feedback and Human Mistakes in PBE Systems .................. 112	
	 Motivation and Introduction .............................................................................................. 112	
	 Experimental System UI .................................................................................................... 114	
	 Feedback rules ................................................................................................................... 116	
	 Missing steps ................................................................................................................. 116	
	 Ambiguous cases .......................................................................................................... 118	
	 Inconsistent or unsupported values ............................................................................... 119	
	 Unnecessary steps ......................................................................................................... 121	
	 Describing with formula ............................................................................................... 121	
	 Inconsistent program ..................................................................................................... 122	
	 Empty cases .................................................................................................................. 122	
	 Methods ............................................................................................................................. 122	
	 Procedure ...................................................................................................................... 122	
	 Closing survey .............................................................................................................. 123	
 vii 
	 Compensation ............................................................................................................... 123	
	 Experimental design ...................................................................................................... 123	
	 Measurements ............................................................................................................... 124	
	 Participants .................................................................................................................... 125	
	 Result ................................................................................................................................. 125	
	 The insignificant impact of feedback messages on completion and success rates ....... 125	
	 Frequency and click rates of feedback .......................................................................... 128	
	 Perceived quality of the system and the outcomes ....................................................... 130	
	 Participant background and behavior ............................................................................ 131	
	 Discussion .......................................................................................................................... 132	
	 The insignificant impact of feedback messages ............................................................ 132	
	 Potential reasons and remedies for the high dropout rate ............................................. 132	
	 Plan for a follow-up experiment: addressing the high dropout rate .............................. 134	
Chapter 7:	 Conclusion ....................................................................................................... 136	
	 Answers to the research questions ..................................................................................... 136	
	 R1. What do end-user programmer need to improve the Web? ................................... 136	
	 R2. How do non-programmers express their programming intent? ............................. 136	
	 R3. Is PBE better than direct specification? ................................................................. 137	
	 R4. Can inexperienced users perform problem decomposition and disambiguation? .. 137	
	 R5. What is the best feedback design for PBE users? .................................................. 137	
	 Thesis contributions ........................................................................................................... 138	
	 Identification of unmet needs of end-users of the Web ................................................ 138	
	 Characterization of non-programmers’ mental model .................................................. 139	
	 Design process of interleaving visual programming and PBE ..................................... 139	
	 Identification of human mistakes of PBE ..................................................................... 140	
	 Design and assessment of feedback for PBE users ....................................................... 140	
	 Future work ........................................................................................................................ 140	
	 Crowdsourcing feedback rules to users ........................................................................ 141	
	 Balancing between too much or too little feedback to users ........................................ 141	
	 Long-term user study of practical EUP systems ........................................................... 142	
	 Final remarks ..................................................................................................................... 142	
Bibliography .......................................................................................................................... 143	
 viii 
List of Tables 
Table 1. Occupational background of the participants of study 1 .............................. 31	
Table 2. Occupational background of the participants ............................................... 49	
Table 3. In Task 2, the participants were asked to create a filter than removes houses 
with less than three bedrooms among housing rental posts scraped from 
Craigslist.com. .................................................................................................... 51	
Table 4. VESPY operations and their required parameters. Subscripted types (e.g. 
VAL of IVAL) mean that the operation requires the type of the value. IDOM must 
contain only DOM elements; IVAL can be any type except DOM elements. ...... 82	
Table 5. Examples of Substring inference. ................................................................. 85	
Table 6. Examples of String Test inference ................................................................ 85	
Table 7. Examples of Number Test inference ............................................................ 85	
Table 8. Examples of Arithmetic inference ................................................................ 85	
Table 9. Examples of Compose Text inference .......................................................... 85	
Table 10. The core set of task recipes in VESPY. If input and output satisfies the 
condition, the recipe will create temporary nodes (in orange color) and will try to 
find sub-solution. ................................................................................................ 86	
Table 11. The four tasks for the controlled experiment consist of thirteen problems. 94	
Table 12. Wilcoxon signed rank test result of the completion times for each problem. 
For simple problems that require single steps (P1-P7), the Direct Specification 
condition equivalent or better performance. However, for complex problems 
requiring multiple-steps (P8-13), the PBE condition was significantly more 
efficient (p<0.03) ................................................................................................ 95	
Table 13. With the given description and default examples for each task, participants 
were asked to add more examples, such as the solution examples shown. ...... 102	
Table 14. Success rates (proportion of participants who passed the task) and average 
numbers of trials for the baseline (Base.) and the experimental (Exp.) conditions. 
Highlighted cells are significant (p<.05). ......................................................... 105	
Table 15. Examples of missing steps ........................................................................ 107	
Table 16. Examples of ambiguous cases .................................................................. 108	
Table 17. Examples of inconsistent or unsupported values ...................................... 108	
Table 18. Examples of unnecessary steps ................................................................. 109	
Table 19. Examples of describing with formula ....................................................... 109	
  
 ix 
List of Figures 
Figure 1. Chickenfoot scripting environment running inside the Firefox browser. 
Users type scripting code in the script editor (left) to automate, customize, and 
integrate Web applications without examining HTML source code. ................. 10	
Figure 2. The Inky command line window. When user types a command in the Input 
area, the Feedback area shows a list of interpreted and fixed candidates. .......... 10	
Figure 3. A screenshot of Scratch, a visual programming environment for creating 
stories, games, and animations. Children can easily understand and use Scratch’s 
visual widgets. ..................................................................................................... 12	
Figure 4. LabView is a dataflow programming language widely used in laboratories.
............................................................................................................................. 12	
Figure 5. Sample widgets in Yahoo's Pipes. Users create complex operations by 
connecting widgets and customizing parameters. ............................................... 12	
Figure 6. A sample mashup in Marmite [86]2] extracts address and other information 
from a Web page. Users select operators on the left, the widgets in the middle 
show the data flow, and the table on the right shows the processed data. .......... 13	
Figure 7. Quartz Composer can process and render graphical data. ........................... 14	
Figure 8. The Wrangler interface. Users can select or edit data in the right panel. Then 
the left-bottom panel shows suggestions of transforming operations based on a 
user’s latest action. As the user selects one of the suggestions, it will be applied 
to the data set and appended to the transform script (left-top). .......................... 17	
Figure 9. The user interface of Karma. By highlighting a segment of text (“Japon 
Bistro”) in the embedded Web browser (left) and dragging it into the table 
(right), a user can specify a data retrieval operation. .......................................... 17	
Figure 10. Sorting by year in STEPS. Mock input/output pairs1 specify each step; 
nested colored blocks represent structure. .......................................................... 17	
Figure 11. WYSIWYT approach highlights potential bugs in spreadsheets. Red 
borders indicate incorrect cells. Check marks indicate that the cells have passed 
generated test cases, while question marks indicate that the cells need testing. . 21	
Figure 12. Whyline is a debugging tool in the Alice programming environment. Users 
can press “why” or “why not” buttons for getting detailed information (e.g.  
program’s execution history) of specific animated behavior. ............................. 21	
Figure 13. Conversational Clarification being used to disambiguate different 
programs that extract individual authors. ............................................................ 23	
Figure 14. Program Navigation tab allows users to navigate sub-expressions of a 
program, and choose among alternative sub-expressions that other programs 
have suggested. ................................................................................................... 23	
Figure 15. Participants were asked to explain how to draw a histogram of the numbers 
in the table. In this example, the participant gave histogram bins different codes 
(A-D), and marked each number with the codes. Since the participant could not 
put 12 into any bin, he marked the number with question mark and a line that 
points a missing bin. ........................................................................................... 50	
Figure 16. In this example of Task 2, the participant used scribbles along with verbal 
statements. For example, the participant wrote variations of keywords that 
 x 
indicate “bedroom” used in the list. He/she also circled and underlined the 
number of rooms in each title to demonstrate the text extraction logic, crossed 
out titles that did not meet the criteria, and drew arrows from houses to empty 
slots in the list. .................................................................................................... 52	
Figure 17. In Task 3, participants were asked to describe a simple Mashup program 
that shows available colors of each individual product in the Main page (top left) 
extracted from the Product Detail (bottom right) page. ...................................... 53	
Figure 18. The 1st design of VESPY UI looks like a spreadsheet. Each column 
represents a list of values. The green arrows represent operations that calculate 
the next column. .................................................................................................. 63	
Figure 19. The 2nd design of VESPY UI. Widgets that contain small spreadsheets 
represent complex program structure such as branching and merging. .............. 63	
Figure 20. The 3rd design of VESPY UI is optimized for showing the description of 
every operation. ................................................................................................... 64	
Figure 21. The 4th design of VESPY UI employs a 2D grid and a semantic zoom 
feature. ................................................................................................................ 65	
Figure 22. The 5th design of VESPY UI includes a pop-up panel that shows details of 
the currently selected node. The top row represents values of the input nodes and 
the current node. The middle row explains what operation is assigned to the 
current node. The bottom row shows a set of operations that users can click to 
assign to the current node. .................................................................................. 67	
Figure 23. The VESPY user interface consists of the grid, info, actions, and node 
details. s can open the UI at any web page by pressing the button on the top right 
corner of web browsers. ...................................................................................... 69	
Figure 24. A new enhancement is created. The grid UI contains a Trigger node to 
begin with. The original web page is shown on the right side. ........................... 70	
Figure 25. User can (1) drag an operation from Actions panel (left) to the grid 
(center), (2) directly change options of the operation (e.g. “button”, “calculate 
sum”) in the floating node detail window, and then (3) run the operation by 
clicking the play button on the right side of the window. Finally, (4) the values 
of the node will be updated. ................................................................................ 71	
Figure 26. User can attach new elements to any place in the web page by (1) drag-
and-drop an element to the target place, (2) choosing the relative position 
(before, front, back, after) to the target, and (3) clicking a suggested program in 
the Actions panel. Then (4) two nodes are added to the grid. ............................ 71	
Figure 27. User can set an event handler by (1) dragging Trigger operation next to the 
node containing elements, and (2) setting the correct input channel. ................. 72	
Figure 28. s can specify a node that extracts elements at a specific DOM position by 
(1) create an empty node, (2) click an element of interest and press extract button 
(repeat twice for extracting a set of elements), and (3) confirm the suggested 
Extract Element operation in the action panel. Then (4) the empty node is 
replaced with the node that can extract all the elements at the same position. ... 73	
Figure 29. s can create new elements from values with Create Element node. .......... 73	
Figure 30. s can extract specific attributes from elements by (1) creating an empty 
node next to the elements, (2) clicking the attribute value in the detail window, 
and (3) confirming the suggested action. ............................................................ 74	
 xi 
Figure 31. A simple enhancement creates a button for calculating total points.  Three 
nodes on the left side create and attach “calculate sum” button to the table. When 
the button is clicked in runtime, the trigger node executes the following nodes to 
extract all the points from the table, add them, and attach the result back to the 
page. .................................................................................................................... 75	
Figure 32. A simple enhancement creates a button for calculating total points.  Four 
nodes on the left side attach “calculate total points” button to the Web page. 
When the button is clicked, the trigger node runs the following nodes to extract 
all the points from the table, add them, and attach the result back to the page. . 76	
Figure 33. Users can bring elements in the input node by (1) clicking the arrow button 
in the node detail window.  (2) When the current node contains elements of the 
input node, PBE suggests a three-step task that filters the input elements by their 
properties. (3) Clicking the task will add three new nodes for the filtering task. 78	
Figure 34. VESPY PBE suggests single / multiple operation tasks based on the values 
of the input nodes and the current node.  To sort a list of numbers, an user (1) 
creates an empty node that follows the input node. (2) He starts typing desired 
output “-5”. However, at this point, PBE can only suggest a task with Number 
Test + Filter operations. (3) As he typed the sufficient output values, PBE 
suggests a correct Sort operation. (4) He clicks the suggestion to confirm it as 
the node’s operation. ........................................................................................... 79	
Figure 35. Filtering a set of table rows by values of a specific column requires the 
filtered list [c] and the key values for predicate [b]. Users (1) extract key values 
from the original list, (2) ..................................................................................... 80	
Figure 36. The syntax of VESPY enhancements. ....................................................... 81	
Figure 37. Representation of the VESPY program. An enhancement consists of 
multiple nodes. The enhancement in this figure calculates the average of 
numbers ([1,3,6]) by running the four nodes in the numbered order 
(1à2à3à4). Each node contains an operation, values, and input nodes. When 
its preceding node triggers a node, it executes its operation, updates its values, 
and then triggers its following nodes. ................................................................. 81	
Figure 38. An example of Extract Parent operation inference. ................................... 83	
Figure 39. The deep search enhancement adds a text input box to the original page. 
When user types a keyword in the input box, it searches all the linked pages and 
highlights links whose pages contain the keyword. The main content of the links 
are attached to the links as well. ......................................................................... 88	
Figure 40. The custom filter enhancement extracts all the venues from the publication 
list, and attaches a list of unique buttons. When a button is clicked, it shows only 
the articles published to the selected venue. ....................................................... 89	
Figure 41.  The event parser enhancement attaches button to every event in the list. 
When a button is clicked, it finds an open tab of Google Calendar and fills the 
input form with the event information. ............................................................... 89	
Figure 42. The multi-attribute ranking enhancement adds text boxes to each column 
header that users can type in their own weight factors. When a factor is changed, 
it updates weighted total scores and color codes on the right end of the table. It 
also attaches the Sort button that reorders the table rows by weighted total 
scores. .................................................................................................................. 89	
 xii 
Figure 43. An example of the second problem of the Calculating	numbers task. Given 
the two input nodes, the participants need to create an Arithmetic node that 
multiplies the two node values. ........................................................................... 93	
Figure 44.  The study UI and basic walkthrough ...................................................... 103	
Figure 45. The experimental system UI. The TASK section describes the program 
participants should build. The EXAMPLES contains a table of user-provided 
examples and feedback from the PBE engine. In the RESULT panel, users press 
the Teach Computer button to let the PBE engine generate programs based on 
provided examples, and get feedback. Finally, the HISTORY panel shows all the 
trials provided for the current task. ................................................................... 114	
Figure 46.  The mechanism of choosing and locking commands for a step. When the 
computer generates multiple commands users can choose one among them. 
Chosen commands are locked to the step, and stay until they got unlocked. ... 115	
Figure 47. Probabilities of participants reaching and completing tasks compared 
across different feedback compositions. Lines indicate the portion of participants 
who reached specific tasks. Bars indicate the portions of participants who 
accomplished tasks without giving up. The green line above the other lines 
suggests that the ‘BOTH’ setting, which shows both system info and instruction, 
outperformed the other settings. ....................................................................... 127	
Figure 48. Probabilities of participants reaching and completing tasks compared to 
whether the history panel is given or not. Lines indicate the portion of 
participants who reached specific tasks. Bars indicate the portions of participants 
who accomplished tasks without giving up. The two lines go along with each 
other, suggesting that the history panel does not have a strong impact on how 
many users reached and completed tasks. ......................................................... 127	
Figure 49. # of tasks (and tutorials) that a specific feedback rule was activated and 
clicked by participants. ..................................................................................... 129	
Figure 50. The closing survey result. The Likert scale ratings generally suggest that 
the BOTH condition is perceived to be intuitive, effective, and useful to increase 
the credibility of outcome. However, a few participants perceived the BOTH 
condition to be hard to understand and ineffective. .......................................... 129	
 1 
Chapter 1: Introduction 
 Motivation 
Programming has become a new literacy, but is still one of the most challenging skills 
for ordinary people. As of 2016, only 2.54% of the employed workforce in the United 
States are software developers [83]. To enable ordinary people to perform complex and 
customizable computational tasks, researchers have proposed the concept of End-User 
Development (EUD), “a set of methods, techniques and tools that allow users of 
software systems, who are acting as non-professional software developers, at some 
point to create, modify, or extend a software artifact” [51]. Since end-user programmers 
have characteristics different from professional programmers, they need specially 
designed programming environments, which is the goal of end-user programming 
(EUP) research. EUP researchers have proposed various approaches for making 
programming concepts easy to learn, as reviewed in section 2.1. While every EUP 
approach has its own strengths and weaknesses, the common ground is that end-user 
programmers need to learn the constructs of such systems, and imperatively specify 
them.  
Programming-by-Example (PBE) is an alternative approach that allows users to 
teach computers to perform repetitive tasks by demonstrating or providing examples 
using conventional direct manipulation interface, instead of directly specifying via text-
based coding or visual programming techniques [14,50]. Therefore, PBE has been 
successful in the areas where users can easily demonstrate complete and consistent 
 2 
examples, such as controlling robot arms, automating repetitive tasks, creating 
animations, and wrangling structured data (reviewed in Chapter 2.2.4).  
Despite its strong potential, the advancements of PBE are mostly driven by 
technical improvements rather than addressing issues on the human side. For example, 
users of PBE systems often express frustration at not knowing the capability and 
limitations of the PBE engine [88]. If users make a mistake while expressing their intent, 
or if the intent is not expressible, PBE systems would fail without the second plan [44]. 
There is no easy way to check the correctness of generated programs, especially when 
extensive test cases are unavailable [54]. Decomposing a complex task into smaller 
subtasks is a challenge for inexperienced users [25]. In sum, usability issues remain as 
barriers to widespread adoption of PBE [44].  
 Dissertation goals and statement 
At a high level, the goal of this dissertation has been to improve the design of PBE 
systems. More specifically, our goal was to answer the following research questions: 
R1. (Chapter 3) What do end-user programmers need to improve the Web?  
a. What challenges do end-users experience on the Web? 
b. What features should EUP system provide to end-user 
programmers?    
R2.  (Chapter 3) How do non-programmers express their programming intent?  
R3.  (Chapter 4) Is PBE better than direct specification? 
R4.  (Chapter 5) Can inexperienced users perform problem decomposition and 
disambiguation?  
a. What mistakes do users make when using PBE? 
 3 
R5.  (Chapter 6) What is the impact of feedback design on user's experience of 
PBE? 
a. Is showing either system information, instruction, or both helpful for 
completing tasks, understanding the system, and fixing human 
mistakes? 
b. Does feedback design affect user's behavior of using PBE features? 
c. Does feedback design affect user's credibility of the programs they 
make?   
d. Does demographic information affect user's performance and 
behavior of using PBE features? 
e. Is the history of previous trials helpful for users to understand and fix 
their mistakes? 
We addressed these questions with four research threads: (1) studying inexperienced 
users’ needs and mental models, (2) designing a symbiotic environment that interleaves 
visual programming and PBE, (3) identifying mistakes that inexperienced users make 
while using PBE; (4) exploring the design space of feedback for human mistakes. 
 Approach and overview 
Towards the objectives of the dissertation outlined above, we started by conducting 
two formative user studies (Chapter 3). First, a semi-structured interview study 
explored challenges that 35 end-users experience daily, and identified seven categories 
of web enhancements that would be helpful to be included in future EUP systems. 
Second, a Wizard of Oz study with 13 non-programmers observed how they naturally 
explain common computational tasks through conversational dialogue. This study 
expands existing work with characteristics of non-programmers’ mental models. The 
 4 
findings, though preliminary, suggest that future EUP tools should support multi-modal 
and mixed-initiative interaction for making programming more natural and easy-to-use. 
Building on the findings from the formative studies, we developed VESPY, an 
end-user programming environment for creating interactive web components (Chapter 
4). The development of VESPY was a long iterative process, taking 1.5 years to explore 
various ways to accommodate visual programming and PBE. The design goal was to 
interleave visual programming and PBE so that users could decompose complex tasks 
into modules, and generate solutions for each module by providing input and output 
examples to the PBE engine. Section 4.5 presents four scenarios of sample 
enhancements that demonstrate the unique capability and versatility of the approach. 
We also conducted a preliminary user study with VESPY to compare PBE and direct 
specification approaches. For complex tasks requiring multiple inferences, PBE 
outperformed direct specification in terms of user’s performance. However, for simple 
tasks, direct specification was as good as PBE, particularly after participants 
understood the domain specific language. We also observed that the participants 
experienced usability issues similar with the other PBE systems.  
While PBE systems can be quite difficult for inexperienced users, there is little 
research on people's ability to accomplish complex tasks by providing examples. 
Chapter 5 presents an online user study that investigates to what extent inexperienced 
participants perform decomposition and disambiguation for complex PBE tasks, and 
identifies types of common mistakes. We developed an experimental PBE system that 
supports simple tasks (e.g. arithmetic, string extraction, and conditional filtering). 
Among 161 participants recruited from Amazon Mechanical Turk, only 18.6% (30 
 5 
participants) could finish the entire study. We identified seven types of common 
mistakes, and reaffirmed that decomposition and disambiguation are tricky for 
inexperienced users. In addition, we observed that providing actionable feedback for 
unsuccessful trials can significantly improve the success rate of users, as compared to 
simple feedback.     
Finally, we explored the design space of feedback for PBE (Chapter 6). First of all, 
we created three types of feedback messages that included: (1) detection of user intent; 
(2) system information; and (3) instructions for resolving the current issue. We also 
developed a history panel that shows all the unsuccessful trials for the current task. 
Using the same experimental system, we compared eight combinations of the feedback 
design factors. The findings suggest that feedback messages have no significant impact 
on participants' performance. However, providing both system information and 
instruction increases the perceived effectiveness of feedback messages. The result also 
suggests that the high dropout rates and information overloads lowered the validity of 
the study, we will conduct a follow-up experiment with a revised system and study 
design. The contributions and future research direction of this dissertation are discussed 
in Chapter 7. 
 Organization of the Dissertation 
The rest of this dissertation is organized as eight chapters. In Chapter 2, we discuss a 
literature review related to my thesis. Chapter 3 reports the Wizard of Oz study result 
that explores how non-programmers describe computational tasks. Chapters 4 
introduces the implementation of VESPY, a visual programming environment that 
employs PBE techniques. Chapter 5 presents the online user study of how ordinary 
 6 
people perform PBE decomposition and disambiguation with the seven types of human 
errors. Chapter 6 reports the follow-up study of the extended feedback components. 
Finally, Chapter 7 proposes possible future research projects that can extend the current 
scope of this dissertation.  
  
 7 
Chapter 2: Related Work 
This chapter provides an overview of end-user development in terms of interaction 
approaches for making programming accessible for inexperienced people, as well as a 
brief history of human-AI interaction research. We begin with a general background of 
how the End-User Development (EUD) paradigm has evolved to make programming 
accessible to ordinary people (Section 2.1). Section 2.2 describes the End-User 
Programming (EUP) concept, which is a subset of EUD but focuses on enabling end 
users to create their own programs. We delve into a variety of interaction styles used 
in EUP systems. In Section 2.3, we review End-user Software Engineering (EUSE), 
which is another related concept overlapping with EUD and EUP, emphasizing the 
quality of the software that end-users create, modify, and extend. Finally, Section 2.4 
reviews research topics that have been advanced toward symbiotic interaction between 
human and AI.   
 End-User Development (EUD) 
More and more people use computers on daily basis for diverse, complex and 
frequently changing needs [9]. Enabling people to solve their own problems is a value 
in itself. Moreover, professional software developers, who comprise only 2.54% of a 
total employed workforce in the United States by 2016, cannot fully meet all the needs 
of the country [83]. EUD is “a set of methods, techniques and tools that allow users of 
software systems, who are acting as non-professional software developers, at some 
point to create, modify, or extend a software artifact” [51]. End user programmers are 
also domain experts such as teachers using a spreadsheet for efficient grading, 
 8 
interaction designers building working prototypes, and journalists crawling data from 
Web pages as surveyed by Ko et al. [39].  
Spreadsheet applications such as VisiCalc, Lotus 1-2-3, and Microsoft Excel are 
the first and by far the most successful EUD environment. Although end users may not 
think they are creating programs, the spreadsheet artifacts they create are actually first-
order functional programs [33]. In the early days of personal computers, spreadsheet’s 
EUD support was a major factor for buying expensive machines.  
Complex applications such as word processors usually have a lot of functionalities 
sufficient to satisfy a diverse target user groups but not optimized for a single user. 
Customization or tailoring of UI is specifying parameters to an existing application to 
meet the user’s needs [9]. For instance, a wide range of tools such as web browsers, 
word processors, integrated development environments, and even games allow users to 
add plug-ins and change configurations.  
The first step of creating a successful EUD is to understand what additional 
features people want, and how to enable them to specify those features. In this 
dissertation, we conducted a formative interview study (Chapter 3) to investigate what 
problems end-users experience on the Internet, and how they would fix them (Chapter 
3).  
 End-User Programming (EUP) 
End-user programming is defined as “programming to achieve the result of a program, 
rather than the program itself” by Ko et al. [39]. According to the definition, end-user 
programmer, compared to professional programmers, are less concerned about re-
usability, reliability, and security of the programs they create. Instead, end users are 
 9 
interested in quick-and-dirty ways to solve problems at hands. Programs created 
through EUP extend functionalities of existing applications (e.g. web pages) or run as 
stand-alone software. Kelleher and Pausch [36] have surveyed EUP systems, and 
identified five interaction styles in addition to text-based programming. In this section, 
we review EUP by their interaction styles [63].    
 End-User Programming (EUP) 
Textual programming is often considered unfriendly to end-user programmers. 
However, if users had sufficient programming skills, text-based scripting is an efficient 
and expressive way to use the full functionality of domain-specific languages. For 
example, early EUP systems for customizing the Web such as Greasemonkey [100] are 
as versatile as JavaScript, at the expense of requiring professional programming skills. 
To make the efficiency of text-based programming accessible for end-users, EUP 
researchers have proposed various interactive supports of textual programming. 
Chickenfoot [6] automatically identifies page elements matching with user-provided 
keywords shown in Figure 1. When a user types click(“Go”) command, 
Chickenfoot finds clickable elements (e.g. hyperlink or button) containing a keyword 
“Go”, and triggers click events that are assigned to the elements. Inky [58] has sloppy 
syntax and rich feedback features that allows commands with incorrect ordering, 
missing keyword or parameters.  While the user is typing, Inky incrementally and 
continuously shows rich feedback of how it interprets and fixes the command as shown 
in Figure 2. Many of those supports are later applied to even professional IDE  
 10 
 (e.g. Eclipse) as auto-completion or code quality suggestion plug-ins.  
While EUP systems that support other interaction styles (e.g. PBE, visual 
programming) rarely require end-users to write code from scratch, textual description 
is still a common representation of existing programs, because once users understand 
textual description they can easily validate and modify programs [34,37,86,89].  In this 
dissertation, we employ textual description to present programs generated by PBE 
engines in VESPY (Chapter 4) and the online usability study of PBE (Chapter 6).  
 
Figure 1. Chickenfoot scripting environment running inside the Firefox browser. Users 
type scripting code in the script editor (left) to automate, customize, and integrate Web 
applications without examining HTML source code.  
 
 
Figure 2. The Inky command line window. When user types a command in the Input 
area, the Feedback area shows a list of interpreted and fixed candidates.  
 
 11 
 Visual programming 
To address the steep learning curve of textual coding, many EUP tools employ visual 
elements to represent low-level language constructs (e.g. commands, control structure, 
and variables) so that end user programmers can arrange them to build programs, 
animated stories, and games. Using visual constructs has many advantages. First, the 
widgets’ shapes and colors help users understand program structure and memorize 
language constructs. Like Lego bricks, connectors of widgets constrain how they 
should be put together without obscure syntax or punctuation of textual coding. Also 
the palette of available commands and the options of each command present the tool’s 
capability intuitively. It is not surprising that many visual programming environments 
have educational purposes, such as Alice [37], LEGOsheets [21], and Scratch [71] (see 
Figure 3). In spite of their educational benefits, visual programming is often criticized 
for being impractical for solving real-world problems. For instance, visual blocks take  
 12 
large spaces on screen, and arranging visual blocks takes much longer than typing code 
[2].   
 
Figure 3. A screenshot of Scratch, a visual programming environment for creating stories, games, and 
animations. Children can easily understand and use Scratch’s visual widgets.   
 
Figure 4. LabView is a dataflow programming language widely used in laboratories. 
 
Figure 5. Sample widgets in Yahoo's Pipes. Users create complex operations by connecting widgets 
and customizing parameters.  
 13 
 Dataflow Programming 
Dataflow programming (DFP) models a program as a direct graph of information 
flowing between operations [32]. Recently many advancements have been made in 
visual DFP because complex program structure becomes easy to reason with visualized 
flow of information. DFP’s application domains include signal processing for real-time 
music / video performance (e.g. Max/MSP1, Pure Data2, VVVV3), processing large 
amounts of data (e.g. Marmite [86], Karma [81], Yahoo! Pipes [97]), and prototyping 
interactive UI (e.g. Quartz Composer [99]). DFP falls short when representing complex  
cyclic control flows such as For-loops and recursions [32]. Thus many DFP tools 
conceal the entire loop in each operation so that a node deals with a list of input and 
output values without explicit looping.  
                                                
1 http://cycling74.com/products/max 
2 http://en.wikipedia.org/w/index.php?title=Pure_Data&oldid=629733021 
3 http://vvvv.org/documentation/vvvv-a-multipurpose-toolkit 
 
Figure 6. A sample mashup in Marmite [86]2] extracts address and other information from a Web 
page. Users select operators on the left, the widgets in the middle show the data flow, and the table 
on the right shows the processed data. 
 14 
LabView4  is a well-known DFP for analyzing data in laboratories. Nodes in 
LabView programs are predefined functions, and connect to each other passing data 
around as shown in Figure 4. Internet mashup builders have employed dataflow as their 
program representation. For example, Yahoo! Pipes5 (Figure 5) and Marmite (Figure 
6) enable users to compose nodes to aggregate, manipulate, and mashup content around 
the Web. Origami toolkit for Quartz Composer6 (Figure 7) is a visual DFP tool for 
creating interactive design prototypes. Dataflow programming is often confused with 
visual programming, because both rely on visual elements. The difference is whether 
visual elements represent either low-level language constructs such as variables, 
operators, and control flows (in visual programming) or a high-level structure such as 
sub-process (in dataflow programming). In this dissertation, we employed visual 
                                                
4 http://www.ni.com/labview/ 
5 https://en.wikipedia.org/wiki/Yahoo!_Pipes 
6 https://en.wikipedia.org/wiki/Quartz_Composer 
 
Figure 7. Quartz Composer can process and render graphical data. 
 15 
elements for the dataflow approach - allowing users to decompose a complex 
programming task into small modules. 
 Programming-by-Example / Demonstration (PBE / PBD) 
Programming-by-example (PBE), sometimes called programming-by-demonstration 
(PBD), is an EUP technique for teaching a computer to perform certain tasks by 
demonstrating or providing examples using conventional direct manipulation interface, 
instead of directly specifying via text-based coding or visual programming techniques 
[14,50]. Given that end users are readily able to demonstrate consistent and complete 
examples, PBE is supposed to be easier to learn and use than traditional programming. 
PBE is commonly used for creating animations [55,70], drawing geometric shapes   [1], 
creating macros for repetitive document editing [45] or Web-based processes [47], 
extracting data from structured documents [42,46,78], transforming data in 
spreadsheets [23,25], or controlling robot arms [61]. In this dissertation we build an 
EUP system for data extraction, transformation, and web automation and customization, 
and we discuss couple PBE systems that are relevant to those topics.  
First, generating automation scripts from user’s activity is a common use case of 
PBE. Koala [53] and CoScripter [48] enable end-users to create and share automation 
scripts to perform Web-based processes. To create an automation script, end-users 
record their interactions with the Web pages, and correct parts that should be 
generalized.  
Data transformation is a common application of PBE techniques. For instance, 
Wrangler [34] is a system for interactive data transformation shown in Figure 8. To 
build a sequence of basic transforms in Wrangler, users demonstrate their intent by 
 16 
selecting or editing a few examples on the spreadsheet. Then the PBE engine of 
Wrangler suggests a list of transforming operations ordered by relevancy so that users 
can choose one of them to apply.  The goal of the Wrangler interface is to provide 
multiple means to add each transformation step so that users can choose the most 
convenient one for their tasks. Karma [79] (Figure 9) is another example of a data 
transformation tool that enables users to quickly extract, clean, and integrate data from 
multiple sources including databases, spreadsheets, text files, XML, and Web APIs. 
Karma uses PBE techniques that generate data transformation scripts from user’s action 
on its data table.  
Text processing tasks (e.g. extracting / replacing substring, reformatting structured 
text) are tedious and error-prone even for professional programmers. Therefore, text-
editing tools often employ PBD / PBE techniques to automate text editing based on a 
user’s keyboard stroke and mouse clicks. For instance, SMARTedit [43] generates 
macros from repetitive editing actions. Karma [79] and Wrangler [34]) generate 
corresponding text transforms from multiple pairs of input and output examples. 
STEPS [89] enables end-users to select and manipulate part of the hierarchical structure 
of text by example.  
 17 
 
Figure 8. The Wrangler interface. Users can select or edit data in the right panel. Then the left-bottom 
panel shows suggestions of transforming operations based on a user’s latest action. As the user selects 
one of the suggestions, it will be applied to the data set and appended to the transform script (left-
top).    
       
Figure 9. The user interface of Karma. By highlighting a segment of text (“Japon Bistro”) in the 
embedded Web browser (left) and dragging it into the table (right), a user can specify a data retrieval 
operation. 
 
Figure 10. Sorting by year in STEPS. Mock input/output pairs1 specify each step; nested colored 
blocks represent structure. 
 18 
 Summary 
In Section 2.2, we reviewed four interaction styles commonly used in EUP systems. It 
is noteworthy that none of them is better or worse than the other. Instead, each of them 
has strengths and weaknesses. Text-based programming is hard to learn, but very 
effective at describing programs. Visual programming makes it easy to learn basic 
programming concepts, but not as scalable as text-based or dataflow programming. 
Dataflow programming provides an effective way to handle the high-level structure of 
certain programs. PBE allows end users to create programs without learning how to 
specify them, but it may not be applicable for every task. In fact, most EUP systems 
employ multiple styles in combination. For example, visual constructs and textual 
descriptions are commonly used together to describe programs [86,97]. EUP systems 
provide PBE as well as traditional direct manipulation [34]. In Chapter 4, we build 
VESPY, an EUP system that employ a combination of dataflow and PBE. 
 End-User Software Engineering (EUSE) 
End-user programmers may not have the same skills and goals as professional 
programmers. However, issues of software engineering, such as maintainability, 
reusability, privacy, and security are essential requirements for the success of EUD. 
End-user software engineering (EUSE) is a body of research that focuses on systematic 
and disciplined activities that address the quality of software created by end-users [39]. 
In this chapter, we review a few research topics in EUSE that are most relevant to this 
dissertation.       
 19 
 Supporting the exploratory approach of end-user programming 
Professional developers usually do not have complete knowledge about the domain, 
but are supposed to investigate and define requirements of software before starting 
software development. In contrast, end-user programmers usually have good 
understanding of their needs, and directly jump into development without specifying 
requirements or considering other issues of software engineering [72]. End user 
programmers tend to take evolutionary or exploratory approaches, leaving parts of the 
design in a rough and ambiguous state. An integrative approach is using community 
support to help less experience users learn from more experienced end-user 
programmers. For example, CoScripter community supports end-user programmers to 
share and extend macro scripts in an enterprise [5]. As another approach, EUP systems 
often have design critic features that give end-users context-aware design critics for 
improving their designs [17]. In this dissertation, we propose context-aware critic 
features to help end-users decompose PBE tasks (Chapter 5 and 6).  
 Understanding end-user programmer’s mental model 
Understanding of end-user’s needs and mental model is essential for building 
successful programming tools [73]. Researchers have studied a wide range of end-user 
programmers including children [68], teachers, interaction designers [62], or anyone 
else who would develop programs for professional or personal needs. Keller and 
Pausch [36] surveyed development environments of novice programmers, mainly 
focusing on the educational impacts of such settings [51]. Miller [57] examined non-
programmers generating procedural instructions in natural language, which resulted in 
a set of recommended features of programming languages. For instance, he suggested 
 20 
that contextual referencing would be a good alternative method of using variables and 
traversing data structure. Pane et al. [67] studied vocabulary and structure of non-
programmers expressing solutions to computational problems, and identified patterns 
of imprecise and underspecified information in them.   
In this dissertation, I conducted an interview study to examine how people with 
varying programming expertise express their needs for Web customization (section 
3.2), and a Wizard of Oz study to investigate non-programmers describing 
computational tasks through conversational dialogue (section 3.3). I also examined 
what mistakes people make while using PBE to solve complex problems (Chapter 5).  
 21 
 Preventing mistakes of end-user programmers 
Even though end-user programmers usually create software artifacts for their own 
needs, small bugs in their code can make critical consequences for failures. For instance, 
a Texas oil firm lost millions of dollars in an acquisition deal because of a buggy 
spreadsheet formula [69]. A business Web site with broken links can result in loss of 
revenue and credibility [73]. In consideration of the quality issues in EUSE, researchers 
have begun to study what mistakes end-user programmers make, and have proposed 
supports for preventing such human mistakes. For instance, “What You See Is What 
 
Figure 11. WYSIWYT approach highlights potential bugs in spreadsheets. Red borders indicate incorrect 
cells. Check marks indicate that the cells have passed generated test cases, while question marks indicate 
that the cells need testing.  
 
Figure 12. Whyline is a debugging tool in the Alice programming environment. Users can press “why” 
or “why not” buttons for getting detailed information (e.g.  program’s execution history) of specific 
animated behavior. 
 22 
You Test” (WYSIWYT) is an end-user testing approach, which helps users 
systematically validate and find bugs in their spreadsheets [19]. When WYSIWYT 
finds a potential bug in spreadsheets, it highlights the area with colored borders to 
attract user’s attention, and adds a tooltip that explain its meaning (Figure 11). 
Interrogate Debugging [40] is another interactive debugging support for the Alice 
storytelling system. End-user programmers can ask why did and why didn’t questions 
for runtime failures in their programmed animations to get detailed information such 
as the program’s execution history (Figure 12).     
End-user programmers using PBE systems also make mistakes. User-provided 
examples are often ambiguous in that the PBE engine might synthesize an unintended 
program that is consistent with the provided examples. This can let users lose their 
confidences in the PBE system, which is a major usability issue of PBE as Lau [44] 
pointed out. To resolve the ambiguity, researchers have proposed a few interaction 
models. For instance, Wrangler [35] lets users choose an operation among top candidates. 
FlashProg [54] have suggested two interaction models. First, program navigation 
(Figure 13) allows users to effectively choose the intended program among a large 
number of candidates by comparing positive and negative test results. Second, a 
conversational clarification interaction model (Figure 14) asks users specific questions 
that can effectively resolve ambiguities. A few PBE systems [35,88] support 
decomposition by allowing users to  create multiple operations one-by-one. However, 
users of such systems are often frustrated at not knowing what the possible primitive 
operations are [24,88], progress of the current state towards the solution, or intermediate 
steps to reach the solution [25]. Supporting users in decomposing complex tasks into 
 23 
small subtasks and incrementally composing solutions is still an open-ended research 
question. 
  
 Human-AI Interaction  
 Direct manipulation vs. Autonomous agent  
A great visionary in the beginning of human-computer interaction, Licklider [49] 
envisioned symbiotic interaction as an optimal collaboration of man and machine, 
which aims to solve complex problems by tightly coupling human minds and 
computers. For decades, system developers and researchers have built on his vision, 
striving for an optimal division of role, responsibility, and initiatives. There was a hot 
 
Figure 13. Conversational Clarification being used to disambiguate different programs that extract 
individual authors.    
 
Figure 14. Program Navigation tab allows users to navigate sub-expressions of a program, and choose 
among alternative sub-expressions that other programs have suggested.  
 24 
debate between Ben Shneiderman and Pattie Maes [76] about whether direct 
manipulation or autonomous agents would be the ultimate form of human-computer 
interaction. Direct manipulation provides rapid, incremental, reversible actions and 
feedback that give users the feeling of being in control and the responsibility for the 
decisions they make [74]. However, as all the initiative has to come from the user, 
solving complex problems with direct manipulation can be very inefficient and hard to 
learn [27]. In contrast, autonomous agents proactively keeps track of the user model, 
and suggest the most likely solutions so that users can delegate complex problems to 
software agents [76]. The debate did not end up with a winner but an open-ended 
research question – how to make the two approaches complement each other [29]. For 
example, autonomous agents can improve the productivity of direct manipulation 
systems by automating repetitive tasks. Even if users trust autonomous agents and 
delegate all their tasks, direct manipulation is still important to keep the system 
comprehensible, predictable, and controllable. 
 Mixed Initiative interaction 
“Mixed-initiative … refers broadly to methods that explicitly 
support an efficient, natural interleaving of contributions by users 
and automated services … allowing computers to behave like 
associates… Achieving … fluid collaboration between users and 
computers requires solving difficult challenges.” –[28] 
In response to the debate between direct manipulation and autonomous agent [76], 
mixed-initiative interaction aims to interleave them by letting humans and computers 
work on shared tasks, monitor each other’s activity, and negotiate who will take an 
initiative. Eric Horvitz has summarized principles [28] and challenges [29] of mixed-
 25 
initiative interaction. Tecuci et al. [20] have introduced seven aspects of mixed-
initiative interaction (task, control, awareness, communication, personalization, 
architecture, and evaluation) to help understand existing mixed-initiative systems and 
building general design principles. Example applications of mixed-initiative interaction 
include, 
• Major search engines 7  update suggested keywords and searched web 
pages for every keystroke made by users. The rapid feedback loop enables 
users to understand what combination of keywords would give better 
results.   
• Integrated developing environments predict language constructs that the 
user is currently typing. It reduces the number of characters to be typed, 
and also prevents users from making typos.  
• Planning tools (e.g. floor planning CAD [18], meeting scheduler [10], 
and thermostat [38]) suggest advices (e.g. stove and refrigerator being too 
far apart in a floor plan) according to a user’s activity.  
It is noteworthy that the applications of mixed-initiative interaction listed above 
are exploratory and creative processes where neither users nor computer agents have 
complete understanding of the problems or the solutions. Instead, users would 
incrementally refine their goals, based on the solutions suggested by computer agents 
[26] as shown in the applications such as meeting scheduler [10,96], and thermostat 
[38].  
In spite of the aforementioned opportunities, mixed-initiative interaction may not 
be the panacea for all usability issues. For instance, users of Proactive Wrangler [25] 
                                                
7 google.com; www.bing.com; search.yahoo.com 
 26 
did not trust programs that are generated without their initiation. Causal relationships 
of mixed-initiative systems tend to be more complex than fixed-initiative systems. 
Developing a mixed-initiative system requires synergistic integration of AI and HCI 
[20]. From the AI perspective, it might require knowledge representation of the task 
and user’s intention, problem solving and planning, and learning algorithm. From the 
HCI perspective, it has to design an effective UI for dialogue, intent expression, 
understanding generated solution, and building trust.   
In this dissertation, we applied principles of the mixed-initiative interaction to 
propose two solutions for usability issues of PBE. First, I proposed a novel interaction 
model of VESPY that allows both user and the PBE engine to take an initiative of 
program decomposition (Chapter 4). I also identified patterns of common mistakes that 
users make while using PBE (Chapter 5), and proposed a novel mixed-initiative 
feedback mechanism to help users quickly understand and fix mistakes in collaboration 
(Chapter 6).  
 27 
Chapter 3: Formative Study: End-User Needs 
for Enhancing the Web 
End-user programming (EUP) is a common approach for helping ordinary people 
create small programs for their professional or daily tasks. Since end-users may not 
have programming skills or strong motivation for learning them, tools should provide 
what end-users want with minimal costs of learning – i.e., they must decrease the 
barriers to entry. However, it is often hard to address these needs, especially for fast-
evolving domains such as the Web.  
To better understand these existing and ongoing challenges, we conducted two 
formative studies with Web users – a semi-structured interview study, and a Wizard-
of-Oz study. The interview study identifies challenges that participants have with their 
daily experiences on the Web. The Wizard-of-Oz study investigated how participants 
would naturally explain three computational tasks to an interviewer, who acted as a 
hypothetical computer agent. The two user studies demonstrate a disconnect between 
what end-users want and what existing EUP systems support, and thus open the door 
for a path towards better support for end user needs. In particular, our findings from 
the interview study are (1) analysis of challenges that end-users experience on the Web, 
and solutions they envision, and (2) seven core functionalities of EUP for addressing 
these challenges. Findings from the Wizard-of-Oz study include (3) characteristics of 
non-programmers describing three common computation tasks, and (4) design 
implications for future EUP systems.  
 28 
 Introduction 
Over the decades, the Web has become the most popular and convenient workbench 
for individuals and businesses supporting an incredible number of activities. However, 
developers of Web services cannot completely anticipate future uses and problems at 
design time, when a service is developed. Thus we can expect users, at use time, will 
discover misalignment between their needs and the support that an existing system can 
provide for them [16]. Numerous examples of this misalignment exist. For example, a 
site designed to support comparison shopping for online shoppers may not meet the 
needs of shoppers who want to compare prices across different sites and even track 
daily prices8. Another is that people often use customizable applications (e.g. RSS feed 
readers) to manage ever-growing channels instead of visiting individual sites. More 
broadly, fraudulent sites and deceptive opinion spam are ongoing concerns for 
consumers [65]. When a Web page does not match their needs, people often use 
mashups [15,85,90,91,93], browser extensions and scripts [7,47,60,98] built by third-
party programmers. Unfortunately there are not enough third-party solutions to address 
all 1.4 billion end-user's needs of 175 million websites [78], and enabling end users to 
develop their own solutions is the goal of end-user programming on the Web 
(WebEUP).  
A clear understanding of end-user needs is essential for building successful 
programming tools [73]. In this chapter we report two user studies. The first study, a 
semi-structured interview study addresses the research questions defined at section 1.2: 
R1. What do end-user programmers need to improve the Web? 
                                                
8 http://camelcamelcamel.com 
 29 
a.  What challenges do end-users experience on the Web? 
b.  What features should EUP system provide to end-user programmers?    
The second study addresses, 
R2. How do non-programmers express their programming intent?  
Answering the above questions is important to have a clear understanding of the 
direction we should take to develop WebEUP systems that will be useful and effective 
for a broad range of people.  
 Prior studies [91–93] characterize potential end-user programmer’s mindset and 
needs. Researchers also investigated end-user programmer’s real world behavior and 
software artifacts they created with specific WebEUP tools such as CoScripter [5]. Live 
collections such as the Chrome Web Store9 and ProgrammableWeb10 are valuable 
resources that address user needs by community developed scripts and mashups. This 
chapter reports on an interview study with similar motivations – to investigate what 
challenges end-users experience and how they would improve – but focuses on unmet 
needs of 35 end-users on the Web with minimal bias of current technology. Through 
iterative coding we identify the pattern of challenges that end-users experience. We 
also suggest seven functionalities of EUP for addressing the challenges - Modify, 
Compute, Interactivity, Gather, Automate, Store, and Notify. 
There is a wealth of study for the second research question. Researchers have 
studied the psychology of non-programmers. Miller [56,57] examined natural language 
descriptions by non-programmers and identified a rich set of characteristics such as 
                                                
9 https://chrome.google.com/webstore/category/apps 
10 http://www.programmableweb.com/ 
 30 
contextual referencing. Biermann, Ballard and Sigmon [3] confirmed that there are 
numerous regularities in the way non-programmers describe programs. Pane et al. [67] 
identified vocabulary and structure in non-programmer’s description of programs. We 
conducted a Wizard of Oz study with 13 non-programmers to observe how they 
naturally explain common computational tasks through conversational dialogue with 
an intelligent agent. The interviewer acted as a hypothetical computer agent, who 
understands participant’s verbal statements, gestures, and scribbles. This study expands 
existing work with characteristics of non-programmers’ mental models. 
Findings from the interviews and the Wizard-of-Oz study together demonstrate a 
disconnect between what end-users need from EUP and what current systems support. 
In addition to identifying a set of important functionalities that should be included to 
best support end-users, our findings specifically highlight the needs of social platforms 
for solving complex problems, and interactivity of programs created with EUP tools to 
alleviate end-user’s concerns about using third-party programs. The Wizard-of-Oz 
study also shows that future EUP tools should support multi-modal and mixed-
initiative interaction for making programming more natural and easy-to-use.   
The two studies have the following contributions: 1) identification of unmet needs 
of end-users of the Web; 2) characterization of non-programmers’ mental models 
describing computational tasks; 3) implications for designing future EUP systems.  
 Study 1: End-User Needs on the Web 
To better understand end-user needs on the Web, we conducted a semi-structured 
interview study. The goal was to better understand the challenges that the participants 
experience, and enhancement ideas that they envision without technical constraints. 
 31 
The approach is to qualitatively analyze the participant responses to identify themes 
that should be considered in the development of future WebEUP systems. 
 Participants 
35 participants (14 males, 21 females) were recruited via a university campus mailing 
list, social network, and word-of-mouth. They were on average 30.8 years old (SD = 
5.1) and had a wide range of occupations as shown in Table 1. Every participant spends  
at least one hour per day on the Web. 10 out of 35 participants had used at least one 
programming language, and five participants had created web pages. However, none 
of them had the experience of end-user programming on the Web. We did not offer any 
incentive for participation.  
  Procedure 
18 interviews were conducted via a video chat program with shared screen11, while the 
rest were face-to-face interviews at public areas such as libraries and cafes. I asked 
participants,  
                                                
11 Google Hangout (https://hangouts.google.com/) 
Table 1. Occupational background of the participants of study 1	
Graduate students 15 
 Engineering 8 
 Business 4 
 Psychology 2 
 Education 1 
Professionals 12 
 IT specialists 8 
 Directors and office managers 4 
Non-professionals (e.g. homemaker) 8 
Total 35 
 32 
“Show me a couple Web sites that you recently visited, and tell us 
challenges that you experienced there. If you could hire a team of 
designers and developers for free, how would you improve the Web 
sites?”  
We recorded (or videotaped for the face-to-face interviews) the participants visiting 
two to four sites they recently experienced problems. While demonstrating regular 
tasks on the sites, participants followed the think-aloud protocol. For the challenges 
they mentioned, we asked them to imagine a team of third-party developers, and to 
explain to the “team” an enhancement for the Web site. Each interview covered 
approximately three (M = 3.02) sites, and took approximately 20-40 minutes. The study 
was found to be exempt from IRB review. 
 Data and Analysis 
35 participants demonstrated the use of 92 sites (M = 2.63) that included online 
shopping (24 sites), academic research (17), streaming video (11), news (10), work-
related sites (7), forums (5), search engines (5), social network services (4), travel (4), 
finance (2), review sites (1), job market (1), and weather (1). Note that these frequencies 
do not correlate the frequency of regular visits but the challenges that our participants 
experienced. While visiting the sites the participants explained 106 challenges. Every 
interview video was transcribed, and coded. As an exploratory work, we pursued an 
iterative analysis approach using a mixture of inductive and deductive coding [8,30]. 
First, we created a codebook derived from the literature [13,92] and an initial post-
interview discussion within the research team. The codebook included types of 
challenges (lack of relevant information, repetitive operations, poorly-organized 
information, privacy, security, fake information, bugs), and functionalities required for 
 33 
doing a wide range of WebEUP tasks (mashup, redesign, automation, social knowledge, 
sharing, monitoring). To assure high quality and reliable coding, two researchers 
independently coded ten randomly selected ideas. Analyzing the Inter-Rater Reliability 
(IRR) of that analysis with Krippendorff’s alpha (α = 0.391; total disagreements = 24 
out of 255), we revised the codebook. Then the two researchers coded another ten 
randomly selected ideas, and achieved a high IRR (α = 0.919; total disagreements = 6 
out of 248). After resolving every disagreement, the first researcher coded the 
remaining data. Following the guide of thematic analysis [8], we collated the different 
codes into potential themes, and drew initial thematic maps that represent the 
relationship between codes and themes. We then reviewed and refined the thematic 
maps, to make sure that data within a theme was internally coherent, and that different 
themes were distinguished as clearly as possible. The two following subsections 
summarize the two groups of themes: challenges that participants experience on the 
Web, and functionalities of WebEUP for addressing those challenges.  
 Result: Challenges 
Based on the above-described process, four groups of common challenges and 
enhancement ideas were found which are described in the following sections.  
Challenge	#1:	Untruthful	information	 	
While trust is a key element of success in online environments [12], 17 participants 
reported four kinds of untruthful information on the Internet.  
Deceptive ads were reported by three participants. Two of them reported deceptive 
advertisements that used confusing or untrue promises to mislead their consumers. For 
 34 
example, P31 gave a poignant example that a local business review site posts 
unavailable items on the Internet: 
“If you're looking for a contractor to work on your home, and other 
home stuff, [local business review site] shows them with ratings. A 
few weeks ago I started paying them again for other information, 
but they have something very frustrating. They have a several page 
list of mortgage brokers searchable from [search engine]. But when 
you pay the fee for their service, they have only a fraction of the 
information. I complained to them, but they have some stories why 
it is not... Anyways, I canceled my membership without getting my 
one-month fee refunded.” (P31) 
Another participant tried to avoid using an online marketplace because of 
deceptive ads in it: “I know there are rental houses with good value on [online 
marketplace], but I do not use it often. There are too many liars on [online marketplace]. 
Instead I post on [Social Network Service] to get help or recommendations from people 
that I trust.” (P21) 
Links to low-quality content were reported by seven participants. During the 
interview, two participants clicked broken links to error pages. Five participants 
reported that they had to spend significant time and effort to find high-quality video 
links in underground streaming video sites: “At [Underground TV show sites], I have 
to try every link until I find the first ‘working’ link. By working, I mean the show must 
be [in] high-resolution, not opening any popup, and most of all as little ads as possible.” 
(P6) A straightforward solution is to attach quality markers next to the links. However, 
it is extremely challenging to define a metric of high-quality links that everybody will 
agree upon.  
 35 
Virus and Malware was reported by four participants. They were aware of the 
risks of installing programs downloaded from the Internet, but estimating risks is often 
inaccurate. For example, two participants stopped using a streaming video site and a 
third-party plugin worrying about computer viruses, though in fact, those site and 
plugins were safe. 
“I used this streaming link site for a while, but not after a friend of 
mine told me her computer got infected with malwares from this 
Web site. I wish I could check how trustworthy the site [is] when 
using [it].” (P24) 
“I have [used popup blocker extension], but am not using [it] now. 
Those apps have viruses, don't they?  I also don't use any 
extensions.” (P17) 
This suggests that end-users may have inaccurate knowledge about the risks of 
their activities on the Internet. Even though third-party programs provide terms and 
conditions, and permission requests, users are often ‘trained’ to give permission to 
popular apps [11] as stated by P27: “If the site is important to me, I just press the 'agree' 
button without reading.” 
Opinion spam was reported by four participants. While social ratings and 
consumer reviews are conventional ways to see feedback on products and information, 
the reliability of the feedback is often questionable [66]. Four participants reported 
concerns about opinion spam – inappropriate or fraudulent reviews created by hired 
people. For example, P31 reflected, “I saw that some sites have certificates, but they 
were on their own sites. So, who knows what they’re gonna do with that information? 
[…] For example, I had a terrible experience with a company that I hired for a kitchen 
 36 
sealing repair, even though they had an A+ rating on [a local business review site].” 
P27 also expressed concerns about fake reviews, “ratings are somewhat helpful. 
However, I cannot fully trust them especially when they have 5 star ratings - they might 
have asked their friends and families to give them high ratings.” Similar to deceptive 
ads, opinion spam is a gateway to serious financial risks such as Nigerian scams [12], 
but there is no simple way to estimate the risk.         
Summary. In order to deal with untruthful information, participants would look 
for more trustworthy alternatives. For example, P21 used a social network service 
instead of online marketplaces. If participants could not find an alternative source, they 
would assess the risks and benefits of using the untruthful information, and decide 
either to give up the task or to take the risk, as P31 said, “I don’t believe everything on 
the Internet. But sometimes I have no other choices than to try it with caution.” The 
remaining issue is that estimating the risk of untruthful information is often quite 
difficult.  
Challenge	#2:	Cognitive	Distraction	
Most participants reported cognitive distractions that make information on the Web 
hard to understand. We identified four types of cognitive distractions as listed below.   
Abrupt design changes were reported by three participants. Websites are 
occasionally redesigned – from a minor tune-up to a complete overhaul – for good 
reason. However, it often undermines the prior knowledge of its users, and makes the 
sites navigation difficult. For example, P22 could not find her favorite menu item 
because “the library recently changed its design, making it much harder to find the 
menu.” Since she found a button for switching back to the classic design at the end, she 
didn’t take advantage of new features in the updated design. P24 shared a similar story: 
 37 
“One day Facebook suddenly changed the timeline to show this double column view. 
That was very annoying.”  
Annoying advertisements were reported by 30 participants. We found that the 
degree of cognitive distraction varies across different types of ads. For example, ads 
with dynamic behavior are much more annoying than static banner ads: 
“There are popup ads that cover the content and follow your 
scrolling. Although they usually have very small 'X' or 'Close' 
buttons, I often miss-click the popup to open the Web page. That's 
pretty annoying.” (P17) 
This finding is consistent with prior research that found display ads with excessive 
animation impair user’s accuracy on cognitive tasks [22]. 16 participants were using 
browser extensions (e.g. Chrome AdBlock12) to automatically remove ads. However, 
one participant had stopped using it for security and usability issues: 
“I have, but am not using [AdBlock] now. Those apps have viruses, 
don't they? […] They would be very useful in the beginning, 
however they also restrict in many ways. For example, the 
extension sometimes automatically block crucial popup windows. 
So I ended up manually pressing 'X' buttons.” (P17) 
Unintuitive tasks. Six participants reported that several Websites are hard to use. 
For example, to create a new album in Facebook, users are required to upload pictures 
first. This task model clearly did not match a participant’s mental model: “I tried to 
create a new photo album. But I could not find a way to create a new album without 
                                                
12 http://goo.gl/rA6sdC 
 38 
uploading a picture. That was a very annoying experience.” (P18). Another user 
reported a similar issue of not being able to create a new contact after searching in a 
mailing list: 
 “I'm adding a new person to the contact database. I should first 
search the last name in order not to put duplicate entry. If the name 
does not exist, it simply shows [0 result found]. Obviously I want to 
add a new entry, but there's no button for that. That bugs me a lot, 
because I have to get back to the previous page and type the name 
again.” (P16) 
Websites with unintuitive navigational structures would require users to do many 
repetitive trial-and-errors.   
“When preparing to visit a touristic place, I look for entrance fee, 
direction, and other basic information from their official sites. 
However, some sites have that information deep in their menu 
structure, so I had to spend much time finding them. I wish those 
information were summarized and shown in one page. Sometimes 
it's hard to find useful images for campsites or cabins. For example, 
I want to see the image of bathroom, but people upload pictures of 
fish they caught.” (P33) 
Information overload. Five participants reported that excessive and irrelevant 
information prevents them from understanding the main things that they care about. 
For example, P22 was disappointed at blog posts full of irrelevant information: “I was 
searching for tips to clean my computer. However, most blog posts have very long 
explanations of why I should keep computers clean without telling how to clean it till 
the end.” (P22)  
 39 
A long list without effective filtering also causes information overload as P2 stated:   
“I want these conferences filtered by deadline, for example, 
showing conferences whose deadlines are at least 1-month from 
now. Also, if possible, the filter can look at descriptions of each 
venue and choose ones containing at least three relevant 
keywords.”  
A simple enhancement to solve this problem is to remove unnecessary, excessive 
information, which is often very hard to decide. For example, P27 criticized an online 
shopping site for having a lot of unnecessary and irrelevant information. However, 
when evaluating usefulness of individual components, she became more vigilant, and 
stressed that her opinions are personal and depending on her current situation. 
“I would remove these promoted products on the side bar. 
However, if these promotions were relevant to my current interest, I 
would keep them. […] Shopping cart and Personal coupon box can 
be useful later. […] I don’t need extra information about secured 
payment, getting products at the shop, or printing receipts.” (P27) 
To enhance websites with an over-abundance of information, participants 
envisioned creative scenarios including interactivity and design details. For example, 
P2 proposed to add a custom filter for a long list. P26 wanted to have the personalized 
summary at the top of a long document with a pop-up window for important 
information:    
“I do not read every Terms and Condition agreement. It’s too long 
and mostly irrelevant.  However, it would be useful if hidden 
charges or tricky conditions were highlighted. I think critical 
information such as hidden charges can be shown in a pop-up 
 40 
window. It would be best the most important summary is shown at 
the top, because I could just click 'yes' without scrolling it down.” 
(P26) 
Challenge	#3:	Repetitive	Operations	
Participants reported tedious and repetitive operations on the Web. Based on them, we 
identified three common reasons for repetitive operations. 
Unsupported Tasks. Seven participants wanted to automate repetitive tasks. 
Efficient repeating of some of the tasks is unsupported by the websites. For example, 
four participants wanted to automate simple interactions such as downloading multiple 
files or clicking a range of checkboxes with a single click.  
“[At an academic library], I click the "Save to Binder" button, then 
select a binder from the drop-down in a new window. Then I click 
the "save" button then the "done" button, then close the window. It's 
really annoying to do it over and over. It would be great to create a 
"save this!" button.” (P4) 
Three participants wanted to automate filling the forms of personal / credit card 
information.   
Information from multiple sources. Reported by 20 participants, integrating 
information from multiple sources is a common practice on the Web [93]. End-users 
switch between browser tabs to compare information repeatedly, but it can be time 
consuming since it requires short-term memory to compare information on tabs that are 
not simultaneously visible. 17 participants wanted to save their time and effort by 
integrating information across multiple sources. For example, P33 told, “I often search 
for videos on YouTube for baby diapers or other things to wear because those videos 
 41 
are very helpful to understand usage of products.”   Similarly, four participants wanted 
to integrate course schedule page with extra information available such as student 
reviews, lecture slides, and reading lists.  
Time-sensitive Information. Five participants reported that they regularly check 
time-sensitive information such as price (3), hot deals (1), second-hand products (1), 
and other notifications (1). Using price trends as an example, three participants 
envisioned a complex service that automatically archives price information retrieved 
from multiple sites, visualizes the price data as timeline graph, and sends email / mobile 
notifications when the price drops:   
“I can imagine that program or Web site will be able to grab 
information, especially prices from various malls, and compare it 
automatically.  [...] It will also say ‘this is the lowest price for 
recent three months.’ so that I don't have to visit Amazon and 
Newegg everyday. […] I want it to send me email alerts - saying 
‘Hey, based on your recent search history on the Canon G15, we 
found these new deals and prices. It's the lowest price in the last 
month.’ ” (P21)  
“[She opened CamelCamelCamel.com] If I want to buy a bread 
machine, I search and choose one model. Here the graph shows the 
price trend of the model. I can make a decision on whether I should 
buy or wait. Unfortunately, this site only shows products from 
Amazon.com.” (P33)  
Challenge	#4:	Privacy	
Privacy did not come up much, but one participant (P24) expressed strong negative 
opinions about the way that a social networking service handles her data:  
 42 
“[At a social network service] a friend of mine told me that if I 
'like' her photos or put comment on them, others will be able to see 
it even if the photos are private. […] Here's another example that I 
don't like about [the SNS]. One day I uploaded a family photo, and 
my family-in-law shared those photos. That's totally fine. However, 
the problem began when friends of my family-in-law started liking 
and commenting on my family photos. I received a lot of 
notifications of those activities by people I do not know at all. I felt 
a little scared.” (P24) 
As another example of privacy issues, P24 believed that her browser tracks her 
activity history, and shared it with online advertisement companies without her 
permission, because banner ads on other Web pages show ads related to her previous 
activity.  
 Potential Functionality of Web Enhancements 
Based on the challenges of the previous section, here we present functionality that we 
believe future WebEUP systems should consider. The functionality has seven 
categories: Modify, Compute, Interactivity, Gather, Automate, Store and Notify. To our 
knowledge, Interactivity, Store, and Notify among them were not supported by existing 
EUP systems for the Web.  
Modify. Modification of existing web pages is the most commonly required 
functionality for 66 out of 109 enhancements. Examples include attaching new DOM 
elements to the original pages (31 enhancements), removing or temporarily hiding 
unnecessary elements (15 enhancements), and highlighting information of interest by 
changing font size, color, or position (5 enhancements). Modification often involves 
adding new interactive behavior of Web sites (8 enhancements). Existing WebEUP 
 43 
tools support a wide range of modification such as removing unwanted DOM elements 
[7], and attaching new DOM elements or interactive behavior to existing elements [78].   
Compute. 29 enhancements require a variety of data transformation: filtering 
elements by user-specified criteria (13 enhancements), extracting specific information 
from text documents (9), and arithmetic operations (7). While computation is a 
fundamental part of programming languages, existing EUP systems support it in 
varying degrees. For example, scripting languages [98] offer an extensive set of 
language constructs such as general-purpose languages (e.g. JavaScript). Data 
integration systems [80,87] focus on handling large amount of semi-structured text 
input, but provide less support on numerical operations. Systems for automated 
browsing [47,59] provide few language constructs for computation.     
Interactivity. 29 enhancements would need interactive components that address 
the dynamic needs of users. For example, 13 enhancements include triggering buttons, 
because users wanted to make use of them in-situ. Eight enhancements show previews 
of changes it will make on the original sites so that users can choose among them. 
Enhancements often require users to configure options such as search keywords, 
filtering criteria, specific DOM elements based on their information needs (8 
enhancements). WebEUP tools often employ predefined interactive components such 
as buttons and preview widgets [78]. However, none of them enable users to create 
their own interactivity.   
Gather. 18 enhancements gather information from either the current domain (9 
enhancements) or external sources (5 enhancements). One example use of information 
of the current domain is to preview linked resources without clicking, as P5 stated, “At 
 44 
various cosmetic malls, I wish the main listing page showed detailed direction on how 
to use the products.” In contrast, participants wanted to gather information from 
external sources that current sites are missing. Information gathering is supported by 
mashup tools [15,82,87].  
Automate. 15 enhancements automate repetitive tasks that include filling in input 
forms (4 enhancements), downloading multiple images and files (4), page navigation 
(3), clicking a series of buttons, checkboxes, and links (3), and keyword search (1). 
Existing WebEUP tools such as CoScripter [47] and Inky [60] support automating 
repetitive tasks.   
Store. 14 enhancements store three types of data while being used. The first type 
relates to user’s activities such as filling input forms, page navigation, and job 
applications found in five enhancements. The second type is temporal information 
periodically gathered from designated sources such as online shopping malls, or 
ticketing sites found in five enhancements. The last is bookmarks of online resources 
such as news articles, blog posts, or streaming videos found in four enhancements. 
Existing WebEUP systems such as CoScripter [5] often provide public repositories for 
scripts, but none of them allow end-users to create custom storage of usage data.  
Notify. Eight enhancements send notifications to users via emails (7 enhancements) 
or SMS messages (1), periodically or when user-specified events occur. To our 
knowledge, no existing WebEUP tool supports notification.   
 45 
 Design Implications 
Based on the challenges and the potential functionality of Web enhancements, we 
discuss two design implications for future WebEUP systems and designing Web sites 
in general.  
Social	Platform	beyond	Technical	Support	
Traditional WebEUP systems focus on lowering the technical barrier of Web 
programming. For example, mashup tools enable users to integrate information from 
multiple pages with just a few clicks. Automation tools allow users to create macro 
scripts through demonstration. Despite the advantage of those technical aids, we noted 
a few enhancement ideas require domain knowledge of multiple users who have the 
same information needs. For instance, when end-users want to integrate additional 
information with original pages, the key question is where the additional information 
can be found. When users want to focus on an important part of a long text, the key is 
which part of the text previous visitors found useful (similar to Amazon Kindle’s 
“Popular Highlights” feature.) An example of how a social platform could address the 
untruthful information issue follows. An end-user programmer creates and deploys an 
enhancement that attaches an interactive component (e.g. button for rating individual 
hyperlinks) to the original page. Users who have installed the enhancement would use 
the new component to provide their knowledge (e.g. quality of the linked resources), 
which will be saved in the enhancement’s database. As more data is collected, the 
enhancement will become more powerful.  
To enable end-users to build social platforms in the aforementioned scenario, 
future WebEUP systems need two functionalities. First, end-user programmers should 
 46 
be able to create and attach interactive components that collect knowledge and 
feedback from users. Second, end-user programmers should be able to set up 
centralized servers that communicate with individual enhancements running in each 
user’s browser, and store collected information. To our knowledge, no prior WebEUP 
system has fully supported these functionalities for social platforms. However, there 
are certainly custom solutions of this type that are commonly used such as, for example, 
Turkopticon13 that helps web workers using Amazon Mechanical Turk rate job creators. 
Alleviate	the	Risk	of	Using	Enhancements	
According to the attention investment framework [4], end-users would decide whether 
to use an enhancement or not as a function of perceived benefit versus cost. Even 
though our participants assumed no development costs, we could identify the following 
concerns about risks of using enhancements. 
Uncertain needs. Our participants often had concerns about the dynamic and 
uncertain nature of their needs and situation. For example, P27 found advertisements 
on an online shopping site to be annoying, but did not remove the advertisements 
because of their potential usefulness in the future. WebEUP systems should be able to 
support interactivity so that users can change configurations or make decisions 
whenever their needs change. Otherwise end-users will be forced to stop using it, as 
P26 and P17 did with non-interactive pop-up blockers. 
Breaking the original design. Enhancement developers should try to minimize 
unnecessary change of the original site. Two participants expressed concerns about 
                                                
13 https://turkopticon.ucsd.edu/ 
 47 
breaking the original site’s design and functionality: “I think the best part of Craigslist 
is its simplicity. I might have seen the filters, but did not bother setting them every time 
I visit this site.” (P20)  
Privacy and Security. End-users have significant privacy and security concerns 
about installing extra software, especially those developed by third-party programmers. 
Ironically, we observed end-users rarely read legal documents, and are trained to give 
permissions to popular apps. Future work should confront these practical concerns and 
design how to communicate potential risks and treatments.  
Summary	of	Design	Implications	
The seven categories of enhancements can be useful to web site designers as they think 
about what a wide range of users might want. There is another potential of more directly 
benefiting from end-user modifications to web sites. Actual enhancements made by 
end-users could provide valuable feedback for designers of the sites if those desires 
were expressed via use of a WebEUP tool. For example, designers could learn what 
kind of information users consider to be untruthful by learning about user feedback on 
specific information. Repetitive operations could be observed by seeing what 
modifications users make, etc. Nevertheless, those feedbacks cannot replace WebEUP, 
as designers and users often have conflicting interests. For instance, designers may not 
agree to remove advertisements that end-users find annoying since they provide 
revenue. Some ideas may be useful for specific user groups but not for everyone, and 
so are not worth pursuing. Ideally, designers should consider providing hooks or APIs 
that enable end-users to build robust, high-quality enhancements.  
 48 
 Limitations 
We made several simplifying assumptions that limit the scope of our findings. First, 35 
participants of the interview study were not large enough to represent the entire 
population of potential end-user programmers. In order to extend the generalizability 
of the findings, an online survey would be an appropriate method. Second, around half 
of our participants have non-technical backgrounds, which is an unusual characteristic 
of end-user programmers. Some of the challenges and solutions they shared could be 
different from end-user programmers who usually have technical knowledge. Third, in 
order to minimize technical bias, the semi-structured interview did not provide any 
technical constraints. Therefore, participants imagined EUP solutions without 
considering the time and effort of development.  
 STUDY 2: NON-PROGRAMMERS MENTAL MODEL OF 
COMPUTATIONAL TASKS 
Programming is difficult to learn since its fundamental structure (e.g. looping, if-then 
conditional, and variable referencing) is not familiar or natural for non-programmers 
[67]. Understanding non-programmer’s mindset is an important step to develop an 
easy-to-learn programming environment. This second study builds on the first by 
examining how non-programmers naturally describe computational tasks common to 
the WebEUP enhancements described in the first study. The findings suggest both 
design implications and open-ended research questions for future EUP systems.   
 Participants 
The study was conducted with 13 participants, including five males and eight females, 
average 33.3 years old (SD = 5.86) with varying occupations as summarized in Table 
 49 
2. All of the participants were experienced computer users, but they all said that they 
had not programmed before. The participants were recruited by the university mailing 
list that we used in the first study. They received no compensation for participation in 
this study.  
 Method 
The study aims to characterize how non-programmers naturally describe complex tasks 
without being biased by specific language constructs or interactive components. We 
employed the Wizard-of-Oz technique [95] where the interviewer acted as a 
hypothetical computer agent that could understand the non-programmer’s verbal 
statements, behavioral signals (e.g. page navigation, mouse click), gestures, and 
drawing on scratch paper, and help them through conversational dialogue. The 
computer agent (called “computer” from here on) followed the rules listed below.  
1. The computer can understand all the literal meaning of participants’ 
instruction, gestures, and drawings. However, the computer cannot 
automatically infer any semantic meaning of the task or the material. For 
example, a rental posting “4 Bedrooms 3 Lvl Townhome $1650 / 4br” is just a 
line of text to the computer.  
Table 2. Occupational background of the participants	
Graduate students 6 
 Engineering 3 
 Business 2 
 Education 1 
Professionals 3 
 IT specialists 2 
 Directors and office managers 1 
Non-professionals (e.g. homemaker) 3 
Total 13 
 50 
2. The computer can perceive a pattern from participant’s repeated examples 
and demonstration. For example, if a participant counted numbers within 
a range 1-3 in a table, the computer asks the participants “Are you counting 
numbers that are within a specific range?”     
3. The computer can execute the participant’s instruction only if it is clearly 
specified without ambiguity. Otherwise the computer asks for additional 
information to resolve it through conversational dialogue like below:  
Programmer: Delete houses with fewer than three bedrooms. 
Computer: Please tell me more about ‘houses with fewer 
than three bedrooms’.  Which part of the page is relevant? 
When the programmer demonstrates a set of examples, the computer will 
suggest a generalizing statement like below:   
Programmer: Delete this one because it contains 3br.  
Computer: Do you want me to delete every line that has 3br?  
 
Figure 15. Participants were asked to explain how to draw a histogram of the numbers in the table. In 
this example, the participant gave histogram bins different codes (A-D), and marked each number with 
the codes. Since the participant could not put 12 into any bin, he marked the number with question mark 
and a line that points a missing bin.   
 51 
A sheet of paper containing basic instruction was provided, and the participants could 
draw or write anything on the paper as shown in Figure 15 and Figure 16.   
Task	1.	Drawing	Histogram	
Given a sheet of paper containing a blank histogram and 10 random numbers between 
0 and 12 (see Figure 1), the participants were asked to explain the computer how to 
draw a histogram of the numbers. The blank histogram has four bins (0~3, 3~6, 6~9, 
and 9~12). The purpose of this task was to observe how non-programmers perform: (1) 
common data-processing operations (e.g. iteration, filtering, and counting), and (2) 
visualize numeric data by examples and demonstration.  
 
Table 3. In Task 2, the participants were asked to create a filter than removes houses 
with less than three bedrooms among housing rental posts scraped from Craigslist.com. 	
“You	want	 to	 create	a	 filter	 that	 removes	houses	having	 less	 than	3	bedrooms.	How	
would	you	explain	it	to	the	computer?”	
	
Brand New Townhome! $2200 / 3br - 1948ft² - (Clarksburg)  
Lanham 2/1 new deck $1050 / 1818ft² - (Lanham)  
4 Bedrooms 3 Lvl Townhome $1650 / 4br - (MD)  
823 Comer Square Bel Air, MD 21014 $1675 / studio   
       … (6 more)… 
 52 
Task	2.	Custom	Filter		
We prepared 10 rental postings in Table 3 copied from an online marketplace14. The 
participants were asked to create a program that removes houses having fewer than 3 
bedrooms. The program consists of three components: (1) extracting text that 
represents the number of bedrooms in each post (e.g. “3br(s)”, “3bedroom(s)”, “3 
BEDROOMS”, “3/2”), (2) a conditional logic for filtering posts with less than three 
bedrooms, and (3) removing / hiding the filtered houses. The purpose of the task is to  
                                                
14 Craigslist.com 
 
Figure 16. In this example of Task 2, the participant used scribbles along with verbal statements. For 
example, the participant wrote variations of keywords that indicate “bedroom” used in the list. 
He/she also circled and underlined the number of rooms in each title to demonstrate the text 
extraction logic, crossed out titles that did not meet the criteria, and drew arrows from houses to 
empty slots in the list. 
 53 
observe how non-programmers decompose a big task into sub-tasks, specify extraction 
queries, and refer temporary variables such as sub-strings and selected postings.   
Task	3.	Mash	Up		
At Amazon.com, each product has different options (e.g. available colors and sizes) 
that are shown in the product detail page. The participants are asked to create a program 
that extracts the available colors from detail pages, and attaches to the product listing. 
The purpose of the task is to understand how non-programmers would describe copy 
operations across multiple pages, and event handling. 
  Procedure 
Each session began with a brief interview about the participant’s programming 
experience and occupational background. The interviewer introduced the Wizard-of-
 
 
Figure 17. In Task 3, participants were asked to describe a simple Mashup program that shows 
available colors of each individual product in the Main page (top left) extracted from the Product 
Detail (bottom right) page.     
 54 
Oz method, and gave an exercise task – ordering the interviewer (acting the 
hypothetical computer agent) to move a cup to another corner of the table. After 
participants said they fully understood the concept of the hypothetical computer agent, 
we started the actual study by introducing the three scenarios in a randomized order. 
For each scenario, participants were asked to explain the task to the “computer”. 
Participants were allowed to finish or to give up a task at any point.    
 Data and Analysis 
The entire session was video recorded, and transcribed for qualitative analysis. The 
transcript of each task consists of a sequence of conversational dialogue between the 
participant and the interviewer, finger and mouse pointing gestures, scribbles on the 
paper (Figure 16 and Figure 17; only for T1 and T2), and page scroll and mouse events 
in the browser (only for Task 3). To analyze the transcript, the first author created the 
initial codebook derived from the literature [67] and an initial post-interview discussion 
within the research team. The codebook included how the participants described and 
what challenges they experienced. While repeating the coding process, a few categories 
emerged: programming styles, imperative commands, ambiguities, and multi-modal 
intent.  
 Findings 
In this section we characterize how non-programmers describe computational tasks. 
Participants were allowed to stop at any moment, but all of them could eventually 
complete tasks with the computer’s help. Each task took an average of 415.3 seconds 
(SD = 217.4). We did not observe any fatigue effect. Since participants had very limited 
understanding of the computer at the beginning, most of their initial explanations were 
 55 
not very informative. Thus the computer asked for further information as the examples 
below.   
(Task 1) 
P12: Wouldn't computers draw graph when numbers are assigned? I'm asking 
because I have no idea. 
P11: Find the numbers, and draw them at the first bin.  
Computer: How can I draw them?  
P11: What should I tell?  Color? 
(Task 2. Custom Filter) 
P5: First, I scan the list with my eyes and exclude them. They clearly stand out. 
Computer: How do they stand out? 
P8: I'd order, “Exclude houses with one or two bedrooms.”  
Computer: How can I know the number of bedrooms? 
(Task 3. Mash Up) 
P11: I'd ask computer to show available colors of this Columbia shirt.  
Computer: Where can I get available colors? 
Natural language tends to be underspecified and ambiguous [67]. We frequently 
observed that our participants skipped mentioning essential information. For example, 
most participants did not specify how to iterate multiple elements in list. They instead 
demonstrated handling the first item, and expected the computer to automatically repeat 
the same process for the rest of the items. They did not refer to objects by names as 
programmers use variables. However, they referred to previously mentioned objects by 
their actual values (underlined in the following example), as P20 said, “In this next 
column, we need items going 6, 7, and 8. So please find those 6, 7, 8, and draw bar in 
this column.” They also used pronouns (e.g. “Remove them”), data type (e.g. “Attach 
colors”), and gestures (e.g. “Paste them here.”). While loops and variable referencing 
 56 
are core concepts of programming languages, our findings suggest that non-
programmers would find them unnecessary or even unnatural. We will discuss the issue 
further with design implications for future EUP systems in the discussion section.   
Through conversational dialogues, participants figured out what information the 
computer requires and how to explain. We found several characteristics of how non-
programmers explain computational tasks as listed below. 
Explaining with rules and examples was used by 9 of 13 participants. When 
participants explained rules first, the following examples usually provided information 
that the rules were potentially missing. For example, while drawing a histogram for 
Task 1, P4 stated a rule, “Determine which bin each number is in”, followed by an 
example, “If the number is one (pointing the first item in the table), then count up this 
bin (pointing the first bin in the histogram).” Participants also provided examples first, 
and then explained the rules. P10 doing Task 1 gave all the numbers (0, 1, and 2) for 
the first bin, “For here (pointing the first column) we need 0, 1, and 2”, and then 
explained the range of those numbers, “Find numbers including zero, smaller than two.” 
Traditional programming languages rarely allow example-based programming. 
Although EUP systems often support Programming-by-Example (PBE) techniques, 
they do not allow this pattern – combining rules and examples to describe individual 
functional elements.    
Elaborating general statement through iteration was observed for every 
participant. Initial explanations of tasks were usually top-level actions (e.g. draw bars, 
remove houses, attach pictures) that contained a variety of ambiguities; but participants 
then iteratively elaborated the statements by adding more details. For example, P1 
 57 
doing T3 described the top-level action, “Attach pictures here.” Then he elaborated 
where the pictures were taken from, “Attach pictures from the pages.” He kept on 
elaborating what the pages are and how to extract pictures from the pages. For T 1, as 
another example, P14 told the computer, “Draw a graph.” She then rephrased the 
statement with more details, “Draw a graph to number 2.” This pattern is far from 
traditional programming languages that support users to create statements in the order 
of their execution.  
Multi-modal expressions including gestures and scribbles were frequently used 
by all participants. While verbal statements were still the central part of explanation, 
they used gestures along with pronouns (e.g. “Count these”, “Put them here”), and 
scribbles to supplement verbal statements like an example in Figure 2. While multi-
modal expressions seem to be natural and effective for non-programmers, traditional 
programming environments rarely support them.  
Rationales are not direct instructions for the computer. However, we consistently 
observed participants explaining rationales. For example, P6 doing T3 explained why 
she chose to attach small color chips rather than larger images, “While we can show 
images, which would be quite complex, I'd want you to do use color boxes.” P13 also 
explained rationale of her scribbles on the sheet of T1, “We can also secretly write 
number here (center of each cell) to remember, so track for afterward so we didn't 
make any mistake.”  
 Implications 
This study provides characteristics of non-programmers explaining how they would 
solve computational tasks. Given that traditional programming environments do not 
 58 
fully support the way these participants conceptualized their solutions, we discuss the 
implications for the design of multi-modal and mixed-initiative approaches for making 
end-user programming more natural and easy-to-use for these users. Our 
recommendations are to: 
Allow end-users to express ideas with rules, examples, gesture, and rationales. 
Traditional programming environments mostly support single programming styles: 
imperative, declarative, or example/demonstration-based. However, as seen in the user 
study, end-users express ideas via combinations of rules, example, and rationales.  
Support iterative refinement of programs. End-users may not be able to provide 
complete information of the programs they want. Instead, they would start with quick 
and brief description of task outlines, goals, or solutions that handle only a subset of 
the potential scenarios. They then iteratively refine it by adding more rules and 
examples. In order to support this iterative refinement, future EUP tools should allow 
users to sketch programs with missing details, and guide them to fill in.     
Support mixed-initiative interaction to disambiguate user intent. To guide non-
programmers to explain essential information such as loops and variable referencing, 
our study employed conversational dialogue (as explained in Section 5.2) between 
participants and the computer. For example, when participants gave incomplete 
statements (e.g. demonstration for the first item), the computer asked them for 
additional information (“What would you like to do for the rest items?”) or confirmation 
(e.g. “Do you want to do the same for the rest items?”) Likewise, future EUP tools 
should incorporate mixed-initiative interaction to help end-users express unambiguous 
 59 
statements; although it is an open-ended research question how the computer and end-
users have mutual understanding. 
 Limitations 
We made several simplifying assumptions that limit the scope of our findings. First, 
the computer followed three informal rules, which may not be specific enough to design 
working a system. A formal set of rules would make the Wizard-of-Oz study stronger. 
Second, participants could not review or test programs they built, which is uncommon 
for most programming environments. Third, the three tasks do not represent a full 
spectrum of computational tasks. However, we believe that even this narrow analysis 
provided useful insights for designing natural and intuitive EUP systems. To adress 
these limitations, a follow-up study should employ an actual, interactive EUP system 
that presents and tests solutions that address the challenges and the implications of this 
study.   
 Conclusion 
This chapter reports two formative studies that extend the understanding of end-users' 
needs and mental models. The first study, a semi-structured interview, explores 
challenges that end-users daily experience, and suggests seven functionalities of future 
EUP systems. The second, a Wizard-of-Oz study, demonstrates how non-programmers 
explain common computational tasks and provides design implications for more natural 
programming environments. Based on these findings, we designed VESPY, an 
interactive EUP system VESPY, which is presented in the following chapter.  
 
 
 60 
 
 
   
 
 61 
Chapter 4: VESPY: A Visual Environment for 
Symbiotic Programming 
 Introduction 
In the previous chapter, we reported end-users’ needs for enhancing the web, and how 
they express programming intent. Based on that, we developed VESPY (Visual 
Environment for Symbiotic Programming), an end-user programming tool that enables 
amateur programmers to build interactive Web enhancements. This chapter presents 
VESPY’s user interface, domain-specific language, and PBE engine. While most end-
user programming systems for the Web (WebEUP) focus on specific application 
domains (e.g. extracting data from pages, automation, information mashup), VESPY 
covers much wider range of enhancements by letting users orchestrate common 
functionalities (e.g. extraction, transformation, integration, automation, customization, 
and interactivity). Our approach is to interleave visual programming techniques with 
programming-by-example (PBE) so that users decompose complex tasks into tractable 
modules (with the grid UI), and generate solutions for each module by providing input 
and output examples to the PBE engine. To demonstrate the versatility of VESPY, we 
present four example enhancements. Finally, a preliminary user study shows that PBE 
helps users do complex tasks with efficiency, but simple tasks are more suitable for 
direct specification. We also observed that participants experienced usability issues and 
made a variety of mistakes.  
 62 
 Design Iteration 
The development of VESPY was a long iterative process, taking 1.5 years to explore 
various ways to accommodate visual programming and PBE. In this chapter we 
describe design prototypes of VESPY UI with our consideration of challenges and 
rationales.  
 Version 1: Spreadsheet 
The first UI design (Figure 15) was matrix-based, motivated by the generality and 
understandability of standard spreadsheets. Each column represented a list of 
homogeneous values, calculated by the same operation (e.g. DOM elements extracted 
with a single query). When the “application” represented by the spreadsheet is executed, 
VESPY executes an operation assigned to the leftmost column to update its values, and 
then repeats the same process for columns to the right. The blue arrows between 
adjacent columns represent operations that calculate the next column from the previous 
columns on the left side. Note that PBE can suggests multiple operations for a single 
column, illustrated as multiple blue arrows next to the first column. The numbers in the 
blue arrows are number of values it calculates.  For example, in Figure 15, the first 
column has a single value, which represent the current page, before executing the 
program. As the entire program is executed, the second column’s operation extracts 
100 DOM elements of each row (p.row). The third column extracts 100 URLs from 
each row. The fourth column loads pages of the URLs, and then gets images in the fifth 
column.  
We informally assessed strength and weakness of the spreadsheet design. We felt 
that the spreadsheet approach was an effective, intuitive representation of values, and 
 63 
the linear flow of data was straight forward. However, we decided to try an alternative 
design, because the spreadsheet metaphor was too limiting. We were not able to find a 
way to adapt it to support complex control flow such as branches or nested loops.  
 
 
Figure 18. The 1st design of VESPY UI looks like a spreadsheet. Each column represents a list of 
values. The green arrows represent operations that calculate the next column.  
 
Figure 19. The 2nd design of VESPY UI. Widgets that contain small spreadsheets represent complex 
program structure such as branching and merging.  
 64 
 Version 2: Graph of Multiple Spreadsheets 
The 2nd version was designed to represent more complex, non-linear data flow such as 
branching, merging, and executing other modules. Although existing dataflow 
programming environments accommodate graph structure (as discussed in the Related 
work section), they do not consider how to support interaction, such as reviewing how 
individual values change along the control flow, or providing input and output 
examples. We wanted our UI to represent control flow and data flow at the same time. 
For example, Figure 19 illustrates a data flow with a branching and a merging for 
finding a specific item in the list. While the 2nd design is clearly more versatile than the 
1st one, we felt the 2nd version is still weak at presenting what the entire program is 
about. 
 
Figure 20. The 3rd design of VESPY UI is optimized for showing the description of every operation.     
 65 
  
 Version 3: List of Operations 
While designing the 3rd revision, we changed our focus on describing operations rather 
than values. As illustrated in Figure 20, items in the vertical list (e.g. “Pick elements”, 
“Inspect links”) represent steps, which will be executed in sequence when users run the 
program. The vertical list works like an accordion, where users can fold / unfold items 
 
 
Figure 21. The 4th design of VESPY UI employs a 2D grid and a semantic zoom feature.  
 66 
to see details. To add an operation, users click at the bottom of the list. To insert an 
operation, users click between two items. To infer commands for a step, users can click 
the step to unfold its input (above the step) and output (below the step) value tables. As 
users type examples in the value tables, the PBE engine recommends corresponding 
commands that users can click to confirm. The design is relative compact – taking only 
small part of the entire screen, thus it would be comfortable to use alongside with the 
web pages. However, the design also has many limitations. For example, it would be 
extremely difficult to visualize complex non-linear programs that contain branching 
and merging. Moreover, the design was not flexible to accommodate large amount of 
input and output examples. In sum, we felt the 3rd design was too simplistic to support 
symbiotic interaction.    
 Version 4: Grid and Semantic Zoom 
The 4th revision was designed to better visualize non-linear control flow and data 
at the same time. Each cell on the 2D grid represent an operation and calculated values. 
By default, each cell accepts input data from the left and the above, and thus data flows 
top-to-bottom, or left-to-right. Users could manually modify each node to take input 
from arbitrary node as well. For example, the program in Figure 21 starts from the 
leftmost cell, “Run when page loaded”, and triggers the cell on the right side. The cell 
in the middle, “Choose [left] containing [above]”, gets input from left and above. After 
filtering, 37 items that contains the keyword becomes invisible by the rightmost cell. 
We liked the grid UI, not only because it represents non-linear flow of execution, but 
also provides extra flexibility for program decomposition. For example, to solve 
 67 
complex problems, users can create sets of connected operations, and connect them 
later.  
 
The biggest design challenge was to find a balanced representation for both 
operations and values of each node. The 4th design addresses the requirement with 
semantic zoom. As users click a cell, the system would zoom into the selected nodes 
and show more details, such as full description of the operation, and current values. 
Although semantic zoom sounded brilliant, we soon realized that it also has a major 
limitation. When creating a new node or arranging multiple nodes, which are common 
 
Figure 22. The 5th design of VESPY UI includes a pop-up panel that shows details of the currently 
selected node. The top row represents values of the input nodes and the current node. The middle row 
explains what operation is assigned to the current node. The bottom row shows a set of operations 
that users can click to assign to the current node.  
 68 
activities of visual programming, users need to see the overview (for organizing nodes) 
and node details (for providing input and output examples) at the same time. Since 
semantic zoom can only provide a single view at a time, we could not confirm that the 
small and the large level of semantic zoom provide much benefits.    
 Version 5: Grid and Pop-up Panel 
For the 5th revision, we designed a pop-up panel that shows details of the current node 
and supports PBE interaction. As illustrated in Figure 22, the top of the panel shows 
values of the input nodes (left) and the current node (right) side by side so that users 
can easily compare corresponding input and output examples for PBE. The middle row 
gives title and description of the operation assigned to the node. The bottom row shows 
operations generated by the PBE engine based on the input and output examples. Users 
pick one of the operations to assign to the current node. I liked the panel design for its 
top, middle, and bottom structure can represent a wide range of situations. However, 
we soon realized that the bottom part can easily be overcrowded with a large number 
of generated operations. Moreover, in pilot studies we observed that inexperienced 
users could not easily grasp in what order they have to use the top, middle, and bottom 
parts of the panel. In the end, we decided to separate the bottom, so that the panel can 
focus on details of the current node.     
 69 
 Version 6: Grid, Pop-up Panel, and Side Panel 
The 6th design was the last revision, as illustrated in Figure 23.  It has a side panel on 
the left, which contains the information of the current program (called enhancement), 
and operations (called actions) that the PBE engine generates or filters based on the 
current node values. The pop-up panel shows information of the current node. Another 
pop-up panel, called Inspector, appears when users select DOM elements of the web 
page. More details will be discussed in the following sections.  
 Example Walkthrough  
Here is a typical walkthrough of creating simple enhancements using VESPY. Jane is 
a knowledge worker who frequently calculate the sum of numbers in a HTML table. 
Although she has been doing the task by manually importing the entire page to 
 
Figure 23. The VESPY user interface consists of the grid, info, actions, and node details. s can open 
the UI at any web page by pressing the button on the top right corner of web browsers. 
 70 
Microsoft Excel. She wants to add an interactive feature to the web page so that she 
can calculate the sum simply by clicking a button.  
Jane first navigates to the page she visits, and clicks the button on top of her 
browser (Figure 23). Then the VESPY UI appears on the left side of the browser, 
pushing the original page to the right. In the middle, the grid UI shows a new empty 
enhancement with a Trigger node (Figure 24). The Trigger node will execute nodes below 
and right when the page is completely loaded.  
 
 
Figure 24. A new enhancement is created. The grid UI contains a Trigger node to begin with. The 
original web page is shown on the right side.    
 71 
 
Since Jane wants to calculate the sum only when she needs it, the next step is to 
attach a button for executing the calculation. She drags Create	Element operation from 
Actions	panel to the below of the Trigger node. The pop-up panel shows the detail of 
Create	Element operation (Figure 25), she directly specifies parameters from “Create	
 
Figure 25. User can (1) drag an operation from Actions panel (left) to the grid (center), (2) directly 
change options of the operation (e.g. “button”, “calculate sum”) in the floating node detail window, 
and then (3) run the operation by clicking the play button on the right side of the window. Finally, (4) 
the values of the node will be updated.   
 
Figure 26. User can attach new elements to any place in the web page by (1) drag-and-drop an element 
to the target place, (2) choosing the relative position (before, front, back, after) to the target, and (3) 
clicking a suggested program in the Actions panel. Then (4) two nodes are added to the grid.  
 72 
[span]	elements	using	the	[input1]” to “Create	[button]	elements	using	the	[calculate	sum]”. 
Now she tests the Create Element operation by clicking the play	button on the right side 
or the panel (Step 3 in Figure 25), the node runs the operation and puts newly created 
button element to its values.  
Then she attaches the “calculate sum” button to the page as Figure 26 illustrates. 
Users can drag a DOM element, which is a value of the current node, to any target 
DOM element of the current page. As she drops the button to the table header (Step 1 
of Figure 26), another pop-up panel shows up and asks her to clarify its relative 
placement (before,	front,	back,	after) to the target. If she chose back, and the button would 
be attached as the last element in the target. After attaching at least two elements, the 
PBE engine generates corresponding a 2-step operation (Extract	Element, and Attach		
 
Figure 27. User can set an event handler by (1) dragging Trigger operation next to the node containing 
elements, and (2) setting the correct input channel.   
 73 
Element), and shows them in the Actions	panel. As she clicks the 2-step operation, 
they are inserted in the grid UI.  
Next step is to create an event handler, which specifies what will happen when the 
button is clicked in runtime. She drags Trigger	operation at the right of the Attach	Element	
node, which contains the button element (Step 1 in Figure 27). Now the Trigger node 
 
Figure 28. s can specify a node that extracts elements at a specific DOM position by (1) create an 
empty node, (2) click an element of interest and press extract button (repeat twice for extracting a set 
of elements), and (3) confirm the suggested Extract Element operation in the action panel. Then (4) 
the empty node is replaced with the node that can extract all the elements at the same position.   
 
Figure 29. s can create new elements from values with Create Element node. 
 74 
monitors the button element, and executes the following nodes when the button is 
clicked.  
Thus far the enhancement creates a new button, attaches it to the table, and 
assigned an event-handler to the button. The next step is to define how to calculate the 
sum of the numbers. As illustrated in Figure 28, she creates an empty node next to the 
trigger so that the node will be executed when the button is clicked. She clicks the DOM 
elements of the numbers, and click Extract button in the inspector pop-up to add them to 
the current node. (Step 2 in Figure 28). As multiple (at least 2) elements are added to 
the current node, the PBE engine generates an Extract	Element operation (or multiple 
operations), and show it in the Action panel. She validates the suggested operation, and 
clicks to assign to the current node (Step 4 in Figure 28). 
 
Figure 30. s can extract specific attributes from elements by (1) creating an empty node next to the 
elements, (2) clicking the attribute value in the detail window, and (3) confirming the suggested 
action. 
 75 
The Extract	Element node contains DOM elements of numbers of which she wants 
to calculate the sum. Thus the next step is to get the attribute from the elements. She 
creates an empty node next to the extracted elements (Step 1 of Figure 30), and clicks 
the attribute of interest (“50”; Step 2 of Figure 30). Then the PBE generates “Get	[test]	
of	[Input1]” operation, and suggests it in the Actions	panel (Step 3 of  Figure 30). Lastly, 
she clicks the operation to assign to the current node.  
She needs to specify how to calculate the numbers. There are two methods for 
specifying the Sum operation: first, she can drag and drop the Sum operation directly 
from the Action panel to the empty cell below the numbers. Second, she can create an 
empty node below the numbers, and type the correct value of the Sum operation so that 
the PBE engine will generate and suggest the Sum operation at the top of the Action 
panel. After specifying the Sum operation, she drag and drop the Create	 Element 
 
Figure 31. A simple enhancement creates a button for calculating total points.  Three nodes on the left 
side create and attach “calculate sum” button to the table. When the button is clicked in runtime, the 
trigger node executes the following nodes to extract all the points from the table, add them, and attach 
the result back to the page. 
 76 
operation, and change the parameter to “Create	[span]	elements	using	the	[input1].” The 
step is important, as we can attach only DOM elements to the page.  
Finally, she drags and drops the span element to the table, and clicks the Attach	
Element operation suggested in the Action panel – as she previously attached the button 
element to the table. The complete enhancement (Figure 31) attaches a “calculate sum” 
button to the HTML table, and when the button is clicked, the nodes on the right side 
of the trigger extract the points, sum them up, and then attach it back to the page.  
 VESPY System 
Along with the iterative design process, we implemented VESPY as a Chrome browser 
extension. Users could activate VESPY for any HTML based web pages, to create an 
enhancement that would automatically customize all the pages in the same domain. 
This section describes design and technical details of how VESPY works.     
 The Grid UI 
VESPY includes several UI components, illustrated in Figure 23. Users open the UI by 
clicking the button at the top of the browser window. The Grid panel shows all the 
nodes of the current enhancement. The Info panel contains the current enhancement’s 
 
Figure 32. A simple enhancement creates a button for calculating total points.  Four nodes on the left 
side attach “calculate total points” button to the Web page. When the button is clicked, the trigger 
node runs the following nodes to extract all the points from the table, add them, and attach the result 
back to the page. 
 77 
title and description. The Actions panel shows operations defined in the VESPY’s 
domain specific language and actions suggested by PBE and PBD. The node detail UI 
shows details of the currently selected node such as operation, values, and input nodes. 
VESPY employs the data-flow programming paradigm [32] that represents a 
program as a directed graph of data flowing between connected operations. Edges 
between nodes pass not only data but also define which node should run next. As an 
example, Figure 32 illustrates the structure of a simple enhancement that attaches a 
“Calculate total points” button below a plain HTML table. When the button is clicked, 
the nodes on the right side of the trigger extract the points, sum them up, and then attach 
it back to the page.  
Most real-world problems are complex enough that state-of-the-art PBE engines 
cannot solve in single steps. Thus it is crucial for users to deconstruct and reconstruct 
smaller modules. Our approach is to interleave visual programming and PBE 
techniques. With the grid UI, users create nodes and arrange them to compose large 
programs without necessarily following the order of execution. Although some existing 
PBE tools such as CoScripter [47], Wrangler [25] or Karma [79] allow users to build 
up multi-step programs, they support sequential lists only that users have to create steps 
in the exact order of execution. In contrast, VESPY’s grid UI provides more flexibility 
for users to arrange multiple groups of operations, where each group can be 
independently created and tested, and connect them later to compose large programs.  
 Direct Specification 
VESPY allows users to directly specify a node by dragging an operation from the Action 
panel to the grid. Users then manually change the parameters in the node detail UI, and 
 78 
test the completed operation. While direct specification is simple and efficient for 
simple operations, and applicable for most programming tasks, it has learnability and 
efficiency issues as previously noted by researchers [76]. First, direct specification 
requires users to know all the syntax and usage from documentation. Second, even if 
end-users had sufficient knowledge, directly specifying complex tasks requires a 
significant amount of time and effort. While direct specification is suitable for simple 
WebEUP tools, expressive programming tools like VESPY, which has more than 30 
operations, require alternative methods. 
 
 Programming-by-Example (PBE) Engine 
As reviewed in section 2.2.4, PBE is an approach to find programs that are consistent 
with a few input and output pairs given from the user. VESPY provides PBE techniques 
to generate single or multiple operations from user intent. VESPY offers provides six 
ways to express their intent. 
 
Figure 33. Users can bring elements in the input node by (1) clicking the arrow button in the node 
detail window.  (2) When the current node contains elements of the input node, PBE suggests a three-
step task that filters the input elements by their properties. (3) Clicking the task will add three new 
nodes for the filtering task.  
 79 
First, users can type desired output values in a node that follows the input. This is 
suitable for data transform operations such as arithmetic, filter, sort, and substring. Figure 
34 illustrates the example process of creating sort operation using PBE.  
Second, users can extract couple examples of DOM elements from web pages, and 
ask the PBE to infer a consistent Extract	Element operation for elements at the same 
position. Figure 28 illustrates a typical use case of this process.  
Third, users can choose specific attributes (e.g. text) directly from input elements 
(e.g. paragraph). For example, Figure 33 illustrates the process of getting text attribute 
from TD element by clicking an example of desired attribute values in the following 
node’s detail.  
Fourth, users can directly choose a subset of input elements as intent of filtering 
task. Then the PBE infers tasks that consist of 2-4 operations (Get	Attribute, Number	/	
String	Test, and Filter).  
Fifth, users can drag and drop values in the current node to the web page. To attach 
elements to current page, a user opens up the node containing the elements, and drags 
the button tag to the target as shown in Figure 26. If the user wants to attach the element 
 
Figure 34. VESPY PBE suggests single / multiple operation tasks based on the values of the input 
nodes and the current node.  To sort a list of numbers, an user (1) creates an empty node that follows 
the input node. (2) He starts typing desired output “-5”. However, at this point, PBE can only suggest 
a task with Number Test + Filter operations. (3) As he typed the sufficient output values, PBE 
suggests a correct Sort operation. (4) He clicks the suggestion to confirm it as the node’s operation. 
 80 
to a set of targets, he/she repeats the steps once more. Then VESPY will suggest a 2-
step task (Extract	Element	à Attach	Element).  
Lastly, users can express their intent with multiple input nodes. For example, as 
Error! Reference source not found. illustrates, if an user wants to filter elements with 
a complex predicate logic, he can prepare a few steps to get the key values, and then 
use both the original elements and the key values as input nodes for the node of filtered 
elements.  
 Domain Specific Language (DSL) 
The expressive power of VESPY enhancements is defined with the domain-specific 
language (DSL) written in JavaScript. The DSL enables users to build a wide range of 
enhancements by combining the five areas of common WebEUP functionalities. As 
summarized in Figure 36 and Figure 37, an enhancement consists of multiple nodes 
that are connected to other input nodes. Each node contains an operation (P) and a list 
 
Figure 35. Filtering a set of table rows by values of a specific column requires the filtered list [c] and 
the key values for predicate [b]. Users (1) extract key values from the original list, (2) 
 81 
of values (V). VESPY currently supports only four value types (DOM element, String, 
Number, and Boolean), and each value list can have single data type, defined by the 
top element. An operation has a type (e.g. Load	page, Extract	Elements) and parameters.        
The DSL’s operations are shown in Table 2, covering five areas the common 
WebEUP functionalities plus event handling and data storages.  
 
Figure 37. Representation of the VESPY program. An enhancement consists of multiple nodes. The 
enhancement in this figure calculates the average of numbers ([1,3,6]) by running the four nodes in the 
numbered order (1à2à3à4). Each node contains an operation, values, and input nodes. When its 
preceding node triggers a node, it executes its operation, updates its values, and then triggers its 
following nodes. 
 
 
 
Figure 36. The syntax of VESPY enhancements. 
 82 
Operation Input Output Param Description 
DATA EXTRACTION 
Load page IURL ODOM - Load page DOM elements of IURL 
Extract Elements IDOM ODOM path Extracts elements using path from IDOM  
Extract Parent  IDOM ODOM d Get enclosing elements of IDOM, d steps above 
Find Tab IURL   Find a currently open tab of IURL, and executes the 
following nodes in the tab.  
Get attribute IDOM OVAL k Get attribute k of elements of IDOM 
DATA TRANSFORMATION 
Literal I O  Directly set the current node data to I 
Filter I O  Get a subset of I whose corresponding boolean value in 
IBOOL is true IBOOL 
Sort IVAL OVAL direction Sort values in IVAL in ascending / descending direction 
Unique I O  O ß I without repeating same values  
Substring ISTR OSTR S1, S2, B1, B2  Get substring of ISTR from string S1 to S2, including Si if 
Bi is true. 
String Test ISTR1 OBOOL p OBOOL will have True if ISTR1 contains ISTR2 , False 
otherwise  ISTR2 
Number Test INUM1 OBOOL op Evaluate inequality (INUM1  op  INUM2) and update true / 
false in OBOOL.  op can be <,<=,>,>=,==,!=  INUM2 
Arithmetic INUM1 ONUM op Calculate two operands INUM1 and INUM2 with operator 
op, which can be  +,-,*,/,%.  INUM2 
Compose text ISTR1 OSTR s Concatenate every pair of values in ISTR1 and ISTR2 with 
separator s ISTR2 
MODIFYING / CREATING DOM ELEMENTS 
Attach elements IDOM1 ODOM  Attach DOM elements of IDOM1 to IDOM2. 
IDOM2 
Create elements IVAL ODOM t Create new DOM elements using Input values and tag 
name t, which can be button, span, or img.  
Literal element - ODOM t Create a single element from t, which is JSON string of 
an arbitrary element.   
Hide / Show IDOM ODOM  Hide / Show IDOM elements. 
Set attribute IDOM ODOM k Updates IDOM elements’ attribute k with IVAL 
IVAL 
SIMULATING MOUSE AND KEYBOARD INTERACTION 
Click IDOM -  Simulate mouse clicks on Input elements.  
Type IDOM - str Simulate keyboard input str on input field of VDOM.  
DATA STORAGE 
Store data IVAL - k Store Input values into data storage with key k. 
(Not implemented yet) 
EVENT HANDLING AND FLOW CONTROL 
Trigger IDOM ODOM e Trigger the node on the right when event e occurs.  
Table 4. VESPY operations and their required parameters. Subscripted types (e.g. VAL of IVAL) mean that 
the operation requires the type of the value. IDOM must contain only DOM elements; IVAL can be any type 
except DOM elements.  
 83 
 Single-step inference algorithms 
Many operations of VESPY have inference algorithms that find parameters 
corresponding to given Input (input node values) and Output (current node values). 
When a node value has been changed, the PBE system asks each operation to try its 
inference algorithm. An inference algorithm would fail and return false, if there is no 
parameter that satisfy the given input and output. The following list briefly explains the 
inference algorithms.   
Extract Element can infer a path for extracting a set of elements (Output) from a 
single element / multiple elements (Input). If the input node contains a single element, 
the algorithm tries to infer a 1-to-n query for extracting all the output elements from 
the element. If multiple elements are in the input node, the algorithm will try to find a 
1-to-1 query that extracts each output from matching input elements. VESPY uses a 
XPath-based algorithm similar with Sifter [31], Karma [79], and Vegemite [52].  
Extract Parent can infer how many steps a set of elements (Out) is above another 
set of elements (In) as illustrated in Figure 38. It returns fail if every output does not 
enclose the corresponding input element.  
Get Attribute can infer the attribute key for getting the output values from the 
input elements. For example, if the output elements are URL attributes of the input 
 
Figure 38. An example of Extract Parent operation inference. 
 84 
elements, then the algorithm will return a Get	Attribute operation having ‘URL’ as the 
key parameter.  
Literal can infer the parameter for setting the current node values. For example, if 
the current node values are [1,2,3] then the algorithm simple suggest Literal operation 
with “[1,2,3]” as the parameter. 
Sort can infer the right direction to get the output values by sorting the input values.  
Unique checks whether the unique set of the input values is equivalent to the 
output values, and returns a Unique operation or false.   
Substring can infer two tokens (S1 and S2) and boolean values (B1 and B2) for 
getting the output texts from the input texts. S1 and S2 indicate the starting and ending 
position of the substring, and B1 and B2 indicate whether S1 and S2 should be included 
or not. If it cannot find a consistent set of parameters, it returns false. Table 5 shows 
three examples of Substring inference.  
String Test can infer a keyword that can determine true or false values of the 
output from the input strings. The inference algorithm also tries whether the keyword 
should be in or not in the input string as shown in Table 6. 
 85 
Number Test can infer an operator (e.g. <, >, ==, <=, >=, !=, %=, !%=) and an 
operand node or number that can determine true of false values of the output numbers 
from the input numbers as shown in Table 7. In order to get an accurate result, Number	
Test requires around many examples.  
IN	 OUT	 RESULT.		S1	[B1]	–	S2	[B2]	
S1	 B1	 S2	 B2	
[“CSIC-1032”,	“MSC-33”]	 [“1032”,	“33”]	 “-“	 false	 “_EOF_”	 false	
[“(6/7)(4/5)”,”49(28/11)”]	 [“(6/7)”,	“(28/11)”]		 “(“	 true	 “)”	 true	
[“323-708-7700”,	“510-333”]	 [“323”,”510”]	 “”	 false	 “-“	 false	
Table 5. Examples of Substring inference. 
 
IN	 OUT	 RESULT.			
keyword	 In		/	not	in	
[“CSIC-1032”,	“MSC-33”]	 [true,	false]	 “CSIC”	 in	
[“a	1”,	”b	2”,	”a	2”]	 [false,	true,	true]	 “1”	 not	in	
[“tomato	soup”,	”potato	soup”,	
“tomato	salad”]		
[true,	false,	true]	 “tomato”	 in	
Table 6. Examples of String Test inference 
 
IN	 OUT	 RESULT.			
operator	 Operand	
[-5,	3,	9,	1,	2]	 [true,	false,	false,	false,	false]	 <=	 -4	
[1,2,3,4,5]	 [false,	true,	false,	true,	false]	 %=	(divisible)	 2	
[3,1,2,0,5]		 [false,	true,	true,	true,	true]	 !=	 3	
Table 7. Examples of Number Test inference 
 
IN1	 IN2	 OUT	 RESULT.			
Operand1	 operator	 Operand2	
[-5,	1,	2]	 [5]	 [10,6,7]	 IN1	 +	 IN2	
[1,	2,	3]	 	 [2,	4,	6]	 IN1	 *	 2	
[3]		 [6,2,-3]	 [0,2,0]	 IN2	 %	 IN1	
Table 8. Examples of Arithmetic inference 
 
IN1	 IN2	 OUT	 RESULT.			
Text1	 connector	 Text2	
[“CSIC”,	“MSC”]	 [1032,	33]	 [“CSIC-1032”,	
“MSC-33”]	
IN1	 “-“	 IN2	
[“a”,	”b”]	 	 [“a	is	good”,	“b	is	
good”]	
IN1	 “”	 “	is	good”	
[	“soup”,	”salad”]	 [“potato”]	 [“potato	soup”,	
“potato	salad”]	
IN2	 “	“	 IN1	
Table 9. Examples of Compose Text inference 
 
	
 86 
Arithmetic can infer an operator (e.g. +, -, *, /, ^) and two operands (numbers or 
input nodes) for getting the output numbers. Table 8 shows examples of Arithmetic 
inference.    
Compose Text can infer a connector (text or an input node) and two text (or input 
nodes) for getting the output text (Table 9). 
The rest operations (e.g. Create	element,	Literal	element,	Hide,	Set	Attribute) are not 
suitable for input and output examples.   
Recipe  Condition / Decomposition 
Extract Attribute Condition: Every Output Text exists in Input Element.  
 
Find Path Condition: Target elements are not within or enclosing the Input 
elements. 
 
Filter Element  Condition: Filtered Elements is a subset of Original Elements. 
      
Table 10. The core set of task recipes in VESPY. If input and output satisfies the condition, the recipe 
will create temporary nodes (in orange color) and will try to find sub-solution. 
 87 
 Multi-step PBE with task recipes 
PBE provides a larger benefit when it generates programs for complex tasks. For 
example, filtering	elements	by	attribute task requires at least four nodes. If PBE can 
generate the four nodes in a single step, it will save a significant amount of user’s time 
and effort. The problem is that solving large tasks mostly require multiple PBE 
algorithms.  
We thus developed a planner that decomposes a large problem into sub-problems, 
and assigns them to different PBE algorithm. Each plan is called a task recipe.  Table 
10 shows a few of them. The search algorithm was inspired by HTN (Hierarchical Task 
Network) planning [64]. The algorithm detail is beyond the scope of this paper. In short, 
when the provided input-output examples match with a recipe’s condition, the recipe 
creates several intermediate nodes (orange color in Table 10) and requests 
corresponding PBE algorithms to solve them. If they could find matching solutions for 
every intermediate node, the planner combines them and suggests to the user.  
 Example Enhancements 
To demonstrate the versatility of VESPY, this section presents four enhancements 
designed to exemplify the kinds of problems we know that users have based on the 
studies of Chapter 3.  
 88 
 Example #1: Deep search 
A fictitious user regularly visits Craigslist.com to buy second-hand items. For every 
item he found interesting, He must visit the detail page, check information, and move 
back to the listing page. To make it more efficient, He wants to look into linked pages 
and search keywords without opening them. Let’s call it deep search. Using VESPY, 
he built an enhancement consists of 17 operations. Deep search attaches a text input 
box above the links (see Figure 39). When users type a search keyword, it automatically 
loads every page of the links, and highlights some of the links that contain the keyword. 
Also, for further preview, it extracts the key content from the pages and attaches below.  
 
 
Figure 39. The deep search enhancement adds a text input box to the original page. When user types a 
keyword in the input box, it searches all the linked pages and highlights links whose pages contain the 
keyword. The main content of the links are attached to the links as well. 
 89 
 
Figure 40. The custom filter enhancement extracts all the venues from the publication list, and attaches a 
list of unique buttons. When a button is clicked, it shows only the articles published to the selected 
venue. 
 
Figure 41.  The event parser enhancement attaches button to every event in the list. When a button is 
clicked, it finds an open tab of Google Calendar and fills the input form with the event information. 
 
Figure 42. The multi-attribute ranking enhancement adds text boxes to each column header that users can 
type in their own weight factors. When a factor is changed, it updates weighted total scores and color 
codes on the right end of the table. It also attaches the Sort button that reorders the table rows by 
weighted total scores. 
 90 
 Example #2: Custom Filter 
While reading a publication site, a user wants to filter the publications by their venues. 
However, personal Websites rarely provide filtering functionality. She saw a custom 
filter enhancement in the VESPY repository. While the enhancement had been built for 
other sites, she could adapt it by modifying two Extract Element nodes that extract 
the items to be filtered, and a target position for buttons. As shown in Figure 40, her 
custom filter extracts all the venues from the page, creates buttons that are 
alphabetically sorted without duplication. Clicking a button hides articles published to 
other domains.  
 Example #3: Event Parser for Google Calendar 
A user regularly visits the event-listing site. When he finds an interesting event he has 
to manually type essential information (when, where, description) to his calendar 
application. Event parser enhancement can help him by adding a button next to each 
event (see Figure 41). When he clicks the button, it extracts the essential information 
and looks for a tab of Google Calendar app. If the calendar app is found, it injects the 
information to corresponding input boxes so that he can check and confirm to create 
the event.  
 Example #4: Multi-Attribute Ranking 
A user is a prospective student deciding on a university. On the Web, he found a data 
table containing multiple attributes of universities. He wants to compare them with his 
own ranking formula. The multi-attribute ranking enhancement (shown in Figure 42) 
attaches input boxes to columns of a plain HTML table so that he can change weight 
factors, calculate the weighted total scores, and sort the rows. When he tweaks his 
 91 
weight factors, the weighted total score column updates the weighted scores and colors. 
After setting the best weight factors for him, he sorts the universities by the updated 
score.  This scenario includes a wide range of tasks: (1) creating and attaching new 
DOM elements (input boxes and buttons) to the page; (2) extracting information from 
the page; (3) performing complex arithmetic; (4) modifying attributes (background-
color) of elements; (5) sorting elements by custom scores; (6) adding event handlers to 
the input boxes and the buttons; and finally (7) orchestrating the above tasks with an 
execution flow. To our knowledge, no existing WebEUP tool can support all of these 
tasks.  
 Preliminary User Study 
To answer the research question, “(R3) Is PBE better than direct specification?”, we 
conducted a preliminary user study using two interaction modes that provide limited 
functionalities of VESPY. The Direct Specification (DS) mode only allows users to 
directly drag and drop operations from the Action panel. In the Programming By 
Example (PBE) mode, the participants were not allowed to drag operations in the 
actions panel to the grid, but can use the PBE features (e.g. Typing in node values, 
suggestion of actions).   
We recruited 16 amateur programmers through a university mailing list. Three 
subjects were female and thirteen were male. Their average age was 29.25 years 
(SD=8.1). We defined the eligibility criteria for amateur programmers that they must 
be familiar with at least one programming language, and understand basic 
programming concepts such as loop, conditionals, and data objects. We excluded 
 92 
applicant who had coded a program longer than 500 lines or for commercial purpose. 
We offered qualified participants ten dollars per hour.  
 Method 
We conducted a within-subject experiment that compared two modes (DS and PBE) of 
VESPY. The study began by learning the basics of common part of the UI using Web-
based tutorial. Throughout the study, one of the authors was sitting next to the 
participants answering questions.  
First, the participants learned the basic concepts using a web-based tutorial that 
includes short demonstration clips and exercises. The tutorial took 10-20 minutes. After 
completing the tutorial, they tried to accomplish four tasks, as shown in Table 11. For 
each task, they first read the instructions about a randomly-selected version (e.g. DS), 
tried a practice task, and then completed the actual task. The same process repeated for 
the other version (e.g. PBE). To minimize learning effects, half of the participants used 
the DS mode first, while another half used the PBE mode first. Also the practice and 
the actual tasks used variations of the same problems with different numbers and 
parameters. At the end of each task, the participants answered to a survey question 
about the problems’ difficulty. After finishing all the tasks, the participants took a 
general survey and participated in a semi-structured wrap-up interview.  
 93 
We measured the completion time for the tasks. However, as VESPY requires 
training, and is not intended to be a walk-up-and-use system, the level of 
understandings about the system had large variance across users. During the pilot study, 
we observed that the participant’s understanding had a dominating impact on their 
performance. For example, if a participant got lost for several minutes, all the other 
aspects would make little difference to the total completion time. To avoid the situation, 
when the participants got stuck for longer than 20 seconds we reminded them high-
level hints (e.g. “If you want to extract DOM elements, click them”, “You need to 
confirm one of these suggestions.”), which were also instructed during the tutorial. 
 Tasks 
We designed four tasks that are commonly used in most enhancement scenarios and 
solvable in both modes. Each task has 1-5 problems depending on their difficulty. As 
illustrated in Figure 43, the participants were requested to add new nodes to get the 
desired result from given input nodes. They pressed the START and DONE button at the  
 
Figure 43. An example of the second problem of the Calculating	numbers task. Given the two input 
nodes, the participants need to create an Arithmetic node that multiplies the two node values. 
 94 
beginning and end of each problem. The time gap was the metric of the system’s 
performance. 
 Tasks 
We designed four tasks that are commonly used in most enhancement scenarios and 
solvable in both modes. Each task has 1-5 problems depending on their difficulty. As 
illustrated in Figure 43, the participants were requested to add new nodes to get the 
desired result from given input nodes. They pressed the START and DONE button at the 
beginning and end of each problem. The time gap was the metric of the system’s 
performance. 
Inputs	 Problem	Description	=>	Solution	
Task	1.	Calculating	numbers	
[1,2,3],		[2,0,2]	 Add	every	number	in	[1,2,3]	with	[2,0,2]	=>	The	result	should	be	[3,2,5]	
[8,7,10],	[2,2,3]	 Multiple	every	number	in	[8,7,10]	with	[2,2,3]	=>	result:	[16,14,30]	
[3,6,9]	 Divide	numbers	[3,6,9]	with	3	=>	result:	[1,2,3]	
[1,9,-5]	 Arrange	[1,9,-5]	in	increasing	order		
=>	result:	[-5,1,9]	
[4,1,1]	 How	many	numbers	are	in	[4,1,1]?	
	=>	result:	[3]	
Task	2.	Extracting	information	
4	elements	 Get	the	text	attributes	of	the	elements		
4	elements	 Get	the	URL	attributes	of	the	links	
4	elements	 Get	the	text	of	sub-elements	within	the	input.	
2	elements	 Find	a	path	from	a	set	of	elements	to	another.		
Task	3.	Filtering	
[apple	juice,		
banana,	apple,		
peach]	
Find	text	values	that	contains	“apple”	
=>	The	result	should	be	[“apple	juice”,	“apple”]	
[1,2,3,4]	 Find	even	numbers	=>	[2,4]	
6	elements	 Find	elements	that	contains	a	specific	keyword	
Task	4.	Attaching	Elements	
1	input	element	 Attach	an	element	to	a	set	of	items	in	the	page	
Table 11. The four tasks for the controlled experiment consist of thirteen problems. 
 95 
 Results 
We tested whether completion times of each group follow normal distribution. It turned 
out that 9 out of 26 (P1-P13 for two conditions) data groups are non-normally 
distributed (p>0.03). Therefore, we compared two conditions for the entire study using 
the Wilcoxon Signed-ranks Test. The result indicates that the participants could finish 
problems requiring multiple steps (i.e. P8-P13) significantly faster under the PBE 
condition than the Direct Specification condition (see Table 12 for Z-scores and p-
values). These results suggest that none of PBE and direct specification outperforms 
the other in all time, and thus, EUP systems should support both approaches for 
different circumstances.  
 Discussion 
 PBE vs. Direct Specification 
Before running the study, we expected to see either PBE or direct specification 
outperforms the other. To the contrary, the usefulness of PBE appears to be affected by 
three factors: (1) user’s knowledge of the domain-specific language, the PBE engine, 
and the task; (2) the amount of work for creating sufficient examples vs. directly 
specifying parameters; (3) credibility of programs. 
Task	Code	 Task	1	 Task	2		 Task	3	 T4	
Problem	Code	 P1	 P2	 P3	 P4	 P5	 P6	 P7	 P8	 P9	 P10	 P11	 P12	 P13	
Required	nodes	 1	 1	 1	 1	 1	 1	 1	 2	 3	 2	 2	 4	 2	
Comple
tion	
Time		
Direct	
Spec.	
Mean	 13.1	 14.1	 19.4	 22.4	 13.4	 17.8	 29.3	 63.9	 108.8	 53.6	 53.6	 111.5	 63.7	
Std	 4.68	 6.17	 6.30	 10.61	 8.51	 13.76	 35.56	 53.90	 69.23	 29.69	 17.81	 75.65	 27.85	
PBE	 Mean	 17	 21	 18	 14	 14	 15	 14	 28	 57	 19	 18	 55	 25	
Std	 6.26	 14.68	 7.43	 3.81	 7.21	 5.81	 12.91	 17.17	 38.50	 7.12	 11.15	 38.81	 12.44	
Wilcoxon	
signed	
rank	
Z-score	 -2.8114	 -1.5254	 -0.8519	 2.1344	 -0.6592	 -0.2841	 -0.1675	 -3.2577	 -2.7406	 -3.4645	 -3.4645	 -2.5854	 -3.5162	
p-value	 0.005	 0.126	 0.395	 0.033	 0.509	 0.779	 0.093	 0.001	 0.006	 0.0005	 0.0005	 0.01	 0.0004	
Table 12. Wilcoxon signed rank test result of the completion times for each problem. For simple problems that 
require single steps (P1-P7), the Direct Specification condition equivalent or better performance. However, for 
complex problems requiring multiple-steps (P8-13), the PBE condition was significantly more efficient (p<0.03)    
 96 
First, to use PBE effectively, users must know what program the system can 
generate. Otherwise, users must figure out the system's capability through trials-and-
errors, which can be tedious and frustrating. For instance, while learning the Extract	
Element operation, s4 wondered, “What if I want to extract only these two elements not 
the entire set?” System knowledge is important to feel confidence of the programs they 
make, and to apply the same approach for similar problems, users need to understand 
how the PBE engine extracts DOM elements from a few examples. Existing PBE 
systems teach its capability and limitations with samples and feedback, we believe there 
is plenty of room for improvement.  
Second, if a user can create a program using both PBE and direct specification, 
he/she will choose a more efficient approach. For simple problems (P1-P7), participants 
preferred to use direct specification, which requires less time and effort, as s7 said, 
“why should I type correct outcomes when I can program it easily?” For complex 
problems (P8-P13), participants could easily perceive the benefits of PBE, which is not 
just easier but also more efficient than direct specification. Future PBE systems should 
consider how to make example creation more efficient.     
Lastly, lack of credibility is another important issue of PBE. Even after learning 
the usage of PBE, some participants were still reluctant to completely rely on it. s8 told 
us, “For larger data set, I would prefer direct specification, because PBE may generate 
incorrect solutions.” How to quickly build up credibility between user and system is 
an interesting research question of the longitudinal study in our future work.  
Designing PBE involves issues very different from conventional direct 
manipulation UI, which are still open research questions in HCI. In the next chapter we 
 97 
will conduct a user study that investigates to what extent inexperienced users can use 
PBE, and what mistakes they make.   
 Limitations 
The study design has a few simplifying assumptions that limit the scope of its findings. 
First, participants performed relatively simple, abstract programming tasks (e.g. 
arithmetic, filtering), which can be solved in 1-3 steps. As abstract tasks are widely 
used in many EUP domains (e.g. data wrangling, text transformation), our findings are 
generalizable for PBE tasks beyond web customizations. However, if participants were 
asked to build practical solutions for real-world problems such as the four 
enhancements in 4.5, findings of the study would be different.   
Second, participants did the tasks with the both approaches (PBE and DS) while 
learning the usage of VESPY. We expect that significant ordering effects exist. For 
instance, participants could be fixated to the approach they learned first. Or, they could 
perform better with the approach they learned later. The ordering effect was counter 
balanced by letting half of the participants learn PBE first, while the other half learn 
DS first.  
Third, VESPY is a practical EUP tool, which eventually requires users to learn 
computational thinking skills and develop programs for their own problems. The 
preliminary study in this Chapter reports only the first few hours of user experience. 
To assess the efficacy of the tool, a multi-dimensional in-depth long-term case study 
(MILCs [77]) would be appropriate.  
 98 
 CONCLUSION 
This chapter presented VESPY, an end-user programming environment for creating 
web enhancements. VESPY enables amateur programmers to deconstruct complex 
tasks into smaller sub-tasks, and to find programs for sub-tasks with examples. Four 
scenarios of sample enhancements demonstrate unique capability and versatility of 
VESPY’s approach. In the preliminary user study, we observed that PBE significantly 
increased user’s performance for multi-step tasks. However, direct specification is as 
good as PBE for single-step tasks. We believe VESPY can help Web end-users improve 
their productivity by creating, sharing, and customizing interactive Web enhancements. 
 
 99 
Chapter 5: Understanding Human Mistakes 
when Programming by Example 
 Abstract 
In the previous chapters, we examined how inexperienced users would describe 
computational tasks (Chapter 3), and introduced VESPY, a WebEUP system that 
employs visual programming and PBE techniques (Chapter 4). Findings from the 
preliminary user study of VESPY indicate that PBE systems can be much harder for 
inexperienced users than PBE researcher’s expectation. Unfortunately, there is little 
research on people's ability to accomplish complex tasks by providing examples. This 
chapter presents an online user study, reporting how well people decompose complex 
tasks, and disambiguate sub-tasks. The findings suggest that disambiguation and 
decomposition are difficult for even highly-motivated workers from Amazon 
Mechanical Turk. We identify seven types of mistakes made, and suggest new 
opportunities for actionable feedback based on unsuccessful examples, with design 
implications for future PBE systems.    
 Introduction 
As described in Chapter 2, the goal of PBE is to enable ordinary people to automate 
complex and repetitive tasks, and it has even made its way into commercial products 
such as Microsoft Excel’s FlashFill [23]. However, guiding inexperienced users on how 
to provide high-quality examples is still an open-ended research question. To create 
high-quality examples, users need to consider two requirements: (1) disambiguation, 
and (2) decomposition. First, users must be able to provide diverse cases to 
 100 
disambiguate the operation they want to create from other operations the PBE engine 
could infer. Second, to create operations for complex tasks, users need to decompose 
those tasks into small sub-tasks that the PBE engine can (more easily) infer. Both 
disambiguation and problem decomposition are challenging computational thinking 
skills and are often part of required training for computer science and engineering 
students.  
To answer the research questions, “(R4) Can inexperienced users perform problem 
decomposition and disambiguation?” and “(R4a) What mistakes do users make when 
using PBE?”, we conducted an online user study with participants recruited from 
Amazon Mechanical Turk (AMT) who were asked to complete 6 tutorials and 5 main 
tasks using our PBE system. Our research focuses on examining the behavior of ordinary 
people providing input and output examples, managing steps and cases for 
decomposition and disambiguation, and making and fixing mistakes. To provide 
recommendations for PBE tool designers, we also designed two feedback mechanisms, 
and compared their impact on the main task success rate. A total of 161 users participated 
in the study, and 30 of them successfully completed all five main tasks.   
Our findings suggest that disambiguation and decomposition are difficult for even 
highly-motivated AMT workers, and for those that had practiced all required subtasks 
during the tutorials. We report seven types of mistakes identified from unsuccessful trials. 
We also determined that that those unsuccessful trials contain meaningful information 
about users’ intent and misunderstandings about PBE. Under the actionable feedback 
condition, participants received context-aware suggestions based on the information 
from unsuccessful trials, and outperformed other participants. 
 101 
 METHODS 
We conducted an online user study that began with a brief introduction to PBE. Then 
six tutorials on the user interface and basic PBE tasks (Table 13) were given. After 
finishing the tutorials, participants were asked to complete five main tasks, that are 
advanced variations of the tutorials. Finally, the tasks were followed by a demographic 
survey. The study took around 26 minutes (M = 25.97, STD = 11.54), and participants 
who finished the entire study were paid $3.00. The study was posted on Amazon 
Mechanical Turk for two days, during which 161 workers started the first tutorial, 137 
workers finished the tutorials and proceeded to the tasks, but only 30 finished the entire 
study. Summary demographics of the 30 participants who finished the entire study 
indicate the majority age range was 25-34 (60%, M = 36.43, STD = 7.56), male (60%), 
with bachelor (50%) or high school degrees (37%). The majority (84%) of participants 
reported that they had no programming knowledge (57%) or only basic concepts (27%). 
However, many of them had various IT experience, such as using spreadsheets (70%), 
creating web pages using HTML (30%) or content-management systems (20%), 
database (23%), and scripting languages such as Python or Ruby (20%).  
 102 
 Experimental System  
We developed an experimental PBE system that allows non-technical participants to 
quickly learn and perform decomposition and disambiguation as illustrated in Figure 1.   
The system can generate simple programs for standard PBE tasks (e.g. arithmetic, text  
Description Default examples Solution examples 
T
utorials 
T1,2 Input + 1 IN 1 
OUT 2 
 
IN 1 5 
OUT 2 6 
 
T3 (Input + 1) * 2 IN 1 
OUT 4 
 
IN 1 2 
STEP 2 3 
OUT 4 6 
 
T4 Get the sum of all numbers IN 1,1 
OUT 2 
 
IN 1,1 3,2 
OUT 2 5 
 
T5 Get length of a text value (including 
spaces). 
IN yes 
OUT 3 
 
IN yes no 
OUT 3 2 
 
T6 Find numbers that are greater than 9 IN 11,8,9,10 
OUT 11,10 
 
IN 11,8,9,10 
STEP T,F,F,T 
OUT 11,10 
 
M
ain tasks 
T7 (Input + 1) * (Input – 1) IN 1 
OUT 0 
 
IN 1 2 3 
STEP 2 3 4 
STEP 0 1 2 
OUT 0 3 8 
 
T8 Sort numbers in ascending order IN 1,-1 
OUT -1,1 
 
IN 1,-1 5,2,3 
OUT -1,1 2,3,5 
 
T9 Find words that are longer than two 
letters 
IN be, are, I, some 
OUT are, some 
 
IN be, are, I, some 
STEP 2,3,1,4 
STEP F,T,F,T 
OUT are, some 
 
T10 Find numbers that are not divisible 
by 4 without remainder 
IN 1,4,5 
OUT 1,5 
 
IN 1,4,5 2,4 
STEP T,F,T F,T 
OUT 1,5 4 
 
T11 Extract prices of cars that are 
manufactured in 2014 or later. 
IN 
Civic(2014)-$12000, 
Elantra(2012)-$9500, 
Corolla(2015)-$14000, 
Corolla(2013)-$10000 
OUT 12000,14000 
 
 
IN 
Civic(2014)-$12000, 
Elantra(2012)-$9500, 
Corolla(2015)-$14000, 
Corolla(2013)-$10000 
STEP 2014, 2012, 2015, 2013 
STEP 12000, 9500, 14000, 10000 
STEP T, F, T, F 
OUT 12000,14000 
 
Table 13. With the given description and default examples for each task, participants were asked to add more 
examples, such as the solution examples shown.    
 103 
processing, filtering). In the system’s UI, table rows represent sequential steps from 
input to output, and table columns represent independent cases. Participants can type 
1. Initial state of the task UI that contains default Input and Output values (1 and 4), buttons for adding case (“Add Case”), adding step 
(“+”), and inferring operations from current examples (“Teach Computer”).   
 
2. As the user clicks the “Teach Computer” button, the UI shows feedback messages for every step and the entire program.  
 
3a. As the user clicks “Add Case”, an empty column is added to the right of the table in which he/she types an example (2 and 6).  
 
3b. Alternatively, the user could click [+] between two rows, and an extra step would be inserted between the rows.     
 
4. By adding a case and a step, the user makes every step find a single operation, and teaches the correct operation.  
 
Figure 44.  The study UI and basic walkthrough  
 104 
examples values in table cells, insert steps by pressing “+” buttons between rows, and 
add cases by pressing the “Add Case” button. Pressing “Teach Computer” runs the PBE 
inference engine, to generate operations that calculate each step. When the engine fails 
to determine operations from the provided examples, feedback messages are shown to 
the right rows. If participants spent at least three minutes, and tried (unsuccessfully) to 
“Teach Computer” at least eight times for a given task, a button was shown to allow 
them to give up and move on to the next task. Through internal pilot tests, we decided 
on a reasonable, high number of minutes and trials to give people the chance to try a 
number of answers in order to study example-providing.  
We designed two types of feedback (simple and actionable) to see whether 
actionable feedback effects user’s behavior. The simple feedback provides only the 
number of programs that the system generated. We designed the simple feedback as the 
baseline condition, since most existing PBE systems [23,35,54,88] provide a similar level 
of feedback for generated programs. In contrast, the actionable feedback detects user’s 
intentions from the examples, and explains details why it failed to generate any program 
and how to resolve the issue. To our knowledge, no prior PBE systems provide actionable 
feedback.  
When the PBE system finds a single operation for the step, both types of feedback 
show the same message, "Found a single program that calculates the step."  
When the system finds multiple operations, the simple feedback is "Found N 
programs that calculate this step", where N is the number of generated operations. The 
actionable feedback is same, but adds "Provide more examples." to the end.  
 105 
When the system finds no operation for the step, the simple feedback is "Found no 
program that calculates this step." In contrast, the actionable feedback includes the 
following messages:  
• If there is an empty cell in the current row, the actionable feedback is, "There 
is an empty case. Did you miss filling it?"    
• If the current row contains values of multiple types (e.g. number and string), 
the actionable feedback is, "There are number and string examples in this 
case. This might have caused the computer to fail in finding a program." 
• If there is any row above the current row that contains all the values of the 
current row, the actionable feedback is, "If you are trying to filter values from 
steps above, you need an additional step containing T or F." 
• If the current row is a substring of a filtered subset of any row above, the 
actionable feedback is, "Are you trying to filter and extract part of string at 
the same time? If that's the case, you have to do them in two steps." 
 SUCCESS RATE 
 
 Success rate  Average # Trials (per participant) 
Task Base. Exp. 𝜒"  
p-value 
Base. Exp. Mann Whitney 
U-test 
T
utorials 
T1 1.00 1.00 >.5 1.67 1.07 p > .5 
T2 0.93 1.00 >.5 3.00 1.20 Z = 0.91, p < .30 
T3 0.80 0.87 >.5 6.80 3.47 Z = 2.13, p < .30 
T4 1.00 1.00 >.5 3.40 1.87 Z = 0.76, p < .30 
T5 1.00 1.00 >.5 1.20 1.07 p > .5 
T6 0.67 0.67 >.5 7.40 6.47 p > .5 
M
ain tasks 
T7 0.53 0.87 <.05  10.33 3.13 Z = 2.32, p < .01  
T8 0.67 1.00 <.03 8.73 2.73 Z = 5.56, p < .3 
T9 0.27 0.93 <.001 18.27 5.27 Z = 2.90, p < .001 
T10 0.53 0.93 <.03 13.00 4.13 Z = 1.60, p < .05  
T11 0.27 0.67 <.03 28.73 6.87 Z = 3.17, p < .001 
Table 14. Success rates (proportion of participants who passed the task) and average 
numbers of trials for the baseline (Base.) and the experimental (Exp.) conditions. 
Highlighted cells are significant (p<.05). 
 106 
As mentioned, 30 of the 161 participants finished the entire study. They successfully 
finished most tutorials (average success rate = 91.1%, # trials = 3.22) as shown in 
Table 2. The main tasks were successfully completed less often than the tutorials 
(success rate = 66.7%, # trials = 10.12) To understand the effect of feedback on 
successful task completion, we conducted a non-parametric repeated measure ANOVA 
test [84]. The result yielded an F ratio of F(1, 150) = 26.01, p < .001, indicating that 
the success rate was significantly greater with the actionable feedback than with the 
baseline feedback. We also conducted factorial ANOVA to check the effect of 
demographic factors on success rate, but found no significant impact (p > .03). 
 Types of Mistakes 
We counted mistakes as participant errors in user-provided examples that prevent the 
PBE engine from generating a single program for each step. The first author reviewed 
150 task results (5 main tasks done by 30 participants), and identified 246 mistakes. 
25.6% of the mistakes were critical, meaning that they remained until participants gave 
up the task. We grouped mistakes into the categories below.  
 107 
 Missing steps (found 92 times; 30 were critical)   
The PBE engine failed to generate programs when participants did not provide crucial 
steps as illustrated below: (a) missing steps of key values above predicates (35 times; 
15 critical), (b) missing steps of predicates values above a list filtering step (31 times; 
7 critical), (c) subtasks of a combination of filtering and text extraction (22 times; 15 
critical), and multi-step arithmetic (T3 and T7; 4 times) as illustrated in Table 15.  
 Ambiguous cases (29 times; 11 critical)   
Participants often could not provide sufficient examples for the engine to find the right 
program. For example, participants stuck with single-case examples (18 times; 8 
critical). (a) To generate a “not divisible by 4” condition for T10, the input requires “2”, 
but eight participants had to try multiple times, and three of them gave up. (b) Similarly, 
T8 (sorting numbers) requires an additional case containing at least three numbers, whose 
output is not the input in reverse-order. See examples in Table 16.  
(a) IN be, are, I, some 
ST1 F,T,F,T 
OUT are, some 
 
For T9, A predicate step (ST1) needs a 
step of key values (“2,3,1,4”) above. 
 
(b) IN 1,4,5 
OUT 1,5 
 
For T10, filtered result (OUT) requires a 
step containing predicate values (“T” for 
including, “F” for excluding values).  
 
(c) 
IN 
Civic(2014)-$12000, 
Elantra(2012)-$9500, 
Corolla(2015)-$14000, 
Corolla(2013)-$10000 
STEP 2014, 2012, 2015, 2013 
STEP T, F, T, F 
OUT 12000,14000 
 
For T11, the output (“12000, 14000”) is 
a substring of the filtered list. It requires 
either a substring of the original list or the 
filtered list above. 
 
Table 15. Examples of missing steps 
 108 
 Inconsistent or unsupported values (28 times; 8 critical)   
Participants provided a variety of values that the PBE engine could not find a matching 
program, such as inconsistent values for arithmetic tasks (9 times; 2 critical), incorrect 
predicates for filtering (5 times; 1 critical), and incorrectly sorted list (2 times). 
Participants also provided steps with single Boolean values, when the correct program 
requires multiple values (7 times; 3 critical). Participants often made formatting 
mistakes such as (a) Boolean values next to numbers (e.g. "T11, T10, F8, F9": 2 times), 
Boolean values without a separator (e.g. "FTFT"; 3 times) and using "Yes" and "No" 
instead of "T" and "F" (1 time). 
 Unnecessary steps (15 times; 5 critical)   
Participants often added unnecessary steps. For example, (a) they often provided steps 
of unnecessary Boolean values for filtering tasks (7 times; 2 critical), numbers for 
arithmetic (4 times; 2 critical), or completely empty steps (2 times; 1 critical). For T10, 
(a) IN 1,4,5 2,4 
STEP T,F,T F,T 
OUT 1,5 4 
 
For T10, to disambiguate “divisible by 
4” from “divisible by 2”, IN requires a 
value “2”. 
(b) IN 1,-1 5,2,1 
OUT -1,1 1,2,5 
 
For T8, examples for sorting must 
contain three numbers that are not in 
reverse order.  
Table 16. Examples of ambiguous cases 
(a) IN 11,8,9,10 
STEP 
T11, T10,F8, 
F9 
OUT 11,10 
 
“T11” probably means that the value 
“11” is marked with “T”  
 
Table 17. Examples of inconsistent or unsupported values 
 109 
(b) two participants provided a step that contains "4", which is the operand of the 
number-predicate program they need (2 times). For examples, see Table 18 
 Describing with formula (11 times; 7 critical)   
Five participants described steps with formulas instead of example values. For instance, 
(a) they provided "Input+1", "*2", "(2)*(0)", and "+1" for arithmetic tasks (3 times; 3 
critical). For the filtering tasks, they tried (b) "<2014", "1/4", "1<2<3<4", "-1<1",  
"are>2", and "some>2" (6 times; 3 critical). For the sorting tasks, two participants tried 
to describe the direction with "increasing order" and "reverse input" (2 times; 1 critical). 
For examples, see Table 19. 
(a) IN 11,8,9,10 
STEP T,F,F,T 
STEP T,T 
OUT 11,10 
 
The third row (“T,T”) is unnecessary.    
 
(b) IN 1,4,5 3,8,15 
STEP 4 4 
STEP F F 
OUT 1,5 3,15 
 
To express a conditional “not divisible 
by 4”, a participant created steps of “4” 
and “F”. 
Table 18. Examples of unnecessary steps 
(a) IN 1 
STEP Input+1 
STEP *2 
OUT 4 
 
 “Input+1” and “*2” are formulas 
for arithmetic tasks.  
(b) 
IN 
be, are, 
I, some 
STEP T 
STEP are>2 
STEP some>2 
OUT 
are, 
some 
 
“are>2” and “some>2” are 
conditional formulas for predicates.    
Table 19. Examples of describing with formula 
 110 
 Inconsistent program (3 times; 2 critical)   
Even when the PBE engine generated a single program for every step, the entire 
program could be inconsistent with the task. For instance, participants often created 
wrong arithmetic (2 times; 2 critical), or filtering programs (1 time). 
 Empty cases (2 times; 0 critical)   
Participants sometimes left the right most case empty. 
 LIMITATIONS 
We made several simplifying assumptions that limit the scope of our findings. First, to 
allow non-expert users to quickly learn, the study introduces only a few standard tasks 
(e.g. arithmetic, string processing, and filtering). While the general patterns of findings 
will likely apply to other tasks, it will be important to confirm the extent to which this 
is true.  
Second, our experimental system does not show generated programs, while a few 
PBE systems [35,54] support interactive disambiguation where users read program 
descriptions and disambiguate by directly choosing a desired program. Further work is 
needed to explore the opportunity and effectiveness of interactive disambiguation.  
Third, we did not collect log data of dropouts, which could explain why they gave 
up the study. The high dropout rate suggests that an attrition bias might exist between the 
baseline and the actionable feedback settings.     
The study also leads us to a wide research area. For example, how to construct and 
train a knowledge model of a PBE user is an open-ended research question. How various 
design factors effect a user’s motivation and understanding of the PBE system is the goal 
of the next chapter. 
 111 
 CONCLUSION 
This chapter presents a user study that examines how inexperienced users learn and use 
our PBE system. Findings include seven types of common mistakes, and an evidence 
confirming that we can automatically detect a user’s programming intent, and generate 
actionable feedback that helps the user quickly fix mistakes. 
 112 
Chapter 6: Experiments on Feedback and 
Human Mistakes in PBE Systems 
 Motivation and Introduction 
Human-centered design is an essential factor for the success of any interactive system, 
but is often overlooked for PBE systems [44]. As reported in Chapter 5, inexperienced 
PBE users make a wide range of mistakes while decomposing complex tasks and 
providing unambiguous examples. However, our preliminary user study (Chapter 5) 
suggests that even unsuccessful examples contain enough clues for detecting a user's 
programming intent and misunderstanding of the system, and PBE systems can provide 
useful feedback based on those clues. The result reaffirms a widely known principle - 
human-readable, informative feedback is crucial for designing usable interfaces [75]. 
However, there is little prior research about detailed feedback design particularly for 
PBE users. The goal of our study in this chapter is to address R5 - exploring the design 
space of feedback.  
R5. What is the impact of feedback design on user's experience of PBE? 
a. Is showing either system information, instruction, or both helpful for 
completing tasks, understanding the system, and fixing human 
mistakes? 
b. Does feedback design affect user's behavior of using PBE features? 
c. Does feedback design affect user's credibility of the programs they 
make?   
d. Does demographic information affect user's performance and 
behavior of using PBE features? 
 113 
e. Is the history of previous trials helpful for users to understand and fix 
their mistakes? 
To answer the above questions, we conducted an online experiment with 133 participants, 
who were recruited from Amazon Mechanical Turk. The experiment is based on the the 
preliminary study in Chapter 5, with a few modifications and extended features. First, we 
collected log data from not only those who finished the entire study but also who dropped 
out in the middle of the study. We compared different feedback design in terms of user’s 
dropout rates (i.e. how far participants proceeded in the tutorials and main tasks), the 
success rate (i.e. how likely each participant would accomplish each tutorial and task), 
behavioral metrics (e.g. click rates of the feedback messages), and subjective assessments 
(e.g. perceived usefulness the system, credibility of programs that participants created). 
Second, we developed 12 rules for detecting mistakes and generating feedback messages. 
We believe these findings provide valuable implications for designers of future PBE 
systems.   
 
 114 
 Experimental System UI 
To conduct the study, we extended the experimental system used in Chapter 5. The new 
system has the same goal – to enable non-technical participants to quickly learn and 
perform decomposition and disambiguation. There are a few extended features. First, 
if a command is assigned to a step, the description of the command is provided next to 
the step, as “Calculate Input*2” next in Figure 45. 
Second, when the PBE engine cannot find any command from user-provided 
examples, it shows a feedback message, “No command is found. What is your intent for 
 
Figure 45. The experimental system UI. The TASK section describes the program participants should 
build. The EXAMPLES contains a table of user-provided examples and feedback from the PBE engine. In 
the RESULT panel, users press the Teach Computer button to let the PBE engine generate programs 
based on provided examples, and get feedback. Finally, the HISTORY panel shows all the trials provided 
for the current task.           
 115 
the step?”, and a list of potential user intents that it extracted from the examples. For 
instance, Step2 in Figure 45 shows a potential intent of a user, “Calculating a multi-step 
arithmetic from above numbers”. If the user thinks the intent is correct, he/she will click 
it to see the relevant information about the system (e.g. “The system can only learn a 
single arithmetic step”), and/or instruction for fixing the examples (e.g. “Insert a step 
above that contains intermediate values of the arithmetic that you want.”)  
1. To solve a task, users need to make every step has a single command. If users provided examples that are ambiguous, 
the PBE engine provides feedback as below, “N commands found. Add another case, or CHOOSE AMONG THEM” where 
N is the number of commands consistent with the examples.  
 
2. Clicking the “CHOOSE AMONG THEM” button will open a popup that contains the list of generated commands. Users 
can select a command to lock, or close the popup.   
 
3. A selected command will be locked for the step. Locked commands will not be updated by teaching computer again.  
              
4. Clicking the “UNLOCK” button will remove the locked command. To get a command for the step, users need to teach 
again.  
 
Figure 46.  The mechanism of choosing and locking commands for a step. When the computer generates 
multiple commands users can choose one among them. Chosen commands are locked to the step, and stay 
until they got unlocked.       
 116 
Lastly, if the PBE engine generates multiple commands for a step, it shows a 
message “n commands found. Add another case, or [CHOOSE AMONG THEM].” Users 
have two options for fixing it: (1) providing additional cases, (2) choosing among the list 
of generated programs as illustrated in Figure 46. Fourth, if a user makes more than five 
unsuccessful trials, the system shows a button for giving up in the RESULT panel. Lastly, 
the HISTORY panel shows the list of unsuccessful trials that the user has made so far for 
the current task. 
 Feedback rules 
In Chapter 5, I identified the seven types of mistakes that inexperienced users make. 
To detect types of mistakes in the current example, and to provide adequate feedback 
messages, I developed 12 rules. It has to be noted that the rules are applied only when 
the PBE engine found no command for the step - not multiple commands. Therefore, 
the rules do not cover Ambiguous cases and Inconsistent program, for which the PBE 
engine generated multiple commands or a single command respectively.   
 Missing steps  
When participants cannot decompose a complex task (i.e. did not create essential steps), 
the PBE engine would fail to generate any command for the provided examples. To 
help users understand and decompose tasks by adding essential steps, we developed 
three feedback rules.   
F1.	FILTER	WITHOUT	PREDICATE	
The PBE engine requires a predicate step between source and target steps of a filtering 
task.  However, inexperienced users often forget to create a step of predicates. The first 
 117 
feedback rule detects the mistake using two conditions, in addition to the failure of 
generating any command. First, the current step must be a target of the filtering task. 
To check this, values of the current step (e.g. “a, d”) are a subset of values of any source 
step above (e.g. “a, b, c, d”). The second condition is non-existence of a valid predicate 
step between the current step and the source. A valid predicate step must have the same 
shape as the original step. In order for two steps to have the same shape, they must have 
the same number of cases, and every case must have the same number of values. For 
instance, the three examples below have the same shape, because all of them have three 
columns, and all the matching columns have the same number of values.  
a, b c d, e  T, F F T,F  0,0 0 0,0 
  
When the rule is satisfied, the system generates the feedback components below, 
and show users according to their experimental conditions.      
• Intent: “Trying to filter {source step}.”  
• System Information: “Filtering requires a predicate step containing T or F for each 
value.” 
• Instruction: “Insert a step above and type predicate values. For instance, F,T,F will 
keep the second values, and filter out the first and the third values.” 
F2.	PREDICATE	WITHOUT	NUMBERS	
The PBE engine can evaluate numbers. Thus, any predicate step requires a step that 
contains key numbers. Our system detects this mistake if two conditions are satisfied. 
First, the current step must hold predicates only. Second, there is no above step that 
contains numbers in the same shape as the current step. Two steps having the same 
shape means that they have the same number of columns (i.e. cases), and corresponding 
 118 
columns always have the same number of values. When the rule is satisfied, the system 
generates the feedback as below.   
• Intent: “Predicates for filtering {source step}” where the source step is the closest 
step above that has the same shape.  
• System Information: “To calculate predicates, it requires numbers above.” 
• Instruction: “Insert a step above and type key numbers for determining predicates.” 
F3.	MULTISTEP	ARITHMETIC	
T3 and T7 require that participants decompose complex arithmetic tasks such as 
(Input+1)*2 or (Input+1)*(Input-1). However, Inexperienced users often realize they 
need to add additional steps between the input and output steps. Our system detects this 
type of mistake if the current step contains only numbers, and there exist a step 
containing numbers with the same shape. When the rule is satisfied, the system 
generates the feedback below.  
• Intent: “Calculating a multi-step arithmetic from above numbers”  
• System Information: “The system can only learn a single arithmetic step.” 
• Instruction: “Insert a step above that contains intermediate values of the arithmetic 
that you want.”  
 Ambiguous cases 
To accomplish a task, users need to specify a single command for every step. However, 
the PBE engine often generates multiple commands that are consistent with provided 
examples. In order to fix it, users either provide additional cases (i.e. additional 
columns in the EXAMPLE table) or manually choosing from the list of generated 
commands. Since these two solutions are well-explained within their own UIs, we did 
not create a feedback rule for this type of mistake.  
 119 
 Inconsistent or unsupported values 
All the values across columns must be consistent with a command, which is supported by the 
PBE engine. However, we observed that inexperienced users make a wide range of mistakes. 
We created five rules (F4-F8) for detecting and generating feedback for inconsistent or 
unsupported values.  
F4.	INCONSISTENT	CASES	
When our participants provided multiple cases for disambiguation, they often gave a value 
inconsistent with the others. It took many trials until they noticed the mistakes. Our system 
detects inconsistent cases using the leave-one-out cross-validation technique [41]. To begin 
with, the step must contain at least three cases. Second, the PBE engine tries to generate 
commands multiple times leaving one case out of the examples. For instance, if the current step 
contains three cases, the PBE engine tries to generate commands three times using (1,2), (1,3), 
and (2,3). If it generates an alternative command from (2,3), which left out the first case, the 
system will create a feedback message as follows.  
• Intent: “Trying to teach {alternative command}”  
• System Information: “It cannot learn when cases are inconsistent.” 
• Instruction: “Consider fixing or removing the 1st case.”  
F5.	NON	MATCHING	SIZE	OF	PREDICATE	
A predicate step must have exactly the same shape with the step to be filtered. However, 
inexperienced users often use their creativity to give examples that the PBE engine cannot 
comprehend. For instance, in the previous user study (Chapter 5), we observed 7 (out of 150) 
cases in which participants provided single predicate values. To detect this mistake, the system 
checks whether the current step contains predicate values, and there is no step above has values 
of the same shape. If the rules are satisfied, the system generates feedback as follows.   
• Intent: “Making predicates for filtering”  
• System Information: 
“Predicates and items to be filtered must have the same length.” 
• Instruction: “Modify this step to have predicates (T or F) for every value in the step 
to be filtered.”  
 120 
F6.	PREDICATE	AND	VALUE	COMBINED	
PBE users often create values in their own format. For example, we observed two (out of 150) 
cases in which participants provided predicates and values to be filtered combined (e.g. “T11, 
T10, F8, F9”; indicating that 11 and 10 are true, 8 and 9 are false cases). Our system detects 
this mistake by testing that the current step begins with a predicate value (e.g. “T”, “F”, “t”, or 
“f”) but also contains non-predicate values. If the rule is satisfied, the system generates 
feedback as follows. 
• Intent: “These are predicates with values combined”  
• System Information: “However, predicates must be T and F separated by commas.” 
• Instruction: “Modify this step to have predicates (T or F) for every value to be 
filtered.”  
F7.	LIST	WITHOUT	SEPARATOR	
PBE users often provide predicate values without a separator, such as “FTFT”. Detecting this 
mistake is simple: the current step data must consist of predicate values (e.g. “T”, “F”, “t”, or 
“f”) only. If the rule is satisfied, the system creates the {correct list} by adding separators 
between every pair of adjacent characters, and generates feedback as follows.  
• Intent: “Predicates for filtering”  
• System Information: “Values in a list must be separated by a comma (,).” 
• Instruction: “Modify the value to {correct list}.”  
F8.	YES	NO	PREDICATE	
Users can create a predicate step with “yes” and “no” – instead of “T” and “F”. This is a very 
rare mistake, which happened only once. To detect this mistake, our system uses a regular 
expression that checks whether the current step contains “yes” or “no” separated by commas. 
When the rule is satisfied, it can automatically generate {corrected predicates} by replacing 
“yes” to “T”, and “no” to “F”, and generate feedback as follows.  
• Intent: “Predicates for filtering”  
• System Information: “But predicates must be a list of T and F separated by 
commas.” 
• Instruction: “Modify the value to {corrected predicates}.”  
 121 
 Unnecessary steps 
F9.	UNUSED	STEP	
Users often create steps that are unnecessary for calculating the output. Although unused steps 
do not fail the task, it is better to remove them for clarity. To evaluate whether the current step 
is unused, the system checks two conditions. First, the output must have a single command. 
Second, the current step is not in the ancestors of the output. When the rule is satisfied, the 
system generates feedback as follows.  
• Intent: “I have no intent for this step”  
• System Information: “This step is unnecessary for getting the output.” 
• Instruction: “You may remove the step to simplify your program.”  
F10.	EMPTY	STEP	
Users often leave a step completely empty, especially when they are exploring to the solution. 
However, since empty steps will fail to generate any command, the system provides the 
following feedback for empty steps. 
• Intent: “I left this step empty”  
• System Information: “The system cannot learn any program from an empty step.” 
• Instruction: “Consider adding relevant values or removing the step.”  
 Describing with formula 
F11.	FORMULA	
Although the PBE engine accepts example values only, PBE users often provide formulas. To 
detect formulas, the system uses a regular expression of common operators (\, <, >, =, +, -, *, 
STEP, INPUT). If the current step contains any of those common operators, it generates 
feedback messages as follows. 
• Intent: “I described the step using formula”  
• System Information: “The system cannot understand formula.” 
• Instruction: “Give values that your formula calculates.”  
 122 
 Inconsistent program 
This is the case when the PBE engine can generate single programs for every step, but the entire 
program is wrong. We do not have any feedback rule for this type of mistake.  
 Empty cases 
F12.	EMPTY	CASES	
During the preliminary study we observed that participants often provided incomplete tables 
having a few empty cells. Since the PBE engine requires consistency across columns, a step 
with empty cases might fail to generate any command. Detecting this mistake is straightforward; 
the system checks whether a column is empty. However, this rule can make false positive errors, 
because cases are often empty for good reasons especially when they are the results of filtering. 
The system generates feedback message as follow.  
• Intent: “I left some columns empty on purpose”  
• System Information: “To teach a program, all the cases must be consistent.” 
• Instruction: “Type consistent values in every case, or remove unnecessary columns.”  
  Methods 
 Procedure 
To participate, participants clicked the hyperlink to our system posted on Amazon 
Mechanical Turk (AMT). In the landing page, they read the consent form and clicked 
the “I agree” button to proceed. In the second page, they filled in the demographic 
survey form, which asked their age, gender, and the highest level of education, major, 
current occupation, programming expertise, and technical experience. After the survey, 
participants learned basic usage and the concept of example-based programming 
through the six tutorials. They then proceeded to the five main tasks. Both tutorials and 
tasks are the same as the preliminary study in Chapter 5. Finally, participants were 
asked to fill in a closing survey about perceived usability and effectiveness of the 
system.   
 123 
 Closing survey 
After finishing the entire study, we gave participants several questions about their 
experience and opinions. The five questions were about how much they agreed with 
the following statements.  
• The system was easy to understand. 
• The interface was effective to accomplish the tasks. 
• The feedback next to each row was helpful. 
• The programs I taught will work correctly for wider ranges of inputs.  
Lastly we asked them to give us general comments.  
• Do you have any other comments on what worked or didn’t work about the system?  
 Compensation 
During the preliminary study we observed that a lot of participants from AMT dropped 
out. Through a few rounds of pilot studies, we figured out a reasonable multi-stage 
compensation policy. Participants received a $1 basic reward as they finish the tutorials.  
Those who finished the entire study, no matter how many tasks they gave up, received 
a $2 bonus. In addition, we gave $1 extra bonus to the best performing one among 
every 10 participants.  
 Experimental design 
To explore the design space of the feedback mechanism, we chose two factors: 
feedback components and the history panel. For the first factor, our feedback rules 
generate feedback messages as three components: intent, system information, and 
instruction, as described in section 6.3. Although it is possible to make a maximum of 
eight combinations from three items, detected intent is an essential component, which 
must be included in all experimental conditions, in order to enable users to choose the 
 124 
right feedback. Therefore, we created four conditions of feedback components as 
follows.  
• (BASELINE): The baseline setting shows no feedback component.  
o    
• (SYSTEM INFO): Shows an intent first. As users click the intent, it reveals 
the relevant system information. 
o  
• (INSTRUCTION): Shows an intent first. As users click the intent, it reveals 
the instruction for fixing the example.  
o  
• (BOTH): Shows an intent first. As users clicked the intent, it reveals the both 
system information and instruction.   
o   
The second factor is whether the UI shows the history panel of not. A participant can 
see his/her previous trials for the current task in the history panel.  
The study uses the between-subject design. When a new participant visits, our 
system randomly assigns one of the eight conditions (4 feedback components * 2 
history panel settings).  
 Measurements 
The research questions focused on user’s performance in relation with various feedback 
components. We measured each participant's progress - how many tutorials and tasks 
 125 
he/she finished the entire study or stopped participating at a specific task. The system 
also collected the success rate - whether participants passed or gave up each task they 
finished. In addition, the system collected various information of each feedback rule, 
such as frequency (i.e. how many times the rule is activated) and click rates (i.e. % of 
feedback messages clicked by participants). The system also measured user’s behavior 
of using UI components such as “add a case”, “add a step”, “choose among them”, etc. 
 Participants 
We recruited participants from AMT. The only constraint was that participants must 
reside in North America. We kept posting batches of our HIT until we reached around 
35 participants for each feedback setting. 
 Result 
 The insignificant impact of feedback messages on completion and success 
rates   
133 participants started the study. 61.5% finished the tutorials, and 30.7% finished the 
entire study. The biggest portion (26.5%) of participants dropped out while doing T3, 
which is the first tutorial they learned decomposition of complex arithmetic 
"(Input+1)*2". The second most difficult task was T7, the first main task, where 29.5% . 
among participants who reached T7 dropped out. 
We expected dropout rates indirectly indicate the performance of each feedback 
setting. If a condition is better than the others, participants using the condition would 
be more likely to finish the entire study. It seems that Figure 47 supports the hypothesis 
- the BOTH condition (shown as a green line) was above the other conditions for the 
main tasks T8-T11. However, Pearson's Chi-squared tests do not support a significant 
 126 
difference between pairs of conditions. For example, even the best (BOTH) and the 
worst (BASELINE) performing conditions do not have significantly different impacts 
on the probability of the participants to reach the final task, 𝜒"(2, 𝑁 = 67) =2.5519, 𝑝 > .1. Mann-Whitney U test also do not tell significant difference between 
conditions, Z = 1.0858, p > .1. We also performed Kruskal-Wallis test on the number 
of tasks that participants reached, but could not find a significant difference between 
conditions, 𝜒"(10, 𝑁 = 67) = 14.891, 𝑝 > .1 Similarly, feedback conditions do not 
have significant impact on the number of tasks that participants successfully finished, 𝜒"(10, 𝑁 = 67) = 2.393, 𝑝 = .495 > .1  
We compared the two conditions about the history panel in Figure 48. It seems that 
the history panel helped users keep going after T3, but through T6 and T7, they 
converged into one. 
  
 
 127 
  
 
 
Figure 47. Probabilities of participants reaching and completing tasks compared across different feedback 
compositions. Lines indicate the portion of participants who reached specific tasks. Bars indicate the portions of 
participants who accomplished tasks without giving up. The green line above the other lines suggests that the 
‘BOTH’ setting, which shows both system info and instruction, outperformed the other settings.        
 
 
 
Figure 48. Probabilities of participants reaching and completing tasks compared to whether the history panel is 
given or not. Lines indicate the portion of participants who reached specific tasks. Bars indicate the portions of 
participants who accomplished tasks without giving up. The two lines go along with each other, suggesting that 
the history panel does not have a strong impact on how many users reached and completed tasks. 
 128 
 Frequency and click rates of feedback 
The system shows feedback messages that 12 feedback rules generate based on user-
provided examples. This section analyzes the log data of 876 tasks focusing on how 
frequently each feedback rule was activated (i.e. shown), and how frequently 
participants clicked them to read system information and instruction components15. In 
total, feedback rules were activated for 314 times, and clicked 115 (36.6% click rate) 
times. As shown in Figure 50, R1. Filter without predicate is the most frequently 
activated (60 times) and clicked (25 times). R2. Predicate without number was 
activated less frequently (35 times), but clicked as many times as R1. The three rules 
(R1-R3) for the Missing Step mistakes were clicked 49.3% of the tasks they were 
shown, which is higher than the average click rate (36.6%).     
R5-R8 were activated much less (<10 times) than other rules. The rules were created 
for the Inconsistent and unsupported values type of mistakes, which occurred less 
frequently than we saw in the preliminary study. It does not necessarily mean that R5-
R8 are less useful than the other rules. 
R9-R12 were shown 25-40 times, and clicked 7-12 times. The 27.3% click rate is 
lower than the average click rate (36.6%).  
                                                
15 Note that no matter how many times a rule is activated or clicked within one task, we counted 
it as one activation or click. This is because users often try a task repeatedly (>50 times).     
 129 
 
 
Figure 49. # of tasks (and tutorials) that a specific feedback rule was activated and clicked by participants.  
 
 
Figure 50. The closing survey result. The Likert scale ratings generally suggest that the BOTH condition is 
perceived to be intuitive, effective, and useful to increase the credibility of outcome. However, a few participants 
perceived the BOTH condition to be hard to understand and ineffective.   
 130 
 Perceived quality of the system and the outcomes  
After finishing the main tasks, participants filled a closing survey about the 
effectiveness of the system, and the generalizability of the programs they created, as 
illustrated in Figure 50. We compared raw frequency of answers, and then conducted 
Kruskal-Wallis test to see whether the differences are significant. It has to be noted that 
there is an attrition bias across conditions because of the high dropout rate. For example, 
only seven participants in the BASELINE condition answered the closing survey, while 
10, 10, and 13 participants answered the survey for SYSTEM INFO, INSTRUCTION, 
and BOTH conditions respectively.  
The first question was about how they perceived the usability of the system. The 
majority of participants using the first three conditions (BASELINE, SYSTEM INFO, 
and INSTRUCTION) gave negative (“disagree” or “strongly disagree”) answers. In 
contrast, participants who used the BOTH condition gave either strongly positive or 
strongly negative ratings. However, the difference is not statistically significant by 
feedback conditions, 𝜒" = 2.203, 𝑝 = .5313, at the 𝛼 = 0.05 significance level.  
The second question was about the effectiveness of the UI. Participants gave similar 
but a bit more positive ratings than the first question. Ratings on the BOTH condition 
were again polarized into positive and negative opinions. However, the difference is not 
statistically significant by feedback conditions, 𝜒" = 1.003, 𝑝 = .8005, at the 𝛼 = 0.05 
significance level.  
 
The third question was about the effectiveness of the feedback. More than 60% of 
participants who used the BOTH and the SYSTEM INFO conditions rated positively, 
and no participant gave the conditions a negative rating. The difference is statistically 
 131 
significant, 𝜒" = 8.266, 𝑝 = .0408 , at the 𝛼 = 0.05  significance level. A post-hoc 
analysis using Dunn's test adjusted by the Benjamini-Hochbrg FDR shows that the 
BASELINE and BOTH are significantly different (𝑝 = .0593). 
The last question was about how confident participants felt about the programs they 
created. More than 60% of the participants who used three conditions except the 
BASELINE rated their programs positively. 45% of the participants who used the BOTH 
condition strongly agreed that they trusted their programs. Around 10% of the 
participants who used the three non-BASELINE conditions were quite negative as well. 
The difference is not statistically significant by feedback conditions, 𝜒" = 2.461, 𝑝 =.4824, at the 𝛼 = 0.1 significance level.  
To sum up, we found a statistically significant difference between the BASELINE 
and the BOTH conditions.  
 Participant background and behavior  
133 participants (80 males, 53 females) were recruited via AMT. They were on average 
34.6 years old (SD = 10.33, range 20-70) and all currently live in the United States or 
Canada. The majority (73) of participants have bachelor degrees, 41 graduate high 
school, 7 have master degrees, 11 have professional degrees, and 1 has a doctoral 
degree. In terms of programming experience, 48 participants have no programming 
knowledge, while 53 know basic concepts. Participants also include 23 amateur 
programmers and 9 reported that they are professional programmers.   
We conducted rank-order correlation tests to check whether participants' progress 
and demographic information are positively or negatively correlated. First, Participants' 
gender and progress were not significantly correlated, 𝑍 = 0.5053, 𝑝 = .6101 < .05. 
 132 
Age did not affect their progress, 𝑟9 = −0.1235, 𝑝 = .1566 > .05 . Participants' 
education level have almost significant positive correlation, 𝑟9 = 0.1695, 𝑝 = .0510 >.05. Lastly, their programming expertise has a significant positive correlation with 
progress, 𝑟9 = 0.1844, 𝑝 = .0333 < .05.    
 Discussion 
 The insignificant impact of feedback messages  
We observed that drop out rates in the experiment are statistical indifferent across 
feedback conditions, although the difference between conditions are visible in Figure 
47. The finding is contradictory to another finding from the online user study in Chapter 
5.5. There are potential reasons of the contradiction. First, it is possible that the 
experimental system has many features (e.g. choosing among generated programs), 
added or extended to the online study, which can weaken the impact of feedback 
conditions on both completion and success rates. It is possible that the UI get 
overloaded with too much information that participants ignored feedback messages.  
To confirm whether feedback settings have significant impacts on user's 
performance, we have a few options of follow-up studies. First, we can collect more 
data to see whether the feedback settings gain the statistical power or not. Second, we 
can get rid of a few irrelevant information from the system so that participants are not 
overloaded with too much information. Third, we can analyze detailed log data to 
investigate what event occurred before participant dropping out.     
 Potential reasons and remedies for the high dropout rate 
Among 131 participants recruited from AMT, only 69.3% dropped out before finishing 
the entire study. For most user studies, a low attrition rate (i.e. high dropout rate) is an 
 133 
alarming signal that indicates serious issues might exist in the study design. In my study, 
there are several factors that possibly contributed to the high dropout rate.  
First, participants might have dropped the study because our tutorials and tasks 
were too challenging. I believe this is highly likely because the two biggest dropouts 
happened when participants first learned the concept of decomposition (T3 and T7), 
and 15.5% of participants dropped after finishing the tutorials, which suggests that they 
were demotivated not to do the main tasks. Although an immediate fix of this issue is 
to use shorter and simpler tasks and tutorials, we need to consider that the aim of the 
experiment is to give users sufficiently hard tasks so that they need additional supports.  
Second, the experimental UI and instruction might have room for improvements. 
This reason is also likely because participants made mistakes even though they had 
already read the relevant instruction. Why did they miss the relevant information? 
Information overload can be a potential reason, as discussed in 6.6.1. To improve the 
design quality of UI and instruction, we could have conducted more pilot studies in the 
lab.  
Third, AMT may not be the best platform to recruit participants for the experiment, 
which requires them to learn a lot of new concepts such as example-based programming, 
and managing steps and cases. Many experimental studies using online samples (e.g. 
Amazon Mechanical Turk) often do not report attrition rates, which can range from 30% 
to 50% and vary across experimental conditions [94]. If this is the dominant reason for 
the high dropout rate, improving UI and instruction will have only limited effectiveness. 
The best way to fix it is to conduct a lab study - which is also the best way to find out 
the actual problem. However, conducting a lab study requires much more resources 
 134 
than running an online experiment. A more economic way is to add a screening test, 
which estimates user's ability to finish the experiment, and to allow only participants 
who pass the screening.  
 Plan for a follow-up experiment: addressing the high dropout rate 
This section describes plan for a follow-up experiment addressing the high dropout rate. 
Simplify	the	study	through	an	iterative	design	process	
To identify and fix potential usability issues in the experimental system, I will go 
through a few rounds of iterative design process that consists of in-person pilot studies 
and redesigning the tasks, UI, and instruction. While redesigning the system, I will 
consider two options for simplifying the study. First, I will observe participants to 
identify redundant or unnecessary part of the UI, and remove them to prevent 
information overloads. Second, I will consider breaking the current tasks (and tutorials) 
into a few groups so that each participant can finish the experiment with less time and 
effort.  
Diversify	the	population	and	the	study	setting		
Who the participants are, and where the experiment is conducted may have a significant 
impact on the result. To control the impact, I will conduct the study with two 
populations: (P1) Amazon Mechanical Turk, and (P2) campus mailing list. Another 
factor is the study setting. I will conduct the study with P2 in two environments: (E1) 
in-person lab study, and (E2) remotely. P1 will always participate remotely. Cross 
validation of the results will give insights of how population and study setting affect 
the dropout rate.  
 135 
Screening	questions	
I found out that programming expertise is not perfect but still a good indicator of 
participants' completion. To lower the dropout rate, I will carefully add a few screening 
questions about their ability to understand the basic concept of PBE tasks and usage of 
the system. The first question will be about the participant's programming experience. 
It will require participants to know at least basic concepts of programming.  Second, I 
will give five tests about PBE. Each test will show four examples of input and output 
values, and participants must pick a matching program that calculates all the input and 
output examples among four programs, as illustrated below. Participants who answered 
all the questions correctly will be able to proceed to the tutorials.   
Input => Output 
1 => 3 
2 => 4 
3 => 5 
4 => 6 
Choose the right program that matches the examples above 
(a) Input * 3     (b) Input + 2    (c) Input +1   (d) Input * 4 - 2 
 
Closing	survey	when	a	participant	dropout	
The closing survey is an important source of information, but the current system does 
not collect from dropouts. I will redesign the system so that participants are asked to 
fill in the closing survey for the HIT completion code, even when they want to stop 
participating.  
 136 
Chapter 7: Conclusion 
The goal of this dissertation was to improve the human-centered design of PBE systems 
by studying users’ needs and mental models, identifying usability issues and human 
mistakes, and developing and testing novel features.  In this chapter, we summarize 
what we have learned in response to the research questions, thesis contributions, and 
directions for future work.     
 Answers to the research questions 
 R1. What do end-user programmer need to improve the Web?  
To answer the question, we conducted a semi-structured interview study with 35 end-
users of the Web, as presented in section 3.2. The interview study explored the space 
of challenges that end-users regularly experience on the Web, and the functionalities 
of enhancements that they envisioned. We proposed seven categories of enhancements 
(Modify, Compute, Interact, Gather, Automate, Store and Notify), which provide 
guidance to website designers in the first place to be aware of the unique needs of many 
users.  
 R2. How do non-programmers express their programming intent? 
To answer the question, we conducted a Wizard of Oz study (section 3.3) that asked 
non-programmers to express computational tasks. We found interesting characteristics 
of them. First, non-programmers would express their intent effectively using multiple 
channels such as rules, examples, and rationales. Although they may not be able to 
provide complete information at first, they can iteratively refine their intent with 
additional information. To enable non-programmers to express high-quality intent, 
 137 
future EUP tools should incorporate mixed-initiative interaction to help end-users 
express unambiguous statements.  
 R3. Is PBE better than direct specification? 
To answer the question, we conducted a preliminary user study with two versions of 
VESPY (Chapter 4). We could not find a clear answer, such as “PBE is always better 
than direct specification” or the opposite. Instead we observed that PBE is effective for 
complex tasks where users can skip multiple steps of interaction. Thus the alternative 
answer to the question is that the usefulness of PBE is affected by many factors: (1) 
user’s knowledge of the domain-specific language, the PBE engine, and the task; (2) 
the amount of work for creating sufficient examples vs. directly specifying parameters; 
(3) credibility of programs.   
 R4. Can inexperienced users perform problem decomposition and 
disambiguation? 
The answer was “No”. As reported in Chapter 5, we observed that only 30 out of the 
161 participants finished the entire study. We also identified seven types of common 
mistakes: Missing steps, Ambiguous cases, Inconsistent or unsupported values, 
Unnecessary steps, Describing with formula, Inconsistent programs, and Empty cases. 
However, we also found empirical evidence that the PBE system can automatically 
detect a user’s programming intent, and generate actionable feedback that helps the user 
quickly fix mistakes. 
 R5. What is the best feedback design for PBE users? 
To answer the last question, we conducted an online experiment with 133 participants. 
We developed 12 rules for detecting mistakes and generating feedback messages, three 
 138 
components of feedback messages, and the history panel showing the previous trials. 
According to Figure 47, the BOTH condition seems to outperform the others, and the 
BASELINE was the worst setting. However, statistical tests could not confirm that the 
numbers of completed tasks are significantly different across conditions. We also 
compared the closing survey result across feedback conditions, and found out that only 
the third question was significant - i.e. participants rated the perceived effectiveness of 
feedback in the BOTH condition higher than the others. Participants did not give 
significantly different ratings for the system's intuitiveness and their credibility of the 
programs. We discussed about a few potential reasons of the insignificant differences, 
in relation with the high dropout rate.  
To investigate whether personal background affect user's performance, we 
conducted rank-order correlation tests on demographic information and the number of 
completed tasks. While age, gender, and education level did not have significant 
impacts on user's performance, programming expertise is helpful to complete more 
tasks. 
 Thesis contributions 
 Identification of unmet needs of end-users of the Web 
End-user programming (EUP) is a common approach for helping ordinary people 
create small programs for their professional or daily tasks. However, it is often hard to 
address these needs, especially for fast-evolving domains such as the Web. We 
conducted a semi-structured interview study (Chapter 3.2) with 35 end-users of the 
Web. The interview study explored the space of challenges that end-users regularly 
experience on the Web, and the functionalities of enhancements that they envisioned. 
 139 
We identified seven categories of enhancements that can provide guidance to future 
EUP developers.  
 Characterization of non-programmers’ mental model 
Programming is difficult to learn since its fundamental structure (e.g. looping, if-then 
conditional, and variable referencing) is not familiar or natural for non-programmers 
[67]. Understanding a non-programmer’s mindset is an important step to develop an 
easy-to-learn programming environment. We conducted a Wizard of Oz study (Chapter 
3.3), which provided characteristics of non-programmers explaining how they would 
express their intent of computational tasks. Given that traditional programming 
environments do not fully support them, we discussed the implications for the design 
of multi-modal and mixed-initiative approaches for making end-user programming 
more natural and easy-to-use for these users.  
 Design process of interleaving visual programming and PBE 
Researchers and companies have developed many PBE systems, but how to design UI 
to support users to decompose and disambiguate complex tasks is still an open-ended 
research question. Through a 1.5 year-long iterative process, we developed VESPY UI 
(Chapter 4) in which users decompose complex tasks into tractable modules (using 
visual / dataflow programming techniques), and generate solutions for each module 
(using PBE techniques). We believe the design process and the final outcome would 
be valuable resources for future PBE system designers.  
 140 
 Identification of human mistakes of PBE 
PBE systems can be challenging for inexperienced users. Unfortunately, there is little 
research on people's ability to accomplish complex tasks by providing examples. We 
conducted an online user study that investigates how well people decompose complex 
tasks, and disambiguate sub-tasks. We also identified seven types of mistakes made, 
and suggested new opportunities for actionable feedback based on unsuccessful 
examples.  
 Design and assessment of feedback for PBE users 
While human-readable, informative feedback is crucial for designing usable interfaces 
[75], there is little prior research about feedback design for PBE users. To explore the 
design space of feedback, in Chapter 6, we designed three components of feedback 
messages: user intent, system information, and instruction. We also proposed a history 
panel that shows previous trials of the user. To assess their impacts on user’s 
performance, we conducted an online experiment. The findings suggest that the 
feedback messages do not significantly affect participants' performance, but providing 
both system information and instruction increases the perceived effectiveness of 
feedback messages. The result also suggests that the high dropout rates and information 
overloads lowered the validity of the study, we will conduct a follow-up experiment 
with a revised system and study design.  
 Future work 
With the investigations and designs presented in this dissertation, I have demonstrated 
that human-centered aspects of PBE can be improved with mixed-initiative, actionable 
 141 
feedback for human mistakes. From here I present several directions for continued 
research.  
 Crowdsourcing feedback rules to users   
We have shown that actionable feedback messages are essential to inexperience users 
of PBE systems. However, since the current set of rules are manually created by the 
author, they may not be scalable or generalizable to other PBE systems. To overcome 
this limitation, designers of future PBE systems can consider crowdsourcing feedback 
rules. For example, if a lot of users make similar mistakes that the current set of rules 
cannot detect, the system can ask users provide structured hints. Based on multiple 
hints, the PBE can automatically create a new feedback rule.    
 Balancing between too much or too little feedback to users 
Although a main benefit of PBE is that it requires users to learn little additional 
knowledge, we observed that inexperienced users could not provide high-quality 
examples without proper feedback. In contrast, most features proposed in this 
dissertation (e.g. adding / removing steps and cases, feedback messages) add a 
significant amount of information to the system. We observed in the last study that 
overloading users with too much information can result in negative results. Providing 
a right amount of information is an important decision for designing usable PBE 
systems. There are a few directions of future work for the issue. First, we need a metric 
to monitor whether the current feedback gives too much or too little information.  
Second, we need a metric to assess the importance of different information so that we 
can stress the most important point. Third, we need a new interaction model that 
 142 
initially reveals a small portion of information in-situ, and users can gradually learn the 
system without getting overloaded with irrelevant or too much information.  
 Long-term user study of practical EUP systems 
Although the last two studies (Chapter 5 and 6) were motivated by the usability issues 
of VESPY (Chapter 5), we did not have a chance to apply our findings to VESPY. I 
would like to improve the usability of VESPY with actionable feedback components, 
and conduct a long-term user study of how users gradually learn the capability of PBE 
based on what feedback they get. A long-term users study will give an opportunity of 
crowdsourcing feedback rules, which is explained in 7.3.1 
 Final remarks 
We are in the early stage of a widespread adoption of automated systems including 
PBE engines, statistical models, intelligent agents, and more. As we interact with 
automated systems more frequently, the importance of symbiotic interaction between 
human minds and automated systems will only increase. Without symbiotic interaction, 
humans would risk blindly accepting what automated systems suggest, or rejecting 
them without reasoning. In this dissertation, we have provided a variety of insights into 
users’ needs, mental model, and mistakes. We also have proposed several ways toward 
symbiotic interaction including VESPY’s interleaved UI, the feedback rules, and the 
history panel. I plan to continue this research with the goal of further exploring the 
design space of symbiotic interaction between human and AI.  
 143 
Bibliography 
1. Daniel L Ashbrook, James R Clawson, Kent Lyons, Thad E. Starner, and Nirmal 
Patel. 2008. Quickdraw: The Impact of Mobility and On-Body Placement on 
Device Access Time. Proceeding of the twenty-sixth annual CHI conference on 
Human factors in computing systems - CHI ’08: 219–222. 
https://doi.org/10.1145/1357054.1357092 
2. A. Begel and Mitchel Resnick. 1996. LogoBlocks: A Graphical Programming 
Language for Interacting with the World. In MIT Media Lab. 
3. Alan W Biermann, Bruce W Ballard, and Anne H Sigmon. 1983. An experimental 
study of natural language programming. International journal of man-machine 
studies 18, 1: 71–87. 
4. A. Blackwell and M. Burnett. 2002. Applying attention investment to end-user 
programming. In IEEE 2002 Symposia on Human Centric Computing Languages 
and Environments, 2002. Proceedings, 28–30. 
https://doi.org/10.1109/HCC.2002.1046337 
5. C. Bogart, M. Burnett, A. Cypher, and C. Scaffidi. 2008. End-user programming 
in the wild: A field study of CoScripter scripts. In IEEE Symposium on Visual 
Languages and Human-Centric Computing, 2008. VL/HCC 2008, 39–46. 
https://doi.org/10.1109/VLHCC.2008.4639056 
6. Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. 
2005. Automation and customization of rendered web pages. In Proceedings of the 
18th annual ACM symposium on User interface software and technology 
(UIST ’05), 163–172. https://doi.org/10.1145/1095034.1095062 
7. Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. 
2005. Automation and customization of rendered web pages. In (UIST ’05), 163–
172. https://doi.org/10.1145/1095034.1095062 
8. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. 
Qualitative Research in Psychology 3, 2: 77–101. 
https://doi.org/10.1191/1478088706qp063oa 
9. Margaret M. Burnett and Christopher Scaffidi. 2014. End-User Development. The 
Encyclopedia of Human-Computer Interaction, 2nd Ed. Retrieved June 7, 2015 
from /encyclopedia/end-user_development.html 
10. Amedeo Cesta. 1998. Mixed-Initiative Issues in an Agent-Based Meeting 
Scheduler. In No.1-2, pp 45 – 78, 45–78. 
11. Pern Hui Chia, Yusuke Yamamoto, and N. Asokan. 2012. Is This App Safe?: A 
Large Scale Study on Application Permissions and Risk Signals. In Proceedings of 
the 21st International Conference on World Wide Web (WWW ’12), 311–320. 
https://doi.org/10.1145/2187836.2187879 
12. Cynthia L. Corritore, Beverly Kracher, and Susan Wiedenbeck. 2003. On-line trust: 
concepts, evolving themes, a model. International Journal of Human-Computer 
Studies 58, 6: 737–758. https://doi.org/10.1016/S1071-5819(03)00041-7 
13. Allen Cypher, Mira Dontcheva, Tessa Lau, and Jeffrey Nichols. 2010. No Code 
Required: Giving Users Tools to Transform the Web. Morgan Kaufmann 
Publishers Inc., San Francisco, CA, USA. 
 144 
14. Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David 
Maulsby, Brad A. Myers, and Alan Turransky (eds.). 1993. Watch what I do: 
programming by demonstration. MIT Press, Cambridge, MA, USA. 
15. Rob Ennals, Eric Brewer, Minos Garofalakis, Michael Shadle, and Prashant 
Gandhi. 2007. Intel Mash Maker: join the web. SIGMOD Rec. 36, 4: 27–33. 
https://doi.org/10.1145/1361348.1361355 
16. Gerhard Fischer and Elisa Giaccardi. 2006. Meta-design: A Framework for the 
Future of End-User Development. In End User Development, Henry Lieberman, 
Fabio Paternò and Volker Wulf (eds.). Springer Netherlands, 427–457. Retrieved 
September 5, 2014 from http://link.springer.com/chapter/10.1007/1-4020-5386-
X_19 
17. Gerhard Fischer, Andreas C. Lemke, Thomas Mastaglio, and Anders I. Morch. 
1990. Using Critics to Empower Users. In Proceedings of the SIGCHI Conference 
on Human Factors in Computing Systems (CHI ’90), 337–347. 
https://doi.org/10.1145/97243.97305 
18. Gerhard Fischer, Andreas C. Lemke, Thomas Mastaglio, and Andres I. Morch. 
1991. The Role of Critiquing in Cooperative Problem Solving. ACM Trans. Inf. 
Syst. 9, 2: 123–151. https://doi.org/10.1145/123078.128727 
19. Marc Fisher, II, Mingming Cao, Gregg Rothermel, Darren Brown, Curtis R. Cook, 
and Margaret M. Burnett. 2006. Integrating automated test generation into the 
WYSIWYT spreadsheet testing methodology. Acm Trans. Softw. Eng. Methodol 
15: 2006. 
20. Mihai Boicu Gheorghe Tecuci. 2007. Seven Aspects of Mixed-Initiative Reasoning: 
An Introduction to this Special Issue on Mixed-Initiative Assistants. AI Magazine 
28: 11–12. 
21. J. Gindling, A. Ioannidou, J. Loh, O. Lokkebo, and A. Repenning. 1995. 
LEGOsheets: A Rule-based Programming, Simulation and Manipulation 
Environment for the LEGO Programmable Brick. In Proceedings of the 11th 
International IEEE Symposium on Visual Languages (VL ’95), 172–. Retrieved 
October 27, 2014 from http://dl.acm.org/citation.cfm?id=832276.834311 
22. Daniel G. Goldstein, R. Preston McAfee, and Siddharth Suri. 2013. The Cost of 
Annoying Ads. In Proceedings of the 22Nd International Conference on World 
Wide Web (WWW ’13), 459–470. Retrieved April 13, 2015 from 
http://dl.acm.org/citation.cfm?id=2488388.2488429 
23. Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-
output examples. SIGPLAN Not. 46, 1: 317–330. 
https://doi.org/10.1145/1925844.1926423 
24. Sumit Gulwani. 2016. Programming by Examples (and its applications in Data 
Wrangling). In Verification and Synthesis of Correct and Secure Systems, Javier 
Esparza, Orna Grumberg and Salomon Sickert (eds.). IOS Press. 
25. Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011. 
Proactive wrangling: mixed-initiative end-user programming of data 
transformation scripts. In (UIST ’11), 65–74. 
https://doi.org/10.1145/2047196.2047205 
26. Marti A. Hearst. 1999. Mixed-initiative interaction. IEEE Intelligent Systems 14: 
14–23. 
 145 
27. Eric Horvitz. 1999. Principles of Mixed-initiative User Interfaces. In Proceedings 
of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’99), 
159–166. https://doi.org/10.1145/302979.303030 
28. Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings 
of the SIGCHI conference on Human Factors in Computing Systems (CHI ’99), 
159–166. https://doi.org/10.1145/302979.303030 
29. Eric J. Horvitz. 2007. Reflections on Challenges and Promises of Mixed-Initiative 
Interaction. AI Magazine 28, 2: 3. https://doi.org/10.1609/aimag.v28i2.2036 
30. Daniel J. Hruschka, Deborah Schwartz, Daphne Cobb St.John, Erin Picone-Decaro, 
Richard A. Jenkins, and James W. Carey. 2004. Reliability in Coding Open-Ended 
Data: Lessons Learned from HIV Behavioral Research. Field Methods 16, 3: 307–
331. https://doi.org/10.1177/1525822X04266540 
31. David F. Huynh, Robert C. Miller, and David R. Karger. 2006. Enabling Web 
Browsers to Augment Web Sites’ Filtering and Sorting Functionalities. In 
Proceedings of the 19th Annual ACM Symposium on User Interface Software and 
Technology (UIST ’06), 125–134. https://doi.org/10.1145/1166253.1166274 
32. Wesley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. 2004. Advances in 
Dataflow Programming Languages. ACM Comput. Surv. 36, 1: 1–34. 
https://doi.org/10.1145/1013208.1013209 
33. Simon Peyton Jones, Alan Blackwell, and Margaret Burnett. 2003. A User-centred 
Approach to Functions in Excel. In Proceedings of the Eighth ACM SIGPLAN 
International Conference on Functional Programming (ICFP ’03), 165–176. 
https://doi.org/10.1145/944705.944721 
34. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. 
Wrangler: Interactive Visual Specification of Data Transformation Scripts. In 
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 
(CHI ’11), 3363–3372. https://doi.org/10.1145/1978942.1979444 
35. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. 
Wrangler: Interactive Visual Specification of Data Transformation Scripts. In 
(CHI ’11), 3363–3372. https://doi.org/10.1145/1978942.1979444 
36. Caitlin Kelleher and Randy Pausch. 2005. Lowering the barriers to programming: 
A taxonomy of programming environments and languages for novice programmers. 
ACM Comput. Surv. 37, 2: 83–137. https://doi.org/10.1145/1089733.1089734 
37. Caitlin Kelleher, Randy Pausch, and Sara Kiesler. 2007. Storytelling Alice 
Motivates Middle School Girls to Learn Computer Programming. In (CHI ’07), 
1455–1464. https://doi.org/10.1145/1240624.1240844 
38. D. V. Keyson, M. P. A. J. de Hoogh, A. Freudenthal, and A. P. O. S. Vermeeren. 
2000. The Intelligent Thermostat: A Mixed-initiative User Interface. In CHI ’00 
Extended Abstracts on Human Factors in Computing Systems (CHI EA ’00), 59–
60. https://doi.org/10.1145/633292.633329 
39. Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret 
Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad 
Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 
2011. The state of the art in end-user software engineering. ACM Comput. Surv. 43, 
3: 21:1–21:44. https://doi.org/10.1145/1922649.1922658 
 146 
40. Andrew J. Ko and Brad A. Myers. 2004. Designing the Whyline: A Debugging 
Interface for Asking Questions About Program Behavior. In Proceedings of the 
SIGCHI Conference on Human Factors in Computing Systems (CHI ’04), 151–158. 
https://doi.org/10.1145/985692.985712 
41. Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy 
Estimation and Model Selection. In Proceedings of the 14th International Joint 
Conference on Artificial Intelligence - Volume 2 (IJCAI’95), 1137–1143. Retrieved 
April 11, 2017 from http://dl.acm.org/citation.cfm?id=1643031.1643047 
42. Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, and Juliana 
S. Teixeira. 2002. A Brief Survey of Web Data Extraction Tools. SIGMOD Rec. 
31, 2: 84–93. https://doi.org/10.1145/565117.565137 
43. Tessa Lau. 2001. Programming by Demonstration: a Machine Learning Approach.  
44. Tessa Lau. 2009. Why PBD systems fail: Lessons learned for usable AI. AI 
Magazine 30.4, 65. 
45. Tessa Lau, Steven A. Wolfman, Pedro Domingos, and Daniel S. Weld. 2003. 
Programming by Demonstration Using Version Space Algebra. Mach. Learn. 53, 
1–2: 111–156. https://doi.org/10.1023/A:1025671410623 
46. Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction 
by Examples. In (PLDI ’14), 542–553. https://doi.org/10.1145/2594291.2594333 
47. Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter: 
automating & sharing how-to knowledge in the enterprise. In (CHI ’08), 1719–
1728. https://doi.org/10.1145/1357054.1357323 
48. Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter: 
automating & sharing how-to knowledge in the enterprise. In Proceedings of the 
SIGCHI Conference on Human Factors in Computing Systems (CHI ’08), 1719–
1728. https://doi.org/10.1145/1357054.1357323 
49. J. C R Licklider. 1960. Man-Computer Symbiosis. IRE Transactions on Human 
Factors in Electronics HFE-1, 1: 4–11. 
https://doi.org/10.1109/THFE2.1960.4503259 
50. Henry Lieberman. 2001. Your Wish is My Command: Programming By Example. 
Morgan Kaufmann, San Francisco. 
51. Henry Lieberman, Fabio Paternò, Markus Klann, and Volker Wulf. 2006. End-User 
Development: An Emerging Paradigm. In End User Development, Henry 
Lieberman, Fabio Paternò and Volker Wulf (eds.). Springer Netherlands, 1–8. 
Retrieved April 16, 2014 from http://link.springer.com/chapter/10.1007/1-4020-
5386-X_1 
52. James Lin, Jeffrey Wong, Jeffrey Nichols, Allen Cypher, and Tessa A. Lau. 2008. 
End-user programming of mashups with vegemite. 106. 
53. Greg Little, Tessa A. Lau, Allen Cypher, James Lin, Eben M. Haber, and Eser 
Kandogan. 2007. Koala: Capture, Share, Automate, Personalize Business 
Processes on the Web. In (CHI ’07), 943–946. 
https://doi.org/10.1145/1240624.1240767 
54. Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr 
Polozov, Rishabh Singh, Benjamin Zorn, and Sumit Gulwani. 2015. User 
Interaction Models for Disambiguation in Programming by Example. In 
 147 
Proceedings of the 28th Annual ACM Symposium on User Interface Software &#38; 
Technology (UIST ’15), 291–301. https://doi.org/10.1145/2807442.2807459 
55. Richard G. McDaniel and Brad A. Myers. 1999. Getting More out of Programming-
by-demonstration. In Proceedings of the SIGCHI Conference on Human Factors 
in Computing Systems (CHI ’99), 442–449. 
https://doi.org/10.1145/302979.303127 
56. L. A.. Miller. 1981. Natural language programming: styles, strategies, and contrasts. 
IBM Syst. J. 20, 2: 184–215. https://doi.org/10.1147/sj.202.0184 
57. Lance A. Miller. 1974. Programming by non-programmers. International Journal 
of Man-Machine Studies 6, 2: 237–260. https://doi.org/10.1016/S0020-
7373(74)80004-0 
58. Robert C. Miller, Victoria H. Chou, Michael Bernstein, Greg Little, Max Van 
Kleek, David Karger, and Mc Schraefel. Inky: A Sloppy Command Line for the Web 
with Rich Visual Feedback.  
59. Robert C. Miller, Victoria H. Chou, Michael Bernstein, Greg Little, Max Van 
Kleek, David Karger, and Mc Schraefel. Inky: A Sloppy Command Line for the Web 
with Rich Visual Feedback.  
60. Robert C Miller, Victoria H Chou, Michael Bernstein, Greg Little, Max Van Kleek, 
David Karger, and others. 2008. Inky: a sloppy command line for the web with rich 
visual feedback. In Proceedings of the 21st annual ACM symposium on User 
interface software and technology, 131–140. 
61. S. Münch, J. Kreuziger, M. Kaiser, and R. Dillmann. 1994. Robot Programming by 
Demonstration (RPD) - Using Machine Learning and User Interaction Methods for 
the Development of Easy and Comfortable Robot Programming Systems. In In 
Proceedings of the 24th International Symposium on Industrial Robots, 685–693. 
62. Brad Myers, Sun Young Park, Yoko Nakano, Greg Mueller, and Andrew Ko. 2008. 
How Designers Design and Program Interactive Behaviors. In Proceedings of the 
2008 IEEE Symposium on Visual Languages and Human-Centric Computing 
(VLHCC ’08), 177–184. https://doi.org/10.1109/VLHCC.2008.4639081 
63. Bonnie A. Nardi. 1993. A Small Matter of Programming: Perspectives on End User 
Computing. MIT Press, Cambridge, MA, USA. 
64. Dana S. Nau, Stephen J. J. Smith, and Kutluhan Erol. 1998. Control Strategies in 
HTN Planning: Theory Versus Practice. In Proceedings of the Fifteenth 
National/Tenth Conference on Artificial Intelligence/Innovative Applications of 
Artificial Intelligence (AAAI ’98/IAAI ’98), 1127–1133. Retrieved April 14, 2014 
from http://dl.acm.org/citation.cfm?id=295240.296264 
65. Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the Prevalence of 
Deception in Online Review Communities. In (WWW ’12), 201–210. 
https://doi.org/10.1145/2187836.2187864 
66. Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding 
Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 
49th Annual Meeting of the Association for Computational Linguistics: Human 
Language Technologies - Volume 1 (HLT ’11), 309–319. Retrieved April 13, 2015 
from http://dl.acm.org/citation.cfm?id=2002472.2002512 
67. John F. Pane, Brad A. Myers, and Chotirat Ann Ratanamahatana. 2001. Studying 
the language and structure in non-programmers’ solutions to programming 
 148 
problems. Int. J. Hum.-Comput. Stud. 54, 2: 237–264. 
https://doi.org/10.1006/ijhc.2000.0410 
68. Marian Petre and Alan F. Blackwell. 2007. Children As Unwitting End-User 
Programmers. In Proceedings of the IEEE Symposium on Visual Languages and 
Human-Centric Computing (VLHCC ’07), 239–242. 
https://doi.org/10.1109/VLHCC.2007.13 
69. Panko Ray. 1995. Finding spreadsheet errors; most spreadsheet models have design 
flaws that may lead to long-term miscalculations. Information Week. 
70. Alexander Repenning and Corrina Perrone. 2000. Programming by Example: 
Programming by Analogous Examples. Commun. ACM 43, 3: 90–97. 
https://doi.org/10.1145/330534.330546 
71. Mitchel Resnick, John Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn 
Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay Silver, Brian 
Silverman, and Yasmin Kafai. 2009. Scratch: Programming for All. Commun. 
ACM 52, 11: 60–67. https://doi.org/10.1145/1592761.1592779 
72. M. B. Rosson, H. Sinha, and T. Edor. 2010. Design Planning in End-User Web 
Development: Gender, Feature Exploration and Feelings of Success. In 2010 IEEE 
Symposium on Visual Languages and Human-Centric Computing, 141–148. 
https://doi.org/10.1109/VLHCC.2010.28 
73. M.B. Rosson, J. Ballin, and J. Rode. 2005. Who, what, and how: a survey of 
informal and professional Web developers. 199–206. 
https://doi.org/10.1109/VLHCC.2005.73 
74. Ben Shneiderman. 1984. The Future of Interactive Systems and the Emergence of 
Direct Manipulation. In Proc. Of the NYU Symposium on User Interfaces on 
Human Factors and Interactive Computer Systems, 1–28. Retrieved October 10, 
2014 from http://dl.acm.org/citation.cfm?id=2092.2093 
75. Ben Shneiderman. 1997. Designing the User Interface: Strategies for Effective 
Human-Computer Interaction. Addison-Wesley Longman Publishing Co., Inc., 
Boston, MA, USA. 
76. Ben Shneiderman and Pattie Maes. 1997. Direct Manipulation vs. Interface Agents. 
interactions 4, 6: 42–61. https://doi.org/10.1145/267505.267514 
77. Ben Shneiderman and Catherine Plaisant. 2006. Strategies for Evaluating 
Information Visualization Tools: Multi-dimensional In-depth Long-term Case 
Studies. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: 
Novel Evaluation Methods for Information Visualization (BELIV ’06), 1–7. 
https://doi.org/10.1145/1168149.1168158 
78. Michael Toomim, Steven M. Drucker, Mira Dontcheva, Ali Rahimi, Blake 
Thomson, and James A. Landay. 2009. Attaching UI enhancements to websites 
with end users. In (CHI ’09), 1859–1868. 
https://doi.org/10.1145/1518701.1518987 
79. Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. 2007. Building data 
integration queries by demonstration. In Proceedings of the 12th international 
conference on Intelligent user interfaces (IUI ’07), 170–179. 
https://doi.org/10.1145/1216295.1216328 
 149 
80. Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. 2007. Building data 
integration queries by demonstration. In (IUI ’07), 170–179. 
https://doi.org/10.1145/1216295.1216328 
81. Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. 2008. Building 
Mashups by example. In Proceedings of the 13th international conference on 
Intelligent user interfaces (IUI ’08), 139–148. 
https://doi.org/10.1145/1378773.1378792 
82. Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. 2008. Building 
Mashups by example. In (IUI ’08), 139–148. 
https://doi.org/10.1145/1378773.1378792 
83. US Bureau of Labor Statistics. 2017. United States Labor Force Statistics - 
Seasonally Adjusted. Labor Market Information. Rhode Island Department of 
Labor and Training. Retrieved March 8, 2017 from 
http://www.dlt.ri.gov/lmi/laus/us/usadj.htm 
84. Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011. 
The Aligned Rank Transform for Nonparametric Factorial Analyses Using Only 
Anova Procedures. In Proceedings of the SIGCHI Conference on Human Factors 
in Computing Systems (CHI ’11), 143–146. 
https://doi.org/10.1145/1978942.1978963 
85. Jeffrey Wong and Jason Hong. 2008. What Do We “Mashup” when We Make 
Mashups? In Proceedings of the 4th International Workshop on End-user Software 
Engineering (WEUSE ’08), 35–39. https://doi.org/10.1145/1370847.1370855 
86. Jeffrey Wong and Jason I. Hong. 2007. Making mashups with marmite: towards 
end-user programming for the web. In Proceedings of the SIGCHI Conference on 
Human Factors in Computing Systems (CHI ’07), 1435–1444. 
https://doi.org/10.1145/1240624.1240842 
87. Jeffrey Wong and Jason I. Hong. 2007. Making mashups with marmite: towards 
end-user programming for the web. In (CHI ’07), 1435–1444. 
https://doi.org/10.1145/1240624.1240842 
88. Kuat Yessenov, Shubham Tulsiani, Aditya Menon, Robert C. Miller, Sumit 
Gulwani, Butler Lampson, and Adam Kalai. 2013. A Colorful Approach to Text 
Processing by Example. In (UIST ’13), 495–504. 
https://doi.org/10.1145/2501988.2502040 
89. Kuat Yessenov, Shubham Tulsiani, Aditya Menon, Robert C. Miller, Sumit 
Gulwani, Butler Lampson, and Adam Kalai. 2013. A Colorful Approach to Text 
Processing by Example. In Proceedings of the 26th Annual ACM Symposium on 
User Interface Software and Technology (UIST ’13), 495–504. 
https://doi.org/10.1145/2501988.2502040 
90. Nan Zang and Mary Beth Rosson. 2009. Web-active Users Working with Data. In 
CHI ’09 Extended Abstracts on Human Factors in Computing Systems (CHI 
EA ’09), 4687–4692. https://doi.org/10.1145/1520340.1520721 
91. Nan Zang, Mary Beth Rosson, and Vincent Nasser. 2008. Mashups: who? what? 
why? In (CHI EA ’08), 3171–3176. https://doi.org/10.1145/1358628.1358826 
92. Nan Zang and M.B. Rosson. 2008. What’s in a mashup? And why? Studying the 
perceptions of web-active end users. In IEEE Symposium on Visual Languages and 
 150 
Human-Centric Computing, 2008. VL/HCC 2008, 31–38. 
https://doi.org/10.1109/VLHCC.2008.4639055 
93. Nan Zang and M.B. Rosson. 2009. Playing with information: How end users think 
about and integrate dynamic data. In IEEE Symposium on Visual Languages and 
Human-Centric Computing, 2009. VL/HCC 2009, 85–92. 
https://doi.org/10.1109/VLHCC.2009.5295293 
94. Haotian Zhou and Ayelet Fishbach. 2016. The pitfall of experimenting on the web: 
How unattended selective attrition leads to surprising (yet false) research 
conclusions. Journal of Personality and Social Psychology 111, 4: 493–504. 
https://doi.org/10.1037/pspa0000056 
95. John Zimmerman, Kathryn Rivard, Ian Hargraves, Anthony Tomasic, and Ken 
Mohnkern. 2009. User-created forms as an effective method of human-agent 
communication. In (CHI ’09), 1869–1878. 
https://doi.org/10.1145/1518701.1518988 
96. John Zimmerman, Anthony Tomasic, Isaac Simmons, Ian Hargraves, Ken 
Mohnkern, Jason Cornwell, and Robert Martin McGuire. 2007. Vio: A Mixed-
initiative Approach to Learning and Automating Procedural Update Tasks. In 
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 
(CHI ’07), 1445–1454. https://doi.org/10.1145/1240624.1240843 
97. 2013. Yahoo! Pipes. Retrieved September 12, 2013 from 
http://pipes.yahoo.com/pipes/ 
98. 2013. Greasemonkey. Retrieved September 12, 2013 from 
https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ 
99. 2014. Quartz Composer. Wikipedia, the free encyclopedia. Retrieved October 27, 
2014 from 
http://en.wikipedia.org/w/index.php?title=Quartz_Composer&oldid=598066763 
100. Greasemonkey. Retrieved September 12, 2013 from 
https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/