ABSTRACT Title of dissertation: REPRESENTING, VISUALIZING, AND MODELING ONLINE AUCTION DATA Valerie Hyde Doctor of Philosophy, 2007 Dissertation directed by: Professor Galit Shmueli Department of Decision and Information Technologies and Professor Wolfgang Jank Department of Decision and Information Technologies The wide and growing popularity of online auctions creates enormous amounts of publicly available bid data providing an important topic for research. These data pose unique statistical challenges because of their special structures. This research focuses on methods for representing, visualizing, and modeling such data. We find semi-continuous data manifested in auction consumer surplus data. Semi-continuous data arise in many applications where naturally continuous data become contaminated by the data generating mechanism. The resulting data con- tain several values that are ?too-frequent?, a hybrid between discrete and continuous data. The main problem is that standard statistical methods, which are geared to- wards continuous or discrete data, cannot be applied adequately to semi-continuous data. We propose a new set of two transformations for semi-continuous data that ?iron-out? the too-frequent values into completely continuous data. We show that the transformed data maintain the properties of the original data but are suitable for standard analysis. We are also interested in the effect of concurrency not only on the final price of an auction but also on the relationship between the current bid levels and high bids in simultaneous ongoing auctions. We suggest several ways to visually represent the concurrent nature of online auction prices. These include ?rug plots? for displaying the price-evolution and price dynamics in concurrent auctions, time-grouped box plots, and moving statistics plots. We find price trends and relationships between prices in concurrent auctions and raise new research questions. Finally, bids during an online auction arrive at unequally-spaced discrete time points. Our goal is to capture the entire continuous price-evolution function by representing it as a functional object. Various nonparametric smoothing methods exist to estimate the functional object from the observed discrete bid data. Previous studies use penalized polynomial and monotone smoothing splines; however, these require the determination of a large number of coefficients and often lengthy com- putational time. We present a family of parametric growth curves that describe the price-evolutionduring online auctions. Our approach is parsimonious and has an ap- pealing interpretation in the online auction context. We also provide an automated fitting algorithm that is computationally fast. REPRESENTING, VISUALIZING, AND MODELING ONLINE AUCTION DATA by Valerie Hyde Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Galit Shmueli, Chair/Advisor Professor Wolfgang Jank, Co-Chair/Co-Advisor Professor Benjamin Kedem Professor Ben Shneiderman Professor Paul Smith c? Copyright by Valerie Hyde 2007 Dedication I would like to dedicate this dissertation to my father, my mother, and Bill. ii Acknowledgments I wish to express my deep appreciation to my advisor, Professor Galit Shmueli, for her time, sage advice, meticulousreading of all of my papers, and encouragement. Her enthusiasm for research is contagious, and her concern for the development of her students is exceptional. It has been a wonderful experience to have a talented, sharp mentor and role model in a field with so few females. I would also like to thank my co-advisor, Professor Wolfgang Jank. His cre- ativity, advice, and prodding for me to take risks and look at the big picture greatly enhanced my research as well as this final document. Thanks are also due to the rest of my thesis committee: Professor Benjamin Kedem, Professor Ben Shneiderman, and Professor Paul Smith (alphabetically) for their time and reading of my dissertation. Professor Kedem provided me with a full university fellowship my first two years at the university which gave me the much needed time to devote to coursework and prepare for qualifying examinations. Professor Smith has made himself available to me, and all students, since the first day of classes. I especially appreciate that he conducted an independent study with me to help me gauge my interest in data mining and increase my statistical toolbox. I would also like to thank Professor Shneiderman for his comments and helping me advance this research. My statistics study group was recognized university-wide for our ability to work together as a team and to translate that teamwork into success in the class- room. Without my fellow ?geese?: Eulus Moore, Aliza Kwiat, Andie Hodge, and iii Olga Karles, I have no doubt that I would not have come this far or this fast. I appreciate their continued support long after they have left the university. I am especially grateful to the numerous people who provided me with real- world data sets. I would liketo thank Ravi Bapna from the Indian School of Business for the consumer surplus data, Claudia Perlich from the IBM T.J. Watson Research Center for the customer wallet data, Shanshan Wang for the Xbox data, and Sharad Borle for the luxury wristwatch data. I owe a special debt of thanks to my family and friends for their never-ending patience and support. Without them I never would have succeeded. iv Table of Contents List of Tables viii List of Figures ix List of Abbreviations xi 1 Introduction to Online Auctions 1 1.1 Online Auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 eBay Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 eBay Bid Level Data Sets . . . . . . . . . . . . . . . . . . . . 5 1.3 Online Auction Literature . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Semi-Continuous Transformation . . . . . . . . . . . . . . . . 11 1.4.2 Visualizing Concurrency . . . . . . . . . . . . . . . . . . . . . 12 1.4.3 Growth Models . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.4 Summary of Dissertation Contributions . . . . . . . . . . . . . 15 2 Transformations for Semi-Continuous Data 16 2.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Identifying Semi-Continuous Data . . . . . . . . . . . . . . . . . . . . 24 2.3 Transforming Semi-Continuous Data . . . . . . . . . . . . . . . . . . 25 2.3.1 The Jittering Transform . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 The Local Regeneration Transform . . . . . . . . . . . . . . . 28 2.3.3 Goodness of Fit and Deviation . . . . . . . . . . . . . . . . . . 30 2.4 Performance Evaluation Via Simulation . . . . . . . . . . . . . . . . . 32 2.4.1 Data Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Transforming the Contaminated Data: Jittering . . . . . . . . 36 2.4.3 Transforming the Contaminated Data: Local Regeneration . . 36 2.4.4 Choosing The Transformation Parameters . . . . . . . . . . . 40 2.4.4.1 Jittering . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.4.2 Local Regeneration . . . . . . . . . . . . . . . . . . . 47 2.5 Transforming the Surplus Data . . . . . . . . . . . . . . . . . . . . . 49 2.6 Conclusions ................................ 56 3 Functional Data Analysis 59 3.1 What is Functional Data Analysis? . . . . . . . . . . . . . . . . . . . 59 3.2 Representing Auction Price Evolution as a Continuous Curve . . . . . 60 3.3 Representing a Functional Object Nonparametrically . . . . . . . . . 62 3.3.1 Basis Series Expansion . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 Kernel Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.3 Roughness Penalty . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 FDA Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 v 4 Investigating Concurrency in Online Auctions Through Visualization 69 4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Visualizing Concurrent Auctions . . . . . . . . . . . . . . . . . . . . . 74 4.2.1 RugPlots ............................. 74 4.2.1.1 Curve Shapes . . . . . . . . . . . . . . . . . . . . . . 79 4.2.1.2 Temporal Groupings of Curve Types . . . . . . . . . 80 4.2.1.3 Other Groupings of Curves . . . . . . . . . . . . . . 82 4.2.2 Time Grouped Box Plots . . . . . . . . . . . . . . . . . . . . . 83 4.2.2.1 Comparing Medians of Adjacent Auctions . . . . . . 87 4.2.3 Moving Statistics Plots . . . . . . . . . . . . . . . . . . . . . . 88 4.2.4 Price Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Conclusions ................................ 92 5 A Family of Growth Models for Representing the Price Process in Online Auctions 94 5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 WristwatchData ............................. 97 5.3 Representing Auction Price Nonparametrically . . . . . . . . . . . . . 98 5.3.1 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.2 Monotone Splines . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Representing Price Evolution Parametrically . . . . . . . . . . . . . . 102 5.4.1 Exponential Growth . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.1.1 Exponential Model . . . . . . . . . . . . . . . . . . . 103 5.4.1.2 Logarithmic Model . . . . . . . . . . . . . . . . . . . 105 5.4.2 Logistic Growth . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4.2.1 Logistic Model . . . . . . . . . . . . . . . . . . . . . 106 5.4.2.2 Reflected-Logistic Model . . . . . . . . . . . . . . . . 108 5.4.3 Fitting Growth Models . . . . . . . . . . . . . . . . . . . . . . 109 5.4.3.1 Fitting Exponential Growth . . . . . . . . . . . . . . 110 5.4.3.2 Fitting Logarithmic Growth . . . . . . . . . . . . . . 110 5.4.3.3 Fitting Logistic Growth . . . . . . . . . . . . . . . . 111 5.4.3.4 Fitting Reflected-Logistic Growth . . . . . . . . . . . 111 5.5 Selecting the Best Growth Model . . . . . . . . . . . . . . . . . . . . 112 5.5.1 Model Selection Metrics . . . . . . . . . . . . . . . . . . . . . 113 5.5.2 Model Selection Procedure . . . . . . . . . . . . . . . . . . . . 114 5.6 Smoothing Method Comparison . . . . . . . . . . . . . . . . . . . . . 117 5.7 Using Growth Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7.1 RugPlots .............................124 5.7.2 Integrating Growth Model Parameters Into Analyses . . . . . 128 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6 Future Research 132 6.1 Semi-Continuous Transformation . . . . . . . . . . . . . . . . . . . . 132 6.2 Visualizing Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3 GrowthModels ..............................134 vi A eBay Bid Level Data Sets 138 A.1 Luxury Wristwatch Data . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.2 Palm Pilot M515 Data . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.3 XboxData.................................140 vii List of Tables 2.1 SimulatedSemi-Continuousand Transformed Data DescriptiveStatis- tics..................................... 35 2.2 Simulated Semi-Continuous and Jittering Transformed Lognormal Data GOF Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3 Simulated Semi-Continuous and Jittering Transformed Weibull Data GOFStatistics .............................. 42 2.4 Simulated Semi-Continuous and Jittering Transformed Normal Data GOFStatistics .............................. 43 2.5 SimulatedSemi-Continuousand Local Regeneration Transformed Data GOFStatistics .............................. 50 2.6 Surplus and Transformed Data Descriptive Statistics . . . . . . . . . 54 2.7 Surplus and Transformed Data GOF Statistics . . . . . . . . . . . . . 54 5.1 Growth Models Formulas . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Parametric and Nonparametric Comparison . . . . . . . . . . . . . . 119 5.3 Distribution of Chosen Smoothing Method . . . . . . . . . . . . . . . 120 5.4 Elapsed Run Time for Different Smoothing Methods . . . . . . . . . 123 A.1 Wristwatch Descriptive Statistics . . . . . . . . . . . . . . . . . . . . 139 A.2 Palm Pilot M515 Descriptive Statistics . . . . . . . . . . . . . . . . . 141 A.3 Xbox Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . 143 viii List of Figures 1.1 Bid History and Auction Attributes for Rolex Auction . . . . . . . . 3 1.2 Proxy and Live Bid Plot for Rolex Auction . . . . . . . . . . . . . . . 4 2.1 Max-Bin Histograms for Real World Semi-Continuous Data . . . . . . 18 2.2 Surplus Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Surplus Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Lognormal Max-Bin Histograms . . . . . . . . . . . . . . . . . . . . . 37 2.5 Weibull Max-Bin Histograms . . . . . . . . . . . . . . . . . . . . . . . 38 2.6 Normal Max-Bin Histograms . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Simulated Semi-Continuous and Jittering Transformed Data GOF ScreePlots................................. 45 2.8 Simulated Semi-Continuous and Jittering Transformed Data Kernel Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.9 SimulatedSemi-Continuousand Local Regeneration Transformed Data GOFScreePlots ............................. 51 2.10 Transformed Surplus Max-Bin Histograms and Weibull Probability Plots.................................... 52 2.11 Transformed Surplus GOF Scree Plots . . . . . . . . . . . . . . . . . 55 2.12 Transformed Surplus Residuals . . . . . . . . . . . . . . . . . . . . . 55 2.13 Transformed Surplus Probability Plots . . . . . . . . . . . . . . . . . 56 4.1 eBay Website Resulting From Palm Pilot Search . . . . . . . . . . . . 71 4.2 Live Bids and Monotone Smoothed Price Curves for 3 Palm Pilot Auctions.................................. 75 4.3 Palm Pilot Price Rug Plots (Monotone Smooth) . . . . . . . . . . . . 77 4.4 Palm Pilot Velocity Rug Plots (Monotone Smooth) . . . . . . . . . . 78 ix 4.5 Palm Pilot Price Rug Plot (Monotone Smooth) Coded By Auction Duration.................................. 83 4.6 Palm Pilot Closing Price Time Grouped Box Plots (Width Propor- tional to Number of Auctions) . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Palm Pilot Time Grouped Closing Price Box Plots and Auction Fre- quency Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.8 Palm Pilot Time Grouped Closing Price Box Plots (Notched, Over- lapping, and Zoomed-In) . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.9 Palm Pilot Closing Price Moving Mean Plot . . . . . . . . . . . . . . 90 4.10 Palm Pilot Closing Price Autocorrelation Plot . . . . . . . . . . . . . 92 5.1 Daily Bid Distribution for Wristwatch Data . . . . . . . . . . . . . . 98 5.2 Wristwatch Exponential Dynamics . . . . . . . . . . . . . . . . . . . 105 5.3 Wristwatch Logarithmic Dynamics . . . . . . . . . . . . . . . . . . . 105 5.4 Wristwatch Logistic Dynamics . . . . . . . . . . . . . . . . . . . . . . 108 5.5 Wristwatch Reflected-Logistic Dynamics . . . . . . . . . . . . . . . . 108 5.6 Wristwatch Model Selection Overweighting Y Dimension . . . . . . . 115 5.7 Wristwatch Model Selection Overweighting X Direction . . . . . . . . 115 5.8 Price Curves for Growth Models Fit to Wristwatch Auctions . . . . . 116 5.9 Growth Model Distribution for Wristwatch Data . . . . . . . . . . . . 117 5.10 Growth Models and Nonparametric Smoothing of Sample Wristwatch Auctions..................................121 5.11 Palm Pilot and Xbox Rug Plots (Growth Models), Jittered Closing Models, and Temporally Grouped Model Bar Plots . . . . . . . . . . 125 5.12 Palm Pilot Rug Plots (Growth Models) Broken Out By Duration . . 127 5.13 Xbox Rug Plots (Growth Models) Broken Out By Condition . . . . 128 x List of Abbreviations EDA Exploratory Data Analysis FDA Functional Data Analysis GCV Generalized Cross Validation LHS Left Hand Side OLS Ordinary Least Squares PCA Principal Components Analysis xi Chapter 1 Introduction to Online Auctions 1.1 Online Auctions With the advent of the Internet came online auction marketplaces, such as the popular eBay.com, which allow consumers and businesses to sell, buy, and bid on a variety of different goods. Ebay is the largest consumer-to-consumer (C2C) marketplace and touts net revenues that topped $1 billion for the first time for the first quarter of 2005 and were close to $1.4 billion (36% higher) for the first quarter of 2006. Ebay?s popularity is further reflected by the amount and growth of participation. For example, at the end of the first quarter 2006, there were 192.9 million registered users (up 31% from the previous year) and 75.4 million active users: those who had bid, bought or listed in the previous 12 months (up 25% from the previous year). On any given day, several million items, dispersed across thousands of categories, are available for sale on eBay. There were 575.4 million new listings in the first quarter of 2006 (up 33% from the first quarter in 2005). Indeed, eBay?s slogan, ?What ever it is, you can find it on eBay,? is appropriate. More information on eBay?s success can be found at http://www.findarticles.com/p/articles/mi m0EIN/is 2006 April 19/ai n16127113. Online auctions are different from traditional brick-and-mortar auctions in that they occur simultaneously or within close temporal proximity. Online auctions 1 tend to be longer, ranging over several days rather than minutes. Further, online auctions are not limited by the geographic barriers of traditional auctions. Buyers and sellers may be located on different continents and still conduct business since the online auction marketplace is always open and available. 1.2 eBay Data Structure The number of different online auction sites is growing steadily. Despite dif- ferent formats and rules, there is a common data structure that can be found across most sites. This structure comprises of a time series that describes the bids placed over time (the bid history) and an associated set of features that describe the auc- tion setting, such as the seller rating, the auction duration, and the item category. We refer to these features as the auction attributes. Figure 1.1 provides a snap- shot of a closed auction from eBay.com showing the auction attributes (top) and the bid history (bottom). We see that this is a 7-day auction for a Vintage Rolex Submariner Black Dial Men?s Wristwatch. The seller tutleandwabbit has a feedback rating of 100 with 100% positive feedback. The closing price is $2600.00, and there are a total of 13 bids from 9 bidders. In this case, eBay has decided to alias the bidders? user names in order to protect them from fake offers. Aliasing is performed quite often for high end merchandise. To understand the structure of bid history data, it is necessary to understand the auction rules and bidding mechanism. On eBay, the majority of auctions are second-price auctions, which means that the winner is the bidder who places the 2 Figure 1.1: Time series and attributes for a men?s Rolex wristwatch auction. Notice that bids are arranged in descending order by bid amount. This order, however, does not reflect the arrival of the bids. Rather, it reflects the current auction high bid. 3 1234567 0 500 1000 1500 2000 2500 Rolex Submariner Black Dial Mens Wrist Watch Time Price live bid proxy bid auction price Figure 1.2: Proxy bids (triangles) and live bids (squares) for a men?s Rolex wrist- watch auction. The line connecting the live bids represents current auction price. highest bid, but s/he pays the second highest price plus an increment. In our auction, Bidder 8 placed the highest bid but paid only Bidder 9?s bid. Ebay does not disclose the highest bid (here, by Bidder 8). Furthermore, eBay uses a so- called ?proxy bidding? system where bidders place the highest value that they are willing to pay, and then eBay bids on their behalf by increasing the current price by only an increment. For further details see http://pages.ebay.com/help/buy/proxy- bidding.html. During the auction, the ?current high bid? displayed is actually the second highest bid at the time plus an increment. Figure 1.2 shows this for the auction in Figure 1.1. We call the current price ?live bid.? We see that Bidder 9 placed a proxy bid of $2600 at 19:35:08. Bidder 8 placed an unknown higher bid at 18:22:24 so is credited with the highest bid and wins the auction. 4 1.2.1 eBay Bid Level Data Sets We use predominantly three eBay.com bid level data sets throughout this research. All of the data sets contain the auction ID, bids, bid dates and times, opening price, start date and time, closing price, and end date and time. In some cases, we have additional auction attributes such as unique bidder ID and rating, seller ID and rating, and item description. Our data are bid histories of closed auctions for luxurywristwatches(Rolexand Cartier), Palm Pilot M515 personal handheld organizers, and Xbox game consoles. The luxury wristwatch data set includes a wide variety of items across a wide range of prices. We use this data set when the goal is to capture very different price behavior. The Palm Pilot and Xbox data sets are for the same or similar items. These data sets are used for illustration in cases where the focus is on a single item such as competing concurrent auctions. A full description of each data set as well as descriptive statistics is given in Appendix ??. In some cases, we break down the descriptive statistics by auction attributes since that information is later used in the analysis. 1.3 Online Auction Literature The popularity of online auctions as well as the rich data sets has encouraged many empirical research studies. Although empirical research has flourished, the great majority of work is by nonstatisticians; therefore, the statistical methods used tend to be standard ?off-the-shelf? tools such as regression models. We describe 5 some of the major contributions to the online auction literature to date. The article by Bajari and Hortascu (2004) provides an extensive overview of the literature by describing various auction attributes and documenting what has been found to date. It is a first-read for anyone interested in studying online auctions. Lucking-Reiley et al. (2005) conduct one of the first eBay empirical research studies (draft available in 1999). They examine, via regression analysis, the determi- nants of the final price of auctions where the item, in this case coins, have a known book value. They find that a seller?s rating has an effect on the final price; however, a negative rating has a more significant effect than a positive rating (i.e., a negative seller rating will decrease the final selling price much more than a positive rating will increase it). They also find that the length of an auction plays an important role in the final price: longer auctions (7- or 10-days) tend to fetch a higher closing price. Finally, they have mixed results for setting a reserve price (the minimum amount a seller is willing to part with the item for). Although a reserve price signif- icantly increases the final price of the auction, they are quick to note that a reserve, especially a secret reserve, decreases the likelihood of the auction transacting. Bajari and Hortascu (2003) investigate the determinants of bidder and seller behavior. In terms of bidders, they find that last-moment bidding, often referred to as ?sniping?, occurs quite frequently in online auctions and usually results in the winning bid. They notice that sellers tend to set opening bid values much less than the market price of the item, which is thought to encourage bidder entry. Further, reserve prices are usually limited to high-end products. 6 Roth and Ockenfels (2002) examine bidding behavior based on the rules for ending an auction. For example, eBay has a hard close where the auction has a known fixed end while Amazon auctions may have an unknown end. An Amazon auction remains open (after the specified close) as long as a bid has been placed in the previous 10 minutes. They find that the hard close gives eBay bidders an incentive to bid late, and it is the experienced bidders that tend to bid late. On Amazon, there is still a lot of late bidding, but this tends to be done by inexperienced bidders. Bapna et al. (2004a) examine bidder-level (rather than auction-level) data to learn about bidders? preferences, behaviors, and economic welfare. Using explana- tory variables: time of entry, time of exit, and number of bids (which is a proxy for a bidder?s time valuation), they use K-means clustering and find five distinct bidder classes. Evaluators place only one bid in the beginning or middle of the auction, perhaps their willingness to pay. Opportunists bid late in the auction and are most similar to snipers. Sip-and-dippers place a bid in the beginning of the auction (which establishes time priority) and revise their bid at the end of the auction sequence. Participators bid frequently, start bidding early, and place their final bid late, which is the profile of a typical ?gamer?. Finally, agent bidders use automatic bidding agents to place their bid for them. They are similar to participators except that the agent monitors bidding so they do not have to. Bapna et al. (2004a) find that agent bidders and then participators have the highest level of surplus while opportunists and sip-and-dippers have the highest likelihood of winning the auction. They also find that while experienced bidders are usually the ones who snipe, inexperienced 7 bidders who do not understand eBay?s proxy bidding mechanism tend to bid high early in the auction. Kauffman and Wood (2005) examine how the fee structure on eBay may mo- tivate bid shilling, the act of bidding in ones own auction. There are two types of shilling: competitive and reserve price. Competitive shilling is where a seller bids in his own auction to increase the price that other bidders need to pay to win the item. Reserve price shilling is where the seller sets a low opening price but bids his reserve in order to avoid the fee associated with setting a secret reserve price on eBay. They are able to detect when shilling has occurred (for example, if a bidder typically bids high in the same seller?s auctions and drops out early) and predict which auctions are likely to be shilled. Recently,Dellarocas and Wood (2007) examined the reporting bias that occurs in the eBay (or any two-party) feedback mechanism. They find that people are more likelyto leave positive feedback when they are satisfied than negative feedback when they are dissatisfied. This is likelydue to fear of retaliation that would blemish their own ratings. They argue that not leaving feedback carries important information and should also be reported on a trader?s profile. More recently, statisticians have become involved in online auction research. Notably, there is a special issue of Statistical Science (May 2006) devoted entirely to ?statistical challenges and opportunities in electronic commerce research.? Re- search by statisticians has focused on statistical challenges and proposes suitable probabilistic and statistical approaches. For example, Shmueli et al. (2004) model bid arrivals with a nonhomogenous Poisson process. They find that there is some 8 bidding activity at the auction start (first day), followed by a period of very little activity, culminating in a surge of bidding at the very end of the auction. There has also been a trend in studying the priceformation and price dynamics rather than just the closing price of an auction. Jank and Shmueli (2005) explore price dynamics by representing the price curve as a smooth continuous function. They use curve clustering to group similarly shaped price curves and compare the price dynamics of the clusters. They then look at auction attributes such as opening bid, seller rating, etc. within each cluster to gain insight into the relationship between price dynamics and auction attributes. The first cluster is thought of as experienced buyer/seller auctions. They have a moderately high opening price and low competition but still achieve a high closing price. The second cluster is formed by greedy sellers, who post a high opening price and thus attract little competition which results in a low closing price. Finally, there are Bazaar auctions, which are characterized by a low opening price, high competition, and the highest closing prices. Several recent papers in the eCommerce literature have emphasized the im- portance of the price formation process and dynamics in online auctions (Jank and Shmueli, 2006; G?oes et al., 2007). For example, the price-velocity indicates how quickly the price is changing at every point during the auction. This knowledge can be used to build powerful dynamic forecasting models for price (Wang et al., 2007a). It can also be used to visually ?mine? a database of auctions for the same or similar products (Shmueli and Jank, 2005; Hyde et al., 2006; Shmueli et al., 2006; Jank et al., 2007). 9 Hyde et al. (2004) use functional principal components analysis (PCA) to examine the variability in price over the duration of an auction. They find that price varies between auctions mostly in the middle portion of the auction and that a high opening bid leads to little variability in the middle part of the auction, most likely due to a market value for the item. Auctions vary next most in the distance (range) between opening and closing price. This list is by no means exhaustive; however, these early results motivated some of our research questions. Our contribution is described next in Section 1.4. 1.4 Contributions of this Dissertation The wide and growing popularity of online auctions creates enormous amounts of publicly available auction bid data providing an interesting and important topic for research. These data also pose unique statistical challenges because of their special structures. Although various statistical methods have been used thus far in the online auction context, they are usually used in a very basic manner, not always accounting for the special data structures that arise. For that reason, there has mostly been a neglect of the entire bid history with a concentration more on the end (final price) of the auction. This dissertation examines some of the special data structures and develops or adapts statistical methods for representing, visualizing, and modeling such data. 10 1.4.1 Semi-Continuous Transformation Buyers and sellers participate in online auctions for a variety of reasons. The success of online auctions is due in large part to the ability to bring buyers and sellers together in a manner that may not have previously existed, especially when there are geographic limitations. This is the case in particular for unique or hard- to-find items. The online marketplace can provide a good approximation of the true market value; however, due to imperfect information, there is the opportunity for sellers to make an usually large profit or bidders to get a ?steal?. eBay auctions are second-price, meaning that the winner is the bidder who places the highest bid but pays the second highest price plus a bid increment. Bapna et al. (2005) quantify the overall consumer surplus, the difference between the price paid for an item and the amount that would have been paid, that eBay produces. Consumer surplus is inherently nonnegative and continuous. However, the data set that Bapna et al. (2005) use is semi-continuous in that there are numerous ?too-frequent? values. This is easily seen in a histogram of the consumer surplus data. Bapna et al. (2005) remedy this problem by imposing a ceiling transform, where the surplus values are rounded to the next integer value, reasoning that surpluses within the same $1 unit carry the same or similar information. This may or may not be true, but it would be better to approximate the underlying continuous distribution that would exist if a discretization contamination mechanism, like the tendency to bid ?round? values, did not exist. Further, only the ?too-frequent? values should be transformed since they are the only values contaminated. 11 Our first contribution is a new set of two transformations for semi-continuous data that ?iron-out? the too-frequent values and produce completely continuous data. We show that the transformed data maintain the properties of the original data but are suitable for standard analysis. We also assess the performance of the transformed data. This research, discussed in Chapter 2, is also under review (Shmueli et al., 2007). 1.4.2 Visualizing Concurrency A great majority of the literature that analyzes online auction data assumes independence across auctions. This assumption is typically made for reasons of simplicity and convenience, while in reality auctions for the same item, competing items, or even related (substitute) items will influence each other especially if they take place within a close time frame. On eBay.com, an identical product is often sold in numerous simultaneous auctions. While each auction contains a replicate of the product, the resulting sales, prices, and even the number and level of bids during the ongoing auctions are clearly not independent of each other. In such situations, it is likely that there is dependence between the auctions since buyers have the option to select which of the competing items to bid on, and sellers have the option to decide when to post their item for sale using information on similar previously sold items. We approach the study of concurrency from a visual point of view and propose a series of visualizations that are suitable for the special structure of bid data. 12 Our major contribution is the rug plot which graphically displays the entire price- evolution curves or dynamics curves over calendar time. Other displays that we use are time-grouped box plots and moving statistics plots. We find price trends and relationships between prices in concurrent auctions and raise new research questions. While our main contribution is the rug plot, this research is one of the few to date that considers the dependence among online auctions. Although our analysis takes place in the online auction context, it may be used whenever concurrency of events is prevalent. This research is discussed in Chapter 4 and has been published (Hyde et al., 2006). 1.4.3 Growth Models In order to analyze an auction?s price process, it must be captured with a statistical structure or model. This is a challenging task because we start with a nonstandard time series: bids during an online auction arrive at unequally-spaced discrete time points with sometimes very sparse and/or dense bidding periods. Our goal is to achieve a continuous smooth representation of the auction?s price process that is estimated from the discrete bid data. Previous work treating bid histories as functional objects has focused on non- parametricsmoothing methods. Jank et al.(2007) use penalizedpolynomial smooth- ing splines due to their computational efficiency and good balance between fit and smoothness. Unfortunately, splines do not necessarily yield a monotone nondecreas- ing function, as the auction price process demands. They also suffer from extra 13 volatility at the start and end of the auction, which are of special interest, partic- ularly in forecasting scenarios. In comparison, monotone smoothing used by Hyde et al. (2006) achieves a monotone price process, but it is computationally intensive which might not be useful for large auction data sets. Both methods require es- timating a large number of coefficients and for numerous parameters to be set in advance by the user. Finally, both methods are data-driven and thus do not offer a theoretical explanatory model of the process captured. This motivated us to explore the use of meaningful parametric representations. An important contribution of this research is to introduce a parametric family of growth models that can capture a variety of possible underlying price processes. These are exponential growth, logarithmic growth, logistic growth, and reflected- logistic growth, where ?reflected? refers to a mapping of the original model over the line y = x. We also develop an automated model selection procedure that uses a specialized proximity metric that measures the distance between bids and the fitted curve in the two dimensions of price and time. The parametric approach is also computationally fast, more parsimonious, and generally more insightful. This research, discussed in Chapter 5, is also under review (Hyde et al., 2007). This class of parametric growth models can be used not only for data repre- sentation and visualization, but also for formal modeling of the auction processes. Statistical methods such as functional linear regression models, functional PCA, and curve clustering can be applied to the parametric objects in order to test hypotheses and model relationships between the price process (and its dynamics) and factors such as auction design and concurrency. 14 1.4.4 Summary of Dissertation Contributions To summarize, the contributions of this dissertation are to: 1. Propose a transformation for semi-continuous data found in online auction surplus data (Chapter 2). 2. Investigate concurrency in online auctions through visualization and the in- vention of the rug plot (Chapter 4). 3. Represent unevenly-spaced time series data as a continuous curve (function) and compare various previously used nonparametric smoothing techniques with a parametric family of growth models (Chapter 5). 15 Chapter 2 Transformations for Semi-Continuous Data 2.1 Introduction and Motivation Standard statistical methods can be divided into methods for continuous data and those for discrete data. However, there are situations in which observed data do not fall in either category. In particular, we consider data that are inherently continuous but get contaminated by inhomogeneous discretizing mechanisms. Such data lose their basic continuous structure and instead are spotted with a set of ?too-frequent? values. We call these semi-continuous data. Semi-continuous data arise in various settings. Reasons range from human tendencies to enter round numbers or to report values that conform to given specifi- cations (e.g., reporting quality levels that conform to specifications), to contamina- tion mechanisms related to data entry, processing, storage, or any other operation that introduces a discrete element into continuous data. Examples are numerous and span various applications. In accounting, for example, a method for detecting fraud in financial reports is to search for figures that are ?too common?, such as ending with 99 or being ?too round?. In quality control sometimes data are ma- nipulated in order to achieve or meet certain criteria. For example, Bzik (2005) describes practices of data handling in the semiconductor industry that deteriorate the performance of statistical monitoring. One of these is replacing data with ?phys- 16 ical limits?: in reporting contamination levels negative values tend to be replaced with zeros and values above 100% are replaced with 100%. However, negative and >100% values are feasible due to measurement error, flawed calibration, etc. An- other questionable practice is ?rounding? actual measurements to values that are considered ideal in terms of specifications. This results in data that include multiple repetitions of one or more values. We use the term ?too-frequent? to describe such values. We encountered two particular studies where semi-continuous data are present. The first is a research project by IBM on customer wallet estimation, where the mar- keting experts who derived the estimates tended to report round estimates thereby leading to data with severaltoo-frequent values (Perlichand Rosset, 2006). A second study, which motivated this work, studies consumer surplus in the online market- place eBay. Here, the observed surplus values had several too-frequent values, most likely due to the discrete set of bid increments that eBay uses and the tendency of users to place integer bids. The top panels in Figure 2.1 show the frequencies of the values in samples from each of these two data sets. In the eBay surplus data (left panel) the value $0 accounts for 6.35% of the values, and values such as $0.01, $0.50, $1.00, $1.01, $1.50, $2.00, $2.01, $3.00 are much more frequent than their neighboring values. In the IBM customer wallet estimates, too-frequent values are 0, 100, 200, 250, 300, and 400. We use the surplus data throughout this chapter to illustrate and motivate the proposed transformations. We therefore describe a few more details about the mechanism that generates the data. Consumer surplus, used by economists to mea- 17 Surplus Frequency 012345 0 50 100 150 200 IBM Customer Wallet Estimates Frequency 0 100 200 300 400 500 0 50 100 150 Contaminated Normal Data Frequency ?4 ?2 0 2 4 0 10 20 30 40 Figure 2.1: Frequencies of values for three data sets: eBay surplus data (top), IBM customer wallet estimates (middle), and contaminated normal simulated data (bottom). 18 sure consumer welfare, is the difference between the price paid and the amount that consumers are willing to pay for a good or service. In second-price auctions, such as those on the famous online marketplace eBay, the winner is the highest bidder; however, s/he pays a price equal to the second highest bid. Surplus in a second- price auction is defined (under some conditions) as the difference between the price paid and the highest bid. Bapna et al. (2005), who investigate consumer surplus in eBay, find that although surplus is inherently continuous, observed surplus data are contaminated by too-frequent values, as shown in the top left panel of Figure 2.1. Using the term ?contamination? in this case is somewhat artificial because it reflects human behavior. However, the result is similar to the contamination of a physical process by machine error if we assume that humans, acting rationally, would bid their true valuation which may or may not be a round number. For example, a person may value a product for $5.00 but bid $5.03 corresponding to a May 3 birthday, which would not be rational behavior. This type of ?contamination?, although quite possibly present, is difficult to identify in the data set. In this sense, the surplus data is a ?contaminated? version of the continuous bidding process. Semi-continuous data can appear as if they come from a mixture of discrete and continuous populations. Graphs such as scatter plots and frequency plots might indicate segregated areas or clouds. Such challenges arise in the surplus data de- scribed above. Figure 2.2 displays two probability plots for log(surplus+1): the first is a lognormal fit and the second is a Weibull fit. It is obvious that neither of these two models approximate the data well (other distributions fit even worse) because there are too many 0 values in the data. Furthermore, using log(surplus + 1) as 19 02468 0 2 4 6 8 Lognormal Quantiles Surplus (Log?scale +1) 02468 0 2 4 6 8 Weibull Quantiles Surplus (Log?scale +1) Figure 2.2: Probability Plots for log(surplus+ 1): lognormal fit (left) and Weibull fit (right). the response in a linear regression model yields residual plots that exhibit anomalies that suggest a mixture of populations. Figure 2.3 shows two residual plots exhibit- ing such behavior. These plots indicate clear violation of the assumptions of a linear regression model. The main problem with semi-continuous data is that they tend to be unfit for use with many standard statistical analysis methods. One possible solution is to separate the data into continuous and discrete parts, model each part separately, and then integrate the models. For example, in a data set that has too many zeros (zero-inflated) but otherwise positive, continuous values, we might create a classification model for zero/nonzero and then a prediction model for the positive data. This approach has two practical limitations: first, partitioning the data leads to loss of statistical power, and second, it requires the too-frequent values to be concentrated in a limited area or in some meaningful locations. To solve the first 20 02468 ?2 0 2 4 Log(price+1) Residuals ?3 ?2 ?1 0 1 2 3 ?2 0 2 4 Normal(0,1) Quantiles Residuals Figure2.3: Residuals from a linearregression modelof log(surplus+1) on log(price+ 1) (left) and normal probability plot of residuals (right). issue one might argue for a mixture model. Although there exists a plethora of such models for mixed continuous populations or for mixed discrete populations (e.g., ?zero-inflated? models, Lambert, 1992), we have not encountered models for mixtures of continuous and discrete data. Moreover, there is a major conceptual distinction between mixture and semi-continuous data: unlike mixture data, semi- continuous data are inherently generated from a single process. Therefore, treating them as a mixture of populations is artificial. The ideal solution, of course, would be to find the source of discretization in the data generation mechanism and eliminate it or account for it. However, in many cases this is impossible, very costly, or very complicated. We therefore strive for a method that ?unites? the apparent ?subpopulations? so that the data can be integrated into a single model and treated with ordinary models for continuous data. Our proposed solution is a set of two transformations which yield continuous 21 data. We distinguish between two cases: one, where the goal is to obtain data that fit a particular continuous distribution (e.g., a normal distribution, for fitting a linear regression model); and two, where there is no particular parametric distribution in mind, but the data are still required to be continuous. The first approach, suitable when the data should fit a particular distribution, is based on binning the data in a way that is meaningful with respect to the assumed underlying distribution and then replacing the too-frequent values with randomly generated observations in that bin. We call this the local regeneration transform. The second approach, when the goal is simply to obtain continuous data, is based on jittering. As in graphical displays, where jittering is used to better see duplicate observations, our data transformation adds a random perturbation to each too-frequent value, thereby ironing-out the anomalous high frequencies. We call this the jittering transform. Jittering is used not only in graphical displays, but also in data privacy protec- tion. It is a common method for disguising sensitive continuous data while retaining the distributional properties of the sample that are required for statistical inference. The difference between this application of jittering and our proposed jittering trans- form is that unlike in data privacy protection, we do not jitter each observation but rather only the too-frequent values. Local regeneration is related to smoothing histograms and rebinning of his- tograms; however, there are two fundamental differences. First, the assumption about the data origin is different. In histogram smoothing and binning, unless we are dealing with a mixture model, the underlying assumption is that extreme peaks and dips in the histogram result from sampling error. Therefore, increasing the 22 sample size should alleviate these phenomena, and in the population such peaks would not appear at all (Good and Gaskins, 1980). In contrast, in semi-continuous data the cause of the peaks is not sampling error but rather a structural distortion created by the data generating mechanism. This means that even very large samples will exhibit the same semi-continuity. The second difference is the goal: whereas histogram smoothing and binning are used mainly for density estimation (without attempting to modify the data), the purpose of local regeneration is to transform the semi-continuous data into data that fit a particular continuous parametric dis- tribution, similar to the famous Box-Cox transformation. Local regeneration and binning both attempt to find the best representation of the data in histogram form. The literature on optimal bin sizes focuses on minimiz- ing some function of the difference between the histogram ? f(x) and the real density f(x) (such as the mean integrated squared error, or MISE). In practice, f(x) is as- sumed to be a particular density function (e.g., Scott (1979)) or else it is estimated by using an estimator that is known to have a smaller MISE (e.g., a kernel estima- tor). Methods range from rules of thumb to (asymptotically) theoretically optimal bin width formulas. The methods also vary widely with respect to computational complexity and performance. Simple bin-width formulas such as Sturges? rule and Doane?s modifications, which are used in many software packages (e.g., Minitab and S-Plus), have been criticized as leading to oversmoothing (Wand, 1997). On the other hand, methods that have better asymptotic optimality features tend to be computationally intensive and less straightforward to implement in practice. In the semi-continuous data case, we assume that it is impossible to obtain a set of data 23 that is not distorted. Our target function is therefore a parametric distribution. The remainder of this chapter is organized as follows. In Section 2.2 we de- scribe a method for identifying semi-continuous data. Section 2.3 describes the jittering and local regeneration transforms in detail. We illustrate the performance of these methods using simulated data in Section 2.4 and apply them to the on- line auction surplus data in Section 2.5. Further issues and future directions are discussed in Section 2.6. 2.2 Identifying Semi-Continuous Data To identify whether a continuous data set suffers from semi-continuity, we examine each of the unique values and see whether there is one that is too-frequent. One way to check this is to examine a one-way pivot table, with counts for each value in the data, for very large counts. Sorting the counts in descending order can alleviate the search process. However, since such a table loses the ordering of the values, too-frequent values that appear in sparser areas might remain undetected. An alternative, which enhances the spotting of too-frequent values while preserving the interval scale of the data is a visualization that is a hybrid between a bar chart and a histogram: the max-bin histogram. The max-bin histogram is essentially a histogram with bin widths equal to the smallest unit in the data. It is therefore equivalent to a bar chart with as many nonzero bars as there are unique values in the data, except that its x-axis has a continuous meaning rather than labels. On the max-bin histogram of the raw data, where frequencies are represented by 24 bars, frequencies that are outstandingly high compared to their neighbors indicate values suspected of being too frequent for a continuous scheme. Figure 2.1 displays three max-bin histograms. The top two are for the eBay surplus data and the IBM customer wallet estimates. As mentioned in the previous section, too-frequent values are very visible in each of these plots. The bottom panel shows a simulated data set of size 10,000, where 9,000 observations were generated from a standard normal distribution, retaining 4 decimal digits, and the remaining 1,000 observations were obtained by duplicating 25 of the generated values 40 times. The 25 too-frequent values are clearly visible in the max-bin histogram. Using plots for continuous data can also assist in identifying semi-continuous data. For instance probability plots might display ?steps? or contain other indi- cations of mixed populations. Scatter plots of the suspected variable vs. other variables of interest can also reveal multiple clouds of data. However, depending on the locations and prevalence of the too-frequent values, standard plots might not enhance the detection of such values (e.g., if the too-frequent values are in high frequency areas of the distribution). The max-bin histogram is therefore a powerful and unique display that is ideal for this purpose. 2.3 Transforming Semi-Continuous Data Given a set of semi-continuous data, the influence of ?bouts? of high frequency values can be ?ironed-out? (a term coined by Good and Gaskins (1980)) in one of the following ways, depending on the goal of the analysis. The first approach is to jitter 25 each of the too-frequent values, which means that we add a random perturbation to each such observation. The second approach defines local neighborhoods by binning the data and then replaces the too-frequent values with randomly generated observations within their respective neighborhoods. The two methods are similar in several respects. First, they both assume that the too-frequent values are distortions of other, nearby values. Second, both methods define a local neighborhood for the too-frequent values, and the transformed data are actually randomly generated observations from within this neighborhood. Third, in both cases only the too- frequent values are replaced while the other observations remain in their original form. And finally, in both cases the definition of a local neighborhood must be determined. We describe each of the two methods in detail next. 2.3.1 The Jittering Transform Although many statistical methods assume a parametric distribution of the data, there are many nonparametric methods that only assume a continuous nature. This is also true of various graphical displays, which are suitable for many data structures, as long as they are continuous (e.g., scatterplots). After identifying the too-frequent values, the transform operates by perturbing each of them by adding random noise. If we denote the ith original observation by X i and the transformed observation by ? X i , then the jittering transformation is ? X i = ? ? ? ? ? ? X i +epsilon1 if X i is a too-frequent value X i else , (2.1) 26 where epsilon1 is a random variable with mean 0 and standard deviation ? epsilon1 . The choice of distribution for epsilon1 depends on the underlying process that the too-frequent values are most likely a result of. If there is no reason to believe that the too-frequent values are a result of an asymmetric processes, then a symmetric distribution (e.g., normal or uniform) is adequate. If there is some information on a directional distortion that leads to the too-frequent values, then that should be considered in the choice of the perturbing distribution. In general, the choice of distribution is similar to that in kernel estimation, where the choice of kernel is based on domain knowledge, trial-and-error, and robustness. The second choice is the value of ? epsilon1 , which should depend on the scale of the data and its practical meaning within the particular application. Specifically, domain knowledge should guide the maximal distance from a too-frequent value that can still be considered reasonable. For example, if the data are measurements of product length in inches, where the target value is 10 inches and a deviation of more than an inch is considered ?large?, then ? epsilon1 should be chosen such that the jittering will be less than one inch. This, in turn, defines ? max , the upper bound on ? epsilon1 . A series of increasing values of ? epsilon1 is then chosen in the range (0,? max ], and the performance of the transformation is evaluated at each of these values in order to determine the adequate level. Let 2? be the width of the jittering neighborhood (symmetric around the too- frequentvalue). The mapping between? and ? epsilon1 depends on the jitteringdistribution. If the perturbation distribution for epsilon1 is normal, then ? epsilon1 should be chosen such that 3? epsilon1 = ? since almost 100% of the data should fall within 3 standard deviations of the 27 mean. If the underlying distribution is Uniform(a,b), then 2? = b?a and therefore ? 3? epsilon1 = ?. Since jittering is performed in a way that is supposed to (inversely) mimic the data mechanism that generated the too-frequent observations, the transformed data should be similar to the unobserved, continuous data in the sense that they are both samples from the same distribution. In fact, since we only transform the few too-frequent values, most of the observations in the original and transformed data are identical. 2.3.2 The Local Regeneration Transform Many standard statistical methods assume a particular parametricdistribution that underlies the data. In that case the jittering transform can still be used, but it might require additional tweaking (in the choice of the perturbation distribution and of ? epsilon1 ) in order to achieve a satisfactory parametric fit. The reason is that the jittering operation is anchored around the too-frequent values, and the locations of these values is independent of the unobservable, underlying continuous distribution. A more direct approach is therefore to anchor the transformation not to the too-frequent values, but rather to anchors that depend on the parametric distribu- tion of interest. We achieve this by first binning the observed data into bins that correspond to percentiles of the distribution. In particular, we create k bins that have upper bounds at the 100 k ,2 100 k ,...,k 100 k percentiles of the distribution. For ex- ample, for k = 10 we create 10 bins, each as wide as the corresponding spacing 28 between the two distribution deciles. Each bin now defines a local neighborhood in the sense that values within a bin have similar densities. This will create narrow bins in areas of the distribution that are high density and much wider bins in sparse areas such as tails. The varying neighborhood width differentiates this transforma- tion from the jittering transform, which has constant sized neighborhoods. This means that the main difference between the two transforms will be in areas of low density that include too-frequent values. To estimate the percentile of the distribution we use the original data. If the continuous distribution belongs to the exponential family, we can use the sufficient statistics computed from the original data for estimating the distribution parameters and then compute percentiles. Because too-frequent values are assumed to be close to their unobservable counterparts, summary statistics based on the observed data should be sufficiently accurate for initial estimation. In the second step, each too- frequent value is replaced with a randomly generated observation within its bin. As in the jittering transform, the choice of the generating distribution is guided by domain knowledge about the nature of the mechanism that generates the too- frequent values (symmetry, reasonable distance of a too-frequent value from the unobservable value, etc.). The local regeneration transform for the ith observation can be written as ? X i = ? ? ? ? ? ? ? l j +epsilon1 if X i is a too-frequent value X i else , (2.2) where l j is the jth percentile closest to X i from below (i.e., the lower bound of the bin), j =1,2,...,k, and epsilon1 is a random variable with support [l j ,l j+1 ). 29 The two settings that need to be determined are therefore the parametric dis- tribution and the number of bins (k). For the distribution, one might try transform- ing the data according to a few popular distributions, and choose the distribution that best fits the data (according to measures of goodness-of-fit). The number of bins can be set in a similar way: trying a few configurations and choosing the one that yields the best fit. In both cases, of course, practical considerations and domain knowledge must be integrated into the decision process. 2.3.3 Goodness of Fit and Deviation Our goal is to find a level of jittering or binning that sufficiently ?irons-out? the discrete bursts in the data and creates a data set that is continuous and perhaps fits a parametric distribution. In the parametric fitting (using the local regener- ation transform), we can measure how well the data fit prespecified distributions at each level of binning. For example, the three most popular distributions used for fitting data and which serve as the basis for many statistical analyses are the normal, lognormal, and Weibull distributions. We can evaluate how well data are approximated by each of these distributions by using graphical tools, such as prob- ability plots, and goodness-of-fit measures, such as the Anderson-Darling statistic, Kolmogorov-Smirnov statistic, Chi-squared statistic, and the correlation based on a probability plot. In the nonparametric case where the goal is to achievecontinuous data without a specified parametric distribution, we use the notion of local smoothness. Treating 30 the max-bin histogram as an estimate of a continuous density function, we require frequencies of neighboring bins to be relatively similar. One approach is to look for the presence of an ?abnormality? in the max-bin histogram by comparing each frequency to its neighbors. Shekhar et al. (2003) define an abnormal point as one that is extreme relative to its neighbors (rather than relative to the entire data set). They construct the following measure of deviation, ? i : ? i = p i ?E j?N i (p j ), (2.3) where p i = f i / summationtext j f j is the observed relative frequency at value i, N i is the set of neighbors of p i , and E j?N i (p j ) is the average proportion of the neighboring values. They then assume an a priori distribution of the data or their histogram, which allow them to determine what is considered an abnormal value for ? i . In our context, the underlying distribution of the data is assumed to be unknown. We therefore use an ad-hoc threshold: values i that have ? i larger than 3 standard deviations of ? i are considered abnormal (assuming that deviations are approximately normal around 0). An alternativeapproach is to measure smoothness of the max-bin histogram by looking at every pair of consecutive frequencies (f i ?f i+1 ). Various distance metrics can be devised based on these pairwise deviations, such as the sum-of-absolute- deviations or the sum-of-squared-deviations. We use one such measure here, the ?sum of absolute deviations between neighboring frequencies?: SADBNF = summationdisplay i |f i ?f i+1 |. (2.4) A third approach is fitting a nonparametric curve to the max-bin histogram 31 and measuring the smoothness of the resulting curve or the deviation of the bar heights from the curve (e.g. Efromovich (1997)). In addition to measuring the fit to a distribution, we also measure deviation from the original data by computing summary statistics at each level of binning or jittering and comparing them to those from the original data. We examine the mean, standard deviation, and median; however, robust statistics may be considered for certain distributions and/or different contamination mechanisms. 2.4 Performance Evaluation Via Simulation In order to evaluate the performance of the jittering and local regeneration transformations in practice, we simulate data from known distributions and then contaminate them to resemble observable semi-continuous real world data. In par- ticular, we choose parameters and contaminations that mimic the surplus data. We apply the transformations to the data and choose the transformation parameters (? in jittering and k in local regeneration) that best achieve the goal of approxi- mating the underlying (unobserved) distribution or at least obtaining a continuous distribution (which should coincide with the underlying generating distribution). 2.4.1 Data Simulation We simulate ?unobservable? data from three continuous distributions: log- normal, Weibull, and normal. We then contaminate each data set in a way that resembles the surplus data. The three data sets are called Contaminated Log- 32 normal (CLNorm), Contaminated Weibull (CWeibull), and Contaminated Normal (CNorm). The steps for all three simulations are similar except that each simu- lation starts with a different underlying distribution, different too-frequent values, and different contamination spreads. The initial data, whether lognormal, Weibull, or normal, are thought to be the unobservable data that come about naturally. However, a mechanism contaminates the observed data, thereby introducing a few values with high frequencies. Some of the original characteristics of the data are still present, but the contamination makes it difficult to work with the observed data in their present form. In the following we describe the details and use the notation for the CLNorm data. Equivalent steps are taken for the CWeibull and CNorm simulated data. We generate 3000 observations (Y i , i =1,...,3000) from a lognormal distri- bution with parameters ? =0,?= 2. We choose the too-frequent values {s j } ={ 0, 0.25, 0.50, 0.75, 1.00, 1.50, 2.00, 3.00, 5.00, 10.00} and contaminate 750 of the 3000 observations by replacing values that fall within ? =0.10 of a too-frequent value by that frequent value (in other words, ? is the width of the contamination neighbor- hood). The contaminated data are therefore obtained by the following operation: X i = ? ? ? ? ? ? ? s j if Y i ? [s j ??,s j +?], and i =1,...,750 Y i else . (2.5) The underlying distribution for the CWeibull data is Weibull(? = shape = 0.5,? = scale = 10) with too-frequent values {s j }= {0, 0.25, 0.50, 0.75, 1.00, 1.50, 2.00, 3.00, 5.00, 10.00} and a contamination neighborhood of ? =0.12. The underlying distribution for the CNorm data is N(? =4,? = 1) with too-frequent 33 values{s j }={2.00, 3.00, 3.50, 3.75, 4.00, 4.25, 4.50, 5.00, 6.00} and a contamination neighborhood of ? =0.05. The top two panels of Figures 2.4, 2.5, and 2.6 show max-bin histograms of the unobservable and contaminated data. The lognormal and Weibull plots show a zoomed-in region in order to better see the areas where most of the data are located (i.e., not in the tail(s)). The theoretical density is also drawn on as a solid grey line. We see that the contaminated data have the shape of the original distributions with peaks at the too-frequent values. For the lognormal and Weibull data, only the too-frequent values with high density overall are discernable. These are 0, 0.25, 0.50, 0.75, 1.00, 1.50, 2.00, and 3.00 for the CLNorm data and 0, 0.25, 0.50, 0.75, 1.00, and 1.50 for the CWeibull data. For the CNorm data, the too-frequent values 2.00, 3.00, 3.50, 3.75, 4.00, 4.25, and 5.00 stand out. The values 1.80 and 5.12 also stand out, but this is the result of random noise. Table 2.1 provides the sample mean, standard deviation, and median for each data set. The summary statistics are very close for the original and contaminated samples. This supports the use of summary statistics from the contaminated data for parameter estimation. These statistics serve as benchmarks for comparing the transformed statistics, to make sure that the chosen transformation preserves the main characteristics of the distribution. 34 Table 2.1: Descriptive statistics for simulated lognormal (top), Weibull (middle), and normal (bottom) data. Summaries are for the original data, the contaminated data, jitter-transformed data (with best parameters), and local regeneration-transformed data (with best parameters). Lognormal(?=0, ?=2) Sample Mean (Standard Deviation) Sample Median Original 8.15 (63.86) 1.04 Contaminated 8.15 (63.86) 1.01 Jittered (Uniform, ?=0.07) 8.15 (63.86) 1.03 Local Regeneration (k=5) 8.15 (63.86) 1.05 Weibull(?=0.5, ?=10) Sample Mean (Standard Deviation) Sample Median Original 19.73 (44.60) 4.47 Contaminated 19.73 (44.60) 4.47 Jittered (Uniform, ?=0.02) 19.73 (44.60) 4.47 Local Regeneration (k=16) 19.74 (44.59) 4.47 Normal(?=4, ?=1) Sample Mean (Standard Deviation) Sample Median Original 3.98 (1.02) 3.99 Contaminated 3.98 (1.02) 4.00 Jittered (Uniform, ?=0.07) 3.98 (1.02) 3.98 Local Regeneration (k=20) 3.98 (1.02) 3.99 35 2.4.2 Transforming the Contaminated Data: Jittering We start by choosing a range of values for ? that defines the neighborhood for jittering. We choose six values 6?= 0.01, 0.02, 0.05, 0.08, 0.010 and 0.12 and compare the uniform and normal perturbation distributions for each of these values. Note that we use the observed too-frequent values for the transformations to better mimic a realistic implementation. Max-bin histograms are produced for each ? and perturbation distribution. To be concise, only the best jitter level/distribution (as determined in Section 2.4.4) is shown for each underlying distribution in the third panel of Figures 2.4, 2.5, and 2.6. The complete set of max-bin histograms can be found at www.smith.umd.edu/ceme/statistics/papers. We see that indeed jittering irons out the too-frequent values and the distribution of the transformed data more closely resembles that of the original data. In addition, for all three distributions the jit- tering transformation hardly affects the overall summary statistics, as can be seen in Table 2.1. 2.4.3 Transforming the Contaminated Data: Local Regeneration We start by choosing the levels of binning, the regeneration distribution, and the parametric distributions of interest. For the number of bins we choose seven values: k=4, 5, 8, 10, 16, 20 and 25. Each of these values divide 100 easily. For the CLNorm data, we bin the observed too-frequent values corresponding to the percentiles of the LNorm(??,??) distribution where ?? and ?? are estimated from the 36 Unobservable Lognormal Sample Density 012345 0.0 0.5 1.0 1.5 2.0 Contaminated Lognormal Sample Density 012345 0 1 2 3 4 Jitter Transform (Uniform, delta=0.07) Density 012345 0.0 0.5 1.0 1.5 2.0 Local Regeneration Transform (k=5) Density 012345 0.0 0.5 1.0 1.5 2.0 Figure 2.4: Histograms of the original, contaminated, jittered (uniform, ? =0.07) and locally regenerated (k=20) data for the simulated lognormal data. The solid grey line is the lognormal density. Note that the y-axis scale is different for the contaminated data. 37 Unobservable Weibull Sample Density 012345 0 1 2 3 4 5 Contaminated Weibull Sample Density 012345 0 1 2 3 4 5 Jitter Transform (Uniform, delta=0.02) Density 012345 0 1 2 3 4 5 Local Regeneration Transform (k=16) Density 012345 0 1 2 3 4 5 Figure 2.5: Histograms of the original, contaminated, jittered (uniform, ? =0.02) and locally regenerated (k=5) data for the simulated Weibull data. The solid grey line is the Weibull density. 38 Unobservable Normal Sample Density 02468 0.0 0.2 0.4 0.6 0.8 Contaminated Normal Sample Density 02468 0.0 0.5 1.0 1.5 Jitter Transform (Uniform, delta=0.07) Density 02468 0.0 0.2 0.4 0.6 0.8 Local Regeneration Transform (k=20) Density 02468 0.0 0.2 0.4 0.6 0.8 Figure 2.6: Histograms of the original, contaminated, jittered (uniform, ? =0.07) and locally regenerated (k=10) data for the simulated normal data. The solid grey line is the normal density. Note that the y-axis scale is different for the contaminated data. 39 CLNorm data, and then we ?iron-out? the too-frequent values within their corre- sponding percentile bin. The CWeibull and CNorm data are also ?ironed-out? using parameter estimates from the contaminated data. Max-bin histograms are produced for each k, but only the best k (as deter- mined in Section 2.4.4) is shown for each underlying distribution in the bottom panel of Figures 2.4, 2.5, and 2.6. The complete collection of max-bin histograms can be found at www.smith.umd.edu/ceme/statistics/papers. From the max- bin histograms, we see that the local regeneration transformation yield data that closely resemble the original data. In the Weibull case, it also appears to perform better than the jittering, probably because the too-frequent values are located in low-density areas of the distribution. In such cases, the local regeneration spreads out the too-frequent values over a wider neighborhood (because the percentiles of the distribution are farther away from each other). As with the jittering trans- form, the summary statistics for the transformed data, for all three distributions, are extremely close to those of the original data, as shown in Table 2.1. 2.4.4 Choosing The Transformation Parameters While visual inspection of the transformation process (and in particular via max-bin histograms) is valuable in determining the parameters of the transforma- tion (? and the perturbing distribution in jittering and k in the local regeneration transform), quantitative metrics to assess the data fit can also be helpful, especially for purposes of automation. In order to determine how well the ?ironed-out? data 40 Table 2.2: Goodness-of-fit statistics (and p-values) for the original, contaminated, and jittering transformed lognormal data. Lognormal(?=0, ?=2) Anderson-Darling Cramer von Mises Kolmogorov SADBNF Original 1.07(0.25) 0.147(0.25) 0.0152(0.25) 1997 Contaminated 15.66(0.00) 0.332(0.11) 0.0353(0.00) 2523 Uniform, ? = 0.01 7.40(0.00) 0.300(0.14) 0.0353(0.00) 2224 Uniform, ? = 0.02 4.15(0.00) 0.272(0.17) 0.0316(0.00) 2264 Uniform, ? = 0.05 1.52(0.18) 0.180(0.25) 0.0206(0.15) 2128 Uniform, ? = 0.07 1.16(0.25) 0.154(0.25) 0.0171(0.25) 2102 Uniform, ? = 0.10 1.12(0.25) 0.160(0.25) 0.0185(0.25) 2131 Uniform, ? = 0.12 1.52(0.18) 0.180(0.25) 0.0200(0.18) 2119 Normal, ? = 0.01 13.86(0.00) 0.334(0.12) 0.0353(0.00) 2415 Normal, ? = 0.02 8.54(0.00) 0.301(0.14) 0.0353(0.00) 2211 Normal, ? = 0.05 3.75(0.01) 0.249(0.20) 0.0273(0.02) 2158 Normal, ? = 0.07 2.27(0.07) 0.204(0.25) 0.0231(0.08) 2140 Normal, ? = 0.10 1.60(0.15) 0.182(0.25) 0.0199(0.19) 2089 Normal, ? = 0.12 1.24(0.25) 0.169(0.25) 0.0172(0.25) 2106 41 Table 2.3: Goodness-of-fit statistics (and p-values) for the original, contaminated, and jittering transformed Weibull data. Weibull(?=0.5, ?=10) Anderson-Darling Cramer von Mises Kolmogorov SADBNF Original 1.18(0.25) 0.172(0.25) 0.0164(0.25) 3119 Contaminated 3.92(0.01) 0.231(0.22) 0.0344(0.00) 3442 Uniform, ? = 0.01 1.62(0.15) 0.193(0.25) 0.0224(0.10) 3259 Uniform, ? = 0.02 1.44(0.20) 0.187(0.25) 0.0173(0.25) 3233 Uniform, ? = 0.05 2.21(0.08) 0.203(0.25) 0.0280(0.02) 3149 Uniform, ? = 0.07 2.70(0.04) 0.219(0.24) 0.0300(0.01) 3166 Uniform, ? = 0.10 3.53(0.02) 0.258(0.18) 0.0306(0.01) 3165 Uniform, ? = 0.12 4.29(0.01) 0.293(0.14) 0.0328(0.00) 3125 Normal, ? = 0.01 2.81(0.04) 0.212(0.25) 0.0274(0.02) 3357 Normal, ? = 0.02 1.68(0.14) 0.193(0.25) 0.0220(0.11) 3209 Normal, ? = 0.05 1.57(0.16) 0.187(0.25) 0.0230(0.09) 3324 Normal, ? = 0.07 1.67(0.14) 0.187(0.25) 0.0246(0.05) 3183 Normal, ? = 0.10 2.08(0.09) 0.199(0.25) 0.0280(0.02) 3149 Normal, ? = 0.12 2.35(0.06) 0.206(0.25) 0.0290(0.01) 3180 42 Table 2.4: Goodness-of-fit statistics (and p-values) for the original, contaminated, and jittering transformed normal data. Normal(?=4, ?=1) Anderson-Darling Cramer von Mises Kolmogorov SADBNF Original 1.50(0.18) 0.261(0.18) 0.0224(0.10) 1358 Contaminated 1.52(0.17) 0.267(0.17) 0.0224(0.10) 1710 Uniform, ? = 0.01 1.51(0.18) 0.263(0.18) 0.0224(0.10) 1442 Uniform, ? = 0.02 1.50(0.18) 0.261(0.18) 0.0224(0.10) 1470 Uniform, ? = 0.05 1.51(0.18) 0.264(0.18) 0.0224(0.10) 1462 Uniform, ? = 0.07 1.47(0.19) 0.255(0.19) 0.0224(0.10) 1378 Uniform, ? = 0.10 1.50(0.18) 0.261(0.18) 0.0224(0.10) 1398 Uniform, ? = 0.12 1.59(0.16) 0.281(0.15) 0.0224(0.10) 1422 Normal, ? = 0.01 1.52(0.18) 0.266(0.17) 0.0224(0.10) 1592 Normal, ? = 0.02 1.51(0.18) 0.264(0.18) 0.0224(0.10) 1408 Normal, ? = 0.03 1.50(0.18) 0.263(0.18) 0.0224(0.10) 1410 Normal, ? = 0.07 1.50(0.18) 0.263(0.18) 0.0224(0.10) 1406 Normal, ? = 0.10 1.49(0.18) 0.261(0.18) 0.0224(0.10) 1428 Normal, ? = 0.12 1.51(0.18) 0.262(0.18) 0.0224(0.10) 1420 43 fit an underlying distribution, we use three parametric goodness-of-fit measures: the Anderson-Darling (AD), Cramer von Mises (CvM), and Kolmogorov (K) statistics. These are computed for each combination of distribution and ? for the jittering transform and each k for the local regeneration transform. We define a ?good? parameter as one that captures the underlying (unob- served) distribution, while altering the data as little as possible. In other words, we seek to obtain a reasonable distributional fit while minimizing the differences between the original and transformed too-frequent values. 2.4.4.1 Jittering Tables 2.2, 2.3, and 2.4 provide the goodness-of-fit test statistics and corre- sponding p-values for the jittering transformation on the lognormal, Weibull, and normal simulated data sets, respectively. We tried two perturbation distribution (uniform and normal) and six values for ?. Note that when the goal is fitting a parametric distribution we advocate the local regeneration transform over jittering. However, we show the results of the jittering transform for purposes of comparison and also to assess the ability of the jittering transform to approximate the original generating distribution. In order to determine the best jitter level, we plot the tab- ular results using scree plots for the test statistics, as shown in Figure 2.7. On each scree plot, the lower horizontal line is the test statistic for the original data. The upper horizontal line is the test statistic for the contaminated data. An optimal parameter would have a test statistic close to the lower horizontal line. 44 24681012 5 10 15 Lognormal Sample delta Anderson?Darling Statistic 24681012 1.5 2.5 3.5 Weibull Sample delta Anderson?Darling Statistic 24681012 1.48 1.52 1.56 Normal Sample delta Anderson?Darling Statistic 24681012 0.15 0.25 Lognormal Sample delta Cramer con Mises Statistic 24681012 0.18 0.22 0.26 Weibull Sample delta Cramer con Mises 24681012 0.255 0.265 0.275 Normal Sample delta Cramer con Mises 24681012 0.015 0.025 0.035 Lognormal Sample delta Kolmogorov Statistic 24681012 0.020 0.030 Weibull Sample delta Kolmogorov 24681012 0.022370 0.022390 Normal Sample delta Kolmogorov Figure 2.7: Scree plots for the Anderson-Darling (left), Cramer von Mises (middle), and Kolmogorow (right) statistics after applying the uniform jittering transform to the CLNorm (top), CWeibull (center), and CNorm (bottom) data. The upper hor- izontal line is the test statistic for the contaminated data, and the lower horizontal line is the test statistic for the original data. 45 For simplicity,we only discuss the results using a uniform perturbation (which provided better results than a normal perturbation). For the CLNorm data, the AD statistic suggests ? =0.10, the CvM statistic suggests ? =0.07 (the largest value tested), and the K statistic indicates ? =0.07. Looking at the p-values in Tables 2.2, 2.3, and 2.4, when ? =0.07, the p-value is large for all tests, suggesting that these transformed data resembles the underlying distribution. Since we want to manipulate the data as little as possible, we choose ? =0.07. The CWeibull data has more straightforward results, where the lowest test statistic for each of the three goodness-of-fit statistics is achieved at ? =0.02. In contrast, results for the CNorm data are very fuzzy, with no clear pattern in the scree plots. Indeed, according to the p-values for these test statistics, even the contaminated data resemble a N(4,1) distribution quite well! This highlights the robustness of these tests in the normal case, whichin our case is a limitation. Itis obvious from the max-bin histograms that the contaminated data differ from a reasonable normal sample. We also attempted asymmetric contamination for the CNorm data, but the robustness result remained unchanged. Finally, to determine the level of jittering using the nonparametric measures, we look at the three proposed methods for evaluating smoothness of the max-bin histogram: kernel smoothers of the max-bin histogram, deviations between pairs of neighboring frequencies (e.g., SADBNF), and looking for local-abnormalities (using the ? i measure). The kernel smoothers are shown in Figure 2.8 for the original data, the contaminated data, and the transformed data. In the lognormal and Weibull cases the kernel smoother for the jittered data shows an improvement over 46 the contaminated data and closely resembles that of the original data. In contrast, in the normal case we see once again the robustness to deviations from normality, and all kernel smoothers appear identical. We conclude that for nonnormal data, kernel smoothers are a useful tool for the jittering parameter choice. With respect to measures of smoothness, we find these not as helpful. Although the SADBNF measure (shown in the last column of Tables 2.2 2.3, and 2.4), seems to lead to ? values that correspond to those indicated by the parametric goodness-of-fit tests, these drops in SADBNF are more likely an artifact of the particular samples that we generated. Instead, we notice a tendency of this and other local smoothness metrics to always favor larger values of ?. Since our goal is to find the minimal level of smoothing that is able to ?iron-out? too-frequent values, one option is to introduce a penalty term into the local smoothness measure. In contrast, using the ? i measure with a three-standard-deviation threshold leads to the opposite conclusion: it favors no jittering at all. The reason for this is that low levels of jittering (with the extreme case being no jittering at all) are associated with a large standard deviation of ? and thus no ? i value exceeds this threshold. 2.4.4.2 Local Regeneration Table 2.5 provides the goodness-of-fit statistics (and p-values) for the local regeneration transformation, and the scree plots are shown in Figure 2.5. In this case we choose seven values for the number of bins k, ranging between 4 and 25. The CLNorm data fits best when k = 5 (note that k = 8 appears worse than both 47 0 12345 0.0 0.2 0.4 0.6 0.8 1.0 Kernal Density Estimate Contaminated Lognormal Unobservable Lognormal 0 12345 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kernal Density Estimate Contaminated Weibull Unobservable Weibull 0 2468 0.0 0.1 0.2 0.3 0.4 Kernal Density Estimate Contaminated Normal Unobservable Normal 012345 0.0 0.2 0.4 0.6 0.8 1.0 Kernal Density Estimate Jitter Transform (Uniform, delta=0.07) Unobservable Lognormal 012345 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kernal Density Estimate Jitter Transform (Uniform, delta=0.02) Unobservable Weibull 02468 0.0 0.1 0.2 0.3 0.4 Kernal Density Estimate Jitter Transform (Uniform, delta=0.07) Unobservable Normal 012345 0.0 0.2 0.4 0.6 0.8 1.0 Kernal Density Estimate Local Regeneration Transform (k=5) Unobservable Lognormal 012345 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kernal Density Estimate Local Regeneration Transform (k=16) Unobservable Weibull 02468 0.0 0.1 0.2 0.3 0.4 Kernal Density Estimate Local Regeneration Transform (k=20) Unobservable Normal Figure 2.8: Comparing kernel density estimates for the original data (solid line) versus (dashed) the contaminated data (top), jittered data (middle), and locally regenerated data (bottom). The columns correspond to the lognormal (left), Weibull (center), and normal (right) simulated data. 48 k = 5 and k = 10, but this is most likely an artifact of these particular data). For the CWeibull data, the best fit is achieved with k = 16. Finally, for the CNorm data the picture is again ambiguous, probably due to the strong robustness of the tests to deviation from normality. The transformed data with these chosen parameters are plotted using max-bin histograms (Figures 2.4-2.6, bottom panels). It can be seen that in each of the three cases, the transformed data no longer exhibit too-frequent values, and they appear very similar to the original ?unobservable? data. 2.5 Transforming the Surplus Data We now apply the local regeneration transform to the auction surplus data, with the goal of obtaining data from a parametric continuous distribution. Since the data are real world, the underlying distribution is unknown. The data best fit a three-parameter Weibull distribution when using the log-scale + 1, as can be seen from the probability plots in Figure 2.2. The data does not fit any familiar distribution without the log transformation. By inspecting zoomed-in regions of the max-bin histogram (top left panel in Figure 2.1), the too-frequent values appear to be {0, 0.01, 0.50, 0.51, 1.00, 1.01, 1.49, 1.50, 1.75, 2.00, 2.01, 2.50, 3.00, 3.50, 4.00, 4.50, 5.00}. These same values also appear in a max-bin histogram applied to the log-transformed data (top panel in Figure 2.10). As explained above, we operate on the log-transformed data from here on. 49 Table 2.5: Goodness-of-fit statistics (and p-values) for the original, contaminated, and local regenerating transformed lognormal (top), Weibull (middle), and normal (bottom) data Lognormal(?=0, ?=2) Anderson-Darling Cramer von Mises Kolmogorov Original 1.08(0.25) 0.147(0.25) 0.0152(0.25) Contaminated 15.66(0.00) 0.332(0.11) 0.0353(0.00) k=25 4.36(0.01) 0.286(0.15) 0.0316(0.00) k=20 3.24(0.02) 0.223(0.23) 0.0316(0.00) k=16 2.83(0.04) 0.234(0.22) 0.0280(0.02) k=10 1.38(0.21) 0.117(0.25) 0.0205(0.16) k=8 1.77(0.12) 0.292(0.14) 0.0242(0.06) k=5 0.90(0.25) 0.101(0.25) 0.0169(0.25) k=4 2.48(0.05) 0.375(0.09) 0.0271(0.02) Weibull(?=0.5, ?=10) Anderson-Darling Cramer von Mises Kolmogorov Original 1.18(0.25) 0.172(0.25) 0.0164(0.25) Contaminated 3.92(0.01) 0.231(0.22) 0.0344(0.00) k=25 2.03(0.09) 0.194(0.25) 0.0224(0.10) k=20 1.37(0.22) 0.175(0.25) 0.0172(0.25) k=16 1.46(0.25) 0.160(0.25) 0.0164(0.25) k=10 1.55(0.17) 0.185(0.25) 0.0216(0.12) k=8 1.83(0.12) 0.180(0.25) 0.0236(0.07) k=5 2.37(0.06) 0.196(0.25) 0.0266(0.03) k=4 2.33(0.06) 0.179(0.25) 0.0246(0.05) Normal(?=4, ?=1) Anderson-Darling Cramer von Mises Kolmogorov Original 1.50(0.18) 0.261(0.18) 0.0224(0.10) Contaminated 1.52(0.17) 0.267(0.17) 0.0224(0.10) k=25 1.42(0.20) 0.237(0.21) 0.0224(0.0985) k=20 1.35(0.22) 0.235(0.22) 0.0224(0.10) k=16 1.45(0.19) 0.248(0.20) 0.0224(0.10) k=10 1.34(0.23) 0.215(0.24) 0.0224(0.10) k=8 1.41(0.21) 0.249(0.20) 0.0223(0.10) k=5 1.60(0.15) 0.237(0.21) 0.0254(0.04) k=4 1.40(0.21) 0.212(0.24) 0.0223(0.10) 50 5 10 15 20 25 5 10 15 Lognormal Sample Number of Bins Anderson?Darling Statistic 5 10 15 20 25 1.5 2.5 3.5 Weibull Sample Number of Bins Anderson?Darling Statistic 5 10 15 20 25 1.35 1.45 1.55 Normal Sample Number of Bins Anderson?Darling Statistic 5 10 15 20 25 0.10 0.20 0.30 Lognormal Sample Number of Bins Cramer con Mises Statistic 5 10 15 20 25 0.16 0.18 0.20 0.22 Weibull Sample Number of Bins Cramer con Mises 5 10 15 20 25 0.21 0.23 0.25 Normal Sample Number of Bins Cramer con Mises 5 10 15 20 25 0.015 0.025 0.035 Lognormal Sample Number of Bins Kolmogorov Statistic 5 10 15 20 25 0.020 0.030 Weibull Sample Number of Bins Kolmogorov 5 10 15 20 25 0.0225 0.0240 0.0255 Normal Sample Number of Bins Kolmogorov Figure 2.9: Scree plots for the Anderson-Darling (left), Cramer von Mises (middle), and Kolmogorow (right) statistics after applying the local regeneration transform to the CLNorm (top), CWeibull (center), and CNorm (bottom) data. The upper hor- izontal line is the test statistic for the contaminated data, and the lower horizontal line is the test statistic for the original data. 51 Surplus (Log?scale +1) log(surplus + 1) Density 0.0 0.5 1.0 1.5 2.0 0 2 4 6 k=20 Transform Density 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 02468 0 2 4 6 8 Weibull Quantiles Surplus (Log?scale +1) 0 2468 0 2 4 6 8 Weibull Quantiles k=20 Transform k=10 Transform Density 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 k=5 Transform Density 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 02468 0 2 4 6 8 Weibull Quantiles k=10 Transform 02468 0 2 4 6 8 Weibull Quantiles k=5 Transform Figure 2.10: Histograms (left) and Weibull probability plots (right) for the original and transformed surplus data. The solid grey line on the histogram is the Weibull density. 52 The first step is to estimate the three parameters of a Weibull distribution from the observed (contaminated) data. These turn out to be ? = shape =1.403793, ? = scale =2.129899, and ? = threshold = ?0.10379. The local regeneration transformation is then applied using k bins from the estimated Weibull distribution, where k = 4, 5, 8, 10, 16, 20 and 25. We show the max-bin histograms of the transformed data for several of these k values in Figure 2.10. The overlaid grey line is the three-parameter Weibull density estimate. Note that the y-axis for the original data is much larger than that for the transformed data. This emphasizes the excess frequency of the zero value. With k=20 and even k=10 there are still too many zero and near zero values compared to a Weibull distribution. However, k=5 appears to ?iron-out? the data quite well. This can also be seen in the corresponding probability plots (right column of Figure 2.10), where a small ?step? is visible in all plots except for the data transformed with k=5. Furthermore, the probability plot with k = 5 also appears closest to a straight line, even in the right tail. To evaluate the deviation between the observed and transformed data, we note that approximately 17% of the original data are considered too-frequent. Comparing the summary statistics before and after the transformation (Table 2.6) shows that they are remarkably similar for all k values tested. This reassures us that the transformation has not altered the main features of the distribution. Finally, examining at the three goodness-of-fit statistics (Table 2.7) and the corresponding scree-plots (Figure 2.11) indicates that k=5 leads to the best fit to an underlying three parameter Weibull distribution. These indications, coupled with visual evidence from the max-bin histograms and probability plots, lead us to 53 Table 2.6: Summary statistics for the original and transformed surplus data. Weibull Fit Mean (Standard Deviation) Median surplus 19.19 (64.77) 4.49 k=25 19.20 (64.77) 4.48 k=20 19.19 (64.77) 4.40 k=16 19.20 (64.77) 4.45 k=10 19.21 (64.77) 4.49 k=8 19.21 (64.77) 4.52 k=5 19.22 (64.76) 4.49 k=4 19.27 (64.75) 4.54 Table 2.7: Goodness-of-fit statistics (and p-values) for surplus original and trans- formed data (log-scale + 1) fit to a Weibull distribution. Weibull Fit Anderson-Darling Cramer von Mises Kolmogorov surplus 23.47(0.00) 2.45(0.00) 0.0567(0.00) k=25 22.41(0.00) 2.72(0.00) 0.0589(0.00) k=20 20.92(0.00) 2.68(0.00) 0.0591(0.00) k=16 18.30(0.00) 2.41(0.00) 0.0536(0.00) k=10 12.25(0.00) 1.82(0.00) 0.0474(0.00) k=8 9.78(0.00) 1.57(0.00) 0.0466(0.00) k=5 5.29(0.00) 0.84(0.01) 0.0352(0.00) k=4 6.55(0.00) 1.09(0.00) 0.0402(0.00) conclude that the (log) surplus data are best approximated by a three parameter Weibull distribution. The transformed data can then be used in further statistical analyses. For example, Figures 2.12 and 2.13 compare the residuals from a linear regression model of surplus on price (both in log form) using the raw data (left panel) vs. the transformed data (right panel). Transforming the raw data removes some of the unwanted pattern from the residuals, thereby making it a more appropriate model for consumer surplus in online auctions. 54 5 10 15 20 25 5 10 15 20 Number of Bins Anderson?Darling Statistic 5 10 15 20 25 1.0 1.5 2.0 2.5 Number of Bins Cramer von Mises Statistic 5 10 15 20 25 0.035 0.040 0.045 0.050 0.055 0.060 Number of Binsl Kolmogorov Figure 2.11: Scree plots for Anderson-Darling (left), Cramer von Mises (middle), and Kolmogorow (right) statistics for the transformed surplus data, for an increasing number of bins. 02468 ?2 0 2 4 Log(price+1) Residuals 02468 ?2 0 2 4 Log(price+1) Residuals Figure 2.12: Residuals from a linear regression model of log(surplus + 1) on log(price+ 1) using the original data (left) vs. the transformed data (right). 55 ?3 ?2 ?1 0 1 2 3 ?2 0 2 4 Normal(0,1) Quantiles Residuals ?3 ?2 ?1 0 1 2 3 ?2 0 2 4 Normal(0,1) Quantiles Residuals Figure 2.13: Normal probability plots for regression residuals using the original surplus data (left) vs. the transformed data (right). 2.6 Conclusions We introduce two transformations, jittering and local regeneration, for semi- continuous data that are aimed at yielding data that are continuous. One method is directly aimed at fitting a parametric continuous distribution, while the other is nonparametric and leads to continuous data of an unspecified parametric form. The idea behind both transformations is to replace too-frequent values with values randomly generated within their neighborhood. The difference between the two transformations is with respect to the definition of a neighborhood. While jittering defines a fixed size neighborhood that is anchored around the too-frequent values, local regeneration uses percentiles of the fitted parametric distribution to define neighborhoods. In the latter case, the size of the neighborhood depends on the shape of the fitted distribution, with wider neighborhoods in tail or other low- density areas. The transformed data from the two transformations therefore differ 56 when too-frequent values are located in low-frequency areas of the data. The proposed transforms, and in particular the local regeneration transforma- tion, are similar in flavor to the wellknown and widelyused Box-Coxtransformation. In both cases the goal is to transform data into a form that can be fed into stan- dard statistical methods. Like the Box-Cox transformation, the process of finding the best transformation level (? in the Box-Cox case, ? in jittering, and k in local regeneration) is iterative. To evaluate the performance of the transformations we use simulateddata from three common parametric distributions, which are ?contaminated?. We find that both transformations perform well, yielding transformed data that are close to the original simulated data. We use both graphical aids and test statistics to evaluate goodness-of-fit. However, for some distributions (e.g., the normal) statistical tests of goodness-of-fit appear to be insensitive to the too-frequent values. The same occurs when comparing transformed data using different parameter values. We therefore advocate the use of max-bin histograms for identifying too-frequent values, for choosing parameters, and for assessing goodness-of-fit. When the goal is to achieve continuous data without a specific parametric fit, we find that kernel smoothers provide a useful visual aid in determining the level of required jittering, with the exception of the normal distribution. We also note that simple measures of smoothness of the max-bin histogram such as SADBNF or ? i tend to favor no jittering or excessive jittering, and therefore there is room for developing measures that can better assist in determining the right level of jittering. In conclusion, we believe that further studies with real semi-continuous data 57 will show the importance and usefulness of these transformations. 58 Chapter 3 Functional Data Analysis 3.1 What is Functional Data Analysis? Functional data analysis (FDA) has become popular by the seminal works of Ramsay and Silverman (2005, 2002). At its core, functional data analysis deals with curves, shapes, or objects as the observation, instead of discrete data points as in classical statistics. For example, the temperature at a weather station over a period of one day may be considered a functional observation (curve) since it arises from a continuous process: the temperature at any moment in time over that day. The curves representing daily temperature at several different weather stations would be a set of functional observations. Functional data analysis generalizes classical statistics to the functional con- text. For example, data summaries include measures of central tendency, variability, skewness, etc.,and traditionally, summary statistics are presentedin numericalform. In the functional setting, each summary statistic is itself a functional object, such as the mean function or standard deviation function. FDA has also changed the way that we look at regression. Functional linear models are similar to ordinary linear models except that the explanatory variables are now in functional form. The re- sponse may be scalar, multivariate, or another function. See Ramsay and Silverman (2005, Chapters 10 and 11) for details and examples. 59 One of the main advantages of using functional data is that it often allows for the estimation of derivatives. The first derivative of temperature over one day is also a functional object, and it represents the temperature-velocity: the rate of change in temperature over time. Under normal weather conditions (i.e., no fronts or passing storms), the temperature-velocity is positive approaching 2pm (the warmest part of the day) since the temperature is increasing. After 2pm, the velocity is negative since the temperature is decreasing. Similarly, the temperature-acceleration (second derivative) measures the speed at which the temperature is changing. After the sun sets, the temperature may be decreasing at an increasing rate which corresponds to negative acceleration (or deceleration). 3.2 Representing Auction Price Evolution as a Continuous Curve It is often difficultor impossible to measure a continuous process. For example, the temperature at a weather station may only be recorded once an hour. Similarly, bids in online auctions are placed at varying time points. The resulting bid histories are therefore time series that are unevenly spaced with sometimes very sparse and other times very dense areas. Instead of considering the discrete set of bids in an auction as a vector, we use them to estimate the complete continuous price evolution that takes place during the auction. While we could simply ?connect the dots? to obtain the price of the auction at any given time, this would overfit the data (i.e., model the noise), thereby not providing a good representation of the underlying continuous price process. An alternative is to represent the price as a continuous 60 smooth curve. This type of curve representation is prevalent in FDA (Ramsay and Silverman, 2005). The first step is therefore to represent/estimate the continuous price function from the discrete bid data. Let y j be the recording of an observation at time t j ,j=1,...nin an auction with n bids. We convert the raw data into a continuous function, f(t), that allows for the evaluation of the price at any point t during the auction. As with any measurement, there is error, so we have y j = f(t j )+e(t j ) (3.1) where e(t j ) is considered white noise (e(t j ) ? (0,? 2 )). Different nonparametric smoothing methods exist to approximate the price function f(t) and will be dis- cussed in Section 3.3 below. Section 5.3 discusses the smoothing techniques applied to auction bid data thus far. In Chapter 5, we propose a parametric alternative which also produces continuous price curves. The advantage of the curve representation is that it treats price evolution as a single continuous entity. It captures the complete price evolution in a more compact and easier to visualize way than raw bid data. The price process can then be described by a few coefficients. Furthermore, an appealing feature of smooth curves is that we can gauge their derivatives (the first derivative is the price-velocity and the second derivative is the price-acceleration) in order to learn how price dynamics behave during the auction. 61 3.3 Representing a Functional Object Nonparametrically In this section, we discuss various smoothing techniques to estimate a func- tional object from discrete measurements. Ramsay and Dalzell (1991) represent finite observations in infinite dimensional space since functional data analysis of- ten involves infinite dimensional processes and/or data. For ease of notation and to coincide with our research, we discuss estimating a functional object in a finite dimensional space. Except where noted, the formulas contained in this section are from Ramsay and Silverman (2005); however, the discussion surrounding the for- mulation is our own. 3.3.1 Basis Series Expansion Define a linear smoother ? f t = n summationdisplay j=1 S j (t)y j (3.2) that estimates the function f(t). The most popular smoothing procedure is that of representing the function as a linear combination of K known independent basis functions ? k so that f(t)= K summationdisplay k=1 c k ? k (t). (3.3) An example of a basis function would be ? k (t)=t which would represent a linear model or ? k (t)=log(t) which is simply a nonlinear transformation of the inputs (Hastie et al., 2001). For n observations, let ? = {? k (t j )} be a full-rank n ? K matrix of basis functions at the observation points. An exact representation may be 62 possible when n = K since we could choose the coefficients c k such that x(t j )=y j for each j. The amount of smoothing that takes place depends on the number K of basis functions. Ideally, we would like a small value of K for computational efficiency; however, we want to choose K large enough so that the main features of the data are captured but small enough to avoid fitting the noise. The best way to determine the weights S j (t) is to determine the coefficients c k by minimizing the least squares criterion SMSSE(y|c)= n summationdisplay j=1 [y j ? K summationdisplay k=1 c k ? k (t j )] 2 . (3.4) In matrix terms, this is SMSSE(y|c)=(y??c) T (y??c)=||y??c|| 2 . (3.5) Therefore, the linear smoother to multiply observation points, y j (in vector form), by is the weight: S = ?(? T ?) ?1 ? T . (3.6) There are many different choices of basis functions. Examples are Fourier series, polynomials, regression splines, and wavelets. Fourier series are used when the underlying function is cyclical. Wavelet bases deal with discontinuitiesand rapid changes in behavior, which is a direct contradiction to Fourier series. Polynomials fit well in the center of the data but are often wiggly and inaccurate at the tails. The most popular bases are regression splines which are functions constructed by joining polynomials together smoothly at special values called knots. The re- searcher determines the number and placement of the knots. More knots are chosen 63 in areas where the curve should fit the data most accurately. For online auctions, this occurs at the end of the auction when there is the most bidding activity. Let K 1 +1 be the number of knots. Then there are 2 exterior knots and K 1 ?1 interior knots. In between any two adjacent knots, a polynomial spline of a fixed degree, K 2 , is fit in such a manner that the value of consecutive polynomials are the same at each interior knot. A polynomial of degree 0 is a step function discontinuous at the knots; degree 1 is a polynomial or piecewise linear function; degree 2 is a piecewise quadratic with continuous first derivative; degree 3 is continuous at the first two derivatives of the cubic spline meaning that the function is virtually smooth, etc. The number of basis functions will depend on the number of internal knots, the number of parameters per region, and the number of constraints per knot. Numer- ous examples and illustrations of piecewise polynomials are provided in Hastie et al. (2001). A basis function must be carefully chosen to have features matching those of the function to estimate. Another criterion for choosing bases is that a certain number of derivatives (intrinsic to the particular data problem) behave well. For example, if we want to represent the price of an auction by a function, its first point should be the opening bid, the last point should be the closing bid, and the function should be monotonically nondecreasing since the price of an auction never decreases. We require that two derivatives, representing the price-velocityand price-acceleration, be accurate. In the online auction context, Jank et al. (2007) use polynomial smoothing splines (discussed in Section 5.3.1), and Hyde et al. (2006) use monotone smoothing splines (discussed in Section 5.3.2). 64 3.3.2 Kernel Smoothers Another option for smoothing is the use of a kernel smoother. The idea be- hind kernel smoothing is that the value of a function at any point t must be most influenced by the observations near t. The uniform kernel is Kern(u)=0.5,|u|?1,. The quadratic kernel is Kern(u)=0.75(1 ?u 2 ),|u|?1,. The Gaussian kernel is Kern(u)=(2pi) ?1/2 exp(?u 2 /2). The weights w j (t)=Kern[(t j ?t)/h] are concentrated for t j near t, and the degree of concentration is controlled by h, the bandwidth parameter. Small values of h mean that only observations close to t receive substantial weight, and large values of h use a wider average. The kernel estimator makes use of local weights S j (t) such that S j (t)= Kern[(t j ?t)/h] summationtext r Kern[(t r ?t)/h] for the Nadaraya-Watson kernel estimator. 3.3.3 Roughness Penalty The above smoothers tend to fit the data too closely (and thus model the noise); therefore, a roughness penalty approach is commonly employed. This method 65 takes into account the trade-off between a curve that fits the data well and smooth- ness (so not overfitting to the noise). For example, we would like to find a function x(t j ) that minimizes the residual sum of squares summationtext j [y j ?x(t j )] 2 to provide a good fit to the data. However, fitting the data too closely could yield a function that is too oscillatory, or locally variable, that is also modeling the noise in the data. A popular measure of wiggliness (or roughness) of a function is its integrated squared m th derivative PEN m (x)= integraldisplay [D m x(t)] 2 dt. (3.7) This measures the degree of departure from a straight line. A highly variable func- tion will yield a high value of PEN m (x). The researcher should choose the highest order of the derivative that she is interested in investigating and then add 2 to ensure that the highest derivative is also a smooth function. For online auctions, we are interested in up to the second derivative, the price-acceleration, so m = 4. Define the penalized residual sum of squares to be PENSSE ? = summationdisplay j [y j ?x(t j )] 2 +?PEN m (x), (3.8) where ? is the smoothing parameter. When ? = 0, the penalized squared error drops out, and the function fits the data closely. Indeed, if there were as many knots as there were data points for a linear (degree 1) polynomial, we would have linear interpolation, which would provide too good of a fit to the data. Larger values of ? penalize the function for being curvy. As ? ??, the fitted curves approach the standard linear regression line. Ramsay and Dalzell (1991) and Ramsay (1998) suggest the smoothing parameter ? can often be chosen by inspection of the smooths 66 or through an automated procedure such as generalized cross-validation (GCV). 3.4 FDA Methodology There has been a lot of efforts in developing models and methods in the func- tional domain, many of which generalize traditional statistical approaches. We review some of these efforts next; however, Ramsay and Silverman (2005) provide a more extensive list. Functional summary statistics can be created based on functional observa- tions rather than discrete data points. Clustering discrete points in n-dimensions is performed using a proximity metric, such as K-means. In the functional con- text, functions can be grouped by the parameters that describe each function. Jank and Shmueli (2007) use functional clustering to examine different groups of price processes in online auctions. Functional linear models is similar to ordinary linear models; however, the explanatory variables are now functional, and the response may be scalar, multivariate, or another function. Smooth curves also allow for the estimation of derivatives. Understanding a function?s dynamics is useful because the dynamics are what drive the process. Differential equations extend quite nicely in the functional context. Ramsay and Silverman(2002) model handwriting and juggling with a differentialequation model. Wang et al. (2007b) show that a second order differential equation approximates the price dynamics that take place during an eBay auction well. FDA also has many applicable uses in the efforts of visualization and ex- 67 ploratory data analysis (EDA). Unfortunately, visualization has mostly been ignored in the functional literature, which mainly focuses on the derivation of mathemati- cal models. However, there are some noteworthy exceptions. Functional summary statistics such as the mean or median function can be displayed along with the functional observations. Phase plane plots are useful in understanding the interplay of dynamics. Graphing functional principal components, which is the most widely used of the visualization techniques, enables the variability within a function to be studied. Functional PCA has been applied to many different problems: Canadian weather data (Ramsay and Dalzell, 1991; Ramsay and Silverman, 2005), criminal activity (Ramsay and Silverman, 2002), adolescent growth (Ramsay and Silverman, 2002), lip motion (Ramsay et al., 1996), and online auction price (Hyde et al., 2004; Hyde, 2006),to name a few. Recently, Jank et al. (2007) explored functional data in the online auction context exclusively via visualization. While some standard functional visualization tools are implemented, a key contribution is the cutting-edge interactive time series visualization tool Timesearcher. It allows users to see long time series, multivariate time series, zoom-in to various rectangular blocks of time, and search for a selected pattern. Data visualization, especially in the functional context, provides a deeper un- derstanding of data than summary measures alone. The field of data visualization should continue to expand with increased computer capabilities. We expand the lit- erature in Chapter 4 by presenting a novel graphical tool which displays functional data over calendar time. 68 Chapter 4 Investigating Concurrency in Online Auctions Through Visualization 4.1 Introduction and Motivation Concurrency of events is omnipresent in today?s commerce, both within the electronic commerce world and even more so if we consider the online and o?ine domains together. For example, a consumer purchasing a luxury item such as a digital camera can gather information about price from a variety of different sources: by reading a store?s sales circular, by checking the price at online retailers, by using comparison shopping websites such as bizrate.com that search across many online retailers, by reading newspaper/online sales postings for used cameras by individuals (e.g., craigslist.com), by checking online auction sites such as eBay.com, and more. The availability of items from multichannel sources gives consumers the power to compare prices and other related information (such as shipping time and cost, trustworthiness of the seller, travel time, and physical inspection). This also means that sources compete with one another and therefore, more than likely, also influence each other. There has been a growing literature investigating how these different channels affect each other, their sales, their prices, etc. (e.g., Etzion et al. (2004); Gallien (2002); Vakrat and Seidmann (1999)). In this research we focus on concurrency within a single channel, namely, online auctions on eBay.com. Online auctions are different from traditional brick- 69 and-mortar auctions in that they occur simultaneously or within close temporal proximity. Online auctions tend to be longer, ranging over several days rather than minutes. Further, online auctions are not limited by the geographic barriers of traditional auctions. Sellers and bidders can be located in different parts of the country, or even the world, and still conduct business. The wide and growing popularity of online auctions makes the problem of concurrency an interesting and important topic for research. There are typically numerous auctions for the same or similar products occur- ing simultaneously. Figure 4.1 shows a snapshot of an eBay webpage that results from a search on ?Palm Pilot M515.? It displays all open auctions closing within the next two days that meet the search criteria. Notice that the soonest closing auction (in 6 hours and 55 minutes) has only received 3 bids, and the highest bid is $41. This is surprising since two other auctions closing the following day already received more bids and also a higher current price. One reason for this could be the lack of a picture on the display page which attracts more attention. Another reason could be the less detailed product description. A great majority of the literature that analyzes online auction data assumes independence across auctions. This assumption is typically made for reasons of simplicity and convenience, while in reality auctions for the same item, competing items, or even related (substitute) items will influence each other especially if they take place within a close time frame. On eBay.com, an identicalproduct is often sold in numerous simultaneous auctions. While each auction contains a replicate of the product, the resulting sales, prices, and even the number and levelof bids during the 70 Figure 4.1: List of Palm Pilot M515 auctions closing within the next two days. Note that some of the auctions are for accessories, not the actual Palm M515 product. ongoing auctions are clearly not independent of each other. Empirical research has mostly been based on auctions for the same item (e.g., mint condition Indian-head pennies (Lucking-Reiley et al., 2005), coins (Bajari and Hortascu, 2003), personal digital assistants (Jank and Shmueli, 2005; Shmueli and Jank, 2005; Ghani and Simmons, 2004), rare coin auctions (Kauffman and Wood, 2005)), that have closed during a certain time period (typically within a few of months). In such situations it is likely that there is dependence between the auctions since buyers have the option to select which of the competing items to bid on, and sellers have the option to decide when to post their item for sale using information on similar previously sold items. Auction theory is mainly concerned with a single auction, and there has not 71 been much theoretical research on concurrent auctions, even in the o?ine context. One paper that does consider simultaneous auctions is that by Guerre et al. (2000), who examine the underlying distribution of a bidder?s private value in sealed bid first-price o?ine auctions. Prior empirical research concerned with the interplay of auctions has focused mainly on sequential classical (o?ine) auctions. That is, auctions that occur one after the other, but not simultaneously. Sequential auction research is concerned with price and quality of selection as one auction succeeds another auction. Allen and Swisher (2000) find that auctions occurring later in an auction sequence tend to fetch a higher price than those in the beginning probably due to the limited supply at the end of a sequential auction. Deltas (1999) examines sequential ordering across numerous auctions for cattle, where it is common that higher value lots are sold at the beginning of an auction. Deltas (1999) hence finds that prices decline throughout an auction but also finds a different rate of decline for auctions of different size: prices decline faster in auctions where only a small number of lots are sold. While the insight gleaned from these analyses is useful in the traditional auc- tion setting, it is little applicable to the online context. Online auctions present new and challenging questions because of the prevalence of simultaneous events. Even within this particular universe, concurrency is prevalent, and it is challenging to evaluate its magnitude and effects. There have been only few attempts at quanti- fying concurrency in online auctions. Snir (2007) shows that the expected selling price in S sequential auctions is equivalent to the Sth order statistic. While that 72 work is primarily concerned with the final price, we are also interested in the effect of concurrency on the process of bidding. Zeithammer (2006) examines sequential auctions on eBay, and finds that bidders deflate their bids based on the future ex- pected surplus. In particular, when bidders expect more auctions in the future, bids decrease. While the work does not examine price trends over time, it examines how information about the near future affects the bidding equilibrium. Specifically, it investigates whether it is better to bid high today and lose potential surplus tomor- row or bid cautiously today in hopes of surplus tomorrow. Finally, the recent paper by Anwar et al. (2006) finds that bidders on eBay typically place bids in multiple competitive auctions. We are interested in the effect of concurrency not only on the final price of an auction but also on the relationship between the current bid levels and high bids in simultaneous ongoing auctions. Moreover, very few visualization methods have been developed for the special data challenges arising in online auctions. Exceptions are the work of Jank et al. (2007); Shmueli and Jank (2005); Aris et al. (2005) Although we concentrate on online auctions, our proposed methods can be generalized to other applications where one is interested in studying the effect of concurrency on the final result of an event and also on the event-evolution itself. Our work adds to this line of research by specifically focusing on the aspect of concurrency and its impact in online auctions. The goals of this research are twofold: to study concurrent events via statisti- cal exploratory methods (mainly visualization), and second, to develop new research questions based on these visualizations. In Section 4.2 we propose several new vi- 73 sualizations and adaptations of classic graphical displays for capturing concurrency. We conclude in Section 4.3. 4.2 Visualizing Concurrent Auctions We use the Palm Pilot M515 data set (described in detail in Appendix A.2) in this section to investigate concurrency since all auctions are for the same exact good, and they occur in close temporal proximity. 4.2.1 Rug Plots For studying concurrency of prices in competing auctions, we sometimes con- sider as the observation the complete price evolution that takes place during each auction (as described in Chapter 3). In this research, we use monotone smoothing splines (described in detail in Section 5.3.2) for their flexibility and suitability in the auction context: since the bidding process is necessarily nondecreasing, the mono- tone smoothing splines guarantee monotone nondecreasing price curves. Our choice of knots is governed by bidding frequencies, such that more knots are present during the ?active period? at the very end of the auction and much fewer during the ?quiet period? in the middle of the auction (for further details on the choice of knots and smoothing parameter see Jank and Shmueli (2005)). Figure 4.2 shows the results of monotone smoothing for three sample auctions. Even though the bid patterns in each of the auctions is very different, the same family of monotone splines (using 74 0246 0 50 100 150 200 250 300 Time Price j?curve 0246 0 50 100 150 200 250 300 Time Price straight?line 0246 0 50 100 150 200 250 300 Time Price s?curve Figure 4.2: Bids and fitted price curve for three different auctions. Knots are placed at day (0, 1, 2, 3, 4, 5, 6, 6.25, 6.50, 6.75, 6.8125, 6.8750, 6.9375, 7). The smoothing parameter is ?=0.01, and the polynomial order is 5. the same knots, smoothing parameter, and polynomial order) leads to reasonable approximating curves in each case and avoids adding an extra source of variability across curves. Our first goal is to visualize a set of auctions by describing their entire price- evolution for the purpose of exploring similarities and patterns in price evolutions of concurrent (or partially concurrent) auctions. One step in this direction is the Cal- endar of Auctions plot (Shmueli and Jank, 2005) for displaying a sample of auctions over the data collection period. Although it preserves chronological information, it does not preserve the bid history information. We propose a graphical method that expands upon the Calendar of Auctions, which displays the entire price-evolution curves or dynamics curves over calendar time. Because of its rug like appearance we 75 name these displays rug plots. A rug plot is a functional data analysis visualization tool which displays numerous overlapping continuous processes over calendar time. Figure 4.3 displays the price curves and Figure 4.4 displays the price-velocitycurves. Since the end of an auction is of special interest, we emphasize the endpoint of each curve by a darkened dot. Viewing rug plots at multiple resolutions allows seeing an overview but also local details. For instance, the bottom panels of Figures 4.3 and 4.4 display a subset of all curves, zoomed-in between March 29, 2004 and April 8, 2004. The price-velocity curve, which is the first derivative of the price-evolution curve, is always non-negative (because the price curve is monotone). A zero velocity occurs when the price does not increase, representing a period of bidding inactivity. A small positive velocity means that the price slowly increases at that instant. Similarly, high velocity corresponds to rapid price increases. The nextstep is to see what types of concurrencycan be gleaned from rug plots. The rug plot display maintains the temporal information of each auction (or event, in general) in order to compare global trends to more local ones. Concurrency can affect the shape of a price curve or the shape of the dynamics curve if we assume that different curve shapes capture different bidding patterns (e.g., gradual vs. bursty bidding or early vs. late bidding). Taking an exploratory approach, we examine the price and velocity curves in the rug plot for certain prominent types and look for patterns of temporal proximity between types of curves. 76 0 50 100 150 200 250 Price Evolution Curves Vs. Calendar Time Calendar Time Price 3/14 3/24 4/3 4/13 4/23 133 j?curves; 42 straight?lines; 61 s?curves 0 50 100 150 200 250 Price Evolution Curves Vs. Calendar Time ? ZOOM?in on 3/29?4/8 Calendar Time Price 3/30 4/1 4/3 4/5 4/7 Figure 4.3: Price evolution over calendar time (top) and zoomed-in to March 29 - April 8 region (bottom). Thick curve and grey band are the daily median and IQR closing price. 77 0 200 400 600 800 1000 1200 1400 Price Velocity Curves Vs. Calendar Time Calendar Time Velocity 3/14 3/24 4/3 4/13 4/23 0 50 100 150 200 250 300 Price Velocity Curves Vs. Calendar Time ? ZOOM?in on 3/29?4/8 Calendar Time Velocity 3/30 4/1 4/3 4/5 4/7 Figure 4.4: Price velocity over calendar time (top) and zoomed-in to March 29 - April 8 region (bottom). Thick curve and grey band are the daily median and IQR velocity. 78 4.2.1.1 Curve Shapes The empiricalonline auction literature describes several bidding patterns. One of these is late bidding (also called ?sniping?), which results in very high bidding frequency at the end of an auction. We therefore expect auctions with sniping to have price-velocity curves that spike towards the auction end. Bidding frequency should also manifest itself in the resulting curve: gradual bidding would lead to more gradual curves (and slow dynamics), whereas jump bidding would lead to bursty velocity curves. Relating different types of curves with bidding behavior can be useful in interpreting the rug plot results. Examining the price curves in Figure 4.3 reveals three main shapes of price curves corresponding to the three shapes seen in Figure 4.2. The most frequent (133 auctions) is a concave-up ?j?-shaped price curve which represents auctions with gradual price increases until midauction and a price jump towards the auction end, perhaps indicativeof sniping. The corresponding velocitycurveslowly increases and culminates in a peak. Further inspection reveals that most of the ?j?-shaped auctions start at the lowest opening bid of $0.01. A second type of price curve that we identify is the straight line where the slope depends on the ratio of the opening and closing prices. The corresponding velocity curve is constant. These curves represent auctions with little to no bidding in the middle of the auction. Since all of our auctions transacted, a flat price curve represents auctions with a high opening bid (set by the seller), and therefore, very little bidding. In our data, the 42 straight line curve auctions vary widely in their 79 opening and closing bids, but they all open above $150. The third typical shape in our data (61 auctions) is a stretched-out ?s?-shaped curve which reflects auctions where the price increases slowly, then jumps up during midauction, and slowly increases to the close. The corresponding velocity curve spikes at the start of the price jump and then decreases when the price slows down again, resulting in a single ?hump? shape. This can be indicative of jump bidding that occurs during midauction, where a single bidder raises the price drastically. 4.2.1.2 Temporal Groupings of Curve Types When searching for auctions for a particular item on eBay, users are presented with information on all open auctions for that item as well as for auctions that closed in the last 60 days. A typical user would examine each of the open auctions (their current price, number of bids, and closing time) and perhaps some closed auctions to learn about closing prices. All these factors can influence how this bidder will place a bid. In addition, sellers can also use this information to schedule their auction, to set their opening bid, etc. We therefore expect to see temporal patterns in bidding (and posting) behavior in concurrent auctions. These patterns can manifest themselves as similarities in bidding behavior, for example when bidders mimic a sniping behavior. Or they can show dissimilarities, for instance, if low closing price auctions lead new sellers to change strategy. Our rug plots (Figures 4.3-4.4) show temporal effects such as clustering of shapes in certain periods or the lack of such clustering in others. One global obser- 80 vation is that the straight line price curves appear to be scattered throughout the collection period. As mentioned earlier, many of these auctions have a high open- ing bid (sometimes higher than the 75 th percentile of daily closing price), perhaps because there is something special about the product, like an included accessory. In general, during periods with many auctions, it is useful to zoom-in on the x-axis to better separate curves and make their shape more visible. For this reason we concentrate on three periods: the calendar start, the zoomed-in period of March 29 - April 8 (more visible in the bottom panels of Figures 4.3-4.4), and the calendar end. During the beginning of the data period (before March 24), we see mostly ?s?-shaped price curves and a few straight line price curves. The main distinction between the two appears to be the opening bid, with the ?s?-shaped price curves starting at lower prices. This is more visible in the velocity rug plot where the first group of auctions has a positive velocity ?bump? at midauction (corresponding to the ?s?-shaped price curves), followed by a period with nearly flat velocity curves. This raises questions about the bidding behavior during this period. What is going on during this flat price velocity period? Why was there no jump bidding during this period? The second period (March 29 - April 8) contains many concurrent auctions (see bottom panels of Figures 4.3-4.4). First, we see a global peak in velocity around April 3, as indicated by the median daily acceleration, indicating fast price increases during this period. Second, during this entire period, we see two dominant types of auctions going on: a few ?j?-shaped price curves (more easily identified in the zoomed-in velocity rug plot, as curves that end high), followed by a large group of 81 ?s?-shaped price curves (i.e., single-humped velocity curves). The price curves of both types appear to intersect mainly around $100-$150 perhaps suggesting that bidders who see later closing auctions reaching a price in this range will tend to jump bid to this amount in auctions soon to close, reasoning that the price will most likely reach (at least) this amount. Finally, during the last period of the calendar, there appear to be more and more auctions with high opening bids. One possible explanation is that the Palm M515 has been auctioned long enough by this time that an opening bid of $50 or $100 seems reasonable to bidders and sellers . The resulting price curves are mostly ?j?-shaped curves, and many of them tend to close lower than the 25 th percentile of daily price. 4.2.1.3 Other Groupings of Curves The rug plot allows one to explore groupings not only of curve types but also of other relevant factors such as the number of bidders, the seller rating, and the auction duration. This can be performed by coding each curve according to the level or category of the variable of interest. Such an extension allows investigating hypothesized concurrency effects such as: how does a high rated seller?s auction affect the bidding progress in a concurrent low rated seller?s auction? Or, do sellers choose the auction duration according to the durations of other open auctions for the similar item? To show how this can reveal interesting effects, consider Figure 4.5, which 82 0 50 100 150 200 250 Price Evolution Curves Vs. Calendar Time Calendar Time Price 3/14 3/24 4/3 4/13 4/23 Figure 4.5: Price evolution by auction duration (green = 7-day, blue = 5-day, red = 3-day, gold = 10-day.) displays the price curve rug plot from Figure 4.3 coded by auction duration. Besides the popularity of 7-day auctions, we see that some auction durations seem to group temporally while others do not exhibit such a pattern. There are two periods when 3-day auctions were very popular and a number of periods where 5-day auctions are prevalent. In contrast, 10-day auctions are relatively rare and do not seem to group together. 4.2.2 Time Grouped Box Plots In order to study the relationship between closing prices of nearby auctions as well as trends over time, we group each auction?s closing price into categories based on the calendar date and ending time of the auction. Using multi scale box 83 plots for visualizing closing prices in a series of online auctions was proposed by Shmueli and Jank (2005). In particular, they propose an interactive computation of box plots according to temporal scales of interest (?STAT-zoom?). To incorporate the volume of auctions within each temporal window, the widths of the box plots are proportional to the sample size, or alternatively, the box plots are coupled with histograms. To study the relationship between selling prices of concurrent or nearly con- current auctions we apply the same method. The levels of grouping we examine are weekly, daily, and half-daily. Figure 4.6 displays time grouped box plots. The width of the box is proportional to the square root of the number of auctions (McGill et al., 1978) to show the volume of auctions closing in a particular period. We see that the median of weekly closing prices is relatively constant over time whereas median prices are highly variable on the daily and half-daily scale. This is comparable to the stock market where prices are highly volatile within a day but less so when an entire week is considered. The within day price fluctuations could be attributed to different time zones that bidders live in or to day/night differences in bidding activity. While the width of the box plot is visible in the weekly displays it is not easily discernible at the daily and half-daily scales; therefore, we enhance the box plots with histograms of auction closings. Figure 4.7 displays box plots that are coupled with histograms that contain the number of observations in each group. For the daily displays (top of Figure 4.7), the number of auctions closing on each day is between 1 and 10. In some cases where there are many auctions, the median closing 84 180 220 260 WEEK Group Closing Price 3/21 3/28 4/4 4/11 4/18 4/25 5/2 180 220 260 DAY Group Closing Price 3/22 3/26 3/30 4/1 4/3 4/5 4/7 4/9 4/11 4/15 4/19 4/23 4/27 180 220 260 HALF DAY Group Closing Price 3/22 3/26 3/30 4/1 4/3 4/5 4/7 4/9 4/11 4/15 4/19 4/23 4/27 Box Plot of Closing Price Grouped By Length of Time Width of Box Plot Proportional to Square Root of Number of Auctions in Group Figure 4.6: Grouped box plots of closing price with width proportional to the square root of the number of auctions. price is small as would be expected by economic theory since as supply increases, price decreases. However, this is not always the case due to factors such as seller rating, shipping costs, experience of the bidders, etc. that may also influence price. For instance, quite a few auctions end around April 3 and result in a high median price, much higher than the median price on the days before or after where the auction volume is comparable. The histograms provided in the half-daily displays (bottom of Figure 4.7) show that most auctions close only during one half of the day, probably due to the fact that people sleep during the other half! However, there is also a significant number of auctions that close at night (e.g., around April 27) which may be attributed to time differences or simply to unfortunate and sloppy seller choices. 85 180 220 260 DAY Group Closing Price 3/22 3/26 3/30 4/3 4/7 4/11 4/15 4/19 4/23 4/27 DAY Group Number of Auctions 0 2 4 6 8 10 3/22 3/26 3/30 4/3 4/7 4/11 4/15 4/19 4/23 4/27 Box Plot of Closing Price Grouped By DAY and Histogram of Number of Auctions Closing Each DAY 180 220 260 HALF DAY Group Closing Price 3/22 3/26 3/30 4/3 4/7 4/11 4/15 4/19 4/23 4/27 HALF DAY Group Number of Auctions 0 2 4 6 3/22 3/26 3/30 4/3 4/7 4/11 4/15 4/19 4/23 4/27 Box Plot of Closing Price Grouped By HALF DAY and Histogram of Number of Auctions Closing Each HALF DAY Figure 4.7: Closing price box plots and auction frequency for day (top) and half day (bottom) groups. 86 4.2.2.1 Comparing Medians of Adjacent Auctions To better differentiate between adjacent medians, we use the test by Chambers et al. (1983) that compares medians of box plots. The histogram function in R performs this test with the option ?notch.? The notches extend to ?1.58IQR/ ? n where IQR is the interquartile range. If the notches overlap, then the medians are not significantly different with 95% confidence. This test assumes asymptotic normality of the median and roughly equal sample sizes. Both assumptions seem reasonable with the auction data. There are a total of 41 daily box plots between March 22 - April 30 in Figure 4.7. In order to declutter the graphical display, we partition the 41 plots into 3 groups (March 21-April 10, April 1-April 20, and April 11-April 30) and enhance them with notches described above. The result can be seen in Figure 4.8. For instance, we see that the notches for March 26 and March 27 overlap; median prices for these two days are not significantly different. In contrast, the median prices of March 25 and March 26 are significantly different. In general, while the median daily closing prices are variable, the differences are not statistically significant in most instances. There also does not appear to be any day of the week effects: the median daily closing price for any particular day of the week is not consistently significantly different than the other days of the week. Figure 4.8 also shows the notched box plots for the half-daily displays. In this display, we find more statistical differences between adjacent medians. This suggests that the time of day can have an advantageous effect for the bidder or the seller. 87 This finding also coincides with Snir (2007) who finds lower bids now (i.e., at night) if more auctions are expected in the future (i.e., during the day). 4.2.3 Moving Statistics Plots Another way of visualizing time trends in prices (or any other measure of interest) is by applying moving statistics plots, that is, via graphing the moving average or median. Moving statistics plots are a a form of smoothing the data. The particular statistic is computed over subsets of the data. We use the observation index for the purpose of grouping (the earliest closing auction is denoted ?1?, the next closing auction is denoted by ?2?, etc.) For example, using a moving window of 3 auctions for our 236 auctions yields a series of 234 statistics. Note that this method does not take into account the temporal distance be- tween auctions. This means that moving statistics do not differentiate between auctions that are close in time and those that are far apart. One possibility is to use weights that reflect these temporal distances. A moving statistic plot helps in understanding time trends in the data, and unlike side-by-side box plots, they present a smoother transition over time. Figure 4.9 shows moving average plots for time-windows of 3, 5, and 10 auctions. The dark solid line shows the moving average, and the lighter lines correspond to upper and lower 95% confidence bands. The straight line through the center of the data shows the overall average closing price. The increased variabilityin average pricefor the 3- and 5-auction time windows 88 200 240 280 DAY Group Closing Price 3/22 3/24 3/26 3/28 3/30 4/1 4/2 4/3 4/4 4/5 4/6 4/7 4/8 4/9 4/10 180 220 260 DAY Group Closing Price 4/1 4/2 4/3 4/4 4/5 4/6 4/7 4/8 4/9 4/10 4/12 4/14 4/16 4/18 4/20 180 220 260 DAY Group Closing Price 4/11 4/13 4/15 4/17 4/19 4/21 4/23 4/25 4/27 4/29 Zoomed Intervals of the DAY Grouping Box Plots Overlapping Notches Suggest Adjacent Medians are Not Statistically Different 180 220 260 HALF DAY Group Closing Price 3/22 3/24 3/26 3/28 3/30 4/1 4/2 4/3 4/4 4/5 4/6 4/7 4/8 4/9 4/10 180 220 260 300 HALF DAY Group Closing Price 4/1 4/2 4/3 4/4 4/5 4/6 4/7 4/8 4/9 4/10 4/12 4/14 4/16 4/18 4/20 180 220 260 HALF DAY Group Closing Price 4/11 4/13 4/15 4/17 4/19 4/21 4/23 4/25 4/27 4/29 Zoomed Intervals of the HALF DAY Grouping Box Plots Overlapping Notches Suggest Adjacent Medians are Not Statistically Different Figure 4.8: Closing price box plots zoomed at different overlapping periods of cal- endar time for day groups (top three) and half day groups (bottom three). 89 0 50 100 150 200 150 250 Moving Average of Closing Price Using 3 Auction Window Average Winning Bid 0 50 100 150 200 150 250 Moving Average of Closing Price Using 5 Auction Window Average Winning Bid 0 50 100 150 200 150 250 Moving Average of Closing Price Using 10 Auction Window Average Winning Bid Figure 4.9: Closing price moving average based on different numbers of auctions. compared to the relatively constant 10-day window suggests, again, that auctions are similar to the stock market: while there exists some variability in the short run, prices tend to converge to the market value in the long run. This also suggests that the smart (or lucky) bidder can take advantage of the variability to obtain a price that is lower than the market value. 4.2.4 Price Autocorrelation We now examine the correlation between adjacent auctions. If the closing price in an auction affects the closing price in nearby auctions, we expect to find some degree of autocorrelation across the series of prices, especially if the groups are small enough. One is less likely to find autocorrelation in weekly prices because 90 of the large price variability in auctions every week and also possibly because the most popular auction duration is one week long. Narrowing the window to the day or half-day window may provide a better indication of autocorrelation. Ideally, we would like to look at hourly autocorrelation; unfortunately, this data set is too sparse to do so. Since most auctions do not close in the middle of the night, it would be hard to find a data set rich enough for this type of investigation. Therefore, a method for dealing with this type of missing values is necessary. Even in the half-daily series we encounter missing data. Our solution is to impute missing values with the average of prices from one time period before and after the missing value. This ?nearest neighbor? approach can be extended to mul- tiple neighbors if it is reasonable to assume that the price is stable within a wider window of auctions. Another possibility is to impute the missing value using only information prior to the missing value, to reflect that fact that bidders do not have information about future prices. Figure 4.10 shows price autocorrelation based on the series of moving averages and moving medians for several window widths. It appears that there is no auto- correlation in the means or medians at any window. We suspect that the groups are too large and do not differentiate well between auctions that closed hours apart and those that closed minutes apart. Perhaps a richer data set would reveal more or a moving statistic that would account for the temporal distances between auctions. The general question is how to represent time lags in unequally spaced observations (e.g., Jank and Shmueli (2006)). 91 0123456 ?0.5 0.0 0.5 1.0 Lag ACF Mean By Week 0 5 10 15 ?0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF Mean By Day 0 5 10 15 ?0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF Mean By Half Day 0 123456 ?0.5 0.0 0.5 1.0 Lag ACF Median By Week 0 5 10 15 ?0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF Median By Day 0 5 10 15 ?0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF Median By Half Day Autocorrelation Plots of Grouped Closing Price Mean (Top) and Median (Bottom) Figure 4.10: Autocorrelation for week (left), day (middle), and half day (right) groups for mean (top) and median (bottom). 4.3 Conclusions Concurrency is prevalent in the world of commerce. Whether it is stocks, sales promotions of competing products, or mortgage rates offered by lenders, action often results in reaction. Concurrency is particularly prevalent in the online environment since the openness of the Internet allows players to observe each other?s moves in real time. In this work, we consider concurrency of online auctions on eBay. We approach the study of concurrency from a visual point of view and propose a series of visualizations that are suitable for the special structure of bid data. Considering the entire price evolution during an auction and its dynamics, we find several patterns and raise new research questions. We also find that online auctions tend to resemble 92 the stock market in that most auctions tend to close around their market value but still contain variability that allows bidders to purchase products significantly below market value. While some of this variability could be attributed to other factors such as seller rating and the winner?s experience, it creates opportunities for an eBay bidder. We also find a difference in median closing prices at different times of the day. A bidder, for instance, may be able to obtain a lower value by taking part in auctions that close later in the evening when fewer people are taking part in an auction. These observations are all based on the exploratory analysis performed here and can serve as a basis for further studies of concurrency in online auctions. 93 Chapter 5 A Family of Growth Models for Representing the Price Process in Online Auctions 5.1 Introduction and Motivation The wide and growing popularity of online auctions creates enormous amounts of publicly available auction bid data providing an interesting and important topic for research. These data pose special statistical challenges because of their special structures. This research focuses on uncovering the entire continuous price process during an auction. Bids during an online auction arrive at unevenly-spaced discrete time points chosen by bidders; however, we are interested in obtaining a continuous approxi- mation of the underlying continuous price evolution: the price at any time during the auction. From a visualization point of view, a series of discrete unevenly-spaced bids (e.g. presented as a scatterplot of bid vs. time) loses the conceptual as well as continuous nature of the price process. Furthermore, such plots do not scale to multiple auctions. A better representation is a continuous function with only a few parameters. Such a representation is conceptually more appealing, is more parsimonious, and can be further used for analyses such as clustering or regression models. Finally, the underlying continuous process may depend on derivatives of 94 the function. A smooth price curve allows the calculation of derivatives (i.e., the first derivative is price-velocity and the second derivative is price-acceleration). In FDA, nonparametric smoothing techniquesare used to estimatea functional object from discrete measurements (as described in Chapter 3). Examples are ker- nel smoothers, polynomial splines, and monotone splines (Ramsay and Silverman, 2005; Ramsay, 1998). Smoothing should be done in such a way that the resulting underlying object adequately represents the continuous process. In the auction set- ting, we know that the price is always positive and monotonicly nondecreasing, so our smoothed price curve must reflect this nature. To date, all of the smoothing methods applied to online auctions have proven to involve the specification of many parameters such as the number and position of knots, a roughness penalty parameter, and the polynomial order. Resulting curves capture the price curves reasonably well; however, the fit often requires manual visual inspection (for parameter selection). Smoothing splines, which suffer from edge fitting, often result in a deteriorated fit at the start and end of the price curve and produce curves that are not necessarily monotone. Monotone smoothing splines produce monotone curves; however, they are computationally intensive and require lengthy run times for even a moderate number of auctions. Lastly, nonparametric smoothing does not provide an explanatory model for the price process. This mo- tivated us to explore meaningful parametric representations of the price process. A parametric approach would be more elegant in the sense that it would provide a theoretical explanation of the process, it would potentially be computationally fast, and it would provide a more parsimonious representation. 95 For example, in Chapter 4, monotone smoothing is employed to estimate the price curves of a set of Palm M515 auctions. We find three distinct shapes of price curves: ?j?, stretched-out ?s?, and straight line curves. The most frequent shape in the Palm Pilot data set is a concave-up ?j?-shaped price curve which represents auctions with gradual price increases until mid-to-late auction and then a price jump towards the end; the second most popular shape is the straight line, where the slope depends on the ratio of the opening and closing prices; and the third typical shape is a stretched-out ?s?-shaped curve which reflects auctions where the price increases slowly, jumps up during mid-auction, and slowly increases to the close. These price curve shapes are the result of several bidding styles documented by Bapna et al. (2004b). Determining the number of each type of curve requires visual inspection of every single price curve. Clearly, a better method for grouping curves is needed beyond physical examination. The goal of this paper is to introduce, within an FDA framework, a new parametric family of growth functions that describe the price processes in online auctions. This chapter is organized as follows: Section 5.2 discusses the data we use in this research. In Section 5.3, we discuss two nonparametric smoothing methods that have been used to describe auction price evolution and their limitations. In Section 5.4, we describe two popular growth models (logistic and exponential) and two additional useful growth modes (logarithmic and reflected-logistic) and show how our parametric family of growth models is used for representing auction price growth. Section 5.5 introduces an automated selection procedure to choose the best growth model. Section 5.6 compares the quality of curves obtained through non- 96 parametric methods with those from our parametric family of curves. In Section 5.7, we suggest several further uses of growth models. We conclude in Section 5.8 with final remarks and future research. 5.2 Wristwatch Data We use the luxury wristwatch data described in Section A.1 to test our family of growth models because of its variability. This sample includes a variety of items in terms of make and model, new and used, and closing price. We know from the literature (Bapna et al., 2004b; Wang et al., 2007a) that auction attributes such as opening price, seller experience, number of bids, etc. affect not only the closing price of an auction but also the entire price process. Indeed, our sample is varied in all of those attributes. Figure 5.1 shows the distribution of the number of bids in our sample for each day in the auction. Typically,there is some bidding activityat the auction start (first day), followed by a period of very little activity, culminating in a surge of bidding at the very end of the auction (Shmueli et al., 2004). This last moment bidding is often referred to as ?sniping? (Bajari and Hortascu, 2003; Roth and Ockenfels, 2002). Clearly, when modeling the price process, the start and end of the auction are of particular importance. 97 0?1 1?2 2?3 3?4 4?5 5?6 6?7 Day of Auction Distribution 0 5 10 15 20 25 30 26.41 11.64 7.68 7.29 5.74 9.43 31.80 Figure 5.1: Distribution of daily bids over the course of 472 luxury wristwatch auctions. 5.3 Representing Auction Price Nonparametrically There have been predominantly two approaches for representing the price pro- cess in online auctions. Jank et al. (2007) use penalized polynomial smoothing splines (p-splines), and Hyde et al. (2006) use penalized monotone splines. A com- parison of the two approaches for auction data is given in Alford and Urimi (2004). We now describe each of the two methods and their properties. 5.3.1 Smoothing Splines Each auction has bids placed at differenttimes. Rather than applying smoothers to the raw data directly, we apply them to a derived data set that are sampled on 98 the same set of time points for all auctions as follows. Consider the observed price during an auction, represented by the step function in Figure 1.2, where there is a new step every time a bid is placed. We use this step function to obtain our sampled data by selecting a set of knots at times ? 1 ,? 2 ,...,? L and determining the corresponding auction prices at those knots, y ? 1 ,y ? 2 ,...,y ? L . The noisy bid data are then discarded and replaced with the observations (? i ,y ? i ). This transformation then allows us to use the same method to estimate the price process in different auctions using the same smoother method. The polynomial spline (Green and Silverman, 1994) of order p is given by f(t)=? 0 +? 1 t+? 2 t 2 +...+? p t P + L summationdisplay l=1 ? pl (t?? l ) p + (5.1) where u+=uI(u ? 0) is the positive part of the function u. Many smoothers of this type tend to fit the data too closely (and thus model the noise); therefore, a roughness penalty approach is commonly employed. This method takes into account the trade off between data fit (i.e., minimizing f(t)= summationtext j (y ? j ?f(t j )) 2 ) and function smoothness. A popular measure of roughness, which measures degree of departure from a straight line, is of the form PEN m (t)= integraldisplay [D m f(t)] 2 dt (5.2) where D m f,m =1,2,3,...denotes the m th derivative of the function f. A highly variable function will yield a high value of PEN m (x). If the highest derivative of interest is m, then using m + 2 as the polynomial order will assure m continuous derivatives. For online auctions, where the first and second derivatives have been 99 shown to be of interest,we use polynomials of orderm = 4. The penalizedsmoothing spline f minimizes the penalized squared error PENSSE ?,m = integraldisplay (y(t)?f(t)) 2 +?PEN m (t). (5.3) When the roughness parameter is set to ? = 0, the penalized squared error drops out, and the function fits the data. Larger values of ? penalize the function for being curvy, and as ? ??, the fitted curves approach a linear regression. Ramsay and Dalzell (1991) and Ramsay (1998) suggest that the smoothing parameter ? can often be chosen by inspection of the smooths or through optimizing metrics such as the generalized cross-validation (GCV). We have encountered a number of challenges using penalized smoothing splines in the online auction context. First, and most detrimental, is that the created functions are not always monotone nondecreasing, as auction price necessarily must be. Second, the functions are often very wiggly, especially at the ends. This is particularly egregious in the online auction context where the start and end of the auction are of major importance. Third, there tends to be a large number of coefficients to estimate (due to adequate choices of knots and the polynomial order). Finally, there are multiple decisions about smoothing parameters that must be made in advance: the number and position of knots, the polynomial order, and the roughness penalty parameter ?, in order to provide a reasonable fit to the entire set of auctions. An alternative, where each auction is fit separately using a different set of parameters, will lead to confounding in the analysis results (see Jank et al. (2007)). 100 5.3.2 Monotone Splines Since the bidding process by nature is nondecreasing, Hyde et al. (2006) use monotone smoothing splines to represent the price process. The idea behind mono- tone smoothing (Ramsay, 1998) is that monotone increasing functions havea positive first derivative. The exponential function has this property and can be described by the differential equation f prime (t)=w(t)f(t). This means that the rate of change of the function is proportional to its size. Consider the linear differential equation D 2 f(t)=w(t)Df(t). (5.4) Here, w(t)= D 2 f(t) Df(t) which is the ratio of the acceleration and velocity. It is also the derivative of the logarithm of velocity which always exists (because we define velocity to be positive by the equation Df(t)=e w(t) ). The differential equation has the following solution: f(t)=? 0 +? 1 integraldisplay t t 0 exp parenleftbiggintegraldisplay v t 0 w(v)dv parenrightbigg du (5.5) where t 0 is the lower boundary over which we are smoothing. After some substitu- tions (see Ramsay and Silverman (2005)), we can write f(t)=? 0 +? 1 e wt . (5.6) and estimate ? 0 , ? 1 , and w(t) from the data. Since w(t) has no constraints, as f(t) does in the form of the differential equation, it may be defined as a linear combination of K known basis functions (i.e., w(t)= summationtext k c k ? k (t)). Examples of a basis functions are ? k (t)=t which represents a linear model or ? k (t)=log(t) which 101 is a nonlinear transformation of the inputs. The penalized least squares criterion is thus PENSSE ? = summationdisplay i [y i ?f(t)] 2 +? integraldisplay T 0 [w 2 (t)] 2 dt. (5.7) While monotone smoothing solves the wiggly problem of the penalized smooth- ing splines, some of the same challenges remain and new ones arise. First, monotone smoothing is computationally intensive. The more bids, the longer the fitting pro- cess. Second, we still cannot fit the original raw data but rather the derived data (? i ,y ? i ) (obtained from the step function). Finally, as with smoothing splines, the researcher must determine the number and location of knots and the roughness parameter ? that provide a reasonable fit to the entire set of auctions. 5.4 Representing Price Evolution Parametrically We now introduce a parametric family of four growth models that are able to capture the price evolution in many types of auctions. These are exponential growth, logarithmic growth, logistic growth, and reflected-logistic growth functions. Exponential and logistic growth functions have long been used to model population growth, dissemination of information, spread of disease, and more. Our parametric approach is elegant, computationally fast, and parsimonious. It allows automated fitting, and there is no need to specify any parameters in ad- vance. We choose models that are theoretically relevant in terms of monotonicity (to accurately reflect the price process) and that provide insight into the price process in online auctions. 102 Our approach, although motivated by online auctions, actually provides an alternative to the nonparametric smoothing methods that are customary in the field of Functional Data Analysis (FDA). This research opens the door for parametric curves to be the basis of FDA. In the following, we describe each of the four models. Their functional form, derivative form, and parameters are summarized in Table 5.1. 5.4.1 Exponential Growth 5.4.1.1 Exponential Model Exponential growth has been used for describing a variety of natural phe- nomena including the dissemination of information, the spread of disease, and the multiplication of cells in a petrie dish. In finance, the exponential equation is used to calculate the value of interest-bearing accounts compounded continuously. The idea behind exponential growth is that the rate of growth is proportional to the function?s current size; that is, growth follows the differential equation Y prime (t)=rY(t), (5.8) or the equivalent equation Y(t)=Ae rt , (5.9) where t is time, and r>0 is the growth constant. Equivalently, exponential decay, when r<0, can model phenomena such as the half-life of an organic event. From a theoretical standpoint, we hypothesize that exponential growth can 103 Table 5.1: Price-evolution, -velocity, and -acceleration functions for growth models. Growth Model Price Evolution Price Velocity Price Acceleration Parameters Exponential Y(t) = Ae rt Y prime (t) = Are rt Y primeprime (t) = Ar 2 e rt A,r Logarithmic Y(t) = ln( t A ) r Y prime (t) = 1 rt Y primeprime (t) = ?1 rt 2 A,r Logistic Y(t) = L 1+Ce rt Y prime (t) = ?LCre rt (1+Ce rt ) 2 Y primeprime (t) = ?LCr 2 e rt (1?Ce rt ) (1+Ce rt ) 3 C,r Reflected-Logistic Y(t) = ln( L t ?1)?ln(C) r Y prime (t) = ?L rt 2 ( L t ?1) Y primeprime (t) = L(L?2t) rt 4 ( L t ?1) C,r 104 0 1234567 0 500 1000 Time Price 0 1234567 100 200 300 Time Price Velocity 0 1234567 20 30 40 50 Time Price Accleration Figure 5.2: Price (top), velocity(mid- dle), and acceleration (bottom) for an auction fit with exponential growth. 0 1234567 0 200 400 Time Price 0 1234567 0 200 400 Time Price Velocity 0 1234567 ?5000 ?2000 0 Time Price Acceleration Figure 5.3: Price (top), velocity(mid- dle), and acceleration (bottom) for an auction fit with logarithmic growth. describe a price process for auctions where there are gradual price increases un- til mid-to-late auction and a price jump towards the end, perhaps due to sniping (Bajari and Hortascu, 2003; Roth and Ockenfels, 2002). This is reminiscent of the ?j?-shaped price curves that Hyde et al. (2006) find. An example of an auction that is captured well with exponential growth is shown in Figure 5.2 (top). The corresponding velocity curve (middle) and acceleration curve (bottom) are propor- tional to the price curve, as the differential equation implies. The price-velocity and -acceleration are zero or small during most of the auction before spiking at the end. 5.4.1.2 Logarithmic Model We have learned from Bapna et al. (2004a) that inexperienced bidders who do not understand the proxy bidding mechanism on eBay tend to bid high early, 105 which increases the price early in the auction. Since the item for sale in an auction can often be purchased in other venues, such as online and brick-and-mortar stores, there is a ?market value? that caps price. The existence of a market value for the product necessitates that the price process flattens out for the remainder of the auction. This type of price behavior tends to be rare, as most bidders do not wish to reveal their bids early in the auction. The inverse of the exponential function, Y(t)= ln( t A ) r , (5.10) which is called logarithmic growth, approximates this price process well. We choose a form of the logarithmic model that is the mapping of the original exponential model over the line y = x. An example of an auction whose price process can be described well by a logarithmic price process is as shown in Figure 5.3). In this model, the velocity starts at its maximum and then decays to little or zero velocity as the auction progresses. The acceleration is always negative, since price increases more slowly throughout the auction, and approaches zero at the end of the auction (where very little change in price is occuring). 5.4.2 Logistic Growth 5.4.2.1 Logistic Model While exponential growth often makes sense over a fixed period of time, in many cases, growth can not continue indefinitely. For example, there are only a finite number of people to spread information or disease; the petrie dish can only hold a maximal number of cells. A typical application of the logistic equation is in 106 population growth. In the beginning, there are seemingly unlimited resources and population grows increasinglyfast. At some point, competition for food, water, land, and other resources slows down the growth; however, population is still growing. Finally, overcrowding, lack of food, and susceptibiltiyto disease limit the population to some maximal carrying capacity. In auctions, it is possible that growth starts out exponentially with a big price increase in the middle of the auction, perhaps due to an inexperienced bidder. In the presence of a market value (or ?carrying capacity?), the increased price slows competition, as smart bidders will make sure not to overpay for the item, and it is necessary for the price to flatten out near the end of the auction. The closing price that we witness is analogous to the carrying capacity L in the logistic growth function. The logistic model is given by Y (t)= L 1+Ce rt , (5.11) and the differential equation is Y prime (t)=rY(t) parenleftbigg Y (t) L ?1 parenrightbigg , (5.12) where L is the carrying capacity, t is time, r is the growth rate, and C is a constant. Logistic growth forms a stretched-out ?s?-shaped curve, discussed by Hyde et al. (2006), where the price increases slowly, then jumps up during mid-auction, and finally levelsoff through the end of the auction. This price process and dynamics can been seen in Figure 5.4. The velocity is small or zero at the beginning and end of the auction, where there is little change in price, with a peak midauction corresponding 107 0 1234567 0 4000 8000 Logistic Time Price 0 1234567 0 2000 4000 Time Price Velocity 0 1234567 ?3000 0 2000 Time Price Acceleration Figure 5.4: Price (top), velocity(mid- dle), and acceleration (bottom) for an auction fit with logistic growth. 0 1234567 0 500 1500 Reflected Logistic Time Price 0 1234567 200 600 1000 Time Price Velocity 0 1234567 ?10000 0 10000 Time Price Acceleration Figure 5.5: Price (top), velocity(mid- dle), and acceleration (bottom) for an auction fit with reflected-logistic growth. to the price increase. The acceleration is zero for most of the beginning and end of the auction. It peaks during the first part of the growth spike, where price is increasing at an increasing rate, followed by a valley during the second part of the growth spike, where price is increasing at a decreasing rate. 5.4.2.2 Reflected-Logistic Model Shmueli et al. (2004) found that there is some bidding activity at the auction start (first day), followed by a period of very little activity, culminating in a surge of bidding at the very end of the auction. The early price increases is indicative of early bidding by inexperiencedbidders, and the price spike at the end may be caused by sniping. We hypothesize that the early bidding will increase price, followed by 108 a period of no price increase when little or no bidding occurs, and then a large price increase with the surge of bidding towards the very end of the auction. This type of price process may be described well by the inverse of logistic growth, or reflected-logistic growth, given by the function Y(t)= ln( L t ?1)?ln(C) r . (5.13) An example of reflected-logistic growth is shown in Figure 5.5. The velocity peaks at the start and end of the auction where there are jumps in price and little or zero velocity during the middle of the auction where price does not change signif- icantly. The acceleration curve is similar in shape to the price curve; however, it is negative in the beginning (where price increases at a decreasing rate) and positive at the end (where price increases at an increasing rate). 5.4.3 Fitting Growth Models Unlike the nonparametric smoothers, we fit the growth models directly to the live bids. A simple and computationally efficient method for fitting each of the growth models is by linearizingthe function and then performing least squares. Since we are especially interested in obtaining an accurate fit at the beginning and end of the auction, it is usuallynecessary to add two additional points representingthe price at the start and close of the auction: (t =0,y= min(y j )) and (t = l,y = max(y j )) where l is the duration of the auction, and y j is the value of bid j. Note that min(y j ) = opening price and max(y j ) = closing price. It is not necessary to add these extra points for logistic growth, where the maximum price is already incorporated 109 into the function by defining L = max(y j ). Further, empirical evidence shows that auctions whose underlying price process is logistic tend to start close to zero, where logistic growth must start. 5.4.3.1 Fitting Exponential Growth The exponential growth model from equation (5.9) can be linearized as lnY =lnA+rt. (5.14) We fit this model directly to the live bids with the two additional points in order to constrain the price at the start and end of the auction. In this case, two parameters, A and r are estimated. Although we can fix A as the opening price since Y(t =0)= Ae 0r = A, we chose to estimate both parameters for two reasons: first, empirical evidence shows that the two-parameter estimation allows a better fit at the end of the auction, and second, the other three growth models require estimating two parameters. Thus, it is easier to compare model fit. The straight line curves that are discussed in Hyde et al. (2006) are a special case of exponential growth. When r = 0, we obtain a horizontal straight line, and when r is close to 0, the price curve resembles a straight line with positive slope. 5.4.3.2 Fitting Logarithmic Growth The logarithmic growth model is given in equation (5.10). As with exponential growth, the two extra points are added to the live bids ensure a good fit at the start and end of the auction. Notice that we can not reduce this function to one parameter 110 because when t = 0, ln(0) does not exist. Also, we can not linearize this function. Therefore, rather than using optimization methods for parameter estimation where guesses of the initial value are necessary, we fit T(y)=Ae ry and linearize as in the exponential growth case. This time, least squares minimizes over time instead of price. 5.4.3.3 Fitting Logistic Growth The logistic growth model from equation (5.11), where L is the distribution?s asymptote, can be linearized as ln parenleftbigg L y ?1 parenrightbigg = ln(C)+rt. (5.15) We know that lim t?l = L (since for logistic growth r<0 and L>0). Define L = max(price)+?, where ? =0.01 is needed so that the left hand side (LHS) is defined over all bids y. In this case, there is no need to add the start and closing points to the live bids because defining the asymptote takes care of the fit at the end. Auctions whose underlying price process can be described by logistic growth tend to start out low, so there is also no need to set the start value. 5.4.3.4 Fitting Reflected-Logistic Growth The reflected-logistic function is given in equation (5.13). As with logarithmic growth, we can not linearize this function. We instead fit T(y)= L 1+Ce ry , where L = l+epsilon1 (epsilon1 =0.00001), to obtain the coefficients for C and r. We need epsilon1<