ABSTRACT Title of Document: MY MOBILE MUSIC: AN ADAPTIVE PERSONALIZATION SYSTEM FOR DIGITAL AUDIO PLAYERS Tuck Siong Chung, Ph.D., 2007 Directed By: Professor Roland T. Rust, Department of Marketing Professor Michel Wedel, Department of Marketing This paper develops a music recommendation system that automates the downloading of songs into a mobile digital audio device. The system tailors the compositions of the songs to the preferences of individuals based on past behaviors. We describe and predict individual listening behaviors using a lognormal hazard function. Our recommendation system is the first to accomplish this and there is as of this moment no existing alternative. Our proposed approach provides an improvement over alternative methods that could be used for product recommendations. Our system has a number of distinct features. First, we use a Sequential Monte Carlo algorithm that enables the system to deal with massive historical datasets containing listening behavior of individuals. Second, we apply a variable selection procedure that helps to reduce the dimensionality of the problem, because in many applications the collection of songs needs to be described by a very large number of explanatory variables. Third, our system recommends a batch of products rather than a single product, taking into account the predicted utility and the uncertainty in the parameter estimates, and applying experimental design methods. MY MOBILE MUSIC: AN ADAPTIVE PERSONALIZATION SYSTEM FOR DIGITAL AUDIO PLAYERS By Tuck Siong Chung Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Roland T. Rust, Co-Chair Professor Michel Wedel, Co-Chair Professor John P. Rust Associate Professor P.K. Kannan Associate Professor Wendy Moe ? Copyright by Tuck Siong Chung 2007 ii Dedication To my wife Angie who has been both supportive and accommodating during the pursuit of my doctoral degree. iii Acknowledgements I am grateful to many individuals who contributed to this dissertation, especially to Dr. Roland T. Rust and Dr. Michel Wedel for providing many hours of guidance and stimulating discussions, and for all their patience and encouragement. Committee members, Dr. John P. Rust, Dr. P.K. Kannan and Dr. Wendy Moe provided valuable comments and suggestions that helped me to improve the quality of this dissertation. I would like to thank Dr. Howard Frank, the Dean of Robert H. Smith School of Business, Maryland, College Park for his generous financial and moral support that made it possible to run the experiment for this dissertation. iv Table of Contents Dedication..................................................................................................................... ii Acknowledgements......................................................................................................iii Table of Contents......................................................................................................... iv List of Tables ................................................................................................................ v List of Figures.............................................................................................................. vi Chapter 1: Introduction................................................................................................. 1 Chapter 2: The Model ................................................................................................... 7 Section 1: The Lognormal Hazard Function............................................................. 9 Section 2: Bayesian Variable Selection .................................................................. 14 Section 3: Sequential Monte Carlo Estimation....................................................... 17 Section 4: Model Averaging ................................................................................... 20 Section 5: Bayesian Experimental Design.............................................................. 23 Chapter 3: Simulation Study....................................................................................... 28 Section 1: Simulated Data....................................................................................... 30 Section 2: Estimation .............................................................................................. 31 Section 3: Simulation results .................................................................................. 34 Chapter 4: Experimental Study................................................................................... 41 Section 1: Data collection ....................................................................................... 41 Section 2: Estimation and customization procedures ............................................. 45 Section 3: Results of the experiment ...................................................................... 46 Chapter 5: Discussion and conclusion ........................................................................ 59 Appendix: Survey on Music Preference Study........................................................... 63 References................................................................................................................... 64 v List of Tables Table 1 Parameter used to generate simulated data ........................................... 30 Table 2 Parameter estimates in the first simulation study (with variable selection) ................................................................................................................. 34 Table 3 Parameter estimates in the first simulation study (without variable selection)................................................................................................. 34 Table 4 Credible intervals of parameter estimates in the first simulation study . 35 Table 5 Comparison of Mean Squared Errors (MSE) of predictions with and without model averaging......................................................................... 36 Table 6 Comparison of design performances ...................................................... 37 Table 7 Distribution of the song-specific constants............................................. 51 Table 8 Distribution of subject weights. ............................................................... 52 Table 9 Comparison of play-lists? performance in terms of how well they recommend songs.................................................................................... 54 Table 10 Subjects? ratings of the play-lists? performance obtained from post experiment............................................................................................... 56 vi List of Figures Figure 1 Plot of the lognormal hazard function................................................... 12 Figure 2 Transition in actual utilities for the optimal designs during sequential design generation.................................................................................... 39 Figure 3 Change in song compositions during the sequential generation of the optimal design......................................................................................... 40 Figure 4 Screen shot of PDA software .................................................................. 42 Figure 5 Sigma values of the estimated experiment subject?s parameters............ 47 Figure 6 Percentage distribution of songs heard by the duration heard in milliseconds............................................................................................. 48 Figure 7 Percentage distribution of songs by the ratio of listening duration to song length....................................................................................................... 48 Figure 8 Distribution of experimental subjects by the number of non-zero song characteristic coefficients. ...................................................................... 49 Figure 9 The distribution of experimental subjects by the number of non-zero coefficients for describing song genres................................................... 50 Figure 10 Distribution plot of the song-specific constants. ....................................... 52 Figure 11 Distribution of subject weights. ................................................................. 53 1 Chapter 1: Introduction Mobile digital audio devices have become so common that in 2005 about eleven percent of the American adult population have an iPod or MP3 player (Pew Internet & American Life survey). Concurrent with the increasing penetration of these devices is the growing adoption of digital music in 2006 (US Music Consumer Survey, 2006). The use of music play-lists has steadily increased as consumers look for a more personalized digital music collection to suit their individual tastes. Despite the growing importance of the issue of music recommendation for mobile devices, as far as we know there is yet to be published any article that develops a music recommendation system for mobile music devices. The objective of this paper is to fill this gap and to develop such a music recommendation system. This paper will also provide a validation of the proposed recommendation system through an actual implementation of the system on Personal Digital Assistants (PDAs), and data collected through an recommendation experiment. The target audiences for this paper are researchers who are involved in the research in recommendation systems, and practitioners who are looking for a better recommendation system for their music products. Our recommendation system is directly implementable and provides useful new features that will benefit recommendation practice as it resolves a number of the challenges of a real life recommendation system. A choice of music is highly dynamic. An individual music choice may change due to the individual?s emotions, listening context and contacts with the other media. This calls for a system that is able to make real time recommendations. One 2 of the challenges of making music recommendations arises from the heterogeneity of individual music preferences. Music preferences are highly personal as they relate to specific personality characteristics, cognitive ability and emotions (Rentfrow and Gosling, 2003). An effective music recommendation system therefore is one that provides individual customization. In order to carry out the customization, information on individual music preferences is necessary. The problem of explicitly asking individuals for their music preferences is that they are reluctant to actively provide personal information due to the efforts involved, and the fear of the invasion of privacy. Explicit inputs of individual preferences in recommendation systems are sparse because only a small proportion of them are willing to provide the inputs (see Konstan et al. 1997) and as a consequence, most of the information in the systems based on these data is missing (Ying, Feinberg and Wedel, 2006). In addition, when asked about their product preferences, individuals may not be able to fully or accurately express them. For example, one problem with the use of genre to classify music type is that a system?s categorization of genres may not map to an individual?s mental model of music. Yet all studies published in the marketing literature on recommendation systems depend on such input elicited from respondents. The use of explicit preference data that comes in the form of product ratings creates another problem. Individuals don?t always response to rating scale in accordance to their preferences. They commonly indicate their preferences by choosing the middle, or the extreme of the rating scales, (Rossi, Gilula and Allenby, 2001). 3 An alternative to explicit inputs from the individuals is to infer their preferences from their demographic profiles. However, due to the less than perfect match between demographic profiles and personality traits, recommendation systems that utilize only demographic profiles for recommendations are inherently inaccurate. A more effective music recommendation system will be one that infers music preferences based on the individuals? past choice behaviors, as well as the choice behaviors of other similar individuals. Recommendation systems generally fall into two categories -- content filtering and collaborative filtering. Simply stated, content filtering makes recommendations based on an individual?s past preferences for product attributes. Other the other hand, collaborative filtering predicts an individual?s preferences using a weighted sum of other individual?s preferences. The weights are reflection of how closely one individual?s preferences are to the others. Our recommendation system falls into the category of a hybrid system since it utilizes both content and collaborative filtering. Recommendation systems that also use a hybrid approach can be found in the marketing literature (e.g. Ansari, Essegaier and Kohli, 2000). An additional challenge for music recommendation arises from the problem of massive datasets created as a result of the number of songs and individuals involved, and the potentially large number of explanatory variables. A massive dataset is an issue when the algorithm is computationally intensive, as with the Markov Chain Monte Carlo (MCMC) procedure that involves more than a single pass through the data. A large number of explanatory variables can thus, results in an unreasonable time lag for product customization. The last challenge results from the demands on 4 the system for individualized customization. This means that the system developed needs to sequentially update individual level estimates to refine the coefficient estimates and to adapt to the changes in individual music preferences. There are a number features in our recommendation system that resolve the above mentioned challenges of a recommendation system. First, the Sequential Monte Carlo algorithm that we use updates the parameter estimates as new data come in. The Sequential Monte Carlo algorithm is ideal for the implementation of real-time customization not only because of the sequential way in which it updates parameter estimates, but also because of the speed with which the estimation is done. Second, our music recommendation system removes the need for individuals to explicitly provide inputs on their song preferences. Music preferences are learned based on the history of how long a song is listened to, with the assumption that individuals will listen to songs that provide higher utilities. Third, our recommendation system addresses the problem of massive data. The ability to handle large datasets comes partly from the processing of data in blocks, and partly from the incorporation of a variable selection step into our algorithm. The variable selection step removes redundant or irrelevant variables that unnecessarily complicate the analysis, therefore reducing the system?s computational burden. Fourth, the use of Bayesian methods in our system ties in with the need for sequential updating of individual level estimates. As more and more blocks of data come in, our estimates of individual preferences improve. Our system also adapts the customization process to the changes in tastes and preferences. This is because, as 5 new information is obtained on the individuals, this information is used to dynamically update the system?s estimation of preferences. Fifth, our recommendation system automates the downloading of songs to a mobile audio device, and as a result removes the need for individual intervention. When the system removes the need for individual inputs, and also adapts to individual changes in preferences, it enables the music downloading process to be fully automated. We reviewed some of the existing music recommendation system in practice and the systems that have been developed in the academic literature. In practice no mobile audio device recommendation system based on past behavior is available. Some of the existing music recommendation systems, such as CDNow, MediaUnbound and MoodLogic just to name a few, ultimately allow the website users to download the recommended music into mobile music audio devices. However, they based their recommendations on music data groupings and the website users? interests. These systems take inputs from the users on their music preferences using general questions on their music preferences and tastes, ratings on a sample of music objects, and open ended questions on favorite types of music. Other music recommendation systems like the Pandora.com and last.fm broadcast music tracks to their internet radio station listeners, and modify their future broadcasts based on the user indication of which artists are and are not acceptable. However, such systems are not specifically design to help the website users download music into their mobile audio devices, and are based on simple similarity measures of user mentioned preferences to general song characteristics (music genomes). 6 In the academic literature recommendation systems are static, and their performances are validated through the use of secondary data collected for other purposes (e.g. Ansari and Mela, 2003; Montgomery et al., 2004) or through simulated data (e.g. Raghu et al., 2001; Ariely, Lynch and Aparicio, 2004). Some of these papers (e.g. Ansari, Essegaier and Kohli, 2000) use rating data and therefore encounter the problem of rating scale response biases mentioned in Rossi, Gilula and Allenby (2001). In addition, unlike our recommendation system that recommends a play-list (i.e. a set of products) these recommendation systems make product prediction to an individual one product at a time. This paper is divided into five chapters. Chapter one provides an introduction, background and motivations for the paper. Chapter two provides a description of the model used to estimate subject preferences and a description of the play-list customization procedures. Chapter three describes the simulation study conducted as an initial test of our model. The experimental study is discussed in chapter four, along with results. Finally, our paper ends with chapter five that provides a discussion and conclusions. 7 Chapter 2: The Model This section describes the model used to estimate subject preferences and the customization of individualized music play-lists to suit these preferences. The model has the following features: (1) it uses a hazard model to estimate subject?s utility from the time a subject spent listening to a song, (2) it uses a Bayesian variable selection procedure to select relevant song characteristic in the hazard model for each subject, (3) it uses a particle filter for estimating the models in real time, allowing for sequential updating of the parameters as new data come in, and accommodating the massive data accumulated on song listening behavior, and (4) it uses a model averaging procedure for collaborative filtering to generate recommendations. To achieve this, models of different subjects are averaged based on their similarity. The model averaging also allows us to deal with idiosyncratic song effects not captured by the song descriptor variables, through a pooled model with song-specific constants. Finally, (5) it uses optimal sequential experimental design methods to generate the utility maximizing play-lists across a sequence of batch recommendations, based on the model-averaged estimates. This section starts by describing the hazard function. The hazard function is used to link a subject?s utility for a song to how long this subject listens to the song. A hazard model predicts duration and the probability of termination for an event. In our case, the event corresponds to the act of listening. The Bayesian Variable Selection procedure is described next. This procedure is used to deal with the situation in which the number of explanatory variables is large. It reduces the number of variables to a more manageable size and removes the 8 variables which are redundant from the model, at the individual level. The variable selection simplifies and improves the parameter estimations and model predictions. We illustrate the Bayesian Variable Selection, and demonstrate that for some subjects some of the variables can be removed from the model. The next subsection describes the Sequential Monte Carlo procedure, based on particle filtering. This procedure is used to deal with a massive dataset and for sequential updating of the parameters estimates as new data are obtained. First, we describe how the particles are generated. Second, we describe the updating of particle weights, which dictates how much of the parameters in each particle are to be used for prediction. Third, we describe the particle rejuvenation or re-sampling, an important step in particle filtering. This step is used when the effectiveness of the particles drops below a tolerance level. We move on to the description of the Model Averaging next. Model averaging is used because we believe that the prediction of a subject?s song preferences can be improved by borrowing the information from other similar subjects who have listened to the same song. In addition, model averaging also allows us to introduce song-specific constants into the prediction. The song-specific constants are important because they are used to incorporate characteristics of songs that are not otherwise captured by the prediction variables. Lastly, we discussed how the individualize play-list is designed using an Experimental Design approach. In our model, we derive the optimal play-list by maximizing a function consisting of the proxy of song?s utility (i.e. the predicted listening duration), and an information criterion that accounts for parameter 9 uncertainty, enabling the listening data to be used to update the individual hazard model parameters efficiently. Section 1: The Lognormal Hazard Function We make the assumption that we can infer song preferences based on the time a subject spends listening to a particular song. The assumption that a subject will listen to songs they like longer than to songs they don?t like is not an unreasonable one. It is a commonly observed phenomenon that an individual will prolong pleasurable experiences and shorten painful ones. Specifically, Holbrook and Gardner (1993) shows that pleasure has a positive effect on music listening duration. We model listening behavior using a commonly-used parametric survival model - the lognormal hazard model. The basic input for our model is the amount of time a subject has listened to a particular song. Individual models and an aggregate model are estimated using the available data. Our models are fully individualized and estimated separately for each subject, which enables real-time processing of the data, individual level variable selection and individual level recommendations. The estimates are obtained using song attributes and genre as predictors. The aggregate model on the other hand, predicts all subjects? listening durations based purely on song-specific constants. This is done by using a dummy variable in the log-normal hazard model for each song across all subjects. The aggregate model is used to capture the impact of an individual song?s unique characteristics, which are not otherwise reflected in the prediction by the individual level models. Equation 1, 2 and 3 below describes the aggregate model, where ky represents an observation describing the log listening duration of a subject for the song k. For 10 simplicity in notation, we have omitted the individual level subscript. If ky has a lognormal distribution with a mean of ? and variance of 2? , then the probability density function is { }2212/1 ))(log(2/1exp)()2(),|( ???pi?? ??= ?? kkk yyyf (1) The survival function is given by ]/})[{log(1),|( ???? ???= kk yyS (2) We can thus write the likelihood function of ),( ?? as ? = ?= N k kk kk uySyfL 1 )1(),|(),|(),( ?? ????? (3) There are N observations in the aggregate dataset, which is formed by stacking the data of every subject?s listening duration to every song. The number of observations per subject need not be the same. In the aggregate model, we introduce covariates through ? , and write ?z '=k? for observation k. 'z is vector consisting of a binary dummy variable for every song, and ? is the vector of coefficients for the aggregate model. The data are censored as we are only able to observe the listening duration up to the length of a particular song. Whether there is censoring in a particular observation is indicated using a censoring indicator k? . In the individual model, the probability density, survival and likelihood function have a similar form as the aggregate model. However, model estimation is done subject by subject using only the subject?s data, for reasons that will become apparent below. The likelihood of the individual model takes the form ? = ?= i kk n k iiikiiikii uySyfL 1 )1(),|(),|(),( ?? ????? (4) 11 The term in above represents the number of observations for subject i . The covariates in the individual model are introduced through i? , and we write iik '?x=? for observation k. 'g91 represents song attributes and genre variables and i? is the subject specific coefficients. In both the aggregate and the individual-level models we assume that the error terms are independent and identically distributed (iid) random variables. In addition, the variance term i? in the models act as the shape parameter that will determine the shape of the log normal hazard functions. We utilize the lognormal hazard function because it is flexible and represents the type of non-monotone hazard that we expect for songs, that is, unimodal. Other hazard functions like the log-logistic or the expo-power hazards (Saha and Hilton, 1997) represent these shapes as well, but the lognormal has the additional, and for our study crucial, characteristic that the mean is available in closed form, and that it facilitates computation for massive data set in real time. The hazard function is defined as the ratio of the probability that subject i will stop listening to song j to the probability that subject i is still listening to song j at a particular time. The shape of this hazard function is plotted in figure 1 using different values of ? . 12 Figure 1 Plot of the lognormal hazard function As shown in the figure, the hazard function is downward sloping when ? is at a higher value (e.g. a value of 2). This describes a listening behavior in which a subject has a higher tendency to stop listening to a song (i.e. a higher hazard rate) at the beginning. The tendency to continue listening increases (i.e. the hazard rate drops) as the subject listens longer. This reflects a case of increasing interest/attractiveness of the song. When ? is at a lower value such as 0.5, the hazard function describes a different listening behavior. The hazard rate increases from zero to a maximum and then tapers off when the duration increases. This would imply the case when a subject needs certain duration of listening before s/he can decide whether the song is preferable. At the beginning of the curve, the hazard rate increases with the listening duration, possible due to the subject?s better understanding of the song?s attributes or appeal as time passes. Tendency to make a decision on whether to reject a song increases with a better understanding. If the subject decides to listen to a song longer 13 than the duration in which the hazard curve inflects, the tendency to stop listening to the song reduces with the listening duration. The closed form for the posterior distribution of our model parameters is not available. To resolve this issue, we use Markov Chain Monte Carlo (MCMC) and Metropolis Hasting algorithm to sample from the posterior distribution. For the individual models, the first block of data on a subject that comes in is used to generate the initial particles used for the Sequential Monte Carlo procedure. Each particle represents a Metropolis-within-Gibbs sample of the subject?s coefficients after a burn-in period. The procedure updates subject?s parameters using importance sampling. (The details of the Sequential Monte Carlo procedure are given in a subsequent section of this paper.) The MCMC sampler requires one complete pass through the first block of data. We use a variable selection step into our MCMC procedure. If the variable selection step is not applied, the MCMC sampling will involve simply the successive simulation of ),|( 2 iii i p y,?? g735? , )|( i i p ??g735 and ),|( 2 iiip y?? , where ig735 ? represents the variance-covariance matrix of subject i ?s coefficients. (The details of the MCMC sampling with the variable selection step are explained in the next section.) For the individual models, estimation of subjects? parameters is done with the Sequential Monte Carlo procedure on the second block of data onwards. For the aggregate model, we do not include a variable selection step since we need the coefficient for each variable. Each variable in the aggregate model is a song-specific dummy, and the parameters estimated using this model provide a song- specific constant for every song. We will only be able to generate a constant for a 14 particular song when at least one subject has listened to the song. This means that we need to generate new song-specific constants as new data come in, because some subjects have now listened to new songs. Consequently, the Sequential Monte Carlo procedure is not used for the aggregate model because we will always require new MCMC estimations to generate new song-specific constants. To keep the size of the dataset manageable for the MCMC estimation, only the most recent listening duration of a subject for a particular song is used. In other words, if subject i has listened to a song j in the most recent block of data and also in a different block of data in the past, we will only use the information from the last data block. An additional benefit of this approach is that the song-specific constants are estimated using only the most current data. When there is a change in music tastes, this change is reflected in the song constants. Section 2: Bayesian Variable Selection Typical applications of our music recommendation system have a huge number of explanatory variables. This is especially the case when the system includes variables describing the song attributes and granularly defined genres. In a situation where the number of explanatory variables is large, the vector of explanatory variables could contain many redundant or irrelevant variables that unnecessarily complicate the analysis. In our recommendation system, variable selection is applied to the individual models but not to the aggregate model, because we desire estimates of all song-specific constants. 15 The goal of variable selection is to ignore a variable jpx (i.e. the p-th attribute variable for song j ) if ip? (i.e. subject i ?s p-th coefficient) is equal to zero (Chipman, George and McCulloch, 2001). This involves selecting a sub-model, the likelihood of which has the form ),(),|( 22 I?x?y g757g757g757 g76g76g76g76 iiiiiii ii N,p ??? = (5) where ),|( 2 iiii ,p i ??y g757 ? is the likelihood of iy given the parameters. This is similar to the likelihood shown in equation 4, but with some of the coefficients set to a value of zero. We indicate if the value of ip? is set to zero using ),,( 1 iPii ?? L=? . When ip? is not equal to zero ip? has the value of one, otherwise ip? has a value of zero. The vector i? is updated via a Metropolis search during the estimation process. A different set of values of i? represents a different set of coefficients remaining in the model after the variable selection. iig757 ? and g76g76g76g76g757 x i are the vector of regression coefficients and the x matrix corresponding to a particular i? . We apply a Metropolis-Hastings Algorithm for variable selection that involves the successive simulation of ),,|( 2 iiiii ,p y??? g82g79g71g757g81g72g90 g76g76g76g76 ? (6) ),,|( 2 iiii iii p y?,?? g757g735g757 ? ),|( ii iii p ??? g757g735 g757 ),,|( 2 iiiip y?? g76g76g76g76g757 ? 16 After a burn-in period, the Metropolis-within-Gibbs sample draws for the values of ii and?? g76g76g76g76g76g76g76g76g757 ?, are saved as initial particles used for the Sequential Monte Carlo procedure described in the next section. We use the following priors for iig757 ? , iig757g735 ? and 2i? ( ) iiiii Np iiii g757g757 g735g735g757 ?y??? ,),,,|( 2 0=? (7) )2/,2/(),|( xii vvIGp iii ???? g757g735 g757 = )2/,2/(),,|( 2 yiiii vvIGp ?? =y?? g76g76g76g76g736 The use of a prior mean of 0 for iig757 ? is a neutral choice reflecting the indifference between positive and negative values for the coefficients. These coefficients are normally distributed with an diagonal variance-covariance matrix iig757g735 ? . We draw the values of the variance vector iig757g735 ? using a prior of )2/,2/( xvvIG ? . We chose a small value v = 5 for a diffuse prior, and equate g91? to the sample variance of ix . The value of 2i? is drawn similarly with a prior of )2/,2/( yvvIG ? , while equating y? to the sample variance of iy . The proposed newi? is accepted with the probability: ?? ??? ?? ??? )y?? y?? g82g79g71 g81g72g90 g757 g757 iii old i iii new i ,f ,f i i ,|( ),|( ,1min 2 2 ? ? (8) Where ),,|(),|(,|( 222 iiiiiiiiiii ,p,f),f y?????yy?? g82g79g71g757g81g72g90g76g76g76g76g757g757g76g76g76g76 g76g76g76g76g76g76g76g76g76g76g76g76 ??? ? A symmetric transition kernel is used for our Metropolis algorithm. This is based on the assumption that every variable has the same probability of being in the model. It is also assumed that all variables have the same probability of switching 17 from a non-zero coefficient to a zero coefficient as it is vice versa. In other words, we assume that 1||1),,|( 1 2 =?= ? = old ip P p new ipiiiii andP,p ??? y??? g82g79g71 g736 g81g72g90 g76g76g76g76 (9) These are realistic assumptions as they reflect a belief that all variables have the same prior probability of being in the model. They are also convenient assumptions because they simplify the probability of acceptance for newi? to ?? ??? ?? ??? old iiii new iiii ,f ,f i i ??y ??y g82g79g71 g81g72g90 g757 g757 ,|( ,|( ,1min 2 2 ? ? (10) The variable selection step is implemented as a part of the MCMC sampling for the individual models on the first block of data. A set of initial particles is generated and is used for the Sequential Monte Carlo procedure. Section 3: Sequential Monte Carlo Estimation One of the challenges of analyzing online consumer behavior is the sheer mass of the data available. A standard Markov Chain Monte Carlo method generally requires a complete scan of the dataset and also a large number of iterations to estimate the model parameters. This makes Bayesian analysis of massive datasets using Markov Chain Monte Carlo methods infeasible. In this paper, we use the Sequential Monte Carlo Method to estimate the model parameters in the individual models (Ridgeway and Madigan, 2003). Not only does this procedure allows us to analyze large datasets, it also enables us to analyze the data in blocks. In other words, we could estimate a subject?s parameters using the data available in one time period, and improve the estimate as new blocks of data 18 becomes available later. This type of sequential updating of individual level parameters as new data come in is crucial for mobile recommendation systems. A typical Monte Carlo method samples from the posterior, )|( ii yf ? for subject i , where i? refers to subject i ?s parameters (i.e ii and?? g76g76g76g76g76g76g76g76g757 ?, ) and iy refers to the data. The method then estimates )|)(( ii yhE ? as ? =Mm mihM 1 )()/1( ? . Here M refers to the number of draws from the posterior used to estimate i? and h is a transformation. In our procedure, we estimate subject parameters using Markov Chain Monte Carlo Methods on the first block of data. We retain the estimates obtained in the different iterations as ?particles? after a burn-in period. Each particle is a vector containing the current estimated values of the parameters ii and?? g76g76g76g76g76g76g76g76g757 ?, obtained from the Metropolis-within-Gibbs sample draw procedure described in the previous section. When a second block of data arrives, our estimates of the subjects? parameters are updated using importance sampling. We represent the first block of data as iy1 and the second block of data as i y2 . If Mii ?? ,,1 L are drawn from )|( 1 iyf g76g76g76g76? we can estimate the posterior expectation of any function )( ih ? as ( )? ?= == Mm Mm immiimiii whwyyhE 1 121 /)(),)|((? ?? (11) where iw ?s are the importance sampling weights for subject i . The value of imw for particle m is given by: )|)((/),|)(( 121 imiimim yfyyfw g76g76g76g76g76g76g76g76 ??= (12) This greatly simplifies (Ridgeway and Madigan, 2003) to: 19 iK K k m ik yyyyf 2 1 1 },,{,)|( ??? = L g76g76g76g76 ? (13) where iKi yy ,,1 L are the observations in the iy2 data block. The posterior variance of the parameter estimates conditioned on iy1 and iy 2 will be smaller than those conditioned on iy 1 alone. This means that ),|( 21 iii yyf ? has a narrower distribution than )|( 1 ii yf ? . The narrower ),|( 21 iii yyf ? is, as compared to )|( 1 ii yf ? , the larger is the proportion of draws from )|( 1 ii yf ? that has zero importance weights. This reduces the efficiency of the importance step. We mitigate the loss of efficiency by the incorporation of a rejuvenation step. The rejuvenation step is used when the effective sample size (ESS) of the importance sampling fall below a tolerance level. We use the tolerance level of 0.1, which means that the rejuvenation step is invoked when the ESS falls below 10% of the sample size used for the Sequential Monte Carlo procedure (Ridgeway and Madigan, 2003). The ESS is the number of observations from a simple random sample needed to obtain an estimate with Monte Carlo variation equal to that obtained with a weighted draw of M particles. Kong, Liu and Wong (1994) shows that the ESS can be approximated by: ?? == = M m im M m im wwESS 1 22 1 )(/)( (14) We carry out the rejuvenation step using a systematic re-sampling procedure (Arulampalam et al., 2002). The basic idea is to reduce the number of particles with small sampling weights and increase the number of particles with large weights. The 20 procedure involves uniform sampling of the particles with replacement using the probabilities derived from the particle weights; particles with higher weights will be selected more often. The sampling is done M times if we need to generate a new set of M particles. After the re-sampling is done, the new particles are each given the new weights of 1/M. The particles and the importance sampling weights generated with the Sequential Monte Carlo procedure are used to predict the listening duration. Without model averaging, the predicted mean listening duration of subject i for song j , that is in using model i , is given by the weighted average: ( )? ?= =+= Mm Mm immimijimij wwyE i1 1 2 /)2/)('()(? ? g757?X (15) Where imw is the m-th particle sampling weight for subject i , 'jX represents a p-vector of attributes for song j , mi ig757 ? is the m-th particle?s values for subject i , and 2)( m i? is the m-th particle?s residual variance for subject i . Model averaging is used in our system because averaging the prediction over different models will improve our predictions of a subject?s listening duration, in particular for songs that the subject has not listened to. The model averaging is similar in spirit to collaborative filtering and is described in the next section. Section 4: Model Averaging Typically Model Averaging is used to deal with the problem of model uncertainty. For example, a researcher may have several believable models, each with its own sets of parameters, but is not sure which one is correct. Alternatively, 21 the researcher may not believe that any of the models is actually correct, but uses them as proxies for some unknown underlying model. Our motivation for using Model Averaging is closer to the latter since we believe that averaging over the different models will improve our predictions of a subject?s listening duration, or stated differently, we use model averaging as our procedure for collaborative filtering. This is especially useful in the case where we have to predict the listening duration of a song that a subject has not listened to before, in generating the play-lists. In such instances, we have to rely on the listening behavior of the other subjects who have similar preferences to make a prediction. It is the essence of collaborative filtering applied to subjects? models rather than the conventional method of applying on subjects? data. Whereas existing recommendation systems define recommendations based on how close the data of person j is to that of person i, we combine individuals? models based on how well they predict the target individuals? data. That is, we borrow strengths across models of different individuals, by combining the predictions/recommendations based on them. In our Model Averaging, similarity between subjects is based on how well one subject?s parameters predicts the listening duration of another subject. Even for the case of predicting the listening duration of a song that a subject has listened before, the averaging will give a more accurate prediction because we incorporate the prediction from an aggregate model. The impact of individual song constants is incorporated in the prediction by using the aggregate model as a part of the Model Averaging. Note that the song-specific constants would not be estimable in the individual models, due to the severe lack of data, even using Bayesian shrinkage. 22 Let 1Model , 2Model , ?, IModel be the models estimated for subjects 1,?,I, where I represents the total number of subjects. We also define 1+IModel to be the model that contains the particles of the aggregate model?s parameters. In addition, we define ? to be a I+1 by J matrix of indicator variables. ),,( ,1 jIij += ?? Lg77g77g77g77? and ij? = 1 if subject i has ever listened to song j , ij? = 0 otherwise. In addition, jI ,1+? = 1 if at least one subject has ever listened to song j , jI ,1+? = 0 otherwise. The subject weights are calculated using the posterior odds as in a typical model averaging approach. For example, to calculate the weights that should be applied to subject 2?s (i.e. 2Model ?s) prediction of expected listening duration for subject 1 (i.e. 21? ), we use (Hoeting et al. 1999): )( )(* )|( )|( )|( |( 1 2 1 21 11 12 21 Modelp Modelp Modelp Modelp Modelp )Modelp ly y y y ==? (16) Here, ( )iji y=y represents the vector of log observed listening duration for subject 1. )( 1ModelP ? )( 1+IModelP represent the prior probabilities for model 1 to I+1. We use a uniform prior for the models due to the absence of any reason favoring one model to another, therefore the probabilities for all the models are the same and are all equal to )1/(1)( += IModelp i (17) This simplifies the posterior odds above to the form of a Bayes factor: )|( )|( )|( )|( 11 21 11 12 21 Modelp Modelp Modelp Modelp y y y y ==? (18) Thus the weight that we should place on a model in the prediction of another subject?s data is based on the ratio of the likelihood of that model over the likelihood 23 of the subject?s own model. The ratio is based on the likelihood calculated from one model?s particles divided by the likelihood calculated using the particles form another model. The same observations are used for the calculation of both likelihoods. We calculate the between-subject weights for all combinations of subjects, while treating the aggregate model as one of the candidate models. We thus obtain values 11? to 1,1 ++ II? . The model-averaged prediction of the expected log-listening duration of subject 1 for song 1 is given by the expression: ?? +=+= 11 11111 11 */)(?** Ii iiiIi ii yE ???? (19) In this equation 1i? has the value of one, that is, the target individuals? indicator is always set to one so that the targets individuals? data are always used in the prediction. The predictions of the expected log listening duration for the subjects are used in the experimental design stage to generate the play-list as described in the next section. In the experimental design stage customized individualized music play- list are created for all the subjects. Section 5: Bayesian Experimental Design The optimal music play-list that we create determines the songs that are downloaded into the mobile audio devices. The term optimal needs some clarification. Fully optimal designs for nonlinear models with unknown parameters are not obtainable in practice. We define a design to be optimal in the Bayesian sense and given a specific prior, if there is no better design to be found under our multi- criteria objective unless the parameters are known exactly in advance. 24 When deriving the optimal music play-list for our subjects, we face a sequential decision problem that seeks a balance between immediate payoff on one hand and immediate information that might lead to increased future payoff on the other hand. Immediate payoff is maximized when we maximize a subject?s predicted utilities, while immediate information is maximized when we maximize the efficiency in estimating a subject?s parameters. Maximizing immediate payoff, that is expected utility, has the disadvantage that the design converges to the same set of songs on all future recommendations. That, in turn makes some parameters inestimable, because subsets of songs that enable the identification of such parameters (i.e. genres) no longer appear in the design. For example, a subject may show a strong preference for Jazz music from the past estimates. We maximize immediate payoff by recommending only Jazz songs to this subject. However, this makes it impossible to estimate the other parameters because songs of the other genres are no longer in the play-list. Instead of using the immediate payoff maximizing approach, we generate a sequence of play-lists that dynamically optimizes both expected utility and parameter efficiency over a finite time horizon. A multi-criteria experimental design problem that optimizes a combination of an outcome and an information criterion is discussed in Verdinelli and Kadane (1992). The objective function takes the form of [ ]2/)(|log ?? RE d'd'd ++ xx1y (20) Where 1y 'd represents the predicted outcome as result of the experiment design, 'dx is the design matrix, 2/?R is the prior precision matrix, and ? is a weight reflecting the relative emphasis that are placed on the outcome versus the Shannon 25 Information, or Bayes D-optimality, criterion 2/)( ?Rd'd +xx . What is different between the design of Verdinelli and Kadane (1992) and our design is that unlike ours, their design is not sequential. Even in a case of static multi-criteria design, balancing between the gains in information against current yield is difficult. Often, good performance with respect to one criterion works against the performance of others. The challenge then is to understand how the performance of the design changes when we trade off one criterion for another, as reflected in the weight ? . There is no hard and fast rule on what the value of ? should be, and in many cases it is determined subjectively, by the end user of the objective function or through some experimentation (Verdinelli, 1992; Verdinelli and Kadane, 1992). We derive it for a batch sequential design problem. We extend the work of Verdinelli and Kadane (1992) to a batch sequential experimental design, because of the way our data are collected and the way we customize the play-list. Each time we customize the play-list for a subject, we obtain new data on the subjects? music preferences, and based on the data we have received so far we customize a play-list for the next round of listening. It is batch sequential design rather than a fully sequential design because our play-list involves designing a set of songs (a batch of design points) rather than just a single song (a single design point). Whether the use of a sequential design gives a better estimation than a static design depends on how good the initial estimate of the parameters is. When the initial estimates are poor, sequential design provides an advantage over static designs, however when the initial design yields estimates that are very close to the real 26 parameter values sequential design may lead to a lower estimation efficiency (see Ford, Titterington and Kitsos, 1989 for a discussion). In our model, we derive the optimal play-list for subject i by maximizing the function |/|log}]1*)[log{()( 221 iddddQq iq iii RQkNyE ?g757 g757g757 xx' ++?+? = (21) In the equation above, Q is the number of songs in our optimal play-list, which is defined a-priori. )( iqyE is the prediction of the expected log listening duration for song q based on the model averaging we described in the previous section, given that song q is one of the songs in the optimal play-list. The i? subscript in the design matrix i dg757x and the prior precision matrix 2/ iiR ?g757 indicates that the variables that are taken out of the model during the variable selection process are not used for designing the optimal play-list. The static weight ? in equation 20 is replaced by the expression 2}]1*)[log{( +? QkN dd , here dk is a count variable that keeps track of how many play-lists we have generated for the subject so far, and dN is the number of designs that will be created before the weight drops to zero. In other words, dN is the finite time horizon over which the design is optimized. The ?+1? is added to the expression so that when dk = dN , the weight? has a value of zero, and the optimal play-list is derived by solely maximizing )( iqyE . Pronzato and Thierry (2003) prove that designs generated based on the criteria in the form of equation (21) yields both approximately unbiased estimates of the coefficients and optimal designs. 27 In deriving the optimal play-list, we use a design procedure that maximizes the outcome and the design efficiency criterion simultaneously. The outcome component of the criterion )( iqyE maximizes the subject?s expected utility. The design criterion improves our future inference of the subject?s model parameters. That is, it increases the ?spread? of the design to continue to provide information on the individuals attribute coefficients, even as the play-lists are successively zoomed in on the subjects? preferences as data accumulate. The objective of formulating the weight as we do above is to ensure that a greater importance is placed on getting a more accurate estimate of the parameters in the initial designs. As the estimates improve in subsequent designs, we place greater emphasis on increasing the utility that the songs provide. The value of dN , although any reasonable time horizon may be chosen, is calibrated in the simulation done in the next section. It is desirable for the value of dN to be somewhat large so that the songs in the play-list do not converge quickly to a very similar type of songs. However, a larger value of dN means a lower overall utility of the songs in the initial design and if the subjects? parameters change the later designs may not be attuned to the individual estimates well. We choose a reasonable value of dN through our simulation study such that the initial design will not perform worse than an ad-hoc design, while we ensure that not an excessive number of play- lists are generated based on the same parameter values. Providing utilities that are too low in the initial design will greatly discourage the subject from using our recommendation system, and we would like the expected utilities of the play-lists to gradually increase over the finite time horizon dN . 28 Using the expression in equation 21, the optimal design is derived using a modified Fedorov method (Cook and Nachtsheim, 1980) and is implemented in a program written in R. The method we implement involves choosing an initial design, which we derive using a ?Greedy? criterion of maximum design utility. At each iteration, the algorithm will exchange each song in the design with a song chosen from the song database, so as to optimize the design according to the multi-criteria. If an improvement is achieved, the new design is adopted. This iterative replacement method is carried out until there is no further improvement even after a complete sweep of all the songs available. We evaluate how well our model and recommendation system perform using both simulation studies and an experiment. The details of the simulation studies are described in the next section. After the section on simulation, the experimental study is detailed next. Chapter 3: Simulation Study As an initial test of our model, we ran two simple simulation studies using five artificial subjects. The first simulation study is used to test how well the sequential Monte Carlo procedure is able to recover the parameters from a simulated dataset. In addition, the study tests the effectiveness of the variable selection step. This is to see if the variable selection really does remove redundant variables from the model. In the second simulation study, song-specific constants were used to simulate the data. The objective of the study is to find out if the model averaging will improve the prediction of a subject?s listening duration for the songs. In addition, we compare 29 the performance of the play-list design of four different design approaches. These four approaches are: (1) a maximum utility design (2) an optimal design (3) a minimum dissimilarity design (4) a random design. The weight used for balancing the outcome and information criterion was also calibrated from the simulation results of the sequential effect of the weight on the designs. 30 Section 1: Simulated Data The data for the first simulation study were created using the parameter values shown in table 1 below. Table 1 Parameter used to generate simulated data Variables Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Constant 0.00 0.00 0.00 0.00 0.00 Volume 0.00 0.00 0.35 0.25 0.35 Tempo 0.00 0.35 0.35 0.00 0.00 Voice 0.50 0.45 0.00 0.35 0.35 Size 0.00 0.00 0.00 0.55 0.05 Purpose 0.00 0.00 0.35 0.00 0.00 Mood 0.50 0.00 0.00 0.00 0.30 Alternate 1.00 0.00 0.00 0.00 0.00 Dance 0.00 0.35 0.00 0.00 0.00 Jazz 0.00 0.00 1.00 0.00 0.00 Latin 0.00 0.00 0.00 0.00 0.00 Metal 0.00 0.00 0.00 0.00 0.00 Pop 0.00 0.00 0.00 0.00 0.00 Hip_Hop 0.00 0.00 0.00 0.00 0.00 Reggae 0.00 0.00 0.00 0.00 0.00 Rock 0.00 0.00 0.00 0.25 0.00 Soul 0.00 0.00 0.00 0.00 0.00 Vocals 0.00 0.00 0.00 0.00 0.20 Std Dev 0.50 0.50 0.50 0.50 0.50 The values of the standard deviation (i.e. ?) for all the subjects were set at 0.5. This gives a hazard functional curve in which the subjects need some listening duration before making a decision on whether to accept or reject the songs. The song attributes used in the simulated datasets consists of variables that describe the genre that a song belongs to in addition to the characteristics of the songs. The genre used for our model indicates if a song is: Alternative/Indies, Electronic/Dance, Jazz, Metal, Pop, Rap/Hip Hop, Reggae/Ska, Rock , Soul/R&B or Vocal. The variables that describes a songs musical context include: the perceived loudness of the song, the song?s tempo, whether the voice of the song is more instrumental or vocal, whether the song is more of a solo or a more orchestra performance, the purpose of the song be it for plain listening or dancing, and finally whether the mood of the song is happy 31 or angry. We obtained these song attributes for 400 of the actual songs that will be used for an experiment. The attributes of each of the 400 actual songs were obtained using the information extracted from the www.musiclens.de music recommendation site. Two blocks of data were generated for each subject in both of the two simulation studies. Each block of data was generated by randomly assigning 50 songs to each subject. To ensure that we have a dataset that is of a reasonable size, we assumed that the subject listening to the each assigned song for ten times and therefore giving 500 observations in each block of data. The censoring of the data is kept at a level roughly between 25-35%. For the first study the data were simulated using the parameter values in table 1. For the second study the data were simulated with an additional song-specific constant for each song at a value between 0.00 and 0.25. Section 2: Estimation In both studies, the first block of data was used to generate the initial particles using a Metropolis-within-Gibbs sample draw for the individual model. 2500 iterations were used for the MCMC procedure and a burn-in of 2000 iterations was used. This provides us with 500 particles. The variable selection step was applied on the individual models. The variable selection indicator vector i? was updated via a Metropolis search by randomly changing one of the P indices in oldi? (e.g. oldip? ). When the existing value of the index oldip? was one, we set the new value newip? to zero. This is equivalent to setting the corresponding value newip? to zero if oldip? is not zero. If 32 the existing value of the index oldip? was zero, we set the new value newip? to one by assigning the value of newip? to a non- zero coefficient taken from the oldi? vector. A variable which had a non-zero coefficient was then chosen randomly and was given a coefficient of zero. This procedure is similar to the one used in Sha, Tadesse and Vannucci (2006). In the aggregate model, a Gibbs sample draw was done only for the coefficients of the songs heard by at least one of the subjects. This keeps the coefficients of the unheard songs, and therefore the song-specific constants, at zero. No variable selection and Sequential Monte Carlo procedure was used on the aggregate model. The Metropolis-within-Gibbs sampling was applied only after the second block of data and, only the most recent listening duration of a subject for a particular song was used in the aggregate model. The composition of the songs heard in the two data blocks are different, which means that running the MCMC sampling after the second data blocks would result in a higher number of song-specific constants. 2500 iterations were used for the MCMC procedure and a burn-in of 2000 iterations was used generating 500 particles. With no Sequential Monte Carlo procedure applied, we assigned an equal importance sampling weight to the particles in the aggregate model. In the first study, the simulation ends at the Sequential Monte Carlo step. The predicted listening durations for the subjects for all 400 songs using the estimated parameters are compared to the predicted value using the initial simulated parameters. In the second study, the model averaging and experimental design stage are 33 implemented after the Sequential Monte Carlo procedure. The results of the simulations are shown in the next section. 34 Section 3: Simulation results The estimation of the subjects? parameters in the first study is shown in table 2 below. An additional table (i.e. table 3) shows the estimates when there is no variable selection applied. The values are obtained after using two blocks of data and after applying the Sequential Monte Carlo procedure. Table 2 Parameter estimates in the first simulation study (with variable selection) Variables Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Constant 0.00 0.00 0.00 0.00 0.00 Volume 0.00 0.00 0.34 0.19 0.44 Tempo 0.00 0.42 0.42 0.00 0.00 Voice 0.52 0.47 0.00 0.52 0.46 Size 0.00 0.00 0.00 0.81 0.00 Purpose 0.00 0.00 0.38 0.00 0.00 Mood 0.57 0.00 0.00 0.00 0.38 Alternate 0.88 0.00 0.00 0.00 0.00 Dance 0.00 0.50 0.00 0.00 0.00 Jazz 0.00 0.00 0.48 0.00 0.00 Latin 0.00 0.00 0.00 0.00 0.00 Metal 0.00 0.00 0.00 0.00 0.00 Pop 0.00 0.00 0.00 0.00 0.00 Hip_Hop 0.00 0.00 0.00 0.00 0.00 Reggae 0.00 0.00 0.00 0.00 0.00 Rock 0.00 0.00 0.00 0.00 0.00 Soul 0.00 0.00 0.00 0.00 0.00 Vocals 0.00 0.00 0.00 0.00 0.03 Std Dev 0.69 0.67 0.51 0.83 0.63 Table 3 Parameter estimates in the first simulation study (without variable selection) Variables Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Constant 0.00 0.00 0.00 0.00 0.00 Volume 0.04 -0.04 0.38 0.23 0.44 Tempo -0.04 0.51 0.40 -0.04 0.06 Voice 0.59 0.52 0.01 0.57 0.52 Size -0.01 -0.01 0.03 0.79 0.04 Purpose -0.06 -0.01 0.41 -0.01 -0.03 Mood 0.60 0.03 -0.01 -0.02 0.35 Alternate 0.98 -0.11 -0.03 -0.08 -0.16 Dance -0.07 0.41 0.00 -0.09 -0.31 Jazz -0.05 -0.18 0.63 -0.23 -0.13 Latin -0.02 -0.05 -0.05 -0.03 -0.01 Metal -0.01 -0.07 0.02 -0.09 -0.03 Pop -0.01 -0.01 -0.07 -0.05 -0.13 Hip_Hop -0.03 0.06 -0.08 -0.06 0.01 Reggae 0.00 -0.01 0.00 0.00 0.00 Rock -0.05 -0.15 -0.11 0.24 -0.17 Soul -0.11 -0.04 -0.07 0.02 -0.03 Vocals -0.13 0.01 0.01 0.00 -0.02 Std Dev 0.75 0.67 0.44 0.78 0.57 Comparing the values in table 1, 2 and 3, the variable selection seems to work well. It effectively forced the non significant parameters to a value of zero. In the 35 case of subject 4 the genre parameter is set to 0 even when it was given a positive value during the data simulation. We attribute this to the fact that the song characteristics have captured enough of the variability to make the genre variable redundant. This is consistent with our argument that variable selection helps to deal with non essential and potentially numerous genre variables. Without using the variable selection step in our MCMC procedure, the parameters are more spread out and non essential variables are given non zero coefficients. The parameter recovery is also reasonable, with the credible intervals that are tightly bounded. Table 4 shows the statistics of the posterior distribution of the parameter. At the interest of space, only the minimum, maximum, mean and median values of subject 1 are shown. The posterior distributions of the other four subjects have the same pattern as that of subject 1. Table 4 Credible intervals of parameter estimates in the first simulation study Variables Min Mean Median Max Constant 0.00 0.00 0.00 0.00 Volume 0.00 0.00 0.00 0.00 Tempo 0.00 0.00 0.00 0.00 Voice 0.47 0.52 0.51 0.57 Size 0.00 0.00 0.00 0.00 Purpose 0.00 0.00 0.00 0.00 Mood 0.49 0.57 0.58 0.64 Alternate 0.74 0.88 0.84 1.06 Dance 0.00 0.00 0.00 0.00 Jazz 0.00 0.00 0.00 0.00 Latin 0.00 0.00 0.00 0.00 Metal 0.00 0.00 0.00 0.00 Pop 0.00 0.00 0.00 0.00 Hip_Hop 0.00 0.00 0.00 0.00 Reggae 0.00 0.00 0.00 0.00 Rock 0.00 0.00 0.00 0.00 Soul 0.00 0.00 0.00 0.00 Vocals 0.00 0.00 0.00 0.00 Std Dev 0.69 0.69 0.69 0.69 In the second simulation study, we look at the performance of the model averaging in our algorithm. The Mean Squared Errors (MSE) of the predicted listening duration for all the 400 songs are used to compare the prediction based on model average against the prediction using only subject?s own particles. The Mean 36 Squared Errors are calculated using the predicted listening duration minus the ?actual? listening duration of the 400 songs. The listening durations before there is any censoring are compared. The ?actual? duration is calculated using the original parameter values that we have used to simulate the data. The calculation is deterministic. As shown in table 5, compared to the model calibrated on the data itself, the improvement from the model averaging is substantial. The MSEs under model averaging are close to half of those without model averaging. Table 5 Comparison of Mean Squared Errors (MSE) of predictions with and without model averaging Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 MSE (with Model Averging ) 3.30 2.78 6.60 16.76 3.66 MSE (Without Model Averging) 6.34 3.46 11.10 43.29 4.71 t-value for difference in means -3.420*** -1.824* -4.79*** -7.36*** -2.145** Note: * reflects < 0.1 significance, ** reflects < 0.05 significance, *** reflects < 0.01 significance, We compare the performance of the optimal design, a design generated using our model, against three other designs generated using: (1) maximum predicted utility (2) minimum dissimilarity and (3) random approach. The maximum utility design creates the play-list that gives the highest predicted listening duration derived from the model averaging. This design maximizes immediate payoff without any consideration for immediate information. Comparing this design to the optimal design will give an indication of how our model tradeoff immediate payoff for immediate information. The minimum dissimilarity design is derived by minimizing the average Gower distances of the songs in the play-list from the songs that a subject prefer in the subject last run of listening. We infer that a subject prefers a song when the subject finishes listening to the song - in other words the duration is censored. The 37 minimum dissimilarity design uses a clustering algorithm similar to the one used by Pandora.com. Pandora.com, an existing commercial music website, recommends new songs to an individual by looking for songs that have attributes closest to the songs the individual indicated as desirable. The minimum dissimilarity design enables us to compare our model with a commercially available recommendation system. The random design creates a play-list by choosing songs randomly for each subject. This is the design we use when we do not have any preference data, and the design is used to generate our initial play-lists in our experimental study. With the incorporation of preference data, the three other designs should provide higher utilities than the random design. In other words, the random design helps to detect any error in the algorithms of the three other designs. The performances of the four designs in terms of total utilities and posterior precisions are shown in table 6. Table 6 Comparison of design performances Subj 1 Subj 2 Subj 3 Subj 4 Subj 5 Maximum utility design Total utility 458.33 318.22 400.09 484.31 361.87 Posterior Precision 31.01 24.81 26.02 25.51 27.67 Optimal design Total utility 416.29 318.22 400.09 465.89 357.98 Posterior Precision 30.98 24.81 26.02 25.45 27.60 Minimum dissimilarity design Total utility 306.10 281.49 267.63 289.73 271.20 Posterior Precision 31.26 24.88 26.24 25.62 27.58 Random design Total utility 229.09 153.25 173.62 214.88 187.22 Posterior Precision 30.96 24.54 26.04 25.29 27.37 Note: The optimal design refers to the design generated using our model. Table 6 shows the performance of the optimal design when we set the value of dN in equation 21 to ten. In other words, ten designs are created before the weight 38 applied on the Shannon information drops to zero. Operationally this implies that we re-estimate a subject?s preferences after we have created ten music play-lists for the subject. Over the duration in which we create the ten music play-lists, we assume that the subjects? preference parameters are constant. After the tenth design, the estimation process starts from the MCMC procedure again which generates a new set of initial particles to be used in the Sequential Monte Carlo estimations that follow. Looking at table 6, the random design always gives the lowest utility. This is reasonable since the random design does not incorporate any preference data. Since it is a random design, the posterior precisions of the parameters estimates may or may not be higher than the maximum utility design depending on the random design composition. The optimal design, created using our model, is obtained from the maximum utility design by trading off utility for D-optimality. This means that in most cases, the utility of the optimal design is lower than the maximum utility design. The minimum dissimilarity design gives a lower utility but higher posterior precision than the optimal design. On the other hand, the minimum dissimilarity design does not predict subjects? preferences as well as the optimal design. We simulate the changes in the actual utility of the optimal design from design one to ten for subject five to see if design utilities do improve through sequential design generation. The transition of the actual utilities for the optimal design for this subject is shown in figure 2. The first design is created randomly rather than using the multi-criteria objective, the ten other designs that follows are the ones created optimally. Actual utility of the design is calculated from the true parameters of this subject. Notice that the actual utility increases with each subsequent design. This is a 39 result of an improvement in the estimation of the parameters with more data, and a result of a lower emphasis placed on the Shannon information criterion. Figure 2 Transition in actual utilities for the optimal designs during sequential design generation The changes in song compositions in each design created sequentially are shown in figure 3. Using the value of one to indicate that a song is chosen to be in the design and a zero otherwise, we can calculate a correlation value between one design and the design that precedes it. A correlation value of one indicates that a design has not changed since the previous design, while a correlation value of zero indicates that the composition of songs has changed completely. As it is shown in figure 3, the correlation value varies between 0.80 and 1.00. The figure shows that the improvement in utilities from the sequential generation of the optimal design is a result of a change in song compositions over and above the change in the weight applied on the Shannon information. 40 Figure 3 Change in song compositions during the sequential generation of the optimal design Our recommendation system appears to perform the way that it should based on the results of the simulation study. The optimal designs created based on our model provide higher utilities compared to the random and the minimum dissimilarity design. The minimum dissimilarity design, a clustering algorithm used in Pandora.com, will be used as the benchmark to compare our model in the experimental study. In addition, the utilities of the optimal design converge towards the maximum over the duration in which the 10 different sequential designs are created. As a further proof of our concept we ran an experiment involving real subjects and a working music recommendation system. The details of the experiment are described in the next section. 41 Chapter 4: Experimental Study Section 1: Data collection The experiment was run using an actual implementation of the recommendation system. The experiment required the subjects to listen to the songs played from a Palm TX PDA for two waves of listening. In each wave songs were played by the PDAs from the individual play-lists that were downloaded. The play- lists used for the first wave were generated randomly, while the play-lists in the second wave were generated using either the customization procedure described in our model, or a heuristic customization procedure we used as a benchmark. We developed our recommendation system from scratch and two programs were written for this experiment. The first program was written for the PDAs to allow them to play Mp3 songs according to a play-list, and also to act as data collection instruments. The programming of the PDAs was done using the Handheld Basic Software developed by Peter Holmes Consulting. The screen shots of the PDAs program is shown in figure 4. 42 Figure 4 Screen shot of PDA software The second program was written in R for a desktop computer. The R program implements the estimation and customization procedures using ours, and the benchmark customization procedures. Estimations of subjects? parameters were made using the data downloaded from the PDAs to the desktop computer. Finally, the individually customized play-lists generated in the desktop computer were downloaded into the PDAs to determine the actual songs the PDAs play. Subjects in the experiment are undergraduates and graduate students from an eastern United States university who volunteered to participate in the experiment in the period of April and May 2007. Subjects who finished the whole experiments were given monetary rewards in addition to the chance of winning a Palm TX PDA. The 86 subjects who participated in the experiment were solicited using emails without any screening based on demographic criteria. Of the subjects 77% were aged between 18 to 21 years old, 20% were aged 22 to 29 years old, while the rest were 30 43 years old and above. In addition, 37% of the subjects are male, while 63% of the subjects are female. Subjects who signed up for the experiment knew that the experiment involved understanding music preferences. They were informed that the experiment involved two waves of listening, and that their listening behaviors were recorded. About one- third of the subjects were randomly chosen to be in the control group. The subjects in the control group were subjected to the same procedures as the experimental group except that their play-lists are customized using a clustering procedure. For the subjects in the experimental group, their play-lists were customized using the procedures described in our model. The data captured in the experiment are records of subjects? listening duration for the songs played from the PDAs. 16,835 data points are collected, where each data point represents an instance in which a particular subject had listened to a particular song. To collect the data while the subjects were in their ?natural? situations, the PDAs were issued out to the subjects during the duration of the experiment. After the subjects had collected the PDAs, they had the liberty to decide when and where to listen to the songs. There were two waves of listening for each subject, with each wave lasting five days each. There was a gap of two days in between the two waves so as to allow the researchers to retrieve the data from the PDA from the first wave, analyze the first wave data, and finally customized the play- lists for the second wave. Due to the limitation on the number of PDAs available, the subjects were broken up into three different batches. The running of the experiment for these three batches was completed in one and a half months. To supplement the 44 data collected through the PDAs, subjects were asked to complete up a short survey, shown in the appendix, after the whole experiment was over to rate the quality of songs in the play-lists both in wave one and two. Play-lists were used to determine which songs to play in the PDAs. Each play-list holds the name of 50 different Mp3 songs out of the universal set of 400 unique Mp3 songs used for the experiment. At the first wave of listening, randomly generated initial play-lists were used to determine which songs to play to the subjects. These initial play-lists were generated randomly and were not customized to the preference of the subjects due to the lack of prior preference data. The songs were played sequentially in the PDAs, and the subjects had the option of skipping over the song if they disliked them. In addition, the participants also had the options of pausing the current song played, resuming the playing of the paused song, and stopping the song altogether. The subjects however, were not given the option of deciding which song should be played next, to prevent order effects. The duration data captured indicate how long a song is listened to until the subject consciously chose to skip or stop the song that is presently playing. Five days after the PDAs were issued to the subjects for the first wave of listening, the PDAs were returned for data download. The duration data from the mobile music devices was downloaded into the desktop computer for preference estimation and play-list customization. Customized play-lists were generated for the second wave. These play-lists were generated using either our customization procedure or a benchmark customization procedure. The benchmark procedure used a clustering algorithm and is described in more details in the next sub-section. Two days after the PDA were 45 returned from the first wave, the subjects collect the PDAs again for the second wave of listening. Section 2: Estimation and customization procedures The estimation and customization procedures based on our model were used to create the second wave play-lists for the subjects in the experimental group. For those subjects in the control group a clustering procedure was used. The estimation and customization procedures based on our model were carried out in the following steps: (1) 500 particles using the first wave?s data for each subject were generated, where each particle was a vector containing the estimated values of a subject?s parameters. These particles were generated using the data on listening duration and song attributes similar to the ones we used in the simulation section. These song attributes consist of variables that described the genre that a song belongs to in addition to the characteristics of the songs. 2500 iterations were used for the MCMC procedure with a burn-in of 2000. The variable selection step was also applied to the MCMC procedure. (2) An aggregate model was estimated to generate song-specific constants. 500 particles were generated for the aggregate model. In the context of the aggregate model, each particle corresponds to a vector containing an estimate of the different song constants. (3) After the individual and aggregate models? particles were estimated, the model averaging step was used to predict the listening duration of each subject for the 400 different songs. In the process, subject weights were generated. These weights were used to determine how much of one subject?s parameter should be used to predict another target subject?s listening duration. (4) The Bayesian Experimental Design procedure was used to customize the play-list for 46 the individual subjects. Similar to the simulation, the weights used in the optimization criteria were such that the system converges to a maximum utility design after ten customizations. We used a heuristic clustering algorithm as the benchmark to compare our model against. Such a heuristic procedure is conceptually similar to the one used by for example Pandora.com, an existing internet radio recommendation system. For the purpose of the benchmark algorithm we inferred that a subject likes a particular song if the subject finished listening to it during the first wave of listening. The clustering procedure generated a play-list for this subject in the second wave by choosing the songs that were most similar to the preferred songs. Similarity was measured using the Gower distances between one song from the next in the song attribute space. The clustering procedure minimizes the total Gower distances of the 50 songs in the play- lists. Section 3: Results of the experiment The lognormal hazard function is used in this paper to model subjects? listening behavior. For the lognormal hazard function, the standard deviation of the distribution determines the function?s shape. A small sigma value (e.g. 0.5) describes a non-monotone hazard function, and as the value of sigma increases (e.g. around 2.0) the hazard function adopts a shape closer to a downward sloping curve ( Figure 1). The distribution of 86 subjects? sigmas based on the first wave (i.e. standard deviations of the estimated parameters) is shown in figure 5. 47 Figure 5 Sigma values of the estimated experiment subject?s parameters Given that the sigma values of in figure 5 is closer to 2.0 then 0.5, a downward sloping hazard function generally describes the subject?s listening behavior. This implies that the subjects? do not need to listen to the songs for long before they could decide whether a song is preferable. In addition, the tendency to continue listening to a song increases as the listening duration increases. Figure 6 and 7 gives an indication on how the subjects chose which songs to listen to. Figure 6 shows how long in milliseconds that the songs are listened to by the subjects. Figure 7 show the ratio between listening duration and the length of the songs. Both figures show that apart from the songs which the subjects chose to finish, the portion of songs that was skipped early was higher than the portion of songs that are skipped later into the song. These two figures show further that the subjects make a quick decision on which songs to finish listening to. 1.00000 2.00000 3.00000 Subjects' sigma value 0% 5% 10% 15% Percentage of subjects 48 Figure 6 Percentage distribution of songs heard by the duration heard in milliseconds Figure 7 Percentage distribution of songs by the ratio of listening duration to song length.. 100000 200000 300000 400000 500000 Listening duration (ms) 0% 5% 10% 15% 20% Percentage of songs heard 0% 25% 50% 75% 100% Ratio of listening duration over the length of song Percentage of songs heard 0% 10% 20% 30% 49 Figure 8 and 9 demonstrate the effectiveness of the variable selection procedure. They show the distribution of the subjects by the number of parameter estimates which are set to zero by the variable selection procedure. Figure 8 shows the distribution for the six variables that describe the song characteristics (e.g. perceived loudness of the song, song tempo, etc). Figure 9 shows the distribution for the eleven variables that describe the song genres (e.g. Jazz, pop, etc.) Figure 8 Distribution of experimental subjects by the number of non-zero song characteristic coefficients. Number of non-zero coefficients for the variables describing song characteristics 1 2 3 4 5 0% 25% 50% 75% 100% Cumulative percent of subjects 50 Figure 9 The distribution of experimental subjects by the number of non-zero coefficients for describing song genres. From figure 8 and 9 we can see that the preferences of about 14% of the subjects can be described using only half of the song characteristics variables (figure 8) , and the preferences of about 20% of the subjects can be described using only half of the variables describing genres (figure 9). For these subjects, the variable selection eliminates the variables that are redundant and therefore simplifies the parameter estimation and predictions. The Sequential Monte Carlo estimation procedure is used to deal with the challenges of analyzing big dataset. In our experiment, the computation time using Sequential Monte Carlo methods is about 15% of the MCMC methods when the same dataset is used. A standard Markov Chain Monte Carlo method requires a complete scan of the dataset and also a large number of iterations to estimate the model parameters. Computational time for the MCMC method increases greatly with the 10 Number of non-zero coefficients for the variables describing song genres Cumulative percentage of subjects 2 4 6 8 0% 25% 50% 75% 100% 51 size of the dataset. The ability to break datasets into blocks helps to keep the computational time of the Sequential Monte Carlo method short. Therefore, the benefit of the Sequential Monte Carlo Method is more pronounced when the size of the dataset is greater than the dataset we have in our experiment. We used the aggregate model to capture the impact of individual song?s unique characteristics which are not otherwise reflected in the prediction by the individual level models. This aggregate model uses a dummy variable for each song across all subjects. The coefficients of the aggregate model therefore represent the song-specific constants. The distribution of the song-specific constants is shown in table 7 and in figure 10. The song-specific constants follow a normal distribution with a high variance. The normal distribution is a result of the use of a normal prior when we estimated the constants. We infer that the 400 songs that we used for our experiment represents a diverse collection of songs with very different characteristics. We have deliberately chosen the songs to be greatly different in order to better cater to the diverse taste in music among the experiment subjects. Table 7 Distribution of the song-specific constants.. Mean Median Min Max 25th Percentile 50th Percentile 75th Percentile Standard Deviation -0.11 -0.12 -1.50 1.82 -0.42 -0.12 0.20 0.46 52 Figure 10 Distribution plot of the song-specific constants. Our motivation for using model averaging is to improve our predictions of a subject?s listening duration. We combine individual models (i.e. sets of parameter estimates) based on how well they predict the target individual?s data. The weight that we place on a model in the prediction of another subject?s data is based on the ratio of the likelihood of that model over the subjects? own model. We calculate the between-subject weights for all combination of subject while treating the aggregate model as one of the candidate models. Table 8 below shows the distribution of subject weights. Table 8 Distribution of subject weights. Mean Median Minimum Maximum Standard Deviation 0.87 0.91 0.05 1.35 0.19 On average a weight of about 0.87 is applied to other subject?s parameters to predict the target subject listening duration. There are instances where one subject?s Value of song-specific constants Percentage -1.00 0.00 1.00 0% 4% 8% 12% 53 parameters are poor predictor of another subject?s listening duration (i.e. when the weight is at 0.05), but in other instances another?s subject?s parameters maybe a better predictor for the target?s subject preferences. It is in the instances when the weights are above 1.00 that the model averaging approach helps to improve the predictions of the subjects? listening duration substantially. This happens in about 16% of the cases. The distribution of the subject weights is shown in figure 11. Figure 11 Distribution of subject weights. In table 9, we compare the performance of the play-lists in terms of how well they recommend songs to subjects. The performances of the random play-lists, and also the play-lists generated using of our model and the benchmark model are shown. The comparisons are done using two criteria: (1) the percentage in which the subjects listened to each of the songs presented to them. (2) The proportion of all the songs presented to the subjects which they finished listening. Subject weights Percentage 0.25 0.50 0.75 1.00 1.25 10% 20% 30% 40% 54 Table 9 Comparison of play-lists? performance in terms of how well they recommend songs. Mean percentage of each song that the subjects have listened to Mean proportion of the songs that the subjects have finished Listening to (1) (2) (3) (4) (5) (6) Play-lists generated by our model 55.00% 55.00% 0.46 0.46 Play-lists generated by benchmark model 44.79% 44.79% 0.36 0.36 Random play- lists 44.09% 40.57% 0.34 0.30 Difference in mean 10.21%* 10.91%* 4.22%* 0.10* 0.12* 0.06* Percent improvement 22.78% 24.74% 10.41% 27.78% 35.29% 20.00% Note: * reflects < 0.01 significance Based on the results shown, our model performs 23% better than the benchmark model in terms of percentage of each song listened to (Column 1, table 9) ,and 28% better for the proportion of songs that the subjects have finished listening (Column 4, table 9). This comparison is done using the data in the second wave of listening because it is the data based on customized play-lists. When we compare the listening behavior of the subjects before and after we have customized the play-lists (i.e. comparison of wave one and wave two data), the play-lists customized using our model results in a 25% (column 2, table 9) improvement in the percentage of each song listened to, and a 35% improvement in the proportion of songs that the subjects finished (column 5, table 9). Comparatively, the play-list customized using the benchmark model results in a 10% improvement in the percentage of each song listened to (column 3, table 9), and a 20% improvement in proportion of songs that the subjects finished (column 6, table 9). Looking at the relative ratios of improvement, our model is 2.59 (i.e. 10.91% / 4.22%) times better than the benchmark model in increasing the percentage of each song listened to, and is 2.00 times (i.e. 0.12 / 0.06) better than the benchmark model 55 in increasing the number of songs that the subjects finished. These results show that our model is indeed better than the benchmark model. To provide an indication of the best case and worst scenarios for the performance of our and the benchmark recommendation system we analyze the recommendation performances on an individual basis. For our model the changes in the percentage of each song listened to before and after the play-lists are customized range between -42.41% to 184.14% and have a median of 8.85%. Comparatively, the range for the benchmark model is between -64.40% and 124.32% with a median of 3.29%. In terms of the changes in the proportions of songs that the subjects finished our model has a range between -50.98% and 373.12% with a median at 13.08%. The respective percent changes for the benchmark model are -66.94% and 239.85% respectively. The median for the changes in proportion for the benchmark model is 8.32%. This means that our model has a less severe worst case scenario and a better best case scenario compared to the benchmark model both for the percentages of each song listened to and the proportions of songs that the subjects finished. By using the median and as a result reducing the effect of outliers, our model still performs better than the benchmark model. When we asked the subjects to indicate how well the play-lists perform using a post experiment survey, we are not able to see any significant difference in the way the subjects rate our model and the benchmark model, and in the way the subject rate the play-lists before and after they are customized. The actual survey used is shown in the appendix, and the results of the survey are shown in table 10. For the rating on subjects? song satisfaction, a value of 1 reflects that the subjects are very satisfied, 56 and a value of 7 reflects that they are very unsatisfied. The values in table 10 are around 3.0, which corresponds to the average rating of ?somewhat satisfying?. For the rating on the proportion of songs the subjects liked, a rating of 1 reflects that subjects like all the songs, and a rating of 5 reflects that subjects like none of the songs. A value close to 3.0 as shown in table 10 indicates that the subjects like only some of the songs in the play-lists. None of the differences in ratings are significant. The problem of using rating data in making recommendations is discussed in Rossi, Gilula and Allenby (2001). In this paper the authors show that rating data do not always indicate the true preferences. In our case, subjects reported ratings are not useful in making any preference predictions. Table 10 Subjects? ratings of the play-lists? performance obtained from post experiment survey Satisfaction with the songs Proportion of songs liked (1) (2) (3) (4) (5) (6) Play-lists generated by our model 3.63 3.63 3.11 3.11 Play-lists generated by benchmark model 3.59 3.59 3.14 3.14 Random play- lists 3.86 3.55 3.26 3.17 difference in mean 0.04 -0.23 0.04 -0.03 -0.15 -0.03 Note: For satisfaction, a rating of 1 reflects very satisfied and a rating of 7 reflects very unsatisfied. For the proportion of songs liked, a rating of 1 reflects that subjects like all the songs, and a rating of 5 reflects that subjects like none of the songs. None of the differences in ratings is significant. We investigate further into the results of our satisfaction measure by correlating the actual listening behaviors of the subjects with their responses in the post experiment survey. The satisfaction levels with the songs generated by the customized play-lists indicated in the survey correlate significantly with the actual 57 listening behavior when we combine the survey results of the subjects in the control and the experiment group. The correlation between the satisfaction levels and the percentages of each song listened to in the second wave is at -0.31, and the correlation between satisfaction levels and the proportions of songs that the subjects finished in the second wave is at -0.30. The correlations are negative because smaller values for the satisfaction measure indicate higher levels of satisfaction. Both of the correlations are significant at the 0.01 level. The results are different when we correlate the satisfaction levels with the songs in the second wave and the actual listening behaviors separately for the play-lists generated using our method and the play-lists generated by the benchmark model. For the benchmark model the correlations are significant at the 0.01 level. The correlations are -0.40 between satisfaction level and the percentages of each song listened to, and -0.39 between satisfaction level and the proportions of songs that the subjects finished in the second wave. The corresponding correlations for our model is at -0.11 and -0.15 respectively. Not only are the correlations smaller for the play-lists generated using our model, the correlations are also not significant. The performance of our model is shown to be better than the benchmark model. The fact that correlation between satisfaction level and actual listening behavior is significant for the benchmark model and not our model indicates that the subjects are stricter in indicating their satisfaction level when a recommendation system is performing better in contrast to a recommendation system that is performing poorer. This provides some explanations as to why our model does not result in a higher satisfaction level when compared to the benchmark model. This is inline with Gilula and Allenby (2001) observation that 58 rating data do not always indicate the true preferences. We run into the same problem when we look at the correlation between the proportions of songs that subjects like indicated in the survey and the actual listening behaviors. The correlations are significant at the 0.01 level for the play-lists generated using our model and are non significant for the play-lists generated using the benchmark model. The subjects indicated that they are wiling to pay about $6.65 on average for the recommendation system. This is lower compared to the $9.99 that Rhapsody offers for the song download. The lower willingness to pay for our system is explained partly by the fact that subjects do not always indicate their true preferences in a survey. For example, we did not find any significant results between willingness to pay and satisfaction level, and between willingness to pay and actual listening behaviors. A spontaneous remark on the willingness to pay was given by a subject in the survey form. She remarked: ?Free as a first time product trial. Maybe later if I like the service, I would consider upgrading my subscription for $5.00 extra a month?. This provide a possible explanation that apart from the problem of the subjects not indicating their true preferences in the survey, another possible reason for the lower willingness to pay is that the subjects may not see the recommendation system as a separate product but as a possible value add to existing system. In addition, subjects may want a trial period to further evaluate the performance of our recommendation system before they can provide a more accurate indication of their willingness to pay. 59 We move on to discuss our results in the next section and also provide a conclusion to this paper. Chapter 5: Discussion and conclusion The main objective of this paper is to develop a music recommendation system that automates the downloading of songs into a mobile digital audio device. We have taken the approach that involves as little effort as possible from the consumers? end. This provides some challenges to our recommendation system. First, we cannot use consumer demographic profiles or prior music preference questions to help in the generation of the initial play-list. Our use of a randomly generated initial play-list is due to the belief that asking for personal information may be considered an invasion of privacy, and may discourage the adoption of the recommendation system. However, we could in principle use such information to initialize our recommendation system. Second, we choose not to involve individuals when they visit the music downloading website, for example by asking them to indicate which songs they prefer after listening to a collection of music snippets. We have not incorporated this feature due to our desire to minimize the effort required from individuals during the process of music recommendation and downloading. Nevertheless, this could in principle be incorporated in an extension of our system. Third, we have also kept our Mp3 software interface simple to reduce the effort needed to operate the mobile music playing device. On the reverse side, subjects do not have the option to choose which songs they want to listen to. The simulation section of this paper demonstrated that our model does achieve it objectives in handling massive data and improving predictions through model 60 averaging. The speed comparison between MCMC and Sequential Monte Carlo was shown to be substantial. By using simulated data in the simulation, and thus knowing the true parameters, the Sequential Monte Carlo and variable selection procedures were shown to provide good estimates of an individual?s preferences. Experimental results show that variable selection does simplify estimation and prediction as different individuals differ in the number of variables need to definite their listening behaviors. The results also show that for some individuals, model averaging does in fact help to improve predictions. Looking at the results of the experiment, our model provides 23 ? 35% improvement in recommendations. This improvement is achieved in a single wave and in a natural experimental setting in which the subjects have a choice or when, where and how they want to listen to the songs. Running the experiment in a natural setting brings with it a set of new challenges. First, unlike recommendation systems based on secondary data we did not have the options of fine tuning our model in the midst of our experiment. Second, the listening context in which the subjects are in can be very different when they listening to the songs in the first wave versus the context they are in for the second wave. For example, a subject could be listening to the songs mostly when they are working out in a gym in the first wave, but chose to listen to the songs mostly when the subject is driving in the second wave. Preferences may change with the listening context and thus reducing the effectiveness of our recommendations. We expect such changes to diminish over time as recommendations are made in multiple waves. Future extension for our model could potentially address the issue of context effect on listening preferences. Our 61 experiment results show that individuals make their choices on which songs they prefer quickly. In addition, given the great speed in which the Sequential Monte Carlo procedure updates parameter estimates, it can be programmed into the mobile music playing device itself. The system then becomes a distributed computing system incorporating parallel computation on the PDAs for individual subjects. After an individual?s particles is downloaded from the website into the music device, the device can effectively chose which of the potential multiple play-lists to use after observing several of the individual choices. Third, we have less data points per subject per wave of listening compared to the 500 data points per subjects used in our simulation. Instead of the 500 data points, we have on average only 98 data points per subject per wave of listening, which translates to about 3 hours of listening on average. This resulted from the fact that we did not impose any total listening duration requirement on our subjects. The lower number of data points affects the accuracy of our recommendation system. We have also not utilized a maximum utility design when we recommend our songs, but one in which utility and precision are traded off in a single recommendation wave. We have deliberately incorporated precision into the criteria used to recommend a composition of songs because this is how the system operates in real-life conditions. Even in the simulation, the use of precision in this way reduces our design utility by about 3%. In conclusion, we believe that have achieve our objectives for developing a full working and directly implementable recommendation system. The system achieves dramatic improvements over a realistic and heuristic one, that itself has not yet been implemented as far as we are aware of. This system is one that automates 62 the downloading of songs into a mobile digital audio device based just on past listening behavior. One immediate refinement that can be done to our model is to improve the attributes used to differentiate one song from the next. The music genome project which characterizes different music with a different music ?DNA? is an natural direction to take in achieving this goal. Possible immediate application of our recommendation system is to incorporate it into the itune website, in which ipods are used as the music playing and data collection devices, but we envision many other useful applications. 63 Appendix: Survey on Music Preference Study To help us better understand your listening experience during the study please take a few minutes to complete this survey. Question 1 and 2 applies to your listening experience in the last five days. Please place a check in the circle that best describes your listening experience. Question 1: What proportion of the sings do you like? ? All ? Most ? Some ? Not many ? None Question 2: How satisfied are you with the songs? ? Very Satisfied ? Satisfied ? Somewhat satisfied ? Neither Satisfied or disatisfied ? Somewhat dissatified ? Dissatified ? Very dissatified Question 3 and 4 applies to your listening experience in the first five days. Please place a check in the circle that best describes your listening experience. Question 3: What proportion of the sings do you like? ? All ? Most ? Some ? Not many ? None Question 4: How satisfied are you with the songs? ? Very Satisfied ? Satisfied ? Somewhat satisfied ? Neither Satisfied or disatisfied ? Somewhat dissatified ? Dissatified ? Very dissatified Question 5: If the online music company decides to launch the website, it plans to provide its services through a monthly subscription plan. Consumers who subscribe to the service will have unlimited song downloads. As a comparision, Rhapsody (an existing music website) offers unlimited song download at a fee of $9.99 a month. How much would you be willing to pay for the monthly subscription to the website that offers automatic download of music? US$ _______________________ 64 References Ansari, Asim, Carl F. Mela. 2003. E-Customization. Journal of Marketing Research. 40(2), 131-145. Ansari, Asim, Skander Essegaier, Rajeev Kohli. 2000. Internet Recommendation Systems. Journal of Marketing Research. 37(3), 363-375. Ariely, Dan, John G. Lynch. Jr., Manuel Aparicio IV. 2004. Learning by Collaborative and Individual-Based Recommendation Agents. Journal of Consumer Psychology. 14(1&2), 81-95. Arulampalam, Sanjeev, Simon Maskell, Neil Gordon, Tim Clapp. 2002. A Tutorial on Particle Filters for On-Line Non-linear/Non-Gaussian Bayesian Tracking. IEEE Transactions on Signal Processing. 50(2), 174-188. Chaloner, Kathryn, Isabella Verdinelli. 1995. Bayesian Experimental Design: A Review. Statistical Science. 10(3), 273-304. Chipman, Hugh, Edward I. George, Robert E. McCulloch. 2001. The Practical Implementation of Bayesian Model Selection. Model Selection. Institute of Mathematical Statistics, Beachwood, Ohio. 38, 65-116. Cook, R. Dennis, Christopher J. Nachtsheim. 1980. A Comparision of Algorithms For Constructing Exact D-Optimal Design. Technometrics. 22, 315-324. Ford, Ian, D. M. Titterington, Christos P. Kitsos. 1989. Recent Advances in Nonlinear Experimental Design. Techometrics. 31(1), 49-60. Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, Chris T. Volinsky. 1999. Bayesian Model Averaging: A Tutorial. Statistical Science. 14(4), 382-417. Holbrook, Morris B., Meryl P. Gardner. 1993. An Approach to Investigating the Emotional Determinants of Consumption Durations: Why Do People Consumer What They Consumer For As Long As They Consumer It? Journal of Consumer Psychology. 2(2), 123-142. Kiefer, J., J. Wolfowitz. 1959. Optimum Designs in Regression Problems. The Annals of Mathematical Statistics. 30(2), 271-294. Kong, Augustine, Jun S. Liu, Wing Hung Wong. 1994. Sequential Imputations and Bayesian Missing Data Problems. Journal of the American Statistical Association. 89(425), 278-288.` Konstan, Joseph, Bradley N. Miller, David Maltz, Jonathan Herlocker, et al. 1997. Grouplens: Applying Collaborative Filtering to USENET news. Communications of the ACM. 40(3), 77-87. Montgomery, Alan L., Kartik Hosanagar, Ramayya Krishnan, Karen B. Clay. 2004. Designing a Better Shopbot. Management Science. 50(2), 189-206. Pronzato, Luc, ?ric Thierry. 2003. Sequential Experimental Design and Response Optimisation. Statistical Methods & Applications. 11, 277-292. Raghu, T.S., P.K. Kannan, H.R. Rao, Andrew B. Whinston. 2001. Dynamic Profiling of Consumers for Customized Offerings Over the Internet: A Model And Analysis. Decision Support System. 32(2001), 117-134. Rentfrow, Peter J., Samuel D. Gosling. 2003. The Do Re Mi?s of Everyday Life: The Structure and Personality Correlates of Music Preferences. Journal of Personality and Social Psychology. 84(6), 1236-1256. 65 Ridgeway, Greg, David Madigan. 2003. A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets. Data Mining and Knowledge Discovery. 7(3), 301-319. Rossi, Peter E, Zvi Gilula, Greg M. Allenby. 2001. Overcoming Scale Usage Heterogeneity: A Bayesian Hierarchical Approach. Journal of the American Statistical Association. 96(453), 20-31. Saha, Atanu, Lynette Hilton. 1997. Expo-power: A Flexible Hazard Function for Duration Data Models. Economic Letters. 54,227 ? 233. Sha, Naijun, Mahlet G. Tadesse, Marina Vannucci. 2006. Bayesian Variable Selection for the Analysis of Microarray Data with Censored Outcomes. BioInformatics. 22(18), 2262 ? 2268. Verdinelli, Isabella. 1992. Advances in Bayesian Experimental Design. Bayesian Statistics 4. Oxford University Press. 467-481. Verdinelli, Isabella, Joseph B. Kadane. 1992. Bayesian Designs for Maximizing Information and Outcome. Journal of American Statistical Association. 87(418), 510-515. Ying, Yuan Ping, Fred Feinberg, Michel Wedel. 2006. Leveraging Missing Ratings to Improve Online Recommendation Systems. Journal of Marketing Research. 43(3), 355-365.