ISR develops, applies and teaches advanced methodologies of design and analysis to solve complex, hierarchical, heterogeneous and dynamic problems of engineering technology and systems for industry and government. ISR is a permanent institute of the University of Maryland, within the Glenn L. Martin Institute of Technol- ogy/A. James Clark School of Engineering. It is a National Science Foundation Engineering Research Center. Web site http://www.isr.umd.edu I R INSTITUTE FOR SYSTEMS RESEARCH TECHNICAL RESEARCH REPORT Understanding Patterns of User Visits to Web Sites: Interactive Starfield Visualization of WWW Log Data by Harry Hochheiser, Ben Shneiderman T.R. 99-3 Understanding Patterns of User Visits to Web Sites: Interactive Star#0Celd Visualizations of WWW Log Data Harry Hochheiser, Ben Shneiderman * Human-Computer Interaction Lab, Department of Computer Science, *Institute for Systems Research and Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 fhsh,beng@cs.umd.edu Abstract HTTP server log#0Cles provideWebsite operators with substantial detail regarding the visitors to their sites. Interest in interpreting this data has spawned an ac- tivemarketforsoftwarepackages thatsummarizeand analyze this data, providing histograms, pie graphs, and other charts summarizingusage patterns. While useful, these summaries obscure useful information and restrict users to passiveinterpretation of static displays. Interactive star#0Celd visualizations can be used to provide users with greater abilities to interpret and explore web log data. By combining two-dimensional displays of thousands of individual access requests, color andsize codingfor additionalattributes, andfa- cilities for zooming and #0Cltering, these visualizations provide capabilities for examining data that exceed those of traditional web log analysis tools. We in- troduce a series of interactive star#0Celd visualizations, which can be used to explore server data across vari- ous dimensions. Possible uses of these visualizations are discussed, and di#0Eculties of data collection, pre- sentation, and interpretation are explored. Keywords: World-Wide Web, Log File Analysis, Information Visualization 1 Introduction For WWW information providers, understanding of user visit patterns is essential for e#0Bective design of sites involving online communities, government services, digital libraries, and electronic commerce. Such understanding helps resolve issues such as depth vs. breadth of tree structures, incidental learning patterns, utilityofgraphics inpromotingexploration, and motivation for abandoned shopping baskets. WWW server activity logs provide a richsetof data that track the usage of a site. As a result, monitoringof site activity through analysis and sum- mary of server log #0Cles has become a common-place activity. In addition to several research projects on the topic, there are over 50 commercial and freeware products supporting analysis of log #0Cles currently available #5B2#5D. Unfortunately, these products tend to provide static displays of subsets of the log data, in a manner that can obscure patterns and other useful information. Interactive visualizations of log data can provide a richer and more informative means of understanding site usage. This paper describes the use of Spot#0Cre #5B17#5D to generate a varietyofinteractive visualizations of log data, ranging from aggregate views of all web site hits in a time interval to close-ups that approx- imate the path of a user through a site. We begin with a discussion of currently available solutions and research e#0Borts, followed by examples of the visual- 1 izationscreated inSpot#0Cre. Di#0Ecultiesofdata collec- tion, presentation, and interpretation are discussed, along with suggestions for future improvements. 2 Current E#0Borts Log analysis e#0Borts can be divided into two cate- gories: products and research projects. 2.1 Products Products, such as wwwstat #5B20#5D, analog #5B5#5D, HitList #5B11#5D, and wusage #5B19#5D parse log #0Cles in order to pro- duce aggregate reports, such as #5Ctransfers by request date", #5Ctransfers by URL#2Farchive", most popular pages, visits by time of dayordayofweek, originat- ing regions, user agent, or other criteria. While these packages focus on aggregate statistics, some provide user level information , such as #5Cexample visits" or #5Cdocument trails", which describe paths that have been taken through the site. Output reports are gen- erally presented as tables, histograms, or pie charts. While these packages di#0Ber in the speci#0Cc reports available,they generally share several characteristics: #0F Static displays : Reports are generally presented on an HTML page, without interactive facilities. #0F Low-dimensionality of reports: Results are pre- sented in a series of low-dimensional reports, whichmust be scanned sequentially. #0F Lack of low-level details: Reports focus on ag- gregations, with minimal#28if any#29 support for di- rect examinationofrecords relatingto individual page requests. #0F Relative lack of #0Dexibility: while most tools in- clude some con#0Cguration functionalitytoallow for selection of reports to be generated, cus- tomization facilities are often quite limited. #0F Lack of integration of knowledge of site layout: reports are presented in terms of visits by URL, without any further information regarding the site layout. More recently, products such as BazaarSuite #5B6#5Dhave augmented these facilities with additional reports tracking entry and exits sites and visual displays of user paths through a site. 2.2 Research E#0Borts Since the early WebViz e#0Bort #5B16#5D, various projects have revisited the issue of log display and visualiza- tion. Disk Trees and Time Tubes #5B8#5Dprovide three- dimensional visualizations of web #5Cecologies", dis- playing the evolution of a web site over time, using attributes such as display line color or thickness to encode multi-dimensionalinformation. Other e#0Borts, suchasPalantir #5B13#5D and Chitra #5B1#5D examined the use of log analysis for speci#0Cc goals, such as understand- ing of patterns in geographic origin of requests or caching performance. However, these tools lack fa- cilities for general-purpose, interactive exploration of log data. Characterization and modeling of web-site access patterns has been an active area of research #5B18#5D#5B15#5D #5B9#5D #5B14#5D. While these e#0Borts often rely upon web log analysis, their focus is generally on modeling and data-mining. This paper presents a series of inter- active visualizations that might be used to augment these data models. 3 Star#0Celd Visualizations Star#0Celd visualization tools #5B3#5Dsuch as Spot#0Cre #5B17#5D combine simultaneous display of large numbers of in- dividual data points with a tightly-coupled interface that provides facilities for zooming, #0Cltering, and dy- namic querying #28Figure 1#29. By using these facilities to examinethe contentofweb server logs, we can gain an understanding of human factors issues related to visitation patterns. Interactive visualizations of visits to the web site of the Human-Computer Interaction Lab #28http:#2F#2Fwww.cs.umd.edu#2Fhcil#29 were generated from the logs of the University of Maryland's Computer Science department#28http:#2F#2Fwww.cs.umd.edu#2F#29. In an attemptto generate meaningfulpage request data, these logs were processed to removeany accesses that 2 Figure 1: Interactive Visualizations in Spot#0Cre: A Spot#0Cre visualization, with the URL requested on the y-axis and the time of request on the x-axis. Checkboxes on the right-hand side have been selected to display only data points with #5COK" status, and a slider labeled #5CTime" has been adjusted to eliminate points corresponding to requests that occurred before September 4, 1998. Detailed information about a selected point is displayed in the lower right-hand corner. 3 either came from machines with the cs.umd.edu do- main or referenced pages outside the #5Chcil" subdirec- tory. Requests for non-HTML objects #28images, ap- plets, etc.#29 were also eliminated, in order to avoid generating multiple data points for any single page request. This process can be viewed as a simpli#0Ced version of the pre-processing performed byWebMiner #5B9#5D and similar systems. During this processing, eachentry was also as- signed to a category, based on a simplepattern match that assigns pages to categories based on URLs. Fur- thermore, client host names were parsed to allow categorization by top and second-level Internet do- mainnames,and attempts were madeto identifyhost names for accesses from visits that were logged only byIPnumber. In addition to identifying the request- ing host, timestamp, URL, and Category, the result- ing visualization #0Cle includes HTTP Status, number of bytes delivered, HTTP-referer, and User-Agentfor each hit. The available data #0Celds are summarized in Table 1. Foratwo-month period covering late August to late October 1998, this data set consisted of over 33,000 data points. This data set was used to gen- erate several visualizations, some of which required additional processing. 3.1 Time vs. URL, Macro View Accesses were plotted with time on the x-axis and URL #28alphabetically#29 on the y-axis. Secondary cod- ings include size coding for document size and color coding for HTTP response code. This #5Call at once" overview provides a high-level view of major usage patterns of web site visits #28Figure 2#29, including: #0F Usage frequency. #0F Weekly usage: vertical #5Clanes" of lower hit den- sity correspond to weekends. #0F Correlated references: short vertical groupings indicating pages that had similar URLs #28due to pre#0Cx similarity#29 and references that were close together in time. #0F Bandwidth usage: frequency of hits to larger #0Cles. #0F HTTP errors: color coding of HTTP status re- sponses allows for quick visual scanning to iden- tify document requests that caused errors. By displaying all of these usage patterns in one screen, the visualization gives a compact overview of site activity. Due to their qualitative nature, these observations are more useful for identi#0Ccation of po- tential areas of interest than for direct comparison. However, Spot#0Cre's zooming and dynamic query fa- cilities can be used to quicklynarrowinoninteresting subsets of the data. Replacing URL with category on the y-axis groups points into horizontal bands, based on the seman- tic category assigned during pre-processing. While potentially hiding the information carried in the dis- tinct URLs, the discrete categories provide a more orderly display that can simplifyinvestigationsof rel- ative usage of di#0Berent parts of the site. Speci#0Ccally, category usage informationmayprovide insights into the topics and areas that were of interest to users, as opposed to simplyidentifying the pages that were ac- cessed. This information might be useful for design- ers interested in focusing maintenance e#0Borts on the most highly-used portions of a site, or for researchers testing hypotheses about site design. 3.2 Time vs. URL, Micro View Zoom and #0Clter techniques can be used modify the time vs. URL visualization to displaylower-level us- age patterns, such as per-host visits. By restricting the above visualization to display hits from particu- lar clients, we can examine patterns of repeated vis- its over extended periods of time, in order to identify host machines that mayhave repeatedly returned to the site over the course of several weeks. Zooming in to display smaller time slices provides a potential visualization of the events in a given visit #28Figure 3#29. Of course, these visualizations must be interpreted carefully: hits from hostnames that indicate proxy hosts or dynamically-assignedhostnames #28for ISP di- alups#29 are less likely to indicate single visits from a small group of individuals. Use of this visualization to examine patterns found for multiple hosts can also reveal some interesting 4 Client Host Client's Internet host name:#5Ccs.umd.edu" TLD Top-level Internet host name:#5C.edu" SLD Second-level Internet name: #5Cumd.edu" Timestamp Date and time of Client's request: #5C980822 17:05:03" indicating August 22, 1998 at 5:05:03 PM EST URL Uniform Resource Locator: the name of the #0Cle that was requested Category Classi#0Ccation within the web site. Possibilitiesinclude projects withinthe group, such as #5CVisible Human", #5CPad++", or #5CLifelines" HTTP Status The web server's response to a request. Values include #5COK", #5CUnauthorized,",#5CNot Found", and other values speci#0Ced in the HTTP speci#0Ccation #5B10#5D Bytes The size of the resource delivered, in bytes HTTP-referer The URL that the user's browser was on before makingthe current request. When present, identi#0Ces the page that links to the requested page User Agent A description ofthe speci#0Cc clientsoftware used to makea request #28e.g., #5CMozilla#2F4.0#28compat- ible; MSIE 4.01; MSN 2.5; Windows 98#29"#29. Can be used to identify user's operating system and browser. Also useful for identifying WWW robots - automated web traversing programs. Example robots include #5CArchitextSpider" and #5CSlurp#2F2.0 #28slurp@inktomi.com;http:#2F#2Fwww.inktomi.com#2Fslurp.html#29" Table 1: Visualization Data Fields 5 Figure 2: Time vs. URL, Macro View: Twoweeks of accesses to a subset of the HCIL pages. The requested URL is on the y-axis, with the date and time on the x-axis. The dark lines on the x-axis correspond to weekends. Each circle represents a request for a single page. The size of the circle indicates the number of bytes delivered for the given request. Color is used to indicate the HTTP status response, with the majority of points being #5COK", indicating a successful request. Numbered arrows point to examples of interesting patterns that can be seen in the visualization 3.1: 1#29 The group home page, #5C#2Findex.html", shows a steady stream of visits, as indicated by the horizontal line of access points that spans the entire graph. The gap to the left of the arrowshows a slight dip in access frequency, due to the weekend of September 12-13, 1998. 2#29 Groups of access points clumped together vertically indicate pages that both have similar URLs and were accessed at points close together in time, possibly indicating user sequences of requests that form user sessions. 3#29 Large circles indicate large #0Cles. Frequent accesses to such #0Cles might cause concerns regarding bandwidth allocation. 4#29 Color coding for HTTP status codes allows for quickidenti#0Ccation of errors: the straight line of error code here indicates a non-existent URL that is frequently requested - perhaps from an outdated link on an external page. 6 Figure 3: Time vs. URL, MicroView:A series of requests from a single client. Over the course of #0Cve weeks, this client made several series of requests to the HCIL web site: 4 pages on September 8, one on September14,3onSeptember 27, and 4 #28of which three are shown#29 on October 16. URLs are alphabetized on the y-axis, so closely-packed points in a vertical line are accesses occurring on a single dayinvolving #0Cles with similar #0Cle names. Each of these request clusters may constitute a visit to the site. patterns. For this data set, this visualization clearly indicated that the vast majority of individual hosts had recorded only one request to the site. 3.3 Time vs. Hostname Examination of trends in accesses by hostname can provide insights into the patterns of visitors into the web site. By plotting time on the x axis and fully- quali#0Ced-domain-name on the y-axis #28or IP number, if the complete domain name is unavailable#29 and maintaining the size and color codings used previ- ously,we can see trends in requests from di#0Berent hosts. As with the #5Ctime vs. URL" visualization #28Sec- tion 3.1#29, this displaymay illustrate usage patterns that would not be obvious in output from traditional log analysis tools. For example, horizontal lines in- dicate sites that have been visited repeatedly bya given host, perhaps over a period of days or weeks. Particularly strong trends in the horizontal - a given host visiting the site repeatedly and regularly over an extended period of time - may indicate a visit from an automatedweb agent, or classes of visitors coming from a proxy or cache server. Changing the view to display second-level domains #28e.g., .umd.edu#29 or top-level-domains #28e.g, .edu#29 pro- vides information regarding the organization or lo- cality of the originating host. Filtering and zooming to specify speci#0Cc hostnames can be used to provide another version of the usage patterns from individ- ual hosts described under the #5Ctime vs. URL, micro view" visualization #28Section 3.2#29. Unfortunately, the high frequency of hosts that do not have resolvable hostnames results in a large pro- portion of the hits being classi#0Ced byIPnumber only. Furthermore, some of the hostnames that were found in the log either came from proxies #28proxy.host.com#29, or were obviously associated with dialup PPP lines #28ppp.dialup.isp.net#29. In the data set used to gen- erate these visualizations, approximately 2500 hits #28roughly 7#25#29 involved hosts with names contain- ing #5Cproxy" or #5Cdialup", and approximately 6200 #28roughly 18#25#29 were identi#0Ced solely byIPnumber. While these percentages are not necessarily typical, these di#0Eculties clearly presentchallenges for any analysis system that hopes to extract useful infor- mation from hostname information in log #0Cles. 7 3.4 Client Host vs. URL Visualization of client hostname #28x-axis#29 vs. re- quested URL #28y-axis#29 can illustrate trends in access patterns for individual Internet hosts. In this dis- play, eachvertical lane corresponds to requests from a single host: examination of these lanes can provide insights into the #0Cles requested by di#0Berenthosts. This display might also be used to identify URL request patterns that are shared bymultiple hosts. Speci#0Ccally,multiple parallel vertical lanes that have data points #28hits#29 in the same vertical positions indi- cate groups of clients that visited similar pages. Un- fortunately, the alphabetic ordering of client hosts and URLs mightmakesuch patterns di#0Ecult to iden- tify. The visualization might also be used to identify visits from web robots. Vertical lines that extend throughout large portions of the URL space show timeperiods when manypages on the site were hit by a single host in a short time period, indicating a pos- sible robot visit #28Figure 4#29. This informationmaybe useful for site operators interested in knowing when their site is being visited by an automated agent. Of course, the di#0Eculties with unidenti#0Ced or unin- formative hostnames #28described above#29 apply to this visualization as well. 3.5 Index Page Link Requests Researchers and web site designers maybeinterested in using data regarding hits to links on a site's home page as a means of evaluating the e#0Bectiveness of the site's design. One way to perform this assessment wouldbe to track the frequency ofuser visits to URLs that are referenced from the home page. In order to visualize this data, we reprocessed the visualization #0Cles, calculating the total number of hits per dayper linkedURLforeach ofthe 35linksfoundonthe HCIL index page. As part of this processing, each URL that was linked from the index page was assigned a number #28links on the home page to o#0B-site resources were ignored#29. Numbers were assigned in descending order, starting with -1 for the top link on the home page, thus guaranteeing that a link's position in the visualization will correspond to it's position in the home page. This revised data was then displayed in a visual- ization, with date of access on the x-axis, rank on the y-axis, color coding for the URL, and size coding for the number of hits on eachday, with larger points in- dicating morehits. This provides a visualizationwith a series of horizontal lines, each tracking accesses to agiven link on the HCIL home page. This visualization can be used to track frequency and regularity of user visits to the home page links. Of course, this display does not address the relative use of the underlying pages: in fact, this visualization con#0Crmed our suspicions that many of the user visits to HCIL pages were coming from external links. Un- fortunately, this visualization is potentially mislead- ing, as references to pages linked from the home page do not necessarily involve selection from the home page: site visitors mightarriveatthesepagesbyse- lecting links from some other page, or bytyping a link directly into their browser. This one-screen display of the relative frequency of use of the various links can provide valuable insights to designers and webmasters interested in improving page performance. For example, rarely-used links to- wards the top of a page might be occupying space that would be better allocated to more popular re- sources #28Figure 5#29. Alternatively, high-interest items found at the end of a long page mightshowlower lev- els of access, perhaps re#0Decting users' unwillingness to scroll to the end of longer pages. 3.6 Referer vs. Time Manyweb site operators are interested in under- standing their site's position in the web universe. While search engines suchasAltavista #5B4#5Dmay pro- vide facilities for searching for links to a given URL, such searches do not provide any information about the actual use of these links. Fortunately,manyweb logs contain the HTTP-referer #0Celd, which indicates the URL that a browser was viewing before a given page request was made, thus indicating the page that led to the request. Log #0Cles containing HTTP-referer #0Celds can be used to derive visualizations that might provide some valuableinsights into the use of internal and external links. By plotting time on the x-axis, 8 Figure 4: Client Host vs. URL: Client hosts on the X-axis, and requested URL on the y-axis. Vertical slices display #0Cles visited byeach host, while horizontal slices indicate the patterns of requests for a given web page. This display indicates that the HCIL index page #28arrow labeled 1#29 is visited bymany of the hosts that come to the site, but the percentage of visitors that visit the page for the lab's Visible Human project #282#29 or technical reports #284#29 appears to be higher. The vertical line indicated byarrow 3 is a visit from a web robot. referer URL on the y-axis, along with color coding for HTTP status and size coding for size of resource requested, we can generate a visualization that dis- plays trends in referring URLs that lead users to the site #28Figure 6#29. For example, dense horizontal bands indicate ref- erer URLs that are continually and regularly leading people to the site. Of these URLs, external sites are likely to be the most interesting, but internal refer- ers may provide interesting clues as to which links on the site are being used. Furthermore, changes in the referer pro#0Cles over time may indicate the addition or deletion of new links to the site. Examinationof the range of referers is also instruc- tive. Search engines often return responses to queries as dynamically-generated content with similarly dy- namic URLs. As a result, visits that originated with search engines have distinct referers, leading to hor- izontal bands in the visualization, indicating classes of visits from di#0Berentsearch engines. Furthermore, search terms are often encoded in the URLs of search results, so examination of individual referer URLs for these search engine referers mayprovide some in- sights into the searchkeywords that are leading visi- tors to the site. 3.7 Referer vs. URL Further insightinto paths that users take to reach various pages can be gained by plotting the HTTP- referer #28x-axis#29 vs. the URL being retrieved #28y-axis#29, while maintaining the size and color codings used above for HTTP status and resource size, respec- tively. While this visualization mayprovide inter- esting insights, the presence of a large number of intra-site and search engine referers may lead to pos- sibilities for misinterpretation. If these potential con- founds are properly accounted for, several interesting patterns may be observed: #0F Pages accessedfrom a variety of external refer- ers: Horizontal bars correspond to pages that are referenced from multiple sources - either ex- ternal or internal. These bars may be used to 9 Figure 5: Index Page Link Requests: Requests for pages that have links on the group index page. Eachrow corresponds to a link on the index page. The vertical position of eachrow in the visualization corresponds to the vertical position of the link on the index page, with links at the top of the page found at the top of the visualization. Date of access is plotted on the x-axis, and the points are scaled to indicate the relative number of requests on eachday - larger points indicating more frequent accesses. Fromthis visualization, we can see that some links placed fairly high on the page are not referenced very frequently #28arrow 1, indicating links to the pages for HCIL's Baltimore Learning Communityproject#29, while frequently requested links such as HCIL's technical report page are placed further down the page #28arrow 2#29. This informationmightbeused to redesign the index page, by identifying frequently requested links that mightbemoved to more prominent positions. 10 Figure 6: Referer vs. Time: The URL of the referring page is given on the y-axis, and the request date and time is on the x-axis. Referer patterns seem to be fairly regular across the 5 weeks of data displayed. The points shown below arrow 1 are those that have referring URLs that are inside the HCIL web site, indicating visitors who went from one page to another within the site. As expected, this class makes up a signi#0Ccant portion of the data points. The line marked by arrow 2 indicates a URL at the National Library of Medicine that consistently refers users to the pages for the HCIL's Visible Human project. The area indicated bythe vertical brace on the left #28labeled #233#29 indicates a band of referer URLs corresponding to a search engine. 11 gauge the relative external visibility of di#0Berent web pages, in a manner that identi#0Ces the links that actually bring users to the site #28as opposed to links that may exist but are never visited#29. #0F Frequent referers:Vertical lines #28or bands#29 indi- cate URLs #28or groups of URLs#29 that may refer- ence multiple pages on the site. In the case of external referers, these patterns maybeusedto identify WWW resources with a strong a#0Enity to the material on a given site. #0F Non-link references: The referer #0Celd is only recorded for HTTP requests that originate when a user clicks on a link found in a web page. Ex- amination of the entries that do not have referer values may provide insights into the prevalence of users who are reaching the site in question bymanually providing a URL to their browser. This may be used to gain some understanding of the extent to which knowledge about the site is propagating via non-WWW mechanisms. #0F Problem Links:Asdescribedabove, color coding based on HTTP status can be used to quickly identify requests that corresponded to problem responses. In particular, referer#2FURL combina- tions that result in the #5Cnot found" response can be quickly identi#0Ced, and this informationmight be used to locateexternal pages thatmayinclude links to one or more references on the site that do not exist. This information mightbeused to determine when appropriate redirection may prove useful, or to identify web site operators who might be asked to update their pages. The use of this visualization for the HCIL web site provided an example of the problems of artifacts in the data that present potential pitfalls in the use of these techniques. Speci#0Ccally,we observed strong patterns in the visualization, in the form of multiple data points that seemed to form two distinct lines of non-zero slope, cutting across large sections of the URL space #28Figure 7#29. While these lines present a striking visual image, the phenomenon being observed is actuallyquite sim- ple. Likemany other web sites, the HCIL pages are arranged hierarchically on a Unix #0Cle system, where pages for a given interest area - such as a research project or user home pages - are stored in a single directory. As a result, a page in one of these areas is likely to contain links that refer to other pages in that area: a user's home page might contain links to her CV, and vice-versa. Since the URLs di#0Ber only slightly,page requests that movebetween these pages will generate tight clusters in the visualization. Furthermore, the presence of areas on a web site with common pre#0Cx #28i.e., #2FResearch#2F1997 and #2FRe- search#2F1998#29 willleadto ajuxtapositionofthese clus- ters, thus forming easily-visible lines. While this dis- playmayprovide the impression of a strong pattern of usage and references, the understanding of usage patterns that is gained is actually quite small. Fur- ther clari#0Ccation of the data, either through elimi- nation of intra-site referers, or through aggregation of referers by URL domain #28as opposed to complete URL path#29 may eliminate the potential problems caused by this sort of display. 3.8 Other Visualizations Several other possible visualizations mayprovide fur- ther understanding of site access patterns: #0F User Agent: Plotting user-agent vs. time, URL, or domain, mayprove useful for understanding the software used to access agiven web site. This information might be useful for web site design- ers interested in deciding which HTML features to use. #0F Totals by Time Period: With the exception of the #5Cfront page visits", the above visualizations do not summarize information in terms of hits byday,week, or month. Visualizations based on these aggregates #28which can be easily generated through re-processing of the log #0Cles#29 mightbe used to identify further patterns. #0F Site #5Cmapping": plots containing category iden- ti#0Cers vs. URL illustrate the layout of the site, in terms of categories occupied byvarious URLs. 12 Figure 7: Referer vs. URL: URL on the y-axis, referer on the x-axis. Fromthis visualization,we can see that some URLs havenumerous associated referers #28see arrow #231#29, indicating either multiplelinks to the URL or frequentsearches that lead users to the URL. Arrow #232 points to the #5Cnon-link" references: requests that did not involve named referers. Arrow #233 illustrate the patterns formed when users follow paths through the site, as described in Section 3.7. 13 4 Discussion All of the data reported above mightbeincluded- in some form - in the output of a traditional web log analysis tool. However, interactive star#0Celd visual- izations o#0Ber several advantages #5B3#5D in understanding user visits, including: #0F Rich display of multiple-dimensional data, allow- ing discovery of multiple trends. The Time vs. URL visualization described above can poten- tially reveal several usage patterns in the data. #0F Simultaneous display of large numbers of individ- ual data points: While traditional analysis tools display bar charts or tables containing dozens of data points, Spot#0Cre can presentover thousands of data points, representing every site user visit for a month-long period, on a single screen. #0F Filter and zoom for access to detail: In gen- eration of aggregate summaries, traditional tools obscure most informationabout individual events. The visualizations described aboveallow analysts to move seamlessly from viewing thou- sands of hits from a period of several weeks to individual accesses from an hour-long visit bya single user. #0F Goal-neutral, interactive output: Existing log- analysis tools provide reports and output that are limited in #0Dexibility and tied directly to the problem domain. As a result, the analyst's abil- ity to expand the range of questions being asked, or to simply #5Cexplore" the data, is limited. The lack of domainknowledge in a tool suchasSpot- #0Cre is in manyways an advantage,asitmay avoid over-constraining analysts in their e#0Borts to #0Cnd meaningful patterns. These facilities combine to provide an environ- mentthat mayprove useful for generating hypotheses about web usage patterns that would be di#0Ecult to make with traditional tools. For example, the com- bination of the Time vs. URL and FrontPage Visit visualizationswasused toidentifypagesthat were en- tered #5Cthrough the side door" - pages that had user visits from links that originated outside of the local site. This provides another perspective on the notion of #5Centry points" #5B14#5D#5B9#5D. Visualizations helped illustrate data artifacts that mighthave been obscured by the output of tradi- tional packages. For example, some projects de- scribed on the HCIL web page have all of their infor- mation on a given web page, while others use multi- ple pages. Using traditional tools, it might appear as if the former projects had more user visits, because these hits would be focused on a small number of pages, instead of being distributed across a larger set. The categorization of web pages as described above helps avoid this problem, and could easily be added to traditional tools. However, the interactive visual- ization provides analysts with the abilitytoquickly switchbetween the categorized and non-categorized views, thus presenting a means of visually identifying a trend that might be obscured in the static layout of a traditional tool. With the exception of #5CFrontPage Visits" 3.5, the visualizations described abovepresented eachaccess as a distinct point in the star#0Celd visualization. This use of individual points instead of aggregate sum- maries is a double-edged sword: while visualizations eliminate the data loss that is inherent in summaries, they also mask some of the more basic information provided by traditional tools. For example, the #5Ctime vs. URL" visualizations provide access to each in- dividual data point, but identi#0Ccation of aggregate trends may not be immediately available. Interac- tive visualizations mightwork best as complements to traditional analysis tools. Speci#0Ccally, visualiza- tions might be most useful for identifying questions to be asked, while traditional log analysis tools might provide the answers to those questions. Ideally,web loganalysiswillleadto an understand- ing of usage patterns that can be used to guide web site design or research, in order to e#0Bectively real- ize the goals of the site. For maximal bene#0Ct, this analysis will be done in the context of a clear under- standing of the goals of a site: usage patterns from an academic site are likely to be very di#0Berentfrom those of an online supermarket. By providing direct access to data from large number of user visits, inter- active visualizations provide web site operators with 14 the ability to answer questions suchas#5Cwhichlinks are being used?", #5Cwhen are people visiting the site", #5Cwhere are visitors coming from?", and others. An- swers to such questions can be valuable inputs to the process of site and page design. 5 Future Work Additionalinsights maybe gainedfromvisualizations covering a longer time range. By extending the above visualizationstocover longertimeperiods - perhaps 6 months or one year, we might gain an understanding of seasonal usage trends, the impact of site redesign, or other factors that might be missed in a smaller time sample. Unfortunately, such expanded visual- izations might exceed the capabilities of the visual- izationtool: the performanceof the currentversion of Spot#0Cre #283.2#29 degrades noticeably on data sets con- taining more than twenty thousand points. While improved software and processing hardware should help, display technologies may not be able to ade- quately handle the hundreds of thousands or millions of data points that mightbeinvolved in visualizing usage patterns for larger sites. The utilityofweb log visualizations is also limited by the available data that can be manipulated, and by the types of manipulations that can be done. In- clusion ofadditionaldata, alongwith toolsto manage that data, mayincrease the expressivepower of these visualizations. Speci#0Ccally, visualizations that combine web log data with other appropriate data may help users place data in the appropriate contexts. The most ba- sic external data sources include additional log #0Cles, tracking errors, cookies, or other web server output. Visualizations that combine web log data with site #5Cmaps" might improve the utility of visualizations that approximate user sessions. For sites aimed at accomplishing speci#0Cc goals, data relevant to those goals mightprovide further utility. For example, visualizations of log data for electronic commerce sites mightbe enhanced through inclusion of relevant marketing data #5B7#5D. Further improvements might be made through the addition of data modeling tools to the visualization environment. Spot#0Cre is primarily a data visualiza- tion tool: facilities for data modeling are somewhat limited. Potentially useful additions to the visualiza- tion environment include: #0F Improvedaggregation facilities: facilities for gen- erating #5Con-the-#0Dy" aggregations of data may prove useful for identifying trends. Fully gen- eral aggregation facilities could be used to gen- erate aggregations that would go beyond those provided by traditional tools. #0F Generalized handling of hierarchical data:Log data has several attributes that are hierarchi- cal in structure: URL #0Cle names, timestamps, and client host names. Facilities to easily move through views at di#0Berentlevels of the hierarchy, in combinationwith improvedaggregation tools, would simplify the process of building models. For example, users would be able to movefrom display of all hits in a given month, to aggregate counts by hour, day,orweek. #0F #5CInter-visualization" visualization: While inves- tigating the index page link request visualiza- tion, we noticed that one area of the web site had a high level of tra#0Ec, despite the fact that the corresponding link on the index page was not heavily used. Wehypothesized that user's were arrivingthrough anexternal linktoanother page in that section of the site. Examination of the referer vs. time and referer vs. URL visu- alizations veri#0Ced that this was indeed the case. Tools that would allow for coordination of data between di#0Berent visualizations would increase the likelihood of identifying trends of this sort. The large space of possible visualizations of log data presents a challenge for e#0Bective use of these tools: further exploration of these possibilities might lead to identi#0Ccation of an #5Coptimal"set of visualiza- tions. This reduced set would ideally provide neces- sary understanding with minimal e#0Bort. Despite the e#0Borts of several research projects #5B15#5D #5B9#5D, modelingof web usage remains an inexact science #5B12#5D. Interactive visualizations of web log data may be useful complements to static reports generated by 15 current tools and session models currently being de- veloped. Finally,nomatterhow rich or accurate the log data, answers to many questions may require coordi- nated observations or interviews with users. For ex- ample, a long visit to many pages on a site may indi- cate satisfaction and interest in the contents, or con- fusion and frustration due to an unsuccessful search for information. While visualizations of the log data may expose patterns that provide some insights into the user's experience, the characterizations of user behaviors provided by these patterns will be at best indirect, and may require direct observations for fur- ther clari#0Ccation. Acknowledgments This researchwas supported byagrant from IBM. Thanks to Anne Rose for help with generation of the visualizations. References #5B1#5D Abrams, M., Williams, S., Abdulla, G., Patel, S., Ribler, R., and Fox, E. Multime- dia tra#0Ec analysis using chitra95. In ACM Mul- timedia #281995#29. http:#2F#2Fei.cs.vt.edu#2F~succeed#2F 95multimediaAWAFPR #2F95multimediaAWAFPR.html. #5B2#5D Access Log Analyzers. http:#2F#2Fwww.uu.se#2FSoftware#2FAnalyzers #2FAccess-analyzers.html. #5B3#5D Ahlberg, C., and Shneiderman, B. Visual information seeking: Tight coupling of dynamic query #0Clters with star#0Celd displays. In ACM CHI Conference on Human Factors in Comput- ing Systems #281994#29, pp. 313#7B317. #5B4#5D Altavista.http:#2F#2Fwww.altavista.digital.com. #5B5#5D Analog.http:#2F#2Fwww.statslab.cam.ac.uk#2F~sret1 #2Fanalog#2F. #5B6#5D BazaarSuite.http:#2F#2Fwww.bazaarsuite.com. #5B7#5D B#7Fuchner, A. G., and Mulvenna, M. D. Dis- covering internet marketing intelligence through online analytical web usage mining. SIGMOD Record27, 4 #28December 1998#29, 54#7B61. #5B8#5D Chi, E., Pitkow, J., Mackinlay, J., Pirolli, P., Gossweiler, R., and Card, S. Visualizing the evolution of web ecologies. In ACM CHI Conference on Human Factors in Computing Systems #281998#29, pp. 400#7B407. #5B9#5D Cooley, R., Mobasher, B., and Srivas- tava, J. Data preparation for mining world wide web browsing patterns. Journal of Knowl- edge and Information Systems 1, 1 #281999#29. http:#2F#2Fmaya.cs.depaul.edu#2F~mobasher#2F papers#2Fwebminer-kais.ps. #5B10#5D Fielding, R., Gettys, J., Mogul, J., Frystyk, H., and Berners-Lee, T. Rfc 2068 #7B hypertext transfer protocol #7B http#2F1.1. http:#2F#2Finfo.internet.isi.edu:80#2Fin- notes#2Frfc#2F#0Cles#2Frfc2068.txt. #5B11#5D Hit List.http:#2F#2Fwww.marketwave.com. #5B12#5D Monticino, M. Web-analysis: Stripping away the hype. IEEE Computer 31, 12 #28December 1998#29, 130#7B132. #5B13#5D Papadakakis, N., Markatos, E. P., and Papathanasiou, A. E. Palan- tir: A visualization tool for the world wide web. In INET'98 Proceedings #281998#29. http:#2F#2Fwww.csi.forth.gr#2F~papathan#2Fpapers#2F INET98 Palantir#2F. #5B14#5D Pirolli, P., Pitkow, J., and Rao, R. Silk from a sow's ear: Extracting usable structures from the web. In ACM CHI Conferenceon Human Factors in Computing Systems #281996#29, pp. 118#7B125. #5B15#5D Pitkow, J. In search of reliable usage data on the www. Tech. rep., Georgia Tech. College of Computing, Graphics, Vi- sualization, and Usability Center, 1996. ftp:#2F#2Fftp.gvu.gatech.edu#2Fpub#2Fgvu#2Ftr#2F1997#2F97- 13.pdf. 16 #5B16#5D Pitkow, J., and Bharat, K. Web- viz: A tool for world-wide web access log analysis. In First International Con- ference on the World-Wide Web #281994#29. http:#2F#2Fwww1.cern.ch#2FPapersWWW94#2Fpitkow- webvis.ps. #5B17#5D Spotfire.http:#2F#2Fwww.spot#0Cre.com. #5B18#5D Tauscher, L., and Greenberg, S. Revis- itation patterns in world-wide-web navigation. In ACM CHI Conference on Human Factors in Computing Systems #281997#29, pp. 399#7B406. #5B19#5D Wusage.http:#2F#2Fwww.boutell.com#2Fwusage#2F. #5B20#5D Wwwstat.http:#2F#2Fwww.ics.uci.edu#2Fpub#2Fwebsoft#2Fwwwstat#2F. 17