Вы находитесь на странице: 1из 9

Data Mining User Behavior

Tom Wilson

Introduction

In order to test a developed system, we must understand how the systems users will interact with it. Some user behaviors that we will be investigating are (1) user arrival rate, (2) session duration, and (3) think time. One behavior that we will not be investigating here is workload, which is the mix of the activities that the users choose to perform. For the data that we will be looking at, we will use both science and art in our analyses. We will use visualization to assist in communicating the analyses. A few graphs will use interval sampling to summarize large amounts of data. Interval sampling consists of counting events during a time interval. It has its limitations ([Wil10a]), but often is sucient in helping us see how data are organized. Some graphs will summarize data with boxplots. Boxplots have several useful features: (1) the top and bottom of the box correspond to the 75th and 25th percentiles, (2) the median, or 50th percentile, is shown by a line within the box, (3) values in the interquartile range are represented above and/or below the box by a whisker (a dashed line, which can be up to 1.5 times the size of the box), and (4) potential outliers are shown as points above and/or below the whiskers. In this paper, boxplots are clipped near the upper whisker so we can see the details of the box better. The boxplot quickly shows us how much the data are clustered or spread. We will also plot the mean when drawing the boxplot. Histograms are also a graphical method of summarizing data, but suer from being data sensitive. Our presentation of a histogram is modied slightly in that it is drawn using a line rather than a series of bars because our choices for bins often result in small widths.1 So only the heights of the bins are shown as the points on the line they create. We will then add the mean and the 25th , 50th , and 75th percentiles to the histogram in order to relate it to the boxplot. In this paper, the distribution is clipped like the boxplot so as not to compress the majority of the data. As we analyze each data set, we will comment on some of its content. Such discussion usually focuses on the extremes (why are some values very small or very large) since we should question if these values are valid. Hopefully, the analyses presented here can strengthen your analysis.

Source Data

The proprietary system providing example data supports logistics and maintenance for military equipment. For the purpose of anonymity, some general details are provided and some specic details have been changed. Minor details of the system are scattered throughout the discussion. Figure 1 graphs the number of users and logins over an 8-month period. Both graphs have a similar pattern of a 5-day series of peaks (i.e., the work week) followed by a 2-day extended valley (i.e., the weekend). The login graph has several sharp spikes that are probably not a function of time. Besides a steady growth in each parameter over the interval, a couple of small dips occur that are a result of holiday periods (months 2 and 6). We will take a closer look at these two metrics before turning our attention to the three user behaviors that we are interested in.

2.1

User Counts

Users access the system as part of their jobs with the bulk of the users working during a rst-shift weekday concept. So, user load is highly inuenced by the time of day, the day of the week, and seasonal events (e.g., holidays). Figure 2a shows the minimum, average, and maximum number of users per hour for every hour of each day of the week. The statistics are computed for the entire 8-month interval. The weekday minimums are certainly aected by holidays; if holidays were excluded, the minimums would be about 100 higher. Figure 2b shows a boxplot of the same data, but constrains the day of the week to Tuesday. The limited interval allows more information to be presented clearly. As before, eliminating holidays would drive the minimums up by
Published 1 The

in CMG MeasureIT, September 2010. resulting graph looks like a density plot, but it is not.

1,500

Users

1,000

500

0 1 2 3 4 5 Month 6 7 8

1,500

Logins

1,000

500

0 1 2 3 4 5 Month 6 7 8

Figure 1: The charts show the number of users (top) and the number of logins (bottom) over an 8-month period. The charts use the same y-axis limits for better visual correlation. It is apparent that several login spikes have occurred.

Users by Hour of Day of Week


700 Max Avg Min 600

Users by Hour of Day (Tuesday)

600

500 500
G G G G G

400 400
G G

Users

Users

300

300

G G G G

200

200
G
G

G G G G

G G G
G

100

100
G

G G

G G G G

G G G

G
G

0 Sun Mon Tue Wed Day of Week Thu Fri Sat

G G

G G

G G

G G

G G

G G

G G G

G G

10

12 Hour

14

16

18

20

22

(a)

(b)

Figure 2: Chart (a) shows the maximum, average, and minimum number of users for every hour of the week. The summary contains 8 months of data. The chart suggests that the number of users is sensitive to the time of day and day of week. Chart (b) shows a boxplot of the users for every hour of all Tuesdays. Means are added to the boxplot with a line connecting them.

about 100. The location of the mean below the median (during most hours) suggests that the distribution is left skewed. We did not plot this distribution because it is mostly uninteresting. It is quite easy to see which hours constitute the workday for most users (i.e., 6 to 16). The typical lunch-time

lull is also present, giving a double-hump appearance to most descriptive statistics on the graph.

2.2

Login Counts

Initially, we might think that login and user counts are equivalent, but they are not. A user can log in several times during a sampling interval, yet only count as one user (this assumes that a user cannot be logged in multiple times). Likewise, a user can persist over several sampling intervals without logging in during any of the intervals. So, we expect only a moderate-to-high correlation. This was demonstrated in [Wil10b]. Users can log in and out frequently because, for many users, most of their jobs consist of work away from the system. Limited numbers of terminals may also be a factor. Figure 3a shows the number of logins per hour for each day and hour of the week. These data contain holidays, which are likely to impact weekday minimums. The spikes (large maximums) are notable since they are signicantly higher than the averages. A downtime event is one possible explanation since, when the system is restored, many users are waiting to login at the same time. Such possible outliers could be removed, but they truly are valid data when our explanation is correct.

Logins by Hour of Day of Week

Logins by Hour of Day (Tuesday)


G

Max Avg Min 1,500 1,500


G G

Logins

Logins

1,000

1,000

500

500

G G G

G G

G G G G

G G

G G

G G
G

G
G

0 Sun Mon Tue Wed Day of Week Thu Fri Sat

G G

G G

G G

G G

G G G

G
G

G G G

G G G

G G

10

12 Hour

14

16

18

20

22

(a)

(b)

Figure 3: Chart (a) shows the maximum, average, and minimum number of logins for every hour of the week. The summary contains 8 months of data. The chart suggests that the number of logins is sensitive to the time of day and day of week. Chart (b) shows a boxplot of the logins for every hour of all Tuesdays.

Figure 3b shows only the Tuesday login data as a boxplot. The spikes now appear as circles and are signicantly distant from the whiskers. They are also few in number.

2.3

Interarrival Time Data

Interarrival time is computed from the login times. The time between successive logins (for any users) denes an interarrival time. There are almost 325,000 interarrival times in the source data. As illustrated in Figure 3, these times appear to be sensitive to the time of day and day of week, although no correlation has been performed. Other factors, such as holidays, are also important. The Original column of Figure 4a shows the distribution of the interarrival times using a boxplot and Figure 4b shows the distribution using a (line) histogram. It appears from the histogram that the data are exponentially distributed. Because the data are right-skewed, the median might be a better location estimator than the mean. There is a slight bulge at the median; we cannot oer an explanation for this.

Interarrival Time Boxplot


Mean 60 8% 50

Interarrival Time (Line) Histogram

40 Frequency Minutes

6%

25th Pct. 4% Median

30

20 Mean 2% 10

75th Pct. Mean 0 Original Filtered 0% 0 10 20 30 Minutes 40 50 60

(a)

(b)

Figure 4: Chart (a) shows a boxplot of the interarrival times for both the original and ltered data, while chart (b) shows the original data as a (line) histogram. Each chart also shows important statistics such as the median and mean.

During heavily-loaded time periods, users can arrive very closely together. This contributes to the large number of small values. During lightly-loaded time periods, there can be a signcant gap between users arriving. An event that brings the system down will likely result in a large cluster of logins as soon as the system is back up. If so, the number of such events is small. So, what do the data look like if we consider the busier hours of workdays? We will dene these data as those for hours 6 through 16 of weekdays that are not holidays, and refer to them as the ltered data. How do important statistics change? Figure 4a shows the boxplot of the ltered data. The change in most descriptive statistics is small but apparent. The mean has changed considerably. Approximately 28,000 times were eliminated by the ltering criteria. The histogram for the ltered data is not shown since it is so similar to the original data. If we were going to simulate users arriving, we would use the interarrival times to create a distribution model. However, our testing did not include any test of a long enough duration to use this information.

2.4

Session Duration Data

Session duration measures the amount of time that a user is logged into the system. This is the time between the log in and log out. We should not associate any level of activity with the measurement. It is very feasible to have one or more periods of inactivity during the session. Sessions can easily span several days if time-outs do not terminate sessions, which is the case here. There are over 400,000 times in the source data. The Original column of Figure 5a shows the distribution of the session durations using a boxplot. Figure 5b shows the distribution using a (line) histogram. It is appears from the histogram that the data are exponentially distributed. As before, because the data are right-skewed, the median might be a better location estimator than the mean. We would like to eliminate large session durations since this is unlikely to represent someone continuously working on the system. So, any session longer than 10 hours was eliminated. This only reduces the data set by about 8,500 times. The boxplot of the ltered data are shown in Figure 5a. As before, there is a slight change in most of the descriptive statistics, except for the mean, which changed signicantly. Also, the corresponding histogram is not shown because of its similarity to the original graph. Many questions could be asked about the session duration data. We do not have answers to many of these questions since we did not perform any tests requiring a session duration to be computed. We will mention them nonetheless.

Session Duration Boxplot


120 10% 100

Session Duration (Line) Histogram

Mean 8%

80 Frequency 6% 25th Pct. 4% 40 Mean

Minutes

60

20

2% Median 75th Pct. Mean

0 Original Filtered

0% 0 20 40 60 Minutes 80 100 120

(a)

(b)

Figure 5: Chart (a) shows a boxplot of the session durations for both the original and ltered data, while chart (b) shows the original data as a (line) histogram. Each chart also shows important statistics such as the median and mean.

Is session duration sensitive to the time of day, day of week, or seasonal periods? Any result becomes dicult to interpret because of the long sessions that are mainly idle. In order to assess productivity of a user, how long of an idle period should be allowed? It may be sucient to shorten the session duration by subtracting suciently long idle periods. This would give a better estimate for the rate of work (e.g., transactions/hour). Long idle periods can signicantly lower this number. A highly-detailed, long-term test could allow idleness with a long think time, but this is rarely the objective of running a test. There is some merit to the perspective since an idle session still ties up some resources. What value do extremely short sessions have? A session of 0.5 seconds hardly seems worth implementing: Log in then log out! This is easy to implement and does aect some resources, but do we really want such sequences to happen with a high frequency? The last question seems to imply that the exponential distribution might not be the desired distribution. A lognormal distribution has a similar right-skew property, but can signicantly reduce the frequency of extremely small times. We are not promoting this, but it could be studied further. The data say that there are a lot of them. Are they anomalies? Other dicult questions are: Is session duration sensitive to user count or system load? Are sessions lengthened because it is taking longer to get responses, which is causing users to multitask away from the terminal? Are sessions shortened because users are frustrated with response times and resolve to come back at a later time?

2.5

Think Time Data

Think time is the time the user spends between tasks. It might be possible to categorize the times based upon the activity being performed. Examples include: selecting from a menu, entering data into a text eld, reviewing a list of results before selecting one or more of them, navigating to a completely new task. Signicant instrumentation is required to provide this resolution. What cannot be determined is whether or not the user is truly interacting with the system. As with session duration, idle times impact think times. Thesholds could be established for eliminating outliers that probably contain idle intervals. There are over 13 million think times in the source data. Almost 110,000 times are less than or equal to 0. Negative think times can occur when two or more transactions are generated by the same user action. In such cases, the earlier one nishes after the next one has already started. Sometimes, a user submits another transaction before the previous has nished. This is often due to a slow response time, but can also be a result of abandoning the previous transaction due to change of mind. We have chosen to lter these times out of our analysis since they

cannot be used in a test. We refer to these data as Filtered 1. Very short think times greater than 0 can be a result of anticipation. Because of familiarity with the application, the user does not review the result and quickly submits the next transaction to move on. We do this when we manually batch processdoing the same thing over and over again. These times are valid. The Filtered 1 column of Figure 6a shows the distribution of the think times using a boxplot. Figure 6b shows the distribution using a (line) histogram. It appears from the histogram that the data have the shape of a lognormal distribution (although we did not t the data to this distribution). Because the data are right-skewed, the median might be a better location estimator than the mean.

Think Time Boxplot


Mean 6% 25th Pct. 150 5%

Think Time (Line) Histogram

4% Seconds 100 Frequency

3% Median 2%

Mean 50

1% 75th Pct. Mean 0 Filtered 1 Filtered 2 0% 0 50 100 Seconds 150

(a)

(b)

Figure 6: Chart (a) shows a boxplot of the think times for both ltered data sets, while chart (b) shows the same data as a (line) histogram. Each chart also shows important statistics such as the median and mean.

If we were to conduct a short test, we would not want very large times. We removed times over an hour and were still left with over 13 million times. The Filtered 2 column of Figure 6a shows the distribution of these think times using a boxplot.

Modeling Think Time

The point of doing analysis is to either make decisions based on it or feed it into a model. In the cases of interarrival time and session duration, the analyses were just exercises since we did not perform tests that could use the results. However, the think time analysis was used in our performance testing. Our testing is for a new system, while the data we have is from its predecessor system. Each generation adds new functionality and supports more users. We expect the user to behave in a consistent manner on the new system. So, the think time data are applicable. More delity is needed when modeling think times. The time spent thinking varies for dierent actions. Inserting a random time from one distribution for all actions is not accurate. We need a think times distribution for each action. Table 1 lists the actions pertinent to our system. We have chosen to associate a range with each action, and base the think times for each action on a normal distribution. The testing tool can generate uniformly distributed random numbers, and we can use those generated numbers to index a table containing the normally distributed range. The table can be updated without any impact to the scripts. Understanding the think time types is not critical, but we will give a short description of each. Tab is used when the user selects one of several tabs, each leading to a dierent dialog on the screen. Menu is used when the user selects an item from a pull-down menu. Search Options is used when the user selects or checks options associated

Table 1: Think Time Types and Ranges (in Seconds)

1 2 3 4 5 6 7 8

Type Tab Menu Search Options View Update Search Results Create Inter-task

Min 1 3 5 8 10 15 15 30

Max 5 9 15 24 30 45 60 90

with searching. View is used when the user looks at data on various screens besides search results. Update is used when the user modies retrieved data. Search Results is used when the user reviews results from a search. Create is used when the user enters new data. Inter-task is used between iterations of a script. Each think time type has a table of 100 values. The distribution of each table is shown in Figure 7a. The hope is that, when all of the times are collected, a distribution results that compares to the original data. This is dicult to predict without knowing the frequency of each type. Those frequencies could be computed if all of the scripts were done.

Think Times by Type


7% 50% Tab Menu Search Options View Update Search Results Create Intertask Frequency

Think Time Comparison

Actual Test 6%

40%

5%

Frequency

30%

4%

3%

20% 2% 10% 1%

0% 0 20 40 Seconds 60 80

0% 0 20 40 60 Seconds 80 100 120

(a)

(b)

Figure 7: Chart (a) shows the think time distributions for each type. Each distribution is normal with the intent that the accumulation will result in a lognormal distribution. Chart (b) shows the ltered source think time data and the ltered test think time data for comparison using (line) histograms.

After a test is run, think times are computed from the transactions. We encountered issues in computing think times with the new system similar to those encountered with the current system. False times result from two consecutive transactions that are generated by one user action. However, the transactions do not overlap, and the time between the transactions is small (less than 1 second). So, these times are ltered out. Figure 7b shows a comparison between the source data, ltered to include only times less than 125 seconds, and the ltered test data. In the test data, times can be larger than 90 seconds because some scripts contain two consecutive think times. One think time might be a view type; the second is always an inter-task type. The results are close at a high level. The times in Table 1 could be modied to improve the overall shape, but this task will be postponed until the entire workload is present (i.e., do not waste time tuning an incomplete system). 7

Use R!

The graphs in this paper were generated with R ([R D09]), although they could be created with another tool, such as Microsoft Excel. Since some non-standard R graphs were created, we thought we would demonstrate how they were made. Script 1 shows an R script that generates a gure similar to Figure 4 (only the original data are plotted in the boxplot). Basic R facilities are used, but extra work is done to make the graphs more presentable. Similar scripts generate Figure 5 and Figure 6. Script 1 Interarrival Time R Script # read the data from a CSV file. the column were interested in is "Duration" w = read.csv("interarrival_times.csv") m = mean(w$Duration) # compute the mean to plot on the graphs b = boxplot(w$Duration, plot=FALSE) # create a boxplot without drawing it # compute the y-axis limit. b$stats[5] is the top whisker of the boxplot ymax = max(b$stats[5], m) n = ceiling(ymax) # convert the limit to an integer for the histogram # create a histogram without drawing it. the "breaks" run from 0 to n with an # additional break created at the largest value. were not going to plot # anything in the last interval h = hist(w$Duration, breaks=c(0:n, ceiling(max(w$Duration))), plot=FALSE) x = h$breaks[2:(n+1)] y = h$intensities[1:n] # compute where the 25th, 50th, and 75th percentiles and the mean fall on the # histogram. approx does the math so that the points are on the line pts = approx(x, y, xout=c(b$stats[2:4], m)) # # draw the boxplot # pdf("interarrival_time_boxplot.pdf") # create a PDF file # draw the previously computed boxplot (b), but limit the y-axis so that the # whisker and mean can be drawn. since there are so many points above the # whisker, suppress the points by using "NA_integer_". this prevents the PDF # file from becoming large bxp(b, ylim=c(0, ymax), ylab="Minutes", las=1, pch=NA_integer_, main="Interarrival Time Boxplot") # "fake" the points above the whisker by drawing a thick line -- looks the same lines(c(1, 1), c(b$stats[5], par("usr")[4]), lwd=7) points(m, pch=23, bg="yellow") # plot the mean. pch=23 is a filled diamond dev.off() # close the file # # draw the (line) histogram # pdf("interarrival_time_distr.pdf") # create a PDF file # plot the histogram as a line. the last bin is not drawn. suppress the y-axis plot(x, y, type="l", yaxt="n", xlab="Minutes", ylab="Frequency", main="Interarrival Time (Line) Histogram") yticks = axTicks(2) # compute the standard y-axis ticks # draw the y-axis, but make it prettier by using a 0%-100% format axis(2, at=yticks, labels=sprintf("%d%%", yticks*100), las=1) # plot the percentiles and mean. pch=23 is a filled diamond points(pts, pch=23, bg=c("red", "red", "red", "yellow")) # print text next to the points text(pts, labels=c("25th Pct.", "Median", "75th Pct.", "Mean"), pos=c(4, 4, 4, 3), offset=1) dev.off() # close the file

R provides the facility to compute the data for a graph (e.g., boxplot or histogram) without drawing it. The data can then be further manipulated or used to improve the graphs appearance. In Script 1, the boxplots upper whisker is compared to the mean so that the larger determines the upper limit of the graph. This same cut-o point is used to limit the histogram. The histogram actually has one more big bin to hold all of the data that is o the chart. The histograms breaks provide the x-values, while the intensities provide the y-values. These values which dene the histogram line are also used to determine where the predictive statistics fall since the x-values will probably not be integers. The boxplot contains a large number of values that exceed the upper whisker. There are so many that they appear as a thick line. To save space in the generated le, the points are suppressed and are simulated by a thick line. The y-axis of the histogram is labeled with formatted percentages rather than decimal numbers less than 1. In contrast to the next example, these percentages t in the space provided and can be drawn with the axis command. Figure 1 and Figure 3 create a pretty y-axis with a comma separator for numbers over 1000 and with the numbers written horizontally. Script 2 is an excerpt of a script that demonstrates how to create the axis tick labels. The main problem that is solved is avoiding the y-axis label colliding with the tick labels. Two things must be done to overcome this: The margin must be made larger and the axis label must be written as margin text (i.e., it cannot be created by the plot command or the axis command). Script 2 R Script Excerpt Demonstrating Other Techniques #num_hours, maxs, avgs, and mins are set earlier in the real script par(mai=c(1.02, 1.02, 0.82, 0.42)) # set the margins plot(c(0, num_hours), c(0, max(maxs)), type="n", ylab="", xlab="Day of Week", main="Users by Hour of Day of Week", xaxt="n", yaxt="n") mtext("Users", side=2, line=4) # put the axis label in the margin yticks = axTicks(2) # compute the standard y-axis ticks # draw the y-axis, but make it prettier by adding commas axis(2, at=yticks, labels=prettyNum(yticks, big.mark=","), las=1) # polygon(c(1, 1:num_hours, num_hours), c(0, maxs, 0), col="pink", border="red3") polygon(c(1, 1:num_hours, num_hours), c(0, avgs, 0), col="lightblue", border="blue3") polygon(c(1, 1:num_hours, num_hours), c(0, mins, 0), col="lightgreen", border="green4")

The day of week charts in Figure 2 and Figure 3 are drawn using polygon commands. The last few lines of Script 2 demonstrates drawing them. Smaller polygons overlay the larger polygons.

Conclusions

This paper examines some common user metrics mined from a real transaction system. Some characteristics of each metric were discussed. Some questions were posed, but with only a few answers given. The distributions of interarrival time, session duration, and think time data were shown. In the case of think time, the data were used for testing in the new system. However, the test required ner detail than the available data provided. The inability to reverse-engineer the dierent types of think times left us guessing as to how to apply it in a test. Normal distributions were applied to the dierent think time categories with the hope that the accumulation of times would result in a distribution similar to the source data.

References
[R D09] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. [Wil10a] Tom Wilson. Principles of Performance Measurement. CMG MeasureIT, June 2010. [Wil10b] Tom Wilson. Workload Correlation and Visualization. CMG 10 International Conference, December 2010.

Вам также может понравиться