Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT In this paper, we analyze web access and error logs to identify major error sources and to evaluate web site reliability. Our results show that both error distribution and reliability distribution among different file types are highly uneven, and point to the potential benefit of focusing on specific file types with high concentration of errors and high impact to effectively improve web site reliability and overall user satisfaction. KEYWORDS Web errors and problems, reliability, error log, access log, file type and classification.
1. INTRODUCTION
The prevalence of the World Wide Web also spreads intended or unintended problems on an ever larger scale. The problems include various malicious viruses as well as unintended problems caused by communication breakdowns, hardware failures, and software defects. Identifying the root causes for these problems can help us understand their severity and scope. More importantly, such understandings help us derive effective means to deal with the problems and improve web reliability. In this paper, we focus on the identification, analysis, and characterization of software defects that lead to web problems and affect web reliability. Since the 80:20 rule (Koch, 2000), which states that the majority (say 80%) of the problems can be traced back to a small proportion (say 20%) of the components, has been observed to be generally true for software systems (Porter and Selby, 1990; Tian, 1995), the identification and characterization of these major error sources could also lead us to effective reliability improvement. For web applications, various log files are routinely kept at web servers. In this paper, we extend our previous study on statistical web testing and reliability analysis in (Kallepalli and Tian, 2001) to extract web error and workload information from these log files to support our analyses. The rest of the paper is organized as follows: Section 2 analyzes the web reliability problems and examines the contents of various web logs. Section 3 presents our error analyses, with common error sources identified and characterized, followed by reliability analyses in Section 4. Conclusions and future directions are discussed in Section 5.
235
achieved via prevention of web failures or reduction of chances for such failures. We define web failures as the inability to obtain and deliver information, such as documents or computational results, requested by web users. This definition conforms to the standard definition of failures being the behavioral deviations from user expectations (IEEE, 1990). Based on this definition, we can consider the following failure sources in this process of obtaining and delivering information requested by web users: Host or network failures: Host hardware or systems failures and network communication problems may lead to web failures. However, such failures are no different from regular system or network failures, which can be analyzed by existing techniques. Therefore, these failure sources are not the focus of our study. Browser failures: These failures can be treated the same way as software product failures, thus existing techniques for software quality assurance and reliability analysis can be used to deal with such problems. Therefore, they are not the focus of our study either. Source or content failures: Web failures can also be caused by the information source itself at the server side. We will primarily deal with this kind of web failures in this study.
The failure information, when used in connection with workload measurement, can be fed to many software reliability models (Lyu, 1995; Musa, 1998) to help us evaluate the web site reliability and the potential for reliability improvement. In this paper, we use the Nelson model (Nelson, 1978), one of the earliest and most widely used input domain reliability models, to assess the web site's current reliability. If a total number of f failures are observed for n workload units, the estimated reliability R according to the Nelson model can be obtained as: R = 1 r = 1 f/n = (n f)/n Where r=f/n is the failure rate, a complementary measure to reliability R . The summary reliability measure, mean-time -between-failures (MTBF), can be calculated as: MTBF = n/f = 1/r If discovered defects are fixed over time, its effect on reliability (or reliability growth due to defect removal) can be analyzed by using various software reliability growth models (Lyu, 1995). Both the time domain and input domain information can also be used in tree-based reliability models (Tian, 1995) to identify reliability bottlenecks for focused reliability improvement.
[Mon Aug 16 00:00:56 1999] [error] [client 129.119.4.17] File does not exist: /users/csegrad2/srinivas/public_html/Image10.jpg Figure 2. A sample entry in an error log.
A ``hit'' is registered in the access log if a file corresponding to an HTML page, a document, or other web content is explicitly requested, or if some embedded content, such as graphics or a Java class within an HTML page, is implicitly requested or activated. Most web servers record the following information in their access logs: the requesting computer, the date and time of the request, the file that the client requested, the size of the requested file, and an HTTP status code. In this paper, we use this information together with error information to assess the impact on web reliability by different types of web sources.
236
ANALYZING WEB LOGS TO IDENTIFY COMMON ERRORS AND IMPROVE WEB RELIABILITY
Although access logs also record common HTML errors, separate error logs are typically used by web servers to record details about the problems encountered. The format of these error logs is simple: a timestamp fo llowed by the error message, such as in Figure 2. Common error types are listed below: permission denied no such file or directory stale NFS file handle client denied by server configuration file does not exist invalid method in request invalid URL in request connection mod_mime_magic request failed script not found or unable to start connection reset by peer
Notice that most of these errors conform closely to the source or content failures we defined in Section 2.1. We refer to such failures as errors in subsequent discussions to conform to the commonly used terminology in the web community. Questions about error occurrences and distribution, as well as overall reliability of the web site, can be answered by analyzing error logs and access logs.
237
Table 1. Error distribution by predefined error types. Error type permission denied No such file or directory stale NFS file handle client denied by server configuration file does not exist invalid method in requests invalid URL in request connection mod_mime_magic request failed script not found or unable to start connection reset by peer Total Errors 2079 14 4 2 28631 0 1 1 1 27 0 30760
There are two scenarios in the denied access situations: The first is for denying unauthorized accesses to restricted resources, which should not be counted as web failures. The second is wrongfully denied accesses to unrestricted resources or to restricted resources with proper access authorization, which should be counted as web failures. Consequently, further information needs to be gathered for this type of errors to determine whether to count them as failures. Only afterwards, these properly counted failures can be used in web reliability analyses. File-not-found errors usually represent bad links, and should be counted as web failures. They are also called 404 errors because their error code in the web logs. They are by far the most common type of problems in web usage, accounting for more than 90% of the total errors in this case. Further analyses can be performed to examine the causes for these errors and to assess web site reliability.
For our web site, there are more than 100 different file extensions, with most of them accounting for very few 404 errors. We sorted these file extensions by their corresponding 404 errors and give the results for the top 10 in Table 2. These top 10 error sources represent a dominating share (more than 99%) of the overall 404 errors. In fact, only four file types, .gif, .class, directory, and .html, represent close to 90% of all
238
ANALYZING WEB LOGS TO IDENTIFY COMMON ERRORS AND IMPROVE WEB RELIABILITY
the errors. Our results generally confirm the uneven distribution of problems, or the 80:20 rule mentioned earlier, and point out the potential benefit of identifying and correcting such highly problematic areas. The top error sources indicate what kind of problems a web user would be most likely to encounter when her request was not completed because of 404 errors. Consequently, fixing these problems would improve the overall web site reliability and thus improve overall user satisfaction. Furthermore, root cause analysis can be carried out to understand the reasons or causes for these high defect file types, and corresponding follow-up actions can be carried out to prevent future problems related to the identified causes. On the other hand, the effort to fix the problems may not be directly linked to the total number of 404 errors corresponding to each file type, because a few missing files may be requested repeatedly with high frequency, resulting in high error count for the corresponding file type. In fact, the effort to fix the problems is directly proportional to the actual number of such missing files. This number can be obtained from the error log by counting the unique 404 errors, i.e., an error is counted only once at its first observation, but not counted subsequently.
Table 3. File types and corresponding unique errors or missing files. File type directory .html .gif .ico .jpg .ps .pdf .txt .doc .class Cumulative Total Unique errors % of total 1137 40.46% 897 31.92% 274 9.75% 122 4.34% 106 3.77% 52 1.85% 42 1.49% 25 0.89% 23 0.82% 21 0.75% 2699 96.05% 2810 100% Errors 4443 3639 12471 849 1354 209 235 32 75 4913 28220 28631 % of total 15.52% 12.71% 43.56% 2.97% 4.73% 0.73% 0.82% 0.11% 0.26% 17.16% 98.56% 100%
Table 3 gives the top 10 missing file types and the corresponding numbers of such missing files, with the corresponding total errors also presented for comparison. These top 10 files types are almost identical to that in Table 2, with the exception that .mp3 is replaced by .txt. Similar to Table 2, there is an uneven distribution of missing files, with the top 10 file types representing 96.05% of all the missing files, and three types, directory, .html, and .gif, representing more than 80%. However, the relative ranking is quite different. The most striking difference is for the .class files, where 21 of such missing files, which represent only 0.75% of all the missing files, were requested 4913 times (or 17.16% of the 404 errors). Similar difference is also observed for .gif files, with relatively few missing files requested relatively more often. The directories and .html files demonstrate the opposite trend, where relatively more missing files or directories were requested less often. These differences also confirmed the usefulness of obtaining unique errors or missing file counts: Such information as in Table 3 could help us properly allocate development effort to fix missing file problems.
239
If we count all the errors (not just 404 errors), the total is 30760, giving us slightly worse reliability numbers, as follows: Error rate r = 0.0403, or 4.03% of the requests will result in an error. Reliability R = 0.9597, or the web site is 95.97% reliable. MTBF = 24.81, or on average, a user would expect a problem for every 24.81 requests.
Table 4. File types and corresponding reliability (error rate). File type .gif .html directory .jpg .pdf .class .ps .ppt .css .txt .doc .c .ico Cumulative (or average rate) Total (or average rate) Hits 438536 128869 87067 65876 10784 10055 2737 2510 2008 1597 1567 1254 849 753709 763021 % of total 57.47% 16.89% 11.41% 8.63% 1.41% 1.32% 0.36% 0.33% 0.26% 0.21% 0.21% 0.16% 0.11% 98.78% 100% Errors 12471 3639 4443 1354 235 4913 209 7 17 32 75 5 849 28249 28631 % of total 43.56% 12.71% 15.52% 4.73% 0.82% 17.16% 0.73% 0.02% 0.06% 0.11% 0.26% 0.02% 2.97% 98.67% 100% Error rate 0.0284 0.0282 0.0510 0.0199 0.0218 0.4886 0.0764 0.0028 0.0085 0.0200 0.0479 0.0040 1.0000 0.0375 0.0375
240
ANALYZING WEB LOGS TO IDENTIFY COMMON ERRORS AND IMPROVE WEB RELIABILITY
The subsets with extremely low reliability (or high error rate) are the types .ico and .class, where every request for .ico files has resulted in a 404 error, while close to half (48.86%) of the requests for .class files have resulted in 404 errors. Further analysis can be carried out to understand the root causes for these low reliability file types, and corresponding follow-up actions can be carried out to prevent future reliability problems related to the identified causes. Relative to the average error rate of 0.0375, we can categorize these 13 major file types into four categories by their reliability (error rate) below: good reliability category, with error rate r < 0.01, includes .c, .css, .ppt files. above -average reliability category, with error rate r between 0.01 and 0.0375 (average), include .txt, .html, .gif, .jpg, and .pdf files. below-average reliability category, with error rate r between 0.0375 (average) and 0.1, include directories and .doc and .ps files. poor reliability category, with error rate r > 0.1, include .ico and .class files.
This reliability categorization also helps us prioritize effort for reliability improvement: The file types with the poor reliability should be the primary focus for reliability improvement, because they represent the reliability bottleneck. Improvement to their reliability will have a large impact on the overall reliability. In this case, fixing the 122 missing .ico files and 21 .class ones (see Table 3 for unique errors or missing files) would result in reducing 5762 (= 849 + 4913) errors and improving the overall web site reliability from an average error rate r = 0.0375 to r = 0.0300. This significant reliability improvement of 20% can be achieved with relatively low effort, because these 143 (= 122 + 21) missing files only represent a small share (about 5%) of all the 2810 missing files. After fixing these problems, we can focus on the below-average file types, then above-average ones, before we start working on the good ones. This priority scheme gives us a cost effective procedure to improve the overall web site reliability.
5. CONCLUSION
By analyzing the unique problems and information sources for the web environment, we have developed an approach for identifying and characterizing web errors and for assessing and improving web site reliability based on information extracted from existing web logs. This approach has been applied to analyze the log files for the web site at the School of Engineering at Southern Methodist University. Our results demonstrated that the error distribution across different error types and sources is highly uneven. In addition, missing file distribution, workload distribution, as well as reliability distribution for individual types of requested files are all quite uneven. These distributions generally follow the so-called 80:20 rule, where a few components are responsible for most of the problems or effort. Our analysis results can help web site owners to prioritize their web site maintenance and quality assurance effort and to guide further analyses, such as root cause analysis, to identify problem causes and perform preventive and corrective actions. All these focused actions and efforts would lead to better web service and user satisfaction due to the improved web site reliability. The primary limitation of our study is the fact that our web site may not be a representative one for many non-academic web sites. Most of our web pages are static ones, with the HTML documents and embedded graphics dominating other types of pages, while in e -commerce and various other business applications, dynamic pages and context -sensitive contents play a much more important role. To overcome these limitations, we plan to analyze some public domain web logs, such as from the Internet Traffic Archive at ita.ee.lbl.gov or the W3C Web Characterization Repository at repository.cs.vt.edu, to cross-validate our general results. As an immediate follow-up to this study, we plan to analyze web site reliability over time for different types of error sources, and perform related risk identification activities for focused reliability improvement. We also plan to identify better existing tools, develop new tools and utility programs, and integrate them to provide better implementation support for our strategy. All these efforts should lead us to a more practical and effective approach to achieve high quality web service and to maximize user satisfaction.
241
ACKNOWLEDGEMENT
This research is supported in part by NSF grants 9733588 and 0204345, THECB/ATP grants 003613-00301999 and 003613-0030-2001, and Nortel Networks.
REFERENCES
Behlandorf, B., 1996. Running a Perfect Web Site with Apache, 2nd Ed. MacMillan Computer Publishing, New York, USA. IEEE, 1990. IEEE Standard Glossary of Software Engineering Terminology. STD 610.12-1990. IEEE. Kallepalli, C. and Tian, J., 2001. Measuring and modeling usage and reliability for statistical web testing. IEEE Trans. on Software Engineering, Vol. 27, No. 11, pp 1023--1036. Koch, R., 2000. The 80/20 Principle: The Secret of Achieving More With Less, Nicholas Brealey Publishing, London, UK. Lyu, M. R., editor, 1995. Handbook of Software Reliability Engineering. McGraw-Hill, New York, USA. Musa, J. D., 1998. Software Reliability Engineering. McGraw-Hill, New York, USA. Nelson, E., 1978 . Estimating software reliability from test data. Microelectronics and Reliability, Vol. 17, No. 1, pp. 67-73. Porter, A. A. and Selby, R. W., 1990. Empirical Guided Software Development Using Metric-Based Classification Trees, IEEE Software, Vol. 7, No. 2, pp. 4654. Tian, J., 1995. Integrating time domain and input domain analyses of software reliability using tree-based models. IEEE Trans. on Software Engineering, Vol. 21, No. 12, pp. 945--958. Tian, J. and Palma, J., 1997. Test workload measurement and reliability analysis for large commercial software systems. Annals of Software Engineering, Vol. 4, pp. 201--222.
242