Академический Документы
Профессиональный Документы
Культура Документы
Web-filtering systems are commercially available, and potential users can download trial
versions from the Internet. However, the techniques these systems use are insufficiently
accurate and do not adapt well to the ever-changing ments used to construct a multiframe Web page as a
Web. (For more on current approaches and systems, single entity. This is because statistics obtained from
The Intelligent see the “Web Content-Filtering Approaches” and any aspects of a Web page should be derived as a
“Current Web Content-Filtering Systems” sidebars.) whole from the aggregated data collected from every
Classification Engine To solve this problem, we propose using artificial HTML document that is part of that page.
neural networks (ANNs)1,2 to classify Web pages
uses neural networks’ during content filtering. We focus on blocking PICS use. Web publishers can use PICS labels to
pornography because it is among the most prolific limit access to Web content (see the “Web Content-
learning capabilities to and harmful Web content. According to CyberAtlas, Filtering Approaches” sidebar). A PICS label is usu-
pornography-related terms such as “sex” and “porn” ally distributed with the associated Web page by one
provide fast, accurate are among the top 20 search terms queried at the 10 of these methods:
leading Internet portals and search engines.3 Fur-
differentiation thermore, research suggests that pornography is • The Web publisher (or Web content owner) embeds
addictive and causes harmful side effects.4 However, the label in the HTML code’s header section.
between pornographic our general framework is adaptable for filtering other • The Web server inserts the label in the HTTP
objectionable Web material. packet’s header section before sending the page
and nonpornographic to a requesting client.
Know the enemy
Web pages. The basic How do pornographic Web pages differ from oth- In both cases, we can’t determine whether a Web
ers? We attempt to answer this question by studying page has a PICS label simply by inspecting the Web
framework can also these pages’ characteristics and analyzing data. page contents displayed in a browser. The first
method requires viewing the Web page’s HTML
serve to distinguish Characteristics code, which you can easily do with any major Web
Understanding pornographic Web pages’ charac- browser that supports such functionality. In contrast,
other types of Web teristics can help us develop effective content analysis the second method requires checking the HTTP
techniques. Although it is well known that porno- packet’s header section, which you can’t do with any
content. graphic Web sites contain many sexually oriented known browser. Consequently, to analyze porno-
images, text and other information can also help us dis- graphic Web sites’ use of PICS systems, we collect
tinguish these sites. We focus on three characteristics: statistics from the samples’ HTML code.
page layout format, use of PICS (Platform for Internet
Content Selection) ratings, and indicative terms. Indicative terms. Terms (words or phrases) that indi-
cate pornographic Web pages fall into two major
Page layout format. We treat all the HTML docu- groups according to their meanings and use. Most
are sexually explicit terms; the rest consist graphic Web sites. These locations are mouse; it usually occurs in the <IMG> tag)
primarily of legal terms used to establish the • Graphical text (sometimes graphics or
legal conditions of use of the material. Legal • The Web page title images contain text that we can extract)
terms often appear because many porno- • The warning message block
graphic Web sites’ entry pages contain a • Other viewable text in the Web browser Indicative terms might be displayed or
warning message block. window nondisplayed in the Web browser window.
Most indicative terms occur in the porno- • The “description” and “keywords” metadata Displayed terms appear in the Web page title,
graphic Web page’s text. We can extract them • The Web page’s URL and other URLs warning message block, other viewable text,
from different locations of the correspond- embedded in the Web page and graphical text. Nondisplayed terms are
ing HTML document that might contain • The image tooltip (the text string displayed stored in the URL, the “description” and “key-
information useful for distinguishing porno- when a user points to an object using a words” metadata, and the image tooltip.
Table A. A comparison of 10 popular Web-filtering systems. Bold lettering indicates each system’s main approach.
Content-filtering approach
System Location PICS support URL blocking Keyword filtering Content analysis Filtering domain
Cyber Patrol Client Yes Yes Yes No General
Cyber Snoop Client Yes Yes Yes No General
CYBERsitter Client Yes Yes YesA No General
I-Gear Server Yes Yes YesB No General
Net Nanny Client No Yes Yes No General
SmartFilter Server No Yes No No General
SurfWatch Client Yes Yes Yes No General
WebChaperone Client Yes Yes Yes YesC Pornographic
Websense Server No Yes YesD No General
X-Stop Client Yes Yes YesE No Pornographic
Vector encoding
Index 1 2 3 4 ... 55 56 57 58 59 60 61
Basic vector 1 1 0 0 ... 3 1 14 8 1 24 236
Weight multiplication
Normalized vector 0.0042 0.0042 0 0 ... 0.0499 0.0146 0.0582 0.1331 0.0125 0.0998 0.9775
Figure 3. An example of transformation. Vector encoding is the result of preprocessing that forms the 61-element basic vector.
Weight multiplication then weighs the elements according to their relative indicativeness. The weighted vector is then normalized
before being fed to the ANN.
Repeat steps 2 and 3 for every exemplar in the training set for a user-defined number of iterations.
Figure 4. The KSOM (Kohonen’s Self-
Organizing Maps) training algorithm.
Category assignment. After the ANN gener- Table 3. Thresholds for category assignment.
ates the clusters, we still need to determine Proportion of Web pages in clusters
each cluster’s nature. This is because a clus-
Cluster category Pornographic Nonpornographic
ter defines a group of Web pages with simi-
lar features and characteristics but does not Pornographic [70%, 100%] [0%, 30%)
classify those similarities. Nonpornographic [0%, 30%) [70%, 100%]
The engine assigns each cluster to one of Unascertained [30%, 70%) [30%, 70%]
three categories: pornographic, nonporno-
graphic, or unascertained. Unascertained Table 4. Training efficiency for the Kohonen’s Self-Organizing Maps and the Fuzzy
clusters contain a fair mixture of porno- Adaptive Resonance Theory networks.
graphic and nonpornographic Web pages. Results
Table 3 summarizes the thresholds for cate-
gory assignment. For example, if a cluster Attribute KSOM Fuzzy ART
contains 70 percent of Web pages labeled as Number of inputs 61 61
“pornographic,” we map the cluster to the Number of output neurons 49 47
Number of iterations 24,500 93
“pornographic” category.
Number of training exemplars 4,786 4,786
The system records the results in a cluster- Total processing time 37 hrs., 43 min., 23 sec. 47 sec.
to-category mapping database. Each database
Table 6. Classification accuracy. The “Meta” column indicates the number of Web pages that metacontent checking classified.
Correctly classified Incorrectly classified
ANN type Web page ANN Meta ANN Meta Unascertained Total
KSOM Pornographic 499 9 23 0 4 535
Nonpornographic 496 1 5 2 19 523
Total 1,005 (95.0%) 30 (2.8%) 23 (2.2%) 1,058
Fuzzy ART Pornographic 428 32 47 0 28 535
Nonpornographic 475 8 7 9 24 523
Total 943 (89.1%) 63 (6.0%) 52 (4.9%) 1,058
entry has a unique ID identifying one of the produce an activated cluster. The catego- the metacontents include terms directly
generated clusters and its category. rization step uses the cluster-to-category related to the associated Web pages’ subject.
mapping database to determine the activated For keywords, we use the terms in the indica-
Classification cluster’s category, which it uses to classify tive terms dictionary. If the filtering finds at
This process uses the trained ANN to clas- the corresponding Web page. least one indicative term in the metacontents,
sify incoming Web pages; it outputs one of To further reduce the number of unascer- the engine classifies the associated Web page
the three predefined categories. Similar to the tained Web pages, we introduce a postpro- as pornographic. Otherwise, it identifies the
training process, it also performs feature cessing step called metacontent checking. page as nonpornographic. If the engine can-
extraction, preprocessing, and transforma- This step applies keyword filtering to the not find the metacontents or they do not exist,
tion for each incoming Web page. After these “description” and “keywords” metacontents in the the page remains unascertained.
steps have generated the Web page vector, HTML header of the unascertained Web
the system feeds the vector into the ANN to pages. This mechanism is effective because Performance evaluation
For training, we used the 4,786 Web pages
(93,578,232 bytes) mentioned in the “Train-
ing” section. First, we measured the pro-
cessing time required for the three pre-ANN
steps (feature extraction, preprocessing, and
transformation) for both training and classi-
fication. The three steps took 167 seconds to
process all Web pages (an average of 35 ms
per page).
Next, we measured the training efficiency
and accuracy for KSOM and Fuzzy ART.
Tables 4 and 5 summarize the results.
Although KSOM requires much longer train-
ing time, it produces better training accuracy
and gives a smaller set of unascertained Web
pages.
Finally, we measured both networks’ clas-
sification accuracy. We compiled a testing
exemplar set with 535 pornographic Web
pages and 523 nonpornographic Web pages.
Table 6 summarizes the results. The “Meta”
column indicates the number of Web pages
Figure 6. The engine misclassified this page as pornographic because it contains that metacontent checking classified.
sexually explicit terms in its displayed contents and the “description” and “keywords” From these results, we conclude that the
metacontents. KSOM network performed much better than
Acknowledgments 6. Y. Yang and C.G. Chute, “An Application of 12. D. Roussinov and H. Chen, “A Scalable Self-
Least Squares Fit Mapping to Text Informa- Organizing Map Algorithm for Textual Clas-
The list of products surveyed in this article is tion Retrieval,” Proc. 16th Ann. Int’l ACM sification: A Neural Network Approach to
not exhaustive. We are not related to any of the SIGIR Conf. Research and Development in Thesaurus Generation,” Communication and
vendors, and we do not endorse any of these Information Retrieval (SIGIR 93), ACM Cognition—Artificial Intelligence, vol. 15,
products. Press, New York, 1993, pp. 281–290. nos. 1–2, Spring 1998, pp. 81–111.