Spam Filtering

Bayesian Inferencing AKA Naive Bayesian Filtering
Using B8
Spam Filtering with b8:

// Start using the bayesian filtering class $b8 = new b8; // Try to classify a text $b8->classify('Hello World'); // Show it something that isn't spam echo $b8->classify("Everybody has a birthday"); $b8->learn("Everybody has a birthday", "ham"); echo $b8->classify("Everybody has a birthday"); // Show it something that is spam echo $b8->classify("Today isn't mine."); $b8->learn("Today isn't mine.", "spam"); echo $b8->classify("Today isn't mine."); // Try to classify a text echo $b8->classify("It's somebody's birthday today"); // Show it that this isn't spam too echo $b8->classify("It's somebody's birthday today"); $b8->learn("It's somebody's birthday today", "spam"); echo $b8->classify("It's somebody's birthday today"); . // Lets try this one on for size echo $b8->classify("Say Happy Birthday to Dave!"); // That was pretty quick, wasn't it? Spaminess: could not calculate spaminess Classification before learning:could not calculate spaminess Saved the text as Ham Classification after learning:could not calculate spaminess Classification before learning:could not calculate spaminess Saved the text as Spam Classification after learning: 0.884615 Spaminess: 0.583509
Classification before learning: 0.583509 Saved the text as Ham Classification after learning: 0.105294
Spaminess: 0.065217
Any Questions?
Good!
I am glad I am not the only one...
AKA Wikipedia to the rescue...
What is Bayesian Inference Statistics?

In laymen's terms: A bunch of statistical mumbo-jumbo that learns from the past to allow you to classify in the future. Or, more concisely, from Wikipedia:
Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process. Bayes' theorem was derived from the work of the Reverend Thomas Bayes.
http://en.wikipedia.org/wiki/Bayesian_inference
Who?
Thomas Bayes (c. 1702 7 April 1761) was a British mathematician and Presbyterian minister, known for having formulated a specific case of the theorem that bears his name: Bayes' theorem, which was published posthumously. Bayes' solution to a problem of "inverse probability" was presented in the Essay Towards Solving a Problem in the Doctrine of Chances (1764), published posthumously by his friend Richard Price in the Philosophical Transactions of the Royal Society of London. This essay contains a statement of a special case of Bayes' theorem.
http://en.wikipedia.org/wiki/Thomas_Bayes
Bayes Theorum
Bayes' theorem relates the conditional and marginal probabilities of events A and B, where B has a non-vanishing probability: Each term in Bayes' theorem has a conventional name: P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having observed 'B'. Objective Bayesians emphasise that these probabilities are fixed by a body of well-specified background knowledge (K), so their version of the theorem expresses this:[5][2]
http://en.wikipedia.org/wiki/Bayes'_theorem
Duh... an example please?

Suppose there is a co-ed school having 60% boys and 40% girls as students. The girl students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem. The event A is that the student observed is a girl, and the event B is that the student observed is wearing trousers. To compute P(A|B), we first need to know: P(A), or the probability that the student is a girl regardless of any other information. Since the observers sees a random student, meaning that all students have the same probability of being observed, and the fraction of girls among the students is 40%, this probability equals 0.4. P(A'), or the probability that the student is a boy regardless of any other information (A' is the complementary event to A). This is 60%, or 0.6. P(B|A), or the probability of the student wearing trousers given that the student is a girl. As they are as likely to wear skirts as trousers, this is 0.5. P(B|A'), or the probability of the student wearing trousers given that the student is a boy. This is given as 1. P(B), or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since P(B) = P(B|A)P(A) + P(B|A')P(A'), this is 0.50.4 + 10.6 = 0.8. Given all this information, the probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula:
http://en.wikipedia.org/wiki/Bayes'_theorem
Yeah... I can't really get my head around it either...

There are lots of resources online to help. http://en.wikipedia.org/wiki/Bayesian_probability http://en.wikipedia.org/wiki/Bayesian_inference http://en.wikipedia.org/wiki/Naive_Bayes_classifier http://blog.oscarbonilla.com/2009/05/visualizing-bayestheorem/ Really cool explanation using Venn Diagrams http://www.ibm.com/developerworks/web/library/wa-bayes1/
Which brings us back to b8

b8 is a naive Bayesian Spam filter library written by Tobias Leupold. http://nasauber.de/opensource/b8/index.php.en
Why use a class library? - for the usual reasons Written by somebody who knows more about the problem It just works with a minimum of fuss Many of the 'gotcha's' and edge cases should be resolved Published code reviewed by many people. It does all that stuff from two slides ago as easily as you saw on the 2nd slide.
b8's 'target audience'

b8 is designed - optimized really - to classify very short messages, such as comments left on blogs. b8 accepts only a single text string for classification. No header, body distinction. b8 tallies the number of instances of a word. It can distinguish between a single URL in a comment vs 20 links.
The author claims it may not be suited to longer text strings such as email messages.
How does b8 work?

1. b8 'tokenizes' a string into individual words, URL bits & pieces, IP addresses, HTML tags, etc. o You can create your own 'lexer' if you want different tokens o Tokens that aren't in the existing known token list go through a 'degenerater' process to try to find similar tokens. 2. b8 picks the 15 (configurable) most interesting (farthest from a score of .5) tokens to calculate the probability with. b8 can also 'learn' that a text's set of tokens represents spam or not. It will use this new data for future classifications.
How the default Lexer creates Tokens

The lexer is where you can really give b8 it's 'smarts' as you can define how the individual tokens are created. The default Lexer tries to find all IP addresses and URL looking strings in the provided text. It then breaks the URLs into bits, using both the whole URL, and the individual elements of it for tokens. The default Lexer also tries to pull out the HTML tags to use as tags as well. Remember, it was originally written to combat blog comment spam, which is primarily links to websites.
What else can you have the Lexer do?

With a little insight into the text strings your trying to classify, you can make the Lexer quite intelligent in creating tokens.
For email classification, you can create a token out of the SPF record lookup in the header. You could also create a token out of the spam score header line added by your email host's spam filter. Some Bayesian implementations will tokenize on phrases, so sentence structure can be utilized instead of just a list of words. Doing this will allow the following two phrases using the words 'buy' and 'now' to be distinguished. "Now I know what to buy" "24 Hour Sale! Buy Now"
Degeneration
b8 will take a token that it hasn't seen before and do several transforms on it trying to find it in the existing corpus of known tokens. If a degenerated version is found, it picks the most interesting one for scoring.
b8 only does this for scoring text. It will not saved degenerated tokens. The degeneration process currently has several different transforms to it. 1. lowercase the whole token 2. uppercase the whole token 3. capitalize the first letter of the token 4. remove punctuation from token such as: . ! ?
Learning about spam and not spam

b8 saves each token into a database with a count of the number of instances it was seen in both spam texts and not spam texts. b8 also will save when each token was last seen, but I don't know if this is really used or was just curiosity on the authors behalf. When a token exists in the database, it updates the spam/not spam counts. b8 counts each instance of a token in each learned text, not just a single instance of a token for a given text. It's possible for the counts to exceed the total number of texts learned.
forgetaboutit... AKA [crt]-z

b8 can also unlearn a text string. This is useful if you accidentally flagged a text one way or the other and the message was really the other way. This is also useful because some implementations will autolearn high probability spam messages as spam messages. This can be done to make the system adaptive to changes in spamming tactics as the changes are seen. New tactics seen with at the same time as the old will automatically be learned to be spam. There is a potential problem that you can unlearn a text that was never learned in the first place, so beware.
The future of b8
The author and another individual currently have the next version .5 in .svn development. It's basically a total re-write of everything but the core Bayesian math processing. This new version is a complete PHP5 native rewrite. MySQL query usage is much more efficient, providing a significant speed increase.
Work is being done into multiple categorization - not just spam/not spam. This looks to significantly complicate the code, so it isn't likely to be a .5 feature.
Bayesian Poisoning
The idea is to provide enough otherwise innocuous text that the 'spam' message is lost amongst the non-spam message. There are several ways this is done.
Random dictionary words. Short text snippets from various sources, such as Shakespeare, Wikipedia or news websites. The spam message is embedded into an image file, where the Bayesian inference engine can't see it.

Spam Filtering

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Spam Filtering

Загружено:

Авторское право:

Доступные форматы

Bayesian Inferencing AKA Naive Bayesian Filtering

Spam Filtering with b8:

What is Bayesian Inference Statistics?

Duh... an example please?

Yeah... I can't really get my head around it either...

Which brings us back to b8

b8's 'target audience'

How does b8 work?

How the default Lexer creates Tokens

What else can you have the Lexer do?

Learning about spam and not spam

forgetaboutit... AKA [crt]-z

Вам также может понравиться