Академический Документы
Профессиональный Документы
Культура Документы
Using B8
Classification before learning: 0.583509 Saved the text as Ham Classification after learning: 0.105294
Spaminess: 0.065217
Any Questions?
Good!
I am glad I am not the only one...
AKA Wikipedia to the rescue...
Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process. Bayes' theorem was derived from the work of the Reverend Thomas Bayes.
http://en.wikipedia.org/wiki/Bayesian_inference
Who?
Thomas Bayes (c. 1702 7 April 1761) was a British mathematician and Presbyterian minister, known for having formulated a specific case of the theorem that bears his name: Bayes' theorem, which was published posthumously. Bayes' solution to a problem of "inverse probability" was presented in the Essay Towards Solving a Problem in the Doctrine of Chances (1764), published posthumously by his friend Richard Price in the Philosophical Transactions of the Royal Society of London. This essay contains a statement of a special case of Bayes' theorem.
http://en.wikipedia.org/wiki/Thomas_Bayes
Bayes Theorum
Bayes' theorem relates the conditional and marginal probabilities of events A and B, where B has a non-vanishing probability: Each term in Bayes' theorem has a conventional name: P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having observed 'B'. Objective Bayesians emphasise that these probabilities are fixed by a body of well-specified background knowledge (K), so their version of the theorem expresses this:[5][2]
http://en.wikipedia.org/wiki/Bayes'_theorem
The author claims it may not be suited to longer text strings such as email messages.
Degeneration
b8 will take a token that it hasn't seen before and do several transforms on it trying to find it in the existing corpus of known tokens. If a degenerated version is found, it picks the most interesting one for scoring.
b8 only does this for scoring text. It will not saved degenerated tokens. The degeneration process currently has several different transforms to it. 1. lowercase the whole token 2. uppercase the whole token 3. capitalize the first letter of the token 4. remove punctuation from token such as: . ! ?
The future of b8
The author and another individual currently have the next version .5 in .svn development. It's basically a total re-write of everything but the core Bayesian math processing. This new version is a complete PHP5 native rewrite. MySQL query usage is much more efficient, providing a significant speed increase.
Work is being done into multiple categorization - not just spam/not spam. This looks to significantly complicate the code, so it isn't likely to be a .5 feature.
Bayesian Poisoning
The idea is to provide enough otherwise innocuous text that the 'spam' message is lost amongst the non-spam message. There are several ways this is done.
Random dictionary words. Short text snippets from various sources, such as Shakespeare, Wikipedia or news websites. The spam message is embedded into an image file, where the Bayesian inference engine can't see it.