Contents

# naive Bayes

## 1. introduction

The Bias method is a long history, with a solid theoretical basis, to deal with many problems directly and efficiently, and many advanced Natural Language Processing models can also evolve from it. Therefore, learning Bayesian method is a very good entry point to study Natural Language Processing problem.

## 2. Bias formula”

The Bias formula is one line:

P(Y|X)=P(X|Y)P(Y)P(X)

In fact, it is derived from the joint probability formula.

P(Y,X)=P(Y|X)P(X)=P(X|Y)P(Y)

amongP(Y)It is called a priori probability.P(Y|X)It is called a posteriori probability.P(Y,X)It’s called the joint probability.

No, the core formula of Bayes is so.

## 3. understands Bayes formula from machine learning perspective

In the perspective of machine learning, we doXUnderstand**“Have a certain characteristic.**，holdYUnderstand**“Category label “**(The general machine learning is all in the question`X=>Features`

, `Y=>Result`

Right. In the simplest two classification problems (`yes`

and`no`

Judged), we willYUnderstand**“Belong to a certain class**”The label. So the Bayes formula becomes the following:

P(“genustosomeclass”|“haveYessomespecialsign”)=P(“haveYessomespecialsign”|“genustosomeclass”)P(“genustosomeclass”)P(“haveYessomespecialsign”)

We simplify the above formula:

P(“genustosomeclass”|“haveYessomespecialsign”)=When a sample is known to have a characteristic, the probability of the sample “belongs to a certain class”. So it’s called

『Posteriori probability。

P(“haveYessomespecialsign”|“genustosomeclass”)=When a sample is known to belong to a certain class, the probability of “characteristic” of the sample is obtained. It

P(“genustosomeclass”)=（When the unknown sample has the characteristic of “a characteristic”, the probability of the sample “belongs to a certain class”. So it’s called『Priori probability。

P(“haveYessomespecialsign”)=(Under the condition of “a certain category” of a sample, the probability of “a characteristic” of the sample is unknown.

The ultimate goal of our two classification problem is to**judgeP(“genustosomeclass”|“haveYessomespecialsign”)Is it more than 1/2**That’s enough. Bias method to calculate**“Belonging to a certain category under the condition of a certain characteristic.**The probability of conversion is required to be calculated**“Having a certain characteristic under a certain category.**The probability of the latter is much simpler. We only need to find some samples containing known labels, so we can train them. The classification labels of samples are clear, so Bayesian method belongs to supervised learning in machine learning.

Here is a supplement.**『The prior probability and the “posterior probability” are relative**Appear, such asP(Y)andP(Y|X)It’s about it.YA priori probability and a posteriori probabilityP(X)andP(X|Y)It’s about it.XThe prior probabilities and the posteriori probabilities.

## 4. spam identification (two classification problem)”

For example, we now have to classify mail and identify spam and ordinary mail. If we choose to use the simple Bias classifier, that is the goal.**judgeP(“GarbageRefuseMailpiece”|“haveYessomespecialsign”)Is it more than 1/2**。Now suppose we have 10 thousand junk mail and normal mail each (the number of spam in reality will be less in practice, there is a problem of data imbalance) as a training set. It is necessary to determine whether the following email is spam.

“Our company can handle regular invoices (fidelity) 17% VAT invoice points discount!

that is**Judgment probabilityP(“GarbageRefuseMailpiece”|“Idepartmentcandoreasonjustgaugehairticket（Protectreally）17%increasevaluetaxhairticketspotnumberexcellentbenefit！”)Is it more than 1/2**。

Cough, there are wood found, conversion to this probability, the calculation method: is to write a counter, and then +1 +1 +1 statistics out of all spam and normal mail in the number of words ah!!! Well, the specific point says:

If you use the complete sentence in all samples, you may only find it once, and it is hard to retrieve it. Therefore, this method is not workable and needs to be segmented first and then retrieved. That is, from the coarse grain of a large paragraph to fine grained participle.

P(“GarbageRefuseMailpiece”|“Idepartmentcandoreasonjustgaugehairticket（Protectreally）17%increasevaluetaxhairticketspotnumberexcellentbenefit！”) =GarbageRefuseMailpieceinOut ofpresentthissentencewordAsecondnumberGarbageRefuseMailpieceinOut ofpresentthissentencewordAsecondnumber+justoftenMailpieceinOut ofpresentthissentencewordAsecondnumber

## 5. participle

A very sad but realistic conclusion:**The training set is limited, and the possibility of the sentence is infinite. So the training set covering all the possibilities of sentences does not exist. Therefore, we need to solve it with more fine-grained participles.**

So the solution is? It**The possibilities of sentences are infinite, but words are so!!**There are 2500 commonly used Chinese characters and 56000 commonly used words. (you finally understand that the Chinese language teachers in primary schools are in a good mood). According to people’s experience, the meaning of two sentences is similar. such as**“Our company can handle regular invoices, 17% VAT invoice points discount!**，This sentence is less than the previous one**“（”Fidelity”. “**This word, but the meaning is the same. If these circumstances are taken into consideration, the number will increase, which will facilitate our calculation.

Therefore, we can not take the sentence as the characteristic, but take the words (combinations) in the sentence as the characteristics to consider. such as**“Regular invoice**It can be used as a single word,**“VAT “**It can also be used as a single word and so on.

sentence

“Our company can handle regular invoices, 17% VAT invoice points discount! It can be turned into (“me”, “department”, “OK”, “handling”, “regular invoice”, “fidelity”, “VAT”, “invoice”, “point number”, “preferential”)))。

So you came into contact with one of the most important technologies in Chinese NLP:**participle**！！！that is**Split a whole sentence into more fine-grained words.**。In addition, after the participle**Removing punctuation marks, numbers or even irrelevant components (stop words) is a technology in feature preprocessing.**。

**Chinese word segmentation is a special technical field (I will not tell you that a search engine yard brick worker has a special word segmentation!!! The students who had attended the course knew that Python had a very convenient word segmentation tool Jieba.**

We observe (“I”, “Si”, “Ke”, “handle”, “formal invoice”, “fidelity”, “value-added tax”, “invoice”, “points”, “preferential”).**This can be understood as a vector: every dimension of the vector indicates that the characteristic word exists in the specific location of the text. It is very common and effective in Natural Language Processing and machine learning to split the features into smaller units and think based on these more flexible and finer features..**

**Keyword extraction (statistical probability method): in Jieba, there are feature_extraction.text in sklearn.**

So the Bayes formula is turned into:

P(“GarbageRefuseMailpiece”|（“I”,“department”,“can”,“doreason”,“justgaugehairticket”,“Protectreally”,“increasevaluetax”,“hairticket”,“spotnumber”,“excellentbenefit”)） =P(（“I”,“department”,“can”,“doreason”,“justgaugehairticket”,“Protectreally”,“increasevaluetax”,“hairticket”,“spotnumber”,“excellentbenefit”)|“GarbageRefuseMailpiece“）P(“GarbageRefuseMailpiece”)P(（“I”,“department”,“can”,“doreason”,“justgaugehairticket”,“Protectreally