Article From:https://www.cnblogs.com/Josie-chen/p/9124558.html

naive Bayes

1. introduction

The Bias method is a long history, with a solid theoretical basis, to deal with many problems directly and efficiently, and many advanced Natural Language Processing models can also evolve from it. Therefore, learning Bayesian method is a very good entry point to study Natural Language Processing problem.

2. Bias formula”

The Bias formula is one line:

In fact, it is derived from the joint probability formula.

among

No, the core formula of Bayes is so.

3. understands Bayes formula from machine learning perspective

In the perspective of machine learning, we do

We simplify the above formula:

The ultimate goal of our two classification problem is tojudgeThat’s enough. Bias method to calculate“Belonging to a certain category under the condition of a certain characteristic.The probability of conversion is required to be calculated“Having a certain characteristic under a certain category.The probability of the latter is much simpler. We only need to find some samples containing known labels, so we can train them. The classification labels of samples are clear, so Bayesian method belongs to supervised learning in machine learning.

Here is a supplement.『The prior probability and the “posterior probability” are relativeAppear, such as

4. spam identification (two classification problem)”

For example, we now have to classify mail and identify spam and ordinary mail. If we choose to use the simple Bias classifier, that is the goal.judge。Now suppose we have 10 thousand junk mail and normal mail each (the number of spam in reality will be less in practice, there is a problem of data imbalance) as a training set. It is necessary to determine whether the following email is spam.

“Our company can handle regular invoices (fidelity) 17% VAT invoice points discount!

that isJudgment probability

Cough, there are wood found, conversion to this probability, the calculation method: is to write a counter, and then +1 +1 +1 statistics out of all spam and normal mail in the number of words ah!!! Well, the specific point says:

If you use the complete sentence in all samples, you may only find it once, and it is hard to retrieve it. Therefore, this method is not workable and needs to be segmented first and then retrieved. That is, from the coarse grain of a large paragraph to fine grained participle.

5. participle

A very sad but realistic conclusion:The training set is limited, and the possibility of the sentence is infinite. So the training set covering all the possibilities of sentences does not exist. Therefore, we need to solve it with more fine-grained participles.

So the solution is? ItThe possibilities of sentences are infinite, but words are so!!There are 2500 commonly used Chinese characters and 56000 commonly used words. (you finally understand that the Chinese language teachers in primary schools are in a good mood). According to people’s experience, the meaning of two sentences is similar. such as“Our company can handle regular invoices, 17% VAT invoice points discount!，This sentence is less than the previous one“（”Fidelity”. “This word, but the meaning is the same. If these circumstances are taken into consideration, the number will increase, which will facilitate our calculation.

Therefore, we can not take the sentence as the characteristic, but take the words (combinations) in the sentence as the characteristics to consider. such as“Regular invoiceIt can be used as a single word,“VAT “It can also be used as a single word and so on.

sentence“Our company can handle regular invoices, 17% VAT invoice points discount! It can be turned into (“me”, “department”, “OK”, “handling”, “regular invoice”, “fidelity”, “VAT”, “invoice”, “point number”, “preferential”)))

So you came into contact with one of the most important technologies in Chinese NLP:participle！！！that isSplit a whole sentence into more fine-grained words.。In addition, after the participleRemoving punctuation marks, numbers or even irrelevant components (stop words) is a technology in feature preprocessing.

Chinese word segmentation is a special technical field (I will not tell you that a search engine yard brick worker has a special word segmentation!!! The students who had attended the course knew that Python had a very convenient word segmentation tool Jieba.

We observe (“I”, “Si”, “Ke”, “handle”, “formal invoice”, “fidelity”, “value-added tax”, “invoice”, “points”, “preferential”).This can be understood as a vector: every dimension of the vector indicates that the characteristic word exists in the specific location of the text. It is very common and effective in Natural Language Processing and machine learning to split the features into smaller units and think based on these more flexible and finer features..

Keyword extraction (statistical probability method): in Jieba, there are feature_extraction.text in sklearn.

So the Bayes formula is turned into: