Article From:https://www.cnblogs.com/lvchaoshun/p/9127088.html

One of the ten major algorithms for machine learning: EM algorithm. It can be rated as one of the top ten, which sounds like NB. What is NB? We generally say someone is NB because he can solve problems that others can’t solve. Why is God a God? Because God can do many things that no one can do. So EMWhat is the problem that the algorithm can solve? Or why the EM algorithm came to this world and attracted the attention of so many people.

I hope I can understand it in a popular way, but the question of EM is really not so good to speak in popular language, because it is simple and complex. The simplicity lies in its simple thinking that it can accomplish powerful functions in only two steps, and its complexity lies in its mathematical reasoning.The more complicated probability formulas are involved. If it is simple, it loses the essence of the EM algorithm. If it is merely mathematical reasoning, it is too boring and astringent, but on the other hand, it is not easy to combine the two. So I can’t expect what I can say about it. I hope you don’t mean it.Guide.

One, maximum likelihood

I have to pull in too much and get to the point. Suppose we have a problem with the following:

Suppose we need to investigate the height distribution of boys and girls in our school. How do you do it? You say so many people can not ask one by one, it must be a sample. Suppose you caught 100 boys and 100 girls on campus. They were 200 people (200 tall)Sample data, to facilitate the presentation, the following, I say, “human” means the corresponding height is in the classroom. What’s the next step? You begin to shout, “the left side of the man, the right side of the woman, the rest of the station!” Then you first count the height of the 100 boys sampled. Suppose they areHeight is subject to Gauss’s distribution. But the mean u and variance of the distribution2We do not know that these two parameters are what we want to estimate. Note as theta =[uT

In the language of mathematics, in the school so many boys (height), we separate 100 (height) independently according to the probability density p (x| theta), and make up the sample set X. We want to estimate the unknown parameter theta by the sample set X. Here’s the probability density p (x| theta) we know it’s GaussIn the form of N (U, =[u), the unknown parameter is theta =[u.T。The sample set is X={x1,x2,…,xN}，Among them, XiThe height of the I person is shown here. N is 100, indicating the number of samples drawn.

Since each sample is extracted from P (x| theta) independently, in other words, any one of the 100 boys is my casual catch, and from my point of view there is no relation between these boys. So why did I pick up the 100 men from so many boys at school? DrawWhat is the probability of reaching these 100 people? Because these boys are subject to the same Gauss distribution P (x| theta). So the probability that I take the boy A (height) is p (x).A|θ)，The probability of getting a boy B is p (xB|θ)，Since they are independent, it is obvious that the probability that I smoke male A and male B at the same time is p (x).A|θ)* p(xB|θ)，Similarly, the probability that I drew the 100 boys at the same time is the product of their respective probabilities. Using the muzzle of a mathematician, the probability of extracting the 100 samples from the total sample of P (x| theta), that is, the joint probability of each sample in the sample set X, is expressed in the following form:

This probability reflects that the probability of X’s sample is obtained when the parameter of the probability density function is theta. Because X is known here, that is to say, the height of the 100 people I can extract can be measured, that is, known. While theta is unknown, the above formula is only unknown.It’s a function of theta. This function shows the possibility of obtaining the current set of samples under different parameter theta values, so it is called the likelihood function (likehood function) of the parameter theta relative to the sample set X. It is recorded as L (theta).

Here is a concept, the likelihood function. Do you remember our goal? We need to estimate the value of the parameter theta under the condition that the sample X has been drawn. How do you estimate it? What’s the use of the likelihood function? Let’s understand the concept of lower likelihood first.

Give a direct example:

A student went hunting with a hunter, and a hare ran past. Just listen to a gunshot, the hare answers to the next, if you want to guess, who hit this hit shot? You will think that if you shoot a single shot, you will hit the probability that the hunter is usually greater than the student’s hit probability.The gun was shot by the hunter.

The inference of this example embodies the basic idea of the maximum likelihood method.

For example, after class, a group of boys and girls went to the bathroom. Then you are idle and want to know that there are more people who go to the toilet during the class or more for the girls to go to the toilet, and then you run to the door of the men’s and the women’s toilet. Squatting for five minutes, suddenly a beautiful woman came out, you were ecstatic, came running to tell me.There are more girls going to toilets during the break. Do you believe you can go in and count them? Ha ha, I am not so stupid to run into the number, and then no headlines. I ask you how you know. You said, “5 minutes, the girls and girls are coming out, so the probability that girls will come out is the biggest.It’s bigger than boys, so there must be more people in the ladies’ toilets than in men’s toilets. See, no, you have used maximum likelihood estimation. You can see girls come out first, then what kind of girls will come out first? It must be the best time for a girl to come out. When does a girl come out?Max, that must be the time when women’s toilets are more than men’s toilets. This is the parameter you estimate.

From the above two examples, what conclusion have you got?

Back to the example of the boy’s height. Among the boys at school, I smoked the 100 boys (the height), not the others, and that was the greatest probability of the 100 people in the whole school. So what does this probability mean? Oh, that’s the one aboveThe function is L (theta). So, we just need to find a parameter theta whose corresponding likelihood function L (theta) is maximum, that is to say, the probability of pumping the 100 boys (the height) is the greatest. The maximum likelihood estimator, called theta, is recorded as:

Sometimes you can see that L (theta) is multiplicative, so in order to facilitate analysis, we can define logarithmic likelihood function and turn it into continuous addition.

Well, now we know that the requirement of theta is to maximize the likelihood function L (theta) of theta, and then the maximum value corresponding to theta is our estimate. Here we go back to the question of the best value. How do you seek the best value of a function? Of course, it is derivative, then let the derivative be 0, then solve the theta obtained by this equation.That is, (of course, the premise is that the function L (theta) is continuously differentiable). What if theta is a vector containing multiple parameters? It is, of course, to find the partial derivative of L (theta) to all the parameters, that is, the gradient, then the N unknown parameters, there are n equations, and the solution of the equations is the extreme point of the likelihood function.Now, of course, we get these n parameters.

Maximum likelihood estimation, you can think of it as a backstepping. In most cases we calculate the results according to the known conditions, and the maximum likelihood estimation is that the result has been known, and then we seek the most likely condition for the result to be used as an estimate. For example, if other conditions are certain, smokersThe risk of developing lung cancer is 5 times that of those who do not smoke. If I already know someone is lung cancer, I would like to ask you if he smokes or not. How do you judge? You may not know anything about this person. What you know is only one thing, that is, smoking is more prone to lung cancer, so you can guess this person.Don’t smoke? I believe you are more likely to say that this person smokes. Why? This is “the greatest possibility”, and I can only say that he “most likely” is smoking, “he is smoking” is the estimate is “most likely” to get the result of “lung cancer”. This is the maximum likelihood estimate.

Maximum likelihood estimation is only one application of probability theory in statistics. It is one of the methods of parameter estimation. It is known that a random sample is known to satisfy a certain probability distribution, but the specific parameters are not clear. The parameter estimation is through a number of experiments to observe the results and to use the results to introduce the approximate values of the parameters.The maximum likelihood estimation is based on the idea that a given parameter can make the maximum probability of this sample, and of course we will not choose any other sample of small probability, so we simply use this parameter as the true value of the estimate.

The general step of estimating the maximum likelihood function:

（1）Write down the likelihood function;

（2）The logarithm of the likelihood function is taken and arranged.

（3）The derivative is obtained, the derivative is 0, and the likelihood equation is obtained.

（4）The parameters are obtained by solving the likelihood equation.

Two, EM algorithm

OK, let’s go back to the question of height distribution estimation above. Now, by extracting the height of the 100 boys and the height of the known 100, we can get the parameter theta =[u corresponding to the Gauss distribution by maximizing the likelihood function.TThat’s right. The height distribution of girls in our school can also be obtained in the same way.

Back to the example itself, if there is no “left side of the male, the right side of the woman, the other station in the middle!” This step, or I caught these 200 people, some boys and some girls fell in love at first sight, had already been good, entangled up. We do not want to be so cruel and hard to pull them away. So now this 200 people have been mixed up, and at this time, you just give me a person (height) from the 200 people (height), I can’t determine whether this person (height) is a boy (height) or a girl (height). That is to say, you do not know everyone in the 200 people who have been sampled.Is it extracted from the height distribution of a boy or the height distribution of a girl? In mathematical language, every sample extracted does not know which distribution is extracted from it.

At this time, there are two things that need to be guessed or estimated for every sample or the person you pick up. Is this man a man or a woman? Two, what are the parameters of Gauss distribution for boys and girls?

Only when we know which people belong to the same Gauss distribution, we can predict the parameters of this distribution, such as the first maximum likelihood, but now two kinds of Gauss distribution people are mixed together, we do not know which people belong to the first Gauss distribution, which belong to the first.The two one is that it is impossible to estimate the parameters of the two distributions. In turn, only when we make an accurate estimation of the parameters of the two distribution, we can know exactly which people belong to the first distribution and those who belong to the second distribution.

This is the question of whether there is chicken or egg. The chicken said, without me, who gave birth to you. The egg refused to accept, saying, without me, where did you jump out? Oh, this is a philosophical question. Of course, scientists later said that there were eggs, because eggs evolved from eggs. In order to solve this you areDepend on me, I rely on your cycle of dependence, one of the parties to break the deadlock first, say, no matter, I first casually value out, see how you change, and then I adjust my changes according to your changes, and then iterative and continuous deduction, and eventually converge to a solution. This is the EM algorithmThe basic idea.

I wonder if you can understand the ideas, so I’ll give you a long time. In fact, this idea is not at all.

For example, when I was a child, mom gave you a big bag of candy and asked you to share it with your sister, and then you didn’t bother to count the number of sweets, so you didn’t know how many people had to divide. What do we do in general? First divide a bag of sweets into two bags, then take two bags of candy to the left and right hands to see which weight is heavier.If the right hand is heavy, it’s obvious that there’s more candy in the right hand, then you take the bag in the right hand bag and put it in the left hand, then feel the weight, then take a small bag in the heavy bag and put it in the light bag and go on until you feel that the two bags of candy are almost equal. Hehe, and thenTo show fairness, you also let your sister choose.

EMThis is the algorithm, assuming that we want to estimate the two parameters of A and B, and the two are unknown in the beginning, but if you know the information of the A, you can get the information of the B, and in turn know that B gets A. You can consider giving A some initial value to get the estimated value of B, and then from B.Starting from the current value, we reestimate the value of A, which continues until convergence.

EMIt means “Expectation Maximization”, and in our above problem, we first randomly guess the parameters of the normal distribution of the boy (height), such as the mean and the variance. For example, the average of boys is 1 meters 7, and the variance is 0.1 meters (of course).The first or second normal distribution (for example, the height of the man is 1 meters 8, that is obviously, he’s most likely to be the boy’s distribution), which belongs to the Expectation step. With everyoneTo belong, or to say that we have roughly divided the 200 men into two parts of boys and girls by the way above, we can reestimate the parameters of the first distribution, based on the previous maximum likelihood, by the n individuals who are probably divided into boys, and the girls’ distribution is reestimated in the same way.This is Maximization. Then, when we update these two distributions, the probability of each of these two distributions is changed again, then we need to adjust the E step… So, until the parameters are basically no longer changed.

Here, the complete description of each person (sample) is regarded as a three tuple y.i={xi,zi1,zi2}，Among them, XiIt is the observed value of the I sample, which corresponds to the height of the person, which is the observed value. Zi1And Zi2Which of the two Gauss distributions of boys and girls is used to generate the value x?i，That is to say, these two values indicate whether this person is male or female (height distribution). We do not know these two values, they are implied variables. To be exact, ZijAt xiFrom the J Gauss distribution, the time value is 1, otherwise it is 0. For example, when a sample is observed at 1.8, then he comes from the Gauss distribution of the boy, then we can represent this sample as {1.8, 1, 0}. If Zi1And Zi2The value is known, that is to say that each person I have been marked as a boy or a girl, then we can use the maximum likelihood algorithm above to estimate the parameters of their respective Gauss distribution. But they are unknown, so we can only use the EM algorithm.

Is it not because of that nauseous hidden variable (that each sample that is extracted does not know which distribution is extracted) makes the original simple problem complex and can’t be solved. Then what shall I do? The idea of solving human problems is whether we can simplify complex problems. OK, soNow, turn this complex problem back, I suppose we have known the implicit variable. Well, then it’s very easy to solve the parameter of that distribution, and the maximum likelihood estimate is right on the top. Then you ask me, the implicit variable is unknown. How do you come to the hypothesis that it is known? youThis assumption is unfounded. Well, I know, so we can first give this distribution an initial value, and then the expectation of the implicit variable, as the known value of the implicit variable, then we can now use the maximum likelihood to solve the parameter of that distribution, which assumes that this parameter is more than before.The random parameter is better, it is more able to express the true distribution, then we can then get the expectation of the implicit variable by the distribution determined by this parameter, then maximize, and get another better parameter,… With iteration, you can get a happy result.

At that time, you refuse to accept that you are always iterative and iterative. How do you know that the new parameter is better than the original one? Why does this method work? Is there a time for failure? When does it fail? What do you need to pay attention to using this method? Ha ha, throw out so many problems at once, make me.But it proves that you have a good potential for research. Ha ha, actually these problems are mathematicians need to solve the problem. In mathematics, it can be proved or drawn. Let’s use mathematics to redescribe the above problems. It is known here, no matter how much it is.The idea of a complex or simple physical world needs to be modeled and abstracted through mathematical tools to make it possible to use and play its powerful role, and the mathematics contained in it can often bring you more things that you can’t imagine, which is the essence of mathematics.

Three and EM algorithm derivation

Suppose we have a sample set {x(1),…,x(m)}，Contains M independent samples. But each sample I corresponds to the class Z(i)It is unknown (equivalent to clustering), that is, implied variable. So we need to estimate the parameter theta of the probability model P (x, z), but because it contains the implicit variable Z, it is difficult to use the maximum likelihood solution, but if Z knows it, then we can easily solve it.

For parameter estimation, we essentially want to obtain a parameter theta that maximizes the likelihood function. Now different from the maximum likelihood, we have only an unknown variable Z in the likelihood function formula. See the next formula (1). That is to say, our goal is to find the suitable theta and Z to maximize the L (theta). So we alsoMay think, you are an unknown variable, ah, I can also separate the unknown theta and Z partial guide respectively, then make it equal to 0, the solution is not the same?

In essence, we need to maximize (1) (1), and we recall the solution of the marginal probability density function of a variable under the joint probability density, and here Z is also a random variable. For every possible category Z of every sample I, the joint probability density function on the right side of the equation is obtained, and the equation is obtained.The marginal probability density of the random variable x on the left is the likelihood function, but it can be seen that there is a “sum of logarithms”, and the form will be very complex after the derivation (you can imagine log (f).1(x)+ f2(x)+ f3(x)+…)Therefore, it is difficult to obtain unknown parameters Z and theta. So, OK, can we make some changes to (1)? We see (2), (2) is only the molecular denominator the same multiplied by an equal function, or the “logarithm of the sum”, or still can’t be solved. Then why do you want to do this?What? Let’s ignore, look at (3), and find that (3) becomes “logarithmic sum”. We pay attention to it and find that the equal sign has become an inequality sign. Why can it be changed? This is the place where Jensen inequality is great.

JensenInequalities:

Let f be a function that defines a field as a real number, if it is x for all real numbers. If the two derivative of all real X and f (x) is greater than or equal to 0, then f is a convex function. WhenxWhen vectors are, if their Hessian matrix H is semi positive definite, then f is a convex function. If it is greater than 0 and not equal to 0, then f is strictly convex function.

JensenThe inequalities are expressed as follows:

If f is a convex function, X is a random variable, then: E[f (X)]> =f (E[X]).

In particular, if f is strictly convex function, if and only if X is constant, the upper form takes an equal sign.

It will be clear if it is represented by a graph:

In the graph, the real line f is a convex function, X is a random variable, a probability of 0.5 is a, and a probability of 0.5 is B. (like a coin toss). The expected value of X is the median of a and B. E[f (X)]&gt and =f (E[X]) can be seen in the graph.

When f is (Yan Ge) concave function if and only if -f is (Yan Ge) convex function.

JensenWhen inequality is applied to concave functions, the direction of inequality is reversed.

Go back to formula (2), because f (x) =log x is a concave function (its second derivative is -1/x).2<0）。

（2）In the form ofThe expectation (taking into account E (X) = Sigma x*p (x) and f (X) is a function of X, then E (f (X)) = sigma f (f) (()), and，So we can get the inequality of formula (3). (if you don’t understand, please take a pen, ha ha):

OK，Here, the present formula (3) is easy to guide, but the formula (2) and formula (3) is an inequality, the maximum value of formula (2) is not the maximum (3) of the maximum value, and we want to get the maximum value of (2), and how to do that?

Now we need a bit of imagination, and the above (2) and (3) inequalities can be written as: the likelihood function L (theta) > =J (Z, Q), then we can make L (theta) continuously increase by maximizing the lower bound J, and finally reach its maximum value.

See above, we fix theta, adjust Q (z) to increase the lower bound J (Z, Q) to equal to L (theta) at this point theta (green curve to blue curve), and then fix Q (z), adjust theta to make the lower J (Z, Q) reach the maximum value (theta) (theta).tTo thetat+1），And then fix the theta and adjust the Q (z)… Until we converge to the theta * of the maximum of the likelihood function L (theta) *. Here are two questions: when is the lower bound J (Z, Q) equal to L (theta) at this point? Why do we have to converge?

First of all, in the Jensen inequality, when the independent variable X is constant, the equation is established. And here, that is:

Further derivation, due to（Because Q is a random variable Z(i)The probability density function) can be obtained: the sum of the molecules is equal to C (the numerator denominator is all Z.(i)Summation: the equality of numerator denominator is invariable. This holds that the two probability ratios of each sample are c).

At this point, we have introduced the formula for calculating the Q (z) of the lower bound after the fixed parameter theta, which is a posteriori probability, and solved the problem of how to choose Q (z). This step is the E step, and the lower bound of L (theta) is established. The next M step is to adjust theta after giving a Q (z) and to maximize L (theta).The lower bound J (after the fixed Q (z), the lower bound can be adjusted even more). Then the general steps of the EM algorithm are as follows:

EMAlgorithm (Expectation-maximization):

The maximum expected maximum likelihood algorithm is a maximum likelihood estimation method for solving the parameter of probability model in an incomplete data or a data set with data loss (existing implicit variables).

EMThe algorithm flow:

Initialization of the distribution parameter theta;

Repeat the following steps until convergence

EStep:The posterior probability of hidden variables is calculated according to the initial value of the parameter or the model parameters of the previous iteration, which is actually the expectation of the hidden variable. As the present estimated value of a hidden variable:

MStep:Maximizing the likelihood function to obtain new parameter values:

With this continuous iteration, the parameter theta maximizing the likelihood function L (theta) can be obtained. Then we have to answer the second questions. Will it converge?

Because the lower bound is increasing, the maximum likelihood estimation is monotonically increased, so we will reach the maximum likelihood estimation. If we analyze rationally, we will get the following things:

Specifically how to prove, see the derivation process reference: Andrew Ng “The EM algorithm”.

Another understanding of the four, EM algorithm

The coordinate rising method (Coordinate ascent):

The path of the linear iterative optimization in the graph shows that every step forward one step toward the optimal value, and the forward route is parallel to the coordinate axis, because each step optimizes only one variable.

This is like finding the extreme value of a curve in the X-Y coordinate system. However, the curve function can not be directly derived, so what gradient descent method is not applicable. But when a variable is fixed, the other can be derived by the derivation, so the coordinate rise method can be used, one variable is fixed at a time, and the other is for the extremum.Finally, it is gradually approaching the extremum. Corresponding to EM,EStep:Fix theta and optimize Q.MStep:Fix Q, optimize theta, maximize the extreme value alternately.

The application of five and EM

EMThere are many applications of the algorithm. The most widely used ones are GMM mixed Gauss model, clustering, HMM and so on. Specifically, we can refer to the Machine Learning column in JerryLead’s cnblog:

（EMAlgorithm) The EM Algorithm

Hybrid Gauss model (Mixtures of Gaussians) and EM algorithm

K-meansclustering algorithm