Article From:

NLP&Deep learning: an overview of recent trends


Abstract:When NLP meets deep learning, what changes has taken place?


In a recent paper, Young and his colleagues summarized some of the latest trends in depth-based natural language processing (NLP) systems and applications. This article focuses on the results of the latest technology (SOTA) for various NLP tasks, such as visual question answering (QA) and machine translation.Review and comparison. In this comprehensive review, you can learn more about the past, present and future of NLP’s deep learning. In addition, you will learn some of the best practices of applying deep learning in NLP. Topics include:

1、The rise of distributed representation (e.g., word2vec);

2、Convolution, recurrent and recurrent neural networks;

3、The application of NLP in reinforcement learning.

4、The latest progress of unsupervised models in presentation learning;

5、The deep learning model is combined with the enhanced memory strategy.



What is NLP?


Natural Language Processing (NLP) involves building computer algorithms to automatically analyze and represent human languages. NLP-based systems are now widely used, such as Google’s powerful search engine and Alibaba’s recent voice assistant, the Titan Elf. NLP can also be used to teach machine tools.The ability to perform complex natural language related tasks, such as Machine Translation and dialogue generation.


For a long time, shallow machine learning models and time-consuming hand-made features have been used in most of the methods used to study NLP problems. Since most linguistic information is represented sparsely (high-dimensional features), this leads to problems such as dimensionality catastrophes. However, with the recent word embedding (low dimension, distributed representation)Compared with traditional machine learning models (such as SVM or logistic regression), neural-based models have achieved outstanding results in various language-related tasks.


Distributed representation


As mentioned earlier, hand-made features were mainly used to simulate natural language tasks until neural network approaches emerged and solved some of the problems faced by traditional machine learning models, such as the dimensionality disaster.


Word embedding:Distribution vectors, also known as word embedding, are based on the so-called distribution hypothesis that words that appear in similar contexts have similar meanings. Word embedding is pre-trained on tasks, and its goal is to predict words based on their context, usually using a shallow neural network. The following illustration shows Bengio and his colleagues.A neural linguistic model.



Word vectors tend to embed syntactic and semantic information and are responsible for SOTA in various NLP tasks, such as affective analysis and sentence composition.


Distributed representation has been used extensively in the past to study various NLP tasks, but it became popular only when continuous word bags (CBOW) and skip-gram models were introduced into the field. They are popular because they can effectively build high-quality word embedding, and they can be used toSemantic combination (for example,’man’+’royal’=’king’).


Word2vec2013Around the year, Mikolav and others put forward the CBOW and skip-gram models. CBOW is a neural network method to construct word embedding. The purpose of CBOW is to calculate the conditional probability of the target word given the context word. On the other hand, Skip-gram is a construction word embedding.The objective of neural network method is to predict the contextual words (i.e. conditional probabilities) around a given central target word. For both models, word embedding dimensions are determined by calculating the predictive accuracy (in an unsupervised manner).


One of the challenges with word embedding is when we want to get vector representations of phrases like “hot potato” or “Boston Globe”. We can’t simply combine single word vector representations because these phrases don’t represent combinations of meanings of a single word. WhenWhen it comes to longer phrases and sentences, it becomes more complicated.


word2vecAnother limitation of the model is the use of smaller window sizes to produce similar embedding for comparing words such as “good” and “bad”, which is not desirable for tasks that are important to this distinction, such as emotional analysis. Another constraint on word embeddedness is their dependence.For applications that use them, retraining task-specific embedding for each new task is an exploratory option, but it is often computationally expensive and can be solved more effectively using negative sampling. There are other problems in the Word2vec model, for example, without considering polysemy factors and other possible reasons.The emerging bias in training data.


Character embedding: For tasks such as POS tagging and named entity recognition (NER), it is useful to look at morphological information (such as characters or their combinations) in a word. This is also useful for rich forms of language, such as Portuguese, Spanish and Chinese. Because we are at character levelAnalyzing text, so these types of embedding help with unknown words, because we no longer represent large vocabularies that need to be reduced for efficient computing purposes.


Finally, it is important to understand that even though character-level and character-level embedding have been successfully applied to various NLP tasks, the long-term impact is still questioned. For example, Lucy and Gauthier recently discovered that word vectors are limited by how well they capture different aspects of the conceptual meaning behind words. RemarkThey claim that only distributed semantics can not be used to understand the concepts behind words. Recently, under the background of Natural Language Processing system, significant debate has been made on the representation of meaning.


Convolution neural network (CNN)


CNNBasically, it is based on neural network, which is used to construct a word or n-gram to extract higher-level feature feature functions. The resulting abstract features have been effectively used in sentiment analysis, machine translation and question answering (QA) systems, and other tasks. Collobert and WestoN is one of the first researchers to apply CNN based frameworks to NLP tasks. The goal of their approach is to convert words into vector representations through lookup tables, which allows an original word embedding method to learn weights during neural network training (see figure below).




In order to use the basic CNN for sentence modeling, we first mark the sentence as a word, and then convert it into a d-dimensional word embedding matrix (i.e. input embedding layer). Then, a convolution filter is applied on the input embedding layer, which includes the application of all possible window sizes.A filter is used to generate the so-called feature map. The maximum pool operation is then applied to each filter to obtain a fixed length output and reduce the dimension of the output, and the process produces the final sentence representation.




By increasing the complexity of these basic CNNs and adapting them to perform word-based predictions, other NLP tasks such as NER, object detection, and POS (part-of-speech tagging) can be studied. This requires a window based (window) approach, where adjacent words are considered for each word.The fixed size window of a child. Then the independent CNN is applied to the sentence, and the training goal is to predict the words in the center of the window, also known as word level classification.


CNNsOne drawback is that it is impossible to model long distance dependencies, which is important for all kinds of NLP tasks. To solve this problem, CNN has been coupled to a time-delayed neural network (TDNN), which can immediately achieve a wider context during training. Performance in different NLP tasksOther useful types of CNN, such as emotion prediction and problem type classification, are called dynamic convolution neural networks (DCNN). DCNN uses a dynamic K-MAX pool strategy where filters can dynamically span variable ranges while executing sentence modeling.


CNNIt is also used for more complex tasks, such as object detection, affective analysis, short text categorization and satire detection, for text of different lengths. However, some of these studies have reported that external knowledge is necessary when applying CNN-based methods to micro-texts such as Twitter text. Prove that CNN hasOther tasks used are query document matching, speech recognition, Machine Translation and Q & A. On the other hand, DCNN is used for hierarchical learning to capture and combine low-level lexical features into high-level semantic concepts for automatic generalization of text.


In general, CNNs are effective because they can mine semantic clues in context windows, but they are difficult to maintain continuous order and simulate remote context information. RNN is more suitable for this type of learning, and we will discuss them later.


Recurrent neural network (RNN)


RNNIt is a neural network specially designed to process sequential information. RNN applies the calculation to the input sequence that is based on previous calculation results. These sequences are usually represented by fixed size labeled vectors, and they are sent sequentially to the loop unit. The following illustration illustrates a simple RNN framework.




RNNThe main advantage is that it can memorize the previous calculation results and use this information in the current calculation. This makes the RNN model suitable for any length of input with context dependencies, which creates the right combination of inputs. RNN has been used to study various NLP tasks, such as Machine Translation.Image caption and language modeling.

Compared with the CNN model, the RNN model can be equally effective or even better in specific natural language tasks. Because they simulate different aspects of the data, this makes them effective, depending on the semantics required by the task.


RNNExpected inputs are usually one-hot encoding or word embedding, but in some cases they are coupled with abstract representations constructed by the CNN model. Simple RNN is vulnerable to vanishing gradient problem, which makes it difficult for the network to learn and adjust the parameters in the earlier layer. Other variants are emerging.To address this problem, such as long-term and short-term memory (LSTM) networks, ResNets and Gated Loop Networks (GRU) were introduced to overcome this limitation.




LSTMIt consists of three gates (input, oblivion and output gates) and calculates the hidden state through the combination of the three. GRU is similar to LSTM, but contains only two gates, which is more efficient because they are not so complicated. A study shows that it is hard to say which gated RNN is more effective, usually based on available computing power.Force to pick them. Research and experiments show that LSTM-based models are used for sequence-to-sequence mapping (via the encoder-decoder framework) and are suitable for machine translation, text summarization, human dialogue modeling, question answering, image-based language generation, and other tasks.


In general, RNN can be used in many NLP systems, such as:

•  Word level classification (NER);

•  Language modeling;

•  Sentence level classification (for example, emotional polarity);

•  Semantic matching (for example, matching messages with candidate responses in a dialog system);

•  Natural language generation (for example, Machine Translation, visual QA and image caption);


Attention mechanism


In essence, attention mechanism is a technique that benefits from the need to allow the above RNN-based decoder to use the last hidden state and the information (i.e. context vectors) computed based on the sequence of input hidden states. This is especially useful for tasks that need to be aligned between input and output text.


Attention mechanism has been successfully used in machine translation, text summarization, image caption, dialogue generation and aspect-based emotional analysis. Various forms and types of attention mechanisms have been proposed, and they remain important for NLP researchers to study various applications.Field.


Recurrent neural network (Recursive Neural Network)


Similar to RNN, recurrent neural network is very suitable for continuous data modeling. This is because languages can be viewed as recursive structures in which words and phrases form other higher-level phrases in the hierarchy. In this structure, the non terminal node is represented by all its child nodes. The following illustration illustrates the followingA simple recurrent neural network.




In the basic form of recurrent neural networks, combinatorial functions (i.e. networks) combine components in a bottom-up way to compute the representation of higher-level phrases (see figure above). In the variant MV-RNN, words are represented by matrices and vectors, which means that the parameters learned by the network represent the matrices of each component. AnotherA variant, the Recursive Neural Tensor Network (RNTN), allows more interaction between input vectors to avoid large parameters, such as MV-RNN. Recursive neural networks are more flexible and can be coupled to LSTM units to deal with problems such as gradient fading.


Recurrent neural networks are used for various applications, such as:

•  Analysis;

•  Sentiment analysis is performed by phrase level representation.

•  Semantic relations classification (e.g., topic message);

•  Sentence correlation;



Reinforcement learning

Reinforcement learning is through machine learning, training agents perform discrete actions, and then reward. Several natural language generation (NLG) tasks, such as text summarization, are being studied through reinforcement learning.


The application of reinforcement learning on NLP suffers from some problems. When using RNN-based generators, the standard answers are replaced by the answers generated by the model, which quickly increases the error rate. In addition, for such a model, the goal of word level training is different from that of test metrics, such as machine translationN-gram overlap measurement with dialog system, BLEU. Because of this difference, the current NLG type system often produces incoherent, repetitive and boring information.


To solve these problems, the industry uses an enhanced algorithm called REINFORCE to solve NLP tasks, such as image captions and machine translation. This reinforcement learning framework consists of a proxy (RNN-based generation model) that interacts with the external environment (the input seen at each time step)Word entry and context vector. The agent selects an action based on the policy (parameter), which predicts the next word in the sequence at each time step. The agent then updates its internal state (the hidden unit of RNN). This continues to the end of the final calculation of the reward sequence. Reward functions vary from task to task.For example, in sentence generation task, reward can be information flow.


Although reinforcement learning methods show promise, they need to properly handle action and state space, which may limit the expressiveness and learning ability of the model. Remember, independent RNN based models strive for expressive power and natural ability to express language.


Confrontation training is also used to train language generators to deceive trained discriminators to distinguish generated sequences from real sequences. If a dialog system, through the policy gradient (policy network), can build tasks under the reinforcement learning paradigm, where the discriminator isLike the human Turing tester, discriminators are basically trained to distinguish between human and machine-generated conversations.


Unsupervised learning

Unsupervised sentence representation learning involves mapping the sentence to a fixed size vector in an unsupervised way. Distributed representations capture semantic and syntactic properties from language and use auxiliary tasks for training.


Similar to the algorithm used to learn word embedding, the researchers proposed a skip-thinking model in which the task was to predict the next adjacent sentence based on the central sentence. The model is trained using the seq2seq framework, where the decoder generates the target sequence and the encoder is regarded as a universal feature extractor – even in the processWe learned the word embedding. The model basically learns the distributed representation of input sentences, similar to how each word is embedded in previous language modeling techniques.


Depth generation model


Depth generation models such as VAE and GAN can also be used in NLP to discover rich structures in natural languages by generating realistic sentences from potential code spaces.

It is well known that standard automatic encoders cannot generate realistic sentences due to unconstrained potential space. VAE exerts a priori distribution on hidden potential space, so that the model can generate suitable samples. VAE is composed of encoder and generator network. Encoder and generator network input encoding to potential space.Then generate samples from potential space. The training objective is to maximize the variational lower bound of the log likelihood of the observed data under the generative model. The following illustration shows the VAE based on RNN for sentence generation.




Generative models are useful for many NLP tasks, and they are essentially flexible. For example, a RNN-based VAE generation model is proposed to generate more diverse and well-formed sentences than standard automatic encoders. Other models allow structural variables (such as tense and emotion) to be tied together.Merge into potential code to generate reasonable sentences.


GAN (generator and discriminator), composed of two competing networks, is also used to generate realistic text. For example, LSTM is used as a generator, and CNN is used as a discriminator to distinguish real data from generating samples. In this case, CNN represents binary sentence classifier. The model can be used in confrontation training.Then generate realistic text.


In addition to the problem that the gradient of discriminator cannot propagate properly backward through discrete variables, the deep generation model is also difficult to evaluate. Many solutions have been put forward in recent years, but these solutions have not yet been standardized.


Memory enhanced network (Memory-Augmented Network)

Hidden vectors accessed by the attention mechanism during the output result generation stage represent the “internal memory” of the model. Neural networks can also be coupled to some form of memory to solve visual QA, language modeling, POS markup and emotional analysis tasks. For example, in order to solve the QA task, it will support facts or common sense.Knowledge is provided as a form of memory to the model. Dynamic memory network is an improvement on the previous memory-based model, which uses neural network model for input representation, attention mechanism and response mechanism.



So far, we have known the capacity and effectiveness of neural network-based models such as CNN and RNN. We are also aware that reinforcement learning, unsupervised methods, and depth generation models are being applied to complex NLP tasks such as visual QA and machine translation. Attention mechanism andMemory enhancement networks are powerful in expanding the ability of neural NLP models. Combining these powerful technologies, we believe we will find convincing ways to deal with language complexity.


The author of this article: [direction]

Author: Yun Yun community, Ali
Source: brief book
Copyright of the handbook is vested in the author. Any reprint of the handbook should be authorized by the author and the source should be indicated.


Leave a Reply

Your email address will not be published. Required fields are marked *