NLP 1: Word Embedding in Natural Language Processing (NLP)
This is a summary of section 25.1 of “Artificial Intelligence, A Morden Approach”.
One-hot vector is one of the basic method os encode a work, it means we encode the ith word in the dictionary with a 1 bit in the ith input position and a 0 in all the other positions. But it would not capture the similarity between words.
n-gram counts are better understanding the context around a word and all the phrases that the word appears in. With a 1,000,000 word vocabulary, there are 10²⁵ 5-grams to keep track of. If we we can reduced this to a smaller-size vector with just a few hundred dimensions, it will be more generalised. This low-dimensional vector which representing a word is then called word embedding.
Each word is a just a vector of numbers, where the individual dimension and numeric values do not have physical meanings:
"physical" = [-0.7, +0.2, -3.2, ...]
"meanings" = [+0.5, +0.9, -1.3, ...]
The feature space has the property that similar words have similar vectors. It turns out, the word embedding vectors have additional properties beyond mere proximity for similar words. we can use the vector difference from one word to another to represent the relationship between these two words.
Hence, word embedding are good representation for downstream language tasks, such as question answering or translation or summarisation. They are not guaranteed to answer analogy questions on their own.
Recommended by LinkedIn
It is proved that word embedding vectors are more helpful than one-hot encodings of words in deep learning to NLP tasks. Most of the times, we can use generic pretrained vectors. The commonly used vector dictionaries include WORD2VEC, GloVe, and FastText and they have embeddings for 157 languages.
We can also train our own word vectors. This is usually done at the same time as training a network for a particular task. Unlike generic retrained embeddings, word embeddings produced for a specific task can be trained on a carefully selected corpus and tend to emphasise aspects of words that are useful for the task. For instance, the task is part-of-speech (POS) tagging. It means we need to predict the correct part of speech for each word in a sentence. It is a simple task but nontrivial because many words can be tagged in multiple ways. The word "CUT" can be a present-tense verb, a past-tense verb, an infinitive verb, a past participle, an adjective, or a noun. If a nearby temporal adverb refers to the past, we would expect the embedding will capture the past-refering aspect of the adverbs.
How to do POS tagging with word embeddings?
Given a corpus of sentences with POS tags, we learn the parameters for the word embeddings and the POS tagger simultaneously. The process works as follows: