Electron microscopy
Word2Vec Models
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                          http://www.globalsino.com/ICs/        

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix



Word2Vec is a popular technique in natural language processing (NLP) that is used to represent words as vectors in a continuous vector space.  The main idea behind Word2Vec is to capture the semantic relationships between words by representing them as dense vectors, where words with similar meanings are closer together in the vector space so that it captures semantic relationships between words based on their distributional patterns in a large corpus of text. Word2Vec was introduced by a team of researchers at Google, including Tomas Mikolov, in 2013.

Word2Vec can be trained on a large dataset to learn vector representations for words. The resulting word vectors are positioned in such a way that words with similar meanings are closer to each other in the vector space. This allows the model to capture semantic relationships and similarities between words. 

The key idea behind Word2Vec is the distributional hypothesis, which suggests that words that appear in similar contexts tend to have similar meanings. Word2Vec leverages this idea by learning to predict the context (surrounding words) of a target word, and in the process, it learns vector representations that encode semantic information. 

Word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data, and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications. Word2Vec usually performs better than simple bag of words models. A bag of words model only counts how many times each word appears in each document. The bag of words models have no information about how similar the words are. Word2Vec can figure out that some words are similar to each other and then it performs better when doing machine learning with text.

Word2vec can be used to train word vectors.

There are two main learning algorithms in word2vec:
         i) continuous bag-of-words,
         ii) continuous skip-gram.

In word embeddings of word2vec model, Euclidian similarity cannot work well for the high-dimensional word vectors because Euclidian similarity will increase the number of dimensions increases even if the word embedding stands for different meanings. Alternatively, cosine similarity can be used to measure the similarity between two vectors. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Therefore, the cosine similarity captures the angle of the word vectors and not the magnitude. Under cosine similarity, no similarity is expressed as a 90-degree angle while the total similarity of 1 is at 0 degree angle.

Word2Vec with Gensim