ai-content-maker/.venv/Lib/site-packages/nltk/test/gensim.doctest

.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT

=======================================
Demonstrate word embedding using Gensim
=======================================

    >>> from nltk.test.gensim_fixt import setup_module
    >>> setup_module()

We demonstrate three functions:
- Train the word embeddings using brown corpus;
- Load the pre-trained model and perform simple tasks; and
- Pruning the pre-trained binary model.

    >>> import gensim

---------------
Train the model
---------------

Here we train a word embedding using the Brown Corpus:

    >>> from nltk.corpus import brown
    >>> train_set = brown.sents()[:10000]
    >>> model = gensim.models.Word2Vec(train_set)

It might take some time to train the model. So, after it is trained, it can be saved as follows:

    >>> model.save('brown.embedding')
    >>> new_model = gensim.models.Word2Vec.load('brown.embedding')

The model will be the list of words with their embedding. We can easily get the vector representation of a word.

    >>> len(new_model.wv['university'])
    100

There are some supporting functions already implemented in Gensim to manipulate with word embeddings.
For example, to compute the cosine similarity between 2 words:

    >>> new_model.wv.similarity('university','school') > 0.3
    True

---------------------------
Using the pre-trained model
---------------------------

NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset.
The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

    >>> from nltk.data import find
    >>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
    >>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

We pruned the model to only include the most common words (~44k words).

    >>> len(model)
    43981

Each word is represented in the space of 300 dimensions:

    >>> len(model['university'])
    300

Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.

    >>> model.most_similar(positive=['university'], topn = 3)
    [('universities', 0.70039...), ('faculty', 0.67809...), ('undergraduate', 0.65870...)]

Finding a word that is not in a list is also supported, although, implementing this by yourself is simple.

    >>> model.doesnt_match('breakfast cereal dinner lunch'.split())
    'cereal'

Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,
the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.

    >>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
    [('queen', 0.71181...)]

    >>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
    [('France', 0.78840...)]

We can visualize the word embeddings using t-SNE (https://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words.

|    import numpy as np
|    labels = []
|    count = 0
|    max_count = 1000
|    X = np.zeros(shape=(max_count,len(model['university'])))
|
|    for term in model.index_to_key:
|        X[count] = model[term]
|        labels.append(term)
|        count+= 1
|        if count >= max_count: break
|
|    # It is recommended to use PCA first to reduce to ~50 dimensions
|    from sklearn.decomposition import PCA
|    pca = PCA(n_components=50)
|    X_50 = pca.fit_transform(X)
|
|    # Using TSNE to further reduce to 2 dimensions
|    from sklearn.manifold import TSNE
|    model_tsne = TSNE(n_components=2, random_state=0)
|    Y = model_tsne.fit_transform(X_50)
|
|    # Show the scatter plot
|    import matplotlib.pyplot as plt
|    plt.scatter(Y[:,0], Y[:,1], 20)
|
|    # Add labels
|    for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):
|        plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)
|
|    plt.show()

------------------------------
Prune the trained binary model
------------------------------

Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/
We use this code to get the `word2vec_sample` model.

|    import gensim
|    # Load the binary model
|    model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True)
|
|    # Only output word that appear in the Brown corpus
|    from nltk.corpus import brown
|    words = set(brown.words())
|    print(len(words))
|
|    # Output presented word to a temporary file
|    out_file = 'pruned.word2vec.txt'
|    with open(out_file,'w') as f:
|        word_presented = words.intersection(model.index_to_key)
|        f.write('{} {}\n'.format(len(word_presented),len(model['word'])))
|
|        for word in word_presented:
|            f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))
first commit 2024-05-03 04:18:51 +03:00			`.. Copyright (C) 2001-2023 NLTK Project`
			`.. For license information, see LICENSE.TXT`

			`=======================================`
			`Demonstrate word embedding using Gensim`
			`=======================================`

			`>>> from nltk.test.gensim_fixt import setup_module`
			`>>> setup_module()`

			`We demonstrate three functions:`
			`- Train the word embeddings using brown corpus;`
			`- Load the pre-trained model and perform simple tasks; and`
			`- Pruning the pre-trained binary model.`

			`>>> import gensim`

			`---------------`
			`Train the model`
			`---------------`

			`Here we train a word embedding using the Brown Corpus:`

			`>>> from nltk.corpus import brown`
			`>>> train_set = brown.sents()[:10000]`
			`>>> model = gensim.models.Word2Vec(train_set)`

			`It might take some time to train the model. So, after it is trained, it can be saved as follows:`

			`>>> model.save('brown.embedding')`
			`>>> new_model = gensim.models.Word2Vec.load('brown.embedding')`

			`The model will be the list of words with their embedding. We can easily get the vector representation of a word.`

			`>>> len(new_model.wv['university'])`
			`100`

			`There are some supporting functions already implemented in Gensim to manipulate with word embeddings.`
			`For example, to compute the cosine similarity between 2 words:`

			`>>> new_model.wv.similarity('university','school') > 0.3`
			`True`

			`---------------------------`
			`Using the pre-trained model`
			`---------------------------`

			`NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset.`
			`The full model is from https://code.google.com/p/word2vec/ (about 3 GB).`

			`>>> from nltk.data import find`
			`>>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))`
			`>>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)`

			`We pruned the model to only include the most common words (~44k words).`

			`>>> len(model)`
			`43981`

			`Each word is represented in the space of 300 dimensions:`

			`>>> len(model['university'])`
			`300`

			`Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.`

			`>>> model.most_similar(positive=['university'], topn = 3)`
			`[('universities', 0.70039...), ('faculty', 0.67809...), ('undergraduate', 0.65870...)]`

			`Finding a word that is not in a list is also supported, although, implementing this by yourself is simple.`

			`>>> model.doesnt_match('breakfast cereal dinner lunch'.split())`
			`'cereal'`

			`Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,`
			`the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.`

			`>>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)`
			`[('queen', 0.71181...)]`

			`>>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)`
			`[('France', 0.78840...)]`

			`We can visualize the word embeddings using t-SNE (https://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words.`

			`\| import numpy as np`
			`\| labels = []`
			`\| count = 0`
			`\| max_count = 1000`
			`\| X = np.zeros(shape=(max_count,len(model['university'])))`
			`\|`
			`\| for term in model.index_to_key:`
			`\| X[count] = model[term]`
			`\| labels.append(term)`
			`\| count+= 1`
			`\| if count >= max_count: break`
			`\|`
			`\| # It is recommended to use PCA first to reduce to ~50 dimensions`
			`\| from sklearn.decomposition import PCA`
			`\| pca = PCA(n_components=50)`
			`\| X_50 = pca.fit_transform(X)`
			`\|`
			`\| # Using TSNE to further reduce to 2 dimensions`
			`\| from sklearn.manifold import TSNE`
			`\| model_tsne = TSNE(n_components=2, random_state=0)`
			`\| Y = model_tsne.fit_transform(X_50)`
			`\|`
			`\| # Show the scatter plot`
			`\| import matplotlib.pyplot as plt`
			`\| plt.scatter(Y[:,0], Y[:,1], 20)`
			`\|`
			`\| # Add labels`
			`\| for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):`
			`\| plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)`
			`\|`
			`\| plt.show()`

			`------------------------------`
			`Prune the trained binary model`
			`------------------------------`

			`Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/`
			We use this code to get the `word2vec_sample` model.

			`\| import gensim`
			`\| # Load the binary model`
			`\| model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True)`
			`\|`
			`\| # Only output word that appear in the Brown corpus`
			`\| from nltk.corpus import brown`
			`\| words = set(brown.words())`
			`\| print(len(words))`
			`\|`
			`\| # Output presented word to a temporary file`
			`\| out_file = 'pruned.word2vec.txt'`
			`\| with open(out_file,'w') as f:`
			`\| word_presented = words.intersection(model.index_to_key)`
			`\| f.write('{} {}\n'.format(len(word_presented),len(model['word'])))`
			`\|`
			`\| for word in word_presented:`
			`\| f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))`