93 lines
4.2 KiB
Python
93 lines
4.2 KiB
Python
|
# Natural Language Toolkit: Clusterers
|
||
|
#
|
||
|
# Copyright (C) 2001-2023 NLTK Project
|
||
|
# Author: Trevor Cohn <tacohn@cs.mu.oz.au>
|
||
|
# URL: <https://www.nltk.org/>
|
||
|
# For license information, see LICENSE.TXT
|
||
|
|
||
|
"""
|
||
|
This module contains a number of basic clustering algorithms. Clustering
|
||
|
describes the task of discovering groups of similar items with a large
|
||
|
collection. It is also describe as unsupervised machine learning, as the data
|
||
|
from which it learns is unannotated with class information, as is the case for
|
||
|
supervised learning. Annotated data is difficult and expensive to obtain in
|
||
|
the quantities required for the majority of supervised learning algorithms.
|
||
|
This problem, the knowledge acquisition bottleneck, is common to most natural
|
||
|
language processing tasks, thus fueling the need for quality unsupervised
|
||
|
approaches.
|
||
|
|
||
|
This module contains a k-means clusterer, E-M clusterer and a group average
|
||
|
agglomerative clusterer (GAAC). All these clusterers involve finding good
|
||
|
cluster groupings for a set of vectors in multi-dimensional space.
|
||
|
|
||
|
The K-means clusterer starts with k arbitrary chosen means then allocates each
|
||
|
vector to the cluster with the closest mean. It then recalculates the means of
|
||
|
each cluster as the centroid of the vectors in the cluster. This process
|
||
|
repeats until the cluster memberships stabilise. This is a hill-climbing
|
||
|
algorithm which may converge to a local maximum. Hence the clustering is
|
||
|
often repeated with random initial means and the most commonly occurring
|
||
|
output means are chosen.
|
||
|
|
||
|
The GAAC clusterer starts with each of the *N* vectors as singleton clusters.
|
||
|
It then iteratively merges pairs of clusters which have the closest centroids.
|
||
|
This continues until there is only one cluster. The order of merges gives rise
|
||
|
to a dendrogram - a tree with the earlier merges lower than later merges. The
|
||
|
membership of a given number of clusters *c*, *1 <= c <= N*, can be found by
|
||
|
cutting the dendrogram at depth *c*.
|
||
|
|
||
|
The Gaussian EM clusterer models the vectors as being produced by a mixture
|
||
|
of k Gaussian sources. The parameters of these sources (prior probability,
|
||
|
mean and covariance matrix) are then found to maximise the likelihood of the
|
||
|
given data. This is done with the expectation maximisation algorithm. It
|
||
|
starts with k arbitrarily chosen means, priors and covariance matrices. It
|
||
|
then calculates the membership probabilities for each vector in each of the
|
||
|
clusters - this is the 'E' step. The cluster parameters are then updated in
|
||
|
the 'M' step using the maximum likelihood estimate from the cluster membership
|
||
|
probabilities. This process continues until the likelihood of the data does
|
||
|
not significantly increase.
|
||
|
|
||
|
They all extend the ClusterI interface which defines common operations
|
||
|
available with each clusterer. These operations include:
|
||
|
|
||
|
- cluster: clusters a sequence of vectors
|
||
|
- classify: assign a vector to a cluster
|
||
|
- classification_probdist: give the probability distribution over cluster memberships
|
||
|
|
||
|
The current existing classifiers also extend cluster.VectorSpace, an
|
||
|
abstract class which allows for singular value decomposition (SVD) and vector
|
||
|
normalisation. SVD is used to reduce the dimensionality of the vector space in
|
||
|
such a manner as to preserve as much of the variation as possible, by
|
||
|
reparameterising the axes in order of variability and discarding all bar the
|
||
|
first d dimensions. Normalisation ensures that vectors fall in the unit
|
||
|
hypersphere.
|
||
|
|
||
|
Usage example (see also demo())::
|
||
|
|
||
|
from nltk import cluster
|
||
|
from nltk.cluster import euclidean_distance
|
||
|
from numpy import array
|
||
|
|
||
|
vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
|
||
|
|
||
|
# initialise the clusterer (will also assign the vectors to clusters)
|
||
|
clusterer = cluster.KMeansClusterer(2, euclidean_distance)
|
||
|
clusterer.cluster(vectors, True)
|
||
|
|
||
|
# classify a new vector
|
||
|
print(clusterer.classify(array([3, 3])))
|
||
|
|
||
|
Note that the vectors must use numpy array-like
|
||
|
objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
|
||
|
efficiency when required.
|
||
|
"""
|
||
|
|
||
|
from nltk.cluster.em import EMClusterer
|
||
|
from nltk.cluster.gaac import GAAClusterer
|
||
|
from nltk.cluster.kmeans import KMeansClusterer
|
||
|
from nltk.cluster.util import (
|
||
|
Dendrogram,
|
||
|
VectorSpaceClusterer,
|
||
|
cosine_distance,
|
||
|
euclidean_distance,
|
||
|
)
|