# K-Mean Clustering

In my opinion, K-Means are often confused with K-Nearest Neighbor. KNN is a classification algorithm, in which given x is classified by the kth nearest neighbor. The majority of nearest neighbor with x is classified into that cluster groups.

Whereas, K-Means are centroid based clustering algorithms. These algorithms lies in the category of unsupervised learning.

Unsupervised learning has no corresponding output. It will take input data, and the main purpose will be to find the hidden pattern for the data set because data is not labeled. There are no correct answer, the machine is left with the algorithms to  find patterns and determine the structure of the data set. Unsupervised learning are used in data mining, bioinformatics, medical imaging and computer vision.

K-Means

Formally, the given training data set { x1, x2 … x(m) } and want to group these data set into clusters. As mentioned in unsupervised there is no label that is y is not given.

Cluster Assignments

First, initialize the centriod randamly µ1, µ2, . . . , µk. These current guess represents the mean of our clusters.

foreach i = 1 .. m
c(i) := min(𝑘)(𝑥𝑖− 𝜇𝑘)^2

K is the number of clusters needed to be created. In the above equation µ(k) is the cluster centroid and x(i) is the elements of the training dataset. This equation index (1 to K) of cluster centroid nearest to xi

Move Centroid

Next, moving the cluster centroid. Let say the point x1, x2, x4 and x10 were nearest to µ2 then c1 = 2, c2 = 2, c4 = 2 and c10 =2. Then getting the average of these centroid

foreach 1 .. K:
µk = average(mean) of points assigned to cluster

Repeat the above iterations until convergence.

Using sklearn library by python we can generate cluster. Sklearn will do clustering automatically.

Below code is taken from: Jakevdp-PaythonDataScience-KMean

``````import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

# Generate sample data
np.random.seed(0)
X, labels_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)

# When visualize, it is easy to pick the cluster because sklearn
# do it automatically
plt.scatter(X[:, 0], X[:, 1], s=50);

# Visualizing the cluster by coloring
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Color each cluster
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

plt.show()``````

Reference:

Andrew Ng K-Means