Today my hope is to give a less bewildering introduction to one of the cornerstone unsupervised learning, K-Means. The only expectation will be some programming experience and a passing understanding of Python. If you are already a rock star machine learning developer then you will likely know all this like the back of your hand. As for everyone else buckle up.
So, quick refresher of some machine learning 101. There are two primary types of learning algorithms: supervised, and unsupervised. Supervised algorithms validate their estimations while learning by correcting themselves after looking at supplied answers. By supply the answers this allows the algorithm to model data in a way specified by its creator.
Unsupervised algorithms simply require data with enough inherent context for them to unravel a pattern. This is handy since one difficulty in machine learning is labeling data to get an algorithm to learn to fill gaps and generalize accurately. But it does come at a cost. Unsupervised algorithms will not be able to label and categorize as straightforwardly as supervised learners. This is due to the inherent context that labels give data. For instance, it would not be possible to give an unsupervised algorithm trained with animal data points a dog and expect it to directly output the category dog. But it will likely throw it in the same category as other dog data points, maybe wolfs as well etc. Unsupervised learners real strength is for finding patterns we are not looking for.
So, let’s get started.
We are also going to use the variable k denote the number of categories we want the algorithm to split our data into.
Let us also call the center (x,y) point of a category in our graph its centroid. More on how this works shortly. We will assign each data point will always be assigned to its closest centroid.
For those who (like myself) slept through math class, lets quickly talk about finding the distance between two vectors. Which is formally noted as:
Which broken down looks like:
For those a little rusty on there euclidean geometry here is a simple explanation as to how distance is derived.
Optional Nerd Knowledge
I’m sure, like most of us, you’re wondering why not just use cosine dissimilarity or some other form of distance. Well in truth there are variations to k-means which do calculate distance with other methods.
Here is a brief explanation as to why euclidean distance is used as well why we square the distance, as you will see below. The short, smarty pants, answer goes as follows “.. the sum of squared deviations from (the) centroid is equal to the sum of (the) pairwise squared Euclidean distances divided by the number of points”. But mostly, it saves us from taking the square root of the difference between vectors.
Lets quickly walk through the algorithms steps. But first lets randomly initialize the x and y positions of K number of centroids.
1.) We now check each k for the one with the shortest distance squared between and assign each point to its nearest centroid.
2.) Now that we have updated all of our points, lets updated our centroids. Each centroid now moves to the average of all the newly assigned points. Which is done by adding up all the points in each cluster and dividing by the count.
Now all that is needed is to repeat steps one and two by either a fixed amount of say 100 iterations or perhaps check to see how much the centroids have moved. When it stops moving any significant amount, stop the loop.
Now that wasn’t too bad, was it? Lets see a bit of sample code to demonstrate it in practice.
def assign_clusters(): for x_indx, x in enumerate(points): min_distance = sys.maxint min_category = 0 for idx,centroid in enumerate(k): # using built in vector function for distance to make simple distance = x.dist(centroid) # is it closer? if so lets make it our category if distance < min_distance: min_distance = distance min_category = idx # update point with new category points[x_indx] = (x, min_category) def update_centroid(): for idx,_ in enumerate(k): sum = PVector(0,0) count = 0 # sum the vectors for p in (item for item in points if item == idx): sum.add(p) count +=1 # bad things happen when you divide by zero if count == 0: continue # normalize to the average position of all points sum.div(count) k[idx] = sum
- Its simple, fast, simple to understand, and usually pretty easy to debug.
- Reliable when clear patterns exist.
- It can get stuck in local optima, which usually requires re-running the algorithm several times and taking the best result.
- It won’t detect non linear clusters.
Hopefully you have found this to be useful, if not or should you have any questions be sure to let me know on twitter @zen_code. Of course this article is a pretty elementary description of what K-Means can do. For those looking for a bit more, or would like to see some interesting applications. If you interested in using this on a large or production data set, what ever you do don’t right you own. Sci-kit learn has an excellent k-means tool set. I’ve included a link to the full code as well additional resources below and as always happy learning.