K-Means What? A Less Bewildering Introduction

Today my hope is to give a less bewildering introduction to one of the cornerstone unsupervised learning, K-Means. The only expectation will be some programming experience and a passing understanding of Python. If you are already a rock star machine learning developer then you will likely know all this like the back of your hand. As for everyone else buckle up.

So, quick refresher of some machine learning 101. There are two primary types of learning algorithms: supervised, and unsupervised. Supervised algorithms validate their estimations while learning by correcting themselves after looking at supplied answers. By supply the answers this allows the algorithm to model data in a way specified by its creator.

Unsupervised algorithms simply require data with enough inherent context for them to unravel a pattern. This is handy since one difficulty in machine learning is labeling data to get an algorithm to learn to fill gaps and generalize accurately. But it does come at a cost. Unsupervised algorithms will not be able to label and categorize as straightforwardly as supervised learners. This is due to the inherent context that labels give data. For instance, it would not be possible to give an unsupervised algorithm trained with animal data points a dog and expect it to directly output the category dog. But it will likely throw it in the same category as other dog data points, maybe wolfs as well etc. Unsupervised learners real strength is for finding patterns we are not looking for.

So, let’s get started.

K-Means Basics

We are also going to use the variable k denote the number of categories we want the algorithm to split our data into.

Let us also call the center (x,y) point of a category in our graph its centroid. More on how this works shortly. We will assign each data point x_n will always be assigned to its closest centroid.

For those who (like myself) slept through math class, lets quickly talk about finding the distance between two vectors. Which is formally noted as:

\left \| \vec{x^n} - \vec{\mu^k} \right \|

Which broken down looks like:

distance = \sqrt{(x_{2} - x_{1})^2 + (y_{2} - y_{1})^2}

For those a little rusty on there euclidean geometry here is a simple explanation as to how distance is derived.

Optional Nerd Knowledge

I’m sure, like most of us, you’re wondering why not just use cosine dissimilarity or some other form of distance. Well in truth there are variations to k-means which do calculate distance with other methods.

Here is a brief explanation as to why euclidean distance is used as well why we square the distance, as you will see below. The short, smarty pants, answer goes as follows “.. the sum of squared deviations from (the) centroid is equal to the sum of (the) pairwise squared Euclidean distances divided by the number of points”. But mostly, it saves us from taking the square root of the difference between vectors.

Step Through

Lets quickly walk through the algorithms steps. But first lets randomly initialize the x and y positions of K number of centroids.

1.) We now check each for the one with the shortest distance squared between and assign each point to its nearest centroid.

min_{k} \left \| \vec{x^n} - \vec{\mu^k} \right \|^2

2.) Now that we have updated all of our points, lets updated our centroids. Each centroid now moves to the average of all the newly assigned points. Which is done by adding up all the points in each cluster and dividing by the count.

K_{i} = (1/C_i)\sum^{C_i}_{j=1} x_i

Now all that is needed is to repeat steps one and two by either a fixed amount of say 100 iterations or perhaps check to see how much the centroids have moved. When it stops moving any significant amount, stop the loop.

Now that wasn’t too bad, was it? Lets see a bit of sample code to demonstrate it in practice.


def assign_clusters():
    for x_indx, x in enumerate(points):
        min_distance = sys.maxint
        min_category = 0
        
        for idx,centroid in enumerate(k):
           # using built in vector function for distance to make simple
           distance = x[0].dist(centroid)
           
           # is it closer? if so lets make it our category
           if distance < min_distance:
               min_distance = distance
               min_category = idx
        
        # update point with new category
        points[x_indx] = (x[0], min_category)
        
def update_centroid():
    for idx,_ in enumerate(k):
        sum = PVector(0,0)
        count = 0
        
        # sum the vectors
        for p in (item for item in points if item[1] == idx):
            sum.add(p[0])
            count +=1
        
        # bad things happen when you divide by zero
        if count == 0:
            continue
        
        # normalize to the average position of all points
        sum.div(count)
        k[idx] = sum

Advantages

  • Its simple, fast, simple to understand, and usually pretty easy to debug.
  • Reliable when clear patterns exist.

Disadvantages

  • It can get stuck in local optima, which usually requires re-running the algorithm several times and taking the best result.
  • It won’t detect non linear clusters.

K-Means doesn’t like non linear data –   🙁 [source]

Final Thoughts

Hopefully you have found this to be useful, if not or should you have any questions be sure to let me know on twitter @zen_code. Of course this article is a pretty elementary description of what K-Means can do. For those looking for a bit more, or would like to see some interesting applications. If you interested in using this on a large or production data set, what ever you do don’t right you own. Sci-kit learn has an excellent k-means tool set. I’ve included a link to the full code as well additional resources below and as always happy learning.

Summary of DeepStack: Expert-Level AI in Heads-Up No-Limit Poker

Below is an excerpt of some of the research I have been doing for a Udacity Course I am taking on AI. It’s been a lot of fun digging into the merging of logic and machine learning. I’ll likely post a bit more details and perhaps a review once complete. But for now, here is a summary of some pretty cool research coming out of the University of Alberta.

https://arxiv.org/pdf/1701.01724.pdf

Imperfect information games provide a significant challenge as compared to deterministic games. Since limited information is available to the agent, the agent must calculate a statistical best choice(s). Formally the statistical best response is accomplished by approximating the Nash equilibrium given the private information held by the opponent. DeepStack builds upon the traditional technique known as counterfactual regret minimization, which is a recursive algorithm which pre-computes the probability distributions. The issue with this method is that the computational requirements for Heads-Up No-Limit Poker (HUNL) approaches 10160 therefore approximates to game states would have to be made. A goal of DeepStack was to eliminate as much contextual information loss as possible.

When player a two-player zero sum game such as HUNLor goal is to maximize the expected utility against a best response strategy. Deep Stack calculates this through 3 elements. The first generates a representative picture of all currently available public information. Public information in the case would be the agents two cards in hand, the pot, and any visible cards dealt to the table. The second component is a depth limited search technique which uses a trained neural network to quickly evaluate values below a certain depth. By using the flexibility of neural networks to generate values, the depth needed to search can be greatly limited. Limited set of look-ahead branches. By limiting to a known set of optimal actions, the recursive traversal computational cost can be reduced.

The internal representation of all public information is regenerated each action by a process the authors call “Continual re-solving”. This is done with two values, obviously, the first required values are the public state. Secondly, a vector of counterfactual values of valid “what if’s” relevant to the opponent. The counterfactual values start with a default starting state for the first round. Afterwards the vector is updated by output of the recalculation.

In order to make re-solving practical, the depth of the search is limited to four actions. Values are calculated through a neural network the authors call “Deep Counterfactual Value Networks”. Given the complexity created by imperfect information, each vector must contain what amounts to a description of the poker game being played. Concretely, a representation of the probability distribution of the cards dealt to the private hand and a value of the current knowns.  The authors liken this output to “Intuition”. The network itself consists of seven hidden layers with 500 nodes each. Trained on 10 million randomly generated poker games for the turn and 1 million for the flop.

DeepStack has shown significant improvement over previous systems. Competing against various professional human players. Playing a total of 44,852 games. DeepStack won 492 mbb/g. Which the authors indicate as over 4 standard deviations away from zero. In comparison to previous algorithms such as Claudico(2015) which lost at a rate of 91 mbb/g. DeepStack shows a novel approach to relegating to neural networks strength in determining probability distribution over a wide range of outcomes. This system which demonstrates the possibilities in the future of AI in regards to solving problems with imperfect information.