K-Means What? A Less Bewildering Introduction

Today my hope is to give a less bewildering introduction to one of the cornerstone unsupervised learning, K-Means. The only expectation will be some programming experience and a passing understanding of Python. If you are already a rock star machine learning developer then you will likely know all this like the back of your hand. As for everyone else buckle up.

So, quick refresher of some machine learning 101. There are two primary types of learning algorithms: supervised, and unsupervised. Supervised algorithms validate their estimations while learning by correcting themselves after looking at supplied answers. By supply the answers this allows the algorithm to model data in a way specified by its creator.

Unsupervised algorithms simply require data with enough inherent context for them to unravel a pattern. This is handy since one difficulty in machine learning is labeling data to get an algorithm to learn to fill gaps and generalize accurately. But it does come at a cost. Unsupervised algorithms will not be able to label and categorize as straightforwardly as supervised learners. This is due to the inherent context that labels give data. For instance, it would not be possible to give an unsupervised algorithm trained with animal data points a dog and expect it to directly output the category dog. But it will likely throw it in the same category as other dog data points, maybe wolfs as well etc. Unsupervised learners real strength is for finding patterns we are not looking for.

So, let’s get started.

K-Means Basics

We are also going to use the variable k denote the number of categories we want the algorithm to split our data into.

Let us also call the center (x,y) point of a category in our graph its centroid. More on how this works shortly. We will assign each data point will always be assigned to its closest centroid.

For those who (like myself) slept through math class, lets quickly talk about finding the distance between two vectors. Which is formally noted as:

Which broken down looks like:

For those a little rusty on there euclidean geometry here is a simple explanation as to how distance is derived.

Optional Nerd Knowledge

I’m sure, like most of us, you’re wondering why not just use cosine dissimilarity or some other form of distance. Well in truth there are variations to k-means which do calculate distance with other methods.

Here is a brief explanation as to why euclidean distance is used as well why we square the distance, as you will see below. The short, smarty pants, answer goes as follows “.. the sum of squared deviations from (the) centroid is equal to the sum of (the) pairwise squared Euclidean distances divided by the number of points”. But mostly, it saves us from taking the square root of the difference between vectors.

Step Through

Lets quickly walk through the algorithms steps. But first lets randomly initialize the x and y positions of K number of centroids.

1.) We now check each for the one with the shortest distance squared between and assign each point to its nearest centroid.

2.) Now that we have updated all of our points, lets updated our centroids. Each centroid now moves to the average of all the newly assigned points. Which is done by adding up all the points in each cluster and dividing by the count.

Now all that is needed is to repeat steps one and two by either a fixed amount of say 100 iterations or perhaps check to see how much the centroids have moved. When it stops moving any significant amount, stop the loop.

Now that wasn’t too bad, was it? Lets see a bit of sample code to demonstrate it in practice.


def assign_clusters():
for x_indx, x in enumerate(points):
min_distance = sys.maxint
min_category = 0

for idx,centroid in enumerate(k):
# using built in vector function for distance to make simple
distance = x[0].dist(centroid)

# is it closer? if so lets make it our category
if distance < min_distance:
min_distance = distance
min_category = idx

# update point with new category
points[x_indx] = (x[0], min_category)

def update_centroid():
for idx,_ in enumerate(k):
sum = PVector(0,0)
count = 0

# sum the vectors
for p in (item for item in points if item[1] == idx):
count +=1

# bad things happen when you divide by zero
if count == 0:
continue

# normalize to the average position of all points
sum.div(count)
k[idx] = sum


• Its simple, fast, simple to understand, and usually pretty easy to debug.
• Reliable when clear patterns exist.

• It can get stuck in local optima, which usually requires re-running the algorithm several times and taking the best result.
• It won’t detect non linear clusters.

K-Means doesn’t like non linear data –   🙁 [source]

Final Thoughts

Hopefully you have found this to be useful, if not or should you have any questions be sure to let me know on twitter @zen_code. Of course this article is a pretty elementary description of what K-Means can do. For those looking for a bit more, or would like to see some interesting applications. If you interested in using this on a large or production data set, what ever you do don’t right you own. Sci-kit learn has an excellent k-means tool set. I’ve included a link to the full code as well additional resources below and as always happy learning.

Does it Function?

Photo by: Gaby Av

What are the ideal properties of a function and when should you refactor a functions? Or from another angle, when is the best time to refactor a function out, restructure within or just leave it? In the last week or so I’ve noticed an interesting pattern in functions written by some slightly novice developers. I got to thinking about how to best give constructive feedback. But something I found interesting about all of is that the structure of the functions, at least on the surface, seemed to meet all the traditional best practices of a function. They performed a single purpose, were created to avoid repetition, etc. But the code was awkward to work with and functions almost obfuscated rather than simplified. So what attributes made them difficult to work with and kinda smell of code rot? So I thought I would list out some further considerations when breaking a functions out.

Functional Mutation

What the heck is functional mutation? Well I did just make the word up but it seemed appropriate. Usually, this is a sneaky violation of “functions should perform a single purpose”. It happens when a function’s output varies unpredictable from what can been gathered from the function’s name and its inputs. This is usually caused when the code inside the functions varies output wildly due to non-intuitive conditions, at least as seen by the outside observer. It really pays to leave conditional logic which impacts the straightforwardness of a function in a higher level of code. If that just doesn’t seem reasonable then usually that’s an indication that it’s to rethink the structure to match the conditional lines.

Sequential Dependence

When a function can not perform its required task due to its dependence on another function it’s usually bad. There are exceptions to this one, especially when designed something that interface with an external system. But, it’s very important to ask whether the dependence can be removed by either restructuring the input parameters, restructuring functions, or both. The problem with sequential dependence when it comes to maintenance is that it hurts code readability, and locks it any changes to external dependencies. So ask yourself twice if there is a better way if you find a set of function like this.

Sometimes it’s better to not refactor a function out unless it stand on its own. But be sure to remember that the person that comes behind you, which is usually you 6 months from now who has totally forgotten everything, won’t know your ideology while they absorb it purpose. So do yourself, and them a favor by writing code which speaks to them rather than having to be strangle out.

Raspberry Pi Interrupts – An adventure with vector interrupt controllers

Today’s topic is about the Raspberry Pi Interrupt controller. I know what you’re thinking but its not as bad as it sounds promise. So, I’ve been goofing around with my ultra sweet Raspberry Pi and in doing so had this wonderful idea to just roll my own OS. You know average weekend project stuff, right? Anyways my work on RazOS has just begun and most likely will continue for some time. Though so far its been awesome, I have encountered many a roadblock along the way and thought I should share a few in hopes of saving someone else from the pain and suffering of sorting though Linux source as well as the little tidbits that are scattered about. That said, if you’re interested in OS coding I’m planning a series to come which should provided a bit more back-story. I should mention for those interested in learning something cool that this is should give you a bit of the detail you will need. So if you have no idea what a Vector Interrupt Controller is then stick with me I’ll bring you up to speed, if you do know then you may want to skip the whys and go to the core of the article down below.

What is a Vector Interrupt Controller

Unfortunately our CPU is a bit dumber than our accountant so we have to be a little more straight forward with our decision making but the metaphor is still quite applicable. But the question lingers still, what is a vector interrupt controller? In short a vector interrupt controller is for when we have more than one door which is almost always the case.  In the case of the Raspberry Pi it can visualized with one buzzer and you have to look up which door it came from but we will come back to that in more concrete terms.

Raspberry Pi Interrupt Controller

To began lets look at this bit of code from a great example provided from David Welch.


.globl _start
_start:
ldr pc,reset_handler
ldr pc,undefined_handler
ldr pc,swi_handler
ldr pc,prefetch_handler
ldr pc,data_handler
ldr pc,unused_handler
ldr pc,irq_handler
ldr pc,fiq_handler

reset_handler: .word reset
undefined_handler: .word hang
swi_handler: .word hang
prefetch_handler: .word hang
data_handler: .word hang
unused_handler: .word hang
irq_handler: .word irq
fiq_handler: .word hang

reset:
mov r0,#0x8000
mov r1,#0x0000
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}

irq:
;@ What gets called if a IRQ event happens


This is an example set of one possible way to initialize your vector table but I find it to be very straight forward for illustration purposes. This code’s job is to put the addresses of the subroutines which will be called automatically by the processors hardware should any of the events be triggered as well as enabled. By default the hardware assumes that the first eight words of memory are address to subroutines, but it can be changed to the last eight words should you wish it. You can find out how here, get familiar with it many arm specific questions can be answered with it. Since the Pi expects and runs your code at 0x8000 we must relocate the table to 0x0000, which is what the remainder of the code does for us.

Since we are only interested in interrupts today I’m going to ignore all but the IRQ calls, though it should be noted that FIQ is essentially the same with a few perks to speed up interrupt handling. You can find more on the subject in the ARM Manual.

Who knocks?

So the final part of our interrupt adventure comes in finding which door the knock has occurred. But before we get any knocks we have to enable IRQ, this is a two step procedure. First of which we must turn on the “master IRQ switch” so to speak which is located outside in a coproccessor. Once again code courtesy of David Welch. Note the mrs and msr calls.


.globl enable_irq
enable_irq:
mrs r0,cpsr
bic r0,r0,#0x80
msr cpsr_c,r0
bx lr


Once we have enabled IRQ as a whole them we must enable the individual “Doors” which we wish to hear knocks from. From this point each sub component handles the the details, so you must refer to each individual component for the details of how it triggers an interrupt.  It is worth mentioning that most of the interrupts are not handled via the ARM processor unfortunately since technically it is the secondary processor in the Broadcom BCM2835, the undocumented Videocore GPU is the king on this chip. But the ARM does have access to all of the primary ones.

The subsystems can be found here. One big point to note about the peripherals document. Unless you have the MMU enabled and mapped as stated then almost all address in the guide must be translated from 0x7Ennnnnn to 0x20nnnnnn. For example the guide states that the interrupt begins at register is 0x7E00B000, NOPE! It is in fact at the physical address of 0x200B000, so keep this mind. Since this is running a bit long I’m going to end it here, a follow up to come.

The Executioner: the JavaScript execution context and how to defeat it

JavaScript as the dynamic and versatile language it is provides us with several non intuitive points to be weary of while constructing our code. But before we dig too deep into the issues and their solutions lets review a few facets of the JavaScript’s interpreter.

Execution Context

[Note: also sometimes referred to as Activation Object, Scope Object, Variable Fairy]

Not as scary as it sounds, I assure you. To understand its purpose we need to merely break down its very name. Execution Context, it is the manager of the context of the code that is currently being executed. Quickly, lets take a look at what the execution contexts properties are.

var executionContextObject = { variableObject:{}, scopeChain:{}, this:{} };

• VariableObject – Simply the list of variables/arguments which may be accessed in the function. More on this in a moment.
• ScopeChain – All the execution context objects containing of all the parent functions.
• This – the value associated with the current this

Variable Object Array

Lets take a moment to peel back the covers and understand how the variable object works. Now at the point this is happening the interpreter has reached a new function and is creating the execution context for us. These are the steps it will take while creating it.

1. Look at the inputs||arguments of the function and make an arguments array out of them.
2. Find each function and and save the pointer to it, if it finds a matching pointer overwrite it with the newest value. Stick results into the Variable Object array.
3. Find each variable with a var and initialize it to its line assigned value (var me = ‘Ben’) or default which is the “undefined” value. Stick results into Variable Object array.

This array is what lets us do things like this.

var mySweetObject = { color: "Red", magicalAbility: "telekinesis" };

 

//Normal way to access a property var color = mySweetObject.color; //Using the Variable Object Array var magicalAbility = mySweetObject["magicalAbility"];

The Execution Stack

Now that we understand a bit of how the execution context looks lets take a look at the execution stack. The execution stacks job is to keep track of whats being executed as well as its associated execution context. We can think of it simply as a stack which items are placed on top as they are encountered or LIFO if you prefer that verbiage.

[Image created by David Shariff]


Hoisting

You hear a lot about hoisting, but armed with the above you can tell all of your friends about how simple it is, properly asserting your intellectual superiority over them. Lets look at an example.

console.log(typeof funkyFunction ); // function pointer

 

var funkyFunction = function(){ return "awesome"; }

Wha, bit odd that myFunkyFunction seems to have a value as though it was declared before its var. But we know that the variable object array is created before the code is executed and therefore is the any inside the scope of this function are created before execution. (Side Note: this does not apply to variables applied to child function)

Scope Chain

Like mentioned before the scope chain is an array which contains all parent execution contexts. Its purpose is to store the location of all variables that the current function can access , this is to facilitate JavaScript’s external variable storage which is represented as lexical scoping.  Lexical scoping simply means functions are stuck with the variables of their parent functions even if called from another physical location.

Use local variables

Now considering the impact of all our variables being stored in physical order, also it is then given they must each be resolved every time a non local variable is utilized. This is a very good reason to be weary of the global scope since it must always be contained in the scope chain. So lighten up on those global declarations, it can save you on bigger JavaScript projects. I have seen scope variables easily reach over 100,000 in big projects, this meaning a serious impact on scan times when the interpreter must resolve variables. But of course use of local variables avoids this entirely, allowing the interpreter to remain in the variable object array.

P.S.

Don’t forget that lack of a var keyword when declaring a variable always lands that variable into the global scope, this includes functions.

SQL Voodoo Query: Merging a column in multiple rows into just one

[This is an old but regularly requested  post of mine from BIDN in 2011, updated due to my renowned grammar]

Today I wanted to quickly mention a technique to pivot a one to many data structure into a one column list of items. I know it has been mentioned before a bit differently in previous blogs before on BIDN.

The problem?

We have three tables

InstructorsInstructorClassesClasses

In the instructors table we have the instructors name, ID, and yada yada. InstructorClasses is our junction table and links all of the instructors to their list of classes and finally classes is our simple list of classes.  Now assuming we have a web page that needs to show them as below?

Instructor: Jim Bean

Classes: Rowing; Yoga; Bartending;

Now assuming you are unable to format in code or perhaps you are archiving data and needing to convert it during ETL. To get the expected results we can utilize the XML Path command in SQL Server to provide us a pivoted view of classes as demonstrated in the example below.

SELECT Left(Main.Categories,Len(Main.Categories)-1) as Classes, Instructors.FirstName + ' ' + Instructors.LastName as InstructorsName FROM Instructors left join (Select distinct ST2.InstructorID , (Select distinct Classes.ClassTitle + '; ' AS [text()] From InstructorsClasses ST1 join dbo.Classes on ST1.ClassID = Classes.ClassID Where ST1.InstructorID = ST2.InstructorID For XML PATH ('') ) [Categories] From InstructorsClasses ST2) [Main] on InstructorID = main.InstructorID