Are you curious about Machine Leaning and how to easily implement the fundamental algorithms? Then you should keep reading.

This post is a brief description of what happened at JArchitect’s office on April 25thn, the last *JArchitect’s Tech Day*.

This *Tech Day*‘s edition was fully dedicated to some of the most important machine learning algorithms, the ones that make up the base for the state-of-the-art algorithms and help us to understand how machine learning can be applied in real case scenarios.

Obviously most of the IT people have contact with common buzz-words and terms related to machine leaning and artificial intelligence, at least at some degree. However, terms like *supervised learning, reinforcement learning, neural networks, deep learning* and so many others are not only buzz-words. Even though we start getting used to these words as they are more and more used, most of the people seldom *know how* they really work. From my perspective, the reason is quite simple: *people don’t know the basic concepts on which the most powerful **algorithms are built upon*.

This perception is the single drive and motivator for the **Diving in the fundamental machine learning algorithms: let’s code them!*** Tech Day* session.

Therefore, let’s shed some light upon the subject by implementing a few of the fundamental machine learning algorithms. Hopefully we’ll be able to grasp the basics to help us understand the more advanced techniques and algorithms later on.

## Caveats

- Some of the workshop’s audience had little or no experience with machine learning;
- The code base is written in pure python (no fancy library allowed) and can be found on github repository [https://github.com/jmetzz/ml-fundamentals];
- The code is
**not**production ready, making it useful for**educational purpose only**; - For most examples in the repository I’m using data sets from the UCI Machine Learning repository .
- Oftentimes there will be examples using simpler/toy data sets which are hard-coded into the scripts or even randomly generated for that specific scenario.

## Scope

Back in 2008 a group of scientists published a really interesting article listing the *Top 10 algorithms in Data Mining *(see references). These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are among the most important topics in data mining research and development.

The authors briefly described each algorithm and discussed its impact on the field as support to their claim on why that particular algorithm should be among the most influential data mining algorithms in the research community.

Giving that diving in the implementation details of 10 algorithms would not fit in the ‘time box’ we had for the workshop, I’ve chosen small selection of algorithms with some overlapping on the top 10 list. Thus, the list of algorithms covered here is:

– K-means partitional clustering (**2nd in the list**)

– Single Linkage hierarchical clustering

– K-nearest neighbour (**8th in the list**);

– CART (**10th in the list**); and

– Perceptron Neural Network.

First question one might ask is *why chose algorithms not present in the top 10 list*?

Single Linkage hierarchical clustering and Perceptron Neural Network didn’t make into the 2008 list, but I do consider them quite important to understand some different aspects and techniques of machine learning. Moreover, we have to keep in mind the audience (*see caveat #1*). Support Vector Machines or Ensemble methods would probably be a bit too complex for the non-initiated.

## The algorithms

In this session you will find a *super brief* description of the algorithms. You can also get the presentation slides, which contain a few images that will help you better understand the transitions from step-to-step in each algorithms. However, I do encourage you to dive into the code to check how the algorithm really works. The code should be clean and simple enough even for the ones with no python experience. I’ve tried to use only basic multi-purpose programming language resources and techniques. So, do not expect any obscure or fancy python structure or command.

In order to keep this post short (and hopefully not boring) I’ve split the content in two separate posts:

- This one, containing the description of K-means, Single Linkage hierarchical clustering and K-nearest neighbour classifier; and
- The next post (which will be available soon) containing the description of CART and Perceptron Neural Network.

Stay tuned if you want to learn more about these algorithms.

### K-means

K-means is an unsupervised learning algorithm, which means it does not use any kind of *a priory* additional information about the data apart from its features. Consider the table bellow as an example of data set. Each column represents a feature, in this case a coordinate in the 2-dimensional space and each line represent an *instance* of the domain.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|:----:|:----:| | x | y | |:----:|:----:| | 2.77 | 1.78 | | 1.72 | 1.16 | | 3.67 | 2.81 | | 3.96 | 2.61 | | 2.99 | 2.20 | | 7.49 | 3.16 | | 9.00 | 3.33 | | 7.44 | 0.47 | | 6.64 | 3.31 | |10.13 | 3.23 | |:----:|:----:| |

K-means treats every domain instance as a *point* in the *n*-dimensional space, where *n* is the number of features in the domain, i.e., the number of columns in your data set.

Given the data set and a value *k* of ‘seeds’ or prototypes to start from, which represent the initial clusters centroids, this algorithm iteratively divide the space into *k* clusters, each containing a set of instances from the input data set. The basic idea is to iterate over the data set to partition *n* instances into the *k* clusters assigning each instance to the cluster with the nearest centroid (mean point or prototype).

Therefore, we could define this algorithm as follows:

1 2 3 4 5 6 |
Given the inputs X1,X2,X3,…,Xm and value of k 1. Pick k random points as cluster prototypes or centroids. 2. Assign each Xi instance to the nearest cluster based on the distance to each centroid. 3. Update the cluster centroids by calculating the average of the containing instances. 4. Repeat Step 2 and 3 until none of the cluster assignments change. |

The image bellow shows an illustration of the k-means execution (source Stanford University – CS221 course).

### Single Linkage Hierarchical Clustering

Hierarchical clustering algorithms also try to split the data set into groups. However, in this case the algorithm is defined in a way that instances are distributed in clusters respecting a certain hierarchy. There are two main strategies to form this hierarchy:

- Agglomerative (“bottom up” approach): each instance starts in its own cluster, and pairs of clusters are merged based on some similarity measure as one moves up the hierarchy.
- Divisive (“top down” approach): all instances start in one cluster, and splits are performed recursively as one moves down the hierarchy.

For this post we’ve chosen the Agglomerative Single Linkage hierarchical clustering using a Euclidean distance metric to calculate the *inter-cluster similarity*, i.e., the similarity between clusters.

We could define this algorithm as follows:

1 2 3 4 5 6 7 |
Given the inputs X1,X2,X3,…,Xm and a stop criteria 1. Initialize the clusters, each containing one exclusive instance in the data set 2. Calculate the cluster centroid for each cluster (mean between the instances in the cluster) 3. Find the pair of clusters with the highest similarity based on the linkage criteria 4. Merge the selected pair of clusters in a new cluster and remove the old one 5. Repeat Steps 2 to 4 until stop criteria is met, for example a specific number of clusters is formed |

The linkage criteria defines how the inter-cluster distance is calculated, i.e., the distance between a pair of instances not yet belonging to the same cluster as each other. In single-linkage strategy, the inter-cluster similarity is represented by the distance between the two instances (one in each cluster) that are closest to each other. Therefore, this method is also known as *nearest neighbour clustering *or minimum linkage. If we would instead pick up the pair of instances (again, one in each cluster) that are farthest apart, we would be using the Complete-linkage strategy. The image bellow shows an illustration of these two linkage strategies:

Other linkage criteria can also be applied, such as:

- Average linkage clustering, or UPGMA
- Centroid linkage clustering, or UPGMC
- Minimum energy clustering

Check the references for more detail.

### K-nearest neighbour

Unlike the previous two clustering algorithms, K-nearest neighbour, or K-NN for short, is a supervised learning algorithm or classification method. Which means it uses an extra information about the data to create classification model. This extra information is called **class** and represents the *a priori* known classification of that particular instance.

Actually, this algorithms is a specific type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. In other words, there is no actual model created, but rather the data itself is used as the model during the classification of a new instance.

Regarding the input data, the main difference between unsupervised and supervised learning is the presence (or absence) of the special feature called class. Consider the table bellow as an example of a supervised learning data set, where the last column is the class attribute and its value represents if the instance is a positive or negative case of the domain.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|--------:|:-----------:|:---------:|:------:|:-----:| |Outlook | Temperature | Humidity | Windy | Class | |--------:|:-----------:|:---------:|:------:|:-----:| |Sunny | Hot | High | Weak | No | |Sunny | Hot | High | Strong | No | |Overcast | Hot | High | Weak | Yes | |Rain | Mild | High | Weak | Yes | |Rain | Cool | Normal | Weak | Yes | |Rain | Cool | Normal | Strong | No | |Overcast | Cool | Normal | Strong | Yes | |Sunny | Mild | High | Weak | No | |Sunny | Cool | Normal | Weak | Yes | |Rain | Mild | Normal | Weak | Yes | |Sunny | Mild | Normal | Strong | Yes | |Overcast | Mild | High | Strong | Yes | |Overcast | Hot | Normal | Weak | Yes | |Rain | Mild | High | Strong | No | |--------:|:-----------:|:---------:|:------:|:-----:| |

This is the iconic *Saturday mornings* data set first introduced by Quinlan in 1986 (see references), where positive mornings are suitable for some unspecified activity and negative are not, for example playing some kind of sport or preparing a barbecue for your friends and family.

Now we can define K-NN classification algorithm as follows:

1 2 3 4 5 6 |
Given the input data set, the k value of neighbours and the new instance X to classify 1. Load at the data 2. Calculate the distance between the X instance and all the other instances in the data set 3. Identify the k nearest instances (neighbours) 4. Vote on the label (class) based on the majority class of the neighbours, i.e., the class containing the majority of the nearest neighbours |

Pretty simple, isn’t it?

## Where to go next?

We intend to add some content soon, which will possibly cover several topics on machine learning in individual posts. But, first things first right?

Thus, let’s start by the second part of this post about CART and Perceptron. Stay tuned!

## References

- Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Metz, Jean. Hierarchical clustering interpretation. 2006. Master thesis. [get it (Portuguese only)]
- Quinlan, J.R. Mach Learn (1986) 1: 81. [https://doi.org/10.1007/BF00116251]
- Wu, X., Kumar, V., Ross Quinlan, J. et al. Knowl Inf Syst (2008) 14: 1. [https://doi.org/10.1007/s10115-007-0114-2]