K-Nearest Neighbours (KNN Algorithm) - Integrated Circuits - - An Online Book - |
||||||||
Integrated Circuits http://www.globalsino.com/ICs/ | ||||||||
================================================================================= | ||||||||
KNN (K-Nearest Neighbours) uses similarity for classification, thus, this model uses input values to put into the train model to predict the output. KNN is one of the simplest supervised Machine Learning algorithms mostly used to classify a data point based on how its neighbors are classified. KNN stores all available cases and then classifies new cases based on a similarity measure. Choosing the right value of k is a process called parameter turning, and it is import for good accuracy of predication. If k equals m, then the program will look at the m nearest neighbors. If k is too low, the bias is too noisy; if k is too large, then it will take forever to process the predication, and/or a category with only a few samples in it will always be out voted by other categories, even though large values for k smoothen over things. Assuming n is the total number of data points, common rules of choosing n are: Advantages of KNN algorithm are: Disadvantages of KNN algorithm are: As an example, the application of KNN procedure (for two classes) is: The data set is {x(1), x(2), ... x(m)}
ii) Pick up two points as centroids for each class by the best guess (code). Note, for k classes, we then need to choose k initial cluster centroids (where k is the number of clusters we want to identify): μ1, μ2, ..., μk ∈ℝ For 2D data, it is easier to pick the values of the k initial cluster centroids. However, for high dimensional data, it is more complicated. Therefore, we normally pick up k examples from the m examples as the initial cluster centroids. iii) Assign each data point to the nearest cluster centroid, namely, color the dots based on the closer cluster centroid in the current case. In the case above, we have two clusters (green and red), then assign each point to the green or red cluster based on proximity (code) This step is used to assign each data point to the cluster whose centroid is the closest, where is the cluster index.Mathematically, the step is defined as: --------------------------- [4707a] where:
The centroid μj for cluster is calculated as the mean of all the data points assigned to that cluster:--------------------------- [4707b] where,:
iv) Recalculate the centroids of the clusters based on the assigned data points: look at all the green and red dots and compute the averages of the green and red dots separately, and then re-assign their centroids. (code) v) Repeat: Repeat the two steps above until convergence. Convergence occurs when the centroids no longer change significantly or a predefined number of iterations is reached. v.a) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid. v.b) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid again. v.c) Recalculate the centroids of the clusters, and then re-color the dots based on the closer cluster centroid again. The cost function for the k-means clustering algorithm is typically the sum of squared distances between each data point and its assigned cluster centroid: --------------------------- [4707c] where,
The cost function measures the overall distortion or spread of the clusters. The goal of the k-means algorithm is to minimize this cost function by adjusting the cluster assignments and centroids iteratively. The steps of the algorithm involve updating the cluster assignments and centroids to minimize this sum of squared distances. When the cost function is at the lowest value (but it cannot be zero), then it is converged. The question of determining the appropriate number of clusters, often denoted as , is one of the most frequently asked and challenging aspects in applying clustering algorithms, including k-means. The choice of is crucial because it directly affects the structure and interpretation of the clusters.Several methods are commonly used to determine the optimal , and it's an active area of research in data clustering:
Figure 4707g shows Elbow Method for optimal k for K-nearest neighbours. The idea is to plot the cost (or inertia) against different values of and look for an "elbow" in the plot, where the rate of decrease in cost starts to slow down. Looking at the plot, it seems that the optimal could be 4. This is because up to , there is a significant reduction in inertia, and after , the reduction is not as substantial.Figure 4707g. Elbow Method for optimal k for K-nearest neighbours. Choosing is often a subjective decision and involves a trade-off between computational efficiency and the desire for meaningful, interpretable clusters. It's important to consider the characteristics of the data and the goals of the analysis when making this decision.
=================================================== Machine learning: KNN algorithm: code: =================================================== Machine learning: KNN algorithm (version 3 -- more functions are added): code: =================================================== Find the nearest neighbour (number) by using K-dimentional tree: code:
|
||||||||
================================================================================= | ||||||||
|
||||||||