P.Mean: Strange results in KMeans clustering (created 2011-12-31).
News: Sign up for "The Monthly Mean," the newsletter that dares to call itself average, www.pmean.com/news. |

Understanding the logic of Cluster Analysis - KMeans? 1. After considering the data set, we are selecting the random (pure) centroids and the number of clusters required. 2. Following the Euclidean distance method, we are segregating the points in different clusters If it selects random points as Step1 then how does it return the same result every time? Is there any other treatment or logic that is worked out to uniformly segregate it and select initial centroids. Please help me. I am unable to comprehend the initial centroid selection part of Cluster analysis and require your help for this. I Your Answer: There are two possible answers. First, random numbers on a computer are (almost always) generated using a seed value. If you use the same seed, you get the same sequence of random numbers. That allows you to re-run a simulation under the exact same conditions without having to store the entire sequence of random numbers. That's all well and good, but in some software programs, the seed is not automatically changed each time you run the program. You have to manually change the seed. How you do this is software dependent. Read your manual for details. Second, it may be that the random numbers change, but you have a fairly stable set of clusters and all initial values converge on the same solution. This is good news, as it provides evidence that your solution is the best solution. If you found, unhappily that two difference set of random centroids led to two different solutions, then you could choose the solution that meets some optimal criteria, but you would not know if a third set of random centroids might not lead to a third clustering that is better than the two you observed. Your only option is to keep trying new random centroids and comparing the results. There is no guarantee in KMeans clustering (and in many other iterative statistical methods) that you will converge on the best solution. Most of the time it works pretty well, but sometimes it doesn't. How hard you try to establish that your clustering is the best clustering depends on how much time and energy you have and how critical the project is.

This page was written by Steve Simon and is licensed under the Creative Commons Attribution 3.0 United States License. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Incomplete pages.