News

New paper! in the American Naturalist

Tuesday, October 13, 2020

K-Means Clustering (unsupervised, clustering)



"""
Codes from "COGNITIVE CLASS.ai - K-Means Clustering", author: Saeed Aghabozorgi

K-means is a clustering method with unlabeled data.
K-means algorithm isn't directly applicable to categorical variables because Euclidean distance function isn't really meaningful for discrete variables.
It will partition data points into mutually exclusive groups, for example, into 3 clusters.

K-means algorithm first picks centroids randomly, which are used as the beginning points for every cluster, and then performs repetitive calculations to optimize the positions of the centroids. 
"""

import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

##### Step1: Data & Preprocessing #####
df = pd.read_csv("data.csv",delimiter=',')
df = df.drop('X_categorical', axis=1) # drop categorical variables
X = df.values[:, 1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X) # Normalizing over the standard deviation

##### Step2: Modeling ####
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_

##### Step3: Insights #####
df["Clus_km"] = labels # assign the labels to each row in dataframe
df.groupby("Clus_km").mean() # check centroid values by averaging the features in each cluster