k-modes聚类算法（Python实现）

k-means 算法是一种简单且实用的聚类算法，但是传统的 k-means 算法只适用于连续属性的数据集，而对于离散属性的数据集，计算簇的均值以及点之间的欧氏距离就变得不合适了。

k-modes 作为 k-means 的一种扩展，适用于离散属性的数据集。

假设有 N 个样本，M 个属性且全是离散的，簇的个数为 k，其实现步骤为：

随机确定 k 个聚类中心 C₁，C₂，…，C_k 是长度为 M 的向量，C_i=[C¹_i，C²_i，…，C^M_i]。
对于样本 x_j（j=1，2，…，N），分别比较其与 k 个中心之间的距离（此处的距离为不同属性值的个数，假如 x₁=[1，2，1，3]，C₁=[1，2，3，4]，那么 x1 与 C1 之间的距离为 2）。
将 x_j 划分到距离最小的簇，在全部的样本都被划分完毕后，重新确定簇中心，向量 C_i 中的每一个分量都更新为簇 i 中的众数。
重复步骤 2 和 3，直到总距离（各个簇中样本与各自簇中心距离之和）不再降低，返回最后的聚类结果。

【实例】利用 k-modes 算法对给定的数据进行聚类。

import numpy as np

if __name__ == '__main__':
    # 导入数据
    datas = np.array([[1,2,3,4],
                    [1,6,8,8],
                    [1,2,3,3],
                    [2,4,5,5],
                    [4,7,8,7],
                    [7,6,8,9],
                    [4,4,3,3],
                    [2,2,5,5],
                    [7,5,5,5],
                    [5,6,8,9]])

    # 选取聚类中心
    centers = np.array([datas[1],datas[6],datas[2]])
    # 初始化各类中到中心的距离总和,但是由于距离总和不减少时,停止聚类,所以设置一个尽量大的数,以免影响过程
    distance1 = 1000000
    distance2 = 1000000
    distance3 = 1000000  # 记录聚类次数
    n = 0
    # 开始聚类循环
    while(True):
        n+=1
        # 使用 numpy 的 array 作为两个类的容器
        cluster1 = np.array([centers[0]])
        cluster2 = np.array([centers[1]])
        cluster3 = np.array([centers[2]])
        # 使用 list 存放聚类中每个点到聚类中心的汉明距离
        hamming1 = []
        hamming2 = []
        hamming3 = []
        # 遍历所有数据
        for i in datas:
            # 用于记录汉明距离
            n1=0
            n2=0
            n3=0
            # 循环每个位置,但有位置与聚类中心有相同属性时,n1/n2 加 1
            for j in range(4):
                if(i[j]!=centers[0][j]):
                    n1+=1
                if(i[j]!=centers[1][j]):
                    n2+=1
                if(i[j]!=centers[2][j]):
                    n3+=1
            # 将每个汉明距离存储到 list
            hamming1.append(n1)
            hamming2.append(n2)
            hamming3.append(n3)
        # 将每个数据添加到其对应的类里面
        if(n1==0 and n2==0 and n3!=0):
            cluster1 = np.vstack([cluster1,i])
        if(n1<n2 and n2<n3):
            cluster2 = np.vstack([cluster2,i])
        if(n3<n2 and n3<n1):
            cluster3 = np.vstack([cluster3,i])
        # 将两个类别转置,方便求每个属性的众数。因为 NumPy 中没有直接求每一列众数的函数,所以这样操作后使用 list 的求众数操作
        cluster1_t = np.transpose(cluster1)
        cluster2_t = np.transpose(cluster2)
        cluster3_t = np.transpose(cluster3)
        list1=[]
        list2=[]
        list3=[]
        # 将每一行的众数分别存储到两个列表中,作为下一次聚类的新中心。因为转置了,所以现在一行的众数,就是之前列的众数
        for i in range(4):
            counts1 = np.bincount(cluster1_t[i])
            list1.append(np.argmax(counts1))
            counts2 = np.bincount(cluster2_t[i])
            list2.append(np.argmax(counts2))
            counts3 = np.bincount(cluster3_t[i])
            list3.append(np.argmax(counts3))
        # print(f"第{n}次聚类 新中心{list1}")
        # 将新中心作为下一次聚类的中心
        centers = np.array([list1,list2,list3])
        # 判断汉明距离和上一次聚类的差别,如果总的汉明距离没有减小,那么聚类结束
        if(sum(hamming1)>=distance1 and sum(hamming2) >= distance2 and sum(hamming3)>=distance3):
            print(f"聚类{n}次后结束")
            break;
        else:
            distance1 = sum(hamming1)
            distance2 = sum(hamming2)
            distance3 = sum(hamming3)

    print(f"第一类\n{cluster1}")
    print(f"中心 {centers[0]}")
    print(f"第二类\n{cluster2}")
    print(f"中心 {centers[1]}")
    print(f"第三类\n{cluster3}")
    print(f"中心 {centers[2]}")

运行程序，输出如下：

聚类3次后结束
第一类
[[1 6 8 9]
[1 6 8 8]
[4 7 8 7]
[7 6 8 9]
[5 6 8 9]]
中心[1 6 8 9]
第二类
[[2 4 3 3]
[2 4 5 5]
[4 4 3 3]]
中心[2 4 3 3]
第三类
[[1 2 3 3]
[1 2 3 4]]
中心[1 2 3 3]

k-modes聚类算法（Python实现）

相关文章