 # 【模式识别小作业】K均值聚类K-means clustering+Matlab实现+UCI的Iris和Seeds数据集+分类问题

## 1.Inroduction

In this assignment, I implemented and test the kmeans-clustering methods to do the three-classification tasks on two data sets by using MATLAB. The data sets, Iris and Seeds, are downloaded from the UCI Machine Learning Repository.

Each program contains one .m file, which is processed to divide each sample into different classes. Since the number of types of the data set is 3, in order to make the experimental results more accurate, it is directly assumed that the number of clusters to be obtained is 3. Because the clustering result has a great relationship with the selection of the initial clustering center and consider the computational complexity, three different sample points are selected every five samples to calculate the cluster. The accuracy of the classification is used to select the three sample points that are most suitable as the initial center of the cluster.

Experimental results show that using kmeans-clustering model to solve the three-classification problems looks not good. Because, I think, the process of clustering greatly depends on the initial clustering numbers “k” and the initial clustering center, the processing speed is not fast and accuracy is random. While the effect on Seeds data set is similar to Iris, although the data set contains 7 features.

## 2.The characteristics of the data sets

2.1 Iris
The data set contains 150 samples. Every sample have 4 attributes: sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Three classes of Iris is Iris Setosa, Iris Versicolour and Iris Virginica. This data set is the most popular in UCI. The clear structure and plentiful samples make it suitable in this assignment.

2.2 Seeds
The data set contains 210 samples. Every sample have 7 attributes: area A, perimeter P, compactness C = 4piA/P^2, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. And all of these parameters were real-valued continuous. The data set comprises kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian. It is often used for the tasks of classification and cluster analysis.

## 3.Data preprocessing

Taking the Iris as an example, the first step is to convert characters in data set into digital representations. Read the dataset file with MATLAB, extract the whole attributes into an array, and merge the numeric symbols after each sample. Iris-setosa is marked as 1, Iris-versicolor is marked as 2, and Iris-virginica is marked as 3. Then, copy the processed data set for subsequent cluster processing. And write these data set to each txt file to facilitate saving and using.

## 4.The code

%This program is based on the data set Iris
%The original data set contains three kinds of flowers
%Here, the Kmeans method is processed to do the three-category problem
%The flower name in the data set is changed to a numeric symbol of 1 2 3

clear;
clc;

%%%%%%%%%%%%%%Data preprocessing section%%%%%%%%%%%%%%%%
f=fopen('iris.data');%Open dataset file

D=[];% Used to store attribute values
for i=1:length(data)-1
D=[D data{1,i}];
end
fclose(f);

lable=data{1,length(data)};
n1=0;n2=0;n3=0;
% Find the index of each type of data
for j=1:length(lable)
if strcmp(lable{j,1},'Iris-setosa')
n1=n1+1;
index_1(n1)=j;% Record the index belonging to the "Iris-setosa" class

elseif strcmp(lable{j,1},'Iris-versicolor')
n2=n2+1;
index_2(n2)=j;% Record the index belonging to the "Iris-versicolor" class

elseif strcmp(lable{j,1},'Iris-virginica')
n3=n3+1;
index_3(n3)=j;% Record the index belonging to the "Iris-virginica" class

end
end

% Retrieve each type of data according to the index
class_1=D(index_1,:);
class_2=D(index_2,:);
class_3=D(index_3,:);
Attributes=[class_1;class_2;class_3];

%Iris-setosa is marked as 1; Iris-versicolor is marked as 2
%Iris-virginica is marked as 3
I=[1*ones(n1,1);2*ones(n2,1);3*ones(n3,1)];
Iris=[Attributes I];% Change the name of the flower to a number tag

save Iris.mat Iris % Save all data as a mat file

%Save all data as a txt file
f=fopen('iris1.txt','w');
[m,n]=size(Iris);
for i=1:m
for j=1:n
if j==n
fprintf(f,'%g \n',Iris(i,j));
else
fprintf(f,'%g,',Iris(i,j));
end
end
end
fclose(f);

%Directly take the number of categories k = 3
Iris_test=Iris;
[m,n]=size(Iris_test);
acc_rateqian=0;

%traverse all samples every 5 samples to calculate the optimal 3 sample
%points that can be used as the initial sample center
for bianli1=1:5:50
for bianli2=51:5:100
for bianli3=101:5:150

u1=Iris(bianli1,:);
u2=Iris(bianli2,:);
u3=Iris(bianli3,:);

u1_qian=u1(1:4);
u2_qian=u2(1:4);
u3_qian=u3(1:4);

diedai=0;

while 1

%Divide each sample into different classes
for i=1:m
d1=0;
d2=0;
d3=0;

for j=1:n-1
d1=d1+(Iris_test(i,j)-u1(1,j))^2;
d2=d2+(Iris_test(i,j)-u2(1,j))^2;
d3=d3+(Iris_test(i,j)-u3(1,j))^2;
end
%Determine which sample center is closest to each point
if (d1<=d2)&&(d1<=d3)
Iris_test(i,n)=1;
elseif (d2<=d1)&&(d2<=d3)
Iris_test(i,n)=2;
else
Iris_test(i,n)=3;
end
end

%Save the results after the first cluster to show the difference
if diedai==0
f_first=fopen('iris_first.txt','w');
[m_first,n_first]=size(Iris_test);
for i=1:m_first
for j=1:n_first
if j==n_first
fprintf(f_first,'%g \n',Iris_test(i,j));
else
fprintf(f_first,'%g,',Iris_test(i,j));
end
end
end
fclose(f_first);
end

%Update new cluster center
Iris_1=[0,0,0,0];
geshu1=0;
Iris_2=[0,0,0,0];
geshu2=0;
Iris_3=[0,0,0,0];
geshu3=0;
for i=1:m
if Iris_test(i,n)==1
for j=1:n-1
Iris_1(1,j)=Iris_1(1,j)+Iris_test(i,j);
end
geshu1=geshu1+1;
end
if Iris_test(i,n)==2
for j=1:n-1
Iris_2(1,j)=Iris_2(1,j)+Iris_test(i,j);
end
geshu2=geshu2+1;
end
if Iris_test(i,n)==3
for j=1:n-1
Iris_3(1,j)=Iris_3(1,j)+Iris_test(i,j);
end
geshu3=geshu3+1;
end
end

u1=(1/geshu1)*Iris_1;
u2=(1/geshu2)*Iris_2;
u3=(1/geshu3)*Iris_3;

%If the cluster center points have not changed,
%the clustering process stops
if u1_qian==u1
if u2_qian==u2
if u3_qian==u3
break;
end
end
end
u1_qian=u1;
u2_qian=u2;
u3_qian=u3;

%Limit the number of cluster iterations
if diedai>1000
break
end
diedai=diedai+1;

end

%Calculate clustering accuracy
[m_result,n_result]=size(Iris_test);
error=0;
for i=1:m_result
if Iris(i,n_result)~=Iris_test(i,n_result)
error=error+1;
end
end

acc_rate=(m_result-error)/m_result;

if acc_rate>=acc_rateqian

%Save clustering results obtained by the last optimal
%initial clustering center
f_result=fopen('iris_result.txt','w');
for i=1:m_result
for j=1:n_result
if j==n_result
fprintf(f_result,'%g \n',Iris_test(i,j));
else
fprintf(f_result,'%g,',Iris_test(i,j));
end
end
end
fclose(f_result);

acc_final=acc_rate;
yangben1=bianli1;
yangben2=bianli2;
yangben3=bianli3;

diedai_best=diedai;
end

end
end
end
fprintf('acc_rate is %f\n',acc_final);
fprintf('Initial cluster center is %d, %d and %d\n',yangben1,yangben2,yangben3);
fprintf('diedaicishu is %d\n',diedai_best);
fprintf('The results of the clusting are saved\n');



## 5.Limitations and improvements

1. This K-means can be applied to the data set Seeds and Iris, but the effect is not good, comparing with the neural-network method and the logic regression. It deserves improving further, perhaps by further data preprocessing.
2. Only three-classification tasks are considered in this assignment, and application of the K-mean clustering model to the multi-classification tasks and the large data analysis should be tried in the future. Perhaps, the application of clustering in big data analysis has advantages over the neural network method.