Data clustering methods are used to solve data mining tasks in different areas of human activities. This is caused of sometimes we need to order by groups a big data array under full uncertainty and to select the distinctive features of each group to classify other single objects.
The clear clustering methods such as k-means [1] and Hard C-Means [2] solve these tasks successfully but data must be clearly separated one from other for this. However the real data have a mixed nature. Because of this the fuzzy clustering methods are more relevant for solving such tasks. Many algorithms of this branch of clustering methods use different between objects distance evaluation metrics. However such approach doesn’t consider relationship between object attributes. It leads to the final result quality decreasing because the object attributes cannot be absolutely independent one from other in real situation.
So the paper is dedicated to introduce a fuzzy clustering algorithm based on principles of existing fuzzy clustering methods but with using inter-attributive correlation to increase final result quality level.
The problem
Let we have a set X containing D objects (X = {x(1), x(2),…, x(D)}). Each object is described with a list of attributes having a numerical value to analyze its state. So an object can be presented as a point in an N-dimensional Euclidean space. It is necessary to divide the set X into ended number of clusters C and find coordinates of their centroids.
Fuzzy C-Means [2; 3; 4] data clustering algorithm is taken as basis to solve the formulated problem. The inter-attributive correlation using is suggested to evaluate distances between objects and cluster centroids.
It is necessary to evaluate the qualitative and quantitative characteristics of clustering algorithms with using inter-attributive correlation and without it.
The algorithm description
Clusters are described by the fuzzy partition matrix U:
(1)
where μki is a membership degree of the k-th object to the i-th cluster.
There should be performed the following condition:
(2)
In the beginning we set next initial conditions:
a number of clusters С;
a weighting exponent w;
a termination criterion ε.
The weighting exponent is usually taken equal 2 because the rule of its choice doesn’t exist according to [4].
Stage 1. We initialize the initial matrix of fuzzy partition U by random real values from 0 to 1 taking into account the condition (2).
Stage 2. The centroids coordinates are calculating by the following formula:
(3)
where c(i) is the center of the i-th cluster, x(k) is a vector of the k-th object.
Stage 3. The distances squares from objects to centroids are calculated by the following formula:
(4)
where R is a correlation matrix for objects attributes.
The correlation matrix elements are calculating by the formula for Pearson’s correlation coefficient:
(5)
where lower index is an attribute number.
Here we should note the identity matrix is used in (4) instead R for the classical Fuzzy C-Means algorithm [2] that leads to the distances squares calculating by the Euclidean metric.
Stage 4. New membership degrees are re-calculated by the following formula:
(6)
Stage 5. The matrices U and U* difference is comparing with ε by the Chebyshev criterion:
(7)
If the condition (7) isn’t satisfied we move on stage 2. Otherwise the algorithm finishes own work and gives a result.
Experiments description
We have a set of 300 objects described by 7 attributes. Each attribute has an integer value from 0 to 12. It is necessary to divide the proposed set into 4 clusters. The weighting exponent is taking equal 2 and the termination criterion - equal 0,001 according to [4].
Moreover each object has an expert assessment of state from 0 to 3 to check algorithms accuracy. The value is comparing with an appointed cluster number after data processing. We detect an error if they are not congruent. The algorithm quality level is calculating by the following formula:
(8)
where E is a number of errors, D is the power of the set.
Also we measure clustering executing time and iterations amount.
The experiments are doing for classical Fuzzy C-Means algorithm and for introduced algorithm with counting inter-attributive correlation with sets counted 50, 100, 150, 200, 250 and 300 multidimensional objects.
Results and conclusion
The results of the experiments are shown in table 1.
Table 1 – The results of the experiments for classical Fuzzy C-Means (FCM) and Fuzzy C-Means with using inter-attributive correlation (C-FCM)
Characteristic |
50 |
100 |
150 |
200 |
250 |
300 |
|
FCM |
Quality |
48% |
40% |
35,33% |
34% |
33,2% |
33,33% |
Errors |
26 |
60 |
97 |
132 |
167 |
200 |
|
Iterations |
13 |
13 |
13 |
13 |
13 |
13 |
|
Total time, ms |
4,92 |
8,02 |
12,02 |
16,04 |
19,97 |
23,96 |
|
C-FCM |
Quality |
92% |
88% |
72% |
88% |
88% |
84% |
Errors |
4 |
12 |
42 |
24 |
30 |
48 |
|
Iterations |
739 |
1595 |
624 |
871 |
1055 |
1024 |
|
Total time, ms |
152,79 |
665,31 |
389,19 |
727,29 |
1122,44 |
1304,57 |
As we can see using inter-attributive correlation leads to clustering errors amount decreasing and as a result to clustering quality level increasing. However we need more time and iterations for set dividing. Also we can say that the quantitative characteristics do not depend from size of the set.
In conclusion we note the introduced algorithm is very good for precision calculates but it demand highspeed machines presence if this algorithm will be used in real-time systems processing big data-streams.
References
1. Daniel Fasulo, An analysis of recent work on clustering algorithms, Technical Report UFF-CSEO1-03-02, University of Washington, 1999.
2. Barsegyan, A.A. Analiz dannyx i processov: ucheb. posobie / A.A. Barsegyan, M.C. Kupriyanov, I.I. Xolod, M.D. Tess, S.I. Elizarov. - 3-e izd., pererab. i dop. - SPb.: BXV-Peterburg, 2009. - 512 s.
3. J.C. Bezdek. Pattern recognition with fuzzy objective function algorithms. New York: Plenum, 1981
4. Shtovba S.D. Vvedenie v teoriyu nechetkix mnozhestv i nechetkuyu logiku. Vinnica: Kontinent-Prim. - 2003. - 198 s.