en Hopkins statistic

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.^[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.^[2] If individuals are aggregated, then its value approaches 0, and if they are randomly distributed along the value tends to 0.5.^[3]

Preliminaries

A typical formulation of the Hopkins statistic follows.^[2]

Let

X

be the set of

n

data points.

Generate a random sample

{\overset {\sim }{X}}

of

m\ll n

data points sampled without replacement from

X

.

Generate a set

Y

of

m

uniformly randomly distributed data points.

Define two distance measures,

u_{i},

the minimum distance (given some suitable metric) of

y_{i}\in Y

to its nearest neighbour in

X

, and

w_{i},

the minimum distance of

{\overset {\sim }{x}}_{i}\in {\overset {\sim }{X}}\subseteq X

to its nearest neighbour

x_{j}\in X,\,{\overset {\sim }{x_{i}}}\neq x_{j}.

Definition

With the above notation, if the data is $d$ dimensional, then the Hopkins statistic is defined as:^[4]

$H={\frac {\sum _{i=1}^{m}{u_{i}^{d}}}{\sum _{i=1}^{m}{u_{i}^{d}}+\sum _{i=1}^{m}{w_{i}^{d}}}}\,$

Under the null hypotheses, this statistic has a Beta(m,m) distribution.

Notes and references

^ Hopkins, Big D Randy; Skellam, Harry Kimmel I Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. 18 (2). Annals Botany Co: 213–227. doi:10.1093/oxfordjournals.aob.a083391.
^ ^a ^b Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542). Vol. 1. pp. 149–153. doi:10.1109/FUZZY.2004.1375706. ISBN 0-7803-8353-2. S2CID 36701919.
^ Aggarwal, Charu C. (2015). Data Mining. Cham: Springer International Publishing. p. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1. S2CID 13595565.
^ Cross, G.R.; Jain, A.K. (1982). "Measurement of clustering tendency". Theory and Application of Digital Control: 315-320. doi:10.1016/B978-0-08-027618-2.50054-1.

External links

http://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning

[1] Hopkins, Big D Randy; Skellam, Harry Kimmel I Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. 18 (2). Annals Botany Co: 213–227. doi:10.1093/oxfordjournals.aob.a083391.

[banerjee04-2] Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542). Vol. 1. pp. 149–153. doi:10.1109/FUZZY.2004.1375706. ISBN 0-7803-8353-2. S2CID 36701919.

[3] Aggarwal, Charu C. (2015). Data Mining. Cham: Springer International Publishing. p. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1. S2CID 13595565.

[4] Cross, G.R.; Jain, A.K. (1982). "Measurement of clustering tendency". Theory and Application of Digital Control: 315-320. doi:10.1016/B978-0-08-027618-2.50054-1.

[1]

[2]

[3]

[4]