BIRCH

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.[1] With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm.[2] An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.

Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively",[1] beating DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.[3]

Problem with previous methods

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH

It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

Algorithm

The BIRCH algorithm takes as input a set of N data points, represented as real-valued vectors, and a desired number of clusters K. It operates in four phases, the second of which is optional.

The first phase builds a clustering feature () tree out of the data points, a height-balanced tree data structure, defined as follows:

  • Given a set of N d-dimensional data points, the clustering feature of the set is defined as the triple , where
    • is the linear sum.
    • is the square sum of data points.
  • Clustering features are organized in a CF tree, a height-balanced tree with two parameters:[clarification needed] branching factor and threshold . Each non-leaf node contains at most entries of the form , where is a pointer to its th child node and the clustering feature representing the associated subcluster. A leaf node contains at most entries each of the form . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter . A node is required to fit in a page of size . and are determined by . So can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the second step, the algorithm scans all the leaf entries in the initial tree to rebuild a smaller tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.

In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Calculations with the clustering features

Given only the clustering feature , the same measures can be calculated without the knowledge of the underlying actual values.

  • Centroid:
  • Radius:
  • Average Linkage Distance between clusters and :

In multidimensional cases the square root should be replaced with a suitable norm.

BIRCH uses the distances DO to D3 to find the nearest leaf, then the radius R or the diameter D to decide whether to absorb the data into the existing leaf or whether to add a new leaf.

Numerical issues in BIRCH clustering features

Unfortunately, there are numerical issues associated with the use of the term in BIRCH. When subtracting or similar in the other distances such as , catastrophic cancellation can occur and yield a poor precision, and which can in some cases even cause the result to be negative (and the square root then become undefined).[2] This can be resolved by using BETULA cluster features instead, which store the count , mean , and sum of squared deviations instead based on numerically more reliable online algorithms to calculate variance. For these features, a similar additivity theorem holds. When storing a vector respectively a matrix for the squared deviations, the resulting BIRCH CF-tree can also be used to accelerate Gaussian Mixture Modeling with the expectation–maximization algorithm, besides k-means clustering and hierarchical agglomerative clustering.

Instead of storing the linear sum and the sum of squares, we can instead store the mean and the squared deviation from the mean in each cluster feature ,[4] where

  • is the node weight (number of points)
  • is the node center vector (arithmetic mean, centroid)
  • is the sum of squared deviations from the mean (either a vector, or a sum to conserve memory, depending on the application)

The main difference here is that S is computed relative to the center, instead of relative to the origin.

A single point can be cast into a cluster feature . In order to combine two cluster features , we use

  • (incremental update of the mean)
  • in vector form using the element-wise product, respectively
  • to update a scalar sum of squared deviations

These computations use numerically more reliable computations (c.f. online computation of the variance) that avoid the subtraction of two similar squared values. The centroid is simply the node center vector , and can directly be used for distance computations using, e.g., the Euclidean or Manhattan distances. The radius simplifies to and the diameter to .

We can now compute the different distances D0 to D4 used in the BIRCH algorithm as:[4]

  • Euclidean distance and Manhattan distance are computed using the CF centers
  • Inter-cluster distance
  • Intra-cluster distance
  • Variance-increase distance

These distances can also be used to initialize the distance matrix for hierarchical clustering, depending on the chosen linkage. For accurate hierarchical clustering and k-means clustering, we also need to use the node weight .

Clustering Step

The CF-tree provides a compressed summary of the data set, but the leaves themselves only provide a very poor data clustering. In a second step, the leaves can be clustered using, e.g.,

  1. k-means clustering, where leaves are weighted by the numbers of points, N.
  2. k-means++, by sampling cluster features proportional to where the are the previously chosen centers, and is the BETULA cluster feature.
  3. Gaussian mixture modeling, where also the variance S can be taken into account, and if the leaves store covariances, also the covariances.
  4. Hierarchical agglomerative clustering, where the linkage can be initialized using the following equivalence of linkages to BIRCH distances:[5]
Correspondence between HAC linkages and BIRCH distances[5]
HAC Linkage BIRCH distance
UPGMA D2²
WPGMA D0²
Ward 2 D4²

Availability

  • ELKI contains BIRCH and BETULA.
  • scikit-learn contains a limited version of BIRCH, which only supports D0 distance, static thresholds, and which uses only the centroids of the leaves in the clustering step.[6]

References

  1. ^ a b Zhang, T.; Ramakrishnan, R.; Livny, M. (1996). "BIRCH: an efficient data clustering method for very large databases". Proceedings of the 1996 ACM SIGMOD international conference on Management of data - SIGMOD '96. pp. 103–114. doi:10.1145/233269.233324.
  2. ^ a b Lang, Andreas; Schubert, Erich (2020), "BETULA: Numerically Stable CF-Trees for BIRCH Clustering", Similarity Search and Applications, pp. 281–296, arXiv:2006.12881, doi:10.1007/978-3-030-60936-8_22, ISBN 978-3-030-60935-1, S2CID 219980434, retrieved 2021-01-16
  3. ^ "2006 SIGMOD Test of Time Award". Archived from the original on 2010-05-23.
  4. ^ a b Lang, Andreas; Schubert, Erich (2022). "BETULA: Fast clustering of large data with improved BIRCH CF-Trees". Information Systems. 108: 101918. doi:10.1016/j.is.2021.101918.
  5. ^ a b Schubert, Erich; Lang, Andreas (2022-12-31), "5.1 Data Aggregation for Hierarchical Clustering", Machine Learning under Resource Constraints - Fundamentals, De Gruyter, pp. 215–226, arXiv:2309.02552, ISBN 978-3-11-078594-4
  6. ^ as discussed in [1]

Read other articles:

Don Camillo en Russie Graziella Granata et Gianni Garko dans la dernière scène du film. Données clés Titre original Il compagno don Camillo Réalisation Luigi Comencini Scénario Giovannino Guareschi Acteurs principaux FernandelGino Cervi Sociétés de production Franco-London FilmsRizzoli Films Pays de production France Italie Allemagne de l'Ouest Genre Comédie Durée 111 minutes Sortie 1965 Série Don Camillo Don Camillo Monseigneur(1961) Pour plus de détails, voir Fiche technique et Dis…

American inventor, researcher, and mystic For the Swiss Olympic biathlete, see Marcel Vogel (biathlete). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Marcel Vogel – news · newspapers · books · scholar · JSTOR (August 2013) (Learn how and when to remove this message) Marcel VogelBorn(1917-04-14)April 14, 1917…

جي بي لاغونا 2017 تفاصيل السباقسلسلة3. جي بي لاغونامنافسةطواف أوروبا للدراجات 2017 1.2‏التاريخ19 فبراير 2017المسافات158 كمالبلد كرواتيانقطة البدايةفونتانا [الإنجليزية]‏نقطة النهايةبوريكالفرق14عدد المتسابقين في البداية150عدد المتسابقين في النهاية109متوسط السرعة43٫593 كم/سالمن…

GeometryProjecting a sphere to a plane OutlineHistory (Timeline) Branches Euclidean Non-Euclidean Elliptic Spherical Hyperbolic Non-Archimedean geometry Projective Affine Synthetic Analytic Algebraic Arithmetic Diophantine Differential Riemannian Symplectic Discrete differential Complex Finite Discrete/Combinatorial Digital Convex Computational Fractal Incidence Noncommutative geometry Noncommutative algebraic geometry ConceptsFeaturesDimension Straightedge and compass constructions Angle Curve …

Percy Jackson e gli dei dell'Olimpo: la battaglia del labirintoTitolo originalePercy Jackson & The Olympians: The Battle of the Labyrinth AutoreRick Riordan 1ª ed. originale2008 1ª ed. italiana2011 Genereromanzo Sottogenerefantasy Lingua originaleinglese AmbientazioneCampo Mezzosangue,il Labirinto,Alcatraz,Monte Sant'Elena,Isola di Ogigia ProtagonistiPercy Jackson (figlio di Poseidone) CoprotagonistiGrover Underwood (satiro),Annabeth Chase (figlia di Atena),Rachel Elizabeth Dare (mortale),…

  提示:此条目页的主题不是中華人民共和國最高領導人。 中华人民共和国 中华人民共和国政府与政治系列条目 执政党 中国共产党 党章、党旗党徽 主要负责人、领导核心 领导集体、民主集中制 意识形态、组织 以习近平同志为核心的党中央 两个维护、两个确立 全国代表大会 (二十大) 中央委员会 (二十届) 总书记:习近平 中央政治局 常务委员会 中央书记处 中…

Danish actor You can help expand this article with text translated from the corresponding article in Danish. (October 2023) Click [show] for important translation instructions. View a machine-translated version of the Danish article. Machine translation, like DeepL or Google Translate, is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-translated text into the Engl…

Cercle Arctique T. Cancer Équateur T. Capricorne Cercle AntarctiqueTracé du méridien de 135° est En géographie, le 135e méridien est est le méridien joignant les points de la surface de la Terre dont la longitude est égale à 135° est. Géographie Dimensions Comme tous les autres méridiens, la longueur du 135e méridien correspond à une demi-circonférence terrestre, soit 20 003,932 km. Au niveau de l'équateur, il est distant du méridien de Greenwich de 15 0…

Instituto Ayrton SennaInstituto Ayrton Sennacode: pt is deprecated   (Portugis)SingkatanIASDinamai berdasarkanAyrton SennaTanggal pendirian20 November 1994; 29 tahun lalu (1994-11-20)PendiriKeluarga SennaTujuanPerkembangan manusiaKantor pusatPinheiros, Subprefektur Pinheiros, São Paulo, BrasilWilayah BrasilPresidenViviane SennaTokoh pentingBianca Senna, Direktur BrandingEmilio Munaro, Direktur Pengembangan GlobalJumlah Staf 24Situs webinstitutoayrtonsenna.org.br Instituto Ayrton …

مشروع تنمية شمال سيناء البلد مصر  تعديل مصدري - تعديل   هو مشروع لتحقيق التنمية الشاملة والمستدامة لمنطقة شمال شبه جزيرة سيناء، يتضمن مشروعات بنية تحتية واستثمارية ضخمة في مجالات الزراعة والثروة السمكية والصناعة والسياحة والإسكان والأمن والنقل والتعليم والطاقة تمتد ف…

Disambiguazione – Se stai cercando altri significati, vedi Final Fantasy (disambigua). «Non penso di essere bravo a creare giochi d'azione. Preferisco narrare storie.» (Hironobu Sakaguchi, creatore di Final Fantasy) Il bobi della serie. Final Fantasy (ファイナルファンタジー?, Fainaru Fantajī) è una serie di videogiochi di ruolo giapponese prodotti da Square (divenuta Square Enix nel 2003). Final Fantasy rappresenta uno dei più grandi marchi per il mondo del divertimento interat…

Some NightsSingel oleh Fundari album Some NightsDirilis4 Juni 2012Direkam2011GenreAlternative rock[1]Durasi4:36LabelFueled by Ramen, AtlanticPencipta Jeff Bhasker Nate Ruess Andrew Dost Jack Antonoff ProduserJeff BhaskerKronologi singel Fun We Are Young (2011) Some Nights (2012) Carry On (2012) Video musikSome Nights di YouTube Some Nights adalah lagu dari band pop indie asal Amerika Serikat Fun. Lagu ini dirilis pada 4 Juni 2012, sebagai single kedua dari album studio kedua mereka denga…

James CoburnCoburn di Charade di 1963LahirJames Harrison Coburn, Jr.Tahun aktif1958 – 2002Suami/istriBeverly Kelly (1959–1979) Paula Murad (1993–2002) James Harrison Coburn (31 Agustus, 1928-18 November, 2002) merupakan seorang aktor berkebangsaan Amerika Serikat yang memenangkan nominasi Academy Award. Dia dilahirkan di Laurel, Nebraska. Dia berkarier di dunia film sejak tahun 1958 sampai dia meninggal dunia pada tahun 2002 akibat serangan jantung. Filmografi Ride Lonesome (1959) Fac…

Bunded reservoir (Proposed) in OxfordshireAbingdon Reservoir2023 Illustration by Thames Water of the conceptual design and location for a 150 Mm3 reservoir.[1]Abingdon ReservoirLocationOxfordshireCoordinates51°38′04″N 1°21′15″W / 51.63444°N 1.35417°W / 51.63444; -1.35417Lake typeBunded reservoir (Proposed)Primary inflowsRiver ThamesPrimary outflowsRiver ThamesBasin countriesUnited KingdomSurface area6.7 km2 (670 ha; 1,700 acres)Average d…

Atoll in American Samoa Not to be confused with Rose Island (disambiguation). Atoll in American SamoaRose AtollAtollRose AtollNASA satellite imageryRose AtollShow map of OceaniaRose AtollShow map of Pacific OceanCoordinates: 14°32′48″S 168°09′07″W / 14.54667°S 168.15194°W / -14.54667; -168.15194TerritoryAmerican SamoaArea • Land0.02 sq mi (0.05 km2)Population • Total0 • Density0/sq mi (0/km2) Rose Atoll, …

Academy in Bishop's Cleeve, Gloucestershire, EnglandCleeve SchoolAddressTwo Hedges RoadBishop's Cleeve, Gloucestershire, GL52 8AEEnglandCoordinates51°56′29″N 2°03′20″W / 51.941389°N 2.055556°W / 51.941389; -2.055556InformationTypeAcademyMotto'A.C.T' (Aspiring, collaborating, Transforming)Department for Education URN136772 TablesOfstedReportsExecutive HeadAlvin RichardsGendermixedAge11 to 18Enrolment1,495HousesRowling, Shackleton, Attenborough, Holmes and N…

Pemulihan pendorong tingkat pertama Falcon 9 setelah pendaratan pertamanya Kendaraan peluncur pakai ulang memiliki bagian yang dapat dipulihkan dan diterbangkan kembali, sambil membawa muatan dari permukaan ke luar angkasa. Tingkatan roket adalah bagian kendaraan peluncur paling umum yang ditujukan untuk dipakai kembali. Bagian yang lebih kecil seperti mesin roket dan pendorong juga dapat dipakai kembali, meskipun wahana antariksa pakai ulang dapat diluncurkan di atas kendaraan peluncuran sekali…

This is a list of U.S. states, federal district, and territories by total fertility rate. Total Fertility Rate by U.S. state in 2021 according to the Center for Disease Control & Prevention Fertility rate by State 2008 - 2020 2019-2022 Statefederal district or territory TFR2019[1] TFR2020[2] TFR2021[3] TFR2022[4]  Guam 2.74 2.64 2.36 2.26  American Samoa -- -- -- --  South Dakota 2.08 1.98 2.07 2.01  Nebraska 1.97 1.94 1.95 1.94  Alask…

Irish-born American Methodist preacher For the American polo champion, see Robert Early Strawbridge Jr. Robert Strawbridge (born 1732 - died 1781) was a Methodist preacher born in Drumsna, County Leitrim, Ireland. Early life and ancestral history Information detailing the early life of Robert Strawbridge is somewhat limited. One article, Robert Strawbridge, offers some light unto the subject. Another source, Robert Strawbridge: Some Additional Irish Perspectives offers additional details. In the…

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Ekonomika teknik – berita · surat kabar · buku · cendekiawan · JSTOR Ekonomika teknik adalah penentuan faktor-faktor dan kriteria ekonomi yang digunakan ketika satu atau lebih alternatif dipertimbangkan unt…