Primary clustering

In computer programming, primary clustering is a phenomenon that causes performance degradation in linear-probing hash tables. The phenomenon states that, as elements are added to a linear probing hash table, they have a tendency to cluster together into long runs (i.e., long contiguous regions of the hash table that contain no free slots). If the hash table is at a load factor of for some parameter , then the expected length of the run containing a given element is . This causes insertions and negative queries to take expected time in a linear-probing hash table.

Causes of primary clustering

Primary clustering has two causes:

  • Winner keeps winning: The longer that a run becomes, the more likely it is to accrue additional elements. This causes a positive feedback loop that contributes to the clustering effect. However, this alone would not cause the quadratic blowup.[1][2]
  • Joining of runs: A single insertion may not only increase the length of the run that it is in by one, but may instead connect together two runs that were already relatively long. This is what causes the quadratic blowup in expected run length.[1]

Another way to understand primary clustering is by examining the standard deviation on the number of items that hash to a given region within the hash table.[2] Consider a sub-region of the hash table of size . The expected number of items that hash into the region is . On the other hand, the standard deviation on the number of such items is . It follows that, with probability , the number of items that hash into the region will exceed the size of the region. Intuitively, this means that regions of size will often overflow, while larger regions typically will not. This intuition is often used as the starting point for formal analyses of primary clustering.[2][3][4]

Effect on performance

Primary clustering causes performance degradation for both insertions and queries in a linear probing hash table. Insertions must travel to the end of a run, and therefore take expected time .[1] Negative queries (i.e., queries that are searching for an element that turns out not to be present) must also travel to the end of a run, and thus also take expected time .[1] Positive queries can terminate as soon as they find the element that they are searching for. As a result, the expected time to query a random element in the hash table is .[1] However, positive queries to recently-inserted elements (e.g., an element that was just inserted) take expected time .[1]

These bounds also hold for linear probing with lazy deletions (i.e., using tombstones for deletions), as long as the hash table is rebuilt (and the tombstones are cleared out) semi-frequently. It suffices to perform such a rebuild at least once every insertions.[2]

Common misconceptions

Many textbooks describe the winner-keeps-winning effect (in which the longer a run becomes, the more likely it is to accrue additional elements) as the sole cause of primary clustering.[5][6][7][8][9][10][11] However, as noted by Knuth,[1] this is not the main cause of primary clustering.

Some textbooks state that the expected time for a positive query is ,[11][12] typically citing Knuth.[1] This is true for a query to a random element. Some positive queries may have much larger expected running times, however. For example, if one inserts an element and then immediately queries that element, the query will take the same amount of time as did the insertion, which is in expectation.

Techniques for avoiding primary clustering

Ordered linear probing[13] (often referred to as Robin Hood hashing[14]) is a technique for reducing the effects of primary clustering on queries. Ordered linear probing sorts the elements within each run by their hash. Thus, a query can terminate as soon as it encounters any element whose hash is larger than that of the element being queried. This results in both positive and negative queries taking expected time .

Graveyard hashing is a variant of ordered linear probing that eliminates the asymptotic effects of primary clustering for all operations.[2] Graveyard hashing strategically leaves gaps within runs that future insertions can make use of. These gaps, which can be thought of as tombstones (like those created by lazy deletions), are inserted into the table during semi-regular rebuilds. The gaps then speed up the insertions that take place until the next semi-regular rebuild occurs. Every operation in a graveyard hash table takes expected time .

Many sources recommend the use of quadratic probing as an alternative to linear probing that empirically avoids the effects of primary clustering.

References

  1. ^ a b c d e f g h Knuth, Donald Ervin (1997). The art of computer programming, volume 3, sorting and searching. Reading, Mass.: Addison-Wesley. pp. 527–528. ISBN 0-201-89683-4. OCLC 36241708.
  2. ^ a b c d e Bender, Michael A.; Kuszmaul, Bradley C.; Kuszmaul, William (February 2022). "Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering". 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE. pp. 1171–1182. doi:10.1109/focs52979.2021.00115. ISBN 978-1-6654-2055-6. S2CID 235731820.
  3. ^ Pagh, Anna; Pagh, Rasmus; Ruzic, Milan (2007-06-11). "Linear probing with constant independence". Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. New York, NY, USA: ACM. pp. 318–327. doi:10.1145/1250790.1250839. ISBN 9781595936318. S2CID 7523004.
  4. ^ Thorup, Mikkel; Zhang, Yin (January 2012). "Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation". SIAM Journal on Computing. 41 (2): 293–331. doi:10.1137/100800774. ISSN 0097-5397.
  5. ^ Cormen, Thomas H. (2022). Introduction to algorithms. Charles Eric Leiserson, Ronald L. Rivest, Clifford Stein (Fourth ed.). Cambridge, Massachusetts. ISBN 978-0-262-04630-5. OCLC 1264174621.{{cite book}}: CS1 maint: location missing publisher (link)
  6. ^ Drozdek, Adam (1995). Data structures in C. PWS Pub. Co. ISBN 0-534-93495-1. OCLC 31077222.
  7. ^ Kruse, Robert L. (1987). Data structures and program design (2nd ed.). Englewood Cliffs, N.J.: Prentice-Hall. ISBN 0-13-195884-4. OCLC 13823328.
  8. ^ McMillan, Michael (2014). Data structures and algorithms with JavaScript. Sebastopol, CA: O'Reilly. ISBN 978-1-4493-6493-9. OCLC 876268837.
  9. ^ Smith, Peter, February 1- (2004). Applied data structures with C++. Sudbury, Mass.: Jones and Bartlett Publishers. ISBN 0-7637-2562-5. OCLC 53138521.{{cite book}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
  10. ^ Tremblay, Jean-Paul (1976). An introduction to data structures with applications. P. G. Sorenson. New York: McGraw-Hill. ISBN 0-07-065150-7. OCLC 1858301.
  11. ^ a b Handbook of Data Structures and Applications. [S.l.]: CRC PRESS. 2020. ISBN 978-0-367-57200-6. OCLC 1156995269.
  12. ^ Sedgewick, Robert (1998). Algorithms in C (Third ed.). Reading, Massachusetts. ISBN 0-201-31452-5. OCLC 37141168.{{cite book}}: CS1 maint: location missing publisher (link)
  13. ^ Amble, Knuth (1974). "Ordered hash tables". The Computer Journal. 17 (2): 135–142. doi:10.1093/comjnl/17.2.135.
  14. ^ Celis, Pedro, Per-Ake Larson, and J. Ian Munro. "Robin hood hashing." 26th Annual Symposium on Foundations of Computer Science (sfcs 1985). IEEE, 1985.