Geometric Search Trees

From "2 Related Work":

> None of these data structures can provide a non-probabilistic upper bound on the number of items per vertex. This hampers efficient implementation; and adversarial data suppliers can trivially produce n items in O(n) expected time that must all be stored in the same vertex.

And "4.3 Novel G-Trees":

> Our analysis confirmes that — with high probability — the resulting trees are of logarithmic height and their G-nodes store O(k) items. > > The second key insight toward an efficient k-ary data structure is, paradoxically, that there is no need to be clever about it. Sorted linked lists are naïve, inefficient data structures, yet zip trees are efficient. We can be similarly naïve for our k-ary construction. > > We use a sorted linked list in which every node stores up to k items. We require all items to be stored as early in the list as possible; this is the simplemost way of achieving history-independence. In other words, the only node to store fewer than k items is the final node. We call such a list a k-list (see Figure 9).

To me it's not obvious why an adversarial data supplier can't produce items that all need to go into the same G-node. Sure, it can be transformed into a k-ary data structure by partitioning the items across n/k nodes, but I don't see how that's better than a linked list.

So this makes me wonder: how is this approach any better than a Prolly tree? I see it being mentioned twice in this article, but the comparison doesn't go into detail.

bminor13 3 days ago | root | parent |

In order to be in the same G-node, they'd need to have the same rank and be close in value (such that they were not "broken up" by a value in the next highest rank), right?

Seems like brute-force search for adjacent values with the same rank is possible, but guaranteeing that intermediate higher-rank values dont also exist may not be (for an attacker). Maybe one mitigation on this sort of attack is to search for higher-rank extra values to insert to break up large G-nodes?

This also assumes they can know the hash function (if rank is chosen by cryptographically-secure hash); maybe also salting values before hashing could thwart these sorts of attacks?

yorwba 8 hours ago | root | parent | next |

Generating arbitrarily many values of the minimum rank is very easy for an attacker. Since the rank is geometrically distributed with parameter p = 1 - 1/k and k ≥ 2, randomly sampling a value will give you one of minimum rank with probability p ≥ 1/2 and it only gets easier for larger k.

If you want to break that up with dummy elements, you now have the problem of choosing those dummies in a history-independent manner efficiently.

But I think their recursive construction with G-trees of G-trees of ... might work if nodes with too many elements are stored as G-trees with a different, independent rank function (e.g. using a hash function with different salt). Producing many nodes with the same ranks should then get exponentially harder as the number of independent rank functions increases.

carsonfarmer 4 hours ago | root | parent |

Yes this exactly. Another really simple way to do this, is to use alternating leading and trailing zero counts in the hash in your nested G-trees. Simple, and pretty effective.

yorwba an hour ago | root | parent |

Hmmm... if you need to go deeper (because 1/4 of all hashes have zero leading zeros and zero trailing zeros), you can generalize this by converting the hash into its run-length encoding to get a sequence of rank functions where finding values with the same rank for all rank functions is equivalent to finding hash collisions. Very nice.

EdSchouten 3 days ago | root | parent | prev |

Exactly. But my concern is that this is not any stronger/better than what Prolly trees already offer, which is why I'm disappointed that they are mentioned under "related work", but not discussed/compared in more detail.

carsonfarmer 4 hours ago | root | parent |

You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).

One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).

EdSchouten 29 minutes ago | root | parent |

Another thing that's worth investigating:

As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).

Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:

https://github.com/buildbarn/go-cdc

Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?

msy 12 hours ago | prev | next |

If you're interested in how this is put together the repo is here: https://github.com/g-trees/g_trees/ and it uses https://github.com/worm-blossom/demo_macromania.

carsonfarmer 3 hours ago | prev | next |

Pretty cool to see this here. My co-author and I haven't actually officially "released" this yet, so quite neat to see where it is organically showing up! Feedback appreciated y'all!

zermelo44 3 days ago | prev | next |

The presentation of the webpage is really really nice. Especially the highlighting/linking of mathematical definitions and bound variables.

How did you achieve this?

carsonfarmer 4 hours ago | root | parent | next |

Thanks, that's all due to my co-author. Here's what we used: https://news.ycombinator.com/item?id=41589728 (link is to a post from @msy)

Aljoscha initially built the macro framework for the specs on willowprotocol.org

cabidaher 6 hours ago | root | parent | prev |

from a quick look, I believe the code for the page is here: https://github.com/g-trees/g_trees

randomizedalgs 7 hours ago | prev |

Cool paper!

As a small comment, this seems closely related to another recent paper: History-Independent Dynamic Partitioning: Operation-Order Privacy in Ordered Data Structures (PODS 2024, Best Paper).

I'm not sure how they compare, since neither paper seems to know about the other. And I'm also not sure which paper came first, since the geometric search paper does not seem to post a publication date.

carsonfarmer 4 hours ago | root | parent |

Whoah, cool. I'm one of the authors of the geometric search tree paper, and we totally hadn't see that paper, but will for sure dig in! Thanks for mentioning it.