Tiny Pointers(arxiv.org) |
Tiny Pointers(arxiv.org) |
I was struggling a bit to come up with good use cases for it. Their examples are all around combining them with existing data structures and they show that the space complexity is smaller, but it wasn't completely clear to me how feasible this would actually be in practice.
https://www.quantamagazine.org/undergraduate-upends-a-40-yea...
"The dict type now uses a “compact” representation based on a proposal by Raymond Hettinger which was first implemented by PyPy. The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5." -- https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-com...
"Note, the sizeof(index) can be as small as a single byte for small dicts, two bytes for bigger dicts and up to sizeof(Py_ssize_t) for huge dict." -- https://mail.python.org/pipermail/python-dev/2012-December/1...
The "tiny pointers" are in the _make_index method in the proof of concept code. -- https://code.activestate.com/recipes/578375-proof-of-concept...
@staticmethod
def _make_index(n):
'New sequence of indices using the smallest possible datatype'
if n <= 2**7: return array.array('b', [FREE]) * n # signed char
if n <= 2**15: return array.array('h', [FREE]) * n # signed short
if n <= 2**31: return array.array('l', [FREE]) * n # signed long
return [FREE] * n
The logic is still present today in CPython. -- https://raw.githubusercontent.com/python/cpython/3e222e3a159... dk_indices is actual hashtable. It holds index in entries, or DKIX_EMPTY(-1)
or DKIX_DUMMY(-2).
Size of indices is dk_size. Type of each index in indices varies with dk_size:
* int8 for dk_size <= 128
* int16 for 256 <= dk_size <= 2**15
* int32 for 2**16 <= dk_size <= 2**31
* int64 for 2**32 <= dk_sizeWhat if you have to debug the whole situation, such that you don't always know who is the owner of a pointer you are looking at?
> A user k can call Allocate(k) in order to get a tiny pointer p; they can dereference the tiny pointer pby computing a function Dereference(k,p) whose value depends only on k, p, and random bits; and they can free a tiny pointer p by calling a function Free(k,p).
That is tantamount ot saying that the pointer is not actually p but the tuple <k, p>, and so its size consists of the number of bits in k, the number of bits in p plus an indication of where the division between these bits lie: where k ends and p begins.
We can abbreviate <k, p> to p in contexts where k can be implicitly understood.
Places it would likely impact more than you'd realize is in higher level language arrays where each item in the array is a pointer. For similar reasons, many datastructures can be coded such that each "pointer" is an array index instead.
So, extending all of that, what if you could make your pointers even smaller than 32 bits? If you know the addressable need for where the pointer is used, there is no reason you can't go smaller.
I mean I presume it's something cleverer than just "let's bit-pack our indexes to save space".
..."and make it dynamic" (arenas)
Oh wait, variable-size tiny pointers. Yes. Now it's getting satisfyingly complicated. Maybe like how UTF-8 works?
The pointers can be dereferenced into an array index (when given the key), but they are smaller in size than a "bit-packed index", although I'm not quite sure what you mean by this. If you want to to represent an index into an array of size `n` there's no representation where all indexes take less than log(n). You can make some of them smaller, but then you make other bigger (like in UTF-8). The Tiny Pointers are all smaller than log(n) (and some of them would be exactly equal, despite representing different indexes!), but the only way of determining the actual index is to dereference it together with original key that was used to create it.
To rephrase, this is how to mechanically bit-pack indexes to save space. Doing it in a pointer means you are likely wanting to share these pointers with other parts of the code, so you also need to have them be self contained. Naive way would be for each pointer to be a tuple identifying arena and having its offset. But there is still a lot of effort that has to be done for that to work well. This paper looks at that effort and gives a pass at how worthwhile it is.
This is like using a fixed sharding scheme plus delta-compression for indices relative to shard addresses plus varint encoding the deltas.
One scheme proves tiny pointer size bounds for fixed length tiny pointers. The other proves bounds for variable length pointers.