Short Message Compression Using LLMs

Short Message Compression Using LLMs(bellard.org)

261 points by chunkles 1 year ago | 116 comments

antirez 1 year ago |

The way this works is awesome. If I understand correctly, it's like that, given (part of) a sentence, the next token really in the sequence will be one predicted by the model among the top scoring ones, so most next tokens can be mapped to very low numbers (0 if the actual next token it's the best token in the LLM prediction, 1 if it is the second best, ...). This small numbers can be encoded very efficiently using trivial old techniques. And boom: done.

So for instance:

> In my pasta I put a lot of [cheese]

LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]

The real next token is "cheese" so I'll store "1".

Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2 And so forth.

gliptic 1 year ago | |

I'm pretty sure it doesn't use ranking. That leaves a lot of performance on the table. Instead you would use the actual predicted token probabilities and arithmetic coding.

antirez 1 year ago | | |

I supposed it used arithmetic coding with the ranking bacause they have a distribution easy to exploit: zero more likely, one a bit less and so forth. What's your guess? Unfortunately Bellard is as smart as hermetic. We are here guessing what should be a README file.

amelius 1 year ago | |

This is very similar to how many compression schemes work. Look up Huffman coding to begin with.

https://en.wikipedia.org/wiki/Huffman_coding

nzach 1 year ago | | |

For anyone interested in this topic, Primeagen has a pretty great video on how he used several encoding schemes to save bandwidth in one of his projects.

https://www.youtube.com/watch?v=3f9tbqSIm-E

ai-christianson 1 year ago | |

Seems like an ideal compression method for LoRa/Meshtastic-style communication. An LLM wouldn't run on an ESP32, but there are several that could run on a raspberry pi.

It's not just natural language that could be compressed this way, either. Code (HTML, JS, etc) could be compressed with the same technique/models. I bet that the same general idea could work for image compression as well, using an image/diffusion model (or perhaps a multimodal model for everything.)

This could lead to an entire internet of content using just a few bits.

soulofmischief 1 year ago | | |

The key insight is that the larger the shared context between parties, the more efficient communication can be, as communication tends towards a purely relational construct. The limit of this is two parties that both share the exact same context and inputs, the inputs should produce the same hidden state within both parties and communication is not even necessary because both parties have the same knowledge and state.

That's not new to anyone familiar with compression or information theory, but the novelty here is the LLM itself. It's absolutely plausible that, given an already highly compressed relationally-encoded context like a trained LLM, very few bits could be communicated to communicate very abstract and complex ideas, letting the LLM recontextualize information which has been compressed across several semantic and contextual layers, effectively leveraging a complete (but lossy) history of human knowledge against every single bit of information communicated.

nullc 1 year ago | | |

LORA is also pretty slow, like the 'long fast' mode that most meshtastic users use is about a kilobit per second... and presumably a small percentage of the traffic at any time is traffic in channels that you're monitoring.

Probably decoding few tokens per second is fast enough to deliver more goodput than the existing uncompressed usage.

Retr0id 1 year ago | |

It'd be a fun experiment to try making it lossy.

You could adjust tokens towards what's more statistically probable, and therefore more compressible (in your example, it'd be picking tomato instead of cheese)

lxgr 1 year ago | | |

I could see that as a plot point in a science fiction story: Intergalactic telegrams are prohibitively expensive, so before sending one you're offered various variants of your text that amount to the same thing but save data due to using more generic (per zeitgeist) language :)

Compare also with commercial code [1], a close historical analog, albeit with handcrafted, as opposed to ML-derived, compression tables. (There was a single code point for "twins, born alive and well, one boy and one girl", for example! [2])

[1] https://en.wikipedia.org/wiki/Commercial_code_(communication...

[2] https://archive.org/details/unicodeuniversa00unkngoog/

antirez 1 year ago | | |

Yep. For lossy what could work even better is an encoder-decoder model, so that it is possible to just save the embedding, and later the embedding will be turned back into the meaning.

__MatrixMan__ 1 year ago | |

Seems like an opportunity to do some steganography. If the model isn't known by the attacker (or perhaps they can be "salted" to become unknown) then the actual message can be encoded in the offsets.

This would be much nicer than text-in-image steganography because services often alter images before displaying them, but they rarely do that to text (assuming the usual charset and no consecutive whitespace).

vasco 1 year ago | | |

There's already research into stenography for LLM generated text for fingerprinting and identifying source: https://www.nature.com/articles/s41586-024-08025-4

The idea seems similar enough that I wanted to share. The same way you can hide information in the text to prove it was generated by a specific model and version, of course you can use this for secrets as well.

Groxx 1 year ago | | |

tbh I'm not sure this would qualify as steganography - the message doesn't exist at all in the encoded form. It's not hidden, it's completely gone, the information is now split into two pieces.

So it's cryptography. With a shared dictionary. Basically just ECB, though with an unbelievably large and complicated code book.

userbinator 1 year ago | |

very computationally expensive

The same goes for all the other higher-order probability models, which are used in what is currently the best known compression algorithm:

https://en.wikipedia.org/wiki/PAQ

LLMs are just another way to do the probability modeling.

userbinator 1 year ago |

The download is 153MB, compressed... didn't even bother to wait for it to finish once I saw the size.

The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.

There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)

theamk 1 year ago | |

Life is more than that competitions?

For example, antirez mentioned LoRa in the earlier thread - that's a cheap, license-free radio, which achieves a large range at the expense of low rate (250 bit/sec). That's 30 bytes/second, not including framing overhead and retransmission.

If you wanted to build a communication system out of those, this compression method would be great. You'd have LORA device that connects to a regular cell phone and provides connectivity, and all the compression/decompression and UI happens on the cell phone. 150MB is nothing for modern phones, but you'd see a real improvement in message speed.

f33d5173 1 year ago | |

You can compress arbitrarily many messages with it, and the dictionary remains 153MB. Why it's worth pointing out that brotli already uses a dictionary is that otherwise it would be generating the dictionary as it compressed, meaning that short messages would be pessimized. So brotli is in some sense the state of the art for short messages.

tshaddox 1 year ago |

This is obviously relevant to the Hutter Prize, which is intended to incentivize AI research by awarding cash to people who can losslessly compress a large English text corpus:

https://en.wikipedia.org/wiki/Hutter_Prize

From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.

kianN 1 year ago |

For those wondering how it works:

> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]

It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?

[1] https://bellard.org/ts_zip/

kvemkon 1 year ago | |

> ts_zip

Discussed (once more) in a neighbor thread: https://news.ycombinator.com/item?id=42549083

cyptus 1 year ago | |

isn’t a LLM itself basically a compression of the texts from the internet? you can download the model and decompress the (larger) content with compute power (lossy)

kianN 1 year ago | | |

Yeah that’s exactly how I think of llms in my head: lossy compression that interpolates in order to fill in gaps. Hallucination is simply interpolation error. Which is guaranteed in lossy compression.

giovannibonetti 1 year ago |

Regarding lossless text compression, does anyone know how a simple way to compress repetitive JSON(B) data in a regular Postgres table? Ideally I would use columnar compression [1], but I'm limited to the extensions supported by Google Cloud SQL [2].

Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.

I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.

[1] https://github.com/citusdata/citus?tab=readme-ov-file#creati... [2] https://cloud.google.com/sql/docs/postgres/extensions

tdiff 1 year ago |

How is llm here better than Markov chains created from a corpus of English text? I guess similar idea must have been explored million times in traditional compression studies.

max_ 1 year ago |

Does this guy (Fabrice Bellard) have a podcast interview anyone would recommend?

silisili 1 year ago | |

AFAIK, he doesn't do videos or interviews. His web presence is pretty sparse. I remember trying to dig something up last year and coming up blank. Totally respect that, but a bummer for folks hoping to get a peek inside his mind.

If nothing else, I hope he finds time to write his thoughts into a book at some point.

usr1106 1 year ago | | |

He seems to spend all time to write truly amazing software.

mNovak 1 year ago |

I recall someone using one of the image generation models for pretty impressive (lossy) compression as well -- I wonder if AI data compression/inflation will be a viable concept in the future; the cost of inference right now is high, but it feels similar to the way cryptographic functions were more expensive before they got universal hardware acceleration.

hangonhn 1 year ago | |

At a startup where I worked many years ago, they trained a model to take the image and screen size as the input and it would output the JPG compression level to use so that the image appears the same to people. It worked exceedingly well that a major software company offered to acquire the startup just for that. Alas, the founders were too ambitious/greedy and said no. It all burned down.

kevmo314 1 year ago | | |

That seems like a fun project to replicate independently. You didn't want to rebuild it?

stabbles 1 year ago |

It's a bit confusing to show the output as multibyte utf-8 characters and compare that to a base64 string

Retr0id 1 year ago | |

The comparison example uses base64 too

stabbles 1 year ago | | |

Ah, my mistake. I thought that was meant to show a dictionary and brotli encoded string separately.

slater 1 year ago |

i always wondered if e.g. telcos had special short codes for stuff people often send, like at xmas many people write "merry christmas" in an SMS, and the telco just sends out "[code:mx]" to all recipient phones, to save on bandwidth and disk space?

qingcharles 1 year ago | |

No, the systems are not that sophisticated, from having worked on them in the past.

j_juggernaut 1 year ago |

Made a quick and dirt streamlit app to play around encrypt decrypt https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.a...

lxgr 1 year ago |

Impressive!

I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.

Retr0id 1 year ago |

What's the throughput like, for both compression and decompression?

crazygringo 1 year ago |

What is this encoding scheme that produces Chinese characters from binary data? E.g. from the first example:

> 뮭䅰㼦覞㻪紹陠聚牊

I've never seen that before. The base64 below it, in contrast, is quite familiar.

lxgr 1 year ago | |

There's a family of encodings optimized for fitting the most information possible into an Unicode string of a given length, e.g. for gimmicks like fitting the most possible binary data into tweets.

For example: https://github.com/qntm/base65536

For short messages in the mobile phone (i.e. GSM/3GPP) sense, which was my first association for "short message compression", it doubt that it works better than just sending binary messages with the appropriate header, but if that's not an option, it might just beat a custom alphabet based on the 7-bit GSM charset [1] (since that allows 100% of possible 7-bit characters to be used, whereas UTF-16 probably has at least some reserved codepoints that might be causing problems).

[1] https://en.wikipedia.org/wiki/GSM_03.38

SeptiumMMX 1 year ago |

The practical use for this could be satellite messaging (e.g. InReach) where a message is limited to ~160 characters, and costs about a dollar per message.

deadbabe 1 year ago |

Could this become an attack vector somehow? The greatest minds could probably find a way to get a malicious payload decompressed into the output.

Retr0id 1 year ago | |

It's lossless, at worst you'd make the compression ratio worse for certain inputs.

deadbabe 1 year ago | | |

With LLM based compression, could we get something like the opposite of lossless, like hallucinatory? All the original content, plus more?

the5avage 1 year ago |

Is there a paper explaining it in more detail? I also saw on his website he has a similar algorithm for audio compression...

gcr 1 year ago |

Decoding random gibberish into semantically meaningful sentences is fascinating.

It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.

Here's a few decompressions of short strings (in base64):

    $ ./ts_sms.exe d -F base64 sAbC
    Functional improvements of the wva
    $ ./ts_sms.exe d -F base64 aBcDefGh
    In the Case of Detained Van Vliet {#
    $ ./ts_sms.exe d -F base64 yolo9000
    Give the best tendering
    $ ./ts_sms.exe d -F base64 elonMuskSuckss=
    As a result, there are safety mandates on radium-based medical devices
    $ ./ts_sms.exe d -F base64 trump4Prezident=
    Order Fostering Actions Supported in May

    In our yellow
    $ ./ts_sms.exe d -F base64 harris4Prezident=
    Colleges Beto O'Rourke voted with Cher ┬íLa
    $ ./ts_sms.exe d -F base64 obama4Prezident=
    2018 AFC Champions League activity televised live on Telegram:

    $ ./ts_sms.exe d -F base64 hunter2=
    All contact and birthday parties

    $ ./ts_sms.exe d -F base64 'correctHorseBatteryStaples='
    ---
    author:
    - Stefano Vezzalini
    - Paolo Di┬áRio
    - Petros Maev
    - Chris Copi
    - Andreas Smit
    bibliography:

    $ ./ts_sms.exe d -F base64 'https//news/ycombinator/com/item/id/42517035'

    Allergen-specific Tregs or Treg used in cancer immunotherapy.
    Tregs are a critical feature of immunotherapies for cancer. Our previous 
    studies indicated a role of Tregs in multiple
    cancers such as breast, liver, prostate, lung, renal and pancreatitis. Ten years ago, most clinical studies were positi
    ve, and zero percent response rates

    $ ./ts_sms.exe d -F base64 'helloWorld='
    US Internal Revenue Service (IRS) seized $1.6 billion worth of bitcoin and

In terms of compressions, set phrases are pretty short:

    $ ./ts_sms.exe c -F base64 'I love you'
    G5eY
    $ ./ts_sms.exe c -F base64 'Happy Birthday'
    6C+g

Common mutations lead to much shorter output than uncommon mutations / typos, as expected:

    $ ./ts_sms.exe c -F base64 'one in the hand is worth two in the bush'
    Y+ox+lmtc++G
    $ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush'
    kC4Y5cUJgL3s
    $ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush.'
    kC4Y5cUJgL3b
    $ ./ts_sms.exe c -F base64 'One in the hand .is worth two in the bush.'
    kC4Y5c+urSDmrod4

Note that the correct version of this idiom is a couple bits shorter:

    $ ./ts_sms.exe c -F base64 'A bird in the hand is worth two in the bush.'
    ERdNZC0WYw==

Slight corruptions at different points lead to wildly different (but meaningful) output:

    $ ./ts_sms.exe d -F base64 FRdNZC0WYw==
    Dionis Ellison

    Dionis Ellison is an American film director,
    $ ./ts_sms.exe d -F base64 ERcNZC0WYw==
    A preliminary assessment of an endodontic periapical fluor
    $ ./ts_sms.exe d -F base64 ERdNYC0WYw==
    A bird in the hand and love of the divine
    $ ./ts_sms.exe d -F base64 ERdNZC1WYw==
    A bird in the hand is worth thinking about
    $ ./ts_sms.exe d -F base64 ERdNZD0WYw==
    A bird in the hand is nearly as big as the human body
    $ ./ts_sms.exe d -F base64 ERdNZC0wYw==
    A bird in the hand is worth something!
    
    Friday
    $ ./ts_sms.exe d -F base64 ERdNZC0XYw==
    A bird in the hand is worth two studies

yalok 1 year ago |

What’s the size of the model used here?

GaggiX 1 year ago | |

The model used is RWKV 169M v4.

jonplackett 1 year ago |

Would this also work for video encoding using something like Sora?

Get Sora to guess the next frame and then correct any parts that are wrong?

I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!

MPSimmons 1 year ago |

Cool, now all I need is the tiny encoded message and a 7+B weight model, plus some electricity.

This is more like a book cipher than a compression algorithm.

mlok 1 year ago |

LLMs, and now this, make me think of the (non-existant) "Sloot Digital Coding System" that could be viewed as a form of "compression".

https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System

bongodongobob 1 year ago | |

I view it as a form of fraud. There's no way that worked or could have worked.

dekhn 1 year ago | | |

Perhaps not literally, but you can easily imagine training an embedding on a large amount of existing video, and then delivering somebody "the point in space that decodes to the video with the least residual compared to the original".

Conceptually, most modern movies are just linear combinations of basis tropes (tvtropes.org).

mlok 1 year ago | | |

Yes that is why I specified it was non-existent. But the idea behind it is in the same vein somehow. Maybe what Sloot envisioned was something similar to LLMs.

perching_aix 1 year ago | | |

I thought I had come up with something with a similar performance once. Then a couple hours later I realized that I just (still) suck at combinatorics :)

RandomThoughts3 1 year ago |

It’s a very clever idea.

I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.

zamadatix 1 year ago | |

If mobile phone conversations over the last 2 decades have taught me anything it's that people talk about anything but battery life and ultimately the crowd ends up doing "whatever means I don't have to put it on the charger twice a day". Especially when the base iPhone SE already has enough storage to fit more text than one could read in their life anyways.

lxgr 1 year ago | |

If you like that idea, give Kiwix a try! Best ~60 GB I have stored on my phone :) And it comes in handy more often than initially expected.