Short Message Compression Using LLMs(bellard.org) |
Short Message Compression Using LLMs(bellard.org) |
So for instance:
> In my pasta I put a lot of [cheese]
LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]
The real next token is "cheese" so I'll store "1".
Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2 And so forth.
It's not just natural language that could be compressed this way, either. Code (HTML, JS, etc) could be compressed with the same technique/models. I bet that the same general idea could work for image compression as well, using an image/diffusion model (or perhaps a multimodal model for everything.)
This could lead to an entire internet of content using just a few bits.
That's not new to anyone familiar with compression or information theory, but the novelty here is the LLM itself. It's absolutely plausible that, given an already highly compressed relationally-encoded context like a trained LLM, very few bits could be communicated to communicate very abstract and complex ideas, letting the LLM recontextualize information which has been compressed across several semantic and contextual layers, effectively leveraging a complete (but lossy) history of human knowledge against every single bit of information communicated.
Probably decoding few tokens per second is fast enough to deliver more goodput than the existing uncompressed usage.
You could adjust tokens towards what's more statistically probable, and therefore more compressible (in your example, it'd be picking tomato instead of cheese)
Compare also with commercial code [1], a close historical analog, albeit with handcrafted, as opposed to ML-derived, compression tables. (There was a single code point for "twins, born alive and well, one boy and one girl", for example! [2])
[1] https://en.wikipedia.org/wiki/Commercial_code_(communication...
This would be much nicer than text-in-image steganography because services often alter images before displaying them, but they rarely do that to text (assuming the usual charset and no consecutive whitespace).
The idea seems similar enough that I wanted to share. The same way you can hide information in the text to prove it was generated by a specific model and version, of course you can use this for secrets as well.
So it's cryptography. With a shared dictionary. Basically just ECB, though with an unbelievably large and complicated code book.
The same goes for all the other higher-order probability models, which are used in what is currently the best known compression algorithm:
https://en.wikipedia.org/wiki/PAQ
LLMs are just another way to do the probability modeling.
The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.
There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)
For example, antirez mentioned LoRa in the earlier thread - that's a cheap, license-free radio, which achieves a large range at the expense of low rate (250 bit/sec). That's 30 bytes/second, not including framing overhead and retransmission.
If you wanted to build a communication system out of those, this compression method would be great. You'd have LORA device that connects to a regular cell phone and provides connectivity, and all the compression/decompression and UI happens on the cell phone. 150MB is nothing for modern phones, but you'd see a real improvement in message speed.
https://en.wikipedia.org/wiki/Hutter_Prize
From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.
> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]
It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?
Discussed (once more) in a neighbor thread: https://news.ycombinator.com/item?id=42549083
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
[1] https://github.com/citusdata/citus?tab=readme-ov-file#creati... [2] https://cloud.google.com/sql/docs/postgres/extensions
If nothing else, I hope he finds time to write his thoughts into a book at some point.
I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.
> 뮭䅰㼦覞㻪紹陠聚牊
I've never seen that before. The base64 below it, in contrast, is quite familiar.
For example: https://github.com/qntm/base65536
For short messages in the mobile phone (i.e. GSM/3GPP) sense, which was my first association for "short message compression", it doubt that it works better than just sending binary messages with the appropriate header, but if that's not an option, it might just beat a custom alphabet based on the 7-bit GSM charset [1] (since that allows 100% of possible 7-bit characters to be used, whereas UTF-16 probably has at least some reserved codepoints that might be causing problems).
It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.
Here's a few decompressions of short strings (in base64):
$ ./ts_sms.exe d -F base64 sAbC
Functional improvements of the wva
$ ./ts_sms.exe d -F base64 aBcDefGh
In the Case of Detained Van Vliet {#
$ ./ts_sms.exe d -F base64 yolo9000
Give the best tendering
$ ./ts_sms.exe d -F base64 elonMuskSuckss=
As a result, there are safety mandates on radium-based medical devices
$ ./ts_sms.exe d -F base64 trump4Prezident=
Order Fostering Actions Supported in May
In our yellow
$ ./ts_sms.exe d -F base64 harris4Prezident=
Colleges Beto O'Rourke voted with Cher ¡La
$ ./ts_sms.exe d -F base64 obama4Prezident=
2018 AFC Champions League activity televised live on Telegram:
$ ./ts_sms.exe d -F base64 hunter2=
All contact and birthday parties
$ ./ts_sms.exe d -F base64 'correctHorseBatteryStaples='
---
author:
- Stefano Vezzalini
- Paolo Di Rio
- Petros Maev
- Chris Copi
- Andreas Smit
bibliography:
$ ./ts_sms.exe d -F base64 'https//news/ycombinator/com/item/id/42517035'
Allergen-specific Tregs or Treg used in cancer immunotherapy.
Tregs are a critical feature of immunotherapies for cancer. Our previous
studies indicated a role of Tregs in multiple
cancers such as breast, liver, prostate, lung, renal and pancreatitis. Ten years ago, most clinical studies were positi
ve, and zero percent response rates
$ ./ts_sms.exe d -F base64 'helloWorld='
US Internal Revenue Service (IRS) seized $1.6 billion worth of bitcoin and
In terms of compressions, set phrases are pretty short: $ ./ts_sms.exe c -F base64 'I love you'
G5eY
$ ./ts_sms.exe c -F base64 'Happy Birthday'
6C+g
Common mutations lead to much shorter output than uncommon mutations / typos, as expected: $ ./ts_sms.exe c -F base64 'one in the hand is worth two in the bush'
Y+ox+lmtc++G
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush'
kC4Y5cUJgL3s
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush.'
kC4Y5cUJgL3b
$ ./ts_sms.exe c -F base64 'One in the hand .is worth two in the bush.'
kC4Y5c+urSDmrod4
Note that the correct version of this idiom is a couple bits shorter: $ ./ts_sms.exe c -F base64 'A bird in the hand is worth two in the bush.'
ERdNZC0WYw==
Slight corruptions at different points lead to wildly different (but meaningful) output: $ ./ts_sms.exe d -F base64 FRdNZC0WYw==
Dionis Ellison
Dionis Ellison is an American film director,
$ ./ts_sms.exe d -F base64 ERcNZC0WYw==
A preliminary assessment of an endodontic periapical fluor
$ ./ts_sms.exe d -F base64 ERdNYC0WYw==
A bird in the hand and love of the divine
$ ./ts_sms.exe d -F base64 ERdNZC1WYw==
A bird in the hand is worth thinking about
$ ./ts_sms.exe d -F base64 ERdNZD0WYw==
A bird in the hand is nearly as big as the human body
$ ./ts_sms.exe d -F base64 ERdNZC0wYw==
A bird in the hand is worth something!
Friday
$ ./ts_sms.exe d -F base64 ERdNZC0XYw==
A bird in the hand is worth two studiesGet Sora to guess the next frame and then correct any parts that are wrong?
I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!
This is more like a book cipher than a compression algorithm.
Conceptually, most modern movies are just linear combinations of basis tropes (tvtropes.org).
I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.
Fabrice also makes some programs you might use, like FFMEG and QEMU
For those unaware, it's a typo. willvarfar meant FFMPEG.
This would be the one sentence that wouldn't cause me to look down on somoene, if used as a third-person humble-brag.
llms are generally large
Using CLP: https://www.uber.com/blog/reducing-logging-cost-by-two-order...
Rewrite the log to extract and group by common identifiers:
I gather that you’d supply the same “seed” during both compression and decompression, and this would reduce the amount of information embedded into the compressed result.
It's easy to implement in LZ-style compressors - it amounts to injecting the dictionary as context, as if it had been previously output by the decompressor. (There's a striking parallel to how LLM prompting works.)
$ ./ts_sms.exe d -F base64 elonMuskSuckss=
As a result, there are safety mandates on radium-based medical devices
LOL!That "decompression" is reminiscent of Terry Davis of TempleOS fame, who had written a random sentence generator that he interpreted as "speaking to God".
> “And from your standpoint, Hamid, there’s one big drawback. The mean bandwidth of this thing [an ansible, more or less] is just under six bits per minute.”
> “Huh? Ten seconds to send a single bit?”
> “Yup. Skandr left three protocols at the Lothlrimarre end: ASCII, a Hamming map to a subset of English, and an AI scheme that guesses what you’d say if you used more bits. The first is Skandr’s idea of a joke, and I wouldn’t trust the third more than wishful thinking.”
(Good advice at the end there.)
One viewer stumbles onto a key insight about the struggle taking place, but they only have evocations so they’re not sure. And they sound like a total kook so everyone ignores them.
I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.
It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!
In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.
E.g., making a greeting card with somebody's name spelled correctly.
Incidentally, this technique is actually an old one that has already seen use in magic shows and confidence tricks; two people who wish to communicate covertly, such as a magician and an "audience member", can exchange small amounts of information by varying the wording of simple responses: "yes", "that's right", "quite so". This can be used to, for instance, discreetly reveal the value of a card that the magician is "guessing" through "ESP".
you could hide that as text in other text, and that'd be steganography.
So you'd feed the algorithm some starter text like: "Here's my favorite recipe for brownies", and then you'd give it some data to encode, and depending on which data you gave it, you'd get a different, but "plausible", recipe for brownies. The recipient could reverse the recpie back into numbers, and from that they'd decode the hidden message.
The trick would be balancing the LLM's attempt to make sense against whatever additional constraints came along with your data encoding scheme. If you tried to encode too much cyphertext into a too-short brownies recipe, the recipe would fail to be convincingly a recipe. Conveniently, it's conventional to prefix recipes with a tremendous amount of text that nobody reads, so you've got a lot of entropy to play in.
They have different goals and utilize completely different techniques.
At most lossy techniques leverage lossless techniques (eg to compress non-perceptual binary headers) not the other way round.
This isn't normally what people mean by lossy compression, though. In lossy compression (e.g. mainstream media compression like JPEG) you work out what the user doesn't value and throw it away.
And that still doesn’t show how lossless compression is tied to intelligence. The example I always like to give is, “What’s more intelligent? Reciting the US war of independence Wikipedia page verbatim every time or being able to synthesize a useful summary in your own words and provide relevant contextual information such as it’s role in the French Revolution?”
I would definitely expect something like this to happen at some point. as long as people use LLMs with a non-zero temperature, you'd expect variation from anyone's output, so it'd be super hard to detect / super deniable with just the output.
EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.
"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."
There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.
If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.
IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.
These lossless compression algorithms compress a large corpus of English text from an encyclopedia. The idea is that you can compress this text more if you know more about English grammar, the subject matter of the text, logic, etc.
I think you’re distracted by the lossless part. The only difference here between lossy and lossless compression is that the lossy algorithm also needs to generate the diff between its initial output and the real target text. Clearly a lossy algorithm with lower error needs to waste fewer bits representing that error.
(aFutD also had a case of high-stakes suspicion of highly compressed messages during Zone turbulence, as I think the GP meant.)