The essence of Reed-Solomon coding

The essence of Reed-Solomon coding(mazzo.li)

168 points by rostayob 3 years ago | 41 comments

I've found that, just as with CRCs, there's an abundance of articles that show the theoretical explanation of RS, but aren't much help for those wanting to actually implement it. Here's a good practical explanation of implementing RS, including the GF operations: https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_f...

corsix 3 years ago | |

Alternatively, the following pair of articles, the first of which is already referenced as a footnote in the OP: http://www.corsix.org/content/galois-field-instructions-2021... http://www.corsix.org/content/reed-solomon-for-software-raid

nayuki 3 years ago |

Using Reed-Solomon coding to recover from erasures (at known positions) is relatively straightforward. It can be understood with a basic knowledge of linear algebra and finite field arithmetic.

Using RS for error correction (at initially unknown positions) is quite difficult. I wrote a step-by-step guide on it including demo code, and it doesn't even cover the most efficient decoding algorithm (I used PGZ instead of Berlekamp-Massey): https://www.nayuki.io/page/reed-solomon-error-correcting-cod...

Hanschri 3 years ago | |

To understand how error correction works and to learn more about Hamming codes & Reed-Solomon, 3Blue1Brown and Ben Eater were invaluable. 3Blue1Brown and Ben Eater are by far some of the best educational content creators within their fields, mathematics and computer engineering respectively.

I would strongly recommend anyone interested in the topic to check out any of these videos:

How to send a self-correcting message (Hamming codes): https://www.youtube.com/watch?v=X8jsijhllIA

Hamming codes part 2, the elegance of it all: https://www.youtube.com/watch?v=b3NxrZOu_CE

And any of Ben Eater's five videos on error correction: https://eater.net/crc

As an aside, Ben Eater does all of his videos and demonstrations using an 8-bit computer he has built step by step in videos on a breadboard. Very impressive and inspiring.

aortega 3 years ago | |

One important stage of error correction is spreading, that use a pseudo-random function to 'spread' the damaged data over several blocks that can be then completely recovered, instead of concentrating the damage in a single block (like often happen in erasures) that cannot be recovered.

userbinator 3 years ago | | |

Optical media like CDs and DVDs use a similar interleaving scheme, allowing for large amounts of errors to be corrected; in fact, even during normal operation with what appears to be a completely clean and perfect disc and drive, read errors are always occurring but corrected silently so there is no loss of data:

https://en.wikipedia.org/wiki/Cross-interleaved_Reed%E2%80%9...

https://en.wikipedia.org/wiki/Optical_disc#Surface_error_sca...

nayuki 3 years ago | | |

By spreading, I think you're referring to interleaving https://en.wikipedia.org/wiki/Burst_error-correcting_code#In... , which doesn't need randomness.

Pseudo-random scrambling is used in places like https://en.wikipedia.org/wiki/64b/66b_encoding to reduce the chances of many consecutive 0s or 1s which could cause clock desynchronization between the sender and receiver.

rogers18445 3 years ago | |

Usually, you would see RS used in a setting where any error is going to be erasure (non erasure error -> failed checksum -> erasure error).

Then you use something like LDPC codes for in-packet ECC. RS for multi-packet ECC.

less_less 3 years ago | | |

RS codes are also good for contexts where you don't know where the errors are. The codes are always maximum distance separable, and the algorithms to locate the errors are quite efficient.

In my experience, LDPC is usually better for soft-decision decoding, where you have information about how reliable each bit is. RS is usually better for hard-decision where you don't know anything about the error locations. Also LDPC is usually bit-oriented, and RS is always a symbol-oriented code, so RS works well for burst errors.

abdullahkhalids 3 years ago | |

Is there a resource which lists out different codes and the decoding algorithms that work with them?

terramars 3 years ago |

I implemented the RS encoder that ended up being used on the production satellite bus for the telemetry system in space systems / Loral upgraded bus during my engineering coop in school! All in verilog. It's complicated especially the decoder but understandable with time. The turbo codes and more modern extensions that enable lte / 5g and other low power / small antenna applications are absolute black magic though. Such a cool field!

jl2718 3 years ago | |

Similar story. I got my first job assignment at to do a comparison of hardware for the EUMETSAT METOP ground station. This was 2004 and the leading options were custom hardware ASICs in the $1-10M range. There was a snow storm and I had ridden my bike to work, and actually got hit by a car on the way, so I didn’t really feel like trying to make it home, and $10M seemed like a lot of money when I didn’t even know how fast my new Pentium 4 desktop could do it. I slept in my office for the next three days while most of the lab took snow days, trying to write the decoder in C. How hard could it be, right? I’d never taken a CS course, and assumed it was just a little bit of bit-banging to decode a structure or whatever. It was, until I came across this error-correction thing called Reed-Solomon. I knew about CRC, and thought that was the end of the story, but this used a bunch of totally unintelligible ‘new math’ called Galois fields, and it just happened to have been invented at the place I was working at, and by the way, the guy who gave me to assignment was named Solomon, unsure of any relation to Gustave, and I felt I was setting myself up for great embarrassment. Well anyway, this place had a great library, so I read the original paper, still getting nowhere, and then a book on the subject, also unintelligible, except that it included lots of very detailed pseudo code. When the lab returned, I had something working really slowly, but the hardware engineer thought it was cool enough to keep going, and built a PCI DAQ card to read the I/Q inputs directly into the computer. Eventually I made it fast enough, order of 10Mb, and that’s how NOAA made our weather reports during the decade when NPOESS VIIRS was delayed. I adapted it to VIIRS too, but only stuck around long enough to see it used in testing. I was very proud of myself, and was recognized with an award of zero dollars and the non-exclusive right to post a reply about it on Hacker News when the topic finally came up.

tromp 3 years ago |

This is also the essence behind Shamir's Secret Sharing where the sample at x=0 is a secret shared by k+t parties any k of which can recover the secret.

[1] https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing

aortega 3 years ago |

There are many types of Reed-Solomon codes, each with different amount of redundancy, they are a special case of BCH codes, a kind of code that approach the Shannon limit (the maximum rate of error-free data that can be transferred over a noisy channel). The bandwidth of a channel (like radio, or optical fiber) is not really limited by the media, but only by the noise it has. Optical fiber have almost zero noise, that's why they are so fast.

There are much better codes nowadays that are closer to the Shannon limit, like LDPC, or convolutional codes. But they are usually much more computationally intensive. They are used in space probes where computation time don't matter, but you often have channels with much more noise than signal.

I keep a repo of C implementation of several error-correction codes including Reed-Solomon, that can be used as standard unix filters like gzip: https://github.com/ortegaalfredo/eccchain

vmilner 3 years ago | |

LDPC coding (combined with BCH coding) is now used for most Western Europe digital and satellite TV in the DVB-T2 and DVB-S2 standards.

vbuterin 3 years ago |

Shameless plug of my own explanation on how binary fields work:

https://vitalik.ca/general/2019/05/12/fft.html

Ctrl+F for "binary fields".

nsteel 3 years ago |

I think my first exposure to this was parchive files from usenet. That feeling when your 3-month download of some "huge" 700MB iso was corrupt, you load up quickpar and suddenly (quite a few minutes later) it's all fixed! No idea how it worked at the time, just magic.

klodolph 3 years ago |

Some additions, as "exercise for the reader":

1. The finite field you choose has a minimum size. What is the minimum size field 2^bits for an RS(N,K) coding system? What happens when you try to construct a Reed-Solomon code with a finite field that is too small?

2. Consider a Reed-Solomon coding system which uses a lookup table for the finite field multiplication operation that fits in L1 cache. Given that the table already fits in L1 cache, how could you make the encoder/decoder faster, if you had a smaller finite field?

nayuki 3 years ago | |

> What is the minimum size field 2^bits for an RS(N,K) coding system?

Your field size must be at least N+1, noting that you shouldn't use the value 0 in the encoding matrix.

> What happens when you try to construct a Reed-Solomon code with a finite field that is too small?

Your system of linear equations doesn't have enough linearly independent equations.

> how could you make the encoder/decoder faster

Maybe put the lookup table in a 64-bit general-purpose register and use bit shifting, or in a 128/256/512-bit SIMD register and use extraction instructions (shuffle bytes, etc.).

shachaf 3 years ago | | |

Hmm, do you mean N+K+1 (to have enough points for both the data and parity shards)? Why isn't N+K sufficient, fitting a polynomial to N points and emitting K more?

sizzzzlerz 3 years ago |

On its two Voyager missions, NASA used (still uses!) a RS encoder/decoder that can correct up to 16 error bits per frame. It became a standard for a number of subsequent missions. This was in an era before ASICs and FPGAs yet the hardware still had to meet strict size, weight, and power requirements. Not to mention maintaining comms over distances greater than 14 billion miles and received signal strengths of 10^-16 watts. Truly humbling.

est 3 years ago |

I think Reed-solomon should be considered in future network protocols designs to combat censorship. Every byte should be demuxed into bits and transferred in independent data streams, so MITM boxes can only intercept incomplete streams, and aggregate streams back to original would be insanely difficult. Let transport layers do only one job and no distinguish whatever the content might be inside.

Currently H2 does support M:N stream muxing but popular browsers only support N:1 mode.

vlovich123 3 years ago | |

It’s a comparatively expensive operation (CPU and memory) compared with just encrypting the information which also blinds the network operator to the same extent. Unless you’re saying that you’d send the stream across multiple disparate networks. But if you’re able to get packets out of one, what’s stopping you from getting the whole stream out that network?

est 3 years ago | | |

> comparatively expensive operation (CPU and memory)

Which is good, because it means higher cost of middle boxes

> But if you’re able to get packets out of one, what’s stopping you from getting the whole stream out that network?

It's practically impossible, unless the MITM box were setup very close to both ends on the edge. In real world packets were routed slightly different, the server might have several IPs or CDNs, so if your middlebox were placed in backbone it will be useless as packets were transfered out-of-order and not in the same stream.

> just encrypting the information which also blinds the network operator

Yes, but the network operator was sure every information is inside one exact stream, just with a thick layer of protection, state-of-the-art classifiers are able to match metadata patterns to the individual websites, so protocol designers would then take huge amount of time to fight it. You either have a very fast TTFB protocol, or you'd have to add some padding redundancy (noise) to disguise the metadata. By metadata I mean packet length and frequency pattern.

not2b 3 years ago | |

That seems an inferior approach to just using encryption.

vlovich123 3 years ago |

I really wish physical and OS network stacks would be able to give you the ability to send uncorrected bit streams. That way you can tune the error rate at the application level that makes sense. For example, with video streaming, you probably don’t need much error correction on the data stream as periodic i-frames would correct any transient glitches (you’d only bother to EC the control headers for the video). Then WiFi networks maybe wouldn’t have to be as careful about time multiplexing all the coex streams and some noise due to conflicts would be fine and not require retransmission (because the application layer could handle it).

This does come with tradeoffs (eg it may take your application longer to recover from the noise than a quick retransmit at the physical layer).

From a cost perspective it’s also maybe impractical because the computer industry gets efficiency gains by solving a problem for everyone at some quality threshold by giving up optimality for applications that could do something with it. Also you would still need to correct the control layer of the network (IP + MAC) just to make it work at all so it may be a wash (ie the incremental cost of correcting the data vs control + data may be insignificant).

Still, at least having the option as a switch that could be flipped for experimentation purposes would be quite neat to allow the curious to find new techniques / layers of abstractions vs what’s orthodoxy today.

agsamek 3 years ago |

Reed-Solomon is the foundation of today's computing. It is used in data storage (hdd, ssd) and in data transfer protocols. It allows for building of a reliable system on top of an unreliable real life fenomens with desired level of certainty. This is so incredible tech that once implemented we can just forget about it in the higher level abstractions.

QuinnyPig 3 years ago |

This is, for example, how Amazon S3 works.

kccqzy 3 years ago | |

It's also how GCS (and the underlying Colossus system) works.

https://news.ycombinator.com/item?id=11713406

nayuki 3 years ago | |

This is how Backblaze works. https://www.backblaze.com/blog/reed-solomon/ , https://github.com/Backblaze/JavaReedSolomon , https://news.ycombinator.com/item?id=9726890

markc 3 years ago | | |

And Sia distributed storage network https://gitlab.com/NebulousLabs/Sia/-/blob/master/modules/er...

aortega 3 years ago | |

Also, how RAID works, and how ECC memory works.