Not sure how this works for larger data structures, but my first thought was that this should be implemented as some microcode or instruction.
Most computation is not thaat jitter sensitive, perception is not really in the nano to microsecond scale, but maybe a cool gadget for like dtrace or interrupt handers etc.
But otherwise, nice work tying all the concepts together. You might want to get some better model trains though.
From a narrative standpoint, I agree it makes more sense to focus on a duplicated lookup table and fastest wins, however, from an engineering standpoint, framing it in terms of channel de-correlated reads has more possibilities. For example, if you need to evaluate multiple parallel ML models to get a result then by intentionally partitioning your models by channel you could ensure that a model does reads on only fast data or only slow data. ML models might not be that interesting since they are good candidates for being resident in L3.
But practically speaking, in a real application - isn’t any performance benefit going to be lost by the reduced cache hit rate caused by having a larger working set? Or are the reads of all-but-one of the replicas non-cached?
Apologies if I am missing something.
Additionally you are going to be memory starving every other thread/process because you are hogging all the memory channels, and making an already bad L3 cache situation worse.
Outside of extremely niche realtime use cases (which would generally fit in L3 cache) I can’t see how this would improve overall throughput, once you take into account other processes running on the same box.
Do you have an example use case?
Source: not only do I have an R720xd (and two regular R720s), I checked the Intel Xeon E5-2600v2 reference manuals.
OT: Tail Slayer. Not Tails Layer. My brain took longer to parse that than I’d have wanted.
The one that comes to mind is HPC, where you avoid over allocation of the physical cores. If the process has the whole node for itself for a brief period, inefficient memory access might have a bigger impact than memory starvation.
IBM also has their RAID-like memory for mainframes that might be able to do something similar. This feels like software implemented RAID-1.