Thanks!
Most (definitely not all) of the mis-alignment problems were in the network stack, and were centered around the fact that ethernet headers are 14 bytes, while nearly all other protocols had headers that were a multiple of at least 4 bytes.
I've said it before, and I'll say it again: If I had a time machine, I would not kill Hitler. I'd go back to the 70s and make the ethernet header be 16 bytes long, rather than 14.
While you've got the time machine, can you fix it so that "network byte order" and Intel endianness are the same too?
Hindsight is 20/20, and a lot of times people don't appreciate the constraints these old systems had. This was being developed decade before the Commodore 64 came out with its luxurious 64 kilobytes of memory (39k usable).
Also, was a "byte" standardized at the time? Didn't they still have systems working in not-8-bit "byte", nibbles, byte, and 2-byte boundaries?
PowerPC, and really, most non-x86 architectures, do this one way or another.
ARM v6-A and later (except for some microcontrollers, like Cortex M0/R0, that don't support hardware unaligned access at all, triggering a exception) is similar to the Intel x86 case (reference in transparent unaligned memory access -except for SIMD, where x86 can raise exceptions, too, in the case of unaligned load/store opcodes-), where there is hardware support for unaligned memory access.
For software that uses intensive non-aligned data access, e.g. data compression algorithms doing string search, PowerPC, ARM v6-A (and later ARM Application processors), new MIPS with HW support for unaligned memory access, and Intel are pretty much the same (i.e. b = * (uint32_t * )(a + 23) will take 1-2 cycles, not requiring doing a memcpy(&b, a + 23, sizeof(uint32_t))).
For SIMD, though, there is no transparent fix, although there are specific opcodes for aligned and unaligned memory access (e.g. load/store, unaligned load/store).
Which brings me to padding. I wonder what percentage of memory of the average 64-bit user's system is padding? I'm afraid of the answer. The heroes of yesteryear could've coded miracles in the ignored spaces in our data.
This sounds like it could be really exciting. Is there anywhere I can find out more?
Specifically, I've been struggling to find an appropriate backend for HTTP Server-Sent Events, could this feature help with that?
For posterity, here's the referenced videos:
General overview: https://youtu.be/U7J33pd3hLU?t=23m54s
Implementation details: https://youtu.be/Wzy8dIjsY6Y
https://www.reddit.com/r/redis/comments/4mmrgr/stream_data_s...
"Unaligned word or halfword loads or stores add penalty cycles. A byte aligned halfword load or store adds one extra cycle to perform the operation as two bytes. A halfword aligned word load or store adds one extra cycle to perform the operation as two halfwords. A byte-aligned word load or store adds two extra cycles to perform the operation as a byte, a halfword, and a byte. These numbers increase if the memory stalls."
However, multi-word memory instructions (LDRD, STRD, LDM, STM, etc.) always require their arguments to be word-aligned.
Turn on "show dead comments" and see how many greens are deleted. I screenshot many examples.
And to concentrate all my meta in one place... Is shadow banning a thing at HN? I thought they just, well, banned you.
All ARM processors do this. The concept is called "natural alignment" and it's pretty common on non-x86. See e.g. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.... . The problem here is that a lot of code written for x86 wants more than that, e.g. byte addressing for non-byte-wide values.
I'm also the weirdo that feels process isolation, memory management, and I/O mechanisms need a rethink. But that's something that would take me forever to get into.
One thing I will say, though, is alignment issues "infect" everything. Assume your architecture doesn't allow misaligned access. Now, all your data has to be naturally aligned. Your structs now have to be aligned to the alignment of the largest sub-structure within them. This is all because code is alignment sensitive. Given a pointer to a struct, generic code is unnecessarily larger. Any why would we care? Communication, of course. If we're exchanging data between systems then idiosyncrasies such as this suddenly become globally visible.
Endian-ness must be little. Byte-aligment a non-issue, and network-bit order should be from bit zero up, with any upper layer need, say for cut-through forwarding, expressed as a data ordering requirement, so for example an IP4 address is not a blind 32-bit word, but specifies the structure of those 32-bits.
Come to think of it, the Vax was little endian (like Intel).
(The rules are all fairly clearly documented in the Architecture Reference Manual.)
Practically speaking, the world is not going to converge on a single endianness or on a no-alignment-restrictions setup any time soon, so we have to deal with the world as it is. If you're programming in a sensible high-level language, it will deal with this kind of low-level nit for you. If you're programming in a low-level language (like C), well, I think you wouldn't be doing that if you didn't have fun at some level in feeling like you had a mastery of the low-level nits :-)
Why registers? I haven't studied the Tomasulo algorithm in any detail, but if you're going to do "register renaming", why have registers at all? You could, for example, treat memory as a if-needed-only backing store, and then add a "commit" instruction that commits memory (takes an address, or a range). Sure you need to make changes with how you do mm i/o and protection, but at a basic level: why registers?
I'm glad FPGA's are becoming a thing, and I think we're about a decade or two away from ASICs as a service, because if you're not beholden to tradition, you really can work some magic. Of course I'll be pretty rusty by then, but who knows, maybe medicine will keep me feisty.
Instructions on a CPU are something like to following (this is based on MIPS since x86 is a mess) The first 6 bits are the instruction, the rest is command specific. For add the next 12 would be 4 bits for each of the source registers and then the destination register and then various flags (overflow for example).
If instead they only worked on memory they would have a lot more possible instructions - but there isn't enough room on CPUs to design that many instructions anyway so who cares, followed by the all three memory addresses. This means that every CPU instruction needs to read 3 times as much memory before doing anything. Worse, most of those are pointers: when you compile the code you don't know the location of those address, so in most cases it is read the instruction from the program, then go back to the stack to read the address of the next values, then read those locations. That is a lot of memory access and memory access is expensive. Of course as you can say you can just use caching, but cache is expensive and now you need to add 3 times as much - this is too big for the fast level one cache so now you are expanding level two cache and seeing a lot more cache misses in the level one cache.
The above would all be okay, but it turns out that given enough registers (x86 fails here) in most cases you are operating on the same set of values all the time, (indeed the stack locations each of the above is referring too is probably a small set of variables) so if the compiler is careful it can manage all that. The compiler has better information on when things need to be committed to memory anyway so let it handle that.
I'd venture to guess green users could use this ambiguity to play subtle rhetorical tricks on users with a moderate number of points here.