Apple M2 Die Shot and Architecture Analysis – Big Cost Increase and A15 Based IP(semianalysis.substack.com) |
Apple M2 Die Shot and Architecture Analysis – Big Cost Increase and A15 Based IP(semianalysis.substack.com) |
But their rate of improvement on the A series has been slowing on general tasks. They’re on the same process node, and only increased frequency a bit.
Is it really that surprising that performance didn’t take a massive jump? You can’t keep up a 20% increase in normal stuff every release for long.
You can use accelerators like they do for video and ML to help some tasks. You can improve your GPU some and make it a little bigger.
It seems like in some places people are trying to push a “the M2 is a failure because it’s not a huge leap above the M1“ narrative. But no one exits that from Intel or AMD every year anymore. Or Apple’s A-series.
So why here?
It's going from N5 to N5P, chosen by Apple over N4.
> But no one [expects] that from Intel or AMD every year anymore.
That's not accurate, a minor performance upgrade after almost 2 years is the exact thing Intel has gotten a lot of flack for in recent years. The fact that people are willing to defend it is really exclusive to Apple and their unbeatable marketing.
Zen 4 on an almost identical timeline will be ~ 30-40% performance, and people were widely disappointed by the announcement of ">15%" S/T - very close to Apple's +18% M/T. Intel will have gone from Rocket Lake to (almost) Raptor Lake, doubling performance.
Any info as to why?
The big M1 numbers came from getting ahead of the rest of the market on 5nm TSMC, and critically from packing everything into the SOC so physical distances were reduced by multiple orders of magnitude (which has already been the case for the A series). That’s been done now, so the low hanging fruit is gone there.
Performance gains from here should be expected to be identical to AMD as they’ll be moving on TSMC’s cadence (it’s AMD who might actually see similar jumps on the low end if they go the Apple route and move everything to the package).
I wouldn’t be surprised if Apple has already started looking to stand up it’s own fab. They have large and very predictable needs now, and could likely get ahead of ASML’s queue by throwing money and scale at the problem - not least because it would help them muddy the waters as to what Apple Silicon actually is more, which fits the marketing better.
I don't think that reduces chip power so much as it reduces latency. Apple's "power" here comes entirely from using the 5nm node and refining a stupid-high IPC.
> it’s AMD who might actually see similar jumps on the low end if they go the Apple route and move everything to the package
No? Again, making everything an SOC has advantages/disadvantages, but your raw performance metrics are almost never significantly influenced by distance of the components (unless the distance is significant enough). AMD's real advantage will be jumping ahead 1.5 generations at TSMC, and then later it will be an architectural change (eg. big.LITTLE). I think Apple is the only one interested in shipping computers with SOCs.
But this chip is literally 18% faster “in normal stuff”
I think this expectation of sustained performance gains was a part of some of the more glowing reviews, rather than a narrow evaluation of M1 itself.
Anyway, it wasn’t a reasonable expectation. But I think people expected it anyway.
99.999999% of internet comments on M2, or even anything hardware related are pretty much junk. Anandtech used to do some explanation into these sort of things, but as it turns out people aren't interested in in-depth analysis, they just want benchmarks. You end up having them drifting towards mass / mainstream media or LinusTechTip type of content. RealWorldTech doesn't do any these anymore, partly because there is very little money to be made on the consumer side of things. You have other site which talks about Semiconductor Engineering and Business type of content. Unfortunately every time they were posted on HN, no one has shown any interest or complaining about how the content looks too "Enteprisy or Cooperate" because they were intended for B2B settings.
It also isn't just consumers or enthusiast. When viewed from the outside most people would have expected programmers, or what they now called Software Engineers to have some sort of High Level understanding on hardware. But most developers, especially Web Developers, are so abstracted from hardware they dont know or do not care about it.
Broadly speaking this isn't just with hardware, but also every other subject.
I've been hearing that since Skylake, at every single processor generation that the gains are too modest. That's more than a decade at this point.
Performance Cores:
>Apple A15 performance cores are extremely impressive here – usually increases in performance always come with some sort of deficit in efficiency, or at least flat efficiency. Apple here instead has managed to reduce power whilst increasing performance, meaning energy efficiency is improved by 17% on the peak performance states versus the A14. If we had been able to measure both SoCs at the same performance level, this efficiency advantage of the A15 would grow even larger. In our initial coverage of Apple’s announcement, we theorised that the company might possibly invested into energy efficiency rather than performance increases this year, and I’m glad to see that seemingly this is exactly what has happened, explaining some of the more conservative (at least for Apple) performance improvements.
Efficiency Cores:
>The A15’s efficiency cores are also massively impressive – at peak performance, efficiency is flat, but they’re also +28% faster.
The comparison against the little Cortex-A55 cores is more absurd though, as the A15’s E-core is 3.5x faster on average, yet only consuming 32% more power, so energy efficiency is 60% better.
Conclusions:
>In our extensive testing, we’re elated to see that it was actually mostly an efficiency focus this year, with the new performance cores showcasing adequate performance improvements, while at the same time reducing power consumption, as well as significantly improving energy efficiency.
The efficiency cores of the A15 have also seen massive gains, this time around with Apple mostly investing them back into performance, with the new cores showcasing +23-28% absolute performance improvements, something that isn’t easily identified by popular benchmarking. This large performance increase further helps the SoC improve energy efficiency, and our initial battery life figures of the new 13 series showcase that the chip has a very large part into the vastly longer longevity of the new devices.
https://www.anandtech.com/show/16983/the-apple-a15-soc-perfo...
How true is this? If they're on the money it's an excellent example of a talent retention miss leading to a demonstrable mediocrity in delivery.
If they were a CPU “arms dealer” like Intel or AMD it’d matter more I think.
What would Rivos business model be? I’m genuinely interested seems interesting to me.
Would they be positioning themselves as the next Qualcomm?
Or perhaps sell a superior chip to Apple at some point?
Realistically speaking, most likely to be ARM, IMG or CEVA type of IP companies.
But since it is RISC-V, and we are on HN, I wont be surprised if some people expect them to give out their design for free.
I really hate that word “poach”. Using “attract” works much better in that sentence.
I find it appalling how entering into a free contract with someone to give them more money for their work is called “poaching”.
Words matter, and how we describe something has an impact in how it is viewed (“piracy” is another example).
> A group of big tech companies, including Apple, Google, Adobe, and Intel, recently settled a lawsuit over their "no poach" agreement for $324 million. The CEOs of those companies had agreed not to do "cold call" recruiting of each others' engineers until they were busted by the Department of Justice, which saw the deal as an antitrust violation. The government action was followed up by a class-action lawsuit from the affected workers, who claimed the deal suppressed their wages.
On poaching though, Apple could choose to respond to the market signal by improving their work culture and policies to retain talent. I do wonder about the supply for these highly skilled hardware engineers though.
How many really move from Apple to Amazon for the work culture?
The Apple Product Cycle (https://misterbg.org/AppleProductCycle.html)
still sums it all up pretty well.
I would love to live in world where 10x engineers are rewarded 10x. Right now it's 25% better pay than median.
With tech it's a bit of a catch 22, most engineers really become effective after 18 to 24 months. At this point you know how to really get things done in your org.
But after 2 years you can job hop and make significantly more, so your interest might not align 100% with the company's
Then I won't have the energy for a few months.
Probably evens out to an employer, but imo you do better when the snowball is rolling longer without breaks.
That's the secret to 10x. ADHD and don't stop. Stopping is the enemy, you can't see the forest.
I hear regulation is nice, but tell my mind that.
Been at my employer for 10 years making cool shit if that matters.
Every SOC I continue to keep my eyes open for MTE being used in a mainstream ARMv8.5 processor... If we're to believe that M3 is marked to be using ARMv9 as well. Maybe 2024 is the year?
> The overall performance gains are quite disappointing when you factor in the raw cost increase that comes with this new M2 and the fact that it has been nearly 2 years since the M1’s introduction.
Also the logic of article in the title is little weird to me. M1 was introduced in the same year as A14, they use the same core; while M2 uses the same core as A15, which introduced 1 year after M1. So technically M2 increased the performance by 18% in one year, not two years.
Though I'm curious why Apple didn't use A16's core in M2.
Probably the smaller process node. There's low capacity and low yields for the first year or two of the smaller node. It might not be an issue for the base-level M2s, but they'll be expected to update the Pro/Max/Ultra line up as well in the next 8 months which have much larger die sizes and they'd end up throwing away most of the wafer.
Available volume on the new node will be much smaller, so they had to prioritize. This is likely why only the iPhone pro will get the A16.
We will know soon enough. My guess is that A16 is designed with TSMC 3nm in mind, that is why ( rumour ) only the new iPhone Pro will get A16, and iPhone 14 will stick to A15.
These days I care more about which TSMC process node my chips came from than which company designed them. I need a new computer but I'm waiting until next year because there will be a wave of new CPUs and GPUs coming out with much better performance. Better designs? Maybe a little, but it's really because they're all moving to TSMC N4.
I really hope Pat Gelsinger can save Intel's fab business because we really need another company that can compete in fabs and Samsung isn't doing too hot either.
The fixation on the fab process is bewildering. Yes, it does help, but it is also an optimisation step that is decoupled from and that bears no relevance on the chip design. Yes, the smaller node size also brings the increased density along and an increased number of things that can be whacked into the same sized piece of silicon, but it will not magically improve the overall system performance or result in the linear architecture scalability.
The article is specifically calling out a potentially decreased ROB size in M2 cores, and ARMv9 also potentially not arriving until M3 which are crucial to the speed or software performance. There is absolutely nothing the fab process can do to make SVE2 and matrix instructions automagically appear in lithographic chip designs – those are the «silicon» design time decisions. As we have recently been seeing more and more practical, mainstream use cases of the advanced use of the SIMD instructions at the C/C++/Rust runtime level that bring an order of magnitude level performance gains, having the SVE2 implementation at the ISA level is becoming somewhat critical.
I would recommend not taking their business conjecture without a giant pinch of salt. Just today they were claiming Apple has lost hundreds of engineers in the chip division. The idea that a single division somehow lost hundreds without the industry noticing is ridiculous.
To quote from that article:
"SemiAnalysis believes that the next generation core was delayed out of 2021 into 2022 due to CPU engineer resource problems. In 2019, Nuvia was founded and later acquired by Qualcomm for $1.4B. Apple’s Chief CPU Architect, Gerard Williams, as well as over a 100 other Apple engineers left to join this firm. More recently, SemiAnalysis broke the news about Rivos Inc, a new high performance RISC V startup which includes many senior Apple engineers. The brain drain continues and impacts will be more apparent as time moves on. As Apple once drained resources out of Intel and others through the industry, the reverse seems to be happening now."
I was very optimistic on Apple on the CPU front until I read this today. Now I'm waiting to see how the A16 pans out for them to see if it's a two generation loss of progress, or just a single generation stumble.
0: https://semianalysis.substack.com/p/apple-cpu-gains-grind-to...
Nuvia started early enough to be a factor here. But Rivos wasn’t even founded until June 2021. To release now, M2 would already have been at finished with design by then.
There is an excellent video on this for anyone interested in Japanese culture and the war against USA via semiconductors:
I think there's always a desire to work at a startup in SV and in a low/zero interest rate environment - VCs could probably fund something in the chip design space.
But now that interest rates are going up, I think that will be a lot tougher and Apple will be a better position due to their direct access to free cashflow - to either compete or acquire them at a later date.
Its also an observation that w.r.t. chip design and consumer electronics, the pay is general lower than say Google, Facebook, Salesforce, Web 2.0 based startups (i.e. AirBnb, Uber, DoorDash), etc.
My presumption is that this is because as a chip designer or embedded software/hardware engineer, the capital costs to do anything interesting on your own as a startup (i.e. tape out a chip, mass production in Asia, etc.) are very very high and very fixed and very up-front. Even fabless semiconductors and factory-less product design companies that outsource manufacturing to Asia would need to go find outside capital for IC masks or HW prototypes. You also need a cadre of supply chain, biz dev, marketing, ad spend, channel sales distribution.
Compare that to AirBnb, Dropbox where you need a good idea, a handful of 10x SW engineers and an AWS account that can scale as you grown and a free tier for onboarding customers. Therefore, Google/FB etc. need to pay more to prevent these folks from going off and starting their disruptor (i.e. Insta, WhatsApp, SalesForce).
The author's argument here about talent leaving after having "gotten Apple off x64" is such an odd take. It's not as if Apple started designing these chips after the M1 launched—the pipeline for even small SoCs is often five or more years. The bit about Rivos is especially bizarre because that company was founded in 2021, well after this chip must have been taped out.
Few people are going to upgrade from the M1 to the M2 anyway, so it makes sense to keep powder dry for the M3.
It looks like M2 is neither of those, and it's already 2 year.
And while Apple isn't the max payer in SV, I'm sure they pay fine compared to other big tech. The issue is, chips are big right now and no existing big tech can compete compensation wise with shares in a growth chip startup. With VC drying up, I expect this to change back in Apple (and other big techs) favor.
[1], I hope that is not him.
[1] https://www.reuters.com/legal/litigation/apple-lawsuit-says-...
- boy, I got a lot of money Really, a lot. You know how I got them? I never gave anything away for free. Hand that change over.
“ The bleeding hasn’t stopped in recent years as Apple’s work culture simply isn’t the best and other firms, namely the hyperscalers such as Google, Microsoft, Amazon, and Meta, are paying more than Apple was to poach talent.”
Big management negotiate higher pay all the time not because there is a difference between earning 30 or 40 millions, but because they need to feel their own "value" go up.
This is discussed at length in "thinking fast and slow", iirc.
That aside once people have kids the sky is the limit for giving them the "best life possible." Nanny, private tutors, private schools (and/or a house in Cupertino since it's Apple), college funds, house with good amenities nearby, etc. I probably missed some costs in there.
Nuvia was purchased for $1.4 billion by Qualcomm, a couple of years after being started.
* you attract/retain more people that are interested in money/status.
* the employees become entitled.
Also, just like Apple's customers are OK with paying a premium price because it's Apple, employees are OK with paying a premium price to be an employee of Apple (by accepting lower salaries).
N4 increases the number of EUV layers so the main improvements should be in cost and yield which would have been interesting to Apple, but N5P hit volume manufacturing earlier allowing Apple to ship the M2 earlier and with more capacity.
Waiting for N3 would have offered a considerable performance and efficiency boost but that’d realistically have delayed M2 to the first half of 2023.
They're more normal in SF because of the school lottery system which can assign you to a school whether or not you can actually get there on time every day.
With respect to Rivos, reading the about page - it seems an interesting take on RISC-V.
My take is that this will be rolled back into either Apple or Google at a later date - mostly as a hedge against someone (like Nvidia) acquiring the ARM IP now that its in play - or to provide some realistic alternative that can be used as a counter bid in licensing discussions with ARM.
Two of the founders of Rivos were involved in PA Semi which was acquired by Apple and Agnilux which was acquired by Google ChromeBook team.
* - including you
It's there.
Apple will eventually be overtaken by another company at some point, but there's a world of journalists and pundits who continue to cry wolf every day.
2017 Kaby Lake i5-7600k single core 1157 [2]
2017 Coffee Lake i5-8600k single core 1206 [3]
2018 Coffee Lake i5-9600k single core 1233 [4]
2020 Comet Lake i5-10600k single core 1307 [5]
16% improvement over 5 years, average 4.6% improvement per release, range from 2.2% to 6% per step. didn't realize that they released 2 Coffee Lakes.
[1] https://browser.geekbench.com/processors/intel-core-i5-6600k
[2] https://browser.geekbench.com/processors/intel-core-i5-7600k
[3] https://browser.geekbench.com/processors/intel-core-i5-8600k
[4] https://browser.geekbench.com/processors/intel-core-i5-9600k
[5] https://browser.geekbench.com/processors/intel-core-i5-10600...
When studying the evolution of Intel CPUs over many years, it is obvious that most of the time they could have done greater improvements, but as long as their competition was weak they delayed the improvements that they could have done in a single year over 2 or 3 yearly CPU generations, in order to minimize their manufacturing costs, therefore maximizing their profits.
Only during the many years that have passed between Skylake and Alder Lake, Intel was no longer able to implement all the improvements that they would have wanted, due to the failures in the development of the new CMOS processes, so they were forced to make random minor improvements because greater improvements were impossible and they did not have a good Plan B as an alternative to the erroneous Plan A, which was every year that the next year will be the year when the Intel "10 nm" CMOS process will become competitive.
It looks as though Apple are gearing up for armv9 and smaller process node for the next round of chips which would be more of the "large jump" people are expecting. I think as long as Apple alternate the big jumps with the small jumps then they're not doing anything different from anyone else.
They needed to deliver M2 to show they're not resting on their laurels. If M3 is a similar kind of improvement then that's when to be worried.
Seems some employees took more than themselves to Rivos. "at least two former Apple engineers took gigabytes of confidential information with them to Rivos."
https://www.macrotrends.net/stocks/charts/AAPL/apple/long-te...
They made a big splash with the M1 macbook air, which was at the time an incredible value, and the clear best laptop on the market in terms of price/performance hands down. Apple was able to get splashy headlines, and assert their silicon was not just competitive with, but better than Intel and AMD. That's the critical goal they had to reach to validate Apple Silicon as a valid contender in the market.
This year, they're iterating on the design, and getting the market to accept a 20% price increase on the macbook air, which is their mass-market product.
Does anything they do from here on out actually depend on them continuing to win in the semiconductor space? It's not as if these chips are competing for server slots, where winning comes down to raw numbers in terms of performance/Watt.
These macbooks are going to be absolutely fine for the foreseeable future for everything anyone needs a mac to do: video editing, coding, content consumption etc. run absolutely great on these devices which have excellent battery life and great user experience.
California has a total ban on non-competes.
A very small handful of other states put restrictions on non-competes, but even those generally allow non-competition agreements if time limited, and the employee makes over ~$100k.
It’s widely accepted that prohibiting non-competes has been a significant factor in the tech industry success in California.
As one example, it is well known that Amazon aggressively enforces non-competes, even against line engineers.
But yes, I wish all states would just ban them outright. Or at least make them require compensation. If an employee is important enough to require a non-compete, then they are important enough to pay during the non-compete time period.
Does CA also ban them as part of an acquisition? I've seen them as part of the sale so everyone doesn't quit the day after the acquisition and start a competitor.
And even after the mobile revolution shrank the demand for X86 PCs, the cloud revolution further entrched X86 in the cloud.
It is not. A recent paper (https://arxiv.org/pdf/2205.05982.pdf) from Google engineering has compared performance of a vectorised (SIMD) vs non-vectorised implementation of the quick sort in the Highway library as well as the performance difference of the AVX-512 vs NEON/SVE1 implementations. By switching to the SIMD processing alone, the 9-19x speedup has been reported, depending on the SIMD unit size (32/64/128-bit numbers have been sampled and measured up). Even the smallest of the two, the 9x perfomance gain factor, is far from being marginal.
On the SIMD unit size of things, the performance difference between AVX-512 (the average of 1120 Mb/sec has been measured) and NEON implementation (the 478 Mb/sec throughput on average) is 2.4x smaller for NEON/SVE1 largely due to the smaller width of the units of processing. Again, the 2.4x factor is not in the marginal territory.
> What's not marginal is the improvements in power efficiency that come with new process nodes.
And that is an optimisation step, albeit a very important one. However, it will not make a quick sort implementation run 2.4x faster alone.
(I suspect any application doing enough quicksort that the 2x speedup is significant, would be even happier going slightly off-core to a coprocessor more specialized in vector processing, like Hwacha. There's plenty of space between "tightly-coupled CPU SIMD" and "GPU" that I think makes more sense than needing to implement 512-bit registers in little cores.)
Depends on the applications, I suppose. But did you know that (at least on OoO x86), the energy cost of scheduling an instruction dwarfs that of the actual computation? That is why SIMD, including SVE2, can be so important - it amortizes that cost over several elements. Let's spend (more of) our energy budget on actual work.
Is it really just "very few things that actually start using new SIMD"? I'm not a huge fan of autovectorization, but even that is able to vectorize some fraction of STL algorithms. And there are several widely used libraries, including image/video codecs and encryption, that use SIMD and wouldn't be feasible otherwise.
Take OpenSSL as an isolated example. By simply fiddling with the C compiler flags to allow it to use NEON on M1, the sha256 calculation speed-up is 4x for 128 and 256 block sizes, with performance gains quickly tapering off for larger block sizes and resutling in a modest 10% increase only. And that performance increase happens without the involvement of hash functions having been manually optimised for NEON/SVE1.
SVE2 with its variable vector size support could improve performance for larger unit sizes. Perhaps it is the time to spin up a Graviton3 instance and poke around with clang/gcc to see how actually good or faster the SVE2 is.
Because outside of servers where little cores don't exist, 256b ALUs in big cores mean 256b registers in little cores, and Cortex-A510 is way smaller than Gracemont. And then you're giving Samsung another opportunity to screw up big.LITTLE...
And even the server CPUs with SVE are 2x256b except A64FX which is HPC exclusive, so no better than 4x128b.
So, rather than bring the money home at the high rate, Apple has been taking on debt for U.S. operations while waiting for (and perhaps lobbying for) another tax holiday.
My observation is, the assholes are everywhere but also the nice and polite people. I can't really generalize it for rich or poor, I did not see that simple pattern.
At that time my hourly wage was about 8 pounds and a lady at an extravagant event gave me 5 pounds and told me to keep extra good care of the table. She somehow expected to have private waiter for the night for 5 pounds sterling but I took extra good care for about 45 minutes and when she asked me why I wasn't working for her specifically any longer, I explained that 5 pounds will do just that much and she agreed.
I recall once a very rich person screaming at the waiter because did not like the foam of the coffee and a few instances of rudeness but overall these were rarities.
If anything, the managers were much much bigger arsholes towards the employees because they could afford it(because the employees were mostly students or immigrants like me who need the money to sustain life until they find a proper job). Employees with higher status were big assholes towards the more junior ones.
Most social interactions with the rich or famous that I had or have seen were very positive and polite.
In some instances I was at fault and they were very understanding and tolerant. Once I failed to deliver the coffee of a famous F1 racer at breakfast and he didn't make a big deal of it(If I was him, I would probably be much more rude). Victoria's Secret models were just fine too when received flat champagne.
I'm not convinced that rich people being assholes in social interactions is a real thing. IMHO the pattern is, people who are privileged in their own social group are the assholes.
My SO works as a consultant in a bank here in Rome, Italy.
She moved from a bank in the periphery to a very central one in the Parioli neighborhood.
There was a night and day difference between her old and new clients in wealth (with the Parioli ones being largely millionaires).
Old clients would treat her with the utmost respect and call her doctor, "dottoressa", and always listen to what she had to say. New ones were on average much more rude, pretending and overall uneducated. She would have to explain them that she couldn't activate them some service because she needed their signatures and they would go all mad and call her director or some friend in the bank.
They are on average much worse people and they're also much more money aware.
Another anecdote she recalled me was how some rich woman wanted to set up a bank account for a no profit to send money to some african country. Not only there was no way to explain her that it was not that easy to do such operations, especially for large sums because this would have to automatically trigger money laundering controls, she would just not listen and blame her, but the client was MAD she had to pay 8 euros commissions on 60k+ euros wire transfer, pretending it to be free because it was a "no profit".
Yes, there's good and bad people in each wealth tier, but rich people on average are much worse assholes. There's no comparison.
There's an old family in my town that came from the kind of wealth that had each of their children for a few generations married into important or powerful families across the state. Today, the main family has no income other than from what they inherited, but they maintain their position and membership in society through being horrible to deal with. The center of the family is a vile gossip and has nothing but time to hear about everything that happens and think up ways to use it to her advantage.
They're notorious for showing up to functions uninvited, sitting at your table and ordering, and leaving before the bill comes. They hire the best local artisans and builders, complain to everyone about how shoddy the work is until they get extra for free, and then never pay, threatening to sue for imagined problems. When the grand children were in school, the family would try to walk into functions without tickets because "their child was performing", as if no one else's were.
When their daughter married a pro athlete, no one in town would build them a house, so they had to hire from other parts of the state. Their reasoning? No one in town was skilled enough to build them what they wanted.
They wrote a letter of complaint to the White House about a cavalcade driving through town during a family member's wedding reception and were sent an apology and a bottle of champagne by the POTUS. The family apparently sent back a letter letting him know that they didn't vote for them.
No one here even needs TV. Just hold a dinner party at a place they like and they'll show up and entertain for the cost of a few drinks and a meal.
How would you know that the person in the normal family car is rich?
summer child labor conscript: your total is $15 and your change is $85, lemme keep that
you: ….. uhhhh you kidding me?
audience: rich people are assholes!
Why the fsck? Is it normal to beg during work where you’re from?
Yes - but they call it tipping 'round these parts. They even have prominently displayed tip jars and everything.
"Well I didn't get rich by writing a lot of cheques!"
The purpose of SVE2 is to simplify the writing of the software that exploits the data parallelism, both when that is done manually and when that is done automatically by an autovectorizing compiler.
With SVE2 it should become much easier to deal with data structures where the sizes and the alignments are not multiples of the ALU width and it will also no longer be necessary to write many alternative code paths, to take advantage of any future better CPUs, like when optimizing for Intel SSE/AVX/AVX2/AVX-512.
There are still a majority of programs that do not utilize as frequently as possible the existing SIMD units. With SVE2, their number should diminish.
Even the web browser you are using right now to comment on HN likely makes use of the very same Highway library (Chrome and Firefox certainly do, unsure about Safari) the speedup gains have been reported for. The «overall» browser performance will also improve as the result due to it receiving gains transparently, by simply dropping an optimised implementation into the browser build.
And an optimised QuickSort can also come in handy if one pokes around a large browser history or uses it as a knowledge base, which I do and use it on a regular basis. My browser keeps a uninterrupted record of all visited websites over the last 15+ years and being able to zoom in on a particular time span to find something within that temporal range quickly is important to me. I am almost certain that a sorting of sorts is involved somewhere behind the scenes.
Getting kicked out of your Google account can mean losing access to all your other online accounts. That’s disruptive to your life and you might waste weeks or months dealing with a situation where you have no recourse because you can’t reason with a human.
It’s routine that people get their Google accounts banned without understanding why, and thus can’t fix it. When you’re kicked out of a physical location, you’ll know why.
2.4x difference was, in fact, reported, however I still find it somewhat difficult to interpret the reported results. The processing unit size difference alone and the number of LU's can't account for such a big difference in transfer speeds as the M1 Max that was used in the assessment has a very wide memory bus (256 bit wide for a performance core cluster or 512 bit wide for the entire SoC) as well as unusually large L1-D cache and a large L2 cache, with both caches having deep TLB's. The test set they used could also fully fit into the L2 cache. I have asked the Google engineer a question in a separate thread about what else could influence the observed performance difference but have not received a satisfactory explanation.
The key bottleneck is partitioning. AVX-512 does really well there because it has dedicated compressstore instructions, and it's actually even faster to partition a vector via vperm* (because we only need to do that once, whereas two compressstore are required to partition). So AVX-512 reaches >25 GB/s partition throughput per core; it's instead limited by the memory bandwidth each core can access (around 11 GB/s if a single core is active, less when all are competing for the total "128 GB/s").
By contrast, NEON for example in the M1 has 128-bit vectors. Its "4 vector units" (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports) are definitely not as good as actual 512 bit vectors, because partitioning only has a left and right side, and we don't have enough ILP for each of those to keep 2 vector units busy. Hence NEON reaches 11 GB/s partition throughput. It would seem like this matches Skylake, but no: once a subarray fits into cache, Skylake is freed from the memory bottleneck and is at least twice as fast there (which is a sizable fraction of the total sort time).
Does this help explain the results?
> The test set they used could also fully fit into the L2 cache.
This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?
It's pretty symmetrical, moreso than Cortex-X2's 4 pipelines; there's analysis that on M1 only some floating point and crypto instructions can't execute on all 4 pipelines. [1] (TP in that table is inverse throughput)
Which means that, for example, byte permutes from tables 256b or less can actually achieve the same throughput on M1 as with Intel's AVX-512, since M1 can sustain 4x 2-register TBL per cycle. And doing the exact equivalent of a 512b vpermb (3 cycle latency, 1/cycle throughput) can be done with 5 cycles latency and 0.33/cycle throughput on M1, via 4x 4-register TBL.
Well, a vpermd in NEON would need an extra MLA to convert indexes, and vpermi2* equivalents fall off a cliff. And Intel still has p01 free, and COMPACT is SVE. But in general, a lot of the parallelism that enables AVX-512 implementations will convert directly into ILP across 128b vectors.
> This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?
Anandtech [2] measured the same L2 latency up to about 8MB single-core, so regardless of the details, 8MB is a pretty significant cliff on M1. Regardless, RAM bandwidth is ~60GB/s, and unlike Intel, can be just about saturated by a single core.
[1] https://dougallj.github.io/applecpu/firestorm-simd.html
[2] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...
Also, there are now several RISC-V CPUs with 512-bit vectors, and it seems fair to call them little cores especially compared to x86 and M1/M2. Perhaps 512-bit is more feasible (and sensible) than is widely believed?
huh, that's surprising, that plot indeed looks like a core might be grabbing more than 'its share' of L2, though not all. The 'full random' curve starts creeping up after ~3MB as expected, so the situation seems to be even more complex than "use up to 8MB".
For completeness I'll also measure for 100M elements single core, though on M1 that wouldn't make a difference because as you say, a single core can drive a lot of memory bandwidth, enough that NEON becomes the bottleneck.
I share your concern about new SIMD instructions not being used. It seems to me we're at an inflection point, though. ISAs such as RISC-V and SVE will enable (properly written) software to benefit from future wider vectors without even recompiling. github.com/google/highway (disclosure: I am the main author) lets you write your code only once, and target newer instructions whenever they are available, with transparent fallback to other codepaths for CPUs.
Given the various physical realities including power efficiency, I believe there will be considerably more SIMD usage within the next few years.
Just like as a programmer I am going mental when encountering absurd and ineffective account password rules lets say (one special char, one upper case, one non-letter, etc) while a lay person would just sigh and comply.
most “anti money laundering” or “security” stuff is actually just that one bank’s poor and inaccurate implementation of a law. most of it is just company policy and nothing related to the law.
with electronic funds, the entire banking system relies on assuming that the prior and next bank has already done the checks necessary
because the law only creates a firewall of reporting at the deposit and withdrawal of physical notes (its same across europe, across us, and elsewhere)
There are laws in italy and there are very specific amounts you can use per month before controls have to be triggered.
60k transactions, abroad are 12 times what you can transfer without declaring exactly what is the money from, where is it from. Especially when sending and receiving money to african countries.
Tax evasion and laundering are high in italy and banks easily deny you their services if they smell something.
your bank did not
one of my biggest pet peeves is how low-level employees cant tell that their organization isn’t doing the normal thing