AMD Prepares 32-Core Naples CPUs for 1P and 2P Servers: Coming in Q2

AMD Prepares 32-Core Naples CPUs for 1P and 2P Servers: Coming in Q2(anandtech.com)

341 points by BlackMonday 9 years ago | 160 comments

I think Naples is a very exciting development, because:

- 1S/2S is obviously where the pie is. Few servers are 4S.

- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas

- First x86 server platform with SHA1/2 acceleration

- 128 PCIe lanes in a 1S system is unprecedented

All in all Naples seems like a very interesting platform for throughput-intensive applications. Overall it seems that Sun with it's Niagara-approach (massive number of threads, lots of I/O on-chip) was just a few years too early (and likely a few thousands / system to expensive ;)

semi-extrinsic 9 years ago | |

> 128 PCIe lanes in a 1S system is unprecedented

Yes, definitely drooling at this. Assuming a workload that doesn't eat too much CPU, this would make for a relatively cheap and hassle-free non-blocking 8 GPU @ 16x PCIe workstation. I wants one.

sorenjan 9 years ago | | |

That does sound pretty spectacular, and really loud. What kind of case would you put that in? Would you work with ear protection?

AnthonyMouse 9 years ago | |

> 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas

This one will be interesting. The current Ryzen (like most of the Intel desktop range) has two channels, but everyone has been benchmarking it against the i7-6900K because they both have eight cores. The i7-6900K is the workstation LGA 2011 with four channels. If the workstation Ryzen will have eight channels...

gigatexal 9 years ago | |

Let's hope this isn't niagra again: it needs to have decent clock speeds as IPC is still worth something today. But yes, I totally agree, this is an exciting chip.

binarycrusader 9 years ago | | |

It's not, not only did AMD move from CMT (clustered multi-thread) design used in the previous Bulldozer microarchitecture, they now have an SMT (simultaneous multithreading) architecture allowing for 2 threads per core.

By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.

Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.

alimbada 9 years ago | | |

Naples is based on Ryzen which, if you look at early benchmarks, is beating the competition on all fronts except gaming (suspectedly due to software optimisation and motherboard issues).

tyingq 9 years ago | |

32 cores in one socket may also take a bite out of some servers that are currently 2 sockets.

agumonkey 9 years ago | |

My shallow understanding of big servers and IBM Z series amounted to "lots of dedicated IO processors". Seems like "mainstream" caught up with big blue.

kev009 9 years ago | | |

Sort of. It ebbs and flows, generally more maintainable to do more in CPU/kernel and less in HW/firmware for PCs and of course price runs the market so there's a race to do less. Part of the mainframe price tag is getting long term support on the whole system stack, whereas PC vendors actively abandon stuff after a few years. That is a big risk for something like TCP offload engine.

Every mainframe interface is basically an offload interface.. "computers" DMAing and processing to the CPs and each other. Every I/O device has a command processor, so it can handle channel errors and integrated pcie errors in a way PCs cannot.

A PC with Chelsio NICs doing TCP offload with direct data placement or RDMA as well as Fiber Channel storage would be mini/mainframe-ish.

PeCaN 9 years ago | | |

Pretty much. Mainframes have been very I/O oriented from the start. Channel I/O (more or less DMA) with dedicated channel programs and processors can be very high-throughput.

mtgx 9 years ago | |

Intel doesn't have SHA2 acceleration? ARMv8 has had it for like 2-3 years now...

And AMD should dump SHA1 acceleration in the next generation.

drzaiusapelord 9 years ago | | |

>And AMD should dump SHA1 acceleration in the next generation.

The cost to have that on silicon is probably close to zero. If you think SHA1 is just going to magically disappear because you want it to, well, you'll be in for a SHA1 sized surprise. Our grandkids will still have SHA1 acceleration.

>ARMv8 has had it for like 2-3 years now...

Because ARM cores don't remotely have the CPU heft an Intel x86/64 chip has, so ARM needs all this acceleration because its typically used in very low power mobile scenarios. On top of that, Intel claims AES-NI can be used to accelerate SHA1.

https://software.intel.com/en-us/articles/improving-the-perf...

throwawayish 9 years ago | | |

ARM cores are much weaker, crypto performance without NEON is absymal across the board. Of course, compared to hardware-acceleration software always seems slow; Haswell manages AES-OCB at <1 cpb.

yuhong 9 years ago | |

As a side note, XOP had rotate instructions. Sadly it is no longer supported in Ryzen.

tw04 9 years ago | |

Intel hass had SHA1/2 acceleration for YEARS via the AES-NI instruction set.

https://en.wikipedia.org/wiki/Intel_SHA_extensions

>There are seven new SSE-based instructions, four supporting SHA-1 and three for SHA-256:

>SHA1RNDS4, SHA1NEXTE, SHA1MSG1, SHA1MSG2, SHA256RNDS2, SHA256MSG1, SHA256MSG2

throwawayish 9 years ago | | |

This is not part of AES-NI and has never been released in a mid-range+ server/desktop CPU, only part of some Atom parts (Goldmont). Therefore software support is poor (I think OpenSSL does not support it). It is said to be included in 2018+ Cannonlake, though.

floatboth 9 years ago | | |

haha nope. This is not a part of AES-NI.

The only processors so far with these extensions are low power Goldmont chips.

https://github.com/weidai11/cryptopp/issues/139

arca_vorago 9 years ago |

This is what I have really been looking forward to. I theorycrafted a more ideal system for the genetics work a former employer was doing, but didn't get to build it until after I had left there. A quad 16 core opteron system for a total of 64 cores (for physics calculations in comsol). I think that there is more potential use for high actual core count servers than many people realize, so I can't wait to build one. (for my purposes these days is as an game server in a colo, one of my projects is a multiplayer UE4 game)

At the previous job where I built the 64-core system, I even emailed the AMD marketing department to see if we could do some PR campaign together, but I think it was too soon before the Naples drop, because I never got a response. Here's to hoping supermicro does a 4 cpu board for this... 124 cores would be amazing. (But I'll take 64 naples cores as long as it gets rid of the bugs and issues I found with the opterons).

deepnotderp 9 years ago | |

Out of curiosity, I thought that genetics was the domain of gpus?

kannanvijayan 9 years ago | | |

I did sequence-based bioinformatics back around 2006 or so.

Very few of the operations used GPU. Things may have changed since I was working there, but the work at the time wasn't suited for a GPU architecture.

Initial step was sequence cleanup, which is a hidden markov model executed over a collection of sequences of varying length, so hard to parallelize. Sequence annotation is embarassingly parallel on a per-library basis (each sequence can be annotated independently of the other), but the computational work is fuzzy string matching, which is once again hard to GPU-ize. Another major computational job was contig assembly, which is somewhat parallelizable (pairwise sequence comparisons), but once again involves fuzzy string matching so not GPU-izable.

So that's just sequence genetics. Don't know if GPUs are used in other areas.

Lots of cores, lots of threads, and lots of main memory. That was the key.

keth 9 years ago |

I'm looking forward to the benchmarks since the performance per watt of the desktop parts (Ryzen R7) seems to be really good. Quite curious how it will compare against Skylake-EP.

A quote from a anandtech forum post [0] reads promising:

"850 points in Cinebench 15 at 30W is quite telling. Or not telling, but absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels of efficiency, as long as it operates within its ideal frequency range."

A comparison against a Xeon D at 30W would be interesting.

The possibility of this monster maybe coming out sometime in the future is also quite nice: http://www.computermachines.org/joe/publications/pdfs/hpca20...

[0] https://forums.anandtech.com/threads/ryzen-strictly-technica...

drewg123 9 years ago |

The important thing here, from my perspective, is how NUMA-ish a single socket configuration will be. According to the article, a single package is actually made up of 4 dies, each with its own memory (and presumably cache hierarchy, etc). While trivially parallelizable workloads (like HPC benchmarks) scale quite well regardless of system topology, not all workloads do so. And teaching kernel schedulers about 2 levels of numa affinity may not be trivial.

With that say, I'm looking forward to these systems.

wtallis 9 years ago | |

Intel's largest CPUs are already explicitly NUMA on a single socket. They call it Cluster On Die: http://images.anandtech.com/doci/10401/03%20-%20Architectura...

drewg123 9 years ago | | |

Very true, I should have mentioned that. At least for us, COD doesn't seem to impact our performance at all, while NUMA does. I'm hoping that Naples is the same for us.

However, there is an important difference. AMD seems to be putting multiple dies into the same package, whereas Intel seems to have (as the Cluster on Die name implies) everything on the same die. So my fear is that the interconnect between dies may not be fast enough to paper-over our NUMA weaknesses.

kiddico 9 years ago |

Sorry, my google-fu isn't on point today; what's the difference between 1p and 1u. or 2p and 2u? My nomenclature knowledge is lacking ...

sp332 9 years ago | |

P = Processor and S = Socket (they're pretty interchangeable). U = rack Unit https://en.wikipedia.org/wiki/Rack_unit

throwawayish 9 years ago | |

n-P / n-S / n-way = how many sockets/processors a system has. A 1S system has one socket / processor, a 2S system two, a 4S four and so on.

x U (or x HE, if you're talking with a German manufacturer, they like to make that mistake ... ;) are rack-units, i.e. how large the case is.

astrodust 9 years ago | |

The title should be re-written to say "single and dual socket" not "1P and 2P".

daemonk 9 years ago |

Nice. This is the more interesting market for AMD rather than the gaming market in my opinion. 128 PCIe lanes and up to 4TB of ram will be awesome.

ptrptr 9 years ago | |

Gaming? More like consumer market, Ryzen 7 is definitely not suited for gamers, advertising it as such was IMO mistake. Nevertheless Naples can be big innovation in server segment.

Also what with ECC? Ryzen can support it or not?

mrb 9 years ago | | |

"Ryzen 7 is definitely not suited for gamers"

The underperformance in gaming was tracked down to software issues according to AMD. Namely:

- bugs in the Windows process scheduler (scheduling 2 threads on same core, and moving threads across CPU complexes which loses all L3 cache data since each CCX has its own cache)

- buggy BIOS accidentally disabling Boost or the High Performance mode (feature that lets the processor adjust voltage and clock every 1 ms instead of every 40 ms.)

- games containing Intel-optimized code

More info: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...

Furthermore hardcore gamers usually play at 1440p or higher in which case there is no difference in perf between Intel or AMD, as demonstrated by the many benchmarks (because the GPU is always the bottleneck at such high resolutions.)

floatboth 9 years ago | | |

Not being the top single-threaded performer which is required to push many many hundreds of frames per second != "not suited for gamers". Games in general are more likely to be GPU-bound!! Intel's quad cores are only really required for the pro Counter-Strike players who want 600fps at 1080p just to get the absolute latest frame.

BTW they advertised it as good for gaming + streaming (h264 CPU encoding at the same time on the same machine). And "content creation", which pretty much always means video editing.

IIRC Ryzen supports unbuffered ECC if the mainboard supports it.

alimbada 9 years ago | | |

It's just as suited for gaming as it is for anything else. The problem is everyone expected all games to run buttery smooth on day one with no hiccups. Ryzen specific game engine optimisations are coming according to AMD, as well as a Windows 10 scheduler patch. There are also other issues on the motherboard/BIOS side which manufacturers are working on.

ksec 9 years ago |

1. Most of the benchmarks are not even compiled or made with Zen Optimization in mind. But the results are already promising, or even Surprising.

2. Compared to Desktop / Windows Ecosystem, their are much more Open Source Software on the Server side, along with usual Open Source Compiler. Which means any AMD Zen optimization will be far easier to deploy compared to Games and App on Desktop coded and compiled with Intel / ICC.

3. The sweet spot for Server Memory is still at 16GB DIMMs. A 256GB Memory for your caching needs or In-memory Database will now be much cheaper.

4. When are we going to get much cheaper 128GB DIMM Memory? Fitting 2TB Memory per Socket, and 4TB per U, along with 128 lanes for NVM-E SSD Storage, the definition of Big Data, just grown a little bigger.

5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!

keth 9 years ago | |

> 5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!

Yes, and it's rumored that the top end 7nm chip will be 48 cores (codename starship). Exciting times ahead now that the competition is back.

rl3 9 years ago |

In previous threads there was discussion about Intel processors, specifically Skylake (which is a desktop processor), being superior for server workloads involving vectorization.

How will Naples fare on this front?

quickben 9 years ago | |

That front remains to be seen. However, 128 lanes, 8 channel ram; It will make a mess out of Intel in the vm hosting arena.

I'm glad I don't own any Intel stock atm :)

greggyb 9 years ago | | |

The VM hosting arena is exactly where cloud providers play.

A high core count, energy efficient CPU with IO out the wazoo?

I'm happy I bought AMD stock over the summer (:

astrodust 9 years ago | |

Outside of specialized workloads, not a lot of software is vectorized. Maybe your database server can take advantage, but your application server will probably not benefit one bit.

wtallis 9 years ago | |

Desktop Skylake doesn't support AVX-512. Server Skylake will, when it ships. (The Xeon E3 v5 doesn't, because it's the same chip as desktop Skylake.)

rl3 9 years ago | | |

Removed the incorrect information from my post. Thanks for the correction.

sp332 9 years ago | |

Naples might not fare well, but AMD is betting on vector operations being offloaded to a GPU-like accelerator connected via Infinity Fabric.

Tuna-Fish 9 years ago | |

Badly, but it doesn't matter because it's still just a tiny portion of the market.

deepnotderp 9 years ago |

I've long been advocating for a high i/o cpu with several pcie lanes. 128 lanes will support 8 GPUs at max bandwidth. AMD has positioned itself well.

andy_ppp 9 years ago |

How well does, say, Postgres scale on such hardware? Is anything more that 8 cores overkill or can we assume good linear increases in queries per second...

eis 9 years ago | |

Depends on your queries. I am looking at a server right now that uses 80% of 32 cores with Postgres 9.6. It's doing lots of upserts and small selects. Averages 76k transactions per second. I think it could easily take advantage of a 64 core system.

The main scalability issue I have with Postgres is its horrible layout of data pages on disk. You can't order rows to be layed out on disk according to primary key. You can CLUSTER the table every now and then but that's not really practical for most production loads.

koolba 9 years ago | | |

I think I saw a proposal recently for something that would cover this use case. IIRC it was for an index organized table that stores the entire contents in a btree (so it would naturally be stored in primary key order).

I don't think there's been any work on it yet though.

pg314 9 years ago | | |

Have you looked into pg_repack [1]? It's a PostgreSQL extension that can CLUSTER online, without holding an exclusive lock. I haven't used it, but it looks interesting as an alternative to the built-in CLUSTER.

[1] http://reorg.github.io/pg_repack/

brianwawok 9 years ago | |

This is from 2012: http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-abo...

My guess is the 1 socket options scales great. 2 sockets are are less than ideal, and you will not double the 1 socket performance.

mtgx 9 years ago |

If they have a much better performance/$ than Intel, which they likely will have, it sounds like a good opportunity for AWS to significantly undercut Microsoft and Google (which recently bragged about purchasing expensive Skylake-E chips).

chx 9 years ago | |

There's opportunity cost to consider. Google has Skylake-E now which is not even available at retail yet.

mtgx 9 years ago | | |

Well, it also seems that Intel prioritized its customers. If I were Amazon or Microsoft (the rumors said Google and Facebook were the priority customers), I would get Naples just to spite Intel (it doesn't hurt that AMD's Naples likely offers better perf/$, too, though):

https://semiaccurate.com/2016/11/17/intel-preferentially-off...

ajaimk 9 years ago |

This is the first I'm reading about the 32 cores being 4 dies on a package - Not sure how well that will work out in practice. IBM does something similar with Power servers where 2 dies on a package are used for lower end chips.

Basically, using multiple dies increases latency significantly between the cores on different dies. This will affect performance. I will not judge till I see the benchmark though :-)

Coding_Cat 9 years ago |

With how big these chips are getting, I wonder if the next iteration will have an HBM last-level cache on chip.

phkahler 9 years ago | |

That's the old EHP concept.

http://wccftech.com/amd-exascale-heterogeneous-processor-ehp...

I'd like to have that in the old project quantum package: http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-veg...

That would be a TFLOPS level supercomputer on your desk.

keth 9 years ago | |

Here is the newest PDF about something like that: http://www.computermachines.org/joe/publications/pdfs/hpca20...

throwawayish 9 years ago | |

"IBM did it first"

Well not with HBM (which is DRAM), but huge amounts of L3 SRAM on a MCM... POWER5 I believe.

Demcox 9 years ago |

Just having one of those in a workstation get me all warm and fuzzy.

HippoBaro 9 years ago |

I think Naples will be a very serious threat to Intel in the server market. As Ryzen benchmarks & reviews have shown, Zen really shines in heavy-multithreaded applications. The typical workload of a server.

Though I am kind of worried concerning memory access. Latency penalties when accessing non-local memory are very high on Zen CPUs due to the multi-die architecture design.

Does that mean we will finally see some serious interest in Shared-Nothing design and alike in the future ?

Symmetry 9 years ago |

Semi-ironically this looks like just the thing to use in a supercomputer controlling a good number of NVidia GPUs.

gbrown_ 9 years ago | |

Was thinking the same thing. Like the CPU marked it's good to have competition with GPUs but it would be interesting if Nvidia picked up/ partnered with AMD. Oh well let's see how OpenPOWER pans out.

PeCaN 9 years ago | |

What would be really interesting if AMD CPUs and GPUs both support their Infinity Fabric concept. Heterogeneous systems with high-performance direct memory access is a huge deal.

galeos 9 years ago |

This is a multi-chip-module (MCM). Are the high core-count Xeons now all single die? Will be interesting to see what impact the MCM approach has on benchmarks as I supposed could have a latency impact in certain use cases?

m3kw9 9 years ago |

In other words, we have a faster server chip coming

deelowe 9 years ago |

This is when things will get interesting. Ryzen appears to do better with hot and server workloads than gaming.

deelowe 9 years ago | |

Should read HPC instead of "hot"

emcrazyone 9 years ago |

can anyone chime in as to why use PCIe over something more core to core direct? As I understand it, the CPU still needs to talk to a PCIe host/bridge controller. Why not have something that is more direct between processors?

sliken 9 years ago | |

Hypertransport is an AMD technology that's high bandwidth per line, low latency, and scalable. It's also cache-coherent (well there's a version that is), so it's great for connecting CPUs. But the AMD hardware is flexible and can use the same pins for either.

So the single socket systems can have more pci-e lanes available, but the dual socket has less per socket because some of those lanes are used for hypertransport.

What I can't figure out is why Intel and AMD aren't using similar (Hypertransport for AMD and QPI for intel) to connect directly to GPUs in a cache coherent way. These days the faster interconnects spend a decent fraction of their latency just getting across the PCI-e bus twice.

So 100 Gbit networks, Infiniband, GPUs, etc all could take advantage of a lower latency cache coherent interface, but it's not available.

I suspect mainly because qpi and hypertransport are incompatible and pci-e is good enough for the high volume cases.

jabl 9 years ago | | |

Well, AMD is one of the founding members of OpenCAPI, http://opencapi.org/ , so I guess there's some hope. It seems they haven't talked about it wrt Zen/Naples, maybe some later iteration will have it?

rosege 9 years ago |

Licensing Windows 2016 Datacenter would cost a fortune for the 2P server.

__mp 9 years ago |

I'm wondering how they will stack up against XeonPhi.

hossbeast 9 years ago |

How feasible will a Naples desktop build be?