- 1S/2S is obviously where the pie is. Few servers are 4S.
- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas
- First x86 server platform with SHA1/2 acceleration
- 128 PCIe lanes in a 1S system is unprecedented
All in all Naples seems like a very interesting platform for throughput-intensive applications. Overall it seems that Sun with it's Niagara-approach (massive number of threads, lots of I/O on-chip) was just a few years too early (and likely a few thousands / system to expensive ;)
Yes, definitely drooling at this. Assuming a workload that doesn't eat too much CPU, this would make for a relatively cheap and hassle-free non-blocking 8 GPU @ 16x PCIe workstation. I wants one.
This one will be interesting. The current Ryzen (like most of the Intel desktop range) has two channels, but everyone has been benchmarking it against the i7-6900K because they both have eight cores. The i7-6900K is the workstation LGA 2011 with four channels. If the workstation Ryzen will have eight channels...
By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.
Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.
Every mainframe interface is basically an offload interface.. "computers" DMAing and processing to the CPs and each other. Every I/O device has a command processor, so it can handle channel errors and integrated pcie errors in a way PCs cannot.
A PC with Chelsio NICs doing TCP offload with direct data placement or RDMA as well as Fiber Channel storage would be mini/mainframe-ish.
And AMD should dump SHA1 acceleration in the next generation.
The cost to have that on silicon is probably close to zero. If you think SHA1 is just going to magically disappear because you want it to, well, you'll be in for a SHA1 sized surprise. Our grandkids will still have SHA1 acceleration.
>ARMv8 has had it for like 2-3 years now...
Because ARM cores don't remotely have the CPU heft an Intel x86/64 chip has, so ARM needs all this acceleration because its typically used in very low power mobile scenarios. On top of that, Intel claims AES-NI can be used to accelerate SHA1.
https://software.intel.com/en-us/articles/improving-the-perf...
https://en.wikipedia.org/wiki/Intel_SHA_extensions
>There are seven new SSE-based instructions, four supporting SHA-1 and three for SHA-256:
>SHA1RNDS4, SHA1NEXTE, SHA1MSG1, SHA1MSG2, SHA256RNDS2, SHA256MSG1, SHA256MSG2
The only processors so far with these extensions are low power Goldmont chips.
At the previous job where I built the 64-core system, I even emailed the AMD marketing department to see if we could do some PR campaign together, but I think it was too soon before the Naples drop, because I never got a response. Here's to hoping supermicro does a 4 cpu board for this... 124 cores would be amazing. (But I'll take 64 naples cores as long as it gets rid of the bugs and issues I found with the opterons).
Very few of the operations used GPU. Things may have changed since I was working there, but the work at the time wasn't suited for a GPU architecture.
Initial step was sequence cleanup, which is a hidden markov model executed over a collection of sequences of varying length, so hard to parallelize. Sequence annotation is embarassingly parallel on a per-library basis (each sequence can be annotated independently of the other), but the computational work is fuzzy string matching, which is once again hard to GPU-ize. Another major computational job was contig assembly, which is somewhat parallelizable (pairwise sequence comparisons), but once again involves fuzzy string matching so not GPU-izable.
So that's just sequence genetics. Don't know if GPUs are used in other areas.
Lots of cores, lots of threads, and lots of main memory. That was the key.
A quote from a anandtech forum post [0] reads promising:
"850 points in Cinebench 15 at 30W is quite telling. Or not telling, but absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels of efficiency, as long as it operates within its ideal frequency range."
A comparison against a Xeon D at 30W would be interesting.
The possibility of this monster maybe coming out sometime in the future is also quite nice: http://www.computermachines.org/joe/publications/pdfs/hpca20...
[0] https://forums.anandtech.com/threads/ryzen-strictly-technica...
With that say, I'm looking forward to these systems.
However, there is an important difference. AMD seems to be putting multiple dies into the same package, whereas Intel seems to have (as the Cluster on Die name implies) everything on the same die. So my fear is that the interconnect between dies may not be fast enough to paper-over our NUMA weaknesses.
x U (or x HE, if you're talking with a German manufacturer, they like to make that mistake ... ;) are rack-units, i.e. how large the case is.
Also what with ECC? Ryzen can support it or not?
The underperformance in gaming was tracked down to software issues according to AMD. Namely:
- bugs in the Windows process scheduler (scheduling 2 threads on same core, and moving threads across CPU complexes which loses all L3 cache data since each CCX has its own cache)
- buggy BIOS accidentally disabling Boost or the High Performance mode (feature that lets the processor adjust voltage and clock every 1 ms instead of every 40 ms.)
- games containing Intel-optimized code
More info: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...
Furthermore hardcore gamers usually play at 1440p or higher in which case there is no difference in perf between Intel or AMD, as demonstrated by the many benchmarks (because the GPU is always the bottleneck at such high resolutions.)
BTW they advertised it as good for gaming + streaming (h264 CPU encoding at the same time on the same machine). And "content creation", which pretty much always means video editing.
IIRC Ryzen supports unbuffered ECC if the mainboard supports it.
2. Compared to Desktop / Windows Ecosystem, their are much more Open Source Software on the Server side, along with usual Open Source Compiler. Which means any AMD Zen optimization will be far easier to deploy compared to Games and App on Desktop coded and compiled with Intel / ICC.
3. The sweet spot for Server Memory is still at 16GB DIMMs. A 256GB Memory for your caching needs or In-memory Database will now be much cheaper.
4. When are we going to get much cheaper 128GB DIMM Memory? Fitting 2TB Memory per Socket, and 4TB per U, along with 128 lanes for NVM-E SSD Storage, the definition of Big Data, just grown a little bigger.
5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!
Yes, and it's rumored that the top end 7nm chip will be 48 cores (codename starship). Exciting times ahead now that the competition is back.
How will Naples fare on this front?
I'm glad I don't own any Intel stock atm :)
A high core count, energy efficient CPU with IO out the wazoo?
I'm happy I bought AMD stock over the summer (:
The main scalability issue I have with Postgres is its horrible layout of data pages on disk. You can't order rows to be layed out on disk according to primary key. You can CLUSTER the table every now and then but that's not really practical for most production loads.
I don't think there's been any work on it yet though.
My guess is the 1 socket options scales great. 2 sockets are are less than ideal, and you will not double the 1 socket performance.
https://semiaccurate.com/2016/11/17/intel-preferentially-off...
Basically, using multiple dies increases latency significantly between the cores on different dies. This will affect performance. I will not judge till I see the benchmark though :-)
http://wccftech.com/amd-exascale-heterogeneous-processor-ehp...
I'd like to have that in the old project quantum package: http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-veg...
That would be a TFLOPS level supercomputer on your desk.
Well not with HBM (which is DRAM), but huge amounts of L3 SRAM on a MCM... POWER5 I believe.
Though I am kind of worried concerning memory access. Latency penalties when accessing non-local memory are very high on Zen CPUs due to the multi-die architecture design.
Does that mean we will finally see some serious interest in Shared-Nothing design and alike in the future ?
So the single socket systems can have more pci-e lanes available, but the dual socket has less per socket because some of those lanes are used for hypertransport.
What I can't figure out is why Intel and AMD aren't using similar (Hypertransport for AMD and QPI for intel) to connect directly to GPUs in a cache coherent way. These days the faster interconnects spend a decent fraction of their latency just getting across the PCI-e bus twice.
So 100 Gbit networks, Infiniband, GPUs, etc all could take advantage of a lower latency cache coherent interface, but it's not available.
I suspect mainly because qpi and hypertransport are incompatible and pci-e is good enough for the high volume cases.
http://ce-publications.et.tudelft.nl/publications/1520_gpuac...
I've seen benchmarks on the -hackers mailing list with 88 core Intel servers (4s 22c) in regard to eliminating bottlenecks when you have that many cores. So even if it's not 100% there yet, it will be soon.
Skylake can compute SHA1 at 4.3-3.4 cycles/B and SHA256 at 7-9 cycles/B [1]. That's ~1GB/s SHA1 and ~500MB/s SHA256.
A B-tree can in no terms be described as being laid out on disc in primary key order. The individual pages of the tree are placed on disc randomly, as they are allocated. Therefore an index scan won't return the rows in index order as quickly as the current scheme of having the rows separate from the index and sorting them every now and again.
Ultimately, for the goal of fast in-order scan of a table while adding/removing rows, you need the rows to be laid out on disc in that order, so that a sequential scan of the disc can be performed with few seeks. This requires that inserted rows are actually inserted in the space they should be, which is not always possible - often there isn't space in the page, and you don't want to spend lots of time shifting the rest of the rows rightwards a little bit to make space. To a certain extent Postgres already does insert in the right place if there is space in the right disc page (from deleted rows), but because this is not always possible, the solution is to re-CLUSTER the table every now and again.
I think the Postgres way is actually very well thought out.
https://www.starwindsoftware.com/blog/numa-and-cluster-on-di...
There's not much difference in memory bandwidth between crossing domains on the same die (COD) vs crossing domains system wide (accessing memory for a different socket). What kind of computation are you running?
We're not latency sensitive at all. The problem we run into with NUMA is that we totally saturate QPI due to FreeBSD's lack of NUMA awareness.
The results you link to don't match with what we've seen on our HCC Broadwell CPUs, at least with COD disabled. Though we only really look at aggregate system bandwidth, so potentially the slowness accessing the "far" memory on the same socket is latency driven, and falls away in aggregate.
Most uses of special instructions will check feature bits or CPU version, but not all will do so correctly.
(I'd say that the additional area cost of something like this is small, and the big cost of special instructions is reserving opcodes and feature bits)
But for all practical purposes, SHA1 isn't about to disappear. MD5 has been shown to be broken since forever and people still write new code using it today.
Very much this. Which is why I ended up theorycrafting that the AMD many core CPU's would be so useful.
Then a lot of code is very branchy but massively parallel leading to clusters of pure CPUs to be more flexible, which is important in research settings, and with higher utilization than mixed CPU/GPU clusters.
GPU code takes longer to get to market and has more specialized skills required then standard CPU orientated programming. Late to market means you miss a whole wave of experimental methods from the lab. i.e. GPU short read aligners came when long reads started to come out of the sequencing lab. Leading to people to stop doing short reads or at least stop doing pure short reads.
Secondly quite a bunch of the key staff at the large research institutes had been burned by previous hardware acceleration attempts and where not going to throw money at it until market proven.
Bio-informatics tends to cutting edge (the hemorrhaging kind) on the bio/lab tech side yet the production IT tends to balance that to doing the things we know as we already have enough risks. i.e. focus on the algorithms and robustness not on pure power.
Not necessarily in comp. genetics / sequencing.. / the DNA stuff..
The source engine isn't exactly the pinnacle of engine development.
It doesn't really know what to with more than 2ish cores, so you probably get more FPS by using a dual core instead of a quad core, which tend to go farther in terms of overclocking.
Try running on a cheap i3 from a few years ago and you'll understand your pain quickly.
My opinion: if Microsoft is able to pivot the Scorpio over to the Ryzen (or indeed, any CPU with more than 4C/8T) it will drastically alter the lowest common denominator in terms of what game developerss target - i.e. we'll see games moving towards more modern threading architectures (e.g. futures/jobs as-per Star Citizen, which more thoroughly exploit CPU resources).
Furthermore, there is hearsay evidence that supports AMDs claims. Ashes of the Singularity currently runs better on Intel but the developers claim:
[2]> Oxide games is incredibly excited with what we are seeing from the Ryzen CPU. Using our Nitrous game engine, we are working to scale our existing and future game title performance to take full advantage of Ryzen and its 8-core, 16-thread architecture, and the results thus far are impressive.
In addition to that, if you look at the CPU usage/saturation alongside the benchmarks (13:08 in [1]) it's strikingly obvious that the CPU is not the bottleneck - Intel is upwards of 90% on all cores while the Ryzen hovers around ~60%. I'm holding my credit card close until the aforementioned optimizations and rumored bios patches land, but I'm willing to give AMD a little benefit of the doubt - what we're seeing largely matches what they are saying.
[1]: https://youtu.be/ylvdSnEbL50 [2]: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...
Hell in 30 million households there are 8 jaguar x86 core gaming machines active now with an IPC that is probably (I assume) atrocious.
I build my i7 4770 4 years ago and the sad part is that it will probably still take a lot of time for it to become a bottleneck in 90% of the games.
That said, completely tangential to what you're saying. Ryzen may (at worst) perform like an i5 in gaming but it has more than 8 threads. I do everything with my machine and going with a R7 1700 overclocked.
Hardcore is that guy who plays Call of Duty 24x7 on his Xbox 360 and mediocre 720p television. You can't deny the determination or enthusiasm. Hardware's irrelevant.
Blaming windows is just a desperate excuse from AMD to justify its lack of performances. Don't be tricked by that.
It's possible -and rather common- that there are motherboard issues on the first generation of MB, which again, is not a a valid excuse but a bad thing that desperately needs fixing from AMD and a sign that it's still in testing phase.
This is nothing new or outstanding at all.
Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.
As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.
So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html
Wrt. cases: I think a regular E-ATX compatible case should be enough, but it all depends on the motherboard, and those don't exist yet. Existing 8x GPU servers have been 4U rack mount dual socket affairs; you can also already get 7x GPU dual socket "EEB" motherboards and workstation style cases, but none that will do full 16x for all the GPUs.
For comparison, Notebookcheck's system noise scale is 30dB=silent, 40dB=audible, 50dB=loud.
I'd quit if I had to work in an office space with 60 dB noise. That's like sitting next to a rack of 1U servers at "moderately angry bee swarm" fan level.
I personally cannot stand to be near a noise source above 40 dB for any extended length of time (more than a few hours).
But 60 dB... wow. Can't imagine how shitty that must be to work in for 8 hours per day.
I don't know if AMD will make a new architecture or not, but I can't see why they wouldn't just release 32 Ryzen cores side-by-side and underclocked at the stock configuration.
https://forums.anandtech.com/threads/ryzen-strictly-technica...
AMD might well steal some of the dual socket market with a dual socket, and maybe some of the quad socket market with dual sockets.
Considering that the current ryzen at $500 is relatively competitive with the $1,000 intel (basically a relabled Xeon with 4 memory busses in the LGA2011 server socket) a quad module (32 core/64 thread) in a socket sounds pretty good. Even if it's more watts than the intel.
One reason, perhaps, is if my binaries are compiled with Intel-specific optimizations and it's inconvenient to deploy separate AMD-optimized binaries.
However, high end systems don't lend themselves well to mass-deployment (i.e. scale out).
I know very little about computation genetics/biology but it sounds interesting.
On the other hand many of the bioinformatics software solve a specific scientific question and usually are written by people with mostly non-computational background. They use higher level languages such as Python/Perl/R and people often don't have the expertise or time to implement them for GPUs.
However now that machine learning and deep neural network approaches are being picked up by the field, the workloads might change a and also there are frameworks that make it easer to leverage GPUs (Tensorflow, etc)
That's an interesting thought, has anyone ever attempted to get 'regular' programmers interested in this stuff as a 'game'/code golf kind of thing?
(Too many) Years ago one of the programming channels I was active in got distracted for 3 weeks while everyone tried to come up with the fastest way to search a 10Mb random string for a substring, not in the theoretical sense but in the actual how fast can this be done, that was the point I found out that Delphi (which was my tool of choice at the time) had relatively slow string functions in it's 'standard' library and ended up writing KMP in assembly or something equally insane, I got my ass handed to me by someone who'd written a bunch of books on C but eh it was damn good fun, it was also one of the first realizations I had just how fast machines (back then) had gotten and just how slow 'standard' (but very flexible) libraries could be.
Obviously the total scope of re-writing researchers code would probably be far far beyond that but if they could define the parts they know are slow with their code and some sample data I know a few programmers who would find that an interesting challenge.
Thanks for the response.
(More than a decade ago, I struggled to / barely succeeded in building a Beowulf cluster; I am just amazed at how far both the hardware & the software tools have come..)
In other areas of comp bio though, GPUs I think are finding use. Protein folding, molecular dynamics. Also, with STORM & such: super resolution microscopy? I think increasingly, gpus will become important.
Also, whole cell simulations?
You are also right that some of the comp bio areas (CryoEM, protein folding, molecular dynamics) are well suited for GPUs
One of the nice things about HN is you get to look outside your own bubble (I mostly do Line of Business/SME stuff so this stuff isn't just outside my wheelhouse it's on the other side of the ocean).
GPUs excel at problems where you can apply exactly the same logic to lots of data in parallel. CPUs can handle branching cases, where each operation requires a lot of decisions, a lot better.
Sufficiently large FPGA chips could accelerate certain parts of the workflow, if not the whole thing, since they're extremely good at branching in parallel. This is why early FPGA Bitcoin implementations blew the doors off of any GPU solution, each round of the SHA hashing process can be run in parallel on sequentially ordered data if you organize it correctly.
FPGAs run hot, don't have many transistors, limited clock rate, and are a pain to program.
So yeah a "Sufficiently large" chip, a "sufficiently fast clock", and a "sufficiently well written app" could theoretically do well. Problem is in the real world they aren't and developers aren't targeting them.
Your user name: a fan of the cre-lox system, or the enzyme itself?
Cool uid!
In my past life, I've used flp/frt & cre/lox; and studied mismatch repair enzymes. And topoisomerases.. :)
Where do you think the heat comes from? Or where do you think the power that doesn't turn into heat goes?
That the FPGAs use this proprietary and for all intents opaque binary format is not very helpful and is probably the biggest barrier.