AMD Patent Reveals Hybrid CPU-FPGA Design That Could Be Enabled by Xilinx Tech

AMD Patent Reveals Hybrid CPU-FPGA Design That Could Be Enabled by Xilinx Tech(hothardware.com)

278 points by craigjb 5 years ago | 155 comments

dhruvdh 5 years ago |

I can't help but think most commentators haven't actually read the article or the patent. This isn't about having an FPGA embedded into the CPU or near the CPU, it's about having a programmable FPGA like execution unit that can be programmed to be say a 4-bit floating point adder, or any other weird execution unit one might need.

Why is this important? Have a program that does a lot of integer multiplications? Let's program all of these programmable execution units to multiply integers on the fly, etc. Now your integer multiply throughput is higher, as per the current program's needs.

Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.

I think it's great, and that most people are missing the point.

ajross 5 years ago | |

> Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them

That's been the role of microcode for like three decades now. Why does it matter if the instruction no one uses is implemented with FPGA gates or uops? No one uses them.

webmobdev 5 years ago | | |

Theoretically, Now you can create "micro-codes" in the CPU for your specific needs - e.g. scientists do a lot of calulation and would like a processor optimised for that. Now they can use the FPGA to do it. You want a CPU instruction that is optimised for something else - you can program the FPGA for that.

pjmlp 5 years ago | | |

Six actually, given that microcode was the approach taken on most mainframes that started around Burroughs timeframe.

mhh__ 5 years ago | |

> Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.

Possible, but the "x86" part is already a big decoder in front of a murky processor underneath so this is already what the CPU does - if you removed the reference to an FPGA, rewriting old x86 instructions in terms of "new" ones is microcode.

justinclift 5 years ago | |

Wonder how that'd work - in practical terms - with modern systems?

eg, you could be running (say) 3 or 4 primary applications at the same time. Which one gets to use the FPGA pieces, or are they re-written every time, on every context switch? ;)

Re-writing them on every context switch sounds extremely unlikely, so it'd be more some kind of resource locking thing instead. Which could mean that FPGA-using applications at least start out being fairly niche, as only one could run "per core" or something.

Maybe dedicated cores per application instead or something?

freeqaz 5 years ago | | |

The work is already being put in with modern NUMA (non-uniform memory access) systems to pin apps to specific cores. This seems like it would overlap if this ended up being used in production.

apsient 5 years ago | | |

how about having FPGA execution units in addition to the normal units and os deciding how and who will use these new EUs based on the most CPU intensive apps running currently

laydn 5 years ago | |

I think the point is that what you're describing existed for years. Any Xilinx Zynq chip or Altera SoC chip can do this already. Just because the data doesn't travel through the AXI/AMBA bus does not make this novel.

andy_ppp 5 years ago | | |

Of course it does because you get access to the CPU as well so you can hop from an instruction you built on the FPGA to another “silicon” instruction with the same registers and processor state. This is extremely clever and doesn’t involve shuffling code from the main processor over a slow bus, executing some stuff all on the fpga and shuffling it back.

nitrogen 5 years ago | |

It sounds like that's exactly how another processor worked: https://news.ycombinator.com/item?id=25623763

shuringai 5 years ago | |

this won't be a question of what the user wants to do with these parts. I bet it won't even be accessible for common programmers. Applications will simply constantly racing between each other and reprogram the field programmable part of my cpu every startup

av3csr 5 years ago | |

Sounds like a more general version of what Sambanova is doing with their Dataflow unit.

ajb 5 years ago |

How patents happen:

An alarm goes out: our company has fewer patents than company X! In fact, we have the smallest patent hoard of our competitor group. If they sue us we might not have enough patents to sue them back! We must have more patents! Everyone who gets a patent gets a bonus! ( Exit CEO, trailing exclamation marks. All the engineers file their pet idea as a patent, hoping management will be interested in building it).

(Some years later) Okay, some of those patents we filed are a bit silly. But at least we now have a huge, intimidating patent pile! No one will dare sue us now! Mua ha ha! But let's be a bit more careful what we give those patent bonuses for. (Meanwhile at company X: our company has fewer patents than company Y!...)

The above is a true story, happened to me. Well, apart from the moustache twirling. My name is on some not very practical patents. So I'm not very convinced by stories which read the tea leaves from patents as to what a company intends ( Or economists trying to infer innovation rate from patent filing rate). Another problem is that the patent office is slow. Unless the company is General Fusion, most probably the product will be out before the patent.

skybrian 5 years ago | |

I almost got my name on a patent that way for sharing an idea in an internal forum. I refused (well, I asked politely) and they took my name off it. (There was someone else in the conversation as well.)

pnw_hazor 5 years ago | | |

Omitting an inventor from a patent is usually grounds for invalidating the patent.

(Unless they are careful to exclude all of your contributions from the claims -- which is almost impossible)

gaudat 5 years ago | | |

Sounds like it's a bad idea to have our names on a patent from the way you said it. Can you enlighten us on that?

debug-desperado 5 years ago | |

Similar story with a patent that a friend applied for at National Instruments some years ago for his work on LabView software. As far as I could tell from his description it was more of an implementation detail rather than a patentable product. NI was pushing for patents though, and he obliged. Ended up getting a nice little bonus for his young family!

leecb 5 years ago |

Everything described in the article sounds exactly like some of the Virtex*-FX products from more than 10 years ago.

For instance, the Virtex4-FX had either one or two 450MHz PowerPC coresembedded in it, where you could implement 8 of your own additional instructions in the FPGA. This is effectively now a CPU where you can extend the instruction set, and design your own instructions specific to your application. For example, you might make special instructions using the onboard logic to accelerate video compression, or math operations; I know of one application that was designed to do a 4x4 matrix multiply per cycle.

https://www.digikey.com/catalog/en/partgroup/virtex-4-fx-ser... https://www.xilinx.com/support/documentation/data_sheets/ds1...

thrtythreeforty 5 years ago | |

For those curious, Xtensa is a similar embeddable architecture (known especially for its use in the ESP32 microcontroller) that allows broad latitude to the designer to customize its instruction set with custom acceleration. The integration is very good, the compiler recognizes the new intrinsics and the designer has control over how the instruction is pipelined into the main processor.

Unfortunately it's very proprietary, and as far as I know there isn't an at-home version you can play with on FPGAs. But this kind of thing does exist if you can afford it - you don't have to roll your own RTL.

jng 5 years ago | |

I am very familiar with the new Zynq family, embedding ARM cores on the same die together with FPGA fabric. I didn't know that the PowerPC version allowed such a tight coupling as handing off an instruction to programmable logic, the current Zynq models are much more lightly coupled, using AXI buses to connect the ARM cores with the PL (and many other components on the same SoC).

mhh__ 5 years ago | |

What was the latency like to actually get data into your shiny new instruction e.g. do I get a 14 stage pipeline stall to actually use the instruction?

rowanG077 5 years ago | | |

That depends on how you designed your instruction.

Traster 5 years ago |

I hate to be that bucket of cold water, but there's multiple reasons FPGAs haven't been successful in package with CPUs. Firstly, the costs of embedding the FPGA - FPGAs are relatively large and power hungry (for what they can do), if you're sticking one on a CPU die, you're seriously talking about trading that against other extremely useful logic. You really need to make a judgement at purchase time whether you want that dark piece of silicon instead of CPU cores for day to day use.

Secondly, whilst they're reconfigurable, they're not reoconfigurable in the time scales it takes to spawn a thread, it's more like the same scale of time to compile a program (this is getting a little better over time). Which makes it a difficult system design problem to make sure your FPGA is programmed with the right image to run the software programme you want. If you're at that level of optimization, why not just design your system to use a PCI-E board, it'll give you more CPU, and way more FPGA compute and both will be cheaper because you get a stock CPU and stock FPGA, not some super custom FPGA-CPU hybrid chip.

Thirdly the programming model for FPGAs are fundamentally very different to CPUs, it's dataflow, and generally the FPGA is completely deterministic. We really don't have a good answer for writing FPGA logic to handle the sort of cache hierarchy, out of order execution that CPUs do. So you're not getting the same sort of advantage that you'd expect from that data locality. It's very difficult to write CPU/FPGA programs that run concurrently, almost all solutions today run in parallel - you package up your work, send it off to the FPGA and wait for it to finish.

Finally, as others have said - the tools are bad. That's relatively solvable.

For me, it boils down to this, if you have an application that you think would be good on the same package as a CPU, it's probably worth hardening it into ASIC (see: error correction, Apple's AI stuff). If you have an application that isn't, then a PCI-E card is probably a better bet - you get more FPGA, more CPU and you're not trading the two off.

GuB-42 5 years ago |

Everyone seems to be talking about accelerated instructions but how about I/O?

FPGAs are awesome at asynchronous I/O and low latency. We could implement network stacks, sound and video processing, etc... It can start a TLS handshake as soon as the electrical signal hits the ethernet port, while the CPU is not even aware of it happening. It can timestamp MIDI input down to the microsecond and replay with the same precision. It can process position data from a VR headset at the very last moment in the graphics pipeline. Maybe even do something like a software defined radio.

Basically every simple but latency-critical operations. Of course, embedded/realtime systems are a prime target.

slimsag 5 years ago | |

A fair amount of enterprise NICs in data centers do exactly this, e.g. Intel FPGA smart NICs

I don't know enough to know how this being on the CPU would affect performance in this scenario, but I'd love to learn more!

AlotOfReading 5 years ago | |

That's pretty much what Xilinx's Zynq product lines are already targeting, including embedded. They're comparatively nice boards to work on, as long as you can swallow the BOM cost.

Izikiel43 5 years ago | |

Microsoft has already done the networking thing with project catapult as of a few years ago. I think they also use it for ai.

https://www.microsoft.com/en-us/research/project/project-cat...

andoriyu 5 years ago | |

Network example is wasteful. There are already NICs that have FPGA and a whole set of linux's kernel features. You wouldn't want that to be that far away from the NIC itself.

PTP works just like that - timestamps incoming and outgoing packets right after/before packet hit the wire. There is eXpress Data Path that can offload eBPF programs to NICs and deal with packets without them even coming into even kernel at all.

High Frequency Traders do exactly that IIRC today.

As for video processing codecs today are way too complex to be run there. Well, no one will stop you from running something like an integer DCT part on FPGA.

VR thing... Generally, aside from Nvidia companies don't want to ship entire FPGA to end customers (guess why Nvidia G-Sync monitors used to be so expensive). Something like Snapdragon XR2 "solves" VR. Also, in order to render a picture you need to know headset position early, not at the last moment. How would you know what to render?

How useful is the subject depends entirely on FPGA capability, and it's size. I bet it will be more useful for things like implementing some hash function there or something like that.

IMO this will be a very niche product inside already niche market.

Scene_Cast2 5 years ago |

A killer tech for this would be a framework that automatically reprograms the FPGA and offloads the work if it makes sense. For example - running k-means? Have your FPGA automatically (with minimal dev effort) flash to be a Nearest Neighbor accelerator.

The problem is finding a way to make that translation happen with minimal dev effort, as software is written rather differently from hardware.

cashsterling 5 years ago | |

I recommend checking out CacheQ: https://cacheq.com/

they are working on almost exactly this. If I was an investor, or Intel or AMD, I would buy them and/or invest heavily.

therealcamino 5 years ago | | |

Their web site is very sparse on what programming models the tool supports. Traditionally, the things you can easily accelerate automatically are algorithms you can write naturally in Fortran 77 (lots of arrays, no pointers), and that's one limit on the applicability of these automatic tools. (Other limits that other posters have pointed out are compilation+place+route runtime, and reconfiguration time.)

They are claiming you can use malloc and make "extensive" use of pointers in C programs and still have them automatically compiled for the FPGA. That's where details are needed and they are mostly missing.

I watched their 30 minute demo film. The speedups are impressive, and on the small example it's impressive that it does the partitioning automatically. However, the program contains only a single call to malloc, and all pointers are derived from that address, so it doesn't do much to convince us that it the memory model and alias analysis give you more flexibility than the F77 model.

d_tr 5 years ago | |

You might want to check the "Warp Processing" project out: http://www.cs.ucr.edu/~vahid/warp/. It is probably exactly what you are thinking about. Transparent analysis of the instruction stream at runtime and synthesis and offloading of hot spots to the FPGA.

Scene_Cast2 5 years ago | | |

Huh, interesting. It seems that the work doesn't have to be explicitly parallel for this to work, which is a surprise.

rch 5 years ago | |

I recall reading papers about doing this by profiling Java apps a decade or so ago, but I would have to dig pretty deep in my HN comment history to find them.

The approach seems conceptually similar to the optimizations available via the enterprise version of GraalVM.

d_tr 5 years ago |

The main reason I am interested in this acquisition is a (faint) hope that they open some specs up to help projects like SymbiFlow.

ohazi 5 years ago |

For decades, the FPGA vendors have had this fever dream of "an FPGA in every PC" -- either as an add-on card, or as part of the chipset on a motherboard -- that would enable a compiler or operating system to seamlessly accelerate arbitrary tasks on demand.

In my opinion, the problem has always been their software: the FPGA vendor tools are slow, bloated monstrosities. The core of these tools are written by the big three EDA vendors (Cadence, Synopsys, and Mentor Graphics) rather than the FPGA vendors themselves. The licenses include ridiculous, paranoid restrictions [1] and force the FPGA vendors to keep their bitstream formats and timing databases secret [2] in order to prevent competition from other tool vendors. Most FPGA vendors didn't see this as a problem, but even the ones that did didn't have much of a choice, because the tool market is a cartel.

Thankfully, we now have an open source toolchain [3] with support for a growing number of FPGA architectures [4], and using it vs. the vendor tools is like using gcc or llvm vs. a '90s era, non-compliant C++ compiler. It even has a real IR that isn't Verilog, which has made it easier to design new HDLs [5].

I don't see how a dynamic FPGA accelerator platform can be even remotely viable without this. It's the difference between a developer getting to choose between one of a few dozen pre-baked designs that lock up the entire FPGA (and needing to learn how to shovel data into it), vs. a compiler flag that can give you the option of unrolling any loop directly into any inactive region of FPGA fabric.

It would be quite the cherry on top to see AMD build something interesting in this space. But unless they're willing to fully unencumber at least this one design, I think the effort is likely to fail. The open source guys are chomping at the bit to make this work, and have been making real progress lately. Meanwhile, the EDA vendors have been making promises, failing, and throwing tantrums for the last 20 years. It's time to write them off.

[1] https://twitter.com/OlofKindgren/status/1052822081652617221?...

[2] Imagine trying to write an assembler without being allowed to see the manual that tells you how instructions are encoded. It's like that, but the state-space is hundreds to thousands of bytes in multiple configurations rather than a few dozen bits.

[3] https://github.com/YosysHQ/yosys

[4] https://symbiflow.github.io/

[5] https://github.com/m-labs/nmigen

CoffeeDregs 5 years ago |

This looks a bit like the old (2000s) work of Leopard Logic or Tensilica. Exciting stuff.

One important note (based on some comments here): generally, these in-CPU FPGAs have very fast reconfiguration. Not sure if it's 1, 10 or 100 cycles but it's not milliseconds. Actually, (in past examples) configuration might take milliseconds but it would load a number of planes of configurations: plane 0 might be MP3 audio device; plane 1 might be MPEG2 video device. Then reconfiguration is: switch to plane 1.

This AMD proposal looks like it's much more tightly integrated into the CPU so it's got to be even faster. Combine that with the deep knowledge of processor internals you'll have to have to code for this thing and I'm having a hard time seeing you and me having much luck tinkering. This is probably 99.99% data center with gnarly NDAs and field support.

ineedasername 5 years ago |

Sounds like spending a few hours a month learning an HDL could be a good long-term career decision.

nsajko 5 years ago | |

I think the right way isn't "learn a HDL", it's "learn digital electronics design". Hardware description languages enable succint hardware description, but it's still necessary to keep an image of the actual hardware in mind.

ip26 5 years ago | | |

HDL is really just ascii schematics.

seabird 5 years ago | |

You're going to need to commit a lot more time than that. HDLs and the surrounding concepts have key fundamental differences from software that a lot of developers have a hard time stomaching. That's why high-level synthesis is the FPGA industry's City of El Dorado; software developers would be able to create acceleration designs without having to build up a fairly large new skillset.

signa11 5 years ago |

this approach is not new, and has been toyed around since the 1960 (!), see G. Estrin's work on adaptive architectures for example.

i got to know about this as part of PRISM (processor reconfiguration through-instruction set metamorphosis) work in the early 90's. there is a very cool paper by the same name. check it out !

ps : PRISM Paper (http://class.ece.iastate.edu/tyagi/cpre583/documents/prism.p...)

qwerty456127 5 years ago |

I could never stop wondering why is this not a norm yet. Why doesn't every computer have an FPGA.

whatever1 5 years ago |

How fast can an FPGA be reprogrammed? If I close my FPGA accelerated machine learning training algorithm, and then open a PC game, would it be feasible to load the new gaming-oriented instructions in ~10-30" that a PC game takes to open?

dewhelmed 5 years ago | |

What sort of gaming-related workload do you think an FPGA would be suitable for? I don't know much about the gaming world, but isn't the majority of the computational workload graphics-rendering related, in which case, the GPU architecture is the best candidate to iterate on?

whatever1 5 years ago | | |

Not a game developer, but I believe that game mechanics, lighting and "AI" are all handled by the CPU.

gh02t 5 years ago | |

Programming a bitstream onto the FPGA is relatively quick, it's perfectly feasible. The time consuming part is in development and synthesis.

BryanBeshore 5 years ago |

Lisa Su is a fantastic CEO. Time will tell what the impact of AMD’s acquisition of Xilinx will be (should it close), but this shows the strategy and execution behind Su and team.

While a lot of acquisitions don’t pan out, this seems great.

nynx 5 years ago |

This is exciting! Would be cool if it could access some sort of gpio as well!

rwmj 5 years ago |

About *!$% time! I was hoping Intel would do something like this when they acquired Altera a few years back. Does anyone know why Intel acquired Altera?

PedroBatista 5 years ago | |

Almost the same reason someone buys a Peloton bike or rusted old Porsche. Because someone had a dream last night and have the money.

sbrorson 5 years ago | | |

chuckle This is true. As far as I can see (as a hardware engineer frequently doing FPGA stuff) The Intel/Altera combo has not produced any new products nor yielded any customer benefit beyond what would have happened if the two companies had remained independent. But I'll bet the "business strategists" at each company who thought this one up made a pile of money from the deal.

d_tr 5 years ago | |

AFAIK there exist some Xeon + FPGA chips. No clue about availability though...

harry8 5 years ago | | |

Are they connected by the pcie bus though?

https://www.xes-inc.com/

AlphaSite 5 years ago |

This seems more appropriate for GPUs than for CPUs where it’s high throughput and you can eat the latency cost of reconfiguring the node.

mhh__ 5 years ago |

Xilinx already have ARM cores in their FPGAs so I wonder which way they'll go - I'd honestly prefer a neoverse core than an X86

sbrorson 5 years ago | |

You are right the ARM cores, mostly. Xilinx Zynq devices have ARM A devices built into them as "hard" cores. That is, the ARMs are instantiated directly in silicon, not as "soft" cores which take LUTs (gates) from the FPGA fabric. The ARM A is a microprocessor (not a microcontroller) powerful enough to run Linux.

The ARM connects to the FPGA fabric using a so-called AXI bus, which is a local bus defined by ARM. Xilinx supplies a bunch of "soft" cores which you can instantiate in the FPGA and integrate with the ARM. Of course, you can write your own logic for the FPGA too, as long as you can figure out how to interface to it using one of the AXI bus variants.

Several vendors offer experimenters platforms which are affordable enough for hobbyists and folks making engineering prototypes. Examples are the Avnet's Zed board and Digilent's Zybo board.

The biggest problem with the Zynq ecosystem is that the Xilinx tools -- Vivado/SDK and whatever they renamed it to last year -- are steaming piles of smelly brown stoff. Vivado is buggy, poorly supported, has bad documentation, and the supplied examples typically don't work in the latest version of Vivado since they were written long ago and have been made obsolete via version skew. An absolute disgrace compared to what software engineers are used to. The SDK is basically Eclipse which has its own problems, but is not as bad as Vivado. Ask me how I know.

I think AMD and Xilinx have a long way to go before they can satisfy the hype and speculation I see in all the posts here. I suppose one could shell out $20K for a seat of Synopsys if one wanted a decent set of dev tools, but that's not the direction most software engineers are going nowadays.

Also, assuming NVidia completes its acquisition of ARM, the whole Zynq ecosystem is imperiled since it pits ARM against NVidia.

jagger27 5 years ago | |

AMD already has full-on Arm products.

https://www.amd.com/en/amd-opteron-a1100

efferifick 5 years ago | |

Not sure how realistic it would be, but I would like to see a RISC-V base core, and the FPGA implementing the extensions. Why? Because it would be cool! Also, I don't really see a use case except for debugging compilers supporting multiple RISC-V extensions and what not.

galaxyLogic 5 years ago |

I think neural networks and AI might be a good application area for this.

economusty 5 years ago |

Computronium

user5994461 5 years ago |

Yet another patent that should never have been granted.

SoC have been a thing for a long time. SoC = CPU + FPGA on a single chip.

Looking at the patent, the list of 20 claims is absurd. The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.

refulgentis 5 years ago | |

>> the list of 20 claims is absurd.

Claims are a union - each individual claim may sound simple, what matters is the combination.

>> The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.

No. The title of a patent is not a patent.

user5994461 5 years ago | | |

Every claim is almost a patent on its own. Submit 20 claims that are progressively more specific, so if one claim is denied during the patent application or afterwards, the other claims can still stand.

Typical strategy is to claim as many things as you can imagine, like inventing CPU and anything that can evaluate an instruction and instructions themselves, then remove any claim that the patent office refuses to grant.