Why is this important? Have a program that does a lot of integer multiplications? Let's program all of these programmable execution units to multiply integers on the fly, etc. Now your integer multiply throughput is higher, as per the current program's needs.
Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.
I think it's great, and that most people are missing the point.
That's been the role of microcode for like three decades now. Why does it matter if the instruction no one uses is implemented with FPGA gates or uops? No one uses them.
Possible, but the "x86" part is already a big decoder in front of a murky processor underneath so this is already what the CPU does - if you removed the reference to an FPGA, rewriting old x86 instructions in terms of "new" ones is microcode.
eg, you could be running (say) 3 or 4 primary applications at the same time. Which one gets to use the FPGA pieces, or are they re-written every time, on every context switch? ;)
Re-writing them on every context switch sounds extremely unlikely, so it'd be more some kind of resource locking thing instead. Which could mean that FPGA-using applications at least start out being fairly niche, as only one could run "per core" or something.
Maybe dedicated cores per application instead or something?
An alarm goes out: our company has fewer patents than company X! In fact, we have the smallest patent hoard of our competitor group. If they sue us we might not have enough patents to sue them back! We must have more patents! Everyone who gets a patent gets a bonus! ( Exit CEO, trailing exclamation marks. All the engineers file their pet idea as a patent, hoping management will be interested in building it).
(Some years later) Okay, some of those patents we filed are a bit silly. But at least we now have a huge, intimidating patent pile! No one will dare sue us now! Mua ha ha! But let's be a bit more careful what we give those patent bonuses for. (Meanwhile at company X: our company has fewer patents than company Y!...)
The above is a true story, happened to me. Well, apart from the moustache twirling. My name is on some not very practical patents. So I'm not very convinced by stories which read the tea leaves from patents as to what a company intends ( Or economists trying to infer innovation rate from patent filing rate). Another problem is that the patent office is slow. Unless the company is General Fusion, most probably the product will be out before the patent.
(Unless they are careful to exclude all of your contributions from the claims -- which is almost impossible)
For instance, the Virtex4-FX had either one or two 450MHz PowerPC coresembedded in it, where you could implement 8 of your own additional instructions in the FPGA. This is effectively now a CPU where you can extend the instruction set, and design your own instructions specific to your application. For example, you might make special instructions using the onboard logic to accelerate video compression, or math operations; I know of one application that was designed to do a 4x4 matrix multiply per cycle.
https://www.digikey.com/catalog/en/partgroup/virtex-4-fx-ser... https://www.xilinx.com/support/documentation/data_sheets/ds1...
Unfortunately it's very proprietary, and as far as I know there isn't an at-home version you can play with on FPGAs. But this kind of thing does exist if you can afford it - you don't have to roll your own RTL.
Secondly, whilst they're reconfigurable, they're not reoconfigurable in the time scales it takes to spawn a thread, it's more like the same scale of time to compile a program (this is getting a little better over time). Which makes it a difficult system design problem to make sure your FPGA is programmed with the right image to run the software programme you want. If you're at that level of optimization, why not just design your system to use a PCI-E board, it'll give you more CPU, and way more FPGA compute and both will be cheaper because you get a stock CPU and stock FPGA, not some super custom FPGA-CPU hybrid chip.
Thirdly the programming model for FPGAs are fundamentally very different to CPUs, it's dataflow, and generally the FPGA is completely deterministic. We really don't have a good answer for writing FPGA logic to handle the sort of cache hierarchy, out of order execution that CPUs do. So you're not getting the same sort of advantage that you'd expect from that data locality. It's very difficult to write CPU/FPGA programs that run concurrently, almost all solutions today run in parallel - you package up your work, send it off to the FPGA and wait for it to finish.
Finally, as others have said - the tools are bad. That's relatively solvable.
For me, it boils down to this, if you have an application that you think would be good on the same package as a CPU, it's probably worth hardening it into ASIC (see: error correction, Apple's AI stuff). If you have an application that isn't, then a PCI-E card is probably a better bet - you get more FPGA, more CPU and you're not trading the two off.
FPGAs are awesome at asynchronous I/O and low latency. We could implement network stacks, sound and video processing, etc... It can start a TLS handshake as soon as the electrical signal hits the ethernet port, while the CPU is not even aware of it happening. It can timestamp MIDI input down to the microsecond and replay with the same precision. It can process position data from a VR headset at the very last moment in the graphics pipeline. Maybe even do something like a software defined radio.
Basically every simple but latency-critical operations. Of course, embedded/realtime systems are a prime target.
I don't know enough to know how this being on the CPU would affect performance in this scenario, but I'd love to learn more!
https://www.microsoft.com/en-us/research/project/project-cat...
PTP works just like that - timestamps incoming and outgoing packets right after/before packet hit the wire. There is eXpress Data Path that can offload eBPF programs to NICs and deal with packets without them even coming into even kernel at all.
High Frequency Traders do exactly that IIRC today.
As for video processing codecs today are way too complex to be run there. Well, no one will stop you from running something like an integer DCT part on FPGA.
VR thing... Generally, aside from Nvidia companies don't want to ship entire FPGA to end customers (guess why Nvidia G-Sync monitors used to be so expensive). Something like Snapdragon XR2 "solves" VR. Also, in order to render a picture you need to know headset position early, not at the last moment. How would you know what to render?
How useful is the subject depends entirely on FPGA capability, and it's size. I bet it will be more useful for things like implementing some hash function there or something like that.
IMO this will be a very niche product inside already niche market.
The problem is finding a way to make that translation happen with minimal dev effort, as software is written rather differently from hardware.
they are working on almost exactly this. If I was an investor, or Intel or AMD, I would buy them and/or invest heavily.
They are claiming you can use malloc and make "extensive" use of pointers in C programs and still have them automatically compiled for the FPGA. That's where details are needed and they are mostly missing.
I watched their 30 minute demo film. The speedups are impressive, and on the small example it's impressive that it does the partitioning automatically. However, the program contains only a single call to malloc, and all pointers are derived from that address, so it doesn't do much to convince us that it the memory model and alias analysis give you more flexibility than the F77 model.
The approach seems conceptually similar to the optimizations available via the enterprise version of GraalVM.
In my opinion, the problem has always been their software: the FPGA vendor tools are slow, bloated monstrosities. The core of these tools are written by the big three EDA vendors (Cadence, Synopsys, and Mentor Graphics) rather than the FPGA vendors themselves. The licenses include ridiculous, paranoid restrictions [1] and force the FPGA vendors to keep their bitstream formats and timing databases secret [2] in order to prevent competition from other tool vendors. Most FPGA vendors didn't see this as a problem, but even the ones that did didn't have much of a choice, because the tool market is a cartel.
Thankfully, we now have an open source toolchain [3] with support for a growing number of FPGA architectures [4], and using it vs. the vendor tools is like using gcc or llvm vs. a '90s era, non-compliant C++ compiler. It even has a real IR that isn't Verilog, which has made it easier to design new HDLs [5].
I don't see how a dynamic FPGA accelerator platform can be even remotely viable without this. It's the difference between a developer getting to choose between one of a few dozen pre-baked designs that lock up the entire FPGA (and needing to learn how to shovel data into it), vs. a compiler flag that can give you the option of unrolling any loop directly into any inactive region of FPGA fabric.
It would be quite the cherry on top to see AMD build something interesting in this space. But unless they're willing to fully unencumber at least this one design, I think the effort is likely to fail. The open source guys are chomping at the bit to make this work, and have been making real progress lately. Meanwhile, the EDA vendors have been making promises, failing, and throwing tantrums for the last 20 years. It's time to write them off.
[1] https://twitter.com/OlofKindgren/status/1052822081652617221?...
[2] Imagine trying to write an assembler without being allowed to see the manual that tells you how instructions are encoded. It's like that, but the state-space is hundreds to thousands of bytes in multiple configurations rather than a few dozen bits.
[3] https://github.com/YosysHQ/yosys
One important note (based on some comments here): generally, these in-CPU FPGAs have very fast reconfiguration. Not sure if it's 1, 10 or 100 cycles but it's not milliseconds. Actually, (in past examples) configuration might take milliseconds but it would load a number of planes of configurations: plane 0 might be MP3 audio device; plane 1 might be MPEG2 video device. Then reconfiguration is: switch to plane 1.
This AMD proposal looks like it's much more tightly integrated into the CPU so it's got to be even faster. Combine that with the deep knowledge of processor internals you'll have to have to code for this thing and I'm having a hard time seeing you and me having much luck tinkering. This is probably 99.99% data center with gnarly NDAs and field support.
i got to know about this as part of PRISM (processor reconfiguration through-instruction set metamorphosis) work in the early 90's. there is a very cool paper by the same name. check it out !
ps : PRISM Paper (http://class.ece.iastate.edu/tyagi/cpre583/documents/prism.p...)
While a lot of acquisitions don’t pan out, this seems great.
The ARM connects to the FPGA fabric using a so-called AXI bus, which is a local bus defined by ARM. Xilinx supplies a bunch of "soft" cores which you can instantiate in the FPGA and integrate with the ARM. Of course, you can write your own logic for the FPGA too, as long as you can figure out how to interface to it using one of the AXI bus variants.
Several vendors offer experimenters platforms which are affordable enough for hobbyists and folks making engineering prototypes. Examples are the Avnet's Zed board and Digilent's Zybo board.
The biggest problem with the Zynq ecosystem is that the Xilinx tools -- Vivado/SDK and whatever they renamed it to last year -- are steaming piles of smelly brown stoff. Vivado is buggy, poorly supported, has bad documentation, and the supplied examples typically don't work in the latest version of Vivado since they were written long ago and have been made obsolete via version skew. An absolute disgrace compared to what software engineers are used to. The SDK is basically Eclipse which has its own problems, but is not as bad as Vivado. Ask me how I know.
I think AMD and Xilinx have a long way to go before they can satisfy the hype and speculation I see in all the posts here. I suppose one could shell out $20K for a seat of Synopsys if one wanted a decent set of dev tools, but that's not the direction most software engineers are going nowadays.
Also, assuming NVidia completes its acquisition of ARM, the whole Zynq ecosystem is imperiled since it pits ARM against NVidia.
SoC have been a thing for a long time. SoC = CPU + FPGA on a single chip.
Looking at the patent, the list of 20 claims is absurd. The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.
Claims are a union - each individual claim may sound simple, what matters is the combination.
>> The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.
No. The title of a patent is not a patent.
Typical strategy is to claim as many things as you can imagine, like inventing CPU and anything that can evaluate an instruction and instructions themselves, then remove any claim that the patent office refuses to grant.
Also (not disagreeing but I'm curious), last time I checked FPGAs could pull off some level of partial reconfiguration in the millisecond and sub millisecond ranges. I may be a bit off on these times but I saw them in a research paper a few years back. What types of speed would be necessary for CPUs to actually be able to benefit from a small FPGA onboard (rather than on an expansion card) with all the context switching.
Unless latency is so critical that the speed of light is the limiting factor, partial reconfiguration just replaces PCIe with a much harder to work with AXI interconnect (or similar, but it always end up being AXI...).
Lattice is seemingly at "wink wink, nudge nudge" levels of support -- their lawyers won't allow them to say anything because they're afraid of pissing off Synopsys, but they also know that they're currently the best supported platform, and don't seem interested in deliberately making things difficult.
I'm really liking Clash and Bluespec (Bluespec is completely open source now) but I don't want to write any conventional languages.
Of course, using it in industry is presumably pretty different from using it for a few school courses.
I could be totally wrong, though.
Also, FPGAs can't be reasonably context-switched. Flashing them takes a significant amount of time, so forget about time-multiplexing access to the FPGA among different applications.
I definitely want one but any common task worth having on an FPGA is probably common enough to justify either a GPU or actual silicon.
Intel and AMD both have the IP to do it, and iPhones do have a Lattice chip on them apparently
In practice you see FPGAs mostly in two areas: specialised embedded applications which benefit from heavy custom I/O and/or some efficient specific DSP but don't have enough volume to justify an ASIC design, or in accelerators for simulating ASIC design.
Every other year or so someone "rediscovers" FPGA and thinks this niche architecture is poised for a total revolution of how computing works, think drag&drop hardware and super fast custom everything. It never happens and it will never happen because customization, much like premature optimization, is the root of all evil and also just.. see the first paragraph.
AMD bought ATI while promising the same integration "synergies". GPU style compute was going to be completely woven into the CPU - "AMD Fusion". Sounds great - but they ended up with them being beaten to the CPU-with-integrated-GPU market by Intel by over a year (Intel Clarkdale launched January 2010, AMD Llano midway 2011). 14 years after the acquisition, AMD's iGPU integration is not much different compared to any other iGPU integration, their raw performance lead is shrinking compared to Intel and they're beaten by Apple. Radeon Technologies Group functionally operates independently within the company, and AMD won't use their more performant new RDNA architecture in iGPUs for two years after its launch for some reason - even their 2021 APUs still use their 2017 Vega architecture (fundamentally based on 2012 GCN technology). In the intervening years they've screwed up their processor architecture and marketshare for by going all in on the terrible Bulldozer architecture that was designed around the broken promises of far reaching GPU integration.
Given all that the ATI acquisition might still have been worth it - in hindsight AMD needed a competent GPU architecture one way or another - but the mismanagement of this acquistion nearly killed the company. I hope better leadership can do something here but I'm not really holding my breath.
They screwed up majorly with software, and they may have the same problem with an FPGA acquisition as well. AMD failed big time to capitalize on GPUs the way Nvidia did, and that's really almost entirely down to lack of good software solutions. There's ROCm now and it seems plausible that the gap is going to narrow further with AMD GPUs deployed to big HPC clusters, but a gap remains.
I don't like patents in general (and especially in software), but this patent is not as general as you claim.
Companies who grow to a certain size look to be acquired by larger firms with bigger war chests.
Sometimes companies recognize patents are stifling progress and engage in cross licensing or pooling of patents. Sometimes they do it to gang up on a new rival.
That's what's interesting about the article, because that's what the patent is about: "implementing as part of a processor pipeline a reprogrammable execution unit capable of executing specialized instructions".
It compiles to Verilog, but the stack is much more integrated than other similar compile-to-verilog HDLs - the simulator is similar to verilator and much easier to get started with.
I'm kind of beginning to feel that Haskell isn't a good medium for HDL code - Verilog already encourages unreadable names like "mem_chk_sig_state" and Haskell code is almost unstructured to my eye (I like functional programming but it seems hard to keep it readable because of the style it imposes - the flow is there but the names are usually way too short for my taste)
I think it's the bits around the outside of the (say) math kernel which will trip up an "ah it's just like C!"-thinking programmer.
Prime majority of hosting market still goes to bog standard servers, not even blades.
I'll wait for "clouds" to get to significant double double digit market share first.
I would like to see a FAANG try and support some open tools - it doesn't have to be anything legally sketchy like reverse engineering bitstreams - for example, Yosys only has limited SystemVerilog support
At some point I'd like to see it integrated as the frontend to tools like Yosys to get best-in-class SystemVerilog support in open tools.
Does anyone know of any open source software for taking smallish chunks of Verilog/VHDL and making a visual representation/schematic?
There is an online version at here: http://www.clifford.at/yosys/nogit/YosysJS/snapshot/demo02.h... which uses YosysJS. Hopefully someone can port Compiler Explorer UI to this
Since applications do all their rendering via the GPU these days, desktop multi-tasking requires reasonably time-sliced access to the GPU. GPUs have proper memory protection these days (GPU-side page tables for each process). That's big progress over 10 years ago.
Never heard of putting it into the socket, would be a real pain to attach JTAG to program/debug your design...
In what concerns OpenJDK that is configurable via the -XX flags.
I don't think it would happen in a general purpose chip but I could see it happening in a smaller one like the exploits christopher Domas demonstrated against some embedded X86 cores.