FuryGpu – Custom PCIe FPGA GPU

FuryGpu – Custom PCIe FPGA GPU(furygpu.com)

446 points by argulane 2 years ago | 126 comments

So, this is my project! Was somewhat hoping to wait until there was a bit more content up on the site before it started doing the rounds, but here we are! :)

To answer what seems to be the most common question I get asked about this, I am intending on open-sourcing the entire stack (PCB schematic/layout, all the HDL, Windows WDDM drivers, API runtime drivers, and Quake ported to use the API) at some point, but there are a number of legal issues that need to be cleared (with respect to my job) and I need to decide the rest of the particulars (license, etc.) - this stuff is not what I do for a living, but it's tangentially-related enough that I need to cover my ass.

The first commit for this project was on August 22, 2021. It's been a bit over two and a half years I've been working on this, and while I didn't write anything up during that process, there are a fair number of videos in my YouTube FuryGpu playlist (https://www.youtube.com/playlist?list=PL4FPA1MeZF440A9CFfMJ7...) that can kind of give you an idea of how things progressed.

The next set of blog posts that are in the works concern the PCIe interface. It'll probably be a multi-part series starting at the PCB schematic/layout and moving through the FPGA design and ending with the Windows drivers. No timeline on when that'll be done, though. After having written just that post on how the Texture Units work, I've got even more respect for those that can write up technical stuff like that with any sort of timing consistency.

I'll answer the remaining questions in the threads where they were asked.

Thanks for the interest!

rustybolt 2 years ago | |

I have seen semi-regular updates from you on discord and it is awesome to see how far this project has come (and also a bit frustrating to see how relatively little progress I have made on my FPGA projects in the same time!). I was hoping you'd do a writeup, can't wait!

michaelt 2 years ago | |

Googling the Xilinx Zynq UltraScale+ it seems kinda expensive.

Of course plenty of hobbies let people spend thousands (or more) so there's nothing wrong with that if you've got the money. But is it the end target for your project? Or do you have ambitions to go beyond that?

PfhorSlayer 2 years ago | | |

Let's be clear here, this is a toy. Beyond being a fun project to work on that could maybe get my foot in the door were I ever to decide to change careers and move into hardware design, this is not going to change the GPU landscape or compete with any of the commercial players. What it might do is pave the way for others to do interesting things in this space. A board with all of the video hardware that you can plug into a computer with all the infrastructure available to play around with accelerating graphics could be a fun, if extremely niche, product. That would also require a *significant* time and money investment from me, and that's not something I necessarily want to deal with. When this is eventually open-sourced, those who really are interested could make their own boards.

One thing to note that is that while the US+ line is generally quite expensive (the higher end parts sit in the five-figures range for a one-off purchase! No one actually buying these is paying that price, but still!), the Kria SOMs are quite cheap in comparison. They've got a reasonably-powerful Zynq US+ for about $400, or just $350ish the dev boards (which do not expose some of the high-speed interfaces like PCIe). I'm starting to sound like a Xilinx shill given how many times I've re-stated this, but for anyone serious about getting into this kind of thing, those devboards are an amazing deal.

0xcde4c3db 2 years ago | | |

I've been told by several people that distributor pricing for FPGAs is ridiculously inflated compared to what direct customers pay, and considering that one can apparently get a dev board on AliExpress for about $110 [1] while Digikey lists the FPGA alone for about $1880 [2], I believe it (this example isn't an UltraScale chip, but it is significantly bigger than the usual low-end Zynq 7000 boards sold to undergrads and tinkerers).

[1] https://www.aliexpress.us/item/3256806069467487.html

[2] https://www.digikey.com/en/products/detail/amd/XC7K325T-1FFG...

kanetw 2 years ago | | |

The Kria SOM in use here is like $300.

ruslan 2 years ago | |

How much it depends on hard IP blocks ? I mean, can it be ported to FPGAs of other vendors, like Lattice ECP5 ? Did you implement PCIe in HDL or used vendor specific IP block ? Please, provide some resource utilization statistics. Thanks.

alexforencich 2 years ago | | |

The GPU uses https://github.com/alexforencich/verilog-pcie + the Xilinx PCIe hard IP core. When using the device-independent DMA engine, that library supports both Xilinx and Intel FPGAs.

PfhorSlayer 2 years ago | | |

Implementing PCIe in the fabric without using the hard IP would be foolish, and definitely not the kind of thing I'd enjoy spending my time on! The design makes extensive use of the DSP48E2 and various BRAM/URAM blocks available in the fabric. I don't have exact numbers off the top of my head, but roughly it's ~500 DSP units (primarily for multiplication), ~70k LUTs, ~135k FFs, and ~90 BRAMs. Porting it to a different device would be a pretty significant undertaking, but would not be impossible. Many of the DSP resources are inferred, but there is a lot of timing stuff that depends on the DSP48E2's behavior - multiple register stages following the multiplies, the inputs are sized appropriately for those specific DSP capabilities, etc.

pocak 2 years ago | |

In the post about the texture unit, that ROM table for mip level address offsets seems to use quite a bit of space. Have you considered making the mip base addresses a part of the texture spec instead?

PfhorSlayer 2 years ago | | |

The problem with doing that is it would require significantly more space in that spec. At a minimum, one offset for each possible mip level. That data needs to be moved around the GPU internally quite a bit, crossing clock domains and everything else, and would require a ton of extra registers to keep track of. Putting it in a ROM is basically free - a pair of BRAM versus a ton of registers (and the associated timing considerations), the BRAM wins almost every time.

billconan 2 years ago | |

this is very awesome! Could you recommend me some books if I want to do something similar? for example, on how to design a pcb board for a PCIE pluggable hardware.

Thank you very much!

MalphasWats 2 years ago |

It's incredible how influential Ben Eater's breadboard computer series has been in hobby electronics. I've been similarly inspired to try to design my own "retro" CPU.

I desperately want something as easy to plug into things as the 6502, but with jussst a little more capability - few more registers, hardware division, that sort of thing. It's a really daunting task.

I always end up coming back to just use an MCU and be done with it, and then I hit the How To Generate Graphics problem.

gchadwick 2 years ago |

Cool! I found the hello blog here illuminating to understand the creators intentions: https://www.furygpu.com/blog/hello

As I read it, it's just a fun hobby project for them first and foremost and looks like they're intending to write a whole bunch more about how they built it.

It's certainly an impressive piece of work, in particular as they've got the full stack working, a windows driver implementing a custom graphics API and then quake running on top of that. A shame they've not got some DX/GL support but I can certainly understand why they went the custom API route.

I wonder if they'll open source the design?

PfhorSlayer 2 years ago | |

I'm in the process of actually trying to work out what would be feasible performance-wise if I were to spent the considerable effort to add the features required for base D3D support. It's not looking good, unfortunately. Beyond just "shaders", there are a significant amount of other requirements that even just the OS's window manager needs to function at all. It's all built up on 20+ years of evolving tech and for the normal players in this space (AMD, Nvidia, Intel, Imagination, etc.) it's always been an iterative process.

KallDrexx 2 years ago |

This is my dream!

The last year I've been working on a 2d focused GPU for I/O constrained microcontrollers (https://github.com/KallDrexx/microgpu). I've been able to utilize this to get user interfaces on slow SPI machines to render on large displays, and it's been fascinating to work on.

But seeing the limitation of processor pipelines I've had the thought for a while that FPGAs could make this faster. I've recently gotten some low end FPGAs to start learning to try and turn my microgpu from an ESP32 based one to an FPGA one.

I don't know if I"ll ever get to this level due to kids and free time constraints, but man, I would love to get even a hundredth of this level.

Chabsff 2 years ago | |

You probably know this already, but for anyone else curious about going down that road: For this type of use, it's definitely worth it to constrain yourself to FPGAs with dedicated high-bandwidth transceivers. A "basic" 1080p RGB signal at 60hz requires some high-frequency signal processing that's really hard to contend with in pure FPGA-land.

KallDrexx 2 years ago | | |

That's good to know actually. I'm still very very early in my FPGA adaption (learning the fpga basics) and I am intending to start with standard 640x480 VGA before expanding.

snvzz 2 years ago |

Pipeline seems retro, but far better than nothing.

There's no open hardware GPU to speak of. Depending on license (can't find information?), this could be the first, and a starting point for more.

detuur 2 years ago |

I can't believe that this is the closest we have to a compact, stand-alone GPU option. There's nothing like a M.2 format GPU out there. All I want is a stand-alone M.2 GPU with modest performance, something on the level of embedded GPUs like Intel UHD Graphics, AMD Radeon, or Qualcomm's Adreno.

I have an idea for a small embedded product which needs a lot of compute and networking, but only very modest graphical capabilities. The NXP Layerscape LX2160A [1] would be perfect, but I have to pass on it because it doesn't come with an embedded GPU. I just want a small GPU!

[1]: https://www.nxp.com/products/processors-and-microcontrollers...

cpgxiii 2 years ago | |

There's at least one m.2 GPU based on the Silicon Motion SM750 controller made by Asrock Rack. Similar products exist for mPCIe form factor.

Performance is nowhere near a modern iGPU, because an iGPU has access to all of the system memory and caches and power budget, and a simple m.2 device has node of that. Even low-end PCIe GPUs (single slot, half-length/half-height) struggle to outperform better iGPUs and really only make sense when you have to use them for basic display functionality.

magixx 2 years ago | |

What about MXM GPUs that used to be found in gaming laptops? I know the standard is very niche and thus expensive ($400 for a 3080M used on ebay) but it does exists and you could convert them to PCI-E and thus m.2

t-3 2 years ago | |

Maybe a little bit too low-powered for you, but: https://www.matrixorbital.com/ftdi-eve

raphlinus 2 years ago |

Very cool project, and I love to see more work in this space.

Something else to look at is the Vortex project from Georgia Tech[1]. Rather than recapitulating the fixed-function past of GPU design, I think it looks toward the future, as it's at heart a highly parallel computer, based on RISC-V with some extensions to handle GPU workloads better. The boards it runs on are a few thousand dollars, so it's not exactly a hobbyist friendly, but it certainly is more accessible than closed, proprietary development. There's a 2.0 release that just landed a few months ago.

[1]: https://vortex.cc.gatech.edu/

spuz 2 years ago |

This looks like an incredible achievement. I'd love to see some photos of the physical device. I'm also slightly confused about which FGPA module is being used. The blog mentions the Xylinx Kria SoMs but if you follow the links to the specs of those modules, you see they have ARM SoCs rather than Xylinx FGPAs. The whole world of FGPAs is pretty unfamiliar to me so maybe I'm missing something.

https://www.amd.com/en/products/system-on-modules/kria/k26/k...

PfhorSlayer 2 years ago | |

You're in luck! https://imgur.com/a/BE0h9cZ

As mentioned in the rest of this thread, the Kria SoMs are FPGA fabric with hardened ARM cores running the show. Beyond just being what was available (for oh so cheap, the Kria devboards are like $350!), these devices also include things like hardened DisplayPort IP attached to the ARM cores allowing me to offload things like video output and audio to the firmware. A previous version of this project was running on a Zynq 7020, for which I needed to write my own HDMI stuff that, while not super complicated, takes up a fair amount of logic and also gets way more complex if it needs to be configurable.

crote 2 years ago | |

> you see they have ARM SoCs rather than Xylinx FGPAs

It's a mixed chip: FPGA and traditional SoC glued together. This mean you don't have a softcore MCU taking up precious FPGA resources just to do some basic management tasks.

spuz 2 years ago | | |

Ah that makes sense. It's slightly ironic then that the ARM SoC includes a Mali GPU which presumably easily outperforms what can be achieved with the FGPA.

chrsw 2 years ago | | |

I didn't see any mention of what the software on the Zynq's ARM core is doing, which made me wonder why use Zynq at all.

chiral-anomaly 2 years ago | |

Xilinx doesn't mention the exact FPGA p/n used in the Kria SoMs. However according to their public specs they appear to match [1] the ZU3EG-UBVA530-2L and ZU5EV-SFVC784-2L devices, with the latter being the only one featuring PCIe support.

Designing and bringing-up the FPGA board as described in the blog post is already a high bar to clear. I hope the author will at some point publish schematics and sources.

[1] https://docs.amd.com/v/u/en-US/zynq-ultrascale-plus-product-...

userbinator 2 years ago |

Supporting hardware features equivalent to a high-end graphics card of the mid 1990s

I see no one else has asked this question yet, so I will: How VGA-compatible is it? Would I be able to e.g. plug it into any PC with a PCIe slot, boot to DOS and play DOOM with it?

nxobject 2 years ago |

I hope the author goes into some detail about how he implements the PCIe interface! I doubt I'll ever do hardware work at that level of sophistication, but for general cultural awareness I think it's worth looking under the hood of PCIe.

PfhorSlayer 2 years ago | |

Next blog post will be covering exactly that! Probably going to do a multi-part series - first one will be the PCB schematic/layout, then the FPGA interfaces and testing, followed by Windows drivers.

gorkish 2 years ago | |

The FPGA he is using has native pcie so usually all you get on this front is an interface to a vendor proprietary ip block. The state of open interfaces in FPGA land is abysmal. I think the best I’ve seen fully open source is a gigabit MAC

0xcde4c3db 2 years ago | | |

There is an open-source DisplayPort transmitter [1] that apparently supports multiple 2.7 Gbps lanes (albeit using family-specific SERDES/differential transceiver blocks, but I doubt that's avoidable at these speeds). This isn't PCIe, but it's also surprisingly close to PCIe 1.0 (2.5 Gbps/lane, and IIRC they use the same 8b/10b code and scrambling algorithm).

[1] https://github.com/hamsternz/FPGA_DisplayPort

alexforencich 2 years ago | | |

The GPU uses this: https://github.com/alexforencich/verilog-pcie . And there is an open-source 100G NIC here, including open source 10G/25G MACs: https://github.com/corundum/corundum

alexforencich 2 years ago | |

It uses https://github.com/alexforencich/verilog-pcie on top of the Xilinx PCIe hard IP core, which provides everything below the transaction layer.

sylware 2 years ago |

Hopefully their hardware programming model is going full hardware circular command/interrupt buffers (even for GPU register programming).

It is how it is done on AMD GPU, that said I have no idea what is the nvidia hardware programming model.

jamesu 2 years ago |

Similarly there is this: https://github.com/ToNi3141/Rasterix

Would be neat if someone made an FPGA GPU which had a shader pipeline honestly.

wpwpwpw 2 years ago |

Excellent job. Would be amazing if this became an open source hardware project.

bobharris 2 years ago |

beyond amazing. i've dreamt of this. so inspiring. it reminds me of alot of time i spent thinking about this: https://rcl.ece.iastate.edu/sites/default/files/papers/SteJo... i actually wrote one of the professors asking for more info. didn't get a reply. my dream EE class I never got to take.

bloatfish 2 years ago |

This is insane! As a hobby hardware designer myself, I can imagine how much work must have gone into reaching this stage. Well done!

codedokode 2 years ago |

"UltraScale" in name assumes ultra price? FPGAs seem to be an expensive toy.

nxobject 2 years ago | |

It's worth mentioning that it's easy enough to find absurdly cheap (~$20) early-generation dev boards for Zynq FPGAs with embedded ARM cores on Aliexpress, shucked from obsolete Bitcoin miners [1]. Interfaces include SD, Ethernet, 3 banks of GPIO.

[1] https://github.com/xjtuecho/EBAZ4205

thrtythreeforty 2 years ago | | |

Zynq is deeply annoying to work with, though. Unfortunately the hard ARM core bootloads the FPGA fabric, rather than the other way around (or having the option to initialize both separately). This means you have to muck with software on the target to update FPGA bitstreams.

mattalex 2 years ago | |

Not in the grand scheme of things: you can get fpga dev boards for $50 that are already useable for this type of thing (you can go even lower, but those aren't really useable for "CPU like" operation and are closer to "a whole lot of logic gates in a single chip"). Of course the "industry grade" solutions pack significantly more of a punch, but they can also be had for <$500.

PfhorSlayer 2 years ago | |

In general, yes. However, the Kria series are amazingly good deals for what you get - a quite powerful Zynq US+ part and a dev board for like $350.

varispeed 2 years ago | |

Ages ago I bought TinyFPGA, which is like £40 and I was able to synthesize RISC-V cpu on it. It was fun.

allanrbo 2 years ago |

What an inspiring passion project! Very ambitious first Verilog project.

iAkashPaul 2 years ago |

FPGAs for native FP4 will change the entire landscape

blacklion 2 years ago | |

Entire landscape of open graphic chips?

Not every GPU should be used to train or infer so-called AI.

Please, stop, we need some hardware to put images on the screens.

Y_Y 2 years ago | |

Four-bit floats are not as useful as Nvidia would have you believe. Like structured sparsity it's mainly a trick to make newer-gen cards look faster in the absence of an improvement in the underlying tech. If you're using it for NN inference you have to carefully tune the weights to get good accuracy and it offers nothing over fixed-point.

imtringued 2 years ago | | |

The actual problem is that nobody uses these low precision floats for training their models. When you do quantization you are merely compressing the weights to minimize memory usage and to use memory bandwidth more efficiently. You still have to run the model at the original precision for the calculations so nobody gives a damn about the low precision floats for now.

jsheard 2 years ago | |

Very briefly, until someone makes an ASIC that does the same thing and FPGAs are relegated to niche use-cases once again.

FPGAs only make long-term sense in applications that are so low-volume that it's not worth spinning an ASIC for them.

iAkashPaul 2 years ago | | |

Absolutely

imtringued 2 years ago | |

How? NPUs are going to be included in every PC in 2025. The only differentiators will be how much SRAM and memory bandwidth you have or whether you use processing in memory or not. AMD is already shipping APUs with 16 TOPS or 4 TFLOPS (bfloat16) and that is more than enough for inference considering the limited memory bandwidth. Strix Halo will have around 12 TFLOPS (bfloat16) and four memory channels.

llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.

luma 2 years ago | |

How so?

anon115 2 years ago |

can you run valorant on it?

notorandit 2 years ago |

It needs to be very fancy to write text in light gray on white.

I am not sure your product will be a success.

I am sure you web design skills need a good overhaul.

nicolas_17 2 years ago | |

It's not a "product" that will be "sold" or has intention of being "successful" in a commercial sense.