FuryGpu – Custom PCIe FPGA GPU(furygpu.com) |
FuryGpu – Custom PCIe FPGA GPU(furygpu.com) |
To answer what seems to be the most common question I get asked about this, I am intending on open-sourcing the entire stack (PCB schematic/layout, all the HDL, Windows WDDM drivers, API runtime drivers, and Quake ported to use the API) at some point, but there are a number of legal issues that need to be cleared (with respect to my job) and I need to decide the rest of the particulars (license, etc.) - this stuff is not what I do for a living, but it's tangentially-related enough that I need to cover my ass.
The first commit for this project was on August 22, 2021. It's been a bit over two and a half years I've been working on this, and while I didn't write anything up during that process, there are a fair number of videos in my YouTube FuryGpu playlist (https://www.youtube.com/playlist?list=PL4FPA1MeZF440A9CFfMJ7...) that can kind of give you an idea of how things progressed.
The next set of blog posts that are in the works concern the PCIe interface. It'll probably be a multi-part series starting at the PCB schematic/layout and moving through the FPGA design and ending with the Windows drivers. No timeline on when that'll be done, though. After having written just that post on how the Texture Units work, I've got even more respect for those that can write up technical stuff like that with any sort of timing consistency.
I'll answer the remaining questions in the threads where they were asked.
Thanks for the interest!
Of course plenty of hobbies let people spend thousands (or more) so there's nothing wrong with that if you've got the money. But is it the end target for your project? Or do you have ambitions to go beyond that?
One thing to note that is that while the US+ line is generally quite expensive (the higher end parts sit in the five-figures range for a one-off purchase! No one actually buying these is paying that price, but still!), the Kria SOMs are quite cheap in comparison. They've got a reasonably-powerful Zynq US+ for about $400, or just $350ish the dev boards (which do not expose some of the high-speed interfaces like PCIe). I'm starting to sound like a Xilinx shill given how many times I've re-stated this, but for anyone serious about getting into this kind of thing, those devboards are an amazing deal.
[1] https://www.aliexpress.us/item/3256806069467487.html
[2] https://www.digikey.com/en/products/detail/amd/XC7K325T-1FFG...
Thank you very much!
I desperately want something as easy to plug into things as the 6502, but with jussst a little more capability - few more registers, hardware division, that sort of thing. It's a really daunting task.
I always end up coming back to just use an MCU and be done with it, and then I hit the How To Generate Graphics problem.
As I read it, it's just a fun hobby project for them first and foremost and looks like they're intending to write a whole bunch more about how they built it.
It's certainly an impressive piece of work, in particular as they've got the full stack working, a windows driver implementing a custom graphics API and then quake running on top of that. A shame they've not got some DX/GL support but I can certainly understand why they went the custom API route.
I wonder if they'll open source the design?
The last year I've been working on a 2d focused GPU for I/O constrained microcontrollers (https://github.com/KallDrexx/microgpu). I've been able to utilize this to get user interfaces on slow SPI machines to render on large displays, and it's been fascinating to work on.
But seeing the limitation of processor pipelines I've had the thought for a while that FPGAs could make this faster. I've recently gotten some low end FPGAs to start learning to try and turn my microgpu from an ESP32 based one to an FPGA one.
I don't know if I"ll ever get to this level due to kids and free time constraints, but man, I would love to get even a hundredth of this level.
There's no open hardware GPU to speak of. Depending on license (can't find information?), this could be the first, and a starting point for more.
I have an idea for a small embedded product which needs a lot of compute and networking, but only very modest graphical capabilities. The NXP Layerscape LX2160A [1] would be perfect, but I have to pass on it because it doesn't come with an embedded GPU. I just want a small GPU!
[1]: https://www.nxp.com/products/processors-and-microcontrollers...
Performance is nowhere near a modern iGPU, because an iGPU has access to all of the system memory and caches and power budget, and a simple m.2 device has node of that. Even low-end PCIe GPUs (single slot, half-length/half-height) struggle to outperform better iGPUs and really only make sense when you have to use them for basic display functionality.
Something else to look at is the Vortex project from Georgia Tech[1]. Rather than recapitulating the fixed-function past of GPU design, I think it looks toward the future, as it's at heart a highly parallel computer, based on RISC-V with some extensions to handle GPU workloads better. The boards it runs on are a few thousand dollars, so it's not exactly a hobbyist friendly, but it certainly is more accessible than closed, proprietary development. There's a 2.0 release that just landed a few months ago.
https://www.amd.com/en/products/system-on-modules/kria/k26/k...
As mentioned in the rest of this thread, the Kria SoMs are FPGA fabric with hardened ARM cores running the show. Beyond just being what was available (for oh so cheap, the Kria devboards are like $350!), these devices also include things like hardened DisplayPort IP attached to the ARM cores allowing me to offload things like video output and audio to the firmware. A previous version of this project was running on a Zynq 7020, for which I needed to write my own HDMI stuff that, while not super complicated, takes up a fair amount of logic and also gets way more complex if it needs to be configurable.
It's a mixed chip: FPGA and traditional SoC glued together. This mean you don't have a softcore MCU taking up precious FPGA resources just to do some basic management tasks.
Designing and bringing-up the FPGA board as described in the blog post is already a high bar to clear. I hope the author will at some point publish schematics and sources.
[1] https://docs.amd.com/v/u/en-US/zynq-ultrascale-plus-product-...
I see no one else has asked this question yet, so I will: How VGA-compatible is it? Would I be able to e.g. plug it into any PC with a PCIe slot, boot to DOS and play DOOM with it?
It is how it is done on AMD GPU, that said I have no idea what is the nvidia hardware programming model.
Would be neat if someone made an FPGA GPU which had a shader pipeline honestly.
Not every GPU should be used to train or infer so-called AI.
Please, stop, we need some hardware to put images on the screens.
FPGAs only make long-term sense in applications that are so low-volume that it's not worth spinning an ASIC for them.
llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.
I am not sure your product will be a success.
I am sure you web design skills need a good overhaul.
Regarding graphics, initially output serial. Abstract the problem away until you are ready to deal with it. If you sneak up on an Arduino and make it scream, you can make it into a very basic VGA graphics card [1]. Even easier is ESP32 to VGA (also gives keyboard and mouse) [2].
[1] https://www.instructables.com/Arduino-Basic-PC-With-VGA-Outp...
And yeah, video output is a significant issue because of the required bandwidth for digital outputs (unless you're okay with composite or VGA outputs, I guess they can still be done with readily available chips?). The recent Commander X16 settled for an FPGA for this.
I always got the impression that David sort of got railroaded by the other members of the team that wanted to keep adding features and MOAR POWAH, and didn't have a huge amount of choice because those features quickly scoped out of his own areas of knowledge.
He started posting videos again recently with some regularity after a lull. Audience is in the low hundreds of thousands. I assume fewer than 100k actually finish videos and fewer still do anything with it.
Hobby electronics seems surprisingly small in this era.
I've built stuff with microcontrollers (partially aided by techniques learned here), but that was very purpose-driven and I'm not super interested in just messing around for fun.
I’m having trouble wrapping my head around how / why you’d use youtube to present analog electrical engineering formulas and pin out diagrams instead of using latex or a diagram.
I wrote a couple of articles on how to do bit banged VGA on the RP2040 from scratch: https://gregchadwick.co.uk/blog/playing-with-the-pico-pt5/ and https://gregchadwick.co.uk/blog/playing-with-the-pico-pt6/ plus an intro to PIO https://gregchadwick.co.uk/blog/playing-with-the-pico-pt4/
There's this which is about the same kind of GPU
Lattice ECP5 (which goes up to 85k LUT or so?) and Nexus have more than decent support.
Gowin FPGAs are supported via project apicula up to 20k LUT models. Some new models go above 200k LUT so there's hope there.
chip: https://colognechip.com/programmable-logic/gatemate/ board: https://www.olimex.com/Products/FPGA/GateMate/GateMateA1-EVB...
https://github.com/schlae/graphics-gremlin is an MDA/CGA compatible adapter
https://github.com/OmarMongy/VGA is a VGA core
https://github.com/archlabo/Frix is a whole IBM PC compatible SoC, including a VGA.
So my guess is that it would be quite challenging to implement a modern GPU in an affordable FPGA if you want more than a proof of concept.
I do not doubt that a shader core could be built, but I have reservations about the ability to run it fast enough or have as many of them as would be needed to get similar performance out of them. FuryGpu does its front-end (everything up through primitive assembly) in full fp32. Because that's just a simple fixed modelview-projection matrix transform it can be done relatively quickly, but having every single vertex/pixel able to run full fp32 shader instructions requires the ability to cover instruction latency with additional data sets - it gets complicated, fast!
Cheaper boards are definitely possible since there are smaller parts in that family, but they need to offer support for some of them in the free version of Vivado...
It's terrible use of the hardware and the performance is far from stellar, but you can!
You're right that low-precison training still doesn't seem to work, presumably because you lose the smoothness required for SGD-type optimization.
Cool project though.
The best compromise seems to be webpages with readable technical info and animated video illustrations - such as the one posted here yesterday about how radio works.
There has been a lot of times where I am showing someone new to my field something and they stop me before I get to what I thought was the "educational" point and ask what I just did.
Video can portray that pretty well because the information is there for you to see, with a schematic or write-up if the author didn't put it there the information isn't there.
Pictures of the output here: https://github.com/PhobGCC/PhobGCC-doc/blob/main/For_Users/P...
Open source GPUs won't threat Nvidia/AMD/Intel anytime soon or ever. They're way too far ahead in the game and also backed by patents if any new player were to become a thereat.
With a project like this I think you're well past a "foot in the door".
That devboard is using recycled chips 100 percent. Their cost is almost nothing.
The kintex-7 part in question can probably be bought in volume quantities for around $190. Think 100kEAU.
This kind of price break comes with volume and is common with many other kinds of silicon besides FPGAs. Some product lines have more pricing pressure than others. For example, very popular MCUs may not get as wide of a price break. Some manufacturers price more fairly to distributors, some allow very large discounts.
Basically, as George Carlin put it, "it's a big club, and you ain't in it".
I don't have exact numbers, but I'm pretty sure you can get significant discounts starting around 100 parts. So not much at all.
Another thing to note is you can already get parts for significant discounts in 1-off quantities through legit Chinese distributors like LCSC. For example, a XC7A35T-2FGG484I is 90$ on Digikey and 20$ at LCSC. I think a personalized deal for that part would be cheaper than 20$ though...
Imagine for instance hard real time tasks, each one task running on its own separate core.
He also did run into a similar problem that I ran into when I tried something like that as well: Sound Chips. Building a system around a Yamaha FM Synthesizer is perfect, but I found as well that most of the chips out there are broken, fake, or both and that no one else makes them anymore. Which makes sense because if you want a sound chip in this day, you use an AC97 or HD Audio codec and call it a day, but that goes against that spirit.
I think that the spirit on hobby electronics is really found in FPGAs these days instead of rarer and rarer DIP parts. Which is a bit sad, but I guess that's just the passage of time. I wonder if that's how some people felt in the 70s when CPUs replaced many distinct layouts, or if they rejoiced and embraced it instead.
I've given up trying to build a system on a breadboard and think that MiSTer is the modern equivalent of that.
Microcontrollers have taken over. When 8kB SRAM and 20MHz microcontrollers exist below 50-cents and at miniscule 25mm^2 chip sizes drawing only 500uA of current... there's very little reason to use a collection of 30 chips to do equivalent functionality.
Except performance. If you need performance then bam, FPGA land comes in and Zynq just has too much performance at too low a cost (though not quite as low as the microcontroller gang).
----------
Hobby Electronics is great now. You have so many usable parts at very low costs. A lot of problems are "solved" yes, but that's a good thing. That means you can focus on solving your hobby problem rather than trying to invent a new display driver or something.
I do think some people that remember fondly the user experience of those old machines might be better served by using modern machines (like a raspberry pi or even a standard pc) in a different way instead of trying to use old hardware. That's from the good old Turing machine universality (you can simulate practically any machine you like using newer hardware, if what you're interested in is software). You can even add artificial limitations like PICO-8 or TIC-80 does.
See also uxn:
and (WIP) picotron:
https://www.lexaloffle.com/picotron.php
I think there's a general concept here of making 'Operating environments' that are pleasant to work within (or have fun limitations), which I think are more practical than a dedicated Operating System optionally with a dedicated machine. Plus (unless you particularly want to!) you don't need to worry about all the complex parts of operating systems like network stacks, drivers and such.
[1] Maybe we should call that Hobby universality (or immortality?) :P If it's already been made/discovered, you can always make it again just for fun.
Edit: found it! https://excamera.com/sphinx/gameduino/index.html
A FPGA is really just the right tool for solving the video problem. Or some projects do it with a micro-controller. But it's sort of too bad as it kind of undercuts the spirit of the whole design. If you video processor is orders of magnitude more powerful than the rest of the computer, then one starts to ask why not just implement the entire computer inside the video processor?
And yeah, you can't really buy sprite-based video chips anymore, and you don't even have to worry about stuff like "Sprites per Scanline" because you can get a proper framebuffer for essentially free - but now you might as well go further and use one microprocessor to be the CPU, GPU, and FM Synthesizer Sound Chip and "just" add the logic to generate the actual video/audio signals.
https://github.com/studio8502/Sentinel-65X
It's not yet a deliverable product but watching the developers work on it has been an entertaining part of my doomscrolling diet.