Why libvirt supports only 14 PCIe hotplugged devices on x86-64

Why libvirt supports only 14 PCIe hotplugged devices on x86-64(dottedmag.net)

249 points by andreyvit 2 years ago | 68 comments

> It already supports a number of obscure options (you can make QEMU claim to support a CPU feature regardless of whether the host CPU supports it, really?), so adding one more woild fit in just fine.

> Nope. “there are no plans to address it further or fix it in an upcoming release”.

<https://bugzilla.redhat.com/show_bug.cgi?id=1408810>

I could see that being the response of an individual open-source developer working for free. But that was IBM saying that, and people pay big bucks to IBM to fix things like this.

rwmj 2 years ago | |

It's a bug filed against RHEL 7 originally, by someone working at Red Hat, who suggested that we add the qemu disable-io feature to libvirt. There was no customer case behind either the original RHEL 7 bug nor this cloned RHEL 8 bug, so we simply didn't think it was important to implement this, and 5 years after the original bug was filed, with no customer coming along nor anyone having done the work upstream, the bug was auto-closed.

However if someone came along and did the work upstream to fix it, I'm sure that would be accepted.

Or if a customer turned up who wanted this, that would also be implemented.

emmelaich 2 years ago | | |

WONTFIX sounds so final.

Though, reading the closing comment, this is really CLOSED-WONTFIXYET, as in no plans.

Maybe it'd be nice to introduce a WONTFIXYET. Might be useful to fossick among features abandoned that someday become feasible.

Volundr 2 years ago | |

I'm expect if it comes from one of their $X million support contracts their answer will be very different.

bombcar 2 years ago | | |

You’d like to think so but often even at that level you’re paying for it not to be your fault.

Now if your CTO golfs with an exec at IBM, you might get somewhere.

josephcsible 2 years ago | | |

Having worked for an organization with one of said $X million support contracts with IBM, it often is not.

gnufied 2 years ago | |

> . I guess SeaBIOS can't figure out how to assign I/O space to all devices that want some, and so it simply gives up?

It appears that although for some devices VM works fine but for others the VM refuses to boot (esp e100)

So the answer might be more nuanced than it seems?

chociej 2 years ago | | |

Those devices for which it works fine are such devices that don't request any I/O space

wkat4242 2 years ago | |

As redhat becomes more commercial it's imperative we don't let them be stewards of open source anymore. Too many times to their corporate strategy.

For example they took ownership of X11 only so they can let it die in favor of their preferred Wayland. While Wayland is not bad, it's not covering everything.

But anyway I don't really care anymore, I'm less and less invested in the Linux ecosystem. It's too commercial now, I just stick with the BSDs <3

dark-star 2 years ago | |

Because, if you read the report, suc an option is not needed: - if you disable IO port allocation and plug in a card that requires it, that card cannot possibly work - if you don't disable it but use only cards that don't require IO ports, you might get an error in your dmesg but the card will still work just fine

So, why would you need to specify this option in the first place?

formerly_proven 2 years ago |

> So if you wish to have more than 14 PCIe slots in your VM, you’ll have to use QEMU directly.

No need, libvirt can pass arbitrary options to QEMU.

https://libvirt.org/kbase/qemu-passthrough-security.html

anonymousiam 2 years ago | |

Back before libvirt made it trivial, I used QEMU/KVM directly to map PCI devices to VMs. It's a little tricky because you must first unmap the device from the host/hypervisor, and you need to unmap the whole bus that the device is on. So if there are other PCI devices on the same bus as that device you want to map, they must all go along, which is often impossible for things like the USB controller for your keyboard/mouse.

These days, instead of crafting a custom script to launch QEMU/KVM for PCI mapping, it's just a few clicks in virt-manager. Note that the first time you launch a VM with a mapped PCI device, the launch will often fail with an error, but it will work on a subsequent retry and thereafter.

Also, I've tinkered with lots of VMs over the past 15 years and I've NEVER had a need for more than 14 buses. Hopefully I never will.

dottedmag 2 years ago | |

Author here. Thanks, I'll give it a whirl!

Not sure it will work though: I need to add an option to a `pcie-root-port` command-line argument managed by libvirt.

I can try skipping creating `pcie-root-port`s by libvirt completely, and add them manually using options passthrough, but I'm not sure the rest of libvirt won't throw a fit when it finds other devices that refer to these (unknown to libvirt) PCIe slots.

tedunangst 2 years ago |

I'm curious to know more about the VM host machine that they plugged 15 e1000 cards into to test this limitation. And even more curious about the non-test environment in which somebody ran into this limitation.

I can only imagine trying to passthrough 20 nvme devices to a guest, but it seems like a very weird configuration.

derefr 2 years ago | |

> but it seems like a very weird configuration

On IaaS providers, you get "local scratch NVMe" presented to the guest as individual fixed-sized disks — presumably because they're being IOMMU-pass-through'ed from the host (or a JBOD direct-attached to the host.)

The sizes for these disks were standardized several generations ago, so they're at least presented to the guest as 375G slices (I'm guessing they might actually be partitions of a larger disk nowadays.) To get "decent" amounts of local scratch storage for e.g. a serverless data-warehouse instance, you need "all you can get" of these small volumes — which on at least AWS and GCP, is 24 of them (equalling ~9TB.)

And that's just one guest. The host might have several such guests.

(To be clear, neither AWS nor GCP is likely to be using libvirt anywhere in their stack. This is just to demonstrate the use-case.)

candiddevmike 2 years ago | | |

A serverless data warehouse instance sounds like an oxymoron

adql 2 years ago | | |

...the use case of "our architecture's idiotic limitations made it hit hypervisor limitations" ?

simcop2387 2 years ago | | |

Probably not normal partitions but nvme namespaces instead since that 3ill also allow them to balance iops and such so that one customer doesn't affect another as much.

karavelov 2 years ago | |

These are emulated `r1000` devices, not pass-through

EmilioPeJu 2 years ago |

If I'm not wrong, the pre-allocation of I/O ranges in PCIe bridges is needed only if you intend to hot-plug devices that were not present in the first enumeration.. but in VMs the hardware is known from the start and the PCIe enumeration can assign I/O ranges only if devices underneath actually needs them... is there a reason why hot-plugging is needed in VMs?

magicalhippo 2 years ago |

I ran into this on FreeNAS which uses Bhyve. Not sure if it's FreeNAS' way of doing things, but adding a virtual disk using VirtIO creates a separate SATA controller.

I tried forwarding quad NVMe's and couldn't get it working until I discovered I was hitting this limitation between the existing disks and VirtIO network card.

wkat4242 2 years ago | |

Are you sure that's the same issue? Bhyve doesn't share an awful lot of code with KVM/Qemu.

magicalhippo 2 years ago | | |

Perhaps I am slightly misrembering and it was incidental to the NVMe's, but it did fail due to this 14 PCIe device limit due to virtual disks did not share a controller, and I had to change to using Bhyves AHCI driver for some disks to get the VM running again.

I even did a test adding one disk at a time until the VM stopped booting.

mixmastamyk 2 years ago |

Would like to hear more about why i/o ports stayed fixed and "usage decreased over time." USB/TB devices must not use them, right?

joelhaasnoot 2 years ago |

But this doesn't answer the question why 14 and not 16. There's a diff of two there...

dottedmag 2 years ago | |

Author here. D'oh, you're right. Added it to QEMU section.

joelhaasnoot 2 years ago | | |

Amazing, thanks, that closes the loop!