From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B784AE9D826 for ; Sun, 5 Apr 2026 23:21:18 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1w9Wm9-00028n-Tf; Sun, 05 Apr 2026 19:21:05 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1w9Wm1-000277-VE for qemu-devel@nongnu.org; Sun, 05 Apr 2026 19:20:58 -0400 Received: from mail-dy1-x1332.google.com ([2607:f8b0:4864:20::1332]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1w9Wlz-0002jT-EU for qemu-devel@nongnu.org; Sun, 05 Apr 2026 19:20:57 -0400 Received: by mail-dy1-x1332.google.com with SMTP id 5a478bee46e88-2ba895adfeaso3600930eec.0 for ; Sun, 05 Apr 2026 16:20:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775431254; x=1776036054; darn=nongnu.org; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zyNFCc3lmN4ZV810CoDdVOEIXuWzwi7k9pvCxQa9iMc=; b=bKk5PfrZjNbFWmXAqOces0g3uCNSu9mZnLATZPYwNugKwqir0FSBpIffUUJCuvgEYa 8FWGGr+yy2nRP/gKEIbXiJNTXb6lr2cNabeC9YGI+qCSqL0POidw46jHW5LF7aoLNth8 UnVZwvKGuptopXDoRrVuTCartrfR/r6ZpUuQiAt+Qojr0ee/sraZZXwFgzqh23AFmQaK Y7lfGEqyY53F2Zb6itGkq8Nowzzk/2FnXuFbdfPE8M0J7LbTFE78Cj0oLvO4Mal8N6K0 JuFP7gTfzpsllbdGuDNEtI8/hsc66tiQyzn9ToxTM/FRCo8PYOiVQOurTWh/fjcX+Qmr YjLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775431254; x=1776036054; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=zyNFCc3lmN4ZV810CoDdVOEIXuWzwi7k9pvCxQa9iMc=; b=KekHDpzH6rLWcX8mZrDHFbnlTc8ucuynCNf+G9bBPI67EyrzdEjAXSTL1TmInkcFP5 xDxGKvJHbtb3rZXB5lJcKgIuy0nHuEIwbfOq+7ma9q9S+70LGA9HYjiK48BSh6AM7fe0 BI8GT1NVdyFLqFhTJP/fBTrIWeNRacKARC9Tz/AznpNd3tyNnP8UfjI/4Pgg8/ekI18I 8UXmExMSa31KEPHKStRdLvJzN2sY/utSujvPSq7HhvkFFbXGsDzBdANXtZ1hSFrIIGBg mX4xqD8/zV7WMkbnHQOSFSXRnH4SjP0cYb4UzC8UpcnncB13JXNPEK6QPwWIbYp+T/v9 3MRw== X-Gm-Message-State: AOJu0YzI5/ErXiWQAR3CeqnXNVVw6olfx9wjBksCMd7V1wM3EZ2RrOeS Hp6atatCwt0o6bZwLot2BWCYGCiVMuovoOWSoMggQSl2YucJ2HrjX9nj X-Gm-Gg: AeBDiev+ZrrafZtlPWsaX8sZagpR8g6MOwecqlqftHxk3bwWADaFaFJfe/b+nZfxzO0 wTOgF1Pr3bo5e4gIifwfcJNJiSFrjIm9MIleTF03IMeb7mDOl4ii1VRHAL0vYOt5FpFJf2mCGyK FaWCosuWjmC8E4ZZQx/fHTl7oh8oYB+1ewqUHGbrjHC2MqatIwcRbgh67Hc8VFcxx83U/ABQD9V ttGv0MWF5CWlb/UYzRflimo6UcTJD4zlsL/tCQlBdzUrAydhPgxxVURzofvNUYiqUVgitQkSOVZ FT5j5bvW1eZ8gVcEr+BiDQ/muq9PCtT8tYpGlUfvuEVSQK8N6tYtKjmUSD44EF1payzEUI8tyHm 0hEER+ihQpDRRFI6fj/XCGZJk7pU6trK6GxqCGHlLffgTY2vFHY976MzRjmGYHC+SCN39jMR7b6 gIkPP+6lbYiWRoV545vCxdEjcQWWy+je4PmzVEH2Iu4IiHE6cAeBs+e16st5T1E51USXl7fPSx/ bd5YWHtBDI= X-Received: by 2002:a05:7300:dc92:b0:2c0:f84b:2455 with SMTP id 5a478bee46e88-2cbfb4a7fd6mr5612136eec.19.1775431253398; Sun, 05 Apr 2026 16:20:53 -0700 (PDT) Received: from localhost ([2601:645:8200:47:fc81:54a4:64f6:40f2]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-2ca7c3010e9sm15366187eec.14.2026.04.05.16.20.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 05 Apr 2026 16:20:52 -0700 (PDT) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Sun, 05 Apr 2026 16:20:51 -0700 Message-Id: Cc: , , , , , , , , , Subject: Re: [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs From: "Scott J. Goldman" To: "Mohamed Mediouni" , "Scott J. Goldman" X-Mailer: aerc 0.21.0 References: <20260405072857.66484-1-scottjgo@gmail.com> In-Reply-To: Received-SPF: pass client-ip=2607:f8b0:4864:20::1332; envelope-from=scottjgo@gmail.com; helo=mail-dy1-x1332.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote: > > >> On 5. Apr 2026, at 10:01, Mohamed Mediouni wr= ote: >>=20 >>>=20 >>> On 5. Apr 2026, at 09:28, Scott J. Goldman wrote: >>>=20 >>> This series adds VFIO PCI device passthrough support for Apple Silicon >>> Macs running macOS, using a DriverKit extension (dext) as the host >>> backend instead of the Linux VFIO kernel driver. >>>=20 >>> I'm sending this as an RFC because I'd like feedback before investing >>> further in upstreaming. The code is functional. I've tested it with >>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU >>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 >>> [1]), likely due to the BAR access penalty described below. AI >>> inference workloads appear less affected. Ollama with Qwen 3.5 >>> generates around 140 tok/sec on the same setup [2]. >>>=20 >>> How it works: >>>=20 >>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio >>> for device access and DMA mapping. On macOS, there is no equivalent >>> kernel interface. Instead, a userspace DriverKit extension >>> (VFIOUserPCIDriver) mediates access to the physical PCI device through >>> IOKit's IOUserClient and PCIDriverKit APIs. >>>=20 >>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's >>> passthrough infrastructure. A few ioctl callsites are refactored into >>> io_ops callbacks, the build system is extended for Darwin, and the >>> Apple-specific backend plugs in behind those abstractions. >>>=20 >>> The guest sees two PCI devices: the passthrough device itself >>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion >>> DMA mapping device (apple-dma-pci). On the QEMU side, an >>> AppleVFIOContainer implements the IOMMU backend, and a C client >>> library wraps the IOUserClient calls to the dext for config space, >>> BAR MMIO, interrupts, reset, and DMA. >>>=20 >>> DMA limitations: >>>=20 >>> This is the biggest platform constraint. Unlike a typical IOMMU >>> mapping operation where the caller specifies the IOVA, the >>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a >>> system-assigned IOVA. There is no way to request a specific address. >>> This means the guest's requested DMA addresses cannot be used >>> directly. The guest kernel module must intercept DMA mapping calls >>> and forward them through the companion device to get the actual >>> hardware IOVA. >>=20 >> Hello, >>=20 >> Ugh this one is not great. By the way, Apple has a private PCIe passthro= ugh >> API used by Virtualization.framework but that=E2=80=99s a different desi= gn. This is really interesting and I had not heard about this. Are you able to elaborate on this one at all? Maybe this is something where an internal API to manipulate the DART is available inside Virtualization.framework? >> Would bounce buffering using something akin the confidential compute pat= h and=20 >> a pre-defined chunk of host memory accessible from the device, and then = managing >> the guest address map work? (see swiotlb). I tested this approach early on, but ran into a couple issues: 1. Not only does PrepareForDMA() limit the total size of the pool, but it also limits the size of individual allocations. IIRC it not very large at around 16MB. Thankfully, I found that the allocator seemed to just keep allocating continguously across multiple allocations, so maybe that's fine? 2. Linux swiotlb default configuration is too small for GPU drivers. The max single mapping is 256KB and the total pool size is 64MB. The overall pool size is configurable but the max single mapping is derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time constants. During games, I have seen roughly ~900MB of active DMA mappings and mappings much larger than 256kb. I abandoned this approach because it seemed like the CPU penalty of bouncing all the DMA buffers would be pretty severe and the swiotlb allocator just didn't seem designed for this much memory pressure. I also was hoping to avoid the requirement of recompiling the entire guest kernel as a prerequisite for guests to use this passthrough feature. On top of that, I wasn't sure if upstream would even be willing to take changes to support this use case, since it's so far outside what the existing swiotlb allocator would normally be doing. That said, you were saying that CoCo is fine with this restriction? Do other devices just not have drivers that are doing so much allocation? I didn't actually try changing the constants and recompiling the guest kernel in swiotlb to make the pool big enough for it to really work at all with the nvidia guest driver, I will have to see what happens. > > see restricted-dma-pool > > I think in this specific case that ACPI support isn=E2=80=99t worth it an= d that FDT > will be good enough. Yes, this seems fine to me as well if we went the swiotlb route. It could be a different `-machine` type or perhaps a machine-specific param if we went this route, maybe. > > The limitation that I can see there if if you can=E2=80=99t match IOVA an= d GPA for that > restricted DMA pool, then you=E2=80=99ll need a small (and hopefully easy= to merge) kernel > change. >> If the last part isn=E2=80=99t possible, something minimal to export an = swiotlb window >> through device tree with giving the IOVA there would be good too. >>=20 >> And that will get rid of a need for a apple-dma-pci device. I am not 100% sure since I didn't try this exactly, but it seems like you could have the DriverKit side allocate a big DMA buffer before the guest starts, and then identity map the region somewhere inside the guest with the `restricted-dma-pool` attribute attached to it. The caveat being that you might have to pray that the region is contiguous or introduce a much more complicated swiotlb subsystem allocator. WRT a kernel patch to make it easier, can you elaborate on what you werelt = thinking there? >>> There are also hard platform limits: approximately 1.5 GB total >>> mapped memory and roughly 64k concurrent mappings. Not all >>> workloads will fit within these limits, though GPU gaming and LLM >>> inference have worked in practice. >>=20 >> That=E2=80=99s not too dissimilar from the confidential compute limitati= ons. >>=20 >>>=20 >>> BAR access has performance issues as well. HVF does not expose >>> controls to map device memory as cacheable in the guest, creating a >>> significant performance penalty on BAR MMIO. Uncached mappings work >>> correctly but slowly compared to what the hardware could do. >>=20 >> That=E2=80=99s not a macOS limitation and not an Apple hardware limitati= on, but >> it=E2=80=99s more fundamental to how PCIe works. >>=20 >> Unlike CXL, PCIe doesn=E2=80=99t have a coherency protocol story, and th= e alternative >> of uncached and doing manual software-managed flushes isn=E2=80=99t real= ly tenable. Apologies, I misspoke. It's not cacheability that's the issue. I think it's write-combining. Specifically the question is how the HVF sets the=20 attributes in the stage-2 page tables. The behavior is observable by looking at the performance of sweeping writes across the BARs. As part of the work to implement and test this change I wrote such a benchmark as a client of the dext in the host, and a Linux kernel module that runs in the guest. It takes BAR1 (VRAM aperture) and does a write sweep of 8MB with 4 passes and measures the results. Host (mapped with kIOWriteCombineCache): 386mb/sec Host (mapped with kIInhibitCache): 46mb/sec Guest (mapped with ioremap_wc) 31mb/s Guest (mapped with ioremap): 31mb/s In the case of BAR1, it is marked prefetchable so I believe you would usually want to map it with write-combining. I'm not sure why the case without write-combining is worse in the guest, but it's the same order of magnitude. I think the real interesting thing there is that the write-combining map in the guest performs identically to the one=20 without. To me, that indicates that perhaps the stage-2 bits are not set properly. Even though the host has mapped the memory with kIOWriteCombineCache, this wasn't propogated when HVF maps this into the guest, which probably falls back to the lesser of the stage-1 vs stage-2 mappings (i.e. disabling write-combining).=20 >>=20 >>>=20 >>> What works: >>> - PCI config space passthrough >>> - BAR MMIO via direct-mapped device memory >>> - MSI/MSI-X interrupts via async notification from the dext >>> - Device reset (FLR with hot-reset fallback) >>> - DMA mapping for guest device drivers >>>=20 >> This is very interesting to see :) Thanks! It's always nice to catch some interest/advice for a strange project like this. >>=20 >>> What doesn't work: >>> - Expansion ROM / VBIOS passthrough >>> - PCI BAR quirks >>> - VGA region passthrough >>> - Migration and dirty page tracking >>> - Hot-unplug >>>=20 >> >>=20 >>=20 >>> Questions for reviewers: >>>=20 >>> 1. Is this something the VFIO maintainers would consider carrying >>> upstream? The refactoring patches (3-6) are benign, but the Apple >>> backend is a new platform with real limitations. That said, if Apple >>> lifts some of the DART/HVF restrictions in a future macOS release, the >>> code changes to take advantage would likely be minor. I'd like to >>> understand whether this is in scope before doing the work to >>> address review feedback on the full series. >>>=20 >>> 2. The apple-dma-pci companion device: should this be a virtio device >>> instead? I went with a simple custom PCI device because the virtio >>> infrastructure didn't buy much for what is essentially a {map, unmap} >>> register interface, but if virtio is preferred, what is the process >>> for allocating a device ID? If a custom PCI device is the right >>> approach, I've tentatively allocated 1b36:0015. Is there a process >>> for reserving a device ID under the Red Hat PCI vendor, or is >>> claiming it in pci-ids.rst sufficient? The guest-side kernel module >>> hooks all DMA mapping functions for passed-through devices, which is >>> unusual enough that I'm not sure it's upstreamable in the Linux >>> kernel. I can maintain it out of tree if needed. >>=20 >> I=E2=80=99d recommend using bounce buffers like the CoCo case if possibl= e. I don=E2=80=99t >> think that the apple-dma-pci definitely-not-an-IOMMU is a good idea. To be clear, it definitely is weird and bad, but it was seemingly the least bad option that I was able to get working with minimal guest changes (just one guest kmod). >>=20 >>>=20 >>> 3. Should the macOS host-side DriverKit extension live in the QEMU >>> tree? It's not included in this series and requires Apple code >>> signing. I'm happy to keep it out of tree if that's preferred, >>> or include the source if reviewers want it co-located. >>=20 >> Both are fine I think. Could you share compatibility with the tinygrad >> one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b= 99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.= githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip= ? This is a good question and not something I had considered. My module probably works a little different than their module. It's possible I'm wrong but my understanding was: 1. They got apple entitlements for AMD/NVIDIA driver vendor ids only. That said, if it became compatible with QEMU, I suppose it would be an easy case to make that it could be expanded to wildcard (another developer indicated to me that Apple was willing to grant the wildcard entitlement if the use case was justifiable) 2. The architecture of their driver is a little different. I believe they are allocating DMA-able memory in the driver and mapping it down to userland, so it's kind of the reverse of what I'm doing now. I guess, conceivably they could change how they are doing this to unify our efforts. Thanks, -sjg