All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Scott J. Goldman" <scottjgo@gmail.com>
To: "Mohamed Mediouni" <mohamed@unpredictable.fr>,
	"Scott J. Goldman" <scottjgo@gmail.com>
Cc: <qemu-devel@nongnu.org>, <alex@shazbot.org>, <clg@redhat.com>,
	<pbonzini@redhat.com>, <rbolshakov@ddn.com>, <phil@philjordan.eu>,
	<mst@redhat.com>, <john.levon@nutanix.com>,
	<thanos.makatos@nutanix.com>, <qemu-s390x@nongnu.org>
Subject: Re: [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs
Date: Sun, 05 Apr 2026 16:20:51 -0700	[thread overview]
Message-ID: <DHLLUP2OY18O.1HYC4V18K7IH1@gmail.com> (raw)
In-Reply-To: <EE710653-F4AF-4C1B-A9B0-C9ADE7EB01F1@unpredictable.fr>

On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>
>
>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>> 
>>> 
>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>> 
>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>> backend instead of the Linux VFIO kernel driver.
>>> 
>>> I'm sending this as an RFC because I'd like feedback before investing
>>> further in upstreaming.  The code is functional.  I've tested it with
>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air.  GPU
>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>> [1]), likely due to the BAR access penalty described below.  AI
>>> inference workloads appear less affected.  Ollama with Qwen 3.5
>>> generates around 140 tok/sec on the same setup [2].
>>> 
>>> How it works:
>>> 
>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>> for device access and DMA mapping.  On macOS, there is no equivalent
>>> kernel interface.  Instead, a userspace DriverKit extension
>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>> 
>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>> passthrough infrastructure.  A few ioctl callsites are refactored into
>>> io_ops callbacks, the build system is extended for Darwin, and the
>>> Apple-specific backend plugs in behind those abstractions.
>>> 
>>> The guest sees two PCI devices: the passthrough device itself
>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>> DMA mapping device (apple-dma-pci).  On the QEMU side, an
>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>> library wraps the IOUserClient calls to the dext for config space,
>>> BAR MMIO, interrupts, reset, and DMA.
>>> 
>>> DMA limitations:
>>> 
>>> This is the biggest platform constraint.  Unlike a typical IOMMU
>>> mapping operation where the caller specifies the IOVA, the
>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>> system-assigned IOVA.  There is no way to request a specific address.
>>> This means the guest's requested DMA addresses cannot be used
>>> directly.  The guest kernel module must intercept DMA mapping calls
>>> and forward them through the companion device to get the actual
>>> hardware IOVA.
>> 
>> Hello,
>> 
>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>> API used by Virtualization.framework but that’s a different design.

This is really interesting and I had not heard about this. Are you
able to elaborate on this one at all? Maybe this is something where an
internal API to manipulate the DART is available inside
Virtualization.framework?

>> Would bounce buffering using something akin the confidential compute path and 
>> a pre-defined chunk of host memory accessible from the device, and then managing
>> the guest address map work? (see swiotlb).

I tested this approach early on, but ran into a couple issues:

1. Not only does PrepareForDMA() limit the total size of the pool, but
   it also limits the size of individual allocations. IIRC it not very
   large at around 16MB. Thankfully, I found that the allocator seemed
   to just keep allocating continguously across multiple allocations, so
   maybe that's fine?
2. Linux swiotlb default configuration is too small for GPU drivers. The
   max single mapping is 256KB and the total pool size is 64MB. The
   overall pool size is configurable but the max single mapping is
   derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time
   constants. During games, I have seen roughly ~900MB of active DMA
   mappings and mappings much larger than 256kb.

I abandoned this approach because it seemed like the CPU penalty of
bouncing all the DMA buffers would be pretty severe and the swiotlb
allocator just didn't seem designed for this much memory pressure. I
also was hoping to avoid the requirement of recompiling the entire guest
kernel as a prerequisite for guests to use this passthrough feature. On
top of that, I wasn't sure if upstream would even be willing to take
changes to support this use case, since it's so far outside what the
existing swiotlb allocator would normally be doing.

That said, you were saying that CoCo is fine with this restriction? Do
other devices just not have drivers that are doing so much allocation? I
didn't actually try changing the constants and recompiling the guest
kernel in swiotlb to make the pool big enough for it to really work at
all with the nvidia guest driver, I will have to see what happens.

>
> see restricted-dma-pool
>
> I think in this specific case that ACPI support isn’t worth it and that FDT
> will be good enough.

Yes, this seems fine to me as well if we went the swiotlb route. It
could be a different `-machine` type or perhaps a machine-specific param
if we went this route, maybe.

>
> The limitation that I can see there if if you can’t match IOVA and GPA for that
> restricted DMA pool, then you’ll need a small (and hopefully easy to merge) kernel
> change.
>> If the last part isn’t possible, something minimal to export an swiotlb window
>> through device tree with giving the IOVA there would be good too.
>> 
>> And that will get rid of a need for a apple-dma-pci device.

I am not 100% sure since I didn't try this exactly, but it seems like
you could have the DriverKit side allocate a big DMA buffer before the
guest starts, and then identity map the region somewhere inside the
guest with the `restricted-dma-pool` attribute attached to it. The
caveat being that you might have to pray that the region is contiguous
or introduce a much more complicated swiotlb subsystem allocator.

WRT a kernel patch to make it easier, can you elaborate on what you werelt thinking there?

>>> There are also hard platform limits: approximately 1.5 GB total
>>> mapped memory and roughly 64k concurrent mappings.  Not all
>>> workloads will fit within these limits, though GPU gaming and LLM
>>> inference have worked in practice.
>> 
>> That’s not too dissimilar from the confidential compute limitations.
>> 
>>> 
>>> BAR access has performance issues as well.  HVF does not expose
>>> controls to map device memory as cacheable in the guest, creating a
>>> significant performance penalty on BAR MMIO.  Uncached mappings work
>>> correctly but slowly compared to what the hardware could do.
>> 
>> That’s not a macOS limitation and not an Apple hardware limitation, but
>> it’s more fundamental to how PCIe works.
>> 
>> Unlike CXL, PCIe doesn’t have a coherency protocol story, and the alternative
>> of uncached and doing manual software-managed flushes isn’t really tenable.

Apologies, I misspoke. It's not cacheability that's the issue. I think
it's write-combining. Specifically the question is how the HVF sets the 
attributes in the stage-2 page tables. The behavior is observable by
looking at the performance of sweeping writes across the BARs.

As part of the work to implement and test this change I wrote such a
benchmark as a client of the dext in the host, and a Linux kernel module
that runs in the guest. It takes BAR1 (VRAM aperture) and does a write
sweep of 8MB with 4 passes and measures the results.

Host (mapped with kIOWriteCombineCache): 386mb/sec
Host (mapped with kIInhibitCache): 46mb/sec

Guest (mapped with ioremap_wc) 31mb/s
Guest (mapped with ioremap): 31mb/s

In the case of BAR1, it is marked prefetchable so I believe you would
usually want to map it with write-combining. I'm not sure why the case
without write-combining is worse in the guest, but it's the same order
of magnitude. I think the real interesting thing there is that the
write-combining map in the guest performs identically to the one 
without. To me, that indicates that perhaps the stage-2 bits are not set
properly. Even though the host has mapped the memory with
kIOWriteCombineCache, this wasn't propogated when HVF maps this into the
guest, which probably falls back to the lesser of the stage-1 vs stage-2
mappings (i.e. disabling write-combining). 


>> 
>>> 
>>> What works:
>>> - PCI config space passthrough
>>> - BAR MMIO via direct-mapped device memory
>>> - MSI/MSI-X interrupts via async notification from the dext
>>> - Device reset (FLR with hot-reset fallback)
>>> - DMA mapping for guest device drivers
>>> 
>> This is very interesting to see :)

Thanks! It's always nice to catch some interest/advice for a strange
project like this.

>> 
>>> What doesn't work:
>>> - Expansion ROM / VBIOS passthrough
>>> - PCI BAR quirks
>>> - VGA region passthrough
>>> - Migration and dirty page tracking
>>> - Hot-unplug
>>> 
>>
>> 
>> 
>>> Questions for reviewers:
>>> 
>>> 1. Is this something the VFIO maintainers would consider carrying
>>>  upstream?  The refactoring patches (3-6) are benign, but the Apple
>>>  backend is a new platform with real limitations.  That said, if Apple
>>>  lifts some of the DART/HVF restrictions in a future macOS release, the
>>>  code changes to take advantage would likely be minor.  I'd like to
>>>  understand whether this is in scope before doing the work to
>>>  address review feedback on the full series.
>>> 
>>> 2. The apple-dma-pci companion device: should this be a virtio device
>>>  instead?  I went with a simple custom PCI device because the virtio
>>>  infrastructure didn't buy much for what is essentially a {map, unmap}
>>>  register interface, but if virtio is preferred, what is the process
>>>  for allocating a device ID?  If a custom PCI device is the right
>>>  approach, I've tentatively allocated 1b36:0015.  Is there a process
>>>  for reserving a device ID under the Red Hat PCI vendor, or is
>>>  claiming it in pci-ids.rst sufficient?  The guest-side kernel module
>>>  hooks all DMA mapping functions for passed-through devices, which is
>>>  unusual enough that I'm not sure it's upstreamable in the Linux
>>>  kernel.  I can maintain it out of tree if needed.
>> 
>> I’d recommend using bounce buffers like the CoCo case if possible. I don’t
>> think that the apple-dma-pci definitely-not-an-IOMMU is a good idea.

To be clear, it definitely is weird and bad, but it was seemingly the
least bad option that I was able to get working with minimal guest
changes (just one guest kmod).

>> 
>>> 
>>> 3. Should the macOS host-side DriverKit extension live in the QEMU
>>>  tree?  It's not included in this series and requires Apple code
>>>  signing.  I'm happy to keep it out of tree if that's preferred,
>>>  or include the source if reviewers want it co-located.
>> 
>> Both are fine I think. Could you share compatibility with the tinygrad
>> one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip?

This is a good question and not something I had considered. My module
probably works a little different than their module. It's possible I'm
wrong but my understanding was:

1. They got apple entitlements for AMD/NVIDIA driver vendor ids only.
   That said, if it became compatible with QEMU, I suppose it would be
   an easy case to make that it could be expanded to wildcard (another
   developer indicated to me that Apple was willing to grant the
   wildcard entitlement if the use case was justifiable)
2. The architecture of their driver is a little different. I believe
   they are allocating DMA-able memory in the driver and mapping it down
   to userland, so it's kind of the reverse of what I'm doing now. I
   guess, conceivably they could change how they are doing this to unify
   our efforts.

Thanks,
-sjg


  reply	other threads:[~2026-04-05 23:21 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-05  7:28 [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 01/10] vfio/pci: Use the write side of EventNotifier for IRQ signaling Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 02/10] accel/hvf: avoid executable mappings for RAM-device memory Scott J. Goldman
2026-04-22 17:05   ` Philippe Mathieu-Daudé
2026-04-05  7:28 ` [RFC PATCH 03/10] vfio: Allow building on Darwin hosts Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 04/10] vfio: Prepare existing code for Apple VFIO backend Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 05/10] vfio: Add region_map and region_unmap callbacks to VFIODeviceIOOps Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 06/10] vfio: Add device_reset callback " Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 07/10] vfio/apple: Add DriverKit dext client library Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 08/10] vfio/apple: Add IOMMU container and PCI device Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 09/10] vfio/apple: Add apple-dma-pci companion device Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 10/10] docs: Add vfio-apple documentation and MAINTAINERS entry Scott J. Goldman
2026-04-05  8:01 ` [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs Mohamed Mediouni
2026-04-05  8:14   ` Mohamed Mediouni
2026-04-05 23:20     ` Scott J. Goldman [this message]
2026-04-06  0:16       ` Mohamed Mediouni
2026-04-08  7:02         ` Scott J. Goldman
2026-04-08  8:33           ` Mohamed Mediouni
2026-04-08 19:09           ` Mohamed Mediouni
2026-04-08 20:45             ` Scott J. Goldman
2026-04-08 22:12               ` Mohamed Mediouni
2026-04-08 23:33                 ` Scott J. Goldman
2026-04-09  0:02                   ` Mohamed Mediouni
2026-04-05 10:36 ` BALATON Zoltan
2026-04-05 18:16   ` Scott J. Goldman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHLLUP2OY18O.1HYC4V18K7IH1@gmail.com \
    --to=scottjgo@gmail.com \
    --cc=alex@shazbot.org \
    --cc=clg@redhat.com \
    --cc=john.levon@nutanix.com \
    --cc=mohamed@unpredictable.fr \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=phil@philjordan.eu \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=rbolshakov@ddn.com \
    --cc=thanos.makatos@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.