Re: [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Scott J. Goldman" <scottjgo@gmail.com>
To: "Mohamed Mediouni" <mohamed@unpredictable.fr>,
	"Scott J. Goldman" <scottjgo@gmail.com>
Cc: <qemu-devel@nongnu.org>, <alex@shazbot.org>, <clg@redhat.com>,
	<pbonzini@redhat.com>, <rbolshakov@ddn.com>, <phil@philjordan.eu>,
	<mst@redhat.com>, <john.levon@nutanix.com>,
	<thanos.makatos@nutanix.com>, <qemu-s390x@nongnu.org>
Subject: Re: [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs
Date: Wed, 08 Apr 2026 00:02:55 -0700	[thread overview]
Message-ID: <DHNKXKNWFI3S.P78ERGSFU3RQ@gmail.com> (raw)
In-Reply-To: <67BA415D-4D1A-4F68-9429-284309EE96C0@unpredictable.fr>

On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote:
>
>
>> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote:
>> 
>> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>>> 
>>> 
>>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>>>> 
>>>>> 
>>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>> 
>>>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>>>> backend instead of the Linux VFIO kernel driver.
>>>>> 
>>>>> I'm sending this as an RFC because I'd like feedback before investing
>>>>> further in upstreaming.  The code is functional.  I've tested it with
>>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air.  GPU
>>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>>>> [1]), likely due to the BAR access penalty described below.  AI
>>>>> inference workloads appear less affected.  Ollama with Qwen 3.5
>>>>> generates around 140 tok/sec on the same setup [2].
>>>>> 
>>>>> How it works:
>>>>> 
>>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>>>> for device access and DMA mapping.  On macOS, there is no equivalent
>>>>> kernel interface.  Instead, a userspace DriverKit extension
>>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>>>> 
>>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>>>> passthrough infrastructure.  A few ioctl callsites are refactored into
>>>>> io_ops callbacks, the build system is extended for Darwin, and the
>>>>> Apple-specific backend plugs in behind those abstractions.
>>>>> 
>>>>> The guest sees two PCI devices: the passthrough device itself
>>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>>>> DMA mapping device (apple-dma-pci).  On the QEMU side, an
>>>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>>>> library wraps the IOUserClient calls to the dext for config space,
>>>>> BAR MMIO, interrupts, reset, and DMA.
>>>>> 
>>>>> DMA limitations:
>>>>> 
>>>>> This is the biggest platform constraint.  Unlike a typical IOMMU
>>>>> mapping operation where the caller specifies the IOVA, the
>>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>>>> system-assigned IOVA.  There is no way to request a specific address.
>>>>> This means the guest's requested DMA addresses cannot be used
>>>>> directly.  The guest kernel module must intercept DMA mapping calls
>>>>> and forward them through the companion device to get the actual
>>>>> hardware IOVA.
>>>> 
>>>> Hello,
>>>> 
>>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>>>> API used by Virtualization.framework but that's a different design.
>> 
>> This is really interesting and I had not heard about this. Are you
>> able to elaborate on this one at all? Maybe this is something where an
>> internal API to manipulate the DART is available inside
>> Virtualization.framework?
>
> Hello,
>
> All of it needs using private entitlements currently.
>
> It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use.
>
> The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not
> sure whether OS versions even have all the code currently though.
>

Appreciate the pointers here. It looks like, as you said, the framework
taps into a bunch of code that isn't shipped to us mere mortals. I can
see from some of the code in Virtualization.framework the general shape
of what they're doing, though.

It looks like they implement a virtio-iommu device that ultimately calls
into the host kernel with some internal APIs to do the DART mappings. 

>>>> Would bounce buffering using something akin the confidential compute path and 
>>>> a pre-defined chunk of host memory accessible from the device, and then managing
>>>> the guest address map work? (see swiotlb).
>> 
>> I tested this approach early on, but ran into a couple issues:
>> 
>> 1. Not only does PrepareForDMA() limit the total size of the pool, but
>>   it also limits the size of individual allocations. IIRC it not very
>>   large at around 16MB.
> Sigh.
>
>> Thankfully, I found that the allocator seemed
>>   to just keep allocating continguously across multiple allocations, so
>>   maybe that's fine?
> That's good… but it sounds brittle…
>
>> 2. Linux swiotlb default configuration is too small for GPU drivers. The
>>   max single mapping is 256KB and the total pool size is 64MB. The
>>   overall pool size is configurable but the max single mapping is
>>   derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time
>>   constants. During games, I have seen roughly ~900MB of active DMA
>>   mappings and mappings much larger than 256kb.
>
> Pre-defined mappings with restricted-dma-pool sound like a good idea there.
>> 
>> I abandoned this approach because it seemed like the CPU penalty of
>> bouncing all the DMA buffers would be pretty severe and the swiotlb
>> allocator just didn't seem designed for this much memory pressure. I
>> also was hoping to avoid the requirement of recompiling the entire guest
>> kernel as a prerequisite for guests to use this passthrough feature. On
>> top of that, I wasn't sure if upstream would even be willing to take
>> changes to support this use case, since it's so far outside what the
>> existing swiotlb allocator would normally be doing.
>> 
>> That said, you were saying that CoCo is fine with this restriction? Do
>> other devices just not have drivers that are doing so much allocation? I
>> didn't actually try changing the constants and recompiling the guest
>> kernel in swiotlb to make the pool big enough for it to really work at
>> all with the nvidia guest driver, I will have to see what happens.
>
> CoCo with bounce buffering works with NVIDIA GPUs. It had to be done because
> no trusted I/O path (and implementing that is a quagmire).
>
> A recent Intel post about it claiming production-readiness:
>
> https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Confidential-AI-with-GPU-Acceleration-Bounce-Buffers-Offer-a/post/1740417

I dug in here and implemented the restricted-dma-pool solution. Still
needs some cleanups but it's working enough to test. To start with the
bad news:

1. As I mentioned previously, the mainline kernel has a max 256k limit
for any single swiotlb mapping. This has been debated a few times on
LKML, but the consensus has been generally that it should not be changed
or made configurable. You can see the threads:
  - https://lkml.org/lkml/2015/3/3/84
  - https://patchwork.kernel.org/project/linux-mips/patch/20210914151016.3174924-1-Roman_Skakun@epam.com/

2. NVIDIA drivers immediately make a contiguous 528384 byte allocation,
at least on my hardware (NVIDIA RTX 5090), which is required as part of
initializing the firmware on the card. This obviously fails immediately.
It happens on both the NVIDIA-provided "open" drivers[1] and the in-tree
`nouveau` [2], so it's more a hardware-specific issue than just a driver
problem. If you hack around that (allocate 3 smaller buffers and hope
they are contiguous), you'll see that both drivers assume coherent DMA
memory (moreso in the nvidia driver than nouveau, but it's a problem in
both). They map DMA buffers and then write data into the buffers
afterward. So you end up sending empty swiotlb buffers to the card and
it'll ultimately fail to initialize.

It's possible the press release was referring to using the closed NVIDIA
drivers, but those are now deprecated and don't support my newer GPU.

But, there is good news:

1. The IOVA range that seems to always come from PCIDriverKit is pretty
far outside the default qemu mapping from `-machine virt`, so the range
can be cleanly identity mapped in the VM without overlap. One of the
restrictions I noted earlier (16MB max contiguous mapping) was actually
just a bug in my code. A large contiguous mapping seems to work fine,
though the ~1.5GB limit is still real.

2. restricted-dma-pool DT attribute can be assigned per-device. So it
doesn't affect other drivers on the system, and potentially that means
you can have different pools for multiple devices (have not actually
tried this yet, but seems like it would work).

3. More normal devices can work. I purchased a thunderbolt nvme
enclosure and it works with the swiotlb bounce buffering with no kernel
modifications.

4. With a sufficient amount of hacks in the driver, the NVIDIA "open"
driver can be made to work, albeit with already slow gaming performance
reduced to about 30% (~10fps) vs paravirt dma mapping (~30fps). I wasn't
able to get CUDA working, but presumably that just needs more elbow
grease.

After sleeping on this a bit, I think my proposal would be:

- The `restricted-dma-pool` method can be the default. For most devices
  this will work seamlessly, though users may have to specify a size for
  the pool, since the optimal size will vary for each device.

- The apple-dma-pci thing can be downgraded from an actual device to an
  out-of-tree workaround. I have not yet tested it, but presumably it
  can use ivshmem or a virtual serial port to communicate the mappings.
  It's mostly a guest-side hack so it doesn't really need qemu
  involvement necessarily. 

- I doubt Apple will actually approve this for distribution, but I can
  write a kext that uses the kernel API to manipulate the DART directly.
  I didn't realize this was an option before. This can act as kind of a
  companion for my dext and as follow-on to this patchset, I can teach
  the vIOMMU device to use it. Eventually if Apple exposes this as
  something you can use in a dext, then the functionality can be moved
  into the dext and all of these concerns become moot. Until then, it
  can be an optimization if you're willing to run without SIP.

If you think this is OK, I can prepare a new version of the patchset.

Thanks,
-sjg

[1] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/gpu/gsp/kernel_gsp.c#L5404
[2] https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c#L1827

next prev parent reply	other threads:[~2026-04-08 19:50 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-05  7:28 [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 01/10] vfio/pci: Use the write side of EventNotifier for IRQ signaling Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 02/10] accel/hvf: avoid executable mappings for RAM-device memory Scott J. Goldman
2026-04-22 17:05   ` Philippe Mathieu-Daudé
2026-04-05  7:28 ` [RFC PATCH 03/10] vfio: Allow building on Darwin hosts Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 04/10] vfio: Prepare existing code for Apple VFIO backend Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 05/10] vfio: Add region_map and region_unmap callbacks to VFIODeviceIOOps Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 06/10] vfio: Add device_reset callback " Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 07/10] vfio/apple: Add DriverKit dext client library Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 08/10] vfio/apple: Add IOMMU container and PCI device Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 09/10] vfio/apple: Add apple-dma-pci companion device Scott J. Goldman
2026-04-05  7:28 ` [RFC PATCH 10/10] docs: Add vfio-apple documentation and MAINTAINERS entry Scott J. Goldman
2026-04-05  8:01 ` [RFC PATCH 00/10] vfio: PCI device passthrough on Apple Silicon Macs Mohamed Mediouni
2026-04-05  8:14   ` Mohamed Mediouni
2026-04-05 23:20     ` Scott J. Goldman
2026-04-06  0:16       ` Mohamed Mediouni
2026-04-08  7:02         ` Scott J. Goldman [this message]
2026-04-08  8:33           ` Mohamed Mediouni
2026-04-08 19:09           ` Mohamed Mediouni
2026-04-08 20:45             ` Scott J. Goldman
2026-04-08 22:12               ` Mohamed Mediouni
2026-04-08 23:33                 ` Scott J. Goldman
2026-04-09  0:02                   ` Mohamed Mediouni
2026-04-05 10:36 ` BALATON Zoltan
2026-04-05 18:16   ` Scott J. Goldman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHNKXKNWFI3S.P78ERGSFU3RQ@gmail.com \
    --to=scottjgo@gmail.com \
    --cc=alex@shazbot.org \
    --cc=clg@redhat.com \
    --cc=john.levon@nutanix.com \
    --cc=mohamed@unpredictable.fr \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=phil@philjordan.eu \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=rbolshakov@ddn.com \
    --cc=thanos.makatos@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.