[TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
@ 2025-09-18 21:44 Alex Mastro
  2025-09-18 22:57 ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Mastro @ 2025-09-18 21:44 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe, Kevin Tian
  Cc: Bjorn Helgaas, David Reiss, Joerg Roedel, Keith Busch,
	Leon Romanovsky, Li Zhe, Mahmoud Adam, Philipp Stanner,
	Robin Murphy, Vivek Kasireddy, Will Deacon, Yunxiang Li,
	linux-kernel, iommu, kvm

Hello,

We've been running user space drivers (USD) in production built on top of VFIO,
and have come to value the operational and development benefits of being able to
deploy updates to device policy by simply shipping user space binaries.

In our architecture, a long-running USD process bootstraps and manages
background device operations related to supporting client process workloads.
Client processes communicate with the USD over Unix domain sockets to acquire
the device resources necessary to dispatch work to the device.

We anticipate a growing need to provide more granular access to device resources
beyond what the kernel currently affords to user space drivers similar to our
model.

The purpose of this email is to:
- Gauge the extent to which ongoing work in VFIO and IOMMUFD can meet those
  needs.
- Seek guidance from VFIO and IOMMUFD maintainers about whether there is a path
  to supporting the remaining pieces across the VFIO and IOMMUFD UAPIs.
- Describe our current approach and get feedback on whether there are existing
  solutions that we've missed.
- Figure out the most useful places we can help contribute.

Inter-process communication latency (between client processes and USD) is
prohibitively slow to support hot-path communication between the client and
device, which targets round-trip times on the order of microseconds. To address
this, we need to allow client processes and the device to access each other's
memory directly, bypassing IPC with the USD and kernel syscalls.
a) For host-initiated access into device memory, this means mmap-ing BAR
   sub-regions into the client process.
b) For device-initiated access into host memory, it means establishing IOMMU
   mappings to memory underlying the client process address space.

Such things are more straightforward for in-kernel device drivers to accomplish:
they are free to define customized semantics for their associated fds and
syscall handlers. Today, user space driver processes have fewer tools at their
disposal for controlling these types of access.

----------
BAR Access
----------

To achieve (a), the USD sends the VFIO device fd to the client over Unix domain
sockets using SCM_RIGHTS, along with descriptions of which device regions are
for what. While this allows the client to mmap BARs into its address space,
it comes at the cost of exposing more access to device BAR regions than is
necessary or appropriate. In our use case, we don't need to contend with
adversarial client processes, so the current situation is tenable, but not
ideal.

Ongoing efforts to add dma-buf exporting to VFIO [1] seem relevant here. Though
its current intent is around enabling peer-to-peer access, the fact that only
a subset of device regions are bound to this fd could be useful for controlling
access granularity to device regions.

Instead of vending the VFIO device fd to the client process, the USD could bind
the necessary BAR regions to a dma-buf fd and share that with the client. If
VFIO supported dma_buf_ops.mmap, the client could mmap those into its address
space.

Adding such capability would mean that there would be two paths for mmap-ing
device regions: VFIO device fd and dma-buf fd. I imagine this could be
contentious; people may not like that there are two ways to achieve the same
thing. It also seems complicated by the fact that there are ongoing discussions
about how to extend the VFIO device fd UAPI to support features like write
combining [2]. It would feel incomplete for such features to be available
through one mmap-ing modality but not the other. This would have implications
for how the “special regions” should be communicated across the UAPI.

The VFIO dma-buf UAPI currently being proposed [3] takes a region_index and an
array of (offset, length) intervals within the region to assign to the dma-buf.
From what I can tell, that seems coherent with the latest direction from [2],
which will enable the creation of new region indices with special properties,
which are aliases to the default BAR regions. The USD could theoretically create
a dma-buf backed by "the region index corresponding to write-combined BAR 4" to
share with the client.

Given some of the considerations above, would there be line of sight for adding
support for dma_buf_ops.mmap to VFIO?

-------------
IOMMU Mapping
-------------

To achieve (b), we have been using the (now legacy) VFIO container interface
to manage access to the IOMMU. We understand that new feature development has
moved to IOMMUFD, and intend to migrate to using it when it's ready (we have
some use cases that require P2P). We are not expecting to add features to VFIO
containers. I will describe what we are doing today first.

In order to enable a device to access memory in multiple processes, we also
share the VFIO container fd using SCM_RIGHTS between the USD and client
processes. In this scheme, we partition the I/O address space (IOAS) for a
given device's container to be used cooperatively amongst each process. The
only enforcement of the partitioning convention is that each process only
VFIO_IOMMU_{MAP,UNMAP}_DMA's to the IOVA ranges which have been assigned to it.

When the USD detects that the client process has exited, it is able to unmap any
leftover dirty mappings with VFIO_IOMMU_UNMAP_DMA. This became possible after
[4], which allowed one process to free the mappings created by another process.
That patch's intent was to enable QEMU live update use cases, but benefited our
use case as well.

Again, we don't have to contend with adversarial client processes, so this has
been OK, but not ideal for now.

We are interested in the following incremental capabilities:
- We want the USD to be able to create and vend fds which provide restricted
  mapping access to the device's IOAS to the client, while preserving
  the ability of the USD to revoke device access to client memory via
  VFIO_IOMMU_UNMAP_DMA (or IOMMUFD_CMD_IOAS_UNMAP for IOMMUFD). Alternatively,
  to forcefully invalidate the entire restricted IOMMU fd, including mappings.
- It would be nice if mappings created with the restricted IOMMU fd were
  automatically freed when the underlying kernel object was freed (if the client
  process were to exit ungracefully without explicitly performing unmap cleanup
  after itself).

Some of those things sound very similar to the direction of vIOMMU, but it is
difficult to tell if that could meet our needs exactly. The kinds of features
I think we want should be achievable purely in software without any dedicated
hardware support.

This is an area we are less familiar with, since we haven't been living on the
IOMMUFD UAPI or following its development as closely yet. Perhaps we have missed
something more obvious?

Overall, I'm curious to hear feedback on this. Allowing user space drivers
to vend more granular device access would certainly benefit our use case, and
perhaps others as well.

[1] https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org/
[2] https://lore.kernel.org/all/20250804104012.87915-1-mngyadam@amazon.de/
[3] https://lore.kernel.org/all/5e043d8b95627441db6156e7f15e6e1658e9d537.1754311439.git.leon@kernel.org/
[4] https://lore.kernel.org/all/20220627035109.73745-1-lizhe.67@bytedance.com/

Thanks,
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 21:44 [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes Alex Mastro
@ 2025-09-18 22:57 ` Jason Gunthorpe
  2025-09-18 23:24   ` Keith Busch
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-18 22:57 UTC (permalink / raw)
  To: Alex Mastro
  Cc: Alex Williamson, Kevin Tian, Bjorn Helgaas, David Reiss,
	Joerg Roedel, Keith Busch, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Vivek Kasireddy, Will Deacon,
	Yunxiang Li, linux-kernel, iommu, kvm

On Thu, Sep 18, 2025 at 02:44:07PM -0700, Alex Mastro wrote:

> We anticipate a growing need to provide more granular access to device resources
> beyond what the kernel currently affords to user space drivers similar to our
> model.

I'm having a somewhat hard time wrapping my head around the security
model that says your trust your related processes not use DMA in a way
that is hostile their peers, but you don't trust them not to issue
hostile ioctls..

> To achieve (a), the USD sends the VFIO device fd to the client over Unix domain
> sockets using SCM_RIGHTS, along with descriptions of which device regions are
> for what. While this allows the client to mmap BARs into its address space,
> it comes at the cost of exposing more access to device BAR regions than is
> necessary or appropriate. 

IIRC VFIO should allow partial BAR mappings, so the client process can
robustly have a subset mapped if you trust it to perform the unix
SCM_RIGHTS/mapping ioctl/close() sequence.

> Instead of vending the VFIO device fd to the client process, the USD could bind
> the necessary BAR regions to a dma-buf fd and share that with the client. If
> VFIO supported dma_buf_ops.mmap, the client could mmap those into its address
> space.

I wouldn't object to this, I think it is not too complicated at all.

And the idea to add some 'use writecombining' to the create dmabuf ioctl is
certainly a novel and simple way to solve that problem.

> We are interested in the following incremental capabilities:
> - We want the USD to be able to create and vend fds which provide restricted
>   mapping access to the device's IOAS to the client, while preserving
>   the ability of the USD to revoke device access to client memory via
>   VFIO_IOMMU_UNMAP_DMA (or IOMMUFD_CMD_IOAS_UNMAP for IOMMUFD). Alternatively,
>   to forcefully invalidate the entire restricted IOMMU fd, including mappings.

I've had similarish requests for fwctl.. 

What I've been thinking is if the vending process could "dup" the FD
and permanently attach a BPF program to the new FD that sits right
after ioctl. The BPF program would inspect each ioctl when it is
issued and enforce whatever policy the vending process wants.

Sort of like seccomp.

iommufd and fwctl have a similar ioctl design, so I would have no
issue with something that could be easily reused for both.

What would give me alot of pause is your proposal where we effectively
have the kernel enforce some arbitary policy, and I know from
experience there will be endless asks for more and more policy
options.

> - It would be nice if mappings created with the restricted IOMMU fd were
>   automatically freed when the underlying kernel object was freed (if the client
>   process were to exit ungracefully without explicitly performing unmap cleanup
>   after itself).

Maybe the BPF could trigger an eventfd or something when the FD closes?

> Some of those things sound very similar to the direction of vIOMMU, but it is
> difficult to tell if that could meet our needs exactly. The kinds of features
> I think we want should be achievable purely in software without any dedicated
> hardware support.

I don't think viommu is really related to this, viommu is more about
multiple physical devices.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 22:57 ` Jason Gunthorpe
@ 2025-09-18 23:24   ` Keith Busch
  2025-09-19  7:00     ` Tian, Kevin
  2025-09-19 11:56     ` Jason Gunthorpe
  2025-09-19 15:57   ` Alex Williamson
  2025-09-19 16:13   ` Alex Mastro
  2 siblings, 2 replies; 12+ messages in thread
From: Keith Busch @ 2025-09-18 23:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Mastro, Alex Williamson, Kevin Tian, Bjorn Helgaas,
	David Reiss, Joerg Roedel, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Vivek Kasireddy, Will Deacon,
	Yunxiang Li, linux-kernel, iommu, kvm

On Thu, Sep 18, 2025 at 07:57:39PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 18, 2025 at 02:44:07PM -0700, Alex Mastro wrote:
> 
> > We anticipate a growing need to provide more granular access to device resources
> > beyond what the kernel currently affords to user space drivers similar to our
> > model.
> 
> I'm having a somewhat hard time wrapping my head around the security
> model that says your trust your related processes not use DMA in a way
> that is hostile their peers, but you don't trust them not to issue
> hostile ioctls..

I read this as more about having the granularity to automatically
release resources associated with a client process when it dies (as
mentioned below) rather than relying on the bootstrapping process to
manage it all. Not really about hostile ioctls, but that an ungraceful
ending of some client workload doesn't even send them.
 
> > - It would be nice if mappings created with the restricted IOMMU fd were
> >   automatically freed when the underlying kernel object was freed (if the client
> >   process were to exit ungracefully without explicitly performing unmap cleanup
> >   after itself).
> 
> Maybe the BPF could trigger an eventfd or something when the FD closes?

I wouldn't have considered a BPF dependency for this. I'll need to think
about that one for a moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 23:24   ` Keith Busch
@ 2025-09-19  7:00     ` Tian, Kevin
  2025-09-19 11:58       ` Jason Gunthorpe
  2025-09-22  9:14       ` Mostafa Saleh
  2025-09-19 11:56     ` Jason Gunthorpe
  1 sibling, 2 replies; 12+ messages in thread
From: Tian, Kevin @ 2025-09-19  7:00 UTC (permalink / raw)
  To: Keith Busch, Jason Gunthorpe
  Cc: Alex Mastro, Alex Williamson, Bjorn Helgaas, David Reiss,
	Joerg Roedel, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Kasireddy, Vivek, Will Deacon,
	Yunxiang Li, linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	kvm@vger.kernel.org

> From: Keith Busch <kbusch@kernel.org>
> Sent: Friday, September 19, 2025 7:25 AM
> 
> On Thu, Sep 18, 2025 at 07:57:39PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 18, 2025 at 02:44:07PM -0700, Alex Mastro wrote:
> >
> > > We anticipate a growing need to provide more granular access to device
> resources
> > > beyond what the kernel currently affords to user space drivers similar to
> our
> > > model.
> >
> > I'm having a somewhat hard time wrapping my head around the security
> > model that says your trust your related processes not use DMA in a way
> > that is hostile their peers, but you don't trust them not to issue
> > hostile ioctls..
> 
> I read this as more about having the granularity to automatically
> release resources associated with a client process when it dies (as
> mentioned below) rather than relying on the bootstrapping process to
> manage it all. Not really about hostile ioctls, but that an ungraceful
> ending of some client workload doesn't even send them.
> 

the proposal includes two parts: BAR access and IOMMU mapping. For
the latter looks the intention is more around releasing resource. But
the former sounds more like a security enhancement - instead of
granting the client full access to the entire device it aims to expose
only a region of BAR resource necessary into guest. Then as Jason
questioned what is the value of doing so when one client can program
arbitrary DMA address into the exposed BAR region to attack mapped
memory of other clients and the USD... there is no hw isolation 
within a partitioned IOAS unless the device supports PASID then 
each client can be associated to its own IOAS space.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 23:24   ` Keith Busch
  2025-09-19  7:00     ` Tian, Kevin
@ 2025-09-19 11:56     ` Jason Gunthorpe
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-19 11:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Alex Mastro, Alex Williamson, Kevin Tian, Bjorn Helgaas,
	David Reiss, Joerg Roedel, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Vivek Kasireddy, Will Deacon,
	Yunxiang Li, linux-kernel, iommu, kvm

On Thu, Sep 18, 2025 at 05:24:54PM -0600, Keith Busch wrote:
> I read this as more about having the granularity to automatically
> release resources associated with a client process when it dies (as
> mentioned below) rather than relying on the bootstrapping process to
> manage it all. Not really about hostile ioctls, but that an ungraceful
> ending of some client workload doesn't even send them.

You could achieve this between co-operating processes by monitoring
the child with a pidfd, or handing it a pipe and watching for the pipe
to close..

> > > - It would be nice if mappings created with the restricted IOMMU fd were
> > >   automatically freed when the underlying kernel object was freed (if the client
> > >   process were to exit ungracefully without explicitly performing unmap cleanup
> > >   after itself).
> > 
> > Maybe the BPF could trigger an eventfd or something when the FD closes?
> 
> I wouldn't have considered a BPF dependency for this. I'll need to think
> about that one for a moment.

Well, if you are going to be using BPF for policy then may as well use
it for all policy. It would not be hard to also invoke the BPF duing
the file descriptor close and presumably it can somehow to signal the
vendor process in some easy BPF way?

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-19  7:00     ` Tian, Kevin
@ 2025-09-19 11:58       ` Jason Gunthorpe
  2025-09-22  9:14       ` Mostafa Saleh
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-19 11:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Keith Busch, Alex Mastro, Alex Williamson, Bjorn Helgaas,
	David Reiss, Joerg Roedel, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Kasireddy, Vivek, Will Deacon,
	Yunxiang Li, linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	kvm@vger.kernel.org

On Fri, Sep 19, 2025 at 07:00:04AM +0000, Tian, Kevin wrote:
> memory of other clients and the USD... there is no hw isolation 
> within a partitioned IOAS unless the device supports PASID then 
> each client can be associated to its own IOAS space.

If the device does support pasid then both of the suggestions make
a lot more security sense..

Jsaon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 22:57 ` Jason Gunthorpe
  2025-09-18 23:24   ` Keith Busch
@ 2025-09-19 15:57   ` Alex Williamson
  2025-09-19 17:14     ` Alex Mastro
  2025-09-19 16:13   ` Alex Mastro
  2 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2025-09-19 15:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Mastro, Kevin Tian, Bjorn Helgaas, David Reiss, Joerg Roedel,
	Keith Busch, Leon Romanovsky, Li Zhe, Mahmoud Adam,
	Philipp Stanner, Robin Murphy, Vivek Kasireddy, Will Deacon,
	Yunxiang Li, linux-kernel, iommu, kvm

On Thu, 18 Sep 2025 19:57:39 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> What I've been thinking is if the vending process could "dup" the FD
> and permanently attach a BPF program to the new FD that sits right
> after ioctl. The BPF program would inspect each ioctl when it is
> issued and enforce whatever policy the vending process wants.

Promising idea.

> What would give me alot of pause is your proposal where we effectively
> have the kernel enforce some arbitary policy, and I know from
> experience there will be endless asks for more and more policy
> options.

Definitely.  Also, is this at all considering the work that's gone into
vfio-user?  The long running USD sounds a lot like a vfio-user server,
where if we're using vfio-user's socket interface we'd have a lot of
opportunity to implement policy there and dma-bufs might be a means to
expose direct, restricted access.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-18 22:57 ` Jason Gunthorpe
  2025-09-18 23:24   ` Keith Busch
  2025-09-19 15:57   ` Alex Williamson
@ 2025-09-19 16:13   ` Alex Mastro
  2 siblings, 0 replies; 12+ messages in thread
From: Alex Mastro @ 2025-09-19 16:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Mastro, Alex Williamson, Kevin Tian, Bjorn Helgaas,
	David Reiss, Joerg Roedel, Keith Busch, Leon Romanovsky, Li Zhe,
	Mahmoud Adam, Philipp Stanner, Robin Murphy, Vivek Kasireddy,
	Will Deacon, Yunxiang Li, linux-kernel, iommu, kvm

On Thu, Sep 18, 2025 at 07:57:39PM -0300, Jason Gunthorpe wrote:
> I'm having a somewhat hard time wrapping my head around the security
> model that says your trust your related processes not use DMA in a way
> that is hostile their peers, but you don't trust them not to issue
> hostile ioctls..

Ah, yea. In my original message, I should have emphasized that vending the
entire vfio device fd confers access to inappropriate ioctls *in addition to*
inappropriate BAR regions that the client should be restricted from accessing.

Assuming we make headway on dma_buf_ops.mmap, granting a client process access
to a dma-buf's worth of BAR space does not feel spiritually different than
granting it to a peer device. The onus is on the combination of driver + device
policy to constrain the side-effects of foreign access to the exposed BAR
sub-regions.

Please let me know if I misunderstood your meaning.

> IIRC VFIO should allow partial BAR mappings, so the client process can
> robustly have a subset mapped if you trust it to perform the unix
> SCM_RIGHTS/mapping ioctl/close() sequence.

Yes -- we already do this today actually. The USD just tells the client "these
are the specific set of (offset, length) within the vfio device fd you should
mmap". Those intervals are slices within BARs.

> > > Instead of vending the VFIO device fd to the client process, the USD could bind
> > the necessary BAR regions to a dma-buf fd and share that with the client. If
> > VFIO supported dma_buf_ops.mmap, the client could mmap those into its address
> > space.
> 
> I wouldn't object to this, I think it is not too complicated at all.

That's encouraging to hear! Thank you.

> What I've been thinking is if the vending process could "dup" the FD
> and permanently attach a BPF program to the new FD that sits right
> after ioctl. The BPF program would inspect each ioctl when it is
> issued and enforce whatever policy the vending process wants.

This seems totally reasonable to me.

> What would give me alot of pause is your proposal where we effectively
> have the kernel enforce some arbitary policy, and I know from
> experience there will be endless asks for more and more policy
> options.

Agreed. If we can engineer BPF to be able to interact with those ioctls to hoist
these kinds of policy decisions up into user space, I can't argue with that.

> I don't think viommu is really related to this, viommu is more about
> multiple physical devices.

Ack. I wasn't sure how much to read into the "representing a slice of the
physical IOMMU instance" comment [1].

[1] https://docs.kernel.org/userspace-api/iommufd.html

Thanks,
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-19 15:57   ` Alex Williamson
@ 2025-09-19 17:14     ` Alex Mastro
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Mastro @ 2025-09-19 17:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alex Mastro, Jason Gunthorpe, Kevin Tian, Bjorn Helgaas,
	David Reiss, Joerg Roedel, Keith Busch, Leon Romanovsky, Li Zhe,
	Mahmoud Adam, Philipp Stanner, Robin Murphy, Vivek Kasireddy,
	Will Deacon, Yunxiang Li, linux-kernel, iommu, kvm

On Fri, Sep 19, 2025 at 09:57:43AM -0600, Alex Williamson wrote:
> Definitely.  Also, is this at all considering the work that's gone into
> vfio-user?  The long running USD sounds a lot like a vfio-user server,
> where if we're using vfio-user's socket interface we'd have a lot of
> opportunity to implement policy there and dma-bufs might be a means to
> expose direct, restricted access.  Thanks,

Possibly. Though I think the USD's responsibilities and the semantics for
how clients would negotiate various forms of device access would be very
application-dependent. In addition to just vending vfio and iommu-related fds,
our USD needs to do things like bootstrap the device by loading firmwares,
collect metrics, and other background functionality.

I'm not sure if I'm addressing your point though.

We actually do use libvfio-user [1] for user space simulation of PCI devices,
but it's not a part of our USD today.

[1] https://github.com/nutanix/libvfio-user

Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-19  7:00     ` Tian, Kevin
  2025-09-19 11:58       ` Jason Gunthorpe
@ 2025-09-22  9:14       ` Mostafa Saleh
  2025-09-22 17:46         ` Alex Mastro
  1 sibling, 1 reply; 12+ messages in thread
From: Mostafa Saleh @ 2025-09-22  9:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Keith Busch, Jason Gunthorpe, Alex Mastro, Alex Williamson,
	Bjorn Helgaas, David Reiss, Joerg Roedel, Leon Romanovsky, Li Zhe,
	Mahmoud Adam, Philipp Stanner, Robin Murphy, Kasireddy, Vivek,
	Will Deacon, Yunxiang Li, linux-kernel@vger.kernel.org,
	iommu@lists.linux.dev, kvm@vger.kernel.org

On Fri, Sep 19, 2025 at 07:00:04AM +0000, Tian, Kevin wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > Sent: Friday, September 19, 2025 7:25 AM
> > 
> > On Thu, Sep 18, 2025 at 07:57:39PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 18, 2025 at 02:44:07PM -0700, Alex Mastro wrote:
> > >
> > > > We anticipate a growing need to provide more granular access to device
> > resources
> > > > beyond what the kernel currently affords to user space drivers similar to
> > our
> > > > model.
> > >
> > > I'm having a somewhat hard time wrapping my head around the security
> > > model that says your trust your related processes not use DMA in a way
> > > that is hostile their peers, but you don't trust them not to issue
> > > hostile ioctls..
> > 
> > I read this as more about having the granularity to automatically
> > release resources associated with a client process when it dies (as
> > mentioned below) rather than relying on the bootstrapping process to
> > manage it all. Not really about hostile ioctls, but that an ungraceful
> > ending of some client workload doesn't even send them.
> > 
> 
> the proposal includes two parts: BAR access and IOMMU mapping. For
> the latter looks the intention is more around releasing resource. But
> the former sounds more like a security enhancement - instead of
> granting the client full access to the entire device it aims to expose
> only a region of BAR resource necessary into guest. Then as Jason
> questioned what is the value of doing so when one client can program
> arbitrary DMA address into the exposed BAR region to attack mapped
> memory of other clients and the USD... there is no hw isolation 
> within a partitioned IOAS unless the device supports PASID then 
> each client can be associated to its own IOAS space.

That’s also my opinion, it seems that PASIDs are not supported in
that case, that’s why the clients share the same IOVA address space,
instead of each one having their own.
In that case I think as all of this is cooperative and can’t be enforced,
one process can corrupt another process memory that is mapped the IOMMU.

It seems to me that any memory mapped in the IOMMU is that situation
has to be explicitly shared between processes first through the kernel,
so such memory can be accessed both by CPU and DMA by both processes.

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-22  9:14       ` Mostafa Saleh
@ 2025-09-22 17:46         ` Alex Mastro
  2025-09-22 17:51           ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Mastro @ 2025-09-22 17:46 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Alex Mastro, Tian, Kevin, Keith Busch, Jason Gunthorpe,
	Alex Williamson, Bjorn Helgaas, David Reiss, Joerg Roedel,
	Leon Romanovsky, Li Zhe, Mahmoud Adam, Philipp Stanner,
	Robin Murphy, Kasireddy, Vivek, Will Deacon, Yunxiang Li,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	kvm@vger.kernel.org

On Mon, Sep 22, 2025 at 09:14:24AM +0000, Mostafa Saleh wrote:
> On Fri, Sep 19, 2025 at 07:00:04AM +0000, Tian, Kevin wrote:
> > the proposal includes two parts: BAR access and IOMMU mapping. For
> > the latter looks the intention is more around releasing resource. But
> > the former sounds more like a security enhancement - instead of
> > granting the client full access to the entire device it aims to expose
> > only a region of BAR resource necessary into guest. Then as Jason
> > questioned what is the value of doing so when one client can program
> > arbitrary DMA address into the exposed BAR region to attack mapped
> > memory of other clients and the USD... there is no hw isolation 
> > within a partitioned IOAS unless the device supports PASID then 
> > each client can be associated to its own IOAS space.
> 
> That’s also my opinion, it seems that PASIDs are not supported in
> that case, that’s why the clients share the same IOVA address space,
> instead of each one having their own.

Yes, we do have cases where PASID is not supported by our hardware.

> In that case I think as all of this is cooperative and can’t be enforced,
> one process can corrupt another process memory that is mapped the IOMMU.

In systems lacking PASID, some degree of enforcement would still be possible via
USD and device policies. In a ~similar way to how an in-kernel driver wanting
to accomplish our same goals (enabling a client and device able to access each
other's memory directly) would presumably need to enforce this.

I have been thinking along the following lines:

Imagine that we want the client and device to communicate with each other
directly via queues in each other's memory, bypassing interaction with the
driver.

As part of granting access to a client process:
- The USD allocates some IOAS slice for the client.
- The USD prepares a restricted IOMMU fd to be shared with the client which
  only has mapping permissions to the IOAS slice.
- The USD configures the device: "DMA initiated across this region of
  client-accessible BAR is only allowed to target the client's IOAS slice."
- The USD vends the client a dma-buf exposing a view of only that client's queue
  space, along with the restricted IOMMU fd.

Given the above setup, barring bugs in the USD, or the device hardware/firmware,
it should be impossible for one client to corrupt another client's address
space, since the side-effects it is able to create by accessing its BAR slice
have been constrained by a combination of USD + device policy.

Next, we need to address revocation. The USD needs to be able revoke:
1) client access to BAR memory
2) device access to client memory

Issue (2) was touched on in the original tech topic email, but we haven't
covered (1) yet.

For (1) to be possible, I think we need to grant the VFIO user (USD in this
specific case) the ability to revoke a dma-buf in a way that prevents "peer"
access to the device -- whether the peer is some other device, or a user space
client process.

Following a dma_buf_ops.mmap, I suppose that revocation would mean:
- Poisoning the dma-buf fd to disallow the creation of additional mmaps.
- Zapping the PTEs backing existing mmaps. Subsequent access to the unmapped
  client address space should trigger page faults.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
  2025-09-22 17:46         ` Alex Mastro
@ 2025-09-22 17:51           ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 17:51 UTC (permalink / raw)
  To: Alex Mastro
  Cc: Mostafa Saleh, Tian, Kevin, Keith Busch, Alex Williamson,
	Bjorn Helgaas, David Reiss, Joerg Roedel, Leon Romanovsky, Li Zhe,
	Mahmoud Adam, Philipp Stanner, Robin Murphy, Kasireddy, Vivek,
	Will Deacon, Yunxiang Li, linux-kernel@vger.kernel.org,
	iommu@lists.linux.dev, kvm@vger.kernel.org

On Mon, Sep 22, 2025 at 10:46:23AM -0700, Alex Mastro wrote:

> Following a dma_buf_ops.mmap, I suppose that revocation would mean:

I'd investigate adding some ioctl to the dmabuf fd to permanently
revoke it. The zapping/etc already has to be done just to get mmap in
the first place. The vending process would retain a FD on the dmabuf
and when it is time to revoke it then it can call the ioctl directly
on the fd to revoke.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-22 17:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-18 21:44 [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes Alex Mastro
2025-09-18 22:57 ` Jason Gunthorpe
2025-09-18 23:24   ` Keith Busch
2025-09-19  7:00     ` Tian, Kevin
2025-09-19 11:58       ` Jason Gunthorpe
2025-09-22  9:14       ` Mostafa Saleh
2025-09-22 17:46         ` Alex Mastro
2025-09-22 17:51           ` Jason Gunthorpe
2025-09-19 11:56     ` Jason Gunthorpe
2025-09-19 15:57   ` Alex Williamson
2025-09-19 17:14     ` Alex Mastro
2025-09-19 16:13   ` Alex Mastro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox