The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF
       [not found]     ` <afhNeYS174EW7RYp@nvidia.com>
@ 2026-05-05 10:49       ` Leon Romanovsky
  2026-05-05 14:50         ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-05 10:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matt Evans, Alex Mastro, Christian König,
	Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

On Mon, May 04, 2026 at 04:40:41AM -0300, Jason Gunthorpe wrote:
> On Fri, May 01, 2026 at 04:19:15PM -0600, Alex Williamson wrote:
> 
> > Exporting dma-bufs from vfio-pci is a feature, but mmap of MMIO BARs is
> > a legacy requirement.  That legacy requirement now depends on
> > PCI_P2PDMA, which depends on 64BIT and ZONE_DEVICE.
> 
> That should be split up now, Leon missed it when he added the new
> APIs that didn't require ZONE_DEVICE..

Sorry, what did I miss here?  
VFIO_DMABUF is an optional feature and is enabled only when P2P support is  
available. It does not affect legacy systems where P2P cannot be enabled.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/9] vfio/pci: Clean up BAR zap and revocation
       [not found]   ` <20260501171919.42659174@shazbot.org>
@ 2026-05-05 10:58     ` Leon Romanovsky
  0 siblings, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-05 10:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matt Evans, Jason Gunthorpe, Alex Mastro, Christian König,
	Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

On Fri, May 01, 2026 at 05:19:19PM -0600, Alex Williamson wrote:
> On Thu, 16 Apr 2026 06:17:49 -0700
> Matt Evans <mattev@meta.com> wrote:
> 
> > Previously, vfio_pci_zap_bars() (and the wrapper
> > vfio_pci_zap_and_down_write_memory_lock()) calls were paired with
> > calls of vfio_pci_dma_buf_move().
> > 
> > This commit replaces them a unified new function,
> > vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move()
> > and the unmap_mapping_range(), making it harder for callers to omit
> > one.  It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes
> > the write memory_lock before zapping, and adds a new
> > vfio_pci_unrevoke_bars() for the re-enable path.
> > 
> > However, as of "vfio/pci: Convert BAR mmap() to use a DMABUF" the
> > unmap_mapping_range() to zap is entirely redundant for plain vfio-pci,
> > since the DMABUFs used for BAR mappings already zap PTEs when the
> > vfio_pci_dma_buf_move() occurs.
> > 
> > One exception remains as a FIXME: in nvgrace-gpu, some BAR VMAs
> > conditionally use custom vm_ops, which have not moved to be backed by
> > DMABUFs.  If these BARs are mmap()ed, the vdev enables the existing
> > behaviour of unmap_mapping_range() for the device fd address space.
> 
> What's the plan here?  Is this a temporary FIXME or a place to prove
> that dmabuf for mmap works beyond the core use case?
> 
> > 
> > Signed-off-by: Matt Evans <mattev@meta.com>
> > ---
> >  drivers/vfio/pci/nvgrace-gpu/main.c |  5 +++
> >  drivers/vfio/pci/vfio_pci_config.c  | 30 ++++++--------
> >  drivers/vfio/pci/vfio_pci_core.c    | 62 +++++++++++++++++++----------
> >  drivers/vfio/pci/vfio_pci_priv.h    |  3 +-
> >  include/linux/vfio_pci_core.h       |  1 +
> >  5 files changed, 62 insertions(+), 39 deletions(-)
> > 
> ...  
> > @@ -1229,7 +1228,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> >  	if (!vdev->reset_works)
> >  		return -EINVAL;
> >  
> > -	vfio_pci_zap_and_down_write_memory_lock(vdev);
> > +	vfio_pci_lock_zap_revoke_bars(vdev);
> >  
> >  	/*
> >  	 * This function can be invoked while the power state is non-D0. If
> > @@ -1242,10 +1241,9 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> >  	 */
> >  	vfio_pci_set_power_state(vdev, PCI_D0);
> >  
> > -	vfio_pci_dma_buf_move(vdev, true);
> 
> This seems subtle enough to be troublesome.  I wonder if Leon didn't
> intentionally place the dmabuf revoke after the device is in D0 to
> allow the driver to interact with the device. 

My intention was to place vfio_pci_dma_buf_move() as close as possible to
pci_try_reset_function(), so the device is known to be fully operational
at that point. It looks like calling it after the transition to D0 is the
right ordering.

Thanks

> I think the lock needs to come before the power state change to avoid racing a user induced
> state change.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF
  2026-05-05 10:49       ` [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF Leon Romanovsky
@ 2026-05-05 14:50         ` Alex Williamson
  2026-05-05 14:59           ` Jason Gunthorpe
  2026-05-06  5:35           ` Leon Romanovsky
  0 siblings, 2 replies; 19+ messages in thread
From: Alex Williamson @ 2026-05-05 14:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Matt Evans, Alex Mastro, Christian König,
	Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm, alex

On Tue, 5 May 2026 13:49:11 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> On Mon, May 04, 2026 at 04:40:41AM -0300, Jason Gunthorpe wrote:
> > On Fri, May 01, 2026 at 04:19:15PM -0600, Alex Williamson wrote:
> >   
> > > Exporting dma-bufs from vfio-pci is a feature, but mmap of MMIO BARs is
> > > a legacy requirement.  That legacy requirement now depends on
> > > PCI_P2PDMA, which depends on 64BIT and ZONE_DEVICE.  
> > 
> > That should be split up now, Leon missed it when he added the new
> > APIs that didn't require ZONE_DEVICE..  
> 
> Sorry, what did I miss here?  
> VFIO_DMABUF is an optional feature and is enabled only when P2P support is  
> available. It does not affect legacy systems where P2P cannot be enabled.

If we look at the long term view of moving exclusively to cdev/iommufd,
where VFIO_DMABUF becomes the mechanism for implementing P2P DMA
mappings, VFIO_DMABUF may be optional, but it's highly desirable for
legacy compatibility.  There's an argument though that providing P2P
compatibility on platforms that support PCI_P2PDMA is probably
sufficient.

However, in providing mmap of dmabufs as a feature, this series is
wiring all mmaps through dmabufs and therefore that dependency becomes
fundamental to the use of vfio-pci.  Thus the discussion whether the
noted config requirements could be lifted.  Thanks,

Alex

PS - Please also weigh in on the dmabuf underflow[1]

[1]https://lore.kernel.org/all/20260501131236.278ac431@shazbot.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF
  2026-05-05 14:50         ` Alex Williamson
@ 2026-05-05 14:59           ` Jason Gunthorpe
  2026-05-06  5:35           ` Leon Romanovsky
  1 sibling, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2026-05-05 14:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Matt Evans, Alex Mastro, Christian König,
	Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

On Tue, May 05, 2026 at 08:50:58AM -0600, Alex Williamson wrote:
> On Tue, 5 May 2026 13:49:11 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > On Mon, May 04, 2026 at 04:40:41AM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 01, 2026 at 04:19:15PM -0600, Alex Williamson wrote:
> > >   
> > > > Exporting dma-bufs from vfio-pci is a feature, but mmap of MMIO BARs is
> > > > a legacy requirement.  That legacy requirement now depends on
> > > > PCI_P2PDMA, which depends on 64BIT and ZONE_DEVICE.  
> > > 
> > > That should be split up now, Leon missed it when he added the new
> > > APIs that didn't require ZONE_DEVICE..  
> > 
> > Sorry, what did I miss here?  
> > VFIO_DMABUF is an optional feature and is enabled only when P2P support is  
> > available. It does not affect legacy systems where P2P cannot be enabled.
> 
> If we look at the long term view of moving exclusively to cdev/iommufd,
> where VFIO_DMABUF becomes the mechanism for implementing P2P DMA
> mappings, VFIO_DMABUF may be optional, but it's highly desirable for
> legacy compatibility.  There's an argument though that providing P2P
> compatibility on platforms that support PCI_P2PDMA is probably
> sufficient.

The whole reason we developed the P2PDMA stuff the way we did was so
that all VFIO platforms could use it and get P2P. Thec code is fine,
there is a kconfig/kbuild issue that we can't enable P2PDMA without
also ZONE_DEVICE and those need to be split up. Once P2PDMA is
available on all arches it is no longer a concern..

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/9] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
       [not found]       ` <20260430171106.GA6829@nvidia.com>
@ 2026-05-05 18:13         ` Matt Evans
  2026-05-06 19:03           ` Matt Evans
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Evans @ 2026-05-05 18:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Jason,

On 30/04/2026 18:11, Jason Gunthorpe wrote:
> 
> On Thu, Apr 30, 2026 at 05:47:49PM +0100, Matt Evans wrote:
>>> On Thu, Apr 16, 2026 at 06:17:46AM -0700, Matt Evans wrote:
>>>> +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
>>>> +				   struct vm_area_struct *vma,
>>>> +				   u64 phys_start, u64 req_len,
>>>> +				   unsigned int res_index)
>>>> +{
>>>> +	struct vfio_pci_dma_buf *priv;
>>>> +	const unsigned int nr_ranges = 1;
>>>> +	int ret;
>>>> +
>>>> +	priv = kzalloc_obj(*priv);
>>>> +	if (!priv)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	priv->phys_vec = kzalloc_obj(*priv->phys_vec);
>>>> +	if (!priv->phys_vec) {
>>>> +		ret = -ENOMEM;
>>>> +		goto err_free_priv;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * The mmap() request's vma->vm_offs might be non-zero, but
>>>> +	 * the DMABUF is created from _offset zero_ of the BAR.  The
>>>> +	 * portion between zero and the vm_offs is inaccessible
>>>> +	 * through this VMA, but this approach keeps the
>>>> +	 * /proc/<pid>/maps offset somewhat consistent with the
>>>> +	 * pre-DMABUF code.  Size includes the offset portion.
>>>
>>> I'm not sure I understand this comment?
>>>
>>> For the old path vm_pgoff for byte 0 of the bar starts at some large
>>> offset
>>>
>>> For the new path vm_pgoff for byte 0 of the first range starts at 0
>>
>> Glad you asked.  :)
>>
>> This is trying to achieve keeping /proc/<pid>/maps (or similar) somewhat
>> as informative as pre-DMABUF BAR mmap, in terms of keeping the VMA
>> vm_offs column useful.  Before this patch, say you mmap() two slices A
>> and B of the same BAR:
>>
>>   struct vfio_region_info bar_region;
>>
>>   vm_a = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0);
>>   vm_b = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0x4000);
>>
>> ...you'd see something like this in /proc/blah/maps:
>>
>> fffff4000000-fffff4001000 rw-s 10000000000 00:07 148     /dev/vfio/devices/vfio0
>> fffff5000000-fffff5001000 rw-s 10000004000 00:07 148     /dev/vfio/devices/vfio0
> 
> 
>> then the VMA's vm_offs would need to be thunked back down to 0 (since
>> the fault handler then treats vm_b + 0 as the first byte of the DMABUF).
>> That works/adds up, but then the vm_offs of both VMAs A & B both have
>> offset 0, and it's harder to differentiate in /proc/blah/maps.
> 
> Yes, and that would be correct.
> 
> The VMA output of lspci should show the exact pgoff passed to mmap and
> nothing else. Do not mangle it for "debugging".
> 
> pgoff is not to be used to show random internal FD details..
> 
>> We could possibly stash the original offset somewhere and then render it
>> in the name string, but the name's already about the max size and using
>> the existing vm_offs column is nicer IMO, doesn't need a new field, etc.
> 
>> I need to work on this comment then!  What this is trying to say is that
>> the DMABUF is made artificially larger than the part that is visible
>> through the VMA.
> 
> Yuk, that's another reason not to do this.

OK, fair enough.  I'll rework this and remove this one weird trick, 
thanks for the input.

Matt


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF
  2026-05-05 14:50         ` Alex Williamson
  2026-05-05 14:59           ` Jason Gunthorpe
@ 2026-05-06  5:35           ` Leon Romanovsky
  1 sibling, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-06  5:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Matt Evans, Alex Mastro, Christian König,
	Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

On Tue, May 05, 2026 at 08:50:58AM -0600, Alex Williamson wrote:
> On Tue, 5 May 2026 13:49:11 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > On Mon, May 04, 2026 at 04:40:41AM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 01, 2026 at 04:19:15PM -0600, Alex Williamson wrote:
> > >   
> > > > Exporting dma-bufs from vfio-pci is a feature, but mmap of MMIO BARs is
> > > > a legacy requirement.  That legacy requirement now depends on
> > > > PCI_P2PDMA, which depends on 64BIT and ZONE_DEVICE.  
> > > 
> > > That should be split up now, Leon missed it when he added the new
> > > APIs that didn't require ZONE_DEVICE..  
> > 
> > Sorry, what did I miss here?  
> > VFIO_DMABUF is an optional feature and is enabled only when P2P support is  
> > available. It does not affect legacy systems where P2P cannot be enabled.
> 
> If we look at the long term view of moving exclusively to cdev/iommufd,
> where VFIO_DMABUF becomes the mechanism for implementing P2P DMA
> mappings, VFIO_DMABUF may be optional, but it's highly desirable for
> legacy compatibility.  There's an argument though that providing P2P
> compatibility on platforms that support PCI_P2PDMA is probably
> sufficient.
> 
> However, in providing mmap of dmabufs as a feature, this series is
> wiring all mmaps through dmabufs and therefore that dependency becomes
> fundamental to the use of vfio-pci.  Thus the discussion whether the
> noted config requirements could be lifted.  Thanks,

Right, there was no need to remove ZONE_DEVICE when I added my code, and I
left the task of cleaning it out of is_pci_p2pdma_page() for another day.
Without ZONE_DEVICE, all pages are treated as non‑P2P.

> 
> Alex
> 
> PS - Please also weigh in on the dmabuf underflow[1]
> 
> [1]https://lore.kernel.org/all/20260501131236.278ac431@shazbot.org/

will do.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
       [not found]   ` <20260501131236.278ac431@shazbot.org>
@ 2026-05-06 13:53     ` Matt Evans
  2026-05-06 15:29       ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Evans @ 2026-05-06 13:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm,
	Carlos López

Hi Alex,

On 01/05/2026 20:12, Alex Williamson wrote:
> 
> On Thu, 16 Apr 2026 06:17:44 -0700
> Matt Evans <mattev@meta.com> wrote:
> 
>> vfio_pci_dma_buf_cleanup() assumed all VFIO device DMABUFs need to be
>> revoked.  However, if vfio_pci_dma_buf_move() revokes DMABUFs before
>> the fd/device closes, then vfio_pci_dma_buf_cleanup() would do a
>> second/underflowing kref_put() then wait_for_completion() on a
>> completion that never fires.  Fixed by predicating on revocation
>> status.
>>
>> This could happen if PCI_COMMAND_MEMORY is cleared before closing the
>> device fd (but the scenario is more likely to hit when future commits
>> add more methods to revoke DMABUFs).
>>
>> Fixes: 1a8a5227f2299 ("vfio: Wait for dma-buf invalidation to complete")
>> Signed-off-by: Matt Evans <mattev@meta.com>
>> ---
>>
>> (Just a fix, but later "vfio/pci: Convert BAR mmap() to use a DMABUF"
>> and "vfio/pci: Permanently revoke a DMABUF on request" depend on this
>> context, so including in this series.)
> 
> We really need a fix for this split out from this series, It's already
> been shown[1] that this is trivially reachable.  Carlos proposed[2] a
> similar solution to the one below.  I was concurrently working on the
> issued and suggested an alternative[3].  Let's pick a solution for
> 7.1-rc.  Thanks,

It looks like [3] is progressing, so I'll drop this one when I can 
rebase onto it.

I noticed [3] removes the dma_resv_lock(priv->dmabuf->resv) around the 
priv->vdev = NULL, and this series' vfio_pci_mmap_huge_fault() relies on 
vdev only changing whilst resv is held to resolve a race between a fault 
and cleanup (see patch 7 of this series).  The handler takes resv so 
that it can stably test vdev in order to take memory_lock.

Must your fix change vdev outside of holding resv?  I'm still sketching 
alternatives; at first glance perhaps the fault handler could rely on 
vdev being valid if !revoked, which can be tested holding [only] resv.


Thanks,

Matt

> 
> Alex
> 
> [1]https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/
> [2]https://lore.kernel.org/all/20260429182736.409323-2-clopez@suse.de/
> [3]https://lore.kernel.org/all/20260429142242.70f746b4@nvidia.com/
> 
>   
>> drivers/vfio/pci/vfio_pci_dmabuf.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> index 281ba7d69567..04478b7415a0 100644
>> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
>> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> @@ -395,20 +395,25 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>>   
>>   	down_write(&vdev->memory_lock);
>>   	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>> +		bool was_revoked;
>> +
>>   		if (!get_file_active(&priv->dmabuf->file))
>>   			continue;
>>   
>>   		dma_resv_lock(priv->dmabuf->resv, NULL);
>>   		list_del_init(&priv->dmabufs_elm);
>>   		priv->vdev = NULL;
>> +		was_revoked = priv->revoked;
>>   		priv->revoked = true;
>>   		dma_buf_invalidate_mappings(priv->dmabuf);
>>   		dma_resv_wait_timeout(priv->dmabuf->resv,
>>   				      DMA_RESV_USAGE_BOOKKEEP, false,
>>   				      MAX_SCHEDULE_TIMEOUT);
>>   		dma_resv_unlock(priv->dmabuf->resv);
>> -		kref_put(&priv->kref, vfio_pci_dma_buf_done);
>> -		wait_for_completion(&priv->comp);
>> +		if (!was_revoked) {
>> +			kref_put(&priv->kref, vfio_pci_dma_buf_done);
>> +			wait_for_completion(&priv->comp);
>> +		}
>>   		vfio_device_put_registration(&vdev->vdev);
>>   		fput(priv->dmabuf->file);
>>   	}
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
  2026-05-06 13:53     ` [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put Matt Evans
@ 2026-05-06 15:29       ` Leon Romanovsky
  2026-05-06 15:55         ` Matt Evans
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-06 15:29 UTC (permalink / raw)
  To: Matt Evans
  Cc: Alex Williamson, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm,
	Carlos López

On Wed, May 06, 2026 at 02:53:31PM +0100, Matt Evans wrote:
> Hi Alex,
> 
> On 01/05/2026 20:12, Alex Williamson wrote:
> > 
> > On Thu, 16 Apr 2026 06:17:44 -0700
> > Matt Evans <mattev@meta.com> wrote:
> > 
> > > vfio_pci_dma_buf_cleanup() assumed all VFIO device DMABUFs need to be
> > > revoked.  However, if vfio_pci_dma_buf_move() revokes DMABUFs before
> > > the fd/device closes, then vfio_pci_dma_buf_cleanup() would do a
> > > second/underflowing kref_put() then wait_for_completion() on a
> > > completion that never fires.  Fixed by predicating on revocation
> > > status.
> > > 
> > > This could happen if PCI_COMMAND_MEMORY is cleared before closing the
> > > device fd (but the scenario is more likely to hit when future commits
> > > add more methods to revoke DMABUFs).
> > > 
> > > Fixes: 1a8a5227f2299 ("vfio: Wait for dma-buf invalidation to complete")
> > > Signed-off-by: Matt Evans <mattev@meta.com>
> > > ---
> > > 
> > > (Just a fix, but later "vfio/pci: Convert BAR mmap() to use a DMABUF"
> > > and "vfio/pci: Permanently revoke a DMABUF on request" depend on this
> > > context, so including in this series.)
> > 
> > We really need a fix for this split out from this series, It's already
> > been shown[1] that this is trivially reachable.  Carlos proposed[2] a
> > similar solution to the one below.  I was concurrently working on the
> > issued and suggested an alternative[3].  Let's pick a solution for
> > 7.1-rc.  Thanks,
> 
> It looks like [3] is progressing, so I'll drop this one when I can rebase
> onto it.
> 
> I noticed [3] removes the dma_resv_lock(priv->dmabuf->resv) around the
> priv->vdev = NULL, and this series' vfio_pci_mmap_huge_fault() relies on
> vdev only changing whilst resv is held to resolve a race between a fault and
> cleanup (see patch 7 of this series).  The handler takes resv so that it can
> stably test vdev in order to take memory_lock.

I think that you should rely on priv->revoked and not on priv->vdev.

Thanks

> 
> Must your fix change vdev outside of holding resv?  I'm still sketching
> alternatives; at first glance perhaps the fault handler could rely on vdev
> being valid if !revoked, which can be tested holding [only] resv.
> 
> 
> Thanks,
> 
> Matt
> 
> > 
> > Alex
> > 
> > [1]https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/
> > [2]https://lore.kernel.org/all/20260429182736.409323-2-clopez@suse.de/
> > [3]https://lore.kernel.org/all/20260429142242.70f746b4@nvidia.com/
> > 
> > > drivers/vfio/pci/vfio_pci_dmabuf.c | 9 +++++++--
> > >   1 file changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > > index 281ba7d69567..04478b7415a0 100644
> > > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > > @@ -395,20 +395,25 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> > >   	down_write(&vdev->memory_lock);
> > >   	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> > > +		bool was_revoked;
> > > +
> > >   		if (!get_file_active(&priv->dmabuf->file))
> > >   			continue;
> > >   		dma_resv_lock(priv->dmabuf->resv, NULL);
> > >   		list_del_init(&priv->dmabufs_elm);
> > >   		priv->vdev = NULL;
> > > +		was_revoked = priv->revoked;
> > >   		priv->revoked = true;
> > >   		dma_buf_invalidate_mappings(priv->dmabuf);
> > >   		dma_resv_wait_timeout(priv->dmabuf->resv,
> > >   				      DMA_RESV_USAGE_BOOKKEEP, false,
> > >   				      MAX_SCHEDULE_TIMEOUT);
> > >   		dma_resv_unlock(priv->dmabuf->resv);
> > > -		kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > > -		wait_for_completion(&priv->comp);
> > > +		if (!was_revoked) {
> > > +			kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > > +			wait_for_completion(&priv->comp);
> > > +		}
> > >   		vfio_device_put_registration(&vdev->vdev);
> > >   		fput(priv->dmabuf->file);
> > >   	}
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
  2026-05-06 15:29       ` Leon Romanovsky
@ 2026-05-06 15:55         ` Matt Evans
  2026-05-06 16:14           ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Evans @ 2026-05-06 15:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm,
	Carlos López

Hi Leon,

On 06/05/2026 16:29, Leon Romanovsky wrote:
> 
> On Wed, May 06, 2026 at 02:53:31PM +0100, Matt Evans wrote:
>> Hi Alex,
>>
>> On 01/05/2026 20:12, Alex Williamson wrote:
>>>
>>> On Thu, 16 Apr 2026 06:17:44 -0700
>>> Matt Evans <mattev@meta.com> wrote:
>>>
>>>> vfio_pci_dma_buf_cleanup() assumed all VFIO device DMABUFs need to be
>>>> revoked.  However, if vfio_pci_dma_buf_move() revokes DMABUFs before
>>>> the fd/device closes, then vfio_pci_dma_buf_cleanup() would do a
>>>> second/underflowing kref_put() then wait_for_completion() on a
>>>> completion that never fires.  Fixed by predicating on revocation
>>>> status.
>>>>
>>>> This could happen if PCI_COMMAND_MEMORY is cleared before closing the
>>>> device fd (but the scenario is more likely to hit when future commits
>>>> add more methods to revoke DMABUFs).
>>>>
>>>> Fixes: 1a8a5227f2299 ("vfio: Wait for dma-buf invalidation to complete")
>>>> Signed-off-by: Matt Evans <mattev@meta.com>
>>>> ---
>>>>
>>>> (Just a fix, but later "vfio/pci: Convert BAR mmap() to use a DMABUF"
>>>> and "vfio/pci: Permanently revoke a DMABUF on request" depend on this
>>>> context, so including in this series.)
>>>
>>> We really need a fix for this split out from this series, It's already
>>> been shown[1] that this is trivially reachable.  Carlos proposed[2] a
>>> similar solution to the one below.  I was concurrently working on the
>>> issued and suggested an alternative[3].  Let's pick a solution for
>>> 7.1-rc.  Thanks,
>>
>> It looks like [3] is progressing, so I'll drop this one when I can rebase
>> onto it.
>>
>> I noticed [3] removes the dma_resv_lock(priv->dmabuf->resv) around the
>> priv->vdev = NULL, and this series' vfio_pci_mmap_huge_fault() relies on
>> vdev only changing whilst resv is held to resolve a race between a fault and
>> cleanup (see patch 7 of this series).  The handler takes resv so that it can
>> stably test vdev in order to take memory_lock.
> 
> I think that you should rely on priv->revoked and not on priv->vdev.

Needs both unfortunately, as the fault handler ultimately needs to take
vdev->memory_lock.


Matt

> 
> Thanks
> 
>>
>> Must your fix change vdev outside of holding resv?  I'm still sketching
>> alternatives; at first glance perhaps the fault handler could rely on vdev
>> being valid if !revoked, which can be tested holding [only] resv.
>>
>>
>> Thanks,
>>
>> Matt
>>
>>>
>>> Alex
>>>
>>> [1]https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/
>>> [2]https://lore.kernel.org/all/20260429182736.409323-2-clopez@suse.de/
>>> [3]https://lore.kernel.org/all/20260429142242.70f746b4@nvidia.com/
>>>
>>>> drivers/vfio/pci/vfio_pci_dmabuf.c | 9 +++++++--
>>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
>>>> index 281ba7d69567..04478b7415a0 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>>>> @@ -395,20 +395,25 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>>>>    	down_write(&vdev->memory_lock);
>>>>    	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>>>> +		bool was_revoked;
>>>> +
>>>>    		if (!get_file_active(&priv->dmabuf->file))
>>>>    			continue;
>>>>    		dma_resv_lock(priv->dmabuf->resv, NULL);
>>>>    		list_del_init(&priv->dmabufs_elm);
>>>>    		priv->vdev = NULL;
>>>> +		was_revoked = priv->revoked;
>>>>    		priv->revoked = true;
>>>>    		dma_buf_invalidate_mappings(priv->dmabuf);
>>>>    		dma_resv_wait_timeout(priv->dmabuf->resv,
>>>>    				      DMA_RESV_USAGE_BOOKKEEP, false,
>>>>    				      MAX_SCHEDULE_TIMEOUT);
>>>>    		dma_resv_unlock(priv->dmabuf->resv);
>>>> -		kref_put(&priv->kref, vfio_pci_dma_buf_done);
>>>> -		wait_for_completion(&priv->comp);
>>>> +		if (!was_revoked) {
>>>> +			kref_put(&priv->kref, vfio_pci_dma_buf_done);
>>>> +			wait_for_completion(&priv->comp);
>>>> +		}
>>>>    		vfio_device_put_registration(&vdev->vdev);
>>>>    		fput(priv->dmabuf->file);
>>>>    	}
>>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
  2026-05-06 15:55         ` Matt Evans
@ 2026-05-06 16:14           ` Leon Romanovsky
  2026-05-06 16:42             ` Matt Evans
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-06 16:14 UTC (permalink / raw)
  To: Matt Evans
  Cc: Alex Williamson, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm,
	Carlos López

On Wed, May 06, 2026 at 04:55:27PM +0100, Matt Evans wrote:
> Hi Leon,
> 
> On 06/05/2026 16:29, Leon Romanovsky wrote:
> > 
> > On Wed, May 06, 2026 at 02:53:31PM +0100, Matt Evans wrote:
> > > Hi Alex,
> > > 
> > > On 01/05/2026 20:12, Alex Williamson wrote:
> > > > 
> > > > On Thu, 16 Apr 2026 06:17:44 -0700
> > > > Matt Evans <mattev@meta.com> wrote:
> > > > 
> > > > > vfio_pci_dma_buf_cleanup() assumed all VFIO device DMABUFs need to be
> > > > > revoked.  However, if vfio_pci_dma_buf_move() revokes DMABUFs before
> > > > > the fd/device closes, then vfio_pci_dma_buf_cleanup() would do a
> > > > > second/underflowing kref_put() then wait_for_completion() on a
> > > > > completion that never fires.  Fixed by predicating on revocation
> > > > > status.
> > > > > 
> > > > > This could happen if PCI_COMMAND_MEMORY is cleared before closing the
> > > > > device fd (but the scenario is more likely to hit when future commits
> > > > > add more methods to revoke DMABUFs).
> > > > > 
> > > > > Fixes: 1a8a5227f2299 ("vfio: Wait for dma-buf invalidation to complete")
> > > > > Signed-off-by: Matt Evans <mattev@meta.com>
> > > > > ---
> > > > > 
> > > > > (Just a fix, but later "vfio/pci: Convert BAR mmap() to use a DMABUF"
> > > > > and "vfio/pci: Permanently revoke a DMABUF on request" depend on this
> > > > > context, so including in this series.)
> > > > 
> > > > We really need a fix for this split out from this series, It's already
> > > > been shown[1] that this is trivially reachable.  Carlos proposed[2] a
> > > > similar solution to the one below.  I was concurrently working on the
> > > > issued and suggested an alternative[3].  Let's pick a solution for
> > > > 7.1-rc.  Thanks,
> > > 
> > > It looks like [3] is progressing, so I'll drop this one when I can rebase
> > > onto it.
> > > 
> > > I noticed [3] removes the dma_resv_lock(priv->dmabuf->resv) around the
> > > priv->vdev = NULL, and this series' vfio_pci_mmap_huge_fault() relies on
> > > vdev only changing whilst resv is held to resolve a race between a fault and
> > > cleanup (see patch 7 of this series).  The handler takes resv so that it can
> > > stably test vdev in order to take memory_lock.
> > 
> > I think that you should rely on priv->revoked and not on priv->vdev.
> 
> Needs both unfortunately, as the fault handler ultimately needs to take
> vdev->memory_lock.

One can argue that if priv->revoked == True, all accesses to device
should be denied and treated as priv->vdev == Null.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
  2026-05-06 16:14           ` Leon Romanovsky
@ 2026-05-06 16:42             ` Matt Evans
  0 siblings, 0 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-06 16:42 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm,
	Carlos López

Hi Leon,

On 06/05/2026 17:14, Leon Romanovsky wrote:
> 
> On Wed, May 06, 2026 at 04:55:27PM +0100, Matt Evans wrote:
>> Hi Leon,
>>
>> On 06/05/2026 16:29, Leon Romanovsky wrote:
>>>
>>> On Wed, May 06, 2026 at 02:53:31PM +0100, Matt Evans wrote:
>>>> Hi Alex,
>>>>
>>>> On 01/05/2026 20:12, Alex Williamson wrote:
>>>>>
>>>>> On Thu, 16 Apr 2026 06:17:44 -0700
>>>>> Matt Evans <mattev@meta.com> wrote:
>>>>>
>>>>>> vfio_pci_dma_buf_cleanup() assumed all VFIO device DMABUFs need to be
>>>>>> revoked.  However, if vfio_pci_dma_buf_move() revokes DMABUFs before
>>>>>> the fd/device closes, then vfio_pci_dma_buf_cleanup() would do a
>>>>>> second/underflowing kref_put() then wait_for_completion() on a
>>>>>> completion that never fires.  Fixed by predicating on revocation
>>>>>> status.
>>>>>>
>>>>>> This could happen if PCI_COMMAND_MEMORY is cleared before closing the
>>>>>> device fd (but the scenario is more likely to hit when future commits
>>>>>> add more methods to revoke DMABUFs).
>>>>>>
>>>>>> Fixes: 1a8a5227f2299 ("vfio: Wait for dma-buf invalidation to complete")
>>>>>> Signed-off-by: Matt Evans <mattev@meta.com>
>>>>>> ---
>>>>>>
>>>>>> (Just a fix, but later "vfio/pci: Convert BAR mmap() to use a DMABUF"
>>>>>> and "vfio/pci: Permanently revoke a DMABUF on request" depend on this
>>>>>> context, so including in this series.)
>>>>>
>>>>> We really need a fix for this split out from this series, It's already
>>>>> been shown[1] that this is trivially reachable.  Carlos proposed[2] a
>>>>> similar solution to the one below.  I was concurrently working on the
>>>>> issued and suggested an alternative[3].  Let's pick a solution for
>>>>> 7.1-rc.  Thanks,
>>>>
>>>> It looks like [3] is progressing, so I'll drop this one when I can rebase
>>>> onto it.
>>>>
>>>> I noticed [3] removes the dma_resv_lock(priv->dmabuf->resv) around the
>>>> priv->vdev = NULL, and this series' vfio_pci_mmap_huge_fault() relies on
>>>> vdev only changing whilst resv is held to resolve a race between a fault and
>>>> cleanup (see patch 7 of this series).  The handler takes resv so that it can
>>>> stably test vdev in order to take memory_lock.
>>>
>>> I think that you should rely on priv->revoked and not on priv->vdev.
>>
>> Needs both unfortunately, as the fault handler ultimately needs to take
>> vdev->memory_lock.
> 
> One can argue that if priv->revoked == True, all accesses to device
> should be denied and treated as priv->vdev == Null.

I agree, the handler will early-exit when a buffer is revoked.  Though 
when it _isn't_ revoked, it still needs to go through a careful set of 
steps to keep vdev around long enough to take the lock (and ensure it 
still isn't revoked, etc.).

I think the sequence in patch 7 still works (with Alex's patch in [3]), 
since the invariants still hold:

- if not-revoked then vdev is still valid (IOW, vdev = NULL only happens 
after revoked = true)
- revoke is only changed when holding priv->dmabuf->resv

OK, [3] doesn't seem to break this series (just context/rebase).  Sorry 
for the thinking out loud, it'll be good if someone sees a flaw in my 
reasoning though.

[3] was https://lore.kernel.org/all/20260429142242.70f746b4@nvidia.com/


Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/9] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  2026-05-05 18:13         ` [PATCH 3/9] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
@ 2026-05-06 19:03           ` Matt Evans
  0 siblings, 0 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-06 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi again Jason,

On 05/05/2026 19:13, Matt Evans wrote:
> Hi Jason,
> 
> On 30/04/2026 18:11, Jason Gunthorpe wrote:
>>
>> On Thu, Apr 30, 2026 at 05:47:49PM +0100, Matt Evans wrote:
>>>> On Thu, Apr 16, 2026 at 06:17:46AM -0700, Matt Evans wrote:
>>>>> +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
>>>>> +                   struct vm_area_struct *vma,
>>>>> +                   u64 phys_start, u64 req_len,
>>>>> +                   unsigned int res_index)
>>>>> +{
>>>>> +    struct vfio_pci_dma_buf *priv;
>>>>> +    const unsigned int nr_ranges = 1;
>>>>> +    int ret;
>>>>> +
>>>>> +    priv = kzalloc_obj(*priv);
>>>>> +    if (!priv)
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    priv->phys_vec = kzalloc_obj(*priv->phys_vec);
>>>>> +    if (!priv->phys_vec) {
>>>>> +        ret = -ENOMEM;
>>>>> +        goto err_free_priv;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * The mmap() request's vma->vm_offs might be non-zero, but
>>>>> +     * the DMABUF is created from _offset zero_ of the BAR.  The
>>>>> +     * portion between zero and the vm_offs is inaccessible
>>>>> +     * through this VMA, but this approach keeps the
>>>>> +     * /proc/<pid>/maps offset somewhat consistent with the
>>>>> +     * pre-DMABUF code.  Size includes the offset portion.
>>>>
>>>> I'm not sure I understand this comment?
>>>>
>>>> For the old path vm_pgoff for byte 0 of the bar starts at some large
>>>> offset
>>>>
>>>> For the new path vm_pgoff for byte 0 of the first range starts at 0
>>>
>>> Glad you asked.  :)
>>>
>>> This is trying to achieve keeping /proc/<pid>/maps (or similar) somewhat
>>> as informative as pre-DMABUF BAR mmap, in terms of keeping the VMA
>>> vm_offs column useful.  Before this patch, say you mmap() two slices A
>>> and B of the same BAR:
>>>
>>>   struct vfio_region_info bar_region;
>>>
>>>   vm_a = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0);
>>>   vm_b = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0x4000);
>>>
>>> ...you'd see something like this in /proc/blah/maps:
>>>
>>> fffff4000000-fffff4001000 rw-s 10000000000 00:07 148     /dev/vfio/ 
>>> devices/vfio0
>>> fffff5000000-fffff5001000 rw-s 10000004000 00:07 148     /dev/vfio/ 
>>> devices/vfio0

Looking at this again, I/we got this backwards and I mixed up two things:

The goal of this patch _is already_ to make sure the VMA's vm_pgoff 
(whether viewed in /proc/<pid>/maps or elsewhere) still matches the 
mmap()'s offset.

(For a mo, ignore the resource index encoded into the offset.  Consider 
just the offset into the BAR itself, inside the VFIO_PCI_OFFSET_MASK. 
I'll come back to the index encoded into the upper bits.)

>>> then the VMA's vm_offs would need to be thunked back down to 0 (since
>>> the fault handler then treats vm_b + 0 as the first byte of the DMABUF).
>>> That works/adds up, but then the vm_offs of both VMAs A & B both have
>>> offset 0, and it's harder to differentiate in /proc/blah/maps.
>>
>> Yes, and that would be correct.

Why?  This paragraph was outlining a hypothetical alternative 
implementation that creates the DMABUF the size of the VMA and starting 
from an offset into the BAR based on vm_pgoff, and then compensates by 
setting vma->vm_pgoff = 0 so that the fault doesn't re-apply the offset 
again.  That would make byte 0 of the VMA access correct:

   BAR_start + (vma->vm_pgoff << PAGE_SHIFT)   [1]

But it would...

>> The VMA output of lspci should show the exact pgoff passed to mmap and
>> nothing else. Do not mangle it for "debugging".
 >>>> pgoff is not to be used to show random internal FD details..

...definitely break this property, no?

This patch is supporting that property by instead creating the DMABUF so 
that the VMA's vm_pgoff (which is maintained and the same* as passed 
from mmap()!) indexes the DMABUF so that byte 0 of the VMA accesses the 
same address above in [1].  The DMABUF spans from the start of the BAR 
so the fault handler maths (which indexes the DMABUF by vm_pgoffs) is 
common for all buffers.

  a = mmap(0, 0x10000, ..., device_fd, 0x4000);

          +0           +0x4000
          +------------v------------------------------------------+
          |               BAR                                     |
          |                                                       |
          +------------^------------------------------------------+
          .            .
          .            +--------------------------+
          .            |  VMA                     |
          .            |  vma->vm_pgoff = 4       |
          .            +--------------------------+
          .            .                          .
          +------------+--------------------------+
          | invisible  |  DMABUF                  |
          |            |                          |
          +------------+--------------------------+

Same* externally-observable behaviour as the old mmap().

>>> We could possibly stash the original offset somewhere and then render it
>>> in the name string, but the name's already about the max size and using
>>> the existing vm_offs column is nicer IMO, doesn't need a new field, etc.
>>
>>> I need to work on this comment then!  What this is trying to say is that
>>> the DMABUF is made artificially larger than the part that is visible
>>> through the VMA.
>>
>> Yuk, that's another reason not to do this.
Apart from the yuk part, do we have a specific concern with the 
invisible portion?  Perhaps if one could fish out a DMABUF fd somehow 
(it's a file, but no scalar fd is returned to userspace) then the lower 
BAR addresses could get mmap()ed.  Isn't that at worst as permissive as 
a "closed" VFIO device fd which could get fished out, e.g. 
/proc/<pid>/map_files, and mmap()ed again?

I went through other approaches, but they either need special-casing in 
the fault handler or DMABUF-to-PFN helper, or as above would modify the 
vma->vm_offs.  This seemed best overall though, as ever, open to ideas.


*: Region offset:

OK, so this patch strips out the high bits of the offset early, so 
that's disappeared from /proc/<pid>/maps etc.  You're right to point out 
that the resource index could be carried such that the vma->vm_pgoff 
really is identical throughout.  I'll restore that so that the VMA's 
vm_offs is identical to that passed via mmap().

Does that seem reasonable?


Thanks,


Matt


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/9] vfio/pci: Add a helper to look up PFNs for DMABUFs
       [not found]   ` <20260424181510.GF3444440@nvidia.com>
@ 2026-05-07 15:48     ` Matt Evans
  0 siblings, 0 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-07 15:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Jason,

On 24/04/2026 19:15, Jason Gunthorpe wrote:
> 
> On Thu, Apr 16, 2026 at 06:17:45AM -0700, Matt Evans wrote:
>> Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to
>> find a PFN.
>>
>> This supports multi-range DMABUFs, which typically would be used to
>> represent scattered spans but might even represent overlapping or
>> aliasing spans of PFNs.
>>
>> Because this is intended to be used in vfio_pci_core.c, we also need
>> to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
>>
>> Signed-off-by: Matt Evans <mattev@meta.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_dmabuf.c | 124 ++++++++++++++++++++++++++---
>>   drivers/vfio/pci/vfio_pci_priv.h   |  19 +++++
>>   2 files changed, 130 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> index 04478b7415a0..8b6bae56bbf2 100644
>> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
>> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> @@ -9,19 +9,6 @@
>>   
>>   MODULE_IMPORT_NS("DMA_BUF");
>>   
>> -struct vfio_pci_dma_buf {
>> -	struct dma_buf *dmabuf;
>> -	struct vfio_pci_core_device *vdev;
>> -	struct list_head dmabufs_elm;
>> -	size_t size;
>> -	struct phys_vec *phys_vec;
>> -	struct p2pdma_provider *provider;
>> -	u32 nr_ranges;
>> -	struct kref kref;
>> -	struct completion comp;
>> -	u8 revoked : 1;
>> -};
>> -
>>   static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
>>   				   struct dma_buf_attachment *attachment)
>>   {
>> @@ -106,6 +93,117 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
>>   	.release = vfio_pci_dma_buf_release,
>>   };
>>   
>> +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
>> +			      struct vm_area_struct *vma,
>> +			      unsigned long address,
>> +			      unsigned int order,
>> +			      unsigned long *out_pfn)
>> +{
>> +	/*
>> +	 * Given a VMA (start, end, pgoffs) and a fault address,
>> +	 * search the corresponding DMABUF's phys_vec[] to find the
>> +	 * range representing the address's offset into the VMA, and
>> +	 * its PFN.
>> +	 *
>> +	 * The phys_vec[] ranges represent contiguous spans of VAs
>> +	 * upwards from the buffer offset 0; the actual PFNs might be
>> +	 * in any order, overlap/alias, etc.  Calculate an offset of
>> +	 * the desired page given VMA start/pgoff and address, then
>> +	 * search upwards from 0 to find which span contains it.
>> +	 *
>> +	 * On success, a valid PFN for a page sized by 'order' is
>> +	 * returned into out_pfn.
>> +	 *
>> +	 * Failure occurs if:
>> +	 * - The page would cross the edge of the VMA
>> +	 * - The page isn't entirely contained within a range
>> +	 * - We find a range, but the final PFN isn't aligned to the
>> +	 *   requested order.
>> +	 *
>> +	 * (Upon failure, the caller is expected to try again with a
>> +	 * smaller order; the tests above will always succeed for
>> +	 * order=0 as the limit case.)
>> +	 *
>> +	 * It's suboptimal if DMABUFs are created with neigbouring
>> +	 * ranges that are physically contiguous, since hugepages
>> +	 * can't straddle range boundaries.  (The construction of the
>> +	 * ranges vector should merge such ranges.)
>> +	 */
>> +
>> +	const unsigned long pagesize = PAGE_SIZE << order;
>> +	unsigned long rounded_page_addr = address & ~(pagesize - 1);
> 
> ALIGN_DOWN(address, pagesize);

Oops, right, fixed.

>> +	unsigned long rounded_page_end = rounded_page_addr + pagesize;
>> +	unsigned long buf_page_offset;
>> +	unsigned long buf_offset = 0;
>> +	unsigned int i;
>> +
>> +	if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
>> +		if (order > 0)
>> +			return -EAGAIN;
>> +
>> +		/* A fault address outside of the VMA is absurd. */
>> +		WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",
>> +		     address, vma->vm_start, vma->vm_end);
>> +		return -EFAULT;
>> +	}
>> +
>> +	if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
>> +					vma->vm_pgoff << PAGE_SHIFT, &buf_page_offset)))
>> +		return -EFAULT;
> 
>> +
>> +	for (i = 0; i < vpdmabuf->nr_ranges; i++) {
>> +		size_t range_len = vpdmabuf->phys_vec[i].len;
>> +		phys_addr_t range_start = vpdmabuf->phys_vec[i].paddr;
>> +
>> +		/*
>> +		 * If the current range starts after the page's span,
>> +		 * this and any future range won't match.  Bail early.
>> +		 */
>> +		if (buf_page_offset + pagesize <= buf_offset)
>> +			break;
> 
> No overflow check on this +? If we are worried order is so large that
> the first needs a check then this would too..

In the earlier check it's not order being large but the vm_pgoff, but 
yes an overflow check wouldn't hurt here.  Added.

I've found (my) choice of variable names here awkward, and have renamed 
them to make it a bit clearer as to what's the page being searched for 
and what's the range, etc.

>> +
>> +		if (buf_page_offset >= buf_offset &&
>> +		    buf_page_offset + pagesize <= buf_offset + range_len) {
> 
>> +			/*
>> +			 * The faulting page is wholly contained
>> +			 * within the span represented by the range.
>> +			 * Validate PFN alignment for the order:
>> +			 */
>> +			unsigned long pfn = (range_start >> PAGE_SHIFT) +
>> +				((buf_page_offset - buf_offset) >> PAGE_SHIFT);
> 
> (range_start + (buf_page_offset - buf_offset)) / PAGE_SIZE;

WFM, done.

Thank you,

Matt


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/9] vfio/pci: Support mmap() of a VFIO DMABUF
       [not found]   ` <20260424183006.GI3444440@nvidia.com>
@ 2026-05-07 16:09     ` Matt Evans
  0 siblings, 0 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-07 16:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Jason,

On 24/04/2026 19:30, Jason Gunthorpe wrote:
> 
> On Thu, Apr 16, 2026 at 06:17:50AM -0700, Matt Evans wrote:
>> +
>> +	dma_resv_lock(priv->dmabuf->resv, NULL);
>>   	vdev = READ_ONCE(priv->vdev);
>>
>> +	if (READ_ONCE(priv->revoked) || !vdev) {
> 
> Why is this read once? It is inside the resv lock so it is stable?
> 
>> +		pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",
>> +				     __func__, vmf->address, vma->vm_pgoff);
>> +		dma_resv_unlock(priv->dmabuf->resv);
>> +		return VM_FAULT_SIGBUS;
>> +	}
>> +	/* vdev is usable */
>> +
>> +	if (!vfio_device_try_get_registration(&vdev->vdev)) {
>> +		/*
>> +		 * If vdev != NULL (above), the registration should
>> +		 * already be >0 and so this try_get should never
>> +		 * fail.
>> +		 */
>> +		dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n",
>> +			 __func__);
>> +		dma_resv_unlock(priv->dmabuf->resv);
>> +		return VM_FAULT_SIGBUS;
>> +	}
>> +	dma_resv_unlock(priv->dmabuf->resv);
>> +
>>   	scoped_guard(rwsem_read, &vdev->memory_lock) {
>> -		if (!priv->revoked) {
>> +		if (!READ_ONCE(priv->revoked)) {
> 
> Same here, it is not read once since you hold the memory_lock it is
> stable.

I used them more as an 'eyecatcher' to complement the comment.  Although 
they're not strictly required (compiler barriers at the lock/unlocks), 
they stand out to say "something's going on here".

Revoked/status is read first holding resv, and _must be read a second 
time_ once that's released and after memory_lock is taken, so two 
READ_ONCEs seemed appropriate to show this.

But if you feel strongly, sure, I can remove it (though I'd add, say, a 
/* Re-read status value */ comment).


Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] vfio/pci: Provide a user-facing name for BAR mappings
       [not found]   ` <20260501164430.5d3ea683@shazbot.org>
@ 2026-05-07 16:56     ` Matt Evans
  2026-05-07 17:17       ` Matt Evans
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Evans @ 2026-05-07 16:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Alex,

On 01/05/2026 23:44, Alex Williamson wrote:
> 
> On Thu, 16 Apr 2026 06:17:48 -0700
> Matt Evans <mattev@meta.com> wrote:
> 
>> Since converting BAR mmap()s to using DMABUFs, we lose the original
>> device path in /proc/<pid>/maps, lsof, etc.  Generate a debug-oriented
>> synthetic 'filename' based on the cdev, plus BDF, plus resource index.
>>
>> This applies only to BAR mappings via the VFIO device fd, as
>> explicitly-exported DMABUFs are named by userspace via the
>> DMA_BUF_SET_NAME ioctl.
>>
>> Signed-off-by: Matt Evans <mattev@meta.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> index a12432825e5e..04c7733fe712 100644
>> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
>> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>> @@ -4,6 +4,7 @@
>>   #include <linux/dma-buf-mapping.h>
>>   #include <linux/pci-p2pdma.h>
>>   #include <linux/dma-resv.h>
>> +#include <uapi/linux/dma-buf.h>
>>   
>>   #include "vfio_pci_priv.h"
>>   
>> @@ -467,6 +468,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
>>   {
>>   	struct vfio_pci_dma_buf *priv;
>>   	const unsigned int nr_ranges = 1;
>> +	char *bufname;
>>   	int ret;
>>   
>>   	priv = kzalloc_obj(*priv);
>> @@ -479,6 +481,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
>>   		goto err_free_priv;
>>   	}
>>   
>> +	bufname = kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL);
>> +	if (!bufname) {
>> +		ret = -ENOMEM;
>> +		goto err_free_phys;
>> +	}
>> +
>> +	/*
>> +	 * Maximum size of the friendly debug name is
>> +	 * vfio1234567890:ffff:ff:3f.7-9 = 30, which fits within
>> +	 * DMA_BUF_NAME_LEN.
>> +	 */
>> +	snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x",
>> +		 dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index);
> 
> Comment suggests 9 is the max res_index that can be printed, but mmap
> only directly supports standard BARs 0-5.  Comment also uses a '-'
> while the code uses a '/'.  Thanks,

Right you are.  Fixed, but, since...
https://lore.kernel.org/kvm/52162da4-e1cc-4f90-a95a-218d6089cd71@meta.com/

...I'm keeping the resource index encoded in the vm_pgoffs and as that's
in /proc/<pid>/maps it doesn't need to be in the name. I.e., an example
mapping of BAR 2 looks like:

ffffa9330000-ffffad300000 rw-s 20000030000 00:0b 12 /dmabuf:vfio0:0000:00:03.0

Thanks,

Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] vfio/pci: Provide a user-facing name for BAR mappings
  2026-05-07 16:56     ` [PATCH 5/9] vfio/pci: Provide a user-facing name for BAR mappings Matt Evans
@ 2026-05-07 17:17       ` Matt Evans
  0 siblings, 0 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-07 17:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Sent too soon, :|

On 07/05/2026 17:56, Matt Evans wrote:
> Hi Alex,
> 
> On 01/05/2026 23:44, Alex Williamson wrote:
>>
>> On Thu, 16 Apr 2026 06:17:48 -0700
>> Matt Evans <mattev@meta.com> wrote:
>>
>>> Since converting BAR mmap()s to using DMABUFs, we lose the original
>>> device path in /proc/<pid>/maps, lsof, etc.  Generate a debug-oriented
>>> synthetic 'filename' based on the cdev, plus BDF, plus resource index.
>>>
>>> This applies only to BAR mappings via the VFIO device fd, as
>>> explicitly-exported DMABUFs are named by userspace via the
>>> DMA_BUF_SET_NAME ioctl.
>>>
>>> Signed-off-by: Matt Evans <mattev@meta.com>
>>> ---
>>>   drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++--
>>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/ 
>>> vfio_pci_dmabuf.c
>>> index a12432825e5e..04c7733fe712 100644
>>> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
>>> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
>>> @@ -4,6 +4,7 @@
>>>   #include <linux/dma-buf-mapping.h>
>>>   #include <linux/pci-p2pdma.h>
>>>   #include <linux/dma-resv.h>
>>> +#include <uapi/linux/dma-buf.h>
>>>   #include "vfio_pci_priv.h"
>>> @@ -467,6 +468,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct 
>>> vfio_pci_core_device *vdev,
>>>   {
>>>       struct vfio_pci_dma_buf *priv;
>>>       const unsigned int nr_ranges = 1;
>>> +    char *bufname;
>>>       int ret;
>>>       priv = kzalloc_obj(*priv);
>>> @@ -479,6 +481,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct 
>>> vfio_pci_core_device *vdev,
>>>           goto err_free_priv;
>>>       }
>>> +    bufname = kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL);
>>> +    if (!bufname) {
>>> +        ret = -ENOMEM;
>>> +        goto err_free_phys;
>>> +    }
>>> +
>>> +    /*
>>> +     * Maximum size of the friendly debug name is
>>> +     * vfio1234567890:ffff:ff:3f.7-9 = 30, which fits within
>>> +     * DMA_BUF_NAME_LEN.
>>> +     */
>>> +    snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x",
>>> +         dev_name(&vdev->vdev.device), pci_name(vdev->pdev), 
>>> res_index);
>>
>> Comment suggests 9 is the max res_index that can be printed, but mmap
>> only directly supports standard BARs 0-5.  Comment also uses a '-'
>> while the code uses a '/'.  Thanks,
> 
> Right you are.  Fixed, but, since...
> https://lore.kernel.org/kvm/52162da4-e1cc-4f90-a95a-218d6089cd71@meta.com/
> 
> ...I'm keeping the resource index encoded in the vm_pgoffs and as that's
> in /proc/<pid>/maps it doesn't need to be in the name. I.e., an example
> mapping of BAR 2 looks like:
> 
> ffffa9330000-ffffad300000 rw-s 20000030000 00:0b 12 / 
> dmabuf:vfio0:0000:00:03.0

BUT, the name's visible via paths other than just /proc/<pid>/maps, e.g.
/sys/kernel/debug/dma_buf/bufinfo or /proc/<pid>/map_files which don't 
have the vm_offs, and so back to plan A.  Just made the comment consistent.

Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature
       [not found]       ` <20260427083644.4ee174cd@shazbot.org>
@ 2026-05-11 15:30         ` Matt Evans
  2026-05-11 17:51           ` Leon Romanovsky
  2026-05-11 20:09           ` Alex Williamson
  0 siblings, 2 replies; 19+ messages in thread
From: Matt Evans @ 2026-05-11 15:30 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky
  Cc: Jason Gunthorpe, Alex Mastro, Christian König, Mahmoud Adam,
	David Matlack, Björn Töpel, Sumit Semwal, Kevin Tian,
	Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

Hi Alex, Leon,

On 27/04/2026 15:36, Alex Williamson wrote:
> 
> On Sun, 26 Apr 2026 13:52:15 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
>> On Fri, Apr 24, 2026 at 03:31:53PM -0300, Jason Gunthorpe wrote:
>>> On Thu, Apr 16, 2026 at 06:17:52AM -0700, Matt Evans wrote:
>>>> A new field is reserved in vfio_device_feature_dma_buf.flags to
>>>> request CPU-facing memory type attributes for mmap()s of the buffer.
>>>> Add a flag VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC, which results in WC
>>>> PTEs for the DMABUF's BAR region.
>>>>
>>>> Signed-off-by: Matt Evans <mattev@meta.com>
>>>> ---
>>>>   drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++++++++++++--
>>>>   drivers/vfio/pci/vfio_pci_priv.h   |  1 +
>>>>   include/uapi/linux/vfio.h          | 12 +++++++++---
>>>>   3 files changed, 23 insertions(+), 5 deletions(-)
>>>
>>> Nice and simple
>>>
>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>    
>>>> @@ -1549,8 +1551,12 @@ struct vfio_region_dma_range {
>>>>   struct vfio_device_feature_dma_buf {
>>>>   	__u32	region_index;
>>>>   	__u32	open_flags;
>>>> -	__u32   flags;
>>>> -	__u32   nr_ranges;
>>>> +	__u32	flags;
>>>> +	/* Flags sub-field reserved for attribute enum */
>>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK		(0xfU << 28)
>>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_UC		(0 << 28)
>>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC		(1 << 28)
>>>> +	__u32	nr_ranges;
>>
>> Alex,
>>
>> The TPH proposal extends the flags field in a similar way, but I suggested
>> a different approach to conserve bits. At the moment, we spend three bits
>> on a single feature, which feels wasteful.
>>
>> What do you think?
>> https://lore.kernel.org/all/20260409120415.GF86584@unreal/
> 
> I already proposed a very different interface for TPH that decouples
> the dma-buf creation from setting the TPH values:
> 
> https://lore.kernel.org/all/20260423132016.4a25e074@shazbot.org/
> 
> This is overall less intrusive than the TPH change proposed, but it
> could still make sense to align this as an operation on the dma-buf,
> that can be probed as a separate feature.  Thanks,

I'll add a VFIO_DEVICE_FEATURE_DMA_BUF_ATTRS in a v2 instead to get in 
line with the TPH work, no worries.

For the benefit of future hackers, how would you describe the criteria 
for adding flags to this existing field?  What hypothetical feature 
characteristics would be appropriate?  (Maybe it's that these attrs & 
TPH add scalar fields in several bits rather than a simple boolean.) 
Two of us have independently added something that's turned out to be 
inapproriate so some guidance would be good.

Thanks!


Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature
  2026-05-11 15:30         ` [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
@ 2026-05-11 17:51           ` Leon Romanovsky
  2026-05-11 20:09           ` Alex Williamson
  1 sibling, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2026-05-11 17:51 UTC (permalink / raw)
  To: Matt Evans
  Cc: Alex Williamson, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

On Mon, May 11, 2026 at 04:30:39PM +0100, Matt Evans wrote:
> Hi Alex, Leon,
> 
> On 27/04/2026 15:36, Alex Williamson wrote:
> > 
> > On Sun, 26 Apr 2026 13:52:15 +0300
> > Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > > On Fri, Apr 24, 2026 at 03:31:53PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Apr 16, 2026 at 06:17:52AM -0700, Matt Evans wrote:
> > > > > A new field is reserved in vfio_device_feature_dma_buf.flags to
> > > > > request CPU-facing memory type attributes for mmap()s of the buffer.
> > > > > Add a flag VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC, which results in WC
> > > > > PTEs for the DMABUF's BAR region.
> > > > > 
> > > > > Signed-off-by: Matt Evans <mattev@meta.com>
> > > > > ---
> > > > >   drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++++++++++++--
> > > > >   drivers/vfio/pci/vfio_pci_priv.h   |  1 +
> > > > >   include/uapi/linux/vfio.h          | 12 +++++++++---
> > > > >   3 files changed, 23 insertions(+), 5 deletions(-)
> > > > 
> > > > Nice and simple
> > > > 
> > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > @@ -1549,8 +1551,12 @@ struct vfio_region_dma_range {
> > > > >   struct vfio_device_feature_dma_buf {
> > > > >   	__u32	region_index;
> > > > >   	__u32	open_flags;
> > > > > -	__u32   flags;
> > > > > -	__u32   nr_ranges;
> > > > > +	__u32	flags;
> > > > > +	/* Flags sub-field reserved for attribute enum */
> > > > > +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK		(0xfU << 28)
> > > > > +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_UC		(0 << 28)
> > > > > +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC		(1 << 28)
> > > > > +	__u32	nr_ranges;
> > > 
> > > Alex,
> > > 
> > > The TPH proposal extends the flags field in a similar way, but I suggested
> > > a different approach to conserve bits. At the moment, we spend three bits
> > > on a single feature, which feels wasteful.
> > > 
> > > What do you think?
> > > https://lore.kernel.org/all/20260409120415.GF86584@unreal/
> > 
> > I already proposed a very different interface for TPH that decouples
> > the dma-buf creation from setting the TPH values:
> > 
> > https://lore.kernel.org/all/20260423132016.4a25e074@shazbot.org/
> > 
> > This is overall less intrusive than the TPH change proposed, but it
> > could still make sense to align this as an operation on the dma-buf,
> > that can be probed as a separate feature.  Thanks,
> 
> I'll add a VFIO_DEVICE_FEATURE_DMA_BUF_ATTRS in a v2 instead to get in line
> with the TPH work, no worries.
> 
> For the benefit of future hackers, how would you describe the criteria for
> adding flags to this existing field?  

One bit per-feature.

> What hypothetical feature characteristics would be appropriate?  (Maybe it's that these attrs & TPH
> add scalar fields in several bits rather than a simple boolean.) Two of us
> have independently added something that's turned out to be inapproriate so
> some guidance would be good.

Both of you intertwined the signaling bit with the actual data, and that is
what led me to prefer a different approach.

Thanks

> 
> Thanks!
> 
> 
> Matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature
  2026-05-11 15:30         ` [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
  2026-05-11 17:51           ` Leon Romanovsky
@ 2026-05-11 20:09           ` Alex Williamson
  1 sibling, 0 replies; 19+ messages in thread
From: Alex Williamson @ 2026-05-11 20:09 UTC (permalink / raw)
  To: Matt Evans
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Christian König, Mahmoud Adam, David Matlack,
	Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm, alex

On Mon, 11 May 2026 16:30:39 +0100
Matt Evans <mattev@meta.com> wrote:

> Hi Alex, Leon,
> 
> On 27/04/2026 15:36, Alex Williamson wrote:
> > 
> > On Sun, 26 Apr 2026 13:52:15 +0300
> > Leon Romanovsky <leon@kernel.org> wrote:
> >   
> >> On Fri, Apr 24, 2026 at 03:31:53PM -0300, Jason Gunthorpe wrote:  
> >>> On Thu, Apr 16, 2026 at 06:17:52AM -0700, Matt Evans wrote:  
> >>>> A new field is reserved in vfio_device_feature_dma_buf.flags to
> >>>> request CPU-facing memory type attributes for mmap()s of the buffer.
> >>>> Add a flag VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC, which results in WC
> >>>> PTEs for the DMABUF's BAR region.
> >>>>
> >>>> Signed-off-by: Matt Evans <mattev@meta.com>
> >>>> ---
> >>>>   drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++++++++++++--
> >>>>   drivers/vfio/pci/vfio_pci_priv.h   |  1 +
> >>>>   include/uapi/linux/vfio.h          | 12 +++++++++---
> >>>>   3 files changed, 23 insertions(+), 5 deletions(-)  
> >>>
> >>> Nice and simple
> >>>
> >>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> >>>      
> >>>> @@ -1549,8 +1551,12 @@ struct vfio_region_dma_range {
> >>>>   struct vfio_device_feature_dma_buf {
> >>>>   	__u32	region_index;
> >>>>   	__u32	open_flags;
> >>>> -	__u32   flags;
> >>>> -	__u32   nr_ranges;
> >>>> +	__u32	flags;
> >>>> +	/* Flags sub-field reserved for attribute enum */
> >>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK		(0xfU << 28)
> >>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_UC		(0 << 28)
> >>>> +#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC		(1 << 28)
> >>>> +	__u32	nr_ranges;  
> >>
> >> Alex,
> >>
> >> The TPH proposal extends the flags field in a similar way, but I suggested
> >> a different approach to conserve bits. At the moment, we spend three bits
> >> on a single feature, which feels wasteful.
> >>
> >> What do you think?
> >> https://lore.kernel.org/all/20260409120415.GF86584@unreal/  
> > 
> > I already proposed a very different interface for TPH that decouples
> > the dma-buf creation from setting the TPH values:
> > 
> > https://lore.kernel.org/all/20260423132016.4a25e074@shazbot.org/
> > 
> > This is overall less intrusive than the TPH change proposed, but it
> > could still make sense to align this as an operation on the dma-buf,
> > that can be probed as a separate feature.  Thanks,  
> 
> I'll add a VFIO_DEVICE_FEATURE_DMA_BUF_ATTRS in a v2 instead to get in 
> line with the TPH work, no worries.
> 
> For the benefit of future hackers, how would you describe the criteria 
> for adding flags to this existing field?  What hypothetical feature 
> characteristics would be appropriate?  (Maybe it's that these attrs & 
> TPH add scalar fields in several bits rather than a simple boolean.) 
> Two of us have independently added something that's turned out to be 
> inapproriate so some guidance would be good.

I think the question of how we actually expand an arbitrary grab bag of
"ATTRS" is the central question in whether we should implement the
interface.  If we follow the direction I suggested for TPH, maybe this
is just a VFIO_DEVICE_FEATURE_DMA_BUF_WC, where it supports only PROBE
and SET, with SET taking only the dma-buf fd to implement the one-way
promotion from UC -> WC.

If we support a generic SET ATTRS feature, we really need to map out how
flag bits are indicated as supported and how a user untangles failures
from trying to set various attributes.  If we end up with a feature
indicating each ATTR is available, we might as well have just
implemented a feature for each attribute.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-05-11 20:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260416131815.2729131-1-mattev@meta.com>
     [not found] ` <20260416131815.2729131-5-mattev@meta.com>
     [not found]   ` <20260501161915.75525c15@shazbot.org>
     [not found]     ` <afhNeYS174EW7RYp@nvidia.com>
2026-05-05 10:49       ` [PATCH 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF Leon Romanovsky
2026-05-05 14:50         ` Alex Williamson
2026-05-05 14:59           ` Jason Gunthorpe
2026-05-06  5:35           ` Leon Romanovsky
     [not found] ` <20260416131815.2729131-7-mattev@meta.com>
     [not found]   ` <20260501171919.42659174@shazbot.org>
2026-05-05 10:58     ` [PATCH 6/9] vfio/pci: Clean up BAR zap and revocation Leon Romanovsky
     [not found] ` <20260416131815.2729131-4-mattev@meta.com>
     [not found]   ` <20260424182426.GG3444440@nvidia.com>
     [not found]     ` <c598a21e-ee50-42d9-98dc-2959e84ace50@meta.com>
     [not found]       ` <20260430171106.GA6829@nvidia.com>
2026-05-05 18:13         ` [PATCH 3/9] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
2026-05-06 19:03           ` Matt Evans
     [not found] ` <20260416131815.2729131-2-mattev@meta.com>
     [not found]   ` <20260501131236.278ac431@shazbot.org>
2026-05-06 13:53     ` [PATCH 1/9] vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put Matt Evans
2026-05-06 15:29       ` Leon Romanovsky
2026-05-06 15:55         ` Matt Evans
2026-05-06 16:14           ` Leon Romanovsky
2026-05-06 16:42             ` Matt Evans
     [not found] ` <20260416131815.2729131-3-mattev@meta.com>
     [not found]   ` <20260424181510.GF3444440@nvidia.com>
2026-05-07 15:48     ` [PATCH 2/9] vfio/pci: Add a helper to look up PFNs for DMABUFs Matt Evans
     [not found] ` <20260416131815.2729131-8-mattev@meta.com>
     [not found]   ` <20260424183006.GI3444440@nvidia.com>
2026-05-07 16:09     ` [PATCH 7/9] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
     [not found] ` <20260416131815.2729131-6-mattev@meta.com>
     [not found]   ` <20260501164430.5d3ea683@shazbot.org>
2026-05-07 16:56     ` [PATCH 5/9] vfio/pci: Provide a user-facing name for BAR mappings Matt Evans
2026-05-07 17:17       ` Matt Evans
     [not found] ` <20260416131815.2729131-10-mattev@meta.com>
     [not found]   ` <20260424183153.GJ3444440@nvidia.com>
     [not found]     ` <20260426105215.GA440345@unreal>
     [not found]       ` <20260427083644.4ee174cd@shazbot.org>
2026-05-11 15:30         ` [PATCH 9/9] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
2026-05-11 17:51           ` Leon Romanovsky
2026-05-11 20:09           ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox