Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Alex Williamson <alex.williamson@nvidia.com>
Cc: "Carlos López" <clopez@suse.de>,
	kvm@vger.kernel.org, joonas.kylmala@netum.fi,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"open list" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] vfio/pci: Fix double-put during dma-buf cleanup
Date: Wed, 6 May 2026 08:45:46 +0300	[thread overview]
Message-ID: <20260506054546.GH11063@unreal> (raw)
In-Reply-To: <20260429142242.70f746b4@nvidia.com>

On Wed, Apr 29, 2026 at 02:22:42PM -0600, Alex Williamson wrote:
> On Wed, 29 Apr 2026 20:27:36 +0200
> Carlos López <clopez@suse.de> wrote:
> 
> > When a dmabuf is created for a VFIO PCI device BAR, it is added
> > to the device's list of dmabufs. If PCI memory access is disabled,
> > vfio_pci_dma_buf_move() is called to revoke the dma-buf, dropping a
> > reference via kref_put(), and setting its revoked field to true.
> > 
> > Currently, vfio_pci_dma_buf_cleanup() does not check if the buffer
> > was already revoked, calling kref_put() on all dmabufs for the device,
> > potentially leading to a refcount underflow and use-after-free, as
> > reported by Joonas Kylmälä.
> > 
> > Check priv->revoked before calling kref_put() to avoid underflowing the
> > reference count.
> > 
> > [  216.397532] ------------[ cut here ]------------
> > [  216.397540] refcount_t: underflow; use-after-free.
> > [  216.397542] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x59/0x90, CPU#5: python3/3269
> > [ ... ]
> > [  216.397851] RIP: 0010:refcount_warn_saturate+0x59/0x90
> > [  216.397859] Code: 44 48 8d 3d 09 bc 35 01 67 48 0f b9 3a c3 cc cc cc cc 48 8d 3d 08 bc 35 01 67 48 0f b9 3a c3 cc cc cc cc 48 8d 3d 07 bc 35 01 <67> 48 0f b9 3a e9 4d b4 70 00 48 8d 3d 06 bc 35 01 67 48 0f b9 3a
> > [  216.397862] RSP: 0018:ffffd05f83a03c60 EFLAGS: 00010246
> > [  216.397867] RAX: 0000000000000000 RBX: ffff8db0425a49c0 RCX: 0000000000000000
> > [  216.397871] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffb0364910
> > [  216.397873] RBP: ffff8dafc9fc0550 R08: ffff8dafca8f90a8 R09: 0000000000000000
> > [  216.397876] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8dafc9fc0000
> > [  216.397878] R13: ffff8dafc9fc0560 R14: 7fffffffffffffff R15: ffff8db0425a4980
> > [  216.397882] FS:  00007fc096c06780(0000) GS:ffff8db74ea75000(0000) knlGS:0000000000000000
> > [  216.397886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  216.397889] CR2: 00007fc096932010 CR3: 000000014040b005 CR4: 0000000000f72ef0
> > [  216.397892] PKRU: 55555554
> > [  216.397894] Call Trace:
> > [  216.397898]  <TASK>
> > [  216.397904]  vfio_pci_dma_buf_cleanup+0x163/0x168 [vfio_pci_core]
> > [  216.397923]  vfio_pci_core_close_device+0x67/0xe0 [vfio_pci_core]
> > [  216.397935]  vfio_df_close+0x4c/0x80 [vfio]
> > [  216.397946]  vfio_df_group_close+0x36/0x80 [vfio]
> > [  216.397956]  vfio_device_fops_release+0x21/0x40 [vfio]
> > [  216.397965]  __fput+0xe6/0x2b0
> > [  216.397972]  __x64_sys_close+0x3d/0x80
> > [  216.397979]  do_syscall_64+0xea/0x15d0
> > [  216.397988]  ? ksys_write+0x6b/0xe0
> > [  216.397996]  ? __x64_sys_pread64+0x91/0xc0
> > [  216.398003]  ? do_syscall_64+0x128/0x15d0
> > [  216.398010]  ? do_syscall_64+0x128/0x15d0
> > [  216.398017]  ? ksys_write+0x6b/0xe0
> > [  216.398023]  ? do_syscall_64+0x128/0x15d0
> > [  216.398029]  ? __x64_sys_ioctl+0x96/0xe0
> > [  216.398036]  ? do_syscall_64+0x128/0x15d0
> > [  216.398042]  ? do_syscall_64+0x9f/0x15d0
> > [  216.398048]  ? clear_bhb_loop+0x30/0x80
> > [  216.398054]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  216.398059] RIP: 0033:0x7fc096c9a687
> > [  216.398063] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
> > [  216.398067] RSP: 002b:00007ffe422781f0 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
> > [  216.398071] RAX: ffffffffffffffda RBX: 00007fc096c06780 RCX: 00007fc096c9a687
> > [  216.398074] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
> > [  216.398076] RBP: 0000000000a83590 R08: 0000000000000000 R09: 0000000000000000
> > [  216.398079] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fc096f43160
> > [  216.398081] R13: 0000000000000135 R14: 0000000000a83590 R15: 00007fc096f43168
> > [  216.398087]  </TASK>
> > [  216.398089] ---[ end trace 0000000000000000 ]---
> > 
> > Fixes: 1a8a5227f229 ("vfio: Wait for dma-buf invalidation to complete")
> > Closes: https://lore.kernel.org/kvm/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/
> > Reported-by: Joonas Kylmälä <joonas.kylmala@netum.fi>
> > Assisted-by: Gemini:gemini-3.1-flash-lite-preview
> > Signed-off-by: Carlos López <clopez@suse.de>
> > ---
> >  drivers/vfio/pci/vfio_pci_dmabuf.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > index f87fd32e4a01..deb9c351c4a6 100644
> > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > @@ -389,14 +389,20 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> >  		dma_resv_lock(priv->dmabuf->resv, NULL);
> >  		list_del_init(&priv->dmabufs_elm);
> >  		priv->vdev = NULL;
> > -		priv->revoked = true;
> > -		dma_buf_invalidate_mappings(priv->dmabuf);
> > -		dma_resv_wait_timeout(priv->dmabuf->resv,
> > -				      DMA_RESV_USAGE_BOOKKEEP, false,
> > -				      MAX_SCHEDULE_TIMEOUT);
> > -		dma_resv_unlock(priv->dmabuf->resv);
> > -		kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > -		wait_for_completion(&priv->comp);
> > +
> > +		if (!priv->revoked) {
> > +			priv->revoked = true;
> > +			dma_buf_invalidate_mappings(priv->dmabuf);
> > +			dma_resv_wait_timeout(priv->dmabuf->resv,
> > +					      DMA_RESV_USAGE_BOOKKEEP, false,
> > +					      MAX_SCHEDULE_TIMEOUT);
> > +			dma_resv_unlock(priv->dmabuf->resv);
> > +			kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > +			wait_for_completion(&priv->comp);
> > +		} else {
> > +			dma_resv_unlock(priv->dmabuf->resv);
> > +		}
> > +
> >  		vfio_device_put_registration(&vdev->vdev);
> >  		fput(priv->dmabuf->file);
> >  	}
> 
> Thanks for taking a look at this.  I was putting some AI to work on
> this AI sourced issue as well.  We found a similar solution, but
> ultimately headed in a direction that it would be best to return the
> device to the dmabuf creation state rather than allow this invariant
> refcount to exist.  There's an alternate option below that Leon and
> Christian can weigh in on.  I'll post some selftests additions that
> trigger this and exercise the expected flows regardless.  Thanks,
> 
> Alex
> 
> commit 744d9704fb83e843d511c0c668c16d43175c6ad7
> Author: Alex Williamson <alex.williamson@nvidia.com>
> Date:   Wed Apr 29 12:19:34 2026 -0600
> 
>     vfio/pci: fix dma-buf kref underflow after revoke
>     
>     vfio_pci_dma_buf_move(revoked=true) and vfio_pci_dma_buf_cleanup()
>     ran the same drain sequence: set priv->revoked, invalidate mappings,
>     wait for fences, drop the registered kref, wait for completion.
>     When the VFIO device fd was closed after PCI_COMMAND_MEMORY had been
>     cleared, both ran in turn -- the second kref_put underflowed and the
>     subsequent wait_for_completion() blocked on a completion that the
>     first run had already consumed:
>     
>       refcount_t: underflow; use-after-free.
>       WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x59/0x90
>       Call Trace:
>        vfio_pci_dma_buf_cleanup+0x163/0x168 [vfio_pci_core]
>        vfio_pci_core_close_device+0x67/0xe0 [vfio_pci_core]
>        vfio_df_close+0x4c/0x80 [vfio]
>        vfio_df_group_close+0x36/0x80 [vfio]
>        vfio_device_fops_release+0x21/0x40 [vfio]
>        __fput+0xe6/0x2b0
>        __x64_sys_close+0x3d/0x80
>     
>     Collapse the duplication: vfio_pci_dma_buf_cleanup() now delegates
>     the drain to vfio_pci_dma_buf_move(true), which is idempotent for
>     already-revoked dma-bufs.  cleanup retains only list removal and
>     the device registration drop; the dma_resv_lock that bracketed
>     those is dropped along with the in-line drain that required it,
>     memory_lock continues to protect them.
>     
>     Re-arm the kref and the completion at the end of move()'s revoke
>     branch so post-revoke state matches post-creation (kref == 1,
>     completion ready).  This keeps cleanup's call into move() a no-op
>     when revoke already ran, and replaces the explicit kref_init() that
>     the un-revoke branch used to perform for the un-revoke -> remap
>     path.
>     
>     Fixes: 1a8a5227f229 ("vfio: Wait for dma-buf invalidation to complete")
>     Reported-by: Joonas Kylmälä <joonas.kylmala@netum.fi>
>     Closes: https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/
>     Assisted-by: Claude:claude-opus-4-7
>     Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>

Alex,

I think that your fix is the cleanest one.

Reviewed-by: Leon Romanovsky <leon@kernel.org>

Thanks

> 
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index f87fd32e4a01..fdc22e8b4656 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -354,19 +354,18 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
>  			if (revoked) {
>  				kref_put(&priv->kref, vfio_pci_dma_buf_done);
>  				wait_for_completion(&priv->comp);
> -			} else {
>  				/*
> -				 * Kref is initialize again, because when revoke
> -				 * was performed the reference counter was decreased
> -				 * to zero to trigger completion.
> +				 * Re-arm the registered kref reference and the
> +				 * completion so the post-revoke state matches the
> +				 * post-creation state.  An un-revoke followed by a
> +				 * new mapping needs the kref to be non-zero before
> +				 * kref_get(), and vfio_pci_dma_buf_cleanup()
> +				 * delegates its drain back through this revoke
> +				 * path on a possibly-already-revoked dma-buf.
>  				 */
>  				kref_init(&priv->kref);
> -				/*
> -				 * There is no need to wait as no mapping was
> -				 * performed when the previous status was
> -				 * priv->revoked == true.
> -				 */
>  				reinit_completion(&priv->comp);
> +			} else {
>  				dma_resv_lock(priv->dmabuf->resv, NULL);
>  				priv->revoked = false;
>  				dma_resv_unlock(priv->dmabuf->resv);
> @@ -382,21 +381,22 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>  	struct vfio_pci_dma_buf *tmp;
>  
>  	down_write(&vdev->memory_lock);
> +
> +	/*
> +	 * Drain any active mappings via the revoke path.  The move is
> +	 * idempotent for dma-bufs already in the revoked state and
> +	 * leaves every priv with the kref re-armed and the completion
> +	 * ready, so cleanup itself does not need to participate in kref
> +	 * bookkeeping.
> +	 */
> +	vfio_pci_dma_buf_move(vdev, true);
> +
>  	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>  		if (!get_file_active(&priv->dmabuf->file))
>  			continue;
>  
> -		dma_resv_lock(priv->dmabuf->resv, NULL);
>  		list_del_init(&priv->dmabufs_elm);
>  		priv->vdev = NULL;
> -		priv->revoked = true;
> -		dma_buf_invalidate_mappings(priv->dmabuf);
> -		dma_resv_wait_timeout(priv->dmabuf->resv,
> -				      DMA_RESV_USAGE_BOOKKEEP, false,
> -				      MAX_SCHEDULE_TIMEOUT);
> -		dma_resv_unlock(priv->dmabuf->resv);
> -		kref_put(&priv->kref, vfio_pci_dma_buf_done);
> -		wait_for_completion(&priv->comp);
>  		vfio_device_put_registration(&vdev->vdev);
>  		fput(priv->dmabuf->file);
>  	}

      reply	other threads:[~2026-05-06  5:45 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 18:27 [PATCH] vfio/pci: Fix double-put during dma-buf cleanup Carlos López
2026-04-29 20:22 ` Alex Williamson
2026-05-06  5:45   ` Leon Romanovsky [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260506054546.GH11063@unreal \
    --to=leon@kernel.org \
    --cc=alex.williamson@nvidia.com \
    --cc=christian.koenig@amd.com \
    --cc=clopez@suse.de \
    --cc=joonas.kylmala@netum.fi \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox