Re: [PATCH] drm/xe: Fix slab-out-of-bounds on PT update ops retry

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	Matthew Auld <matthew.auld@intel.com>,
	 stable@vger.kernel.org
Subject: Re: [PATCH] drm/xe: Fix slab-out-of-bounds on PT update ops retry
Date: Fri, 03 Apr 2026 12:00:48 +0200	[thread overview]
Message-ID: <de63e900ec874a731b1a9423ec45f1984e33b367.camel@linux.intel.com> (raw)
In-Reply-To: <ac8pYV9MTav7nmZu@gsse-cloud1.jf.intel.com>

On Thu, 2026-04-02 at 19:43 -0700, Matthew Brost wrote:
> On Thu, Apr 02, 2026 at 07:42:06PM -0700, Matthew Brost wrote:
> > On Thu, Apr 02, 2026 at 11:15:39AM +0200, Thomas Hellström wrote:
> > > xe_pt_update_ops_prepare() calls xe_pt_update_ops_init() at the
> > > start of
> > > each invocation to reset per-attempt state, but current_op was
> > > not
> > > included in that reset. When vm_bind_ioctl_ops_execute() retries
> > > due to
> > > ww-mutex contention (drm_exec_retry_on_contention), ops_execute()
> > > calls
> > 
> > I'm falling to see retry path around vm_bind_ioctl_ops_execute
> > related
> > to drm_exec_retry_on_contention... Also by the time we get to
> > vm_bind_ioctl_ops_execute we have all dma-resv, right?
> 
> s/vm_bind_ioctl_ops_execute/ops_execute here...
> 
> Matt

So indeed the error commit message states that the retry happens
earlier, but the KASAN message indicates that ops_execute() was already
started with the same vops. The patch indeed fixes the KASAN splat.

We might be looking at a bigger issue here, since when we
xe_vm_set_validation_exec() we need to be prepared to handle -EDEADLK
(and -ENOMEM) for that matter.

I guess in this situation those would primarily come from allocating
and validating page-table bos, and if there is a contention arising
from *any* ww lock (like in the future eviction) in ops_execute(), that
contention affects the __until_all_locked() and causes an implicit
rerun.

so I need to dig down into what's actually causing the rerun in this
case, and we need to ensure to properly handle -EDEADLKS and -ENOMEMS
after the xe_set_validation_exec() enclosed regions.

/Thomas.


> 
> > 
> > I believe the Kasan report but I just can't spot the bug - can you
> > point
> > out the retry path to me?
> > 
> > Matt
> > 
> > > xe_pt_update_ops_prepare() again. The second call walks the same
> > > op list
> > > and fills ops[] starting from current_op, which still holds the
> > > value
> > > from the first attempt. This indexes past the end of the ops
> > > array
> > > allocated by xe_vma_ops_alloc(), whose size was computed for a
> > > single
> > > pass.
> > > 
> > > KASAN reported:
> > >   BUG: KASAN: slab-out-of-bounds in bind_op_prepare+0x89c/0xae0
> > > [xe]
> > >   Write of size 8 at addr ffff88812e72bae8 by task xe_evict/2848
> > >   [...]
> > >   bind_op_prepare+0x89c/0xae0 [xe]
> > >   xe_pt_update_ops_prepare+0xbd0/0x1570 [xe]
> > >   ops_execute+0x3ae/0x2030 [xe]
> > >   vm_bind_ioctl_ops_execute+0x4d5/0xed0 [xe]
> > > 
> > > The write lands at ops[1].vma (offset 360 into the second element
> > > of a
> > > one-element 384-byte allocation) because entries[] is exactly 360
> > > bytes
> > > and current_op was 1 at the start of the retried prepare pass.
> > > 
> > > Fix by resetting current_op to 0 in xe_pt_update_ops_init().
> > > 
> > > Fixes: e8babb280b5e ("drm/xe: Convert multiple bind ops into
> > > single job")
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: <stable@vger.kernel.org> # v6.12+
> > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > Signed-off-by: Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_pt.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 8e5f4f0dea3f..3607cd57fc4c 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -2291,6 +2291,7 @@ xe_pt_update_ops_init(struct
> > > xe_vm_pgtable_update_ops *pt_update_ops)
> > >  	init_llist_head(&pt_update_ops->deferred);
> > >  	pt_update_ops->start = ~0x0ull;
> > >  	pt_update_ops->last = 0x0ull;
> > > +	pt_update_ops->current_op = 0;
> > >  	xe_page_reclaim_list_init(&pt_update_ops->prl);
> > >  }
> > >  
> > > -- 
> > > 2.53.0
> > >

     prev parent reply	other threads:[~2026-04-03 10:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02  9:15 [PATCH] drm/xe: Fix slab-out-of-bounds on PT update ops retry Thomas Hellström
2026-04-02  9:27 ` Matthew Auld
2026-04-03  2:42 ` Matthew Brost
2026-04-03  2:43   ` Matthew Brost
2026-04-03 10:00     ` Thomas Hellström [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de63e900ec874a731b1a9423ec45f1984e33b367.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox