From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 986E6388E45 for ; Fri, 3 Apr 2026 10:00:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.19 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775210461; cv=none; b=JEwD5o9nZNtF4cP5RG8/P9qduj8Hmj6S3EKqd6mIEnNEFd/4BTAx/o3HWOUuq52IUtPVk3rvq8LhCN8V8OZlT+hIGXkDBjOHg+WxHAiMhYtoTOclDzlaO3329TZpm/idSsbOPf3EMcGC0wmyqvMipCsDxbiYkeuIEdy00EWuQpo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775210461; c=relaxed/simple; bh=4JSRAw9km5Zo7bBKyhkiNKgxBjfgpidzKmClEAuBpuY=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=XqPWUP/LtbyOleUD2rmeGkpZyvaXJ2uVwJJYT1eTngSRzSH6qoy1yCBL2v50kgseHfPRjqTIau/LzYPusJgNnIBquTWJtCi8Vv1yuBXBcIodbEr72SAkkmDjDRsNq6XeA0F33oVz8oJXSzAaU82RVBhnyVeuXDsx28EuFAlHSJU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=STaMsq6O; arc=none smtp.client-ip=198.175.65.19 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="STaMsq6O" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775210454; x=1806746454; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=4JSRAw9km5Zo7bBKyhkiNKgxBjfgpidzKmClEAuBpuY=; b=STaMsq6OnlXu9Y8rKO4F4+fmI8l6A1Q3ug+H6BXE2aO4ohZHRC3Gb25t nUVPR6RDklqMzgopNLcEsJqsjFWp00OQK9i5qTQ1ZwyP9LaOvfG1X9GzL QyPBzPviy6cNR9xZM/LbFXh2jLFn7ht92IWZejDaBKxNUAO0QuPaors4a T4PYdrzJKCSndeGMwvbdcFqHbwa+Vy0aG1QHEYEjEtp7h3o9hpMTBtNoX VIWZI8A8SHxRnOaAlSlzhPqyDko7hfukSVNi7PW+kyBMTmP2L4LIf8/Ae 8lLU1+45oAttBD13RRW6Ny+7fyDlHyt3qgReLvATNCGmc3uiFfMk1aLRJ Q==; X-CSE-ConnectionGUID: gR5z+lVGTk20j3sYxR/wew== X-CSE-MsgGUID: 8XhCtF5URqCjZqft1Hh+xQ== X-IronPort-AV: E=McAfee;i="6800,10657,11747"; a="76161514" X-IronPort-AV: E=Sophos;i="6.23,157,1770624000"; d="scan'208";a="76161514" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa111.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Apr 2026 03:00:52 -0700 X-CSE-ConnectionGUID: FF06Eq2TQ5+RrGkR0aH7kw== X-CSE-MsgGUID: 8GWrnRaJTNSUJaapYFuk7A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,157,1770624000"; d="scan'208";a="220581680" Received: from fpallare-mobl4.ger.corp.intel.com (HELO [10.245.245.191]) ([10.245.245.191]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Apr 2026 03:00:50 -0700 Message-ID: Subject: Re: [PATCH] drm/xe: Fix slab-out-of-bounds on PT update ops retry From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Matthew Brost Cc: intel-xe@lists.freedesktop.org, Matthew Auld , stable@vger.kernel.org Date: Fri, 03 Apr 2026 12:00:48 +0200 In-Reply-To: References: <20260402091539.4114-1-thomas.hellstrom@linux.intel.com> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.3 (3.58.3-1.fc43) Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-04-02 at 19:43 -0700, Matthew Brost wrote: > On Thu, Apr 02, 2026 at 07:42:06PM -0700, Matthew Brost wrote: > > On Thu, Apr 02, 2026 at 11:15:39AM +0200, Thomas Hellstr=C3=B6m wrote: > > > xe_pt_update_ops_prepare() calls xe_pt_update_ops_init() at the > > > start of > > > each invocation to reset per-attempt state, but current_op was > > > not > > > included in that reset. When vm_bind_ioctl_ops_execute() retries > > > due to > > > ww-mutex contention (drm_exec_retry_on_contention), ops_execute() > > > calls > >=20 > > I'm falling to see retry path around vm_bind_ioctl_ops_execute > > related > > to drm_exec_retry_on_contention... Also by the time we get to > > vm_bind_ioctl_ops_execute we have all dma-resv, right? >=20 > s/vm_bind_ioctl_ops_execute/ops_execute here... >=20 > Matt So indeed the error commit message states that the retry happens earlier, but the KASAN message indicates that ops_execute() was already started with the same vops. The patch indeed fixes the KASAN splat. We might be looking at a bigger issue here, since when we xe_vm_set_validation_exec() we need to be prepared to handle -EDEADLK (and -ENOMEM) for that matter. I guess in this situation those would primarily come from allocating and validating page-table bos, and if there is a contention arising from *any* ww lock (like in the future eviction) in ops_execute(), that contention affects the __until_all_locked() and causes an implicit rerun. so I need to dig down into what's actually causing the rerun in this case, and we need to ensure to properly handle -EDEADLKS and -ENOMEMS after the xe_set_validation_exec() enclosed regions. /Thomas. >=20 > >=20 > > I believe the Kasan report but I just can't spot the bug - can you > > point > > out the retry path to me? > >=20 > > Matt > >=20 > > > xe_pt_update_ops_prepare() again. The second call walks the same > > > op list > > > and fills ops[] starting from current_op, which still holds the > > > value > > > from the first attempt. This indexes past the end of the ops > > > array > > > allocated by xe_vma_ops_alloc(), whose size was computed for a > > > single > > > pass. > > >=20 > > > KASAN reported: > > > =C2=A0 BUG: KASAN: slab-out-of-bounds in bind_op_prepare+0x89c/0xae0 > > > [xe] > > > =C2=A0 Write of size 8 at addr ffff88812e72bae8 by task xe_evict/2848 > > > =C2=A0 [...] > > > =C2=A0 bind_op_prepare+0x89c/0xae0 [xe] > > > =C2=A0 xe_pt_update_ops_prepare+0xbd0/0x1570 [xe] > > > =C2=A0 ops_execute+0x3ae/0x2030 [xe] > > > =C2=A0 vm_bind_ioctl_ops_execute+0x4d5/0xed0 [xe] > > >=20 > > > The write lands at ops[1].vma (offset 360 into the second element > > > of a > > > one-element 384-byte allocation) because entries[] is exactly 360 > > > bytes > > > and current_op was 1 at the start of the retried prepare pass. > > >=20 > > > Fix by resetting current_op to 0 in xe_pt_update_ops_init(). > > >=20 > > > Fixes: e8babb280b5e ("drm/xe: Convert multiple bind ops into > > > single job") > > > Cc: Matthew Brost > > > Cc: Matthew Auld > > > Cc: # v6.12+ > > > Assisted-by: GitHub Copilot:claude-sonnet-4.6 > > > Signed-off-by: Thomas Hellstr=C3=B6m > > > > > > --- > > > =C2=A0drivers/gpu/drm/xe/xe_pt.c | 1 + > > > =C2=A01 file changed, 1 insertion(+) > > >=20 > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c > > > b/drivers/gpu/drm/xe/xe_pt.c > > > index 8e5f4f0dea3f..3607cd57fc4c 100644 > > > --- a/drivers/gpu/drm/xe/xe_pt.c > > > +++ b/drivers/gpu/drm/xe/xe_pt.c > > > @@ -2291,6 +2291,7 @@ xe_pt_update_ops_init(struct > > > xe_vm_pgtable_update_ops *pt_update_ops) > > > =C2=A0 init_llist_head(&pt_update_ops->deferred); > > > =C2=A0 pt_update_ops->start =3D ~0x0ull; > > > =C2=A0 pt_update_ops->last =3D 0x0ull; > > > + pt_update_ops->current_op =3D 0; > > > =C2=A0 xe_page_reclaim_list_init(&pt_update_ops->prl); > > > =C2=A0} > > > =C2=A0 > > > --=20 > > > 2.53.0 > > >=20