From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BE27202F65; Mon, 29 Dec 2025 16:22:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767025321; cv=none; b=sdeFn/G5MvJ57wVdMyQ0as1DJ87iWgVs4ZpI+dhiDP4az6gnSb/tDKF8bxPAGyI9qTvqIyMpoh8+k7wjG/Nm2o5ilKDv3zisIngZvstibY8Lbhl5acMwFQqYm3pcxGQz+YmUk8ExumhyPgvBcY+iEEbytmfHe8MMW5nqNn7ZwT8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767025321; c=relaxed/simple; bh=4oz0nAfBixP1Him6illebraudrhy6n17ima7weSk8aQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=si5h8WgiJ7btlPUzlJTgBc4f9OoClhLR9JmkG8IFvoqFLsuOwFd9oNvF89BXqwQX2J8TI/uTmxD/tqiBp4xqBcpF/fvmjZzRJvsjtaay1GS6ojmIwiQdG29QmknrTXeCPvwWIZSxPwykJyzFR2Wfr4egiABMRcBKX/WfDy31Ua8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b=rSidqGir; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="rSidqGir" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E9A1DC4CEF7; Mon, 29 Dec 2025 16:22:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1767025321; bh=4oz0nAfBixP1Him6illebraudrhy6n17ima7weSk8aQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rSidqGirQ/JwWlhECSqODybHSs0vDxoT3AMeOY39qXekB27d1WKEBGfmZJdzA6Lfb 3STbOK5qHyoPW4esIxGbi6lyZpq7Auwb/coNbBn4ZneXwRScSEm6js/9mbVbclW3uE NjQOBJWOD5Xj3TnviM6iFLmawtuQr03AIgCHx8vs= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, SRINIVASAN.SHANMUGAM@amd.com, vitaly.prosyak@amd.com, christian.koenig@amd.com, Matthew Brost , Srinivasan Shanmugam , Lijo Lazar , Alex Deucher , Sasha Levin Subject: [PATCH 6.18 129/430] drm/amdgpu: fix a job->pasid access race in gpu recovery Date: Mon, 29 Dec 2025 17:08:51 +0100 Message-ID: <20251229160729.113000800@linuxfoundation.org> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20251229160724.139406961@linuxfoundation.org> References: <20251229160724.139406961@linuxfoundation.org> User-Agent: quilt/0.69 X-stable: review X-Patchwork-Hint: ignore Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 6.18-stable review patch. If anyone has any objections, please let me know. ------------------ From: Alex Deucher [ Upstream commit 77f73253015cbc7893fca1821ac3eae9eb4bc943 ] Avoid a possible UAF in GPU recovery due to a race between the sched timeout callback and the tdr work queue. The gpu recovery function calls drm_sched_stop() and later drm_sched_start(). drm_sched_start() restarts the tdr queue which will eventually free the job. If the tdr queue frees the job before time out callback completes, the job will be freed and we'll get a UAF when accessing the pasid. Cache it early to avoid the UAF. Example KASAN trace: [ 493.058141] BUG: KASAN: slab-use-after-free in amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.067530] Read of size 4 at addr ffff88b0ce3f794c by task kworker/u128:1/323 [ 493.074892] [ 493.076485] CPU: 9 UID: 0 PID: 323 Comm: kworker/u128:1 Tainted: G E 6.16.0-1289896.2.zuul.bf4f11df81c1410bbe901c4373305a31 #1 PREEMPT(voluntary) [ 493.076493] Tainted: [E]=UNSIGNED_MODULE [ 493.076495] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 493.076500] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 493.076512] Call Trace: [ 493.076515] [ 493.076518] dump_stack_lvl+0x64/0x80 [ 493.076529] print_report+0xce/0x630 [ 493.076536] ? _raw_spin_lock_irqsave+0x86/0xd0 [ 493.076541] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 493.076545] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077253] kasan_report+0xb8/0xf0 [ 493.077258] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077965] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.078672] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu] [ 493.079378] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu] [ 493.080111] amdgpu_job_timedout+0x642/0x1400 [amdgpu] [ 493.080903] ? pick_task_fair+0x24e/0x330 [ 493.080910] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu] [ 493.081702] ? _raw_spin_lock+0x75/0xc0 [ 493.081708] ? __pfx__raw_spin_lock+0x10/0x10 [ 493.081712] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched] [ 493.081721] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081725] process_one_work+0x679/0xff0 [ 493.081732] worker_thread+0x6ce/0xfd0 [ 493.081736] ? __pfx_worker_thread+0x10/0x10 [ 493.081739] kthread+0x376/0x730 [ 493.081744] ? __pfx_kthread+0x10/0x10 [ 493.081748] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081751] ? __pfx_kthread+0x10/0x10 [ 493.081755] ret_from_fork+0x247/0x330 [ 493.081761] ? __pfx_kthread+0x10/0x10 [ 493.081764] ret_from_fork_asm+0x1a/0x30 [ 493.081771] Fixes: a72002cb181f ("drm/amdgpu: Make use of drm_wedge_task_info") Link: https://github.com/HansKristian-Work/vkd3d-proton/pull/2670 Cc: SRINIVASAN.SHANMUGAM@amd.com Cc: vitaly.prosyak@amd.com Cc: christian.koenig@amd.com Suggested-by: Matthew Brost Reviewed-by: Srinivasan Shanmugam Reviewed-by: Lijo Lazar Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 20880a3fd5dd7bca1a079534cf6596bda92e107d) Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 96b6738e6252..843770e61e42 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6476,6 +6476,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, struct amdgpu_hive_info *hive = NULL; int r = 0; bool need_emergency_restart = false; + /* save the pasid here as the job may be freed before the end of the reset */ + int pasid = job ? job->pasid : -EINVAL; /* * If it reaches here because of hang/timeout and a RAS error is @@ -6572,8 +6574,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!r) { struct amdgpu_task_info *ti = NULL; - if (job) - ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid); + /* + * The job may already be freed at this point via the sched tdr workqueue so + * use the cached pasid. + */ + if (pasid >= 0) + ti = amdgpu_vm_get_task_info_pasid(adev, pasid); drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, ti ? &ti->task : NULL); -- 2.51.0