From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E2958F57 for ; Mon, 13 Feb 2023 15:00:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C814DC4339E; Mon, 13 Feb 2023 15:00:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1676300410; bh=AfAgJSPFfNm29Np0MSgLD3bVzgcEPT7DGf37ZfOqqGk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=dza8Zs/m85qy0L6ZkOiA4gHH5ngDsJXuvCx4t3Fa9lxiw3WA+tcLkF0Dui2hYw20G IypzcziLREUuXOTWpDyDUOAjvJPkMUtio5yIHvaBrZxhAHfJjl4vVysg/bIsRH8Lp0 sNx2hvQUoIeYS8Kqtyb1DBjii4dzNOS5sz5faR3E= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, =?UTF-8?q?Christian=20K=C3=B6nig?= , Guchun Chen , Luben Tuikov , Mario Limonciello , "Guilherme G. Piccoli" , Alex Deucher Subject: [PATCH 5.15 64/67] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Date: Mon, 13 Feb 2023 15:49:45 +0100 Message-Id: <20230213144735.441231203@linuxfoundation.org> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230213144732.336342050@linuxfoundation.org> References: <20230213144732.336342050@linuxfoundation.org> User-Agent: quilt/0.67 Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Guilherme G. Piccoli commit 5ad7bbf3dba5c4a684338df1f285080f2588b535 upstream. Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of 0000:04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0], and also removed the no_scheduler check [1]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König Cc: Guchun Chen Cc: Luben Tuikov Cc: Mario Limonciello Reviewed-by: Luben Tuikov Signed-off-by: Guilherme G. Piccoli Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -579,7 +579,13 @@ void amdgpu_fence_driver_sw_fini(struct if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) + /* + * Notice we check for sched.ops since there's some + * override on the meaning of sched.ready by amdgpu. + * The natural check would be sched.ready, which is + * set as drm_sched_init() finishes... + */ + if (ring->sched.ops) drm_sched_fini(&ring->sched); for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)