[PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
@ 2025-01-07 14:02 Philipp Reisner
  2025-01-07 14:08 ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Reisner @ 2025-01-07 14:02 UTC (permalink / raw)
  To: dri-devel
  Cc: linux-kernel, Christian König, Nirmoy Das, Simona Vetter,
	Philipp Reisner

The following OOPS plagues me on about every 10th suspend and resume:

[160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
[160640.791309] #PF: supervisor read access in kernel mode
[160640.791311] #PF: error_code(0x0000) - not-present page
[160640.791313] PGD 0 P4D 0
[160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
[160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
[160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
[160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
[160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
[160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
[160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
[160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
[160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
[160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
[160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
[160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
[160640.791354] Call Trace:
[160640.791357]  <TASK>
[160640.791360]  ? __die_body.cold+0x19/0x27
[160640.791367]  ? page_fault_oops+0x15a/0x2f0
[160640.791372]  ? exc_page_fault+0x7e/0x180
[160640.791376]  ? asm_exc_page_fault+0x26/0x30
[160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
[160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
[160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
[160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[160640.792341]  drm_ioctl_kernel+0xb0/0x100
[160640.792346]  drm_ioctl+0x28b/0x540
[160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[160640.792994]  __x64_sys_ioctl+0x94/0xd0
[160640.792999]  do_syscall_64+0x82/0x160
[160640.793006]  ? __count_memcg_events+0x75/0x130
[160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
[160640.793014]  ? handle_mm_fault+0x21b/0x330
[160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
[160640.793020]  ? exc_page_fault+0x7e/0x180
[160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The OOPS happens because the rq member of entity is NULL in
drm_sched_job_arm() after the call to drm_sched_entity_select_rq().

In drm_sched_entity_select_rq(), the code considers that
drb_sched_pick_best() might return a NULL value. When NULL, it assigns
NULL to entity->rq even if it had a non-NULL value before.

drm_sched_job_arm() does not deal with entities having a rq of NULL.

Fix this by leaving the entity on the engine it was instead of
assigning a NULL to its run queue member.

Link: https://retrace.fedoraproject.org/faf/reports/1038619/
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index a75eede8bf8d..495bc087588b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 
 	spin_lock(&entity->rq_lock);
 	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
-	rq = sched ? sched->sched_rq[entity->priority] : NULL;
-	if (rq != entity->rq) {
-		drm_sched_rq_remove_entity(entity->rq, entity);
-		entity->rq = rq;
+	if (sched) {
+		rq = sched->sched_rq[entity->priority];
+		if (rq != entity->rq) {
+			drm_sched_rq_remove_entity(entity->rq, entity);
+			entity->rq = rq;
+		}
 	}
 	spin_unlock(&entity->rq_lock);
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-07 14:02 [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume Philipp Reisner
@ 2025-01-07 14:08 ` Christian König
  2025-01-07 15:21   ` Philipp Reisner
  2025-01-08 14:26   ` Alex Deucher
  0 siblings, 2 replies; 17+ messages in thread
From: Christian König @ 2025-01-07 14:08 UTC (permalink / raw)
  To: Philipp Reisner, dri-devel; +Cc: linux-kernel, Nirmoy Das, Simona Vetter

Am 07.01.25 um 15:02 schrieb Philipp Reisner:
> The following OOPS plagues me on about every 10th suspend and resume:
>
> [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [160640.791309] #PF: supervisor read access in kernel mode
> [160640.791311] #PF: error_code(0x0000) - not-present page
> [160640.791313] PGD 0 P4D 0
> [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
> [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
> [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
> [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
> [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
> [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
> [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
> [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
> [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
> [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
> [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
> [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
> [160640.791354] Call Trace:
> [160640.791357]  <TASK>
> [160640.791360]  ? __die_body.cold+0x19/0x27
> [160640.791367]  ? page_fault_oops+0x15a/0x2f0
> [160640.791372]  ? exc_page_fault+0x7e/0x180
> [160640.791376]  ? asm_exc_page_fault+0x26/0x30
> [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
> [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
> [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
> [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> [160640.792341]  drm_ioctl_kernel+0xb0/0x100
> [160640.792346]  drm_ioctl+0x28b/0x540
> [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
> [160640.792994]  __x64_sys_ioctl+0x94/0xd0
> [160640.792999]  do_syscall_64+0x82/0x160
> [160640.793006]  ? __count_memcg_events+0x75/0x130
> [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
> [160640.793014]  ? handle_mm_fault+0x21b/0x330
> [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
> [160640.793020]  ? exc_page_fault+0x7e/0x180
> [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The OOPS happens because the rq member of entity is NULL in
> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>
> In drm_sched_entity_select_rq(), the code considers that
> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> NULL to entity->rq even if it had a non-NULL value before.
>
> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>
> Fix this by leaving the entity on the engine it was instead of
> assigning a NULL to its run queue member.

Well that is clearly not the correct approach to fixing this. So clearly 
a NAK from my side.

The real question is why is amdgpu_cs_ioctl() called when all of 
userspace should be frozen?

Regards,
Christian.

>
> Link: https://retrace.fedoraproject.org/faf/reports/1038619/
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index a75eede8bf8d..495bc087588b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>   
>   	spin_lock(&entity->rq_lock);
>   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> -	rq = sched ? sched->sched_rq[entity->priority] : NULL;
> -	if (rq != entity->rq) {
> -		drm_sched_rq_remove_entity(entity->rq, entity);
> -		entity->rq = rq;
> +	if (sched) {
> +		rq = sched->sched_rq[entity->priority];
> +		if (rq != entity->rq) {
> +			drm_sched_rq_remove_entity(entity->rq, entity);
> +			entity->rq = rq;
> +		}
>   	}
>   	spin_unlock(&entity->rq_lock);
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-07 14:08 ` Christian König
@ 2025-01-07 15:21   ` Philipp Reisner
  2025-01-08  8:19     ` Christian König
  2025-01-08 14:26   ` Alex Deucher
  1 sibling, 1 reply; 17+ messages in thread
From: Philipp Reisner @ 2025-01-07 15:21 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, linux-kernel, Simona Vetter

[...]
> > The OOPS happens because the rq member of entity is NULL in
> > drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
> >
> > In drm_sched_entity_select_rq(), the code considers that
> > drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> > NULL to entity->rq even if it had a non-NULL value before.
> >
> > drm_sched_job_arm() does not deal with entities having a rq of NULL.
> >
> > Fix this by leaving the entity on the engine it was instead of
> > assigning a NULL to its run queue member.
>
> Well that is clearly not the correct approach to fixing this. So clearly
> a NAK from my side.
>
> The real question is why is amdgpu_cs_ioctl() called when all of
> userspace should be frozen?
>
> Regards,
> Christian.
>

Could the OOPS happen at resume time? Might it be that the kernel
activates user-space
before all the components of the GPU finished their wakeup?

Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?

Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
suggested change improves the code. drm_sched_pick_best() can return NULL.
drm_sched_entity_select_rq() can handle the NULL (partially).

drm_sched_job_arm() crashes on an entity that has rq set to NULL.

The handling of NULL values is half-baked.

In my opinion, you should define if drm_sched_pick_best() may put a NULL into
rq. If your answer is yes, it might put a NULL there; then, there should be a
BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
If your answer is no, the BUG_ON() should be in drm_sched_pick_best().

That helps guys with zero domain knowledge, like me, to figure out how
this is all
supposed to work.

best regards,
 Philipp

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-07 15:21   ` Philipp Reisner
@ 2025-01-08  8:19     ` Christian König
  2025-01-13  8:43       ` Philipp Stanner
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2025-01-08  8:19 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: dri-devel, linux-kernel, Simona Vetter

Am 07.01.25 um 16:21 schrieb Philipp Reisner:
> [...]
>>> The OOPS happens because the rq member of entity is NULL in
>>> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>>>
>>> In drm_sched_entity_select_rq(), the code considers that
>>> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
>>> NULL to entity->rq even if it had a non-NULL value before.
>>>
>>> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>>>
>>> Fix this by leaving the entity on the engine it was instead of
>>> assigning a NULL to its run queue member.
>> Well that is clearly not the correct approach to fixing this. So clearly
>> a NAK from my side.
>>
>> The real question is why is amdgpu_cs_ioctl() called when all of
>> userspace should be frozen?
>>
>> Regards,
>> Christian.
>>
> Could the OOPS happen at resume time? Might it be that the kernel
> activates user-space
> before all the components of the GPU finished their wakeup?
>
> Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?

Yeah that is exactly what I meant. It looks like either the suspend or 
the resume order is somehow messed up.

In other words either some application tries to submit GPU work while it 
should already been stopped, or it tries to submit GPU work before it is 
started.

> Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
> suggested change improves the code. drm_sched_pick_best() can return NULL.
> drm_sched_entity_select_rq() can handle the NULL (partially).
>
> drm_sched_job_arm() crashes on an entity that has rq set to NULL.

Which is actually not the worst outcome :)

With your patch applied we don't immediately crash any more in the 
submission path, but the whole system could then later deadlock because 
the core memory management waits for a GPU submission which never returns.

That is an even worse situation because you then can't pinpoint any more 
where that is coming from.

> The handling of NULL values is half-baked.
>
> In my opinion, you should define if drm_sched_pick_best() may put a NULL into
> rq. If your answer is yes, it might put a NULL there; then, there should be a
> BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
> If your answer is no, the BUG_ON() should be in drm_sched_pick_best().

Yeah good point.

We might not want a BUG_ON(), that is only justified when we prevent 
further damage (e.g. random data corruption or similar).

I suggest using a WARN(!shed, "Submission without activated sheduler!"). 
This way the system has at least a chance of survival should the 
scheduler become ready later on.

On the other hand the BUG_ON() or the NULL pointer deref should only 
kill the application thread which is submitting something before the 
driver is resumed. So that might help to pinpoint where the actually 
issue is.

Regards,
Christian.

>
> That helps guys with zero domain knowledge, like me, to figure out how
> this is all
> supposed to work.
>
> best regards,
>   Philipp

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-07 14:08 ` Christian König
  2025-01-07 15:21   ` Philipp Reisner
@ 2025-01-08 14:26   ` Alex Deucher
  2025-01-08 14:35     ` Christian König
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Deucher @ 2025-01-08 14:26 UTC (permalink / raw)
  To: Christian König
  Cc: Philipp Reisner, dri-devel, linux-kernel, Nirmoy Das,
	Simona Vetter

On Tue, Jan 7, 2025 at 9:09 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.01.25 um 15:02 schrieb Philipp Reisner:
> > The following OOPS plagues me on about every 10th suspend and resume:
> >
> > [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
> > [160640.791309] #PF: supervisor read access in kernel mode
> > [160640.791311] #PF: error_code(0x0000) - not-present page
> > [160640.791313] PGD 0 P4D 0
> > [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
> > [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> > [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
> > [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
> > [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
> > [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
> > [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
> > [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
> > [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
> > [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
> > [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
> > [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
> > [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
> > [160640.791354] Call Trace:
> > [160640.791357]  <TASK>
> > [160640.791360]  ? __die_body.cold+0x19/0x27
> > [160640.791367]  ? page_fault_oops+0x15a/0x2f0
> > [160640.791372]  ? exc_page_fault+0x7e/0x180
> > [160640.791376]  ? asm_exc_page_fault+0x26/0x30
> > [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
> > [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
> > [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
> > [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> > [160640.792341]  drm_ioctl_kernel+0xb0/0x100
> > [160640.792346]  drm_ioctl+0x28b/0x540
> > [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> > [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
> > [160640.792994]  __x64_sys_ioctl+0x94/0xd0
> > [160640.792999]  do_syscall_64+0x82/0x160
> > [160640.793006]  ? __count_memcg_events+0x75/0x130
> > [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
> > [160640.793014]  ? handle_mm_fault+0x21b/0x330
> > [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
> > [160640.793020]  ? exc_page_fault+0x7e/0x180
> > [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > The OOPS happens because the rq member of entity is NULL in
> > drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
> >
> > In drm_sched_entity_select_rq(), the code considers that
> > drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> > NULL to entity->rq even if it had a non-NULL value before.
> >
> > drm_sched_job_arm() does not deal with entities having a rq of NULL.
> >
> > Fix this by leaving the entity on the engine it was instead of
> > assigning a NULL to its run queue member.
>
> Well that is clearly not the correct approach to fixing this. So clearly
> a NAK from my side.
>
> The real question is why is amdgpu_cs_ioctl() called when all of
> userspace should be frozen?
>

Could this be due to amdgpu setting sched->ready when the rings are
finished initializing from long ago rather than when the scheduler has
been armed?

Alex


> Regards,
> Christian.
>
> >
> > Link: https://retrace.fedoraproject.org/faf/reports/1038619/
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
> > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index a75eede8bf8d..495bc087588b 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> >
> >       spin_lock(&entity->rq_lock);
> >       sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> > -     rq = sched ? sched->sched_rq[entity->priority] : NULL;
> > -     if (rq != entity->rq) {
> > -             drm_sched_rq_remove_entity(entity->rq, entity);
> > -             entity->rq = rq;
> > +     if (sched) {
> > +             rq = sched->sched_rq[entity->priority];
> > +             if (rq != entity->rq) {
> > +                     drm_sched_rq_remove_entity(entity->rq, entity);
> > +                     entity->rq = rq;
> > +             }
> >       }
> >       spin_unlock(&entity->rq_lock);
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-08 14:26   ` Alex Deucher
@ 2025-01-08 14:35     ` Christian König
  2025-01-10  7:37       ` Philipp Reisner
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2025-01-08 14:35 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Philipp Reisner, dri-devel, linux-kernel, Nirmoy Das,
	Simona Vetter

Am 08.01.25 um 15:26 schrieb Alex Deucher:
> On Tue, Jan 7, 2025 at 9:09 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 07.01.25 um 15:02 schrieb Philipp Reisner:
>>> The following OOPS plagues me on about every 10th suspend and resume:
>>>
>>> [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
>>> [160640.791309] #PF: supervisor read access in kernel mode
>>> [160640.791311] #PF: error_code(0x0000) - not-present page
>>> [160640.791313] PGD 0 P4D 0
>>> [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
>>> [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>> [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
>>> [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
>>> [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
>>> [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
>>> [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
>>> [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
>>> [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
>>> [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
>>> [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
>>> [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
>>> [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
>>> [160640.791354] Call Trace:
>>> [160640.791357]  <TASK>
>>> [160640.791360]  ? __die_body.cold+0x19/0x27
>>> [160640.791367]  ? page_fault_oops+0x15a/0x2f0
>>> [160640.791372]  ? exc_page_fault+0x7e/0x180
>>> [160640.791376]  ? asm_exc_page_fault+0x26/0x30
>>> [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
>>> [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
>>> [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
>>> [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>> [160640.792341]  drm_ioctl_kernel+0xb0/0x100
>>> [160640.792346]  drm_ioctl+0x28b/0x540
>>> [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>> [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
>>> [160640.792994]  __x64_sys_ioctl+0x94/0xd0
>>> [160640.792999]  do_syscall_64+0x82/0x160
>>> [160640.793006]  ? __count_memcg_events+0x75/0x130
>>> [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
>>> [160640.793014]  ? handle_mm_fault+0x21b/0x330
>>> [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
>>> [160640.793020]  ? exc_page_fault+0x7e/0x180
>>> [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> The OOPS happens because the rq member of entity is NULL in
>>> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>>>
>>> In drm_sched_entity_select_rq(), the code considers that
>>> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
>>> NULL to entity->rq even if it had a non-NULL value before.
>>>
>>> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>>>
>>> Fix this by leaving the entity on the engine it was instead of
>>> assigning a NULL to its run queue member.
>> Well that is clearly not the correct approach to fixing this. So clearly
>> a NAK from my side.
>>
>> The real question is why is amdgpu_cs_ioctl() called when all of
>> userspace should be frozen?
>>
> Could this be due to amdgpu setting sched->ready when the rings are
> finished initializing from long ago rather than when the scheduler has
> been armed?

Yes and that is absolutely intentional.

Either the driver is not done with it's resume yet, or it has already 
started it's suspend handler. So the scheduler backends are not started 
and so the ready flag is false.

But some userspace application still tries to submit work.

If we would now wait for this work to finish we would deadlock, so 
crashing on the NULL pointer deref is actually the less worse outcome.

Christian.

>
> Alex
>
>
>> Regards,
>> Christian.
>>
>>> Link: https://retrace.fedoraproject.org/faf/reports/1038619/
>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
>>> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
>>>    1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a75eede8bf8d..495bc087588b 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>
>>>        spin_lock(&entity->rq_lock);
>>>        sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
>>> -     rq = sched ? sched->sched_rq[entity->priority] : NULL;
>>> -     if (rq != entity->rq) {
>>> -             drm_sched_rq_remove_entity(entity->rq, entity);
>>> -             entity->rq = rq;
>>> +     if (sched) {
>>> +             rq = sched->sched_rq[entity->priority];
>>> +             if (rq != entity->rq) {
>>> +                     drm_sched_rq_remove_entity(entity->rq, entity);
>>> +                     entity->rq = rq;
>>> +             }
>>>        }
>>>        spin_unlock(&entity->rq_lock);
>>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-08 14:35     ` Christian König
@ 2025-01-10  7:37       ` Philipp Reisner
  2025-01-10  8:44         ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Reisner @ 2025-01-10  7:37 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, dri-devel, linux-kernel, Simona Vetter

[...]
> > Could this be due to amdgpu setting sched->ready when the rings are
> > finished initializing from long ago rather than when the scheduler has
> > been armed?
>
> Yes and that is absolutely intentional.
>
> Either the driver is not done with it's resume yet, or it has already
> started it's suspend handler. So the scheduler backends are not started
> and so the ready flag is false.
>
> But some userspace application still tries to submit work.
>
> If we would now wait for this work to finish we would deadlock, so
> crashing on the NULL pointer deref is actually the less worse outcome.
>
> Christian.

Hi Christian,

Today in the morning, when I woke up my workstation, I was greeted
with a black screen, on which I still could move my mouse pointer. The
OOPS happens at resume time, not at suspend time:

...
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: BUG: kernel NULL pointer dereference,
address: 0000000000000008
Jän 10 07:58:14 ryzen9 kernel: #PF: supervisor read access in kernel mode
Jän 10 07:58:14 ryzen9 kernel: #PF: error_code(0x0000) - not-present page
Jän 10 07:58:14 ryzen9 kernel: PGD 0 P4D 0
Jän 10 07:58:14 ryzen9 kernel: Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
Jän 10 07:58:14 ryzen9 kernel: CPU: 2 UID: 1001 PID: 4961 Comm:
chrome:cs0 Tainted: G      D    OE      6.12.5-200.fc41.x86_64 #1
Jän 10 07:58:14 ryzen9 kernel: Tainted: [D]=DIE, [O]=OOT_MODULE,
[E]=UNSIGNED_MODULE
Jän 10 07:58:14 ryzen9 kernel: Hardware name: Micro-Star International
Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
Jän 10 07:58:14 ryzen9 kernel: RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
Jän 10 07:58:14 ryzen9 kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa
0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8
e1 38 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8
01 00 00 00 f0 48 0f
Jän 10 07:58:14 ryzen9 kernel: RSP: 0018:ffffa52510cf7758 EFLAGS: 00010206
...

Can we conclude that "the driver is not yet ready with it's resume"?
Can you point me to where I could add instrumentation code to dig deeper?

Thanks,
 Philipp

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-10  7:37       ` Philipp Reisner
@ 2025-01-10  8:44         ` Christian König
  2025-01-10 14:32           ` Philipp Reisner
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2025-01-10  8:44 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: Alex Deucher, dri-devel, linux-kernel, Simona Vetter

Am 10.01.25 um 08:37 schrieb Philipp Reisner:
> [...]
>>> Could this be due to amdgpu setting sched->ready when the rings are
>>> finished initializing from long ago rather than when the scheduler has
>>> been armed?
>> Yes and that is absolutely intentional.
>>
>> Either the driver is not done with it's resume yet, or it has already
>> started it's suspend handler. So the scheduler backends are not started
>> and so the ready flag is false.
>>
>> But some userspace application still tries to submit work.
>>
>> If we would now wait for this work to finish we would deadlock, so
>> crashing on the NULL pointer deref is actually the less worse outcome.
>>
>> Christian.
> Hi Christian,
>
> Today in the morning, when I woke up my workstation, I was greeted
> with a black screen, on which I still could move my mouse pointer. The
> OOPS happens at resume time, not at suspend time:
>
> ...
> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping
> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
> Jän 10 07:58:14 ryzen9 kernel: BUG: kernel NULL pointer dereference,
> address: 0000000000000008
> Jän 10 07:58:14 ryzen9 kernel: #PF: supervisor read access in kernel mode
> Jän 10 07:58:14 ryzen9 kernel: #PF: error_code(0x0000) - not-present page
> Jän 10 07:58:14 ryzen9 kernel: PGD 0 P4D 0
> Jän 10 07:58:14 ryzen9 kernel: Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
> Jän 10 07:58:14 ryzen9 kernel: CPU: 2 UID: 1001 PID: 4961 Comm:
> chrome:cs0 Tainted: G      D    OE      6.12.5-200.fc41.x86_64 #1
> Jän 10 07:58:14 ryzen9 kernel: Tainted: [D]=DIE, [O]=OOT_MODULE,
> [E]=UNSIGNED_MODULE
> Jän 10 07:58:14 ryzen9 kernel: Hardware name: Micro-Star International
> Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
> Jän 10 07:58:14 ryzen9 kernel: RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
> Jän 10 07:58:14 ryzen9 kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa
> 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8
> e1 38 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8
> 01 00 00 00 f0 48 0f
> Jän 10 07:58:14 ryzen9 kernel: RSP: 0018:ffffa52510cf7758 EFLAGS: 00010206
> ...
>
> Can we conclude that "the driver is not yet ready with it's resume"?

Take a look at those messages right before the crash:

Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, 
skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, 
skipping

That is basically a 100% certain confirm that an application tries to 
use the device before before those compute queues are resumed.

Can I have a full dmesg? Maybe the resume is canceled or aborted for 
some reason.

Or we have a bug in the driver that we forget to resume the compute 
queues and only do that on demand later on or something like that.

> Can you point me to where I could add instrumentation code to dig deeper?

Let me take a look at the full dmesg, something really fishy is going on 
here.

Thanks,
Christian.

>
> Thanks,
>   Philipp


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-10  8:44         ` Christian König
@ 2025-01-10 14:32           ` Philipp Reisner
  2025-01-10 14:47             ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Reisner @ 2025-01-10 14:32 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, dri-devel, linux-kernel, Simona Vetter

[...]
> Take a look at those messages right before the crash:
>
> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
> skipping
> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
> skipping
>
> That is basically a 100% certain confirm that an application tries to
> use the device before before those compute queues are resumed.
>
> Can I have a full dmesg? Maybe the resume is canceled or aborted for
> some reason.
>

Yes, of course. I have made the files available here:
https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa

best regards,
 Philipp

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-10 14:32           ` Philipp Reisner
@ 2025-01-10 14:47             ` Christian König
  2025-01-10 15:10               ` Alex Deucher
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2025-01-10 14:47 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: Alex Deucher, dri-devel, linux-kernel, Simona Vetter

Am 10.01.25 um 15:32 schrieb Philipp Reisner:
> [...]
>> Take a look at those messages right before the crash:
>>
>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
>> skipping
>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
>> skipping
>>
>> That is basically a 100% certain confirm that an application tries to
>> use the device before before those compute queues are resumed.
>>
>> Can I have a full dmesg? Maybe the resume is canceled or aborted for
>> some reason.
>>
> Yes, of course. I have made the files available here:
> https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa

Ah! That suddenly makes much more sense.

Here is the root cause:

[111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
[111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
[111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
[111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
[111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
[111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper 
[amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
[111315.207293] [drm] UVD and UVD ENC initialized successfully.
[111315.308270] [drm] VCE initialized successfully.
[111315.447494] PM: resume devices took 2.306 seconds
[111315.447865] OOM killer enabled.

I'm surprised that this works at all. For some reason the graphics queue 
works, but the compute queues fail to resume.

@Alex what do we do about that? We could return an error when not all 
rings come up again after resume, but that will probably result in a 
number of complains.

Regards,
Christian.


>
> best regards,
>   Philipp


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-10 14:47             ` Christian König
@ 2025-01-10 15:10               ` Alex Deucher
  2025-01-13  8:32                 ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Deucher @ 2025-01-10 15:10 UTC (permalink / raw)
  To: Christian König
  Cc: Philipp Reisner, dri-devel, linux-kernel, Simona Vetter

On Fri, Jan 10, 2025 at 9:48 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 10.01.25 um 15:32 schrieb Philipp Reisner:
> > [...]
> >> Take a look at those messages right before the crash:
> >>
> >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
> >> skipping
> >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
> >> skipping
> >>
> >> That is basically a 100% certain confirm that an application tries to
> >> use the device before before those compute queues are resumed.
> >>
> >> Can I have a full dmesg? Maybe the resume is canceled or aborted for
> >> some reason.
> >>
> > Yes, of course. I have made the files available here:
> > https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
>
> Ah! That suddenly makes much more sense.
>
> Here is the root cause:
>
> [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
> [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
> [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
> [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
> [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
> [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
> [111315.207293] [drm] UVD and UVD ENC initialized successfully.
> [111315.308270] [drm] VCE initialized successfully.
> [111315.447494] PM: resume devices took 2.306 seconds
> [111315.447865] OOM killer enabled.
>
> I'm surprised that this works at all. For some reason the graphics queue
> works, but the compute queues fail to resume.
>
> @Alex what do we do about that? We could return an error when not all
> rings come up again after resume, but that will probably result in a
> number of complains.

Maybe return an error if all of the rings of a particular type fail,
but if only some do, we should be able to deal with that.  We
currently set up 8 compute rings.  We probably don't need that many.
Maybe just two (high and low priority).

Alex

>
> Regards,
> Christian.
>
>
> >
> > best regards,
> >   Philipp
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-10 15:10               ` Alex Deucher
@ 2025-01-13  8:32                 ` Christian König
  0 siblings, 0 replies; 17+ messages in thread
From: Christian König @ 2025-01-13  8:32 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Philipp Reisner, dri-devel, linux-kernel, Simona Vetter

Am 10.01.25 um 16:10 schrieb Alex Deucher:
> On Fri, Jan 10, 2025 at 9:48 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 10.01.25 um 15:32 schrieb Philipp Reisner:
>>> [...]
>>>> Take a look at those messages right before the crash:
>>>>
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
>>>> skipping
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
>>>> skipping
>>>>
>>>> That is basically a 100% certain confirm that an application tries to
>>>> use the device before before those compute queues are resumed.
>>>>
>>>> Can I have a full dmesg? Maybe the resume is canceled or aborted for
>>>> some reason.
>>>>
>>> Yes, of course. I have made the files available here:
>>> https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
>> Ah! That suddenly makes much more sense.
>>
>> Here is the root cause:
>>
>> [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
>> [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
>> [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
>> [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
>> [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
>> [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
>> [111315.207293] [drm] UVD and UVD ENC initialized successfully.
>> [111315.308270] [drm] VCE initialized successfully.
>> [111315.447494] PM: resume devices took 2.306 seconds
>> [111315.447865] OOM killer enabled.
>>
>> I'm surprised that this works at all. For some reason the graphics queue
>> works, but the compute queues fail to resume.
>>
>> @Alex what do we do about that? We could return an error when not all
>> rings come up again after resume, but that will probably result in a
>> number of complains.
> Maybe return an error if all of the rings of a particular type fail,
> but if only some do, we should be able to deal with that.  We
> currently set up 8 compute rings.  We probably don't need that many.
> Maybe just two (high and low priority).

Reducing the number of queues would make the problem even more severe 
instead of helping since you then have even less chance of successfully 
resuming.

Currently we don't abort resume when the compute queues don't resume, 
but this leads to a crash later on.

The issue is that when we start to abort resume the end user experience 
doesn't really improve, we just avoid the crash.

Either we need to tell Mesa to stop using the compute queues by default 
(what is that good for anyway?) or we need to get the compute queues 
reliable working after a resume.

Christian.

>
> Alex
>
>> Regards,
>> Christian.
>>
>>
>>> best regards,
>>>    Philipp


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-08  8:19     ` Christian König
@ 2025-01-13  8:43       ` Philipp Stanner
  2025-01-13  9:55         ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stanner @ 2025-01-13  8:43 UTC (permalink / raw)
  To: Christian König, Philipp Reisner
  Cc: dri-devel, linux-kernel, Simona Vetter, Danilo Krummrich,
	Philipp Stanner

+cc Danilo
+cc myself

On Wed, 2025-01-08 at 09:19 +0100, Christian König wrote:
> Am 07.01.25 um 16:21 schrieb Philipp Reisner:
> > [...]
> > > > The OOPS happens because the rq member of entity is NULL in
> > > > drm_sched_job_arm() after the call to
> > > > drm_sched_entity_select_rq().
> > > > 
> > > > In drm_sched_entity_select_rq(), the code considers that
> > > > drb_sched_pick_best() might return a NULL value. When NULL, it
> > > > assigns
> > > > NULL to entity->rq even if it had a non-NULL value before.
> > > > 
> > > > drm_sched_job_arm() does not deal with entities having a rq of
> > > > NULL.
> > > > 
> > > > Fix this by leaving the entity on the engine it was instead of
> > > > assigning a NULL to its run queue member.
> > > Well that is clearly not the correct approach to fixing this. So
> > > clearly
> > > a NAK from my side.
> > > 
> > > The real question is why is amdgpu_cs_ioctl() called when all of
> > > userspace should be frozen?
> > > 
> > > Regards,
> > > Christian.
> > > 
> > Could the OOPS happen at resume time? Might it be that the kernel
> > activates user-space
> > before all the components of the GPU finished their wakeup?
> > 
> > Maybe drm_sched_pick_best() returns NULL since no scheduler is
> > ready yet?
> 
> Yeah that is exactly what I meant. It looks like either the suspend
> or 
> the resume order is somehow messed up.
> 
> In other words either some application tries to submit GPU work while
> it 
> should already been stopped, or it tries to submit GPU work before it
> is 
> started.
> 
> > Apart from whether amdgpu_cs_ioctl() should run at this point, I
> > still think the
> > suggested change improves the code. drm_sched_pick_best() can
> > return NULL.
> > drm_sched_entity_select_rq() can handle the NULL (partially).
> > 
> > drm_sched_job_arm() crashes on an entity that has rq set to NULL.
> 
> Which is actually not the worst outcome :)
> 
> With your patch applied we don't immediately crash any more in the 
> submission path, but the whole system could then later deadlock
> because 
> the core memory management waits for a GPU submission which never
> returns.
> 
> That is an even worse situation because you then can't pinpoint any
> more 
> where that is coming from.
> 
> > The handling of NULL values is half-baked.
> > 
> > In my opinion, you should define if drm_sched_pick_best() may put a
> > NULL into
> > rq. If your answer is yes, it might put a NULL there; then, there
> > should be a
> > BUG_ON(!entity->rq) after the invocation of
> > drm_sched_entity_select_rq().
> > If your answer is no, the BUG_ON() should be in
> > drm_sched_pick_best().
> 
> Yeah good point.
> 
> We might not want a BUG_ON(), that is only justified when we prevent 
> further damage (e.g. random data corruption or similar).
> 
> I suggest using a WARN(!shed, "Submission without activated
> sheduler!"). 
> This way the system has at least a chance of survival should the 
> scheduler become ready later on.
> 
> On the other hand the BUG_ON() or the NULL pointer deref should only 
> kill the application thread which is submitting something before the 
> driver is resumed. So that might help to pinpoint where the actually 
> issue is.

As I see it the BUG_ON() would just be a more pretty NULL pointer
deref. If we agree that this is effectively a misuse of the scheduler
API we probably want to add it to make it more pretty, though?

@Philipp:
BTW, I only just discovered this thread by coincidence. Please use
get_maintainer. The scheduler currently has 4 maintainers, and none of
them is on CC.

Danke,
P.

> 
> Regards,
> Christian.
> 
> > 
> > That helps guys with zero domain knowledge, like me, to figure out
> > how
> > this is all
> > supposed to work.
> > 
> > best regards,
> >   Philipp
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-13  8:43       ` Philipp Stanner
@ 2025-01-13  9:55         ` Christian König
  2025-05-28  9:55           ` Christopher Snowhill
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2025-01-13  9:55 UTC (permalink / raw)
  To: Philipp Stanner, Philipp Reisner
  Cc: dri-devel, linux-kernel, Simona Vetter, Danilo Krummrich,
	Philipp Stanner

[-- Attachment #1: Type: text/plain, Size: 2203 bytes --]

Am 13.01.25 um 09:43 schrieb Philipp Stanner:
> [SNIP]
>>> The handling of NULL values is half-baked.
>>>
>>> In my opinion, you should define if drm_sched_pick_best() may put a
>>> NULL into
>>> rq. If your answer is yes, it might put a NULL there; then, there
>>> should be a
>>> BUG_ON(!entity->rq) after the invocation of
>>> drm_sched_entity_select_rq().
>>> If your answer is no, the BUG_ON() should be in
>>> drm_sched_pick_best().
>> Yeah good point.
>>
>> We might not want a BUG_ON(), that is only justified when we prevent
>> further damage (e.g. random data corruption or similar).
>>
>> I suggest using a WARN(!shed, "Submission without activated
>> sheduler!").
>> This way the system has at least a chance of survival should the
>> scheduler become ready later on.
>>
>> On the other hand the BUG_ON() or the NULL pointer deref should only
>> kill the application thread which is submitting something before the
>> driver is resumed. So that might help to pinpoint where the actually
>> issue is.
> As I see it the BUG_ON() would just be a more pretty NULL pointer
> deref. If we agree that this is effectively a misuse of the scheduler
> API we probably want to add it to make it more pretty, though?

The only alternative I can see is that the scheduler API gracefully 
handles submits to non-ready schedulers. E.g. that 
drm_sched_entity_push_job() detects this condition and instead of 
pushing the job sets and error code and signals the fences.

But that might not be a good idea.

It just moves the crash from one place to another and in general I fully 
agree the driver is misusing the scheduler API to do something which 
won't work and potentially crash the whole system.

> @Philipp:
> BTW, I only just discovered this thread by coincidence. Please use
> get_maintainer. The scheduler currently has 4 maintainers, and none of
> them is on CC.

Oh good, point I was already wondering why nobody else commented and 
didn't realized that nobody was on CC.

Thanks,
Christian.

>
> Danke,
> P.
>
>> Regards,
>> Christian.
>>
>>> That helps guys with zero domain knowledge, like me, to figure out
>>> how
>>> this is all
>>> supposed to work.
>>>
>>> best regards,
>>>    Philipp

[-- Attachment #2: Type: text/html, Size: 3519 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-01-13  9:55         ` Christian König
@ 2025-05-28  9:55           ` Christopher Snowhill
  2025-06-02 10:25             ` Philipp Reisner
  0 siblings, 1 reply; 17+ messages in thread
From: Christopher Snowhill @ 2025-05-28  9:55 UTC (permalink / raw)
  To: Christian König, Philipp Stanner, Philipp Reisner
  Cc: dri-devel, linux-kernel, Simona Vetter, Danilo Krummrich,
	Philipp Stanner, dri-devel

On Mon Jan 13, 2025 at 1:55 AM PST, Christian König wrote:
> Am 13.01.25 um 09:43 schrieb Philipp Stanner:
>> [SNIP]
>>>> The handling of NULL values is half-baked.
>>>>
>>>> In my opinion, you should define if drm_sched_pick_best() may put a
>>>> NULL into
>>>> rq. If your answer is yes, it might put a NULL there; then, there
>>>> should be a
>>>> BUG_ON(!entity->rq) after the invocation of
>>>> drm_sched_entity_select_rq().
>>>> If your answer is no, the BUG_ON() should be in
>>>> drm_sched_pick_best().
>>> Yeah good point.
>>>
>>> We might not want a BUG_ON(), that is only justified when we prevent
>>> further damage (e.g. random data corruption or similar).
>>>
>>> I suggest using a WARN(!shed, "Submission without activated
>>> sheduler!").
>>> This way the system has at least a chance of survival should the
>>> scheduler become ready later on.
>>>
>>> On the other hand the BUG_ON() or the NULL pointer deref should only
>>> kill the application thread which is submitting something before the
>>> driver is resumed. So that might help to pinpoint where the actually
>>> issue is.
>> As I see it the BUG_ON() would just be a more pretty NULL pointer
>> deref. If we agree that this is effectively a misuse of the scheduler
>> API we probably want to add it to make it more pretty, though?
>
> The only alternative I can see is that the scheduler API gracefully 
> handles submits to non-ready schedulers. E.g. that 
> drm_sched_entity_push_job() detects this condition and instead of 
> pushing the job sets and error code and signals the fences.
>
> But that might not be a good idea.
>
> It just moves the crash from one place to another and in general I fully 
> agree the driver is misusing the scheduler API to do something which 
> won't work and potentially crash the whole system.
>
>> @Philipp:
>> BTW, I only just discovered this thread by coincidence. Please use
>> get_maintainer. The scheduler currently has 4 maintainers, and none of
>> them is on CC.
>
> Oh good, point I was already wondering why nobody else commented and 
> didn't realized that nobody was on CC.
>
> Thanks,
> Christian.

I'm only seeing this mail exchange months after the fact because I was
linked to it by someone on IRC, and I am making a wild guess here.

Could this sleep wake issue also be caused by a similar thing to the
panics and SMU hangs I was experiencing with my own issue? It's an issue
known to have the same workaround for both 6000 and 7000 series users. A
specific kernel commit seems to affect it as well.

If you could test whether you can still reproduce the error after
disabling GFXOFF states with the following kernel commandline override:

amdgpu.ppfeaturemask=0xfff73fff

And report back. Unless it's already something long solved? Since this
particular thread died back in January, I guess nothing has happened
since?

>
>>
>> Danke,
>> P.
>>
>>> Regards,
>>> Christian.
>>>
>>>> That helps guys with zero domain knowledge, like me, to figure out
>>>> how
>>>> this is all
>>>> supposed to work.
>>>>
>>>> best regards,
>>>>    Philipp


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-05-28  9:55           ` Christopher Snowhill
@ 2025-06-02 10:25             ` Philipp Reisner
  2025-06-04 10:19               ` Christopher Snowhill
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Reisner @ 2025-06-02 10:25 UTC (permalink / raw)
  To: Christopher Snowhill
  Cc: Christian König, Philipp Stanner, dri-devel, linux-kernel,
	Simona Vetter, Danilo Krummrich, Philipp Stanner, dri-devel

Hi Christopher,

Thanks for following up. The bug still annoys me from time to time.
It triggered last on May 8, May 12, and May 18.
The crash on May 18 was already with the 6.14.5 kernel.

> Could this sleep wake issue also be caused by a similar thing to the
> panics and SMU hangs I was experiencing with my own issue? It's an issue
> known to have the same workaround for both 6000 and 7000 series users. A
> specific kernel commit seems to affect it as well.
>

I posted the stack trace earlier in the thread. The question is, what
was the stack
trace of the issue you are referring to?

>
> If you could test whether you can still reproduce the error after
> disabling GFXOFF states with the following kernel commandline override:
>
> amdgpu.ppfeaturemask=0xfff73fff
>

that disables PP_OVERDRIVE_MASK, PP_GFXOFF_MASK,
and PP_GFX_DCS_MASK.

IMHO, that looks like a mitigation for something different than the non-ready
compute schedulers that seem to be the root cause for the NULL pointer derefs
in my case.

Anyhow, I will give it a try, and will report back if my workstation
does not deref
NULL pointers for more than three weeks with that amdgpu.ppfeaturemask set.

Best regards,
 Philipp

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
  2025-06-02 10:25             ` Philipp Reisner
@ 2025-06-04 10:19               ` Christopher Snowhill
  0 siblings, 0 replies; 17+ messages in thread
From: Christopher Snowhill @ 2025-06-04 10:19 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: Christian König, Philipp Stanner, dri-devel, linux-kernel,
	Simona Vetter, Danilo Krummrich, Philipp Stanner, dri-devel

On Mon Jun 2, 2025 at 3:25 AM PDT, Philipp Reisner wrote:
> Hi Christopher,
>
> Thanks for following up. The bug still annoys me from time to time.
> It triggered last on May 8, May 12, and May 18.
> The crash on May 18 was already with the 6.14.5 kernel.
>
>> Could this sleep wake issue also be caused by a similar thing to the
>> panics and SMU hangs I was experiencing with my own issue? It's an issue
>> known to have the same workaround for both 6000 and 7000 series users. A
>> specific kernel commit seems to affect it as well.
>>
>
> I posted the stack trace earlier in the thread. The question is, what
> was the stack
> trace of the issue you are referring to?
>
>>
>> If you could test whether you can still reproduce the error after
>> disabling GFXOFF states with the following kernel commandline override:
>>
>> amdgpu.ppfeaturemask=0xfff73fff
>>
>
> that disables PP_OVERDRIVE_MASK, PP_GFXOFF_MASK,
> and PP_GFX_DCS_MASK.
>
> IMHO, that looks like a mitigation for something different than the non-ready
> compute schedulers that seem to be the root cause for the NULL pointer derefs
> in my case.

Indeed, it's mitigating something that leads to SMU firmware hangs. I
made a guess, I probably guessed poorly, that your compute units may be
failing to wake up due to a SMU hang. But you have no SMU hang log
notices, so it's probably not that. Oh well.

>
> Anyhow, I will give it a try, and will report back if my workstation
> does not deref
> NULL pointers for more than three weeks with that amdgpu.ppfeaturemask set.
>
> Best regards,
>  Philipp


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-06-04 10:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07 14:02 [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume Philipp Reisner
2025-01-07 14:08 ` Christian König
2025-01-07 15:21   ` Philipp Reisner
2025-01-08  8:19     ` Christian König
2025-01-13  8:43       ` Philipp Stanner
2025-01-13  9:55         ` Christian König
2025-05-28  9:55           ` Christopher Snowhill
2025-06-02 10:25             ` Philipp Reisner
2025-06-04 10:19               ` Christopher Snowhill
2025-01-08 14:26   ` Alex Deucher
2025-01-08 14:35     ` Christian König
2025-01-10  7:37       ` Philipp Reisner
2025-01-10  8:44         ` Christian König
2025-01-10 14:32           ` Philipp Reisner
2025-01-10 14:47             ` Christian König
2025-01-10 15:10               ` Alex Deucher
2025-01-13  8:32                 ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).