Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
       [not found]     ` <4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de>
@ 2026-05-21 10:17       ` Thomas Gleixner
  2026-05-21 10:21         ` Bert Karwatzki
  2026-05-21 12:55         ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas Gleixner @ 2026-05-21 10:17 UTC (permalink / raw)
  To: Bert Karwatzki, Mateusz Guzik, Christian Brauner
  Cc: linux-kernel, linux-next, linux-rt-devel, linux-fsdevel,
	adobriyan, jack, viro, Sebastian Andrzej Siewior, spasswolf,
	Alex Deucher, amd-gfx

On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote:
> Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik:
>
> with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619
> crashed in 2 out of 3 boots without RT)) but I get this warning on every boot:
>
> [    2.793416] [    T331] ------------[ cut here ]------------
> [    2.793433] [    T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> [    2.793434] [    T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331

So either the mutex is corrupted or was never initialized.

> [    2.793463] [    T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec
> drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme
> scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic
> [    2.793518] [    T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT 
> [    2.793534] [    T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
> [    2.793547] [    T331] RIP: 0010:__mutex_lock+0x58d/0x10c0
> [    2.793555] [    T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83
> <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8
> [    2.793579] [    T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246
> [    2.793588] [    T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000
> [    2.793598] [    T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00
> [    2.793609] [    T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000
> [    2.793619] [    T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    2.793629] [    T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000
> [    2.793641] [    T331] FS:  00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000
> [    2.793653] [    T331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.793662] [    T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0
> [    2.793673] [    T331] PKRU: 55555554
> [    2.793678] [    T331] Call Trace:
> [    2.793683] [    T331]  <TASK>
> [    2.793687] [    T331]  ? lock_acquire+0xbe/0x2d0
> [    2.793696] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> [    2.793809] [    T331]  ? lock_release+0xc6/0x2a0
> [    2.793816] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> [    2.793902] [    T331]  init_mqd+0x122/0x190 [amdgpu]
> [    2.793961] [    T331]  init_mqd_hiq+0xd/0x20 [amdgpu]
> [    2.794015] [    T331]  kq_initialize.constprop.0+0x2b8/0x370 [amdgpu]
> [    2.794071] [    T331]  kernel_queue_init+0x3f/0x60 [amdgpu]
> [    2.794125] [    T331]  pm_init+0x6b/0x100 [amdgpu]
> [    2.794178] [    T331]  start_cpsch+0x1d6/0x270 [amdgpu]
> [    2.794234] [    T331]  kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu]
> [    2.794365] [    T331]  amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu]

amdgpu_amdkfd_device_init()
  kgd2kfd_device_init() {
      ....
        init_mqd()
          mutex_lock(... profiler_lock); <- FAIL

      mutex_init(...profiler_lock);
  }

Seems the famous graphics CI failed to catch this...

Thanks,

        tglx
---
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev
 			KGD_ENGINE_SDMA1);
 	kfd->shared_resources = *gpu_resources;
 
+	kfd->profiler_process = NULL;
+	mutex_init(&kfd->profiler_lock);
+
 	kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr);
 
 	if (kfd->num_nodes == 0) {
@@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev
 
 	svm_range_set_max_pages(kfd->adev);
 
-	kfd->profiler_process = NULL;
-	mutex_init(&kfd->profiler_lock);
-
 	kfd->init_complete = true;
 	dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
 		 kfd->adev->pdev->device);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-21 10:17       ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner
@ 2026-05-21 10:21         ` Bert Karwatzki
  2026-05-21 10:33           ` Mateusz Guzik
  2026-05-21 12:55         ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Bert Karwatzki @ 2026-05-21 10:21 UTC (permalink / raw)
  To: Thomas Gleixner, Mateusz Guzik, Christian Brauner
  Cc: linux-kernel, linux-next, linux-rt-devel, linux-fsdevel,
	adobriyan, jack, viro, Sebastian Andrzej Siewior, spasswolf,
	Alex Deucher, amd-gfx

Am Donnerstag, dem 21.05.2026 um 12:17 +0200 schrieb Thomas Gleixner:
> On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote:
> > Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik:
> > 
> > with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619
> > crashed in 2 out of 3 boots without RT)) but I get this warning on every boot:
> > 
> > [    2.793416] [    T331] ------------[ cut here ]------------
> > [    2.793433] [    T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > [    2.793434] [    T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331
> 
> So either the mutex is corrupted or was never initialized.
> 
> > [    2.793463] [    T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec
> > drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme
> > scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic
> > [    2.793518] [    T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT 
> > [    2.793534] [    T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
> > [    2.793547] [    T331] RIP: 0010:__mutex_lock+0x58d/0x10c0
> > [    2.793555] [    T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83
> > <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8
> > [    2.793579] [    T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246
> > [    2.793588] [    T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000
> > [    2.793598] [    T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00
> > [    2.793609] [    T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000
> > [    2.793619] [    T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [    2.793629] [    T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000
> > [    2.793641] [    T331] FS:  00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000
> > [    2.793653] [    T331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    2.793662] [    T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0
> > [    2.793673] [    T331] PKRU: 55555554
> > [    2.793678] [    T331] Call Trace:
> > [    2.793683] [    T331]  <TASK>
> > [    2.793687] [    T331]  ? lock_acquire+0xbe/0x2d0
> > [    2.793696] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> > [    2.793809] [    T331]  ? lock_release+0xc6/0x2a0
> > [    2.793816] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> > [    2.793902] [    T331]  init_mqd+0x122/0x190 [amdgpu]
> > [    2.793961] [    T331]  init_mqd_hiq+0xd/0x20 [amdgpu]
> > [    2.794015] [    T331]  kq_initialize.constprop.0+0x2b8/0x370 [amdgpu]
> > [    2.794071] [    T331]  kernel_queue_init+0x3f/0x60 [amdgpu]
> > [    2.794125] [    T331]  pm_init+0x6b/0x100 [amdgpu]
> > [    2.794178] [    T331]  start_cpsch+0x1d6/0x270 [amdgpu]
> > [    2.794234] [    T331]  kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu]
> > [    2.794365] [    T331]  amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu]
> 
> amdgpu_amdkfd_device_init()
>   kgd2kfd_device_init() {
>       ....
>         init_mqd()
>           mutex_lock(... profiler_lock); <- FAIL
> 
>       mutex_init(...profiler_lock);
>   }
> 
> Seems the famous graphics CI failed to catch this...
> 
> Thanks,
> 
>         tglx
> ---
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev
>  			KGD_ENGINE_SDMA1);
>  	kfd->shared_resources = *gpu_resources;
>  
> +	kfd->profiler_process = NULL;
> +	mutex_init(&kfd->profiler_lock);
> +
>  	kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr);
>  
>  	if (kfd->num_nodes == 0) {
> @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev
>  
>  	svm_range_set_max_pages(kfd->adev);
>  
> -	kfd->profiler_process = NULL;
> -	mutex_init(&kfd->profiler_lock);
> -
>  	kfd->init_complete = true;
>  	dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
>  		 kfd->adev->pdev->device);

Actually, when I test next-20260519 with the improved fix, I do not see
the warning from amdgpu.

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 771fc31a69b8..712316a1e3e0 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -269,7 +269,7 @@ static __cold noinline int regen_filesystems_string(void)
 	hlist_for_each_entry_rcu(p, &file_systems, list) {
 		if (!(p->fs_flags & FS_REQUIRES_DEV))
 			newlen += strlen("nodev");
-		newlen += strlen("\t") + strlen(p->name) +  strlen("\n");
+		newlen += strlen("\t") + strlen(p->name) + strlen("\n");
 	}
 	spin_unlock(&file_systems_lock);
 
@@ -289,6 +289,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did someone beat us to it?
 	 */
 	if (old && old->gen == file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		return 0;
 	}
@@ -297,6 +298,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did the list change in the meantime?
 	 */
 	if (gen != file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		goto retry;
 	}
@@ -321,13 +323,12 @@ static __cold noinline int regen_filesystems_string(void)
 		 * generation above and messes it up.
 		 */
 		spin_unlock(&file_systems_lock);
-		if (old)
-			kfree_rcu(old, rcu);
+		kfree(new);
 		return -EINVAL;
 	}
 
 	/*
-	 * Paired with consume fence in READ_ONCE() in filesystems_proc_show()
+	 * Paired with consume fence in rcu_dereference() in filesystems_proc_show()
 	 */
 	smp_store_release(&file_systems_string, new);
 	spin_unlock(&file_systems_lock);

Bert Karwatzki

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-21 10:21         ` Bert Karwatzki
@ 2026-05-21 10:33           ` Mateusz Guzik
  2026-05-21 11:50             ` Bert Karwatzki
  0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2026-05-21 10:33 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next,
	linux-rt-devel, linux-fsdevel, adobriyan, jack, viro,
	Sebastian Andrzej Siewior, Alex Deucher, amd-gfx

On Thu, May 21, 2026 at 12:22 PM Bert Karwatzki <spasswolf@web.de> wrote:
>
> Am Donnerstag, dem 21.05.2026 um 12:17 +0200 schrieb Thomas Gleixner:
> > On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote:
> > > Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik:
> > >
> > > with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619
> > > crashed in 2 out of 3 boots without RT)) but I get this warning on every boot:
> > >
> > > [    2.793416] [    T331] ------------[ cut here ]------------
> > > [    2.793433] [    T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > > [    2.793434] [    T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331
> >
> > So either the mutex is corrupted or was never initialized.
> >
> > > [    2.793463] [    T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec
> > > drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme
> > > scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic
> > > [    2.793518] [    T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT
> > > [    2.793534] [    T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026
> > > [    2.793547] [    T331] RIP: 0010:__mutex_lock+0x58d/0x10c0
> > > [    2.793555] [    T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83
> > > <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8
> > > [    2.793579] [    T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246
> > > [    2.793588] [    T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000
> > > [    2.793598] [    T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00
> > > [    2.793609] [    T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000
> > > [    2.793619] [    T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > [    2.793629] [    T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000
> > > [    2.793641] [    T331] FS:  00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000
> > > [    2.793653] [    T331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    2.793662] [    T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0
> > > [    2.793673] [    T331] PKRU: 55555554
> > > [    2.793678] [    T331] Call Trace:
> > > [    2.793683] [    T331]  <TASK>
> > > [    2.793687] [    T331]  ? lock_acquire+0xbe/0x2d0
> > > [    2.793696] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> > > [    2.793809] [    T331]  ? lock_release+0xc6/0x2a0
> > > [    2.793816] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> > > [    2.793902] [    T331]  init_mqd+0x122/0x190 [amdgpu]
> > > [    2.793961] [    T331]  init_mqd_hiq+0xd/0x20 [amdgpu]
> > > [    2.794015] [    T331]  kq_initialize.constprop.0+0x2b8/0x370 [amdgpu]
> > > [    2.794071] [    T331]  kernel_queue_init+0x3f/0x60 [amdgpu]
> > > [    2.794125] [    T331]  pm_init+0x6b/0x100 [amdgpu]
> > > [    2.794178] [    T331]  start_cpsch+0x1d6/0x270 [amdgpu]
> > > [    2.794234] [    T331]  kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu]
> > > [    2.794365] [    T331]  amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu]
> >
> > amdgpu_amdkfd_device_init()
> >   kgd2kfd_device_init() {
> >       ....
> >         init_mqd()
> >           mutex_lock(... profiler_lock); <- FAIL
> >
> >       mutex_init(...profiler_lock);
> >   }
> >
> > Seems the famous graphics CI failed to catch this...
> >
> > Thanks,
> >
> >         tglx
> > ---
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev
> >                       KGD_ENGINE_SDMA1);
> >       kfd->shared_resources = *gpu_resources;
> >
> > +     kfd->profiler_process = NULL;
> > +     mutex_init(&kfd->profiler_lock);
> > +
> >       kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr);
> >
> >       if (kfd->num_nodes == 0) {
> > @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev
> >
> >       svm_range_set_max_pages(kfd->adev);
> >
> > -     kfd->profiler_process = NULL;
> > -     mutex_init(&kfd->profiler_lock);
> > -
> >       kfd->init_complete = true;
> >       dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
> >                kfd->adev->pdev->device);
>
> Actually, when I test next-20260519 with the improved fix, I do not see
> the warning from amdgpu.
>

Can you please do the following:
1. go back to the known crashing-tag, add my fix, verify you still get
the amd splat and then try out the fix provided by Thomas
2. regardless if the above helps, can you boot a kernel built with
CONFIG_KASAN=y

fwiw I verified my patch works fine with KASAN, including by
intentionally miscalculating the size of the target buffer and seeing
a nice splat from it so I'm confident I'm not corrupting anything.
However, as there are new mallocs + free flying around at early boot,
it is *plausible* amd was getting zeroed memory without asking for it
and it worked by accident.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-21 10:33           ` Mateusz Guzik
@ 2026-05-21 11:50             ` Bert Karwatzki
  2026-05-21 12:01               ` Mateusz Guzik
  0 siblings, 1 reply; 8+ messages in thread
From: Bert Karwatzki @ 2026-05-21 11:50 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Thomas Gleixner, spasswolf, Christian Brauner, spasswolf,
	linux-kernel, linux-next, linux-rt-devel, linux-fsdevel,
	adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher,
	amd-gfx


> 
> Can you please do the following:
> 1. go back to the known crashing-tag, add my fix, verify you still get
> the amd splat and then try out the fix provided by Thomas
> 2. regardless if the above helps, can you boot a kernel built with
> CONFIG_KASAN=y
> 
> fwiw I verified my patch works fine with KASAN, including by
> intentionally miscalculating the size of the target buffer and seeing
> a nice splat from it so I'm confident I'm not corrupting anything.
> However, as there are new mallocs + free flying around at early boot,
> it is *plausible* amd was getting zeroed memory without asking for it
> and it worked by accident.

I think the warnning from amdgpu is only displayed with CONFIG_LOCKDEP=y, so
your "improved fix" does not silence the warning from amdgpu. 

The additional fix from Thomas fixes the amdgpu warning.

I also built the kernel with CONFIG_KASAN and get no error messages.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-21 11:50             ` Bert Karwatzki
@ 2026-05-21 12:01               ` Mateusz Guzik
  2026-05-28 17:59                 ` Bert Karwatzki
  0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2026-05-21 12:01 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next,
	linux-rt-devel, linux-fsdevel, adobriyan, jack, viro,
	Sebastian Andrzej Siewior, Alex Deucher, amd-gfx

On Thu, May 21, 2026 at 1:51 PM Bert Karwatzki <spasswolf@web.de> wrote:
>
>
> >
> > Can you please do the following:
> > 1. go back to the known crashing-tag, add my fix, verify you still get
> > the amd splat and then try out the fix provided by Thomas
> > 2. regardless if the above helps, can you boot a kernel built with
> > CONFIG_KASAN=y
> >
> > fwiw I verified my patch works fine with KASAN, including by
> > intentionally miscalculating the size of the target buffer and seeing
> > a nice splat from it so I'm confident I'm not corrupting anything.
> > However, as there are new mallocs + free flying around at early boot,
> > it is *plausible* amd was getting zeroed memory without asking for it
> > and it worked by accident.
>
> I think the warnning from amdgpu is only displayed with CONFIG_LOCKDEP=y, so
> your "improved fix" does not silence the warning from amdgpu.
>
> The additional fix from Thomas fixes the amdgpu warning.
>

I just wanted to confirm my patch does not *cause* issues, at worst
uncovers them.

> I also built the kernel with CONFIG_KASAN and get no error messages.
>

nice

I presume Thomas will handle getting the amdgpu patch to the right
people, I think it will be fine to drop all the mailing lists and the
cc's. :-)

So overall I think we are done here.

Thank you for testing and sorry for the breakage.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-21 12:01               ` Mateusz Guzik
@ 2026-05-28 17:59                 ` Bert Karwatzki
  2026-05-29 17:20                   ` Mateusz Guzik
  0 siblings, 1 reply; 8+ messages in thread
From: Bert Karwatzki @ 2026-05-28 17:59 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Thomas Gleixner, spasswolf, Christian Brauner, linux-kernel,
	linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro,
	Sebastian Andrzej Siewior, Alex Deucher, amd-gfx

Am Donnerstag, dem 21.05.2026 um 14:01 +0200 schrieb Mateusz Guzik:
> 
> So overall I think we are done here.
> 
> Thank you for testing and sorry for the breakage.

Just as a reminder, this has not been fixed in linux-next, yet,
up to version next-20260528.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT
  2026-05-28 17:59                 ` Bert Karwatzki
@ 2026-05-29 17:20                   ` Mateusz Guzik
  0 siblings, 0 replies; 8+ messages in thread
From: Mateusz Guzik @ 2026-05-29 17:20 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next,
	linux-rt-devel, linux-fsdevel, adobriyan, jack, viro,
	Sebastian Andrzej Siewior, Alex Deucher, amd-gfx

On Thu, May 28, 2026 at 7:59 PM Bert Karwatzki <spasswolf@web.de> wrote:
>
> Am Donnerstag, dem 21.05.2026 um 14:01 +0200 schrieb Mateusz Guzik:
> >
> > So overall I think we are done here.
> >
> > Thank you for testing and sorry for the breakage.
>
> Just as a reminder, this has not been fixed in linux-next, yet,
> up to version next-20260528.
>

I sent a v4 of the patchset with some extra touch ups:
https://lore.kernel.org/linux-fsdevel/20260529171840.2576445-1-mjguzik@gmail.com/T/#t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early
  2026-05-21 10:17       ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner
  2026-05-21 10:21         ` Bert Karwatzki
@ 2026-05-21 12:55         ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2026-05-21 12:55 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: linux-kernel, Alex Deucher, amd-gfx, Felix Kuehling

Bert reported the following lockdep splat:

 DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331
 RIP: 0010:__mutex_lock+0x58d/0x10c0
  init_mqd+0x122/0x190 [amdgpu]
  init_mqd_hiq+0xd/0x20 [amdgpu]
  kq_initialize.constprop.0+0x2b8/0x370 [amdgpu]
  kernel_queue_init+0x3f/0x60 [amdgpu]
  pm_init+0x6b/0x100 [amdgpu]
  start_cpsch+0x1d6/0x270 [amdgpu]
  kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu]
  amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu]
  amdgpu_device_init.cold+0x1952/0x1c79 [amdgpu]
  amdgpu_driver_load_kms+0x14/0x80 [amdgpu]

Some implementations of init_mqd() acquire kfd_dev->profile_lock, which is
initialized in kgd2kfd_device_init() after init_mqd() was invoked via the
above callchain. So init_mqd() tries to lock an uninitialized mutex.

Move the initialization to the beginning of kgd2kfd_device_init() to cure
that.

Fixes: a789761de305 ("amd/amdkfd: Add kfd_ioctl_profiler to contain profiler kernel driver changes")
Reported-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Benjamin Welton <bewelton@amd.com>
Closes: https://lore.kernel.org/lkml/4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de/
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev
 			KGD_ENGINE_SDMA1);
 	kfd->shared_resources = *gpu_resources;
 
+	kfd->profiler_process = NULL;
+	mutex_init(&kfd->profiler_lock);
+
 	kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr);
 
 	if (kfd->num_nodes == 0) {
@@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev
 
 	svm_range_set_max_pages(kfd->adev);
 
-	kfd->profiler_process = NULL;
-	mutex_init(&kfd->profiler_lock);
-
 	kfd->init_complete = true;
 	dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
 		 kfd->adev->pdev->device);

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-01  7:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260520225245.2962-1-spasswolf@web.de>
     [not found] ` <fnrz73n5jojl2wlbgrsjdtu5zuwykwbcjzznaijbquuovpoand@i6ihdqn7a6zv>
     [not found]   ` <s7cu3dpioidx6mepmai6eyj2pxjs4skbw7v534zbzs6g2fwcis@cvokidcxy3xa>
     [not found]     ` <4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de>
2026-05-21 10:17       ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner
2026-05-21 10:21         ` Bert Karwatzki
2026-05-21 10:33           ` Mateusz Guzik
2026-05-21 11:50             ` Bert Karwatzki
2026-05-21 12:01               ` Mateusz Guzik
2026-05-28 17:59                 ` Bert Karwatzki
2026-05-29 17:20                   ` Mateusz Guzik
2026-05-21 12:55         ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox