* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT [not found] ` <4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de> @ 2026-05-21 10:17 ` Thomas Gleixner 2026-05-21 10:21 ` Bert Karwatzki 2026-05-21 12:55 ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner 0 siblings, 2 replies; 8+ messages in thread From: Thomas Gleixner @ 2026-05-21 10:17 UTC (permalink / raw) To: Bert Karwatzki, Mateusz Guzik, Christian Brauner Cc: linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, spasswolf, Alex Deucher, amd-gfx On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote: > Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik: > > with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619 > crashed in 2 out of 3 boots without RT)) but I get this warning on every boot: > > [ 2.793416] [ T331] ------------[ cut here ]------------ > [ 2.793433] [ T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock) > [ 2.793434] [ T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331 So either the mutex is corrupted or was never initialized. > [ 2.793463] [ T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec > drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme > scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic > [ 2.793518] [ T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT > [ 2.793534] [ T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026 > [ 2.793547] [ T331] RIP: 0010:__mutex_lock+0x58d/0x10c0 > [ 2.793555] [ T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83 > <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8 > [ 2.793579] [ T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246 > [ 2.793588] [ T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000 > [ 2.793598] [ T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00 > [ 2.793609] [ T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000 > [ 2.793619] [ T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 2.793629] [ T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000 > [ 2.793641] [ T331] FS: 00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000 > [ 2.793653] [ T331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2.793662] [ T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0 > [ 2.793673] [ T331] PKRU: 55555554 > [ 2.793678] [ T331] Call Trace: > [ 2.793683] [ T331] <TASK> > [ 2.793687] [ T331] ? lock_acquire+0xbe/0x2d0 > [ 2.793696] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > [ 2.793809] [ T331] ? lock_release+0xc6/0x2a0 > [ 2.793816] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > [ 2.793902] [ T331] init_mqd+0x122/0x190 [amdgpu] > [ 2.793961] [ T331] init_mqd_hiq+0xd/0x20 [amdgpu] > [ 2.794015] [ T331] kq_initialize.constprop.0+0x2b8/0x370 [amdgpu] > [ 2.794071] [ T331] kernel_queue_init+0x3f/0x60 [amdgpu] > [ 2.794125] [ T331] pm_init+0x6b/0x100 [amdgpu] > [ 2.794178] [ T331] start_cpsch+0x1d6/0x270 [amdgpu] > [ 2.794234] [ T331] kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu] > [ 2.794365] [ T331] amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu] amdgpu_amdkfd_device_init() kgd2kfd_device_init() { .... init_mqd() mutex_lock(... profiler_lock); <- FAIL mutex_init(...profiler_lock); } Seems the famous graphics CI failed to catch this... Thanks, tglx --- --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev KGD_ENGINE_SDMA1); kfd->shared_resources = *gpu_resources; + kfd->profiler_process = NULL; + mutex_init(&kfd->profiler_lock); + kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr); if (kfd->num_nodes == 0) { @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev svm_range_set_max_pages(kfd->adev); - kfd->profiler_process = NULL; - mutex_init(&kfd->profiler_lock); - kfd->init_complete = true; dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor, kfd->adev->pdev->device); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-21 10:17 ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner @ 2026-05-21 10:21 ` Bert Karwatzki 2026-05-21 10:33 ` Mateusz Guzik 2026-05-21 12:55 ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner 1 sibling, 1 reply; 8+ messages in thread From: Bert Karwatzki @ 2026-05-21 10:21 UTC (permalink / raw) To: Thomas Gleixner, Mateusz Guzik, Christian Brauner Cc: linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, spasswolf, Alex Deucher, amd-gfx Am Donnerstag, dem 21.05.2026 um 12:17 +0200 schrieb Thomas Gleixner: > On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote: > > Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik: > > > > with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619 > > crashed in 2 out of 3 boots without RT)) but I get this warning on every boot: > > > > [ 2.793416] [ T331] ------------[ cut here ]------------ > > [ 2.793433] [ T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock) > > [ 2.793434] [ T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331 > > So either the mutex is corrupted or was never initialized. > > > [ 2.793463] [ T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec > > drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme > > scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic > > [ 2.793518] [ T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT > > [ 2.793534] [ T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026 > > [ 2.793547] [ T331] RIP: 0010:__mutex_lock+0x58d/0x10c0 > > [ 2.793555] [ T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83 > > <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8 > > [ 2.793579] [ T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246 > > [ 2.793588] [ T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000 > > [ 2.793598] [ T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00 > > [ 2.793609] [ T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000 > > [ 2.793619] [ T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > [ 2.793629] [ T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000 > > [ 2.793641] [ T331] FS: 00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000 > > [ 2.793653] [ T331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 2.793662] [ T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0 > > [ 2.793673] [ T331] PKRU: 55555554 > > [ 2.793678] [ T331] Call Trace: > > [ 2.793683] [ T331] <TASK> > > [ 2.793687] [ T331] ? lock_acquire+0xbe/0x2d0 > > [ 2.793696] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > > [ 2.793809] [ T331] ? lock_release+0xc6/0x2a0 > > [ 2.793816] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > > [ 2.793902] [ T331] init_mqd+0x122/0x190 [amdgpu] > > [ 2.793961] [ T331] init_mqd_hiq+0xd/0x20 [amdgpu] > > [ 2.794015] [ T331] kq_initialize.constprop.0+0x2b8/0x370 [amdgpu] > > [ 2.794071] [ T331] kernel_queue_init+0x3f/0x60 [amdgpu] > > [ 2.794125] [ T331] pm_init+0x6b/0x100 [amdgpu] > > [ 2.794178] [ T331] start_cpsch+0x1d6/0x270 [amdgpu] > > [ 2.794234] [ T331] kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu] > > [ 2.794365] [ T331] amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu] > > amdgpu_amdkfd_device_init() > kgd2kfd_device_init() { > .... > init_mqd() > mutex_lock(... profiler_lock); <- FAIL > > mutex_init(...profiler_lock); > } > > Seems the famous graphics CI failed to catch this... > > Thanks, > > tglx > --- > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev > KGD_ENGINE_SDMA1); > kfd->shared_resources = *gpu_resources; > > + kfd->profiler_process = NULL; > + mutex_init(&kfd->profiler_lock); > + > kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr); > > if (kfd->num_nodes == 0) { > @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev > > svm_range_set_max_pages(kfd->adev); > > - kfd->profiler_process = NULL; > - mutex_init(&kfd->profiler_lock); > - > kfd->init_complete = true; > dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor, > kfd->adev->pdev->device); Actually, when I test next-20260519 with the improved fix, I do not see the warning from amdgpu. diff --git a/fs/filesystems.c b/fs/filesystems.c index 771fc31a69b8..712316a1e3e0 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -269,7 +269,7 @@ static __cold noinline int regen_filesystems_string(void) hlist_for_each_entry_rcu(p, &file_systems, list) { if (!(p->fs_flags & FS_REQUIRES_DEV)) newlen += strlen("nodev"); - newlen += strlen("\t") + strlen(p->name) + strlen("\n"); + newlen += strlen("\t") + strlen(p->name) + strlen("\n"); } spin_unlock(&file_systems_lock); @@ -289,6 +289,7 @@ static __cold noinline int regen_filesystems_string(void) * Did someone beat us to it? */ if (old && old->gen == file_systems_gen) { + spin_unlock(&file_systems_lock); kfree(new); return 0; } @@ -297,6 +298,7 @@ static __cold noinline int regen_filesystems_string(void) * Did the list change in the meantime? */ if (gen != file_systems_gen) { + spin_unlock(&file_systems_lock); kfree(new); goto retry; } @@ -321,13 +323,12 @@ static __cold noinline int regen_filesystems_string(void) * generation above and messes it up. */ spin_unlock(&file_systems_lock); - if (old) - kfree_rcu(old, rcu); + kfree(new); return -EINVAL; } /* - * Paired with consume fence in READ_ONCE() in filesystems_proc_show() + * Paired with consume fence in rcu_dereference() in filesystems_proc_show() */ smp_store_release(&file_systems_string, new); spin_unlock(&file_systems_lock); Bert Karwatzki ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-21 10:21 ` Bert Karwatzki @ 2026-05-21 10:33 ` Mateusz Guzik 2026-05-21 11:50 ` Bert Karwatzki 0 siblings, 1 reply; 8+ messages in thread From: Mateusz Guzik @ 2026-05-21 10:33 UTC (permalink / raw) To: Bert Karwatzki Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher, amd-gfx On Thu, May 21, 2026 at 12:22 PM Bert Karwatzki <spasswolf@web.de> wrote: > > Am Donnerstag, dem 21.05.2026 um 12:17 +0200 schrieb Thomas Gleixner: > > On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote: > > > Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik: > > > > > > with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only though (next-20260619 > > > crashed in 2 out of 3 boots without RT)) but I get this warning on every boot: > > > > > > [ 2.793416] [ T331] ------------[ cut here ]------------ > > > [ 2.793433] [ T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock) > > > [ 2.793434] [ T331] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331 > > > > So either the mutex is corrupted or was never initialized. > > > > > [ 2.793463] [ T331] Modules linked in: amdgpu(+) hid_generic usbhid drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec > > > drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm nvme > > > scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic > > > [ 2.793518] [ T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT > > > [ 2.793534] [ T331] Hardware name: ASUS System Product Name/ROG STRIX B850-F GAMING WIFI, BIOS 1627 02/05/2026 > > > [ 2.793547] [ T331] RIP: 0010:__mutex_lock+0x58d/0x10c0 > > > [ 2.793555] [ T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 83 > > > <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8 > > > [ 2.793579] [ T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246 > > > [ 2.793588] [ T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 0000000000000000 > > > [ 2.793598] [ T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: ffffffff83de3c00 > > > [ 2.793609] [ T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 0000000000000000 > > > [ 2.793619] [ T331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > [ 2.793629] [ T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 0000000000268000 > > > [ 2.793641] [ T331] FS: 00007f1f32e5b9c0(0000) GS:ffff88d23b2ca000(0000) knlGS:0000000000000000 > > > [ 2.793653] [ T331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 2.793662] [ T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 0000000000f50ef0 > > > [ 2.793673] [ T331] PKRU: 55555554 > > > [ 2.793678] [ T331] Call Trace: > > > [ 2.793683] [ T331] <TASK> > > > [ 2.793687] [ T331] ? lock_acquire+0xbe/0x2d0 > > > [ 2.793696] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > > > [ 2.793809] [ T331] ? lock_release+0xc6/0x2a0 > > > [ 2.793816] [ T331] ? init_mqd+0x122/0x190 [amdgpu] > > > [ 2.793902] [ T331] init_mqd+0x122/0x190 [amdgpu] > > > [ 2.793961] [ T331] init_mqd_hiq+0xd/0x20 [amdgpu] > > > [ 2.794015] [ T331] kq_initialize.constprop.0+0x2b8/0x370 [amdgpu] > > > [ 2.794071] [ T331] kernel_queue_init+0x3f/0x60 [amdgpu] > > > [ 2.794125] [ T331] pm_init+0x6b/0x100 [amdgpu] > > > [ 2.794178] [ T331] start_cpsch+0x1d6/0x270 [amdgpu] > > > [ 2.794234] [ T331] kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu] > > > [ 2.794365] [ T331] amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu] > > > > amdgpu_amdkfd_device_init() > > kgd2kfd_device_init() { > > .... > > init_mqd() > > mutex_lock(... profiler_lock); <- FAIL > > > > mutex_init(...profiler_lock); > > } > > > > Seems the famous graphics CI failed to catch this... > > > > Thanks, > > > > tglx > > --- > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > > @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev > > KGD_ENGINE_SDMA1); > > kfd->shared_resources = *gpu_resources; > > > > + kfd->profiler_process = NULL; > > + mutex_init(&kfd->profiler_lock); > > + > > kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr); > > > > if (kfd->num_nodes == 0) { > > @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev > > > > svm_range_set_max_pages(kfd->adev); > > > > - kfd->profiler_process = NULL; > > - mutex_init(&kfd->profiler_lock); > > - > > kfd->init_complete = true; > > dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor, > > kfd->adev->pdev->device); > > Actually, when I test next-20260519 with the improved fix, I do not see > the warning from amdgpu. > Can you please do the following: 1. go back to the known crashing-tag, add my fix, verify you still get the amd splat and then try out the fix provided by Thomas 2. regardless if the above helps, can you boot a kernel built with CONFIG_KASAN=y fwiw I verified my patch works fine with KASAN, including by intentionally miscalculating the size of the target buffer and seeing a nice splat from it so I'm confident I'm not corrupting anything. However, as there are new mallocs + free flying around at early boot, it is *plausible* amd was getting zeroed memory without asking for it and it worked by accident. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-21 10:33 ` Mateusz Guzik @ 2026-05-21 11:50 ` Bert Karwatzki 2026-05-21 12:01 ` Mateusz Guzik 0 siblings, 1 reply; 8+ messages in thread From: Bert Karwatzki @ 2026-05-21 11:50 UTC (permalink / raw) To: Mateusz Guzik Cc: Thomas Gleixner, spasswolf, Christian Brauner, spasswolf, linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher, amd-gfx > > Can you please do the following: > 1. go back to the known crashing-tag, add my fix, verify you still get > the amd splat and then try out the fix provided by Thomas > 2. regardless if the above helps, can you boot a kernel built with > CONFIG_KASAN=y > > fwiw I verified my patch works fine with KASAN, including by > intentionally miscalculating the size of the target buffer and seeing > a nice splat from it so I'm confident I'm not corrupting anything. > However, as there are new mallocs + free flying around at early boot, > it is *plausible* amd was getting zeroed memory without asking for it > and it worked by accident. I think the warnning from amdgpu is only displayed with CONFIG_LOCKDEP=y, so your "improved fix" does not silence the warning from amdgpu. The additional fix from Thomas fixes the amdgpu warning. I also built the kernel with CONFIG_KASAN and get no error messages. Bert Karwatzki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-21 11:50 ` Bert Karwatzki @ 2026-05-21 12:01 ` Mateusz Guzik 2026-05-28 17:59 ` Bert Karwatzki 0 siblings, 1 reply; 8+ messages in thread From: Mateusz Guzik @ 2026-05-21 12:01 UTC (permalink / raw) To: Bert Karwatzki Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher, amd-gfx On Thu, May 21, 2026 at 1:51 PM Bert Karwatzki <spasswolf@web.de> wrote: > > > > > > Can you please do the following: > > 1. go back to the known crashing-tag, add my fix, verify you still get > > the amd splat and then try out the fix provided by Thomas > > 2. regardless if the above helps, can you boot a kernel built with > > CONFIG_KASAN=y > > > > fwiw I verified my patch works fine with KASAN, including by > > intentionally miscalculating the size of the target buffer and seeing > > a nice splat from it so I'm confident I'm not corrupting anything. > > However, as there are new mallocs + free flying around at early boot, > > it is *plausible* amd was getting zeroed memory without asking for it > > and it worked by accident. > > I think the warnning from amdgpu is only displayed with CONFIG_LOCKDEP=y, so > your "improved fix" does not silence the warning from amdgpu. > > The additional fix from Thomas fixes the amdgpu warning. > I just wanted to confirm my patch does not *cause* issues, at worst uncovers them. > I also built the kernel with CONFIG_KASAN and get no error messages. > nice I presume Thomas will handle getting the amdgpu patch to the right people, I think it will be fine to drop all the mailing lists and the cc's. :-) So overall I think we are done here. Thank you for testing and sorry for the breakage. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-21 12:01 ` Mateusz Guzik @ 2026-05-28 17:59 ` Bert Karwatzki 2026-05-29 17:20 ` Mateusz Guzik 0 siblings, 1 reply; 8+ messages in thread From: Bert Karwatzki @ 2026-05-28 17:59 UTC (permalink / raw) To: Mateusz Guzik Cc: Thomas Gleixner, spasswolf, Christian Brauner, linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher, amd-gfx Am Donnerstag, dem 21.05.2026 um 14:01 +0200 schrieb Mateusz Guzik: > > So overall I think we are done here. > > Thank you for testing and sorry for the breakage. Just as a reminder, this has not been fixed in linux-next, yet, up to version next-20260528. Bert Karwatzki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT 2026-05-28 17:59 ` Bert Karwatzki @ 2026-05-29 17:20 ` Mateusz Guzik 0 siblings, 0 replies; 8+ messages in thread From: Mateusz Guzik @ 2026-05-29 17:20 UTC (permalink / raw) To: Bert Karwatzki Cc: Thomas Gleixner, Christian Brauner, linux-kernel, linux-next, linux-rt-devel, linux-fsdevel, adobriyan, jack, viro, Sebastian Andrzej Siewior, Alex Deucher, amd-gfx On Thu, May 28, 2026 at 7:59 PM Bert Karwatzki <spasswolf@web.de> wrote: > > Am Donnerstag, dem 21.05.2026 um 14:01 +0200 schrieb Mateusz Guzik: > > > > So overall I think we are done here. > > > > Thank you for testing and sorry for the breakage. > > Just as a reminder, this has not been fixed in linux-next, yet, > up to version next-20260528. > I sent a v4 of the patchset with some extra touch ups: https://lore.kernel.org/linux-fsdevel/20260529171840.2576445-1-mjguzik@gmail.com/T/#t ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early 2026-05-21 10:17 ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner 2026-05-21 10:21 ` Bert Karwatzki @ 2026-05-21 12:55 ` Thomas Gleixner 1 sibling, 0 replies; 8+ messages in thread From: Thomas Gleixner @ 2026-05-21 12:55 UTC (permalink / raw) To: Bert Karwatzki; +Cc: linux-kernel, Alex Deucher, amd-gfx, Felix Kuehling Bert reported the following lockdep splat: DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331 RIP: 0010:__mutex_lock+0x58d/0x10c0 init_mqd+0x122/0x190 [amdgpu] init_mqd_hiq+0xd/0x20 [amdgpu] kq_initialize.constprop.0+0x2b8/0x370 [amdgpu] kernel_queue_init+0x3f/0x60 [amdgpu] pm_init+0x6b/0x100 [amdgpu] start_cpsch+0x1d6/0x270 [amdgpu] kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu] amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu] amdgpu_device_init.cold+0x1952/0x1c79 [amdgpu] amdgpu_driver_load_kms+0x14/0x80 [amdgpu] Some implementations of init_mqd() acquire kfd_dev->profile_lock, which is initialized in kgd2kfd_device_init() after init_mqd() was invoked via the above callchain. So init_mqd() tries to lock an uninitialized mutex. Move the initialization to the beginning of kgd2kfd_device_init() to cure that. Fixes: a789761de305 ("amd/amdkfd: Add kfd_ioctl_profiler to contain profiler kernel driver changes") Reported-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: Benjamin Welton <bewelton@amd.com> Closes: https://lore.kernel.org/lkml/4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de/ --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev KGD_ENGINE_SDMA1); kfd->shared_resources = *gpu_resources; + kfd->profiler_process = NULL; + mutex_init(&kfd->profiler_lock); + kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr); if (kfd->num_nodes == 0) { @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev svm_range_set_max_pages(kfd->adev); - kfd->profiler_process = NULL; - mutex_init(&kfd->profiler_lock); - kfd->init_complete = true; dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor, kfd->adev->pdev->device); ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-06-01 7:27 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260520225245.2962-1-spasswolf@web.de>
[not found] ` <fnrz73n5jojl2wlbgrsjdtu5zuwykwbcjzznaijbquuovpoand@i6ihdqn7a6zv>
[not found] ` <s7cu3dpioidx6mepmai6eyj2pxjs4skbw7v534zbzs6g2fwcis@cvokidcxy3xa>
[not found] ` <4f548d61b2dd12e01f401ce4b8c865f238f7b23c.camel@web.de>
2026-05-21 10:17 ` context switch within RCU read-side critical section in next-20260518+ with PREEMPT_RT Thomas Gleixner
2026-05-21 10:21 ` Bert Karwatzki
2026-05-21 10:33 ` Mateusz Guzik
2026-05-21 11:50 ` Bert Karwatzki
2026-05-21 12:01 ` Mateusz Guzik
2026-05-28 17:59 ` Bert Karwatzki
2026-05-29 17:20 ` Mateusz Guzik
2026-05-21 12:55 ` [PATCH] amd/amdkfd: Initialize kfd_dev::profiler lock early Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox