* Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks) [not found] ` <ZliHhebSGQYZ/0S0@shell.armlinux.org.uk> @ 2024-05-31 8:39 ` Linux regression tracking (Thorsten Leemhuis) 2024-05-31 9:50 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede 0 siblings, 1 reply; 17+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2024-05-31 8:39 UTC (permalink / raw) To: Pavel Machek, Lee Jones, Linux LEDs Cc: linux-kernel, netdev, andrew, hkallweit1, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists, Linux kernel regressions list [adding the LED folks and the regressions list to the list of recipients] Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone. Lee, Pavel, could you look into below regression report please? Thread starts here: https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ Another report with somewhat similar symptom can be found here: https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ See also Russell's analysis of that report below (many many thx for that, much appreciated Russel!). To my untrained eyes all of this sounds a lot like we still have a 6.9 regression related to the LED code somewhere. Reminder, we had earlier trouble, but that was avoided through other measures: * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") That iwlwifi commit even calls it self "work around". The developer that submitted it bisected the problem to a LED merge, but sadly that was the end of it. :-/ Ciao, Thorsten On 30.05.24 16:04, Russell King (Oracle) wrote: > On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote: >> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote: >> This report for 6.9.1 could well be the same issue: >> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ > > The reg_check_chans_work() thing in pid 285 is likely stuck on the > rtnl lock. The same is true of pid 287. > > That will be because of the thread (pid 663) that's stuck in > __dev_open()...led_trigger_register(), where the rtnl lock will have > been taken in that path. It looks to me like led_trigger_register() > is stuck waiting for read access with the leds_list_lock rwsem. > > There are only two places that take that rwsem in write mode, which > are led_classdev_register_ext() and led_classdev_unregister(). None > of these paths are blocking in v6.9. > > Pid 641 doesn't look significant (its probably waiting for either > pid 285 or 287 to complete its work.) > > Pid 666 looks like it is blocked waiting for exclusive write-access > on the leds_list_lock - but it isn't holding that lock. This means > there must already be some other reader or writer holding this lock. > > Pid 722 doesn't look sigificant (same as pid 641). > > Pid 760 is also waiting for the rtnl lock. > > Pid 854, 855 also doesn't look sigificant (as pid 641). > > And then we get to pid 858. This is in set_device_name(), which > was called from led_trigger_set() and led_trigger_register(). > We know from pid 663 that led_trigger_register() can take a read > on leds_list_lock, and indeed it does and then calls > led_match_default_trigger(), which then goes on to call > led_trigger_set(). Bingo, this is why pid 666 is blocked, which > then blocks pid 663. pid 663 takes the rtnl lock, which blocks > everything else _and_ also blocks pid 858 in set_device_name(). > > Lockdep would've found this... this is a classic AB-BA deadlock > between the leds_list_lock rwsem and the rtnl mutex. > > I haven't checked to see how that deadlock got introduced, that's > for someone else to do. P.S.: #regzbot report: / #regzbot introduced: f5c31bcf604d #regzbot duplicate: https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ #regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 8:39 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks) Linux regression tracking (Thorsten Leemhuis) @ 2024-05-31 9:50 ` Hans de Goede 2024-05-31 10:22 ` Hans de Goede ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Hans de Goede @ 2024-05-31 9:50 UTC (permalink / raw) To: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit Cc: linux-kernel, netdev, andrew, hkallweit1, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi, On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote: > [adding the LED folks and the regressions list to the list of recipients] > > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > for once, to make this easily accessible to everyone. > > Lee, Pavel, could you look into below regression report please? Thread > starts here: > https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ > > Another report with somewhat similar symptom can be found here: > https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ > > See also Russell's analysis of that report below (many many thx for > that, much appreciated Russel!). > > To my untrained eyes all of this sounds a lot like we still have a 6.9 > regression related to the LED code somewhere. Reminder, we had earlier > trouble, but that was avoided through other measures: > > * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / > https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ > > * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / > https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ > > * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") > > That iwlwifi commit even calls it self "work around". The developer that > submitted it bisected the problem to a LED merge, but sadly that was the > end of it. :-/ I actually have been looking at a ledtrig-netdev lockdep warning yesterday which I believe is the same thing. I'll include the lockdep trace below. According to lockdep there indeed is a ABBA (ish) cyclic deadlock with the rtnl mutex vs led-triggers related locks. I believe that this problem may be a pre-existing problem but this now actually gets hit in kernels >= 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is given, make hw_control trigger the default trigger"). Before that commit the "netdev" trigger would not be bound / set as phy LEDs trigger by default. +Cc Heiner Kallweit who authored that commit. The netdev trigger typically is not needed because the PHY LEDs are typically under hw-control and the netdev trigger even tries to leave things that way so setting it as the active trigger for the LED class device is basically a no-op. I guess the goal of that commit is correctly have the triggers file content reflect that the LED is controlled by a netdev and to allow changing the hw-control mode without the user first needing to set netdev as trigger before being able to change the mode. But there is a price to this, besides the locking problem this also causes the ledtrig-netdev module to load on pretty much everyones systems (when build as a module) even though 99.999% of our users likely does not need this at all... Given this price and the troubles this is causing I think it might be best to revert 66601a29bb23. There might still be a locking issue when setting the trigger to netdev manually (I'll check and follow up) but this should fix the regression users are hitting since typically users do not set the trigger manually. Gene, as the original reporter of this can you do "modinfo ledtrig_netdev" and if this shows that ledtrig_netdev is a module for you try blacklisting ledtrig_netdev ? And if it is not a module can you try building a 6.9 kernel with commit 66601a29bb23 reverted and see if that helps ? Regards, Hans Here is the promised lockdep report: [ 73.959672] ====================================================== [ 73.959675] WARNING: possible circular locking dependency detected [ 73.959677] 6.10.0-rc1+ #94 Not tainted [ 73.959680] ------------------------------------------------------ [ 73.959682] NetworkManager/1815 is trying to acquire lock: [ 73.959685] ffffffffb2145170 (triggers_list_lock){++++}-{3:3}, at: led_trigger_register+0x40/0x1b0 [ 73.959695] but task is already holding lock: [ 73.959697] ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630 [ 73.959706] which lock already depends on the new lock. [ 73.959708] the existing dependency chain (in reverse order) is: [ 73.959710] -> #2 (rtnl_mutex){+.+.}-{3:3}: [ 73.959716] __mutex_lock+0x8c/0xc10 [ 73.959720] set_device_name+0x2d/0x140 [ledtrig_netdev] [ 73.959725] netdev_trig_activate+0x197/0x210 [ledtrig_netdev] [ 73.959728] led_trigger_set+0x1e1/0x2e0 [ 73.959731] led_trigger_register+0x170/0x1b0 [ 73.959734] do_one_initcall+0x5e/0x3a0 [ 73.959738] do_init_module+0x60/0x220 [ 73.959741] __do_sys_init_module+0x15f/0x190 [ 73.959745] do_syscall_64+0x93/0x180 [ 73.959748] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 73.959752] -> #1 (&led_cdev->trigger_lock){+.+.}-{3:3}: [ 73.959758] down_write+0x3b/0xd0 [ 73.959761] led_trigger_set_default+0x34/0xe0 [ 73.959764] led_classdev_register_ext+0x311/0x3a0 [ 73.959767] input_leds_connect+0x139/0x260 [ 73.959770] input_attach_handler.isra.0+0x75/0x90 [ 73.959773] input_register_device.cold+0xa1/0x150 [ 73.959776] hidinput_connect+0x848/0xb00 [ 73.959779] hid_connect+0x567/0x5a0 [ 73.959783] hid_hw_start+0x3f/0x60 [ 73.959787] hid_device_probe+0x10d/0x190 [ 73.959789] really_probe+0xde/0x340 [ 73.959793] __driver_probe_device+0x78/0x110 [ 73.959796] driver_probe_device+0x1f/0xa0 [ 73.959798] __device_attach_driver+0x85/0x110 [ 73.959801] bus_for_each_drv+0x78/0xc0 [ 73.959805] __device_attach+0xb0/0x1b0 [ 73.959808] bus_probe_device+0x94/0xb0 [ 73.959810] device_add+0x64a/0x860 [ 73.959814] hid_add_device+0xe5/0x240 [ 73.959817] usbhid_probe+0x4bb/0x600 [ 73.959821] usb_probe_interface+0xea/0x2b0 [ 73.959824] really_probe+0xde/0x340 [ 73.959827] __driver_probe_device+0x78/0x110 [ 73.959830] driver_probe_device+0x1f/0xa0 [ 73.959833] __device_attach_driver+0x85/0x110 [ 73.959835] bus_for_each_drv+0x78/0xc0 [ 73.959839] __device_attach+0xb0/0x1b0 [ 73.959842] bus_probe_device+0x94/0xb0 [ 73.959844] device_add+0x64a/0x860 [ 73.959847] usb_set_configuration+0x5e8/0x880 [ 73.959850] usb_generic_driver_probe+0x3e/0x60 [ 73.959854] usb_probe_device+0x3d/0x120 [ 73.959857] really_probe+0xde/0x340 [ 73.959859] __driver_probe_device+0x78/0x110 [ 73.959862] driver_probe_device+0x1f/0xa0 [ 73.959865] __device_attach_driver+0x85/0x110 [ 73.959868] bus_for_each_drv+0x78/0xc0 [ 73.959871] __device_attach+0xb0/0x1b0 [ 73.959874] bus_probe_device+0x94/0xb0 [ 73.959876] device_add+0x64a/0x860 [ 73.959880] usb_new_device.cold+0x141/0x38f [ 73.959883] hub_event+0x1166/0x1980 [ 73.959886] process_one_work+0x21a/0x590 [ 73.959889] worker_thread+0x1d1/0x3e0 [ 73.959891] kthread+0xee/0x120 [ 73.959895] ret_from_fork+0x30/0x50 [ 73.959898] ret_from_fork_asm+0x1a/0x30 [ 73.959902] -> #0 (triggers_list_lock){++++}-{3:3}: [ 73.959908] __lock_acquire+0x11c6/0x1f20 [ 73.959911] lock_acquire+0xc8/0x2b0 [ 73.959914] down_write+0x3b/0xd0 [ 73.959916] led_trigger_register+0x40/0x1b0 [ 73.959919] phy_led_triggers_register+0xb4/0x240 [ 73.959923] phy_attach_direct+0x378/0x380 [ 73.959926] phy_connect_direct+0x21/0x70 [ 73.959929] rtl_open+0x2e1/0x490 [r8169] [ 73.959934] __dev_open+0xea/0x1b0 [ 73.959937] __dev_change_flags+0x201/0x240 [ 73.959939] dev_change_flags+0x22/0x60 [ 73.959942] do_setlink+0x32d/0x1070 [ 73.959945] __rtnl_newlink+0x516/0x9d0 [ 73.959948] rtnl_newlink+0x43/0x70 [ 73.959951] rtnetlink_rcv_msg+0x159/0x630 [ 73.959953] netlink_rcv_skb+0x4f/0x100 [ 73.959956] netlink_unicast+0x18c/0x260 [ 73.959959] netlink_sendmsg+0x207/0x420 [ 73.959962] ____sys_sendmsg+0x364/0x3a0 [ 73.959965] ___sys_sendmsg+0x84/0xd0 [ 73.959967] __sys_sendmsg+0x8e/0xd0 [ 73.959971] do_syscall_64+0x93/0x180 [ 73.959974] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 73.959977] other info that might help us debug this: [ 73.959979] Chain exists of: triggers_list_lock --> &led_cdev->trigger_lock --> rtnl_mutex [ 73.959987] Possible unsafe locking scenario: [ 73.959989] CPU0 CPU1 [ 73.959991] ---- ---- [ 73.959993] lock(rtnl_mutex); [ 73.959996] lock(&led_cdev->trigger_lock); [ 73.960000] lock(rtnl_mutex); [ 73.960004] lock(triggers_list_lock); [ 73.960007] *** DEADLOCK *** [ 73.960009] 1 lock held by NetworkManager/1815: [ 73.960012] #0: ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630 [ 73.960020] stack backtrace: [ 73.960023] CPU: 5 PID: 1815 Comm: NetworkManager Not tainted 6.10.0-rc1+ #94 [ 73.960026] Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH WIFI (MS-7C95), BIOS 2.J0 01/02/2024 [ 73.960029] Call Trace: [ 73.960031] <TASK> [ 73.960034] dump_stack_lvl+0x68/0x90 [ 73.960039] check_noncircular+0x10d/0x120 [ 73.960044] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960048] ? __lock_acquire+0xc19/0x1f20 [ 73.960055] __lock_acquire+0x11c6/0x1f20 [ 73.960062] lock_acquire+0xc8/0x2b0 [ 73.960065] ? led_trigger_register+0x40/0x1b0 [ 73.960073] down_write+0x3b/0xd0 [ 73.960076] ? led_trigger_register+0x40/0x1b0 [ 73.960079] led_trigger_register+0x40/0x1b0 [ 73.960083] phy_led_triggers_register+0xb4/0x240 [ 73.960089] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960091] ? schedule_timeout+0xc1/0x1b0 [ 73.960095] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960097] ? find_held_lock+0x2b/0x80 [ 73.960101] ? phy_attach_direct+0x1b0/0x380 [ 73.960105] ? phy_attach_direct+0x1b0/0x380 [ 73.960108] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960110] ? lock_release+0x169/0x2b0 [ 73.960115] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960118] ? __mutex_unlock_slowpath+0x2d/0x260 [ 73.960124] phy_attach_direct+0x378/0x380 [ 73.960128] ? __pfx_r8169_phylink_handler+0x10/0x10 [r8169] [ 73.960134] phy_connect_direct+0x21/0x70 [ 73.960139] rtl_open+0x2e1/0x490 [r8169] [ 73.960147] __dev_open+0xea/0x1b0 [ 73.960152] __dev_change_flags+0x201/0x240 [ 73.960155] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960160] dev_change_flags+0x22/0x60 [ 73.960165] do_setlink+0x32d/0x1070 [ 73.960172] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960175] ? __lock_acquire+0x3ce/0x1f20 [ 73.960178] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960180] ? __nla_validate_parse+0x52/0xc70 [ 73.960185] ? __entry_text_end+0x1025c9/0x1025cd [ 73.960190] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960193] ? __lock_acquire+0x3ce/0x1f20 [ 73.960199] __rtnl_newlink+0x516/0x9d0 [ 73.960204] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960211] ? __entry_text_end+0x1025c9/0x1025cd [ 73.960214] ? rcu_is_watching+0xd/0x40 [ 73.960217] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960220] ? kmalloc_trace_noprof+0x25c/0x320 [ 73.960224] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960230] rtnl_newlink+0x43/0x70 [ 73.960235] rtnetlink_rcv_msg+0x159/0x630 [ 73.960241] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 73.960246] netlink_rcv_skb+0x4f/0x100 [ 73.960257] netlink_unicast+0x18c/0x260 [ 73.960262] netlink_sendmsg+0x207/0x420 [ 73.960270] ____sys_sendmsg+0x364/0x3a0 [ 73.960273] ? import_iovec+0x17/0x20 [ 73.960277] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960280] ? copy_msghdr_from_user+0xd9/0x150 [ 73.960286] ___sys_sendmsg+0x84/0xd0 [ 73.960292] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960294] ? __lock_acquire+0x3ce/0x1f20 [ 73.960299] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960304] ? find_held_lock+0x2b/0x80 [ 73.960307] ? __fget_files+0xc3/0x190 [ 73.960312] ? __fget_files+0xc3/0x190 [ 73.960315] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960317] ? lock_release+0x169/0x2b0 [ 73.960322] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960327] __sys_sendmsg+0x8e/0xd0 [ 73.960337] do_syscall_64+0x93/0x180 [ 73.960340] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960345] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960349] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960352] ? find_held_lock+0x2b/0x80 [ 73.960356] ? __memcg_slab_free_hook+0x56/0x230 [ 73.960360] ? __memcg_slab_free_hook+0x56/0x230 [ 73.960363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960366] ? lock_release+0x169/0x2b0 [ 73.960370] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960373] ? __memcg_slab_free_hook+0x60/0x230 [ 73.960378] ? kmem_cache_free+0x144/0x430 [ 73.960382] ? syscall_exit_to_user_mode+0x11/0x280 [ 73.960387] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960390] ? do_syscall_64+0x9f/0x180 [ 73.960393] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960396] ? lockdep_hardirqs_on+0x78/0x100 [ 73.960400] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960403] ? do_syscall_64+0x9f/0x180 [ 73.960406] ? fd_install+0xba/0x310 [ 73.960411] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960414] ? kmem_cache_free+0x144/0x430 [ 73.960420] ? do_sys_openat2+0x64/0xb0 [ 73.960423] ? syscall_exit_to_user_mode+0x11/0x280 [ 73.960428] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960431] ? do_syscall_64+0x9f/0x180 [ 73.960435] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960437] ? lockdep_hardirqs_on+0x78/0x100 [ 73.960441] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960444] ? do_syscall_64+0x9f/0x180 [ 73.960448] ? srso_alias_return_thunk+0x5/0xfbef5 [ 73.960453] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 73.960457] RIP: 0033:0x7f857a9db79b [ 73.960461] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 49 7a f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 a1 7a f7 ff 48 8b [ 73.960464] RSP: 002b:00007ffcf5d3a660 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [ 73.960468] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f857a9db79b [ 73.960471] RDX: 0000000000000000 RSI: 00007ffcf5d3a6a0 RDI: 000000000000000d [ 73.960473] RBP: 00007ffcf5d3a680 R08: 0000000000000000 R09: 0000000000000000 [ 73.960476] R10: 0000000000000000 R11: 0000000000000293 R12: 00005632de03d9d0 [ 73.960478] R13: 000000000000000a R14: 00007ffcf5d3a83c R15: 0000000000000000 [ 73.960488] </TASK> [ 73.960561] Generic FE-GE Realtek PHY r8169-0-2a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-2a00:00, irq=MAC) > On 30.05.24 16:04, Russell King (Oracle) wrote: >> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote: >>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote: >>> This report for 6.9.1 could well be the same issue: >>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ >> >> The reg_check_chans_work() thing in pid 285 is likely stuck on the >> rtnl lock. The same is true of pid 287. >> >> That will be because of the thread (pid 663) that's stuck in >> __dev_open()...led_trigger_register(), where the rtnl lock will have >> been taken in that path. It looks to me like led_trigger_register() >> is stuck waiting for read access with the leds_list_lock rwsem. >> >> There are only two places that take that rwsem in write mode, which >> are led_classdev_register_ext() and led_classdev_unregister(). None >> of these paths are blocking in v6.9. >> >> Pid 641 doesn't look significant (its probably waiting for either >> pid 285 or 287 to complete its work.) >> >> Pid 666 looks like it is blocked waiting for exclusive write-access >> on the leds_list_lock - but it isn't holding that lock. This means >> there must already be some other reader or writer holding this lock. >> >> Pid 722 doesn't look sigificant (same as pid 641). >> >> Pid 760 is also waiting for the rtnl lock. >> >> Pid 854, 855 also doesn't look sigificant (as pid 641). >> >> And then we get to pid 858. This is in set_device_name(), which >> was called from led_trigger_set() and led_trigger_register(). >> We know from pid 663 that led_trigger_register() can take a read >> on leds_list_lock, and indeed it does and then calls >> led_match_default_trigger(), which then goes on to call >> led_trigger_set(). Bingo, this is why pid 666 is blocked, which >> then blocks pid 663. pid 663 takes the rtnl lock, which blocks >> everything else _and_ also blocks pid 858 in set_device_name(). >> >> Lockdep would've found this... this is a classic AB-BA deadlock >> between the leds_list_lock rwsem and the rtnl mutex. >> >> I haven't checked to see how that deadlock got introduced, that's >> for someone else to do. > > P.S.: > > #regzbot report: / > #regzbot introduced: f5c31bcf604d > #regzbot duplicate: > https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ > #regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the > leds_list_lock rwsem and the rtnl mutex > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 9:50 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede @ 2024-05-31 10:22 ` Hans de Goede 2024-06-01 20:05 ` Hans de Goede 2024-05-31 11:55 ` Genes Lists 2024-05-31 12:54 ` Andrew Lunn 2 siblings, 1 reply; 17+ messages in thread From: Hans de Goede @ 2024-05-31 10:22 UTC (permalink / raw) To: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi, On 5/31/24 11:50 AM, Hans de Goede wrote: > Hi, > > On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote: >> [adding the LED folks and the regressions list to the list of recipients] >> >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting >> for once, to make this easily accessible to everyone. >> >> Lee, Pavel, could you look into below regression report please? Thread >> starts here: >> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ >> >> Another report with somewhat similar symptom can be found here: >> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ >> >> See also Russell's analysis of that report below (many many thx for >> that, much appreciated Russel!). >> >> To my untrained eyes all of this sounds a lot like we still have a 6.9 >> regression related to the LED code somewhere. Reminder, we had earlier >> trouble, but that was avoided through other measures: >> >> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / >> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ >> >> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / >> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ >> >> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") >> >> That iwlwifi commit even calls it self "work around". The developer that >> submitted it bisected the problem to a LED merge, but sadly that was the >> end of it. :-/ > > I actually have been looking at a ledtrig-netdev lockdep warning yesterday > which I believe is the same thing. I'll include the lockdep trace below. > > According to lockdep there indeed is a ABBA (ish) cyclic deadlock with > the rtnl mutex vs led-triggers related locks. I believe that this problem > may be a pre-existing problem but this now actually gets hit in kernels >= > 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is > given, make hw_control trigger the default trigger"). Before that commit > the "netdev" trigger would not be bound / set as phy LEDs trigger by default. > > +Cc Heiner Kallweit who authored that commit. > > The netdev trigger typically is not needed because the PHY LEDs are typically > under hw-control and the netdev trigger even tries to leave things that way > so setting it as the active trigger for the LED class device is basically > a no-op. I guess the goal of that commit is correctly have the triggers > file content reflect that the LED is controlled by a netdev and to allow > changing the hw-control mode without the user first needing to set netdev > as trigger before being able to change the mode. > > But there is a price to this, besides the locking problem this also > causes the ledtrig-netdev module to load on pretty much everyones > systems (when build as a module) even though 99.999% of our users > likely does not need this at all... > > Given this price and the troubles this is causing I think it might be best > to revert 66601a29bb23. There might still be a locking issue when setting > the trigger to netdev manually (I'll check and follow up) but this should > fix the regression users are hitting since typically users do not set > the trigger manually. Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23 reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev" (to add it to the list of available triggers) and then setting the trigger for /sys/class/leds/enp42s0-0::lan to netdev manually. Still reverting 66601a29bb23 should avoid the problem getting triggered and this would seem like a safe fix especially for the 6.9 series and then the necessary time can be taken to fix the actual underlying locking issue which 66601a29bb23 exposes. > Gene, as the original reporter of this can you do "modinfo ledtrig_netdev" > and if this shows that ledtrig_netdev is a module for you try blacklisting > ledtrig_netdev ? And if it is not a module can you try building a 6.9 > kernel with commit 66601a29bb23 reverted and see if that helps ? Regards, Hans > Here is the promised lockdep report: > [ 73.959672] ====================================================== > [ 73.959675] WARNING: possible circular locking dependency detected > [ 73.959677] 6.10.0-rc1+ #94 Not tainted > [ 73.959680] ------------------------------------------------------ > [ 73.959682] NetworkManager/1815 is trying to acquire lock: > [ 73.959685] ffffffffb2145170 (triggers_list_lock){++++}-{3:3}, at: led_trigger_register+0x40/0x1b0 > [ 73.959695] > but task is already holding lock: > [ 73.959697] ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630 > [ 73.959706] > which lock already depends on the new lock. > > [ 73.959708] > the existing dependency chain (in reverse order) is: > [ 73.959710] > -> #2 (rtnl_mutex){+.+.}-{3:3}: > [ 73.959716] __mutex_lock+0x8c/0xc10 > [ 73.959720] set_device_name+0x2d/0x140 [ledtrig_netdev] > [ 73.959725] netdev_trig_activate+0x197/0x210 [ledtrig_netdev] > [ 73.959728] led_trigger_set+0x1e1/0x2e0 > [ 73.959731] led_trigger_register+0x170/0x1b0 > [ 73.959734] do_one_initcall+0x5e/0x3a0 > [ 73.959738] do_init_module+0x60/0x220 > [ 73.959741] __do_sys_init_module+0x15f/0x190 > [ 73.959745] do_syscall_64+0x93/0x180 > [ 73.959748] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 73.959752] > -> #1 (&led_cdev->trigger_lock){+.+.}-{3:3}: > [ 73.959758] down_write+0x3b/0xd0 > [ 73.959761] led_trigger_set_default+0x34/0xe0 > [ 73.959764] led_classdev_register_ext+0x311/0x3a0 > [ 73.959767] input_leds_connect+0x139/0x260 > [ 73.959770] input_attach_handler.isra.0+0x75/0x90 > [ 73.959773] input_register_device.cold+0xa1/0x150 > [ 73.959776] hidinput_connect+0x848/0xb00 > [ 73.959779] hid_connect+0x567/0x5a0 > [ 73.959783] hid_hw_start+0x3f/0x60 > [ 73.959787] hid_device_probe+0x10d/0x190 > [ 73.959789] really_probe+0xde/0x340 > [ 73.959793] __driver_probe_device+0x78/0x110 > [ 73.959796] driver_probe_device+0x1f/0xa0 > [ 73.959798] __device_attach_driver+0x85/0x110 > [ 73.959801] bus_for_each_drv+0x78/0xc0 > [ 73.959805] __device_attach+0xb0/0x1b0 > [ 73.959808] bus_probe_device+0x94/0xb0 > [ 73.959810] device_add+0x64a/0x860 > [ 73.959814] hid_add_device+0xe5/0x240 > [ 73.959817] usbhid_probe+0x4bb/0x600 > [ 73.959821] usb_probe_interface+0xea/0x2b0 > [ 73.959824] really_probe+0xde/0x340 > [ 73.959827] __driver_probe_device+0x78/0x110 > [ 73.959830] driver_probe_device+0x1f/0xa0 > [ 73.959833] __device_attach_driver+0x85/0x110 > [ 73.959835] bus_for_each_drv+0x78/0xc0 > [ 73.959839] __device_attach+0xb0/0x1b0 > [ 73.959842] bus_probe_device+0x94/0xb0 > [ 73.959844] device_add+0x64a/0x860 > [ 73.959847] usb_set_configuration+0x5e8/0x880 > [ 73.959850] usb_generic_driver_probe+0x3e/0x60 > [ 73.959854] usb_probe_device+0x3d/0x120 > [ 73.959857] really_probe+0xde/0x340 > [ 73.959859] __driver_probe_device+0x78/0x110 > [ 73.959862] driver_probe_device+0x1f/0xa0 > [ 73.959865] __device_attach_driver+0x85/0x110 > [ 73.959868] bus_for_each_drv+0x78/0xc0 > [ 73.959871] __device_attach+0xb0/0x1b0 > [ 73.959874] bus_probe_device+0x94/0xb0 > [ 73.959876] device_add+0x64a/0x860 > [ 73.959880] usb_new_device.cold+0x141/0x38f > [ 73.959883] hub_event+0x1166/0x1980 > [ 73.959886] process_one_work+0x21a/0x590 > [ 73.959889] worker_thread+0x1d1/0x3e0 > [ 73.959891] kthread+0xee/0x120 > [ 73.959895] ret_from_fork+0x30/0x50 > [ 73.959898] ret_from_fork_asm+0x1a/0x30 > [ 73.959902] > -> #0 (triggers_list_lock){++++}-{3:3}: > [ 73.959908] __lock_acquire+0x11c6/0x1f20 > [ 73.959911] lock_acquire+0xc8/0x2b0 > [ 73.959914] down_write+0x3b/0xd0 > [ 73.959916] led_trigger_register+0x40/0x1b0 > [ 73.959919] phy_led_triggers_register+0xb4/0x240 > [ 73.959923] phy_attach_direct+0x378/0x380 > [ 73.959926] phy_connect_direct+0x21/0x70 > [ 73.959929] rtl_open+0x2e1/0x490 [r8169] > [ 73.959934] __dev_open+0xea/0x1b0 > [ 73.959937] __dev_change_flags+0x201/0x240 > [ 73.959939] dev_change_flags+0x22/0x60 > [ 73.959942] do_setlink+0x32d/0x1070 > [ 73.959945] __rtnl_newlink+0x516/0x9d0 > [ 73.959948] rtnl_newlink+0x43/0x70 > [ 73.959951] rtnetlink_rcv_msg+0x159/0x630 > [ 73.959953] netlink_rcv_skb+0x4f/0x100 > [ 73.959956] netlink_unicast+0x18c/0x260 > [ 73.959959] netlink_sendmsg+0x207/0x420 > [ 73.959962] ____sys_sendmsg+0x364/0x3a0 > [ 73.959965] ___sys_sendmsg+0x84/0xd0 > [ 73.959967] __sys_sendmsg+0x8e/0xd0 > [ 73.959971] do_syscall_64+0x93/0x180 > [ 73.959974] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 73.959977] > other info that might help us debug this: > > [ 73.959979] Chain exists of: > triggers_list_lock --> &led_cdev->trigger_lock --> rtnl_mutex > > [ 73.959987] Possible unsafe locking scenario: > > [ 73.959989] CPU0 CPU1 > [ 73.959991] ---- ---- > [ 73.959993] lock(rtnl_mutex); > [ 73.959996] lock(&led_cdev->trigger_lock); > [ 73.960000] lock(rtnl_mutex); > [ 73.960004] lock(triggers_list_lock); > [ 73.960007] > *** DEADLOCK *** > > [ 73.960009] 1 lock held by NetworkManager/1815: > [ 73.960012] #0: ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630 > [ 73.960020] > stack backtrace: > [ 73.960023] CPU: 5 PID: 1815 Comm: NetworkManager Not tainted 6.10.0-rc1+ #94 > [ 73.960026] Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH WIFI (MS-7C95), BIOS 2.J0 01/02/2024 > [ 73.960029] Call Trace: > [ 73.960031] <TASK> > [ 73.960034] dump_stack_lvl+0x68/0x90 > [ 73.960039] check_noncircular+0x10d/0x120 > [ 73.960044] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960048] ? __lock_acquire+0xc19/0x1f20 > [ 73.960055] __lock_acquire+0x11c6/0x1f20 > [ 73.960062] lock_acquire+0xc8/0x2b0 > [ 73.960065] ? led_trigger_register+0x40/0x1b0 > [ 73.960073] down_write+0x3b/0xd0 > [ 73.960076] ? led_trigger_register+0x40/0x1b0 > [ 73.960079] led_trigger_register+0x40/0x1b0 > [ 73.960083] phy_led_triggers_register+0xb4/0x240 > [ 73.960089] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960091] ? schedule_timeout+0xc1/0x1b0 > [ 73.960095] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960097] ? find_held_lock+0x2b/0x80 > [ 73.960101] ? phy_attach_direct+0x1b0/0x380 > [ 73.960105] ? phy_attach_direct+0x1b0/0x380 > [ 73.960108] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960110] ? lock_release+0x169/0x2b0 > [ 73.960115] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960118] ? __mutex_unlock_slowpath+0x2d/0x260 > [ 73.960124] phy_attach_direct+0x378/0x380 > [ 73.960128] ? __pfx_r8169_phylink_handler+0x10/0x10 [r8169] > [ 73.960134] phy_connect_direct+0x21/0x70 > [ 73.960139] rtl_open+0x2e1/0x490 [r8169] > [ 73.960147] __dev_open+0xea/0x1b0 > [ 73.960152] __dev_change_flags+0x201/0x240 > [ 73.960155] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960160] dev_change_flags+0x22/0x60 > [ 73.960165] do_setlink+0x32d/0x1070 > [ 73.960172] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960175] ? __lock_acquire+0x3ce/0x1f20 > [ 73.960178] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960180] ? __nla_validate_parse+0x52/0xc70 > [ 73.960185] ? __entry_text_end+0x1025c9/0x1025cd > [ 73.960190] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960193] ? __lock_acquire+0x3ce/0x1f20 > [ 73.960199] __rtnl_newlink+0x516/0x9d0 > [ 73.960204] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960211] ? __entry_text_end+0x1025c9/0x1025cd > [ 73.960214] ? rcu_is_watching+0xd/0x40 > [ 73.960217] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960220] ? kmalloc_trace_noprof+0x25c/0x320 > [ 73.960224] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960230] rtnl_newlink+0x43/0x70 > [ 73.960235] rtnetlink_rcv_msg+0x159/0x630 > [ 73.960241] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 > [ 73.960246] netlink_rcv_skb+0x4f/0x100 > [ 73.960257] netlink_unicast+0x18c/0x260 > [ 73.960262] netlink_sendmsg+0x207/0x420 > [ 73.960270] ____sys_sendmsg+0x364/0x3a0 > [ 73.960273] ? import_iovec+0x17/0x20 > [ 73.960277] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960280] ? copy_msghdr_from_user+0xd9/0x150 > [ 73.960286] ___sys_sendmsg+0x84/0xd0 > [ 73.960292] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960294] ? __lock_acquire+0x3ce/0x1f20 > [ 73.960299] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960304] ? find_held_lock+0x2b/0x80 > [ 73.960307] ? __fget_files+0xc3/0x190 > [ 73.960312] ? __fget_files+0xc3/0x190 > [ 73.960315] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960317] ? lock_release+0x169/0x2b0 > [ 73.960322] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960327] __sys_sendmsg+0x8e/0xd0 > [ 73.960337] do_syscall_64+0x93/0x180 > [ 73.960340] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960345] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960349] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960352] ? find_held_lock+0x2b/0x80 > [ 73.960356] ? __memcg_slab_free_hook+0x56/0x230 > [ 73.960360] ? __memcg_slab_free_hook+0x56/0x230 > [ 73.960363] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960366] ? lock_release+0x169/0x2b0 > [ 73.960370] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960373] ? __memcg_slab_free_hook+0x60/0x230 > [ 73.960378] ? kmem_cache_free+0x144/0x430 > [ 73.960382] ? syscall_exit_to_user_mode+0x11/0x280 > [ 73.960387] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960390] ? do_syscall_64+0x9f/0x180 > [ 73.960393] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960396] ? lockdep_hardirqs_on+0x78/0x100 > [ 73.960400] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960403] ? do_syscall_64+0x9f/0x180 > [ 73.960406] ? fd_install+0xba/0x310 > [ 73.960411] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960414] ? kmem_cache_free+0x144/0x430 > [ 73.960420] ? do_sys_openat2+0x64/0xb0 > [ 73.960423] ? syscall_exit_to_user_mode+0x11/0x280 > [ 73.960428] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960431] ? do_syscall_64+0x9f/0x180 > [ 73.960435] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960437] ? lockdep_hardirqs_on+0x78/0x100 > [ 73.960441] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960444] ? do_syscall_64+0x9f/0x180 > [ 73.960448] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 73.960453] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 73.960457] RIP: 0033:0x7f857a9db79b > [ 73.960461] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 49 7a f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 a1 7a f7 ff 48 8b > [ 73.960464] RSP: 002b:00007ffcf5d3a660 EFLAGS: 00000293 ORIG_RAX: 000000000000002e > [ 73.960468] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f857a9db79b > [ 73.960471] RDX: 0000000000000000 RSI: 00007ffcf5d3a6a0 RDI: 000000000000000d > [ 73.960473] RBP: 00007ffcf5d3a680 R08: 0000000000000000 R09: 0000000000000000 > [ 73.960476] R10: 0000000000000000 R11: 0000000000000293 R12: 00005632de03d9d0 > [ 73.960478] R13: 000000000000000a R14: 00007ffcf5d3a83c R15: 0000000000000000 > [ 73.960488] </TASK> > [ 73.960561] Generic FE-GE Realtek PHY r8169-0-2a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-2a00:00, irq=MAC) > > > > > > > > > > > >> On 30.05.24 16:04, Russell King (Oracle) wrote: >>> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote: >>>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote: >>>> This report for 6.9.1 could well be the same issue: >>>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ >>> >>> The reg_check_chans_work() thing in pid 285 is likely stuck on the >>> rtnl lock. The same is true of pid 287. >>> >>> That will be because of the thread (pid 663) that's stuck in >>> __dev_open()...led_trigger_register(), where the rtnl lock will have >>> been taken in that path. It looks to me like led_trigger_register() >>> is stuck waiting for read access with the leds_list_lock rwsem. >>> >>> There are only two places that take that rwsem in write mode, which >>> are led_classdev_register_ext() and led_classdev_unregister(). None >>> of these paths are blocking in v6.9. >>> >>> Pid 641 doesn't look significant (its probably waiting for either >>> pid 285 or 287 to complete its work.) >>> >>> Pid 666 looks like it is blocked waiting for exclusive write-access >>> on the leds_list_lock - but it isn't holding that lock. This means >>> there must already be some other reader or writer holding this lock. >>> >>> Pid 722 doesn't look sigificant (same as pid 641). >>> >>> Pid 760 is also waiting for the rtnl lock. >>> >>> Pid 854, 855 also doesn't look sigificant (as pid 641). >>> >>> And then we get to pid 858. This is in set_device_name(), which >>> was called from led_trigger_set() and led_trigger_register(). >>> We know from pid 663 that led_trigger_register() can take a read >>> on leds_list_lock, and indeed it does and then calls >>> led_match_default_trigger(), which then goes on to call >>> led_trigger_set(). Bingo, this is why pid 666 is blocked, which >>> then blocks pid 663. pid 663 takes the rtnl lock, which blocks >>> everything else _and_ also blocks pid 858 in set_device_name(). >>> >>> Lockdep would've found this... this is a classic AB-BA deadlock >>> between the leds_list_lock rwsem and the rtnl mutex. >>> >>> I haven't checked to see how that deadlock got introduced, that's >>> for someone else to do. >> >> P.S.: >> >> #regzbot report: / >> #regzbot introduced: f5c31bcf604d >> #regzbot duplicate: >> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ >> #regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the >> leds_list_lock rwsem and the rtnl mutex >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 10:22 ` Hans de Goede @ 2024-06-01 20:05 ` Hans de Goede 0 siblings, 0 replies; 17+ messages in thread From: Hans de Goede @ 2024-06-01 20:05 UTC (permalink / raw) To: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi All, On 5/31/24 12:22 PM, Hans de Goede wrote: > Hi, > > On 5/31/24 11:50 AM, Hans de Goede wrote: >> Hi, >> >> On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote: >>> [adding the LED folks and the regressions list to the list of recipients] >>> >>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting >>> for once, to make this easily accessible to everyone. >>> >>> Lee, Pavel, could you look into below regression report please? Thread >>> starts here: >>> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ >>> >>> Another report with somewhat similar symptom can be found here: >>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ >>> >>> See also Russell's analysis of that report below (many many thx for >>> that, much appreciated Russel!). >>> >>> To my untrained eyes all of this sounds a lot like we still have a 6.9 >>> regression related to the LED code somewhere. Reminder, we had earlier >>> trouble, but that was avoided through other measures: >>> >>> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / >>> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ >>> >>> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / >>> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ >>> >>> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") >>> >>> That iwlwifi commit even calls it self "work around". The developer that >>> submitted it bisected the problem to a LED merge, but sadly that was the >>> end of it. :-/ >> >> I actually have been looking at a ledtrig-netdev lockdep warning yesterday >> which I believe is the same thing. I'll include the lockdep trace below. >> >> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with >> the rtnl mutex vs led-triggers related locks. I believe that this problem >> may be a pre-existing problem but this now actually gets hit in kernels >= >> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is >> given, make hw_control trigger the default trigger"). Before that commit >> the "netdev" trigger would not be bound / set as phy LEDs trigger by default. >> >> +Cc Heiner Kallweit who authored that commit. >> >> The netdev trigger typically is not needed because the PHY LEDs are typically >> under hw-control and the netdev trigger even tries to leave things that way >> so setting it as the active trigger for the LED class device is basically >> a no-op. I guess the goal of that commit is correctly have the triggers >> file content reflect that the LED is controlled by a netdev and to allow >> changing the hw-control mode without the user first needing to set netdev >> as trigger before being able to change the mode. >> >> But there is a price to this, besides the locking problem this also >> causes the ledtrig-netdev module to load on pretty much everyones >> systems (when build as a module) even though 99.999% of our users >> likely does not need this at all... >> >> Given this price and the troubles this is causing I think it might be best >> to revert 66601a29bb23. There might still be a locking issue when setting >> the trigger to netdev manually (I'll check and follow up) but this should >> fix the regression users are hitting since typically users do not set >> the trigger manually. > > Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23 > reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev" > (to add it to the list of available triggers) and then setting the trigger > for /sys/class/leds/enp42s0-0::lan to netdev manually. > > Still reverting 66601a29bb23 should avoid the problem getting triggered > and this would seem like a safe fix especially for the 6.9 series and > then the necessary time can be taken to fix the actual underlying locking > issue which 66601a29bb23 exposes. I recently wrote a new input-events LED trigger: https://lore.kernel.org/linux-leds/20240531135910.168965-2-hdegoede@redhat.com/ and I just found out this has a very similar deadlock problem. It seems there it a generic problem with LEDs or LED triggers getting registered by subsystems while holding a global lock from that subsystem vs the activate / deactivate method of the trigger calling functions of that same subsystem which require that same global lock. I came up with a fix but that just fixed the activate() path leaving a similar deadlock in place in the deactivate path: https://lore.kernel.org/linux-leds/20240601195528.48308-1-hdegoede@redhat.com/ https://lore.kernel.org/linux-leds/20240601195528.48308-3-hdegoede@redhat.com/ So this seems to be a non trivial problem to solve. For the new input-events trigger I plan to solve this by switching to a single input_handler which will be registered at module_init() time instead of having a handler per LED and registering that handler at activate() time. But I have no idea how to solve this for the netdev trigger I just wanted to share my experience with the input-events trigger in case it is useful. Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 9:50 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede 2024-05-31 10:22 ` Hans de Goede @ 2024-05-31 11:55 ` Genes Lists 2024-05-31 12:54 ` Andrew Lunn 2 siblings, 0 replies; 17+ messages in thread From: Genes Lists @ 2024-05-31 11:55 UTC (permalink / raw) To: Hans de Goede, Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle) [-- Attachment #1: Type: text/plain, Size: 2389 bytes --] On Fri, 2024-05-31 at 11:50 +0200, Hans de Goede wrote: > Hi, > > ... > I actually have been looking at a ledtrig-netdev lockdep warning > yesterday > which I believe is the same thing. I'll include the lockdep trace > below. > > According to lockdep there indeed is a ABBA (ish) cyclic deadlock > with > the rtnl mutex vs led-triggers related locks. I believe that this > problem > may be a pre-existing problem but this now actually gets hit in > kernels >= > 6.9 because of commit 66601a29bb23 ("leds: class: If no default > trigger is > given, make hw_control trigger the default trigger"). Before that > commit > the "netdev" trigger would not be bound / set as phy LEDs trigger by > default. > > +Cc Heiner Kallweit who authored that commit. > > The netdev trigger typically is not needed because the PHY LEDs are > typically > under hw-control and the netdev trigger even tries to leave things > that way > so setting it as the active trigger for the LED class device is > basically > a no-op. I guess the goal of that commit is correctly have the > triggers > file content reflect that the LED is controlled by a netdev and to > allow > changing the hw-control mode without the user first needing to set > netdev > as trigger before being able to change the mode. > > But there is a price to this, besides the locking problem this also > causes the ledtrig-netdev module to load on pretty much everyones > systems (when build as a module) even though 99.999% of our users > likely does not need this at all... > > Given this price and the troubles this is causing I think it might be > best > to revert 66601a29bb23. There might still be a locking issue when > setting > the trigger to netdev manually (I'll check and follow up) but this > should > fix the regression users are hitting since typically users do not set > the trigger manually. > > Gene, as the original reporter of this can you do "modinfo > ledtrig_netdev" > and if this shows that ledtrig_netdev is a module for you try > blacklisting > ledtrig_netdev ? And if it is not a module can you try building a > 6.9 > kernel with commit 66601a29bb23 reverted and see if that helps ? Thank you - I've blacklisted ledtrig_netdev and will report back if anything interesting happens. best gene > > > Regards, > > Hans -- Gene [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 9:50 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede 2024-05-31 10:22 ` Hans de Goede 2024-05-31 11:55 ` Genes Lists @ 2024-05-31 12:54 ` Andrew Lunn 2024-05-31 13:11 ` Hans de Goede ` (2 more replies) 2 siblings, 3 replies; 17+ messages in thread From: Andrew Lunn @ 2024-05-31 12:54 UTC (permalink / raw) To: Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists > I actually have been looking at a ledtrig-netdev lockdep warning yesterday > which I believe is the same thing. I'll include the lockdep trace below. > > According to lockdep there indeed is a ABBA (ish) cyclic deadlock with > the rtnl mutex vs led-triggers related locks. I believe that this problem > may be a pre-existing problem but this now actually gets hit in kernels >= > 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is > given, make hw_control trigger the default trigger"). Before that commit > the "netdev" trigger would not be bound / set as phy LEDs trigger by default. > > +Cc Heiner Kallweit who authored that commit. > > The netdev trigger typically is not needed because the PHY LEDs are typically > under hw-control and the netdev trigger even tries to leave things that way > so setting it as the active trigger for the LED class device is basically > a no-op. I guess the goal of that commit is correctly have the triggers > file content reflect that the LED is controlled by a netdev and to allow > changing the hw-control mode without the user first needing to set netdev > as trigger before being able to change the mode. It was not the intention that this triggers is loaded for all systems. It should only be those that actually have LEDs which can be controlled: drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; drivers/net/ethernet/intel/igc/igc_leds.c: led_cdev->hw_control_trigger = "netdev"; drivers/net/dsa/qca/qca8k-leds.c: port_led->cdev.hw_control_trigger = "netdev"; drivers/net/phy/phy_device.c: cdev->hw_control_trigger = "netdev"; Reverting this patch does seem like a good way forward, but i would also like to give Heiner a little bit of time to see if he has a quick real fix. Andrew ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 12:54 ` Andrew Lunn @ 2024-05-31 13:11 ` Hans de Goede 2024-05-31 13:29 ` Andrew Lunn 2024-06-06 12:01 ` Hans de Goede 2024-06-24 19:57 ` Heiner Kallweit 2 siblings, 1 reply; 17+ messages in thread From: Hans de Goede @ 2024-05-31 13:11 UTC (permalink / raw) To: Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi, On 5/31/24 2:54 PM, Andrew Lunn wrote: >> I actually have been looking at a ledtrig-netdev lockdep warning yesterday >> which I believe is the same thing. I'll include the lockdep trace below. >> >> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with >> the rtnl mutex vs led-triggers related locks. I believe that this problem >> may be a pre-existing problem but this now actually gets hit in kernels >= >> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is >> given, make hw_control trigger the default trigger"). Before that commit >> the "netdev" trigger would not be bound / set as phy LEDs trigger by default. >> >> +Cc Heiner Kallweit who authored that commit. >> >> The netdev trigger typically is not needed because the PHY LEDs are typically >> under hw-control and the netdev trigger even tries to leave things that way >> so setting it as the active trigger for the LED class device is basically >> a no-op. I guess the goal of that commit is correctly have the triggers >> file content reflect that the LED is controlled by a netdev and to allow >> changing the hw-control mode without the user first needing to set netdev >> as trigger before being able to change the mode. > > It was not the intention that this triggers is loaded for all > systems. Right note there are really 2 separate issues (or 1 issue and one question) here: 1. The locking issue which this commit has exposed (but existed before) 2. If it is desirable to load / activate ledtrig-netdev by default on quite a lot of machines where it does not really gain us anything ? For now I think we should focus on 1. Still about 2: > It should only be those that actually have LEDs which can be > controlled: > > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/ethernet/intel/igc/igc_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/dsa/qca/qca8k-leds.c: port_led->cdev.hw_control_trigger = "netdev"; > drivers/net/phy/phy_device.c: cdev->hw_control_trigger = "netdev"; Well those drivers combined, esp. with the generic phy_device in there does mean that the ledtrig-netdev module now gets loaded on a whole lot of x86 machines where before it would not. On one hand those machines are plenty powerful typically, so what is one more module. OTOH I don't think many users if any at all want to change the hwcontrol mode for those LEDs... > Reverting this patch does seem like a good way forward, but i would > also like to give Heiner a little bit of time to see if he has a quick > real fix. Ack. Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 13:11 ` Hans de Goede @ 2024-05-31 13:29 ` Andrew Lunn 2024-05-31 13:36 ` Hans de Goede 0 siblings, 1 reply; 17+ messages in thread From: Andrew Lunn @ 2024-05-31 13:29 UTC (permalink / raw) To: Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists > > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > > drivers/net/ethernet/intel/igc/igc_leds.c: led_cdev->hw_control_trigger = "netdev"; > > drivers/net/dsa/qca/qca8k-leds.c: port_led->cdev.hw_control_trigger = "netdev"; > > drivers/net/phy/phy_device.c: cdev->hw_control_trigger = "netdev"; > > Well those drivers combined, esp. with the generic phy_device in there > does mean that the ledtrig-netdev module now gets loaded on a whole lot > of x86 machines where before it would not. phy_device will only do something if there is the needed Device Tree properties. Given that very few systems use DT on x86, that should not be an issue. So only x86 systems with r8169 and igc should have the trigger module loaded because of this. It would be good to understand why other systems have the trigger loaded. However, as you say, this will not fix the underlying deadlock, it will just limit it to systems with r8169 and igc... Andrew ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 13:29 ` Andrew Lunn @ 2024-05-31 13:36 ` Hans de Goede 0 siblings, 0 replies; 17+ messages in thread From: Hans de Goede @ 2024-05-31 13:36 UTC (permalink / raw) To: Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi, On 5/31/24 3:29 PM, Andrew Lunn wrote: >>> drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; >>> drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; >>> drivers/net/ethernet/intel/igc/igc_leds.c: led_cdev->hw_control_trigger = "netdev"; >>> drivers/net/dsa/qca/qca8k-leds.c: port_led->cdev.hw_control_trigger = "netdev"; >>> drivers/net/phy/phy_device.c: cdev->hw_control_trigger = "netdev"; >> >> Well those drivers combined, esp. with the generic phy_device in there >> does mean that the ledtrig-netdev module now gets loaded on a whole lot >> of x86 machines where before it would not. > > phy_device will only do something if there is the needed Device Tree > properties. Given that very few systems use DT on x86, that should not > be an issue. That is good to know. > So only x86 systems with r8169 and igc should have the > trigger module loaded because of this. Those are very popular NICs though, so that is still a lot of systems. > It would be good to understand > why other systems have the trigger loaded. Actually my system has a RTL8168 ethernet NIC so the netdev trigger getting loaded there is expected. > However, as you say, this > will not fix the underlying deadlock, it will just limit it to systems with r8169 > and igc... Right, given on the above discussion I believe that it likely already is limited to systems with Realtek r8169 or Intel i225 / i226 NICs. Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 12:54 ` Andrew Lunn 2024-05-31 13:11 ` Hans de Goede @ 2024-06-06 12:01 ` Hans de Goede 2024-06-06 13:12 ` Andrew Lunn 2024-06-24 19:57 ` Heiner Kallweit 2 siblings, 1 reply; 17+ messages in thread From: Hans de Goede @ 2024-06-06 12:01 UTC (permalink / raw) To: Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi all, On 5/31/24 2:54 PM, Andrew Lunn wrote: >> I actually have been looking at a ledtrig-netdev lockdep warning yesterday >> which I believe is the same thing. I'll include the lockdep trace below. >> >> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with >> the rtnl mutex vs led-triggers related locks. I believe that this problem >> may be a pre-existing problem but this now actually gets hit in kernels >= >> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is >> given, make hw_control trigger the default trigger"). Before that commit >> the "netdev" trigger would not be bound / set as phy LEDs trigger by default. >> >> +Cc Heiner Kallweit who authored that commit. >> >> The netdev trigger typically is not needed because the PHY LEDs are typically >> under hw-control and the netdev trigger even tries to leave things that way >> so setting it as the active trigger for the LED class device is basically >> a no-op. I guess the goal of that commit is correctly have the triggers >> file content reflect that the LED is controlled by a netdev and to allow >> changing the hw-control mode without the user first needing to set netdev >> as trigger before being able to change the mode. > > It was not the intention that this triggers is loaded for all > systems. <snip> > Reverting this patch does seem like a good way forward, but i would > also like to give Heiner a little bit of time to see if he has a quick > real fix. So it has been almost a week and no reply from Heiner. Since this is causing real issues for users out there I think a revert of 66601a29bb23 should be submitted to Linus and then backported to the stable kernels. to fix the immediate issue at hand. Once the underlying locking issue which is the real root cause here is fixed then we can reconsider re-applying 66601a29bb23. Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-06 12:01 ` Hans de Goede @ 2024-06-06 13:12 ` Andrew Lunn 2024-06-06 13:39 ` Jakub Kicinski 0 siblings, 1 reply; 17+ messages in thread From: Andrew Lunn @ 2024-06-06 13:12 UTC (permalink / raw) To: Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists > So it has been almost a week and no reply from Heiner. Since this is > causing real issues for users out there I think a revert of 66601a29bb23 > should be submitted to Linus and then backported to the stable kernels. > to fix the immediate issue at hand. Agreed. Andrew ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-06 13:12 ` Andrew Lunn @ 2024-06-06 13:39 ` Jakub Kicinski 2024-06-06 14:25 ` Genes Lists 2024-06-07 10:19 ` Hans de Goede 0 siblings, 2 replies; 17+ messages in thread From: Jakub Kicinski @ 2024-06-06 13:39 UTC (permalink / raw) To: Andrew Lunn, Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, pabeni, johanneswueller, Russell King (Oracle), Genes Lists On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote: > > So it has been almost a week and no reply from Heiner. Since this is > > causing real issues for users out there I think a revert of 66601a29bb23 > > should be submitted to Linus and then backported to the stable kernels. > > to fix the immediate issue at hand. > > Agreed. Please submit.. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-06 13:39 ` Jakub Kicinski @ 2024-06-06 14:25 ` Genes Lists 2024-06-07 10:22 ` Hans de Goede 2024-06-07 10:19 ` Hans de Goede 1 sibling, 1 reply; 17+ messages in thread From: Genes Lists @ 2024-06-06 14:25 UTC (permalink / raw) To: Jakub Kicinski, Andrew Lunn, Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, pabeni, johanneswueller, Russell King (Oracle) [-- Attachment #1.1: Type: text/plain, Size: 697 bytes --] On Thu, 2024-06-06 at 06:39 -0700, Jakub Kicinski wrote: > On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote: > > > So it has been almost a week and no reply from Heiner. Since this > > > is > > > causing real issues for users out there I think a revert of > > > 66601a29bb23 > > > should be submitted to Linus and then backported to the stable > > > kernels. > > > to fix the immediate issue at hand. > > > > Agreed. > > Please submit.. I assume this deadlock is unrelated to the filesystem stalls reported here: https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/ but thought it best to ask. thank you. -- Gene [-- Attachment #1.2: Type: text/html, Size: 1700 bytes --] [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-06 14:25 ` Genes Lists @ 2024-06-07 10:22 ` Hans de Goede 2024-06-07 11:27 ` Genes Lists 0 siblings, 1 reply; 17+ messages in thread From: Hans de Goede @ 2024-06-07 10:22 UTC (permalink / raw) To: Genes Lists, Jakub Kicinski, Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, pabeni, johanneswueller, Russell King (Oracle) Hi Gene, On 6/6/24 4:25 PM, Genes Lists wrote: > On Thu, 2024-06-06 at 06:39 -0700, Jakub Kicinski wrote: >> On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote: >>>> So it has been almost a week and no reply from Heiner. Since this is >>>> causing real issues for users out there I think a revert of 66601a29bb23 >>>> should be submitted to Linus and then backported to the stable kernels. >>>> to fix the immediate issue at hand. >>> >>> Agreed. >> >> Please submit.. > > I assume this deadlock is unrelated to the filesystem stalls reported here: > > https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/ <https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/> Right that one is unrelated, but the revert which I just submitted should fix your other report: https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ I believe you have already worked around that one by blacklisting the ledtrig-netdev module. Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-07 10:22 ` Hans de Goede @ 2024-06-07 11:27 ` Genes Lists 0 siblings, 0 replies; 17+ messages in thread From: Genes Lists @ 2024-06-07 11:27 UTC (permalink / raw) To: Hans de Goede, Jakub Kicinski, Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, pabeni, johanneswueller, Russell King (Oracle) [-- Attachment #1.1: Type: text/plain, Size: 710 bytes --] On Fri, 2024-06-07 at 12:22 +0200, Hans de Goede wrote: > Hi Gene, ... > > > > > > https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/ > > < > > https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/ > > > > > Right that one is unrelated, but the revert which I just submitted > should > fix your other report: > > https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ Very much appreciated thank you. > > I believe you have already worked around that one by blacklisting > the ledtrig-netdev module. Correct. > Thank you very much. -- Gene [-- Attachment #1.2: Type: text/html, Size: 2375 bytes --] [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-06-06 13:39 ` Jakub Kicinski 2024-06-06 14:25 ` Genes Lists @ 2024-06-07 10:19 ` Hans de Goede 1 sibling, 0 replies; 17+ messages in thread From: Hans de Goede @ 2024-06-07 10:19 UTC (permalink / raw) To: Jakub Kicinski, Andrew Lunn Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem, edumazet, pabeni, johanneswueller, Russell King (Oracle), Genes Lists Hi, On 6/6/24 3:39 PM, Jakub Kicinski wrote: > On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote: >>> So it has been almost a week and no reply from Heiner. Since this is >>> causing real issues for users out there I think a revert of 66601a29bb23 >>> should be submitted to Linus and then backported to the stable kernels. >>> to fix the immediate issue at hand. >> >> Agreed. > > Please submit.. Done: https://lore.kernel.org/linux-leds/20240607101847.23037-1-hdegoede@redhat.com/ Regards, Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex 2024-05-31 12:54 ` Andrew Lunn 2024-05-31 13:11 ` Hans de Goede 2024-06-06 12:01 ` Hans de Goede @ 2024-06-24 19:57 ` Heiner Kallweit 2 siblings, 0 replies; 17+ messages in thread From: Heiner Kallweit @ 2024-06-24 19:57 UTC (permalink / raw) To: Andrew Lunn, Hans de Goede Cc: Linux regressions mailing list, Pavel Machek, Lee Jones, Linux LEDs, linux-kernel, netdev, davem, edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle), Genes Lists On 31.05.2024 14:54, Andrew Lunn wrote: >> I actually have been looking at a ledtrig-netdev lockdep warning yesterday >> which I believe is the same thing. I'll include the lockdep trace below. >> >> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with >> the rtnl mutex vs led-triggers related locks. I believe that this problem >> may be a pre-existing problem but this now actually gets hit in kernels >= >> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is >> given, make hw_control trigger the default trigger"). Before that commit >> the "netdev" trigger would not be bound / set as phy LEDs trigger by default. >> >> +Cc Heiner Kallweit who authored that commit. >> >> The netdev trigger typically is not needed because the PHY LEDs are typically >> under hw-control and the netdev trigger even tries to leave things that way >> so setting it as the active trigger for the LED class device is basically >> a no-op. I guess the goal of that commit is correctly have the triggers >> file content reflect that the LED is controlled by a netdev and to allow >> changing the hw-control mode without the user first needing to set netdev >> as trigger before being able to change the mode. > > It was not the intention that this triggers is loaded for all > systems. It should only be those that actually have LEDs which can be > controlled: > > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/ethernet/realtek/r8169_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/ethernet/intel/igc/igc_leds.c: led_cdev->hw_control_trigger = "netdev"; > drivers/net/dsa/qca/qca8k-leds.c: port_led->cdev.hw_control_trigger = "netdev"; > drivers/net/phy/phy_device.c: cdev->hw_control_trigger = "netdev"; > > Reverting this patch does seem like a good way forward, but i would > also like to give Heiner a little bit of time to see if he has a quick > real fix. > Sorry, I was on vacation and can reply only now. I agree that a revert is appropriate because the actual issue that this change revealed seems to be non-trivial to fix. Drawback is that now a user who wants to control LEDs has to know that he first has to manually load the netdev trigger module and set the netdev trigger. That the netdev trigger module is now loaded by default on systems with r8169 I don't really see as an issue. I think it's normal that this is opt-out. If a user has a problem with this for whatever reason he has several options to prevent this: - disable R8169 LED support in kernel config - disable netdev trigger in kernel config - remove netdev trigger module - blacklist netdev trigger module > Andrew Heiner ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-06-24 19:56 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com>
[not found] ` <15a0bbd24cd01bd0b60b7047958a2e3ab556ea6f.camel@sapience.com>
[not found] ` <ZliHhebSGQYZ/0S0@shell.armlinux.org.uk>
2024-05-31 8:39 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks) Linux regression tracking (Thorsten Leemhuis)
2024-05-31 9:50 ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
2024-05-31 10:22 ` Hans de Goede
2024-06-01 20:05 ` Hans de Goede
2024-05-31 11:55 ` Genes Lists
2024-05-31 12:54 ` Andrew Lunn
2024-05-31 13:11 ` Hans de Goede
2024-05-31 13:29 ` Andrew Lunn
2024-05-31 13:36 ` Hans de Goede
2024-06-06 12:01 ` Hans de Goede
2024-06-06 13:12 ` Andrew Lunn
2024-06-06 13:39 ` Jakub Kicinski
2024-06-06 14:25 ` Genes Lists
2024-06-07 10:22 ` Hans de Goede
2024-06-07 11:27 ` Genes Lists
2024-06-07 10:19 ` Hans de Goede
2024-06-24 19:57 ` Heiner Kallweit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).