linux-leds.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks)
       [not found]   ` <ZliHhebSGQYZ/0S0@shell.armlinux.org.uk>
@ 2024-05-31  8:39     ` Linux regression tracking (Thorsten Leemhuis)
  2024-05-31  9:50       ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
  0 siblings, 1 reply; 17+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-05-31  8:39 UTC (permalink / raw)
  To: Pavel Machek, Lee Jones, Linux LEDs
  Cc: linux-kernel, netdev, andrew, hkallweit1, davem, edumazet, kuba,
	pabeni, johanneswueller, Russell King (Oracle), Genes Lists,
	Linux kernel regressions list

[adding the LED folks and the regressions list to the list of recipients]

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Lee, Pavel, could you look into below regression report please? Thread
starts here:
https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/

Another report with somewhat similar symptom can be found here:
https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/

See also Russell's analysis of that report below (many many thx for
that, much appreciated Russel!).

To my untrained eyes all of this sounds a lot like we still have a 6.9
regression related to the LED code somewhere. Reminder, we had earlier
trouble, but that was avoided through other measures:

* 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/

* c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/

* 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")

That iwlwifi commit even calls it self "work around". The developer that
submitted it bisected the problem to a LED merge, but sadly that was the
end of it. :-/

Ciao, Thorsten

On 30.05.24 16:04, Russell King (Oracle) wrote:
> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote:
>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote:
>> This report for 6.9.1 could well be the same issue:
>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
> 
> The reg_check_chans_work() thing in pid 285 is likely stuck on the
> rtnl lock. The same is true of pid 287.
> 
> That will be because of the thread (pid 663) that's stuck in
> __dev_open()...led_trigger_register(), where the rtnl lock will have
> been taken in that path. It looks to me like led_trigger_register()
> is stuck waiting for read access with the leds_list_lock rwsem.
> 
> There are only two places that take that rwsem in write mode, which
> are led_classdev_register_ext() and led_classdev_unregister(). None
> of these paths are blocking in v6.9.
> 
> Pid 641 doesn't look significant (its probably waiting for either
> pid 285 or 287 to complete its work.)
> 
> Pid 666 looks like it is blocked waiting for exclusive write-access
> on the leds_list_lock - but it isn't holding that lock. This means
> there must already be some other reader or writer holding this lock.
> 
> Pid 722 doesn't look sigificant (same as pid 641).
> 
> Pid 760 is also waiting for the rtnl lock.
> 
> Pid 854, 855 also doesn't look sigificant (as pid 641).
> 
> And then we get to pid 858. This is in set_device_name(), which
> was called from led_trigger_set() and led_trigger_register().
> We know from pid 663 that led_trigger_register() can take a read
> on leds_list_lock, and indeed it does and then calls
> led_match_default_trigger(), which then goes on to call
> led_trigger_set(). Bingo, this is why pid 666 is blocked, which
> then blocks pid 663. pid 663 takes the rtnl lock, which blocks
> everything else _and_ also blocks pid 858 in set_device_name().
> 
> Lockdep would've found this... this is a classic AB-BA deadlock
> between the leds_list_lock rwsem and the rtnl mutex.
> 
> I haven't checked to see how that deadlock got introduced, that's
> for someone else to do.

P.S.:

#regzbot report: /
#regzbot introduced: f5c31bcf604d
#regzbot duplicate:
https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
#regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the
leds_list_lock rwsem and the rtnl mutex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31  8:39     ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks) Linux regression tracking (Thorsten Leemhuis)
@ 2024-05-31  9:50       ` Hans de Goede
  2024-05-31 10:22         ` Hans de Goede
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Hans de Goede @ 2024-05-31  9:50 UTC (permalink / raw)
  To: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit
  Cc: linux-kernel, netdev, andrew, hkallweit1, davem, edumazet, kuba,
	pabeni, johanneswueller, Russell King (Oracle), Genes Lists

Hi,

On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote:
> [adding the LED folks and the regressions list to the list of recipients]
> 
> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
> 
> Lee, Pavel, could you look into below regression report please? Thread
> starts here:
> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/
> 
> Another report with somewhat similar symptom can be found here:
> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
> 
> See also Russell's analysis of that report below (many many thx for
> that, much appreciated Russel!).
> 
> To my untrained eyes all of this sounds a lot like we still have a 6.9
> regression related to the LED code somewhere. Reminder, we had earlier
> trouble, but that was avoided through other measures:
> 
> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/
> 
> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/
> 
> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")
> 
> That iwlwifi commit even calls it self "work around". The developer that
> submitted it bisected the problem to a LED merge, but sadly that was the
> end of it. :-/

I actually have been looking at a ledtrig-netdev lockdep warning yesterday
which I believe is the same thing. I'll include the lockdep trace below.

According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
the rtnl mutex vs led-triggers related locks. I believe that this problem
may be a pre-existing problem but this now actually gets hit in kernels >=
6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
given, make hw_control trigger the default trigger"). Before that commit
the "netdev" trigger would not be bound / set as phy LEDs trigger by default.

+Cc Heiner Kallweit who authored that commit.

The netdev trigger typically is not needed because the PHY LEDs are typically
under hw-control and the netdev trigger even tries to leave things that way
so setting it as the active trigger for the LED class device is basically
a no-op. I guess the goal of that commit is correctly have the triggers
file content reflect that the LED is controlled by a netdev and to allow
changing the hw-control mode without the user first needing to set netdev
as trigger before being able to change the mode.

But there is a price to this, besides the locking problem this also
causes the ledtrig-netdev module to load on pretty much everyones
systems (when build as a module) even though 99.999% of our users
likely does not need this at all...

Given this price and the troubles this is causing I think it might be best
to revert 66601a29bb23. There might still be a locking issue when setting
the trigger to netdev manually (I'll check and follow up) but this should
fix the regression users are hitting since typically users do not set
the trigger manually.

Gene, as the original reporter of this can you do "modinfo ledtrig_netdev"
and if this shows that ledtrig_netdev is a module for you try blacklisting
ledtrig_netdev ?  And if it is not a module can you try building a 6.9
kernel with commit 66601a29bb23 reverted and see if that helps ?

Regards,

Hans


Here is the promised lockdep report:
[   73.959672] ======================================================
[   73.959675] WARNING: possible circular locking dependency detected
[   73.959677] 6.10.0-rc1+ #94 Not tainted
[   73.959680] ------------------------------------------------------
[   73.959682] NetworkManager/1815 is trying to acquire lock:
[   73.959685] ffffffffb2145170 (triggers_list_lock){++++}-{3:3}, at: led_trigger_register+0x40/0x1b0
[   73.959695] 
               but task is already holding lock:
[   73.959697] ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630
[   73.959706] 
               which lock already depends on the new lock.

[   73.959708] 
               the existing dependency chain (in reverse order) is:
[   73.959710] 
               -> #2 (rtnl_mutex){+.+.}-{3:3}:
[   73.959716]        __mutex_lock+0x8c/0xc10
[   73.959720]        set_device_name+0x2d/0x140 [ledtrig_netdev]
[   73.959725]        netdev_trig_activate+0x197/0x210 [ledtrig_netdev]
[   73.959728]        led_trigger_set+0x1e1/0x2e0
[   73.959731]        led_trigger_register+0x170/0x1b0
[   73.959734]        do_one_initcall+0x5e/0x3a0
[   73.959738]        do_init_module+0x60/0x220
[   73.959741]        __do_sys_init_module+0x15f/0x190
[   73.959745]        do_syscall_64+0x93/0x180
[   73.959748]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   73.959752] 
               -> #1 (&led_cdev->trigger_lock){+.+.}-{3:3}:
[   73.959758]        down_write+0x3b/0xd0
[   73.959761]        led_trigger_set_default+0x34/0xe0
[   73.959764]        led_classdev_register_ext+0x311/0x3a0
[   73.959767]        input_leds_connect+0x139/0x260
[   73.959770]        input_attach_handler.isra.0+0x75/0x90
[   73.959773]        input_register_device.cold+0xa1/0x150
[   73.959776]        hidinput_connect+0x848/0xb00
[   73.959779]        hid_connect+0x567/0x5a0
[   73.959783]        hid_hw_start+0x3f/0x60
[   73.959787]        hid_device_probe+0x10d/0x190
[   73.959789]        really_probe+0xde/0x340
[   73.959793]        __driver_probe_device+0x78/0x110
[   73.959796]        driver_probe_device+0x1f/0xa0
[   73.959798]        __device_attach_driver+0x85/0x110
[   73.959801]        bus_for_each_drv+0x78/0xc0
[   73.959805]        __device_attach+0xb0/0x1b0
[   73.959808]        bus_probe_device+0x94/0xb0
[   73.959810]        device_add+0x64a/0x860
[   73.959814]        hid_add_device+0xe5/0x240
[   73.959817]        usbhid_probe+0x4bb/0x600
[   73.959821]        usb_probe_interface+0xea/0x2b0
[   73.959824]        really_probe+0xde/0x340
[   73.959827]        __driver_probe_device+0x78/0x110
[   73.959830]        driver_probe_device+0x1f/0xa0
[   73.959833]        __device_attach_driver+0x85/0x110
[   73.959835]        bus_for_each_drv+0x78/0xc0
[   73.959839]        __device_attach+0xb0/0x1b0
[   73.959842]        bus_probe_device+0x94/0xb0
[   73.959844]        device_add+0x64a/0x860
[   73.959847]        usb_set_configuration+0x5e8/0x880
[   73.959850]        usb_generic_driver_probe+0x3e/0x60
[   73.959854]        usb_probe_device+0x3d/0x120
[   73.959857]        really_probe+0xde/0x340
[   73.959859]        __driver_probe_device+0x78/0x110
[   73.959862]        driver_probe_device+0x1f/0xa0
[   73.959865]        __device_attach_driver+0x85/0x110
[   73.959868]        bus_for_each_drv+0x78/0xc0
[   73.959871]        __device_attach+0xb0/0x1b0
[   73.959874]        bus_probe_device+0x94/0xb0
[   73.959876]        device_add+0x64a/0x860
[   73.959880]        usb_new_device.cold+0x141/0x38f
[   73.959883]        hub_event+0x1166/0x1980
[   73.959886]        process_one_work+0x21a/0x590
[   73.959889]        worker_thread+0x1d1/0x3e0
[   73.959891]        kthread+0xee/0x120
[   73.959895]        ret_from_fork+0x30/0x50
[   73.959898]        ret_from_fork_asm+0x1a/0x30
[   73.959902] 
               -> #0 (triggers_list_lock){++++}-{3:3}:
[   73.959908]        __lock_acquire+0x11c6/0x1f20
[   73.959911]        lock_acquire+0xc8/0x2b0
[   73.959914]        down_write+0x3b/0xd0
[   73.959916]        led_trigger_register+0x40/0x1b0
[   73.959919]        phy_led_triggers_register+0xb4/0x240
[   73.959923]        phy_attach_direct+0x378/0x380
[   73.959926]        phy_connect_direct+0x21/0x70
[   73.959929]        rtl_open+0x2e1/0x490 [r8169]
[   73.959934]        __dev_open+0xea/0x1b0
[   73.959937]        __dev_change_flags+0x201/0x240
[   73.959939]        dev_change_flags+0x22/0x60
[   73.959942]        do_setlink+0x32d/0x1070
[   73.959945]        __rtnl_newlink+0x516/0x9d0
[   73.959948]        rtnl_newlink+0x43/0x70
[   73.959951]        rtnetlink_rcv_msg+0x159/0x630
[   73.959953]        netlink_rcv_skb+0x4f/0x100
[   73.959956]        netlink_unicast+0x18c/0x260
[   73.959959]        netlink_sendmsg+0x207/0x420
[   73.959962]        ____sys_sendmsg+0x364/0x3a0
[   73.959965]        ___sys_sendmsg+0x84/0xd0
[   73.959967]        __sys_sendmsg+0x8e/0xd0
[   73.959971]        do_syscall_64+0x93/0x180
[   73.959974]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   73.959977] 
               other info that might help us debug this:

[   73.959979] Chain exists of:
                 triggers_list_lock --> &led_cdev->trigger_lock --> rtnl_mutex

[   73.959987]  Possible unsafe locking scenario:

[   73.959989]        CPU0                    CPU1
[   73.959991]        ----                    ----
[   73.959993]   lock(rtnl_mutex);
[   73.959996]                                lock(&led_cdev->trigger_lock);
[   73.960000]                                lock(rtnl_mutex);
[   73.960004]   lock(triggers_list_lock);
[   73.960007] 
                *** DEADLOCK ***

[   73.960009] 1 lock held by NetworkManager/1815:
[   73.960012]  #0: ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630
[   73.960020] 
               stack backtrace:
[   73.960023] CPU: 5 PID: 1815 Comm: NetworkManager Not tainted 6.10.0-rc1+ #94
[   73.960026] Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH WIFI (MS-7C95), BIOS 2.J0 01/02/2024
[   73.960029] Call Trace:
[   73.960031]  <TASK>
[   73.960034]  dump_stack_lvl+0x68/0x90
[   73.960039]  check_noncircular+0x10d/0x120
[   73.960044]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960048]  ? __lock_acquire+0xc19/0x1f20
[   73.960055]  __lock_acquire+0x11c6/0x1f20
[   73.960062]  lock_acquire+0xc8/0x2b0
[   73.960065]  ? led_trigger_register+0x40/0x1b0
[   73.960073]  down_write+0x3b/0xd0
[   73.960076]  ? led_trigger_register+0x40/0x1b0
[   73.960079]  led_trigger_register+0x40/0x1b0
[   73.960083]  phy_led_triggers_register+0xb4/0x240
[   73.960089]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960091]  ? schedule_timeout+0xc1/0x1b0
[   73.960095]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960097]  ? find_held_lock+0x2b/0x80
[   73.960101]  ? phy_attach_direct+0x1b0/0x380
[   73.960105]  ? phy_attach_direct+0x1b0/0x380
[   73.960108]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960110]  ? lock_release+0x169/0x2b0
[   73.960115]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960118]  ? __mutex_unlock_slowpath+0x2d/0x260
[   73.960124]  phy_attach_direct+0x378/0x380
[   73.960128]  ? __pfx_r8169_phylink_handler+0x10/0x10 [r8169]
[   73.960134]  phy_connect_direct+0x21/0x70
[   73.960139]  rtl_open+0x2e1/0x490 [r8169]
[   73.960147]  __dev_open+0xea/0x1b0
[   73.960152]  __dev_change_flags+0x201/0x240
[   73.960155]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960160]  dev_change_flags+0x22/0x60
[   73.960165]  do_setlink+0x32d/0x1070
[   73.960172]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960175]  ? __lock_acquire+0x3ce/0x1f20
[   73.960178]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960180]  ? __nla_validate_parse+0x52/0xc70
[   73.960185]  ? __entry_text_end+0x1025c9/0x1025cd
[   73.960190]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960193]  ? __lock_acquire+0x3ce/0x1f20
[   73.960199]  __rtnl_newlink+0x516/0x9d0
[   73.960204]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960211]  ? __entry_text_end+0x1025c9/0x1025cd
[   73.960214]  ? rcu_is_watching+0xd/0x40
[   73.960217]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960220]  ? kmalloc_trace_noprof+0x25c/0x320
[   73.960224]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960230]  rtnl_newlink+0x43/0x70
[   73.960235]  rtnetlink_rcv_msg+0x159/0x630
[   73.960241]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   73.960246]  netlink_rcv_skb+0x4f/0x100
[   73.960257]  netlink_unicast+0x18c/0x260
[   73.960262]  netlink_sendmsg+0x207/0x420
[   73.960270]  ____sys_sendmsg+0x364/0x3a0
[   73.960273]  ? import_iovec+0x17/0x20
[   73.960277]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960280]  ? copy_msghdr_from_user+0xd9/0x150
[   73.960286]  ___sys_sendmsg+0x84/0xd0
[   73.960292]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960294]  ? __lock_acquire+0x3ce/0x1f20
[   73.960299]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960304]  ? find_held_lock+0x2b/0x80
[   73.960307]  ? __fget_files+0xc3/0x190
[   73.960312]  ? __fget_files+0xc3/0x190
[   73.960315]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960317]  ? lock_release+0x169/0x2b0
[   73.960322]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960327]  __sys_sendmsg+0x8e/0xd0
[   73.960337]  do_syscall_64+0x93/0x180
[   73.960340]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960345]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960349]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960352]  ? find_held_lock+0x2b/0x80
[   73.960356]  ? __memcg_slab_free_hook+0x56/0x230
[   73.960360]  ? __memcg_slab_free_hook+0x56/0x230
[   73.960363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960366]  ? lock_release+0x169/0x2b0
[   73.960370]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960373]  ? __memcg_slab_free_hook+0x60/0x230
[   73.960378]  ? kmem_cache_free+0x144/0x430
[   73.960382]  ? syscall_exit_to_user_mode+0x11/0x280
[   73.960387]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960390]  ? do_syscall_64+0x9f/0x180
[   73.960393]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960396]  ? lockdep_hardirqs_on+0x78/0x100
[   73.960400]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960403]  ? do_syscall_64+0x9f/0x180
[   73.960406]  ? fd_install+0xba/0x310
[   73.960411]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960414]  ? kmem_cache_free+0x144/0x430
[   73.960420]  ? do_sys_openat2+0x64/0xb0
[   73.960423]  ? syscall_exit_to_user_mode+0x11/0x280
[   73.960428]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960431]  ? do_syscall_64+0x9f/0x180
[   73.960435]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960437]  ? lockdep_hardirqs_on+0x78/0x100
[   73.960441]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960444]  ? do_syscall_64+0x9f/0x180
[   73.960448]  ? srso_alias_return_thunk+0x5/0xfbef5
[   73.960453]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   73.960457] RIP: 0033:0x7f857a9db79b
[   73.960461] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 49 7a f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 a1 7a f7 ff 48 8b
[   73.960464] RSP: 002b:00007ffcf5d3a660 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   73.960468] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f857a9db79b
[   73.960471] RDX: 0000000000000000 RSI: 00007ffcf5d3a6a0 RDI: 000000000000000d
[   73.960473] RBP: 00007ffcf5d3a680 R08: 0000000000000000 R09: 0000000000000000
[   73.960476] R10: 0000000000000000 R11: 0000000000000293 R12: 00005632de03d9d0
[   73.960478] R13: 000000000000000a R14: 00007ffcf5d3a83c R15: 0000000000000000
[   73.960488]  </TASK>
[   73.960561] Generic FE-GE Realtek PHY r8169-0-2a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-2a00:00, irq=MAC)











> On 30.05.24 16:04, Russell King (Oracle) wrote:
>> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote:
>>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote:
>>> This report for 6.9.1 could well be the same issue:
>>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
>>
>> The reg_check_chans_work() thing in pid 285 is likely stuck on the
>> rtnl lock. The same is true of pid 287.
>>
>> That will be because of the thread (pid 663) that's stuck in
>> __dev_open()...led_trigger_register(), where the rtnl lock will have
>> been taken in that path. It looks to me like led_trigger_register()
>> is stuck waiting for read access with the leds_list_lock rwsem.
>>
>> There are only two places that take that rwsem in write mode, which
>> are led_classdev_register_ext() and led_classdev_unregister(). None
>> of these paths are blocking in v6.9.
>>
>> Pid 641 doesn't look significant (its probably waiting for either
>> pid 285 or 287 to complete its work.)
>>
>> Pid 666 looks like it is blocked waiting for exclusive write-access
>> on the leds_list_lock - but it isn't holding that lock. This means
>> there must already be some other reader or writer holding this lock.
>>
>> Pid 722 doesn't look sigificant (same as pid 641).
>>
>> Pid 760 is also waiting for the rtnl lock.
>>
>> Pid 854, 855 also doesn't look sigificant (as pid 641).
>>
>> And then we get to pid 858. This is in set_device_name(), which
>> was called from led_trigger_set() and led_trigger_register().
>> We know from pid 663 that led_trigger_register() can take a read
>> on leds_list_lock, and indeed it does and then calls
>> led_match_default_trigger(), which then goes on to call
>> led_trigger_set(). Bingo, this is why pid 666 is blocked, which
>> then blocks pid 663. pid 663 takes the rtnl lock, which blocks
>> everything else _and_ also blocks pid 858 in set_device_name().
>>
>> Lockdep would've found this... this is a classic AB-BA deadlock
>> between the leds_list_lock rwsem and the rtnl mutex.
>>
>> I haven't checked to see how that deadlock got introduced, that's
>> for someone else to do.
> 
> P.S.:
> 
> #regzbot report: /
> #regzbot introduced: f5c31bcf604d
> #regzbot duplicate:
> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
> #regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the
> leds_list_lock rwsem and the rtnl mutex
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31  9:50       ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
@ 2024-05-31 10:22         ` Hans de Goede
  2024-06-01 20:05           ` Hans de Goede
  2024-05-31 11:55         ` Genes Lists
  2024-05-31 12:54         ` Andrew Lunn
  2 siblings, 1 reply; 17+ messages in thread
From: Hans de Goede @ 2024-05-31 10:22 UTC (permalink / raw)
  To: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit
  Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni,
	johanneswueller, Russell King (Oracle), Genes Lists

Hi,

On 5/31/24 11:50 AM, Hans de Goede wrote:
> Hi,
> 
> On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote:
>> [adding the LED folks and the regressions list to the list of recipients]
>>
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Lee, Pavel, could you look into below regression report please? Thread
>> starts here:
>> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/
>>
>> Another report with somewhat similar symptom can be found here:
>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
>>
>> See also Russell's analysis of that report below (many many thx for
>> that, much appreciated Russel!).
>>
>> To my untrained eyes all of this sounds a lot like we still have a 6.9
>> regression related to the LED code somewhere. Reminder, we had earlier
>> trouble, but that was avoided through other measures:
>>
>> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
>> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/
>>
>> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
>> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/
>>
>> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")
>>
>> That iwlwifi commit even calls it self "work around". The developer that
>> submitted it bisected the problem to a LED merge, but sadly that was the
>> end of it. :-/
> 
> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
> which I believe is the same thing. I'll include the lockdep trace below.
> 
> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
> the rtnl mutex vs led-triggers related locks. I believe that this problem
> may be a pre-existing problem but this now actually gets hit in kernels >=
> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
> given, make hw_control trigger the default trigger"). Before that commit
> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
> 
> +Cc Heiner Kallweit who authored that commit.
> 
> The netdev trigger typically is not needed because the PHY LEDs are typically
> under hw-control and the netdev trigger even tries to leave things that way
> so setting it as the active trigger for the LED class device is basically
> a no-op. I guess the goal of that commit is correctly have the triggers
> file content reflect that the LED is controlled by a netdev and to allow
> changing the hw-control mode without the user first needing to set netdev
> as trigger before being able to change the mode.
> 
> But there is a price to this, besides the locking problem this also
> causes the ledtrig-netdev module to load on pretty much everyones
> systems (when build as a module) even though 99.999% of our users
> likely does not need this at all...
> 
> Given this price and the troubles this is causing I think it might be best
> to revert 66601a29bb23. There might still be a locking issue when setting
> the trigger to netdev manually (I'll check and follow up) but this should
> fix the regression users are hitting since typically users do not set
> the trigger manually.

Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23
reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev"
(to add it to the list of available triggers) and then setting the trigger
for /sys/class/leds/enp42s0-0::lan to netdev manually.

Still reverting 66601a29bb23 should avoid the problem getting triggered
and this would seem like a safe fix especially for the 6.9 series and
then the necessary time can be taken to fix the actual underlying locking
issue which 66601a29bb23 exposes.

> Gene, as the original reporter of this can you do "modinfo ledtrig_netdev"
> and if this shows that ledtrig_netdev is a module for you try blacklisting
> ledtrig_netdev ?  And if it is not a module can you try building a 6.9
> kernel with commit 66601a29bb23 reverted and see if that helps ?

Regards,
 
Hans




> Here is the promised lockdep report:
> [   73.959672] ======================================================
> [   73.959675] WARNING: possible circular locking dependency detected
> [   73.959677] 6.10.0-rc1+ #94 Not tainted
> [   73.959680] ------------------------------------------------------
> [   73.959682] NetworkManager/1815 is trying to acquire lock:
> [   73.959685] ffffffffb2145170 (triggers_list_lock){++++}-{3:3}, at: led_trigger_register+0x40/0x1b0
> [   73.959695] 
>                but task is already holding lock:
> [   73.959697] ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630
> [   73.959706] 
>                which lock already depends on the new lock.
> 
> [   73.959708] 
>                the existing dependency chain (in reverse order) is:
> [   73.959710] 
>                -> #2 (rtnl_mutex){+.+.}-{3:3}:
> [   73.959716]        __mutex_lock+0x8c/0xc10
> [   73.959720]        set_device_name+0x2d/0x140 [ledtrig_netdev]
> [   73.959725]        netdev_trig_activate+0x197/0x210 [ledtrig_netdev]
> [   73.959728]        led_trigger_set+0x1e1/0x2e0
> [   73.959731]        led_trigger_register+0x170/0x1b0
> [   73.959734]        do_one_initcall+0x5e/0x3a0
> [   73.959738]        do_init_module+0x60/0x220
> [   73.959741]        __do_sys_init_module+0x15f/0x190
> [   73.959745]        do_syscall_64+0x93/0x180
> [   73.959748]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   73.959752] 
>                -> #1 (&led_cdev->trigger_lock){+.+.}-{3:3}:
> [   73.959758]        down_write+0x3b/0xd0
> [   73.959761]        led_trigger_set_default+0x34/0xe0
> [   73.959764]        led_classdev_register_ext+0x311/0x3a0
> [   73.959767]        input_leds_connect+0x139/0x260
> [   73.959770]        input_attach_handler.isra.0+0x75/0x90
> [   73.959773]        input_register_device.cold+0xa1/0x150
> [   73.959776]        hidinput_connect+0x848/0xb00
> [   73.959779]        hid_connect+0x567/0x5a0
> [   73.959783]        hid_hw_start+0x3f/0x60
> [   73.959787]        hid_device_probe+0x10d/0x190
> [   73.959789]        really_probe+0xde/0x340
> [   73.959793]        __driver_probe_device+0x78/0x110
> [   73.959796]        driver_probe_device+0x1f/0xa0
> [   73.959798]        __device_attach_driver+0x85/0x110
> [   73.959801]        bus_for_each_drv+0x78/0xc0
> [   73.959805]        __device_attach+0xb0/0x1b0
> [   73.959808]        bus_probe_device+0x94/0xb0
> [   73.959810]        device_add+0x64a/0x860
> [   73.959814]        hid_add_device+0xe5/0x240
> [   73.959817]        usbhid_probe+0x4bb/0x600
> [   73.959821]        usb_probe_interface+0xea/0x2b0
> [   73.959824]        really_probe+0xde/0x340
> [   73.959827]        __driver_probe_device+0x78/0x110
> [   73.959830]        driver_probe_device+0x1f/0xa0
> [   73.959833]        __device_attach_driver+0x85/0x110
> [   73.959835]        bus_for_each_drv+0x78/0xc0
> [   73.959839]        __device_attach+0xb0/0x1b0
> [   73.959842]        bus_probe_device+0x94/0xb0
> [   73.959844]        device_add+0x64a/0x860
> [   73.959847]        usb_set_configuration+0x5e8/0x880
> [   73.959850]        usb_generic_driver_probe+0x3e/0x60
> [   73.959854]        usb_probe_device+0x3d/0x120
> [   73.959857]        really_probe+0xde/0x340
> [   73.959859]        __driver_probe_device+0x78/0x110
> [   73.959862]        driver_probe_device+0x1f/0xa0
> [   73.959865]        __device_attach_driver+0x85/0x110
> [   73.959868]        bus_for_each_drv+0x78/0xc0
> [   73.959871]        __device_attach+0xb0/0x1b0
> [   73.959874]        bus_probe_device+0x94/0xb0
> [   73.959876]        device_add+0x64a/0x860
> [   73.959880]        usb_new_device.cold+0x141/0x38f
> [   73.959883]        hub_event+0x1166/0x1980
> [   73.959886]        process_one_work+0x21a/0x590
> [   73.959889]        worker_thread+0x1d1/0x3e0
> [   73.959891]        kthread+0xee/0x120
> [   73.959895]        ret_from_fork+0x30/0x50
> [   73.959898]        ret_from_fork_asm+0x1a/0x30
> [   73.959902] 
>                -> #0 (triggers_list_lock){++++}-{3:3}:
> [   73.959908]        __lock_acquire+0x11c6/0x1f20
> [   73.959911]        lock_acquire+0xc8/0x2b0
> [   73.959914]        down_write+0x3b/0xd0
> [   73.959916]        led_trigger_register+0x40/0x1b0
> [   73.959919]        phy_led_triggers_register+0xb4/0x240
> [   73.959923]        phy_attach_direct+0x378/0x380
> [   73.959926]        phy_connect_direct+0x21/0x70
> [   73.959929]        rtl_open+0x2e1/0x490 [r8169]
> [   73.959934]        __dev_open+0xea/0x1b0
> [   73.959937]        __dev_change_flags+0x201/0x240
> [   73.959939]        dev_change_flags+0x22/0x60
> [   73.959942]        do_setlink+0x32d/0x1070
> [   73.959945]        __rtnl_newlink+0x516/0x9d0
> [   73.959948]        rtnl_newlink+0x43/0x70
> [   73.959951]        rtnetlink_rcv_msg+0x159/0x630
> [   73.959953]        netlink_rcv_skb+0x4f/0x100
> [   73.959956]        netlink_unicast+0x18c/0x260
> [   73.959959]        netlink_sendmsg+0x207/0x420
> [   73.959962]        ____sys_sendmsg+0x364/0x3a0
> [   73.959965]        ___sys_sendmsg+0x84/0xd0
> [   73.959967]        __sys_sendmsg+0x8e/0xd0
> [   73.959971]        do_syscall_64+0x93/0x180
> [   73.959974]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   73.959977] 
>                other info that might help us debug this:
> 
> [   73.959979] Chain exists of:
>                  triggers_list_lock --> &led_cdev->trigger_lock --> rtnl_mutex
> 
> [   73.959987]  Possible unsafe locking scenario:
> 
> [   73.959989]        CPU0                    CPU1
> [   73.959991]        ----                    ----
> [   73.959993]   lock(rtnl_mutex);
> [   73.959996]                                lock(&led_cdev->trigger_lock);
> [   73.960000]                                lock(rtnl_mutex);
> [   73.960004]   lock(triggers_list_lock);
> [   73.960007] 
>                 *** DEADLOCK ***
> 
> [   73.960009] 1 lock held by NetworkManager/1815:
> [   73.960012]  #0: ffffffffb2158fe8 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x133/0x630
> [   73.960020] 
>                stack backtrace:
> [   73.960023] CPU: 5 PID: 1815 Comm: NetworkManager Not tainted 6.10.0-rc1+ #94
> [   73.960026] Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH WIFI (MS-7C95), BIOS 2.J0 01/02/2024
> [   73.960029] Call Trace:
> [   73.960031]  <TASK>
> [   73.960034]  dump_stack_lvl+0x68/0x90
> [   73.960039]  check_noncircular+0x10d/0x120
> [   73.960044]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960048]  ? __lock_acquire+0xc19/0x1f20
> [   73.960055]  __lock_acquire+0x11c6/0x1f20
> [   73.960062]  lock_acquire+0xc8/0x2b0
> [   73.960065]  ? led_trigger_register+0x40/0x1b0
> [   73.960073]  down_write+0x3b/0xd0
> [   73.960076]  ? led_trigger_register+0x40/0x1b0
> [   73.960079]  led_trigger_register+0x40/0x1b0
> [   73.960083]  phy_led_triggers_register+0xb4/0x240
> [   73.960089]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960091]  ? schedule_timeout+0xc1/0x1b0
> [   73.960095]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960097]  ? find_held_lock+0x2b/0x80
> [   73.960101]  ? phy_attach_direct+0x1b0/0x380
> [   73.960105]  ? phy_attach_direct+0x1b0/0x380
> [   73.960108]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960110]  ? lock_release+0x169/0x2b0
> [   73.960115]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960118]  ? __mutex_unlock_slowpath+0x2d/0x260
> [   73.960124]  phy_attach_direct+0x378/0x380
> [   73.960128]  ? __pfx_r8169_phylink_handler+0x10/0x10 [r8169]
> [   73.960134]  phy_connect_direct+0x21/0x70
> [   73.960139]  rtl_open+0x2e1/0x490 [r8169]
> [   73.960147]  __dev_open+0xea/0x1b0
> [   73.960152]  __dev_change_flags+0x201/0x240
> [   73.960155]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960160]  dev_change_flags+0x22/0x60
> [   73.960165]  do_setlink+0x32d/0x1070
> [   73.960172]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960175]  ? __lock_acquire+0x3ce/0x1f20
> [   73.960178]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960180]  ? __nla_validate_parse+0x52/0xc70
> [   73.960185]  ? __entry_text_end+0x1025c9/0x1025cd
> [   73.960190]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960193]  ? __lock_acquire+0x3ce/0x1f20
> [   73.960199]  __rtnl_newlink+0x516/0x9d0
> [   73.960204]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960211]  ? __entry_text_end+0x1025c9/0x1025cd
> [   73.960214]  ? rcu_is_watching+0xd/0x40
> [   73.960217]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960220]  ? kmalloc_trace_noprof+0x25c/0x320
> [   73.960224]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960230]  rtnl_newlink+0x43/0x70
> [   73.960235]  rtnetlink_rcv_msg+0x159/0x630
> [   73.960241]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> [   73.960246]  netlink_rcv_skb+0x4f/0x100
> [   73.960257]  netlink_unicast+0x18c/0x260
> [   73.960262]  netlink_sendmsg+0x207/0x420
> [   73.960270]  ____sys_sendmsg+0x364/0x3a0
> [   73.960273]  ? import_iovec+0x17/0x20
> [   73.960277]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960280]  ? copy_msghdr_from_user+0xd9/0x150
> [   73.960286]  ___sys_sendmsg+0x84/0xd0
> [   73.960292]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960294]  ? __lock_acquire+0x3ce/0x1f20
> [   73.960299]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960304]  ? find_held_lock+0x2b/0x80
> [   73.960307]  ? __fget_files+0xc3/0x190
> [   73.960312]  ? __fget_files+0xc3/0x190
> [   73.960315]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960317]  ? lock_release+0x169/0x2b0
> [   73.960322]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960327]  __sys_sendmsg+0x8e/0xd0
> [   73.960337]  do_syscall_64+0x93/0x180
> [   73.960340]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960345]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960349]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960352]  ? find_held_lock+0x2b/0x80
> [   73.960356]  ? __memcg_slab_free_hook+0x56/0x230
> [   73.960360]  ? __memcg_slab_free_hook+0x56/0x230
> [   73.960363]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960366]  ? lock_release+0x169/0x2b0
> [   73.960370]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960373]  ? __memcg_slab_free_hook+0x60/0x230
> [   73.960378]  ? kmem_cache_free+0x144/0x430
> [   73.960382]  ? syscall_exit_to_user_mode+0x11/0x280
> [   73.960387]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960390]  ? do_syscall_64+0x9f/0x180
> [   73.960393]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960396]  ? lockdep_hardirqs_on+0x78/0x100
> [   73.960400]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960403]  ? do_syscall_64+0x9f/0x180
> [   73.960406]  ? fd_install+0xba/0x310
> [   73.960411]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960414]  ? kmem_cache_free+0x144/0x430
> [   73.960420]  ? do_sys_openat2+0x64/0xb0
> [   73.960423]  ? syscall_exit_to_user_mode+0x11/0x280
> [   73.960428]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960431]  ? do_syscall_64+0x9f/0x180
> [   73.960435]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960437]  ? lockdep_hardirqs_on+0x78/0x100
> [   73.960441]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960444]  ? do_syscall_64+0x9f/0x180
> [   73.960448]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   73.960453]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   73.960457] RIP: 0033:0x7f857a9db79b
> [   73.960461] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 49 7a f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 a1 7a f7 ff 48 8b
> [   73.960464] RSP: 002b:00007ffcf5d3a660 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> [   73.960468] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f857a9db79b
> [   73.960471] RDX: 0000000000000000 RSI: 00007ffcf5d3a6a0 RDI: 000000000000000d
> [   73.960473] RBP: 00007ffcf5d3a680 R08: 0000000000000000 R09: 0000000000000000
> [   73.960476] R10: 0000000000000000 R11: 0000000000000293 R12: 00005632de03d9d0
> [   73.960478] R13: 000000000000000a R14: 00007ffcf5d3a83c R15: 0000000000000000
> [   73.960488]  </TASK>
> [   73.960561] Generic FE-GE Realtek PHY r8169-0-2a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-2a00:00, irq=MAC)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 30.05.24 16:04, Russell King (Oracle) wrote:
>>> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote:
>>>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote:
>>>> This report for 6.9.1 could well be the same issue:
>>>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
>>>
>>> The reg_check_chans_work() thing in pid 285 is likely stuck on the
>>> rtnl lock. The same is true of pid 287.
>>>
>>> That will be because of the thread (pid 663) that's stuck in
>>> __dev_open()...led_trigger_register(), where the rtnl lock will have
>>> been taken in that path. It looks to me like led_trigger_register()
>>> is stuck waiting for read access with the leds_list_lock rwsem.
>>>
>>> There are only two places that take that rwsem in write mode, which
>>> are led_classdev_register_ext() and led_classdev_unregister(). None
>>> of these paths are blocking in v6.9.
>>>
>>> Pid 641 doesn't look significant (its probably waiting for either
>>> pid 285 or 287 to complete its work.)
>>>
>>> Pid 666 looks like it is blocked waiting for exclusive write-access
>>> on the leds_list_lock - but it isn't holding that lock. This means
>>> there must already be some other reader or writer holding this lock.
>>>
>>> Pid 722 doesn't look sigificant (same as pid 641).
>>>
>>> Pid 760 is also waiting for the rtnl lock.
>>>
>>> Pid 854, 855 also doesn't look sigificant (as pid 641).
>>>
>>> And then we get to pid 858. This is in set_device_name(), which
>>> was called from led_trigger_set() and led_trigger_register().
>>> We know from pid 663 that led_trigger_register() can take a read
>>> on leds_list_lock, and indeed it does and then calls
>>> led_match_default_trigger(), which then goes on to call
>>> led_trigger_set(). Bingo, this is why pid 666 is blocked, which
>>> then blocks pid 663. pid 663 takes the rtnl lock, which blocks
>>> everything else _and_ also blocks pid 858 in set_device_name().
>>>
>>> Lockdep would've found this... this is a classic AB-BA deadlock
>>> between the leds_list_lock rwsem and the rtnl mutex.
>>>
>>> I haven't checked to see how that deadlock got introduced, that's
>>> for someone else to do.
>>
>> P.S.:
>>
>> #regzbot report: /
>> #regzbot introduced: f5c31bcf604d
>> #regzbot duplicate:
>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
>> #regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the
>> leds_list_lock rwsem and the rtnl mutex
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31  9:50       ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
  2024-05-31 10:22         ` Hans de Goede
@ 2024-05-31 11:55         ` Genes Lists
  2024-05-31 12:54         ` Andrew Lunn
  2 siblings, 0 replies; 17+ messages in thread
From: Genes Lists @ 2024-05-31 11:55 UTC (permalink / raw)
  To: Hans de Goede, Linux regressions mailing list, Pavel Machek,
	Lee Jones, Linux LEDs, Heiner Kallweit
  Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni,
	johanneswueller, Russell King (Oracle)

[-- Attachment #1: Type: text/plain, Size: 2389 bytes --]

On Fri, 2024-05-31 at 11:50 +0200, Hans de Goede wrote:
> Hi,
> 
> ...

> I actually have been looking at a ledtrig-netdev lockdep warning
> yesterday
> which I believe is the same thing. I'll include the lockdep trace
> below.
> 
> According to lockdep there indeed is a ABBA (ish) cyclic deadlock
> with
> the rtnl mutex vs led-triggers related locks. I believe that this
> problem
> may be a pre-existing problem but this now actually gets hit in
> kernels >=
> 6.9 because of commit 66601a29bb23 ("leds: class: If no default
> trigger is
> given, make hw_control trigger the default trigger"). Before that
> commit
> the "netdev" trigger would not be bound / set as phy LEDs trigger by
> default.
> 
> +Cc Heiner Kallweit who authored that commit.
> 
> The netdev trigger typically is not needed because the PHY LEDs are
> typically
> under hw-control and the netdev trigger even tries to leave things
> that way
> so setting it as the active trigger for the LED class device is
> basically
> a no-op. I guess the goal of that commit is correctly have the
> triggers
> file content reflect that the LED is controlled by a netdev and to
> allow
> changing the hw-control mode without the user first needing to set
> netdev
> as trigger before being able to change the mode.
> 
> But there is a price to this, besides the locking problem this also
> causes the ledtrig-netdev module to load on pretty much everyones
> systems (when build as a module) even though 99.999% of our users
> likely does not need this at all...
> 
> Given this price and the troubles this is causing I think it might be
> best
> to revert 66601a29bb23. There might still be a locking issue when
> setting
> the trigger to netdev manually (I'll check and follow up) but this
> should
> fix the regression users are hitting since typically users do not set
> the trigger manually.
> 
> Gene, as the original reporter of this can you do "modinfo
> ledtrig_netdev"
> and if this shows that ledtrig_netdev is a module for you try
> blacklisting
> ledtrig_netdev ?  And if it is not a module can you try building a
> 6.9
> kernel with commit 66601a29bb23 reverted and see if that helps ?

Thank you - I've blacklisted ledtrig_netdev and will report back if
anything interesting happens.

best

gene

> 
> 
> Regards,
> 
> Hans

-- 
Gene


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31  9:50       ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
  2024-05-31 10:22         ` Hans de Goede
  2024-05-31 11:55         ` Genes Lists
@ 2024-05-31 12:54         ` Andrew Lunn
  2024-05-31 13:11           ` Hans de Goede
                             ` (2 more replies)
  2 siblings, 3 replies; 17+ messages in thread
From: Andrew Lunn @ 2024-05-31 12:54 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
> which I believe is the same thing. I'll include the lockdep trace below.
> 
> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
> the rtnl mutex vs led-triggers related locks. I believe that this problem
> may be a pre-existing problem but this now actually gets hit in kernels >=
> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
> given, make hw_control trigger the default trigger"). Before that commit
> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
> 
> +Cc Heiner Kallweit who authored that commit.
> 
> The netdev trigger typically is not needed because the PHY LEDs are typically
> under hw-control and the netdev trigger even tries to leave things that way
> so setting it as the active trigger for the LED class device is basically
> a no-op. I guess the goal of that commit is correctly have the triggers
> file content reflect that the LED is controlled by a netdev and to allow
> changing the hw-control mode without the user first needing to set netdev
> as trigger before being able to change the mode.

It was not the intention that this triggers is loaded for all
systems. It should only be those that actually have LEDs which can be
controlled:

drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
drivers/net/ethernet/intel/igc/igc_leds.c:	led_cdev->hw_control_trigger = "netdev";
drivers/net/dsa/qca/qca8k-leds.c:		port_led->cdev.hw_control_trigger = "netdev";
drivers/net/phy/phy_device.c:		cdev->hw_control_trigger = "netdev";

Reverting this patch does seem like a good way forward, but i would
also like to give Heiner a little bit of time to see if he has a quick
real fix.

     Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 12:54         ` Andrew Lunn
@ 2024-05-31 13:11           ` Hans de Goede
  2024-05-31 13:29             ` Andrew Lunn
  2024-06-06 12:01           ` Hans de Goede
  2024-06-24 19:57           ` Heiner Kallweit
  2 siblings, 1 reply; 17+ messages in thread
From: Hans de Goede @ 2024-05-31 13:11 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

Hi,

On 5/31/24 2:54 PM, Andrew Lunn wrote:
>> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
>> which I believe is the same thing. I'll include the lockdep trace below.
>>
>> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
>> the rtnl mutex vs led-triggers related locks. I believe that this problem
>> may be a pre-existing problem but this now actually gets hit in kernels >=
>> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
>> given, make hw_control trigger the default trigger"). Before that commit
>> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
>>
>> +Cc Heiner Kallweit who authored that commit.
>>
>> The netdev trigger typically is not needed because the PHY LEDs are typically
>> under hw-control and the netdev trigger even tries to leave things that way
>> so setting it as the active trigger for the LED class device is basically
>> a no-op. I guess the goal of that commit is correctly have the triggers
>> file content reflect that the LED is controlled by a netdev and to allow
>> changing the hw-control mode without the user first needing to set netdev
>> as trigger before being able to change the mode.
> 
> It was not the intention that this triggers is loaded for all
> systems.

Right note there are really 2 separate issues (or 1 issue
and one question) here:

1. The locking issue which this commit has exposed (but existed before)

2. If it is desirable to load / activate ledtrig-netdev by default on
   quite a lot of machines where it does not really gain us anything ?

For now I think we should focus on 1.

Still about 2:

> It should only be those that actually have LEDs which can be
> controlled:
> 
> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/ethernet/intel/igc/igc_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/dsa/qca/qca8k-leds.c:		port_led->cdev.hw_control_trigger = "netdev";
> drivers/net/phy/phy_device.c:		cdev->hw_control_trigger = "netdev";

Well those drivers combined, esp. with the generic phy_device in there
does mean that the ledtrig-netdev module now gets loaded on a whole lot
of x86 machines where before it would not. On one hand those machines
are plenty powerful typically, so what is one more module. OTOH I don't
think many users if any at all want to change the hwcontrol mode for
those LEDs...

> Reverting this patch does seem like a good way forward, but i would
> also like to give Heiner a little bit of time to see if he has a quick
> real fix.

Ack.

Regards,

Hans




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 13:11           ` Hans de Goede
@ 2024-05-31 13:29             ` Andrew Lunn
  2024-05-31 13:36               ` Hans de Goede
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2024-05-31 13:29 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

> > drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> > drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> > drivers/net/ethernet/intel/igc/igc_leds.c:	led_cdev->hw_control_trigger = "netdev";
> > drivers/net/dsa/qca/qca8k-leds.c:		port_led->cdev.hw_control_trigger = "netdev";
> > drivers/net/phy/phy_device.c:		cdev->hw_control_trigger = "netdev";
> 
> Well those drivers combined, esp. with the generic phy_device in there
> does mean that the ledtrig-netdev module now gets loaded on a whole lot
> of x86 machines where before it would not.

phy_device will only do something if there is the needed Device Tree
properties. Given that very few systems use DT on x86, that should not
be an issue. So only x86 systems with r8169 and igc should have the
trigger module loaded because of this. It would be good to understand
why other systems have the trigger loaded. However, as you say, this
will not fix the underlying deadlock, it will just limit it to systems with r8169
and igc...

    Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 13:29             ` Andrew Lunn
@ 2024-05-31 13:36               ` Hans de Goede
  0 siblings, 0 replies; 17+ messages in thread
From: Hans de Goede @ 2024-05-31 13:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

Hi,

On 5/31/24 3:29 PM, Andrew Lunn wrote:
>>> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
>>> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
>>> drivers/net/ethernet/intel/igc/igc_leds.c:	led_cdev->hw_control_trigger = "netdev";
>>> drivers/net/dsa/qca/qca8k-leds.c:		port_led->cdev.hw_control_trigger = "netdev";
>>> drivers/net/phy/phy_device.c:		cdev->hw_control_trigger = "netdev";
>>
>> Well those drivers combined, esp. with the generic phy_device in there
>> does mean that the ledtrig-netdev module now gets loaded on a whole lot
>> of x86 machines where before it would not.
> 
> phy_device will only do something if there is the needed Device Tree
> properties. Given that very few systems use DT on x86, that should not
> be an issue.

That is good to know.

> So only x86 systems with r8169 and igc should have the
> trigger module loaded because of this.

Those are very popular NICs though, so that is still a lot of
systems.

> It would be good to understand
> why other systems have the trigger loaded.

Actually my system has a RTL8168 ethernet NIC so the netdev trigger
getting loaded there is expected.

> However, as you say, this
> will not fix the underlying deadlock, it will just limit it to systems with r8169
> and igc...

Right, given on the above discussion I believe that it likely already
is limited to systems with Realtek r8169 or Intel i225 / i226 NICs.

Regards,

Hans




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 10:22         ` Hans de Goede
@ 2024-06-01 20:05           ` Hans de Goede
  0 siblings, 0 replies; 17+ messages in thread
From: Hans de Goede @ 2024-06-01 20:05 UTC (permalink / raw)
  To: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit
  Cc: linux-kernel, netdev, andrew, davem, edumazet, kuba, pabeni,
	johanneswueller, Russell King (Oracle), Genes Lists

Hi All,

On 5/31/24 12:22 PM, Hans de Goede wrote:
> Hi,
> 
> On 5/31/24 11:50 AM, Hans de Goede wrote:
>> Hi,
>>
>> On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> [adding the LED folks and the regressions list to the list of recipients]
>>>
>>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>>> for once, to make this easily accessible to everyone.
>>>
>>> Lee, Pavel, could you look into below regression report please? Thread
>>> starts here:
>>> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/
>>>
>>> Another report with somewhat similar symptom can be found here:
>>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/
>>>
>>> See also Russell's analysis of that report below (many many thx for
>>> that, much appreciated Russel!).
>>>
>>> To my untrained eyes all of this sounds a lot like we still have a 6.9
>>> regression related to the LED code somewhere. Reminder, we had earlier
>>> trouble, but that was avoided through other measures:
>>>
>>> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
>>> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/
>>>
>>> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
>>> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/
>>>
>>> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")
>>>
>>> That iwlwifi commit even calls it self "work around". The developer that
>>> submitted it bisected the problem to a LED merge, but sadly that was the
>>> end of it. :-/
>>
>> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
>> which I believe is the same thing. I'll include the lockdep trace below.
>>
>> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
>> the rtnl mutex vs led-triggers related locks. I believe that this problem
>> may be a pre-existing problem but this now actually gets hit in kernels >=
>> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
>> given, make hw_control trigger the default trigger"). Before that commit
>> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
>>
>> +Cc Heiner Kallweit who authored that commit.
>>
>> The netdev trigger typically is not needed because the PHY LEDs are typically
>> under hw-control and the netdev trigger even tries to leave things that way
>> so setting it as the active trigger for the LED class device is basically
>> a no-op. I guess the goal of that commit is correctly have the triggers
>> file content reflect that the LED is controlled by a netdev and to allow
>> changing the hw-control mode without the user first needing to set netdev
>> as trigger before being able to change the mode.
>>
>> But there is a price to this, besides the locking problem this also
>> causes the ledtrig-netdev module to load on pretty much everyones
>> systems (when build as a module) even though 99.999% of our users
>> likely does not need this at all...
>>
>> Given this price and the troubles this is causing I think it might be best
>> to revert 66601a29bb23. There might still be a locking issue when setting
>> the trigger to netdev manually (I'll check and follow up) but this should
>> fix the regression users are hitting since typically users do not set
>> the trigger manually.
> 
> Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23
> reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev"
> (to add it to the list of available triggers) and then setting the trigger
> for /sys/class/leds/enp42s0-0::lan to netdev manually.
> 
> Still reverting 66601a29bb23 should avoid the problem getting triggered
> and this would seem like a safe fix especially for the 6.9 series and
> then the necessary time can be taken to fix the actual underlying locking
> issue which 66601a29bb23 exposes.

I recently wrote a new input-events LED trigger:
https://lore.kernel.org/linux-leds/20240531135910.168965-2-hdegoede@redhat.com/

and I just found out this has a very similar deadlock problem. It seems
there it a generic problem with LEDs or LED triggers getting registered
by subsystems while holding a global lock from that subsystem vs
the activate / deactivate method of the trigger calling functions of that
same subsystem which require that same global lock.

I came up with a fix but that just fixed the activate() path leaving
a similar deadlock in place in the deactivate path:

https://lore.kernel.org/linux-leds/20240601195528.48308-1-hdegoede@redhat.com/
https://lore.kernel.org/linux-leds/20240601195528.48308-3-hdegoede@redhat.com/

So this seems to be a non trivial problem to solve. For the new input-events
trigger I plan to solve this by switching to a single input_handler which will
be registered at module_init() time instead of having a handler per LED and
registering that handler at activate() time.

But I have no idea how to solve this for the netdev trigger I just wanted
to share my experience with the input-events trigger in case it is useful.

Regards,

Hans



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 12:54         ` Andrew Lunn
  2024-05-31 13:11           ` Hans de Goede
@ 2024-06-06 12:01           ` Hans de Goede
  2024-06-06 13:12             ` Andrew Lunn
  2024-06-24 19:57           ` Heiner Kallweit
  2 siblings, 1 reply; 17+ messages in thread
From: Hans de Goede @ 2024-06-06 12:01 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

Hi all,

On 5/31/24 2:54 PM, Andrew Lunn wrote:
>> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
>> which I believe is the same thing. I'll include the lockdep trace below.
>>
>> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
>> the rtnl mutex vs led-triggers related locks. I believe that this problem
>> may be a pre-existing problem but this now actually gets hit in kernels >=
>> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
>> given, make hw_control trigger the default trigger"). Before that commit
>> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
>>
>> +Cc Heiner Kallweit who authored that commit.
>>
>> The netdev trigger typically is not needed because the PHY LEDs are typically
>> under hw-control and the netdev trigger even tries to leave things that way
>> so setting it as the active trigger for the LED class device is basically
>> a no-op. I guess the goal of that commit is correctly have the triggers
>> file content reflect that the LED is controlled by a netdev and to allow
>> changing the hw-control mode without the user first needing to set netdev
>> as trigger before being able to change the mode.
> 
> It was not the intention that this triggers is loaded for all
> systems.

<snip>

> Reverting this patch does seem like a good way forward, but i would
> also like to give Heiner a little bit of time to see if he has a quick
> real fix.

So it has been almost a week and no reply from Heiner. Since this is
causing real issues for users out there I think a revert of 66601a29bb23
should be submitted to Linus and then backported to the stable kernels.
to fix the immediate issue at hand.

Once the underlying locking issue which is the real root cause here
is fixed then we can reconsider re-applying 66601a29bb23.

Regards,

Hans






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-06 12:01           ` Hans de Goede
@ 2024-06-06 13:12             ` Andrew Lunn
  2024-06-06 13:39               ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2024-06-06 13:12 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, kuba, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

> So it has been almost a week and no reply from Heiner. Since this is
> causing real issues for users out there I think a revert of 66601a29bb23
> should be submitted to Linus and then backported to the stable kernels.
> to fix the immediate issue at hand.

Agreed.

	Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-06 13:12             ` Andrew Lunn
@ 2024-06-06 13:39               ` Jakub Kicinski
  2024-06-06 14:25                 ` Genes Lists
  2024-06-07 10:19                 ` Hans de Goede
  0 siblings, 2 replies; 17+ messages in thread
From: Jakub Kicinski @ 2024-06-06 13:39 UTC (permalink / raw)
  To: Andrew Lunn, Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote:
> > So it has been almost a week and no reply from Heiner. Since this is
> > causing real issues for users out there I think a revert of 66601a29bb23
> > should be submitted to Linus and then backported to the stable kernels.
> > to fix the immediate issue at hand.  
> 
> Agreed.

Please submit..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-06 13:39               ` Jakub Kicinski
@ 2024-06-06 14:25                 ` Genes Lists
  2024-06-07 10:22                   ` Hans de Goede
  2024-06-07 10:19                 ` Hans de Goede
  1 sibling, 1 reply; 17+ messages in thread
From: Genes Lists @ 2024-06-06 14:25 UTC (permalink / raw)
  To: Jakub Kicinski, Andrew Lunn, Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, pabeni, johanneswueller, Russell King (Oracle)


[-- Attachment #1.1: Type: text/plain, Size: 697 bytes --]

On Thu, 2024-06-06 at 06:39 -0700, Jakub Kicinski wrote:
> On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote:
> > > So it has been almost a week and no reply from Heiner. Since this
> > > is
> > > causing real issues for users out there I think a revert of
> > > 66601a29bb23
> > > should be submitted to Linus and then backported to the stable
> > > kernels.
> > > to fix the immediate issue at hand.  
> > 
> > Agreed.
> 
> Please submit..

I assume this deadlock is unrelated to the filesystem stalls reported
here:

 
 https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/

but thought it best to ask.

thank you.

-- 
Gene


[-- Attachment #1.2: Type: text/html, Size: 1700 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-06 13:39               ` Jakub Kicinski
  2024-06-06 14:25                 ` Genes Lists
@ 2024-06-07 10:19                 ` Hans de Goede
  1 sibling, 0 replies; 17+ messages in thread
From: Hans de Goede @ 2024-06-07 10:19 UTC (permalink / raw)
  To: Jakub Kicinski, Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, pabeni, johanneswueller, Russell King (Oracle),
	Genes Lists

Hi,

On 6/6/24 3:39 PM, Jakub Kicinski wrote:
> On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote:
>>> So it has been almost a week and no reply from Heiner. Since this is
>>> causing real issues for users out there I think a revert of 66601a29bb23
>>> should be submitted to Linus and then backported to the stable kernels.
>>> to fix the immediate issue at hand.  
>>
>> Agreed.
> 
> Please submit..

Done: https://lore.kernel.org/linux-leds/20240607101847.23037-1-hdegoede@redhat.com/

Regards,

Hans





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-06 14:25                 ` Genes Lists
@ 2024-06-07 10:22                   ` Hans de Goede
  2024-06-07 11:27                     ` Genes Lists
  0 siblings, 1 reply; 17+ messages in thread
From: Hans de Goede @ 2024-06-07 10:22 UTC (permalink / raw)
  To: Genes Lists, Jakub Kicinski, Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, pabeni, johanneswueller, Russell King (Oracle)

Hi Gene,

On 6/6/24 4:25 PM, Genes Lists wrote:
> On Thu, 2024-06-06 at 06:39 -0700, Jakub Kicinski wrote:
>> On Thu, 6 Jun 2024 15:12:54 +0200 Andrew Lunn wrote:
>>>> So it has been almost a week and no reply from Heiner. Since this is
>>>> causing real issues for users out there I think a revert of 66601a29bb23
>>>> should be submitted to Linus and then backported to the stable kernels.
>>>> to fix the immediate issue at hand. 
>>>
>>> Agreed.
>>
>> Please submit..
> 
> I assume this deadlock is unrelated to the filesystem stalls reported here:
> 
>    https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/ <https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/>

Right that one is unrelated, but the revert which I just submitted should
fix your other report:

https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/

I believe you have already worked around that one by blacklisting
the ledtrig-netdev module.

Regards,

Hans




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-06-07 10:22                   ` Hans de Goede
@ 2024-06-07 11:27                     ` Genes Lists
  0 siblings, 0 replies; 17+ messages in thread
From: Genes Lists @ 2024-06-07 11:27 UTC (permalink / raw)
  To: Hans de Goede, Jakub Kicinski, Andrew Lunn
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, Heiner Kallweit, linux-kernel, netdev, davem,
	edumazet, pabeni, johanneswueller, Russell King (Oracle)


[-- Attachment #1.1: Type: text/plain, Size: 710 bytes --]

On Fri, 2024-06-07 at 12:22 +0200, Hans de Goede wrote:
> Hi Gene,

...

> > 
> >  
> >  https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/
> >  <
> > https://lore.kernel.org/lkml/da8710eddca32677cf3c195000416121045eb811.camel@sapience.com/
> > >
> 
> Right that one is unrelated, but the revert which I just submitted
> should
> fix your other report:
> 
> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/

Very much appreciated thank you.

> 
> I believe you have already worked around that one by blacklisting
> the ledtrig-netdev module.

Correct.

> 

Thank you very much.




-- 
Gene


[-- Attachment #1.2: Type: text/html, Size: 2375 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
  2024-05-31 12:54         ` Andrew Lunn
  2024-05-31 13:11           ` Hans de Goede
  2024-06-06 12:01           ` Hans de Goede
@ 2024-06-24 19:57           ` Heiner Kallweit
  2 siblings, 0 replies; 17+ messages in thread
From: Heiner Kallweit @ 2024-06-24 19:57 UTC (permalink / raw)
  To: Andrew Lunn, Hans de Goede
  Cc: Linux regressions mailing list, Pavel Machek, Lee Jones,
	Linux LEDs, linux-kernel, netdev, davem, edumazet, kuba, pabeni,
	johanneswueller, Russell King (Oracle), Genes Lists

On 31.05.2024 14:54, Andrew Lunn wrote:
>> I actually have been looking at a ledtrig-netdev lockdep warning yesterday
>> which I believe is the same thing. I'll include the lockdep trace below.
>>
>> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
>> the rtnl mutex vs led-triggers related locks. I believe that this problem
>> may be a pre-existing problem but this now actually gets hit in kernels >=
>> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
>> given, make hw_control trigger the default trigger"). Before that commit
>> the "netdev" trigger would not be bound / set as phy LEDs trigger by default.
>>
>> +Cc Heiner Kallweit who authored that commit.
>>
>> The netdev trigger typically is not needed because the PHY LEDs are typically
>> under hw-control and the netdev trigger even tries to leave things that way
>> so setting it as the active trigger for the LED class device is basically
>> a no-op. I guess the goal of that commit is correctly have the triggers
>> file content reflect that the LED is controlled by a netdev and to allow
>> changing the hw-control mode without the user first needing to set netdev
>> as trigger before being able to change the mode.
> 
> It was not the intention that this triggers is loaded for all
> systems. It should only be those that actually have LEDs which can be
> controlled:
> 
> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/ethernet/realtek/r8169_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/ethernet/intel/igc/igc_leds.c:	led_cdev->hw_control_trigger = "netdev";
> drivers/net/dsa/qca/qca8k-leds.c:		port_led->cdev.hw_control_trigger = "netdev";
> drivers/net/phy/phy_device.c:		cdev->hw_control_trigger = "netdev";
> 
> Reverting this patch does seem like a good way forward, but i would
> also like to give Heiner a little bit of time to see if he has a quick
> real fix.
> 
Sorry, I was on vacation and can reply only now.
I agree that a revert is appropriate because the actual issue that this change revealed
seems to be non-trivial to fix.
Drawback is that now a user who wants to control LEDs has to know that he first has to
manually load the netdev trigger module and set the netdev trigger.

That the netdev trigger module is now loaded by default on systems with r8169 I don't
really see as an issue. I think it's normal that this is opt-out.
If a user has a problem with this for whatever reason he has several options to
prevent this:
- disable R8169 LED support in kernel config
- disable netdev trigger in kernel config
- remove netdev trigger module
- blacklist netdev trigger module

>      Andrew

Heiner

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-06-24 19:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com>
     [not found] ` <15a0bbd24cd01bd0b60b7047958a2e3ab556ea6f.camel@sapience.com>
     [not found]   ` <ZliHhebSGQYZ/0S0@shell.armlinux.org.uk>
2024-05-31  8:39     ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks) Linux regression tracking (Thorsten Leemhuis)
2024-05-31  9:50       ` Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex Hans de Goede
2024-05-31 10:22         ` Hans de Goede
2024-06-01 20:05           ` Hans de Goede
2024-05-31 11:55         ` Genes Lists
2024-05-31 12:54         ` Andrew Lunn
2024-05-31 13:11           ` Hans de Goede
2024-05-31 13:29             ` Andrew Lunn
2024-05-31 13:36               ` Hans de Goede
2024-06-06 12:01           ` Hans de Goede
2024-06-06 13:12             ` Andrew Lunn
2024-06-06 13:39               ` Jakub Kicinski
2024-06-06 14:25                 ` Genes Lists
2024-06-07 10:22                   ` Hans de Goede
2024-06-07 11:27                     ` Genes Lists
2024-06-07 10:19                 ` Hans de Goede
2024-06-24 19:57           ` Heiner Kallweit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).