public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Regression] wifi problems since tg3 started throwing rcu stall warnings
@ 2024-10-23  8:27 Linux regression tracking (Thorsten Leemhuis)
  2024-10-23  9:11 ` Linux regression tracking (Thorsten Leemhuis)
  2024-10-23 10:09 ` Frederic Weisbecker
  0 siblings, 2 replies; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-23  8:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linux kernel regressions list, LKML, Mingcong Bai,
	Paul E. McKenney, rcu

Hi, Thorsten here, the Linux kernel's regression tracker.

Frederic, I noticed a report about a regression in bugzilla.kernel.org
that appears to be caused by the following change of yours:

55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
invocation")

As many (most?) kernel developers don't keep an eye on the bug tracker,
I decided to write this mail. To quote from
https://bugzilla.kernel.org/show_bug.cgi?id=219390:

>  Mingcong Bai 2024-10-15 13:32:35 UTC
> 
> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors during
> boot up, causing peripherals such as the wireless card to misbehave:
> 
> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
> [   24.166938] rcu: blocking rcu_node structures (internal RCU debug):
> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
> [   24.183113] NMI backtrace for cpu 2
> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 6.10.5-aosc-main #1
> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90
> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 0000000000000000
> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: ffffa04ec78f8a00
> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 0000000000000000
> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000216
> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: ffffa04ec78f8b40
> [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) knlGS:0000000000000000
> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 00000000000006f0
> [   24.183157] Call Trace:
> [   24.183162]  <NMI>
> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
> [   24.183181]  ? nmi_handle+0x61/0x160
> [   24.183186]  ? default_do_nmi+0x42/0x110
> [   24.183191]  ? exc_nmi+0x1bd/0x290
> [   24.183194]  ? end_repeat_nmi+0xf/0x53
> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> [   24.183213]  </NMI>
> [   24.183214]  <TASK>
> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
> [   24.183244]  ? delay_tsc+0x89/0xf0
> [   24.183249]  ? preempt_count_sub+0x51/0x60
> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
> [   24.183258]  __dev_open+0x103/0x1c0
> [   24.183265]  __dev_change_flags+0x1bd/0x230
> [   24.183269]  ? rtnl_getlink+0x362/0x400
> [   24.183276]  dev_change_flags+0x26/0x70
> [   24.183280]  do_setlink+0xe16/0x11f0
> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
> [   24.183312]  ? rtnl_newlink+0x2e/0x70
> [   24.183316]  rtnl_newlink+0x47/0x70
> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
> [   24.183324]  ? __netlink_sendskb+0x68/0x90
> [   24.183329]  ? netlink_unicast+0x237/0x290
> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> [   24.183336]  netlink_rcv_skb+0x5b/0x110
> [   24.183343]  netlink_unicast+0x1a4/0x290
> [   24.183347]  netlink_sendmsg+0x222/0x4a0
> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
> [   24.183368]  ___sys_sendmsg+0x86/0xe0
> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
> [   24.183378]  ? _copy_from_iter+0x8b/0x570
> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
> [   24.183401]  ? __fdget+0xc2/0xf0
> [   24.183405]  __sys_sendmsg+0x5b/0xc0
> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
> [   24.183416]  do_syscall_64+0x64/0x150
> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> I have bisected the error to this commit. Reverting it caused no new or
> perceivable issues on both the MacBook and a Zen4-based laptop.

[...]

>> Ohh, and when you say "causing peripherals such as the wireless card to
>> misbehave" what exactly do you mean?
> 
> When the kernel throws rcu stall messages, the wireless card on the
> MacBook may fail to discover and/or connect to wireless networks - not a
> consistent behaviour but I suppose that something in the kernel got stuck.

See the ticket for more details and dmesg logs; the problem still
happens with 6.12-rc. The reporter is CCed.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

P.S.: let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
#regzbot from: Mingcong Bai <jeffbai@aosc.io>
#regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219390
#regzbot title: rcu: wifi problems since tg3 started throwing rcu stall
warnings
#regzbot ignore-activity

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-10-23  8:27 [Regression] wifi problems since tg3 started throwing rcu stall warnings Linux regression tracking (Thorsten Leemhuis)
@ 2024-10-23  9:11 ` Linux regression tracking (Thorsten Leemhuis)
  2024-10-23 10:09 ` Frederic Weisbecker
  1 sibling, 0 replies; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-23  9:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linux kernel regressions list, LKML, Mingcong Bai,
	Paul E. McKenney, rcu, Pavan Chebbi, Michael Chan, netdev,
	linux-wireless@vger.kernel.org

[reply to self to CC the tg3 maintainers, netdev, and linux-wireless;
sorry, forgot them earlier, but they should be involved, as I guess this
is a problem in tg3 interfering with wifi drivers that the rcu change
just exposed]

On 23.10.24 10:27, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> Frederic, I noticed a report about a regression in bugzilla.kernel.org
> that appears to be caused by the following change of yours:
> 
> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
> invocation")
> 
> As many (most?) kernel developers don't keep an eye on the bug tracker,
> I decided to write this mail. To quote from
> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
> 
>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>>
>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
>> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
>> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors during
>> boot up, causing peripherals such as the wireless card to misbehave:
>>
>> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
>> [   24.166938] rcu: blocking rcu_node structures (internal RCU debug):
>> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
>> [   24.183113] NMI backtrace for cpu 2
>> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 6.10.5-aosc-main #1
>> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
>> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
>> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90
>> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
>> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 0000000000000000
>> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: ffffa04ec78f8a00
>> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 0000000000000000
>> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000216
>> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: ffffa04ec78f8b40
>> [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) knlGS:0000000000000000
>> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 00000000000006f0
>> [   24.183157] Call Trace:
>> [   24.183162]  <NMI>
>> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
>> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
>> [   24.183181]  ? nmi_handle+0x61/0x160
>> [   24.183186]  ? default_do_nmi+0x42/0x110
>> [   24.183191]  ? exc_nmi+0x1bd/0x290
>> [   24.183194]  ? end_repeat_nmi+0xf/0x53
>> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>> [   24.183213]  </NMI>
>> [   24.183214]  <TASK>
>> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
>> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
>> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
>> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
>> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
>> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
>> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
>> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
>> [   24.183244]  ? delay_tsc+0x89/0xf0
>> [   24.183249]  ? preempt_count_sub+0x51/0x60
>> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
>> [   24.183258]  __dev_open+0x103/0x1c0
>> [   24.183265]  __dev_change_flags+0x1bd/0x230
>> [   24.183269]  ? rtnl_getlink+0x362/0x400
>> [   24.183276]  dev_change_flags+0x26/0x70
>> [   24.183280]  do_setlink+0xe16/0x11f0
>> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
>> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
>> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
>> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
>> [   24.183312]  ? rtnl_newlink+0x2e/0x70
>> [   24.183316]  rtnl_newlink+0x47/0x70
>> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
>> [   24.183324]  ? __netlink_sendskb+0x68/0x90
>> [   24.183329]  ? netlink_unicast+0x237/0x290
>> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>> [   24.183336]  netlink_rcv_skb+0x5b/0x110
>> [   24.183343]  netlink_unicast+0x1a4/0x290
>> [   24.183347]  netlink_sendmsg+0x222/0x4a0
>> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
>> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
>> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
>> [   24.183368]  ___sys_sendmsg+0x86/0xe0
>> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
>> [   24.183378]  ? _copy_from_iter+0x8b/0x570
>> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
>> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
>> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
>> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
>> [   24.183401]  ? __fdget+0xc2/0xf0
>> [   24.183405]  __sys_sendmsg+0x5b/0xc0
>> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
>> [   24.183416]  do_syscall_64+0x64/0x150
>> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> I have bisected the error to this commit. Reverting it caused no new or
>> perceivable issues on both the MacBook and a Zen4-based laptop.
> 
> [...]
> 
>>> Ohh, and when you say "causing peripherals such as the wireless card to
>>> misbehave" what exactly do you mean?
>>
>> When the kernel throws rcu stall messages, the wireless card on the
>> MacBook may fail to discover and/or connect to wireless networks - not a
>> consistent behaviour but I suppose that something in the kernel got stuck.
> 
> See the ticket for more details and dmesg logs; the problem still
> happens with 6.12-rc. The reporter is CCed.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> P.S.: let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
> 
> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
> #regzbot from: Mingcong Bai <jeffbai@aosc.io>
> #regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219390
> #regzbot title: rcu: wifi problems since tg3 started throwing rcu stall
> warnings
> #regzbot ignore-activity
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-10-23  8:27 [Regression] wifi problems since tg3 started throwing rcu stall warnings Linux regression tracking (Thorsten Leemhuis)
  2024-10-23  9:11 ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-10-23 10:09 ` Frederic Weisbecker
  2024-10-23 10:22   ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2024-10-23 10:09 UTC (permalink / raw)
  To: Linux regressions mailing list; +Cc: LKML, Mingcong Bai, Paul E. McKenney, rcu

Hi Thorsten,

First, thanks for letting us know.

Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking (Thorsten Leemhuis) a écrit :
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> Frederic, I noticed a report about a regression in bugzilla.kernel.org
> that appears to be caused by the following change of yours:
> 
> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
> invocation")

Are you sure about the commit? Below it says:

> 
> As many (most?) kernel developers don't keep an eye on the bug tracker,
> I decided to write this mail. To quote from
> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
> 
> >  Mingcong Bai 2024-10-15 13:32:35 UTC
> > 
> > Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and

Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't find in vanilla
tree.

Also I'm failing to see an immediate issue between the below stacktrace
and the above commit. So are we sure about that reference?

Thanks.


> > v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
> > MacBook Pro (15'', Mid 2010) would throw many rcu stall errors during
> > boot up, causing peripherals such as the wireless card to misbehave:
> > 
> > [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
> > [   24.166938] rcu: blocking rcu_node structures (internal RCU debug):
> > [   24.177800] Sending NMI from CPU 3 to CPUs 2:
> > [   24.183113] NMI backtrace for cpu 2
> > [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 6.10.5-aosc-main #1
> > [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
> > [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
> > [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90
> > [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
> > [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 0000000000000000
> > [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: ffffa04ec78f8a00
> > [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 0000000000000000
> > [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000216
> > [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: ffffa04ec78f8b40
> > [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) knlGS:0000000000000000
> > [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 00000000000006f0
> > [   24.183157] Call Trace:
> > [   24.183162]  <NMI>
> > [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
> > [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
> > [   24.183181]  ? nmi_handle+0x61/0x160
> > [   24.183186]  ? default_do_nmi+0x42/0x110
> > [   24.183191]  ? exc_nmi+0x1bd/0x290
> > [   24.183194]  ? end_repeat_nmi+0xf/0x53
> > [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> > [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> > [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
> > [   24.183213]  </NMI>
> > [   24.183214]  <TASK>
> > [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
> > [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
> > [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
> > [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
> > [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
> > [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
> > [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
> > [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
> > [   24.183244]  ? delay_tsc+0x89/0xf0
> > [   24.183249]  ? preempt_count_sub+0x51/0x60
> > [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
> > [   24.183258]  __dev_open+0x103/0x1c0
> > [   24.183265]  __dev_change_flags+0x1bd/0x230
> > [   24.183269]  ? rtnl_getlink+0x362/0x400
> > [   24.183276]  dev_change_flags+0x26/0x70
> > [   24.183280]  do_setlink+0xe16/0x11f0
> > [   24.183286]  ? __nla_validate_parse+0x61/0xd40
> > [   24.183295]  __rtnl_newlink+0x63d/0x9f0
> > [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
> > [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
> > [   24.183312]  ? rtnl_newlink+0x2e/0x70
> > [   24.183316]  rtnl_newlink+0x47/0x70
> > [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
> > [   24.183324]  ? __netlink_sendskb+0x68/0x90
> > [   24.183329]  ? netlink_unicast+0x237/0x290
> > [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> > [   24.183336]  netlink_rcv_skb+0x5b/0x110
> > [   24.183343]  netlink_unicast+0x1a4/0x290
> > [   24.183347]  netlink_sendmsg+0x222/0x4a0
> > [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
> > [   24.183358]  ____sys_sendmsg+0x379/0x3b0
> > [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
> > [   24.183368]  ___sys_sendmsg+0x86/0xe0
> > [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
> > [   24.183378]  ? _copy_from_iter+0x8b/0x570
> > [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
> > [   24.183388]  ? _raw_spin_unlock+0x19/0x50
> > [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
> > [   24.183397]  ? trace_hardirqs_on+0x29/0x90
> > [   24.183401]  ? __fdget+0xc2/0xf0
> > [   24.183405]  __sys_sendmsg+0x5b/0xc0
> > [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
> > [   24.183416]  do_syscall_64+0x64/0x150
> > [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > I have bisected the error to this commit. Reverting it caused no new or
> > perceivable issues on both the MacBook and a Zen4-based laptop.
> 
> [...]
> 
> >> Ohh, and when you say "causing peripherals such as the wireless card to
> >> misbehave" what exactly do you mean?
> > 
> > When the kernel throws rcu stall messages, the wireless card on the
> > MacBook may fail to discover and/or connect to wireless networks - not a
> > consistent behaviour but I suppose that something in the kernel got stuck.
> 
> See the ticket for more details and dmesg logs; the problem still
> happens with 6.12-rc. The reporter is CCed.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> P.S.: let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
> 
> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
> #regzbot from: Mingcong Bai <jeffbai@aosc.io>
> #regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219390
> #regzbot title: rcu: wifi problems since tg3 started throwing rcu stall
> warnings
> #regzbot ignore-activity
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-10-23 10:09 ` Frederic Weisbecker
@ 2024-10-23 10:22   ` Linux regression tracking (Thorsten Leemhuis)
  2024-11-05  7:17     ` Mingcong Bai
  0 siblings, 1 reply; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-23 10:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Linux regressions mailing list
  Cc: LKML, Mingcong Bai, Paul E. McKenney, rcu

On 23.10.24 12:09, Frederic Weisbecker wrote:
> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking (Thorsten Leemhuis) a écrit :
>> Hi, Thorsten here, the Linux kernel's regression tracker.
>>
>> Frederic, I noticed a report about a regression in bugzilla.kernel.org
>> that appears to be caused by the following change of yours:
>>
>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
>> invocation")

> Are you sure about the commit? Below it says:

Not totally, but...

>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>> I decided to write this mail. To quote from
>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
>>
>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>>>
>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
> 
> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't find in vanilla
> tree.

...quite, as that is the commit-id of the backport to v6.10.5; and the
reporter reverted it there. Ideally of course that would have happened
on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
there helps, too.

HTH, Ciao, Thorsten

> Also I'm failing to see an immediate issue between the below stacktrace
> and the above commit. So are we sure about that reference?
> 
> Thanks.
> 
> 
>>> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
>>> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors during
>>> boot up, causing peripherals such as the wireless card to misbehave:
>>>
>>> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
>>> [   24.166938] rcu: blocking rcu_node structures (internal RCU debug):
>>> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
>>> [   24.183113] NMI backtrace for cpu 2
>>> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 6.10.5-aosc-main #1
>>> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
>>> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
>>> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90
>>> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
>>> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 0000000000000000
>>> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: ffffa04ec78f8a00
>>> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 0000000000000000
>>> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000216
>>> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: ffffa04ec78f8b40
>>> [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) knlGS:0000000000000000
>>> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 00000000000006f0
>>> [   24.183157] Call Trace:
>>> [   24.183162]  <NMI>
>>> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
>>> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
>>> [   24.183181]  ? nmi_handle+0x61/0x160
>>> [   24.183186]  ? default_do_nmi+0x42/0x110
>>> [   24.183191]  ? exc_nmi+0x1bd/0x290
>>> [   24.183194]  ? end_repeat_nmi+0xf/0x53
>>> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>> [   24.183213]  </NMI>
>>> [   24.183214]  <TASK>
>>> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
>>> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
>>> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
>>> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
>>> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
>>> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
>>> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
>>> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
>>> [   24.183244]  ? delay_tsc+0x89/0xf0
>>> [   24.183249]  ? preempt_count_sub+0x51/0x60
>>> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
>>> [   24.183258]  __dev_open+0x103/0x1c0
>>> [   24.183265]  __dev_change_flags+0x1bd/0x230
>>> [   24.183269]  ? rtnl_getlink+0x362/0x400
>>> [   24.183276]  dev_change_flags+0x26/0x70
>>> [   24.183280]  do_setlink+0xe16/0x11f0
>>> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
>>> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
>>> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
>>> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
>>> [   24.183312]  ? rtnl_newlink+0x2e/0x70
>>> [   24.183316]  rtnl_newlink+0x47/0x70
>>> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
>>> [   24.183324]  ? __netlink_sendskb+0x68/0x90
>>> [   24.183329]  ? netlink_unicast+0x237/0x290
>>> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>>> [   24.183336]  netlink_rcv_skb+0x5b/0x110
>>> [   24.183343]  netlink_unicast+0x1a4/0x290
>>> [   24.183347]  netlink_sendmsg+0x222/0x4a0
>>> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
>>> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
>>> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
>>> [   24.183368]  ___sys_sendmsg+0x86/0xe0
>>> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
>>> [   24.183378]  ? _copy_from_iter+0x8b/0x570
>>> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
>>> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
>>> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
>>> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
>>> [   24.183401]  ? __fdget+0xc2/0xf0
>>> [   24.183405]  __sys_sendmsg+0x5b/0xc0
>>> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
>>> [   24.183416]  do_syscall_64+0x64/0x150
>>> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> I have bisected the error to this commit. Reverting it caused no new or
>>> perceivable issues on both the MacBook and a Zen4-based laptop.
>>
>> [...]
>>
>>>> Ohh, and when you say "causing peripherals such as the wireless card to
>>>> misbehave" what exactly do you mean?
>>>
>>> When the kernel throws rcu stall messages, the wireless card on the
>>> MacBook may fail to discover and/or connect to wireless networks - not a
>>> consistent behaviour but I suppose that something in the kernel got stuck.
>>
>> See the ticket for more details and dmesg logs; the problem still
>> happens with 6.12-rc. The reporter is CCed.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> P.S.: let me use this mail to also add the report to the list of tracked
>> regressions to ensure it's doesn't fall through the cracks:
>>
>> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
>> #regzbot from: Mingcong Bai <jeffbai@aosc.io>
>> #regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219390
>> #regzbot title: rcu: wifi problems since tg3 started throwing rcu stall
>> warnings
>> #regzbot ignore-activity
>>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-10-23 10:22   ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-11-05  7:17     ` Mingcong Bai
  2024-11-07  9:10       ` Thorsten Leemhuis
  0 siblings, 1 reply; 14+ messages in thread
From: Mingcong Bai @ 2024-11-05  7:17 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Frederic Weisbecker, LKML, Paul E. McKenney, rcu, sakiiily

Hi Frederic and Thorston,

(CC-ing the laptop's owner so that she might help with further 
testing...)

在 2024-10-23 18:22,Linux regression tracking (Thorsten Leemhuis) 写道:
> On 23.10.24 12:09, Frederic Weisbecker wrote:
>> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking 
>> (Thorsten Leemhuis) a écrit :
>>> Hi, Thorsten here, the Linux kernel's regression tracker.
>>> 
>>> Frederic, I noticed a report about a regression in 
>>> bugzilla.kernel.org
>>> that appears to be caused by the following change of yours:
>>> 
>>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
>>> invocation")
> 
>> Are you sure about the commit? Below it says:
> 
> Not totally, but...
> 
>>> As many (most?) kernel developers don't keep an eye on the bug 
>>> tracker,
>>> I decided to write this mail. To quote from
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
>>> 
>>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>>>> 
>>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
>> 
>> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't find 
>> in vanilla
>> tree.
> 
> ...quite, as that is the commit-id of the backport to v6.10.5; and the
> reporter reverted it there. Ideally of course that would have happened
> on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
> there helps, too.

Do we need any further information/testing on this issue? Please let me 
know if there's anything we can do as the issue still persists in 6.12.

Best Regards,
Mingcong Bai

> 
> HTH, Ciao, Thorsten
> 
>> Also I'm failing to see an immediate issue between the below 
>> stacktrace
>> and the above commit. So are we sure about that reference?
>> 
>> Thanks.
>> 
>> 
>>>> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
>>>> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors 
>>>> during
>>>> boot up, causing peripherals such as the wireless card to misbehave:
>>>> 
>>>> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on 
>>>> CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
>>>> [   24.166938] rcu: blocking rcu_node structures (internal RCU 
>>>> debug):
>>>> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
>>>> [   24.183113] NMI backtrace for cpu 2
>>>> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 
>>>> 6.10.5-aosc-main #1
>>>> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, 
>>>> BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
>>>> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 
>>>> 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 
>>>> 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 
>>>> 90 90 90
>>>> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
>>>> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 
>>>> 0000000000000000
>>>> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: 
>>>> ffffa04ec78f8a00
>>>> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 
>>>> 0000000000000000
>>>> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 
>>>> 0000000000000216
>>>> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: 
>>>> ffffa04ec78f8b40
>>>> [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) 
>>>> knlGS:0000000000000000
>>>> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 
>>>> 00000000000006f0
>>>> [   24.183157] Call Trace:
>>>> [   24.183162]  <NMI>
>>>> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
>>>> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
>>>> [   24.183181]  ? nmi_handle+0x61/0x160
>>>> [   24.183186]  ? default_do_nmi+0x42/0x110
>>>> [   24.183191]  ? exc_nmi+0x1bd/0x290
>>>> [   24.183194]  ? end_repeat_nmi+0xf/0x53
>>>> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183213]  </NMI>
>>>> [   24.183214]  <TASK>
>>>> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
>>>> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
>>>> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
>>>> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
>>>> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
>>>> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
>>>> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
>>>> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
>>>> [   24.183244]  ? delay_tsc+0x89/0xf0
>>>> [   24.183249]  ? preempt_count_sub+0x51/0x60
>>>> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
>>>> [   24.183258]  __dev_open+0x103/0x1c0
>>>> [   24.183265]  __dev_change_flags+0x1bd/0x230
>>>> [   24.183269]  ? rtnl_getlink+0x362/0x400
>>>> [   24.183276]  dev_change_flags+0x26/0x70
>>>> [   24.183280]  do_setlink+0xe16/0x11f0
>>>> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
>>>> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
>>>> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
>>>> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
>>>> [   24.183312]  ? rtnl_newlink+0x2e/0x70
>>>> [   24.183316]  rtnl_newlink+0x47/0x70
>>>> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
>>>> [   24.183324]  ? __netlink_sendskb+0x68/0x90
>>>> [   24.183329]  ? netlink_unicast+0x237/0x290
>>>> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>>>> [   24.183336]  netlink_rcv_skb+0x5b/0x110
>>>> [   24.183343]  netlink_unicast+0x1a4/0x290
>>>> [   24.183347]  netlink_sendmsg+0x222/0x4a0
>>>> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
>>>> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
>>>> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
>>>> [   24.183368]  ___sys_sendmsg+0x86/0xe0
>>>> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
>>>> [   24.183378]  ? _copy_from_iter+0x8b/0x570
>>>> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
>>>> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
>>>> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
>>>> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
>>>> [   24.183401]  ? __fdget+0xc2/0xf0
>>>> [   24.183405]  __sys_sendmsg+0x5b/0xc0
>>>> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
>>>> [   24.183416]  do_syscall_64+0x64/0x150
>>>> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> 
>>>> I have bisected the error to this commit. Reverting it caused no new 
>>>> or
>>>> perceivable issues on both the MacBook and a Zen4-based laptop.
>>> 
>>> [...]
>>> 
>>>>> Ohh, and when you say "causing peripherals such as the wireless 
>>>>> card to
>>>>> misbehave" what exactly do you mean?
>>>> 
>>>> When the kernel throws rcu stall messages, the wireless card on the
>>>> MacBook may fail to discover and/or connect to wireless networks - 
>>>> not a
>>>> consistent behaviour but I suppose that something in the kernel got 
>>>> stuck.
>>> 
>>> See the ticket for more details and dmesg logs; the problem still
>>> happens with 6.12-rc. The reporter is CCed.
>>> 
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' 
>>> hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>> 
>>> P.S.: let me use this mail to also add the report to the list of 
>>> tracked
>>> regressions to ensure it's doesn't fall through the cracks:
>>> 
>>> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
>>> #regzbot from: Mingcong Bai <jeffbai@aosc.io>
>>> #regzbot duplicate: 
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390
>>> #regzbot title: rcu: wifi problems since tg3 started throwing rcu 
>>> stall
>>> warnings
>>> #regzbot ignore-activity
>>> 
>> 
>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-05  7:17     ` Mingcong Bai
@ 2024-11-07  9:10       ` Thorsten Leemhuis
  2024-11-07 10:04         ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: Thorsten Leemhuis @ 2024-11-07  9:10 UTC (permalink / raw)
  To: Mingcong Bai, Linux regressions mailing list
  Cc: Frederic Weisbecker, LKML, Paul E. McKenney, rcu, sakiiily

On 05.11.24 08:17, Mingcong Bai wrote:
> (CC-ing the laptop's owner so that she might help with further testing...)
> 在 2024-10-23 18:22,Linux regression tracking (Thorsten Leemhuis) 写道:
>> On 23.10.24 12:09, Frederic Weisbecker wrote:
>>> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking
>>> (Thorsten Leemhuis) a écrit :
>>>>
>>>> Frederic, I noticed a report about a regression in bugzilla.kernel.org
>>>> that appears to be caused by the following change of yours:
>>>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
>>>> invocation")
>>> Are you sure about the commit? Below it says:
>> Not totally, but...
>>
>>>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>>>> I decided to write this mail. To quote from
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
>>>>
>>>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>>>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
>>> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't
>>> find in vanilla
>>> tree.
>> ...quite, as that is the commit-id of the backport to v6.10.5; and the
>> reporter reverted it there. Ideally of course that would have happened
>> on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
>> there helps, too.
> Do we need any further information/testing on this issue? Please let me
> know if there's anything we can do as the issue still persists in 6.12.

Hmm, no reply from Frederic. Not sure why, maybe he is just away from
the keyboard for a few days. But if the reporter has a minute, it might
be wise to check if reverting that commit on top of 6.12-rc6 or newer
also fixes the problem, to rule out any interference from changes
specific to the stable series.

Ciao, Thorsten

>>> Also I'm failing to see an immediate issue between the below stacktrace
>>> and the above commit. So are we sure about that reference?
>>>
>>> Thanks.
>>>
>>>
>>>>> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
>>>>> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors during
>>>>> boot up, causing peripherals such as the wireless card to misbehave:
>>>>>
>>>>> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on
>>>>> CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
>>>>> [   24.166938] rcu: blocking rcu_node structures (internal RCU debug):
>>>>> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
>>>>> [   24.183113] NMI backtrace for cpu 2
>>>>> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted
>>>>> 6.10.5-aosc-main #1
>>>>> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-
>>>>> F22586C8, BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
>>>>> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
>>>>> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
>>>>> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48
>>>>> 03 77 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90
>>>>> 90 90 90 90 90
>>>>> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
>>>>> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX:
>>>>> 0000000000000000
>>>>> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI:
>>>>> ffffa04ec78f8a00
>>>>> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12:
>>>>> 0000000000000216
>>>>> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15:
>>>>> ffffa04ec78f8b40
>>>>> [   24.183151] FS:  00007f4c524b2140(0000)
>>>>> GS:ffffa05007d00000(0000) knlGS:0000000000000000
>>>>> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4:
>>>>> 00000000000006f0
>>>>> [   24.183157] Call Trace:
>>>>> [   24.183162]  <NMI>
>>>>> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
>>>>> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
>>>>> [   24.183181]  ? nmi_handle+0x61/0x160
>>>>> [   24.183186]  ? default_do_nmi+0x42/0x110
>>>>> [   24.183191]  ? exc_nmi+0x1bd/0x290
>>>>> [   24.183194]  ? end_repeat_nmi+0xf/0x53
>>>>> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>>> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>>> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>>> [   24.183213]  </NMI>
>>>>> [   24.183214]  <TASK>
>>>>> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
>>>>> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
>>>>> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
>>>>> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
>>>>> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
>>>>> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
>>>>> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
>>>>> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
>>>>> [   24.183244]  ? delay_tsc+0x89/0xf0
>>>>> [   24.183249]  ? preempt_count_sub+0x51/0x60
>>>>> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
>>>>> [   24.183258]  __dev_open+0x103/0x1c0
>>>>> [   24.183265]  __dev_change_flags+0x1bd/0x230
>>>>> [   24.183269]  ? rtnl_getlink+0x362/0x400
>>>>> [   24.183276]  dev_change_flags+0x26/0x70
>>>>> [   24.183280]  do_setlink+0xe16/0x11f0
>>>>> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
>>>>> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
>>>>> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
>>>>> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
>>>>> [   24.183312]  ? rtnl_newlink+0x2e/0x70
>>>>> [   24.183316]  rtnl_newlink+0x47/0x70
>>>>> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
>>>>> [   24.183324]  ? __netlink_sendskb+0x68/0x90
>>>>> [   24.183329]  ? netlink_unicast+0x237/0x290
>>>>> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>>>>> [   24.183336]  netlink_rcv_skb+0x5b/0x110
>>>>> [   24.183343]  netlink_unicast+0x1a4/0x290
>>>>> [   24.183347]  netlink_sendmsg+0x222/0x4a0
>>>>> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
>>>>> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
>>>>> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
>>>>> [   24.183368]  ___sys_sendmsg+0x86/0xe0
>>>>> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
>>>>> [   24.183378]  ? _copy_from_iter+0x8b/0x570
>>>>> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
>>>>> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
>>>>> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
>>>>> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
>>>>> [   24.183401]  ? __fdget+0xc2/0xf0
>>>>> [   24.183405]  __sys_sendmsg+0x5b/0xc0
>>>>> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
>>>>> [   24.183416]  do_syscall_64+0x64/0x150
>>>>> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> I have bisected the error to this commit. Reverting it caused no
>>>>> new or
>>>>> perceivable issues on both the MacBook and a Zen4-based laptop.
>>>>
>>>> [...]
>>>>
>>>>>> Ohh, and when you say "causing peripherals such as the wireless
>>>>>> card to
>>>>>> misbehave" what exactly do you mean?
>>>>>
>>>>> When the kernel throws rcu stall messages, the wireless card on the
>>>>> MacBook may fail to discover and/or connect to wireless networks -
>>>>> not a
>>>>> consistent behaviour but I suppose that something in the kernel got
>>>>> stuck.
>>>>
>>>> See the ticket for more details and dmesg logs; the problem still
>>>> happens with 6.12-rc. The reporter is CCed.
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
>>>> hat)
>>>> -- 
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> If I did something stupid, please tell me, as explained on that page.
>>>>
>>>> P.S.: let me use this mail to also add the report to the list of
>>>> tracked
>>>> regressions to ensure it's doesn't fall through the cracks:
>>>>
>>>> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
>>>> #regzbot from: Mingcong Bai <jeffbai@aosc.io>
>>>> #regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219390
>>>> #regzbot title: rcu: wifi problems since tg3 started throwing rcu stall
>>>> warnings
>>>> #regzbot ignore-activity
>>>>
>>>
>>>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-07  9:10       ` Thorsten Leemhuis
@ 2024-11-07 10:04         ` Frederic Weisbecker
  2024-11-07 10:33           ` Mingcong Bai
  2024-11-07 16:29           ` Mingcong Bai
  0 siblings, 2 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2024-11-07 10:04 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mingcong Bai, Linux regressions mailing list, LKML,
	Paul E. McKenney, rcu, sakiiily

Le Thu, Nov 07, 2024 at 10:10:37AM +0100, Thorsten Leemhuis a écrit :
> On 05.11.24 08:17, Mingcong Bai wrote:
> > (CC-ing the laptop's owner so that she might help with further testing...)
> > 在 2024-10-23 18:22,Linux regression tracking (Thorsten Leemhuis) 写道:
> >> On 23.10.24 12:09, Frederic Weisbecker wrote:
> >>> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking
> >>> (Thorsten Leemhuis) a écrit :
> >>>>
> >>>> Frederic, I noticed a report about a regression in bugzilla.kernel.org
> >>>> that appears to be caused by the following change of yours:
> >>>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
> >>>> invocation")
> >>> Are you sure about the commit? Below it says:
> >> Not totally, but...
> >>
> >>>> As many (most?) kernel developers don't keep an eye on the bug tracker,
> >>>> I decided to write this mail. To quote from
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
> >>>>
> >>>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
> >>>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
> >>> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't
> >>> find in vanilla
> >>> tree.
> >> ...quite, as that is the commit-id of the backport to v6.10.5; and the
> >> reporter reverted it there. Ideally of course that would have happened
> >> on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
> >> there helps, too.
> > Do we need any further information/testing on this issue? Please let me
> > know if there's anything we can do as the issue still persists in 6.12.
> 
> Hmm, no reply from Frederic. Not sure why, maybe he is just away from
> the keyboard for a few days. But if the reporter has a minute, it might
> be wise to check if reverting that commit on top of 6.12-rc6 or newer
> also fixes the problem, to rule out any interference from changes
> specific to the stable series.
> 
> Ciao, Thorsten

Sorry for the lag, I still don't understand how this specific commit
can produce this issue. Can you please retry with and without this commit
reverted?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-07 10:04         ` Frederic Weisbecker
@ 2024-11-07 10:33           ` Mingcong Bai
  2024-11-07 16:29           ` Mingcong Bai
  1 sibling, 0 replies; 14+ messages in thread
From: Mingcong Bai @ 2024-11-07 10:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thorsten Leemhuis, Linux regressions mailing list, LKML,
	Paul E. McKenney, rcu, sakiiily

Hi all,

在 2024-11-07 18:04,Frederic Weisbecker 写道:
> Le Thu, Nov 07, 2024 at 10:10:37AM +0100, Thorsten Leemhuis a écrit :
>> On 05.11.24 08:17, Mingcong Bai wrote:
>> > (CC-ing the laptop's owner so that she might help with further testing...)
>> > 在 2024-10-23 18:22,Linux regression tracking (Thorsten Leemhuis) 写道:
>> >> On 23.10.24 12:09, Frederic Weisbecker wrote:
>> >>> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking
>> >>> (Thorsten Leemhuis) a écrit :
>> >>>>
>> >>>> Frederic, I noticed a report about a regression in bugzilla.kernel.org
>> >>>> that appears to be caused by the following change of yours:
>> >>>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
>> >>>> invocation")
>> >>> Are you sure about the commit? Below it says:
>> >> Not totally, but...
>> >>
>> >>>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>> >>>> I decided to write this mail. To quote from
>> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
>> >>>>
>> >>>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>> >>>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
>> >>> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't
>> >>> find in vanilla
>> >>> tree.
>> >> ...quite, as that is the commit-id of the backport to v6.10.5; and the
>> >> reporter reverted it there. Ideally of course that would have happened
>> >> on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
>> >> there helps, too.
>> > Do we need any further information/testing on this issue? Please let me
>> > know if there's anything we can do as the issue still persists in 6.12.
>> 
>> Hmm, no reply from Frederic. Not sure why, maybe he is just away from
>> the keyboard for a few days. But if the reporter has a minute, it 
>> might
>> be wise to check if reverting that commit on top of 6.12-rc6 or newer
>> also fixes the problem, to rule out any interference from changes
>> specific to the stable series.
>> 
>> Ciao, Thorsten
> 
> Sorry for the lag, I still don't understand how this specific commit
> can produce this issue. Can you please retry with and without this 
> commit
> reverted?
> 
> Thanks.

Yes, we are on it. Should report back in six hours or so.

Best Regards,
Mingcong Bai

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-07 10:04         ` Frederic Weisbecker
  2024-11-07 10:33           ` Mingcong Bai
@ 2024-11-07 16:29           ` Mingcong Bai
  2024-11-08 13:46             ` Frederic Weisbecker
  1 sibling, 1 reply; 14+ messages in thread
From: Mingcong Bai @ 2024-11-07 16:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thorsten Leemhuis, Linux regressions mailing list, LKML,
	Paul E. McKenney, rcu, sakiiily, Kexy Biscuit

[-- Attachment #1: Type: text/plain, Size: 4883 bytes --]

Hi Frederic,

<snip>

> Sorry for the lag, I still don't understand how this specific commit
> can produce this issue. Can you please retry with and without this 
> commit
> reverted?

Just tested v6.12-rc6 with and without the revert. Without the revert, 
the touchpad and the wireless adapter both stopped working, whereas with 
the revert, both devices functions as normal.

I have attached the dmesg for both kernels below. Unlike the log we got 
last time, there is no direct reference to tg3 any more, but the NMI 
backtrace still pointed to NetworkManager and net/netlink-related 
functions (perhaps a debug kernel would be more helpful?). Here's a 
snippet:

[   10.337720] rcu: INFO: rcu_preempt detected expedited stalls on 
CPUs/tasks: { P683 } 21 jiffies s: 781 root: 0x0/T
[   10.339168] rcu: blocking rcu_node structures (internal RCU debug):
[   10.591480] loop0: detected capacity change from 0 to 8
[   11.777733] rcu: INFO: rcu_preempt detected expedited stalls on 
CPUs/tasks: { 3-.... } 21 jiffies s: 1077 root: 0x8/.
[   11.779210] rcu: blocking rcu_node structures (internal RCU debug):
[   11.780630] Sending NMI from CPU 1 to CPUs 3:
[   11.780659] NMI backtrace for cpu 3
[   11.780663] CPU: 3 UID: 0 PID: 1027 Comm: NetworkManager Not tainted 
6.12.0-aosc-main #1
[   11.780667] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, 
BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
[   11.780670] RIP: 0010:0xffffffffc0482051
[   11.780679] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 30 8b 
06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90
[   11.780682] RSP: 0018:ffffb39a8131f5e8 EFLAGS: 00000082
[   11.780685] RAX: 0000000000000000 RBX: ffffa0f4bbd6aa40 RCX: 
0000000000000000
[   11.780687] RDX: 0000000000000000 RSI: ffffb39a804b007c RDI: 
ffffa0f4bbd6aa40
[   11.780689] RBP: 0000000000000b50 R08: 0000000000000000 R09: 
0000000000000000
[   11.780690] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000216
[   11.780692] R13: ffffb39a8131f624 R14: ffffa0f4bbd6aa48 R15: 
ffffa0f4bbd6ab80
[   11.780694] FS:  00007fd9da58d140(0000) GS:ffffa0f5c7d80000(0000) 
knlGS:0000000000000000
[   11.780696] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.780698] CR2: 00007fbbd4b681b0 CR3: 0000000104986000 CR4: 
00000000000006f0
[   11.780700] Call Trace:
[   11.780706]  <NMI>
[   11.780710]  ? nmi_cpu_backtrace+0xbf/0x140
[   11.780719]  ? nmi_cpu_backtrace_handler+0x11/0x20
[   11.780725]  ? nmi_handle+0x61/0x160
[   11.780731]  ? default_do_nmi+0x42/0x110
[   11.780736]  ? exc_nmi+0x1bd/0x290
[   11.780740]  ? end_repeat_nmi+0xf/0x53
[   11.780748]  ? 0xffffffffc0482051
[   11.780752]  ? 0xffffffffc0482051
[   11.780754]  ? 0xffffffffc0482051
[   11.780756]  </NMI>
[   11.780757]  <TASK>
[   11.780758]  0xffffffffc0486508
[   11.780762]  ? 0xffffffffc0482010
[   11.780764]  0xffffffffc048e5b6
[   11.780767]  0xffffffffc04907b8
[   11.780770]  0xffffffffc049c01f
[   11.780773]  ? _raw_spin_unlock_irqrestore+0x25/0x70
[   11.780777]  ? 0xffffffffc048e5b6
[   11.780779]  0xffffffffc04a0a53
[   11.780782]  ? delay_tsc+0x89/0xf0
[   11.780786]  ? preempt_count_sub+0x51/0x60
[   11.780792]  0xffffffffc04a0f5b
[   11.780795]  __dev_open+0x103/0x1c0
[   11.780803]  __dev_change_flags+0x1bd/0x230
[   11.780806]  ? rtnl_getlink+0x364/0x400
[   11.780811]  dev_change_flags+0x26/0x70
[   11.780815]  do_setlink+0xe19/0x11f0
[   11.780820]  ? __nla_validate_parse+0x61/0xd40
[   11.780826]  __rtnl_newlink+0x5e7/0x990
[   11.780831]  ? kmem_cache_alloc_node_noprof+0x11d/0x350
[   11.780835]  ? __kmalloc_cache_noprof+0x10c/0x330
[   11.780839]  rtnl_newlink+0x47/0x70
[   11.780842]  rtnetlink_rcv_msg+0x152/0x400
[   11.780846]  ? __netlink_sendskb+0x68/0x90
[   11.780851]  ? netlink_unicast+0x23b/0x290
[   11.780856]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   11.780859]  netlink_rcv_skb+0x5b/0x110
[   11.780865]  netlink_unicast+0x1a6/0x290
[   11.780870]  netlink_sendmsg+0x222/0x4b0
[   11.780873]  ? proc_get_long.constprop.0+0x116/0x210
[   11.780879]  ____sys_sendmsg+0x379/0x3b0
[   11.780885]  ? copy_msghdr_from_user+0x6d/0xb0
[   11.780891]  ___sys_sendmsg+0x86/0xe0
[   11.780897]  ? addrconf_sysctl_forward+0xf3/0x270
[   11.780902]  ? _copy_from_iter+0x8b/0x6b0
[   11.780906]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
[   11.780911]  ? _raw_spin_unlock+0x19/0x50
[   11.780914]  ? proc_sys_call_handler+0xf0/0x2f0
[   11.780922]  ? trace_hardirqs_on+0x29/0x90
[   11.780927]  ? mod_objcg_state+0x102/0x300
[   11.780932]  ? fdget+0xd2/0x100
[   11.780938]  __sys_sendmsg+0x5b/0xc0
[   11.780944]  ? syscall_trace_enter+0x110/0x1b0
[   11.780951]  do_syscall_64+0x64/0x150
[   11.780957]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Also looping our distro kernel maintainer here.

Best Regards,
Mingcong Bai

[-- Attachment #2: dmesg-6.12-rc6-revert-55d4669ef1b768.log --]
[-- Type: application/json, Size: 104184 bytes --]

[-- Attachment #3: dmesg-6.12-rc6-vanilla.log --]
[-- Type: application/json, Size: 137449 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-07 16:29           ` Mingcong Bai
@ 2024-11-08 13:46             ` Frederic Weisbecker
  2024-11-08 15:14               ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2024-11-08 13:46 UTC (permalink / raw)
  To: Mingcong Bai
  Cc: Thorsten Leemhuis, Linux regressions mailing list, LKML,
	Paul E. McKenney, rcu, sakiiily, Kexy Biscuit

Le Fri, Nov 08, 2024 at 12:29:40AM +0800, Mingcong Bai a écrit :
> Hi Frederic,
> 
> <snip>
> 
> > Sorry for the lag, I still don't understand how this specific commit
> > can produce this issue. Can you please retry with and without this
> > commit
> > reverted?
> 
> Just tested v6.12-rc6 with and without the revert. Without the revert, the
> touchpad and the wireless adapter both stopped working, whereas with the
> revert, both devices functions as normal.
> 
> I have attached the dmesg for both kernels below. Unlike the log we got last
> time, there is no direct reference to tg3 any more, but the NMI backtrace
> still pointed to NetworkManager and net/netlink-related functions (perhaps a
> debug kernel would be more helpful?). Here's a snippet:
> 
> [   10.337720] rcu: INFO: rcu_preempt detected expedited stalls on
> CPUs/tasks: { P683 } 21 jiffies s: 781 root: 0x0/T
> [   10.339168] rcu: blocking rcu_node structures (internal RCU debug):
> [   10.591480] loop0: detected capacity change from 0 to 8
> [   11.777733] rcu: INFO: rcu_preempt detected expedited stalls on
> CPUs/tasks: { 3-.... } 21 jiffies s: 1077 root: 0x8/.
> [   11.779210] rcu: blocking rcu_node structures (internal RCU debug):
> [   11.780630] Sending NMI from CPU 1 to CPUs 3:
> [   11.780659] NMI backtrace for cpu 3
> [   11.780663] CPU: 3 UID: 0 PID: 1027 Comm: NetworkManager Not tainted
> 6.12.0-aosc-main #1

Funny, this happens on bootup and no CPU has ever gone offline, so the path
modified by this patch shouldn't have been taken. And yet this commit has
an influence to the point of reliably triggering that stall.

I'm running off of ideas, Paul any clue?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-08 13:46             ` Frederic Weisbecker
@ 2024-11-08 15:14               ` Paul E. McKenney
  2024-11-12 12:50                 ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2024-11-08 15:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mingcong Bai, Thorsten Leemhuis, Linux regressions mailing list,
	LKML, rcu, sakiiily, Kexy Biscuit

On Fri, Nov 08, 2024 at 02:46:16PM +0100, Frederic Weisbecker wrote:
> Le Fri, Nov 08, 2024 at 12:29:40AM +0800, Mingcong Bai a écrit :
> > Hi Frederic,
> > 
> > <snip>
> > 
> > > Sorry for the lag, I still don't understand how this specific commit
> > > can produce this issue. Can you please retry with and without this
> > > commit
> > > reverted?
> > 
> > Just tested v6.12-rc6 with and without the revert. Without the revert, the
> > touchpad and the wireless adapter both stopped working, whereas with the
> > revert, both devices functions as normal.
> > 
> > I have attached the dmesg for both kernels below. Unlike the log we got last
> > time, there is no direct reference to tg3 any more, but the NMI backtrace
> > still pointed to NetworkManager and net/netlink-related functions (perhaps a
> > debug kernel would be more helpful?). Here's a snippet:
> > 
> > [   10.337720] rcu: INFO: rcu_preempt detected expedited stalls on
> > CPUs/tasks: { P683 } 21 jiffies s: 781 root: 0x0/T
> > [   10.339168] rcu: blocking rcu_node structures (internal RCU debug):
> > [   10.591480] loop0: detected capacity change from 0 to 8
> > [   11.777733] rcu: INFO: rcu_preempt detected expedited stalls on
> > CPUs/tasks: { 3-.... } 21 jiffies s: 1077 root: 0x8/.
> > [   11.779210] rcu: blocking rcu_node structures (internal RCU debug):
> > [   11.780630] Sending NMI from CPU 1 to CPUs 3:
> > [   11.780659] NMI backtrace for cpu 3
> > [   11.780663] CPU: 3 UID: 0 PID: 1027 Comm: NetworkManager Not tainted
> > 6.12.0-aosc-main #1
> 
> Funny, this happens on bootup and no CPU has ever gone offline, so the path
> modified by this patch shouldn't have been taken. And yet this commit has
> an influence to the point of reliably triggering that stall.
> 
> I'm running off of ideas, Paul any clue?

Here is one straw to grasp at...

Is it possible that one of the CPUs had a problem coming online at boot,
and therefore backed out of the online process, thus appearing to at
least some of the CPU-hotplug notifiers to have gone offline?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-08 15:14               ` Paul E. McKenney
@ 2024-11-12 12:50                 ` Frederic Weisbecker
  2024-11-15  3:01                   ` Mingcong Bai
  0 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2024-11-12 12:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mingcong Bai, Thorsten Leemhuis, Linux regressions mailing list,
	LKML, rcu, sakiiily, Kexy Biscuit

Le Fri, Nov 08, 2024 at 07:14:41AM -0800, Paul E. McKenney a écrit :
> On Fri, Nov 08, 2024 at 02:46:16PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Nov 08, 2024 at 12:29:40AM +0800, Mingcong Bai a écrit :
> > > Hi Frederic,
> > > 
> > > <snip>
> > > 
> > > > Sorry for the lag, I still don't understand how this specific commit
> > > > can produce this issue. Can you please retry with and without this
> > > > commit
> > > > reverted?
> > > 
> > > Just tested v6.12-rc6 with and without the revert. Without the revert, the
> > > touchpad and the wireless adapter both stopped working, whereas with the
> > > revert, both devices functions as normal.
> > > 
> > > I have attached the dmesg for both kernels below. Unlike the log we got last
> > > time, there is no direct reference to tg3 any more, but the NMI backtrace
> > > still pointed to NetworkManager and net/netlink-related functions (perhaps a
> > > debug kernel would be more helpful?). Here's a snippet:
> > > 
> > > [   10.337720] rcu: INFO: rcu_preempt detected expedited stalls on
> > > CPUs/tasks: { P683 } 21 jiffies s: 781 root: 0x0/T
> > > [   10.339168] rcu: blocking rcu_node structures (internal RCU debug):
> > > [   10.591480] loop0: detected capacity change from 0 to 8
> > > [   11.777733] rcu: INFO: rcu_preempt detected expedited stalls on
> > > CPUs/tasks: { 3-.... } 21 jiffies s: 1077 root: 0x8/.
> > > [   11.779210] rcu: blocking rcu_node structures (internal RCU debug):
> > > [   11.780630] Sending NMI from CPU 1 to CPUs 3:
> > > [   11.780659] NMI backtrace for cpu 3
> > > [   11.780663] CPU: 3 UID: 0 PID: 1027 Comm: NetworkManager Not tainted
> > > 6.12.0-aosc-main #1
> > 
> > Funny, this happens on bootup and no CPU has ever gone offline, so the path
> > modified by this patch shouldn't have been taken. And yet this commit has
> > an influence to the point of reliably triggering that stall.
> > 
> > I'm running off of ideas, Paul any clue?
> 
> Here is one straw to grasp at...
> 
> Is it possible that one of the CPUs had a problem coming online at boot,
> and therefore backed out of the online process, thus appearing to at
> least some of the CPU-hotplug notifiers to have gone offline?

I looked for it in the dmesg and there are indeed rejected CPUs but very early,
before secondary boot-up.

Just in case, Mingcong Bai can you test the following patch without the
revert and see if it triggers something?

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 35949ec1f935..b4f8ed8138d3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -5170,6 +5170,7 @@ void rcutree_migrate_callbacks(int cpu)
 	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 	bool needwake;
 
+	WARN_ON_ONCE(1);
 	if (rcu_rdp_is_offloaded(rdp))
 		return;
 


Thanks.

> 
> 							Thanx, Paul

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-12 12:50                 ` Frederic Weisbecker
@ 2024-11-15  3:01                   ` Mingcong Bai
  2024-11-19 10:43                     ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: Mingcong Bai @ 2024-11-15  3:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Thorsten Leemhuis,
	Linux regressions mailing list, LKML, rcu, sakiiily, Kexy Biscuit

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

Hi Frederic,

<snip>

> Just in case, Mingcong Bai can you test the following patch without the
> revert and see if it triggers something?
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 35949ec1f935..b4f8ed8138d3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -5170,6 +5170,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>  	bool needwake;
> 
> +	WARN_ON_ONCE(1);
>  	if (rcu_rdp_is_offloaded(rdp))
>  		return;
> 

Please find attached the dmesg with your patch (and no revert) against 
6.12-rc7.

Best Regards,
Mingcong Bai

> 
> 
> Thanks.
> 
>> 
>> 							Thanx, Paul

[-- Attachment #2: dmesg-tg3-warn-on-once.log --]
[-- Type: application/json, Size: 150070 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Regression] wifi problems since tg3 started throwing rcu stall warnings
  2024-11-15  3:01                   ` Mingcong Bai
@ 2024-11-19 10:43                     ` Frederic Weisbecker
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2024-11-19 10:43 UTC (permalink / raw)
  To: Mingcong Bai
  Cc: Paul E. McKenney, Thorsten Leemhuis,
	Linux regressions mailing list, LKML, rcu, sakiiily, Kexy Biscuit

Le Fri, Nov 15, 2024 at 11:01:25AM +0800, Mingcong Bai a écrit :
> Hi Frederic,
> 
> <snip>
> 
> > Just in case, Mingcong Bai can you test the following patch without the
> > revert and see if it triggers something?
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 35949ec1f935..b4f8ed8138d3 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -5170,6 +5170,7 @@ void rcutree_migrate_callbacks(int cpu)
> >  	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
> >  	bool needwake;
> > 
> > +	WARN_ON_ONCE(1);
> >  	if (rcu_rdp_is_offloaded(rdp))
> >  		return;
> > 
> 
> Please find attached the dmesg with your patch (and no revert) against
> 6.12-rc7.

The added WARN_ON_ONCE() doesn't trigger so the function/path changed by this patch
isn't taken.

My only guess is that the patch changes some code layout that makes a bug more
likely to appear in networking...

Thanks.

> 
> Best Regards,
> Mingcong Bai
> 
> > 
> > 
> > Thanks.
> > 
> > > 
> > > 							Thanx, Paul



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-19 10:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23  8:27 [Regression] wifi problems since tg3 started throwing rcu stall warnings Linux regression tracking (Thorsten Leemhuis)
2024-10-23  9:11 ` Linux regression tracking (Thorsten Leemhuis)
2024-10-23 10:09 ` Frederic Weisbecker
2024-10-23 10:22   ` Linux regression tracking (Thorsten Leemhuis)
2024-11-05  7:17     ` Mingcong Bai
2024-11-07  9:10       ` Thorsten Leemhuis
2024-11-07 10:04         ` Frederic Weisbecker
2024-11-07 10:33           ` Mingcong Bai
2024-11-07 16:29           ` Mingcong Bai
2024-11-08 13:46             ` Frederic Weisbecker
2024-11-08 15:14               ` Paul E. McKenney
2024-11-12 12:50                 ` Frederic Weisbecker
2024-11-15  3:01                   ` Mingcong Bai
2024-11-19 10:43                     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox