netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mlx5e warnings on 6.10
@ 2024-07-17 11:41 Breno Leitao
  2024-07-18 11:00 ` Dragos Tatulea
  0 siblings, 1 reply; 4+ messages in thread
From: Breno Leitao @ 2024-07-17 11:41 UTC (permalink / raw)
  To: saeedm, tariqt; +Cc: netdev

I am seeing the following warning in 6.10
(0c3836482481200ead7b416ca80c68a29cfdaabd) in some hosts we have with
6.10.

This kernel was built with some debug options enabled (lockdep, kasamn,
kmemleak, etc) and it is slower than usual.

Sharing in case you find it useful.

	mlx5_core 0000:01:00.0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 15535 ms
	mlx5_core 0000:01:00.0 eth0: TX timeout detected
	mlx5_core 0000:01:00.0 eth0: TX timeout on queue: 0, SQ: 0x2e9, CQ: 0xa5, SQ Cons: 0x22ee SQ Prod: 0x3290, usecs since last trans: 15570000
	mlx5_core 0000:01:00.0 eth0: EQ 0x7: Cons = 0x56838, irqn = 0x67
	------------[ cut here ]------------
	WARNING: CPU: 15 PID: 48322 at drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78 mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
	Modules linked in: sunrpc(E) veth(E) bpf_preload(E) sch_fq(E) squashfs(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) kvm_amd(E) kvm(E) mlx5_ib(E) ib_uverbs(E) acpi_cpufreq(E) ccp(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) evdev(E) button(E) sch_fq_codel(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) amd_hsmp(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) efivarfs(E)
	Hardware name: Wiwynn HalfDome/HalfDome MP, BIOS HD404.sign 09/22/2023
	Workqueue: mlx5e mlx5e_tx_timeout_work
	RIP: 0010:mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
	Code: 05 18 ef 4d 03 01 48 c7 c7 c0 52 bb 84 be 54 00 00 00 48 c7 c2 00 53 bb 84 e8 f7 e5 d2 fd e9 34 fe ff ff 0f 0b e9 e0 fd ff ff <0f> 0b e9 b7 fd ff ff 48 c7 c1 cc 3e ba 86 80 e1 07 80 c1 03 38 c1
	All code
	========
	   0:   05 18 ef 4d 03          add    $0x34def18,%eax
	   5:   01 48 c7                add    %ecx,-0x39(%rax)
	   8:   c7 c0 52 bb 84 be       mov    $0xbe84bb52,%eax
	   e:   54                      push   %rsp
	   f:   00 00                   add    %al,(%rax)
	  11:   00 48 c7                add    %cl,-0x39(%rax)
	  14:   c2 00 53                ret    $0x5300
	  17:   bb 84 e8 f7 e5          mov    $0xe5f7e884,%ebx
	  1c:   d2 fd                   sar    %cl,%ch
	  1e:   e9 34 fe ff ff          jmp    0xfffffffffffffe57
	  23:   0f 0b                   ud2
	  25:   e9 e0 fd ff ff          jmp    0xfffffffffffffe0a
	  2a:*  0f 0b                   ud2             <-- trapping instruction
	  2c:   e9 b7 fd ff ff          jmp    0xfffffffffffffde8
	  31:   48 c7 c1 cc 3e ba 86    mov    $0xffffffff86ba3ecc,%rcx
	  38:   80 e1 07                and    $0x7,%cl
	  3b:   80 c1 03                add    $0x3,%cl
	  3e:   38 c1                   cmp    %al,%cl

	Code starting with the faulting instruction
	===========================================
	   0:   0f 0b                   ud2
	   2:   e9 b7 fd ff ff          jmp    0xfffffffffffffdbe
	   7:   48 c7 c1 cc 3e ba 86    mov    $0xffffffff86ba3ecc,%rcx
	   e:   80 e1 07                and    $0x7,%cl
	  11:   80 c1 03                add    $0x3,%cl
	  14:   38 c1                   cmp    %al,%cl
	RSP: 0018:ffff8883d45877b0 EFLAGS: 00010246
	RAX: 0000000000000000 RBX: ffff8882d3050a90 RCX: 0000000000000000
	RDX: 0000000000000160 RSI: ffffffff846c5900 RDI: ffffffff8495f700
	RBP: ffff8882d3050af0 R08: ffff8888c7cc0177 R09: 1ffff11118f9802e
	R10: dffffc0000000000 R11: ffffed1118f9802f R12: dffffc0000000000
	R13: ffff8882d3050a80 R14: ffff8888c7cc0018 R15: ffff8882d3050a80
	FS:  0000000000000000(0000) GS:ffff88bee8780000(0000) knlGS:0000000000000000
	CR2: 00007f33afc33000 CR3: 00000004df52c001 CR4: 0000000000770ef0
	PKRU: 55555554
	Call Trace:
	<TASK>
	? __warn (kernel/panic.c:239 kernel/panic.c:693)
	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
	? report_bug (lib/bug.c:? lib/bug.c:219)
	? handle_bug (arch/x86/kernel/traps.c:239)
	? exc_invalid_op (arch/x86/kernel/traps.c:260)
	? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
	mlx5e_safe_switch_params (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3275)
	? mark_lock (./arch/x86/include/asm/bitops.h:227 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
	? lockdep_hardirqs_on (kernel/locking/lockdep.c:4421)
	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
	? __irq_put_desc_unlock (kernel/irq/irqdesc.c:908)
	? enable_irq (kernel/irq/internals.h:?)
	? mlx5_eq_poll_irq_disabled (drivers/net/ethernet/mellanox/mlx5/core/eq.c:171)
	? mlx5e_health_channel_eq_recover (drivers/net/ethernet/mellanox/mlx5/core/en/health.c:139)
	mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
	devlink_health_report (net/devlink/health.c:539 net/devlink/health.c:639)
	mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
	? mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:132)
	? mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:398)
	mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4943)
	? process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
	process_scheduled_works (kernel/workqueue.c:? kernel/workqueue.c:3329)
	worker_thread (./include/linux/list.h:373 kernel/workqueue.c:947 kernel/workqueue.c:3410)
	kthread (kernel/kthread.c:390)
	? pr_cont_work (kernel/workqueue.c:3356)
	? kthread_blkcg (kernel/kthread.c:342)
	ret_from_fork (arch/x86/kernel/process.c:153)
	? kthread_blkcg (kernel/kthread.c:342)
	ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
	</TASK>
	irq event stamp: 843317
	hardirqs last enabled at (843331): console_unlock (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:341 kernel/printk/printk.c:2731 kernel/printk/printk.c:3050)
	hardirqs last disabled at (843346): console_unlock (kernel/printk/printk.c:339 kernel/printk/printk.c:2731 kernel/printk/printk.c:3050)
	softirqs last enabled at (843360): __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
	softirqs last disabled at (843355): __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
	---[ end trace 0000000000000000 ]---

	=============================
	WARNING: suspicious RCU usage
	6.10.0-0_fbk700_debug_rc0_kbuilder_7_gcce2edc9a03a #1 Tainted: G S      W   EL   N
	-----------------------------
	drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:84 suspicious rcu_dereference_protected() usage!

	other info that might help us debug this:


	rcu_scheduler_active = 2, debug_locks = 1
	4 locks held by kworker/u704:0/48322:
	#0: ffff8882d2a6e948 ((wq_completion)mlx5e){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3223 kernel/workqueue.c:3329)
	#1: ffff8883d4587d68 ((work_completion)(&priv->tx_timeout_work)){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
	#2: ffffffff86a75e08 (rtnl_mutex){+.+.}-{3:3}, at: mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4926)
	#3: ffff8882d2a61250 (&devlink->lock_key#2){+.+.}-{3:3}, at: devlink_health_report (net/devlink/health.c:?)

	stack backtrace:
	Hardware name: Wiwynn HalfDome/HalfDome MP, BIOS HD404.sign 09/22/2023
	Workqueue: mlx5e mlx5e_tx_timeout_work
	Call Trace:
	<TASK>
	dump_stack_lvl (lib/dump_stack.c:116)
	lockdep_rcu_suspicious (./include/linux/context_tracking.h:122 ./include/linux/context_tracking.h:143 kernel/locking/lockdep.c:6672)
	mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:?)
	mlx5e_safe_switch_params (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3275)
	? mark_lock (./arch/x86/include/asm/bitops.h:227 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
	? lockdep_hardirqs_on (kernel/locking/lockdep.c:4421)
	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
	? __irq_put_desc_unlock (kernel/irq/irqdesc.c:908)
	? enable_irq (kernel/irq/internals.h:?)
	? mlx5_eq_poll_irq_disabled (drivers/net/ethernet/mellanox/mlx5/core/eq.c:171)
	? mlx5e_health_channel_eq_recover (drivers/net/ethernet/mellanox/mlx5/core/en/health.c:139)
	mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
	devlink_health_report (net/devlink/health.c:539 net/devlink/health.c:639)
	mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
	? mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:132)
	? mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:398)
	mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4943)
	? process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
	process_scheduled_works (kernel/workqueue.c:? kernel/workqueue.c:3329)
	worker_thread (./include/linux/list.h:373 kernel/workqueue.c:947 kernel/workqueue.c:3410)
	kthread (kernel/kthread.c:390)
	? pr_cont_work (kernel/workqueue.c:3356)
	? kthread_blkcg (kernel/kthread.c:342)
	ret_from_fork (arch/x86/kernel/process.c:153)
	? kthread_blkcg (kernel/kthread.c:342)
	ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
	</TASK>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mlx5e warnings on 6.10
  2024-07-17 11:41 mlx5e warnings on 6.10 Breno Leitao
@ 2024-07-18 11:00 ` Dragos Tatulea
  2024-07-18 19:12   ` Breno Leitao
  0 siblings, 1 reply; 4+ messages in thread
From: Dragos Tatulea @ 2024-07-18 11:00 UTC (permalink / raw)
  To: Tariq Toukan, leitao@debian.org, Saeed Mahameed; +Cc: netdev@vger.kernel.org

Hi Breno,

On Wed, 2024-07-17 at 04:41 -0700, Breno Leitao wrote:
> I am seeing the following warning in 6.10
> (0c3836482481200ead7b416ca80c68a29cfdaabd) in some hosts we have with
> 6.10.
> 
> This kernel was built with some debug options enabled (lockdep, kasamn,
> kmemleak, etc) and it is slower than usual.
> 
> Sharing in case you find it useful.
> 
> 	mlx5_core 0000:01:00.0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 15535 ms
> 	mlx5_core 0000:01:00.0 eth0: TX timeout detected
> 	mlx5_core 0000:01:00.0 eth0: TX timeout on queue: 0, SQ: 0x2e9, CQ: 0xa5, SQ Cons: 0x22ee SQ Prod: 0x3290, usecs since last trans: 15570000
> 	mlx5_core 0000:01:00.0 eth0: EQ 0x7: Cons = 0x56838, irqn = 0x67
> 	------------[ cut here ]------------
> 	WARNING: CPU: 15 PID: 48322 at drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78 mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
> 	Modules linked in: sunrpc(E) veth(E) bpf_preload(E) sch_fq(E) squashfs(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) kvm_amd(E) kvm(E) mlx5_ib(E) ib_uverbs(E) acpi_cpufreq(E) ccp(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) evdev(E) button(E) sch_fq_codel(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) amd_hsmp(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) efivarfs(E)
> 	Hardware name: Wiwynn HalfDome/HalfDome MP, BIOS HD404.sign 09/22/2023
> 	Workqueue: mlx5e mlx5e_tx_timeout_work
> 	RIP: 0010:mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
> 	Code: 05 18 ef 4d 03 01 48 c7 c7 c0 52 bb 84 be 54 00 00 00 48 c7 c2 00 53 bb 84 e8 f7 e5 d2 fd e9 34 fe ff ff 0f 0b e9 e0 fd ff ff <0f> 0b e9 b7 fd ff ff 48 c7 c1 cc 3e ba 86 80 e1 07 80 c1 03 38 c1
> 	All code
> 	========
> 	   0:   05 18 ef 4d 03          add    $0x34def18,%eax
> 	   5:   01 48 c7                add    %ecx,-0x39(%rax)
> 	   8:   c7 c0 52 bb 84 be       mov    $0xbe84bb52,%eax
> 	   e:   54                      push   %rsp
> 	   f:   00 00                   add    %al,(%rax)
> 	  11:   00 48 c7                add    %cl,-0x39(%rax)
> 	  14:   c2 00 53                ret    $0x5300
> 	  17:   bb 84 e8 f7 e5          mov    $0xe5f7e884,%ebx
> 	  1c:   d2 fd                   sar    %cl,%ch
> 	  1e:   e9 34 fe ff ff          jmp    0xfffffffffffffe57
> 	  23:   0f 0b                   ud2
> 	  25:   e9 e0 fd ff ff          jmp    0xfffffffffffffe0a
> 	  2a:*  0f 0b                   ud2             <-- trapping instruction
> 	  2c:   e9 b7 fd ff ff          jmp    0xfffffffffffffde8
> 	  31:   48 c7 c1 cc 3e ba 86    mov    $0xffffffff86ba3ecc,%rcx
> 	  38:   80 e1 07                and    $0x7,%cl
> 	  3b:   80 c1 03                add    $0x3,%cl
> 	  3e:   38 c1                   cmp    %al,%cl
> 
> 	Code starting with the faulting instruction
> 	===========================================
> 	   0:   0f 0b                   ud2
> 	   2:   e9 b7 fd ff ff          jmp    0xfffffffffffffdbe
> 	   7:   48 c7 c1 cc 3e ba 86    mov    $0xffffffff86ba3ecc,%rcx
> 	   e:   80 e1 07                and    $0x7,%cl
> 	  11:   80 c1 03                add    $0x3,%cl
> 	  14:   38 c1                   cmp    %al,%cl
> 	RSP: 0018:ffff8883d45877b0 EFLAGS: 00010246
> 	RAX: 0000000000000000 RBX: ffff8882d3050a90 RCX: 0000000000000000
> 	RDX: 0000000000000160 RSI: ffffffff846c5900 RDI: ffffffff8495f700
> 	RBP: ffff8882d3050af0 R08: ffff8888c7cc0177 R09: 1ffff11118f9802e
> 	R10: dffffc0000000000 R11: ffffed1118f9802f R12: dffffc0000000000
> 	R13: ffff8882d3050a80 R14: ffff8888c7cc0018 R15: ffff8882d3050a80
> 	FS:  0000000000000000(0000) GS:ffff88bee8780000(0000) knlGS:0000000000000000
> 	CR2: 00007f33afc33000 CR3: 00000004df52c001 CR4: 0000000000770ef0
> 	PKRU: 55555554
> 	Call Trace:
> 	<TASK>
> 	? __warn (kernel/panic.c:239 kernel/panic.c:693)
> 	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
> 	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
> 	? report_bug (lib/bug.c:? lib/bug.c:219)
> 	? handle_bug (arch/x86/kernel/traps.c:239)
> 	? exc_invalid_op (arch/x86/kernel/traps.c:260)
> 	? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
> 	? mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:78)
> 	mlx5e_safe_switch_params (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3275)
> 	? mark_lock (./arch/x86/include/asm/bitops.h:227 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
> 	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> 	? lockdep_hardirqs_on (kernel/locking/lockdep.c:4421)
> 	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
> 	? __irq_put_desc_unlock (kernel/irq/irqdesc.c:908)
> 	? enable_irq (kernel/irq/internals.h:?)
> 	? mlx5_eq_poll_irq_disabled (drivers/net/ethernet/mellanox/mlx5/core/eq.c:171)
> 	? mlx5e_health_channel_eq_recover (drivers/net/ethernet/mellanox/mlx5/core/en/health.c:139)
> 	mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
> 	devlink_health_report (net/devlink/health.c:539 net/devlink/health.c:639)
> 	mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
> 	? mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:132)
> 	? mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:398)
> 	mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4943)
> 	? process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
> 	process_scheduled_works (kernel/workqueue.c:? kernel/workqueue.c:3329)
> 	worker_thread (./include/linux/list.h:373 kernel/workqueue.c:947 kernel/workqueue.c:3410)
> 	kthread (kernel/kthread.c:390)
> 	? pr_cont_work (kernel/workqueue.c:3356)
> 	? kthread_blkcg (kernel/kthread.c:342)
> 	ret_from_fork (arch/x86/kernel/process.c:153)
> 	? kthread_blkcg (kernel/kthread.c:342)
> 	ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
> 	</TASK>
> 	irq event stamp: 843317
> 	hardirqs last enabled at (843331): console_unlock (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:341 kernel/printk/printk.c:2731 kernel/printk/printk.c:3050)
> 	hardirqs last disabled at (843346): console_unlock (kernel/printk/printk.c:339 kernel/printk/printk.c:2731 kernel/printk/printk.c:3050)
> 	softirqs last enabled at (843360): __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
> 	softirqs last disabled at (843355): __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
> 	---[ end trace 0000000000000000 ]---
> 
> 	=============================
> 	WARNING: suspicious RCU usage
> 	6.10.0-0_fbk700_debug_rc0_kbuilder_7_gcce2edc9a03a #1 Tainted: G S      W   EL   N
> 	-----------------------------
> 	drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:84 suspicious rcu_dereference_protected() usage!
> 
> 	other info that might help us debug this:
> 
> 
> 	rcu_scheduler_active = 2, debug_locks = 1
> 	4 locks held by kworker/u704:0/48322:
> 	#0: ffff8882d2a6e948 ((wq_completion)mlx5e){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3223 kernel/workqueue.c:3329)
> 	#1: ffff8883d4587d68 ((work_completion)(&priv->tx_timeout_work)){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
> 	#2: ffffffff86a75e08 (rtnl_mutex){+.+.}-{3:3}, at: mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4926)
> 	#3: ffff8882d2a61250 (&devlink->lock_key#2){+.+.}-{3:3}, at: devlink_health_report (net/devlink/health.c:?)
> 
> 	stack backtrace:
> 	Hardware name: Wiwynn HalfDome/HalfDome MP, BIOS HD404.sign 09/22/2023
> 	Workqueue: mlx5e mlx5e_tx_timeout_work
> 	Call Trace:
> 	<TASK>
> 	dump_stack_lvl (lib/dump_stack.c:116)
> 	lockdep_rcu_suspicious (./include/linux/context_tracking.h:122 ./include/linux/context_tracking.h:143 kernel/locking/lockdep.c:6672)
> 	mlx5e_selq_prepare_params (drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:?)
> 	mlx5e_safe_switch_params (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3275)
> 	? mark_lock (./arch/x86/include/asm/bitops.h:227 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
> 	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:67 ./arch/x86/include/asm/irqflags.h:127 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> 	? lockdep_hardirqs_on (kernel/locking/lockdep.c:4421)
> 	? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
> 	? __irq_put_desc_unlock (kernel/irq/irqdesc.c:908)
> 	? enable_irq (kernel/irq/internals.h:?)
> 	? mlx5_eq_poll_irq_disabled (drivers/net/ethernet/mellanox/mlx5/core/eq.c:171)
> 	? mlx5e_health_channel_eq_recover (drivers/net/ethernet/mellanox/mlx5/core/en/health.c:139)
> 	mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
> 	devlink_health_report (net/devlink/health.c:539 net/devlink/health.c:639)
> 	mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:?)
> 	? mlx5e_reporter_tx_timeout (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:132)
> 	? mlx5e_tx_reporter_timeout_recover (drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c:398)
> 	mlx5e_tx_timeout_work (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4943)
> 	? process_scheduled_works (kernel/workqueue.c:3224 kernel/workqueue.c:3329)
> 	process_scheduled_works (kernel/workqueue.c:? kernel/workqueue.c:3329)
> 	worker_thread (./include/linux/list.h:373 kernel/workqueue.c:947 kernel/workqueue.c:3410)
> 	kthread (kernel/kthread.c:390)
> 	? pr_cont_work (kernel/workqueue.c:3356)
> 	? kthread_blkcg (kernel/kthread.c:342)
> 	ret_from_fork (arch/x86/kernel/process.c:153)
> 	? kthread_blkcg (kernel/kthread.c:342)
> 	ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
> 	</TASK>
> 
Thanks for the report. The output, it is very useful. The problem seems to be
that mlx5e_tx_reporter_timeout_recover() should take a state lock and doesn't.

I wonder why this happened only in 6.10. There were no relevant changes in 6.10.
Or is it maybe that until now you didn't run into the tx queue timeout issue?

Would you have the possibility and willingness to test the below fix?

---
 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 22918b2ef7f1..fe29be7aa70f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -146,7 +146,9 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
 		return err;
 	}
 
+	mutex_lock(&priv->state_lock);
 	err = mlx5e_safe_reopen_channels(priv);
+	mutex_lock(&priv->state_lock);
 	if (!err) {
 		to_ctx->status = 1; /* all channels recovered */
 		return err;
-- 
2.45.2

Thanks,
Dragos

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: mlx5e warnings on 6.10
  2024-07-18 11:00 ` Dragos Tatulea
@ 2024-07-18 19:12   ` Breno Leitao
  2024-07-19 16:57     ` Dragos Tatulea
  0 siblings, 1 reply; 4+ messages in thread
From: Breno Leitao @ 2024-07-18 19:12 UTC (permalink / raw)
  To: Dragos Tatulea; +Cc: Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org

On Thu, Jul 18, 2024 at 11:00:00AM +0000, Dragos Tatulea wrote:
> Hi Breno,
> 
> On Wed, 2024-07-17 at 04:41 -0700, Breno Leitao wrote:

> > Sharing in case you find it useful.

> Thanks for the report. The output, it is very useful. The problem seems to be
> that mlx5e_tx_reporter_timeout_recover() should take a state lock and doesn't.

Right. I've looked at other cases where mlx5e_safe_reopen_channels() is
called, and priv->state_lock is, in fact, hold before calling it.

So, independent if this fix the problem or not, it seems the right thing
to do.

Feel free to add a "Reviewed-by: Breno Leitao <leitao@debian.org>" when
you send it.

> I wonder why this happened only in 6.10. There were no relevant changes in 6.10.
> Or is it maybe that until now you didn't run into the tx queue timeout issue?

I don't have a reproducer for it, so, i just got it in 6.10. Maybe just
a coincidence?

> Would you have the possibility and willingness to test the below fix?

Sure. I have two hosts running with your patch, but, it is hard to make
them timeout.

Let me know if you have any trick I can explore and force the card to
time out.

Thanks for the quick reply!
--breno

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mlx5e warnings on 6.10
  2024-07-18 19:12   ` Breno Leitao
@ 2024-07-19 16:57     ` Dragos Tatulea
  0 siblings, 0 replies; 4+ messages in thread
From: Dragos Tatulea @ 2024-07-19 16:57 UTC (permalink / raw)
  To: leitao@debian.org; +Cc: Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org

On Thu, 2024-07-18 at 12:12 -0700, Breno Leitao wrote:
> On Thu, Jul 18, 2024 at 11:00:00AM +0000, Dragos Tatulea wrote:
> > Hi Breno,
> > 
> > On Wed, 2024-07-17 at 04:41 -0700, Breno Leitao wrote:
> 
> > > Sharing in case you find it useful.
> 
> > Thanks for the report. The output, it is very useful. The problem seems to be
> > that mlx5e_tx_reporter_timeout_recover() should take a state lock and doesn't.
> 
> Right. I've looked at other cases where mlx5e_safe_reopen_channels() is
> called, and priv->state_lock is, in fact, hold before calling it.
> 
> So, independent if this fix the problem or not, it seems the right thing
> to do.
> 
> Feel free to add a "Reviewed-by: Breno Leitao <leitao@debian.org>" when
> you send it.
> 
Thanks!

> > I wonder why this happened only in 6.10. There were no relevant changes in 6.10.
> > Or is it maybe that until now you didn't run into the tx queue timeout issue?
> 
> I don't have a reproducer for it, so, i just got it in 6.10. Maybe just
> a coincidence?
> 
It would be interesting to find out if you are suddenly seeing more tx queue
timeout events than usual in 6.10.

> > Would you have the possibility and willingness to test the below fix?
> 
> Sure. I have two hosts running with your patch, but, it is hard to make
> them timeout.
> 
> Let me know if you have any trick I can explore and force the card to
> time out.
> 
I'd guess that there must be a way. I'm in the best position to find out :).
Once I know I can also test it myself.


Thanks,
Dragos

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-07-19 16:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-17 11:41 mlx5e warnings on 6.10 Breno Leitao
2024-07-18 11:00 ` Dragos Tatulea
2024-07-18 19:12   ` Breno Leitao
2024-07-19 16:57     ` Dragos Tatulea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).