"Dead loop on virtual device" error without softirq-BKL on PREEMPT

public inbox for linux-rt-devel@lists.linux.dev
 help / color / mirror / Atom feed

* "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
@ 2026-02-16 13:43 Bert Karwatzki
  2026-02-16 15:32 ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-16 13:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Bert Karwatzki, Thomas Gleixner, linux-kernel, linux-rt-devel

Starting with linux-6.18, I see the following log messages

2026-02-15T23:50:17.558716+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
2026-02-15T23:50:17.558737+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
[...]

regarding my wireless network device

04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]

I bisected this (from v6.17 to v6.18) and got this as the first bad commit:
3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")

Using CONFIG_PREEMPT_RT_NEEDS_BH_LOCK=y in v6.18.10 fixes the issue.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-16 13:43 "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT Bert Karwatzki
@ 2026-02-16 15:32 ` Bert Karwatzki
  2026-02-16 15:37   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-16 15:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, spasswolf

Am Montag, dem 16.02.2026 um 14:43 +0100 schrieb Bert Karwatzki:
> Starting with linux-6.18, I see the following log messages
> 
> 2026-02-15T23:50:17.558716+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
> 2026-02-15T23:50:17.558737+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
> [...]
> 
> regarding my wireless network device
> 
> 04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
> 
> I bisected this (from v6.17 to v6.18) and got this as the first bad commit:
> 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> 
> Using CONFIG_PREEMPT_RT_NEEDS_BH_LOCK=y in v6.18.10 fixes the issue.
> 
> Bert Karwatzki

May I presume that to fix this properly at least some of the spin_lock_bh()s in
net/mac80211/ need to be converted to local_lock_nested_bh()s?

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-16 15:32 ` Bert Karwatzki
@ 2026-02-16 15:37   ` Sebastian Andrzej Siewior
  2026-02-16 23:48     ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-16 15:37 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: Thomas Gleixner, linux-kernel, linux-rt-devel

On 2026-02-16 16:32:29 [+0100], Bert Karwatzki wrote:
> Am Montag, dem 16.02.2026 um 14:43 +0100 schrieb Bert Karwatzki:
> > Starting with linux-6.18, I see the following log messages
> > 
> > 2026-02-15T23:50:17.558716+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
> > 2026-02-15T23:50:17.558737+01:00 [ T1559] Dead loop on virtual device wlp4s0, fix it urgently!
> > [...]
> > 
> > regarding my wireless network device
> > 
> > 04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
> > 
> > I bisected this (from v6.17 to v6.18) and got this as the first bad commit:
> > 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> > 
> > Using CONFIG_PREEMPT_RT_NEEDS_BH_LOCK=y in v6.18.10 fixes the issue.
> > 
> > Bert Karwatzki
> 
> May I presume that to fix this properly at least some of the spin_lock_bh()s in
> net/mac80211/ need to be converted to local_lock_nested_bh()s?

I am not sure what issue is so I can't tell. The dev_xmit_recursion*()
based counters are per-task so it should be fine. But yet the wifi
managed to repeatedly enqueue packets. This might be a real recursion, a
stack trace should tell. And then, somewhere synchronisation is missing.

> Bert Karwatzki

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-16 15:37   ` Sebastian Andrzej Siewior
@ 2026-02-16 23:48     ` Bert Karwatzki
  2026-02-17  7:19       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-16 23:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, spasswolf

Am Montag, dem 16.02.2026 um 16:37 +0100 schrieb Sebastian Andrzej Siewior:
> 
> I am not sure what issue is so I can't tell. The dev_xmit_recursion*()
> based counters are per-task so it should be fine. But yet the wifi
> managed to repeatedly enqueue packets. This might be a real recursion, a
> stack trace should tell. And then, somewhere synchronisation is missing.
> 
> > Bert Karwatzki
> 
> Sebastian

The problem seems to be that different preemtible threads try to send skbs. 

I used this debug patch for 6.18.10:

diff --git a/net/core/dev.c b/net/core/dev.c
index 5b536860138d..ecfdd8e3dc99 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4704,6 +4704,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	qdisc_pkt_len_init(skb);
 	tcx_set_ingress(skb, false);
+	printk(KERN_INFO "%s 0: skb = %px dev = %s\n", __func__, skb, dev->name);
 #ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		if (nf_hook_egress_active()) {
@@ -4739,10 +4740,12 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	trace_net_dev_queue(skb);
 	if (q->enqueue) {
+		printk(KERN_INFO "%s 1: skb = %px dev = %s\n", __func__, skb, dev->name);
 		rc = __dev_xmit_skb(skb, q, dev, txq);
 		goto out;
 	}
 
+	printk(KERN_INFO "%s 2: skb = %px dev = %s txq = %px\n", __func__, skb, dev->name, txq);
 	/* The device has no queue. Common case for software devices:
 	 * loopback, all the sorts of tunnels...
 
@@ -4761,15 +4764,20 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		/* Other cpus might concurrently change txq->xmit_lock_owner
 		 * to -1 or to their cpu id, but not to our id.
 		 */
+		printk(KERN_INFO "%s: cpu = %d xmit_lock_owner = %d\n", __func__, cpu, READ_ONCE(txq->xmit_lock_owner));
 		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
-			if (dev_xmit_recursion())
+			printk(KERN_INFO "%s 3: skb = %px dev = %s txq = %px\n", __func__, skb, dev->name, txq);
+			if (dev_xmit_recursion()) {
+				printk(KERN_INFO "%s: recursion alert for device %s!\n", __func__, dev->name);
 				goto recursion_alert;
+			}
 
 			skb = validate_xmit_skb(skb, dev, &again);
 			if (!skb)
 				goto out;
 
 			HARD_TX_LOCK(dev, txq, cpu);
+			printk(KERN_INFO "%s 4: skb = %px dev = %s txq = %px\n", __func__, skb, dev->name, txq);
 
 			if (!netif_xmit_stopped(txq)) {
 				dev_xmit_recursion_inc();
@@ -4777,6 +4785,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 				dev_xmit_recursion_dec();
 				if (dev_xmit_complete(rc)) {
 					HARD_TX_UNLOCK(dev, txq);
+					printk(KERN_INFO "%s 5: skb = %px dev = %s txq = %px\n", __func__, skb, dev->name, txq);
 					goto out;
 				}
 			}

The normal path of an skb is this:
2026-02-17T00:29:24.124757+01:00 [ T1522] __dev_queue_xmit 0: skb = ffff8c11ed06bd00 dev = wlp4s0
2026-02-17T00:29:24.124845+01:00 [ T1522] __dev_queue_xmit 2: skb = ffff8c11ed06bd00 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.124851+01:00 [ T1522] __dev_queue_xmit: cpu = 7 xmit_lock_owner = -1
2026-02-17T00:29:24.124853+01:00 [ T1522] __dev_queue_xmit 3: skb = ffff8c11ed06bd00 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.124855+01:00 [ T1522] __dev_queue_xmit 4: skb = ffff8c11ed06bd00 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.124857+01:00 [ T1522] __dev_queue_xmit 5: skb = 0000000000000000 dev = wlp4s0 txq = ffff8c1145320200

This is the situation which produces the error messages:

T1522 tries to send an skb on CPU 7:
2026-02-17T00:29:24.212215+01:00 [ T1522] __dev_queue_xmit 0: skb = ffff8c11ed06b100 dev = wlp4s0
2026-02-17T00:29:24.212217+01:00 [ T1522] __dev_queue_xmit 2: skb = ffff8c11ed06b100 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.212219+01:00 [ T1522] __dev_queue_xmit: cpu = 7 xmit_lock_owner = -1
2026-02-17T00:29:24.212221+01:00 [ T1522] __dev_queue_xmit 3: skb = ffff8c11ed06b100 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.212223+01:00 [ T1522] __dev_queue_xmit 4: skb = ffff8c11ed06b100 dev = wlp4s0 txq = ffff8c1145320200 
Here T1522 gets preempted and T1513 is executed on CPU 7 and also tries to send an skb:
2026-02-17T00:29:24.212225+01:00 [ T1513] __dev_queue_xmit 0: skb = ffff8c11ed06a300 dev = wlp4s0
2026-02-17T00:29:24.212228+01:00 [ T1513] __dev_queue_xmit 2: skb = ffff8c11ed06a300 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.212230+01:00 [ T1513] __dev_queue_xmit: cpu = 7 xmit_lock_owner = 7
2026-02-17T00:29:24.212231+01:00 [ T1513] Dead loop on virtual device wlp4s0, fix it urgently!
2026-02-17T00:29:24.212234+01:00 [ T1513] __dev_queue_xmit 0: skb = ffff8c11ed06a300 dev = wlp4s0
2026-02-17T00:29:24.212236+01:00 [ T1513] __dev_queue_xmit 2: skb = ffff8c11ed06a300 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.212238+01:00 [ T1513] __dev_queue_xmit: cpu = 7 xmit_lock_owner = 7
2026-02-17T00:29:24.212240+01:00 [ T1513] Dead loop on virtual device wlp4s0, fix it urgently!
2026-02-17T00:29:24.212242+01:00 [ T1513] __dev_queue_xmit 0: skb = ffff8c11ed06a300 dev = wlp4s0
2026-02-17T00:29:24.212244+01:00 [ T1513] __dev_queue_xmit 2: skb = ffff8c11ed06a300 dev = wlp4s0 txq = ffff8c1145320200
2026-02-17T00:29:24.212246+01:00 [ T1513] __dev_queue_xmit: cpu = 7 xmit_lock_owner = 7
2026-02-17T00:29:24.212247+01:00 [ T1513] Dead loop on virtual device wlp4s0, fix it urgently!
T1513 gets preempted and T1522 finishes processing the skb from above:
2026-02-17T00:29:24.212249+01:00 [ T1522] __dev_queue_xmit 5: skb = 0000000000000000 dev = wlp4s0 txq = ffff8c1145320200

Bert Karwatzki

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-16 23:48     ` Bert Karwatzki
@ 2026-02-17  7:19       ` Sebastian Andrzej Siewior
  2026-02-17  8:56         ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17  7:19 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: Thomas Gleixner, linux-kernel, linux-rt-devel

On 2026-02-17 00:48:25 [+0100], Bert Karwatzki wrote:
> The problem seems to be that different preemtible threads try to send skbs. 

This does not matter because the counter is per-thread not per-CPU. 

> 2026-02-17T00:29:24.212231+01:00 [ T1513] Dead loop on virtual device wlp4s0, fix it urgently!

Could you please do a backtrace here, for instance via WARN_ON_ONCE()

> Bert Karwatzki

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17  7:19       ` Sebastian Andrzej Siewior
@ 2026-02-17  8:56         ` Bert Karwatzki
  2026-02-17  9:57           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-17  8:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, spasswolf

Am Dienstag, dem 17.02.2026 um 08:19 +0100 schrieb Sebastian Andrzej Siewior:
> On 2026-02-17 00:48:25 [+0100], Bert Karwatzki wrote:
> > The problem seems to be that different preemtible threads try to send skbs. 
> 
> This does not matter because the counter is per-thread not per-CPU. 

The "Dead loop on virtual device" messages is not printed because dev_xmit_recursion()
returns true, but because READ_ONCE(txq->xmit_lock_owner) == cpu.
> 
> > 2026-02-17T00:29:24.212231+01:00 [ T1513] Dead loop on virtual device wlp4s0, fix it urgently!
> 
> Could you please do a backtrace here, for instance via WARN_ON_ONCE()
> 
> > Bert Karwatzki
> 
> Sebastian

Here's the backtrace:

2026-02-17T09:48:49.553225+01:00 lisa kernel: [ T1521] __dev_queue_xmit 0: skb = ffff9ee68afd3a00 dev = wlp4s0
2026-02-17T09:48:49.553227+01:00 lisa kernel: [ T1521] __dev_queue_xmit 2: skb = ffff9ee68afd3a00 dev = wlp4s0 txq = ffff9ee6820b5a00
2026-02-17T09:48:49.553229+01:00 lisa kernel: [ T1521] __dev_queue_xmit: cpu = 7 xmit_lock_owner = -1
2026-02-17T09:48:49.553232+01:00 lisa kernel: [ T1521] __dev_queue_xmit 3: skb = ffff9ee68afd3a00 dev = wlp4s0 txq = ffff9ee6820b5a00
2026-02-17T09:48:49.553234+01:00 lisa kernel: [ T1521] __dev_queue_xmit 4: skb = ffff9ee68afd3a00 dev = wlp4s0 txq = ffff9ee6820b5a00

2026-02-17T09:48:49.553235+01:00 lisa kernel: [ T1538] __dev_queue_xmit 0: skb = ffff9ee68afd3b00 dev = wlp4s0
2026-02-17T09:48:49.553238+01:00 lisa kernel: [ T1538] __dev_queue_xmit 2: skb = ffff9ee68afd3b00 dev = wlp4s0 txq = ffff9ee6820b5a00
2026-02-17T09:48:49.553241+01:00 lisa kernel: [ T1538] __dev_queue_xmit: cpu = 7 xmit_lock_owner = 7
2026-02-17T09:48:49.553243+01:00 lisa kernel: [ T1538] ------------[ cut here ]------------
2026-02-17T09:48:49.553245+01:00 lisa kernel: [ T1538] WARNING: CPU: 7 PID: 1538 at net/core/dev.c:4800 __dev_queue_xmit.cold+0x163/0x5bd
2026-02-17T09:48:49.553248+01:00 lisa kernel: [ T1538] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq
snd_seq_device rfcomm bnep nls_ascii nls_cp437 vfat fat snd_hda_codec_generic snd_hda_codec_atihdmi snd_hda_codec_hdmi snd_hda_intel btusb snd_intel_dspcfg
btrtl uvcvideo btintel snd_hda_codec btbcm videobuf2_vmalloc snd_acp3x_pdm_dma snd_soc_dmic snd_acp3x_rn btmtk videobuf2_memops snd_soc_core snd_hda_core uvc
videobuf2_v4l2 bluetooth snd_hwdep videodev snd_pcm_oss snd_mixer_oss snd_pcm snd_rn_pci_acp3x snd_acp_config videobuf2_common msi_wmi snd_soc_acpi ecdh_generic
ecc mc sparse_keymap wmi_bmof snd_timer snd k10temp ccp soundcore snd_pci_acp3x battery ac button joydev evdev amd_pmc mt7921e mt7921_common mt792x_lib
mt76_connac_lib mt76 mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 mbcache jbd2 usbhid amdgpu drm_client_lib
i2c_algo_bit drm_ttm_helper ttm drm_exec drm_suballoc_helper drm_buddy drm_panel_backlight_quirks gpu_sched xhci_pci amdxcp xhci_hcd
2026-02-17T09:48:49.553251+01:00 lisa kernel: [ T1538]  drm_display_helper hid_sensor_hub hid_multitouch mfd_core hid_generic i2c_hid_acpi psmouse
drm_kms_helper usbcore nvme amd_sfh i2c_hid hid cec nvme_core serio_raw i2c_piix4 r8169 crc16 usb_common i2c_smbus i2c_designware_platform i2c_designware_core
2026-02-17T09:48:49.553254+01:00 lisa kernel: [ T1538] CPU: 7 UID: 122 PID: 1538 Comm: isc-loop-0014 Not tainted 6.18.10-deadloop-00005-g56089d3b695a #1153
PREEMPT_{RT,(full)} 
2026-02-17T09:48:49.553256+01:00 lisa kernel: [ T1538] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024
2026-02-17T09:48:49.553258+01:00 lisa kernel: [ T1538] RIP: 0010:__dev_queue_xmit.cold+0x163/0x5bd
2026-02-17T09:48:49.553261+01:00 lisa kernel: [ T1538] Code: 3d 3e 88 5a 01 66 41 83 bf ca 04 00 00 08 0f 86 4d 02 00 00 4c 89 f2 48 c7 c6 80 b3 0c b9 48 c7 c7
b0 b3 2f b9 e8 49 43 fd ff <0f> 0b e8 72 8f 75 00 85 c0 74 0f 4c 89 f6 48 c7 c7 58 b4 2f b9 e8
2026-02-17T09:48:49.553262+01:00 lisa kernel: [ T1538] RSP: 0018:ffffbd6506d53990 EFLAGS: 00010246
2026-02-17T09:48:49.553264+01:00 lisa kernel: [ T1538] RAX: 0000000000000007 RBX: ffff9ee6a089c000 RCX: 0000000000000027
2026-02-17T09:48:49.553265+01:00 lisa kernel: [ T1538] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9ee95e7d6d80
2026-02-17T09:48:49.553266+01:00 lisa kernel: [ T1538] RBP: ffff9ee68afd3b00 R08: 0000000000000000 R09: ffffffffb9685590
2026-02-17T09:48:49.553268+01:00 lisa kernel: [ T1538] R10: ffffffffb96a4210 R11: 0000000000000003 R12: ffff9ee6820b5a00
2026-02-17T09:48:49.553270+01:00 lisa kernel: [ T1538] R13: 0000000000000007 R14: ffff9ee6a089c118 R15: ffff9ee6b455f000
2026-02-17T09:48:49.553273+01:00 lisa kernel: [ T1538] FS:  00007f3ad75ff680(0000) GS:ffff9ee9a4bf3000(0000) knlGS:0000000000000000
2026-02-17T09:48:49.553275+01:00 lisa kernel: [ T1538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2026-02-17T09:48:49.553277+01:00 lisa kernel: [ T1538] CR2: 00007f3d54a24080 CR3: 0000000102f73000 CR4: 0000000000750ef0
2026-02-17T09:48:49.553279+01:00 lisa kernel: [ T1538] PKRU: 55555554
2026-02-17T09:48:49.553282+01:00 lisa kernel: [ T1538] Call Trace:
2026-02-17T09:48:49.553284+01:00 lisa kernel: [ T1538]  <TASK>
2026-02-17T09:48:49.553285+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553287+01:00 lisa kernel: [ T1538]  ? __ip_make_skb+0x325/0x560
2026-02-17T09:48:49.553289+01:00 lisa kernel: [ T1538]  ip_finish_output2+0x2c8/0x600
2026-02-17T09:48:49.553292+01:00 lisa kernel: [ T1538]  ip_local_out+0xc6/0xf0
2026-02-17T09:48:49.553294+01:00 lisa kernel: [ T1538]  ip_send_skb+0x14/0x50
2026-02-17T09:48:49.553296+01:00 lisa kernel: [ T1538]  udp_send_skb+0x181/0x370
2026-02-17T09:48:49.553299+01:00 lisa kernel: [ T1538]  udp_sendmsg+0x8e8/0xbc0
2026-02-17T09:48:49.553301+01:00 lisa kernel: [ T1538]  ? ip_frag_init+0x60/0x60
2026-02-17T09:48:49.553302+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553304+01:00 lisa kernel: [ T1538]  ? aa_sk_perm+0x8e/0x210
2026-02-17T09:48:49.553306+01:00 lisa kernel: [ T1538]  __sock_sendmsg+0x60/0x80
2026-02-17T09:48:49.553308+01:00 lisa kernel: [ T1538]  ____sys_sendmsg+0x21d/0x2b0
2026-02-17T09:48:49.553310+01:00 lisa kernel: [ T1538]  ? import_iovec+0x1b/0x30
2026-02-17T09:48:49.553312+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553315+01:00 lisa kernel: [ T1538]  ? copy_msghdr_from_user+0xe5/0x170
2026-02-17T09:48:49.553316+01:00 lisa kernel: [ T1538]  ___sys_sendmsg+0x7e/0xc0
2026-02-17T09:48:49.553318+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553318+01:00 lisa kernel: [ T1538]  ? rt_spin_lock+0x38/0x110
2026-02-17T09:48:49.553321+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553323+01:00 lisa kernel: [ T1538]  ? rt_spin_lock+0x38/0x110
2026-02-17T09:48:49.553325+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553327+01:00 lisa kernel: [ T1538]  ? ipv4_dst_check+0x36/0x60
2026-02-17T09:48:49.553330+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553332+01:00 lisa kernel: [ T1538]  ? ip4_datagram_release_cb+0x45/0x1d0
2026-02-17T09:48:49.553334+01:00 lisa kernel: [ T1538]  ? rt_spin_lock+0x38/0x110
2026-02-17T09:48:49.553336+01:00 lisa kernel: [ T1538]  ? get_random_u16+0xc6/0x1c0
2026-02-17T09:48:49.553338+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553340+01:00 lisa kernel: [ T1538]  ? rt_spin_lock+0x38/0x110
2026-02-17T09:48:49.553342+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553344+01:00 lisa kernel: [ T1538]  ? rt_spin_unlock+0x5a/0xa0
2026-02-17T09:48:49.553346+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553347+01:00 lisa kernel: [ T1538]  ? rt_spin_unlock+0x5a/0xa0
2026-02-17T09:48:49.553349+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553351+01:00 lisa kernel: [ T1538]  ? __local_bh_enable_ip+0x73/0xa0
2026-02-17T09:48:49.553353+01:00 lisa kernel: [ T1538]  ? srso_alias_return_thunk+0x5/0xfbef5
2026-02-17T09:48:49.553355+01:00 lisa kernel: [ T1538]  __sys_sendmsg+0x68/0xc0
2026-02-17T09:48:49.553357+01:00 lisa kernel: [ T1538]  do_syscall_64+0x65/0x2f0
2026-02-17T09:48:49.553359+01:00 lisa kernel: [ T1538]  entry_SYSCALL_64_after_hwframe+0x55/0x5d
2026-02-17T09:48:49.553361+01:00 lisa kernel: [ T1538] RIP: 0033:0x7f3ae6d239ee
2026-02-17T09:48:49.553363+01:00 lisa kernel: [ T1538] Code: 08 0f 85 f5 4b ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c
24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 83 ec 08
2026-02-17T09:48:49.553365+01:00 lisa kernel: [ T1538] RSP: 002b:00007f3ad75fd0a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
2026-02-17T09:48:49.553367+01:00 lisa kernel: [ T1538] RAX: ffffffffffffffda RBX: 00007f3ad75ff680 RCX: 00007f3ae6d239ee
2026-02-17T09:48:49.553367+01:00 lisa kernel: [ T1538] RDX: 0000000000000000 RSI: 00007f3ad75fd110 RDI: 0000000000000105
2026-02-17T09:48:49.553369+01:00 lisa kernel: [ T1538] RBP: 00007f3ad75fd110 R08: 0000000000000000 R09: 0000000000000000
2026-02-17T09:48:49.553371+01:00 lisa kernel: [ T1538] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
2026-02-17T09:48:49.553373+01:00 lisa kernel: [ T1538] R13: 00007f3ad75fd750 R14: 00007f3ad644c8d0 R15: 00007f3ad75fd750
2026-02-17T09:48:49.553375+01:00 lisa kernel: [ T1538]  </TASK>
2026-02-17T09:48:49.553377+01:00 lisa kernel: [ T1538] ---[ end trace 0000000000000000 ]---
2026-02-17T09:48:49.553379+01:00 lisa kernel: [ T1538] Dead loop on virtual device wlp4s0, fix it urgently!
2026-02-17T09:48:49.553380+01:00 lisa kernel: [ T1538] __dev_queue_xmit 0: skb = ffff9ee68afd3b00 dev = wlp4s0
2026-02-17T09:48:49.553381+01:00 lisa kernel: [ T1538] __dev_queue_xmit 2: skb = ffff9ee68afd3b00 dev = wlp4s0 txq = ffff9ee6820b5a00
2026-02-17T09:48:49.553383+01:00 lisa kernel: [ T1538] __dev_queue_xmit: cpu = 7 xmit_lock_owner = 7
2026-02-17T09:48:49.553385+01:00 lisa kernel: [ T1538] Dead loop on virtual device wlp4s0, fix it urgently!

2026-02-17T09:48:49.553387+01:00 lisa kernel: [ T1521] __dev_queue_xmit 5: skb = 0000000000000000 dev = wlp4s0 txq = ffff9ee6820b5a00

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17  8:56         ` Bert Karwatzki
@ 2026-02-17  9:57           ` Sebastian Andrzej Siewior
  2026-02-17 10:42             ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17  9:57 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: Thomas Gleixner, linux-kernel, linux-rt-devel

On 2026-02-17 09:56:48 [+0100], Bert Karwatzki wrote:
> Am Dienstag, dem 17.02.2026 um 08:19 +0100 schrieb Sebastian Andrzej Siewior:
> > On 2026-02-17 00:48:25 [+0100], Bert Karwatzki wrote:
> > > The problem seems to be that different preemtible threads try to send skbs. 
> > 
> > This does not matter because the counter is per-thread not per-CPU. 
> 
> The "Dead loop on virtual device" messages is not printed because dev_xmit_recursion()
> returns true, but because READ_ONCE(txq->xmit_lock_owner) == cpu.

Ach, so it is not the recursion, it is the assigned CPU.
This is assigned via __netif_tx_lock(). Here we somehow lack the
expected synchronisation. So the queue should be locked but not by the
caller.

> Here's the backtrace:

thanks.

> Bert Karwatzki

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17  9:57           ` Sebastian Andrzej Siewior
@ 2026-02-17 10:42             ` Bert Karwatzki
  2026-02-17 11:24               ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-17 10:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Thomas Gleixner, linux-kernel, linux-rt-devel

Am Dienstag, dem 17.02.2026 um 10:57 +0100 schrieb Sebastian Andrzej Siewior:
> On 2026-02-17 09:56:48 [+0100], Bert Karwatzki wrote:
> > Am Dienstag, dem 17.02.2026 um 08:19 +0100 schrieb Sebastian Andrzej Siewior:
> > > On 2026-02-17 00:48:25 [+0100], Bert Karwatzki wrote:
> > > > The problem seems to be that different preemtible threads try to send skbs. 
> > > 
> > > This does not matter because the counter is per-thread not per-CPU. 
> > 
> > The "Dead loop on virtual device" messages is not printed because dev_xmit_recursion()
> > returns true, but because READ_ONCE(txq->xmit_lock_owner) == cpu.
> 
> Ach, so it is not the recursion, it is the assigned CPU.
> This is assigned via __netif_tx_lock(). Here we somehow lack the
> expected synchronisation. So the queue should be locked but not by the
> caller.

Yes, the queue gets locked by the first thread (via HARD_TX_LOCK), then the thread gets
preempted before the processing of the skb is complete, then the next thread on the same
CPU calls __dev_queue_xmit() and find that the lockowner has the same CPU id.

I just wondered if we can completely skip the

	if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
		[...]
	} else 
	{
		/* "Recursion" alert */
	}
 
check, as the synchronization will we provided by HARD_TX_{LOCK,UNLOCK}.

The comment

		/* Other cpus might concurrently change txq->xmit_lock_owner
		 * to -1 or to their cpu id, but not to our id.
		 */
suggests that the case that a thread is preempted while holding the lock was
not taken into account here. And in non-RT cases this would be correct as spin_lock()
disables preemption in that case.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17 10:42             ` Bert Karwatzki
@ 2026-02-17 11:24               ` Bert Karwatzki
  2026-02-17 16:52                 ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-17 11:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, spasswolf,
	Jakub Kicinski, Eric Dumazet, netdev

Am Dienstag, dem 17.02.2026 um 11:42 +0100 schrieb Bert Karwatzki:
> 
> I just wondered if we can completely skip the
> 
> 	if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> 		[...]
> 	} else 
> 	{
> 		/* "Recursion" alert */
> 	}
>  
> check, as the synchronization will we provided by HARD_TX_{LOCK,UNLOCK}.
> 

I thought about that again, and it seems like a bad idea as (in the non-preempt) case
other threads trying to access the queue would wait for the spinlock to be freed, perhaps
one can just change the code like this:

commit 05026868843a4eea51d45811d87706f36896e828
Author: Bert Karwatzki <spasswolf@web.de>
Date:   Tue Feb 17 12:08:35 2026 +0100

    net: core: dev: don't warn about recursion when on same CPU
    
    This prints a message if we're on the same CPU and the lock is
    already taken, in a production use we would of course skip this
    message.
    
    Signed-off-by: Bert Karwatzki <spasswolf@web.de>

diff --git a/net/core/dev.c b/net/core/dev.c
index 5b536860138d..cac5588640b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4784,15 +4784,18 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
 					     dev->name);
 		} else {
-			/* Recursion is detected! It is possible,
-			 * unfortunately
-			 */
-recursion_alert:
-			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
-					     dev->name);
+			net_crit_ratelimited("Lock taken already on %s!\n", dev->name);
+			goto lock_taken;
 		}
 	}
 
+	/* Recursion is detected! It is possible,
+	 * unfortunately
+	 */
+recursion_alert:
+	net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
+					     dev->name);
+lock_taken:
 	rc = -ENETDOWN;
 	rcu_read_unlock_bh();
 

With this I get these messages (be skipped in production code)
instead of the "Dead loop on virtual device":

[   51.994435] [   T1525] Lock taken already on wlp4s0!
[   51.994546] [   T1525] Lock taken already on wlp4s0!
[   51.994650] [   T1525] Lock taken already on wlp4s0!
[   51.994746] [   T1525] Lock taken already on wlp4s0!
[   51.994845] [   T1525] Lock taken already on wlp4s0!
[   51.994948] [   T1525] Lock taken already on wlp4s0!
[   51.995037] [   T1525] Lock taken already on wlp4s0!
[   51.995128] [   T1525] Lock taken already on wlp4s0!
[   51.995220] [   T1525] Lock taken already on wlp4s0!


Bert Karwatzki

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17 11:24               ` Bert Karwatzki
@ 2026-02-17 16:52                 ` Bert Karwatzki
  2026-02-17 19:10                   ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-17 16:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, Jakub Kicinski,
	Eric Dumazet, netdev, spasswolf

Am Dienstag, dem 17.02.2026 um 12:24 +0100 schrieb Bert Karwatzki:
> Am Dienstag, dem 17.02.2026 um 11:42 +0100 schrieb Bert Karwatzki:
> > 
> > I just wondered if we can completely skip the
> > 
> > 	if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > 		[...]
> > 	} else 
> > 	{
> > 		/* "Recursion" alert */
> > 	}
> >  
> > check, as the synchronization will we provided by HARD_TX_{LOCK,UNLOCK}.
> > 
> 
> I thought about that again, and it seems like a bad idea as (in the non-preempt) case
> other threads trying to access the queue would wait for the spinlock to be freed, perhaps
> one can just change the code like this:

My argument above seems wrong: In the non-preempt case we cannot have another thread accessing the txq
from the same CPU if the lock is taken (it is not preemptible) and for other threads
accessing the txq from different CPUs the check (txq->xmit_lock_owner != cpu) would succeed
and they would try the spinlock anyway, So this would not speak against killing the lock owner
check. As for the recursion detection, perhaps dev_xmit_recursion() is enough?

Another Idea (more vague ...):

Using the CPU Id as the lock_owner seems to make sense for locks that are not
preemptible (raw_spinlock in the RT case). For preemptible locks a thread ID (which one exactly I'm not sure ...)
would perhaps make a better lock owner ...

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17 16:52                 ` Bert Karwatzki
@ 2026-02-17 19:10                   ` Bert Karwatzki
  2026-02-18  7:30                     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-17 19:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, Jakub Kicinski,
	Eric Dumazet, netdev, spasswolf

Am Dienstag, dem 17.02.2026 um 17:52 +0100 schrieb Bert Karwatzki:
> Am Dienstag, dem 17.02.2026 um 12:24 +0100 schrieb Bert Karwatzki:
> > Am Dienstag, dem 17.02.2026 um 11:42 +0100 schrieb Bert Karwatzki:
> > > 
> > > I just wondered if we can completely skip the
> > > 
> > > 	if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > 		[...]
> > > 	} else 
> > > 	{
> > > 		/* "Recursion" alert */
> > > 	}
> > >  
> > > check, as the synchronization will we provided by HARD_TX_{LOCK,UNLOCK}.
> > > 
> > 
> > I thought about that again, and it seems like a bad idea as (in the non-preempt) case
> > other threads trying to access the queue would wait for the spinlock to be freed, perhaps
> > one can just change the code like this:
> 
> My argument above seems wrong: In the non-preempt case we cannot have another thread accessing the txq
> from the same CPU if the lock is taken (it is not preemptible) and for other threads
> accessing the txq from different CPUs the check (txq->xmit_lock_owner != cpu) would succeed
> and they would try the spinlock anyway, So this would not speak against killing the lock owner
> check. As for the recursion detection, perhaps dev_xmit_recursion() is enough?

I tried to research the original commit which introduced the xmit_lock_owner check, but
it is present since linux 2.3.6 (released 19990610) (when __dev_queue_xmit() was still called dev_queue_xmit()),
so I can't tell the original idea behind that check (perhaps recuesion detection ...), so I'm
not completely sure if it can be omitted (and just let dev_xmit_recursion() do the recursion checking).

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-17 19:10                   ` Bert Karwatzki
@ 2026-02-18  7:30                     ` Sebastian Andrzej Siewior
  2026-02-18 12:50                       ` Bert Karwatzki
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-18  7:30 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, Jakub Kicinski,
	Eric Dumazet, netdev

On 2026-02-17 20:10:09 [+0100], Bert Karwatzki wrote:
> 
> I tried to research the original commit which introduced the xmit_lock_owner check, but
> it is present since linux 2.3.6 (released 19990610) (when __dev_queue_xmit() was still called dev_queue_xmit()),
> so I can't tell the original idea behind that check (perhaps recuesion detection ...), so I'm
> not completely sure if it can be omitted (and just let dev_xmit_recursion() do the recursion checking).

Okay. Thank you. I add it to my list.

> Bert Karwatzki

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-18  7:30                     ` Sebastian Andrzej Siewior
@ 2026-02-18 12:50                       ` Bert Karwatzki
  2026-02-26 17:29                         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Bert Karwatzki @ 2026-02-18 12:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, Jakub Kicinski,
	Eric Dumazet, netdev, spasswolf

Am Mittwoch, dem 18.02.2026 um 08:30 +0100 schrieb Sebastian Andrzej Siewior:
> On 2026-02-17 20:10:09 [+0100], Bert Karwatzki wrote:
> > 
> > I tried to research the original commit which introduced the xmit_lock_owner check, but
> > it is present since linux 2.3.6 (released 19990610) (when __dev_queue_xmit() was still called dev_queue_xmit()),
> > so I can't tell the original idea behind that check (perhaps recuesion detection ...), so I'm
> > not completely sure if it can be omitted (and just let dev_xmit_recursion() do the recursion checking).
> 
> Okay. Thank you. I add it to my list.
> 
I've thought about it again and I now think the xmit_lock_owner check IS necessary to
avoid deadlocks on recursion in the non-RT case.

My idea to use get_current()->tgid as lock owner also does not work as interrupts are still enabled
and __dev_queue_xmit() can be called from interrupt context. So in a situation where an interrupt occurs
after the lock has been taken and the interrupt handler calls __dev_queue_xmit() on the same CPU a deadlock
would occur.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-18 12:50                       ` Bert Karwatzki
@ 2026-02-26 17:29                         ` Sebastian Andrzej Siewior
  2026-03-18 10:30                           ` Daniel Vacek
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-26 17:29 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Thomas Gleixner, linux-kernel, linux-rt-devel, Jakub Kicinski,
	Eric Dumazet, netdev

On 2026-02-18 13:50:14 [+0100], Bert Karwatzki wrote:
> Am Mittwoch, dem 18.02.2026 um 08:30 +0100 schrieb Sebastian Andrzej Siewior:
> > On 2026-02-17 20:10:09 [+0100], Bert Karwatzki wrote:
> > > 
> > > I tried to research the original commit which introduced the xmit_lock_owner check, but
> > > it is present since linux 2.3.6 (released 19990610) (when __dev_queue_xmit() was still called dev_queue_xmit()),
> > > so I can't tell the original idea behind that check (perhaps recuesion detection ...), so I'm
> > > not completely sure if it can be omitted (and just let dev_xmit_recursion() do the recursion checking).
> > 
> > Okay. Thank you. I add it to my list.
> > 
> I've thought about it again and I now think the xmit_lock_owner check IS necessary to
> avoid deadlocks on recursion in the non-RT case.
> 
> My idea to use get_current()->tgid as lock owner also does not work as interrupts are still enabled
> and __dev_queue_xmit() can be called from interrupt context. So in a situation where an interrupt occurs
> after the lock has been taken and the interrupt handler calls __dev_queue_xmit() on the same CPU a deadlock
> would occur.

The warning happens because taskA on cpuX goes through
HARD_TX_LOCK(), gets preempted and then taskB on cpuX wants also to send
send a packet. The second one throws the warning.

We could ignore this check because a deadlock will throw a warning and
"halt" the task that runs into the deadlock.
But then we could be smart about this in the same way !RT is and
pro-active check for the simple deadlock. The lock owner of
netdev_queue::_xmit_lock is recorded can be checked vs current.

The snippet below should work. I need to see if tomorrow this is still a
good idea.

diff --git a/net/core/dev.c b/net/core/dev.c
index 6ff4256700e60..de342ceb17201 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		/* Other cpus might concurrently change txq->xmit_lock_owner
 		 * to -1 or to their cpu id, but not to our id.
 		 */
-		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+		if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
 			if (dev_xmit_recursion())
 				goto recursion_alert;
 

> Bert Karwatzki

Sebastian

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-02-26 17:29                         ` Sebastian Andrzej Siewior
@ 2026-03-18 10:30                           ` Daniel Vacek
  2026-03-18 11:18                             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Vacek @ 2026-03-18 10:30 UTC (permalink / raw)
  To: bigeasy
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Daniel Vacek

On Thu, 26 Feb 2026 18:29:27 +0100, Sebastian Andrzej Siewior wrote:
> On 2026-02-18 13:50:14 [+0100], Bert Karwatzki wrote:
> > Am Mittwoch, dem 18.02.2026 um 08:30 +0100 schrieb Sebastian Andrzej Siewior:
> > > On 2026-02-17 20:10:09 [+0100], Bert Karwatzki wrote:
> > > > 
> > > > I tried to research the original commit which introduced the xmit_lock_owner check, but
> > > > it is present since linux 2.3.6 (released 19990610) (when __dev_queue_xmit() was still called dev_queue_xmit()),
> > > > so I can't tell the original idea behind that check (perhaps recuesion detection ...), so I'm
> > > > not completely sure if it can be omitted (and just let dev_xmit_recursion() do the recursion checking).
> > > 
> > > Okay. Thank you. I add it to my list.
> > > 
> > I've thought about it again and I now think the xmit_lock_owner check IS necessary to
> > avoid deadlocks on recursion in the non-RT case.
> > 
> > My idea to use get_current()->tgid as lock owner also does not work as interrupts are still enabled
> > and __dev_queue_xmit() can be called from interrupt context. So in a situation where an interrupt occurs
> > after the lock has been taken and the interrupt handler calls __dev_queue_xmit() on the same CPU a deadlock
> > would occur.
> 
> The warning happens because taskA on cpuX goes through
> HARD_TX_LOCK(), gets preempted and then taskB on cpuX wants also to send
> send a packet. The second one throws the warning.
> 
> We could ignore this check because a deadlock will throw a warning and
> "halt" the task that runs into the deadlock.
> But then we could be smart about this in the same way !RT is and
> pro-active check for the simple deadlock. The lock owner of
> netdev_queue::_xmit_lock is recorded can be checked vs current.
> 
> The snippet below should work. I need to see if tomorrow this is still a
> good idea.
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 6ff4256700e60..de342ceb17201 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  		/* Other cpus might concurrently change txq->xmit_lock_owner
>  		 * to -1 or to their cpu id, but not to our id.
>  		 */
> -		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> +		if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {

Ain't this changing the behavior for !RT case? Previously, if it was the same thread
which has already locked the queue (and hence the same CPU) evaluating this condition,
the condition was skipped, which is no longer the case with this change.

How about something like this instead?:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d99b0fbc1942..27d090b03493 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -705,7 +705,7 @@ struct netdev_queue {
 	struct dql		dql;
 #endif
 	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
-	int			xmit_lock_owner;
+	struct task_struct	*xmit_lock_owner;
 	/*
 	 * Time (in jiffies) of last Tx
 	 */
@@ -4709,7 +4709,7 @@ static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
 {
 	spin_lock(&txq->_xmit_lock);
 	/* Pairs with READ_ONCE() in __dev_queue_xmit() */
-	WRITE_ONCE(txq->xmit_lock_owner, cpu);
+	WRITE_ONCE(txq->xmit_lock_owner, current);
 }
 
 static inline bool __netif_tx_acquire(struct netdev_queue *txq)
@@ -4727,7 +4727,7 @@ static inline void __netif_tx_lock_bh(struct netdev_queue *txq)
 {
 	spin_lock_bh(&txq->_xmit_lock);
 	/* Pairs with READ_ONCE() in __dev_queue_xmit() */
-	WRITE_ONCE(txq->xmit_lock_owner, smp_processor_id());
+	WRITE_ONCE(txq->xmit_lock_owner, current);
 }
 
 static inline bool __netif_tx_trylock(struct netdev_queue *txq)
@@ -4736,7 +4736,7 @@ static inline bool __netif_tx_trylock(struct netdev_queue *txq)
 
 	if (likely(ok)) {
 		/* Pairs with READ_ONCE() in __dev_queue_xmit() */
-		WRITE_ONCE(txq->xmit_lock_owner, smp_processor_id());
+		WRITE_ONCE(txq->xmit_lock_owner, current);
 	}
 	return ok;
 }
@@ -4744,14 +4744,14 @@ static inline bool __netif_tx_trylock(struct netdev_queue *txq)
 static inline void __netif_tx_unlock(struct netdev_queue *txq)
 {
 	/* Pairs with READ_ONCE() in __dev_queue_xmit() */
-	WRITE_ONCE(txq->xmit_lock_owner, -1);
+	WRITE_ONCE(txq->xmit_lock_owner, NULL);
 	spin_unlock(&txq->_xmit_lock);
 }
 
 static inline void __netif_tx_unlock_bh(struct netdev_queue *txq)
 {
 	/* Pairs with READ_ONCE() in __dev_queue_xmit() */
-	WRITE_ONCE(txq->xmit_lock_owner, -1);
+	WRITE_ONCE(txq->xmit_lock_owner, NULL);
 	spin_unlock_bh(&txq->_xmit_lock);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index ccef685023c2..f62bffb8edf6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4817,7 +4817,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		/* Other cpus might concurrently change txq->xmit_lock_owner
 		 * to -1 or to their cpu id, but not to our id.
 		 */
-		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+		if (READ_ONCE(txq->xmit_lock_owner) != current) {
 			if (dev_xmit_recursion())
 				goto recursion_alert;
 
@@ -11178,7 +11178,7 @@ static void netdev_init_one_queue(struct net_device *dev,
 	/* Initialize queue lock */
 	spin_lock_init(&queue->_xmit_lock);
 	netdev_set_xmit_lockdep_class(&queue->_xmit_lock, dev->type);
-	queue->xmit_lock_owner = -1;
+	queue->xmit_lock_owner = NULL;
 	netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
 	queue->dev = dev;
 #ifdef CONFIG_BQL

Daniel

>  			if (dev_xmit_recursion())
>  				goto recursion_alert;
>  
> 
> > Bert Karwatzki
> 
> Sebastian

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-03-18 10:30                           ` Daniel Vacek
@ 2026-03-18 11:18                             ` Sebastian Andrzej Siewior
  2026-03-18 14:43                               ` Daniel Vacek
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 11:18 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >  		/* Other cpus might concurrently change txq->xmit_lock_owner
> >  		 * to -1 or to their cpu id, but not to our id.
> >  		 */
> > -		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > +		if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> 
> Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> which has already locked the queue (and hence the same CPU) evaluating this condition,
> the condition was skipped, which is no longer the case with this change.

The above was me thinking and does not even compile for !RT. Commit
b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
netdev_queue::_xmit_lock") is what was merged in the end.

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-03-18 11:18                             ` Sebastian Andrzej Siewior
@ 2026-03-18 14:43                               ` Daniel Vacek
  2026-03-18 14:51                                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Vacek @ 2026-03-18 14:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On Wed, 18 Mar 2026 at 12:18, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > >             /* Other cpus might concurrently change txq->xmit_lock_owner
> > >              * to -1 or to their cpu id, but not to our id.
> > >              */
> > > -           if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > +           if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> >
> > Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> > which has already locked the queue (and hence the same CPU) evaluating this condition,
> > the condition was skipped, which is no longer the case with this change.
>
> The above was me thinking and does not even compile for !RT. Commit
> b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> netdev_queue::_xmit_lock") is what was merged in the end.

Hmm, that means txq->xmit_lock_owner is not used at all for PREEMT_RT.
It's pointless to even store it. Shall we care?

--nX

> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-03-18 14:43                               ` Daniel Vacek
@ 2026-03-18 14:51                                 ` Sebastian Andrzej Siewior
  2026-03-18 14:58                                   ` Daniel Vacek
  2026-04-01 16:55                                   ` Daniel Vacek
  0 siblings, 2 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 14:51 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On 2026-03-18 15:43:52 [+0100], Daniel Vacek wrote:
> On Wed, 18 Mar 2026 at 12:18, Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > >             /* Other cpus might concurrently change txq->xmit_lock_owner
> > > >              * to -1 or to their cpu id, but not to our id.
> > > >              */
> > > > -           if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > > +           if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> > >
> > > Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> > > which has already locked the queue (and hence the same CPU) evaluating this condition,
> > > the condition was skipped, which is no longer the case with this change.
> >
> > The above was me thinking and does not even compile for !RT. Commit
> > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > netdev_queue::_xmit_lock") is what was merged in the end.
> 
> Hmm, that means txq->xmit_lock_owner is not used at all for PREEMT_RT.
> It's pointless to even store it. Shall we care?

For PREEMPT_RT the xmit_lock_owner member is only stored and not used
otherwise. It could be removed as in ifdef-ed away but I do not care
enough to sprinkle it and the gain is little (proof me wrong). So I am
happy as-is.

As for the check itself as we have now, we detect the deadlock before it
happens and this is nice to have. The alternative (in case of a
deadlock) would be the deadlock detection in rtmutex code which would
freeze the thread.

> --nX
> 
Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-03-18 14:51                                 ` Sebastian Andrzej Siewior
@ 2026-03-18 14:58                                   ` Daniel Vacek
  2026-04-01 16:55                                   ` Daniel Vacek
  1 sibling, 0 replies; 26+ messages in thread
From: Daniel Vacek @ 2026-03-18 14:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On Wed, 18 Mar 2026 at 15:51, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2026-03-18 15:43:52 [+0100], Daniel Vacek wrote:
> > On Wed, 18 Mar 2026 at 12:18, Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > > >             /* Other cpus might concurrently change txq->xmit_lock_owner
> > > > >              * to -1 or to their cpu id, but not to our id.
> > > > >              */
> > > > > -           if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > > > +           if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> > > >
> > > > Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> > > > which has already locked the queue (and hence the same CPU) evaluating this condition,
> > > > the condition was skipped, which is no longer the case with this change.
> > >
> > > The above was me thinking and does not even compile for !RT. Commit
> > > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > > netdev_queue::_xmit_lock") is what was merged in the end.
> >
> > Hmm, that means txq->xmit_lock_owner is not used at all for PREEMT_RT.
> > It's pointless to even store it. Shall we care?
>
> For PREEMPT_RT the xmit_lock_owner member is only stored and not used
> otherwise. It could be removed as in ifdef-ed away but I do not care
> enough to sprinkle it and the gain is little (proof me wrong). So I am
> happy as-is.

Right. Sorry for bothering then. I missed the merged commit.

--nX

> As for the check itself as we have now, we detect the deadlock before it
> happens and this is nice to have. The alternative (in case of a
> deadlock) would be the deadlock detection in rtmutex code which would
> freeze the thread.
>
> > --nX
> >
> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-03-18 14:51                                 ` Sebastian Andrzej Siewior
  2026-03-18 14:58                                   ` Daniel Vacek
@ 2026-04-01 16:55                                   ` Daniel Vacek
  2026-04-02  7:03                                     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 26+ messages in thread
From: Daniel Vacek @ 2026-04-01 16:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On Wed, 18 Mar 2026 at 15:51, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2026-03-18 15:43:52 [+0100], Daniel Vacek wrote:
> > On Wed, 18 Mar 2026 at 12:18, Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > > >             /* Other cpus might concurrently change txq->xmit_lock_owner
> > > > >              * to -1 or to their cpu id, but not to our id.
> > > > >              */
> > > > > -           if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > > > +           if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> > > >
> > > > Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> > > > which has already locked the queue (and hence the same CPU) evaluating this condition,
> > > > the condition was skipped, which is no longer the case with this change.
> > >
> > > The above was me thinking and does not even compile for !RT. Commit
> > > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > > netdev_queue::_xmit_lock") is what was merged in the end.

Thinking about it again, wouldn't it be better to have one generic
solution rather then special-casing for PREEMPT_RT vs. !PREEMPT_RT?

--nX

> >
> > Hmm, that means txq->xmit_lock_owner is not used at all for PREEMT_RT.
> > It's pointless to even store it. Shall we care?
>
> For PREEMPT_RT the xmit_lock_owner member is only stored and not used
> otherwise. It could be removed as in ifdef-ed away but I do not care
> enough to sprinkle it and the gain is little (proof me wrong). So I am
> happy as-is.
>
> As for the check itself as we have now, we detect the deadlock before it
> happens and this is nice to have. The alternative (in case of a
> deadlock) would be the deadlock detection in rtmutex code which would
> freeze the thread.
>
> > --nX
> >
> Sebastian

On Wed, 18 Mar 2026 at 15:51, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2026-03-18 15:43:52 [+0100], Daniel Vacek wrote:
> > On Wed, 18 Mar 2026 at 12:18, Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2026-03-18 11:30:09 [+0100], Daniel Vacek wrote:
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -4821,7 +4821,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > > >             /* Other cpus might concurrently change txq->xmit_lock_owner
> > > > >              * to -1 or to their cpu id, but not to our id.
> > > > >              */
> > > > > -           if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> > > > > +           if (rt_mutex_owner(&txq->_xmit_lock.lock) != current) {
> > > >
> > > > Ain't this changing the behavior for !RT case? Previously, if it was the same thread
> > > > which has already locked the queue (and hence the same CPU) evaluating this condition,
> > > > the condition was skipped, which is no longer the case with this change.
> > >
> > > The above was me thinking and does not even compile for !RT. Commit
> > > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > > netdev_queue::_xmit_lock") is what was merged in the end.
> >
> > Hmm, that means txq->xmit_lock_owner is not used at all for PREEMT_RT.
> > It's pointless to even store it. Shall we care?
>
> For PREEMPT_RT the xmit_lock_owner member is only stored and not used
> otherwise. It could be removed as in ifdef-ed away but I do not care
> enough to sprinkle it and the gain is little (proof me wrong). So I am
> happy as-is.
>
> As for the check itself as we have now, we detect the deadlock before it
> happens and this is nice to have. The alternative (in case of a
> deadlock) would be the deadlock detection in rtmutex code which would
> freeze the thread.
>
> > --nX
> >
> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-01 16:55                                   ` Daniel Vacek
@ 2026-04-02  7:03                                     ` Sebastian Andrzej Siewior
  2026-04-02  7:50                                       ` Daniel Vacek
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-02  7:03 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx

On 2026-04-01 18:55:43 [+0200], Daniel Vacek wrote:
> > > > The above was me thinking and does not even compile for !RT. Commit
> > > > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > > > netdev_queue::_xmit_lock") is what was merged in the end.
> 
> Thinking about it again, wouldn't it be better to have one generic
> solution rather then special-casing for PREEMPT_RT vs. !PREEMPT_RT?

PREEMPT_RT and !PREEMPT_RT is fundamentally different here. The one is
not preemptible and records the CPU of the lock owner to detect a
recursive deadlock.
The other is preemptible, uses a different locking type/ class which
records the lock owner which can be utilised for this purpose.

A generic thing would be to remove this and rely on lockdep. This could
work if it is only a devel thing and never "I setup something and make a
loop" sort of thing.

> --nX

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-02  7:03                                     ` Sebastian Andrzej Siewior
@ 2026-04-02  7:50                                       ` Daniel Vacek
  2026-04-02  8:31                                         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Vacek @ 2026-04-02  7:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Aaron Tomlin

On Thu, 2 Apr 2026 at 09:03, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2026-04-01 18:55:43 [+0200], Daniel Vacek wrote:
> > > > > The above was me thinking and does not even compile for !RT. Commit
> > > > > b824c3e16c190 ("net: Provide a PREEMPT_RT specific check for
> > > > > netdev_queue::_xmit_lock") is what was merged in the end.
> >
> > Thinking about it again, wouldn't it be better to have one generic
> > solution rather then special-casing for PREEMPT_RT vs. !PREEMPT_RT?
>
> PREEMPT_RT and !PREEMPT_RT is fundamentally different here. The one is
> not preemptible and records the CPU of the lock owner to detect a
> recursive deadlock.
> The other is preemptible, uses a different locking type/ class which
> records the lock owner which can be utilised for this purpose.

I understand that (or at least I think I do).
My idea was that the non-preemptible one can record `current` task
instead of the CPU to detect the deadlock. And that would also work
for the preemptible case (it would actually match the lock owner
approach as you did for the PREEMPT_RT case).
One code for both configurations, no special-casing. I'd argue that's
a better result. Am I missing something?

The size of the netdev_queue structure would grow by 8 bytes for !RT
case, but that's not a big deal, IMO. For RT case it would just fill
the hole.

--nX


> A generic thing would be to remove this and rely on lockdep. This could
> work if it is only a devel thing and never "I setup something and make a
> loop" sort of thing.
>
> > --nX
>
> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-02  7:50                                       ` Daniel Vacek
@ 2026-04-02  8:31                                         ` Sebastian Andrzej Siewior
  2026-04-02  9:21                                           ` Daniel Vacek
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-02  8:31 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Aaron Tomlin

On 2026-04-02 09:50:35 [+0200], Daniel Vacek wrote:
> My idea was that the non-preemptible one can record `current` task
> instead of the CPU to detect the deadlock. And that would also work
> for the preemptible case (it would actually match the lock owner
> approach as you did for the PREEMPT_RT case).
> One code for both configurations, no special-casing. I'd argue that's
> a better result. Am I missing something?
> 
> The size of the netdev_queue structure would grow by 8 bytes for !RT
> case, but that's not a big deal, IMO. For RT case it would just fill
> the hole.

We have xmit_lock_owner as int. If you replace it with task_struct *
then on 64bit the size of the struct netdev_queue will remain unchanged
as it fills the hole before the following long.
Then you could record `current' as the lock owner in both cases. This
should work.

> --nX

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-02  8:31                                         ` Sebastian Andrzej Siewior
@ 2026-04-02  9:21                                           ` Daniel Vacek
  2026-04-02 13:46                                             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Vacek @ 2026-04-02  9:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Aaron Tomlin

On Thu, 2 Apr 2026 at 10:31, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2026-04-02 09:50:35 [+0200], Daniel Vacek wrote:
> > My idea was that the non-preemptible one can record `current` task
> > instead of the CPU to detect the deadlock. And that would also work
> > for the preemptible case (it would actually match the lock owner
> > approach as you did for the PREEMPT_RT case).
> > One code for both configurations, no special-casing. I'd argue that's
> > a better result. Am I missing something?
> >
> > The size of the netdev_queue structure would grow by 8 bytes for !RT
> > case, but that's not a big deal, IMO. For RT case it would just fill
> > the hole.
>
> We have xmit_lock_owner as int. If you replace it with task_struct *
> then on 64bit the size of the struct netdev_queue will remain unchanged
> as it fills the hole before the following long.
> Then you could record `current' as the lock owner in both cases. This
> should work.

Well, that's the patch I originally sent then.

https://lore.kernel.org/linux-rt-devel/20260318103009.2120920-1-neelx@suse.com/

--nX

>
> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-02  9:21                                           ` Daniel Vacek
@ 2026-04-02 13:46                                             ` Sebastian Andrzej Siewior
  2026-04-02 13:58                                               ` Daniel Vacek
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-02 13:46 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Aaron Tomlin

On 2026-04-02 11:21:20 [+0200], Daniel Vacek wrote:
> 
> Well, that's the patch I originally sent then.
> 
> https://lore.kernel.org/linux-rt-devel/20260318103009.2120920-1-neelx@suse.com/

That was an alternative after everything was already done.

> --nX

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT
  2026-04-02 13:46                                             ` Sebastian Andrzej Siewior
@ 2026-04-02 13:58                                               ` Daniel Vacek
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Vacek @ 2026-04-02 13:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: edumazet, kuba, linux-kernel, linux-rt-devel, netdev, spasswolf,
	tglx, Aaron Tomlin

On Thu, 2 Apr 2026 at 15:46, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2026-04-02 11:21:20 [+0200], Daniel Vacek wrote:
> >
> > Well, that's the patch I originally sent then.
> >
> > https://lore.kernel.org/linux-rt-devel/20260318103009.2120920-1-neelx@suse.com/
>
> That was an alternative after everything was already done.

That is true. I don't mean to complain. But also because I haven't
seen the applied version on the mailing list. That's why I originally
sent my patch as a reply to this thread.

I'm simply asking whether we should proceed with the discussed
approach? Would that be better in the end?

If you don't object, I can send a rebased version along with a
follow-up cleanup.

--nX

> > --nX
>
> Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-04-02 13:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-16 13:43 "Dead loop on virtual device" error without softirq-BKL on PREEMPT_RT Bert Karwatzki
2026-02-16 15:32 ` Bert Karwatzki
2026-02-16 15:37   ` Sebastian Andrzej Siewior
2026-02-16 23:48     ` Bert Karwatzki
2026-02-17  7:19       ` Sebastian Andrzej Siewior
2026-02-17  8:56         ` Bert Karwatzki
2026-02-17  9:57           ` Sebastian Andrzej Siewior
2026-02-17 10:42             ` Bert Karwatzki
2026-02-17 11:24               ` Bert Karwatzki
2026-02-17 16:52                 ` Bert Karwatzki
2026-02-17 19:10                   ` Bert Karwatzki
2026-02-18  7:30                     ` Sebastian Andrzej Siewior
2026-02-18 12:50                       ` Bert Karwatzki
2026-02-26 17:29                         ` Sebastian Andrzej Siewior
2026-03-18 10:30                           ` Daniel Vacek
2026-03-18 11:18                             ` Sebastian Andrzej Siewior
2026-03-18 14:43                               ` Daniel Vacek
2026-03-18 14:51                                 ` Sebastian Andrzej Siewior
2026-03-18 14:58                                   ` Daniel Vacek
2026-04-01 16:55                                   ` Daniel Vacek
2026-04-02  7:03                                     ` Sebastian Andrzej Siewior
2026-04-02  7:50                                       ` Daniel Vacek
2026-04-02  8:31                                         ` Sebastian Andrzej Siewior
2026-04-02  9:21                                           ` Daniel Vacek
2026-04-02 13:46                                             ` Sebastian Andrzej Siewior
2026-04-02 13:58                                               ` Daniel Vacek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox