Netdev List
 help / color / mirror / Atom feed
* Re: [patch V2 18/25] timekeeping: Prepare for cross timestamps on arbitrary clock IDs
From: David Woodhouse @ 2026-06-22 12:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <87se6eltod.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Mon, 2026-06-22 at 13:07 +0200, Thomas Gleixner wrote:
> On Mon, Jun 22 2026 at 09:55, David Woodhouse wrote:
> > We ended up with ktime_get_snapshot_id() also supporting CLOCK_BOOTTIME
> > and CLOCK_MONOTONIC_RAW, but not get_device_system_crosststamp().
> > Should we make that consistent?
> 
> Maybe. The BOOTTIME support is only there for that ARM64 hyper trace muck,
> but has no other relevance.
> 
> MONORAW is there for the PTP EXTENDED IOCTL, but with PRECISE the
> snapshot already contains the raw value and you'd have to prevent the
> historical adjustment part for RAW. So I don't see the actual value, but
> I don't have a strong opinion either.

Yeah, I'm not sure I see the need for it; it's just the consistency
thing that slightly bothered me once I had them both in my sights doing
the snapshot_ntp_error() thing in both.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [Kernel Bug] INFO: task hung in xt_find_table
From: Longxing Li @ 2026-06-22 12:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Pablo Neira Ayuso, syzkaller, edumazet, kuba, pabeni, horms,
	netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <d26c8934-6d4c-4171-9e6f-f58a249dd9ff@linux.dev>

Hi Jiayuan,
Thanks for explaining the situation. I will double check this problem.

Best regards,
Longxing Li

Jiayuan Chen <jiayuan.chen@linux.dev> 于2026年6月10日周三 17:26写道:
>
>
> On 6/10/26 3:14 PM, Longxing Li wrote:
> > sorry for not containing report plain text in last email. the report
> > is as follows:
> >
> > INFO: task syz-executor.4:42949 blocked for more than 143 seconds.
> >        Not tainted 7.0.6 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:syz-executor.4  state:D stack:26456 pid:42949 tgid:42937
> > ppid:9759   task_flags:0x400140 flags:0x00080002
> > Call Trace:
> >   <TASK>
> >   context_switch kernel/sched/core.c:5298 [inline]
> >   __schedule+0x1006/0x5f00 kernel/sched/core.c:6911
> >   __schedule_loop kernel/sched/core.c:6993 [inline]
> >   schedule+0xe7/0x3a0 kernel/sched/core.c:7008
> >   schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7065
> >   __mutex_lock_common kernel/locking/mutex.c:692 [inline]
> >   __mutex_lock+0xd9e/0x1df0 kernel/locking/mutex.c:776
> >   xt_find_table+0x59/0x1a0 net/netfilter/x_tables.c:1245
> >   ip6t_unregister_table_exit+0x22/0x50 net/ipv6/netfilter/ip6_tables.c:1808
> >   ops_exit_list net/core/net_namespace.c:199 [inline]
> >   ops_undo_list+0x2dd/0xa50 net/core/net_namespace.c:252
> >   setup_net+0x1f3/0x3a0 net/core/net_namespace.c:462
> >   copy_net_ns+0x351/0x7c0 net/core/net_namespace.c:579
> >   create_new_namespaces+0x3f6/0xac0 kernel/nsproxy.c:130
> >   copy_namespaces+0x45c/0x580 kernel/nsproxy.c:195
> >   copy_process+0x30cc/0x76d0 kernel/fork.c:2227
> >   kernel_clone+0xea/0x8f0 kernel/fork.c:2655
> >   __do_sys_clone+0xce/0x120 kernel/fork.c:2796
> >   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >   do_syscall_64+0x11b/0xf80 arch/x86/entry/syscall_64.c:94
> >   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x471ecd
> > RSP: 002b:00007f51f163e008 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
> > RAX: ffffffffffffffda RBX: 000000000059bf80 RCX: 0000000000471ecd
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040080020
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000202 R12: 000000000059bf8c
> > R13: 000000000000000b R14: 000000000059bf80 R15: 00007f51f161e000
> >   </TASK>
>
>
>
> This is not a deadlock — there's no lock cycle.
>
> The runner is simply under heavy pressure on all three axes: CPU (zswap
> compression) + memory (direct reclaim) + IO (swap).
>
> The hung task is just a victim. The actual holder is another task that
> took the mutex and then fell into direct reclaim.
>
> Likely stack of the holder:
> get_entries
>    xt_find_table_lock
>    copy_entries_to_user
>      alloc_counters
>         vzalloc  -> direct reclaim
>
> "INFO: task hung" reports of this kind are common on the official
> syzkaller dashboard https://syzkaller.appspot.com/upstream/
>
>

^ permalink raw reply

* Re: [PATCH net v2 2/2] net: airoha: fix netif_set_real_num_tx_queues for sparse QoS channels
From: Simon Horman @ 2026-06-22 12:31 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Wayen Yan, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260619-airoha-qos-fixes-v2-2-5c43485038f9@kernel.org>

On Fri, Jun 19, 2026 at 01:37:14PM +0200, Lorenzo Bianconi wrote:
> airoha_tc_htb_alloc_leaf_queue() assigns queue IDs based on the channel
> index (opt->qid = AIROHA_NUM_TX_RING + channel), but updates
> real_num_tx_queues with a simple increment (num_tx_queues + 1). When QoS
> channels are allocated sparsely (e.g., channels 0 and 3 without 1 and
> 2), the returned qid can exceed real_num_tx_queues, causing out-of-bounds
> accesses in the networking stack.
> For example, allocating channel 0 then channel 3 results in
> real_num_tx_queues = 34 but qid = 35, which is out of range [0, 34).
> Fix this by computing real_num_tx_queues based on the highest active
> channel index rather than using a simple counter, in both the allocation
> and deletion paths.
> 
> Fixes: ef1ca9271313b ("net: airoha: Add sched HTB offload support")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

Thanks for the update since v1.

Reviewed-by: Simon Horman <horms@kernel.org>

FTR, there is an AI-generated review of this patch on sashiko.dev.
I do not think that should impede the progress of this patch but
you may want to consider it in the context of follow-up.

^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-22 12:28 UTC (permalink / raw)
  To: Menglong Dong, Xuan Zhuo
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, virtualization, linux-kernel, eperezma
In-Reply-To: <1782096043.3540094-1-xuanzhuo@linux.alibaba.com>

On 2026/6/22 10:40 Xuan Zhuo <xuanzhuo@linux.alibaba.com> write:
> On Tue, 16 Jun 2026 19:59:12 +0800, Menglong Dong <menglong8.dong@gmail.com> wrote:
> > For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> > in the tx path for example: we set xsk_set_tx_need_wakeup() in
> > virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> > anywhere, which means the user will call send() for every packet.
> >
> > We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> > is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> > Otherwise, we will clear the wakeup flag.
> >
> > Race condition is considered for tx path.
> >
> > Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")
> 
> This is not a bug, so we do not need this.
> And you post this to net-next.

Okay, I'll remove this tag in the V4.

> 
> 
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > ---
> > v3:
[...]
> > +
> > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > +		xsk_set_tx_need_wakeup(pool);
> 
> You need to comment this.

Ack!

> 
> 
> > +
[...]
> > +
> >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> >  		check_sq_full_and_disable(vi, vi->dev, sq);
> 
> 
> After fixed above comments, you can add:
> 
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

OK! Thanks for the review :)

> 
> Thanks.
> 
> 
> >
> > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> >  	u64_stats_update_end(&sq->stats.syncp);
> >
> > -	if (xsk_uses_need_wakeup(pool))
> > -		xsk_set_tx_need_wakeup(pool);
> > -
> >  	return sent;
> >  }
> >
> > --
> > 2.54.0
> >
> 
> 





^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-22 12:27 UTC (permalink / raw)
  To: Menglong Dong, Michael S. Tsirkin
  Cc: xuanzhuo, eperezma, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260621182119-mutt-send-email-mst@kernel.org>

On 2026/6/22 06:31 Michael S. Tsirkin <mst@redhat.com> write:
> On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
[...]
> >  
> > +	vring_size = virtqueue_get_vring_size(sq->vq);
> > +	need_wakeup = xsk_uses_need_wakeup(pool);
> > +
> > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > +		xsk_set_tx_need_wakeup(pool);
> > +
> 
> why are we doing this here?
> the check after virtnet_xsk_xmit_batch not enough?
> I vaguely think it's some kind of race we are closing?
> Pls add a comment to explain.

Hi, Michael. Thanks for your review.

Yeah, it's for a race condition between user space and kernel
space. I added a comment in V2, which is too confusing, and
I removed it 😢. I'll make it more clear and add it in the V4. The
origin comment is:

 * If the sq->vq is empty, and the tx ring is empty, and the user
 * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
 * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
 * up the tx napi, so we have to set the need_wakeup flag here.

And the logic is like this:

Kernel: tx NAPI is waked up from skb_xmit_done() ->
Kernel: sq->vq and xsk->tx_ring are both empty ->
Kernel: call virtnet_xsk_xmit_batch()

    User: submit a entry to the xsk->tx_ring
    User: check the wakeup flag
    User: wakeup flag is not set, skip send()

Kernel: call xsk_set_tx_need_wakeup(), because sq->vq is empty

If we don't send more data, the data in the xsk->tx_ring will
not be sent forever.

> 
> >  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> >  
> > +	if (need_wakeup) {
> > +		if (vring_size == sq->vq->num_free)
> > +			/* we can't wake up by ourself, and it should be done
> > +			 * by the user.
> > +			 */
> > +			xsk_set_tx_need_wakeup(pool);
> > +		else
> > +			/* we can wake up from skb_xmit_done() */
> > +			xsk_clear_tx_need_wakeup(pool);
> 
> But what if we don't have get tx napi so no wakeup in skb_xmit_done?

Sorry that I'm not sure what "get tx napi" means here ;(

There are entry in sq->vq, so skb_xmit_done() will be called after
the entries in the ring is consumed by the HOST, right?
Then, the corresponding sq->napi will be scheduled, as we ensure
that tx napi is always enabled, which means napi->weight is not
zero, in this commit:
1df5116a41a8 ("virtio_net: xsk: prevent disable tx napi")

Right?

Thanks!
Menglong Dong

> 
> 
> > +	}
> > +
> >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> >  		check_sq_full_and_disable(vi, vi->dev, sq);
> >  
> > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> >  	u64_stats_update_end(&sq->stats.syncp);
> >  
> > -	if (xsk_uses_need_wakeup(pool))
> > -		xsk_set_tx_need_wakeup(pool);
> > -
> >  	return sent;
> >  }
> >  
> > -- 
> > 2.54.0
> 
> 
> 





^ permalink raw reply

* Re: [REGRESSION 6.12.90 -> 6.12.94] vsock/virtio: large AF_VSOCK transfers reset under backpressure
From: Stefano Garzarella @ 2026-06-22 12:22 UTC (permalink / raw)
  To: Brien Oberstein; +Cc: netdev, regressions, stable
In-Reply-To: <618701dd023e$063de350$12b9a9f0$@gmail.com>

On Mon, Jun 22, 2026 at 07:55:30AM -0400, Brien Oberstein wrote:
>Hi Stefano,
>
>Thanks, that matches what I'm seeing: large transfers reset mid-stream
>instead of the sender being throttled (reliable above ~1.5 MB, fine below
>~90 KB).
>
>The bind for me: it's not just this mail bridge -- I use AF_VSOCK for a few
>host/guest services, some of which open their own sockets, so the per-socket
>buffer workaround can't cover them all. That leaves pinning 6.12.90 (losing
>the DoS fix and further kernel updates) as the only blanket option.

Okay, but in that case did it work?

>
>A few quick questions:
>
>1. Is a -stable backport of the merging fix likely, and roughly when?

We don't have a fix yet.

>2. Could a smaller interim land in -stable sooner (e.g. more default
>   headroom) without reopening the DoS?

What we've merged so far is the best we can do for now, but anyone who 
wants to help improve the situation is welcome to submit patches.

>3. Will the fix guarantee backpressure for any packet size, or just widen
>   the margin?

It should fix STREAM sockets for any packet size.
SEQPACKET/DGRAM is a bit different since we need to keep boundaries, so 
it will come later if needed.

>
>Happy to test any patch

THanks, I'll ask you to test.

>I have a solid reproducer and can turn it around
>in a day. I'll also file this as a tracked regression so it's not lost.

Unfortunately, it's always been partially broken, using more memory than 
specified, so I don't know if this is actually a full regression, but I 
understand.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH net v2 7/7] ipv6: reset position for force_forwarding sysctl restart
From: Fernando Fernandez Mancera @ 2026-06-22 12:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, nicolas.dichtel, stephen, brian.haley, horms, pabeni,
	kuba, edumazet, davem, dsahern
In-Reply-To: <20260622114223.GA233619@shredder>

On 6/22/26 1:42 PM, Ido Schimmel wrote:
> On Sat, Jun 20, 2026 at 06:18:50PM +0200, Fernando Fernandez Mancera wrote:
>> When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
> 
> s/proxy_ndp/force_forwarding/
> 
>> retried but the position pointer was already advanced meaning that the
>> restarted sysctl will read from an incorrect offset.
>>
>> Fix this by restoring the original position pointer before restarting
>> the syscall.
>>
>> In addition, remove the redundant position pointer restoration at the
>> end of the function.
>>
>> Fixes: f24987ef6959 ("ipv6: add `force_forwarding` sysctl to enable per-interface forwarding")
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>> ---
>>   net/ipv6/addrconf.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>> index cbe681de3818..8c0741e9dfcc 100644
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -6825,8 +6825,10 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
>>   	ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
>>   
>>   	if (write && old_val != new_val) {
>> -		if (!rtnl_net_trylock(net))
>> +		if (!rtnl_net_trylock(net)) {
>> +			*ppos = pos;
>>   			return restart_syscall();
>> +		}
> 
> Are you sure that this is needed?
> 
> AFAICT, the position pointer is only advanced if the return value is
> positive. From new_sync_write():
> 
> kiocb.ki_pos = (ppos ? *ppos : 0);
> [...]
> ret = filp->f_op->write_iter(&kiocb, &iter);
> [...]
> if (ret > 0 && ppos)
>          *ppos = kiocb.ki_pos;
> 
> And restart_syscall() returns '-ERESTARTNOINTR'.
> 

Hm, I think you are right. I was not aware of this check, thanks for 
pointing it out. That means we can get rid of position pointer reset 
from the rest of the code.. the are plenty of sysctl following this 
pattern. I will prepare a batch for net-next.

I am sending a v3 dropping this patch.

Thank you Ido!

>>   
>>   		WRITE_ONCE(*valp, new_val);
>>   
>> @@ -6851,8 +6853,6 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
>>   		rtnl_net_unlock(net);
>>   	}
>>   
>> -	if (ret)
>> -		*ppos = pos;
>>   	return ret;
>>   }
>>   
>> -- 
>> 2.54.0
>>


^ permalink raw reply

* [syzbot] [wireless?] KASAN: slab-use-after-free Read in ath9k_hif_request_firmware (2)
From: syzbot @ 2026-06-22 12:15 UTC (permalink / raw)
  To: linux-kernel, linux-wireless, netdev, syzkaller-bugs, toke

Hello,

syzbot found the following issue on:

HEAD commit:    1a3746ccbb0a Merge tag 'strncpy-removal-v7.2-rc1' of git:/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=153b07f2580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=26c7945305cfa3b1
dashboard link: https://syzkaller.appspot.com/bug?extid=cb7ed9d85261445a0201
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/634e430ffbca/disk-1a3746cc.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/b11553afbbe2/vmlinux-1a3746cc.xz
kernel image: https://storage.googleapis.com/syzbot-assets/1fa9342aa2a9/bzImage-1a3746cc.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+cb7ed9d85261445a0201@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: slab-use-after-free in ath9k_hif_request_firmware+0x416/0x450 drivers/net/wireless/ath/ath9k/hif_usb.c:1219
Read of size 8 at addr ffff888053c45000 by task kworker/1:8/11284

CPU: 1 UID: 0 PID: 11284 Comm: kworker/1:8 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Workqueue: events request_firmware_work_func
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0x13d/0x4b0 mm/kasan/report.c:482
 kasan_report+0xdf/0x1c0 mm/kasan/report.c:595
 ath9k_hif_request_firmware+0x416/0x450 drivers/net/wireless/ath/ath9k/hif_usb.c:1219
 ath9k_hif_usb_firmware_cb+0x3f9/0x530 drivers/net/wireless/ath/ath9k/hif_usb.c:1237
 request_firmware_work_func+0x13f/0x440 drivers/base/firmware_loader/main.c:1164
 process_one_work+0xa23/0x1940 kernel/workqueue.c:3322
 process_scheduled_works kernel/workqueue.c:3405 [inline]
 worker_thread+0x5ef/0xe50 kernel/workqueue.c:3486
 kthread+0x370/0x450 kernel/kthread.c:436
 ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 11281:
 kasan_save_stack+0x30/0x50 mm/kasan/common.c:57
 kasan_save_track+0x14/0x30 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:415
 _kmalloc_noprof include/linux/slab.h:969 [inline]
 _kzalloc_noprof include/linux/slab.h:1286 [inline]
 ath9k_hif_usb_probe+0x30e/0x830 drivers/net/wireless/ath/ath9k/hif_usb.c:1369
 usb_probe_interface+0x303/0x8f0 drivers/usb/core/driver.c:396
 call_driver_probe drivers/base/dd.c:628 [inline]
 really_probe+0x241/0xa60 drivers/base/dd.c:706
 __driver_probe_device+0x20e/0x450 drivers/base/dd.c:868
 driver_probe_device+0x4a/0x140 drivers/base/dd.c:898
 __device_attach_driver+0x1df/0x320 drivers/base/dd.c:1026
 bus_for_each_drv+0x159/0x1e0 drivers/base/bus.c:500
 __device_attach+0x1e4/0x4d0 drivers/base/dd.c:1098
 device_initial_probe+0xaf/0xd0 drivers/base/dd.c:1153
 bus_probe_device+0x64/0x160 drivers/base/bus.c:620
 device_add+0x121d/0x1970 drivers/base/core.c:3772
 usb_set_configuration+0xd97/0x1c60 drivers/usb/core/message.c:2268
 usb_generic_driver_probe+0xa1/0xe0 drivers/usb/core/generic.c:250
 usb_probe_device+0xef/0x400 drivers/usb/core/driver.c:291
 call_driver_probe drivers/base/dd.c:628 [inline]
 really_probe+0x241/0xa60 drivers/base/dd.c:706
 __driver_probe_device+0x20e/0x450 drivers/base/dd.c:868
 driver_probe_device+0x4a/0x140 drivers/base/dd.c:898
 __device_attach_driver+0x1df/0x320 drivers/base/dd.c:1026
 bus_for_each_drv+0x159/0x1e0 drivers/base/bus.c:500
 __device_attach+0x1e4/0x4d0 drivers/base/dd.c:1098
 device_initial_probe+0xaf/0xd0 drivers/base/dd.c:1153
 bus_probe_device+0x64/0x160 drivers/base/bus.c:620
 device_add+0x121d/0x1970 drivers/base/core.c:3772
 usb_new_device.cold+0x685/0x115c drivers/usb/core/hub.c:2695
 hub_port_connect drivers/usb/core/hub.c:5567 [inline]
 hub_port_connect_change drivers/usb/core/hub.c:5707 [inline]
 port_event drivers/usb/core/hub.c:5871 [inline]
 hub_event+0x314d/0x4af0 drivers/usb/core/hub.c:5953
 process_one_work+0xa23/0x1940 kernel/workqueue.c:3322
 process_scheduled_works kernel/workqueue.c:3405 [inline]
 worker_thread+0x5ef/0xe50 kernel/workqueue.c:3486
 kthread+0x370/0x450 kernel/kthread.c:436
 ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Freed by task 5704:
 kasan_save_stack+0x30/0x50 mm/kasan/common.c:57
 kasan_save_track+0x14/0x30 mm/kasan/common.c:78
 kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5f/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2700 [inline]
 slab_free mm/slub.c:6310 [inline]
 kfree+0x22b/0x6c0 mm/slub.c:6625
 ath9k_hif_usb_disconnect+0x207/0x3c0 drivers/net/wireless/ath/ath9k/hif_usb.c:1439
 usb_unbind_interface+0x1dd/0x9e0 drivers/usb/core/driver.c:458
 device_remove drivers/base/dd.c:618 [inline]
 device_remove+0x12a/0x180 drivers/base/dd.c:610
 __device_release_driver drivers/base/dd.c:1349 [inline]
 device_release_driver_internal+0x44e/0x620 drivers/base/dd.c:1372
 bus_remove_device+0x2bc/0x560 drivers/base/bus.c:664
 device_del+0x376/0x9b0 drivers/base/core.c:3961
 usb_disable_device+0x367/0x810 drivers/usb/core/message.c:1478
 usb_disconnect+0x2e2/0x9a0 drivers/usb/core/hub.c:2345
 hub_port_connect drivers/usb/core/hub.c:5407 [inline]
 hub_port_connect_change drivers/usb/core/hub.c:5707 [inline]
 port_event drivers/usb/core/hub.c:5871 [inline]
 hub_event+0x1d0c/0x4af0 drivers/usb/core/hub.c:5953
 process_one_work+0xa23/0x1940 kernel/workqueue.c:3322
 process_scheduled_works kernel/workqueue.c:3405 [inline]
 worker_thread+0x5ef/0xe50 kernel/workqueue.c:3486
 kthread+0x370/0x450 kernel/kthread.c:436
 ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff888053c45000
 which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 0 bytes inside of
 freed 2048-byte region [ffff888053c45000, ffff888053c45800)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x53c40
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88813fe40000 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
head: 00fff00000000040 ffff88813fe40000 dead000000000100 dead000000000122
head: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
head: 00fff00000000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4965, tgid 4965 (klogd), ts 316220316300, free_ts 316211784975
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0xfd/0x120 mm/page_alloc.c:1859
 prep_new_page mm/page_alloc.c:1867 [inline]
 get_page_from_freelist+0xf48/0x3530 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x299/0x2dc0 mm/page_alloc.c:5304
 alloc_slab_page mm/slub.c:3289 [inline]
 allocate_slab mm/slub.c:3404 [inline]
 new_slab+0xa2/0x670 mm/slub.c:3447
 refill_objects+0xe3/0x430 mm/slub.c:7241
 refill_sheaf mm/slub.c:2827 [inline]
 __pcs_replace_empty_main+0x375/0x660 mm/slub.c:4692
 alloc_from_pcs mm/slub.c:4790 [inline]
 slab_alloc_node mm/slub.c:4924 [inline]
 __kmalloc_cache_noprof+0x48d/0x6e0 mm/slub.c:5446
 _kmalloc_noprof include/linux/slab.h:969 [inline]
 syslog_print+0xf8/0x620 kernel/printk/printk.c:1585
 do_syslog+0x5bd/0x6d0 kernel/printk/printk.c:1763
 __do_sys_syslog kernel/printk/printk.c:1855 [inline]
 __se_sys_syslog kernel/printk/printk.c:1853 [inline]
 __x64_sys_syslog+0x74/0xb0 kernel/printk/printk.c:1853
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x115/0x870 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 11284 tgid 11284 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1406 [inline]
 free_pages_prepare+0x586/0xd80 mm/page_alloc.c:1451
 __free_contig_range_common+0x14f/0x250 mm/page_alloc.c:6895
 __free_contig_range mm/page_alloc.c:6940 [inline]
 free_pages_bulk+0x12a/0x200 mm/page_alloc.c:5257
 vm_area_free_pages+0xad/0x2b0 mm/vmalloc.c:3439
 vfree mm/vmalloc.c:3488 [inline]
 vfree+0x107/0x750 mm/vmalloc.c:3462
 delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3392
 process_one_work+0xa23/0x1940 kernel/workqueue.c:3322
 process_scheduled_works kernel/workqueue.c:3405 [inline]
 worker_thread+0x5ef/0xe50 kernel/workqueue.c:3486
 kthread+0x370/0x450 kernel/kthread.c:436
 ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff888053c44f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff888053c44f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888053c45000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff888053c45080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888053c45100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [syzbot] [wireless?] divide error in mac80211_hwsim_write_tsf
From: syzbot @ 2026-06-22 12:15 UTC (permalink / raw)
  To: johannes, linux-kernel, linux-wireless, netdev, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    83f1454877cc Merge tag 'ext4_for_linus-7.2-rc1' of git://g..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17956aae580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8deb4438448ed47a
dashboard link: https://syzkaller.appspot.com/bug?extid=21629c14aa749636db9d
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-83f14548.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/06b66919e887/vmlinux-83f14548.xz
kernel image: https://storage.googleapis.com/syzbot-assets/3dedd791b7cd/bzImage-83f14548.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+21629c14aa749636db9d@syzkaller.appspotmail.com

Oops: divide error: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:mac80211_hwsim_write_tsf+0x3a3/0x590 drivers/net/wireless/virtual/mac80211_hwsim_main.c:1628
Code: 81 c4 e8 49 00 00 4c 89 e0 48 c1 e8 03 42 80 3c 30 00 74 08 4c 89 e7 e8 1b bb 22 fb 48 8b 34 24 41 03 34 24 66 b8 20 03 31 d2 <66> f7 f5 0f b7 d8 4d 8d 65 0a 49 83 c5 0d 4c 89 e0 48 c1 e8 03 42
RSP: 0018:ffffc900037aedf0 EFLAGS: 00010246
RAX: 1ffff110080a0320 RBX: 000000000000001c RCX: 0000000000100000
RDX: 0000000000000000 RSI: 0000000005e6b00c RDI: 0000000000000230
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff520006f5dac R12: ffff888040547c08
R13: ffff88803d7fadda R14: dffffc0000000000 R15: 0000000000000020
FS:  00007f2f6aff66c0(0000) GS:ffff88808c852000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 0000000013282000 CR4: 0000000000352ef0
Call Trace:
 <TASK>
 mac80211_hwsim_tx_frame_no_nl+0x16b/0x1760 drivers/net/wireless/virtual/mac80211_hwsim_main.c:1902
 mac80211_hwsim_tx+0x1784/0x2500 drivers/net/wireless/virtual/mac80211_hwsim_main.c:2261
 drv_tx net/mac80211/driver-ops.h:38 [inline]
 ieee80211_tx_frags+0x3df/0x890 net/mac80211/tx.c:1746
 __ieee80211_tx+0x267/0x580 net/mac80211/tx.c:1801
 ieee80211_tx+0x312/0x4b0 net/mac80211/tx.c:1984
 ieee80211_monitor_start_xmit+0xb33/0x1280 net/mac80211/tx.c:2479
 __netdev_start_xmit include/linux/netdevice.h:5387 [inline]
 netdev_start_xmit include/linux/netdevice.h:5396 [inline]
 xmit_one net/core/dev.c:3889 [inline]
 dev_hard_start_xmit+0x2cd/0x830 net/core/dev.c:3905
 __dev_queue_xmit+0x1435/0x37f0 net/core/dev.c:4872
 packet_snd net/packet/af_packet.c:3082 [inline]
 packet_sendmsg+0x3d95/0x5040 net/packet/af_packet.c:3114
 sock_sendmsg_nosec net/socket.c:775 [inline]
 __sock_sendmsg net/socket.c:790 [inline]
 __sys_sendto+0x626/0x6c0 net/socket.c:2252
 __do_sys_sendto net/socket.c:2259 [inline]
 __se_sys_sendto net/socket.c:2255 [inline]
 __x64_sys_sendto+0xde/0x100 net/socket.c:2255
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2f6a19ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f2f6aff5fe8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f2f6a415fa0 RCX: 00007f2f6a19ce59
RDX: 0000000000000026 RSI: 0000200000000640 RDI: 0000000000000007
RBP: 00007f2f6a232e6f R08: 0000200000000380 R09: 0000000000000014
R10: 0000000004000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f2f6a416038 R14: 00007f2f6a415fa0 R15: 00007ffff9cddab8
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:mac80211_hwsim_write_tsf+0x3a3/0x590 drivers/net/wireless/virtual/mac80211_hwsim_main.c:1628
Code: 81 c4 e8 49 00 00 4c 89 e0 48 c1 e8 03 42 80 3c 30 00 74 08 4c 89 e7 e8 1b bb 22 fb 48 8b 34 24 41 03 34 24 66 b8 20 03 31 d2 <66> f7 f5 0f b7 d8 4d 8d 65 0a 49 83 c5 0d 4c 89 e0 48 c1 e8 03 42
RSP: 0018:ffffc900037aedf0 EFLAGS: 00010246
RAX: 1ffff110080a0320 RBX: 000000000000001c RCX: 0000000000100000
RDX: 0000000000000000 RSI: 0000000005e6b00c RDI: 0000000000000230
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff520006f5dac R12: ffff888040547c08
R13: ffff88803d7fadda R14: dffffc0000000000 R15: 0000000000000020
FS:  00007f2f6aff66c0(0000) GS:ffff88808c852000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 0000000013282000 CR4: 0000000000352ef0
----------------
Code disassembly (best guess):
   0:	81 c4 e8 49 00 00    	add    $0x49e8,%esp
   6:	4c 89 e0             	mov    %r12,%rax
   9:	48 c1 e8 03          	shr    $0x3,%rax
   d:	42 80 3c 30 00       	cmpb   $0x0,(%rax,%r14,1)
  12:	74 08                	je     0x1c
  14:	4c 89 e7             	mov    %r12,%rdi
  17:	e8 1b bb 22 fb       	call   0xfb22bb37
  1c:	48 8b 34 24          	mov    (%rsp),%rsi
  20:	41 03 34 24          	add    (%r12),%esi
  24:	66 b8 20 03          	mov    $0x320,%ax
  28:	31 d2                	xor    %edx,%edx
* 2a:	66 f7 f5             	div    %bp <-- trapping instruction
  2d:	0f b7 d8             	movzwl %ax,%ebx
  30:	4d 8d 65 0a          	lea    0xa(%r13),%r12
  34:	49 83 c5 0d          	add    $0xd,%r13
  38:	4c 89 e0             	mov    %r12,%rax
  3b:	48 c1 e8 03          	shr    $0x3,%rax
  3f:	42                   	rex.X


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH net v2] net/smc: fix out-of-bounds read when sk_user_data holds a sk_psock
From: Jiayuan Chen @ 2026-06-22 12:11 UTC (permalink / raw)
  To: Sechang Lim, D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Ursula Braun,
	Karsten Graul, Guvenc Gulce, linux-rdma, linux-s390, netdev,
	linux-kernel, bpf
In-Reply-To: <20260619150342.3626224-1-rhkrqnwk98@gmail.com>


On 6/19/26 11:03 PM, Sechang Lim wrote:
> SMC stores its smc_sock in the clcsock's sk_user_data tagged
> SK_USER_DATA_NOCOPY and reads it back with smc_clcsock_user_data(), which
> only strips that flag. sockmap stores a sk_psock in the same field tagged
> SK_USER_DATA_NOCOPY | SK_USER_DATA_PSOCK. Nothing keeps both off one
> socket, and SMC then casts the sk_psock to an smc_sock.

How about SK_USER_DATA_BPF



^ permalink raw reply

* [PATCH bpf-next v8 7/7] selftests/bpf: add bpf_icmp_send recursion test
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

This test is similar to test_icmp_send_unreach_cgroup but checks that,
in case of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.

The test attaches to the root cgroup to ensure the ICMP packet generated
by the kfunc re-triggers the BPF program. Since it's attached only for
this recursion test, it should not disrupt the whole network.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 45 +++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c | 56 +++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index 66447681f72d..fd4b8fa78a01 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -1,8 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include <network_helpers.h>
+#include <cgroup_helpers.h>
 #include <linux/errqueue.h>
 #include <poll.h>
+#include <unistd.h>
 #include "icmp_send.skel.h"

 #define TIMEOUT_MS 1000
@@ -10,6 +12,7 @@
 #define ICMP_DEST_UNREACH 3
 #define ICMPV6_DEST_UNREACH 1

+#define ICMP_HOST_UNREACH 1
 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
 #define ICMPV6_REJECT_ROUTE 6
@@ -203,3 +206,45 @@ void test_icmp_send_unreach_tc(void)
 	bpf_link__destroy(link);
 	icmp_send__destroy(skel);
 }
+
+void test_icmp_send_unreach_recursion(void)
+{
+	struct icmp_send *skel;
+	int cgroup_fd = -1;
+
+	skel = icmp_send__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	if (setup_cgroup_environment()) {
+		fprintf(stderr, "Failed to setup cgroup environment\n");
+		goto cleanup;
+	}
+
+	cgroup_fd = get_root_cgroup();
+	if (!ASSERT_OK_FD(cgroup_fd, "get_root_cgroup"))
+		goto cleanup;
+
+	skel->data->target_pid = getpid();
+	skel->links.recursion =
+		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
+		goto cleanup;
+
+	trigger_prog_read_icmp_errqueue(skel, ICMP_HOST_UNREACH, AF_INET,
+					"127.0.0.1");
+
+	/*
+	 * Because there's recursion involved, the first call will return at
+	 * index 1 since it will return the second, and the second call will
+	 * return at index 0 since it will return the first.
+	 */
+	ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
+	ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
+
+cleanup:
+	cleanup_cgroup_environment();
+	icmp_send__destroy(skel);
+	if (cgroup_fd >= 0)
+		close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 5fa5467bdb70..fd9c7684797b 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -13,6 +13,10 @@ __u16 server_port = 0;
 int unreach_type = 0;
 int unreach_code = 0;
 int kfunc_ret = -1;
+int target_pid = -1;
+
+unsigned int rec_count = 0;
+int rec_kfunc_rets[] = { -1, -1 };

 SEC("cgroup_skb/egress")
 int egress(struct __sk_buff *skb)
@@ -125,4 +129,56 @@ int tc_egress(struct __sk_buff *skb)
 	return TCX_DROP;
 }

+SEC("cgroup_skb/egress")
+int recursion(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct icmphdr *icmph;
+	struct tcphdr *tcph;
+	struct iphdr *iph;
+	int ret;
+
+	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+		return SK_PASS;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4)
+		return SK_PASS;
+
+	if (iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	if (iph->protocol == IPPROTO_TCP) {
+		tcph = (void *)iph + iph->ihl * 4;
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(server_port))
+			return SK_PASS;
+	} else if (iph->protocol == IPPROTO_ICMP) {
+		icmph = (void *)iph + iph->ihl * 4;
+		if ((void *)(icmph + 1) > data_end ||
+		    icmph->type != unreach_type ||
+		    icmph->code != unreach_code)
+			return SK_PASS;
+	} else {
+		return SK_PASS;
+	}
+
+	/*
+	 * This call will provoke a recursion: the ICMP packet generated by the
+	 * kfunc will re-trigger this program since we are in the root cgroup in
+	 * which the kernel ICMP socket belongs. However when re-entering the
+	 * kfunc, it should return EBUSY.
+	 */
+	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
+	rec_kfunc_rets[rec_count & 1] = ret;
+	__sync_fetch_and_add(&rec_count, 1);
+
+	/* Let the first ICMP error message pass */
+	if (iph->protocol == IPPROTO_ICMP)
+		return SK_PASS;
+
+	return SK_DROP;
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 6/7] selftests/bpf: add bpf_icmp_send kfunc tc tests
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

This test is similar to the one with cgroup_skb programs but uses tc
egress instead.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 25 ++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c | 60 +++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index a5ac1a6ea77a..66447681f72d 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -178,3 +178,28 @@ void test_icmp_send_unreach_cgroup(void)
 	if (cgroup_fd >= 0)
 		close(cgroup_fd);
 }
+
+void test_icmp_send_unreach_tc(void)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, opts);
+	struct icmp_send *skel;
+	struct bpf_link *link = NULL;
+
+	skel = icmp_send__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	link = bpf_program__attach_tcx(skel->progs.tc_egress, 1, &opts);
+	if (!ASSERT_OK_PTR(link, "prog_attach"))
+		goto cleanup;
+
+	if (test__start_subtest("ipv4"))
+		run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);
+
+	if (test__start_subtest("ipv6"))
+		run_icmp_test(skel, AF_INET6, "::1", ICMPV6_REJECT_ROUTE);
+
+cleanup:
+	bpf_link__destroy(link);
+	icmp_send__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6e1ba539eeb0..5fa5467bdb70 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -2,6 +2,7 @@
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"

 /* 127.0.0.1 in host byte order */
 #define SERVER_IP 0x7F000001
@@ -65,4 +66,63 @@ int egress(struct __sk_buff *skb)
 	return SK_DROP;
 }

+SEC("tc/egress")
+int tc_egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	struct ipv6hdr *ip6h;
+	struct tcphdr *tcph;
+
+	eth = data;
+	if ((void *)(eth + 1) > data_end)
+		return TCX_PASS;
+
+	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+		iph = (void *)(eth + 1);
+		if ((void *)(iph + 1) > data_end)
+			return TCX_PASS;
+
+		if (iph->protocol != IPPROTO_TCP ||
+		    iph->daddr != bpf_htonl(SERVER_IP))
+			return TCX_PASS;
+
+		tcph = (void *)iph + iph->ihl * 4;
+		if ((void *)(tcph + 1) > data_end)
+			return TCX_PASS;
+
+		if (tcph->dest != bpf_htons(server_port))
+			return TCX_PASS;
+
+	} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+		ip6h = (void *)(eth + 1);
+		if ((void *)(ip6h + 1) > data_end)
+			return TCX_PASS;
+
+		if (ip6h->nexthdr != IPPROTO_TCP)
+			return TCX_PASS;
+
+		if (ip6h->daddr.in6_u.u6_addr32[0] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[1] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[2] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[3] != bpf_htonl(SERVER_IP6_LO))
+			return TCX_PASS;
+
+		tcph = (void *)(ip6h + 1);
+		if ((void *)(tcph + 1) > data_end)
+			return TCX_PASS;
+
+		if (tcph->dest != bpf_htons(server_port))
+			return TCX_PASS;
+	} else {
+		return TCX_PASS;
+	}
+
+	kfunc_ret = bpf_icmp_send(skb, unreach_type, unreach_code);
+
+	return TCX_DROP;
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 5/7] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

This test extends the existing cgroup_skb tests with IPv6 support.

Note that we need to set IPV6_RECVERR on the socket for IPv6 in
connect_to_fd_nonblock otherwise the error will be ignored even if we
are in the middle of the TCP handshake. See in
net/ipv6/datagram.c:ipv6_icmp_error for more details.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 77 +++++++++++++------
 tools/testing/selftests/bpf/progs/icmp_send.c | 48 +++++++++---
 2 files changed, 92 insertions(+), 33 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index f4e5b883d4c8..a5ac1a6ea77a 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -8,15 +8,17 @@
 #define TIMEOUT_MS 1000

 #define ICMP_DEST_UNREACH 3
+#define ICMPV6_DEST_UNREACH 1

 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
+#define ICMPV6_REJECT_ROUTE 6

 static int connect_to_fd_nonblock(int server_fd)
 {
 	struct sockaddr_storage addr;
 	socklen_t len = sizeof(addr);
-	int fd, err;
+	int fd, err, on = 1;

 	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
 		return -1;
@@ -25,6 +27,12 @@ static int connect_to_fd_nonblock(int server_fd)
 	if (fd < 0)
 		return -1;

+	if (addr.ss_family == AF_INET6 &&
+	    setsockopt(fd, IPPROTO_IPV6, IPV6_RECVERR, &on, sizeof(on)) < 0) {
+		close(fd);
+		return -1;
+	}
+
 	err = connect(fd, (struct sockaddr *)&addr, len);
 	if (err < 0 && errno != EINPROGRESS) {
 		close(fd);
@@ -34,8 +42,14 @@ static int connect_to_fd_nonblock(int server_fd)
 	return fd;
 }

-static void read_icmp_errqueue(int sockfd, int expected_code)
+static void read_icmp_errqueue(int sockfd, int expected_code, int af)
 {
+	int expected_ee_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+						 ICMPV6_DEST_UNREACH;
+	int expected_origin = (af == AF_INET) ? SO_EE_ORIGIN_ICMP :
+						SO_EE_ORIGIN_ICMP6;
+	int expected_level = (af == AF_INET) ? IPPROTO_IP : IPPROTO_IPV6;
+	int expected_type = (af == AF_INET) ? IP_RECVERR : IPV6_RECVERR;
 	struct sock_extended_err *sock_err;
 	char ctrl_buf[512];
 	struct msghdr msg = {
@@ -61,15 +75,16 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 		return;

 	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
-		if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
+		if (cm->cmsg_level != expected_level ||
+		    cm->cmsg_type != expected_type)
 			continue;

 		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);

-		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
-			       "sock_err_origin_icmp"))
+		if (!ASSERT_EQ(sock_err->ee_origin, expected_origin,
+			       "sock_err_origin"))
 			return;
-		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+		if (!ASSERT_EQ(sock_err->ee_type, expected_ee_type,
 			       "sock_err_type_dest_unreach"))
 			return;
 		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
@@ -79,13 +94,14 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
 }

-static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
+static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code,
+					    int af, const char *ip)
 {
 	int srv_fd = -1, client_fd = -1;
 	struct sockaddr_in addr;
 	socklen_t len = sizeof(addr);

-	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
+	srv_fd = start_server(af, SOCK_STREAM, ip, 0, TIMEOUT_MS);
 	if (!ASSERT_OK_FD(srv_fd, "start_server"))
 		return;

@@ -94,6 +110,8 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
 		return;
 	}
 	skel->bss->server_port = ntohs(addr.sin_port);
+	skel->bss->unreach_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+						    ICMPV6_DEST_UNREACH;
 	skel->bss->unreach_code = code;

 	client_fd = connect_to_fd_nonblock(srv_fd);
@@ -103,13 +121,34 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
 	}

 	/* Skip reading ICMP error queue if code is invalid */
-	if (code >= 0 && code <= NR_ICMP_UNREACH)
-		read_icmp_errqueue(client_fd, code);
+	if (code >= 0 && ((af == AF_INET && code <= NR_ICMP_UNREACH) ||
+			  (af == AF_INET6 && code <= ICMPV6_REJECT_ROUTE)))
+		read_icmp_errqueue(client_fd, code, af);

 	close(client_fd);
 	close(srv_fd);
 }

+static void run_icmp_test(struct icmp_send *skel, int af, const char *ip,
+			  int max_code)
+{
+	for (int code = 0; code <= max_code; code++) {
+		/*
+		 * The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (af == AF_INET && code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(skel, code, af, ip);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	trigger_prog_read_icmp_errqueue(skel, -1, af, ip);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+}
+
 void test_icmp_send_unreach_cgroup(void)
 {
 	struct icmp_send *skel;
@@ -128,21 +167,11 @@ void test_icmp_send_unreach_cgroup(void)
 	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
 		goto cleanup;

-	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
-		/*
-		 * The TCP stack reacts differently when asking for
-		 * fragmentation, let's ignore it for now.
-		 */
-		if (code == ICMP_FRAG_NEEDED)
-			continue;
-
-		trigger_prog_read_icmp_errqueue(skel, code);
-		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
-	}
+	if (test__start_subtest("ipv4"))
+		run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);

-	/* Test an invalid code */
-	trigger_prog_read_icmp_errqueue(skel, -1);
-	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+	if (test__start_subtest("ipv6"))
+		run_icmp_test(skel, AF_INET6, "::1", ICMPV6_REJECT_ROUTE);

 cleanup:
 	icmp_send__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6d0be0a9afe1..6e1ba539eeb0 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -5,10 +5,11 @@

 /* 127.0.0.1 in host byte order */
 #define SERVER_IP 0x7F000001
-
-#define ICMP_DEST_UNREACH 3
+/* ::1 in host byte order (last 32-bit word) */
+#define SERVER_IP6_LO 0x00000001

 __u16 server_port = 0;
+int unreach_type = 0;
 int unreach_code = 0;
 int kfunc_ret = -1;

@@ -18,19 +19,48 @@ int egress(struct __sk_buff *skb)
 	void *data = (void *)(long)skb->data;
 	void *data_end = (void *)(long)skb->data_end;
 	struct iphdr *iph;
+	struct ipv6hdr *ip6h;
 	struct tcphdr *tcph;
+	__u8 version;

-	iph = data;
-	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
-	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+	if (data + 1 > data_end)
 		return SK_PASS;

-	tcph = (void *)iph + iph->ihl * 4;
-	if ((void *)(tcph + 1) > data_end ||
-	    tcph->dest != bpf_htons(server_port))
+	version = (*((__u8 *)data)) >> 4;
+
+	if (version == 4) {
+		iph = data;
+		if ((void *)(iph + 1) > data_end ||
+		    iph->protocol != IPPROTO_TCP ||
+		    iph->daddr != bpf_htonl(SERVER_IP))
+			return SK_PASS;
+
+		tcph = (void *)iph + iph->ihl * 4;
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(server_port))
+			return SK_PASS;
+
+	} else if (version == 6) {
+		ip6h = data;
+		if ((void *)(ip6h + 1) > data_end ||
+		    ip6h->nexthdr != IPPROTO_TCP)
+			return SK_PASS;
+
+		if (ip6h->daddr.in6_u.u6_addr32[0] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[1] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[2] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[3] != bpf_htonl(SERVER_IP6_LO))
+			return SK_PASS;
+
+		tcph = (void *)(ip6h + 1);
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(server_port))
+			return SK_PASS;
+	} else {
 		return SK_PASS;
+	}

-	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+	kfunc_ret = bpf_icmp_send(skb, unreach_type, unreach_code);

 	return SK_DROP;
 }
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 4/7] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

This test opens a server and client, enters a new cgroup, attach a
cgroup_skb program on egress and calls the bpf_icmp_send function from
the client egress so that an ICMP unreach control message is sent back
to the client. It then fetches the message from the error queue to
confirm the correct ICMP unreach code has been sent.

Note that, for the client, we have to connect in non-blocking mode to
let the test execute faster. Otherwise, we need to wait for the TCP
three-way handshake to timeout in the kernel before reading the errno.

Also note that we don't set IP_RECVERR on the socket in
connect_to_fd_nonblock since the error will be transferred anyway in our
test because the connection is rejected at the beginning of the TCP
handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.

Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 151 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c |  38 +++++
 2 files changed, 189 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
new file mode 100644
index 000000000000..f4e5b883d4c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include <poll.h>
+#include "icmp_send.skel.h"
+
+#define TIMEOUT_MS 1000
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static int connect_to_fd_nonblock(int server_fd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	int fd, err;
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
+		return -1;
+
+	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (struct sockaddr *)&addr, len);
+	if (err < 0 && errno != EINPROGRESS) {
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	struct sock_extended_err *sock_err;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+	struct pollfd pfd = {
+		.fd = sockfd,
+		.events = POLLERR,
+	};
+	struct cmsghdr *cm;
+	ssize_t n;
+
+	if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
+		return;
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
+		return;
+
+	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+		return;
+	}
+
+	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
+}
+
+static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
+{
+	int srv_fd = -1, client_fd = -1;
+	struct sockaddr_in addr;
+	socklen_t len = sizeof(addr);
+
+	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
+	if (!ASSERT_OK_FD(srv_fd, "start_server"))
+		return;
+
+	if (getsockname(srv_fd, (struct sockaddr *)&addr, &len)) {
+		close(srv_fd);
+		return;
+	}
+	skel->bss->server_port = ntohs(addr.sin_port);
+	skel->bss->unreach_code = code;
+
+	client_fd = connect_to_fd_nonblock(srv_fd);
+	if (!ASSERT_OK_FD(client_fd, "client_connect_nonblock")) {
+		close(srv_fd);
+		return;
+	}
+
+	/* Skip reading ICMP error queue if code is invalid */
+	if (code >= 0 && code <= NR_ICMP_UNREACH)
+		read_icmp_errqueue(client_fd, code);
+
+	close(client_fd);
+	close(srv_fd);
+}
+
+void test_icmp_send_unreach_cgroup(void)
+{
+	struct icmp_send *skel;
+	int cgroup_fd = -1;
+
+	skel = icmp_send__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_OK_FD(cgroup_fd, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
+		/*
+		 * The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(skel, code);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	trigger_prog_read_icmp_errqueue(skel, -1);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+cleanup:
+	icmp_send__destroy(skel);
+	if (cgroup_fd >= 0)
+		close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
new file mode 100644
index 000000000000..6d0be0a9afe1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+/* 127.0.0.1 in host byte order */
+#define SERVER_IP 0x7F000001
+
+#define ICMP_DEST_UNREACH 3
+
+__u16 server_port = 0;
+int unreach_code = 0;
+int kfunc_ret = -1;
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(server_port))
+		return SK_PASS;
+
+	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+
+	return SK_DROP;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 3/7] bpf: add bpf_icmp_send kfunc
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs. We also extend this kfunc to tc
program as a convenience.

This reuses concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still let the packet pass
  (SK_PASS from the cgroup_skb progs for example) and the current skb
  need to stay untouched (cgroup_skb hooks only allow read-only skb
  payload).
* We protect against recursion since the kfunc, by generating an ICMP
  error message, could retrigger the BPF prog that invoked it.

For now, we support cgroup_skb and tc program types. For cgroup_skb and
tc egress, almost everything should be good. However for tc ingress:
- packet will not be routed yet: need to set the net device for
  icmp_send, thus the call to ip[6]_route_reply_fill_dst.
- fragments could trigger hook: icmp_send will only reply to fragment 0.
- ensure the ip headers is linearized before processing, and zero out
  the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
  from misinterpreting garbage data as IP options.

Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
The interface accepts a type parameter to facilitate future extension to
other ICMP control message types.

Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..fc69a14650e4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -84,6 +84,8 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmpv6.h>
+#include <net/icmp.h>

 #include "dev.h"

@@ -12546,6 +12548,101 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
 	return 0;
 }

+/**
+ * bpf_icmp_send - Send an ICMP control message
+ * @skb_ctx: Packet that triggered the control message
+ * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
+ * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
+ *
+ * Sends an ICMP control message in response to the packet. The original packet
+ * is cloned before sending the ICMP message, so the BPF program can still let
+ * the packet pass if desired.
+ *
+ * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
+ * supported.
+ *
+ * Return: 0 on success, negative error code on failure:
+ *         -EINVAL: Invalid code parameter
+ *         -EBADMSG: Packet too short or malformed
+ *         -ENOMEM: Memory allocation failed
+ *         -EBUSY: Recursion detected
+ *         -EHOSTUNREACH: Routing failed
+ *         -EPROTONOSUPPORT: Non-IP protocol
+ *         -EOPNOTSUPP: Unsupported ICMP type
+ */
+__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct sk_buff *nskb;
+	struct sock *sk;
+
+	sk = skb_to_full_sk(skb);
+	if (sk && sk->sk_kern_sock &&
+	    (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
+		return -EBUSY;
+
+	switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_INET)
+	case htons(ETH_P_IP):
+		if (type != ICMP_DEST_UNREACH)
+			return -EOPNOTSUPP;
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
+			kfree_skb(nskb);
+			return -EBADMSG;
+		}
+
+		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
+
+		icmp_send(nskb, type, code, 0);
+		consume_skb(nskb);
+		break;
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (type != ICMPV6_DEST_UNREACH)
+			return -EOPNOTSUPP;
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!pskb_network_may_pull(nskb, sizeof(struct ipv6hdr))) {
+			kfree_skb(nskb);
+			return -EBADMSG;
+		}
+
+		if (!skb_dst(nskb) && ip6_route_reply_fill_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		memset(IP6CB(nskb), 0, sizeof(struct inet6_skb_parm));
+
+		icmpv6_send(nskb, type, code, 0);
+		consume_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12588,6 +12685,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)

+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send)
+BTF_ID_FLAGS(func, bpf_icmp_send)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -12618,6 +12719,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
 	.set = &bpf_kfunc_check_set_sock_ops,
 };

+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_icmp_send,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -12639,6 +12745,9 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 					       &bpf_kfunc_set_sock_addr);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_icmp_send);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_icmp_send);
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
 }
 late_initcall(bpf_kfunc_init);
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 2/7] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fill_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_output so this patch inlines it.

Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 09ffe0f13ce7..eb5a60d3babe 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fill_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index ef5b7e85cffa..7d2f577e72b8 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -293,21 +293,6 @@ nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 						   sizeof(struct tcphdr), 0));
 }

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -440,7 +425,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 	if (hooknum == NF_INET_LOCAL_OUT && skb_in->dev == NULL)
 		skb_in->dev = net->loopback_dev;

-	if (!skb_dst(skb_in) && nf_reject6_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip6_route_reply_fill_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 6361ad2fcf77..0fa56c801178 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2732,6 +2732,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fill_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fill_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 1/7] net: move netfilter nf_reject_fill_skb_dst to core ipv4
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy
In-Reply-To: <20260622120515.137082-1-mahe.tardy@gmail.com>

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fill_dst in ipv4/route.c
so that it can be reused in the following patches by BPF kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index f90106f383c5..300d292cd9a1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fill_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index fecf6621f679..c1c0724e4d4d 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -252,21 +252,6 @@ static void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *
 	nskb->csum_offset = offsetof(struct tcphdr, check);
 }

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -279,7 +264,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 	if (!oth)
 		return;

-	if (!skb_dst(oldskb) && nf_reject_fill_skb_dst(oldskb) < 0)
+	if (!skb_dst(oldskb) && ip_route_reply_fill_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -352,7 +337,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 	if (iph->frag_off & htons(IP_OFFSET))
 		return;

-	if (!skb_dst(skb_in) && nf_reject_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip_route_reply_fill_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3f3de5164d6e..f24609933fbe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2942,6 +2942,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fill_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fill_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v8 0/7] bpf: add icmp_send kfunc
From: Mahe Tardy @ 2026-06-22 12:05 UTC (permalink / raw)
  To: bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song,
	Mahe Tardy

Hello,

This is v8 of adding the icmp_send kfunc, as suggested during LSF/MM/BPF
2025[^1]. The goal is to allow cgroup_skb programs to actively reject
east-west traffic, similarly to what is possible to do with netfilter
reject target. Applications can receive early feedback that something
went wrong during the TCP handshake.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this kfunc design are that a program can call this function N
times, thus send N ICMP unreach control messages and that the program
can return from the BPF filter with pass leading to a potential
confusing situation where the TCP connection was established while the
client received ICMP_DEST_UNREACH messages.

v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.

v3 update:
- fix an undefined reference build error.

v4 updates:
- prevent the kfunc to be called recursively and add a test (thanks to
  Martin).
- do not fetch dst route when unnecessary (thanks to Martin).
- extend the test for IPv6 (thanks to Martin).
- use SK_DROP in examples and use non blocking sockets for testing
  (thanks to Martin).
- test when the kfunc returns -EINVAL (thanks to Jordan).
- add the kfunc to bpf_kfunc_set_skb as suggested by Alexei.
- guard the IPv4 parts with IS_ENABLED(CONFIG_INET).
- fix a wrong initial value for client_fd (thanks to Yonghong).
- add documentation to the kfunc.
- to Jordan: I couldn't include <linux/icmp.h> because of redefines from
  <network_helpers.h>.

v5 updates:
- kfunc name is now icmp_send and takes the control message type as
  parameter for future potential extension (daniel)
- drop the net patches to route packet since now the kfunc is limited to
  cgroup_skb and tc progs (daniel & martin)
- linearize skb headers (sashiko)
- zero SKB control block (sashiko)
- bind to port 0 instead of fixed port (sashiko)
- poll to wait for POLLERR event (sashiko)
- do not use ASSERT_EQ in CMSG_NXTHDR loop (sashiko)
- fix comment about byte order (sashiko)
- fix endianness IP address issue (sashiko)
- add forgotten cleanup_cgroup_environment (sashiko)
- let packets pass in recursion test (sashiko)
- clarify evaluation order for recursion test (sashiko)

v6 updates (all from sashiko):
- bring back the net patches to route packet since tc ingress needs it.
- rename the ip_route_reply helpers from fetch to fill.
- call pskb_network_may_pull on the cloned pkt.
- check explicitly that we received one and only one ICMP err ctrl msg.

v7 updates:
- use consume_skb on success path (stanislav)
- replace recursion protection with CPU_ARRAY by checking the nature of
  the sk (daniel, offline)
- use reverse xmas tree in read_icmp_errqueue (jordan)
- use ASSERT_OK_FD instead of ASSERT_GE whenever possible (jordan)
- add a test for tc (jordan)
- better filtering from host cgroup test progs (sashiko)

v8 updates:
- mostly a resend as it's been sitting as "New" in the queue for almost
  one month, fixed a few nits.
- on new bpf_icmp_send kfunc cgroup_skb test (patch 4/7):
  - guard a close fd with fd >= 0 (jordan)
  - use ASSERT_OK_FD instead of ASSERT_GE (jordan)
  - fixed comment style (sashiko)
- on recursion test (patch 7/7):
  - guard a close fd with fd >= 0 (jordan)
  - fixed comments style (sashiko)
  - filter bpf prog on pid and ICMP message types (sashiko)

[^1]: https://lwn.net/Articles/1022034/

Link to v7: https://lore.kernel.org/bpf/20260526153708.279717-1-mahe.tardy@gmail.com/

Mahe Tardy (7):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send kfunc
  selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
  selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests
  selftests/bpf: add bpf_icmp_send kfunc tc tests
  selftests/bpf: add bpf_icmp_send recursion test

 include/net/ip6_route.h                       |   2 +
 include/net/route.h                           |   1 +
 net/core/filter.c                             | 109 ++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           |  19 +-
 net/ipv4/route.c                              |  15 ++
 net/ipv6/netfilter/nf_reject_ipv6.c           |  17 +-
 net/ipv6/route.c                              |  18 ++
 .../bpf/prog_tests/icmp_send_kfunc.c          | 248 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c | 184 +++++++++++++
 9 files changed, 580 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c

--
2.34.1


Mahe Tardy (7):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send kfunc
  selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
  selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests
  selftests/bpf: add bpf_icmp_send kfunc tc tests
  selftests/bpf: add bpf_icmp_send recursion test

 include/net/ip6_route.h                       |   2 +
 include/net/route.h                           |   1 +
 net/core/filter.c                             | 109 ++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           |  19 +-
 net/ipv4/route.c                              |  15 ++
 net/ipv6/netfilter/nf_reject_ipv6.c           |  17 +-
 net/ipv6/route.c                              |  18 ++
 .../bpf/prog_tests/icmp_send_kfunc.c          | 250 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c | 184 +++++++++++++
 9 files changed, 582 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c

--
2.34.1


^ permalink raw reply

* RE: [REGRESSION 6.12.90 -> 6.12.94] vsock/virtio: large AF_VSOCK transfers reset under backpressure
From: Brien Oberstein @ 2026-06-22 11:55 UTC (permalink / raw)
  To: 'Stefano Garzarella'; +Cc: netdev, regressions, stable
In-Reply-To: <ajkAlpiyPWmNPWfx@sgarzare-redhat>

Hi Stefano,

Thanks, that matches what I'm seeing: large transfers reset mid-stream
instead of the sender being throttled (reliable above ~1.5 MB, fine below
~90 KB).

The bind for me: it's not just this mail bridge -- I use AF_VSOCK for a few
host/guest services, some of which open their own sockets, so the per-socket
buffer workaround can't cover them all. That leaves pinning 6.12.90 (losing
the DoS fix and further kernel updates) as the only blanket option.

A few quick questions:

1. Is a -stable backport of the merging fix likely, and roughly when?
2. Could a smaller interim land in -stable sooner (e.g. more default
   headroom) without reopening the DoS?
3. Will the fix guarantee backpressure for any packet size, or just widen
   the margin?

Happy to test any patch -- I have a solid reproducer and can turn it around
in a day. I'll also file this as a tracked regression so it's not lost.

Thanks again,
Brien

#regzbot introduced: v6.12.90..v6.12.94

-----Original Message-----
From: Stefano Garzarella <sgarzare@redhat.com> 
Sent: Monday, June 22, 2026 6:08 AM
To: Brien Oberstein <brienpub@gmail.com>
Cc: edumazet@google.com
Subject: Re: [REGRESSION 6.12.90 -> 6.12.94] vsock/virtio: large AF_VSOCK transfers reset under backpressure

On Sun, Jun 21, 2026 at 08:42:41AM -0400, Brien Oberstein wrote:
>Hi Stefano, Eric,

Hi Brien,

>
>I'm hitting a regression in the 6.12.y stable series: a bulk transfer 
>over
>AF_VSOCK is torn down mid-stream once the message is large enough to
>exercise receiver-side backpressure. By stable version it lands on
>6.12.94; 6.12.90 is fine.
>
>Setup
>-----
>A host process mails a guest's postfix over an AF_VSOCK bridge:
>
>  host msmtp --(unix sock)--> socat --(AF_VSOCK: host CID 2 ->
>    guest CID 101, port 20025)--> [guest] socat --(TCP 127.0.0.1:25)-->
>    postfix
>
>postfix (TLS-terminating, then writing to its queue) drains the stream
>slower than the host writes it, so the per-socket vsock buffer fills
>during a large message.
>
>Symptom (guest, 6.12.94)
>------------------------
>The guest-side socat exits status=1 mid-transfer and postfix logs:
>
>  postfix/smtpd: NNN: lost connection after DATA (153330 bytes)
>    from localhost[127.0.0.1]
>  postfix/smtpd: disconnect ... data=0/1 commands=5/6
>
>On the host, msmtp reports:
>
>  msmtp: cannot write to TLS connection: The TLS connection was
>    non-properly terminated.        (sendmail exit 74 / EX_TEMPFAIL)
>
>So the AF_VSOCK connection is dropped while data is still flowing, rather
>than the sender being throttled by the credit-based flow control.
>
>Reproduction
>------------
>Send messages of increasing size through the bridge:
>
>  body <= ~88 KB : always succeeds
>  body ~354 KB   : intermittent failure
>  body >= 1.5 MB : fails 12/12
>
>On 6.12.90 the identical test passes 20/20, including 1.5 MB x12,
>2.4 MB x3, 4 MB x3 and 8 MB x2. The only variable is the guest kernel.
>
>Bisection
>---------
>6.12.91, .92 and .93 carry no vsock changes. 6.12.94 pulled in three
>vsock/virtio commits:
>
>  1eca304f  vsock/virtio: fix potential unbounded skb queue
>  f3bf0f3b  vsock/virtio: fix skb overhead accounting to preserve
>            full buf_alloc
>  149205a1  vsock/virtio: fix skb overhead overflow on 32-bit builds
>
>The behaviour (drop/reset under a fast sender + slow receiver instead of
>applying backpressure) makes 1eca304f the prime suspect, but I have only
>A/B tested whole stable releases, not the individual commits.

Yep, I'm working on a followup to improve the status.

Basically, the memory management in AF_VSOCK has always been broken. The 
patches you mentioned are designed to prevent one peer from consuming 
all of the other peer’s memory.
Instead of counting only the payload bytes, we now also take the packet 
metadata into account, using a socket buffer that is double the size set 
(default 256 KB).

So if your system is sending small packets, then this is likely hitting 
this issue.

My advice for now is to increase the socket buffer size. Thanks to 
VMware, AF_VSOCK have specific sockopts :-(:
- SO_VM_SOCKETS_BUFFER_SIZE (0)
- SO_VM_SOCKETS_BUFFER_MAX_SIZE (2)

I suggest to set both to 16 MB (MAX should be set first).
I tried this with socat and seems to work:
   socat VSOCK-LISTEN:4242,setsockopt=40:2:x0000000001000000,setsockopt=40:0:x0000000001000000

Hope this helps.

In the mean time, I'm working on a follow-up for net-next to ensure that 
packets are merged when we exceed a threshold; we might be able to 
backport this to stable, but I'm not sure.

Thanks,
Stefano



^ permalink raw reply

* [PATCH v2] net: mdio: airoha: fix reset control leak in error path
From: Wentao Liang @ 2026-06-22 11:54 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, Wentao Liang

In airoha_mdio_probe(), after calling reset_control_deassert(),
if clk_set_rate() fails, the function returns immediately without
calling reset_control_assert(). This leaves the reset line
deasserted and causes a reference count leak on shared reset
controllers.

Fix this by reorganizing the error handling to use a goto label,
ensuring reset_control_assert() is called on all error paths
before returning.

Also add error checking for reset_control_deassert().
Fixes: 67e3ba978361 ("net: mdio: Add MDIO bus controller for Airoha AN7583")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 drivers/net/mdio/mdio-airoha.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mdio/mdio-airoha.c b/drivers/net/mdio/mdio-airoha.c
index 52e7475121ea..4c1b2415687c 100644
--- a/drivers/net/mdio/mdio-airoha.c
+++ b/drivers/net/mdio/mdio-airoha.c
@@ -246,15 +246,17 @@ static int airoha_mdio_probe(struct platform_device *pdev)
 
 	ret = clk_set_rate(priv->clk, freq);
 	if (ret)
-		return ret;
+		goto err_reset_assert;
 
 	ret = devm_of_mdiobus_register(dev, bus, dev->of_node);
-	if (ret) {
-		reset_control_assert(priv->reset);
-		return ret;
-	}
+	if (ret)
+		goto err_reset_assert;
 
 	return 0;
+
+err_reset_assert:
+	reset_control_assert(priv->reset);
+	return ret;
 }
 
 static const struct of_device_id airoha_mdio_dt_ids[] = {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH net v2 5/7] ipv6: reset value and position for proxy_ndp sysctl restart
From: Nicolas Dichtel @ 2026-06-22 11:48 UTC (permalink / raw)
  To: Fernando Fernandez Mancera, netdev
  Cc: stephen, brian.haley, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern
In-Reply-To: <20260620161850.7114-6-fmancera@suse.de>

Le 20/06/2026 à 18:18, Fernando Fernandez Mancera a écrit :
> When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
> retried but as the value was already modified by the initial
> proc_dointvec() call, the restarted syscall will read the newly modified
> value as the 'old' state.
> 
> Fix this by taking the RTNL lock before parsing the input value if the
> operation is a write.
> 
> Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp")
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

^ permalink raw reply

* Re: [PATCH net v2 7/7] ipv6: reset position for force_forwarding sysctl restart
From: Ido Schimmel @ 2026-06-22 11:42 UTC (permalink / raw)
  To: Fernando Fernandez Mancera
  Cc: netdev, nicolas.dichtel, stephen, brian.haley, horms, pabeni,
	kuba, edumazet, davem, dsahern
In-Reply-To: <20260620161850.7114-8-fmancera@suse.de>

On Sat, Jun 20, 2026 at 06:18:50PM +0200, Fernando Fernandez Mancera wrote:
> When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is

s/proxy_ndp/force_forwarding/

> retried but the position pointer was already advanced meaning that the
> restarted sysctl will read from an incorrect offset.
> 
> Fix this by restoring the original position pointer before restarting
> the syscall.
> 
> In addition, remove the redundant position pointer restoration at the
> end of the function.
> 
> Fixes: f24987ef6959 ("ipv6: add `force_forwarding` sysctl to enable per-interface forwarding")
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> ---
>  net/ipv6/addrconf.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index cbe681de3818..8c0741e9dfcc 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -6825,8 +6825,10 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
>  	ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
>  
>  	if (write && old_val != new_val) {
> -		if (!rtnl_net_trylock(net))
> +		if (!rtnl_net_trylock(net)) {
> +			*ppos = pos;
>  			return restart_syscall();
> +		}

Are you sure that this is needed?

AFAICT, the position pointer is only advanced if the return value is
positive. From new_sync_write():

kiocb.ki_pos = (ppos ? *ppos : 0);
[...]
ret = filp->f_op->write_iter(&kiocb, &iter);
[...]
if (ret > 0 && ppos)
        *ppos = kiocb.ki_pos;

And restart_syscall() returns '-ERESTARTNOINTR'.

>  
>  		WRITE_ONCE(*valp, new_val);
>  
> @@ -6851,8 +6853,6 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
>  		rtnl_net_unlock(net);
>  	}
>  
> -	if (ret)
> -		*ppos = pos;
>  	return ret;
>  }
>  
> -- 
> 2.54.0
> 

^ permalink raw reply

* Re: [PATCH net v6 3/4] iavf: send MAC change request synchronously
From: Przemek Kitszel @ 2026-06-22 11:38 UTC (permalink / raw)
  To: Jose Ignacio Tornos Martinez, netdev
  Cc: intel-wired-lan, aleksandr.loktionov, jacob.e.keller, horms,
	anthony.l.nguyen, davem, edumazet, kuba, pabeni, stable
In-Reply-To: <20260619061321.8554-4-jtornosm@redhat.com>

[-Jesse, he moved to another company a while ago]

> v6: Address edge cases found by AI review (Jakub Kicinski):
>      Although unlikely in practice, v6 adds robustness for corner cases:
>      - Allocation failure after message sent: allocate event buffer BEFORE
>        sending to PF (theoretical - allocation rarely fails for small buffers)
>      - Multi-batch scenario: add loop to send all batches when >200 MACs pending
>        (rare - most configurations have far fewer MACs)
>      - Timeout rollback: only rollback on send failure (ret != -EAGAIN), not on
>        timeout where PF response handler will sync state (transient inconsistency
>        during timeout is acceptable and will be resolved by response)
> v5: https://lore.kernel.org/all/20260429102426.210750-4-jtornosm@redhat.com/
> 
>   drivers/net/ethernet/intel/iavf/iavf.h        | 11 ++-
>   drivers/net/ethernet/intel/iavf/iavf_main.c   | 91 +++++++++++++----
>   .../net/ethernet/intel/iavf/iavf_virtchnl.c   | 99 +++++++++++++++++--
>   3 files changed, 171 insertions(+), 30 deletions(-)
> 

[...]

> +static bool iavf_mac_change_done(struct iavf_adapter *adapter,
> +				 const void *data, enum virtchnl_ops v_op)
> +{
> +	const u8 *addr = data;
> +
> +	return iavf_is_mac_set_handled(adapter->netdev, addr);
> +}

[...]

> +static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr)
> +{
> +	struct iavf_arq_event_info event;
> +	int ret;
> +
> +	netdev_assert_locked(adapter->netdev);
> +
> +	event.buf_len = IAVF_MAX_AQ_BUF_SIZE;
> +	event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL);
> +	if (!event.msg_buf)
> +		return -ENOMEM;
> +
> +	while (adapter->aq_required & IAVF_FLAG_AQ_ADD_MAC_FILTER) {
> +		ret = iavf_add_ether_addrs(adapter);

I believe that this change (made in v6) is wrong.
(just an observation: AI review made this series worse vs v5).

the second step onward would fail a check:
"if (adapter->current_op != VIRTCHNL_OP_UNKNOWN)" and thus return
-EBUSY

watchdog would not kick the VC/AQ queue since we hold the netdev lock
here, there is need to manually ensure forward progress by calling
iavf_poll_virtchnl_response() within the loop

I think it should be fine to stop when the "iavf_mac_change_done"
condition is met, this will simply leave the rest of the changes
for watchdog (as we do now).

> +		if (ret)
> +			goto out;
> +	}
> +
> +	ret = iavf_poll_virtchnl_response(adapter, &event,
> +					  iavf_mac_change_done, addr, 2500);
> +
> +out:
> +	kfree(event.msg_buf);
> +	return ret;
> +}


^ permalink raw reply

* [PATCH iwl-net v2 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Petr Oros @ 2026-06-22 11:34 UTC (permalink / raw)
  To: netdev; +Cc: Petr Oros, Aleksandr Loktionov
In-Reply-To: <20260622113428.2565255-1-poros@redhat.com>

When an ice port is part of a vlan-filtering bridge with a wide VLAN
trunk and the netdev is in IFF_PROMISC (typical for bond slaves
attached to a bridge), the driver installs per-VLAN
ICE_SW_LKUP_PROMISC_VLAN entries (recipe 9) in addition to the broad
ICE_SW_LKUP_DFLT VSI Rx rule (recipe 5). Each per-VLAN rule consumes
one Flow Lookup Unit (FLU) entry from a fixed hardware pool of "up to
32K FLU entries" per device, documented in the E810 datasheet
(613875-009 section 7.8.10, Table 7-18, page 1015).

With three active PFs sharing one switch context and a bridge trunk of
vid 2-4094, the configuration would require roughly

  3 PFs * 4093 VLANs * 3 rules per VLAN per PF ~= 36,800 rules

which exceeds the 32K FLU budget. Firmware then responds to further
Add Switch Rules requests with AQ retval 0x10 (LIBIE_AQ_RC_ENOSPC) and
the user-visible failure surfaces as

  ice 0000:5c:00.1: Failed to set VSI 14 as the default forwarding
                    VSI, error -5
  ice 0000:5c:00.1 ens1f1: Error -5 setting default VSI 14 Rx rule

After a switch context has been driven into overrun, subsequent
retries can come back as AQ retval 0x2 (LIBIE_AQ_RC_ENOENT), which has
misled triage attempts toward a perceived recipe binding defect
rather than a capacity issue.

When the DFLT VSI Rx rule is in place it catches every packet on the
lport regardless of VLAN tag, so the per-VLAN PROMISC_VLAN expansion
is redundant. The recipe 4 VLAN prune entries are still installed
per VLAN and continue to track the allowed VID set, but the
IFF_PROMISC sync path disables their enforcement on the VSI via
vlan_ops->dis_rx_filtering() before ice_set_promisc() runs.
ena_rx_filtering() is restored when IFF_PROMISC is cleared.

Skip the per-VLAN expansion at the two call sites that drive it:
ice_set_promisc() falls through to ice_fltr_set_vsi_promisc() and
ice_vlan_rx_add_vid() omits the per-VLAN ICE_MCAST_VLAN_PROMISC_BITS
add. Plain IFF_ALLMULTI without an installed DFLT VSI rule is
unchanged and still installs per-VLAN multicast promisc rules.

Both checks use ice_is_vsi_dflt_vsi() which inspects the recipe
filter list for an installed DFLT rule on this VSI, not
netdev->flags & IFF_PROMISC. The HW-state predicate avoids two
regression vectors that a user-intent predicate would introduce:

1. ice_lag_is_switchdev_running() short-circuits ice_set_dflt_vsi()
   to return 0 without installing the DFLT rule for a PF in
   switchdev LAG mode. An IFF_PROMISC-only check would also
   suppress the per-VLAN fallback, leaving the PF with no rule.

2. When ice_set_dflt_vsi() returns a non-EEXIST error (FLU
   exhausted, switch context divergence), the driver clears
   IFF_PROMISC from vsi->current_netdev_flags but the netdev's own
   flags retain IFF_PROMISC. The user-intent predicate would still
   suppress the per-VLAN fallback even though DFLT failed to
   install.

The predicate is install-time only. The IFF_PROMISC off path closes
the lifecycle gap in ice_vsi_exit_dflt_promisc(): for an IFF_ALLMULTI
VSI with VLANs it reinstates the per-VID rules before clearing the
default rule, so multicast coverage never lapses. If that AQ call
fails the default rule is left in place, ice_vsi_exit_dflt_promisc()
returns the error, and the sync_fltr pass bails with
vsi->current_netdev_flags |= IFF_PROMISC; the current/netdev flag
mismatch re-fires the IFF_PROMISC off path on the next sync. Clearing
the default rule first would instead expose a window where neither
the default rule nor the per-VID rules carry multicast.

If ice_clear_dflt_vsi() fails after the per-VID rules were reinstated
they are deliberately not rolled back. Clearing the default rule is a
removal that frees an FLU entry rather than allocating one, so it
cannot fail for lack of space; a failure is a transient AdminQ error.
The per-VID rules are the steady state for an IFF_ALLMULTI VLAN VSI,
so the only redundant entry left behind is the single un-removed
default rule, not the per-VID set. The retry re-enters this path,
ice_fltr_set_vlan_vsi_promisc() returns -EEXIST for the rules that
already exist so nothing is reallocated, and the default rule is
removed on the next attempt. Rolling the per-VID rules back here would
instead churn thousands of removes and re-adds on every retry.

After the default rule is gone the vid=0 PROMISC rule that paired
with it is redundant and is dropped, but only to reclaim a filter
entry, so a failure there is logged and does not abort the
transition.

ice_set_vsi_promisc() and ice_clear_vsi_promisc() dispatch the
recipe based on whether ICE_PROMISC_VLAN_RX/TX bits are present in
the mask: with the bits set, recipe ICE_SW_LKUP_PROMISC_VLAN is
used; otherwise ICE_SW_LKUP_PROMISC. The else branch in
ice_set_promisc() installs the vid=0 rule in ICE_SW_LKUP_PROMISC.
Because ice_clear_promisc() with VLANs present adds the VLAN bits
and would search ICE_SW_LKUP_PROMISC_VLAN, the recipe mismatch
would leave the vid=0 ICE_SW_LKUP_PROMISC rule orphaned when VLANs
are configured. This is a single stale rule, not a per-cycle leak:
re-adding it on the next promisc on returns -EEXIST rather than
allocating a new entry. The set-time recipe is not recorded, so
ice_clear_promisc() clears both recipes; clearing a rule that is not
present succeeds, both clears run unconditionally, and the first
error is returned.

The two VLAN-0 recipe transition blocks in ice_vlan_rx_add_vid()
and ice_vlan_rx_kill_vid() that promote / demote the vid=0 rule
between ICE_SW_LKUP_PROMISC and ICE_SW_LKUP_PROMISC_VLAN are
likewise guarded by !ice_is_vsi_dflt_vsi(). With DFLT in place the
vid=0 rule already covers every VID and a recipe swap would only
install a redundant rule.

Lab reproduction on an E810-C with the same firmware family (4.80,
NVM 1.3805.0, DDP 1.3.43.0) using four PFs in vlan-filtering bridges
with vid 2-4094 and the slaves brought to IFF_PROMISC before the
bridge VLAN bulk add:

  before fix:  ~12,279 AQ Add Switch Rules per PF, ENOSPC and ENOENT
               responses in dmesg, DFLT VSI Rx rule install fails on
               the affected PF
  after fix:   ~4,093 AQ Add Switch Rules per PF, no AQ errors, DFLT
               VSI Rx rule installs on every PF

The 66.7% reduction in installed switch rules per PF matches the
expected per-VLAN saving: a single DFLT rule replaces the per-VID
PROMISC_VLAN expansion.

Functional regression test with vid 2-100 trunk between two ice
ports through the lab switch (40/40 PASS, 0 AQ errors, 0 ENOSPC
at 4093-VID customer scale):

  vid 50 unicast, vid 100 unicast, vid 50 broadcast ARP,
    vid 100 multicast IPv6 ND
  vid 200/500/1500/4000 isolation (out-of-trunk) and untagged not
    leaked: 0 packets reach any bridge endpoint
  IGMP/MLD snooping, Jumbo MTU 9000, reserved-multicast STP BPDU
  IFF_PROMISC + IFF_ALLMULTI transition (off while allmulti stays)
  Regression reproducer for commit 1273f89578f2 ("ice: Fix broken
    IFF_ALLMULTI handling"): allmulti on -> add vid -> allmulti off
    -> allmulti on plus the orphan-rule Scenario 2; both converge
    with no stale rules
  100-VID, 1000-VID, 4093-VID stress cycles (5/3/2 iterations each)
  switchdev mode toggle preserves IFF_PROMISC pruning state across
    the session (vid 999 multicast received before and after the
    legacy -> switchdev -> legacy cycle)
  SR-IOV: VFs unaffected because ice_set_promisc() early-returns
    for non-PF VSI and VF representors do not register
    ndo_vlan_rx_add_vid

Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Petr Oros <poros@redhat.com>
---
v2:
- No functional changes; collected the Reviewed-by.

v1: https://lore.kernel.org/all/89efbea9831175e6f57e9fe8557f7a0e48e050b7.1781786935.git.poros@redhat.com/
---
 drivers/net/ethernet/intel/ice/ice_main.c | 90 ++++++++++++++++++-----
 1 file changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 6d24056c247cf4..af8df81fc45623 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -274,7 +274,8 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
 	if (vsi->type != ICE_VSI_PF)
 		return 0;
 
-	if (ice_vsi_has_non_zero_vlans(vsi)) {
+	/* skip per-VID expansion; the DFLT Rx rule already covers every VID */
+	if (ice_vsi_has_non_zero_vlans(vsi) && !ice_is_vsi_dflt_vsi(vsi)) {
 		promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
 		status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
 						       promisc_m);
@@ -304,9 +305,19 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
 		return 0;
 
 	if (ice_vsi_has_non_zero_vlans(vsi)) {
-		promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+		int vid0_status;
+
+		/* set time used either recipe (per-VID PROMISC_VLAN, or vid=0
+		 * PROMISC via the ice_set_promisc() else branch), so clear
+		 * both; clearing an absent rule succeeds
+		 */
 		status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
-							 promisc_m);
+				promisc_m | ICE_PROMISC_VLAN_RX |
+				ICE_PROMISC_VLAN_TX);
+		vid0_status = ice_fltr_clear_vsi_promisc(&vsi->back->hw,
+							 vsi->idx, promisc_m, 0);
+		if (!status)
+			status = vid0_status;
 	} else {
 		status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 						    promisc_m, 0);
@@ -317,6 +328,49 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
 	return status;
 }
 
+/**
+ * ice_vsi_exit_dflt_promisc - drop the default VSI Rx rule on promisc off
+ * @vsi: the VSI leaving promiscuous mode
+ *
+ * For an IFF_ALLMULTI VSI with VLANs the per-VID multicast rules are
+ * reinstated before the default rule is cleared so coverage never lapses;
+ * the then redundant vid=0 rule is dropped best-effort. The callees log
+ * their own failures, so error returns are not re-logged here.
+ *
+ * Return: 0 on success, negative on error with the default rule left in place.
+ */
+static int ice_vsi_exit_dflt_promisc(struct ice_vsi *vsi)
+{
+	struct ice_vsi_vlan_ops *vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
+	struct net_device *netdev = vsi->netdev;
+	struct ice_hw *hw = &vsi->back->hw;
+	bool restore_mc;
+	int err;
+
+	restore_mc = (vsi->current_netdev_flags & IFF_ALLMULTI) &&
+		     ice_vsi_has_non_zero_vlans(vsi);
+
+	if (restore_mc) {
+		err = ice_fltr_set_vlan_vsi_promisc(hw, vsi,
+						    ICE_MCAST_VLAN_PROMISC_BITS);
+		if (err && err != -EEXIST)
+			return err;
+	}
+
+	err = ice_clear_dflt_vsi(vsi);
+	if (err)
+		return err;
+
+	if (netdev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
+		vlan_ops->ena_rx_filtering(vsi);
+
+	if (restore_mc)
+		ice_fltr_clear_vsi_promisc(hw, vsi->idx, ICE_MCAST_PROMISC_BITS,
+					   0);
+
+	return 0;
+}
+
 /**
  * ice_vsi_sync_fltr - Update the VSI filter list to the HW
  * @vsi: ptr to the VSI
@@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
 		} else {
 			/* Clear Rx filter to remove traffic from wire */
 			if (ice_is_vsi_dflt_vsi(vsi)) {
-				err = ice_clear_dflt_vsi(vsi);
+				err = ice_vsi_exit_dflt_promisc(vsi);
 				if (err) {
-					netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
-						   err, vsi->vsi_num);
 					vsi->current_netdev_flags |=
 						IFF_PROMISC;
 					goto out_promisc;
 				}
-				if (vsi->netdev->features &
-				    NETIF_F_HW_VLAN_CTAG_FILTER)
-					vlan_ops->ena_rx_filtering(vsi);
 			}
 
 			/* disable allmulti here, but only if allmulti is not
@@ -3676,10 +3725,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
 	while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
 		usleep_range(1000, 2000);
 
-	/* Add multicast promisc rule for the VLAN ID to be added if
-	 * all-multicast is currently enabled.
-	 */
-	if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+	/* skip the per-VID rule when the DFLT Rx rule already covers this VID */
+	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+	    !ice_is_vsi_dflt_vsi(vsi)) {
 		ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
 					       ICE_MCAST_VLAN_PROMISC_BITS,
 					       vid);
@@ -3697,11 +3745,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
 	if (ret)
 		goto finish;
 
-	/* If all-multicast is currently enabled and this VLAN ID is only one
-	 * besides VLAN-0 we have to update look-up type of multicast promisc
-	 * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+	/* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
+	 * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
+	 * the DFLT Rx rule is installed; it already covers every VID.
 	 */
 	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+	    !ice_is_vsi_dflt_vsi(vsi) &&
 	    ice_vsi_num_non_zero_vlans(vsi) == 1) {
 		ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 					   ICE_MCAST_PROMISC_BITS, 0);
@@ -3764,11 +3813,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
 					   ICE_MCAST_VLAN_PROMISC_BITS, vid);
 
 	if (!ice_vsi_has_non_zero_vlans(vsi)) {
-		/* Update look-up type of multicast promisc rule for VLAN 0
-		 * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
-		 * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+		/* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
+		 * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
+		 * is installed; no recipe swap is needed.
 		 */
-		if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+		if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+		    !ice_is_vsi_dflt_vsi(vsi)) {
 			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 						   ICE_MCAST_VLAN_PROMISC_BITS,
 						   0);
-- 
2.53.0


^ permalink raw reply related

* [PATCH iwl-net v2 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Petr Oros @ 2026-06-22 11:34 UTC (permalink / raw)
  To: netdev; +Cc: Petr Oros
In-Reply-To: <20260622113428.2565255-1-poros@redhat.com>

ice_eswitch_setup_env() calls ice_set_dflt_vsi() to install the
ICE_SW_LKUP_DFLT Rx rule on the uplink VSI. The helper returns 0 even
when the rule is already in place, so the call is a no-op if
ice_vsi_sync_fltr() had previously installed the DFLT rule in response
to IFF_PROMISC on the uplink netdev. ice_remove_vsi_fltr() called
earlier in ice_eswitch_setup_env() does not affect this rule because
ice_remove_vsi_lkup_fltr() lacks a case for ICE_SW_LKUP_DFLT and falls
into its default branch which only logs. Switchdev mode then adds an
ICE_FLTR_TX leg via ice_cfg_dflt_vsi() on the same VSI handle.

ice_eswitch_release_env() unconditionally removed both the Rx and Tx
DFLT rules. When the Rx DFLT was installed by ice_vsi_sync_fltr()
before the switchdev session started, this clobbered promisc state the
operator had asked for: the DFLT Rx rule disappeared while IFF_PROMISC
was still set on the netdev, and the IFF_PROMISC sync path was not
retriggered, so the uplink ended the session without the catch-all
rule the netdev flags requested.

Skip the Rx DFLT removal when the uplink is promiscuous, both in
ice_eswitch_release_env() and in the err_def_tx unwind of
ice_eswitch_setup_env(). The Tx leg installed by switchdev is always
removed since switchdev owns it.

Test the live netdev->flags for this decision. The ena_rx_filtering()
call right above in ice_eswitch_release_env() reaches
ice_cfg_vlan_pruning(), which already keys on the live netdev->flags
IFF_PROMISC bit, so reusing the same value keeps the preserved DFLT
rule and the VLAN pruning state mutually consistent across every
promisc transition, including one the operator made while switchdev
ran: ice_set_rx_mode() is gated off for the uplink during the session,
so such a change never reaches the filter sync, but it is reflected in
netdev->flags and is therefore honored here on release.

Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Signed-off-by: Petr Oros <poros@redhat.com>
---
v2:
- Reworked the fix to avoid the service task entirely. v1 scheduled a
  filter sync in ice_eswitch_disable_switchdev() to reconcile the uplink
  DFLT Rx rule; that work could run after ice_remove() freed the uplink
  VSI (use-after-free) and was not guaranteed to fire if ice_set_rx_mode()
  never ran again. v2 keeps or drops the DFLT Rx rule synchronously in
  ice_eswitch_release_env() (and the setup_env error unwind) by testing
  the live netdev->flags IFF_PROMISC, the same value ice_cfg_vlan_pruning()
  already keys on, so the preserved rule and the pruning state stay
  consistent. No service task is scheduled and no symbol is exported.
- Dropped the Reviewed-by since the fix mechanism changed.

v1: https://lore.kernel.org/all/deef5756e534ef06c12d910c5305d3fd205d30a0.1781786935.git.poros@redhat.com/
---
 drivers/net/ethernet/intel/ice/ice_eswitch.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f77..48273ef9f69dc8 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -66,8 +66,10 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
 			 ICE_FLTR_TX);
 err_def_tx:
-	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
-			 ICE_FLTR_RX);
+	/* keep the Rx DFLT rule if the uplink is promiscuous (see release_env) */
+	if (!(uplink_vsi->netdev->flags & IFF_PROMISC))
+		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+				 false, ICE_FLTR_RX);
 err_def_rx:
 	ice_vsi_del_vlan_zero(uplink_vsi);
 err_vlan_zero:
@@ -278,8 +280,16 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
 	vlan_ops->ena_rx_filtering(uplink_vsi);
 	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
 			 ICE_FLTR_TX);
-	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
-			 ICE_FLTR_RX);
+
+	/* Keep the Rx DFLT rule if the uplink is promiscuous; it must outlive
+	 * the session. Test the live netdev->flags, the same value
+	 * ena_rx_filtering() -> ice_cfg_vlan_pruning() above keys its decision
+	 * on, so the preserved DFLT rule and the pruning state stay consistent.
+	 */
+	if (!(uplink_vsi->netdev->flags & IFF_PROMISC))
+		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+				 false, ICE_FLTR_RX);
+
 	ice_fltr_add_mac_and_broadcast(uplink_vsi,
 				       uplink_vsi->port_info->mac.perm_addr,
 				       ICE_FWD_TO_VSI);
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox