Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: the confusing 10000base_CR. Shouldn't it be 10000_SFI_DA?
From: Andrew Lunn @ 2026-06-26 18:18 UTC (permalink / raw)
  To: D H, Siddaraju
  Cc: netdev@vger.kernel.org, Das, Shubham, Chintalapalle, Balaji,
	Srinivasan, Vijay
In-Reply-To: <MW4PR11MB6912BE7FEDE5C509B6C5D5DF9AEB2@MW4PR11MB6912.namprd11.prod.outlook.com>

On Fri, Jun 26, 2026 at 06:12:15PM +0000, D H, Siddaraju wrote:
> Sorry Andrew, carry-forwarded IEEE's PMD naming convention of separating media "-CR" :D. Will stick to Ethtool :).
> I was referring to "ETHTOOL_LINK_MODE_10000baseCR_FULL_Full_BIT" and its associated documentations Andrew.

Please don't top post. And set your mail client to wrap to around 75
characters. All standard netiquette things.

It is a good idea to be specific when asking questions. Asking about
something which does not exist in the kernel just makes it a guessing
game.

As Maxime pointed out, this is uAPI, so cannot be changed.

	Andrew

^ permalink raw reply

* RE: the confusing 10000base_CR. Shouldn't it be 10000_SFI_DA?
From: D H, Siddaraju @ 2026-06-26 18:12 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev@vger.kernel.org, Das, Shubham, Chintalapalle, Balaji,
	Srinivasan, Vijay
In-Reply-To: <2994edfc-0894-49c9-abac-6daea5e12075@lunn.ch>

Sorry Andrew, carry-forwarded IEEE's PMD naming convention of separating media "-CR" :D. Will stick to Ethtool :).
I was referring to "ETHTOOL_LINK_MODE_10000baseCR_FULL_Full_BIT" and its associated documentations Andrew.

- Thank you,
Siddaraju D H

-----Original Message-----
From: Andrew Lunn <andrew@lunn.ch> 
Sent: Friday, June 26, 2026 8:02 PM
To: D H, Siddaraju <siddaraju.dh@intel.com>
Cc: netdev@vger.kernel.org; Das, Shubham <shubham.das@intel.com>; Chintalapalle, Balaji <balaji.chintalapalle@intel.com>; Srinivasan, Vijay <vijay.srinivasan@intel.com>
Subject: Re: the confusing 10000base_CR. Shouldn't it be 10000_SFI_DA?

On Fri, Jun 26, 2026 at 02:15:27PM +0000, D H, Siddaraju wrote:
> Hi Linux Ethernet Team,
> 
> We explored Ethtool's "10000base_CR" PMD PHY type and we think it 
> might be just a wrong name.

~/linux$ grep -r 10000Base_CR
~/linux$ 

~/ethtool$ grep -r 10000Base_CR
~/ethtool$ 

What exactly do you mean by 10000Base_CR?

     Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: John Fastabend @ 2026-06-26 18:04 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Alexei Starovoitov, Jiayuan Chen, Amery Hung, Kuniyuki Iwashima,
	bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Network Development, kernel-team
In-Reply-To: <87cxxey09d.fsf@cloudflare.com>

On Thu, Jun 25, 2026 at 07:53:50PM +0200, Jakub Sitnicki wrote:
>On Wed, Jun 24, 2026 at 01:57 PM -07, Alexei Starovoitov wrote:
>> On Tue Jun 23, 2026 at 6:32 PM PDT, Jiayuan Chen wrote:
>>>
>>> Hi Alexei and Jakub,
>>>
>>> skmsg is actually still pretty useful for gateways.
>>> I started with bpf by integrating skmsg into nginx as a module and envoy
>>> has something similar.
>>> The usual setup is cgroup/sk for L4 bypass (reject SYN), and skmsg for
>>> L7, redirecting
>>> between local apps by looking at the payload. So there are real users.

Interesting.

>>
>> ...
>>
>>> Agree, just like we remove skmsg from KTLS which is rarely used.
>>
>> ...
>>
>>> Hope not have skmsg disabled by default.
>>
>> I wasn't suggesting to delete the whole skmsg,
>> but to disable combinations that are causing issues.
>> Like what was done for skmsg and ktls.
>> I'd allow plain tcp and udp sockets only.
>> Allowing unix sockets was fishy. I think we should reject it too.
>
>For unix & vsock we know Bytedance built a proxy using it.
>We've been showcasing it as one of sockmap use cases [1].
>That said, I don't know if it's still being used or not.
>
>If we don't want to go through the config-knob-then-deprecate process,
>then I guess the only option is to kill it and see if anyone complains.
>
>[1] Slide 117, https://github.com/sockmap-project/sockmap-project/blob/810d259af6e7a5793922af3991c9dc7ff502fe19/talks/2024-09%20-%20NDC%20TechTown%20-%20Splicing%20Sockets%20with%20SOCKMAP.pdf

Most the bugs I'm seeing are combinations of push/pop/pull/tail/head
calls over sk_msg scatter gather list no one ever considered. Or at
least I never considered. Add in additional socket combinations that
came later and we get bugs.

Do the nginx/envoy offloads manipulate the scatter gather list as
well?

Do we need all these helpers? At some point I thought I was going to
build a real kernel proxy with this, but never did it. Another option
would be better test framework.

.John

^ permalink raw reply

* Re: [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
From: Nabil S. Alramli @ 2026-06-26 18:02 UTC (permalink / raw)
  To: Dragos Tatulea, saeedm, tariqt, mbloch
  Cc: nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel
In-Reply-To: <9f150145-d95c-4a90-a358-5b33ab78a8ef@nvidia.com>

On 6/26/26 09:12, Dragos Tatulea wrote:
> 
> 
> On 25.06.26 19:40, Nabil S. Alramli wrote:
>> Under memory pressure, mlx5 driver has WARNING during fragmented page
>> release. This happens because there is a discrepency between what mlx5
>> thinks the page fragment counter is vs what the page_pool actually says it
>> is.
>>
> The mlx5 frag counter is not the same as pp_ref_count. The page gets
> split into 64 parts during page allocation. The frag counter tracks how
> many of those frags have been used.
> 

Thank you for explaining that to me. I thought it was the same because in
mlx5e_page_release_fragmented, drain_count is computed from frag_page->frags
and then passed into page_pool_unref_page as the nr parameter which is then
compared to the refcount in the page_pool.

>> The cause of the issue is page allocations on concurrent cpus, which
>> increment the non-atomic u16 page counter mlx5e_frag_page.frags, while at
>> the same time the page reference counter net_iov.pp_ref_count is atomically
>> incremented. That sometimes leads to a difference in the counts and
>> therefore triggers the warning in page_pool_unref_netmem:
>>
> page_pool page allocations must not happen in parallel on different CPUs.
> Each queue has its own page_pool and allocation happens within the NAPI of
> that queue which sticks to a single CPU. The release path does support
> releasing on another CPU (release to ring).
> 
> How did you encounter this scenario of having parallel allocations on
> different CPUs from the same page_pool?
> 

Perhaps we didn't, I had assumed that the numbers must match. What we did
encounter is these WARNINGs, and they seemed to go away with this patch but
maybe it is a coincidence.

>> ```
>> 	ret = atomic_long_sub_return(nr, pp_ref_count);
>> 	WARN_ON(ret < 0);
>> ```
>>
>> The actual stack trace looks like this:
>>
>> ```
>> WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>> Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
>> Hardware name: *
>> RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>> RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
>> RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
>> RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
>> RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
>> R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
>> R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
>> FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
>> Call Trace:
>>  <IRQ>
>>  mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
>>  mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
>>  mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
>>  __napi_poll+0x2b/0x1a0
>>  net_rx_action+0x30e/0x370
>>  ? sched_clock+0x9/0x10
>>  ? sched_clock_cpu+0xf/0x170
>>  handle_softirqs+0xe2/0x2a0
>>  common_interrupt+0x85/0xa0
>>  </IRQ>
>>  <TASK>
>>  asm_common_interrupt+0x26/0x40
>> RIP: 0010:page_counter_uncharge+0x34/0x90
>> RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
>> RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
>> RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
>> RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
>> R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
>> R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
>>  uncharge_batch+0x40/0xd0
>> ```
>>
> Can you provide more data on how you reproduced this? This helps to
> narrow down the bug. Reproduction steps would be ideal.
> 

I don't have clear steps to reproduce it, we just have seen it randomly on
some servers that were under memory pressure. I will try to look into it more
and find a way to reliably reproduce it. I agree that would be ideal to find a
proper fix.

>> The fix is to use an atomic page fragment counter, so it will always match
>> the number of references held in the page_pool.
>>
> This is not the right fix. The mlx5 page frag counter is not atomic
> on purpose because all changes to it happen only within the NAPI
> context.
> 

That was a question that I had, is it ever possible for frag_page->frags to be
incremented / set outside of NAPI context? I tried to answer that by looking
at code and by tracing it but could not get a clear picture. If it's not
possible then I agree, this is not the right fix.

> Thanks,
> Dragos

Thanks again for your guidance.

Nabil S. Alramli


^ permalink raw reply

* Re: [PATCH 2/2] net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Jamal Hadi Salim @ 2026-06-26 17:47 UTC (permalink / raw)
  To: Victor Nogueira
  Cc: hexlabsecurity, Vinicius Costa Gomes, Paolo Abeni, Jiri Pirko,
	Jakub Kicinski, David S. Miller, Eric Dumazet, Simon Horman,
	netdev, Jarek Poplawski, Vladimir Oltean, linux-kernel
In-Reply-To: <fc1b9963-2f69-4e47-bf59-6b668faedc0e@mojatatu.com>

On Fri, Jun 26, 2026 at 1:18 PM Victor Nogueira <victor@mojatatu.com> wrote:
>
> On 25/06/2026 06:51, Bryam Vargas via B4 Relay wrote:
> > From: Bryam Vargas <hexlabsecurity@proton.me>
> >
> > multiq_dequeue() takes a packet from a band's child with a direct
> > ->dequeue() call after multiq_peek() peeked it. When the child is
> > non-work-conserving the peek stashes the skb in the child's gso_skb, so
> > the direct dequeue returns a different skb and orphans the stash,
> > desyncing the child's qlen/backlog. With a qfq child reached through a
> > peeking parent (e.g. tbf) this re-enters the child on an emptied list and
> > dereferences NULL, panicking the kernel from softirq on ordinary egress.
> >
> > Take the packet through qdisc_dequeue_peeked(), as sch_prio already does
> > and as sch_red and sch_sfb were just fixed to do. The helper is a no-op
> > when the child has no stash, so a work-conserving child is unaffected.
> >
> > Fixes: 77be155cba4e ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>
> Reviewed-by: Victor Nogueira <victor@mojatatu.com>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH 1/2] net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Jamal Hadi Salim @ 2026-06-26 17:46 UTC (permalink / raw)
  To: Victor Nogueira
  Cc: hexlabsecurity, Vinicius Costa Gomes, Paolo Abeni, Jiri Pirko,
	Jakub Kicinski, David S. Miller, Eric Dumazet, Simon Horman,
	netdev, Jarek Poplawski, Vladimir Oltean, linux-kernel
In-Reply-To: <5d57e4ce-0460-4ad2-9891-1bc7d24020d8@mojatatu.com>

On Fri, Jun 26, 2026 at 1:16 PM Victor Nogueira <victor@mojatatu.com> wrote:
>
> On 25/06/2026 06:51, Bryam Vargas via B4 Relay wrote:
> > From: Bryam Vargas <hexlabsecurity@proton.me>
> >
> > When taprio's software path peeks a non-work-conserving child qdisc, the
> > child stashes the peeked skb in its gso_skb; taprio_dequeue_from_txq()
> > then takes the packet with a direct child ->dequeue() call, which ignores
> > that stash, orphans the peeked skb and desyncs the child's qlen/backlog.
> > With a qfq child this re-enters the child on an emptied list and
> > dereferences NULL, panicking the kernel from softirq on ordinary egress.
> >
> > Take the packet through qdisc_dequeue_peeked(), as sch_red and sch_sfb
> > now do. The helper returns the child's stashed skb first and is a no-op
> > when there is none, so a work-conserving child is unaffected and the
> > gated path now consumes the skb whose length was charged to the budget.
> >
> > Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
> > Cc: stable@vger.kernel.org
> > Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
> > Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>
> Reviewed-by: Victor Nogueira <victor@mojatatu.com>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [BUG] netdevsim: KASAN slab-use-after-free in ref_tracker_free
From: Shuangpeng @ 2026-06-26 17:41 UTC (permalink / raw)
  To: saeed bishara
  Cc: netdev, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-kernel
In-Reply-To: <CAHfVqdXqZkrp5Dm0a8BJU0Tbc_T09A=e02PenwxrZFDMQrEmyQ@mail.gmail.com>



> On Jun 16, 2026, at 05:34, saeed bishara <saeed.bishara.os@gmail.com> wrote:
> 
> I tried gemini, here its analysis and a fix suggestion:
> 
> This is a brilliant, subtle bug. Let's break this down with a rigorous
> audit of the Linux network device refcounting architecture to see
> exactly why this Use-After-Free is occurring.
> The root cause is a classic "Reverse-Teardown Fallacy" colliding with
> the kernel's dual-refcount lifecycle for network devices (pcpu_refcnt
> vs. kobject refcount).
> 
> The Dual-Lifecycle Trap
> In modern kernel networking, a struct net_device is kept alive by two
> distinct mechanisms that must be carefully orchestrated:
> dev->pcpu_refcnt (Operational Lifecycle): Managed by dev_hold() and
> dev_put(). This tracks active operational references. The
> unregister_netdevice sequence blocks in netdev_run_todo() waiting for
> this to hit zero before it invokes free_netdev().
> dev->dev.kobj (Memory Lifecycle): The underlying memory structure
> manager. free_netdev() drops a reference to this. When the kobject
> refcount hits 0, device_release() is synchronously triggered, and the
> kmalloc-cg-8k slab is actually freed.
> 
> The Exact Race Sequence
> Look at the teardown order in the error path of sysfs_rtnl_lock():
> 
> unbreak:
>    sysfs_unbreak_active_protection(kn);  // LINE 121
>    dev_put(ndev);                        // LINE 122
> 
> Normally, developers write error paths in the exact reverse order of
> initialization. However, in this specific case, adhering to that
> convention is fatal.
> The Setup: sysfs_rtnl_lock starts by calling dev_hold(ndev),
> incrementing the pcpu_refcnt. It then calls
> sysfs_break_active_protection(), which explicitly calls kobject_get(),
> elevating the kobj memory refcount.
> The Unregister Race: In parallel, another CPU initiates device
> teardown (e.g., echo 1 > del_device on netdevsim). The sysfs entries
> are deleted, and device_del() drops its primary kobj reference. The
> teardown thread hits netdev_run_todo() and blocks, waiting for
> sysfs_rtnl_lock to release its dev_hold.
> The Lock Fails: sysfs_rtnl_lock fails to get the rtnl lock and jumps to unbreak.
> The Fatal Drop (Line 121): sysfs_unbreak_active_protection(kn)
> executes. It calls kobject_put(). Because the unregistration path
> already dropped the standard kobject references, this is the absolute
> last kobject reference. device_release() is synchronously invoked, and
> the net_device memory is instantly freed to the slab.
> The KASAN Trigger (Line 122): dev_put(ndev) executes. Under the hood,
> it attempts to call netdev_tracker_free(&ndev->ref_tracker, ...). It
> reads from the ndev structure that was just destroyed microseconds
> earlier. KASAN screams.
> 
> To fix this, the teardown convention must be deliberately violated to
> respect the underlying memory dependencies.
> 

Thanks for the analysis and patch suggestion.

I tested the patch together with the local delay/debug patch that was
used to trigger the original report.  I rebuilt the kernel, booted a
fresh VM, and reran the same netdevsim reproducer.

With the patched kernel, the original KASAN UAF was not reproduced in
this run:

  BUG: KASAN: slab-use-after-free in ref_tracker_free

However, the same rx-0 create_dir failure path was still reached, and
the run still produced a ref tracker warning:

  kobject_add_internal create_dir failed for rx-0 ...
  WARNING: lib/ref_tracker.c:295 at ref_tracker_free
  sysfs_rtnl_lock
  phys_port_name_show


> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index a1b2c3d4e5f6..7f8e9d0c1b2a 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -118,8 +118,8 @@ static int sysfs_rtnl_lock(struct kobject *kobj,
> struct attribute *attr,
>  return 0;
> 
> unbreak:
> - sysfs_unbreak_active_protection(kn);
>  dev_put(ndev);
> + sysfs_unbreak_active_protection(kn);
>  return ret;
> }
> 
> 
> On Mon, Jun 15, 2026 at 4:18 AM Shuangpeng Bai
> <shuangpeng.kernel@gmail.com> wrote:
>> 
>> Hi netdev maintainers,
>> 
>> I hit the following KASAN report while testing an upstream kernel.
>> 
>> The issue was reproduced with netdevsim. I have not confirmed whether this is
>> specific to netdevsim or whether other net devices can trigger a similar issue.
>> 
>> The KASAN report shows a slab-use-after-free in ref_tracker_free(), reached from
>> sysfs_rtnl_lock() while reading phys_port_name.
>> 
>> I reproduced this on commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7 (May 25 2026)
>> 
>> To help trigger the bug more reliably, we applied a minimal diagnostic patch
>> that only adds delays and print statements.
>> 
>> The reproducer and .config files are here.
>> https://gist.github.com/shuangpengbai/b49765d646ec4610917015371aa1c3ca
>> 
>> I'm happy to test debug patches or provide additional information.
>> 
>> Reported-by: Shuangpeng Bai <shuangpeng.kernel@gmail.com>
>> 
>> [ 3145.449971][T17497] BUG: KASAN: slab-use-after-free in ref_tracker_free (lib/ref_tracker.c:295)
>> [ 3145.452089][T17497] Read of size 1 at addr ffff888107678598 by task cat/17497
>> [ 3145.454439][T17497]
>> [ 3145.454977][T17497] Tainted: [W]=WARN
>> [ 3145.454980][T17497] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [ 3145.454985][T17497] Call Trace:
>> [ 3145.454991][T17497]  <TASK>
>> [ 3145.454994][T17497]  dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
>> [ 3145.455002][T17497]  print_report (mm/kasan/report.c:378 mm/kasan/report.c:482)
>> [ 3145.455028][T17497]  kasan_report (mm/kasan/report.c:595)
>> [ 3145.455046][T17497]  ref_tracker_free (lib/ref_tracker.c:295)
>> [ 3145.455083][T17497]  sysfs_rtnl_lock (include/linux/netdevice.h:4491 include/linux/netdevice.h:4508 include/linux/netdevice.h:4534 net/core/net-sysfs.c:122)
>> [ 3145.455091][T17497]  phys_port_name_show (net/core/net-sysfs.c:665)
>> [ 3145.455118][T17497]  dev_attr_show (drivers/base/core.c:2421)
>> [ 3145.455128][T17497]  sysfs_kf_seq_show (fs/sysfs/file.c:65)
>> [ 3145.455135][T17497]  seq_read_iter (fs/seq_file.c:231)
>> [ 3145.455144][T17497]  vfs_read (fs/read_write.c:493 fs/read_write.c:574)
>> [ 3145.455169][T17497]  ksys_read (fs/read_write.c:717)
>> [ 3145.455181][T17497]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
>> [ 3145.455188][T17497]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
>> [ 3145.455193][T17497] RIP: 0033:0x7fcf098c43ce
>> [ 3145.455200][T17497] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
>> [ 3145.455204][T17497] RSP: 002b:00007ffd05e76b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [ 3145.455211][T17497] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcf098c43ce
>> [ 3145.455214][T17497] RDX: 0000000000020000 RSI: 00007fcf095e4000 RDI: 0000000000000003
>> [ 3145.455217][T17497] RBP: 00007fcf095e4000 R08: 00007fcf095e3010 R09: 0000000000000000
>> [ 3145.455219][T17497] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
>> [ 3145.455222][T17497] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
>> [ 3145.455227][T17497]  </TASK>
>> [ 3145.455229][T17497]
>> [ 3145.479014][T17497] Freed by task 17497 on cpu 0 at 3145.447575s:
>> [ 3145.479559][T17497]  kasan_save_track (mm/kasan/common.c:57 mm/kasan/common.c:78)
>> [ 3145.479963][T17497]  kasan_save_free_info (mm/kasan/generic.c:584)
>> [ 3145.480411][T17497]  __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285)
>> [ 3145.480813][T17497]  kfree (include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566)
>> [ 3145.481148][T17497]  device_release (drivers/base/core.c:2542)
>> [ 3145.481567][T17497]  kobject_put (lib/kobject.c:689 lib/kobject.c:720 include/linux/kref.h:65 lib/kobject.c:737)
>> [ 3145.481951][T17497]  sysfs_rtnl_lock (net/core/net-sysfs.c:121)
>> [ 3145.482351][T17497]  phys_port_name_show (net/core/net-sysfs.c:665)
>> [ 3145.482782][T17497]  dev_attr_show (drivers/base/core.c:2421)
>> [ 3145.483154][T17497]  sysfs_kf_seq_show (fs/sysfs/file.c:65)
>> [ 3145.483586][T17497]  seq_read_iter (fs/seq_file.c:231)
>> [ 3145.483975][T17497]  vfs_read (fs/read_write.c:493 fs/read_write.c:574)
>> [ 3145.484334][T17497]  ksys_read (fs/read_write.c:717)
>> [ 3145.484701][T17497]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
>> [ 3145.485092][T17497]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
>> [ 3145.485592][T17497]
>> [ 3145.485794][T17497] The buggy address belongs to the object at ffff888107678000
>> [ 3145.485794][T17497]  which belongs to the cache kmalloc-cg-8k of size 8192
>> [ 3145.486991][T17497] The buggy address is located 1432 bytes inside of
>> [ 3145.486991][T17497]  freed 8192-byte region [ffff888107678000, ffff88810767a000)
>> [ 3145.488159][T17497]
>> [ 3145.488367][T17497] The buggy address belongs to the physical page:
>> 
>> 
>> Best,
>> Shuangpeng
>> 


^ permalink raw reply

* Re: [PATCH net v2 1/1] net: sched: ets: avoid deficit wrap and bound empty dequeue rounds
From: Yuan Tan @ 2026-06-26 17:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Ren Wei, netdev, jiri, davem, petrm, yifanwucs, tomapufckgml,
	zcliangcn, bird, bronzed_45_vested
In-Reply-To: <CAM0EoM=FvNKemH4Z5dL_pFQZcGWxLb4c_AMahOePLY7s6Pkhcg@mail.gmail.com>

On Fri, Jun 26, 2026 at 2:54 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Jun 26, 2026 at 4:32 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
> >
> > From: Wyatt Feng <bronzed_45_vested@icloud.com>
> >
> > ETS keeps each DRR-style deficit in a u32 and replenishes it with
> > the configured quantum whenever the head packet is too large. Both
> > the quantum and qdisc_pkt_len() are user-controlled inputs: a large
> > quantum can wrap the deficit counter, while a tiny quantum combined
> > with an inflated qdisc_pkt_len() can force billions of iterations in
> > softirq context before any packet becomes eligible.
> >
> > Store the deficit in u64 so replenishment cannot wrap the counter.
> > This keeps the existing dequeue logic unchanged while fixing the
> > overflow condition.
> >
> > Bound one dequeue attempt to at most nbands * 2 ETS rotations, as
> > suggested in review. This avoids the livelock without adding heavier
> > logic to the fast path.
> >
> > Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
> > Cc: stable@vger.kernel.org
> > Reported-by: Yuan Tan <yuantan098@gmail.com>
> > Reported-by: Yifan Wu <yifanwucs@gmail.com>
> > Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> > Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> > Reported-by: Xin Liu <bird@lzu.edu.cn>
> > Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > Assisted-by: Codex:GPT-5.4
> > Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
> > Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
>
> Note, you did not Cc many maintainers. Next time, or if you have to
> resend this patch make sure you Cc the stakeholders (see
> scripts/get_maintainers.pl)

Thank you very much for your advice. We were previously concerned
about bothering too many maintainers, so we used the pattern-depth=1
option when running scripts/get_maintainers.pl. We’ll relax this
parameter a bit in future submissions.

>
> cheers,
> jamal
>
> > ---
> > changes in v2:
> >   - Instead of doing a div() in the fast path, simply bound the loop per
> >     dequeue
> >   - v1 Link: https://lore.kernel.org/all/20260615103759.2404228-2-n05ec@lzu.edu.cn/
> >
> >
> >  net/sched/sch_ets.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
> > index cb8cf437ce87..12a156ccb0a6 100644
> > --- a/net/sched/sch_ets.c
> > +++ b/net/sched/sch_ets.c
> > @@ -40,7 +40,7 @@ struct ets_class {
> >         struct list_head alist; /* In struct ets_sched.active. */
> >         struct Qdisc *qdisc;
> >         u32 quantum;
> > -       u32 deficit;
> > +       u64 deficit;
> >         struct gnet_stats_basic_sync bstats;
> >         struct gnet_stats_queue qstats;
> >  };
> > @@ -463,6 +463,8 @@ ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
> >  static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
> >  {
> >         struct ets_sched *q = qdisc_priv(sch);
> > +       unsigned int max_loops = READ_ONCE(q->nbands) * 2;
> > +       unsigned int loops = 0;
> >         struct ets_class *cl;
> >         struct sk_buff *skb;
> >         unsigned int band;
> > @@ -499,6 +501,8 @@ static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
> >
> >                 cl->deficit += READ_ONCE(cl->quantum);
> >                 list_move_tail(&cl->alist, &q->active);
> > +               if (++loops > max_loops)
> > +                       goto out;
> >         }
> >  out:
> >         return NULL;
> > --
> > 2.47.3
> >

^ permalink raw reply

* Re: [PATCH 2/2] net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Victor Nogueira @ 2026-06-26 17:18 UTC (permalink / raw)
  To: hexlabsecurity, Vinicius Costa Gomes, Paolo Abeni,
	Jamal Hadi Salim, Jiri Pirko, Jakub Kicinski, David S. Miller,
	Eric Dumazet
  Cc: Simon Horman, netdev, Jarek Poplawski, Vladimir Oltean,
	linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-2-85c40b83c529@proton.me>

On 25/06/2026 06:51, Bryam Vargas via B4 Relay wrote:
> From: Bryam Vargas <hexlabsecurity@proton.me>
> 
> multiq_dequeue() takes a packet from a band's child with a direct
> ->dequeue() call after multiq_peek() peeked it. When the child is
> non-work-conserving the peek stashes the skb in the child's gso_skb, so
> the direct dequeue returns a different skb and orphans the stash,
> desyncing the child's qlen/backlog. With a qfq child reached through a
> peeking parent (e.g. tbf) this re-enters the child on an emptied list and
> dereferences NULL, panicking the kernel from softirq on ordinary egress.
> 
> Take the packet through qdisc_dequeue_peeked(), as sch_prio already does
> and as sch_red and sch_sfb were just fixed to do. The helper is a no-op
> when the child has no stash, so a work-conserving child is unaffected.
> 
> Fixes: 77be155cba4e ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>

^ permalink raw reply

* Re: [PATCH 1/2] net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Victor Nogueira @ 2026-06-26 17:16 UTC (permalink / raw)
  To: hexlabsecurity, Vinicius Costa Gomes, Paolo Abeni,
	Jamal Hadi Salim, Jiri Pirko, Jakub Kicinski, David S. Miller,
	Eric Dumazet
  Cc: Simon Horman, netdev, Jarek Poplawski, Vladimir Oltean,
	linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-1-85c40b83c529@proton.me>

On 25/06/2026 06:51, Bryam Vargas via B4 Relay wrote:
> From: Bryam Vargas <hexlabsecurity@proton.me>
> 
> When taprio's software path peeks a non-work-conserving child qdisc, the
> child stashes the peeked skb in its gso_skb; taprio_dequeue_from_txq()
> then takes the packet with a direct child ->dequeue() call, which ignores
> that stash, orphans the peeked skb and desyncs the child's qlen/backlog.
> With a qfq child this re-enters the child on an emptied list and
> dereferences NULL, panicking the kernel from softirq on ordinary egress.
> 
> Take the packet through qdisc_dequeue_peeked(), as sch_red and sch_sfb
> now do. The helper returns the child's stashed skb first and is a no-op
> when there is none, so a work-conserving child is unaffected and the
> gated path now consumes the skb whose length was charged to the budget.
> 
> Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
> Cc: stable@vger.kernel.org
> Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>

^ permalink raw reply

* Re: [PATCH v8 02/14] firmware: qcom_scm: Migrate to generic PAS service
From: Julian Braha @ 2026-06-26 17:05 UTC (permalink / raw)
  To: Sumit Garg, andersson
  Cc: linux-arm-msm, dri-devel, freedreno, linux-media, netdev,
	linux-wireless, ath12k, linux-remoteproc, konradybcio, robh,
	krzk+dt, conor+dt, robin.clark, sean, akhilpo, lumag,
	abhinav.kumar, jesszhan0024, marijn.suijten, airlied, simona,
	vikash.garodia, bod, mchehab, elder, andrew+netdev, davem,
	edumazet, kuba, pabeni, jjohnson, mathieu.poirier,
	trilokkumar.soni, mukesh.ojha, pavan.kondeti, jorge.ramirez,
	tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
	jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg,
	Harshal Dev
In-Reply-To: <20260626133440.692849-3-sumit.garg@kernel.org>

Hi Sumit,

On 6/26/26 14:34, Sumit Garg wrote:

>  config QCOM_SCM
> +	tristate "Qualcomm PAS SCM interface driver"
> +	select QCOM_PAS
>  	select QCOM_TZMEM
> -	tristate
I think QCOM_SCM is missing a 'select IRQ_DOMAIN'. Right now I get a
build error without it:

drivers/firmware/qcom/qcom_scm.c: In function ‘qcom_scm_get_waitq_irq’:
  drivers/firmware/qcom/qcom_scm.c:2512:16: error: implicit declaration
of function ‘irq_create_fwspec_mapping’; did you mean
‘irq_create_of_mapping’? [-Wimplicit-function-declaration]
   2512 |         return irq_create_fwspec_mapping(&fwspec);
        |                ^~~~~~~~~~~~~~~~~~~~~~~~~
        |                irq_create_of_mapping

- Julian Braha

^ permalink raw reply

* [PATCH net-next v3 2/3] net: pse-pd: fire lifecycle events on controller register/unregister
From: Carlo Szelinsky @ 2026-06-26 16:59 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Hook the newly-introduced pse_controller_notifier chain so that
pse_controller_register() fires PSE_REGISTERED after the controller
has been added to pse_controller_list (i.e. is now resolvable by
of_pse_control_get()), and pse_controller_unregister() fires
PSE_UNREGISTERED before the controller is removed from the list
(while it is still valid to dereference from a subscriber's
pse_control pointer targeting it).

With no subscribers yet, this is observably a no-op. A later change
wires the phy subsystem in as the first subscriber.

Signed-off-by: Corey Leavitt <corey@leavitt.info>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/pse-pd/pse_core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 3c4d09f1d6e4..0190fac33c78 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1138,6 +1138,9 @@ int pse_controller_register(struct pse_controller_dev *pcdev)
 	list_add(&pcdev->list, &pse_controller_list);
 	mutex_unlock(&pse_list_mutex);
 
+	blocking_notifier_call_chain(&pse_controller_notifier,
+				     PSE_REGISTERED, pcdev);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(pse_controller_register);
@@ -1148,6 +1151,9 @@ EXPORT_SYMBOL_GPL(pse_controller_register);
  */
 void pse_controller_unregister(struct pse_controller_dev *pcdev)
 {
+	blocking_notifier_call_chain(&pse_controller_notifier,
+				     PSE_UNREGISTERED, pcdev);
+
 	pse_flush_pw_ds(pcdev);
 	pse_release_pis(pcdev);
 	if (pcdev->irq)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v3 1/3] net: pse-pd: add notifier chain for controller lifecycle events
From: Carlo Szelinsky @ 2026-06-26 16:59 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Introduce a blocking notifier chain that allows other subsystems to be
informed when a PSE controller is registered or unregistered, and
provide pse_register_notifier() / pse_unregister_notifier() as the
subscriber interface.

Subsequent patches will use this to let the phy subsystem own the
phydev->psec lifecycle directly, decoupling PSE lookup from
fwnode_mdiobus_register_phy() and removing the probe-time
-EPROBE_DEFER coupling that currently exists between mdio, phy and
pse-pd when the PSE controller driver is modular.

A blocking chain (rather than atomic) is used because callbacks will
take rtnl_lock and call back into pse_core via of_pse_control_get().

The enum pse_controller_event is placed outside the
IS_ENABLED(CONFIG_PSE_CONTROLLER) guard so that subscribers compiled
into a kernel without PSE support can still reference the event
values in dead-code paths without breaking the build.

This patch is pure infrastructure: nothing fires events yet, and
nothing subscribes. No observable behavior change.

Signed-off-by: Corey Leavitt <corey@leavitt.info>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/pse-pd/pse_core.c | 34 ++++++++++++++++++++++++++++++++++
 include/linux/pse-pd/pse.h    | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 69dbdbde9d71..3c4d09f1d6e4 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/ethtool.h>
 #include <linux/ethtool_netlink.h>
+#include <linux/notifier.h>
 #include <linux/of.h>
 #include <linux/phy.h>
 #include <linux/pse-pd/pse.h>
@@ -23,6 +24,39 @@ static LIST_HEAD(pse_controller_list);
 static DEFINE_XARRAY_ALLOC(pse_pw_d_map);
 static DEFINE_MUTEX(pse_pw_d_mutex);
 
+static BLOCKING_NOTIFIER_HEAD(pse_controller_notifier);
+
+/**
+ * pse_register_notifier - register a callback for PSE controller events
+ * @nb: notifier block to register
+ *
+ * See enum pse_controller_event for events fired and their subscriber
+ * contract. Callbacks run in process context; they may sleep, take
+ * rtnl, and call of_pse_control_get(). The chain fires synchronously,
+ * so a PSE controller driver's probe/unbind path must not hold any
+ * such lock when calling pse_controller_register() or
+ * pse_controller_unregister().
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int pse_register_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&pse_controller_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(pse_register_notifier);
+
+/**
+ * pse_unregister_notifier - unregister a previously registered callback
+ * @nb: notifier block previously passed to pse_register_notifier()
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int pse_unregister_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&pse_controller_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(pse_unregister_notifier);
+
 /**
  * struct pse_control - a PSE control
  * @pcdev: a pointer to the PSE controller device
diff --git a/include/linux/pse-pd/pse.h b/include/linux/pse-pd/pse.h
index 4e5696cfade7..78fe3a2b1ea8 100644
--- a/include/linux/pse-pd/pse.h
+++ b/include/linux/pse-pd/pse.h
@@ -21,6 +21,7 @@ struct net_device;
 struct phy_device;
 struct pse_controller_dev;
 struct netlink_ext_ack;
+struct notifier_block;
 
 /* C33 PSE extended state and substate. */
 struct ethtool_c33_pse_ext_state_info {
@@ -337,6 +338,24 @@ enum pse_budget_eval_strategies {
 	PSE_BUDGET_EVAL_STRAT_DYNAMIC	= 1 << 2,
 };
 
+/**
+ * enum pse_controller_event - PSE controller lifecycle events
+ *
+ * Event data in callbacks is always a pointer to the struct
+ * pse_controller_dev firing the event.
+ *
+ * @PSE_REGISTERED: controller added to pse_controller_list and
+ *	resolvable by of_pse_control_get().
+ * @PSE_UNREGISTERED: controller about to be removed from
+ *	pse_controller_list. Subscribers holding pse_control references
+ *	targeting it must drop them before returning and must not
+ *	acquire new references for it.
+ */
+enum pse_controller_event {
+	PSE_REGISTERED,
+	PSE_UNREGISTERED,
+};
+
 #if IS_ENABLED(CONFIG_PSE_CONTROLLER)
 int pse_controller_register(struct pse_controller_dev *pcdev);
 void pse_controller_unregister(struct pse_controller_dev *pcdev);
@@ -366,6 +385,9 @@ int pse_ethtool_set_prio(struct pse_control *psec,
 bool pse_has_podl(struct pse_control *psec);
 bool pse_has_c33(struct pse_control *psec);
 
+int pse_register_notifier(struct notifier_block *nb);
+int pse_unregister_notifier(struct notifier_block *nb);
+
 #else
 
 static inline struct pse_control *of_pse_control_get(struct device_node *node,
@@ -416,6 +438,16 @@ static inline bool pse_has_c33(struct pse_control *psec)
 	return false;
 }
 
+static inline int pse_register_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int pse_unregister_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
 #endif
 
 #endif
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v3 3/3] net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook
From: Carlo Szelinsky @ 2026-06-26 16:59 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Transfer ownership of phydev->psec from fwnode_mdio to the phy
subsystem itself. The phy subsystem now subscribes to the pse-pd
notifier chain and manages psec attach/detach in response to PSE
controller lifecycle events, while fwnode_mdio loses its PSE awareness
entirely.

phydev->psec is attached after device_add() has made the phy visible
on mdio_bus_type, under a narrow rtnl_lock() that covers only
phy_try_attach_pse(). Ordering the attach after registration closes
the race that would otherwise leave a phy unattached: a PSE_REGISTERED
event firing during registration walks mdio_bus_type and either finds
the phy already added (and attaches it) or runs before device_add(),
in which case the post-add attach resolves it. The phydev->psec check
in phy_try_attach_pse() makes the two paths idempotent. Holding rtnl
across of_pse_control_get() is safe because pse_list_mutex is never
taken in the opposite order.

device_add() is deliberately left outside rtnl. Binding a phy that
itself provides an SFP cage reaches sfp_bus_add_upstream() through
phy_probe() -> phy_setup_ports() -> phy_sfp_probe(), and
sfp_bus_add_upstream() takes rtnl_lock(); holding rtnl across
device_add() would deadlock such phys (reported on RTL8214FC).

phy_device_register() is split into the public form, which takes the
narrow rtnl_lock() around the attach, and a phy_device_register_locked()
form for callers that already hold rtnl (the SFP module state machine
via __sfp_sm_event). This pair mirrors the register_netdevice() /
register_netdev() split convention already established in the core
networking stack. The _locked form runs device_add() under the
caller's rtnl, which is safe because a phy resident on an SFP module
does not itself provide a downstream cage, so phy_sfp_probe() is a
no-op there.

  - On PSE_REGISTERED: an rtnl-guarded bus walk retries the attach for
    every registered phy whose psec is still NULL. This is the "phy
    was enumerated before the PSE controller loaded" case, the root
    cause of the boot-time probe-retry storm on systems with a modular
    PSE controller driver.

  - On PSE_UNREGISTERED: an rtnl-guarded bus walk releases every
    phydev->psec that targets the departing controller before
    pse_release_pis() frees pcdev->pi. Without this, a phy still
    holding a pse_control reference would cause a use-after-free in
    __pse_control_release()'s pcdev->pi[psec->id] access, and the PSE
    driver module could not finish unloading while any phy still held a
    reference.

A bad `pses` binding -- an error from of_pse_control_get() other than
-ENOENT (no phandle) or -EPROBE_DEFER (controller not yet registered)
-- is reported with phydev_warn() rather than silently dropped,
preserving the diagnostic that the removed fwnode_mdio lookup used to
provide.

The final pse_control_put() of phydev->psec moves from
phy_device_remove() to phy_device_release(), so it runs only after
every reference on the device -- including the bus-iterator references
taken by bus_for_each_dev() in the notifier walk -- has been dropped.

Finally, delete fwnode_find_pse_control() and its call site in
fwnode_mdiobus_register_phy(), and drop the PSE header from
fwnode_mdio.c. The MDIO/DSA probe no longer sees any PSE-originated
-EPROBE_DEFER, so the probe-retry storm is gone and fwnode_mdio is
now PSE-agnostic.

Reported-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Closes: https://lore.kernel.org/netdev/e00048dd-1ed3-40c3-9912-59bccf015ad5@gmail.com/
Signed-off-by: Corey Leavitt <corey@leavitt.info>
Co-developed-by: Carlo Szelinsky <github@szelinsky.de>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/mdio/fwnode_mdio.c |  34 -------
 drivers/net/phy/phy_device.c   | 168 +++++++++++++++++++++++++++++++--
 drivers/net/phy/sfp.c          |   2 +-
 drivers/net/pse-pd/pse_core.c  |  14 +++
 include/linux/phy.h            |   2 +
 include/linux/pse-pd/pse.h     |   9 ++
 6 files changed, 186 insertions(+), 43 deletions(-)

diff --git a/drivers/net/mdio/fwnode_mdio.c b/drivers/net/mdio/fwnode_mdio.c
index ba7091518265..7bd979b59f49 100644
--- a/drivers/net/mdio/fwnode_mdio.c
+++ b/drivers/net/mdio/fwnode_mdio.c
@@ -11,33 +11,11 @@
 #include <linux/fwnode_mdio.h>
 #include <linux/of.h>
 #include <linux/phy.h>
-#include <linux/pse-pd/pse.h>
 
 MODULE_AUTHOR("Calvin Johnson <calvin.johnson@oss.nxp.com>");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("FWNODE MDIO bus (Ethernet PHY) accessors");
 
-static struct pse_control *
-fwnode_find_pse_control(struct fwnode_handle *fwnode,
-			struct phy_device *phydev)
-{
-	struct pse_control *psec;
-	struct device_node *np;
-
-	if (!IS_ENABLED(CONFIG_PSE_CONTROLLER))
-		return NULL;
-
-	np = to_of_node(fwnode);
-	if (!np)
-		return NULL;
-
-	psec = of_pse_control_get(np, phydev);
-	if (PTR_ERR(psec) == -ENOENT)
-		return NULL;
-
-	return psec;
-}
-
 static struct mii_timestamper *
 fwnode_find_mii_timestamper(struct fwnode_handle *fwnode)
 {
@@ -118,7 +96,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 				struct fwnode_handle *child, u32 addr)
 {
 	struct mii_timestamper *mii_ts = NULL;
-	struct pse_control *psec = NULL;
 	struct phy_device *phy;
 	bool is_c45;
 	u32 phy_id;
@@ -159,14 +136,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 			goto clean_phy;
 	}
 
-	psec = fwnode_find_pse_control(child, phy);
-	if (IS_ERR(psec)) {
-		rc = PTR_ERR(psec);
-		goto unregister_phy;
-	}
-
-	phy->psec = psec;
-
 	/* phy->mii_ts may already be defined by the PHY driver. A
 	 * mii_timestamper probed via the device tree will still have
 	 * precedence.
@@ -176,9 +145,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 
 	return 0;
 
-unregister_phy:
-	if (is_acpi_node(child) || is_of_node(child))
-		phy_device_remove(phy);
 clean_phy:
 	phy_device_free(phy);
 clean_mii_ts:
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 0615228459ef..f5febff4b00b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -223,8 +223,19 @@ static void phy_mdio_device_free(struct mdio_device *mdiodev)
 
 static void phy_device_release(struct device *dev)
 {
+	struct phy_device *phydev = to_phy_device(dev);
+
+	/* bus_for_each_dev() holds get_device() across each iteration
+	 * step, deferring this release callback until any in-flight PSE
+	 * notifier walk has advanced past this phy. pse_control_put()
+	 * takes pse_list_mutex, so this path must run in sleepable
+	 * context.
+	 */
+	might_sleep();
+	pse_control_put(phydev->psec);
+
 	fwnode_handle_put(dev->fwnode);
-	kfree(to_phy_device(dev));
+	kfree(phydev);
 }
 
 static void phy_mdio_device_remove(struct mdio_device *mdiodev)
@@ -1102,11 +1113,103 @@ struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45)
 }
 EXPORT_SYMBOL(get_phy_device);
 
-/**
- * phy_device_register - Register the phy device on the MDIO bus
- * @phydev: phy_device structure to be added to the MDIO bus
+/* Best-effort attach of phydev->psec from a DT `pses = <&...>` phandle.
+ * Caller must hold rtnl. A missing phandle (-ENOENT) or a not-yet-registered
+ * controller (-EPROBE_DEFER) is silent; the notifier retries the latter at
+ * PSE_REGISTERED time. Any other error means a broken binding and is warned
+ * about, but left non-fatal so the phy still registers.
  */
-int phy_device_register(struct phy_device *phydev)
+static void phy_try_attach_pse(struct phy_device *phydev)
+{
+	struct pse_control *psec;
+	struct device_node *np;
+
+	ASSERT_RTNL();
+
+	np = phydev->mdio.dev.of_node;
+	if (!np)
+		return;
+
+	if (phydev->psec)
+		return;
+
+	psec = of_pse_control_get(np, phydev);
+	if (IS_ERR(psec)) {
+		if (PTR_ERR(psec) != -EPROBE_DEFER && PTR_ERR(psec) != -ENOENT)
+			phydev_warn(phydev, "failed to get PSE control: %pe\n",
+				    psec);
+		return;
+	}
+
+	phydev->psec = psec;
+}
+
+static int phy_pse_attach_one(struct device *dev, void *data __maybe_unused)
+{
+	ASSERT_RTNL();
+
+	if (dev->type != &mdio_bus_phy_type)
+		return 0;
+
+	phy_try_attach_pse(to_phy_device(dev));
+	return 0;
+}
+
+static int phy_pse_detach_one(struct device *dev, void *data)
+{
+	struct pse_controller_dev *pcdev = data;
+	struct phy_device *phydev;
+	struct pse_control *psec;
+
+	ASSERT_RTNL();
+
+	if (dev->type != &mdio_bus_phy_type)
+		return 0;
+
+	phydev = to_phy_device(dev);
+	psec = phydev->psec;
+	if (!psec || !pse_control_matches_pcdev(psec, pcdev))
+		return 0;
+
+	phydev->psec = NULL;
+	pse_control_put(psec);
+	return 0;
+}
+
+static int phy_pse_notifier_event(struct notifier_block *nb,
+				  unsigned long event, void *data)
+{
+	switch (event) {
+	case PSE_REGISTERED:
+		rtnl_lock();
+		bus_for_each_dev(&mdio_bus_type, NULL, NULL,
+				 phy_pse_attach_one);
+		rtnl_unlock();
+		return NOTIFY_OK;
+	case PSE_UNREGISTERED:
+		rtnl_lock();
+		bus_for_each_dev(&mdio_bus_type, NULL, data,
+				 phy_pse_detach_one);
+		rtnl_unlock();
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block phy_pse_notifier __read_mostly = {
+	.notifier_call = phy_pse_notifier_event,
+};
+
+/* Core registration: add the phy to the MDIO bus. Does not touch rtnl or
+ * PSE. phydev->psec is attached by the callers below, after device_add()
+ * has made the phy visible on mdio_bus_type, so that a concurrent PSE
+ * notifier walk and the attach can never leave the phy unattached. Keeping
+ * device_add() out of rtnl also avoids deadlocking when binding a phy that
+ * itself provides an SFP cage (phy_probe() -> phy_sfp_probe() ->
+ * sfp_bus_add_upstream() takes rtnl).
+ */
+static int __phy_device_register(struct phy_device *phydev)
 {
 	int err;
 
@@ -1135,10 +1238,54 @@ int phy_device_register(struct phy_device *phydev)
  out:
 	/* Assert the reset signal */
 	phy_device_reset(phydev, 1);
-
 	mdiobus_unregister_device(&phydev->mdio);
 	return err;
 }
+
+/**
+ * phy_device_register_locked - Register the phy device on the MDIO bus
+ * @phydev: phy_device structure to be added to the MDIO bus
+ *
+ * Same as phy_device_register() but caller must already hold rtnl_lock().
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int phy_device_register_locked(struct phy_device *phydev)
+{
+	int err;
+
+	ASSERT_RTNL();
+
+	err = __phy_device_register(phydev);
+	if (err)
+		return err;
+
+	phy_try_attach_pse(phydev);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_device_register_locked);
+
+/**
+ * phy_device_register - Register the phy device on the MDIO bus
+ * @phydev: phy_device structure to be added to the MDIO bus
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int phy_device_register(struct phy_device *phydev)
+{
+	int err;
+
+	err = __phy_device_register(phydev);
+	if (err)
+		return err;
+
+	rtnl_lock();
+	phy_try_attach_pse(phydev);
+	rtnl_unlock();
+
+	return 0;
+}
 EXPORT_SYMBOL(phy_device_register);
 
 /**
@@ -1152,8 +1299,6 @@ EXPORT_SYMBOL(phy_device_register);
 void phy_device_remove(struct phy_device *phydev)
 {
 	unregister_mii_timestamper(phydev->mii_ts);
-	pse_control_put(phydev->psec);
-
 	device_del(&phydev->mdio.dev);
 
 	/* Assert the reset signal */
@@ -3981,8 +4126,14 @@ static int __init phy_init(void)
 	if (rc)
 		goto err_c45;
 
+	rc = pse_register_notifier(&phy_pse_notifier);
+	if (rc)
+		goto err_genphy;
+
 	return 0;
 
+err_genphy:
+	phy_driver_unregister(&genphy_driver);
 err_c45:
 	phy_driver_unregister(&genphy_c45_driver);
 err_ethtool_phy_ops:
@@ -3999,6 +4150,7 @@ static int __init phy_init(void)
 
 static void __exit phy_exit(void)
 {
+	pse_unregister_notifier(&phy_pse_notifier);
 	phy_driver_unregister(&genphy_c45_driver);
 	phy_driver_unregister(&genphy_driver);
 	rtnl_lock();
diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
index 03bfd8640db9..18868bdd6485 100644
--- a/drivers/net/phy/sfp.c
+++ b/drivers/net/phy/sfp.c
@@ -2083,7 +2083,7 @@ static int sfp_sm_probe_phy(struct sfp *sfp, int addr, bool is_c45)
 	/* Mark this PHY as being on a SFP module */
 	phy->is_on_sfp_module = true;
 
-	err = phy_device_register(phy);
+	err = phy_device_register_locked(phy);
 	if (err) {
 		phy_device_free(phy);
 		dev_err(sfp->dev, "phy_device_register failed: %pe\n",
diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 0190fac33c78..611540df43bb 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -2021,3 +2021,17 @@ bool pse_has_c33(struct pse_control *psec)
 	return psec->pcdev->types & ETHTOOL_PSE_C33;
 }
 EXPORT_SYMBOL_GPL(pse_has_c33);
+
+/**
+ * pse_control_matches_pcdev - Test whether a pse_control targets a controller
+ * @psec: pse_control obtained from of_pse_control_get()
+ * @pcdev: PSE controller to compare against
+ *
+ * Return: %true if @psec was obtained from @pcdev, %false otherwise.
+ */
+bool pse_control_matches_pcdev(struct pse_control *psec,
+			       struct pse_controller_dev *pcdev)
+{
+	return psec->pcdev == pcdev;
+}
+EXPORT_SYMBOL_GPL(pse_control_matches_pcdev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 199a7aaa341b..865b9baddb85 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -2158,6 +2158,8 @@ struct phy_device *fwnode_phy_find_device(struct fwnode_handle *phy_fwnode);
 struct fwnode_handle *fwnode_get_phy_node(const struct fwnode_handle *fwnode);
 struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45);
 int phy_device_register(struct phy_device *phy);
+/* Caller must hold rtnl_lock(); see phy_device_register() for the public form. */
+int phy_device_register_locked(struct phy_device *phy);
 void phy_device_free(struct phy_device *phydev);
 void phy_device_remove(struct phy_device *phydev);
 int phy_get_c45_ids(struct phy_device *phydev);
diff --git a/include/linux/pse-pd/pse.h b/include/linux/pse-pd/pse.h
index 78fe3a2b1ea8..d4310ca71a3e 100644
--- a/include/linux/pse-pd/pse.h
+++ b/include/linux/pse-pd/pse.h
@@ -385,6 +385,9 @@ int pse_ethtool_set_prio(struct pse_control *psec,
 bool pse_has_podl(struct pse_control *psec);
 bool pse_has_c33(struct pse_control *psec);
 
+bool pse_control_matches_pcdev(struct pse_control *psec,
+			       struct pse_controller_dev *pcdev);
+
 int pse_register_notifier(struct notifier_block *nb);
 int pse_unregister_notifier(struct notifier_block *nb);
 
@@ -438,6 +441,12 @@ static inline bool pse_has_c33(struct pse_control *psec)
 	return false;
 }
 
+static inline bool pse_control_matches_pcdev(struct pse_control *psec,
+					     struct pse_controller_dev *pcdev)
+{
+	return false;
+}
+
 static inline int pse_register_notifier(struct notifier_block *nb)
 {
 	return 0;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
From: Carlo Szelinsky @ 2026-06-26 16:59 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel,
	Carlo Szelinsky

This is v3 of Corey's series [1]. It takes the PSE controller lookup out
of the MDIO probe path, so a modular PSE driver no longer makes the
PHY/DSA probe spin on -EPROBE_DEFER until the PSE module loads.

v2 was four patches. The first one (a regulator handle fix) is a
self-contained bug fix, so on Jakub's suggestion it is going to net on
its own [2] and is not part of this series. The three patches here are
the notifier rework and target net-next. net-next was closed for the
merge window when v2 was posted; it is open again now, so here they are.

How it works: pse_core gets a notifier chain (REGISTERED / UNREGISTERED).
The phy layer subscribes, owns phydev->psec, and attaches the PSE handle
when the controller shows up instead of during probe. fwnode_mdio loses
its PSE awareness, so no -EPROBE_DEFER leaves it and the probe-retry loop
is gone.

Tested on a Realtek rtl93xx PoE switch with two HS104 PSE controllers on
i2c:

 - clean boot, no probe-retry loop, no watchdog reset
 - 10G SFP+ port: module hotplug works, no deadlock
 - ethtool --set-pse enable/disable cuts and restores power to a PD
 - i2c unbind -> rmmod -> modprobe: PSE detaches on unbind and re-attaches
   on reload with power restored, no reboot. No lockdep splats.

Tested-by: Carlo Szelinsky <github@szelinsky.de>

Changes in v3:
 - Drop patch 1 (regulator handle fix); it goes to net separately [2].
 - Rebase on current net-next. No code changes to the three patches.

v1 was an RFC by Corey [3].

[1] https://lore.kernel.org/netdev/20260620112440.1734404-1-github@szelinsky.de/
[2] https://lore.kernel.org/netdev/20260624204017.2752934-1-github@szelinsky.de/
[3] https://lore.kernel.org/netdev/20260423-pse-notifier-decouple-v1-0-86ed750a9d62@leavitt.info/

Corey Leavitt (3):
  net: pse-pd: add notifier chain for controller lifecycle events
  net: pse-pd: fire lifecycle events on controller register/unregister
  net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio
    hook

 drivers/net/mdio/fwnode_mdio.c |  34 -------
 drivers/net/phy/phy_device.c   | 168 +++++++++++++++++++++++++++++++--
 drivers/net/phy/sfp.c          |   2 +-
 drivers/net/pse-pd/pse_core.c  |  54 +++++++++++
 include/linux/phy.h            |   2 +
 include/linux/pse-pd/pse.h     |  41 ++++++++
 6 files changed, 258 insertions(+), 43 deletions(-)

base-commit: 805185b7c7a1069e407b6f7b3bc98e44d415f484
-- 
2.43.0

^ permalink raw reply

* [PATCH net 3/3] selftests/tc-testing: Verify bpf redirect on RED block with preceding clsact (egress) classifier
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
	clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
	krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
	Jamal Hadi Salim
In-Reply-To: <20260626165156.169012-1-jhs@mojatatu.com>

From: Victor Nogueira <victor@mojatatu.com>

The bpf_net_context used by sch_handle_egress() is stack-allocated and torn
down in that function. By the time tcf_qevent_handle() runs
current->bpf_net_context is NULL.

When a filter attached to a qevent block (e.g. RED's early_drop or mark
qevents, which always uses shared blocks) returns TC_ACT_REDIRECT,
tcf_qevent_handle() calls skb_do_redirect(), which in turn calls bpf helper
bpf_net_ctx_get_ri(). That helper unconditionally dereferences
current->bpf_net_context resulting in a NULL pointer dereference.

Add a test case that reproduces this scenario by attaching a filter to
clsact (egress) and a bpf filter to a block attached to RED. Use TBF as
red's parent, so that  a traffic burst builds backlog and RED early-drops
triggers the block filter.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
---
 .../testing/selftests/tc-testing/action-ebpf  | Bin 856 -> 9072 bytes
 tools/testing/selftests/tc-testing/action.c   |   5 +++
 .../tc-testing/tc-tests/infra/qdiscs.json     |  32 ++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/action-ebpf b/tools/testing/selftests/tc-testing/action-ebpf
index 4879479b2ee5c046279be0fe8f9ca313dfb7e618..52c47e42bf0af024a073cacc823c8270f906a8df 100644
GIT binary patch
literal 9072
zcmb<-^>JfjWMqH=MuzVU2p&w7fnkFTg6#liIxt8xFfwchvl$qsLh0>H5Js}1514@=
z&%nUI&VW$w9^k{kD9EU)D$L5PS|lzYF0Cra7^++>ULwxGz+}R}tm-LjFKNYX&CMji
zz`)GN=qb#=z@o_DDQwQoz`&})z^rP=&CSigzy@M+bK7w<FtF<}3Q7yHIY?AVGOL2L
zs!M_lVPN23Wq=5P4B=#DV3I&^x%e4CqTIra%&OenR@~OC3=BNHVEaKF3vLDmUS0-I
zVGyT-ksrk86K8~|>|o?)VBi;H@Dzra$G{)}QwmZi2vf(vAS4Xc!oa}rf|-GVm4T51
z6i$o`vQQQS0|PV&85kHqay%drW;2i~8K#8{%uWmp3@mOSf`OHVjggI&gPomGfPsO5
zF^Y|WX9`Fc7X!}>ka~6|1+X{=gCIzplQEEsK@cLt4AH^KAP$n@;9?L5i?gz`vT)61
zU|`@5Il#13f`?m*iGhJ>nFIq5ADFdVf`x}4%vvGA!6N`>t(4&55d^bVNeJ)=fmy31
zM0kY3tThr6JR)G$S_v5*Q7~(rgaVHkn6+L)g-0CB+9099BLQY@l+fXkR0G+&Ny30f
z3M{r+!i7f~%-SO1!6O4^ZI$rhkp;81Nd)l7fmz!nLU`oCtX&cjJPKgeZiyHkMKEiR
zL;{bJ5y<4d5-B{&VAei~5*`(>?0$(B9#t^wfJ6t68kluhqK9`QBLf4|5ebe7d>kN(
zN8Ju&!Vw7u1|EBRW(EePqY^WCoWRPDNi5)T2D6S!Ea80x(s)9GV+9`v(+LR<9v5$r
z>JuQ1fnY@^B{uK`DT4%0No?T>1{-!pVh01i5)%UhFQYUo4?7DpNF_MFSs4&)76vY7
zCI$v>I}4^~GCUgMATyrJFz{%DSubRmcyz$5moh9ox?me#$*}PlfLX6)*m(@WtT!?o
zJVs#FTNzFsV||b*?_{`mOu?-8GCVwHVAcm2K7nl@Pk)pV5L96LC?jwL#QP+}AjHA+
zNruPV9HjHJ3<HlPnDs@5g+bssNXa)D1|bEeZ!$bq;K2Sa!@y$=X8n*6U|`^}0eOz;
zw~PUgEm-3p850J6d1eL%Ek+4eO?D=JZDs}reMV7MJq|{GkcUi|6~M{Qf?0{*otc5b
zkx`!2ft`aZfSG}TJ0O6GSCYpSY$l&112iG<OS15|fyD$QIiLwuP?86ljD;ixpovmg
zQiR7HtWH!?g2w~wN-;?p9#62CxTFq`7dS8^Bn^1H!D3R9COkf1b<&a+JicHt8A%Tw
zzsI29kd^dd;0I+ce?}<=F$Pdx=KvS2EDWH6On{k5ftgu=A%YPk1In!o45ADS3~~$%
z44_=A$-uy%$H2e<%I=_|G=PDDA&P;4A&Y^5A%}s1p@4ybp_YMxp_ze!p@)HiVIl(q
z!+Zt?h7}A93|ko(81^wRFq~vyV7SD<!0?EHf#DSc1H)$q28M483=F>+7#P?X85p=3
z85l$u85m?385ooq85r~!85k@W85o=y85n#S85klN85mL+85r^y85k-U85rsq85nvQ
z85kxrGBC_!WMEjr$iT3Mk%3_sBLl;JMh1qnj0_Cd7#SGuGcqtdXJlY_$H>6&g^_{b
z7b61$GZO;?7ZU@6FcSlV3=;!`DiZ^PHWLGb850A83ljr_HxmOx91{aW3KIiEE)xSo
z8509TB@+Wf8xsRVHxmQHWF`iN*-Q)!OPClK)-o|LY-M6#*vrJgaF~gK;R+K2!!0HT
zhQ~|{3~!hi7=AD@FfcMRFeK+B=A|o?r4|)u=I1FG8R;47nKC3Mmt^MW=_NDhF~sL&
zCa2~Vr!pjGBo;Bm$2$fEIY!0@dq%m&heQUr#>Yby$LD7=WagE?c-i?dR#9q7W>IQ#
z2}3bMPHG-QX<l(=dR}UZ0!VRue5tV!LqT>)d`V?NDno8!Q8q(iX=-U|d~RYvL1tb$
zLqSn~Nq%yE4ntW^VqSbfQEG8&UI~O#lAH-)fYmS*6lLZYWtLPjWagz8r4|>*XQpN5
zrKDCc!03|Xc!)r95<^B}aRx(4a(r@5VsUY13PVa_Ng|ktPt8kV$V)89jL%GANK4Gk
z%&BB3O3lqLNsZ4eFk#5aPfpAMv*3bea6vPe%7Xl&5~wJc2{JuCH?<^@AuT7rJU%<M
zvX~(+BR?$-5gNrAAU*N%rG{n<C19z<l$4@)h}SZU<I{=~(-EqnaZzf)0FufqDlUO2
z$SjUe%}Y)!V8|?hY6XQ^en~z<e0)->p&3Il#64g#v!Ki*zPKnEEN5)Q0OqF@mw*^%
zV2R9vGP8J)NLo%}dNIWDIf+TBISfe!Y4HfZloXdF<`y8Fmy@5Dt^gt!;^RxrOc=^D
zi&Eo3k)K|iA77lBUd&LO&5)E|nwJuvl3Es@nZ^K){^Fu!aL__%GX@Y1c4<m+Nj#hZ
ziUyECW`P+)aY<rHDnn64JZhqek1sYh0=uy|KRKHLY-?s!Dg(rwkhGRj4&gDx#}{YE
zCzYn9F{nTbAV@)jo1FiekwF3~ox;q(0K&@=4H7006VyIbVqjo70BR?I3NxsBekdDM
ze1XhhW?*0dwH0Nd;t3244BAk30|Ntt36#Bnfq}sh$_AO~1!X^AU|<M^vOy(h9F(m9
zDw3dV2Sx^lGAKKNk%6He>`w-U21W*kCaAaq69Yp#l<mO8z%U8QPGDkSm=0w(FflNI
z+yQd)0wzc?3Su8%VqjPgRr7#}fngVv&A`mSaD)NUmQ`S8U^owDFJNY1xCvFGz{0@r
z6v}pBVPN<GWha1Y08l%Pfq|icg@NH00|NsW0|UbW76t}TQygS311kdq7pT|-RR^q)
z00y-L9atF{grVXMtdOFckAZ<<0V@N81|!rQRt5$`P$V%haDc2bhp-tzY*5@lOau9V
z0Zgz#!_0*Ubs#f9`a!i8sC5dezBoYw2+SnSz`&3MF^hwnfq@|d!Unf=LFoac6sEtL
zk%561B!Iw7psog}ssRNNC`~m(^@Avoogk$kZGxa`kC6eCG{997sI3cfp8^8|<8+8P
z52%@oP5nuzyf8=-q>+Jv7uh^(P!+<!zyqq1t3Zh!5;dR}Imj6xUEpXE2UVru#yA5*
zJwzQJDD8t3bwJDjd4WL^qywsM0z@6C8^kaViH($9b5iq0Qq=)1t}x3|Ld6@Xibbn+
zF)MLIVGb#>;Tk~2IHW8u&IT3d7KmaVTniN=*ZTR{&{|(NKbt{MAKr}MEJ`gYEy_~}
zagKL%4vF{owuY*Uhqn`Svq6<qVo6C+W>RTMYJ9wgMsX^*8KR*CF-JE$UrAG^v^X_I
zQxnvBP=E->XXk4amlTyImnguCas{noO$N?lT{}?Ct6-~OP+<VK5#AnwXxD}F6(9iB
zsX?wTo<Xk8A=c3L53Dr=qfskD5D#4CFo5a^SUCbJ!$BC-Bn6e9pkxlpqYEMV3&dp*
zVqjpnjKmjVfHe0&c?6^Y)Hnv^Q)E4>85tP1f%;Dt5WV2c#=yV;62A-5=mO~-gZQ9k
zB$p2Z1IP>zAFM9`)eew&G!p{@M+(SB1_n@B`v3p`|11m~3>-Dw3?Mf_WFRESED#G_
z9OiFyahQ5gbDNccgP{&ocrh~m=Hq2xWCRuJpi+vFkrC9nf%Nb}r6e<>JQF{w91EzS
z&)_R@sm`yAaZAD<73VVx=lwJX6-zq=J{1((2_#EytNSW+OeGK09bh`wvnwt9_@k2y
z0!I>LJ>^b$fQASTtbMp}G3T|;oM#+dfr~D(vM{hRaWQa0$`?@0!^_CT#J~uu1&{;<
z8Ckiw6j_-Rp>nJoD0(Cydh{6dON)#2GxL&jN>ftx6N__o(^K<Oi!zf@C2}(JN-Lnr
zUoRPydvtSh%uMt$Kn)hX3~*DZST6$<usARaBLf2q!^i*Fq?y6B6{ZX`1E|!;CJwDQ
z(WIEcBT#69pb!Gr!q~(?wHP*Ww3-X5gqZ<UBSSez6f;IDgGe%iyN3`qGRX{Yi6KOH
zFfcF(A%z1h-GMrqLP+5NOQ%X;^@wnQslUa*z#zx~s`(*i!$J_0cR?bc{07nmsuN*+
zP=y7m4`F;zs3?Ot;ILz00JQ_uk@z6fk<A0KLHQ1(1Y|yl4|5NU56f3DKBz85Ru5_$
zAoF4VgsBI$bCKmWKpd$1K=$dN@eR=Upt1}k1T)V9#6i+;gT{xoXJG1G(Bwf`6C?za
z2X*C<`Jiq+GCu^wL9#CbjUR)?2eqj|LNN28#Tz(<fcy)~YcP3GIgP9yG>m}EF9C6o
z?5{xM*P!u1?Rbz7%={J*2T6Yi8Xq)N0TP0#p8(<@sRyNFkPu8BJSvGKKL;d<#D~>6
zF!f8&<UzwBAR(Ci8W0Cb{{}Sv7BoI6zk!5c`uBi1Ncutj6_5~29@Hm5=AQs@kkp?+
z<6l7IUqR#FK;z#*<AeH0AR(CjPe2?b`(B{&-=Oh9eG-rmO#c@U2TA`AH2xnnKB!s)
z3BmM(%5P*o2dJ(_GLHw1FM!4e4M!pCmq3#T4OJn_gZhHVd=)hHpgs|@ybhYY0UF-~
zjSm_QL)LGDCJ*X=BFjVjQ1JZcfu`OEjURx<4?*LD`d7&2gYq<p531HdWhV&3_#iP@
zc?n{J;sjPcg7}~^J_aNR?T3K$!pcLC97qkUd;{@8N@3*{h!4WB@&?2PsfU#hAU;Sv
zEWg9}u>1?-!}1%555mahJ*W?i?0%5>F!zDX2Fb(9YmgjB47t1o$-~N9kUYqIQ2hrA
zACOv5-J=HLK+_|LuZ6}3&$A%Od!xz2{0}p)6ips9wg3_W*$=|YK^#!?3~J^vBtFP}
zynOru40<W4Nu}xWiAhOCsbvg$C8-r940=VWIeJbZZh9aNq&XiCZ_Y#bh~_=ifTFzg
zoXp~qVu)slp~WRd@%d?K#i<}+xDd?BoXot`_~McxWF4TvIcOUWwF?7w62yqiyfpYA
zC~C(jC#Nho9%MvuW;$Yk6-g_|N@VNOiV|~Eq4t4BWs6ISN)nS8^olEU!89}+py5U-
z#S0xfK{uxb)EsAEU{HhRKbSlkmjTq72Z@2&HZV0H8rBW~v5_$-j*<C?K#h9nm;tgl
zOg|`3B8$WFJ4|dI4*jrp1T4S9)T8S+!l5754j^RzZK!@w`iI#8qG9a-bpL|F3M3E1
zAU+7g_%IsQZukxpfYRt{LG?Gf7)U8-90^3fL30?oT2Olc)J}zkH%Jc%3qT7ZSU5_7
z`fpGSmIo2g=@@h~K~m|U#xer~14ti;55wqc!=T{@lZVlbQ2jAz8ql~5K@5;~HoE&0
zVD5*S19AtbeGZF%SiELJ?T3}4F#GR96EruD@PoMnRKCO5Fufr91t?G%7#LvfZIBoU
zqpJnQGe{}A`!_%b$YA0i_k(&-=<ZiU8b<(`55oeW!k>YGK^f`~m^gZRlx4&oejlLr
z!@>z<KZuQPKS&w6pFnn*fE0tqa6ud-jBY=u|Afu|u!IB4M<Dw_<0k0#gYpM9`+q>~
zhxHd=PJro$@j+97*z9irS;)Y^0IT0XTu_pN>4zH6@En?cVCKQ-0BA??8cYC6qr3Gn
zj`aHgWFcrA8>$b)1&s@#+YjoWfXqZsziObsVqjo^<zJ{Vu;wY82DQJLeg5ZzhTdUn
sKy(AN{D;LC$bOJG$Sx2K!=j)uDHsQdu7KJ<1F8W;fkp>l?uWH&07wxN(*OVf

delta 317
zcmez1c7tt#hG+yM0~|PjSq=;w6K#!|-2;3kf8>;vbYoy(U}5<9A1sGNNKf7<Ebhb3
zz`!8HzycRnfU;~E7#IW@SfM<S2@oa|GYf-WNoqw2Lt=7CW`16Lc0QD)n?2b|MpptN
zte4E7S6ot5l9<GxS6rD}l9)7kqm2C|E)G_I1_lP^$saj|Cx1|60h=E`IZ#f0vW2V#
zqw3^BS$jso$s1+u8SN$;%84@;O!kyhXVjnkQAu3%1H=Uk%upKSbcV@=a>A_P3=9lR
YATu>8pmH!86gW%_3=AAlaS1350I>8ljQ{`u

diff --git a/tools/testing/selftests/tc-testing/action.c b/tools/testing/selftests/tc-testing/action.c
index c32b99b80e19..350f2d36a773 100644
--- a/tools/testing/selftests/tc-testing/action.c
+++ b/tools/testing/selftests/tc-testing/action.c
@@ -20,4 +20,9 @@ __attribute__((section("action-ko"),used)) int action_ko(struct __sk_buff *s)
 	return TC_ACT_OK;
 }
 
+__attribute__((section("action-redirect"), used)) int action_redirect(struct __sk_buff *s)
+{
+	return TC_ACT_REDIRECT;
+}
+
 char _license[] __attribute__((section("license"),used)) = "GPL";
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index a1f97a4b606e..762f86ceab1c 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -1540,5 +1540,37 @@
             "$TC qdisc del dev $DUMMY root",
             "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
         ]
+    },
+    {
+        "id": "fb8d",
+        "name": "Verify bpf redirect on RED block with preceding clsact (egress) classifier",
+        "category": [
+            "qdisc",
+            "red",
+            "qevent",
+            "clsact"
+        ],
+        "plugins": {
+            "requires": "nsPlugin"
+        },
+        "setup": [
+            "$IP addr add 10.10.10.1/24 dev $DUMMY",
+            "$IP neigh add 10.10.10.2 lladdr 02:00:00:00:00:01 dev $DUMMY nud permanent",
+            "$TC qdisc add dev $DUMMY handle 1: root tbf rate 1Mbit burst 10K limit 1M",
+            "$TC qdisc add dev $DUMMY parent 1:1 handle 11: red limit 1M avpkt 1400 probability 1 burst 38 harddrop min 30000 max 30001 qevent early_drop block 10",
+            "$TC qdisc add dev $DUMMY clsact",
+            "$TC filter add dev $DUMMY egress protocol ip prio 1 matchall action gact pass",
+            "$TC filter add block 10 protocol ip prio 1 matchall action bpf obj action-ebpf sec action-redirect"
+        ],
+        "cmdUnderTest": "bash -c 'data=$(head -c 1400 /dev/zero | tr \"\\0\" \"x\"); exec 3>/dev/udp/10.10.10.2/12345; for i in $(seq 1 8000); do printf \"%s\" \"$data\" >&3; done; exit 0'",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s filter show block 10",
+        "matchPattern": "Sent [1-9][0-9]* bytes [1-9][0-9]* pkt",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY clsact",
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP addr del 10.10.10.1/24 dev $DUMMY"
+        ]
     }
 ]
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 2/3] net/sched: Handle TC_ACT_REDIRECT from qdisc filter chains
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
	clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
	krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
	Jamal Hadi Salim
In-Reply-To: <20260626165156.169012-1-jhs@mojatatu.com>

When a TC filter attached to a qdisc filter chain returns
TC_ACT_REDIRECT (ex: via an eBPF program calling bpf_redirect() or an
act_bpf action), the redirect was silently lost i.e no qdisc classify
function handled TC_ACT_REDIRECT, so the packet fell through the
switch and was enqueued normally instead of being redirected.

This has been broken since bpf_redirect() was introduced for TC in
commit 27b29f63058d ("bpf: add bpf_redirect() helper"). We got lucky
for a long time because bpf_net_context was a per-CPU variable that
was always available.
commit 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
turned bpf_net_context into a task_struct member that is only set up by
explicit callers. Without a caller setting it up, bpf_redirect() itself
crashes with a NULL pointer dereference in bpf_net_ctx_get_ri().

The NULL deref is fixed separately by extending the bpf_net_context
lifetime to cover qdisc enqueue. However, even with bpf_net_context
available, TC_ACT_REDIRECT from qdisc filter chains cannot be honored
without adding skb_do_redirect() calls to every qdisc classify
function, which would require changes across net/sched/.

Instead, add a tcf_classify_qdisc() inline helper in pkt_cls.h, as a
wrapper around tcf_classify() for use by qdisc classify functions and
tcf_qevent_handle(). When the classify verdict is TC_ACT_REDIRECT,
the wrapper converts it to TC_ACT_SHOT, dropping the packet rather
than letting it continue silently.  Dropping is preferred over
letting the packet through because the user immediately sees packet
loss and, with the help of the rate-limited warning in the log
("use eBPF with clsact or mirred redirect instead"), can quickly
identify and fix the misconfiguration. Silently passing the packet
through would hide the problem and leave the user wondering why their
redirect is not working.

The clsact fast path, tc_run() continues to call tcf_classify() directly
and is unaffected: TC_ACT_REDIRECT is returned as-is and handled by
sch_handle_egress/ingress() calling skb_do_redirect() as before.

Remove the TC_ACT_REDIRECT case from tcf_qevent_handle() since
tcf_classify_qdisc() now converts it to TC_ACT_SHOT, and the existing
SHOT case handles it.

Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
Fixes: 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/pkt_cls.h    | 13 +++++++++++++
 net/sched/cls_api.c      |  6 +-----
 net/sched/sch_cake.c     |  2 +-
 net/sched/sch_drr.c      |  2 +-
 net/sched/sch_dualpi2.c  |  2 +-
 net/sched/sch_ets.c      |  2 +-
 net/sched/sch_fq_codel.c |  2 +-
 net/sched/sch_fq_pie.c   |  2 +-
 net/sched/sch_hfsc.c     |  2 +-
 net/sched/sch_htb.c      |  2 +-
 net/sched/sch_multiq.c   |  2 +-
 net/sched/sch_prio.c     |  2 +-
 net/sched/sch_qfq.c      |  2 +-
 net/sched/sch_sfb.c      |  2 +-
 net/sched/sch_sfq.c      |  2 +-
 15 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 3bd08d7f39c1..3a542a72e9a5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -159,6 +159,19 @@ static inline int tcf_classify(struct sk_buff *skb,
 
 #endif
 
+static inline int tcf_classify_qdisc(struct sk_buff *skb,
+				     const struct tcf_proto *tp,
+				     struct tcf_result *res, bool compat_mode)
+{
+	int ret = tcf_classify(skb, NULL, tp, res, compat_mode);
+
+	if (unlikely(ret == TC_ACT_REDIRECT)) {
+		pr_warn_once("TC_ACT_REDIRECT from qdisc filter chains is not supported; use eBPF with clsact or mirred redirect instead\n");
+		ret = TC_ACT_SHOT;
+	}
+	return ret;
+}
+
 static inline unsigned long
 __cls_set_class(unsigned long *clp, unsigned long cl)
 {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3e67600a4a1a..3ca56d060e28 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -4033,7 +4033,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
 
 	fl = rcu_dereference_bh(qe->filter_chain);
 
-	switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
+	switch (tcf_classify_qdisc(skb, fl, &cl_res, false)) {
 	case TC_ACT_SHOT:
 		qdisc_qstats_drop(sch);
 		__qdisc_drop(skb, to_free);
@@ -4045,10 +4045,6 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
 		__qdisc_drop(skb, to_free);
 		*ret = __NET_XMIT_STOLEN;
 		return NULL;
-	case TC_ACT_REDIRECT:
-		skb_do_redirect(skb);
-		*ret = __NET_XMIT_STOLEN;
-		return NULL;
 	case TC_ACT_CONSUMED:
 		*ret = __NET_XMIT_STOLEN;
 		return NULL;
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index a3c185505afc..94eb47ac54ee 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1730,7 +1730,7 @@ static u32 cake_classify(struct Qdisc *sch, struct cake_tin_data **t,
 		goto hash;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	result = tcf_classify(skb, NULL, filter, &res, false);
+	result = tcf_classify_qdisc(skb, filter, &res, false);
 
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 020657f959b5..91b1ef824afa 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -312,7 +312,7 @@ static struct drr_class *drr_classify(struct sk_buff *skb, struct Qdisc *sch,
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	fl = rcu_dereference_bh(q->filter_list);
-	result = tcf_classify(skb, NULL, fl, &res, false);
+	result = tcf_classify_qdisc(skb, fl, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index 5434df6ca8ef..98364f74211e 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -364,7 +364,7 @@ static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
 		return NET_XMIT_SUCCESS;
 	}
 
-	result = tcf_classify(skb, NULL, fl, &res, false);
+	result = tcf_classify_qdisc(skb, fl, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index cb8cf437ce87..25fcf4079fec 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -391,7 +391,7 @@ static struct ets_class *ets_classify(struct sk_buff *skb, struct Qdisc *sch,
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	if (TC_H_MAJ(skb->priority) != sch->handle) {
 		fl = rcu_dereference_bh(q->filter_list);
-		err = tcf_classify(skb, NULL, fl, &res, false);
+		err = tcf_classify_qdisc(skb, fl, &res, false);
 #ifdef CONFIG_NET_CLS_ACT
 		switch (err) {
 		case TC_ACT_STOLEN:
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index cafd1f943d99..6cce86ba383c 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -91,7 +91,7 @@ static unsigned int fq_codel_classify(struct sk_buff *skb, struct Qdisc *sch,
 		return fq_codel_hash(q, skb) + 1;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	result = tcf_classify(skb, NULL, filter, &res, false);
+	result = tcf_classify_qdisc(skb, filter, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c
index 72f48fa4010b..069e1facd413 100644
--- a/net/sched/sch_fq_pie.c
+++ b/net/sched/sch_fq_pie.c
@@ -96,7 +96,7 @@ static unsigned int fq_pie_classify(struct sk_buff *skb, struct Qdisc *sch,
 		return fq_pie_hash(q, skb) + 1;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	result = tcf_classify(skb, NULL, filter, &res, false);
+	result = tcf_classify_qdisc(skb, filter, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 7e537295b8b6..e87f5021a199 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1143,7 +1143,7 @@ hfsc_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	head = &q->root;
 	tcf = rcu_dereference_bh(q->root.filter_list);
-	while (tcf && (result = tcf_classify(skb, NULL, tcf, &res, false)) >= 0) {
+	while (tcf && (result = tcf_classify_qdisc(skb, tcf, &res, false)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
 		case TC_ACT_QUEUED:
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 908b9ba9ba2e..fdac0dc8f35a 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -243,7 +243,7 @@ static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
 	}
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	while (tcf && (result = tcf_classify(skb, NULL, tcf, &res, false)) >= 0) {
+	while (tcf && (result = tcf_classify_qdisc(skb, tcf, &res, false)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
 		case TC_ACT_QUEUED:
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 4e465d11e3d7..004f0d275caf 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -36,7 +36,7 @@ multiq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 	int err;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	err = tcf_classify(skb, NULL, fl, &res, false);
+	err = tcf_classify_qdisc(skb, fl, &res, false);
 #ifdef CONFIG_NET_CLS_ACT
 	switch (err) {
 	case TC_ACT_STOLEN:
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index e4dd56a89072..79437c587e7e 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -39,7 +39,7 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	if (TC_H_MAJ(skb->priority) != sch->handle) {
 		fl = rcu_dereference_bh(q->filter_list);
-		err = tcf_classify(skb, NULL, fl, &res, false);
+		err = tcf_classify_qdisc(skb, fl, &res, false);
 #ifdef CONFIG_NET_CLS_ACT
 		switch (err) {
 		case TC_ACT_STOLEN:
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index cb56787e1d25..6f3b7273cb16 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -709,7 +709,7 @@ static struct qfq_class *qfq_classify(struct sk_buff *skb, struct Qdisc *sch,
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	fl = rcu_dereference_bh(q->filter_list);
-	result = tcf_classify(skb, NULL, fl, &res, false);
+	result = tcf_classify_qdisc(skb, fl, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index b1d465094276..ed39869199c0 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -260,7 +260,7 @@ static bool sfb_classify(struct sk_buff *skb, struct tcf_proto *fl,
 	struct tcf_result res;
 	int result;
 
-	result = tcf_classify(skb, NULL, fl, &res, false);
+	result = tcf_classify_qdisc(skb, fl, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 758b88f21865..77675f9a4c46 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -171,7 +171,7 @@ static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
 		return sfq_hash(q, skb) + 1;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	result = tcf_classify(skb, NULL, fl, &res, false);
+	result = tcf_classify_qdisc(skb, fl, &res, false);
 	if (result >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 1/3] net: Extend bpf_net_context lifetime to cover qdisc enqueue
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
	clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
	krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
	Jamal Hadi Salim, Sashiko
In-Reply-To: <20260626165156.169012-1-jhs@mojatatu.com>

The bpf_net_context used by sch_handle_egress() is stack-allocated and torn
down in that function returned. By the time tcf_qevent_handle() runs
current->bpf_net_context is NULL.

When a filter attached to a qevent block (e.g. RED's early_drop or mark
qevents, which always use shared blocks) returns TC_ACT_REDIRECT,
tcf_qevent_handle() calls skb_do_redirect(), which in turn calls bpf helper
bpf_net_ctx_get_ri(). That helper unconditionally dereferences
current->bpf_net_context resulting in a NULL pointer dereference.

Note: The same holds for actions that invoke BPF redirect helpers
(e.g. act_bpf running a program that calls bpf_redirect()) during qevent
classification itself. And as a matter of fact the same assumption is
made in the code outside of tc.

Fix:
Move the bpf_net_context lifecycle out of sch_handle_egress() into
__dev_queue_xmit(), so that it spans both the egress TC fast path and the
qdisc enqueue. The setup is placed outside the egress_needed_key static
branch because qevents are independent of clsact/NF egress hooks and
that key may stay disabled when only a qevent-bearing qdisc is
configured. Unfortunately this adds a small unconditional penalty to the
code path _per packet_ only guarded by CONFIG_NET_XGRESS (two writes and
one read for bpf_net_ctx_set, plus one write for bpf_net_ctx_clear).

This keeps all bpf_net_context management in net/core/dev.c i.e the
existing boundary between tc core and BPF without requiring any net/sched/
code to know about BPF plumbing.

Reproducer (see the accompanying tdc test):

  tc qdisc add dev eth0 root handle 1: red limit 1MB min 10KB max 20KB \
      avpkt 1000 burst 100 qevent early_drop block 10
  tc qdisc add dev eth0 clsact
  tc filter add block 10 pref 1 bpf obj redirect.o
  tc filter add dev eth0 egress protocol ip prio 1 matchall \
      action gact pass

  traffic through eth0 triggers red_enqueue() -> tcf_qevent_handle() and,
  on a redirect verdict, a NULL deref in skb_do_redirect().

Fixes: 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260620130749.226642-1-jhs%40mojatatu.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/core/dev.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4b3d5cfdf6e0..8c214bfff8aa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4527,14 +4527,11 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
-	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int sch_ret;

 	if (!entry)
 		return skb;

-	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
-
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -4550,12 +4547,10 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		/* No need to push/pop skb's mac_header here on egress! */
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
-		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	case TC_ACT_SHOT:
 		kfree_skb_reason(skb, drop_reason);
 		*ret = NET_XMIT_DROP;
-		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	/* used by tc_run */
 	case TC_ACT_STOLEN:
@@ -4565,10 +4560,8 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		fallthrough;
 	case TC_ACT_CONSUMED:
 		*ret = NET_XMIT_SUCCESS;
-		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	}
-	bpf_net_ctx_clear(bpf_net_ctx);

 	return skb;
 }
@@ -4767,6 +4760,9 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
  */
 int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
+#ifdef CONFIG_NET_XGRESS
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx = NULL;
+#endif
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq = NULL;
 	enum skb_drop_reason reason;
@@ -4795,6 +4791,9 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_update_prio(skb);

 	tcx_set_ingress(skb, false);
+#ifdef CONFIG_NET_XGRESS
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+#endif
 #ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		if (nf_hook_egress_active()) {
@@ -4898,12 +4897,18 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)

 	reason = SKB_DROP_REASON_RECURSION_LIMIT;
 drop:
+#ifdef CONFIG_NET_XGRESS
+	bpf_net_ctx_clear(bpf_net_ctx);
+#endif
 	rcu_read_unlock_bh();

 	dev_core_stats_tx_dropped_inc(dev);
 	kfree_skb_list_reason(skb, reason);
 	return rc;
 out:
+#ifdef CONFIG_NET_XGRESS
+	bpf_net_ctx_clear(bpf_net_ctx);
+#endif
 	rcu_read_unlock_bh();
 	return rc;
 }
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 0/3] Fix broken TC_ACT_REDIRECT
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
	clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
	krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
	Jamal Hadi Salim

When sashiko-gemini[1] reviewed commit a8a02897f2b4
("net/sched: cls_api: Handle TC_ACT_CONSUMED in tcf_qevent_handle") it
 correctly pointed out the following:

"
This is a pre-existing issue, but does executing a redirect via a qevent
filter cause a NULL pointer dereference?
When tcf_qevent_handle() processes a TC_ACT_REDIRECT, it calls
skb_do_redirect(). This eventually calls bpf_net_ctx_get_ri() which
dereferences the task bpf_net_context:
include/linux/filter.h:bpf_net_ctx_get_ri() {
    ...
    struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
    if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_RI_INIT)) {
    ...
}
Since qevents are evaluated during enqueue, which runs within
__dev_queue_xmit() after sch_handle_egress() has already executed and
cleared the bpf_net_context pointer, will this dereference a NULL pointer?
"

That issue is fixed in patch 1. See the commit log for details.

Upon further investigation it turns out that TC_ACT_REDIRECT being returned
from the egress qdiscs never actually worked. When an action returns that
code we would silently loose it and the packet will never be redirected.
After all those years, if nobody complained, my gut feel is it was never
used by anyone with serious need for it.
Patch 2 fixes it by 1) putting a warning out when someone does and 2) asking
the core to drop the packet. At least this would help whoever is
misconfiguring to diagnose the issue much faster.
I had initially attempted to "fix" this and make it work, but unfortunately
it's a bit ugly so i left i didnt think it was worth fixing

Apologies for the shotgun Cc - its what get_maintainer.pl told me to use.

[1] https://sashiko.dev/#/patchset/20260620130749.226642-1-jhs%40mojatatu.com

^ permalink raw reply

* Re: [PATCH net v3] net: wwan: iosm: bound device offsets in the MUX downlink decoder
From: Simon Horman @ 2026-06-26 16:37 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: Loic Poulain, Sergey Ryazanov, Johannes Berg, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel, stable
In-Reply-To: <178236824878.3259367.5389624724479864947@maoyixie.com>

On Thu, Jun 25, 2026 at 02:17:28PM +0800, Maoyi Xie wrote:
> mux_dl_adb_decode() walks a chain of aggregated datagram tables using
> offsets and lengths taken from the modem. first_table_index,
> next_table_index, table_length, datagram_index and datagram_length are
> all device supplied le values. Only first_table_index was checked, and
> only for being non zero. The decoder then formed adth = block +
> adth_index and read the table header and the datagram entries with no
> bound against the received skb. A modem that reports an index or a
> length past the downlink buffer makes the decoder read out of bounds.
> 
> The buffer is IPC_MEM_MAX_DL_MUX_LITE_BUF_SIZE and skb->len is at most
> that, so skb->len is the real limit, but none of these in band offsets
> were checked against it.
> 
> The table chain is also followed with no forward progress check. The loop
> takes the next table from adth->next_table_index and stops only when that
> reaches zero. A modem can stage two tables that point at each other, so
> the loop never ends. It runs in softirq and clones the skb on every pass.
> 
> Validate every device offset and length against skb->len before use.
> The block header must fit. Each table header, on entry and after every
> next_table_index, must lie inside the skb. The datagram table must fit.
> Each datagram index and length must stay inside the skb. The header
> padding must not exceed the datagram length so the receive length does
> not wrap. Require each next_table_index to move forward so the chain
> cannot cycle.
> 
> This was reproduced under KASAN as a slab out of bounds read on a normal
> downlink receive once the iosm net device is up.
> 
> Fixes: 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
> Suggested-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net v4 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Petr Wozniak @ 2026-06-26 16:35 UTC (permalink / raw)
  To: maxime.chevallier, olek2, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, kabel, Petr Wozniak
In-Reply-To: <20260624084814.20972-3-petr.wozniak@gmail.com>

Maxime Chevallier wrote:
> I finally got time to test this with a RollBall module, and I
> confirm what Aleksander says, the RollBall module's PHY doesn't
> get detected even with this patch. It does work on v7.0 though,
> so before the bridge probing was introduced.

Thanks a lot for taking the time to test, Maxime - and Aleksander
for the original report.

That settles it: the deferred-probe approach in patch 2/2 doesn't
actually restore genuine RollBall PHY detection, and as you both
confirm it worked before 8fe125892f40 introduced the bridge probing.
Sashiko's static review flagged the same thing (the probe bypasses the
PHY discovery retry loop for slow-initializing modules), so the static
analysis and the two hardware reports all point at the same flaw.

I only have RTL8261BE-based copper modules here, not a genuine RollBall
one, so I can't develop and verify a proper slow-init timing fix
(module_t_wait / a retry that waits for the bridge) without the
hardware to test against.

Given that, my suggestion:

- Please drop patch 2/2 from the series.

- Since 8fe125892f40 regressed genuine RollBall detection and the
  deferred probe doesn't restore it, I think the cleanest fix is to
  revert 8fe125892f40. I'm happy to send that revert if you'd prefer.
  The 5-minute RTL8261BE probe loop it was addressing is handled in our
  downstream tree, so reverting it upstream is fine on our side.

- Patch 1/2 (the mii_bus leak fix) is independent of all this and
  already has Reviewed-by from Maxime and Larysa - it would be good to
  take that one regardless. I can resend it standalone if that's easier.

A proper fix covering slow-firmware modules really needs a genuine
RollBall module to validate, so it's better owned by someone who has
that hardware - happy to help review.

Thanks again,
Petr

^ permalink raw reply

* Re: [RFC net-next 00/17] MPTCP KTLS support
From: Jakub Kicinski @ 2026-06-26 16:32 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Matthieu Baerts, Mat Martineau, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Neal Cardwell, Kuniyuki Iwashima,
	John Fastabend, Sabrina Dubroca, Hannes Reinecke, Geliang Tang,
	netdev, mptcp, Gang Yan, Zqiang
In-Reply-To: <a0358bb3aa8e2d45bd1e74851416728c3f08ddff.camel@kernel.org>

On Fri, 26 Jun 2026 15:56:53 +0800 Geliang Tang wrote:
> On Mon, 2026-06-22 at 09:00 -0700, Jakub Kicinski wrote:
> > On Mon, 22 Jun 2026 18:43:20 +0800 Geliang Tang wrote:  
> > > Subject: [RFC net-next 00/17] MPTCP KTLS support  
> > 
> > Please no. We have a ton of unfixed bugs and may have to revert some
> > of
> > the features we dropped back in. I'd prefer to avoid large new bug
> > surfaces until we reach an LTS release.  
> 
> Sure, I can wait. During this time, I'll go over the implementation
> more carefully to make sure there are no issues on the MPTCP side.

The two things that I'd be most interested in understanding better are
1) perf impact on all the pointer chasing and indirect calls you're
   adding
2) what's your plan on implementing HW offload? I understand that _you_
   may not care, but if you follow the ML you'll discover that sooner
   or later some well meaning person will slop code such support.
   We have pretty solid (by kernel standards) documentation for the
   offload. Please read if you haven't already. And then you should
   hopefully realize the huge issue with the layering you chose...

^ permalink raw reply

* [PATCH] net: lan743x: Initialize eth_syslock spinlock before use
From: Andrea Righi @ 2026-06-26 16:32 UTC (permalink / raw)
  To: Bryan Whitehead, UNGLinuxDriver
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Raju Lakkaraju, David Thompson, netdev, linux-kernel

lan743x_hardware_init() calls pci11x1x_strap_get_status() during the
PCI11x1x probe sequence. That helper acquires the Ethernet subsystem
hardware lock via lan743x_hs_syslock_acquire(), which relies on
adapter->eth_syslock_spinlock to serialize access.

The spinlock is currently initialized only after the strap status is
read. With CONFIG_DEBUG_SPINLOCK enabled, taking the zeroed initialized
spinlock can trip the spinlock debug check.

Fix by initializing adapter->eth_syslock_spinlock before reading the
strap status so the probe path never attempts to lock an uninitialized
spinlock.

Fixes: 46b777ad9a8c ("net: lan743x: Add support to SGMII 1G and 2.5G")
Cc: stable@vger.kernel.org # v6.0+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 drivers/net/ethernet/microchip/lan743x_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index 1cdce35e14239..e759171bfd766 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -3541,8 +3541,8 @@ static int lan743x_hardware_init(struct lan743x_adapter *adapter,
 		adapter->max_tx_channels = PCI11X1X_MAX_TX_CHANNELS;
 		adapter->used_tx_channels = PCI11X1X_USED_TX_CHANNELS;
 		adapter->max_vector_count = PCI11X1X_MAX_VECTOR_COUNT;
-		pci11x1x_strap_get_status(adapter);
 		spin_lock_init(&adapter->eth_syslock_spinlock);
+		pci11x1x_strap_get_status(adapter);
 		mutex_init(&adapter->sgmii_rw_lock);
 		pci11x1x_set_rfe_rd_fifo_threshold(adapter);
 		sgmii_ctl = lan743x_csr_read(adapter, SGMII_CTL);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Stanislav Fomichev @ 2026-06-26 16:30 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jason Xing, netdev, bpf, magnus.karlsson, stfomichev, kuba,
	pabeni, horms, bjorn
In-Reply-To: <aj4tUMqAXmQskO6r@boxer>

On 06/26, Maciej Fijalkowski wrote:
> On Thu, Jun 25, 2026 at 09:05:28AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Jason Xing wrote:
> > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > >
> > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > make userspace lose track of those buffers.
> > > > > >
> > > > > > The generic path fixes cover three related cases:
> > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > >   continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > >   reports overflow;
> > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > >   packet, including the offending invalid descriptor itself.
> > > > > >
> > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > address area and published in completion order, after any earlier
> > > > > > driver-visible Tx descriptors.
> > > > > >
> > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > state local to the singular batched path, the series prevents a second
> > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > >
> > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > still not expecting those invalid packets on the Rx side.
> > > > > >
> > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > against ice driver.
> > > > >
> > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > question from me is: is it time to scrap our current approach where
> > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > half-consumed descriptor queues)
> > > > >
> > > > > Should we:
> > > > >
> > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > and do a full stop here)
> > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > the cq space (if not -> EAGAIN)
> > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > locally, don't ever create semi-initialized skb)
> > > > > 4. construct the skb
> > > > > 5. xmit
> > > >
> > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > previously being addressed by Jason.
> > > >
> > > > I believe planned logistics were to get these fixes onto net and then
> > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > -next and that's where we could address current flow.
> > > 
> > > Agreed. That's what I'm hoping for. There would be much more
> > > discussion on how to do batch xmit in an elegant way, I believe.
> > 
> > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > this non-zc in net, this is still technically fixes, not feature work..
> > 
> > There was already a couple of revisions with this drain_cont approach
> > and every time I look at it feels like the cure is worse than the
> > decease :-( Obviously not gonna stop you from going with the current approach,
> > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > coming and we are piling more complexity).
> 
> Well this is my fault as I took Jason's patches as-is and did not realize
> Sashiko had issues with it. I *think* I got ZC side almost right so I'd
> like to have at least one last round with trying to make the generic side
> right...

You can have my ack on the series since I already did a few revisions
on the original series from Jason, but it is still a very low confidence
ack:

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Stanislav Fomichev @ 2026-06-26 16:25 UTC (permalink / raw)
  To: Jason Xing
  Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
	kuba, pabeni, horms, bjorn
In-Reply-To: <CAL+tcoDOFtTKavz8TWpU6BPrHCcL6TjY8xrmFA5y3P9Wn0T4ZA@mail.gmail.com>

On 06/26, Jason Xing wrote:
> On Fri, Jun 26, 2026 at 12:05 AM Stanislav Fomichev
> <sdf.kernel@gmail.com> wrote:
> >
> > On 06/25, Jason Xing wrote:
> > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > >
> > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > make userspace lose track of those buffers.
> > > > > >
> > > > > > The generic path fixes cover three related cases:
> > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > >   continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > >   reports overflow;
> > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > >   packet, including the offending invalid descriptor itself.
> > > > > >
> > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > address area and published in completion order, after any earlier
> > > > > > driver-visible Tx descriptors.
> > > > > >
> > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > state local to the singular batched path, the series prevents a second
> > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > >
> > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > still not expecting those invalid packets on the Rx side.
> > > > > >
> > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > against ice driver.
> > > > >
> > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > question from me is: is it time to scrap our current approach where
> > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > half-consumed descriptor queues)
> > > > >
> > > > > Should we:
> > > > >
> > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > and do a full stop here)
> > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > the cq space (if not -> EAGAIN)
> > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > locally, don't ever create semi-initialized skb)
> > > > > 4. construct the skb
> > > > > 5. xmit
> > > >
> > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > previously being addressed by Jason.
> > > >
> > > > I believe planned logistics were to get these fixes onto net and then
> > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > -next and that's where we could address current flow.
> > >
> > > Agreed. That's what I'm hoping for. There would be much more
> > > discussion on how to do batch xmit in an elegant way, I believe.
> >
> > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > this non-zc in net, this is still technically fixes, not feature work..
> >
> > There was already a couple of revisions with this drain_cont approach
> > and every time I look at it feels like the cure is worse than the
> > decease :-( Obviously not gonna stop you from going with the current approach,
> > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > coming and we are piling more complexity).
> 
> I see your point, but rewriting is something that cannot be easily
> applied to the stable branches? Until now, we fix issues one by one
> which have an explicit target branch (because of the fixes tag). Cross
> fingers :(
> 
> Sashiko has the magic to find out the hidden bugs more than ever and
> AF_XDP is not the only place where a pile of reports are coming in.

net vs net-next is fixes vs feature work. If we can't fix the current
code, I think we can justify a rewrite using a better approach and
route it via net. This series is 7 patches anyway, it's not like
it is a quick short fix :-) But I'm ok with pushing it as it, I'm just
trying to see if someone on your side is fed up with that part as well
and wants to fix it "properly" :-p

> My take is that batch xmit has been appending too long and at least so
> far less and less bugs are found by sashiko. I believe if the mode is
> changed to batch xmit, there are likely to be new and challenging
> problems to discuss. I prefer to solve questions of the batch xmit
> series.

We can redo this part separately, without batching. Move from "read
one chunk at a time" to "pre-read all chunks". Batching vs current issue
are separate.

> BTW, would you both come to Netdev 0x1a next month? I believe we could
> sit around the table and discuss some future plans there (in xdp
> workshop?).
> https://netdevconf.info/0x1A/sessions/workshop/xdp-workshop.html

Yes, I plan to be there in person.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox