Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio LAWSON @ 2026-04-16 14:25 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Marek Vasut, Woojung Huh, UNGLinuxDriver, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Chevallier, Simon Horman, Heiner Kallweit, Russell King,
	netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <03040421-89e7-4422-9fb5-0367a34323e4@lunn.ch>

On 4/16/26 14:25, Andrew Lunn wrote:
>> Yes, I think a reasonable compromise could be to expose three tunables:
>>
>> - a boolean "short-cable" tunable, which applies the known good settings
>>    (LPF 62 MHz BW, DSP EQ initial value 0).
>>
>> - an integer LPF bandwidth tunable, for advanced use cases where further
>>    tuning is needed;
>>
>> - an integer DSP EQ initial value tunable, for the same advanced cases.
>>
>> The boolean tunable would follow the KISS principle and cover the common
>> scenario, while the more granular controls would remain optional.
> 
> How do the three interact? Do you need to first enable short-cable
> before you set LPG bandwidth or DSP EQ? If it is not enabled, do you
> get -EINVAL?
> 
> It seems like having extack would be useful to return informative
> error messages to user space, however, that requires netlink
> ethtool. And ETHTOOL_PHY_STUNABLE has not been added to netlink
> ethtool yet :-(
> 
> 	Andrew

My intention would be to keep the interactions as simple and
non-surprising as possible, and avoid requiring any particular ordering
or state machine between the tunables.

The boolean short-cable tunable would simply apply the preset
in one step. The LPF bandwidth and DSP EQ initial value tunables would 
be orthogonal knobs which can be set independently at any time, 
regardless of whether short-cable is enabled or not.

With this model, we don’t need to return -EINVAL for combinations or
ordering, and userspace does not need detailed error reporting. The
tunables behave more like simple setters than a mode switch, which
keeps the API predictable and avoids the need for extack or netlink
ethtool support at this point.

Fidelio


^ permalink raw reply

* Re: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
From: Victor Nogueira @ 2026-04-16 14:26 UTC (permalink / raw)
  To: chia-yu.chang, linux-hardening, kees, gustavoars, jhs, jiri,
	davem, edumazet, kuba, pabeni, linux-kernel, netdev, horms, ij,
	ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
In-Reply-To: <20260413163711.56191-1-chia-yu.chang@nokia-bell-labs.com>

On 13/04/2026 13:37, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> 
> Fix dualpi2_change() to correctly enforce updated limit and memlimit values
> after a configuration change of the dualpi2 qdisc.
> 
> Before this patch, dualpi2_change() always attempted to dequeue packets via
> the root qdisc (C-queue) when reducing backlog or memory usage, and
> unconditionally assumed that a valid skb will be returned. When traffic
> classification results in packets being queued in the L-queue while the
> C-queue is empty, this leads to a NULL skb dereference during limit or
> memlimit enforcement.
> 
> This is fixed by first dequeuing from the C-queue path if it is non-empty.
> Once the C-queue is empty, packets are dequeued directly from the L-queue.s
> Return values from qdisc_dequeue_internal() are checked for both queues. When
> dequeuing from the L-queue, the parent qdisc qlen and backlog counters are
> updated explicitly to keep overall qdisc statistics consistent.
> [...]
> ---
>   net/sched/sch_dualpi2.c | 24 +++++++++++++++++++-----
>   1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> index 6d7e6389758d..56d4422970b6 100644
> --- a/net/sched/sch_dualpi2.c
> +++ b/net/sched/sch_dualpi2.c
> @@ -872,11 +872,25 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
>   	old_backlog = sch->qstats.backlog;
>   	while (qdisc_qlen(sch) > sch->limit ||
>   	       q->memory_used > q->memory_limit) {
> -		struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
> -
> -		q->memory_used -= skb->truesize;
> -		qdisc_qstats_backlog_dec(sch, skb);
> -		rtnl_qdisc_drop(skb, sch);
> +		int c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> +		struct sk_buff *skb = NULL;
> +
> +		if (c_len) {
> +			skb = qdisc_dequeue_internal(sch, true);
> +			if (!skb)
> +				break;
> +			q->memory_used -= skb->truesize;
> +			rtnl_qdisc_drop(skb, sch);
> +		} else if (qdisc_qlen(q->l_queue)) {
> +			skb = qdisc_dequeue_internal(q->l_queue, true);
> +			if (!skb)
> +				break;
> +			q->memory_used -= skb->truesize;
> +			rtnl_qdisc_drop(skb, q->l_queue);
> +			/* Keep the overall qdisc stats consistent */
> +			--sch->q.qlen;
> +			qdisc_qstats_backlog_dec(sch, skb);

Sashiko is hallucinating saying this will cause a UAF, it won't.
However it is good to maintain a consistent order here.
For example, see how sch_choke is doing [1].

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/sch_choke.c?id=1f5ffc672165ff851063a5fd044b727ab2517ae3#n394

cheers,
Victor

^ permalink raw reply

* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Andrew Lunn @ 2026-04-16 14:30 UTC (permalink / raw)
  To: Fidelio LAWSON
  Cc: Marek Vasut, Woojung Huh, UNGLinuxDriver, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Chevallier, Simon Horman, Heiner Kallweit, Russell King,
	netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <bf93f620-1288-46a3-8798-6d70aeb2c7c6@gmail.com>

> With this model, we don’t need to return -EINVAL for combinations or
> ordering, and userspace does not need detailed error reporting. The
> tunables behave more like simple setters than a mode switch, which
> keeps the API predictable and avoids the need for extack or netlink
> ethtool support at this point.

Sounds good.

       Andrew
       

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix precedence bug in convert_bpf_ld_abs alignment check
From: patchwork-bot+netdevbpf @ 2026-04-16 14:40 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, netdev, edumazet, willemdebruijn.kernel
In-Reply-To: <20260416122719.661033-1-daniel@iogearbox.net>

Hello:

This patch was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Thu, 16 Apr 2026 14:27:19 +0200 you wrote:
> Fix an operator precedence issue in convert_bpf_ld_abs() where the
> expression offset + ip_align % size evaluates as offset + (ip_align % size)
> due to % having higher precedence than +. That latter evaluation does
> not make any sense. The intended check is (offset + ip_align) % size == 0
> to verify that the packet load offset is properly aligned for direct
> access.
> 
> [...]

Here is the summary with links:
  - [bpf] bpf: Fix precedence bug in convert_bpf_ld_abs alignment check
    https://git.kernel.org/bpf/bpf/c/e5f635edd393

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Mladek @ 2026-04-16 14:49 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Song Chen, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
	snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
	danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
	frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <1db425bf-58a9-4768-8c38-3ae25d7662a5@suse.com>

On Thu 2026-04-16 13:18:30, Petr Pavlu wrote:
> On 4/15/26 8:43 AM, Song Chen wrote:
> > On 4/14/26 22:33, Petr Pavlu wrote:
> >> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> >>> diff --git a/include/linux/module.h b/include/linux/module.h
> >>> index 14f391b186c6..0bdd56f9defd 100644
> >>> --- a/include/linux/module.h
> >>> +++ b/include/linux/module.h
> >>> @@ -308,6 +308,14 @@ enum module_state {
> >>>       MODULE_STATE_COMING,    /* Full formed, running module_init. */
> >>>       MODULE_STATE_GOING,    /* Going away. */
> >>>       MODULE_STATE_UNFORMED,    /* Still setting it up. */
> >>> +    MODULE_STATE_FORMED,
> >>
> >> I don't see a reason to add a new module state. Why is it necessary and
> >> how does it fit with the existing states?
> >>
> > because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
> > 
> > case MODULE_STATE_COMING:
> >      kmalloc();
> > case MODULE_STATE_GOING:
> >      kfree();
> 
> My understanding is that the current module "state machine" operates as
> follows. Transitions marked with an asterisk (*) are announced via the
> module notifier.
> 
> ---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
>         ^            |                     ^    |
>         |            '---------------------*    |
>         '---------------------------------------'
> 
> The new code aims to replace the current ftrace_module_init() call in
> load_module(). To achieve this, it adds a notification for the UNFORMED
> state (only when loading a module) and introduces a new FORMED state for
> rollback. FORMED is purely a fake state because it never appears in
> module::state. The new structure is as follows:
> 
>         ,--*> (FORMED)
>         |
> --*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
>         ^            |                     ^    |
>         |            '---------------------*    |
>         '---------------------------------------'
> 
> I'm afraid this is quite complex and inconsistent. Unless it can be kept
> simple, we would be just replacing one special handling with a different
> complexity, which is not worth it.

> >>
> >>> +    if (err)
> >>> +        goto ddebug_cleanup;
> >>>         /* Finally it's fully formed, ready to start executing. */
> >>>       err = complete_formation(mod, info);
> >>> -    if (err)
> >>> +    if (err) {
> >>> +        blocking_notifier_call_chain_reverse(&module_notify_list,
> >>> +                MODULE_STATE_FORMED, mod);
> >>>           goto ddebug_cleanup;
> >>> +    }
> >>>   -    err = prepare_coming_module(mod);
> >>> +    err = prepare_module_state_transaction(mod,
> >>> +                MODULE_STATE_COMING, MODULE_STATE_GOING);
> >>>       if (err)
> >>>           goto bug_cleanup;
> >>>   @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
> >>>       destroy_params(mod->kp, mod->num_kp);
> >>>       blocking_notifier_call_chain(&module_notify_list,
> >>>                        MODULE_STATE_GOING, mod);
> >>
> >> My understanding is that all notifier chains for MODULE_STATE_GOING
> >> should be reversed.
> > yes, all, from lowest priority notifier to highest.
> > I will resend patch 1 which was failed due to my proxy setting.
> 
> What I meant here is that the call:
> 
> blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);
> 
> should be replaced with:
> 
> blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);
> 
> > 
> >>
> >>> -    klp_module_going(mod);
> >>>    bug_cleanup:
> >>>       mod->state = MODULE_STATE_GOING;
> >>>       /* module_bug_cleanup needs module_mutex protection */
> >>
> >> The patch removes the klp_module_going() cleanup call in load_module().
> >> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
> >> should be removed and appropriately replaced with a cleanup via
> >> a notifier.
> >>
> >     err = prepare_module_state_transaction(mod,
> >                 MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> >     if (err)
> >         goto ddebug_cleanup;
> > 
> > ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
> > 
> >     err = prepare_module_state_transaction(mod,
> >                 MODULE_STATE_COMING, MODULE_STATE_GOING);
> > 
> > each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
> > 
> > if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
> >  coming_cleanup:
> >     mod->state = MODULE_STATE_GOING;
> >     destroy_params(mod->kp, mod->num_kp);
> >     blocking_notifier_call_chain(&module_notify_list,
> >                      MODULE_STATE_GOING, mod);
> > 
> > if  something wrong underneath.
> 
> My point is that the patch leaves a call to ftrace_release_mod() in
> load_module(), which I expected to be handled via a notifier.

I think that I have got it. The ftrace code needs two notifiers when
the module is being loaded and two when it is going.

This is why Sond added the new state. But I think that we would
need two new states to call:

    + ftrace_module_init() in MODULE_STATE_UNFORMED
    + ftrace_module_enable() in MODULE_STATE_FORMED

and

    + ftrace_free_mem() in MODULE_STATE_PRE_GOING
    + ftrace_free_mem() in MODULE_STATE_GOING


By using the ascii art:

 -*> UNFORMED -*> FORMED -> COMING -*> LIVE -*> PRE_GOING -*> GOING -.
              |          |         |                ^           ^    ^
              |          |         '----------------'           |    |
              |          '--------------------------------------'    |
              '------------------------------------------------------'


But I think that this is not worth it.

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH] mISDN: socket: drop device references acquired by get_mdevice()
From: Shuvam Pandey @ 2026-04-16 14:50 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, linux-kernel
In-Reply-To: <20260416125029.851168-2-horms@kernel.org>

> Could this introduce a use-after-free regression if the underlying
> hardware is removed before the socket is closed?
>
> [...]
>
> This isn't a regression, but is there a use-after-free in the stack
> teardown when the hardware is removed?

Thanks for the review.

Looking at it more closely, I agree the close-path part of this patch is
not safe as posted.

While get_mdevice() does return a referenced device, the sockets also
keep raw mISDNdevice / mISDNstack pointers across bind, and
mISDN_unregister_device() tears the stack down while those pointers can
still be reached later from socket release. In addition, several mISDN
drivers free the enclosing allocation immediately after
mISDN_unregister_device() returns, so adding put_device() in the socket
release paths can turn this into a close-time UAF. The delete_channel()
/ ch->st case on the data-socket side is the same underlying lifetime
problem.

I'll drop this version and revisit it after reworking the unregister /
socket lifetime handling first. I also want to re-check whether the
temporary get_mdevice() lookup references should be fixed separately or
only once that lifetime side is addressed.

Thanks,
Shuvam

^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-16 14:54 UTC (permalink / raw)
  To: David Laight
  Cc: chensong_2000, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
	snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
	danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
	frederic, mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin,
	jpoimboe, jikos, mbenes, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260416133004.07bd2886@pumpkin>

On Thu 2026-04-16 13:30:04, David Laight wrote:
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
> 
> > From: Song Chen <chensong_2000@189.cn>
> > 
> > The current notifier chain implementation uses a single-linked list
> > (struct notifier_block *next), which only supports forward traversal
> > in priority order. This makes it difficult to handle cleanup/teardown
> > scenarios that require notifiers to be called in reverse priority order.
> 
> If it is only cleanup/teardown then the list can be order-reversed
> as part of that process at the same time as the list is deleted.

Interesting idea. But it won't work in all situations.

Note that the motivation for this update are the module loader
notifiers which are called several times for each loaded/removed module.

Best Regards,
Petr

^ permalink raw reply

* Re: [RFC v2 1/2] vfio: add callback to get tph info for dmabuf
From: Leon Romanovsky @ 2026-04-16 15:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: Zhiping Zhang, Jason Gunthorpe, Bjorn Helgaas, linux-rdma,
	linux-pci, netdev, dri-devel, Yochai Cohen, Yishai Hadas,
	Bjorn Helgaas
In-Reply-To: <ad56liSM4zI2SWWp@kbusch-mbp>

On Tue, Apr 14, 2026 at 11:34:14AM -0600, Keith Busch wrote:
> On Thu, Apr 09, 2026 at 03:04:15PM +0300, Leon Romanovsky wrote:
> > Something like that, on top of this proposal:
> 
> ...
>   
> > +struct vfio_region_dma_tph {
> > +	u16 tag;
> > +	u8 ph;
> > +};
> > +
> >  struct vfio_region_dma_range {
> > -	__u64 offset;
> > -	__u64 length;
> > +	union {
> > +		__u64 offset;
> > +		struct vfio_region_dma_tph tph;
> > +	};
> > +	union {
> > +		__u64 length;
> > +		__u64 reserved;
> > +	};
> > +};
> > +
> > +enum {
> > +	VFIO_DMABUF_FLAG_TPH = 1 << 0,
> >  };
> 
> Okay, so you have the hints as a separate action from the dmabuf
> creation. 

I'm not sure I fully understood your point, but in my proposal the entire
operation is done in one pass, with the hints provided as the final entry in
the dma_range array.

Thanks

^ permalink raw reply

* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Marek Vasut @ 2026-04-16 15:05 UTC (permalink / raw)
  To: Andrew Lunn, Fidelio LAWSON
  Cc: Woojung Huh, UNGLinuxDriver, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Chevallier,
	Simon Horman, Heiner Kallweit, Russell King, netdev, linux-kernel,
	Fidelio Lawson
In-Reply-To: <6a8fc2e6-830f-4f57-869b-41efe70557e5@lunn.ch>

On 4/16/26 4:30 PM, Andrew Lunn wrote:
>> With this model, we don’t need to return -EINVAL for combinations or
>> ordering, and userspace does not need detailed error reporting. The
>> tunables behave more like simple setters than a mode switch, which
>> keeps the API predictable and avoids the need for extack or netlink
>> ethtool support at this point.
> 
> Sounds good.
Fine by me too.

^ permalink raw reply

* Re: [PATCH] net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow
From: Jamal Hadi Salim @ 2026-04-16 15:23 UTC (permalink / raw)
  To: Simon Horman
  Cc: hxzene, jiri, davem, edumazet, kuba, pabeni, victor, pctammela,
	netdev, linux-kernel, Stephen Hemminger
In-Reply-To: <20260415141931.808383-1-horms@kernel.org>

On Wed, Apr 15, 2026 at 10:19 AM Simon Horman <horms@kernel.org> wrote:
>
> From: 'Simon Horman' <horms@kernel.org>
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow
>
> This patch fixes a stack overflow bug where mirred blockcast actions would
> bypass the recursion limit check because the recursion counter was not
> incremented for blockcast actions.
>
> > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > index 05e0b14b57731..5928fcf3e6514 100644
> > --- a/net/sched/act_mirred.c
> > +++ b/net/sched/act_mirred.c
> > @@ -444,8 +444,12 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
> >       tcf_action_update_bstats(&m->common, skb);
> >
> >       blockid = READ_ONCE(m->tcfm_blockid);
> > -     if (blockid)
> > -             return tcf_blockcast(skb, m, blockid, res, retval);
> > +     if (blockid) {
> > +             xmit->sched_mirred_nest++;
>
> Could this leave a stale pointer in sched_mirred_dev since netdev_xmit is
> persistent per-CPU (or per-task in PREEMPT_RT) memory?
>
> If a nested mirred action later iterates over sched_mirred_dev up to the new
> nesting depth to detect routing loops, could it match a stale pointer from
> whatever unrelated packet previously reached this nesting level?
>
> This might cause a false-positive loop detection and incorrectly drop a valid
> packet. Should we set xmit->sched_mirred_dev[xmit->sched_mirred_nest] = NULL
> before incrementing the counter?
>
> > +             retval = tcf_blockcast(skb, m, blockid, res, retval);
>
> This isn't a regression introduced by this patch, but it exists in the early
> error returns of tcf_mirred_act() (e.g., when the target device is gone, or a
> routing loop is detected) and in tcf_blockcast() (e.g., when
> xa_empty(&block->ports) is true).
>
> In these error paths, the functions simply return retval. For mirred redirect
> actions, retval is initialized to m->tcf_action, which is typically
> TC_ACT_STOLEN.
>
> When TC_ACT_STOLEN is returned, the TC framework assumes the action has taken
> ownership of the packet and will not free the sk_buff. Since the code does
> not free the sk_buff before returning in these error paths, does this leak
> the sk_buff?
>
> Could an unprivileged user inside a network namespace trigger this
> intentionally by creating a routing loop with mirred redirect rules, leading
> to continuous sk_buff leaks and system memory exhaustion?
>
> Should these error paths ensure the sk_buff is freed or force the return value
> to TC_ACT_SHOT if the action is a redirect, mirroring the existing behavior in
> tcf_mirred_to_dev()?
>
> > +             xmit->sched_mirred_nest--;
> > +             return retval;
> > +     }

There's a dependency with the skb bits patch[1] with this one.
Stephen, I will take this set and consider what Sashiko is saying here
[2]. I just became aware of this - but it seems to have some merit.
I will also review this patch and sashiko's input from that
perspective. If it merits it, I will resend the set incorporating this
fix and Sashiko's suggestions.

cheers,
jamal

[1]https://lore.kernel.org/netdev/20260326181701.308275-1-stephen@networkplumber.org/
[2] https://sashiko.dev/#/patchset/20260326181701.308275-1-stephen%40networkplumber.org

^ permalink raw reply

* Re: [PATCH v2] net: phy: air_en8811h: add AN8811HB MCU assert/deassert support
From: Lucien.Jheng @ 2026-04-16 15:24 UTC (permalink / raw)
  To: Eric Woudstra, andrew, hkallweit1, linux, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel, bjorn
  Cc: frank-w, daniel, lucien.jheng
In-Reply-To: <5de3f9ca-ffe7-40ca-93a7-fb7e8e513ec2@gmail.com>

Hi Eric

Lucien.Jheng 於 2026/3/29 下午 01:04 寫道:
>
> Eric Woudstra 於 2026/3/27 下午 12:41 寫道:
>>
>> On 3/26/26 4:35 PM, Lucien.Jheng wrote:
>>> AN8811HB needs a MCU soft-reset cycle before firmware loading begins.
>>> Assert the MCU (hold it in reset) and immediately deassert (release)
>>> via a dedicated PBUS register pair (0x5cf9f8 / 0x5cf9fc), accessed
>>> through the PHY-addr+8 MDIO bus node rather than the BUCKPBUS indirect
>>> path.  This clears the MCU state before the firmware loading sequence
>>> starts.
>>>
>>> Add __air_pbus_reg_write() as a low-level helper for this access, then
>>> implement an8811hb_mcu_assert() / _deassert() on top of it. Wire both
>>> into an8811hb_load_firmware() and en8811h_restart_mcu() so every
>>> firmware load or MCU restart on AN8811HB correctly sequences the reset
>>> control registers.
>>>
>>> Fixes: 0a55766b7711 ("net: phy: air_en8811h: add Airoha AN8811HB 
>>> support")
>>> Signed-off-by: Lucien Jheng <lucienzx159@gmail.com>
>>> ---
>>> Changes in v2:
>>> - Rewrite commit message: The previous wording was ambiguous,
>>>    as it suggested the MCU remains asserted during the entire
>>>    firmware loading process, rather than a quick reset cycle
>>>    before it starts.
>>>    Clarify that assert and deassert is an immediate soft-reset
>>>    cycle to clear MCU state before firmware loading begins.
>>> - Add Fixes: 0a55766b7711 ("net: phy: air_en8811h: add Airoha AN8811HB
>>>    support").
>>> - Change phydev_info() to phydev_dbg() in an8811hb_mcu_assert() and
>>>    an8811hb_mcu_deassert() to avoid noise during normal boot.
>>>
>>>   drivers/net/phy/air_en8811h.c | 105 
>>> ++++++++++++++++++++++++++++++++++
>>>   1 file changed, 105 insertions(+)
>>>
>>> diff --git a/drivers/net/phy/air_en8811h.c 
>>> b/drivers/net/phy/air_en8811h.c
>>> index 29ae73e65caa..01fce1b93618 100644
>>> --- a/drivers/net/phy/air_en8811h.c
>>> +++ b/drivers/net/phy/air_en8811h.c
>>> @@ -170,6 +170,16 @@
>>>   #define   AN8811HB_CLK_DRV_CKO_LDPWD        BIT(13)
>>>   #define   AN8811HB_CLK_DRV_CKO_LPPWD        BIT(14)
>>>
>>> +#define AN8811HB_MCU_SW_RST        0x5cf9f8
>>> +#define   AN8811HB_MCU_SW_RST_HOLD        BIT(16)
>>> +#define   AN8811HB_MCU_SW_RST_RUN        (BIT(16) | BIT(0))
>>> +#define AN8811HB_MCU_SW_START        0x5cf9fc
>>> +#define   AN8811HB_MCU_SW_START_EN        BIT(16)
>>> +
>>> +/* MII register constants for PBUS access (PHY addr + 8) */
>>> +#define AIR_PBUS_ADDR_HIGH        0x1c
>>> +#define AIR_PBUS_DATA_HIGH        0x10
>>> +
>>>   /* Led definitions */
>>>   #define EN8811H_LED_COUNT    3
>>>
>>> @@ -254,6 +264,36 @@ static int air_phy_write_page(struct phy_device 
>>> *phydev, int page)
>>>       return __phy_write(phydev, AIR_EXT_PAGE_ACCESS, page);
>>>   }
>>>
>>> +static int __air_pbus_reg_write(struct phy_device *phydev,
>>> +                u32 pbus_reg, u32 pbus_data)
>>> +{
>>> +    struct mii_bus *bus = phydev->mdio.bus;
>>> +    int pbus_addr = phydev->mdio.addr + 8;
>>> +    int ret;
>>> +
>>> +    ret = __mdiobus_write(bus, pbus_addr, AIR_EXT_PAGE_ACCESS,
>>> +                  upper_16_bits(pbus_reg));
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    ret = __mdiobus_write(bus, pbus_addr, AIR_PBUS_ADDR_HIGH,
>>> +                  (pbus_reg & GENMASK(15, 6)) >> 6);
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    ret = __mdiobus_write(bus, pbus_addr, (pbus_reg & GENMASK(5, 
>>> 2)) >> 2,
>>> +                  lower_16_bits(pbus_data));
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    ret = __mdiobus_write(bus, pbus_addr, AIR_PBUS_DATA_HIGH,
>>> +                  upper_16_bits(pbus_data));
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    return 0;
>>> +}
>>> +
>> Maybe you can use mutex_lock() and mutex_unlock() here also, like so:

After checking, I decided not to implement mutex_lock in pbus_read/write.

This is because these functions are only used within 
an8811hb_mcu_assert/deassert.

Since these sequences cannot be interrupted, a mutex lock is not required.

Thanks

>>
>> static int __air_pbus_reg_write(struct mdio_device *mdio, u32 
>> pbus_address,
>>                 u32 pbus_data)
>> {
>>     int ret;
>>
>>     ret = __mdiobus_write(mdio->bus, mdio->addr, AIR_EXT_PAGE_ACCESS,
>>                   (u16)(pbus_address >> 6));
>>     if (ret < 0)
>>         return ret;
>>
>>     ret = __mdiobus_write(mdio->bus, mdio->addr, (pbus_address >> 2) 
>> & 0xf,
>>                   (u16)pbus_data);
>>     if (ret < 0)
>>         return ret;
>>
>>     ret = __mdiobus_write(mdio->bus, mdio->addr, AIR_PBUS_DATA_HIGH,
>>                   (u16)(pbus_data >> 16));
>>     if (ret < 0)
>>         return ret;
>>
>>     return 0;
>> }
>>
>> static int air_pbus_reg_write(struct mdio_device *mdio, u32 
>> pbus_address,
>>                   u32 pbus_data)
>> {
>>     int ret;
>>
>>     mutex_lock(&mdio->bus->mdio_lock);
>>
>>     ret = __air_pbus_reg_write(mdio, pbus_address, pbus_data);
>>     if (ret < 0)
>>         dev_err(&mdio->dev, "%s 0x%08x failed: %d\n", __func__,
>>             pbus_address, ret);
>>
>>     mutex_unlock(&mdio->bus->mdio_lock);
>>
>>     return ret;
>> }
>>
>> static int __air_pbus_reg_read(struct mdio_device *mdio, u32 
>> pbus_address,
>>                    u32 *pbus_data)
>> {
>>     int ret, pbus_data_low;
>>
>>     ret = __mdiobus_write(mdio->bus, mdio->addr, AIR_EXT_PAGE_ACCESS,
>>                   (u16)(pbus_address >> 6));
>>     if (ret < 0)
>>         return ret;
>>
>>     ret = __mdiobus_read(mdio->bus, mdio->addr, (pbus_address >> 2) & 
>> 0xf);
>>     if (ret < 0)
>>         return ret;
>>     pbus_data_low = ret;
>>
>>     ret = __mdiobus_read(mdio->bus, mdio->addr, AIR_PBUS_DATA_HIGH);
>>     if (ret < 0)
>>         return ret;
>>
>>     *pbus_data = (ret << 16) + pbus_data_low;
>>
>>     return 0;
>> }
>>
>> static int air_pbus_reg_read(struct mdio_device *mdio, u32 pbus_address,
>>                  u32 *pbus_data)
>> {
>>     int ret;
>>
>>     mutex_lock(&mdio->bus->mdio_lock);
>>
>>     ret = __air_pbus_reg_read(mdio, pbus_address, pbus_data);
>>     if (ret < 0)
>>         dev_err(&mdio->dev, "%s 0x%08x failed: %d\n", __func__,
>>             pbus_address, ret);
>>
>>     mutex_unlock(&mdio->bus->mdio_lock);
>>
>>     return ret;
>> }
>>
>> With:
>>
>> #define AIR_PBUS_DATA_HIGH        0x10
>>
> Thank you for your feedback.
>
> I should add mutex_lock() and mutex_unlock() to protect these operations.
>
> I will include this in the next version.
>
>>>   static int __air_buckpbus_reg_write(struct phy_device *phydev,
>>>                       u32 pbus_address, u32 pbus_data)
>>>   {
>>> @@ -570,10 +610,65 @@ static int an8811hb_load_file(struct 
>>> phy_device *phydev, const char *name,
>>>       return ret;
>>>   }
>>>
>>> +static int an8811hb_mcu_assert(struct phy_device *phydev)
>>> +{
>>> +    int ret;
>>> +
>>> +    phy_lock_mdio_bus(phydev);
>>> +
>>> +    ret = __air_pbus_reg_write(phydev, AN8811HB_MCU_SW_RST,
>>> +                   AN8811HB_MCU_SW_RST_HOLD);
>>> +    if (ret < 0)
>>> +        goto unlock;
>>> +
>>> +    ret = __air_pbus_reg_write(phydev, AN8811HB_MCU_SW_START, 0);
>>> +    if (ret < 0)
>>> +        goto unlock;
>>> +
>>> +    msleep(50);
>>> +    phydev_dbg(phydev, "MCU asserted\n");
>>> +
>>> +unlock:
>>> +    phy_unlock_mdio_bus(phydev);
>>> +    return ret;
>>> +}
>>> +
>>> +static int an8811hb_mcu_deassert(struct phy_device *phydev)
>>> +{
>>> +    int ret;
>>> +
>>> +    phy_lock_mdio_bus(phydev);
>>> +
>>> +    ret = __air_pbus_reg_write(phydev, AN8811HB_MCU_SW_START,
>>> +                   AN8811HB_MCU_SW_START_EN);
>>> +    if (ret < 0)
>>> +        goto unlock;
>>> +
>>> +    ret = __air_pbus_reg_write(phydev, AN8811HB_MCU_SW_RST,
>>> +                   AN8811HB_MCU_SW_RST_RUN);
>>> +    if (ret < 0)
>>> +        goto unlock;
>>> +
>>> +    msleep(50);
>>> +    phydev_dbg(phydev, "MCU deasserted\n");
>>> +
>>> +unlock:
>>> +    phy_unlock_mdio_bus(phydev);
>>> +    return ret;
>>> +}
>>> +
>>>   static int an8811hb_load_firmware(struct phy_device *phydev)
>>>   {
>>>       int ret;
>>>
>>> +    ret = an8811hb_mcu_assert(phydev);
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    ret = an8811hb_mcu_deassert(phydev);
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>>       ret = air_buckpbus_reg_write(phydev, EN8811H_FW_CTRL_1,
>>>                        EN8811H_FW_CTRL_1_START);
>>>       if (ret < 0)
>>> @@ -662,6 +757,16 @@ static int en8811h_restart_mcu(struct 
>>> phy_device *phydev)
>>>   {
>>>       int ret;
>>>
>>> +    if (phy_id_compare_model(phydev->phy_id, AN8811HB_PHY_ID)) {
>>> +        ret = an8811hb_mcu_assert(phydev);
>>> +        if (ret < 0)
>>> +            return ret;
>>> +
>>> +        ret = an8811hb_mcu_deassert(phydev);
>>> +        if (ret < 0)
>>> +            return ret;
>>> +    }
>>> +
>>>       ret = air_buckpbus_reg_write(phydev, EN8811H_FW_CTRL_1,
>>>                        EN8811H_FW_CTRL_1_START);
>>>       if (ret < 0)
>>> -- 
>>> 2.34.1
>>>

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-net 2/2] idpf: fix skb datapath queue based scheduling crashes and timeouts
From: Salin, Samuel @ 2026-04-16 15:24 UTC (permalink / raw)
  To: Loktionov, Aleksandr, Hay, Joshua A,
	intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org
In-Reply-To: <IA3PR11MB8986B2581A07EC756D0D9186E55BA@IA3PR11MB8986.namprd11.prod.outlook.com>

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Loktionov, Aleksandr
> Sent: Tuesday, April 7, 2026 11:37 PM
> To: Hay, Joshua A <joshua.a.hay@intel.com>; intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net 2/2] idpf: fix skb datapath queue
> based scheduling crashes and timeouts
> 
> 
> 
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of Joshua Hay
> > Sent: Tuesday, April 7, 2026 1:33 AM
> > To: intel-wired-lan@lists.osuosl.org
> > Cc: netdev@vger.kernel.org
> > Subject: [Intel-wired-lan] [PATCH iwl-net 2/2] idpf: fix skb datapath
> > queue based scheduling crashes and timeouts
> >
> > The splitq Tx resource checks were assuming that the queues were using
> > flow based scheduling and checking the refillqs for free buffers.
> > However, the Tx refillqs are not allocated when using queue based
> > scheduling resulting in a NULL ptr dereference. Adjust the Tx resource
> > checks to only check available descriptor resources when using queue
> > based scheduling. Because queue based scheduling does not have any
> > notion of descriptor only completions, there cannot be any packets in
> > flight, meaning there is no need to check for pending completions.
> >
> > The driver also only supported 8 byte completion descriptors in the
> > skb datapath previously. However, currently the FW only supports 4
> > byte completion descriptors when using queue based scheduling. This
> > meant we were skipping over completions, resulting in Tx timeouts.
> > Add support to process both 4 and 8 byte completion descriptors,
> > depending on the scheduling mode. Cache the next_to_clean completion
> > descriptor in the completion queue struct, and fetch this descriptor
> > before the start of each cleaning loop. Access the next descriptor in
> > the loop by calculating the index based on raw byte count.
> >
> > Fixes: 0c3f135e840d ("idpf: stop Tx if there are insufficient buffer
> > resources")
> > Fixes: 1c325aac10a8 ("idpf: configure resources for TX queues")
> > Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> > Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
> > ---
> >  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 49 +++++++++++++-------
> > -  drivers/net/ethernet/intel/idpf/idpf_txrx.h |  6 ++-
> >  2 files changed, 36 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> > b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> > index f6b3b15364ff..4fc0bb14c5b1 100644
> > --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> > +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> > @@ -270,11 +270,9 @@ static int idpf_tx_desc_alloc(const struct
> > idpf_vport *vport,  static int idpf_compl_desc_alloc(const struct
> > idpf_vport *vport,
> >  				 struct idpf_compl_queue *complq)  {
> 
> ...
> 
> >
> >  /**
> >   * struct idpf_sw_queue
> > --
> > 2.39.2
> 
> 
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Tested-by: Samuel Salin <Samuel.salin@intel.com>

^ permalink raw reply

* Re: [PATCH v5 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Jakub Kicinski @ 2026-04-16 15:27 UTC (permalink / raw)
  To: Kubalewski, Arkadiusz
  Cc: Nitka, Grzegorz, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Oros, Petr, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	Kitszel, Przemyslaw, Nguyen, Anthony L,
	Prathosh.Satish@microchip.com, Vecera, Ivan, jiri@resnulli.us,
	vadim.fedorenko@linux.dev, donald.hunter@gmail.com,
	horms@kernel.org, pabeni@redhat.com, davem@davemloft.net,
	edumazet@google.com
In-Reply-To: <IA0PR11MB737842E2098D0952A8BA1FE29B222@IA0PR11MB7378.namprd11.prod.outlook.com>

On Wed, 15 Apr 2026 13:23:22 +0000 Kubalewski, Arkadiusz wrote:
> >> Well, the true is that we did not anticipated per-port control of the
> >> TX clock source, as a single DPLL device could drive multiple of such.
> >>
> >> This is not true, that we pretend there is a second PLL - there is a
> >> PLL on each TX clock, maybe not a full DPLL, but still the loop with
> >> a control over it's sources is there and it has the same 2 external
> >> sources + default XO.  
> >
> >Don't we put that MAC PLL into bypass mode if we feed a clock from
> >the EEC DPLL?  
> 
> This HW doesn't use EEC DPLL signal to feed MAC clock, as DPLL is
> external from NIC point of view. Only 2 signals from such external DPLL
> device are used by NIC:
> - synce (a single source for all those TXC per-port DPLL device)
> - time_ref (a source for the TS_PLL - which drives PTP timer)

No bypass? The PLL is actually in the loop? oof, this is beyond 
my understanding of clocks and signals :S

> >> A mentioned try of adding per port MUX-type pin, just to give some
> >>control
> >> to the user, is where we wanted to simplify things, but in the end the
> >>API
> >> would have to be modified in significant way, various paths related to
> >>pin
> >> registration and keeping correct references, just to make working case
> >> for the pin_on_pin_register and it's internals. We decided that the
> >>burden
> >> and impact for existing design was to high.
> >>
> >> And that is why the TXC approach emerged, the change of DPLL is minimal,
> >> The model is still correct from user perspective, SyncE SW controller
> >>shall
> >> anticipate possibility that per-port TXC dpll is there  
> >
> >We are starting to push into what was previously the domain of
> >drivers/clk, tho. IIUC the "ASIC PLL"s are usually integrated with
> >clock dividers. And cannot be "configured" after chip init / async
> >reset (which is why I presume you whack a reset in patch 7?).  
> 
> Well, we need CGU-dividers change for a frequency-compliance with lower
> link speeds, the link reset which is required as part of tx-clk switch
> and link establishment on a new clock.
> 
> >  
> >> This particular device and driver doesn't implement any EEC-type DPLL
> >> device, the one could think that we can just change the type here and
> >>use
> >> EEC type instead of new one TXC - since we share pins from external dpll
> >> driver, which is EEC type, and our DPLL device would have different
> >>clock_id
> >> and module. But, further designs, where a single NIC is having control
> >>over
> >> both a EEC DPLL and ability to control each source per-port this would
> >>be
> >> problematic. At least one NIC Port driver would have to have 2 EEC-type
> >>DPLLs
> >> leaving user with extra confusion.  
> >
> >The distinction between TXC and EEC dpll is confusing.
> >I thought EEC one _was_supposed_to_ drive the Tx clock?
> >What PPS means is obvious, what EEC means if not driving Tx clock is
> >unclear to me..
> >  
> 
> Yes, correct, EEC DPLL main task would be to drive TX clocks of NIC
> ports, but if there is a per-port control something extra is required.
> 
> >Let me summarize my concerns - we need to navigate the split between
> >drivers/clk and dpll. We need a distinction on what goes where, because
> >every ASIC has a bunch of PLLs which until now have been controlled by
> >device tree (if at all). If the main question we want to answer is
> >"which clock ref is used to drive internal clock" all we need is a MUX.
> >If we want to make dpll cover also ASIC PLLs for platforms without
> >device tree we need a more generic name than TXC, IMHO.  
> 
> Well, 'floating' MUX type pin not connected to any dpll would require a
> lot of additional implementations, just to allow source selection, as we
> have tried it already.
> 
> Wouldn't more generic name cause a DPLL purpose problem?

The old proposal in netdev family was to to have source selection
without creating a real mux. Not saying I'm dead set on that direction.

> We still want to make sure that given DPLL device would serve the role
> of source selection for particular port where a source pin should be an
> output either on EEC dpll or some external signal generator but somehow
> related to SyncE or similar solutions.

Right, but adding a new "type" per location of the PLL (especially if
we lean into covering any ASIC PLL) may not scale, and opens us up to
"vendor X calls it Y" and "in design A clock is fed by pll type X and
in design B by type Y".

IIUC you do provide "linking" of the pins? netdev will have the MAC pin
assigned. Is the pin that connects the PLLs also annotated so that user
knows what's on the "other side"? Maybe the topology would be clear
enough from just that, and we don't have to add a TXC type.
Call the PLL "integrated" or something generic. User should be able to
trace the path of the signals?

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-16 15:31 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <ad5kuCZz+gR1TlSh@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> I still see roughly a 5% overhead from the atomic refcount operation
> itself, but on that platform there is no throughput drop when using
> page fragments versus full-page mode.

That seems to contradict your claim that it's a problem with a specific
platform.. Since we're in the merge window I asked David Wei to try to
experiment with disabling page fragmentation on the ARM64 platforms we
have at Meta. If it repros we should use the generic rx-buf-len
ringparam because more NICs may want to implement this strategy.

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ct: fix skb leak on fragment check failure
From: Jamal Hadi Salim @ 2026-04-16 15:31 UTC (permalink / raw)
  To: Dudu Lu; +Cc: netdev, jiri, horms
In-Reply-To: <20260416130138.38050-1-phx0fer@gmail.com>

On Thu, Apr 16, 2026 at 9:01 AM Dudu Lu <phx0fer@gmail.com> wrote:
>
> When tcf_ct_handle_fragments() returns an error other than -EINPROGRESS
> (e.g. -EINVAL from malformed fragments), tcf_ct_act() jumps to out_frag
> which unconditionally returns TC_ACT_CONSUMED. This tells the caller the
> skb was consumed, but it was not freed, leaking one skb per malformed
> fragment.
>
> TC_ACT_CONSUMED is only correct for -EINPROGRESS, where defragmentation
> is genuinely in progress and the skb has been queued. For all other
> errors the skb is still owned by the caller and must be freed via
> TC_ACT_SHOT.
>
> Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags")
> Signed-off-by: Dudu Lu <phx0fer@gmail.com>

Do you have a reproducer? Always helps adding at least a tdc test.
Also: How did you find this issue? was it AI? If yes, please add the
tag "Assisted-by:<AI name here>"

cheers,
jamal

> ---
>  net/sched/act_ct.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 7d5e50c921a0..870655f682bd 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1107,8 +1107,10 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>         return retval;
>
>  out_frag:
> -       if (err != -EINPROGRESS)
> +       if (err != -EINPROGRESS) {
>                 tcf_action_inc_drop_qstats(&c->common);
> +               return TC_ACT_SHOT;
> +       }
>         return TC_ACT_CONSUMED;
>
>  drop:
> --
> 2.39.3 (Apple Git-145)
>

^ permalink raw reply

* Re: [PATCH net v3 2/5] i40e: skip unnecessary VF reset when setting trust
From: Simon Horman @ 2026-04-16 15:31 UTC (permalink / raw)
  To: jtornosm
  Cc: 'Simon Horman', netdev, intel-wired-lan, jesse.brandeburg,
	anthony.l.nguyen, davem, edumazet, kuba, pabeni,
	przemyslaw.kitszel
In-Reply-To: <20260414110006.124286-3-jtornosm@redhat.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
i40e: skip unnecessary VF reset when setting trust

This commit skips the VF reset when setting VF trust, unless trust is
being removed from a VF with ADQ cloud filters. The goal is to avoid
a 10-second delay caused by the reset during trust changes.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> index a26c3d47ec156..fea267af7afe6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> @@ -4987,16 +4987,21 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting)
>  	set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state);
>  	pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED;
>  
> -	i40e_vc_reset_vf(vf, true);

By skipping the VF reset here, does the VF's privilege capability flag
become permanently desynchronized from its trust state?

The underlying capability flag I40E_VIRTCHNL_VF_CAP_PRIVILEGE is used by
Virtchnl handlers like i40e_check_vf_permission() to authorize privileged
actions. This flag is exclusively updated in i40e_alloc_vf_res(), which is
only invoked during a VF reset.

Without the reset, it appears that granting trust fails to give the VF
actual privileged capabilities, and revoking trust leaves the privilege
bit set indefinitely, allowing an untrusted VF to perform unauthorized
privileged operations.

Additionally, could revoking trust fail to scrub the VF's actively
configured privileged hardware state, such as promiscuous mode and excess
MAC/VLAN filters?

Trusted VFs can explicitly enable promiscuous mode and configure MAC/VLAN
filters beyond strict limits. In the original code, i40e_vc_reset_vf()
dismantled these states by calling i40e_cleanup_reset_vf(), which
explicitly invokes i40e_config_vf_promiscuous_mode(..., false, false) and
destroys the VSI.

By skipping the reset, an untrusted VF retains its previously configured
promiscuous mode, allowing it to continue intercepting unauthorized traffic.

>  	dev_info(&pf->pdev->dev, "VF %u is now %strusted\n",
>  		 vf_id, setting ? "" : "un");
>  
> +	/* Only reset VF if we're removing trust and it has ADQ cloud filters.
> +	 * Cloud filters can only be added when trusted, so they must be
> +	 * removed when trust is revoked. Other trust changes don't require
> +	 * reset - MAC/VLAN filter sync happens through normal operation.
> +	 */
>  	if (vf->adq_enabled) {
>  		if (!vf->trusted) {
>  			dev_info(&pf->pdev->dev,
>  				 "VF %u no longer Trusted, deleting all cloud filters\n",
>  				 vf_id);
>  			i40e_del_all_cloud_filters(vf);
> +			i40e_vc_reset_vf(vf, true);

Is it possible that reordering i40e_del_all_cloud_filters() before
i40e_vc_reset_vf() creates a race condition on vf->cloud_filter_list?

i40e_del_all_cloud_filters() performs lockless modifications on
vf->cloud_filter_list while running from the netlink context.

In the original code, i40e_vc_reset_vf() ran first, which immediately
cleared I40E_VF_STATE_ACTIVE. This prevented concurrent Virtchnl requests
like VIRTCHNL_OP_ADD_CLOUD_FILTER from modifying the list, because their
handlers in i40e_vc_process_vf_msg() abort if I40E_VF_STATE_ACTIVE is
not set.

By calling i40e_del_all_cloud_filters() first, the VF is left in the
active state. The PF service task could concurrently process an
ADD_CLOUD_FILTER message, executing hlist_add_head() simultaneously with
hlist_for_each_entry_safe(), which might cause list corruption.

>  		}
>  	}
>

^ permalink raw reply

* Re: [PATCH] net: sched: teql: fix use-after-free in teql_master_xmit
From: Jamal Hadi Salim @ 2026-04-16 15:43 UTC (permalink / raw)
  To: Kito Xu (veritas501)
  Cc: Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260413094448.2263828-1-hxzene@gmail.com>

On Mon, Apr 13, 2026 at 5:44 AM Kito Xu (veritas501) <hxzene@gmail.com> wrote:
>
> teql_destroy() traverses the circular slave list to unlink a slave
> qdisc. It reads master->slaves as both the starting point and the
> do-while termination sentinel. However, the data-path writers
> teql_dequeue() and teql_master_xmit() concurrently modify
> master->slaves without holding RTNL, running in softirq or process
> context respectively. If master->slaves is overwritten to an
> already-visited node mid-traversal, the loop exits early without
> finding the target slave. The slave is never unlinked, but its memory
> is freed by call_rcu() -- leaving a dangling pointer in the circular
> list. The next teql_master_xmit() traversal dereferences freed memory.
>


Do you have a reproducer?
And it would be nice to get a tdc test from you.

Also, please use assisted-by: or even suggested-by if this patch was
suggested by AI.

cheers,
jamal

> race condition:
>
>   CPU 0 (teql_destroy, RTNL held)       CPU 1 (teql_dequeue, softirq)
>   -----------------------------------   -----------------------------
>   prev = master->slaves;  // = A
>   q = NEXT_SLAVE(A);      // = B
>   B == A? No.
>   prev = B;
>                                          /* slave C's queue drains */
>                                          skb == NULL ->
>                                          dat->m->slaves = C; /* write! */
>
>   q = NEXT_SLAVE(B);      // = C
>   C == A? No.
>   prev = C;
>   /* check: (prev=C) != master->slaves(C)?
>      FALSE -> loop exits! */
>   /* A never unlinked, freed by call_rcu */
>
>   CPU 0 (teql_master_xmit, later)
>   -----------------------------------
>   q = NEXT_SLAVE(C);      // = A (freed!)
>   slave = qdisc_dev(A);   // UAF!
>
> Fix this by saving master->slaves into a local `head` variable at the
> start of teql_destroy() and using it as a stable sentinel for the
> entire traversal. Also annotate all data-path accesses to
> master->slaves with READ_ONCE/WRITE_ONCE to prevent store-tearing and
> compiler-introduced re-reads.
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in teql_master_xmit+0xeae/0x14a0
> Read of size 8 at addr ffff888018074040 by task poc/162
>
> CPU: 2 UID: 0 PID: 162 Comm: poc Not tainted 7.0.0-rc7-next-20260410 #10 PREEMPTLAZY
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x64/0x80
>  print_report+0xd0/0x5e0
>  kasan_report+0xce/0x100
>  teql_master_xmit+0xeae/0x14a0
>  dev_hard_start_xmit+0xcd/0x5b0
>  sch_direct_xmit+0x12e/0xac0
>  __qdisc_run+0x3b1/0x1a70
>  __dev_queue_xmit+0x2257/0x3100
>  ip_finish_output2+0x615/0x19c0
>  ip_output+0x158/0x2b0
>  ip_send_skb+0x11b/0x160
>  udp_send_skb+0x64b/0xd80
>  udp_sendmsg+0x138c/0x1ec0
>  __sys_sendto+0x331/0x3a0
>  __x64_sys_sendto+0xe0/0x1c0
>  do_syscall_64+0x64/0x680
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The buggy address belongs to the object at ffff888018074000
>  which belongs to the cache kmalloc-512 of size 512
> The buggy address is located 64 bytes inside of
>  freed 512-byte region [ffff888018074000, ffff888018074200)
> ==================================================================
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
> ---
>  net/sched/sch_teql.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
> index ec4039a201a2..2e86397a5219 100644
> --- a/net/sched/sch_teql.c
> +++ b/net/sched/sch_teql.c
> @@ -101,7 +101,7 @@ teql_dequeue(struct Qdisc *sch)
>         if (skb == NULL) {
>                 struct net_device *m = qdisc_dev(q);
>                 if (m) {
> -                       dat->m->slaves = sch;
> +                       WRITE_ONCE(dat->m->slaves, sch);
>                         netif_wake_queue(m);
>                 }
>         } else {
> @@ -136,19 +136,23 @@ teql_destroy(struct Qdisc *sch)
>         if (!master)
>                 return;
>
> -       prev = master->slaves;
> +       prev = READ_ONCE(master->slaves);
>         if (prev) {
> +               struct Qdisc *head = prev;
> +
>                 do {
>                         q = NEXT_SLAVE(prev);
>                         if (q == sch) {
>                                 NEXT_SLAVE(prev) = NEXT_SLAVE(q);
> -                               if (q == master->slaves) {
> -                                       master->slaves = NEXT_SLAVE(q);
> -                                       if (q == master->slaves) {
> +                               if (q == head) {
> +                                       WRITE_ONCE(master->slaves,
> +                                                  NEXT_SLAVE(q));
> +                                       if (q == NEXT_SLAVE(q)) {
>                                                 struct netdev_queue *txq;
>
>                                                 txq = netdev_get_tx_queue(master->dev, 0);
> -                                               master->slaves = NULL;
> +                                               WRITE_ONCE(master->slaves,
> +                                                          NULL);
>
>                                                 dev_reset_queue(master->dev,
>                                                                 txq, NULL);
> @@ -158,7 +162,7 @@ teql_destroy(struct Qdisc *sch)
>                                 break;
>                         }
>
> -               } while ((prev = q) != master->slaves);
> +               } while ((prev = q) != head);
>         }
>  }
>
> @@ -285,7 +289,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
>         int subq = skb_get_queue_mapping(skb);
>         struct sk_buff *skb_res = NULL;
>
> -       start = master->slaves;
> +       start = READ_ONCE(master->slaves);
>
>  restart:
>         nores = 0;
> @@ -317,7 +321,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
>                                     netdev_start_xmit(skb, slave, slave_txq, false) ==
>                                     NETDEV_TX_OK) {
>                                         __netif_tx_unlock(slave_txq);
> -                                       master->slaves = NEXT_SLAVE(q);
> +                                       WRITE_ONCE(master->slaves, NEXT_SLAVE(q));
>                                         netif_wake_queue(dev);
>                                         master->tx_packets++;
>                                         master->tx_bytes += length;
> @@ -329,7 +333,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
>                                 busy = 1;
>                         break;
>                 case 1:
> -                       master->slaves = NEXT_SLAVE(q);
> +                       WRITE_ONCE(master->slaves, NEXT_SLAVE(q));
>                         return NETDEV_TX_OK;
>                 default:
>                         nores = 1;
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH v11 net-next 3/5] psp: add a new netdev event for dev unregister
From: Wei Wang @ 2026-04-16 15:50 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Jakub Kicinski, Daniel Zahka, Willem de Bruijn, David Wei,
	Andrew Lunn, David S . Miller, Eric Dumazet, Simon Horman,
	Wei Wang
In-Reply-To: <cb5bf340-e5fa-4866-b5c6-565709ba633c@redhat.com>

On Mon, Apr 13, 2026 at 3:48 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 4/9/26 1:14 AM, Wei Wang wrote:
> > +static int psp_netdev_event(struct notifier_block *nb, unsigned long event,
> > +                         void *ptr)
> > +{
> > +     struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> > +     struct psp_dev *psd;
> > +
> > +     if (event != NETDEV_UNREGISTER)
> > +             return NOTIFY_DONE;
> > +
> > +     rcu_read_lock();
> > +     psd = rcu_dereference(dev->psp_dev);
> > +     if (psd && psp_dev_tryget(psd)) {
> > +             rcu_read_unlock();
> > +             mutex_lock(&psd->lock);
> > +             psp_dev_disassoc_one(psd, dev);
> > +             mutex_unlock(&psd->lock);
> > +             psp_dev_put(psd);
>
> Sashiko notes that the above is racy:
>
> ---
> Can this code race with psp_nl_dev_assoc_doit() and permanently leak a
> net_device reference?
> If CPU1 is executing psp_nl_dev_assoc_doit() and CPU2 is unregistering the
> device, the following interleaving could happen:
> CPU1 (psp_nl_dev_assoc_doit)
>     assoc_dev = dev_get_by_index(...) // acquires a reference
> CPU2 (unregister_netdevice)
>     psp_netdev_event()
>         psd = rcu_dereference(dev->psp_dev); // sees NULL, returns
> NOTIFY_DONE
> CPU1 (psp_nl_dev_assoc_doit)
>     cmpxchg(&assoc_dev->psp_dev, NULL, psd); // succeeds!
>     list_add(...) // adds to psd->assoc_dev_list
> If this occurs, the notifier misses the unregistration event since it runs
> before the device is fully associated. The unregistering thread will then
> enter netdev_wait_allrefs() and wait indefinitely because the reference
> held in assoc_dev_list is never released.
> ---
>

Hmm... Good catch. I think I could do the following to fix this:

On psp_nl_dev_assoc_doit():
- grab psd->lock
- Do cmpxchg(&assoc_dev->psp_dev, NULL, psd)
- Do if (READ_ONCE(assoc_dev->reg_state) != NETREG_REGISTERED) {
          err = -ENODEV;
          goto cleanup;
       }

On unregister_netdevice side, the sequence of calls are as follows in
unregister_netdevice_many_notify():
- WRITE_ONCE(reg_state, NETREG_UNREGISTERING);
- synchronize_net()
- call_netdevice_notifiers(NETDEV_UNREGISTER, dev) =>
psp_netdev_event() => Read psd => grab psd->lock and do clean up if
psd != NULL

With this logic, at least 1 side would do the cleanup. And the clean
up path is guarded by psd->lock so it is also sequenced and should not
have double free issue.
Let me know what you think.


> /P
>

^ permalink raw reply

* Re: [PATCH net v3] net/sched: taprio: fix NULL pointer dereference in class dump
From: Jamal Hadi Salim @ 2026-04-16 15:50 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Vinicius Costa Gomes, Jiri Pirko, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, Xiang Mei
In-Reply-To: <20260414104311.74115-2-bestswngs@gmail.com>

On Tue, Apr 14, 2026 at 6:43 AM Weiming Shi <bestswngs@gmail.com> wrote:
>
> When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft()
> is called with new == NULL and stores NULL into q->qdiscs[cl - 1].
> Subsequent RTM_GETTCLASS dump operations walk all classes via
> taprio_walk() and call taprio_dump_class(), which calls taprio_leaf()
> returning the NULL pointer, then dereferences it to read child->handle,
> causing a kernel NULL pointer dereference.
>
> The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel
> with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user
> namespaces enabled, an unprivileged local user can trigger a kernel
> panic by creating a taprio qdisc inside a new network namespace,
> grafting an explicit child qdisc, deleting it, and requesting a class
> dump. The RTM_GETTCLASS dump itself requires no capability.
>

While i would say this looks good to me, I hate to sound like a broken
record but:

Do you have a reproducer?
And it would be nice to get a tdc test from you.

Also, please use assisted-by: or even suggested-by if this patch was
suggested by AI.

cheers,
jamal


>  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
>  RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2475)
>  Call Trace:
>   <TASK>
>   tc_fill_tclass (net/sched/sch_api.c:1966)
>   qdisc_class_dump (net/sched/sch_api.c:2329)
>   taprio_walk (net/sched/sch_taprio.c:2510)
>   tc_dump_tclass_qdisc (net/sched/sch_api.c:2353)
>   tc_dump_tclass_root (net/sched/sch_api.c:2370)
>   tc_dump_tclass (net/sched/sch_api.c:2431)
>   rtnl_dumpit (net/core/rtnetlink.c:6827)
>   netlink_dump (net/netlink/af_netlink.c:2325)
>   rtnetlink_rcv_msg (net/core/rtnetlink.c:6927)
>   netlink_rcv_skb (net/netlink/af_netlink.c:2550)
>   </TASK>
>
> Fix this by substituting &noop_qdisc when new is NULL in
> taprio_graft(), following the same pattern used by multiq_graft() and
> prio_graft(). This ensures q->qdiscs[] slots are never NULL, making
> control-plane dump paths safe without requiring individual NULL checks.
>
> Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq)
> previously had explicit NULL guards that would drop/skip the packet
> cleanly, update those checks to test for &noop_qdisc instead. Without
> this, packets would reach taprio_enqueue_one() which increments the root
> qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc
> drops the packet but those counters are never rolled back, permanently
> inflating the root qdisc's statistics.
>
> Fixes: 665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]")
> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
> v3: fix broken patch
> v2: Also update NULL guards in taprio_enqueue() and
>     taprio_dequeue_from_txq() to avoid qlen/backlog inflation (Paolo).
> ---
>  net/sched/sch_taprio.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index f721c03514f60..07723b156c5b3 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -634,7 +634,7 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>         queue = skb_get_queue_mapping(skb);
>
>         child = q->qdiscs[queue];
> -       if (unlikely(!child))
> +       if (unlikely(child == &noop_qdisc))
>                 return qdisc_drop(skb, sch, to_free);
>
>         if (taprio_skb_exceeds_queue_max_sdu(sch, skb)) {
> @@ -717,7 +717,7 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
>         int len;
>         u8 tc;
>
> -       if (unlikely(!child))
> +       if (unlikely(child == &noop_qdisc))
>                 return NULL;
>
>         if (TXTIME_ASSIST_IS_ENABLED(q->flags))
> @@ -2183,6 +2183,9 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
>         if (!dev_queue)
>                 return -EINVAL;
>
> +       if (!new)
> +               new = &noop_qdisc;
> +
>         if (dev->flags & IFF_UP)
>                 dev_deactivate(dev);
>
> @@ -2196,14 +2199,14 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
>         *old = q->qdiscs[cl - 1];
>         if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
>                 WARN_ON_ONCE(dev_graft_qdisc(dev_queue, new) != *old);
> -               if (new)
> +               if (new != &noop_qdisc)
>                         qdisc_refcount_inc(new);
>                 if (*old)
>                         qdisc_put(*old);
>         }
>
>         q->qdiscs[cl - 1] = new;
> -       if (new)
> +       if (new != &noop_qdisc)
>                 new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
>
>         if (dev->flags & IFF_UP)
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: stmmac: move PHY handling out of __stmmac_open()/release()
From: Jakub Kicinski @ 2026-04-16 16:08 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Alexander Stein, Andrew Lunn, Heiner Kallweit, Alexandre Torgue,
	Andrew Lunn, David S. Miller, Eric Dumazet, linux-arm-kernel,
	linux-stm32, Maxime Coquelin, netdev, Paolo Abeni
In-Reply-To: <aeDojTdDTELfpT0X@shell.armlinux.org.uk>

On Thu, 16 Apr 2026 14:47:57 +0100 Russell King (Oracle) wrote:
> The next problem will be netdev's policy over reviews vs patches
> balance which I'm already in deficit, and I have *NO* *TIME*
> what so ever to review patches - let alone propose patches to
> fix people's problems.
> 
> So I'm going to say this plainly: if netdev wants to enforce that
> rule, then I won't be fixing people's problems.

Do you have a better proposal?
I'm under the same pressure of million stupid projects from my employer
as you are. Do y'all think that upstream maintainers have time given by
their employers to do the reviews? SMH.

^ permalink raw reply

* Re: [PATCH v3 0/4] Rust netlink support + use in Rust Binder
From: Jakub Kicinski @ 2026-04-16 16:11 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Trevor Gross, Danilo Krummrich,
	Donald Hunter, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Christian Brauner, Carlos Llamas, linux-kernel,
	rust-for-linux, netdev
In-Reply-To: <20260415-binder-netlink-v3-0-84be9ba63ee2@google.com>

On Wed, 15 Apr 2026 09:37:50 +0000 Alice Ryhl wrote:
> The C Binder driver exposes messages over netlink when transactions
> fail, so that a userpace daemon can respond to processes with many
> failing transactions.

net-next is closed for the duration of the merge window, so that 
people can take a mental break and increasingly catch up on all 
the accumulated CI/AI backed work. Please come back in 2 weeks.

^ permalink raw reply

* Re: [PATCH net v6 1/2] flow_dissector: do not dissect PPPoE PFC frames
From: Jakub Kicinski @ 2026-04-16 16:17 UTC (permalink / raw)
  To: Qingfang Deng
  Cc: linux-ppp, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, netdev, linux-kernel
In-Reply-To: <e1fd74dd-4fd5-4a67-b4c5-5911395b9dbe@linux.dev>

On Wed, 15 Apr 2026 21:42:09 +0800 Qingfang Deng wrote:
> The patch state is "Changes Requested" in patchwork but I haven't 
> received any feedback. Was it set by mistake?

Fixed, thanks for the report.

^ permalink raw reply

* Re: [PATCH RFC net-next v2 2/2] af_packet: Add port specific handling for HSR
From: Sebastian Andrzej Siewior @ 2026-04-16 16:18 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, (JC), Jayachandran, David S. Miller, Andrew Lunn,
	Chintan Vankar, Danish Anwar, Daolin Qiu, Eric Dumazet,
	Felix Maurer, Jakub Kicinski, Paolo Abeni, Richard Cochran,
	Simon Horman, Raghavendra, Vignesh, Bajjuri, Praneeth,
	TK, Pratheesh Gangadhar, Muralidharan, Neelima
In-Reply-To: <willemdebruijn.kernel.1f3d8356e4a5a@gmail.com>

On 2026-04-06 10:47:56 [-0400], Willem de Bruijn wrote:
Hi Willem,

> So the requirement is for a communication path between userspace and
> the driver over packet sockets.
> 
> Existing options that work for both rx and tx are
> 
> - in-band: a packet header or footer
> - mark, metadata
> - maybe: vlan tags
> 
> These require changes in the HSR driver to use them, but no changes in
> the protocol independent core logic, which includes packet sockets.
> 
> As I mentioned before we cannot sprinkle protocol specific code
> throughout protocol independent core code. That quickly leads to an
> unmaintainable mess. PTP over HSR is a particular small niche case,
> nowhere near first in line to get an exception to this guideline.

I understand your concern. I tried to make as self contained as
possible and little runtime overhead as possible.

> One perhaps interesting Rx only option I had missed before is
> SOF_TIMESTAMPING_OPT_PKTINFO. Would that give you the original
> device ifindex today?

The upper logic expects to poll() on the fd. If I need to filter the
device based on this then breaks the expectations.
I need also to receive packets without a timestamp so I don't think this
works.

> If so, we now only have to consider the Tx path to the HSR driver
> (the Tx path directly to the other drivers do not need this metadata).
> 
> I'm not convinced that it is hard to come up with a way to send
> a packet to the HSR driver with an optional header or footer or
> vlan data (or skb->protocol perhaps?) that cannot be
> differentiated from other traffic arriving at that ndo_start_xmit.

I've been looking to skb->protocol. Maybe if the packet has ether type
set to PTP then the HSR layer could consider everything before it (the
two MAC address fields) as internal header and the actual packet starts
after that. Reasoning would be that you shouldn't send a PTP packet over
HSR without dealing with the restrictions. So this could work.

Then the question remains how to do the filtering on RX side. For the
so_mark I did open additional two sockets…

> If all this fails, we can look into a protocol independent approach
> to passing other metadata in packet sockets. to/from skb_ext or cb[],
> say.

I will try the above but it looks very hackish.
cb[] is limited to one layer. I do have a skb_ext variant working but
this requires cmsg to set it. Do you think about generic skb_ext which
is set from af_packet? But I don't think it brings much value if I can't
filter on the RX side before returning the packet to userland.

> But at this point I see enough options that do not require changes
> to packet sockets.
> 
> To get back to the simplest approach: skb->mark. Is there any
> concrete risk that on this path that would conflict with other
> uses of that field? If packet sockets inject directly into this
> driver (possibly even with PACKET_QDISC_BYPASS)?

So I have a skb->mark variant working. I do read on the ethX interface
and write on the hsr0 interface (so I need two extra fd per interface).
The only concern here is that the mark value is hardcoded and could
collide with an existing firewall setup or so.
This field needs also be evaluated by the ethernet driver in case of
hw-offloading for HSR.
So far, this is the only working solution I have which does not touch
af_packet.

Let me try the header with the PTP hedaer type and the additional
sockets for RX.
It will not win a beauty contest but maybe I judge too harsh…

Sebastian

^ permalink raw reply

* Re: [PATCH net v2] net: reduce RFS/ARFS flow updates by checking LLC affinity
From: Jakub Kicinski @ 2026-04-16 16:26 UTC (permalink / raw)
  To: Chuang Wang
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Stanislav Fomichev, Kuniyuki Iwashima, Samiullah Khawaja,
	Hangbin Liu, Krishna Kumar, Neal Cardwell, Willem de Bruijn,
	netdev, linux-kernel
In-Reply-To: <20260414035931.45692-1-nashuiliang@gmail.com>

On Tue, 14 Apr 2026 11:59:20 +0800 Chuang Wang wrote:
> Subject: [PATCH net v2] net: reduce RFS/ARFS flow updates by checking LLC affinity

"net" is for fixes, I think you want to target the "net-next" tree.
Unfortunately that tree is closed for the duration of the merge window,
please come back in 2 weeks:
https://netdev.bots.linux.dev/net-next.html

^ permalink raw reply

* RE: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
From: Chia-Yu Chang (Nokia) @ 2026-04-16 16:36 UTC (permalink / raw)
  To: Victor Nogueira, linux-hardening@vger.kernel.org, kees@kernel.org,
	gustavoars@kernel.org, jhs@mojatatu.com, jiri@resnulli.us,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, horms@kernel.org, ij@kernel.org,
	ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com
In-Reply-To: <8070149c-87c8-4d9c-ae12-6b9a956fb763@mojatatu.com>

> -----Original Message-----
> From: Victor Nogueira <victor@mojatatu.com>
> Sent: Thursday, April 16, 2026 4:27 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; linux-hardening@vger.kernel.org; kees@kernel.org; gustavoars@kernel.org; jhs@mojatatu.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; horms@kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> Subject: Re: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
>
>
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
> On 13/04/2026 13:37, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> >
> > Fix dualpi2_change() to correctly enforce updated limit and memlimit
> > values after a configuration change of the dualpi2 qdisc.
> >
> > Before this patch, dualpi2_change() always attempted to dequeue
> > packets via the root qdisc (C-queue) when reducing backlog or memory
> > usage, and unconditionally assumed that a valid skb will be returned.
> > When traffic classification results in packets being queued in the
> > L-queue while the C-queue is empty, this leads to a NULL skb
> > dereference during limit or memlimit enforcement.
> >
> > This is fixed by first dequeuing from the C-queue path if it is non-empty.
> > Once the C-queue is empty, packets are dequeued directly from the
> > L-queue.s Return values from qdisc_dequeue_internal() are checked for
> > both queues. When dequeuing from the L-queue, the parent qdisc qlen
> > and backlog counters are updated explicitly to keep overall qdisc statistics consistent.
> > [...]
> > ---
> >   net/sched/sch_dualpi2.c | 24 +++++++++++++++++++-----
> >   1 file changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index
> > 6d7e6389758d..56d4422970b6 100644
> > --- a/net/sched/sch_dualpi2.c
> > +++ b/net/sched/sch_dualpi2.c
> > @@ -872,11 +872,25 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
> >       old_backlog = sch->qstats.backlog;
> >       while (qdisc_qlen(sch) > sch->limit ||
> >              q->memory_used > q->memory_limit) {
> > -             struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
> > -
> > -             q->memory_used -= skb->truesize;
> > -             qdisc_qstats_backlog_dec(sch, skb);
> > -             rtnl_qdisc_drop(skb, sch);
> > +             int c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> > +             struct sk_buff *skb = NULL;
> > +
> > +             if (c_len) {
> > +                     skb = qdisc_dequeue_internal(sch, true);
> > +                     if (!skb)
> > +                             break;
> > +                     q->memory_used -= skb->truesize;
> > +                     rtnl_qdisc_drop(skb, sch);
> > +             } else if (qdisc_qlen(q->l_queue)) {
> > +                     skb = qdisc_dequeue_internal(q->l_queue, true);
> > +                     if (!skb)
> > +                             break;
> > +                     q->memory_used -= skb->truesize;
> > +                     rtnl_qdisc_drop(skb, q->l_queue);
> > +                     /* Keep the overall qdisc stats consistent */
> > +                     --sch->q.qlen;
> > +                     qdisc_qstats_backlog_dec(sch, skb);
>
> Sashiko is hallucinating saying this will cause a UAF, it won't.
> However it is good to maintain a consistent order here.
> For example, see how sch_choke is doing [1].
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/sch_choke.c?id=1f5ffc672165ff851063a5fd044b727ab2517ae3#n394
>
> cheers,
> Victor

Hi Victor,

Thanks for the pointer to sch_choke, it follows the order: (1) qdisc_qstats_backlog_dec(), (2) reduce qlen, and (3) rtnl_qdisc_drop().

But I've also checked sch_codel, its order is: (1) reduce qlen, (2) qdisc_qstats_backlog_dec(), and (3) rtnl_qdisc_drop().

So, the key is to place rtnl_qdisc_drop() after the reduction of qstats_backlog as well as qlen.

Then, I will follow the same order for dualpi2 in next version:
1. qdisc_dequeue_internal(q->l_queue), including (a) --q->l_queue->q.qlen, and (2) qdisc_qstats_backlog_dec(q->l_queue)
2. --sch->q.qlen
3. qdisc_qstats_backlog_dec(sch)
4. rtnl_qdisc_drop(skb, q->l_queue), which will do "qdisc_qstats_drop(q->l_queue)"
5. qdisc_qstats_drop(sch)

Thanks,
Chia-Yu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox