* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Ivan Vecera @ 2026-06-19 17:07 UTC (permalink / raw)
To: Kubalewski, Arkadiusz, Jiri Pirko, Vadim Fedorenko,
Jakub Kicinski
Cc: netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
linux-kernel@vger.kernel.org
In-Reply-To: <CH3PR11MB8749910F17977B951A8B12CA9BE42@CH3PR11MB8749.namprd11.prod.outlook.com>
On 6/17/26 1:59 PM, Kubalewski, Arkadiusz wrote:
>> From: Ivan Vecera <ivecera@redhat.com>
>> Sent: Monday, June 15, 2026 2:00 PM
>>
>> On 6/11/26 2:09 PM, Jiri Pirko wrote:
>>> Wed, Jun 10, 2026 at 05:45:46PM +0200, ivecera@redhat.com wrote:
>>>> On 6/10/26 3:04 PM, Kubalewski, Arkadiusz wrote:
>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>> Sent: Tuesday, June 9, 2026 4:59 PM
>>>>>>
>>>>>> On 6/9/26 4:00 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>> Sent: Tuesday, June 9, 2026 10:51 AM
>>>>>>>>
>>>>>>>> Mon, Jun 08, 2026 at 07:03:46PM +0200,
>>>>>>>> arkadiusz.kubalewski@intel.com
>>>>>>>> wrote:
>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>> Sent: Monday, June 8, 2026 5:48 PM
>>>>>>>>>>
>>>>>>>>>> On 6/8/26 4:43 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>>>> Sent: Sunday, May 31, 2026 9:44 PM ...
>>>>>>>>>>>> -
>>>>>>>>>>>> name: gnss
>>>>>>>>>>>> doc: GNSS recovered clock
>>>>>>>>>>>> + -
>>>>>>>>>>>> + name: int-nco
>>>>>>>>>>>> + doc: |
>>>>>>>>>>>> + Device internal numerically controlled oscillator.
>>>>>>>>>>>> + When connected as a DPLL input, the DPLL enters NCO
>>>>>>>>>>>> mode
>>>>>>>>>>>> + where the output frequency is adjusted by the host
>>>>>>>>>>>> via
>>>>>>>>>>>> + the PTP clock interface.
>>>>>>>>>>>
>>>>>>>>>>> Hi Ivan!
>>>>>>>>>>>
>>>>>>>>>>> How would you control this in case of automatic mode dpll?
>>>>>>>>>>> Automatic mode DPLL shall be controlled on HW level, such pin
>>>>>>>>>>> brakes that rule and requires some driver magic to show it is
>>>>>>>>>>> higher priority then the rest of the pins?
>>>>>>>>>>
>>>>>>>>>> The NCO pin can be connected only in manual mode. In other words
>>>>>>>>>> a
>>>>>>>>>> DPLL in automatic mode cannot select NCO pin (switch to NCO mode)
>>>>>>>>>> by
>>>>>>>>>> its own.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being picky on DPLL_MODE for enabling feature is not something we
>>>>>>>>> can allow if it is not related to HW limitation, is it?
>>>>>>>>> Could you please elaborate why it is not possible for AUTOMATIC
>>>>>>>>> mode?
>>>>>>>>
>>>>>>>> In automatic mode, the pin selection logic is defined upon prio. I
>>>>>>>> can imagine that if NCO pin has the highest prio of the available
>>>>>>>> ones, it gets picked. I would be aligned 100% with automatic mode
>>>>>>>> behaviour.
>>>>>>>> Is there a real usecase for it?
>>>>>>>>
>>>>>>>> [..]
>>>>>>>
>>>>>>> This is not true. AUTOMATIC mode is HW solution, SW driver ONLY
>>>>>>> configures priorities on the inputs, not manages the active inputs.
>>>>>>> This brakes that behavior, the SW driver would have to manually
>>>>>>> override the AUTMATIC mode to be fed from such NCO pin as it doesn't
>>>>>>> exists on it's priority list, HW cannot pick or use it.
>>>>>>
>>>>>> Correct, AUTO mode is hardware feature and it should not be emulated
>>>>>> by a
>>>>>> driver. If the hardware does not support it then the switching
>>>>>> between
>>>>>> input references should be done by userspace (by monitoring ffo,
>>>>>> phase_offset, operstate).
>>>>>>
>>>>>
>>>>> Yes, exactly, so for AUTOMATIC mode HW it will not be possible to
>>>>> create
>>>>> such pin, which means that NCO pin would serve only a MANUAL mode
>>>>> implementation.
>>>>> Basically this is something we shall not allow to happen. DPLL API
>>>>> should be designed to cover the case where AUTO mode is able to
>>>>> implement
>>>>> all features consistently.
>>>>
>>>> If you don't like the proposal from Jiri (NCO switch driven by NCO pin
>>>> priority -> highest==enter_nco else leave_nco) then it could be
>>>> possible
>>>> to handle the switching by allowing the state 'connected' in AUTO mode
>>>> for the NCO pin type. Then the implementation will be the same for both
>>>> selection modes.
>>>>
>>>> Only difference would be that a user does not need to switch the device
>>> >from the AUTO to MANUAL mode.
>>>>
>>>>>>> The real use case is that any DPLL can switch the mode to this one
>>>>>>> instead of implementing MANUAL mode just to use the feature with a
>>>>>>> 'virtual' pin.
>>>>>>
>>>>>> I don't expect this... but it is up to a driver. I don't plan such
>>>>>> functionality in zl3073x as the NCO pin does not expose prio_get()
>>>>>> and
>>>>>> prio_set() callbacks - so it is clear that this pin cannot be part of
>>>>>> the
>>>>>> automatic selection.
>>>>>>
>>>>>> Ivan
>>>>>
>>>>> There is a difference between particular HW and API capabilities, with
>>>>> the
>>>>> proposed API we would disallow the possibility of such implementation
>>>>> for
>>>>> existing HW variants.
>>>>>
>>>>> DPLL NCO MODE would allow that but as pointed here by Ivan and by Jiri
>>>>> in
>>>>> the other email it would also require the extra implementation for
>>>>> some
>>>>> configuration - device level phase/ffo handling.
>>>>>
>>>>> To summarize it all, I don't have such simple solution for it.
>>>>>
>>>>> First thing that comes to my mind is to combine both approaches.
>>>>> Make it possible for AUTMATIC mode to also set "CONNECTED" state
>>>>> on certain kind of "OVERRIDE" pins, where it could be determined by
>>>>> the type of PIN and embed that logic into the DPLL subsystem.
>>>>
>>>> The possible states for particual pins are now handled at a driver
>>>> level
>>>> so the driver decides if the requested state is correct or not. So it
>>>> could be easy to implement this.
>>>>
>>>> For auto mode allowed states:
>>>> - input references: selectable / disconnected
>>>> - nco pin: connected / disconnected
>>>>
>>>>> Basically, if driver registers such NCO pin it would be always
>>>>> selected
>>>>> manually, and in such case all the other pins are going to
>>>>> disconnected
>>>>> state while DPLL mode is also a "OVERRIDE" or something like it.
>>>>
>>>> I would leave this decision on the driver level... Imagine the
>>>> potential
>>>> HW that would allow to switch NCO mode if there is no valid input
>>>> reference.
>>>>
>>>> Example:
>>>>
>>>> REF0 (prio 0) -> +------+ -> OUT0
>>>> REF1 (prio 1) -> | DPLL | -> ...
>>>> NCO (prio 2) -> +------+ -> OUTn
>>>>
>>>> Such HW would prefer REF0 or REF1 and lock to one of them if they are
>>>> qualified. But if they are NOT, then it switches to NCO mode.
>
> Now you said yourself "NCO mode" ... I agree that it would be a mode in
> that case. Where instead of running on regular/built in XO dpll would run
> on NCO and user could select it, and this would be addition to regular
> behavior.
>
> I also agree that the pin approach might be better/easier to use, assuming
> frequency offset for all the outputs given dpll drives, it makes more sense
> to have it configurable on input side.
+1
>>>>
>>>> In this situation the relevant driver would allow to configure priority
>>>> and state 'selectable' for this NCO pin.
>>>>
>>>>> Perhaps the pin type could include OVERRIDE in it's name to make it
>>>>> less
>>>>> confusing and needs some extra documentation.
>>>>>
>>>>> Thoughts?
>>>> I think _INT_ is ok. In the case of TYPE_INT_OSCILLATOR it is also
>>>> obvious that it is not a standard input reference.
>>>>
>>>> Jiri, Vadim, Arek, thoughts?
>>>
>>> I agree with you, the driver should have the flexibility to implement
>>> this according to his/hw's needs/capabilities. If it implements prio
>>> selection in AUTO mode, let it have it. If it implements manual NCO pin
>>> selection in AUTO mode using connected/disconnected override, let it
>>> have it.
>
> I don't know 'current' HW that is capable of using AUTO mode as a part of
> HW-based priority source selection and use such NCO input..
> But as already explained above, this is special mode of regular XO, which
> allows DPLL's output frequency offset configuration.
Lets keep this available for potential future HW. I can imagine a
situation where a user will prefer an automatic switch to NCO mode
if there is no qualified input reference - automatic switch means
that HW will support this (not emulated by the driver).
>>>
>>> Moreover, I actually like the "override" capability for pins in AUTO
>>> mode in general. It may be handy for other usecases as well.
>>>
>> Arek? Vadim?
>>
>> Thanks,
>> Ivan
>
> Agree, 'override' capability of a pin would be the way to go for this and
> other similar further cases.
>
> I believe a single approach on this would be best, I mean if AUTO mode
> needs a capability, to switch from regular behavior to 'OVERRIDE', and
> 'OVERRIDE' is only pin capability that allows such behavior for AUTO
> mode, then similar approach should be used on MANUAL mode, to make
> userspace know that such pin is always available to set "CONNECTED"
> and make the userspace implementation consistent on enabling it no matter
> if AUTO or MANUAL mode dpll.
Proposal:
1) new pin capability
- name: state-connected-override
- doc: pin state can be changed to connected in any DPLL mode
2) new NCO pin type to switch the DPLL to NCO mode when connected
3) automatic-only DPLL
- should expose NCO pin with state-connected-override capability
4) manual-only DPLL
- does not need to expose NCO pin with state-connected-override cap
5) dual-mode DPLL (supporting mode switching)
- if it exposes NCO pin with the override cap then it has to support
switching to NCO mode directly from AUTO mode
- if does not expose NCO pin with the override cap then a user MUST
switch the DPLL mode from AUTO to MANUAL to be able to make NCO
pin connected to the DPLL
Vadim, Jiri, Arek - thoughts?
Thanks,
Ivan
^ permalink raw reply
* Re: [PATCH net 0/6] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-19 16:42 UTC (permalink / raw)
To: netdev
Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
horms, pabeni, kuba, edumazet, davem, idosch, dsahern
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>
On 6/18/26 6:22 PM, Fernando Fernandez Mancera wrote:
> While working on a different IPv6 patch series I have spotted multiple
> minor bugs around sysctl error handling and notifications. In general,
> they are not serious issues.
>
> In addition, there is one more issue in forwarding sysctl as it does not
> check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
> of this series and I am aiming it at the net-next tree once it re-opens.
>
> Fernando Fernandez Mancera (6):
> ipv6: fix error handling in disable_ipv6 sysctl
> ipv6: fix error handling in ignore_routes_with_linkdown sysctl
> ipv6: fix error handling in forwarding sysctl
> ipv6: fix error handling in disable_policy sysctl
> ipv6: reset value and position for proxy_ndp sysctl restart
> ipv6: fix missing notification for ignore_routes_with_linkdown
>
> net/ipv6/addrconf.c | 35 +++++++++++++++++++++++++++--------
> 1 file changed, 27 insertions(+), 8 deletions(-)
>
So I have checked the feedback from sashiko.dev [1]. In general, it is
real but I have doubts about this one. While it is real, I am not so
sure we actually care about this scenario. In any case, this is a
pattern around kernel code and it would require fixes everywhere
(IPv4/IPv6/other places (?)).
The easier thing to do is what Nicolas suggested about taking the lock
before parsing the value i.e before calling proc_dointvec().
I also do wonder if I should fix them in individual commits or together
in the same commit as they have different fixes tag. Or maybe we can aim
this last part to net-next. Thoughts?
Please, see the feedback attached below.
Thanks!
[1] https://sashiko.dev/#/patchset/20260618162225.4588-1-fmancera%40suse.de
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 8ff015975e279..1cfb223476bd6 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
[ ... ]
> @@ -6493,8 +6494,12 @@ static int addrconf_sysctl_proxy_ndp(const
struct ctl_table *ctl, int write,
> if (write && old != new) {
> struct net *net = ctl->extra2;
>
> - if (!rtnl_net_trylock(net))
> + if (!rtnl_net_trylock(net)) {
> + /* Restore the original values before restarting */
> + *valp = old;
> + *ppos = pos;
> return restart_syscall();
> + }
>
Does this rollback introduce a race condition that can silently discard
concurrent writes?
The sysctl core does not serialize concurrent writes to the same sysctl
parameter, and proc_dointvec() modifies the global sysctl memory pointed to
by ctl->data directly.
If Thread A and Thread B concurrently write to proxy_ndp, and Thread A fails
to acquire the RTNL lock, setting *valp = old will unconditionally reset the
global memory back to what Thread A observed before its write. This seems
like it would clobber Thread B's concurrent update.
If Thread B successfully acquires the RTNL lock, it will evaluate its local
old != new stack variables as true and emit a RTM_NEWNETCONF netlink
notification. However, the notification will read the corrupted *valp
memory,
leaving the emitted event and internal sysctl state out of sync.
Could this be avoided by duplicating the ctl_table, pointing the data field
to a local stack variable, calling proc_dointvec() on the local variable,
and only updating the global *valp after successfully acquiring the lock,
similar to how it is handled in addrconf_sysctl_forward()?
^ permalink raw reply
* Re: [PATCH] net: add sock_open() for unified socket creation
From: Al Viro @ 2026-06-19 16:34 UTC (permalink / raw)
To: Alex Goltsev; +Cc: davem, netdev, linux-kernel
In-Reply-To: <CAEKmD4JfM5GWSiRMUn6NK+kKFeyXA8i3A9gthDz3hVKFcR1YDA@mail.gmail.com>
On Fri, Jun 19, 2026 at 01:35:56PM +0300, Alex Goltsev wrote:
> > What's the point (and why not make it inline, while we are at it)?
>
> > Are there really callers that would pass a non-constant value as the last argument,
> > and if so, what are they doing next?
>
>
> As for `inline`: in this case, it would have no practical significance.
>
> The compiler already treats a simple inline function as a regular
>
> symbol within the `EXPORT_SYMBOL` context, whereas a static inline
> function (the standard
>
> kernel template for helper functions) would completely break the
> export to the LKM.
How so? All three underlying primitives are exported, so static inline
in whatever include/*/*.h you put it in would work just fine.
> As for the last argument, yes, today it is usually a constant,
>
> but that’s not the point. The purpose of the enumeration is to provide
>
> a unified, explicit control interface. It’s important that if, in the future,
>
> someone adds a new type of socket creation, existing calling programs won’t
>
> panic or throw a compilation error, but will smoothly fall back to
>
> the default case and return -EINVAL, which is a safe failure mode.
Collapsing several functions together is worthless unless the combination
can be _used_ other than a (questionable) syntax sugar. kmalloc() can;
something that would only result in trading multiple identifiers for
functions for multiple identifiers for "which function to call" is not
an improvement.
^ permalink raw reply
* RE: Ethtool : PRBS feature
From: Das, Shubham @ 2026-06-19 16:26 UTC (permalink / raw)
To: Alexander H Duyck, Andrew Lunn, lee@trager.us
Cc: netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
Chintalapalle, Balaji, Lindberg, Magnus,
niklas.damberg@ericsson.com
In-Reply-To: <06d8c98da24e80d148ede4e933bb621c5515a7a2.camel@gmail.com>
> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?
In our case PRBS functionality is implemented in the PHY firmware at the PCS (TX/RX) + PMA (FEC Error Injection) layer.
Andrew, Alexander, Lee,
The host driver does not directly access any registers but requests the PHY FW to manage PRBS on behalf of it.
Because of this, the implementation does not naturally fit the traditional PHYLIB model, where Linux PHY drivers directly manage PHY registers.
The functionality is closer to a firmware-managed service exposed through the PCIe driver, so we thought the right place would be to extend ethtool.
We come from the Ethernet PHY field and are attempting to generalize PRBS for generic PHYs to accommodate all bus types, which might distract us, I believe.
The existing ethtool user application interface will give a quick start for Ethernet PHY PRBS management.
When we need other buses or when we have another model implementation, then we can abstract the commonalities into a framework.
Should we proceed with implementing the "ethtool --phy-test" ?
> -----Original Message-----
> From: Alexander H Duyck <alexander.duyck@gmail.com>
> Sent: 16 June 2026 21:45
> To: Das, Shubham <shubham.das@intel.com>; Andrew Lunn <andrew@lunn.ch>
> Cc: netdev@vger.kernel.org; mkubecek@suse.cz; D H, Siddaraju
> <siddaraju.dh@intel.com>; Chintalapalle, Balaji <balaji.chintalapalle@intel.com>
> Subject: Re: Ethtool : PRBS feature
>
> On Tue, 2026-06-16 at 12:14 +0000, Das, Shubham wrote:
> > Hi Andrew,
> >
> > Thanks for the feedback.
> >
> > Yes, for multi-lane ports we can accept the lane number as an argument like:
> >
> > ethtool --phy-test eth1 lane 0 tx-prbs prbs7 ethtool --phy-test eth2
> > lane 0 rx-prbs prbs7
> >
> > We referred to "Lee Trager's" "Open-Source Tooling for PHY Management and
> Testing" session:
> > https://netdevconf.info/0x19/sessions/talk/open-source-tooling-for-phy-
> management-and-testing.html?.
> > We have been trying to reach "Lee Trager" to seek more input, latest update on
> the approach and understand if there is a parallel effort in active so we can
> collaborate.
> > If you can, please help me connect with "Lee Trager" and others who expressed
> interest in Ethernet PRBS. We are happy to align and start implementation.
> >
>
> You aren't going to have much luck if you are trying to reach out via his Meta
> address as he has moved onto Nvidia so he is no longer working on the fbnic
> driver.
>
> As far as the work done most of it was internal and making use of debugfs. I don't
> believe any of the work for fbnic began to approach the suggested methods for
> upstreamming the feature as Lee had been pulled into other efforts.
>
> > About standardizing across other bus like PCIe and USB, I had a quick discussion
> with our internal designers, but I didn't observe any such SW-level config knobs
> interest.
> > Looks like Ethernet has clear interest and we are joining that Ethernet PRBS
> community too.
>
> I think it largely depends on what your implementation looks like. The point being
> made was that many of the SerDes PHYs out there are capable of use in multiple
> applications. So instead of being a networking device you would be looking at a
> SerDes PHY such as those in "/drivers/phy/".
>
> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?
>
> If you are referring to the PCS level then yes, it would make sense to have it in the
> networking subsystem as the PCS at this point is more a netdev specific set of
> drivers, see "/drivers/net/pcs/".
>
> In the case of the PMD that is where things get a bit more interesting.
> There is an IEEE c45 register definition that includes PRBS testing registers,
> however in the case of our implementation the PMD doesn't follow that
> specification and follows more the "/drivers/phy/" model.
>
> > Ethernet PRBS configuration and diagnostics support is well established and
> already widely used in existing Ethernet SERDES deployments.
> > We think Ethernet is the most natural starting point within netdev, as
> > it aligns with current driver practice and existing validation workflows.
>
> The problem is many of these parts used as an Ethernet Serdes PMD are really a
> multiuse part. So for example in the case of the hardware in FBNIC we use the
> same part on the Ethernet PHY as we do for the PCIe
> Gen5 PHY.
>
> The complication in our case is that both are buried behind our FW due to the fact
> that both are shared between slices. However for testing purposes and such we
> could look at disabling the odd slices to essentially unshare the hardware if you
> need another platform to test something like this with.
^ permalink raw reply
* RE: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Selvamani Rajagopal @ 2026-06-19 16:05 UTC (permalink / raw)
To: Uwe Kleine-König
Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
Shuah Khan, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
devicetree@vger.kernel.org, linux-doc@vger.kernel.org, Jerry Ray
In-Reply-To: <ajVKfBKPuNk9zN7b@monoceros>
Thanks for your feedback. Will take care of all the three comments.
> -----Original Message-----
> Subject: Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500
> MAC-PHY
>
> On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> > +static const struct of_device_id s2500_of_match[] = {
> > + { .compatible = "onnn,s2500" },
> > + {}
>
> s/{}/{ }/
>
> > +};
> > +
> > +static const struct spi_device_id s2500_ids[] = {
> > + { "s2500" },
> > + {}
> > +};
>
> Please make this:
>
> static const struct spi_device_id s2500_ids[] = {
> { .name = "s2500" },
> { }
> };
>
> > +MODULE_DEVICE_TABLE(spi, s2500_ids);
> > +
> > +static struct spi_driver s2500_driver = {
> > + .driver = {
> > + .name = DRV_NAME,
> > + .of_match_table = s2500_of_match,
> > + },
> > + .probe = s2500_probe,
> > + .remove = s2500_remove,
> > + .id_table = s2500_ids,
>
> Tastes are different, but the idea to align = is usually screwed by
> follow up patches. Here it's broken from the start. If you ask me: Use a
> single space before each =.
>
> > +};
> > +
> > +module_spi_driver(s2500_driver);
>
> Usually there is no empty line between the driver struct and the macro
> registering it.
>
>>
> Best regards
> Uwe
^ permalink raw reply
* Wireguard head of line blocking when CPUs saturate
From: Toke Høiland-Jørgensen @ 2026-06-19 15:56 UTC (permalink / raw)
To: wireguard; +Cc: netdev
Hey everyone
I'm running Wireguard on my main gateway, which is a not-super-high
powered ARM box with eight cores (based on the NXP LS1088A SoC). The box
does, however, also have eight hardware queues for its networking, which
means regular network traffic can be spread nicely across the cores.
However, the per-core performance is limited, making it pretty trivial
to saturate a single core by just running a fat TCP flow through it. And
when this happens, Wireguard traffic just... stalls. I.e., no traffic
gets through the Wireguard interface until the (unrelated) flow
saturating one of the cores subsides.
I suspect what happens is that Wireguard spreads out traffic to all
cores for encryption, but has to wait for the respective CPUs to finish
encrypting the packets in order before they can actually be transmitted.
And because one CPU is now suddenly saturated in softirq context, the
Wireguard work queue never gets a chance to run on that CPU, stalling TX
progress for the Wireguard device entirely.
I'm sending this message to (a) see if anyone else is seeing the same
kind of stalling, and (b) to get input on whether the explanation
outlined above seems plausible. And, in the case of affirmative answers
to both (a) and (b), to hopefully start a discussion on what to do about
this :)
-Toke
^ permalink raw reply
* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: David Woodhouse @ 2026-06-19 15:34 UTC (permalink / raw)
To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
Richard Cochran, linux-kernel, netdev
Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <87h5myd56x.ffs@fw13>
[-- Attachment #1: Type: text/plain, Size: 3150 bytes --]
On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> > @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
> >
> > nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > +
> > + /*
> > + * For the NTP-disciplined mono-based clocks, report how far
> > + * @systime is from the ideal NTP time at @now, in signed ns,
> > + * so a caller can land on the ideal line by adding it. Four
> > + * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> > + *
> > + * - tk->ntp_error, the deviation as of the last update;
> > + * - (cycle_delta * ntp_err_frac), the fractional-mult drift
> > + * accrued since then (cycle_delta is at most a tick on a
> > + * tickful kernel, but many ticks' worth under NO_HZ);
> > + * - (cycle_delta * ntp_err_mult), subtracting the applied +1
> > + * mult dither over the same span;
> > + * - the sub-ns fraction @systime dropped when the read was
> > + * truncated to whole ns (low @shift bits, exact despite the
> > + * multiply overflowing).
> > + *
> > + * RAW is undisciplined and AUX has its own discipline, so they
> > + * carry no ntp_error.
>
> AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
> work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
> needs to be excluded.
Ack.
> > + */
> > + if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> > + clock_id == CLOCK_BOOTTIME) {
> > + u32 nes = tk->ntp_error_shift;
> > + u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> > + tk->tkr_mono.mask;
> > + s64 err = tk->ntp_error +
> > + (((s64)mul_u64_u64_shr(cycle_delta,
> > + tk->ntp_err_frac, 32) -
> > + (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> > +
> > + err += (s64)((cycle_delta * tk->tkr_mono.mult +
> > + tk->tkr_mono.xtime_nsec) &
> > + ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> > + systime_snapshot->ntp_error =
> > + (err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> > + NTP_SCALE_SHIFT;
>
> This formatting makes my brain hurt. Can you please split that out into
> a separate function?
Yep. There's also a potential error there — an *additional* discrepancy
comes from the enforced monotonicity that timekeeping_cycles_to_ns()
applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
I couldn't work out if I cared about the clocksource-is-non-monotonic
casse, and even if I did, what I should do about it.
I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
or something like that, such that e.g. PTP clients could *ask* for it.
It's all very well hard-coding it in pps_get_ts() and unconditionally
changing the behaviour... I *think* we could justify that. But the
example I actually used in the patch was PTP, and that's slightly
harder to justify the behavioural change.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [PATCH net] net: au1000: move free_irq out of the close-time spinlocked section
From: Runyu Xiao @ 2026-06-19 15:18 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Runyu Xiao, stable
au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.
This was found by our static analysis tool and then confirmed by manual
review of the in-tree au1000_close() .ndo_stop path. The reviewed path
keeps aup->lock held across the MAC reset, queue stop and
free_irq(dev->irq, dev).
A directed runtime validation kept that ndo_stop carrier and the same
free_irq(dev->irq, dev) operation under the driver lock. Lockdep reported
"BUG: sleeping function called from invalid context" and "Invalid wait
context" while free_irq() was taking desc->request_mutex, with
au1000_close() and free_irq() on the stack.
Drop aup->lock before freeing the IRQ. The protected close-time work still
stops the device and queue before IRQ teardown, but the sleepable IRQ core
path now runs outside the spinlocked section.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/net/ethernet/amd/au1000_eth.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 9d35ac348ebe..5a04056e38fa 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -943,9 +943,10 @@ static int au1000_close(struct net_device *dev)
/* stop the device */
netif_stop_queue(dev);
+ spin_unlock_irqrestore(&aup->lock, flags);
+
/* disable the interrupt */
free_irq(dev->irq, dev);
- spin_unlock_irqrestore(&aup->lock, flags);
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH v3 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>
Add a regression test for DualPI2 GSO backlog accounting when it is
used as a child qdisc of QFQ.
The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
the leaf qdisc. DualPI2 splits the skb into two segments. After the
traffic drains, both QFQ and DualPI2 must report zero backlog and zero
qlen.
On kernels with the broken accounting, QFQ can keep a stale non-zero
qlen after all real packets have been dequeued.
Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 44 +++++++++++++++++++
tools/testing/selftests/tc-testing/tdc_gso.py | 43 ++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100755 tools/testing/selftests/tc-testing/tdc_gso.py
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
index cd1f2ee8f354..ed6a900bb568 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
@@ -250,5 +250,49 @@
"teardown": [
"$TC qdisc del dev $DUMMY handle 1: root"
]
+ },
+ {
+ "id": "891f",
+ "name": "Verify DualPI2 GSO backlog accounting with QFQ parent",
+ "category": [
+ "qdisc",
+ "dualpi2",
+ "qfq",
+ "gso"
+ ],
+ "plugins": {
+ "requires": "nsPlugin"
+ },
+ "setup": [
+ "$IP link set dev $DUMMY up || true",
+ "$IP addr add 10.10.10.10/24 dev $DUMMY || true",
+ "$TC qdisc add dev $DUMMY root handle 1: qfq",
+ "$TC class add dev $DUMMY parent 1: classid 1:1 qfq weight 1 maxpkt 4096",
+ "$TC qdisc add dev $DUMMY parent 1:1 handle 2: dualpi2",
+ "$TC filter add dev $DUMMY parent 1: matchall classid 1:1"
+ ],
+ "cmdUnderTest": "./tdc_gso.py 10.10.10.10 10.10.10.1 9000 1200 2400",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY",
+ "matchJSON": [
+ {
+ "kind": "qfq",
+ "handle": "1:",
+ "packets": 2,
+ "backlog": 0,
+ "qlen": 0
+ },
+ {
+ "kind": "dualpi2",
+ "handle": "2:",
+ "packets": 2,
+ "backlog": 0,
+ "qlen": 0
+ }
+ ],
+ "teardown": [
+ "$TC qdisc del dev $DUMMY root",
+ "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
+ ]
}
]
diff --git a/tools/testing/selftests/tc-testing/tdc_gso.py b/tools/testing/selftests/tc-testing/tdc_gso.py
new file mode 100755
index 000000000000..b66528ea4b68
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tdc_gso.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+tdc_gso.py - send a UDP GSO datagram
+
+Copyright (C) 2026 Xingquan Liu <b1n@b1n.io>
+"""
+
+import argparse
+import socket
+import struct
+import sys
+
+UDP_MAX_SEGMENTS = 1 << 7
+
+
+parser = argparse.ArgumentParser(description="UDP GSO datagram sender")
+parser.add_argument("src", help="source IPv4 address")
+parser.add_argument("dst", help="destination IPv4 address")
+parser.add_argument("port", type=int, help="destination UDP port")
+parser.add_argument("gso_size", type=int, help="UDP GSO segment payload size")
+parser.add_argument("payload_len", type=int, help="total UDP payload length")
+args = parser.parse_args()
+
+if args.gso_size <= 0 or args.gso_size > 0xFFFF:
+ parser.error("gso_size must fit in an unsigned 16-bit integer")
+if args.payload_len <= args.gso_size:
+ parser.error("payload_len must be larger than gso_size")
+if args.payload_len > args.gso_size * UDP_MAX_SEGMENTS:
+ parser.error("payload_len exceeds UDP_MAX_SEGMENTS")
+
+SOL_UDP = getattr(socket, "SOL_UDP", socket.IPPROTO_UDP)
+UDP_SEGMENT = getattr(socket, "UDP_SEGMENT", 103)
+
+sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+sock.bind((args.src, 0))
+
+payload = b"b" * args.payload_len
+cmsg = [(SOL_UDP, UDP_SEGMENT, struct.pack("=H", args.gso_size))]
+
+sent = sock.sendmsg([payload], cmsg, 0, (args.dst, args.port))
+sys.exit(sent != len(payload))
--
Xingquan Liu
^ permalink raw reply related
* [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
stable
When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.
With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.
Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.
Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
v3:
- Move the UDP GSO sender into tdc_gso.py.
v2:
- Change patch commit message.
- Add tdc test.
net/sched/sch_dualpi2.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index d7c3254ef800..5434df6ca8ef 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
if (IS_ERR_OR_NULL(nskb))
return qdisc_drop(skb, sch, to_free);
- cnt = 1;
+ cnt = 0;
byte_len = 0;
orig_len = qdisc_pkt_len(skb);
skb_list_walk_safe(nskb, nskb, next) {
@@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
byte_len += nskb->len;
}
}
- if (cnt > 1) {
+ if (cnt > 0) {
/* The caller will add the original skb stats to its
* backlog, compensate this if any nskb is enqueued.
*/
- --cnt;
- byte_len -= orig_len;
+ qdisc_tree_reduce_backlog(sch, 1 - cnt,
+ orig_len - byte_len);
}
- qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
consume_skb(skb);
- return err;
+ return cnt > 0 ? NET_XMIT_SUCCESS : err;
}
return dualpi2_enqueue_skb(skb, sch, to_free);
}
base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
--
Xingquan Liu
^ permalink raw reply related
* [PATCH net v2] net/smc: fix out-of-bounds read when sk_user_data holds a sk_psock
From: Sechang Lim @ 2026-06-19 15:03 UTC (permalink / raw)
To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Ursula Braun,
Karsten Graul, Guvenc Gulce, linux-rdma, linux-s390, netdev,
linux-kernel, bpf
SMC stores its smc_sock in the clcsock's sk_user_data tagged
SK_USER_DATA_NOCOPY and reads it back with smc_clcsock_user_data(), which
only strips that flag. sockmap stores a sk_psock in the same field tagged
SK_USER_DATA_NOCOPY | SK_USER_DATA_PSOCK. Nothing keeps both off one
socket, and SMC then casts the sk_psock to an smc_sock.
A passive-open child hits this. It inherits the listener's
smc_clcsock_data_ready(), but sk_clone_lock() clears its NOCOPY
sk_user_data, and a BPF sock_ops program then adds the child to a sockmap,
installing a sk_psock in that field. The inherited callback reads it as an
smc_sock and dereferences a clcsk_* pointer past the end of the sk_psock:
BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
<IRQ>
smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
[...]
</IRQ>
Allocated by task 67930:
sk_psock_init+0x142/0x740 net/core/skmsg.c:766
sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
__cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
[...]
sk_psock() already guards the other side, returning NULL unless
SK_USER_DATA_PSOCK is set. Make smc_clcsock_user_data() and its RCU
variant return the smc_sock only when sk_user_data carries SMC's tag
alone. A sk_psock then reads back as NULL, which the data_ready and
fallback callbacks already handle.
Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
net/smc/smc.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 52145df83f6e..88dfb459b7cc 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -342,13 +342,25 @@ static inline void smc_init_saved_callbacks(struct smc_sock *smc)
static inline struct smc_sock *smc_clcsock_user_data(const struct sock *clcsk)
{
- return (struct smc_sock *)
- ((uintptr_t)clcsk->sk_user_data & ~SK_USER_DATA_NOCOPY);
+ uintptr_t data = (uintptr_t)clcsk->sk_user_data;
+
+ /*
+ * Return the smc_sock only if the slot carries SMC's tag alone.
+ * sockmap stores a sk_psock here tagged SK_USER_DATA_PSOCK; it is
+ * not an smc_sock and must not be dereferenced as one.
+ */
+ if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+ return NULL;
+ return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
}
static inline struct smc_sock *smc_clcsock_user_data_rcu(const struct sock *clcsk)
{
- return (struct smc_sock *)rcu_dereference_sk_user_data(clcsk);
+ uintptr_t data = (uintptr_t)rcu_dereference(__sk_user_data(clcsk));
+
+ if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+ return NULL;
+ return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
}
/* save target_cb in saved_cb, and replace target_cb with new_cb */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net] net/smc: fix out-of-bounds read in smc_clcsock_data_ready()
From: Sechang Lim @ 2026-06-19 14:59 UTC (permalink / raw)
To: D. Wythe
Cc: Dust Li, Sidraya Jayagond, Wenjia Zhang, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, David S . Miller, Mahanta Jambigi,
Tony Lu, Wen Gu, Simon Horman, Ursula Braun, Karsten Graul,
Guvenc Gulce, netdev, linux-rdma, linux-s390, bpf, linux-kernel
In-Reply-To: <20260616071639.GA104390@j66a10360.sqa.eu95>
On Tue, Jun 16, 2026 at 03:16:39PM +0800, D. Wythe wrote:
>On Sun, Jun 14, 2026 at 12:09:30PM +0000, Sechang Lim wrote:
>> smc_clcsock_data_ready() is installed on the listen socket and reads its
>> sk_user_data as an smc_sock. A passive-open child inherits this callback,
>> but sk_clone_lock() clears the child's sk_user_data because it is tagged
>> SK_USER_DATA_NOCOPY. smc_tcp_syn_recv_sock() restores the child's af_ops,
>> but the inherited sk_data_ready() is left in place until accept.
>>
>> In that window the child is established. A cgroup sock_ops program can run
>> bpf_sock_hash_update() on it from tcp_init_transfer(); sk_psock_init()
>> stores a sk_psock in the NULL sk_user_data. The inherited callback then
>> reads sk_user_data via smc_clcsock_user_data(), which masks only
>> SK_USER_DATA_NOCOPY, mistakes the sk_psock for an smc_sock, and reads a
>> callback pointer past the end of the sk_psock:
>>
>> BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>> Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
>> <IRQ>
>> smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>> tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
>> tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
>> tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>> tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
>> ip_protocol_deliver_rcu+0x226/0x420 net/ipv4/ip_input.c:207
>> ip_local_deliver_finish+0x35a/0x5f0 net/ipv4/ip_input.c:241
>> __netif_receive_skb_one_core+0x1e5/0x210 net/core/dev.c:6216
>> process_backlog+0x631/0x1470 net/core/dev.c:6682
>> __napi_poll+0xb3/0x320 net/core/dev.c:7749
>> net_rx_action+0x4fa/0xcb0 net/core/dev.c:7969
>> handle_softirqs+0x236/0x800 kernel/softirq.c:622
>> </IRQ>
>>
>> Allocated by task 67930:
>> sk_psock_init+0x142/0x740 net/core/skmsg.c:766
>> sock_map_link+0x646/0xdf0 net/core/sock_map.c:279
>> sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
>> bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
>> __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
>> tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
>> tcp_rcv_state_process+0x241e/0x4940 net/ipv4/tcp_input.c:7231
>> tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>
>> Restore the inherited sk_data_ready() in smc_tcp_syn_recv_sock(), where the
>> child's sk_user_data is already cleared, rather than only at accept.
>>
>> Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
>> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
>> ---
>> net/smc/af_smc.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>> index b5db69073e20..152971e8ad17 100644
>> --- a/net/smc/af_smc.c
>> +++ b/net/smc/af_smc.c
>> @@ -156,6 +156,12 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk,
>> if (child) {
>> rcu_assign_sk_user_data(child, NULL);
>>
>> + /*
>> + * the child inherited the listen-specific sk_data_ready();
>> + * restore it here, as sk_user_data may be reused before accept
>> + */
>> + child->sk_data_ready = smc->clcsk_data_ready;
>
>One concern:
>
>smc_clcsock_user_data_rcu() together with refcount_inc_not_zero() only
>pins the smc_sock; it does not guarantee anything about the lifetime or
>consistency of smc->clcsk_data_ready. In the listen-close path,
>smc_clcsock_restore_cb() clears that field under sk_callback_lock,
>while smc_tcp_syn_recv_sock() reads it without any lock. These are
>independent protection domains. If close wins the race,
>child->sk_data_ready can end up NULL and the next data arrival will
>crash.
>
will drop the syn_recv restore in v2. Thanks for your review.
>Also, I don't object to this fix, but I'd rather see the underlying cause
>addressed directly. The real issue seems to be the conflict between
>SMC's sk_user_data and sk_psock. Maybe there is a cleaner solution, e.g.
>always setting user_data.
>
Agreed.
Thanks, will send v2.
Best,
Sechang
^ permalink raw reply
* Re: AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Nicolai Buchwitz @ 2026-06-19 14:01 UTC (permalink / raw)
To: Sven Schuchmann
Cc: Thangaraj Samynathan, Rengarajan Sundararajan, UNGLinuxDriver,
Woojung.Huh, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, netdev, linux-usb, linux-kernel
In-Reply-To: <BEZP281MB224523ADACDB48D8E3974D4AD9E22@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM>
Hi Sven
On 19.6.2026 15:31, Sven Schuchmann wrote:
> Hello Nicolai,
>
> looks good from my point of view
> (Calling the lan78xx_write_vlan_table() from
> lan78xx_mac_link_up() and from lan78xx_reset()).
Thanks.
> But I investigated a little more and it seems the hash table
> (which is right behind the vlan table in the controllers memory)
> also gets cleared. I wrote some random data into this table and have
> seen that it gets also cleared. I think this needs to be fixed too.
Something like
static int lan78xx_write_mchash_table(struct lan78xx_net *dev)
{
struct lan78xx_priv *pdata = (struct lan78xx_priv
*)(dev->data[0]);
return lan78xx_dataport_write(dev, DP_SEL_RSEL_VLAN_DA_,
DP_SEL_VHF_VLAN_LEN,
DP_SEL_VHF_HASH_LEN,
pdata->mchash_table); // from lan78xx_deferred_multicast_write)
}
with callers in lan78xx_deferred_multicast_write() and
lan78xx_mac_link_up(), should
do the trick?
>
> In the Datasheet from the LAN7801 I can read:
> "After a reset event, the RFE will automatically initialize the
> contents of the VHF to 0h."
> Where VHF also refers to the hash table.
> But I still do not understand what reset is happening when I just
> unplug the network cable....
I suspect it is triggered from the PHY:
8.10 (MAC Reset Watchdog Timer):
"A portion of the MAC operates on clocks generated by the Ethernet PHY
[...] PHY Reset
(PHY_RST) results in resetting the portion of the MAC operating on the
PHY receive and
transmit clocks."
So which PHY are you using?
> [...]
Thanks,
Nicolai
^ permalink raw reply
* Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Uwe Kleine-König @ 2026-06-19 13:59 UTC (permalink / raw)
To: Selvamani.Rajagopal
Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
Shuah Khan, netdev, linux-kernel, devicetree, linux-doc,
Jerry Ray
In-Reply-To: <20260614-s2500-mac-phy-support-v5-12-89874b72f725@onsemi.com>
[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]
On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> +static const struct of_device_id s2500_of_match[] = {
> + { .compatible = "onnn,s2500" },
> + {}
s/{}/{ }/
> +};
> +
> +static const struct spi_device_id s2500_ids[] = {
> + { "s2500" },
> + {}
> +};
Please make this:
static const struct spi_device_id s2500_ids[] = {
{ .name = "s2500" },
{ }
};
> +MODULE_DEVICE_TABLE(spi, s2500_ids);
> +
> +static struct spi_driver s2500_driver = {
> + .driver = {
> + .name = DRV_NAME,
> + .of_match_table = s2500_of_match,
> + },
> + .probe = s2500_probe,
> + .remove = s2500_remove,
> + .id_table = s2500_ids,
Tastes are different, but the idea to align = is usually screwed by
follow up patches. Here it's broken from the start. If you ask me: Use a
single space before each =.
> +};
> +
> +module_spi_driver(s2500_driver);
Usually there is no empty line between the driver struct and the macro
registering it.
> +
> +MODULE_AUTHOR("Piergiorgio Beruto <pier.beruto@onsemi.com>");
> +MODULE_AUTHOR("Selva Rajagopal <selvamani.rajagopal@onsemi.com>");
> +MODULE_DESCRIPTION("onsemi MACPHY ethernet driver");
> +MODULE_LICENSE("GPL");
Best regards
Uwe
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-19 13:55 UTC (permalink / raw)
To: Shradha Gupta
Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
Saurabh Singh Sengar, stable
In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com>
On Fri, Jun 19, 2026 at 12:33:35AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
>
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
>
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
>
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
>
> We also studied the results of setting the affinity and hint to
> NULL in these cases, and observed that, with this logic if there are
> pre existing IRQs allocated on the VM(apart from MANA), during MANA
> IRQs allocation, it leads to clustering of the MANA queue IRQs again.
> These results can be seen through case 3 in the following data.
>
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 2
> IRQ3: mana_q3 0
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 38.85 0.03 24.89 24.65
> pass 2: 39.15 0.03 24.57 25.28
> pass 3: 40.36 0.03 23.20 23.17
>
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 1
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 15.42 15.85 14.99 14.51
> pass 2: 15.53 15.94 15.81 15.93
> pass 3: 16.41 16.35 16.40 16.36
>
> =======================================================
> Case 3: with affinity set to NULL
> =======================================================
> 4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 2
> IRQ2: mana_q2 3
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn with patch w/o patch aff NULL
> 20480 15.65 7.73 5.25
> 10240 15.63 8.93 5.77
> 8192 15.64 9.69 7.16
> 6144 15.64 13.16 9.33
> 4096 15.69 15.75 13.50
> 2048 15.69 15.83 13.61
> 1024 15.71 15.28 13.60
>
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
> ---
> Changes in v4
> * Add mana prefix on irq_affinity_*() in mana driver
> * Corrected grammar, comment for mana_irq_setup_linear()
> * added new line as per guidelines
> * added case 3 in commit message for when affinity is NULL
> ---
> Changes in v3
> * Optimize the comments in mana_gd_setup_dyn_irqs()
> * add more details in the dev_dbg for extra IRQs
> ---
> Changes in v2
> * Removed the unused skip_first_cpu variable
> * fixed exit condition in irq_setup_linear() with len == 0
> * changed return type of irq_setup_linear() as it will always be 0
> * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> * added appropriate comments to indicate expected behaviour when
> IRQs are more than or equal to num_online_cpus()
> ---
> .../net/ethernet/microsoft/mana/gdma_main.c | 78 +++++++++++++++----
> 1 file changed, 64 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index a0fdd052d7f1..e8b7ffb47eb9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> } else {
> /* If dynamic allocation is enabled we have already allocated
> * hwc msi
> + * Also, we make sure in this case the following is always true
> + * (num_msix_usable - 1 HWC) <= num_online_cpus()
> */
> gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> }
> @@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> * do the same thing.
> */
>
> -static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> - bool skip_first_cpu)
> +static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
> + int node, bool skip_first_cpu)
> {
> const struct cpumask *next, *prev = cpu_none_mask;
> cpumask_var_t cpus __free(free_cpumask_var);
> @@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> return 0;
> }
>
> +/* must be called with cpus_read_lock() held */
> +static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + if (len == 0)
> + break;
> +
> + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> + len--;
> + }
> +}
> +
> static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> {
> struct gdma_context *gc = pci_get_drvdata(pdev);
> struct gdma_irq_context *gic;
> - bool skip_first_cpu = false;
> int *irqs, err, i, msi;
>
> irqs = kmalloc_objs(int, nvec);
> @@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> return -ENOMEM;
>
> /*
> + * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> + * nvec is only Queue IRQ (HWC already setup).
> * While processing the next pci irq vector, we start with index 1,
> * as IRQ vector at index 0 is already processed for HWC.
> * However, the population of irqs array starts with index 0, to be
> - * further used in irq_setup()
> + * further used in mana_irq_setup_numa_aware()
> */
> for (i = 1; i <= nvec; i++) {
> msi = i;
> @@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> }
>
> /*
> - * When calling irq_setup() for dynamically added IRQs, if number of
> - * CPUs is more than or equal to allocated MSI-X, we need to skip the
> - * first CPU sibling group since they are already affinitized to HWC IRQ
> + * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
> + * if number of CPUs is more than or equal to allocated MSI-X, we need to
> + * skip the first CPU sibling group since they are already affinitized to
> + * HWC IRQ
> */
> cpus_read_lock();
> - if (gc->num_msix_usable <= num_online_cpus())
> - skip_first_cpu = true;
> + if (gc->num_msix_usable <= num_online_cpus()) {
> + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
> + true);
> + if (err) {
> + cpus_read_unlock();
> + goto free_irq;
> + }
> + } else {
> + /*
> + * When num_msix_usable are more than num_online_cpus, our
> + * queue IRQs should be equal to num of online vCPUs.
> + * We try to make sure queue IRQs spread across all vCPUs.
> + * In such a case NUMA or CPU core affinity does not matter.
> + * Note: in this case the total mana IRQ should always be
> + * num_online_cpus + 1. The first HWC IRQ is already handled
> + * in HWC setup calls
> + * However, if CPUs went offline since num_msix_usable was
> + * computed, queue IRQs will be more than num_online_cpus().
> + * In such cases remaining extra IRQs will retain their default
> + * affinity.
> + */
> + int first_unassigned = num_online_cpus();
>
> - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> - if (err) {
> - cpus_read_unlock();
> - goto free_irq;
> + if (nvec > first_unassigned) {
> + char buf[32];
> +
> + if (first_unassigned == nvec - 1)
> + snprintf(buf, sizeof(buf), "%d",
> + first_unassigned);
> + else
> + snprintf(buf, sizeof(buf), "%d-%d",
> + first_unassigned, nvec - 1);
> +
> + dev_dbg(&pdev->dev,
> + "MANA IRQ indices #%s will retain the default CPU affinity\n",
> + buf);
> + }
> +
> + mana_irq_setup_linear(irqs, nvec);
> }
>
> cpus_read_unlock();
> @@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
> nvec -= 1;
> }
>
> - err = irq_setup(irqs, nvec, gc->numa_node, false);
> + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
> if (err) {
> cpus_read_unlock();
> goto free_irq;
>
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> --
> 2.34.1
^ permalink raw reply
* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: Thomas Gleixner @ 2026-06-19 13:34 UTC (permalink / raw)
To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
Richard Cochran, linux-kernel, netdev
Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <3616fc9718614bf11915569599038a5bcb268c02.camel@infradead.org>
On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
>
> nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> +
> + /*
> + * For the NTP-disciplined mono-based clocks, report how far
> + * @systime is from the ideal NTP time at @now, in signed ns,
> + * so a caller can land on the ideal line by adding it. Four
> + * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> + *
> + * - tk->ntp_error, the deviation as of the last update;
> + * - (cycle_delta * ntp_err_frac), the fractional-mult drift
> + * accrued since then (cycle_delta is at most a tick on a
> + * tickful kernel, but many ticks' worth under NO_HZ);
> + * - (cycle_delta * ntp_err_mult), subtracting the applied +1
> + * mult dither over the same span;
> + * - the sub-ns fraction @systime dropped when the read was
> + * truncated to whole ns (low @shift bits, exact despite the
> + * multiply overflowing).
> + *
> + * RAW is undisciplined and AUX has its own discipline, so they
> + * carry no ntp_error.
AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
needs to be excluded.
> + */
> + if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> + clock_id == CLOCK_BOOTTIME) {
> + u32 nes = tk->ntp_error_shift;
> + u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> + tk->tkr_mono.mask;
> + s64 err = tk->ntp_error +
> + (((s64)mul_u64_u64_shr(cycle_delta,
> + tk->ntp_err_frac, 32) -
> + (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> +
> + err += (s64)((cycle_delta * tk->tkr_mono.mult +
> + tk->tkr_mono.xtime_nsec) &
> + ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> + systime_snapshot->ntp_error =
> + (err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> + NTP_SCALE_SHIFT;
This formatting makes my brain hurt. Can you please split that out into
a separate function?
/*
* Big fat comment....
*/
static void snapshot_ntp_error(clockid_t clock_id, struct system_time_snapshot *snap,
struct timekeeper *tk)
{
if (clock_id == CLOCK_MONOTONIC_RAW) {
snap->ntp_error = 0;
return;
}
u64 cycle_delta = (now - tk->tkr_mono.cycle_last) & tk->tkr_mono.mask;
u32 nes = tk->ntp_error_shift;
s64 tmp, err = tk->ntp_error;
err += ((s64)mul_u64_u64_shr(cycle_delta, tk->ntp_err_frac, 32) -
(s64)(cycle_delta * tk->ntp_err_mult)) << nes;
tmp = (s64)(cycle_delta * tk->tkr_mono.mult + tk->tkr_mono.xtime_nsec);
tmp &= (1ULL << tk->tkr_mono.shift) - 1;
err += tmp << nes;
snap->ntp_error = (err + (1LL << (NTP_SCALE_SHIFT - 1))) >> NTP_SCALE_SHIFT;
}
or something readable like that.
^ permalink raw reply
* AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Sven Schuchmann @ 2026-06-19 13:31 UTC (permalink / raw)
To: Nicolai Buchwitz
Cc: Thangaraj Samynathan, Rengarajan Sundararajan,
UNGLinuxDriver@microchip.com, Woojung.Huh@microchip.com,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <4abfc9b1e8860da93c03639863bd0232@tipi-net.de>
Hello Nicolai,
looks good from my point of view
(Calling the lan78xx_write_vlan_table() from
lan78xx_mac_link_up() and from lan78xx_reset()).
But I investigated a little more and it seems the hash table
(which is right behind the vlan table in the controllers memory)
also gets cleared. I wrote some random data into this table and have
seen that it gets also cleared. I think this needs to be fixed too.
In the Datasheet from the LAN7801 I can read:
"After a reset event, the RFE will automatically initialize the contents of the VHF to 0h."
Where VHF also refers to the hash table.
But I still do not understand what reset is happening when I just unplug the network cable....
Regards,
Sven
On 19.6.2026 11:53, Nicolai Buchwitz wrote:
> Hi Sven
>
> On 19.6.2026 11:18, Sven Schuchmann wrote:
> > Hello Nicolai,
> >
> > my first opservation is that calling lan78xx_write_vlan_table()
> > at the end lan78xx_start_rx_path() fixes the problem. I was able
> > to do over 200 connect/disconnects without any problem.
>
> Thanks, that's the right direction. For the final patch I'd move it
> to lan78xx_mac_link_up(), which is IMHO a bit "cleaner":
>
> [...]
> static void lan78xx_rx_urb_submit_all(struct lan78xx_net *dev);
> +static int lan78xx_write_vlan_table(struct lan78xx_net *dev);
> [...]
> static void lan78xx_mac_link_up(struct phylink_config *config,
> [...]
> if (ret < 0)
> goto link_up_fail;
>
> + ret = lan78xx_write_vlan_table(dev);
> + if (ret < 0)
> + goto link_up_fail;
> +
> netif_start_queue(net);
> [...]
>
> Could you give this version a quick test and confirm? Then I'll add
> your Tested-by.
>
> > [...]
>
> Thanks
> Nicolai
^ permalink raw reply
* [Bug ?] Packet with End.X segment not correctly forwarded to nexthop
From: Anthony Doeraene @ 2026-06-19 13:25 UTC (permalink / raw)
To: andrea.mayer; +Cc: netdev
Hello,
I am currently experimenting with SRv6 and VRFs, and I found some weird
interactions between the two.
For the context, I need routers to have multiple VRFs, with each VRF
having different routes to reach destinations.
Our routers not only send packets to a specific nexthop, but also
specify the VRF that the nexthop
should use to forward these packets.
To achieve this goal, routes in these VRFs push two segments: a local
End.X segment, and a End.DT46 segment.
Due to some implementation constraints, I want to have a single End.DT46
segment shared by
all routers in the network.
Once packets are encapsulated by the VRF, the packet is sent in the main
table to do a lookup for the nexthop.
As the End.DT46 segment is shared between routers and can not be used to
learn the nexthop, I decided to
use an End.X segment to specify it.
However, what I observe in this scenario is that End.X segment
processing function is never called, resulting
in the packet not being sent to the correct nexthop.
I am wondering if this is an expected behavior (i.e. a node should never
push a local segment), or if it is a real bug ?
I am not well versed into the implementation details of SRv6 in the
kernel, but I'm suspecting that this "bug" comes
from the fact that seg6_output_core calls dst_output, which does not
allow an SRv6 segment function to be called.
A minimal example is given below, which creates two namespaces (r1, r2)
and allows to reproduce this behavior.
(tested on a kernel compiled on virtme-ng from commit
e771677c937da5808f7b6c1f0e4a97ec1a84f8a8)
Thank you in advance for the help and thanks for the SRv6 support on Linux,
Doeraene Anthony
File setup.sh
```
# Topology under test:
#
# fc00::1:1 fc00::1:2
# fc00::1 [ r1 ] ------------------------- [ r2 ] fc00::2
#
# Description:
# ============
#
# Each node has an additional VRF, which it can use to provide different
# routing decisions based on arbitrary rules (e.g. QoS aware forwarding)
# Routes in this VRF will encapsulate the packets and push segments to
# specify the nexthop (End.X) and the VRF the nexthop should use
# (End.DT46). The same End.DT46 segment is shared by all nodes
#
# Problem:
# ========
#
# Once segments are pushed, the End.X segment is never applied. As a
# result, the segment is not popped from the SL, and the packet is sent
# on an incorrect interface.
#
# Forwarding steps:
# =================
#
# - R1 sends the packet to fc00::2 in its VRF `myvrf`
# - This VRF encapsulates the packet and add two segments:
# 1) End.X segment to force the transmission of the packet on r1-r2
# 2) End.DT46 segment allowing r2 to know which VRF it should use
# to forward the packet.
# - After encapsulation, r1 does a lookup in its main table for the
# End.X segment, but does not pop the segment. The packet is thus
# sent incorrectly on the dummy interface
#
# Running the example (with sudo):
# ====================
#
# 1) Start the topology
#
# bash setup.sh
#
# 2) Start pinging (leave in the background)
#
# ip netns exec r1 ping -I fc00::1 fc00::2
#
# 3) Check with tcpdump. We should see packets on r1-r2, and should not
# see any packet on dum0
#
# ip netns exec r1 tcpdump -i dum0 -n
# ip netns exec r1 tcpdump -i r1-r2 -n
if [ -z "$(lsmod | grep vrf)" ]; then
echo "Run modprobe vrf"
exit 1
fi
nodes="r1 r2"
vrftable=10
localsid=90
# Create nodes
for node in $nodes; do
ip netns add $node
ip -n $node link set lo up
done
# Create loopback addresses
ip -n r1 addr add fc00::1 dev lo
ip -n r2 addr add fc00::2 dev lo
# Create links
ip link add r1-r2 type veth peer name r2-r1
ip link set r1-r2 netns r1
ip link set r2-r1 netns r2
ip -n r1 link set r1-r2 up
ip -n r2 link set r2-r1 up
# Configure IPs
ip -n r1 addr add dev r1-r2 fc00::1:1/112
ip -n r2 addr add dev r2-r1 fc00::1:2/112
# Add default routes
ip -n r1 -6 route add default via fc00::1:2
ip -n r2 -6 route add default via fc00::1:1
# Configure sysctls
for node in $nodes; do
ip netns exec $node sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec $node sysctl -w net.ipv6.conf.all.seg6_enabled=1
ip netns exec $node sysctl -w net.vrf.strict_mode=1
for itf in $(ip netns exec $node ls /sys/class/net); do
ip netns exec $node sysctl net.ipv6.conf.$itf.seg6_enabled=1
done
done
for node in $nodes; do
# Create a dummy interface for End.X segments
ip -n $node link add dum0 type dummy
ip -n $node link set dum0 up
# Create VRF
ip -n $node link add myvrf type vrf table $vrftable
ip -n $node link set dev myvrf up
done
# Create SID table route
ip -n r1 -6 rule add to fc00:1::/32 lookup $localsid prio 998
ip -n r1 -6 rule add to fc00:ffff::/32 lookup $localsid prio 999
ip -n r2 -6 rule add to fc00:2::/32 lookup $localsid prio 998
ip -n r2 -6 rule add to fc00:ffff::/32 lookup $localsid prio 999
# Create the DT46 segment associated with the VRF
ip -n r1 route add table $localsid fc00:ffff:: \
encap seg6local \
action End.DT46 vrftable $vrftable dev myvrf
ip -n r2 route add table $localsid fc00:ffff:: \
encap seg6local \
action End.DT46 vrftable $vrftable dev myvrf
# Create the End.X segment
ip -n r1 route add table $localsid fc00:1:2:: \
encap seg6local action End.X nh6 fc00::1:2 oif r1-r2 dev dum0
ip -n r2 route add table $localsid fc00:2:1:: \
encap seg6local action End.X nh6 fc00::1:1 oif r2-r1 dev dum0
# Setup routes (main table)
ip -n r1 route add fc00::2 dev myvrf
# Setup routes (VRF). R1 push an End.X into End.DT46 segment
ip -n r1 route add fc00::2 encap seg6 \
mode encap \
segs fc00:1:2::,fc00:ffff:: \
dev r1-r2 via fc00::1:2 \
table 10
```
^ permalink raw reply
* Re: [PATCH v28 5/5] sfc: support pio mapping based on cxl
From: Edward Cree @ 2026-06-19 13:23 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero
In-Reply-To: <20260618181806.118745-6-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> A PIO buffer is a region of device memory to which the driver can write a
> packet for TX, with the device handling the transmit doorbell without
> requiring a DMA for getting the packet data, which helps reducing latency
> in certain exchanges. With CXL mem protocol this latency can be lowered
> further.
>
> With a device supporting CXL and successfully initialised, use the cxl
> region to map the memory range and use this mapping for PIO buffers.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
One nit:
> diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
> index 45e191686625..057d30090894 100644
> --- a/drivers/net/ethernet/sfc/efx.h
> +++ b/drivers/net/ethernet/sfc/efx.h
> @@ -236,5 +236,4 @@ static inline bool efx_rwsem_assert_write_locked(struct rw_semaphore *sem)
>
> int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs,
> bool flush);
> -
> #endif /* EFX_EFX_H */
This looks like a stray changebar, clean it up if respinning.
-ed
^ permalink raw reply
* Re: [PATCH v28 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Edward Cree @ 2026-06-19 13:20 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero
In-Reply-To: <20260618181806.118745-5-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Use core API for safely obtain the CXL range linked to an HDM committed
> by the BIOS. Map such a range for being used as the ctpio buffer.
>
> A potential user space action through sysfs unbinding or core cxl
> modules remove will trigger sfc driver device detachment, with that case
> not racing with this mapping as this is done during driver probe and
> therefore protected with device lock against those user space actions.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
^ permalink raw reply
* Re: [PATCH v28 3/5] cxl/sfc: Initialize dpa without a mailbox
From: Edward Cree @ 2026-06-19 13:15 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Dan Williams, Ben Cheatham, Jonathan Cameron
In-Reply-To: <20260618181806.118745-4-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Type3 relies on mailbox CXL_MBOX_OP_IDENTIFY command for initializing
> memdev state params which end up being used for DPA initialization.
>
> Allow a Type2 driver to initialize DPA simply by giving the size of its
> volatile hardware partition.
>
> Move related functions to memdev.
>
> Add sfc driver as the client.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
^ permalink raw reply
* Re: [PATCH v28 2/5] cxl/sfc: Map cxl regs
From: Edward Cree @ 2026-06-19 13:14 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Dan Williams, Jonathan Cameron, Ben Cheatham
In-Reply-To: <20260618181806.118745-3-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Export cxl core functions for a Type2 driver being able to discover and
> map the device registers.
>
> Use it in sfc driver cxl initialization.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
^ permalink raw reply
* [PATCH v3 net] net: airoha: Fix TX scheduler queue mask loop upper bound
From: Wayen Yan @ 2026-06-19 13:12 UTC (permalink / raw)
To: netdev
Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
linux-mediatek
In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).
Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
- channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
- channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
- channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
- channel 3: clears bit 24..31 (channel 3 only) - correct by accident
While BIT(32+) on arm64 produces 64-bit values truncated to 0 in u32
mask parameter, the loop still incorrectly clears queues within the
same channel beyond queue 7.
Even though this is functionally harmless (the register resets to 0
and is only ever cleared, never set — so clearing extra bits is a
no-op), the loop bound is semantically wrong and should be fixed for
correctness and clarity.
Fix by using AIROHA_NUM_QOS_QUEUES (8) as the loop upper bound.
Fixes: ef1ca9271313 ("net: airoha: Add sched HTB offload support")
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Wayen Yan <win847@gmail.com>
---
Changes in v3:
- Rebase on top of current net tree (Lorenzo pointed out v2 was
not based on latest net HEAD).
- No code changes from v2.
drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f..47fb32517a 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -2395,7 +2395,7 @@ static int airoha_qdma_set_chan_tx_sched(struct net_device *netdev,
struct airoha_gdm_dev *dev = netdev_priv(netdev);
int i;
- for (i = 0; i < AIROHA_NUM_TX_RING; i++)
+ for (i = 0; i < AIROHA_NUM_QOS_QUEUES; i++)
airoha_qdma_clear(dev->qdma, REG_QUEUE_CLOSE_CFG(channel),
TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i));
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v28 1/5] sfc: add cxl support
From: Edward Cree @ 2026-06-19 13:12 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Jonathan Cameron, Alison Schofield,
Dan Williams
In-Reply-To: <20260618181806.118745-2-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Add CXL initialization based on new CXL API for accel drivers and make
> it dependent on kernel CXL configuration.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
...
> diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
> index b98c259f672d..de3fc9537662 100644
> --- a/drivers/net/ethernet/sfc/net_driver.h
> +++ b/drivers/net/ethernet/sfc/net_driver.h
> @@ -1197,14 +1197,23 @@ struct efx_nic {
> atomic_t n_rx_noskb_drops;
> };
>
> +#ifdef CONFIG_SFC_CXL
> +struct efx_cxl;
> +#endif
> +
> /**
> * struct efx_probe_data - State after hardware probe
> * @pci_dev: The PCI device
> * @efx: Efx NIC details
> + * @cxl: details of related cxl objects
> + * @cxl_pio_initialised: cxl initialization outcome.
> */
> struct efx_probe_data {
> struct pci_dev *pci_dev;
> struct efx_nic efx;
> +#ifdef CONFIG_SFC_CXL
> + struct efx_cxl *cxl;
> +#endif
> };
The documented cxl_pio_initialised member does not appear to exist.
Will this not cause a kerneldoc build error?
^ permalink raw reply
* Re: [PATCH v2 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Victor Nogueira @ 2026-06-19 13:10 UTC (permalink / raw)
To: Xingquan Liu; +Cc: Jamal Hadi Salim, netdev, Jiri Pirko, Chia-Yu Chang
In-Reply-To: <20260619073211.637928-2-b1n@b1n.io>
On Fri, Jun 19, 2026 at 4:32 AM Xingquan Liu <b1n@b1n.io> wrote:
>
> Add a regression test for DualPI2 GSO backlog accounting when it is
> used as a child qdisc of QFQ.
>
> The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
> the leaf qdisc. DualPI2 splits the skb into two segments. After the
> traffic drains, both QFQ and DualPI2 must report zero backlog and zero
> qlen.
>
> On kernels with the broken accounting, QFQ can keep a stale non-zero
> qlen after all real packets have been dequeued.
>
> Signed-off-by: Xingquan Liu <b1n@b1n.io>
> ---
> .../tc-testing/tc-tests/qdiscs/dualpi2.json | 44 +++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> index cd1f2ee8f354..ffd6fd5ba8f7 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> + {
> + "id": "891f",
> [...]
> + "cmdUnderTest": "python3 -c 'import socket,struct; SOL_UDP=getattr(socket,\"SOL_UDP\",socket.IPPROTO_UDP); UDP_SEGMENT=getattr(socket,\"UDP_SEGMENT\",103); s=socket.socket(socket.AF_INET,socket.SOCK_DGRAM); s.bind((\"10.10.10.10\",0)); p=b\"X\"*2400; n=s.sendmsg([p],[(SOL_UDP,UDP_SEGMENT,struct.pack(\"=H\",1200))],0,(\"10.10.10.1\",9000)); raise SystemExit(n != len(p))'",
Can you make this a separate Python script?
Something similar to what the flower tests did [1] with tdc_batch.py [2].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/tools/testing/selftests/tc-testing/tc-tests/filters/flower.json#n205
[2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/tools/testing/selftests/tc-testing/tdc_batch.py
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox