Netdev List
 help / color / mirror / Atom feed
* AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Sven Schuchmann @ 2026-06-19 13:31 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: Thangaraj Samynathan, Rengarajan Sundararajan,
	UNGLinuxDriver@microchip.com, Woojung.Huh@microchip.com,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4abfc9b1e8860da93c03639863bd0232@tipi-net.de>

Hello Nicolai,

looks good from my point of view
(Calling the lan78xx_write_vlan_table() from 
lan78xx_mac_link_up() and from lan78xx_reset()).

But I investigated a little more and it seems the hash table
(which is right behind the vlan table in the controllers memory)
also gets cleared. I wrote some random data into this table and have
seen that it gets also cleared. I think this needs to be fixed too.

In the Datasheet from the LAN7801 I can read:
"After a reset event, the RFE will automatically initialize the contents of the VHF to 0h."
Where VHF also refers to the hash table.
But I still do not understand what reset is happening when I just unplug the network cable....

Regards,
   Sven


On 19.6.2026 11:53, Nicolai Buchwitz wrote:
> Hi Sven
> 
> On 19.6.2026 11:18, Sven Schuchmann wrote:
> > Hello Nicolai,
> >
> > my first opservation is that calling lan78xx_write_vlan_table()
> > at the end lan78xx_start_rx_path() fixes the problem. I was able
> > to do over 200 connect/disconnects without any problem.
> 
> Thanks, that's the right direction. For the final patch I'd move it
> to lan78xx_mac_link_up(), which is IMHO a bit "cleaner":
> 
> [...]
>   static void lan78xx_rx_urb_submit_all(struct lan78xx_net *dev);
> +static int lan78xx_write_vlan_table(struct lan78xx_net *dev);
> [...]
> static void lan78xx_mac_link_up(struct phylink_config *config,
> [...]
>          if (ret < 0)
>                  goto link_up_fail;
> 
> +       ret = lan78xx_write_vlan_table(dev);
> +       if (ret < 0)
> +               goto link_up_fail;
> +
>          netif_start_queue(net);
> [...]
> 
> Could you give this version a quick test and confirm? Then I'll add
> your Tested-by.
> 
> > [...]
> 
> Thanks
> Nicolai

^ permalink raw reply

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: Thomas Gleixner @ 2026-06-19 13:34 UTC (permalink / raw)
  To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <3616fc9718614bf11915569599038a5bcb268c02.camel@infradead.org>

On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
>  
>  		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
>  		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> +
> +		/*
> +		 * For the NTP-disciplined mono-based clocks, report how far
> +		 * @systime is from the ideal NTP time at @now, in signed ns,
> +		 * so a caller can land on the ideal line by adding it. Four
> +		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> +		 *
> +		 *  - tk->ntp_error, the deviation as of the last update;
> +		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
> +		 *    accrued since then (cycle_delta is at most a tick on a
> +		 *    tickful kernel, but many ticks' worth under NO_HZ);
> +		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
> +		 *    mult dither over the same span;
> +		 *  - the sub-ns fraction @systime dropped when the read was
> +		 *    truncated to whole ns (low @shift bits, exact despite the
> +		 *    multiply overflowing).
> +		 *
> +		 * RAW is undisciplined and AUX has its own discipline, so they
> +		 * carry no ntp_error.

AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
needs to be excluded.

> +		 */
> +		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> +		    clock_id == CLOCK_BOOTTIME) {
> +			u32 nes = tk->ntp_error_shift;
> +			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> +					  tk->tkr_mono.mask;
> +			s64 err = tk->ntp_error +
> +				(((s64)mul_u64_u64_shr(cycle_delta,
> +						       tk->ntp_err_frac, 32) -
> +				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> +
> +			err += (s64)((cycle_delta * tk->tkr_mono.mult +
> +				      tk->tkr_mono.xtime_nsec) &
> +				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> +			systime_snapshot->ntp_error =
> +				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> +				NTP_SCALE_SHIFT;

This formatting makes my brain hurt. Can you please split that out into
a separate function?


/*
 * Big fat comment....
 */
static void snapshot_ntp_error(clockid_t clock_id, struct system_time_snapshot *snap,
       			       struct timekeeper *tk)
{
	if (clock_id == CLOCK_MONOTONIC_RAW) {
        	snap->ntp_error = 0;
                return;
        }

	u64 cycle_delta = (now - tk->tkr_mono.cycle_last) & tk->tkr_mono.mask;
       	u32 nes = tk->ntp_error_shift;
	s64 tmp, err = tk->ntp_error;

        err += ((s64)mul_u64_u64_shr(cycle_delta, tk->ntp_err_frac, 32) -
               (s64)(cycle_delta * tk->ntp_err_mult)) << nes;

	tmp = (s64)(cycle_delta * tk->tkr_mono.mult + tk->tkr_mono.xtime_nsec);
        tmp &= (1ULL << tk->tkr_mono.shift) - 1;
	err += tmp << nes;
	snap->ntp_error = (err + (1LL << (NTP_SCALE_SHIFT - 1))) >> NTP_SCALE_SHIFT;
}

or something readable like that.
                      

^ permalink raw reply

* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-19 13:55 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com>

On Fri, Jun 19, 2026 at 12:33:35AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
> 
> We also studied the results of setting the affinity and hint to
> NULL in these cases, and observed that, with this logic if there are
> pre existing IRQs allocated on the VM(apart from MANA), during MANA
> IRQs allocation, it leads to clustering of the MANA queue IRQs again.
> These results can be seen through case 3 in the following data.
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Case 3: with affinity set to NULL
> =======================================================
> 4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC			0
> IRQ1:	mana_q1			2
> IRQ2:	mana_q2			3
> IRQ3:	mana_q3			2
> IRQ4:	mana_q4			3
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch	aff NULL
> 20480		15.65		7.73		5.25
> 10240		15.63		8.93		5.77
> 8192		15.64		9.69		7.16
> 6144		15.64		13.16		9.33
> 4096		15.69		15.75		13.50
> 2048		15.69		15.83		13.61
> 1024		15.71		15.28		13.60
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>

Reviewed-by: Yury Norov <ynorov@nvidia.com>

> ---
> Changes in v4
>  * Add mana prefix on irq_affinity_*() in mana driver
>  * Corrected grammar, comment for mana_irq_setup_linear()
>  * added new line as per guidelines
>  * added case 3 in commit message for when affinity is NULL
> ---
> Changes in v3
>  * Optimize the comments in mana_gd_setup_dyn_irqs()
>  * add more details in the dev_dbg for extra IRQs
> ---
> Changes in v2
>  * Removed the unused skip_first_cpu variable
>  * fixed exit condition in irq_setup_linear() with len == 0
>  * changed return type of irq_setup_linear() as it will always be 0
>  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
>  * added appropriate comments to indicate expected behaviour when
>    IRQs are more than or equal to num_online_cpus()
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
>  1 file changed, 64 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index a0fdd052d7f1..e8b7ffb47eb9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	} else {
>  		/* If dynamic allocation is enabled we have already allocated
>  		 * hwc msi
> +		 * Also, we make sure in this case the following is always true
> +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
>  		 */
>  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
>  	}
> @@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
>   * do the same thing.
>   */
>  
> -static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> -		     bool skip_first_cpu)
> +static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
> +				     int node, bool skip_first_cpu)
>  {
>  	const struct cpumask *next, *prev = cpu_none_mask;
>  	cpumask_var_t cpus __free(free_cpumask_var);
> @@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +/* must be called with cpus_read_lock() held */
> +static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (len == 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_irq_context *gic;
> -	bool skip_first_cpu = false;
>  	int *irqs, err, i, msi;
>  
>  	irqs = kmalloc_objs(int, nvec);
> @@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  		return -ENOMEM;
>  
>  	/*
> +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> +	 * nvec is only Queue IRQ (HWC already setup).
>  	 * While processing the next pci irq vector, we start with index 1,
>  	 * as IRQ vector at index 0 is already processed for HWC.
>  	 * However, the population of irqs array starts with index 0, to be
> -	 * further used in irq_setup()
> +	 * further used in mana_irq_setup_numa_aware()
>  	 */
>  	for (i = 1; i <= nvec; i++) {
>  		msi = i;
> @@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	}
>  
>  	/*
> -	 * When calling irq_setup() for dynamically added IRQs, if number of
> -	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
> -	 * first CPU sibling group since they are already affinitized to HWC IRQ
> +	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
> +	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
> +	 * skip the first CPU sibling group since they are already affinitized to
> +	 * HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
> +						true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;
> +		}
> +	} else {
> +		/*
> +		 * When num_msix_usable are more than num_online_cpus, our
> +		 * queue IRQs should be equal to num of online vCPUs.
> +		 * We try to make sure queue IRQs spread across all vCPUs.
> +		 * In such a case NUMA or CPU core affinity does not matter.
> +		 * Note: in this case the total mana IRQ should always be
> +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * However, if CPUs went offline since num_msix_usable was
> +		 * computed, queue IRQs will be more than num_online_cpus().
> +		 * In such cases remaining extra IRQs will retain their default
> +		 * affinity.
> +		 */
> +		int first_unassigned = num_online_cpus();
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> -	if (err) {
> -		cpus_read_unlock();
> -		goto free_irq;
> +		if (nvec > first_unassigned) {
> +			char buf[32];
> +
> +			if (first_unassigned == nvec - 1)
> +				snprintf(buf, sizeof(buf), "%d",
> +					 first_unassigned);
> +			else
> +				snprintf(buf, sizeof(buf), "%d-%d",
> +					 first_unassigned, nvec - 1);
> +
> +			dev_dbg(&pdev->dev,
> +				"MANA IRQ indices #%s will retain the default CPU affinity\n",
> +				buf);
> +		}
> +
> +		mana_irq_setup_linear(irqs, nvec);
>  	}
>  
>  	cpus_read_unlock();
> @@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
>  		nvec -= 1;
>  	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, false);
> +	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> -- 
> 2.34.1

^ permalink raw reply

* Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Uwe Kleine-König @ 2026-06-19 13:59 UTC (permalink / raw)
  To: Selvamani.Rajagopal
  Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
	Shuah Khan, netdev, linux-kernel, devicetree, linux-doc,
	Jerry Ray
In-Reply-To: <20260614-s2500-mac-phy-support-v5-12-89874b72f725@onsemi.com>

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> +static const struct of_device_id s2500_of_match[] = {
> +	{ .compatible = "onnn,s2500" },
> +	{}

s/{}/{ }/

> +};
> +
> +static const struct spi_device_id s2500_ids[] = {
> +	{ "s2500" },
> +	{}
> +};

Please make this:

static const struct spi_device_id s2500_ids[] = {
	{ .name = "s2500" },
	{ }
};

> +MODULE_DEVICE_TABLE(spi, s2500_ids);
> +
> +static struct spi_driver s2500_driver = {
> +	.driver = {
> +		.name	= DRV_NAME,
> +		.of_match_table = s2500_of_match,
> +	},
> +	.probe		= s2500_probe,
> +	.remove		= s2500_remove,
> +	.id_table	= s2500_ids,

Tastes are different, but the idea to align = is usually screwed by
follow up patches. Here it's broken from the start. If you ask me: Use a
single space before each =.

> +};
> +
> +module_spi_driver(s2500_driver);

Usually there is no empty line between the driver struct and the macro
registering it.

> +
> +MODULE_AUTHOR("Piergiorgio Beruto <pier.beruto@onsemi.com>");
> +MODULE_AUTHOR("Selva Rajagopal <selvamani.rajagopal@onsemi.com>");
> +MODULE_DESCRIPTION("onsemi MACPHY ethernet driver");
> +MODULE_LICENSE("GPL");

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Nicolai Buchwitz @ 2026-06-19 14:01 UTC (permalink / raw)
  To: Sven Schuchmann
  Cc: Thangaraj Samynathan, Rengarajan Sundararajan, UNGLinuxDriver,
	Woojung.Huh, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-usb, linux-kernel
In-Reply-To: <BEZP281MB224523ADACDB48D8E3974D4AD9E22@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM>

Hi Sven

On 19.6.2026 15:31, Sven Schuchmann wrote:
> Hello Nicolai,
> 
> looks good from my point of view
> (Calling the lan78xx_write_vlan_table() from
> lan78xx_mac_link_up() and from lan78xx_reset()).

Thanks.

> But I investigated a little more and it seems the hash table
> (which is right behind the vlan table in the controllers memory)
> also gets cleared. I wrote some random data into this table and have
> seen that it gets also cleared. I think this needs to be fixed too.

Something like

static int lan78xx_write_mchash_table(struct lan78xx_net *dev)
{
        struct lan78xx_priv *pdata = (struct lan78xx_priv 
*)(dev->data[0]);

        return lan78xx_dataport_write(dev, DP_SEL_RSEL_VLAN_DA_,
                                      DP_SEL_VHF_VLAN_LEN,
                                      DP_SEL_VHF_HASH_LEN, 
pdata->mchash_table); // from lan78xx_deferred_multicast_write)
}

with callers in lan78xx_deferred_multicast_write() and 
lan78xx_mac_link_up(), should
do the trick?

> 
> In the Datasheet from the LAN7801 I can read:
> "After a reset event, the RFE will automatically initialize the 
> contents of the VHF to 0h."
> Where VHF also refers to the hash table.
> But I still do not understand what reset is happening when I just 
> unplug the network cable....

I suspect it is triggered from the PHY:

8.10 (MAC Reset Watchdog Timer):
"A portion of the MAC operates on clocks generated by the Ethernet PHY 
[...] PHY Reset
(PHY_RST) results in resetting the portion of the MAC operating on the 
PHY receive and
transmit clocks."

So which PHY are you using?

> [...]

Thanks,
Nicolai

^ permalink raw reply

* Re: [PATCH net] net/smc: fix out-of-bounds read in smc_clcsock_data_ready()
From: Sechang Lim @ 2026-06-19 14:59 UTC (permalink / raw)
  To: D. Wythe
  Cc: Dust Li, Sidraya Jayagond, Wenjia Zhang, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David S . Miller, Mahanta Jambigi,
	Tony Lu, Wen Gu, Simon Horman, Ursula Braun, Karsten Graul,
	Guvenc Gulce, netdev, linux-rdma, linux-s390, bpf, linux-kernel
In-Reply-To: <20260616071639.GA104390@j66a10360.sqa.eu95>

On Tue, Jun 16, 2026 at 03:16:39PM +0800, D. Wythe wrote:
>On Sun, Jun 14, 2026 at 12:09:30PM +0000, Sechang Lim wrote:
>> smc_clcsock_data_ready() is installed on the listen socket and reads its
>> sk_user_data as an smc_sock. A passive-open child inherits this callback,
>> but sk_clone_lock() clears the child's sk_user_data because it is tagged
>> SK_USER_DATA_NOCOPY. smc_tcp_syn_recv_sock() restores the child's af_ops,
>> but the inherited sk_data_ready() is left in place until accept.
>>
>> In that window the child is established. A cgroup sock_ops program can run
>> bpf_sock_hash_update() on it from tcp_init_transfer(); sk_psock_init()
>> stores a sk_psock in the NULL sk_user_data. The inherited callback then
>> reads sk_user_data via smc_clcsock_user_data(), which masks only
>> SK_USER_DATA_NOCOPY, mistakes the sk_psock for an smc_sock, and reads a
>> callback pointer past the end of the sk_psock:
>>
>>   BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>>   Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
>>    <IRQ>
>>    smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>>    tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
>>    tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
>>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>    tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
>>    ip_protocol_deliver_rcu+0x226/0x420 net/ipv4/ip_input.c:207
>>    ip_local_deliver_finish+0x35a/0x5f0 net/ipv4/ip_input.c:241
>>    __netif_receive_skb_one_core+0x1e5/0x210 net/core/dev.c:6216
>>    process_backlog+0x631/0x1470 net/core/dev.c:6682
>>    __napi_poll+0xb3/0x320 net/core/dev.c:7749
>>    net_rx_action+0x4fa/0xcb0 net/core/dev.c:7969
>>    handle_softirqs+0x236/0x800 kernel/softirq.c:622
>>    </IRQ>
>>
>>   Allocated by task 67930:
>>    sk_psock_init+0x142/0x740 net/core/skmsg.c:766
>>    sock_map_link+0x646/0xdf0 net/core/sock_map.c:279
>>    sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
>>    bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
>>    __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
>>    tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
>>    tcp_rcv_state_process+0x241e/0x4940 net/ipv4/tcp_input.c:7231
>>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>
>> Restore the inherited sk_data_ready() in smc_tcp_syn_recv_sock(), where the
>> child's sk_user_data is already cleared, rather than only at accept.
>>
>> Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
>> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
>> ---
>>  net/smc/af_smc.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>> index b5db69073e20..152971e8ad17 100644
>> --- a/net/smc/af_smc.c
>> +++ b/net/smc/af_smc.c
>> @@ -156,6 +156,12 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk,
>>  	if (child) {
>>  		rcu_assign_sk_user_data(child, NULL);
>>
>> +		/*
>> +		 * the child inherited the listen-specific sk_data_ready();
>> +		 * restore it here, as sk_user_data may be reused before accept
>> +		 */
>> +		child->sk_data_ready = smc->clcsk_data_ready;
>
>One concern:
>
>smc_clcsock_user_data_rcu() together with refcount_inc_not_zero() only
>pins the smc_sock; it does not guarantee anything about the lifetime or
>consistency of smc->clcsk_data_ready. In the listen-close path,
>smc_clcsock_restore_cb() clears that field under sk_callback_lock,
>while smc_tcp_syn_recv_sock() reads it without any lock. These are
>independent protection domains. If close wins the race,
>child->sk_data_ready can end up NULL and the next data arrival will
>crash.
>

will drop the syn_recv restore in v2. Thanks for your review.

>Also, I don't object to this fix, but I'd rather see the underlying cause
>addressed directly. The real issue seems to be the conflict between
>SMC's sk_user_data and sk_psock. Maybe there is a cleaner solution, e.g.
>always setting user_data.
>

Agreed. 

Thanks, will send v2.

Best,
Sechang

^ permalink raw reply

* [PATCH net v2] net/smc: fix out-of-bounds read when sk_user_data holds a sk_psock
From: Sechang Lim @ 2026-06-19 15:03 UTC (permalink / raw)
  To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Ursula Braun,
	Karsten Graul, Guvenc Gulce, linux-rdma, linux-s390, netdev,
	linux-kernel, bpf

SMC stores its smc_sock in the clcsock's sk_user_data tagged
SK_USER_DATA_NOCOPY and reads it back with smc_clcsock_user_data(), which
only strips that flag. sockmap stores a sk_psock in the same field tagged
SK_USER_DATA_NOCOPY | SK_USER_DATA_PSOCK. Nothing keeps both off one
socket, and SMC then casts the sk_psock to an smc_sock.

A passive-open child hits this. It inherits the listener's
smc_clcsock_data_ready(), but sk_clone_lock() clears its NOCOPY
sk_user_data, and a BPF sock_ops program then adds the child to a sockmap,
installing a sk_psock in that field. The inherited callback reads it as an
smc_sock and dereferences a clcsk_* pointer past the end of the sk_psock:

  BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
  Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
   <IRQ>
   smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
   tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
   tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
   tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
   tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
   [...]
   </IRQ>

  Allocated by task 67930:
   sk_psock_init+0x142/0x740 net/core/skmsg.c:766
   sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
   bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
   __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
   tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
   [...]

sk_psock() already guards the other side, returning NULL unless
SK_USER_DATA_PSOCK is set. Make smc_clcsock_user_data() and its RCU
variant return the smc_sock only when sk_user_data carries SMC's tag
alone. A sk_psock then reads back as NULL, which the data_ready and
fallback callbacks already handle.

Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/smc/smc.h | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 52145df83f6e..88dfb459b7cc 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -342,13 +342,25 @@ static inline void smc_init_saved_callbacks(struct smc_sock *smc)
 
 static inline struct smc_sock *smc_clcsock_user_data(const struct sock *clcsk)
 {
-	return (struct smc_sock *)
-	       ((uintptr_t)clcsk->sk_user_data & ~SK_USER_DATA_NOCOPY);
+	uintptr_t data = (uintptr_t)clcsk->sk_user_data;
+
+	/*
+	 * Return the smc_sock only if the slot carries SMC's tag alone.
+	 * sockmap stores a sk_psock here tagged SK_USER_DATA_PSOCK; it is
+	 * not an smc_sock and must not be dereferenced as one.
+	 */
+	if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+		return NULL;
+	return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
 }
 
 static inline struct smc_sock *smc_clcsock_user_data_rcu(const struct sock *clcsk)
 {
-	return (struct smc_sock *)rcu_dereference_sk_user_data(clcsk);
+	uintptr_t data = (uintptr_t)rcu_dereference(__sk_user_data(clcsk));
+
+	if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+		return NULL;
+	return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
 }
 
 /* save target_cb in saved_cb, and replace target_cb with new_cb */
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
	stable

When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.

With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.

Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.

Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
v3:
- Move the UDP GSO sender into tdc_gso.py.

v2:
- Change patch commit message.
- Add tdc test.

 net/sched/sch_dualpi2.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index d7c3254ef800..5434df6ca8ef 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		if (IS_ERR_OR_NULL(nskb))
 			return qdisc_drop(skb, sch, to_free);
 
-		cnt = 1;
+		cnt = 0;
 		byte_len = 0;
 		orig_len = qdisc_pkt_len(skb);
 		skb_list_walk_safe(nskb, nskb, next) {
@@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 				byte_len += nskb->len;
 			}
 		}
-		if (cnt > 1) {
+		if (cnt > 0) {
 			/* The caller will add the original skb stats to its
 			 * backlog, compensate this if any nskb is enqueued.
 			 */
-			--cnt;
-			byte_len -= orig_len;
+			qdisc_tree_reduce_backlog(sch, 1 - cnt,
+						  orig_len - byte_len);
 		}
-		qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
 		consume_skb(skb);
-		return err;
+		return cnt > 0 ? NET_XMIT_SUCCESS : err;
 	}
 	return dualpi2_enqueue_skb(skb, sch, to_free);
 }

base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
-- 
Xingquan Liu


^ permalink raw reply related

* [PATCH v3 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
	stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>

Add a regression test for DualPI2 GSO backlog accounting when it is
used as a child qdisc of QFQ.

The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
the leaf qdisc. DualPI2 splits the skb into two segments. After the
traffic drains, both QFQ and DualPI2 must report zero backlog and zero
qlen.

On kernels with the broken accounting, QFQ can keep a stale non-zero
qlen after all real packets have been dequeued.

Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
 .../tc-testing/tc-tests/qdiscs/dualpi2.json   | 44 +++++++++++++++++++
 tools/testing/selftests/tc-testing/tdc_gso.py | 43 ++++++++++++++++++
 2 files changed, 87 insertions(+)
 create mode 100755 tools/testing/selftests/tc-testing/tdc_gso.py

diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
index cd1f2ee8f354..ed6a900bb568 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
@@ -250,5 +250,49 @@
         "teardown": [
             "$TC qdisc del dev $DUMMY handle 1: root"
         ]
+    },
+    {
+        "id": "891f",
+        "name": "Verify DualPI2 GSO backlog accounting with QFQ parent",
+        "category": [
+            "qdisc",
+            "dualpi2",
+            "qfq",
+            "gso"
+        ],
+        "plugins": {
+            "requires": "nsPlugin"
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.10.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: qfq",
+            "$TC class add dev $DUMMY parent 1: classid 1:1 qfq weight 1 maxpkt 4096",
+            "$TC qdisc add dev $DUMMY parent 1:1 handle 2: dualpi2",
+            "$TC filter add dev $DUMMY parent 1: matchall classid 1:1"
+        ],
+        "cmdUnderTest": "./tdc_gso.py 10.10.10.10 10.10.10.1 9000 1200 2400",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY",
+        "matchJSON": [
+            {
+                "kind": "qfq",
+                "handle": "1:",
+                "packets": 2,
+                "backlog": 0,
+                "qlen": 0
+            },
+            {
+                "kind": "dualpi2",
+                "handle": "2:",
+                "packets": 2,
+                "backlog": 0,
+                "qlen": 0
+            }
+        ],
+        "teardown": [
+            "$TC qdisc del dev $DUMMY root",
+            "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
+        ]
     }
 ]
diff --git a/tools/testing/selftests/tc-testing/tdc_gso.py b/tools/testing/selftests/tc-testing/tdc_gso.py
new file mode 100755
index 000000000000..b66528ea4b68
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tdc_gso.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+tdc_gso.py - send a UDP GSO datagram
+
+Copyright (C) 2026 Xingquan Liu <b1n@b1n.io>
+"""
+
+import argparse
+import socket
+import struct
+import sys
+
+UDP_MAX_SEGMENTS = 1 << 7
+
+
+parser = argparse.ArgumentParser(description="UDP GSO datagram sender")
+parser.add_argument("src", help="source IPv4 address")
+parser.add_argument("dst", help="destination IPv4 address")
+parser.add_argument("port", type=int, help="destination UDP port")
+parser.add_argument("gso_size", type=int, help="UDP GSO segment payload size")
+parser.add_argument("payload_len", type=int, help="total UDP payload length")
+args = parser.parse_args()
+
+if args.gso_size <= 0 or args.gso_size > 0xFFFF:
+    parser.error("gso_size must fit in an unsigned 16-bit integer")
+if args.payload_len <= args.gso_size:
+    parser.error("payload_len must be larger than gso_size")
+if args.payload_len > args.gso_size * UDP_MAX_SEGMENTS:
+    parser.error("payload_len exceeds UDP_MAX_SEGMENTS")
+
+SOL_UDP = getattr(socket, "SOL_UDP", socket.IPPROTO_UDP)
+UDP_SEGMENT = getattr(socket, "UDP_SEGMENT", 103)
+
+sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+sock.bind((args.src, 0))
+
+payload = b"b" * args.payload_len
+cmsg = [(SOL_UDP, UDP_SEGMENT, struct.pack("=H", args.gso_size))]
+
+sent = sock.sendmsg([payload], cmsg, 0, (args.dst, args.port))
+sys.exit(sent != len(payload))
-- 
Xingquan Liu


^ permalink raw reply related

* [PATCH net] net: au1000: move free_irq out of the close-time spinlocked section
From: Runyu Xiao @ 2026-06-19 15:18 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Runyu Xiao, stable

au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.

This was found by our static analysis tool and then confirmed by manual
review of the in-tree au1000_close() .ndo_stop path. The reviewed path
keeps aup->lock held across the MAC reset, queue stop and
free_irq(dev->irq, dev).

A directed runtime validation kept that ndo_stop carrier and the same
free_irq(dev->irq, dev) operation under the driver lock. Lockdep reported
"BUG: sleeping function called from invalid context" and "Invalid wait
context" while free_irq() was taking desc->request_mutex, with
au1000_close() and free_irq() on the stack.

Drop aup->lock before freeing the IRQ. The protected close-time work still
stops the device and queue before IRQ teardown, but the sleepable IRQ core
path now runs outside the spinlocked section.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 drivers/net/ethernet/amd/au1000_eth.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 9d35ac348ebe..5a04056e38fa 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -943,9 +943,10 @@ static int au1000_close(struct net_device *dev)
 	/* stop the device */
 	netif_stop_queue(dev);
 
+	spin_unlock_irqrestore(&aup->lock, flags);
+
 	/* disable the interrupt */
 	free_irq(dev->irq, dev);
-	spin_unlock_irqrestore(&aup->lock, flags);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: David Woodhouse @ 2026-06-19 15:34 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <87h5myd56x.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 3150 bytes --]

On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> > @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
> >  
> >  		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> >  		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > +
> > +		/*
> > +		 * For the NTP-disciplined mono-based clocks, report how far
> > +		 * @systime is from the ideal NTP time at @now, in signed ns,
> > +		 * so a caller can land on the ideal line by adding it. Four
> > +		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> > +		 *
> > +		 *  - tk->ntp_error, the deviation as of the last update;
> > +		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
> > +		 *    accrued since then (cycle_delta is at most a tick on a
> > +		 *    tickful kernel, but many ticks' worth under NO_HZ);
> > +		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
> > +		 *    mult dither over the same span;
> > +		 *  - the sub-ns fraction @systime dropped when the read was
> > +		 *    truncated to whole ns (low @shift bits, exact despite the
> > +		 *    multiply overflowing).
> > +		 *
> > +		 * RAW is undisciplined and AUX has its own discipline, so they
> > +		 * carry no ntp_error.
> 
> AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
> work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
> needs to be excluded.

Ack.

> > +		 */
> > +		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> > +		    clock_id == CLOCK_BOOTTIME) {
> > +			u32 nes = tk->ntp_error_shift;
> > +			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> > +					  tk->tkr_mono.mask;
> > +			s64 err = tk->ntp_error +
> > +				(((s64)mul_u64_u64_shr(cycle_delta,
> > +						       tk->ntp_err_frac, 32) -
> > +				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> > +
> > +			err += (s64)((cycle_delta * tk->tkr_mono.mult +
> > +				      tk->tkr_mono.xtime_nsec) &
> > +				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> > +			systime_snapshot->ntp_error =
> > +				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> > +				NTP_SCALE_SHIFT;
> 
> This formatting makes my brain hurt. Can you please split that out into
> a separate function?

Yep. There's also a potential error there — an *additional* discrepancy
comes from the enforced monotonicity that timekeeping_cycles_to_ns()
applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).

I couldn't work out if I cared about the clocksource-is-non-monotonic
casse, and even if I did, what I should do about it. 

I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
or something like that, such that e.g. PTP clients could *ask* for it.

It's all very well hard-coding it in pps_get_ts() and unconditionally
changing the behaviour... I *think* we could justify that. But the
example I actually used in the patch was PTP, and that's slightly
harder to justify the behavioural change.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Wireguard head of line blocking when CPUs saturate
From: Toke Høiland-Jørgensen @ 2026-06-19 15:56 UTC (permalink / raw)
  To: wireguard; +Cc: netdev

Hey everyone

I'm running Wireguard on my main gateway, which is a not-super-high
powered ARM box with eight cores (based on the NXP LS1088A SoC). The box
does, however, also have eight hardware queues for its networking, which
means regular network traffic can be spread nicely across the cores.

However, the per-core performance is limited, making it pretty trivial
to saturate a single core by just running a fat TCP flow through it. And
when this happens, Wireguard traffic just... stalls. I.e., no traffic
gets through the Wireguard interface until the (unrelated) flow
saturating one of the cores subsides.

I suspect what happens is that Wireguard spreads out traffic to all
cores for encryption, but has to wait for the respective CPUs to finish
encrypting the packets in order before they can actually be transmitted.
And because one CPU is now suddenly saturated in softirq context, the
Wireguard work queue never gets a chance to run on that CPU, stalling TX
progress for the Wireguard device entirely.

I'm sending this message to (a) see if anyone else is seeing the same
kind of stalling, and (b) to get input on whether the explanation
outlined above seems plausible. And, in the case of affirmative answers
to both (a) and (b), to hopefully start a discussion on what to do about
this :)

-Toke

^ permalink raw reply

* RE: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Selvamani Rajagopal @ 2026-06-19 16:05 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
	Shuah Khan, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linux-doc@vger.kernel.org, Jerry Ray
In-Reply-To: <ajVKfBKPuNk9zN7b@monoceros>


Thanks for your feedback. Will take care of all the three comments.

> -----Original Message-----
> Subject: Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500
> MAC-PHY
> 
> On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> > +static const struct of_device_id s2500_of_match[] = {
> > +	{ .compatible = "onnn,s2500" },
> > +	{}
> 
> s/{}/{ }/
> 
> > +};
> > +
> > +static const struct spi_device_id s2500_ids[] = {
> > +	{ "s2500" },
> > +	{}
> > +};
> 
> Please make this:
> 
> static const struct spi_device_id s2500_ids[] = {
> 	{ .name = "s2500" },
> 	{ }
> };
> 
> > +MODULE_DEVICE_TABLE(spi, s2500_ids);
> > +
> > +static struct spi_driver s2500_driver = {
> > +	.driver = {
> > +		.name	= DRV_NAME,
> > +		.of_match_table = s2500_of_match,
> > +	},
> > +	.probe		= s2500_probe,
> > +	.remove		= s2500_remove,
> > +	.id_table	= s2500_ids,
> 
> Tastes are different, but the idea to align = is usually screwed by
> follow up patches. Here it's broken from the start. If you ask me: Use a
> single space before each =.
> 
> > +};
> > +
> > +module_spi_driver(s2500_driver);
> 
> Usually there is no empty line between the driver struct and the macro
> registering it.
> 
>> 
> Best regards
> Uwe


^ permalink raw reply

* RE: Ethtool : PRBS feature
From: Das, Shubham @ 2026-06-19 16:26 UTC (permalink / raw)
  To: Alexander H Duyck, Andrew Lunn, lee@trager.us
  Cc: netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
	Chintalapalle, Balaji, Lindberg, Magnus,
	niklas.damberg@ericsson.com
In-Reply-To: <06d8c98da24e80d148ede4e933bb621c5515a7a2.camel@gmail.com>

> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?

In our case PRBS functionality is implemented in the PHY firmware at the PCS (TX/RX) + PMA (FEC Error Injection) layer.


Andrew,  Alexander, Lee,

The host driver does not directly access any registers but requests the PHY FW to manage PRBS on behalf of it.
Because of this, the implementation does not naturally fit the traditional PHYLIB model, where Linux PHY drivers directly manage PHY registers. 
The functionality is closer to a firmware-managed service exposed through the PCIe driver, so we thought the right place would be to extend ethtool.

We come from the Ethernet PHY field and are attempting to generalize PRBS for generic PHYs to accommodate all bus types, which might distract us, I believe.
The existing ethtool user application interface will give a quick start for Ethernet PHY PRBS management. 
When we need other buses or when we have another model implementation, then we can abstract the commonalities into a framework.

Should we proceed with implementing the "ethtool --phy-test" ?


> -----Original Message-----
> From: Alexander H Duyck <alexander.duyck@gmail.com>
> Sent: 16 June 2026 21:45
> To: Das, Shubham <shubham.das@intel.com>; Andrew Lunn <andrew@lunn.ch>
> Cc: netdev@vger.kernel.org; mkubecek@suse.cz; D H, Siddaraju
> <siddaraju.dh@intel.com>; Chintalapalle, Balaji <balaji.chintalapalle@intel.com>
> Subject: Re: Ethtool : PRBS feature
> 
> On Tue, 2026-06-16 at 12:14 +0000, Das, Shubham wrote:
> > Hi Andrew,
> >
> > Thanks for the feedback.
> >
> > Yes, for multi-lane ports we can accept the lane number as an argument like:
> >
> > ethtool --phy-test eth1 lane 0 tx-prbs prbs7 ethtool --phy-test eth2
> > lane 0 rx-prbs prbs7
> >
> > We referred to "Lee Trager's" "Open-Source Tooling for PHY Management and
> Testing" session:
> > https://netdevconf.info/0x19/sessions/talk/open-source-tooling-for-phy-
> management-and-testing.html?.
> > We have been trying to reach "Lee Trager" to seek more input, latest update on
> the approach and understand if there is a parallel effort in active so we can
> collaborate.
> > If you can, please help me connect with "Lee Trager" and others who expressed
> interest in Ethernet PRBS. We are happy to align and start implementation.
> >
> 
> You aren't going to have much luck if you are trying to reach out via his Meta
> address as he has moved onto Nvidia so he is no longer working on the fbnic
> driver.
> 
> As far as the work done most of it was internal and making use of debugfs. I don't
> believe any of the work for fbnic began to approach the suggested methods for
> upstreamming the feature as Lee had been pulled into other efforts.
> 
> > About standardizing across other bus like PCIe and USB, I had a quick discussion
> with our internal designers, but I didn't observe any such SW-level config knobs
> interest.
> > Looks like Ethernet has clear interest and we are joining that Ethernet PRBS
> community too.
> 
> I think it largely depends on what your implementation looks like. The point being
> made was that many of the SerDes PHYs out there are capable of use in multiple
> applications. So instead of being a networking device you would be looking at a
> SerDes PHY such as those in "/drivers/phy/".
> 
> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?
> 
> If you are referring to the PCS level then yes, it would make sense to have it in the
> networking subsystem as the PCS at this point is more a netdev specific set of
> drivers, see "/drivers/net/pcs/".
> 
> In the case of the PMD that is where things get a bit more interesting.
> There is an IEEE c45 register definition that includes PRBS testing registers,
> however in the case of our implementation the PMD doesn't follow that
> specification and follows more the "/drivers/phy/" model.
> 
> > Ethernet PRBS configuration and diagnostics support is well established and
> already widely used in existing Ethernet SERDES deployments.
> > We think Ethernet is the most natural starting point within netdev, as
> > it aligns with current driver practice and existing validation workflows.
> 
> The problem is many of these parts used as an Ethernet Serdes PMD are really a
> multiuse part. So for example in the case of the hardware in FBNIC we use the
> same part on the Ethernet PHY as we do for the PCIe
> Gen5 PHY.
> 
> The complication in our case is that both are buried behind our FW due to the fact
> that both are shared between slices. However for testing purposes and such we
> could look at disabling the odd slices to essentially unshare the hardware if you
> need another platform to test something like this with.

^ permalink raw reply

* Re: [PATCH] net: add sock_open() for unified socket creation
From: Al Viro @ 2026-06-19 16:34 UTC (permalink / raw)
  To: Alex Goltsev; +Cc: davem, netdev, linux-kernel
In-Reply-To: <CAEKmD4JfM5GWSiRMUn6NK+kKFeyXA8i3A9gthDz3hVKFcR1YDA@mail.gmail.com>

On Fri, Jun 19, 2026 at 01:35:56PM +0300, Alex Goltsev wrote:
> > What's the point (and why not make it inline, while we are at it)?
> 
> > Are there really callers that would pass a non-constant value as the last argument,
> > and if so, what are they doing next?
> 
> 
> As for `inline`: in this case, it would have no practical significance.
> 
> The compiler already treats a simple inline function as a regular
> 
> symbol within the `EXPORT_SYMBOL` context, whereas a static inline
> function (the standard
> 
> kernel template for helper functions) would completely break the
> export to the LKM.

How so?  All three underlying primitives are exported, so static inline
in whatever include/*/*.h you put it in would work just fine.

> As for the last argument, yes, today it is usually a constant,
> 
> but that’s not the point. The purpose of the enumeration is to provide
> 
> a unified, explicit control interface. It’s important that if, in the future,
> 
> someone adds a new type of socket creation, existing calling programs won’t
> 
> panic or throw a compilation error, but will smoothly fall back to
> 
> the default case and return -EINVAL, which is a safe failure mode.

Collapsing several functions together is worthless unless the combination
can be _used_ other than a (questionable) syntax sugar.  kmalloc() can;
something that would only result in trading multiple identifiers for
functions for multiple identifiers for "which function to call" is not
an improvement.

^ permalink raw reply

* Re: [PATCH net 0/6] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-19 16:42 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

On 6/18/26 6:22 PM, Fernando Fernandez Mancera wrote:
> While working on a different IPv6 patch series I have spotted multiple
> minor bugs around sysctl error handling and notifications. In general,
> they are not serious issues.
> 
> In addition, there is one more issue in forwarding sysctl as it does not
> check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
> of this series and I am aiming it at the net-next tree once it re-opens.
> 
> Fernando Fernandez Mancera (6):
>    ipv6: fix error handling in disable_ipv6 sysctl
>    ipv6: fix error handling in ignore_routes_with_linkdown sysctl
>    ipv6: fix error handling in forwarding sysctl
>    ipv6: fix error handling in disable_policy sysctl
>    ipv6: reset value and position for proxy_ndp sysctl restart
>    ipv6: fix missing notification for ignore_routes_with_linkdown
> 
>   net/ipv6/addrconf.c | 35 +++++++++++++++++++++++++++--------
>   1 file changed, 27 insertions(+), 8 deletions(-)
> 

So I have checked the feedback from sashiko.dev [1]. In general, it is 
real but I have doubts about this one. While it is real, I am not so 
sure we actually care about this scenario. In any case, this is a 
pattern around kernel code and it would require fixes everywhere 
(IPv4/IPv6/other places (?)).

The easier thing to do is what Nicolas suggested about taking the lock 
before parsing the value i.e before calling proc_dointvec().

I also do wonder if I should fix them in individual commits or together 
in the same commit as they have different fixes tag. Or maybe we can aim 
this last part to net-next. Thoughts?

Please, see the feedback attached below.

Thanks!

[1] https://sashiko.dev/#/patchset/20260618162225.4588-1-fmancera%40suse.de

 > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
 > index 8ff015975e279..1cfb223476bd6 100644
 > --- a/net/ipv6/addrconf.c
 > +++ b/net/ipv6/addrconf.c
[ ... ]
 > @@ -6493,8 +6494,12 @@ static int addrconf_sysctl_proxy_ndp(const 
struct ctl_table *ctl, int write,
 >  	if (write && old != new) {
 >  		struct net *net = ctl->extra2;
 >
 > -		if (!rtnl_net_trylock(net))
 > +		if (!rtnl_net_trylock(net)) {
 > +			/* Restore the original values before restarting */
 > +			*valp = old;
 > +			*ppos = pos;
 >  			return restart_syscall();
 > +		}
 >

Does this rollback introduce a race condition that can silently discard
concurrent writes?
The sysctl core does not serialize concurrent writes to the same sysctl
parameter, and proc_dointvec() modifies the global sysctl memory pointed to
by ctl->data directly.
If Thread A and Thread B concurrently write to proxy_ndp, and Thread A fails
to acquire the RTNL lock, setting *valp = old will unconditionally reset the
global memory back to what Thread A observed before its write. This seems
like it would clobber Thread B's concurrent update.
If Thread B successfully acquires the RTNL lock, it will evaluate its local
old != new stack variables as true and emit a RTM_NEWNETCONF netlink
notification. However, the notification will read the corrupted *valp 
memory,
leaving the emitted event and internal sysctl state out of sync.
Could this be avoided by duplicating the ctl_table, pointing the data field
to a local stack variable, calling proc_dointvec() on the local variable,
and only updating the global *valp after successfully acquiring the lock,
similar to how it is handled in addrconf_sysctl_forward()?


^ permalink raw reply

* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Ivan Vecera @ 2026-06-19 17:07 UTC (permalink / raw)
  To: Kubalewski, Arkadiusz, Jiri Pirko, Vadim Fedorenko,
	Jakub Kicinski
  Cc: netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
	Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
	Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
	linux-kernel@vger.kernel.org
In-Reply-To: <CH3PR11MB8749910F17977B951A8B12CA9BE42@CH3PR11MB8749.namprd11.prod.outlook.com>

On 6/17/26 1:59 PM, Kubalewski, Arkadiusz wrote:
>> From: Ivan Vecera <ivecera@redhat.com>
>> Sent: Monday, June 15, 2026 2:00 PM
>>
>> On 6/11/26 2:09 PM, Jiri Pirko wrote:
>>> Wed, Jun 10, 2026 at 05:45:46PM +0200, ivecera@redhat.com wrote:
>>>> On 6/10/26 3:04 PM, Kubalewski, Arkadiusz wrote:
>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>> Sent: Tuesday, June 9, 2026 4:59 PM
>>>>>>
>>>>>> On 6/9/26 4:00 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>> Sent: Tuesday, June 9, 2026 10:51 AM
>>>>>>>>
>>>>>>>> Mon, Jun 08, 2026 at 07:03:46PM +0200,
>>>>>>>> arkadiusz.kubalewski@intel.com
>>>>>>>> wrote:
>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>> Sent: Monday, June 8, 2026 5:48 PM
>>>>>>>>>>
>>>>>>>>>> On 6/8/26 4:43 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>>>> Sent: Sunday, May 31, 2026 9:44 PM ...
>>>>>>>>>>>>            -
>>>>>>>>>>>>              name: gnss
>>>>>>>>>>>>              doc: GNSS recovered clock
>>>>>>>>>>>> +      -
>>>>>>>>>>>> +        name: int-nco
>>>>>>>>>>>> +        doc: |
>>>>>>>>>>>> +          Device internal numerically controlled oscillator.
>>>>>>>>>>>> +          When connected as a DPLL input, the DPLL enters NCO
>>>>>>>>>>>> mode
>>>>>>>>>>>> +          where the output frequency is adjusted by the host
>>>>>>>>>>>> via
>>>>>>>>>>>> +          the PTP clock interface.
>>>>>>>>>>>
>>>>>>>>>>> Hi Ivan!
>>>>>>>>>>>
>>>>>>>>>>> How would you control this in case of automatic mode dpll?
>>>>>>>>>>> Automatic mode DPLL shall be controlled on HW level, such pin
>>>>>>>>>>> brakes that rule and requires some driver magic to show it is
>>>>>>>>>>> higher priority then the rest of the pins?
>>>>>>>>>>
>>>>>>>>>> The NCO pin can be connected only in manual mode. In other words
>>>>>>>>>> a
>>>>>>>>>> DPLL in automatic mode cannot select NCO pin (switch to NCO mode)
>>>>>>>>>> by
>>>>>>>>>> its own.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being picky on DPLL_MODE for enabling feature is not something we
>>>>>>>>> can allow if it is not related to HW limitation, is it?
>>>>>>>>> Could you please elaborate why it is not possible for AUTOMATIC
>>>>>>>>> mode?
>>>>>>>>
>>>>>>>> In automatic mode, the pin selection logic is defined upon prio. I
>>>>>>>> can imagine that if NCO pin has the highest prio of the available
>>>>>>>> ones, it gets picked. I would be aligned 100% with automatic mode
>>>>>>>> behaviour.
>>>>>>>> Is there a real usecase for it?
>>>>>>>>
>>>>>>>> [..]
>>>>>>>
>>>>>>> This is not true. AUTOMATIC mode is HW solution, SW driver ONLY
>>>>>>> configures priorities on the inputs, not manages the active inputs.
>>>>>>> This brakes that behavior, the SW driver would have to manually
>>>>>>> override the AUTMATIC mode to be fed from such NCO pin as it doesn't
>>>>>>> exists on it's priority list, HW cannot pick or use it.
>>>>>>
>>>>>> Correct, AUTO mode is hardware feature and it should not be emulated
>>>>>> by a
>>>>>> driver. If the hardware does not support it then the switching
>>>>>> between
>>>>>> input references should be done by userspace (by monitoring ffo,
>>>>>> phase_offset, operstate).
>>>>>>
>>>>>
>>>>> Yes, exactly, so for AUTOMATIC mode HW it will not be possible to
>>>>> create
>>>>> such pin, which means that NCO pin would serve only a MANUAL mode
>>>>> implementation.
>>>>> Basically this is something we shall not allow to happen. DPLL API
>>>>> should be designed to cover the case where AUTO mode is able to
>>>>> implement
>>>>> all features consistently.
>>>>
>>>> If you don't like the proposal from Jiri (NCO switch driven by NCO pin
>>>> priority -> highest==enter_nco else leave_nco) then it could be
>>>> possible
>>>> to handle the switching by allowing the state 'connected' in AUTO mode
>>>> for the NCO pin type. Then the implementation will be the same for both
>>>> selection modes.
>>>>
>>>> Only difference would be that a user does not need to switch the device
>>> >from the AUTO to MANUAL mode.
>>>>
>>>>>>> The real use case is that any DPLL can switch the mode to this one
>>>>>>> instead of implementing MANUAL mode just to use the feature with a
>>>>>>> 'virtual' pin.
>>>>>>
>>>>>> I don't expect this... but it is up to a driver. I don't plan such
>>>>>> functionality in zl3073x as the NCO pin does not expose prio_get()
>>>>>> and
>>>>>> prio_set() callbacks - so it is clear that this pin cannot be part of
>>>>>> the
>>>>>> automatic selection.
>>>>>>
>>>>>> Ivan
>>>>>
>>>>> There is a difference between particular HW and API capabilities, with
>>>>> the
>>>>> proposed API we would disallow the possibility of such implementation
>>>>> for
>>>>> existing HW variants.
>>>>>
>>>>> DPLL NCO MODE would allow that but as pointed here by Ivan and by Jiri
>>>>> in
>>>>> the other email it would also require the extra implementation for
>>>>> some
>>>>> configuration - device level phase/ffo handling.
>>>>>
>>>>> To summarize it all, I don't have such simple solution for it.
>>>>>
>>>>> First thing that comes to my mind is to combine both approaches.
>>>>> Make it possible for AUTMATIC mode to also set "CONNECTED" state
>>>>> on certain kind of "OVERRIDE" pins, where it could be determined by
>>>>> the type of PIN and embed that logic into the DPLL subsystem.
>>>>
>>>> The possible states for particual pins are now handled at a driver
>>>> level
>>>> so the driver decides if the requested state is correct or not. So it
>>>> could be easy to implement this.
>>>>
>>>> For auto mode allowed states:
>>>> - input references: selectable / disconnected
>>>> - nco pin: connected / disconnected
>>>>
>>>>> Basically, if driver registers such NCO pin it would be always
>>>>> selected
>>>>> manually, and in such case all the other pins are going to
>>>>> disconnected
>>>>> state while DPLL mode is also a "OVERRIDE" or something like it.
>>>>
>>>> I would leave this decision on the driver level... Imagine the
>>>> potential
>>>> HW that would allow to switch NCO mode if there is no valid input
>>>> reference.
>>>>
>>>> Example:
>>>>
>>>> REF0 (prio 0) -> +------+ -> OUT0
>>>> REF1 (prio 1) -> | DPLL | -> ...
>>>> NCO  (prio 2) -> +------+ -> OUTn
>>>>
>>>> Such HW would prefer REF0 or REF1 and lock to one of them if they are
>>>> qualified. But if they are NOT, then it switches to NCO mode.
> 
> Now you said yourself "NCO mode" ... I agree that it would be a mode in
> that case. Where instead of running on regular/built in XO dpll would run
> on NCO and user could select it, and this would be addition to regular
> behavior.
> 
> I also agree that the pin approach might be better/easier to use, assuming
> frequency offset for all the outputs given dpll drives, it makes more sense
> to have it configurable on input side.

+1

>>>>
>>>> In this situation the relevant driver would allow to configure priority
>>>> and state 'selectable' for this NCO pin.
>>>>
>>>>> Perhaps the pin type could include OVERRIDE in it's name to make it
>>>>> less
>>>>> confusing and needs some extra documentation.
>>>>>
>>>>> Thoughts?
>>>> I think _INT_ is ok. In the case of TYPE_INT_OSCILLATOR it is also
>>>> obvious that it is not a standard input reference.
>>>>
>>>> Jiri, Vadim, Arek, thoughts?
>>>
>>> I agree with you, the driver should have the flexibility to implement
>>> this according to his/hw's needs/capabilities. If it implements prio
>>> selection in AUTO mode, let it have it. If it implements manual NCO pin
>>> selection in AUTO mode using connected/disconnected override, let it
>>> have it.
> 
> I don't know 'current' HW that is capable of using AUTO mode as a part of
> HW-based priority source selection and use such NCO input..
> But as already explained above, this is special mode of regular XO, which
> allows DPLL's output frequency offset configuration.

Lets keep this available for potential future HW. I can imagine a
situation where a user will prefer an automatic switch to NCO mode
if there is no qualified input reference - automatic switch means
that HW will support this (not emulated by the driver).

>>>
>>> Moreover, I actually like the "override" capability for pins in AUTO
>>> mode in general. It may be handy for other usecases as well.
>>>
>> Arek? Vadim?
>>
>> Thanks,
>> Ivan
> 
> Agree, 'override' capability of a pin would be the way to go for this and
> other similar further cases.
> 
> I believe a single approach on this would be best, I mean if AUTO mode
> needs a capability, to switch from regular behavior to 'OVERRIDE', and
> 'OVERRIDE' is only pin capability that allows such behavior for AUTO
> mode, then similar approach should be used on MANUAL mode, to make
> userspace know that such pin is always available to set "CONNECTED"
> and make the userspace implementation consistent on enabling it no matter
> if AUTO or MANUAL mode dpll.

Proposal:
1) new pin capability
    - name: state-connected-override
    - doc: pin state can be changed to connected in any DPLL mode

2) new NCO pin type to switch the DPLL to NCO mode when connected

3) automatic-only DPLL
    - should expose NCO pin with state-connected-override capability

4) manual-only DPLL
   - does not need to expose NCO pin with state-connected-override cap

5) dual-mode DPLL (supporting mode switching)
   - if it exposes NCO pin with the override cap then it has to support
     switching to NCO mode directly from AUTO mode
   - if does not expose NCO pin with the override cap then a user MUST
     switch the DPLL mode from AUTO to MANUAL to be able to make NCO
     pin connected to the DPLL

Vadim, Jiri, Arek - thoughts?

Thanks,
Ivan


^ permalink raw reply

* [PATCH net v3 0/2] Drop skb metadata before LWT encapsulation
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team

See description for patch 1.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Changes in v3:
- Clear metadata for non-BPF LWT encaps as well (Sashiko)
- Add selftests for LWT encap + XDP metadata
- Link to v2: https://lore.kernel.org/r/20260514-bpf-lwt-drop-skb-metadata-v2-1-458664edc2b5@cloudflare.com

Changes in v2:
- Clear metadata in bpf_xmit to allow access from tc(x) egress (Daniel)
- Add WARNING snippet to the description
- Link to v1: https://lore.kernel.org/r/20260428-wip-skb-local-storage-from-scratch-v1-1-8f7ca9b378ce@cloudflare.com

---
Jakub Sitnicki (2):
      net: lwtunnel: Drop skb metadata before LWT encapsulation
      selftests/bpf: Add LWT encap tests for skb metadata

 net/core/lwtunnel.c                                |   6 +
 tools/testing/selftests/bpf/config                 |   3 +
 .../bpf/prog_tests/xdp_context_test_run.c          | 175 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  | 123 +++++++++------
 4 files changed, 255 insertions(+), 52 deletions(-)


^ permalink raw reply

* [PATCH net v3 1/2] net: lwtunnel: Drop skb metadata before LWT encapsulation
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team
In-Reply-To: <20260619-bpf-lwt-drop-skb-metadata-v3-0-71d6a33ab76b@cloudflare.com>

skb metadata is meant for passing information between XDP and TC. It lives
in the skb headroom, immediately before skb->data. LWT programs cannot
access the __sk_buff->data_meta pseudo-pointer to metadata.

However, LWT encapsulation prepends outer headers, moving skb->data back
over the headroom where the metadata sits. On an RX-originated (forwarded)
packet that still carries XDP metadata this goes wrong in two different
ways, depending on the encap type:

1. Non-BPF LWT encaps (mpls, seg6, ioam6 ...) call skb_push()/skb_pull()
   and silently overwrite the metadata that sits in the headroom.

2) BPF LWT xmit calls bpf_skb_change_head(), which uses skb_data_move().
   That helper expects metadata immediately before skb->data. But since
   the IP output path runs LWT xmit before neighbour output has built
   the outgoing L2 header, for forwarded packets skb->data points at the
   L3 header while skb_mac_header() still points at the old L2 header.
   skb_data_move() sees metadata ending at skb_mac_header(), not before
   skb->data, warns and clears metadata:

  WARNING: CPU: 21 PID: 454557 at include/linux/skbuff.h:4609 skb_data_move+0x47/0x90
  CPU: 21 UID: 0 PID: 454557 Comm: napi/iconduit-g Tainted: G           O        6.18.21 #1
  RIP: 0010:skb_data_move+0x47/0x90
  Call Trace:
   <IRQ>
   bpf_skb_change_head+0xe6/0x1a0
   bpf_prog_...+0x213/0x2e3
   run_lwt_bpf.isra.0+0x1d3/0x360
   bpf_xmit+0x46/0xe0
   lwtunnel_xmit+0xa1/0xf0
   ip_finish_output2+0x1e7/0x5e0
   ip_output+0x63/0x100
   __netif_receive_skb_one_core+0x85/0xa0
   process_backlog+0x9c/0x150
   __napi_poll+0x2b/0x190
   net_rx_action+0x40b/0x7f0
   handle_softirqs+0xd2/0x270
   do_softirq+0x3f/0x60
   </IRQ>

That is what happens, as for how to fix it - a received packet that
carries metadata can reach an encap through any of the three LWT
redirect modes:

  LWTUNNEL_STATE_INPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       lwtunnel_input

  LWTUNNEL_STATE_OUTPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             lwtunnel_output

  LWTUNNEL_STATE_XMIT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             ip6_output
               ip6_finish_output
                 ip6_finish_output2
                   lwtunnel_xmit

Every encap funnels through the three LWT dispatch helpers, so drop the
metadata there, right before handing the skb to the encap op. This
single chokepoint covers all encap types and all three redirect modes:

  - lwtunnel_input():  seg6, rpl, ila, seg6_local
  - lwtunnel_output(): ioam6
  - lwtunnel_xmit():   mpls, LWT BPF xmit

Alternatively, we could clear the metadata right after TC ingress hook.
That would require a compromise, however. Metadata would become
inaccessible from TC egress (in setups where it actually reaches the
hook it tact, that is without any L2 tunnels on path).

Fixes: 8989d328dfe7 ("net: Helper to move packet data and metadata after skb_push/pull")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/lwtunnel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index f9d76d85d04f..b01a395d9a96 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -350,6 +350,8 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->output)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->output(net, sk, skb);
 		dev_xmit_recursion_dec();
@@ -404,6 +406,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->xmit)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->xmit(skb);
 		dev_xmit_recursion_dec();
@@ -455,6 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->input)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->input(skb);
 		dev_xmit_recursion_dec();

-- 
2.43.0


^ permalink raw reply related

* [PATCH net v3 2/2] selftests/bpf: Add LWT encap tests for skb metadata
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team
In-Reply-To: <20260619-bpf-lwt-drop-skb-metadata-v3-0-71d6a33ab76b@cloudflare.com>

Test that an LWT encapsulation does not silently corrupt XDP metadata
sitting in the skb headroom. Exercise all three LWT dispatch paths:

- BPF LWT xmit prog reserves headroom on the LWT .xmit redirect,
- mpls pushes an MPLS label on the LWT .xmit redirect,
- seg6 in encap mode runs on the LWT .input redirect,
- ioam6 encap inserts an IOAM Hop-by-Hop option on LWT .output redirect.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/config                 |   3 +
 .../bpf/prog_tests/xdp_context_test_run.c          | 175 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  | 123 +++++++++------
 3 files changed, 249 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index bac60b444551..adb25146e88c 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -45,13 +45,16 @@ CONFIG_IPV6=y
 CONFIG_IPV6_FOU=y
 CONFIG_IPV6_FOU_TUNNEL=y
 CONFIG_IPV6_GRE=y
+CONFIG_IPV6_IOAM6_LWTUNNEL=y
 CONFIG_IPV6_SEG6_BPF=y
+CONFIG_IPV6_SEG6_LWTUNNEL=y
 CONFIG_IPV6_SIT=y
 CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LIVEPATCH=y
 CONFIG_LWTUNNEL=y
+CONFIG_LWTUNNEL_BPF=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
index 26159e0499c7..448807676176 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include <network_helpers.h>
+#include <linux/ipv6.h>
+#include <arpa/inet.h>
 #include "test_xdp_context_test_run.skel.h"
 #include "test_xdp_meta.skel.h"
 
@@ -8,9 +10,12 @@
 #define TX_NAME "veth1"
 #define TX_NETNS "xdp_context_tx"
 #define RX_NETNS "xdp_context_rx"
+#define RX_MAC "02:00:00:00:00:01"
+#define TX_MAC "02:00:00:00:00:02"
 #define TAP_NAME "tap0"
 #define DUMMY_NAME "dum0"
 #define TAP_NETNS "xdp_context_tuntap"
+#define LWT_NETNS "xdp_context_lwt"
 
 #define TEST_PAYLOAD_LEN 32
 static const __u8 test_payload[TEST_PAYLOAD_LEN] = {
@@ -187,6 +192,42 @@ static int write_test_packet(int tap_fd)
 	return 0;
 }
 
+/* Inject Ethernet+IPv6+UDP frame into TAP */
+static int write_test_packet_udp(int tap_fd)
+{
+	__u8 pkt[sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		 sizeof(struct udphdr) + TEST_PAYLOAD_LEN] = {};
+	struct ethhdr *eth = (void *)pkt;
+	struct ipv6hdr *ip6 = (void *)(eth + 1);
+	struct udphdr *udp = (void *)(ip6 + 1);
+	__u8 *payload = (void *)(udp + 1);
+	const __u8 tap_mac[ETH_ALEN] = { 0x02, 0, 0, 0, 0, 0x01 };
+	int n;
+
+	memcpy(eth->h_dest, tap_mac, ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IPV6);
+
+	ip6->version = 6;
+	ip6->hop_limit = 64;
+	ip6->nexthdr = IPPROTO_UDP;
+	ip6->payload_len = htons(sizeof(*udp) + TEST_PAYLOAD_LEN);
+	inet_pton(AF_INET6, "fd00::2", &ip6->saddr);
+	inet_pton(AF_INET6, "fd00:1::1", &ip6->daddr);
+
+	udp->source = htons(42);
+	udp->dest = htons(42);
+	udp->len = htons(sizeof(*udp) + TEST_PAYLOAD_LEN);
+	/* UDP checksum is not validated on the forwarding path. */
+
+	memcpy(payload, test_payload, TEST_PAYLOAD_LEN);
+
+	n = write(tap_fd, pkt, sizeof(pkt));
+	if (!ASSERT_EQ(n, sizeof(pkt), "write frame"))
+		return -1;
+
+	return 0;
+}
+
 static void dump_err_stream(const struct bpf_program *prog)
 {
 	char buf[512];
@@ -518,3 +559,137 @@ void test_xdp_context_tuntap(void)
 
 	test_xdp_meta__destroy(skel);
 }
+
+/*
+ * Test topology:
+ *
+ *	tap0 fd00::1
+ *	  RX:  injected IPv6 UDP frame, XDP ingress sets metadata
+ *	  fwd: encap route prepends outer header(s)
+ *	  TX:  TC egress validates metadata
+ *
+ * A routable IPv6 UDP frame is written into the tap fd, so it enters the RX
+ * path where XDP stores metadata. Routing then forwards it back out the same
+ * tap through an encapsulating route that prepends outer header(s). The TC
+ * egress program checks that the pushed header did not silently corrupt
+ * metadata.
+ */
+#define LWT_PIN_PATH "/sys/fs/bpf/xdp_context_lwt_xmit"
+
+enum lwt_encap_type {
+	LWT_ENCAP_BPF,
+	LWT_ENCAP_MPLS,
+	LWT_ENCAP_SEG6,
+	LWT_ENCAP_IOAM6,
+};
+
+static void test_lwt_encap(struct test_xdp_meta *skel,
+			   enum lwt_encap_type type)
+{
+	LIBBPF_OPTS(bpf_tc_hook, tc_hook, .attach_point = BPF_TC_EGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	struct bpf_program *lwt_prog = NULL;
+	struct netns_obj *ns = NULL;
+	const char *encap;
+	bool pinned = false;
+	int tap_ifindex;
+	int tap_fd = -1;
+	int ret;
+
+	skel->bss->test_pass = false;
+
+	switch (type) {
+	case LWT_ENCAP_BPF:
+		encap = "encap bpf xmit pinned " LWT_PIN_PATH " via fd00::2";
+		lwt_prog = skel->progs.dummy_lwt_xmit;
+		break;
+	case LWT_ENCAP_MPLS:
+		encap = "encap mpls 100 via inet6 fd00::2";
+		break;
+	case LWT_ENCAP_SEG6:
+		encap = "encap seg6 mode encap segs fd00::2";
+		break;
+	case LWT_ENCAP_IOAM6:
+		encap = "encap ioam6 mode encap tundst fd00::2 "
+			"trace prealloc type 0x800000 ns 0 size 4 via fd00::2";
+		break;
+	default:
+		return;
+	}
+
+	if (lwt_prog) {
+		unlink(LWT_PIN_PATH);
+		ret = bpf_program__pin(lwt_prog, LWT_PIN_PATH);
+		if (!ASSERT_OK(ret, "pin lwt prog"))
+			return;
+		pinned = true;
+	}
+
+	ns = netns_new(LWT_NETNS, true);
+	if (!ASSERT_OK_PTR(ns, "netns_new"))
+		goto close;
+
+	tap_fd = open_tuntap(TAP_NAME, true);
+	if (!ASSERT_GE(tap_fd, 0, "open_tuntap"))
+		goto close;
+
+	SYS(close, "ip link set dev " TAP_NAME " address " RX_MAC);
+	SYS(close, "sysctl -wq net.ipv6.conf.all.forwarding=1");
+	SYS(close, "ip addr add fd00::1/64 dev " TAP_NAME " nodad");
+	SYS(close, "ip link set dev " TAP_NAME " up");
+	SYS(close, "ip neigh add fd00::2 lladdr " TX_MAC " nud permanent dev " TAP_NAME);
+	SYS(close, "ip -6 route add fd00:1::/64 %s dev %s", encap, TAP_NAME);
+
+	tap_ifindex = if_nametoindex(TAP_NAME);
+	if (!ASSERT_GE(tap_ifindex, 0, "if_nametoindex"))
+		goto close;
+
+	ret = bpf_xdp_attach(tap_ifindex, bpf_program__fd(skel->progs.ing_xdp),
+			     0, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto close;
+
+	tc_hook.ifindex = tap_ifindex;
+	ret = bpf_tc_hook_create(&tc_hook);
+	if (!ASSERT_OK(ret, "bpf_tc_hook_create"))
+		goto close;
+
+	tc_opts.prog_fd = bpf_program__fd(skel->progs.tc_is_meta_empty);
+	ret = bpf_tc_attach(&tc_hook, &tc_opts);
+	if (!ASSERT_OK(ret, "bpf_tc_attach"))
+		goto close;
+
+	ret = write_test_packet_udp(tap_fd);
+	if (!ASSERT_OK(ret, "write_test_packet_udp"))
+		goto close;
+
+	if (!ASSERT_TRUE(skel->bss->test_pass, "test_pass"))
+		dump_err_stream(skel->progs.tc_is_meta_empty);
+
+close:
+	if (tap_fd >= 0)
+		close(tap_fd);
+	netns_free(ns);
+	if (pinned)
+		unlink(LWT_PIN_PATH);
+}
+
+void test_xdp_context_lwt_encap(void)
+{
+	struct test_xdp_meta *skel;
+
+	skel = test_xdp_meta__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open and load skeleton"))
+		return;
+
+	if (test__start_subtest("bpf_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_BPF);
+	if (test__start_subtest("mpls_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_MPLS);
+	if (test__start_subtest("seg6_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_SEG6);
+	if (test__start_subtest("ioam6_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_IOAM6);
+
+	test_xdp_meta__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_meta.c b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
index fa73b17cb999..08b03be0b891 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_meta.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
@@ -21,10 +21,6 @@
 
 bool test_pass;
 
-static const __u8 smac_want[ETH_ALEN] = {
-	0x12, 0x34, 0xDE, 0xAD, 0xBE, 0xEF,
-};
-
 static const __u8 meta_want[META_SIZE] = {
 	0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,
 	0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,
@@ -32,11 +28,6 @@ static const __u8 meta_want[META_SIZE] = {
 	0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38,
 };
 
-static bool check_smac(const struct ethhdr *eth)
-{
-	return !__builtin_memcmp(eth->h_source, smac_want, ETH_ALEN);
-}
-
 static bool check_metadata(const char *file, int line, __u8 *meta_have)
 {
 	if (!__builtin_memcmp(meta_have, meta_want, META_SIZE))
@@ -280,18 +271,47 @@ int ing_cls_dynptr_offset_oob(struct __sk_buff *ctx)
 	return TC_ACT_SHOT;
 }
 
+/* Test packets carry test metadata pattern as payload. */
+static bool is_test_packet_xdp(struct xdp_md *ctx)
+{
+	__u8 meta_have[META_SIZE];
+	__u32 len;
+
+	len = bpf_xdp_get_buff_len(ctx);
+	if (len < META_SIZE)
+		return false;
+	if (bpf_xdp_load_bytes(ctx, len - META_SIZE, meta_have, META_SIZE))
+		return false;
+	if (__builtin_memcmp(meta_have, meta_want, META_SIZE))
+		return false;
+
+	return true;
+}
+
+/* Test packets carry test metadata pattern as payload. */
+static bool is_test_packet_tc(struct __sk_buff *ctx)
+{
+	__u8 meta_have[META_SIZE];
+
+	if (ctx->len < META_SIZE)
+		return false;
+	if (bpf_skb_load_bytes(ctx, ctx->len - META_SIZE, meta_have, META_SIZE))
+		return false;
+	if (__builtin_memcmp(meta_have, meta_want, META_SIZE))
+		return false;
+
+	return true;
+}
+
 /* Reserve and clear space for metadata but don't populate it */
 SEC("xdp")
 int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 {
-	struct ethhdr *eth = ctx_ptr(ctx, data);
 	__u8 *meta;
 	int ret;
 
 	/* Drop any non-test packets */
-	if (eth + 1 > ctx_ptr(ctx, data_end))
-		return XDP_DROP;
-	if (!check_smac(eth))
+	if (!is_test_packet_xdp(ctx))
 		return XDP_DROP;
 
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
@@ -310,33 +330,24 @@ int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 SEC("xdp")
 int ing_xdp(struct xdp_md *ctx)
 {
-	__u8 *data, *data_meta, *data_end, *payload;
-	struct ethhdr *eth;
+	__u8 *data, *data_meta;
 	int ret;
 
+	/* Drop any non-test packets */
+	if (!is_test_packet_xdp(ctx))
+		return XDP_DROP;
+
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
 	if (ret < 0)
 		return XDP_DROP;
 
 	data_meta = ctx_ptr(ctx, data_meta);
-	data_end  = ctx_ptr(ctx, data_end);
 	data      = ctx_ptr(ctx, data);
 
-	eth = (struct ethhdr *)data;
-	payload = data + sizeof(struct ethhdr);
-
-	if (payload + META_SIZE > data_end ||
-	    data_meta + META_SIZE > data)
+	if (data_meta + META_SIZE > data)
 		return XDP_DROP;
 
-	/* The Linux networking stack may send other packets on the test
-	 * interface that interfere with the test. Just drop them.
-	 * The test packets can be recognized by their source MAC address.
-	 */
-	if (!check_smac(eth))
-		return XDP_DROP;
-
-	__builtin_memcpy(data_meta, payload, META_SIZE);
+	__builtin_memcpy(data_meta, meta_want, META_SIZE);
 	return XDP_PASS;
 }
 
@@ -353,7 +364,7 @@ int clone_data_meta_survives_data_write(struct __sk_buff *ctx)
 	if (eth + 1 > ctx_ptr(ctx, data_end))
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	if (meta_have + META_SIZE > eth)
@@ -383,7 +394,7 @@ int clone_data_meta_survives_meta_write(struct __sk_buff *ctx)
 	if (eth + 1 > ctx_ptr(ctx, data_end))
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	if (meta_have + META_SIZE > eth)
@@ -416,7 +427,7 @@ int clone_meta_dynptr_survives_data_slice_write(struct __sk_buff *ctx)
 	if (!eth)
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	bpf_dynptr_from_skb_meta(ctx, 0, &meta);
@@ -436,16 +447,11 @@ int clone_meta_dynptr_survives_data_slice_write(struct __sk_buff *ctx)
 SEC("tc")
 int clone_meta_dynptr_survives_meta_slice_write(struct __sk_buff *ctx)
 {
-	struct bpf_dynptr data, meta;
-	const struct ethhdr *eth;
+	struct bpf_dynptr meta;
 	__u8 *meta_have;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	bpf_dynptr_from_skb_meta(ctx, 0, &meta);
@@ -471,15 +477,10 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 {
 	struct bpf_dynptr data, meta;
 	__u8 meta_have[META_SIZE];
-	const struct ethhdr *eth;
 	int err;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	/* Expect read-write metadata before unclone */
@@ -492,6 +493,7 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 		goto out;
 
 	/* Helper write to payload will unclone the packet */
+	bpf_dynptr_from_skb(ctx, 0, &data);
 	bpf_dynptr_write(&data, offsetof(struct ethhdr, h_proto), "x", 1, 0);
 
 	err = bpf_dynptr_read(meta_have, META_SIZE, &meta, 0, 0);
@@ -511,17 +513,12 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 SEC("tc")
 int clone_meta_dynptr_rw_before_meta_dynptr_write(struct __sk_buff *ctx)
 {
-	struct bpf_dynptr data, meta;
+	struct bpf_dynptr meta;
 	__u8 meta_have[META_SIZE];
-	const struct ethhdr *eth;
 	int err;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	/* Expect read-write metadata before unclone */
@@ -545,6 +542,28 @@ int clone_meta_dynptr_rw_before_meta_dynptr_write(struct __sk_buff *ctx)
 	return TC_ACT_SHOT;
 }
 
+SEC("lwt_xmit")
+int dummy_lwt_xmit(struct __sk_buff *ctx)
+{
+	if (bpf_skb_change_head(ctx, sizeof(struct ipv6hdr), 0))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+SEC("tc")
+int tc_is_meta_empty(struct __sk_buff *ctx)
+{
+	if (!is_test_packet_tc(ctx))
+		return TC_ACT_OK;
+
+	if (ctx->data_meta != ctx->data)
+		return TC_ACT_OK;
+
+	test_pass = true;
+	return TC_ACT_OK;
+}
+
 SEC("tc")
 int helper_skb_vlan_push_pop(struct __sk_buff *ctx)
 {

-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v7 04/11] net: Enable BIG TCP with partial GSO
From: Alice Mikityanska @ 2026-06-19 17:21 UTC (permalink / raw)
  To: Paolo Abeni, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Xin Long, Willem de Bruijn, Willem de Bruijn,
	David Ahern, Nikolay Aleksandrov
  Cc: Shuah Khan, Stanislav Fomichev, Andrew Lunn, Simon Horman,
	Florian Westphal, netdev, Alice Mikityanska
In-Reply-To: <554ff2bd-e4d7-4e64-8ec4-86dc7da85992@redhat.com>

On Sun, Jun 14, 2026, at 14:19, Paolo Abeni wrote:
> On 6/11/26 9:29 PM, Alice Mikityanska wrote:
>> From: Alice Mikityanska <alice@isovalent.com>
>> 
>> skb_segment is called for partial GSO, when netif_needs_gso returns true
>> in validate_xmit_skb. Partial GSO is needed, for example, when
>> segmentation of tunneled traffic is offloaded to a NIC that only
>> supports inner checksum offload.
>> 
>> Currently, skb_segment clamps the segment length to 65534 bytes, because
>> gso_size == 65535 is a special value GSO_BY_FRAGS, and we don't want
>> to accidentally assign mss = 65535, as it would fall into the
>> GSO_BY_FRAGS check further in the function.
>> 
>> This implementation, however, artificially blocks len > 65534, which is
>> possible since the introduction of BIG TCP. To allow bigger lengths and
>> avoid resegmentation of BIG TCP packets, store the gso_by_frags flag in
>> the beginning and don't use a special value of mss for this purpose
>> after mss was modified.
>> 
>> Signed-off-by: Alice Mikityanska <alice@isovalent.com>
>> Reviewed-by: Willem de Bruijn <willemb@google.com>
>> ---
>>  drivers/net/netdevsim/psp.c |  2 +-
>>  net/core/skbuff.c           | 10 +++++-----
>>  2 files changed, 6 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/net/netdevsim/psp.c b/drivers/net/netdevsim/psp.c
>> index d3e36c74be62..6b3532b5e360 100644
>> --- a/drivers/net/netdevsim/psp.c
>> +++ b/drivers/net/netdevsim/psp.c
>> @@ -92,7 +92,7 @@ nsim_do_psp(struct sk_buff *skb, struct netdevsim *ns,
>>  		 * provide a valid checksum here, so the skb isn't dropped.
>>  		 */
>>  		uh = udp_hdr(skb);
>> -		udplen = ntohs(uh->len) ?: skb->len - skb_transport_offset(skb);
>> +		udplen = udp_get_len(skb, uh, skb_transport_offset(skb));
>>  		csum = skb_checksum(skb, skb_transport_offset(skb),
>>  				    udplen, 0);
>>  
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index c64693fcb2d1..5dcee79df8cf 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4773,6 +4773,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	struct sk_buff *tail = NULL;
>>  	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>  	unsigned int mss = skb_shinfo(head_skb)->gso_size;
>> +	bool gso_by_frags = mss == GSO_BY_FRAGS;
>>  	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>>  	unsigned int offset = doffset;
>>  	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
>> @@ -4788,7 +4789,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	int nfrags, pos;
>>  
>>  	if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) &&
>> -	    mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) {
>> +	    !gso_by_frags && mss != skb_headlen(head_skb)) {
>>  		struct sk_buff *check_skb;
>>  
>>  		for (check_skb = list_skb; check_skb; check_skb = check_skb->next) {
>> @@ -4816,7 +4817,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	sg = !!(features & NETIF_F_SG);
>>  	csum = !!can_checksum_protocol(features, proto);
>>  
>> -	if (sg && csum && (mss != GSO_BY_FRAGS))  {
>> +	if (sg && csum && !gso_by_frags)  {
>>  		if (!(features & NETIF_F_GSO_PARTIAL)) {
>>  			struct sk_buff *iter;
>>  			unsigned int frag_len;
>> @@ -4850,9 +4851,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  		/* GSO partial only requires that we trim off any excess that
>>  		 * doesn't fit into an MSS sized block, so take care of that
>>  		 * now.
>> -		 * Cap len to not accidentally hit GSO_BY_FRAGS.
>>  		 */
>> -		partial_segs = min(len, GSO_BY_FRAGS - 1) / mss;
>> +		partial_segs = len / mss;
>
> Sashiko/gemini says the above can lead to hit BUG_ON() later.
>
> I *think* it's not a false positive, as it looks like skb_segment()
> assumes an skb can hold `mss` bytes without resorting to frag_list
> usage, and mss > MAX_SKB_FRAGS * PAGE_SIZE  breaks such assumption.
>
> I think handling correctly this case will requires some non trivial
> surgery to skb_segment: both `while (pos < offset + len) {` loops must
> be updated to feed data from `frags` as needed instead of
> BUG_ON()/net_warn_ratelimited(skb_shinfo(nskb)->nr_frags >= MAX_SKB_FRAGS);

Thanks, these are some valid points. I analyzed the code in skb_segment
better, and from what I can tell, this issue exists before my changes,
even without BIG TCP (I have a reproduction). Moreover, I don't think
that my changes make the situation worse, because what truly matters is
the geometry of the SKB coming into skb_segment, not the sizes of the
frags as such.

The root of the issue is that skb_segment attempts to produce SKBs of
the same size and without frag_list, but the incoming SKB can have a
frag_list and more than 17 frags much smaller than PAGE_SIZE. For
example, if the incoming SKB has 30 frags (using frag_list) of 500 bytes
each, it's just 15000 bytes (much smaller than 64k or MAX_SKB_FRAGS *
PAGE_SIZE), but the partial GSO flow will try to fit almost all of them
(mss will be almost 15000) into a single non-frag_list SKB. Since
skb_segment just reuses the existing frags without combining them in any
way, this will end up with "too many frags".

This means that the assumption that an SKB can hold mss bytes without
frag_list is wrong, and skb_segment is already broken as it relies on
this assumption.

As for the solution, we can't just break the loop early before we put
len bytes into an output SKB, because it will break the guarantee that
the output SKBs have the same size. The only way I can imagine is to
dry-run the loop over all frags in advance and determine the smallest
length that any 17 consecutive frags can take together, and limit len
to that value.

Why adding BIG TCP into the picture doesn't make the situation worse, in
my opinion? There are two main practical ways to get into skb_segment
with BIG TCP:

1. Sending a TCP stream from an application. tcp_sendmsg_locked doesn't
   build frag_list SKBs, stopping at sysctl_max_skb_frags. If a non-
   frag_list SKB enters skb_segment, it can handle it even if it's well
   above 64k. From my observation, 32768-byte high order pages are
   allocated to back the frags.

2. Receiving a TCP stream with GRO, then forwarding it out of another
   network interface, e.g., that uses GSO partial. This is indeed broken
   if GRO resorts to frag_list, but, as shown above, it's broken even
   without BIG TCP.

Down below, I'll attach a reproducer patch (mostly generated with AI)
that adds debug prints to skb_segment and a script that sets up
forwarding to a GSO partial VXLAN netdev. Setting
net.core.high_order_alloc_disable is not even necessary: I can still
observe a 29026-byte SKB with 21 frags and frag_list (most frags are
1448 bytes), it's just these frags are allocated from two order-3 pages.

Here are example logs that show the geometry of an SKB that comes from
regular GRO (no BIG TCP) and breaks skb_segment:

skbuff: skb_segment: start head=ffff888101008a00 len=29026 doffset=66 mss=28960 gso_by_frags=0 features=0x0000010e501d0049
skbuff: skb_segment: input head[0] skb=ffff888101008a00 len=29026 data_len=28960 headlen=66 nr_frags=16 frag_list=ffff888104b7e900 head_frag=0 gso_size=1448 gso_segs=20 gso_type=0x801
skbuff: skb_segment: input head[0] frag[0] page=ffffea000420f7c0 head=ffffea000420f7c0 order=0 off=2480 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[1] page=ffffea000420f7c0 head=ffffea000420f7c0 order=0 off=3928 size=168 base_pages=1
skbuff: skb_segment: input head[0] frag[2] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=0 size=1280 base_pages=1
skbuff: skb_segment: input head[0] frag[3] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=1280 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[4] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=2728 size=1368 base_pages=1
skbuff: skb_segment: input head[0] frag[5] page=ffffea0004886400 head=ffffea0004886400 order=0 off=0 size=80 base_pages=1
skbuff: skb_segment: input head[0] frag[6] page=ffffea0004886400 head=ffffea0004886400 order=0 off=80 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[7] page=ffffea0004886400 head=ffffea0004886400 order=0 off=1528 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[8] page=ffffea0004886400 head=ffffea0004886400 order=0 off=2976 size=1120 base_pages=1
skbuff: skb_segment: input head[0] frag[9] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=0 size=328 base_pages=1
skbuff: skb_segment: input head[0] frag[10] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=328 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[11] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=1776 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[12] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=3224 size=872 base_pages=1
skbuff: skb_segment: input head[0] frag[13] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=0 size=576 base_pages=1
skbuff: skb_segment: input head[0] frag[14] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=576 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[15] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=2024 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] skb=ffff888104b7e900 len=11584 data_len=11584 headlen=0 nr_frags=11 frag_list=0000000000000000 head_frag=0 gso_size=0 gso_segs=0 gso_type=0x0
skbuff: skb_segment: input frag_list[0] frag[0] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=3472 size=624 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[1] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=0 size=824 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[2] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=824 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[3] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=2272 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[4] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=3720 size=376 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[5] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=0 size=1072 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[6] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=1072 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[7] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=2520 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[8] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=3968 size=128 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[9] page=ffffea0004826440 head=ffffea0004826440 order=0 off=0 size=1320 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[10] page=ffffea0004826440 head=ffffea0004826440 order=0 off=1320 size=1448 base_pages=1
skbuff: skb_segment: setup after push head=ffff888101008a00 len=29026 payload=28960 offset=66 mss=28960 partial_segs=20 sg=1 csum=1 list_skb=ffff888104b7e900 features=0x0000010e501d0049
skbuff: skb_segment: segment begin offset=66 len=28960 end=29026 pos=66 i=0 nfrags=16 hsize=0 list_skb=ffff888104b7e900 frag_skb=ffff888101008a00
skbuff: skb_segment: alloc nskb=ffff888101cf5100 hsize=0 doffset=66 headroom=190
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=0 source_skb=ffff888101008a00 source=frag[0] page=ffffea000420f7c0 off=2480 size=1448 pos=66 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=1 pos=66 next_pos=1514 next_i=1
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=1 source_skb=ffff888101008a00 source=frag[1] page=ffffea000420f7c0 off=3928 size=168 pos=1514 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=2 pos=1514 next_pos=1682 next_i=2
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=2 source_skb=ffff888101008a00 source=frag[2] page=ffffea00048261c0 off=0 size=1280 pos=1682 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=3 pos=1682 next_pos=2962 next_i=3
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=3 source_skb=ffff888101008a00 source=frag[3] page=ffffea00048261c0 off=1280 size=1448 pos=2962 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=4 pos=2962 next_pos=4410 next_i=4
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=4 source_skb=ffff888101008a00 source=frag[4] page=ffffea00048261c0 off=2728 size=1368 pos=4410 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=5 pos=4410 next_pos=5778 next_i=5
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=5 source_skb=ffff888101008a00 source=frag[5] page=ffffea0004886400 off=0 size=80 pos=5778 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=6 pos=5778 next_pos=5858 next_i=6
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=6 source_skb=ffff888101008a00 source=frag[6] page=ffffea0004886400 off=80 size=1448 pos=5858 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=7 pos=5858 next_pos=7306 next_i=7
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=7 source_skb=ffff888101008a00 source=frag[7] page=ffffea0004886400 off=1528 size=1448 pos=7306 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=8 pos=7306 next_pos=8754 next_i=8
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=8 source_skb=ffff888101008a00 source=frag[8] page=ffffea0004886400 off=2976 size=1120 pos=8754 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=9 pos=8754 next_pos=9874 next_i=9
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=9 source_skb=ffff888101008a00 source=frag[9] page=ffffea00048883c0 off=0 size=328 pos=9874 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=10 pos=9874 next_pos=10202 next_i=10
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=10 source_skb=ffff888101008a00 source=frag[10] page=ffffea00048883c0 off=328 size=1448 pos=10202 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=11 pos=10202 next_pos=11650 next_i=11
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=11 source_skb=ffff888101008a00 source=frag[11] page=ffffea00048883c0 off=1776 size=1448 pos=11650 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=12 pos=11650 next_pos=13098 next_i=12
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=12 source_skb=ffff888101008a00 source=frag[12] page=ffffea00048883c0 off=3224 size=872 pos=13098 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=13 pos=13098 next_pos=13970 next_i=13
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=13 source_skb=ffff888101008a00 source=frag[13] page=ffffea00041bf840 off=0 size=576 pos=13970 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=14 pos=13970 next_pos=14546 next_i=14
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=14 source_skb=ffff888101008a00 source=frag[14] page=ffffea00041bf840 off=576 size=1448 pos=14546 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=15 pos=14546 next_pos=15994 next_i=15
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=15 source_skb=ffff888101008a00 source=frag[15] page=ffffea00041bf840 off=2024 size=1448 pos=15994 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=16 pos=15994 next_pos=17442 next_i=16
skbuff: skb_segment: sg source exhausted pos=17442 target_end=29026 old_frag_skb=ffff888101008a00 old_nfrags=16 next_list=ffff888104b7e900
skbuff: skb_segment: sg entered frag_list skb=ffff888104b7e900 len=11584 headlen=0 nr_frags=11 head_frag=0 start_i=0 next_list=0000000000000000
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=16 source_skb=ffff888104b7e900 source=frag[0] page=ffffea00041bf840 off=3472 size=624 pos=17442 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=17 pos=17442 next_pos=18066 next_i=1
skbuff: skb_segment: sg output full nskb=ffff888101cf5100 nr_frags=17 max=17 pos=18066 target_end=29026 source_i=1 source_nfrags=11 source_skb=ffff888104b7e900
skbuff: skb_segment: too many frags: 18066 28960
skbuff: skb_segment: error err=-22 segs=ffff888101cf5100 tail=0000000019d2a099 offset=66 len=28960 pos=18066 i=1 nfrags=11 list_skb=0000000000000000 frag_skb=ffff888104b7e900

Here goes the reproducer patch (applies on top of net-next):

--cut--
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 18dabb4e9cfa..f23aff92f857 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -65,6 +65,7 @@
 #include <linux/kcov.h>
 #include <linux/iov_iter.h>
 #include <linux/crc32.h>
+#include <linux/ratelimit.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
@@ -4759,6 +4760,77 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_segment_list);
 
+static DEFINE_RATELIMIT_STATE(skb_segment_trace_ratelimit, 5 * HZ, 1);
+
+//#define SKB_SEGMENT_TRACE_THRESHOLD GSO_LEGACY_MAX_SIZE
+#define SKB_SEGMENT_TRACE_THRESHOLD 20000
+
+static bool skb_segment_trace(const struct sk_buff *skb,
+			      unsigned int mss)
+{
+	if (!mss || mss == GSO_BY_FRAGS)
+		return false;
+
+	if (mss <= SKB_SEGMENT_TRACE_THRESHOLD && skb->len <= SKB_SEGMENT_TRACE_THRESHOLD)
+		return false;
+
+	return __ratelimit(&skb_segment_trace_ratelimit);
+}
+
+static void skb_segment_dbg_dump_skb(const char *stage, const char *role,
+				     unsigned int idx,
+				     const struct sk_buff *skb)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	unsigned int nr_frags = shinfo->nr_frags;
+	unsigned int i;
+
+	pr_info("skb_segment: %s %s[%u] skb=%px len=%u data_len=%u headlen=%u nr_frags=%u frag_list=%px head_frag=%u gso_size=%u gso_segs=%u gso_type=0x%x\n",
+		stage, role, idx, skb, skb->len, skb->data_len,
+		skb_headlen(skb), nr_frags, shinfo->frag_list,
+		skb->head_frag, shinfo->gso_size, shinfo->gso_segs,
+		shinfo->gso_type);
+
+	for (i = 0; i < nr_frags; i++) {
+		const skb_frag_t *frag = &shinfo->frags[i];
+
+		struct page *page = skb_frag_page(frag);
+		struct page *head = compound_head(page);
+		unsigned int off = skb_frag_off(frag);
+		unsigned int size = skb_frag_size(frag);
+		unsigned int base_pages;
+
+		base_pages = ((off + size - 1) >> PAGE_SHIFT) - (off >> PAGE_SHIFT) + 1;
+
+		pr_info("skb_segment: %s %s[%u] frag[%u] page=%px head=%px order=%u off=%u size=%u base_pages=%u\n",
+			stage, role, idx, i, skb_frag_page(frag),
+			head, compound_order(head),
+			skb_frag_off(frag), skb_frag_size(frag),
+			base_pages);
+	}
+}
+
+static void skb_segment_dbg_dump_tree(const char *stage,
+				      const struct sk_buff *skb)
+{
+	struct sk_buff *iter;
+	unsigned int i = 0;
+
+	skb_segment_dbg_dump_skb(stage, "head", 0, skb);
+	skb_walk_frags(skb, iter)
+		skb_segment_dbg_dump_skb(stage, "frag_list", i++, iter);
+}
+
+static void skb_segment_dbg_dump_list(const char *stage,
+				      const struct sk_buff *skb)
+{
+	const struct sk_buff *iter;
+	unsigned int i = 0;
+
+	for (iter = skb; iter; iter = iter->next)
+		skb_segment_dbg_dump_skb(stage, "out", i++, iter);
+}
+
 /**
  *	skb_segment - Perform protocol segmentation on skb.
  *	@head_skb: buffer to segment
@@ -4775,6 +4847,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *tail = NULL;
 	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
+	bool gso_by_frags = mss == GSO_BY_FRAGS;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
 	unsigned int offset = doffset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
@@ -4784,6 +4857,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *frag_skb;
 	skb_frag_t *frag;
 	__be16 proto;
+	bool trace = false;
 	bool csum, sg;
 	int err = -ENOMEM;
 	int i = 0;
@@ -4861,6 +4935,18 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			partial_segs = 0;
 	}
 
+	trace = skb_segment_trace(head_skb, mss);
+	if (trace) {
+		pr_info("skb_segment: start head=%px len=%u doffset=%u mss=%u gso_by_frags=%u features=%pNF\n",
+			head_skb, head_skb->len, doffset, mss, gso_by_frags,
+			&features);
+		skb_segment_dbg_dump_tree("input", head_skb);
+		pr_info("skb_segment: setup after push head=%px len=%u payload=%u offset=%u mss=%u partial_segs=%u sg=%u csum=%u list_skb=%px features=%pNF\n",
+			head_skb, head_skb->len, head_skb->len - offset,
+			offset, mss, partial_segs, sg, csum, list_skb,
+			&features);
+	}
+
 normal:
 	headroom = skb_headroom(head_skb);
 	pos = skb_headlen(head_skb);
@@ -4888,10 +4974,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		hsize = skb_headlen(head_skb) - offset;
 
+		if (trace)
+			pr_info("skb_segment: segment begin offset=%u len=%u end=%u pos=%d i=%d nfrags=%d hsize=%d list_skb=%px frag_skb=%px\n",
+				offset, len, offset + len, pos, i, nfrags,
+				hsize, list_skb, frag_skb);
+
 		if (hsize <= 0 && i >= nfrags && skb_headlen(list_skb) &&
 		    (skb_headlen(list_skb) == len || sg)) {
 			BUG_ON(skb_headlen(list_skb) > len);
 
+			if (trace)
+				pr_info("skb_segment: clone frag_list skb=%px headlen=%u len=%u nr_frags=%u pos=%d target_end=%u sg=%u\n",
+					list_skb, skb_headlen(list_skb),
+					list_skb->len,
+					skb_shinfo(list_skb)->nr_frags,
+					pos, offset + len, sg);
+
 			nskb = skb_clone(list_skb, GFP_ATOMIC);
 			if (unlikely(!nskb))
 				goto err;
@@ -4903,9 +5001,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			pos += skb_headlen(list_skb);
 
 			while (pos < offset + len) {
+				if (trace)
+					pr_info("skb_segment: clone walk pos=%d target_end=%u i=%d nfrags=%d frag_skb=%px\n",
+						pos, offset + len, i, nfrags,
+						frag_skb);
+				if (trace && i >= nfrags)
+					pr_info("skb_segment: clone walk would BUG: pos=%d target_end=%u i=%d nfrags=%d list_skb=%px\n",
+						pos, offset + len, i, nfrags,
+						list_skb);
 				BUG_ON(i >= nfrags);
 
 				size = skb_frag_size(frag);
+				if (trace)
+					pr_info("skb_segment: clone walk frag[%d] page=%px off=%u size=%d pos=%d next=%d target_end=%u\n",
+						i, skb_frag_page(frag),
+						skb_frag_off(frag), size, pos,
+						pos + size, offset + len);
 				if (pos + size > offset + len)
 					break;
 
@@ -4916,6 +5027,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			list_skb = list_skb->next;
 
+			if (trace)
+				pr_info("skb_segment: clone walk done nskb=%px pos=%d i=%d next_list=%px\n",
+					nskb, pos, i, list_skb);
+
 			if (unlikely(pskb_trim(nskb, len))) {
 				kfree_skb(nskb);
 				goto err;
@@ -4945,6 +5060,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			skb_reserve(nskb, headroom);
 			__skb_put(nskb, doffset);
+
+			if (trace)
+				pr_info("skb_segment: alloc nskb=%px hsize=%d doffset=%u headroom=%u\n",
+					nskb, hsize, doffset, headroom);
 		}
 
 		if (segs)
@@ -4997,6 +5116,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
+				if (trace)
+					pr_info("skb_segment: sg source exhausted pos=%d target_end=%u old_frag_skb=%px old_nfrags=%d next_list=%px\n",
+						pos, offset + len, frag_skb,
+						nfrags, list_skb);
 				if (skb_orphan_frags(list_skb, GFP_ATOMIC) ||
 				    skb_zerocopy_clone(nskb, list_skb,
 						       GFP_ATOMIC))
@@ -5019,11 +5142,24 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 					frag--;
 				}
 
+				if (trace)
+					pr_info("skb_segment: sg entered frag_list skb=%px len=%u headlen=%u nr_frags=%d head_frag=%u start_i=%d next_list=%px\n",
+						frag_skb, frag_skb->len,
+						skb_headlen(frag_skb), nfrags,
+						frag_skb->head_frag, i,
+						list_skb->next);
+
 				list_skb = list_skb->next;
 			}
 
 			if (unlikely(skb_shinfo(nskb)->nr_frags >=
 				     MAX_SKB_FRAGS)) {
+				if (trace)
+					pr_info("skb_segment: sg output full nskb=%px nr_frags=%u max=%u pos=%d target_end=%u source_i=%d source_nfrags=%d source_skb=%px\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						MAX_SKB_FRAGS, pos,
+						offset + len, i, nfrags,
+						frag_skb);
 				net_warn_ratelimited(
 					"skb_segment: too many frags: %u %u\n",
 					pos, mss);
@@ -5035,18 +5171,46 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			__skb_frag_ref(nskb_frag);
 			size = skb_frag_size(nskb_frag);
 
+			if (trace)
+				pr_info("skb_segment: sg add nskb=%px out_frag=%u source_skb=%px source=%s[%d] page=%px off=%u size=%d pos=%d target=[%u,%u)\n",
+					nskb, skb_shinfo(nskb)->nr_frags,
+					frag_skb, i < 0 ? "head_frag" : "frag",
+					i, skb_frag_page(nskb_frag),
+					skb_frag_off(nskb_frag), size, pos,
+					offset, offset + len);
+
 			if (pos < offset) {
+				if (trace)
+					pr_info("skb_segment: sg trim front nskb=%px out_frag=%u trim=%u old_off=%u old_size=%d\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						offset - pos,
+						skb_frag_off(nskb_frag), size);
 				skb_frag_off_add(nskb_frag, offset - pos);
 				skb_frag_size_sub(nskb_frag, offset - pos);
+				if (trace)
+					pr_info("skb_segment: sg trim front done nskb=%px out_frag=%u new_off=%u new_size=%u\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						skb_frag_off(nskb_frag),
+						skb_frag_size(nskb_frag));
 			}
 
 			skb_shinfo(nskb)->nr_frags++;
 
 			if (pos + size <= offset + len) {
+				if (trace)
+					pr_info("skb_segment: sg consumed source nskb=%px out_nr_frags=%u pos=%d next_pos=%d next_i=%d\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						pos, pos + size, i + 1);
 				i++;
 				frag++;
 				pos += size;
 			} else {
+				if (trace)
+					pr_info("skb_segment: sg trim tail nskb=%px out_frag=%u trim=%u pos=%d size=%d target_end=%u\n",
+						nskb,
+						skb_shinfo(nskb)->nr_frags - 1,
+						pos + size - (offset + len),
+						pos, size, offset + len);
 				skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
 				goto skip_fraglist;
 			}
@@ -5059,6 +5223,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;
 
+		if (trace)
+			pr_info("skb_segment: segment built nskb=%px len=%u data_len=%u nr_frags=%u hsize=%d offset=%u seg_len=%u pos=%d i=%d next_offset=%u\n",
+				nskb, nskb->len, nskb->data_len,
+				skb_shinfo(nskb)->nr_frags, hsize, offset,
+				len, pos, i, offset + len);
+
 perform_csum_check:
 		if (!csum) {
 			if (skb_has_shared_frag(nskb) &&
@@ -5106,6 +5276,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - doffset, gso_size);
 	}
 
+	if (trace)
+		skb_segment_dbg_dump_list("output", segs);
+
 	/* Following permits correct backpressure, for protocols
 	 * using skb_set_owner_w().
 	 * Idea is to tranfert ownership from head_skb to last segment.
@@ -5118,6 +5291,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	return segs;
 
 err:
+	if (trace)
+		pr_info("skb_segment: error err=%d segs=%px tail=%p offset=%u len=%u pos=%d i=%d nfrags=%d list_skb=%px frag_skb=%px\n",
+			err, segs, tail, offset, len, pos, i, nfrags,
+			list_skb, frag_skb);
 	kfree_skb_list(segs);
 	return ERR_PTR(err);
 }
diff --git a/tools/testing/selftests/net/big_tcp_repro.sh b/tools/testing/selftests/net/big_tcp_repro.sh
new file mode 100755
index 000000000000..7c465938526d
--- /dev/null
+++ b/tools/testing/selftests/net/big_tcp_repro.sh
@@ -0,0 +1,226 @@
+#!/usr/bin/env bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Reproducer for GRO frag_list failure in skb_segment.
+#
+# Topology:
+#
+#   client ns                 router ns                         server ns
+#   c0 10.0.0.1  <--veth-->  r0 10.0.0.2
+#                             r1 10.0.1.1  <--veth-->          s1 10.0.1.2
+#                             ptun0 10.0.2.1 <--outer vxlan--> ptun1 10.0.2.2
+#                             tun0 192.0.2.1 <--inner vxlan--> tun1 192.0.2.2
+#
+# The client emits non-GSO TCP packets. The router receives them on r0 with
+# plain GRO enabled, forwards the GRO skb into the inner VXLAN tunnel, then
+# sends the already-encapsulated skb through the outer VXLAN tunnel with
+# tx-gso-partial enabled. The outer tunnel forces the SKB into skb_segment for
+# partial GSO, where "skb_segment: too many frags" can be caught.
+
+set -euo pipefail
+
+CLIENT_NS=$(mktemp -u btcp-client-XXXXXXXX)
+ROUTER_NS=$(mktemp -u btcp-router-XXXXXXXX)
+SERVER_NS=$(mktemp -u btcp-server-XXXXXXXX)
+
+CLIENT_IP=10.0.0.1
+ROUTER_CLIENT_IP=10.0.0.2
+ROUTER_UNDERLAY_IP=10.0.1.1
+SERVER_UNDERLAY_IP=10.0.1.2
+ROUTER_OUTER_IP=10.0.2.1
+SERVER_OUTER_IP=10.0.2.2
+ROUTER_INNER_IP=192.0.2.1
+SERVER_INNER_IP=192.0.2.2
+
+NETPERF_TIME=${NETPERF_TIME:-10}
+NETPERF_WRITE=${NETPERF_WRITE:-262144}
+LOWER_MTU=${LOWER_MTU:-9000}
+SHOW_DMESG=${SHOW_DMESG:-1}
+CLEAR_DMESG=${CLEAR_DMESG:-0}
+
+OLD_HIGH_ORDER_ALLOC_DISABLE=
+
+require_command()
+{
+	if ! command -v "$1" >/dev/null 2>&1; then
+		echo "SKIP: missing $1"
+		exit 4
+	fi
+}
+
+cleanup()
+{
+	for ns in "$SERVER_NS" "$ROUTER_NS" "$CLIENT_NS"; do
+		ip netns pids "$ns" 2>/dev/null | xargs -r kill 2>/dev/null || true
+	done
+
+	ip netns del "$SERVER_NS" 2>/dev/null || true
+	ip netns del "$ROUTER_NS" 2>/dev/null || true
+	ip netns del "$CLIENT_NS" 2>/dev/null || true
+
+	if [ -n "$OLD_HIGH_ORDER_ALLOC_DISABLE" ]; then
+		sysctl -qw "net.core.high_order_alloc_disable=$OLD_HIGH_ORDER_ALLOC_DISABLE" || true
+	fi
+}
+
+ethtool_must()
+{
+	local ns=$1
+	local dev=$2
+
+	shift 2
+	ip netns exec "$ns" ethtool -K "$dev" "$@"
+}
+
+ethtool_try()
+{
+	local ns=$1
+	local dev=$2
+
+	shift 2
+	ip netns exec "$ns" ethtool -K "$dev" "$@" >/dev/null 2>&1 || true
+}
+
+setup_namespaces()
+{
+	ip netns add "$CLIENT_NS"
+	ip netns add "$ROUTER_NS"
+	ip netns add "$SERVER_NS"
+
+	for ns in "$CLIENT_NS" "$ROUTER_NS" "$SERVER_NS"; do
+		ip -n "$ns" link set lo up
+		ip netns exec "$ns" sysctl -qw net.ipv4.conf.all.rp_filter=0
+		ip netns exec "$ns" sysctl -qw net.ipv4.conf.default.rp_filter=0
+	done
+	ip netns exec "$ROUTER_NS" sysctl -qw net.ipv4.ip_forward=1
+	ip netns exec "$CLIENT_NS" sysctl -qw net.ipv4.tcp_wmem="4096 4194304 4194304"
+	ip netns exec "$SERVER_NS" sysctl -qw net.ipv4.tcp_rmem="4096 4194304 4194304"
+
+	ip -n "$CLIENT_NS" link add c0 type veth peer name r0 netns "$ROUTER_NS"
+	ip -n "$ROUTER_NS" link add r1 type veth peer name s1 netns "$SERVER_NS"
+
+	ip -n "$CLIENT_NS" addr add "$CLIENT_IP/24" dev c0
+	ip -n "$ROUTER_NS" addr add "$ROUTER_CLIENT_IP/24" dev r0
+	ip -n "$ROUTER_NS" addr add "$ROUTER_UNDERLAY_IP/24" dev r1
+	ip -n "$SERVER_NS" addr add "$SERVER_UNDERLAY_IP/24" dev s1
+
+	ip -n "$ROUTER_NS" link set dev r1 mtu "$LOWER_MTU"
+	ip -n "$SERVER_NS" link set dev s1 mtu "$LOWER_MTU"
+
+	ip -n "$CLIENT_NS" link set c0 up
+	ip -n "$ROUTER_NS" link set r0 up
+	ip -n "$ROUTER_NS" link set r1 up
+	ip -n "$SERVER_NS" link set s1 up
+
+	ip -n "$CLIENT_NS" route add "$SERVER_INNER_IP/32" via "$ROUTER_CLIENT_IP" dev c0
+}
+
+setup_tunnels()
+{
+	ip -n "$ROUTER_NS" link add ptun0 type vxlan \
+		id 100 local "$ROUTER_UNDERLAY_IP" remote "$SERVER_UNDERLAY_IP" dev r1 dstport 4790
+	ip -n "$SERVER_NS" link add ptun1 type vxlan \
+		id 100 local "$SERVER_UNDERLAY_IP" remote "$ROUTER_UNDERLAY_IP" dev s1 dstport 4790
+
+	ip -n "$ROUTER_NS" addr add "$ROUTER_OUTER_IP/24" dev ptun0
+	ip -n "$SERVER_NS" addr add "$SERVER_OUTER_IP/24" dev ptun1
+	ip -n "$ROUTER_NS" link set ptun0 up
+	ip -n "$SERVER_NS" link set ptun1 up
+
+	ip -n "$ROUTER_NS" link add tun0 type vxlan \
+		id 200 local "$ROUTER_OUTER_IP" remote "$SERVER_OUTER_IP" dev ptun0 dstport 4789
+	ip -n "$SERVER_NS" link add tun1 type vxlan \
+		id 200 local "$SERVER_OUTER_IP" remote "$ROUTER_OUTER_IP" dev ptun1 dstport 4789
+
+	ip -n "$ROUTER_NS" addr add "$ROUTER_INNER_IP/24" dev tun0
+	ip -n "$SERVER_NS" addr add "$SERVER_INNER_IP/24" dev tun1
+	ip -n "$ROUTER_NS" link set tun0 up
+	ip -n "$SERVER_NS" link set tun1 up
+
+	ip -n "$SERVER_NS" route add "$CLIENT_IP/32" via "$ROUTER_INNER_IP" dev tun1
+}
+
+setup_offloads()
+{
+	# Client must put non-GSO packets onto c0 so router-side r0 can GRO them.
+	ethtool_must "$CLIENT_NS" c0 tso off gso off
+	ethtool_try "$CLIENT_NS" c0 tx-gso-partial off
+	ethtool_try "$CLIENT_NS" c0 tx-udp_tnl-segmentation off
+	ethtool_try "$CLIENT_NS" c0 tx-udp_tnl-csum-segmentation off
+
+	# Router ingress: normal GRO, not rx-gro-list.  We want skb_gro_receive()
+	# to fill order-0 frags and then use frag_list as the fallback.
+	ethtool_must "$ROUTER_NS" r0 gro on
+	ethtool_try "$ROUTER_NS" r0 rx-gro-list off
+
+	# Outer tunnel: this is the partial-GSO-capable software egress.
+	ethtool_must "$ROUTER_NS" ptun0 \
+		tx-gso-partial on \
+		tx-udp_tnl-segmentation on \
+		tx-udp_tnl-csum-segmentation on
+
+	# Lower veth must not absorb the whole outer tunnel GSO packet.
+	ethtool_must "$ROUTER_NS" r1 tso off gso off
+	ethtool_try "$ROUTER_NS" r1 tx-gso-partial off
+	ethtool_try "$ROUTER_NS" r1 tx-udp_tnl-segmentation off
+	ethtool_try "$ROUTER_NS" r1 tx-udp_tnl-csum-segmentation off
+}
+
+show_relevant_state()
+{
+	echo "Client TX offloads:"
+	ip netns exec "$CLIENT_NS" ethtool -k c0 | grep -E 'tcp-segmentation-offload|generic-segmentation-offload|tx-gso-partial|tx-udp_tnl' || true
+	echo
+	echo "Router GRO ingress:"
+	ip -n "$ROUTER_NS" -d link show r0 | grep -E 'gso|gro' || true
+	ip netns exec "$ROUTER_NS" ethtool -k r0 | grep -E 'generic-receive-offload|rx-gro-list' || true
+	echo
+	echo "Router partial-GSO outer tunnel:"
+	ip netns exec "$ROUTER_NS" ethtool -k ptun0 | grep -E 'tx-gso-partial|tx-udp_tnl' || true
+}
+
+run_traffic()
+{
+	ip netns exec "$SERVER_NS" netserver >/dev/null
+	sleep 1
+
+	if [ "$CLEAR_DMESG" = 1 ]; then
+		dmesg -C || true
+	fi
+
+	echo "Running TCP_STREAM from $CLIENT_NS to $SERVER_INNER_IP for ${NETPERF_TIME}s"
+	ip netns exec "$CLIENT_NS" netperf -H "$SERVER_INNER_IP" \
+		-t TCP_STREAM -l "$NETPERF_TIME" -- -m "$NETPERF_WRITE" >/dev/null
+
+	if [ "$SHOW_DMESG" = 1 ]; then
+		echo
+		echo "Recent skb_segment logs:"
+		dmesg | grep 'skb_segment' || true
+	fi
+}
+
+require_command ip
+require_command ethtool
+require_command netperf
+require_command netserver
+
+if [ "$(id -u)" -ne 0 ]; then
+	echo "SKIP: must be run as root"
+	exit 4
+fi
+
+if ! { ip link help 2>&1 || :; } | grep -q gso_ipv4_max_size; then
+	echo "SKIP: iproute2 does not support gso/gro IPv4 max size knobs"
+	exit 4
+fi
+
+trap cleanup EXIT
+
+OLD_HIGH_ORDER_ALLOC_DISABLE=$(sysctl -n net.core.high_order_alloc_disable 2>/dev/null || true)
+#sysctl -qw net.core.high_order_alloc_disable=1
+
+setup_namespaces
+setup_tunnels
+setup_offloads
+show_relevant_state
+run_traffic
--cut--

Thanks,
Alice

^ permalink raw reply related

* Re: [PATCH] net: add sock_open() for unified socket creation
From: Alex Goltsev @ 2026-06-19 17:54 UTC (permalink / raw)
  To: Al Viro; +Cc: davem, netdev, linux-kernel
In-Reply-To: <20260619163421.GD2636677@ZenIV>

On Fri, 19 Jun 2026 at 19:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Jun 19, 2026 at 01:35:56PM +0300, Alex Goltsev wrote:
> > > What's the point (and why not make it inline, while we are at it)?
> >
> > > Are there really callers that would pass a non-constant value as the last argument,
> > > and if so, what are they doing next?
> >
> >
> > As for `inline`: in this case, it would have no practical significance.
> >
> > The compiler already treats a simple inline function as a regular
> >
> > symbol within the `EXPORT_SYMBOL` context, whereas a static inline
> > function (the standard
> >
> > kernel template for helper functions) would completely break the
> > export to the LKM.
>
> How so?  All three underlying primitives are exported, so static inline
> in whatever include/*/*.h you put it in would work just fine.
>
> > As for the last argument, yes, today it is usually a constant,
> >
> > but that’s not the point. The purpose of the enumeration is to provide
> >
> > a unified, explicit control interface. It’s important that if, in the future,
> >
> > someone adds a new type of socket creation, existing calling programs won’t
> >
> > panic or throw a compilation error, but will smoothly fall back to
> >
> > the default case and return -EINVAL, which is a safe failure mode.
>
> Collapsing several functions together is worthless unless the combination
> can be _used_ other than a (questionable) syntax sugar.  kmalloc() can;
> something that would only result in trading multiple identifiers for
> functions for multiple identifiers for "which function to call" is not
> an improvement.

Thank you for the detailed overview. I understand your point of view,
standardization without adding new features isn’t an improvement. I’ll
consider a v2 version in which flags can be combined to produce unique
behavior, so that the API offers more than just syntactic sugar.

^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Andrew Lunn @ 2026-06-19 18:37 UTC (permalink / raw)
  To: Das, Shubham
  Cc: Alexander H Duyck, lee@trager.us, netdev@vger.kernel.org,
	mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
	Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <SN7PR11MB8109C173933D08F994FBB084FFE22@SN7PR11MB8109.namprd11.prod.outlook.com>

> The host driver does not directly access any registers but requests
> the PHY FW to manage PRBS on behalf of it.

Maybe a dumb question. Why?

Can you change the firmware to expose the 802.3 registers for PRBS?
You can then write a library which both plylib and your driver can
use.

	Andrew

^ permalink raw reply

* Re: [ANN] Google's Netdev-CI for IDPF and GVE
From: Jakub Kicinski @ 2026-06-19 18:59 UTC (permalink / raw)
  To: Sheena Mohan
  Cc: netdev, andrew+netdev, davem, Eric Dumazet, pabeni, horms,
	Willem de Bruijn, Max Yuan, Pin-yen Lin, Harshitha Ramamurthy,
	Joshua Washington, Danny Gonzalez, David Decotigny, Brian Vazquez
In-Reply-To: <CADWJPTsg5G21=hybo81+QHv0+g64d3a+6gGUaJSm1i7EttCUcw@mail.gmail.com>

On Fri, 29 May 2026 13:44:48 -0700 Sheena Mohan wrote:
> Hi everyone,
> 
> We are happy to share that Netdev-CI testing on both IDPF (running on
> Google Bare Metal) and GVE (running on Google Virtual Machines) is now
> up and running.
> This NIPA integration work enables executing kselftests against the
> current proposed net-next kernel branch on real hardware.
> 
> Thanks to Danny, Max, and Pin-yen for their contributions!
> 
> The test results and logs are available in:
> 
> IDPF Results: https://idpf-netdev-nipa.static.usercontent.goog/json/results.json
> GVE Results: https://gve-netdev-nipa.static.usercontent.goog/json/results.json

Hi Sheena!

The Google runners do not report device info. The results should
contain a "device" object that identifies external components that
may cause regressions (like device FW version), see:
https://github.com/linux-netdev/nipa/wiki/Netdev-CI-system/#device-information
In practice the main use we currently have for it is to auto-categorize
the results as executing on a real driver rather than netdevsim.

^ permalink raw reply

* [PATCH net v2] eth: bnxt: improve the timing of stats
From: Jakub Kicinski @ 2026-06-19 19:15 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	michael.chan, pavan.chebbi

Kernel selftests wait 1.25x of the promised stats refresh time
(as read from ethtool -c). bnxt reports 1sec by default, but
the stats update process has two steps. First device DMAs the
new values, then the service task performs update in full-width
SW counters. So the worst case delay is actually 2x.

Note that the behavior is different for ring stats and port stats.
Port stats are fetched synchronously by the service worker, so
there's no risk of doubling up the delay there.

The problem of stale stats impacts not only tests but real workloads
which monitor egress bandwidth of a NIC. The inaccuracy causes double
counting in the next cycle and spurious overload alarms.

Try to read from the DMA buffer more aggressively, to mitigate
timing issues between DMA and service task. The SW update should
be cheap.

Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: michael.chan@broadcom.com
CC: pavan.chebbi@broadcom.com

v2:
 - split the accumulate into port and ring
 - make the sync only cover rings
 - remove sync from callbacks which use port stats (which are fetched
   synchronously by the service worker)
v1: https://lore.kernel.org/20260618181358.3037661-1-kuba@kernel.org

With this patch I had a 50 clean runs of ntuple.py in a row.
Previously it'd fail within 5 runs at most.

Hopefully this is good enough, in the past I sent an RFC to
convert the driver to use SW stats for everything. That felt
a little drastic.
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  5 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 48 ++++++++++++++++++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  1 +
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 6d312259f852..6335dfc14c98 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2620,6 +2620,10 @@ struct bnxt {
 #define BNXT_MIN_STATS_COAL_TICKS	  250000
 #define BNXT_MAX_STATS_COAL_TICKS	 1000000
 
+	/* Protects stats_updated_jiffies and writes to sw_stats */
+	spinlock_t		stats_lock;
+	unsigned long		stats_updated_jiffies;
+
 	struct work_struct	sp_task;
 	unsigned long		sp_event;
 #define BNXT_RX_NTP_FLTR_SP_EVENT	1
@@ -3027,6 +3031,7 @@ void bnxt_reenable_sriov(struct bnxt *bp);
 void bnxt_close_nic(struct bnxt *, bool, bool);
 void bnxt_get_ring_drv_stats(struct bnxt *bp,
 			     struct bnxt_total_ring_drv_stats *stats);
+void bnxt_sync_ring_stats(struct bnxt *bp);
 bool bnxt_rfs_capable(struct bnxt *bp, bool new_rss_ctx);
 int bnxt_dbg_hwrm_rd_reg(struct bnxt *bp, u32 reg_off, u16 num_words,
 			 u32 *reg_buf);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 055e93a417b6..7513618793da 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10530,7 +10530,7 @@ static void bnxt_accumulate_stats(struct bnxt_stats_mem *stats)
 				stats->hw_masks, stats->len / 8, false);
 }
 
-static void bnxt_accumulate_all_stats(struct bnxt *bp)
+static void bnxt_accumulate_ring_stats(struct bnxt *bp)
 {
 	struct bnxt_stats_mem *ring0_stats;
 	bool ignore_zero = false;
@@ -10553,6 +10553,10 @@ static void bnxt_accumulate_all_stats(struct bnxt *bp)
 					ring0_stats->hw_masks,
 					ring0_stats->len / 8, ignore_zero);
 	}
+}
+
+static void bnxt_accumulate_port_stats(struct bnxt *bp)
+{
 	if (bp->flags & BNXT_FLAG_PORT_STATS) {
 		struct bnxt_stats_mem *stats = &bp->port_stats;
 		__le64 *hw_stats = stats->hw_stats;
@@ -10575,6 +10579,41 @@ static void bnxt_accumulate_all_stats(struct bnxt *bp)
 	}
 }
 
+static void bnxt_accumulate_all_stats(struct bnxt *bp)
+{
+	bnxt_accumulate_ring_stats(bp);
+	bnxt_accumulate_port_stats(bp);
+}
+
+/* Re-accumulate ring stats from DMA buffers if stale.
+ * uAPIs for reading sw_stats should call this first.
+ *
+ * We promise user space update frequency of bp->stats_coal_ticks but
+ * the update is a two step process - first device updates the DMA buffer,
+ * then we have to update from that buffer to driver stats in the service work.
+ * Worst case we would be 2x off from the desired frequency.
+ * Sync the stats sooner, if stale. The 20% threshold was chosen arbitrarily.
+ *
+ * Ideally we would split the user-configured time into two portions,
+ * i.e. also lower the DMA period by the 20%. But the DMA timer seems to have
+ * too coarse granularity to play such tricks.
+ */
+void bnxt_sync_ring_stats(struct bnxt *bp)
+{
+	unsigned long stale;
+
+	if (!netif_running(bp->dev) || !bp->stats_coal_ticks)
+		return;
+
+	spin_lock(&bp->stats_lock);
+	stale = usecs_to_jiffies(bp->stats_coal_ticks / 5);
+	if (time_after_eq(jiffies, bp->stats_updated_jiffies + stale)) {
+		bnxt_accumulate_ring_stats(bp);
+		bp->stats_updated_jiffies = jiffies;
+	}
+	spin_unlock(&bp->stats_lock);
+}
+
 static int bnxt_hwrm_port_qstats(struct bnxt *bp, u8 flags)
 {
 	struct hwrm_port_qstats_input *req;
@@ -13577,6 +13616,7 @@ bnxt_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 		return;
 	}
 
+	bnxt_sync_ring_stats(bp);
 	bnxt_get_ring_stats(bp, stats);
 	bnxt_add_prev_stats(bp, stats);
 
@@ -14753,7 +14793,10 @@ static void bnxt_sp_task(struct work_struct *work)
 	if (test_and_clear_bit(BNXT_PERIODIC_STATS_SP_EVENT, &bp->sp_event)) {
 		bnxt_hwrm_port_qstats(bp, 0);
 		bnxt_hwrm_port_qstats_ext(bp, 0);
+		spin_lock(&bp->stats_lock);
 		bnxt_accumulate_all_stats(bp);
+		bp->stats_updated_jiffies = jiffies;
+		spin_unlock(&bp->stats_lock);
 	}
 
 	if (test_and_clear_bit(BNXT_LINK_CHNG_SP_EVENT, &bp->sp_event)) {
@@ -15488,6 +15531,7 @@ static int bnxt_init_board(struct pci_dev *pdev, struct net_device *dev)
 	INIT_DELAYED_WORK(&bp->fw_reset_task, bnxt_fw_reset_task);
 
 	spin_lock_init(&bp->ntp_fltr_lock);
+	spin_lock_init(&bp->stats_lock);
 #if BITS_PER_LONG == 32
 	spin_lock_init(&bp->db_lock);
 #endif
@@ -16056,6 +16100,7 @@ static void bnxt_get_queue_stats_rx(struct net_device *dev, int i,
 	if (!bp->bnapi)
 		return;
 
+	bnxt_sync_ring_stats(bp);
 	cpr = &bp->bnapi[i]->cp_ring;
 	sw = cpr->stats.sw_stats;
 
@@ -16084,6 +16129,7 @@ static void bnxt_get_queue_stats_tx(struct net_device *dev, int i,
 	if (!bp->tx_ring)
 		return;
 
+	bnxt_sync_ring_stats(bp);
 	bnapi = bp->tx_ring[bp->tx_ring_map[i]].bnapi;
 	sw = bnapi->cp_ring.stats.sw_stats;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 56d74a3c24b7..62bc9cae613c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -606,6 +606,7 @@ static void bnxt_get_ethtool_stats(struct net_device *dev,
 		goto skip_ring_stats;
 	}
 
+	bnxt_sync_ring_stats(bp);
 	tpa_stats = bnxt_get_num_tpa_ring_stats(bp);
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		struct bnxt_napi *bnapi = bp->bnapi[i];
-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox