Netdev List
 help / color / mirror / Atom feed
* TCP default settings (bugzilla)
From: Stephen Hemminger @ 2026-04-15 14:14 UTC (permalink / raw)
  To: netdev

A pair of TCP configuration related bug reports just showed up in bugzilla.
Getting the right time values here seems like a trade off between fast
failover and not dropping crappy connections.

Given how well formatted the buts are they look AI generated.

https://bugzilla.kernel.org/show_bug.cgi?id=221366

The default value of net.ipv4.tcp_retries2 (15 retries, resulting in
~924 seconds / ~15.4 minutes before TCP abandons a dead connection) is
far too high for modern data center environments. When a remote host
becomes unreachable (server crash, failover, network partition),
applications are stuck for up to 16 minutes before receiving an error
and taking recovery action. This causes cascading failures, connection
pool exhaustion, and prolonged service outages.

https://bugzilla.kernel.org/show_bug.cgi?id=221365

The default value of net.ipv4.tcp_keepalive_time (7200 seconds / 2
hours) is incompatible with virtually all modern network
infrastructure, causing silent connection failures. Intermediate
stateful devices (load balancers, firewalls, NAT gateways) routinely
expire idle TCP connections after 300-1800 seconds — long before the
first keepalive probe is ever sent.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/ethernet: add ZTE network driver support
From: Andrew Lunn @ 2026-04-15 14:10 UTC (permalink / raw)
  To: Junyang Han
  Cc: netdev, davem, andrew+netdev, edumazet, kuba, pabeni, ran.ming,
	han.chengfei, zhang.yanze
In-Reply-To: <20260415015334.2018453-1-han.junyang@zte.com.cn>

On Wed, Apr 15, 2026 at 09:53:32AM +0800, Junyang Han wrote:
> Add basic framework for ZTE DingHai ethernet PF driver, including
> Kconfig/Makefile build support and PCIe device probe/remove skeleton.

Please take a read on

https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html

and

https://docs.kernel.org/process/submitting-patches.html

Please always include a patch 0/X in a patch set, explaining the big
picture.

Thanks for keeping the driver small and easy to review.

> + * ZTE DingHai Ethernet driver
> + * Copyright (c) 2022-2024, ZTE Corporation.

And the last two years?

> +#define DRV_VERSION "1.0-1"

Driver versions are generally useless. What does this actually mean
for the given very limited driver? Are you going to change the version
with each patchset?

> +#define DRV_SUMMARY "ZTE(R) zxdh-net driver"
> +
> +const char zxdh_pf_driver_version[] = DRV_VERSION;
> +static const char zxdh_pf_driver_string[] = DRV_SUMMARY;
> +static const char zxdh_pf_copyright[] = "Copyright (c)
>  2022-2024, ZTE Corporation.";

You don't need this, you have the copyright above.

> +MODULE_AUTHOR("ZTE");

Author is a person, with an email address.

> +MODULE_DESCRIPTION(DRV_SUMMARY);

Please just put the string here, not #define.

> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE("GPL");
> +static int dh_pf_pci_init(struct dh_core_dev *dev)
> +{
> +    int ret = 0;
> +    struct zxdh_pf_device *pf_dev = NULL;

Reverse Christmas tree. This applies everywhere for a netdev driver.

> +static int dh_pf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{

> +err_cfg_init:
> +    mutex_destroy(&pf_dev->irq_lock);
> +    mutex_destroy(&dh_dev->lock);
> +    devlink_free(devlink);
> +    pf_dev = NULL;

Since this is a probe function, do you really need to set pf_dev to
NULL? How is it going to keep a value over EPROBE_DEFER cycles?

> +static void dh_pf_remove(struct pci_dev *pdev)
> +{
> +    struct dh_core_dev *dh_dev = pci_get_drvdata(pdev);
> +    struct devlink *devlink = priv_to_devlink(dh_dev);
> +    struct zxdh_pf_device *pf_dev = dh_core_priv(dh_dev);
> +
> +    if (!dh_dev)
> +        return;

How does that happen?

> +    dh_pf_pci_close(dh_dev);
> +    mutex_destroy(&pf_dev->irq_lock);
> +    mutex_destroy(&dh_dev->lock);
> +    devlink_free(devlink);
> +    pci_set_drvdata(pdev, NULL);
> +}

> +static int dh_pf_suspend(struct pci_dev *pdev, pm_message_t state)
> +{
> +    return 0;
> +}
> +
> +static int dh_pf_resume(struct pci_dev *pdev)
> +{
> +    return 0;
> +}

If they do nothing, don't provide them. You can add them when you add
suspend/resume support.

> +static int __init dh_pf_pci_init_module(void)
> +{
> +    return pci_register_driver(&dh_pf_driver);
> +}
> +
> +static void __exit dh_pf_pci_exit_module(void)
> +{
> +    pci_unregister_driver(&dh_pf_driver);
> +}
> +
> +module_init(dh_pf_pci_init_module);
> +module_exit(dh_pf_pci_exit_module);

The PCI subsystem offers a wrapper to do this.

> +struct dh_core_dev {
> +    struct device *device;
> +    enum dh_coredev_type coredev_type;
> +    struct pci_dev *pdev;
> +    struct devlink *devlink;
> +    struct mutex lock; /* Protects device configuration */
> +    char priv[] __aligned(32);

That is unusual. priv is usually a void * and allocated. If you want
an actual array, you might want to have a second member indicate the
size of the array, look at all the work done recently on flexible
arrays.
	
    Andrew

---
pw-bot: cr

^ permalink raw reply

* Re: [PATCH] net/sched: act_ct: fix skb leak on fragment check failure
From: Simon Horman @ 2026-04-15 14:10 UTC (permalink / raw)
  To: phx0fer; +Cc: Simon Horman, jiri, netdev, jhs
In-Reply-To: <20260413084609.69560-1-phx0fer@gmail.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/sched: act_ct: fix skb leak on fragment check failure

This commit fixes a socket buffer leak in the act_ct traffic control action.
When tcf_ct_handle_fragments() encounters malformed fragments and returns an
error, the skb was not being freed, leading to memory exhaustion under
sustained traffic. The fix changes the return value to TC_ACT_SHOT to ensure
proper cleanup.

> net/sched: act_ct: fix skb leak on fragment check failure
>
> tcf_ct_handle_fragments() returns TC_ACT_CONSUMED when
> tcf_ct_ipv4/6_is_fragment() fails. This causes the caller to

The commit message contains a technical inaccuracy. It states
'tcf_ct_handle_fragments() returns TC_ACT_CONSUMED when
tcf_ct_ipv4/6_is_fragment() fails', but tcf_ct_handle_fragments() returns
integer error codes (like -EINVAL, -ENOMEM, -EPROTO), not TC action codes.

The function that returns TC_ACT_CONSUMED is tcf_ct_act(), which interprets
the error codes from tcf_ct_handle_fragments(). The code fix itself is
correct - only the commit message attribution is wrong.

More accurate would be: 'tcf_ct_act() returns TC_ACT_CONSUMED when
tcf_ct_handle_fragments() fails'.

> believe the skb was consumed, but it was not freed. Each
> malformed fragment leaks one skb, leading to OOM DoS under
> sustained traffic.
>
> Change the return value to TC_ACT_SHOT so the skb is properly
> freed by the caller.
>
> Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")

The existing Fixes: tag points to commit b57dc7c13ea9 ("net/sched: Introduce
action ct"), but the bug was actually introduced by commit 3f14b377d01d
("net/sched: act_ct: fix skb leak and crash on ooo frags").

The original commit b57dc7c13ea9 had correct error handling where fragment
check failures went to the 'drop' label. Commit 3f14b377d01d introduced the
'out_frag' label that always returns TC_ACT_CONSUMED, causing the skb leak
when fragment checks fail with errors other than -EINPROGRESS.

Suggested: Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags")

> Signed-off-by: Dudu Lu <phx0fer@gmail.com>
>
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 7d5e50c921a0..870655f682bd 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1107,8 +1107,10 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
>  	return retval;
>
>  out_frag:
> -	if (err != -EINPROGRESS)
> +	if (err != -EINPROGRESS) {
>  		tcf_action_inc_drop_qstats(&c->common);
> +		return TC_ACT_SHOT;
> +	}
>  	return TC_ACT_CONSUMED;
>
>  drop:

^ permalink raw reply

* Re: [PATCH net-next v2 12/14] net: macb: re-read ISR inside IRQ handler locked section
From: Théo Lebrun @ 2026-04-15 14:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: nicolas.ferre, claudiu.beznea, andrew+netdev, davem, edumazet,
	pabeni, richardcochran, linux, pvalerio, conor, nb,
	vladimir.kondratiev, gregory.clement, benoit.monin, tawfik.bayouk,
	thomas.petazzoni, maxime.chevallier, netdev, linux-kernel
In-Reply-To: <20260414005244.3758889-1-kuba@kernel.org>

Hello Jakub's LLM,

On Tue Apr 14, 2026 at 2:52 AM CEST, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: macb: re-read ISR inside IRQ handler locked section
>
> This commit adds a second read of the ISR register inside the lock in
> macb_interrupt() to handle a race condition where the status might have
> been updated by concurrent hardware resets.
>
>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>> index 2633234012bae..3deab87c565b5 100644
>> --- a/drivers/net/ethernet/cadence/macb_main.c
>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>> @@ -2190,6 +2190,13 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
>>  
>>  	spin_lock(&bp->lock);
>>  
>> +	/* `status` stack variable might be stalled => re-read it */
>> +	status = queue_readl(queue, ISR);
>
> Will this second read discard interrupt events on older MACB variants that
> use clear-on-read for the ISR?
>
> If the hardware does not have the MACB_CAPS_ISR_CLEAR_ON_WRITE capability,
> the first queue_readl call before the lock already cleared the active bits.
>
> Overwriting the local status variable here sets it to 0, which permanently
> drops the initially captured Rx/Tx completions and stalls the network
> interface.

The AI is correct. For R1C, we must check ISR inside the bp->lock
critical section. For W1C, we can check it outside to detect spurious
interrupt without grabbing bp->lock and then we check again in bp->lock.

So:

static irqreturn_t macb_interrupt(int irq, void *dev_id)
{
	// ...

	/* detect spurious interrupts without grabbing bp->lock */
	if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE) {
		status = queue_readl(queue, ISR);
		if (unlikely(!status))
			return IRQ_NONE;
	}

	spin_lock(&bp->lock);

	status = queue_readl(queue, ISR);
	if (unlikely(!status)) {
		spin_unlock(&bp->lock);
		return IRQ_NONE;
	}

	// ...
}

Or we could drop the readl(ISR) from outside bp->lock.

>> +	if (unlikely(!status)) {
>> +		spin_unlock(&bp->lock);
>> +		return IRQ_NONE;
>
> If the function reaches this point, the initial status read before the lock
> was non-zero, meaning the device legitimately asserted the interrupt.
>
> If a concurrent operation cleared the hardware state while this thread
> waited for the lock, will returning IRQ_NONE falsely flag this as a
> spurious interrupt?
>
> Since the interrupt was actually raised by the device, doing this repeatedly
> can cause the core IRQ subsystem to permanently disable the IRQ line.
>
> Could this return IRQ_HANDLED instead?

Disagreed on this one. This codepath, if it happens, is expected to be
infrequent and to protect against races by the swap operation disabling
the device. There is no chance we often land here and end up disabling
the IRQ line.

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply

* Re: [PATCH net-next v2 09/14] net: macb: change caps helpers signatures
From: Théo Lebrun @ 2026-04-15 13:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
	Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
	linux-kernel
In-Reply-To: <20260413174716.39915c92@kernel.org>

Hello Jakub,

On Tue Apr 14, 2026 at 2:47 AM CEST, Jakub Kicinski wrote:
> On Fri, 10 Apr 2026 21:51:57 +0200 Théo Lebrun wrote:
>> For parallel MACB context to start become a reality, many functions will
>> soon not have access to `struct macb *bp`. Those will still have access
>> to caps through ctx->info->caps.
>> 
>> Change all caps helpers signatures, from taking `struct macb *bp` to
>> taking `u32 caps`.
>
> Subjectively I feel like this is a slight loss of type safety.
> Someone may pass the wrong u32 and compiler will not help?

Agreed. We can solve that using our new `struct macb` subset called
`struct macb_info`. It contains the caps and is available from both bp
and ctx.

So it will be one of:

   macb_is_gem(bp->info)
   macb_is_gem(ctx->info)

No more obscure u32 argument.

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply

* Re: [PATCH iwl-net 5/5] iavf: return 0 when TC flower filter not found after qdisc teardown
From: Simon Horman @ 2026-04-15 13:53 UTC (permalink / raw)
  To: Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Kiran Patil
In-Reply-To: <20260413073035.4082204-6-aleksandr.loktionov@intel.com>

On Mon, Apr 13, 2026 at 09:30:35AM +0200, Aleksandr Loktionov wrote:
> From: Kiran Patil <kiran.patil@intel.com>
> 
> When an egress qdisc is destroyed, the driver proactively deletes all
> associated cloud filters to prevent stale hardware state, decrementing
> num_cloud_filters to zero in the process.
> 
> The kernel netdev layer is unaware of this implicit cleanup and may
> still try to delete the same filters individually. If the filter is
> not found in the driver's list and num_cloud_filters is already zero,
> return 0 instead of -EINVAL to avoid confusing upper layers that
> believe the filter is still offloaded in hardware.
> 
> Fixes: 0075fa0fadd0 ("i40evf: Add support to apply cloud filters")
> Signed-off-by: Kiran Patil <kiran.patil@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Reviewed-by: Simon Horman <horms@kernel.org>

Sashiko has some comments on this function - which do not
seem related to the logic this patch touches.

I'd encourage you to take a look at some point as a follow-up activity.

...

^ permalink raw reply

* Re: [PATCH iwl-net 4/5] iavf: fix TC boundary check in iavf_handle_tclass
From: Simon Horman @ 2026-04-15 13:46 UTC (permalink / raw)
  To: Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Avinash Dayanand
In-Reply-To: <20260413073035.4082204-5-aleksandr.loktionov@intel.com>

On Mon, Apr 13, 2026 at 09:30:34AM +0200, Aleksandr Loktionov wrote:
> From: Avinash Dayanand <avinash.dayanand@intel.com>
> 
> The condition `tc < adapter->num_tc` admits any tc value equal to or
> greater than num_tc, bypassing the destination-port validation and
> allowing traffic to be steered to a non-existent traffic class. Change
> the comparison to `tc > adapter->num_tc` to correctly reject
> out-of-range TC values.
> 
> Fixes: 0075fa0fadd0 ("i40evf: Add support to apply cloud filters")
> Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

I am a bit confused by this logic.

With this patch applied:

1) For tc <= adapter->num_tc, which I assume is valid TCs (other than 0,
   in which case the function returns earlier), the filter destination port
   is skipped.

   But the failure path for that checks logs:
   "Specify destination port to redirect to traffic class other than TC0\n"

   This does not seem consistent.

2) For tc > adapter->num_tc, which I assume is invalid TCs,
   the function will eventually assign fields of filter->f and succeed
   if filter has a valid destination port.

   This doesn't seem to be in keeping with the patch description.

3) The above two points aside, is there an out by 1 condition in
   the condition tc > adapter->num_tc. It seems to imply
   that tc == adapter->num_tc is a valid tc. But I suspect that
   is not hte case.

In short, I'm wondering if the function should look something like this
(completely untested):

/**
 * iavf_handle_tclass - Forward to a traffic class on the device
 * @adapter: board private structure
 * @tc: traffic class index on the device
 * @filter: pointer to cloud filter structure
 */
static int iavf_handle_tclass(struct iavf_adapter *adapter, u32 tc,
			      struct iavf_cloud_filter *filter)
{
		if (tc == 0)
			return 0;

		if (tc >= adapter->num_tc) {
			// dev_err(...);
			return -EINVAL;
		}

		if (!filter->f.data.tcp_spec.dst_port) {
			dev_err(&adapter->pdev->dev,
				"Specify destination port to redirect to traffic class other than TC0\n");
			return -EINVAL;
		}

		/* redirect to a traffic class on the same device */
		filter->f.action = VIRTCHNL_ACTION_TC_REDIRECT;
		filter->f.action_meta = tc;

		return 0;
}

> ---
>  drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index ab5f5adc..5e4035b 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -4062,7 +4062,7 @@ static int iavf_handle_tclass(struct iavf_adapter *adapter, u32 tc,
>  {
>  	if (tc == 0)
>  		return 0;
> -	if (tc < adapter->num_tc) {
> +	if (tc > adapter->num_tc) {
>  		if (!filter->f.data.tcp_spec.dst_port) {
>  			dev_err(&adapter->pdev->dev,
>  				"Specify destination port to redirect to traffic class other than TC0\n");
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH net v6 1/2] flow_dissector: do not dissect PPPoE PFC frames
From: Qingfang Deng @ 2026-04-15 13:42 UTC (permalink / raw)
  To: linux-ppp, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260415022456.141758-1-qingfang.deng@linux.dev>

The patch state is "Changes Requested" in patchwork but I haven't 
received any feedback. Was it set by mistake?

^ permalink raw reply

* Re: [BUG] KASAN: slab-use-after-free in sctp_addto_chunk
From: Xin Long @ 2026-04-15 13:40 UTC (permalink / raw)
  To: 许东洁
  Cc: marcelo.leitner, linux-sctp, netdev, zhaoruilin22
In-Reply-To: <7e897c44.4cb35.19d8f29411c.Coremail.xudongjie25@mails.ucas.ac.cn>

On Tue, Apr 14, 2026 at 11:23 PM 许东洁 <xudongjie25@mails.ucas.ac.cn> wrote:
>
> Hi,
>
> While running fuzzing tests on 6.19.0-rc5, we hit a slab-use-after-free in the SCTP module. The crash occurs in skb_put_data() when processing an incoming chunk and appending data via sctp_addto_chunk().
>
> Looking at the trace and the code, it seems to be an skb reallocation issue. In sctp_sf_beat_8_3(), a pointer to the payload is extracted from the incoming chunk's skb. Later, a pull operation (e.g., pskb_pull) might trigger pskb_expand_head(), which frees the original skb->head and reallocates a larger one. However, the previously extracted payload pointer becomes dangling but is still passed down to sctp_make_heartbeat_ack(), eventually being read by memcpy() in skb_put_data().
>
Hi, Dongjie,

Normally this shouldn't happen, as all incoming skbs must have already
been linearized in sctp_rcv() before coming to  sctp_sf_beat_8_3().
For a linearized skb, pskb_pull() will not trigger the skb
reallocation, but only reduce skb->len and advance skb->data.

Do you have a reproducer to trigger this issue? We need to check how a
non-linearized skb arrives in sctp_sf_beat_8_3().

Thanks.

> It seems we need to either ensure pull operations are completed before taking the payload pointer, or recalculate the pointer immediately after the pull.
>
> We haven't prepared a patch for this yet, but we are glad to help test any proposed fixes.
>
> Crash log, call trace, and machine info are as follows:
>
> [Machine Info]
> QEMU emulator version 6.2.0
> CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (4 cores)
> Kernel Version: 6.19.0-rc5-00042-g944aacb68baf
>
> [Crash Report & Call Trace]
> BUG: KASAN: slab-use-after-free in skb_put_data include/linux/skbuff.h:2800 [inline]
> BUG: KASAN: slab-use-after-free in sctp_addto_chunk+0xfa/0x2a0 net/sctp/sm_make_chunk.c:1535
> Read of size 56 at addr ffff88804878bb68 by task syz.6.114/15386
>
> CPU: 3 UID: 0 PID: 15386 Comm: syz.6.114 Not tainted 6.19.0-rc5-00042-g944aacb68baf #1 PREEMPT(full)
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:94 [inline]
> dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
> print_address_description mm/kasan/report.c:378 [inline]
> print_report+0xca/0x5f0 mm/kasan/report.c:482
> kasan_report+0xca/0x100 mm/kasan/report.c:595
> check_region_inline mm/kasan/generic.c:194 [inline]
> kasan_check_range+0x39/0x1c0 mm/kasan/generic.c:200
> __asan_memcpy+0x24/0x60 mm/kasan/shadow.c:105
> skb_put_data include/linux/skbuff.h:2800 [inline]
> sctp_addto_chunk+0xfa/0x2a0 net/sctp/sm_make_chunk.c:1535
> sctp_make_heartbeat_ack+0x54/0x110 net/sctp/sm_make_chunk.c:1198
> sctp_sf_beat_8_3+0x4f6/0x7a0 net/sctp/sm_statefuns.c:1201
> sctp_do_sm+0x172/0x5520 net/sctp/sm_sideeffect.c:1172
> sctp_assoc_bh_rcv+0x38a/0x6c0 net/sctp/associola.c:1034
> sctp_inq_push+0x1dc/0x270 net/sctp/inqueue.c:88
> sctp_backlog_rcv+0x167/0x5a0 net/sctp/input.c:331
> sk_backlog_rcv include/net/sock.h:1177 [inline]
> __release_sock+0x397/0x430 net/core/sock.c:3213
> release_sock+0x5a/0x220 net/core/sock.c:3795
> ...
> </TASK>
>
> Freed by task 15386: kasan_save_stack+0x24/0x50 mm/kasan/common.c:57 kasan_save_track+0x14/0x30 mm/kasan/common.c:78 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x61/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free mm/slub.c:6670 [inline] kmem_cache_free+0x15f/0x780 mm/slub.c:6781 skb_kfree_head net/core/skbuff.c:1066 [inline] skb_free_head+0x1b7/0x210 net/core/skbuff.c:1080 pskb_expand_head+0x3b1/0xf80 net/core/skbuff.c:2314 skb_might_realloc+0xb1/0xd0 net/core/skb_fault_injection.c:33 pskb_may_pull_reason include/linux/skbuff.h:2850 [inline] pskb_pull include/linux/skbuff.h:2871 [inline] sctp_sf_beat_8_3+0x419/0x7a0 net/sctp/sm_statefuns.c:1198 ...
> Xu Dongjie
> University of Chinese Academy of Sciences

^ permalink raw reply

* Re: [PATCH v2] net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler
From: kernel test robot @ 2026-04-15 13:37 UTC (permalink / raw)
  To: Pavitra Jha, pabeni
  Cc: oe-kbuild-all, w, chandrashekar.devegowda, linux-wwan, netdev,
	stable, Pavitra Jha
In-Reply-To: <20260414153201.1633720-1-jhapavitra98@gmail.com>

Hi Pavitra,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]
[also build test WARNING on net-next/main linus/master v7.0 next-20260415]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pavitra-Jha/net-wwan-t7xx-validate-port_count-against-message-length-in-t7xx_port_enum_msg_handler/20260415-014321
base:   net/main
patch link:    https://lore.kernel.org/r/20260414153201.1633720-1-jhapavitra98%40gmail.com
patch subject: [PATCH v2] net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20260415/202604151531.ClMVCCxv-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260415/202604151531.ClMVCCxv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604151531.ClMVCCxv-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c:127 function parameter 'msg_len' not described in 't7xx_port_enum_msg_handler'
>> Warning: drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c:127 function parameter 'msg_len' not described in 't7xx_port_enum_msg_handler'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net] net: phy: motorcomm: use device properties for firmware tuning
From: Andrew Lunn @ 2026-04-15 13:37 UTC (permalink / raw)
  To: chunzhi.lin
  Cc: Frank.Sae, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	netdev, linux-kernel, chunzhi.lin
In-Reply-To: <20260415131452.3492671-1-linchunzhi0@gmail.com>

On Wed, Apr 15, 2026 at 09:14:52PM +0800, chunzhi.lin wrote:
> The Motorcomm PHY driver reads optional firmware properties via
> of_property_read_*() from phydev->mdio.dev.of_node. This works for
> Device Tree based systems, but causes ACPI platforms to ignore the same
> properties when they are supplied through _DSD.
> 
> As a result, ACPI-described Motorcomm PHY devices fall back to default
> settings instead of applying firmware-provided tuning such as
> rx/tx internal delay, drive strength, clock output frequency, and
> optional boolean controls like auto-sleep-disabled,
> keep-pll-enabled, and tx clock inversion.
> 
> Switch these lookups to device_property_read_*() so the driver uses the
> generic firmware node interface and can consume the same property names
> from either Device Tree or ACPI.
> 
> This keeps the existing DT behavior unchanged while allowing ACPI
> platforms to honor PHY configuration from firmware.

Please document the new ACPI binding in
Documentation/firmware-guide/acpi/dsd and Cc: the ACPI list so they
can review the binding, same as a DT binding would be reviewed.

The Subject line is wrong. This patch is for net-next. Please read

https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html

and since the merge window is open at the moment, you will need to
wait two weeks before resubmitting.

    Andrew

---
pw-bot: cr

^ permalink raw reply

* [linus:master] [selftest]  400e658aa0: kernel-selftests-bpf.net.tun.fail
From: kernel test robot @ 2026-04-15 13:36 UTC (permalink / raw)
  To: Xu Du; +Cc: oe-lkp, lkp, linux-kernel, Jakub Kicinski, netdev, oliver.sang



Hello,

kernel test robot noticed "kernel-selftests-bpf.net.tun.fail" on:

commit: 400e658aa096cda99b37ce806ed63cfe894c9566 ("selftest: tun: Add test for sending gso packet into tun")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: kernel-selftests-bpf
version: 
with following parameters:

	group: net


config: x86_64-rhel-9.4-bpf
compiler: gcc-14
test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202604151525.8305306f-lkp@intel.com


# timeout set to 3600
# selftests: net: tun
# TAP version 13
# 1..9
# # Starting 9 tests from 5 test cases.
# #  RUN           tun.delete_detach_close ...
# #            OK  tun.delete_detach_close
# ok 1 tun.delete_detach_close
# #  RUN           tun.detach_delete_close ...
# #            OK  tun.detach_delete_close
# ok 2 tun.detach_delete_close
# #  RUN           tun.detach_close_delete ...
# #            OK  tun.detach_close_delete
# ok 3 tun.detach_close_delete
# #  RUN           tun.reattach_delete_close ...
# #            OK  tun.reattach_delete_close
# ok 4 tun.reattach_delete_close
# #  RUN           tun.reattach_close_delete ...
# #            OK  tun.reattach_close_delete
# ok 5 tun.reattach_close_delete
# #  RUN           tun_vnet_udptnl.4in4_1mss.send_gso_packet ...
# #            OK  tun_vnet_udptnl.4in4_1mss.send_gso_packet
# ok 6 tun_vnet_udptnl.4in4_1mss.send_gso_packet
# #  RUN           tun_vnet_udptnl.6in4_1mss.send_gso_packet ...
# # tun.c:679:send_gso_packet:Expected ret (0) == variant->data_size (1402)
# # send_gso_packet: Test terminated by assertion
# #          FAIL  tun_vnet_udptnl.6in4_1mss.send_gso_packet
# not ok 7 tun_vnet_udptnl.6in4_1mss.send_gso_packet
# #  RUN           tun_vnet_udptnl.4in6_1mss.send_gso_packet ...
# # tun.c:679:send_gso_packet:Expected ret (0) == variant->data_size (1402)
# # send_gso_packet: Test terminated by assertion
# #          FAIL  tun_vnet_udptnl.4in6_1mss.send_gso_packet
# not ok 8 tun_vnet_udptnl.4in6_1mss.send_gso_packet
# #  RUN           tun_vnet_udptnl.6in6_1mss.send_gso_packet ...
# # tun.c:679:send_gso_packet:Expected ret (0) == variant->data_size (1382)
# # send_gso_packet: Test terminated by assertion
# #          FAIL  tun_vnet_udptnl.6in6_1mss.send_gso_packet
# not ok 9 tun_vnet_udptnl.6in6_1mss.send_gso_packet
# # FAILED: 6 / 9 tests passed.
# # Totals: pass:6 fail:3 xfail:0 xpass:0 skip:0 error:0
not ok 19 selftests: net: tun # exit=1



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260415/202604151525.8305306f-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH iwl-net 3/5] iavf: prevent VSI corruption when ring params changed during reset
From: Simon Horman @ 2026-04-15 13:28 UTC (permalink / raw)
  To: aleksandr.loktionov
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen, netdev,
	kiran.patil
In-Reply-To: <20260413073035.4082204-4-aleksandr.loktionov@intel.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
iavf: prevent VSI corruption when ring params changed during reset

Changing ring parameters via ethtool triggers a VF reset and queue
reconfiguration. If ethtool is called again before the first reset
completes, the second reset races with uninitialised queue state and
can corrupt the VSI resource tree on the PF side. Return -EAGAIN from
iavf_set_ringparam() when the adapter is already resetting or its queues
are disabled.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> index 1cd1f3f2930a0..390913170315f 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> @@ -495,6 +495,11 @@ static int iavf_set_ringparam(struct net_device *netdev,
>  	if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>  		return -EINVAL;
>  
> +	if (adapter->state == __IAVF_RESETTING ||
> +	    (adapter->state == __IAVF_RUNNING &&
> +	     adapter->flags & IAVF_FLAG_QUEUES_DISABLED))
> +		return -EAGAIN;
> +

Is there a lockless Time-Of-Check to Time-Of-Use race here?

iavf_set_ringparam() is called from ethtool under rtnl_lock(), but the
background iavf_reset_task() synchronizes using netdev_lock(). Because iavf
does not set netdev->request_ops_lock = true, the ethtool core does not
acquire netdev_lock() for iavf_set_ringparam().

This allows adapter->state to transition to __IAVF_RESETTING immediately
after this lockless check.

Furthermore, calling iavf_reset_step() later in this function without
holding netdev_lock() allows the hardware reset sequence to execute
concurrently from both the ethtool thread and the workqueue, which could
lead to VSI resource corruption.

Could this open-coded state check miss pending resets?

If a reset is scheduled (e.g., by a watchdog timeout setting
IAVF_FLAG_RESET_NEEDED) but the worker task hasn't started yet,
adapter->state will still be __IAVF_RUNNING. This check will pass, and
iavf_set_ringparam() will proceed to mutate ring parameters and execute
iavf_reset_step(), racing with the previously scheduled iavf_reset_task.

Can this be addressed by also checking for IAVF_FLAG_RESET_NEEDED and
IAVF_FLAG_RESET_PENDING?

Also, the commit message notes that triggering a VF reset before the first
one completes causes VSI resource tree corruption. Both iavf_set_channels()
and iavf_change_mtu() also dynamically reconfigure queues and
unconditionally trigger resets by calling iavf_reset_step(adapter).

Should these functions be updated to include a similar state check to
prevent the same VSI resource tree corruption on the PF side?

>  	if (ring->tx_pending > IAVF_MAX_TXD ||
>  	    ring->tx_pending < IAVF_MIN_TXD ||
>  	    ring->rx_pending > IAVF_MAX_RXD ||

^ permalink raw reply

* Re: [PATCH iwl-net 2/5] iavf: fix error path in iavf_request_misc_irq
From: Simon Horman @ 2026-04-15 13:26 UTC (permalink / raw)
  To: aleksandr.loktionov
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen, netdev,
	kiran.patil
In-Reply-To: <20260413073035.4082204-3-aleksandr.loktionov@intel.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
iavf: fix error path in iavf_request_misc_irq

This patch drops the free_irq() call from the error path of
iavf_request_misc_irq() when request_irq() fails. This is intended to
prevent a kernel warning when freeing an unrequested vector.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index dad001abc9086..ab5f5adc2d3c2 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -587,7 +587,6 @@ static int iavf_request_misc_irq(struct iavf_adapter *adapter)
>  		dev_err(&adapter->pdev->dev,
>  			"request_irq for %s failed: %d\n",
>  			adapter->misc_vector_name, err);
> -		free_irq(adapter->msix_entries[0].vector, netdev);

Does this patch completely fix the warning?

If request_irq() fails, iavf_request_misc_irq() returns an error but leaves
adapter->msix_entries allocated.

In caller error handling paths like iavf_disable_vf() which clean up after a
failure, iavf_free_misc_irq() is called:

static void iavf_free_misc_irq(struct iavf_adapter *adapter)
{
	struct net_device *netdev = adapter->netdev;

	if (!adapter->msix_entries)
		return;

	free_irq(adapter->msix_entries[0].vector, netdev);
}

Since adapter->msix_entries is still allocated, the check passes and free_irq()
is called on the unrequested vector anyway, triggering the exact same warning.

Does the driver need to explicitly track whether the IRQ was successfully
requested?

>  	}
>  	return err;
>  }

^ permalink raw reply

* RE: [PATCH v5 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Kubalewski, Arkadiusz @ 2026-04-15 13:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Nitka, Grzegorz, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Oros, Petr, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	Kitszel, Przemyslaw, Nguyen, Anthony L,
	Prathosh.Satish@microchip.com, Vecera, Ivan, jiri@resnulli.us,
	vadim.fedorenko@linux.dev, donald.hunter@gmail.com,
	horms@kernel.org, pabeni@redhat.com, davem@davemloft.net,
	edumazet@google.com
In-Reply-To: <20260414145835.07fbe355@kernel.org>

>From: Jakub Kicinski <kuba@kernel.org>
>Sent: Tuesday, April 14, 2026 11:59 PM
>
>On Mon, 13 Apr 2026 08:19:30 +0000 Kubalewski, Arkadiusz wrote:
>>> My concern is that I think this is a pretty run of the mill SyncE
>>> design. If we need to pretend we have two DPLLs here if we really
>>> only have one and a mux - then our APIs are mis-designed :(
>>
>> Well, the true is that we did not anticipated per-port control of the
>> TX clock source, as a single DPLL device could drive multiple of such.
>>
>> This is not true, that we pretend there is a second PLL - there is a
>> PLL on each TX clock, maybe not a full DPLL, but still the loop with
>> a control over it's sources is there and it has the same 2 external
>> sources + default XO.
>
>Don't we put that MAC PLL into bypass mode if we feed a clock from
>the EEC DPLL?

This HW doesn't use EEC DPLL signal to feed MAC clock, as DPLL is
external from NIC point of view. Only 2 signals from such external DPLL
device are used by NIC:
- synce (a single source for all those TXC per-port DPLL device)
- time_ref (a source for the TS_PLL - which drives PTP timer)

Grzegorz is now working on submitting the patches for later one.

>
>> A mentioned try of adding per port MUX-type pin, just to give some
>>control
>> to the user, is where we wanted to simplify things, but in the end the
>>API
>> would have to be modified in significant way, various paths related to
>>pin
>> registration and keeping correct references, just to make working case
>> for the pin_on_pin_register and it's internals. We decided that the
>>burden
>> and impact for existing design was to high.
>>
>> And that is why the TXC approach emerged, the change of DPLL is minimal,
>> The model is still correct from user perspective, SyncE SW controller
>>shall
>> anticipate possibility that per-port TXC dpll is there
>
>We are starting to push into what was previously the domain of
>drivers/clk, tho. IIUC the "ASIC PLL"s are usually integrated with
>clock dividers. And cannot be "configured" after chip init / async
>reset (which is why I presume you whack a reset in patch 7?).

Well, we need CGU-dividers change for a frequency-compliance with lower
link speeds, the link reset which is required as part of tx-clk switch
and link establishment on a new clock.

>
>> This particular device and driver doesn't implement any EEC-type DPLL
>> device, the one could think that we can just change the type here and
>>use
>> EEC type instead of new one TXC - since we share pins from external dpll
>> driver, which is EEC type, and our DPLL device would have different
>>clock_id
>> and module. But, further designs, where a single NIC is having control
>>over
>> both a EEC DPLL and ability to control each source per-port this would
>>be
>> problematic. At least one NIC Port driver would have to have 2 EEC-type
>>DPLLs
>> leaving user with extra confusion.
>
>The distinction between TXC and EEC dpll is confusing.
>I thought EEC one _was_supposed_to_ drive the Tx clock?
>What PPS means is obvious, what EEC means if not driving Tx clock is
>unclear to me..
>

Yes, correct, EEC DPLL main task would be to drive TX clocks of NIC
ports, but if there is a per-port control something extra is required.

>Let me summarize my concerns - we need to navigate the split between
>drivers/clk and dpll. We need a distinction on what goes where, because
>every ASIC has a bunch of PLLs which until now have been controlled by
>device tree (if at all). If the main question we want to answer is
>"which clock ref is used to drive internal clock" all we need is a MUX.
>If we want to make dpll cover also ASIC PLLs for platforms without
>device tree we need a more generic name than TXC, IMHO.

Well, 'floating' MUX type pin not connected to any dpll would require a
lot of additional implementations, just to allow source selection, as we
have tried it already.

Wouldn't more generic name cause a DPLL purpose problem?
We still want to make sure that given DPLL device would serve the role
of source selection for particular port where a source pin should be an
output either on EEC dpll or some external signal generator but somehow
related to SyncE or similar solutions.

Thanks,
Arkadiusz

^ permalink raw reply

* Re: rust: net: phy: intent for MAE0621A (out-of-tree C -> Rust), request for target guidance
From: Andrew Lunn @ 2026-04-15 13:20 UTC (permalink / raw)
  To: wenzhaoliao
  Cc: hkallweit1, fujita.tomonori, linux, tmgross, ojeda, netdev,
	rust-for-linux
In-Reply-To: <AFkAcAA-KLHz8L2oAyS3qqrb.1.1776247198201.Hmail.2023000929@ruc.edu.cn>

On Wed, Apr 15, 2026 at 05:59:58PM +0800, wenzhaoliao wrote:
> 
> Hello PHY and Rust maintainers,
> 
> 
> I am a PhD student working on a C-to-Rust migration tool for systems code.
> We would like to validate it in Linux with one concrete PHY target and would
> like to confirm direction before posting a larger RFC series.
> 
> 
> Scope of this intent:
> - Initial target: MAE0621A (currently out-of-tree C driver).
> - We do NOT intend to submit a duplicate Rust rewrite of an existing in-tree C PHY driver.
> - Goal: evaluate a semi-automatic abstraction completion workflow:
>   reuse existing Rust PHY abstractions where possible, and add only minimal missing abstractions.
> 
> 
> Planned deliverables:
> - A gap analysis between MAE0621A C callbacks and current rust/kernel/net/phy.rs coverage.
> - A small RFC patch series with minimal abstraction additions (if needed).
> - A MAE0621A Rust driver prototype on top of those abstractions for linux-next/rust-next evaluation.

When done correctly, this sounds reasonable. However, i do have some
further questions.

Do you have hardware? What board do you intent to test this on. Does
the board boot using Mainline?

Do you have the datasheet?

What out of tree C driver do you intend to start from. I had a quick
look around and the first one i found is:

https://github.com/CoreELEC/linux-amlogic/blob/amlogic-5.4.210/drivers/net/phy/maxio.c

As is often the case of an out of tree driver, it is not up to the
quality of a Mainline driver. Doing a tool based C to Rust migration
based on this code will just give you a poor quality Rust driver,
which will not be accepted. Do you have the knowledge to fix all the
issues?

Maybe you can tell us what C driver you are plan to use, and do a
review of it, list all the issues you see with it, what needs
fixing. That will give us an idea if you can produce a Mainline
quality driver.

	Andrew

^ permalink raw reply

* [PATCH net] net: phy: motorcomm: use device properties for firmware tuning
From: chunzhi.lin @ 2026-04-15 13:14 UTC (permalink / raw)
  To: Frank.Sae
  Cc: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel, chunzhi.lin, chunzhi.lin

The Motorcomm PHY driver reads optional firmware properties via
of_property_read_*() from phydev->mdio.dev.of_node. This works for
Device Tree based systems, but causes ACPI platforms to ignore the same
properties when they are supplied through _DSD.

As a result, ACPI-described Motorcomm PHY devices fall back to default
settings instead of applying firmware-provided tuning such as
rx/tx internal delay, drive strength, clock output frequency, and
optional boolean controls like auto-sleep-disabled,
keep-pll-enabled, and tx clock inversion.

Switch these lookups to device_property_read_*() so the driver uses the
generic firmware node interface and can consume the same property names
from either Device Tree or ACPI.

This keeps the existing DT behavior unchanged while allowing ACPI
platforms to honor PHY configuration from firmware.

We have completed testing on Sophgo RISC-V architecture server SD3-10.
This server has a 64-core Thead C920 CPU whose DWMAC is connected to
Motorcomm's PHY YT8531. This server supports UEFI boot and it would like
to use the ACPI table.

Signed-off-by: chunzhi.lin <linchunzhi0@gmail.com>
---
 drivers/net/phy/motorcomm.c | 41 ++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/net/phy/motorcomm.c b/drivers/net/phy/motorcomm.c
index 4d62f7b36212..708491bc198a 100644
--- a/drivers/net/phy/motorcomm.c
+++ b/drivers/net/phy/motorcomm.c
@@ -10,7 +10,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/phy.h>
-#include <linux/of.h>
+#include <linux/property.h>
 
 #define PHY_ID_YT8511		0x0000010a
 #define PHY_ID_YT8521		0x0000011a
@@ -843,12 +843,12 @@ static u32 ytphy_get_delay_reg_value(struct phy_device *phydev,
 				     u16 *rxc_dly_en,
 				     u32 dflt)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	int tb_size_half = tb_size / 2;
 	u32 val;
 	int i;
 
-	if (of_property_read_u32(node, prop_name, &val))
+	if (device_property_read_u32(dev, prop_name, &val))
 		goto err_dts_val;
 
 	/* when rxc_dly_en is NULL, it is get the delay for tx, only half of
@@ -996,12 +996,12 @@ static int yt8531_get_ds_map(struct phy_device *phydev, u32 cur)
 
 static int yt8531_set_ds(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	u32 ds_field_low, ds_field_hi, val;
 	int ret, ds;
 
 	/* set rgmii rx clk driver strength */
-	if (!of_property_read_u32(node, "motorcomm,rx-clk-drv-microamp", &val)) {
+	if (!device_property_read_u32(dev, "motorcomm,rx-clk-drv-microamp", &val)) {
 		ds = yt8531_get_ds_map(phydev, val);
 		if (ds < 0)
 			return dev_err_probe(&phydev->mdio.dev, ds,
@@ -1018,7 +1018,7 @@ static int yt8531_set_ds(struct phy_device *phydev)
 		return ret;
 
 	/* set rgmii rx data driver strength */
-	if (!of_property_read_u32(node, "motorcomm,rx-data-drv-microamp", &val)) {
+	if (!device_property_read_u32(dev, "motorcomm,rx-data-drv-microamp", &val)) {
 		ds = yt8531_get_ds_map(phydev, val);
 		if (ds < 0)
 			return dev_err_probe(&phydev->mdio.dev, ds,
@@ -1051,7 +1051,6 @@ static int yt8531_set_ds(struct phy_device *phydev)
  */
 static int yt8521_probe(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
 	struct device *dev = &phydev->mdio.dev;
 	struct yt8521_priv *priv;
 	int chip_config;
@@ -1101,7 +1100,7 @@ static int yt8521_probe(struct phy_device *phydev)
 			return ret;
 	}
 
-	if (of_property_read_u32(node, "motorcomm,clk-out-frequency-hz", &freq))
+	if (device_property_read_u32(dev, "motorcomm,clk-out-frequency-hz", &freq))
 		freq = YTPHY_DTS_OUTPUT_CLK_DIS;
 
 	if (phydev->drv->phy_id == PHY_ID_YT8521) {
@@ -1169,11 +1168,11 @@ static int yt8521_probe(struct phy_device *phydev)
 
 static int yt8531_probe(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	u16 mask, val;
 	u32 freq;
 
-	if (of_property_read_u32(node, "motorcomm,clk-out-frequency-hz", &freq))
+	if (device_property_read_u32(dev, "motorcomm,clk-out-frequency-hz", &freq))
 		freq = YTPHY_DTS_OUTPUT_CLK_DIS;
 
 	switch (freq) {
@@ -1665,7 +1664,7 @@ static int yt8521_resume(struct phy_device *phydev)
  */
 static int yt8521_config_init(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	int old_page;
 	int ret = 0;
 
@@ -1680,7 +1679,7 @@ static int yt8521_config_init(struct phy_device *phydev)
 			goto err_restore_page;
 	}
 
-	if (of_property_read_bool(node, "motorcomm,auto-sleep-disabled")) {
+	if (device_property_read_bool(dev, "motorcomm,auto-sleep-disabled")) {
 		/* disable auto sleep */
 		ret = ytphy_modify_ext(phydev, YT8521_EXTREG_SLEEP_CONTROL1_REG,
 				       YT8521_ESC1R_SLEEP_SW, 0);
@@ -1688,7 +1687,7 @@ static int yt8521_config_init(struct phy_device *phydev)
 			goto err_restore_page;
 	}
 
-	if (of_property_read_bool(node, "motorcomm,keep-pll-enabled")) {
+	if (device_property_read_bool(dev, "motorcomm,keep-pll-enabled")) {
 		/* enable RXC clock when no wire plug */
 		ret = ytphy_modify_ext(phydev, YT8521_CLOCK_GATING_REG,
 				       YT8521_CGR_RX_CLK_EN, 0);
@@ -1801,14 +1800,14 @@ static int yt8521_led_hw_control_get(struct phy_device *phydev, u8 index,
 
 static int yt8531_config_init(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	int ret;
 
 	ret = ytphy_rgmii_clk_delay_config_with_lock(phydev);
 	if (ret < 0)
 		return ret;
 
-	if (of_property_read_bool(node, "motorcomm,auto-sleep-disabled")) {
+	if (device_property_read_bool(dev, "motorcomm,auto-sleep-disabled")) {
 		/* disable auto sleep */
 		ret = ytphy_modify_ext_with_lock(phydev,
 						 YT8521_EXTREG_SLEEP_CONTROL1_REG,
@@ -1817,7 +1816,7 @@ static int yt8531_config_init(struct phy_device *phydev)
 			return ret;
 	}
 
-	if (of_property_read_bool(node, "motorcomm,keep-pll-enabled")) {
+	if (device_property_read_bool(dev, "motorcomm,keep-pll-enabled")) {
 		/* enable RXC clock when no wire plug */
 		ret = ytphy_modify_ext_with_lock(phydev,
 						 YT8521_CLOCK_GATING_REG,
@@ -1844,7 +1843,7 @@ static int yt8531_config_init(struct phy_device *phydev)
  */
 static void yt8531_link_change_notify(struct phy_device *phydev)
 {
-	struct device_node *node = phydev->mdio.dev.of_node;
+	struct device *dev = &phydev->mdio.dev;
 	bool tx_clk_1000_inverted = false;
 	bool tx_clk_100_inverted = false;
 	bool tx_clk_10_inverted = false;
@@ -1852,17 +1851,17 @@ static void yt8531_link_change_notify(struct phy_device *phydev)
 	u16 val = 0;
 	int ret;
 
-	if (of_property_read_bool(node, "motorcomm,tx-clk-adj-enabled"))
+	if (device_property_read_bool(dev, "motorcomm,tx-clk-adj-enabled"))
 		tx_clk_adj_enabled = true;
 
 	if (!tx_clk_adj_enabled)
 		return;
 
-	if (of_property_read_bool(node, "motorcomm,tx-clk-10-inverted"))
+	if (device_property_read_bool(dev, "motorcomm,tx-clk-10-inverted"))
 		tx_clk_10_inverted = true;
-	if (of_property_read_bool(node, "motorcomm,tx-clk-100-inverted"))
+	if (device_property_read_bool(dev, "motorcomm,tx-clk-100-inverted"))
 		tx_clk_100_inverted = true;
-	if (of_property_read_bool(node, "motorcomm,tx-clk-1000-inverted"))
+	if (device_property_read_bool(dev, "motorcomm,tx-clk-1000-inverted"))
 		tx_clk_1000_inverted = true;
 
 	if (phydev->speed < 0)
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net v3 2/3] vsock/test: fix MSG_PEEK handling in recv_buf()
From: Luigi Leonardi @ 2026-04-15 13:11 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
	netdev, linux-kernel
In-Reply-To: <ad96TgXHW_jKitls@sgarzare-redhat>

On Wed, Apr 15, 2026 at 01:54:43PM +0200, Stefano Garzarella wrote:
>On Wed, Apr 15, 2026 at 01:31:11PM +0200, Stefano Garzarella wrote:
>>On Tue, Apr 14, 2026 at 06:10:22PM +0200, Luigi Leonardi wrote:
>>>`recv_buf` does not handle the MSG_PEEK flag correctly: it keeps calling
>>>`recv` until all requested bytes are available or an error occurs.
>>>
>>>The problem is how it calculates the amount of bytes read: MSG_PEEK
>>>doesn't consume any bytes, will re-read the same bytes from the buffer
>>>head, so, summing the return value every time is wrong.
>>>
>>>Moreover, MSG_PEEK doesn't consume the bytes in the buffer, so if the
>>>requested amount is more than the bytes available, the loop will never
>>>terminate, because `recv` will never return EOF. For this reason we need
>>>to compare the amount of read bytes with the number of bytes expected.
>>>
>>>Add a check, and if the MSG_PEEK flag is present, update the counter of
>>>read bytes differently, and break if we read the expected amount.
>>
>>nit: "..., update the counter for bytes read only after all expected
>>bytes have been read and break out of the loop; otherwise, try again
>>after a short delay to avoid consuming too many CPU cycles."
>>
>>>
>>>This allows us to simplify the `test_stream_credit_update_test`, by
>>>reusing `recv_buf`, like some other tests already do.
>>>
>>>This also fixes callers that pass MSG_PEEK to recv_buf().
>>
>>nit: this is implicit from the first part of the description.
>>
>>>
>>>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>>>Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
>>>---
>>>tools/testing/vsock/util.c       | 15 +++++++++++++++
>>>tools/testing/vsock/vsock_test.c | 13 +------------
>>>2 files changed, 16 insertions(+), 12 deletions(-)
>>>
>>>diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
>>>index 1fe1338c79cd..2c9ee3210090 100644
>>>--- a/tools/testing/vsock/util.c
>>>+++ b/tools/testing/vsock/util.c
>>>@@ -381,7 +381,13 @@ void send_buf(int fd, const void *buf, size_t len, int flags,
>>>	}
>>>}
>>>
>>>+#define RECV_PEEK_RETRY_USEC 10
>>
>>10 usec IMO are a bit low, it could be the same order of the 
>>syscalls involved in the loop, I'd go to some milliseconds like we 
>>do for SEND_SLEEP_USEC.
>>
>>>+
>>>/* Receive bytes in a buffer and check the return value.
>>>+ *
>>>+ * MSG_PEEK note: MSG_PEEK doesn't consume bytes from the buffer, so partial
>>>+ * reads cannot be summed. Instead, the function retries until recv() returns
>>>+ * exactly expected_ret bytes in a single call.
>>
>>I'd replace with something like this:
>>
>>  * When MSG_PEEK is set, recv() is retried until it returns exactly
>>  * expected_ret bytes. The function returns on error, EOF, or timeout
>>  * as usual.
>>
>>Thanks,
>>Stefano
>>
>>>*
>>>* expected_ret:
>>>*  <0 Negative errno (for testing errors)
>>>@@ -403,6 +409,15 @@ void recv_buf(int fd, void *buf, size_t len, int flags, ssize_t expected_ret)
>>>		if (ret <= 0)
>>>			break;
>>>
>>>+		if (flags & MSG_PEEK) {
>>>+			if (ret == expected_ret) {
>
>On second thought, I think it would be more appropriate to check for
>`ret >= expected_ret` here, because all subsequent recv() will
>definitely return more bytes, so there’s no point in continuing the
>loop... and anyway, we’ll check the result later, so just that change
>should be fine.
>
>And of course I'd update the comment on top in this way:
>
>   * When MSG_PEEK is set, recv() is retried until it returns at least
>   * expected_ret bytes. The function returns on error, EOF, or timeout
>   * as usual.
>
>Thanks,
>Stefano
>

Good idea, will do.

Thanks!
Luigi


^ permalink raw reply

* Re: [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
From: Aleksandr Nogikh @ 2026-04-15 13:05 UTC (permalink / raw)
  To: syzbot+cib904ea9ebb647254, hawk
  Cc: netdev, linux-kernel, syzkaller-bugs, syzbot
In-Reply-To: <69dd48c2.a00a0220.468cb.004e.GAE@google.com>

... okay, one more fixed bug, one more try.


#syz test

---
  drivers/net/veth.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 911e7e36e166..9d7b085c9548 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1138,7 +1138,9 @@ static void veth_napi_del_range(struct net_device
*dev, int start, int end)
         */
        peer = rtnl_dereference(priv->peer);
        if (peer) {
-               for (i = start; i < end; i++)
+               int peer_end = min(end, (int)peer->real_num_tx_queues);
+
+               for (i = start; i < peer_end; i++)
                        netdev_tx_reset_queue(netdev_get_tx_queue(peer, i));
        }


^ permalink raw reply related

* [PATCH net v2] net: pse-pd: fix out-of-bounds bitmap access in pse_isr() on 32-bit
From: Kory Maincent @ 2026-04-15 13:02 UTC (permalink / raw)
  To: Jakub Kicinski, Kory Maincent (Dent Project), netdev,
	linux-kernel
  Cc: Carlo Szelinsky, thomas.petazzoni, Oleksij Rempel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni

In pse_isr(), notifs_mask was declared as a single unsigned long on the
stack (32 bits on 32-bit architectures). For PSE controllers with more
than 32 ports, this causes two problems:

- map_event callbacks could wrote bit positions >= 32 via
  *notifs_mask |= BIT(i), which is undefined behaviour on a 32-bit
  unsigned long and corrupts adjacent stack memory.

- for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) treats
  &notifs_mask as a multi-word bitmap and reads beyond the single
  unsigned long when nr_lines > BITS_PER_LONG.

Fix this by moving notifs_mask out of the stack and into struct pse_irq
as a dynamically allocated bitmap. It is sized with
BITS_TO_LONGS(pcdev->nr_lines) words in devm_pse_irq_helper(), so it
is always wide enough regardless of the host word size.

Fixes: fc0e6db30941a ("net: pse-pd: Add support for reporting events")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---

Changes in v2:
- Use devm_bitmap_zalloc() instead of devm_kcalloc().
---
 drivers/net/pse-pd/pse_core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index f6b94ac7a68a4..87aa4f4e97249 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1170,6 +1170,7 @@ struct pse_irq {
 	struct pse_controller_dev *pcdev;
 	struct pse_irq_desc desc;
 	unsigned long *notifs;
+	unsigned long *notifs_mask;
 };
 
 /**
@@ -1247,7 +1248,6 @@ static int pse_set_config_isr(struct pse_controller_dev *pcdev, int id,
 static irqreturn_t pse_isr(int irq, void *data)
 {
 	struct pse_controller_dev *pcdev;
-	unsigned long notifs_mask = 0;
 	struct pse_irq_desc *desc;
 	struct pse_irq *h = data;
 	int ret, i;
@@ -1257,14 +1257,15 @@ static irqreturn_t pse_isr(int irq, void *data)
 
 	/* Clear notifs mask */
 	memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
+	bitmap_zero(h->notifs_mask, pcdev->nr_lines);
 	mutex_lock(&pcdev->lock);
-	ret = desc->map_event(irq, pcdev, h->notifs, &notifs_mask);
-	if (ret || !notifs_mask) {
+	ret = desc->map_event(irq, pcdev, h->notifs, h->notifs_mask);
+	if (ret || bitmap_empty(h->notifs_mask, pcdev->nr_lines)) {
 		mutex_unlock(&pcdev->lock);
 		return IRQ_NONE;
 	}
 
-	for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) {
+	for_each_set_bit(i, h->notifs_mask, pcdev->nr_lines) {
 		unsigned long notifs, rnotifs;
 		struct pse_ntf ntf = {};
 
@@ -1340,6 +1341,10 @@ int devm_pse_irq_helper(struct pse_controller_dev *pcdev, int irq,
 	if (!h->notifs)
 		return -ENOMEM;
 
+	h->notifs_mask = devm_bitmap_zalloc(dev, pcdev->nr_lines, GFP_KERNEL);
+	if (!h->notifs_mask)
+		return -ENOMEM;
+
 	ret = devm_request_threaded_irq(dev, irq, NULL, pse_isr,
 					IRQF_ONESHOT | irq_flags,
 					irq_name, h);
-- 
2.43.0


^ permalink raw reply related

* [PATCH iproute2] ss: force a flush in monitor mode
From: Eric Dumazet @ 2026-04-15 13:03 UTC (permalink / raw)
  To: David Ahern, Stephen Hemminger
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima,
	netdev, eric.dumazet, Eric Dumazet

Call fflush() from generic_show_sock() in order to work
with pipes and redirects.

After this patch, "ss -E &>log_file" works as expected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 misc/ss.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/misc/ss.c b/misc/ss.c
index 1ea804ad549e23f767633e07efdd9adf1277af18..39b109276ffa83f12d1e1e9f8f2cf58c25737b4b 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -5534,6 +5534,7 @@ static int generic_show_sock(struct nlmsghdr *nlh, void *arg)
 
 	render();
 
+	fflush(stdout);
 	return ret;
 }
 
-- 
2.54.0.rc1.513.gad8abe7a5a-goog


^ permalink raw reply related

* Re: [PATCH net-next 5/6] net: stmmac: move PHY handling out of __stmmac_open()/release()
From: Russell King (Oracle) @ 2026-04-15 12:59 UTC (permalink / raw)
  To: Alexander Stein
  Cc: Andrew Lunn, Heiner Kallweit, Alexandre Torgue, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, linux-arm-kernel,
	linux-stm32, Maxime Coquelin, netdev, Paolo Abeni
In-Reply-To: <8409022.LvFx2qVVIh@steina-w>

On Wed, Apr 15, 2026 at 08:08:40AM +0200, Alexander Stein wrote:
> Hi,
> 
> Am Dienstag, 23. September 2025, 13:26:19 CEST schrieb Russell King (Oracle):
> > Move the PHY attachment/detachment from the network driver out of
> > __stmmac_open() and __stmmac_release() into stmmac_open() and
> > stmmac_release() where these actions will only happen when the
> > interface is administratively brought up or down. It does not make
> > sense to detach and re-attach the PHY during a change of MTU.
> 
> Sorry for coming up now. But I recently noticed this commit breaks changing
> the MTU on i.MX8MP. Once I simply change the MTU I run into some DMA error:
> $ ip link set dev end1 mtu 1400
> imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0
> imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-1
> imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-2
> imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-3
> imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-4
> imx-dwmac 30bf0000.ethernet end1: Link is Down
> imx-dwmac 30bf0000.ethernet end1: Failed to reset the dma
> imx-dwmac 30bf0000.ethernet end1: stmmac_hw_setup: DMA engine initialization failed

This basically means that a clock is missing. Please provide more
information:

- what kernel version are you using?
- has EEE been negotiated?
- does the problem persist when EEE is disabled?
- which PHY is attached to stmmac?
- which PHY interface mode is being used to connect the PHY to stmmac?

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net v5] net: stmmac: Prevent NULL deref when RX memory exhausted
From: Russell King (Oracle) @ 2026-04-15 12:56 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue, Maxime Chevallier,
	Ovidiu Panait, Vladimir Oltean, Baruch Siach, Serge Semin,
	Giuseppe Cavallaro, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, stable
In-Reply-To: <20260415023947.7627-1-CFSworks@gmail.com>

On Tue, Apr 14, 2026 at 07:39:47PM -0700, Sam Edwards wrote:
> The CPU receives frames from the MAC through conventional DMA: the CPU
> allocates buffers for the MAC, then the MAC fills them and returns
> ownership to the CPU. For each hardware RX queue, the CPU and MAC
> coordinate through a shared ring array of DMA descriptors: one
> descriptor per DMA buffer. Each descriptor includes the buffer's
> physical address and a status flag ("OWN") indicating which side owns
> the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
> the flag and the MAC is only allowed to clear it, and both must move
> through the ring in sequence: thus the ring is used for both
> "submissions" and "completions."
> 
> In the stmmac driver, stmmac_rx() bookmarks its position in the ring
> with the `cur_rx` index. The main receive loop in that function checks
> for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
> network stack (NULLing the pointer), and increments `cur_rx` modulo the
> ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
> position with `dirty_rx`, allocates fresh buffers and rearms the
> descriptors (setting OWN=1). If it fails any allocation, it simply stops
> early (leaving OWN=0) and will retry where it left off when next called.
> 
> This means descriptors have a three-stage lifecycle (terms my own):
> - `empty` (OWN=1, buffer valid)
> - `full` (OWN=0, buffer valid and populated)
> - `dirty` (OWN=0, buffer NULL)
> 
> But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
> the past (see 'Fixes:'), there was a bug where the loop could cycle
> `cur_rx` all the way back to the first descriptor it dirtied, resulting
> in a NULL dereference when mistaken for `full`. The aforementioned
> commit resolved that *specific* failure by capping the loop's iteration
> limit at `dma_rx_size - 1`, but this is only a partial fix: if the
> previous stmmac_rx_refill() didn't complete, then there are leftover
> `dirty` descriptors that the loop might encounter without needing to
> cycle fully around. The current code therefore panics (see 'Closes:')
> when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
> catch up to `dirty_rx`.
> 
> Fix this by further tightening the clamp from `dma_rx_size - 1` to
> `dma_rx_size - stmmac_rx_dirty() - 1`, subtracting any remnant dirty
> entries and limiting the loop so that `cur_rx` cannot catch back up to
> `dirty_rx`. This carries no risk of arithmetic underflow: since the
> maximum possible return value of stmmac_rx_dirty() is `dma_rx_size - 1`,
> the worst the clamp can do is prevent the loop from running at all.
> 
> Fixes: b6cb4541853c7 ("net: stmmac: avoid rx queue overrun")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>

Locally, while debugging my issues, I used this to prevent cur_rx
catching up with dirty_rx:

                status = stmmac_rx_status(priv, &priv->xstats, p);
                /* check if managed by the DMA otherwise go ahead */
                if (unlikely(status & dma_own))
                        break;

                next_entry = STMMAC_NEXT_ENTRY(rx_q->cur_rx,
                                               priv->dma_conf.dma_rx_size);
                if (unlikely(next_entry == rx_q->dirty_rx))
                        break;

                rx_q->cur_rx = next_entry;

If we care about the cost of reloading rx_q->dirty_rx on every
iteration, then I'd suggest that the cost we already incur reading and
writing rx_q->cur_rx is something that should be addressed, and
eliminating that would counter the cost of reading rx_q->dirty_rx. I
suspect, however, that the cost is minimal, as cur_tx and dirty_rx are
likely in the same cache line.

It looks like any fix to stmmac_rx() will also need a corresponding
fix for stmmac_rx_zc().

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: KaFai Wan @ 2026-04-15 12:52 UTC (permalink / raw)
  To: Jiayuan Chen, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel
In-Reply-To: <0b3a3a41-f709-4414-8a5d-d2eb4959db3f@linux.dev>

On Wed, 2026-04-15 at 09:47 +0800, Jiayuan Chen wrote:
> 
> On 4/14/26 11:37 PM, mkf wrote:
> > On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> 
> Hi Martin, I saw your patch. Your solution is better, please ignore mine :)
> 
I'm not Martin, just same first name :). Ok, I'll continue.
> 
> 

-- 
Thanks,
KaFai

^ permalink raw reply

* [PATCH net v4] openvswitch: cap upcall PID array size and pre-size vport replies
From: Weiming Shi @ 2026-04-15 12:51 UTC (permalink / raw)
  To: Aaron Conole, Eelco Chaudron, Ilya Maximets, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Pravin B Shelar, Alex Wang, Thomas Graf, netdev,
	dev, Xiang Mei, Weiming Shi

The vport netlink reply helpers allocate a fixed-size skb with
nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
array via ovs_vport_get_upcall_portids().  Since
ovs_vport_set_upcall_portids() accepts any non-zero multiple of
sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
array large enough to overflow the reply buffer, causing nla_put() to
fail with -EMSGSIZE and hitting BUG_ON(err < 0).  On systems with
unprivileged user namespaces enabled (e.g., Ubuntu default), this is
reachable via unshare -Urn since OVS vport mutation operations use
GENL_UNS_ADMIN_PERM.

 kernel BUG at net/openvswitch/datapath.c:2414!
 Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
 CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
 RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
 Call Trace:
  <TASK>
  genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
  genl_rcv_msg (net/netlink/genetlink.c:1194)
  netlink_rcv_skb (net/netlink/af_netlink.c:2550)
  genl_rcv (net/netlink/genetlink.c:1219)
  netlink_unicast (net/netlink/af_netlink.c:1344)
  netlink_sendmsg (net/netlink/af_netlink.c:1894)
  __sys_sendto (net/socket.c:2206)
  __x64_sys_sendto (net/socket.c:2209)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 Kernel panic - not syncing: Fatal exception

Reject attempts to set more PIDs than nr_cpu_ids in
ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
size in ovs_vport_cmd_msg_size() based on that bound, similar to the
existing ovs_dp_cmd_msg_size().  nr_cpu_ids matches the cap already
used by the per-CPU dispatch configuration on the datapath side
(ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the
two sides stay consistent.

Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
v4 (per Ilya):
- Use nr_cpu_ids instead of num_possible_cpus() for consistency with
  the per-CPU dispatch on the datapath side.
- Annotate ovs_vport_cmd_msg_size() per-attribute; split nested sums.
v3: Cap at num_possible_cpus(); add ovs_vport_cmd_msg_size(); keep
    BUG_ON(); fix Fixes tag.
v2: Dynamically size reply skb; drop WARN_ON_ONCE, return plain errors.
---
 net/openvswitch/datapath.c | 33 +++++++++++++++++++++++++++++++--
 net/openvswitch/vport.c    |  3 +++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index e209099218b4..35e67e51b0d2 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2184,9 +2184,38 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb,
 	return err;
 }
 
+static size_t ovs_vport_cmd_msg_size(void)
+{
+	size_t msgsize = NLMSG_ALIGN(sizeof(struct ovs_header));
+
+	msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_PORT_NO */
+	msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_TYPE */
+	msgsize += nla_total_size(IFNAMSIZ);    /* OVS_VPORT_ATTR_NAME */
+	msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_IFINDEX */
+	msgsize += nla_total_size(sizeof(s32)); /* OVS_VPORT_ATTR_NETNSID */
+	/* OVS_VPORT_ATTR_STATS */
+	msgsize += nla_total_size_64bit(sizeof(struct ovs_vport_stats));
+	/* OVS_VPORT_ATTR_UPCALL_STATS(OVS_VPORT_UPCALL_ATTR_SUCCESS +
+	 *                             OVS_VPORT_UPCALL_ATTR_FAIL)
+	 */
+	msgsize += nla_total_size(nla_total_size_64bit(sizeof(u64)) +
+				  nla_total_size_64bit(sizeof(u64)));
+	/* OVS_VPORT_ATTR_UPCALL_PID (capped at nr_cpu_ids by
+	 * ovs_vport_set_upcall_portids())
+	 */
+	msgsize += nla_total_size(nr_cpu_ids * sizeof(u32));
+	/* OVS_VPORT_ATTR_OPTIONS(OVS_TUNNEL_ATTR_DST_PORT +
+	 *                        OVS_TUNNEL_ATTR_EXTENSION(OVS_VXLAN_EXT_GBP))
+	 */
+	msgsize += nla_total_size(nla_total_size(sizeof(u16)) +
+				  nla_total_size(nla_total_size(0)));
+
+	return msgsize;
+}
+
 static struct sk_buff *ovs_vport_cmd_alloc_info(void)
 {
-	return nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	return genlmsg_new(ovs_vport_cmd_msg_size(), GFP_KERNEL);
 }
 
 /* Called with ovs_mutex, only via ovs_dp_notify_wq(). */
@@ -2196,7 +2225,7 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, struct net *net,
 	struct sk_buff *skb;
 	int retval;
 
-	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	skb = ovs_vport_cmd_alloc_info();
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 23f629e94a36..56b2e2d1a749 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -406,6 +406,9 @@ int ovs_vport_set_upcall_portids(struct vport *vport, const struct nlattr *ids)
 	if (!nla_len(ids) || nla_len(ids) % sizeof(u32))
 		return -EINVAL;
 
+	if (nla_len(ids) / sizeof(u32) > nr_cpu_ids)
+		return -EINVAL;
+
 	old = ovsl_dereference(vport->upcall_portids);
 
 	vport_portids = kmalloc(sizeof(*vport_portids) + nla_len(ids),
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox