Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14  0:50 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
	Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>

On Fri, 10 Apr 2026 21:52:01 +0200 Théo Lebrun wrote:
> ethtool_ops.set_ringparam() is implemented using the primitive close /
> update ring size / reopen sequence. Under memory pressure this does not
> fly: we free our buffers at close and cannot reallocate new ones at
> open. Also, it triggers a slow PHY reinit.
> 
> Instead, exploit the new context mechanism and improve our sequence to:
>  - allocate a new context (including buffers) first
>  - if it fails, early return without any impact to the interface
>  - stop interface
>  - update global state (bp, netdev, etc)
>  - pass buffer pointers to the hardware
>  - start interface
>  - free old context.
> 
> The HW disable sequence is inspired by macb_reset_hw() but avoids
> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT.
> 
> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping
> over register writes which would be redundant (because values have not
> changed).
> 
> The generic context swapping parts are isolated into helper functions
> macb_context_swap_start|end(), reusable by other operations (change_mtu,
> set_channels, etc).

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 81beb67b206a..340ae7d881c6 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
>  	}
>  }
>  
> +static void macb_context_swap_start(struct macb *bp)
> +{
> +	struct macb_queue *queue;
> +	unsigned long flags;
> +	unsigned int q;
> +	u32 ctrl;
> +
> +	/* Disable software Tx, disable HW Tx/Rx and disable NAPI. */
> +
> +	netif_tx_disable(bp->netdev);

AFAIR netif_tx_disable() just stops all the queues, if the NAPIs and
whatever else may wake queues is still running the queues may get
restarted right away.

> +	spin_lock_irqsave(&bp->lock, flags);
> +
> +	ctrl = macb_readl(bp, NCR);
> +	macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE)));
> +
> +	macb_writel(bp, TSR, -1);
> +	macb_writel(bp, RSR, -1);
> +
> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> +		queue_writel(queue, IDR, -1);
> +		queue_readl(queue, ISR);
> +		if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
> +			queue_writel(queue, ISR, -1);
> +	}
> +

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-net v2 1/6] ixgbe: fix SWFW semaphore timeout for X550 family
From: Jacob Keller @ 2026-04-14  0:56 UTC (permalink / raw)
  To: Aleksandr Loktionov, intel-wired-lan, anthony.l.nguyen; +Cc: netdev
In-Reply-To: <20260408131154.2661818-2-aleksandr.loktionov@intel.com>

On 4/8/2026 6:11 AM, Aleksandr Loktionov wrote:
> According to FW documentation, the most time-consuming FW operation is
> Shadow RAM (SR) dump which takes up to 3.2 seconds.  For X550 family
> devices the module-update FW command can take over 4.5 s.  The default
> semaphore loop runs 200 iterations with a 5 ms sleep each, giving a
> maximum wait of 1 s -- not "200 ms" as previously stated in error.
> This is insufficient for X550 family FW update operations and causes
> spurious EBUSY failures.
> 
> Extend the SW/FW semaphore timeout from 1 s to 5 s (1000 iterations x
> 5 ms) for all three X550 variants: ixgbe_mac_X550, ixgbe_mac_X550EM_x,
> and ixgbe_mac_x550em_a.  All three share the same FW and exhibit the
> same worst-case latency.  Use three explicit mac.type comparisons rather
> than a range check so future MAC additions are not inadvertently
> captured.
> 
> The timeout variable is set immediately before the loop so the intent
> is clear, with an inline comment stating the resulting maximum delay.
> 
> Suggested-by: Soumen Karmakar <soumen.karmakar@intel.com>
> Cc: stable@vger.kernel.org
> Suggested-by: Marta Plantykow <marta.a.plantykow@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
>  - Squash with 0015 (X550EM extension); fix commit message ("200ms" was
>    wrong, actual default is 1 s); replace >= / <= range check with three
>    explicit mac.type == comparisons per Tony Nguyen.
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> index e67e2fe..a3c8f51 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> @@ -577,6 +577,15 @@ int ixgbe_acquire_swfw_sync_X540(struct ixgbe_hw *hw, u32 mask)
>  
>  	swmask |= swi2c_mask;
>  	fwmask |= swi2c_mask << 2;
> +	/* Extend to 5 s (1000 x 5 ms) for X550 family; default is 1 s
> +	 * (200 x 5 ms).  FW SR-dump takes up to 3.2 s; module-update up
> +	 * to 4.5 s.
> +	 */
> +	if (hw->mac.type == ixgbe_mac_X550 ||
> +	    hw->mac.type == ixgbe_mac_X550EM_x ||
> +	    hw->mac.type == ixgbe_mac_x550em_a)
> +		timeout = 1000;
> +

Typically, I would request and prefer if we would refactor timeout loops
like this to use read_poll_timeout() instead of open coding the loop
like we do here. The current loop is somewhat complicated so I can
understand it might be tricky to refactor.

The issue with open coded loops like this is that usleep_range has
variable length waiting time, so we sleep for anywhere between 5 and 6
milliseconds in this case. This makes the total amount of time waiting
cap at 6 seconds and not the expected 5, with the actual amount of time
waiting being variable based on when the usleep_range wakes up.

Perhaps this specific loop is a bit more complicated and not worth the
effort to refactor to read_poll_timeout, but its something I've been
trying to get us to cleanup (both inside Intel drivers and in other
places in the kernel) when modifying such open-coded timeout loops.

Since this loop body is a bit more complicated (it has to take and
release the semaphore to check the condition) I can accept it doesn't
make sense to modify it here for net.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Thanks,
Jake

>  	for (i = 0; i < timeout; i++) {
>  		/* SW NVM semaphore bit is used for access to all
>  		 * SW_FW_SYNC bits (not just NVM)


^ permalink raw reply

* Re: [PATCH net-next v2 12/14] net: macb: re-read ISR inside IRQ handler locked section
From: Jakub Kicinski @ 2026-04-14  0:52 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-12-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: re-read ISR inside IRQ handler locked section

This commit adds a second read of the ISR register inside the lock in
macb_interrupt() to handle a race condition where the status might have
been updated by concurrent hardware resets.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 2633234012bae..3deab87c565b5 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2190,6 +2190,13 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
>  
>  	spin_lock(&bp->lock);
>  
> +	/* `status` stack variable might be stalled => re-read it */
> +	status = queue_readl(queue, ISR);

Will this second read discard interrupt events on older MACB variants that
use clear-on-read for the ISR?

If the hardware does not have the MACB_CAPS_ISR_CLEAR_ON_WRITE capability,
the first queue_readl call before the lock already cleared the active bits.

Overwriting the local status variable here sets it to 0, which permanently
drops the initially captured Rx/Tx completions and stalls the network
interface.

> +	if (unlikely(!status)) {
> +		spin_unlock(&bp->lock);
> +		return IRQ_NONE;

If the function reaches this point, the initial status read before the lock
was non-zero, meaning the device legitimately asserted the interrupt.

If a concurrent operation cleared the hardware state while this thread
waited for the lock, will returning IRQ_NONE falsely flag this as a
spurious interrupt?

Since the interrupt was actually raised by the device, doing this repeatedly
can cause the core IRQ subsystem to permanently disable the IRQ line.

Could this return IRQ_HANDLED instead?

> +	}
> +
>  	while (status) {
>  		/* close possible race with dev_close */
>  		if (unlikely(!netif_running(netdev))) {
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14  0:56 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .set_ringparam()

This patch modifies the ethtool set_ringparam implementation to use a
context swapping mechanism. Instead of closing and reopening the interface,
it allocates new buffers, stops the interface temporarily, swaps the active
context, and then frees the old one.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 3deab87c565b5..d1536cb1e40bb 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
[ ... ]
> +static void macb_context_swap_start(struct macb *bp)
> +{
[ ... ]
> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> +		napi_disable(&queue->napi_rx);
> +		napi_disable(&queue->napi_tx);

This isn't a regression from this patch, but does calling napi_disable()
unconditionally crash on EMAC devices?

EMAC devices appear to not initialize NAPI, which leaves the NAPI struct
zero-initialized (n->dev is NULL), potentially causing a NULL pointer
dereference in napi_disable(). Additionally, does this path also blindly
call bp->macbgem_ops.mog_init_rings() which is NULL on EMAC?

> +		netdev_tx_reset_queue(netdev_get_tx_queue(bp->netdev, q));
> +		cancel_work_sync(&queue->tx_error_task);
> +	}
[ ... ]
> +static void macb_context_swap_end(struct macb *bp,
> +				  struct macb_context *new_ctx)
> +{
[ ... ]
> +	/* Swap contexts & give buffer pointers to HW. */
> +
> +	old_ctx = bp->ctx;
> +	bp->ctx = new_ctx;
> +	macb_init_buffers(bp);

Since the new context swap doesn't call phylink_stop(), could a concurrent
macb_mac_link_up() execution set the RE and TE bits in NCR during the window
where bp->lock is dropped?

If so, macb_init_buffers() would write new DMA ring base addresses to RBQP
and TBQP while the MAC is actively enabled, which violates hardware
specifications and could cause DMA corruption.

[ ... ]
> +	if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
> +		for (q = 0, queue = bp->queues; q < bp->num_queues;
> +		     ++q, ++queue) {
> +			queue_writel(queue, IER,
> +				     bp->rx_intr_mask |
> +				     MACB_TX_INT_FLAGS |
> +				     MACB_BIT(HRESP));
> +		}
> +	}

Hardware interrupts are disabled unconditionally for all devices in
macb_context_swap_start() by writing -1 to IDR.

Will this explicit check prevent EMAC devices from having their interrupts
re-enabled?

Since EMAC devices do not use NAPI and rely on hardware interrupts, leaving
them masked might silently halt all network traffic after a ring parameter
change.

> +
> +	ctrl = macb_readl(bp, NCR);
> +	macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));

Can this unprotected read-modify-write of the NCR register race with phylink
state changes?

If an asynchronous phylink callback like macb_mac_config() modifies NCR
while holding bp->lock, this unlocked write could overwrite and drop those
concurrent updates.

> +
> +	netif_tx_start_all_queues(bp->netdev);
> +
> +	/* Free old context. */
> +
> +	macb_free_consistent(old_ctx);

This isn't a regression from this patch, but does this context swap leak
pending TX packets and their DMA mappings?

When freeing the old context, macb_free_consistent() appears to only free
the pointer array. If it doesn't iterate from txq->tail to txq->head to
unmap the active DMA mappings and free the sk_buff structs, repeatedly
modifying ring parameters could cause a memory and DMA mapping leak.

^ permalink raw reply

* Re: [PATCH net-next v2 14/14] net: macb: use context swapping in .ndo_change_mtu()
From: Jakub Kicinski @ 2026-04-14  0:56 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-14-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .ndo_change_mtu()

This commit introduces a mechanism to perform context swapping when changing
the MTU while the interface is running. It allocates a new context,
reconfigures the hardware via a start and end swap process, and then frees
the old context to avoid PHY reinitialization and gracefully handle memory
pressure.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index d1536cb1e40bb..79371718f8392 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3430,11 +3430,25 @@ static int macb_close(struct net_device *netdev)
>  
>  static int macb_change_mtu(struct net_device *netdev, int new_mtu)
>  {
> -	if (netif_running(netdev))
> -		return -EBUSY;
> +	struct macb *bp = netdev_priv(netdev);
> +	bool running = netif_running(netdev);
> +	struct macb_context *new_ctx;
> +
> +	if (running) {
> +		new_ctx = macb_context_alloc(bp, new_mtu,
> +					     bp->configured_rx_ring_size,
> +					     bp->configured_tx_ring_size);
> +		if (IS_ERR(new_ctx))
> +			return PTR_ERR(new_ctx);
> +
> +		macb_context_swap_start(bp);

Does macb_context_swap_start() gracefully halt the transmitter before the
context is freed?

It appears to attempt disabling the hardware transmitter by clearing the
TE and RE bits in the NCR register, but does not poll the hardware to ensure
it has reached an idle state.

Immediately after, macb_context_swap_end() frees the descriptor rings and data
buffers via macb_free_consistent().

If the MAC DMA engine is still mid-transfer, could it read or write to kernel
memory that has already been freed or reallocated?

Can this code cause an AB-BA deadlock between napi_disable() and
cancel_work_sync()?

In macb_context_swap_start(), the code calls napi_disable() for all queues,
and then immediately calls cancel_work_sync() on the error tasks.

Both of these work items internally invoke napi_disable().

If either error task runs concurrently, it will block forever inside its
napi_disable() waiting for the NAPI state bits (which were already set by the
first call in macb_context_swap_start()) to clear.

Consequently, could cancel_work_sync() hang indefinitely waiting for the
blocked work task to finish?

>  	}
>  
>  	WRITE_ONCE(netdev->mtu, new_mtu);
>  
> +	if (running)
> +		macb_context_swap_end(bp, new_ctx);

Does macb_context_swap_end() leak pending TX buffers and their DMA mappings?

When changing the MTU while the interface is running, macb_context_swap_end()
calls macb_free_consistent() to free the old DMA context. While RX buffers are
properly freed, for the TX queues it appears to only free the tracking array
pointer without iterating through the ring to unmap the DMA mappings and free
the actual buffers.

Any packets queued in the hardware ring at the moment of the MTU swap might
be permanently leaked, which could lead to memory and IOMMU space exhaustion
under active traffic.

Will MTU changes on EMAC devices permanently disable hardware interrupts?

macb_context_swap_start() disables all interrupts unconditionally by writing
-1 to the IDR register.

However, macb_context_swap_end() restores interrupts via the IER register
inside an explicit check:

    if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
        ...
    }

Since there is no else branch to re-enable interrupts for EMAC devices,
could changing the MTU on these devices leave interrupts permanently masked
and cause the network interface to silently stall?

Is there a missing spinlock when modifying the IER and NCR registers in
macb_context_swap_end()?

Modifications to the NCR and interrupt registers in macb_context_swap_start()
are protected by the spinlock. However, macb_context_swap_end() performs
lockless read-modify-write operations on both IER and NCR:

    ctrl = macb_readl(bp, NCR);
    macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));

If concurrent hardware events or background tasks take the lock and modify
NCR during this window, could their updates be silently overwritten?

> +
>  	return 0;
>  }

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-net v2 2/6] ixgbe: add bounds check for debugfs register access
From: Jacob Keller @ 2026-04-14  1:00 UTC (permalink / raw)
  To: Simon Horman, Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Paul Greenwalt
In-Reply-To: <20260413103050.GL469338@kernel.org>

On 4/13/2026 3:30 AM, Simon Horman wrote:
> On Wed, Apr 08, 2026 at 03:11:50PM +0200, Aleksandr Loktionov wrote:
>> From: Paul Greenwalt <paul.greenwalt@intel.com>
>>
>> Prevent out-of-bounds MMIO accesses triggered through user-controlled
>> register offsets.  IXGBE_HFDR (0x15FE8) is the highest valid MMIO
>> register in the ixgbe register map; any offset beyond it would address
>> unmapped memory.
>>
>> Add a defense-in-depth check at two levels:
>>
>> 1. ixgbe_read_reg() -- the noinline register read accessor.  A
>>    WARN_ON_ONCE() guard here catches any future code path (including
>>    ioctl extensions) that might inadvertently pass an out-of-range
>>    offset without relying on higher layers to catch it first.
>>    ixgbe_write_reg() is a static inline called from the TX/RX hot path;
>>    adding WARN_ON_ONCE there would inline the check at every call site,
>>    so only the read path gets this guard.
>>
>> 2. ixgbe_dbg_reg_ops_write() -- the debugfs 'reg_ops' interface is the
>>    only current path where a raw, user-supplied offset enters the driver.
>>    Gating it before invoking the register accessors provides a clean,
>>    user-visible failure (silent ignore with no kernel splat) for
>>    deliberately malformed debugfs writes.
>>
>> Add a reg <= IXGBE_HFDR guard to both the read and write paths in
>> ixgbe_dbg_reg_ops_write(), and a WARN_ON_ONCE + early-return guard to
>> ixgbe_read_reg().
>>
>> Fixes: 91fbd8f081e2 ("ixgbe: added reg_ops file to debugfs")
>> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> ---
>> v1 -> v2:
>>  - Add Fixes: tag; reroute from iwl-next to iwl-net (security-relevant
>>    hardening for user-controllable out-of-bounds MMIO).
> 
> Thanks for the update.
> 
> And sorry for not thinking to ask this earlier: this patch
> addresses possible overruns of the mapped address space if the
> supplied value for reg is too large. But do we also need a
> guard against underrun if the value for reg is too small?
> 

I don't think so. This is bounds checking a register offset which is an
unsigned 32-bit value and begins at 0, so the map goes from 0 to
IXGBE_HFDR. Since the value is unsigned, if it does underflow somehow it
would then get caught by the check for IXGBE_HFDR right?

Thanks,
Jake

> ...


^ permalink raw reply

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: fengchengwen @ 2026-04-14  1:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <20260413191930.GP21470@unreal>

On 4/14/2026 3:19 AM, Leon Romanovsky wrote:
> On Mon, Apr 13, 2026 at 08:04:10PM +0800, fengchengwen wrote:
>> On 4/13/2026 6:01 PM, Leon Romanovsky wrote:
>>> On Fri, Apr 10, 2026 at 10:30:52PM +0800, fengchengwen wrote:
>>>> Hi all,
>>>>
>>>> I'm writing to propose adding a sysfs interface to expose and configure the
>>>> PCIe TPH
>>>> Steering Tag for PCIe devices, which is retrieved inside the kernel.
>>>>
>>>>
>>>> Background: The TPH Steering Tag is tightly coupled with both a PCIe device
>>>> (identified
>>>> by its BDF) and a CPU core. It can only be obtained in kernel mode. To allow
>>>> user-space
>>>> applications to fetch and set this value securely and conveniently, we need
>>>> a standard
>>>> kernel-to-user interface.
>>>>
>>>>
>>>> Proposed Solution: Add several sysfs attributes under each PCIe device's
>>>> sysfs directory:
>>>> 1. /sys/bus/pci/devices/<BDF>/tph_mode to query the TPH mode (interrupt or
>>>> device specific)
>>>> 2. /sys/bus/pci/devices/<BDF>/tph_enable to control the TPH feature
>>>> 3. /sys/bus/pci/devices/<BDF>/tph_st to support both read and write
>>>> operations, e.g.:
>>>>    Read operation:
>>>>      echo "cpu=3" > /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>      cat /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>    Write operation:
>>>>      echo "index=10 st=123" > /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>
>>>>
>>>> The design strictly follows PCI subsystem sysfs standards and has the
>>>> following key properties:
>>>>
>>>> 1. Dynamic Visibility: The sysfs attributes will only be present for PCIe
>>>> devices that
>>>>    support TPH Steering Tag. Devices without TPH capability will not show
>>>> these nodes,
>>>>    avoiding unnecessary user confusion.
>>>>
>>>> 2. Permission Control: The attributes will use 0600 file permissions,
>>>> ensuring only
>>>>    privileged root users can read or write them, which satisfies security
>>>> requirements
>>>>    for hardware configuration interfaces.
>>>>
>>>> 3. Standard Implementation Location: The interface will be implemented in
>>>>    drivers/pci/pci-sysfs.c, the canonical location for all PCI device sysfs
>>>> attributes,
>>>>    ensuring consistency and maintainability within the PCI subsystem.
>>>>
>>>>
>>>> Why sysfs instead of alternatives like VFIO-PCI ioctl:
>>>>
>>>> - Universality: sysfs does not require binding the device to a special
>>>> driver such as
>>>>   vfio-pci. It is available to any privileged user-space component,
>>>> including system
>>>>   utilities, daemons, and monitoring tools.
>>>>
>>>> - Simplicity: Both user-space usage (cat/echo) and kernel implementation are
>>>>   straightforward, reducing code complexity and long-term maintenance cost.
>>>>
>>>> - Design Alignment: TPH Steering Tag is a generic PCIe device feature, not
>>>> specific to
>>>>   user-space drivers like DPDK or VFIO. Exposing it via sysfs matches the
>>>> kernel's
>>>>   standard pattern for hardware capabilities.
>>>>
>>>>
>>>> I look forward to your comments about this design before submitting the
>>>> final patch.
>>>
>>> You need to explain more clearly why this write functionality is useful
>>> and necessary outside the VFIO/RDMA context:
>>> https://lore.kernel.org/all/20260324234615.3731237-1-zhipingz@meta.com/
>>>
>>> AFAIK, for non-VFIO TPH callers, kernel has enough knowledge to set
>>> right ST values.
>>>
>>> There are several comments regarding the implementation, but those can wait
>>> until the rationale behind the proposal is fully clarified.
>>
>> Thanks for your review and comments.
>>
>> Let me clarify the rationale behind this user-space sysfs interface:
>>
>> 1. VFIO is just one of the user-space device access frameworks.
>>    There are many other in-kernel frameworks that expose devices
>>    to user space, such as UIO, UACCE, etc., which may also require
>>    TPH Steering Tag support.
>>
>> 2. The kernel can automatically program Steering Tags only when
>>    the device provides a standard ST table in MSI-X or config space.
>>    However, many devices implement vendor-specific or platform-specific
>>    Steering Tag programming methods that cannot be fully handled
>>    by the generic kernel code.
>>
>> 3. For such devices, user-space applications or framework drivers
>>    need to retrieve and configure TPH Steering Tags directly.
>>    A unified sysfs interface allows all user-space frameworks
>>    (not just VFIO) to use a common, standard way to manage
>>    TPH Steering Tags, rather than implementing duplicated logic
>>    in each subsystem.
>>
>> This interface provides a uniform method for any user-space
>> device access solution to work with TPH, which is why I believe
>> it is useful and necessary beyond the VFIO/RDMA case.
> 
> I understand the rationale for providing a read interface, for example for
> debugging, but I do not see any justification for a write interface.

Thank you for the comment!

As I explained, read interface is not only for debugging. It was used to
such device who don't declare ST location in MSI-X or config-space, the following
is Intel X710 NIC device's lspci output (only TPH part):

	Capabilities: [1a0 v1] Transaction Processing Hints
		Device specific mode supported
		No steering table available

So we could not config the ST for device on kernel because it's vendor specific.
But we could configure ST by it's vendor user-space driver, in this case, we
should get ST from kernel to user-space.


As for write interface, which was used to devices whose ST location is known, I
think we could simple it, and only passing <index, cpu>, then kernel query cpu's ST
and set to corresponding index.

> 
> TPH is defined by the PCI specification. If a device intends to support it,
> then it should conform to the specification.

According to the PCI specification 6.17.3 ST Modes of Operation:

  Device Specific Mode: It is recommended for the Function to use a Steering Tag value from an ST Table entry, but it is not required.

  In the Device Specific Mode of operation, the assignment of the Steering Tags to Requests is device specific. The number
  of Steering Tags used by the Function is permitted to be different than the number of interrupt vectors allocated for the
  Function, irrespective of the ST Table location, and Steering Tag values used in Requests are not required to come from
  the architected ST Table.

Thanks

> 
> Thanks
> 
> 
>>
>> Thanks
>>
>>>
>>> Thanks
>>>
>>>>
>>>> Best regards,
>>>> Chengwen Feng
>>>>
>>
>>


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic per-queue stats
From: Jacob Keller @ 2026-04-14  1:22 UTC (permalink / raw)
  To: Paolo Abeni, Loktionov, Aleksandr,
	intel-wired-lan@lists.osuosl.org
  Cc: Nguyen, Anthony L, Kitszel, Przemyslaw, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev@vger.kernel.org
In-Reply-To: <2bae0dc2-4035-4fe2-a87e-dc5dae6c7df5@redhat.com>

On 4/8/2026 7:44 AM, Paolo Abeni wrote:
> On 4/8/26 2:07 PM, Loktionov, Aleksandr wrote:
>>> -----Original Message-----
>>> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
>>> Of Paolo Abeni
>>> Sent: Wednesday, April 8, 2026 1:44 PM
>>> To: intel-wired-lan@lists.osuosl.org
>>> Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
>>> Przemyslaw <przemyslaw.kitszel@intel.com>; Andrew Lunn
>>> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
>>> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>;
>>> Alexei Starovoitov <ast@kernel.org>; Daniel Borkmann
>>> <daniel@iogearbox.net>; Jesper Dangaard Brouer <hawk@kernel.org>; John
>>> Fastabend <john.fastabend@gmail.com>; Stanislav Fomichev
>>> <sdf@fomichev.me>; netdev@vger.kernel.org
>>> Subject: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic
>>> per-queue stats
>>>
>>> Only expose the counters currently available (bytes, packets); add
>>> account for base stats to deal with ring clear.
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>>  drivers/net/ethernet/intel/i40e/i40e.h      |   7 ++
>>>  drivers/net/ethernet/intel/i40e/i40e_main.c | 133
>>> ++++++++++++++++++++
>>>  2 files changed, 140 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
>>> b/drivers/net/ethernet/intel/i40e/i40e.h
>>> index dcb50c2e1aa2..fe642c464e9c 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>> @@ -836,16 +836,23 @@ struct i40e_vsi {
>>>  	struct i40e_eth_stats eth_stats;
>>>  	struct i40e_eth_stats eth_stats_offsets;
>>>  	u64 tx_restart;
>>
>> ...
>>
>>> +static void i40e_zero_tx_ring_stats(struct netdev_queue_stats_tx *tx)
>>> {
>>> +	tx->bytes = 0;
>>> +	tx->packets = 0;
>>> +	tx->stop = 0;
>>> +	tx->wake = 0;
>>> +	tx->hw_drops = 0;
>>> +}
>>> +
>>> +static void i40e_add_tx_ring_stats(struct i40e_ring *tx_ring,
>>> +				   struct netdev_queue_stats_tx *tx) {
>>> +	u64 bytes, packets;
>>> +	unsigned int start;
>>> +
>>> +	do {
>>> +		start = u64_stats_fetch_begin(&tx_ring->syncp);
>>> +		bytes = tx_ring->stats.bytes;
>>> +		packets = tx_ring->stats.packets;
>>> +	} while (u64_stats_fetch_retry(&tx_ring->syncp, start));
>>> +
>>> +	tx->bytes += bytes;
>>> +	tx->packets += packets;
>>> +
>>> +	tx->stop += tx_ring->tx_stats.tx_stopped;
>>> +	tx->wake += tx_ring->tx_stats.restart_queue;
>>> +	tx->hw_drops += tx_ring->tx_stats.tx_busy; }
>> Why the reads are outside the seqlock region? 
>> On 32-bit kernels, unprotected u64 reads can tear IMHO
> 

Paolo is correct that just moving these into the do/while loop is
useless, since the increments aren't protected properly.

> Currently there is no seqlock on the write side; to keep the series
> small I preferred avoid fixing the pre-existing issue. In any case I
> think moving stop, wake, hw_drops (and others) under seqlock protection
> is an orthogonal change.
> 
> /P

I ended up doing some work on ice to fix a lot of similar issues a few
months ago.. The intel drivers weren't using u64_stats_t, and several
error/debug counters were not being handled appropriately.

I'd personally prefer fixing existing issues before we compound them by
adding even more incorrect code. Even on 64bit systems we need to use
READ_ONCE/WRITE_ONCE or local64_t, which the u64_stats_t type uses
internally.

I can understand the desire to limit scope of work, and the issues may
feel "minor" but ultimately I'd rather not see us continue making the
problem bigger instead of fixing it.

However.. if other maintainers feel strongly that the additions are
acceptable despite being incorrect w.r.t. the stats logic, I suppose we
can continue this and have someone from Intel look into cleaning up the
mess like I did for ice.

It looks like the series has some other requested changes either way though.

^ permalink raw reply

* [PATCH net] tcp: make probe0 timer handle expired user timeout
From: Altan Hacigumus @ 2026-04-14  1:36 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S . Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, Enke Chen, Altan Hacigumus

tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue.  If elapsed probing time
already exceeds the configured TCP_USER_TIMEOUT, the subtraction
underflows and yields a large value.

Handle this expiration case similarly to tcp_clamp_rto_to_user_timeout().

Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
---
 net/ipv4/tcp_timer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 5a14a53a3c9e..4a43356a4e06 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -50,7 +50,8 @@ static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	u32 remaining, user_timeout;
+	u32 user_timeout;
+	s32 remaining;
 	s32 elapsed;
 
 	user_timeout = READ_ONCE(icsk->icsk_user_timeout);
@@ -61,6 +62,8 @@ u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
 	if (unlikely(elapsed < 0))
 		elapsed = 0;
 	remaining = msecs_to_jiffies(user_timeout) - elapsed;
+	if (remaining <= 0)
+		return 1;
 	remaining = max_t(u32, remaining, TCP_TIMEOUT_MIN);
 
 	return min_t(u32, remaining, when);
-- 
2.43.0


^ permalink raw reply related

* Re: commit 0c4f1c02d27a880b cause a deadlock issue
From: He, Guocai (CN) @ 2026-04-14  1:38 UTC (permalink / raw)
  To: Greg KH, Thorsten Leemhuis
  Cc: Berg, Johannes, Friend, Linux kernel regressions list,
	Korenblit, Miriam Rachel, stable@vger.kernel.org
In-Reply-To: <2026041330-groggy-ruse-5e27@gregkh>

OK, I will send a patch later.

________________________________________
From: Greg KH <gregkh@linuxfoundation.org>
Sent: Monday, April 13, 2026 8:55 PM
To: Thorsten Leemhuis
Cc: He, Guocai (CN); Berg, Johannes; Friend; Linux kernel regressions list; Korenblit, Miriam Rachel; stable@vger.kernel.org
Subject: Re: commit 0c4f1c02d27a880b cause a deadlock issue

CAUTION: This email comes from a non Wind River email account!
Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Mon, Apr 13, 2026 at 01:58:56PM +0200, Thorsten Leemhuis wrote:
> On 4/3/26 15:00, Korenblit, Miriam Rachel wrote:
> >> From: Greg KH <gregkh@linuxfoundation.org>
> >> On Fri, Apr 03, 2026 at 12:44:48PM +0000, Korenblit, Miriam Rachel wrote:
> >>>> -----Original Message-----
> >>>> From: Greg KH <gregkh@linuxfoundation.org>
> >>>> On Fri, Apr 03, 2026 at 11:08:46AM +0000, He, Guocai (CN) wrote:
> >>>>> No, The mainline have no this issue.
> >>>>> The changes of 0c4f1c02d27a880b is not in mainline.
> >>>>
> >>>> That does not make sense, that commit is really commit e1696c8bd005
> >>>> ("wifi: cfg80211: stop NAN and P2P in cfg80211_leave") which is in
> >>>> all of the following releases:
> >>>>  5.10.252 5.15.202 6.1.165 6.6.128 6.12.75 6.18.14 6.19.4 7.0-rc1
> >>>> confused,
> >>> The change is indeed in mainline, but the locking situation in
> >>> mainline is totally different (that mutex does not even exist there)
> >>> Therefore, the issue is not supposed to happen in mainline.
> >>
> >> Ok, does that commit now need to be reverted from some of the stable branches?
> >> If so, which ones?
> >
> > From every version which is < 6.7.
>
> Greg, do you still have this in your todo mail queue somewhere? Just
> wondering, as last weeks 6.6.y released afics lacked a revert of
> e1696c8bd0056b ("wifi: cfg80211: stop NAN and P2P in cfg80211_leave") --
> and I cannot spot one in your public stable queue either.
>
> These are the commits that according to Miri need to be reverted if I
> understood things right:
>
> v6.6.128 (4d7a05da767e5c), v6.1.165 (0c4f1c02d27a88), v5.15.202
> (31344ffecd7a34), v5.10.252 (d91240f24e831d)

It is, yes, my queue is huge :(

It's fastest if someone sends me the reverts and I can easily apply them
that way.  Otherwise it takes me a bit to do each one manually :(

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH net-next] pppoe: optimize hash with word access
From: Qingfang Deng @ 2026-04-14  1:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Guillaume Nault, Kees Cook, Eric Woudstra, netdev, linux-kernel
In-Reply-To: <CANn89iKmxWiCTt7nVk-DZ4R_KsYDYbPwQ5f7Hp5F8hWmr+zc=g@mail.gmail.com>

April 13, 2026 at 4:42 PM, Eric Dumazet wrote:

> 
> On Sun, Apr 12, 2026 at 8:52 PM Qingfang Deng <qingfang.deng@linux.dev> wrote:
> 
> net-next is closed.
> 
> https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
> 
> Also I would suggest using hash32(hash, PPPOE_HASH_BITS)

Thanks for the info, but I would like to keep the same algorithm.

^ permalink raw reply

* Re: [PATCH net-next v7 00/15] net: sleepable ndo_set_rx_mode
From: Jakub Kicinski @ 2026-04-14  2:08 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, davem, edumazet, pabeni
In-Reply-To: <20260413171131.550126-1-sdf@fomichev.me>

On Mon, 13 Apr 2026 10:11:16 -0700 Stanislav Fomichev wrote:
> This series adds a new ndo_set_rx_mode_async callback that enables
> drivers to handle address list updates in a sleepable context. The
> current ndo_set_rx_mode is called under the netif_addr_lock spinlock
> with BHs disabled, which prevents drivers from sleeping. This is
> problematic for ops-locked drivers that need to sleep.

Hi hi, hit this on the new(ish) queue leasing test with bnxt (debug
kernel):

| [ 1148.733157] kselftest: Running tests in drivers/net/hw
[ 1151.485032] ref_tracker: reference already released.
[ 1151.491522] ref_tracker: allocated in:
[ 1151.496526]  __dev_set_rx_mode+0x398/0x4f0
[ 1151.501923]  dev_mc_add+0xe7/0x100
[ 1151.506537]  igmp6_group_added+0x31c/0x400
[ 1151.511930]  __ipv6_dev_mc_inc+0x282/0x590
[ 1151.517320]  __ipv6_sock_mc_join+0x40d/0x7c0
[ 1151.522908]  do_ipv6_setsockopt+0x3504/0x3700
[ 1151.528594]  ipv6_setsockopt+0x7e/0xf0
[ 1151.533596]  do_sock_setsockopt+0x164/0x3b0
[ 1151.539089]  __sys_setsockopt+0xe4/0x150
[ 1151.544276]  __x64_sys_setsockopt+0xbd/0x180
[ 1151.549866]  do_syscall_64+0xf3/0x5e0
[ 1151.554763]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1151.561229] ref_tracker: freed in:
[ 1151.565840]  netdev_rx_mode_work+0x205/0x410
[ 1151.571429]  process_one_work+0xdf5/0x1410
[ 1151.576824]  worker_thread+0x4f1/0xd60
[ 1151.581827]  kthread+0x364/0x460
[ 1151.586246]  ret_from_fork+0x4a4/0x720
[ 1151.591251]  ret_from_fork_asm+0x11/0x20
[ 1151.596704] ------------[ cut here ]------------
[ 1151.602747] WARNING: lib/ref_tracker.c:322 at ref_tracker_free.cold+0x10a/0x1f9, CPU#7: kworker/7:2/5262
[ 1151.614292] Modules linked in:
[ 1151.618626] CPU: 7 UID: 0 PID: 5262 Comm: kworker/7:2 Not tainted 7.0.0-rc7-hmtc-g3cb8f09d448a #1 PREEMPT(full) 
[ 1151.630928] Hardware name: Giga Computing E163-Z34-AAH1-000/MZ33-DC1-000, BIOS R30_F44 12/24/2025
[ 1151.641762] Workqueue: events netdev_rx_mode_work
[ 1151.647920] RIP: 0010:ref_tracker_free.cold+0x10a/0x1f9
[ 1151.654646] Code: e0 2a 0f b6 04 02 84 c0 74 04 3c 03 7e 78 8b 7b 18 4c 89 04 24 e8 71 b2 e4 01 4c 8b 04 24 4c 89 c6 48 89 ef e8 32 92 05 04 90 <0f> 0b 90 ba ea ff ff ff e9 06 de e4 01 4c 89 ea 48 89 df 4c 89 04
[ 1151.676612] RSP: 0018:ffa0000035f6fb18 EFLAGS: 00010296
[ 1151.683343] RAX: 0000000000000000 RBX: ff110001f7ae9460 RCX: 0000000000000001
[ 1151.692224] RDX: 00000000ffffffff RSI: 1ffffffff1841334 RDI: 0000000000000001
[ 1151.701102] RBP: ff110001a3c12618 R08: 0000000000000000 R09: 0000000000000000
[ 1151.709973] R10: 0000000000000007 R11: 0000000000000001 R12: 1ff4000006bedf67
[ 1151.718835] R13: ff110001f7ae9478 R14: ff110001fd1949fc R15: ffffffff8a80b9a0
[ 1151.727814] FS:  0000000000000000(0000) GS:ff11001882899000(0000) knlGS:0000000000000000
[ 1151.737696] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1151.744947] CR2: 00007f83287c3574 CR3: 00000017e26b0002 CR4: 0000000000771ef0
[ 1151.753758] PKRU: 55555554
[ 1151.757588] Call Trace:
[ 1151.761130]  <TASK>
[ 1151.764282]  ? ref_tracker_alloc+0x460/0x460
[ 1151.769872]  ? netdev_rx_mode_work+0x205/0x410
[ 1151.775659]  ? process_one_work+0xdf5/0x1410
[ 1151.781249]  ? worker_thread+0x4f1/0xd60
[ 1151.786450]  ? kthread+0x364/0x460
[ 1151.791066]  ? ret_from_fork+0x4a4/0x720
[ 1151.796267]  ? ret_from_fork_asm+0x11/0x20
[ 1151.801666]  ? mark_held_locks+0x40/0x70
[ 1151.806870]  netdev_rx_mode_work+0x205/0x410
[ 1151.812461]  ? trace_workqueue_execute_start+0x9b/0x180
[ 1151.819124]  ? process_one_work+0xdb4/0x1410
[ 1151.824712]  process_one_work+0xdf5/0x1410
[ 1151.830111]  ? pwq_dec_nr_in_flight+0x710/0x710
[ 1151.835992]  ? lock_acquire.part.0+0xbc/0x260
[ 1151.841686]  worker_thread+0x4f1/0xd60
[ 1151.846694]  ? rescuer_thread+0x1320/0x1320
[ 1151.852188]  kthread+0x364/0x460
[ 1151.856609]  ? trace_irq_enable.constprop.0+0x9b/0x180
[ 1151.863180]  ? kthread_affine_node+0x330/0x330
[ 1151.868965]  ret_from_fork+0x4a4/0x720
[ 1151.873965]  ? arch_exit_to_user_mode_prepare.isra.0+0xb0/0xb0
[ 1151.881315]  ? __switch_to+0x540/0xd00
[ 1151.886320]  ? kthread_affine_node+0x330/0x330
[ 1151.892108]  ret_from_fork_asm+0x11/0x20
| [ 1151.905076] hardirqs last  enabled at (7665): [<ffffffff847c61ea>] __up_console_sem+0x5a/0x70
| [ 1151.915450] hardirqs last disabled at (7676): [<ffffffff847c61cf>] __up_console_sem+0x3f/0x70
| [ 1151.925823] softirqs last  enabled at (7600): [<ffffffff8462d50e>] handle_softirqs+0x60e/0x920
| [ 1151.936291] softirqs last disabled at (7595): [<ffffffff8462dca2>] irq_exit_rcu+0xa2/0xf0
| [ 1151.946275] ---[ end trace 0000000000000000 ]---
[ 1152.247817] ref_tracker: netdev@ff110001a3c12618 has 1/1 users at\x0a     __dev_set_rx_mode+0x398/0x4f0\x0a     dev_mc_add+0xe7/0x100\x0a     igmp6_group_added+0x31c/0x400\x0a     __ipv6_dev_mc_inc+0x282/0x590\x0a     addrconf_dad_begin+0x13c/0x540\x0a     addrconf_dad_work+0x170/0x930\x0a     process_one_work+0xdf5/0x1410\x0a     worker_thread+0x4f1/0xd60\x0a     kthread+0x364/0x460\x0a     ret_from_fork+0x4a4/0x720\x0a     ret_from_fork_asm+0x11/0x20\x0a
[ 1152.318767] ------------[ cut here ]------------
[ 1152.325512] WARNING: lib/ref_tracker.c:246 at ref_tracker_dir_exit+0x466/0x7e0, CPU#27: ip/15056
[ 1152.336246] Modules linked in:
[ 1152.340578] CPU: 27 UID: 0 PID: 15056 Comm: ip Tainted: G        W           7.0.0-rc7-hmtc-g3cb8f09d448a #1 PREEMPT(full) 
[ 1152.353871] Tainted: [W]=WARN
[ 1152.357997] Hardware name: Giga Computing E163-Z34-AAH1-000/MZ33-DC1-000, BIOS R30_F44 12/24/2025
[ 1152.368759] RIP: 0010:ref_tracker_dir_exit+0x466/0x7e0
[ 1152.375326] Code: e8 03 42 80 3c 38 00 0f 85 57 02 00 00 48 8b 03 49 89 de 49 39 dc 0f 85 2e ff ff ff 48 8b 74 24 10 48 89 ef e8 cb aa 20 02 90 <0f> 0b 90 48 8d 5d 44 be 04 00 00 00 48 89 df e8 a6 ee ec fe 48 89
[ 1152.397212] RSP: 0018:ffa00000380470d0 EFLAGS: 00010286
[ 1152.403876] RAX: 0000000000000000 RBX: ff110001a3c12668 RCX: 0000000000000001
[ 1152.412685] RDX: 00000000ffffffff RSI: 1ffffffff1841334 RDI: 0000000000000001
[ 1152.421496] RBP: ff110001a3c12618 R08: 0000000000000000 R09: 0000000000000000
[ 1152.430303] R10: 000000000000000b R11: 0000000000000001 R12: ff110001a3c12668
[ 1152.439103] R13: dead000000000100 R14: ff110001a3c12668 R15: dffffc0000000000
[ 1152.447911] FS:  00007f05f6331840(0000) GS:ff11001883299000(0000) knlGS:0000000000000000
[ 1152.457811] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1152.465057] CR2: 00007f68f567f440 CR3: 00000001a9d12003 CR4: 0000000000771ef0
[ 1152.473863] PKRU: 55555554
[ 1152.477694] Call Trace:
[ 1152.481231]  <TASK>
[ 1152.484384]  ? rcu_is_watching+0x15/0xd0
[ 1152.489585]  ? ref_tracker_free+0x870/0x870
[ 1152.495077]  ? lockdep_hardirqs_on+0x8c/0x130
[ 1152.500764]  ? __stack_depot_get_stack_record+0x10/0x10
[ 1152.507427]  ? kfree+0x151/0x610
[ 1152.511847]  free_netdev+0x402/0x870
[ 1152.516648]  ? do_raw_spin_unlock+0x59/0x250
[ 1152.522241]  ? _raw_spin_unlock_irqrestore+0x40/0x80
[ 1152.528613]  netdev_run_todo+0x83b/0xbf0
[ 1152.533804]  ? unregister_netdevice_queued+0x80/0x80
[ 1152.540172]  ? mutex_is_locked+0x1c/0x50
[ 1152.545373]  ? generic_xdp_install+0x400/0x400
[ 1152.551158]  ? mutex_is_locked+0x1c/0x50
[ 1152.556349]  ? rtnl_is_locked+0x15/0x20
[ 1152.561450]  ? unregister_netdevice_queued+0x16/0x80
[ 1152.567826]  rtnl_dellink+0x4a5/0xae0
[ 1152.572737]  ? rtnl_mdb_del+0x580/0x580
[ 1152.577839]  ? __lock_acquire+0x508/0xc10
[ 1152.583148]  ? avc_has_perm_noaudit+0xf6/0x300
[ 1152.588935]  ? sched_init_numa+0x7f7/0xc30
[ 1152.594331]  ? mark_usage+0x61/0x170
[ 1152.599142]  ? __lock_acquire+0x508/0xc10
[ 1152.604443]  ? lock_acquire.part.0+0xbc/0x260
[ 1152.610130]  ? find_held_lock+0x2b/0x80
[ 1152.615237]  ? rtnl_mdb_del+0x580/0x580
[ 1152.620339]  ? __lock_release.isra.0+0x6b/0x1a0
[ 1152.626215]  ? rtnl_mdb_del+0x580/0x580
[ 1152.631319]  rtnetlink_rcv_msg+0x6fd/0xbd0
[ 1152.636718]  ? rtnl_fdb_dump+0x690/0x690
[ 1152.641917]  ? __lock_acquire+0x508/0xc10
[ 1152.647218]  ? lock_acquire.part.0+0xbc/0x260
[ 1152.652905]  ? find_held_lock+0x2b/0x80
[ 1152.658008]  netlink_rcv_skb+0x14e/0x3a0
[ 1152.663210]  ? rtnl_fdb_dump+0x690/0x690
[ 1152.668412]  ? netlink_ack+0xcd0/0xcd0
[ 1152.673422]  ? netlink_deliver_tap+0xc5/0x330
[ 1152.679110]  ? netlink_deliver_tap+0x13c/0x330
[ 1152.684894]  netlink_unicast+0x47c/0x740
[ 1152.690098]  ? netlink_attachskb+0x800/0x800
[ 1152.695687]  ? sock_has_perm+0x283/0x3f0
[ 1152.700879]  netlink_sendmsg+0x75b/0xc90
[ 1152.706081]  ? netlink_unicast+0x740/0x740
[ 1152.711476]  ? __lock_release.isra.0+0x6b/0x1a0
[ 1152.717357]  ? __import_iovec+0x36c/0x620
[ 1152.722655]  ? __might_fault+0x97/0x140
[ 1152.727749]  __sock_sendmsg+0xca/0x180
[ 1152.732747]  ? move_addr_to_kernel+0x36/0xf0
[ 1152.738328]  ____sys_sendmsg+0x609/0x830
[ 1152.743529]  ? copy_msghdr_from_user+0x2a0/0x460
[ 1152.749510]  ? kernel_sendmsg+0x30/0x30
[ 1152.754611]  ? move_addr_to_kernel+0xf0/0xf0
[ 1152.760204]  ? kasan_save_stack+0x3d/0x50
[ 1152.765503]  ? kasan_save_stack+0x2f/0x50
[ 1152.770799]  ? kasan_record_aux_stack+0x9b/0xc0
[ 1152.776681]  ? __call_rcu_common.constprop.0+0xb2/0xa10
[ 1152.783337]  ? kmem_cache_free+0x3d0/0x5f0
[ 1152.788733]  ? fput_close_sync+0xde/0x1b0
[ 1152.794032]  ? __x64_sys_close+0x8b/0xf0
[ 1152.799235]  ___sys_sendmsg+0x14e/0x1d0
[ 1152.804337]  ? copy_msghdr_from_user+0x460/0x460
[ 1152.810330]  ? rcu_is_watching+0x15/0xd0
[ 1152.815532]  ? trace_irq_enable.constprop.0+0x9b/0x180
[ 1152.822106]  __sys_sendmsg+0x145/0x1f0
[ 1152.827110]  ? __sys_sendmsg_sock+0x20/0x20
[ 1152.832608]  ? do_raw_spin_unlock+0x59/0x250
[ 1152.838196]  ? rcu_is_watching+0x15/0xd0
[ 1152.843398]  do_syscall_64+0xf3/0x5e0
[ 1152.848307]  ? trace_hardirqs_off+0xd/0x30
[ 1152.853703]  ? exc_page_fault+0xda/0xf0
[ 1152.858805]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1152.865267] RIP: 0033:0x7f05f656b22e
[ 1152.870077] Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 1152.891961] RSP: 002b:00007fff388be520 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 1152.901262] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f05f656b22e
[ 1152.910070] RDX: 0000000000000000 RSI: 00007fff388be5d0 RDI: 0000000000000003
[ 1152.918880] RBP: 00007fff388be530 R08: 0000000000000000 R09: 0000000000000000
[ 1152.927690] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 1152.936499] R13: 0000000069dd8ee8 R14: 000056113dabf040 R15: 0000000000000000


Either decoding failed or I forgot where NIPA-HW puts the output :S
But I think it's clear enough..

^ permalink raw reply

* [PATCH ipsec-next v3] xfrm: cleanup error path in xfrm_add_policy()
From: Deepanshu Kartikey @ 2026-04-14  2:09 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem, edumazet, kuba, pabeni, horms,
	sd
  Cc: netdev, linux-kernel, Deepanshu Kartikey

Replace the open-coded manual cleanup in the error path of
xfrm_add_policy() with xfrm_policy_destroy(), which already
handles all the necessary cleanup internally. This is consistent
with how xfrm_policy_construct() handles its own error paths.

The walk.dead flag must be set before calling xfrm_policy_destroy()
as required by BUG_ON(!policy->walk.dead).

Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
---
v3:
  - Changed prefix to ipsec-next as this is a cleanup
  - Dropped syzbot references as suggested by Sabrina Dubroca
v2:
  - Reworded commit message to reflect cleanup rather than bugfix
    as suggested by Sabrina Dubroca
  - Removed incorrect Fixes: and Closes: tags
  - Corrected subject prefix to PATCH ipsec
---
 net/xfrm/xfrm_user.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d56450f61669..ae144d1e4a65 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2267,9 +2267,8 @@ static int xfrm_add_policy(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (err) {
 		xfrm_dev_policy_delete(xp);
-		xfrm_dev_policy_free(xp);
-		security_xfrm_policy_free(xp->security);
-		kfree(xp);
+		xp->walk.dead = 1;
+		xfrm_policy_destroy(xp);
 		return err;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v2] net/smc: cap allocation order for SMC-R physically contiguous buffers
From: D. Wythe @ 2026-04-14  2:10 UTC (permalink / raw)
  To: Simon Horman
  Cc: D. Wythe, David S. Miller, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Tony Lu, Wen Gu, linux-kernel, linux-rdma, linux-s390, netdev,
	oliver.yang, pasic
In-Reply-To: <20260410151631.GY469338@kernel.org>

On Fri, Apr 10, 2026 at 04:16:31PM +0100, Simon Horman wrote:
> On Tue, Apr 07, 2026 at 08:43:37PM +0800, D. Wythe wrote:
> > The alloc_pages() cannot satisfy requests exceeding MAX_PAGE_ORDER,
> > and attempting such allocations will lead to guaranteed failures
> > and potential kernel warnings.
> > 
> > For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER.
> > This ensures the attempts to allocate the largest possible physically
> > contiguous chunk succeed, instead of failing with an invalid order.
> > This also avoids redundant "try-fail-degrade" cycles in
> > __smc_buf_create().
> > 
> > For SMCR_MIXED_BUFS, no cap is needed: if the order exceeds
> > MAX_PAGE_ORDER, alloc_pages() will silently fail (__GFP_NOWARN)
> > and automatically fall back to virtual memory.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> > Changes v1 -> v2:
> > https://lore.kernel.org/netdev/20260312082154.36971-1-alibuda@linux.alibaba.com/
> > 
> > - Move the bufsize cap from smcr_new_buf_create() up to
> >   __smc_buf_create(), which is simpler and avoids touching
> >   the allocation logic itself.
> 
> The nit below notwithstanding, this looks good to me.
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> > ---
> >  net/smc/smc_core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> > index e2d083daeb7e..cdd881746e21 100644
> > --- a/net/smc/smc_core.c
> > +++ b/net/smc/smc_core.c
> > @@ -2440,6 +2440,10 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >  		/* use socket send buffer size (w/o overhead) as start value */
> >  		bufsize = smc->sk.sk_sndbuf / 2;
> >  
> > +	/* limit bufsize for physically contiguous buffers */
> > +	if (!is_smcd && lgr->buf_type == SMCR_PHYS_CONT_BUFS)
> > +		bufsize = min_t(int, bufsize, (PAGE_SIZE << MAX_PAGE_ORDER));
> 
> nit: I think min() is sufficient here, and the inner parentheses are
>      unnecessary

Hi Simon,

I think min_t is required here because min() triggers a signedness
error:

././include/linux/compiler_types.h:706:38: error: call to
‘__compiletime_assert_950’ declared with attribute error: min(bufsize,
((1UL) << 12) << 10) signedness error

The inner parentheses can be removed, though.

D. Wythe

> 
> > +
> >  	for (bufsize_comp = smc_compress_bufsize(bufsize, is_smcd, is_rmb);
> >  	     bufsize_comp >= 0; bufsize_comp--) {
> >  		if (is_rmb) {
> > -- 
> > 2.45.0
> > 

^ permalink raw reply

* Re: [PATCH net-next v2 0/3] Follow-ups to nk_qlease net selftests
From: Jakub Kicinski @ 2026-04-14  2:12 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, dw, pabeni, razor
In-Reply-To: <20260413220809.604592-1-daniel@iogearbox.net>

On Tue, 14 Apr 2026 00:08:03 +0200 Daniel Borkmann wrote:
> This is a set of follow-ups addressing [0]:
> 
> - Split netdevsim tests from HW tests in nk_qlease and move the SW
>   tests under selftests/net/
> - Remove multiple ksft_run()s to fix the recently enforced hard-fail
> - Move all the setup inside the test cases for the ones under
>   selftests/net/ (I'll defer the HW ones to David)
> - Add more test coverage related to queue leasing behavior and corner
>   cases, so now we have 45 tests in nk_qlease.py with netdevsim
>   which does not need special HW

LGTM, thanks!

I'll let it run overnight in the CI to shake out any latent flakiness
(and the crash which I think is from Stan's series).

Could you cook up one more follow up to enable VETH in the config?
We're getting:

# # Exception| Traceback (most recent call last):
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/ksft.py", line 420, in ksft_run
# # Exception|     func(*args)
# # Exception|     ~~~~^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/drivers/net/hw/./nk_qlease.py", line 393, in test_veth_queue_create
# # Exception|     ip("link add veth0 type veth peer name veth1")
# # Exception|     ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 238, in ip
# # Exception|     return tool('ip', args, json=json, host=host)
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 225, in tool
# # Exception|     cmd_obj = cmd(cmd_str, ns=ns, host=host)
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 91, in __init__
# # Exception|     self.process(terminate=False, fail=fail, timeout=timeout)
# # Exception|     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 117, in process
# # Exception|     raise CmdExitFailure("Command failed", self)
# # Exception| net.lib.py.utils.CmdExitFailure: Command failed
# # Exception| CMD: ip link add veth0 type veth peer name veth1
# # Exception|   EXIT: 2
# # Exception|   STDERR: Error: Unknown device type.
# # Exception| 
# not ok 27 nk_qlease.test_veth_queue_create


I guess you can post it without waiting for this to be merged, it won't
conflict.

^ permalink raw reply

* Re: [PATCH] xfrm: fix memory leak in xfrm_add_policy()
From: Deepanshu Kartikey @ 2026-04-14  2:12 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: steffen.klassert, herbert, davem, edumazet, kuba, pabeni, horms,
	leon, netdev, linux-kernel, syzbot+901d48e0b95aed4a2548
In-Reply-To: <adz_CeDItDjznfWo@krikkit>

On Mon, Apr 13, 2026 at 8:04 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> Ok. Then you should wait 2 weeks until the merge window is over:
> https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
>
> and use "[PATCH ipsec-next]" as prefix for the cleanup patch (+ drop
> the syzbot references).
>

Hi Sabrina,

Thanks for the guidance.  I have submitted patch v3.

Thanks

Deepanshu

^ permalink raw reply

* [PATCH net v5 2/2] pppoe: drop PFC frames
From: Qingfang Deng @ 2026-04-14  2:13 UTC (permalink / raw)
  To: linux-ppp, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Qingfang Deng, Breno Leitao,
	Sebastian Andrzej Siewior, Kees Cook, Kuniyuki Iwashima,
	Guillaume Nault, Eric Woudstra, Simon Horman, Sam Protsenko,
	netdev, linux-kernel
  Cc: Paul Mackerras, Jaco Kroon, James Carlson, Wojciech Drewek,
	Marcin Szycik
In-Reply-To: <20260414021353.23471-1-qingfang.deng@linux.dev>

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the current PPPoE driver assumes an
uncompressed (2-byte) protocol field. However, the generic PPP layer
function ppp_input() is not aware of the negotiation result, and still
accepts PFC frames.

If a peer with a broken implementation or an attacker sends a frame with
a compressed (1-byte) protocol field, the subsequent PPP payload is
shifted by one byte. This causes the network header to be 4-byte
misaligned, which may trigger unaligned access exceptions on some
architectures.

To reduce the attack surface, drop PPPoE PFC frames. Introduce
ppp_skb_is_compressed_proto() helper function to be used in both
ppp_generic.c and pppoe.c to avoid open-coding.

Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
---
Changes in v5: add Reviewed-by tag from Simon
 Link to v4: https://lore.kernel.org/netdev/20260410033627.93786-2-qingfang.deng@linux.dev/

 drivers/net/ppp/ppp_generic.c |  2 +-
 drivers/net/ppp/pppoe.c       |  8 +++++++-
 include/linux/ppp_defs.h      | 16 ++++++++++++++++
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index b097d1b38ac9..853da966ad46 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -2242,7 +2242,7 @@ ppp_do_recv(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
  */
 static void __ppp_decompress_proto(struct sk_buff *skb)
 {
-	if (skb->data[0] & 0x01)
+	if (ppp_skb_is_compressed_proto(skb))
 		*(u8 *)skb_push(skb, 1) = 0x00;
 }
 
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index d546a7af0d54..bdd61c504a1c 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -393,7 +393,7 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (skb_mac_header_len(skb) < ETH_HLEN)
 		goto drop;
 
-	if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr)))
+	if (!pskb_may_pull(skb, PPPOE_SES_HLEN))
 		goto drop;
 
 	ph = pppoe_hdr(skb);
@@ -403,6 +403,12 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (skb->len < len)
 		goto drop;
 
+	/* skb->data points to the PPP protocol header after skb_pull_rcsum.
+	 * Drop PFC frames.
+	 */
+	if (ppp_skb_is_compressed_proto(skb))
+		goto drop;
+
 	if (pskb_trim_rcsum(skb, len))
 		goto drop;
 
diff --git a/include/linux/ppp_defs.h b/include/linux/ppp_defs.h
index b7e57fdbd413..b1d1f46d7d3b 100644
--- a/include/linux/ppp_defs.h
+++ b/include/linux/ppp_defs.h
@@ -8,6 +8,7 @@
 #define _PPP_DEFS_H_
 
 #include <linux/crc-ccitt.h>
+#include <linux/skbuff.h>
 #include <uapi/linux/ppp_defs.h>
 
 #define PPP_FCS(fcs, c) crc_ccitt_byte(fcs, c)
@@ -25,4 +26,19 @@ static inline bool ppp_proto_is_valid(u16 proto)
 	return !!((proto & 0x0101) == 0x0001);
 }
 
+/**
+ * ppp_skb_is_compressed_proto - checks if PPP protocol in a skb is compressed
+ * @skb: skb to check
+ *
+ * Check if the PPP protocol field is compressed (the least significant
+ * bit of the most significant octet is 1). skb->data must point to the PPP
+ * protocol header.
+ *
+ * Return: Whether the PPP protocol field is compressed.
+ */
+static inline bool ppp_skb_is_compressed_proto(const struct sk_buff *skb)
+{
+	return unlikely(skb->data[0] & 0x01);
+}
+
 #endif /* _PPP_DEFS_H_ */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net v5 1/2] flow_dissector: do not dissect PPPoE PFC frames
From: Qingfang Deng @ 2026-04-14  2:13 UTC (permalink / raw)
  To: linux-ppp, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Qingfang Deng, Tony Nguyen,
	Guillaume Nault, Wojciech Drewek, netdev, linux-kernel
  Cc: Paul Mackerras, Jaco Kroon, James Carlson, Marcin Szycik

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the flow dissector driver has assumed an
uncompressed frame until the blamed commit.

During the review process of that commit [1], support for PFC is
suggested. However, having a compressed (1-byte) protocol field means
the subsequent PPP payload is shifted by one byte, causing 4-byte
misalignment for the network header and an unaligned access exception
on some architectures.

The exception can be reproduced by sending a PPPoE PFC frame to an
ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE
session is active on that interface:

$ 0   : 00000000 80c40000 00000000 85144817
$ 4   : 00000008 00000100 80a75758 81dc9bb8
$ 8   : 00000010 8087ae2c 0000003d 00000000
$12   : 000000e0 00000039 00000000 00000000
$16   : 85043240 80a75758 81dc9bb8 00006488
$20   : 0000002f 00000007 85144810 80a70000
$24   : 81d1bda0 00000000
$28   : 81dc8000 81dc9aa8 00000000 805ead08
Hi    : 00009d51
Lo    : 2163358a
epc   : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50
ra    : 805ead08 __skb_get_hash_net+0x74/0x12c
Status: 11000403        KERNEL EXL IE
Cause : 40800010 (ExcCode 04)
BadVA : 85144817
PrId  : 0001992f (MIPS 1004Kc)
Call Trace:
[<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50
[<805ead08>] __skb_get_hash_net+0x74/0x12c
[<805ef330>] get_rps_cpu+0x1b8/0x3fc
[<805fca70>] netif_receive_skb_list_internal+0x324/0x364
[<805fd120>] napi_complete_done+0x68/0x2a4
[<8058de5c>] mtk_napi_rx+0x228/0xfec
[<805fd398>] __napi_poll+0x3c/0x1c4
[<805fd754>] napi_threaded_poll_loop+0x234/0x29c
[<805fd848>] napi_threaded_poll+0x8c/0xb0
[<80053544>] kthread+0x104/0x12c
[<80002bd8>] ret_from_kernel_thread+0x14/0x1c

Code: 02d51821  1060045b  00000000 <8c640000> 3084000f  2c820005  144001a2  00042080  8e220000

To reduce the attack surface and maintain performance, do not process
PPPoE PFC frames.

[1] https://patch.msgid.link/20220630231016.GA392@debian.home
Fixes: 46126db9c861 ("flow_dissector: Add PPPoE dissectors")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
---
Changes in v5: drop byte-swap change
 Link to v4: https://lore.kernel.org/netdev/20260410033627.93786-1-qingfang.deng@linux.dev/

 net/core/flow_dissector.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1b61bb25ba0e..f9aaba554128 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1374,16 +1374,8 @@ bool __skb_flow_dissect(const struct net *net,
 			break;
 		}
 
-		/* least significant bit of the most significant octet
-		 * indicates if protocol field was compressed
-		 */
 		ppp_proto = ntohs(hdr->proto);
-		if (ppp_proto & 0x0100) {
-			ppp_proto = ppp_proto >> 8;
-			nhoff += PPPOE_SES_HLEN - 1;
-		} else {
-			nhoff += PPPOE_SES_HLEN;
-		}
+		nhoff += PPPOE_SES_HLEN;
 
 		if (ppp_proto == PPP_IP) {
 			proto = htons(ETH_P_IP);
-- 
2.43.0


^ permalink raw reply related

* [PATCH 6.1.y] Revert "wifi: cfg80211: stop NAN and P2P in cfg80211_leave"
From: guocai.he.cn @ 2026-04-14  2:16 UTC (permalink / raw)
  To: stable; +Cc: gregkh, johannes.berg, netdev, regressions,
	miriam.rachel.korenblit

From: Guocai He <guocai.he.cn@windriver.com>

This reverts commit 0c4f1c02d27a880b10b58c63f574f13bed4f711d which is commit 
e1696c8bd0056bc1a5f7766f58ac333adc203e8a upstream.

The reverted patch introduced a deadlock. The locking situation in mainline is 
totally different, so it is incorrect to directly backport the commit from mainline.

Signed-off-by: Guocai He <guocai.he.cn@windriver.com>
---
 net/wireless/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/wireless/core.c b/net/wireless/core.c
index e75326932c32..2a6a8bdfa724 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -1328,10 +1328,8 @@ void __cfg80211_leave(struct cfg80211_registered_device *rdev,
 		__cfg80211_leave_ocb(rdev, dev);
 		break;
 	case NL80211_IFTYPE_P2P_DEVICE:
-		cfg80211_stop_p2p_device(rdev, wdev);
-		break;
 	case NL80211_IFTYPE_NAN:
-		cfg80211_stop_nan(rdev, wdev);
+		/* cannot happen, has no netdev */
 		break;
 	case NL80211_IFTYPE_AP_VLAN:
 	case NL80211_IFTYPE_MONITOR:
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next] net: shaper: Reject zero weight in shaper config
From: Mohsin Bashir @ 2026-04-14  2:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, ast, chuck.lever, davem, donald.hunter, edumazet, horms,
	linux-kernel, matttbe, pabeni
In-Reply-To: <20260413145039.43f7b162@kernel.org>



On 4/13/26 2:50 PM, Jakub Kicinski wrote:
> On Fri, 10 Apr 2026 15:51:23 -0700 Mohsin Bashir wrote:
>> A zero weight is meaningless for DWRR scheduling and can cause
>> starvation of the affected node. Add a min-value constraint to
>> the weight attribute in the net_shaper netlink spec so that zero
>> is rejected at the netlink policy level.
>>
>> Found while prototyping a new driver, existing drivers are not
>> affected.
> 
> AI review points out that if the netlink attr is not present core will
> leave the DWRR weight as 0 in the struct. I guess we need to think this
> thru a little more carefully. What should the "default" weight be?
> What if user specifies weights only for subset of leaves?
> 
> This part of the uAPI seems under-defined.
> 
> Maybe a better adjustment would be to make core set the weight to 1
> automatically if the user has not defined it? Only when sending it to
> the driver tho, because we'd still want it to not be reported back to
> user space. Not sure how hairy it'd get code-wise.

Interesting!!
Let me look at the big picture here and re-spin.

^ permalink raw reply

* [PATCH 6.6.y] Revert "wifi: cfg80211: stop NAN and P2P in cfg80211_leave"
From: guocai.he.cn @ 2026-04-14  2:46 UTC (permalink / raw)
  To: gregkh
  Cc: stable, johannes.berg, netdev, regressions,
	miriam.rachel.korenblit, linux-kernel

From: Guocai He <guocai.he.cn@windriver.com>

This reverts commit 4d7a05da767e5cbcf4db511b9289d7ebd380dc56 which is commit
e1696c8bd0056bc1a5f7766f58ac333adc203e8a upstream.

The reverted patch introduced a deadlock. The locking situation in mainline is
totally different, so it is incorrect to directly backport the commit from mainline.

Signed-off-by: Guocai He <guocai.he.cn@windriver.com>
---
 net/wireless/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/wireless/core.c b/net/wireless/core.c
index fac19dab23c6..d07c4baa32d9 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -1332,10 +1332,8 @@ void __cfg80211_leave(struct cfg80211_registered_device *rdev,
 		__cfg80211_leave_ocb(rdev, dev);
 		break;
 	case NL80211_IFTYPE_P2P_DEVICE:
-		cfg80211_stop_p2p_device(rdev, wdev);
-		break;
 	case NL80211_IFTYPE_NAN:
-		cfg80211_stop_nan(rdev, wdev);
+		/* cannot happen, has no netdev */
 		break;
 	case NL80211_IFTYPE_AP_VLAN:
 	case NL80211_IFTYPE_MONITOR:
-- 
2.34.1


^ permalink raw reply related

* [PATCH 6.1.y] netfilter: conntrack: add missing netlink policy validations
From: Li hongliang @ 2026-04-14  2:59 UTC (permalink / raw)
  To: gregkh, stable, fw
  Cc: patches, linux-kernel, pablo, kadlec, davem, edumazet, kuba,
	pabeni, horms, kaber, netfilter-devel, coreteam, netdev, imv4bel

From: Florian Westphal <fw@strlen.de>

[ Upstream commit f900e1d77ee0ef87bfb5ab3fe60f0b3d8ad5ba05 ]

Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.

These attributes are used by the kernel without any validation.
Extend the netlink policies accordingly.

Quoting the reporter:
  nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
  value directly to ct->proto.sctp.state without checking that it is
  within the valid range. [..]

  and: ... with exp->dir = 100, the access at
  ct->master->tuplehash[100] reads 5600 bytes past the start of a
  320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
  UBSAN.

Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations")
Fixes: a258860e01b8 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li hongliang <1468888505@139.com>
---
 net/netfilter/nf_conntrack_netlink.c    | 2 +-
 net/netfilter/nf_conntrack_proto_sctp.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 89cec02de68b..bcbd77608365 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3458,7 +3458,7 @@ ctnetlink_change_expect(struct nf_conntrack_expect *x,
 
 #if IS_ENABLED(CONFIG_NF_NAT)
 static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
-	[CTA_EXPECT_NAT_DIR]	= { .type = NLA_U32 },
+	[CTA_EXPECT_NAT_DIR]	= NLA_POLICY_MAX(NLA_BE32, IP_CT_DIR_REPLY),
 	[CTA_EXPECT_NAT_TUPLE]	= { .type = NLA_NESTED },
 };
 #endif
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 7ffd698497f2..90458799324e 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -600,7 +600,8 @@ static int sctp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 }
 
 static const struct nla_policy sctp_nla_policy[CTA_PROTOINFO_SCTP_MAX+1] = {
-	[CTA_PROTOINFO_SCTP_STATE]	    = { .type = NLA_U8 },
+	[CTA_PROTOINFO_SCTP_STATE]	    = NLA_POLICY_MAX(NLA_U8,
+							 SCTP_CONNTRACK_HEARTBEAT_SENT),
 	[CTA_PROTOINFO_SCTP_VTAG_ORIGINAL]  = { .type = NLA_U32 },
 	[CTA_PROTOINFO_SCTP_VTAG_REPLY]     = { .type = NLA_U32 },
 };
-- 
2.34.1



^ permalink raw reply related

* [PATCH 6.6.y] netfilter: conntrack: add missing netlink policy validations
From: Li hongliang @ 2026-04-14  2:59 UTC (permalink / raw)
  To: gregkh, stable, fw
  Cc: patches, linux-kernel, pablo, kadlec, davem, edumazet, kuba,
	pabeni, horms, kaber, netfilter-devel, coreteam, netdev, imv4bel

From: Florian Westphal <fw@strlen.de>

[ Upstream commit f900e1d77ee0ef87bfb5ab3fe60f0b3d8ad5ba05 ]

Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.

These attributes are used by the kernel without any validation.
Extend the netlink policies accordingly.

Quoting the reporter:
  nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
  value directly to ct->proto.sctp.state without checking that it is
  within the valid range. [..]

  and: ... with exp->dir = 100, the access at
  ct->master->tuplehash[100] reads 5600 bytes past the start of a
  320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
  UBSAN.

Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations")
Fixes: a258860e01b8 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li hongliang <1468888505@139.com>
---
 net/netfilter/nf_conntrack_netlink.c    | 2 +-
 net/netfilter/nf_conntrack_proto_sctp.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 9b089cdfcd35..255996f43d85 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3454,7 +3454,7 @@ ctnetlink_change_expect(struct nf_conntrack_expect *x,
 
 #if IS_ENABLED(CONFIG_NF_NAT)
 static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
-	[CTA_EXPECT_NAT_DIR]	= { .type = NLA_U32 },
+	[CTA_EXPECT_NAT_DIR]	= NLA_POLICY_MAX(NLA_BE32, IP_CT_DIR_REPLY),
 	[CTA_EXPECT_NAT_TUPLE]	= { .type = NLA_NESTED },
 };
 #endif
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 4cc97f971264..fabb2c1ca00a 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -587,7 +587,8 @@ static int sctp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 }
 
 static const struct nla_policy sctp_nla_policy[CTA_PROTOINFO_SCTP_MAX+1] = {
-	[CTA_PROTOINFO_SCTP_STATE]	    = { .type = NLA_U8 },
+	[CTA_PROTOINFO_SCTP_STATE]	    = NLA_POLICY_MAX(NLA_U8,
+							 SCTP_CONNTRACK_HEARTBEAT_SENT),
 	[CTA_PROTOINFO_SCTP_VTAG_ORIGINAL]  = { .type = NLA_U32 },
 	[CTA_PROTOINFO_SCTP_VTAG_REPLY]     = { .type = NLA_U32 },
 };
-- 
2.34.1



^ permalink raw reply related

* [PATCH iwl-next v2 0/2] Introduce IDPF PCI callbacks
From: Emil Tantilov @ 2026-04-14  3:16 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, przemyslaw.kitszel, jay.bhat, ivan.d.barrera,
	aleksandr.loktionov, larysa.zaremba, anthony.l.nguyen,
	andrew+netdev, davem, edumazet, kuba, pabeni, aleksander.lobakin,
	linux-pci, madhu.chittim, decot, willemb, sheenamo, lukas

This series implements PCI callbacks for the purpose of handling FLR and
PCI errors in the IDPF driver.

The first patch removes the conditional deinitialization of the mailbox in
the idpf_vc_core_deinit() function. Aside from being redundant, due to the
shutdown of the mailbox after a reset is detected, the check was also
preventing the driver from sending messages to stop and disable the vports
and queues on FW side, which is needed for the prepare phase of the FLR
handling.

The second patch implements the PCI callbacks. The logic here follows
the reset handling done in idpf_init_hard_reset(), but is split in
prepare and resume phases, where idpf_reset_prepare() stops all driver
operations and the resume callback attempt to recover following the
reset or the PCI error event.

Testing hints:
1. FLR via sysfs:
echo 1 > /sys/class/net/<ifname>/device/reset

Previously this would have been handled by idpf_init_hard_reset() as the
driver detects the reset. Now it will be done by the PCI err callbacks,
so this is the easiest way to test the reset_prepare/resume path.

2. PCI errors can be tested with aer-inject:
./aer-inject -s 83:00.0 examples/<error_type>

3. Stress testing can be done by combining various callbacks with the
reset from step 1:
echo 1 > /sys/class/net/<if>/device/reset& ethtool -L <if> combined 8
ethtool -L <if> combined 16& echo 1 > /sys/class/net/<if>/device/reset

Changelog:
v1->v2:
- Removed the call to pci_save_state() from idpf_pci_err_slot_reset(),
  as it is no longer needed after pci_restore_state(). Suggested by
  Lukas Wunner.

v1:
https://lore.kernel.org/netdev/20260411003959.30959-1-emil.s.tantilov@intel.com/

Emil Tantilov (2):
  idpf: remove conditonal MBX deinit from idpf_vc_core_deinit()
  idpf: implement pci error handlers

 drivers/net/ethernet/intel/idpf/idpf.h        |   3 +
 drivers/net/ethernet/intel/idpf/idpf_lib.c    |  13 +-
 drivers/net/ethernet/intel/idpf/idpf_main.c   | 112 ++++++++++++++++++
 .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  11 +-
 4 files changed, 127 insertions(+), 12 deletions(-)

-- 
2.37.3


^ permalink raw reply

* [PATCH iwl-next v2 1/2] idpf: remove conditonal MBX deinit from idpf_vc_core_deinit()
From: Emil Tantilov @ 2026-04-14  3:16 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, przemyslaw.kitszel, jay.bhat, ivan.d.barrera,
	aleksandr.loktionov, larysa.zaremba, anthony.l.nguyen,
	andrew+netdev, davem, edumazet, kuba, pabeni, aleksander.lobakin,
	linux-pci, madhu.chittim, decot, willemb, sheenamo, lukas
In-Reply-To: <20260414031631.2107-1-emil.s.tantilov@intel.com>

Previously it was assumed that idpf_vc_core_deinit() is always being
called during reset handling, with remove being an exception. Ideally
the driver needs to communicate the changes to FW in all instances where
the MBX is not already disabled. Remove the remove_in_prog check from
idpf_vc_core_deinit() as the MBX was already disabled while handling the
reset via libie_ctlq_xn_shutdown() by the service task. This is also
needed by the following patch, introducing PCI callbacks support.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Jay Bhat <jay.bhat@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index 129c8f6b0faa..fceaf3ec1cd4 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -3229,24 +3229,15 @@ int idpf_vc_core_init(struct idpf_adapter *adapter)
  */
 void idpf_vc_core_deinit(struct idpf_adapter *adapter)
 {
-	bool remove_in_prog;
-
 	if (!test_bit(IDPF_VC_CORE_INIT, adapter->flags))
 		return;
 
-	/* Avoid transaction timeouts when called during reset */
-	remove_in_prog = test_bit(IDPF_REMOVE_IN_PROG, adapter->flags);
-	if (!remove_in_prog)
-		idpf_deinit_dflt_mbx(adapter);
-
 	idpf_ptp_release(adapter);
 	idpf_deinit_task(adapter);
 	idpf_idc_deinit_core_aux_device(adapter);
 	idpf_rel_rx_pt_lkup(adapter);
 	idpf_intr_rel(adapter);
-
-	if (remove_in_prog)
-		idpf_deinit_dflt_mbx(adapter);
+	idpf_deinit_dflt_mbx(adapter);
 
 	cancel_delayed_work_sync(&adapter->serv_task);
 
-- 
2.37.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox