Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] nfc: st-nci: use unaligned accessors for frame length
From: David Laight @ 2026-06-20  9:29 UTC (permalink / raw)
  To: Runyu Xiao
  Cc: Krzysztof Kozlowski, netdev, Samuel Ortiz, Christophe Ricard,
	linux-kernel, Jianhao Xu, stable
In-Reply-To: <20260620090536.1701282-1-runyu.xiao@seu.edu.cn>

On Sat, 20 Jun 2026 17:05:36 +0800
Runyu Xiao <runyu.xiao@seu.edu.cn> wrote:

> The ST NCI I2C and SPI transports parse a frame length from bytes
> received from the controller. Both paths first read the frame header into
> a local u8 buffer and then cast buf + 2 to __be16 * before converting it
> from big endian.

Then align the local buffer.

	David

> 
> These are transport byte buffers, not __be16 objects. Use
> get_unaligned_be16() for the NCI frame length field in both the I2C and
> SPI transports.
> 
> This issue was detected by our static analysis tool and confirmed by
> manual audit. A focused UBSAN alignment validation kept the original
> access shape, be16_to_cpu(*(__be16 *)(buf + 2)), and ran it on an NCI
> frame byte buffer with buf + 2 at an odd address. UBSAN reported a
> misaligned-access load of type '__be16', and the trace contained
> st_nci_i2c_read().
> 
> The driver has the same source-level issue: the transport helpers fill
> u8 buffers, and the length checks only prove that the bytes are present.
> They do not establish a __be16 object at buf + 2 or a 2-byte alignment
> guarantee before the typed load.
> 
> Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci")
> Fixes: 2bc4d4f8c8f3 ("nfc: st-nci: Add spi phy support for st21nfcb")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  drivers/nfc/st-nci/i2c.c | 3 ++-
>  drivers/nfc/st-nci/spi.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nfc/st-nci/i2c.c b/drivers/nfc/st-nci/i2c.c
> index 9ae839a6f5cc..29fdb4ae56e0 100644
> --- a/drivers/nfc/st-nci/i2c.c
> +++ b/drivers/nfc/st-nci/i2c.c
> @@ -14,6 +14,7 @@
>  #include <linux/delay.h>
>  #include <linux/nfc.h>
>  #include <linux/of.h>
> +#include <linux/unaligned.h>
>  
>  #include "st-nci.h"
>  
> @@ -120,7 +121,7 @@ static int st_nci_i2c_read(struct st_nci_i2c_phy *phy,
>  	if (r != ST_NCI_I2C_MIN_SIZE)
>  		return -EREMOTEIO;
>  
> -	len = be16_to_cpu(*(__be16 *) (buf + 2));
> +	len = get_unaligned_be16(buf + 2);
>  	if (len > ST_NCI_I2C_MAX_SIZE) {
>  		nfc_err(&client->dev, "invalid frame len\n");
>  		return -EBADMSG;
> diff --git a/drivers/nfc/st-nci/spi.c b/drivers/nfc/st-nci/spi.c
> index 169eacc0a32a..1326c20e43fc 100644
> --- a/drivers/nfc/st-nci/spi.c
> +++ b/drivers/nfc/st-nci/spi.c
> @@ -14,6 +14,7 @@
>  #include <linux/delay.h>
>  #include <linux/nfc.h>
>  #include <linux/of.h>
> +#include <linux/unaligned.h>
>  #include <net/nfc/nci.h>
>  
>  #include "st-nci.h"
> @@ -130,7 +131,7 @@ static int st_nci_spi_read(struct st_nci_spi_phy *phy,
>  	if (r < 0)
>  		return -EREMOTEIO;
>  
> -	len = be16_to_cpu(*(__be16 *) (buf + 2));
> +	len = get_unaligned_be16(buf + 2);
>  	if (len > ST_NCI_SPI_MAX_SIZE) {
>  		nfc_err(&dev->dev, "invalid frame len\n");
>  		phy->ndlc->hard_fault = 1;


^ permalink raw reply

* Re: [PATCH net v2 07/10] rxrpc: Fix oob challenge leak in cleanup after notification failure
From: Simon Horman @ 2026-06-20  9:17 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-8-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
rxrpc: Fix oob challenge leak in cleanup after notification failure

Fix rxrpc_notify_socket_oob() to return an indication of failure in the
event that it failed to queue a packet and fix rxrpc_post_challenge() to
clean up the connection ref in such an event.

Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com

> diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
> index c96ca615b787c..611c790bc6d0c 100644
> --- a/net/rxrpc/conn_event.c
> +++ b/net/rxrpc/conn_event.c
> @@ -436,7 +436,7 @@ static bool rxrpc_post_challenge(struct rxrpc_connection *conn,
>  	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
>  	struct rxrpc_call *call = NULL;
>  	struct rxrpc_sock *rx;
> -	bool respond = false;
> +	bool respond = false, queued = false;

[Severity: Critical]
This is a pre-existing issue, but can the lockless read of
conn->channels[i].call in net/rxrpc/conn_event.c lead to a NULL pointer
dereference?

Looking at rxrpc_post_challenge(), the code does:

    if (conn->channels[i].call) {
        call = conn->channels[i].call;
        rx = rcu_dereference(call->socket);

Without READ_ONCE(), the compiler could issue two separate loads. If a
concurrent thread clears the channel's call pointer after the if-check but
before the assignment, call becomes NULL.

Does this cause the subsequent rcu_dereference(call->socket) to panic by
dereferencing NULL->socket?

[Severity: High]
This isn't a bug introduced by this patch, but does taking a connection
reference here in rxrpc_post_challenge() lead to a leak if the packet is
later discarded due to connection abort or teardown?

If rxrpc_process_event() in net/rxrpc/conn_event.c encounters 
conn->state == RXRPC_CONN_ABORTED, it returns -ECONNABORTED early:

    if (conn->state == RXRPC_CONN_ABORTED)
        return -ECONNABORTED;

This bypasses the cleanup, and the skb is consumed by
rxrpc_do_process_connection() without dropping the sp->chall.conn reference.

Similarly, during teardown, rxrpc_purge_queue() in net/rxrpc/skbuff.c
consumes skbs via:

    while ((skb = skb_dequeue((list))) != NULL) {
        ...
        consume_skb(skb);
    }

Since it doesn't inspect the cb scratch space, doesn't this permanently leak
the connection references stored in sp->chall.conn?

[ ... ]

>  	if (call)
> -		rxrpc_notify_socket_oob(call, skb);
> +		queued = rxrpc_notify_socket_oob(call, skb);

[Severity: High]
This is a pre-existing issue, but does failing to set skb->mark to
RXRPC_OOB_CHALLENGE here in rxrpc_post_challenge() break OOB challenge
processing?

The skb->mark remains RXRPC_SKB_MARK_PACKET as set during ingress.
When the application attempts to respond via rxrpc_respond_to_oob() or read
via rxrpc_recvmsg_oob(), won't the operation fail with -EPROTO because it
expects skb->mark == RXRPC_OOB_CHALLENGE?

Additionally, when the OOB message is freed via rxrpc_kernel_free_oob(),
doesn't the switch statement on oob->mark fail to match, bypassing
rxrpc_put_connection() and leaking the connection reference stored in
sp->chall.conn?

>  	rcu_read_unlock();
> +	if (call && !queued) {
> +		rxrpc_put_connection(conn, rxrpc_conn_put_challenge_input);
> +		sp->chall.conn = NULL;
> +		return false;
> +	}
>  
>  	if (!call)
>  		rxrpc_post_packet_to_conn(conn, skb);
> diff --git a/net/rxrpc/oob.c b/net/rxrpc/oob.c
> index 3318c8bd82ad5..c80ee2487d094 100644
> --- a/net/rxrpc/oob.c
> +++ b/net/rxrpc/oob.c

[ ... ]

> @@ -49,6 +50,7 @@ void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
>  			skb->skb_mstamp_ns = rx->oob_id_counter++;
>  			rxrpc_get_skb(skb, rxrpc_skb_get_post_oob);
>  			skb_queue_tail(&rx->recvmsg_oobq, skb);
> +			queued = true;

[Severity: High]
This isn't a bug introduced by this patch, but does unbounded queuing of
challenge packets onto recvmsg_oobq risk a kernel OOM?

In net/rxrpc/oob.c, rxrpc_notify_socket_oob() uses skb_queue_tail() without
checking memory limits (like sk_rmem_alloc vs sk_rcvbuf) or restricting the
total number of challenges.

Since rxrpc_encap_rcv() intercepts these packets before standard UDP memory
accounting occurs, could a remote attacker flood the host with valid
CHALLENGE packets?

If the application is slow to process them or ignores them, will these
accumulate infinitely in recvmsg_oobq and cause a Denial of Service?

^ permalink raw reply

* Re: [PATCH net v2 06/10] rxrpc: Fix the reception of a reply packet before data transmission
From: Simon Horman @ 2026-06-20  9:17 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-7-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
rxrpc: Fix the reception of a reply packet before data transmission

This commit fixes rxrpc_receiving_reply() to handle the reception of an
apparent reply DATA packet before rxrpc has had a chance to send any request
DATA packets. It also adjusts rxrpc_rotate_tx_window() to handle an empty
Tx queue by changing the do...while loop into a while loop.

Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com

> diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
> index 37881dffa898b..01ccd2d2fe92c 100644
> --- a/net/rxrpc/input.c
> +++ b/net/rxrpc/input.c
> @@ -247,7 +247,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
>  		tq = call->tx_queue;
>  	}
>  
> -	do {
> +	while (before_eq(seq, to)) {

[Severity: Critical]
This is a pre-existing issue, but can this trigger a use-after-free write
during a concurrent sendmsg() if the remote peer sends an early reply?

Right before this loop in rxrpc_rotate_tx_window(), the code unconditionally
frees the active transmission queue buffer:

	if (seq == call->tx_qbase + RXRPC_NR_TXQUEUE) {
		call->tx_qbase += RXRPC_NR_TXQUEUE;
		call->tx_queue = tq->next;
		...
		kfree(tq);
		tq = call->tx_queue;
	}

If a full block of 63 packets is transmitted and ACKed, call->tx_bottom
reaches 63. If the remote peer then sends an early reply,
rxrpc_receiving_reply() calls rxrpc_rotate_tx_window() with to = 63 and seq
starting at 64.

This triggers the condition above and frees the active block (tq) without
verifying if it is still being written to. If it is still the tail
(tq == call->tx_qtail), call->send_queue and call->tx_qtail are left as
dangling pointers.

Concurrently, if the client is still sending data via sendmsg(), it might call
rxrpc_alloc_txqueue() for the 64th packet:

net/rxrpc/sendmsg.c:rxrpc_alloc_txqueue() {
	...
	if (call->send_queue) {
		tq->qbase = call->send_top + 1;
		call->send_queue->next = tq;
		call->send_queue = tq;
	}
	...
}

Will call->send_queue->next = tq write a newly allocated pointer into the
recently freed memory block?

[ ... ]

^ permalink raw reply

* Re: [PATCH net v2 05/10] afs: Fix uncancelled rxrpc OOB message handler
From: Simon Horman @ 2026-06-20  9:13 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-6-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
afs: Fix uncancelled rxrpc OOB message handler

This patch modifies AFS to cancel its OOB message processing during
shutdown. It moves OOB message processing to afs_wq and adds checks
for the net namespace's live state to prevent processing when no
longer live.

Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com

> diff --git a/fs/afs/cm_security.c b/fs/afs/cm_security.c
> index edcbd249d2024..103168c70dd4d 100644
> --- a/fs/afs/cm_security.c
> +++ b/fs/afs/cm_security.c

[ ... ]

> @@ -101,7 +101,8 @@ void afs_process_oob_queue(struct work_struct *work)
>  	struct sk_buff *oob;
>  	enum rxrpc_oob_type type;
>  
> -	while ((oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
> +	while (READ_ONCE(net->live) &&
> +	       (oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
>  		switch (type) {
>  		case RXRPC_OOB_CHALLENGE:
>  			afs_respond_to_challenge(oob);

[Severity: Critical]
This isn't a bug introduced by this patch, but can afs_respond_to_challenge()
cause a null pointer dereference or use-after-free here?

When afs_respond_to_challenge() queries peer_data and casts it to an
afs_server pointer, it unconditionally dereferences it and locks its mutex:

fs/afs/cm_security.c:afs_respond_to_challenge() {
    ...
    server = (struct afs_server *)peer_data;
    if (!server->cm_rxgk_appdata.data) {
        mutex_lock(&server->cm_token_lock);
    ...
}

If a challenge is received for an unknown peer, or if afs_server_destroyer()
concurrently clears the peer app_data to null during teardown, could 
peer_data be null?

If the pointer was read just before being cleared, could the server object
be freed concurrently since no RCU read lock or refcount is held?

> diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
> index 6714a189d58fb..e8af2a6614405 100644
> --- a/fs/afs/rxrpc.c
> +++ b/fs/afs/rxrpc.c

[ ... ]

> @@ -989,5 +991,6 @@ static void afs_rx_notify_oob(struct sock *sk, struct sk_buff *oob)
>  {
>  	struct afs_net *net = sk->sk_user_data;
>  
> -	schedule_work(&net->rx_oob_work);
> +	if (net->live)
> +		queue_work(afs_wq, &net->rx_oob_work);
>  }

[Severity: Medium]
Should net->live be read with READ_ONCE() here?

Since net->live can be modified concurrently by namespace teardown code
such as afs_close_socket() or afs_net_exit(), reading it without READ_ONCE()
might allow the compiler to optimize the read, leading to a stale evaluation.

The updated afs_process_oob_queue() function correctly uses READ_ONCE().
Should afs_rx_notify_oob() do the same to prevent data races?

[Severity: High]
This isn't a bug introduced by this patch, but there appears to be a similar
issue in fs/afs/server.c when retrieving the server pointer.

Can afs_find_server() increment the refcount of a freed object?

afs_find_server() fetches the pointer and increments its refcount:

fs/afs/server.c:afs_find_server() {
    ...
    struct afs_server *server = (struct afs_server *)rxrpc_kernel_get_peer_data(peer);
    
    if (!server)
        return NULL;
    return afs_use_server(server, false, afs_server_trace_use_cm_call);
    ...
}

If afs_server_destroyer() runs concurrently, unbinds the peer, and frees
the server, could afs_find_server() fetch the pointer just before it is
cleared?

If preempted, it seems this operation lacks RCU read lock protection and
could increment the refcount of a freed object without using
refcount_inc_not_zero().

^ permalink raw reply

* [PATCH net] nfc: st-nci: use unaligned accessors for frame length
From: Runyu Xiao @ 2026-06-20  9:05 UTC (permalink / raw)
  To: Krzysztof Kozlowski, netdev
  Cc: Samuel Ortiz, Christophe Ricard, linux-kernel, Runyu Xiao,
	Jianhao Xu, stable

The ST NCI I2C and SPI transports parse a frame length from bytes
received from the controller. Both paths first read the frame header into
a local u8 buffer and then cast buf + 2 to __be16 * before converting it
from big endian.

These are transport byte buffers, not __be16 objects. Use
get_unaligned_be16() for the NCI frame length field in both the I2C and
SPI transports.

This issue was detected by our static analysis tool and confirmed by
manual audit. A focused UBSAN alignment validation kept the original
access shape, be16_to_cpu(*(__be16 *)(buf + 2)), and ran it on an NCI
frame byte buffer with buf + 2 at an odd address. UBSAN reported a
misaligned-access load of type '__be16', and the trace contained
st_nci_i2c_read().

The driver has the same source-level issue: the transport helpers fill
u8 buffers, and the length checks only prove that the bytes are present.
They do not establish a __be16 object at buf + 2 or a 2-byte alignment
guarantee before the typed load.

Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci")
Fixes: 2bc4d4f8c8f3 ("nfc: st-nci: Add spi phy support for st21nfcb")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 drivers/nfc/st-nci/i2c.c | 3 ++-
 drivers/nfc/st-nci/spi.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nfc/st-nci/i2c.c b/drivers/nfc/st-nci/i2c.c
index 9ae839a6f5cc..29fdb4ae56e0 100644
--- a/drivers/nfc/st-nci/i2c.c
+++ b/drivers/nfc/st-nci/i2c.c
@@ -14,6 +14,7 @@
 #include <linux/delay.h>
 #include <linux/nfc.h>
 #include <linux/of.h>
+#include <linux/unaligned.h>
 
 #include "st-nci.h"
 
@@ -120,7 +121,7 @@ static int st_nci_i2c_read(struct st_nci_i2c_phy *phy,
 	if (r != ST_NCI_I2C_MIN_SIZE)
 		return -EREMOTEIO;
 
-	len = be16_to_cpu(*(__be16 *) (buf + 2));
+	len = get_unaligned_be16(buf + 2);
 	if (len > ST_NCI_I2C_MAX_SIZE) {
 		nfc_err(&client->dev, "invalid frame len\n");
 		return -EBADMSG;
diff --git a/drivers/nfc/st-nci/spi.c b/drivers/nfc/st-nci/spi.c
index 169eacc0a32a..1326c20e43fc 100644
--- a/drivers/nfc/st-nci/spi.c
+++ b/drivers/nfc/st-nci/spi.c
@@ -14,6 +14,7 @@
 #include <linux/delay.h>
 #include <linux/nfc.h>
 #include <linux/of.h>
+#include <linux/unaligned.h>
 #include <net/nfc/nci.h>
 
 #include "st-nci.h"
@@ -130,7 +131,7 @@ static int st_nci_spi_read(struct st_nci_spi_phy *phy,
 	if (r < 0)
 		return -EREMOTEIO;
 
-	len = be16_to_cpu(*(__be16 *) (buf + 2));
+	len = get_unaligned_be16(buf + 2);
 	if (len > ST_NCI_SPI_MAX_SIZE) {
 		nfc_err(&dev->dev, "invalid frame len\n");
 		phy->ndlc->hard_fault = 1;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH iwl-next v2] ixgbe: Implement PCI reset handler
From: Simon Horman @ 2026-06-20  8:53 UTC (permalink / raw)
  To: sergey.temerkhanov; +Cc: Simon Horman, intel-wired-lan, netdev, pmenzel
In-Reply-To: <20260618142212.310475-1-sergey.temerkhanov@intel.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ixgbe: Implement PCI reset handler

Implement PCI device reset handler to allow the network device to
get re-initialized and function after a PCI-level reset.

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
[ ... ]
> @@ -12338,6 +12338,86 @@ static pci_ers_result_t ixgbe_io_slot_reset(struct pci_dev *pdev)
> +static void ixgbe_reset_prep(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
> +
> +	if (!adapter)
> +		return;
> +
> +	if (poll_timeout_us(test_and_set_bit(__IXGBE_RESETTING, &adapter->state),
> +			    test_bit(__IXGBE_RESETTING, &adapter->state),

[Severity: Medium]
Does this introduce a new instance of an ad-hoc bit lock?

According to the networking drivers guidelines, using bit operations like
test_and_set_bit() to guard critical sections instead of proper
synchronization primitives like a mutex or spinlock can hide issues from
lockdep and lacks deadlock detection.

[Severity: Critical]
Does this misuse the poll_timeout_us() macro?

The macro executes the operation before evaluating the condition. Since
test_and_set_bit() unconditionally sets the bit to 1, test_bit() will always
evaluate to true on the very first loop iteration, regardless of whether the
lock was already held.

Doesn't this completely break mutual exclusion?

> +			    IXGBE_RESET_PREP_POLL_INTERVAL_US,
> +			    IXGBE_RESET_PREP_TIMEOUT_US, false)) {
> +		/* ixgbe_reset_done() will exit early if this happens.
> +		 * A retry will be needed
> +		 */
> +		e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
> +		return;

[Severity: High]
If this times out and returns early, won't it abandon the software
reinitialization process while the PCI subsystem proceeds with the physical
reset?

The .reset_prepare callback returns void, so the reset continues. By
returning early without setting __IXGBE_PCIE_RESET_IN_PROGRESS, critical
teardowns are skipped, and ixgbe_reset_done() will also bail out early:

ixgbe_reset_done() {
    if (!test_and_clear_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state)) {
        e_err(drv, "Reset done called without PCIe reset in progress\n");
        return;
    }
}

Could this permanently break the network interface until the module is
reloaded?

> +	}
> +
> +	/* Sync __IXGBE_RESETTING */
> +	smp_mb__after_atomic();
> +
> +	if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
> +		/* Prevent the service task from being requeued in the timer callback */
> +		timer_delete_sync(&adapter->service_timer);
> +		/* Cancel any possibly queued service task */
> +		cancel_work_sync(&adapter->service_task);

[Severity: Critical]
Can this sequence cause a deadlock with the service task?

Here in ixgbe_reset_prep(), __IXGBE_RESETTING has just been acquired,
and then cancel_work_sync(&adapter->service_task) blocks waiting for
the service task to complete.

However, if the active service task reaches ixgbe_reinit_locked():

ixgbe_reinit_locked() {
    while (test_and_set_bit(__IXGBE_RESETTING, &adapter->state))
        usleep_range(1000, 2000);
    ...
}

The service task will spin infinitely waiting for __IXGBE_RESETTING to be
cleared. Since ixgbe_reset_prep() holds the bit lock and is blocked waiting
for the service task, neither thread can make progress.

> +	}
> +
> +	pci_clear_master(pdev);
> +
> +	set_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state);
> +}
[ ... ]
> +static void ixgbe_reset_done(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
[ ... ]
> +	/* Allow the service task to run */
> +	if (!test_bit(__IXGBE_REMOVING, &adapter->state)) {
> +		clear_bit(__IXGBE_RESETTING, &adapter->state);
> +		/* Sync __IXGBE_RESETTING */
> +		smp_mb__after_atomic();

[Severity: Medium]
Does this provide the correct release semantics for the ad-hoc lock?

The clear_bit() operation does not imply a memory barrier, and
smp_mb__after_atomic() only orders memory operations that occur after the
atomic operation.

Could memory accesses from within the critical section (like the
pci_set_master() call earlier) bleed past the lock release, allowing other
threads to acquire the lock too early?

Is it possible to use clear_bit_unlock() or place an smp_mb__before_atomic()
prior to the clear_bit() instead?

> +	}
[ ... ]

^ permalink raw reply

* [PATCH] net: liquidio: fix BAR resource leak on PF number failure
From: Haoxiang Li @ 2026-06-20  8:37 UTC (permalink / raw)
  To: andrew+netdev, davem, kuba, pabeni, felix.manlunas,
	ricardo.farrington
  Cc: netdev, linux-kernel, Haoxiang Li, stable

If cn23xx_get_pf_num() fails, the function returns without
unmapping either BAR. Unmap both BARs before returning from
the error path.

Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
 drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 75f22f74774c..a1548ca81ecd 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1167,8 +1167,11 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 		return 1;
 	}
 
-	if (cn23xx_get_pf_num(oct) != 0)
+	if (cn23xx_get_pf_num(oct) != 0) {
+		octeon_unmap_pci_barx(oct, 0);
+		octeon_unmap_pci_barx(oct, 1);
 		return 1;
+	}
 
 	if (cn23xx_sriov_config(oct)) {
 		octeon_unmap_pci_barx(oct, 0);
-- 
2.25.1


^ permalink raw reply related

* [PATCH net v5] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Wayen Yan @ 2026-06-20  8:17 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek, Joe Damato

In airoha_dev_select_queue(), the expression:

  queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES;

implicitly converts to unsigned arithmetic: when skb->priority is 0
(the default for unclassified traffic), (0u - 1u) wraps to UINT_MAX,
and UINT_MAX % 8 = 7, routing default best-effort packets to the
highest-priority QoS queue. This causes QoS inversion where the
majority of traffic on a PON gateway starves actual high-priority
flows (VoIP, gaming, etc.).

The "- 1" offset was a leftover from the ETS offload implementation
that has since been removed. The correct mapping is a direct modulo:

  queue = skb->priority % AIROHA_NUM_QOS_QUEUES;

This maps priority 0 → queue 0 (lowest), priority 7 → queue 7
(highest), with higher priorities wrapping around. This is the
standard Linux sk_prio → HW queue mapping used by other drivers.

Fixes: 2b288b81560b ("net: airoha: Introduce ndo_select_queue callback")
Link: https://lore.kernel.org/netdev/178185573207.2378135.3729126358670287878@gmail.com/
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
---
Changes in v5:
- Rebase on net/main (previous version was incorrectly based on
  net-next/origin/master, causing Patchwork CI apply failure).

Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f3f..3370c3df7c10 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -2110,7 +2110,7 @@ static u16 airoha_dev_select_queue(struct net_device *netdev,
 	 */
 	channel = netdev_uses_dsa(netdev) ? skb_get_queue_mapping(skb) : port->id;
 	channel = channel % AIROHA_NUM_QOS_CHANNELS;
-	queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES; /* QoS queue */
+	queue = skb->priority % AIROHA_NUM_QOS_QUEUES;
 	queue = channel * AIROHA_NUM_QOS_QUEUES + queue;
 
 	return queue < netdev->num_tx_queues ? queue : 0;
-- 
2.51.0



^ permalink raw reply related

* Re: [RFC net-next 3/4] net: dsa: motorcomm: Dynamically allocate port structures
From: Andrew Lunn @ 2026-06-20  8:03 UTC (permalink / raw)
  To: David Yang
  Cc: netdev, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <CAAXyoMN9a6nncr-C1UDuUQMf4i5FR-9s-sfhSpYLYs5nmh9Uhg@mail.gmail.com>

On Fri, Jun 19, 2026 at 02:46:24PM +0800, David Yang wrote:
> On Fri, Jun 19, 2026 at 2:06 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Fri, Jun 19, 2026 at 04:26:31AM +0800, David Yang wrote:
> > > With support for LED introduced later, struct yt921x_priv will be 17k
> > > which is not very good for a single kmalloc(). Convert the ports array
> > > to a array of pointers to stop bloating the priv struct.
> > >
> > > Signed-off-by: David Yang <mmyangfl@gmail.com>
> > > ---
> > >  drivers/net/dsa/motorcomm/chip.c | 95 ++++++++++++++++++++++++--------
> > >  drivers/net/dsa/motorcomm/chip.h |  3 +-
> > >  2 files changed, 75 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/net/dsa/motorcomm/chip.c b/drivers/net/dsa/motorcomm/chip.c
> > > index 6dee25b6754a..d44f7749de02 100644
> > > --- a/drivers/net/dsa/motorcomm/chip.c
> > > +++ b/drivers/net/dsa/motorcomm/chip.c
> > > @@ -548,11 +548,14 @@ yt921x_mbus_ext_init(struct yt921x_priv *priv, struct device_node *mnp)
> > >  /* Read and handle overflow of 32bit MIBs. MIB buffer must be zeroed before. */
> > >  static int yt921x_read_mib(struct yt921x_priv *priv, int port)
> > >  {
> > > -     struct yt921x_port *pp = &priv->ports[port];
> > > +     struct yt921x_port *pp = priv->ports[port];
> > >       struct device *dev = to_device(priv);
> > >       struct yt921x_mib *mib = &pp->mib;
> > >       int res = 0;
> > >
> > > +     if (!pp)
> > > +             return -ENODEV;
> > > +
> >
> > Are all these tests actually needed? If you cannot allocate the
> > memory, i would expect the probe to fail, so you can never get here.
> >
> >         Andrew
> 
> Dummy ports are no longer assigned control blocks (in yt921x_dsa_setup).

This seems pretty error prone. A missing check will result in an
opps. At least it will be obvious. How big is each port structure? Is
the memory saving worth it?

    Andrew

^ permalink raw reply

* [PATCH] netdevsim: fix use-after-free in __nsim_dev_port_del
From: Hrushiraj Gandhi @ 2026-06-20  6:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Paolo Abeni,
	Jiri Pirko, netdev, linux-kernel, Hrushiraj Gandhi,
	syzbot+6c25f4750230faf70be9

debugfs files created under a port's ddir (ethtool/get_err,
ethtool/set_err, ring params, bpf_offloaded_id, udp_ports/inject_error,
etc.) store raw pointers directly into the netdevsim struct, which lives
in the net_device private data kmalloc slab.

If these files outlive the netdevsim struct, a concurrent reader can
trigger a slab-use-after-free by passing debugfs_file_get() (which only
checks dentry lifetime) and then dereferencing the freed data pointer
in debugfs_u32_get().

In __nsim_dev_port_del(), nsim_destroy() is called before
nsim_dev_port_debugfs_exit(). However, nsim_destroy() calls free_netdev()
at its end, while nsim_dev_port_debugfs_exit() removes the port's
debugfs directory. This means the slab is freed before the debugfs
files are removed.

Fix by calling debugfs_remove_recursive(ns->nsim_dev_port->ddir) in
nsim_destroy() right before free_netdev(). This ensures all per-port
debugfs files are destroyed synchronously before the backing memory is
freed. The subsequent call to nsim_dev_port_debugfs_exit() in
__nsim_dev_port_del() becomes a harmless no-op.

Reported-by: syzbot+6c25f4750230faf70be9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6c25f4750230faf70be9
Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
Signed-off-by: Hrushiraj Gandhi <hrushirajg23@gmail.com>
---
 drivers/net/netdevsim/netdev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 27e5f109f933..08136e7990cb 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -1214,6 +1214,13 @@ void nsim_destroy(struct netdevsim *ns)
 		ns->page = NULL;
 	}
 
+	/*
+	 * Remove per-port debugfs files before free_netdev() releases the
+	 * netdevsim struct to prevent use-after-free in concurrent readers.
+	 */
+	debugfs_remove_recursive(ns->nsim_dev_port->ddir);
+	ns->nsim_dev_port->ddir = NULL;
+
 	free_netdev(dev);
 }
 
-- 
2.47.3


^ permalink raw reply related

* AW: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
From: Markus Stockhausen @ 2026-06-20  6:45 UTC (permalink / raw)
  To: 'Jan Klos', 'Heiner Kallweit',
	'Andrew Lunn', 'Russell King', netdev
  Cc: 'Maxime Chevallier', 'David S. Miller',
	'Eric Dumazet', 'Jakub Kicinski',
	'Paolo Abeni', 'Daniel Golle',
	'Vladimir Oltean', 'Aleksander Jan Bajkowski',
	'Jan Hoffmann', 'Issam Hamdi',
	'Chukun Pan', 'Russell King (Oracle)',
	'ChunHao Lin', linux-kernel
In-Reply-To: <20260620011956.37181-1-honza.klos@gmail.com>

> Von: Jan Klos <honza.klos@gmail.com> 
> Gesendet: Samstag, 20. Juni 2026 03:20
> Betreff: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G
bit
>
> On RTL8127A connected to a link partner that advertises 10000baseT
> speed cannot be changed to anything other than 10000baseT as 10GbE
> is always advertised regardless of any setting. Fix this by
> clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
> call to phy_modify_mmd_changed().

As you are enhancing the mask, shouldn't this be "... by respecting ..."?

Markus


^ permalink raw reply

* [PATCH net] bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()
From: Abdun Nihaal @ 2026-06-20  6:23 UTC (permalink / raw)
  To: skalluru
  Cc: Abdun Nihaal, manishc, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel, barak, stable

If the allocation of fp[i].tpa_info fails, the error path will not free
the struct bnx2x_fastpath allocated earlier, as it is not linked to the
bp structure yet. Fix that by linking it immediately after allocation.

Cc: stable@vger.kernel.org
Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure")
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
---
Compile tested only. Issue found using static analysis.

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 19e078479b0d..5b2640bd31c3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4748,6 +4748,7 @@ int bnx2x_alloc_mem_bp(struct bnx2x *bp)
 	fp = kzalloc_objs(*fp, bp->fp_array_size);
 	if (!fp)
 		goto alloc_err;
+	bp->fp = fp;
 	for (i = 0; i < bp->fp_array_size; i++) {
 		fp[i].tpa_info =
 			kzalloc_objs(struct bnx2x_agg_info,
@@ -4756,8 +4757,6 @@ int bnx2x_alloc_mem_bp(struct bnx2x *bp)
 			goto alloc_err;
 	}
 
-	bp->fp = fp;
-
 	/* allocate sp objs */
 	bp->sp_objs = kzalloc_objs(struct bnx2x_sp_objs, bp->fp_array_size);
 	if (!bp->sp_objs)
-- 
2.43.0


^ permalink raw reply related

* [syzbot] [fs?] [mm?] INFO: rcu detected stall in dentry_kill
From: syzbot @ 2026-06-20  3:58 UTC (permalink / raw)
  To: brauner, jack, linux-fsdevel, linux-kernel, linux-mm, netdev,
	syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    b85966adbf5d Merge tag 'net-next-7.2' of git://git.kernel...
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15ffe3a1580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9a9f723a32776544
dashboard link: https://syzkaller.appspot.com/bug?extid=0635dc2e2c3c21a6aa04
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1192ccfe580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10dec2ae580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d65306d96573/disk-b85966ad.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ef43139aab0e/vmlinux-b85966ad.xz
kernel image: https://storage.googleapis.com/syzbot-assets/26d4d1ab67c3/bzImage-b85966ad.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0635dc2e2c3c21a6aa04@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	0-...!: (1 GPs behind) idle=8aec/1/0x4000000000000000 softirq=15232/15238 fqs=0
rcu: 	(detected by 1, t=10502 jiffies, g=12001, q=779 ncpus=2)
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 UID: 0 PID: 5691 Comm: udevd Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:lock_release+0x2d3/0x3c0 kernel/locking/lockdep.c:5893
Code: 65 c7 05 2c 91 98 11 00 00 00 00 eb b5 e8 45 d1 05 0a f7 c3 00 02 00 00 74 b9 65 48 8b 05 45 4c 98 11 48 3b 44 24 28 75 44 fb <48> 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 48 8d
RSP: 0018:ffffc90000007c98 EFLAGS: 00000046
RAX: 2f357cb7f4202a00 RBX: ffff88803147f2a8 RCX: 0000000000010002
RDX: 0000000000010000 RSI: ffffffff8c291100 RDI: ffffffff8c2910c0
RBP: dffffc0000000000 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff52000000f90 R12: ffff8880611c6000
R13: ffffffff89b61a3a R14: ffff88803147f2c0 R15: ffff88803147f300
FS:  0000000000000000(0000) GS:ffff88812527c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564961a89a38 CR3: 000000000e746000 CR4: 00000000003526f0
Call Trace:
 <IRQ>
 __raw_spin_unlock include/linux/spinlock_api_smp.h:167 [inline]
 _raw_spin_unlock+0x16/0x50 kernel/locking/spinlock.c:190
 spin_unlock include/linux/spinlock.h:390 [inline]
 advance_sched+0x99a/0xc80 net/sched/sch_taprio.c:988
 __run_hrtimer kernel/time/hrtimer.c:2032 [inline]
 __hrtimer_run_queues+0x3bc/0xa10 kernel/time/hrtimer.c:2096
 hrtimer_interrupt+0x448/0x910 kernel/time/hrtimer.c:2215
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1051 [inline]
 __sysvec_apic_timer_interrupt+0x102/0x430 arch/x86/kernel/apic/apic.c:1068
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1062 [inline]
 sysvec_apic_timer_interrupt+0xa1/0xc0 arch/x86/kernel/apic/apic.c:1062
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:674
RIP: 0010:__unwind_start+0x514/0x660 arch/x86/kernel/unwind_orc.c:-1
Code: 10 42 80 3c 28 00 4c 8d 7b 38 74 08 4c 89 ff e8 12 7a ba 00 48 8b 44 24 08 49 39 07 0f 87 b6 fb ff ff 48 89 df e8 cc d0 ff ff <48> 8b 04 24 42 0f b6 04 28 84 c0 75 11 83 3b 00 4c 89 f1 0f 85 5b
RSP: 0018:ffffc9000432f590 EFLAGS: 00000282
RAX: 00000000f218b401 RBX: ffffc9000432f5e8 RCX: 0000000080000001
RDX: ffffc9000432f601 RSI: ffffffff8c291100 RDI: ffff888034f03e00
RBP: 1ffff92000865ebf R08: ffffc9000432f5d8 R09: 0000000000000000
R10: ffffc9000432f638 R11: fffff52000865ec9 R12: 1ffff92000865ebe
R13: dffffc0000000000 R14: ffffc9000432f5f8 R15: ffffc9000432f620
 unwind_start arch/x86/include/asm/unwind.h:64 [inline]
 arch_stack_walk+0xe3/0x150 arch/x86/kernel/stacktrace.c:24
 stack_trace_save+0xa9/0x100 kernel/stacktrace.c:122
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
 __call_rcu_common kernel/rcu/tree.c:3159 [inline]
 call_rcu+0xee/0x8b0 kernel/rcu/tree.c:3279
 __destroy_inode+0x2a1/0x630 fs/inode.c:365
 destroy_inode fs/inode.c:388 [inline]
 evict+0x8d4/0xb50 fs/inode.c:852
 dentry_kill+0x1b9/0x880 fs/dcache.c:826
 finish_dput+0x1a/0x260 fs/dcache.c:1001
 __fput+0x675/0xa50 fs/file_table.c:520
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0x73a/0x2360 kernel/exit.c:1004
 do_group_exit+0x22d/0x2f0 kernel/exit.c:1147
 __do_sys_exit_group kernel/exit.c:1158 [inline]
 __se_sys_exit_group kernel/exit.c:1156 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1156
 x64_sys_call+0x221a/0x2240 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fbd5bcf16c5
Code: Unable to access opcode bytes at 0x7fbd5bcf169b.
RSP: 002b:00007ffe420f4688 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000564961aa4f80 RCX: 00007fbd5bcf16c5
RDX: 00000000000000e7 RSI: fffffffffffffe68 RDI: 0000000000000000
RBP: 0000564961a80910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe420f46d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
rcu: rcu_preempt kthread starved for 10502 jiffies! g12001 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:28040 pid:16    tgid:16    ppid:2      task_flags:0x208040 flags:0x00080000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5504 [inline]
 __schedule+0x17d9/0x56c0 kernel/sched/core.c:7228
 __schedule_loop kernel/sched/core.c:7307 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7322
 schedule_timeout+0x152/0x2c0 kernel/time/sleep_timeout.c:99
 rcu_gp_fqs_loop+0x30c/0x11f0 kernel/rcu/tree.c:2123
 rcu_gp_kthread+0x9e/0x2b0 kernel/rcu/tree.c:2325
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
CPU: 1 UID: 0 PID: 5689 Comm: udevd Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:csd_lock_wait kernel/smp.c:342 [inline]
RIP: 0010:smp_call_function_many_cond+0x10b0/0x14b0 kernel/smp.c:892
Code: c0 75 73 41 8b 1e 89 de 83 e6 01 31 ff e8 98 02 0c 00 83 e3 01 48 bb 00 00 00 00 00 fc ff df 75 07 e8 44 fe 0b 00 eb 37 f3 90 <41> 0f b6 04 1c 84 c0 75 10 41 f7 06 01 00 00 00 74 1e e8 29 fe 0b
RSP: 0000:ffffc9000430f840 EFLAGS: 00000293
RAX: ffffffff81b9f7f7 RBX: dffffc0000000000 RCX: ffff88807f020000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffffc9000430f970 R08: ffffffff903116f7 R09: 1ffffffff20622de
R10: dffffc0000000000 R11: fffffbfff20622df R12: 1ffff110170c85c5
R13: ffff8880b873c2c8 R14: ffff8880b8642e28 R15: 0000000000000000
FS:  00007fbd5c388880(0000) GS:ffff88812537c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564961a89a38 CR3: 0000000044280000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 on_each_cpu_cond_mask+0x3f/0x80 kernel/smp.c:1057
 __flush_tlb_multi arch/x86/include/asm/paravirt.h:46 [inline]
 flush_tlb_multi arch/x86/mm/tlb.c:1361 [inline]
 flush_tlb_mm_range+0x5c4/0x1090 arch/x86/mm/tlb.c:1451
 flush_tlb_page arch/x86/include/asm/tlbflush.h:345 [inline]
 ptep_clear_flush+0x120/0x170 mm/pgtable-generic.c:104
 wp_page_copy mm/memory.c:3941 [inline]
 do_wp_page+0x3d52/0x4c70 mm/memory.c:4336
 handle_pte_fault mm/memory.c:6443 [inline]
 __handle_mm_fault mm/memory.c:6565 [inline]
 handle_mm_fault+0x1490/0x3080 mm/memory.c:6734
 do_user_addr_fault+0xa4d/0x1340 arch/x86/mm/fault.c:1339
 handle_page_fault arch/x86/mm/fault.c:1479 [inline]
 exc_page_fault+0x6a/0xc0 arch/x86/mm/fault.c:1532
 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:595
RIP: 0033:0x7fbd5c3ada9a
Code: 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 53 48 85 ff 74 2f 48 8b 47 08 48 39 c7 74 21 48 8b 1f 48 39 df 74 19 48 89 18 <48> 89 43 08 e8 8d d9 ff ff 48 89 d8 5b c3 0f 1f 84 00 00 00 00 00
RSP: 002b:00007ffe420f4620 EFLAGS: 00010202
RAX: 0000564961a8a0b0 RBX: 0000564961a89a30 RCX: 0000000000000000
RDX: 0000564961a95430 RSI: 0000564961a91f60 RDI: 0000564961a8f4e0
RBP: 0000564961a8f4e0 R08: 0000564961a91f70 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000297 R12: 0000564958c24588
R13: 00007ffe420f46d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH bpf v2] bpf, sockmap: disallow update and delete from tc, xdp and flow_dissector
From: Sechang Lim @ 2026-06-20  3:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Lorenz Bauer, Jiayuan Chen, bpf, linux-kernel,
	netdev

sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
write. That gives the order stab->lock -> sk_callback_lock.

The reverse order comes from the SK_SKB stream parser.
sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
__tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
deletes from the sockmap and takes stab->lock:

  WARNING: possible circular locking dependency detected
  7.1.0-rc6 Not tainted
  ------------------------------------------------------
  syz.9.8824 is trying to acquire lock:
  (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
  but task is already holding lock:
  (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173

  -> #1 (clock-AF_INET){++.-}-{3:3}:
         _raw_write_lock_bh
         sock_map_del_link net/core/sock_map.c:167
         sock_map_unref net/core/sock_map.c:184
         sock_map_update_common net/core/sock_map.c:509
         sock_map_update_elem_sys net/core/sock_map.c:588
         map_update_elem kernel/bpf/syscall.c:1805

  -> #0 (&stab->lock){+.-.}-{3:3}:
         _raw_spin_lock_bh
         __sock_map_delete net/core/sock_map.c:421
         sock_map_delete_elem net/core/sock_map.c:452
         bpf_prog_06044d24140080b6
         tcx_run net/core/dev.c:4451
         sch_handle_egress net/core/dev.c:4541
         __dev_queue_xmit net/core/dev.c:4808
         ...
         tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
         strp_data_ready net/strparser/strparser.c:402
         sk_psock_strp_data_ready net/core/skmsg.c:1174
         tcp_data_queue net/ipv4/tcp_input.c:5661

  Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    rlock(clock-AF_INET);
                                 lock(&stab->lock);
                                 lock(clock-AF_INET);
    lock(&stab->lock);

   *** DEADLOCK ***

A tc, xdp or flow_dissector program has no reason to update or delete a
sockmap, and redirect does not go through here. Drop them from
may_update_sockmap() so the verifier rejects it. It also closes the
matching sockhash inversion.

Fixes: 0126240f448d ("bpf: sockmap: Allow update from BPF")
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
v2:
 - reject sockmap update/delete from tc, xdp and flow_dissector (John
   Fastabend)
 - fix the changelog (Jiayuan Chen)

v1:
 - https://lore.kernel.org/all/20260616091153.2966617-1-rhkrqnwk98@gmail.com/

 kernel/bpf/verifier.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7fb88e1cd7c4..94d225521b5a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8766,11 +8766,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
 			return true;
 		break;
 	case BPF_PROG_TYPE_SOCKET_FILTER:
-	case BPF_PROG_TYPE_SCHED_CLS:
-	case BPF_PROG_TYPE_SCHED_ACT:
-	case BPF_PROG_TYPE_XDP:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
-	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return true;
 	default:
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 1/2] fs: Add bpf_sock_read_xattr() kfunc to read socket xattrs
From: Alexei Starovoitov @ 2026-06-20  3:20 UTC (permalink / raw)
  To: Christian Brauner, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann
  Cc: Alexander Viro, Jan Kara, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, linux-fsdevel, netdev, bpf, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa
In-Reply-To: <20260617-work-bpf-sock-xattr-v1-1-a1276f7c9da3@kernel.org>

On Wed Jun 17, 2026 at 4:18 AM PDT, Christian Brauner wrote:
> In c8db08110cbe ("Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
> we added support for extended attributes for sockets. This comes in two
> flavors: sockfs and non-sockfs/filesystem sockets. Filesystem sockets
> are actual filesystem objects so reading xattrs must use dedicated fs
> helpers such as bpf_get_dentry_xattr() and bpf_get_file_xattr(). Those
> are inherently sleeping operations. Sockfs sockets on the other hand
> don't need to use sleeping operations as the underlying data structure
> is lockless. In addition, retrieval of sockfs extended attributes often
> happens from LSM hooks that only provide struct socket and it's
> completely nonsensical to grab a reference to a file, then force a
> sleeping operation to retrieve the xattr and drop the reference. We know
> that the sockfs file cannot go away while the LSM hook runs.
>
> This series adds a bpf_sock_read_xattr() kfunc that, given a struct
> socket, reads a user.* extended attribute from the socket's sockfs inode
> into a bpf_dynptr. Together with fsetxattr() from userspace this lets a
> process label a socket with a user.* xattr and have a BPF LSM program
> retrieve that label locklessly. The kfunc mirrors the existing
> bpf_cgroup_read_xattr(), including the restriction to the user.*
> namespace.
>
> systemd uses user.* xattrs on sockets to implement socket rate limiting
> and to tag sockets for other purposes [1] such as implementing a varlink
> registry. There is currently no efficient way for a BPF program to read
> those labels back. The new helper allows a listening socket marked with
> an extended attribute to be read back during bind/connect and then act
> on the connect()ing socket. Extended attributes make it possible to
> allow an unprivileged user manager such as systemd --user to mark
> sockets from userspace and then rediscover them or implement policies.
>
> The kfunc is registered KF_RCU and only for BPF LSM programs. A struct
> socket is only guaranteed to live in sockfs when an LSM socket hook hands
> it out, which is what keeps SOCK_INODE() valid. Sockets that embed struct
> socket outside sockfs (tun, tap) are only reachable from tracing programs
> and are excluded by the registration. (Btw, for consistency it would
> be nice to force allocation of struct socket from sockfs instead of
> simply embedding it in e.g., struct tun_file which makes the SOCKFS_I()
> pattern a hazard - at least outside of sockfs functions.)
>
> The read never sleeps and takes no lock. For sockfs the value lives in
> the inode's in-memory xattr store and simple_xattr_get() resolves it
> with an RCU-protected rhashtable lookup, taking neither the inode lock
> nor any xattr lock. The kfunc is therefore usable from both sleepable
> and non-sleepable LSM hooks.
>
> Link: https://github.com/systemd/systemd/pull/40559 [1]
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/bpf_fs_kfuncs.c  | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/net.h |  1 +
>  net/socket.c        | 25 +++++++++++++++++++++++++
>  3 files changed, 63 insertions(+)
>
> diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
> index 11841c3d4260..85fc9519d1ff 100644
> --- a/fs/bpf_fs_kfuncs.c
> +++ b/fs/bpf_fs_kfuncs.c
> @@ -11,6 +11,7 @@
>  #include <linux/file.h>
>  #include <linux/kernfs.h>
>  #include <linux/mm.h>
> +#include <linux/net.h>
>  #include <linux/xattr.h>
>  
>  __bpf_kfunc_start_defs();
> @@ -359,6 +360,39 @@ __bpf_kfunc int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__s
>  }
>  #endif /* CONFIG_CGROUPS */
>  
> +#ifdef CONFIG_NET
> +/**
> + * bpf_sock_read_xattr - read xattr of a socket's inode in sockfs
> + * @sock: socket to get xattr from
> + * @name__str: name of the xattr
> + * @value_p: output buffer of the xattr value
> + *
> + * Get xattr *name__str* of *sock* and store the output in *value_p*.
> + *
> + * For security reasons, only *name__str* with prefix "user." is allowed.
> + *
> + * Return: length of the xattr value on success, a negative value on error.
> + */
> +__bpf_kfunc int bpf_sock_read_xattr(struct socket *sock, const char *name__str,
> +				    struct bpf_dynptr *value_p)
> +{
> +	struct bpf_dynptr_kern *value_ptr = (struct bpf_dynptr_kern *)value_p;
> +	u32 value_len;
> +	void *value;
> +
> +	/* Only allow reading "user.*" xattrs */
> +	if (strncmp(name__str, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN))
> +		return -EPERM;
> +
> +	value_len = __bpf_dynptr_size(value_ptr);
> +	value = __bpf_dynptr_data_rw(value_ptr, value_len);
> +	if (!value)
> +		return -EINVAL;
> +
> +	return sock_read_xattr(sock, name__str, value, value_len);
> +}
> +#endif /* CONFIG_NET */

lgtm.
How do you want to route it? Thought vfs tree for the next merge window?
If so
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [PATCH bpf-next v5 3/3] selftests/bpf: test rejection of a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

Verify that attaching an SK_SKB stream parser that can modify the packet
is rejected, while a read-only parser still attaches.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 2 files changed, 38 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
index 621b3b71888e..1d7231728eaf 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
@@ -431,6 +431,35 @@ static void test_sockmap_strp_verdict(int family, int sotype)
 	test_sockmap_strp__destroy(strp);
 }
 
+static void test_sockmap_strp_parser_reject(void)
+{
+	struct test_sockmap_strp *strp = NULL;
+	int parser_mod, parser_ro, link;
+	int err, map;
+
+	strp = test_sockmap_strp__open_and_load();
+	if (!ASSERT_OK_PTR(strp, "test_sockmap_strp__open_and_load"))
+		return;
+
+	map = bpf_map__fd(strp->maps.sock_map);
+	parser_mod = bpf_program__fd(strp->progs.prog_skb_parser_resize);
+	parser_ro = bpf_program__fd(strp->progs.prog_skb_parser);
+
+	err = bpf_prog_attach(parser_mod, map, BPF_SK_SKB_STREAM_PARSER, 0);
+	ASSERT_ERR(err, "bpf_prog_attach parser_mod");
+
+	link = bpf_link_create(parser_ro, map, BPF_SK_SKB_STREAM_PARSER, NULL);
+	if (!ASSERT_GE(link, 0, "bpf_link_create parser_ro"))
+		goto out;
+
+	err = bpf_link_update(link, parser_mod, NULL);
+	ASSERT_ERR(err, "bpf_link_update parser_mod");
+out:
+	if (link >= 0)
+		close(link);
+	test_sockmap_strp__destroy(strp);
+}
+
 void test_sockmap_strp(void)
 {
 	if (test__start_subtest("sockmap strp tcp pass"))
@@ -451,4 +480,6 @@ void test_sockmap_strp(void)
 		test_sockmap_strp_multiple_pkt(AF_INET, SOCK_STREAM);
 	if (test__start_subtest("sockmap strp tcp dispatch"))
 		test_sockmap_strp_dispatch_pkt(AF_INET, SOCK_STREAM);
+	if (test__start_subtest("sockmap strp parser reject pkt mod"))
+		test_sockmap_strp_parser_reject();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
index dde3d5bec515..fe88fa6d40bc 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
@@ -50,4 +50,11 @@ int prog_skb_parser_partial(struct __sk_buff *skb)
 	return 10;
 }
 
+SEC("sk_skb/stream_parser")
+int prog_skb_parser_resize(struct __sk_buff *skb)
+{
+	bpf_skb_change_tail(skb, skb->len, 0);
+	return skb->len;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v5 2/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
to find the length of the next message. strparser assembles a message out
of several received skbs by chaining them onto the head's frag_list and
recording where to append the next one in strp->skb_nextp:

	*strp->skb_nextp = skb;
	strp->skb_nextp = &skb->next;

and then calls the parser on the head:

	len = (*strp->cb.parse_msg)(strp, head);

The parser is only meant to inspect the skb, but the program may call
bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
Once the head carries a frag_list these go

	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail

and __pskb_pull_tail() frees the frag_list skbs that strparser still
tracks through skb_nextp:

	while ((list = skb_shinfo(skb)->frag_list) != insp) {
		skb_shinfo(skb)->frag_list = list->next;
		consume_skb(list);
	}

strp->skb_nextp now points into a freed sk_buff. The next segment of
the same message arrives in __strp_recv(), which links it with
*strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
and the write happen in different __strp_recv() calls, so the message
has to span at least three segments before it triggers.

  BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
  Write of size 8 at addr ffff88810db86140 by task repro/349

  Call Trace:
   <IRQ>
   __strp_recv+0x447/0xda0
   __tcp_read_sock+0x13d/0x590
   tcp_bpf_strp_read_sock+0x195/0x320
   strp_data_ready+0x267/0x340
   sk_psock_strp_data_ready+0x1ce/0x350
   tcp_data_queue+0x1364/0x2fd0
   tcp_rcv_established+0xe07/0x1640
   [...]

  Allocated by task 349:
   skb_clone+0x17b/0x210
   __strp_recv+0x2c3/0xda0
   __tcp_read_sock+0x13d/0x590
   [...]

  Freed by task 349:
   kmem_cache_free+0x150/0x570
   __pskb_pull_tail+0x57b/0xc20
   skb_ensure_writable+0x236/0x260
   __bpf_skb_change_tail+0x1d4/0x590
   sk_skb_change_tail+0x2a/0x40
   bpf_prog_1b285dcd6c41373e+0x27/0x30
   bpf_prog_run_pin_on_cpu+0xf3/0x260
   sk_psock_strp_parse+0x118/0x1e0
   __strp_recv+0x4f6/0xda0
   [...]

The same resize also leaves the head's length inconsistent with its
frags, so a later __pskb_pull_tail() can instead hit the
BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.

A stream parser is only meant to measure the next message, not to modify
the packet. Reject a parser whose program can change packet data
(prog->aux->changes_pkt_data) at attach time. The check is shared by
sock_map_prog_update() and sock_map_link_update_prog(), which between them
cover prog attach, link create and link update. Verdict programs are
unaffected and may still modify the skb.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/core/sock_map.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..c60ba6d292f9 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
 	return 0;
 }
 
+static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
+				      struct bpf_prog *prog)
+{
+	/* A stream parser must not modify the skb, only measure it. */
+	if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
+	    prog->aux->changes_pkt_data)
+		return -EINVAL;
+
+	return 0;
+}
+
 /* Handle the following four cases:
  * prog_attach: prog != NULL, old == NULL, link == NULL
  * prog_detach: prog == NULL, old != NULL, link == NULL
@@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	if (ret)
 		return ret;
 
+	ret = sock_map_prog_attach_check(which, prog);
+	if (ret)
+		return ret;
+
 	/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
 	 * exists for that prog.
 	 */
@@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
 		ret = -EINVAL;
 		goto out;
 	}
+
+	ret = sock_map_prog_attach_check(link->attach_type, prog);
+	if (ret)
+		goto out;
+
 	if (!sockmap_link->map) {
 		ret = -ENOLINK;
 		goto out;
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v5 1/3] selftests/bpf: don't modify the skb in the strparser parser prog
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

sockmap_parse_prog.c is attached as an SK_SKB stream parser and modifies
the skb: it calls bpf_skb_pull_data() and writes a byte into the packet.
A stream parser runs on strparser's message head and must not modify it.
A resize frees the frag_list segments strparser still tracks, leading to
a use-after-free.

Make the parser read-only. It only needs to return the message length,
which keeps it attaching once packet-modifying parsers are rejected.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/progs/sockmap_parse_prog.c  | 22 -------------------
 1 file changed, 22 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
index c9abfe3a11af..56e9aebf05f2 100644
--- a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
+++ b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
@@ -5,28 +5,6 @@
 SEC("sk_skb1")
 int bpf_prog1(struct __sk_buff *skb)
 {
-	void *data_end = (void *)(long) skb->data_end;
-	void *data = (void *)(long) skb->data;
-	__u8 *d = data;
-	int err;
-
-	if (data + 10 > data_end) {
-		err = bpf_skb_pull_data(skb, 10);
-		if (err)
-			return SK_DROP;
-
-		data_end = (void *)(long)skb->data_end;
-		data = (void *)(long)skb->data;
-		if (data + 10 > data_end)
-			return SK_DROP;
-	}
-
-	/* This write/read is a bit pointless but tests the verifier and
-	 * strparser handler for read/write pkt data and access into sk
-	 * fields.
-	 */
-	d = data;
-	d[7] = 1;
 	return skb->len;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v5 0/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel

A BPF_PROG_TYPE_SK_SKB stream parser runs on strparser's message head,
which can chain skbs through frag_list. A parser that resizes the skb
frees the frag_list segments that strparser still tracks through
skb_nextp, leading to a use-after-free.

A stream parser is only meant to measure the next message, not to modify
the packet, so reject a packet-modifying parser at attach time.

v5:
 - target bpf-next instead of bpf
 - add Reviewed-by tag (Jiayuan Chen)

v4:
 - https://lore.kernel.org/all/20260619062959.3277612-1-rhkrqnwk98@gmail.com/

v3:
 - https://lore.kernel.org/all/20260618102718.2331468-1-rhkrqnwk98@gmail.com/

v2:
 - https://lore.kernel.org/all/20260612123553.2724240-1-rhkrqnwk98@gmail.com/

v1:
 - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/

Sechang Lim (3):
  selftests/bpf: don't modify the skb in the strparser parser prog
  bpf, sockmap: reject a packet-modifying SK_SKB stream parser
  selftests/bpf: test rejection of a packet-modifying SK_SKB stream
    parser

 net/core/sock_map.c                           | 20 ++++++++++++
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/sockmap_parse_prog.c  | 22 -------------
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 4 files changed, 58 insertions(+), 22 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
From: Jan Klos @ 2026-06-20  1:19 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, Russell King, netdev
  Cc: Jan Klos, Maxime Chevallier, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Golle, Vladimir Oltean,
	Aleksander Jan Bajkowski, Markus Stockhausen, Jan Hoffmann,
	Issam Hamdi, Chukun Pan, Russell King (Oracle), ChunHao Lin,
	linux-kernel

On RTL8127A connected to a link partner that advertises 10000baseT
speed cannot be changed to anything other than 10000baseT as 10GbE
is always advertised regardless of any setting. Fix this by
clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
call to phy_modify_mmd_changed().

Fixes: 83d962316128 ("net: phy: realtek: add RTL8127-internal PHY")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jan Klos <honza.klos@gmail.com>
---
v2: Patch formalities (rebase, tree name, tags, ccs)

 drivers/net/phy/realtek/realtek_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index 27268811f564..b65d0f5fa1a0 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -1802,7 +1802,8 @@ static int rtl822x_config_aneg(struct phy_device *phydev)
 		ret = phy_modify_mmd_changed(phydev, MDIO_MMD_VEND2,
 					     RTL_MDIO_AN_10GBT_CTRL,
 					     MDIO_AN_10GBT_CTRL_ADV2_5G |
-					     MDIO_AN_10GBT_CTRL_ADV5G, adv);
+					     MDIO_AN_10GBT_CTRL_ADV5G |
+					     MDIO_AN_10GBT_CTRL_ADV10G, adv);
 		if (ret < 0)
 			return ret;
 	}
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf] bpf: tcp: Fix use-after-free in bpf_iter_tcp_established_batch()
From: Jose Fernandez (Anthropic) @ 2026-06-20  0:32 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrii Nakryiko,
	Yonghong Song, Martin KaFai Lau
  Cc: netdev, linux-kernel, bpf, Ben Cressey,
	Jose Fernandez (Anthropic)

reqsk_queue_hash_req() publishes a TCP_NEW_SYN_RECV request_sock onto
the ehash chain (via inet_ehash_insert(), which drops the bucket lock on
return) and only afterwards refcount_set()s rsk_refcnt to 3.

Lockless readers such as __inet_lookup_established() account for this by
using refcount_inc_not_zero(), but bpf_iter_tcp_established_batch() uses
plain sock_hold() while holding the bucket lock, on the assumption that
the lock guarantees sk_refcnt > 0. That assumption does not hold for
request_sock:

  CPU 0                                CPU 1
  -----                                -----
  tcp_conn_request()
   reqsk_queue_hash_req()
    inet_ehash_insert(req)
     spin_lock(bucket)
     __sk_nulls_add_node_rcu(req)      // rsk_refcnt == 0
     spin_unlock(bucket)
                                       bpf_iter_tcp_established_batch()
                                        spin_lock(bucket)
                                        sock_hold(req)   <-- addition on 0
                                        spin_unlock(bucket)
    refcount_set(&req->rsk_refcnt, 3)  // clobbers saturated value

which surfaces as:

  refcount_t: addition on 0; use-after-free.
  WARNING: lib/refcount.c:25 at refcount_warn_saturate+0x48/0x90, CPU#1
  Call Trace:
   bpf_iter_tcp_established_batch+0x14e/0x170
   bpf_iter_tcp_batch+0x53/0x200
   bpf_iter_tcp_seq_next+0x27/0x70
   bpf_seq_read+0x107/0x410
   vfs_read+0xb9/0x380

refcount_warn_saturate() then saturates the count, the publishing CPU's
refcount_set() clobbers it, and the socket is left one reference short.
When the last legitimate owner drops its reference the reqsk is freed
while still reachable, leading to use-after-free panics in e.g.
inet_csk_accept() or inet_csk_listen_stop().

This reproduces in seconds with tcp_syncookies=0, a handful of threads
doing connect()/close() to a local listener while others read an
iter/tcp link in a tight loop.

Use refcount_inc_not_zero() and skip the socket on failure, the same way
every other ehash walker does. The listening hash is unaffected as
listeners are always inserted into lhash2 with sk_refcnt >= 1, so
bpf_iter_tcp_listening_batch() is left as-is.

If every matching socket in a bucket is mid-init, end_sk can stay at 0;
advance to the next bucket in that case rather than terminating the
whole iteration on a stale batch[0].

Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
Reviewed-by: Ben Cressey <ben@cressey.dev>
Assisted-by: Claude:unspecified
Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
---
 net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fdc81150ff6c..92342dcc6892 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3074,25 +3074,25 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 {
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct hlist_nulls_node *node;
-	unsigned int expected = 1;
-	struct sock *sk;
+	unsigned int expected = 0;
+	struct sock *sk = *start_sk;
 
-	sock_hold(*start_sk);
-	iter->batch[iter->end_sk++].sk = *start_sk;
-
-	sk = sk_nulls_next(*start_sk);
 	*start_sk = NULL;
 	sk_nulls_for_each_from(sk, node) {
-		if (seq_sk_match(seq, sk)) {
-			if (iter->end_sk < iter->max_sk) {
-				sock_hold(sk);
-				iter->batch[iter->end_sk++].sk = sk;
-			} else if (!*start_sk) {
-				/* Remember where we left off. */
-				*start_sk = sk;
-			}
-			expected++;
+		if (!seq_sk_match(seq, sk))
+			continue;
+		if (iter->end_sk < iter->max_sk) {
+			/* reqsk_queue_hash_req() inserts with sk_refcnt == 0
+			 * and refcount_set()s it after the bucket lock drops.
+			 */
+			if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+				continue;
+			iter->batch[iter->end_sk++].sk = sk;
+		} else if (!*start_sk) {
+			/* Remember where we left off. */
+			*start_sk = sk;
 		}
+		expected++;
 	}
 
 	return expected;
@@ -3129,6 +3129,7 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 	struct sock *sk;
 	int err;
 
+again:
 	sk = bpf_iter_tcp_resume(seq);
 	if (!sk)
 		return NULL; /* Done */
@@ -3167,6 +3168,10 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 	WARN_ON_ONCE(iter->end_sk != expected);
 done:
 	bpf_iter_tcp_unlock_bucket(seq);
+	if (unlikely(!iter->end_sk)) {
+		++iter->state.bucket;
+		goto again;
+	}
 	return iter->batch[0].sk;
 }
 

---
base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
change-id: 20260619-bpf-iter-tcp-refcnt-107d52b238da

Best regards,
--  
Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>


^ permalink raw reply related

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-19 22:59 UTC (permalink / raw)
  To: Johan Thomsen
  Cc: i.maximets, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Gustavo A. R. Silva,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-hardening, llvm
In-Reply-To: <CAKv6aAMTqSo0qSng2Kv4=i6BSvJuUy9KSfeFRYx5JsYuo9=kqQ@mail.gmail.com>

On 6/18/26 1:43 PM, Johan Thomsen wrote:
>> Johan, if you can test this one in your setup as well, that would
>> be great.  Thanks.
>>
>>  include/net/dst_metadata.h | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>> --- a/include/net/dst_metadata.h
>> +++ b/include/net/dst_metadata.h
>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>         if (!new_md)
>>                 return ERR_PTR(-ENOMEM);
>>
>> -       memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>> -              sizeof(struct ip_tunnel_info) + md_size);
>> +       /* Copy in two stages to keep the __counted_by happy. */
>> +       new_md->u.tun_info = md_dst->u.tun_info;
>> +       memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
>> +              ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
>> +
>>  #ifdef CONFIG_DST_CACHE
>>         /* Unclone the dst cache if there is one */
>>         if (new_md->u.tun_info.dst_cache.cache) {
> 
> Hi Ilya,
> 
> Sure. Just stressed it for 24 hours and - I cannot trigger the bug
> with this patch applied.

Thanks, Johan!

Best regards, Ilya Maximets.

^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-19 22:58 UTC (permalink / raw)
  To: Gustavo A. R. Silva, netdev
  Cc: i.maximets, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Gustavo A. R. Silva,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-hardening, llvm, Johan Thomsen
In-Reply-To: <13b922ce-8450-48fd-adf7-5377989fb6e4@embeddedor.com>

On 6/18/26 6:02 AM, Gustavo A. R. Silva wrote:
> 
> 
> On 6/17/26 16:59, Gustavo A. R. Silva wrote:
>>
>>
>> On 6/17/26 16:01, Ilya Maximets wrote:
>>> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>>>> Hi,
>>>>
>>>> On 6/16/26 04:03, Ilya Maximets wrote:
>>>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>>>> structure to the options_len, which is then initialized to zero.
>>>>> Later, we're initializing the structure by copying the tunnel info
>>>>> together with the options, and this triggers a warning for a potential
>>>>> memcpy overflow, since the compiler estimates that the options can't
>>>>> fit into the structure, even though the memory for them is actually
>>>>> allocated.
>>>>>
>>>>>    memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>>>>    WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>>>>     skb_tunnel_info_unclone+0x179/0x190
>>>>>     geneve_xmit+0x7fe/0xe00
>>>>
>>>> This warning has nothing to do with counted_by. See below for more
>>>> comments.
>>>>
>>>>>
>>>>> The issue is triggered when built with clang and source fortification.
>>>>>
>>>>> Fix that by doing the copy in two stages: first - the main data with
>>>>> the options_len, then the options.  This way the correct length should
>>>>> be known at the time of the copy.
>>>>>
>>>>> It would be better if the options_len never changed after allocation,
>>>>> but the allocation code is a little separate from the initialization
>>>>> and it would be awkward and potentially dangerous to return a struct
>>>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>>>
>>>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>>>> doing too many unnecessary operations for the use case here.
>>>>>
>>>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>> ---
>>>>>
>>>>> Johan, if you can test this one in your setup as well, that would
>>>>> be great.  Thanks.
>>>>>
>>>>>    include/net/dst_metadata.h | 7 +++++--
>>>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>>>> --- a/include/net/dst_metadata.h
>>>>> +++ b/include/net/dst_metadata.h
>>>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>>>>        if (!new_md)
>>>>>            return ERR_PTR(-ENOMEM);
>>>>> -    memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>>>> -           sizeof(struct ip_tunnel_info) + md_size);
>>>>
>>>> What's going on here is that, internally, fortified memcpy() retrieves
>>>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>>>
>>>> That is:
>>>>
>>>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>>>
>>>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>>>
>>>> So the warning is reporting that 104 bytes don't fit in an object of
>>>> size 96 bytes, regardless of any counted_by annotation or allocation.
>>>
>>> Hmm.  Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
>>> 104 when the options_len is 8?  If so, isn't that because it is counted
>>> by that field?  Asking because the fortification doesn't complain if we
>>> keep the full 104-byte copy as-is, but set the options_len beforehand,
>>> as tested by Johan.
>>
>> I see. If that is the case, then, internally, fortified memcpy() ends up
>> using mode 0 instead of mode 1. Something like this:
>>
>> __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
>>
>> The above will effectively consider the allocation and counted_by because
>> it will interpret new_md->u.tun_info as an open-ended object due to the
>> flexible-array member (in struct ip_tunnel_info) whose size is determined
>> by counted_by.
> 
> Indeed. The execution stops here:
> 
> fortify_memcpy_chk():
> 588         /*
> 589          * Always stop accesses beyond the struct that contains the
> 590          * field, when the buffer's remaining size is known.
> 591          * (The SIZE_MAX test is to optimize away checks where the buffer
> 592          * lengths are unknown.)
> 593          */
> 594         if (p_size != SIZE_MAX && p_size < size)
> 595                 fortify_panic(func, FORTIFY_WRITE, p_size, size, true);
> 
> with p_size = __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
> 
> The code never reaches the part where p_size_field (__bdos(&new_md->u.tun_info, 1))
> is checked at runtime because there is no need for that.
> 
> So yep, this patch is okay as-is.

Ack.  Thanks for looking into this!

Best regards, Ilya Maximets.

^ permalink raw reply

* Re: [BUG] net: tcp: SO_LINGER with l_linger=0 leaks memory when closing sockets with pending send data
From: Ahmed, Aaron @ 2026-06-19 22:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: stable@vger.kernel.org, netdev@vger.kernel.org,
	ncardwell@google.com, edumazet@google.com
In-Reply-To: <CAAVpQUBtKBzq36Wz9p3MaHR=G10-NFBtQXgGW3S3QV5THW2iCg@mail.gmail.com>



Hi Kuniyuki,

Sorry to keep asking, were you able take a look at the updated reproducer? I've still been able to repro with the latest 6.18 LTS.

Thanks,
Aaron 



^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Bryam Vargas @ 2026-06-19 22:17 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajS4BgnyzRsa7HVm@linux.alibaba.com>

On Fri, 19 Jun 2026 11:31:18 +0800, Dust Li wrote:
> I think we can decide after we see the real issue.

Here it is, as a truth table over the real smc_curs_diff. cons is fixed (the app
isn't reading), bytes_to_rcv is the running sum of per-CDC smc_curs_diff(prod_old,
prod_new), len = 65504:

  scenario                     b2r       count>=len  diff>len  occ>len  OOB no-clamp  OOB clamp
  honest steady / full / wrap  <= len    no          no        no       no            no
  attack single big diff       131007    no          yes       yes      yes           no
  attack count=len-1 wrapflip  327519    no          yes       yes      yes           no
  attack wrap++ count=0        327520    no          no        no       yes           no

Every attack row has count < len, so an input count check accepts it. The last
row is the one that matters: a peer that just increments prod.wrap with count=0
adds len to bytes_to_rcv every CDC, unbounded, and no cursor-level check sees it.
The per-CDC diff is exactly len, and smc_curs_diff(cons, prod) stays at len
because it can't see the wrap accumulation. The only thing that bounds it is
clamping bytes_to_rcv at the consumer. So #2 isn't subsumed by validating cursors
at the input -- the cursor view can't see the accumulator.

> should we also abort the connection like what we did in patch #1 ?

Yes for net-next. Two caveats: First, the detection
has to be on bytes_to_rcv itself, not on a cursor recompute -- the wrap++ row
walks past every cursor check, so an occupancy gate at the input wouldn't catch
it. Second, the abort supplements the clamp, it doesn't replace it: the clamp is
synchronous, the abort via queue_work isn't. The producer add runs in the tasklet
under bh_lock_sock, the consumer sub runs in smc_recvmsg under lock_sock which
drops the spinlock, so they race; between queue_work and abort_work running
smc_conn_kill, smc_recvmsg can read the inflated bytes_to_rcv and copy past the
RMB. The clamp at the consumer is what closes that window.

So v4: -stable keeps the consumer-side clamp on #2, and the same shape on #3 for
sndbuf_space and peer_rmbe_space -- no control-flow change. net-next keeps the
clamp and, when bytes_to_rcv goes over len (which an honest peer never does),
queues the abort the way patch #1 does. Patch #1 keeps its count-based abort for
the urgent index.

Bryam

The table above is this program (gcc -O2 -Wall -Wextra -fwrapv; self-checks, exit 0):

  #include <stdio.h>
  #include <stdint.h>
  typedef uint16_t u16; typedef uint32_t u32;
  union hc { struct { u16 reserved; u16 wrap; u32 count; }; };

  /* verbatim net/smc/smc_cdc.h:149-158 */
  static int smc_curs_diff(unsigned int size, const union hc *old, const union hc *new)
  {
          if (old->wrap != new->wrap) {
                  int v = (int)((size - old->count) + new->count);
                  return v > 0 ? v : 0;
          }
          { int v = (int)(new->count - old->count); return v > 0 ? v : 0; }
  }

  #define LEN 65504
  struct cur { u16 w; u32 c; };

  /* prod[]/cons[]: cursor positions after each CDC. honest=app drains so
   * occupancy stays <= len; attack=cons stuck. */
  static int run(const char *name, int honest, int n,
                 const struct cur *prod, const struct cur *cons)
  {
          union hc po = {0}, co = {0};
          long b2r = 0; int i, cnt_rej = 0, raw_rej = 0, occ_rej = 0, fail = 0;
          for (i = 0; i < n; i++) {
                  union hc p = { .wrap = prod[i].w, .count = prod[i].c };
                  union hc c = { .wrap = cons[i].w, .count = cons[i].c };
                  int dp = smc_curs_diff(LEN, &po, &p);
                  if (prod[i].c >= (u32)LEN) cnt_rej = 1;
                  if (dp > LEN) raw_rej = 1;
                  if (smc_curs_diff(LEN, &c, &p) > LEN) occ_rej = 1;
                  b2r += dp; b2r -= smc_curs_diff(LEN, &co, &c);
                  po = p; co = c;
          }
          int oob_noclamp = b2r > LEN;
          int oob_clamp   = (b2r > LEN ? LEN : b2r) > LEN;   /* always 0 */
          printf("  %-30s b2r=%-8ld cnt_rej=%d raw_rej=%d occ_rej=%d oob_noclamp=%d oob_clamp=%d\n",
                 name, b2r, cnt_rej, raw_rej, occ_rej, oob_noclamp, oob_clamp);
          if (honest) fail = (cnt_rej || raw_rej || occ_rej || oob_noclamp);
          else        fail = (oob_clamp || !oob_noclamp);
          return fail;
  }

  int main(void)
  {
          struct cur ps[][5] = {
                  {{0,5000}}, {{1,0}}, {{0,30000},{0,60000},{1,10000}},
                  {{1,LEN-1}},
                  {{1,LEN-1},{0,LEN-1},{1,LEN-1},{0,LEN-1}},
                  {{1,0},{2,0},{3,0},{4,0},{5,0}},
          };
          struct cur cs[][5] = {
                  {{0,4000}}, {{0,0}}, {{0,0},{0,30000},{0,50000}},
                  {{0,0}},
                  {{0,0},{0,0},{0,0},{0,0}},
                  {{0,0},{0,0},{0,0},{0,0},{0,0}},
          };
          const char *nm[] = { "honest: steady", "honest: full ring",
                  "honest: wrapping", "attack: single big diff",
                  "attack: count=len-1 wrapflip", "attack: wrap++ count=0" };
          int hon[] = { 1,1,1,0,0,0 };
          int nc[]  = { 1,1,3,1,4,5 };
          int i, fails = 0;
          for (i = 0; i < 6; i++)
                  fails += run(nm[i], hon[i], nc[i], ps[i], cs[i]);
          printf("RESULT: %s\n", fails ? "FAIL" : "PASS");
          return fails ? 1 : 0;
  }

(In-kernel KASAN confirming the over-read at count=65503 is available on request;
a small out-of-tree module driving the same smc_curs_diff over a real
rmb_desc->len allocation -- bytes_to_rcv 131007 -> 327519, slab-out-of-bounds in
the recv copy, clean with the clamp.)


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox