Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net] net: wwan: iosm: fix potential memory leaks in ipc_imem_init()
From: Abdun Nihaal @ 2026-05-08  9:21 UTC (permalink / raw)
  To: loic.poulain
  Cc: Abdun Nihaal, ryazanov.s.a, johannes, andrew+netdev, davem,
	edumazet, kuba, pabeni, netdev, linux-kernel, m.chetan.kumar,
	stable

The memory allocated in ipc_protocol_init() is not freed on the error
paths that follow in ipc_imem_init(). Fix that by calling the
corresponding release function ipc_protocol_deinit() in the error path.

Fixes: 3670970dd8c6 ("net: iosm: shared memory IPC interface")
Cc: stable@vger.kernel.org
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
---
Compile tested only. Issue found using static analysis.

 drivers/net/wwan/iosm/iosm_ipc_imem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wwan/iosm/iosm_ipc_imem.c b/drivers/net/wwan/iosm/iosm_ipc_imem.c
index 1b7bc7d63a2e..f4edb277efd9 100644
--- a/drivers/net/wwan/iosm/iosm_ipc_imem.c
+++ b/drivers/net/wwan/iosm/iosm_ipc_imem.c
@@ -1422,6 +1422,7 @@ struct iosm_imem *ipc_imem_init(struct iosm_pcie *pcie, unsigned int device_id,
 	hrtimer_cancel(&ipc_imem->fast_update_timer);
 	hrtimer_cancel(&ipc_imem->tdupdate_timer);
 	hrtimer_cancel(&ipc_imem->startup_timer);
+	ipc_protocol_deinit(ipc_imem->ipc_protocol);
 protocol_init_fail:
 	cancel_work_sync(&ipc_imem->run_state_worker);
 	ipc_task_deinit(ipc_imem->ipc_task);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net] vsock/virtio: fix skb overhead accounting to preserve full buf_alloc
From: Stefano Garzarella @ 2026-05-08  9:23 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Michael S. Tsirkin, Stefan Hajnoczi, virtualization,
	David S. Miller, Jason Wang, Simon Horman, linux-kernel,
	Paolo Abeni, Xuan Zhuo, kvm, Jakub Kicinski, Stefano Garzarella,
	Eugenio Pérez

From: Stefano Garzarella <sgarzare@redhat.com>

After commit 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb
queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from
buf_alloc when checking whether a new packet fits. This reduces the
effective receive buffer below what the user configured via
SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be
silently dropped and applications that rely on the full buffer size
to deadlock.

Also, the reduced space is not communicated to the remote peer, so
its credit calculation accounts more credit than the receiver will
actually accept, causing data loss (there is no retransmission).

This also causes failures in tools/testing/vsock/vsock_test.c.
Test 18 sometimes fails, while test 22 always fails in this way:
    18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch

    22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
    Resource temporarily unavailable

Fix this by introducing virtio_transport_rx_buf_size() to calculate the
size of the RX buffer based on the overhead. Using it in the acceptance
check, the advertised buf_alloc, and the credit update decision.
Use buf_alloc * 2 as total budget (payload + overhead), similar to how
SO_RCVBUF is doubled to reserve space for sk_buff metadata.
The function returns buf_alloc as long as overhead fits within the
reservation, then gradually reduces toward 0 as overhead exceeds
buf_alloc (e.g. under small-packet flooding), informing the peer to
slow down.

Fixes: 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
---
 net/vmw_vsock/virtio_transport_common.c | 31 +++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 9b8014516f4f..94a4beb8fd61 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -444,12 +444,32 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	return ret;
 }
 
+/* vvs->rx_lock held by the caller */
+static u32 virtio_transport_rx_buf_size(struct virtio_vsock_sock *vvs)
+{
+	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+	/* Use buf_alloc * 2 as total budget (payload + overhead), similar to
+	 * how SO_RCVBUF is doubled to reserve space for sk_buff metadata.
+	 */
+	u64 total_budget = (u64)vvs->buf_alloc * 2;
+
+	/* Overhead within buf_alloc: full buf_alloc available for payload */
+	if (skb_overhead < vvs->buf_alloc)
+		return vvs->buf_alloc;
+
+	/* Overhead exceeded buf_alloc: gradually reduce to bound skb queue */
+	if (skb_overhead < total_budget)
+		return total_budget - skb_overhead;
+
+	return 0;
+}
+
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
 					u32 len)
 {
-	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+	u32 rx_buf_size = virtio_transport_rx_buf_size(vvs);
 
-	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
+	if (!rx_buf_size || vvs->buf_used + len > rx_buf_size)
 		return false;
 
 	vvs->rx_bytes += len;
@@ -472,7 +492,7 @@ void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *
 	spin_lock_bh(&vvs->rx_lock);
 	vvs->last_fwd_cnt = vvs->fwd_cnt;
 	hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
-	hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
+	hdr->buf_alloc = cpu_to_le32(virtio_transport_rx_buf_size(vvs));
 	spin_unlock_bh(&vvs->rx_lock);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
@@ -594,6 +614,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 	bool low_rx_bytes;
 	int err = -EFAULT;
 	size_t total = 0;
+	u32 rx_buf_size;
 	u32 free_space;
 
 	spin_lock_bh(&vvs->rx_lock);
@@ -639,7 +660,9 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 	}
 
 	fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-	free_space = vvs->buf_alloc - fwd_cnt_delta;
+	rx_buf_size = virtio_transport_rx_buf_size(vvs);
+	free_space = rx_buf_size > fwd_cnt_delta ?
+		     rx_buf_size - fwd_cnt_delta : 0;
 	low_rx_bytes = (vvs->rx_bytes <
 			sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next v2] net/sched: sch_dualpi2: annotate lockless stats reads in dump path
From: Breno Leitao @ 2026-05-08  9:26 UTC (permalink / raw)
  To: Vineet Agarwal
  Cc: netdev, jhs, jiri, davem, edumazet, kuba, pabeni, horms,
	linux-kernel
In-Reply-To: <20260508072918.324797-1-agarwal.vineet2006@gmail.com>

On Fri, May 08, 2026 at 12:59:18PM +0530, Vineet Agarwal wrote:
> No WRITE_ONCE() annotations are added because these statistics are
> maintained as best-effort counters

Adding READ_ONCE() on the dumper side without matching WRITE_ONCE() on
the writer side does prevent torn stores, and also not actually silence
KCSAN, last time I checked.

> , and the update paths already use simple non-synchronized increments
> consistent with existing qdisc statistics patterns. The intent of this
> change is only to make the lockless read semantics explicit.

I am not sure READ_ONCE() does that. Usually data_race is best used for
this.

> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index
> 241e6a46bd00..40035f70db80 100644 --- a/net/sched/sch_dualpi2.c +++
> b/net/sched/sch_dualpi2.c @@ -1046,14 +1046,14 @@ static int
> dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d) struct
> dualpi2_sched_data *q = qdisc_priv(sch); struct tc_dualpi2_xstats st
> = { .prob			= READ_ONCE(q->pi2_prob),
...
> +		.credit                 = q->c_protection_credit,

Why this is unnanotated. isn't it uupdated locklessly in
dualpi2_qdisc_dequeue()?

Also, you have spaces other than tab for this instance.

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Stefano Garzarella @ 2026-05-08  9:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Michael S. Tsirkin, Arseniy Krasnov, Bobby Eshleman,
	Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <20260507101805.6c3335d0@kernel.org>

On Thu, May 07, 2026 at 10:18:05AM -0700, Jakub Kicinski wrote:
>On Thu, 7 May 2026 09:32:24 -0700 Eric Dumazet wrote:
>> On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>> > On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote:
>> > >We can revert if you think that the risk of regression is high..
>> > >Please LMK soon, we can do it before patch reaches Linus.
>> >
>> > Some tests in tools/testing/vsock/vsock_test.c are failing with this
>> > patch applied.
>> >
>> > Test 18 are failing sometime in this way (I guess because we are
>> > dropping packets):
>> >
>> > 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
>> >
>> > Test 22 is failing 100% in this way:
>> >
>> > 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
>> > Resource temporarily unavailable
>> >
>> >
>> > With my followup patch adding also advertisement to the other peer
>> > (still draft locally, waiting for Michael proposal) I saw 22 failing,
>> > because tests expects that can use the entire buf_alloc, but now we are
>> > reducing it.  So IMO we should do like in `__sock_set_rcvbuf()` and
>> > double the buffer size, or at least digest an overhead equal to the
>> > buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah,
>> > AF_VSOCK has it owns sockopt since the beginning :-().
>> >
>> > With that approach tests are passing, but I'd like to stress a bit more
>> > that patch. I'll send it tomorrow as fixup of this patch, or if you
>> > prefer to revert, I'll send as standalone.
>>
>> A plain revert is a big issue, now users now how to crash hypervisors.
>>
>> This vulnerability allows a compromised guest (controlling
>> virtio_vsock_hdr fields)
>> to continuously flood the host's vsock receive queue without
>> triggering any memory
>>  accounting limits or reader wakeups, resulting in unbounded host
>> kernel memory consumption (Host DoS via OOM).
>>
>> A vulnerability where a KVM guest can crash or deadlock its host is
>> classified as a KVM DoS.
>>
>> Am I missing something?
>
>Alright, let's leave it.
>

I posted a potential fixup: 
https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/

Thanks,
Stefano


^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-next] ice: fix AQ error code comparison in ice_set_pauseparam()
From: Rinitha, SX @ 2026-05-08  9:27 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org,
	Nguyen, Anthony L, Loktionov, Aleksandr
  Cc: netdev@vger.kernel.org, Czapnik, Lukasz
In-Reply-To: <20260327072236.129802-4-aleksandr.loktionov@intel.com>

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Aleksandr Loktionov
> Sent: 27 March 2026 12:53
> To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Loktionov, Aleksandr <aleksandr.loktionov@intel.com>
> Cc: netdev@vger.kernel.org; Czapnik, Lukasz <lukasz.czapnik@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-next] ice: fix AQ error code comparison in ice_set_pauseparam()
>
> From: Lukasz Czapnik <lukasz.czapnik@intel.com>
>
> Fix unreachable code: the conditionals in ice_set_pauseparam() used the bitwise-AND operator suggesting aq_failures is a bitmap, but it is actually an enum, making the third condition logically unreachable.
>
> Replace the if-else ladder with a switch statement.  Also move the aq_failures initialization to the variable declaration and remove the redundant zeroing from ice_set_fc().
>
> Fixes: fcea6f3da546 ("ice: Add stats and ethtool support")
> Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
>
> drivers/net/ethernet/intel/ice/ice_common.c  |  1 -  drivers/net/ethernet/intel/ice/ice_ethtool.c | 12 ++++++++----
> 2 files changed, 8 insertions(+), 5 deletions(-)
>

Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)

^ permalink raw reply

* Re: [PATCH net] net: ena: PHC: Fix potential use-after-free in get_timestamp
From: Vadim Fedorenko @ 2026-05-08  9:28 UTC (permalink / raw)
  To: Arthur Kiyanovski, David Miller, Jakub Kicinski, netdev
  Cc: Richard Cochran, Eric Dumazet, Paolo Abeni, David Woodhouse,
	Thomas Gleixner, Miroslav Lichvar, Andrew Lunn, Wen Gu, Xuan Zhuo,
	David Woodhouse, Yonatan Sarna, Zorik Machulsky,
	Alexander Matushevsky, Saeed Bshara, Matt Wilson, Anthony Liguori,
	Nafea Bshara, Evgeny Schmeilin, Netanel Belgazal, Ali Saidi,
	Benjamin Herrenschmidt, Noam Dagan, David Arinzon,
	Evgeny Ostrovsky, Ofir Tabachnik, Amit Bernstein, stable
In-Reply-To: <20260508062126.7273-1-akiyano@amazon.com>

On 08/05/2026 07:21, Arthur Kiyanovski wrote:
> Move the phc->active check and resp pointer assignment to after
> acquiring the spinlock. Previously, phc->active was checked without
> holding the lock, and resp was cached from ena_dev->phc.virt_addr
> before the lock was acquired.
> 
> If ena_com_phc_destroy() runs between the lockless active check and
> the lock acquisition, it sets active=false, releases the lock, frees
> the DMA memory, and sets virt_addr=NULL. The get_timestamp path would
> then read a NULL virt_addr and dereference it.
> 
> With both the active check and the pointer read under the lock,
> destroy cannot free the memory while get_timestamp is using it.
> 
> Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
> ---
>   drivers/net/ethernet/amazon/ena/ena_com.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c
> index e67b592..8c86789 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -1782,20 +1782,23 @@ void ena_com_phc_destroy(struct ena_com_dev *ena_dev)
>   
>   int ena_com_phc_get_timestamp(struct ena_com_dev *ena_dev, u64 *timestamp)
>   {
> -	volatile struct ena_admin_phc_resp *resp = ena_dev->phc.virt_addr;
>   	const ktime_t zero_system_time = ktime_set(0, 0);
>   	struct ena_com_phc_info *phc = &ena_dev->phc;
> +	volatile struct ena_admin_phc_resp *resp;
>   	ktime_t expire_time;
>   	ktime_t block_time;
>   	unsigned long flags = 0;
>   	int ret = 0;
>   
> +	spin_lock_irqsave(&phc->lock, flags);
> +
>   	if (!phc->active) {
> +		spin_unlock_irqrestore(&phc->lock, flags);
>   		netdev_err(ena_dev->net_device, "PHC feature is not active in the device\n");
>   		return -EOPNOTSUPP;
>   	}
>   
> -	spin_lock_irqsave(&phc->lock, flags);
> +	resp = ena_dev->phc.virt_addr;
>   
>   	/* Check if PHC is in blocked state */
>   	if (unlikely(ktime_compare(phc->system_time, zero_system_time))) {

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* Re: [PATCH] net: ethtool: fix missing closing paren in rings_reply_size()
From: Vadim Fedorenko @ 2026-05-08  9:30 UTC (permalink / raw)
  To: Tao Cui, andrew, kuba, davem, edumazet, pabeni; +Cc: horms, netdev
In-Reply-To: <20260508071402.183909-1-cuitao@kylinos.cn>

On 08/05/2026 08:14, Tao Cui wrote:
> sizeof(u32) on the _RINGS_CQE_SIZE line is missing its closing
> parenthesis, causing nla_total_size() to absorb the subsequent
> _TX_PUSH and _RX_PUSH entries. The resulting size estimate
> happens to be numerically identical due to NLA alignment, but
> the nesting is wrong and misleading.
> 
> Signed-off-by: Tao Cui <cuitao@kylinos.cn>
> ---
>   net/ethtool/rings.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> index 0fd5dcc3729f..9054c89c5d7b 100644
> --- a/net/ethtool/rings.c
> +++ b/net/ethtool/rings.c
> @@ -63,9 +63,9 @@ static int rings_reply_size(const struct ethnl_req_info *req_base,
>   	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX */
>   	       nla_total_size(sizeof(u32)) +	/* _RINGS_RX_BUF_LEN */
>   	       nla_total_size(sizeof(u8))  +	/* _RINGS_TCP_DATA_SPLIT */
> -	       nla_total_size(sizeof(u32)  +	/* _RINGS_CQE_SIZE */
> +	       nla_total_size(sizeof(u32)) +	/* _RINGS_CQE_SIZE */
>   	       nla_total_size(sizeof(u8))  +	/* _RINGS_TX_PUSH */
> -	       nla_total_size(sizeof(u8))) +	/* _RINGS_RX_PUSH */
> +	       nla_total_size(sizeof(u8))  +	/* _RINGS_RX_PUSH */
>   	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN */
>   	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN_MAX */
>   	       nla_total_size(sizeof(u32)) +	/* _RINGS_HDS_THRESH */

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* Re: [PATCH] net: ethtool: fix missing closing paren in rings_reply_size()
From: Breno Leitao @ 2026-05-08  9:33 UTC (permalink / raw)
  To: Tao Cui; +Cc: andrew, kuba, davem, edumazet, pabeni, horms, netdev
In-Reply-To: <20260508071402.183909-1-cuitao@kylinos.cn>

On Fri, May 08, 2026 at 03:14:02PM +0800, Tao Cui wrote:
> sizeof(u32) on the _RINGS_CQE_SIZE line is missing its closing
> parenthesis, causing nla_total_size() to absorb the subsequent
> _TX_PUSH and _RX_PUSH entries. The resulting size estimate
> happens to be numerically identical due to NLA alignment, but
> the nesting is wrong and misleading.
> 
> Signed-off-by: Tao Cui <cuitao@kylinos.cn>

Reviewed-by: Breno Leitao <leitao@debian.org>

Even thought the resulting size is identical, I am wondering if you
should add the Fixes: here. (or target net-next instead).

I think this is a net material, which will require a Fixes: tag.

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Stefano Garzarella @ 2026-05-08  9:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <20260507163710-mutt-send-email-mst@kernel.org>

On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote:
>On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote:
>> On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote:
>> > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote:

[...]

>> > > For now, we're already doing something:
>> > > merging the skuffs if they don't have EOM set.
>> >
>> >
>> > Right that's good. You could go further and merge with EOM too
>> > if you stick the info about message boundaries somewhere else.
>>
>> This adds a lot of complexity IMO, but we can try.
>>
>> Do you have something in mind?
>
>BER is clearly overkill but here's a POC that claude made for me,
>just to give u an idea. It's clearly has a ton of issues,
>for example I dislike how GFP_ATOMIC is handled.

Okay, I somewhat understand, but clearly this isn't net material, so for 
now I think the best thing to do is to merge the fixup I sent (or 
something similar): 
https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/

This is a major change that should be merged with more caution.
Could this have too much of an impact on performance?

Thanks,
Stefano

>Yet it seems to work fine in light testing.
>
>-->
>
>
>vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing
>
>virtio_transport_recv_enqueue() currently refuses to coalesce an
>incoming skb with the previous one when the previous skb carries
>VIRTIO_VSOCK_SEQ_EOM.  This forces one skb per seqpacket message.
>For workloads with many small or zero-byte messages the per-skb
>overhead (~960 bytes) dominates, causing unbounded memory growth.
>
>Decouple message boundary tracking from the skb structure: store
>boundary offsets in a compact side buffer using DWARF ULEB128
>encoding with the EOR flag folded into the low bit, then allow
>the data of multiple complete messages to be coalesced into a single
>skb.
>
>Cross-EOM coalescing fires only when:
>- both the tail skb and the incoming packet carry EOM (complete msgs)
>- the incoming packet fits in the tail skb's tailroom
>- no BPF psock is attached (read_skb expects one msg per skb)
>
>On allocation failure the code falls back to separate skbs (existing
>behaviour).  Credit accounting is unchanged; the boundary buffer is
>capped at PAGE_SIZE.
>
>Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index f91704731057..e36b9ab28372 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -12,6 +12,7 @@
> struct virtio_vsock_skb_cb {
> 	bool reply;
> 	bool tap_delivered;
>+	bool has_boundary_entries;
> 	u32 offset;
> };
>
>@@ -167,6 +168,12 @@ struct virtio_vsock_sock {
> 	u32 buf_used;
> 	struct sk_buff_head rx_queue;
> 	u32 msg_count;
>+
>+	/* ULEB128-encoded seqpacket message boundary buffer */
>+	u8 *boundary_buf;
>+	u32 boundary_len;
>+	u32 boundary_alloc;
>+	u32 boundary_off;
> };
>
> struct virtio_vsock_pkt_info {
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 416d533f493d..81654f70f72c 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -11,6 +11,7 @@
> #include <linux/sched/signal.h>
> #include <linux/ctype.h>
> #include <linux/list.h>
>+#include <linux/skmsg.h>
> #include <linux/virtio_vsock.h>
> #include <uapi/linux/vsockmon.h>
>
>@@ -26,6 +27,91 @@
> /* Threshold for detecting small packets to copy */
> #define GOOD_COPY_LEN  128
>
>+#define VSOCK_BOUNDARY_BUF_INIT	64
>+#define VSOCK_BOUNDARY_BUF_MAX	PAGE_SIZE
>+
>+/* ULEB128 boundary encoding: value = (msg_len << 1) | eor.
>+ * Each byte carries 7 data bits; bit 7 is set on all but the last byte.
>+ * Max 5 bytes for a u32 msg_len (33 bits with eor shift).
>+ */
>+static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor)
>+{
>+	u64 val = ((u64)msg_len << 1) | eor;
>+	int n = 0;
>+
>+	do {
>+		buf[n] = val & 0x7f;
>+		val >>= 7;
>+		if (val)
>+			buf[n] |= 0x80;
>+		n++;
>+	} while (val);
>+
>+	return n;
>+}
>+
>+static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail,
>+				      u32 *msg_len, bool *eor)
>+{
>+	u64 val = 0;
>+	int shift = 0;
>+	int n = 0;
>+
>+	do {
>+		if (n >= avail || shift >= 35)
>+			return -EINVAL;
>+		val |= (u64)(buf[n] & 0x7f) << shift;
>+		shift += 7;
>+	} while (buf[n++] & 0x80);
>+
>+	*eor = val & 1;
>+	*msg_len = val >> 1;
>+	return n;
>+}
>+
>+static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs)
>+{
>+	if (vvs->boundary_off == 0)
>+		return;
>+
>+	vvs->boundary_len -= vvs->boundary_off;
>+	memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off,
>+		vvs->boundary_len);
>+	vvs->boundary_off = 0;
>+}
>+
>+static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed)
>+{
>+	u32 new_alloc;
>+	u8 *new_buf;
>+
>+	if (vvs->boundary_alloc >= needed)
>+		return 0;
>+
>+	/* Reclaim consumed space before growing */
>+	if (vvs->boundary_off) {
>+		needed -= vvs->boundary_off;
>+		vsock_boundary_buf_compact(vvs);
>+		if (vvs->boundary_alloc >= needed)
>+			return 0;
>+	}
>+
>+	new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2
>+						    : VSOCK_BOUNDARY_BUF_INIT);
>+	if (new_alloc > VSOCK_BOUNDARY_BUF_MAX)
>+		new_alloc = VSOCK_BOUNDARY_BUF_MAX;
>+	if (new_alloc < needed)
>+		return -ENOMEM;
>+
>+	new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC);
>+	if (!new_buf)
>+		return -ENOMEM;
>+
>+	vvs->boundary_buf = new_buf;
>+	vvs->boundary_alloc = new_alloc;
>+	return 0;
>+}
>+
> static void virtio_transport_cancel_close_work(struct vsock_sock *vsk,
> 					       bool cancel_timeout);
> static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs);
>@@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk,
> 	total = 0;
> 	len = msg_data_left(msg);
>
>-	skb_queue_walk(&vvs->rx_queue, skb) {
>-		struct virtio_vsock_hdr *hdr;
>+	skb = skb_peek(&vvs->rx_queue);
>+	if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
>+		u32 msg_len, offset;
>+		size_t bytes;
>+		bool eor;
>+		int ret;
>
>-		if (total < len) {
>-			size_t bytes;
>+		ret = vsock_uleb_decode_boundary(
>+			vvs->boundary_buf + vvs->boundary_off,
>+			vvs->boundary_len - vvs->boundary_off,
>+			&msg_len, &eor);
>+		if (ret < 0)
>+			goto unlock;
>+
>+		offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
>+		bytes = min(len, (size_t)msg_len);
>+
>+		if (bytes) {
> 			int err;
>
>-			bytes = len - total;
>-			if (bytes > skb->len)
>-				bytes = skb->len;
>-
> 			spin_unlock_bh(&vvs->rx_lock);
>-
>-			/* sk_lock is held by caller so no one else can dequeue.
>-			 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
>-			 */
>-			err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset,
>+			err = skb_copy_datagram_iter(skb, offset,
> 						     &msg->msg_iter, bytes);
> 			if (err)
> 				return err;
>-
> 			spin_lock_bh(&vvs->rx_lock);
> 		}
>
>-		total += skb->len;
>-		hdr = virtio_vsock_hdr(skb);
>+		total = msg_len;
>+		if (eor)
>+			msg->msg_flags |= MSG_EOR;
>+	} else {
>+		skb_queue_walk(&vvs->rx_queue, skb) {
>+			struct virtio_vsock_hdr *hdr;
>
>-		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
>-			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
>-				msg->msg_flags |= MSG_EOR;
>+			if (total < len) {
>+				size_t bytes;
>+				int err;
>
>-			break;
>+				bytes = len - total;
>+				if (bytes > skb->len)
>+					bytes = skb->len;
>+
>+				spin_unlock_bh(&vvs->rx_lock);
>+
>+				err = skb_copy_datagram_iter(
>+					skb,
>+					VIRTIO_VSOCK_SKB_CB(skb)->offset,
>+					&msg->msg_iter, bytes);
>+				if (err)
>+					return err;
>+
>+				spin_lock_bh(&vvs->rx_lock);
>+			}
>+
>+			total += skb->len;
>+			hdr = virtio_vsock_hdr(skb);
>+
>+			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
>+				if (le32_to_cpu(hdr->flags) &
>+				    VIRTIO_VSOCK_SEQ_EOR)
>+					msg->msg_flags |= MSG_EOR;
>+				break;
>+			}
> 		}
> 	}
>
>+unlock:
> 	spin_unlock_bh(&vvs->rx_lock);
>
> 	return total;
>@@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> 	}
>
> 	while (!msg_ready) {
>-		struct virtio_vsock_hdr *hdr;
>-		size_t pkt_len;
>-
>-		skb = __skb_dequeue(&vvs->rx_queue);
>+		skb = skb_peek(&vvs->rx_queue);
> 		if (!skb)
> 			break;
>-		hdr = virtio_vsock_hdr(skb);
>-		pkt_len = (size_t)le32_to_cpu(hdr->len);
>
>-		if (dequeued_len >= 0) {
>+		if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
> 			size_t bytes_to_copy;
>+			u32 msg_len, offset;
>+			bool eor;
>+			int ret;
>
>-			bytes_to_copy = min(user_buf_len, pkt_len);
>+			ret = vsock_uleb_decode_boundary(
>+				vvs->boundary_buf + vvs->boundary_off,
>+				vvs->boundary_len - vvs->boundary_off,
>+				&msg_len, &eor);
>+			if (ret < 0)
>+				break;
>+			vvs->boundary_off += ret;
>
>-			if (bytes_to_copy) {
>+			offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
>+			bytes_to_copy = min(user_buf_len, (size_t)msg_len);
>+
>+			if (bytes_to_copy && dequeued_len >= 0) {
> 				int err;
>
>-				/* sk_lock is held by caller so no one else can dequeue.
>-				 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
>-				 */
> 				spin_unlock_bh(&vvs->rx_lock);
>-
>-				err = skb_copy_datagram_iter(skb, 0,
>+				err = skb_copy_datagram_iter(skb, offset,
> 							     &msg->msg_iter,
> 							     bytes_to_copy);
>-				if (err) {
>-					/* Copy of message failed. Rest of
>-					 * fragments will be freed without copy.
>-					 */
>-					dequeued_len = err;
>-				} else {
>-					user_buf_len -= bytes_to_copy;
>-				}
>-
> 				spin_lock_bh(&vvs->rx_lock);
>+				if (err)
>+					dequeued_len = err;
>+				else
>+					user_buf_len -= bytes_to_copy;
> 			}
>
> 			if (dequeued_len >= 0)
>-				dequeued_len += pkt_len;
>-		}
>+				dequeued_len += msg_len;
>
>-		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
>+			VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len;
> 			msg_ready = true;
> 			vvs->msg_count--;
>
>-			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
>+			if (eor)
> 				msg->msg_flags |= MSG_EOR;
>-		}
>
>-		virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
>-		kfree_skb(skb);
>+			virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len);
>+
>+			if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) {
>+				__skb_unlink(skb, &vvs->rx_queue);
>+				kfree_skb(skb);
>+			}
>+
>+			if (vvs->boundary_off >= vvs->boundary_len / 2)
>+				vsock_boundary_buf_compact(vvs);
>+		} else {
>+			struct virtio_vsock_hdr *hdr;
>+			size_t pkt_len;
>+
>+			skb = __skb_dequeue(&vvs->rx_queue);
>+			if (!skb)
>+				break;
>+			hdr = virtio_vsock_hdr(skb);
>+			pkt_len = (size_t)le32_to_cpu(hdr->len);
>+
>+			if (dequeued_len >= 0) {
>+				size_t bytes_to_copy;
>+
>+				bytes_to_copy = min(user_buf_len, pkt_len);
>+
>+				if (bytes_to_copy) {
>+					int err;
>+
>+					spin_unlock_bh(&vvs->rx_lock);
>+					err = skb_copy_datagram_iter(
>+						skb, 0, &msg->msg_iter,
>+						bytes_to_copy);
>+					if (err)
>+						dequeued_len = err;
>+					else
>+						user_buf_len -= bytes_to_copy;
>+					spin_lock_bh(&vvs->rx_lock);
>+				}
>+
>+				if (dequeued_len >= 0)
>+					dequeued_len += pkt_len;
>+			}
>+
>+			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
>+				msg_ready = true;
>+				vvs->msg_count--;
>+
>+				if (le32_to_cpu(hdr->flags) &
>+				    VIRTIO_VSOCK_SEQ_EOR)
>+					msg->msg_flags |= MSG_EOR;
>+			}
>+
>+			virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
>+			kfree_skb(skb);
>+		}
> 	}
>
> 	spin_unlock_bh(&vvs->rx_lock);
>@@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
>
> 	virtio_transport_cancel_close_work(vsk, true);
>
>+	kfree(vvs->boundary_buf);
> 	kfree(vvs);
> 	vsk->trans = NULL;
> }
>@@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk)
> 	 * removing it.
> 	 */
> 	__skb_queue_purge(&vvs->rx_queue);
>+	kfree(vvs->boundary_buf);
>+	vvs->boundary_buf = NULL;
>+	vvs->boundary_len = 0;
>+	vvs->boundary_alloc = 0;
>+	vvs->boundary_off = 0;
> 	vsock_remove_sock(vsk);
> }
>
>@@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> 	    !skb_is_nonlinear(skb)) {
> 		struct virtio_vsock_hdr *last_hdr;
> 		struct sk_buff *last_skb;
>+		bool last_has_eom;
>+		bool has_eom;
>
> 		last_skb = skb_peek_tail(&vvs->rx_queue);
> 		last_hdr = virtio_vsock_hdr(last_skb);
>+		last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
>+		has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
>
>-		/* If there is space in the last packet queued, we copy the
>-		 * new packet in its buffer. We avoid this if the last packet
>-		 * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is
>-		 * delimiter of SEQPACKET message, so 'pkt' is the first packet
>-		 * of a new message.
>-		 */
>-		if (skb->len < skb_tailroom(last_skb) &&
>-		    !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) {
>-			memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
>-			free_pkt = true;
>-			last_hdr->flags |= hdr->flags;
>-			le32_add_cpu(&last_hdr->len, len);
>-			goto out;
>+		if (skb->len < skb_tailroom(last_skb)) {
>+			if (!last_has_eom) {
>+				/* Same-message coalescing (existing path) */
>+				memcpy(skb_put(last_skb, skb->len),
>+				       skb->data, skb->len);
>+				free_pkt = true;
>+				last_hdr->flags |= hdr->flags;
>+				le32_add_cpu(&last_hdr->len, len);
>+				goto out;
>+			}
>+
>+			/* Cross-EOM: coalesce complete messages into one skb,
>+			 * recording message boundaries in a compact BER buffer.
>+			 * Only when incoming packet also has EOM (complete msg).
>+			 */
>+			if (has_eom && !sk_psock(sk_vsock(vsk))) {
>+				bool prev_eor, cur_eor;
>+				u8 tmp[12];
>+				int n = 0;
>+
>+				cur_eor = le32_to_cpu(hdr->flags) &
>+					  VIRTIO_VSOCK_SEQ_EOR;
>+
>+				if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) {
>+					u32 prev_len = le32_to_cpu(last_hdr->len);
>+
>+					prev_eor = le32_to_cpu(last_hdr->flags) &
>+						   VIRTIO_VSOCK_SEQ_EOR;
>+					n += vsock_uleb_encode_boundary(
>+						tmp + n, prev_len, prev_eor);
>+				}
>+				n += vsock_uleb_encode_boundary(
>+					tmp + n, len, cur_eor);
>+
>+				if (!vsock_boundary_buf_ensure(
>+					    vvs, vvs->boundary_len + n)) {
>+					memcpy(vvs->boundary_buf +
>+					       vvs->boundary_len, tmp, n);
>+					vvs->boundary_len += n;
>+					VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true;
>+					memcpy(skb_put(last_skb, skb->len),
>+					       skb->data, skb->len);
>+					free_pkt = true;
>+					last_hdr->flags |= hdr->flags;
>+					le32_add_cpu(&last_hdr->len, len);
>+					goto out;
>+				}
>+			}
> 		}
> 	}
>
>


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Michael S. Tsirkin @ 2026-05-08  9:43 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <af2sXvVCBT05XF0D@sgarzare-redhat>

On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote:
> On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote:
> > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote:
> > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote:
> 
> [...]
> 
> > > > > For now, we're already doing something:
> > > > > merging the skuffs if they don't have EOM set.
> > > >
> > > >
> > > > Right that's good. You could go further and merge with EOM too
> > > > if you stick the info about message boundaries somewhere else.
> > > 
> > > This adds a lot of complexity IMO, but we can try.
> > > 
> > > Do you have something in mind?
> > 
> > BER is clearly overkill but here's a POC that claude made for me,
> > just to give u an idea. It's clearly has a ton of issues,
> > for example I dislike how GFP_ATOMIC is handled.
> 
> Okay, I somewhat understand, but clearly this isn't net material, so for now
> I think the best thing to do is to merge the fixup I sent (or something
> similar):
> https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/


will respond there.

> This is a major change that should be merged with more caution.
> Could this have too much of an impact on performance?
> 
> Thanks,
> Stefano

Upstream is so broken now, I'd not worry about perf even.


> > Yet it seems to work fine in light testing.
> > 
> > -->
> > 
> > 
> > vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing
> > 
> > virtio_transport_recv_enqueue() currently refuses to coalesce an
> > incoming skb with the previous one when the previous skb carries
> > VIRTIO_VSOCK_SEQ_EOM.  This forces one skb per seqpacket message.
> > For workloads with many small or zero-byte messages the per-skb
> > overhead (~960 bytes) dominates, causing unbounded memory growth.
> > 
> > Decouple message boundary tracking from the skb structure: store
> > boundary offsets in a compact side buffer using DWARF ULEB128
> > encoding with the EOR flag folded into the low bit, then allow
> > the data of multiple complete messages to be coalesced into a single
> > skb.
> > 
> > Cross-EOM coalescing fires only when:
> > - both the tail skb and the incoming packet carry EOM (complete msgs)
> > - the incoming packet fits in the tail skb's tailroom
> > - no BPF psock is attached (read_skb expects one msg per skb)
> > 
> > On allocation failure the code falls back to separate skbs (existing
> > behaviour).  Credit accounting is unchanged; the boundary buffer is
> > capped at PAGE_SIZE.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
> > 
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index f91704731057..e36b9ab28372 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -12,6 +12,7 @@
> > struct virtio_vsock_skb_cb {
> > 	bool reply;
> > 	bool tap_delivered;
> > +	bool has_boundary_entries;
> > 	u32 offset;
> > };
> > 
> > @@ -167,6 +168,12 @@ struct virtio_vsock_sock {
> > 	u32 buf_used;
> > 	struct sk_buff_head rx_queue;
> > 	u32 msg_count;
> > +
> > +	/* ULEB128-encoded seqpacket message boundary buffer */
> > +	u8 *boundary_buf;
> > +	u32 boundary_len;
> > +	u32 boundary_alloc;
> > +	u32 boundary_off;
> > };
> > 
> > struct virtio_vsock_pkt_info {
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 416d533f493d..81654f70f72c 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -11,6 +11,7 @@
> > #include <linux/sched/signal.h>
> > #include <linux/ctype.h>
> > #include <linux/list.h>
> > +#include <linux/skmsg.h>
> > #include <linux/virtio_vsock.h>
> > #include <uapi/linux/vsockmon.h>
> > 
> > @@ -26,6 +27,91 @@
> > /* Threshold for detecting small packets to copy */
> > #define GOOD_COPY_LEN  128
> > 
> > +#define VSOCK_BOUNDARY_BUF_INIT	64
> > +#define VSOCK_BOUNDARY_BUF_MAX	PAGE_SIZE
> > +
> > +/* ULEB128 boundary encoding: value = (msg_len << 1) | eor.
> > + * Each byte carries 7 data bits; bit 7 is set on all but the last byte.
> > + * Max 5 bytes for a u32 msg_len (33 bits with eor shift).
> > + */
> > +static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor)
> > +{
> > +	u64 val = ((u64)msg_len << 1) | eor;
> > +	int n = 0;
> > +
> > +	do {
> > +		buf[n] = val & 0x7f;
> > +		val >>= 7;
> > +		if (val)
> > +			buf[n] |= 0x80;
> > +		n++;
> > +	} while (val);
> > +
> > +	return n;
> > +}
> > +
> > +static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail,
> > +				      u32 *msg_len, bool *eor)
> > +{
> > +	u64 val = 0;
> > +	int shift = 0;
> > +	int n = 0;
> > +
> > +	do {
> > +		if (n >= avail || shift >= 35)
> > +			return -EINVAL;
> > +		val |= (u64)(buf[n] & 0x7f) << shift;
> > +		shift += 7;
> > +	} while (buf[n++] & 0x80);
> > +
> > +	*eor = val & 1;
> > +	*msg_len = val >> 1;
> > +	return n;
> > +}
> > +
> > +static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs)
> > +{
> > +	if (vvs->boundary_off == 0)
> > +		return;
> > +
> > +	vvs->boundary_len -= vvs->boundary_off;
> > +	memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off,
> > +		vvs->boundary_len);
> > +	vvs->boundary_off = 0;
> > +}
> > +
> > +static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed)
> > +{
> > +	u32 new_alloc;
> > +	u8 *new_buf;
> > +
> > +	if (vvs->boundary_alloc >= needed)
> > +		return 0;
> > +
> > +	/* Reclaim consumed space before growing */
> > +	if (vvs->boundary_off) {
> > +		needed -= vvs->boundary_off;
> > +		vsock_boundary_buf_compact(vvs);
> > +		if (vvs->boundary_alloc >= needed)
> > +			return 0;
> > +	}
> > +
> > +	new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2
> > +						    : VSOCK_BOUNDARY_BUF_INIT);
> > +	if (new_alloc > VSOCK_BOUNDARY_BUF_MAX)
> > +		new_alloc = VSOCK_BOUNDARY_BUF_MAX;
> > +	if (new_alloc < needed)
> > +		return -ENOMEM;
> > +
> > +	new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC);
> > +	if (!new_buf)
> > +		return -ENOMEM;
> > +
> > +	vvs->boundary_buf = new_buf;
> > +	vvs->boundary_alloc = new_alloc;
> > +	return 0;
> > +}
> > +
> > static void virtio_transport_cancel_close_work(struct vsock_sock *vsk,
> > 					       bool cancel_timeout);
> > static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs);
> > @@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk,
> > 	total = 0;
> > 	len = msg_data_left(msg);
> > 
> > -	skb_queue_walk(&vvs->rx_queue, skb) {
> > -		struct virtio_vsock_hdr *hdr;
> > +	skb = skb_peek(&vvs->rx_queue);
> > +	if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
> > +		u32 msg_len, offset;
> > +		size_t bytes;
> > +		bool eor;
> > +		int ret;
> > 
> > -		if (total < len) {
> > -			size_t bytes;
> > +		ret = vsock_uleb_decode_boundary(
> > +			vvs->boundary_buf + vvs->boundary_off,
> > +			vvs->boundary_len - vvs->boundary_off,
> > +			&msg_len, &eor);
> > +		if (ret < 0)
> > +			goto unlock;
> > +
> > +		offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
> > +		bytes = min(len, (size_t)msg_len);
> > +
> > +		if (bytes) {
> > 			int err;
> > 
> > -			bytes = len - total;
> > -			if (bytes > skb->len)
> > -				bytes = skb->len;
> > -
> > 			spin_unlock_bh(&vvs->rx_lock);
> > -
> > -			/* sk_lock is held by caller so no one else can dequeue.
> > -			 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
> > -			 */
> > -			err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset,
> > +			err = skb_copy_datagram_iter(skb, offset,
> > 						     &msg->msg_iter, bytes);
> > 			if (err)
> > 				return err;
> > -
> > 			spin_lock_bh(&vvs->rx_lock);
> > 		}
> > 
> > -		total += skb->len;
> > -		hdr = virtio_vsock_hdr(skb);
> > +		total = msg_len;
> > +		if (eor)
> > +			msg->msg_flags |= MSG_EOR;
> > +	} else {
> > +		skb_queue_walk(&vvs->rx_queue, skb) {
> > +			struct virtio_vsock_hdr *hdr;
> > 
> > -		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > -			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
> > -				msg->msg_flags |= MSG_EOR;
> > +			if (total < len) {
> > +				size_t bytes;
> > +				int err;
> > 
> > -			break;
> > +				bytes = len - total;
> > +				if (bytes > skb->len)
> > +					bytes = skb->len;
> > +
> > +				spin_unlock_bh(&vvs->rx_lock);
> > +
> > +				err = skb_copy_datagram_iter(
> > +					skb,
> > +					VIRTIO_VSOCK_SKB_CB(skb)->offset,
> > +					&msg->msg_iter, bytes);
> > +				if (err)
> > +					return err;
> > +
> > +				spin_lock_bh(&vvs->rx_lock);
> > +			}
> > +
> > +			total += skb->len;
> > +			hdr = virtio_vsock_hdr(skb);
> > +
> > +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +				if (le32_to_cpu(hdr->flags) &
> > +				    VIRTIO_VSOCK_SEQ_EOR)
> > +					msg->msg_flags |= MSG_EOR;
> > +				break;
> > +			}
> > 		}
> > 	}
> > 
> > +unlock:
> > 	spin_unlock_bh(&vvs->rx_lock);
> > 
> > 	return total;
> > @@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> > 	}
> > 
> > 	while (!msg_ready) {
> > -		struct virtio_vsock_hdr *hdr;
> > -		size_t pkt_len;
> > -
> > -		skb = __skb_dequeue(&vvs->rx_queue);
> > +		skb = skb_peek(&vvs->rx_queue);
> > 		if (!skb)
> > 			break;
> > -		hdr = virtio_vsock_hdr(skb);
> > -		pkt_len = (size_t)le32_to_cpu(hdr->len);
> > 
> > -		if (dequeued_len >= 0) {
> > +		if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
> > 			size_t bytes_to_copy;
> > +			u32 msg_len, offset;
> > +			bool eor;
> > +			int ret;
> > 
> > -			bytes_to_copy = min(user_buf_len, pkt_len);
> > +			ret = vsock_uleb_decode_boundary(
> > +				vvs->boundary_buf + vvs->boundary_off,
> > +				vvs->boundary_len - vvs->boundary_off,
> > +				&msg_len, &eor);
> > +			if (ret < 0)
> > +				break;
> > +			vvs->boundary_off += ret;
> > 
> > -			if (bytes_to_copy) {
> > +			offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
> > +			bytes_to_copy = min(user_buf_len, (size_t)msg_len);
> > +
> > +			if (bytes_to_copy && dequeued_len >= 0) {
> > 				int err;
> > 
> > -				/* sk_lock is held by caller so no one else can dequeue.
> > -				 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
> > -				 */
> > 				spin_unlock_bh(&vvs->rx_lock);
> > -
> > -				err = skb_copy_datagram_iter(skb, 0,
> > +				err = skb_copy_datagram_iter(skb, offset,
> > 							     &msg->msg_iter,
> > 							     bytes_to_copy);
> > -				if (err) {
> > -					/* Copy of message failed. Rest of
> > -					 * fragments will be freed without copy.
> > -					 */
> > -					dequeued_len = err;
> > -				} else {
> > -					user_buf_len -= bytes_to_copy;
> > -				}
> > -
> > 				spin_lock_bh(&vvs->rx_lock);
> > +				if (err)
> > +					dequeued_len = err;
> > +				else
> > +					user_buf_len -= bytes_to_copy;
> > 			}
> > 
> > 			if (dequeued_len >= 0)
> > -				dequeued_len += pkt_len;
> > -		}
> > +				dequeued_len += msg_len;
> > 
> > -		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +			VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len;
> > 			msg_ready = true;
> > 			vvs->msg_count--;
> > 
> > -			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
> > +			if (eor)
> > 				msg->msg_flags |= MSG_EOR;
> > -		}
> > 
> > -		virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
> > -		kfree_skb(skb);
> > +			virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len);
> > +
> > +			if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) {
> > +				__skb_unlink(skb, &vvs->rx_queue);
> > +				kfree_skb(skb);
> > +			}
> > +
> > +			if (vvs->boundary_off >= vvs->boundary_len / 2)
> > +				vsock_boundary_buf_compact(vvs);
> > +		} else {
> > +			struct virtio_vsock_hdr *hdr;
> > +			size_t pkt_len;
> > +
> > +			skb = __skb_dequeue(&vvs->rx_queue);
> > +			if (!skb)
> > +				break;
> > +			hdr = virtio_vsock_hdr(skb);
> > +			pkt_len = (size_t)le32_to_cpu(hdr->len);
> > +
> > +			if (dequeued_len >= 0) {
> > +				size_t bytes_to_copy;
> > +
> > +				bytes_to_copy = min(user_buf_len, pkt_len);
> > +
> > +				if (bytes_to_copy) {
> > +					int err;
> > +
> > +					spin_unlock_bh(&vvs->rx_lock);
> > +					err = skb_copy_datagram_iter(
> > +						skb, 0, &msg->msg_iter,
> > +						bytes_to_copy);
> > +					if (err)
> > +						dequeued_len = err;
> > +					else
> > +						user_buf_len -= bytes_to_copy;
> > +					spin_lock_bh(&vvs->rx_lock);
> > +				}
> > +
> > +				if (dequeued_len >= 0)
> > +					dequeued_len += pkt_len;
> > +			}
> > +
> > +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +				msg_ready = true;
> > +				vvs->msg_count--;
> > +
> > +				if (le32_to_cpu(hdr->flags) &
> > +				    VIRTIO_VSOCK_SEQ_EOR)
> > +					msg->msg_flags |= MSG_EOR;
> > +			}
> > +
> > +			virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
> > +			kfree_skb(skb);
> > +		}
> > 	}
> > 
> > 	spin_unlock_bh(&vvs->rx_lock);
> > @@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
> > 
> > 	virtio_transport_cancel_close_work(vsk, true);
> > 
> > +	kfree(vvs->boundary_buf);
> > 	kfree(vvs);
> > 	vsk->trans = NULL;
> > }
> > @@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk)
> > 	 * removing it.
> > 	 */
> > 	__skb_queue_purge(&vvs->rx_queue);
> > +	kfree(vvs->boundary_buf);
> > +	vvs->boundary_buf = NULL;
> > +	vvs->boundary_len = 0;
> > +	vvs->boundary_alloc = 0;
> > +	vvs->boundary_off = 0;
> > 	vsock_remove_sock(vsk);
> > }
> > 
> > @@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> > 	    !skb_is_nonlinear(skb)) {
> > 		struct virtio_vsock_hdr *last_hdr;
> > 		struct sk_buff *last_skb;
> > +		bool last_has_eom;
> > +		bool has_eom;
> > 
> > 		last_skb = skb_peek_tail(&vvs->rx_queue);
> > 		last_hdr = virtio_vsock_hdr(last_skb);
> > +		last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
> > +		has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
> > 
> > -		/* If there is space in the last packet queued, we copy the
> > -		 * new packet in its buffer. We avoid this if the last packet
> > -		 * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is
> > -		 * delimiter of SEQPACKET message, so 'pkt' is the first packet
> > -		 * of a new message.
> > -		 */
> > -		if (skb->len < skb_tailroom(last_skb) &&
> > -		    !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) {
> > -			memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
> > -			free_pkt = true;
> > -			last_hdr->flags |= hdr->flags;
> > -			le32_add_cpu(&last_hdr->len, len);
> > -			goto out;
> > +		if (skb->len < skb_tailroom(last_skb)) {
> > +			if (!last_has_eom) {
> > +				/* Same-message coalescing (existing path) */
> > +				memcpy(skb_put(last_skb, skb->len),
> > +				       skb->data, skb->len);
> > +				free_pkt = true;
> > +				last_hdr->flags |= hdr->flags;
> > +				le32_add_cpu(&last_hdr->len, len);
> > +				goto out;
> > +			}
> > +
> > +			/* Cross-EOM: coalesce complete messages into one skb,
> > +			 * recording message boundaries in a compact BER buffer.
> > +			 * Only when incoming packet also has EOM (complete msg).
> > +			 */
> > +			if (has_eom && !sk_psock(sk_vsock(vsk))) {
> > +				bool prev_eor, cur_eor;
> > +				u8 tmp[12];
> > +				int n = 0;
> > +
> > +				cur_eor = le32_to_cpu(hdr->flags) &
> > +					  VIRTIO_VSOCK_SEQ_EOR;
> > +
> > +				if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) {
> > +					u32 prev_len = le32_to_cpu(last_hdr->len);
> > +
> > +					prev_eor = le32_to_cpu(last_hdr->flags) &
> > +						   VIRTIO_VSOCK_SEQ_EOR;
> > +					n += vsock_uleb_encode_boundary(
> > +						tmp + n, prev_len, prev_eor);
> > +				}
> > +				n += vsock_uleb_encode_boundary(
> > +					tmp + n, len, cur_eor);
> > +
> > +				if (!vsock_boundary_buf_ensure(
> > +					    vvs, vvs->boundary_len + n)) {
> > +					memcpy(vvs->boundary_buf +
> > +					       vvs->boundary_len, tmp, n);
> > +					vvs->boundary_len += n;
> > +					VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true;
> > +					memcpy(skb_put(last_skb, skb->len),
> > +					       skb->data, skb->len);
> > +					free_pkt = true;
> > +					last_hdr->flags |= hdr->flags;
> > +					le32_add_cpu(&last_hdr->len, len);
> > +					goto out;
> > +				}
> > +			}
> > 		}
> > 	}
> > 
> > 


^ permalink raw reply

* Re: [net-next v3 4/5] net: stmmac: starfive: Add jhb100 SGMII interface
From: Minda Chen @ 2026-05-08  9:35 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
	Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	devicetree@vger.kernel.org
In-Reply-To: <20260507-reoccur-underfoot-bac75e8e454d@spud>


> 
> On Thu, May 07, 2026 at 05:41:14PM +0800, Minda Chen wrote:
> > Add jhb100 compatible and SGMII support. jhb100 soc contains
> > 2 SGMII interfaces and integrated with serdes PHY. SGMII with split
> > TX/RX MAC clock and need to set 2.5M/25M/125M TX/RX clock rate in
> > 10M/100M/1000M speed mode.
> >
> > Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> > Reviewed-by: Sai Krishna <saikrishnag@marvell.com> @@ -130,6 +160,7 @@
> > static const struct starfive_dwmac_data jh7100_data = {  static const
> > struct of_device_id starfive_dwmac_match[] = {
> >  	{ .compatible = "starfive,jh7100-dwmac", .data = &jh7100_data },
> >  	{ .compatible = "starfive,jh7110-dwmac" },
> > +	{ .compatible = "starfive,jhb100-dwmac" },
> 
> You've declared compatibility with the jh7110, why do you also need to add the
> new comaptible?
> 

I am sorry, I forget to remove this. 

> >  	{ /* sentinel */ }
> >  };
> >  MODULE_DEVICE_TABLE(of, starfive_dwmac_match);
> > --
> > 2.17.1
> >

^ permalink raw reply

* Re: [net-next v3 4/5] net: stmmac: starfive: Add jhb100 SGMII interface
From: Minda Chen @ 2026-05-08  9:35 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
	Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	devicetree@vger.kernel.org
In-Reply-To: <98266e3a-a23b-4283-a493-65deb8c2a1e7@lunn.ch>



> 
> > +static int stmmac_starfive_sgmii_set_clk_rate(void *bsp_priv, struct clk
> *clk_tx_i,
> > +					      phy_interface_t __maybe_unused interface,
> > +					      int speed)
> > +{
> > +	struct starfive_dwmac *dwmac = (void *)bsp_priv;
> 
> Why (void *) when it is already a void *?
> 
>     Andrew
> 
Okay, Thanks.

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead accounting to preserve full buf_alloc
From: Michael S. Tsirkin @ 2026-05-08  9:53 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, Eric Dumazet, Stefan Hajnoczi, virtualization,
	David S. Miller, Jason Wang, Simon Horman, linux-kernel,
	Paolo Abeni, Xuan Zhuo, kvm, Jakub Kicinski, Eugenio Pérez
In-Reply-To: <20260508092330.69690-1-sgarzare@redhat.com>

On Fri, May 08, 2026 at 11:23:30AM +0200, Stefano Garzarella wrote:
> From: Stefano Garzarella <sgarzare@redhat.com>
> 
> After commit 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb
> queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from
> buf_alloc when checking whether a new packet fits. This reduces the
> effective receive buffer below what the user configured via
> SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be
> silently dropped and applications that rely on the full buffer size
> to deadlock.
> 
> Also, the reduced space is not communicated to the remote peer, so
> its credit calculation accounts more credit than the receiver will
> actually accept, causing data loss (there is no retransmission).
> 
> This also causes failures in tools/testing/vsock/vsock_test.c.
> Test 18 sometimes fails, while test 22 always fails in this way:
>     18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
> 
>     22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
>     Resource temporarily unavailable
> 
> Fix this by introducing virtio_transport_rx_buf_size() to calculate the
> size of the RX buffer based on the overhead. Using it in the acceptance
> check, the advertised buf_alloc, and the credit update decision.
> Use buf_alloc * 2 as total budget (payload + overhead), similar to how
> SO_RCVBUF is doubled to reserve space for sk_buff metadata.
> The function returns buf_alloc as long as overhead fits within the
> reservation, then gradually reduces toward 0 as overhead exceeds
> buf_alloc (e.g. under small-packet flooding), informing the peer to
> slow down.
> 
> Fixes: 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>


unfortunately, this is a bit of a spec violation and there is no guarantee
it helps.

a spec violation because the spec says:
Only payload bytes are counted and header bytes are not
included

and the implication is that a side can not reduce its own buf_alloc.

no guarantee because the other side is not required to process your
packets, so it might not see your buf alloc reduction.

as designed in the current spec, you can only increase your buf alloc,
not decrease it.

what can be done:
- more efficient storage for small packets (poc i posted)
- reduce buf alloc ahead of the time

> ---
>  net/vmw_vsock/virtio_transport_common.c | 31 +++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 9b8014516f4f..94a4beb8fd61 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -444,12 +444,32 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	return ret;
>  }
>  
> +/* vvs->rx_lock held by the caller */
> +static u32 virtio_transport_rx_buf_size(struct virtio_vsock_sock *vvs)
> +{
> +	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
> +	/* Use buf_alloc * 2 as total budget (payload + overhead), similar to
> +	 * how SO_RCVBUF is doubled to reserve space for sk_buff metadata.
> +	 */
> +	u64 total_budget = (u64)vvs->buf_alloc * 2;
> +
> +	/* Overhead within buf_alloc: full buf_alloc available for payload */
> +	if (skb_overhead < vvs->buf_alloc)
> +		return vvs->buf_alloc;
> +
> +	/* Overhead exceeded buf_alloc: gradually reduce to bound skb queue */
> +	if (skb_overhead < total_budget)
> +		return total_budget - skb_overhead;
> +
> +	return 0;
> +}
> +
>  static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
>  					u32 len)
>  {
> -	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
> +	u32 rx_buf_size = virtio_transport_rx_buf_size(vvs);
>  
> -	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
> +	if (!rx_buf_size || vvs->buf_used + len > rx_buf_size)
>  		return false;
>  
>  	vvs->rx_bytes += len;
> @@ -472,7 +492,7 @@ void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *
>  	spin_lock_bh(&vvs->rx_lock);
>  	vvs->last_fwd_cnt = vvs->fwd_cnt;
>  	hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> -	hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
> +	hdr->buf_alloc = cpu_to_le32(virtio_transport_rx_buf_size(vvs));
>  	spin_unlock_bh(&vvs->rx_lock);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
> @@ -594,6 +614,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>  	bool low_rx_bytes;
>  	int err = -EFAULT;
>  	size_t total = 0;
> +	u32 rx_buf_size;
>  	u32 free_space;
>  
>  	spin_lock_bh(&vvs->rx_lock);
> @@ -639,7 +660,9 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>  	}
>  
>  	fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
> -	free_space = vvs->buf_alloc - fwd_cnt_delta;
> +	rx_buf_size = virtio_transport_rx_buf_size(vvs);
> +	free_space = rx_buf_size > fwd_cnt_delta ?
> +		     rx_buf_size - fwd_cnt_delta : 0;
>  	low_rx_bytes = (vvs->rx_bytes <
>  			sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>  
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH net 1/1] igb: Return state in pm_runtime_idle instead of power-down
From: Simon Horman @ 2026-05-08  9:56 UTC (permalink / raw)
  To: Felix Moessbauer
  Cc: intel-wired-lan, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Yan, Zheng, netdev, vivek.behera, Ertman, David M, Sasha Neftin,
	Heiner Kallweit
In-Reply-To: <20260505101141.2657169-1-felix.moessbauer@siemens.com>

+ David Ertman, Sasha Neftin, Heiner Kallweit

On Tue, May 05, 2026 at 12:11:34PM +0200, Felix Moessbauer wrote:
> The PM runtime_idle API expects to get an indication if the device can
> be powered down. Instead of returning the appropriate state, we
> currently directly power down the device (if not active) and return
> that the device is busy.
> 
> We change this by making the function side-effect free and just return
> the state.
> 
> Fixes: 749ab2cd12704 ("igb: add basic runtime PM support")
> Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index ce91dda00ec0e..e8ab0b506a104 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -9652,7 +9652,7 @@ static int igb_runtime_idle(struct device *dev)
>  	struct igb_adapter *adapter = netdev_priv(netdev);
>  
>  	if (!igb_has_link(adapter))
> -		pm_schedule_suspend(dev, MSEC_PER_SEC * 5);
> +		return 0;
>  
>  	return -EBUSY;
>  }

Hi Felix,

I am not sure this is the right approach, are you seeing a behavioural
problem?


This pattern seems to also be present in at least e1000e, igc and r8169.

The igb and e1000e implementations seem to have co-evolved [1][2],
possibly in conjunction with OOT versions of each driver.
The igc implementation came later, perhaps copying e1000e or igb [3].

The git log for r8169 seems to provide a justification for why that
driver users this approach [4].

    - Let the idle notification check whether we can suspend and let it
      schedule the suspend. This way we don't need to have calls to
      pm_schedule_suspend in different places.

While the current e1000e implementation seems to address some reliability
issues [1], although it's not entirely clear to me how that relates to the
issue at hand.

    Fix issues with:
    RuntimePM causing the device to repeatedly flip between suspend and resume
    with the interface administratively downed.
    Having RuntimePM enabled interfering with the functionality of Energy
    Efficient Ethernet.

    Added checks to disallow functions that should not be executed if the
    device is currently runtime suspended

    Make runtime_idle callback to use same deterministic behavior as the igb
    driver.

[1] 63eb48f151b5 ("e1000e Refactor of Runtime Power Management")
    Fri Feb 14 07:16:46 2014 +0000
[2] 749ab2cd1270 ("igb: add basic runtime PM support")
    Wed Jan 4 20:23:37 2012 +0000
[3] 9513d2a5dc7f ("igc: Add legacy power management support")
    Thu Nov 14 09:54:46 2019 +020
[4] a92a08499b1f ("r8169: improve runtime pm in general and suspend unused ports")
    Mon Jan 8 21:39:13 2018 +0100

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Michael S. Tsirkin @ 2026-05-08  9:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <af2sXvVCBT05XF0D@sgarzare-redhat>

On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote:
> On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote:
> > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote:
> > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote:
> 
> [...]
> 
> > > > > For now, we're already doing something:
> > > > > merging the skuffs if they don't have EOM set.
> > > >
> > > >
> > > > Right that's good. You could go further and merge with EOM too
> > > > if you stick the info about message boundaries somewhere else.
> > > 
> > > This adds a lot of complexity IMO, but we can try.
> > > 
> > > Do you have something in mind?
> > 
> > BER is clearly overkill but here's a POC that claude made for me,
> > just to give u an idea. It's clearly has a ton of issues,
> > for example I dislike how GFP_ATOMIC is handled.
> 
> Okay, I somewhat understand, but clearly this isn't net material

I doubt we have many other options given reverting the regression was
ruled out.


> so for now
> I think the best thing to do is to merge the fixup I sent (or something
> similar):
> https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/

I reviewed that one, problem is it's a spec violation/change that we'll
have to support forever. 

> This is a major change that should be merged with more caution.
> Could this have too much of an impact on performance?
> 
> Thanks,
> Stefano

It's really a POC, real patch is left as an excersise for the reader :).
The correct approach IMHO is to only start using this
when we wasted a lot of memory on small packets.

For example, if sum(truesize) >= buf size.

then we'll not see a perf impact unless it's already pathological.


> > Yet it seems to work fine in light testing.
> > 
> > -->
> > 
> > 
> > vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing
> > 
> > virtio_transport_recv_enqueue() currently refuses to coalesce an
> > incoming skb with the previous one when the previous skb carries
> > VIRTIO_VSOCK_SEQ_EOM.  This forces one skb per seqpacket message.
> > For workloads with many small or zero-byte messages the per-skb
> > overhead (~960 bytes) dominates, causing unbounded memory growth.
> > 
> > Decouple message boundary tracking from the skb structure: store
> > boundary offsets in a compact side buffer using DWARF ULEB128
> > encoding with the EOR flag folded into the low bit, then allow
> > the data of multiple complete messages to be coalesced into a single
> > skb.
> > 
> > Cross-EOM coalescing fires only when:
> > - both the tail skb and the incoming packet carry EOM (complete msgs)
> > - the incoming packet fits in the tail skb's tailroom
> > - no BPF psock is attached (read_skb expects one msg per skb)
> > 
> > On allocation failure the code falls back to separate skbs (existing
> > behaviour).  Credit accounting is unchanged; the boundary buffer is
> > capped at PAGE_SIZE.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
> > 
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index f91704731057..e36b9ab28372 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -12,6 +12,7 @@
> > struct virtio_vsock_skb_cb {
> > 	bool reply;
> > 	bool tap_delivered;
> > +	bool has_boundary_entries;
> > 	u32 offset;
> > };
> > 
> > @@ -167,6 +168,12 @@ struct virtio_vsock_sock {
> > 	u32 buf_used;
> > 	struct sk_buff_head rx_queue;
> > 	u32 msg_count;
> > +
> > +	/* ULEB128-encoded seqpacket message boundary buffer */
> > +	u8 *boundary_buf;
> > +	u32 boundary_len;
> > +	u32 boundary_alloc;
> > +	u32 boundary_off;
> > };
> > 
> > struct virtio_vsock_pkt_info {
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 416d533f493d..81654f70f72c 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -11,6 +11,7 @@
> > #include <linux/sched/signal.h>
> > #include <linux/ctype.h>
> > #include <linux/list.h>
> > +#include <linux/skmsg.h>
> > #include <linux/virtio_vsock.h>
> > #include <uapi/linux/vsockmon.h>
> > 
> > @@ -26,6 +27,91 @@
> > /* Threshold for detecting small packets to copy */
> > #define GOOD_COPY_LEN  128
> > 
> > +#define VSOCK_BOUNDARY_BUF_INIT	64
> > +#define VSOCK_BOUNDARY_BUF_MAX	PAGE_SIZE
> > +
> > +/* ULEB128 boundary encoding: value = (msg_len << 1) | eor.
> > + * Each byte carries 7 data bits; bit 7 is set on all but the last byte.
> > + * Max 5 bytes for a u32 msg_len (33 bits with eor shift).
> > + */
> > +static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor)
> > +{
> > +	u64 val = ((u64)msg_len << 1) | eor;
> > +	int n = 0;
> > +
> > +	do {
> > +		buf[n] = val & 0x7f;
> > +		val >>= 7;
> > +		if (val)
> > +			buf[n] |= 0x80;
> > +		n++;
> > +	} while (val);
> > +
> > +	return n;
> > +}
> > +
> > +static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail,
> > +				      u32 *msg_len, bool *eor)
> > +{
> > +	u64 val = 0;
> > +	int shift = 0;
> > +	int n = 0;
> > +
> > +	do {
> > +		if (n >= avail || shift >= 35)
> > +			return -EINVAL;
> > +		val |= (u64)(buf[n] & 0x7f) << shift;
> > +		shift += 7;
> > +	} while (buf[n++] & 0x80);
> > +
> > +	*eor = val & 1;
> > +	*msg_len = val >> 1;
> > +	return n;
> > +}
> > +
> > +static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs)
> > +{
> > +	if (vvs->boundary_off == 0)
> > +		return;
> > +
> > +	vvs->boundary_len -= vvs->boundary_off;
> > +	memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off,
> > +		vvs->boundary_len);
> > +	vvs->boundary_off = 0;
> > +}
> > +
> > +static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed)
> > +{
> > +	u32 new_alloc;
> > +	u8 *new_buf;
> > +
> > +	if (vvs->boundary_alloc >= needed)
> > +		return 0;
> > +
> > +	/* Reclaim consumed space before growing */
> > +	if (vvs->boundary_off) {
> > +		needed -= vvs->boundary_off;
> > +		vsock_boundary_buf_compact(vvs);
> > +		if (vvs->boundary_alloc >= needed)
> > +			return 0;
> > +	}
> > +
> > +	new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2
> > +						    : VSOCK_BOUNDARY_BUF_INIT);
> > +	if (new_alloc > VSOCK_BOUNDARY_BUF_MAX)
> > +		new_alloc = VSOCK_BOUNDARY_BUF_MAX;
> > +	if (new_alloc < needed)
> > +		return -ENOMEM;
> > +
> > +	new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC);
> > +	if (!new_buf)
> > +		return -ENOMEM;
> > +
> > +	vvs->boundary_buf = new_buf;
> > +	vvs->boundary_alloc = new_alloc;
> > +	return 0;
> > +}
> > +
> > static void virtio_transport_cancel_close_work(struct vsock_sock *vsk,
> > 					       bool cancel_timeout);
> > static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs);
> > @@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk,
> > 	total = 0;
> > 	len = msg_data_left(msg);
> > 
> > -	skb_queue_walk(&vvs->rx_queue, skb) {
> > -		struct virtio_vsock_hdr *hdr;
> > +	skb = skb_peek(&vvs->rx_queue);
> > +	if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
> > +		u32 msg_len, offset;
> > +		size_t bytes;
> > +		bool eor;
> > +		int ret;
> > 
> > -		if (total < len) {
> > -			size_t bytes;
> > +		ret = vsock_uleb_decode_boundary(
> > +			vvs->boundary_buf + vvs->boundary_off,
> > +			vvs->boundary_len - vvs->boundary_off,
> > +			&msg_len, &eor);
> > +		if (ret < 0)
> > +			goto unlock;
> > +
> > +		offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
> > +		bytes = min(len, (size_t)msg_len);
> > +
> > +		if (bytes) {
> > 			int err;
> > 
> > -			bytes = len - total;
> > -			if (bytes > skb->len)
> > -				bytes = skb->len;
> > -
> > 			spin_unlock_bh(&vvs->rx_lock);
> > -
> > -			/* sk_lock is held by caller so no one else can dequeue.
> > -			 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
> > -			 */
> > -			err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset,
> > +			err = skb_copy_datagram_iter(skb, offset,
> > 						     &msg->msg_iter, bytes);
> > 			if (err)
> > 				return err;
> > -
> > 			spin_lock_bh(&vvs->rx_lock);
> > 		}
> > 
> > -		total += skb->len;
> > -		hdr = virtio_vsock_hdr(skb);
> > +		total = msg_len;
> > +		if (eor)
> > +			msg->msg_flags |= MSG_EOR;
> > +	} else {
> > +		skb_queue_walk(&vvs->rx_queue, skb) {
> > +			struct virtio_vsock_hdr *hdr;
> > 
> > -		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > -			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
> > -				msg->msg_flags |= MSG_EOR;
> > +			if (total < len) {
> > +				size_t bytes;
> > +				int err;
> > 
> > -			break;
> > +				bytes = len - total;
> > +				if (bytes > skb->len)
> > +					bytes = skb->len;
> > +
> > +				spin_unlock_bh(&vvs->rx_lock);
> > +
> > +				err = skb_copy_datagram_iter(
> > +					skb,
> > +					VIRTIO_VSOCK_SKB_CB(skb)->offset,
> > +					&msg->msg_iter, bytes);
> > +				if (err)
> > +					return err;
> > +
> > +				spin_lock_bh(&vvs->rx_lock);
> > +			}
> > +
> > +			total += skb->len;
> > +			hdr = virtio_vsock_hdr(skb);
> > +
> > +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +				if (le32_to_cpu(hdr->flags) &
> > +				    VIRTIO_VSOCK_SEQ_EOR)
> > +					msg->msg_flags |= MSG_EOR;
> > +				break;
> > +			}
> > 		}
> > 	}
> > 
> > +unlock:
> > 	spin_unlock_bh(&vvs->rx_lock);
> > 
> > 	return total;
> > @@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> > 	}
> > 
> > 	while (!msg_ready) {
> > -		struct virtio_vsock_hdr *hdr;
> > -		size_t pkt_len;
> > -
> > -		skb = __skb_dequeue(&vvs->rx_queue);
> > +		skb = skb_peek(&vvs->rx_queue);
> > 		if (!skb)
> > 			break;
> > -		hdr = virtio_vsock_hdr(skb);
> > -		pkt_len = (size_t)le32_to_cpu(hdr->len);
> > 
> > -		if (dequeued_len >= 0) {
> > +		if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) {
> > 			size_t bytes_to_copy;
> > +			u32 msg_len, offset;
> > +			bool eor;
> > +			int ret;
> > 
> > -			bytes_to_copy = min(user_buf_len, pkt_len);
> > +			ret = vsock_uleb_decode_boundary(
> > +				vvs->boundary_buf + vvs->boundary_off,
> > +				vvs->boundary_len - vvs->boundary_off,
> > +				&msg_len, &eor);
> > +			if (ret < 0)
> > +				break;
> > +			vvs->boundary_off += ret;
> > 
> > -			if (bytes_to_copy) {
> > +			offset = VIRTIO_VSOCK_SKB_CB(skb)->offset;
> > +			bytes_to_copy = min(user_buf_len, (size_t)msg_len);
> > +
> > +			if (bytes_to_copy && dequeued_len >= 0) {
> > 				int err;
> > 
> > -				/* sk_lock is held by caller so no one else can dequeue.
> > -				 * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
> > -				 */
> > 				spin_unlock_bh(&vvs->rx_lock);
> > -
> > -				err = skb_copy_datagram_iter(skb, 0,
> > +				err = skb_copy_datagram_iter(skb, offset,
> > 							     &msg->msg_iter,
> > 							     bytes_to_copy);
> > -				if (err) {
> > -					/* Copy of message failed. Rest of
> > -					 * fragments will be freed without copy.
> > -					 */
> > -					dequeued_len = err;
> > -				} else {
> > -					user_buf_len -= bytes_to_copy;
> > -				}
> > -
> > 				spin_lock_bh(&vvs->rx_lock);
> > +				if (err)
> > +					dequeued_len = err;
> > +				else
> > +					user_buf_len -= bytes_to_copy;
> > 			}
> > 
> > 			if (dequeued_len >= 0)
> > -				dequeued_len += pkt_len;
> > -		}
> > +				dequeued_len += msg_len;
> > 
> > -		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +			VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len;
> > 			msg_ready = true;
> > 			vvs->msg_count--;
> > 
> > -			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
> > +			if (eor)
> > 				msg->msg_flags |= MSG_EOR;
> > -		}
> > 
> > -		virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
> > -		kfree_skb(skb);
> > +			virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len);
> > +
> > +			if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) {
> > +				__skb_unlink(skb, &vvs->rx_queue);
> > +				kfree_skb(skb);
> > +			}
> > +
> > +			if (vvs->boundary_off >= vvs->boundary_len / 2)
> > +				vsock_boundary_buf_compact(vvs);
> > +		} else {
> > +			struct virtio_vsock_hdr *hdr;
> > +			size_t pkt_len;
> > +
> > +			skb = __skb_dequeue(&vvs->rx_queue);
> > +			if (!skb)
> > +				break;
> > +			hdr = virtio_vsock_hdr(skb);
> > +			pkt_len = (size_t)le32_to_cpu(hdr->len);
> > +
> > +			if (dequeued_len >= 0) {
> > +				size_t bytes_to_copy;
> > +
> > +				bytes_to_copy = min(user_buf_len, pkt_len);
> > +
> > +				if (bytes_to_copy) {
> > +					int err;
> > +
> > +					spin_unlock_bh(&vvs->rx_lock);
> > +					err = skb_copy_datagram_iter(
> > +						skb, 0, &msg->msg_iter,
> > +						bytes_to_copy);
> > +					if (err)
> > +						dequeued_len = err;
> > +					else
> > +						user_buf_len -= bytes_to_copy;
> > +					spin_lock_bh(&vvs->rx_lock);
> > +				}
> > +
> > +				if (dequeued_len >= 0)
> > +					dequeued_len += pkt_len;
> > +			}
> > +
> > +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> > +				msg_ready = true;
> > +				vvs->msg_count--;
> > +
> > +				if (le32_to_cpu(hdr->flags) &
> > +				    VIRTIO_VSOCK_SEQ_EOR)
> > +					msg->msg_flags |= MSG_EOR;
> > +			}
> > +
> > +			virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len);
> > +			kfree_skb(skb);
> > +		}
> > 	}
> > 
> > 	spin_unlock_bh(&vvs->rx_lock);
> > @@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
> > 
> > 	virtio_transport_cancel_close_work(vsk, true);
> > 
> > +	kfree(vvs->boundary_buf);
> > 	kfree(vvs);
> > 	vsk->trans = NULL;
> > }
> > @@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk)
> > 	 * removing it.
> > 	 */
> > 	__skb_queue_purge(&vvs->rx_queue);
> > +	kfree(vvs->boundary_buf);
> > +	vvs->boundary_buf = NULL;
> > +	vvs->boundary_len = 0;
> > +	vvs->boundary_alloc = 0;
> > +	vvs->boundary_off = 0;
> > 	vsock_remove_sock(vsk);
> > }
> > 
> > @@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> > 	    !skb_is_nonlinear(skb)) {
> > 		struct virtio_vsock_hdr *last_hdr;
> > 		struct sk_buff *last_skb;
> > +		bool last_has_eom;
> > +		bool has_eom;
> > 
> > 		last_skb = skb_peek_tail(&vvs->rx_queue);
> > 		last_hdr = virtio_vsock_hdr(last_skb);
> > +		last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
> > +		has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM;
> > 
> > -		/* If there is space in the last packet queued, we copy the
> > -		 * new packet in its buffer. We avoid this if the last packet
> > -		 * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is
> > -		 * delimiter of SEQPACKET message, so 'pkt' is the first packet
> > -		 * of a new message.
> > -		 */
> > -		if (skb->len < skb_tailroom(last_skb) &&
> > -		    !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) {
> > -			memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
> > -			free_pkt = true;
> > -			last_hdr->flags |= hdr->flags;
> > -			le32_add_cpu(&last_hdr->len, len);
> > -			goto out;
> > +		if (skb->len < skb_tailroom(last_skb)) {
> > +			if (!last_has_eom) {
> > +				/* Same-message coalescing (existing path) */
> > +				memcpy(skb_put(last_skb, skb->len),
> > +				       skb->data, skb->len);
> > +				free_pkt = true;
> > +				last_hdr->flags |= hdr->flags;
> > +				le32_add_cpu(&last_hdr->len, len);
> > +				goto out;
> > +			}
> > +
> > +			/* Cross-EOM: coalesce complete messages into one skb,
> > +			 * recording message boundaries in a compact BER buffer.
> > +			 * Only when incoming packet also has EOM (complete msg).
> > +			 */
> > +			if (has_eom && !sk_psock(sk_vsock(vsk))) {
> > +				bool prev_eor, cur_eor;
> > +				u8 tmp[12];
> > +				int n = 0;
> > +
> > +				cur_eor = le32_to_cpu(hdr->flags) &
> > +					  VIRTIO_VSOCK_SEQ_EOR;
> > +
> > +				if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) {
> > +					u32 prev_len = le32_to_cpu(last_hdr->len);
> > +
> > +					prev_eor = le32_to_cpu(last_hdr->flags) &
> > +						   VIRTIO_VSOCK_SEQ_EOR;
> > +					n += vsock_uleb_encode_boundary(
> > +						tmp + n, prev_len, prev_eor);
> > +				}
> > +				n += vsock_uleb_encode_boundary(
> > +					tmp + n, len, cur_eor);
> > +
> > +				if (!vsock_boundary_buf_ensure(
> > +					    vvs, vvs->boundary_len + n)) {
> > +					memcpy(vvs->boundary_buf +
> > +					       vvs->boundary_len, tmp, n);
> > +					vvs->boundary_len += n;
> > +					VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true;
> > +					memcpy(skb_put(last_skb, skb->len),
> > +					       skb->data, skb->len);
> > +					free_pkt = true;
> > +					last_hdr->flags |= hdr->flags;
> > +					le32_add_cpu(&last_hdr->len, len);
> > +					goto out;
> > +				}
> > +			}
> > 		}
> > 	}
> > 
> > 


^ permalink raw reply

* [net-next v3 1/3] net: phy: motorcomm: move mdio lock out from yt8531_set_ds()
From: Minda Chen @ 2026-05-08  9:45 UTC (permalink / raw)
  To: Frank, Andrew Lunn, Heiner Kallweit, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev
  Cc: linux-kernel, Minda Chen
In-Reply-To: <20260508094522.3952-1-minda.chen@starfivetech.com>

yt8531_set_ds() default set register with mdio lock and only called
with YT8531 PHY. But new type YT8531s support RGMII and has the same
pin strength setting with YT8531, YT8531s need to call yt8531_set_ds()
setting pin drive strength. But YT8531s config init function
yt8521_config_init() already get the mdio lock with phy_select_page().
If calling yt8521_config_init() with mdio lock will cause dead lock.

Need to get the lock before calling yt8531_get_ds() and move mdio
lock out from it for YT8531s.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
---
 drivers/net/phy/motorcomm.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/phy/motorcomm.c b/drivers/net/phy/motorcomm.c
index 4d62f7b36212..c66804537aa2 100644
--- a/drivers/net/phy/motorcomm.c
+++ b/drivers/net/phy/motorcomm.c
@@ -974,7 +974,8 @@ static u32 yt8531_get_ldo_vol(struct phy_device *phydev)
 {
 	u32 val;
 
-	val = ytphy_read_ext_with_lock(phydev, YT8521_CHIP_CONFIG_REG);
+	val = ytphy_read_ext(phydev, YT8521_CHIP_CONFIG_REG);
+
 	val = FIELD_GET(YT8531_RGMII_LDO_VOL_MASK, val);
 
 	return val <= YT8531_LDO_VOL_1V8 ? val : YT8531_LDO_VOL_1V8;
@@ -1010,10 +1011,11 @@ static int yt8531_set_ds(struct phy_device *phydev)
 		ds = YT8531_RGMII_RX_DS_DEFAULT;
 	}
 
-	ret = ytphy_modify_ext_with_lock(phydev,
-					 YTPHY_PAD_DRIVE_STRENGTH_REG,
-					 YT8531_RGMII_RXC_DS_MASK,
-					 FIELD_PREP(YT8531_RGMII_RXC_DS_MASK, ds));
+	ret = ytphy_modify_ext(phydev,
+			       YTPHY_PAD_DRIVE_STRENGTH_REG,
+			       YT8531_RGMII_RXC_DS_MASK,
+			       FIELD_PREP(YT8531_RGMII_RXC_DS_MASK, ds));
+
 	if (ret < 0)
 		return ret;
 
@@ -1033,10 +1035,11 @@ static int yt8531_set_ds(struct phy_device *phydev)
 	ds_field_low = FIELD_GET(GENMASK(1, 0), ds);
 	ds_field_low = FIELD_PREP(YT8531_RGMII_RXD_DS_LOW_MASK, ds_field_low);
 
-	ret = ytphy_modify_ext_with_lock(phydev,
-					 YTPHY_PAD_DRIVE_STRENGTH_REG,
-					 YT8531_RGMII_RXD_DS_LOW_MASK | YT8531_RGMII_RXD_DS_HI_MASK,
-					 ds_field_low | ds_field_hi);
+	ret = ytphy_modify_ext(phydev,
+			       YTPHY_PAD_DRIVE_STRENGTH_REG,
+			       YT8531_RGMII_RXD_DS_LOW_MASK | YT8531_RGMII_RXD_DS_HI_MASK,
+			       ds_field_low | ds_field_hi);
+
 	if (ret < 0)
 		return ret;
 
@@ -1826,7 +1829,9 @@ static int yt8531_config_init(struct phy_device *phydev)
 			return ret;
 	}
 
+	phy_lock_mdio_bus(phydev);
 	ret = yt8531_set_ds(phydev);
+	phy_unlock_mdio_bus(phydev);
 	if (ret < 0)
 		return ret;
 
-- 
2.17.1


^ permalink raw reply related

* [net-next v3 3/3] net: phy: motorcomm: Add YT8522 100M RMII PHY support
From: Minda Chen @ 2026-05-08  9:45 UTC (permalink / raw)
  To: Frank, Andrew Lunn, Heiner Kallweit, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev
  Cc: linux-kernel, Minda Chen
In-Reply-To: <20260508094522.3952-1-minda.chen@starfivetech.com>

Add YT8522 100M RMII ethernet PHY base driver support, including
PHY ID and base config init function.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
---
 drivers/net/phy/motorcomm.c | 49 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/motorcomm.c b/drivers/net/phy/motorcomm.c
index ebc24f51e626..13d57aba5487 100644
--- a/drivers/net/phy/motorcomm.c
+++ b/drivers/net/phy/motorcomm.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
- * Motorcomm 8511/8521/8531/8531S/8821 PHY driver.
+ * Motorcomm 8511/8521/8522/8531/8531S/8821 PHY driver.
  *
  * Author: Peter Geis <pgwipeout@gmail.com>
  * Author: Frank <Frank.Sae@motor-comm.com>
@@ -14,6 +14,7 @@
 
 #define PHY_ID_YT8511		0x0000010a
 #define PHY_ID_YT8521		0x0000011a
+#define PHY_ID_YT8522		0x4f51e928
 #define PHY_ID_YT8531		0x4f51e91b
 #define PHY_ID_YT8531S		0x4f51e91a
 #define PHY_ID_YT8821		0x4f51ea19
@@ -227,6 +228,13 @@
 #define YT8521_LED_100_ON_EN			BIT(5)
 #define YT8521_LED_10_ON_EN			BIT(4)
 
+#define YT8522_EXTREG_SLEEP_CONTROL		0x2027
+#define YT8522_EN_SLEEP_SW			15
+
+#define YT8522_EXTENDED_COMBO_CTRL		0x4000
+#define YT8522_RXDV_SEL				BIT(4)
+#define YT8522_RMII_EN				BIT(1)
+
 #define YTPHY_MISC_CONFIG_REG			0xA006
 #define YTPHY_MCR_FIBER_SPEED_MASK		BIT(0)
 #define YTPHY_MCR_FIBER_1000BX			(0x1 << 0)
@@ -1843,6 +1851,36 @@ static int yt8531_config_init(struct phy_device *phydev)
 	return 0;
 }
 
+static int yt8522_config_init(struct phy_device *phydev)
+{
+	struct device_node *node = phydev->mdio.dev.of_node;
+	int ret, val;
+
+	val = ytphy_read_ext_with_lock(phydev, YT8522_EXTENDED_COMBO_CTRL);
+	if (val < 0)
+		return val;
+
+	if (val & YT8522_RMII_EN) {
+		val |= YT8522_RXDV_SEL;
+		ret = ytphy_write_ext_with_lock(phydev,
+						YT8522_EXTENDED_COMBO_CTRL,
+						val);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (of_property_read_bool(node, "motorcomm,auto-sleep-disabled")) {
+		/* disable auto sleep */
+		ret = ytphy_modify_ext_with_lock(phydev,
+						 YT8522_EXTREG_SLEEP_CONTROL,
+						 YT8522_EN_SLEEP_SW, 0);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
 /**
  * yt8531_link_change_notify() - Adjust the tx clock direction according to
  * the current speed and dts config.
@@ -3052,6 +3090,14 @@ static struct phy_driver motorcomm_phy_drvs[] = {
 		.led_hw_control_set = yt8521_led_hw_control_set,
 		.led_hw_control_get = yt8521_led_hw_control_get,
 	},
+	{
+		PHY_ID_MATCH_EXACT(PHY_ID_YT8522),
+		.name		= "YT8522 100 Megabit Ethernet",
+		.config_aneg	= genphy_config_aneg,
+		.config_init	= yt8522_config_init,
+		.suspend	= genphy_suspend,
+		.resume		= genphy_resume,
+	},
 	{
 		PHY_ID_MATCH_EXACT(PHY_ID_YT8531),
 		.name		= "YT8531 Gigabit Ethernet",
@@ -3112,6 +3158,7 @@ MODULE_LICENSE("GPL");
 static const struct mdio_device_id __maybe_unused motorcomm_tbl[] = {
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8511) },
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8521) },
+	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8522) },
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8531) },
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8531S) },
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8821) },
-- 
2.17.1


^ permalink raw reply related

* [net-next v3 0/3] Add motorcomm 8531s set ds func and 8522 driver
From: Minda Chen @ 2026-05-08  9:45 UTC (permalink / raw)
  To: Frank, Andrew Lunn, Heiner Kallweit, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev
  Cc: linux-kernel, Minda Chen

This patch is for Starfive JHB100 EVB board. JHB100  contain
1 RGMII/RMII and 1 RMII synopsys GMAC cores. In the EVB board, RGMII
interface connect with YT8531s Ethernet PHY. RMII interface connect
with YT8522 ethernet PHY. So patch 1-2 is for RGMII interface
patch 3 is RMII is for RMII interface.

JHB100 is a Starfive new RISC-V SoC for datacenter BMC (BaseBoard
Managent Controller). Similar with Aspeed 27x0.

The JHB100 minimal system upstream is in progress:
https://patchwork.kernel.org/project/linux-riscv/cover/20260403054945.467700-1-changhuang.liang@starfivetech.com/

The patch base net-next (base commit list below)

previous commit
v1: https://patchwork.kernel.org/project/netdevbpf/cover/20260415092654.64907-1-minda.chen@starfivetech.com/
v2:https://patchwork.kernel.org/project/netdevbpf/cover/20260422083255.29692-1-minda.chen@starfivetech.com/

The change list:
v3:
1. change some commit message.
v2:
1. patch1 move mdio lock out from  yt8531_set_ds().
2. patch2 changed to phy_interface_is_rgmii().

Minda Chen (3):
  net: phy: motorcomm: move mdio lock out from yt8531_set_ds()
  net: motorcomm: phy: set drive strength in YT8531s RGMII
  net: phy: motorcomm: Add YT8522 100M RMII PHY support

 drivers/net/phy/motorcomm.c | 77 ++++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 10 deletions(-)

base-commit: 6a4c4656b0d2d4056a1f0c35442db4e8a5cf8021
-- 
2.17.1

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead accounting to preserve full buf_alloc
From: Stefano Garzarella @ 2026-05-08 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Eric Dumazet, Stefan Hajnoczi, virtualization,
	David S. Miller, Jason Wang, Simon Horman, linux-kernel,
	Paolo Abeni, Xuan Zhuo, kvm, Jakub Kicinski, Eugenio Pérez
In-Reply-To: <20260508055125-mutt-send-email-mst@kernel.org>

On Fri, May 08, 2026 at 05:53:07AM -0400, Michael S. Tsirkin wrote:
>On Fri, May 08, 2026 at 11:23:30AM +0200, Stefano Garzarella wrote:
>> From: Stefano Garzarella <sgarzare@redhat.com>
>>
>> After commit 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb
>> queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from
>> buf_alloc when checking whether a new packet fits. This reduces the
>> effective receive buffer below what the user configured via
>> SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be
>> silently dropped and applications that rely on the full buffer size
>> to deadlock.
>>
>> Also, the reduced space is not communicated to the remote peer, so
>> its credit calculation accounts more credit than the receiver will
>> actually accept, causing data loss (there is no retransmission).
>>
>> This also causes failures in tools/testing/vsock/vsock_test.c.
>> Test 18 sometimes fails, while test 22 always fails in this way:
>>     18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
>>
>>     22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
>>     Resource temporarily unavailable
>>
>> Fix this by introducing virtio_transport_rx_buf_size() to calculate the
>> size of the RX buffer based on the overhead. Using it in the acceptance
>> check, the advertised buf_alloc, and the credit update decision.
>> Use buf_alloc * 2 as total budget (payload + overhead), similar to how
>> SO_RCVBUF is doubled to reserve space for sk_buff metadata.
>> The function returns buf_alloc as long as overhead fits within the
>> reservation, then gradually reduces toward 0 as overhead exceeds
>> buf_alloc (e.g. under small-packet flooding), informing the peer to
>> slow down.
>>
>> Fixes: 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>
>
>unfortunately, this is a bit of a spec violation and there is no guarantee
>it helps.

Loosing data like we are doing in 059b7dbd20a6 is even worse IMHO.

>
>a spec violation because the spec says:
>Only payload bytes are counted and header bytes are not
>included
>
>and the implication is that a side can not reduce its own buf_alloc.
>
>no guarantee because the other side is not required to process your
>packets, so it might not see your buf alloc reduction.
>
>as designed in the current spec, you can only increase your buf alloc,
>not decrease it.

We never enforced this, currently an user can reduce it by 
SO_VM_SOCKETS_BUFFER_SIZE and we haven't blocked it since virtio-vsock 
was introduced, should we update the spec?

>
>what can be done:
>- more efficient storage for small packets (poc i posted)
>- reduce buf alloc ahead of the time

That's basically what I'm doing here: I'm using twice the size of 
`buf_alloc` (just like `SO_RCVBUF` does for other socket types) and 
telling the other peer just `buf_alloc`.

But then, somehow, we have to let the other person know that we're 
running out of space. With this patch that only happens when the other 
peer isn't behaving properly, sending so many small packets that the 
overhead exceeds `buf_alloc`.

Stefano

>
>> ---
>>  net/vmw_vsock/virtio_transport_common.c | 31 +++++++++++++++++++++----
>>  1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 9b8014516f4f..94a4beb8fd61 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -444,12 +444,32 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	return ret;
>>  }
>>
>> +/* vvs->rx_lock held by the caller */
>> +static u32 virtio_transport_rx_buf_size(struct virtio_vsock_sock *vvs)
>> +{
>> +	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
>> +	/* Use buf_alloc * 2 as total budget (payload + overhead), similar to
>> +	 * how SO_RCVBUF is doubled to reserve space for sk_buff metadata.
>> +	 */
>> +	u64 total_budget = (u64)vvs->buf_alloc * 2;
>> +
>> +	/* Overhead within buf_alloc: full buf_alloc available for payload */
>> +	if (skb_overhead < vvs->buf_alloc)
>> +		return vvs->buf_alloc;
>> +
>> +	/* Overhead exceeded buf_alloc: gradually reduce to bound skb queue */
>> +	if (skb_overhead < total_budget)
>> +		return total_budget - skb_overhead;
>> +
>> +	return 0;
>> +}
>> +
>>  static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
>>  					u32 len)
>>  {
>> -	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
>> +	u32 rx_buf_size = virtio_transport_rx_buf_size(vvs);
>>
>> -	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
>> +	if (!rx_buf_size || vvs->buf_used + len > rx_buf_size)
>>  		return false;
>>
>>  	vvs->rx_bytes += len;
>> @@ -472,7 +492,7 @@ void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *
>>  	spin_lock_bh(&vvs->rx_lock);
>>  	vvs->last_fwd_cnt = vvs->fwd_cnt;
>>  	hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
>> -	hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
>> +	hdr->buf_alloc = cpu_to_le32(virtio_transport_rx_buf_size(vvs));
>>  	spin_unlock_bh(&vvs->rx_lock);
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
>> @@ -594,6 +614,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>>  	bool low_rx_bytes;
>>  	int err = -EFAULT;
>>  	size_t total = 0;
>> +	u32 rx_buf_size;
>>  	u32 free_space;
>>
>>  	spin_lock_bh(&vvs->rx_lock);
>> @@ -639,7 +660,9 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>>  	}
>>
>>  	fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
>> -	free_space = vvs->buf_alloc - fwd_cnt_delta;
>> +	rx_buf_size = virtio_transport_rx_buf_size(vvs);
>> +	free_space = rx_buf_size > fwd_cnt_delta ?
>> +		     rx_buf_size - fwd_cnt_delta : 0;
>>  	low_rx_bytes = (vvs->rx_bytes <
>>  			sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>>
>> --
>> 2.54.0
>


^ permalink raw reply

* [PATCH net v2] net: ethtool: fix missing closing paren in rings_reply_size()
From: Tao Cui @ 2026-05-08 10:04 UTC (permalink / raw)
  To: leitao, vadim.fedorenko
  Cc: andrew, cuitao, davem, edumazet, horms, kuba, netdev, pabeni
In-Reply-To: <af2tj9AkacSUfsp9@gmail.com>

sizeof(u32) on the _RINGS_CQE_SIZE line is missing its closing
parenthesis, causing nla_total_size() to absorb the subsequent
_TX_PUSH and _RX_PUSH entries. The resulting size estimate
happens to be numerically identical due to NLA alignment, but
the nesting is wrong and misleading.

Fixes: 4dc84c06a343 ("net: ethtool: extend ringparam set/get APIs for tx_push")
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Breno Leitao <leitao@debian.org>
---
v2:
- Added Fixes: tag as suggested by Breno Leitao.
- Added Reviewed-by tags from Vadim Fedorenko and Breno Leitao.
---
 net/ethtool/rings.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 0fd5dcc3729f..9054c89c5d7b 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -63,9 +63,9 @@ static int rings_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX */
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_RX_BUF_LEN */
 	       nla_total_size(sizeof(u8))  +	/* _RINGS_TCP_DATA_SPLIT */
-	       nla_total_size(sizeof(u32)  +	/* _RINGS_CQE_SIZE */
+	       nla_total_size(sizeof(u32)) +	/* _RINGS_CQE_SIZE */
 	       nla_total_size(sizeof(u8))  +	/* _RINGS_TX_PUSH */
-	       nla_total_size(sizeof(u8))) +	/* _RINGS_RX_PUSH */
+	       nla_total_size(sizeof(u8))  +	/* _RINGS_RX_PUSH */
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN */
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN_MAX */
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_HDS_THRESH */
-- 
2.43.0


^ permalink raw reply related

* Re: [net-next v40] mctp pcc: Implement MCTP over PCC Transport
From: Jeremy Kerr @ 2026-05-08 10:11 UTC (permalink / raw)
  To: Adam Young, Matt Johnston, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel, Sudeep Holla, Jonathan Cameron, Huisong Li
In-Reply-To: <20260508032953.337036-1-admiyo@os.amperecomputing.com>

Hi Adam,

> +static void mctp_pcc_client_rx_callback(struct mbox_client *cl, void *mssg)
> +{
> +	struct acpi_pcct_ext_pcc_shared_memory pcc_header;
> +	struct mctp_pcc_ndev *mctp_pcc_ndev;
> +	struct mctp_pcc_mailbox *inbox;
> +	struct mctp_skb_cb *cb;
> +	struct sk_buff *skb;
> +	u32 header_length;
> +	int size;
> +
> +	mctp_pcc_ndev = container_of(cl, struct mctp_pcc_ndev, inbox.client);
> +	inbox = &mctp_pcc_ndev->inbox;
> +	memcpy_fromio(&pcc_header, inbox->chan->shmem, sizeof(pcc_header));
> +
> +	// The message must at least have the PCC command indicating it is an MCTP
> +	// message followed by the MCTP header, or we have a malformed message.
> +	// This may be run on big endian system, but the data in the buffer is
> +	// explicitly little endian.
> +	header_length = le32_to_cpu(pcc_header.length);
> +
> +	if (header_length < sizeof(pcc_header.command) + sizeof(struct mctp_hdr))
> +		goto error;
> +	// If the reported size is larger than the shared memory minus headers,
> +	// something is wrong and treat the buffer as corrupted data.
> +	if (header_length > inbox->chan->shmem_size - PCC_EXTRA_LEN)
> +		goto error;
> +	if (memcmp(&pcc_header.command, MCTP_SIGNATURE, MCTP_SIGNATURE_LENGTH) != 0)
> +		goto error;
> +
> +	size = header_length + PCC_EXTRA_LEN;
> +	skb = netdev_alloc_skb(mctp_pcc_ndev->ndev, size);
> +	if (!skb)
> +		goto error;
> +	skb_put(skb, size);
> +	skb->protocol = htons(ETH_P_MCTP);
> +	memcpy_fromio(skb->data, inbox->chan->shmem, size);
> +	dev_dstats_rx_add(mctp_pcc_ndev->ndev, size);
> +	skb_pull(skb, sizeof(pcc_header));
> +	skb_reset_mac_header(skb);
> +	skb_reset_network_header(skb);
> +	cb = __mctp_cb(skb);
> +	cb->halen = 0;
> +	netif_rx(skb);
> +	return;
> +error:
> +	dev_dstats_rx_dropped(mctp_pcc_ndev->ndev);
> +}

Super minor, but odd spacing here. This should probably be:

            if (!skb)
                goto error;

            [...]
            netif_rx(skb);
            return;
    
    error:
            dev_dstats_rx_dropped(..);
    }

(ie, with a bit of spacing after the goto and the return)

> +
> +static netdev_tx_t mctp_pcc_tx(struct sk_buff *skb, struct net_device *ndev)
> +{
> +	struct acpi_pcct_ext_pcc_shared_memory *pcc_header;
> +	struct mctp_pcc_ndev *mpnd = netdev_priv(ndev);
> +	int len = skb->len;
> +
> +	/* Consolidated a fragmented packet into contiguous memory */
> +	if (skb_is_nonlinear(skb)) {
> +		if (skb_linearize(skb))
> +			goto error;
> +	}

This is still not needed.

> +
> +	if (skb_cow_head(skb, sizeof(*pcc_header)))
> +		goto error;
> +	/**
> +	 * This code could potentially be run on A Big Endian
> +	 * System.  The ACPI specification requires that values
> +	 * in the shared buffer be little endian.
> +	 */

Minor: You've taken out BE support, but left this comment in.

> +	pcc_header = skb_push(skb, sizeof(*pcc_header));
> +	pcc_header->signature = PCC_SIGNATURE | mpnd->outbox.index;
> +	pcc_header->flags = PCC_CMD_COMPLETION_NOTIFY;
> +	memcpy(&pcc_header->command, MCTP_SIGNATURE, MCTP_SIGNATURE_LENGTH);
> +	pcc_header->length = len + MCTP_SIGNATURE_LENGTH;
> +
> +	if (mbox_send_message(mpnd->outbox.chan->mchan, skb) < 0) {
> +		// Remove the header in case it gets sent again
> +		skb_pull(skb, sizeof(*pcc_header));
> +		netif_stop_queue(ndev);
> +		return NETDEV_TX_BUSY;
> +	}
> +
> +	return NETDEV_TX_OK;
> +error:
> +	dev_dstats_tx_dropped(ndev);
> +	kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}

Cheers,


Jeremy

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Stefano Garzarella @ 2026-05-08 10:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <20260508055343-mutt-send-email-mst@kernel.org>

On Fri, May 08, 2026 at 05:58:06AM -0400, Michael S. Tsirkin wrote:
>On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote:
>> On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote:
>> > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote:
>> > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote:
>> > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote:
>>
>> [...]
>>
>> > > > > For now, we're already doing something:
>> > > > > merging the skuffs if they don't have EOM set.
>> > > >
>> > > >
>> > > > Right that's good. You could go further and merge with EOM too
>> > > > if you stick the info about message boundaries somewhere else.
>> > >
>> > > This adds a lot of complexity IMO, but we can try.
>> > >
>> > > Do you have something in mind?
>> >
>> > BER is clearly overkill but here's a POC that claude made for me,
>> > just to give u an idea. It's clearly has a ton of issues,
>> > for example I dislike how GFP_ATOMIC is handled.
>>
>> Okay, I somewhat understand, but clearly this isn't net material
>
>I doubt we have many other options given reverting the regression was
>ruled out.

As Eric pointed out, we can't revert it.

>
>
>> so for now
>> I think the best thing to do is to merge the fixup I sent (or something
>> similar):
>> https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/
>
>I reviewed that one, problem is it's a spec violation/change that we'll
>have to support forever.

I have a few points to make on this, but let's discuss them there.

>
>> This is a major change that should be merged with more caution.
>> Could this have too much of an impact on performance?
>>
>> Thanks,
>> Stefano
>
>It's really a POC, real patch is left as an excersise for the reader 
>:).

eheh, I see, but honestly, this overcomplication scares me. I'll try to 
think it over.

>The correct approach IMHO is to only start using this
>when we wasted a lot of memory on small packets.
>
>For example, if sum(truesize) >= buf size.
>
>then we'll not see a perf impact unless it's already pathological.

Agree on this, which is similar to what I'm doing in that patch.  
Reducing the advertised buf_alloc only in pathological cases (e.g.  
overhead > buf_alloc).

Stefano


^ permalink raw reply

* [PATCH net-next v4] hsr: Allow to send a specific port and with HSR header
From: Sebastian Andrzej Siewior @ 2026-05-08 10:17 UTC (permalink / raw)
  To: netdev
  Cc: Jayachandran, Andrew Lunn, Chintan Vankar, Danish Anwar,
	Daolin Qiu, David S. Miller, Eric Dumazet, Felix Maurer,
	Jakub Kicinski, Neelima Muralidharan, Paolo Abeni,
	Praneeth Bajjuri, Pratheesh Gangadhar TK, Richard Cochran,
	Simon Horman, Vignesh Raghavendra, Willem de Bruijn

HSR forwards all packets it received on slave port 1 to slave port 2 and
one of the two copies to the user (master) interface.
In terms of PTP this is not good because the latency introduced by
forwarding makes the timestamp in the PTP packet inaccurate.
The PTP packets should not be forwarded like regular packets.

In order to work with PTP over HSR the following has been done:
- PTP packets which are received are dropped within the HSR stack. That
  means they are not forwarded or injected into the master port. If the
  user requires them, then they need to be obtained directly from the
  SLAVE interface.

- Sending packets. If the ethernet type of the packet is ETH_P_1588 then
  the stack assumes a header of type struct hsr_inline_header. The size
  of this header is the same as ethhdr. As a safeguard, the header
  contains a magic field which matches the position of h_source and it
  needs to match HSR_INLINE_HDR.
  Once this is verified, the header contains the port on which this
  packet needs to be sent and if system's HSR header should be added.
  This information is used with the HSR stack. The packet is then
  pulled passed the custom header so the remaining stack will see the
  actual data.

- HSR HW offloading. The stack passes the skb to the requested port. The
  driver needs to be aware of the mode (HSR/ PRP). In PRP mode, there
  must be no offloading if the ether type is ETH_P_1588. In HSR mode it
  needs to add a HSR header for the ETH_P_1588 ether type but none if it
  is already ETH_P_HSR.

The originally submitted skb is freed and only the (altered) clone is
submitted to the slave interface for sending. Therefore on cloning, the
socket and tx_flags/ tskey are copied so that the PTP timestamp is
forwarded to the socket submitting the skb.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
I am trying to extend linuxptp to support PTP over a HSR network.

This is the kernel side of the changes. In short PTP over HSR sends its
packets to a multicast address and every node needs to forward the PTP
packet (SYNC and FOLLOW-UP for instance) within the HSR ring.
In order to achieve this, the HSR stack must not duplicate and forward
the PTP packets as it would do with other packets. The delay caused by
the duplication and forwarding adds overhead which in turn makes the
timing information within the PTP packet inaccurate.

My current approach is to open the slave devices (eth0/ eth1) from
userland in order to receive the PTP packets. Sending happens from the
hsr0 device. The actual packet can have an optional inline header
prepended of type struct hsr_inline_header. The size of the header is
equivalent to ethhdr. The header has a type (h_proto) at the same
position as ethhdr and expects it to be ETH_P_1588 as this extra meta
information is only relevant for PTP packets. It makes no sense to send
PTP packets via the HSR interface because it gets duplicated and the
timestamp information is lost so this should not break anything. As an
additional safe guard there is a magic value at h_source position. The
value has '0xaf' at the most significant byte which makes the address a
locally administered multicast address.

The header passes two information from userland: On which slave port
the packet has to be sent and does the HSR stack need to prepend a
header or not.
The header is skipped so that the remaining stack sees the actual data
and can send it as requested.

The PRP packets are sent directly via the SLAVE interface. The standard
mandates not add a PRP trailer (PRP, redundancy control trailer) to PTP
packets. There is not really a reason to use hsr interface.

HSR hardware offloading is optional. The driver needs to know if the
operating mode is HSR or PRP. In PRP mode it needs to check the ether
type and for ETH_P_1588 it must not perform any offloading.
In HSR mode, for ether-type ETH_P_1588 there must be no offloading. If
the ether-type is ETH_P_HSR there must be no offloading if the
encapsulated protocol is ETH_P_1588.

This has been tested in a pure software environment and in an HW-assisted
environment where the HW is able to duplicate and duplicate packets
but does not do it for PTP packets.
It has not been tested within an environment where the HW is able to
forward the PTP packet and correctly update the timing information.
---
v3…v4: https://lore.kernel.org/r/20260429-hsr_ptp-v3-1-afbf8f200f48@linutronix.de
- Removed skb extention. The information within HSR is passed via
  struct hsr_frame_info. Driver with HSR-offloading capabilities need to
  know the HSR mode (HSR or PRP) and parse the skb to decide what needs
  to be done (whether to send on both ports and if adding a header is
  needed).

v2…v3: https://patch.msgid.link/20260309-hsr_ptp-v2-0-798262aad3a4@linutronix.de
- Remove af_packet changes entirely.
- Add an internal header to pass additional information for HSR-PTP
  packets.
- Remove PRP, userland will use slave devices directly.
- Drop all received PTP packets. Userland needs to use the slave device
  for RX.

v1…v2: https://patch.msgid.link/20260204-hsr_ptp-v1-0-b421c69a77da@linutronix.de
- Added PRP support
- skb extention is used instead of extending struct skb_shared_info
- in af_packet
  - packet_sendmsg_spkt() is no longer extended
  - jump labels are used to avoid the overhead if there no socket that
    is using this HSR extension.
---
 include/linux/if_hsr.h | 11 +++++++++
 net/hsr/hsr_device.c   | 49 +++++++++++++++++++++++++++----------
 net/hsr/hsr_forward.c  | 65 +++++++++++++++++++++++++++++++++++++++++---------
 net/hsr/hsr_forward.h  |  3 ++-
 net/hsr/hsr_framereg.h |  2 ++
 net/hsr/hsr_slave.c    | 37 +++++++++++++++++++++-------
 6 files changed, 134 insertions(+), 33 deletions(-)

diff --git a/include/linux/if_hsr.h b/include/linux/if_hsr.h
index f4cf2dd36d193..0dda83511804e 100644
--- a/include/linux/if_hsr.h
+++ b/include/linux/if_hsr.h
@@ -3,6 +3,7 @@
 #define _LINUX_IF_HSR_H_
 
 #include <linux/types.h>
+#include <linux/skbuff.h>
 
 struct net_device;
 
@@ -22,6 +23,16 @@ enum hsr_port_type {
 	HSR_PT_PORTS,	/* This must be the last item in the enum */
 };
 
+#define HSR_INLINE_HDR	0xaf485352
+struct hsr_inline_header {
+	uint8_t tx_port;
+	uint8_t hsr_hdr;
+	uint8_t __pad0[4];
+	uint32_t magic;
+	uint8_t __pad1[2];
+	uint16_t eth_type;
+} __packed;
+
 /* HSR Tag.
  * As defined in IEC-62439-3:2010, the HSR tag is really { ethertype = 0x88FB,
  * path, LSDU_size, sequence Nr }. But we let eth_header() create { h_dest,
diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index 5555b71ab19b5..cab4f7c601257 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -223,24 +223,49 @@ static netdev_features_t hsr_fix_features(struct net_device *dev,
 
 static netdev_tx_t hsr_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 {
+	enum hsr_port_type tx_port = HSR_PT_NONE;
 	struct hsr_priv *hsr = netdev_priv(dev);
 	struct hsr_port *master;
+	bool has_header = false;
 
 	rcu_read_lock();
 	master = hsr_port_get_hsr(hsr, HSR_PT_MASTER);
-	if (master) {
-		skb->dev = master->dev;
-		skb_reset_mac_header(skb);
-		skb_reset_mac_len(skb);
-		spin_lock_bh(&hsr->seqnr_lock);
-		hsr_forward_skb(skb, master);
-		spin_unlock_bh(&hsr->seqnr_lock);
-	} else {
-		dev_core_stats_tx_dropped_inc(dev);
-		dev_kfree_skb_any(skb);
+	if (!master)
+		goto drop;
+
+	skb->dev = master->dev;
+	if (skb->len > ETH_HLEN * 2) {
+		struct hsr_inline_header *hsr_opt;
+
+		BUILD_BUG_ON(sizeof(struct hsr_inline_header) != sizeof(struct ethhdr));
+		hsr_opt = (struct hsr_inline_header *)skb_mac_header(skb);
+		if (hsr_opt->eth_type == htons(ETH_P_1588) &&
+		    hsr_opt->magic == htonl(HSR_INLINE_HDR)) {
+			has_header = hsr_opt->hsr_hdr;
+			tx_port = hsr_opt->tx_port;
+			if (tx_port != HSR_PT_SLAVE_A && tx_port != HSR_PT_SLAVE_B)
+				goto drop;
+
+			skb_pull(skb, sizeof(struct hsr_inline_header));
+			if (has_header)
+				skb_set_network_header(skb, ETH_HLEN + HSR_HLEN);
+			else
+				skb_set_network_header(skb, ETH_HLEN);
+		}
 	}
+
+	skb_reset_mac_header(skb);
+	skb_reset_mac_len(skb);
+	spin_lock_bh(&hsr->seqnr_lock);
+	hsr_forward_skb(skb, master, tx_port, has_header);
+	spin_unlock_bh(&hsr->seqnr_lock);
 	rcu_read_unlock();
 
+	return NETDEV_TX_OK;
+drop:
+	rcu_read_unlock();
+	dev_core_stats_tx_dropped_inc(dev);
+	dev_kfree_skb_any(skb);
 	return NETDEV_TX_OK;
 }
 
@@ -361,7 +386,7 @@ static void send_hsr_supervision_frame(struct hsr_port *port,
 		return;
 	}
 
-	hsr_forward_skb(skb, port);
+	hsr_forward_skb(skb, port, HSR_PT_NONE, false);
 	spin_unlock_bh(&hsr->seqnr_lock);
 	return;
 }
@@ -402,7 +427,7 @@ static void send_prp_supervision_frame(struct hsr_port *master,
 		return;
 	}
 
-	hsr_forward_skb(skb, master);
+	hsr_forward_skb(skb, master, HSR_PT_NONE, false);
 	spin_unlock_bh(&hsr->seqnr_lock);
 }
 
diff --git a/net/hsr/hsr_forward.c b/net/hsr/hsr_forward.c
index 0aca859c88cbb..7f92314934c98 100644
--- a/net/hsr/hsr_forward.c
+++ b/net/hsr/hsr_forward.c
@@ -12,6 +12,7 @@
 #include <linux/skbuff.h>
 #include <linux/etherdevice.h>
 #include <linux/if_vlan.h>
+#include <net/sock.h>
 #include "hsr_main.h"
 #include "hsr_framereg.h"
 
@@ -343,7 +344,10 @@ struct sk_buff *hsr_create_tagged_frame(struct hsr_frame_info *frame,
 		hsr_set_path_id(frame, hsr_ethhdr, port);
 		return skb_clone(frame->skb_hsr, GFP_ATOMIC);
 	} else if (port->dev->features & NETIF_F_HW_HSR_TAG_INS) {
-		return skb_clone(frame->skb_std, GFP_ATOMIC);
+		skb = skb_clone(frame->skb_std, GFP_ATOMIC);
+		if (frame->req_tx_port != HSR_PT_NONE)
+			skb_set_owner_w(skb, frame->skb_std->sk);
+		return skb;
 	}
 
 	/* Create the new skb with enough headroom to fit the HSR tag */
@@ -365,6 +369,15 @@ struct sk_buff *hsr_create_tagged_frame(struct hsr_frame_info *frame,
 	memmove(dst, src, movelen);
 	skb_reset_mac_header(skb);
 
+	if (frame->req_tx_port != HSR_PT_NONE) {
+		/* Packets are bound to a port and the sender may expect time
+		 * information.
+		 */
+		skb_shinfo(skb)->tx_flags = skb_shinfo(frame->skb_std)->tx_flags;
+		skb_shinfo(skb)->tskey = skb_shinfo(frame->skb_std)->tskey;
+		skb_set_owner_w(skb, frame->skb_std->sk);
+	}
+
 	/* skb_put_padto free skb on error and hsr_fill_tag returns NULL in
 	 * that case
 	 */
@@ -420,7 +433,7 @@ static void hsr_deliver_master(struct sk_buff *skb, struct net_device *dev,
 static int hsr_xmit(struct sk_buff *skb, struct hsr_port *port,
 		    struct hsr_frame_info *frame)
 {
-	if (frame->port_rcv->type == HSR_PT_MASTER) {
+	if (frame->port_rcv->type == HSR_PT_MASTER && !frame->has_foreign_header) {
 		hsr_addr_subst_dest(frame->node_src, skb, port);
 
 		/* Address substitution (IEC62439-3 pp 26, 50): replace mac
@@ -519,11 +532,12 @@ bool hsr_drop_frame(struct hsr_frame_info *frame, struct hsr_port *port)
 static void hsr_forward_do(struct hsr_frame_info *frame)
 {
 	struct hsr_port *port;
-	struct sk_buff *skb;
 	bool sent = false;
 
 	hsr_for_each_port(frame->port_rcv->hsr, port) {
 		struct hsr_priv *hsr = port->hsr;
+		struct sk_buff *skb = NULL;
+
 		/* Don't send frame back the way it came */
 		if (port == frame->port_rcv)
 			continue;
@@ -542,6 +556,19 @@ static void hsr_forward_do(struct hsr_frame_info *frame)
 		if ((port->dev->features & NETIF_F_HW_HSR_DUP) && sent)
 			continue;
 
+		/* PTP TX packets have an outgoing port specified */
+		if (frame->req_tx_port != HSR_PT_NONE && frame->req_tx_port != port->type)
+			continue;
+		/* PTP TX packets may already have a HSR header which needs to
+		 * be preserved
+		 */
+		if (frame->has_foreign_header && frame->skb_std) {
+			skb = skb_clone(frame->skb_std, GFP_ATOMIC);
+			if (skb)
+				skb_set_owner_w(skb, frame->skb_std->sk);
+			goto inject_into_stack;
+		}
+
 		/* Don't send frame over port where it has been sent before.
 		 * Also for SAN, this shouldn't be done.
 		 */
@@ -569,6 +596,7 @@ static void hsr_forward_do(struct hsr_frame_info *frame)
 		else
 			skb = hsr->proto_ops->get_untagged_frame(frame, port);
 
+inject_into_stack:
 		if (!skb) {
 			frame->port_rcv->dev->stats.rx_dropped++;
 			continue;
@@ -633,6 +661,13 @@ int hsr_fill_frame_info(__be16 proto, struct sk_buff *skb,
 	struct hsr_port *port = frame->port_rcv;
 	struct hsr_priv *hsr = port->hsr;
 
+	if (frame->has_foreign_header) {
+		frame->skb_std = skb;
+
+		WARN_ON_ONCE(port->type != HSR_PT_MASTER);
+		WARN_ON_ONCE(skb->mac_len < sizeof(struct hsr_ethhdr));
+		return 0;
+	}
 	/* HSRv0 supervisory frames double as a tag so treat them as tagged. */
 	if ((!hsr->prot_version && proto == htons(ETH_P_PRP)) ||
 	    proto == htons(ETH_P_HSR)) {
@@ -674,7 +709,8 @@ int prp_fill_frame_info(__be16 proto, struct sk_buff *skb,
 }
 
 static int fill_frame_info(struct hsr_frame_info *frame,
-			   struct sk_buff *skb, struct hsr_port *port)
+			   struct sk_buff *skb, struct hsr_port *port,
+			   enum hsr_port_type tx_port, bool has_hsr_header)
 {
 	struct hsr_priv *hsr = port->hsr;
 	struct hsr_vlan_ethhdr *vlan_hdr;
@@ -697,10 +733,15 @@ static int fill_frame_info(struct hsr_frame_info *frame,
 	if (port->type == HSR_PT_INTERLINK)
 		n_db = &hsr->proxy_node_db;
 
-	frame->node_src = hsr_get_node(port, n_db, skb,
-				       frame->is_supervision, port->type);
-	if (!frame->node_src)
-		return -1; /* Unknown node and !is_supervision, or no mem */
+	frame->req_tx_port = tx_port;
+	frame->has_foreign_header = has_hsr_header;
+
+	if (!frame->has_foreign_header) {
+		frame->node_src = hsr_get_node(port, n_db, skb,
+					       frame->is_supervision, port->type);
+		if (!frame->node_src)
+			return -1; /* Unknown node and !is_supervision, or no mem */
+	}
 
 	ethhdr = (struct ethhdr *)skb_mac_header(skb);
 	frame->is_vlan = false;
@@ -731,15 +772,17 @@ static int fill_frame_info(struct hsr_frame_info *frame,
 }
 
 /* Must be called holding rcu read lock (because of the port parameter) */
-void hsr_forward_skb(struct sk_buff *skb, struct hsr_port *port)
+void hsr_forward_skb(struct sk_buff *skb, struct hsr_port *port,
+		     enum hsr_port_type tx_port, bool has_hsr_header)
 {
 	struct hsr_frame_info frame;
 
 	rcu_read_lock();
-	if (fill_frame_info(&frame, skb, port) < 0)
+	if (fill_frame_info(&frame, skb, port, tx_port, has_hsr_header) < 0)
 		goto out_drop;
 
-	hsr_register_frame_in(frame.node_src, port, frame.sequence_nr);
+	if (!frame.has_foreign_header)
+		hsr_register_frame_in(frame.node_src, port, frame.sequence_nr);
 	hsr_forward_do(&frame);
 	rcu_read_unlock();
 	/* Gets called for ingress frames as well as egress from master port.
diff --git a/net/hsr/hsr_forward.h b/net/hsr/hsr_forward.h
index 206636750b300..e64b0358907a9 100644
--- a/net/hsr/hsr_forward.h
+++ b/net/hsr/hsr_forward.h
@@ -13,7 +13,8 @@
 #include <linux/netdevice.h>
 #include "hsr_main.h"
 
-void hsr_forward_skb(struct sk_buff *skb, struct hsr_port *port);
+void hsr_forward_skb(struct sk_buff *skb, struct hsr_port *port,
+		     enum hsr_port_type tx_port, bool has_hsr_header);
 struct sk_buff *prp_create_tagged_frame(struct hsr_frame_info *frame,
 					struct hsr_port *port);
 struct sk_buff *hsr_create_tagged_frame(struct hsr_frame_info *frame,
diff --git a/net/hsr/hsr_framereg.h b/net/hsr/hsr_framereg.h
index c65ecb9257348..7aeb59beb7fd8 100644
--- a/net/hsr/hsr_framereg.h
+++ b/net/hsr/hsr_framereg.h
@@ -27,6 +27,8 @@ struct hsr_frame_info {
 	bool is_local_dest;
 	bool is_local_exclusive;
 	bool is_from_san;
+	bool has_foreign_header;
+	enum hsr_port_type req_tx_port;
 };
 
 void hsr_del_self_node(struct hsr_priv *hsr);
diff --git a/net/hsr/hsr_slave.c b/net/hsr/hsr_slave.c
index d9af9e65f72f0..7cc0641dbd137 100644
--- a/net/hsr/hsr_slave.c
+++ b/net/hsr/hsr_slave.c
@@ -44,8 +44,7 @@ static rx_handler_result_t hsr_handle_frame(struct sk_buff **pskb)
 
 	if (hsr_addr_is_self(port->hsr, eth_hdr(skb)->h_source)) {
 		/* Directly kill frames sent by ourselves */
-		kfree_skb(skb);
-		goto finish_consume;
+		goto finish_free_consume;
 	}
 
 	/* For HSR, only tagged frames are expected (unless the device offloads
@@ -64,10 +63,8 @@ static rx_handler_result_t hsr_handle_frame(struct sk_buff **pskb)
 	skb_reset_mac_header(skb);
 	if ((!hsr->prot_version && protocol == htons(ETH_P_PRP)) ||
 	    protocol == htons(ETH_P_HSR)) {
-		if (!pskb_may_pull(skb, ETH_HLEN + HSR_HLEN)) {
-			kfree_skb(skb);
-			goto finish_consume;
-		}
+		if (!pskb_may_pull(skb, ETH_HLEN + HSR_HLEN))
+			goto finish_free_consume;
 
 		skb_set_network_header(skb, ETH_HLEN + HSR_HLEN);
 	}
@@ -78,13 +75,35 @@ static rx_handler_result_t hsr_handle_frame(struct sk_buff **pskb)
 	 */
 	if (port->type == HSR_PT_INTERLINK) {
 		spin_lock_bh(&hsr->seqnr_lock);
-		hsr_forward_skb(skb, port);
+		hsr_forward_skb(skb, port, HSR_PT_NONE, false);
 		spin_unlock_bh(&hsr->seqnr_lock);
 	} else {
-		hsr_forward_skb(skb, port);
+		struct hsr_ethhdr *hsr_ethhdr;
+
+		/* PTP packets are not supposed to be forwarded via HSR as-is.
+		 * The latency introduced by forwarding renders the time
+		 * information useless. Userland needs to capture the packet on
+		 * the original interface instead of hsr.
+		 */
+		if ((!hsr->prot_version && protocol == htons(ETH_P_PRP)) ||
+		    protocol == htons(ETH_P_HSR)) {
+			/* HSR */
+			hsr_ethhdr = (struct hsr_ethhdr *)skb_mac_header(skb);
+			if (hsr_ethhdr->hsr_tag.encap_proto == htons(ETH_P_1588))
+				goto finish_free_consume;
+		} else {
+			/* PRP */
+			if (protocol == htons(ETH_P_1588))
+				goto finish_free_consume;
+		}
+
+		hsr_forward_skb(skb, port, HSR_PT_NONE, false);
 	}
 
-finish_consume:
+	return RX_HANDLER_CONSUMED;
+
+finish_free_consume:
+	kfree_skb(skb);
 	return RX_HANDLER_CONSUMED;
 
 finish_pass:

---
base-commit: 9e0898f1c0f134c6bad146ca8578f73c3e40ac0a
change-id: 20260204-hsr_ptp-1f6380f1d35f

Best regards,
-- 
Sebastian Andrzej Siewior <bigeasy@linutronix.de>

^ permalink raw reply related

* [PATCH iwl-net v2] ice: support SBQ posted writes with non-posted support for CGU
From: Przemyslaw Korba @ 2026-05-08 10:20 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, anthony.l.nguyen, przemyslaw.kitszel, aleksandr.loktionov,
	arkadiusz.kubalewski, Przemyslaw Korba

From: Karol Kolacinski <karol.kolacinski@intel.com>

Sideband queue (SBQ) is a HW queue with very short completion time. All
SBQ writes were posted by default, which means that the driver did not
have to wait for completion from the neighbor device, because there was
none. This introduced unnecessary delays, where only those delays were
"ensuring" that the command is "completed" and this was a potential race
condition.

Add the possibility to perform non-posted writes where it's necessary to
wait for completion, instead of relying on fake completion from the FW,
where only the delays are guarding the writes.

Flush the SBQ by reading address 0 from the PHY 0 before issuing SYNC
command to ensure that writes to all PHYs were completed and skip SBQ
message completion if it's posted.

To analyze if delays are gone, look for and compare time spent in
ice_sq_send_cmd — posted writes should return immediately after the wr32.
That can be done for example by adjusting phc time with phc_ctl on E830
device, for less than 2 seconds to use this new mechanism. Without it,
command below will fail.

Reproduction steps:
phc_ctl eth13 adj 1
phc_ctl[4478170.994]: adjusted clock by 1.000000 seconds

Check trace for timing for comparisions:
echo ice_sbq_send_cmd > /sys/kernel/debug/tracing/set_ftrace_filter
echo function_graph > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/trace

Tested on:
  - Intel E830 NIC (FW version 1.00)
  - Kernel 6.19.0+

Fixes: 8f5ee3c477a8 ("ice: add support for sideband messages")
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
---
v2:
- fix minor issues for E810 devices
v1:
https://lore.kernel.org/intel-wired-lan/20260507135110.809367-1-przemyslaw.korba@intel.com/
---
 drivers/net/ethernet/intel/ice/ice_common.c   | 21 ++++--
 drivers/net/ethernet/intel/ice/ice_controlq.c |  4 ++
 drivers/net/ethernet/intel/ice/ice_controlq.h |  1 +
 drivers/net/ethernet/intel/ice/ice_ptp_hw.c   | 65 +++++++++++--------
 drivers/net/ethernet/intel/ice/ice_sbq_cmd.h  |  5 +-
 5 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 0ec65007d672..d5007f6c9d6e 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1762,6 +1762,7 @@ int ice_sbq_rw_reg(struct ice_hw *hw, struct ice_sbq_msg_input *in, u16 flags)
 {
 	struct ice_sbq_cmd_desc desc = {0};
 	struct ice_sbq_msg_req msg = {0};
+	struct ice_sq_cd cd = {};
 	u16 msg_len;
 	int status;
 
@@ -1774,19 +1775,29 @@ int ice_sbq_rw_reg(struct ice_hw *hw, struct ice_sbq_msg_input *in, u16 flags)
 	msg.msg_addr_low = cpu_to_le16(in->msg_addr_low);
 	msg.msg_addr_high = cpu_to_le32(in->msg_addr_high);
 
-	if (in->opcode)
+	switch (in->opcode) {
+	case ice_sbq_msg_wr_p:
+	case ice_sbq_msg_wr_np:
 		msg.data = cpu_to_le32(in->data);
-	else
+		break;
+	case ice_sbq_msg_rd:
 		/* data read comes back in completion, so shorten the struct by
 		 * sizeof(msg.data)
 		 */
 		msg_len -= sizeof(msg.data);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	cd.posted = in->opcode == ice_sbq_msg_wr_p;
 
 	desc.flags = cpu_to_le16(flags);
 	desc.opcode = cpu_to_le16(ice_sbq_opc_neigh_dev_req);
 	desc.param0.cmd_len = cpu_to_le16(msg_len);
-	status = ice_sbq_send_cmd(hw, &desc, &msg, msg_len, NULL);
-	if (!status && !in->opcode)
+	status = ice_sbq_send_cmd(hw, &desc, &msg, msg_len, &cd);
+
+	if (!status && in->opcode == ice_sbq_msg_rd)
 		in->data = le32_to_cpu
 			(((struct ice_sbq_msg_cmpl *)&msg)->data);
 	return status;
@@ -6557,7 +6568,7 @@ int ice_write_cgu_reg(struct ice_hw *hw, u32 addr, u32 val)
 {
 	struct ice_sbq_msg_input cgu_msg = {
 		.dest_dev = ice_get_dest_cgu(hw),
-		.opcode = ice_sbq_msg_wr,
+		.opcode = ice_sbq_msg_wr_np,
 		.msg_addr_low = addr,
 		.data = val
 	};
diff --git a/drivers/net/ethernet/intel/ice/ice_controlq.c b/drivers/net/ethernet/intel/ice/ice_controlq.c
index dcb837cadd18..a6008dc77fa4 100644
--- a/drivers/net/ethernet/intel/ice/ice_controlq.c
+++ b/drivers/net/ethernet/intel/ice/ice_controlq.c
@@ -1086,6 +1086,10 @@ ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq,
 	wr32(hw, cq->sq.tail, cq->sq.next_to_use);
 	ice_flush(hw);
 
+	/* If the message is posted, don't wait for completion. */
+	if (cd && cd->posted)
+		goto sq_send_command_error;
+
 	/* Wait for the command to complete. If it finishes within the
 	 * timeout, copy the descriptor back to temp.
 	 */
diff --git a/drivers/net/ethernet/intel/ice/ice_controlq.h b/drivers/net/ethernet/intel/ice/ice_controlq.h
index 788040dd662e..c50d6fcbacba 100644
--- a/drivers/net/ethernet/intel/ice/ice_controlq.h
+++ b/drivers/net/ethernet/intel/ice/ice_controlq.h
@@ -77,6 +77,7 @@ struct ice_ctl_q_ring {
 /* sq transaction details */
 struct ice_sq_cd {
 	struct libie_aq_desc *wb_desc;
+	u8 posted : 1;
 };
 
 /* rq event information */
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
index 2c18e16fe053..698e65c1209c 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
@@ -352,6 +352,17 @@ void ice_ptp_src_cmd(struct ice_hw *hw, enum ice_ptp_tmr_cmd cmd)
 static void ice_ptp_exec_tmr_cmd(struct ice_hw *hw)
 {
 	struct ice_pf *pf = container_of(hw, struct ice_pf, hw);
+	struct ice_sbq_msg_input msg = {
+		.dest_dev = ice_sbq_dev_phy_0,
+		.opcode = ice_sbq_msg_rd,
+	};
+	int err;
+
+	if (hw->mac_type != ICE_MAC_E810) {
+		err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
+		if (err)
+			dev_warn(ice_hw_to_dev(hw), "Failed to flush SBQ: %d\n", err);
+	}
 
 	if (!ice_is_primary(hw))
 		hw = ice_get_primary_hw(pf);
@@ -442,7 +453,7 @@ static int ice_write_phy_eth56g(struct ice_hw *hw, u8 port, u32 addr, u32 val)
 {
 	struct ice_sbq_msg_input msg = {
 		.dest_dev = ice_ptp_get_dest_dev_e825(hw, port),
-		.opcode = ice_sbq_msg_wr,
+		.opcode = ice_sbq_msg_wr_p,
 		.msg_addr_low = lower_16_bits(addr),
 		.msg_addr_high = upper_16_bits(addr),
 		.data = val
@@ -2504,11 +2515,12 @@ static bool ice_is_40b_phy_reg_e82x(u16 low_addr, u16 *high_addr)
 static int
 ice_read_phy_reg_e82x(struct ice_hw *hw, u8 port, u16 offset, u32 *val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.opcode = ice_sbq_msg_rd,
+	};
 	int err;
 
 	ice_fill_phy_msg_e82x(hw, &msg, port, offset);
-	msg.opcode = ice_sbq_msg_rd;
 
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
@@ -2581,12 +2593,13 @@ ice_read_64b_phy_reg_e82x(struct ice_hw *hw, u8 port, u16 low_addr, u64 *val)
 static int
 ice_write_phy_reg_e82x(struct ice_hw *hw, u8 port, u16 offset, u32 val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.opcode = ice_sbq_msg_wr_p,
+		.data = val
+	};
 	int err;
 
 	ice_fill_phy_msg_e82x(hw, &msg, port, offset);
-	msg.opcode = ice_sbq_msg_wr;
-	msg.data = val;
 
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
@@ -2740,15 +2753,15 @@ static int ice_fill_quad_msg_e82x(struct ice_hw *hw,
 int
 ice_read_quad_reg_e82x(struct ice_hw *hw, u8 quad, u16 offset, u32 *val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.opcode = ice_sbq_msg_rd,
+	};
 	int err;
 
 	err = ice_fill_quad_msg_e82x(hw, &msg, quad, offset);
 	if (err)
 		return err;
 
-	msg.opcode = ice_sbq_msg_rd;
-
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
 		ice_debug(hw, ICE_DBG_PTP, "Failed to send message to PHY, err %d\n",
@@ -2774,16 +2787,16 @@ ice_read_quad_reg_e82x(struct ice_hw *hw, u8 quad, u16 offset, u32 *val)
 int
 ice_write_quad_reg_e82x(struct ice_hw *hw, u8 quad, u16 offset, u32 val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.opcode = ice_sbq_msg_wr_p,
+		.data = val
+	};
 	int err;
 
 	err = ice_fill_quad_msg_e82x(hw, &msg, quad, offset);
 	if (err)
 		return err;
 
-	msg.opcode = ice_sbq_msg_wr;
-	msg.data = val;
-
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
 		ice_debug(hw, ICE_DBG_PTP, "Failed to send message to PHY, err %d\n",
@@ -4450,14 +4463,14 @@ static void ice_ptp_init_phy_e82x(struct ice_ptp_hw *ptp)
  */
 static int ice_read_phy_reg_e810(struct ice_hw *hw, u32 addr, u32 *val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.dest_dev = ice_sbq_dev_phy_0,
+		.opcode = ice_sbq_msg_rd,
+		.msg_addr_low = lower_16_bits(addr),
+		.msg_addr_high = upper_16_bits(addr),
+	};
 	int err;
 
-	msg.msg_addr_low = lower_16_bits(addr);
-	msg.msg_addr_high = upper_16_bits(addr);
-	msg.opcode = ice_sbq_msg_rd;
-	msg.dest_dev = ice_sbq_dev_phy_0;
-
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
 		ice_debug(hw, ICE_DBG_PTP, "Failed to send message to PHY, err %d\n",
@@ -4480,15 +4493,15 @@ static int ice_read_phy_reg_e810(struct ice_hw *hw, u32 addr, u32 *val)
  */
 static int ice_write_phy_reg_e810(struct ice_hw *hw, u32 addr, u32 val)
 {
-	struct ice_sbq_msg_input msg = {0};
+	struct ice_sbq_msg_input msg = {
+		.dest_dev = ice_sbq_dev_phy_0,
+		.opcode = ice_sbq_msg_wr_p,
+		.msg_addr_low = lower_16_bits(addr),
+		.msg_addr_high = upper_16_bits(addr),
+		.data = val
+	};
 	int err;
 
-	msg.msg_addr_low = lower_16_bits(addr);
-	msg.msg_addr_high = upper_16_bits(addr);
-	msg.opcode = ice_sbq_msg_wr;
-	msg.dest_dev = ice_sbq_dev_phy_0;
-	msg.data = val;
-
 	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
 	if (err) {
 		ice_debug(hw, ICE_DBG_PTP, "Failed to send message to PHY, err %d\n",
diff --git a/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h b/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
index 21bb861febbf..86a143ebf089 100644
--- a/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
@@ -54,8 +54,9 @@ enum ice_sbq_dev_id {
 };
 
 enum ice_sbq_msg_opcode {
-	ice_sbq_msg_rd	= 0x00,
-	ice_sbq_msg_wr	= 0x01
+	ice_sbq_msg_rd		= 0x00,
+	ice_sbq_msg_wr_p	= 0x01,
+	ice_sbq_msg_wr_np	= 0x02,
 };
 
 #define ICE_SBQ_MSG_FLAGS	0x40

base-commit: 80b47e88f7ead00b0795e9f2833f1d0cafe11d90
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] mptcp: do not drop partial packets
From: Paolo Abeni @ 2026-05-08 10:23 UTC (permalink / raw)
  To: Shardul Bankar, matttbe, martineau
  Cc: geliang, davem, edumazet, kuba, horms, netdev, mptcp,
	linux-kernel, janak, kalpan.jani, Shardul Bankar
In-Reply-To: <20260422143931.43281-1-shardul.b@mpiricsoftware.com>

On 4/22/26 4:39 PM, Shardul Bankar wrote:
> When a packet arrives with map_seq < ack_seq < end_seq, the beginning
> of the packet has already been acknowledged but the end contains new
> data.  Currently the entire packet is dropped as "old data," forcing
> the sender to retransmit.
> 
> Instead, skip the already-acked bytes by adjusting the skb offset and
> enqueue only the new portion.  Update bytes_received and ack_seq to
> reflect the new data consumed.
> 
> A previous attempt at this fix (commit 1d2ce718811a ("mptcp: do not
> drop partial packets"), reverted in commit bf39160c4218 ("Revert
> "mptcp: do not drop partial packets"")) also added a zero-window
> check and changed rcv_wnd_sent initialization, which caused test
> regressions.  This version addresses only the partial packet handling
> without modifying receive window accounting.
> 
> Fixes: ab174ad8ef76 ("mptcp: move ooo skbs into msk out of order queue.")
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/600
> Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>

It would be great if you could send a v3 addressing the AI comment.

If you don't have time or capacity, please LMK, I can send v3 with your
SoB and the needed editing.

Thanks,

Paolo


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox