* [PATCH net] vsock/virtio: fix potential unbounded skb queue
@ 2026-04-30 12:26 Eric Dumazet
2026-05-05 2:20 ` patchwork-bot+netdevbpf
2026-05-05 13:51 ` Stefano Garzarella
0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2026-04-30 12:26 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, netdev, eric.dumazet, Eric Dumazet, Arseniy Krasnov,
Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization
virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.
virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.
If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.
Fix this by estimating the skb metadata size:
(Number of skbs in the queue) * SKB_TRUESIZE(0)
Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
---
net/vmw_vsock/virtio_transport_common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
u32 len)
{
- if (vvs->buf_used + len > vvs->buf_alloc)
+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
return false;
vvs->rx_bytes += len;
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-04-30 12:26 [PATCH net] vsock/virtio: fix potential unbounded skb queue Eric Dumazet @ 2026-05-05 2:20 ` patchwork-bot+netdevbpf 2026-05-05 13:51 ` Stefano Garzarella 1 sibling, 0 replies; 26+ messages in thread From: patchwork-bot+netdevbpf @ 2026-05-05 2:20 UTC (permalink / raw) To: Eric Dumazet Cc: davem, kuba, pabeni, horms, netdev, eric.dumazet, AVKrasnov, stefanha, sgarzare, mst, jasowang, xuanzhuo, eperezma, kvm, virtualization Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Thu, 30 Apr 2026 12:26:52 +0000 you wrote: > virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > virtio_transport_recv_enqueue() skips coalescing for packets > with VIRTIO_VSOCK_SEQ_EOM. > > If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > a very large number of packets can be queued > because vvs->rx_bytes stays at 0. > > [...] Here is the summary with links: - [net] vsock/virtio: fix potential unbounded skb queue https://git.kernel.org/netdev/net/c/059b7dbd20a6 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-04-30 12:26 [PATCH net] vsock/virtio: fix potential unbounded skb queue Eric Dumazet 2026-05-05 2:20 ` patchwork-bot+netdevbpf @ 2026-05-05 13:51 ` Stefano Garzarella 2026-05-05 14:14 ` Eric Dumazet 1 sibling, 1 reply; 26+ messages in thread From: Stefano Garzarella @ 2026-05-05 13:51 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > >virtio_transport_recv_enqueue() skips coalescing for packets >with VIRTIO_VSOCK_SEQ_EOM. > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >a very large number of packets can be queued >because vvs->rx_bytes stays at 0. > >Fix this by estimating the skb metadata size: > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >Signed-off-by: Eric Dumazet <edumazet@google.com> >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >Cc: Stefan Hajnoczi <stefanha@redhat.com> >Cc: Stefano Garzarella <sgarzare@redhat.com> >Cc: "Michael S. Tsirkin" <mst@redhat.com> >Cc: Jason Wang <jasowang@redhat.com> >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >Cc: "Eugenio Pérez" <eperezma@redhat.com> >Cc: kvm@vger.kernel.org >Cc: virtualization@lists.linux.dev >--- > net/vmw_vsock/virtio_transport_common.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >--- a/net/vmw_vsock/virtio_transport_common.c >+++ b/net/vmw_vsock/virtio_transport_common.c >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > u32 len) > { >- if (vvs->buf_used + len > vvs->buf_alloc) >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >+ >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > return false; I'm not sure about this fix, I mean that maybe this is incomplete. In virtio-vsock, there is a credit mechanism between the two peers: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 This takes only the payload into account, so it’s true that this problem exists; however, perhaps we should also inform the other peer of a lower credit balance, otherwise the other peer will believe it has much more credit than it actually does, send a large payload, and then the packet will be discarded and the data lost (there are no retransmissions, etc.). Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 13:51 ` Stefano Garzarella @ 2026-05-05 14:14 ` Eric Dumazet 2026-05-05 16:11 ` Stefano Garzarella 2026-05-06 15:15 ` Michael S. Tsirkin 0 siblings, 2 replies; 26+ messages in thread From: Eric Dumazet @ 2026-05-05 14:14 UTC (permalink / raw) To: Stefano Garzarella Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > >virtio_transport_recv_enqueue() skips coalescing for packets > >with VIRTIO_VSOCK_SEQ_EOM. > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > >a very large number of packets can be queued > >because vvs->rx_bytes stays at 0. > > > >Fix this by estimating the skb metadata size: > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > >Signed-off-by: Eric Dumazet <edumazet@google.com> > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > >Cc: Stefano Garzarella <sgarzare@redhat.com> > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > >Cc: Jason Wang <jasowang@redhat.com> > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > >Cc: kvm@vger.kernel.org > >Cc: virtualization@lists.linux.dev > >--- > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > >--- a/net/vmw_vsock/virtio_transport_common.c > >+++ b/net/vmw_vsock/virtio_transport_common.c > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > u32 len) > > { > >- if (vvs->buf_used + len > vvs->buf_alloc) > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > >+ > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > return false; > > I'm not sure about this fix, I mean that maybe this is incomplete. > In virtio-vsock, there is a credit mechanism between the two peers: > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > This takes only the payload into account, so it’s true that this problem > exists; however, perhaps we should also inform the other peer of a lower > credit balance, otherwise the other peer will believe it has much more > credit than it actually does, send a large payload, and then the packet > will be discarded and the data lost (there are no retransmissions, > etc.). I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff state to account credit") and find a better fix then? There is always a discrepancy between skb->len and skb->truesize. You will not be able to announce a 1MB window, and accept one milliion skb of 1-byte each. This kind of contract is broken. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 14:14 ` Eric Dumazet @ 2026-05-05 16:11 ` Stefano Garzarella 2026-05-05 16:37 ` Bobby Eshleman 2026-05-06 15:37 ` Michael S. Tsirkin 2026-05-06 15:15 ` Michael S. Tsirkin 1 sibling, 2 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-05 16:11 UTC (permalink / raw) To: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: >On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >> >> On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >> >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. >> > >> >virtio_transport_recv_enqueue() skips coalescing for packets >> >with VIRTIO_VSOCK_SEQ_EOM. >> > >> >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >> >a very large number of packets can be queued >> >because vvs->rx_bytes stays at 0. >> > >> >Fix this by estimating the skb metadata size: >> > >> > (Number of skbs in the queue) * SKB_TRUESIZE(0) >> > >> >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >> >Signed-off-by: Eric Dumazet <edumazet@google.com> >> >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >> >Cc: Stefan Hajnoczi <stefanha@redhat.com> >> >Cc: Stefano Garzarella <sgarzare@redhat.com> >> >Cc: "Michael S. Tsirkin" <mst@redhat.com> >> >Cc: Jason Wang <jasowang@redhat.com> >> >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >> >Cc: "Eugenio Pérez" <eperezma@redhat.com> >> >Cc: kvm@vger.kernel.org >> >Cc: virtualization@lists.linux.dev >> >--- >> > net/vmw_vsock/virtio_transport_common.c | 4 +++- >> > 1 file changed, 3 insertions(+), 1 deletion(-) >> > >> >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >> >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >> >--- a/net/vmw_vsock/virtio_transport_common.c >> >+++ b/net/vmw_vsock/virtio_transport_common.c >> >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >> > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, >> > u32 len) >> > { >> >- if (vvs->buf_used + len > vvs->buf_alloc) >> >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >> >+ >> >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) >> > return false; >> >> I'm not sure about this fix, I mean that maybe this is incomplete. >> In virtio-vsock, there is a credit mechanism between the two peers: >> https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 >> >> This takes only the payload into account, so it’s true that this problem >> exists; however, perhaps we should also inform the other peer of a lower >> credit balance, otherwise the other peer will believe it has much more >> credit than it actually does, send a large payload, and then the packet >> will be discarded and the data lost (there are no retransmissions, >> etc.). > >I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff >state to account credit") >and find a better fix then? IIRC the same issue was there before the commit fixed by that one (commit 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so not sure about reverting it TBH. CCing Arseniy and Bobby. > >There is always a discrepancy between skb->len and skb->truesize. >You will not be able to announce a 1MB window, and accept one milliion >skb of 1-byte each. > >This kind of contract is broken. > Yep, I agree, but before we start discarding data (and losing it), IMHO we should at least inform the other peer that we're out of space. @Stefan, @Michael, do you think we can do something in the spec to avoid this issue and in some way take into account also the metadata in the credit. I mean to avoid the 1-byte packets flooding. Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 16:11 ` Stefano Garzarella @ 2026-05-05 16:37 ` Bobby Eshleman 2026-05-06 9:50 ` Arseniy Krasnov 2026-05-06 15:37 ` Michael S. Tsirkin 1 sibling, 1 reply; 26+ messages in thread From: Bobby Eshleman @ 2026-05-05 16:37 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > > > > > >virtio_transport_recv_enqueue() skips coalescing for packets > > > >with VIRTIO_VSOCK_SEQ_EOM. > > > > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > > > >a very large number of packets can be queued > > > >because vvs->rx_bytes stays at 0. > > > > > > > >Fix this by estimating the skb metadata size: > > > > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > >Cc: Jason Wang <jasowang@redhat.com> > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > > > >Cc: kvm@vger.kernel.org > > > >Cc: virtualization@lists.linux.dev > > > >--- > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > > > >--- a/net/vmw_vsock/virtio_transport_common.c > > > >+++ b/net/vmw_vsock/virtio_transport_common.c > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > > > u32 len) > > > > { > > > >- if (vvs->buf_used + len > vvs->buf_alloc) > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > > > >+ > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > > > return false; > > > > > > I'm not sure about this fix, I mean that maybe this is incomplete. > > > In virtio-vsock, there is a credit mechanism between the two peers: > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > > > > > This takes only the payload into account, so it’s true that this problem > > > exists; however, perhaps we should also inform the other peer of a lower > > > credit balance, otherwise the other peer will believe it has much more > > > credit than it actually does, send a large payload, and then the packet > > > will be discarded and the data lost (there are no retransmissions, > > > etc.). > > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff > > state to account credit") > > and find a better fix then? > > IIRC the same issue was there before the commit fixed by that one (commit > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so > not sure about reverting it TBH. > > CCing Arseniy and Bobby. > > > > > There is always a discrepancy between skb->len and skb->truesize. > > You will not be able to announce a 1MB window, and accept one milliion > > skb of 1-byte each. > > > > This kind of contract is broken. > > > > Yep, I agree, but before we start discarding data (and losing it), IMHO we > should at least inform the other peer that we're out of space. > > @Stefan, @Michael, do you think we can do something in the spec to avoid > this issue and in some way take into account also the metadata in the > credit. I mean to avoid the 1-byte packets flooding. > > Thanks, > Stefano > > Indeed the old pre-fix skb code would have the same issue. I can't think of any way around this without extending the spec. Best, Bobby ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 16:37 ` Bobby Eshleman @ 2026-05-06 9:50 ` Arseniy Krasnov 2026-05-06 14:00 ` Stefano Garzarella 0 siblings, 1 reply; 26+ messages in thread From: Arseniy Krasnov @ 2026-05-06 9:50 UTC (permalink / raw) To: Bobby Eshleman, Stefano Garzarella Cc: Eric Dumazet, Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization 05.05.2026 19:37, Bobby Eshleman wrote: > On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: >> On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: >>> On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >>>> >>>> On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >>>>> virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. >>>>> >>>>> virtio_transport_recv_enqueue() skips coalescing for packets >>>>> with VIRTIO_VSOCK_SEQ_EOM. >>>>> >>>>> If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >>>>> a very large number of packets can be queued >>>>> because vvs->rx_bytes stays at 0. >>>>> >>>>> Fix this by estimating the skb metadata size: >>>>> >>>>> (Number of skbs in the queue) * SKB_TRUESIZE(0) >>>>> >>>>> Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >>>>> Signed-off-by: Eric Dumazet <edumazet@google.com> >>>>> Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >>>>> Cc: Stefan Hajnoczi <stefanha@redhat.com> >>>>> Cc: Stefano Garzarella <sgarzare@redhat.com> >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>> Cc: Jason Wang <jasowang@redhat.com> >>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com> >>>>> Cc: kvm@vger.kernel.org >>>>> Cc: virtualization@lists.linux.dev >>>>> --- >>>>> net/vmw_vsock/virtio_transport_common.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >>>>> index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >>>>> --- a/net/vmw_vsock/virtio_transport_common.c >>>>> +++ b/net/vmw_vsock/virtio_transport_common.c >>>>> @@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >>>>> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, >>>>> u32 len) >>>>> { >>>>> - if (vvs->buf_used + len > vvs->buf_alloc) >>>>> + u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >>>>> + >>>>> + if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) >>>>> return false; >>>> >>>> I'm not sure about this fix, I mean that maybe this is incomplete. >>>> In virtio-vsock, there is a credit mechanism between the two peers: >>>> https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 >>>> >>>> This takes only the payload into account, so it’s true that this problem >>>> exists; however, perhaps we should also inform the other peer of a lower >>>> credit balance, otherwise the other peer will believe it has much more >>>> credit than it actually does, send a large payload, and then the packet >>>> will be discarded and the data lost (there are no retransmissions, >>>> etc.). >>> >>> I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff >>> state to account credit") >>> and find a better fix then? >> >> IIRC the same issue was there before the commit fixed by that one (commit >> 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so >> not sure about reverting it TBH. >> >> CCing Arseniy and Bobby. Thanks! >> >>> >>> There is always a discrepancy between skb->len and skb->truesize. >>> You will not be able to announce a 1MB window, and accept one milliion >>> skb of 1-byte each. >>> >>> This kind of contract is broken. >>> >> >> Yep, I agree, but before we start discarding data (and losing it), IMHO we >> should at least inform the other peer that we're out of space. >> >> @Stefan, @Michael, do you think we can do something in the spec to avoid >> this issue and in some way take into account also the metadata in the >> credit. I mean to avoid the 1-byte packets flooding. >> >> Thanks, >> Stefano >> >> > > Indeed the old pre-fix skb code would have the same issue. > > I can't think of any way around this without extending the spec. Hi, thanks, agree with Bobby, that accounting metadata (e.g. skb size here) was not implemented "by design" in credit logic - another side of data exchange knows nothing about that. Also the same situation was before skb implementation was added by Bobby. So looks like need to update spec may be. Thanks! > > Best, > Bobby ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-06 9:50 ` Arseniy Krasnov @ 2026-05-06 14:00 ` Stefano Garzarella 0 siblings, 0 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-06 14:00 UTC (permalink / raw) To: Arseniy Krasnov Cc: Bobby Eshleman, Eric Dumazet, Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Wed, May 06, 2026 at 12:50:04PM +0300, Arseniy Krasnov wrote: > > >05.05.2026 19:37, Bobby Eshleman wrote: >> On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: >>> On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: >>>> On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >>>>> >>>>> On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >>>>>> virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. >>>>>> >>>>>> virtio_transport_recv_enqueue() skips coalescing for packets >>>>>> with VIRTIO_VSOCK_SEQ_EOM. >>>>>> >>>>>> If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >>>>>> a very large number of packets can be queued >>>>>> because vvs->rx_bytes stays at 0. >>>>>> >>>>>> Fix this by estimating the skb metadata size: >>>>>> >>>>>> (Number of skbs in the queue) * SKB_TRUESIZE(0) >>>>>> >>>>>> Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com> >>>>>> Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >>>>>> Cc: Stefan Hajnoczi <stefanha@redhat.com> >>>>>> Cc: Stefano Garzarella <sgarzare@redhat.com> >>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>>> Cc: Jason Wang <jasowang@redhat.com> >>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com> >>>>>> Cc: kvm@vger.kernel.org >>>>>> Cc: virtualization@lists.linux.dev >>>>>> --- >>>>>> net/vmw_vsock/virtio_transport_common.c | 4 +++- >>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >>>>>> index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >>>>>> --- a/net/vmw_vsock/virtio_transport_common.c >>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c >>>>>> @@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >>>>>> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, >>>>>> u32 len) >>>>>> { >>>>>> - if (vvs->buf_used + len > vvs->buf_alloc) >>>>>> + u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >>>>>> + >>>>>> + if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) >>>>>> return false; >>>>> >>>>> I'm not sure about this fix, I mean that maybe this is incomplete. >>>>> In virtio-vsock, there is a credit mechanism between the two peers: >>>>> https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 >>>>> >>>>> This takes only the payload into account, so it’s true that this problem >>>>> exists; however, perhaps we should also inform the other peer of a lower >>>>> credit balance, otherwise the other peer will believe it has much more >>>>> credit than it actually does, send a large payload, and then the packet >>>>> will be discarded and the data lost (there are no retransmissions, >>>>> etc.). >>>> >>>> I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff >>>> state to account credit") >>>> and find a better fix then? >>> >>> IIRC the same issue was there before the commit fixed by that one (commit >>> 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so >>> not sure about reverting it TBH. >>> >>> CCing Arseniy and Bobby. > >Thanks! > >>> >>>> >>>> There is always a discrepancy between skb->len and skb->truesize. >>>> You will not be able to announce a 1MB window, and accept one milliion >>>> skb of 1-byte each. >>>> >>>> This kind of contract is broken. >>>> >>> >>> Yep, I agree, but before we start discarding data (and losing it), IMHO we >>> should at least inform the other peer that we're out of space. >>> >>> @Stefan, @Michael, do you think we can do something in the spec to avoid >>> this issue and in some way take into account also the metadata in the >>> credit. I mean to avoid the 1-byte packets flooding. >>> >>> Thanks, >>> Stefano >>> >>> >> >> Indeed the old pre-fix skb code would have the same issue. >> >> I can't think of any way around this without extending the spec. > >Hi, thanks, agree with Bobby, that accounting metadata (e.g. skb size here) was not implemented "by >design" in credit logic - another side of data exchange knows nothing about that. Also the same >situation was before skb implementation was added by Bobby. So looks like need to update spec may be. > Even if we change the specifications, we still need to work with older devices, so we should find a solution for this as well. My main concern is data loss, so I'm considering the following options: 1. Notify the other peer of a smaller buf_alloc from the start, leave some room for overhead, and when it's running out, notify them that buf_alloc = 0. This way, the peer realizes it can’t send anything else. 2. Or update buf_alloc each time by removing the overhead, similar to what’s currently done in virtio_transport_inc_rx_pkt(), but also do it in virtio_transport_inc_tx_pkt(). As I said, IMO this patch alone is incomplete; we need to communicate with the peer somehow regarding space. I don’t think including the overhead in fwd_cnt is spec compliant, since the other peer has no idea how much overhead is needed, but reducing buf_alloc should be okay, even though I’m concerned about packets in flight. As a quick fix, I think option 2 might be the easiest; I’ll run some tests and send over a patch. But in the long run, I think we absolutely need to improve memory management in vsock, perhaps by avoiding custom solutions. Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 16:11 ` Stefano Garzarella 2026-05-05 16:37 ` Bobby Eshleman @ 2026-05-06 15:37 ` Michael S. Tsirkin 2026-05-07 9:09 ` Stefano Garzarella 1 sibling, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-06 15:37 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > > > > > >virtio_transport_recv_enqueue() skips coalescing for packets > > > >with VIRTIO_VSOCK_SEQ_EOM. > > > > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > > > >a very large number of packets can be queued > > > >because vvs->rx_bytes stays at 0. > > > > > > > >Fix this by estimating the skb metadata size: > > > > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > >Cc: Jason Wang <jasowang@redhat.com> > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > > > >Cc: kvm@vger.kernel.org > > > >Cc: virtualization@lists.linux.dev > > > >--- > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > > > >--- a/net/vmw_vsock/virtio_transport_common.c > > > >+++ b/net/vmw_vsock/virtio_transport_common.c > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > > > u32 len) > > > > { > > > >- if (vvs->buf_used + len > vvs->buf_alloc) > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > > > >+ > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > > > return false; > > > > > > I'm not sure about this fix, I mean that maybe this is incomplete. > > > In virtio-vsock, there is a credit mechanism between the two peers: > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > > > > > This takes only the payload into account, so it’s true that this problem > > > exists; however, perhaps we should also inform the other peer of a lower > > > credit balance, otherwise the other peer will believe it has much more > > > credit than it actually does, send a large payload, and then the packet > > > will be discarded and the data lost (there are no retransmissions, > > > etc.). > > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff > > state to account credit") > > and find a better fix then? > > IIRC the same issue was there before the commit fixed by that one (commit > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so > not sure about reverting it TBH. > > CCing Arseniy and Bobby. > > > > > There is always a discrepancy between skb->len and skb->truesize. > > You will not be able to announce a 1MB window, and accept one milliion > > skb of 1-byte each. > > > > This kind of contract is broken. > > > > Yep, I agree, but before we start discarding data (and losing it), IMHO we > should at least inform the other peer that we're out of space. > > @Stefan, @Michael, do you think we can do something in the spec to avoid > this issue and in some way take into account also the metadata in the > credit. I mean to avoid the 1-byte packets flooding. > > Thanks, > Stefano Why do we need the metadata? Just don't keep it around if you begin running low on memory. -- MST ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-06 15:37 ` Michael S. Tsirkin @ 2026-05-07 9:09 ` Stefano Garzarella 2026-05-07 11:45 ` Michael S. Tsirkin 0 siblings, 1 reply; 26+ messages in thread From: Stefano Garzarella @ 2026-05-07 9:09 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Wed, May 06, 2026 at 11:37:45AM -0400, Michael S. Tsirkin wrote: >On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: >> On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: >> > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >> > > >> > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >> > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. >> > > > >> > > >virtio_transport_recv_enqueue() skips coalescing for packets >> > > >with VIRTIO_VSOCK_SEQ_EOM. >> > > > >> > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >> > > >a very large number of packets can be queued >> > > >because vvs->rx_bytes stays at 0. >> > > > >> > > >Fix this by estimating the skb metadata size: >> > > > >> > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) >> > > > >> > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >> > > >Signed-off-by: Eric Dumazet <edumazet@google.com> >> > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >> > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> >> > > >Cc: Stefano Garzarella <sgarzare@redhat.com> >> > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> >> > > >Cc: Jason Wang <jasowang@redhat.com> >> > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >> > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> >> > > >Cc: kvm@vger.kernel.org >> > > >Cc: virtualization@lists.linux.dev >> > > >--- >> > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- >> > > > 1 file changed, 3 insertions(+), 1 deletion(-) >> > > > >> > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >> > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >> > > >--- a/net/vmw_vsock/virtio_transport_common.c >> > > >+++ b/net/vmw_vsock/virtio_transport_common.c >> > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >> > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, >> > > > u32 len) >> > > > { >> > > >- if (vvs->buf_used + len > vvs->buf_alloc) >> > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >> > > >+ >> > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) >> > > > return false; >> > > >> > > I'm not sure about this fix, I mean that maybe this is incomplete. >> > > In virtio-vsock, there is a credit mechanism between the two peers: >> > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 >> > > >> > > This takes only the payload into account, so it’s true that this problem >> > > exists; however, perhaps we should also inform the other peer of a lower >> > > credit balance, otherwise the other peer will believe it has much more >> > > credit than it actually does, send a large payload, and then the packet >> > > will be discarded and the data lost (there are no retransmissions, >> > > etc.). >> > >> > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff >> > state to account credit") >> > and find a better fix then? >> >> IIRC the same issue was there before the commit fixed by that one (commit >> 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so >> not sure about reverting it TBH. >> >> CCing Arseniy and Bobby. >> >> > >> > There is always a discrepancy between skb->len and skb->truesize. >> > You will not be able to announce a 1MB window, and accept one milliion >> > skb of 1-byte each. >> > >> > This kind of contract is broken. >> > >> >> Yep, I agree, but before we start discarding data (and losing it), IMHO we >> should at least inform the other peer that we're out of space. >> >> @Stefan, @Michael, do you think we can do something in the spec to avoid >> this issue and in some way take into account also the metadata in the >> credit. I mean to avoid the 1-byte packets flooding. >> >> Thanks, >> Stefano > >Why do we need the metadata? Just don't keep it around if you begin >running low on memory. I don't think removing the skuffs will be easy; we added them for ebpf, zero-copy, and seqpacket as well. For now, we're already doing something: merging the skuffs if they don't have EOM set. As a quick fix, I'm thinking of reducing the `buf_alloc` value to account for the overhead and notifying the other peer, at least until we find a better solution. Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 9:09 ` Stefano Garzarella @ 2026-05-07 11:45 ` Michael S. Tsirkin 2026-05-07 12:59 ` Stefano Garzarella 0 siblings, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-07 11:45 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: > On Wed, May 06, 2026 at 11:37:45AM -0400, Michael S. Tsirkin wrote: > > On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: > > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > > > > > > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > > > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > > > > > > > > > >virtio_transport_recv_enqueue() skips coalescing for packets > > > > > >with VIRTIO_VSOCK_SEQ_EOM. > > > > > > > > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > > > > > >a very large number of packets can be queued > > > > > >because vvs->rx_bytes stays at 0. > > > > > > > > > > > >Fix this by estimating the skb metadata size: > > > > > > > > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > > > > > > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > > > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > > > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > > > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> > > > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > > > >Cc: Jason Wang <jasowang@redhat.com> > > > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > > > > > >Cc: kvm@vger.kernel.org > > > > > >Cc: virtualization@lists.linux.dev > > > > > >--- > > > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > > > > > >--- a/net/vmw_vsock/virtio_transport_common.c > > > > > >+++ b/net/vmw_vsock/virtio_transport_common.c > > > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > > > > > u32 len) > > > > > > { > > > > > >- if (vvs->buf_used + len > vvs->buf_alloc) > > > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > > > > > >+ > > > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > > > > > return false; > > > > > > > > > > I'm not sure about this fix, I mean that maybe this is incomplete. > > > > > In virtio-vsock, there is a credit mechanism between the two peers: > > > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > > > > > > > > > This takes only the payload into account, so it’s true that this problem > > > > > exists; however, perhaps we should also inform the other peer of a lower > > > > > credit balance, otherwise the other peer will believe it has much more > > > > > credit than it actually does, send a large payload, and then the packet > > > > > will be discarded and the data lost (there are no retransmissions, > > > > > etc.). > > > > > > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff > > > > state to account credit") > > > > and find a better fix then? > > > > > > IIRC the same issue was there before the commit fixed by that one (commit > > > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so > > > not sure about reverting it TBH. > > > > > > CCing Arseniy and Bobby. > > > > > > > > > > > There is always a discrepancy between skb->len and skb->truesize. > > > > You will not be able to announce a 1MB window, and accept one milliion > > > > skb of 1-byte each. > > > > > > > > This kind of contract is broken. > > > > > > > > > > Yep, I agree, but before we start discarding data (and losing it), IMHO we > > > should at least inform the other peer that we're out of space. > > > > > > @Stefan, @Michael, do you think we can do something in the spec to avoid > > > this issue and in some way take into account also the metadata in the > > > credit. I mean to avoid the 1-byte packets flooding. > > > > > > Thanks, > > > Stefano > > > > Why do we need the metadata? Just don't keep it around if you begin > > running low on memory. > > I don't think removing the skuffs will be easy; we added them for ebpf, > zero-copy, and seqpacket as well. You do not need to remove them completely. > For now, we're already doing something: > merging the skuffs if they don't have EOM set. Right that's good. You could go further and merge with EOM too if you stick the info about message boundaries somewhere else. > As a quick fix, I'm thinking of reducing the `buf_alloc` value to account > for the overhead and notifying the other peer, at least until we find a > better solution. > > Stefano well if you want to support pathological cases such as 1 byte messages that would mean like 100x reduction no? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 11:45 ` Michael S. Tsirkin @ 2026-05-07 12:59 ` Stefano Garzarella 2026-05-07 14:33 ` Jakub Kicinski ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-07 12:59 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: >On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: >> On Wed, May 06, 2026 at 11:37:45AM -0400, Michael S. Tsirkin wrote: >> > On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: >> > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: >> > > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >> > > > > >> > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: >> > > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. >> > > > > > >> > > > > >virtio_transport_recv_enqueue() skips coalescing for packets >> > > > > >with VIRTIO_VSOCK_SEQ_EOM. >> > > > > > >> > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, >> > > > > >a very large number of packets can be queued >> > > > > >because vvs->rx_bytes stays at 0. >> > > > > > >> > > > > >Fix this by estimating the skb metadata size: >> > > > > > >> > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) >> > > > > > >> > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") >> > > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> >> > > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> >> > > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> >> > > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> >> > > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> >> > > > > >Cc: Jason Wang <jasowang@redhat.com> >> > > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >> > > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> >> > > > > >Cc: kvm@vger.kernel.org >> > > > > >Cc: virtualization@lists.linux.dev >> > > > > >--- >> > > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- >> > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) >> > > > > > >> > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >> > > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 >> > > > > >--- a/net/vmw_vsock/virtio_transport_common.c >> > > > > >+++ b/net/vmw_vsock/virtio_transport_common.c >> > > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >> > > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, >> > > > > > u32 len) >> > > > > > { >> > > > > >- if (vvs->buf_used + len > vvs->buf_alloc) >> > > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); >> > > > > >+ >> > > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) >> > > > > > return false; >> > > > > >> > > > > I'm not sure about this fix, I mean that maybe this is incomplete. >> > > > > In virtio-vsock, there is a credit mechanism between the two peers: >> > > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 >> > > > > >> > > > > This takes only the payload into account, so it’s true that this problem >> > > > > exists; however, perhaps we should also inform the other peer of a lower >> > > > > credit balance, otherwise the other peer will believe it has much more >> > > > > credit than it actually does, send a large payload, and then the packet >> > > > > will be discarded and the data lost (there are no retransmissions, >> > > > > etc.). >> > > > >> > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff >> > > > state to account credit") >> > > > and find a better fix then? >> > > >> > > IIRC the same issue was there before the commit fixed by that one (commit >> > > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so >> > > not sure about reverting it TBH. >> > > >> > > CCing Arseniy and Bobby. >> > > >> > > > >> > > > There is always a discrepancy between skb->len and skb->truesize. >> > > > You will not be able to announce a 1MB window, and accept one milliion >> > > > skb of 1-byte each. >> > > > >> > > > This kind of contract is broken. >> > > > >> > > >> > > Yep, I agree, but before we start discarding data (and losing it), IMHO we >> > > should at least inform the other peer that we're out of space. >> > > >> > > @Stefan, @Michael, do you think we can do something in the spec to avoid >> > > this issue and in some way take into account also the metadata in the >> > > credit. I mean to avoid the 1-byte packets flooding. >> > > >> > > Thanks, >> > > Stefano >> > >> > Why do we need the metadata? Just don't keep it around if you begin >> > running low on memory. >> >> I don't think removing the skuffs will be easy; we added them for ebpf, >> zero-copy, and seqpacket as well. > >You do not need to remove them completely. > >> For now, we're already doing something: >> merging the skuffs if they don't have EOM set. > > >Right that's good. You could go further and merge with EOM too >if you stick the info about message boundaries somewhere else. This adds a lot of complexity IMO, but we can try. Do you have something in mind? > >> As a quick fix, I'm thinking of reducing the `buf_alloc` value to account >> for the overhead and notifying the other peer, at least until we find a >> better solution. >> >> Stefano > >well if you want to support pathological cases such as 1 byte messages >that would mean like 100x reduction no? > Yep, but since this patch is already merged, IMHO that is better than losing data in those pathological cases. Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 12:59 ` Stefano Garzarella @ 2026-05-07 14:33 ` Jakub Kicinski 2026-05-07 16:05 ` Stefano Garzarella 2026-05-07 14:52 ` Michael S. Tsirkin 2026-05-07 22:48 ` Michael S. Tsirkin 2 siblings, 1 reply; 26+ messages in thread From: Jakub Kicinski @ 2026-05-07 14:33 UTC (permalink / raw) To: Stefano Garzarella Cc: Michael S. Tsirkin, Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, 7 May 2026 14:59:13 +0200 Stefano Garzarella wrote: > >well if you want to support pathological cases such as 1 byte messages > >that would mean like 100x reduction no? > > Yep, but since this patch is already merged, IMHO that is better than > losing data in those pathological cases. We can revert if you think that the risk of regression is high.. Please LMK soon, we can do it before patch reaches Linus. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 14:33 ` Jakub Kicinski @ 2026-05-07 16:05 ` Stefano Garzarella 2026-05-07 16:32 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Stefano Garzarella @ 2026-05-07 16:05 UTC (permalink / raw) To: Jakub Kicinski Cc: Michael S. Tsirkin, Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote: >On Thu, 7 May 2026 14:59:13 +0200 Stefano Garzarella wrote: >> >well if you want to support pathological cases such as 1 byte messages >> >that would mean like 100x reduction no? >> >> Yep, but since this patch is already merged, IMHO that is better than >> losing data in those pathological cases. > >We can revert if you think that the risk of regression is high.. >Please LMK soon, we can do it before patch reaches Linus. > Some tests in tools/testing/vsock/vsock_test.c are failing with this patch applied. Test 18 are failing sometime in this way (I guess because we are dropping packets): 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch Test 22 is failing 100% in this way: 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: Resource temporarily unavailable With my followup patch adding also advertisement to the other peer (still draft locally, waiting for Michael proposal) I saw 22 failing, because tests expects that can use the entire buf_alloc, but now we are reducing it. So IMO we should do like in `__sock_set_rcvbuf()` and double the buffer size, or at least digest an overhead equal to the buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah, AF_VSOCK has it owns sockopt since the beginning :-(). With that approach tests are passing, but I'd like to stress a bit more that patch. I'll send it tomorrow as fixup of this patch, or if you prefer to revert, I'll send as standalone. Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 16:05 ` Stefano Garzarella @ 2026-05-07 16:32 ` Eric Dumazet 2026-05-07 17:18 ` Jakub Kicinski 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2026-05-07 16:32 UTC (permalink / raw) To: Stefano Garzarella Cc: Jakub Kicinski, Michael S. Tsirkin, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote: > >On Thu, 7 May 2026 14:59:13 +0200 Stefano Garzarella wrote: > >> >well if you want to support pathological cases such as 1 byte messages > >> >that would mean like 100x reduction no? > >> > >> Yep, but since this patch is already merged, IMHO that is better than > >> losing data in those pathological cases. > > > >We can revert if you think that the risk of regression is high.. > >Please LMK soon, we can do it before patch reaches Linus. > > > > Some tests in tools/testing/vsock/vsock_test.c are failing with this > patch applied. > > Test 18 are failing sometime in this way (I guess because we are > dropping packets): > > 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch > > Test 22 is failing 100% in this way: > > 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: > Resource temporarily unavailable > > > With my followup patch adding also advertisement to the other peer > (still draft locally, waiting for Michael proposal) I saw 22 failing, > because tests expects that can use the entire buf_alloc, but now we are > reducing it. So IMO we should do like in `__sock_set_rcvbuf()` and > double the buffer size, or at least digest an overhead equal to the > buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah, > AF_VSOCK has it owns sockopt since the beginning :-(). > > With that approach tests are passing, but I'd like to stress a bit more > that patch. I'll send it tomorrow as fixup of this patch, or if you > prefer to revert, I'll send as standalone. > A plain revert is a big issue, now users now how to crash hypervisors. This vulnerability allows a compromised guest (controlling virtio_vsock_hdr fields) to continuously flood the host's vsock receive queue without triggering any memory accounting limits or reader wakeups, resulting in unbounded host kernel memory consumption (Host DoS via OOM). A vulnerability where a KVM guest can crash or deadlock its host is classified as a KVM DoS. Am I missing something? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 16:32 ` Eric Dumazet @ 2026-05-07 17:18 ` Jakub Kicinski 2026-05-08 9:26 ` Stefano Garzarella 0 siblings, 1 reply; 26+ messages in thread From: Jakub Kicinski @ 2026-05-07 17:18 UTC (permalink / raw) To: Eric Dumazet Cc: Stefano Garzarella, Michael S. Tsirkin, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, 7 May 2026 09:32:24 -0700 Eric Dumazet wrote: > On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote: > > >We can revert if you think that the risk of regression is high.. > > >Please LMK soon, we can do it before patch reaches Linus. > > > > Some tests in tools/testing/vsock/vsock_test.c are failing with this > > patch applied. > > > > Test 18 are failing sometime in this way (I guess because we are > > dropping packets): > > > > 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch > > > > Test 22 is failing 100% in this way: > > > > 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: > > Resource temporarily unavailable > > > > > > With my followup patch adding also advertisement to the other peer > > (still draft locally, waiting for Michael proposal) I saw 22 failing, > > because tests expects that can use the entire buf_alloc, but now we are > > reducing it. So IMO we should do like in `__sock_set_rcvbuf()` and > > double the buffer size, or at least digest an overhead equal to the > > buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah, > > AF_VSOCK has it owns sockopt since the beginning :-(). > > > > With that approach tests are passing, but I'd like to stress a bit more > > that patch. I'll send it tomorrow as fixup of this patch, or if you > > prefer to revert, I'll send as standalone. > > A plain revert is a big issue, now users now how to crash hypervisors. > > This vulnerability allows a compromised guest (controlling > virtio_vsock_hdr fields) > to continuously flood the host's vsock receive queue without > triggering any memory > accounting limits or reader wakeups, resulting in unbounded host > kernel memory consumption (Host DoS via OOM). > > A vulnerability where a KVM guest can crash or deadlock its host is > classified as a KVM DoS. > > Am I missing something? Alright, let's leave it. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 17:18 ` Jakub Kicinski @ 2026-05-08 9:26 ` Stefano Garzarella 0 siblings, 0 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-08 9:26 UTC (permalink / raw) To: Jakub Kicinski Cc: Eric Dumazet, Michael S. Tsirkin, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 10:18:05AM -0700, Jakub Kicinski wrote: >On Thu, 7 May 2026 09:32:24 -0700 Eric Dumazet wrote: >> On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >> > On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote: >> > >We can revert if you think that the risk of regression is high.. >> > >Please LMK soon, we can do it before patch reaches Linus. >> > >> > Some tests in tools/testing/vsock/vsock_test.c are failing with this >> > patch applied. >> > >> > Test 18 are failing sometime in this way (I guess because we are >> > dropping packets): >> > >> > 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch >> > >> > Test 22 is failing 100% in this way: >> > >> > 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: >> > Resource temporarily unavailable >> > >> > >> > With my followup patch adding also advertisement to the other peer >> > (still draft locally, waiting for Michael proposal) I saw 22 failing, >> > because tests expects that can use the entire buf_alloc, but now we are >> > reducing it. So IMO we should do like in `__sock_set_rcvbuf()` and >> > double the buffer size, or at least digest an overhead equal to the >> > buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah, >> > AF_VSOCK has it owns sockopt since the beginning :-(). >> > >> > With that approach tests are passing, but I'd like to stress a bit more >> > that patch. I'll send it tomorrow as fixup of this patch, or if you >> > prefer to revert, I'll send as standalone. >> >> A plain revert is a big issue, now users now how to crash hypervisors. >> >> This vulnerability allows a compromised guest (controlling >> virtio_vsock_hdr fields) >> to continuously flood the host's vsock receive queue without >> triggering any memory >> accounting limits or reader wakeups, resulting in unbounded host >> kernel memory consumption (Host DoS via OOM). >> >> A vulnerability where a KVM guest can crash or deadlock its host is >> classified as a KVM DoS. >> >> Am I missing something? > >Alright, let's leave it. > I posted a potential fixup: https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/ Thanks, Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 12:59 ` Stefano Garzarella 2026-05-07 14:33 ` Jakub Kicinski @ 2026-05-07 14:52 ` Michael S. Tsirkin 2026-05-07 22:48 ` Michael S. Tsirkin 2 siblings, 0 replies; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-07 14:52 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: > > > On Wed, May 06, 2026 at 11:37:45AM -0400, Michael S. Tsirkin wrote: > > > > On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: > > > > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > > > > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > > > > > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > > > > > > > > > > > > > >virtio_transport_recv_enqueue() skips coalescing for packets > > > > > > > >with VIRTIO_VSOCK_SEQ_EOM. > > > > > > > > > > > > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > > > > > > > >a very large number of packets can be queued > > > > > > > >because vvs->rx_bytes stays at 0. > > > > > > > > > > > > > > > >Fix this by estimating the skb metadata size: > > > > > > > > > > > > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > > > > > > > > > > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > > > > > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > > > > > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> > > > > > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > > > > > >Cc: Jason Wang <jasowang@redhat.com> > > > > > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > > > > > > > >Cc: kvm@vger.kernel.org > > > > > > > >Cc: virtualization@lists.linux.dev > > > > > > > >--- > > > > > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > > > > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > > > > > > > >--- a/net/vmw_vsock/virtio_transport_common.c > > > > > > > >+++ b/net/vmw_vsock/virtio_transport_common.c > > > > > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > > > > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > > > > > > > u32 len) > > > > > > > > { > > > > > > > >- if (vvs->buf_used + len > vvs->buf_alloc) > > > > > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > > > > > > > >+ > > > > > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > > > > > > > return false; > > > > > > > > > > > > > > I'm not sure about this fix, I mean that maybe this is incomplete. > > > > > > > In virtio-vsock, there is a credit mechanism between the two peers: > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > > > > > > > > > > > > > This takes only the payload into account, so it’s true that this problem > > > > > > > exists; however, perhaps we should also inform the other peer of a lower > > > > > > > credit balance, otherwise the other peer will believe it has much more > > > > > > > credit than it actually does, send a large payload, and then the packet > > > > > > > will be discarded and the data lost (there are no retransmissions, > > > > > > > etc.). > > > > > > > > > > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff > > > > > > state to account credit") > > > > > > and find a better fix then? > > > > > > > > > > IIRC the same issue was there before the commit fixed by that one (commit > > > > > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so > > > > > not sure about reverting it TBH. > > > > > > > > > > CCing Arseniy and Bobby. > > > > > > > > > > > > > > > > > There is always a discrepancy between skb->len and skb->truesize. > > > > > > You will not be able to announce a 1MB window, and accept one milliion > > > > > > skb of 1-byte each. > > > > > > > > > > > > This kind of contract is broken. > > > > > > > > > > > > > > > > Yep, I agree, but before we start discarding data (and losing it), IMHO we > > > > > should at least inform the other peer that we're out of space. > > > > > > > > > > @Stefan, @Michael, do you think we can do something in the spec to avoid > > > > > this issue and in some way take into account also the metadata in the > > > > > credit. I mean to avoid the 1-byte packets flooding. > > > > > > > > > > Thanks, > > > > > Stefano > > > > > > > > Why do we need the metadata? Just don't keep it around if you begin > > > > running low on memory. > > > > > > I don't think removing the skuffs will be easy; we added them for ebpf, > > > zero-copy, and seqpacket as well. > > > > You do not need to remove them completely. > > > > > For now, we're already doing something: > > > merging the skuffs if they don't have EOM set. > > > > > > Right that's good. You could go further and merge with EOM too > > if you stick the info about message boundaries somewhere else. > > This adds a lot of complexity IMO, but we can try. > > Do you have something in mind? I'll send something shortly just to give you an idea. > > > > > As a quick fix, I'm thinking of reducing the `buf_alloc` value to account > > > for the overhead and notifying the other peer, at least until we find a > > > better solution. > > > > > > Stefano > > > > well if you want to support pathological cases such as 1 byte messages > > that would mean like 100x reduction no? > > > > Yep, but since this patch is already merged, IMHO that is better than losing > data in those pathological cases. > > Thanks, > Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 12:59 ` Stefano Garzarella 2026-05-07 14:33 ` Jakub Kicinski 2026-05-07 14:52 ` Michael S. Tsirkin @ 2026-05-07 22:48 ` Michael S. Tsirkin 2026-05-08 9:41 ` Stefano Garzarella 2 siblings, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-07 22:48 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: > > > On Wed, May 06, 2026 at 11:37:45AM -0400, Michael S. Tsirkin wrote: > > > > On Tue, May 05, 2026 at 06:11:13PM +0200, Stefano Garzarella wrote: > > > > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > > > > > On Tue, May 5, 2026 at 6:52 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 30, 2026 at 12:26:52PM +0000, Eric Dumazet wrote: > > > > > > > >virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. > > > > > > > > > > > > > > > >virtio_transport_recv_enqueue() skips coalescing for packets > > > > > > > >with VIRTIO_VSOCK_SEQ_EOM. > > > > > > > > > > > > > > > >If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, > > > > > > > >a very large number of packets can be queued > > > > > > > >because vvs->rx_bytes stays at 0. > > > > > > > > > > > > > > > >Fix this by estimating the skb metadata size: > > > > > > > > > > > > > > > > (Number of skbs in the queue) * SKB_TRUESIZE(0) > > > > > > > > > > > > > > > >Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") > > > > > > > >Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > > > > >Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> > > > > > > > >Cc: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > >Cc: Stefano Garzarella <sgarzare@redhat.com> > > > > > > > >Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > > > > > >Cc: Jason Wang <jasowang@redhat.com> > > > > > > > >Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > > > >Cc: "Eugenio Pérez" <eperezma@redhat.com> > > > > > > > >Cc: kvm@vger.kernel.org > > > > > > > >Cc: virtualization@lists.linux.dev > > > > > > > >--- > > > > > > > > net/vmw_vsock/virtio_transport_common.c | 4 +++- > > > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > > > > > > >index 416d533f493d7b07e9c77c43f741d28cfcd0953e..9b8014516f4fb1130ae184635fbba4dfee58bd64 100644 > > > > > > > >--- a/net/vmw_vsock/virtio_transport_common.c > > > > > > > >+++ b/net/vmw_vsock/virtio_transport_common.c > > > > > > > >@@ -447,7 +447,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, > > > > > > > > static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, > > > > > > > > u32 len) > > > > > > > > { > > > > > > > >- if (vvs->buf_used + len > vvs->buf_alloc) > > > > > > > >+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); > > > > > > > >+ > > > > > > > >+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc) > > > > > > > > return false; > > > > > > > > > > > > > > I'm not sure about this fix, I mean that maybe this is incomplete. > > > > > > > In virtio-vsock, there is a credit mechanism between the two peers: > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4850003 > > > > > > > > > > > > > > This takes only the payload into account, so it’s true that this problem > > > > > > > exists; however, perhaps we should also inform the other peer of a lower > > > > > > > credit balance, otherwise the other peer will believe it has much more > > > > > > > credit than it actually does, send a large payload, and then the packet > > > > > > > will be discarded and the data lost (there are no retransmissions, > > > > > > > etc.). > > > > > > > > > > > > I dunno, perhaps revert 077706165717 ("virtio/vsock: don't use skbuff > > > > > > state to account credit") > > > > > > and find a better fix then? > > > > > > > > > > IIRC the same issue was there before the commit fixed by that one (commit > > > > > 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")), so > > > > > not sure about reverting it TBH. > > > > > > > > > > CCing Arseniy and Bobby. > > > > > > > > > > > > > > > > > There is always a discrepancy between skb->len and skb->truesize. > > > > > > You will not be able to announce a 1MB window, and accept one milliion > > > > > > skb of 1-byte each. > > > > > > > > > > > > This kind of contract is broken. > > > > > > > > > > > > > > > > Yep, I agree, but before we start discarding data (and losing it), IMHO we > > > > > should at least inform the other peer that we're out of space. > > > > > > > > > > @Stefan, @Michael, do you think we can do something in the spec to avoid > > > > > this issue and in some way take into account also the metadata in the > > > > > credit. I mean to avoid the 1-byte packets flooding. > > > > > > > > > > Thanks, > > > > > Stefano > > > > > > > > Why do we need the metadata? Just don't keep it around if you begin > > > > running low on memory. > > > > > > I don't think removing the skuffs will be easy; we added them for ebpf, > > > zero-copy, and seqpacket as well. > > > > You do not need to remove them completely. > > > > > For now, we're already doing something: > > > merging the skuffs if they don't have EOM set. > > > > > > Right that's good. You could go further and merge with EOM too > > if you stick the info about message boundaries somewhere else. > > This adds a lot of complexity IMO, but we can try. > > Do you have something in mind? BER is clearly overkill but here's a POC that claude made for me, just to give u an idea. It's clearly has a ton of issues, for example I dislike how GFP_ATOMIC is handled. Yet it seems to work fine in light testing. --> vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing virtio_transport_recv_enqueue() currently refuses to coalesce an incoming skb with the previous one when the previous skb carries VIRTIO_VSOCK_SEQ_EOM. This forces one skb per seqpacket message. For workloads with many small or zero-byte messages the per-skb overhead (~960 bytes) dominates, causing unbounded memory growth. Decouple message boundary tracking from the skb structure: store boundary offsets in a compact side buffer using DWARF ULEB128 encoding with the EOR flag folded into the low bit, then allow the data of multiple complete messages to be coalesced into a single skb. Cross-EOM coalescing fires only when: - both the tail skb and the incoming packet carry EOM (complete msgs) - the incoming packet fits in the tail skb's tailroom - no BPF psock is attached (read_skb expects one msg per skb) On allocation failure the code falls back to separate skbs (existing behaviour). Credit accounting is unchanged; the boundary buffer is capped at PAGE_SIZE. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h index f91704731057..e36b9ab28372 100644 --- a/include/linux/virtio_vsock.h +++ b/include/linux/virtio_vsock.h @@ -12,6 +12,7 @@ struct virtio_vsock_skb_cb { bool reply; bool tap_delivered; + bool has_boundary_entries; u32 offset; }; @@ -167,6 +168,12 @@ struct virtio_vsock_sock { u32 buf_used; struct sk_buff_head rx_queue; u32 msg_count; + + /* ULEB128-encoded seqpacket message boundary buffer */ + u8 *boundary_buf; + u32 boundary_len; + u32 boundary_alloc; + u32 boundary_off; }; struct virtio_vsock_pkt_info { diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 416d533f493d..81654f70f72c 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -11,6 +11,7 @@ #include <linux/sched/signal.h> #include <linux/ctype.h> #include <linux/list.h> +#include <linux/skmsg.h> #include <linux/virtio_vsock.h> #include <uapi/linux/vsockmon.h> @@ -26,6 +27,91 @@ /* Threshold for detecting small packets to copy */ #define GOOD_COPY_LEN 128 +#define VSOCK_BOUNDARY_BUF_INIT 64 +#define VSOCK_BOUNDARY_BUF_MAX PAGE_SIZE + +/* ULEB128 boundary encoding: value = (msg_len << 1) | eor. + * Each byte carries 7 data bits; bit 7 is set on all but the last byte. + * Max 5 bytes for a u32 msg_len (33 bits with eor shift). + */ +static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor) +{ + u64 val = ((u64)msg_len << 1) | eor; + int n = 0; + + do { + buf[n] = val & 0x7f; + val >>= 7; + if (val) + buf[n] |= 0x80; + n++; + } while (val); + + return n; +} + +static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail, + u32 *msg_len, bool *eor) +{ + u64 val = 0; + int shift = 0; + int n = 0; + + do { + if (n >= avail || shift >= 35) + return -EINVAL; + val |= (u64)(buf[n] & 0x7f) << shift; + shift += 7; + } while (buf[n++] & 0x80); + + *eor = val & 1; + *msg_len = val >> 1; + return n; +} + +static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs) +{ + if (vvs->boundary_off == 0) + return; + + vvs->boundary_len -= vvs->boundary_off; + memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off, + vvs->boundary_len); + vvs->boundary_off = 0; +} + +static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed) +{ + u32 new_alloc; + u8 *new_buf; + + if (vvs->boundary_alloc >= needed) + return 0; + + /* Reclaim consumed space before growing */ + if (vvs->boundary_off) { + needed -= vvs->boundary_off; + vsock_boundary_buf_compact(vvs); + if (vvs->boundary_alloc >= needed) + return 0; + } + + new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2 + : VSOCK_BOUNDARY_BUF_INIT); + if (new_alloc > VSOCK_BOUNDARY_BUF_MAX) + new_alloc = VSOCK_BOUNDARY_BUF_MAX; + if (new_alloc < needed) + return -ENOMEM; + + new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC); + if (!new_buf) + return -ENOMEM; + + vvs->boundary_buf = new_buf; + vvs->boundary_alloc = new_alloc; + return 0; +} + static void virtio_transport_cancel_close_work(struct vsock_sock *vsk, bool cancel_timeout); static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs); @@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk, total = 0; len = msg_data_left(msg); - skb_queue_walk(&vvs->rx_queue, skb) { - struct virtio_vsock_hdr *hdr; + skb = skb_peek(&vvs->rx_queue); + if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { + u32 msg_len, offset; + size_t bytes; + bool eor; + int ret; - if (total < len) { - size_t bytes; + ret = vsock_uleb_decode_boundary( + vvs->boundary_buf + vvs->boundary_off, + vvs->boundary_len - vvs->boundary_off, + &msg_len, &eor); + if (ret < 0) + goto unlock; + + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; + bytes = min(len, (size_t)msg_len); + + if (bytes) { int err; - bytes = len - total; - if (bytes > skb->len) - bytes = skb->len; - spin_unlock_bh(&vvs->rx_lock); - - /* sk_lock is held by caller so no one else can dequeue. - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. - */ - err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset, + err = skb_copy_datagram_iter(skb, offset, &msg->msg_iter, bytes); if (err) return err; - spin_lock_bh(&vvs->rx_lock); } - total += skb->len; - hdr = virtio_vsock_hdr(skb); + total = msg_len; + if (eor) + msg->msg_flags |= MSG_EOR; + } else { + skb_queue_walk(&vvs->rx_queue, skb) { + struct virtio_vsock_hdr *hdr; - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) - msg->msg_flags |= MSG_EOR; + if (total < len) { + size_t bytes; + int err; - break; + bytes = len - total; + if (bytes > skb->len) + bytes = skb->len; + + spin_unlock_bh(&vvs->rx_lock); + + err = skb_copy_datagram_iter( + skb, + VIRTIO_VSOCK_SKB_CB(skb)->offset, + &msg->msg_iter, bytes); + if (err) + return err; + + spin_lock_bh(&vvs->rx_lock); + } + + total += skb->len; + hdr = virtio_vsock_hdr(skb); + + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { + if (le32_to_cpu(hdr->flags) & + VIRTIO_VSOCK_SEQ_EOR) + msg->msg_flags |= MSG_EOR; + break; + } } } +unlock: spin_unlock_bh(&vvs->rx_lock); return total; @@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk, } while (!msg_ready) { - struct virtio_vsock_hdr *hdr; - size_t pkt_len; - - skb = __skb_dequeue(&vvs->rx_queue); + skb = skb_peek(&vvs->rx_queue); if (!skb) break; - hdr = virtio_vsock_hdr(skb); - pkt_len = (size_t)le32_to_cpu(hdr->len); - if (dequeued_len >= 0) { + if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { size_t bytes_to_copy; + u32 msg_len, offset; + bool eor; + int ret; - bytes_to_copy = min(user_buf_len, pkt_len); + ret = vsock_uleb_decode_boundary( + vvs->boundary_buf + vvs->boundary_off, + vvs->boundary_len - vvs->boundary_off, + &msg_len, &eor); + if (ret < 0) + break; + vvs->boundary_off += ret; - if (bytes_to_copy) { + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; + bytes_to_copy = min(user_buf_len, (size_t)msg_len); + + if (bytes_to_copy && dequeued_len >= 0) { int err; - /* sk_lock is held by caller so no one else can dequeue. - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. - */ spin_unlock_bh(&vvs->rx_lock); - - err = skb_copy_datagram_iter(skb, 0, + err = skb_copy_datagram_iter(skb, offset, &msg->msg_iter, bytes_to_copy); - if (err) { - /* Copy of message failed. Rest of - * fragments will be freed without copy. - */ - dequeued_len = err; - } else { - user_buf_len -= bytes_to_copy; - } - spin_lock_bh(&vvs->rx_lock); + if (err) + dequeued_len = err; + else + user_buf_len -= bytes_to_copy; } if (dequeued_len >= 0) - dequeued_len += pkt_len; - } + dequeued_len += msg_len; - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { + VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len; msg_ready = true; vvs->msg_count--; - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) + if (eor) msg->msg_flags |= MSG_EOR; - } - virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); - kfree_skb(skb); + virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len); + + if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) { + __skb_unlink(skb, &vvs->rx_queue); + kfree_skb(skb); + } + + if (vvs->boundary_off >= vvs->boundary_len / 2) + vsock_boundary_buf_compact(vvs); + } else { + struct virtio_vsock_hdr *hdr; + size_t pkt_len; + + skb = __skb_dequeue(&vvs->rx_queue); + if (!skb) + break; + hdr = virtio_vsock_hdr(skb); + pkt_len = (size_t)le32_to_cpu(hdr->len); + + if (dequeued_len >= 0) { + size_t bytes_to_copy; + + bytes_to_copy = min(user_buf_len, pkt_len); + + if (bytes_to_copy) { + int err; + + spin_unlock_bh(&vvs->rx_lock); + err = skb_copy_datagram_iter( + skb, 0, &msg->msg_iter, + bytes_to_copy); + if (err) + dequeued_len = err; + else + user_buf_len -= bytes_to_copy; + spin_lock_bh(&vvs->rx_lock); + } + + if (dequeued_len >= 0) + dequeued_len += pkt_len; + } + + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { + msg_ready = true; + vvs->msg_count--; + + if (le32_to_cpu(hdr->flags) & + VIRTIO_VSOCK_SEQ_EOR) + msg->msg_flags |= MSG_EOR; + } + + virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); + kfree_skb(skb); + } } spin_unlock_bh(&vvs->rx_lock); @@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk) virtio_transport_cancel_close_work(vsk, true); + kfree(vvs->boundary_buf); kfree(vvs); vsk->trans = NULL; } @@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk) * removing it. */ __skb_queue_purge(&vvs->rx_queue); + kfree(vvs->boundary_buf); + vvs->boundary_buf = NULL; + vvs->boundary_len = 0; + vvs->boundary_alloc = 0; + vvs->boundary_off = 0; vsock_remove_sock(vsk); } @@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk, !skb_is_nonlinear(skb)) { struct virtio_vsock_hdr *last_hdr; struct sk_buff *last_skb; + bool last_has_eom; + bool has_eom; last_skb = skb_peek_tail(&vvs->rx_queue); last_hdr = virtio_vsock_hdr(last_skb); + last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; + has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; - /* If there is space in the last packet queued, we copy the - * new packet in its buffer. We avoid this if the last packet - * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is - * delimiter of SEQPACKET message, so 'pkt' is the first packet - * of a new message. - */ - if (skb->len < skb_tailroom(last_skb) && - !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) { - memcpy(skb_put(last_skb, skb->len), skb->data, skb->len); - free_pkt = true; - last_hdr->flags |= hdr->flags; - le32_add_cpu(&last_hdr->len, len); - goto out; + if (skb->len < skb_tailroom(last_skb)) { + if (!last_has_eom) { + /* Same-message coalescing (existing path) */ + memcpy(skb_put(last_skb, skb->len), + skb->data, skb->len); + free_pkt = true; + last_hdr->flags |= hdr->flags; + le32_add_cpu(&last_hdr->len, len); + goto out; + } + + /* Cross-EOM: coalesce complete messages into one skb, + * recording message boundaries in a compact BER buffer. + * Only when incoming packet also has EOM (complete msg). + */ + if (has_eom && !sk_psock(sk_vsock(vsk))) { + bool prev_eor, cur_eor; + u8 tmp[12]; + int n = 0; + + cur_eor = le32_to_cpu(hdr->flags) & + VIRTIO_VSOCK_SEQ_EOR; + + if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) { + u32 prev_len = le32_to_cpu(last_hdr->len); + + prev_eor = le32_to_cpu(last_hdr->flags) & + VIRTIO_VSOCK_SEQ_EOR; + n += vsock_uleb_encode_boundary( + tmp + n, prev_len, prev_eor); + } + n += vsock_uleb_encode_boundary( + tmp + n, len, cur_eor); + + if (!vsock_boundary_buf_ensure( + vvs, vvs->boundary_len + n)) { + memcpy(vvs->boundary_buf + + vvs->boundary_len, tmp, n); + vvs->boundary_len += n; + VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true; + memcpy(skb_put(last_skb, skb->len), + skb->data, skb->len); + free_pkt = true; + last_hdr->flags |= hdr->flags; + le32_add_cpu(&last_hdr->len, len); + goto out; + } + } } } ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-07 22:48 ` Michael S. Tsirkin @ 2026-05-08 9:41 ` Stefano Garzarella 2026-05-08 9:43 ` Michael S. Tsirkin 2026-05-08 9:58 ` Michael S. Tsirkin 0 siblings, 2 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-08 9:41 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote: >On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: >> On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: >> > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: [...] >> > > For now, we're already doing something: >> > > merging the skuffs if they don't have EOM set. >> > >> > >> > Right that's good. You could go further and merge with EOM too >> > if you stick the info about message boundaries somewhere else. >> >> This adds a lot of complexity IMO, but we can try. >> >> Do you have something in mind? > >BER is clearly overkill but here's a POC that claude made for me, >just to give u an idea. It's clearly has a ton of issues, >for example I dislike how GFP_ATOMIC is handled. Okay, I somewhat understand, but clearly this isn't net material, so for now I think the best thing to do is to merge the fixup I sent (or something similar): https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/ This is a major change that should be merged with more caution. Could this have too much of an impact on performance? Thanks, Stefano >Yet it seems to work fine in light testing. > >--> > > >vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing > >virtio_transport_recv_enqueue() currently refuses to coalesce an >incoming skb with the previous one when the previous skb carries >VIRTIO_VSOCK_SEQ_EOM. This forces one skb per seqpacket message. >For workloads with many small or zero-byte messages the per-skb >overhead (~960 bytes) dominates, causing unbounded memory growth. > >Decouple message boundary tracking from the skb structure: store >boundary offsets in a compact side buffer using DWARF ULEB128 >encoding with the EOR flag folded into the low bit, then allow >the data of multiple complete messages to be coalesced into a single >skb. > >Cross-EOM coalescing fires only when: >- both the tail skb and the incoming packet carry EOM (complete msgs) >- the incoming packet fits in the tail skb's tailroom >- no BPF psock is attached (read_skb expects one msg per skb) > >On allocation failure the code falls back to separate skbs (existing >behaviour). Credit accounting is unchanged; the boundary buffer is >capped at PAGE_SIZE. > >Signed-off-by: Michael S. Tsirkin <mst@redhat.com> >Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> > >diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h >index f91704731057..e36b9ab28372 100644 >--- a/include/linux/virtio_vsock.h >+++ b/include/linux/virtio_vsock.h >@@ -12,6 +12,7 @@ > struct virtio_vsock_skb_cb { > bool reply; > bool tap_delivered; >+ bool has_boundary_entries; > u32 offset; > }; > >@@ -167,6 +168,12 @@ struct virtio_vsock_sock { > u32 buf_used; > struct sk_buff_head rx_queue; > u32 msg_count; >+ >+ /* ULEB128-encoded seqpacket message boundary buffer */ >+ u8 *boundary_buf; >+ u32 boundary_len; >+ u32 boundary_alloc; >+ u32 boundary_off; > }; > > struct virtio_vsock_pkt_info { >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >index 416d533f493d..81654f70f72c 100644 >--- a/net/vmw_vsock/virtio_transport_common.c >+++ b/net/vmw_vsock/virtio_transport_common.c >@@ -11,6 +11,7 @@ > #include <linux/sched/signal.h> > #include <linux/ctype.h> > #include <linux/list.h> >+#include <linux/skmsg.h> > #include <linux/virtio_vsock.h> > #include <uapi/linux/vsockmon.h> > >@@ -26,6 +27,91 @@ > /* Threshold for detecting small packets to copy */ > #define GOOD_COPY_LEN 128 > >+#define VSOCK_BOUNDARY_BUF_INIT 64 >+#define VSOCK_BOUNDARY_BUF_MAX PAGE_SIZE >+ >+/* ULEB128 boundary encoding: value = (msg_len << 1) | eor. >+ * Each byte carries 7 data bits; bit 7 is set on all but the last byte. >+ * Max 5 bytes for a u32 msg_len (33 bits with eor shift). >+ */ >+static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor) >+{ >+ u64 val = ((u64)msg_len << 1) | eor; >+ int n = 0; >+ >+ do { >+ buf[n] = val & 0x7f; >+ val >>= 7; >+ if (val) >+ buf[n] |= 0x80; >+ n++; >+ } while (val); >+ >+ return n; >+} >+ >+static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail, >+ u32 *msg_len, bool *eor) >+{ >+ u64 val = 0; >+ int shift = 0; >+ int n = 0; >+ >+ do { >+ if (n >= avail || shift >= 35) >+ return -EINVAL; >+ val |= (u64)(buf[n] & 0x7f) << shift; >+ shift += 7; >+ } while (buf[n++] & 0x80); >+ >+ *eor = val & 1; >+ *msg_len = val >> 1; >+ return n; >+} >+ >+static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs) >+{ >+ if (vvs->boundary_off == 0) >+ return; >+ >+ vvs->boundary_len -= vvs->boundary_off; >+ memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off, >+ vvs->boundary_len); >+ vvs->boundary_off = 0; >+} >+ >+static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed) >+{ >+ u32 new_alloc; >+ u8 *new_buf; >+ >+ if (vvs->boundary_alloc >= needed) >+ return 0; >+ >+ /* Reclaim consumed space before growing */ >+ if (vvs->boundary_off) { >+ needed -= vvs->boundary_off; >+ vsock_boundary_buf_compact(vvs); >+ if (vvs->boundary_alloc >= needed) >+ return 0; >+ } >+ >+ new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2 >+ : VSOCK_BOUNDARY_BUF_INIT); >+ if (new_alloc > VSOCK_BOUNDARY_BUF_MAX) >+ new_alloc = VSOCK_BOUNDARY_BUF_MAX; >+ if (new_alloc < needed) >+ return -ENOMEM; >+ >+ new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC); >+ if (!new_buf) >+ return -ENOMEM; >+ >+ vvs->boundary_buf = new_buf; >+ vvs->boundary_alloc = new_alloc; >+ return 0; >+} >+ > static void virtio_transport_cancel_close_work(struct vsock_sock *vsk, > bool cancel_timeout); > static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs); >@@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk, > total = 0; > len = msg_data_left(msg); > >- skb_queue_walk(&vvs->rx_queue, skb) { >- struct virtio_vsock_hdr *hdr; >+ skb = skb_peek(&vvs->rx_queue); >+ if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { >+ u32 msg_len, offset; >+ size_t bytes; >+ bool eor; >+ int ret; > >- if (total < len) { >- size_t bytes; >+ ret = vsock_uleb_decode_boundary( >+ vvs->boundary_buf + vvs->boundary_off, >+ vvs->boundary_len - vvs->boundary_off, >+ &msg_len, &eor); >+ if (ret < 0) >+ goto unlock; >+ >+ offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; >+ bytes = min(len, (size_t)msg_len); >+ >+ if (bytes) { > int err; > >- bytes = len - total; >- if (bytes > skb->len) >- bytes = skb->len; >- > spin_unlock_bh(&vvs->rx_lock); >- >- /* sk_lock is held by caller so no one else can dequeue. >- * Unlock rx_lock since skb_copy_datagram_iter() may sleep. >- */ >- err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset, >+ err = skb_copy_datagram_iter(skb, offset, > &msg->msg_iter, bytes); > if (err) > return err; >- > spin_lock_bh(&vvs->rx_lock); > } > >- total += skb->len; >- hdr = virtio_vsock_hdr(skb); >+ total = msg_len; >+ if (eor) >+ msg->msg_flags |= MSG_EOR; >+ } else { >+ skb_queue_walk(&vvs->rx_queue, skb) { >+ struct virtio_vsock_hdr *hdr; > >- if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { >- if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) >- msg->msg_flags |= MSG_EOR; >+ if (total < len) { >+ size_t bytes; >+ int err; > >- break; >+ bytes = len - total; >+ if (bytes > skb->len) >+ bytes = skb->len; >+ >+ spin_unlock_bh(&vvs->rx_lock); >+ >+ err = skb_copy_datagram_iter( >+ skb, >+ VIRTIO_VSOCK_SKB_CB(skb)->offset, >+ &msg->msg_iter, bytes); >+ if (err) >+ return err; >+ >+ spin_lock_bh(&vvs->rx_lock); >+ } >+ >+ total += skb->len; >+ hdr = virtio_vsock_hdr(skb); >+ >+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { >+ if (le32_to_cpu(hdr->flags) & >+ VIRTIO_VSOCK_SEQ_EOR) >+ msg->msg_flags |= MSG_EOR; >+ break; >+ } > } > } > >+unlock: > spin_unlock_bh(&vvs->rx_lock); > > return total; >@@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk, > } > > while (!msg_ready) { >- struct virtio_vsock_hdr *hdr; >- size_t pkt_len; >- >- skb = __skb_dequeue(&vvs->rx_queue); >+ skb = skb_peek(&vvs->rx_queue); > if (!skb) > break; >- hdr = virtio_vsock_hdr(skb); >- pkt_len = (size_t)le32_to_cpu(hdr->len); > >- if (dequeued_len >= 0) { >+ if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { > size_t bytes_to_copy; >+ u32 msg_len, offset; >+ bool eor; >+ int ret; > >- bytes_to_copy = min(user_buf_len, pkt_len); >+ ret = vsock_uleb_decode_boundary( >+ vvs->boundary_buf + vvs->boundary_off, >+ vvs->boundary_len - vvs->boundary_off, >+ &msg_len, &eor); >+ if (ret < 0) >+ break; >+ vvs->boundary_off += ret; > >- if (bytes_to_copy) { >+ offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; >+ bytes_to_copy = min(user_buf_len, (size_t)msg_len); >+ >+ if (bytes_to_copy && dequeued_len >= 0) { > int err; > >- /* sk_lock is held by caller so no one else can dequeue. >- * Unlock rx_lock since skb_copy_datagram_iter() may sleep. >- */ > spin_unlock_bh(&vvs->rx_lock); >- >- err = skb_copy_datagram_iter(skb, 0, >+ err = skb_copy_datagram_iter(skb, offset, > &msg->msg_iter, > bytes_to_copy); >- if (err) { >- /* Copy of message failed. Rest of >- * fragments will be freed without copy. >- */ >- dequeued_len = err; >- } else { >- user_buf_len -= bytes_to_copy; >- } >- > spin_lock_bh(&vvs->rx_lock); >+ if (err) >+ dequeued_len = err; >+ else >+ user_buf_len -= bytes_to_copy; > } > > if (dequeued_len >= 0) >- dequeued_len += pkt_len; >- } >+ dequeued_len += msg_len; > >- if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { >+ VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len; > msg_ready = true; > vvs->msg_count--; > >- if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) >+ if (eor) > msg->msg_flags |= MSG_EOR; >- } > >- virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); >- kfree_skb(skb); >+ virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len); >+ >+ if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) { >+ __skb_unlink(skb, &vvs->rx_queue); >+ kfree_skb(skb); >+ } >+ >+ if (vvs->boundary_off >= vvs->boundary_len / 2) >+ vsock_boundary_buf_compact(vvs); >+ } else { >+ struct virtio_vsock_hdr *hdr; >+ size_t pkt_len; >+ >+ skb = __skb_dequeue(&vvs->rx_queue); >+ if (!skb) >+ break; >+ hdr = virtio_vsock_hdr(skb); >+ pkt_len = (size_t)le32_to_cpu(hdr->len); >+ >+ if (dequeued_len >= 0) { >+ size_t bytes_to_copy; >+ >+ bytes_to_copy = min(user_buf_len, pkt_len); >+ >+ if (bytes_to_copy) { >+ int err; >+ >+ spin_unlock_bh(&vvs->rx_lock); >+ err = skb_copy_datagram_iter( >+ skb, 0, &msg->msg_iter, >+ bytes_to_copy); >+ if (err) >+ dequeued_len = err; >+ else >+ user_buf_len -= bytes_to_copy; >+ spin_lock_bh(&vvs->rx_lock); >+ } >+ >+ if (dequeued_len >= 0) >+ dequeued_len += pkt_len; >+ } >+ >+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { >+ msg_ready = true; >+ vvs->msg_count--; >+ >+ if (le32_to_cpu(hdr->flags) & >+ VIRTIO_VSOCK_SEQ_EOR) >+ msg->msg_flags |= MSG_EOR; >+ } >+ >+ virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); >+ kfree_skb(skb); >+ } > } > > spin_unlock_bh(&vvs->rx_lock); >@@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk) > > virtio_transport_cancel_close_work(vsk, true); > >+ kfree(vvs->boundary_buf); > kfree(vvs); > vsk->trans = NULL; > } >@@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk) > * removing it. > */ > __skb_queue_purge(&vvs->rx_queue); >+ kfree(vvs->boundary_buf); >+ vvs->boundary_buf = NULL; >+ vvs->boundary_len = 0; >+ vvs->boundary_alloc = 0; >+ vvs->boundary_off = 0; > vsock_remove_sock(vsk); > } > >@@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk, > !skb_is_nonlinear(skb)) { > struct virtio_vsock_hdr *last_hdr; > struct sk_buff *last_skb; >+ bool last_has_eom; >+ bool has_eom; > > last_skb = skb_peek_tail(&vvs->rx_queue); > last_hdr = virtio_vsock_hdr(last_skb); >+ last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; >+ has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; > >- /* If there is space in the last packet queued, we copy the >- * new packet in its buffer. We avoid this if the last packet >- * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is >- * delimiter of SEQPACKET message, so 'pkt' is the first packet >- * of a new message. >- */ >- if (skb->len < skb_tailroom(last_skb) && >- !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) { >- memcpy(skb_put(last_skb, skb->len), skb->data, skb->len); >- free_pkt = true; >- last_hdr->flags |= hdr->flags; >- le32_add_cpu(&last_hdr->len, len); >- goto out; >+ if (skb->len < skb_tailroom(last_skb)) { >+ if (!last_has_eom) { >+ /* Same-message coalescing (existing path) */ >+ memcpy(skb_put(last_skb, skb->len), >+ skb->data, skb->len); >+ free_pkt = true; >+ last_hdr->flags |= hdr->flags; >+ le32_add_cpu(&last_hdr->len, len); >+ goto out; >+ } >+ >+ /* Cross-EOM: coalesce complete messages into one skb, >+ * recording message boundaries in a compact BER buffer. >+ * Only when incoming packet also has EOM (complete msg). >+ */ >+ if (has_eom && !sk_psock(sk_vsock(vsk))) { >+ bool prev_eor, cur_eor; >+ u8 tmp[12]; >+ int n = 0; >+ >+ cur_eor = le32_to_cpu(hdr->flags) & >+ VIRTIO_VSOCK_SEQ_EOR; >+ >+ if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) { >+ u32 prev_len = le32_to_cpu(last_hdr->len); >+ >+ prev_eor = le32_to_cpu(last_hdr->flags) & >+ VIRTIO_VSOCK_SEQ_EOR; >+ n += vsock_uleb_encode_boundary( >+ tmp + n, prev_len, prev_eor); >+ } >+ n += vsock_uleb_encode_boundary( >+ tmp + n, len, cur_eor); >+ >+ if (!vsock_boundary_buf_ensure( >+ vvs, vvs->boundary_len + n)) { >+ memcpy(vvs->boundary_buf + >+ vvs->boundary_len, tmp, n); >+ vvs->boundary_len += n; >+ VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true; >+ memcpy(skb_put(last_skb, skb->len), >+ skb->data, skb->len); >+ free_pkt = true; >+ last_hdr->flags |= hdr->flags; >+ le32_add_cpu(&last_hdr->len, len); >+ goto out; >+ } >+ } > } > } > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-08 9:41 ` Stefano Garzarella @ 2026-05-08 9:43 ` Michael S. Tsirkin 2026-05-08 9:58 ` Michael S. Tsirkin 1 sibling, 0 replies; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-08 9:43 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote: > On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote: > > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: > > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: > > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: > > [...] > > > > > > For now, we're already doing something: > > > > > merging the skuffs if they don't have EOM set. > > > > > > > > > > > > Right that's good. You could go further and merge with EOM too > > > > if you stick the info about message boundaries somewhere else. > > > > > > This adds a lot of complexity IMO, but we can try. > > > > > > Do you have something in mind? > > > > BER is clearly overkill but here's a POC that claude made for me, > > just to give u an idea. It's clearly has a ton of issues, > > for example I dislike how GFP_ATOMIC is handled. > > Okay, I somewhat understand, but clearly this isn't net material, so for now > I think the best thing to do is to merge the fixup I sent (or something > similar): > https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/ will respond there. > This is a major change that should be merged with more caution. > Could this have too much of an impact on performance? > > Thanks, > Stefano Upstream is so broken now, I'd not worry about perf even. > > Yet it seems to work fine in light testing. > > > > --> > > > > > > vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing > > > > virtio_transport_recv_enqueue() currently refuses to coalesce an > > incoming skb with the previous one when the previous skb carries > > VIRTIO_VSOCK_SEQ_EOM. This forces one skb per seqpacket message. > > For workloads with many small or zero-byte messages the per-skb > > overhead (~960 bytes) dominates, causing unbounded memory growth. > > > > Decouple message boundary tracking from the skb structure: store > > boundary offsets in a compact side buffer using DWARF ULEB128 > > encoding with the EOR flag folded into the low bit, then allow > > the data of multiple complete messages to be coalesced into a single > > skb. > > > > Cross-EOM coalescing fires only when: > > - both the tail skb and the incoming packet carry EOM (complete msgs) > > - the incoming packet fits in the tail skb's tailroom > > - no BPF psock is attached (read_skb expects one msg per skb) > > > > On allocation failure the code falls back to separate skbs (existing > > behaviour). Credit accounting is unchanged; the boundary buffer is > > capped at PAGE_SIZE. > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> > > > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h > > index f91704731057..e36b9ab28372 100644 > > --- a/include/linux/virtio_vsock.h > > +++ b/include/linux/virtio_vsock.h > > @@ -12,6 +12,7 @@ > > struct virtio_vsock_skb_cb { > > bool reply; > > bool tap_delivered; > > + bool has_boundary_entries; > > u32 offset; > > }; > > > > @@ -167,6 +168,12 @@ struct virtio_vsock_sock { > > u32 buf_used; > > struct sk_buff_head rx_queue; > > u32 msg_count; > > + > > + /* ULEB128-encoded seqpacket message boundary buffer */ > > + u8 *boundary_buf; > > + u32 boundary_len; > > + u32 boundary_alloc; > > + u32 boundary_off; > > }; > > > > struct virtio_vsock_pkt_info { > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > index 416d533f493d..81654f70f72c 100644 > > --- a/net/vmw_vsock/virtio_transport_common.c > > +++ b/net/vmw_vsock/virtio_transport_common.c > > @@ -11,6 +11,7 @@ > > #include <linux/sched/signal.h> > > #include <linux/ctype.h> > > #include <linux/list.h> > > +#include <linux/skmsg.h> > > #include <linux/virtio_vsock.h> > > #include <uapi/linux/vsockmon.h> > > > > @@ -26,6 +27,91 @@ > > /* Threshold for detecting small packets to copy */ > > #define GOOD_COPY_LEN 128 > > > > +#define VSOCK_BOUNDARY_BUF_INIT 64 > > +#define VSOCK_BOUNDARY_BUF_MAX PAGE_SIZE > > + > > +/* ULEB128 boundary encoding: value = (msg_len << 1) | eor. > > + * Each byte carries 7 data bits; bit 7 is set on all but the last byte. > > + * Max 5 bytes for a u32 msg_len (33 bits with eor shift). > > + */ > > +static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor) > > +{ > > + u64 val = ((u64)msg_len << 1) | eor; > > + int n = 0; > > + > > + do { > > + buf[n] = val & 0x7f; > > + val >>= 7; > > + if (val) > > + buf[n] |= 0x80; > > + n++; > > + } while (val); > > + > > + return n; > > +} > > + > > +static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail, > > + u32 *msg_len, bool *eor) > > +{ > > + u64 val = 0; > > + int shift = 0; > > + int n = 0; > > + > > + do { > > + if (n >= avail || shift >= 35) > > + return -EINVAL; > > + val |= (u64)(buf[n] & 0x7f) << shift; > > + shift += 7; > > + } while (buf[n++] & 0x80); > > + > > + *eor = val & 1; > > + *msg_len = val >> 1; > > + return n; > > +} > > + > > +static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs) > > +{ > > + if (vvs->boundary_off == 0) > > + return; > > + > > + vvs->boundary_len -= vvs->boundary_off; > > + memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len); > > + vvs->boundary_off = 0; > > +} > > + > > +static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed) > > +{ > > + u32 new_alloc; > > + u8 *new_buf; > > + > > + if (vvs->boundary_alloc >= needed) > > + return 0; > > + > > + /* Reclaim consumed space before growing */ > > + if (vvs->boundary_off) { > > + needed -= vvs->boundary_off; > > + vsock_boundary_buf_compact(vvs); > > + if (vvs->boundary_alloc >= needed) > > + return 0; > > + } > > + > > + new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2 > > + : VSOCK_BOUNDARY_BUF_INIT); > > + if (new_alloc > VSOCK_BOUNDARY_BUF_MAX) > > + new_alloc = VSOCK_BOUNDARY_BUF_MAX; > > + if (new_alloc < needed) > > + return -ENOMEM; > > + > > + new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC); > > + if (!new_buf) > > + return -ENOMEM; > > + > > + vvs->boundary_buf = new_buf; > > + vvs->boundary_alloc = new_alloc; > > + return 0; > > +} > > + > > static void virtio_transport_cancel_close_work(struct vsock_sock *vsk, > > bool cancel_timeout); > > static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs); > > @@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk, > > total = 0; > > len = msg_data_left(msg); > > > > - skb_queue_walk(&vvs->rx_queue, skb) { > > - struct virtio_vsock_hdr *hdr; > > + skb = skb_peek(&vvs->rx_queue); > > + if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { > > + u32 msg_len, offset; > > + size_t bytes; > > + bool eor; > > + int ret; > > > > - if (total < len) { > > - size_t bytes; > > + ret = vsock_uleb_decode_boundary( > > + vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len - vvs->boundary_off, > > + &msg_len, &eor); > > + if (ret < 0) > > + goto unlock; > > + > > + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; > > + bytes = min(len, (size_t)msg_len); > > + > > + if (bytes) { > > int err; > > > > - bytes = len - total; > > - if (bytes > skb->len) > > - bytes = skb->len; > > - > > spin_unlock_bh(&vvs->rx_lock); > > - > > - /* sk_lock is held by caller so no one else can dequeue. > > - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. > > - */ > > - err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset, > > + err = skb_copy_datagram_iter(skb, offset, > > &msg->msg_iter, bytes); > > if (err) > > return err; > > - > > spin_lock_bh(&vvs->rx_lock); > > } > > > > - total += skb->len; > > - hdr = virtio_vsock_hdr(skb); > > + total = msg_len; > > + if (eor) > > + msg->msg_flags |= MSG_EOR; > > + } else { > > + skb_queue_walk(&vvs->rx_queue, skb) { > > + struct virtio_vsock_hdr *hdr; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) > > - msg->msg_flags |= MSG_EOR; > > + if (total < len) { > > + size_t bytes; > > + int err; > > > > - break; > > + bytes = len - total; > > + if (bytes > skb->len) > > + bytes = skb->len; > > + > > + spin_unlock_bh(&vvs->rx_lock); > > + > > + err = skb_copy_datagram_iter( > > + skb, > > + VIRTIO_VSOCK_SKB_CB(skb)->offset, > > + &msg->msg_iter, bytes); > > + if (err) > > + return err; > > + > > + spin_lock_bh(&vvs->rx_lock); > > + } > > + > > + total += skb->len; > > + hdr = virtio_vsock_hdr(skb); > > + > > + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + if (le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR) > > + msg->msg_flags |= MSG_EOR; > > + break; > > + } > > } > > } > > > > +unlock: > > spin_unlock_bh(&vvs->rx_lock); > > > > return total; > > @@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk, > > } > > > > while (!msg_ready) { > > - struct virtio_vsock_hdr *hdr; > > - size_t pkt_len; > > - > > - skb = __skb_dequeue(&vvs->rx_queue); > > + skb = skb_peek(&vvs->rx_queue); > > if (!skb) > > break; > > - hdr = virtio_vsock_hdr(skb); > > - pkt_len = (size_t)le32_to_cpu(hdr->len); > > > > - if (dequeued_len >= 0) { > > + if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { > > size_t bytes_to_copy; > > + u32 msg_len, offset; > > + bool eor; > > + int ret; > > > > - bytes_to_copy = min(user_buf_len, pkt_len); > > + ret = vsock_uleb_decode_boundary( > > + vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len - vvs->boundary_off, > > + &msg_len, &eor); > > + if (ret < 0) > > + break; > > + vvs->boundary_off += ret; > > > > - if (bytes_to_copy) { > > + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; > > + bytes_to_copy = min(user_buf_len, (size_t)msg_len); > > + > > + if (bytes_to_copy && dequeued_len >= 0) { > > int err; > > > > - /* sk_lock is held by caller so no one else can dequeue. > > - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. > > - */ > > spin_unlock_bh(&vvs->rx_lock); > > - > > - err = skb_copy_datagram_iter(skb, 0, > > + err = skb_copy_datagram_iter(skb, offset, > > &msg->msg_iter, > > bytes_to_copy); > > - if (err) { > > - /* Copy of message failed. Rest of > > - * fragments will be freed without copy. > > - */ > > - dequeued_len = err; > > - } else { > > - user_buf_len -= bytes_to_copy; > > - } > > - > > spin_lock_bh(&vvs->rx_lock); > > + if (err) > > + dequeued_len = err; > > + else > > + user_buf_len -= bytes_to_copy; > > } > > > > if (dequeued_len >= 0) > > - dequeued_len += pkt_len; > > - } > > + dequeued_len += msg_len; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len; > > msg_ready = true; > > vvs->msg_count--; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) > > + if (eor) > > msg->msg_flags |= MSG_EOR; > > - } > > > > - virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); > > - kfree_skb(skb); > > + virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len); > > + > > + if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) { > > + __skb_unlink(skb, &vvs->rx_queue); > > + kfree_skb(skb); > > + } > > + > > + if (vvs->boundary_off >= vvs->boundary_len / 2) > > + vsock_boundary_buf_compact(vvs); > > + } else { > > + struct virtio_vsock_hdr *hdr; > > + size_t pkt_len; > > + > > + skb = __skb_dequeue(&vvs->rx_queue); > > + if (!skb) > > + break; > > + hdr = virtio_vsock_hdr(skb); > > + pkt_len = (size_t)le32_to_cpu(hdr->len); > > + > > + if (dequeued_len >= 0) { > > + size_t bytes_to_copy; > > + > > + bytes_to_copy = min(user_buf_len, pkt_len); > > + > > + if (bytes_to_copy) { > > + int err; > > + > > + spin_unlock_bh(&vvs->rx_lock); > > + err = skb_copy_datagram_iter( > > + skb, 0, &msg->msg_iter, > > + bytes_to_copy); > > + if (err) > > + dequeued_len = err; > > + else > > + user_buf_len -= bytes_to_copy; > > + spin_lock_bh(&vvs->rx_lock); > > + } > > + > > + if (dequeued_len >= 0) > > + dequeued_len += pkt_len; > > + } > > + > > + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + msg_ready = true; > > + vvs->msg_count--; > > + > > + if (le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR) > > + msg->msg_flags |= MSG_EOR; > > + } > > + > > + virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); > > + kfree_skb(skb); > > + } > > } > > > > spin_unlock_bh(&vvs->rx_lock); > > @@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk) > > > > virtio_transport_cancel_close_work(vsk, true); > > > > + kfree(vvs->boundary_buf); > > kfree(vvs); > > vsk->trans = NULL; > > } > > @@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk) > > * removing it. > > */ > > __skb_queue_purge(&vvs->rx_queue); > > + kfree(vvs->boundary_buf); > > + vvs->boundary_buf = NULL; > > + vvs->boundary_len = 0; > > + vvs->boundary_alloc = 0; > > + vvs->boundary_off = 0; > > vsock_remove_sock(vsk); > > } > > > > @@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk, > > !skb_is_nonlinear(skb)) { > > struct virtio_vsock_hdr *last_hdr; > > struct sk_buff *last_skb; > > + bool last_has_eom; > > + bool has_eom; > > > > last_skb = skb_peek_tail(&vvs->rx_queue); > > last_hdr = virtio_vsock_hdr(last_skb); > > + last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; > > + has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; > > > > - /* If there is space in the last packet queued, we copy the > > - * new packet in its buffer. We avoid this if the last packet > > - * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is > > - * delimiter of SEQPACKET message, so 'pkt' is the first packet > > - * of a new message. > > - */ > > - if (skb->len < skb_tailroom(last_skb) && > > - !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) { > > - memcpy(skb_put(last_skb, skb->len), skb->data, skb->len); > > - free_pkt = true; > > - last_hdr->flags |= hdr->flags; > > - le32_add_cpu(&last_hdr->len, len); > > - goto out; > > + if (skb->len < skb_tailroom(last_skb)) { > > + if (!last_has_eom) { > > + /* Same-message coalescing (existing path) */ > > + memcpy(skb_put(last_skb, skb->len), > > + skb->data, skb->len); > > + free_pkt = true; > > + last_hdr->flags |= hdr->flags; > > + le32_add_cpu(&last_hdr->len, len); > > + goto out; > > + } > > + > > + /* Cross-EOM: coalesce complete messages into one skb, > > + * recording message boundaries in a compact BER buffer. > > + * Only when incoming packet also has EOM (complete msg). > > + */ > > + if (has_eom && !sk_psock(sk_vsock(vsk))) { > > + bool prev_eor, cur_eor; > > + u8 tmp[12]; > > + int n = 0; > > + > > + cur_eor = le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR; > > + > > + if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) { > > + u32 prev_len = le32_to_cpu(last_hdr->len); > > + > > + prev_eor = le32_to_cpu(last_hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR; > > + n += vsock_uleb_encode_boundary( > > + tmp + n, prev_len, prev_eor); > > + } > > + n += vsock_uleb_encode_boundary( > > + tmp + n, len, cur_eor); > > + > > + if (!vsock_boundary_buf_ensure( > > + vvs, vvs->boundary_len + n)) { > > + memcpy(vvs->boundary_buf + > > + vvs->boundary_len, tmp, n); > > + vvs->boundary_len += n; > > + VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true; > > + memcpy(skb_put(last_skb, skb->len), > > + skb->data, skb->len); > > + free_pkt = true; > > + last_hdr->flags |= hdr->flags; > > + le32_add_cpu(&last_hdr->len, len); > > + goto out; > > + } > > + } > > } > > } > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-08 9:41 ` Stefano Garzarella 2026-05-08 9:43 ` Michael S. Tsirkin @ 2026-05-08 9:58 ` Michael S. Tsirkin 2026-05-08 10:11 ` Stefano Garzarella 1 sibling, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-08 9:58 UTC (permalink / raw) To: Stefano Garzarella Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote: > On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote: > > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: > > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: > > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: > > [...] > > > > > > For now, we're already doing something: > > > > > merging the skuffs if they don't have EOM set. > > > > > > > > > > > > Right that's good. You could go further and merge with EOM too > > > > if you stick the info about message boundaries somewhere else. > > > > > > This adds a lot of complexity IMO, but we can try. > > > > > > Do you have something in mind? > > > > BER is clearly overkill but here's a POC that claude made for me, > > just to give u an idea. It's clearly has a ton of issues, > > for example I dislike how GFP_ATOMIC is handled. > > Okay, I somewhat understand, but clearly this isn't net material I doubt we have many other options given reverting the regression was ruled out. > so for now > I think the best thing to do is to merge the fixup I sent (or something > similar): > https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/ I reviewed that one, problem is it's a spec violation/change that we'll have to support forever. > This is a major change that should be merged with more caution. > Could this have too much of an impact on performance? > > Thanks, > Stefano It's really a POC, real patch is left as an excersise for the reader :). The correct approach IMHO is to only start using this when we wasted a lot of memory on small packets. For example, if sum(truesize) >= buf size. then we'll not see a perf impact unless it's already pathological. > > Yet it seems to work fine in light testing. > > > > --> > > > > > > vsock/virtio: use DWARF ULEB128 to record EOM boundaries, enable cross-EOM skb coalescing > > > > virtio_transport_recv_enqueue() currently refuses to coalesce an > > incoming skb with the previous one when the previous skb carries > > VIRTIO_VSOCK_SEQ_EOM. This forces one skb per seqpacket message. > > For workloads with many small or zero-byte messages the per-skb > > overhead (~960 bytes) dominates, causing unbounded memory growth. > > > > Decouple message boundary tracking from the skb structure: store > > boundary offsets in a compact side buffer using DWARF ULEB128 > > encoding with the EOR flag folded into the low bit, then allow > > the data of multiple complete messages to be coalesced into a single > > skb. > > > > Cross-EOM coalescing fires only when: > > - both the tail skb and the incoming packet carry EOM (complete msgs) > > - the incoming packet fits in the tail skb's tailroom > > - no BPF psock is attached (read_skb expects one msg per skb) > > > > On allocation failure the code falls back to separate skbs (existing > > behaviour). Credit accounting is unchanged; the boundary buffer is > > capped at PAGE_SIZE. > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> > > > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h > > index f91704731057..e36b9ab28372 100644 > > --- a/include/linux/virtio_vsock.h > > +++ b/include/linux/virtio_vsock.h > > @@ -12,6 +12,7 @@ > > struct virtio_vsock_skb_cb { > > bool reply; > > bool tap_delivered; > > + bool has_boundary_entries; > > u32 offset; > > }; > > > > @@ -167,6 +168,12 @@ struct virtio_vsock_sock { > > u32 buf_used; > > struct sk_buff_head rx_queue; > > u32 msg_count; > > + > > + /* ULEB128-encoded seqpacket message boundary buffer */ > > + u8 *boundary_buf; > > + u32 boundary_len; > > + u32 boundary_alloc; > > + u32 boundary_off; > > }; > > > > struct virtio_vsock_pkt_info { > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c > > index 416d533f493d..81654f70f72c 100644 > > --- a/net/vmw_vsock/virtio_transport_common.c > > +++ b/net/vmw_vsock/virtio_transport_common.c > > @@ -11,6 +11,7 @@ > > #include <linux/sched/signal.h> > > #include <linux/ctype.h> > > #include <linux/list.h> > > +#include <linux/skmsg.h> > > #include <linux/virtio_vsock.h> > > #include <uapi/linux/vsockmon.h> > > > > @@ -26,6 +27,91 @@ > > /* Threshold for detecting small packets to copy */ > > #define GOOD_COPY_LEN 128 > > > > +#define VSOCK_BOUNDARY_BUF_INIT 64 > > +#define VSOCK_BOUNDARY_BUF_MAX PAGE_SIZE > > + > > +/* ULEB128 boundary encoding: value = (msg_len << 1) | eor. > > + * Each byte carries 7 data bits; bit 7 is set on all but the last byte. > > + * Max 5 bytes for a u32 msg_len (33 bits with eor shift). > > + */ > > +static int vsock_uleb_encode_boundary(u8 *buf, u32 msg_len, bool eor) > > +{ > > + u64 val = ((u64)msg_len << 1) | eor; > > + int n = 0; > > + > > + do { > > + buf[n] = val & 0x7f; > > + val >>= 7; > > + if (val) > > + buf[n] |= 0x80; > > + n++; > > + } while (val); > > + > > + return n; > > +} > > + > > +static int vsock_uleb_decode_boundary(const u8 *buf, u32 avail, > > + u32 *msg_len, bool *eor) > > +{ > > + u64 val = 0; > > + int shift = 0; > > + int n = 0; > > + > > + do { > > + if (n >= avail || shift >= 35) > > + return -EINVAL; > > + val |= (u64)(buf[n] & 0x7f) << shift; > > + shift += 7; > > + } while (buf[n++] & 0x80); > > + > > + *eor = val & 1; > > + *msg_len = val >> 1; > > + return n; > > +} > > + > > +static void vsock_boundary_buf_compact(struct virtio_vsock_sock *vvs) > > +{ > > + if (vvs->boundary_off == 0) > > + return; > > + > > + vvs->boundary_len -= vvs->boundary_off; > > + memmove(vvs->boundary_buf, vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len); > > + vvs->boundary_off = 0; > > +} > > + > > +static int vsock_boundary_buf_ensure(struct virtio_vsock_sock *vvs, u32 needed) > > +{ > > + u32 new_alloc; > > + u8 *new_buf; > > + > > + if (vvs->boundary_alloc >= needed) > > + return 0; > > + > > + /* Reclaim consumed space before growing */ > > + if (vvs->boundary_off) { > > + needed -= vvs->boundary_off; > > + vsock_boundary_buf_compact(vvs); > > + if (vvs->boundary_alloc >= needed) > > + return 0; > > + } > > + > > + new_alloc = max(needed, vvs->boundary_alloc ? vvs->boundary_alloc * 2 > > + : VSOCK_BOUNDARY_BUF_INIT); > > + if (new_alloc > VSOCK_BOUNDARY_BUF_MAX) > > + new_alloc = VSOCK_BOUNDARY_BUF_MAX; > > + if (new_alloc < needed) > > + return -ENOMEM; > > + > > + new_buf = krealloc(vvs->boundary_buf, new_alloc, GFP_ATOMIC); > > + if (!new_buf) > > + return -ENOMEM; > > + > > + vvs->boundary_buf = new_buf; > > + vvs->boundary_alloc = new_alloc; > > + return 0; > > +} > > + > > static void virtio_transport_cancel_close_work(struct vsock_sock *vsk, > > bool cancel_timeout); > > static s64 virtio_transport_has_space(struct virtio_vsock_sock *vvs); > > @@ -682,41 +768,74 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk, > > total = 0; > > len = msg_data_left(msg); > > > > - skb_queue_walk(&vvs->rx_queue, skb) { > > - struct virtio_vsock_hdr *hdr; > > + skb = skb_peek(&vvs->rx_queue); > > + if (skb && VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { > > + u32 msg_len, offset; > > + size_t bytes; > > + bool eor; > > + int ret; > > > > - if (total < len) { > > - size_t bytes; > > + ret = vsock_uleb_decode_boundary( > > + vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len - vvs->boundary_off, > > + &msg_len, &eor); > > + if (ret < 0) > > + goto unlock; > > + > > + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; > > + bytes = min(len, (size_t)msg_len); > > + > > + if (bytes) { > > int err; > > > > - bytes = len - total; > > - if (bytes > skb->len) > > - bytes = skb->len; > > - > > spin_unlock_bh(&vvs->rx_lock); > > - > > - /* sk_lock is held by caller so no one else can dequeue. > > - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. > > - */ > > - err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->offset, > > + err = skb_copy_datagram_iter(skb, offset, > > &msg->msg_iter, bytes); > > if (err) > > return err; > > - > > spin_lock_bh(&vvs->rx_lock); > > } > > > > - total += skb->len; > > - hdr = virtio_vsock_hdr(skb); > > + total = msg_len; > > + if (eor) > > + msg->msg_flags |= MSG_EOR; > > + } else { > > + skb_queue_walk(&vvs->rx_queue, skb) { > > + struct virtio_vsock_hdr *hdr; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) > > - msg->msg_flags |= MSG_EOR; > > + if (total < len) { > > + size_t bytes; > > + int err; > > > > - break; > > + bytes = len - total; > > + if (bytes > skb->len) > > + bytes = skb->len; > > + > > + spin_unlock_bh(&vvs->rx_lock); > > + > > + err = skb_copy_datagram_iter( > > + skb, > > + VIRTIO_VSOCK_SKB_CB(skb)->offset, > > + &msg->msg_iter, bytes); > > + if (err) > > + return err; > > + > > + spin_lock_bh(&vvs->rx_lock); > > + } > > + > > + total += skb->len; > > + hdr = virtio_vsock_hdr(skb); > > + > > + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + if (le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR) > > + msg->msg_flags |= MSG_EOR; > > + break; > > + } > > } > > } > > > > +unlock: > > spin_unlock_bh(&vvs->rx_lock); > > > > return total; > > @@ -740,57 +859,105 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk, > > } > > > > while (!msg_ready) { > > - struct virtio_vsock_hdr *hdr; > > - size_t pkt_len; > > - > > - skb = __skb_dequeue(&vvs->rx_queue); > > + skb = skb_peek(&vvs->rx_queue); > > if (!skb) > > break; > > - hdr = virtio_vsock_hdr(skb); > > - pkt_len = (size_t)le32_to_cpu(hdr->len); > > > > - if (dequeued_len >= 0) { > > + if (VIRTIO_VSOCK_SKB_CB(skb)->has_boundary_entries) { > > size_t bytes_to_copy; > > + u32 msg_len, offset; > > + bool eor; > > + int ret; > > > > - bytes_to_copy = min(user_buf_len, pkt_len); > > + ret = vsock_uleb_decode_boundary( > > + vvs->boundary_buf + vvs->boundary_off, > > + vvs->boundary_len - vvs->boundary_off, > > + &msg_len, &eor); > > + if (ret < 0) > > + break; > > + vvs->boundary_off += ret; > > > > - if (bytes_to_copy) { > > + offset = VIRTIO_VSOCK_SKB_CB(skb)->offset; > > + bytes_to_copy = min(user_buf_len, (size_t)msg_len); > > + > > + if (bytes_to_copy && dequeued_len >= 0) { > > int err; > > > > - /* sk_lock is held by caller so no one else can dequeue. > > - * Unlock rx_lock since skb_copy_datagram_iter() may sleep. > > - */ > > spin_unlock_bh(&vvs->rx_lock); > > - > > - err = skb_copy_datagram_iter(skb, 0, > > + err = skb_copy_datagram_iter(skb, offset, > > &msg->msg_iter, > > bytes_to_copy); > > - if (err) { > > - /* Copy of message failed. Rest of > > - * fragments will be freed without copy. > > - */ > > - dequeued_len = err; > > - } else { > > - user_buf_len -= bytes_to_copy; > > - } > > - > > spin_lock_bh(&vvs->rx_lock); > > + if (err) > > + dequeued_len = err; > > + else > > + user_buf_len -= bytes_to_copy; > > } > > > > if (dequeued_len >= 0) > > - dequeued_len += pkt_len; > > - } > > + dequeued_len += msg_len; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + VIRTIO_VSOCK_SKB_CB(skb)->offset += msg_len; > > msg_ready = true; > > vvs->msg_count--; > > > > - if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) > > + if (eor) > > msg->msg_flags |= MSG_EOR; > > - } > > > > - virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); > > - kfree_skb(skb); > > + virtio_transport_dec_rx_pkt(vvs, msg_len, msg_len); > > + > > + if (VIRTIO_VSOCK_SKB_CB(skb)->offset >= skb->len) { > > + __skb_unlink(skb, &vvs->rx_queue); > > + kfree_skb(skb); > > + } > > + > > + if (vvs->boundary_off >= vvs->boundary_len / 2) > > + vsock_boundary_buf_compact(vvs); > > + } else { > > + struct virtio_vsock_hdr *hdr; > > + size_t pkt_len; > > + > > + skb = __skb_dequeue(&vvs->rx_queue); > > + if (!skb) > > + break; > > + hdr = virtio_vsock_hdr(skb); > > + pkt_len = (size_t)le32_to_cpu(hdr->len); > > + > > + if (dequeued_len >= 0) { > > + size_t bytes_to_copy; > > + > > + bytes_to_copy = min(user_buf_len, pkt_len); > > + > > + if (bytes_to_copy) { > > + int err; > > + > > + spin_unlock_bh(&vvs->rx_lock); > > + err = skb_copy_datagram_iter( > > + skb, 0, &msg->msg_iter, > > + bytes_to_copy); > > + if (err) > > + dequeued_len = err; > > + else > > + user_buf_len -= bytes_to_copy; > > + spin_lock_bh(&vvs->rx_lock); > > + } > > + > > + if (dequeued_len >= 0) > > + dequeued_len += pkt_len; > > + } > > + > > + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) { > > + msg_ready = true; > > + vvs->msg_count--; > > + > > + if (le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR) > > + msg->msg_flags |= MSG_EOR; > > + } > > + > > + virtio_transport_dec_rx_pkt(vvs, pkt_len, pkt_len); > > + kfree_skb(skb); > > + } > > } > > > > spin_unlock_bh(&vvs->rx_lock); > > @@ -1132,6 +1299,7 @@ void virtio_transport_destruct(struct vsock_sock *vsk) > > > > virtio_transport_cancel_close_work(vsk, true); > > > > + kfree(vvs->boundary_buf); > > kfree(vvs); > > vsk->trans = NULL; > > } > > @@ -1224,6 +1392,11 @@ static void virtio_transport_remove_sock(struct vsock_sock *vsk) > > * removing it. > > */ > > __skb_queue_purge(&vvs->rx_queue); > > + kfree(vvs->boundary_buf); > > + vvs->boundary_buf = NULL; > > + vvs->boundary_len = 0; > > + vvs->boundary_alloc = 0; > > + vvs->boundary_off = 0; > > vsock_remove_sock(vsk); > > } > > > > @@ -1395,23 +1568,62 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk, > > !skb_is_nonlinear(skb)) { > > struct virtio_vsock_hdr *last_hdr; > > struct sk_buff *last_skb; > > + bool last_has_eom; > > + bool has_eom; > > > > last_skb = skb_peek_tail(&vvs->rx_queue); > > last_hdr = virtio_vsock_hdr(last_skb); > > + last_has_eom = le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; > > + has_eom = le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM; > > > > - /* If there is space in the last packet queued, we copy the > > - * new packet in its buffer. We avoid this if the last packet > > - * queued has VIRTIO_VSOCK_SEQ_EOM set, because this is > > - * delimiter of SEQPACKET message, so 'pkt' is the first packet > > - * of a new message. > > - */ > > - if (skb->len < skb_tailroom(last_skb) && > > - !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)) { > > - memcpy(skb_put(last_skb, skb->len), skb->data, skb->len); > > - free_pkt = true; > > - last_hdr->flags |= hdr->flags; > > - le32_add_cpu(&last_hdr->len, len); > > - goto out; > > + if (skb->len < skb_tailroom(last_skb)) { > > + if (!last_has_eom) { > > + /* Same-message coalescing (existing path) */ > > + memcpy(skb_put(last_skb, skb->len), > > + skb->data, skb->len); > > + free_pkt = true; > > + last_hdr->flags |= hdr->flags; > > + le32_add_cpu(&last_hdr->len, len); > > + goto out; > > + } > > + > > + /* Cross-EOM: coalesce complete messages into one skb, > > + * recording message boundaries in a compact BER buffer. > > + * Only when incoming packet also has EOM (complete msg). > > + */ > > + if (has_eom && !sk_psock(sk_vsock(vsk))) { > > + bool prev_eor, cur_eor; > > + u8 tmp[12]; > > + int n = 0; > > + > > + cur_eor = le32_to_cpu(hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR; > > + > > + if (!VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries) { > > + u32 prev_len = le32_to_cpu(last_hdr->len); > > + > > + prev_eor = le32_to_cpu(last_hdr->flags) & > > + VIRTIO_VSOCK_SEQ_EOR; > > + n += vsock_uleb_encode_boundary( > > + tmp + n, prev_len, prev_eor); > > + } > > + n += vsock_uleb_encode_boundary( > > + tmp + n, len, cur_eor); > > + > > + if (!vsock_boundary_buf_ensure( > > + vvs, vvs->boundary_len + n)) { > > + memcpy(vvs->boundary_buf + > > + vvs->boundary_len, tmp, n); > > + vvs->boundary_len += n; > > + VIRTIO_VSOCK_SKB_CB(last_skb)->has_boundary_entries = true; > > + memcpy(skb_put(last_skb, skb->len), > > + skb->data, skb->len); > > + free_pkt = true; > > + last_hdr->flags |= hdr->flags; > > + le32_add_cpu(&last_hdr->len, len); > > + goto out; > > + } > > + } > > } > > } > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-08 9:58 ` Michael S. Tsirkin @ 2026-05-08 10:11 ` Stefano Garzarella 0 siblings, 0 replies; 26+ messages in thread From: Stefano Garzarella @ 2026-05-08 10:11 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Eric Dumazet, Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Fri, May 08, 2026 at 05:58:06AM -0400, Michael S. Tsirkin wrote: >On Fri, May 08, 2026 at 11:41:21AM +0200, Stefano Garzarella wrote: >> On Thu, May 07, 2026 at 06:48:47PM -0400, Michael S. Tsirkin wrote: >> > On Thu, May 07, 2026 at 02:59:13PM +0200, Stefano Garzarella wrote: >> > > On Thu, May 07, 2026 at 07:45:10AM -0400, Michael S. Tsirkin wrote: >> > > > On Thu, May 07, 2026 at 11:09:47AM +0200, Stefano Garzarella wrote: >> >> [...] >> >> > > > > For now, we're already doing something: >> > > > > merging the skuffs if they don't have EOM set. >> > > > >> > > > >> > > > Right that's good. You could go further and merge with EOM too >> > > > if you stick the info about message boundaries somewhere else. >> > > >> > > This adds a lot of complexity IMO, but we can try. >> > > >> > > Do you have something in mind? >> > >> > BER is clearly overkill but here's a POC that claude made for me, >> > just to give u an idea. It's clearly has a ton of issues, >> > for example I dislike how GFP_ATOMIC is handled. >> >> Okay, I somewhat understand, but clearly this isn't net material > >I doubt we have many other options given reverting the regression was >ruled out. As Eric pointed out, we can't revert it. > > >> so for now >> I think the best thing to do is to merge the fixup I sent (or something >> similar): >> https://lore.kernel.org/netdev/20260508092330.69690-1-sgarzare@redhat.com/ > >I reviewed that one, problem is it's a spec violation/change that we'll >have to support forever. I have a few points to make on this, but let's discuss them there. > >> This is a major change that should be merged with more caution. >> Could this have too much of an impact on performance? >> >> Thanks, >> Stefano > >It's really a POC, real patch is left as an excersise for the reader >:). eheh, I see, but honestly, this overcomplication scares me. I'll try to think it over. >The correct approach IMHO is to only start using this >when we wasted a lot of memory on small packets. > >For example, if sum(truesize) >= buf size. > >then we'll not see a perf impact unless it's already pathological. Agree on this, which is similar to what I'm doing in that patch. Reducing the advertised buf_alloc only in pathological cases (e.g. overhead > buf_alloc). Stefano ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-05 14:14 ` Eric Dumazet 2026-05-05 16:11 ` Stefano Garzarella @ 2026-05-06 15:15 ` Michael S. Tsirkin 2026-05-06 15:38 ` Eric Dumazet 1 sibling, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-06 15:15 UTC (permalink / raw) To: Eric Dumazet Cc: Stefano Garzarella, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > There is always a discrepancy between skb->len and skb->truesize. > You will not be able to announce a 1MB window, and accept one milliion > skb of 1-byte each. We can if we copy. > This kind of contract is broken. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-06 15:15 ` Michael S. Tsirkin @ 2026-05-06 15:38 ` Eric Dumazet 2026-05-06 15:49 ` Michael S. Tsirkin 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2026-05-06 15:38 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Stefano Garzarella, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Wed, May 6, 2026 at 8:15 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > There is always a discrepancy between skb->len and skb->truesize. > > You will not be able to announce a 1MB window, and accept one milliion > > skb of 1-byte each. > > We can if we copy. You mean, ignore VIRTIO_VSOCK_SEQ_EOM? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue 2026-05-06 15:38 ` Eric Dumazet @ 2026-05-06 15:49 ` Michael S. Tsirkin 0 siblings, 0 replies; 26+ messages in thread From: Michael S. Tsirkin @ 2026-05-06 15:49 UTC (permalink / raw) To: Eric Dumazet Cc: Stefano Garzarella, David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Stefan Hajnoczi, Jason Wang, Xuan Zhuo, Eugenio Pérez, kvm, virtualization On Wed, May 06, 2026 at 08:38:39AM -0700, Eric Dumazet wrote: > On Wed, May 6, 2026 at 8:15 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Tue, May 05, 2026 at 07:14:36AM -0700, Eric Dumazet wrote: > > > There is always a discrepancy between skb->len and skb->truesize. > > > You will not be able to announce a 1MB window, and accept one milliion > > > skb of 1-byte each. > > > > We can if we copy. > > You mean, ignore VIRTIO_VSOCK_SEQ_EOM? No I mean saving it in a compact form. If packet boundaries are the only concern, the overhead needn't exceed 50% even for 1 byte messages. asn.1 ber, for an over-engineered example? ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-05-08 10:11 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-30 12:26 [PATCH net] vsock/virtio: fix potential unbounded skb queue Eric Dumazet 2026-05-05 2:20 ` patchwork-bot+netdevbpf 2026-05-05 13:51 ` Stefano Garzarella 2026-05-05 14:14 ` Eric Dumazet 2026-05-05 16:11 ` Stefano Garzarella 2026-05-05 16:37 ` Bobby Eshleman 2026-05-06 9:50 ` Arseniy Krasnov 2026-05-06 14:00 ` Stefano Garzarella 2026-05-06 15:37 ` Michael S. Tsirkin 2026-05-07 9:09 ` Stefano Garzarella 2026-05-07 11:45 ` Michael S. Tsirkin 2026-05-07 12:59 ` Stefano Garzarella 2026-05-07 14:33 ` Jakub Kicinski 2026-05-07 16:05 ` Stefano Garzarella 2026-05-07 16:32 ` Eric Dumazet 2026-05-07 17:18 ` Jakub Kicinski 2026-05-08 9:26 ` Stefano Garzarella 2026-05-07 14:52 ` Michael S. Tsirkin 2026-05-07 22:48 ` Michael S. Tsirkin 2026-05-08 9:41 ` Stefano Garzarella 2026-05-08 9:43 ` Michael S. Tsirkin 2026-05-08 9:58 ` Michael S. Tsirkin 2026-05-08 10:11 ` Stefano Garzarella 2026-05-06 15:15 ` Michael S. Tsirkin 2026-05-06 15:38 ` Eric Dumazet 2026-05-06 15:49 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox