* [PATCH net-next v4 1/4] vsock/virtio/vhost: read data from non-linear skb
2023-07-27 22:26 [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
@ 2023-07-27 22:26 ` Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 2/4] vsock/virtio: support to send " Arseniy Krasnov
` (3 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-27 22:26 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman
Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
avkrasnov, Arseniy Krasnov
This is preparation patch for MSG_ZEROCOPY support. It adds handling of
non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
'skb_copy_datagram_iter()'. Main advantage of the second one is that it
can handle paged part of the skb by using 'kmap()' on each page, but if
there are no pages in the skb, it behaves like simple copying to iov
iterator. This patch also adds new field to the control block of skb -
this value shows current offset in the skb to read next portion of data
(it doesn't matter linear it or not). Idea behind this field is that
'skb_copy_datagram_iter()' handles both types of skb internally - it
just needs an offset from which to copy data from the given skb. This
offset is incremented on each read from skb. This approach allows to
avoid special handling of non-linear skbs:
1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
2) We need to update 'data_len' also on each read from this skb.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
Changelog:
v5(big patchset) -> v1:
* Merge 'virtio_transport_common.c' and 'vhost/vsock.c' patches into
this single patch.
* Commit message update: grammar fix and remark that this patch is
MSG_ZEROCOPY preparation.
* Use 'min_t()' instead of comparison using '<>' operators.
v1 -> v2:
* R-b tag added.
v3 -> v4:
* R-b tag removed due to rebase:
* Part for 'virtio_transport_stream_do_peek()' is changed.
* Part for 'virtio_transport_seqpacket_do_peek()' is added.
* Comments about sleep in 'memcpy_to_msg()' now describe sleep in
'skb_copy_datagram_iter()'.
drivers/vhost/vsock.c | 14 +++++++----
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport_common.c | 32 +++++++++++++++----------
3 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 817d377a3f36..8c917be32b5d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct sk_buff *skb;
unsigned out, in;
size_t nbytes;
+ u32 frag_off;
int head;
skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}
iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
- payload_len = skb->len;
+ frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+ payload_len = skb->len - frag_off;
hdr = virtio_vsock_hdr(skb);
/* If the packet is greater than the space available in the
@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
+ if (skb_copy_datagram_iter(skb,
+ frag_off,
+ &iov_iter,
+ payload_len)) {
kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;
- skb_pull(skb, payload_len);
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
total_len += payload_len;
/* If we didn't send all the payload we can requeue the packet
* to send it with the next available buffer.
*/
- if (skb->len > 0) {
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
hdr->flags |= cpu_to_le32(flags_to_restore);
/* We are queueing the same skb to handle
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..17dbb7176e37 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,7 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+ u32 frag_off;
};
#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 352d042b130b..0b6a89139810 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -364,9 +364,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
spin_unlock_bh(&vvs->rx_lock);
/* sk_lock is held by caller so no one else can dequeue.
- * Unlock rx_lock since memcpy_to_msg() may sleep.
+ * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
*/
- err = memcpy_to_msg(msg, skb->data, bytes);
+ err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &msg->msg_iter, bytes);
if (err)
goto out;
@@ -410,25 +411,27 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
while (total < len && !skb_queue_empty(&vvs->rx_queue)) {
skb = skb_peek(&vvs->rx_queue);
- bytes = len - total;
- if (bytes > skb->len)
- bytes = skb->len;
+ bytes = min_t(size_t, len - total,
+ skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off);
/* sk_lock is held by caller so no one else can dequeue.
- * Unlock rx_lock since memcpy_to_msg() may sleep.
+ * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, skb->data, bytes);
+ err = skb_copy_datagram_iter(skb,
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &msg->msg_iter, bytes);
if (err)
goto out;
spin_lock_bh(&vvs->rx_lock);
total += bytes;
- skb_pull(skb, bytes);
- if (skb->len == 0) {
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
+
+ if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
virtio_transport_dec_rx_pkt(vvs, pkt_len);
@@ -492,9 +495,10 @@ virtio_transport_seqpacket_do_peek(struct vsock_sock *vsk,
spin_unlock_bh(&vvs->rx_lock);
/* sk_lock is held by caller so no one else can dequeue.
- * Unlock rx_lock since memcpy_to_msg() may sleep.
+ * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
*/
- err = memcpy_to_msg(msg, skb->data, bytes);
+ err = skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &msg->msg_iter, bytes);
if (err)
return err;
@@ -553,11 +557,13 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
int err;
/* sk_lock is held by caller so no one else can dequeue.
- * Unlock rx_lock since memcpy_to_msg() may sleep.
+ * Unlock rx_lock since skb_copy_datagram_iter() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+ err = skb_copy_datagram_iter(skb, 0,
+ &msg->msg_iter,
+ bytes_to_copy);
if (err) {
/* Copy of message failed. Rest of
* fragments will be freed without copy.
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH net-next v4 2/4] vsock/virtio: support to send non-linear skb
2023-07-27 22:26 [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
@ 2023-07-27 22:26 ` Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 3/4] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
` (2 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-27 22:26 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman
Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
avkrasnov, Arseniy Krasnov
For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
Changelog:
v2 -> v3:
* Comment about 'page_to_virt()' is updated. I don't remove R-b,
as this change is quiet small I guess.
net/vmw_vsock/virtio_transport.c | 41 +++++++++++++++++++++++++++-----
1 file changed, 35 insertions(+), 6 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..7bbcc8093e51 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
for (;;) {
- struct scatterlist hdr, buf, *sgs[2];
+ /* +1 is for packet header. */
+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +113,39 @@ virtio_transport_send_pkt_work(struct work_struct *work)
virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+ sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
+ sizeof(*virtio_vsock_hdr(skb)));
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+
+ if (!skb_is_nonlinear(skb)) {
+ if (skb->len > 0) {
+ sg_init_one(&bufs[out_sg], skb->data, skb->len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
+ } else {
+ struct skb_shared_info *si;
+ int i;
+
+ si = skb_shinfo(skb);
+
+ for (i = 0; i < si->nr_frags; i++) {
+ skb_frag_t *skb_frag = &si->frags[i];
+ void *va;
- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
- sgs[out_sg++] = &hdr;
- if (skb->len > 0) {
- sg_init_one(&buf, skb->data, skb->len);
- sgs[out_sg++] = &buf;
+ /* We will use 'page_to_virt()' for the userspace page
+ * here, because virtio or dma-mapping layers will call
+ * 'virt_to_phys()' later to fill the buffer descriptor.
+ * We don't touch memory at "virtual" address of this page.
+ */
+ va = page_to_virt(skb_frag->bv_page);
+ sg_init_one(&bufs[out_sg],
+ va + skb_frag->bv_offset,
+ skb_frag->bv_len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
}
ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH net-next v4 3/4] vsock/virtio: non-linear skb handling for tap
2023-07-27 22:26 [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 2/4] vsock/virtio: support to send " Arseniy Krasnov
@ 2023-07-27 22:26 ` Arseniy Krasnov
2023-07-27 22:26 ` [PATCH net-next v4 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
2023-07-28 5:45 ` [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Michael S. Tsirkin
4 siblings, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-27 22:26 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman
Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
avkrasnov, Arseniy Krasnov
For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 0b6a89139810..de7b44675254 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}
+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ struct iov_iter iov_iter = { 0 };
+ struct kvec kvec;
+ size_t to_copy;
+
+ kvec.iov_base = dst;
+ kvec.iov_len = len;
+
+ iov_iter.iter_type = ITER_KVEC;
+ iov_iter.kvec = &kvec;
+ iov_iter.nr_segs = 1;
+
+ to_copy = min_t(size_t, len, skb->len);
+
+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &iov_iter, to_copy);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;
/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;
skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(pkt)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}
return skb;
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH net-next v4 4/4] vsock/virtio: MSG_ZEROCOPY flag support
2023-07-27 22:26 [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
` (2 preceding siblings ...)
2023-07-27 22:26 ` [PATCH net-next v4 3/4] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
@ 2023-07-27 22:26 ` Arseniy Krasnov
2023-07-28 8:03 ` Arseniy Krasnov
2023-07-28 5:45 ` [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Michael S. Tsirkin
4 siblings, 1 reply; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-27 22:26 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman
Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
avkrasnov, Arseniy Krasnov
This adds handling of MSG_ZEROCOPY flag on transmission path: if this
flag is set and zerocopy transmission is possible (enabled in socket
options and transport allows zerocopy), then non-linear skb will be
created and filled with the pages of user's buffer. Pages of user's
buffer are locked in memory by 'get_user_pages()'. Second thing that
this patch does is replace type of skb owning: instead of calling
'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
of socket, so to decrease this field correctly proper skb destructor is
needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
Changelog:
v5(big patchset) -> v1:
* Refactorings of 'if' conditions.
* Remove extra blank line.
* Remove 'frag_off' field unneeded init.
* Add function 'virtio_transport_fill_skb()' which fills both linear
and non-linear skb with provided data.
v1 -> v2:
* Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
v2 -> v3:
* Add new transport callback: 'msgzerocopy_check_iov'. It checks that
provided 'iov_iter' with data could be sent in a zerocopy mode.
If this callback is not set in transport - transport allows to send
any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
then zerocopy is allowed. Reason of this callback is that in case of
G2H transmission we insert whole skb to the tx virtio queue and such
skb must fit to the size of the virtio queue to be sent in a single
iteration (may be tx logic in 'virtio_transport.c' could be reworked
as in vhost to support partial send of current skb). This callback
will be enabled only for G2H path. For details pls see comment
'Check that tx queue...' below.
v3 -> v4:
* 'msgzerocopy_check_iov' moved from 'struct vsock_transport' to
'struct virtio_transport' as it is virtio specific callback and
never needed in other transports.
include/linux/virtio_vsock.h | 5 +
net/vmw_vsock/virtio_transport.c | 38 ++++
net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
3 files changed, 242 insertions(+), 58 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 17dbb7176e37..33d878abbfcb 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -160,6 +160,11 @@ struct virtio_transport {
/* Takes ownership of the packet */
int (*send_pkt)(struct sk_buff *skb);
+
+ /* Used in MSG_ZEROCOPY mode. Checks that provided data
+ * could be transmitted with zerocopy mode.
+ */
+ bool (*msgzerocopy_check_iov)(const struct iov_iter *iov);
};
ssize_t
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 7bbcc8093e51..f0abcaad5cfd 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}
+static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
+{
+ struct virtio_vsock *vsock;
+ bool res = false;
+
+ rcu_read_lock();
+
+ vsock = rcu_dereference(the_virtio_vsock);
+ if (vsock) {
+ struct virtqueue *vq;
+ int iov_pages;
+
+ vq = vsock->vqs[VSOCK_VQ_TX];
+
+ iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
+
+ /* Check that tx queue is large enough to keep whole
+ * data to send. This is needed, because when there is
+ * not enough free space in the queue, current skb to
+ * send will be reinserted to the head of tx list of
+ * the socket to retry transmission later, so if skb
+ * is bigger than whole queue, it will be reinserted
+ * again and again, thus blocking other skbs to be sent.
+ * Each page of the user provided buffer will be added
+ * as a single buffer to the tx virtqueue, so compare
+ * number of pages against maximum capacity of the queue.
+ * +1 means buffer for the packet header.
+ */
+ if (iov_pages + 1 <= vq->num_max)
+ res = true;
+ }
+
+ rcu_read_unlock();
+
+ return res;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);
static struct virtio_transport virtio_transport = {
@@ -491,6 +528,7 @@ static struct virtio_transport virtio_transport = {
},
.send_pkt = virtio_transport_send_pkt,
+ .msgzerocopy_check_iov = virtio_transport_msgzerocopy_check_iov,
};
static bool virtio_transport_seqpacket_allow(u32 remote_cid)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index de7b44675254..659f9d70266b 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}
-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
- size_t len,
- u32 src_cid,
- u32 src_port,
- u32 dst_cid,
- u32 dst_port)
-{
- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
- struct virtio_vsock_hdr *hdr;
- struct sk_buff *skb;
- void *payload;
- int err;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+ size_t max_to_send)
+{
+ const struct virtio_transport *t_ops;
+ struct iov_iter *iov_iter;
- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
- if (!skb)
- return NULL;
+ if (!info->msg)
+ return false;
- hdr = virtio_vsock_hdr(skb);
- hdr->type = cpu_to_le16(info->type);
- hdr->op = cpu_to_le16(info->op);
- hdr->src_cid = cpu_to_le64(src_cid);
- hdr->dst_cid = cpu_to_le64(dst_cid);
- hdr->src_port = cpu_to_le32(src_port);
- hdr->dst_port = cpu_to_le32(dst_port);
- hdr->flags = cpu_to_le32(info->flags);
- hdr->len = cpu_to_le32(len);
+ iov_iter = &info->msg->msg_iter;
- if (info->msg && len > 0) {
- payload = skb_put(skb, len);
- err = memcpy_from_msg(payload, info->msg, len);
- if (err)
- goto out;
+ t_ops = virtio_transport_get_ops(info->vsk);
- if (msg_data_left(info->msg) == 0 &&
- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+ if (t_ops->msgzerocopy_check_iov &&
+ !t_ops->msgzerocopy_check_iov(iov_iter))
+ return false;
- if (info->msg->msg_flags & MSG_EOR)
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
- }
+ /* Data is simple buffer. */
+ if (iter_is_ubuf(iov_iter))
+ return true;
+
+ if (!iter_is_iovec(iov_iter))
+ return false;
+
+ if (iov_iter->iov_offset)
+ return false;
+
+ /* We can't send whole iov. */
+ if (iov_iter->count > max_to_send)
+ return false;
+
+ return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+ struct sk_buff *skb,
+ struct msghdr *msg,
+ bool zerocopy)
+{
+ struct ubuf_info *uarg;
+
+ if (msg->msg_ubuf) {
+ uarg = msg->msg_ubuf;
+ net_zcopy_get(uarg);
+ } else {
+ struct iov_iter *iter = &msg->msg_iter;
+ struct ubuf_info_msgzc *uarg_zc;
+ int len;
+
+ /* Only ITER_IOVEC or ITER_UBUF are allowed and
+ * checked before.
+ */
+ if (iter_is_iovec(iter))
+ len = iov_length(iter->__iov, iter->nr_segs);
+ else
+ len = iter->count;
+
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+ len,
+ NULL);
+ if (!uarg)
+ return -1;
+
+ uarg_zc = uarg_to_msgzc(uarg);
+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
}
- if (info->reply)
- virtio_vsock_skb_set_reply(skb);
+ skb_zcopy_init(skb, uarg);
- trace_virtio_transport_alloc_pkt(src_cid, src_port,
- dst_cid, dst_port,
- len,
- info->type,
- info->op,
- info->flags);
+ return 0;
+}
- if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
- WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
- goto out;
+static int virtio_transport_fill_skb(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ size_t len,
+ bool zcopy)
+{
+ if (zcopy) {
+ return __zerocopy_sg_from_iter(info->msg, NULL, skb,
+ &info->msg->msg_iter,
+ len);
+ } else {
+ void *payload;
+ int err;
+
+ payload = skb_put(skb, len);
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err)
+ return -1;
+
+ if (msg_data_left(info->msg))
+ return 0;
+
+ return 0;
}
+}
- return skb;
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ u32 src_cid,
+ u32 src_port,
+ u32 dst_cid,
+ u32 dst_port,
+ size_t len)
+{
+ struct virtio_vsock_hdr *hdr;
-out:
- kfree_skb(skb);
- return NULL;
+ hdr = virtio_vsock_hdr(skb);
+ hdr->type = cpu_to_le16(info->type);
+ hdr->op = cpu_to_le16(info->op);
+ hdr->src_cid = cpu_to_le64(src_cid);
+ hdr->dst_cid = cpu_to_le64(dst_cid);
+ hdr->src_port = cpu_to_le32(src_port);
+ hdr->dst_port = cpu_to_le32(dst_port);
+ hdr->flags = cpu_to_le32(info->flags);
+ hdr->len = cpu_to_le32(len);
}
static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
@@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
return VIRTIO_VSOCK_TYPE_SEQPACKET;
}
+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t payload_len,
+ bool zcopy,
+ u32 src_cid,
+ u32 src_port,
+ u32 dst_cid,
+ u32 dst_port)
+{
+ struct sk_buff *skb;
+ size_t skb_len;
+
+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+ if (!zcopy)
+ skb_len += payload_len;
+
+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+ if (!skb)
+ return NULL;
+
+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len);
+
+ /* Set owner here, because '__zerocopy_sg_from_iter()' uses
+ * owner of skb without check to update 'sk_wmem_alloc'.
+ */
+ if (vsk)
+ skb_set_owner_w(skb, sk_vsock(vsk));
+
+ if (info->msg && payload_len > 0) {
+ int err;
+
+ err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
+ if (err)
+ goto out;
+
+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+ struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
+
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+ if (info->msg->msg_flags & MSG_EOR)
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ }
+ }
+
+ if (info->reply)
+ virtio_vsock_skb_set_reply(skb);
+
+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len,
+ info->type,
+ info->op,
+ info->flags);
+
+ return skb;
+out:
+ kfree_skb(skb);
+ return NULL;
+}
+
/* This function can only be used on connecting/connected sockets,
* since a socket assigned to a transport is required.
*
@@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
struct virtio_vsock_pkt_info *info)
{
+ u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
u32 src_cid, src_port, dst_cid, dst_port;
const struct virtio_transport *t_ops;
struct virtio_vsock_sock *vvs;
u32 pkt_len = info->pkt_len;
+ bool can_zcopy = false;
u32 rest_len;
int ret;
@@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;
+ if (info->msg) {
+ /* If zerocopy is not enabled by 'setsockopt()', we behave as
+ * there is no MSG_ZEROCOPY flag set.
+ */
+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+ info->msg->msg_flags &= ~MSG_ZEROCOPY;
+
+ if (info->msg->msg_flags & MSG_ZEROCOPY)
+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+
+ if (can_zcopy)
+ max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+ (MAX_SKB_FRAGS * PAGE_SIZE));
+ }
+
rest_len = pkt_len;
do {
struct sk_buff *skb;
size_t skb_len;
- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+ skb_len = min(max_skb_len, rest_len);
- skb = virtio_transport_alloc_skb(info, skb_len,
+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
src_cid, src_port,
dst_cid, dst_port);
if (!skb) {
@@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
break;
}
+ /* This is last skb to send this portion of data. */
+ if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
+ skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
+ if (virtio_transport_init_zcopy_skb(vsk, skb,
+ info->msg,
+ can_zcopy)) {
+ ret = -ENOMEM;
+ break;
+ }
+ }
+
virtio_transport_inc_tx_pkt(vvs, skb);
ret = t_ops->send_pkt(skb);
@@ -985,7 +1126,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
if (!t)
return -ENOTCONN;
- reply = virtio_transport_alloc_skb(&info, 0,
+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
le64_to_cpu(hdr->dst_cid),
le32_to_cpu(hdr->dst_port),
le64_to_cpu(hdr->src_cid),
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH net-next v4 4/4] vsock/virtio: MSG_ZEROCOPY flag support
2023-07-27 22:26 ` [PATCH net-next v4 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
@ 2023-07-28 8:03 ` Arseniy Krasnov
0 siblings, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-28 8:03 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman
Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa
On 28.07.2023 01:26, Arseniy Krasnov wrote:
> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> flag is set and zerocopy transmission is possible (enabled in socket
> options and transport allows zerocopy), then non-linear skb will be
> created and filled with the pages of user's buffer. Pages of user's
> buffer are locked in memory by 'get_user_pages()'. Second thing that
> this patch does is replace type of skb owning: instead of calling
> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> of socket, so to decrease this field correctly proper skb destructor is
> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>
> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> ---
> Changelog:
> v5(big patchset) -> v1:
> * Refactorings of 'if' conditions.
> * Remove extra blank line.
> * Remove 'frag_off' field unneeded init.
> * Add function 'virtio_transport_fill_skb()' which fills both linear
> and non-linear skb with provided data.
> v1 -> v2:
> * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> v2 -> v3:
> * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> provided 'iov_iter' with data could be sent in a zerocopy mode.
> If this callback is not set in transport - transport allows to send
> any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> then zerocopy is allowed. Reason of this callback is that in case of
> G2H transmission we insert whole skb to the tx virtio queue and such
> skb must fit to the size of the virtio queue to be sent in a single
> iteration (may be tx logic in 'virtio_transport.c' could be reworked
> as in vhost to support partial send of current skb). This callback
> will be enabled only for G2H path. For details pls see comment
> 'Check that tx queue...' below.
> v3 -> v4:
> * 'msgzerocopy_check_iov' moved from 'struct vsock_transport' to
> 'struct virtio_transport' as it is virtio specific callback and
> never needed in other transports.
>
> include/linux/virtio_vsock.h | 5 +
> net/vmw_vsock/virtio_transport.c | 38 ++++
> net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> 3 files changed, 242 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 17dbb7176e37..33d878abbfcb 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -160,6 +160,11 @@ struct virtio_transport {
>
> /* Takes ownership of the packet */
> int (*send_pkt)(struct sk_buff *skb);
> +
> + /* Used in MSG_ZEROCOPY mode. Checks that provided data
> + * could be transmitted with zerocopy mode.
> + */
> + bool (*msgzerocopy_check_iov)(const struct iov_iter *iov);
> };
>
> ssize_t
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 7bbcc8093e51..f0abcaad5cfd 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> +{
> + struct virtio_vsock *vsock;
> + bool res = false;
> +
> + rcu_read_lock();
> +
> + vsock = rcu_dereference(the_virtio_vsock);
> + if (vsock) {
> + struct virtqueue *vq;
> + int iov_pages;
> +
> + vq = vsock->vqs[VSOCK_VQ_TX];
> +
> + iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> +
> + /* Check that tx queue is large enough to keep whole
> + * data to send. This is needed, because when there is
> + * not enough free space in the queue, current skb to
> + * send will be reinserted to the head of tx list of
> + * the socket to retry transmission later, so if skb
> + * is bigger than whole queue, it will be reinserted
> + * again and again, thus blocking other skbs to be sent.
> + * Each page of the user provided buffer will be added
> + * as a single buffer to the tx virtqueue, so compare
> + * number of pages against maximum capacity of the queue.
> + * +1 means buffer for the packet header.
> + */
> + if (iov_pages + 1 <= vq->num_max)
^^^
Oops, here is bug, it must be
if (min(iov_pages, MAX_SKB_FRAGS) + 1 <= vq->num_max)
This is because whole iov is not inserted to the tx queue. iov is splitted by
MAX_SKB_FRAGS limit (in tx loop in virtio_transport_common.c) and then every
skb is inserted to the tx queue. I'll fix it in v5, sorry
Thanks, Arseniy
> + res = true;
> + }
> +
> + rcu_read_unlock();
> +
> + return res;
> +}
> +
> static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>
> static struct virtio_transport virtio_transport = {
> @@ -491,6 +528,7 @@ static struct virtio_transport virtio_transport = {
> },
>
> .send_pkt = virtio_transport_send_pkt,
> + .msgzerocopy_check_iov = virtio_transport_msgzerocopy_check_iov,
> };
>
> static bool virtio_transport_seqpacket_allow(u32 remote_cid)
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index de7b44675254..659f9d70266b 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> return container_of(t, struct virtio_transport, transport);
> }
>
> -/* Returns a new packet on success, otherwise returns NULL.
> - *
> - * If NULL is returned, errp is set to a negative errno.
> - */
> -static struct sk_buff *
> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> - size_t len,
> - u32 src_cid,
> - u32 src_port,
> - u32 dst_cid,
> - u32 dst_port)
> -{
> - const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> - struct virtio_vsock_hdr *hdr;
> - struct sk_buff *skb;
> - void *payload;
> - int err;
> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> + size_t max_to_send)
> +{
> + const struct virtio_transport *t_ops;
> + struct iov_iter *iov_iter;
>
> - skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> - if (!skb)
> - return NULL;
> + if (!info->msg)
> + return false;
>
> - hdr = virtio_vsock_hdr(skb);
> - hdr->type = cpu_to_le16(info->type);
> - hdr->op = cpu_to_le16(info->op);
> - hdr->src_cid = cpu_to_le64(src_cid);
> - hdr->dst_cid = cpu_to_le64(dst_cid);
> - hdr->src_port = cpu_to_le32(src_port);
> - hdr->dst_port = cpu_to_le32(dst_port);
> - hdr->flags = cpu_to_le32(info->flags);
> - hdr->len = cpu_to_le32(len);
> + iov_iter = &info->msg->msg_iter;
>
> - if (info->msg && len > 0) {
> - payload = skb_put(skb, len);
> - err = memcpy_from_msg(payload, info->msg, len);
> - if (err)
> - goto out;
> + t_ops = virtio_transport_get_ops(info->vsk);
>
> - if (msg_data_left(info->msg) == 0 &&
> - info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> + if (t_ops->msgzerocopy_check_iov &&
> + !t_ops->msgzerocopy_check_iov(iov_iter))
> + return false;
>
> - if (info->msg->msg_flags & MSG_EOR)
> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> - }
> + /* Data is simple buffer. */
> + if (iter_is_ubuf(iov_iter))
> + return true;
> +
> + if (!iter_is_iovec(iov_iter))
> + return false;
> +
> + if (iov_iter->iov_offset)
> + return false;
> +
> + /* We can't send whole iov. */
> + if (iov_iter->count > max_to_send)
> + return false;
> +
> + return true;
> +}
> +
> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> + struct sk_buff *skb,
> + struct msghdr *msg,
> + bool zerocopy)
> +{
> + struct ubuf_info *uarg;
> +
> + if (msg->msg_ubuf) {
> + uarg = msg->msg_ubuf;
> + net_zcopy_get(uarg);
> + } else {
> + struct iov_iter *iter = &msg->msg_iter;
> + struct ubuf_info_msgzc *uarg_zc;
> + int len;
> +
> + /* Only ITER_IOVEC or ITER_UBUF are allowed and
> + * checked before.
> + */
> + if (iter_is_iovec(iter))
> + len = iov_length(iter->__iov, iter->nr_segs);
> + else
> + len = iter->count;
> +
> + uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> + len,
> + NULL);
> + if (!uarg)
> + return -1;
> +
> + uarg_zc = uarg_to_msgzc(uarg);
> + uarg_zc->zerocopy = zerocopy ? 1 : 0;
> }
>
> - if (info->reply)
> - virtio_vsock_skb_set_reply(skb);
> + skb_zcopy_init(skb, uarg);
>
> - trace_virtio_transport_alloc_pkt(src_cid, src_port,
> - dst_cid, dst_port,
> - len,
> - info->type,
> - info->op,
> - info->flags);
> + return 0;
> +}
>
> - if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> - WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> - goto out;
> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> + struct virtio_vsock_pkt_info *info,
> + size_t len,
> + bool zcopy)
> +{
> + if (zcopy) {
> + return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> + &info->msg->msg_iter,
> + len);
> + } else {
> + void *payload;
> + int err;
> +
> + payload = skb_put(skb, len);
> + err = memcpy_from_msg(payload, info->msg, len);
> + if (err)
> + return -1;
> +
> + if (msg_data_left(info->msg))
> + return 0;
> +
> + return 0;
> }
> +}
>
> - return skb;
> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> + struct virtio_vsock_pkt_info *info,
> + u32 src_cid,
> + u32 src_port,
> + u32 dst_cid,
> + u32 dst_port,
> + size_t len)
> +{
> + struct virtio_vsock_hdr *hdr;
>
> -out:
> - kfree_skb(skb);
> - return NULL;
> + hdr = virtio_vsock_hdr(skb);
> + hdr->type = cpu_to_le16(info->type);
> + hdr->op = cpu_to_le16(info->op);
> + hdr->src_cid = cpu_to_le64(src_cid);
> + hdr->dst_cid = cpu_to_le64(dst_cid);
> + hdr->src_port = cpu_to_le32(src_port);
> + hdr->dst_port = cpu_to_le32(dst_port);
> + hdr->flags = cpu_to_le32(info->flags);
> + hdr->len = cpu_to_le32(len);
> }
>
> static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> }
>
> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> + struct virtio_vsock_pkt_info *info,
> + size_t payload_len,
> + bool zcopy,
> + u32 src_cid,
> + u32 src_port,
> + u32 dst_cid,
> + u32 dst_port)
> +{
> + struct sk_buff *skb;
> + size_t skb_len;
> +
> + skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> +
> + if (!zcopy)
> + skb_len += payload_len;
> +
> + skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> + if (!skb)
> + return NULL;
> +
> + virtio_transport_init_hdr(skb, info, src_cid, src_port,
> + dst_cid, dst_port,
> + payload_len);
> +
> + /* Set owner here, because '__zerocopy_sg_from_iter()' uses
> + * owner of skb without check to update 'sk_wmem_alloc'.
> + */
> + if (vsk)
> + skb_set_owner_w(skb, sk_vsock(vsk));
> +
> + if (info->msg && payload_len > 0) {
> + int err;
> +
> + err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> + if (err)
> + goto out;
> +
> + if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> + struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> +
> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +
> + if (info->msg->msg_flags & MSG_EOR)
> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> + }
> + }
> +
> + if (info->reply)
> + virtio_vsock_skb_set_reply(skb);
> +
> + trace_virtio_transport_alloc_pkt(src_cid, src_port,
> + dst_cid, dst_port,
> + payload_len,
> + info->type,
> + info->op,
> + info->flags);
> +
> + return skb;
> +out:
> + kfree_skb(skb);
> + return NULL;
> +}
> +
> /* This function can only be used on connecting/connected sockets,
> * since a socket assigned to a transport is required.
> *
> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> struct virtio_vsock_pkt_info *info)
> {
> + u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> u32 src_cid, src_port, dst_cid, dst_port;
> const struct virtio_transport *t_ops;
> struct virtio_vsock_sock *vvs;
> u32 pkt_len = info->pkt_len;
> + bool can_zcopy = false;
> u32 rest_len;
> int ret;
>
> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> return pkt_len;
>
> + if (info->msg) {
> + /* If zerocopy is not enabled by 'setsockopt()', we behave as
> + * there is no MSG_ZEROCOPY flag set.
> + */
> + if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> + info->msg->msg_flags &= ~MSG_ZEROCOPY;
> +
> + if (info->msg->msg_flags & MSG_ZEROCOPY)
> + can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> +
> + if (can_zcopy)
> + max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> + (MAX_SKB_FRAGS * PAGE_SIZE));
> + }
> +
> rest_len = pkt_len;
>
> do {
> struct sk_buff *skb;
> size_t skb_len;
>
> - skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> + skb_len = min(max_skb_len, rest_len);
>
> - skb = virtio_transport_alloc_skb(info, skb_len,
> + skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> src_cid, src_port,
> dst_cid, dst_port);
> if (!skb) {
> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> break;
> }
>
> + /* This is last skb to send this portion of data. */
> + if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> + skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> + if (virtio_transport_init_zcopy_skb(vsk, skb,
> + info->msg,
> + can_zcopy)) {
> + ret = -ENOMEM;
> + break;
> + }
> + }
> +
> virtio_transport_inc_tx_pkt(vvs, skb);
>
> ret = t_ops->send_pkt(skb);
> @@ -985,7 +1126,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> if (!t)
> return -ENOTCONN;
>
> - reply = virtio_transport_alloc_skb(&info, 0,
> + reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> le64_to_cpu(hdr->dst_cid),
> le32_to_cpu(hdr->dst_port),
> le64_to_cpu(hdr->src_cid),
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations
2023-07-27 22:26 [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
` (3 preceding siblings ...)
2023-07-27 22:26 ` [PATCH net-next v4 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
@ 2023-07-28 5:45 ` Michael S. Tsirkin
2023-07-28 8:00 ` Arseniy Krasnov
2023-07-30 8:57 ` Arseniy Krasnov
4 siblings, 2 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2023-07-28 5:45 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
Bobby Eshleman, kvm, virtualization, netdev, linux-kernel, kernel,
oxffffaa
On Fri, Jul 28, 2023 at 01:26:23AM +0300, Arseniy Krasnov wrote:
> Hello,
>
> this patchset is first of three parts of another big patchset for
> MSG_ZEROCOPY flag support:
> https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/
overall looks good. Two points I'd like to see addressed:
- what's the performance with all these changes - still same?
- most systems have a copybreak scheme where buffers
smaller than a given size are copied directly.
This will address regression you see with small buffers -
but need to find that value. we know it's between 4k and 32k :)
> During review of this series, Stefano Garzarella <sgarzare@redhat.com>
> suggested to split it for three parts to simplify review and merging:
>
> 1) virtio and vhost updates (for fragged skbs) <--- this patchset
> 2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
> tx completions) and update for Documentation/.
> 3) Updates for tests and utils.
>
> This series enables handling of fragged skbs in virtio and vhost parts.
> Newly logic won't be triggered, because SO_ZEROCOPY options is still
> impossible to enable at this moment (next bunch of patches from big
> set above will enable it).
>
> I've included changelog to some patches anyway, because there were some
> comments during review of last big patchset from the link above.
>
> Head for this patchset is 9d0cd5d25f7d45bce01bbb3193b54ac24b3a60f3
>
> Link to v1:
> https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
> Link to v2:
> https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/
> Link to v3:
> https://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/
>
> Changelog:
> * Patchset rebased and tested on new HEAD of net-next (see hash above).
> * See per-patch changelog after ---.
>
> Arseniy Krasnov (4):
> vsock/virtio/vhost: read data from non-linear skb
> vsock/virtio: support to send non-linear skb
> vsock/virtio: non-linear skb handling for tap
> vsock/virtio: MSG_ZEROCOPY flag support
>
> drivers/vhost/vsock.c | 14 +-
> include/linux/virtio_vsock.h | 6 +
> net/vmw_vsock/virtio_transport.c | 79 +++++-
> net/vmw_vsock/virtio_transport_common.c | 312 ++++++++++++++++++------
> 4 files changed, 330 insertions(+), 81 deletions(-)
>
> --
> 2.25.1
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations
2023-07-28 5:45 ` [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Michael S. Tsirkin
@ 2023-07-28 8:00 ` Arseniy Krasnov
2023-07-28 8:05 ` Arseniy Krasnov
2023-07-30 8:57 ` Arseniy Krasnov
1 sibling, 1 reply; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-28 8:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
Bobby Eshleman, kvm, virtualization, netdev, linux-kernel, kernel,
oxffffaa
On 28.07.2023 08:45, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2023 at 01:26:23AM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> this patchset is first of three parts of another big patchset for
>> MSG_ZEROCOPY flag support:
>> https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/
>
> overall looks good. Two points I'd like to see addressed:
Thanks!
> - what's the performance with all these changes - still same?
Yes, I perform quick tests and seems result are same. This is because last
implemented logic when I compare size of payload against 'num_max' is
for "emergency" case and not triggered in default environment. Anyway, I'll
perform retest at least in nested guest case.
> - most systems have a copybreak scheme where buffers
> smaller than a given size are copied directly.
> This will address regression you see with small buffers -
> but need to find that value. we know it's between 4k and 32k :)
I see, You suggest to find this value and add this check for decision to
use zerocopy or copy ?
Thanks, Arseniy
>
>
>> During review of this series, Stefano Garzarella <sgarzare@redhat.com>
>> suggested to split it for three parts to simplify review and merging:
>>
>> 1) virtio and vhost updates (for fragged skbs) <--- this patchset
>> 2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
>> tx completions) and update for Documentation/.
>> 3) Updates for tests and utils.
>>
>> This series enables handling of fragged skbs in virtio and vhost parts.
>> Newly logic won't be triggered, because SO_ZEROCOPY options is still
>> impossible to enable at this moment (next bunch of patches from big
>> set above will enable it).
>>
>> I've included changelog to some patches anyway, because there were some
>> comments during review of last big patchset from the link above.
>>
>> Head for this patchset is 9d0cd5d25f7d45bce01bbb3193b54ac24b3a60f3
>>
>> Link to v1:
>> https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
>> Link to v2:
>> https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/
>> Link to v3:
>> https://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/
>>
>> Changelog:
>> * Patchset rebased and tested on new HEAD of net-next (see hash above).
>> * See per-patch changelog after ---.
>>
>> Arseniy Krasnov (4):
>> vsock/virtio/vhost: read data from non-linear skb
>> vsock/virtio: support to send non-linear skb
>> vsock/virtio: non-linear skb handling for tap
>> vsock/virtio: MSG_ZEROCOPY flag support
>>
>> drivers/vhost/vsock.c | 14 +-
>> include/linux/virtio_vsock.h | 6 +
>> net/vmw_vsock/virtio_transport.c | 79 +++++-
>> net/vmw_vsock/virtio_transport_common.c | 312 ++++++++++++++++++------
>> 4 files changed, 330 insertions(+), 81 deletions(-)
>>
>> --
>> 2.25.1
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations
2023-07-28 8:00 ` Arseniy Krasnov
@ 2023-07-28 8:05 ` Arseniy Krasnov
0 siblings, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-28 8:05 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
Bobby Eshleman, kvm, virtualization, netdev, linux-kernel, kernel,
oxffffaa
On 28.07.2023 11:00, Arseniy Krasnov wrote:
>
>
> On 28.07.2023 08:45, Michael S. Tsirkin wrote:
>> On Fri, Jul 28, 2023 at 01:26:23AM +0300, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> this patchset is first of three parts of another big patchset for
>>> MSG_ZEROCOPY flag support:
>>> https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/
>>
>> overall looks good. Two points I'd like to see addressed:
>
> Thanks!
>
>> - what's the performance with all these changes - still same?
>
> Yes, I perform quick tests and seems result are same. This is because last
> implemented logic when I compare size of payload against 'num_max' is
> for "emergency" case and not triggered in default environment. Anyway, I'll
> perform retest at least in nested guest case.
"default environment" is vanilla Qemu where queue size is 128 elements. To test
this logic i rebuild Qemu with for example queue of 8 elements.
Thanks, Arseniy
>
>> - most systems have a copybreak scheme where buffers
>> smaller than a given size are copied directly.
>> This will address regression you see with small buffers -
>> but need to find that value. we know it's between 4k and 32k :)
>
> I see, You suggest to find this value and add this check for decision to
> use zerocopy or copy ?
>
> Thanks, Arseniy
>
>>
>>
>>> During review of this series, Stefano Garzarella <sgarzare@redhat.com>
>>> suggested to split it for three parts to simplify review and merging:
>>>
>>> 1) virtio and vhost updates (for fragged skbs) <--- this patchset
>>> 2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
>>> tx completions) and update for Documentation/.
>>> 3) Updates for tests and utils.
>>>
>>> This series enables handling of fragged skbs in virtio and vhost parts.
>>> Newly logic won't be triggered, because SO_ZEROCOPY options is still
>>> impossible to enable at this moment (next bunch of patches from big
>>> set above will enable it).
>>>
>>> I've included changelog to some patches anyway, because there were some
>>> comments during review of last big patchset from the link above.
>>>
>>> Head for this patchset is 9d0cd5d25f7d45bce01bbb3193b54ac24b3a60f3
>>>
>>> Link to v1:
>>> https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
>>> Link to v2:
>>> https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/
>>> Link to v3:
>>> https://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/
>>>
>>> Changelog:
>>> * Patchset rebased and tested on new HEAD of net-next (see hash above).
>>> * See per-patch changelog after ---.
>>>
>>> Arseniy Krasnov (4):
>>> vsock/virtio/vhost: read data from non-linear skb
>>> vsock/virtio: support to send non-linear skb
>>> vsock/virtio: non-linear skb handling for tap
>>> vsock/virtio: MSG_ZEROCOPY flag support
>>>
>>> drivers/vhost/vsock.c | 14 +-
>>> include/linux/virtio_vsock.h | 6 +
>>> net/vmw_vsock/virtio_transport.c | 79 +++++-
>>> net/vmw_vsock/virtio_transport_common.c | 312 ++++++++++++++++++------
>>> 4 files changed, 330 insertions(+), 81 deletions(-)
>>>
>>> --
>>> 2.25.1
>>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations
2023-07-28 5:45 ` [PATCH net-next v4 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Michael S. Tsirkin
2023-07-28 8:00 ` Arseniy Krasnov
@ 2023-07-30 8:57 ` Arseniy Krasnov
1 sibling, 0 replies; 10+ messages in thread
From: Arseniy Krasnov @ 2023-07-30 8:57 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
Bobby Eshleman, kvm, virtualization, netdev, linux-kernel, kernel,
oxffffaa
On 28.07.2023 08:45, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2023 at 01:26:23AM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> this patchset is first of three parts of another big patchset for
>> MSG_ZEROCOPY flag support:
>> https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/
>
> overall looks good. Two points I'd like to see addressed:
> - what's the performance with all these changes - still same?
Hello Michael,
here are results on the last version:
There is some difference between these numbers and numbers from link
(it was v3). Looks like new version of zerocopy become slower on big
buffers. But anyway it is faster than copy mode in all cases (except
<<<<<< marked line below, but I had same result for this testcase in v3
before).
I tried to find reason of this difference by switching to v3 version, but
seems it is no easy - I get current results again. I guess reason maybe:
1) My environment change - I perform this test in nested virtualization
mode, so host OS may also affect performance.
2) My mistake in v3 :(
Anyway:
1) MSG_ZEROCOPY is still faster than copy as expected.
2) I'v added column with benchmark on 'net-next' without MSG_ZEROCOPY
patchset. Seems it doesn't affect copy performance. Cases where we
have difference like 26 against 29 is not a big deal - final result
is unstable with some error, e.g. if you run again same test, you
can get opposite result like 29 against 26.
2) Numbers below could be considered valid. This is newest measurement.
G2H transmission (values are Gbit/s):
Core i7 with nested guest.
*-------------------------------*-----------------------*
| | | | |
| buf size | copy | zerocopy | copy w/o MSG_ZEROCOPY |
| | | | patchset |
| | | | |
*-------------------------------*-----------------------*
| 4KB | 3 | 11 | 3 |
*-------------------------------*-----------------------*
| 32KB | 9 | 70 | 10 |
*-------------------------------*-----------------------*
| 256KB | 30 | 224 | 29 |
*-------------------------------*-----------------------*
| 1M | 27 | 285 | 30 |
*-------------------------------*-----------------------*
| 8M | 26 | 365 | 29 |
*-------------------------------*-----------------------*
H2G:
Core i7 with nested guest.
*-------------------------------*-----------------------*
| | | | |
| buf size | copy | zerocopy | copy w/o MSG_ZEROCOPY |
| | | | patchset |
| | | | |
*-------------------------------*-----------------------*
| 4KB | 17 | 10 | 17 | <<<<<<
*-------------------------------*-----------------------*
| 32KB | 30 | 61 | 31 |
*-------------------------------*-----------------------*
| 256KB | 35 | 214 | 30 |
*-------------------------------*-----------------------*
| 1M | 29 | 292 | 28 |
*-------------------------------*-----------------------*
| 8M | 28 | 341 | 28 |
*-------------------------------*-----------------------*
Loopback:
Core i7 with nested guest.
*-------------------------------*-----------------------*
| | | | |
| buf size | copy | zerocopy | copy w/o MSG_ZEROCOPY |
| | | | patchset |
| | | | |
*-------------------------------*-----------------------*
| 4KB | 8 | 7 | 8 |
*-------------------------------*-----------------------*
| 32KB | 27 | 43 | 30 |
*-------------------------------*-----------------------*
| 256KB | 38 | 100 | 39 |
*-------------------------------*-----------------------*
| 1M | 37 | 141 | 39 |
*-------------------------------*-----------------------*
| 8M | 40 | 201 | 36 |
*-------------------------------*-----------------------*
Thanks, Arseniy
> - most systems have a copybreak scheme where buffers
> smaller than a given size are copied directly.
> This will address regression you see with small buffers -
> but need to find that value. we know it's between 4k and 32k :)
>
>
>> During review of this series, Stefano Garzarella <sgarzare@redhat.com>
>> suggested to split it for three parts to simplify review and merging:
>>
>> 1) virtio and vhost updates (for fragged skbs) <--- this patchset
>> 2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
>> tx completions) and update for Documentation/.
>> 3) Updates for tests and utils.
>>
>> This series enables handling of fragged skbs in virtio and vhost parts.
>> Newly logic won't be triggered, because SO_ZEROCOPY options is still
>> impossible to enable at this moment (next bunch of patches from big
>> set above will enable it).
>>
>> I've included changelog to some patches anyway, because there were some
>> comments during review of last big patchset from the link above.
>>
>> Head for this patchset is 9d0cd5d25f7d45bce01bbb3193b54ac24b3a60f3
>>
>> Link to v1:
>> https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
>> Link to v2:
>> https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/
>> Link to v3:
>> https://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/
>>
>> Changelog:
>> * Patchset rebased and tested on new HEAD of net-next (see hash above).
>> * See per-patch changelog after ---.
>>
>> Arseniy Krasnov (4):
>> vsock/virtio/vhost: read data from non-linear skb
>> vsock/virtio: support to send non-linear skb
>> vsock/virtio: non-linear skb handling for tap
>> vsock/virtio: MSG_ZEROCOPY flag support
>>
>> drivers/vhost/vsock.c | 14 +-
>> include/linux/virtio_vsock.h | 6 +
>> net/vmw_vsock/virtio_transport.c | 79 +++++-
>> net/vmw_vsock/virtio_transport_common.c | 312 ++++++++++++++++++------
>> 4 files changed, 330 insertions(+), 81 deletions(-)
>>
>> --
>> 2.25.1
>
^ permalink raw reply [flat|nested] 10+ messages in thread