* [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx)
@ 2024-10-30 8:24 Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 01/13] virtio_ring: introduce vring_need_unmap_buffer Xuan Zhuo
` (12 more replies)
0 siblings, 13 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
v2:
1. use new api to submit premapped buffer instead of using sgs to pass this info
2. some small fixes for http://lore.kernel.org/all/20240924013204.13763-1-xuanzhuo@linux.alibaba.com
v1:
1. some small fixes for http://lore.kernel.org/all/20240820073330.9161-1-xuanzhuo@linux.alibaba.com
1. fix the title of the commit #2, #3
2. fix the gcc error for commit #3
3. use virtqueue_dma_xxxx for tx hdr
4. rename virtnet_ptr_to_xsk to virtnet_ptr_to_xsk_buff_len
5. squash #11 in last patch set to #10
================================================================================
## AF_XDP
XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support
this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.
At present, we have completed some preparation:
1. vq-reset (virtio spec and kernel code)
2. virtio-core premapped dma
3. virtio-net xdp refactor
So it is time for Virtio-Net to complete the support for the XDP Socket
Zerocopy.
Virtio-net can not increase the queue num at will, so xsk shares the queue with
kernel.
This patch set includes some refactor to the virtio-net to let that to support
AF_XDP.
## About virtio premapped mode
The current configuration sets the virtqueue (vq) to premapped mode,
implying that all buffers submitted to this queue must be mapped ahead
of time. This presents a challenge for the virtnet send queue (sq): the
virtnet driver would be required to keep track of dma information for vq
size * 17, which can be substantial. However, if the premapped mode were
applied on a per-buffer basis, the complexity would be greatly reduced.
With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
skb buffers could remain unmapped.
We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this
indicates that the driver has performed DMA mapping in advance, allowing
the Virtio core to directly utilize sg_dma_address(sg) without
conducting any internal DMA mapping. Additionally, DMA unmap operations
for this buffer will be bypassed.
## performance
ENV: Qemu with vhost-user(polling mode).
Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
### virtio PMD in guest with testpmd
testpmd> show port stats all
######################## NIC statistics for port 0 ########################
RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584
RX-errors: 0
RX-nombuf: 0
TX-packets: 5959955552 TX-errors: 0 TX-bytes: 371030645664
Throughput (since last show)
Rx-pps: 8861574 Rx-bps: 3969985208
Tx-pps: 8861493 Tx-bps: 3969962736
############################################################################
### AF_XDP PMD in guest with testpmd
testpmd> show port stats all
######################## NIC statistics for port 0 ########################
RX-packets: 68152727 RX-missed: 0 RX-bytes: 3816552712
RX-errors: 0
RX-nombuf: 0
TX-packets: 68114967 TX-errors: 33216 TX-bytes: 3814438152
Throughput (since last show)
Rx-pps: 6333196 Rx-bps: 2837272088
Tx-pps: 6333227 Tx-bps: 2837285936
############################################################################
But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
Please review.
Thanks.
Xuan Zhuo (13):
virtio_ring: introduce vring_need_unmap_buffer
virtio_ring: split: record extras for indirect buffers
virtio_ring: packed: record extras for indirect buffers
virtio_ring: perform premapped operations based on per-buffer
virtio_ring: introduce add api for premapped
virtio-net: rq submits premapped per-buffer
virtio_ring: remove API virtqueue_set_dma_premapped
virtio_net: refactor the xmit type
virtio_net: xsk: bind/unbind xsk for tx
virtio_net: xsk: prevent disable tx napi
virtio_net: xsk: tx: support xmit xsk buffer
virtio_net: update tx timeout record
virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
drivers/net/virtio_net.c | 358 +++++++++++++++++++++++++++++------
drivers/virtio/virtio_ring.c | 340 ++++++++++++++++-----------------
include/linux/virtio.h | 15 +-
3 files changed, 483 insertions(+), 230 deletions(-)
--
2.32.0.3.g01195cf9f
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH net-next v2 01/13] virtio_ring: introduce vring_need_unmap_buffer
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers Xuan Zhuo
` (11 subsequent siblings)
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
To make the code readable, introduce vring_need_unmap_buffer() to
replace do_unmap.
use_dma_api premapped -> vring_need_unmap_buffer()
1. false false false
2. true false true
3. true true false
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/virtio/virtio_ring.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 98374ed7c577..97590c201aa2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -175,11 +175,6 @@ struct vring_virtqueue {
/* Do DMA mapping by driver */
bool premapped;
- /* Do unmap or not for desc. Just when premapped is False and
- * use_dma_api is true, this is true.
- */
- bool do_unmap;
-
/* Head of free buffer list. */
unsigned int free_head;
/* Number we've added since last sync. */
@@ -297,6 +292,11 @@ static bool vring_use_dma_api(const struct virtio_device *vdev)
return false;
}
+static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring)
+{
+ return vring->use_dma_api && !vring->premapped;
+}
+
size_t virtio_max_dma_size(const struct virtio_device *vdev)
{
size_t max_segment_size = SIZE_MAX;
@@ -445,7 +445,7 @@ static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
{
u16 flags;
- if (!vq->do_unmap)
+ if (!vring_need_unmap_buffer(vq))
return;
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
@@ -475,7 +475,7 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
- if (!vq->do_unmap)
+ if (!vring_need_unmap_buffer(vq))
goto out;
dma_unmap_page(vring_dma_dev(vq),
@@ -643,7 +643,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
}
/* Last one doesn't continue. */
desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
- if (!indirect && vq->do_unmap)
+ if (!indirect && vring_need_unmap_buffer(vq))
vq->split.desc_extra[prev & (vq->split.vring.num - 1)].flags &=
~VRING_DESC_F_NEXT;
@@ -802,7 +802,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
VRING_DESC_F_INDIRECT));
BUG_ON(len == 0 || len % sizeof(struct vring_desc));
- if (vq->do_unmap) {
+ if (vring_need_unmap_buffer(vq)) {
for (j = 0; j < len / sizeof(struct vring_desc); j++)
vring_unmap_one_split_indirect(vq, &indir_desc[j]);
}
@@ -1236,7 +1236,7 @@ static void vring_unmap_extra_packed(const struct vring_virtqueue *vq,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
- if (!vq->do_unmap)
+ if (!vring_need_unmap_buffer(vq))
return;
dma_unmap_page(vring_dma_dev(vq),
@@ -1251,7 +1251,7 @@ static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
{
u16 flags;
- if (!vq->do_unmap)
+ if (!vring_need_unmap_buffer(vq))
return;
flags = le16_to_cpu(desc->flags);
@@ -1632,7 +1632,7 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
if (!desc)
return;
- if (vq->do_unmap) {
+ if (vring_need_unmap_buffer(vq)) {
len = vq->packed.desc_extra[id].len;
for (i = 0; i < len / sizeof(struct vring_packed_desc);
i++)
@@ -2091,7 +2091,6 @@ static struct virtqueue *vring_create_virtqueue_packed(
vq->dma_dev = dma_dev;
vq->use_dma_api = vring_use_dma_api(vdev);
vq->premapped = false;
- vq->do_unmap = vq->use_dma_api;
vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) &&
!context;
@@ -2636,7 +2635,6 @@ static struct virtqueue *__vring_new_virtqueue(unsigned int index,
vq->dma_dev = dma_dev;
vq->use_dma_api = vring_use_dma_api(vdev);
vq->premapped = false;
- vq->do_unmap = vq->use_dma_api;
vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) &&
!context;
@@ -2799,7 +2797,6 @@ int virtqueue_set_dma_premapped(struct virtqueue *_vq)
}
vq->premapped = true;
- vq->do_unmap = false;
END_USE(vq);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 01/13] virtio_ring: introduce vring_need_unmap_buffer Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-11-05 3:42 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 03/13] virtio_ring: packed: " Xuan Zhuo
` (10 subsequent siblings)
12 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
The subsequent commit needs to know whether every indirect buffer is
premapped or not. So we need to introduce an extra struct for every
indirect buffer to record this info.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/virtio/virtio_ring.c | 112 ++++++++++++++++-------------------
1 file changed, 52 insertions(+), 60 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 97590c201aa2..dca093744fe1 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -69,7 +69,11 @@
struct vring_desc_state_split {
void *data; /* Data for callback. */
- struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
+
+ /* Indirect extra table and desc table, if any. These two will be
+ * allocated together. So we won't stress more to the memory allocator.
+ */
+ struct vring_desc *indir_desc;
};
struct vring_desc_state_packed {
@@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, u32 num)
* Split ring specific functions - *_split().
*/
-static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
- const struct vring_desc *desc)
-{
- u16 flags;
-
- if (!vring_need_unmap_buffer(vq))
- return;
-
- flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
-
- dma_unmap_page(vring_dma_dev(vq),
- virtio64_to_cpu(vq->vq.vdev, desc->addr),
- virtio32_to_cpu(vq->vq.vdev, desc->len),
- (flags & VRING_DESC_F_WRITE) ?
- DMA_FROM_DEVICE : DMA_TO_DEVICE);
-}
-
static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
- unsigned int i)
+ struct vring_desc_extra *extra)
{
- struct vring_desc_extra *extra = vq->split.desc_extra;
u16 flags;
- flags = extra[i].flags;
+ flags = extra->flags;
if (flags & VRING_DESC_F_INDIRECT) {
if (!vq->use_dma_api)
goto out;
dma_unmap_single(vring_dma_dev(vq),
- extra[i].addr,
- extra[i].len,
+ extra->addr,
+ extra->len,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
@@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
goto out;
dma_unmap_page(vring_dma_dev(vq),
- extra[i].addr,
- extra[i].len,
+ extra->addr,
+ extra->len,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
out:
- return extra[i].next;
+ return extra->next;
}
static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
unsigned int total_sg,
gfp_t gfp)
{
+ struct vring_desc_extra *extra;
struct vring_desc *desc;
- unsigned int i;
+ unsigned int i, size;
/*
* We require lowmem mappings for the descriptors because
@@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
*/
gfp &= ~__GFP_HIGHMEM;
- desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
+ size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg;
+
+ desc = kmalloc(size, gfp);
if (!desc)
return NULL;
+ extra = (struct vring_desc_extra *)&desc[total_sg];
+
for (i = 0; i < total_sg; i++)
- desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+ extra[i].next = i + 1;
+
return desc;
}
static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
struct vring_desc *desc,
+ struct vring_desc_extra *extra,
unsigned int i,
dma_addr_t addr,
unsigned int len,
- u16 flags,
- bool indirect)
+ u16 flags)
{
- struct vring_virtqueue *vring = to_vvq(vq);
- struct vring_desc_extra *extra = vring->split.desc_extra;
u16 next;
desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
desc[i].len = cpu_to_virtio32(vq->vdev, len);
- if (!indirect) {
- next = extra[i].next;
- desc[i].next = cpu_to_virtio16(vq->vdev, next);
+ extra[i].addr = addr;
+ extra[i].len = len;
+ extra[i].flags = flags;
+
+ next = extra[i].next;
- extra[i].addr = addr;
- extra[i].len = len;
- extra[i].flags = flags;
- } else
- next = virtio16_to_cpu(vq->vdev, desc[i].next);
+ desc[i].next = cpu_to_virtio16(vq->vdev, next);
return next;
}
@@ -551,6 +539,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
+ struct vring_desc_extra *extra;
struct scatterlist *sg;
struct vring_desc *desc;
unsigned int i, n, avail, descs_used, prev, err_idx;
@@ -586,9 +575,11 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Set up rest to use this indirect table. */
i = 0;
descs_used = 1;
+ extra = (struct vring_desc_extra *)&desc[total_sg];
} else {
indirect = false;
desc = vq->split.vring.desc;
+ extra = vq->split.desc_extra;
i = head;
descs_used = total_sg;
}
@@ -618,9 +609,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Note that we trust indirect descriptor
* table since it use stream DMA mapping.
*/
- i = virtqueue_add_desc_split(_vq, desc, i, addr, sg->length,
- VRING_DESC_F_NEXT,
- indirect);
+ i = virtqueue_add_desc_split(_vq, desc, extra, i, addr, sg->length,
+ VRING_DESC_F_NEXT);
}
}
for (; n < (out_sgs + in_sgs); n++) {
@@ -634,11 +624,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Note that we trust indirect descriptor
* table since it use stream DMA mapping.
*/
- i = virtqueue_add_desc_split(_vq, desc, i, addr,
+ i = virtqueue_add_desc_split(_vq, desc, extra, i, addr,
sg->length,
VRING_DESC_F_NEXT |
- VRING_DESC_F_WRITE,
- indirect);
+ VRING_DESC_F_WRITE);
}
}
/* Last one doesn't continue. */
@@ -660,10 +649,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
}
virtqueue_add_desc_split(_vq, vq->split.vring.desc,
+ vq->split.desc_extra,
head, addr,
total_sg * sizeof(struct vring_desc),
- VRING_DESC_F_INDIRECT,
- false);
+ VRING_DESC_F_INDIRECT);
}
/* We're using some buffers from the free list. */
@@ -716,11 +705,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
for (n = 0; n < total_sg; n++) {
if (i == err_idx)
break;
- if (indirect) {
- vring_unmap_one_split_indirect(vq, &desc[i]);
- i = virtio16_to_cpu(_vq->vdev, desc[i].next);
- } else
- i = vring_unmap_one_split(vq, i);
+
+ i = vring_unmap_one_split(vq, &extra[i]);
}
free_indirect:
@@ -765,22 +751,25 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
void **ctx)
{
+ struct vring_desc_extra *extra;
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
+ extra = vq->split.desc_extra;
+
/* Put back on free list: unmap first-level descriptors and find end */
i = head;
while (vq->split.vring.desc[i].flags & nextflag) {
- vring_unmap_one_split(vq, i);
+ vring_unmap_one_split(vq, &extra[i]);
i = vq->split.desc_extra[i].next;
vq->vq.num_free++;
}
- vring_unmap_one_split(vq, i);
+ vring_unmap_one_split(vq, &extra[i]);
vq->split.desc_extra[i].next = vq->free_head;
vq->free_head = head;
@@ -790,21 +779,24 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
if (vq->indirect) {
struct vring_desc *indir_desc =
vq->split.desc_state[head].indir_desc;
- u32 len;
+ u32 len, num;
/* Free the indirect table, if any, now that it's unmapped. */
if (!indir_desc)
return;
-
len = vq->split.desc_extra[head].len;
BUG_ON(!(vq->split.desc_extra[head].flags &
VRING_DESC_F_INDIRECT));
BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+ num = len / sizeof(struct vring_desc);
+
+ extra = (struct vring_desc_extra *)&indir_desc[num];
+
if (vring_need_unmap_buffer(vq)) {
- for (j = 0; j < len / sizeof(struct vring_desc); j++)
- vring_unmap_one_split_indirect(vq, &indir_desc[j]);
+ for (j = 0; j < num; j++)
+ vring_unmap_one_split(vq, &extra[j]);
}
kfree(indir_desc);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 03/13] virtio_ring: packed: record extras for indirect buffers
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 01/13] virtio_ring: introduce vring_need_unmap_buffer Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer Xuan Zhuo
` (9 subsequent siblings)
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
The subsequent commit needs to know whether every indirect buffer is
premapped or not. So we need to introduce an extra struct for every
indirect buffer to record this info.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/virtio/virtio_ring.c | 60 +++++++++++++++++++++---------------
1 file changed, 36 insertions(+), 24 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index dca093744fe1..628e01af1c9a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -78,7 +78,11 @@ struct vring_desc_state_split {
struct vring_desc_state_packed {
void *data; /* Data for callback. */
- struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+
+ /* Indirect extra table and desc table, if any. These two will be
+ * allocated together. So we won't stress more to the memory allocator.
+ */
+ struct vring_packed_desc *indir_desc;
u16 num; /* Descriptor list length. */
u16 last; /* The last desc state in a list. */
};
@@ -1238,27 +1242,12 @@ static void vring_unmap_extra_packed(const struct vring_virtqueue *vq,
}
}
-static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
- const struct vring_packed_desc *desc)
-{
- u16 flags;
-
- if (!vring_need_unmap_buffer(vq))
- return;
-
- flags = le16_to_cpu(desc->flags);
-
- dma_unmap_page(vring_dma_dev(vq),
- le64_to_cpu(desc->addr),
- le32_to_cpu(desc->len),
- (flags & VRING_DESC_F_WRITE) ?
- DMA_FROM_DEVICE : DMA_TO_DEVICE);
-}
-
static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
gfp_t gfp)
{
+ struct vring_desc_extra *extra;
struct vring_packed_desc *desc;
+ int i, size;
/*
* We require lowmem mappings for the descriptors because
@@ -1267,7 +1256,16 @@ static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
*/
gfp &= ~__GFP_HIGHMEM;
- desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
+ size = (sizeof(*desc) + sizeof(*extra)) * total_sg;
+
+ desc = kmalloc(size, gfp);
+ if (!desc)
+ return NULL;
+
+ extra = (struct vring_desc_extra *)&desc[total_sg];
+
+ for (i = 0; i < total_sg; i++)
+ extra[i].next = i + 1;
return desc;
}
@@ -1280,6 +1278,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
void *data,
gfp_t gfp)
{
+ struct vring_desc_extra *extra;
struct vring_packed_desc *desc;
struct scatterlist *sg;
unsigned int i, n, err_idx;
@@ -1291,6 +1290,8 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
if (!desc)
return -ENOMEM;
+ extra = (struct vring_desc_extra *)&desc[total_sg];
+
if (unlikely(vq->vq.num_free < 1)) {
pr_debug("Can't add buf len 1 - avail = 0\n");
kfree(desc);
@@ -1312,6 +1313,13 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
0 : VRING_DESC_F_WRITE);
desc[i].addr = cpu_to_le64(addr);
desc[i].len = cpu_to_le32(sg->length);
+
+ if (unlikely(vq->use_dma_api)) {
+ extra[i].addr = addr;
+ extra[i].len = sg->length;
+ extra[i].flags = n < out_sgs ? 0 : VRING_DESC_F_WRITE;
+ }
+
i++;
}
}
@@ -1381,7 +1389,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
err_idx = i;
for (i = 0; i < err_idx; i++)
- vring_unmap_desc_packed(vq, &desc[i]);
+ vring_unmap_extra_packed(vq, &extra[i]);
free_desc:
kfree(desc);
@@ -1617,7 +1625,8 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
}
if (vq->indirect) {
- u32 len;
+ struct vring_desc_extra *extra;
+ u32 len, num;
/* Free the indirect table, if any, now that it's unmapped. */
desc = state->indir_desc;
@@ -1626,9 +1635,12 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
if (vring_need_unmap_buffer(vq)) {
len = vq->packed.desc_extra[id].len;
- for (i = 0; i < len / sizeof(struct vring_packed_desc);
- i++)
- vring_unmap_desc_packed(vq, &desc[i]);
+ num = len / sizeof(struct vring_packed_desc);
+
+ extra = (struct vring_desc_extra *)&desc[num];
+
+ for (i = 0; i < num; i++)
+ vring_unmap_extra_packed(vq, &extra[i]);
}
kfree(desc);
state->indir_desc = NULL;
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (2 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 03/13] virtio_ring: packed: " Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-11-05 4:25 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped Xuan Zhuo
` (8 subsequent siblings)
12 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
The current configuration sets the virtqueue (vq) to premapped mode,
implying that all buffers submitted to this queue must be mapped ahead
of time. This presents a challenge for the virtnet send queue (sq): the
virtnet driver would be required to keep track of dma information for vq
size * 17, which can be substantial. However, if the premapped mode were
applied on a per-buffer basis, the complexity would be greatly reduced.
With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
skb buffers could remain unmapped.
And consider that some sgs are not generated by the virtio driver,
that may be passed from the block stack. So we can not change the
sgs, new APIs are the better way.
So we pass the new argument 'premapped' to indicate the buffers
submitted to virtio are premapped in advance. Additionally,
DMA unmap operations for these buffers will be bypassed.
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/virtio/virtio_ring.c | 79 ++++++++++++++++++------------------
1 file changed, 40 insertions(+), 39 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 628e01af1c9a..a89295b79e66 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -300,9 +300,10 @@ static bool vring_use_dma_api(const struct virtio_device *vdev)
return false;
}
-static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring)
+static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring,
+ const struct vring_desc_extra *extra)
{
- return vring->use_dma_api && !vring->premapped;
+ return vring->use_dma_api && (extra->addr != DMA_MAPPING_ERROR);
}
size_t virtio_max_dma_size(const struct virtio_device *vdev)
@@ -372,9 +373,10 @@ static struct device *vring_dma_dev(const struct vring_virtqueue *vq)
/* Map one sg entry. */
static int vring_map_one_sg(const struct vring_virtqueue *vq, struct scatterlist *sg,
- enum dma_data_direction direction, dma_addr_t *addr)
+ enum dma_data_direction direction, dma_addr_t *addr,
+ bool premapped)
{
- if (vq->premapped) {
+ if (premapped) {
*addr = sg_dma_address(sg);
return 0;
}
@@ -465,7 +467,7 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
- if (!vring_need_unmap_buffer(vq))
+ if (!vring_need_unmap_buffer(vq, extra))
goto out;
dma_unmap_page(vring_dma_dev(vq),
@@ -514,7 +516,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
unsigned int i,
dma_addr_t addr,
unsigned int len,
- u16 flags)
+ u16 flags, bool premapped)
{
u16 next;
@@ -522,7 +524,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
desc[i].len = cpu_to_virtio32(vq->vdev, len);
- extra[i].addr = addr;
+ extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr;
extra[i].len = len;
extra[i].flags = flags;
@@ -540,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
unsigned int in_sgs,
void *data,
void *ctx,
+ bool premapped,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
@@ -606,7 +609,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
dma_addr_t addr;
- if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr))
+ if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr, premapped))
goto unmap_release;
prev = i;
@@ -614,14 +617,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
* table since it use stream DMA mapping.
*/
i = virtqueue_add_desc_split(_vq, desc, extra, i, addr, sg->length,
- VRING_DESC_F_NEXT);
+ VRING_DESC_F_NEXT,
+ premapped);
}
}
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
dma_addr_t addr;
- if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr))
+ if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr, premapped))
goto unmap_release;
prev = i;
@@ -631,12 +635,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
i = virtqueue_add_desc_split(_vq, desc, extra, i, addr,
sg->length,
VRING_DESC_F_NEXT |
- VRING_DESC_F_WRITE);
+ VRING_DESC_F_WRITE,
+ premapped);
}
}
/* Last one doesn't continue. */
desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
- if (!indirect && vring_need_unmap_buffer(vq))
+ if (!indirect && vring_need_unmap_buffer(vq, &extra[prev]))
vq->split.desc_extra[prev & (vq->split.vring.num - 1)].flags &=
~VRING_DESC_F_NEXT;
@@ -645,18 +650,14 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
dma_addr_t addr = vring_map_single(
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
- if (vring_mapping_error(vq, addr)) {
- if (vq->premapped)
- goto free_indirect;
-
+ if (vring_mapping_error(vq, addr))
goto unmap_release;
- }
virtqueue_add_desc_split(_vq, vq->split.vring.desc,
vq->split.desc_extra,
head, addr,
total_sg * sizeof(struct vring_desc),
- VRING_DESC_F_INDIRECT);
+ VRING_DESC_F_INDIRECT, false);
}
/* We're using some buffers from the free list. */
@@ -713,7 +714,6 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
i = vring_unmap_one_split(vq, &extra[i]);
}
-free_indirect:
if (indirect)
kfree(desc);
@@ -798,7 +798,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
extra = (struct vring_desc_extra *)&indir_desc[num];
- if (vring_need_unmap_buffer(vq)) {
+ if (vq->use_dma_api) {
for (j = 0; j < num; j++)
vring_unmap_one_split(vq, &extra[j]);
}
@@ -1232,7 +1232,7 @@ static void vring_unmap_extra_packed(const struct vring_virtqueue *vq,
(flags & VRING_DESC_F_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
- if (!vring_need_unmap_buffer(vq))
+ if (!vring_need_unmap_buffer(vq, extra))
return;
dma_unmap_page(vring_dma_dev(vq),
@@ -1276,6 +1276,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
unsigned int out_sgs,
unsigned int in_sgs,
void *data,
+ bool premapped,
gfp_t gfp)
{
struct vring_desc_extra *extra;
@@ -1306,7 +1307,8 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
for (n = 0; n < out_sgs + in_sgs; n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
if (vring_map_one_sg(vq, sg, n < out_sgs ?
- DMA_TO_DEVICE : DMA_FROM_DEVICE, &addr))
+ DMA_TO_DEVICE : DMA_FROM_DEVICE,
+ &addr, premapped))
goto unmap_release;
desc[i].flags = cpu_to_le16(n < out_sgs ?
@@ -1315,7 +1317,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
desc[i].len = cpu_to_le32(sg->length);
if (unlikely(vq->use_dma_api)) {
- extra[i].addr = addr;
+ extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr;
extra[i].len = sg->length;
extra[i].flags = n < out_sgs ? 0 : VRING_DESC_F_WRITE;
}
@@ -1328,12 +1330,8 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
addr = vring_map_single(vq, desc,
total_sg * sizeof(struct vring_packed_desc),
DMA_TO_DEVICE);
- if (vring_mapping_error(vq, addr)) {
- if (vq->premapped)
- goto free_desc;
-
+ if (vring_mapping_error(vq, addr))
goto unmap_release;
- }
vq->packed.vring.desc[head].addr = cpu_to_le64(addr);
vq->packed.vring.desc[head].len = cpu_to_le32(total_sg *
@@ -1391,7 +1389,6 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
for (i = 0; i < err_idx; i++)
vring_unmap_extra_packed(vq, &extra[i]);
-free_desc:
kfree(desc);
END_USE(vq);
@@ -1405,6 +1402,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
unsigned int in_sgs,
void *data,
void *ctx,
+ bool premapped,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
@@ -1431,7 +1429,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
if (virtqueue_use_indirect(vq, total_sg)) {
err = virtqueue_add_indirect_packed(vq, sgs, total_sg, out_sgs,
- in_sgs, data, gfp);
+ in_sgs, data, premapped, gfp);
if (err != -ENOMEM) {
END_USE(vq);
return err;
@@ -1466,7 +1464,8 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
dma_addr_t addr;
if (vring_map_one_sg(vq, sg, n < out_sgs ?
- DMA_TO_DEVICE : DMA_FROM_DEVICE, &addr))
+ DMA_TO_DEVICE : DMA_FROM_DEVICE,
+ &addr, premapped))
goto unmap_release;
flags = cpu_to_le16(vq->packed.avail_used_flags |
@@ -1482,7 +1481,8 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
desc[i].id = cpu_to_le16(id);
if (unlikely(vq->use_dma_api)) {
- vq->packed.desc_extra[curr].addr = addr;
+ vq->packed.desc_extra[curr].addr = premapped ?
+ DMA_MAPPING_ERROR : addr;
vq->packed.desc_extra[curr].len = sg->length;
vq->packed.desc_extra[curr].flags =
le16_to_cpu(flags);
@@ -1633,7 +1633,7 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
if (!desc)
return;
- if (vring_need_unmap_buffer(vq)) {
+ if (vq->use_dma_api) {
len = vq->packed.desc_extra[id].len;
num = len / sizeof(struct vring_packed_desc);
@@ -2204,14 +2204,15 @@ static inline int virtqueue_add(struct virtqueue *_vq,
unsigned int in_sgs,
void *data,
void *ctx,
+ bool premapped,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
return vq->packed_ring ? virtqueue_add_packed(_vq, sgs, total_sg,
- out_sgs, in_sgs, data, ctx, gfp) :
+ out_sgs, in_sgs, data, ctx, premapped, gfp) :
virtqueue_add_split(_vq, sgs, total_sg,
- out_sgs, in_sgs, data, ctx, gfp);
+ out_sgs, in_sgs, data, ctx, premapped, gfp);
}
/**
@@ -2245,7 +2246,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
total_sg++;
}
return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs,
- data, NULL, gfp);
+ data, NULL, false, gfp);
}
EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
@@ -2267,7 +2268,7 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
void *data,
gfp_t gfp)
{
- return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, gfp);
+ return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, false, gfp);
}
EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
@@ -2289,7 +2290,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
void *data,
gfp_t gfp)
{
- return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, gfp);
+ return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp);
}
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
@@ -2313,7 +2314,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
void *ctx,
gfp_t gfp)
{
- return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, gfp);
+ return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp);
}
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (3 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-31 2:19 ` kernel test robot
2024-11-05 4:28 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer Xuan Zhuo
` (7 subsequent siblings)
12 siblings, 2 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Two APIs are introduced to submit premapped per-buffers.
int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
struct scatterlist *sg, unsigned int num,
void *data,
void *ctx,
bool premapped,
gfp_t gfp);
int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
struct scatterlist *sg, unsigned int num,
void *data,
bool premapped,
gfp_t gfp);
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/virtio/virtio_ring.c | 48 ++++++++++++++++++++++++++++++++++++
include/linux/virtio.h | 13 ++++++++++
2 files changed, 61 insertions(+)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a89295b79e66..525308d82728 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -2272,6 +2272,29 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
}
EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
+/**
+ * virtqueue_add_outbuf_premapped - expose output buffers to other end
+ * @vq: the struct virtqueue we're talking about.
+ * @sg: scatterlist (must be well-formed and terminated!)
+ * @num: the number of entries in @sg readable by other side
+ * @data: the token identifying the buffer.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
+ struct scatterlist *sg, unsigned int num,
+ void *data,
+ bool premapped,
+ gfp_t gfp)
+{
+ return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp);
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped);
+
/**
* virtqueue_add_inbuf - expose input buffers to other end
* @vq: the struct virtqueue we're talking about.
@@ -2318,6 +2341,31 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
}
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
+/**
+ * virtqueue_add_inbuf_premapped - expose input buffers to other end
+ * @vq: the struct virtqueue we're talking about.
+ * @sg: scatterlist (must be well-formed and terminated!)
+ * @num: the number of entries in @sg writable by other side
+ * @data: the token identifying the buffer.
+ * @ctx: extra context for the token
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
+ struct scatterlist *sg, unsigned int num,
+ void *data,
+ void *ctx,
+ bool premapped,
+ gfp_t gfp)
+{
+ return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, premapped, gfp);
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_premapped);
+
/**
* virtqueue_dma_dev - get the dma dev
* @_vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 306137a15d07..19afa49b92d0 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -56,6 +56,19 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
void *ctx,
gfp_t gfp);
+int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
+ struct scatterlist *sg, unsigned int num,
+ void *data,
+ void *ctx,
+ bool premapped,
+ gfp_t gfp);
+
+int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
+ struct scatterlist *sg, unsigned int num,
+ void *data,
+ bool premapped,
+ gfp_t gfp);
+
int virtqueue_add_sgs(struct virtqueue *vq,
struct scatterlist *sgs[],
unsigned int out_sgs,
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (4 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-11-05 3:23 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped Xuan Zhuo
` (6 subsequent siblings)
12 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
virtio-net rq submits premapped per-buffer by setting sg page to NULL;
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/net/virtio_net.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 792e9eadbfc3..09757fa408bd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
}
+static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
+{
+ sg->dma_address = addr;
+ sg->length = len;
+}
+
static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
bool in_napi, struct virtnet_sq_free_stats *stats)
{
@@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
addr = dma->addr - sizeof(*dma) + offset;
sg_init_table(rq->sg, 1);
- rq->sg[0].dma_address = addr;
- rq->sg[0].length = len;
+ sg_fill_dma(rq->sg, addr, len);
}
static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
@@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct virtnet_info *vi,
}
}
-static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
-{
- sg->dma_address = addr;
- sg->length = len;
-}
-
static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
struct receive_queue *rq, void *buf, u32 len)
{
@@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
sg_init_table(rq->sg, 1);
sg_fill_dma(rq->sg, addr, len);
- err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
+ err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, xsk_buffs[i],
+ NULL, true, gfp);
if (err)
goto err;
}
@@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
- err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
+ err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
+ rq->do_dma, gfp);
if (err < 0) {
if (rq->do_dma)
virtnet_rq_unmap(rq, buf, 0);
@@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
virtnet_rq_init_one_sg(rq, buf, len);
ctx = mergeable_len_to_ctx(len + room, headroom);
- err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
+ err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
+ rq->do_dma, gfp);
if (err < 0) {
if (rq->do_dma)
virtnet_rq_unmap(rq, buf, 0);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (5 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-11-05 4:29 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 08/13] virtio_net: refactor the xmit type Xuan Zhuo
` (5 subsequent siblings)
12 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Now, this API is useless. remove it.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
drivers/virtio/virtio_ring.c | 48 ------------------------------------
include/linux/virtio.h | 2 --
2 files changed, 50 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 525308d82728..4c1c6a3c7c43 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -180,9 +180,6 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
- /* Do DMA mapping by driver */
- bool premapped;
-
/* Head of free buffer list. */
unsigned int free_head;
/* Number we've added since last sync. */
@@ -2094,7 +2091,6 @@ static struct virtqueue *vring_create_virtqueue_packed(
vq->packed_ring = true;
vq->dma_dev = dma_dev;
vq->use_dma_api = vring_use_dma_api(vdev);
- vq->premapped = false;
vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) &&
!context;
@@ -2687,7 +2683,6 @@ static struct virtqueue *__vring_new_virtqueue(unsigned int index,
#endif
vq->dma_dev = dma_dev;
vq->use_dma_api = vring_use_dma_api(vdev);
- vq->premapped = false;
vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) &&
!context;
@@ -2814,49 +2809,6 @@ int virtqueue_resize(struct virtqueue *_vq, u32 num,
}
EXPORT_SYMBOL_GPL(virtqueue_resize);
-/**
- * virtqueue_set_dma_premapped - set the vring premapped mode
- * @_vq: the struct virtqueue we're talking about.
- *
- * Enable the premapped mode of the vq.
- *
- * The vring in premapped mode does not do dma internally, so the driver must
- * do dma mapping in advance. The driver must pass the dma_address through
- * dma_address of scatterlist. When the driver got a used buffer from
- * the vring, it has to unmap the dma address.
- *
- * This function must be called immediately after creating the vq, or after vq
- * reset, and before adding any buffers to it.
- *
- * Caller must ensure we don't call this with other virtqueue operations
- * at the same time (except where noted).
- *
- * Returns zero or a negative error.
- * 0: success.
- * -EINVAL: too late to enable premapped mode, the vq already contains buffers.
- */
-int virtqueue_set_dma_premapped(struct virtqueue *_vq)
-{
- struct vring_virtqueue *vq = to_vvq(_vq);
- u32 num;
-
- START_USE(vq);
-
- num = vq->packed_ring ? vq->packed.vring.num : vq->split.vring.num;
-
- if (num != vq->vq.num_free) {
- END_USE(vq);
- return -EINVAL;
- }
-
- vq->premapped = true;
-
- END_USE(vq);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(virtqueue_set_dma_premapped);
-
/**
* virtqueue_reset - detach and recycle all unused buffers
* @_vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 19afa49b92d0..8ca09ba165a9 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -95,8 +95,6 @@ bool virtqueue_enable_cb(struct virtqueue *vq);
unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
-int virtqueue_set_dma_premapped(struct virtqueue *_vq);
-
bool virtqueue_poll(struct virtqueue *vq, unsigned);
bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 08/13] virtio_net: refactor the xmit type
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (6 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 09/13] virtio_net: xsk: bind/unbind xsk for tx Xuan Zhuo
` (4 subsequent siblings)
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Because the af-xdp will introduce a new xmit type, so I refactor the
xmit type mechanism first.
We know both xdp_frame and sk_buff are at least 4 bytes aligned.
For the xdp tx, we do not pass any pointer to virtio core as data,
we just need to pass the len of the packet. So we will push len
to the void pointer. We can make sure the pointer is 4 bytes aligned.
And the data structure of AF_XDP also is at least 4 bytes aligned.
So the last two bits of the pointers are free, we can't use these to
distinguish them.
00 for skb
01 for SKB_ORPHAN
10 for XDP
11 for AF-XDP tx
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 90 +++++++++++++++++++++++-----------------
1 file changed, 51 insertions(+), 39 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 09757fa408bd..37e86e0a1d8e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -45,9 +45,6 @@ module_param(napi_tx, bool, 0644);
#define VIRTIO_XDP_TX BIT(0)
#define VIRTIO_XDP_REDIR BIT(1)
-#define VIRTIO_XDP_FLAG BIT(0)
-#define VIRTIO_ORPHAN_FLAG BIT(1)
-
/* RX packet size EWMA. The average packet size is used to determine the packet
* buffer size when refilling RX rings. As the entire RX ring may be refilled
* at once, the weight is chosen so that the EWMA will be insensitive to short-
@@ -512,34 +509,35 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
struct page *page, void *buf,
int len, int truesize);
-static bool is_xdp_frame(void *ptr)
-{
- return (unsigned long)ptr & VIRTIO_XDP_FLAG;
-}
+enum virtnet_xmit_type {
+ VIRTNET_XMIT_TYPE_SKB,
+ VIRTNET_XMIT_TYPE_SKB_ORPHAN,
+ VIRTNET_XMIT_TYPE_XDP,
+};
-static void *xdp_to_ptr(struct xdp_frame *ptr)
-{
- return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
-}
+/* We use the last two bits of the pointer to distinguish the xmit type. */
+#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
-static struct xdp_frame *ptr_to_xdp(void *ptr)
+static enum virtnet_xmit_type virtnet_xmit_ptr_unpack(void **ptr)
{
- return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
-}
+ unsigned long p = (unsigned long)*ptr;
-static bool is_orphan_skb(void *ptr)
-{
- return (unsigned long)ptr & VIRTIO_ORPHAN_FLAG;
+ *ptr = (void *)(p & ~VIRTNET_XMIT_TYPE_MASK);
+
+ return p & VIRTNET_XMIT_TYPE_MASK;
}
-static void *skb_to_ptr(struct sk_buff *skb, bool orphan)
+static void *virtnet_xmit_ptr_pack(void *ptr, enum virtnet_xmit_type type)
{
- return (void *)((unsigned long)skb | (orphan ? VIRTIO_ORPHAN_FLAG : 0));
+ return (void *)((unsigned long)ptr | type);
}
-static struct sk_buff *ptr_to_skb(void *ptr)
+static int virtnet_add_outbuf(struct send_queue *sq, int num, void *data,
+ enum virtnet_xmit_type type)
{
- return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
+ return virtqueue_add_outbuf(sq->vq, sq->sg, num,
+ virtnet_xmit_ptr_pack(data, type),
+ GFP_ATOMIC);
}
static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
@@ -551,29 +549,37 @@ static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
bool in_napi, struct virtnet_sq_free_stats *stats)
{
+ struct xdp_frame *frame;
+ struct sk_buff *skb;
unsigned int len;
void *ptr;
while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
- if (!is_xdp_frame(ptr)) {
- struct sk_buff *skb = ptr_to_skb(ptr);
+ switch (virtnet_xmit_ptr_unpack(&ptr)) {
+ case VIRTNET_XMIT_TYPE_SKB:
+ skb = ptr;
pr_debug("Sent skb %p\n", skb);
+ stats->napi_packets++;
+ stats->napi_bytes += skb->len;
+ napi_consume_skb(skb, in_napi);
+ break;
- if (is_orphan_skb(ptr)) {
- stats->packets++;
- stats->bytes += skb->len;
- } else {
- stats->napi_packets++;
- stats->napi_bytes += skb->len;
- }
+ case VIRTNET_XMIT_TYPE_SKB_ORPHAN:
+ skb = ptr;
+
+ stats->packets++;
+ stats->bytes += skb->len;
napi_consume_skb(skb, in_napi);
- } else {
- struct xdp_frame *frame = ptr_to_xdp(ptr);
+ break;
+
+ case VIRTNET_XMIT_TYPE_XDP:
+ frame = ptr;
stats->packets++;
stats->bytes += xdp_get_frame_len(frame);
xdp_return_frame(frame);
+ break;
}
}
netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);
@@ -1431,8 +1437,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
skb_frag_size(frag), skb_frag_off(frag));
}
- err = virtqueue_add_outbuf(sq->vq, sq->sg, nr_frags + 1,
- xdp_to_ptr(xdpf), GFP_ATOMIC);
+ err = virtnet_add_outbuf(sq, nr_frags + 1, xdpf, VIRTNET_XMIT_TYPE_XDP);
if (unlikely(err))
return -ENOSPC; /* Caller handle free/refcnt */
@@ -3048,8 +3053,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan)
return num_sg;
num_sg++;
}
- return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
- skb_to_ptr(skb, orphan), GFP_ATOMIC);
+
+ return virtnet_add_outbuf(sq, num_sg, skb,
+ orphan ? VIRTNET_XMIT_TYPE_SKB_ORPHAN : VIRTNET_XMIT_TYPE_SKB);
}
static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -5926,10 +5932,16 @@ static void free_receive_page_frags(struct virtnet_info *vi)
static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
{
- if (!is_xdp_frame(buf))
+ switch (virtnet_xmit_ptr_unpack(&buf)) {
+ case VIRTNET_XMIT_TYPE_SKB:
+ case VIRTNET_XMIT_TYPE_SKB_ORPHAN:
dev_kfree_skb(buf);
- else
- xdp_return_frame(ptr_to_xdp(buf));
+ break;
+
+ case VIRTNET_XMIT_TYPE_XDP:
+ xdp_return_frame(buf);
+ break;
+ }
}
static void free_unused_bufs(struct virtnet_info *vi)
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 09/13] virtio_net: xsk: bind/unbind xsk for tx
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (7 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 08/13] virtio_net: refactor the xmit type Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 10/13] virtio_net: xsk: prevent disable tx napi Xuan Zhuo
` (3 subsequent siblings)
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
This patch implement the logic of bind/unbind xsk pool to sq and rq.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 53 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 37e86e0a1d8e..2d2658fcad07 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -295,6 +295,10 @@ struct send_queue {
/* Record whether sq is in reset state. */
bool reset;
+
+ struct xsk_buff_pool *xsk_pool;
+
+ dma_addr_t xsk_hdr_dma_addr;
};
/* Internal representation of a receive virtqueue */
@@ -497,6 +501,8 @@ struct virtio_net_common_hdr {
};
};
+static struct virtio_net_common_hdr xsk_hdr;
+
static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
struct net_device *dev,
@@ -5496,6 +5502,29 @@ static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct receive_queu
return err;
}
+static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi,
+ struct send_queue *sq,
+ struct xsk_buff_pool *pool)
+{
+ int err, qindex;
+
+ qindex = sq - vi->sq;
+
+ virtnet_tx_pause(vi, sq);
+
+ err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf);
+ if (err) {
+ netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: %d\n", qindex, err);
+ pool = NULL;
+ }
+
+ sq->xsk_pool = pool;
+
+ virtnet_tx_resume(vi, sq);
+
+ return err;
+}
+
static int virtnet_xsk_pool_enable(struct net_device *dev,
struct xsk_buff_pool *pool,
u16 qid)
@@ -5504,6 +5533,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
struct receive_queue *rq;
struct device *dma_dev;
struct send_queue *sq;
+ dma_addr_t hdr_dma;
int err, size;
if (vi->hdr_len > xsk_pool_get_headroom(pool))
@@ -5541,6 +5571,11 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
if (!rq->xsk_buffs)
return -ENOMEM;
+ hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, vi->hdr_len,
+ DMA_TO_DEVICE, 0);
+ if (virtqueue_dma_mapping_error(sq->vq, hdr_dma))
+ return -ENOMEM;
+
err = xsk_pool_dma_map(pool, dma_dev, 0);
if (err)
goto err_xsk_map;
@@ -5549,11 +5584,24 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
if (err)
goto err_rq;
+ err = virtnet_sq_bind_xsk_pool(vi, sq, pool);
+ if (err)
+ goto err_sq;
+
+ /* Now, we do not support tx offload(such as tx csum), so all the tx
+ * virtnet hdr is zero. So all the tx packets can share a single hdr.
+ */
+ sq->xsk_hdr_dma_addr = hdr_dma;
+
return 0;
+err_sq:
+ virtnet_rq_bind_xsk_pool(vi, rq, NULL);
err_rq:
xsk_pool_dma_unmap(pool, 0);
err_xsk_map:
+ virtqueue_dma_unmap_single_attrs(rq->vq, hdr_dma, vi->hdr_len,
+ DMA_TO_DEVICE, 0);
return err;
}
@@ -5562,19 +5610,24 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
struct virtnet_info *vi = netdev_priv(dev);
struct xsk_buff_pool *pool;
struct receive_queue *rq;
+ struct send_queue *sq;
int err;
if (qid >= vi->curr_queue_pairs)
return -EINVAL;
+ sq = &vi->sq[qid];
rq = &vi->rq[qid];
pool = rq->xsk_pool;
err = virtnet_rq_bind_xsk_pool(vi, rq, NULL);
+ err |= virtnet_sq_bind_xsk_pool(vi, sq, NULL);
xsk_pool_dma_unmap(pool, 0);
+ virtqueue_dma_unmap_single_attrs(sq->vq, sq->xsk_hdr_dma_addr,
+ vi->hdr_len, DMA_TO_DEVICE, 0);
kvfree(rq->xsk_buffs);
return err;
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 10/13] virtio_net: xsk: prevent disable tx napi
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (8 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 09/13] virtio_net: xsk: bind/unbind xsk for tx Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 11/13] virtio_net: xsk: tx: support xmit xsk buffer Xuan Zhuo
` (2 subsequent siblings)
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Since xsk's TX queue is consumed by TX NAPI, if sq is bound to xsk, then
we must stop tx napi from being disabled.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2d2658fcad07..a5bad8a5f642 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -5025,7 +5025,7 @@ static int virtnet_set_coalesce(struct net_device *dev,
struct netlink_ext_ack *extack)
{
struct virtnet_info *vi = netdev_priv(dev);
- int ret, queue_number, napi_weight;
+ int ret, queue_number, napi_weight, i;
bool update_napi = false;
/* Can't change NAPI weight if the link is up */
@@ -5054,6 +5054,14 @@ static int virtnet_set_coalesce(struct net_device *dev,
return ret;
if (update_napi) {
+ /* xsk xmit depends on the tx napi. So if xsk is active,
+ * prevent modifications to tx napi.
+ */
+ for (i = queue_number; i < vi->max_queue_pairs; i++) {
+ if (vi->sq[i].xsk_pool)
+ return -EBUSY;
+ }
+
for (; queue_number < vi->max_queue_pairs; queue_number++)
vi->sq[queue_number].napi.weight = napi_weight;
}
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 11/13] virtio_net: xsk: tx: support xmit xsk buffer
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (9 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 10/13] virtio_net: xsk: prevent disable tx napi Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 12/13] virtio_net: update tx timeout record Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY Xuan Zhuo
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
The driver's tx napi is very important for XSK. It is responsible for
obtaining data from the XSK queue and sending it out.
At the beginning, we need to trigger tx napi.
virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
buffer) by the last bits of the pointer.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 179 ++++++++++++++++++++++++++++++++++++---
1 file changed, 168 insertions(+), 11 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a5bad8a5f642..4d00d73d8088 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -83,6 +83,7 @@ struct virtnet_sq_free_stats {
u64 bytes;
u64 napi_packets;
u64 napi_bytes;
+ u64 xsk;
};
struct virtnet_sq_stats {
@@ -514,16 +515,20 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
struct sk_buff *curr_skb,
struct page *page, void *buf,
int len, int truesize);
+static void virtnet_xsk_completed(struct send_queue *sq, int num);
enum virtnet_xmit_type {
VIRTNET_XMIT_TYPE_SKB,
VIRTNET_XMIT_TYPE_SKB_ORPHAN,
VIRTNET_XMIT_TYPE_XDP,
+ VIRTNET_XMIT_TYPE_XSK,
};
/* We use the last two bits of the pointer to distinguish the xmit type. */
#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
+#define VIRTIO_XSK_FLAG_OFFSET 2
+
static enum virtnet_xmit_type virtnet_xmit_ptr_unpack(void **ptr)
{
unsigned long p = (unsigned long)*ptr;
@@ -546,6 +551,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, int num, void *data,
GFP_ATOMIC);
}
+static u32 virtnet_ptr_to_xsk_buff_len(void *ptr)
+{
+ return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
+}
+
static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
{
sg->dma_address = addr;
@@ -586,11 +596,27 @@ static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
stats->bytes += xdp_get_frame_len(frame);
xdp_return_frame(frame);
break;
+
+ case VIRTNET_XMIT_TYPE_XSK:
+ stats->bytes += virtnet_ptr_to_xsk_buff_len(ptr);
+ stats->xsk++;
+ break;
}
}
netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);
}
+static void virtnet_free_old_xmit(struct send_queue *sq,
+ struct netdev_queue *txq,
+ bool in_napi,
+ struct virtnet_sq_free_stats *stats)
+{
+ __free_old_xmit(sq, txq, in_napi, stats);
+
+ if (stats->xsk)
+ virtnet_xsk_completed(sq, stats->xsk);
+}
+
/* Converting between virtqueue no. and kernel tx/rx queue no.
* 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
*/
@@ -1018,7 +1044,7 @@ static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
{
struct virtnet_sq_free_stats stats = {0};
- __free_old_xmit(sq, txq, in_napi, &stats);
+ virtnet_free_old_xmit(sq, txq, in_napi, &stats);
/* Avoid overhead when no packets have been processed
* happens when called speculatively from start_xmit.
@@ -1380,6 +1406,113 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
return err;
}
+static void *virtnet_xsk_to_ptr(u32 len)
+{
+ unsigned long p;
+
+ p = len << VIRTIO_XSK_FLAG_OFFSET;
+
+ return virtnet_xmit_ptr_pack((void *)p, VIRTNET_XMIT_TYPE_XSK);
+}
+
+static int virtnet_xsk_xmit_one(struct send_queue *sq,
+ struct xsk_buff_pool *pool,
+ struct xdp_desc *desc)
+{
+ struct virtnet_info *vi;
+ dma_addr_t addr;
+
+ vi = sq->vq->vdev->priv;
+
+ addr = xsk_buff_raw_get_dma(pool, desc->addr);
+ xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
+
+ sg_init_table(sq->sg, 2);
+ sg_fill_dma(sq->sg, sq->xsk_hdr_dma_addr, vi->hdr_len);
+ sg_fill_dma(sq->sg + 1, addr, desc->len);
+
+ return virtqueue_add_outbuf_premapped(sq->vq, sq->sg, 2,
+ virtnet_xsk_to_ptr(desc->len),
+ true, GFP_ATOMIC);
+}
+
+static int virtnet_xsk_xmit_batch(struct send_queue *sq,
+ struct xsk_buff_pool *pool,
+ unsigned int budget,
+ u64 *kicks)
+{
+ struct xdp_desc *descs = pool->tx_descs;
+ bool kick = false;
+ u32 nb_pkts, i;
+ int err;
+
+ budget = min_t(u32, budget, sq->vq->num_free);
+
+ nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
+ if (!nb_pkts)
+ return 0;
+
+ for (i = 0; i < nb_pkts; i++) {
+ err = virtnet_xsk_xmit_one(sq, pool, &descs[i]);
+ if (unlikely(err)) {
+ xsk_tx_completed(sq->xsk_pool, nb_pkts - i);
+ break;
+ }
+
+ kick = true;
+ }
+
+ if (kick && virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
+ (*kicks)++;
+
+ return i;
+}
+
+static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
+ int budget)
+{
+ struct virtnet_info *vi = sq->vq->vdev->priv;
+ struct virtnet_sq_free_stats stats = {};
+ struct net_device *dev = vi->dev;
+ u64 kicks = 0;
+ int sent;
+
+ /* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
+ * free_old_xmit().
+ */
+ __free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq), true, &stats);
+
+ if (stats.xsk)
+ xsk_tx_completed(sq->xsk_pool, stats.xsk);
+
+ sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
+
+ if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
+ check_sq_full_and_disable(vi, vi->dev, sq);
+
+ u64_stats_update_begin(&sq->stats.syncp);
+ u64_stats_add(&sq->stats.packets, stats.packets);
+ u64_stats_add(&sq->stats.bytes, stats.bytes);
+ u64_stats_add(&sq->stats.kicks, kicks);
+ u64_stats_add(&sq->stats.xdp_tx, sent);
+ u64_stats_update_end(&sq->stats.syncp);
+
+ if (xsk_uses_need_wakeup(pool))
+ xsk_set_tx_need_wakeup(pool);
+
+ return sent;
+}
+
+static void xsk_wakeup(struct send_queue *sq)
+{
+ if (napi_if_scheduled_mark_missed(&sq->napi))
+ return;
+
+ local_bh_disable();
+ virtqueue_napi_schedule(&sq->napi, sq->vq);
+ local_bh_enable();
+}
+
static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -1393,14 +1526,19 @@ static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
sq = &vi->sq[qid];
- if (napi_if_scheduled_mark_missed(&sq->napi))
- return 0;
+ xsk_wakeup(sq);
+ return 0;
+}
- local_bh_disable();
- virtqueue_napi_schedule(&sq->napi, sq->vq);
- local_bh_enable();
+static void virtnet_xsk_completed(struct send_queue *sq, int num)
+{
+ xsk_tx_completed(sq->xsk_pool, num);
- return 0;
+ /* If this is called by rx poll, start_xmit and xdp xmit we should
+ * wakeup the tx napi to consume the xsk tx queue, because the tx
+ * interrupt may not be triggered.
+ */
+ xsk_wakeup(sq);
}
static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
@@ -1516,8 +1654,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
}
/* Free up any pending old buffers before queueing new ones. */
- __free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq),
- false, &stats);
+ virtnet_free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq),
+ false, &stats);
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
@@ -2969,7 +3107,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
struct netdev_queue *txq;
- int opaque;
+ int opaque, xsk_done = 0;
bool done;
if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
@@ -2981,7 +3119,11 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
virtqueue_disable_cb(sq->vq);
- free_old_xmit(sq, txq, !!budget);
+
+ if (sq->xsk_pool)
+ xsk_done = virtnet_xsk_xmit(sq, sq->xsk_pool, budget);
+ else
+ free_old_xmit(sq, txq, !!budget);
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) {
if (netif_tx_queue_stopped(txq)) {
@@ -2992,6 +3134,11 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
netif_tx_wake_queue(txq);
}
+ if (xsk_done >= budget) {
+ __netif_tx_unlock(txq);
+ return budget;
+ }
+
opaque = virtqueue_enable_cb_prepare(sq->vq);
done = napi_complete_done(napi, 0);
@@ -5993,6 +6140,12 @@ static void free_receive_page_frags(struct virtnet_info *vi)
static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
{
+ struct virtnet_info *vi = vq->vdev->priv;
+ struct send_queue *sq;
+ int i = vq2rxq(vq);
+
+ sq = &vi->sq[i];
+
switch (virtnet_xmit_ptr_unpack(&buf)) {
case VIRTNET_XMIT_TYPE_SKB:
case VIRTNET_XMIT_TYPE_SKB_ORPHAN:
@@ -6002,6 +6155,10 @@ static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
case VIRTNET_XMIT_TYPE_XDP:
xdp_return_frame(buf);
break;
+
+ case VIRTNET_XMIT_TYPE_XSK:
+ xsk_tx_completed(sq->xsk_pool, 1);
+ break;
}
}
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 12/13] virtio_net: update tx timeout record
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (10 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 11/13] virtio_net: xsk: tx: support xmit xsk buffer Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY Xuan Zhuo
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
If send queue sent some packets, we update the tx timeout
record to prevent the tx timeout.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4d00d73d8088..091e3ed0cafa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1490,6 +1490,13 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
check_sq_full_and_disable(vi, vi->dev, sq);
+ if (sent) {
+ struct netdev_queue *txq;
+
+ txq = netdev_get_tx_queue(vi->dev, sq - vi->sq);
+ txq_trans_cond_update(txq);
+ }
+
u64_stats_update_begin(&sq->stats.syncp);
u64_stats_add(&sq->stats.packets, stats.packets);
u64_stats_add(&sq->stats.bytes, stats.bytes);
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH net-next v2 13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
` (11 preceding siblings ...)
2024-10-30 8:24 ` [PATCH net-next v2 12/13] virtio_net: update tx timeout record Xuan Zhuo
@ 2024-10-30 8:24 ` Xuan Zhuo
12 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-10-30 8:24 UTC (permalink / raw)
To: netdev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Now, we support AF_XDP(xsk). Add NETDEV_XDP_ACT_XSK_ZEROCOPY to
xdp_features.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 091e3ed0cafa..43e5234af699 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -6614,7 +6614,8 @@ static int virtnet_probe(struct virtio_device *vdev)
dev->hw_features |= NETIF_F_GRO_HW;
dev->vlan_features = dev->features;
- dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT;
+ dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
+ NETDEV_XDP_ACT_XSK_ZEROCOPY;
/* MTU range: 68 - 65535 */
dev->min_mtu = MIN_MTU;
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
2024-10-30 8:24 ` [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped Xuan Zhuo
@ 2024-10-31 2:19 ` kernel test robot
2024-11-05 4:28 ` Jason Wang
1 sibling, 0 replies; 28+ messages in thread
From: kernel test robot @ 2024-10-31 2:19 UTC (permalink / raw)
To: Xuan Zhuo, netdev
Cc: llvm, oe-kbuild-all, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Andrew Lunn, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
Hi Xuan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_ring-introduce-vring_need_unmap_buffer/20241030-162739
base: net-next/main
patch link: https://lore.kernel.org/r/20241030082453.97310-6-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
config: x86_64-buildonly-randconfig-003-20241031 (https://download.01.org/0day-ci/archive/20241031/202410310925.LuCycrTj-lkp@intel.com/config)
compiler: clang version 19.1.2 (https://github.com/llvm/llvm-project 7ba7d8e2f7b6445b60679da826210cdde29eaf8b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/202410310925.LuCycrTj-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410310925.LuCycrTj-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/virtio/virtio_ring.c:2293: warning: Function parameter or struct member 'premapped' not described in 'virtqueue_add_outbuf_premapped'
>> drivers/virtio/virtio_ring.c:2364: warning: Function parameter or struct member 'premapped' not described in 'virtqueue_add_inbuf_premapped'
vim +2293 drivers/virtio/virtio_ring.c
2274
2275 /**
2276 * virtqueue_add_outbuf_premapped - expose output buffers to other end
2277 * @vq: the struct virtqueue we're talking about.
2278 * @sg: scatterlist (must be well-formed and terminated!)
2279 * @num: the number of entries in @sg readable by other side
2280 * @data: the token identifying the buffer.
2281 * @gfp: how to do memory allocations (if necessary).
2282 *
2283 * Caller must ensure we don't call this with other virtqueue operations
2284 * at the same time (except where noted).
2285 *
2286 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
2287 */
2288 int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
2289 struct scatterlist *sg, unsigned int num,
2290 void *data,
2291 bool premapped,
2292 gfp_t gfp)
> 2293 {
2294 return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp);
2295 }
2296 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped);
2297
2298 /**
2299 * virtqueue_add_inbuf - expose input buffers to other end
2300 * @vq: the struct virtqueue we're talking about.
2301 * @sg: scatterlist (must be well-formed and terminated!)
2302 * @num: the number of entries in @sg writable by other side
2303 * @data: the token identifying the buffer.
2304 * @gfp: how to do memory allocations (if necessary).
2305 *
2306 * Caller must ensure we don't call this with other virtqueue operations
2307 * at the same time (except where noted).
2308 *
2309 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
2310 */
2311 int virtqueue_add_inbuf(struct virtqueue *vq,
2312 struct scatterlist *sg, unsigned int num,
2313 void *data,
2314 gfp_t gfp)
2315 {
2316 return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp);
2317 }
2318 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
2319
2320 /**
2321 * virtqueue_add_inbuf_ctx - expose input buffers to other end
2322 * @vq: the struct virtqueue we're talking about.
2323 * @sg: scatterlist (must be well-formed and terminated!)
2324 * @num: the number of entries in @sg writable by other side
2325 * @data: the token identifying the buffer.
2326 * @ctx: extra context for the token
2327 * @gfp: how to do memory allocations (if necessary).
2328 *
2329 * Caller must ensure we don't call this with other virtqueue operations
2330 * at the same time (except where noted).
2331 *
2332 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
2333 */
2334 int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
2335 struct scatterlist *sg, unsigned int num,
2336 void *data,
2337 void *ctx,
2338 gfp_t gfp)
2339 {
2340 return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp);
2341 }
2342 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
2343
2344 /**
2345 * virtqueue_add_inbuf_premapped - expose input buffers to other end
2346 * @vq: the struct virtqueue we're talking about.
2347 * @sg: scatterlist (must be well-formed and terminated!)
2348 * @num: the number of entries in @sg writable by other side
2349 * @data: the token identifying the buffer.
2350 * @ctx: extra context for the token
2351 * @gfp: how to do memory allocations (if necessary).
2352 *
2353 * Caller must ensure we don't call this with other virtqueue operations
2354 * at the same time (except where noted).
2355 *
2356 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
2357 */
2358 int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
2359 struct scatterlist *sg, unsigned int num,
2360 void *data,
2361 void *ctx,
2362 bool premapped,
2363 gfp_t gfp)
> 2364 {
2365 return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, premapped, gfp);
2366 }
2367 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_premapped);
2368
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
2024-10-30 8:24 ` [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer Xuan Zhuo
@ 2024-11-05 3:23 ` Jason Wang
2024-11-05 7:09 ` Xuan Zhuo
0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-11-05 3:23 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> virtio-net rq submits premapped per-buffer by setting sg page to NULL;
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> drivers/net/virtio_net.c | 24 +++++++++++++-----------
> 1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 792e9eadbfc3..09757fa408bd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> }
>
> +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> +{
> + sg->dma_address = addr;
> + sg->length = len;
This may work but I think it's better to reuse existing dma sg helpers like:
sg_dma_address(sg) = addr;
sg_dma_length(sg) = len;
And we probably need to fix the virtio core which only uses
sg_dma_address() but not sg_dma_length().
This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.
Others look good.
Thanks
> +}
> +
> static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> bool in_napi, struct virtnet_sq_free_stats *stats)
> {
> @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> addr = dma->addr - sizeof(*dma) + offset;
>
> sg_init_table(rq->sg, 1);
> - rq->sg[0].dma_address = addr;
> - rq->sg[0].length = len;
> + sg_fill_dma(rq->sg, addr, len);
> }
>
> static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct virtnet_info *vi,
> }
> }
>
> -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> -{
> - sg->dma_address = addr;
> - sg->length = len;
> -}
> -
> static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> struct receive_queue *rq, void *buf, u32 len)
> {
> @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> sg_init_table(rq->sg, 1);
> sg_fill_dma(rq->sg, addr, len);
>
> - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, xsk_buffs[i],
> + NULL, true, gfp);
> if (err)
> goto err;
> }
> @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>
> virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
>
> - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> + rq->do_dma, gfp);
> if (err < 0) {
> if (rq->do_dma)
> virtnet_rq_unmap(rq, buf, 0);
> @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
> virtnet_rq_init_one_sg(rq, buf, len);
>
> ctx = mergeable_len_to_ctx(len + room, headroom);
> - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> + rq->do_dma, gfp);
> if (err < 0) {
> if (rq->do_dma)
> virtnet_rq_unmap(rq, buf, 0);
> --
> 2.32.0.3.g01195cf9f
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-10-30 8:24 ` [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers Xuan Zhuo
@ 2024-11-05 3:42 ` Jason Wang
2024-11-05 6:51 ` Xuan Zhuo
0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-11-05 3:42 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> The subsequent commit needs to know whether every indirect buffer is
> premapped or not. So we need to introduce an extra struct for every
> indirect buffer to record this info.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> drivers/virtio/virtio_ring.c | 112 ++++++++++++++++-------------------
> 1 file changed, 52 insertions(+), 60 deletions(-)
Do we have a performance impact for this patch?
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 97590c201aa2..dca093744fe1 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -69,7 +69,11 @@
>
> struct vring_desc_state_split {
> void *data; /* Data for callback. */
> - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> +
> + /* Indirect extra table and desc table, if any. These two will be
> + * allocated together. So we won't stress more to the memory allocator.
> + */
> + struct vring_desc *indir_desc;
So it looks like we put a descriptor table after the extra table. Can
this lead to more crossing page mappings for the indirect descriptors?
If yes, it seems expensive so we probably need to make the descriptor
table come first.
> };
>
> struct vring_desc_state_packed {
> @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, u32 num)
> * Split ring specific functions - *_split().
> */
>
> -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
> - const struct vring_desc *desc)
> -{
> - u16 flags;
> -
> - if (!vring_need_unmap_buffer(vq))
> - return;
> -
> - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> -
> - dma_unmap_page(vring_dma_dev(vq),
> - virtio64_to_cpu(vq->vq.vdev, desc->addr),
> - virtio32_to_cpu(vq->vq.vdev, desc->len),
> - (flags & VRING_DESC_F_WRITE) ?
> - DMA_FROM_DEVICE : DMA_TO_DEVICE);
> -}
> -
> static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> - unsigned int i)
> + struct vring_desc_extra *extra)
> {
> - struct vring_desc_extra *extra = vq->split.desc_extra;
> u16 flags;
>
> - flags = extra[i].flags;
> + flags = extra->flags;
>
> if (flags & VRING_DESC_F_INDIRECT) {
> if (!vq->use_dma_api)
> goto out;
>
> dma_unmap_single(vring_dma_dev(vq),
> - extra[i].addr,
> - extra[i].len,
> + extra->addr,
> + extra->len,
> (flags & VRING_DESC_F_WRITE) ?
> DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> goto out;
>
> dma_unmap_page(vring_dma_dev(vq),
> - extra[i].addr,
> - extra[i].len,
> + extra->addr,
> + extra->len,
> (flags & VRING_DESC_F_WRITE) ?
> DMA_FROM_DEVICE : DMA_TO_DEVICE);
> }
>
> out:
> - return extra[i].next;
> + return extra->next;
> }
>
> static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> unsigned int total_sg,
> gfp_t gfp)
> {
> + struct vring_desc_extra *extra;
> struct vring_desc *desc;
> - unsigned int i;
> + unsigned int i, size;
>
> /*
> * We require lowmem mappings for the descriptors because
> @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> */
> gfp &= ~__GFP_HIGHMEM;
>
> - desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> + size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg;
> +
> + desc = kmalloc(size, gfp);
> if (!desc)
> return NULL;
>
> + extra = (struct vring_desc_extra *)&desc[total_sg];
> +
> for (i = 0; i < total_sg; i++)
> - desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> + extra[i].next = i + 1;
> +
> return desc;
> }
>
> static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
> struct vring_desc *desc,
> + struct vring_desc_extra *extra,
> unsigned int i,
> dma_addr_t addr,
> unsigned int len,
> - u16 flags,
> - bool indirect)
> + u16 flags)
> {
> - struct vring_virtqueue *vring = to_vvq(vq);
> - struct vring_desc_extra *extra = vring->split.desc_extra;
> u16 next;
>
> desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
> desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
> desc[i].len = cpu_to_virtio32(vq->vdev, len);
>
> - if (!indirect) {
> - next = extra[i].next;
> - desc[i].next = cpu_to_virtio16(vq->vdev, next);
> + extra[i].addr = addr;
> + extra[i].len = len;
> + extra[i].flags = flags;
> +
> + next = extra[i].next;
>
> - extra[i].addr = addr;
> - extra[i].len = len;
> - extra[i].flags = flags;
> - } else
> - next = virtio16_to_cpu(vq->vdev, desc[i].next);
> + desc[i].next = cpu_to_virtio16(vq->vdev, next);
>
> return next;
> }
> @@ -551,6 +539,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> + struct vring_desc_extra *extra;
> struct scatterlist *sg;
> struct vring_desc *desc;
> unsigned int i, n, avail, descs_used, prev, err_idx;
> @@ -586,9 +575,11 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> /* Set up rest to use this indirect table. */
> i = 0;
> descs_used = 1;
> + extra = (struct vring_desc_extra *)&desc[total_sg];
> } else {
> indirect = false;
> desc = vq->split.vring.desc;
> + extra = vq->split.desc_extra;
> i = head;
> descs_used = total_sg;
> }
> @@ -618,9 +609,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> /* Note that we trust indirect descriptor
> * table since it use stream DMA mapping.
> */
> - i = virtqueue_add_desc_split(_vq, desc, i, addr, sg->length,
> - VRING_DESC_F_NEXT,
> - indirect);
> + i = virtqueue_add_desc_split(_vq, desc, extra, i, addr, sg->length,
> + VRING_DESC_F_NEXT);
> }
> }
> for (; n < (out_sgs + in_sgs); n++) {
> @@ -634,11 +624,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> /* Note that we trust indirect descriptor
> * table since it use stream DMA mapping.
> */
> - i = virtqueue_add_desc_split(_vq, desc, i, addr,
> + i = virtqueue_add_desc_split(_vq, desc, extra, i, addr,
> sg->length,
> VRING_DESC_F_NEXT |
> - VRING_DESC_F_WRITE,
> - indirect);
> + VRING_DESC_F_WRITE);
> }
> }
> /* Last one doesn't continue. */
> @@ -660,10 +649,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> }
>
> virtqueue_add_desc_split(_vq, vq->split.vring.desc,
> + vq->split.desc_extra,
> head, addr,
> total_sg * sizeof(struct vring_desc),
> - VRING_DESC_F_INDIRECT,
> - false);
> + VRING_DESC_F_INDIRECT);
> }
>
> /* We're using some buffers from the free list. */
> @@ -716,11 +705,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> for (n = 0; n < total_sg; n++) {
> if (i == err_idx)
> break;
> - if (indirect) {
> - vring_unmap_one_split_indirect(vq, &desc[i]);
> - i = virtio16_to_cpu(_vq->vdev, desc[i].next);
> - } else
> - i = vring_unmap_one_split(vq, i);
> +
> + i = vring_unmap_one_split(vq, &extra[i]);
> }
>
> free_indirect:
> @@ -765,22 +751,25 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> void **ctx)
> {
> + struct vring_desc_extra *extra;
> unsigned int i, j;
> __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>
> /* Clear data ptr. */
> vq->split.desc_state[head].data = NULL;
>
> + extra = vq->split.desc_extra;
> +
> /* Put back on free list: unmap first-level descriptors and find end */
> i = head;
>
> while (vq->split.vring.desc[i].flags & nextflag) {
> - vring_unmap_one_split(vq, i);
> + vring_unmap_one_split(vq, &extra[i]);
Not sure if I've asked this before. But this part seems to deserve an
independent fix for -stable.
> i = vq->split.desc_extra[i].next;
> vq->vq.num_free++;
> }
>
> - vring_unmap_one_split(vq, i);
> + vring_unmap_one_split(vq, &extra[i]);
> vq->split.desc_extra[i].next = vq->free_head;
> vq->free_head = head;
>
> @@ -790,21 +779,24 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> if (vq->indirect) {
> struct vring_desc *indir_desc =
> vq->split.desc_state[head].indir_desc;
> - u32 len;
> + u32 len, num;
>
> /* Free the indirect table, if any, now that it's unmapped. */
> if (!indir_desc)
> return;
> -
> len = vq->split.desc_extra[head].len;
>
> BUG_ON(!(vq->split.desc_extra[head].flags &
> VRING_DESC_F_INDIRECT));
> BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>
> + num = len / sizeof(struct vring_desc);
> +
> + extra = (struct vring_desc_extra *)&indir_desc[num];
> +
> if (vring_need_unmap_buffer(vq)) {
> - for (j = 0; j < len / sizeof(struct vring_desc); j++)
> - vring_unmap_one_split_indirect(vq, &indir_desc[j]);
> + for (j = 0; j < num; j++)
> + vring_unmap_one_split(vq, &extra[j]);
> }
>
> kfree(indir_desc);
> --
> 2.32.0.3.g01195cf9f
>
Thanks
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer
2024-10-30 8:24 ` [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer Xuan Zhuo
@ 2024-11-05 4:25 ` Jason Wang
0 siblings, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-11-05 4:25 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> The current configuration sets the virtqueue (vq) to premapped mode,
> implying that all buffers submitted to this queue must be mapped ahead
> of time. This presents a challenge for the virtnet send queue (sq): the
> virtnet driver would be required to keep track of dma information for vq
> size * 17,
Even worse as MAX_SKB_FRAGS could be even larger.
> which can be substantial. However, if the premapped mode were
> applied on a per-buffer basis, the complexity would be greatly reduced.
> With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
> skb buffers could remain unmapped.
>
> And consider that some sgs are not generated by the virtio driver,
> that may be passed from the block stack. So we can not change the
> sgs, new APIs are the better way.
I had some new thoughts on this.
Pre-mapping makes a lot of sense as it allows the us to move the
expensive DMA mapping operations (swiotlb, IOMMU or VDUSE) out of the
per-vq lock. I wonder what would we do if someday we want to convert
the skb TX to be premapped (or even all virtio drivers).
Considering we've already used skb_to_sgvec() in start_xmit() it looks
like we need to allocate sg[queue_num][MAX_SKB_FRAGS + 2] and store
the dma mapping there. Then we probably don't even need to duplicate
the dma mapping information in the core.
It means it's the driver's charge to store the dma information via sg,
so we can simply use dma_map_sg() in add_sgs() which allows various
optimizations in IOMMU layers.
>
> So we pass the new argument 'premapped' to indicate the buffers
> submitted to virtio are premapped in advance. Additionally,
> DMA unmap operations for these buffers will be bypassed.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Anyhow
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
> ---
> drivers/virtio/virtio_ring.c | 79 ++++++++++++++++++------------------
> 1 file changed, 40 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 628e01af1c9a..a89295b79e66 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -300,9 +300,10 @@ static bool vring_use_dma_api(const struct virtio_device *vdev)
> return false;
> }
>
> -static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring)
> +static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring,
> + const struct vring_desc_extra *extra)
> {
> - return vring->use_dma_api && !vring->premapped;
> + return vring->use_dma_api && (extra->addr != DMA_MAPPING_ERROR);
> }
>
> size_t virtio_max_dma_size(const struct virtio_device *vdev)
> @@ -372,9 +373,10 @@ static struct device *vring_dma_dev(const struct vring_virtqueue *vq)
>
> /* Map one sg entry. */
> static int vring_map_one_sg(const struct vring_virtqueue *vq, struct scatterlist *sg,
> - enum dma_data_direction direction, dma_addr_t *addr)
> + enum dma_data_direction direction, dma_addr_t *addr,
> + bool premapped)
> {
> - if (vq->premapped) {
> + if (premapped) {
> *addr = sg_dma_address(sg);
> return 0;
> }
> @@ -465,7 +467,7 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> (flags & VRING_DESC_F_WRITE) ?
> DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> - if (!vring_need_unmap_buffer(vq))
> + if (!vring_need_unmap_buffer(vq, extra))
> goto out;
>
> dma_unmap_page(vring_dma_dev(vq),
> @@ -514,7 +516,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
> unsigned int i,
> dma_addr_t addr,
> unsigned int len,
> - u16 flags)
> + u16 flags, bool premapped)
> {
> u16 next;
>
> @@ -522,7 +524,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
> desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
> desc[i].len = cpu_to_virtio32(vq->vdev, len);
>
> - extra[i].addr = addr;
> + extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr;
> extra[i].len = len;
> extra[i].flags = flags;
>
> @@ -540,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> unsigned int in_sgs,
> void *data,
> void *ctx,
> + bool premapped,
> gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -606,7 +609,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> dma_addr_t addr;
>
> - if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr))
> + if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr, premapped))
> goto unmap_release;
>
> prev = i;
> @@ -614,14 +617,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> * table since it use stream DMA mapping.
> */
> i = virtqueue_add_desc_split(_vq, desc, extra, i, addr, sg->length,
> - VRING_DESC_F_NEXT);
> + VRING_DESC_F_NEXT,
> + premapped);
> }
> }
> for (; n < (out_sgs + in_sgs); n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> dma_addr_t addr;
>
> - if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr))
> + if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr, premapped))
> goto unmap_release;
>
> prev = i;
> @@ -631,12 +635,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> i = virtqueue_add_desc_split(_vq, desc, extra, i, addr,
> sg->length,
> VRING_DESC_F_NEXT |
> - VRING_DESC_F_WRITE);
> + VRING_DESC_F_WRITE,
> + premapped);
> }
> }
> /* Last one doesn't continue. */
> desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> - if (!indirect && vring_need_unmap_buffer(vq))
> + if (!indirect && vring_need_unmap_buffer(vq, &extra[prev]))
> vq->split.desc_extra[prev & (vq->split.vring.num - 1)].flags &=
> ~VRING_DESC_F_NEXT;
>
> @@ -645,18 +650,14 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> dma_addr_t addr = vring_map_single(
> vq, desc, total_sg * sizeof(struct vring_desc),
> DMA_TO_DEVICE);
> - if (vring_mapping_error(vq, addr)) {
> - if (vq->premapped)
> - goto free_indirect;
> -
> + if (vring_mapping_error(vq, addr))
> goto unmap_release;
> - }
>
> virtqueue_add_desc_split(_vq, vq->split.vring.desc,
> vq->split.desc_extra,
> head, addr,
> total_sg * sizeof(struct vring_desc),
> - VRING_DESC_F_INDIRECT);
> + VRING_DESC_F_INDIRECT, false);
> }
>
> /* We're using some buffers from the free list. */
> @@ -713,7 +714,6 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> i = vring_unmap_one_split(vq, &extra[i]);
> }
>
> -free_indirect:
> if (indirect)
> kfree(desc);
>
> @@ -798,7 +798,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>
> extra = (struct vring_desc_extra *)&indir_desc[num];
>
> - if (vring_need_unmap_buffer(vq)) {
> + if (vq->use_dma_api) {
> for (j = 0; j < num; j++)
> vring_unmap_one_split(vq, &extra[j]);
> }
> @@ -1232,7 +1232,7 @@ static void vring_unmap_extra_packed(const struct vring_virtqueue *vq,
> (flags & VRING_DESC_F_WRITE) ?
> DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> - if (!vring_need_unmap_buffer(vq))
> + if (!vring_need_unmap_buffer(vq, extra))
> return;
>
> dma_unmap_page(vring_dma_dev(vq),
> @@ -1276,6 +1276,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
> unsigned int out_sgs,
> unsigned int in_sgs,
> void *data,
> + bool premapped,
> gfp_t gfp)
> {
> struct vring_desc_extra *extra;
> @@ -1306,7 +1307,8 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
> for (n = 0; n < out_sgs + in_sgs; n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> if (vring_map_one_sg(vq, sg, n < out_sgs ?
> - DMA_TO_DEVICE : DMA_FROM_DEVICE, &addr))
> + DMA_TO_DEVICE : DMA_FROM_DEVICE,
> + &addr, premapped))
> goto unmap_release;
>
> desc[i].flags = cpu_to_le16(n < out_sgs ?
> @@ -1315,7 +1317,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
> desc[i].len = cpu_to_le32(sg->length);
>
> if (unlikely(vq->use_dma_api)) {
> - extra[i].addr = addr;
> + extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr;
> extra[i].len = sg->length;
> extra[i].flags = n < out_sgs ? 0 : VRING_DESC_F_WRITE;
> }
> @@ -1328,12 +1330,8 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
> addr = vring_map_single(vq, desc,
> total_sg * sizeof(struct vring_packed_desc),
> DMA_TO_DEVICE);
> - if (vring_mapping_error(vq, addr)) {
> - if (vq->premapped)
> - goto free_desc;
> -
> + if (vring_mapping_error(vq, addr))
> goto unmap_release;
> - }
>
> vq->packed.vring.desc[head].addr = cpu_to_le64(addr);
> vq->packed.vring.desc[head].len = cpu_to_le32(total_sg *
> @@ -1391,7 +1389,6 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
> for (i = 0; i < err_idx; i++)
> vring_unmap_extra_packed(vq, &extra[i]);
>
> -free_desc:
> kfree(desc);
>
> END_USE(vq);
> @@ -1405,6 +1402,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> unsigned int in_sgs,
> void *data,
> void *ctx,
> + bool premapped,
> gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -1431,7 +1429,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
>
> if (virtqueue_use_indirect(vq, total_sg)) {
> err = virtqueue_add_indirect_packed(vq, sgs, total_sg, out_sgs,
> - in_sgs, data, gfp);
> + in_sgs, data, premapped, gfp);
> if (err != -ENOMEM) {
> END_USE(vq);
> return err;
> @@ -1466,7 +1464,8 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> dma_addr_t addr;
>
> if (vring_map_one_sg(vq, sg, n < out_sgs ?
> - DMA_TO_DEVICE : DMA_FROM_DEVICE, &addr))
> + DMA_TO_DEVICE : DMA_FROM_DEVICE,
> + &addr, premapped))
> goto unmap_release;
>
> flags = cpu_to_le16(vq->packed.avail_used_flags |
> @@ -1482,7 +1481,8 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> desc[i].id = cpu_to_le16(id);
>
> if (unlikely(vq->use_dma_api)) {
> - vq->packed.desc_extra[curr].addr = addr;
> + vq->packed.desc_extra[curr].addr = premapped ?
> + DMA_MAPPING_ERROR : addr;
> vq->packed.desc_extra[curr].len = sg->length;
> vq->packed.desc_extra[curr].flags =
> le16_to_cpu(flags);
> @@ -1633,7 +1633,7 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
> if (!desc)
> return;
>
> - if (vring_need_unmap_buffer(vq)) {
> + if (vq->use_dma_api) {
> len = vq->packed.desc_extra[id].len;
> num = len / sizeof(struct vring_packed_desc);
>
> @@ -2204,14 +2204,15 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> unsigned int in_sgs,
> void *data,
> void *ctx,
> + bool premapped,
> gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> return vq->packed_ring ? virtqueue_add_packed(_vq, sgs, total_sg,
> - out_sgs, in_sgs, data, ctx, gfp) :
> + out_sgs, in_sgs, data, ctx, premapped, gfp) :
> virtqueue_add_split(_vq, sgs, total_sg,
> - out_sgs, in_sgs, data, ctx, gfp);
> + out_sgs, in_sgs, data, ctx, premapped, gfp);
> }
>
> /**
> @@ -2245,7 +2246,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
> total_sg++;
> }
> return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs,
> - data, NULL, gfp);
> + data, NULL, false, gfp);
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
>
> @@ -2267,7 +2268,7 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
> void *data,
> gfp_t gfp)
> {
> - return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, gfp);
> + return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, false, gfp);
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>
> @@ -2289,7 +2290,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
> void *data,
> gfp_t gfp)
> {
> - return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, gfp);
> + return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp);
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
>
> @@ -2313,7 +2314,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
> void *ctx,
> gfp_t gfp)
> {
> - return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, gfp);
> + return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp);
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
>
> --
> 2.32.0.3.g01195cf9f
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
2024-10-30 8:24 ` [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped Xuan Zhuo
2024-10-31 2:19 ` kernel test robot
@ 2024-11-05 4:28 ` Jason Wang
1 sibling, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-11-05 4:28 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> Two APIs are introduced to submit premapped per-buffers.
>
> int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
> struct scatterlist *sg, unsigned int num,
> void *data,
> void *ctx,
> bool premapped,
> gfp_t gfp);
>
> int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
> struct scatterlist *sg, unsigned int num,
> void *data,
> bool premapped,
> gfp_t gfp);
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> drivers/virtio/virtio_ring.c | 48 ++++++++++++++++++++++++++++++++++++
> include/linux/virtio.h | 13 ++++++++++
> 2 files changed, 61 insertions(+)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index a89295b79e66..525308d82728 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -2272,6 +2272,29 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>
> +/**
> + * virtqueue_add_outbuf_premapped - expose output buffers to other end
> + * @vq: the struct virtqueue we're talking about.
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg readable by other side
> + * @data: the token identifying the buffer.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + bool premapped,
We don't need this parameter consider:
1) we've already had virtqueue_add_outbuf() which implies the buf has
been mapped
2) no explanation for "premapped" in the function doc
Thanks
> + gfp_t gfp)
> +{
> + return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped);
> +
> /**
> * virtqueue_add_inbuf - expose input buffers to other end
> * @vq: the struct virtqueue we're talking about.
> @@ -2318,6 +2341,31 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
> }
> EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
>
> +/**
> + * virtqueue_add_inbuf_premapped - expose input buffers to other end
> + * @vq: the struct virtqueue we're talking about.
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg writable by other side
> + * @data: the token identifying the buffer.
> + * @ctx: extra context for the token
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + void *ctx,
> + bool premapped,
> + gfp_t gfp)
> +{
> + return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, premapped, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_premapped);
> +
> /**
> * virtqueue_dma_dev - get the dma dev
> * @_vq: the struct virtqueue we're talking about.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 306137a15d07..19afa49b92d0 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -56,6 +56,19 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
> void *ctx,
> gfp_t gfp);
>
> +int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + void *ctx,
> + bool premapped,
> + gfp_t gfp);
> +
> +int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + bool premapped,
> + gfp_t gfp);
> +
> int virtqueue_add_sgs(struct virtqueue *vq,
> struct scatterlist *sgs[],
> unsigned int out_sgs,
> --
> 2.32.0.3.g01195cf9f
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped
2024-10-30 8:24 ` [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped Xuan Zhuo
@ 2024-11-05 4:29 ` Jason Wang
0 siblings, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-11-05 4:29 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> Now, this API is useless. remove it.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-11-05 3:42 ` Jason Wang
@ 2024-11-05 6:51 ` Xuan Zhuo
2024-11-06 1:44 ` Jason Wang
0 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-11-05 6:51 UTC (permalink / raw)
To: Jason Wang
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > The subsequent commit needs to know whether every indirect buffer is
> > premapped or not. So we need to introduce an extra struct for every
> > indirect buffer to record this info.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/virtio/virtio_ring.c | 112 ++++++++++++++++-------------------
> > 1 file changed, 52 insertions(+), 60 deletions(-)
>
> Do we have a performance impact for this patch?
>
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 97590c201aa2..dca093744fe1 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -69,7 +69,11 @@
> >
> > struct vring_desc_state_split {
> > void *data; /* Data for callback. */
> > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> > +
> > + /* Indirect extra table and desc table, if any. These two will be
> > + * allocated together. So we won't stress more to the memory allocator.
> > + */
> > + struct vring_desc *indir_desc;
>
> So it looks like we put a descriptor table after the extra table. Can
> this lead to more crossing page mappings for the indirect descriptors?
>
> If yes, it seems expensive so we probably need to make the descriptor
> table come first.
No, the descriptors are before extra table.
So, there is not performance impact.
>
> > };
> >
> > struct vring_desc_state_packed {
> > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, u32 num)
> > * Split ring specific functions - *_split().
> > */
> >
> > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
> > - const struct vring_desc *desc)
> > -{
> > - u16 flags;
> > -
> > - if (!vring_need_unmap_buffer(vq))
> > - return;
> > -
> > - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > -
> > - dma_unmap_page(vring_dma_dev(vq),
> > - virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > - virtio32_to_cpu(vq->vq.vdev, desc->len),
> > - (flags & VRING_DESC_F_WRITE) ?
> > - DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > -}
> > -
> > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> > - unsigned int i)
> > + struct vring_desc_extra *extra)
> > {
> > - struct vring_desc_extra *extra = vq->split.desc_extra;
> > u16 flags;
> >
> > - flags = extra[i].flags;
> > + flags = extra->flags;
> >
> > if (flags & VRING_DESC_F_INDIRECT) {
> > if (!vq->use_dma_api)
> > goto out;
> >
> > dma_unmap_single(vring_dma_dev(vq),
> > - extra[i].addr,
> > - extra[i].len,
> > + extra->addr,
> > + extra->len,
> > (flags & VRING_DESC_F_WRITE) ?
> > DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > } else {
> > @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> > goto out;
> >
> > dma_unmap_page(vring_dma_dev(vq),
> > - extra[i].addr,
> > - extra[i].len,
> > + extra->addr,
> > + extra->len,
> > (flags & VRING_DESC_F_WRITE) ?
> > DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > }
> >
> > out:
> > - return extra[i].next;
> > + return extra->next;
> > }
> >
> > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> > unsigned int total_sg,
> > gfp_t gfp)
> > {
> > + struct vring_desc_extra *extra;
> > struct vring_desc *desc;
> > - unsigned int i;
> > + unsigned int i, size;
> >
> > /*
> > * We require lowmem mappings for the descriptors because
> > @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> > */
> > gfp &= ~__GFP_HIGHMEM;
> >
> > - desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> > + size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg;
> > +
> > + desc = kmalloc(size, gfp);
> > if (!desc)
> > return NULL;
> >
> > + extra = (struct vring_desc_extra *)&desc[total_sg];
> > +
> > for (i = 0; i < total_sg; i++)
> > - desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> > + extra[i].next = i + 1;
> > +
> > return desc;
> > }
> >
> > static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
> > struct vring_desc *desc,
> > + struct vring_desc_extra *extra,
> > unsigned int i,
> > dma_addr_t addr,
> > unsigned int len,
> > - u16 flags,
> > - bool indirect)
> > + u16 flags)
> > {
> > - struct vring_virtqueue *vring = to_vvq(vq);
> > - struct vring_desc_extra *extra = vring->split.desc_extra;
> > u16 next;
> >
> > desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
> > desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
> > desc[i].len = cpu_to_virtio32(vq->vdev, len);
> >
> > - if (!indirect) {
> > - next = extra[i].next;
> > - desc[i].next = cpu_to_virtio16(vq->vdev, next);
> > + extra[i].addr = addr;
> > + extra[i].len = len;
> > + extra[i].flags = flags;
> > +
> > + next = extra[i].next;
> >
> > - extra[i].addr = addr;
> > - extra[i].len = len;
> > - extra[i].flags = flags;
> > - } else
> > - next = virtio16_to_cpu(vq->vdev, desc[i].next);
> > + desc[i].next = cpu_to_virtio16(vq->vdev, next);
> >
> > return next;
> > }
> > @@ -551,6 +539,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > gfp_t gfp)
> > {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > + struct vring_desc_extra *extra;
> > struct scatterlist *sg;
> > struct vring_desc *desc;
> > unsigned int i, n, avail, descs_used, prev, err_idx;
> > @@ -586,9 +575,11 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > /* Set up rest to use this indirect table. */
> > i = 0;
> > descs_used = 1;
> > + extra = (struct vring_desc_extra *)&desc[total_sg];
> > } else {
> > indirect = false;
> > desc = vq->split.vring.desc;
> > + extra = vq->split.desc_extra;
> > i = head;
> > descs_used = total_sg;
> > }
> > @@ -618,9 +609,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > /* Note that we trust indirect descriptor
> > * table since it use stream DMA mapping.
> > */
> > - i = virtqueue_add_desc_split(_vq, desc, i, addr, sg->length,
> > - VRING_DESC_F_NEXT,
> > - indirect);
> > + i = virtqueue_add_desc_split(_vq, desc, extra, i, addr, sg->length,
> > + VRING_DESC_F_NEXT);
> > }
> > }
> > for (; n < (out_sgs + in_sgs); n++) {
> > @@ -634,11 +624,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > /* Note that we trust indirect descriptor
> > * table since it use stream DMA mapping.
> > */
> > - i = virtqueue_add_desc_split(_vq, desc, i, addr,
> > + i = virtqueue_add_desc_split(_vq, desc, extra, i, addr,
> > sg->length,
> > VRING_DESC_F_NEXT |
> > - VRING_DESC_F_WRITE,
> > - indirect);
> > + VRING_DESC_F_WRITE);
> > }
> > }
> > /* Last one doesn't continue. */
> > @@ -660,10 +649,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > }
> >
> > virtqueue_add_desc_split(_vq, vq->split.vring.desc,
> > + vq->split.desc_extra,
> > head, addr,
> > total_sg * sizeof(struct vring_desc),
> > - VRING_DESC_F_INDIRECT,
> > - false);
> > + VRING_DESC_F_INDIRECT);
> > }
> >
> > /* We're using some buffers from the free list. */
> > @@ -716,11 +705,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > for (n = 0; n < total_sg; n++) {
> > if (i == err_idx)
> > break;
> > - if (indirect) {
> > - vring_unmap_one_split_indirect(vq, &desc[i]);
> > - i = virtio16_to_cpu(_vq->vdev, desc[i].next);
> > - } else
> > - i = vring_unmap_one_split(vq, i);
> > +
> > + i = vring_unmap_one_split(vq, &extra[i]);
> > }
> >
> > free_indirect:
> > @@ -765,22 +751,25 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> > static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> > void **ctx)
> > {
> > + struct vring_desc_extra *extra;
> > unsigned int i, j;
> > __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> >
> > /* Clear data ptr. */
> > vq->split.desc_state[head].data = NULL;
> >
> > + extra = vq->split.desc_extra;
> > +
> > /* Put back on free list: unmap first-level descriptors and find end */
> > i = head;
> >
> > while (vq->split.vring.desc[i].flags & nextflag) {
> > - vring_unmap_one_split(vq, i);
> > + vring_unmap_one_split(vq, &extra[i]);
>
> Not sure if I've asked this before. But this part seems to deserve an
> independent fix for -stable.
What fix?
Thanks.
>
> > i = vq->split.desc_extra[i].next;
> > vq->vq.num_free++;
> > }
> >
> > - vring_unmap_one_split(vq, i);
> > + vring_unmap_one_split(vq, &extra[i]);
> > vq->split.desc_extra[i].next = vq->free_head;
> > vq->free_head = head;
> >
> > @@ -790,21 +779,24 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> > if (vq->indirect) {
> > struct vring_desc *indir_desc =
> > vq->split.desc_state[head].indir_desc;
> > - u32 len;
> > + u32 len, num;
> >
> > /* Free the indirect table, if any, now that it's unmapped. */
> > if (!indir_desc)
> > return;
> > -
> > len = vq->split.desc_extra[head].len;
> >
> > BUG_ON(!(vq->split.desc_extra[head].flags &
> > VRING_DESC_F_INDIRECT));
> > BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> >
> > + num = len / sizeof(struct vring_desc);
> > +
> > + extra = (struct vring_desc_extra *)&indir_desc[num];
> > +
> > if (vring_need_unmap_buffer(vq)) {
> > - for (j = 0; j < len / sizeof(struct vring_desc); j++)
> > - vring_unmap_one_split_indirect(vq, &indir_desc[j]);
> > + for (j = 0; j < num; j++)
> > + vring_unmap_one_split(vq, &extra[j]);
> > }
> >
> > kfree(indir_desc);
> > --
> > 2.32.0.3.g01195cf9f
> >
>
> Thanks
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
2024-11-05 3:23 ` Jason Wang
@ 2024-11-05 7:09 ` Xuan Zhuo
2024-11-06 1:56 ` Jason Wang
0 siblings, 1 reply; 28+ messages in thread
From: Xuan Zhuo @ 2024-11-05 7:09 UTC (permalink / raw)
To: Jason Wang
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/net/virtio_net.c | 24 +++++++++++++-----------
> > 1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 792e9eadbfc3..09757fa408bd 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> > }
> >
> > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > +{
> > + sg->dma_address = addr;
> > + sg->length = len;
>
> This may work but I think it's better to reuse existing dma sg helpers like:
>
> sg_dma_address(sg) = addr;
> sg_dma_length(sg) = len;
>
> And we probably need to fix the virtio core which only uses
> sg_dma_address() but not sg_dma_length().
>
> This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.
I don't think so.
For no-premapped mode, we pass the sg as no-dma sg to virtio core,
so the virtio core uses the sg->length directly.
If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
so we must use sg->length directly.
In this case, for the driver, we can not use sg_dma_length(),
if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length,
but virtio core use sg->length.
For sg->dma_address, it is ok for me to use sg_dma_address or not.
But for consistency to sg->length, I use the sg->dma_address directly.
I noticed this is special, so I put them into an independent function.
Thanks.
>
> Others look good.
>
> Thanks
>
> > +}
> > +
> > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> > bool in_napi, struct virtnet_sq_free_stats *stats)
> > {
> > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> > addr = dma->addr - sizeof(*dma) + offset;
> >
> > sg_init_table(rq->sg, 1);
> > - rq->sg[0].dma_address = addr;
> > - rq->sg[0].length = len;
> > + sg_fill_dma(rq->sg, addr, len);
> > }
> >
> > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct virtnet_info *vi,
> > }
> > }
> >
> > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > -{
> > - sg->dma_address = addr;
> > - sg->length = len;
> > -}
> > -
> > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> > struct receive_queue *rq, void *buf, u32 len)
> > {
> > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > sg_init_table(rq->sg, 1);
> > sg_fill_dma(rq->sg, addr, len);
> >
> > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, xsk_buffs[i],
> > + NULL, true, gfp);
> > if (err)
> > goto err;
> > }
> > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
> >
> > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> >
> > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > + rq->do_dma, gfp);
> > if (err < 0) {
> > if (rq->do_dma)
> > virtnet_rq_unmap(rq, buf, 0);
> > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
> > virtnet_rq_init_one_sg(rq, buf, len);
> >
> > ctx = mergeable_len_to_ctx(len + room, headroom);
> > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > + rq->do_dma, gfp);
> > if (err < 0) {
> > if (rq->do_dma)
> > virtnet_rq_unmap(rq, buf, 0);
> > --
> > 2.32.0.3.g01195cf9f
> >
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-11-05 6:51 ` Xuan Zhuo
@ 2024-11-06 1:44 ` Jason Wang
2024-11-06 5:58 ` Xuan Zhuo
2024-11-06 7:42 ` Michael S. Tsirkin
0 siblings, 2 replies; 28+ messages in thread
From: Jason Wang @ 2024-11-06 1:44 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > The subsequent commit needs to know whether every indirect buffer is
> > > premapped or not. So we need to introduce an extra struct for every
> > > indirect buffer to record this info.
> > >
> > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > ---
> > > drivers/virtio/virtio_ring.c | 112 ++++++++++++++++-------------------
> > > 1 file changed, 52 insertions(+), 60 deletions(-)
> >
> > Do we have a performance impact for this patch?
> >
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 97590c201aa2..dca093744fe1 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -69,7 +69,11 @@
> > >
> > > struct vring_desc_state_split {
> > > void *data; /* Data for callback. */
> > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> > > +
> > > + /* Indirect extra table and desc table, if any. These two will be
> > > + * allocated together. So we won't stress more to the memory allocator.
> > > + */
> > > + struct vring_desc *indir_desc;
> >
> > So it looks like we put a descriptor table after the extra table. Can
> > this lead to more crossing page mappings for the indirect descriptors?
> >
> > If yes, it seems expensive so we probably need to make the descriptor
> > table come first.
>
> No, the descriptors are before extra table.
Well, you need then tweak the above comment, it said
"Indirect extra table and desc table".
> So, there is not performance impact.
>
>
> >
> > > };
> > >
[...]
> > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > - vring_unmap_one_split(vq, i);
> > > + vring_unmap_one_split(vq, &extra[i]);
> >
> > Not sure if I've asked this before. But this part seems to deserve an
> > independent fix for -stable.
>
> What fix?
I meant for hardening we need to check the flags stored in the extra
instead of the descriptor itself as it could be mangled by the device.
Thanks
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
2024-11-05 7:09 ` Xuan Zhuo
@ 2024-11-06 1:56 ` Jason Wang
2024-11-06 5:55 ` Xuan Zhuo
0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-11-06 1:56 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> > >
> > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > ---
> > > drivers/net/virtio_net.c | 24 +++++++++++++-----------
> > > 1 file changed, 13 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 792e9eadbfc3..09757fa408bd 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> > > }
> > >
> > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > +{
> > > + sg->dma_address = addr;
> > > + sg->length = len;
> >
> > This may work but I think it's better to reuse existing dma sg helpers like:
> >
> > sg_dma_address(sg) = addr;
> > sg_dma_length(sg) = len;
> >
> > And we probably need to fix the virtio core which only uses
> > sg_dma_address() but not sg_dma_length().
> >
> > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.
>
>
> I don't think so.
>
> For no-premapped mode, we pass the sg as no-dma sg to virtio core,
> so the virtio core uses the sg->length directly.
This is fine.
> If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
> so we must use sg->length directly.
I meant it's a hack. It may work now but will be a bug in the future.
For example, I'm playing a prototype to do pre mapping for virtio-blk,
the idea is to move the expensive DMA mappings in the case of swiotlb
etc to be done outside the pre virtqueue lock. In that case, the
driver may want to use dma_map_sg() instead of dma_map_page().
I'd suppose we will finally go with the way where DMA mappings needs
to be handled by the driver, and dma_map_sg() is faster than per sg
dma_map_page() anyhow.
>
> In this case, for the driver, we can not use sg_dma_length(),
> if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length,
> but virtio core use sg->length.
Well, we just need a minor tweak to get the length from
vring_map_one_sg(), then everything should be fine?
if (sg_is_premapped) {
*addr = sg_dma_address(sg);
*len = sg_dma_len(sg);
}
>
> For sg->dma_address, it is ok for me to use sg_dma_address or not.
> But for consistency to sg->length, I use the sg->dma_address directly.
>
> I noticed this is special, so I put them into an independent function.
>
> Thanks.
Actually, the code like sg_fill_dma() calls for a virtqueue dma
mapping helper, I think we've agreed that core needs to hide DMA
details from the driver. That is something like
virtqueue_dma_map_sg() etc.
Thanks
>
> >
> > Others look good.
> >
> > Thanks
> >
> > > +}
> > > +
> > > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> > > bool in_napi, struct virtnet_sq_free_stats *stats)
> > > {
> > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> > > addr = dma->addr - sizeof(*dma) + offset;
> > >
> > > sg_init_table(rq->sg, 1);
> > > - rq->sg[0].dma_address = addr;
> > > - rq->sg[0].length = len;
> > > + sg_fill_dma(rq->sg, addr, len);
> > > }
> > >
> > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct virtnet_info *vi,
> > > }
> > > }
> > >
> > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > -{
> > > - sg->dma_address = addr;
> > > - sg->length = len;
> > > -}
> > > -
> > > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> > > struct receive_queue *rq, void *buf, u32 len)
> > > {
> > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > > sg_init_table(rq->sg, 1);
> > > sg_fill_dma(rq->sg, addr, len);
> > >
> > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, xsk_buffs[i],
> > > + NULL, true, gfp);
> > > if (err)
> > > goto err;
> > > }
> > > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
> > >
> > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> > >
> > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > > + rq->do_dma, gfp);
> > > if (err < 0) {
> > > if (rq->do_dma)
> > > virtnet_rq_unmap(rq, buf, 0);
> > > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
> > > virtnet_rq_init_one_sg(rq, buf, len);
> > >
> > > ctx = mergeable_len_to_ctx(len + room, headroom);
> > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > > + rq->do_dma, gfp);
> > > if (err < 0) {
> > > if (rq->do_dma)
> > > virtnet_rq_unmap(rq, buf, 0);
> > > --
> > > 2.32.0.3.g01195cf9f
> > >
> >
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
2024-11-06 1:56 ` Jason Wang
@ 2024-11-06 5:55 ` Xuan Zhuo
0 siblings, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-11-06 5:55 UTC (permalink / raw)
To: Jason Wang
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, 6 Nov 2024 09:56:55 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > > drivers/net/virtio_net.c | 24 +++++++++++++-----------
> > > > 1 file changed, 13 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 792e9eadbfc3..09757fa408bd 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > > > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> > > > }
> > > >
> > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > +{
> > > > + sg->dma_address = addr;
> > > > + sg->length = len;
> > >
> > > This may work but I think it's better to reuse existing dma sg helpers like:
> > >
> > > sg_dma_address(sg) = addr;
> > > sg_dma_length(sg) = len;
> > >
> > > And we probably need to fix the virtio core which only uses
> > > sg_dma_address() but not sg_dma_length().
> > >
> > > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.
> >
> >
> > I don't think so.
> >
> > For no-premapped mode, we pass the sg as no-dma sg to virtio core,
> > so the virtio core uses the sg->length directly.
>
> This is fine.
>
> > If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
> > so we must use sg->length directly.
>
> I meant it's a hack. It may work now but will be a bug in the future.
>
> For example, I'm playing a prototype to do pre mapping for virtio-blk,
> the idea is to move the expensive DMA mappings in the case of swiotlb
> etc to be done outside the pre virtqueue lock. In that case, the
> driver may want to use dma_map_sg() instead of dma_map_page().
>
> I'd suppose we will finally go with the way where DMA mappings needs
> to be handled by the driver, and dma_map_sg() is faster than per sg
> dma_map_page() anyhow.
>
> >
> > In this case, for the driver, we can not use sg_dma_length(),
> > if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length,
> > but virtio core use sg->length.
>
> Well, we just need a minor tweak to get the length from
> vring_map_one_sg(), then everything should be fine?
>
> if (sg_is_premapped) {
> *addr = sg_dma_address(sg);
> *len = sg_dma_len(sg);
> }
For now, let us start from:
if (sg_is_premapped) {
*addr = sg_dma_address(sg);
sg->length = sg_dma_len(sg);
}
Then virtio core needs to be refactor to use dma_map_sg() in future.
Thanks.
>
> >
> > For sg->dma_address, it is ok for me to use sg_dma_address or not.
> > But for consistency to sg->length, I use the sg->dma_address directly.
> >
> > I noticed this is special, so I put them into an independent function.
> >
> > Thanks.
>
> Actually, the code like sg_fill_dma() calls for a virtqueue dma
> mapping helper, I think we've agreed that core needs to hide DMA
> details from the driver. That is something like
> virtqueue_dma_map_sg() etc.
>
> Thanks
>
> >
> > >
> > > Others look good.
> > >
> > > Thanks
> > >
> > > > +}
> > > > +
> > > > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> > > > bool in_napi, struct virtnet_sq_free_stats *stats)
> > > > {
> > > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> > > > addr = dma->addr - sizeof(*dma) + offset;
> > > >
> > > > sg_init_table(rq->sg, 1);
> > > > - rq->sg[0].dma_address = addr;
> > > > - rq->sg[0].length = len;
> > > > + sg_fill_dma(rq->sg, addr, len);
> > > > }
> > > >
> > > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> > > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct virtnet_info *vi,
> > > > }
> > > > }
> > > >
> > > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > -{
> > > > - sg->dma_address = addr;
> > > > - sg->length = len;
> > > > -}
> > > > -
> > > > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> > > > struct receive_queue *rq, void *buf, u32 len)
> > > > {
> > > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > > > sg_init_table(rq->sg, 1);
> > > > sg_fill_dma(rq->sg, addr, len);
> > > >
> > > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> > > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, xsk_buffs[i],
> > > > + NULL, true, gfp);
> > > > if (err)
> > > > goto err;
> > > > }
> > > > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
> > > >
> > > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> > > >
> > > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > > > + rq->do_dma, gfp);
> > > > if (err < 0) {
> > > > if (rq->do_dma)
> > > > virtnet_rq_unmap(rq, buf, 0);
> > > > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
> > > > virtnet_rq_init_one_sg(rq, buf, len);
> > > >
> > > > ctx = mergeable_len_to_ctx(len + room, headroom);
> > > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > > > + rq->do_dma, gfp);
> > > > if (err < 0) {
> > > > if (rq->do_dma)
> > > > virtnet_rq_unmap(rq, buf, 0);
> > > > --
> > > > 2.32.0.3.g01195cf9f
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-11-06 1:44 ` Jason Wang
@ 2024-11-06 5:58 ` Xuan Zhuo
2024-11-06 7:42 ` Michael S. Tsirkin
1 sibling, 0 replies; 28+ messages in thread
From: Xuan Zhuo @ 2024-11-06 5:58 UTC (permalink / raw)
To: Jason Wang
Cc: netdev, Michael S. Tsirkin, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, 6 Nov 2024 09:44:39 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > The subsequent commit needs to know whether every indirect buffer is
> > > > premapped or not. So we need to introduce an extra struct for every
> > > > indirect buffer to record this info.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > > drivers/virtio/virtio_ring.c | 112 ++++++++++++++++-------------------
> > > > 1 file changed, 52 insertions(+), 60 deletions(-)
> > >
> > > Do we have a performance impact for this patch?
> > >
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 97590c201aa2..dca093744fe1 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -69,7 +69,11 @@
> > > >
> > > > struct vring_desc_state_split {
> > > > void *data; /* Data for callback. */
> > > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> > > > +
> > > > + /* Indirect extra table and desc table, if any. These two will be
> > > > + * allocated together. So we won't stress more to the memory allocator.
> > > > + */
> > > > + struct vring_desc *indir_desc;
> > >
> > > So it looks like we put a descriptor table after the extra table. Can
> > > this lead to more crossing page mappings for the indirect descriptors?
> > >
> > > If yes, it seems expensive so we probably need to make the descriptor
> > > table come first.
> >
> > No, the descriptors are before extra table.
>
> Well, you need then tweak the above comment, it said
>
> "Indirect extra table and desc table".
>
> > So, there is not performance impact.
> >
> >
> > >
> > > > };
> > > >
>
> [...]
>
> > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > - vring_unmap_one_split(vq, i);
> > > > + vring_unmap_one_split(vq, &extra[i]);
> > >
> > > Not sure if I've asked this before. But this part seems to deserve an
> > > independent fix for -stable.
> >
> > What fix?
>
> I meant for hardening we need to check the flags stored in the extra
> instead of the descriptor itself as it could be mangled by the device.
I see, we can do it in future.
Thanks.
>
> Thanks
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-11-06 1:44 ` Jason Wang
2024-11-06 5:58 ` Xuan Zhuo
@ 2024-11-06 7:42 ` Michael S. Tsirkin
2024-11-11 1:27 ` Jason Wang
1 sibling, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2024-11-06 7:42 UTC (permalink / raw)
To: Jason Wang
Cc: Xuan Zhuo, netdev, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote:
> > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > - vring_unmap_one_split(vq, i);
> > > > + vring_unmap_one_split(vq, &extra[i]);
> > >
> > > Not sure if I've asked this before. But this part seems to deserve an
> > > independent fix for -stable.
> >
> > What fix?
>
> I meant for hardening we need to check the flags stored in the extra
> instead of the descriptor itself as it could be mangled by the device.
>
> Thanks
Good point. Jason, want to cook up a patch?
--
MST
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
2024-11-06 7:42 ` Michael S. Tsirkin
@ 2024-11-11 1:27 ` Jason Wang
0 siblings, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-11-11 1:27 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Xuan Zhuo, netdev, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, virtualization, bpf
On Wed, Nov 6, 2024 at 3:42 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote:
> > > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > > - vring_unmap_one_split(vq, i);
> > > > > + vring_unmap_one_split(vq, &extra[i]);
> > > >
> > > > Not sure if I've asked this before. But this part seems to deserve an
> > > > independent fix for -stable.
> > >
> > > What fix?
> >
> > I meant for hardening we need to check the flags stored in the extra
> > instead of the descriptor itself as it could be mangled by the device.
> >
> > Thanks
>
> Good point. Jason, want to cook up a patch?
Will do.
Thanks
>
> --
> MST
>
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-11-11 1:28 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-30 8:24 [PATCH net-next v2 00/13] virtio-net: support AF_XDP zero copy (tx) Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 01/13] virtio_ring: introduce vring_need_unmap_buffer Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers Xuan Zhuo
2024-11-05 3:42 ` Jason Wang
2024-11-05 6:51 ` Xuan Zhuo
2024-11-06 1:44 ` Jason Wang
2024-11-06 5:58 ` Xuan Zhuo
2024-11-06 7:42 ` Michael S. Tsirkin
2024-11-11 1:27 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 03/13] virtio_ring: packed: " Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer Xuan Zhuo
2024-11-05 4:25 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped Xuan Zhuo
2024-10-31 2:19 ` kernel test robot
2024-11-05 4:28 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer Xuan Zhuo
2024-11-05 3:23 ` Jason Wang
2024-11-05 7:09 ` Xuan Zhuo
2024-11-06 1:56 ` Jason Wang
2024-11-06 5:55 ` Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped Xuan Zhuo
2024-11-05 4:29 ` Jason Wang
2024-10-30 8:24 ` [PATCH net-next v2 08/13] virtio_net: refactor the xmit type Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 09/13] virtio_net: xsk: bind/unbind xsk for tx Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 10/13] virtio_net: xsk: prevent disable tx napi Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 11/13] virtio_net: xsk: tx: support xmit xsk buffer Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 12/13] virtio_net: update tx timeout record Xuan Zhuo
2024-10-30 8:24 ` [PATCH net-next v2 13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY Xuan Zhuo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).