* [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation
[not found] <20231205113444.63015-1-linyunsheng@huawei.com>
@ 2023-12-05 11:34 ` Yunsheng Lin
2023-12-07 3:15 ` Jakub Kicinski
2023-12-05 11:34 ` [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
` (2 subsequent siblings)
3 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
virtualization, linux-mm
Currently there seems to be three page frag implementions
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.
The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is xor'd to avoid
direct reclaim in skb_page_frag_refill(), but it is not
xor'd in __page_frag_cache_refill().
This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and xor'ing
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
drivers/vhost/net.c | 2 +-
mm/page_alloc.c | 4 ++--
net/core/sock.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a16305cf985..1f0b36dd81b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
- __GFP_NOMEMALLOC;
+ gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
PAGE_FRAG_CACHE_MAX_ORDER);
nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
diff --git a/net/core/sock.c b/net/core/sock.c
index fef349dd72fa..4efa9cae4b0d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2904,7 +2904,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
--
2.33.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation
2023-12-05 11:34 ` [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation Yunsheng Lin
@ 2023-12-07 3:15 ` Jakub Kicinski
2023-12-07 11:27 ` Yunsheng Lin
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2023-12-07 3:15 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, pabeni, netdev, linux-kernel, Alexander Duyck,
Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
virtualization, linux-mm
On Tue, 5 Dec 2023 19:34:40 +0800 Yunsheng Lin wrote:
> __GFP_DIRECT_RECLAIM is xor'd to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> xor'd in __page_frag_cache_refill().
xor is not the same thing as masking a bit off.
The patch itself LGTM.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation
2023-12-07 3:15 ` Jakub Kicinski
@ 2023-12-07 11:27 ` Yunsheng Lin
0 siblings, 0 replies; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-07 11:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, pabeni, netdev, linux-kernel, Alexander Duyck,
Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
virtualization, linux-mm
On 2023/12/7 11:15, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 19:34:40 +0800 Yunsheng Lin wrote:
>> __GFP_DIRECT_RECLAIM is xor'd to avoid
>> direct reclaim in skb_page_frag_refill(), but it is not
>> xor'd in __page_frag_cache_refill().
>
> xor is not the same thing as masking a bit off.
You are right.
Will use 'mask off', thanks.
> The patch itself LGTM.
> .
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
[not found] <20231205113444.63015-1-linyunsheng@huawei.com>
2023-12-05 11:34 ` [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation Yunsheng Lin
@ 2023-12-05 11:34 ` Yunsheng Lin
2023-12-07 5:52 ` Jason Wang
2023-12-05 11:34 ` [PATCH net-next 5/6] net: introduce page_frag_cache_drain() Yunsheng Lin
2023-12-05 11:34 ` [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test Yunsheng Lin
3 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: netdev, linux-kernel, Yunsheng Lin, Michael S. Tsirkin,
Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, kvm, virtualization, bpf
The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.
This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().
The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
drivers/vhost/net.c | 93 ++++++++++++++-------------------------------
1 file changed, 29 insertions(+), 64 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..805e11d598e4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
unsigned tx_zcopy_err;
/* Flush in progress. Protected by tx vq lock. */
bool tx_flush;
- /* Private page frag */
- struct page_frag page_frag;
- /* Refcount bias of page frag */
- int refcnt_bias;
+ /* Private page frag cache */
+ struct page_frag_cache pf_cache;
};
static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
!vhost_vq_avail_empty(vq->dev, vq);
}
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
- struct page_frag *pfrag, gfp_t gfp)
-{
- if (pfrag->page) {
- if (pfrag->offset + sz <= pfrag->size)
- return true;
- __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
- }
-
- pfrag->offset = 0;
- net->refcnt_bias = 0;
- if (SKB_FRAG_PAGE_ORDER) {
- /* Avoid direct reclaim but allow kswapd to wake */
- pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY | __GFP_NOMEMALLOC,
- SKB_FRAG_PAGE_ORDER);
- if (likely(pfrag->page)) {
- pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
- goto done;
- }
- }
- pfrag->page = alloc_page(gfp);
- if (likely(pfrag->page)) {
- pfrag->size = PAGE_SIZE;
- goto done;
- }
- return false;
-
-done:
- net->refcnt_bias = USHRT_MAX;
- page_ref_add(pfrag->page, USHRT_MAX - 1);
- return true;
-}
-
#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
struct vhost_net *net = container_of(vq->dev, struct vhost_net,
dev);
struct socket *sock = vhost_vq_get_backend(vq);
- struct page_frag *alloc_frag = &net->page_frag;
struct virtio_net_hdr *gso;
struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
int sock_hlen = nvq->sock_hlen;
void *buf;
int copied;
+ int ret;
if (unlikely(len < nvq->sock_hlen))
return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
return -ENOSPC;
buflen += SKB_DATA_ALIGN(len + pad);
- alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
- if (unlikely(!vhost_net_page_frag_refill(net, buflen,
- alloc_frag, GFP_KERNEL)))
+ buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
+ SMP_CACHE_BYTES);
+ if (unlikely(!buf))
return -ENOMEM;
- buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
- copied = copy_page_from_iter(alloc_frag->page,
- alloc_frag->offset +
- offsetof(struct tun_xdp_hdr, gso),
- sock_hlen, from);
- if (copied != sock_hlen)
- return -EFAULT;
+ copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+ sock_hlen, from);
+ if (copied != sock_hlen) {
+ ret = -EFAULT;
+ goto err;
+ }
hdr = buf;
gso = &hdr->gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
vhost16_to_cpu(vq, gso->csum_start) +
vhost16_to_cpu(vq, gso->csum_offset) + 2);
- if (vhost16_to_cpu(vq, gso->hdr_len) > len)
- return -EINVAL;
+ if (vhost16_to_cpu(vq, gso->hdr_len) > len) {
+ ret = -EINVAL;
+ goto err;
+ }
}
len -= sock_hlen;
- copied = copy_page_from_iter(alloc_frag->page,
- alloc_frag->offset + pad,
- len, from);
- if (copied != len)
- return -EFAULT;
+ copied = copy_from_iter(buf + pad, len, from);
+ if (copied != len) {
+ ret = -EFAULT;
+ goto err;
+ }
xdp_init_buff(xdp, buflen, NULL);
xdp_prepare_buff(xdp, buf, pad, len, true);
hdr->buflen = buflen;
- --net->refcnt_bias;
- alloc_frag->offset += buflen;
-
++nvq->batched_xdp;
return 0;
+
+err:
+ page_frag_free(buf);
+ return ret;
}
static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
@@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vqs[VHOST_NET_VQ_RX]);
f->private_data = n;
- n->page_frag.page = NULL;
- n->refcnt_bias = 0;
+ n->pf_cache.va = NULL;
return 0;
}
@@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, struct file *f)
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
kfree(n->dev.vqs);
- if (n->page_frag.page)
- __page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
+ if (n->pf_cache.va)
+ __page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
+ n->pf_cache.pagecnt_bias);
kvfree(n);
return 0;
}
--
2.33.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
2023-12-05 11:34 ` [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
@ 2023-12-07 5:52 ` Jason Wang
0 siblings, 0 replies; 15+ messages in thread
From: Jason Wang @ 2023-12-07 5:52 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, kvm, virtualization, bpf
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> The page frag in vhost_net_page_frag_refill() uses the
> 'struct page_frag' from skb_page_frag_refill(), but it's
> implementation is similar to page_frag_alloc_align() now.
>
> This patch removes vhost_net_page_frag_refill() by using
> 'struct page_frag_cache' instead of 'struct page_frag',
> and allocating frag using page_frag_alloc_align().
>
> The added benefit is that not only unifying the page frag
> implementation a little, but also having about 0.5% performance
> boost testing by using the vhost_net_test introduced in the
> last patch.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
[not found] <20231205113444.63015-1-linyunsheng@huawei.com>
2023-12-05 11:34 ` [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation Yunsheng Lin
2023-12-05 11:34 ` [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
@ 2023-12-05 11:34 ` Yunsheng Lin
2023-12-07 5:46 ` Jason Wang
2023-12-05 11:34 ` [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test Yunsheng Lin
3 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: netdev, linux-kernel, Yunsheng Lin, Jeroen de Borst,
Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
Michael S. Tsirkin, Jason Wang, Andrew Morton, linux-arm-kernel,
linux-mediatek, linux-nvme, kvm, virtualization, linux-mm
When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
drivers/nvme/host/tcp.c | 7 +------
drivers/nvme/target/tcp.c | 4 +---
drivers/vhost/net.c | 4 +---
include/linux/gfp.h | 2 ++
mm/page_alloc.c | 10 ++++++++++
7 files changed, 19 insertions(+), 36 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 619bf63ec935..d976190b0f4d 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1278,17 +1278,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
static void gve_drain_page_cache(struct gve_priv *priv)
{
- struct page_frag_cache *nc;
int i;
- for (i = 0; i < priv->rx_cfg.num_queues; i++) {
- nc = &priv->rx[i].page_cache;
- if (nc->va) {
- __page_frag_cache_drain(virt_to_page(nc->va),
- nc->pagecnt_bias);
- nc->va = NULL;
- }
- }
+ for (i = 0; i < priv->rx_cfg.num_queues; i++)
+ page_frag_cache_drain(&priv->rx[i].page_cache);
}
static int gve_open(struct net_device *dev)
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 7ffbd4fca881..df0a3ceaf59b 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
static void
mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
{
- struct page *page;
int i;
for (i = 0; i < q->n_desc; i++) {
@@ -298,19 +297,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
entry->buf = NULL;
}
- if (!q->cache.va)
- return;
-
- page = virt_to_page(q->cache.va);
- __page_frag_cache_drain(page, q->cache.pagecnt_bias);
- memset(&q->cache, 0, sizeof(q->cache));
+ page_frag_cache_drain(&q->cache);
}
static void
mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
{
- struct page *page;
-
for (;;) {
void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
@@ -320,12 +312,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
skb_free_frag(buf);
}
- if (!q->cache.va)
- return;
-
- page = virt_to_page(q->cache.va);
- __page_frag_cache_drain(page, q->cache.pagecnt_bias);
- memset(&q->cache, 0, sizeof(q->cache));
+ page_frag_cache_drain(&q->cache);
}
static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d79811cfa0ce..1c85e1398e4e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1344,7 +1344,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
{
- struct page *page;
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
unsigned int noreclaim_flag;
@@ -1355,11 +1354,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
if (queue->hdr_digest || queue->data_digest)
nvme_tcp_free_crypto(queue);
- if (queue->pf_cache.va) {
- page = virt_to_head_page(queue->pf_cache.va);
- __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
- queue->pf_cache.va = NULL;
- }
+ page_frag_cache_drain(&queue->pf_cache);
noreclaim_flag = memalloc_noreclaim_save();
/* ->sock will be released by fput() */
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 4cc27856aa8f..11237557cfc5 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1576,7 +1576,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
static void nvmet_tcp_release_queue_work(struct work_struct *w)
{
- struct page *page;
struct nvmet_tcp_queue *queue =
container_of(w, struct nvmet_tcp_queue, release_work);
@@ -1600,8 +1599,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
if (queue->hdr_digest || queue->data_digest)
nvmet_tcp_free_crypto(queue);
ida_free(&nvmet_tcp_queue_ida, queue->idx);
- page = virt_to_head_page(queue->pf_cache.va);
- __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+ page_frag_cache_drain(&queue->pf_cache);
kfree(queue);
}
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 805e11d598e4..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1386,9 +1386,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
kfree(n->dev.vqs);
- if (n->pf_cache.va)
- __page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
- n->pf_cache.pagecnt_bias);
+ page_frag_cache_drain(&n->pf_cache);
kvfree(n);
return 0;
}
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bbd75976541e..03ba079655d3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -316,6 +316,8 @@ extern void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align);
+void page_frag_cache_drain(struct page_frag_cache *nc);
+
static inline void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 083e0c38fb62..5a0e68edcb05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4716,6 +4716,16 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
}
EXPORT_SYMBOL(__page_frag_cache_drain);
+void page_frag_cache_drain(struct page_frag_cache *nc)
+{
+ if (!nc->va)
+ return;
+
+ __page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias);
+ nc->va = NULL;
+}
+EXPORT_SYMBOL(page_frag_cache_drain);
+
void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align)
--
2.33.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
2023-12-05 11:34 ` [PATCH net-next 5/6] net: introduce page_frag_cache_drain() Yunsheng Lin
@ 2023-12-07 5:46 ` Jason Wang
0 siblings, 0 replies; 15+ messages in thread
From: Jason Wang @ 2023-12-07 5:46 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Jeroen de Borst,
Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
Michael S. Tsirkin, Andrew Morton, linux-arm-kernel,
linux-mediatek, linux-nvme, kvm, virtualization, linux-mm
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
For vhost part:
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
[not found] <20231205113444.63015-1-linyunsheng@huawei.com>
` (2 preceding siblings ...)
2023-12-05 11:34 ` [PATCH net-next 5/6] net: introduce page_frag_cache_drain() Yunsheng Lin
@ 2023-12-05 11:34 ` Yunsheng Lin
2023-12-07 6:00 ` Jason Wang
3 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: netdev, linux-kernel, Yunsheng Lin, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, virtualization
introduce vhost_net_test basing on virtio_test to test
vhost_net changing in the kernel.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
tools/virtio/Makefile | 8 +-
tools/virtio/vhost_net_test.c | 441 ++++++++++++++++++++++++++++++++++
2 files changed, 446 insertions(+), 3 deletions(-)
create mode 100644 tools/virtio/vhost_net_test.c
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
# SPDX-License-Identifier: GPL-2.0
all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
virtio_test: virtio_ring.o virtio_test.o
vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
try-run = $(shell set -e; \
if ($(1)) >/dev/null 2>&1; \
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
.PHONY: all test mod clean vhost oot oot-clean oot-build
clean:
- ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
- vhost_test/Module.symvers vhost_test/modules.order *.d
+ ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+ vhost_test/.*.cmd vhost_test/Module.symvers \
+ vhost_test/modules.order *.d
-include *.d
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index 000000000000..7e7b7aba3668
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,441 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <getopt.h>
+#include <limits.h>
+#include <string.h>
+#include <poll.h>
+#include <sys/eventfd.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <linux/virtio_types.h>
+#include <linux/vhost.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+
+#define RANDOM_BATCH -1
+
+static int tun_alloc(void)
+{
+ struct ifreq ifr;
+ int fd, e;
+
+ fd = open("/dev/net/tun", O_RDWR);
+ if (fd < 0) {
+ perror("Cannot open /dev/net/tun");
+ return fd;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+
+ ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+ strncpy(ifr.ifr_name, "tun0", IFNAMSIZ);
+
+ e = ioctl(fd, TUNSETIFF, (void *) &ifr);
+ if (e < 0) {
+ perror("ioctl[TUNSETIFF]");
+ close(fd);
+ return e;
+ }
+
+ return fd;
+}
+
+/* Unused */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+ int kick;
+ int call;
+ int num;
+ int idx;
+ void *ring;
+ /* copy used for control */
+ struct vring vring;
+ struct virtqueue *vq;
+};
+
+struct vdev_info {
+ struct virtio_device vdev;
+ int control;
+ struct pollfd fds[1];
+ struct vq_info vqs[1];
+ int nvqs;
+ void *buf;
+ size_t buf_size;
+ struct vhost_memory *mem;
+};
+
+static struct vhost_vring_file no_backend = { .index = 1, .fd = -1 },
+ backend = { .index = 1, .fd = 1 };
+static const struct vhost_vring_state null_state = {};
+
+bool vq_notify(struct virtqueue *vq)
+{
+ struct vq_info *info = vq->priv;
+ unsigned long long v = 1;
+ int r;
+ r = write(info->kick, &v, sizeof v);
+ assert(r == sizeof v);
+ return true;
+}
+
+void vq_callback(struct virtqueue *vq)
+{
+}
+
+
+void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
+{
+ struct vhost_vring_state state = { .index = info->idx };
+ struct vhost_vring_file file = { .index = info->idx };
+ unsigned long long features = dev->vdev.features;
+ struct vhost_vring_addr addr = {
+ .index = info->idx,
+ .desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
+ .avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
+ .used_user_addr = (uint64_t)(unsigned long)info->vring.used,
+ };
+ int r;
+ r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
+ assert(r >= 0);
+ state.num = info->vring.num;
+ r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
+ assert(r >= 0);
+ state.num = 0;
+ r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
+ assert(r >= 0);
+ r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
+ assert(r >= 0);
+ file.fd = info->kick;
+ r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
+ assert(r >= 0);
+ file.fd = info->call;
+ r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
+ assert(r >= 0);
+}
+
+static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
+{
+ if (info->vq)
+ vring_del_virtqueue(info->vq);
+
+ memset(info->ring, 0, vring_size(num, 4096));
+ vring_init(&info->vring, num, info->ring, 4096);
+ info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
+ info->ring, vq_notify, vq_callback, "test");
+ assert(info->vq);
+ info->vq->priv = info;
+}
+
+static void vq_info_add(struct vdev_info *dev, int num)
+{
+ struct vq_info *info = &dev->vqs[dev->nvqs];
+ int r;
+
+ /* use VHOST_NET_VQ_TX for testing */
+ info->idx = 1;
+ info->kick = eventfd(0, EFD_NONBLOCK);
+ info->call = eventfd(0, EFD_NONBLOCK);
+ r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
+ assert(r >= 0);
+ vq_reset(info, num, &dev->vdev);
+ vhost_vq_setup(dev, info);
+ dev->fds[0].fd = info->call;
+ dev->fds[0].events = POLLIN;
+ dev->nvqs++;
+}
+
+static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
+{
+ int r;
+ memset(dev, 0, sizeof *dev);
+ dev->vdev.features = features;
+ INIT_LIST_HEAD(&dev->vdev.vqs);
+ spin_lock_init(&dev->vdev.vqs_list_lock);
+ dev->buf_size = 1024;
+ dev->buf = malloc(dev->buf_size);
+ assert(dev->buf);
+ dev->control = open("/dev/vhost-net", O_RDWR);
+ assert(dev->control >= 0);
+ r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
+ assert(r >= 0);
+ dev->mem = malloc(offsetof(struct vhost_memory, regions) +
+ sizeof dev->mem->regions[0]);
+ assert(dev->mem);
+ memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
+ sizeof dev->mem->regions[0]);
+ dev->mem->nregions = 1;
+ dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
+ dev->mem->regions[0].userspace_addr = (long)dev->buf;
+ dev->mem->regions[0].memory_size = dev->buf_size;
+ r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
+ assert(r >= 0);
+}
+
+/* TODO: this is pretty bad: we get a cache line bounce
+ * for the wait queue on poll and another one on read,
+ * plus the read which is there just to clear the
+ * current state. */
+static void wait_for_interrupt(struct vdev_info *dev)
+{
+ int i;
+ unsigned long long val;
+ poll(dev->fds, dev->nvqs, -1);
+ for (i = 0; i < dev->nvqs; ++i)
+ if (dev->fds[i].revents & POLLIN) {
+ read(dev->fds[i].fd, &val, sizeof val);
+ }
+}
+
+static void run_test(struct vdev_info *dev, struct vq_info *vq,
+ bool delayed, int batch, int reset_n, int bufs)
+{
+ struct scatterlist sl;
+ long started = 0, completed = 0, next_reset = reset_n;
+ long completed_before, started_before;
+ int r;
+ unsigned int len;
+ long long spurious = 0;
+ const bool random_batch = batch == RANDOM_BATCH;
+
+ r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
+ assert(!r);
+
+ if (!reset_n) {
+ next_reset = INT_MAX;
+ }
+
+ for (;;) {
+ virtqueue_disable_cb(vq->vq);
+ completed_before = completed;
+ started_before = started;
+ do {
+ const bool reset = completed > next_reset;
+ if (random_batch)
+ batch = (random() % vq->vring.num) + 1;
+
+ while (started < bufs &&
+ (started - completed) < batch) {
+ sg_init_one(&sl, dev->buf, dev->buf_size);
+ r = virtqueue_add_outbuf(vq->vq, &sl, 1,
+ dev->buf + started,
+ GFP_ATOMIC);
+ if (unlikely(r != 0)) {
+ if (r == -ENOSPC &&
+ started > started_before)
+ r = 0;
+ else
+ r = -1;
+ break;
+ }
+
+ ++started;
+
+ if (unlikely(!virtqueue_kick(vq->vq))) {
+ r = -1;
+ break;
+ }
+ }
+
+ if (started >= bufs)
+ r = -1;
+
+ if (reset) {
+ r = ioctl(dev->control, VHOST_NET_SET_BACKEND,
+ &no_backend);
+ assert(!r);
+ }
+
+ /* Flush out completed bufs if any */
+ while (virtqueue_get_buf(vq->vq, &len)) {
+ ++completed;
+ r = 0;
+ }
+
+ if (reset) {
+ struct vhost_vring_state s = { .index = 0 };
+
+ vq_reset(vq, vq->vring.num, &dev->vdev);
+
+ r = ioctl(dev->control, VHOST_GET_VRING_BASE,
+ &s);
+ assert(!r);
+
+ s.num = 0;
+ r = ioctl(dev->control, VHOST_SET_VRING_BASE,
+ &null_state);
+ assert(!r);
+
+ r = ioctl(dev->control, VHOST_NET_SET_BACKEND,
+ &backend);
+ assert(!r);
+
+ started = completed;
+ while (completed > next_reset)
+ next_reset += completed;
+ }
+ } while (r == 0);
+ if (completed == completed_before && started == started_before)
+ ++spurious;
+ assert(completed <= bufs);
+ assert(started <= bufs);
+ if (completed == bufs)
+ break;
+ if (delayed) {
+ if (virtqueue_enable_cb_delayed(vq->vq))
+ wait_for_interrupt(dev);
+ } else {
+ if (virtqueue_enable_cb(vq->vq))
+ wait_for_interrupt(dev);
+ }
+ }
+ fprintf(stderr,
+ "spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
+ spurious, started, completed);
+}
+
+const char optstring[] = "h";
+const struct option longopts[] = {
+ {
+ .name = "help",
+ .val = 'h',
+ },
+ {
+ .name = "event-idx",
+ .val = 'E',
+ },
+ {
+ .name = "no-event-idx",
+ .val = 'e',
+ },
+ {
+ .name = "indirect",
+ .val = 'I',
+ },
+ {
+ .name = "no-indirect",
+ .val = 'i',
+ },
+ {
+ .name = "virtio-1",
+ .val = '1',
+ },
+ {
+ .name = "no-virtio-1",
+ .val = '0',
+ },
+ {
+ .name = "delayed-interrupt",
+ .val = 'D',
+ },
+ {
+ .name = "no-delayed-interrupt",
+ .val = 'd',
+ },
+ {
+ .name = "buf-num",
+ .val = 'n',
+ .has_arg = required_argument,
+ },
+ {
+ .name = "batch",
+ .val = 'b',
+ .has_arg = required_argument,
+ },
+ {
+ .name = "reset",
+ .val = 'r',
+ .has_arg = optional_argument,
+ },
+ {
+ }
+};
+
+static void help(int status)
+{
+ fprintf(stderr, "Usage: virtio_test [--help]"
+ " [--no-indirect]"
+ " [--no-event-idx]"
+ " [--no-virtio-1]"
+ " [--delayed-interrupt]"
+ " [--batch=random/N]"
+ " [--reset=N]"
+ "\n");
+
+ exit(status);
+}
+
+int main(int argc, char **argv)
+{
+ struct vdev_info dev;
+ unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+ (1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VIRTIO_F_VERSION_1);
+ long batch = 1, reset = 0, nbufs = 0x100000;
+ int o;
+ bool delayed = false;
+
+ for (;;) {
+ o = getopt_long(argc, argv, optstring, longopts, NULL);
+ switch (o) {
+ case -1:
+ goto done;
+ case '?':
+ help(2);
+ case 'e':
+ features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+ break;
+ case 'h':
+ help(0);
+ case 'i':
+ features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
+ break;
+ case '0':
+ features &= ~(1ULL << VIRTIO_F_VERSION_1);
+ break;
+ case 'D':
+ delayed = true;
+ break;
+ case 'b':
+ if (0 == strcmp(optarg, "random")) {
+ batch = RANDOM_BATCH;
+ } else {
+ batch = strtol(optarg, NULL, 10);
+ assert(batch > 0);
+ assert(batch < (long)INT_MAX + 1);
+ }
+ break;
+ case 'r':
+ if (!optarg) {
+ reset = 1;
+ } else {
+ reset = strtol(optarg, NULL, 10);
+ assert(reset > 0);
+ assert(reset < (long)INT_MAX + 1);
+ }
+ break;
+ case 'n':
+ nbufs = strtol(optarg, NULL, 10);
+ assert(nbufs > 0);
+ break;
+ default:
+ assert(0);
+ break;
+ }
+ }
+
+done:
+ backend.fd = tun_alloc();
+ assert(backend.fd >= 0);
+ vdev_info_init(&dev, features);
+ vq_info_add(&dev, 256);
+ run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
+ return 0;
+}
--
2.33.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-05 11:34 ` [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test Yunsheng Lin
@ 2023-12-07 6:00 ` Jason Wang
2023-12-07 11:28 ` Yunsheng Lin
0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2023-12-07 6:00 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> tools/virtio/Makefile | 8 +-
> tools/virtio/vhost_net_test.c | 441 ++++++++++++++++++++++++++++++++++
> 2 files changed, 446 insertions(+), 3 deletions(-)
> create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
> # SPDX-License-Identifier: GPL-2.0
> all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
> virtio_test: virtio_ring.o virtio_test.o
> vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
> try-run = $(shell set -e; \
> if ($(1)) >/dev/null 2>&1; \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
> .PHONY: all test mod clean vhost oot oot-clean oot-build
> clean:
> - ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> - vhost_test/Module.symvers vhost_test/modules.order *.d
> + ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> + vhost_test/.*.cmd vhost_test/Module.symvers \
> + vhost_test/modules.order *.d
> -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index 000000000000..7e7b7aba3668
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,441 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <getopt.h>
> +#include <limits.h>
> +#include <string.h>
> +#include <poll.h>
> +#include <sys/eventfd.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include <linux/virtio_types.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_ring.h>
> +#include <linux/if.h>
> +#include <linux/if_tun.h>
> +
> +#define RANDOM_BATCH -1
> +
> +static int tun_alloc(void)
> +{
> + struct ifreq ifr;
> + int fd, e;
> +
> + fd = open("/dev/net/tun", O_RDWR);
> + if (fd < 0) {
> + perror("Cannot open /dev/net/tun");
> + return fd;
> + }
> +
> + memset(&ifr, 0, sizeof(ifr));
> +
> + ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
Why did you use IFF_TUN but not IFF_TAP here?
> + strncpy(ifr.ifr_name, "tun0", IFNAMSIZ);
tun0 is pretty common if there's a VPN. Do we need some randomized name here?
> +
> + e = ioctl(fd, TUNSETIFF, (void *) &ifr);
> + if (e < 0) {
> + perror("ioctl[TUNSETIFF]");
> + close(fd);
> + return e;
> + }
> +
> + return fd;
> +}
> +
> +/* Unused */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
Why do we need trick like these here?
> +
> +struct vq_info {
> + int kick;
> + int call;
> + int num;
> + int idx;
> + void *ring;
> + /* copy used for control */
> + struct vring vring;
> + struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> + struct virtio_device vdev;
> + int control;
> + struct pollfd fds[1];
> + struct vq_info vqs[1];
> + int nvqs;
> + void *buf;
> + size_t buf_size;
> + struct vhost_memory *mem;
> +};
> +
> +static struct vhost_vring_file no_backend = { .index = 1, .fd = -1 },
> + backend = { .index = 1, .fd = 1 };
A magic number like fd = 1 is pretty confusing.
And I don't see why we need global variables here.
> +static const struct vhost_vring_state null_state = {};
> +
> +bool vq_notify(struct virtqueue *vq)
> +{
> + struct vq_info *info = vq->priv;
> + unsigned long long v = 1;
> + int r;
> + r = write(info->kick, &v, sizeof v);
> + assert(r == sizeof v);
> + return true;
> +}
> +
> +void vq_callback(struct virtqueue *vq)
> +{
> +}
> +
> +
> +void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
> +{
> + struct vhost_vring_state state = { .index = info->idx };
> + struct vhost_vring_file file = { .index = info->idx };
> + unsigned long long features = dev->vdev.features;
> + struct vhost_vring_addr addr = {
> + .index = info->idx,
> + .desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
> + .avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
> + .used_user_addr = (uint64_t)(unsigned long)info->vring.used,
> + };
> + int r;
> + r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
> + assert(r >= 0);
> + state.num = info->vring.num;
> + r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
> + assert(r >= 0);
> + state.num = 0;
> + r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
> + assert(r >= 0);
> + r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
> + assert(r >= 0);
> + file.fd = info->kick;
> + r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
> + assert(r >= 0);
> + file.fd = info->call;
> + r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
> + assert(r >= 0);
> +}
> +
> +static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
> +{
> + if (info->vq)
> + vring_del_virtqueue(info->vq);
> +
> + memset(info->ring, 0, vring_size(num, 4096));
> + vring_init(&info->vring, num, info->ring, 4096);
> + info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
> + info->ring, vq_notify, vq_callback, "test");
> + assert(info->vq);
> + info->vq->priv = info;
> +}
> +
> +static void vq_info_add(struct vdev_info *dev, int num)
> +{
> + struct vq_info *info = &dev->vqs[dev->nvqs];
> + int r;
> +
> + /* use VHOST_NET_VQ_TX for testing */
> + info->idx = 1;
> + info->kick = eventfd(0, EFD_NONBLOCK);
> + info->call = eventfd(0, EFD_NONBLOCK);
> + r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
> + assert(r >= 0);
> + vq_reset(info, num, &dev->vdev);
> + vhost_vq_setup(dev, info);
> + dev->fds[0].fd = info->call;
> + dev->fds[0].events = POLLIN;
> + dev->nvqs++;
> +}
> +
> +static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
> +{
> + int r;
> + memset(dev, 0, sizeof *dev);
> + dev->vdev.features = features;
> + INIT_LIST_HEAD(&dev->vdev.vqs);
> + spin_lock_init(&dev->vdev.vqs_list_lock);
> + dev->buf_size = 1024;
> + dev->buf = malloc(dev->buf_size);
> + assert(dev->buf);
> + dev->control = open("/dev/vhost-net", O_RDWR);
> + assert(dev->control >= 0);
> + r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
> + assert(r >= 0);
> + dev->mem = malloc(offsetof(struct vhost_memory, regions) +
> + sizeof dev->mem->regions[0]);
> + assert(dev->mem);
> + memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
> + sizeof dev->mem->regions[0]);
> + dev->mem->nregions = 1;
> + dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
> + dev->mem->regions[0].userspace_addr = (long)dev->buf;
> + dev->mem->regions[0].memory_size = dev->buf_size;
> + r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> + assert(r >= 0);
> +}
> +
> +/* TODO: this is pretty bad: we get a cache line bounce
> + * for the wait queue on poll and another one on read,
> + * plus the read which is there just to clear the
> + * current state. */
> +static void wait_for_interrupt(struct vdev_info *dev)
> +{
> + int i;
> + unsigned long long val;
> + poll(dev->fds, dev->nvqs, -1);
> + for (i = 0; i < dev->nvqs; ++i)
> + if (dev->fds[i].revents & POLLIN) {
> + read(dev->fds[i].fd, &val, sizeof val);
> + }
> +}
> +
> +static void run_test(struct vdev_info *dev, struct vq_info *vq,
> + bool delayed, int batch, int reset_n, int bufs)
> +{
> + struct scatterlist sl;
> + long started = 0, completed = 0, next_reset = reset_n;
> + long completed_before, started_before;
> + int r;
> + unsigned int len;
> + long long spurious = 0;
> + const bool random_batch = batch == RANDOM_BATCH;
> +
> + r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
> + assert(!r);
> +
> + if (!reset_n) {
> + next_reset = INT_MAX;
> + }
> +
> + for (;;) {
> + virtqueue_disable_cb(vq->vq);
> + completed_before = completed;
> + started_before = started;
> + do {
> + const bool reset = completed > next_reset;
> + if (random_batch)
> + batch = (random() % vq->vring.num) + 1;
> +
> + while (started < bufs &&
> + (started - completed) < batch) {
> + sg_init_one(&sl, dev->buf, dev->buf_size);
> + r = virtqueue_add_outbuf(vq->vq, &sl, 1,
> + dev->buf + started,
> + GFP_ATOMIC);
> + if (unlikely(r != 0)) {
> + if (r == -ENOSPC &&
> + started > started_before)
> + r = 0;
> + else
> + r = -1;
> + break;
> + }
> +
> + ++started;
> +
> + if (unlikely(!virtqueue_kick(vq->vq))) {
> + r = -1;
> + break;
> + }
> + }
> +
> + if (started >= bufs)
> + r = -1;
> +
> + if (reset) {
> + r = ioctl(dev->control, VHOST_NET_SET_BACKEND,
> + &no_backend);
> + assert(!r);
> + }
> +
> + /* Flush out completed bufs if any */
> + while (virtqueue_get_buf(vq->vq, &len)) {
> + ++completed;
> + r = 0;
> + }
> +
> + if (reset) {
> + struct vhost_vring_state s = { .index = 0 };
> +
> + vq_reset(vq, vq->vring.num, &dev->vdev);
> +
> + r = ioctl(dev->control, VHOST_GET_VRING_BASE,
> + &s);
> + assert(!r);
> +
> + s.num = 0;
> + r = ioctl(dev->control, VHOST_SET_VRING_BASE,
> + &null_state);
> + assert(!r);
> +
> + r = ioctl(dev->control, VHOST_NET_SET_BACKEND,
> + &backend);
> + assert(!r);
> +
> + started = completed;
> + while (completed > next_reset)
> + next_reset += completed;
> + }
> + } while (r == 0);
> + if (completed == completed_before && started == started_before)
> + ++spurious;
> + assert(completed <= bufs);
> + assert(started <= bufs);
> + if (completed == bufs)
> + break;
> + if (delayed) {
> + if (virtqueue_enable_cb_delayed(vq->vq))
> + wait_for_interrupt(dev);
> + } else {
> + if (virtqueue_enable_cb(vq->vq))
> + wait_for_interrupt(dev);
> + }
> + }
> + fprintf(stderr,
> + "spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
> + spurious, started, completed);
> +}
> +
> +const char optstring[] = "h";
> +const struct option longopts[] = {
> + {
> + .name = "help",
> + .val = 'h',
> + },
> + {
> + .name = "event-idx",
> + .val = 'E',
> + },
> + {
> + .name = "no-event-idx",
> + .val = 'e',
> + },
> + {
> + .name = "indirect",
> + .val = 'I',
> + },
> + {
> + .name = "no-indirect",
> + .val = 'i',
> + },
> + {
> + .name = "virtio-1",
> + .val = '1',
> + },
> + {
> + .name = "no-virtio-1",
> + .val = '0',
> + },
> + {
> + .name = "delayed-interrupt",
> + .val = 'D',
> + },
> + {
> + .name = "no-delayed-interrupt",
> + .val = 'd',
> + },
> + {
> + .name = "buf-num",
> + .val = 'n',
> + .has_arg = required_argument,
> + },
> + {
> + .name = "batch",
> + .val = 'b',
> + .has_arg = required_argument,
> + },
> + {
> + .name = "reset",
> + .val = 'r',
> + .has_arg = optional_argument,
> + },
> + {
> + }
> +};
> +
> +static void help(int status)
> +{
> + fprintf(stderr, "Usage: virtio_test [--help]"
> + " [--no-indirect]"
> + " [--no-event-idx]"
> + " [--no-virtio-1]"
> + " [--delayed-interrupt]"
> + " [--batch=random/N]"
> + " [--reset=N]"
> + "\n");
> +
> + exit(status);
> +}
> +
> +int main(int argc, char **argv)
> +{
> + struct vdev_info dev;
> + unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
> + (1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VIRTIO_F_VERSION_1);
> + long batch = 1, reset = 0, nbufs = 0x100000;
> + int o;
> + bool delayed = false;
> +
> + for (;;) {
> + o = getopt_long(argc, argv, optstring, longopts, NULL);
> + switch (o) {
> + case -1:
> + goto done;
> + case '?':
> + help(2);
> + case 'e':
> + features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
> + break;
> + case 'h':
> + help(0);
> + case 'i':
> + features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
> + break;
> + case '0':
> + features &= ~(1ULL << VIRTIO_F_VERSION_1);
> + break;
> + case 'D':
> + delayed = true;
> + break;
> + case 'b':
> + if (0 == strcmp(optarg, "random")) {
> + batch = RANDOM_BATCH;
> + } else {
> + batch = strtol(optarg, NULL, 10);
> + assert(batch > 0);
> + assert(batch < (long)INT_MAX + 1);
> + }
> + break;
> + case 'r':
> + if (!optarg) {
> + reset = 1;
> + } else {
> + reset = strtol(optarg, NULL, 10);
> + assert(reset > 0);
> + assert(reset < (long)INT_MAX + 1);
> + }
> + break;
> + case 'n':
> + nbufs = strtol(optarg, NULL, 10);
> + assert(nbufs > 0);
> + break;
> + default:
> + assert(0);
> + break;
> + }
> + }
> +
> +done:
> + backend.fd = tun_alloc();
> + assert(backend.fd >= 0);
> + vdev_info_init(&dev, features);
> + vq_info_add(&dev, 256);
> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
I'd expect we are testing some basic traffic here. E.g can we use a
packet socket then we can test both tx and rx?
Thanks
> + return 0;
> +}
> --
> 2.33.0
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-07 6:00 ` Jason Wang
@ 2023-12-07 11:28 ` Yunsheng Lin
2023-12-12 4:35 ` Jason Wang
0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-07 11:28 UTC (permalink / raw)
To: Jason Wang
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On 2023/12/7 14:00, Jason Wang wrote:
> On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
...
>> +
>> +static int tun_alloc(void)
>> +{
>> + struct ifreq ifr;
>> + int fd, e;
>> +
>> + fd = open("/dev/net/tun", O_RDWR);
>> + if (fd < 0) {
>> + perror("Cannot open /dev/net/tun");
>> + return fd;
>> + }
>> +
>> + memset(&ifr, 0, sizeof(ifr));
>> +
>> + ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
>
> Why did you use IFF_TUN but not IFF_TAP here?
To be honest, no particular reason, I just picked IFF_TUN and it happened
to work for me to test changing in vhost_net_build_xdp().
Is there a particular reason you perfer the IFF_TAP over IFF_TUN?
>
>> + strncpy(ifr.ifr_name, "tun0", IFNAMSIZ);
>
> tun0 is pretty common if there's a VPN. Do we need some randomized name here?
How about something like below?
snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
>
>
>> +
>> + e = ioctl(fd, TUNSETIFF, (void *) &ifr);
>> + if (e < 0) {
>> + perror("ioctl[TUNSETIFF]");
>> + close(fd);
>> + return e;
>> + }
>> +
>> + return fd;
>> +}
>> +
>> +/* Unused */
>> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
>
> Why do we need trick like these here?
That is because of the below error:
tools/virtio/./linux/kernel.h:58: undefined reference to `__kmalloc_fake'
when virtio_ring.c is compiled in the userspace, the kmalloc raleted function
is implemented in tools/virtio/./linux/kernel.h, which requires those varibles
to be defined.
>
>> +
>> +struct vq_info {
>> + int kick;
>> + int call;
>> + int num;
>> + int idx;
>> + void *ring;
>> + /* copy used for control */
>> + struct vring vring;
>> + struct virtqueue *vq;
>> +};
>> +
>> +struct vdev_info {
>> + struct virtio_device vdev;
>> + int control;
>> + struct pollfd fds[1];
>> + struct vq_info vqs[1];
>> + int nvqs;
>> + void *buf;
>> + size_t buf_size;
>> + struct vhost_memory *mem;
>> +};
>> +
>> +static struct vhost_vring_file no_backend = { .index = 1, .fd = -1 },
>> + backend = { .index = 1, .fd = 1 };
>
> A magic number like fd = 1 is pretty confusing.
>
> And I don't see why we need global variables here.
I was using the virtio_test.c as reference, will try to remove it
if it is possible.
>
>> +static const struct vhost_vring_state null_state = {};
>> +
..
>> +
>> +done:
>> + backend.fd = tun_alloc();
>> + assert(backend.fd >= 0);
>> + vdev_info_init(&dev, features);
>> + vq_info_add(&dev, 256);
>> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
>
> I'd expect we are testing some basic traffic here. E.g can we use a
> packet socket then we can test both tx and rx?
Yes, only rx for tun is tested.
Do you have an idea how to test the tx too? As I am not familar enough
with vhost_net and tun yet.
>
> Thanks
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-07 11:28 ` Yunsheng Lin
@ 2023-12-12 4:35 ` Jason Wang
2023-12-20 12:44 ` Yunsheng Lin
0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2023-12-12 4:35 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On Thu, Dec 7, 2023 at 7:28 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/7 14:00, Jason Wang wrote:
> > On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> ...
>
> >> +
> >> +static int tun_alloc(void)
> >> +{
> >> + struct ifreq ifr;
> >> + int fd, e;
> >> +
> >> + fd = open("/dev/net/tun", O_RDWR);
> >> + if (fd < 0) {
> >> + perror("Cannot open /dev/net/tun");
> >> + return fd;
> >> + }
> >> +
> >> + memset(&ifr, 0, sizeof(ifr));
> >> +
> >> + ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
> >
> > Why did you use IFF_TUN but not IFF_TAP here?
>
> To be honest, no particular reason, I just picked IFF_TUN and it happened
> to work for me to test changing in vhost_net_build_xdp().
>
> Is there a particular reason you perfer the IFF_TAP over IFF_TUN?
No preference here. It only matters if you want to test L2 or L3.
>
> >
> >> + strncpy(ifr.ifr_name, "tun0", IFNAMSIZ);
> >
> > tun0 is pretty common if there's a VPN. Do we need some randomized name here?
>
> How about something like below?
>
> snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
>
> >
> >
> >> +
> >> + e = ioctl(fd, TUNSETIFF, (void *) &ifr);
> >> + if (e < 0) {
> >> + perror("ioctl[TUNSETIFF]");
> >> + close(fd);
> >> + return e;
> >> + }
> >> +
> >> + return fd;
> >> +}
> >> +
> >> +/* Unused */
> >> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> >
> > Why do we need trick like these here?
>
> That is because of the below error:
> tools/virtio/./linux/kernel.h:58: undefined reference to `__kmalloc_fake'
>
> when virtio_ring.c is compiled in the userspace, the kmalloc raleted function
> is implemented in tools/virtio/./linux/kernel.h, which requires those varibles
> to be defined.
>
> >
> >> +
> >> +struct vq_info {
> >> + int kick;
> >> + int call;
> >> + int num;
> >> + int idx;
> >> + void *ring;
> >> + /* copy used for control */
> >> + struct vring vring;
> >> + struct virtqueue *vq;
> >> +};
> >> +
> >> +struct vdev_info {
> >> + struct virtio_device vdev;
> >> + int control;
> >> + struct pollfd fds[1];
> >> + struct vq_info vqs[1];
> >> + int nvqs;
> >> + void *buf;
> >> + size_t buf_size;
> >> + struct vhost_memory *mem;
> >> +};
> >> +
> >> +static struct vhost_vring_file no_backend = { .index = 1, .fd = -1 },
> >> + backend = { .index = 1, .fd = 1 };
> >
> > A magic number like fd = 1 is pretty confusing.
> >
> > And I don't see why we need global variables here.
>
> I was using the virtio_test.c as reference, will try to remove it
> if it is possible.
>
> >
> >> +static const struct vhost_vring_state null_state = {};
> >> +
>
> ..
>
> >> +
> >> +done:
> >> + backend.fd = tun_alloc();
> >> + assert(backend.fd >= 0);
> >> + vdev_info_init(&dev, features);
> >> + vq_info_add(&dev, 256);
> >> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
> >
> > I'd expect we are testing some basic traffic here. E.g can we use a
> > packet socket then we can test both tx and rx?
>
> Yes, only rx for tun is tested.
> Do you have an idea how to test the tx too? As I am not familar enough
> with vhost_net and tun yet.
Maybe you can have a packet socket to bind to the tun/tap. Then you can test:
1) TAP RX: by write a packet via virtqueue through vhost_net and read
it from packet socket
2) TAP TX: by write via packet socket and read it from the virtqueue
through vhost_net
Thanks
>
> >
> > Thanks
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-12 4:35 ` Jason Wang
@ 2023-12-20 12:44 ` Yunsheng Lin
2023-12-21 2:33 ` Jason Wang
0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-20 12:44 UTC (permalink / raw)
To: Jason Wang
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
>>>> + backend.fd = tun_alloc();
>>>> + assert(backend.fd >= 0);
>>>> + vdev_info_init(&dev, features);
>>>> + vq_info_add(&dev, 256);
>>>> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
>>>
>>> I'd expect we are testing some basic traffic here. E.g can we use a
>>> packet socket then we can test both tx and rx?
>>
>> Yes, only rx for tun is tested.
>> Do you have an idea how to test the tx too? As I am not familar enough
>> with vhost_net and tun yet.
>
> Maybe you can have a packet socket to bind to the tun/tap. Then you can test:
>
> 1) TAP RX: by write a packet via virtqueue through vhost_net and read
> it from packet socket
> 2) TAP TX: by write via packet socket and read it from the virtqueue
> through vhost_net
When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
I found one possible use of uninitialized data in vhost_net_build_xdp().
And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
queue.
For vhost_net_build_xdp() called by handle_tx_copy():
The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
read of uninitialized data if sock_hlen is zero.
And it seems vhost_hdr is skipped in get_tx_bufs():
https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
struct iov_iter *from)
{
...
buflen += SKB_DATA_ALIGN(len + pad);
alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
if (unlikely(!vhost_net_page_frag_refill(net, buflen,
alloc_frag, GFP_KERNEL)))
return -ENOMEM;
buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
copied = copy_page_from_iter(alloc_frag->page,
alloc_frag->offset +
offsetof(struct tun_xdp_hdr, gso),
sock_hlen, from);
if (copied != sock_hlen)
return -EFAULT;
hdr = buf;
gso = &hdr->gso;
if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
vhost16_to_cpu(vq, gso->csum_start) +
vhost16_to_cpu(vq, gso->csum_offset) + 2 >
vhost16_to_cpu(vq, gso->hdr_len)) {
...
}
I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
case correctly, Or do I miss something obvious here?
>
> Thanks
>
>>
>>>
>>> Thanks
>>
>
> .
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-20 12:44 ` Yunsheng Lin
@ 2023-12-21 2:33 ` Jason Wang
2023-12-21 2:47 ` Yunsheng Lin
0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2023-12-21 2:33 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On Wed, Dec 20, 2023 at 8:45 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
> >>>> + backend.fd = tun_alloc();
> >>>> + assert(backend.fd >= 0);
> >>>> + vdev_info_init(&dev, features);
> >>>> + vq_info_add(&dev, 256);
> >>>> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
> >>>
> >>> I'd expect we are testing some basic traffic here. E.g can we use a
> >>> packet socket then we can test both tx and rx?
> >>
> >> Yes, only rx for tun is tested.
> >> Do you have an idea how to test the tx too? As I am not familar enough
> >> with vhost_net and tun yet.
> >
> > Maybe you can have a packet socket to bind to the tun/tap. Then you can test:
> >
> > 1) TAP RX: by write a packet via virtqueue through vhost_net and read
> > it from packet socket
> > 2) TAP TX: by write via packet socket and read it from the virtqueue
> > through vhost_net
>
> When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
> I found one possible use of uninitialized data in vhost_net_build_xdp().
>
> And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
> sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
> queue.
>
> For vhost_net_build_xdp() called by handle_tx_copy():
>
> The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
> read of uninitialized data if sock_hlen is zero.
Which data is uninitialized here?
>
> And it seems vhost_hdr is skipped in get_tx_bufs():
> https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
>
> static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> struct iov_iter *from)
> {
> ...
> buflen += SKB_DATA_ALIGN(len + pad);
> alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> alloc_frag, GFP_KERNEL)))
> return -ENOMEM;
>
> buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> copied = copy_page_from_iter(alloc_frag->page,
> alloc_frag->offset +
> offsetof(struct tun_xdp_hdr, gso),
> sock_hlen, from);
> if (copied != sock_hlen)
> return -EFAULT;
>
> hdr = buf;
> gso = &hdr->gso;
>
> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> vhost16_to_cpu(vq, gso->csum_start) +
> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> vhost16_to_cpu(vq, gso->hdr_len)) {
> ...
> }
>
> I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
> case correctly, Or do I miss something obvious here?
In get_tx_bufs() we did:
*len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
Which covers this case?
Thanks
>
> >
> > Thanks
> >
> >>
> >>>
> >>> Thanks
> >>
> >
> > .
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-21 2:33 ` Jason Wang
@ 2023-12-21 2:47 ` Yunsheng Lin
2023-12-21 2:56 ` Jason Wang
0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2023-12-21 2:47 UTC (permalink / raw)
To: Jason Wang
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On 2023/12/21 10:33, Jason Wang wrote:
> On Wed, Dec 20, 2023 at 8:45 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
>>>>>> + backend.fd = tun_alloc();
>>>>>> + assert(backend.fd >= 0);
>>>>>> + vdev_info_init(&dev, features);
>>>>>> + vq_info_add(&dev, 256);
>>>>>> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
>>>>>
>>>>> I'd expect we are testing some basic traffic here. E.g can we use a
>>>>> packet socket then we can test both tx and rx?
>>>>
>>>> Yes, only rx for tun is tested.
>>>> Do you have an idea how to test the tx too? As I am not familar enough
>>>> with vhost_net and tun yet.
>>>
>>> Maybe you can have a packet socket to bind to the tun/tap. Then you can test:
>>>
>>> 1) TAP RX: by write a packet via virtqueue through vhost_net and read
>>> it from packet socket
>>> 2) TAP TX: by write via packet socket and read it from the virtqueue
>>> through vhost_net
>>
>> When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
>> I found one possible use of uninitialized data in vhost_net_build_xdp().
>>
>> And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
>> sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
>> queue.
>>
>> For vhost_net_build_xdp() called by handle_tx_copy():
>>
>> The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
>> read of uninitialized data if sock_hlen is zero.
>
> Which data is uninitialized here?
The 'gso', as the sock_hlen is zero, there is no copying for:
copied = copy_page_from_iter(alloc_frag->page,
alloc_frag->offset +
offsetof(struct tun_xdp_hdr, gso),
sock_hlen, from);
>
>>
>> And it seems vhost_hdr is skipped in get_tx_bufs():
>> https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
>>
>> static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>> struct iov_iter *from)
>> {
>> ...
>> buflen += SKB_DATA_ALIGN(len + pad);
>> alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
>> if (unlikely(!vhost_net_page_frag_refill(net, buflen,
>> alloc_frag, GFP_KERNEL)))
>> return -ENOMEM;
>>
>> buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>> copied = copy_page_from_iter(alloc_frag->page,
>> alloc_frag->offset +
>> offsetof(struct tun_xdp_hdr, gso),
>> sock_hlen, from);
>> if (copied != sock_hlen)
>> return -EFAULT;
>>
>> hdr = buf;
>> gso = &hdr->gso;
>>
>> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
>> vhost16_to_cpu(vq, gso->csum_start) +
>> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
>> vhost16_to_cpu(vq, gso->hdr_len)) {
>> ...
>> }
>>
>> I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
>> case correctly, Or do I miss something obvious here?
>
> In get_tx_bufs() we did:
>
> *len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
>
> Which covers this case?
It does not seems to cover it, as the vhost_hdr is just skipped without any
handling in get_tx_bufs():
https://elixir.bootlin.com/linux/v6.7-rc6/source/drivers/vhost/net.c#L616
>
> Thanks
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
2023-12-21 2:47 ` Yunsheng Lin
@ 2023-12-21 2:56 ` Jason Wang
0 siblings, 0 replies; 15+ messages in thread
From: Jason Wang @ 2023-12-21 2:56 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
Xuan Zhuo, virtualization
On Thu, Dec 21, 2023 at 10:48 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/21 10:33, Jason Wang wrote:
> > On Wed, Dec 20, 2023 at 8:45 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
> >>>>>> + backend.fd = tun_alloc();
> >>>>>> + assert(backend.fd >= 0);
> >>>>>> + vdev_info_init(&dev, features);
> >>>>>> + vq_info_add(&dev, 256);
> >>>>>> + run_test(&dev, &dev.vqs[0], delayed, batch, reset, nbufs);
> >>>>>
> >>>>> I'd expect we are testing some basic traffic here. E.g can we use a
> >>>>> packet socket then we can test both tx and rx?
> >>>>
> >>>> Yes, only rx for tun is tested.
> >>>> Do you have an idea how to test the tx too? As I am not familar enough
> >>>> with vhost_net and tun yet.
> >>>
> >>> Maybe you can have a packet socket to bind to the tun/tap. Then you can test:
> >>>
> >>> 1) TAP RX: by write a packet via virtqueue through vhost_net and read
> >>> it from packet socket
> >>> 2) TAP TX: by write via packet socket and read it from the virtqueue
> >>> through vhost_net
> >>
> >> When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
> >> I found one possible use of uninitialized data in vhost_net_build_xdp().
> >>
> >> And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
> >> sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
> >> queue.
> >>
> >> For vhost_net_build_xdp() called by handle_tx_copy():
> >>
> >> The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
> >> read of uninitialized data if sock_hlen is zero.
> >
> > Which data is uninitialized here?
>
> The 'gso', as the sock_hlen is zero, there is no copying for:
>
> copied = copy_page_from_iter(alloc_frag->page,
> alloc_frag->offset +
> offsetof(struct tun_xdp_hdr, gso),
> sock_hlen, from);
I think you're right. This is something we need to fix.
Or we can drop VHOST_NET_F_VIRTIO_NET_HDR as we managed to survive for years:
https://patchwork.ozlabs.org/project/netdev/patch/1528429842-22835-1-git-send-email-jasowang@redhat.com/#1930760
>
> >
> >>
> >> And it seems vhost_hdr is skipped in get_tx_bufs():
> >> https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
> >>
> >> static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> >> struct iov_iter *from)
> >> {
> >> ...
> >> buflen += SKB_DATA_ALIGN(len + pad);
> >> alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> >> if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> >> alloc_frag, GFP_KERNEL)))
> >> return -ENOMEM;
> >>
> >> buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> >> copied = copy_page_from_iter(alloc_frag->page,
> >> alloc_frag->offset +
> >> offsetof(struct tun_xdp_hdr, gso),
> >> sock_hlen, from);
> >> if (copied != sock_hlen)
> >> return -EFAULT;
> >>
> >> hdr = buf;
> >> gso = &hdr->gso;
> >>
> >> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> >> vhost16_to_cpu(vq, gso->csum_start) +
> >> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> >> vhost16_to_cpu(vq, gso->hdr_len)) {
> >> ...
> >> }
> >>
> >> I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
> >> case correctly, Or do I miss something obvious here?
> >
> > In get_tx_bufs() we did:
> >
> > *len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
> >
> > Which covers this case?
>
> It does not seems to cover it, as the vhost_hdr is just skipped without any
> handling in get_tx_bufs():
> https://elixir.bootlin.com/linux/v6.7-rc6/source/drivers/vhost/net.c#L616
My understanding is that in this case vhost can't do more than this as
the socket doesn't know vnet_hdr.
Let's see if Michael is ok with this.
Thanks
>
> >
> > Thanks
>
^ permalink raw reply [flat|nested] 15+ messages in thread