* [RFC v2 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
@ 2023-08-07 12:15 ` Albert Huang
2023-08-07 12:19 ` [RFC v2 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check Albert Huang
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:15 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, open list:NETWORKING DRIVERS, open list
some xsk library calls get_ringparam() API to get the queue length
to init the xsk umem.
Implement that in veth so those scenarios can work properly.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..77e12d52ca2b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -255,6 +255,17 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);
+static void veth_get_ringparam(struct net_device *dev,
+ struct ethtool_ringparam *ring,
+ struct kernel_ethtool_ringparam *kernel_ring,
+ struct netlink_ext_ack *extack)
+{
+ ring->rx_max_pending = VETH_RING_SIZE;
+ ring->tx_max_pending = VETH_RING_SIZE;
+ ring->rx_pending = VETH_RING_SIZE;
+ ring->tx_pending = VETH_RING_SIZE;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
@@ -265,6 +276,7 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_ringparam = veth_get_ringparam,
};
/* general routines */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
2023-08-07 12:15 ` [RFC v2 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
@ 2023-08-07 12:19 ` Albert Huang
2023-08-07 12:22 ` [RFC v2 Optimizing veth xsk performance 3/9] veth: add support for send queue Albert Huang
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:19 UTC (permalink / raw)
Cc: Albert Huang, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:XDP SOCKETS (AF_XDP), open list:XDP SOCKETS (AF_XDP),
open list
for the virtual net device such as veth, there is
no need to do dma check if we support zero copy.
add this flag after unaligned. beacause there is 4 bytes hole
pahole -V ./net/xdp/xsk_buff_pool.o:
-----------
...
/* --- cacheline 3 boundary (192 bytes) --- */
u32 chunk_size; /* 192 4 */
u32 frame_len; /* 196 4 */
u8 cached_need_wakeup; /* 200 1 */
bool uses_need_wakeup; /* 201 1 */
bool dma_need_sync; /* 202 1 */
bool unaligned; /* 203 1 */
/* XXX 4 bytes hole, try to pack */
void * addrs; /* 208 8 */
spinlock_t cq_lock; /* 216 4 */
...
-----------
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
include/net/xsk_buff_pool.h | 1 +
net/xdp/xsk_buff_pool.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..fe31097dc11b 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -81,6 +81,7 @@ struct xsk_buff_pool {
bool uses_need_wakeup;
bool dma_need_sync;
bool unaligned;
+ bool dma_check_skip;
void *addrs;
/* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
* NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b3f7b310811e..ed251b8e8773 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
XDP_PACKET_HEADROOM;
pool->umem = umem;
pool->addrs = umem->addrs;
+ pool->dma_check_skip = false;
INIT_LIST_HEAD(&pool->free_list);
INIT_LIST_HEAD(&pool->xskb_list);
INIT_LIST_HEAD(&pool->xsk_tx_list);
@@ -202,7 +203,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
if (err)
goto err_unreg_pool;
- if (!pool->dma_pages) {
+ if (!pool->dma_pages && !pool->dma_check_skip) {
WARN(1, "Driver did not DMA map zero-copy buffers");
err = -EINVAL;
goto err_unreg_xsk;
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 3/9] veth: add support for send queue
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
2023-08-07 12:15 ` [RFC v2 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
2023-08-07 12:19 ` [RFC v2 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check Albert Huang
@ 2023-08-07 12:22 ` Albert Huang
2023-08-07 12:23 ` [RFC v2 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function Albert Huang
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:22 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:NETWORKING DRIVERS, open list,
open list:XDP (eXpress Data Path)
in order to support native af_xdp for veth. we
need support for send queue for napi tx.
the upcoming patch will make use of it.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 77e12d52ca2b..25faba879505 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -56,6 +56,11 @@ struct veth_rq_stats {
struct u64_stats_sync syncp;
};
+struct veth_sq_stats {
+ struct veth_stats vs;
+ struct u64_stats_sync syncp;
+};
+
struct veth_rq {
struct napi_struct xdp_napi;
struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,11 +74,25 @@ struct veth_rq {
struct page_pool *page_pool;
};
+struct veth_sq {
+ struct napi_struct xdp_napi;
+ struct net_device *dev;
+ struct xdp_mem_info xdp_mem;
+ struct veth_sq_stats stats;
+ u32 queue_index;
+ /* for xsk */
+ struct {
+ struct xsk_buff_pool __rcu *pool;
+ u32 last_cpu;
+ } xsk;
+};
+
struct veth_priv {
struct net_device __rcu *peer;
atomic64_t dropped;
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
+ struct veth_sq *sq;
unsigned int requested_headroom;
};
@@ -1495,6 +1514,15 @@ static int veth_alloc_queues(struct net_device *dev)
u64_stats_init(&priv->rq[i].stats.syncp);
}
+ priv->sq = kcalloc(dev->num_tx_queues, sizeof(*priv->sq), GFP_KERNEL);
+ if (!priv->sq)
+ return -ENOMEM;
+
+ for (i = 0; i < dev->num_tx_queues; i++) {
+ priv->sq[i].dev = dev;
+ u64_stats_init(&priv->sq[i].stats.syncp);
+ }
+
return 0;
}
@@ -1503,6 +1531,7 @@ static void veth_free_queues(struct net_device *dev)
struct veth_priv *priv = netdev_priv(dev);
kfree(priv->rq);
+ kfree(priv->sq);
}
static int veth_dev_init(struct net_device *dev)
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
` (2 preceding siblings ...)
2023-08-07 12:22 ` [RFC v2 Optimizing veth xsk performance 3/9] veth: add support for send queue Albert Huang
@ 2023-08-07 12:23 ` Albert Huang
2023-08-07 12:24 ` [RFC v2 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc Albert Huang
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:23 UTC (permalink / raw)
Cc: Albert Huang, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:XDP SOCKETS (AF_XDP), open list:XDP SOCKETS (AF_XDP),
open list
Return desc to the cq by using the descriptor address.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
include/net/xdp_sock_drv.h | 1 +
net/xdp/xsk.c | 6 ++++++
net/xdp/xsk_queue.h | 10 ++++++++++
3 files changed, 17 insertions(+)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..5220454bff5c 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -15,6 +15,7 @@
#ifdef CONFIG_XDP_SOCKETS
void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr);
bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
void xsk_tx_release(struct xsk_buff_pool *pool);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4f1e0599146e..b2b8aa7b0bcf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -396,6 +396,12 @@ void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
}
EXPORT_SYMBOL(xsk_tx_completed);
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+ xskq_prod_submit_addr(pool->cq, addr);
+}
+EXPORT_SYMBOL(xsk_tx_completed_addr);
+
void xsk_tx_release(struct xsk_buff_pool *pool)
{
struct xdp_sock *xs;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..3a5e26a81dc2 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -428,6 +428,16 @@ static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
smp_store_release(&q->ring->producer, idx); /* B, matches C */
}
+static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
+{
+ struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+ u32 idx = q->ring->producer;
+
+ ring->desc[idx++ & q->ring_mask] = addr;
+
+ __xskq_prod_submit(q, idx);
+}
+
static inline void xskq_prod_submit(struct xsk_queue *q)
{
__xskq_prod_submit(q, q->cached_prod);
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
` (3 preceding siblings ...)
2023-08-07 12:23 ` [RFC v2 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function Albert Huang
@ 2023-08-07 12:24 ` Albert Huang
2023-08-07 12:25 ` [RFC v2 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth Albert Huang
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:24 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:NETWORKING DRIVERS, open list,
open list:XDP (eXpress Data Path)
use send queue tx napi to xmit xsk tx desc
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 230 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 229 insertions(+), 1 deletion(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 25faba879505..28b891dd8dc9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,8 @@
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
#include <net/page_pool.h>
+#include <net/xdp_sock_drv.h>
+#include <net/xdp.h>
#define DRV_NAME "veth"
#define DRV_VERSION "1.0"
@@ -1061,6 +1063,141 @@ static int veth_poll(struct napi_struct *napi, int budget)
return done;
}
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+ struct sk_buff *skb;
+
+ skb = build_skb(head, buflen);
+ if (!skb)
+ return NULL;
+
+ skb_reserve(skb, headroom);
+ skb_put(skb, len);
+
+ return skb;
+}
+
+static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
+{
+ struct veth_priv *priv, *peer_priv;
+ struct net_device *dev, *peer_dev;
+ struct veth_stats stats = {};
+ struct sk_buff *skb = NULL;
+ struct veth_rq *peer_rq;
+ struct xdp_desc desc;
+ int done = 0;
+
+ dev = sq->dev;
+ priv = netdev_priv(dev);
+ peer_dev = priv->peer;
+ peer_priv = netdev_priv(peer_dev);
+
+ /* todo: queue index must set before this */
+ peer_rq = &peer_priv->rq[sq->queue_index];
+
+ /* set xsk wake up flag, to do: where to disable */
+ if (xsk_uses_need_wakeup(xsk_pool))
+ xsk_set_tx_need_wakeup(xsk_pool);
+
+ while (budget-- > 0) {
+ unsigned int truesize = 0;
+ struct page *page;
+ void *vaddr;
+ void *addr;
+
+ if (!xsk_tx_peek_desc(xsk_pool, &desc))
+ break;
+
+ addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
+
+ /* can not hold all data in a page */
+ truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ truesize += desc.len + xsk_pool->headroom;
+ if (truesize > PAGE_SIZE) {
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+ stats.xdp_drops++;
+ break;
+ }
+
+ page = dev_alloc_page();
+ if (!page) {
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+ stats.xdp_drops++;
+ break;
+ }
+ vaddr = page_to_virt(page);
+
+ memcpy(vaddr + xsk_pool->headroom, addr, desc.len);
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+
+ skb = veth_build_skb(vaddr, xsk_pool->headroom, desc.len, PAGE_SIZE);
+ if (!skb) {
+ put_page(page);
+ stats.xdp_drops++;
+ break;
+ }
+ skb->protocol = eth_type_trans(skb, peer_dev);
+ napi_gro_receive(&peer_rq->xdp_napi, skb);
+
+ stats.xdp_bytes += desc.len;
+ done++;
+ }
+
+ /* release, move consumer,and wakeup the producer */
+ if (done) {
+ napi_schedule(&peer_rq->xdp_napi);
+ xsk_tx_release(xsk_pool);
+ }
+
+ u64_stats_update_begin(&sq->stats.syncp);
+ sq->stats.vs.xdp_packets += done;
+ sq->stats.vs.xdp_bytes += stats.xdp_bytes;
+ sq->stats.vs.xdp_drops += stats.xdp_drops;
+ u64_stats_update_end(&sq->stats.syncp);
+
+ return done;
+}
+
+static int veth_poll_tx(struct napi_struct *napi, int budget)
+{
+ struct veth_sq *sq = container_of(napi, struct veth_sq, xdp_napi);
+ struct xsk_buff_pool *pool;
+ int done = 0;
+
+ sq->xsk.last_cpu = smp_processor_id();
+
+ /* xmit for tx queue */
+ rcu_read_lock();
+ pool = rcu_dereference(sq->xsk.pool);
+ if (pool)
+ done = veth_xsk_tx_xmit(sq, pool, budget);
+
+ rcu_read_unlock();
+
+ if (done < budget) {
+ /* if done < budget, the tx ring is no buffer */
+ napi_complete_done(napi, done);
+ }
+
+ return done;
+}
+
+static int veth_napi_add_tx(struct net_device *dev)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+ int i;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ struct veth_sq *sq = &priv->sq[i];
+
+ netif_napi_add(dev, &sq->xdp_napi, veth_poll_tx);
+ napi_enable(&sq->xdp_napi);
+ }
+
+ return 0;
+}
+
static int veth_create_page_pool(struct veth_rq *rq)
{
struct page_pool_params pp_params = {
@@ -1153,6 +1290,19 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
}
}
+static void veth_napi_del_tx(struct net_device *dev)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+ int i;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ struct veth_sq *sq = &priv->sq[i];
+
+ napi_disable(&sq->xdp_napi);
+ __netif_napi_del(&sq->xdp_napi);
+ }
+}
+
static void veth_napi_del(struct net_device *dev)
{
veth_napi_del_range(dev, 0, dev->real_num_rx_queues);
@@ -1360,7 +1510,7 @@ static void veth_set_xdp_features(struct net_device *dev)
struct veth_priv *priv_peer = netdev_priv(peer);
xdp_features_t val = NETDEV_XDP_ACT_BASIC |
NETDEV_XDP_ACT_REDIRECT |
- NETDEV_XDP_ACT_RX_SG;
+ NETDEV_XDP_ACT_RX_SG | NETDEV_XDP_ACT_XSK_ZEROCOPY;
if (priv_peer->_xdp_prog || veth_gro_requested(peer))
val |= NETDEV_XDP_ACT_NDO_XMIT |
@@ -1737,11 +1887,89 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
return err;
}
+static int veth_xsk_pool_enable(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid)
+{
+ struct veth_priv *peer_priv;
+ struct veth_priv *priv = netdev_priv(dev);
+ struct net_device *peer_dev = priv->peer;
+ int err = 0;
+
+ if (qid >= dev->real_num_tx_queues)
+ return -EINVAL;
+
+ if (!peer_dev)
+ return -EINVAL;
+
+ /* no dma, so we just skip dma skip in xsk zero copy */
+ pool->dma_check_skip = true;
+
+ peer_priv = netdev_priv(peer_dev);
+
+ /* enable peer tx xdp here, this side
+ * xdp is enable by veth_xdp_set
+ * to do: we need to check whther this side is already enable xdp
+ * maybe it do not have xdp prog
+ */
+ if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+ /* peer should enable napi*/
+ err = veth_napi_enable(peer_dev);
+ if (err)
+ return err;
+ }
+
+ /* Here is already protected by rtnl_lock, so rcu_assign_pointer
+ * is safe.
+ */
+ rcu_assign_pointer(priv->sq[qid].xsk.pool, pool);
+
+ veth_napi_add_tx(dev);
+
+ return err;
+}
+
+static int veth_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+ struct veth_priv *peer_priv;
+ struct veth_priv *priv = netdev_priv(dev);
+ struct net_device *peer_dev = priv->peer;
+ int err = 0;
+
+ if (qid >= dev->real_num_tx_queues)
+ return -EINVAL;
+
+ if (!peer_dev)
+ return -EINVAL;
+
+ peer_priv = netdev_priv(peer_dev);
+
+ /* to do: this may be failed */
+ if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+ /* disable peer napi */
+ veth_napi_del(peer_dev);
+ }
+
+ veth_napi_del_tx(dev);
+
+ rcu_assign_pointer(priv->sq[qid].xsk.pool, NULL);
+ return err;
+}
+
+/* this is for setup xdp */
+static int veth_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+ if (xdp->xsk.pool)
+ return veth_xsk_pool_enable(dev, xdp->xsk.pool, xdp->xsk.queue_id);
+ else
+ return veth_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
{
switch (xdp->command) {
case XDP_SETUP_PROG:
return veth_xdp_set(dev, xdp->prog, xdp->extack);
+ case XDP_SETUP_XSK_POOL:
+ return veth_xsk_pool_setup(dev, xdp);
default:
return -EINVAL;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
` (4 preceding siblings ...)
2023-08-07 12:24 ` [RFC v2 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc Albert Huang
@ 2023-08-07 12:25 ` Albert Huang
2023-08-07 12:26 ` [RFC v2 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp Albert Huang
2023-08-07 12:26 ` [RFC v2 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature Albert Huang
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:25 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:NETWORKING DRIVERS, open list,
open list:XDP (eXpress Data Path)
add ndo_xsk_wakeup callback for veth, this is used to
wakeup napi tx.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 28b891dd8dc9..ac78d6a87416 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1805,6 +1805,44 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
rcu_read_unlock();
}
+static void veth_xsk_remote_trigger_napi(void *info)
+{
+ struct veth_sq *sq = info;
+
+ napi_schedule(&sq->xdp_napi);
+}
+
+static int veth_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+ struct veth_priv *priv;
+ struct veth_sq *sq;
+ u32 last_cpu, cur_cpu;
+
+ if (!netif_running(dev))
+ return -ENETDOWN;
+
+ if (qid >= dev->real_num_rx_queues)
+ return -EINVAL;
+
+ priv = netdev_priv(dev);
+ sq = &priv->sq[qid];
+
+ if (napi_if_scheduled_mark_missed(&sq->xdp_napi))
+ return 0;
+
+ last_cpu = sq->xsk.last_cpu;
+ cur_cpu = get_cpu();
+
+ /* raise a napi */
+ if (last_cpu == cur_cpu)
+ napi_schedule(&sq->xdp_napi);
+ else
+ smp_call_function_single(last_cpu, veth_xsk_remote_trigger_napi, sq, true);
+
+ put_cpu();
+ return 0;
+}
+
static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
struct netlink_ext_ack *extack)
{
@@ -2019,6 +2057,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_set_rx_headroom = veth_set_rx_headroom,
.ndo_bpf = veth_xdp,
.ndo_xdp_xmit = veth_ndo_xdp_xmit,
+ .ndo_xsk_wakeup = veth_xsk_wakeup,
.ndo_get_peer_dev = veth_peer_dev,
};
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
` (5 preceding siblings ...)
2023-08-07 12:25 ` [RFC v2 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth Albert Huang
@ 2023-08-07 12:26 ` Albert Huang
2023-08-07 12:26 ` [RFC v2 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature Albert Huang
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:26 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend,
open list:NETWORKING DRIVERS, open list,
open list:XDP (eXpress Data Path)
A typical topology is shown below:
veth<--------veth-peer
1 |
|2
|
bridge<------->eth0(such as mlnx5 NIC)
If you use af_xdp to send packets from veth to a physical NIC,
it needs to go through some software paths, so we can refer to
the implementation of kernel GSO. When af_xdp sends packets out
from veth, consider aggregating packets and send a large packet
from the veth virtual NIC to the physical NIC.
performance:(test weth libxdp lib)
AF_XDP without batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP with batch : 1.5 Mpps (with ksoftirqd 15% cpu)
With af_xdp batch, the libxdp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 408 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 387 insertions(+), 21 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ac78d6a87416..70489d017b51 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -29,6 +29,7 @@
#include <net/page_pool.h>
#include <net/xdp_sock_drv.h>
#include <net/xdp.h>
+#include <net/udp.h>
#define DRV_NAME "veth"
#define DRV_VERSION "1.0"
@@ -103,6 +104,23 @@ struct veth_xdp_tx_bq {
unsigned int count;
};
+struct veth_batch_tuple {
+ __u8 protocol;
+ __be32 saddr;
+ __be32 daddr;
+ __be16 source;
+ __be16 dest;
+ __be16 batch_size;
+ __be16 batch_segs;
+ bool batch_enable;
+ bool batch_flush;
+};
+
+struct veth_seg_info {
+ u32 segs;
+ u64 desc[] ____cacheline_aligned_in_smp;
+};
+
/*
* ethtool interface
*/
@@ -1078,11 +1096,340 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
return skb;
}
+static void veth_xsk_destruct_skb(struct sk_buff *skb)
+{
+ struct skb_shared_info *si = skb_shinfo(skb);
+ struct xsk_buff_pool *pool = (struct xsk_buff_pool *)si->destructor_arg_xsk_pool;
+ struct veth_seg_info *seg_info = (struct veth_seg_info *)si->destructor_arg;
+ unsigned long flags;
+ u32 index = 0;
+ u64 addr;
+
+ /* release cq */
+ spin_lock_irqsave(&pool->cq_lock, flags);
+ for (index = 0; index < seg_info->segs; index++) {
+ addr = (u64)(long)seg_info->desc[index];
+ xsk_tx_completed_addr(pool, addr);
+ }
+ spin_unlock_irqrestore(&pool->cq_lock, flags);
+
+ kfree(seg_info);
+ si->destructor_arg = NULL;
+ si->destructor_arg_xsk_pool = NULL;
+}
+
+static struct sk_buff *veth_build_gso_head_skb(struct net_device *dev,
+ char *buff, u32 tot_len,
+ u32 headroom, u32 iph_len,
+ u32 th_len)
+{
+ struct sk_buff *skb = NULL;
+ int err = 0;
+
+ skb = alloc_skb(tot_len, GFP_KERNEL);
+ if (unlikely(!skb))
+ return NULL;
+
+ /* header room contains the eth header */
+ skb_reserve(skb, headroom - ETH_HLEN);
+ skb_put(skb, ETH_HLEN + iph_len + th_len);
+ skb_shinfo(skb)->gso_segs = 0;
+
+ err = skb_store_bits(skb, 0, buff, ETH_HLEN + iph_len + th_len);
+ if (unlikely(err)) {
+ kfree_skb(skb);
+ return NULL;
+ }
+
+ skb->protocol = eth_type_trans(skb, dev);
+ skb->network_header = skb->mac_header + ETH_HLEN;
+ skb->transport_header = skb->network_header + iph_len;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ return skb;
+}
+
+/* only ipv4 udp match
+ * to do: tcp and ipv6
+ */
+static inline bool veth_segment_match(struct veth_batch_tuple *tuple,
+ struct iphdr *iph, struct udphdr *udph)
+{
+ if (tuple->protocol == iph->protocol &&
+ tuple->saddr == iph->saddr &&
+ tuple->daddr == iph->daddr &&
+ tuple->source == udph->source &&
+ tuple->dest == udph->dest &&
+ tuple->batch_size == ntohs(udph->len)) {
+ tuple->batch_flush = false;
+ return true;
+ }
+
+ tuple->batch_flush = true;
+ return false;
+}
+
+static inline void veth_tuple_init(struct veth_batch_tuple *tuple,
+ struct iphdr *iph, struct udphdr *udph)
+{
+ tuple->protocol = iph->protocol;
+ tuple->saddr = iph->saddr;
+ tuple->daddr = iph->daddr;
+ tuple->source = udph->source;
+ tuple->dest = udph->dest;
+ tuple->batch_flush = false;
+ tuple->batch_size = ntohs(udph->len);
+ tuple->batch_segs = 0;
+}
+
+static inline bool veth_batch_ip_check_v4(struct iphdr *iph, u32 len)
+{
+ if (len <= (ETH_HLEN + sizeof(*iph)))
+ return false;
+
+ if (iph->ihl < 5 || iph->version != 4 || len < (iph->ihl * 4 + ETH_HLEN))
+ return false;
+
+ return true;
+}
+
+static struct sk_buff *veth_build_skb_batch_udp(struct net_device *dev,
+ struct xsk_buff_pool *pool,
+ struct xdp_desc *desc,
+ struct veth_batch_tuple *tuple,
+ struct sk_buff *prev_skb)
+{
+ u32 hr, len, ts, index, iph_len, th_len, data_offset, data_len, tot_len;
+ struct veth_seg_info *seg_info;
+ void *buffer;
+ struct udphdr *udph;
+ struct iphdr *iph;
+ struct sk_buff *skb;
+ struct page *page;
+ u32 seg_len = 0;
+ int hh_len = 0;
+ u64 addr;
+
+ addr = desc->addr;
+ len = desc->len;
+
+ /* l2 reserved len */
+ hh_len = LL_RESERVED_SPACE(dev);
+ hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(hh_len));
+
+ /* data points to eth header */
+ buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
+
+ iph = (struct iphdr *)(buffer + ETH_HLEN);
+ iph_len = iph->ihl * 4;
+
+ udph = (struct udphdr *)(buffer + ETH_HLEN + iph_len);
+ th_len = sizeof(struct udphdr);
+
+ if (tuple->batch_flush)
+ veth_tuple_init(tuple, iph, udph);
+
+ ts = pool->unaligned ? len : pool->chunk_size;
+
+ data_offset = offset_in_page(buffer) + ETH_HLEN + iph_len + th_len;
+ data_len = len - (ETH_HLEN + iph_len + th_len);
+
+ /* head is null or this is a new 5 tuple */
+ if (!prev_skb || !veth_segment_match(tuple, iph, udph)) {
+ tot_len = hr + iph_len + th_len;
+ skb = veth_build_gso_head_skb(dev, buffer, tot_len, hr, iph_len, th_len);
+ if (!skb) {
+ /* to do: handle here for skb */
+ return NULL;
+ }
+
+ /* store information for gso */
+ seg_len = struct_size(seg_info, desc, MAX_SKB_FRAGS);
+ seg_info = kmalloc(seg_len, GFP_KERNEL);
+ if (!seg_info) {
+ /* to do */
+ kfree_skb(skb);
+ return NULL;
+ }
+ } else {
+ skb = prev_skb;
+ skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4 | SKB_GSO_PARTIAL;
+ skb_shinfo(skb)->gso_size = data_len;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ /* max segment is MAX_SKB_FRAGS */
+ if (skb_shinfo(skb)->gso_segs >= MAX_SKB_FRAGS - 1)
+ tuple->batch_flush = true;
+
+ seg_info = (struct veth_seg_info *)skb_shinfo(skb)->destructor_arg;
+ }
+
+ /* offset in umem pool buffer */
+ addr = buffer - pool->addrs;
+
+ /* get the page of the desc */
+ page = pool->umem->pgs[addr >> PAGE_SHIFT];
+
+ /* in order to avoid to get freed by kfree_skb */
+ get_page(page);
+
+ /* desc.data can not hold in two */
+ skb_fill_page_desc(skb, skb_shinfo(skb)->gso_segs, page, data_offset, data_len);
+
+ skb->len += data_len;
+ skb->data_len += data_len;
+ skb->truesize += ts;
+ skb->dev = dev;
+
+ /* later we will support gso for this */
+ index = skb_shinfo(skb)->gso_segs;
+ seg_info->desc[index] = desc->addr;
+ seg_info->segs = ++index;
+ skb_shinfo(skb)->gso_segs++;
+
+ skb_shinfo(skb)->destructor_arg = (void *)(long)seg_info;
+ skb_shinfo(skb)->destructor_arg_xsk_pool = (void *)(long)pool;
+ skb->destructor = veth_xsk_destruct_skb;
+
+ /* to do:
+ * add skb to sock. may be there is no need to do for this
+ * and this might be multiple xsk sockets involved, so it's
+ * difficult to determine which socket is sending the data.
+ * refcount_add(ts, &xs->sk.sk_wmem_alloc);
+ */
+ return skb;
+}
+
+static inline struct sk_buff *veth_build_skb_def(struct net_device *dev,
+ struct xsk_buff_pool *pool, struct xdp_desc *desc)
+{
+ struct sk_buff *skb = NULL;
+ struct page *page;
+ void *buffer;
+ void *vaddr;
+
+ page = dev_alloc_page();
+ if (!page)
+ return NULL;
+
+ buffer = (unsigned char *)xsk_buff_raw_get_data(pool, desc->addr);
+
+ vaddr = page_to_virt(page);
+ memcpy(vaddr + pool->headroom, buffer, desc->len);
+ skb = veth_build_skb(vaddr, pool->headroom, desc->len, PAGE_SIZE);
+ if (!skb) {
+ put_page(page);
+ return NULL;
+ }
+
+ skb->protocol = eth_type_trans(skb, dev);
+
+ return skb;
+}
+
+/* To call the following function, the following conditions must be met:
+ * 1.The data packet must be a standard Ethernet data packet
+ * 2. Data packets support batch sending
+ */
+static inline struct sk_buff *veth_build_skb_batch_v4(struct net_device *dev,
+ struct xsk_buff_pool *pool,
+ struct xdp_desc *desc,
+ struct veth_batch_tuple *tuple,
+ struct sk_buff *prev_skb)
+{
+ struct iphdr *iph;
+ void *buffer;
+ u64 addr;
+
+ addr = desc->addr;
+ buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
+ iph = (struct iphdr *)(buffer + ETH_HLEN);
+ if (!veth_batch_ip_check_v4(iph, desc->len))
+ goto normal;
+
+ switch (iph->protocol) {
+ case IPPROTO_UDP:
+ return veth_build_skb_batch_udp(dev, pool, desc, tuple, prev_skb);
+ default:
+ break;
+ }
+normal:
+ tuple->batch_enable = false;
+ return veth_build_skb_def(dev, pool, desc);
+}
+
+/* Zero copy needs to meet the following conditions:
+ * 1. The data content of tx desc must be within one page
+ * 2、the tx desc must support batch xmit, which seted by userspace
+ */
+static inline bool veth_batch_desc_check(void *buff, u32 len)
+{
+ u32 offset;
+
+ offset = offset_in_page(buff);
+ if (PAGE_SIZE - offset < len)
+ return false;
+
+ return true;
+}
+
+/* here must be a ipv4 or ipv6 packet */
+static inline struct sk_buff *veth_build_skb_batch(struct net_device *dev,
+ struct xsk_buff_pool *pool,
+ struct xdp_desc *desc,
+ struct veth_batch_tuple *tuple,
+ struct sk_buff *prev_skb)
+{
+ const struct ethhdr *eth;
+ void *buffer;
+
+ buffer = xsk_buff_raw_get_data(pool, desc->addr);
+ if (!veth_batch_desc_check(buffer, desc->len))
+ goto normal;
+
+ eth = (struct ethhdr *)buffer;
+ switch (ntohs(eth->h_proto)) {
+ case ETH_P_IP:
+ tuple->batch_enable = true;
+ return veth_build_skb_batch_v4(dev, pool, desc, tuple, prev_skb);
+ /* to do: not support yet, just build skb, no batch */
+ case ETH_P_IPV6:
+ fallthrough;
+ default:
+ break;
+ }
+
+normal:
+ tuple->batch_flush = false;
+ tuple->batch_enable = false;
+ return veth_build_skb_def(dev, pool, desc);
+}
+
+/* just support ipv4 udp batch
+ * to do: ipv4 tcp and ipv6
+ */
+static inline void veth_skb_batch_checksum(struct sk_buff *skb)
+{
+ struct iphdr *iph = ip_hdr(skb);
+ struct udphdr *uh = udp_hdr(skb);
+ int ip_tot_len = skb->len;
+ int udp_len = skb->len - (skb->transport_header - skb->network_header);
+
+ iph->tot_len = htons(ip_tot_len);
+ ip_send_check(iph);
+ uh->len = htons(udp_len);
+ uh->check = 0;
+
+ udp4_hwcsum(skb, iph->saddr, iph->daddr);
+}
+
static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
{
struct veth_priv *priv, *peer_priv;
struct net_device *dev, *peer_dev;
+ struct veth_batch_tuple tuple;
struct veth_stats stats = {};
+ struct sk_buff *prev_skb = NULL;
struct sk_buff *skb = NULL;
struct veth_rq *peer_rq;
struct xdp_desc desc;
@@ -1093,24 +1440,23 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
peer_dev = priv->peer;
peer_priv = netdev_priv(peer_dev);
- /* todo: queue index must set before this */
+ /* queue_index set in napi enable
+ * to do:may be we should select rq by 5-tuple or hash
+ */
peer_rq = &peer_priv->rq[sq->queue_index];
+ memset(&tuple, 0, sizeof(tuple));
+
/* set xsk wake up flag, to do: where to disable */
if (xsk_uses_need_wakeup(xsk_pool))
xsk_set_tx_need_wakeup(xsk_pool);
while (budget-- > 0) {
unsigned int truesize = 0;
- struct page *page;
- void *vaddr;
- void *addr;
if (!xsk_tx_peek_desc(xsk_pool, &desc))
break;
- addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
-
/* can not hold all data in a page */
truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
truesize += desc.len + xsk_pool->headroom;
@@ -1120,30 +1466,50 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
break;
}
- page = dev_alloc_page();
- if (!page) {
+ skb = veth_build_skb_batch(peer_dev, xsk_pool, &desc, &tuple, prev_skb);
+ if (!skb) {
+ stats.rx_drops++;
xsk_tx_completed_addr(xsk_pool, desc.addr);
- stats.xdp_drops++;
- break;
+ if (prev_skb != skb) {
+ napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+ prev_skb = NULL;
+ }
+ continue;
}
- vaddr = page_to_virt(page);
-
- memcpy(vaddr + xsk_pool->headroom, addr, desc.len);
- xsk_tx_completed_addr(xsk_pool, desc.addr);
- skb = veth_build_skb(vaddr, xsk_pool->headroom, desc.len, PAGE_SIZE);
- if (!skb) {
- put_page(page);
- stats.xdp_drops++;
- break;
+ if (!tuple.batch_enable) {
+ xsk_tx_completed_addr(xsk_pool, desc.addr);
+ /* flush the prev skb first to avoid out of order */
+ if (prev_skb != skb && prev_skb) {
+ veth_skb_batch_checksum(prev_skb);
+ napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+ prev_skb = NULL;
+ }
+ napi_gro_receive(&peer_rq->xdp_napi, skb);
+ skb = NULL;
+ } else {
+ if (prev_skb && tuple.batch_flush) {
+ veth_skb_batch_checksum(prev_skb);
+ napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+ if (prev_skb == skb)
+ prev_skb = skb = NULL;
+ else
+ prev_skb = skb;
+ } else {
+ prev_skb = skb;
+ }
}
- skb->protocol = eth_type_trans(skb, peer_dev);
- napi_gro_receive(&peer_rq->xdp_napi, skb);
stats.xdp_bytes += desc.len;
done++;
}
+ /* means there is a skb need to send to peer_rq (batch)*/
+ if (skb) {
+ veth_skb_batch_checksum(skb);
+ napi_gro_receive(&peer_rq->xdp_napi, skb);
+ }
+
/* release, move consumer,and wakeup the producer */
if (done) {
napi_schedule(&peer_rq->xdp_napi);
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [RFC v2 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature
2023-08-07 12:04 [RFC v2 Optimizing veth xsk performance 0/9] Albert Huang
` (6 preceding siblings ...)
2023-08-07 12:26 ` [RFC v2 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp Albert Huang
@ 2023-08-07 12:26 ` Albert Huang
7 siblings, 0 replies; 9+ messages in thread
From: Albert Huang @ 2023-08-07 12:26 UTC (permalink / raw)
Cc: Albert Huang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, open list:NETWORKING DRIVERS, open list
this patch only support for tx need_wakup feature.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
drivers/net/veth.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 70489d017b51..7c60c64ef10b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1447,9 +1447,9 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
memset(&tuple, 0, sizeof(tuple));
- /* set xsk wake up flag, to do: where to disable */
+ /* clear xsk wake up flag */
if (xsk_uses_need_wakeup(xsk_pool))
- xsk_set_tx_need_wakeup(xsk_pool);
+ xsk_clear_tx_need_wakeup(xsk_pool);
while (budget-- > 0) {
unsigned int truesize = 0;
@@ -1539,12 +1539,15 @@ static int veth_poll_tx(struct napi_struct *napi, int budget)
if (pool)
done = veth_xsk_tx_xmit(sq, pool, budget);
- rcu_read_unlock();
-
if (done < budget) {
+ /* set xsk wake up flag */
+ if (xsk_uses_need_wakeup(pool))
+ xsk_set_tx_need_wakeup(pool);
+
/* if done < budget, the tx ring is no buffer */
napi_complete_done(napi, done);
}
+ rcu_read_unlock();
return done;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread