* [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA
@ 2025-07-17 15:28 Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 1/5] gve: deduplicate xdp info and xsk pool registration logic Jeroen de Borst
` (5 more replies)
0 siblings, 6 replies; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington
From: Joshua Washington <joshwash@google.com>
This patch series adds support for AF_XDP zero-copy in the DQO RDA queue
format.
XSK infrastructure is updated to re-post buffers when adding XSK pools
because XSK umem will be posted directly to the NIC, a departure from
the bounce buffer model used in GQI QPL. A registry of XSK pools is
introduced to prevent the usage of XSK pools when in copy mode.
---
v2:
- Remove instance of unused variable
v1: https://lore.kernel.org/netdev/20250714160451.124671-1-jeroendb@google.com/
Joshua Washington (5):
gve: deduplicate xdp info and xsk pool registration logic
gve: merge xdp and xsk registration
gve: keep registry of zc xsk pools in netdev_priv
gve: implement DQO TX datapath for AF_XDP zero-copy
gve: implement DQO RX datapath and control path for AF_XDP zero-copy
drivers/net/ethernet/google/gve/gve.h | 24 +-
.../ethernet/google/gve/gve_buffer_mgmt_dqo.c | 24 +-
drivers/net/ethernet/google/gve/gve_dqo.h | 1 +
drivers/net/ethernet/google/gve/gve_main.c | 233 +++++++++++-------
drivers/net/ethernet/google/gve/gve_rx_dqo.c | 94 ++++++-
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 148 +++++++++++
6 files changed, 423 insertions(+), 101 deletions(-)
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next v2 1/5] gve: deduplicate xdp info and xsk pool registration logic
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
@ 2025-07-17 15:28 ` Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 2/5] gve: merge xdp and xsk registration Jeroen de Borst
` (4 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Jeroen de Borst
From: Joshua Washington <joshwash@google.com>
The XDP registration path currently has a lot of reused logic, leading
changes to the codepaths to be unnecessarily complex. gve_reg_xsk_pool
extracts the logic of registering an XSK pool with a queue into a method
that can be used by both XDP_SETUP_XSK_POOL and gve_reg_xdp_info.
gve_unreg_xdp_info is used to undo XDP info registration in the error
path instead of explicitly unregistering the XDP info, as it is more
complete and idempotent.
This patch will be followed by other changes to the XDP registration
logic, and will simplify those changes due to the use of common methods.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 169 ++++++++++-----------
1 file changed, 83 insertions(+), 86 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index be461751ff31..5aca3145e6ab 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1158,13 +1158,75 @@ static int gve_reset_recovery(struct gve_priv *priv, bool was_up);
static void gve_turndown(struct gve_priv *priv);
static void gve_turnup(struct gve_priv *priv);
+static void gve_unreg_xsk_pool(struct gve_priv *priv, u16 qid)
+{
+ struct gve_rx_ring *rx;
+
+ if (!priv->rx)
+ return;
+
+ rx = &priv->rx[qid];
+ rx->xsk_pool = NULL;
+ if (xdp_rxq_info_is_reg(&rx->xsk_rxq))
+ xdp_rxq_info_unreg(&rx->xsk_rxq);
+
+ if (!priv->tx)
+ return;
+ priv->tx[gve_xdp_tx_queue_id(priv, qid)].xsk_pool = NULL;
+}
+
+static int gve_reg_xsk_pool(struct gve_priv *priv, struct net_device *dev,
+ struct xsk_buff_pool *pool, u16 qid)
+{
+ struct napi_struct *napi;
+ struct gve_rx_ring *rx;
+ u16 tx_qid;
+ int err;
+
+ rx = &priv->rx[qid];
+ napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
+ err = xdp_rxq_info_reg(&rx->xsk_rxq, dev, qid, napi->napi_id);
+ if (err)
+ return err;
+
+ err = xdp_rxq_info_reg_mem_model(&rx->xsk_rxq,
+ MEM_TYPE_XSK_BUFF_POOL, pool);
+ if (err) {
+ gve_unreg_xsk_pool(priv, qid);
+ return err;
+ }
+
+ rx->xsk_pool = pool;
+
+ tx_qid = gve_xdp_tx_queue_id(priv, qid);
+ priv->tx[tx_qid].xsk_pool = pool;
+
+ return 0;
+}
+
+static void gve_unreg_xdp_info(struct gve_priv *priv)
+{
+ int i;
+
+ if (!priv->tx_cfg.num_xdp_queues || !priv->rx)
+ return;
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ struct gve_rx_ring *rx = &priv->rx[i];
+
+ if (xdp_rxq_info_is_reg(&rx->xdp_rxq))
+ xdp_rxq_info_unreg(&rx->xdp_rxq);
+
+ gve_unreg_xsk_pool(priv, i);
+ }
+}
+
static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
{
struct napi_struct *napi;
struct gve_rx_ring *rx;
int err = 0;
- int i, j;
- u32 tx_qid;
+ int i;
if (!priv->tx_cfg.num_xdp_queues)
return 0;
@@ -1188,59 +1250,20 @@ static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
if (err)
goto err;
rx->xsk_pool = xsk_get_pool_from_qid(dev, i);
- if (rx->xsk_pool) {
- err = xdp_rxq_info_reg(&rx->xsk_rxq, dev, i,
- napi->napi_id);
- if (err)
- goto err;
- err = xdp_rxq_info_reg_mem_model(&rx->xsk_rxq,
- MEM_TYPE_XSK_BUFF_POOL, NULL);
- if (err)
- goto err;
- xsk_pool_set_rxq_info(rx->xsk_pool,
- &rx->xsk_rxq);
- }
- }
+ if (!rx->xsk_pool)
+ continue;
- for (i = 0; i < priv->tx_cfg.num_xdp_queues; i++) {
- tx_qid = gve_xdp_tx_queue_id(priv, i);
- priv->tx[tx_qid].xsk_pool = xsk_get_pool_from_qid(dev, i);
+ err = gve_reg_xsk_pool(priv, dev, rx->xsk_pool, i);
+ if (err)
+ goto err;
}
return 0;
err:
- for (j = i; j >= 0; j--) {
- rx = &priv->rx[j];
- if (xdp_rxq_info_is_reg(&rx->xdp_rxq))
- xdp_rxq_info_unreg(&rx->xdp_rxq);
- if (xdp_rxq_info_is_reg(&rx->xsk_rxq))
- xdp_rxq_info_unreg(&rx->xsk_rxq);
- }
+ gve_unreg_xdp_info(priv);
return err;
}
-static void gve_unreg_xdp_info(struct gve_priv *priv)
-{
- int i, tx_qid;
-
- if (!priv->tx_cfg.num_xdp_queues || !priv->rx || !priv->tx)
- return;
-
- for (i = 0; i < priv->rx_cfg.num_queues; i++) {
- struct gve_rx_ring *rx = &priv->rx[i];
-
- xdp_rxq_info_unreg(&rx->xdp_rxq);
- if (rx->xsk_pool) {
- xdp_rxq_info_unreg(&rx->xsk_rxq);
- rx->xsk_pool = NULL;
- }
- }
-
- for (i = 0; i < priv->tx_cfg.num_xdp_queues; i++) {
- tx_qid = gve_xdp_tx_queue_id(priv, i);
- priv->tx[tx_qid].xsk_pool = NULL;
- }
-}
static void gve_drain_page_cache(struct gve_priv *priv)
{
@@ -1555,9 +1578,6 @@ static int gve_xsk_pool_enable(struct net_device *dev,
u16 qid)
{
struct gve_priv *priv = netdev_priv(dev);
- struct napi_struct *napi;
- struct gve_rx_ring *rx;
- int tx_qid;
int err;
if (qid >= priv->rx_cfg.num_queues) {
@@ -1579,30 +1599,12 @@ static int gve_xsk_pool_enable(struct net_device *dev,
if (!priv->xdp_prog || !netif_running(dev))
return 0;
- rx = &priv->rx[qid];
- napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
- err = xdp_rxq_info_reg(&rx->xsk_rxq, dev, qid, napi->napi_id);
+ err = gve_reg_xsk_pool(priv, dev, pool, qid);
if (err)
- goto err;
-
- err = xdp_rxq_info_reg_mem_model(&rx->xsk_rxq,
- MEM_TYPE_XSK_BUFF_POOL, NULL);
- if (err)
- goto err;
-
- xsk_pool_set_rxq_info(pool, &rx->xsk_rxq);
- rx->xsk_pool = pool;
-
- tx_qid = gve_xdp_tx_queue_id(priv, qid);
- priv->tx[tx_qid].xsk_pool = pool;
+ xsk_pool_dma_unmap(pool,
+ DMA_ATTR_SKIP_CPU_SYNC |
+ DMA_ATTR_WEAK_ORDERING);
- return 0;
-err:
- if (xdp_rxq_info_is_reg(&rx->xsk_rxq))
- xdp_rxq_info_unreg(&rx->xsk_rxq);
-
- xsk_pool_dma_unmap(pool,
- DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
return err;
}
@@ -1615,17 +1617,17 @@ static int gve_xsk_pool_disable(struct net_device *dev,
struct xsk_buff_pool *pool;
int tx_qid;
- pool = xsk_get_pool_from_qid(dev, qid);
- if (!pool)
- return -EINVAL;
if (qid >= priv->rx_cfg.num_queues)
return -EINVAL;
- /* If XDP prog is not installed or interface is down, unmap DMA and
- * return.
- */
- if (!priv->xdp_prog || !netif_running(dev))
- goto done;
+ pool = xsk_get_pool_from_qid(dev, qid);
+ if (pool)
+ xsk_pool_dma_unmap(pool,
+ DMA_ATTR_SKIP_CPU_SYNC |
+ DMA_ATTR_WEAK_ORDERING);
+
+ if (!netif_running(dev) || !priv->tx_cfg.num_xdp_queues)
+ return 0;
napi_rx = &priv->ntfy_blocks[priv->rx[qid].ntfy_id].napi;
napi_disable(napi_rx); /* make sure current rx poll is done */
@@ -1634,9 +1636,7 @@ static int gve_xsk_pool_disable(struct net_device *dev,
napi_tx = &priv->ntfy_blocks[priv->tx[tx_qid].ntfy_id].napi;
napi_disable(napi_tx); /* make sure current tx poll is done */
- priv->rx[qid].xsk_pool = NULL;
- xdp_rxq_info_unreg(&priv->rx[qid].xsk_rxq);
- priv->tx[tx_qid].xsk_pool = NULL;
+ gve_unreg_xsk_pool(priv, qid);
smp_mb(); /* Make sure it is visible to the workers on datapath */
napi_enable(napi_rx);
@@ -1647,9 +1647,6 @@ static int gve_xsk_pool_disable(struct net_device *dev,
if (gve_tx_clean_pending(priv, &priv->tx[tx_qid]))
napi_schedule(napi_tx);
-done:
- xsk_pool_dma_unmap(pool,
- DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
return 0;
}
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 2/5] gve: merge xdp and xsk registration
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 1/5] gve: deduplicate xdp info and xsk pool registration logic Jeroen de Borst
@ 2025-07-17 15:28 ` Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 3/5] gve: keep registry of zc xsk pools in netdev_priv Jeroen de Borst
` (3 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Jeroen de Borst
From: Joshua Washington <joshwash@google.com>
The existence of both of these xdp_rxq and xsk_rxq is redundant. xdp_rxq
can be used in both the zero-copy mode and the copy mode case. XSK pool
memory model registration is prioritized over normal memory model
registration to ensure that memory model registration happens only once
per queue.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
v2: Remove unused napi_struct pointer
---
drivers/net/ethernet/google/gve/gve.h | 1 -
drivers/net/ethernet/google/gve/gve_main.c | 27 ++++++++--------------
2 files changed, 10 insertions(+), 18 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 53899096e89e..b2be3fca4125 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -331,7 +331,6 @@ struct gve_rx_ring {
/* XDP stuff */
struct xdp_rxq_info xdp_rxq;
- struct xdp_rxq_info xsk_rxq;
struct xsk_buff_pool *xsk_pool;
struct page_frag_cache page_cache; /* Page cache to allocate XDP frames */
};
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 5aca3145e6ab..cf8e1abdfa8e 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1167,8 +1167,8 @@ static void gve_unreg_xsk_pool(struct gve_priv *priv, u16 qid)
rx = &priv->rx[qid];
rx->xsk_pool = NULL;
- if (xdp_rxq_info_is_reg(&rx->xsk_rxq))
- xdp_rxq_info_unreg(&rx->xsk_rxq);
+ if (xdp_rxq_info_is_reg(&rx->xdp_rxq))
+ xdp_rxq_info_unreg_mem_model(&rx->xdp_rxq);
if (!priv->tx)
return;
@@ -1178,18 +1178,12 @@ static void gve_unreg_xsk_pool(struct gve_priv *priv, u16 qid)
static int gve_reg_xsk_pool(struct gve_priv *priv, struct net_device *dev,
struct xsk_buff_pool *pool, u16 qid)
{
- struct napi_struct *napi;
struct gve_rx_ring *rx;
u16 tx_qid;
int err;
rx = &priv->rx[qid];
- napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
- err = xdp_rxq_info_reg(&rx->xsk_rxq, dev, qid, napi->napi_id);
- if (err)
- return err;
-
- err = xdp_rxq_info_reg_mem_model(&rx->xsk_rxq,
+ err = xdp_rxq_info_reg_mem_model(&rx->xdp_rxq,
MEM_TYPE_XSK_BUFF_POOL, pool);
if (err) {
gve_unreg_xsk_pool(priv, qid);
@@ -1232,6 +1226,8 @@ static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
return 0;
for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ struct xsk_buff_pool *xsk_pool;
+
rx = &priv->rx[i];
napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
@@ -1239,7 +1235,11 @@ static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
napi->napi_id);
if (err)
goto err;
- if (gve_is_qpl(priv))
+
+ xsk_pool = xsk_get_pool_from_qid(dev, i);
+ if (xsk_pool)
+ err = gve_reg_xsk_pool(priv, dev, xsk_pool, i);
+ else if (gve_is_qpl(priv))
err = xdp_rxq_info_reg_mem_model(&rx->xdp_rxq,
MEM_TYPE_PAGE_SHARED,
NULL);
@@ -1249,13 +1249,6 @@ static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
rx->dqo.page_pool);
if (err)
goto err;
- rx->xsk_pool = xsk_get_pool_from_qid(dev, i);
- if (!rx->xsk_pool)
- continue;
-
- err = gve_reg_xsk_pool(priv, dev, rx->xsk_pool, i);
- if (err)
- goto err;
}
return 0;
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 3/5] gve: keep registry of zc xsk pools in netdev_priv
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 1/5] gve: deduplicate xdp info and xsk pool registration logic Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 2/5] gve: merge xdp and xsk registration Jeroen de Borst
@ 2025-07-17 15:28 ` Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy Jeroen de Borst
` (2 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Jeroen de Borst
From: Joshua Washington <joshwash@google.com>
Relying on xsk_get_pool_from_qid for getting whether zero copy is
enabled on a queue is erroneous, as an XSK pool is registered in
xp_assign_dev whether AF_XDP zero-copy is enabled or not. This becomes
problematic when queues are restarted in copy mode, as all RX queues
with XSKs will register a pool, causing the driver to exercise the
zero-copy codepath.
This patch adds a bitmap to keep track of which queues have zero-copy
enabled.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
drivers/net/ethernet/google/gve/gve.h | 1 +
drivers/net/ethernet/google/gve/gve_main.c | 37 +++++++++++++++++++---
2 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index b2be3fca4125..9925c08e595e 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -802,6 +802,7 @@ struct gve_priv {
struct gve_tx_queue_config tx_cfg;
struct gve_rx_queue_config rx_cfg;
+ unsigned long *xsk_pools; /* bitmap of RX queues with XSK pools */
u32 num_ntfy_blks; /* split between TX and RX so must be even */
int numa_node;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index cf8e1abdfa8e..d5953f5d1895 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -4,6 +4,7 @@
* Copyright (C) 2015-2024 Google LLC
*/
+#include <linux/bitmap.h>
#include <linux/bpf.h>
#include <linux/cpumask.h>
#include <linux/etherdevice.h>
@@ -1215,6 +1216,14 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
}
}
+static struct xsk_buff_pool *gve_get_xsk_pool(struct gve_priv *priv, int qid)
+{
+ if (!test_bit(qid, priv->xsk_pools))
+ return NULL;
+
+ return xsk_get_pool_from_qid(priv->dev, qid);
+}
+
static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
{
struct napi_struct *napi;
@@ -1236,7 +1245,7 @@ static int gve_reg_xdp_info(struct gve_priv *priv, struct net_device *dev)
if (err)
goto err;
- xsk_pool = xsk_get_pool_from_qid(dev, i);
+ xsk_pool = gve_get_xsk_pool(priv, i);
if (xsk_pool)
err = gve_reg_xsk_pool(priv, dev, xsk_pool, i);
else if (gve_is_qpl(priv))
@@ -1588,15 +1597,19 @@ static int gve_xsk_pool_enable(struct net_device *dev,
if (err)
return err;
+ set_bit(qid, priv->xsk_pools);
+
/* If XDP prog is not installed or interface is down, return. */
if (!priv->xdp_prog || !netif_running(dev))
return 0;
err = gve_reg_xsk_pool(priv, dev, pool, qid);
- if (err)
+ if (err) {
+ clear_bit(qid, priv->xsk_pools);
xsk_pool_dma_unmap(pool,
DMA_ATTR_SKIP_CPU_SYNC |
DMA_ATTR_WEAK_ORDERING);
+ }
return err;
}
@@ -1613,6 +1626,8 @@ static int gve_xsk_pool_disable(struct net_device *dev,
if (qid >= priv->rx_cfg.num_queues)
return -EINVAL;
+ clear_bit(qid, priv->xsk_pools);
+
pool = xsk_get_pool_from_qid(dev, qid);
if (pool)
xsk_pool_dma_unmap(pool,
@@ -2360,10 +2375,22 @@ static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
priv->ts_config.rx_filter = HWTSTAMP_FILTER_NONE;
setup_device:
+ priv->xsk_pools = bitmap_zalloc(priv->rx_cfg.max_queues, GFP_KERNEL);
+ if (!priv->xsk_pools) {
+ err = -ENOMEM;
+ goto err;
+ }
+
gve_set_netdev_xdp_features(priv);
err = gve_setup_device_resources(priv);
- if (!err)
- return 0;
+ if (err)
+ goto err_free_xsk_bitmap;
+
+ return 0;
+
+err_free_xsk_bitmap:
+ bitmap_free(priv->xsk_pools);
+ priv->xsk_pools = NULL;
err:
gve_adminq_free(&priv->pdev->dev, priv);
return err;
@@ -2373,6 +2400,8 @@ static void gve_teardown_priv_resources(struct gve_priv *priv)
{
gve_teardown_device_resources(priv);
gve_adminq_free(&priv->pdev->dev, priv);
+ bitmap_free(priv->xsk_pools);
+ priv->xsk_pools = NULL;
}
static void gve_trigger_reset(struct gve_priv *priv)
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
` (2 preceding siblings ...)
2025-07-17 15:28 ` [PATCH net-next v2 3/5] gve: keep registry of zc xsk pools in netdev_priv Jeroen de Borst
@ 2025-07-17 15:28 ` Jeroen de Borst
2025-07-21 23:37 ` Jason Xing
2025-07-17 15:28 ` [PATCH net-next v2 5/5] gve: implement DQO RX datapath and control path " Jeroen de Borst
2025-07-22 9:50 ` [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA patchwork-bot+netdevbpf
5 siblings, 1 reply; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Praveen Kaligineedi, Jeroen de Borst
From: Joshua Washington <joshwash@google.com>
In the descriptor clean path, a number of changes need to be made to
accommodate out of order completions and double completions.
The XSK stack can only handle completions being processed in order, as a
single counter is incremented in xsk_tx_completed to sigify how many XSK
descriptors have been completed. Because completions can come back out
of order in DQ, a separate queue of XSK descriptors must be maintained.
This queue keeps the pending packets in the order that they were written
so that the descriptors can be counted in xsk_tx_completed in the same
order.
For double completions, a new pending packet state and type are
introduced. The new type, GVE_TX_PENDING_PACKET_DQO_XSK, plays an
anlogous role to pre-existing _SKB and _XDP_FRAME pending packet types
for XSK descriptors. The new state, GVE_PACKET_STATE_XSK_COMPLETE,
represents packets for which no more completions are expected. This
includes packets which have received a packet completion or reinjection
completion, as well as packets whose reinjection completion timer have
timed out. At this point, such packets can be counted as part of
xsk_tx_completed() and freed.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
drivers/net/ethernet/google/gve/gve.h | 19 ++-
drivers/net/ethernet/google/gve/gve_dqo.h | 1 +
drivers/net/ethernet/google/gve/gve_main.c | 6 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 148 +++++++++++++++++++
4 files changed, 171 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 9925c08e595e..ff7dc06e7fa4 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -399,11 +399,17 @@ enum gve_packet_state {
GVE_PACKET_STATE_PENDING_REINJECT_COMPL,
/* No valid completion received within the specified timeout. */
GVE_PACKET_STATE_TIMED_OUT_COMPL,
+ /* XSK pending packet has received a packet/reinjection completion, or
+ * has timed out. At this point, the pending packet can be counted by
+ * xsk_tx_complete and freed.
+ */
+ GVE_PACKET_STATE_XSK_COMPLETE,
};
enum gve_tx_pending_packet_dqo_type {
GVE_TX_PENDING_PACKET_DQO_SKB,
- GVE_TX_PENDING_PACKET_DQO_XDP_FRAME
+ GVE_TX_PENDING_PACKET_DQO_XDP_FRAME,
+ GVE_TX_PENDING_PACKET_DQO_XSK,
};
struct gve_tx_pending_packet_dqo {
@@ -440,10 +446,10 @@ struct gve_tx_pending_packet_dqo {
/* Identifies the current state of the packet as defined in
* `enum gve_packet_state`.
*/
- u8 state : 2;
+ u8 state : 3;
/* gve_tx_pending_packet_dqo_type */
- u8 type : 1;
+ u8 type : 2;
/* If packet is an outstanding miss completion, then the packet is
* freed if the corresponding re-injection completion is not received
@@ -512,6 +518,8 @@ struct gve_tx_ring {
/* Cached value of `dqo_compl.free_tx_qpl_buf_cnt` */
u32 free_tx_qpl_buf_cnt;
};
+
+ atomic_t xsk_reorder_queue_tail;
} dqo_tx;
};
@@ -545,6 +553,9 @@ struct gve_tx_ring {
/* Last TX ring index fetched by HW */
atomic_t hw_tx_head;
+ u16 xsk_reorder_queue_head;
+ u16 xsk_reorder_queue_tail;
+
/* List to track pending packets which received a miss
* completion but not a corresponding reinjection.
*/
@@ -598,6 +609,8 @@ struct gve_tx_ring {
struct gve_tx_pending_packet_dqo *pending_packets;
s16 num_pending_packets;
+ u16 *xsk_reorder_queue;
+
u32 complq_mask; /* complq size is complq_mask + 1 */
/* QPL fields */
diff --git a/drivers/net/ethernet/google/gve/gve_dqo.h b/drivers/net/ethernet/google/gve/gve_dqo.h
index bb278727f4d9..6eb442096e02 100644
--- a/drivers/net/ethernet/google/gve/gve_dqo.h
+++ b/drivers/net/ethernet/google/gve/gve_dqo.h
@@ -38,6 +38,7 @@ netdev_features_t gve_features_check_dqo(struct sk_buff *skb,
netdev_features_t features);
bool gve_tx_poll_dqo(struct gve_notify_block *block, bool do_clean);
bool gve_xdp_poll_dqo(struct gve_notify_block *block);
+bool gve_xsk_tx_poll_dqo(struct gve_notify_block *block, int budget);
int gve_rx_poll_dqo(struct gve_notify_block *block, int budget);
int gve_tx_alloc_rings_dqo(struct gve_priv *priv,
struct gve_tx_alloc_rings_cfg *cfg);
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index d5953f5d1895..c6ccc0bb40c9 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -427,6 +427,12 @@ int gve_napi_poll_dqo(struct napi_struct *napi, int budget)
if (block->rx) {
work_done = gve_rx_poll_dqo(block, budget);
+
+ /* Poll XSK TX as part of RX NAPI. Setup re-poll based on if
+ * either datapath has more work to do.
+ */
+ if (priv->xdp_prog)
+ reschedule |= gve_xsk_tx_poll_dqo(block, budget);
reschedule |= work_done == budget;
}
diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
index ce5370b741ec..6f1d515673d2 100644
--- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
@@ -13,6 +13,7 @@
#include <linux/tcp.h>
#include <linux/slab.h>
#include <linux/skbuff.h>
+#include <net/xdp_sock_drv.h>
/* Returns true if tx_bufs are available. */
static bool gve_has_free_tx_qpl_bufs(struct gve_tx_ring *tx, int count)
@@ -241,6 +242,9 @@ static void gve_tx_free_ring_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
tx->dqo.tx_ring = NULL;
}
+ kvfree(tx->dqo.xsk_reorder_queue);
+ tx->dqo.xsk_reorder_queue = NULL;
+
kvfree(tx->dqo.pending_packets);
tx->dqo.pending_packets = NULL;
@@ -345,6 +349,17 @@ static int gve_tx_alloc_ring_dqo(struct gve_priv *priv,
tx->dqo.pending_packets[tx->dqo.num_pending_packets - 1].next = -1;
atomic_set_release(&tx->dqo_compl.free_pending_packets, -1);
+
+ /* Only alloc xsk pool for XDP queues */
+ if (idx >= cfg->qcfg->num_queues && cfg->num_xdp_rings) {
+ tx->dqo.xsk_reorder_queue =
+ kvcalloc(tx->dqo.complq_mask + 1,
+ sizeof(tx->dqo.xsk_reorder_queue[0]),
+ GFP_KERNEL);
+ if (!tx->dqo.xsk_reorder_queue)
+ goto err;
+ }
+
tx->dqo_compl.miss_completions.head = -1;
tx->dqo_compl.miss_completions.tail = -1;
tx->dqo_compl.timed_out_completions.head = -1;
@@ -992,6 +1007,38 @@ static int gve_try_tx_skb(struct gve_priv *priv, struct gve_tx_ring *tx,
return 0;
}
+static void gve_xsk_reorder_queue_push_dqo(struct gve_tx_ring *tx,
+ u16 completion_tag)
+{
+ u32 tail = atomic_read(&tx->dqo_tx.xsk_reorder_queue_tail);
+
+ tx->dqo.xsk_reorder_queue[tail] = completion_tag;
+ tail = (tail + 1) & tx->dqo.complq_mask;
+ atomic_set_release(&tx->dqo_tx.xsk_reorder_queue_tail, tail);
+}
+
+static struct gve_tx_pending_packet_dqo *
+gve_xsk_reorder_queue_head(struct gve_tx_ring *tx)
+{
+ u32 head = tx->dqo_compl.xsk_reorder_queue_head;
+
+ if (head == tx->dqo_compl.xsk_reorder_queue_tail) {
+ tx->dqo_compl.xsk_reorder_queue_tail =
+ atomic_read_acquire(&tx->dqo_tx.xsk_reorder_queue_tail);
+
+ if (head == tx->dqo_compl.xsk_reorder_queue_tail)
+ return NULL;
+ }
+
+ return &tx->dqo.pending_packets[tx->dqo.xsk_reorder_queue[head]];
+}
+
+static void gve_xsk_reorder_queue_pop_dqo(struct gve_tx_ring *tx)
+{
+ tx->dqo_compl.xsk_reorder_queue_head++;
+ tx->dqo_compl.xsk_reorder_queue_head &= tx->dqo.complq_mask;
+}
+
/* Transmit a given skb and ring the doorbell. */
netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
{
@@ -1015,6 +1062,62 @@ netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
return NETDEV_TX_OK;
}
+static bool gve_xsk_tx_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
+ int budget)
+{
+ struct xsk_buff_pool *pool = tx->xsk_pool;
+ struct xdp_desc desc;
+ bool repoll = false;
+ int sent = 0;
+
+ spin_lock(&tx->dqo_tx.xdp_lock);
+ for (; sent < budget; sent++) {
+ struct gve_tx_pending_packet_dqo *pkt;
+ s16 completion_tag;
+ dma_addr_t addr;
+ u32 desc_idx;
+
+ if (unlikely(!gve_has_avail_slots_tx_dqo(tx, 1, 1))) {
+ repoll = true;
+ break;
+ }
+
+ if (!xsk_tx_peek_desc(pool, &desc))
+ break;
+
+ pkt = gve_alloc_pending_packet(tx);
+ pkt->type = GVE_TX_PENDING_PACKET_DQO_XSK;
+ pkt->num_bufs = 0;
+ completion_tag = pkt - tx->dqo.pending_packets;
+
+ addr = xsk_buff_raw_get_dma(pool, desc.addr);
+ xsk_buff_raw_dma_sync_for_device(pool, addr, desc.len);
+
+ desc_idx = tx->dqo_tx.tail;
+ gve_tx_fill_pkt_desc_dqo(tx, &desc_idx,
+ true, desc.len,
+ addr, completion_tag, true,
+ false);
+ ++pkt->num_bufs;
+ gve_tx_update_tail(tx, desc_idx);
+ tx->dqo_tx.posted_packet_desc_cnt += pkt->num_bufs;
+ gve_xsk_reorder_queue_push_dqo(tx, completion_tag);
+ }
+
+ if (sent) {
+ gve_tx_put_doorbell_dqo(priv, tx->q_resources, tx->dqo_tx.tail);
+ xsk_tx_release(pool);
+ }
+
+ spin_unlock(&tx->dqo_tx.xdp_lock);
+
+ u64_stats_update_begin(&tx->statss);
+ tx->xdp_xsk_sent += sent;
+ u64_stats_update_end(&tx->statss);
+
+ return (sent == budget) || repoll;
+}
+
static void add_to_list(struct gve_tx_ring *tx, struct gve_index_list *list,
struct gve_tx_pending_packet_dqo *pending_packet)
{
@@ -1152,6 +1255,9 @@ static void gve_handle_packet_completion(struct gve_priv *priv,
pending_packet->xdpf = NULL;
gve_free_pending_packet(tx, pending_packet);
break;
+ case GVE_TX_PENDING_PACKET_DQO_XSK:
+ pending_packet->state = GVE_PACKET_STATE_XSK_COMPLETE;
+ break;
default:
WARN_ON_ONCE(1);
}
@@ -1251,8 +1357,34 @@ static void remove_timed_out_completions(struct gve_priv *priv,
remove_from_list(tx, &tx->dqo_compl.timed_out_completions,
pending_packet);
+
+ /* Need to count XSK packets in xsk_tx_completed. */
+ if (pending_packet->type == GVE_TX_PENDING_PACKET_DQO_XSK)
+ pending_packet->state = GVE_PACKET_STATE_XSK_COMPLETE;
+ else
+ gve_free_pending_packet(tx, pending_packet);
+ }
+}
+
+static void gve_tx_process_xsk_completions(struct gve_tx_ring *tx)
+{
+ u32 num_xsks = 0;
+
+ while (true) {
+ struct gve_tx_pending_packet_dqo *pending_packet =
+ gve_xsk_reorder_queue_head(tx);
+
+ if (!pending_packet ||
+ pending_packet->state != GVE_PACKET_STATE_XSK_COMPLETE)
+ break;
+
+ num_xsks++;
+ gve_xsk_reorder_queue_pop_dqo(tx);
gve_free_pending_packet(tx, pending_packet);
}
+
+ if (num_xsks)
+ xsk_tx_completed(tx->xsk_pool, num_xsks);
}
int gve_clean_tx_done_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
@@ -1333,6 +1465,9 @@ int gve_clean_tx_done_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
remove_miss_completions(priv, tx);
remove_timed_out_completions(priv, tx);
+ if (tx->xsk_pool)
+ gve_tx_process_xsk_completions(tx);
+
u64_stats_update_begin(&tx->statss);
tx->bytes_done += pkt_compl_bytes + reinject_compl_bytes;
tx->pkt_done += pkt_compl_pkts + reinject_compl_pkts;
@@ -1365,6 +1500,19 @@ bool gve_tx_poll_dqo(struct gve_notify_block *block, bool do_clean)
return compl_desc->generation != tx->dqo_compl.cur_gen_bit;
}
+bool gve_xsk_tx_poll_dqo(struct gve_notify_block *rx_block, int budget)
+{
+ struct gve_rx_ring *rx = rx_block->rx;
+ struct gve_priv *priv = rx->gve;
+ struct gve_tx_ring *tx;
+
+ tx = &priv->tx[gve_xdp_tx_queue_id(priv, rx->q_num)];
+ if (tx->xsk_pool)
+ return gve_xsk_tx_dqo(priv, tx, budget);
+
+ return 0;
+}
+
bool gve_xdp_poll_dqo(struct gve_notify_block *block)
{
struct gve_tx_compl_desc *compl_desc;
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 5/5] gve: implement DQO RX datapath and control path for AF_XDP zero-copy
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
` (3 preceding siblings ...)
2025-07-17 15:28 ` [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy Jeroen de Borst
@ 2025-07-17 15:28 ` Jeroen de Borst
2025-07-22 9:50 ` [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA patchwork-bot+netdevbpf
5 siblings, 0 replies; 12+ messages in thread
From: Jeroen de Borst @ 2025-07-17 15:28 UTC (permalink / raw)
To: netdev
Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Praveen Kaligineedi, Jeroen de Borst
From: Joshua Washington <joshwash@google.com>
Add the RX datapath for AF_XDP zero-copy for DQ RDA. The RX path is
quite similar to that of the normal XDP case. Parallel methods are
introduced to properly handle XSKs instead of normal driver buffers.
To properly support posting from XSKs, queues are destroyed and
recreated, as the driver was initially making use of page pool buffers
instead of the XSK pool memory.
Expose support for AF_XDP zero-copy, as the TX and RX datapaths both
exist.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
drivers/net/ethernet/google/gve/gve.h | 3 +
.../ethernet/google/gve/gve_buffer_mgmt_dqo.c | 24 ++++-
drivers/net/ethernet/google/gve/gve_main.c | 42 +++++++--
drivers/net/ethernet/google/gve/gve_rx_dqo.c | 94 ++++++++++++++++++-
4 files changed, 149 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index ff7dc06e7fa4..bceaf9b05cb4 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -190,6 +190,9 @@ struct gve_rx_buf_state_dqo {
/* The page posted to HW. */
struct gve_rx_slot_page_info page_info;
+ /* XSK buffer */
+ struct xdp_buff *xsk_buff;
+
/* The DMA address corresponding to `page_info`. */
dma_addr_t addr;
diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index 6c3c459a1b5e..8f5021e59e0a 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -4,6 +4,7 @@
* Copyright (C) 2015-2024 Google, Inc.
*/
+#include <net/xdp_sock_drv.h>
#include "gve.h"
#include "gve_utils.h"
@@ -29,6 +30,10 @@ struct gve_rx_buf_state_dqo *gve_alloc_buf_state(struct gve_rx_ring *rx)
/* Point buf_state to itself to mark it as allocated */
buf_state->next = buffer_id;
+ /* Clear the buffer pointers */
+ buf_state->page_info.page = NULL;
+ buf_state->xsk_buff = NULL;
+
return buf_state;
}
@@ -286,7 +291,24 @@ int gve_alloc_buffer(struct gve_rx_ring *rx, struct gve_rx_desc_dqo *desc)
{
struct gve_rx_buf_state_dqo *buf_state;
- if (rx->dqo.page_pool) {
+ if (rx->xsk_pool) {
+ buf_state = gve_alloc_buf_state(rx);
+ if (unlikely(!buf_state))
+ return -ENOMEM;
+
+ buf_state->xsk_buff = xsk_buff_alloc(rx->xsk_pool);
+ if (unlikely(!buf_state->xsk_buff)) {
+ xsk_set_rx_need_wakeup(rx->xsk_pool);
+ gve_free_buf_state(rx, buf_state);
+ return -ENOMEM;
+ }
+ /* Allocated xsk buffer. Clear wakeup in case it was set. */
+ xsk_clear_rx_need_wakeup(rx->xsk_pool);
+ desc->buf_id = cpu_to_le16(buf_state - rx->dqo.buf_states);
+ desc->buf_addr =
+ cpu_to_le64(xsk_buff_xdp_get_dma(buf_state->xsk_buff));
+ return 0;
+ } else if (rx->dqo.page_pool) {
buf_state = gve_alloc_buf_state(rx);
if (WARN_ON_ONCE(!buf_state))
return -ENOMEM;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index c6ccc0bb40c9..6ea306947417 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1610,13 +1610,24 @@ static int gve_xsk_pool_enable(struct net_device *dev,
return 0;
err = gve_reg_xsk_pool(priv, dev, pool, qid);
- if (err) {
- clear_bit(qid, priv->xsk_pools);
- xsk_pool_dma_unmap(pool,
- DMA_ATTR_SKIP_CPU_SYNC |
- DMA_ATTR_WEAK_ORDERING);
+ if (err)
+ goto err_xsk_pool_dma_mapped;
+
+ /* Stop and start RDA queues to repost buffers. */
+ if (!gve_is_qpl(priv)) {
+ err = gve_configure_rings_xdp(priv, priv->rx_cfg.num_queues);
+ if (err)
+ goto err_xsk_pool_registered;
}
+ return 0;
+err_xsk_pool_registered:
+ gve_unreg_xsk_pool(priv, qid);
+err_xsk_pool_dma_mapped:
+ clear_bit(qid, priv->xsk_pools);
+ xsk_pool_dma_unmap(pool,
+ DMA_ATTR_SKIP_CPU_SYNC |
+ DMA_ATTR_WEAK_ORDERING);
return err;
}
@@ -1628,6 +1639,7 @@ static int gve_xsk_pool_disable(struct net_device *dev,
struct napi_struct *napi_tx;
struct xsk_buff_pool *pool;
int tx_qid;
+ int err;
if (qid >= priv->rx_cfg.num_queues)
return -EINVAL;
@@ -1643,6 +1655,13 @@ static int gve_xsk_pool_disable(struct net_device *dev,
if (!netif_running(dev) || !priv->tx_cfg.num_xdp_queues)
return 0;
+ /* Stop and start RDA queues to repost buffers. */
+ if (!gve_is_qpl(priv) && priv->xdp_prog) {
+ err = gve_configure_rings_xdp(priv, priv->rx_cfg.num_queues);
+ if (err)
+ return err;
+ }
+
napi_rx = &priv->ntfy_blocks[priv->rx[qid].ntfy_id].napi;
napi_disable(napi_rx); /* make sure current rx poll is done */
@@ -1654,12 +1673,14 @@ static int gve_xsk_pool_disable(struct net_device *dev,
smp_mb(); /* Make sure it is visible to the workers on datapath */
napi_enable(napi_rx);
- if (gve_rx_work_pending(&priv->rx[qid]))
- napi_schedule(napi_rx);
-
napi_enable(napi_tx);
- if (gve_tx_clean_pending(priv, &priv->tx[tx_qid]))
- napi_schedule(napi_tx);
+ if (gve_is_gqi(priv)) {
+ if (gve_rx_work_pending(&priv->rx[qid]))
+ napi_schedule(napi_rx);
+
+ if (gve_tx_clean_pending(priv, &priv->tx[tx_qid]))
+ napi_schedule(napi_tx);
+ }
return 0;
}
@@ -2286,6 +2307,7 @@ static void gve_set_netdev_xdp_features(struct gve_priv *priv)
} else if (priv->queue_format == GVE_DQO_RDA_FORMAT) {
xdp_features = NETDEV_XDP_ACT_BASIC;
xdp_features |= NETDEV_XDP_ACT_REDIRECT;
+ xdp_features |= NETDEV_XDP_ACT_XSK_ZEROCOPY;
} else {
xdp_features = 0;
}
diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
index afaa822b1227..7380c2b7a2d8 100644
--- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
@@ -16,6 +16,7 @@
#include <net/ip6_checksum.h>
#include <net/ipv6.h>
#include <net/tcp.h>
+#include <net/xdp_sock_drv.h>
static void gve_rx_free_hdr_bufs(struct gve_priv *priv, struct gve_rx_ring *rx)
{
@@ -149,6 +150,10 @@ void gve_rx_free_ring_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
gve_free_to_page_pool(rx, bs, false);
else
gve_free_qpl_page_dqo(bs);
+ if (gve_buf_state_is_allocated(rx, bs) && bs->xsk_buff) {
+ xsk_buff_free(bs->xsk_buff);
+ bs->xsk_buff = NULL;
+ }
}
if (rx->dqo.qpl) {
@@ -580,8 +585,11 @@ static int gve_xdp_tx_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
int err;
xdpf = xdp_convert_buff_to_frame(xdp);
- if (unlikely(!xdpf))
+ if (unlikely(!xdpf)) {
+ if (rx->xsk_pool)
+ xsk_buff_free(xdp);
return -ENOSPC;
+ }
tx_qid = gve_xdp_tx_queue_id(priv, rx->q_num);
tx = &priv->tx[tx_qid];
@@ -592,6 +600,41 @@ static int gve_xdp_tx_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
return err;
}
+static void gve_xsk_done_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
+ struct xdp_buff *xdp, struct bpf_prog *xprog,
+ int xdp_act)
+{
+ switch (xdp_act) {
+ case XDP_ABORTED:
+ case XDP_DROP:
+ default:
+ xsk_buff_free(xdp);
+ break;
+ case XDP_TX:
+ if (unlikely(gve_xdp_tx_dqo(priv, rx, xdp)))
+ goto err;
+ break;
+ case XDP_REDIRECT:
+ if (unlikely(xdp_do_redirect(priv->dev, xdp, xprog)))
+ goto err;
+ break;
+ }
+
+ u64_stats_update_begin(&rx->statss);
+ if ((u32)xdp_act < GVE_XDP_ACTIONS)
+ rx->xdp_actions[xdp_act]++;
+ u64_stats_update_end(&rx->statss);
+ return;
+
+err:
+ u64_stats_update_begin(&rx->statss);
+ if (xdp_act == XDP_TX)
+ rx->xdp_tx_errors++;
+ if (xdp_act == XDP_REDIRECT)
+ rx->xdp_redirect_errors++;
+ u64_stats_update_end(&rx->statss);
+}
+
static void gve_xdp_done_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
struct xdp_buff *xdp, struct bpf_prog *xprog,
int xdp_act,
@@ -633,6 +676,48 @@ static void gve_xdp_done_dqo(struct gve_priv *priv, struct gve_rx_ring *rx,
return;
}
+static int gve_rx_xsk_dqo(struct napi_struct *napi, struct gve_rx_ring *rx,
+ struct gve_rx_buf_state_dqo *buf_state, int buf_len,
+ struct bpf_prog *xprog)
+{
+ struct xdp_buff *xdp = buf_state->xsk_buff;
+ struct gve_priv *priv = rx->gve;
+ int xdp_act;
+
+ xdp->data_end = xdp->data + buf_len;
+ xsk_buff_dma_sync_for_cpu(xdp);
+
+ if (xprog) {
+ xdp_act = bpf_prog_run_xdp(xprog, xdp);
+ buf_len = xdp->data_end - xdp->data;
+ if (xdp_act != XDP_PASS) {
+ gve_xsk_done_dqo(priv, rx, xdp, xprog, xdp_act);
+ gve_free_buf_state(rx, buf_state);
+ return 0;
+ }
+ }
+
+ /* Copy the data to skb */
+ rx->ctx.skb_head = gve_rx_copy_data(priv->dev, napi,
+ xdp->data, buf_len);
+ if (unlikely(!rx->ctx.skb_head)) {
+ xsk_buff_free(xdp);
+ gve_free_buf_state(rx, buf_state);
+ return -ENOMEM;
+ }
+ rx->ctx.skb_tail = rx->ctx.skb_head;
+
+ /* Free XSK buffer and Buffer state */
+ xsk_buff_free(xdp);
+ gve_free_buf_state(rx, buf_state);
+
+ /* Update Stats */
+ u64_stats_update_begin(&rx->statss);
+ rx->xdp_actions[XDP_PASS]++;
+ u64_stats_update_end(&rx->statss);
+ return 0;
+}
+
/* Returns 0 if descriptor is completed successfully.
* Returns -EINVAL if descriptor is invalid.
* Returns -ENOMEM if data cannot be copied to skb.
@@ -671,7 +756,11 @@ static int gve_rx_dqo(struct napi_struct *napi, struct gve_rx_ring *rx,
buf_len = compl_desc->packet_len;
hdr_len = compl_desc->header_len;
- /* Page might have not been used for a while and was likely last written
+ xprog = READ_ONCE(priv->xdp_prog);
+ if (buf_state->xsk_buff)
+ return gve_rx_xsk_dqo(napi, rx, buf_state, buf_len, xprog);
+
+ /* Page might have not been used for awhile and was likely last written
* by a different thread.
*/
if (rx->dqo.page_pool) {
@@ -721,7 +810,6 @@ static int gve_rx_dqo(struct napi_struct *napi, struct gve_rx_ring *rx,
return 0;
}
- xprog = READ_ONCE(priv->xdp_prog);
if (xprog) {
struct xdp_buff xdp;
void *old_data;
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-17 15:28 ` [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy Jeroen de Borst
@ 2025-07-21 23:37 ` Jason Xing
2025-07-21 23:58 ` Jason Xing
2025-07-22 9:32 ` Paolo Abeni
0 siblings, 2 replies; 12+ messages in thread
From: Jason Xing @ 2025-07-21 23:37 UTC (permalink / raw)
To: Jeroen de Borst
Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Praveen Kaligineedi
Hi Jeroen,
On Thu, Jul 17, 2025 at 11:29 PM Jeroen de Borst <jeroendb@google.com> wrote:
>
> From: Joshua Washington <joshwash@google.com>
>
> In the descriptor clean path, a number of changes need to be made to
> accommodate out of order completions and double completions.
>
> The XSK stack can only handle completions being processed in order, as a
> single counter is incremented in xsk_tx_completed to sigify how many XSK
> descriptors have been completed. Because completions can come back out
> of order in DQ, a separate queue of XSK descriptors must be maintained.
> This queue keeps the pending packets in the order that they were written
> so that the descriptors can be counted in xsk_tx_completed in the same
> order.
>
> For double completions, a new pending packet state and type are
> introduced. The new type, GVE_TX_PENDING_PACKET_DQO_XSK, plays an
> anlogous role to pre-existing _SKB and _XDP_FRAME pending packet types
> for XSK descriptors. The new state, GVE_PACKET_STATE_XSK_COMPLETE,
> represents packets for which no more completions are expected. This
> includes packets which have received a packet completion or reinjection
> completion, as well as packets whose reinjection completion timer have
> timed out. At this point, such packets can be counted as part of
> xsk_tx_completed() and freed.
>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
> Signed-off-by: Joshua Washington <joshwash@google.com>
> Signed-off-by: Jeroen de Borst <jeroendb@google.com>
> ---
> drivers/net/ethernet/google/gve/gve.h | 19 ++-
> drivers/net/ethernet/google/gve/gve_dqo.h | 1 +
> drivers/net/ethernet/google/gve/gve_main.c | 6 +
> drivers/net/ethernet/google/gve/gve_tx_dqo.c | 148 +++++++++++++++++++
> 4 files changed, 171 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> index 9925c08e595e..ff7dc06e7fa4 100644
> --- a/drivers/net/ethernet/google/gve/gve.h
> +++ b/drivers/net/ethernet/google/gve/gve.h
> @@ -399,11 +399,17 @@ enum gve_packet_state {
> GVE_PACKET_STATE_PENDING_REINJECT_COMPL,
> /* No valid completion received within the specified timeout. */
> GVE_PACKET_STATE_TIMED_OUT_COMPL,
> + /* XSK pending packet has received a packet/reinjection completion, or
> + * has timed out. At this point, the pending packet can be counted by
> + * xsk_tx_complete and freed.
> + */
> + GVE_PACKET_STATE_XSK_COMPLETE,
> };
>
> enum gve_tx_pending_packet_dqo_type {
> GVE_TX_PENDING_PACKET_DQO_SKB,
> - GVE_TX_PENDING_PACKET_DQO_XDP_FRAME
> + GVE_TX_PENDING_PACKET_DQO_XDP_FRAME,
> + GVE_TX_PENDING_PACKET_DQO_XSK,
> };
>
> struct gve_tx_pending_packet_dqo {
> @@ -440,10 +446,10 @@ struct gve_tx_pending_packet_dqo {
> /* Identifies the current state of the packet as defined in
> * `enum gve_packet_state`.
> */
> - u8 state : 2;
> + u8 state : 3;
>
> /* gve_tx_pending_packet_dqo_type */
> - u8 type : 1;
> + u8 type : 2;
>
> /* If packet is an outstanding miss completion, then the packet is
> * freed if the corresponding re-injection completion is not received
> @@ -512,6 +518,8 @@ struct gve_tx_ring {
> /* Cached value of `dqo_compl.free_tx_qpl_buf_cnt` */
> u32 free_tx_qpl_buf_cnt;
> };
> +
> + atomic_t xsk_reorder_queue_tail;
> } dqo_tx;
> };
>
> @@ -545,6 +553,9 @@ struct gve_tx_ring {
> /* Last TX ring index fetched by HW */
> atomic_t hw_tx_head;
>
> + u16 xsk_reorder_queue_head;
> + u16 xsk_reorder_queue_tail;
> +
> /* List to track pending packets which received a miss
> * completion but not a corresponding reinjection.
> */
> @@ -598,6 +609,8 @@ struct gve_tx_ring {
> struct gve_tx_pending_packet_dqo *pending_packets;
> s16 num_pending_packets;
>
> + u16 *xsk_reorder_queue;
> +
> u32 complq_mask; /* complq size is complq_mask + 1 */
>
> /* QPL fields */
> diff --git a/drivers/net/ethernet/google/gve/gve_dqo.h b/drivers/net/ethernet/google/gve/gve_dqo.h
> index bb278727f4d9..6eb442096e02 100644
> --- a/drivers/net/ethernet/google/gve/gve_dqo.h
> +++ b/drivers/net/ethernet/google/gve/gve_dqo.h
> @@ -38,6 +38,7 @@ netdev_features_t gve_features_check_dqo(struct sk_buff *skb,
> netdev_features_t features);
> bool gve_tx_poll_dqo(struct gve_notify_block *block, bool do_clean);
> bool gve_xdp_poll_dqo(struct gve_notify_block *block);
> +bool gve_xsk_tx_poll_dqo(struct gve_notify_block *block, int budget);
> int gve_rx_poll_dqo(struct gve_notify_block *block, int budget);
> int gve_tx_alloc_rings_dqo(struct gve_priv *priv,
> struct gve_tx_alloc_rings_cfg *cfg);
> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> index d5953f5d1895..c6ccc0bb40c9 100644
> --- a/drivers/net/ethernet/google/gve/gve_main.c
> +++ b/drivers/net/ethernet/google/gve/gve_main.c
> @@ -427,6 +427,12 @@ int gve_napi_poll_dqo(struct napi_struct *napi, int budget)
>
> if (block->rx) {
> work_done = gve_rx_poll_dqo(block, budget);
> +
> + /* Poll XSK TX as part of RX NAPI. Setup re-poll based on if
> + * either datapath has more work to do.
> + */
> + if (priv->xdp_prog)
> + reschedule |= gve_xsk_tx_poll_dqo(block, budget);
> reschedule |= work_done == budget;
> }
>
> diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> index ce5370b741ec..6f1d515673d2 100644
> --- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> +++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> @@ -13,6 +13,7 @@
> #include <linux/tcp.h>
> #include <linux/slab.h>
> #include <linux/skbuff.h>
> +#include <net/xdp_sock_drv.h>
>
> /* Returns true if tx_bufs are available. */
> static bool gve_has_free_tx_qpl_bufs(struct gve_tx_ring *tx, int count)
> @@ -241,6 +242,9 @@ static void gve_tx_free_ring_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> tx->dqo.tx_ring = NULL;
> }
>
> + kvfree(tx->dqo.xsk_reorder_queue);
> + tx->dqo.xsk_reorder_queue = NULL;
> +
> kvfree(tx->dqo.pending_packets);
> tx->dqo.pending_packets = NULL;
>
> @@ -345,6 +349,17 @@ static int gve_tx_alloc_ring_dqo(struct gve_priv *priv,
>
> tx->dqo.pending_packets[tx->dqo.num_pending_packets - 1].next = -1;
> atomic_set_release(&tx->dqo_compl.free_pending_packets, -1);
> +
> + /* Only alloc xsk pool for XDP queues */
> + if (idx >= cfg->qcfg->num_queues && cfg->num_xdp_rings) {
> + tx->dqo.xsk_reorder_queue =
> + kvcalloc(tx->dqo.complq_mask + 1,
> + sizeof(tx->dqo.xsk_reorder_queue[0]),
> + GFP_KERNEL);
> + if (!tx->dqo.xsk_reorder_queue)
> + goto err;
> + }
> +
> tx->dqo_compl.miss_completions.head = -1;
> tx->dqo_compl.miss_completions.tail = -1;
> tx->dqo_compl.timed_out_completions.head = -1;
> @@ -992,6 +1007,38 @@ static int gve_try_tx_skb(struct gve_priv *priv, struct gve_tx_ring *tx,
> return 0;
> }
>
> +static void gve_xsk_reorder_queue_push_dqo(struct gve_tx_ring *tx,
> + u16 completion_tag)
> +{
> + u32 tail = atomic_read(&tx->dqo_tx.xsk_reorder_queue_tail);
> +
> + tx->dqo.xsk_reorder_queue[tail] = completion_tag;
> + tail = (tail + 1) & tx->dqo.complq_mask;
> + atomic_set_release(&tx->dqo_tx.xsk_reorder_queue_tail, tail);
> +}
> +
> +static struct gve_tx_pending_packet_dqo *
> +gve_xsk_reorder_queue_head(struct gve_tx_ring *tx)
> +{
> + u32 head = tx->dqo_compl.xsk_reorder_queue_head;
> +
> + if (head == tx->dqo_compl.xsk_reorder_queue_tail) {
> + tx->dqo_compl.xsk_reorder_queue_tail =
> + atomic_read_acquire(&tx->dqo_tx.xsk_reorder_queue_tail);
> +
> + if (head == tx->dqo_compl.xsk_reorder_queue_tail)
> + return NULL;
> + }
> +
> + return &tx->dqo.pending_packets[tx->dqo.xsk_reorder_queue[head]];
> +}
> +
> +static void gve_xsk_reorder_queue_pop_dqo(struct gve_tx_ring *tx)
> +{
> + tx->dqo_compl.xsk_reorder_queue_head++;
> + tx->dqo_compl.xsk_reorder_queue_head &= tx->dqo.complq_mask;
> +}
> +
> /* Transmit a given skb and ring the doorbell. */
> netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
> {
> @@ -1015,6 +1062,62 @@ netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
> return NETDEV_TX_OK;
> }
>
> +static bool gve_xsk_tx_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> + int budget)
> +{
> + struct xsk_buff_pool *pool = tx->xsk_pool;
> + struct xdp_desc desc;
> + bool repoll = false;
> + int sent = 0;
> +
> + spin_lock(&tx->dqo_tx.xdp_lock);
> + for (; sent < budget; sent++) {
> + struct gve_tx_pending_packet_dqo *pkt;
> + s16 completion_tag;
> + dma_addr_t addr;
> + u32 desc_idx;
> +
> + if (unlikely(!gve_has_avail_slots_tx_dqo(tx, 1, 1))) {
> + repoll = true;
> + break;
If the whole 'for' loop breaks here, the sent should not be increased
by one, right?
> + }
> +
> + if (!xsk_tx_peek_desc(pool, &desc))
> + break;
The same logic here.
I would use a while() or do..while() loop like how i40e_clean_tx_irq()
manipulates the budget.
> +
> + pkt = gve_alloc_pending_packet(tx);
I checked gve_alloc_pending_packet() and noticed there is one slight
chance to return NULL? If so, it will trigger a NULL dereference
issue.
Thanks,
Jason
> + pkt->type = GVE_TX_PENDING_PACKET_DQO_XSK;
> + pkt->num_bufs = 0;
> + completion_tag = pkt - tx->dqo.pending_packets;
> +
> + addr = xsk_buff_raw_get_dma(pool, desc.addr);
> + xsk_buff_raw_dma_sync_for_device(pool, addr, desc.len);
> +
> + desc_idx = tx->dqo_tx.tail;
> + gve_tx_fill_pkt_desc_dqo(tx, &desc_idx,
> + true, desc.len,
> + addr, completion_tag, true,
> + false);
> + ++pkt->num_bufs;
> + gve_tx_update_tail(tx, desc_idx);
> + tx->dqo_tx.posted_packet_desc_cnt += pkt->num_bufs;
> + gve_xsk_reorder_queue_push_dqo(tx, completion_tag);
> + }
> +
> + if (sent) {
> + gve_tx_put_doorbell_dqo(priv, tx->q_resources, tx->dqo_tx.tail);
> + xsk_tx_release(pool);
> + }
> +
> + spin_unlock(&tx->dqo_tx.xdp_lock);
> +
> + u64_stats_update_begin(&tx->statss);
> + tx->xdp_xsk_sent += sent;
> + u64_stats_update_end(&tx->statss);
> +
> + return (sent == budget) || repoll;
> +}
> +
> static void add_to_list(struct gve_tx_ring *tx, struct gve_index_list *list,
> struct gve_tx_pending_packet_dqo *pending_packet)
> {
> @@ -1152,6 +1255,9 @@ static void gve_handle_packet_completion(struct gve_priv *priv,
> pending_packet->xdpf = NULL;
> gve_free_pending_packet(tx, pending_packet);
> break;
> + case GVE_TX_PENDING_PACKET_DQO_XSK:
> + pending_packet->state = GVE_PACKET_STATE_XSK_COMPLETE;
> + break;
> default:
> WARN_ON_ONCE(1);
> }
> @@ -1251,8 +1357,34 @@ static void remove_timed_out_completions(struct gve_priv *priv,
>
> remove_from_list(tx, &tx->dqo_compl.timed_out_completions,
> pending_packet);
> +
> + /* Need to count XSK packets in xsk_tx_completed. */
> + if (pending_packet->type == GVE_TX_PENDING_PACKET_DQO_XSK)
> + pending_packet->state = GVE_PACKET_STATE_XSK_COMPLETE;
> + else
> + gve_free_pending_packet(tx, pending_packet);
> + }
> +}
> +
> +static void gve_tx_process_xsk_completions(struct gve_tx_ring *tx)
> +{
> + u32 num_xsks = 0;
> +
> + while (true) {
> + struct gve_tx_pending_packet_dqo *pending_packet =
> + gve_xsk_reorder_queue_head(tx);
> +
> + if (!pending_packet ||
> + pending_packet->state != GVE_PACKET_STATE_XSK_COMPLETE)
> + break;
> +
> + num_xsks++;
> + gve_xsk_reorder_queue_pop_dqo(tx);
> gve_free_pending_packet(tx, pending_packet);
> }
> +
> + if (num_xsks)
> + xsk_tx_completed(tx->xsk_pool, num_xsks);
> }
>
> int gve_clean_tx_done_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> @@ -1333,6 +1465,9 @@ int gve_clean_tx_done_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> remove_miss_completions(priv, tx);
> remove_timed_out_completions(priv, tx);
>
> + if (tx->xsk_pool)
> + gve_tx_process_xsk_completions(tx);
> +
> u64_stats_update_begin(&tx->statss);
> tx->bytes_done += pkt_compl_bytes + reinject_compl_bytes;
> tx->pkt_done += pkt_compl_pkts + reinject_compl_pkts;
> @@ -1365,6 +1500,19 @@ bool gve_tx_poll_dqo(struct gve_notify_block *block, bool do_clean)
> return compl_desc->generation != tx->dqo_compl.cur_gen_bit;
> }
>
> +bool gve_xsk_tx_poll_dqo(struct gve_notify_block *rx_block, int budget)
> +{
> + struct gve_rx_ring *rx = rx_block->rx;
> + struct gve_priv *priv = rx->gve;
> + struct gve_tx_ring *tx;
> +
> + tx = &priv->tx[gve_xdp_tx_queue_id(priv, rx->q_num)];
> + if (tx->xsk_pool)
> + return gve_xsk_tx_dqo(priv, tx, budget);
> +
> + return 0;
> +}
> +
> bool gve_xdp_poll_dqo(struct gve_notify_block *block)
> {
> struct gve_tx_compl_desc *compl_desc;
> --
> 2.50.0.727.gbf7dc18ff4-goog
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-21 23:37 ` Jason Xing
@ 2025-07-21 23:58 ` Jason Xing
2025-07-22 9:32 ` Paolo Abeni
1 sibling, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-07-21 23:58 UTC (permalink / raw)
To: Jeroen de Borst
Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb, pabeni,
Joshua Washington, Praveen Kaligineedi
On Tue, Jul 22, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hi Jeroen,
>
> On Thu, Jul 17, 2025 at 11:29 PM Jeroen de Borst <jeroendb@google.com> wrote:
> >
> > From: Joshua Washington <joshwash@google.com>
> >
> > In the descriptor clean path, a number of changes need to be made to
> > accommodate out of order completions and double completions.
> >
> > The XSK stack can only handle completions being processed in order, as a
> > single counter is incremented in xsk_tx_completed to sigify how many XSK
> > descriptors have been completed. Because completions can come back out
> > of order in DQ, a separate queue of XSK descriptors must be maintained.
> > This queue keeps the pending packets in the order that they were written
> > so that the descriptors can be counted in xsk_tx_completed in the same
> > order.
> >
> > For double completions, a new pending packet state and type are
> > introduced. The new type, GVE_TX_PENDING_PACKET_DQO_XSK, plays an
> > anlogous role to pre-existing _SKB and _XDP_FRAME pending packet types
> > for XSK descriptors. The new state, GVE_PACKET_STATE_XSK_COMPLETE,
> > represents packets for which no more completions are expected. This
> > includes packets which have received a packet completion or reinjection
> > completion, as well as packets whose reinjection completion timer have
> > timed out. At this point, such packets can be counted as part of
> > xsk_tx_completed() and freed.
> >
> > Reviewed-by: Willem de Bruijn <willemb@google.com>
> > Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
> > Signed-off-by: Joshua Washington <joshwash@google.com>
> > Signed-off-by: Jeroen de Borst <jeroendb@google.com>
> > ---
> > drivers/net/ethernet/google/gve/gve.h | 19 ++-
> > drivers/net/ethernet/google/gve/gve_dqo.h | 1 +
> > drivers/net/ethernet/google/gve/gve_main.c | 6 +
> > drivers/net/ethernet/google/gve/gve_tx_dqo.c | 148 +++++++++++++++++++
> > 4 files changed, 171 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > index 9925c08e595e..ff7dc06e7fa4 100644
> > --- a/drivers/net/ethernet/google/gve/gve.h
> > +++ b/drivers/net/ethernet/google/gve/gve.h
> > @@ -399,11 +399,17 @@ enum gve_packet_state {
> > GVE_PACKET_STATE_PENDING_REINJECT_COMPL,
> > /* No valid completion received within the specified timeout. */
> > GVE_PACKET_STATE_TIMED_OUT_COMPL,
> > + /* XSK pending packet has received a packet/reinjection completion, or
> > + * has timed out. At this point, the pending packet can be counted by
> > + * xsk_tx_complete and freed.
> > + */
> > + GVE_PACKET_STATE_XSK_COMPLETE,
> > };
> >
> > enum gve_tx_pending_packet_dqo_type {
> > GVE_TX_PENDING_PACKET_DQO_SKB,
> > - GVE_TX_PENDING_PACKET_DQO_XDP_FRAME
> > + GVE_TX_PENDING_PACKET_DQO_XDP_FRAME,
> > + GVE_TX_PENDING_PACKET_DQO_XSK,
> > };
> >
> > struct gve_tx_pending_packet_dqo {
> > @@ -440,10 +446,10 @@ struct gve_tx_pending_packet_dqo {
> > /* Identifies the current state of the packet as defined in
> > * `enum gve_packet_state`.
> > */
> > - u8 state : 2;
> > + u8 state : 3;
> >
> > /* gve_tx_pending_packet_dqo_type */
> > - u8 type : 1;
> > + u8 type : 2;
> >
> > /* If packet is an outstanding miss completion, then the packet is
> > * freed if the corresponding re-injection completion is not received
> > @@ -512,6 +518,8 @@ struct gve_tx_ring {
> > /* Cached value of `dqo_compl.free_tx_qpl_buf_cnt` */
> > u32 free_tx_qpl_buf_cnt;
> > };
> > +
> > + atomic_t xsk_reorder_queue_tail;
> > } dqo_tx;
> > };
> >
> > @@ -545,6 +553,9 @@ struct gve_tx_ring {
> > /* Last TX ring index fetched by HW */
> > atomic_t hw_tx_head;
> >
> > + u16 xsk_reorder_queue_head;
> > + u16 xsk_reorder_queue_tail;
> > +
> > /* List to track pending packets which received a miss
> > * completion but not a corresponding reinjection.
> > */
> > @@ -598,6 +609,8 @@ struct gve_tx_ring {
> > struct gve_tx_pending_packet_dqo *pending_packets;
> > s16 num_pending_packets;
> >
> > + u16 *xsk_reorder_queue;
> > +
> > u32 complq_mask; /* complq size is complq_mask + 1 */
> >
> > /* QPL fields */
> > diff --git a/drivers/net/ethernet/google/gve/gve_dqo.h b/drivers/net/ethernet/google/gve/gve_dqo.h
> > index bb278727f4d9..6eb442096e02 100644
> > --- a/drivers/net/ethernet/google/gve/gve_dqo.h
> > +++ b/drivers/net/ethernet/google/gve/gve_dqo.h
> > @@ -38,6 +38,7 @@ netdev_features_t gve_features_check_dqo(struct sk_buff *skb,
> > netdev_features_t features);
> > bool gve_tx_poll_dqo(struct gve_notify_block *block, bool do_clean);
> > bool gve_xdp_poll_dqo(struct gve_notify_block *block);
> > +bool gve_xsk_tx_poll_dqo(struct gve_notify_block *block, int budget);
> > int gve_rx_poll_dqo(struct gve_notify_block *block, int budget);
> > int gve_tx_alloc_rings_dqo(struct gve_priv *priv,
> > struct gve_tx_alloc_rings_cfg *cfg);
> > diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> > index d5953f5d1895..c6ccc0bb40c9 100644
> > --- a/drivers/net/ethernet/google/gve/gve_main.c
> > +++ b/drivers/net/ethernet/google/gve/gve_main.c
> > @@ -427,6 +427,12 @@ int gve_napi_poll_dqo(struct napi_struct *napi, int budget)
> >
> > if (block->rx) {
> > work_done = gve_rx_poll_dqo(block, budget);
> > +
> > + /* Poll XSK TX as part of RX NAPI. Setup re-poll based on if
> > + * either datapath has more work to do.
> > + */
> > + if (priv->xdp_prog)
> > + reschedule |= gve_xsk_tx_poll_dqo(block, budget);
> > reschedule |= work_done == budget;
> > }
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> > index ce5370b741ec..6f1d515673d2 100644
> > --- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> > +++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
> > @@ -13,6 +13,7 @@
> > #include <linux/tcp.h>
> > #include <linux/slab.h>
> > #include <linux/skbuff.h>
> > +#include <net/xdp_sock_drv.h>
> >
> > /* Returns true if tx_bufs are available. */
> > static bool gve_has_free_tx_qpl_bufs(struct gve_tx_ring *tx, int count)
> > @@ -241,6 +242,9 @@ static void gve_tx_free_ring_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> > tx->dqo.tx_ring = NULL;
> > }
> >
> > + kvfree(tx->dqo.xsk_reorder_queue);
> > + tx->dqo.xsk_reorder_queue = NULL;
> > +
> > kvfree(tx->dqo.pending_packets);
> > tx->dqo.pending_packets = NULL;
> >
> > @@ -345,6 +349,17 @@ static int gve_tx_alloc_ring_dqo(struct gve_priv *priv,
> >
> > tx->dqo.pending_packets[tx->dqo.num_pending_packets - 1].next = -1;
> > atomic_set_release(&tx->dqo_compl.free_pending_packets, -1);
> > +
> > + /* Only alloc xsk pool for XDP queues */
> > + if (idx >= cfg->qcfg->num_queues && cfg->num_xdp_rings) {
> > + tx->dqo.xsk_reorder_queue =
> > + kvcalloc(tx->dqo.complq_mask + 1,
> > + sizeof(tx->dqo.xsk_reorder_queue[0]),
> > + GFP_KERNEL);
> > + if (!tx->dqo.xsk_reorder_queue)
> > + goto err;
> > + }
> > +
> > tx->dqo_compl.miss_completions.head = -1;
> > tx->dqo_compl.miss_completions.tail = -1;
> > tx->dqo_compl.timed_out_completions.head = -1;
> > @@ -992,6 +1007,38 @@ static int gve_try_tx_skb(struct gve_priv *priv, struct gve_tx_ring *tx,
> > return 0;
> > }
> >
> > +static void gve_xsk_reorder_queue_push_dqo(struct gve_tx_ring *tx,
> > + u16 completion_tag)
> > +{
> > + u32 tail = atomic_read(&tx->dqo_tx.xsk_reorder_queue_tail);
> > +
> > + tx->dqo.xsk_reorder_queue[tail] = completion_tag;
> > + tail = (tail + 1) & tx->dqo.complq_mask;
> > + atomic_set_release(&tx->dqo_tx.xsk_reorder_queue_tail, tail);
> > +}
> > +
> > +static struct gve_tx_pending_packet_dqo *
> > +gve_xsk_reorder_queue_head(struct gve_tx_ring *tx)
> > +{
> > + u32 head = tx->dqo_compl.xsk_reorder_queue_head;
> > +
> > + if (head == tx->dqo_compl.xsk_reorder_queue_tail) {
> > + tx->dqo_compl.xsk_reorder_queue_tail =
> > + atomic_read_acquire(&tx->dqo_tx.xsk_reorder_queue_tail);
> > +
> > + if (head == tx->dqo_compl.xsk_reorder_queue_tail)
> > + return NULL;
> > + }
> > +
> > + return &tx->dqo.pending_packets[tx->dqo.xsk_reorder_queue[head]];
> > +}
> > +
> > +static void gve_xsk_reorder_queue_pop_dqo(struct gve_tx_ring *tx)
> > +{
> > + tx->dqo_compl.xsk_reorder_queue_head++;
> > + tx->dqo_compl.xsk_reorder_queue_head &= tx->dqo.complq_mask;
> > +}
> > +
> > /* Transmit a given skb and ring the doorbell. */
> > netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
> > {
> > @@ -1015,6 +1062,62 @@ netdev_tx_t gve_tx_dqo(struct sk_buff *skb, struct net_device *dev)
> > return NETDEV_TX_OK;
> > }
> >
> > +static bool gve_xsk_tx_dqo(struct gve_priv *priv, struct gve_tx_ring *tx,
> > + int budget)
> > +{
> > + struct xsk_buff_pool *pool = tx->xsk_pool;
> > + struct xdp_desc desc;
> > + bool repoll = false;
> > + int sent = 0;
> > +
> > + spin_lock(&tx->dqo_tx.xdp_lock);
> > + for (; sent < budget; sent++) {
> > + struct gve_tx_pending_packet_dqo *pkt;
> > + s16 completion_tag;
> > + dma_addr_t addr;
> > + u32 desc_idx;
> > +
> > + if (unlikely(!gve_has_avail_slots_tx_dqo(tx, 1, 1))) {
> > + repoll = true;
> > + break;
>
> If the whole 'for' loop breaks here, the sent should not be increased
> by one, right?
Oops, I was wrong here because 'sent' starts at zero and will be
counted later. So the sent++ is always necessary even when it might
break sometimes.
Thanks,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-21 23:37 ` Jason Xing
2025-07-21 23:58 ` Jason Xing
@ 2025-07-22 9:32 ` Paolo Abeni
2025-07-22 11:50 ` Jason Xing
2025-07-22 17:32 ` Harshitha Ramamurthy
1 sibling, 2 replies; 12+ messages in thread
From: Paolo Abeni @ 2025-07-22 9:32 UTC (permalink / raw)
To: Jason Xing, Jeroen de Borst
Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb,
Joshua Washington, Praveen Kaligineedi
On 7/22/25 1:37 AM, Jason Xing wrote:
> On Thu, Jul 17, 2025 at 11:29 PM Jeroen de Borst <jeroendb@google.com> wrote:
>> +
>> + pkt = gve_alloc_pending_packet(tx);
>
> I checked gve_alloc_pending_packet() and noticed there is one slight
> chance to return NULL? If so, it will trigger a NULL dereference
> issue.
IIRC, a similar thing was already mentioned on an older patch. Still
IIRC, the trick is that there is a previous, successful call to
gve_has_avail_slots_tx_dqo() that ensures there are free TX packets
avail and operations racing in between on other CPUs could only add more
free pkts.
So the above looks safe to me.
Side note: since this looks like a recurrining topic, I guess it would
be nice a follow-up adding some comments above the relevant functions.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
` (4 preceding siblings ...)
2025-07-17 15:28 ` [PATCH net-next v2 5/5] gve: implement DQO RX datapath and control path " Jeroen de Borst
@ 2025-07-22 9:50 ` patchwork-bot+netdevbpf
5 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-07-22 9:50 UTC (permalink / raw)
To: Jeroen de Borst
Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb, pabeni,
joshwash
Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Thu, 17 Jul 2025 08:28:34 -0700 you wrote:
> From: Joshua Washington <joshwash@google.com>
>
> This patch series adds support for AF_XDP zero-copy in the DQO RDA queue
> format.
>
> XSK infrastructure is updated to re-post buffers when adding XSK pools
> because XSK umem will be posted directly to the NIC, a departure from
> the bounce buffer model used in GQI QPL. A registry of XSK pools is
> introduced to prevent the usage of XSK pools when in copy mode.
>
> [...]
Here is the summary with links:
- [net-next,v2,1/5] gve: deduplicate xdp info and xsk pool registration logic
https://git.kernel.org/netdev/net-next/c/d57ae093c887
- [net-next,v2,2/5] gve: merge xdp and xsk registration
https://git.kernel.org/netdev/net-next/c/077f7153fd25
- [net-next,v2,3/5] gve: keep registry of zc xsk pools in netdev_priv
https://git.kernel.org/netdev/net-next/c/652fe13b1fd8
- [net-next,v2,4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
https://git.kernel.org/netdev/net-next/c/2236836eab26
- [net-next,v2,5/5] gve: implement DQO RX datapath and control path for AF_XDP zero-copy
https://git.kernel.org/netdev/net-next/c/c1fffc5d66a7
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-22 9:32 ` Paolo Abeni
@ 2025-07-22 11:50 ` Jason Xing
2025-07-22 17:32 ` Harshitha Ramamurthy
1 sibling, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-07-22 11:50 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jeroen de Borst, netdev, hramamurthy, davem, edumazet, kuba,
willemb, Joshua Washington, Praveen Kaligineedi
On Tue, Jul 22, 2025 at 5:32 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/22/25 1:37 AM, Jason Xing wrote:
> > On Thu, Jul 17, 2025 at 11:29 PM Jeroen de Borst <jeroendb@google.com> wrote:
> >> +
> >> + pkt = gve_alloc_pending_packet(tx);
> >
> > I checked gve_alloc_pending_packet() and noticed there is one slight
> > chance to return NULL? If so, it will trigger a NULL dereference
> > issue.
>
> IIRC, a similar thing was already mentioned on an older patch. Still
> IIRC, the trick is that there is a previous, successful call to
> gve_has_avail_slots_tx_dqo() that ensures there are free TX packets
> avail and operations racing in between on other CPUs could only add more
> free pkts.
Oh, thanks for clarifying this. It does make sense :)
Thanks,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy
2025-07-22 9:32 ` Paolo Abeni
2025-07-22 11:50 ` Jason Xing
@ 2025-07-22 17:32 ` Harshitha Ramamurthy
1 sibling, 0 replies; 12+ messages in thread
From: Harshitha Ramamurthy @ 2025-07-22 17:32 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jason Xing, Jeroen de Borst, netdev, davem, edumazet, kuba,
willemb, Joshua Washington, Praveen Kaligineedi
On Tue, Jul 22, 2025 at 2:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/22/25 1:37 AM, Jason Xing wrote:
> > On Thu, Jul 17, 2025 at 11:29 PM Jeroen de Borst <jeroendb@google.com> wrote:
> >> +
> >> + pkt = gve_alloc_pending_packet(tx);
> >
> > I checked gve_alloc_pending_packet() and noticed there is one slight
> > chance to return NULL? If so, it will trigger a NULL dereference
> > issue.
>
> IIRC, a similar thing was already mentioned on an older patch. Still
> IIRC, the trick is that there is a previous, successful call to
> gve_has_avail_slots_tx_dqo() that ensures there are free TX packets
> avail and operations racing in between on other CPUs could only add more
> free pkts.
>
> So the above looks safe to me.
>
> Side note: since this looks like a recurrining topic, I guess it would
> be nice a follow-up adding some comments above the relevant functions.
Sounds good, will do in a follow up patch.
Thanks for the discussion and feedback.
>
> Thanks,
>
> Paolo
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-07-22 17:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 15:28 [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 1/5] gve: deduplicate xdp info and xsk pool registration logic Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 2/5] gve: merge xdp and xsk registration Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 3/5] gve: keep registry of zc xsk pools in netdev_priv Jeroen de Borst
2025-07-17 15:28 ` [PATCH net-next v2 4/5] gve: implement DQO TX datapath for AF_XDP zero-copy Jeroen de Borst
2025-07-21 23:37 ` Jason Xing
2025-07-21 23:58 ` Jason Xing
2025-07-22 9:32 ` Paolo Abeni
2025-07-22 11:50 ` Jason Xing
2025-07-22 17:32 ` Harshitha Ramamurthy
2025-07-17 15:28 ` [PATCH net-next v2 5/5] gve: implement DQO RX datapath and control path " Jeroen de Borst
2025-07-22 9:50 ` [PATCH net-next v2 0/5] gve: AF_XDP zero-copy for DQO RDA patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).