netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
@ 2025-06-19  9:04 Jason Xing
  2025-06-19 13:53 ` Willem de Bruijn
  2025-06-19 15:09 ` Jakub Kicinski
  0 siblings, 2 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-19  9:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The patch does the following things:
- Add XDP_MAX_TX_BUDGET socket option.
- Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
  tx_budget_spent.
- tx_budget_spent is set to 32 by default in the initialization phase.
  It's a per-socket granular control.

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscall. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
sendto(). Besides, applications leveraging this setsockopt can adjust
its proper value in time after noticing the upper bound issue happening.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
V3
Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
1. use a per-socket control (suggested by Stanislav)
2. unify both definitions into one
3. support setsockopt and getsockopt
4. add more description in commit message

V2
Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
1. use a per-netns sysctl knob
2. use sysctl_xsk_max_tx_budget to unify both definitions.
---
 include/net/xdp_sock.h            |  3 ++-
 include/uapi/linux/if_xdp.h       |  1 +
 net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
 tools/include/uapi/linux/if_xdp.h |  1 +
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e8bd6ddb7b12..8eecafad92c0 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -65,11 +65,12 @@ struct xdp_sock {
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	struct list_head tx_list;
 	/* record the number of tx descriptors sent by this xsk and
-	 * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
+	 * when it exceeds max_tx_budget, an opportunity needs
 	 * to be given to other xsks for sending tx descriptors, thereby
 	 * preventing other XSKs from being starved.
 	 */
 	u32 tx_budget_spent;
+	u32 max_tx_budget;
 
 	/* Statistics */
 	u64 rx_dropped;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 44f2bb93e7e6..07c6d21c2f1c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_MAX_TX_BUDGET		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72c000c0ae5f..7c47f665e9d1 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -33,9 +33,6 @@
 #include "xdp_umem.h"
 #include "xsk.h"
 
-#define TX_BATCH_SIZE 32
-#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
-
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
 {
 	if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
 	rcu_read_lock();
 again:
 	list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
-		if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
+		int max_budget = READ_ONCE(xs->max_tx_budget);
+
+		if (xs->tx_budget_spent >= max_budget) {
 			budget_exhausted = true;
 			continue;
 		}
@@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 static int __xsk_generic_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
-	u32 max_batch = TX_BATCH_SIZE;
+	u32 max_budget = READ_ONCE(xs->max_tx_budget);
 	bool sent_frame = false;
 	struct xdp_desc desc;
 	struct sk_buff *skb;
@@ -797,7 +796,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		goto out;
 
 	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-		if (max_batch-- == 0) {
+		if (max_budget-- == 0) {
 			err = -EAGAIN;
 			goto out;
 		}
@@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_MAX_TX_BUDGET:
+	{
+		unsigned int budget;
+
+		if (optlen < sizeof(budget))
+			return -EINVAL;
+		if (copy_from_sockptr(&budget, optval, sizeof(budget)))
+			return -EFAULT;
+
+		WRITE_ONCE(xs->max_tx_budget, budget);
+		return 0;
+	}
 	default:
 		break;
 	}
@@ -1588,6 +1599,18 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 
 		return 0;
 	}
+	case XDP_MAX_TX_BUDGET:
+	{
+		unsigned int budget = READ_ONCE(xs->max_tx_budget);
+
+		if (copy_to_user(optval, &budget, sizeof(budget)))
+			return -EFAULT;
+		if (put_user(sizeof(budget), optlen))
+			return -EFAULT;
+
+		return 0;
+	}
+
 	default:
 		break;
 	}
@@ -1734,6 +1757,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 
 	xs = xdp_sk(sk);
 	xs->state = XSK_READY;
+	xs->max_tx_budget = 32;
 	mutex_init(&xs->mutex);
 
 	INIT_LIST_HEAD(&xs->map_list);
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 44f2bb93e7e6..07c6d21c2f1c 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_MAX_TX_BUDGET		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19  9:04 [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
@ 2025-06-19 13:53 ` Willem de Bruijn
  2025-06-19 23:53   ` Jason Xing
  2025-06-19 15:09 ` Jakub Kicinski
  1 sibling, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-19 13:53 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> The patch does the following things:
> - Add XDP_MAX_TX_BUDGET socket option.
> - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
>   tx_budget_spent.
> - tx_budget_spent is set to 32 by default in the initialization phase.
>   It's a per-socket granular control.
> 
> The idea behind this comes out of real workloads in production. We use a
> user-level stack with xsk support to accelerate sending packets and
> minimize triggering syscall. When the packets are aggregated, it's not
> hard to hit the upper bound (namely, 32). The moment user-space stack
> fetches the -EAGAIN error number passed from sendto(), it will loop to try
> again until all the expected descs from tx ring are sent out to the driver.
> Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> sendto(). Besides, applications leveraging this setsockopt can adjust
> its proper value in time after noticing the upper bound issue happening.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> V3
> Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> 1. use a per-socket control (suggested by Stanislav)
> 2. unify both definitions into one
> 3. support setsockopt and getsockopt
> 4. add more description in commit message

+1 on an XSK setsockopt only

> 
> V2
> Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> 1. use a per-netns sysctl knob
> 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> ---
>  include/net/xdp_sock.h            |  3 ++-
>  include/uapi/linux/if_xdp.h       |  1 +
>  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
>  tools/include/uapi/linux/if_xdp.h |  1 +
>  4 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e8bd6ddb7b12..8eecafad92c0 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -65,11 +65,12 @@ struct xdp_sock {
>  	struct xsk_queue *tx ____cacheline_aligned_in_smp;
>  	struct list_head tx_list;
>  	/* record the number of tx descriptors sent by this xsk and
> -	 * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> +	 * when it exceeds max_tx_budget, an opportunity needs
>  	 * to be given to other xsks for sending tx descriptors, thereby
>  	 * preventing other XSKs from being starved.
>  	 */
>  	u32 tx_budget_spent;
> +	u32 max_tx_budget;

This probably does not need to be a u32?

It does fit in an existing hole. Is it also a warm cacheline wherever
this is touched in the hot path?

>  
>  	/* Statistics */
>  	u64 rx_dropped;
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 44f2bb93e7e6..07c6d21c2f1c 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING	6
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
> +#define XDP_MAX_TX_BUDGET		9
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72c000c0ae5f..7c47f665e9d1 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -33,9 +33,6 @@
>  #include "xdp_umem.h"
>  #include "xsk.h"
>  
> -#define TX_BATCH_SIZE 32
> -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> -
>  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
>  {
>  	if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
> @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
>  	rcu_read_lock();
>  again:
>  	list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> -		if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> +		int max_budget = READ_ONCE(xs->max_tx_budget);
> +
> +		if (xs->tx_budget_spent >= max_budget) {
>  			budget_exhausted = true;
>  			continue;
>  		}
> @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  static int __xsk_generic_xmit(struct sock *sk)
>  {
>  	struct xdp_sock *xs = xdp_sk(sk);
> -	u32 max_batch = TX_BATCH_SIZE;
> +	u32 max_budget = READ_ONCE(xs->max_tx_budget);
>  	bool sent_frame = false;
>  	struct xdp_desc desc;
>  	struct sk_buff *skb;
> @@ -797,7 +796,7 @@ static int __xsk_generic_xmit(struct sock *sk)
>  		goto out;
>  
>  	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> -		if (max_batch-- == 0) {
> +		if (max_budget-- == 0) {
>  			err = -EAGAIN;
>  			goto out;
>  		}
> @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return err;
>  	}
> +	case XDP_MAX_TX_BUDGET:
> +	{
> +		unsigned int budget;
> +
> +		if (optlen < sizeof(budget))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> +			return -EFAULT;
> +
> +		WRITE_ONCE(xs->max_tx_budget, budget);

Sanitize input: bounds check

> +		return 0;
> +	}
>  	default:
>  		break;
>  	}
> @@ -1588,6 +1599,18 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
>  
>  		return 0;
>  	}
> +	case XDP_MAX_TX_BUDGET:
> +	{
> +		unsigned int budget = READ_ONCE(xs->max_tx_budget);
> +
> +		if (copy_to_user(optval, &budget, sizeof(budget)))
> +			return -EFAULT;
> +		if (put_user(sizeof(budget), optlen))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
>  	default:
>  		break;
>  	}
> @@ -1734,6 +1757,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
>  
>  	xs = xdp_sk(sk);
>  	xs->state = XSK_READY;
> +	xs->max_tx_budget = 32;
>  	mutex_init(&xs->mutex);
>  
>  	INIT_LIST_HEAD(&xs->map_list);
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index 44f2bb93e7e6..07c6d21c2f1c 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING	6
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
> +#define XDP_MAX_TX_BUDGET		9
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19  9:04 [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
  2025-06-19 13:53 ` Willem de Bruijn
@ 2025-06-19 15:09 ` Jakub Kicinski
  2025-06-20  0:17   ` Jason Xing
  2025-06-20 14:25   ` Stanislav Fomichev
  1 sibling, 2 replies; 26+ messages in thread
From: Jakub Kicinski @ 2025-06-19 15:09 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
>  	rcu_read_lock();
>  again:
>  	list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> -		if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> +		int max_budget = READ_ONCE(xs->max_tx_budget);
> +
> +		if (xs->tx_budget_spent >= max_budget) {
>  			budget_exhausted = true;
>  			continue;
>  		}
> @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  static int __xsk_generic_xmit(struct sock *sk)
>  {
>  	struct xdp_sock *xs = xdp_sk(sk);
> -	u32 max_batch = TX_BATCH_SIZE;
> +	u32 max_budget = READ_ONCE(xs->max_tx_budget);

Hm, maybe a question to Stan / Willem & other XSK experts but are these
two max values / code paths really related? Question 2 -- is generic
XSK a legit optimization target, legit enough to add uAPI?

Jason, I think some additions to Documentation/ and quantification of
the benefits would be needed as well.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19 13:53 ` Willem de Bruijn
@ 2025-06-19 23:53   ` Jason Xing
  2025-06-20  0:02     ` Jason Xing
  2025-06-20 13:43     ` Willem de Bruijn
  0 siblings, 2 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-19 23:53 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > The patch does the following things:
> > - Add XDP_MAX_TX_BUDGET socket option.
> > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> >   tx_budget_spent.
> > - tx_budget_spent is set to 32 by default in the initialization phase.
> >   It's a per-socket granular control.
> >
> > The idea behind this comes out of real workloads in production. We use a
> > user-level stack with xsk support to accelerate sending packets and
> > minimize triggering syscall. When the packets are aggregated, it's not
> > hard to hit the upper bound (namely, 32). The moment user-space stack
> > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > again until all the expected descs from tx ring are sent out to the driver.
> > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > sendto(). Besides, applications leveraging this setsockopt can adjust
> > its proper value in time after noticing the upper bound issue happening.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > V3
> > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > 1. use a per-socket control (suggested by Stanislav)
> > 2. unify both definitions into one
> > 3. support setsockopt and getsockopt
> > 4. add more description in commit message
>
> +1 on an XSK setsockopt only

May I ask why only setsockopt? In tradition, dev_tx_weight can be read
and written through running sysctl. I think they are the same?

>
> >
> > V2
> > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > 1. use a per-netns sysctl knob
> > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > ---
> >  include/net/xdp_sock.h            |  3 ++-
> >  include/uapi/linux/if_xdp.h       |  1 +
> >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> >  tools/include/uapi/linux/if_xdp.h |  1 +
> >  4 files changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index e8bd6ddb7b12..8eecafad92c0 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -65,11 +65,12 @@ struct xdp_sock {
> >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> >       struct list_head tx_list;
> >       /* record the number of tx descriptors sent by this xsk and
> > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > +      * when it exceeds max_tx_budget, an opportunity needs
> >        * to be given to other xsks for sending tx descriptors, thereby
> >        * preventing other XSKs from being starved.
> >        */
> >       u32 tx_budget_spent;
> > +     u32 max_tx_budget;
>
> This probably does not need to be a u32?

From what I've known, it's not possible to set a very large value like
1000 which probably brings side effects.

But it seems we'd better not limit the use of this max_tx_budget? We
don't know what the best fit for various use cases is. If the type
needs to be downsized to a smaller one like u16, another related
consideration is that dev_tx_weight deserves the same treatment?

Or let me adjust to 'int' then?

> It does fit in an existing hole. Is it also a warm cacheline wherever
> this is touched in the hot path?

Oh, right.  max_tx_budget is almost a read-only member while the rest
of the same cacheline are frequently changed. I'm going to change as
below:
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 8eecafad92c0..fca7723ad8b3 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -70,7 +70,6 @@ struct xdp_sock {
         * preventing other XSKs from being starved.
         */
        u32 tx_budget_spent;
-       u32 max_tx_budget;

        /* Statistics */
        u64 rx_dropped;
@@ -85,6 +84,7 @@ struct xdp_sock {
        struct list_head map_list;
        /* Protects map_list */
        spinlock_t map_list_lock;
+       u32 max_tx_budget;
        /* Protects multiple processes in the control path */
        struct mutex mutex;
        struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */

Then it will be put into one read-mostly cacheline and also not add an
extra hole :)

>
> >
> >       /* Statistics */
> >       u64 rx_dropped;
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_UMEM_COMPLETION_RING     6
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> > +#define XDP_MAX_TX_BUDGET            9
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72c000c0ae5f..7c47f665e9d1 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -33,9 +33,6 @@
> >  #include "xdp_umem.h"
> >  #include "xsk.h"
> >
> > -#define TX_BATCH_SIZE 32
> > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> > -
> >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> >  {
> >       if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
> > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> >       rcu_read_lock();
> >  again:
> >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > +
> > +             if (xs->tx_budget_spent >= max_budget) {
> >                       budget_exhausted = true;
> >                       continue;
> >               }
> > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  static int __xsk_generic_xmit(struct sock *sk)
> >  {
> >       struct xdp_sock *xs = xdp_sk(sk);
> > -     u32 max_batch = TX_BATCH_SIZE;
> > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> >       bool sent_frame = false;
> >       struct xdp_desc desc;
> >       struct sk_buff *skb;
> > @@ -797,7 +796,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> >               goto out;
> >
> >       while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> > -             if (max_batch-- == 0) {
> > +             if (max_budget-- == 0) {
> >                       err = -EAGAIN;
> >                       goto out;
> >               }
> > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               mutex_unlock(&xs->mutex);
> >               return err;
> >       }
> > +     case XDP_MAX_TX_BUDGET:
> > +     {
> > +             unsigned int budget;
> > +
> > +             if (optlen < sizeof(budget))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > +                     return -EFAULT;
> > +
> > +             WRITE_ONCE(xs->max_tx_budget, budget);
>
> Sanitize input: bounds check

Thanks for catching this.

I will change it like this:
    WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?

Thanks,
Jason

>
> > +             return 0;
> > +     }
> >       default:
> >               break;
> >       }
> > @@ -1588,6 +1599,18 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
> >
> >               return 0;
> >       }
> > +     case XDP_MAX_TX_BUDGET:
> > +     {
> > +             unsigned int budget = READ_ONCE(xs->max_tx_budget);
> > +
> > +             if (copy_to_user(optval, &budget, sizeof(budget)))
> > +                     return -EFAULT;
> > +             if (put_user(sizeof(budget), optlen))
> > +                     return -EFAULT;
> > +
> > +             return 0;
> > +     }
> > +
> >       default:
> >               break;
> >       }
> > @@ -1734,6 +1757,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
> >
> >       xs = xdp_sk(sk);
> >       xs->state = XSK_READY;
> > +     xs->max_tx_budget = 32;
> >       mutex_init(&xs->mutex);
> >
> >       INIT_LIST_HEAD(&xs->map_list);
> > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > --- a/tools/include/uapi/linux/if_xdp.h
> > +++ b/tools/include/uapi/linux/if_xdp.h
> > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_UMEM_COMPLETION_RING     6
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> > +#define XDP_MAX_TX_BUDGET            9
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19 23:53   ` Jason Xing
@ 2025-06-20  0:02     ` Jason Xing
  2025-06-20 13:43     ` Willem de Bruijn
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-20  0:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

On Fri, Jun 20, 2025 at 7:53 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > The patch does the following things:
> > > - Add XDP_MAX_TX_BUDGET socket option.
> > > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> > >   tx_budget_spent.
> > > - tx_budget_spent is set to 32 by default in the initialization phase.
> > >   It's a per-socket granular control.
> > >
> > > The idea behind this comes out of real workloads in production. We use a
> > > user-level stack with xsk support to accelerate sending packets and
> > > minimize triggering syscall. When the packets are aggregated, it's not
> > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > again until all the expected descs from tx ring are sent out to the driver.
> > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > > sendto(). Besides, applications leveraging this setsockopt can adjust
> > > its proper value in time after noticing the upper bound issue happening.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > V3
> > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > 1. use a per-socket control (suggested by Stanislav)
> > > 2. unify both definitions into one
> > > 3. support setsockopt and getsockopt
> > > 4. add more description in commit message
> >
> > +1 on an XSK setsockopt only
>
> May I ask why only setsockopt? In tradition, dev_tx_weight can be read
> and written through running sysctl. I think they are the same?
>
> >
> > >
> > > V2
> > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > 1. use a per-netns sysctl knob
> > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > ---
> > >  include/net/xdp_sock.h            |  3 ++-
> > >  include/uapi/linux/if_xdp.h       |  1 +
> > >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> > >  tools/include/uapi/linux/if_xdp.h |  1 +
> > >  4 files changed, 34 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index e8bd6ddb7b12..8eecafad92c0 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -65,11 +65,12 @@ struct xdp_sock {
> > >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > >       struct list_head tx_list;
> > >       /* record the number of tx descriptors sent by this xsk and
> > > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > > +      * when it exceeds max_tx_budget, an opportunity needs
> > >        * to be given to other xsks for sending tx descriptors, thereby
> > >        * preventing other XSKs from being starved.
> > >        */
> > >       u32 tx_budget_spent;
> > > +     u32 max_tx_budget;
> >
> > This probably does not need to be a u32?
>
> From what I've known, it's not possible to set a very large value like
> 1000 which probably brings side effects.
>
> But it seems we'd better not limit the use of this max_tx_budget? We
> don't know what the best fit for various use cases is. If the type
> needs to be downsized to a smaller one like u16, another related
> consideration is that dev_tx_weight deserves the same treatment?
>
> Or let me adjust to 'int' then?
>
> > It does fit in an existing hole. Is it also a warm cacheline wherever
> > this is touched in the hot path?
>
> Oh, right.  max_tx_budget is almost a read-only member while the rest
> of the same cacheline are frequently changed. I'm going to change as
> below:
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 8eecafad92c0..fca7723ad8b3 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -70,7 +70,6 @@ struct xdp_sock {
>          * preventing other XSKs from being starved.
>          */
>         u32 tx_budget_spent;
> -       u32 max_tx_budget;
>
>         /* Statistics */
>         u64 rx_dropped;
> @@ -85,6 +84,7 @@ struct xdp_sock {
>         struct list_head map_list;
>         /* Protects map_list */
>         spinlock_t map_list_lock;
> +       u32 max_tx_budget;
>         /* Protects multiple processes in the control path */
>         struct mutex mutex;
>         struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
>
> Then it will be put into one read-mostly cacheline and also not add an
> extra hole :)
>
> >
> > >
> > >       /* Statistics */
> > >       u64 rx_dropped;
> > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > > --- a/include/uapi/linux/if_xdp.h
> > > +++ b/include/uapi/linux/if_xdp.h
> > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> > >  #define XDP_UMEM_COMPLETION_RING     6
> > >  #define XDP_STATISTICS                       7
> > >  #define XDP_OPTIONS                  8
> > > +#define XDP_MAX_TX_BUDGET            9
> > >
> > >  struct xdp_umem_reg {
> > >       __u64 addr; /* Start of packet data area */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 72c000c0ae5f..7c47f665e9d1 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -33,9 +33,6 @@
> > >  #include "xdp_umem.h"
> > >  #include "xsk.h"
> > >
> > > -#define TX_BATCH_SIZE 32
> > > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> > > -
> > >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> > >  {
> > >       if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
> > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > >       rcu_read_lock();
> > >  again:
> > >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > > +
> > > +             if (xs->tx_budget_spent >= max_budget) {
> > >                       budget_exhausted = true;
> > >                       continue;
> > >               }
> > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >  static int __xsk_generic_xmit(struct sock *sk)
> > >  {
> > >       struct xdp_sock *xs = xdp_sk(sk);
> > > -     u32 max_batch = TX_BATCH_SIZE;
> > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > >       bool sent_frame = false;
> > >       struct xdp_desc desc;
> > >       struct sk_buff *skb;
> > > @@ -797,7 +796,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >               goto out;
> > >
> > >       while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> > > -             if (max_batch-- == 0) {
> > > +             if (max_budget-- == 0) {
> > >                       err = -EAGAIN;
> > >                       goto out;
> > >               }
> > > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > >               mutex_unlock(&xs->mutex);
> > >               return err;
> > >       }
> > > +     case XDP_MAX_TX_BUDGET:
> > > +     {
> > > +             unsigned int budget;
> > > +
> > > +             if (optlen < sizeof(budget))
> > > +                     return -EINVAL;
> > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > +                     return -EFAULT;
> > > +
> > > +             WRITE_ONCE(xs->max_tx_budget, budget);
> >
> > Sanitize input: bounds check
>
> Thanks for catching this.
>
> I will change it like this:
>     WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?

Oh, don't bother to modify the above line. Only modifying the
if-statement at the beginning would suffice:

if (optlen != sizeof(budget))
        return -EINVAL;

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19 15:09 ` Jakub Kicinski
@ 2025-06-20  0:17   ` Jason Xing
  2025-06-20 13:50     ` Willem de Bruijn
  2025-06-21 14:43     ` Jakub Kicinski
  2025-06-20 14:25   ` Stanislav Fomichev
  1 sibling, 2 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-20  0:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Thu, Jun 19, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> >       rcu_read_lock();
> >  again:
> >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > +
> > +             if (xs->tx_budget_spent >= max_budget) {
> >                       budget_exhausted = true;
> >                       continue;
> >               }
> > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  static int __xsk_generic_xmit(struct sock *sk)
> >  {
> >       struct xdp_sock *xs = xdp_sk(sk);
> > -     u32 max_batch = TX_BATCH_SIZE;
> > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
>
> Hm, maybe a question to Stan / Willem & other XSK experts but are these
> two max values / code paths really related? Question 2 -- is generic
> XSK a legit optimization target, legit enough to add uAPI?

I'm not an expert but my take is:
#1, I don't see the correlation actually while I don't see any reason
to use the different values for both of them.
#2, These two definitions are improvement points because whether to do
the real send is driven by calling sendto(). Enlarging a little bit of
this value could save many times of calling sendto(). As for the uAPI,
I don't know if it's worth it, sorry. If not, the previous version 2
patch (regarding per-netns policy) will be revived.

So I will leave those two questions to XSK experts as well.

>
> Jason, I think some additions to Documentation/ and quantification of
> the benefits would be needed as well.

Got it.

#1 Documentation. I would add one small section 'XDP_MAX_TX_BUDGET
setsockopt' in Documentation/networking/af_xdp.rst.

#2 quantification
It's really hard to do so mainly because of various stacks implemented
in the user-space. AF_XDP is providing a fundamental mechanism only
and its upper layer is prosperous.

Thanks,
Jason

> --
> pw-bot: cr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19 23:53   ` Jason Xing
  2025-06-20  0:02     ` Jason Xing
@ 2025-06-20 13:43     ` Willem de Bruijn
  2025-06-20 13:58       ` Willem de Bruijn
  1 sibling, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 13:43 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

Jason Xing wrote:
> On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > The patch does the following things:
> > > - Add XDP_MAX_TX_BUDGET socket option.
> > > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> > >   tx_budget_spent.
> > > - tx_budget_spent is set to 32 by default in the initialization phase.
> > >   It's a per-socket granular control.
> > >
> > > The idea behind this comes out of real workloads in production. We use a
> > > user-level stack with xsk support to accelerate sending packets and
> > > minimize triggering syscall. When the packets are aggregated, it's not
> > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > again until all the expected descs from tx ring are sent out to the driver.
> > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > > sendto(). Besides, applications leveraging this setsockopt can adjust
> > > its proper value in time after noticing the upper bound issue happening.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > V3
> > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > 1. use a per-socket control (suggested by Stanislav)
> > > 2. unify both definitions into one
> > > 3. support setsockopt and getsockopt
> > > 4. add more description in commit message
> >
> > +1 on an XSK setsockopt only
> 
> May I ask why only setsockopt? In tradition, dev_tx_weight can be read
> and written through running sysctl. I think they are the same?

This is not dev_tx_weight, which is per device.

This is a per-socket choice. The reason for adding it that you gave,
a specific application that is known to be able to batch more than 32,
can tune this configurable in the application.

I see no immediately need to set this at a per netns or global level.
If so, the extra cacheline space in those structs is not warranted.

> >
> > >
> > > V2
> > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > 1. use a per-netns sysctl knob
> > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > ---
> > >  include/net/xdp_sock.h            |  3 ++-
> > >  include/uapi/linux/if_xdp.h       |  1 +
> > >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> > >  tools/include/uapi/linux/if_xdp.h |  1 +
> > >  4 files changed, 34 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index e8bd6ddb7b12..8eecafad92c0 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -65,11 +65,12 @@ struct xdp_sock {
> > >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > >       struct list_head tx_list;
> > >       /* record the number of tx descriptors sent by this xsk and
> > > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > > +      * when it exceeds max_tx_budget, an opportunity needs
> > >        * to be given to other xsks for sending tx descriptors, thereby
> > >        * preventing other XSKs from being starved.
> > >        */
> > >       u32 tx_budget_spent;
> > > +     u32 max_tx_budget;
> >
> > This probably does not need to be a u32?
> 
> From what I've known, it's not possible to set a very large value like
> 1000 which probably brings side effects.
> 
> But it seems we'd better not limit the use of this max_tx_budget? We
> don't know what the best fit for various use cases is. If the type
> needs to be downsized to a smaller one like u16, another related
> consideration is that dev_tx_weight deserves the same treatment?

If the current constant is 32, is U16_MAX really a limiting factor.
See also the next point.

> Or let me adjust to 'int' then?
> > > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > >               mutex_unlock(&xs->mutex);
> > >               return err;
> > >       }
> > > +     case XDP_MAX_TX_BUDGET:
> > > +     {
> > > +             unsigned int budget;
> > > +
> > > +             if (optlen < sizeof(budget))
> > > +                     return -EINVAL;
> > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > +                     return -EFAULT;
> > > +
> > > +             WRITE_ONCE(xs->max_tx_budget, budget);
> >
> > Sanitize input: bounds check
> 
> Thanks for catching this.
> 
> I will change it like this:
>     WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?

INT_MAX is not a valid upper bound. The current constant is 32.
I would expect an upper bound to perhaps be a few orders larger.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20  0:17   ` Jason Xing
@ 2025-06-20 13:50     ` Willem de Bruijn
  2025-06-20 15:03       ` Jason Xing
  2025-06-21 14:43     ` Jakub Kicinski
  1 sibling, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 13:50 UTC (permalink / raw)
  To: Jason Xing, Jakub Kicinski
  Cc: davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Thu, Jun 19, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > >       rcu_read_lock();
> > >  again:
> > >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > > +
> > > +             if (xs->tx_budget_spent >= max_budget) {
> > >                       budget_exhausted = true;
> > >                       continue;
> > >               }
> > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >  static int __xsk_generic_xmit(struct sock *sk)
> > >  {
> > >       struct xdp_sock *xs = xdp_sk(sk);
> > > -     u32 max_batch = TX_BATCH_SIZE;
> > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> >
> > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > two max values / code paths really related? Question 2 -- is generic
> > XSK a legit optimization target, legit enough to add uAPI?
> 
> I'm not an expert but my take is:
> #1, I don't see the correlation actually while I don't see any reason
> to use the different values for both of them.
> #2, These two definitions are improvement points because whether to do
> the real send is driven by calling sendto(). Enlarging a little bit of
> this value could save many times of calling sendto(). As for the uAPI,
> I don't know if it's worth it, sorry. If not, the previous version 2
> patch (regarding per-netns policy) will be revived.
> 
> So I will leave those two questions to XSK experts as well.

You're proposing the code change, so I think it's on you to make
this argument?
 
> #2 quantification
> It's really hard to do so mainly because of various stacks implemented
> in the user-space. AF_XDP is providing a fundamental mechanism only
> and its upper layer is prosperous.

I think it's a hard sell to argue adding a tunable, if no plausible
recommendation can be given on how the tunable is to be used.

It's not necessary, and most cases infeasible, to give a heuristic
that fits all possible users. But at a minimum the one workload that
prompted the patch. What value do you set it to and how did you
arrive at that number? 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 13:43     ` Willem de Bruijn
@ 2025-06-20 13:58       ` Willem de Bruijn
  2025-06-20 14:37         ` Jason Xing
  0 siblings, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 13:58 UTC (permalink / raw)
  To: Willem de Bruijn, Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

Willem de Bruijn wrote:
> Jason Xing wrote:
> > On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > The patch does the following things:
> > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> > > >   tx_budget_spent.
> > > > - tx_budget_spent is set to 32 by default in the initialization phase.
> > > >   It's a per-socket granular control.
> > > >
> > > > The idea behind this comes out of real workloads in production. We use a
> > > > user-level stack with xsk support to accelerate sending packets and
> > > > minimize triggering syscall. When the packets are aggregated, it's not
> > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > > > sendto(). Besides, applications leveraging this setsockopt can adjust
> > > > its proper value in time after noticing the upper bound issue happening.
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > V3
> > > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > > 1. use a per-socket control (suggested by Stanislav)
> > > > 2. unify both definitions into one
> > > > 3. support setsockopt and getsockopt
> > > > 4. add more description in commit message
> > >
> > > +1 on an XSK setsockopt only
> > 
> > May I ask why only setsockopt? In tradition, dev_tx_weight can be read
> > and written through running sysctl. I think they are the same?
> 
> This is not dev_tx_weight, which is per device.
> 
> This is a per-socket choice. The reason for adding it that you gave,
> a specific application that is known to be able to batch more than 32,
> can tune this configurable in the application.
> 
> I see no immediately need to set this at a per netns or global level.
> If so, the extra cacheline space in those structs is not warranted.
> 
> > >
> > > >
> > > > V2
> > > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > > 1. use a per-netns sysctl knob
> > > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > > ---
> > > >  include/net/xdp_sock.h            |  3 ++-
> > > >  include/uapi/linux/if_xdp.h       |  1 +
> > > >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> > > >  tools/include/uapi/linux/if_xdp.h |  1 +
> > > >  4 files changed, 34 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > index e8bd6ddb7b12..8eecafad92c0 100644
> > > > --- a/include/net/xdp_sock.h
> > > > +++ b/include/net/xdp_sock.h
> > > > @@ -65,11 +65,12 @@ struct xdp_sock {
> > > >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > > >       struct list_head tx_list;
> > > >       /* record the number of tx descriptors sent by this xsk and
> > > > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > > > +      * when it exceeds max_tx_budget, an opportunity needs
> > > >        * to be given to other xsks for sending tx descriptors, thereby
> > > >        * preventing other XSKs from being starved.
> > > >        */
> > > >       u32 tx_budget_spent;
> > > > +     u32 max_tx_budget;
> > >
> > > This probably does not need to be a u32?
> > 
> > From what I've known, it's not possible to set a very large value like
> > 1000 which probably brings side effects.
> > 
> > But it seems we'd better not limit the use of this max_tx_budget? We
> > don't know what the best fit for various use cases is. If the type
> > needs to be downsized to a smaller one like u16, another related
> > consideration is that dev_tx_weight deserves the same treatment?
> 
> If the current constant is 32, is U16_MAX really a limiting factor.
> See also the next point.
> 
> > Or let me adjust to 'int' then?
> > > > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > >               mutex_unlock(&xs->mutex);
> > > >               return err;
> > > >       }
> > > > +     case XDP_MAX_TX_BUDGET:
> > > > +     {
> > > > +             unsigned int budget;
> > > > +
> > > > +             if (optlen < sizeof(budget))
> > > > +                     return -EINVAL;
> > > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > > +                     return -EFAULT;
> > > > +
> > > > +             WRITE_ONCE(xs->max_tx_budget, budget);
> > >
> > > Sanitize input: bounds check
> > 
> > Thanks for catching this.
> > 
> > I will change it like this:
> >     WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?
> 
> INT_MAX is not a valid upper bound. The current constant is 32.
> I would expect an upper bound to perhaps be a few orders larger.

And this would need a clamp to also set a lower bound greater than 0.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-19 15:09 ` Jakub Kicinski
  2025-06-20  0:17   ` Jason Xing
@ 2025-06-20 14:25   ` Stanislav Fomichev
  2025-06-20 16:30     ` Jason Xing
  2025-06-20 22:20     ` Willem de Bruijn
  1 sibling, 2 replies; 26+ messages in thread
From: Stanislav Fomichev @ 2025-06-20 14:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Xing, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On 06/19, Jakub Kicinski wrote:
> On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> >  	rcu_read_lock();
> >  again:
> >  	list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > -		if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > +		int max_budget = READ_ONCE(xs->max_tx_budget);
> > +
> > +		if (xs->tx_budget_spent >= max_budget) {
> >  			budget_exhausted = true;
> >  			continue;
> >  		}
> > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  static int __xsk_generic_xmit(struct sock *sk)
> >  {
> >  	struct xdp_sock *xs = xdp_sk(sk);
> > -	u32 max_batch = TX_BATCH_SIZE;
> > +	u32 max_budget = READ_ONCE(xs->max_tx_budget);
> 
> Hm, maybe a question to Stan / Willem & other XSK experts but are these
> two max values / code paths really related? Question 2 -- is generic
> XSK a legit optimization target, legit enough to add uAPI?

1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
whether we want to affect zc case given the fact that Jason seemingly
cares about copy mode is a good question.

2) I do find it surprising as well. Recent busy polling patches were
also using/targeting copy mode. But from my pow, if people use it, I see
no reason to make it more usable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 13:58       ` Willem de Bruijn
@ 2025-06-20 14:37         ` Jason Xing
  2025-06-20 22:21           ` Willem de Bruijn
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Xing @ 2025-06-20 14:37 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

On Fri, Jun 20, 2025 at 9:58 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Willem de Bruijn wrote:
> > Jason Xing wrote:
> > > On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > The patch does the following things:
> > > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> > > > >   tx_budget_spent.
> > > > > - tx_budget_spent is set to 32 by default in the initialization phase.
> > > > >   It's a per-socket granular control.
> > > > >
> > > > > The idea behind this comes out of real workloads in production. We use a
> > > > > user-level stack with xsk support to accelerate sending packets and
> > > > > minimize triggering syscall. When the packets are aggregated, it's not
> > > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > > > > sendto(). Besides, applications leveraging this setsockopt can adjust
> > > > > its proper value in time after noticing the upper bound issue happening.
> > > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > > V3
> > > > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > > > 1. use a per-socket control (suggested by Stanislav)
> > > > > 2. unify both definitions into one
> > > > > 3. support setsockopt and getsockopt
> > > > > 4. add more description in commit message
> > > >
> > > > +1 on an XSK setsockopt only
> > >
> > > May I ask why only setsockopt? In tradition, dev_tx_weight can be read
> > > and written through running sysctl. I think they are the same?
> >
> > This is not dev_tx_weight, which is per device.
> >
> > This is a per-socket choice. The reason for adding it that you gave,
> > a specific application that is known to be able to batch more than 32,
> > can tune this configurable in the application.

I was thinking a pair is needed like some existing options I'm
familiar with like TCP_RTO_MAX_MS. As I said, it's just a feeling.

Okay, I have no strong opinion on this. I will remove it then.

> >
> > I see no immediately need to set this at a per netns or global level.
> > If so, the extra cacheline space in those structs is not warranted.
> >
> > > >
> > > > >
> > > > > V2
> > > > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > > > 1. use a per-netns sysctl knob
> > > > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > > > ---
> > > > >  include/net/xdp_sock.h            |  3 ++-
> > > > >  include/uapi/linux/if_xdp.h       |  1 +
> > > > >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> > > > >  tools/include/uapi/linux/if_xdp.h |  1 +
> > > > >  4 files changed, 34 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > index e8bd6ddb7b12..8eecafad92c0 100644
> > > > > --- a/include/net/xdp_sock.h
> > > > > +++ b/include/net/xdp_sock.h
> > > > > @@ -65,11 +65,12 @@ struct xdp_sock {
> > > > >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > > > >       struct list_head tx_list;
> > > > >       /* record the number of tx descriptors sent by this xsk and
> > > > > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > > > > +      * when it exceeds max_tx_budget, an opportunity needs
> > > > >        * to be given to other xsks for sending tx descriptors, thereby
> > > > >        * preventing other XSKs from being starved.
> > > > >        */
> > > > >       u32 tx_budget_spent;
> > > > > +     u32 max_tx_budget;
> > > >
> > > > This probably does not need to be a u32?
> > >
> > > From what I've known, it's not possible to set a very large value like
> > > 1000 which probably brings side effects.
> > >
> > > But it seems we'd better not limit the use of this max_tx_budget? We
> > > don't know what the best fit for various use cases is. If the type
> > > needs to be downsized to a smaller one like u16, another related
> > > consideration is that dev_tx_weight deserves the same treatment?
> >
> > If the current constant is 32, is U16_MAX really a limiting factor.
> > See also the next point.
> >
> > > Or let me adjust to 'int' then?
> > > > > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > > >               mutex_unlock(&xs->mutex);
> > > > >               return err;
> > > > >       }
> > > > > +     case XDP_MAX_TX_BUDGET:
> > > > > +     {
> > > > > +             unsigned int budget;
> > > > > +
> > > > > +             if (optlen < sizeof(budget))
> > > > > +                     return -EINVAL;
> > > > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > > > +                     return -EFAULT;
> > > > > +
> > > > > +             WRITE_ONCE(xs->max_tx_budget, budget);
> > > >
> > > > Sanitize input: bounds check
> > >
> > > Thanks for catching this.
> > >
> > > I will change it like this:
> > >     WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?
> >
> > INT_MAX is not a valid upper bound. The current constant is 32.
> > I would expect an upper bound to perhaps be a few orders larger.
>
> And this would need a clamp to also set a lower bound greater than 0.

Sorry, I don't fully follow here. I'm worried if I understand it in
the wrong way.

In this patch, max_tx_budget is u32. If we're doing this this:
        case XDP_MAX_TX_BUDGET:
        {
                unsigned int budget;

                if (optlen != sizeof(budget))    // this line can
filter out those unmatched numbers, right?
                        return -EINVAL;
                if (copy_from_sockptr(&budget, optval,
sizeof(budget)))
                        return -EFAULT;

                WRITE_ONCE(xs->max_tx_budget, budget);

                return 0;
        }
, I'm thinking, is it sufficient because u32 makes sure of zero as its
lower bound?

Or you're suggesting like this:
        case XDP_MAX_TX_BUDGET:
        {
                unsigned int budget;

                if (optlen != sizeof(budget))
                        return -EINVAL;
                if (copy_from_sockptr(&budget, optval, sizeof(budget)))
                        return -EFAULT;

                WRITE_ONCE(xs->max_tx_budget,
                           clamp_t(unsigned int, budget, 0, U16_MAX));
                return 0;
        }
?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 13:50     ` Willem de Bruijn
@ 2025-06-20 15:03       ` Jason Xing
  2025-06-20 22:24         ` Willem de Bruijn
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Xing @ 2025-06-20 15:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

On Fri, Jun 20, 2025 at 9:50 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Thu, Jun 19, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > >       rcu_read_lock();
> > > >  again:
> > > >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > +
> > > > +             if (xs->tx_budget_spent >= max_budget) {
> > > >                       budget_exhausted = true;
> > > >                       continue;
> > > >               }
> > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > >  {
> > > >       struct xdp_sock *xs = xdp_sk(sk);
> > > > -     u32 max_batch = TX_BATCH_SIZE;
> > > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > >
> > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > two max values / code paths really related? Question 2 -- is generic
> > > XSK a legit optimization target, legit enough to add uAPI?
> >
> > I'm not an expert but my take is:
> > #1, I don't see the correlation actually while I don't see any reason
> > to use the different values for both of them.
> > #2, These two definitions are improvement points because whether to do
> > the real send is driven by calling sendto(). Enlarging a little bit of
> > this value could save many times of calling sendto(). As for the uAPI,
> > I don't know if it's worth it, sorry. If not, the previous version 2
> > patch (regarding per-netns policy) will be revived.
> >
> > So I will leave those two questions to XSK experts as well.
>
> You're proposing the code change, so I think it's on you to make
> this argument?
>
> > #2 quantification
> > It's really hard to do so mainly because of various stacks implemented
> > in the user-space. AF_XDP is providing a fundamental mechanism only
> > and its upper layer is prosperous.
>
> I think it's a hard sell to argue adding a tunable, if no plausible
> recommendation can be given on how the tunable is to be used.

Actually I mentioned it in the commit message. One of advantages is to
contribute to less frequencies of sendto() and overall higher
transmission speed.

>
> It's not necessary, and most cases infeasible, to give a heuristic
> that fits all possible users. But at a minimum the one workload that
> prompted the patch. What value do you set it to and how did you
> arrive at that number?

One naive question from me is why the number of packets to be sent is
definitely required to be limited within a small number by default?
Let me set tcp as an example, a simple sendmsg call will not be
stopped because of the hardcoded limitation.

For one application I saw, I suggested using 128 because I saw two
limitations without changing any default configuration: 1)
XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
this was I counted how many desc are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results.

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 14:25   ` Stanislav Fomichev
@ 2025-06-20 16:30     ` Jason Xing
  2025-06-20 16:47       ` Stanislav Fomichev
  2025-06-20 22:20     ` Willem de Bruijn
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Xing @ 2025-06-20 16:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
<stfomichev@gmail.com> wrote:
>
> On 06/19, Jakub Kicinski wrote:
> > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > >     rcu_read_lock();
> > >  again:
> > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > +
> > > +           if (xs->tx_budget_spent >= max_budget) {
> > >                     budget_exhausted = true;
> > >                     continue;
> > >             }
> > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >  static int __xsk_generic_xmit(struct sock *sk)
> > >  {
> > >     struct xdp_sock *xs = xdp_sk(sk);
> > > -   u32 max_batch = TX_BATCH_SIZE;
> > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> >
> > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > two max values / code paths really related? Question 2 -- is generic
> > XSK a legit optimization target, legit enough to add uAPI?
>
> 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> whether we want to affect zc case given the fact that Jason seemingly
> cares about copy mode is a good question.

Allow me to ask the similar question that you asked me before: even though I
didn't see the necessity to set the max budget for zc mode (just
because I didn't spot it happening), would it be better if we separate
both of them because it's an uAPI interface. IIUC, if the setsockopt
is set, we will not separate it any more in the future?

We can keep using the hardcoded value (32) in the zc mode like
before and __only__ touch the copy mode? Later if someone or I found
the significance of making it tunable, then another parameter of
setsockopt can be added? Does it make sense?

Thanks,
Jason

>
> 2) I do find it surprising as well. Recent busy polling patches were
> also using/targeting copy mode. But from my pow, if people use it, I see
> no reason to make it more usable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 16:30     ` Jason Xing
@ 2025-06-20 16:47       ` Stanislav Fomichev
  2025-06-20 17:46         ` Jason Xing
  0 siblings, 1 reply; 26+ messages in thread
From: Stanislav Fomichev @ 2025-06-20 16:47 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On 06/21, Jason Xing wrote:
> On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
> >
> > On 06/19, Jakub Kicinski wrote:
> > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > >     rcu_read_lock();
> > > >  again:
> > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > +
> > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > >                     budget_exhausted = true;
> > > >                     continue;
> > > >             }
> > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > >  {
> > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > >
> > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > two max values / code paths really related? Question 2 -- is generic
> > > XSK a legit optimization target, legit enough to add uAPI?
> >
> > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > whether we want to affect zc case given the fact that Jason seemingly
> > cares about copy mode is a good question.
> 
> Allow me to ask the similar question that you asked me before: even though I
> didn't see the necessity to set the max budget for zc mode (just
> because I didn't spot it happening), would it be better if we separate
> both of them because it's an uAPI interface. IIUC, if the setsockopt
> is set, we will not separate it any more in the future?
> 
> We can keep using the hardcoded value (32) in the zc mode like
> before and __only__ touch the copy mode? Later if someone or I found
> the significance of making it tunable, then another parameter of
> setsockopt can be added? Does it make sense?

Related suggestion: maybe we don't need this limit at all for the copy mode?
If the user, with a socket option, can arbitrarily change it, what is the
point of this limit? Keep it on the zc side to make sure one socket doesn't
starve the rest and drop from the copy mode.. Any reason not to do it?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 16:47       ` Stanislav Fomichev
@ 2025-06-20 17:46         ` Jason Xing
  2025-06-23 14:18           ` Stanislav Fomichev
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Xing @ 2025-06-20 17:46 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
<stfomichev@gmail.com> wrote:
>
> On 06/21, Jason Xing wrote:
> > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > <stfomichev@gmail.com> wrote:
> > >
> > > On 06/19, Jakub Kicinski wrote:
> > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > >     rcu_read_lock();
> > > > >  again:
> > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > +
> > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > >                     budget_exhausted = true;
> > > > >                     continue;
> > > > >             }
> > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > >  {
> > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > >
> > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > two max values / code paths really related? Question 2 -- is generic
> > > > XSK a legit optimization target, legit enough to add uAPI?
> > >
> > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > whether we want to affect zc case given the fact that Jason seemingly
> > > cares about copy mode is a good question.
> >
> > Allow me to ask the similar question that you asked me before: even though I
> > didn't see the necessity to set the max budget for zc mode (just
> > because I didn't spot it happening), would it be better if we separate
> > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > is set, we will not separate it any more in the future?
> >
> > We can keep using the hardcoded value (32) in the zc mode like
> > before and __only__ touch the copy mode? Later if someone or I found
> > the significance of making it tunable, then another parameter of
> > setsockopt can be added? Does it make sense?
>
> Related suggestion: maybe we don't need this limit at all for the copy mode?
> If the user, with a socket option, can arbitrarily change it, what is the
> point of this limit? Keep it on the zc side to make sure one socket doesn't
> starve the rest and drop from the copy mode.. Any reason not to do it?

Thanks for bringing up the same question that I had in this thread. I
saw the commit[1] mentioned it is used to avoid the burst as DPDK
does, so my thought is that it might be used to prevent such a case
where multiple sockets try to send packets through a shared umem
nearly at the same time?

Making it tunable is to provide a chance to let users seek for a good
solution that is the best fit for them. It doesn't mean we
allow/expect to see the burst situation.

[1]
commit e7a1c1300891d8f11d05b42665e299cc22a4b383
Author: Li RongQing <lirongqing@baidu.com>
Date:   Wed Apr 14 13:39:12 2021 +0800

    xsk: Align XDP socket batch size with DPDK

    DPDK default burst size is 32, however, kernel xsk sendto
    syscall can not handle all 32 at one time, and return with
    error.

    So make kernel XDP socket batch size larger to avoid
    unnecessary syscall fail and context switch which will help
    to increase performance.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 14:25   ` Stanislav Fomichev
  2025-06-20 16:30     ` Jason Xing
@ 2025-06-20 22:20     ` Willem de Bruijn
  2025-06-21  1:06       ` Jason Xing
  1 sibling, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 22:20 UTC (permalink / raw)
  To: Stanislav Fomichev, Jakub Kicinski
  Cc: Jason Xing, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing, magnus.karlsson, skhawaja

Stanislav Fomichev wrote:
> On 06/19, Jakub Kicinski wrote:
> > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > >  	rcu_read_lock();
> > >  again:
> > >  	list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > -		if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > +		int max_budget = READ_ONCE(xs->max_tx_budget);
> > > +
> > > +		if (xs->tx_budget_spent >= max_budget) {
> > >  			budget_exhausted = true;
> > >  			continue;
> > >  		}
> > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >  static int __xsk_generic_xmit(struct sock *sk)
> > >  {
> > >  	struct xdp_sock *xs = xdp_sk(sk);
> > > -	u32 max_batch = TX_BATCH_SIZE;
> > > +	u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > 
> > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > two max values / code paths really related? Question 2 -- is generic
> > XSK a legit optimization target, legit enough to add uAPI?
> 
> 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> whether we want to affect zc case given the fact that Jason seemingly
> cares about copy mode is a good question.

The two constants seem to be only tangentially created.

If there is fear that one tunable modifies both, it is simple enough
to remove the unnecessary dependency and only tune the first.
 
> 2) I do find it surprising as well. Recent busy polling patches were
> also using/targeting copy mode. But from my pow, if people use it, I see
> no reason to make it more usable.

That's a very fair question.

Jason, have you tried XDP_ZEROCOPY? It's quite plausible that that
would address your issue.

I have had a use for copy mode, but that was rather obscure. A small
packet workload where copy cost is negligible, and with copy mode it
was easy to make to reinstate flow steering in XDP to specific XSKs,
akin to

https://lore.kernel.org/netdev/65c0f032ac71a_7396029419@willemb.c.googlers.com.notmuch/

The main issue with that remained that driver copy mode also implies
the slower generic copy based Tx path, which goes through the full
dev_queue_xmit stack.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 14:37         ` Jason Xing
@ 2025-06-20 22:21           ` Willem de Bruijn
  0 siblings, 0 replies; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 22:21 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

Jason Xing wrote:
> On Fri, Jun 20, 2025 at 9:58 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Willem de Bruijn wrote:
> > > Jason Xing wrote:
> > > > On Thu, Jun 19, 2025 at 9:53 PM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >
> > > > > Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > The patch does the following things:
> > > > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > > > - Unify TX_BATCH_SIZE and MAX_PER_SOCKET_BUDGET into single one
> > > > > >   tx_budget_spent.
> > > > > > - tx_budget_spent is set to 32 by default in the initialization phase.
> > > > > >   It's a per-socket granular control.
> > > > > >
> > > > > > The idea behind this comes out of real workloads in production. We use a
> > > > > > user-level stack with xsk support to accelerate sending packets and
> > > > > > minimize triggering syscall. When the packets are aggregated, it's not
> > > > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequencies of
> > > > > > sendto(). Besides, applications leveraging this setsockopt can adjust
> > > > > > its proper value in time after noticing the upper bound issue happening.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > > V3
> > > > > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > > > > 1. use a per-socket control (suggested by Stanislav)
> > > > > > 2. unify both definitions into one
> > > > > > 3. support setsockopt and getsockopt
> > > > > > 4. add more description in commit message
> > > > >
> > > > > +1 on an XSK setsockopt only
> > > >
> > > > May I ask why only setsockopt? In tradition, dev_tx_weight can be read
> > > > and written through running sysctl. I think they are the same?
> > >
> > > This is not dev_tx_weight, which is per device.
> > >
> > > This is a per-socket choice. The reason for adding it that you gave,
> > > a specific application that is known to be able to batch more than 32,
> > > can tune this configurable in the application.
> 
> I was thinking a pair is needed like some existing options I'm
> familiar with like TCP_RTO_MAX_MS. As I said, it's just a feeling.
> 
> Okay, I have no strong opinion on this. I will remove it then.
> 
> > >
> > > I see no immediately need to set this at a per netns or global level.
> > > If so, the extra cacheline space in those structs is not warranted.
> > >
> > > > >
> > > > > >
> > > > > > V2
> > > > > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > > > > 1. use a per-netns sysctl knob
> > > > > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > > > > ---
> > > > > >  include/net/xdp_sock.h            |  3 ++-
> > > > > >  include/uapi/linux/if_xdp.h       |  1 +
> > > > > >  net/xdp/xsk.c                     | 36 +++++++++++++++++++++++++------
> > > > > >  tools/include/uapi/linux/if_xdp.h |  1 +
> > > > > >  4 files changed, 34 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > > index e8bd6ddb7b12..8eecafad92c0 100644
> > > > > > --- a/include/net/xdp_sock.h
> > > > > > +++ b/include/net/xdp_sock.h
> > > > > > @@ -65,11 +65,12 @@ struct xdp_sock {
> > > > > >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > > > > >       struct list_head tx_list;
> > > > > >       /* record the number of tx descriptors sent by this xsk and
> > > > > > -      * when it exceeds MAX_PER_SOCKET_BUDGET, an opportunity needs
> > > > > > +      * when it exceeds max_tx_budget, an opportunity needs
> > > > > >        * to be given to other xsks for sending tx descriptors, thereby
> > > > > >        * preventing other XSKs from being starved.
> > > > > >        */
> > > > > >       u32 tx_budget_spent;
> > > > > > +     u32 max_tx_budget;
> > > > >
> > > > > This probably does not need to be a u32?
> > > >
> > > > From what I've known, it's not possible to set a very large value like
> > > > 1000 which probably brings side effects.
> > > >
> > > > But it seems we'd better not limit the use of this max_tx_budget? We
> > > > don't know what the best fit for various use cases is. If the type
> > > > needs to be downsized to a smaller one like u16, another related
> > > > consideration is that dev_tx_weight deserves the same treatment?
> > >
> > > If the current constant is 32, is U16_MAX really a limiting factor.
> > > See also the next point.
> > >
> > > > Or let me adjust to 'int' then?
> > > > > > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > > > >               mutex_unlock(&xs->mutex);
> > > > > >               return err;
> > > > > >       }
> > > > > > +     case XDP_MAX_TX_BUDGET:
> > > > > > +     {
> > > > > > +             unsigned int budget;
> > > > > > +
> > > > > > +             if (optlen < sizeof(budget))
> > > > > > +                     return -EINVAL;
> > > > > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > > > > +                     return -EFAULT;
> > > > > > +
> > > > > > +             WRITE_ONCE(xs->max_tx_budget, budget);
> > > > >
> > > > > Sanitize input: bounds check
> > > >
> > > > Thanks for catching this.
> > > >
> > > > I will change it like this:
> > > >     WRITE_ONCE(xs->max_tx_budget, min_t(int, budget, INT_MAX));?
> > >
> > > INT_MAX is not a valid upper bound. The current constant is 32.
> > > I would expect an upper bound to perhaps be a few orders larger.
> >
> > And this would need a clamp to also set a lower bound greater than 0.
> 
> Sorry, I don't fully follow here. I'm worried if I understand it in
> the wrong way.
> 
> In this patch, max_tx_budget is u32. If we're doing this this:
>         case XDP_MAX_TX_BUDGET:
>         {
>                 unsigned int budget;
> 
>                 if (optlen != sizeof(budget))    // this line can
> filter out those unmatched numbers, right?
>                         return -EINVAL;
>                 if (copy_from_sockptr(&budget, optval,
> sizeof(budget)))
>                         return -EFAULT;
> 
>                 WRITE_ONCE(xs->max_tx_budget, budget);
> 
>                 return 0;
>         }
> , I'm thinking, is it sufficient because u32 makes sure of zero as its
> lower bound?

The issue is that 0 is not a valid budget.

__xsk_generic_xmit will not make any progress as it will break out
immediately at the start of the while loop.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 15:03       ` Jason Xing
@ 2025-06-20 22:24         ` Willem de Bruijn
  2025-06-21  0:40           ` Jason Xing
  0 siblings, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2025-06-20 22:24 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

Jason Xing wrote:
> On Fri, Jun 20, 2025 at 9:50 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Thu, Jun 19, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > >       rcu_read_lock();
> > > > >  again:
> > > > >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > +
> > > > > +             if (xs->tx_budget_spent >= max_budget) {
> > > > >                       budget_exhausted = true;
> > > > >                       continue;
> > > > >               }
> > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > >  {
> > > > >       struct xdp_sock *xs = xdp_sk(sk);
> > > > > -     u32 max_batch = TX_BATCH_SIZE;
> > > > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > >
> > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > two max values / code paths really related? Question 2 -- is generic
> > > > XSK a legit optimization target, legit enough to add uAPI?
> > >
> > > I'm not an expert but my take is:
> > > #1, I don't see the correlation actually while I don't see any reason
> > > to use the different values for both of them.
> > > #2, These two definitions are improvement points because whether to do
> > > the real send is driven by calling sendto(). Enlarging a little bit of
> > > this value could save many times of calling sendto(). As for the uAPI,
> > > I don't know if it's worth it, sorry. If not, the previous version 2
> > > patch (regarding per-netns policy) will be revived.
> > >
> > > So I will leave those two questions to XSK experts as well.
> >
> > You're proposing the code change, so I think it's on you to make
> > this argument?
> >
> > > #2 quantification
> > > It's really hard to do so mainly because of various stacks implemented
> > > in the user-space. AF_XDP is providing a fundamental mechanism only
> > > and its upper layer is prosperous.
> >
> > I think it's a hard sell to argue adding a tunable, if no plausible
> > recommendation can be given on how the tunable is to be used.
> 
> Actually I mentioned it in the commit message. One of advantages is to
> contribute to less frequencies of sendto() and overall higher
> transmission speed.

Understood. It is just informative to have more quantitative data.
What value worked for you.
 
> >
> > It's not necessary, and most cases infeasible, to give a heuristic
> > that fits all possible users. But at a minimum the one workload that
> > prompted the patch. What value do you set it to and how did you
> > arrive at that number?
> 
> One naive question from me is why the number of packets to be sent is
> definitely required to be limited within a small number by default?
> Let me set tcp as an example, a simple sendmsg call will not be
> stopped because of the hardcoded limitation.
> 
> For one application I saw, I suggested using 128 because I saw two
> limitations without changing any default configuration: 1)
> XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> this was I counted how many desc are transmitted to the driver at one
> time of sendto() based on [1] patch and then I calculated the
> possibility of hitting the upper bound. Finally I chose 128 as a
> suitable value because 1) it covers most of the cases, 2) a higher
> number would not bring evident results.
> 
> [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

This is indeed helpful context.

Another limiting factor is the XSK TX queue length?

So even if a user passes UINT_MAX, nothing terrible will happen.

Still, it is better to not accept obviously bad input to begin with.

Normal packet processing loops give up control after tens or maybe
a few hundred packets at a time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 22:24         ` Willem de Bruijn
@ 2025-06-21  0:40           ` Jason Xing
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-21  0:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, bpf, netdev, Jason Xing

On Sat, Jun 21, 2025 at 6:24 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Fri, Jun 20, 2025 at 9:50 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Thu, Jun 19, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > >       rcu_read_lock();
> > > > > >  again:
> > > > > >       list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > -             if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > +             int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > +
> > > > > > +             if (xs->tx_budget_spent >= max_budget) {
> > > > > >                       budget_exhausted = true;
> > > > > >                       continue;
> > > > > >               }
> > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > >  {
> > > > > >       struct xdp_sock *xs = xdp_sk(sk);
> > > > > > -     u32 max_batch = TX_BATCH_SIZE;
> > > > > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > >
> > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > >
> > > > I'm not an expert but my take is:
> > > > #1, I don't see the correlation actually while I don't see any reason
> > > > to use the different values for both of them.
> > > > #2, These two definitions are improvement points because whether to do
> > > > the real send is driven by calling sendto(). Enlarging a little bit of
> > > > this value could save many times of calling sendto(). As for the uAPI,
> > > > I don't know if it's worth it, sorry. If not, the previous version 2
> > > > patch (regarding per-netns policy) will be revived.
> > > >
> > > > So I will leave those two questions to XSK experts as well.
> > >
> > > You're proposing the code change, so I think it's on you to make
> > > this argument?
> > >
> > > > #2 quantification
> > > > It's really hard to do so mainly because of various stacks implemented
> > > > in the user-space. AF_XDP is providing a fundamental mechanism only
> > > > and its upper layer is prosperous.
> > >
> > > I think it's a hard sell to argue adding a tunable, if no plausible
> > > recommendation can be given on how the tunable is to be used.
> >
> > Actually I mentioned it in the commit message. One of advantages is to
> > contribute to less frequencies of sendto() and overall higher
> > transmission speed.
>
> Understood. It is just informative to have more quantitative data.
> What value worked for you.

I see what you mean. Now I think I had better add more details as
follows to show how I arrived at the certain value in the next
version.

>
> > >
> > > It's not necessary, and most cases infeasible, to give a heuristic
> > > that fits all possible users. But at a minimum the one workload that
> > > prompted the patch. What value do you set it to and how did you
> > > arrive at that number?
> >
> > One naive question from me is why the number of packets to be sent is
> > definitely required to be limited within a small number by default?
> > Let me set tcp as an example, a simple sendmsg call will not be
> > stopped because of the hardcoded limitation.
> >
> > For one application I saw, I suggested using 128 because I saw two
> > limitations without changing any default configuration: 1)
> > XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > this was I counted how many desc are transmitted to the driver at one
> > time of sendto() based on [1] patch and then I calculated the
> > possibility of hitting the upper bound. Finally I chose 128 as a
> > suitable value because 1) it covers most of the cases, 2) a higher
> > number would not bring evident results.
> >
> > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
>
> This is indeed helpful context.
>
> Another limiting factor is the XSK TX queue length?

Right, through setting setsockopt(SO_SNDBUD) to increase the queue
length can avoid frequent premature exit from __xsk_generic_xmit().
FYI, the call trace is
__xsk_generic_xmit()
    ->xsk_build_skb()
        ->sock_alloc_send_skb()
            -> if (sk_wmem_alloc_get(sk) < READ_ONCE(sk->sk_sndbuf))

>
> So even if a user passes UINT_MAX, nothing terrible will happen.

Right. And the BQL feature is another possible limit.

>
> Still, it is better to not accept obviously bad input to begin with.

Sure, I can do that. What exact value of upper bound should be, I
wonder? It's not easy to set a hard limit.

Another thing is that what you said on the lower bound in the previous
email is what I missed in the current patch. Thanks for your reminder.
And sorry. I forgot to set it to 1 as my first two patches did. At
least, lower bound is required which is an explicitly unexpected
behaviour.

I'm about to set the lower one _only_ in V4 to see if it works for everyone.

Thanks,
Jason

>
> Normal packet processing loops give up control after tens or maybe
> a few hundred packets at a time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 22:20     ` Willem de Bruijn
@ 2025-06-21  1:06       ` Jason Xing
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-21  1:06 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Jakub Kicinski, davem, edumazet, pabeni,
	bjorn, magnus.karlsson, maciej.fijalkowski, jonathan.lemon, sdf,
	ast, daniel, hawk, john.fastabend, joe, bpf, netdev, Jason Xing,
	magnus.karlsson, skhawaja

On Sat, Jun 21, 2025 at 6:20 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On 06/19, Jakub Kicinski wrote:
> > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > >   rcu_read_lock();
> > > >  again:
> > > >   list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > -         if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > +         int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > +
> > > > +         if (xs->tx_budget_spent >= max_budget) {
> > > >                   budget_exhausted = true;
> > > >                   continue;
> > > >           }
> > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > >  {
> > > >   struct xdp_sock *xs = xdp_sk(sk);
> > > > - u32 max_batch = TX_BATCH_SIZE;
> > > > + u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > >
> > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > two max values / code paths really related? Question 2 -- is generic
> > > XSK a legit optimization target, legit enough to add uAPI?
> >
> > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > whether we want to affect zc case given the fact that Jason seemingly
> > cares about copy mode is a good question.
>
> The two constants seem to be only tangentially created.
>
> If there is fear that one tunable modifies both, it is simple enough
> to remove the unnecessary dependency and only tune the first.

Last night I was thinking only let the first one take effect and keep
the zc mode untouched. IIUC, if we publish the uAPI (setsockopt),
we're _not_ allowed to remove the unwanted one when we spot some
problems? If so, I will only make the copy mode work :) If the latter
needs this in the future, another setsockopt can be easily added. Am I
right?

>
> > 2) I do find it surprising as well. Recent busy polling patches were
> > also using/targeting copy mode. But from my pow, if people use it, I see
> > no reason to make it more usable.
>
> That's a very fair question.
>
> Jason, have you tried XDP_ZEROCOPY? It's quite plausible that that
> would address your issue.

Not yet, but I will try and verify maybe at the beginning of next
month as you suggested.

>
> I have had a use for copy mode, but that was rather obscure. A small
> packet workload where copy cost is negligible, and with copy mode it
> was easy to make to reinstate flow steering in XDP to specific XSKs,
> akin to
>
> https://lore.kernel.org/netdev/65c0f032ac71a_7396029419@willemb.c.googlers.com.notmuch/

Interesting. Thanks for the pointer  :)

>
> The main issue with that remained that driver copy mode also implies
> the slower generic copy based Tx path, which goes through the full
> dev_queue_xmit stack.

To be more specific, not the full dev_queue_xmit stack as
__xsk_generic_xmit() bypasses the qdisc layer.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20  0:17   ` Jason Xing
  2025-06-20 13:50     ` Willem de Bruijn
@ 2025-06-21 14:43     ` Jakub Kicinski
  2025-06-22  0:05       ` Jason Xing
  1 sibling, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2025-06-21 14:43 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Fri, 20 Jun 2025 08:17:48 +0800 Jason Xing wrote:
> > Jason, I think some additions to Documentation/ and quantification of
> > the benefits would be needed as well.  
> 
> Got it.
> 
> #1 Documentation. I would add one small section 'XDP_MAX_TX_BUDGET
> setsockopt' in Documentation/networking/af_xdp.rst.
> 
> #2 quantification
> It's really hard to do so mainly because of various stacks implemented
> in the user-space. AF_XDP is providing a fundamental mechanism only
> and its upper layer is prosperous.

Sorry for the awkward phrase. By "quantification of the benefits"
I meant what performance improvement you have seen from increasing 
this "budget".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-21 14:43     ` Jakub Kicinski
@ 2025-06-22  0:05       ` Jason Xing
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-22  0:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Sat, Jun 21, 2025 at 10:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 20 Jun 2025 08:17:48 +0800 Jason Xing wrote:
> > > Jason, I think some additions to Documentation/ and quantification of
> > > the benefits would be needed as well.
> >
> > Got it.
> >
> > #1 Documentation. I would add one small section 'XDP_MAX_TX_BUDGET
> > setsockopt' in Documentation/networking/af_xdp.rst.
> >
> > #2 quantification
> > It's really hard to do so mainly because of various stacks implemented
> > in the user-space. AF_XDP is providing a fundamental mechanism only
> > and its upper layer is prosperous.
>
> Sorry for the awkward phrase. By "quantification of the benefits"
> I meant what performance improvement you have seen from increasing
> this "budget".

I see. A week ago, I saw a stable improvement of around 4% for both
PPS and throughput. Special notice: not all the cases, just one of
them I traced...

Less resources consumption was found to be observed by strace -c -p xxx:
1) %time is decreased by 7.8%
2) error counter is decreased from 18367 to 572

I will add the numbers into the commit next time.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-20 17:46         ` Jason Xing
@ 2025-06-23 14:18           ` Stanislav Fomichev
  2025-06-23 23:54             ` Jason Xing
  0 siblings, 1 reply; 26+ messages in thread
From: Stanislav Fomichev @ 2025-06-23 14:18 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On 06/21, Jason Xing wrote:
> On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
> >
> > On 06/21, Jason Xing wrote:
> > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > > <stfomichev@gmail.com> wrote:
> > > >
> > > > On 06/19, Jakub Kicinski wrote:
> > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > >     rcu_read_lock();
> > > > > >  again:
> > > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > +
> > > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > > >                     budget_exhausted = true;
> > > > > >                     continue;
> > > > > >             }
> > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > >  {
> > > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > >
> > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > >
> > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > > whether we want to affect zc case given the fact that Jason seemingly
> > > > cares about copy mode is a good question.
> > >
> > > Allow me to ask the similar question that you asked me before: even though I
> > > didn't see the necessity to set the max budget for zc mode (just
> > > because I didn't spot it happening), would it be better if we separate
> > > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > > is set, we will not separate it any more in the future?
> > >
> > > We can keep using the hardcoded value (32) in the zc mode like
> > > before and __only__ touch the copy mode? Later if someone or I found
> > > the significance of making it tunable, then another parameter of
> > > setsockopt can be added? Does it make sense?
> >
> > Related suggestion: maybe we don't need this limit at all for the copy mode?
> > If the user, with a socket option, can arbitrarily change it, what is the
> > point of this limit? Keep it on the zc side to make sure one socket doesn't
> > starve the rest and drop from the copy mode.. Any reason not to do it?
> 
> Thanks for bringing up the same question that I had in this thread. I
> saw the commit[1] mentioned it is used to avoid the burst as DPDK
> does, so my thought is that it might be used to prevent such a case
> where multiple sockets try to send packets through a shared umem
> nearly at the same time?
>
> Making it tunable is to provide a chance to let users seek for a good
> solution that is the best fit for them. It doesn't mean we
> allow/expect to see the burst situation.

The users can choose to moderate their batches by submitting less
with each sendmsg call. I see why having a batch limit might be useful for
zerocopy to tx in batches to interleave multiple sockets, but not
sure how this limit helps for the copy mode. Since we are not running
qdisc layer on tx, we don't really have a good answer for multiple
sockets sharing the same device/queue..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-23 14:18           ` Stanislav Fomichev
@ 2025-06-23 23:54             ` Jason Xing
  2025-06-24  0:48               ` Stanislav Fomichev
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Xing @ 2025-06-23 23:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Mon, Jun 23, 2025 at 10:18 PM Stanislav Fomichev
<stfomichev@gmail.com> wrote:
>
> On 06/21, Jason Xing wrote:
> > On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
> > <stfomichev@gmail.com> wrote:
> > >
> > > On 06/21, Jason Xing wrote:
> > > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > > > <stfomichev@gmail.com> wrote:
> > > > >
> > > > > On 06/19, Jakub Kicinski wrote:
> > > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > > >     rcu_read_lock();
> > > > > > >  again:
> > > > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > > +
> > > > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > > > >                     budget_exhausted = true;
> > > > > > >                     continue;
> > > > > > >             }
> > > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > > >  {
> > > > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > >
> > > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > > >
> > > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > > > whether we want to affect zc case given the fact that Jason seemingly
> > > > > cares about copy mode is a good question.
> > > >
> > > > Allow me to ask the similar question that you asked me before: even though I
> > > > didn't see the necessity to set the max budget for zc mode (just
> > > > because I didn't spot it happening), would it be better if we separate
> > > > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > > > is set, we will not separate it any more in the future?
> > > >
> > > > We can keep using the hardcoded value (32) in the zc mode like
> > > > before and __only__ touch the copy mode? Later if someone or I found
> > > > the significance of making it tunable, then another parameter of
> > > > setsockopt can be added? Does it make sense?
> > >
> > > Related suggestion: maybe we don't need this limit at all for the copy mode?
> > > If the user, with a socket option, can arbitrarily change it, what is the
> > > point of this limit? Keep it on the zc side to make sure one socket doesn't
> > > starve the rest and drop from the copy mode.. Any reason not to do it?
> >
> > Thanks for bringing up the same question that I had in this thread. I
> > saw the commit[1] mentioned it is used to avoid the burst as DPDK
> > does, so my thought is that it might be used to prevent such a case
> > where multiple sockets try to send packets through a shared umem
> > nearly at the same time?
> >
> > Making it tunable is to provide a chance to let users seek for a good
> > solution that is the best fit for them. It doesn't mean we
> > allow/expect to see the burst situation.
>
> The users can choose to moderate their batches by submitting less
> with each sendmsg call. I see why having a batch limit might be useful for
> zerocopy to tx in batches to interleave multiple sockets, but not
> sure how this limit helps for the copy mode. Since we are not running
> qdisc layer on tx, we don't really have a good answer for multiple
> sockets sharing the same device/queue..

It's worth mentioning that the xsk still holds the tx queue lock in
the non-zc mode. So I assume getting rid of the limit might be harmful
for other non xsk flows. That is what I know about the burst concern.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-23 23:54             ` Jason Xing
@ 2025-06-24  0:48               ` Stanislav Fomichev
  2025-06-24  2:47                 ` Jason Xing
  0 siblings, 1 reply; 26+ messages in thread
From: Stanislav Fomichev @ 2025-06-24  0:48 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On 06/24, Jason Xing wrote:
> On Mon, Jun 23, 2025 at 10:18 PM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
> >
> > On 06/21, Jason Xing wrote:
> > > On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
> > > <stfomichev@gmail.com> wrote:
> > > >
> > > > On 06/21, Jason Xing wrote:
> > > > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > > > > <stfomichev@gmail.com> wrote:
> > > > > >
> > > > > > On 06/19, Jakub Kicinski wrote:
> > > > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > > > >     rcu_read_lock();
> > > > > > > >  again:
> > > > > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > > > +
> > > > > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > > > > >                     budget_exhausted = true;
> > > > > > > >                     continue;
> > > > > > > >             }
> > > > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > > > >  {
> > > > > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > >
> > > > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > > > >
> > > > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > > > > whether we want to affect zc case given the fact that Jason seemingly
> > > > > > cares about copy mode is a good question.
> > > > >
> > > > > Allow me to ask the similar question that you asked me before: even though I
> > > > > didn't see the necessity to set the max budget for zc mode (just
> > > > > because I didn't spot it happening), would it be better if we separate
> > > > > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > > > > is set, we will not separate it any more in the future?
> > > > >
> > > > > We can keep using the hardcoded value (32) in the zc mode like
> > > > > before and __only__ touch the copy mode? Later if someone or I found
> > > > > the significance of making it tunable, then another parameter of
> > > > > setsockopt can be added? Does it make sense?
> > > >
> > > > Related suggestion: maybe we don't need this limit at all for the copy mode?
> > > > If the user, with a socket option, can arbitrarily change it, what is the
> > > > point of this limit? Keep it on the zc side to make sure one socket doesn't
> > > > starve the rest and drop from the copy mode.. Any reason not to do it?
> > >
> > > Thanks for bringing up the same question that I had in this thread. I
> > > saw the commit[1] mentioned it is used to avoid the burst as DPDK
> > > does, so my thought is that it might be used to prevent such a case
> > > where multiple sockets try to send packets through a shared umem
> > > nearly at the same time?
> > >
> > > Making it tunable is to provide a chance to let users seek for a good
> > > solution that is the best fit for them. It doesn't mean we
> > > allow/expect to see the burst situation.
> >
> > The users can choose to moderate their batches by submitting less
> > with each sendmsg call. I see why having a batch limit might be useful for
> > zerocopy to tx in batches to interleave multiple sockets, but not
> > sure how this limit helps for the copy mode. Since we are not running
> > qdisc layer on tx, we don't really have a good answer for multiple
> > sockets sharing the same device/queue..
> 
> It's worth mentioning that the xsk still holds the tx queue lock in
> the non-zc mode. So I assume getting rid of the limit might be harmful
> for other non xsk flows. That is what I know about the burst concern.

But one still needs NET_RAW to use it, right? So it's not like some
random process will suddenly start ddos-ing tx queues.. Maybe we should
add need_resched() / signal_pending() to the loop to break it?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-24  0:48               ` Stanislav Fomichev
@ 2025-06-24  2:47                 ` Jason Xing
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Xing @ 2025-06-24  2:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, davem, edumazet, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel, bpf, netdev,
	Jason Xing

On Tue, Jun 24, 2025 at 8:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 06/24, Jason Xing wrote:
> > On Mon, Jun 23, 2025 at 10:18 PM Stanislav Fomichev
> > <stfomichev@gmail.com> wrote:
> > >
> > > On 06/21, Jason Xing wrote:
> > > > On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
> > > > <stfomichev@gmail.com> wrote:
> > > > >
> > > > > On 06/21, Jason Xing wrote:
> > > > > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > > > > > <stfomichev@gmail.com> wrote:
> > > > > > >
> > > > > > > On 06/19, Jakub Kicinski wrote:
> > > > > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > > > > >     rcu_read_lock();
> > > > > > > > >  again:
> > > > > > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > > > > +
> > > > > > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > > > > > >                     budget_exhausted = true;
> > > > > > > > >                     continue;
> > > > > > > > >             }
> > > > > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > > > > >  {
> > > > > > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > > >
> > > > > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > > > > >
> > > > > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > > > > > whether we want to affect zc case given the fact that Jason seemingly
> > > > > > > cares about copy mode is a good question.
> > > > > >
> > > > > > Allow me to ask the similar question that you asked me before: even though I
> > > > > > didn't see the necessity to set the max budget for zc mode (just
> > > > > > because I didn't spot it happening), would it be better if we separate
> > > > > > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > > > > > is set, we will not separate it any more in the future?
> > > > > >
> > > > > > We can keep using the hardcoded value (32) in the zc mode like
> > > > > > before and __only__ touch the copy mode? Later if someone or I found
> > > > > > the significance of making it tunable, then another parameter of
> > > > > > setsockopt can be added? Does it make sense?
> > > > >
> > > > > Related suggestion: maybe we don't need this limit at all for the copy mode?
> > > > > If the user, with a socket option, can arbitrarily change it, what is the
> > > > > point of this limit? Keep it on the zc side to make sure one socket doesn't
> > > > > starve the rest and drop from the copy mode.. Any reason not to do it?
> > > >
> > > > Thanks for bringing up the same question that I had in this thread. I
> > > > saw the commit[1] mentioned it is used to avoid the burst as DPDK
> > > > does, so my thought is that it might be used to prevent such a case
> > > > where multiple sockets try to send packets through a shared umem
> > > > nearly at the same time?
> > > >
> > > > Making it tunable is to provide a chance to let users seek for a good
> > > > solution that is the best fit for them. It doesn't mean we
> > > > allow/expect to see the burst situation.
> > >
> > > The users can choose to moderate their batches by submitting less
> > > with each sendmsg call. I see why having a batch limit might be useful for
> > > zerocopy to tx in batches to interleave multiple sockets, but not
> > > sure how this limit helps for the copy mode. Since we are not running
> > > qdisc layer on tx, we don't really have a good answer for multiple
> > > sockets sharing the same device/queue..
> >
> > It's worth mentioning that the xsk still holds the tx queue lock in
> > the non-zc mode. So I assume getting rid of the limit might be harmful
> > for other non xsk flows. That is what I know about the burst concern.
>
> But one still needs NET_RAW to use it, right? So it's not like some
> random process will suddenly start ddos-ing tx queues.. Maybe we should
> add need_resched() / signal_pending() to the loop to break it?

I'm not referring to the deliberate attack action. TSQ is an example
that avoids a single tcp flow being very aggressive to occupy the
majority of the available bandwidth, which leads to the unfair and
bufferbloat problem. IMHO, the limit here works very similarly. We
cannot control how the application or the upper layer stack uses the
af_xdp feature. The fact is that too many packets sent from xsk can
cause high latency of other normal packets in the same queue. TSQ has
a tunable value. BQL has something like that too. IMHO, I think the tx
budget for xsk does make sense as well :)

If some application tries to remove the limit, the setsockopt is
surely a choice because the v4 patch doesn't hold the upper bound.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-06-24  2:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-19  9:04 [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
2025-06-19 13:53 ` Willem de Bruijn
2025-06-19 23:53   ` Jason Xing
2025-06-20  0:02     ` Jason Xing
2025-06-20 13:43     ` Willem de Bruijn
2025-06-20 13:58       ` Willem de Bruijn
2025-06-20 14:37         ` Jason Xing
2025-06-20 22:21           ` Willem de Bruijn
2025-06-19 15:09 ` Jakub Kicinski
2025-06-20  0:17   ` Jason Xing
2025-06-20 13:50     ` Willem de Bruijn
2025-06-20 15:03       ` Jason Xing
2025-06-20 22:24         ` Willem de Bruijn
2025-06-21  0:40           ` Jason Xing
2025-06-21 14:43     ` Jakub Kicinski
2025-06-22  0:05       ` Jason Xing
2025-06-20 14:25   ` Stanislav Fomichev
2025-06-20 16:30     ` Jason Xing
2025-06-20 16:47       ` Stanislav Fomichev
2025-06-20 17:46         ` Jason Xing
2025-06-23 14:18           ` Stanislav Fomichev
2025-06-23 23:54             ` Jason Xing
2025-06-24  0:48               ` Stanislav Fomichev
2025-06-24  2:47                 ` Jason Xing
2025-06-20 22:20     ` Willem de Bruijn
2025-06-21  1:06       ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).