[PATCH net-next v6] net: xsk: introduce XDP_MAX_TX

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
@ 2025-06-27 11:01 Jason Xing
  2025-06-29  2:51 ` Jason Xing
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Jason Xing @ 2025-06-27 11:01 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_BUDGET socket option.
- Convert TX_BATCH_SIZE to tx_budget_spent.
- Set tx_budget_spent to 32 by default in the initialization phase as a
  per-socket granular control. 32 is also the min value for
  tx_budget_spent.
- Set the range of tx_budget_spent as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
v6
Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@gmail.com/
1. use [32, xs->tx->nentries] range
2. Since setsockopt may generate a different value, add getsockopt to help
   application know what value takes effect finally.

v5
Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@gmail.com/
1. remove changes around zc mode

v4
Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@gmail.com/
1. remove getsockopt as it seems no real use case.
2. adjust the position of max_tx_budget to make sure it stays with other
read-most fields in one cacheline.
3. set one as the lower bound of max_tx_budget
4. add more descriptions/performance data in Doucmentation and commit message.

V3
Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
1. use a per-socket control (suggested by Stanislav)
2. unify both definitions into one
3. support setsockopt and getsockopt
4. add more description in commit message

V2
Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
1. use a per-netns sysctl knob
2. use sysctl_xsk_max_tx_budget to unify both definitions.
---
 Documentation/networking/af_xdp.rst |  9 +++++++
 include/net/xdp_sock.h              |  1 +
 include/uapi/linux/if_xdp.h         |  1 +
 net/xdp/xsk.c                       | 39 ++++++++++++++++++++++++++---
 tools/include/uapi/linux/if_xdp.h   |  1 +
 5 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index dceeb0d763aa..253afee00162 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one.
 Once the option is set, kernel will refuse attempts to bind that socket
 to a different interface.  Updating the value requires CAP_NET_RAW.
 
+XDP_MAX_TX_BUDGET setsockopt
+----------------------------
+
+This setsockopt sets the maximum number of descriptors that can be handled
+and passed to the driver at one send syscall. It is applied in the non-zero
+copy mode to allow application to tune the per-socket maximum iteration for
+better throughput and less frequency of send syscall.
+Allowed range is [32, xs->tx->nentries].
+
 XDP_STATISTICS getsockopt
 -------------------------
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e8bd6ddb7b12..ce587a225661 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -84,6 +84,7 @@ struct xdp_sock {
 	struct list_head map_list;
 	/* Protects map_list */
 	spinlock_t map_list_lock;
+	u32 max_tx_budget;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 44f2bb93e7e6..07c6d21c2f1c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_MAX_TX_BUDGET		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72c000c0ae5f..41efe7b27b0e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -33,8 +33,8 @@
 #include "xdp_umem.h"
 #include "xsk.h"
 
-#define TX_BATCH_SIZE 32
-#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
+#define TX_BUDGET_SIZE 32
+#define MAX_PER_SOCKET_BUDGET 32
 
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
 {
@@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 static int __xsk_generic_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
-	u32 max_batch = TX_BATCH_SIZE;
+	u32 max_budget = READ_ONCE(xs->max_tx_budget);
 	bool sent_frame = false;
 	struct xdp_desc desc;
 	struct sk_buff *skb;
@@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		goto out;
 
 	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-		if (max_batch-- == 0) {
+		if (max_budget-- == 0) {
 			err = -EAGAIN;
 			goto out;
 		}
@@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_MAX_TX_BUDGET:
+	{
+		unsigned int budget;
+
+		if (optlen != sizeof(budget))
+			return -EINVAL;
+		if (copy_from_sockptr(&budget, optval, sizeof(budget)))
+			return -EFAULT;
+		if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE)
+			return -EACCES;
+
+		WRITE_ONCE(xs->max_tx_budget,
+			   clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries));
+		return 0;
+	}
 	default:
 		break;
 	}
@@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 
 		return 0;
 	}
+	case XDP_MAX_TX_BUDGET:
+	{
+		unsigned int budget;
+
+		if (len < sizeof(budget))
+			return -EINVAL;
+
+		budget = READ_ONCE(xs->max_tx_budget);
+		if (copy_to_user(optval, &budget, sizeof(budget)))
+			return -EFAULT;
+		if (put_user(sizeof(budget), optlen))
+			return -EFAULT;
+
+		return 0;
+	}
 	default:
 		break;
 	}
@@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 
 	xs = xdp_sk(sk);
 	xs->state = XSK_READY;
+	xs->max_tx_budget = TX_BUDGET_SIZE;
 	mutex_init(&xs->mutex);
 
 	INIT_LIST_HEAD(&xs->map_list);
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 44f2bb93e7e6..07c6d21c2f1c 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_MAX_TX_BUDGET		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-27 11:01 [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
@ 2025-06-29  2:51 ` Jason Xing
  2025-06-29 10:43   ` Jason Xing
  2025-07-02 11:33 ` Jason Xing
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-06-29  2:51 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> This patch provides a setsockopt method to let applications leverage to
> adjust how many descs to be handled at most in one send syscall. It
> mitigates the situation where the default value (32) that is too small
> leads to higher frequency of triggering send syscall.
>
> Considering the prosperity/complexity the applications have, there is no
> absolutely ideal suggestion fitting all cases. So keep 32 as its default
> value like before.
>
> The patch does the following things:
> - Add XDP_MAX_TX_BUDGET socket option.
> - Convert TX_BATCH_SIZE to tx_budget_spent.
> - Set tx_budget_spent to 32 by default in the initialization phase as a
>   per-socket granular control. 32 is also the min value for
>   tx_budget_spent.
> - Set the range of tx_budget_spent as [32, xs->tx->nentries].
>
> The idea behind this comes out of real workloads in production. We use a
> user-level stack with xsk support to accelerate sending packets and
> minimize triggering syscalls. When the packets are aggregated, it's not
> hard to hit the upper bound (namely, 32). The moment user-space stack
> fetches the -EAGAIN error number passed from sendto(), it will loop to try
> again until all the expected descs from tx ring are sent out to the driver.
> Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> sendto() and higher throughput/PPS.
>
> Here is what I did in production, along with some numbers as follows:
> For one application I saw lately, I suggested using 128 as max_tx_budget
> because I saw two limitations without changing any default configuration:
> 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> this was I counted how many descs are transmitted to the driver at one
> time of sendto() based on [1] patch and then I calculated the
> possibility of hitting the upper bound. Finally I chose 128 as a
> suitable value because 1) it covers most of the cases, 2) a higher
> number would not bring evident results. After twisting the parameters,
> a stable improvement of around 4% for both PPS and throughput and less
> resources consumption were found to be observed by strace -c -p xxx:
> 1) %time was decreased by 7.8%
> 2) error counter was decreased from 18367 to 572

More interesting numbers are arriving here as I run some benchmarks
from xdp-project/bpf-examples/AF_XDP-example/ in my VM.

Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"

Using the default configure 32 as the max budget iteration:
 sock0@eth0:1 txonly xdp-drv
                   pps            pkts           1.01
rx                 0              0
tx                 48,574         49,152

Enlarging the value to 256:
 sock0@eth0:1 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 148,277        148,736

Enlarging the value to 512:
 sock0@eth0:1 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 226,306        227,072

The performance of pps goes up by 365% (with max budget set as 512)
which is an incredible number :)

If the next-respin is needed, I will attach those convincing numbers
in the commit message.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-29  2:51 ` Jason Xing
@ 2025-06-29 10:43   ` Jason Xing
  2025-06-30 11:46     ` Maciej Fijalkowski
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-06-29 10:43 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch provides a setsockopt method to let applications leverage to
> > adjust how many descs to be handled at most in one send syscall. It
> > mitigates the situation where the default value (32) that is too small
> > leads to higher frequency of triggering send syscall.
> >
> > Considering the prosperity/complexity the applications have, there is no
> > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > value like before.
> >
> > The patch does the following things:
> > - Add XDP_MAX_TX_BUDGET socket option.
> > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > - Set tx_budget_spent to 32 by default in the initialization phase as a
> >   per-socket granular control. 32 is also the min value for
> >   tx_budget_spent.
> > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> >
> > The idea behind this comes out of real workloads in production. We use a
> > user-level stack with xsk support to accelerate sending packets and
> > minimize triggering syscalls. When the packets are aggregated, it's not
> > hard to hit the upper bound (namely, 32). The moment user-space stack
> > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > again until all the expected descs from tx ring are sent out to the driver.
> > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > sendto() and higher throughput/PPS.
> >
> > Here is what I did in production, along with some numbers as follows:
> > For one application I saw lately, I suggested using 128 as max_tx_budget
> > because I saw two limitations without changing any default configuration:
> > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > this was I counted how many descs are transmitted to the driver at one
> > time of sendto() based on [1] patch and then I calculated the
> > possibility of hitting the upper bound. Finally I chose 128 as a
> > suitable value because 1) it covers most of the cases, 2) a higher
> > number would not bring evident results. After twisting the parameters,
> > a stable improvement of around 4% for both PPS and throughput and less
> > resources consumption were found to be observed by strace -c -p xxx:
> > 1) %time was decreased by 7.8%
> > 2) error counter was decreased from 18367 to 572
>
> More interesting numbers are arriving here as I run some benchmarks
> from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
>
> Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"
>
> Using the default configure 32 as the max budget iteration:
>  sock0@eth0:1 txonly xdp-drv
>                    pps            pkts           1.01
> rx                 0              0
> tx                 48,574         49,152
>
> Enlarging the value to 256:
>  sock0@eth0:1 txonly xdp-drv
>                    pps            pkts           1.00
> rx                 0              0
> tx                 148,277        148,736
>
> Enlarging the value to 512:
>  sock0@eth0:1 txonly xdp-drv
>                    pps            pkts           1.00
> rx                 0              0
> tx                 226,306        227,072
>
> The performance of pps goes up by 365% (with max budget set as 512)
> which is an incredible number :)

Weird thing. I purchased another VM and didn't manage to see such a
huge improvement.... Good luck is that I own that good machine which
is still reproducible and I'm still digging in it. So please ignore
this noise for now :|

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-29 10:43   ` Jason Xing
@ 2025-06-30 11:46     ` Maciej Fijalkowski
  2025-06-30 12:07       ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-06-30 11:46 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Sun, Jun 29, 2025 at 06:43:05PM +0800, Jason Xing wrote:
> On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This patch provides a setsockopt method to let applications leverage to
> > > adjust how many descs to be handled at most in one send syscall. It
> > > mitigates the situation where the default value (32) that is too small
> > > leads to higher frequency of triggering send syscall.
> > >
> > > Considering the prosperity/complexity the applications have, there is no
> > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > value like before.
> > >
> > > The patch does the following things:
> > > - Add XDP_MAX_TX_BUDGET socket option.
> > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > >   per-socket granular control. 32 is also the min value for
> > >   tx_budget_spent.
> > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > >
> > > The idea behind this comes out of real workloads in production. We use a
> > > user-level stack with xsk support to accelerate sending packets and
> > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > again until all the expected descs from tx ring are sent out to the driver.
> > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > sendto() and higher throughput/PPS.
> > >
> > > Here is what I did in production, along with some numbers as follows:
> > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > because I saw two limitations without changing any default configuration:
> > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > this was I counted how many descs are transmitted to the driver at one
> > > time of sendto() based on [1] patch and then I calculated the
> > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > suitable value because 1) it covers most of the cases, 2) a higher
> > > number would not bring evident results. After twisting the parameters,
> > > a stable improvement of around 4% for both PPS and throughput and less
> > > resources consumption were found to be observed by strace -c -p xxx:
> > > 1) %time was decreased by 7.8%
> > > 2) error counter was decreased from 18367 to 572
> >
> > More interesting numbers are arriving here as I run some benchmarks
> > from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
> >
> > Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"

do you have a patch against xdpsock that does setsockopt you're
introducing here?

-B -b 256 was for enabling busy polling and giving it 256 budget, which is
not what you wanted to achieve.

> >
> > Using the default configure 32 as the max budget iteration:
> >  sock0@eth0:1 txonly xdp-drv
> >                    pps            pkts           1.01
> > rx                 0              0
> > tx                 48,574         49,152
> >
> > Enlarging the value to 256:
> >  sock0@eth0:1 txonly xdp-drv
> >                    pps            pkts           1.00
> > rx                 0              0
> > tx                 148,277        148,736
> >
> > Enlarging the value to 512:
> >  sock0@eth0:1 txonly xdp-drv
> >                    pps            pkts           1.00
> > rx                 0              0
> > tx                 226,306        227,072
> >
> > The performance of pps goes up by 365% (with max budget set as 512)
> > which is an incredible number :)
> 
> Weird thing. I purchased another VM and didn't manage to see such a
> huge improvement.... Good luck is that I own that good machine which
> is still reproducible and I'm still digging in it. So please ignore
> this noise for now :|
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-30 11:46     ` Maciej Fijalkowski
@ 2025-06-30 12:07       ` Jason Xing
  2025-06-30 12:25         ` Maciej Fijalkowski
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-06-30 12:07 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Mon, Jun 30, 2025 at 7:47 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Sun, Jun 29, 2025 at 06:43:05PM +0800, Jason Xing wrote:
> > On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch provides a setsockopt method to let applications leverage to
> > > > adjust how many descs to be handled at most in one send syscall. It
> > > > mitigates the situation where the default value (32) that is too small
> > > > leads to higher frequency of triggering send syscall.
> > > >
> > > > Considering the prosperity/complexity the applications have, there is no
> > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > value like before.
> > > >
> > > > The patch does the following things:
> > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > >   per-socket granular control. 32 is also the min value for
> > > >   tx_budget_spent.
> > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > >
> > > > The idea behind this comes out of real workloads in production. We use a
> > > > user-level stack with xsk support to accelerate sending packets and
> > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > sendto() and higher throughput/PPS.
> > > >
> > > > Here is what I did in production, along with some numbers as follows:
> > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > because I saw two limitations without changing any default configuration:
> > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > this was I counted how many descs are transmitted to the driver at one
> > > > time of sendto() based on [1] patch and then I calculated the
> > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > number would not bring evident results. After twisting the parameters,
> > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > 1) %time was decreased by 7.8%
> > > > 2) error counter was decreased from 18367 to 572
> > >
> > > More interesting numbers are arriving here as I run some benchmarks
> > > from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
> > >
> > > Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"
>
> do you have a patch against xdpsock that does setsockopt you're
> introducing here?

Sure, I added the following code in the apply_setsockopt():
if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 9, &a, sizeof(a)) < 0)
...

>
> -B -b 256 was for enabling busy polling and giving it 256 budget, which is
> not what you wanted to achieve.

I checked that I can use getsockopt to get the budget value the same
as what I use setsockopt().

Sorry, I don't know what you meant here. Could you say more about it?

Thanks,
Jason

>
> > >
> > > Using the default configure 32 as the max budget iteration:
> > >  sock0@eth0:1 txonly xdp-drv
> > >                    pps            pkts           1.01
> > > rx                 0              0
> > > tx                 48,574         49,152
> > >
> > > Enlarging the value to 256:
> > >  sock0@eth0:1 txonly xdp-drv
> > >                    pps            pkts           1.00
> > > rx                 0              0
> > > tx                 148,277        148,736
> > >
> > > Enlarging the value to 512:
> > >  sock0@eth0:1 txonly xdp-drv
> > >                    pps            pkts           1.00
> > > rx                 0              0
> > > tx                 226,306        227,072
> > >
> > > The performance of pps goes up by 365% (with max budget set as 512)
> > > which is an incredible number :)
> >
> > Weird thing. I purchased another VM and didn't manage to see such a
> > huge improvement.... Good luck is that I own that good machine which
> > is still reproducible and I'm still digging in it. So please ignore
> > this noise for now :|
> >
> > Thanks,
> > Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-30 12:07       ` Jason Xing
@ 2025-06-30 12:25         ` Maciej Fijalkowski
  2025-06-30 12:38           ` Jason Xing
  2025-07-01  0:20           ` Jason Xing
  0 siblings, 2 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-06-30 12:25 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Mon, Jun 30, 2025 at 08:07:01PM +0800, Jason Xing wrote:
> On Mon, Jun 30, 2025 at 7:47 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Sun, Jun 29, 2025 at 06:43:05PM +0800, Jason Xing wrote:
> > > On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > This patch provides a setsockopt method to let applications leverage to
> > > > > adjust how many descs to be handled at most in one send syscall. It
> > > > > mitigates the situation where the default value (32) that is too small
> > > > > leads to higher frequency of triggering send syscall.
> > > > >
> > > > > Considering the prosperity/complexity the applications have, there is no
> > > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > > value like before.
> > > > >
> > > > > The patch does the following things:
> > > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > > >   per-socket granular control. 32 is also the min value for
> > > > >   tx_budget_spent.
> > > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > > >
> > > > > The idea behind this comes out of real workloads in production. We use a
> > > > > user-level stack with xsk support to accelerate sending packets and
> > > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > > sendto() and higher throughput/PPS.
> > > > >
> > > > > Here is what I did in production, along with some numbers as follows:
> > > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > > because I saw two limitations without changing any default configuration:
> > > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > > this was I counted how many descs are transmitted to the driver at one
> > > > > time of sendto() based on [1] patch and then I calculated the
> > > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > > number would not bring evident results. After twisting the parameters,
> > > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > > 1) %time was decreased by 7.8%
> > > > > 2) error counter was decreased from 18367 to 572
> > > >
> > > > More interesting numbers are arriving here as I run some benchmarks
> > > > from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
> > > >
> > > > Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"
> >
> > do you have a patch against xdpsock that does setsockopt you're
> > introducing here?
> 
> Sure, I added the following code in the apply_setsockopt():
> if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 9, &a, sizeof(a)) < 0)
> ...
> 
> >
> > -B -b 256 was for enabling busy polling and giving it 256 budget, which is
> > not what you wanted to achieve.
> 
> I checked that I can use getsockopt to get the budget value the same
> as what I use setsockopt().
> 
> Sorry, I don't know what you meant here. Could you say more about it?

I meant that -b is for setting SO_BUSY_POLL_BUDGET. just pick different
knob for your use case.

> 
> Thanks,
> Jason
> 
> >
> > > >
> > > > Using the default configure 32 as the max budget iteration:
> > > >  sock0@eth0:1 txonly xdp-drv
> > > >                    pps            pkts           1.01
> > > > rx                 0              0
> > > > tx                 48,574         49,152
> > > >
> > > > Enlarging the value to 256:
> > > >  sock0@eth0:1 txonly xdp-drv
> > > >                    pps            pkts           1.00
> > > > rx                 0              0
> > > > tx                 148,277        148,736
> > > >
> > > > Enlarging the value to 512:
> > > >  sock0@eth0:1 txonly xdp-drv
> > > >                    pps            pkts           1.00
> > > > rx                 0              0
> > > > tx                 226,306        227,072
> > > >
> > > > The performance of pps goes up by 365% (with max budget set as 512)
> > > > which is an incredible number :)
> > >
> > > Weird thing. I purchased another VM and didn't manage to see such a
> > > huge improvement.... Good luck is that I own that good machine which
> > > is still reproducible and I'm still digging in it. So please ignore
> > > this noise for now :|
> > >
> > > Thanks,
> > > Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-30 12:25         ` Maciej Fijalkowski
@ 2025-06-30 12:38           ` Jason Xing
  2025-07-01  0:20           ` Jason Xing
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-06-30 12:38 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Mon, Jun 30, 2025 at 8:25 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Mon, Jun 30, 2025 at 08:07:01PM +0800, Jason Xing wrote:
> > On Mon, Jun 30, 2025 at 7:47 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Sun, Jun 29, 2025 at 06:43:05PM +0800, Jason Xing wrote:
> > > > On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > This patch provides a setsockopt method to let applications leverage to
> > > > > > adjust how many descs to be handled at most in one send syscall. It
> > > > > > mitigates the situation where the default value (32) that is too small
> > > > > > leads to higher frequency of triggering send syscall.
> > > > > >
> > > > > > Considering the prosperity/complexity the applications have, there is no
> > > > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > > > value like before.
> > > > > >
> > > > > > The patch does the following things:
> > > > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > > > >   per-socket granular control. 32 is also the min value for
> > > > > >   tx_budget_spent.
> > > > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > > > >
> > > > > > The idea behind this comes out of real workloads in production. We use a
> > > > > > user-level stack with xsk support to accelerate sending packets and
> > > > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > > > sendto() and higher throughput/PPS.
> > > > > >
> > > > > > Here is what I did in production, along with some numbers as follows:
> > > > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > > > because I saw two limitations without changing any default configuration:
> > > > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > > > this was I counted how many descs are transmitted to the driver at one
> > > > > > time of sendto() based on [1] patch and then I calculated the
> > > > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > > > number would not bring evident results. After twisting the parameters,
> > > > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > > > 1) %time was decreased by 7.8%
> > > > > > 2) error counter was decreased from 18367 to 572
> > > > >
> > > > > More interesting numbers are arriving here as I run some benchmarks
> > > > > from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
> > > > >
> > > > > Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"
> > >
> > > do you have a patch against xdpsock that does setsockopt you're
> > > introducing here?
> >
> > Sure, I added the following code in the apply_setsockopt():
> > if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 9, &a, sizeof(a)) < 0)
> > ...
> >
> > >
> > > -B -b 256 was for enabling busy polling and giving it 256 budget, which is
> > > not what you wanted to achieve.
> >
> > I checked that I can use getsockopt to get the budget value the same
> > as what I use setsockopt().
> >
> > Sorry, I don't know what you meant here. Could you say more about it?
>
> I meant that -b is for setting SO_BUSY_POLL_BUDGET. just pick different
> knob for your use case.

Sorry again. I still don't follow you. Do you mean this patch doesn't
work as expected in busy polling mode? I tried to use '-B -b 256' and
I managed to see the correct budget value returned from getsockopt().

How is this max tx budget related to the busy polling mode, I wonder?

Or are you talking about the performance test I mentioned? I probably
know what happened there (due to some limit on the host side), but I
still need more time.

Thanks,
Jason

>
> >
> > Thanks,
> > Jason
> >
> > >
> > > > >
> > > > > Using the default configure 32 as the max budget iteration:
> > > > >  sock0@eth0:1 txonly xdp-drv
> > > > >                    pps            pkts           1.01
> > > > > rx                 0              0
> > > > > tx                 48,574         49,152
> > > > >
> > > > > Enlarging the value to 256:
> > > > >  sock0@eth0:1 txonly xdp-drv
> > > > >                    pps            pkts           1.00
> > > > > rx                 0              0
> > > > > tx                 148,277        148,736
> > > > >
> > > > > Enlarging the value to 512:
> > > > >  sock0@eth0:1 txonly xdp-drv
> > > > >                    pps            pkts           1.00
> > > > > rx                 0              0
> > > > > tx                 226,306        227,072
> > > > >
> > > > > The performance of pps goes up by 365% (with max budget set as 512)
> > > > > which is an incredible number :)
> > > >
> > > > Weird thing. I purchased another VM and didn't manage to see such a
> > > > huge improvement.... Good luck is that I own that good machine which
> > > > is still reproducible and I'm still digging in it. So please ignore
> > > > this noise for now :|
> > > >
> > > > Thanks,
> > > > Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-30 12:25         ` Maciej Fijalkowski
  2025-06-30 12:38           ` Jason Xing
@ 2025-07-01  0:20           ` Jason Xing
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-07-01  0:20 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Mon, Jun 30, 2025 at 8:25 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Mon, Jun 30, 2025 at 08:07:01PM +0800, Jason Xing wrote:
> > On Mon, Jun 30, 2025 at 7:47 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Sun, Jun 29, 2025 at 06:43:05PM +0800, Jason Xing wrote:
> > > > On Sun, Jun 29, 2025 at 10:51 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > This patch provides a setsockopt method to let applications leverage to
> > > > > > adjust how many descs to be handled at most in one send syscall. It
> > > > > > mitigates the situation where the default value (32) that is too small
> > > > > > leads to higher frequency of triggering send syscall.
> > > > > >
> > > > > > Considering the prosperity/complexity the applications have, there is no
> > > > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > > > value like before.
> > > > > >
> > > > > > The patch does the following things:
> > > > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > > > >   per-socket granular control. 32 is also the min value for
> > > > > >   tx_budget_spent.
> > > > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > > > >
> > > > > > The idea behind this comes out of real workloads in production. We use a
> > > > > > user-level stack with xsk support to accelerate sending packets and
> > > > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > > > sendto() and higher throughput/PPS.
> > > > > >
> > > > > > Here is what I did in production, along with some numbers as follows:
> > > > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > > > because I saw two limitations without changing any default configuration:
> > > > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > > > this was I counted how many descs are transmitted to the driver at one
> > > > > > time of sendto() based on [1] patch and then I calculated the
> > > > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > > > number would not bring evident results. After twisting the parameters,
> > > > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > > > 1) %time was decreased by 7.8%
> > > > > > 2) error counter was decreased from 18367 to 572
> > > > >
> > > > > More interesting numbers are arriving here as I run some benchmarks
> > > > > from xdp-project/bpf-examples/AF_XDP-example/ in my VM.
> > > > >
> > > > > Running "sudo taskset -c 2 ./xdpsock -i eth0 -q 1 -l -N -t -b 256"
> > >
> > > do you have a patch against xdpsock that does setsockopt you're
> > > introducing here?
> >
> > Sure, I added the following code in the apply_setsockopt():
> > if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 9, &a, sizeof(a)) < 0)
> > ...
> >
> > >
> > > -B -b 256 was for enabling busy polling and giving it 256 budget, which is
> > > not what you wanted to achieve.
> >
> > I checked that I can use getsockopt to get the budget value the same
> > as what I use setsockopt().
> >
> > Sorry, I don't know what you meant here. Could you say more about it?
>
> I meant that -b is for setting SO_BUSY_POLL_BUDGET. just pick different
> knob for your use case.

After taking a deep sleep, I clearly know what that is... I will try
to test with other parameters. But for me, the primary factor is the
security interception on the host side because of a larger number of
descs containing useless information. Well, I will find a good way to
avoid this in the future...

And what is your opinion of the current patch? I used [32, nentries]
as the min/max range as you advised :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-27 11:01 [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
  2025-06-29  2:51 ` Jason Xing
@ 2025-07-02 11:33 ` Jason Xing
  2025-07-03  8:15 ` Paolo Abeni
  2025-07-03 12:25 ` Maciej Fijalkowski
  3 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-07-02 11:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

On Fri, Jun 27, 2025 at 7:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> This patch provides a setsockopt method to let applications leverage to
> adjust how many descs to be handled at most in one send syscall. It
> mitigates the situation where the default value (32) that is too small
> leads to higher frequency of triggering send syscall.
>
> Considering the prosperity/complexity the applications have, there is no
> absolutely ideal suggestion fitting all cases. So keep 32 as its default
> value like before.
>
> The patch does the following things:
> - Add XDP_MAX_TX_BUDGET socket option.
> - Convert TX_BATCH_SIZE to tx_budget_spent.
> - Set tx_budget_spent to 32 by default in the initialization phase as a
>   per-socket granular control. 32 is also the min value for
>   tx_budget_spent.
> - Set the range of tx_budget_spent as [32, xs->tx->nentries].
>
> The idea behind this comes out of real workloads in production. We use a
> user-level stack with xsk support to accelerate sending packets and
> minimize triggering syscalls. When the packets are aggregated, it's not
> hard to hit the upper bound (namely, 32). The moment user-space stack
> fetches the -EAGAIN error number passed from sendto(), it will loop to try
> again until all the expected descs from tx ring are sent out to the driver.
> Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> sendto() and higher throughput/PPS.
>
> Here is what I did in production, along with some numbers as follows:
> For one application I saw lately, I suggested using 128 as max_tx_budget
> because I saw two limitations without changing any default configuration:
> 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> this was I counted how many descs are transmitted to the driver at one
> time of sendto() based on [1] patch and then I calculated the
> possibility of hitting the upper bound. Finally I chose 128 as a
> suitable value because 1) it covers most of the cases, 2) a higher
> number would not bring evident results. After twisting the parameters,
> a stable improvement of around 4% for both PPS and throughput and less
> resources consumption were found to be observed by strace -c -p xxx:
> 1) %time was decreased by 7.8%
> 2) error counter was decreased from 18367 to 572
>
> [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

Hi Maciej, Stan, Willem, and other maintainers,

I wonder if you have further suggestions on the current patch?

Thanks in advance!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-27 11:01 [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
  2025-06-29  2:51 ` Jason Xing
  2025-07-02 11:33 ` Jason Xing
@ 2025-07-03  8:15 ` Paolo Abeni
  2025-07-03  8:22   ` Jason Xing
  2025-07-03 12:25 ` Maciej Fijalkowski
  3 siblings, 1 reply; 17+ messages in thread
From: Paolo Abeni @ 2025-07-03  8:15 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, joe, willemdebruijn.kernel
  Cc: bpf, netdev, Jason Xing

On 6/27/25 1:01 PM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch provides a setsockopt method to let applications leverage to
> adjust how many descs to be handled at most in one send syscall. It
> mitigates the situation where the default value (32) that is too small
> leads to higher frequency of triggering send syscall.
> 
> Considering the prosperity/complexity the applications have, there is no
> absolutely ideal suggestion fitting all cases. So keep 32 as its default
> value like before.
> 
> The patch does the following things:
> - Add XDP_MAX_TX_BUDGET socket option.
> - Convert TX_BATCH_SIZE to tx_budget_spent.
> - Set tx_budget_spent to 32 by default in the initialization phase as a
>   per-socket granular control. 32 is also the min value for
>   tx_budget_spent.
> - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> 
> The idea behind this comes out of real workloads in production. We use a
> user-level stack with xsk support to accelerate sending packets and
> minimize triggering syscalls. When the packets are aggregated, it's not
> hard to hit the upper bound (namely, 32). The moment user-space stack
> fetches the -EAGAIN error number passed from sendto(), it will loop to try
> again until all the expected descs from tx ring are sent out to the driver.
> Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> sendto() and higher throughput/PPS.
> 
> Here is what I did in production, along with some numbers as follows:
> For one application I saw lately, I suggested using 128 as max_tx_budget
> because I saw two limitations without changing any default configuration:
> 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> this was I counted how many descs are transmitted to the driver at one
> time of sendto() based on [1] patch and then I calculated the
> possibility of hitting the upper bound. Finally I chose 128 as a
> suitable value because 1) it covers most of the cases, 2) a higher
> number would not bring evident results. After twisting the parameters,
> a stable improvement of around 4% for both PPS and throughput and less
> resources consumption were found to be observed by strace -c -p xxx:
> 1) %time was decreased by 7.8%
> 2) error counter was decreased from 18367 to 572
> 
> [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

LGTM, waiting a little more for an explicit an ack from XDP maintainers.

Side note: it could be useful to extend the xdp selftest to trigger the
new code path.

/P


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03  8:15 ` Paolo Abeni
@ 2025-07-03  8:22   ` Jason Xing
  2025-07-03 12:29     ` Maciej Fijalkowski
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-07-03  8:22 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: davem, edumazet, kuba, bjorn, magnus.karlsson, maciej.fijalkowski,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Thu, Jul 3, 2025 at 4:15 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 6/27/25 1:01 PM, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch provides a setsockopt method to let applications leverage to
> > adjust how many descs to be handled at most in one send syscall. It
> > mitigates the situation where the default value (32) that is too small
> > leads to higher frequency of triggering send syscall.
> >
> > Considering the prosperity/complexity the applications have, there is no
> > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > value like before.
> >
> > The patch does the following things:
> > - Add XDP_MAX_TX_BUDGET socket option.
> > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > - Set tx_budget_spent to 32 by default in the initialization phase as a
> >   per-socket granular control. 32 is also the min value for
> >   tx_budget_spent.
> > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> >
> > The idea behind this comes out of real workloads in production. We use a
> > user-level stack with xsk support to accelerate sending packets and
> > minimize triggering syscalls. When the packets are aggregated, it's not
> > hard to hit the upper bound (namely, 32). The moment user-space stack
> > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > again until all the expected descs from tx ring are sent out to the driver.
> > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > sendto() and higher throughput/PPS.
> >
> > Here is what I did in production, along with some numbers as follows:
> > For one application I saw lately, I suggested using 128 as max_tx_budget
> > because I saw two limitations without changing any default configuration:
> > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > this was I counted how many descs are transmitted to the driver at one
> > time of sendto() based on [1] patch and then I calculated the
> > possibility of hitting the upper bound. Finally I chose 128 as a
> > suitable value because 1) it covers most of the cases, 2) a higher
> > number would not bring evident results. After twisting the parameters,
> > a stable improvement of around 4% for both PPS and throughput and less
> > resources consumption were found to be observed by strace -c -p xxx:
> > 1) %time was decreased by 7.8%
> > 2) error counter was decreased from 18367 to 572
> >
> > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
>
> LGTM, waiting a little more for an explicit an ack from XDP maintainers.

Thanks. No problem.

>
> Side note: it could be useful to extend the xdp selftest to trigger the
> new code path.

Roger that, sir. I will do it after this gets merged, maybe later this
month, still studying for various tests in recent days :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-06-27 11:01 [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
                   ` (2 preceding siblings ...)
  2025-07-03  8:15 ` Paolo Abeni
@ 2025-07-03 12:25 ` Maciej Fijalkowski
  2025-07-03 13:09   ` Jason Xing
  3 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-07-03 12:25 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch provides a setsockopt method to let applications leverage to
> adjust how many descs to be handled at most in one send syscall. It
> mitigates the situation where the default value (32) that is too small
> leads to higher frequency of triggering send syscall.
> 
> Considering the prosperity/complexity the applications have, there is no
> absolutely ideal suggestion fitting all cases. So keep 32 as its default
> value like before.
> 
> The patch does the following things:
> - Add XDP_MAX_TX_BUDGET socket option.
> - Convert TX_BATCH_SIZE to tx_budget_spent.
> - Set tx_budget_spent to 32 by default in the initialization phase as a
>   per-socket granular control. 32 is also the min value for
>   tx_budget_spent.
> - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> 
> The idea behind this comes out of real workloads in production. We use a
> user-level stack with xsk support to accelerate sending packets and
> minimize triggering syscalls. When the packets are aggregated, it's not
> hard to hit the upper bound (namely, 32). The moment user-space stack
> fetches the -EAGAIN error number passed from sendto(), it will loop to try
> again until all the expected descs from tx ring are sent out to the driver.
> Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> sendto() and higher throughput/PPS.
> 
> Here is what I did in production, along with some numbers as follows:
> For one application I saw lately, I suggested using 128 as max_tx_budget
> because I saw two limitations without changing any default configuration:
> 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> this was I counted how many descs are transmitted to the driver at one
> time of sendto() based on [1] patch and then I calculated the
> possibility of hitting the upper bound. Finally I chose 128 as a
> suitable value because 1) it covers most of the cases, 2) a higher
> number would not bring evident results. After twisting the parameters,
> a stable improvement of around 4% for both PPS and throughput and less
> resources consumption were found to be observed by strace -c -p xxx:
> 1) %time was decreased by 7.8%
> 2) error counter was decreased from 18367 to 572
> 
> [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> v6
> Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@gmail.com/
> 1. use [32, xs->tx->nentries] range
> 2. Since setsockopt may generate a different value, add getsockopt to help
>    application know what value takes effect finally.
> 
> v5
> Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@gmail.com/
> 1. remove changes around zc mode
> 
> v4
> Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@gmail.com/
> 1. remove getsockopt as it seems no real use case.
> 2. adjust the position of max_tx_budget to make sure it stays with other
> read-most fields in one cacheline.
> 3. set one as the lower bound of max_tx_budget
> 4. add more descriptions/performance data in Doucmentation and commit message.
> 
> V3
> Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> 1. use a per-socket control (suggested by Stanislav)
> 2. unify both definitions into one
> 3. support setsockopt and getsockopt
> 4. add more description in commit message
> 
> V2
> Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> 1. use a per-netns sysctl knob
> 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> ---
>  Documentation/networking/af_xdp.rst |  9 +++++++
>  include/net/xdp_sock.h              |  1 +
>  include/uapi/linux/if_xdp.h         |  1 +
>  net/xdp/xsk.c                       | 39 ++++++++++++++++++++++++++---
>  tools/include/uapi/linux/if_xdp.h   |  1 +
>  5 files changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index dceeb0d763aa..253afee00162 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one.
>  Once the option is set, kernel will refuse attempts to bind that socket
>  to a different interface.  Updating the value requires CAP_NET_RAW.
>  
> +XDP_MAX_TX_BUDGET setsockopt
> +----------------------------
> +
> +This setsockopt sets the maximum number of descriptors that can be handled
> +and passed to the driver at one send syscall. It is applied in the non-zero

just 'copy mode' would be enough IMHO.

> +copy mode to allow application to tune the per-socket maximum iteration for
> +better throughput and less frequency of send syscall.
> +Allowed range is [32, xs->tx->nentries].
> +
>  XDP_STATISTICS getsockopt
>  -------------------------
>  
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e8bd6ddb7b12..ce587a225661 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -84,6 +84,7 @@ struct xdp_sock {
>  	struct list_head map_list;
>  	/* Protects map_list */
>  	spinlock_t map_list_lock;
> +	u32 max_tx_budget;
>  	/* Protects multiple processes in the control path */
>  	struct mutex mutex;
>  	struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 44f2bb93e7e6..07c6d21c2f1c 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING	6
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
> +#define XDP_MAX_TX_BUDGET		9
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72c000c0ae5f..41efe7b27b0e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -33,8 +33,8 @@
>  #include "xdp_umem.h"
>  #include "xsk.h"
>  
> -#define TX_BATCH_SIZE 32
> -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> +#define TX_BUDGET_SIZE 32
> +#define MAX_PER_SOCKET_BUDGET 32

that could have stayed as it was before?

>  
>  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
>  {
> @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  static int __xsk_generic_xmit(struct sock *sk)
>  {
>  	struct xdp_sock *xs = xdp_sk(sk);
> -	u32 max_batch = TX_BATCH_SIZE;
> +	u32 max_budget = READ_ONCE(xs->max_tx_budget);

you're breaking RCT here...maybe you could READ_ONCE() the budget right
before the while() loop and keep the declaration only at the top of func?

also nothing wrong with keeping @max_batch as a name for this budget.

>  	bool sent_frame = false;
>  	struct xdp_desc desc;
>  	struct sk_buff *skb;
> @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk)
>  		goto out;
>  
>  	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> -		if (max_batch-- == 0) {
> +		if (max_budget-- == 0) {
>  			err = -EAGAIN;
>  			goto out;
>  		}
> @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return err;
>  	}
> +	case XDP_MAX_TX_BUDGET:
> +	{
> +		unsigned int budget;
> +
> +		if (optlen != sizeof(budget))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> +			return -EFAULT;
> +		if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE)
> +			return -EACCES;
> +
> +		WRITE_ONCE(xs->max_tx_budget,
> +			   clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries));

I think it would be better to throw errno when budget set by user is
bigger than nentries. you do it for min case, let's do it for max case as
well and skip the clamp() altogether?

this is rather speculative that someone would ever set such a big budget
but silently setting a lower value than provided might be confusing to me.
i'd rather get an error thrown at my face and find out the valid range.

> +		return 0;
> +	}
>  	default:
>  		break;
>  	}
> @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
>  
>  		return 0;
>  	}
> +	case XDP_MAX_TX_BUDGET:
> +	{
> +		unsigned int budget;
> +
> +		if (len < sizeof(budget))
> +			return -EINVAL;
> +
> +		budget = READ_ONCE(xs->max_tx_budget);
> +		if (copy_to_user(optval, &budget, sizeof(budget)))
> +			return -EFAULT;
> +		if (put_user(sizeof(budget), optlen))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
>  	default:
>  		break;
>  	}
> @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
>  
>  	xs = xdp_sk(sk);
>  	xs->state = XSK_READY;
> +	xs->max_tx_budget = TX_BUDGET_SIZE;
>  	mutex_init(&xs->mutex);
>  
>  	INIT_LIST_HEAD(&xs->map_list);
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index 44f2bb93e7e6..07c6d21c2f1c 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING	6
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
> +#define XDP_MAX_TX_BUDGET		9

could we have 'SKB' in name?

>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> -- 
> 2.41.3
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03  8:22   ` Jason Xing
@ 2025-07-03 12:29     ` Maciej Fijalkowski
  2025-07-03 13:11       ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-07-03 12:29 UTC (permalink / raw)
  To: Jason Xing
  Cc: Paolo Abeni, davem, edumazet, kuba, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Thu, Jul 03, 2025 at 04:22:21PM +0800, Jason Xing wrote:
> On Thu, Jul 3, 2025 at 4:15 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 6/27/25 1:01 PM, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This patch provides a setsockopt method to let applications leverage to
> > > adjust how many descs to be handled at most in one send syscall. It
> > > mitigates the situation where the default value (32) that is too small
> > > leads to higher frequency of triggering send syscall.
> > >
> > > Considering the prosperity/complexity the applications have, there is no
> > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > value like before.
> > >
> > > The patch does the following things:
> > > - Add XDP_MAX_TX_BUDGET socket option.
> > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > >   per-socket granular control. 32 is also the min value for
> > >   tx_budget_spent.
> > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > >
> > > The idea behind this comes out of real workloads in production. We use a
> > > user-level stack with xsk support to accelerate sending packets and
> > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > again until all the expected descs from tx ring are sent out to the driver.
> > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > sendto() and higher throughput/PPS.
> > >
> > > Here is what I did in production, along with some numbers as follows:
> > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > because I saw two limitations without changing any default configuration:
> > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > this was I counted how many descs are transmitted to the driver at one
> > > time of sendto() based on [1] patch and then I calculated the
> > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > suitable value because 1) it covers most of the cases, 2) a higher
> > > number would not bring evident results. After twisting the parameters,
> > > a stable improvement of around 4% for both PPS and throughput and less
> > > resources consumption were found to be observed by strace -c -p xxx:
> > > 1) %time was decreased by 7.8%
> > > 2) error counter was decreased from 18367 to 572
> > >
> > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> >
> > LGTM, waiting a little more for an explicit an ack from XDP maintainers.
> 
> Thanks. No problem.

Hey! i did review. Jason sorry but I got confused that you need to sort
out the performance results on your side, hence the silence.

> 
> >
> > Side note: it could be useful to extend the xdp selftest to trigger the
> > new code path.
> 
> Roger that, sir. I will do it after this gets merged, maybe later this
> month, still studying for various tests in recent days :)

IMHO nothing worth testing with this patch per-se, it's rather the matter
of performance.

I would like however to ask you for follow-up with patch against xdpsock
that adds support for using this new setsockopt (once we accept this onto
kernel).

> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03 12:25 ` Maciej Fijalkowski
@ 2025-07-03 13:09   ` Jason Xing
  2025-07-03 14:25     ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-07-03 13:09 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

Hi Maciej,

Thanks for the review.

On Thu, Jul 3, 2025 at 8:26 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch provides a setsockopt method to let applications leverage to
> > adjust how many descs to be handled at most in one send syscall. It
> > mitigates the situation where the default value (32) that is too small
> > leads to higher frequency of triggering send syscall.
> >
> > Considering the prosperity/complexity the applications have, there is no
> > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > value like before.
> >
> > The patch does the following things:
> > - Add XDP_MAX_TX_BUDGET socket option.
> > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > - Set tx_budget_spent to 32 by default in the initialization phase as a
> >   per-socket granular control. 32 is also the min value for
> >   tx_budget_spent.
> > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> >
> > The idea behind this comes out of real workloads in production. We use a
> > user-level stack with xsk support to accelerate sending packets and
> > minimize triggering syscalls. When the packets are aggregated, it's not
> > hard to hit the upper bound (namely, 32). The moment user-space stack
> > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > again until all the expected descs from tx ring are sent out to the driver.
> > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > sendto() and higher throughput/PPS.
> >
> > Here is what I did in production, along with some numbers as follows:
> > For one application I saw lately, I suggested using 128 as max_tx_budget
> > because I saw two limitations without changing any default configuration:
> > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > this was I counted how many descs are transmitted to the driver at one
> > time of sendto() based on [1] patch and then I calculated the
> > possibility of hitting the upper bound. Finally I chose 128 as a
> > suitable value because 1) it covers most of the cases, 2) a higher
> > number would not bring evident results. After twisting the parameters,
> > a stable improvement of around 4% for both PPS and throughput and less
> > resources consumption were found to be observed by strace -c -p xxx:
> > 1) %time was decreased by 7.8%
> > 2) error counter was decreased from 18367 to 572
> >
> > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > v6
> > Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@gmail.com/
> > 1. use [32, xs->tx->nentries] range
> > 2. Since setsockopt may generate a different value, add getsockopt to help
> >    application know what value takes effect finally.
> >
> > v5
> > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@gmail.com/
> > 1. remove changes around zc mode
> >
> > v4
> > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@gmail.com/
> > 1. remove getsockopt as it seems no real use case.
> > 2. adjust the position of max_tx_budget to make sure it stays with other
> > read-most fields in one cacheline.
> > 3. set one as the lower bound of max_tx_budget
> > 4. add more descriptions/performance data in Doucmentation and commit message.
> >
> > V3
> > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > 1. use a per-socket control (suggested by Stanislav)
> > 2. unify both definitions into one
> > 3. support setsockopt and getsockopt
> > 4. add more description in commit message
> >
> > V2
> > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > 1. use a per-netns sysctl knob
> > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > ---
> >  Documentation/networking/af_xdp.rst |  9 +++++++
> >  include/net/xdp_sock.h              |  1 +
> >  include/uapi/linux/if_xdp.h         |  1 +
> >  net/xdp/xsk.c                       | 39 ++++++++++++++++++++++++++---
> >  tools/include/uapi/linux/if_xdp.h   |  1 +
> >  5 files changed, 47 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > index dceeb0d763aa..253afee00162 100644
> > --- a/Documentation/networking/af_xdp.rst
> > +++ b/Documentation/networking/af_xdp.rst
> > @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one.
> >  Once the option is set, kernel will refuse attempts to bind that socket
> >  to a different interface.  Updating the value requires CAP_NET_RAW.
> >
> > +XDP_MAX_TX_BUDGET setsockopt
> > +----------------------------
> > +
> > +This setsockopt sets the maximum number of descriptors that can be handled
> > +and passed to the driver at one send syscall. It is applied in the non-zero
>
> just 'copy mode' would be enough IMHO.

Okay.

>
> > +copy mode to allow application to tune the per-socket maximum iteration for
> > +better throughput and less frequency of send syscall.
> > +Allowed range is [32, xs->tx->nentries].
> > +
> >  XDP_STATISTICS getsockopt
> >  -------------------------
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index e8bd6ddb7b12..ce587a225661 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -84,6 +84,7 @@ struct xdp_sock {
> >       struct list_head map_list;
> >       /* Protects map_list */
> >       spinlock_t map_list_lock;
> > +     u32 max_tx_budget;
> >       /* Protects multiple processes in the control path */
> >       struct mutex mutex;
> >       struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_UMEM_COMPLETION_RING     6
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> > +#define XDP_MAX_TX_BUDGET            9
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72c000c0ae5f..41efe7b27b0e 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -33,8 +33,8 @@
> >  #include "xdp_umem.h"
> >  #include "xsk.h"
> >
> > -#define TX_BATCH_SIZE 32
> > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> > +#define TX_BUDGET_SIZE 32
> > +#define MAX_PER_SOCKET_BUDGET 32
>
> that could have stayed as it was before?

Sorry, I don't think so because one definition has nothing to do with
the other one. Why do we need to bind them together?

>
> >
> >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> >  {
> > @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  static int __xsk_generic_xmit(struct sock *sk)
> >  {
> >       struct xdp_sock *xs = xdp_sk(sk);
> > -     u32 max_batch = TX_BATCH_SIZE;
> > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
>
> you're breaking RCT here...

What's the RCT? Never heard of this abbreviation.

> maybe you could READ_ONCE() the budget right
> before the while() loop and keep the declaration only at the top of func?
>
> also nothing wrong with keeping @max_batch as a name for this budget.
>
> >       bool sent_frame = false;
> >       struct xdp_desc desc;
> >       struct sk_buff *skb;
> > @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> >               goto out;
> >
> >       while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> > -             if (max_batch-- == 0) {
> > +             if (max_budget-- == 0) {
> >                       err = -EAGAIN;
> >                       goto out;
> >               }
> > @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               mutex_unlock(&xs->mutex);
> >               return err;
> >       }
> > +     case XDP_MAX_TX_BUDGET:
> > +     {
> > +             unsigned int budget;
> > +
> > +             if (optlen != sizeof(budget))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > +                     return -EFAULT;
> > +             if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE)
> > +                     return -EACCES;
> > +
> > +             WRITE_ONCE(xs->max_tx_budget,
> > +                        clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries));
>
> I think it would be better to throw errno when budget set by user is
> bigger than nentries. you do it for min case, let's do it for max case as
> well and skip the clamp() altogether?

Okay.

After this, do you think it's necessary to keep the getsockopt()?

>
> this is rather speculative that someone would ever set such a big budget
> but silently setting a lower value than provided might be confusing to me.
> i'd rather get an error thrown at my face and find out the valid range.

On point.

>
> > +             return 0;
> > +     }
> >       default:
> >               break;
> >       }
> > @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
> >
> >               return 0;
> >       }
> > +     case XDP_MAX_TX_BUDGET:
> > +     {
> > +             unsigned int budget;
> > +
> > +             if (len < sizeof(budget))
> > +                     return -EINVAL;
> > +
> > +             budget = READ_ONCE(xs->max_tx_budget);
> > +             if (copy_to_user(optval, &budget, sizeof(budget)))
> > +                     return -EFAULT;
> > +             if (put_user(sizeof(budget), optlen))
> > +                     return -EFAULT;
> > +
> > +             return 0;
> > +     }
> >       default:
> >               break;
> >       }
> > @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
> >
> >       xs = xdp_sk(sk);
> >       xs->state = XSK_READY;
> > +     xs->max_tx_budget = TX_BUDGET_SIZE;
> >       mutex_init(&xs->mutex);
> >
> >       INIT_LIST_HEAD(&xs->map_list);
> > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > --- a/tools/include/uapi/linux/if_xdp.h
> > +++ b/tools/include/uapi/linux/if_xdp.h
> > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_UMEM_COMPLETION_RING     6
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> > +#define XDP_MAX_TX_BUDGET            9
>
> could we have 'SKB' in name?

Like XDP_MAX_TX_BUDGET_SKB ?

Thanks,
Jason

>
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > --
> > 2.41.3
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03 12:29     ` Maciej Fijalkowski
@ 2025-07-03 13:11       ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-07-03 13:11 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Paolo Abeni, davem, edumazet, kuba, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Thu, Jul 3, 2025 at 8:29 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, Jul 03, 2025 at 04:22:21PM +0800, Jason Xing wrote:
> > On Thu, Jul 3, 2025 at 4:15 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On 6/27/25 1:01 PM, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch provides a setsockopt method to let applications leverage to
> > > > adjust how many descs to be handled at most in one send syscall. It
> > > > mitigates the situation where the default value (32) that is too small
> > > > leads to higher frequency of triggering send syscall.
> > > >
> > > > Considering the prosperity/complexity the applications have, there is no
> > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > value like before.
> > > >
> > > > The patch does the following things:
> > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > >   per-socket granular control. 32 is also the min value for
> > > >   tx_budget_spent.
> > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > >
> > > > The idea behind this comes out of real workloads in production. We use a
> > > > user-level stack with xsk support to accelerate sending packets and
> > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > sendto() and higher throughput/PPS.
> > > >
> > > > Here is what I did in production, along with some numbers as follows:
> > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > because I saw two limitations without changing any default configuration:
> > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > this was I counted how many descs are transmitted to the driver at one
> > > > time of sendto() based on [1] patch and then I calculated the
> > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > number would not bring evident results. After twisting the parameters,
> > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > 1) %time was decreased by 7.8%
> > > > 2) error counter was decreased from 18367 to 572
> > > >
> > > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > >
> > > LGTM, waiting a little more for an explicit an ack from XDP maintainers.
> >
> > Thanks. No problem.
>
> Hey! i did review. Jason sorry but I got confused that you need to sort
> out the performance results on your side, hence the silence.

Thanks for the review. My environment doesn't allow me to continue the
xdpsock experiment because of many limitations from the host side.
>
> >
> > >
> > > Side note: it could be useful to extend the xdp selftest to trigger the
> > > new code path.
> >
> > Roger that, sir. I will do it after this gets merged, maybe later this
> > month, still studying for various tests in recent days :)
>
> IMHO nothing worth testing with this patch per-se, it's rather the matter
> of performance.
>
> I would like however to ask you for follow-up with patch against xdpsock
> that adds support for using this new setsockopt (once we accept this onto
> kernel).

That xdp-project in the github. I will finish it after it's done.

Thanks,
Jason

>
> >
> > Thanks,
> > Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03 13:09   ` Jason Xing
@ 2025-07-03 14:25     ` Jason Xing
  2025-07-03 14:53       ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-07-03 14:25 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Thu, Jul 3, 2025 at 9:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hi Maciej,
>
> Thanks for the review.
>
> On Thu, Jul 3, 2025 at 8:26 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This patch provides a setsockopt method to let applications leverage to
> > > adjust how many descs to be handled at most in one send syscall. It
> > > mitigates the situation where the default value (32) that is too small
> > > leads to higher frequency of triggering send syscall.
> > >
> > > Considering the prosperity/complexity the applications have, there is no
> > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > value like before.
> > >
> > > The patch does the following things:
> > > - Add XDP_MAX_TX_BUDGET socket option.
> > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > >   per-socket granular control. 32 is also the min value for
> > >   tx_budget_spent.
> > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > >
> > > The idea behind this comes out of real workloads in production. We use a
> > > user-level stack with xsk support to accelerate sending packets and
> > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > again until all the expected descs from tx ring are sent out to the driver.
> > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > sendto() and higher throughput/PPS.
> > >
> > > Here is what I did in production, along with some numbers as follows:
> > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > because I saw two limitations without changing any default configuration:
> > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > this was I counted how many descs are transmitted to the driver at one
> > > time of sendto() based on [1] patch and then I calculated the
> > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > suitable value because 1) it covers most of the cases, 2) a higher
> > > number would not bring evident results. After twisting the parameters,
> > > a stable improvement of around 4% for both PPS and throughput and less
> > > resources consumption were found to be observed by strace -c -p xxx:
> > > 1) %time was decreased by 7.8%
> > > 2) error counter was decreased from 18367 to 572
> > >
> > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v6
> > > Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@gmail.com/
> > > 1. use [32, xs->tx->nentries] range
> > > 2. Since setsockopt may generate a different value, add getsockopt to help
> > >    application know what value takes effect finally.
> > >
> > > v5
> > > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@gmail.com/
> > > 1. remove changes around zc mode
> > >
> > > v4
> > > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@gmail.com/
> > > 1. remove getsockopt as it seems no real use case.
> > > 2. adjust the position of max_tx_budget to make sure it stays with other
> > > read-most fields in one cacheline.
> > > 3. set one as the lower bound of max_tx_budget
> > > 4. add more descriptions/performance data in Doucmentation and commit message.
> > >
> > > V3
> > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > 1. use a per-socket control (suggested by Stanislav)
> > > 2. unify both definitions into one
> > > 3. support setsockopt and getsockopt
> > > 4. add more description in commit message
> > >
> > > V2
> > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > 1. use a per-netns sysctl knob
> > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > ---
> > >  Documentation/networking/af_xdp.rst |  9 +++++++
> > >  include/net/xdp_sock.h              |  1 +
> > >  include/uapi/linux/if_xdp.h         |  1 +
> > >  net/xdp/xsk.c                       | 39 ++++++++++++++++++++++++++---
> > >  tools/include/uapi/linux/if_xdp.h   |  1 +
> > >  5 files changed, 47 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > > index dceeb0d763aa..253afee00162 100644
> > > --- a/Documentation/networking/af_xdp.rst
> > > +++ b/Documentation/networking/af_xdp.rst
> > > @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one.
> > >  Once the option is set, kernel will refuse attempts to bind that socket
> > >  to a different interface.  Updating the value requires CAP_NET_RAW.
> > >
> > > +XDP_MAX_TX_BUDGET setsockopt
> > > +----------------------------
> > > +
> > > +This setsockopt sets the maximum number of descriptors that can be handled
> > > +and passed to the driver at one send syscall. It is applied in the non-zero
> >
> > just 'copy mode' would be enough IMHO.
>
> Okay.
>
> >
> > > +copy mode to allow application to tune the per-socket maximum iteration for
> > > +better throughput and less frequency of send syscall.
> > > +Allowed range is [32, xs->tx->nentries].
> > > +
> > >  XDP_STATISTICS getsockopt
> > >  -------------------------
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index e8bd6ddb7b12..ce587a225661 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -84,6 +84,7 @@ struct xdp_sock {
> > >       struct list_head map_list;
> > >       /* Protects map_list */
> > >       spinlock_t map_list_lock;
> > > +     u32 max_tx_budget;
> > >       /* Protects multiple processes in the control path */
> > >       struct mutex mutex;
> > >       struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > > --- a/include/uapi/linux/if_xdp.h
> > > +++ b/include/uapi/linux/if_xdp.h
> > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> > >  #define XDP_UMEM_COMPLETION_RING     6
> > >  #define XDP_STATISTICS                       7
> > >  #define XDP_OPTIONS                  8
> > > +#define XDP_MAX_TX_BUDGET            9
> > >
> > >  struct xdp_umem_reg {
> > >       __u64 addr; /* Start of packet data area */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 72c000c0ae5f..41efe7b27b0e 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -33,8 +33,8 @@
> > >  #include "xdp_umem.h"
> > >  #include "xsk.h"
> > >
> > > -#define TX_BATCH_SIZE 32
> > > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> > > +#define TX_BUDGET_SIZE 32
> > > +#define MAX_PER_SOCKET_BUDGET 32
> >
> > that could have stayed as it was before?
>
> Sorry, I don't think so because one definition has nothing to do with
> the other one. Why do we need to bind them together?
>
> >
> > >
> > >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> > >  {
> > > @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >  static int __xsk_generic_xmit(struct sock *sk)
> > >  {
> > >       struct xdp_sock *xs = xdp_sk(sk);
> > > -     u32 max_batch = TX_BATCH_SIZE;
> > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> >
> > you're breaking RCT here...
>
> What's the RCT? Never heard of this abbreviation.
>
> > maybe you could READ_ONCE() the budget right
> > before the while() loop and keep the declaration only at the top of func?
> >
> > also nothing wrong with keeping @max_batch as a name for this budget.
> >
> > >       bool sent_frame = false;
> > >       struct xdp_desc desc;
> > >       struct sk_buff *skb;
> > > @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >               goto out;
> > >
> > >       while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> > > -             if (max_batch-- == 0) {
> > > +             if (max_budget-- == 0) {
> > >                       err = -EAGAIN;
> > >                       goto out;
> > >               }
> > > @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > >               mutex_unlock(&xs->mutex);
> > >               return err;
> > >       }
> > > +     case XDP_MAX_TX_BUDGET:
> > > +     {
> > > +             unsigned int budget;
> > > +
> > > +             if (optlen != sizeof(budget))
> > > +                     return -EINVAL;
> > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > +                     return -EFAULT;
> > > +             if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE)
> > > +                     return -EACCES;
> > > +
> > > +             WRITE_ONCE(xs->max_tx_budget,
> > > +                        clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries));
> >
> > I think it would be better to throw errno when budget set by user is
> > bigger than nentries. you do it for min case, let's do it for max case as
> > well and skip the clamp() altogether?
>
> Okay.
>
> After this, do you think it's necessary to keep the getsockopt()?
>
> >
> > this is rather speculative that someone would ever set such a big budget
> > but silently setting a lower value than provided might be confusing to me.
> > i'd rather get an error thrown at my face and find out the valid range.
>
> On point.
>
> >
> > > +             return 0;
> > > +     }
> > >       default:
> > >               break;
> > >       }
> > > @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
> > >
> > >               return 0;
> > >       }
> > > +     case XDP_MAX_TX_BUDGET:
> > > +     {
> > > +             unsigned int budget;
> > > +
> > > +             if (len < sizeof(budget))
> > > +                     return -EINVAL;
> > > +
> > > +             budget = READ_ONCE(xs->max_tx_budget);
> > > +             if (copy_to_user(optval, &budget, sizeof(budget)))
> > > +                     return -EFAULT;
> > > +             if (put_user(sizeof(budget), optlen))
> > > +                     return -EFAULT;
> > > +
> > > +             return 0;
> > > +     }
> > >       default:
> > >               break;
> > >       }
> > > @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
> > >
> > >       xs = xdp_sk(sk);
> > >       xs->state = XSK_READY;
> > > +     xs->max_tx_budget = TX_BUDGET_SIZE;
> > >       mutex_init(&xs->mutex);
> > >
> > >       INIT_LIST_HEAD(&xs->map_list);
> > > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > > --- a/tools/include/uapi/linux/if_xdp.h
> > > +++ b/tools/include/uapi/linux/if_xdp.h
> > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> > >  #define XDP_UMEM_COMPLETION_RING     6
> > >  #define XDP_STATISTICS                       7
> > >  #define XDP_OPTIONS                  8
> > > +#define XDP_MAX_TX_BUDGET            9
> >
> > could we have 'SKB' in name?
>
> Like XDP_MAX_TX_BUDGET_SKB ?

XDP_MAX_TX_SKB_BUDGET sounds better, I assume.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt
  2025-07-03 14:25     ` Jason Xing
@ 2025-07-03 14:53       ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-07-03 14:53 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, joe,
	willemdebruijn.kernel, bpf, netdev, Jason Xing

On Thu, Jul 3, 2025 at 10:25 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 9:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hi Maciej,
> >
> > Thanks for the review.
> >
> > On Thu, Jul 3, 2025 at 8:26 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch provides a setsockopt method to let applications leverage to
> > > > adjust how many descs to be handled at most in one send syscall. It
> > > > mitigates the situation where the default value (32) that is too small
> > > > leads to higher frequency of triggering send syscall.
> > > >
> > > > Considering the prosperity/complexity the applications have, there is no
> > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default
> > > > value like before.
> > > >
> > > > The patch does the following things:
> > > > - Add XDP_MAX_TX_BUDGET socket option.
> > > > - Convert TX_BATCH_SIZE to tx_budget_spent.
> > > > - Set tx_budget_spent to 32 by default in the initialization phase as a
> > > >   per-socket granular control. 32 is also the min value for
> > > >   tx_budget_spent.
> > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries].
> > > >
> > > > The idea behind this comes out of real workloads in production. We use a
> > > > user-level stack with xsk support to accelerate sending packets and
> > > > minimize triggering syscalls. When the packets are aggregated, it's not
> > > > hard to hit the upper bound (namely, 32). The moment user-space stack
> > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try
> > > > again until all the expected descs from tx ring are sent out to the driver.
> > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of
> > > > sendto() and higher throughput/PPS.
> > > >
> > > > Here is what I did in production, along with some numbers as follows:
> > > > For one application I saw lately, I suggested using 128 as max_tx_budget
> > > > because I saw two limitations without changing any default configuration:
> > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by
> > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind
> > > > this was I counted how many descs are transmitted to the driver at one
> > > > time of sendto() based on [1] patch and then I calculated the
> > > > possibility of hitting the upper bound. Finally I chose 128 as a
> > > > suitable value because 1) it covers most of the cases, 2) a higher
> > > > number would not bring evident results. After twisting the parameters,
> > > > a stable improvement of around 4% for both PPS and throughput and less
> > > > resources consumption were found to be observed by strace -c -p xxx:
> > > > 1) %time was decreased by 7.8%
> > > > 2) error counter was decreased from 18367 to 572
> > > >
> > > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > v6
> > > > Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@gmail.com/
> > > > 1. use [32, xs->tx->nentries] range
> > > > 2. Since setsockopt may generate a different value, add getsockopt to help
> > > >    application know what value takes effect finally.
> > > >
> > > > v5
> > > > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@gmail.com/
> > > > 1. remove changes around zc mode
> > > >
> > > > v4
> > > > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@gmail.com/
> > > > 1. remove getsockopt as it seems no real use case.
> > > > 2. adjust the position of max_tx_budget to make sure it stays with other
> > > > read-most fields in one cacheline.
> > > > 3. set one as the lower bound of max_tx_budget
> > > > 4. add more descriptions/performance data in Doucmentation and commit message.
> > > >
> > > > V3
> > > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@gmail.com/
> > > > 1. use a per-socket control (suggested by Stanislav)
> > > > 2. unify both definitions into one
> > > > 3. support setsockopt and getsockopt
> > > > 4. add more description in commit message
> > > >
> > > > V2
> > > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@gmail.com/
> > > > 1. use a per-netns sysctl knob
> > > > 2. use sysctl_xsk_max_tx_budget to unify both definitions.
> > > > ---
> > > >  Documentation/networking/af_xdp.rst |  9 +++++++
> > > >  include/net/xdp_sock.h              |  1 +
> > > >  include/uapi/linux/if_xdp.h         |  1 +
> > > >  net/xdp/xsk.c                       | 39 ++++++++++++++++++++++++++---
> > > >  tools/include/uapi/linux/if_xdp.h   |  1 +
> > > >  5 files changed, 47 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > > > index dceeb0d763aa..253afee00162 100644
> > > > --- a/Documentation/networking/af_xdp.rst
> > > > +++ b/Documentation/networking/af_xdp.rst
> > > > @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one.
> > > >  Once the option is set, kernel will refuse attempts to bind that socket
> > > >  to a different interface.  Updating the value requires CAP_NET_RAW.
> > > >
> > > > +XDP_MAX_TX_BUDGET setsockopt
> > > > +----------------------------
> > > > +
> > > > +This setsockopt sets the maximum number of descriptors that can be handled
> > > > +and passed to the driver at one send syscall. It is applied in the non-zero
> > >
> > > just 'copy mode' would be enough IMHO.
> >
> > Okay.
> >
> > >
> > > > +copy mode to allow application to tune the per-socket maximum iteration for
> > > > +better throughput and less frequency of send syscall.
> > > > +Allowed range is [32, xs->tx->nentries].
> > > > +
> > > >  XDP_STATISTICS getsockopt
> > > >  -------------------------
> > > >
> > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > index e8bd6ddb7b12..ce587a225661 100644
> > > > --- a/include/net/xdp_sock.h
> > > > +++ b/include/net/xdp_sock.h
> > > > @@ -84,6 +84,7 @@ struct xdp_sock {
> > > >       struct list_head map_list;
> > > >       /* Protects map_list */
> > > >       spinlock_t map_list_lock;
> > > > +     u32 max_tx_budget;
> > > >       /* Protects multiple processes in the control path */
> > > >       struct mutex mutex;
> > > >       struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > > > --- a/include/uapi/linux/if_xdp.h
> > > > +++ b/include/uapi/linux/if_xdp.h
> > > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> > > >  #define XDP_UMEM_COMPLETION_RING     6
> > > >  #define XDP_STATISTICS                       7
> > > >  #define XDP_OPTIONS                  8
> > > > +#define XDP_MAX_TX_BUDGET            9
> > > >
> > > >  struct xdp_umem_reg {
> > > >       __u64 addr; /* Start of packet data area */
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 72c000c0ae5f..41efe7b27b0e 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -33,8 +33,8 @@
> > > >  #include "xdp_umem.h"
> > > >  #include "xsk.h"
> > > >
> > > > -#define TX_BATCH_SIZE 32
> > > > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
> > > > +#define TX_BUDGET_SIZE 32
> > > > +#define MAX_PER_SOCKET_BUDGET 32
> > >
> > > that could have stayed as it was before?
> >
> > Sorry, I don't think so because one definition has nothing to do with
> > the other one. Why do we need to bind them together?
> >
> > >
> > > >
> > > >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> > > >  {
> > > > @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > >  {
> > > >       struct xdp_sock *xs = xdp_sk(sk);
> > > > -     u32 max_batch = TX_BATCH_SIZE;
> > > > +     u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > >
> > > you're breaking RCT here...
> >
> > What's the RCT? Never heard of this abbreviation.
> >
> > > maybe you could READ_ONCE() the budget right
> > > before the while() loop and keep the declaration only at the top of func?
> > >
> > > also nothing wrong with keeping @max_batch as a name for this budget.
> > >
> > > >       bool sent_frame = false;
> > > >       struct xdp_desc desc;
> > > >       struct sk_buff *skb;
> > > > @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > >               goto out;
> > > >
> > > >       while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> > > > -             if (max_batch-- == 0) {
> > > > +             if (max_budget-- == 0) {
> > > >                       err = -EAGAIN;
> > > >                       goto out;
> > > >               }
> > > > @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > >               mutex_unlock(&xs->mutex);
> > > >               return err;
> > > >       }
> > > > +     case XDP_MAX_TX_BUDGET:
> > > > +     {
> > > > +             unsigned int budget;
> > > > +
> > > > +             if (optlen != sizeof(budget))
> > > > +                     return -EINVAL;
> > > > +             if (copy_from_sockptr(&budget, optval, sizeof(budget)))
> > > > +                     return -EFAULT;
> > > > +             if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE)
> > > > +                     return -EACCES;
> > > > +
> > > > +             WRITE_ONCE(xs->max_tx_budget,
> > > > +                        clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries));
> > >
> > > I think it would be better to throw errno when budget set by user is
> > > bigger than nentries. you do it for min case, let's do it for max case as
> > > well and skip the clamp() altogether?
> >
> > Okay.
> >
> > After this, do you think it's necessary to keep the getsockopt()?
> >
> > >
> > > this is rather speculative that someone would ever set such a big budget
> > > but silently setting a lower value than provided might be confusing to me.
> > > i'd rather get an error thrown at my face and find out the valid range.
> >
> > On point.
> >
> > >
> > > > +             return 0;
> > > > +     }
> > > >       default:
> > > >               break;
> > > >       }
> > > > @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
> > > >
> > > >               return 0;
> > > >       }
> > > > +     case XDP_MAX_TX_BUDGET:
> > > > +     {
> > > > +             unsigned int budget;
> > > > +
> > > > +             if (len < sizeof(budget))
> > > > +                     return -EINVAL;
> > > > +
> > > > +             budget = READ_ONCE(xs->max_tx_budget);
> > > > +             if (copy_to_user(optval, &budget, sizeof(budget)))
> > > > +                     return -EFAULT;
> > > > +             if (put_user(sizeof(budget), optlen))
> > > > +                     return -EFAULT;
> > > > +
> > > > +             return 0;
> > > > +     }
> > > >       default:
> > > >               break;
> > > >       }
> > > > @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
> > > >
> > > >       xs = xdp_sk(sk);
> > > >       xs->state = XSK_READY;
> > > > +     xs->max_tx_budget = TX_BUDGET_SIZE;
> > > >       mutex_init(&xs->mutex);
> > > >
> > > >       INIT_LIST_HEAD(&xs->map_list);
> > > > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > > > index 44f2bb93e7e6..07c6d21c2f1c 100644
> > > > --- a/tools/include/uapi/linux/if_xdp.h
> > > > +++ b/tools/include/uapi/linux/if_xdp.h
> > > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets {
> > > >  #define XDP_UMEM_COMPLETION_RING     6
> > > >  #define XDP_STATISTICS                       7
> > > >  #define XDP_OPTIONS                  8
> > > > +#define XDP_MAX_TX_BUDGET            9
> > >
> > > could we have 'SKB' in name?
> >
> > Like XDP_MAX_TX_BUDGET_SKB ?
>
> XDP_MAX_TX_SKB_BUDGET sounds better, I assume.

Updated v7 at https://lore.kernel.org/all/20250703145045.58271-1-kerneljasonxing@gmail.com/
.

Cross finger :S

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-03 14:54 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 11:01 [PATCH net-next v6] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt Jason Xing
2025-06-29  2:51 ` Jason Xing
2025-06-29 10:43   ` Jason Xing
2025-06-30 11:46     ` Maciej Fijalkowski
2025-06-30 12:07       ` Jason Xing
2025-06-30 12:25         ` Maciej Fijalkowski
2025-06-30 12:38           ` Jason Xing
2025-07-01  0:20           ` Jason Xing
2025-07-02 11:33 ` Jason Xing
2025-07-03  8:15 ` Paolo Abeni
2025-07-03  8:22   ` Jason Xing
2025-07-03 12:29     ` Maciej Fijalkowski
2025-07-03 13:11       ` Jason Xing
2025-07-03 12:25 ` Maciej Fijalkowski
2025-07-03 13:09   ` Jason Xing
2025-07-03 14:25     ` Jason Xing
2025-07-03 14:53       ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).