netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] xsk: refactors around generic xmit side
@ 2025-09-22 15:25 Maciej Fijalkowski
  2025-09-22 15:25 ` [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic Maciej Fijalkowski
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 15:25 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, stfomichev, kerneljasonxing,
	Maciej Fijalkowski

Hi,

this small patchset is about refactoring code around xsk_build_skb() as
it became pretty heavy. Generic xmit is a bit hard to follow so here are
three clean ups to start with making this code more friendly.

Thanks,
Maciej

Maciej Fijalkowski (3):
  xsk: avoid overwriting skb fields for multi-buffer traffic
  xsk: remove @first_frag from xsk_build_skb()
  xsk: wrap generic metadata handling onto separate function

 net/xdp/xsk.c | 109 +++++++++++++++++++++++++++-----------------------
 1 file changed, 60 insertions(+), 49 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic
  2025-09-22 15:25 [PATCH bpf-next 0/3] xsk: refactors around generic xmit side Maciej Fijalkowski
@ 2025-09-22 15:25 ` Maciej Fijalkowski
  2025-09-22 17:25   ` Stanislav Fomichev
  2025-09-22 15:25 ` [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb() Maciej Fijalkowski
  2025-09-22 15:26 ` [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function Maciej Fijalkowski
  2 siblings, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 15:25 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, stfomichev, kerneljasonxing,
	Maciej Fijalkowski

We are unnecessarily setting a bunch of skb fields per each processed
descriptor, which is redundant for fragmented frames.

Let us set these respective members for first fragment only.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72e34bd2d925..72194f0a3fc0 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -758,6 +758,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				goto free_err;
 
 			xsk_set_destructor_arg(skb, desc->addr);
+			skb->dev = dev;
+			skb->priority = READ_ONCE(xs->sk.sk_priority);
+			skb->mark = READ_ONCE(xs->sk.sk_mark);
+			skb->destructor = xsk_destruct_skb;
 		} else {
 			int nr_frags = skb_shinfo(skb)->nr_frags;
 			struct xsk_addr_node *xsk_addr;
@@ -826,14 +830,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 			if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
 				skb->skb_mstamp_ns = meta->request.launch_time;
+			xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
 		}
 	}
 
-	skb->dev = dev;
-	skb->priority = READ_ONCE(xs->sk.sk_priority);
-	skb->mark = READ_ONCE(xs->sk.sk_mark);
-	skb->destructor = xsk_destruct_skb;
-	xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
 	xsk_inc_num_desc(skb);
 
 	return skb;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-22 15:25 [PATCH bpf-next 0/3] xsk: refactors around generic xmit side Maciej Fijalkowski
  2025-09-22 15:25 ` [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic Maciej Fijalkowski
@ 2025-09-22 15:25 ` Maciej Fijalkowski
  2025-09-22 17:48   ` Stanislav Fomichev
  2025-09-22 15:26 ` [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function Maciej Fijalkowski
  2 siblings, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 15:25 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, stfomichev, kerneljasonxing,
	Maciej Fijalkowski

Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
code in xsk_build_skb().

Same functionality can be achieved with checking if xsk_get_num_desc()
returns 0. To replace current usage of @first_frag with
XSKCB(skb)->num_descs check, pull out the code from
xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
skb_store_bits() in branch that creates skb against first processed
frag. This so error path has the XSKCB(skb)->num_descs initialized and
can free skb in case skb_store_bits() failed.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72194f0a3fc0..064238400036 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
 	return XSKCB(skb)->num_descs;
 }
 
+static void xsk_init_cb(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
+	INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
+	XSKCB(skb)->num_descs = 0;
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
@@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 
 static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
 {
-	BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
-	INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
-	XSKCB(skb)->num_descs = 0;
 	skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
 }
 
@@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-
+		xsk_init_cb(skb);
 		xsk_set_destructor_arg(skb, desc->addr);
 	} else {
 		xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
@@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	struct xsk_tx_metadata *meta = NULL;
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
-	bool first_frag = false;
 	int err;
 
 	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
@@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		len = desc->len;
 
 		if (!skb) {
-			first_frag = true;
-
 			hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
 			tr = dev->needed_tailroom;
 			skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
@@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 			skb_reserve(skb, hr);
 			skb_put(skb, len);
+			xsk_init_cb(skb);
 
 			err = skb_store_bits(skb, 0, buffer, len);
 			if (unlikely(err))
@@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
 		}
 
-		if (first_frag && desc->options & XDP_TX_METADATA) {
+		if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
 			if (unlikely(xs->pool->tx_metadata_len == 0)) {
 				err = -EINVAL;
 				goto free_err;
@@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return skb;
 
 free_err:
-	if (first_frag && skb)
+	if (skb && !xsk_get_num_desc(skb))
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function
  2025-09-22 15:25 [PATCH bpf-next 0/3] xsk: refactors around generic xmit side Maciej Fijalkowski
  2025-09-22 15:25 ` [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic Maciej Fijalkowski
  2025-09-22 15:25 ` [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb() Maciej Fijalkowski
@ 2025-09-22 15:26 ` Maciej Fijalkowski
  2025-09-22 17:51   ` Stanislav Fomichev
  2025-09-23  9:43   ` Jason Xing
  2 siblings, 2 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 15:26 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, stfomichev, kerneljasonxing,
	Maciej Fijalkowski

xsk_build_skb() has gone wild with its size and one of the things we can
do about it is to pull out a branch that takes care of metadata handling
and make it a separate function. Consider this as a good start of
cleanup.

No functional changes here.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 83 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 37 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 064238400036..7121d4f99915 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -723,10 +723,48 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	return skb;
 }
 
+static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
+			    struct xdp_desc *desc, struct xsk_buff_pool *pool,
+			    u32 hr)
+{
+	struct xsk_tx_metadata *meta = NULL;
+
+	if (unlikely(pool->tx_metadata_len == 0))
+		return -EINVAL;
+
+	meta = buffer - pool->tx_metadata_len;
+	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
+		return -EINVAL;
+
+	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
+		if (unlikely(meta->request.csum_start +
+			     meta->request.csum_offset +
+			     sizeof(__sum16) > desc->len))
+			return -EINVAL;
+
+		skb->csum_start = hr + meta->request.csum_start;
+		skb->csum_offset = meta->request.csum_offset;
+		skb->ip_summed = CHECKSUM_PARTIAL;
+
+		if (unlikely(pool->tx_sw_csum)) {
+			int err;
+
+			err = skb_checksum_help(skb);
+			if (err)
+				return err;
+		}
+	}
+
+	if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
+		skb->skb_mstamp_ns = meta->request.launch_time;
+	xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
+
+	return 0;
+}
+
 static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				     struct xdp_desc *desc)
 {
-	struct xsk_tx_metadata *meta = NULL;
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
 	int err;
@@ -764,6 +802,13 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			skb->priority = READ_ONCE(xs->sk.sk_priority);
 			skb->mark = READ_ONCE(xs->sk.sk_mark);
 			skb->destructor = xsk_destruct_skb;
+
+			if (desc->options & XDP_TX_METADATA) {
+				err = xsk_skb_metadata(skb, buffer, desc,
+						       xs->pool, hr);
+				if (unlikely(err))
+					goto free_err;
+			}
 		} else {
 			int nr_frags = skb_shinfo(skb)->nr_frags;
 			struct xsk_addr_node *xsk_addr;
@@ -798,42 +843,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			xsk_addr->addr = desc->addr;
 			list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
 		}
-
-		if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
-			if (unlikely(xs->pool->tx_metadata_len == 0)) {
-				err = -EINVAL;
-				goto free_err;
-			}
-
-			meta = buffer - xs->pool->tx_metadata_len;
-			if (unlikely(!xsk_buff_valid_tx_metadata(meta))) {
-				err = -EINVAL;
-				goto free_err;
-			}
-
-			if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
-				if (unlikely(meta->request.csum_start +
-					     meta->request.csum_offset +
-					     sizeof(__sum16) > len)) {
-					err = -EINVAL;
-					goto free_err;
-				}
-
-				skb->csum_start = hr + meta->request.csum_start;
-				skb->csum_offset = meta->request.csum_offset;
-				skb->ip_summed = CHECKSUM_PARTIAL;
-
-				if (unlikely(xs->pool->tx_sw_csum)) {
-					err = skb_checksum_help(skb);
-					if (err)
-						goto free_err;
-				}
-			}
-
-			if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
-				skb->skb_mstamp_ns = meta->request.launch_time;
-			xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
-		}
 	}
 
 	xsk_inc_num_desc(skb);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic
  2025-09-22 15:25 ` [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic Maciej Fijalkowski
@ 2025-09-22 17:25   ` Stanislav Fomichev
  2025-09-23  8:39     ` Jason Xing
  2025-09-23 15:43     ` Maciej Fijalkowski
  0 siblings, 2 replies; 15+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 17:25 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson,
	kerneljasonxing

On 09/22, Maciej Fijalkowski wrote:
> We are unnecessarily setting a bunch of skb fields per each processed
> descriptor, which is redundant for fragmented frames.
> 
> Let us set these respective members for first fragment only.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  net/xdp/xsk.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72e34bd2d925..72194f0a3fc0 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -758,6 +758,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  				goto free_err;
>  
>  			xsk_set_destructor_arg(skb, desc->addr);
> +			skb->dev = dev;
> +			skb->priority = READ_ONCE(xs->sk.sk_priority);
> +			skb->mark = READ_ONCE(xs->sk.sk_mark);
> +			skb->destructor = xsk_destruct_skb;
>  		} else {
>  			int nr_frags = skb_shinfo(skb)->nr_frags;
>  			struct xsk_addr_node *xsk_addr;
> @@ -826,14 +830,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  
>  			if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
>  				skb->skb_mstamp_ns = meta->request.launch_time;
> +			xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
>  		}
>  	}
>  
> -	skb->dev = dev;
> -	skb->priority = READ_ONCE(xs->sk.sk_priority);
> -	skb->mark = READ_ONCE(xs->sk.sk_mark);
> -	skb->destructor = xsk_destruct_skb;
> -	xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
>  	xsk_inc_num_desc(skb);

What about IFF_TX_SKB_NO_LINEAR case? I'm not super familiar with
it, but I don't see priority/mark being set over there after this change.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-22 15:25 ` [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb() Maciej Fijalkowski
@ 2025-09-22 17:48   ` Stanislav Fomichev
  2025-09-23  9:25     ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 17:48 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson,
	kerneljasonxing

On 09/22, Maciej Fijalkowski wrote:
> Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> code in xsk_build_skb().
> 
> Same functionality can be achieved with checking if xsk_get_num_desc()
> returns 0. To replace current usage of @first_frag with
> XSKCB(skb)->num_descs check, pull out the code from
> xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> skb_store_bits() in branch that creates skb against first processed
> frag. This so error path has the XSKCB(skb)->num_descs initialized and
> can free skb in case skb_store_bits() failed.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  net/xdp/xsk.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72194f0a3fc0..064238400036 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
>  	return XSKCB(skb)->num_descs;
>  }
>  
> +static void xsk_init_cb(struct sk_buff *skb)
> +{
> +	BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> +	INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> +	XSKCB(skb)->num_descs = 0;
> +}
> +
>  static void xsk_destruct_skb(struct sk_buff *skb)
>  {
>  	struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
>  
>  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
>  {
> -	BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> -	INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> -	XSKCB(skb)->num_descs = 0;
>  	skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
>  }
>  
> @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  			return ERR_PTR(err);
>  
>  		skb_reserve(skb, hr);
> -
> +		xsk_init_cb(skb);
>  		xsk_set_destructor_arg(skb, desc->addr);
>  	} else {
>  		xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  	struct xsk_tx_metadata *meta = NULL;
>  	struct net_device *dev = xs->dev;
>  	struct sk_buff *skb = xs->skb;
> -	bool first_frag = false;
>  	int err;
>  
>  	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  		len = desc->len;
>  
>  		if (!skb) {
> -			first_frag = true;
> -
>  			hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
>  			tr = dev->needed_tailroom;
>  			skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  
>  			skb_reserve(skb, hr);
>  			skb_put(skb, len);
> +			xsk_init_cb(skb);
>  
>  			err = skb_store_bits(skb, 0, buffer, len);
>  			if (unlikely(err))
> @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  			list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
>  		}
>  
> -		if (first_frag && desc->options & XDP_TX_METADATA) {
> +		if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
>  			if (unlikely(xs->pool->tx_metadata_len == 0)) {
>  				err = -EINVAL;
>  				goto free_err;
> @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  	return skb;
>  
>  free_err:
> -	if (first_frag && skb)

[..]

> +	if (skb && !xsk_get_num_desc(skb))
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {

For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
add a separate label to jump directly to 'if err == -EOVERFLOW' for
the IFF_TX_SKB_NO_LINEAR case?

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72e34bd2d925..f56182c61c99 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		skb = xsk_build_skb_zerocopy(xs, desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
-			goto free_err;
+			goto out;
 		}
 	} else {
 		u32 hr, tr, len;
@@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	if (first_frag && skb)
 		kfree_skb(skb);
 
+out:
 	if (err == -EOVERFLOW) {
 		/* Drop the packet */
 		xsk_inc_num_desc(xs->skb);

After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
of adding new xsk_init_cb, seems more robust?

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function
  2025-09-22 15:26 ` [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function Maciej Fijalkowski
@ 2025-09-22 17:51   ` Stanislav Fomichev
  2025-09-23  9:43   ` Jason Xing
  1 sibling, 0 replies; 15+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 17:51 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson,
	kerneljasonxing

On 09/22, Maciej Fijalkowski wrote:
> xsk_build_skb() has gone wild with its size and one of the things we can
> do about it is to pull out a branch that takes care of metadata handling
> and make it a separate function. Consider this as a good start of
> cleanup.
> 
> No functional changes here.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic
  2025-09-22 17:25   ` Stanislav Fomichev
@ 2025-09-23  8:39     ` Jason Xing
  2025-09-23 15:43     ` Maciej Fijalkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Jason Xing @ 2025-09-23  8:39 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Maciej Fijalkowski, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Tue, Sep 23, 2025 at 1:25 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 09/22, Maciej Fijalkowski wrote:
> > We are unnecessarily setting a bunch of skb fields per each processed
> > descriptor, which is redundant for fragmented frames.
> >
> > Let us set these respective members for first fragment only.
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  net/xdp/xsk.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72e34bd2d925..72194f0a3fc0 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -758,6 +758,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                               goto free_err;
> >
> >                       xsk_set_destructor_arg(skb, desc->addr);
> > +                     skb->dev = dev;
> > +                     skb->priority = READ_ONCE(xs->sk.sk_priority);
> > +                     skb->mark = READ_ONCE(xs->sk.sk_mark);
> > +                     skb->destructor = xsk_destruct_skb;
> >               } else {
> >                       int nr_frags = skb_shinfo(skb)->nr_frags;
> >                       struct xsk_addr_node *xsk_addr;
> > @@ -826,14 +830,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> >                       if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
> >                               skb->skb_mstamp_ns = meta->request.launch_time;
> > +                     xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> >               }
> >       }
> >
> > -     skb->dev = dev;
> > -     skb->priority = READ_ONCE(xs->sk.sk_priority);
> > -     skb->mark = READ_ONCE(xs->sk.sk_mark);
> > -     skb->destructor = xsk_destruct_skb;
> > -     xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> >       xsk_inc_num_desc(skb);
>
> What about IFF_TX_SKB_NO_LINEAR case? I'm not super familiar with
> it, but I don't see priority/mark being set over there after this change.

Agreed. NO_LINEAR is used for VM. Aside from what you mentioned, with
this adjustment the initialization of skb is not finished here, which
leads to 1) failure in __dev_direct_xmit() due to unknown skb->dev, 2)
losing the chance to set its own destructor, etc. Those fields work
for either linear drivers (like virtio_net) or physical drivers.

Testing on my VM, I saw the following splat appearing on the screen as
I pointed out earlier:
[   91.389269] RIP: 0010:__dev_direct_xmit+0x32/0x1e0
[   91.389659] Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 83 ec
18 89 75 c4 48 8b 5f 10 65 48 8b 05 d0 f3 b7 01 48 89 45 d0 31 c0 c6
45 cf 00 <48> 8b 83 a8 00 00 00 a8 01 0f 84 90 01 00 00 48 8b 83 a8 00
00 00
[   91.391095] RSP: 0018:ffffc9000482bce8 EFLAGS: 00010246
[   91.391538] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[   91.392107] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888120440700
[   91.392663] RBP: ffffc9000482bd28 R08: 000000000000003c R09: 0000000000000001
[   91.393230] R10: 0000000000001000 R11: 0000000000000000 R12: ffff888120440700
[   91.393800] R13: ffff888101ed4e00 R14: ffff888120440700 R15: ffff888123bf2c00
[   91.394360] FS:  00007f2094609540(0000) GS:ffff88907bec2000(0000)
knlGS:0000000000000000
[   91.394992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.395447] CR2: 00000000000000a8 CR3: 0000000113d0d000 CR4: 00000000003506f0
[   91.396015] Call Trace:
[   91.396226]  <TASK>
[   91.396415]  __xsk_generic_xmit+0x315/0x3c0
[   91.396767]  __xsk_sendmsg.constprop.0.isra.0+0x16f/0x1a0
[   91.397208]  xsk_sendmsg+0x25/0x40
[   91.397496]  __sys_sendto+0x210/0x220
[   91.397811]  ? srso_return_thunk+0x5/0x5f
[   91.398343]  ? _sched_setscheduler.isra.0+0x7b/0xb0
[   91.398935]  __x64_sys_sendto+0x24/0x30
[   91.399433]  x64_sys_call+0x8d4/0x1fc0
[   91.399937]  do_syscall_64+0x5d/0x2e0
[   91.400437]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-22 17:48   ` Stanislav Fomichev
@ 2025-09-23  9:25     ` Jason Xing
  2025-09-23  9:36       ` Jason Xing
  2025-09-24 14:35       ` Maciej Fijalkowski
  0 siblings, 2 replies; 15+ messages in thread
From: Jason Xing @ 2025-09-23  9:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Maciej Fijalkowski, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Tue, Sep 23, 2025 at 1:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 09/22, Maciej Fijalkowski wrote:
> > Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> > handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> > code in xsk_build_skb().
> >
> > Same functionality can be achieved with checking if xsk_get_num_desc()
> > returns 0. To replace current usage of @first_frag with
> > XSKCB(skb)->num_descs check, pull out the code from
> > xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> > skb_store_bits() in branch that creates skb against first processed
> > frag. This so error path has the XSKCB(skb)->num_descs initialized and
> > can free skb in case skb_store_bits() failed.
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  net/xdp/xsk.c | 20 +++++++++++---------
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72194f0a3fc0..064238400036 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> >       return XSKCB(skb)->num_descs;
> >  }
> >
> > +static void xsk_init_cb(struct sk_buff *skb)
> > +{
> > +     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > +     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > +     XSKCB(skb)->num_descs = 0;
> > +}
> > +
> >  static void xsk_destruct_skb(struct sk_buff *skb)
> >  {
> >       struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> >
> >  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
> >  {
> > -     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > -     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > -     XSKCB(skb)->num_descs = 0;
> >       skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> >  }
> >
> > @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >                       return ERR_PTR(err);
> >
> >               skb_reserve(skb, hr);
> > -
> > +             xsk_init_cb(skb);
> >               xsk_set_destructor_arg(skb, desc->addr);
> >       } else {
> >               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       struct xsk_tx_metadata *meta = NULL;
> >       struct net_device *dev = xs->dev;
> >       struct sk_buff *skb = xs->skb;
> > -     bool first_frag = false;
> >       int err;
> >
> >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >               len = desc->len;
> >
> >               if (!skb) {
> > -                     first_frag = true;
> > -
> >                       hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> >                       tr = dev->needed_tailroom;
> >                       skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> > @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> >                       skb_reserve(skb, hr);
> >                       skb_put(skb, len);
> > +                     xsk_init_cb(skb);
> >
> >                       err = skb_store_bits(skb, 0, buffer, len);
> >                       if (unlikely(err))
> > @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                       list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> >               }
> >
> > -             if (first_frag && desc->options & XDP_TX_METADATA) {
> > +             if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
> >                       if (unlikely(xs->pool->tx_metadata_len == 0)) {
> >                               err = -EINVAL;
> >                               goto free_err;
> > @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       return skb;
> >
> >  free_err:
> > -     if (first_frag && skb)
>
> [..]
>
> > +     if (skb && !xsk_get_num_desc(skb))
> >               kfree_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
>
> For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
> xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
> add a separate label to jump directly to 'if err == -EOVERFLOW' for
> the IFF_TX_SKB_NO_LINEAR case?
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72e34bd2d925..f56182c61c99 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>                 skb = xsk_build_skb_zerocopy(xs, desc);
>                 if (IS_ERR(skb)) {
>                         err = PTR_ERR(skb);
> -                       goto free_err;
> +                       goto out;
>                 }
>         } else {
>                 u32 hr, tr, len;
> @@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>         if (first_frag && skb)
>                 kfree_skb(skb);
>
> +out:
>         if (err == -EOVERFLOW) {
>                 /* Drop the packet */
>                 xsk_inc_num_desc(xs->skb);
>
> After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
> of adding new xsk_init_cb, seems more robust?

+1. It would be simpler.

And I think this patch should be a standalone one because it actually
supports the missing feature for the VM scenario.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-23  9:25     ` Jason Xing
@ 2025-09-23  9:36       ` Jason Xing
  2025-09-24 14:35       ` Maciej Fijalkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Jason Xing @ 2025-09-23  9:36 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Maciej Fijalkowski, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Tue, Sep 23, 2025 at 5:25 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Sep 23, 2025 at 1:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 09/22, Maciej Fijalkowski wrote:
> > > Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> > > handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> > > code in xsk_build_skb().
> > >
> > > Same functionality can be achieved with checking if xsk_get_num_desc()
> > > returns 0. To replace current usage of @first_frag with
> > > XSKCB(skb)->num_descs check, pull out the code from
> > > xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> > > skb_store_bits() in branch that creates skb against first processed
> > > frag. This so error path has the XSKCB(skb)->num_descs initialized and
> > > can free skb in case skb_store_bits() failed.
> > >
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  net/xdp/xsk.c | 20 +++++++++++---------
> > >  1 file changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 72194f0a3fc0..064238400036 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> > >       return XSKCB(skb)->num_descs;
> > >  }
> > >
> > > +static void xsk_init_cb(struct sk_buff *skb)
> > > +{
> > > +     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > +     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > +     XSKCB(skb)->num_descs = 0;
> > > +}
> > > +
> > >  static void xsk_destruct_skb(struct sk_buff *skb)
> > >  {
> > >       struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > > @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > >
> > >  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
> > >  {
> > > -     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > -     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > -     XSKCB(skb)->num_descs = 0;
> > >       skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> > >  }
> > >
> > > @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > >                       return ERR_PTR(err);
> > >
> > >               skb_reserve(skb, hr);
> > > -
> > > +             xsk_init_cb(skb);
> > >               xsk_set_destructor_arg(skb, desc->addr);
> > >       } else {
> > >               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >       struct xsk_tx_metadata *meta = NULL;
> > >       struct net_device *dev = xs->dev;
> > >       struct sk_buff *skb = xs->skb;
> > > -     bool first_frag = false;
> > >       int err;
> > >
> > >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > > @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >               len = desc->len;
> > >
> > >               if (!skb) {
> > > -                     first_frag = true;
> > > -
> > >                       hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> > >                       tr = dev->needed_tailroom;
> > >                       skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> > > @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >
> > >                       skb_reserve(skb, hr);
> > >                       skb_put(skb, len);
> > > +                     xsk_init_cb(skb);
> > >
> > >                       err = skb_store_bits(skb, 0, buffer, len);
> > >                       if (unlikely(err))
> > > @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >                       list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> > >               }
> > >
> > > -             if (first_frag && desc->options & XDP_TX_METADATA) {
> > > +             if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
> > >                       if (unlikely(xs->pool->tx_metadata_len == 0)) {
> > >                               err = -EINVAL;
> > >                               goto free_err;
> > > @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >       return skb;
> > >
> > >  free_err:
> > > -     if (first_frag && skb)
> >
> > [..]
> >
> > > +     if (skb && !xsk_get_num_desc(skb))
> > >               kfree_skb(skb);
> > >
> > >       if (err == -EOVERFLOW) {
> >
> > For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
> > xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
> > add a separate label to jump directly to 'if err == -EOVERFLOW' for
> > the IFF_TX_SKB_NO_LINEAR case?
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72e34bd2d925..f56182c61c99 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                 skb = xsk_build_skb_zerocopy(xs, desc);
> >                 if (IS_ERR(skb)) {
> >                         err = PTR_ERR(skb);
> > -                       goto free_err;
> > +                       goto out;
> >                 }
> >         } else {
> >                 u32 hr, tr, len;
> > @@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >         if (first_frag && skb)
> >                 kfree_skb(skb);
> >
> > +out:
> >         if (err == -EOVERFLOW) {
> >                 /* Drop the packet */
> >                 xsk_inc_num_desc(xs->skb);
> >
> > After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
> > of adding new xsk_init_cb, seems more robust?
>
> +1. It would be simpler.
>
> And I think this patch should be a standalone one because it actually
> supports the missing feature for the VM scenario.

Reading the commit message one more time, previously I wrongly
considered the original goal you wanted to achieve was to support the
VM case. So please ignore my previous suggestion since it's a simple
cleanup.

Thanks,
Jason

>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function
  2025-09-22 15:26 ` [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function Maciej Fijalkowski
  2025-09-22 17:51   ` Stanislav Fomichev
@ 2025-09-23  9:43   ` Jason Xing
  1 sibling, 0 replies; 15+ messages in thread
From: Jason Xing @ 2025-09-23  9:43 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, stfomichev

On Mon, Sep 22, 2025 at 11:26 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> xsk_build_skb() has gone wild with its size and one of the things we can
> do about it is to pull out a branch that takes care of metadata handling
> and make it a separate function. Consider this as a good start of
> cleanup.
>
> No functional changes here.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic
  2025-09-22 17:25   ` Stanislav Fomichev
  2025-09-23  8:39     ` Jason Xing
@ 2025-09-23 15:43     ` Maciej Fijalkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-23 15:43 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson,
	kerneljasonxing

On Mon, Sep 22, 2025 at 10:25:32AM -0700, Stanislav Fomichev wrote:
> On 09/22, Maciej Fijalkowski wrote:
> > We are unnecessarily setting a bunch of skb fields per each processed
> > descriptor, which is redundant for fragmented frames.
> > 
> > Let us set these respective members for first fragment only.
> > 
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  net/xdp/xsk.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72e34bd2d925..72194f0a3fc0 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -758,6 +758,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  				goto free_err;
> >  
> >  			xsk_set_destructor_arg(skb, desc->addr);
> > +			skb->dev = dev;
> > +			skb->priority = READ_ONCE(xs->sk.sk_priority);
> > +			skb->mark = READ_ONCE(xs->sk.sk_mark);
> > +			skb->destructor = xsk_destruct_skb;
> >  		} else {
> >  			int nr_frags = skb_shinfo(skb)->nr_frags;
> >  			struct xsk_addr_node *xsk_addr;
> > @@ -826,14 +830,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >  
> >  			if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
> >  				skb->skb_mstamp_ns = meta->request.launch_time;
> > +			xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> >  		}
> >  	}
> >  
> > -	skb->dev = dev;
> > -	skb->priority = READ_ONCE(xs->sk.sk_priority);
> > -	skb->mark = READ_ONCE(xs->sk.sk_mark);
> > -	skb->destructor = xsk_destruct_skb;
> > -	xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> >  	xsk_inc_num_desc(skb);
> 
> What about IFF_TX_SKB_NO_LINEAR case? I'm not super familiar with
> it, but I don't see priority/mark being set over there after this change.

The thing is I tricked myself with running xskxceiver against the changes
and not seeing issues :< so IFF_TX_SKB_NO_LINEAR case needs a test
coverage pretty badly I'd say...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-23  9:25     ` Jason Xing
  2025-09-23  9:36       ` Jason Xing
@ 2025-09-24 14:35       ` Maciej Fijalkowski
  2025-09-25  0:17         ` Jason Xing
  1 sibling, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2025-09-24 14:35 UTC (permalink / raw)
  To: Jason Xing
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Tue, Sep 23, 2025 at 05:25:01PM +0800, Jason Xing wrote:
> On Tue, Sep 23, 2025 at 1:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 09/22, Maciej Fijalkowski wrote:
> > > Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> > > handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> > > code in xsk_build_skb().
> > >
> > > Same functionality can be achieved with checking if xsk_get_num_desc()
> > > returns 0. To replace current usage of @first_frag with
> > > XSKCB(skb)->num_descs check, pull out the code from
> > > xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> > > skb_store_bits() in branch that creates skb against first processed
> > > frag. This so error path has the XSKCB(skb)->num_descs initialized and
> > > can free skb in case skb_store_bits() failed.
> > >
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  net/xdp/xsk.c | 20 +++++++++++---------
> > >  1 file changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 72194f0a3fc0..064238400036 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> > >       return XSKCB(skb)->num_descs;
> > >  }
> > >
> > > +static void xsk_init_cb(struct sk_buff *skb)
> > > +{
> > > +     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > +     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > +     XSKCB(skb)->num_descs = 0;
> > > +}
> > > +
> > >  static void xsk_destruct_skb(struct sk_buff *skb)
> > >  {
> > >       struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > > @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > >
> > >  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
> > >  {
> > > -     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > -     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > -     XSKCB(skb)->num_descs = 0;
> > >       skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> > >  }
> > >
> > > @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > >                       return ERR_PTR(err);
> > >
> > >               skb_reserve(skb, hr);
> > > -
> > > +             xsk_init_cb(skb);
> > >               xsk_set_destructor_arg(skb, desc->addr);
> > >       } else {
> > >               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >       struct xsk_tx_metadata *meta = NULL;
> > >       struct net_device *dev = xs->dev;
> > >       struct sk_buff *skb = xs->skb;
> > > -     bool first_frag = false;
> > >       int err;
> > >
> > >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > > @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >               len = desc->len;
> > >
> > >               if (!skb) {
> > > -                     first_frag = true;
> > > -
> > >                       hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> > >                       tr = dev->needed_tailroom;
> > >                       skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> > > @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >
> > >                       skb_reserve(skb, hr);
> > >                       skb_put(skb, len);
> > > +                     xsk_init_cb(skb);
> > >
> > >                       err = skb_store_bits(skb, 0, buffer, len);
> > >                       if (unlikely(err))
> > > @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >                       list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> > >               }
> > >
> > > -             if (first_frag && desc->options & XDP_TX_METADATA) {
> > > +             if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
> > >                       if (unlikely(xs->pool->tx_metadata_len == 0)) {
> > >                               err = -EINVAL;
> > >                               goto free_err;
> > > @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >       return skb;
> > >
> > >  free_err:
> > > -     if (first_frag && skb)
> >
> > [..]
> >
> > > +     if (skb && !xsk_get_num_desc(skb))
> > >               kfree_skb(skb);
> > >
> > >       if (err == -EOVERFLOW) {
> >
> > For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
> > xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
> > add a separate label to jump directly to 'if err == -EOVERFLOW' for
> > the IFF_TX_SKB_NO_LINEAR case?

Right, I got hit with this when running xdpsock within VM now against
virtio-net driver. Since I removed @first_frag and sock_alloc_send_skb()
managed to give me -EAGAIN at start, skb was treated as valid pointer and
then I got splat when accessing either cb or skb_shared_info.

So either we NULL the skb for xsk_build_skb_zerocopy() error path (which
would be fine even for -EOVERFLOW as error path uses xs->skb pointer, not
the local one) or we introduce separate label as you suggest. No strong
opinions here.

> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 72e34bd2d925..f56182c61c99 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                 skb = xsk_build_skb_zerocopy(xs, desc);
> >                 if (IS_ERR(skb)) {
> >                         err = PTR_ERR(skb);
> > -                       goto free_err;
> > +                       goto out;
> >                 }
> >         } else {
> >                 u32 hr, tr, len;
> > @@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >         if (first_frag && skb)
> >                 kfree_skb(skb);
> >
> > +out:
> >         if (err == -EOVERFLOW) {
> >                 /* Drop the packet */
> >                 xsk_inc_num_desc(xs->skb);
> >
> > After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
> > of adding new xsk_init_cb, seems more robust?

Thanks! I'll do it.

> 
> +1. It would be simpler.
> 
> And I think this patch should be a standalone one because it actually
> supports the missing feature for the VM scenario.

Hi Jason,
in commit message, I wrote about enabling tx metadata for
xsk_build_skb_zerocopy() but code did not reflect that as you point out in
your later reply.

Unless there are any objections I will actually enable it there.

> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-24 14:35       ` Maciej Fijalkowski
@ 2025-09-25  0:17         ` Jason Xing
  2025-09-25  8:17           ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2025-09-25  0:17 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Wed, Sep 24, 2025 at 10:36 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Sep 23, 2025 at 05:25:01PM +0800, Jason Xing wrote:
> > On Tue, Sep 23, 2025 at 1:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 09/22, Maciej Fijalkowski wrote:
> > > > Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> > > > handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> > > > code in xsk_build_skb().
> > > >
> > > > Same functionality can be achieved with checking if xsk_get_num_desc()
> > > > returns 0. To replace current usage of @first_frag with
> > > > XSKCB(skb)->num_descs check, pull out the code from
> > > > xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> > > > skb_store_bits() in branch that creates skb against first processed
> > > > frag. This so error path has the XSKCB(skb)->num_descs initialized and
> > > > can free skb in case skb_store_bits() failed.
> > > >
> > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > > ---
> > > >  net/xdp/xsk.c | 20 +++++++++++---------
> > > >  1 file changed, 11 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 72194f0a3fc0..064238400036 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> > > >       return XSKCB(skb)->num_descs;
> > > >  }
> > > >
> > > > +static void xsk_init_cb(struct sk_buff *skb)
> > > > +{
> > > > +     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > > +     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > > +     XSKCB(skb)->num_descs = 0;
> > > > +}
> > > > +
> > > >  static void xsk_destruct_skb(struct sk_buff *skb)
> > > >  {
> > > >       struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > > > @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > >
> > > >  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
> > > >  {
> > > > -     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > > -     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > > -     XSKCB(skb)->num_descs = 0;
> > > >       skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> > > >  }
> > > >
> > > > @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > >                       return ERR_PTR(err);
> > > >
> > > >               skb_reserve(skb, hr);
> > > > -
> > > > +             xsk_init_cb(skb);
> > > >               xsk_set_destructor_arg(skb, desc->addr);
> > > >       } else {
> > > >               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > > @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >       struct xsk_tx_metadata *meta = NULL;
> > > >       struct net_device *dev = xs->dev;
> > > >       struct sk_buff *skb = xs->skb;
> > > > -     bool first_frag = false;
> > > >       int err;
> > > >
> > > >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > > > @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >               len = desc->len;
> > > >
> > > >               if (!skb) {
> > > > -                     first_frag = true;
> > > > -
> > > >                       hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> > > >                       tr = dev->needed_tailroom;
> > > >                       skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> > > > @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >
> > > >                       skb_reserve(skb, hr);
> > > >                       skb_put(skb, len);
> > > > +                     xsk_init_cb(skb);
> > > >
> > > >                       err = skb_store_bits(skb, 0, buffer, len);
> > > >                       if (unlikely(err))
> > > > @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >                       list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> > > >               }
> > > >
> > > > -             if (first_frag && desc->options & XDP_TX_METADATA) {
> > > > +             if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
> > > >                       if (unlikely(xs->pool->tx_metadata_len == 0)) {
> > > >                               err = -EINVAL;
> > > >                               goto free_err;
> > > > @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >       return skb;
> > > >
> > > >  free_err:
> > > > -     if (first_frag && skb)
> > >
> > > [..]
> > >
> > > > +     if (skb && !xsk_get_num_desc(skb))
> > > >               kfree_skb(skb);
> > > >
> > > >       if (err == -EOVERFLOW) {
> > >
> > > For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
> > > xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
> > > add a separate label to jump directly to 'if err == -EOVERFLOW' for
> > > the IFF_TX_SKB_NO_LINEAR case?
>
> Right, I got hit with this when running xdpsock within VM now against
> virtio-net driver. Since I removed @first_frag and sock_alloc_send_skb()
> managed to give me -EAGAIN at start, skb was treated as valid pointer and
> then I got splat when accessing either cb or skb_shared_info.
>
> So either we NULL the skb for xsk_build_skb_zerocopy() error path (which
> would be fine even for -EOVERFLOW as error path uses xs->skb pointer, not
> the local one) or we introduce separate label as you suggest. No strong
> opinions here.
>
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 72e34bd2d925..f56182c61c99 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >                 skb = xsk_build_skb_zerocopy(xs, desc);
> > >                 if (IS_ERR(skb)) {
> > >                         err = PTR_ERR(skb);
> > > -                       goto free_err;
> > > +                       goto out;
> > >                 }
> > >         } else {
> > >                 u32 hr, tr, len;
> > > @@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >         if (first_frag && skb)
> > >                 kfree_skb(skb);
> > >
> > > +out:
> > >         if (err == -EOVERFLOW) {
> > >                 /* Drop the packet */
> > >                 xsk_inc_num_desc(xs->skb);
> > >
> > > After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
> > > of adding new xsk_init_cb, seems more robust?
>
> Thanks! I'll do it.
>
> >
> > +1. It would be simpler.
> >
> > And I think this patch should be a standalone one because it actually
> > supports the missing feature for the VM scenario.
>
> Hi Jason,
> in commit message, I wrote about enabling tx metadata for
> xsk_build_skb_zerocopy() but code did not reflect that as you point out in
> your later reply.
>
> Unless there are any objections I will actually enable it there.

Oh, if you made up your mind enabling it in the next version, how
about changing the title of this series because the core changes are
all about tx metadata rather than simply cleaning up the codes. See,
patch 1, I think, will be scrapped; patch 2 is used to support that
missing feature; patch 3 is a follow-up to patch 2. WDYT?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb()
  2025-09-25  0:17         ` Jason Xing
@ 2025-09-25  8:17           ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2025-09-25  8:17 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, netdev,
	magnus.karlsson

On Thu, Sep 25, 2025 at 8:17 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Sep 24, 2025 at 10:36 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Tue, Sep 23, 2025 at 05:25:01PM +0800, Jason Xing wrote:
> > > On Tue, Sep 23, 2025 at 1:48 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 09/22, Maciej Fijalkowski wrote:
> > > > > Devices that set IFF_TX_SKB_NO_LINEAR will not execute branch that
> > > > > handles metadata, as we set @first_frag only for !IFF_TX_SKB_NO_LINEAR
> > > > > code in xsk_build_skb().
> > > > >
> > > > > Same functionality can be achieved with checking if xsk_get_num_desc()
> > > > > returns 0. To replace current usage of @first_frag with
> > > > > XSKCB(skb)->num_descs check, pull out the code from
> > > > > xsk_set_destructor_arg() that initializes sk_buff::cb and call it before
> > > > > skb_store_bits() in branch that creates skb against first processed
> > > > > frag. This so error path has the XSKCB(skb)->num_descs initialized and
> > > > > can free skb in case skb_store_bits() failed.
> > > > >
> > > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > > > ---
> > > > >  net/xdp/xsk.c | 20 +++++++++++---------
> > > > >  1 file changed, 11 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 72194f0a3fc0..064238400036 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -605,6 +605,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> > > > >       return XSKCB(skb)->num_descs;
> > > > >  }
> > > > >
> > > > > +static void xsk_init_cb(struct sk_buff *skb)
> > > > > +{
> > > > > +     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > > > +     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > > > +     XSKCB(skb)->num_descs = 0;
> > > > > +}
> > > > > +
> > > > >  static void xsk_destruct_skb(struct sk_buff *skb)
> > > > >  {
> > > > >       struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > > > > @@ -620,9 +627,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > >
> > > > >  static void xsk_set_destructor_arg(struct sk_buff *skb, u64 addr)
> > > > >  {
> > > > > -     BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> > > > > -     INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > > > -     XSKCB(skb)->num_descs = 0;
> > > > >       skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> > > > >  }
> > > > >
> > > > > @@ -672,7 +676,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > >                       return ERR_PTR(err);
> > > > >
> > > > >               skb_reserve(skb, hr);
> > > > > -
> > > > > +             xsk_init_cb(skb);
> > > > >               xsk_set_destructor_arg(skb, desc->addr);
> > > > >       } else {
> > > > >               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > > > @@ -725,7 +729,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >       struct xsk_tx_metadata *meta = NULL;
> > > > >       struct net_device *dev = xs->dev;
> > > > >       struct sk_buff *skb = xs->skb;
> > > > > -     bool first_frag = false;
> > > > >       int err;
> > > > >
> > > > >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > > > > @@ -742,8 +745,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >               len = desc->len;
> > > > >
> > > > >               if (!skb) {
> > > > > -                     first_frag = true;
> > > > > -
> > > > >                       hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> > > > >                       tr = dev->needed_tailroom;
> > > > >                       skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> > > > > @@ -752,6 +753,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >
> > > > >                       skb_reserve(skb, hr);
> > > > >                       skb_put(skb, len);
> > > > > +                     xsk_init_cb(skb);
> > > > >
> > > > >                       err = skb_store_bits(skb, 0, buffer, len);
> > > > >                       if (unlikely(err))
> > > > > @@ -797,7 +799,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >                       list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> > > > >               }
> > > > >
> > > > > -             if (first_frag && desc->options & XDP_TX_METADATA) {
> > > > > +             if (!xsk_get_num_desc(skb) && desc->options & XDP_TX_METADATA) {
> > > > >                       if (unlikely(xs->pool->tx_metadata_len == 0)) {
> > > > >                               err = -EINVAL;
> > > > >                               goto free_err;
> > > > > @@ -839,7 +841,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >       return skb;
> > > > >
> > > > >  free_err:
> > > > > -     if (first_frag && skb)
> > > >
> > > > [..]
> > > >
> > > > > +     if (skb && !xsk_get_num_desc(skb))
> > > > >               kfree_skb(skb);
> > > > >
> > > > >       if (err == -EOVERFLOW) {
> > > >
> > > > For IFF_TX_SKB_NO_LINEAR case, the 'goto free_err' is super confusing.
> > > > xsk_build_skb_zerocopy either returns skb or an IS_ERR. Can we
> > > > add a separate label to jump directly to 'if err == -EOVERFLOW' for
> > > > the IFF_TX_SKB_NO_LINEAR case?
> >
> > Right, I got hit with this when running xdpsock within VM now against
> > virtio-net driver. Since I removed @first_frag and sock_alloc_send_skb()
> > managed to give me -EAGAIN at start, skb was treated as valid pointer and
> > then I got splat when accessing either cb or skb_shared_info.
> >
> > So either we NULL the skb for xsk_build_skb_zerocopy() error path (which
> > would be fine even for -EOVERFLOW as error path uses xs->skb pointer, not
> > the local one) or we introduce separate label as you suggest. No strong
> > opinions here.
> >
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 72e34bd2d925..f56182c61c99 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -732,7 +732,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >                 skb = xsk_build_skb_zerocopy(xs, desc);
> > > >                 if (IS_ERR(skb)) {
> > > >                         err = PTR_ERR(skb);
> > > > -                       goto free_err;
> > > > +                       goto out;
> > > >                 }
> > > >         } else {
> > > >                 u32 hr, tr, len;
> > > > @@ -842,6 +842,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >         if (first_frag && skb)
> > > >                 kfree_skb(skb);
> > > >
> > > > +out:
> > > >         if (err == -EOVERFLOW) {
> > > >                 /* Drop the packet */
> > > >                 xsk_inc_num_desc(xs->skb);
> > > >
> > > > After that, it seems we can look at skb_shinfo(skb)->nr_frags? Instead
> > > > of adding new xsk_init_cb, seems more robust?
> >
> > Thanks! I'll do it.
> >
> > >
> > > +1. It would be simpler.
> > >
> > > And I think this patch should be a standalone one because it actually
> > > supports the missing feature for the VM scenario.
> >
> > Hi Jason,
> > in commit message, I wrote about enabling tx metadata for
> > xsk_build_skb_zerocopy() but code did not reflect that as you point out in
> > your later reply.
> >
> > Unless there are any objections I will actually enable it there.
>
> Oh, if you made up your mind enabling it in the next version, how
> about changing the title of this series because the core changes are

Well, don't bother to modify it since in fact the related non linear
driver (for now that is only virtio_net) doesn't support it, I assume
:)

Please go ahead.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-09-25  8:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22 15:25 [PATCH bpf-next 0/3] xsk: refactors around generic xmit side Maciej Fijalkowski
2025-09-22 15:25 ` [PATCH bpf-next 1/3] xsk: avoid overwriting skb fields for multi-buffer traffic Maciej Fijalkowski
2025-09-22 17:25   ` Stanislav Fomichev
2025-09-23  8:39     ` Jason Xing
2025-09-23 15:43     ` Maciej Fijalkowski
2025-09-22 15:25 ` [PATCH bpf-next 2/3] xsk: remove @first_frag from xsk_build_skb() Maciej Fijalkowski
2025-09-22 17:48   ` Stanislav Fomichev
2025-09-23  9:25     ` Jason Xing
2025-09-23  9:36       ` Jason Xing
2025-09-24 14:35       ` Maciej Fijalkowski
2025-09-25  0:17         ` Jason Xing
2025-09-25  8:17           ` Jason Xing
2025-09-22 15:26 ` [PATCH bpf-next 3/3] xsk: wrap generic metadata handling onto separate function Maciej Fijalkowski
2025-09-22 17:51   ` Stanislav Fomichev
2025-09-23  9:43   ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).