netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes
@ 2024-01-22 22:15 Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 01/11] xsk: recycle buffer in case Rx queue was full Maciej Fijalkowski
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:15 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Hey,

after a break followed by dealing with sickness, here is a v5 that makes
bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP
multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with
positive offset which exposed yet another issues, which can be observed
by increased commit count when compared to v3.

John, in the end I think we should remove handling
MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope
for fixes set, IMHO.

Thanks,
Maciej

v5:
- pick correct version of patch 5 [Simon]
- elaborate a bit more on what patch 2 fixes

v4:
- do not clear frags flag when deleting tail; xsk_buff_pool now does
  that
- skip some NULL tests for xsk_buff_get_tail [Martin, John]
- address problems around registering xdp_rxq_info
- fix bpf_xdp_frags_increase_tail() for ZC mbuf

v3:
- add acks
- s/xsk_buff_tail_del/xsk_buff_del_tail
- address i40e as well (thanks Tirthendu)

v2:
- fix !CONFIG_XDP_SOCKETS builds
- add reviewed-by tag to patch 3


Maciej Fijalkowski (10):
  xsk: recycle buffer in case Rx queue was full
  xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  ice: work on pre-XDP prog frag count
  ice: remove redundant xdp_rxq_info registration
  intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
  i40e: set xdp_rxq_info::frag_size
  i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue

Tirthendu Sarkar (1):
  i40e: handle multi-buffer packets that are shrunk by xdp prog

 drivers/net/ethernet/intel/i40e/i40e_main.c   | 47 ++++++++++++------
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 49 +++++++++----------
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  4 +-
 drivers/net/ethernet/intel/ice/ice_base.c     |  7 ++-
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 19 ++++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 31 ++++++++----
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  4 +-
 include/net/xdp_sock_drv.h                    | 26 ++++++++++
 net/core/filter.c                             | 43 ++++++++++++----
 net/xdp/xsk.c                                 | 12 +++--
 net/xdp/xsk_buff_pool.c                       |  3 ++
 12 files changed, 167 insertions(+), 79 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 01/11] xsk: recycle buffer in case Rx queue was full
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags Maciej Fijalkowski
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
descriptor to XSK Rx queue.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9f13aa3353e3..1eadfac03cc4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -167,8 +167,10 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 		contd = XDP_PKT_CONTD;
 
 	err = __xsk_rcv_zc(xs, xskb, len, contd);
-	if (err || likely(!frags))
-		goto out;
+	if (err)
+		goto err;
+	if (likely(!frags))
+		return 0;
 
 	xskb_list = &xskb->pool->xskb_list;
 	list_for_each_entry_safe(pos, tmp, xskb_list, xskb_list_node) {
@@ -177,11 +179,13 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 		len = pos->xdp.data_end - pos->xdp.data;
 		err = __xsk_rcv_zc(xs, pos, len, contd);
 		if (err)
-			return err;
+			goto err;
 		list_del(&pos->xskb_list_node);
 	}
 
-out:
+	return 0;
+err:
+	xsk_buff_free(xdp);
 	return err;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 01/11] xsk: recycle buffer in case Rx queue was full Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:20   ` Magnus Karlsson
  2024-01-22 22:16 ` [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP Maciej Fijalkowski
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
used by drivers to notify data path whether xdp_buff contains fragments
or not. Data path looks up mentioned flag on first buffer that occupies
the linear part of xdp_buff, so drivers only modify it there. This is
sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
stack or it resides within struct representing driver's queue and
fragments are carried via skb_frag_t structs. IOW, we are dealing with
only one xdp_buff.

ZC mode though relies on list of xdp_buff structs that is carried via
xsk_buff_pool::xskb_list, so ZC data path has to make sure that
fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
xsk_buff_free() could misbehave if it would be executed against xdp_buff
that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
take place when within supplied XDP program bpf_xdp_adjust_tail() is
used with negative offset that would in turn release the tail fragment
from multi-buffer frame.

Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
result in releasing all the nodes from xskb_list that were produced by
driver before XDP program execution, which is not what is intended -
only tail fragment should be deleted from xskb_list and then it should
be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
make it up to user space, so from AF_XDP application POV there would be
no traffic running, however due to free_list getting constantly new
nodes, driver will be able to feed HW Rx queue with recycled buffers.
Bottom line is that instead of traffic being redirected to user space,
it would be continuously dropped.

To fix this, let us clear the mentioned flag on xsk_buff_pool side at
allocation time, which is what should have been done right from the
start of XSK multi-buffer support.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 1 -
 drivers/net/ethernet/intel/ice/ice_xsk.c   | 1 -
 net/xdp/xsk_buff_pool.c                    | 3 +++
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e99fa854d17f..fede0bb3e047 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -499,7 +499,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 		xdp_res = i40e_run_xdp_zc(rx_ring, first, xdp_prog);
 		i40e_handle_xdp_result_zc(rx_ring, first, rx_desc, &rx_packets,
 					  &rx_bytes, xdp_res, &failure);
-		first->flags = 0;
 		next_to_clean = next_to_process;
 		if (failure)
 			break;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 5d1ae8e4058a..d9073a618ad6 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -895,7 +895,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 
 		if (!first) {
 			first = xdp;
-			xdp_buff_clear_frags_flag(first);
 		} else if (ice_add_xsk_frag(rx_ring, first, xdp, size)) {
 			break;
 		}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 28711cc44ced..dc5659da6728 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -555,6 +555,7 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
 
 	xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
 	xskb->xdp.data_meta = xskb->xdp.data;
+	xskb->xdp.flags = 0;
 
 	if (pool->dma_need_sync) {
 		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
@@ -601,6 +602,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
 		}
 
 		*xdp = &xskb->xdp;
+		xskb->xdp.flags = 0;
 		xdp++;
 	}
 
@@ -621,6 +623,7 @@ static u32 xp_alloc_reused(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u3
 		list_del_init(&xskb->free_list_node);
 
 		*xdp = &xskb->xdp;
+		xskb->xdp.flags = 0;
 		xdp++;
 	}
 	pool->free_list_cnt -= nb_entries;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 01/11] xsk: recycle buffer in case Rx queue was full Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  1:53   ` Jakub Kicinski
  2024-01-22 22:16 ` [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count Maciej Fijalkowski
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:

[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058]  <IRQ>
[1136314.452465]  ? __die+0x20/0x70
[1136314.459881]  ? page_fault_oops+0x15b/0x440
[1136314.468305]  ? exc_page_fault+0x6a/0x150
[1136314.476491]  ? asm_exc_page_fault+0x22/0x30
[1136314.484927]  ? __xdp_return+0x6c/0x210
[1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
[1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010]  __napi_poll+0x29/0x1b0
[1136314.553462]  net_rx_action+0x133/0x270
[1136314.561619]  __do_softirq+0xbe/0x28e
[1136314.569303]  do_softirq+0x3f/0x60

This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.

To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appriopriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xdp_sock_drv.h | 26 ++++++++++++++++++++++++
 net/core/filter.c          | 41 +++++++++++++++++++++++++++++---------
 2 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index b62bb8525a5f..3d35ac0f838b 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -159,6 +159,23 @@ static inline struct xdp_buff *xsk_buff_get_frag(struct xdp_buff *first)
 	return ret;
 }
 
+static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+{
+	struct xdp_buff_xsk *xskb = container_of(tail, struct xdp_buff_xsk, xdp);
+
+	list_del(&xskb->xskb_list_node);
+}
+
+static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
+{
+	struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp);
+	struct xdp_buff_xsk *frag;
+
+	frag = list_last_entry(&xskb->pool->xskb_list, struct xdp_buff_xsk,
+			       xskb_list_node);
+	return &frag->xdp;
+}
+
 static inline void xsk_buff_set_size(struct xdp_buff *xdp, u32 size)
 {
 	xdp->data = xdp->data_hard_start + XDP_PACKET_HEADROOM;
@@ -350,6 +367,15 @@ static inline struct xdp_buff *xsk_buff_get_frag(struct xdp_buff *first)
 	return NULL;
 }
 
+static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+{
+}
+
+static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
+{
+	return NULL;
+}
+
 static inline void xsk_buff_set_size(struct xdp_buff *xdp, u32 size)
 {
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 24061f29c9dd..8e0cd8ca461e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -83,6 +83,7 @@
 #include <net/netfilter/nf_conntrack_bpf.h>
 #include <net/netkit.h>
 #include <linux/un.h>
+#include <net/xdp_sock_drv.h>
 
 #include "dev.h"
 
@@ -4096,6 +4097,35 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
 	return 0;
 }
 
+static void __shrink_data(struct xdp_buff *xdp, struct xdp_mem_info *mem_info,
+			  skb_frag_t *frag, int shrink)
+{
+	if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL)
+		xsk_buff_get_tail(xdp)->data_end -= shrink;
+	skb_frag_size_sub(frag, shrink);
+}
+
+static bool shrink_data(struct xdp_buff *xdp, skb_frag_t *frag, int shrink)
+{
+	struct xdp_mem_info *mem_info = &xdp->rxq->mem;
+
+	if (skb_frag_size(frag) == shrink) {
+		struct page *page = skb_frag_page(frag);
+		struct xdp_buff *zc_frag = NULL;
+
+		if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL) {
+			zc_frag = xsk_buff_get_tail(xdp);
+
+			xsk_buff_del_tail(zc_frag);
+		}
+
+		__xdp_return(page_address(page), mem_info, false, zc_frag);
+		return true;
+	}
+	__shrink_data(xdp, mem_info, frag, shrink);
+	return false;
+}
+
 static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
@@ -4110,17 +4140,10 @@ static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
 
 		len_free += shrink;
 		offset -= shrink;
-
-		if (skb_frag_size(frag) == shrink) {
-			struct page *page = skb_frag_page(frag);
-
-			__xdp_return(page_address(page), &xdp->rxq->mem,
-				     false, NULL);
+		if (shrink_data(xdp, frag, shrink))
 			n_frags_free++;
-		} else {
-			skb_frag_size_sub(frag, shrink);
+		else
 			break;
-		}
 	}
 	sinfo->nr_frags -= n_frags_free;
 	sinfo->xdp_frags_size -= len_free;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:37   ` Magnus Karlsson
  2024-01-22 22:16 ` [PATCH v5 bpf 05/11] i40e: handle multi-buffer packets that are shrunk by xdp prog Maciej Fijalkowski
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
socket.

Since support for handling multi-buffer frames was added to XDP, usage
of bpf_xdp_adjust_tail() helper within XDP program can free the page
that given fragment occupies and in turn decrease the fragment count
within skb_shared_info that is embedded in xdp_buff struct. In current
ice driver codebase, it can become problematic when page recycling logic
decides not to reuse the page. In such case, __page_frag_cache_drain()
is used with ice_rx_buf::pagecnt_bias that was not adjusted after
refcount of page was changed by XDP prog which in turn does not drain
the refcount to 0 and page is never freed.

To address this, let us store the count of frags before the XDP program
was executed on Rx ring struct. This will be used to compare with
current frag count from skb_shared_info embedded in xdp_buff. A smaller
value in the latter indicates that XDP prog freed frag(s). Then, for
given delta decrement pagecnt_bias for XDP_DROP verdict.

While at it, let us also handle the EOP frag within
ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
needed to be applied against freed frags are performed in the single
place.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 14 ++++++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 31 +++++++++++++------
 3 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 59617f055e35..1760e81379cc 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -603,9 +603,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 		ret = ICE_XDP_CONSUMED;
 	}
 exit:
-	rx_buf->act = ret;
-	if (unlikely(xdp_buff_has_frags(xdp)))
-		ice_set_rx_bufs_act(xdp, rx_ring, ret);
+	ice_set_rx_bufs_act(xdp, rx_ring, ret);
 }
 
 /**
@@ -893,14 +891,17 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	}
 
 	if (unlikely(sinfo->nr_frags == MAX_SKB_FRAGS)) {
-		if (unlikely(xdp_buff_has_frags(xdp)))
-			ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
+		ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
 		return -ENOMEM;
 	}
 
 	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
 				   rx_buf->page_offset, size);
 	sinfo->xdp_frags_size += size;
+	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
+	 * can pop off frags but driver has to handle it on its own
+	 */
+	rx_ring->nr_frags = sinfo->nr_frags;
 
 	if (page_is_pfmemalloc(rx_buf->page))
 		xdp_buff_set_frag_pfmemalloc(xdp);
@@ -1251,6 +1252,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 
 		xdp->data = NULL;
 		rx_ring->first_desc = ntc;
+		rx_ring->nr_frags = 0;
 		continue;
 construct_skb:
 		if (likely(ice_ring_uses_build_skb(rx_ring)))
@@ -1266,10 +1268,12 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 						    ICE_XDP_CONSUMED);
 			xdp->data = NULL;
 			rx_ring->first_desc = ntc;
+			rx_ring->nr_frags = 0;
 			break;
 		}
 		xdp->data = NULL;
 		rx_ring->first_desc = ntc;
+		rx_ring->nr_frags = 0;
 
 		stat_err_bits = BIT(ICE_RX_FLEX_DESC_STATUS0_RXE_S);
 		if (unlikely(ice_test_staterr(rx_desc->wb.status_error0,
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b3379ff73674..af955b0e5dc5 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -358,6 +358,7 @@ struct ice_rx_ring {
 	struct ice_tx_ring *xdp_ring;
 	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
 	struct xsk_buff_pool *xsk_pool;
+	u32 nr_frags;
 	dma_addr_t dma;			/* physical address of ring */
 	u16 rx_buf_len;
 	u8 dcb_tc;			/* Traffic class of ring */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 762047508619..afcead4baef4 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -12,26 +12,39 @@
  * act: action to store onto Rx buffers related to XDP buffer parts
  *
  * Set action that should be taken before putting Rx buffer from first frag
- * to one before last. Last one is handled by caller of this function as it
- * is the EOP frag that is currently being processed. This function is
- * supposed to be called only when XDP buffer contains frags.
+ * to the last.
  */
 static inline void
 ice_set_rx_bufs_act(struct xdp_buff *xdp, const struct ice_rx_ring *rx_ring,
 		    const unsigned int act)
 {
-	const struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
-	u32 first = rx_ring->first_desc;
-	u32 nr_frags = sinfo->nr_frags;
+	u32 sinfo_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
+	u32 nr_frags = rx_ring->nr_frags + 1;
+	u32 idx = rx_ring->first_desc;
 	u32 cnt = rx_ring->count;
 	struct ice_rx_buf *buf;
 
 	for (int i = 0; i < nr_frags; i++) {
-		buf = &rx_ring->rx_buf[first];
+		buf = &rx_ring->rx_buf[idx];
 		buf->act = act;
 
-		if (++first == cnt)
-			first = 0;
+		if (++idx == cnt)
+			idx = 0;
+	}
+
+	/* adjust pagecnt_bias on frags freed by XDP prog */
+	if (sinfo_frags < rx_ring->nr_frags && act == ICE_XDP_CONSUMED) {
+		u32 delta = rx_ring->nr_frags - sinfo_frags;
+
+		while (delta) {
+			if (idx == 0)
+				idx = cnt - 1;
+			else
+				idx--;
+			buf = &rx_ring->rx_buf[idx];
+			buf->pagecnt_bias--;
+			delta--;
+		}
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 05/11] i40e: handle multi-buffer packets that are shrunk by xdp prog
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration Maciej Fijalkowski
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

From: Tirthendu Sarkar <tirthendu.sarkar@intel.com>

XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
helper function. For multi-buffer packets this may lead to reduction of
frag count stored in skb_shared_info area of the xdp_buff struct. This
results in issues with the current handling of XDP_PASS and XDP_DROP
cases.

For XDP_PASS, currently skb is being built using frag count of
xdp_buffer before it was processed by XDP prog and thus will result in
an inconsistent skb when frag count gets reduced by XDP prog. To fix
this, get correct frag count while building the skb instead of using
pre-obtained frag count.

For XDP_DROP, current page recycling logic will not reuse the page but
instead will adjust the pagecnt_bias so that the page can be freed. This
again results in inconsistent behavior as the page refcnt has already
been changed by the helper while freeing the frag(s) as part of
shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
that are stillpart of the packet post-xdp prog run.

Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx")
Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 40 ++++++++++++---------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b82df5bdfac0..377e05dc0c68 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2099,7 +2099,8 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
 static void i40e_process_rx_buffs(struct i40e_ring *rx_ring, int xdp_res,
 				  struct xdp_buff *xdp)
 {
-	u32 next = rx_ring->next_to_clean;
+	u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
+	u32 next = rx_ring->next_to_clean, i = 0;
 	struct i40e_rx_buffer *rx_buffer;
 
 	xdp->flags = 0;
@@ -2112,10 +2113,10 @@ static void i40e_process_rx_buffs(struct i40e_ring *rx_ring, int xdp_res,
 		if (!rx_buffer->page)
 			continue;
 
-		if (xdp_res == I40E_XDP_CONSUMED)
-			rx_buffer->pagecnt_bias++;
-		else
+		if (xdp_res != I40E_XDP_CONSUMED)
 			i40e_rx_buffer_flip(rx_buffer, xdp->frame_sz);
+		else if (i++ <= nr_frags)
+			rx_buffer->pagecnt_bias++;
 
 		/* EOP buffer will be put in i40e_clean_rx_irq() */
 		if (next == rx_ring->next_to_process)
@@ -2129,20 +2130,20 @@ static void i40e_process_rx_buffs(struct i40e_ring *rx_ring, int xdp_res,
  * i40e_construct_skb - Allocate skb and populate it
  * @rx_ring: rx descriptor ring to transact packets on
  * @xdp: xdp_buff pointing to the data
- * @nr_frags: number of buffers for the packet
  *
  * This function allocates an skb.  It then populates it with the page
  * data from the current receive descriptor, taking care to set up the
  * skb correctly.
  */
 static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
-					  struct xdp_buff *xdp,
-					  u32 nr_frags)
+					  struct xdp_buff *xdp)
 {
 	unsigned int size = xdp->data_end - xdp->data;
 	struct i40e_rx_buffer *rx_buffer;
+	struct skb_shared_info *sinfo;
 	unsigned int headlen;
 	struct sk_buff *skb;
+	u32 nr_frags = 0;
 
 	/* prefetch first cache line of first page */
 	net_prefetch(xdp->data);
@@ -2180,6 +2181,10 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	memcpy(__skb_put(skb, headlen), xdp->data,
 	       ALIGN(headlen, sizeof(long)));
 
+	if (unlikely(xdp_buff_has_frags(xdp))) {
+		sinfo = xdp_get_shared_info_from_buff(xdp);
+		nr_frags = sinfo->nr_frags;
+	}
 	rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_clean);
 	/* update all of the pointers */
 	size -= headlen;
@@ -2199,9 +2204,8 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	}
 
 	if (unlikely(xdp_buff_has_frags(xdp))) {
-		struct skb_shared_info *sinfo, *skinfo = skb_shinfo(skb);
+		struct skb_shared_info *skinfo = skb_shinfo(skb);
 
-		sinfo = xdp_get_shared_info_from_buff(xdp);
 		memcpy(&skinfo->frags[skinfo->nr_frags], &sinfo->frags[0],
 		       sizeof(skb_frag_t) * nr_frags);
 
@@ -2224,17 +2228,17 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
  * i40e_build_skb - Build skb around an existing buffer
  * @rx_ring: Rx descriptor ring to transact packets on
  * @xdp: xdp_buff pointing to the data
- * @nr_frags: number of buffers for the packet
  *
  * This function builds an skb around an existing Rx buffer, taking care
  * to set up the skb correctly and avoid any memcpy overhead.
  */
 static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
-				      struct xdp_buff *xdp,
-				      u32 nr_frags)
+				      struct xdp_buff *xdp)
 {
 	unsigned int metasize = xdp->data - xdp->data_meta;
+	struct skb_shared_info *sinfo;
 	struct sk_buff *skb;
+	u32 nr_frags;
 
 	/* Prefetch first cache line of first page. If xdp->data_meta
 	 * is unused, this points exactly as xdp->data, otherwise we
@@ -2243,6 +2247,11 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
 	 */
 	net_prefetch(xdp->data_meta);
 
+	if (unlikely(xdp_buff_has_frags(xdp))) {
+		sinfo = xdp_get_shared_info_from_buff(xdp);
+		nr_frags = sinfo->nr_frags;
+	}
+
 	/* build an skb around the page buffer */
 	skb = napi_build_skb(xdp->data_hard_start, xdp->frame_sz);
 	if (unlikely(!skb))
@@ -2255,9 +2264,6 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
 		skb_metadata_set(skb, metasize);
 
 	if (unlikely(xdp_buff_has_frags(xdp))) {
-		struct skb_shared_info *sinfo;
-
-		sinfo = xdp_get_shared_info_from_buff(xdp);
 		xdp_update_skb_shared_info(skb, nr_frags,
 					   sinfo->xdp_frags_size,
 					   nr_frags * xdp->frame_sz,
@@ -2602,9 +2608,9 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget,
 			total_rx_bytes += size;
 		} else {
 			if (ring_uses_build_skb(rx_ring))
-				skb = i40e_build_skb(rx_ring, xdp, nfrags);
+				skb = i40e_build_skb(rx_ring, xdp);
 			else
-				skb = i40e_construct_skb(rx_ring, xdp, nfrags);
+				skb = i40e_construct_skb(rx_ring, xdp);
 
 			/* drop if we failed to retrieve a buffer */
 			if (!skb) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 05/11] i40e: handle multi-buffer packets that are shrunk by xdp prog Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:39   ` Magnus Karlsson
  2024-01-22 22:16 ` [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers Maciej Fijalkowski
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

xdp_rxq_info struct can be registered by drivers via two functions -
xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
which in turn will make it possible to grow the packet via
bpf_xdp_adjust_tail() BPF helper.

Currently, ice registers xdp_rxq_info in two spots:
1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK

Cited commit under fixes tag took care of setting up frag_size and
updated registration scheme in 2) but it did not help as
1) is called before 2) and as shown above it uses old registration
function. This means that 2) sees that xdp_rxq_info is already
registered and never calls __xdp_rxq_info_reg() which leaves us with
xdp_rxq_info::frag_size being set to 0.

To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
ice_setup_rx_ring().

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 1760e81379cc..765aea630a1f 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -513,11 +513,6 @@ int ice_setup_rx_ring(struct ice_rx_ring *rx_ring)
 	if (ice_is_xdp_ena_vsi(rx_ring->vsi))
 		WRITE_ONCE(rx_ring->xdp_prog, rx_ring->vsi->xdp_prog);
 
-	if (rx_ring->vsi->type == ICE_VSI_PF &&
-	    !xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
-		if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
-				     rx_ring->q_index, rx_ring->q_vector->napi.napi_id))
-			goto err;
 	return 0;
 
 err:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (5 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:44   ` Magnus Karlsson
  2024-01-22 22:16 ` [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Ice and i40e ZC drivers currently set offset of a frag within
skb_shared_info to 0, wchih is incorrect. xdp_buffs that come from
xsk_buff_pool always have 256 bytes of a headroom, so they need to be
taken into account to retrieve xdp_buff::data via skb_frag_address().
Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
xdp_buff::data_hard_start which would result in overwriting existing
payload.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 3 ++-
 drivers/net/ethernet/intel/ice/ice_xsk.c   | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index fede0bb3e047..65f38a57b3df 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -414,7 +414,8 @@ i40e_add_xsk_frag(struct i40e_ring *rx_ring, struct xdp_buff *first,
 	}
 
 	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
-				   virt_to_page(xdp->data_hard_start), 0, size);
+				   virt_to_page(xdp->data_hard_start),
+				   XDP_PACKET_HEADROOM, size);
 	sinfo->xdp_frags_size += size;
 	xsk_buff_add_frag(xdp);
 
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index d9073a618ad6..8b81a1677045 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -825,7 +825,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
 	}
 
 	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
-				   virt_to_page(xdp->data_hard_start), 0, size);
+				   virt_to_page(xdp->data_hard_start),
+				   XDP_PACKET_HEADROOM, size);
 	sinfo->xdp_frags_size += size;
 	xsk_buff_add_frag(xdp);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (6 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:51   ` Magnus Karlsson
  2024-01-22 22:16 ` [PATCH v5 bpf 09/11] xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL Maciej Fijalkowski
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Use a bigger hammer and instead of unregistering only xdp_rxq_info's
memory model, unregister it altogether and register it again and have
xdp_rxq_info with correct frag_size value.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index b25b7f415965..df174c1c3817 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -564,10 +564,15 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 
 		ring->xsk_pool = ice_xsk_pool(ring);
 		if (ring->xsk_pool) {
-			xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
+			xdp_rxq_info_unreg(&ring->xdp_rxq);
 
 			ring->rx_buf_len =
 				xsk_pool_get_rx_frame_size(ring->xsk_pool);
+			/* coverity[check_return] */
+			__xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+					   ring->q_index,
+					   ring->q_vector->napi.napi_id,
+					   ring->rx_buf_len);
 			err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
 							 MEM_TYPE_XSK_BUFF_POOL,
 							 NULL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 09/11] xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (7 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 10/11] i40e: set xdp_rxq_info::frag_size Maciej Fijalkowski
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
queue via subtracting xdp_buff::data_end from xdp_buff::data.

In bpf_xdp_frags_increase_tail(), when underlying memory type of
xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
fragment, so that later on user space will be able to take into account
the amount of bytes added by XDP program.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 8e0cd8ca461e..9270b4d7acee 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4093,6 +4093,8 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
 	memset(skb_frag_address(frag) + skb_frag_size(frag), 0, offset);
 	skb_frag_size_add(frag, offset);
 	sinfo->xdp_frags_size += offset;
+	if (rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
+		xsk_buff_get_tail(xdp)->data_end += offset;
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 10/11] i40e: set xdp_rxq_info::frag_size
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (8 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 09/11] xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-22 22:16 ` [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
  2024-01-24  8:58 ` [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Magnus Karlsson
  11 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

i40e support XDP multi-buffer so it is supposed to use
__xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
frag_size. It can not be simply converted at existing callsite because
rx_buf_len could be un-initialized, so let us register xdp_rxq_info
within i40e_configure_rx_ring(), which happen to be called with already
initialized rx_buf_len value.

Commit 5180ff1364bc ("i40e: use int for i40e_status") converted 'err' to
int, so two variables to deal with return codes are not needed within
i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
from xdp_rxq_info registration.

Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 40 ++++++++++++---------
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  9 -----
 2 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dc642efe1cfa..f8d513499607 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3586,40 +3586,48 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hmc_obj_rxq rx_ctx;
 	int err = 0;
 	bool ok;
-	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	if (ring->vsi->type == I40E_VSI_MAIN)
-		xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
+	ring->rx_buf_len = vsi->rx_buf_len;
+
+	/* XDP RX-queue info only needed for RX rings exposed to XDP */
+	if (ring->vsi->type != I40E_VSI_MAIN)
+		goto skip;
+
+	if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
+		err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+					 ring->queue_index,
+					 ring->q_vector->napi.napi_id,
+					 ring->rx_buf_len);
+		if (err)
+			return err;
+	}
 
 	ring->xsk_pool = i40e_xsk_pool(ring);
 	if (ring->xsk_pool) {
-		ring->rx_buf_len =
-		  xsk_pool_get_rx_frame_size(ring->xsk_pool);
-		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+		ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
+		err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
 						 MEM_TYPE_XSK_BUFF_POOL,
 						 NULL);
-		if (ret)
-			return ret;
+		if (err)
+			return err;
 		dev_info(&vsi->back->pdev->dev,
 			 "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
 			 ring->queue_index);
 
 	} else {
-		ring->rx_buf_len = vsi->rx_buf_len;
-		if (ring->vsi->type == I40E_VSI_MAIN) {
-			ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
-							 MEM_TYPE_PAGE_SHARED,
-							 NULL);
-			if (ret)
-				return ret;
-		}
+		err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED,
+						 NULL);
+		if (err)
+			return err;
 	}
 
+skip:
 	xdp_init_buff(&ring->xdp, i40e_rx_pg_size(ring) / 2, &ring->xdp_rxq);
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 377e05dc0c68..f0a44e6d4884 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1555,7 +1555,6 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 {
 	struct device *dev = rx_ring->dev;
-	int err;
 
 	u64_stats_init(&rx_ring->syncp);
 
@@ -1576,14 +1575,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 	rx_ring->next_to_process = 0;
 	rx_ring->next_to_use = 0;
 
-	/* XDP RX-queue info only needed for RX rings exposed to XDP */
-	if (rx_ring->vsi->type == I40E_VSI_MAIN) {
-		err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
-				       rx_ring->queue_index, rx_ring->q_vector->napi.napi_id);
-		if (err < 0)
-			return err;
-	}
-
 	rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
 
 	rx_ring->rx_bi =
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (9 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 10/11] i40e: set xdp_rxq_info::frag_size Maciej Fijalkowski
@ 2024-01-22 22:16 ` Maciej Fijalkowski
  2024-01-24  8:52   ` Magnus Karlsson
  2024-01-24  8:58 ` [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Magnus Karlsson
  11 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-22 22:16 UTC (permalink / raw)
  To: bpf, ast, daniel, andrii
  Cc: netdev, magnus.karlsson, bjorn, maciej.fijalkowski, echaudro,
	lorenzo, martin.lau, tirthendu.sarkar, john.fastabend, horms

Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f8d513499607..7b091ce64cc7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3609,7 +3609,14 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 
 	ring->xsk_pool = i40e_xsk_pool(ring);
 	if (ring->xsk_pool) {
+		xdp_rxq_info_unreg(&ring->xdp_rxq);
 		ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
+		err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+					 ring->queue_index,
+					 ring->q_vector->napi.napi_id,
+					 ring->rx_buf_len);
+		if (err)
+			return err;
 		err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
 						 MEM_TYPE_XSK_BUFF_POOL,
 						 NULL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  2024-01-22 22:16 ` [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP Maciej Fijalkowski
@ 2024-01-24  1:53   ` Jakub Kicinski
  2024-01-24 12:02     ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-01-24  1:53 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 23:16:02 +0100 Maciej Fijalkowski wrote:
>  
> +static void __shrink_data(struct xdp_buff *xdp, struct xdp_mem_info *mem_info,
> +			  skb_frag_t *frag, int shrink)
> +{
> +	if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL)
> +		xsk_buff_get_tail(xdp)->data_end -= shrink;
> +	skb_frag_size_sub(frag, shrink);

nit: this has just one caller, why not inline these 3 lines?

> +}
> +
> +static bool shrink_data(struct xdp_buff *xdp, skb_frag_t *frag, int shrink)

nit: prefix the function name, please

> +{
> +	struct xdp_mem_info *mem_info = &xdp->rxq->mem;
> +
> +	if (skb_frag_size(frag) == shrink) {
> +		struct page *page = skb_frag_page(frag);
> +		struct xdp_buff *zc_frag = NULL;
> +
> +		if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL) {
> +			zc_frag = xsk_buff_get_tail(xdp);
> +
> +			xsk_buff_del_tail(zc_frag);
> +		}
> +
> +		__xdp_return(page_address(page), mem_info, false, zc_frag);
> +		return true;
> +	}
> +	__shrink_data(xdp, mem_info, frag, shrink);
> +	return false;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  2024-01-22 22:16 ` [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags Maciej Fijalkowski
@ 2024-01-24  8:20   ` Magnus Karlsson
  2024-01-24 11:42     ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:20 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
> used by drivers to notify data path whether xdp_buff contains fragments
> or not. Data path looks up mentioned flag on first buffer that occupies
> the linear part of xdp_buff, so drivers only modify it there. This is
> sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
> stack or it resides within struct representing driver's queue and
> fragments are carried via skb_frag_t structs. IOW, we are dealing with
> only one xdp_buff.
>
> ZC mode though relies on list of xdp_buff structs that is carried via
> xsk_buff_pool::xskb_list, so ZC data path has to make sure that
> fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
> xsk_buff_free() could misbehave if it would be executed against xdp_buff
> that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
> take place when within supplied XDP program bpf_xdp_adjust_tail() is
> used with negative offset that would in turn release the tail fragment
> from multi-buffer frame.
>
> Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
> result in releasing all the nodes from xskb_list that were produced by
> driver before XDP program execution, which is not what is intended -
> only tail fragment should be deleted from xskb_list and then it should
> be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
> make it up to user space, so from AF_XDP application POV there would be
> no traffic running, however due to free_list getting constantly new
> nodes, driver will be able to feed HW Rx queue with recycled buffers.
> Bottom line is that instead of traffic being redirected to user space,
> it would be continuously dropped.
>
> To fix this, let us clear the mentioned flag on xsk_buff_pool side at
> allocation time, which is what should have been done right from the
> start of XSK multi-buffer support.
>
> Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 1 -
>  drivers/net/ethernet/intel/ice/ice_xsk.c   | 1 -
>  net/xdp/xsk_buff_pool.c                    | 3 +++
>  3 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index e99fa854d17f..fede0bb3e047 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -499,7 +499,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>                 xdp_res = i40e_run_xdp_zc(rx_ring, first, xdp_prog);
>                 i40e_handle_xdp_result_zc(rx_ring, first, rx_desc, &rx_packets,
>                                           &rx_bytes, xdp_res, &failure);
> -               first->flags = 0;
>                 next_to_clean = next_to_process;
>                 if (failure)
>                         break;
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 5d1ae8e4058a..d9073a618ad6 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -895,7 +895,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>
>                 if (!first) {
>                         first = xdp;
> -                       xdp_buff_clear_frags_flag(first);
>                 } else if (ice_add_xsk_frag(rx_ring, first, xdp, size)) {
>                         break;
>                 }
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 28711cc44ced..dc5659da6728 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -555,6 +555,7 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
>
>         xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
>         xskb->xdp.data_meta = xskb->xdp.data;
> +       xskb->xdp.flags = 0;
>
>         if (pool->dma_need_sync) {
>                 dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
> @@ -601,6 +602,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
>                 }
>
>                 *xdp = &xskb->xdp;
> +               xskb->xdp.flags = 0;

Thanks for catching this. I am thinking we should have an if-statement
here and only do this when multi-buffer is enabled. The reason that we
have two different paths for aligned mode and unaligned mode here is
that we do not have to touch the xdp_buff at all at allocation time in
aligned mode, which provides a nice speed-up. So let us only do this
when necessary. What do you think? Same goes for the line in
xp_alloc_reused().

>                 xdp++;
>         }
>
> @@ -621,6 +623,7 @@ static u32 xp_alloc_reused(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u3
>                 list_del_init(&xskb->free_list_node);
>
>                 *xdp = &xskb->xdp;
> +               xskb->xdp.flags = 0;
>                 xdp++;
>         }
>         pool->free_list_cnt -= nb_entries;
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count
  2024-01-22 22:16 ` [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count Maciej Fijalkowski
@ 2024-01-24  8:37   ` Magnus Karlsson
  2024-01-24 14:05     ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:37 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
> multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
> socket.
>
> Since support for handling multi-buffer frames was added to XDP, usage
> of bpf_xdp_adjust_tail() helper within XDP program can free the page
> that given fragment occupies and in turn decrease the fragment count
> within skb_shared_info that is embedded in xdp_buff struct. In current
> ice driver codebase, it can become problematic when page recycling logic
> decides not to reuse the page. In such case, __page_frag_cache_drain()
> is used with ice_rx_buf::pagecnt_bias that was not adjusted after
> refcount of page was changed by XDP prog which in turn does not drain
> the refcount to 0 and page is never freed.
>
> To address this, let us store the count of frags before the XDP program
> was executed on Rx ring struct. This will be used to compare with
> current frag count from skb_shared_info embedded in xdp_buff. A smaller
> value in the latter indicates that XDP prog freed frag(s). Then, for
> given delta decrement pagecnt_bias for XDP_DROP verdict.
>
> While at it, let us also handle the EOP frag within
> ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
> needed to be applied against freed frags are performed in the single
> place.
>
> Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c     | 14 ++++++---
>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 31 +++++++++++++------
>  3 files changed, 32 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 59617f055e35..1760e81379cc 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -603,9 +603,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>                 ret = ICE_XDP_CONSUMED;
>         }
>  exit:
> -       rx_buf->act = ret;
> -       if (unlikely(xdp_buff_has_frags(xdp)))
> -               ice_set_rx_bufs_act(xdp, rx_ring, ret);
> +       ice_set_rx_bufs_act(xdp, rx_ring, ret);
>  }
>
>  /**
> @@ -893,14 +891,17 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>         }
>
>         if (unlikely(sinfo->nr_frags == MAX_SKB_FRAGS)) {
> -               if (unlikely(xdp_buff_has_frags(xdp)))
> -                       ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
> +               ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
>                 return -ENOMEM;
>         }
>
>         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>                                    rx_buf->page_offset, size);
>         sinfo->xdp_frags_size += size;
> +       /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
> +        * can pop off frags but driver has to handle it on its own
> +        */
> +       rx_ring->nr_frags = sinfo->nr_frags;
>
>         if (page_is_pfmemalloc(rx_buf->page))
>                 xdp_buff_set_frag_pfmemalloc(xdp);
> @@ -1251,6 +1252,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>
>                 xdp->data = NULL;
>                 rx_ring->first_desc = ntc;
> +               rx_ring->nr_frags = 0;
>                 continue;
>  construct_skb:
>                 if (likely(ice_ring_uses_build_skb(rx_ring)))
> @@ -1266,10 +1268,12 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>                                                     ICE_XDP_CONSUMED);
>                         xdp->data = NULL;
>                         rx_ring->first_desc = ntc;
> +                       rx_ring->nr_frags = 0;
>                         break;
>                 }
>                 xdp->data = NULL;
>                 rx_ring->first_desc = ntc;
> +               rx_ring->nr_frags = 0;

Are these needed? Or asked in another way, is there some way in which
ice_set_rx_bufs_act() can be executed before ice_add_xdp_frag()? If
not, we could remove them.

Looks good otherwise.

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

>
>                 stat_err_bits = BIT(ICE_RX_FLEX_DESC_STATUS0_RXE_S);
>                 if (unlikely(ice_test_staterr(rx_desc->wb.status_error0,
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index b3379ff73674..af955b0e5dc5 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -358,6 +358,7 @@ struct ice_rx_ring {
>         struct ice_tx_ring *xdp_ring;
>         struct ice_rx_ring *next;       /* pointer to next ring in q_vector */
>         struct xsk_buff_pool *xsk_pool;
> +       u32 nr_frags;
>         dma_addr_t dma;                 /* physical address of ring */
>         u16 rx_buf_len;
>         u8 dcb_tc;                      /* Traffic class of ring */
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> index 762047508619..afcead4baef4 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> @@ -12,26 +12,39 @@
>   * act: action to store onto Rx buffers related to XDP buffer parts
>   *
>   * Set action that should be taken before putting Rx buffer from first frag
> - * to one before last. Last one is handled by caller of this function as it
> - * is the EOP frag that is currently being processed. This function is
> - * supposed to be called only when XDP buffer contains frags.
> + * to the last.
>   */
>  static inline void
>  ice_set_rx_bufs_act(struct xdp_buff *xdp, const struct ice_rx_ring *rx_ring,
>                     const unsigned int act)
>  {
> -       const struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> -       u32 first = rx_ring->first_desc;
> -       u32 nr_frags = sinfo->nr_frags;
> +       u32 sinfo_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
> +       u32 nr_frags = rx_ring->nr_frags + 1;
> +       u32 idx = rx_ring->first_desc;
>         u32 cnt = rx_ring->count;
>         struct ice_rx_buf *buf;
>
>         for (int i = 0; i < nr_frags; i++) {
> -               buf = &rx_ring->rx_buf[first];
> +               buf = &rx_ring->rx_buf[idx];
>                 buf->act = act;
>
> -               if (++first == cnt)
> -                       first = 0;
> +               if (++idx == cnt)
> +                       idx = 0;
> +       }
> +
> +       /* adjust pagecnt_bias on frags freed by XDP prog */
> +       if (sinfo_frags < rx_ring->nr_frags && act == ICE_XDP_CONSUMED) {
> +               u32 delta = rx_ring->nr_frags - sinfo_frags;
> +
> +               while (delta) {
> +                       if (idx == 0)
> +                               idx = cnt - 1;
> +                       else
> +                               idx--;
> +                       buf = &rx_ring->rx_buf[idx];
> +                       buf->pagecnt_bias--;
> +                       delta--;
> +               }
>         }
>  }
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration
  2024-01-22 22:16 ` [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration Maciej Fijalkowski
@ 2024-01-24  8:39   ` Magnus Karlsson
  0 siblings, 0 replies; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:39 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:17, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> xdp_rxq_info struct can be registered by drivers via two functions -
> xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
> drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
> which in turn will make it possible to grow the packet via
> bpf_xdp_adjust_tail() BPF helper.
>
> Currently, ice registers xdp_rxq_info in two spots:
> 1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
> 2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK
>
> Cited commit under fixes tag took care of setting up frag_size and
> updated registration scheme in 2) but it did not help as
> 1) is called before 2) and as shown above it uses old registration
> function. This means that 2) sees that xdp_rxq_info is already
> registered and never calls __xdp_rxq_info_reg() which leaves us with
> xdp_rxq_info::frag_size being set to 0.
>
> To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
> ice_setup_rx_ring().

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 5 -----
>  1 file changed, 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 1760e81379cc..765aea630a1f 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -513,11 +513,6 @@ int ice_setup_rx_ring(struct ice_rx_ring *rx_ring)
>         if (ice_is_xdp_ena_vsi(rx_ring->vsi))
>                 WRITE_ONCE(rx_ring->xdp_prog, rx_ring->vsi->xdp_prog);
>
> -       if (rx_ring->vsi->type == ICE_VSI_PF &&
> -           !xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
> -               if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
> -                                    rx_ring->q_index, rx_ring->q_vector->napi.napi_id))
> -                       goto err;
>         return 0;
>
>  err:
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  2024-01-22 22:16 ` [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers Maciej Fijalkowski
@ 2024-01-24  8:44   ` Magnus Karlsson
  2024-01-24 16:21     ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:44 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:17, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> Ice and i40e ZC drivers currently set offset of a frag within
> skb_shared_info to 0, wchih is incorrect. xdp_buffs that come from

Is "wchih" Polish? Just kidding with you ;-)!

> xsk_buff_pool always have 256 bytes of a headroom, so they need to be
> taken into account to retrieve xdp_buff::data via skb_frag_address().
> Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
> xdp_buff::data_hard_start which would result in overwriting existing
> payload.

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 3 ++-
>  drivers/net/ethernet/intel/ice/ice_xsk.c   | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index fede0bb3e047..65f38a57b3df 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -414,7 +414,8 @@ i40e_add_xsk_frag(struct i40e_ring *rx_ring, struct xdp_buff *first,
>         }
>
>         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
> -                                  virt_to_page(xdp->data_hard_start), 0, size);
> +                                  virt_to_page(xdp->data_hard_start),
> +                                  XDP_PACKET_HEADROOM, size);
>         sinfo->xdp_frags_size += size;
>         xsk_buff_add_frag(xdp);
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index d9073a618ad6..8b81a1677045 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -825,7 +825,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
>         }
>
>         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
> -                                  virt_to_page(xdp->data_hard_start), 0, size);
> +                                  virt_to_page(xdp->data_hard_start),
> +                                  XDP_PACKET_HEADROOM, size);
>         sinfo->xdp_frags_size += size;
>         xsk_buff_add_frag(xdp);
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  2024-01-22 22:16 ` [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
@ 2024-01-24  8:51   ` Magnus Karlsson
  2024-01-24 13:58     ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:51 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:17, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
> make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
> is being set via xsk_pool_get_rx_frame_size() and this needs to be
> propagated up to xdp_rxq_info.
>
> Use a bigger hammer and instead of unregistering only xdp_rxq_info's
> memory model, unregister it altogether and register it again and have
> xdp_rxq_info with correct frag_size value.
>
> Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_base.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index b25b7f415965..df174c1c3817 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -564,10 +564,15 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
>
>                 ring->xsk_pool = ice_xsk_pool(ring);
>                 if (ring->xsk_pool) {
> -                       xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
> +                       xdp_rxq_info_unreg(&ring->xdp_rxq);
>
>                         ring->rx_buf_len =
>                                 xsk_pool_get_rx_frame_size(ring->xsk_pool);
> +                       /* coverity[check_return] */

Why not check the return value here? I can see that the non xsk_pool
path ignores the return value too, but do not understand why.

> +                       __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> +                                          ring->q_index,
> +                                          ring->q_vector->napi.napi_id,
> +                                          ring->rx_buf_len);
>                         err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
>                                                          MEM_TYPE_XSK_BUFF_POOL,
>                                                          NULL);
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  2024-01-22 22:16 ` [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
@ 2024-01-24  8:52   ` Magnus Karlsson
  0 siblings, 0 replies; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:52 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:18, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
> make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
> is being set via xsk_pool_get_rx_frame_size() and this needs to be
> propagated up to xdp_rxq_info.

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index f8d513499607..7b091ce64cc7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3609,7 +3609,14 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>
>         ring->xsk_pool = i40e_xsk_pool(ring);
>         if (ring->xsk_pool) {
> +               xdp_rxq_info_unreg(&ring->xdp_rxq);
>                 ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
> +               err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> +                                        ring->queue_index,
> +                                        ring->q_vector->napi.napi_id,
> +                                        ring->rx_buf_len);
> +               if (err)
> +                       return err;
>                 err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
>                                                  MEM_TYPE_XSK_BUFF_POOL,
>                                                  NULL);
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes
  2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
                   ` (10 preceding siblings ...)
  2024-01-22 22:16 ` [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
@ 2024-01-24  8:58 ` Magnus Karlsson
  11 siblings, 0 replies; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24  8:58 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> Hey,
>
> after a break followed by dealing with sickness, here is a v5 that makes
> bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP
> multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with
> positive offset which exposed yet another issues, which can be observed
> by increased commit count when compared to v3.

Thanks for this fix of getting bpf_xdp_djust_tail to work with AF_XDP
in multi-buffer mode. We clearly need to add a test case for this
helper in our test suite. I have put it on the todo list.

> John, in the end I think we should remove handling
> MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope
> for fixes set, IMHO.
>
> Thanks,
> Maciej
>
> v5:
> - pick correct version of patch 5 [Simon]
> - elaborate a bit more on what patch 2 fixes
>
> v4:
> - do not clear frags flag when deleting tail; xsk_buff_pool now does
>   that
> - skip some NULL tests for xsk_buff_get_tail [Martin, John]
> - address problems around registering xdp_rxq_info
> - fix bpf_xdp_frags_increase_tail() for ZC mbuf
>
> v3:
> - add acks
> - s/xsk_buff_tail_del/xsk_buff_del_tail
> - address i40e as well (thanks Tirthendu)
>
> v2:
> - fix !CONFIG_XDP_SOCKETS builds
> - add reviewed-by tag to patch 3
>
>
> Maciej Fijalkowski (10):
>   xsk: recycle buffer in case Rx queue was full
>   xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
>   xsk: fix usage of multi-buffer BPF helpers for ZC XDP
>   ice: work on pre-XDP prog frag count
>   ice: remove redundant xdp_rxq_info registration
>   intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
>   ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
>   xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
>   i40e: set xdp_rxq_info::frag_size
>   i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
>
> Tirthendu Sarkar (1):
>   i40e: handle multi-buffer packets that are shrunk by xdp prog
>
>  drivers/net/ethernet/intel/i40e/i40e_main.c   | 47 ++++++++++++------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 49 +++++++++----------
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  4 +-
>  drivers/net/ethernet/intel/ice/ice_base.c     |  7 ++-
>  drivers/net/ethernet/intel/ice/ice_txrx.c     | 19 ++++---
>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 31 ++++++++----
>  drivers/net/ethernet/intel/ice/ice_xsk.c      |  4 +-
>  include/net/xdp_sock_drv.h                    | 26 ++++++++++
>  net/core/filter.c                             | 43 ++++++++++++----
>  net/xdp/xsk.c                                 | 12 +++--
>  net/xdp/xsk_buff_pool.c                       |  3 ++
>  12 files changed, 167 insertions(+), 79 deletions(-)
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  2024-01-24  8:20   ` Magnus Karlsson
@ 2024-01-24 11:42     ` Maciej Fijalkowski
  2024-01-24 11:49       ` Magnus Karlsson
  0 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 11:42 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, Jan 24, 2024 at 09:20:26AM +0100, Magnus Karlsson wrote:
> On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
> > used by drivers to notify data path whether xdp_buff contains fragments
> > or not. Data path looks up mentioned flag on first buffer that occupies
> > the linear part of xdp_buff, so drivers only modify it there. This is
> > sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
> > stack or it resides within struct representing driver's queue and
> > fragments are carried via skb_frag_t structs. IOW, we are dealing with
> > only one xdp_buff.
> >
> > ZC mode though relies on list of xdp_buff structs that is carried via
> > xsk_buff_pool::xskb_list, so ZC data path has to make sure that
> > fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
> > xsk_buff_free() could misbehave if it would be executed against xdp_buff
> > that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
> > take place when within supplied XDP program bpf_xdp_adjust_tail() is
> > used with negative offset that would in turn release the tail fragment
> > from multi-buffer frame.
> >
> > Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
> > result in releasing all the nodes from xskb_list that were produced by
> > driver before XDP program execution, which is not what is intended -
> > only tail fragment should be deleted from xskb_list and then it should
> > be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
> > make it up to user space, so from AF_XDP application POV there would be
> > no traffic running, however due to free_list getting constantly new
> > nodes, driver will be able to feed HW Rx queue with recycled buffers.
> > Bottom line is that instead of traffic being redirected to user space,
> > it would be continuously dropped.
> >
> > To fix this, let us clear the mentioned flag on xsk_buff_pool side at
> > allocation time, which is what should have been done right from the
> > start of XSK multi-buffer support.
> >
> > Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 1 -
> >  drivers/net/ethernet/intel/ice/ice_xsk.c   | 1 -
> >  net/xdp/xsk_buff_pool.c                    | 3 +++
> >  3 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index e99fa854d17f..fede0bb3e047 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -499,7 +499,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >                 xdp_res = i40e_run_xdp_zc(rx_ring, first, xdp_prog);
> >                 i40e_handle_xdp_result_zc(rx_ring, first, rx_desc, &rx_packets,
> >                                           &rx_bytes, xdp_res, &failure);
> > -               first->flags = 0;
> >                 next_to_clean = next_to_process;
> >                 if (failure)
> >                         break;
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index 5d1ae8e4058a..d9073a618ad6 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -895,7 +895,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> >
> >                 if (!first) {
> >                         first = xdp;
> > -                       xdp_buff_clear_frags_flag(first);
> >                 } else if (ice_add_xsk_frag(rx_ring, first, xdp, size)) {
> >                         break;
> >                 }
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index 28711cc44ced..dc5659da6728 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -555,6 +555,7 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
> >
> >         xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
> >         xskb->xdp.data_meta = xskb->xdp.data;
> > +       xskb->xdp.flags = 0;
> >
> >         if (pool->dma_need_sync) {
> >                 dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
> > @@ -601,6 +602,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
> >                 }
> >
> >                 *xdp = &xskb->xdp;
> > +               xskb->xdp.flags = 0;
> 
> Thanks for catching this. I am thinking we should have an if-statement
> here and only do this when multi-buffer is enabled. The reason that we
> have two different paths for aligned mode and unaligned mode here is
> that we do not have to touch the xdp_buff at all at allocation time in
> aligned mode, which provides a nice speed-up. So let us only do this
> when necessary. What do you think? Same goes for the line in
> xp_alloc_reused().
> 

Good point. How about keeping flags = 0 in xp_alloc() and adding it to
xsk_buff_set_size() ? We do touch xdp_buff there and these two paths cover
batched and non-batched APIs. I do agree that doing it in
xp_alloc_new_from_fq() and in xp_alloc_reused() is not really handy.

> >                 xdp++;
> >         }
> >
> > @@ -621,6 +623,7 @@ static u32 xp_alloc_reused(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u3
> >                 list_del_init(&xskb->free_list_node);
> >
> >                 *xdp = &xskb->xdp;
> > +               xskb->xdp.flags = 0;
> >                 xdp++;
> >         }
> >         pool->free_list_cnt -= nb_entries;
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  2024-01-24 11:42     ` Maciej Fijalkowski
@ 2024-01-24 11:49       ` Magnus Karlsson
  0 siblings, 0 replies; 28+ messages in thread
From: Magnus Karlsson @ 2024-01-24 11:49 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, 24 Jan 2024 at 12:42, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Jan 24, 2024 at 09:20:26AM +0100, Magnus Karlsson wrote:
> > On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
> > > used by drivers to notify data path whether xdp_buff contains fragments
> > > or not. Data path looks up mentioned flag on first buffer that occupies
> > > the linear part of xdp_buff, so drivers only modify it there. This is
> > > sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
> > > stack or it resides within struct representing driver's queue and
> > > fragments are carried via skb_frag_t structs. IOW, we are dealing with
> > > only one xdp_buff.
> > >
> > > ZC mode though relies on list of xdp_buff structs that is carried via
> > > xsk_buff_pool::xskb_list, so ZC data path has to make sure that
> > > fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
> > > xsk_buff_free() could misbehave if it would be executed against xdp_buff
> > > that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
> > > take place when within supplied XDP program bpf_xdp_adjust_tail() is
> > > used with negative offset that would in turn release the tail fragment
> > > from multi-buffer frame.
> > >
> > > Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
> > > result in releasing all the nodes from xskb_list that were produced by
> > > driver before XDP program execution, which is not what is intended -
> > > only tail fragment should be deleted from xskb_list and then it should
> > > be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
> > > make it up to user space, so from AF_XDP application POV there would be
> > > no traffic running, however due to free_list getting constantly new
> > > nodes, driver will be able to feed HW Rx queue with recycled buffers.
> > > Bottom line is that instead of traffic being redirected to user space,
> > > it would be continuously dropped.
> > >
> > > To fix this, let us clear the mentioned flag on xsk_buff_pool side at
> > > allocation time, which is what should have been done right from the
> > > start of XSK multi-buffer support.
> > >
> > > Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> > > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > > Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 1 -
> > >  drivers/net/ethernet/intel/ice/ice_xsk.c   | 1 -
> > >  net/xdp/xsk_buff_pool.c                    | 3 +++
> > >  3 files changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > index e99fa854d17f..fede0bb3e047 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > @@ -499,7 +499,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> > >                 xdp_res = i40e_run_xdp_zc(rx_ring, first, xdp_prog);
> > >                 i40e_handle_xdp_result_zc(rx_ring, first, rx_desc, &rx_packets,
> > >                                           &rx_bytes, xdp_res, &failure);
> > > -               first->flags = 0;
> > >                 next_to_clean = next_to_process;
> > >                 if (failure)
> > >                         break;
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > index 5d1ae8e4058a..d9073a618ad6 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > @@ -895,7 +895,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> > >
> > >                 if (!first) {
> > >                         first = xdp;
> > > -                       xdp_buff_clear_frags_flag(first);
> > >                 } else if (ice_add_xsk_frag(rx_ring, first, xdp, size)) {
> > >                         break;
> > >                 }
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > index 28711cc44ced..dc5659da6728 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -555,6 +555,7 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
> > >
> > >         xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
> > >         xskb->xdp.data_meta = xskb->xdp.data;
> > > +       xskb->xdp.flags = 0;
> > >
> > >         if (pool->dma_need_sync) {
> > >                 dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
> > > @@ -601,6 +602,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
> > >                 }
> > >
> > >                 *xdp = &xskb->xdp;
> > > +               xskb->xdp.flags = 0;
> >
> > Thanks for catching this. I am thinking we should have an if-statement
> > here and only do this when multi-buffer is enabled. The reason that we
> > have two different paths for aligned mode and unaligned mode here is
> > that we do not have to touch the xdp_buff at all at allocation time in
> > aligned mode, which provides a nice speed-up. So let us only do this
> > when necessary. What do you think? Same goes for the line in
> > xp_alloc_reused().
> >
>
> Good point. How about keeping flags = 0 in xp_alloc() and adding it to
> xsk_buff_set_size() ? We do touch xdp_buff there and these two paths cover
> batched and non-batched APIs. I do agree that doing it in
> xp_alloc_new_from_fq() and in xp_alloc_reused() is not really handy.

That is an even better idea. Go for it.

> > >                 xdp++;
> > >         }
> > >
> > > @@ -621,6 +623,7 @@ static u32 xp_alloc_reused(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u3
> > >                 list_del_init(&xskb->free_list_node);
> > >
> > >                 *xdp = &xskb->xdp;
> > > +               xskb->xdp.flags = 0;
> > >                 xdp++;
> > >         }
> > >         pool->free_list_cnt -= nb_entries;
> > > --
> > > 2.34.1
> > >
> > >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  2024-01-24  1:53   ` Jakub Kicinski
@ 2024-01-24 12:02     ` Maciej Fijalkowski
  2024-01-24 16:53       ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 12:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Tue, Jan 23, 2024 at 05:53:17PM -0800, Jakub Kicinski wrote:
> On Mon, 22 Jan 2024 23:16:02 +0100 Maciej Fijalkowski wrote:
> >  
> > +static void __shrink_data(struct xdp_buff *xdp, struct xdp_mem_info *mem_info,
> > +			  skb_frag_t *frag, int shrink)
> > +{
> > +	if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL)
> > +		xsk_buff_get_tail(xdp)->data_end -= shrink;
> > +	skb_frag_size_sub(frag, shrink);
> 
> nit: this has just one caller, why not inline these 3 lines?

we usually rely on compiler to do that, we have the rule "no inlines in
source files", no?

> 
> > +}
> > +
> > +static bool shrink_data(struct xdp_buff *xdp, skb_frag_t *frag, int shrink)
> 
> nit: prefix the function name, please

will rename to bpf_xdp_shrink_data(). Thanks for taking a look!

> 
> > +{
> > +	struct xdp_mem_info *mem_info = &xdp->rxq->mem;
> > +
> > +	if (skb_frag_size(frag) == shrink) {
> > +		struct page *page = skb_frag_page(frag);
> > +		struct xdp_buff *zc_frag = NULL;
> > +
> > +		if (mem_info->type == MEM_TYPE_XSK_BUFF_POOL) {
> > +			zc_frag = xsk_buff_get_tail(xdp);
> > +
> > +			xsk_buff_del_tail(zc_frag);
> > +		}
> > +
> > +		__xdp_return(page_address(page), mem_info, false, zc_frag);
> > +		return true;
> > +	}
> > +	__shrink_data(xdp, mem_info, frag, shrink);
> > +	return false;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  2024-01-24  8:51   ` Magnus Karlsson
@ 2024-01-24 13:58     ` Maciej Fijalkowski
  0 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 13:58 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, Jan 24, 2024 at 09:51:47AM +0100, Magnus Karlsson wrote:
> On Mon, 22 Jan 2024 at 23:17, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
> > make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
> > is being set via xsk_pool_get_rx_frame_size() and this needs to be
> > propagated up to xdp_rxq_info.
> >
> > Use a bigger hammer and instead of unregistering only xdp_rxq_info's
> > memory model, unregister it altogether and register it again and have
> > xdp_rxq_info with correct frag_size value.
> >
> > Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_base.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> > index b25b7f415965..df174c1c3817 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_base.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> > @@ -564,10 +564,15 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> >
> >                 ring->xsk_pool = ice_xsk_pool(ring);
> >                 if (ring->xsk_pool) {
> > -                       xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
> > +                       xdp_rxq_info_unreg(&ring->xdp_rxq);
> >
> >                         ring->rx_buf_len =
> >                                 xsk_pool_get_rx_frame_size(ring->xsk_pool);
> > +                       /* coverity[check_return] */
> 
> Why not check the return value here? I can see that the non xsk_pool
> path ignores the return value too, but do not understand why.

I can't remember now, so maybe let us check retval for both paths. That
won't hurt us.

> 
> > +                       __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > +                                          ring->q_index,
> > +                                          ring->q_vector->napi.napi_id,
> > +                                          ring->rx_buf_len);
> >                         err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> >                                                          MEM_TYPE_XSK_BUFF_POOL,
> >                                                          NULL);
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count
  2024-01-24  8:37   ` Magnus Karlsson
@ 2024-01-24 14:05     ` Maciej Fijalkowski
  0 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 14:05 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, Jan 24, 2024 at 09:37:13AM +0100, Magnus Karlsson wrote:
> On Mon, 22 Jan 2024 at 23:16, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
> > multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
> > socket.
> >
> > Since support for handling multi-buffer frames was added to XDP, usage
> > of bpf_xdp_adjust_tail() helper within XDP program can free the page
> > that given fragment occupies and in turn decrease the fragment count
> > within skb_shared_info that is embedded in xdp_buff struct. In current
> > ice driver codebase, it can become problematic when page recycling logic
> > decides not to reuse the page. In such case, __page_frag_cache_drain()
> > is used with ice_rx_buf::pagecnt_bias that was not adjusted after
> > refcount of page was changed by XDP prog which in turn does not drain
> > the refcount to 0 and page is never freed.
> >
> > To address this, let us store the count of frags before the XDP program
> > was executed on Rx ring struct. This will be used to compare with
> > current frag count from skb_shared_info embedded in xdp_buff. A smaller
> > value in the latter indicates that XDP prog freed frag(s). Then, for
> > given delta decrement pagecnt_bias for XDP_DROP verdict.
> >
> > While at it, let us also handle the EOP frag within
> > ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
> > needed to be applied against freed frags are performed in the single
> > place.
> >
> > Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_txrx.c     | 14 ++++++---
> >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
> >  drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 31 +++++++++++++------
> >  3 files changed, 32 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > index 59617f055e35..1760e81379cc 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > @@ -603,9 +603,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> >                 ret = ICE_XDP_CONSUMED;
> >         }
> >  exit:
> > -       rx_buf->act = ret;
> > -       if (unlikely(xdp_buff_has_frags(xdp)))
> > -               ice_set_rx_bufs_act(xdp, rx_ring, ret);
> > +       ice_set_rx_bufs_act(xdp, rx_ring, ret);
> >  }
> >
> >  /**
> > @@ -893,14 +891,17 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> >         }
> >
> >         if (unlikely(sinfo->nr_frags == MAX_SKB_FRAGS)) {
> > -               if (unlikely(xdp_buff_has_frags(xdp)))
> > -                       ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
> > +               ice_set_rx_bufs_act(xdp, rx_ring, ICE_XDP_CONSUMED);
> >                 return -ENOMEM;
> >         }
> >
> >         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
> >                                    rx_buf->page_offset, size);
> >         sinfo->xdp_frags_size += size;
> > +       /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
> > +        * can pop off frags but driver has to handle it on its own
> > +        */
> > +       rx_ring->nr_frags = sinfo->nr_frags;
> >
> >         if (page_is_pfmemalloc(rx_buf->page))
> >                 xdp_buff_set_frag_pfmemalloc(xdp);
> > @@ -1251,6 +1252,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
> >
> >                 xdp->data = NULL;
> >                 rx_ring->first_desc = ntc;
> > +               rx_ring->nr_frags = 0;
> >                 continue;
> >  construct_skb:
> >                 if (likely(ice_ring_uses_build_skb(rx_ring)))
> > @@ -1266,10 +1268,12 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
> >                                                     ICE_XDP_CONSUMED);
> >                         xdp->data = NULL;
> >                         rx_ring->first_desc = ntc;
> > +                       rx_ring->nr_frags = 0;
> >                         break;
> >                 }
> >                 xdp->data = NULL;
> >                 rx_ring->first_desc = ntc;
> > +               rx_ring->nr_frags = 0;
> 
> Are these needed? Or asked in another way, is there some way in which
> ice_set_rx_bufs_act() can be executed before ice_add_xdp_frag()? If
> not, we could remove them.

I am afraid that if you would have fragged packet followed by non-fragged
one then ice_set_rx_bufs_act() would incorrectly go over more buffers
than it was supposed to. I think we should keep those, unless I am missing
something?

> 
> Looks good otherwise.
> 
> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> >
> >                 stat_err_bits = BIT(ICE_RX_FLEX_DESC_STATUS0_RXE_S);
> >                 if (unlikely(ice_test_staterr(rx_desc->wb.status_error0,
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > index b3379ff73674..af955b0e5dc5 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > @@ -358,6 +358,7 @@ struct ice_rx_ring {
> >         struct ice_tx_ring *xdp_ring;
> >         struct ice_rx_ring *next;       /* pointer to next ring in q_vector */
> >         struct xsk_buff_pool *xsk_pool;
> > +       u32 nr_frags;
> >         dma_addr_t dma;                 /* physical address of ring */
> >         u16 rx_buf_len;
> >         u8 dcb_tc;                      /* Traffic class of ring */
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> > index 762047508619..afcead4baef4 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
> > @@ -12,26 +12,39 @@
> >   * act: action to store onto Rx buffers related to XDP buffer parts
> >   *
> >   * Set action that should be taken before putting Rx buffer from first frag
> > - * to one before last. Last one is handled by caller of this function as it
> > - * is the EOP frag that is currently being processed. This function is
> > - * supposed to be called only when XDP buffer contains frags.
> > + * to the last.
> >   */
> >  static inline void
> >  ice_set_rx_bufs_act(struct xdp_buff *xdp, const struct ice_rx_ring *rx_ring,
> >                     const unsigned int act)
> >  {
> > -       const struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > -       u32 first = rx_ring->first_desc;
> > -       u32 nr_frags = sinfo->nr_frags;
> > +       u32 sinfo_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
> > +       u32 nr_frags = rx_ring->nr_frags + 1;
> > +       u32 idx = rx_ring->first_desc;
> >         u32 cnt = rx_ring->count;
> >         struct ice_rx_buf *buf;
> >
> >         for (int i = 0; i < nr_frags; i++) {
> > -               buf = &rx_ring->rx_buf[first];
> > +               buf = &rx_ring->rx_buf[idx];
> >                 buf->act = act;
> >
> > -               if (++first == cnt)
> > -                       first = 0;
> > +               if (++idx == cnt)
> > +                       idx = 0;
> > +       }
> > +
> > +       /* adjust pagecnt_bias on frags freed by XDP prog */
> > +       if (sinfo_frags < rx_ring->nr_frags && act == ICE_XDP_CONSUMED) {
> > +               u32 delta = rx_ring->nr_frags - sinfo_frags;
> > +
> > +               while (delta) {
> > +                       if (idx == 0)
> > +                               idx = cnt - 1;
> > +                       else
> > +                               idx--;
> > +                       buf = &rx_ring->rx_buf[idx];
> > +                       buf->pagecnt_bias--;
> > +                       delta--;
> > +               }
> >         }
> >  }
> >
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  2024-01-24  8:44   ` Magnus Karlsson
@ 2024-01-24 16:21     ` Maciej Fijalkowski
  0 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 16:21 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, Jan 24, 2024 at 09:44:03AM +0100, Magnus Karlsson wrote:
> On Mon, 22 Jan 2024 at 23:17, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > Ice and i40e ZC drivers currently set offset of a frag within
> > skb_shared_info to 0, wchih is incorrect. xdp_buffs that come from
> 
> Is "wchih" Polish? Just kidding with you ;-)!

Huh, I was missing codespell on my system. There's another one on other
commit:

Commit ab19f322eda5 ("xsk: fix usage of multi-buffer BPF helpers for ZC
XDP")
-----------------------------------------------------------------------------
WARNING: 'appriopriate' may be misspelled - perhaps 'appropriate'?
#64:
appriopriate xsk helpers to do such node operation and use them
^^^^^^^^^^^^

so I'll address those two. Thanks!

> 
> > xsk_buff_pool always have 256 bytes of a headroom, so they need to be
> > taken into account to retrieve xdp_buff::data via skb_frag_address().
> > Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
> > xdp_buff::data_hard_start which would result in overwriting existing
> > payload.
> 
> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 3 ++-
> >  drivers/net/ethernet/intel/ice/ice_xsk.c   | 3 ++-
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index fede0bb3e047..65f38a57b3df 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -414,7 +414,8 @@ i40e_add_xsk_frag(struct i40e_ring *rx_ring, struct xdp_buff *first,
> >         }
> >
> >         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
> > -                                  virt_to_page(xdp->data_hard_start), 0, size);
> > +                                  virt_to_page(xdp->data_hard_start),
> > +                                  XDP_PACKET_HEADROOM, size);
> >         sinfo->xdp_frags_size += size;
> >         xsk_buff_add_frag(xdp);
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index d9073a618ad6..8b81a1677045 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -825,7 +825,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
> >         }
> >
> >         __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++,
> > -                                  virt_to_page(xdp->data_hard_start), 0, size);
> > +                                  virt_to_page(xdp->data_hard_start),
> > +                                  XDP_PACKET_HEADROOM, size);
> >         sinfo->xdp_frags_size += size;
> >         xsk_buff_add_frag(xdp);
> >
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  2024-01-24 12:02     ` Maciej Fijalkowski
@ 2024-01-24 16:53       ` Jakub Kicinski
  2024-01-24 16:56         ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-01-24 16:53 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, 24 Jan 2024 13:02:21 +0100 Maciej Fijalkowski wrote:
> > nit: this has just one caller, why not inline these 3 lines?  
> 
> we usually rely on compiler to do that, we have the rule "no inlines in
> source files", no?

I mean Ctrl-x Ctrl-v the code, the function has 3 LoC and one caller.
And a semi-meaningless name. I'm not sure why this code was factored
out.

> > nit: prefix the function name, please  
> 
> will rename to bpf_xdp_shrink_data(). Thanks for taking a look!

👍️

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  2024-01-24 16:53       ` Jakub Kicinski
@ 2024-01-24 16:56         ` Maciej Fijalkowski
  0 siblings, 0 replies; 28+ messages in thread
From: Maciej Fijalkowski @ 2024-01-24 16:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, bjorn,
	echaudro, lorenzo, martin.lau, tirthendu.sarkar, john.fastabend,
	horms

On Wed, Jan 24, 2024 at 08:53:49AM -0800, Jakub Kicinski wrote:
> On Wed, 24 Jan 2024 13:02:21 +0100 Maciej Fijalkowski wrote:
> > > nit: this has just one caller, why not inline these 3 lines?  
> > 
> > we usually rely on compiler to do that, we have the rule "no inlines in
> > source files", no?
> 
> I mean Ctrl-x Ctrl-v the code, the function has 3 LoC and one caller.
> And a semi-meaningless name. I'm not sure why this code was factored
> out.

Ummm. Probably some previous version was heavier, now I see the point :)
Will fix.

> 
> > > nit: prefix the function name, please  
> > 
> > will rename to bpf_xdp_shrink_data(). Thanks for taking a look!
> 
> 👍️

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-01-24 16:56 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-22 22:15 [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 01/11] xsk: recycle buffer in case Rx queue was full Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 02/11] xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags Maciej Fijalkowski
2024-01-24  8:20   ` Magnus Karlsson
2024-01-24 11:42     ` Maciej Fijalkowski
2024-01-24 11:49       ` Magnus Karlsson
2024-01-22 22:16 ` [PATCH v5 bpf 03/11] xsk: fix usage of multi-buffer BPF helpers for ZC XDP Maciej Fijalkowski
2024-01-24  1:53   ` Jakub Kicinski
2024-01-24 12:02     ` Maciej Fijalkowski
2024-01-24 16:53       ` Jakub Kicinski
2024-01-24 16:56         ` Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 04/11] ice: work on pre-XDP prog frag count Maciej Fijalkowski
2024-01-24  8:37   ` Magnus Karlsson
2024-01-24 14:05     ` Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 05/11] i40e: handle multi-buffer packets that are shrunk by xdp prog Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 06/11] ice: remove redundant xdp_rxq_info registration Maciej Fijalkowski
2024-01-24  8:39   ` Magnus Karlsson
2024-01-22 22:16 ` [PATCH v5 bpf 07/11] intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers Maciej Fijalkowski
2024-01-24  8:44   ` Magnus Karlsson
2024-01-24 16:21     ` Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 08/11] ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
2024-01-24  8:51   ` Magnus Karlsson
2024-01-24 13:58     ` Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 09/11] xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 10/11] i40e: set xdp_rxq_info::frag_size Maciej Fijalkowski
2024-01-22 22:16 ` [PATCH v5 bpf 11/11] i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue Maciej Fijalkowski
2024-01-24  8:52   ` Magnus Karlsson
2024-01-24  8:58 ` [PATCH v5 bpf 00/11] net: bpf_xdp_adjust_tail() and Intel mbuf fixes Magnus Karlsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).