[PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
@ 2025-01-07 15:29 Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer Alexander Lobakin
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

Several months ago, I had been looking through my old XDP hints tree[0]
to check whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-2: decouple the GRO struct from the NAPI struct and allow
  using it out of a NAPI entity within the kernel core code;
* patch 3: switch cpumap from netif_receive_skb_list() to
  gro_receive_skb().

Additional improvements:

* patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 5-6: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 7-8: use that function in veth as well and remove the one that
  was now superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
patch 3         2.3        4         Mpps
patch 8         2.4        4.7       Mpps

1...3 diff      -17        +48       %
1...8 diff      -11        +74       %

Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
tests. On my system however, the same test gave me up to +100%.

Note that there's a series from Lorenzo[4] which achieves the same, but
in a different way. During the discussions, the approach using a
standalone GRO instance was preferred over the threaded NAPI.

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
[3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
[4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org

Alexander Lobakin (8):
  net: gro: decouple GRO from the NAPI layer
  net: gro: expose GRO init/cleanup to use outside of NAPI
  bpf: cpumap: switch to GRO from netif_receive_skb_list()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

 include/linux/netdevice.h                  |  35 ++++--
 include/linux/skbuff.h                     |   1 +
 include/net/busy_poll.h                    |  11 +-
 include/net/gro.h                          |  38 ++++--
 include/net/xdp.h                          |   1 -
 drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
 drivers/net/ethernet/cortina/gemini.c      |   1 +
 drivers/net/veth.c                         |   3 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
 kernel/bpf/cpumap.c                        | 131 ++++++++++++++-------
 net/core/dev.c                             |  79 ++++---------
 net/core/gro.c                             | 103 ++++++++++------
 net/core/skbuff.c                          |  62 ++++++++++
 net/core/xdp.c                             |  10 --
 14 files changed, 306 insertions(+), 171 deletions(-)

---
From v1[5]:
* use a standalone GRO instance instead of the threaded NAPI (Jakub);
* rebase and send to net-next as it's now more networking than BPF.

[5] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com
-- 
2.47.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-09 14:24   ` Paolo Abeni
  2025-01-07 15:29 ` [PATCH net-next v2 2/8] net: gro: expose GRO init/cleanup to use outside of NAPI Alexander Lobakin
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

In fact, these two are not tied closely to each other. The only
requirements to GRO are to use it in the BH context and have some
sane limits on the packet batches, e.g. NAPI has a limit of its
budget (64/8/etc.).
Factor out purely GRO fields into a new structure, &gro_node. Embed
it into &napi_struct and adjust all the references. One member, napi_id,
is duplicated in both &napi_struct and &gro_node for performance and
convenience reasons. Three Ethernet drivers use napi_gro_flush() not
really meant to be exported, so move it to <net/gro.h> and add that
include there. napi_gro_receive() is used in more than 100 drivers,
keep it in <linux/netdevice.h>.
This does not make GRO ready to use outside of the NAPI context
yet.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/linux/netdevice.h                  | 35 ++++++++---
 include/net/busy_poll.h                    | 11 +++-
 include/net/gro.h                          | 35 +++++++----
 drivers/net/ethernet/brocade/bna/bnad.c    |  1 +
 drivers/net/ethernet/cortina/gemini.c      |  1 +
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |  1 +
 net/core/dev.c                             | 62 +++++++++----------
 net/core/gro.c                             | 69 +++++++++++-----------
 8 files changed, 124 insertions(+), 91 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e84602e0226c..bab716f86686 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -339,11 +339,27 @@ struct gro_list {
 };
 
 /*
- * size of gro hash buckets, must less than bit number of
- * napi_struct::gro_bitmask
+ * size of gro hash buckets, must be <= the number of bits in
+ * gro_node::bitmask
  */
 #define GRO_HASH_BUCKETS	8
 
+/**
+ * struct gro_node - structure to implement Generic Receive Offload
+ * @bitmask: mask of used @hash buckets
+ * @hash: lists of pending GRO skbs sorted by flows
+ * @rx_list: list of pending ``GRO_NORMAL`` skbs
+ * @rx_count: length of rx_list
+ * @napi_id: ID of the NAPI owning this node (0 if outside of NAPI)
+ */
+struct gro_node {
+	unsigned long		bitmask;
+	struct gro_list		hash[GRO_HASH_BUCKETS];
+	struct list_head	rx_list;
+	int			rx_count;
+	u32			napi_id;
+};
+
 /*
  * Structure for per-NAPI config
  */
@@ -369,7 +385,6 @@ struct napi_struct {
 	unsigned long		state;
 	int			weight;
 	u32			defer_hard_irqs_count;
-	unsigned long		gro_bitmask;
 	int			(*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
 	/* CPU actively polling if netpoll is configured */
@@ -378,10 +393,8 @@ struct napi_struct {
 	/* CPU on which NAPI has been scheduled for processing */
 	int			list_owner;
 	struct net_device	*dev;
-	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
+	struct gro_node		gro;
 	struct sk_buff		*skb;
-	struct list_head	rx_list; /* Pending GRO_NORMAL skbs */
-	int			rx_count; /* length of rx_list */
 	unsigned int		napi_id;
 	struct hrtimer		timer;
 	struct task_struct	*thread;
@@ -4015,8 +4028,14 @@ int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
 void netif_receive_skb_list_internal(struct list_head *head);
 void netif_receive_skb_list(struct list_head *head);
-gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
-void napi_gro_flush(struct napi_struct *napi, bool flush_old);
+gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb);
+
+static inline gro_result_t napi_gro_receive(struct napi_struct *napi,
+					    struct sk_buff *skb)
+{
+	return gro_receive_skb(&napi->gro, skb);
+}
+
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
 void napi_get_frags_check(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c858270141bc..d31c8cb9e578 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -122,18 +122,23 @@ static inline void sk_busy_loop(struct sock *sk, int nonblock)
 }
 
 /* used in the NIC receive handler to mark the skb */
-static inline void skb_mark_napi_id(struct sk_buff *skb,
-				    struct napi_struct *napi)
+static inline void __skb_mark_napi_id(struct sk_buff *skb, u32 napi_id)
 {
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	/* If the skb was already marked with a valid NAPI ID, avoid overwriting
 	 * it.
 	 */
 	if (skb->napi_id < MIN_NAPI_ID)
-		skb->napi_id = napi->napi_id;
+		skb->napi_id = napi_id;
 #endif
 }
 
+static inline void skb_mark_napi_id(struct sk_buff *skb,
+				    struct napi_struct *napi)
+{
+	__skb_mark_napi_id(skb, napi->napi_id);
+}
+
 /* used in the protocol handler to propagate the napi_id to the socket */
 static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 {
diff --git a/include/net/gro.h b/include/net/gro.h
index b9b58c1f8d19..7aad366452d6 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -506,26 +506,41 @@ static inline int gro_receive_network_flush(const void *th, const void *th2,
 
 int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
 int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb);
+void __gro_flush(struct gro_node *gro, bool flush_old);
+
+static inline void gro_flush(struct gro_node *gro, bool flush_old)
+{
+	if (!gro->bitmask)
+		return;
+
+	__gro_flush(gro, flush_old);
+}
+
+static inline void napi_gro_flush(struct napi_struct *napi, bool flush_old)
+{
+	gro_flush(&napi->gro, flush_old);
+}
 
 /* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
-static inline void gro_normal_list(struct napi_struct *napi)
+static inline void gro_normal_list(struct gro_node *gro)
 {
-	if (!napi->rx_count)
+	if (!gro->rx_count)
 		return;
-	netif_receive_skb_list_internal(&napi->rx_list);
-	INIT_LIST_HEAD(&napi->rx_list);
-	napi->rx_count = 0;
+	netif_receive_skb_list_internal(&gro->rx_list);
+	INIT_LIST_HEAD(&gro->rx_list);
+	gro->rx_count = 0;
 }
 
 /* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
  * pass the whole batch up to the stack.
  */
-static inline void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb, int segs)
+static inline void gro_normal_one(struct gro_node *gro, struct sk_buff *skb,
+				  int segs)
 {
-	list_add_tail(&skb->list, &napi->rx_list);
-	napi->rx_count += segs;
-	if (napi->rx_count >= READ_ONCE(net_hotdata.gro_normal_batch))
-		gro_normal_list(napi);
+	list_add_tail(&skb->list, &gro->rx_list);
+	gro->rx_count += segs;
+	if (gro->rx_count >= READ_ONCE(net_hotdata.gro_normal_batch))
+		gro_normal_list(gro);
 }
 
 /* This function is the alternative of 'inet_iif' and 'inet_sdif'
diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
index ece6f3b48327..3b9107003b00 100644
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -19,6 +19,7 @@
 #include <linux/ip.h>
 #include <linux/prefetch.h>
 #include <linux/module.h>
+#include <net/gro.h>
 
 #include "bnad.h"
 #include "bna.h"
diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index 991e3839858b..1f8067bdd61a 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -40,6 +40,7 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
+#include <net/gro.h>
 
 #include "gemini.h"
 
diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c
index 7a9c09cd4fdc..6a7a26085fc7 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c
@@ -41,6 +41,7 @@
 #include <linux/types.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
+#include <net/gro.h>
 
 #include "t7xx_dpmaif.h"
 #include "t7xx_hif_dpmaif.h"
diff --git a/net/core/dev.c b/net/core/dev.c
index e7223972b9aa..1f4e2a0ef1da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6237,7 +6237,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		return false;
 
 	if (work_done) {
-		if (n->gro_bitmask)
+		if (n->gro.bitmask)
 			timeout = napi_get_gro_flush_timeout(n);
 		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
@@ -6247,15 +6247,14 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		if (timeout)
 			ret = false;
 	}
-	if (n->gro_bitmask) {
-		/* When the NAPI instance uses a timeout and keeps postponing
-		 * it, we need to bound somehow the time packets are kept in
-		 * the GRO layer
-		 */
-		napi_gro_flush(n, !!timeout);
-	}
 
-	gro_normal_list(n);
+	/*
+	 * When the NAPI instance uses a timeout and keeps postponing
+	 * it, we need to bound somehow the time packets are kept in
+	 * the GRO layer.
+	 */
+	gro_flush(&n->gro, !!timeout);
+	gro_normal_list(&n->gro);
 
 	if (unlikely(!list_empty(&n->poll_list))) {
 		/* If n->poll_list is not empty, we need to mask irqs */
@@ -6332,19 +6331,15 @@ static void skb_defer_free_flush(struct softnet_data *sd)
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 {
 	if (!skip_schedule) {
-		gro_normal_list(napi);
+		gro_normal_list(&napi->gro);
 		__napi_schedule(napi);
 		return;
 	}
 
-	if (napi->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(napi, HZ >= 1000);
-	}
+	/* Flush too old packets. If HZ < 1000, flush all packets */
+	gro_flush(&napi->gro, HZ >= 1000);
+	gro_normal_list(&napi->gro);
 
-	gro_normal_list(napi);
 	clear_bit(NAPI_STATE_SCHED, &napi->state);
 }
 
@@ -6451,7 +6446,7 @@ static void __napi_busy_loop(unsigned int napi_id,
 		}
 		work = napi_poll(napi, budget);
 		trace_napi_poll(napi, work, budget);
-		gro_normal_list(napi);
+		gro_normal_list(&napi->gro);
 count:
 		if (work > 0)
 			__NET_ADD_STATS(dev_net(napi->dev),
@@ -6554,6 +6549,8 @@ static void __napi_hash_add_with_id(struct napi_struct *napi,
 	napi->napi_id = napi_id;
 	hlist_add_head_rcu(&napi->napi_hash_node,
 			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+
+	napi->gro.napi_id = napi_id;
 }
 
 static void napi_hash_add_with_id(struct napi_struct *napi,
@@ -6624,10 +6621,10 @@ static void init_gro_hash(struct napi_struct *napi)
 	int i;
 
 	for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-		INIT_LIST_HEAD(&napi->gro_hash[i].list);
-		napi->gro_hash[i].count = 0;
+		INIT_LIST_HEAD(&napi->gro.hash[i].list);
+		napi->gro.hash[i].count = 0;
 	}
-	napi->gro_bitmask = 0;
+	napi->gro.bitmask = 0;
 }
 
 int dev_set_threaded(struct net_device *dev, bool threaded)
@@ -6743,8 +6740,8 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	napi->timer.function = napi_watchdog;
 	init_gro_hash(napi);
 	napi->skb = NULL;
-	INIT_LIST_HEAD(&napi->rx_list);
-	napi->rx_count = 0;
+	INIT_LIST_HEAD(&napi->gro.rx_list);
+	napi->gro.rx_count = 0;
 	napi->poll = poll;
 	if (weight > NAPI_POLL_WEIGHT)
 		netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6838,9 +6835,9 @@ static void flush_gro_hash(struct napi_struct *napi)
 	for (i = 0; i < GRO_HASH_BUCKETS; i++) {
 		struct sk_buff *skb, *n;
 
-		list_for_each_entry_safe(skb, n, &napi->gro_hash[i].list, list)
+		list_for_each_entry_safe(skb, n, &napi->gro.hash[i].list, list)
 			kfree_skb(skb);
-		napi->gro_hash[i].count = 0;
+		napi->gro.hash[i].count = 0;
 	}
 }
 
@@ -6859,7 +6856,7 @@ void __netif_napi_del(struct napi_struct *napi)
 	napi_free_frags(napi);
 
 	flush_gro_hash(napi);
-	napi->gro_bitmask = 0;
+	napi->gro.bitmask = 0;
 
 	if (napi->thread) {
 		kthread_stop(napi->thread);
@@ -6918,14 +6915,9 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 		return work;
 	}
 
-	if (n->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(n, HZ >= 1000);
-	}
-
-	gro_normal_list(n);
+	/* Flush too old packets. If HZ < 1000, flush all packets */
+	gro_flush(&n->gro, HZ >= 1000);
+	gro_normal_list(&n->gro);
 
 	/* Some drivers may have called napi_schedule
 	 * prior to exhausting their budget.
@@ -11887,7 +11879,7 @@ static struct hlist_head * __net_init netdev_create_hash(void)
 static int __net_init netdev_init(struct net *net)
 {
 	BUILD_BUG_ON(GRO_HASH_BUCKETS >
-		     8 * sizeof_field(struct napi_struct, gro_bitmask));
+		     BITS_PER_BYTE * sizeof_field(struct gro_node, bitmask));
 
 	INIT_LIST_HEAD(&net->dev_base_head);
 
diff --git a/net/core/gro.c b/net/core/gro.c
index d1f44084e978..77ec10d9cd43 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -253,8 +253,7 @@ int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb)
 	return 0;
 }
 
-
-static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
+static void gro_complete(struct gro_node *gro, struct sk_buff *skb)
 {
 	struct list_head *head = &net_hotdata.offload_base;
 	struct packet_offload *ptype;
@@ -287,43 +286,43 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
 	}
 
 out:
-	gro_normal_one(napi, skb, NAPI_GRO_CB(skb)->count);
+	gro_normal_one(gro, skb, NAPI_GRO_CB(skb)->count);
 }
 
-static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index,
-				   bool flush_old)
+static void __gro_flush_chain(struct gro_node *gro, u32 index, bool flush_old)
 {
-	struct list_head *head = &napi->gro_hash[index].list;
+	struct list_head *head = &gro->hash[index].list;
 	struct sk_buff *skb, *p;
 
 	list_for_each_entry_safe_reverse(skb, p, head, list) {
 		if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
 			return;
 		skb_list_del_init(skb);
-		napi_gro_complete(napi, skb);
-		napi->gro_hash[index].count--;
+		gro_complete(gro, skb);
+		gro->hash[index].count--;
 	}
 
-	if (!napi->gro_hash[index].count)
-		__clear_bit(index, &napi->gro_bitmask);
+	if (!gro->hash[index].count)
+		__clear_bit(index, &gro->bitmask);
 }
 
-/* napi->gro_hash[].list contains packets ordered by age.
+/*
+ * gro->hash[].list contains packets ordered by age.
  * youngest packets at the head of it.
  * Complete skbs in reverse order to reduce latencies.
  */
-void napi_gro_flush(struct napi_struct *napi, bool flush_old)
+void __gro_flush(struct gro_node *gro, bool flush_old)
 {
-	unsigned long bitmask = napi->gro_bitmask;
+	unsigned long bitmask = gro->bitmask;
 	unsigned int i, base = ~0U;
 
 	while ((i = ffs(bitmask)) != 0) {
 		bitmask >>= i;
 		base += i;
-		__napi_gro_flush_chain(napi, base, flush_old);
+		__gro_flush_chain(gro, base, flush_old);
 	}
 }
-EXPORT_SYMBOL(napi_gro_flush);
+EXPORT_SYMBOL(__gro_flush);
 
 static unsigned long gro_list_prepare_tc_ext(const struct sk_buff *skb,
 					     const struct sk_buff *p,
@@ -442,7 +441,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
 		gro_pull_from_frag0(skb, grow);
 }
 
-static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
+static void gro_flush_oldest(struct gro_node *gro, struct list_head *head)
 {
 	struct sk_buff *oldest;
 
@@ -458,14 +457,15 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
 	 * SKB to the chain.
 	 */
 	skb_list_del_init(oldest);
-	napi_gro_complete(napi, oldest);
+	gro_complete(gro, oldest);
 }
 
-static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
+static enum gro_result dev_gro_receive(struct gro_node *gro,
+				       struct sk_buff *skb)
 {
 	u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
-	struct gro_list *gro_list = &napi->gro_hash[bucket];
 	struct list_head *head = &net_hotdata.offload_base;
+	struct gro_list *gro_list = &gro->hash[bucket];
 	struct packet_offload *ptype;
 	__be16 type = skb->protocol;
 	struct sk_buff *pp = NULL;
@@ -529,7 +529,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 
 	if (pp) {
 		skb_list_del_init(pp);
-		napi_gro_complete(napi, pp);
+		gro_complete(gro, pp);
 		gro_list->count--;
 	}
 
@@ -540,7 +540,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 		goto normal;
 
 	if (unlikely(gro_list->count >= MAX_GRO_SKBS))
-		gro_flush_oldest(napi, &gro_list->list);
+		gro_flush_oldest(gro, &gro_list->list);
 	else
 		gro_list->count++;
 
@@ -554,10 +554,10 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	ret = GRO_HELD;
 ok:
 	if (gro_list->count) {
-		if (!test_bit(bucket, &napi->gro_bitmask))
-			__set_bit(bucket, &napi->gro_bitmask);
-	} else if (test_bit(bucket, &napi->gro_bitmask)) {
-		__clear_bit(bucket, &napi->gro_bitmask);
+		if (!test_bit(bucket, &gro->bitmask))
+			__set_bit(bucket, &gro->bitmask);
+	} else if (test_bit(bucket, &gro->bitmask)) {
+		__clear_bit(bucket, &gro->bitmask);
 	}
 
 	return ret;
@@ -596,13 +596,12 @@ struct packet_offload *gro_find_complete_by_type(__be16 type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
-static gro_result_t napi_skb_finish(struct napi_struct *napi,
-				    struct sk_buff *skb,
-				    gro_result_t ret)
+static gro_result_t gro_skb_finish(struct gro_node *gro, struct sk_buff *skb,
+				   gro_result_t ret)
 {
 	switch (ret) {
 	case GRO_NORMAL:
-		gro_normal_one(napi, skb, 1);
+		gro_normal_one(gro, skb, 1);
 		break;
 
 	case GRO_MERGED_FREE:
@@ -623,21 +622,21 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
 	return ret;
 }
 
-gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
+gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb)
 {
 	gro_result_t ret;
 
-	skb_mark_napi_id(skb, napi);
+	__skb_mark_napi_id(skb, gro->napi_id);
 	trace_napi_gro_receive_entry(skb);
 
 	skb_gro_reset_offset(skb, 0);
 
-	ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
+	ret = gro_skb_finish(gro, skb, dev_gro_receive(gro, skb));
 	trace_napi_gro_receive_exit(ret);
 
 	return ret;
 }
-EXPORT_SYMBOL(napi_gro_receive);
+EXPORT_SYMBOL(gro_receive_skb);
 
 static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 {
@@ -693,7 +692,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
 		__skb_push(skb, ETH_HLEN);
 		skb->protocol = eth_type_trans(skb, skb->dev);
 		if (ret == GRO_NORMAL)
-			gro_normal_one(napi, skb, 1);
+			gro_normal_one(&napi->gro, skb, 1);
 		break;
 
 	case GRO_MERGED_FREE:
@@ -762,7 +761,7 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
 
 	trace_napi_gro_frags_entry(skb);
 
-	ret = napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
+	ret = napi_frags_finish(napi, skb, dev_gro_receive(&napi->gro, skb));
 	trace_napi_gro_frags_exit(ret);
 
 	return ret;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 2/8] net: gro: expose GRO init/cleanup to use outside of NAPI
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 3/8] bpf: cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

Make GRO init and cleanup functions global to be able to use GRO
without a NAPI instance. Taking into account already global gro_flush(),
it's now fully usable standalone.
New functions are not exported, since they're not supposed to be used
outside of the kernel core code.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/net/gro.h |  3 +++
 net/core/dev.c    | 33 +++------------------------------
 net/core/gro.c    | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index 7aad366452d6..343d5afe7c9e 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -543,6 +543,9 @@ static inline void gro_normal_one(struct gro_node *gro, struct sk_buff *skb,
 		gro_normal_list(gro);
 }
 
+void gro_init(struct gro_node *gro);
+void gro_cleanup(struct gro_node *gro);
+
 /* This function is the alternative of 'inet_iif' and 'inet_sdif'
  * functions in case we can not rely on fields of IPCB.
  *
diff --git a/net/core/dev.c b/net/core/dev.c
index 1f4e2a0ef1da..f7059e98ce87 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6616,17 +6616,6 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void init_gro_hash(struct napi_struct *napi)
-{
-	int i;
-
-	for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-		INIT_LIST_HEAD(&napi->gro.hash[i].list);
-		napi->gro.hash[i].count = 0;
-	}
-	napi->gro.bitmask = 0;
-}
-
 int dev_set_threaded(struct net_device *dev, bool threaded)
 {
 	struct napi_struct *napi;
@@ -6738,10 +6727,8 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
-	init_gro_hash(napi);
+	gro_init(&napi->gro);
 	napi->skb = NULL;
-	INIT_LIST_HEAD(&napi->gro.rx_list);
-	napi->gro.rx_count = 0;
 	napi->poll = poll;
 	if (weight > NAPI_POLL_WEIGHT)
 		netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6828,19 +6815,6 @@ void napi_enable(struct napi_struct *n)
 }
 EXPORT_SYMBOL(napi_enable);
 
-static void flush_gro_hash(struct napi_struct *napi)
-{
-	int i;
-
-	for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-		struct sk_buff *skb, *n;
-
-		list_for_each_entry_safe(skb, n, &napi->gro.hash[i].list, list)
-			kfree_skb(skb);
-		napi->gro.hash[i].count = 0;
-	}
-}
-
 /* Must be called in process context */
 void __netif_napi_del(struct napi_struct *napi)
 {
@@ -6855,8 +6829,7 @@ void __netif_napi_del(struct napi_struct *napi)
 	list_del_rcu(&napi->dev_list);
 	napi_free_frags(napi);
 
-	flush_gro_hash(napi);
-	napi->gro.bitmask = 0;
+	gro_cleanup(&napi->gro);
 
 	if (napi->thread) {
 		kthread_stop(napi->thread);
@@ -12243,7 +12216,7 @@ static int __init net_dev_init(void)
 		INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
 		spin_lock_init(&sd->defer_lock);
 
-		init_gro_hash(&sd->backlog);
+		gro_init(&sd->backlog.gro);
 		sd->backlog.poll = process_backlog;
 		sd->backlog.weight = weight_p;
 		INIT_LIST_HEAD(&sd->backlog.poll_list);
diff --git a/net/core/gro.c b/net/core/gro.c
index 77ec10d9cd43..29ee36bb0b27 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -793,3 +793,37 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
 	return sum;
 }
 EXPORT_SYMBOL(__skb_gro_checksum_complete);
+
+void gro_init(struct gro_node *gro)
+{
+	for (u32 i = 0; i < GRO_HASH_BUCKETS; i++) {
+		INIT_LIST_HEAD(&gro->hash[i].list);
+		gro->hash[i].count = 0;
+	}
+
+	gro->bitmask = 0;
+
+	INIT_LIST_HEAD(&gro->rx_list);
+	gro->rx_count = 0;
+
+	gro->napi_id = 0;
+}
+
+void gro_cleanup(struct gro_node *gro)
+{
+	struct sk_buff *skb, *n;
+
+	for (u32 i = 0; i < GRO_HASH_BUCKETS; i++) {
+		list_for_each_entry_safe(skb, n, &gro->hash[i].list, list)
+			kfree_skb(skb);
+
+		gro->hash[i].count = 0;
+	}
+
+	gro->bitmask = 0;
+
+	list_for_each_entry_safe(skb, n, &gro->rx_list, list)
+		kfree_skb(skb);
+
+	gro->rx_count = 0;
+}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 3/8] bpf: cpumap: switch to GRO from netif_receive_skb_list()
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 2/8] net: gro: expose GRO init/cleanup to use outside of NAPI Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 4/8] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

cpumap has its own BH context based on kthread. It has a sane batch
size of 8 frames per one cycle.
GRO can be used here on its own. Adjust cpumap calls to the upper stack
to use GRO API instead of netif_receive_skb_list() which processes skbs
by batches, but doesn't involve GRO layer at all.
In plenty of tests, GRO performs better than listed receiving even
given that it has to calculate full frame checksums on the CPU.
As GRO passes the skbs to the upper stack in the batches of
@gro_normal_batch, i.e. 8 by default, and skb->dev points to the
device where the frame comes from, it is enough to disable GRO
netdev feature on it to completely restore the original behaviour:
untouched frames will be being bulked and passed to the upper stack
by 8, as it was with netif_receive_skb_list().

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 kernel/bpf/cpumap.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 774accbd4a22..10d062dddb6f 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -33,8 +33,8 @@
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
 
-#include <linux/netdevice.h>   /* netif_receive_skb_list */
-#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/netdevice.h>
+#include <net/gro.h>
 
 /* General idea: XDP packets getting XDP redirected to another CPU,
  * will maximum be stored/queued for one driver ->poll() call.  It is
@@ -68,6 +68,7 @@ struct bpf_cpu_map_entry {
 
 	struct bpf_cpumap_val value;
 	struct bpf_prog *prog;
+	struct gro_node gro;
 
 	struct completion kthread_running;
 	struct rcu_work free_work;
@@ -261,10 +262,36 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	return nframes;
 }
 
+static void cpu_map_gro_receive(struct bpf_cpu_map_entry *rcpu,
+				struct list_head *list)
+{
+	struct sk_buff *skb, *tmp;
+
+	list_for_each_entry_safe(skb, tmp, list, list) {
+		skb_list_del_init(skb);
+		gro_receive_skb(&rcpu->gro, skb);
+	}
+}
+
+static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu, bool empty)
+{
+	/*
+	 * If the ring is not empty, there'll be a new iteration soon, and we
+	 * only need to do a full flush if a tick is long (> 1 ms).
+	 * If the ring is empty, to not hold GRO packets in the stack for too
+	 * long, do a full flush.
+	 * This is equivalent to how NAPI decides whether to perform a full
+	 * flush.
+	 */
+	gro_flush(&rcpu->gro, !empty && HZ >= 1000);
+	gro_normal_list(&rcpu->gro);
+}
+
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
 	unsigned long last_qs = jiffies;
+	u32 packets = 0;
 
 	complete(&rcpu->kthread_running);
 	set_current_state(TASK_INTERRUPTIBLE);
@@ -282,6 +309,7 @@ static int cpu_map_kthread_run(void *data)
 		void *frames[CPUMAP_BATCH];
 		void *skbs[CPUMAP_BATCH];
 		LIST_HEAD(list);
+		bool empty;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -361,7 +389,15 @@ static int cpu_map_kthread_run(void *data)
 		trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
 					 sched, &stats);
 
-		netif_receive_skb_list(&list);
+		cpu_map_gro_receive(rcpu, &list);
+
+		/* Flush either every 64 packets or in case of empty ring */
+		empty = __ptr_ring_empty(rcpu->queue);
+		if (packets += n >= NAPI_POLL_WEIGHT || empty) {
+			cpu_map_gro_flush(rcpu, empty);
+			packets = 0;
+		}
+
 		local_bh_enable(); /* resched point, may call do_softirq() */
 	}
 	__set_current_state(TASK_RUNNING);
@@ -430,6 +466,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	rcpu->cpu    = cpu;
 	rcpu->map_id = map->id;
 	rcpu->value.qsize  = value->qsize;
+	gro_init(&rcpu->gro);
 
 	if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
 		goto free_ptr_ring;
@@ -458,6 +495,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	if (rcpu->prog)
 		bpf_prog_put(rcpu->prog);
 free_ptr_ring:
+	gro_cleanup(&rcpu->gro);
 	ptr_ring_cleanup(rcpu->queue, NULL);
 free_queue:
 	kfree(rcpu->queue);
@@ -487,6 +525,7 @@ static void __cpu_map_entry_free(struct work_struct *work)
 
 	if (rcpu->prog)
 		bpf_prog_put(rcpu->prog);
+	gro_cleanup(&rcpu->gro);
 	/* The queue should be empty at this point */
 	__cpu_map_ring_cleanup(rcpu->queue);
 	ptr_ring_cleanup(rcpu->queue, NULL);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 4/8] bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (2 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 3/8] bpf: cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

cpumap still uses linked lists to store a list of skbs to pass to the
stack. Now that we don't use listified Rx in favor of
napi_gro_receive(), linked list is now an unneeded overhead.
Inside the polling loop, we already have an array of skbs. Let's reuse
it for skbs passed to cpumap (generic XDP) and keep there in case of
XDP_PASS when a program is installed to the map itself. Don't list
regular xdp_frames after converting them to skbs as well; store them
in the mentioned array (but *before* generic skbs as the latters have
lower priority) and call gro_receive_skb() for each array element after
they're done.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 kernel/bpf/cpumap.c | 103 +++++++++++++++++++++++---------------------
 1 file changed, 54 insertions(+), 49 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 10d062dddb6f..4c3eeb931bff 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -134,22 +134,23 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 	}
 }
 
-static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
-				     struct list_head *listp,
-				     struct xdp_cpumap_stats *stats)
+static u32 cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
+				    void **skbs, u32 skb_n,
+				    struct xdp_cpumap_stats *stats)
 {
-	struct sk_buff *skb, *tmp;
 	struct xdp_buff xdp;
-	u32 act;
+	u32 act, pass = 0;
 	int err;
 
-	list_for_each_entry_safe(skb, tmp, listp, list) {
+	for (u32 i = 0; i < skb_n; i++) {
+		struct sk_buff *skb = skbs[i];
+
 		act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
 		switch (act) {
 		case XDP_PASS:
+			skbs[pass++] = skb;
 			break;
 		case XDP_REDIRECT:
-			skb_list_del_init(skb);
 			err = xdp_do_generic_redirect(skb->dev, skb, &xdp,
 						      rcpu->prog);
 			if (unlikely(err)) {
@@ -158,7 +159,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 			} else {
 				stats->redirect++;
 			}
-			return;
+			break;
 		default:
 			bpf_warn_invalid_xdp_action(NULL, rcpu->prog, act);
 			fallthrough;
@@ -166,12 +167,15 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 			trace_xdp_exception(skb->dev, rcpu->prog, act);
 			fallthrough;
 		case XDP_DROP:
-			skb_list_del_init(skb);
-			kfree_skb(skb);
+			napi_consume_skb(skb, true);
 			stats->drop++;
-			return;
+			break;
 		}
 	}
+
+	stats->pass += pass;
+
+	return pass;
 }
 
 static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
@@ -235,42 +239,39 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 
 #define CPUMAP_BATCH 8
 
-static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
-				int xdp_n, struct xdp_cpumap_stats *stats,
-				struct list_head *list)
+struct cpu_map_ret {
+	u32 xdp_n;
+	u32 skb_n;
+};
+
+static bool cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
+				 void **skbs, struct cpu_map_ret *ret,
+				 struct xdp_cpumap_stats *stats)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
-	int nframes;
 
 	if (!rcpu->prog)
-		return xdp_n;
+		goto out;
 
 	rcu_read_lock_bh();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
-	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
+	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
+	if (unlikely(ret->skb_n))
+		ret->skb_n = cpu_map_bpf_prog_run_skb(rcpu, skbs, ret->skb_n,
+						      stats);
 
 	if (stats->redirect)
 		xdp_do_flush();
 
-	if (unlikely(!list_empty(list)))
-		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
-
 	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
-	return nframes;
-}
-
-static void cpu_map_gro_receive(struct bpf_cpu_map_entry *rcpu,
-				struct list_head *list)
-{
-	struct sk_buff *skb, *tmp;
+out:
+	if (unlikely(ret->skb_n) && ret->xdp_n)
+		memmove(&skbs[ret->xdp_n], skbs, ret->skb_n * sizeof(*skbs));
 
-	list_for_each_entry_safe(skb, tmp, list, list) {
-		skb_list_del_init(skb);
-		gro_receive_skb(&rcpu->gro, skb);
-	}
+	return !!*(const u64 *)ret;
 }
 
 static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu, bool empty)
@@ -305,10 +306,10 @@ static int cpu_map_kthread_run(void *data)
 		struct xdp_cpumap_stats stats = {}; /* zero stats */
 		unsigned int kmem_alloc_drops = 0, sched = 0;
 		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
-		int i, n, m, nframes, xdp_n;
+		struct cpu_map_ret ret = { };
 		void *frames[CPUMAP_BATCH];
 		void *skbs[CPUMAP_BATCH];
-		LIST_HEAD(list);
+		u32 i, n, m;
 		bool empty;
 
 		/* Release CPU reschedule checks */
@@ -334,7 +335,7 @@ static int cpu_map_kthread_run(void *data)
 		 */
 		n = __ptr_ring_consume_batched(rcpu->queue, frames,
 					       CPUMAP_BATCH);
-		for (i = 0, xdp_n = 0; i < n; i++) {
+		for (i = 0; i < n; i++) {
 			void *f = frames[i];
 			struct page *page;
 
@@ -342,11 +343,11 @@ static int cpu_map_kthread_run(void *data)
 				struct sk_buff *skb = f;
 
 				__ptr_clear_bit(0, &skb);
-				list_add_tail(&skb->list, &list);
+				skbs[ret.skb_n++] = skb;
 				continue;
 			}
 
-			frames[xdp_n++] = f;
+			frames[ret.xdp_n++] = f;
 			page = virt_to_page(f);
 
 			/* Bring struct page memory area to curr CPU. Read by
@@ -357,19 +358,21 @@ static int cpu_map_kthread_run(void *data)
 		}
 
 		/* Support running another XDP prog on this CPU */
-		nframes = cpu_map_bpf_prog_run(rcpu, frames, xdp_n, &stats, &list);
-		if (nframes) {
-			m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
-						  gfp, nframes, skbs);
-			if (unlikely(m == 0)) {
-				for (i = 0; i < nframes; i++)
-					skbs[i] = NULL; /* effect: xdp_return_frame */
-				kmem_alloc_drops += nframes;
-			}
+		if (!cpu_map_bpf_prog_run(rcpu, frames, skbs, &ret, &stats)) {
+			local_bh_disable();
+			goto stats;
+		}
+
+		m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp,
+					  ret.xdp_n, skbs);
+		if (unlikely(!m)) {
+			for (i = 0; i < ret.xdp_n; i++)
+				skbs[i] = NULL; /* effect: xdp_return_frame */
+			kmem_alloc_drops += ret.xdp_n;
 		}
 
 		local_bh_disable();
-		for (i = 0; i < nframes; i++) {
+		for (i = 0; i < ret.xdp_n; i++) {
 			struct xdp_frame *xdpf = frames[i];
 			struct sk_buff *skb = skbs[i];
 
@@ -379,17 +382,19 @@ static int cpu_map_kthread_run(void *data)
 				xdp_return_frame(xdpf);
 				continue;
 			}
-
-			list_add_tail(&skb->list, &list);
 		}
 
+stats:
 		/* Feedback loop via tracepoint.
 		 * NB: keep before recv to allow measuring enqueue/dequeue latency.
 		 */
 		trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
 					 sched, &stats);
 
-		cpu_map_gro_receive(rcpu, &list);
+		for (i = 0; i < ret.xdp_n + ret.skb_n; i++) {
+			if (likely(skbs[i]))
+				gro_receive_skb(&rcpu->gro, skbs[i]);
+		}
 
 		/* Flush either every 64 packets or in case of empty ring */
 		empty = __ptr_ring_empty(rcpu->queue);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk()
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (3 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 4/8] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-09 13:16   ` Paolo Abeni
  2025-01-07 15:29 ` [PATCH net-next v2 6/8] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
requirement to call it only from the BH) is that it tries to use
as many NAPI cache entries for skbs as possible, and allocate new
ones only if needed.

The logic is as follows:

* there is enough skbs in the cache: decache them and return to the
  caller;
* not enough: try refilling the cache first. If there is now enough
  skbs, return;
* still not enough: try allocating skbs directly to the output array
  with %GFP_ZERO, maybe we'll be able to get some. If there's now
  enough, return;
* still not enough: return as many as we were able to obtain.

Most of times, if called from the NAPI polling loop, the first one will
be true, sometimes (rarely) the second one. The third and the fourth --
only under heavy memory pressure.
It can save significant amounts of CPU cycles if there are GRO cycles
and/or Tx completion cycles (anything that descends to
napi_skb_cache_put()) happening on this CPU.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 62 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..1c089c7c14e1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1315,6 +1315,7 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
 void skb_attempt_defer_free(struct sk_buff *skb);
 
+u32 napi_skb_cache_get_bulk(void **skbs, u32 n);
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 struct sk_buff *slab_build_skb(void *data);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a441613a1e6c..42eb31dcc9ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -367,6 +367,68 @@ static struct sk_buff *napi_skb_cache_get(void)
 	return skb;
 }
 
+/**
+ * napi_skb_cache_get_bulk - obtain a number of zeroed skb heads from the cache
+ * @skbs: pointer to an at least @n-sized array to fill with skb pointers
+ * @n: number of entries to provide
+ *
+ * Tries to obtain @n &sk_buff entries from the NAPI percpu cache and writes
+ * the pointers into the provided array @skbs. If there are less entries
+ * available, tries to replenish the cache and bulk-allocates the diff from
+ * the MM layer if needed.
+ * The heads are being zeroed with either memset() or %__GFP_ZERO, so they are
+ * ready for {,__}build_skb_around() and don't have any data buffers attached.
+ * Must be called *only* from the BH context.
+ *
+ * Return: number of successfully allocated skbs (@n if no actual allocation
+ *	   needed or kmem_cache_alloc_bulk() didn't fail).
+ */
+u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	u32 bulk, total = n;
+
+	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
+
+	if (nc->skb_count >= n)
+		goto get;
+
+	/* No enough cached skbs. Try refilling the cache first */
+	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
+	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+					       GFP_ATOMIC | __GFP_NOWARN, bulk,
+					       &nc->skb_cache[nc->skb_count]);
+	if (likely(nc->skb_count >= n))
+		goto get;
+
+	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
+	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
+				   n - nc->skb_count, &skbs[nc->skb_count]);
+	if (likely(nc->skb_count >= n))
+		goto get;
+
+	/* kmem_cache didn't allocate the number we need, limit the output */
+	total -= n - nc->skb_count;
+	n = nc->skb_count;
+
+get:
+	for (u32 base = nc->skb_count - n, i = 0; i < n; i++) {
+		u32 cache_size = kmem_cache_size(net_hotdata.skbuff_cache);
+
+		skbs[i] = nc->skb_cache[base + i];
+
+		kasan_mempool_unpoison_object(skbs[i], cache_size);
+		memset(skbs[i], 0, offsetof(struct sk_buff, tail));
+	}
+
+	nc->skb_count -= n;
+	local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
+
+	return total;
+}
+EXPORT_SYMBOL_GPL(napi_skb_cache_get_bulk);
+
 static inline void __finalize_skb_around(struct sk_buff *skb, void *data,
 					 unsigned int size)
 {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 6/8] bpf: cpumap: switch to napi_skb_cache_get_bulk()
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (4 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 7/8] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower MM layer pressure. Always disable the BH before checking and
running the cpumap-pinned XDP prog and don't re-enable it in between
that and allocating an skb bulk, as we can access the NAPI caches only
from the BH context.
The better GRO aggregates packets, the less new skbs will be allocated.
If an aggregated skb contains 16 frags, this means 15 skbs were returned
to the cache, so next 15 skbs will be built without allocating anything.

The same trafficgen UDP GRO test now shows:

                GRO off   GRO on
threaded GRO    2.3       4         Mpps
thr bulk GRO    2.4       4.7       Mpps
diff            +4        +17       %

Comparing to the baseline cpumap:

baseline        2.7       N/A       Mpps
thr bulk GRO    2.4       4.7       Mpps
diff            -11       +74       %

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
---
 kernel/bpf/cpumap.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 4c3eeb931bff..c6b7f0e1ba61 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -253,7 +253,7 @@ static bool cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (!rcpu->prog)
 		goto out;
 
-	rcu_read_lock_bh();
+	rcu_read_lock();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
@@ -265,7 +265,7 @@ static bool cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 		xdp_do_flush();
 
 	bpf_net_ctx_clear(bpf_net_ctx);
-	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
+	rcu_read_unlock();
 
 out:
 	if (unlikely(ret->skb_n) && ret->xdp_n)
@@ -305,7 +305,6 @@ static int cpu_map_kthread_run(void *data)
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		struct xdp_cpumap_stats stats = {}; /* zero stats */
 		unsigned int kmem_alloc_drops = 0, sched = 0;
-		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
 		struct cpu_map_ret ret = { };
 		void *frames[CPUMAP_BATCH];
 		void *skbs[CPUMAP_BATCH];
@@ -357,21 +356,19 @@ static int cpu_map_kthread_run(void *data)
 			prefetchw(page);
 		}
 
+		local_bh_disable();
+
 		/* Support running another XDP prog on this CPU */
-		if (!cpu_map_bpf_prog_run(rcpu, frames, skbs, &ret, &stats)) {
-			local_bh_disable();
+		if (!cpu_map_bpf_prog_run(rcpu, frames, skbs, &ret, &stats))
 			goto stats;
-		}
 
-		m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp,
-					  ret.xdp_n, skbs);
+		m = napi_skb_cache_get_bulk(skbs, ret.xdp_n);
 		if (unlikely(!m)) {
 			for (i = 0; i < ret.xdp_n; i++)
 				skbs[i] = NULL; /* effect: xdp_return_frame */
 			kmem_alloc_drops += ret.xdp_n;
 		}
 
-		local_bh_disable();
 		for (i = 0; i < ret.xdp_n; i++) {
 			struct xdp_frame *xdpf = frames[i];
 			struct sk_buff *skb = skbs[i];
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 7/8] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (5 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 6/8] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 15:29 ` [PATCH net-next v2 8/8] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
  2025-01-07 17:17 ` [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Jesper Dangaard Brouer
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

Now that we can bulk-allocate skbs from the NAPI cache, use that
function to do that in veth as well instead of direct allocation from
the kmem caches. veth uses NAPI and GRO, so this is both context-safe
and beneficial.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/veth.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 01251868a9c2..7634ee8843bc 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -684,8 +684,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 	void *skbs[VETH_XDP_BATCH];
 	int i;
 
-	if (xdp_alloc_skb_bulk(skbs, n_xdpf,
-			       GFP_ATOMIC | __GFP_ZERO) < 0) {
+	if (unlikely(!napi_skb_cache_get_bulk(skbs, n_xdpf))) {
 		for (i = 0; i < n_xdpf; i++)
 			xdp_return_frame(frames[i]);
 		stats->rx_drops += n_xdpf;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 8/8] xdp: remove xdp_alloc_skb_bulk()
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (6 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 7/8] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
@ 2025-01-07 15:29 ` Alexander Lobakin
  2025-01-07 17:17 ` [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Jesper Dangaard Brouer
  8 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-07 15:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

The only user was veth, which now uses napi_skb_cache_get_bulk().
It's now preferred over a direct allocation and is exported as
well, so remove this one.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/xdp.h |  1 -
 net/core/xdp.c    | 10 ----------
 2 files changed, 11 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 6da0e746cf75..e2f83819405b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -344,7 +344,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					   struct net_device *dev);
 struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					 struct net_device *dev);
-int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp);
 struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 67b53fc7191e..eb8762ff16cb 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -619,16 +619,6 @@ void xdp_warn(const char *msg, const char *func, const int line)
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
 
-int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp)
-{
-	n_skb = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp, n_skb, skbs);
-	if (unlikely(!n_skb))
-		return -ENOMEM;
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xdp_alloc_skb_bulk);
-
 /**
  * xdp_build_skb_from_buff - create an skb from &xdp_buff
  * @xdp: &xdp_buff to convert to an skb
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
  2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (7 preceding siblings ...)
  2025-01-07 15:29 ` [PATCH net-next v2 8/8] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
@ 2025-01-07 17:17 ` Jesper Dangaard Brouer
  2025-01-08 13:39   ` Alexander Lobakin
  2025-01-09  1:26   ` Daniel Xu
  8 siblings, 2 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-07 17:17 UTC (permalink / raw)
  To: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, netdev, bpf, linux-kernel, Jesse Brandeburg,
	kernel-team

Awesome work! - some questions below

On 07/01/2025 16.29, Alexander Lobakin wrote:
> Several months ago, I had been looking through my old XDP hints tree[0]
> to check whether some patches not directly related to hints can be sent
> standalone. Roughly at the same time, Daniel appeared and asked[1] about
> GRO for cpumap from that tree.
> 
> Currently, cpumap uses its own kthread which processes cpumap-redirected
> frames by batches of 8, without any weighting (but with rescheduling
> points). The resulting skbs get passed to the stack via
> netif_receive_skb_list(), which means no GRO happens.
> Even though we can't currently pass checksum status from the drivers,
> in many cases GRO performs better than the listified Rx without the
> aggregation, confirmed by tests.
> 
> In order to enable GRO in cpumap, we need to do the following:
> 
> * patches 1-2: decouple the GRO struct from the NAPI struct and allow
>    using it out of a NAPI entity within the kernel core code;
> * patch 3: switch cpumap from netif_receive_skb_list() to
>    gro_receive_skb().
> 
> Additional improvements:
> 
> * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
>    lists;
> * patch 5-6: introduce and use function do get skbs from the NAPI percpu
>    caches by bulks, not one at a time;
> * patch 7-8: use that function in veth as well and remove the one that
>    was now superseded by it.
> 
> My trafficgen UDP GRO tests, small frame sizes:
> 

How does your trafficgen UDP test manage to get UDP GRO working?
(Perhaps you can share test?)

What is the "small frame" size being used?

Is the UDP benchmark avoiding (re)calculating the RX checksum?
(via setting UDP csum to zero)

>                  GRO off    GRO on
> baseline        2.7        N/A       Mpps
> patch 3         2.3        4         Mpps
> patch 8         2.4        4.7       Mpps
> 
> 1...3 diff      -17        +48       %
> 1...8 diff      -11        +74       %
> 
> Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
> tests. On my system however, the same test gave me up to +100%.
> 

I can imagine that the TCP throughput tests will yield a huge
performance boost.

> Note that there's a series from Lorenzo[4] which achieves the same, but
> in a different way. During the discussions, the approach using a
> standalone GRO instance was preferred over the threaded NAPI.
> 

It looks like you are keeping the "remote" CPUMAP kthread process design
intact in this series, right?

I think this design works for our use-case. For our use-case, we want to
give "remote" CPU-thread higher scheduling priority.  It doesn't matter
if this is a kthread or threaded-NAPI thread, as long as we can see this
as a PID from userspace (by which we adjust the sched priority).

Great to see this work progressing again :-)))
--Jesper

> [0] https://github.com/alobakin/linux/tree/xdp_hints
> [1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
> [2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
> [3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
> [4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org
> 
> Alexander Lobakin (8):
>    net: gro: decouple GRO from the NAPI layer
>    net: gro: expose GRO init/cleanup to use outside of NAPI
>    bpf: cpumap: switch to GRO from netif_receive_skb_list()
>    bpf: cpumap: reuse skb array instead of a linked list to chain skbs
>    net: skbuff: introduce napi_skb_cache_get_bulk()
>    bpf: cpumap: switch to napi_skb_cache_get_bulk()
>    veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
>    xdp: remove xdp_alloc_skb_bulk()
> 
>   include/linux/netdevice.h                  |  35 ++++--
>   include/linux/skbuff.h                     |   1 +
>   include/net/busy_poll.h                    |  11 +-
>   include/net/gro.h                          |  38 ++++--
>   include/net/xdp.h                          |   1 -
>   drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
>   drivers/net/ethernet/cortina/gemini.c      |   1 +
>   drivers/net/veth.c                         |   3 +-
>   drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
>   kernel/bpf/cpumap.c                        | 131 ++++++++++++++-------
>   net/core/dev.c                             |  79 ++++---------
>   net/core/gro.c                             | 103 ++++++++++------
>   net/core/skbuff.c                          |  62 ++++++++++
>   net/core/xdp.c                             |  10 --
>   14 files changed, 306 insertions(+), 171 deletions(-)
> 
> ---
>  From v1[5]:
> * use a standalone GRO instance instead of the threaded NAPI (Jakub);
> * rebase and send to net-next as it's now more networking than BPF.
> 
> [5] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
  2025-01-07 17:17 ` [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Jesper Dangaard Brouer
@ 2025-01-08 13:39   ` Alexander Lobakin
  2025-01-09 17:02     ` Toke Høiland-Jørgensen
  2025-01-09  1:26   ` Daniel Xu
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-08 13:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Toke Høiland-Jørgensen, Martin KaFai Lau, netdev, bpf,
	linux-kernel, Jesse Brandeburg, kernel-team

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 7 Jan 2025 18:17:06 +0100

> Awesome work! - some questions below
> 
> On 07/01/2025 16.29, Alexander Lobakin wrote:
>> Several months ago, I had been looking through my old XDP hints tree[0]
>> to check whether some patches not directly related to hints can be sent
>> standalone. Roughly at the same time, Daniel appeared and asked[1] about
>> GRO for cpumap from that tree.
>>
>> Currently, cpumap uses its own kthread which processes cpumap-redirected
>> frames by batches of 8, without any weighting (but with rescheduling
>> points). The resulting skbs get passed to the stack via
>> netif_receive_skb_list(), which means no GRO happens.
>> Even though we can't currently pass checksum status from the drivers,
>> in many cases GRO performs better than the listified Rx without the
>> aggregation, confirmed by tests.
>>
>> In order to enable GRO in cpumap, we need to do the following:
>>
>> * patches 1-2: decouple the GRO struct from the NAPI struct and allow
>>    using it out of a NAPI entity within the kernel core code;
>> * patch 3: switch cpumap from netif_receive_skb_list() to
>>    gro_receive_skb().
>>
>> Additional improvements:
>>
>> * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
>>    lists;
>> * patch 5-6: introduce and use function do get skbs from the NAPI percpu
>>    caches by bulks, not one at a time;
>> * patch 7-8: use that function in veth as well and remove the one that
>>    was now superseded by it.
>>
>> My trafficgen UDP GRO tests, small frame sizes:
>>
> 
> How does your trafficgen UDP test manage to get UDP GRO working?
> (Perhaps you can share test?)

I usually test as follows:

xdp-trafficgen from xdp-tools on the sender

then, on the receiver:

ethtool -K <iface> rx-udp-gro-forwarding on

No socket on the receiver, but this option enables GRO not only when
forwarding, but also when it's LOCAL_IN and there's just no socket.
Then, the UDP core drops the frame when doing sk lookup as there's no
socket.
IOW, I have the following:

* GRO gets performed
* Stack overhead is there, up to UDP lookup
* The final frame is dropped, so no userspace copy overhead.

> 
> What is the "small frame" size being used?

xdp-trafficgen currently hardcodes frame sizes to 64 bytes. I was
planning to add an option to configure frame size and send it upstream,
but never finished it yet unfortunately.

I realize that on bigger frames, the boosts won't be as big due to that
the CPU will have to calculate checksums for larger buffers. OTOH TCP
benches usually send MTU-sized buffers (+ TSO), but yet the perf is better.

> 
> Is the UDP benchmark avoiding (re)calculating the RX checksum?
> (via setting UDP csum to zero)

OH, I completely forgot about this one. I can imagine even bigger boosts
due to that CPU checksumming will disappear.

> 
>>                  GRO off    GRO on
>> baseline        2.7        N/A       Mpps
>> patch 3         2.3        4         Mpps
>> patch 8         2.4        4.7       Mpps
>>
>> 1...3 diff      -17        +48       %
>> 1...8 diff      -11        +74       %
>>
>> Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
>> tests. On my system however, the same test gave me up to +100%.
>>
> 
> I can imagine that the TCP throughput tests will yield a huge
> performance boost.
> 
>> Note that there's a series from Lorenzo[4] which achieves the same, but
>> in a different way. During the discussions, the approach using a
>> standalone GRO instance was preferred over the threaded NAPI.
>>
> 
> It looks like you are keeping the "remote" CPUMAP kthread process design
> intact in this series, right?

Right, the kthread logic remains the same as before.

> 
> I think this design works for our use-case. For our use-case, we want to
> give "remote" CPU-thread higher scheduling priority.  It doesn't matter
> if this is a kthread or threaded-NAPI thread, as long as we can see this
> as a PID from userspace (by which we adjust the sched priority).
> 
> Great to see this work progressing again :-)))
> --Jesper

Thanks,
Olek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
  2025-01-07 17:17 ` [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Jesper Dangaard Brouer
  2025-01-08 13:39   ` Alexander Lobakin
@ 2025-01-09  1:26   ` Daniel Xu
  1 sibling, 0 replies; 20+ messages in thread
From: Daniel Xu @ 2025-01-09  1:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Lorenzo Bianconi, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Toke Høiland-Jørgensen, Martin KaFai Lau, netdev, bpf,
	linux-kernel, Jesse Brandeburg, kernel-team

On Tue, Jan 07, 2025 at 06:17:06PM +0100, Jesper Dangaard Brouer wrote:
> Awesome work! - some questions below
> 
> On 07/01/2025 16.29, Alexander Lobakin wrote:
> > Several months ago, I had been looking through my old XDP hints tree[0]
> > to check whether some patches not directly related to hints can be sent
> > standalone. Roughly at the same time, Daniel appeared and asked[1] about
> > GRO for cpumap from that tree.
> > 
> > Currently, cpumap uses its own kthread which processes cpumap-redirected
> > frames by batches of 8, without any weighting (but with rescheduling
> > points). The resulting skbs get passed to the stack via
> > netif_receive_skb_list(), which means no GRO happens.
> > Even though we can't currently pass checksum status from the drivers,
> > in many cases GRO performs better than the listified Rx without the
> > aggregation, confirmed by tests.
> > 
> > In order to enable GRO in cpumap, we need to do the following:
> > 
> > * patches 1-2: decouple the GRO struct from the NAPI struct and allow
> >    using it out of a NAPI entity within the kernel core code;
> > * patch 3: switch cpumap from netif_receive_skb_list() to
> >    gro_receive_skb().
> > 
> > Additional improvements:
> > 
> > * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
> >    lists;
> > * patch 5-6: introduce and use function do get skbs from the NAPI percpu
> >    caches by bulks, not one at a time;
> > * patch 7-8: use that function in veth as well and remove the one that
> >    was now superseded by it.
> > 
> > My trafficgen UDP GRO tests, small frame sizes:
> > 
> 
> How does your trafficgen UDP test manage to get UDP GRO working?
> (Perhaps you can share test?)
> 
> What is the "small frame" size being used?
> 
> Is the UDP benchmark avoiding (re)calculating the RX checksum?
> (via setting UDP csum to zero)
> 
> >                  GRO off    GRO on
> > baseline        2.7        N/A       Mpps
> > patch 3         2.3        4         Mpps
> > patch 8         2.4        4.7       Mpps
> > 
> > 1...3 diff      -17        +48       %
> > 1...8 diff      -11        +74       %
> > 
> > Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
> > tests. On my system however, the same test gave me up to +100%.
> > 
> 
> I can imagine that the TCP throughput tests will yield a huge
> performance boost.
> 
> > Note that there's a series from Lorenzo[4] which achieves the same, but
> > in a different way. During the discussions, the approach using a
> > standalone GRO instance was preferred over the threaded NAPI.
> > 
> 
> It looks like you are keeping the "remote" CPUMAP kthread process design
> intact in this series, right?
> 
> I think this design works for our use-case. For our use-case, we want to
> give "remote" CPU-thread higher scheduling priority.  It doesn't matter
> if this is a kthread or threaded-NAPI thread, as long as we can see this
> as a PID from userspace (by which we adjust the sched priority).
> 

Similiar for us as well - having a schedulable entity helps. I might
have mentioned it on an earlier thread, but with sched-ext, I think
things could get interesting for dynamically tuning the system. We've
got some vague ideas. Probably not this upcoming one, but maybe if any
of the ideas work we'll share them at netdev or something.

> Great to see this work progressing again :-)))

Agreed, thanks for continuing!

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk()
  2025-01-07 15:29 ` [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
@ 2025-01-09 13:16   ` Paolo Abeni
  2025-01-13 13:47     ` Alexander Lobakin
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2025-01-09 13:16 UTC (permalink / raw)
  To: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski
  Cc: Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

On 1/7/25 4:29 PM, Alexander Lobakin wrote:
> Add a function to get an array of skbs from the NAPI percpu cache.
> It's supposed to be a drop-in replacement for
> kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
> xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
> requirement to call it only from the BH) is that it tries to use
> as many NAPI cache entries for skbs as possible, and allocate new
> ones only if needed.
> 
> The logic is as follows:
> 
> * there is enough skbs in the cache: decache them and return to the
>   caller;
> * not enough: try refilling the cache first. If there is now enough
>   skbs, return;
> * still not enough: try allocating skbs directly to the output array
>   with %GFP_ZERO, maybe we'll be able to get some. If there's now
>   enough, return;
> * still not enough: return as many as we were able to obtain.
> 
> Most of times, if called from the NAPI polling loop, the first one will
> be true, sometimes (rarely) the second one. The third and the fourth --
> only under heavy memory pressure.
> It can save significant amounts of CPU cycles if there are GRO cycles
> and/or Tx completion cycles (anything that descends to
> napi_skb_cache_put()) happening on this CPU.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> Tested-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>  include/linux/skbuff.h |  1 +
>  net/core/skbuff.c      | 62 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bb2b751d274a..1c089c7c14e1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1315,6 +1315,7 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
>  				 void *data, unsigned int frag_size);
>  void skb_attempt_defer_free(struct sk_buff *skb);
>  
> +u32 napi_skb_cache_get_bulk(void **skbs, u32 n);
>  struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
>  struct sk_buff *slab_build_skb(void *data);
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index a441613a1e6c..42eb31dcc9ce 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -367,6 +367,68 @@ static struct sk_buff *napi_skb_cache_get(void)
>  	return skb;
>  }
>  
> +/**
> + * napi_skb_cache_get_bulk - obtain a number of zeroed skb heads from the cache
> + * @skbs: pointer to an at least @n-sized array to fill with skb pointers
> + * @n: number of entries to provide
> + *
> + * Tries to obtain @n &sk_buff entries from the NAPI percpu cache and writes
> + * the pointers into the provided array @skbs. If there are less entries
> + * available, tries to replenish the cache and bulk-allocates the diff from
> + * the MM layer if needed.
> + * The heads are being zeroed with either memset() or %__GFP_ZERO, so they are
> + * ready for {,__}build_skb_around() and don't have any data buffers attached.
> + * Must be called *only* from the BH context.
> + *
> + * Return: number of successfully allocated skbs (@n if no actual allocation
> + *	   needed or kmem_cache_alloc_bulk() didn't fail).
> + */
> +u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
> +{
> +	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> +	u32 bulk, total = n;
> +
> +	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
> +
> +	if (nc->skb_count >= n)
> +		goto get;

I (mis?)understood from the commit message this condition should be
likely, too?!?

> +	/* No enough cached skbs. Try refilling the cache first */
> +	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
> +	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> +					       GFP_ATOMIC | __GFP_NOWARN, bulk,
> +					       &nc->skb_cache[nc->skb_count]);
> +	if (likely(nc->skb_count >= n))
> +		goto get;
> +
> +	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
> +	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
> +				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
> +				   n - nc->skb_count, &skbs[nc->skb_count]);

You should probably cap 'n' to NAPI_SKB_CACHE_SIZE. Also what about
latency spikes when n == 48 (should be the maximum possible with such
limit) here?

/P


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer
  2025-01-07 15:29 ` [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer Alexander Lobakin
@ 2025-01-09 14:24   ` Paolo Abeni
  2025-01-13 13:50     ` Alexander Lobakin
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2025-01-09 14:24 UTC (permalink / raw)
  To: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski
  Cc: Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

On 1/7/25 4:29 PM, Alexander Lobakin wrote:
> @@ -623,21 +622,21 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
>  	return ret;
>  }
>  
> -gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> +gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb)
>  {
>  	gro_result_t ret;
>  
> -	skb_mark_napi_id(skb, napi);
> +	__skb_mark_napi_id(skb, gro->napi_id);

Is this the only place where gro->napi_id is needed? If so, what about
moving skb_mark_napi_id() in napi_gro_receive() and remove such field?

/P


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
  2025-01-08 13:39   ` Alexander Lobakin
@ 2025-01-09 17:02     ` Toke Høiland-Jørgensen
  2025-01-13 13:50       ` Alexander Lobakin
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-09 17:02 UTC (permalink / raw)
  To: Alexander Lobakin, Jesper Dangaard Brouer
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Martin KaFai Lau, netdev, bpf, linux-kernel, Jesse Brandeburg,
	kernel-team

Alexander Lobakin <aleksander.lobakin@intel.com> writes:

>> What is the "small frame" size being used?
>
> xdp-trafficgen currently hardcodes frame sizes to 64 bytes. I was
> planning to add an option to configure frame size and send it upstream,
> but never finished it yet unfortunately.

Well, I guess I can be of some help here. Just pushed an update to
xdp-trafficgen to support specifying the packet size :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk()
  2025-01-09 13:16   ` Paolo Abeni
@ 2025-01-13 13:47     ` Alexander Lobakin
  0 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-13 13:47 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 9 Jan 2025 14:16:22 +0100

> On 1/7/25 4:29 PM, Alexander Lobakin wrote:
>> Add a function to get an array of skbs from the NAPI percpu cache.
>> It's supposed to be a drop-in replacement for
>> kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
>> xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
>> requirement to call it only from the BH) is that it tries to use
>> as many NAPI cache entries for skbs as possible, and allocate new
>> ones only if needed.

[...]

>> +u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
>> +{
>> +	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
>> +	u32 bulk, total = n;
>> +
>> +	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
>> +
>> +	if (nc->skb_count >= n)
>> +		goto get;
> 
> I (mis?)understood from the commit message this condition should be
> likely, too?!?

It depends, I didn't want to make this unlikely() as will happen
sometimes anyway, while the two unlikely() below can happen only on when
the system is low on memory.

> 
>> +	/* No enough cached skbs. Try refilling the cache first */
>> +	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
>> +	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> +					       GFP_ATOMIC | __GFP_NOWARN, bulk,
>> +					       &nc->skb_cache[nc->skb_count]);
>> +	if (likely(nc->skb_count >= n))
>> +		goto get;
>> +
>> +	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
>> +	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
>> +				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
>> +				   n - nc->skb_count, &skbs[nc->skb_count]);
> 
> You should probably cap 'n' to NAPI_SKB_CACHE_SIZE. Also what about
> latency spikes when n == 48 (should be the maximum possible with such
> limit) here?

The current users never allocate more than 8 skbs in one bulk. Anyway,
the current approach wants to be a drop-in for
kmem_cache_alloc_bulk(skbuff_cache), which doesn't cap anything.
Not that this last branch allocates to @skbs directly, not to the percpu
NAPI cache.

> 
> /P

Thanks,
Olek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer
  2025-01-09 14:24   ` Paolo Abeni
@ 2025-01-13 13:50     ` Alexander Lobakin
  2025-01-13 21:01       ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-13 13:50 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 9 Jan 2025 15:24:16 +0100

> On 1/7/25 4:29 PM, Alexander Lobakin wrote:
>> @@ -623,21 +622,21 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
>>  	return ret;
>>  }
>>  
>> -gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
>> +gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb)
>>  {
>>  	gro_result_t ret;
>>  
>> -	skb_mark_napi_id(skb, napi);
>> +	__skb_mark_napi_id(skb, gro->napi_id);
> 
> Is this the only place where gro->napi_id is needed? If so, what about
> moving skb_mark_napi_id() in napi_gro_receive() and remove such field?

Yes, only here. I thought of this, too. But this will increase the
object code of each napi_gro_receive() caller as it's now inline. So I
stopped on this one.
What do you think?

> 
> /P

Thanks,
Olek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
  2025-01-09 17:02     ` Toke Høiland-Jørgensen
@ 2025-01-13 13:50       ` Alexander Lobakin
  0 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-13 13:50 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Lorenzo Bianconi,
	Daniel Xu, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, netdev, bpf, linux-kernel,
	Jesse Brandeburg, kernel-team

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Thu, 09 Jan 2025 18:02:59 +0100

> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> 
>>> What is the "small frame" size being used?
>>
>> xdp-trafficgen currently hardcodes frame sizes to 64 bytes. I was
>> planning to add an option to configure frame size and send it upstream,
>> but never finished it yet unfortunately.
> 
> Well, I guess I can be of some help here. Just pushed an update to
> xdp-trafficgen to support specifying the packet size :)

Cool, thanks a lot!

> 
> -Toke

Olek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer
  2025-01-13 13:50     ` Alexander Lobakin
@ 2025-01-13 21:01       ` Jakub Kicinski
  2025-01-14 17:19         ` Alexander Lobakin
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2025-01-13 21:01 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Paolo Abeni, Andrew Lunn, David S. Miller, Eric Dumazet,
	Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

On Mon, 13 Jan 2025 14:50:02 +0100 Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Thu, 9 Jan 2025 15:24:16 +0100
> 
> > On 1/7/25 4:29 PM, Alexander Lobakin wrote:  
> >> @@ -623,21 +622,21 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
> >>  	return ret;
> >>  }
> >>  
> >> -gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> >> +gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb)
> >>  {
> >>  	gro_result_t ret;
> >>  
> >> -	skb_mark_napi_id(skb, napi);
> >> +	__skb_mark_napi_id(skb, gro->napi_id);  
> > 
> > Is this the only place where gro->napi_id is needed? If so, what about
> > moving skb_mark_napi_id() in napi_gro_receive() and remove such field?  
> 
> Yes, only here. I thought of this, too. But this will increase the
> object code of each napi_gro_receive() caller as it's now inline. So I
> stopped on this one.
> What do you think?

What if we make napi_gro_receive() a real function (not inline) 
and tail call gro_receive_skb()? Is the compiler not clever 
enough too optimize that?

Very nice work in general, the napi_id is gro sticks out..

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer
  2025-01-13 21:01       ` Jakub Kicinski
@ 2025-01-14 17:19         ` Alexander Lobakin
  0 siblings, 0 replies; 20+ messages in thread
From: Alexander Lobakin @ 2025-01-14 17:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paolo Abeni, Andrew Lunn, David S. Miller, Eric Dumazet,
	Lorenzo Bianconi, Daniel Xu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Martin KaFai Lau, netdev, bpf,
	linux-kernel

From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 13 Jan 2025 13:01:04 -0800

> On Mon, 13 Jan 2025 14:50:02 +0100 Alexander Lobakin wrote:
>> From: Paolo Abeni <pabeni@redhat.com>
>> Date: Thu, 9 Jan 2025 15:24:16 +0100
>>
>>> On 1/7/25 4:29 PM, Alexander Lobakin wrote:  
>>>> @@ -623,21 +622,21 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> -gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
>>>> +gro_result_t gro_receive_skb(struct gro_node *gro, struct sk_buff *skb)
>>>>  {
>>>>  	gro_result_t ret;
>>>>  
>>>> -	skb_mark_napi_id(skb, napi);
>>>> +	__skb_mark_napi_id(skb, gro->napi_id);  
>>>
>>> Is this the only place where gro->napi_id is needed? If so, what about
>>> moving skb_mark_napi_id() in napi_gro_receive() and remove such field?  
>>
>> Yes, only here. I thought of this, too. But this will increase the
>> object code of each napi_gro_receive() caller as it's now inline. So I
>> stopped on this one.
>> What do you think?
> 
> What if we make napi_gro_receive() a real function (not inline) 
> and tail call gro_receive_skb()? Is the compiler not clever 
> enough too optimize that?

Worth trying. I'll be glad to do it that way if perf doesn't regress.

> 
> Very nice work in general, the napi_id is gro sticks out..

Thanks,
Olek

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-01-14 17:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07 15:29 [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 1/8] net: gro: decouple GRO from the NAPI layer Alexander Lobakin
2025-01-09 14:24   ` Paolo Abeni
2025-01-13 13:50     ` Alexander Lobakin
2025-01-13 21:01       ` Jakub Kicinski
2025-01-14 17:19         ` Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 2/8] net: gro: expose GRO init/cleanup to use outside of NAPI Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 3/8] bpf: cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 4/8] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 5/8] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
2025-01-09 13:16   ` Paolo Abeni
2025-01-13 13:47     ` Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 6/8] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 7/8] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
2025-01-07 15:29 ` [PATCH net-next v2 8/8] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
2025-01-07 17:17 ` [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS frames Jesper Dangaard Brouer
2025-01-08 13:39   ` Alexander Lobakin
2025-01-09 17:02     ` Toke Høiland-Jørgensen
2025-01-13 13:50       ` Alexander Lobakin
2025-01-09  1:26   ` Daniel Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).