[PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
@ 2024-08-30 16:24 Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() Alexander Lobakin
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:24 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

Recently, I've been looking through my old XDP hints tree[0] to check
whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-3: allow creating CPU-pinned threaded NAPIs;
* patch 4: switch cpumap from a custom kthread to a CPU-pinned
  threaded NAPI;

Additional improvements:

* patch 5: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 6-7: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 8-9: use that function in veth and remove the one that was
  superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
thread GRO      2.3        4         Mpps
thr bulk GRO    2.4        4.7       Mpps

1...2 diff      -17        +48       %
1...3 diff      -14        +75       %

Daniel reported +14% of throughput in neper's TCP RR tests[2].

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh

Alexander Lobakin (7):
  firmware/psci: fix missing '%u' format literal in
    kthread_create_on_cpu()
  kthread: allow vararg kthread_{create,run}_on_cpu()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

Lorenzo Bianconi (2):
  net: napi: add ability to create CPU-pinned threaded NAPI
  bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread

 include/linux/kthread.h              |  51 ++++---
 include/linux/netdevice.h            |  35 ++++-
 include/linux/skbuff.h               |   1 +
 include/net/xdp.h                    |   1 -
 drivers/firmware/psci/psci_checker.c |   2 +-
 drivers/net/veth.c                   |   3 +-
 kernel/bpf/cpumap.c                  | 210 ++++++++++++---------------
 kernel/kthread.c                     |  22 +--
 net/core/dev.c                       |  18 ++-
 net/core/skbuff.c                    |  62 ++++++++
 net/core/xdp.c                       |  10 --
 11 files changed, 251 insertions(+), 164 deletions(-)

-- 
2.46.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 23:31   ` Daniel Xu
  2024-08-30 16:25 ` [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu() Alexander Lobakin
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

kthread_create_on_cpu() always requires format string to contain one
'%u' at the end, as it automatically adds the CPU ID when passing it
to kthread_create_on_node(). The former doesn't marked as __printf()
as it's not printf-like itself, which effectively hides this from
the compiler.
If you convert this function to printf-like, you'll see the following:

In file included from drivers/firmware/psci/psci_checker.c:15:
drivers/firmware/psci/psci_checker.c: In function 'suspend_tests':
drivers/firmware/psci/psci_checker.c:401:48: warning: too many arguments for format [-Wformat-extra-args]
     401 |                                                "psci_suspend_test");
         |                                                ^~~~~~~~~~~~~~~~~~~
drivers/firmware/psci/psci_checker.c:400:32: warning: data argument not used by format string [-Wformat-extra-args]
     400 |                                                (void *)(long)cpu, cpu,
         |                                                                   ^
     401 |                                                "psci_suspend_test");
         |                                                ~~~~~~~~~~~~~~~~~~~

Add the missing format literal to fix this. Now the corresponding
kthread will be named as "psci_suspend_test-<cpuid>", as it's meant by
kthread_create_on_cpu().

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408141012.KhvKaxoh-lkp@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202408141243.eQiEOQQe-lkp@intel.com
Fixes: ea8b1c4a6019 ("drivers: psci: PSCI checker module")
Cc: stable@vger.kernel.org # 4.10+
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/firmware/psci/psci_checker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/psci/psci_checker.c b/drivers/firmware/psci/psci_checker.c
index 116eb465cdb4..ecc511c745ce 100644
--- a/drivers/firmware/psci/psci_checker.c
+++ b/drivers/firmware/psci/psci_checker.c
@@ -398,7 +398,7 @@ static int suspend_tests(void)
 
 		thread = kthread_create_on_cpu(suspend_test_thread,
 					       (void *)(long)cpu, cpu,
-					       "psci_suspend_test");
+					       "psci_suspend_test-%u");
 		if (IS_ERR(thread))
 			pr_err("Failed to create kthread on CPU %d\n", cpu);
 		else
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 22:56   ` Stanislav Fomichev
  2024-08-30 16:25 ` [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI Alexander Lobakin
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

Currently, kthread_{create,run}_on_cpu() doesn't support varargs like
kthread_create{,_on_node}() do, which makes them less convenient to
use.
Convert them to take varargs as the last argument. The only difference
is that they always append the CPU ID at the end and require the format
string to have an excess '%u' at the end due to that. That's still true;
meanwhile, the compiler will correctly point out to that if missing.
One more nice side effect is that you can now use the underscored
__kthread_create_on_cpu() if you want to override that rule and not
have CPU ID at the end of the name.
The current callers are not anyhow affected.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/kthread.h | 51 ++++++++++++++++++++++++++---------------
 kernel/kthread.c        | 22 ++++++++++--------
 2 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index b11f53c1ba2e..27a94e691948 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -27,11 +27,21 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 #define kthread_create(threadfn, data, namefmt, arg...) \
 	kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
 
-
-struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
-					  void *data,
-					  unsigned int cpu,
-					  const char *namefmt);
+__printf(4, 5)
+struct task_struct *__kthread_create_on_cpu(int (*threadfn)(void *data),
+					    void *data, unsigned int cpu,
+					    const char *namefmt, ...);
+
+#define kthread_create_on_cpu(threadfn, data, cpu, namefmt, ...)	   \
+	_kthread_create_on_cpu(threadfn, data, cpu, __UNIQUE_ID(cpu_),	   \
+			       namefmt, ##__VA_ARGS__)
+
+#define _kthread_create_on_cpu(threadfn, data, cpu, uc, namefmt, ...) ({   \
+	u32 uc = (cpu);							   \
+									   \
+	__kthread_create_on_cpu(threadfn, data, uc, namefmt,		   \
+				##__VA_ARGS__, uc);			   \
+})
 
 void get_kthread_comm(char *buf, size_t buf_size, struct task_struct *tsk);
 bool set_kthread_struct(struct task_struct *p);
@@ -62,25 +72,28 @@ bool kthread_is_per_cpu(struct task_struct *k);
  * @threadfn: the function to run until signal_pending(current).
  * @data: data ptr for @threadfn.
  * @cpu: The cpu on which the thread should be bound,
- * @namefmt: printf-style name for the thread. Format is restricted
- *	     to "name.*%u". Code fills in cpu number.
+ * @namefmt: printf-style name for the thread. Must have an excess '%u'
+ *	     at the end as kthread_create_on_cpu() fills in CPU number.
  *
  * Description: Convenient wrapper for kthread_create_on_cpu()
  * followed by wake_up_process().  Returns the kthread or
  * ERR_PTR(-ENOMEM).
  */
-static inline struct task_struct *
-kthread_run_on_cpu(int (*threadfn)(void *data), void *data,
-			unsigned int cpu, const char *namefmt)
-{
-	struct task_struct *p;
-
-	p = kthread_create_on_cpu(threadfn, data, cpu, namefmt);
-	if (!IS_ERR(p))
-		wake_up_process(p);
-
-	return p;
-}
+#define kthread_run_on_cpu(threadfn, data, cpu, namefmt, ...)		   \
+	_kthread_run_on_cpu(threadfn, data, cpu, __UNIQUE_ID(task_),	   \
+			    namefmt, ##__VA_ARGS__)
+
+#define _kthread_run_on_cpu(threadfn, data, cpu, ut, namefmt, ...)	   \
+({									   \
+	struct task_struct *ut;						   \
+									   \
+	ut = kthread_create_on_cpu(threadfn, data, cpu, namefmt,	   \
+				   ##__VA_ARGS__);			   \
+	if (!IS_ERR(ut))						   \
+		wake_up_process(ut);					   \
+									   \
+	ut;								   \
+})
 
 void free_kthread_struct(struct task_struct *k);
 void kthread_bind(struct task_struct *k, unsigned int cpu);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f7be976ff88a..e9da0115fb2b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -559,23 +559,27 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
- * kthread_create_on_cpu - Create a cpu bound kthread
+ * __kthread_create_on_cpu - Create a cpu bound kthread
  * @threadfn: the function to run until signal_pending(current).
  * @data: data ptr for @threadfn.
  * @cpu: The cpu on which the thread should be bound,
- * @namefmt: printf-style name for the thread. Format is restricted
- *	     to "name.*%u". Code fills in cpu number.
+ * @namefmt: printf-style name for the thread. Must have an excess '%u'
+ *	     at the end as kthread_create_on_cpu() fills in CPU number.
  *
  * Description: This helper function creates and names a kernel thread
  */
-struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
-					  void *data, unsigned int cpu,
-					  const char *namefmt)
+struct task_struct *__kthread_create_on_cpu(int (*threadfn)(void *data),
+					    void *data, unsigned int cpu,
+					    const char *namefmt, ...)
 {
 	struct task_struct *p;
+	va_list args;
+
+	va_start(args, namefmt);
+	p = __kthread_create_on_node(threadfn, data, cpu_to_node(cpu), namefmt,
+				     args);
+	va_end(args);
 
-	p = kthread_create_on_node(threadfn, data, cpu_to_node(cpu), namefmt,
-				   cpu);
 	if (IS_ERR(p))
 		return p;
 	kthread_bind(p, cpu);
@@ -583,7 +587,7 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	return p;
 }
-EXPORT_SYMBOL(kthread_create_on_cpu);
+EXPORT_SYMBOL(__kthread_create_on_cpu);
 
 void kthread_set_per_cpu(struct task_struct *k, int cpu)
 {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu() Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-31  0:19   ` Daniel Xu
  2024-08-30 16:25 ` [PATCH bpf-next 4/9] bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread Alexander Lobakin
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

From: Lorenzo Bianconi <lorenzo@kernel.org>

Add netif_napi_add_percpu() to pin NAPI in threaded mode to a particular
CPU. This means, if the NAPI is not threaded, it will be run as usually,
but when switching to threaded mode, it will always be run on the
specified CPU.
It's not meant to be used in drivers, but might be useful when creating
percpu threaded NAPIs, for example, to replace percpu kthreads or
workers where a NAPI context is needed.
The already existing netif_napi_add*() are not anyhow affected.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/netdevice.h | 35 +++++++++++++++++++++++++++++++++--
 net/core/dev.c            | 18 +++++++++++++-----
 2 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5f0dda733b..4d6fb0ccdea1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	int			irq;
+	int			thread_cpuid;
 };
 
 enum {
@@ -2619,8 +2620,18 @@ static inline void netif_napi_set_irq(struct napi_struct *napi, int irq)
  */
 #define NAPI_POLL_WEIGHT 64
 
-void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
-			   int (*poll)(struct napi_struct *, int), int weight);
+void netif_napi_add_weight_percpu(struct net_device *dev,
+				  struct napi_struct *napi,
+				  int (*poll)(struct napi_struct *, int),
+				  int weight, int thread_cpuid);
+
+static inline void netif_napi_add_weight(struct net_device *dev,
+					 struct napi_struct *napi,
+					 int (*poll)(struct napi_struct *, int),
+					 int weight)
+{
+	netif_napi_add_weight_percpu(dev, napi, poll, weight, -1);
+}
 
 /**
  * netif_napi_add() - initialize a NAPI context
@@ -2665,6 +2676,26 @@ static inline void netif_napi_add_tx(struct net_device *dev,
 	netif_napi_add_tx_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
 }
 
+/**
+ * netif_napi_add_percpu() - initialize a CPU-pinned threaded NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @thread_cpuid: CPU which this NAPI will be pinned to
+ *
+ * Variant of netif_napi_add() which pins the NAPI to the specified CPU. No
+ * changes in the "standard" mode, but in case with the threaded one, this
+ * NAPI will always be run on the passed CPU no matter where scheduled.
+ */
+static inline void netif_napi_add_percpu(struct net_device *dev,
+					 struct napi_struct *napi,
+					 int (*poll)(struct napi_struct *, int),
+					 int thread_cpuid)
+{
+	netif_napi_add_weight_percpu(dev, napi, poll, NAPI_POLL_WEIGHT,
+				     thread_cpuid);
+}
+
 /**
  *  __netif_napi_del - remove a NAPI context
  *  @napi: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index 98bb5f890b88..93ca3df8e9dd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1428,8 +1428,13 @@ static int napi_kthread_create(struct napi_struct *n)
 	 * TASK_INTERRUPTIBLE mode to avoid the blocked task
 	 * warning and work with loadavg.
 	 */
-	n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
-				n->dev->name, n->napi_id);
+	if (n->thread_cpuid >= 0)
+		n->thread = kthread_run_on_cpu(napi_threaded_poll, n,
+					       n->thread_cpuid, "napi/%s-%u",
+					       n->dev->name);
+	else
+		n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
+					n->dev->name, n->napi_id);
 	if (IS_ERR(n->thread)) {
 		err = PTR_ERR(n->thread);
 		pr_err("kthread_run failed with err %d\n", err);
@@ -6640,8 +6645,10 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
 }
 EXPORT_SYMBOL(netif_queue_set_napi);
 
-void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
-			   int (*poll)(struct napi_struct *, int), int weight)
+void netif_napi_add_weight_percpu(struct net_device *dev,
+				  struct napi_struct *napi,
+				  int (*poll)(struct napi_struct *, int),
+				  int weight, int thread_cpuid)
 {
 	if (WARN_ON(test_and_set_bit(NAPI_STATE_LISTED, &napi->state)))
 		return;
@@ -6664,6 +6671,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	napi->poll_owner = -1;
 #endif
 	napi->list_owner = -1;
+	napi->thread_cpuid = thread_cpuid;
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
@@ -6677,7 +6685,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 		dev->threaded = false;
 	netif_napi_set_irq(napi, -1);
 }
-EXPORT_SYMBOL(netif_napi_add_weight);
+EXPORT_SYMBOL(netif_napi_add_weight_percpu);
 
 void napi_disable(struct napi_struct *n)
 {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 4/9] bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (2 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 5/9] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

From: Lorenzo Bianconi <lorenzo@kernel.org>

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
In order to enable GRO, remove all custom kthread logics from the cpumap
and use CPU-pinned threaded NAPIs with the default weight of 64. When a
cpumap is created, a new logical netdevice called "cpumap" is created to
run these NAPIs on it (one per map). Then, a percpu threaded NAPI context
is created for each cpumap entry, IOW for each specified CPU.
Instead of wake_up_process(), the NAPI is now scheduled and runs as
usually: with the budget of 64, napi_complete_done() is called if the
budget is not exhausted. Frames are still processed by batches of 8.
Instead of netif_receive_skb_list(), napi_gro_receive() is now used.

Alex' tests with an UDP trafficgen and small frame size:

                no GRO    GRO
baseline        2.7       N/A    Mpps
threaded GRO    2.3       4      Mpps
diff            -14       +48    %

Daniel's tests with neper's TCP RR tests shows +14% throughput
increase[0].

Currently, GRO on cpumap is limited to that the checksum status is not
known as &xdp_frame doesn't have such metadata. When we have a way to
pass it from the drivers, the boost will be much bigger.

Cc: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh [0]
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 kernel/bpf/cpumap.c | 167 ++++++++++++++++++++------------------------
 1 file changed, 76 insertions(+), 91 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index fbdf5a1aabfe..d1cfa4111727 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -28,14 +28,10 @@
 
 #include <linux/sched.h>
 #include <linux/workqueue.h>
-#include <linux/kthread.h>
 #include <linux/completion.h>
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
 
-#include <linux/netdevice.h>   /* netif_receive_skb_list */
-#include <linux/etherdevice.h> /* eth_type_trans */
-
 /* General idea: XDP packets getting XDP redirected to another CPU,
  * will maximum be stored/queued for one driver ->poll() call.  It is
  * guaranteed that queueing the frame and the flush operation happen on
@@ -56,20 +52,22 @@ struct xdp_bulk_queue {
 
 /* Struct for every remote "destination" CPU in map */
 struct bpf_cpu_map_entry {
-	u32 cpu;    /* kthread CPU and map index */
+	u32 cpu;    /* NAPI thread CPU and map index */
 	int map_id; /* Back reference to map */
 
 	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
 	struct xdp_bulk_queue __percpu *bulkq;
 
-	/* Queue with potential multi-producers, and single-consumer kthread */
+	/*
+	 * Queue with potential multi-producers and single-consumer
+	 * NAPI thread
+	 */
 	struct ptr_ring *queue;
-	struct task_struct *kthread;
 
 	struct bpf_cpumap_val value;
 	struct bpf_prog *prog;
+	struct napi_struct napi;
 
-	struct completion kthread_running;
 	struct rcu_work free_work;
 };
 
@@ -77,12 +75,15 @@ struct bpf_cpu_map {
 	struct bpf_map map;
 	/* Below members specific for map type */
 	struct bpf_cpu_map_entry __rcu **cpu_map;
+	/* Dummy netdev to run threaded NAPI */
+	struct net_device *napi_dev;
 };
 
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
 	u32 value_size = attr->value_size;
 	struct bpf_cpu_map *cmap;
+	struct net_device *dev;
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
@@ -105,19 +106,34 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
 					   sizeof(struct bpf_cpu_map_entry *),
 					   cmap->map.numa_node);
-	if (!cmap->cpu_map) {
-		bpf_map_area_free(cmap);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!cmap->cpu_map)
+		goto free_cmap;
+
+	dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
+	if (!dev)
+		goto free_cpu_map;
+
+	init_dummy_netdev(dev);
+	strscpy(dev->name, "cpumap");
+	dev->threaded = true;
+
+	cmap->napi_dev = dev;
 
 	return &cmap->map;
+
+free_cpu_map:
+	bpf_map_area_free(cmap->cpu_map);
+free_cmap:
+	bpf_map_area_free(cmap);
+
+	return ERR_PTR(-ENOMEM);
 }
 
 static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 {
 	/* The tear-down procedure should have made sure that queue is
 	 * empty.  See __cpu_map_entry_replace() and work-queue
-	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
+	 * invoked __cpu_map_entry_free(). Catch any broken behaviour
 	 * gracefully and warn once.
 	 */
 	void *ptr;
@@ -244,7 +260,6 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (!rcpu->prog)
 		return xdp_n;
 
-	rcu_read_lock_bh();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
@@ -256,62 +271,45 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
 	bpf_net_ctx_clear(bpf_net_ctx);
-	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
 }
 
-static int cpu_map_kthread_run(void *data)
+static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 {
-	struct bpf_cpu_map_entry *rcpu = data;
-	unsigned long last_qs = jiffies;
+	struct xdp_cpumap_stats stats = {}; /* zero stats */
+	u32 done = 0, kmem_alloc_drops = 0;
+	struct bpf_cpu_map_entry *rcpu;
 
-	complete(&rcpu->kthread_running);
-	set_current_state(TASK_INTERRUPTIBLE);
+	rcu_read_lock();
+	rcpu = container_of(napi, typeof(*rcpu), napi);
 
-	/* When kthread gives stop order, then rcpu have been disconnected
-	 * from map, thus no new packets can enter. Remaining in-flight
-	 * per CPU stored packets are flushed to this queue.  Wait honoring
-	 * kthread_stop signal until queue is empty.
-	 */
-	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
-		struct xdp_cpumap_stats stats = {}; /* zero stats */
-		unsigned int kmem_alloc_drops = 0, sched = 0;
+	while (likely(done < budget)) {
 		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
 		int i, n, m, nframes, xdp_n;
 		void *frames[CPUMAP_BATCH];
+		struct sk_buff *skb, *tmp;
 		void *skbs[CPUMAP_BATCH];
 		LIST_HEAD(list);
 
-		/* Release CPU reschedule checks */
-		if (__ptr_ring_empty(rcpu->queue)) {
-			set_current_state(TASK_INTERRUPTIBLE);
-			/* Recheck to avoid lost wake-up */
-			if (__ptr_ring_empty(rcpu->queue)) {
-				schedule();
-				sched = 1;
-				last_qs = jiffies;
-			} else {
-				__set_current_state(TASK_RUNNING);
-			}
-		} else {
-			rcu_softirq_qs_periodic(last_qs);
-			sched = cond_resched();
-		}
+		if (__ptr_ring_empty(rcpu->queue))
+			break;
 
 		/*
 		 * The bpf_cpu_map_entry is single consumer, with this
-		 * kthread CPU pinned. Lockless access to ptr_ring
+		 * NAPI thread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		n = __ptr_ring_consume_batched(rcpu->queue, frames,
-					       CPUMAP_BATCH);
+		n = min(budget - done, CPUMAP_BATCH);
+		n = __ptr_ring_consume_batched(rcpu->queue, frames, n);
+		done += n;
+
 		for (i = 0, xdp_n = 0; i < n; i++) {
 			void *f = frames[i];
 			struct page *page;
 
 			if (unlikely(__ptr_test_bit(0, &f))) {
-				struct sk_buff *skb = f;
+				skb = f;
 
 				__ptr_clear_bit(0, &skb);
 				list_add_tail(&skb->list, &list);
@@ -340,12 +338,10 @@ static int cpu_map_kthread_run(void *data)
 			}
 		}
 
-		local_bh_disable();
 		for (i = 0; i < nframes; i++) {
 			struct xdp_frame *xdpf = frames[i];
-			struct sk_buff *skb = skbs[i];
 
-			skb = __xdp_build_skb_from_frame(xdpf, skb,
+			skb = __xdp_build_skb_from_frame(xdpf, skbs[i],
 							 xdpf->dev_rx);
 			if (!skb) {
 				xdp_return_frame(xdpf);
@@ -354,17 +350,23 @@ static int cpu_map_kthread_run(void *data)
 
 			list_add_tail(&skb->list, &list);
 		}
-		netif_receive_skb_list(&list);
-
-		/* Feedback loop via tracepoint */
-		trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
-					 sched, &stats);
 
-		local_bh_enable(); /* resched point, may call do_softirq() */
+		list_for_each_entry_safe(skb, tmp, &list, list) {
+			skb_list_del_init(skb);
+			napi_gro_receive(napi, skb);
+		}
 	}
-	__set_current_state(TASK_RUNNING);
 
-	return 0;
+	rcu_read_unlock();
+
+	/* Feedback loop via tracepoint */
+	trace_xdp_cpumap_kthread(rcpu->map_id, done, kmem_alloc_drops, 0,
+				 &stats);
+
+	if (done < budget)
+		napi_complete_done(napi, done);
+
+	return done;
 }
 
 static int __cpu_map_load_bpf_program(struct bpf_cpu_map_entry *rcpu,
@@ -394,6 +396,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 {
 	int numa, err, i, fd = value->bpf_prog.fd;
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
+	const struct bpf_cpu_map *cmap;
 	struct bpf_cpu_map_entry *rcpu;
 	struct xdp_bulk_queue *bq;
 
@@ -432,29 +435,13 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
 		goto free_ptr_ring;
 
-	/* Setup kthread */
-	init_completion(&rcpu->kthread_running);
-	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
-					       "cpumap/%d/map:%d", cpu,
-					       map->id);
-	if (IS_ERR(rcpu->kthread))
-		goto free_prog;
-
-	/* Make sure kthread runs on a single CPU */
-	kthread_bind(rcpu->kthread, cpu);
-	wake_up_process(rcpu->kthread);
-
-	/* Make sure kthread has been running, so kthread_stop() will not
-	 * stop the kthread prematurely and all pending frames or skbs
-	 * will be handled by the kthread before kthread_stop() returns.
-	 */
-	wait_for_completion(&rcpu->kthread_running);
+	cmap = container_of(map, typeof(*cmap), map);
+	netif_napi_add_percpu(cmap->napi_dev, &rcpu->napi, cpu_map_napi_poll,
+			      cpu);
+	napi_enable(&rcpu->napi);
 
 	return rcpu;
 
-free_prog:
-	if (rcpu->prog)
-		bpf_prog_put(rcpu->prog);
 free_ptr_ring:
 	ptr_ring_cleanup(rcpu->queue, NULL);
 free_queue:
@@ -477,11 +464,12 @@ static void __cpu_map_entry_free(struct work_struct *work)
 	 */
 	rcpu = container_of(to_rcu_work(work), struct bpf_cpu_map_entry, free_work);
 
-	/* kthread_stop will wake_up_process and wait for it to complete.
-	 * cpu_map_kthread_run() makes sure the pointer ring is empty
+	/* napi_disable() will wait for the NAPI poll to complete.
+	 * cpu_map_napi_poll() makes sure the pointer ring is empty
 	 * before exiting.
 	 */
-	kthread_stop(rcpu->kthread);
+	napi_disable(&rcpu->napi);
+	netif_napi_del(&rcpu->napi);
 
 	if (rcpu->prog)
 		bpf_prog_put(rcpu->prog);
@@ -498,8 +486,8 @@ static void __cpu_map_entry_free(struct work_struct *work)
  * __cpu_map_entry_free() in a separate workqueue after waiting for an RCU grace
  * period. This means that (a) all pending enqueue and flush operations have
  * completed (because of the RCU callback), and (b) we are in a workqueue
- * context where we can stop the kthread and wait for it to exit before freeing
- * everything.
+ * context where we can stop the NAPI thread and wait for it to exit before
+ * freeing everything.
  */
 static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
 				    u32 key_cpu, struct bpf_cpu_map_entry *rcpu)
@@ -579,9 +567,7 @@ static void cpu_map_free(struct bpf_map *map)
 	 */
 	synchronize_rcu();
 
-	/* The only possible user of bpf_cpu_map_entry is
-	 * cpu_map_kthread_run().
-	 */
+	/* The only possible user of bpf_cpu_map_entry is cpu_map_napi_poll() */
 	for (i = 0; i < cmap->map.max_entries; i++) {
 		struct bpf_cpu_map_entry *rcpu;
 
@@ -589,9 +575,10 @@ static void cpu_map_free(struct bpf_map *map)
 		if (!rcpu)
 			continue;
 
-		/* Stop kthread and cleanup entry directly */
+		/* Stop NAPI thread and cleanup entry directly */
 		__cpu_map_entry_free(&rcpu->free_work.work);
 	}
+	bpf_map_area_free(cmap->napi_dev);
 	bpf_map_area_free(cmap->cpu_map);
 	bpf_map_area_free(cmap);
 }
@@ -753,7 +740,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 	if (ret < 0)
 		goto trace;
 
-	wake_up_process(rcpu->kthread);
+	napi_schedule(&rcpu->napi);
 trace:
 	trace_xdp_cpumap_enqueue(rcpu->map_id, !ret, !!ret, rcpu->cpu);
 	return ret;
@@ -765,8 +752,6 @@ void __cpu_map_flush(struct list_head *flush_list)
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_flush_to_queue(bq);
-
-		/* If already running, costs spin_lock_irqsave + smb_mb */
-		wake_up_process(bq->obj->kthread);
+		napi_schedule(&bq->obj->napi);
 	}
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 5/9] bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (3 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 4/9] bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 6/9] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

cpumap still uses linked lists to store a list of skbs to pass to the
stack. Now that we don't use listified Rx in favor of
napi_gro_receive(), linked list is now an unneeded overhead.
Inside the polling loop, we already have an array of skbs. Let's reuse
it for skbs passed to cpumap (generic XDP) and use napi_gro_receive()
directly in case of XDP_PASS when a program is installed to the map
itself. Don't list regular xdp_frames at all and just call
napi_gro_receive() directly as well right after building an skb.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 kernel/bpf/cpumap.c | 55 +++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 29 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index d1cfa4111727..d7206f3f6e80 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -150,21 +150,23 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 }
 
 static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
-				     struct list_head *listp,
+				     void **skbs, u32 skb_n,
 				     struct xdp_cpumap_stats *stats)
 {
-	struct sk_buff *skb, *tmp;
 	struct xdp_buff xdp;
 	u32 act;
 	int err;
 
-	list_for_each_entry_safe(skb, tmp, listp, list) {
+	for (u32 i = 0; i < skb_n; i++) {
+		struct sk_buff *skb = skbs[i];
+
 		act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
 		switch (act) {
 		case XDP_PASS:
+			napi_gro_receive(&rcpu->napi, skb);
+			stats->pass++;
 			break;
 		case XDP_REDIRECT:
-			skb_list_del_init(skb);
 			err = xdp_do_generic_redirect(skb->dev, skb, &xdp,
 						      rcpu->prog);
 			if (unlikely(err)) {
@@ -181,8 +183,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 			trace_xdp_exception(skb->dev, rcpu->prog, act);
 			fallthrough;
 		case XDP_DROP:
-			skb_list_del_init(skb);
-			kfree_skb(skb);
+			napi_consume_skb(skb, true);
 			stats->drop++;
 			return;
 		}
@@ -251,8 +252,8 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 #define CPUMAP_BATCH 8
 
 static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
-				int xdp_n, struct xdp_cpumap_stats *stats,
-				struct list_head *list)
+				int xdp_n, void **skbs, u32 skb_n,
+				struct xdp_cpumap_stats *stats)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
@@ -267,8 +268,8 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (stats->redirect)
 		xdp_do_flush();
 
-	if (unlikely(!list_empty(list)))
-		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
+	if (unlikely(skb_n))
+		cpu_map_bpf_prog_run_skb(rcpu, skbs, skb_n, stats);
 
 	bpf_net_ctx_clear(bpf_net_ctx);
 
@@ -288,9 +289,7 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
 		int i, n, m, nframes, xdp_n;
 		void *frames[CPUMAP_BATCH];
-		struct sk_buff *skb, *tmp;
 		void *skbs[CPUMAP_BATCH];
-		LIST_HEAD(list);
 
 		if (__ptr_ring_empty(rcpu->queue))
 			break;
@@ -304,15 +303,15 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 		n = __ptr_ring_consume_batched(rcpu->queue, frames, n);
 		done += n;
 
-		for (i = 0, xdp_n = 0; i < n; i++) {
+		for (i = 0, xdp_n = 0, m = 0; i < n; i++) {
 			void *f = frames[i];
 			struct page *page;
 
 			if (unlikely(__ptr_test_bit(0, &f))) {
-				skb = f;
+				struct sk_buff *skb = f;
 
 				__ptr_clear_bit(0, &skb);
-				list_add_tail(&skb->list, &list);
+				skbs[m++] = skb;
 				continue;
 			}
 
@@ -327,19 +326,22 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 		}
 
 		/* Support running another XDP prog on this CPU */
-		nframes = cpu_map_bpf_prog_run(rcpu, frames, xdp_n, &stats, &list);
-		if (nframes) {
-			m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
-						  gfp, nframes, skbs);
-			if (unlikely(m == 0)) {
-				for (i = 0; i < nframes; i++)
-					skbs[i] = NULL; /* effect: xdp_return_frame */
-				kmem_alloc_drops += nframes;
-			}
+		nframes = cpu_map_bpf_prog_run(rcpu, frames, xdp_n, skbs, m,
+					       &stats);
+		if (!nframes)
+			continue;
+
+		m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp,
+					  nframes, skbs);
+		if (unlikely(!m)) {
+			for (i = 0; i < nframes; i++)
+				skbs[i] = NULL; /* effect: xdp_return_frame */
+			kmem_alloc_drops += nframes;
 		}
 
 		for (i = 0; i < nframes; i++) {
 			struct xdp_frame *xdpf = frames[i];
+			struct sk_buff *skb;
 
 			skb = __xdp_build_skb_from_frame(xdpf, skbs[i],
 							 xdpf->dev_rx);
@@ -348,11 +350,6 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 				continue;
 			}
 
-			list_add_tail(&skb->list, &list);
-		}
-
-		list_for_each_entry_safe(skb, tmp, &list, list) {
-			skb_list_del_init(skb);
 			napi_gro_receive(napi, skb);
 		}
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 6/9] net: skbuff: introduce napi_skb_cache_get_bulk()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (4 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 5/9] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 7/9] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
requirement to call it only from the BH) is that it tries to use
as many NAPI cache entries for skbs as possible, and allocate new
ones only if needed.

The logic is as follows:

* there is enough skbs in the cache: decache them and return to the
  caller;
* not enough: try refilling the cache first. If there is now enough
  skbs, return;
* still not enough: try allocating skbs directly to the output array
  with %GFP_ZERO, maybe we'll be able to get some. If there's now
  enough, return;
* still not enough: return as many as we were able to obtain.

Most of times, if called from the NAPI polling loop, the first one will
be true, sometimes (rarely) the second one. The third and the fourth --
only under heavy memory pressure.
It can save significant amounts of CPU cycles if there are GRO cycles
and/or Tx completion cycles (anything that descends to
napi_skb_cache_put()) happening on this CPU.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 62 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf8f6ce06742..2bc3ca79bc6e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1304,6 +1304,7 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
 void skb_attempt_defer_free(struct sk_buff *skb);
 
+u32 napi_skb_cache_get_bulk(void **skbs, u32 n);
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 struct sk_buff *slab_build_skb(void *data);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a52638363ea5..0a34f3aa00d1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -366,6 +366,68 @@ static struct sk_buff *napi_skb_cache_get(void)
 	return skb;
 }
 
+/**
+ * napi_skb_cache_get_bulk - obtain a number of zeroed skb heads from the cache
+ * @skbs: pointer to an at least @n-sized array to fill with skb pointers
+ * @n: number of entries to provide
+ *
+ * Tries to obtain @n &sk_buff entries from the NAPI percpu cache and writes
+ * the pointers into the provided array @skbs. If there are less entries
+ * available, tries to replenish the cache and bulk-allocates the diff from
+ * the MM layer if needed.
+ * The heads are being zeroed with either memset() or %__GFP_ZERO, so they are
+ * ready for {,__}build_skb_around() and don't have any data buffers attached.
+ * Must be called *only* from the BH context.
+ *
+ * Return: number of successfully allocated skbs (@n if no actual allocation
+ *	   needed or kmem_cache_alloc_bulk() didn't fail).
+ */
+u32 napi_skb_cache_get_bulk(void **skbs, u32 n)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	u32 bulk, total = n;
+
+	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
+
+	if (nc->skb_count >= n)
+		goto get;
+
+	/* No enough cached skbs. Try refilling the cache first */
+	bulk = min(NAPI_SKB_CACHE_SIZE - nc->skb_count, NAPI_SKB_CACHE_BULK);
+	nc->skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+					       GFP_ATOMIC | __GFP_NOWARN, bulk,
+					       &nc->skb_cache[nc->skb_count]);
+	if (likely(nc->skb_count >= n))
+		goto get;
+
+	/* Still not enough. Bulk-allocate the missing part directly, zeroed */
+	n -= kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+				   GFP_ATOMIC | __GFP_ZERO | __GFP_NOWARN,
+				   n - nc->skb_count, &skbs[nc->skb_count]);
+	if (likely(nc->skb_count >= n))
+		goto get;
+
+	/* kmem_cache didn't allocate the number we need, limit the output */
+	total -= n - nc->skb_count;
+	n = nc->skb_count;
+
+get:
+	for (u32 base = nc->skb_count - n, i = 0; i < n; i++) {
+		u32 cache_size = kmem_cache_size(net_hotdata.skbuff_cache);
+
+		skbs[i] = nc->skb_cache[base + i];
+
+		kasan_mempool_unpoison_object(skbs[i], cache_size);
+		memset(skbs[i], 0, offsetof(struct sk_buff, tail));
+	}
+
+	nc->skb_count -= n;
+	local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
+
+	return total;
+}
+EXPORT_SYMBOL_GPL(napi_skb_cache_get_bulk);
+
 static inline void __finalize_skb_around(struct sk_buff *skb, void *data,
 					 unsigned int size)
 {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 7/9] bpf: cpumap: switch to napi_skb_cache_get_bulk()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (5 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 6/9] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 8/9] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower the MM layer pressure. The polling loop already happens in
the BH context, so the switch is safe from that perspective.
The better GRO aggregates packets, the less new skbs will be allocated.
If an aggregated skb contains 16 frags, this means 15 skbs were returned
to the cache, so next 15 skbs will be built without allocating anything.

The same trafficgen UDP GRO test now shows:

                GRO off   GRO on
threaded GRO    2.3       4         Mpps
thr bulk GRO    2.4       4.7       Mpps
diff            +4        +17       %

Comparing to the baseline cpumap:

baseline        2.7       N/A       Mpps
thr bulk GRO    2.4       4.7       Mpps
diff            -11       +74       %

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 kernel/bpf/cpumap.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index d7206f3f6e80..992f4e30a589 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -286,7 +286,6 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 	rcpu = container_of(napi, typeof(*rcpu), napi);
 
 	while (likely(done < budget)) {
-		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
 		int i, n, m, nframes, xdp_n;
 		void *frames[CPUMAP_BATCH];
 		void *skbs[CPUMAP_BATCH];
@@ -331,8 +330,7 @@ static int cpu_map_napi_poll(struct napi_struct *napi, int budget)
 		if (!nframes)
 			continue;
 
-		m = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp,
-					  nframes, skbs);
+		m = napi_skb_cache_get_bulk(skbs, nframes);
 		if (unlikely(!m)) {
 			for (i = 0; i < nframes; i++)
 				skbs[i] = NULL; /* effect: xdp_return_frame */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 8/9] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (6 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 7/9] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-08-30 16:25 ` [PATCH bpf-next 9/9] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
  2024-09-03 20:51 ` [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Jakub Kicinski
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

Now that we can bulk-allocate skbs from the NAPI cache, use that
function to do that in veth as well instead of direct allocation from
the kmem caches. veth already uses NAPI for Rx processing, so this is
safe from the context perspective. veth also uses GRO, so using NAPI
caches makes real difference here.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/veth.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 18148e068aa0..774f226666c8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -684,8 +684,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 	void *skbs[VETH_XDP_BATCH];
 	int i;
 
-	if (xdp_alloc_skb_bulk(skbs, n_xdpf,
-			       GFP_ATOMIC | __GFP_ZERO) < 0) {
+	if (unlikely(!napi_skb_cache_get_bulk(skbs, n_xdpf))) {
 		for (i = 0; i < n_xdpf; i++)
 			xdp_return_frame(frames[i]);
 		stats->rx_drops += n_xdpf;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 9/9] xdp: remove xdp_alloc_skb_bulk()
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (7 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 8/9] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
@ 2024-08-30 16:25 ` Alexander Lobakin
  2024-09-03 20:51 ` [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Jakub Kicinski
  9 siblings, 0 replies; 26+ messages in thread
From: Alexander Lobakin @ 2024-08-30 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

The only user was veth, which now uses napi_skb_cache_get_bulk().
It's now preferred over a direct allocation and is exported as
well, so remove this one.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/xdp.h |  1 -
 net/core/xdp.c    | 10 ----------
 2 files changed, 11 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index e6770dd40c91..bd3363e384b2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -245,7 +245,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					   struct net_device *dev);
 struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					 struct net_device *dev);
-int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp);
 struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bcc5551c6424..34d057089d20 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -584,16 +584,6 @@ void xdp_warn(const char *msg, const char *func, const int line)
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
 
-int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp)
-{
-	n_skb = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp, n_skb, skbs);
-	if (unlikely(!n_skb))
-		return -ENOMEM;
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xdp_alloc_skb_bulk);
-
 struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					   struct sk_buff *skb,
 					   struct net_device *dev)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu()
  2024-08-30 16:25 ` [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu() Alexander Lobakin
@ 2024-08-30 22:56   ` Stanislav Fomichev
  2024-09-03 12:25     ` Alexander Lobakin
  0 siblings, 1 reply; 26+ messages in thread
From: Stanislav Fomichev @ 2024-08-30 22:56 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

On 08/30, Alexander Lobakin wrote:
> Currently, kthread_{create,run}_on_cpu() doesn't support varargs like
> kthread_create{,_on_node}() do, which makes them less convenient to
> use.
> Convert them to take varargs as the last argument. The only difference
> is that they always append the CPU ID at the end and require the format
> string to have an excess '%u' at the end due to that. That's still true;
> meanwhile, the compiler will correctly point out to that if missing.
> One more nice side effect is that you can now use the underscored
> __kthread_create_on_cpu() if you want to override that rule and not
> have CPU ID at the end of the name.
> The current callers are not anyhow affected.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  include/linux/kthread.h | 51 ++++++++++++++++++++++++++---------------
>  kernel/kthread.c        | 22 ++++++++++--------
>  2 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index b11f53c1ba2e..27a94e691948 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -27,11 +27,21 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  #define kthread_create(threadfn, data, namefmt, arg...) \
>  	kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
>  
> -
> -struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> -					  void *data,
> -					  unsigned int cpu,
> -					  const char *namefmt);
> +__printf(4, 5)
> +struct task_struct *__kthread_create_on_cpu(int (*threadfn)(void *data),
> +					    void *data, unsigned int cpu,
> +					    const char *namefmt, ...);
> +
> +#define kthread_create_on_cpu(threadfn, data, cpu, namefmt, ...)	   \
> +	_kthread_create_on_cpu(threadfn, data, cpu, __UNIQUE_ID(cpu_),	   \
> +			       namefmt, ##__VA_ARGS__)
> +
> +#define _kthread_create_on_cpu(threadfn, data, cpu, uc, namefmt, ...) ({   \
> +	u32 uc = (cpu);							   \
> +									   \
> +	__kthread_create_on_cpu(threadfn, data, uc, namefmt,		   \
> +				##__VA_ARGS__, uc);			   \
> +})
>  
>  void get_kthread_comm(char *buf, size_t buf_size, struct task_struct *tsk);
>  bool set_kthread_struct(struct task_struct *p);
> @@ -62,25 +72,28 @@ bool kthread_is_per_cpu(struct task_struct *k);
>   * @threadfn: the function to run until signal_pending(current).
>   * @data: data ptr for @threadfn.
>   * @cpu: The cpu on which the thread should be bound,
> - * @namefmt: printf-style name for the thread. Format is restricted
> - *	     to "name.*%u". Code fills in cpu number.
> + * @namefmt: printf-style name for the thread. Must have an excess '%u'
> + *	     at the end as kthread_create_on_cpu() fills in CPU number.
>   *
>   * Description: Convenient wrapper for kthread_create_on_cpu()
>   * followed by wake_up_process().  Returns the kthread or
>   * ERR_PTR(-ENOMEM).
>   */
> -static inline struct task_struct *
> -kthread_run_on_cpu(int (*threadfn)(void *data), void *data,
> -			unsigned int cpu, const char *namefmt)
> -{
> -	struct task_struct *p;
> -
> -	p = kthread_create_on_cpu(threadfn, data, cpu, namefmt);
> -	if (!IS_ERR(p))
> -		wake_up_process(p);
> -
> -	return p;
> -}
> +#define kthread_run_on_cpu(threadfn, data, cpu, namefmt, ...)		   \
> +	_kthread_run_on_cpu(threadfn, data, cpu, __UNIQUE_ID(task_),	   \
> +			    namefmt, ##__VA_ARGS__)
> +
> +#define _kthread_run_on_cpu(threadfn, data, cpu, ut, namefmt, ...)	   \
> +({									   \
> +	struct task_struct *ut;						   \
> +									   \
> +	ut = kthread_create_on_cpu(threadfn, data, cpu, namefmt,	   \
> +				   ##__VA_ARGS__);			   \
> +	if (!IS_ERR(ut))						   \
> +		wake_up_process(ut);					   \
> +									   \
> +	ut;								   \
> +})

Why do you need to use __UNIQUE_ID here? Presumably ({}) in _kthread_run_on_cpu
should be enough to avoid the issue of non unique variable in the parent
scope. (and similar kthread_run isn't using any __UNIQUE_IDs)

The rest of the patches look good.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu()
  2024-08-30 16:25 ` [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() Alexander Lobakin
@ 2024-08-30 23:31   ` Daniel Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Xu @ 2024-08-30 23:31 UTC (permalink / raw)
  To: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Lorenzo Bianconi, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, bpf@vger.kernel.org, netdev, linux-kernel

On Fri, Aug 30, 2024, at 9:25 AM, Alexander Lobakin wrote:
> kthread_create_on_cpu() always requires format string to contain one
> '%u' at the end, as it automatically adds the CPU ID when passing it
> to kthread_create_on_node(). The former doesn't marked as __printf()
> as it's not printf-like itself, which effectively hides this from
> the compiler.
> If you convert this function to printf-like, you'll see the following:
>
> In file included from drivers/firmware/psci/psci_checker.c:15:
> drivers/firmware/psci/psci_checker.c: In function 'suspend_tests':
> drivers/firmware/psci/psci_checker.c:401:48: warning: too many 
> arguments for format [-Wformat-extra-args]
>      401 |                                                
> "psci_suspend_test");
>          |                                                
> ^~~~~~~~~~~~~~~~~~~
> drivers/firmware/psci/psci_checker.c:400:32: warning: data argument not 
> used by format string [-Wformat-extra-args]
>      400 |                                                (void 
> *)(long)cpu, cpu,
>          |                                                              
>      ^
>      401 |                                                
> "psci_suspend_test");
>          |                                                
> ~~~~~~~~~~~~~~~~~~~
>
> Add the missing format literal to fix this. Now the corresponding
> kthread will be named as "psci_suspend_test-<cpuid>", as it's meant by
> kthread_create_on_cpu().
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202408141012.KhvKaxoh-lkp@intel.com
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202408141243.eQiEOQQe-lkp@intel.com
> Fixes: ea8b1c4a6019 ("drivers: psci: PSCI checker module")
> Cc: stable@vger.kernel.org # 4.10+
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/firmware/psci/psci_checker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/psci/psci_checker.c 
> b/drivers/firmware/psci/psci_checker.c
> index 116eb465cdb4..ecc511c745ce 100644
> --- a/drivers/firmware/psci/psci_checker.c
> +++ b/drivers/firmware/psci/psci_checker.c
> @@ -398,7 +398,7 @@ static int suspend_tests(void)
> 
>  		thread = kthread_create_on_cpu(suspend_test_thread,
>  					       (void *)(long)cpu, cpu,
> -					       "psci_suspend_test");
> +					       "psci_suspend_test-%u");
>  		if (IS_ERR(thread))
>  			pr_err("Failed to create kthread on CPU %d\n", cpu);
>  		else
> -- 
> 2.46.0

Acked-by: Daniel Xu <dxu@dxuuu.xyz>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI
  2024-08-30 16:25 ` [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI Alexander Lobakin
@ 2024-08-31  0:19   ` Daniel Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Xu @ 2024-08-31  0:19 UTC (permalink / raw)
  To: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Lorenzo Bianconi, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, bpf@vger.kernel.org, netdev, linux-kernel



On Fri, Aug 30, 2024, at 9:25 AM, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo@kernel.org>
>
> Add netif_napi_add_percpu() to pin NAPI in threaded mode to a particular
> CPU. This means, if the NAPI is not threaded, it will be run as usually,
> but when switching to threaded mode, it will always be run on the
> specified CPU.
> It's not meant to be used in drivers, but might be useful when creating
> percpu threaded NAPIs, for example, to replace percpu kthreads or
> workers where a NAPI context is needed.
> The already existing netif_napi_add*() are not anyhow affected.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  include/linux/netdevice.h | 35 +++++++++++++++++++++++++++++++++--
>  net/core/dev.c            | 18 +++++++++++++-----
>  2 files changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ca5f0dda733b..4d6fb0ccdea1 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -377,6 +377,7 @@ struct napi_struct {
>  	struct list_head	dev_list;
>  	struct hlist_node	napi_hash_node;
>  	int			irq;
> +	int			thread_cpuid;
>  };
> 
>  enum {
> @@ -2619,8 +2620,18 @@ static inline void netif_napi_set_irq(struct 
> napi_struct *napi, int irq)
>   */
>  #define NAPI_POLL_WEIGHT 64
> 
> -void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
> -			   int (*poll)(struct napi_struct *, int), int weight);
> +void netif_napi_add_weight_percpu(struct net_device *dev,
> +				  struct napi_struct *napi,
> +				  int (*poll)(struct napi_struct *, int),
> +				  int weight, int thread_cpuid);
> +
> +static inline void netif_napi_add_weight(struct net_device *dev,
> +					 struct napi_struct *napi,
> +					 int (*poll)(struct napi_struct *, int),
> +					 int weight)
> +{
> +	netif_napi_add_weight_percpu(dev, napi, poll, weight, -1);
> +}
> 
>  /**
>   * netif_napi_add() - initialize a NAPI context
> @@ -2665,6 +2676,26 @@ static inline void netif_napi_add_tx(struct 
> net_device *dev,
>  	netif_napi_add_tx_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
>  }
> 
> +/**
> + * netif_napi_add_percpu() - initialize a CPU-pinned threaded NAPI 
> context
> + * @dev:  network device
> + * @napi: NAPI context
> + * @poll: polling function
> + * @thread_cpuid: CPU which this NAPI will be pinned to
> + *
> + * Variant of netif_napi_add() which pins the NAPI to the specified 
> CPU. No
> + * changes in the "standard" mode, but in case with the threaded one, 
> this
> + * NAPI will always be run on the passed CPU no matter where scheduled.
> + */
> +static inline void netif_napi_add_percpu(struct net_device *dev,
> +					 struct napi_struct *napi,
> +					 int (*poll)(struct napi_struct *, int),
> +					 int thread_cpuid)
> +{
> +	netif_napi_add_weight_percpu(dev, napi, poll, NAPI_POLL_WEIGHT,
> +				     thread_cpuid);
> +}
> +
>  /**
>   *  __netif_napi_del - remove a NAPI context
>   *  @napi: NAPI context
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 98bb5f890b88..93ca3df8e9dd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1428,8 +1428,13 @@ static int napi_kthread_create(struct 
> napi_struct *n)
>  	 * TASK_INTERRUPTIBLE mode to avoid the blocked task
>  	 * warning and work with loadavg.
>  	 */
> -	n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
> -				n->dev->name, n->napi_id);
> +	if (n->thread_cpuid >= 0)
> +		n->thread = kthread_run_on_cpu(napi_threaded_poll, n,
> +					       n->thread_cpuid, "napi/%s-%u",
> +					       n->dev->name);
> +	else
> +		n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
> +					n->dev->name, n->napi_id);
>  	if (IS_ERR(n->thread)) {
>  		err = PTR_ERR(n->thread);
>  		pr_err("kthread_run failed with err %d\n", err);
> @@ -6640,8 +6645,10 @@ void netif_queue_set_napi(struct net_device 
> *dev, unsigned int queue_index,
>  }
>  EXPORT_SYMBOL(netif_queue_set_napi);
> 
> -void netif_napi_add_weight(struct net_device *dev, struct napi_struct 
> *napi,
> -			   int (*poll)(struct napi_struct *, int), int weight)
> +void netif_napi_add_weight_percpu(struct net_device *dev,
> +				  struct napi_struct *napi,
> +				  int (*poll)(struct napi_struct *, int),
> +				  int weight, int thread_cpuid)
>  {
>  	if (WARN_ON(test_and_set_bit(NAPI_STATE_LISTED, &napi->state)))
>  		return;
> @@ -6664,6 +6671,7 @@ void netif_napi_add_weight(struct net_device 
> *dev, struct napi_struct *napi,
>  	napi->poll_owner = -1;
>  #endif
>  	napi->list_owner = -1;
> +	napi->thread_cpuid = thread_cpuid;
>  	set_bit(NAPI_STATE_SCHED, &napi->state);
>  	set_bit(NAPI_STATE_NPSVC, &napi->state);
>  	list_add_rcu(&napi->dev_list, &dev->napi_list);
> @@ -6677,7 +6685,7 @@ void netif_napi_add_weight(struct net_device 
> *dev, struct napi_struct *napi,
>  		dev->threaded = false;
>  	netif_napi_set_irq(napi, -1);
>  }
> -EXPORT_SYMBOL(netif_napi_add_weight);
> +EXPORT_SYMBOL(netif_napi_add_weight_percpu);
> 
>  void napi_disable(struct napi_struct *n)
>  {
> -- 
> 2.46.0

Acked-by: Daniel Xu <dxu@dxuuu.xyz>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu()
  2024-08-30 22:56   ` Stanislav Fomichev
@ 2024-09-03 12:25     ` Alexander Lobakin
  2024-09-03 17:04       ` Stanislav Fomichev
  0 siblings, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-09-03 12:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

From: Stanislav Fomichev <sdf@fomichev.me>
Date: Fri, 30 Aug 2024 15:56:12 -0700

> On 08/30, Alexander Lobakin wrote:
>> Currently, kthread_{create,run}_on_cpu() doesn't support varargs like
>> kthread_create{,_on_node}() do, which makes them less convenient to
>> use.
>> Convert them to take varargs as the last argument. The only difference
>> is that they always append the CPU ID at the end and require the format
>> string to have an excess '%u' at the end due to that. That's still true;
>> meanwhile, the compiler will correctly point out to that if missing.
>> One more nice side effect is that you can now use the underscored
>> __kthread_create_on_cpu() if you want to override that rule and not
>> have CPU ID at the end of the name.
>> The current callers are not anyhow affected.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  include/linux/kthread.h | 51 ++++++++++++++++++++++++++---------------
>>  kernel/kthread.c        | 22 ++++++++++--------
>>  2 files changed, 45 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
>> index b11f53c1ba2e..27a94e691948 100644
>> --- a/include/linux/kthread.h
>> +++ b/include/linux/kthread.h
>> @@ -27,11 +27,21 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>>  #define kthread_create(threadfn, data, namefmt, arg...) \
>>  	kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
>>  
>> -
>> -struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>> -					  void *data,
>> -					  unsigned int cpu,
>> -					  const char *namefmt);
>> +__printf(4, 5)
>> +struct task_struct *__kthread_create_on_cpu(int (*threadfn)(void *data),
>> +					    void *data, unsigned int cpu,
>> +					    const char *namefmt, ...);
>> +
>> +#define kthread_create_on_cpu(threadfn, data, cpu, namefmt, ...)	   \
>> +	_kthread_create_on_cpu(threadfn, data, cpu, __UNIQUE_ID(cpu_),	   \
>> +			       namefmt, ##__VA_ARGS__)
>> +
>> +#define _kthread_create_on_cpu(threadfn, data, cpu, uc, namefmt, ...) ({   \
>> +	u32 uc = (cpu);							   \
>> +									   \
>> +	__kthread_create_on_cpu(threadfn, data, uc, namefmt,		   \
>> +				##__VA_ARGS__, uc);			   \
>> +})
>>  
>>  void get_kthread_comm(char *buf, size_t buf_size, struct task_struct *tsk);
>>  bool set_kthread_struct(struct task_struct *p);
>> @@ -62,25 +72,28 @@ bool kthread_is_per_cpu(struct task_struct *k);
>>   * @threadfn: the function to run until signal_pending(current).
>>   * @data: data ptr for @threadfn.
>>   * @cpu: The cpu on which the thread should be bound,
>> - * @namefmt: printf-style name for the thread. Format is restricted
>> - *	     to "name.*%u". Code fills in cpu number.
>> + * @namefmt: printf-style name for the thread. Must have an excess '%u'
>> + *	     at the end as kthread_create_on_cpu() fills in CPU number.
>>   *
>>   * Description: Convenient wrapper for kthread_create_on_cpu()
>>   * followed by wake_up_process().  Returns the kthread or
>>   * ERR_PTR(-ENOMEM).
>>   */
>> -static inline struct task_struct *
>> -kthread_run_on_cpu(int (*threadfn)(void *data), void *data,
>> -			unsigned int cpu, const char *namefmt)
>> -{
>> -	struct task_struct *p;
>> -
>> -	p = kthread_create_on_cpu(threadfn, data, cpu, namefmt);
>> -	if (!IS_ERR(p))
>> -		wake_up_process(p);
>> -
>> -	return p;
>> -}
>> +#define kthread_run_on_cpu(threadfn, data, cpu, namefmt, ...)		   \
>> +	_kthread_run_on_cpu(threadfn, data, cpu, __UNIQUE_ID(task_),	   \
>> +			    namefmt, ##__VA_ARGS__)
>> +
>> +#define _kthread_run_on_cpu(threadfn, data, cpu, ut, namefmt, ...)	   \
>> +({									   \
>> +	struct task_struct *ut;						   \
>> +									   \
>> +	ut = kthread_create_on_cpu(threadfn, data, cpu, namefmt,	   \
>> +				   ##__VA_ARGS__);			   \
>> +	if (!IS_ERR(ut))						   \
>> +		wake_up_process(ut);					   \
>> +									   \
>> +	ut;								   \
>> +})
> 
> Why do you need to use __UNIQUE_ID here? Presumably ({}) in _kthread_run_on_cpu

It will still be a -Wshadow warning if the caller has a variable with
the same name. I know it's enabled only on W=2, but anyway I feel like
we shouldn't introduce any new warnings when possible.

> should be enough to avoid the issue of non unique variable in the parent
> scope. (and similar kthread_run isn't using any __UNIQUE_IDs)
> 
> The rest of the patches look good.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu()
  2024-09-03 12:25     ` Alexander Lobakin
@ 2024-09-03 17:04       ` Stanislav Fomichev
  0 siblings, 0 replies; 26+ messages in thread
From: Stanislav Fomichev @ 2024-09-03 17:04 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, bpf, netdev,
	linux-kernel

On 09/03, Alexander Lobakin wrote:
> From: Stanislav Fomichev <sdf@fomichev.me>
> Date: Fri, 30 Aug 2024 15:56:12 -0700
> 
> > On 08/30, Alexander Lobakin wrote:
> >> Currently, kthread_{create,run}_on_cpu() doesn't support varargs like
> >> kthread_create{,_on_node}() do, which makes them less convenient to
> >> use.
> >> Convert them to take varargs as the last argument. The only difference
> >> is that they always append the CPU ID at the end and require the format
> >> string to have an excess '%u' at the end due to that. That's still true;
> >> meanwhile, the compiler will correctly point out to that if missing.
> >> One more nice side effect is that you can now use the underscored
> >> __kthread_create_on_cpu() if you want to override that rule and not
> >> have CPU ID at the end of the name.
> >> The current callers are not anyhow affected.
> >>
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> >> ---
> >>  include/linux/kthread.h | 51 ++++++++++++++++++++++++++---------------
> >>  kernel/kthread.c        | 22 ++++++++++--------
> >>  2 files changed, 45 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> >> index b11f53c1ba2e..27a94e691948 100644
> >> --- a/include/linux/kthread.h
> >> +++ b/include/linux/kthread.h
> >> @@ -27,11 +27,21 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> >>  #define kthread_create(threadfn, data, namefmt, arg...) \
> >>  	kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
> >>  
> >> -
> >> -struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> >> -					  void *data,
> >> -					  unsigned int cpu,
> >> -					  const char *namefmt);
> >> +__printf(4, 5)
> >> +struct task_struct *__kthread_create_on_cpu(int (*threadfn)(void *data),
> >> +					    void *data, unsigned int cpu,
> >> +					    const char *namefmt, ...);
> >> +
> >> +#define kthread_create_on_cpu(threadfn, data, cpu, namefmt, ...)	   \
> >> +	_kthread_create_on_cpu(threadfn, data, cpu, __UNIQUE_ID(cpu_),	   \
> >> +			       namefmt, ##__VA_ARGS__)
> >> +
> >> +#define _kthread_create_on_cpu(threadfn, data, cpu, uc, namefmt, ...) ({   \
> >> +	u32 uc = (cpu);							   \
> >> +									   \
> >> +	__kthread_create_on_cpu(threadfn, data, uc, namefmt,		   \
> >> +				##__VA_ARGS__, uc);			   \
> >> +})
> >>  
> >>  void get_kthread_comm(char *buf, size_t buf_size, struct task_struct *tsk);
> >>  bool set_kthread_struct(struct task_struct *p);
> >> @@ -62,25 +72,28 @@ bool kthread_is_per_cpu(struct task_struct *k);
> >>   * @threadfn: the function to run until signal_pending(current).
> >>   * @data: data ptr for @threadfn.
> >>   * @cpu: The cpu on which the thread should be bound,
> >> - * @namefmt: printf-style name for the thread. Format is restricted
> >> - *	     to "name.*%u". Code fills in cpu number.
> >> + * @namefmt: printf-style name for the thread. Must have an excess '%u'
> >> + *	     at the end as kthread_create_on_cpu() fills in CPU number.
> >>   *
> >>   * Description: Convenient wrapper for kthread_create_on_cpu()
> >>   * followed by wake_up_process().  Returns the kthread or
> >>   * ERR_PTR(-ENOMEM).
> >>   */
> >> -static inline struct task_struct *
> >> -kthread_run_on_cpu(int (*threadfn)(void *data), void *data,
> >> -			unsigned int cpu, const char *namefmt)
> >> -{
> >> -	struct task_struct *p;
> >> -
> >> -	p = kthread_create_on_cpu(threadfn, data, cpu, namefmt);
> >> -	if (!IS_ERR(p))
> >> -		wake_up_process(p);
> >> -
> >> -	return p;
> >> -}
> >> +#define kthread_run_on_cpu(threadfn, data, cpu, namefmt, ...)		   \
> >> +	_kthread_run_on_cpu(threadfn, data, cpu, __UNIQUE_ID(task_),	   \
> >> +			    namefmt, ##__VA_ARGS__)
> >> +
> >> +#define _kthread_run_on_cpu(threadfn, data, cpu, ut, namefmt, ...)	   \
> >> +({									   \
> >> +	struct task_struct *ut;						   \
> >> +									   \
> >> +	ut = kthread_create_on_cpu(threadfn, data, cpu, namefmt,	   \
> >> +				   ##__VA_ARGS__);			   \
> >> +	if (!IS_ERR(ut))						   \
> >> +		wake_up_process(ut);					   \
> >> +									   \
> >> +	ut;								   \
> >> +})
> > 
> > Why do you need to use __UNIQUE_ID here? Presumably ({}) in _kthread_run_on_cpu
> 
> It will still be a -Wshadow warning if the caller has a variable with
> the same name. I know it's enabled only on W=2, but anyway I feel like
> we shouldn't introduce any new warnings when possible.

Makes sense, thanks! That's why, presumably, kthread_run uses __k name
to avoid the warning..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
                   ` (8 preceding siblings ...)
  2024-08-30 16:25 ` [PATCH bpf-next 9/9] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
@ 2024-09-03 20:51 ` Jakub Kicinski
  2024-09-03 21:33   ` Lorenzo Bianconi
  2024-09-04 13:13   ` Alexander Lobakin
  9 siblings, 2 replies; 26+ messages in thread
From: Jakub Kicinski @ 2024-09-03 20:51 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>   threaded NAPI;

Could you try to use the backlog NAPI? Allocating a fake netdev and
using NAPI as a threading abstraction feels like an abuse. Maybe try
to factor out the necessary bits? What we want is using the per-cpu 
caches, and feeding GRO. None of the IRQ related NAPI functionality
fits in here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-03 20:51 ` [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Jakub Kicinski
@ 2024-09-03 21:33   ` Lorenzo Bianconi
  2024-09-05 11:53     ` Jesper Dangaard Brouer
  2024-09-05 17:01     ` Lorenzo Bianconi
  2024-09-04 13:13   ` Alexander Lobakin
  1 sibling, 2 replies; 26+ messages in thread
From: Lorenzo Bianconi @ 2024-09-03 21:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> > * patch 4: switch cpumap from a custom kthread to a CPU-pinned
> >   threaded NAPI;
> 
> Could you try to use the backlog NAPI? Allocating a fake netdev and
> using NAPI as a threading abstraction feels like an abuse. Maybe try
> to factor out the necessary bits? What we want is using the per-cpu 
> caches, and feeding GRO. None of the IRQ related NAPI functionality
> fits in here.

I was thinking allocating a fake netdev to use NAPI APIs is quite a common
approach, but sure, I will looking into it.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-03 20:51 ` [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Jakub Kicinski
  2024-09-03 21:33   ` Lorenzo Bianconi
@ 2024-09-04 13:13   ` Alexander Lobakin
  2024-09-04 14:50     ` Jakub Kicinski
  1 sibling, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-09-04 13:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

From: Jakub Kicinski <kuba@kernel.org>
Date: Tue, 3 Sep 2024 13:51:58 -0700

> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
>> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>>   threaded NAPI;
> 
> Could you try to use the backlog NAPI? Allocating a fake netdev and
> using NAPI as a threading abstraction feels like an abuse. Maybe try
> to factor out the necessary bits? What we want is using the per-cpu 
> caches, and feeding GRO. None of the IRQ related NAPI functionality
> fits in here.

Lorenzo will try as he wrote. I can only add that in my old tree, I
factored out GRO bits and used them here just as you wrote. The perf was
the same, but the diffstat was several hundred lines only to factor out
stuff, while here the actual switch to NAPI removes more lines than
adds, also custom kthread logic is gone etc. It just looks way more
elegant and simple.
I could say that gro_cells also "abuses" NAPI the same way, don't you
think? But nobody ever objected :>

Thanks,
Olek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-04 13:13   ` Alexander Lobakin
@ 2024-09-04 14:50     ` Jakub Kicinski
  2024-09-04 15:13       ` Alexander Lobakin
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2024-09-04 14:50 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote:
> > Could you try to use the backlog NAPI? Allocating a fake netdev and
> > using NAPI as a threading abstraction feels like an abuse. Maybe try
> > to factor out the necessary bits? What we want is using the per-cpu 
> > caches, and feeding GRO. None of the IRQ related NAPI functionality
> > fits in here.  
> 
> Lorenzo will try as he wrote. I can only add that in my old tree, I
> factored out GRO bits and used them here just as you wrote. The perf was
> the same, but the diffstat was several hundred lines only to factor out
> stuff, while here the actual switch to NAPI removes more lines than
> adds, also custom kthread logic is gone etc. It just looks way more
> elegant and simple.

Once again we seem to be arguing whether lower LoC is equivalent to
better code? :) If we can use backlog NAPI it hopefully won't be as
long. Maybe other, better approaches are within reach, too.

> I could say that gro_cells also "abuses" NAPI the same way, don't you
> think?

"same way"? :] Does it allocate a fake netdev, use NAPI as a threading
abstraction or add extra fields to napi_struct ? 
If other maintainers disagree I won't be upset, but I'm worried
that letting NAPI grow into some generic SW abstraction with broad 
use cases will hinder the ongoing queue config efforts.

> But nobody ever objected :>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-04 14:50     ` Jakub Kicinski
@ 2024-09-04 15:13       ` Alexander Lobakin
  2024-09-04 18:29         ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Alexander Lobakin @ 2024-09-04 15:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 4 Sep 2024 07:50:41 -0700

> On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote:
>>> Could you try to use the backlog NAPI? Allocating a fake netdev and
>>> using NAPI as a threading abstraction feels like an abuse. Maybe try
>>> to factor out the necessary bits? What we want is using the per-cpu 
>>> caches, and feeding GRO. None of the IRQ related NAPI functionality
>>> fits in here.  
>>
>> Lorenzo will try as he wrote. I can only add that in my old tree, I
>> factored out GRO bits and used them here just as you wrote. The perf was
>> the same, but the diffstat was several hundred lines only to factor out
>> stuff, while here the actual switch to NAPI removes more lines than
>> adds, also custom kthread logic is gone etc. It just looks way more
>> elegant and simple.
> 
> Once again we seem to be arguing whether lower LoC is equivalent to
> better code? :) If we can use backlog NAPI it hopefully won't be as

And once again I didn't say that explicitly :D When 2 patches work the
same way, but one has far shorter diffstat, we often prefer this one if
it's correct. This one for cpumap looked correct to me and Lorenzo and
we didn't have any other ideas, so I picked it.
I didn't say "it's better than backlog NAPI because it's shorter", the
only thing I said re backlog NAPI is that we'll try it. I didn't think
of this previously at all, I'm no backlog expert in general.

> long. Maybe other, better approaches are within reach, too.
> 
>> I could say that gro_cells also "abuses" NAPI the same way, don't you
>> think?
> 
> "same way"? :] Does it allocate a fake netdev, use NAPI as a threading
> abstraction or add extra fields to napi_struct ? 

Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I
could say the same for gro_cells -- IRQ related NAPI logics doesn't fit
there. gro_cells is an SW abstraction.
A fake netdev is used by multiple drivers to use GRO, you know that
(popular for wireless drivers). They also conflict with the queue config
effort.

> If other maintainers disagree I won't be upset, but I'm worried
> that letting NAPI grow into some generic SW abstraction with broad 
> use cases will hinder the ongoing queue config efforts.
> 
>> But nobody ever objected :>

Thanks,
Olek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-04 15:13       ` Alexander Lobakin
@ 2024-09-04 18:29         ` Jakub Kicinski
  0 siblings, 0 replies; 26+ messages in thread
From: Jakub Kicinski @ 2024-09-04 18:29 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Lorenzo Bianconi, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

On Wed, 4 Sep 2024 17:13:59 +0200 Alexander Lobakin wrote:
> >> I could say that gro_cells also "abuses" NAPI the same way, don't you
> >> think?  
> > 
> > "same way"? :] Does it allocate a fake netdev, use NAPI as a threading
> > abstraction or add extra fields to napi_struct ?   
> 
> Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I
> could say the same for gro_cells -- IRQ related NAPI logics doesn't fit
> there. gro_cells is an SW abstraction.

Yes, that 1/4th of my complaint does indeed apply :)

> A fake netdev is used by multiple drivers to use GRO, you know that
> (popular for wireless drivers). They also conflict with the queue config
> effort.

And it did cause us some issues when changing netdev_priv() already.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-03 21:33   ` Lorenzo Bianconi
@ 2024-09-05 11:53     ` Jesper Dangaard Brouer
  2024-09-05 17:01     ` Lorenzo Bianconi
  1 sibling, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2024-09-05 11:53 UTC (permalink / raw)
  To: Lorenzo Bianconi, Jakub Kicinski
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend, Martin KaFai Lau,
	David S. Miller, Eric Dumazet, Paolo Abeni, bpf, netdev,
	linux-kernel



On 03/09/2024 23.33, Lorenzo Bianconi wrote:
>> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
>>> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>>>    threaded NAPI;
>>
>> Could you try to use the backlog NAPI? Allocating a fake netdev and
>> using NAPI as a threading abstraction feels like an abuse. Maybe try
>> to factor out the necessary bits? What we want is using the per-cpu
>> caches, and feeding GRO. None of the IRQ related NAPI functionality
>> fits in here.
> 
> I was thinking allocating a fake netdev to use NAPI APIs is quite a common
> approach, but sure, I will looking into it.
> 

I have a use-case for cpumap where I adjust (increase) kthread priority.

Using backlog NAPI, will I still be able to change scheduling priority?

--Jesper

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-03 21:33   ` Lorenzo Bianconi
  2024-09-05 11:53     ` Jesper Dangaard Brouer
@ 2024-09-05 17:01     ` Lorenzo Bianconi
  2024-09-06  0:20       ` Jakub Kicinski
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Bianconi @ 2024-09-05 17:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

> > On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> > > * patch 4: switch cpumap from a custom kthread to a CPU-pinned
> > >   threaded NAPI;
> > 
> > Could you try to use the backlog NAPI? Allocating a fake netdev and
> > using NAPI as a threading abstraction feels like an abuse. Maybe try
> > to factor out the necessary bits? What we want is using the per-cpu 
> > caches, and feeding GRO. None of the IRQ related NAPI functionality
> > fits in here.
> 
> I was thinking allocating a fake netdev to use NAPI APIs is quite a common
> approach, but sure, I will looking into it.

From a first glance I think we could use the backlog NAPI APIs here in
order to avoid allocating a dummy netdev. We could implement a similar
approach I used for the cpumap + gro_cell here [0].
In particular, the cpumap kthread pinned on cpu 'n' can schedule the
backlog NAPI associated to cpu 'n'. However according to my understanding
it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
right? Am I missing something?

Regards,
Lorenzo

[0] https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e

> 
> Regards,
> Lorenzo



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-05 17:01     ` Lorenzo Bianconi
@ 2024-09-06  0:20       ` Jakub Kicinski
  2024-09-06  8:15         ` Lorenzo Bianconi
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2024-09-06  0:20 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> backlog NAPI associated to cpu 'n'. However according to my understanding
> it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> right? Am I missing something?

I meant to use the struct directly, not to schedule it. All you need
is GRO - feed it packets, flush it. 
But maybe you can avoid the netdev allocation and patch 3 in other ways.
Using backlog NAPI was just the first thing that came to mind.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-06  0:20       ` Jakub Kicinski
@ 2024-09-06  8:15         ` Lorenzo Bianconi
  2024-09-07 13:22           ` Lorenzo Bianconi
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Bianconi @ 2024-09-06  8:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

> On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> > In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> > backlog NAPI associated to cpu 'n'. However according to my understanding
> > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> > right? Am I missing something?
> 
> I meant to use the struct directly, not to schedule it. All you need
> is GRO - feed it packets, flush it. 

ack, thx for pointing this out.

> But maybe you can avoid the netdev allocation and patch 3 in other ways.
> Using backlog NAPI was just the first thing that came to mind.

ack, I will look into it.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames
  2024-09-06  8:15         ` Lorenzo Bianconi
@ 2024-09-07 13:22           ` Lorenzo Bianconi
  0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Bianconi @ 2024-09-07 13:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Daniel Xu, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David S. Miller,
	Eric Dumazet, Paolo Abeni, bpf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

> > On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> > > In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> > > backlog NAPI associated to cpu 'n'. However according to my understanding
> > > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> > > right? Am I missing something?
> > 
> > I meant to use the struct directly, not to schedule it. All you need
> > is GRO - feed it packets, flush it. 
> 
> ack, thx for pointing this out.
> 
> > But maybe you can avoid the netdev allocation and patch 3 in other ways.
> > Using backlog NAPI was just the first thing that came to mind.
> 
> ack, I will look into it.
> 
> Regards,
> Lorenzo

Hi all,

I reworked my previous implementation to add GRO support to cpumap codebase, removing
the dummy netdev dependency and keeping most of the other logic. You can find
the codebase here:
- https://github.com/LorenzoBianconi/bpf-next/commit/e152cf8c212196fccece0b516190827430c0f5f8
I added to the two patches below in order to reuse some NAPI generic code:
- https://github.com/LorenzoBianconi/bpf-next/commit/3c73e9c2f07486590749e9b3bfb8a4b3df4cb5e0
- https://github.com/LorenzoBianconi/bpf-next/commit/d435ce2e1b6a991a6264a5aad4a0374a3ca86a51
I have not run any performance test yet, just functional one.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-09-07 13:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 16:24 [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() Alexander Lobakin
2024-08-30 23:31   ` Daniel Xu
2024-08-30 16:25 ` [PATCH bpf-next 2/9] kthread: allow vararg kthread_{create,run}_on_cpu() Alexander Lobakin
2024-08-30 22:56   ` Stanislav Fomichev
2024-09-03 12:25     ` Alexander Lobakin
2024-09-03 17:04       ` Stanislav Fomichev
2024-08-30 16:25 ` [PATCH bpf-next 3/9] net: napi: add ability to create CPU-pinned threaded NAPI Alexander Lobakin
2024-08-31  0:19   ` Daniel Xu
2024-08-30 16:25 ` [PATCH bpf-next 4/9] bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 5/9] bpf: cpumap: reuse skb array instead of a linked list to chain skbs Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 6/9] net: skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 7/9] bpf: cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 8/9] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() Alexander Lobakin
2024-08-30 16:25 ` [PATCH bpf-next 9/9] xdp: remove xdp_alloc_skb_bulk() Alexander Lobakin
2024-09-03 20:51 ` [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Jakub Kicinski
2024-09-03 21:33   ` Lorenzo Bianconi
2024-09-05 11:53     ` Jesper Dangaard Brouer
2024-09-05 17:01     ` Lorenzo Bianconi
2024-09-06  0:20       ` Jakub Kicinski
2024-09-06  8:15         ` Lorenzo Bianconi
2024-09-07 13:22           ` Lorenzo Bianconi
2024-09-04 13:13   ` Alexander Lobakin
2024-09-04 14:50     ` Jakub Kicinski
2024-09-04 15:13       ` Alexander Lobakin
2024-09-04 18:29         ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).