[RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
@ 2024-09-16 10:13 Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine Lorenzo Bianconi
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-09-16 10:13 UTC (permalink / raw)
  To: bpf
  Cc: kuba, aleksander.lobakin, ast, daniel, andrii, dxu,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
NAPI-kthread pinned on the selected cpu.

Changes in rfc v2:
- get rid of dummy netdev dependency

Lorenzo Bianconi (3):
  net: Add napi_init_for_gro routine
  net: add napi_threaded_poll to netdevice.h
  bpf: cpumap: Add gro support

 include/linux/netdevice.h |   3 +
 kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
 net/core/dev.c            |  27 ++++++---
 3 files changed, 73 insertions(+), 80 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine
  2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
@ 2024-09-16 10:13 ` Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 2/3] net: add napi_threaded_poll to netdevice.h Lorenzo Bianconi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-09-16 10:13 UTC (permalink / raw)
  To: bpf
  Cc: kuba, aleksander.lobakin, ast, daniel, andrii, dxu,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

Introduce napi_init_for_gro utility routine to initialize napi_struct
for GRO. This is a preliminary patch to introduce GRO support to cpumap
codebase.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 23 +++++++++++++++++------
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607009150b5fa..3c4c3ae2170f0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2628,6 +2628,8 @@ static inline void netif_napi_set_irq(struct napi_struct *napi, int irq)
  */
 #define NAPI_POLL_WEIGHT 64
 
+int napi_init_for_gro(struct net_device *dev, struct napi_struct *napi,
+		      int (*poll)(struct napi_struct *, int), int weight);
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f66e614078832..c87c510abc05b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6638,13 +6638,14 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
 }
 EXPORT_SYMBOL(netif_queue_set_napi);
 
-void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
-			   int (*poll)(struct napi_struct *, int), int weight)
+int napi_init_for_gro(struct net_device *dev, struct napi_struct *napi,
+		      int (*poll)(struct napi_struct *, int), int weight)
 {
 	if (WARN_ON(test_and_set_bit(NAPI_STATE_LISTED, &napi->state)))
-		return;
+		return -EBUSY;
 
 	INIT_LIST_HEAD(&napi->poll_list);
+	INIT_LIST_HEAD(&napi->dev_list);
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
@@ -6662,18 +6663,28 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	napi->poll_owner = -1;
 #endif
 	napi->list_owner = -1;
+	napi_hash_add(napi);
+	napi_get_frags_check(napi);
+	netif_napi_set_irq(napi, -1);
+
+	return 0;
+}
+
+void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
+			   int (*poll)(struct napi_struct *, int), int weight)
+{
+	if (napi_init_for_gro(dev, napi, poll, weight))
+		return;
+
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
-	napi_hash_add(napi);
-	napi_get_frags_check(napi);
 	/* Create kthread for this napi if dev->threaded is set.
 	 * Clear dev->threaded if kthread creation failed so that
 	 * threaded mode will not be enabled in napi_enable().
 	 */
 	if (dev->threaded && napi_kthread_create(napi))
 		dev->threaded = false;
-	netif_napi_set_irq(napi, -1);
 }
 EXPORT_SYMBOL(netif_napi_add_weight);
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC/RFT v2 2/3] net: add napi_threaded_poll to netdevice.h
  2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine Lorenzo Bianconi
@ 2024-09-16 10:13 ` Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 3/3] bpf: cpumap: Add gro support Lorenzo Bianconi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-09-16 10:13 UTC (permalink / raw)
  To: bpf
  Cc: kuba, aleksander.lobakin, ast, daniel, andrii, dxu,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

Move napi_threaded_poll routine declaration in netdevice.h and remove
static keyword in order to reuse it in cpumap codebase.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3c4c3ae2170f0..3bf7e22965cd5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2628,6 +2628,7 @@ static inline void netif_napi_set_irq(struct napi_struct *napi, int irq)
  */
 #define NAPI_POLL_WEIGHT 64
 
+int napi_threaded_poll(void *data);
 int napi_init_for_gro(struct net_device *dev, struct napi_struct *napi,
 		      int (*poll)(struct napi_struct *, int), int weight);
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
diff --git a/net/core/dev.c b/net/core/dev.c
index c87c510abc05b..8c1b3d1df261d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1417,8 +1417,6 @@ void netdev_notify_peers(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_notify_peers);
 
-static int napi_threaded_poll(void *data);
-
 static int napi_kthread_create(struct napi_struct *n)
 {
 	int err = 0;
@@ -6922,7 +6920,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 	}
 }
 
-static int napi_threaded_poll(void *data)
+int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC/RFT v2 3/3] bpf: cpumap: Add gro support
  2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine Lorenzo Bianconi
  2024-09-16 10:13 ` [RFC/RFT v2 2/3] net: add napi_threaded_poll to netdevice.h Lorenzo Bianconi
@ 2024-09-16 10:13 ` Lorenzo Bianconi
  2024-09-16 15:10 ` [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Alexander Lobakin
  2024-10-08 22:39 ` Daniel Xu
  4 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-09-16 10:13 UTC (permalink / raw)
  To: bpf
  Cc: kuba, aleksander.lobakin, ast, daniel, andrii, dxu,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

Introduce GRO support to cpumap codebase moving the cpu_map_entry
kthread to a NAPI-kthread pinned on the selected cpu.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 kernel/bpf/cpumap.c | 123 +++++++++++++++++++-------------------------
 1 file changed, 52 insertions(+), 71 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index fbdf5a1aabfe4..3ec6739aec5ae 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -62,9 +62,11 @@ struct bpf_cpu_map_entry {
 	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
 	struct xdp_bulk_queue __percpu *bulkq;
 
-	/* Queue with potential multi-producers, and single-consumer kthread */
+	/* Queue with potential multi-producers, and single-consumer
+	 * NAPI-kthread
+	 */
 	struct ptr_ring *queue;
-	struct task_struct *kthread;
+	struct napi_struct napi;
 
 	struct bpf_cpumap_val value;
 	struct bpf_prog *prog;
@@ -261,58 +263,42 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	return nframes;
 }
 
-static int cpu_map_kthread_run(void *data)
+static int cpu_map_poll(struct napi_struct *napi, int budget)
 {
-	struct bpf_cpu_map_entry *rcpu = data;
-	unsigned long last_qs = jiffies;
+	struct xdp_cpumap_stats stats = {}; /* zero stats */
+	unsigned int kmem_alloc_drops = 0;
+	struct bpf_cpu_map_entry *rcpu;
+	int done = 0;
 
+	rcu_read_lock();
+	rcpu = container_of(napi, struct bpf_cpu_map_entry, napi);
 	complete(&rcpu->kthread_running);
-	set_current_state(TASK_INTERRUPTIBLE);
 
-	/* When kthread gives stop order, then rcpu have been disconnected
-	 * from map, thus no new packets can enter. Remaining in-flight
-	 * per CPU stored packets are flushed to this queue.  Wait honoring
-	 * kthread_stop signal until queue is empty.
-	 */
-	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
-		struct xdp_cpumap_stats stats = {}; /* zero stats */
-		unsigned int kmem_alloc_drops = 0, sched = 0;
+	while (done < budget) {
 		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
-		int i, n, m, nframes, xdp_n;
+		int n, i, m, xdp_n = 0, nframes;
 		void *frames[CPUMAP_BATCH];
+		struct sk_buff *skb, *tmp;
 		void *skbs[CPUMAP_BATCH];
 		LIST_HEAD(list);
 
-		/* Release CPU reschedule checks */
-		if (__ptr_ring_empty(rcpu->queue)) {
-			set_current_state(TASK_INTERRUPTIBLE);
-			/* Recheck to avoid lost wake-up */
-			if (__ptr_ring_empty(rcpu->queue)) {
-				schedule();
-				sched = 1;
-				last_qs = jiffies;
-			} else {
-				__set_current_state(TASK_RUNNING);
-			}
-		} else {
-			rcu_softirq_qs_periodic(last_qs);
-			sched = cond_resched();
-		}
-
+		if (__ptr_ring_empty(rcpu->queue))
+			break;
 		/*
 		 * The bpf_cpu_map_entry is single consumer, with this
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		n = __ptr_ring_consume_batched(rcpu->queue, frames,
-					       CPUMAP_BATCH);
-		for (i = 0, xdp_n = 0; i < n; i++) {
+		n = min(budget -  done, CPUMAP_BATCH);
+		n = __ptr_ring_consume_batched(rcpu->queue, frames, n);
+		done += n;
+
+		for (i = 0; i < n; i++) {
 			void *f = frames[i];
 			struct page *page;
 
 			if (unlikely(__ptr_test_bit(0, &f))) {
-				struct sk_buff *skb = f;
-
+				skb = f;
 				__ptr_clear_bit(0, &skb);
 				list_add_tail(&skb->list, &list);
 				continue;
@@ -340,12 +326,10 @@ static int cpu_map_kthread_run(void *data)
 			}
 		}
 
-		local_bh_disable();
 		for (i = 0; i < nframes; i++) {
 			struct xdp_frame *xdpf = frames[i];
-			struct sk_buff *skb = skbs[i];
 
-			skb = __xdp_build_skb_from_frame(xdpf, skb,
+			skb = __xdp_build_skb_from_frame(xdpf, skbs[i],
 							 xdpf->dev_rx);
 			if (!skb) {
 				xdp_return_frame(xdpf);
@@ -354,17 +338,21 @@ static int cpu_map_kthread_run(void *data)
 
 			list_add_tail(&skb->list, &list);
 		}
-		netif_receive_skb_list(&list);
-
-		/* Feedback loop via tracepoint */
-		trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
-					 sched, &stats);
 
-		local_bh_enable(); /* resched point, may call do_softirq() */
+		list_for_each_entry_safe(skb, tmp, &list, list) {
+			skb_list_del_init(skb);
+			napi_gro_receive(napi, skb);
+		}
 	}
-	__set_current_state(TASK_RUNNING);
 
-	return 0;
+	rcu_read_unlock();
+	/* Feedback loop via tracepoint */
+	trace_xdp_cpumap_kthread(rcpu->map_id, done, kmem_alloc_drops, 0,
+				 &stats);
+	if (done < budget)
+		napi_complete(napi);
+
+	return done;
 }
 
 static int __cpu_map_load_bpf_program(struct bpf_cpu_map_entry *rcpu,
@@ -432,18 +420,19 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
 		goto free_ptr_ring;
 
+	napi_init_for_gro(NULL, &rcpu->napi, cpu_map_poll,
+			  NAPI_POLL_WEIGHT);
+	set_bit(NAPI_STATE_THREADED, &rcpu->napi.state);
+
 	/* Setup kthread */
 	init_completion(&rcpu->kthread_running);
-	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
-					       "cpumap/%d/map:%d", cpu,
-					       map->id);
-	if (IS_ERR(rcpu->kthread))
+	rcpu->napi.thread = kthread_run_on_cpu(napi_threaded_poll,
+					       &rcpu->napi, cpu,
+					       "cpumap-napi/%d");
+	if (IS_ERR(rcpu->napi.thread))
 		goto free_prog;
 
-	/* Make sure kthread runs on a single CPU */
-	kthread_bind(rcpu->kthread, cpu);
-	wake_up_process(rcpu->kthread);
-
+	napi_schedule(&rcpu->napi);
 	/* Make sure kthread has been running, so kthread_stop() will not
 	 * stop the kthread prematurely and all pending frames or skbs
 	 * will be handled by the kthread before kthread_stop() returns.
@@ -477,12 +466,8 @@ static void __cpu_map_entry_free(struct work_struct *work)
 	 */
 	rcpu = container_of(to_rcu_work(work), struct bpf_cpu_map_entry, free_work);
 
-	/* kthread_stop will wake_up_process and wait for it to complete.
-	 * cpu_map_kthread_run() makes sure the pointer ring is empty
-	 * before exiting.
-	 */
-	kthread_stop(rcpu->kthread);
-
+	napi_disable(&rcpu->napi);
+	__netif_napi_del(&rcpu->napi);
 	if (rcpu->prog)
 		bpf_prog_put(rcpu->prog);
 	/* The queue should be empty at this point */
@@ -498,8 +483,8 @@ static void __cpu_map_entry_free(struct work_struct *work)
  * __cpu_map_entry_free() in a separate workqueue after waiting for an RCU grace
  * period. This means that (a) all pending enqueue and flush operations have
  * completed (because of the RCU callback), and (b) we are in a workqueue
- * context where we can stop the kthread and wait for it to exit before freeing
- * everything.
+ * context where we can stop the NAPI-kthread and wait for it to exit before
+ * freeing everything.
  */
 static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
 				    u32 key_cpu, struct bpf_cpu_map_entry *rcpu)
@@ -579,9 +564,7 @@ static void cpu_map_free(struct bpf_map *map)
 	 */
 	synchronize_rcu();
 
-	/* The only possible user of bpf_cpu_map_entry is
-	 * cpu_map_kthread_run().
-	 */
+	/* The only possible user of bpf_cpu_map_entry is the NAPI-kthread. */
 	for (i = 0; i < cmap->map.max_entries; i++) {
 		struct bpf_cpu_map_entry *rcpu;
 
@@ -589,7 +572,7 @@ static void cpu_map_free(struct bpf_map *map)
 		if (!rcpu)
 			continue;
 
-		/* Stop kthread and cleanup entry directly */
+		/* Stop NAPI-kthread and cleanup entry directly */
 		__cpu_map_entry_free(&rcpu->free_work.work);
 	}
 	bpf_map_area_free(cmap->cpu_map);
@@ -753,7 +736,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 	if (ret < 0)
 		goto trace;
 
-	wake_up_process(rcpu->kthread);
+	napi_schedule(&rcpu->napi);
 trace:
 	trace_xdp_cpumap_enqueue(rcpu->map_id, !ret, !!ret, rcpu->cpu);
 	return ret;
@@ -765,8 +748,6 @@ void __cpu_map_flush(struct list_head *flush_list)
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_flush_to_queue(bq);
-
-		/* If already running, costs spin_lock_irqsave + smb_mb */
-		wake_up_process(bq->obj->kthread);
+		napi_schedule(&bq->obj->napi);
 	}
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
                   ` (2 preceding siblings ...)
  2024-09-16 10:13 ` [RFC/RFT v2 3/3] bpf: cpumap: Add gro support Lorenzo Bianconi
@ 2024-09-16 15:10 ` Alexander Lobakin
  2024-10-08 22:39 ` Daniel Xu
  4 siblings, 0 replies; 36+ messages in thread
From: Alexander Lobakin @ 2024-09-16 15:10 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, kuba, ast, daniel, andrii, dxu, john.fastabend, hawk,
	martin.lau, davem, edumazet, pabeni, netdev, lorenzo.bianconi

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Mon, 16 Sep 2024 12:13:42 +0200

> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support

Oh okay, so it's still uses a NAPI.
When I'm back from the conferences (next week), I might rebase and send
the solution where I only use the GRO part of it, i.e. no
napi_schedule()/poll()/napi_complete() logics.

> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
                   ` (3 preceding siblings ...)
  2024-09-16 15:10 ` [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Alexander Lobakin
@ 2024-10-08 22:39 ` Daniel Xu
  2024-10-09 10:46   ` Lorenzo Bianconi
  4 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-10-08 22:39 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, kuba, aleksander.lobakin, ast, daniel, andrii,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

Hi Lorenzo,

On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support
> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)
> 
> -- 
> 2.46.0
> 

Sorry about the long delay - finally caught up to everything after
conferences.

I re-ran my synthetic tests (including baseline). v2 is somehow showing
2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
variable I changed is kernel version - steering prog is active for both.


Baseline (again)							

./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
							
cpumap NAPI patches v2							
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-08 22:39 ` Daniel Xu
@ 2024-10-09 10:46   ` Lorenzo Bianconi
  2024-10-09 12:27     ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-10-09 10:46 UTC (permalink / raw)
  To: Daniel Xu
  Cc: bpf, kuba, aleksander.lobakin, ast, daniel, andrii,
	john.fastabend, hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

[-- Attachment #1: Type: text/plain, Size: 2580 bytes --]

> Hi Lorenzo,
> 
> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> > NAPI-kthread pinned on the selected cpu.
> > 
> > Changes in rfc v2:
> > - get rid of dummy netdev dependency
> > 
> > Lorenzo Bianconi (3):
> >   net: Add napi_init_for_gro routine
> >   net: add napi_threaded_poll to netdevice.h
> >   bpf: cpumap: Add gro support
> > 
> >  include/linux/netdevice.h |   3 +
> >  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >  net/core/dev.c            |  27 ++++++---
> >  3 files changed, 73 insertions(+), 80 deletions(-)
> > 
> > -- 
> > 2.46.0
> > 
> 
> Sorry about the long delay - finally caught up to everything after
> conferences.
> 
> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> variable I changed is kernel version - steering prog is active for both.
> 
> 
> Baseline (again)							
> 
> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> 							
> cpumap NAPI patches v2							
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> 
> Thanks,
> Daniel

Hi Daniel,

cool, thx for testing it.

@Olek: how do we want to proceed on it? Are you still working on it or do you want me
to send a regular patch for it?

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-09 10:46   ` Lorenzo Bianconi
@ 2024-10-09 12:27     ` Alexander Lobakin
  2024-10-09 12:47       ` Lorenzo Bianconi
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-10-09 12:27 UTC (permalink / raw)
  To: Lorenzo Bianconi, Daniel Xu
  Cc: bpf, kuba, ast, daniel, andrii, john.fastabend, hawk, martin.lau,
	davem, edumazet, pabeni, netdev, lorenzo.bianconi

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 12:46:00 +0200

>> Hi Lorenzo,
>>
>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>> NAPI-kthread pinned on the selected cpu.
>>>
>>> Changes in rfc v2:
>>> - get rid of dummy netdev dependency
>>>
>>> Lorenzo Bianconi (3):
>>>   net: Add napi_init_for_gro routine
>>>   net: add napi_threaded_poll to netdevice.h
>>>   bpf: cpumap: Add gro support
>>>
>>>  include/linux/netdevice.h |   3 +
>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>  net/core/dev.c            |  27 ++++++---
>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>
>>> -- 
>>> 2.46.0
>>>
>>
>> Sorry about the long delay - finally caught up to everything after
>> conferences.
>>
>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>> variable I changed is kernel version - steering prog is active for both.
>>
>>
>> Baseline (again)							
>>
>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>> 							
>> cpumap NAPI patches v2							
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>
>> Thanks,
>> Daniel
> 
> Hi Daniel,
> 
> cool, thx for testing it.
> 
> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> to send a regular patch for it?

Hi,

I had a small vacation, sorry. I'm starting working on it again today.

> 
> Regards,
> Lorenzo

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-09 12:27     ` Alexander Lobakin
@ 2024-10-09 12:47       ` Lorenzo Bianconi
  2024-10-09 12:50         ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-10-09 12:47 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Daniel Xu, bpf, kuba, ast, daniel, andrii, john.fastabend, hawk,
	martin.lau, davem, edumazet, pabeni, netdev, lorenzo.bianconi

[-- Attachment #1: Type: text/plain, Size: 3109 bytes --]

> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 12:46:00 +0200
> 
> >> Hi Lorenzo,
> >>
> >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>> NAPI-kthread pinned on the selected cpu.
> >>>
> >>> Changes in rfc v2:
> >>> - get rid of dummy netdev dependency
> >>>
> >>> Lorenzo Bianconi (3):
> >>>   net: Add napi_init_for_gro routine
> >>>   net: add napi_threaded_poll to netdevice.h
> >>>   bpf: cpumap: Add gro support
> >>>
> >>>  include/linux/netdevice.h |   3 +
> >>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>  net/core/dev.c            |  27 ++++++---
> >>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>
> >>> -- 
> >>> 2.46.0
> >>>
> >>
> >> Sorry about the long delay - finally caught up to everything after
> >> conferences.
> >>
> >> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >> variable I changed is kernel version - steering prog is active for both.
> >>
> >>
> >> Baseline (again)							
> >>
> >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >> 							
> >> cpumap NAPI patches v2							
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>
> >> Thanks,
> >> Daniel
> > 
> > Hi Daniel,
> > 
> > cool, thx for testing it.
> > 
> > @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> > to send a regular patch for it?
> 
> Hi,
> 
> I had a small vacation, sorry. I'm starting working on it again today.

ack, no worries. Are you going to rebase the other patches on top of it
or are you going to try a different approach?

Regards,
Lorenzo

> 
> > 
> > Regards,
> > Lorenzo
> 
> Thanks,
> Olek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-09 12:47       ` Lorenzo Bianconi
@ 2024-10-09 12:50         ` Alexander Lobakin
  2024-10-22 15:51           ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-10-09 12:50 UTC (permalink / raw)
  To: Lorenzo Bianconi, Daniel Xu
  Cc: bpf, kuba, ast, daniel, andrii, john.fastabend, hawk, martin.lau,
	davem, edumazet, pabeni, netdev, lorenzo.bianconi

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 14:47:58 +0200

>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>
>>>> Hi Lorenzo,
>>>>
>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>
>>>>> Changes in rfc v2:
>>>>> - get rid of dummy netdev dependency
>>>>>
>>>>> Lorenzo Bianconi (3):
>>>>>   net: Add napi_init_for_gro routine
>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>   bpf: cpumap: Add gro support
>>>>>
>>>>>  include/linux/netdevice.h |   3 +
>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.46.0
>>>>>
>>>>
>>>> Sorry about the long delay - finally caught up to everything after
>>>> conferences.
>>>>
>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>> variable I changed is kernel version - steering prog is active for both.
>>>>
>>>>
>>>> Baseline (again)							
>>>>
>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>> 							
>>>> cpumap NAPI patches v2							
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>
>>>> Thanks,
>>>> Daniel
>>>
>>> Hi Daniel,
>>>
>>> cool, thx for testing it.
>>>
>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>> to send a regular patch for it?
>>
>> Hi,
>>
>> I had a small vacation, sorry. I'm starting working on it again today.
> 
> ack, no worries. Are you going to rebase the other patches on top of it
> or are you going to try a different approach?

I'll try the approach without NAPI as Kuba asks and let Daniel test it,
then we'll see.

BTW I'm curious how he got this boost on v2, from what I see you didn't
change the implementation that much?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-09 12:50         ` Alexander Lobakin
@ 2024-10-22 15:51           ` Alexander Lobakin
  2024-11-12 17:43             ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-10-22 15:51 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Daniel Xu, bpf, kuba, ast, daniel, andrii, john.fastabend, hawk,
	martin.lau, davem, edumazet, pabeni, netdev, lorenzo.bianconi

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 9 Oct 2024 14:50:42 +0200

> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 14:47:58 +0200
> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>
>>>>> Hi Lorenzo,
>>>>>
>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>
>>>>>> Changes in rfc v2:
>>>>>> - get rid of dummy netdev dependency
>>>>>>
>>>>>> Lorenzo Bianconi (3):
>>>>>>   net: Add napi_init_for_gro routine
>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>   bpf: cpumap: Add gro support
>>>>>>
>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>
>>>>>> -- 
>>>>>> 2.46.0
>>>>>>
>>>>>
>>>>> Sorry about the long delay - finally caught up to everything after
>>>>> conferences.
>>>>>
>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>
>>>>>
>>>>> Baseline (again)							
>>>>>
>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>> 							
>>>>> cpumap NAPI patches v2							
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>
>>>> Hi Daniel,
>>>>
>>>> cool, thx for testing it.
>>>>
>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>> to send a regular patch for it?
>>>
>>> Hi,
>>>
>>> I had a small vacation, sorry. I'm starting working on it again today.
>>
>> ack, no worries. Are you going to rebase the other patches on top of it
>> or are you going to try a different approach?
> 
> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> then we'll see.

For now, I have the same results without NAPI as with your series, so
I'll push it soon and let Daniel test.

(I simply decoupled GRO and NAPI and used the former in cpumap, but the
 kthread logic didn't change)

> 
> BTW I'm curious how he got this boost on v2, from what I see you didn't
> change the implementation that much?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-10-22 15:51           ` Alexander Lobakin
@ 2024-11-12 17:43             ` Alexander Lobakin
  2024-11-13 23:39               ` Daniel Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-11-12 17:43 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Lorenzo Bianconi, bpf, kuba, ast, daniel, andrii, john.fastabend,
	hawk, martin.lau, davem, edumazet, pabeni, netdev,
	lorenzo.bianconi

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 22 Oct 2024 17:51:43 +0200

> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Wed, 9 Oct 2024 14:50:42 +0200
> 
>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>
>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>
>>>>>> Hi Lorenzo,
>>>>>>
>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>
>>>>>>> Changes in rfc v2:
>>>>>>> - get rid of dummy netdev dependency
>>>>>>>
>>>>>>> Lorenzo Bianconi (3):
>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>
>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.46.0
>>>>>>>
>>>>>>
>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>> conferences.
>>>>>>
>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>
>>>>>>
>>>>>> Baseline (again)							
>>>>>>
>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>> 							
>>>>>> cpumap NAPI patches v2							
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> cool, thx for testing it.
>>>>>
>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>> to send a regular patch for it?
>>>>
>>>> Hi,
>>>>
>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>
>>> ack, no worries. Are you going to rebase the other patches on top of it
>>> or are you going to try a different approach?
>>
>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>> then we'll see.
> 
> For now, I have the same results without NAPI as with your series, so
> I'll push it soon and let Daniel test.
> 
> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>  kthread logic didn't change)
> 
>>
>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>> change the implementation that much?

Hi Daniel,

Sorry for the delay. Please test [0].

[0] https://github.com/alobakin/linux/commits/cpumap-old

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-12 17:43             ` Alexander Lobakin
@ 2024-11-13 23:39               ` Daniel Xu
  2024-11-23  0:10                 ` Daniel Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-11-13 23:39 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi



On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Tue, 22 Oct 2024 17:51:43 +0200
>
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Wed, 9 Oct 2024 14:50:42 +0200
>> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>>
>>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>>
>>>>>>> Hi Lorenzo,
>>>>>>>
>>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>>
>>>>>>>> Changes in rfc v2:
>>>>>>>> - get rid of dummy netdev dependency
>>>>>>>>
>>>>>>>> Lorenzo Bianconi (3):
>>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>>
>>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> 2.46.0
>>>>>>>>
>>>>>>>
>>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>>> conferences.
>>>>>>>
>>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>>
>>>>>>>
>>>>>>> Baseline (again)							
>>>>>>>
>>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>>> 							
>>>>>>> cpumap NAPI patches v2							
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Daniel
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> cool, thx for testing it.
>>>>>>
>>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>>> to send a regular patch for it?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>>
>>>> ack, no worries. Are you going to rebase the other patches on top of it
>>>> or are you going to try a different approach?
>>>
>>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>>> then we'll see.
>> 
>> For now, I have the same results without NAPI as with your series, so
>> I'll push it soon and let Daniel test.
>> 
>> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>>  kthread logic didn't change)
>> 
>>>
>>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>>> change the implementation that much?
>
> Hi Daniel,
>
> Sorry for the delay. Please test [0].
>
> [0] https://github.com/alobakin/linux/commits/cpumap-old
>
> Thanks,
> Olek

Ack. Will do probably early next week.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-13 23:39               ` Daniel Xu
@ 2024-11-23  0:10                 ` Daniel Xu
  2024-11-25 15:12                   ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-11-23  0:10 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi

Hi Olek,

Here are the results.

On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>
>
> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > From: Alexander Lobakin <aleksander.lobakin@intel.com>
> > Date: Tue, 22 Oct 2024 17:51:43 +0200
> >
> >> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> >> Date: Wed, 9 Oct 2024 14:50:42 +0200
> >>
> >>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200
> >>>
> >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
> >>>>>
> >>>>>>> Hi Lorenzo,
> >>>>>>>
> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>>>>>>> NAPI-kthread pinned on the selected cpu.
> >>>>>>>>
> >>>>>>>> Changes in rfc v2:
> >>>>>>>> - get rid of dummy netdev dependency
> >>>>>>>>
> >>>>>>>> Lorenzo Bianconi (3):
> >>>>>>>>   net: Add napi_init_for_gro routine
> >>>>>>>>   net: add napi_threaded_poll to netdevice.h
> >>>>>>>>   bpf: cpumap: Add gro support
> >>>>>>>>
> >>>>>>>>  include/linux/netdevice.h |   3 +
> >>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>>>>>>  net/core/dev.c            |  27 ++++++---
> >>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.46.0
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sorry about the long delay - finally caught up to everything after
> >>>>>>> conferences.
> >>>>>>>
> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >>>>>>> variable I changed is kernel version - steering prog is active for both.
> >>>>>>>
> >>>>>>>
> >>>>>>> Baseline (again)
> >>>>>>>
> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >>>>>>>
> >>>>>>> cpumap NAPI patches v2
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Daniel
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>>
> >>>>>> cool, thx for testing it.
> >>>>>>
> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> >>>>>> to send a regular patch for it?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I had a small vacation, sorry. I'm starting working on it again today.
> >>>>
> >>>> ack, no worries. Are you going to rebase the other patches on top of it
> >>>> or are you going to try a different approach?
> >>>
> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> >>> then we'll see.
> >>
> >> For now, I have the same results without NAPI as with your series, so
> >> I'll push it soon and let Daniel test.
> >>
> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
> >>  kthread logic didn't change)
> >>
> >>>
> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't
> >>> change the implementation that much?
> >
> > Hi Daniel,
> >
> > Sorry for the delay. Please test [0].
> >
> > [0] https://github.com/alobakin/linux/commits/cpumap-old
> >
> > Thanks,
> > Olek
>
> Ack. Will do probably early next week.
>

Baseline (again)

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126

cpumap v2 Olek

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%


It's very interesting that we see -40% tput w/ the patches. I went back
and double checked and it seems the numbers are right. Here's the
some output from some profiles I took with:

    perf record -e cycles:k -a -- sleep 10
    perf --no-pager diff perf.data.baseline perf.data.withpatches > ...

    # Event 'cycles:k'
    # Baseline  Delta Abs  Shared Object                                                    Symbol
         6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
     3.57%     -2.56%  bpf_prog_954ab9c8c8b5e42f_latency                                [k] bpf_prog_954ab9c8c8b5e42f_latency
               +2.22%  bpf_prog_5c74b34eb24d5c9b_steering                               [k] bpf_prog_5c74b34eb24d5c9b_steering
     2.61%     -1.88%  [kernel.kallsyms]                                                [k] __skb_datagram_iter
     0.55%     +1.53%  [kernel.kallsyms]                                                [k] acpi_processor_ffh_cstate_enter
     4.52%     -1.46%  [kernel.kallsyms]                                                [k] read_tsc
     0.34%     +1.42%  [kernel.kallsyms]                                                [k] __slab_free
     0.97%     +1.18%  [kernel.kallsyms]                                                [k] do_idle
     1.35%     +1.17%  [kernel.kallsyms]                                                [k] cpuidle_enter_state
     1.89%     -1.15%  [kernel.kallsyms]                                                [k] tcp_ack
     2.08%     +1.14%  [kernel.kallsyms]                                                [k] _raw_spin_lock
               +1.13%  <redacted>
     0.22%     +1.02%  [kernel.kallsyms]                                                [k] __sock_wfree
     2.23%     -1.02%  [kernel.kallsyms]                                                [k] bpf_dynptr_slice
     0.00%     +0.98%  [kernel.kallsyms]                                                [k] tcp6_gro_receive
     2.91%     -0.98%  [kernel.kallsyms]                                                [k] csum_partial
     0.62%     +0.94%  [kernel.kallsyms]                                                [k] skb_release_data
               +0.81%  [kernel.kallsyms]                                                [k] memset
     0.16%     +0.74%  [kernel.kallsyms]                                                [k] bnxt_tx_int
     0.00%     +0.74%  [kernel.kallsyms]                                                [k] dev_gro_receive
     0.36%     +0.74%  [kernel.kallsyms]                                                [k] __tcp_transmit_skb
               +0.72%  [kernel.kallsyms]                                                [k] tcp_gro_receive
     1.10%     -0.66%  [kernel.kallsyms]                                                [k] ep_poll_callback
     1.52%     -0.65%  [kernel.kallsyms]                                                [k] page_pool_put_unrefed_netmem
     0.75%     -0.57%  [kernel.kallsyms]                                                [k] bnxt_rx_pkt
     1.10%     +0.56%  [kernel.kallsyms]                                                [k] native_sched_clock
     0.16%     +0.53%  <redacted>
     0.83%     -0.53%  [kernel.kallsyms]                                                [k] skb_try_coalesce
     0.60%     +0.53%  [kernel.kallsyms]                                                [k] eth_type_trans
     1.65%     -0.51%  [kernel.kallsyms]                                                [k] _raw_spin_lock_irqsave
     0.14%     +0.50%  [kernel.kallsyms]                                                [k] bnxt_start_xmit
     0.54%     -0.48%  [kernel.kallsyms]                                                [k] __skb_frag_unref
     0.91%     +0.48%  [cls_bpf]                                                        [k] 0x0000000000000010
     0.00%     +0.47%  [kernel.kallsyms]                                                [k] ipv6_gro_receive
     0.76%     -0.45%  [kernel.kallsyms]                                                [k] tcp_rcv_established
     0.94%     -0.45%  [kernel.kallsyms]                                                [k] __inet6_lookup_established
     0.31%     +0.43%  [kernel.kallsyms]                                                [k] __sched_text_start
     0.21%     +0.43%  [kernel.kallsyms]                                                [k] poll_idle
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] tcp_try_coalesce
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] kmem_cache_free
     1.13%     +0.42%  [kernel.kallsyms]                                                [k] __bnxt_poll_work
     0.48%     -0.41%  [kernel.kallsyms]                                                [k] tcp_urg
               +0.39%  [kernel.kallsyms]                                                [k] memcpy
     0.51%     -0.38%  [kernel.kallsyms]                                                [k] _raw_read_unlock_irqrestore
               +0.38%  [kernel.kallsyms]                                                [k] __skb_gro_checksum_complete
               +0.37%  [kernel.kallsyms]                                                [k] irq_entries_start
     0.16%     +0.36%  [kernel.kallsyms]                                                [k] bpf_sk_storage_get
     0.62%     -0.36%  [kernel.kallsyms]                                                [k] page_pool_refill_alloc_cache
     0.08%     +0.35%  [kernel.kallsyms]                                                [k] ip6_finish_output2
     0.14%     +0.34%  [kernel.kallsyms]                                                [k] bnxt_poll_p5
     0.06%     +0.33%  [sch_fq]                                                         [k] 0x0000000000000020
     0.04%     +0.32%  [kernel.kallsyms]                                                [k] __dev_queue_xmit
     0.75%     -0.32%  [kernel.kallsyms]                                                [k] __xdp_build_skb_from_frame
     0.67%     -0.31%  [kernel.kallsyms]                                                [k] sock_def_readable
     0.05%     +0.31%  [kernel.kallsyms]                                                [k] netif_skb_features
               +0.30%  [kernel.kallsyms]                                                [k] tcp_gro_pull_header
     0.49%     -0.29%  [kernel.kallsyms]                                                [k] napi_pp_put_page
     0.18%     +0.29%  [kernel.kallsyms]                                                [k] call_function_single_prep_ipi
     0.40%     -0.28%  [kernel.kallsyms]                                                [k] _raw_read_lock_irqsave
     0.11%     +0.27%  [kernel.kallsyms]                                                [k] raw6_local_deliver
     0.18%     +0.26%  [kernel.kallsyms]                                                [k] ip6_dst_check
     0.42%     -0.26%  [kernel.kallsyms]                                                [k] netif_receive_skb_list_internal
     0.05%     +0.26%  [kernel.kallsyms]                                                [k] __qdisc_run
     0.75%     +0.25%  [kernel.kallsyms]                                                [k] __build_skb_around
     0.05%     +0.25%  [kernel.kallsyms]                                                [k] htab_map_hash
     0.09%     +0.24%  [kernel.kallsyms]                                                [k] net_rx_action
     0.07%     +0.23%  <redacted>
     0.45%     -0.23%  [kernel.kallsyms]                                                [k] migrate_enable
     0.48%     -0.23%  [kernel.kallsyms]                                                [k] mem_cgroup_charge_skmem
     0.26%     +0.23%  [kernel.kallsyms]                                                [k] __switch_to
     0.15%     +0.22%  [kernel.kallsyms]                                                [k] sock_rfree
     0.30%     -0.22%  [kernel.kallsyms]                                                [k] tcp_add_backlog

     <snip>

     5.68%             bpf_prog_17fea1bb6503ed98_steering                               [k] bpf_prog_17fea1bb6503ed98_steering
     2.10%             [kernel.kallsyms]                                                [k] __skb_checksum_complete
     0.71%             [kernel.kallsyms]                                                [k] __memset
     0.54%             [kernel.kallsyms]                                                [k] __memcpy
     0.18%             [kernel.kallsyms]                                                [k] __irqentry_text_start

     <snip>

Please let me know if you want me to collect any other data.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-23  0:10                 ` Daniel Xu
@ 2024-11-25 15:12                   ` Alexander Lobakin
  2024-11-25 17:03                     ` Daniel Xu
                                       ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Alexander Lobakin @ 2024-11-25 15:12 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi

From: Daniel Xu <dxu@dxuuu.xyz>
Date: Fri, 22 Nov 2024 17:10:06 -0700

> Hi Olek,
> 
> Here are the results.
> 
> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>
>>
>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:

[...]

> Baseline (again)
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> 
> cpumap v2 Olek
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> 
> 
> It's very interesting that we see -40% tput w/ the patches. I went back

Oh no, I messed up something =\

Could you please also test not the whole series, but patches 1-3 (up to
"bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
array...")? Would be great to see whether this implementation works
worse right from the start or I just broke something later on.

> and double checked and it seems the numbers are right. Here's the
> some output from some profiles I took with:
> 
>     perf record -e cycles:k -a -- sleep 10
>     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> 
>     # Event 'cycles:k'
>     # Baseline  Delta Abs  Shared Object                                                    Symbol
>          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter

BTW, what CONFIG_HZ do you have on the kernel you're testing with?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 15:12                   ` Alexander Lobakin
@ 2024-11-25 17:03                     ` Daniel Xu
  2024-11-25 18:50                     ` Jesper Dangaard Brouer
  2024-11-25 22:56                     ` Daniel Xu
  2 siblings, 0 replies; 36+ messages in thread
From: Daniel Xu @ 2024-11-25 17:03 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi

On Mon, Nov 25, 2024 at 04:12:24PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
> > Hi Olek,
> > 
> > Here are the results.
> > 
> > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>
> >>
> >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
> > Baseline (again)
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > 
> > cpumap v2 Olek
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > 
> > 
> > It's very interesting that we see -40% tput w/ the patches. I went back
> 
> Oh no, I messed up something =\
> 
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Will do.

> 
> > and double checked and it seems the numbers are right. Here's the
> > some output from some profiles I took with:
> > 
> >     perf record -e cycles:k -a -- sleep 10
> >     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > 
> >     # Event 'cycles:k'
> >     # Baseline  Delta Abs  Shared Object                                                    Symbol
> >          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> 
> BTW, what CONFIG_HZ do you have on the kernel you're testing with?

# zgrep CONFIG_HZ /proc/config.gz
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Just curious - why do you ask?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 15:12                   ` Alexander Lobakin
  2024-11-25 17:03                     ` Daniel Xu
@ 2024-11-25 18:50                     ` Jesper Dangaard Brouer
  2024-11-25 21:53                       ` Daniel Xu
  2024-11-25 22:56                     ` Daniel Xu
  2 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2024-11-25 18:50 UTC (permalink / raw)
  To: Alexander Lobakin, Daniel Xu
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, David Miller, Eric Dumazet,
	Paolo Abeni, netdev, Lorenzo Bianconi, kernel-team, mfleming

On 25/11/2024 16.12, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
>> Hi Olek,
>>
>> Here are the results.
>>
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
>> Baseline (again)
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>

We need to talk about what we are measuring, and how to control the
experiment setup to get reproducible results.
Especially controlling on what CPU cores our code paths are executing.

In above "baseline" case, we have two processes/tasks executing:
  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
  (2) Userspace netserver process TCP receiving data from socket.

My experience is that you will see two noticeable different
throughput performance results depending on whether (1) and (2) is
executing on the *same* CPU (multi-tasking context-switching),
or executing in parallel (e.g. pinned) on two different CPU cores.

The netperf command have an option

  -T lcpu,remcpu
       Request that netperf be bound to local CPU lcpu and/or netserver 
be bound to remote CPU rcpu.

Verify setting by listing pinning like this:
   for PID in $(pidof netserver); do taskset -pc $PID ; done

You can also set pinning runtime like this:
  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU 
$PID; done

For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
output and adjust pinning runtime to observe the effect quickly.

My experience is unfortunately that TCP results have a lot of variation
(thanks for incliding 5 runs in your benchmarks), as it depends on tasks
timing, that can get affected by CPU sleep states. The systems CPU
latency setting can be seen in /dev/cpu_dma_latency, which can be read
like this:

  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency

For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
as it requires holding the file open. E.g I play with these profiles:

  sudo tuned-adm profile throughput-performance
  sudo tuned-adm profile latency-performance
  sudo tuned-adm profile network-latency

>> cpumap v2 Olek
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>
>>

We now three processes/tasks executing:
  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
  (3) Userspace netserver process TCP receiving data from socket.

Again, now the performance is going to depend on depending on which CPU
cores the processes/tasks are running and whether some are sharing the
same CPU. (There are both wakeup timing and cache-line effects).

There are now more combinations to test...

CPUmap is a CPU scaling facility, and you will likely also see different
CPU utilization on the difference cores one you start to pin these to
control the scenarios.

>> It's very interesting that we see -40% tput w/ the patches. I went back
> 

Sad that we see -40% throughput...  but do we know what CPU cores the
now three different tasks/processes run on(?)

> Oh no, I messed up something =\
>  > Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.
> 
>> and double checked and it seems the numbers are right. Here's the
>> some output from some profiles I took with:
>>
>>      perf record -e cycles:k -a -- sleep 10
>>      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
>>
>>      # Event 'cycles:k'
>>      # Baseline  Delta Abs  Shared Object                                                    Symbol
>>           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
>

I really appreciate that you provide perf data and perf diff, but as
described above, we need data and information on what CPU cores are
running which workload.

Fortunately perf diff (and perf report) support doing like this:
  perf diff --sort=cpu,symbol

But then you also need to control the CPUs used in experiment for the
diff to work.

I hope I made sense as these kind of CPU scaling benchmarks are tricky,
--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 18:50                     ` Jesper Dangaard Brouer
@ 2024-11-25 21:53                       ` Daniel Xu
  2024-11-25 22:19                         ` Lorenzo Bianconi
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-11-25 21:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, Lorenzo Bianconi, bpf@vger.kernel.org,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Martin KaFai Lau, David Miller,
	Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi, kernel-team,
	mfleming

Hi Jesper,

On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 25/11/2024 16.12, Alexander Lobakin wrote:
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > 
> > > Hi Olek,
> > > 
> > > Here are the results.
> > > 
> > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > 
> > > > 
> > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > 
> > [...]
> > 
> > > Baseline (again)
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > 
> 
> We need to talk about what we are measuring, and how to control the
> experiment setup to get reproducible results.
> Especially controlling on what CPU cores our code paths are executing.
> 
> In above "baseline" case, we have two processes/tasks executing:
>  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
>  (2) Userspace netserver process TCP receiving data from socket.

"baseline" in this case is still cpumap, just without these GRO patches.

> 
> My experience is that you will see two noticeable different
> throughput performance results depending on whether (1) and (2) is
> executing on the *same* CPU (multi-tasking context-switching),
> or executing in parallel (e.g. pinned) on two different CPU cores.
> 
> The netperf command have an option
> 
>  -T lcpu,remcpu
>       Request that netperf be bound to local CPU lcpu and/or netserver be
> bound to remote CPU rcpu.
> 
> Verify setting by listing pinning like this:
>   for PID in $(pidof netserver); do taskset -pc $PID ; done
> 
> You can also set pinning runtime like this:
>  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> done
> 
> For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> output and adjust pinning runtime to observe the effect quickly.
> 
> My experience is unfortunately that TCP results have a lot of variation
> (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> timing, that can get affected by CPU sleep states. The systems CPU
> latency setting can be seen in /dev/cpu_dma_latency, which can be read
> like this:
> 
>  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> 
> For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> as it requires holding the file open. E.g I play with these profiles:
> 
>  sudo tuned-adm profile throughput-performance
>  sudo tuned-adm profile latency-performance
>  sudo tuned-adm profile network-latency

Appreciate the tips - I should keep this saved somewhere.

> 
> 
> > > cpumap v2 Olek
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > 
> > > 
> 
> 
> We now three processes/tasks executing:
>  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
>  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
>  (3) Userspace netserver process TCP receiving data from socket.
> 
> Again, now the performance is going to depend on depending on which CPU
> cores the processes/tasks are running and whether some are sharing the
> same CPU. (There are both wakeup timing and cache-line effects).
> 
> There are now more combinations to test...
> 
> CPUmap is a CPU scaling facility, and you will likely also see different
> CPU utilization on the difference cores one you start to pin these to
> control the scenarios.
> 
> > > It's very interesting that we see -40% tput w/ the patches. I went back
> > 
> 
> Sad that we see -40% throughput...  but do we know what CPU cores the
> now three different tasks/processes run on(?)
> 

Roughly, yes. For context, my primary use case for cpumap is to provide
some degree of isolation between colocated containers on a single host.
In particular, colocation occurs on AMD Bergamo. And containers are
CPU pinned to their own CCX (roughly). My RX steering program ensures
RX packets destined to a specific container are cpumap redirected to any
of the container's pinned CPUs. It not only provides a good measure of
isolation but ensures resources are properly accounted.

So to answer your question of which CPUs the 3 things run on: cpumap
kthread and application run on the same set of cores. More than that,
they share the same L3 cache by design. irq/softirq is effectively
random given default RSS config and IRQ affinities.


> 
> > Oh no, I messed up something =\
> >  > Could you please also test not the whole series, but patches 1-3 (up to
> > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > array...")? Would be great to see whether this implementation works
> > worse right from the start or I just broke something later on.
> > 
> > > and double checked and it seems the numbers are right. Here's the
> > > some output from some profiles I took with:
> > > 
> > >      perf record -e cycles:k -a -- sleep 10
> > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > 
> > >      # Event 'cycles:k'
> > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > 
> 
> I really appreciate that you provide perf data and perf diff, but as
> described above, we need data and information on what CPU cores are
> running which workload.
> 
> Fortunately perf diff (and perf report) support doing like this:
>  perf diff --sort=cpu,symbol
> 
> But then you also need to control the CPUs used in experiment for the
> diff to work.
> 
> I hope I made sense as these kind of CPU scaling benchmarks are tricky,

Indeed, sounds quite tricky.

My understanding with GRO is that it's a powerful general purpose
optimization. Enough that it should rise above the usual noise on a
reasonably configured system (which mine is).

Maybe we can consider decoupling the cpumap GRO enablement with the
later optimizations?

So in Olek's above series, patches 1-3 seem like they would still
benefit from an simpler testbed. But the more targetted optimizations in
patch 4+ would probably justify a de-noised setup.  Possibly single host
with xdp-trafficgen or something.

Procedurally speaking, maybe it would save some wasted effort if
everyone agreed on the general approach before investing more time into
finer optimizations built on top of the basic GRO support?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 21:53                       ` Daniel Xu
@ 2024-11-25 22:19                         ` Lorenzo Bianconi
  0 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-11-25 22:19 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jesper Dangaard Brouer, Alexander Lobakin, Lorenzo Bianconi,
	bpf@vger.kernel.org, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev,
	kernel-team, mfleming

[-- Attachment #1: Type: text/plain, Size: 8217 bytes --]

> Hi Jesper,
> 
> On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> > 
> > 
> > On 25/11/2024 16.12, Alexander Lobakin wrote:
> > > From: Daniel Xu <dxu@dxuuu.xyz>
> > > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > > 
> > > > Hi Olek,
> > > > 
> > > > Here are the results.
> > > > 
> > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > > 
> > > > > 
> > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > > 
> > > [...]
> > > 
> > > > Baseline (again)
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > > 
> > 
> > We need to talk about what we are measuring, and how to control the
> > experiment setup to get reproducible results.
> > Especially controlling on what CPU cores our code paths are executing.
> > 
> > In above "baseline" case, we have two processes/tasks executing:
> >  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
> >  (2) Userspace netserver process TCP receiving data from socket.
> 
> "baseline" in this case is still cpumap, just without these GRO patches.
> 
> > 
> > My experience is that you will see two noticeable different
> > throughput performance results depending on whether (1) and (2) is
> > executing on the *same* CPU (multi-tasking context-switching),
> > or executing in parallel (e.g. pinned) on two different CPU cores.
> > 
> > The netperf command have an option
> > 
> >  -T lcpu,remcpu
> >       Request that netperf be bound to local CPU lcpu and/or netserver be
> > bound to remote CPU rcpu.
> > 
> > Verify setting by listing pinning like this:
> >   for PID in $(pidof netserver); do taskset -pc $PID ; done
> > 
> > You can also set pinning runtime like this:
> >  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> > done
> > 
> > For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> > output and adjust pinning runtime to observe the effect quickly.
> > 
> > My experience is unfortunately that TCP results have a lot of variation
> > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> > timing, that can get affected by CPU sleep states. The systems CPU
> > latency setting can be seen in /dev/cpu_dma_latency, which can be read
> > like this:
> > 
> >  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> > 
> > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> > as it requires holding the file open. E.g I play with these profiles:
> > 
> >  sudo tuned-adm profile throughput-performance
> >  sudo tuned-adm profile latency-performance
> >  sudo tuned-adm profile network-latency
> 
> Appreciate the tips - I should keep this saved somewhere.
> 
> > 
> > 
> > > > cpumap v2 Olek
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > > 
> > > > 
> > 
> > 
> > We now three processes/tasks executing:
> >  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
> >  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
> >  (3) Userspace netserver process TCP receiving data from socket.
> > 
> > Again, now the performance is going to depend on depending on which CPU
> > cores the processes/tasks are running and whether some are sharing the
> > same CPU. (There are both wakeup timing and cache-line effects).
> > 
> > There are now more combinations to test...
> > 
> > CPUmap is a CPU scaling facility, and you will likely also see different
> > CPU utilization on the difference cores one you start to pin these to
> > control the scenarios.
> > 
> > > > It's very interesting that we see -40% tput w/ the patches. I went back
> > > 
> > 
> > Sad that we see -40% throughput...  but do we know what CPU cores the
> > now three different tasks/processes run on(?)
> > 
> 
> Roughly, yes. For context, my primary use case for cpumap is to provide
> some degree of isolation between colocated containers on a single host.
> In particular, colocation occurs on AMD Bergamo. And containers are
> CPU pinned to their own CCX (roughly). My RX steering program ensures
> RX packets destined to a specific container are cpumap redirected to any
> of the container's pinned CPUs. It not only provides a good measure of
> isolation but ensures resources are properly accounted.
> 
> So to answer your question of which CPUs the 3 things run on: cpumap
> kthread and application run on the same set of cores. More than that,
> they share the same L3 cache by design. irq/softirq is effectively
> random given default RSS config and IRQ affinities.
> 
> 
> > 
> > > Oh no, I messed up something =\
> > >  > Could you please also test not the whole series, but patches 1-3 (up to
> > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > > array...")? Would be great to see whether this implementation works
> > > worse right from the start or I just broke something later on.
> > > 
> > > > and double checked and it seems the numbers are right. Here's the
> > > > some output from some profiles I took with:
> > > > 
> > > >      perf record -e cycles:k -a -- sleep 10
> > > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > > 
> > > >      # Event 'cycles:k'
> > > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > > 
> > 
> > I really appreciate that you provide perf data and perf diff, but as
> > described above, we need data and information on what CPU cores are
> > running which workload.
> > 
> > Fortunately perf diff (and perf report) support doing like this:
> >  perf diff --sort=cpu,symbol
> > 
> > But then you also need to control the CPUs used in experiment for the
> > diff to work.
> > 
> > I hope I made sense as these kind of CPU scaling benchmarks are tricky,
> 
> Indeed, sounds quite tricky.
> 
> My understanding with GRO is that it's a powerful general purpose
> optimization. Enough that it should rise above the usual noise on a
> reasonably configured system (which mine is).
> 
> Maybe we can consider decoupling the cpumap GRO enablement with the
> later optimizations?

I agree. First, we need to identify the best approach to enable GRO on cpumap
(between Olek's approach and what I have suggested) and then we can evaluate
subsequent optimizations.
@Olek: do you agree?

Regards,
Lorenzo

> 
> So in Olek's above series, patches 1-3 seem like they would still
> benefit from an simpler testbed. But the more targetted optimizations in
> patch 4+ would probably justify a de-noised setup.  Possibly single host
> with xdp-trafficgen or something.
> 
> Procedurally speaking, maybe it would save some wasted effort if
> everyone agreed on the general approach before investing more time into
> finer optimizations built on top of the basic GRO support?
> 
> Thanks,
> Daniel
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 15:12                   ` Alexander Lobakin
  2024-11-25 17:03                     ` Daniel Xu
  2024-11-25 18:50                     ` Jesper Dangaard Brouer
@ 2024-11-25 22:56                     ` Daniel Xu
  2024-11-26 10:36                       ` Alexander Lobakin
  2 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-11-25 22:56 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev, Lorenzo Bianconi



On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
>
>> Hi Olek,
>> 
>> Here are the results.
>> 
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>
> [...]
>
>> Baseline (again)
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>> 
>> cpumap v2 Olek
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>> 
>> 
>> It's very interesting that we see -40% tput w/ the patches. I went back
>
> Oh no, I messed up something =\
>
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Patches 1-3 reproduces the -40% tput numbers. 

With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.

tcp_rr results were unaffected.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-25 22:56                     ` Daniel Xu
@ 2024-11-26 10:36                       ` Alexander Lobakin
  2024-11-26 17:02                         ` Lorenzo Bianconi
  2024-12-02 22:47                         ` Jakub Kicinski
  0 siblings, 2 replies; 36+ messages in thread
From: Alexander Lobakin @ 2024-11-26 10:36 UTC (permalink / raw)
  To: Daniel Xu, Jakub Kicinski, Lorenzo Bianconi
  Cc: Lorenzo Bianconi, bpf@vger.kernel.org, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Jesper Dangaard Brouer, Martin KaFai Lau, David Miller,
	Eric Dumazet, Paolo Abeni, netdev

From: Daniel Xu <dxu@dxuuu.xyz>
Date: Mon, 25 Nov 2024 16:56:49 -0600

> 
> 
> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>
>>> Hi Olek,
>>>
>>> Here are the results.
>>>
>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>
>>>>
>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>
>> [...]
>>
>>> Baseline (again)
>>>
>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>>
>>> cpumap v2 Olek
>>>
>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>>
>>>
>>> It's very interesting that we see -40% tput w/ the patches. I went back
>>
>> Oh no, I messed up something =\
>>
>> Could you please also test not the whole series, but patches 1-3 (up to
>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>> array...")? Would be great to see whether this implementation works
>> worse right from the start or I just broke something later on.
> 
> Patches 1-3 reproduces the -40% tput numbers. 

Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
cpumap's kthreads instead of NAPI) really performs worse than switching
cpumap to NAPI.

> 
> With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.

Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.

> 
> tcp_rr results were unaffected.

@ Jakub,

Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
least for now =\ I took a look on the backlog NAPI and it could be used,
although we'd need a pointer in the backlog to the corresponding cpumap
+ also some synchronization point to make sure backlog NAPI won't access
already destroyed cpumap.

Maybe Lorenzo could take a look...

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-26 10:36                       ` Alexander Lobakin
@ 2024-11-26 17:02                         ` Lorenzo Bianconi
  2024-11-26 17:12                           ` Jesper Dangaard Brouer
  2024-12-02 22:47                         ` Jakub Kicinski
  1 sibling, 1 reply; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-11-26 17:02 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Daniel Xu, Jakub Kicinski, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev

[-- Attachment #1: Type: text/plain, Size: 3365 bytes --]

> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Mon, 25 Nov 2024 16:56:49 -0600
> 
> > 
> > 
> > On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> >> From: Daniel Xu <dxu@dxuuu.xyz>
> >> Date: Fri, 22 Nov 2024 17:10:06 -0700
> >>
> >>> Hi Olek,
> >>>
> >>> Here are the results.
> >>>
> >>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>>>
> >>>>
> >>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> >>
> >> [...]
> >>
> >>> Baseline (again)
> >>>
> >>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> >>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> >>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> >>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> >>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> >>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> >>>
> >>> cpumap v2 Olek
> >>>
> >>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> >>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> >>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> >>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> >>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> >>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> >>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> >>>
> >>>
> >>> It's very interesting that we see -40% tput w/ the patches. I went back
> >>
> >> Oh no, I messed up something =\
> >>
> >> Could you please also test not the whole series, but patches 1-3 (up to
> >> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> >> array...")? Would be great to see whether this implementation works
> >> worse right from the start or I just broke something later on.
> > 
> > Patches 1-3 reproduces the -40% tput numbers. 
> 
> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
> cpumap's kthreads instead of NAPI) really performs worse than switching
> cpumap to NAPI.
> 
> > 
> > With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.
> 
> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.
> 
> > 
> > tcp_rr results were unaffected.
> 
> @ Jakub,
> 
> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
> least for now =\ I took a look on the backlog NAPI and it could be used,
> although we'd need a pointer in the backlog to the corresponding cpumap
> + also some synchronization point to make sure backlog NAPI won't access
> already destroyed cpumap.
> 
> Maybe Lorenzo could take a look...

it seems to me the only difference would be we will use the shared backlog_napi
kthreads instead of having a dedicated kthread for each cpumap entry but we still
need the napi poll logic. I can look into it if you prefer the shared kthread
approach.
@Jakub: what do you think?

Regards,
Lorenzo

> 
> Thanks,
> Olek
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-26 17:02                         ` Lorenzo Bianconi
@ 2024-11-26 17:12                           ` Jesper Dangaard Brouer
  2024-11-28 10:41                             ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2024-11-26 17:12 UTC (permalink / raw)
  To: Lorenzo Bianconi, Alexander Lobakin
  Cc: Daniel Xu, Jakub Kicinski, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, David Miller, Eric Dumazet,
	Paolo Abeni, netdev




On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>
>>>
>>>
>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>
>>>>> Hi Olek,
>>>>>
>>>>> Here are the results.
>>>>>
>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>
>>>> [...]
>>>>
>>>>> Baseline (again)
>>>>>
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>>>>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>>>>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>>>>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>>>>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>>>>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>>>>
>>>>> cpumap v2 Olek
>>>>>
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>>>>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>>>>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>>>>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>>>>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>>>>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>>>>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>>>>
>>>>>
>>>>> It's very interesting that we see -40% tput w/ the patches. I went back
>>>>
>>>> Oh no, I messed up something =\
>>>>
>>>> Could you please also test not the whole series, but patches 1-3 (up to
>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>> array...")? Would be great to see whether this implementation works
>>>> worse right from the start or I just broke something later on.
>>>
>>> Patches 1-3 reproduces the -40% tput numbers.
>>
>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>> cpumap's kthreads instead of NAPI) really performs worse than switching
>> cpumap to NAPI.
>>
>>>
>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.
>>
>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.
>>
>>>
>>> tcp_rr results were unaffected.
>>
>> @ Jakub,
>>
>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>> least for now =\ I took a look on the backlog NAPI and it could be used,
>> although we'd need a pointer in the backlog to the corresponding cpumap
>> + also some synchronization point to make sure backlog NAPI won't access
>> already destroyed cpumap.
>>
>> Maybe Lorenzo could take a look...
> 
> it seems to me the only difference would be we will use the shared backlog_napi
> kthreads instead of having a dedicated kthread for each cpumap entry but we still
> need the napi poll logic. I can look into it if you prefer the shared kthread
> approach.

I don't like a shared kthread approach. For my use-case I want to give
the "remote" CPU-map kthreads higher scheduling priority. (As it will be
running a 2nd XDP BPF DDoS program protecting against overload by 
dropping packets).

Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
to give backlog NAPI high priority, in my use-case.

> @Jakub: what do you think?


--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-26 17:12                           ` Jesper Dangaard Brouer
@ 2024-11-28 10:41                             ` Alexander Lobakin
  2024-11-28 10:56                               ` Lorenzo Bianconi
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-11-28 10:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lorenzo Bianconi
  Cc: Daniel Xu, Jakub Kicinski, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, David Miller, Eric Dumazet,
	Paolo Abeni, netdev

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 26 Nov 2024 18:12:27 +0100

> 
> 
> 
> On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>>
>>>>
>>>>
>>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>>
>>>>>> Hi Olek,
>>>>>>
>>>>>> Here are the results.
>>>>>>
>>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> Baseline (again)
>>>>>>
>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>> Run 1    3169917            0.00007295    0.00007871   
>>>>>> 0.00009343        Run 1    21749.43
>>>>>> Run 2    3228290            0.00007103    0.00007679   
>>>>>> 0.00009215        Run 2    21897.17
>>>>>> Run 3    3226746            0.00007231    0.00007871   
>>>>>> 0.00009087        Run 3    21906.82
>>>>>> Run 4    3191258            0.00007231    0.00007743   
>>>>>> 0.00009087        Run 4    21155.15
>>>>>> Run 5    3235653            0.00007231    0.00007743   
>>>>>> 0.00008703        Run 5    21397.06
>>>>>> Average    3210372.8    0.000072182    0.000077814   
>>>>>> 0.00009087        Average    21621.126
>>>>>>
>>>>>> cpumap v2 Olek
>>>>>>
>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>> Run 1    3253651            0.00007167    0.00007807   
>>>>>> 0.00009343        Run 1    13497.57
>>>>>> Run 2    3221492            0.00007231    0.00007743   
>>>>>> 0.00009087        Run 2    12115.53
>>>>>> Run 3    3296453            0.00007039    0.00007807   
>>>>>> 0.00009087        Run 3    12323.38
>>>>>> Run 4    3254460            0.00007167    0.00007807   
>>>>>> 0.00009087        Run 4    12901.88
>>>>>> Run 5    3173327            0.00007295    0.00007871   
>>>>>> 0.00009215        Run 5    12593.22
>>>>>> Average    3239876.6    0.000071798    0.00007807   
>>>>>> 0.000091638        Average    12686.316
>>>>>> Delta    0.92%            -0.53%            0.33%           
>>>>>> 0.85%                    -41.32%
>>>>>>
>>>>>>
>>>>>> It's very interesting that we see -40% tput w/ the patches. I went
>>>>>> back
>>>>>
>>>>> Oh no, I messed up something =\
>>>>>
>>>>> Could you please also test not the whole series, but patches 1-3
>>>>> (up to
>>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>>> array...")? Would be great to see whether this implementation works
>>>>> worse right from the start or I just broke something later on.
>>>>
>>>> Patches 1-3 reproduces the -40% tput numbers.
>>>
>>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>>> cpumap's kthreads instead of NAPI) really performs worse than switching
>>> cpumap to NAPI.
>>>
>>>>
>>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
>>>> it was noisy.
>>>
>>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
>>> on it.
>>>
>>>>
>>>> tcp_rr results were unaffected.
>>>
>>> @ Jakub,
>>>
>>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>>> least for now =\ I took a look on the backlog NAPI and it could be used,
>>> although we'd need a pointer in the backlog to the corresponding cpumap
>>> + also some synchronization point to make sure backlog NAPI won't access
>>> already destroyed cpumap.
>>>
>>> Maybe Lorenzo could take a look...
>>
>> it seems to me the only difference would be we will use the shared
>> backlog_napi
>> kthreads instead of having a dedicated kthread for each cpumap entry
>> but we still
>> need the napi poll logic. I can look into it if you prefer the shared
>> kthread
>> approach.
> 
> I don't like a shared kthread approach. For my use-case I want to give
> the "remote" CPU-map kthreads higher scheduling priority. (As it will be
> running a 2nd XDP BPF DDoS program protecting against overload by
> dropping packets).

Oh, that is also valid.
Let's see what Jakub replies, for now I'm leaning towards posting
approach from this RFC with my bulk allocation from the NAPI cache.

> 
> Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
> to give backlog NAPI high priority, in my use-case.
> 
>> @Jakub: what do you think?
> 
> 
> --Jesper

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-28 10:41                             ` Alexander Lobakin
@ 2024-11-28 10:56                               ` Lorenzo Bianconi
  2024-11-28 10:57                                 ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Bianconi @ 2024-11-28 10:56 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jesper Dangaard Brouer, Lorenzo Bianconi, Daniel Xu,
	Jakub Kicinski, bpf@vger.kernel.org, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

[-- Attachment #1: Type: text/plain, Size: 5154 bytes --]

> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Date: Tue, 26 Nov 2024 18:12:27 +0100
> 
> > 
> > 
> > 
> > On 26/11/2024 18.02, Lorenzo Bianconi wrote:
> >>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>> Date: Mon, 25 Nov 2024 16:56:49 -0600
> >>>
> >>>>
> >>>>
> >>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> >>>>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
> >>>>>
> >>>>>> Hi Olek,
> >>>>>>
> >>>>>> Here are the results.
> >>>>>>
> >>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> Baseline (again)
> >>>>>>
> >>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
> >>>>>> P99 (s)            Throughput (Mbit/s)
> >>>>>> Run 1    3169917            0.00007295    0.00007871   
> >>>>>> 0.00009343        Run 1    21749.43
> >>>>>> Run 2    3228290            0.00007103    0.00007679   
> >>>>>> 0.00009215        Run 2    21897.17
> >>>>>> Run 3    3226746            0.00007231    0.00007871   
> >>>>>> 0.00009087        Run 3    21906.82
> >>>>>> Run 4    3191258            0.00007231    0.00007743   
> >>>>>> 0.00009087        Run 4    21155.15
> >>>>>> Run 5    3235653            0.00007231    0.00007743   
> >>>>>> 0.00008703        Run 5    21397.06
> >>>>>> Average    3210372.8    0.000072182    0.000077814   
> >>>>>> 0.00009087        Average    21621.126
> >>>>>>
> >>>>>> cpumap v2 Olek
> >>>>>>
> >>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
> >>>>>> P99 (s)            Throughput (Mbit/s)
> >>>>>> Run 1    3253651            0.00007167    0.00007807   
> >>>>>> 0.00009343        Run 1    13497.57
> >>>>>> Run 2    3221492            0.00007231    0.00007743   
> >>>>>> 0.00009087        Run 2    12115.53
> >>>>>> Run 3    3296453            0.00007039    0.00007807   
> >>>>>> 0.00009087        Run 3    12323.38
> >>>>>> Run 4    3254460            0.00007167    0.00007807   
> >>>>>> 0.00009087        Run 4    12901.88
> >>>>>> Run 5    3173327            0.00007295    0.00007871   
> >>>>>> 0.00009215        Run 5    12593.22
> >>>>>> Average    3239876.6    0.000071798    0.00007807   
> >>>>>> 0.000091638        Average    12686.316
> >>>>>> Delta    0.92%            -0.53%            0.33%           
> >>>>>> 0.85%                    -41.32%
> >>>>>>
> >>>>>>
> >>>>>> It's very interesting that we see -40% tput w/ the patches. I went
> >>>>>> back
> >>>>>
> >>>>> Oh no, I messed up something =\
> >>>>>
> >>>>> Could you please also test not the whole series, but patches 1-3
> >>>>> (up to
> >>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> >>>>> array...")? Would be great to see whether this implementation works
> >>>>> worse right from the start or I just broke something later on.
> >>>>
> >>>> Patches 1-3 reproduces the -40% tput numbers.
> >>>
> >>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
> >>> cpumap's kthreads instead of NAPI) really performs worse than switching
> >>> cpumap to NAPI.
> >>>
> >>>>
> >>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
> >>>> it was noisy.
> >>>
> >>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
> >>> on it.
> >>>
> >>>>
> >>>> tcp_rr results were unaffected.
> >>>
> >>> @ Jakub,
> >>>
> >>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
> >>> least for now =\ I took a look on the backlog NAPI and it could be used,
> >>> although we'd need a pointer in the backlog to the corresponding cpumap
> >>> + also some synchronization point to make sure backlog NAPI won't access
> >>> already destroyed cpumap.
> >>>
> >>> Maybe Lorenzo could take a look...
> >>
> >> it seems to me the only difference would be we will use the shared
> >> backlog_napi
> >> kthreads instead of having a dedicated kthread for each cpumap entry
> >> but we still
> >> need the napi poll logic. I can look into it if you prefer the shared
> >> kthread
> >> approach.
> > 
> > I don't like a shared kthread approach. For my use-case I want to give
> > the "remote" CPU-map kthreads higher scheduling priority. (As it will be
> > running a 2nd XDP BPF DDoS program protecting against overload by
> > dropping packets).
> 
> Oh, that is also valid.
> Let's see what Jakub replies, for now I'm leaning towards posting
> approach from this RFC with my bulk allocation from the NAPI cache.

I guess it would be better to keep them separated to check what are the effects
of each change (GRO for cpumap and bulk allocation). I guess you can post your
changes on top of mine if we all agree the proposed approach is fine.
What do you think?

Regards,
Lorenzo

> 
> > 
> > Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
> > to give backlog NAPI high priority, in my use-case.
> > 
> >> @Jakub: what do you think?
> > 
> > 
> > --Jesper
> 
> Thanks,
> Olek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-28 10:56                               ` Lorenzo Bianconi
@ 2024-11-28 10:57                                 ` Alexander Lobakin
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Lobakin @ 2024-11-28 10:57 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Jesper Dangaard Brouer, Lorenzo Bianconi, Daniel Xu,
	Jakub Kicinski, bpf@vger.kernel.org, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, John Fastabend,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Thu, 28 Nov 2024 11:56:24 +0100

>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>> Date: Tue, 26 Nov 2024 18:12:27 +0100
>>
>>>
>>>
>>>
>>> On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>>>>
>>>>>>>> Hi Olek,
>>>>>>>>
>>>>>>>> Here are the results.
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>> Baseline (again)
>>>>>>>>
>>>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>>>> Run 1    3169917            0.00007295    0.00007871   
>>>>>>>> 0.00009343        Run 1    21749.43
>>>>>>>> Run 2    3228290            0.00007103    0.00007679   
>>>>>>>> 0.00009215        Run 2    21897.17
>>>>>>>> Run 3    3226746            0.00007231    0.00007871   
>>>>>>>> 0.00009087        Run 3    21906.82
>>>>>>>> Run 4    3191258            0.00007231    0.00007743   
>>>>>>>> 0.00009087        Run 4    21155.15
>>>>>>>> Run 5    3235653            0.00007231    0.00007743   
>>>>>>>> 0.00008703        Run 5    21397.06
>>>>>>>> Average    3210372.8    0.000072182    0.000077814   
>>>>>>>> 0.00009087        Average    21621.126
>>>>>>>>
>>>>>>>> cpumap v2 Olek
>>>>>>>>
>>>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>>>> Run 1    3253651            0.00007167    0.00007807   
>>>>>>>> 0.00009343        Run 1    13497.57
>>>>>>>> Run 2    3221492            0.00007231    0.00007743   
>>>>>>>> 0.00009087        Run 2    12115.53
>>>>>>>> Run 3    3296453            0.00007039    0.00007807   
>>>>>>>> 0.00009087        Run 3    12323.38
>>>>>>>> Run 4    3254460            0.00007167    0.00007807   
>>>>>>>> 0.00009087        Run 4    12901.88
>>>>>>>> Run 5    3173327            0.00007295    0.00007871   
>>>>>>>> 0.00009215        Run 5    12593.22
>>>>>>>> Average    3239876.6    0.000071798    0.00007807   
>>>>>>>> 0.000091638        Average    12686.316
>>>>>>>> Delta    0.92%            -0.53%            0.33%           
>>>>>>>> 0.85%                    -41.32%
>>>>>>>>
>>>>>>>>
>>>>>>>> It's very interesting that we see -40% tput w/ the patches. I went
>>>>>>>> back
>>>>>>>
>>>>>>> Oh no, I messed up something =\
>>>>>>>
>>>>>>> Could you please also test not the whole series, but patches 1-3
>>>>>>> (up to
>>>>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>>>>> array...")? Would be great to see whether this implementation works
>>>>>>> worse right from the start or I just broke something later on.
>>>>>>
>>>>>> Patches 1-3 reproduces the -40% tput numbers.
>>>>>
>>>>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>>>>> cpumap's kthreads instead of NAPI) really performs worse than switching
>>>>> cpumap to NAPI.
>>>>>
>>>>>>
>>>>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
>>>>>> it was noisy.
>>>>>
>>>>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
>>>>> on it.
>>>>>
>>>>>>
>>>>>> tcp_rr results were unaffected.
>>>>>
>>>>> @ Jakub,
>>>>>
>>>>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>>>>> least for now =\ I took a look on the backlog NAPI and it could be used,
>>>>> although we'd need a pointer in the backlog to the corresponding cpumap
>>>>> + also some synchronization point to make sure backlog NAPI won't access
>>>>> already destroyed cpumap.
>>>>>
>>>>> Maybe Lorenzo could take a look...
>>>>
>>>> it seems to me the only difference would be we will use the shared
>>>> backlog_napi
>>>> kthreads instead of having a dedicated kthread for each cpumap entry
>>>> but we still
>>>> need the napi poll logic. I can look into it if you prefer the shared
>>>> kthread
>>>> approach.
>>>
>>> I don't like a shared kthread approach. For my use-case I want to give
>>> the "remote" CPU-map kthreads higher scheduling priority. (As it will be
>>> running a 2nd XDP BPF DDoS program protecting against overload by
>>> dropping packets).
>>
>> Oh, that is also valid.
>> Let's see what Jakub replies, for now I'm leaning towards posting
>> approach from this RFC with my bulk allocation from the NAPI cache.
> 
> I guess it would be better to keep them separated to check what are the effects
> of each change (GRO for cpumap and bulk allocation). I guess you can post your
> changes on top of mine if we all agree the proposed approach is fine.
> What do you think?

Sounds good as well, I don't have any preference here.

> 
> Regards,
> Lorenzo

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-11-26 10:36                       ` Alexander Lobakin
  2024-11-26 17:02                         ` Lorenzo Bianconi
@ 2024-12-02 22:47                         ` Jakub Kicinski
  2024-12-03 11:01                           ` Alexander Lobakin
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2024-12-02 22:47 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Daniel Xu, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote:
> > tcp_rr results were unaffected.  
> 
> @ Jakub,

Context? What doesn't work and why?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-02 22:47                         ` Jakub Kicinski
@ 2024-12-03 11:01                           ` Alexander Lobakin
  2024-12-04  0:51                             ` Jakub Kicinski
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-12-03 11:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Xu, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 2 Dec 2024 14:47:39 -0800

> On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote:
>>> tcp_rr results were unaffected.  
>>
>> @ Jakub,
> 
> Context? What doesn't work and why?

My tests show the same perf as on Lorenzo's series, but I test with UDP
trafficgen. Daniel tests TCP and the results are much worse than with
Lorenzo's implementation.
I suspect this is related to that how NAPI performs flushes / decides
whether to repoll again or exit vs how kthread does that (even though I
also try to flush only every 64 frames or when the ring is empty). Or
maybe to that part of the kthread happens in process context outside any
softirq, while when using NAPI, the whole loop is inside RX softirq.

Jesper said that he'd like to see cpumap still using own kthread, so
that its priority can be boosted separately from the backlog. That's why
we asked you whether it would be fine to have cpumap as threaded NAPI in
regards to all this :D

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-03 11:01                           ` Alexander Lobakin
@ 2024-12-04  0:51                             ` Jakub Kicinski
  2024-12-04 16:42                               ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2024-12-04  0:51 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Daniel Xu, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
> >> @ Jakub,  
> > 
> > Context? What doesn't work and why?  
> 
> My tests show the same perf as on Lorenzo's series, but I test with UDP
> trafficgen. Daniel tests TCP and the results are much worse than with
> Lorenzo's implementation.
> I suspect this is related to that how NAPI performs flushes / decides
> whether to repoll again or exit vs how kthread does that (even though I
> also try to flush only every 64 frames or when the ring is empty). Or
> maybe to that part of the kthread happens in process context outside any
> softirq, while when using NAPI, the whole loop is inside RX softirq.
> 
> Jesper said that he'd like to see cpumap still using own kthread, so
> that its priority can be boosted separately from the backlog. That's why
> we asked you whether it would be fine to have cpumap as threaded NAPI in
> regards to all this :D

Certainly not without a clear understanding what the problem with 
a kthread is.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-04  0:51                             ` Jakub Kicinski
@ 2024-12-04 16:42                               ` Alexander Lobakin
  2024-12-04 21:51                                 ` Daniel Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-12-04 16:42 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Xu
  Cc: Lorenzo Bianconi, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Jakub Kicinski <kuba@kernel.org>
Date: Tue, 3 Dec 2024 16:51:57 -0800

> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>> @ Jakub,  
>>>
>>> Context? What doesn't work and why?  
>>
>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>> trafficgen. Daniel tests TCP and the results are much worse than with
>> Lorenzo's implementation.
>> I suspect this is related to that how NAPI performs flushes / decides
>> whether to repoll again or exit vs how kthread does that (even though I
>> also try to flush only every 64 frames or when the ring is empty). Or
>> maybe to that part of the kthread happens in process context outside any
>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>
>> Jesper said that he'd like to see cpumap still using own kthread, so
>> that its priority can be boosted separately from the backlog. That's why
>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>> regards to all this :D
> 
> Certainly not without a clear understanding what the problem with 
> a kthread is.

Yes, sure thing.

Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
was testing with the UDP trafficgen and got up to 80% improvement over
the baseline. Now I tested TCP and got up to 70% improvement, no
regressions whatsoever =\

I don't know where this regression on Daniel's setup comes from. Is it
multi-thread or single-thread test? What app do you use: iperf, netperf,
neper, Microsoft's app (forgot the name)? Do you have multiple NUMA
nodes on your system, are you sure you didn't cross the node when
redirecting with the GRO patches / no other NUMA mismatches happened?
Some other random stuff like RSS hash key, which affects flow steering?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-04 16:42                               ` Alexander Lobakin
@ 2024-12-04 21:51                                 ` Daniel Xu
  2024-12-05 10:38                                   ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-12-04 21:51 UTC (permalink / raw)
  To: Alexander Lobakin, Jakub Kicinski
  Cc: Lorenzo Bianconi, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev



On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Tue, 3 Dec 2024 16:51:57 -0800
>
>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>> @ Jakub,  
>>>>
>>>> Context? What doesn't work and why?  
>>>
>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>> Lorenzo's implementation.
>>> I suspect this is related to that how NAPI performs flushes / decides
>>> whether to repoll again or exit vs how kthread does that (even though I
>>> also try to flush only every 64 frames or when the ring is empty). Or
>>> maybe to that part of the kthread happens in process context outside any
>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>
>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>> that its priority can be boosted separately from the backlog. That's why
>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>> regards to all this :D
>> 
>> Certainly not without a clear understanding what the problem with 
>> a kthread is.
>
> Yes, sure thing.
>
> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
> was testing with the UDP trafficgen and got up to 80% improvement over
> the baseline. Now I tested TCP and got up to 70% improvement, no
> regressions whatsoever =\
>
> I don't know where this regression on Daniel's setup comes from. Is it
> multi-thread or single-thread test? 

8 threads with 16 flows over them (-T8 -F16)

> What app do you use: iperf, netperf,
> neper, Microsoft's app (forgot the name)?

neper, tcp_stream.

> Do you have multiple NUMA
> nodes on your system, are you sure you didn't cross the node when
> redirecting with the GRO patches / no other NUMA mismatches happened?

Single node. Technically EPYC NPS=1. So there are some numa characteristics
but I think the interconnect is supposed to hide it fairly efficiently.

> Some other random stuff like RSS hash key, which affects flow steering?

Whatever the default is - I'd be willing to be Kuba set up the configuration
at one point or another so it's probably sane. And with 5 runs it seems
unlikely the hashing would get unlucky and cause an imbalance.

>
> Thanks,
> Olek

Since I've got the setup handy and am motivated to see this work land,
do you have any other pointers for things I should look for? I'll spend some
time looking at profiles to see if I can identify any hot spots compared to
softirq based GRO.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-04 21:51                                 ` Daniel Xu
@ 2024-12-05 10:38                                   ` Alexander Lobakin
  2024-12-05 11:06                                     ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-12-05 10:38 UTC (permalink / raw)
  To: Daniel Xu, Jakub Kicinski
  Cc: Lorenzo Bianconi, Lorenzo Bianconi, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jesper Dangaard Brouer, Martin KaFai Lau,
	David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Daniel Xu <dxu@dxuuu.xyz>
Date: Wed, 04 Dec 2024 13:51:08 -0800

> 
> 
> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>> From: Jakub Kicinski <kuba@kernel.org>
>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>
>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>> @ Jakub,  
>>>>>
>>>>> Context? What doesn't work and why?  
>>>>
>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>> Lorenzo's implementation.
>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>> maybe to that part of the kthread happens in process context outside any
>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>
>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>> that its priority can be boosted separately from the backlog. That's why
>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>> regards to all this :D
>>>
>>> Certainly not without a clear understanding what the problem with 
>>> a kthread is.
>>
>> Yes, sure thing.
>>
>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>> was testing with the UDP trafficgen and got up to 80% improvement over
>> the baseline. Now I tested TCP and got up to 70% improvement, no
>> regressions whatsoever =\
>>
>> I don't know where this regression on Daniel's setup comes from. Is it
>> multi-thread or single-thread test? 
> 
> 8 threads with 16 flows over them (-T8 -F16)
> 
>> What app do you use: iperf, netperf,
>> neper, Microsoft's app (forgot the name)?
> 
> neper, tcp_stream.

Let me recheck with neper -T8 -F16, I'll post my results soon.

> 
>> Do you have multiple NUMA
>> nodes on your system, are you sure you didn't cross the node when
>> redirecting with the GRO patches / no other NUMA mismatches happened?
> 
> Single node. Technically EPYC NPS=1. So there are some numa characteristics
> but I think the interconnect is supposed to hide it fairly efficiently.
> 
>> Some other random stuff like RSS hash key, which affects flow steering?
> 
> Whatever the default is - I'd be willing to be Kuba set up the configuration
> at one point or another so it's probably sane. And with 5 runs it seems
> unlikely the hashing would get unlucky and cause an imbalance.
> 
>>
>> Thanks,
>> Olek
> 
> Since I've got the setup handy and am motivated to see this work land,
> do you have any other pointers for things I should look for? I'll spend some
> time looking at profiles to see if I can identify any hot spots compared to
> softirq based GRO.
> 
> Thanks,
> Daniel

Thanks for helping with this!
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-05 10:38                                   ` Alexander Lobakin
@ 2024-12-05 11:06                                     ` Alexander Lobakin
  2024-12-06  0:41                                       ` Daniel Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-12-05 11:06 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jakub Kicinski, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu, 5 Dec 2024 11:38:11 +0100

> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Wed, 04 Dec 2024 13:51:08 -0800
> 
>>
>>
>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>> From: Jakub Kicinski <kuba@kernel.org>
>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>
>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>> @ Jakub,  
>>>>>>
>>>>>> Context? What doesn't work and why?  
>>>>>
>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>> Lorenzo's implementation.
>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>> maybe to that part of the kthread happens in process context outside any
>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>
>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>> regards to all this :D
>>>>
>>>> Certainly not without a clear understanding what the problem with 
>>>> a kthread is.
>>>
>>> Yes, sure thing.
>>>
>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>> regressions whatsoever =\
>>>
>>> I don't know where this regression on Daniel's setup comes from. Is it
>>> multi-thread or single-thread test? 
>>
>> 8 threads with 16 flows over them (-T8 -F16)
>>
>>> What app do you use: iperf, netperf,
>>> neper, Microsoft's app (forgot the name)?
>>
>> neper, tcp_stream.
> 
> Let me recheck with neper -T8 -F16, I'll post my results soon.

kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
clean      28           51              13        9               Gbps
GRO        28           51              26        18              Gbps

100% gain, no regressions =\

My XDP prog is simple (upstream xdp-tools repo with no changes):

numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
no-touch ens802f0np0

IOW it simply redirects everything to CPU 23 (same NUMA node) from any
Rx queue without looking into headers or packet.
Do you test with more sophisticated XDP prog?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-05 11:06                                     ` Alexander Lobakin
@ 2024-12-06  0:41                                       ` Daniel Xu
  2024-12-06 15:06                                         ` Alexander Lobakin
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Xu @ 2024-12-06  0:41 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jakub Kicinski, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 5 Dec 2024 11:38:11 +0100
> 
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Wed, 04 Dec 2024 13:51:08 -0800
> > 
> >>
> >>
> >> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
> >>> From: Jakub Kicinski <kuba@kernel.org>
> >>> Date: Tue, 3 Dec 2024 16:51:57 -0800
> >>>
> >>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
> >>>>>>> @ Jakub,  
> >>>>>>
> >>>>>> Context? What doesn't work and why?  
> >>>>>
> >>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
> >>>>> trafficgen. Daniel tests TCP and the results are much worse than with
> >>>>> Lorenzo's implementation.
> >>>>> I suspect this is related to that how NAPI performs flushes / decides
> >>>>> whether to repoll again or exit vs how kthread does that (even though I
> >>>>> also try to flush only every 64 frames or when the ring is empty). Or
> >>>>> maybe to that part of the kthread happens in process context outside any
> >>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
> >>>>>
> >>>>> Jesper said that he'd like to see cpumap still using own kthread, so
> >>>>> that its priority can be boosted separately from the backlog. That's why
> >>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
> >>>>> regards to all this :D
> >>>>
> >>>> Certainly not without a clear understanding what the problem with 
> >>>> a kthread is.
> >>>
> >>> Yes, sure thing.
> >>>
> >>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
> >>> was testing with the UDP trafficgen and got up to 80% improvement over
> >>> the baseline. Now I tested TCP and got up to 70% improvement, no
> >>> regressions whatsoever =\
> >>>
> >>> I don't know where this regression on Daniel's setup comes from. Is it
> >>> multi-thread or single-thread test? 
> >>
> >> 8 threads with 16 flows over them (-T8 -F16)
> >>
> >>> What app do you use: iperf, netperf,
> >>> neper, Microsoft's app (forgot the name)?
> >>
> >> neper, tcp_stream.
> > 
> > Let me recheck with neper -T8 -F16, I'll post my results soon.
> 
> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
> clean      28           51              13        9               Gbps
> GRO        28           51              26        18              Gbps
> 
> 100% gain, no regressions =\
> 
> My XDP prog is simple (upstream xdp-tools repo with no changes):
> 
> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
> no-touch ens802f0np0
> 
> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
> Rx queue without looking into headers or packet.
> Do you test with more sophisticated XDP prog?

Great reminder... my prog is a bit more sophisticated. I forgot we were
doing latency tracking by inserting a timestamp into frame metadata. But
not clearing it after it was read on remote CPU, which disables GRO. So
previous test was paying the penalty of fixed GRO overhead without
getting any packet merges.

Once I fixed up prog to reset metadata pointer I could see the wins.
Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
latency changes.

Sorry about the churn.

Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-06  0:41                                       ` Daniel Xu
@ 2024-12-06 15:06                                         ` Alexander Lobakin
  2024-12-06 23:36                                           ` Daniel Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Lobakin @ 2024-12-06 15:06 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jakub Kicinski, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev

From: Daniel Xu <dxu@dxuuu.xyz>
Date: Thu, 5 Dec 2024 17:41:27 -0700

> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 5 Dec 2024 11:38:11 +0100
>>
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Wed, 04 Dec 2024 13:51:08 -0800
>>>
>>>>
>>>>
>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>>>> From: Jakub Kicinski <kuba@kernel.org>
>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>>>
>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>>>> @ Jakub,  
>>>>>>>>
>>>>>>>> Context? What doesn't work and why?  
>>>>>>>
>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>>>> Lorenzo's implementation.
>>>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>>>> maybe to that part of the kthread happens in process context outside any
>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>>>
>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>>>> regards to all this :D
>>>>>>
>>>>>> Certainly not without a clear understanding what the problem with 
>>>>>> a kthread is.
>>>>>
>>>>> Yes, sure thing.
>>>>>
>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>>>> regressions whatsoever =\
>>>>>
>>>>> I don't know where this regression on Daniel's setup comes from. Is it
>>>>> multi-thread or single-thread test? 
>>>>
>>>> 8 threads with 16 flows over them (-T8 -F16)
>>>>
>>>>> What app do you use: iperf, netperf,
>>>>> neper, Microsoft's app (forgot the name)?
>>>>
>>>> neper, tcp_stream.
>>>
>>> Let me recheck with neper -T8 -F16, I'll post my results soon.
>>
>> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
>> clean      28           51              13        9               Gbps
>> GRO        28           51              26        18              Gbps
>>
>> 100% gain, no regressions =\
>>
>> My XDP prog is simple (upstream xdp-tools repo with no changes):
>>
>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
>> no-touch ens802f0np0
>>
>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
>> Rx queue without looking into headers or packet.
>> Do you test with more sophisticated XDP prog?
> 
> Great reminder... my prog is a bit more sophisticated. I forgot we were
> doing latency tracking by inserting a timestamp into frame metadata. But
> not clearing it after it was read on remote CPU, which disables GRO. So
> previous test was paying the penalty of fixed GRO overhead without
> getting any packet merges.
> 
> Once I fixed up prog to reset metadata pointer I could see the wins.
> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
> latency changes.
> 
> Sorry about the churn.

No problem, crap happens sometimes :)

Let me send my implementation on Monday-Wednesday. I'll include my UDP
and TCP test results, as well as yours (+18%).

BTW would be great if you could give me a Tested-by tag, as I assume the
tests were fine and it works for you?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
  2024-12-06 15:06                                         ` Alexander Lobakin
@ 2024-12-06 23:36                                           ` Daniel Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Xu @ 2024-12-06 23:36 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jakub Kicinski, Lorenzo Bianconi, Lorenzo Bianconi,
	bpf@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Jesper Dangaard Brouer,
	Martin KaFai Lau, David Miller, Eric Dumazet, Paolo Abeni, netdev



On Fri, Dec 6, 2024, at 7:06 AM, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Thu, 5 Dec 2024 17:41:27 -0700
>
>> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
>>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Date: Thu, 5 Dec 2024 11:38:11 +0100
>>>
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Wed, 04 Dec 2024 13:51:08 -0800
>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>>>>> From: Jakub Kicinski <kuba@kernel.org>
>>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>>>>
>>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>>>>> @ Jakub,  
>>>>>>>>>
>>>>>>>>> Context? What doesn't work and why?  
>>>>>>>>
>>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>>>>> Lorenzo's implementation.
>>>>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>>>>> maybe to that part of the kthread happens in process context outside any
>>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>>>>
>>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>>>>> regards to all this :D
>>>>>>>
>>>>>>> Certainly not without a clear understanding what the problem with 
>>>>>>> a kthread is.
>>>>>>
>>>>>> Yes, sure thing.
>>>>>>
>>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>>>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>>>>> regressions whatsoever =\
>>>>>>
>>>>>> I don't know where this regression on Daniel's setup comes from. Is it
>>>>>> multi-thread or single-thread test? 
>>>>>
>>>>> 8 threads with 16 flows over them (-T8 -F16)
>>>>>
>>>>>> What app do you use: iperf, netperf,
>>>>>> neper, Microsoft's app (forgot the name)?
>>>>>
>>>>> neper, tcp_stream.
>>>>
>>>> Let me recheck with neper -T8 -F16, I'll post my results soon.
>>>
>>> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
>>> clean      28           51              13        9               Gbps
>>> GRO        28           51              26        18              Gbps
>>>
>>> 100% gain, no regressions =\
>>>
>>> My XDP prog is simple (upstream xdp-tools repo with no changes):
>>>
>>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
>>> no-touch ens802f0np0
>>>
>>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
>>> Rx queue without looking into headers or packet.
>>> Do you test with more sophisticated XDP prog?
>> 
>> Great reminder... my prog is a bit more sophisticated. I forgot we were
>> doing latency tracking by inserting a timestamp into frame metadata. But
>> not clearing it after it was read on remote CPU, which disables GRO. So
>> previous test was paying the penalty of fixed GRO overhead without
>> getting any packet merges.
>> 
>> Once I fixed up prog to reset metadata pointer I could see the wins.
>> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
>> latency changes.
>> 
>> Sorry about the churn.
>
> No problem, crap happens sometimes :)
>
> Let me send my implementation on Monday-Wednesday. I'll include my UDP
> and TCP test results, as well as yours (+18%).
>
> BTW would be great if you could give me a Tested-by tag, as I assume the
> tests were fine and it works for you?

Yep, worked great for me.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-12-06 23:36 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 2/3] net: add napi_threaded_poll to netdevice.h Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 3/3] bpf: cpumap: Add gro support Lorenzo Bianconi
2024-09-16 15:10 ` [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Alexander Lobakin
2024-10-08 22:39 ` Daniel Xu
2024-10-09 10:46   ` Lorenzo Bianconi
2024-10-09 12:27     ` Alexander Lobakin
2024-10-09 12:47       ` Lorenzo Bianconi
2024-10-09 12:50         ` Alexander Lobakin
2024-10-22 15:51           ` Alexander Lobakin
2024-11-12 17:43             ` Alexander Lobakin
2024-11-13 23:39               ` Daniel Xu
2024-11-23  0:10                 ` Daniel Xu
2024-11-25 15:12                   ` Alexander Lobakin
2024-11-25 17:03                     ` Daniel Xu
2024-11-25 18:50                     ` Jesper Dangaard Brouer
2024-11-25 21:53                       ` Daniel Xu
2024-11-25 22:19                         ` Lorenzo Bianconi
2024-11-25 22:56                     ` Daniel Xu
2024-11-26 10:36                       ` Alexander Lobakin
2024-11-26 17:02                         ` Lorenzo Bianconi
2024-11-26 17:12                           ` Jesper Dangaard Brouer
2024-11-28 10:41                             ` Alexander Lobakin
2024-11-28 10:56                               ` Lorenzo Bianconi
2024-11-28 10:57                                 ` Alexander Lobakin
2024-12-02 22:47                         ` Jakub Kicinski
2024-12-03 11:01                           ` Alexander Lobakin
2024-12-04  0:51                             ` Jakub Kicinski
2024-12-04 16:42                               ` Alexander Lobakin
2024-12-04 21:51                                 ` Daniel Xu
2024-12-05 10:38                                   ` Alexander Lobakin
2024-12-05 11:06                                     ` Alexander Lobakin
2024-12-06  0:41                                       ` Daniel Xu
2024-12-06 15:06                                         ` Alexander Lobakin
2024-12-06 23:36                                           ` Daniel Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).