[PATCH net-next 0/4] net: rps/rfs improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/4] net: rps/rfs improvements
@ 2023-03-28 23:50 Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 1/4] net: napi_schedule_rps() cleanup Eric Dumazet
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-28 23:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet, Eric Dumazet

Jason Xing attempted to optimize napi_schedule_rps() by avoiding
unneeded NET_RX_SOFTIRQ raises: [1], [2]

This is quite complex to implement properly. I chose to implement
the idea, and added a similar optimization in ____napi_schedule()

Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
invocations.

While this had no impact on throughput or cpu costs on this synthetic
benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
This can have a latency impact on stressed hosts.

[1] https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
[2] https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gmail.com/

Eric Dumazet (4):
  net: napi_schedule_rps() cleanup
  net: add softnet_data.in_net_rx_action
  net: optimize napi_schedule_rps()
  net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ

 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
 2 files changed, 37 insertions(+), 10 deletions(-)

-- 
2.40.0.348.gf938b09366-goog

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next 1/4] net: napi_schedule_rps() cleanup
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
@ 2023-03-28 23:50 ` Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 2/4] net: add softnet_data.in_net_rx_action Eric Dumazet
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-28 23:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet, Eric Dumazet

napi_schedule_rps() return value is ignored, remove it.

Change the comment to clarify the intent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7172334a418fdfe6132562f4c864ad0c69ebfd74..f7050b95d125014d00f4c876175b1569d82525cd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4582,11 +4582,18 @@ static void trigger_rx_softirq(void *data)
 }
 
 /*
- * Check if this softnet_data structure is another cpu one
- * If yes, queue it to our IPI list and return 1
- * If no, return 0
+ * After we queued a packet into sd->input_pkt_queue,
+ * we need to make sure this queue is serviced soon.
+ *
+ * - If this is another cpu queue, link it to our rps_ipi_list,
+ *   and make sure we will process rps_ipi_list from net_rx_action().
+ *   As we do not know yet if we are called from net_rx_action(),
+ *   we have to raise NET_RX_SOFTIRQ. This might change in the future.
+ *
+ * - If this is our own queue, NAPI schedule our backlog.
+ *   Note that this also raises NET_RX_SOFTIRQ.
  */
-static int napi_schedule_rps(struct softnet_data *sd)
+static void napi_schedule_rps(struct softnet_data *sd)
 {
 	struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
 
@@ -4596,11 +4603,10 @@ static int napi_schedule_rps(struct softnet_data *sd)
 		mysd->rps_ipi_list = sd;
 
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-		return 1;
+		return;
 	}
 #endif /* CONFIG_RPS */
 	__napi_schedule_irqoff(&mysd->backlog);
-	return 0;
 }
 
 #ifdef CONFIG_NET_FLOW_LIMIT
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 2/4] net: add softnet_data.in_net_rx_action
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 1/4] net: napi_schedule_rps() cleanup Eric Dumazet
@ 2023-03-28 23:50 ` Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 3/4] net: optimize napi_schedule_rps() Eric Dumazet
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-28 23:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet, Eric Dumazet

We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.

sd.in_net_rx_action is only read/written by the owning cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18a5be6ddd0f7c1a7b8169440808bc66c991d8de..c8c634091a65966fee661695d34ba8a7cf2cd8e7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3188,6 +3188,7 @@ struct softnet_data {
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 #endif
+	bool			in_net_rx_action;
 #ifdef CONFIG_NET_FLOW_LIMIT
 	struct sd_flow_limit __rcu *flow_limit;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index f7050b95d125014d00f4c876175b1569d82525cd..15331edbacf4ca59aa5772c29e95cacd3c106e3f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6646,6 +6646,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	sd->in_net_rx_action = true;
 	local_irq_disable();
 	list_splice_init(&sd->poll_list, &list);
 	local_irq_enable();
@@ -6656,6 +6657,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 		skb_defer_free_flush(sd);
 
 		if (list_empty(&list)) {
+			sd->in_net_rx_action = false;
 			if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
 				goto end;
 			break;
@@ -6682,6 +6684,8 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 	list_splice(&list, &sd->poll_list);
 	if (!list_empty(&sd->poll_list))
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	else
+		sd->in_net_rx_action = false;
 
 	net_rps_action_and_irq_enable(sd);
 end:;
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 3/4] net: optimize napi_schedule_rps()
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 1/4] net: napi_schedule_rps() cleanup Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 2/4] net: add softnet_data.in_net_rx_action Eric Dumazet
@ 2023-03-28 23:50 ` Eric Dumazet
  2023-03-28 23:50 ` [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ Eric Dumazet
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-28 23:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet, Eric Dumazet

Based on initial patch from Jason Xing.

Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps()
when we queued a packet into another cpu backlog.

We can do this only in the context of us being called indirectly
from net_rx_action(), to have the guarantee our rps_ipi_list
will be processed before we exit from net_rx_action().

Link: https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
---
 net/core/dev.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 15331edbacf4ca59aa5772c29e95cacd3c106e3f..f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4587,8 +4587,6 @@ static void trigger_rx_softirq(void *data)
  *
  * - If this is another cpu queue, link it to our rps_ipi_list,
  *   and make sure we will process rps_ipi_list from net_rx_action().
- *   As we do not know yet if we are called from net_rx_action(),
- *   we have to raise NET_RX_SOFTIRQ. This might change in the future.
  *
  * - If this is our own queue, NAPI schedule our backlog.
  *   Note that this also raises NET_RX_SOFTIRQ.
@@ -4602,7 +4600,11 @@ static void napi_schedule_rps(struct softnet_data *sd)
 		sd->rps_ipi_next = mysd->rps_ipi_list;
 		mysd->rps_ipi_list = sd;
 
-		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+		/* If not called from net_rx_action()
+		 * we have to raise NET_RX_SOFTIRQ.
+		 */
+		if (!mysd->in_net_rx_action)
+			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 		return;
 	}
 #endif /* CONFIG_RPS */
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
                   ` (2 preceding siblings ...)
  2023-03-28 23:50 ` [PATCH net-next 3/4] net: optimize napi_schedule_rps() Eric Dumazet
@ 2023-03-28 23:50 ` Eric Dumazet
  2023-03-29 12:47   ` Yunsheng Lin
  2023-03-30  9:50   ` Jason Xing
  2023-03-29  2:40 ` [PATCH net-next 0/4] net: rps/rfs improvements Jason Xing
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-28 23:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet, Eric Dumazet

____napi_schedule() adds a napi into current cpu softnet_data poll_list,
then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.

Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
from net_rx_action(), because we can process poll_list from this point,
without going to full softirq loop.

This needs a change in net_rx_action() to make sure we restart
its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
being raised.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
---
 net/core/dev.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 	}
 
 	list_add_tail(&napi->poll_list, &sd->poll_list);
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	/* If not called from net_rx_action()
+	 * we have to raise NET_RX_SOFTIRQ.
+	 */
+	if (!sd->in_net_rx_action)
+		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
 
 #ifdef CONFIG_RPS
@@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
 	list_splice_init(&sd->poll_list, &list);
@@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 		skb_defer_free_flush(sd);
 
 		if (list_empty(&list)) {
-			sd->in_net_rx_action = false;
-			if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
-				goto end;
+			if (list_empty(&repoll)) {
+				sd->in_net_rx_action = false;
+				barrier();
+				/* We need to check if ____napi_schedule()
+				 * had refilled poll_list while
+				 * sd->in_net_rx_action was true.
+				 */
+				if (!list_empty(&sd->poll_list))
+					goto start;
+				if (!sd_has_rps_ipi_waiting(sd))
+					goto end;
+			}
 			break;
 		}
 
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
                   ` (3 preceding siblings ...)
  2023-03-28 23:50 ` [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ Eric Dumazet
@ 2023-03-29  2:40 ` Jason Xing
  2023-03-30  3:04 ` Jakub Kicinski
  2023-03-30 12:00 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2023-03-29  2:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On Wed, Mar 29, 2023 at 7:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> unneeded NET_RX_SOFTIRQ raises: [1], [2]
>
> This is quite complex to implement properly. I chose to implement
> the idea, and added a similar optimization in ____napi_schedule()
>
[...]
> Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> invocations.
>
> While this had no impact on throughput or cpu costs on this synthetic
> benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
> can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
> This can have a latency impact on stressed hosts.

Eric, nice work ! You got these numbers.

Could you also put this whole important description above into the 3/4
patch? I believe it is very useful information if any
readers/developers try to track this part through git blame. After
all, I spent a lot of time discovering this point. Thanks.

Otherwise it looks good to me. Please add:
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

And I've done tests on this patchset. Please also add:
Tested-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks!

>
> [1] https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
> [2] https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gmail.com/
>
>
> Eric Dumazet (4):
>   net: napi_schedule_rps() cleanup
>   net: add softnet_data.in_net_rx_action
>   net: optimize napi_schedule_rps()
>   net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
>
>  include/linux/netdevice.h |  1 +
>  net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
>  2 files changed, 37 insertions(+), 10 deletions(-)
>
> --
> 2.40.0.348.gf938b09366-goog
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-28 23:50 ` [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ Eric Dumazet
@ 2023-03-29 12:47   ` Yunsheng Lin
  2023-03-29 15:47     ` Eric Dumazet
  2023-03-30  9:50   ` Jason Xing
  1 sibling, 1 reply; 25+ messages in thread
From: Yunsheng Lin @ 2023-03-29 12:47 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Jason Xing, netdev, eric.dumazet

On 2023/3/29 7:50, Eric Dumazet wrote:
> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> 
> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> from net_rx_action(), because we can process poll_list from this point,
> without going to full softirq loop.
> 
> This needs a change in net_rx_action() to make sure we restart
> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> being raised.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Xing <kernelxing@tencent.com>
> ---
>  net/core/dev.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>  	}
>  
>  	list_add_tail(&napi->poll_list, &sd->poll_list);
> -	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
> +	/* If not called from net_rx_action()
> +	 * we have to raise NET_RX_SOFTIRQ.
> +	 */
> +	if (!sd->in_net_rx_action)

It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
time, does it need something like READ_ONCE()/WRITE_ONCE() to access
sd->in_net_rx_action?

> +		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
>  }
>  
>  #ifdef CONFIG_RPS
> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>  	LIST_HEAD(list);
>  	LIST_HEAD(repoll);
>  
> +start:
>  	sd->in_net_rx_action = true;
>  	local_irq_disable();
>  	list_splice_init(&sd->poll_list, &list);
> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>  		skb_defer_free_flush(sd);
>  
>  		if (list_empty(&list)) {
> -			sd->in_net_rx_action = false;
> -			if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> -				goto end;
> +			if (list_empty(&repoll)) {
> +				sd->in_net_rx_action = false;
> +				barrier();

Do we need a stronger barrier to prevent out-of-order execution
from cpu?
Do we need a barrier between list_add_tail() and sd->in_net_rx_action
checking in ____napi_schedule() to pair with the above barrier?

> +				/* We need to check if ____napi_schedule()
> +				 * had refilled poll_list while
> +				 * sd->in_net_rx_action was true.
> +				 */
> +				if (!list_empty(&sd->poll_list))
> +					goto start;
> +				if (!sd_has_rps_ipi_waiting(sd))
> +					goto end;
> +			}
>  			break;
>  		}
>  
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
@ 2023-03-29 13:17 Aiden Leong
  2023-03-29 13:18 ` Aiden Leong
  0 siblings, 1 reply; 25+ messages in thread
From: Aiden Leong @ 2023-03-29 13:17 UTC (permalink / raw)
  To: edumazet; +Cc: davem, eric.dumazet, kernelxing, kuba, netdev, pabeni

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

Hi Eric,

I hope my email is not too off-topic but I have some confusion on how 
maintainers and should react to other people's work.

In short, you are stealing Jason's idea&&work by rewriting your implementation 
which not that ethical. Since your patch is based on his work, but you only 
sign-off it by your name, it's possible to raise lawsuit between Tencent and 
Linux community or Google.

I'm here to provoke a conflict because we know your name in this area and I'd 
to show my respect to you but I do have this kind of confusion in my mind and 
wish you could explain about it.

There's another story you or Tom Herbert may be interested in: I was working 
on Foo Over UDP and have implemented the missing features in the previous 
company I worked for. The proposal to contribute to the upstream community was 
rejected later by our boss for unhappy events very similar to this one.

Aiden Leong

> Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> unneeded NET_RX_SOFTIRQ raises: [1], [2]
> 
> This is quite complex to implement properly. I chose to implement
> the idea, and added a similar optimization in ____napi_schedule()
> 
> Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> invocations.
> 
> While this had no impact on throughput or cpu costs on this synthetic
> benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
> can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
> This can have a latency impact on stressed hosts.
> 
> [1] https://lore.kernel.org/lkml/20230325152417.5403-1-> 
kerneljasonxing@gmail.com/
> [2] https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gmail.com/
> 
> 
> Eric Dumazet (4):
>   net: napi_schedule_rps() cleanup
>   net: add softnet_data.in_net_rx_action
>   net: optimize napi_schedule_rps()
>   net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
> 
>  include/linux/netdevice.h |  1 +
>  net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
>  2 files changed, 37 insertions(+), 10 deletions(-)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-29 13:17 Aiden Leong
@ 2023-03-29 13:18 ` Aiden Leong
  2023-03-29 15:33   ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Aiden Leong @ 2023-03-29 13:18 UTC (permalink / raw)
  To: edumazet; +Cc: davem, eric.dumazet, kernelxing, kuba, netdev, pabeni

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

On Wednesday, March 29, 2023 9:17:06 PM CST Aiden Leong wrote:
> Hi Eric,
> 
> I hope my email is not too off-topic but I have some confusion on how
> maintainers and should react to other people's work.
> 
> In short, you are stealing Jason's idea&&work by rewriting your
> implementation which not that ethical. Since your patch is based on his
> work, but you only sign-off it by your name, it's possible to raise lawsuit
> between Tencent and Linux community or Google.
> 
> I'm here to provoke a conflict because we know your name in this area and
> I'd to show my respect to you but I do have this kind of confusion in my
> mind and wish you could explain about it.
> 
Typo: I'm here NOT to provoke a conflict 
> There's another story you or Tom Herbert may be interested in: I was working
> on Foo Over UDP and have implemented the missing features in the previous
> company I worked for. The proposal to contribute to the upstream community
> was rejected later by our boss for unhappy events very similar to this one.
> 
> Aiden Leong
> 
> > Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> > unneeded NET_RX_SOFTIRQ raises: [1], [2]
> > 
> > This is quite complex to implement properly. I chose to implement
> > the idea, and added a similar optimization in ____napi_schedule()
> > 
> > Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> > I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> > invocations.
> > 
> > While this had no impact on throughput or cpu costs on this synthetic
> > benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
> > can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
> > This can have a latency impact on stressed hosts.
> > 
> > [1] https://lore.kernel.org/lkml/20230325152417.5403-1->
> 
> kerneljasonxing@gmail.com/
> 
> > [2]
> > https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gma
> > il.com/> 
> > Eric Dumazet (4):
> >   net: napi_schedule_rps() cleanup
> >   net: add softnet_data.in_net_rx_action
> >   net: optimize napi_schedule_rps()
> >   net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
> >  
> >  include/linux/netdevice.h |  1 +
> >  net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 37 insertions(+), 10 deletions(-)


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-29 13:18 ` Aiden Leong
@ 2023-03-29 15:33   ` Eric Dumazet
  2023-03-29 16:21     ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2023-03-29 15:33 UTC (permalink / raw)
  To: Aiden Leong; +Cc: davem, eric.dumazet, kernelxing, kuba, netdev, pabeni

On Wed, Mar 29, 2023 at 3:18 PM Aiden Leong <aiden.leong@aibsd.com> wrote:
>
> On Wednesday, March 29, 2023 9:17:06 PM CST Aiden Leong wrote:
> > Hi Eric,
> >
> > I hope my email is not too off-topic but I have some confusion on how
> > maintainers and should react to other people's work.
> >
> > In short, you are stealing Jason's idea&&work by rewriting your
> > implementation which not that ethical. Since your patch is based on his
> > work, but you only sign-off it by your name, it's possible to raise lawsuit
> > between Tencent and Linux community or Google.

Seriously ?

I really gave enough credit to Jason's work, it seems you missed it.

I am quite tired of reviewing patches, and giving ideas of how to
improve things,
then having no credits for my work.

After my feedback on v1, seeing a quite silly v2, obviously not tested,
and with no numbers shown, I wanted to see how hard the complete
implementation would be.

This needed full understanding of RPS and RFS, not only an "idea" and
wrong claims about fixing
a "bug" in the initial implementation which was just fine.

Also, results are theoretical at this stage,I added numbers in the cover letter
showing the impact was tiny or not even mesurable.

I sent a series of 4 patches, Jason work on the 3rd one has been
completely documented.

If Jason managers are not able to see the credit in the patch series
(and cover letter),
this is their problem, not mine.

Also, my contributions to linux do not represent views of my employer,
this should be obvious.



> >
> > I'm here to provoke a conflict because we know your name in this area and
> > I'd to show my respect to you but I do have this kind of confusion in my
> > mind and wish you could explain about it.
> >
> Typo: I'm here NOT to provoke a conflict
> > There's another story you or Tom Herbert may be interested in: I was working
> > on Foo Over UDP and have implemented the missing features in the previous
> > company I worked for. The proposal to contribute to the upstream community
> > was rejected later by our boss for unhappy events very similar to this one.
> >
> > Aiden Leong
> >
> > > Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> > > unneeded NET_RX_SOFTIRQ raises: [1], [2]
> > >
> > > This is quite complex to implement properly. I chose to implement
> > > the idea, and added a similar optimization in ____napi_schedule()
> > >
> > > Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> > > I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> > > invocations.
> > >
> > > While this had no impact on throughput or cpu costs on this synthetic
> > > benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
> > > can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
> > > This can have a latency impact on stressed hosts.
> > >
> > > [1] https://lore.kernel.org/lkml/20230325152417.5403-1->
> >
> > kerneljasonxing@gmail.com/
> >
> > > [2]
> > > https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gma
> > > il.com/>
> > > Eric Dumazet (4):
> > >   net: napi_schedule_rps() cleanup
> > >   net: add softnet_data.in_net_rx_action
> > >   net: optimize napi_schedule_rps()
> > >   net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
> > >
> > >  include/linux/netdevice.h |  1 +
> > >  net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
> > >  2 files changed, 37 insertions(+), 10 deletions(-)
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-29 12:47   ` Yunsheng Lin
@ 2023-03-29 15:47     ` Eric Dumazet
  2023-03-30  2:33       ` Yunsheng Lin
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2023-03-29 15:47 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On Wed, Mar 29, 2023 at 2:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/3/29 7:50, Eric Dumazet wrote:
> > ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> > then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> >
> > Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> > from net_rx_action(), because we can process poll_list from this point,
> > without going to full softirq loop.
> >
> > This needs a change in net_rx_action() to make sure we restart
> > its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> > being raised.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/core/dev.c | 22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> >       }
> >
> >       list_add_tail(&napi->poll_list, &sd->poll_list);
> > -     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > +     /* If not called from net_rx_action()
> > +      * we have to raise NET_RX_SOFTIRQ.
> > +      */
> > +     if (!sd->in_net_rx_action)
>
> It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
> time, does it need something like READ_ONCE()/WRITE_ONCE() to access
> sd->in_net_rx_action?

You probably missed the 2nd patch, explaining the in_net_rx_action is
only read and written by the current (owning the percpu var) cpu.

Have you found a caller that is not providing to ____napi_schedule( sd
= this_cpu_ptr(&softnet_data)) ?



>
> > +             __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >  }
> >
> >  #ifdef CONFIG_RPS
> > @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >       LIST_HEAD(list);
> >       LIST_HEAD(repoll);
> >
> > +start:
> >       sd->in_net_rx_action = true;
> >       local_irq_disable();
> >       list_splice_init(&sd->poll_list, &list);
> > @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >               skb_defer_free_flush(sd);
> >
> >               if (list_empty(&list)) {
> > -                     sd->in_net_rx_action = false;
> > -                     if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> > -                             goto end;
> > +                     if (list_empty(&repoll)) {
> > +                             sd->in_net_rx_action = false;
> > +                             barrier();
>
> Do we need a stronger barrier to prevent out-of-order execution
> from cpu?

We do not need more than barrier() to make sure local cpu wont move this
write after the following code.

It should not, even without the barrier(), because of the way
list_empty() is coded,
but a barrier() makes things a bit more explicit.

> Do we need a barrier between list_add_tail() and sd->in_net_rx_action
> checking in ____napi_schedule() to pair with the above barrier?

I do not think so.

While in ____napi_schedule(), sd->in_net_rx_action is stable
because we run with hardware IRQ masked.

Thanks.



>
> > +                             /* We need to check if ____napi_schedule()
> > +                              * had refilled poll_list while
> > +                              * sd->in_net_rx_action was true.
> > +                              */
> > +                             if (!list_empty(&sd->poll_list))
> > +                                     goto start;
> > +                             if (!sd_has_rps_ipi_waiting(sd))
> > +                                     goto end;
> > +                     }
> >                       break;
> >               }
> >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-29 15:33   ` Eric Dumazet
@ 2023-03-29 16:21     ` Jason Xing
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2023-03-29 16:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Aiden Leong, davem, eric.dumazet, kernelxing, kuba, netdev,
	pabeni

On Wed, Mar 29, 2023 at 11:36 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 29, 2023 at 3:18 PM Aiden Leong <aiden.leong@aibsd.com> wrote:
> >
> > On Wednesday, March 29, 2023 9:17:06 PM CST Aiden Leong wrote:
> > > Hi Eric,
> > >
> > > I hope my email is not too off-topic but I have some confusion on how
> > > maintainers and should react to other people's work.
> > >
> > > In short, you are stealing Jason's idea&&work by rewriting your
> > > implementation which not that ethical. Since your patch is based on his
> > > work, but you only sign-off it by your name, it's possible to raise lawsuit
> > > between Tencent and Linux community or Google.
>
> Seriously ?
>
> I really gave enough credit to Jason's work, it seems you missed it.
>
> I am quite tired of reviewing patches, and giving ideas of how to
> improve things,
> then having no credits for my work.
>
> After my feedback on v1, seeing a quite silly v2, obviously not tested,
> and with no numbers shown, I wanted to see how hard the complete
> implementation would be.

At first glance, I really don't have any interest in commenting on this.

However, those words 'silly' 'not tested' 'no numbers' make me feel
very uncomfortable. Actually I did all of them. What you said makes
others think I'm like a fool who just does not have any knowledge
about this and proposed one idea with no foundation (out of thin air).
You know that it's not real but I don't know why you're using these
terrible words in public?
In fact, last night in our private email exchange, you said "I need to
be convinced" to me and then it was me who listed nearly every step to
prove how it can have impacts on latency with high load at 3:00 AM.

Well, I wouldn't like to see any further conflicts because of this.
Let's stop here and focus on this patch series and then move on, shall
we?

>
> This needed full understanding of RPS and RFS, not only an "idea" and
> wrong claims about fixing
> a "bug" in the initial implementation which was just fine.
>
> Also, results are theoretical at this stage,I added numbers in the cover letter
> showing the impact was tiny or not even mesurable.
>
> I sent a series of 4 patches, Jason work on the 3rd one has been
> completely documented.
>
> If Jason managers are not able to see the credit in the patch series
> (and cover letter),
> this is their problem, not mine.
>
> Also, my contributions to linux do not represent views of my employer,
> this should be obvious.
>
>
>
> > >
> > > I'm here to provoke a conflict because we know your name in this area and
> > > I'd to show my respect to you but I do have this kind of confusion in my
> > > mind and wish you could explain about it.
> > >
> > Typo: I'm here NOT to provoke a conflict
> > > There's another story you or Tom Herbert may be interested in: I was working
> > > on Foo Over UDP and have implemented the missing features in the previous
> > > company I worked for. The proposal to contribute to the upstream community
> > > was rejected later by our boss for unhappy events very similar to this one.
> > >
> > > Aiden Leong
> > >
> > > > Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> > > > unneeded NET_RX_SOFTIRQ raises: [1], [2]
> > > >
> > > > This is quite complex to implement properly. I chose to implement
> > > > the idea, and added a similar optimization in ____napi_schedule()
> > > >
> > > > Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> > > > I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> > > > invocations.
> > > >
> > > > While this had no impact on throughput or cpu costs on this synthetic
> > > > benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
> > > > can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
> > > > This can have a latency impact on stressed hosts.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20230325152417.5403-1->
> > >
> > > kerneljasonxing@gmail.com/
> > >
> > > > [2]
> > > > https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gma
> > > > il.com/>
> > > > Eric Dumazet (4):
> > > >   net: napi_schedule_rps() cleanup
> > > >   net: add softnet_data.in_net_rx_action
> > > >   net: optimize napi_schedule_rps()
> > > >   net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
> > > >
> > > >  include/linux/netdevice.h |  1 +
> > > >  net/core/dev.c            | 46 ++++++++++++++++++++++++++++++---------
> > > >  2 files changed, 37 insertions(+), 10 deletions(-)
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-29 15:47     ` Eric Dumazet
@ 2023-03-30  2:33       ` Yunsheng Lin
  2023-03-30  2:57         ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Yunsheng Lin @ 2023-03-30  2:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On 2023/3/29 23:47, Eric Dumazet wrote:
> On Wed, Mar 29, 2023 at 2:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/3/29 7:50, Eric Dumazet wrote:
>>> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
>>> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
>>>
>>> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
>>> from net_rx_action(), because we can process poll_list from this point,
>>> without going to full softirq loop.
>>>
>>> This needs a change in net_rx_action() to make sure we restart
>>> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
>>> being raised.
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Cc: Jason Xing <kernelxing@tencent.com>
>>> ---
>>>  net/core/dev.c | 22 ++++++++++++++++++----
>>>  1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>>       }
>>>
>>>       list_add_tail(&napi->poll_list, &sd->poll_list);
>>> -     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>> +     /* If not called from net_rx_action()
>>> +      * we have to raise NET_RX_SOFTIRQ.
>>> +      */
>>> +     if (!sd->in_net_rx_action)
>>
>> It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
>> time, does it need something like READ_ONCE()/WRITE_ONCE() to access
>> sd->in_net_rx_action?
> 
> You probably missed the 2nd patch, explaining the in_net_rx_action is
> only read and written by the current (owning the percpu var) cpu.
> 
> Have you found a caller that is not providing to ____napi_schedule( sd
> = this_cpu_ptr(&softnet_data)) ?

You are right.

The one small problem I see is that ____napi_schedule() call by a irq handle
may preempt the running net_rx_action() in the current cpu, I am not sure if
it worth handling, given that it is expected that the irq should be disabled
when net_rx_action() is running?
Do we need to protect against buggy hw or unbehaved driver?

> 
> 
> 
>>
>>> +             __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>>  }
>>>
>>>  #ifdef CONFIG_RPS
>>> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>       LIST_HEAD(list);
>>>       LIST_HEAD(repoll);
>>>
>>> +start:
>>>       sd->in_net_rx_action = true;
>>>       local_irq_disable();
>>>       list_splice_init(&sd->poll_list, &list);
>>> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>               skb_defer_free_flush(sd);
>>>
>>>               if (list_empty(&list)) {
>>> -                     sd->in_net_rx_action = false;
>>> -                     if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
>>> -                             goto end;
>>> +                     if (list_empty(&repoll)) {
>>> +                             sd->in_net_rx_action = false;
>>> +                             barrier();
>>
>> Do we need a stronger barrier to prevent out-of-order execution
>> from cpu?
> 
> We do not need more than barrier() to make sure local cpu wont move this
> write after the following code.

Is there any reason why we need the barrier() if we are not depending
on how list_empty() is coded?
It seems not obvious to me at least:)

> 
> It should not, even without the barrier(), because of the way
> list_empty() is coded,
> but a barrier() makes things a bit more explicit.

In that case, a comment explaining that may help a lot.

Thanks.

> 
>> Do we need a barrier between list_add_tail() and sd->in_net_rx_action
>> checking in ____napi_schedule() to pair with the above barrier?
> 
> I do not think so.
> 
> While in ____napi_schedule(), sd->in_net_rx_action is stable
> because we run with hardware IRQ masked.
> 
> Thanks.
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30  2:33       ` Yunsheng Lin
@ 2023-03-30  2:57         ` Eric Dumazet
  2023-03-30  6:47           ` Yunsheng Lin
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2023-03-30  2:57 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On Thu, Mar 30, 2023 at 4:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/3/29 23:47, Eric Dumazet wrote:
> > On Wed, Mar 29, 2023 at 2:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/3/29 7:50, Eric Dumazet wrote:
> >>> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> >>> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> >>>
> >>> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> >>> from net_rx_action(), because we can process poll_list from this point,
> >>> without going to full softirq loop.
> >>>
> >>> This needs a change in net_rx_action() to make sure we restart
> >>> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> >>> being raised.
> >>>
> >>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> >>> Cc: Jason Xing <kernelxing@tencent.com>
> >>> ---
> >>>  net/core/dev.c | 22 ++++++++++++++++++----
> >>>  1 file changed, 18 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/net/core/dev.c b/net/core/dev.c
> >>> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> >>> --- a/net/core/dev.c
> >>> +++ b/net/core/dev.c
> >>> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> >>>       }
> >>>
> >>>       list_add_tail(&napi->poll_list, &sd->poll_list);
> >>> -     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >>> +     /* If not called from net_rx_action()
> >>> +      * we have to raise NET_RX_SOFTIRQ.
> >>> +      */
> >>> +     if (!sd->in_net_rx_action)
> >>
> >> It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
> >> time, does it need something like READ_ONCE()/WRITE_ONCE() to access
> >> sd->in_net_rx_action?
> >
> > You probably missed the 2nd patch, explaining the in_net_rx_action is
> > only read and written by the current (owning the percpu var) cpu.
> >
> > Have you found a caller that is not providing to ____napi_schedule( sd
> > = this_cpu_ptr(&softnet_data)) ?
>
> You are right.
>
> The one small problem I see is that ____napi_schedule() call by a irq handle
> may preempt the running net_rx_action() in the current cpu, I am not sure if
> it worth handling, given that it is expected that the irq should be disabled
> when net_rx_action() is running?

And what will happen ? If the interrupts comes before

in_net_rx_action = val;

The interrupt handler will see the old value, this is fine really in all points.

If it comes after the assignment, the interrupt handler will see the new value,
because a cpu can not reorder its own reads/writes.

Otherwise simple things like this would fail:

a = 2;
b = a ;
assert (b == 2) ;


1) Note that the local_irq_disable(); after the first

sd->in_net_rx_action = true;

in net_rx_action() already provides a strong barrier.

2) sd->in_net_rx_action = false before the barrier() is enough to
provide needed safety for _this_ cpu.

3) Final sd->in_net_rx_action = false; at the end of net_rx_action()
is performed while hard irq are masked.




> Do we need to protect against buggy hw or unbehaved driver?

If you think there is an issue please elaborate with the exact call
site/ interruption point, because I do not see any.


>
> >
> >
> >
> >>
> >>> +             __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >>>  }
> >>>
> >>>  #ifdef CONFIG_RPS
> >>> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >>>       LIST_HEAD(list);
> >>>       LIST_HEAD(repoll);
> >>>
> >>> +start:
> >>>       sd->in_net_rx_action = true;
> >>>       local_irq_disable();
> >>>       list_splice_init(&sd->poll_list, &list);
> >>> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >>>               skb_defer_free_flush(sd);
> >>>
> >>>               if (list_empty(&list)) {
> >>> -                     sd->in_net_rx_action = false;
> >>> -                     if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> >>> -                             goto end;
> >>> +                     if (list_empty(&repoll)) {
> >>> +                             sd->in_net_rx_action = false;
> >>> +                             barrier();
> >>
> >> Do we need a stronger barrier to prevent out-of-order execution
> >> from cpu?
> >
> > We do not need more than barrier() to make sure local cpu wont move this
> > write after the following code.
>
> Is there any reason why we need the barrier() if we are not depending
> on how list_empty() is coded?
> It seems not obvious to me at least:)
>
> >
> > It should not, even without the barrier(), because of the way
> > list_empty() is coded,
> > but a barrier() makes things a bit more explicit.
>
> In that case, a comment explaining that may help a lot.
>
> Thanks.
>
> >
> >> Do we need a barrier between list_add_tail() and sd->in_net_rx_action
> >> checking in ____napi_schedule() to pair with the above barrier?
> >
> > I do not think so.
> >
> > While in ____napi_schedule(), sd->in_net_rx_action is stable
> > because we run with hardware IRQ masked.
> >
> > Thanks.
> >
> >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
                   ` (4 preceding siblings ...)
  2023-03-29  2:40 ` [PATCH net-next 0/4] net: rps/rfs improvements Jason Xing
@ 2023-03-30  3:04 ` Jakub Kicinski
  2023-03-30  3:15   ` Eric Dumazet
  2023-03-30 12:00 ` patchwork-bot+netdevbpf
  6 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2023-03-30  3:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Jason Xing, netdev, eric.dumazet

On Tue, 28 Mar 2023 23:50:17 +0000 Eric Dumazet wrote:
> Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> invocations.

small clarification on the testing:

invocations == calls to net_rx_action()
or
invocations == calls to __raise_softirq_irqoff(NET_RX_SOFTIRQ)

?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-30  3:04 ` Jakub Kicinski
@ 2023-03-30  3:15   ` Eric Dumazet
  2023-03-30  3:39     ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2023-03-30  3:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Paolo Abeni, Jason Xing, netdev, eric.dumazet

On Thu, Mar 30, 2023 at 5:04 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 28 Mar 2023 23:50:17 +0000 Eric Dumazet wrote:
> > Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> > I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> > invocations.
>
> small clarification on the testing:
>
> invocations == calls to net_rx_action()
> or
> invocations == calls to __raise_softirq_irqoff(NET_RX_SOFTIRQ)

This was from "grep NET_RX /proc/softirqs" (more exactly a tool
parsing /proc/softirqs)

So it should match the number of calls to net_rx_action(), but I can
double check if you want.

(I had a simple hack to enable/disable the optimizations with a hijacked sysctl)

Turn on/off them with

echo 1001 /proc/sys/net/core/netdev_max_backlog
<gather stats>
echo 1000 >/proc/sys/net/core/netdev_max_backlog
<gather stats>

diff --git a/net/core/dev.c b/net/core/dev.c
index 0c4b21291348d4558f036fb05842dab023f65dc3..f8c6fde6100c8e4812037bd070e11733409bd0a0
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6653,7 +6653,7 @@ static __latent_entropy void
net_rx_action(struct softirq_action *h)
        LIST_HEAD(repoll);

 start:
-       sd->in_net_rx_action = true;
+       sd->in_net_rx_action = (netdev_max_backlog & 1);
        local_irq_disable();
        list_splice_init(&sd->poll_list, &list);
        local_irq_enable();

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-30  3:15   ` Eric Dumazet
@ 2023-03-30  3:39     ` Eric Dumazet
  2023-03-30  3:57       ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2023-03-30  3:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Paolo Abeni, Jason Xing, netdev, eric.dumazet

On Thu, Mar 30, 2023 at 5:15 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 30, 2023 at 5:04 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 28 Mar 2023 23:50:17 +0000 Eric Dumazet wrote:
> > > Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> > > I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> > > invocations.
> >
> > small clarification on the testing:
> >
> > invocations == calls to net_rx_action()
> > or
> > invocations == calls to __raise_softirq_irqoff(NET_RX_SOFTIRQ)
>
> This was from "grep NET_RX /proc/softirqs" (more exactly a tool
> parsing /proc/softirqs)
>
> So it should match the number of calls to net_rx_action(), but I can
> double check if you want.
>
> (I had a simple hack to enable/disable the optimizations with a hijacked sysctl)

Trace of real debug session, because numbers ;)
Old platform with two Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz  (96
threads total)

600 tcp_rr flows

iroa23:/home/edumazet# cat /proc/sys/net/core/netdev_max_backlog
1000
iroa23:/home/edumazet# ./interrupts
hrtimer:99518 cal:2421783 timer:1034 sched:80505 rcu:7661 rps:2390765
net_tx:43 net_rx:3344637 eth1-rx:295134 eth1-tx:558933 eth1:854074
^C
iroa23:/home/edumazet# echo 1001 >/proc/sys/net/core/netdev_max_backlog
iroa23:/home/edumazet# ./interrupts
hrtimer:99545 cal:2358993 timer:1086 sched:77806 rcu:10928 rps:2419052
net_tx:21 net_rx:3016301 eth1-rx:294612 eth1-tx:560331 eth1:855083
^C

echo "(3344637 - 3016301)/3344637" | bc -ql
.09816790282473105452

-> ~10 % decrease.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-30  3:39     ` Eric Dumazet
@ 2023-03-30  3:57       ` Jakub Kicinski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2023-03-30  3:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Jason Xing, netdev, eric.dumazet

On Thu, 30 Mar 2023 05:39:35 +0200 Eric Dumazet wrote:
> > This was from "grep NET_RX /proc/softirqs" (more exactly a tool
> > parsing /proc/softirqs)
> >
> > So it should match the number of calls to net_rx_action(), but I can
> > double check if you want.
> >
> > (I had a simple hack to enable/disable the optimizations with a hijacked sysctl)  
> 
> Trace of real debug session, because numbers ;)
> Old platform with two Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz  (96
> threads total)
> 
> 600 tcp_rr flows
> 
> iroa23:/home/edumazet# cat /proc/sys/net/core/netdev_max_backlog
> 1000
> iroa23:/home/edumazet# ./interrupts
> hrtimer:99518 cal:2421783 timer:1034 sched:80505 rcu:7661 rps:2390765
> net_tx:43 net_rx:3344637 eth1-rx:295134 eth1-tx:558933 eth1:854074
> ^C
> iroa23:/home/edumazet# echo 1001 >/proc/sys/net/core/netdev_max_backlog
> iroa23:/home/edumazet# ./interrupts
> hrtimer:99545 cal:2358993 timer:1086 sched:77806 rcu:10928 rps:2419052
> net_tx:21 net_rx:3016301 eth1-rx:294612 eth1-tx:560331 eth1:855083
> ^C
> 
> echo "(3344637 - 3016301)/3344637" | bc -ql
> .09816790282473105452

Very nice, thanks! (I didn't know about /proc/softirqs !)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30  2:57         ` Eric Dumazet
@ 2023-03-30  6:47           ` Yunsheng Lin
  2023-03-30  7:36             ` Yunsheng Lin
  0 siblings, 1 reply; 25+ messages in thread
From: Yunsheng Lin @ 2023-03-30  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On 2023/3/30 10:57, Eric Dumazet wrote:
> On Thu, Mar 30, 2023 at 4:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/3/29 23:47, Eric Dumazet wrote:
>>> On Wed, Mar 29, 2023 at 2:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> On 2023/3/29 7:50, Eric Dumazet wrote:
>>>>> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
>>>>> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
>>>>>
>>>>> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
>>>>> from net_rx_action(), because we can process poll_list from this point,
>>>>> without going to full softirq loop.
>>>>>
>>>>> This needs a change in net_rx_action() to make sure we restart
>>>>> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
>>>>> being raised.
>>>>>
>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>> Cc: Jason Xing <kernelxing@tencent.com>
>>>>> ---
>>>>>  net/core/dev.c | 22 ++++++++++++++++++----
>>>>>  1 file changed, 18 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>>>>       }
>>>>>
>>>>>       list_add_tail(&napi->poll_list, &sd->poll_list);
>>>>> -     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>>>> +     /* If not called from net_rx_action()
>>>>> +      * we have to raise NET_RX_SOFTIRQ.
>>>>> +      */
>>>>> +     if (!sd->in_net_rx_action)
>>>>
>>>> It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
>>>> time, does it need something like READ_ONCE()/WRITE_ONCE() to access
>>>> sd->in_net_rx_action?
>>>
>>> You probably missed the 2nd patch, explaining the in_net_rx_action is
>>> only read and written by the current (owning the percpu var) cpu.
>>>
>>> Have you found a caller that is not providing to ____napi_schedule( sd
>>> = this_cpu_ptr(&softnet_data)) ?
>>
>> You are right.
>>
>> The one small problem I see is that ____napi_schedule() call by a irq handle
>> may preempt the running net_rx_action() in the current cpu, I am not sure if
>> it worth handling, given that it is expected that the irq should be disabled
>> when net_rx_action() is running?
> 
> And what will happen ? If the interrupts comes before
> 
> in_net_rx_action = val;
> 
> The interrupt handler will see the old value, this is fine really in all points.
> 
> If it comes after the assignment, the interrupt handler will see the new value,
> because a cpu can not reorder its own reads/writes.
> 
> Otherwise simple things like this would fail:
> 
> a = 2;
> b = a ;
> assert (b == 2) ;
> 
> 
> 1) Note that the local_irq_disable(); after the first
> 
> sd->in_net_rx_action = true;
> 
> in net_rx_action() already provides a strong barrier.
> 
> 2) sd->in_net_rx_action = false before the barrier() is enough to
> provide needed safety for _this_ cpu.
> 
> 3) Final sd->in_net_rx_action = false; at the end of net_rx_action()
> is performed while hard irq are masked.
> 
> 
> 
> 
>> Do we need to protect against buggy hw or unbehaved driver?
> 
> If you think there is an issue please elaborate with the exact call
> site/ interruption point, because I do not see any.

I was thinking if load/store tearing and out of out-of-order execution
would make something go wrong here.

For load/store tearing, in_net_rx_action in 'struct softnet_data' is a
bool, so I think it should be ok here, it would be better to make it
clear by using READ_ONCE()/WRITE_ONCE()?
LWN article about load/store tearing:
https://lwn.net/Articles/793253/

For out-of-order execution, I am not sure if it is really a problem
for irq preempting softirq in the same CPU yet, for example, for the below code,
if list_empty(&sd->poll_list) checking is executed out-of-order with
"sd->in_net_rx_action = false", and the irq which calls ____napi_schedule()
preempt between list_empty(&sd->poll_list) checking and "sd->in_net_rx_action = false",
then ____napi_schedule() will not raise softirq as sd->in_net_rx_action is
still true, after irq finishs, as list_empty(&sd->poll_list) is already
checked, it may not goto 'start' in net_rx_action().


+				sd->in_net_rx_action = false;
+				barrier();
+				/* We need to check if ____napi_schedule()
+				 * had refilled poll_list while
+				 * sd->in_net_rx_action was true.
+				 */
+				if (!list_empty(&sd->poll_list))



> 
> 
>>
>>>
>>>
>>>
>>>>
>>>>> +             __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>>>>  }
>>>>>
>>>>>  #ifdef CONFIG_RPS
>>>>> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>>>       LIST_HEAD(list);
>>>>>       LIST_HEAD(repoll);
>>>>>
>>>>> +start:
>>>>>       sd->in_net_rx_action = true;
>>>>>       local_irq_disable();
>>>>>       list_splice_init(&sd->poll_list, &list);
>>>>> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>>>               skb_defer_free_flush(sd);
>>>>>
>>>>>               if (list_empty(&list)) {
>>>>> -                     sd->in_net_rx_action = false;
>>>>> -                     if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
>>>>> -                             goto end;
>>>>> +                     if (list_empty(&repoll)) {
>>>>> +                             sd->in_net_rx_action = false;
>>>>> +                             barrier();
>>>>
>>>> Do we need a stronger barrier to prevent out-of-order execution
>>>> from cpu?
>>>
>>> We do not need more than barrier() to make sure local cpu wont move this
>>> write after the following code.
>>
>> Is there any reason why we need the barrier() if we are not depending
>> on how list_empty() is coded?
>> It seems not obvious to me at least:)
>>
>>>
>>> It should not, even without the barrier(), because of the way
>>> list_empty() is coded,
>>> but a barrier() makes things a bit more explicit.
>>
>> In that case, a comment explaining that may help a lot.
>>
>> Thanks.
>>
>>>
>>>> Do we need a barrier between list_add_tail() and sd->in_net_rx_action
>>>> checking in ____napi_schedule() to pair with the above barrier?
>>>
>>> I do not think so.
>>>
>>> While in ____napi_schedule(), sd->in_net_rx_action is stable
>>> because we run with hardware IRQ masked.
>>>
>>> Thanks.
>>>
>>>
>>>
> .
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30  6:47           ` Yunsheng Lin
@ 2023-03-30  7:36             ` Yunsheng Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Yunsheng Lin @ 2023-03-30  7:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On 2023/3/30 14:47, Yunsheng Lin wrote:
> On 2023/3/30 10:57, Eric Dumazet wrote:
>> On Thu, Mar 30, 2023 at 4:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>
>>> On 2023/3/29 23:47, Eric Dumazet wrote:
>>>> On Wed, Mar 29, 2023 at 2:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>>
>>>>> On 2023/3/29 7:50, Eric Dumazet wrote:
>>>>>> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
>>>>>> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
>>>>>>
>>>>>> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
>>>>>> from net_rx_action(), because we can process poll_list from this point,
>>>>>> without going to full softirq loop.
>>>>>>
>>>>>> This needs a change in net_rx_action() to make sure we restart
>>>>>> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
>>>>>> being raised.
>>>>>>
>>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>>> Cc: Jason Xing <kernelxing@tencent.com>
>>>>>> ---
>>>>>>  net/core/dev.c | 22 ++++++++++++++++++----
>>>>>>  1 file changed, 18 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>>> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
>>>>>> --- a/net/core/dev.c
>>>>>> +++ b/net/core/dev.c
>>>>>> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>>>>>       }
>>>>>>
>>>>>>       list_add_tail(&napi->poll_list, &sd->poll_list);
>>>>>> -     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>>>>> +     /* If not called from net_rx_action()
>>>>>> +      * we have to raise NET_RX_SOFTIRQ.
>>>>>> +      */
>>>>>> +     if (!sd->in_net_rx_action)
>>>>>
>>>>> It seems sd->in_net_rx_action may be read/writen by different CPUs at the same
>>>>> time, does it need something like READ_ONCE()/WRITE_ONCE() to access
>>>>> sd->in_net_rx_action?
>>>>
>>>> You probably missed the 2nd patch, explaining the in_net_rx_action is
>>>> only read and written by the current (owning the percpu var) cpu.
>>>>
>>>> Have you found a caller that is not providing to ____napi_schedule( sd
>>>> = this_cpu_ptr(&softnet_data)) ?
>>>
>>> You are right.
>>>
>>> The one small problem I see is that ____napi_schedule() call by a irq handle
>>> may preempt the running net_rx_action() in the current cpu, I am not sure if
>>> it worth handling, given that it is expected that the irq should be disabled
>>> when net_rx_action() is running?
>>
>> And what will happen ? If the interrupts comes before
>>
>> in_net_rx_action = val;
>>
>> The interrupt handler will see the old value, this is fine really in all points.
>>
>> If it comes after the assignment, the interrupt handler will see the new value,
>> because a cpu can not reorder its own reads/writes.
>>
>> Otherwise simple things like this would fail:
>>
>> a = 2;
>> b = a ;
>> assert (b == 2) ;
>>
>>
>> 1) Note that the local_irq_disable(); after the first
>>
>> sd->in_net_rx_action = true;
>>
>> in net_rx_action() already provides a strong barrier.
>>
>> 2) sd->in_net_rx_action = false before the barrier() is enough to
>> provide needed safety for _this_ cpu.
>>
>> 3) Final sd->in_net_rx_action = false; at the end of net_rx_action()
>> is performed while hard irq are masked.
>>
>>
>>
>>
>>> Do we need to protect against buggy hw or unbehaved driver?
>>
>> If you think there is an issue please elaborate with the exact call
>> site/ interruption point, because I do not see any.
> 
> I was thinking if load/store tearing and out of out-of-order execution
> would make something go wrong here.
> 
> For load/store tearing, in_net_rx_action in 'struct softnet_data' is a
> bool, so I think it should be ok here, it would be better to make it
> clear by using READ_ONCE()/WRITE_ONCE()?
> LWN article about load/store tearing:
> https://lwn.net/Articles/793253/
> 
> For out-of-order execution, I am not sure if it is really a problem
> for irq preempting softirq in the same CPU yet, for example, for the below code,
> if list_empty(&sd->poll_list) checking is executed out-of-order with
> "sd->in_net_rx_action = false", and the irq which calls ____napi_schedule()
> preempt between list_empty(&sd->poll_list) checking and "sd->in_net_rx_action = false",
> then ____napi_schedule() will not raise softirq as sd->in_net_rx_action is
> still true, after irq finishs, as list_empty(&sd->poll_list) is already
> checked, it may not goto 'start' in net_rx_action().

As the LWN article above, it seems barrier() is enough?

"As it turns out, this isn't a problem. Modern machines have
"exact exceptions" and "exact interrupts", meaning that any
interrupt or exception will appear to have happened at a
specific place in the instruction stream. Consequently,
the handler will see the effect of all prior instructions,
but won't see the effect of any subsequent instructions."

Thanks for clarifying.

> 
> 
> +				sd->in_net_rx_action = false;
> +				barrier();
> +				/* We need to check if ____napi_schedule()
> +				 * had refilled poll_list while
> +				 * sd->in_net_rx_action was true.
> +				 */
> +				if (!list_empty(&sd->poll_list))
> 
> 
> 
>>
>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>> +             __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>>>>>  }
>>>>>>
>>>>>>  #ifdef CONFIG_RPS
>>>>>> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>>>>       LIST_HEAD(list);
>>>>>>       LIST_HEAD(repoll);
>>>>>>
>>>>>> +start:
>>>>>>       sd->in_net_rx_action = true;
>>>>>>       local_irq_disable();
>>>>>>       list_splice_init(&sd->poll_list, &list);
>>>>>> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>>>>>>               skb_defer_free_flush(sd);
>>>>>>
>>>>>>               if (list_empty(&list)) {
>>>>>> -                     sd->in_net_rx_action = false;
>>>>>> -                     if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
>>>>>> -                             goto end;
>>>>>> +                     if (list_empty(&repoll)) {
>>>>>> +                             sd->in_net_rx_action = false;
>>>>>> +                             barrier();
>>>>>
>>>>> Do we need a stronger barrier to prevent out-of-order execution
>>>>> from cpu?
>>>>
>>>> We do not need more than barrier() to make sure local cpu wont move this
>>>> write after the following code.
>>>
>>> Is there any reason why we need the barrier() if we are not depending
>>> on how list_empty() is coded?
>>> It seems not obvious to me at least:)
>>>
>>>>
>>>> It should not, even without the barrier(), because of the way
>>>> list_empty() is coded,
>>>> but a barrier() makes things a bit more explicit.
>>>
>>> In that case, a comment explaining that may help a lot.
>>>
>>> Thanks.
>>>
>>>>
>>>>> Do we need a barrier between list_add_tail() and sd->in_net_rx_action
>>>>> checking in ____napi_schedule() to pair with the above barrier?
>>>>
>>>> I do not think so.
>>>>
>>>> While in ____napi_schedule(), sd->in_net_rx_action is stable
>>>> because we run with hardware IRQ masked.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>> .
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-28 23:50 ` [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ Eric Dumazet
  2023-03-29 12:47   ` Yunsheng Lin
@ 2023-03-30  9:50   ` Jason Xing
  2023-03-30 11:39     ` Paolo Abeni
  1 sibling, 1 reply; 25+ messages in thread
From: Jason Xing @ 2023-03-30  9:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jason Xing, netdev,
	eric.dumazet

On Wed, Mar 29, 2023 at 7:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
>
> Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> from net_rx_action(), because we can process poll_list from this point,
> without going to full softirq loop.
>
> This needs a change in net_rx_action() to make sure we restart
> its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> being raised.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Xing <kernelxing@tencent.com>
> ---
>  net/core/dev.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>         }
>
>         list_add_tail(&napi->poll_list, &sd->poll_list);
> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> +       /* If not called from net_rx_action()
> +        * we have to raise NET_RX_SOFTIRQ.
> +        */
> +       if (!sd->in_net_rx_action)
> +               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>  }
>
>  #ifdef CONFIG_RPS
> @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>         LIST_HEAD(list);
>         LIST_HEAD(repoll);
>
> +start:
>         sd->in_net_rx_action = true;
>         local_irq_disable();
>         list_splice_init(&sd->poll_list, &list);
> @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>                 skb_defer_free_flush(sd);
>
>                 if (list_empty(&list)) {
> -                       sd->in_net_rx_action = false;
> -                       if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> -                               goto end;
> +                       if (list_empty(&repoll)) {
> +                               sd->in_net_rx_action = false;
> +                               barrier();
> +                               /* We need to check if ____napi_schedule()
> +                                * had refilled poll_list while
> +                                * sd->in_net_rx_action was true.
> +                                */
> +                               if (!list_empty(&sd->poll_list))
> +                                       goto start;

I noticed that since we decide to go back and restart this loop, it
would be better to check the time_limit. More than that,
skb_defer_free_flush() can consume some time which is supposed to take
into account.

Just for your consideration.

> +                               if (!sd_has_rps_ipi_waiting(sd))
> +                                       goto end;
> +                       }
>                         break;
>                 }
>
> --
> 2.40.0.348.gf938b09366-goog
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30  9:50   ` Jason Xing
@ 2023-03-30 11:39     ` Paolo Abeni
  2023-03-30 11:44       ` Eric Dumazet
  2023-03-30 12:03       ` Jason Xing
  0 siblings, 2 replies; 25+ messages in thread
From: Paolo Abeni @ 2023-03-30 11:39 UTC (permalink / raw)
  To: Jason Xing, Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Jason Xing, netdev,
	eric.dumazet

On Thu, 2023-03-30 at 17:50 +0800, Jason Xing wrote:
> On Wed, Mar 29, 2023 at 7:53 AM Eric Dumazet <edumazet@google.com> wrote:
> > 
> > ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> > then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> > 
> > Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> > from net_rx_action(), because we can process poll_list from this point,
> > without going to full softirq loop.
> > 
> > This needs a change in net_rx_action() to make sure we restart
> > its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> > being raised.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/core/dev.c | 22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> >         }
> > 
> >         list_add_tail(&napi->poll_list, &sd->poll_list);
> > -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > +       /* If not called from net_rx_action()
> > +        * we have to raise NET_RX_SOFTIRQ.
> > +        */
> > +       if (!sd->in_net_rx_action)
> > +               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >  }
> > 
> >  #ifdef CONFIG_RPS
> > @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >         LIST_HEAD(list);
> >         LIST_HEAD(repoll);
> > 
> > +start:
> >         sd->in_net_rx_action = true;
> >         local_irq_disable();
> >         list_splice_init(&sd->poll_list, &list);
> > @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> >                 skb_defer_free_flush(sd);
> > 
> >                 if (list_empty(&list)) {
> > -                       sd->in_net_rx_action = false;
> > -                       if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> > -                               goto end;
> > +                       if (list_empty(&repoll)) {
> > +                               sd->in_net_rx_action = false;
> > +                               barrier();
> > +                               /* We need to check if ____napi_schedule()
> > +                                * had refilled poll_list while
> > +                                * sd->in_net_rx_action was true.
> > +                                */
> > +                               if (!list_empty(&sd->poll_list))
> > +                                       goto start;
> 
> I noticed that since we decide to go back and restart this loop, it
> would be better to check the time_limit. More than that,
> skb_defer_free_flush() can consume some time which is supposed to take
> into account.

Note that we can have a __napi_schedule() invocation with sd-
>in_net_rx_action only after executing the napi_poll() call below and
thus after the related time check (that is - after performing at least
one full iteration of the main for(;;) loop).

I don't think another check right here is needed.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30 11:39     ` Paolo Abeni
@ 2023-03-30 11:44       ` Eric Dumazet
  2023-03-30 12:03       ` Jason Xing
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-30 11:44 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jason Xing, David S . Miller, Jakub Kicinski, Jason Xing, netdev,
	eric.dumazet

On Thu, Mar 30, 2023 at 1:39 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2023-03-30 at 17:50 +0800, Jason Xing wrote:
> > On Wed, Mar 29, 2023 at 7:53 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> > > then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> > >
> > > Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> > > from net_rx_action(), because we can process poll_list from this point,
> > > without going to full softirq loop.
> > >
> > > This needs a change in net_rx_action() to make sure we restart
> > > its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> > > being raised.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Cc: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/core/dev.c | 22 ++++++++++++++++++----
> > >  1 file changed, 18 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> > >         }
> > >
> > >         list_add_tail(&napi->poll_list, &sd->poll_list);
> > > -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > > +       /* If not called from net_rx_action()
> > > +        * we have to raise NET_RX_SOFTIRQ.
> > > +        */
> > > +       if (!sd->in_net_rx_action)
> > > +               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > >  }
> > >
> > >  #ifdef CONFIG_RPS
> > > @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> > >         LIST_HEAD(list);
> > >         LIST_HEAD(repoll);
> > >
> > > +start:
> > >         sd->in_net_rx_action = true;
> > >         local_irq_disable();
> > >         list_splice_init(&sd->poll_list, &list);
> > > @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> > >                 skb_defer_free_flush(sd);
> > >
> > >                 if (list_empty(&list)) {
> > > -                       sd->in_net_rx_action = false;
> > > -                       if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> > > -                               goto end;
> > > +                       if (list_empty(&repoll)) {
> > > +                               sd->in_net_rx_action = false;
> > > +                               barrier();
> > > +                               /* We need to check if ____napi_schedule()
> > > +                                * had refilled poll_list while
> > > +                                * sd->in_net_rx_action was true.
> > > +                                */
> > > +                               if (!list_empty(&sd->poll_list))
> > > +                                       goto start;
> >
> > I noticed that since we decide to go back and restart this loop, it
> > would be better to check the time_limit. More than that,
> > skb_defer_free_flush() can consume some time which is supposed to take
> > into account.
>
> Note that we can have a __napi_schedule() invocation with sd-
> >in_net_rx_action only after executing the napi_poll() call below and
> thus after the related time check (that is - after performing at least
> one full iteration of the main for(;;) loop).
>
> I don't think another check right here is needed.

I was about to say the same thing.

Back to READ_ONCE() and WRITE_ONCE(), I originally had them in my tree,
then simply realized they were not needed and confusing as a matter of fact.

barrier() is the right thing here, and only one is needed, because
others are implicit,
because before hitting reads, we must call out of line functions.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/4] net: rps/rfs improvements
  2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
                   ` (5 preceding siblings ...)
  2023-03-30  3:04 ` Jakub Kicinski
@ 2023-03-30 12:00 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 25+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-03-30 12:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, kernelxing, netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Tue, 28 Mar 2023 23:50:17 +0000 you wrote:
> Jason Xing attempted to optimize napi_schedule_rps() by avoiding
> unneeded NET_RX_SOFTIRQ raises: [1], [2]
> 
> This is quite complex to implement properly. I chose to implement
> the idea, and added a similar optimization in ____napi_schedule()
> 
> Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
> I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
> invocations.
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] net: napi_schedule_rps() cleanup
    https://git.kernel.org/netdev/net-next/c/8fcb76b934da
  - [net-next,2/4] net: add softnet_data.in_net_rx_action
    https://git.kernel.org/netdev/net-next/c/c59647c0dc67
  - [net-next,3/4] net: optimize napi_schedule_rps()
    https://git.kernel.org/netdev/net-next/c/821eba962d95
  - [net-next,4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
    https://git.kernel.org/netdev/net-next/c/8b43fd3d1d7d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
  2023-03-30 11:39     ` Paolo Abeni
  2023-03-30 11:44       ` Eric Dumazet
@ 2023-03-30 12:03       ` Jason Xing
  1 sibling, 0 replies; 25+ messages in thread
From: Jason Xing @ 2023-03-30 12:03 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Jason Xing,
	netdev, eric.dumazet

On Thu, Mar 30, 2023 at 7:39 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2023-03-30 at 17:50 +0800, Jason Xing wrote:
> > On Wed, Mar 29, 2023 at 7:53 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > ____napi_schedule() adds a napi into current cpu softnet_data poll_list,
> > > then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
> > >
> > > Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
> > > from net_rx_action(), because we can process poll_list from this point,
> > > without going to full softirq loop.
> > >
> > > This needs a change in net_rx_action() to make sure we restart
> > > its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
> > > being raised.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Cc: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/core/dev.c | 22 ++++++++++++++++++----
> > >  1 file changed, 18 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index f34ce93f2f02e7ec71f5e84d449fa99b7a882f0c..0c4b21291348d4558f036fb05842dab023f65dc3 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4360,7 +4360,11 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> > >         }
> > >
> > >         list_add_tail(&napi->poll_list, &sd->poll_list);
> > > -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > > +       /* If not called from net_rx_action()
> > > +        * we have to raise NET_RX_SOFTIRQ.
> > > +        */
> > > +       if (!sd->in_net_rx_action)
> > > +               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > >  }
> > >
> > >  #ifdef CONFIG_RPS
> > > @@ -6648,6 +6652,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> > >         LIST_HEAD(list);
> > >         LIST_HEAD(repoll);
> > >
> > > +start:
> > >         sd->in_net_rx_action = true;
> > >         local_irq_disable();
> > >         list_splice_init(&sd->poll_list, &list);
> > > @@ -6659,9 +6664,18 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> > >                 skb_defer_free_flush(sd);
> > >
> > >                 if (list_empty(&list)) {
> > > -                       sd->in_net_rx_action = false;
> > > -                       if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
> > > -                               goto end;
> > > +                       if (list_empty(&repoll)) {
> > > +                               sd->in_net_rx_action = false;
> > > +                               barrier();
> > > +                               /* We need to check if ____napi_schedule()
> > > +                                * had refilled poll_list while
> > > +                                * sd->in_net_rx_action was true.
> > > +                                */
> > > +                               if (!list_empty(&sd->poll_list))
> > > +                                       goto start;
> >
> > I noticed that since we decide to go back and restart this loop, it
> > would be better to check the time_limit. More than that,
> > skb_defer_free_flush() can consume some time which is supposed to take
> > into account.
>
> Note that we can have a __napi_schedule() invocation with sd-
> >in_net_rx_action only after executing the napi_poll() call below and
> thus after the related time check (that is - after performing at least
> one full iteration of the main for(;;) loop).

Hello Paolo,

It also took me for a while to consider. I'm not that sure if we
should check again so I added "just for consideration".

Thanks,
Jason

>
> I don't think another check right here is needed.
>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-03-30 12:03 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-28 23:50 [PATCH net-next 0/4] net: rps/rfs improvements Eric Dumazet
2023-03-28 23:50 ` [PATCH net-next 1/4] net: napi_schedule_rps() cleanup Eric Dumazet
2023-03-28 23:50 ` [PATCH net-next 2/4] net: add softnet_data.in_net_rx_action Eric Dumazet
2023-03-28 23:50 ` [PATCH net-next 3/4] net: optimize napi_schedule_rps() Eric Dumazet
2023-03-28 23:50 ` [PATCH net-next 4/4] net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ Eric Dumazet
2023-03-29 12:47   ` Yunsheng Lin
2023-03-29 15:47     ` Eric Dumazet
2023-03-30  2:33       ` Yunsheng Lin
2023-03-30  2:57         ` Eric Dumazet
2023-03-30  6:47           ` Yunsheng Lin
2023-03-30  7:36             ` Yunsheng Lin
2023-03-30  9:50   ` Jason Xing
2023-03-30 11:39     ` Paolo Abeni
2023-03-30 11:44       ` Eric Dumazet
2023-03-30 12:03       ` Jason Xing
2023-03-29  2:40 ` [PATCH net-next 0/4] net: rps/rfs improvements Jason Xing
2023-03-30  3:04 ` Jakub Kicinski
2023-03-30  3:15   ` Eric Dumazet
2023-03-30  3:39     ` Eric Dumazet
2023-03-30  3:57       ` Jakub Kicinski
2023-03-30 12:00 ` patchwork-bot+netdevbpf
  -- strict thread matches above, loose matches on Subject: below --
2023-03-29 13:17 Aiden Leong
2023-03-29 13:18 ` Aiden Leong
2023-03-29 15:33   ` Eric Dumazet
2023-03-29 16:21     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).