linux-s390.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
@ 2024-10-25  7:46 Wenjia Zhang
  2024-10-25 13:56 ` Simon Horman
  2024-10-25 23:58 ` Dust Li
  0 siblings, 2 replies; 7+ messages in thread
From: Wenjia Zhang @ 2024-10-25  7:46 UTC (permalink / raw)
  To: Wen Gu, D. Wythe, Tony Lu, David Miller, Jakub Kicinski,
	Eric Dumazet, Paolo Abeni
  Cc: netdev, linux-s390, Heiko Carstens, Jan Karcher, Gerd Bayer,
	Alexandra Winter, Halil Pasic, Nils Hoppmann, Niklas Schnell,
	Thorsten Winkler, Karsten Graul, Stefan Raspl, Wenjia Zhang

From: Halil Pasic <pasic@linux.ibm.com>

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are  competing for the buffer. Currently
up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

For some reason include/linux/wait.h does not offer a top level wrapper
macro for wait_event with interruptible, exclusive and timeout. I did
not spend too many cycles on thinking if that is even a combination that
makes sense (on the quick I don't see why not) and conversely I
refrained from making an attempt to accomplish the interruptible,
exclusive and timeout combo by using the abstraction-wise lower
level __wait_event interface.

To alleviate the tx performance bottleneck and the CPU overhead due to
the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reported-by: Nils Hoppmann <niho@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
---
 net/smc/smc_wr.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..81e772e241f3 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,7 +19,7 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
+#define SMC_WR_BUF_CNT 256	/* # of ctrl buffers per link */
 
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-10-25  7:46 [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT Wenjia Zhang
@ 2024-10-25 13:56 ` Simon Horman
  2024-10-25 23:58 ` Dust Li
  1 sibling, 0 replies; 7+ messages in thread
From: Simon Horman @ 2024-10-25 13:56 UTC (permalink / raw)
  To: Wenjia Zhang
  Cc: Wen Gu, D. Wythe, Tony Lu, David Miller, Jakub Kicinski,
	Eric Dumazet, Paolo Abeni, netdev, linux-s390, Heiko Carstens,
	Jan Karcher, Gerd Bayer, Alexandra Winter, Halil Pasic,
	Nils Hoppmann, Niklas Schnell, Thorsten Winkler, Karsten Graul,
	Stefan Raspl

On Fri, Oct 25, 2024 at 09:46:19AM +0200, Wenjia Zhang wrote:
> From: Halil Pasic <pasic@linux.ibm.com>
> 
> The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
> contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
> spinlock when many connections are  competing for the buffer. Currently
> up to 256 connections per linkgroup are supported.
> 
> To make things worse when finally a buffer becomes available and
> smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
> WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
> time a single one can proceed, and the rest is contending on the
> spinlock of the wq to go to sleep again.
> 
> For some reason include/linux/wait.h does not offer a top level wrapper
> macro for wait_event with interruptible, exclusive and timeout. I did
> not spend too many cycles on thinking if that is even a combination that
> makes sense (on the quick I don't see why not) and conversely I
> refrained from making an attempt to accomplish the interruptible,
> exclusive and timeout combo by using the abstraction-wise lower
> level __wait_event interface.
> 
> To alleviate the tx performance bottleneck and the CPU overhead due to
> the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> Reported-by: Nils Hoppmann <niho@linux.ibm.com>
> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-10-25  7:46 [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT Wenjia Zhang
  2024-10-25 13:56 ` Simon Horman
@ 2024-10-25 23:58 ` Dust Li
  2024-10-31 12:30   ` Halil Pasic
  1 sibling, 1 reply; 7+ messages in thread
From: Dust Li @ 2024-10-25 23:58 UTC (permalink / raw)
  To: Wenjia Zhang, Wen Gu, D. Wythe, Tony Lu, David Miller,
	Jakub Kicinski, Eric Dumazet, Paolo Abeni
  Cc: netdev, linux-s390, Heiko Carstens, Jan Karcher, Gerd Bayer,
	Alexandra Winter, Halil Pasic, Nils Hoppmann, Niklas Schnell,
	Thorsten Winkler, Karsten Graul, Stefan Raspl

On 2024-10-25 09:46:19, Wenjia Zhang wrote:
>From: Halil Pasic <pasic@linux.ibm.com>
>
>The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
>contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
>spinlock when many connections are  competing for the buffer. Currently
>up to 256 connections per linkgroup are supported.
>
>To make things worse when finally a buffer becomes available and
>smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
>WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
>time a single one can proceed, and the rest is contending on the
>spinlock of the wq to go to sleep again.
>
>For some reason include/linux/wait.h does not offer a top level wrapper
>macro for wait_event with interruptible, exclusive and timeout. I did
>not spend too many cycles on thinking if that is even a combination that
>makes sense (on the quick I don't see why not) and conversely I
>refrained from making an attempt to accomplish the interruptible,
>exclusive and timeout combo by using the abstraction-wise lower
>level __wait_event interface.
>
>To alleviate the tx performance bottleneck and the CPU overhead due to
>the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.

Hi,

Have you tested other values, such as 64? In our internal version, we
have used 64 for some time.

Increasing this to 256 will require a 36K continuous physical memory
allocation in smc_wr_alloc_link_mem(). Based on my experience, this may
fail on servers that have been running for a long time and have
fragmented memory.

    link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, SMC_WR_BUF_SIZE,
                               GFP_KERNEL);

As we can see, the link->wr_rx_bufs will increase from 16*3*48 = 2,304
to 256*3*48=36,864 (1 page to 9 pages).

Best regards,
Dust

>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reported-by: Nils Hoppmann <niho@linux.ibm.com>
>Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
>Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
>---
> net/smc/smc_wr.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
>index f3008dda222a..81e772e241f3 100644
>--- a/net/smc/smc_wr.h
>+++ b/net/smc/smc_wr.h
>@@ -19,7 +19,7 @@
> #include "smc.h"
> #include "smc_core.h"
> 
>-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
>+#define SMC_WR_BUF_CNT 256	/* # of ctrl buffers per link */
> 
> #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
> 
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-10-25 23:58 ` Dust Li
@ 2024-10-31 12:30   ` Halil Pasic
  2024-11-04 16:42     ` Halil Pasic
  0 siblings, 1 reply; 7+ messages in thread
From: Halil Pasic @ 2024-10-31 12:30 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, Wen Gu, D. Wythe, Tony Lu, David Miller,
	Jakub Kicinski, Eric Dumazet, Paolo Abeni, netdev, linux-s390,
	Heiko Carstens, Jan Karcher, Gerd Bayer, Alexandra Winter,
	Nils Hoppmann, Niklas Schnell, Thorsten Winkler, Karsten Graul,
	Stefan Raspl, Halil Pasic

On Sat, 26 Oct 2024 07:58:39 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >For some reason include/linux/wait.h does not offer a top level wrapper
> >macro for wait_event with interruptible, exclusive and timeout. I did
> >not spend too many cycles on thinking if that is even a combination that
> >makes sense (on the quick I don't see why not) and conversely I
> >refrained from making an attempt to accomplish the interruptible,
> >exclusive and timeout combo by using the abstraction-wise lower
> >level __wait_event interface.
> >
> >To alleviate the tx performance bottleneck and the CPU overhead due to
> >the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.  
> 
> Hi,
> 
> Have you tested other values, such as 64? In our internal version, we
> have used 64 for some time.

Yes we have, but I'm not sure the data is still to be found. Let me do
some digging.

> 
> Increasing this to 256 will require a 36K continuous physical memory
> allocation in smc_wr_alloc_link_mem(). Based on my experience, this may
> fail on servers that have been running for a long time and have
> fragmented memory.

Good point! It is possible that I did not give sufficient thought to
this aspect.

Regards,
Halil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-10-31 12:30   ` Halil Pasic
@ 2024-11-04 16:42     ` Halil Pasic
  2024-11-05 10:16       ` Paolo Abeni
  2024-11-05 14:34       ` Dust Li
  0 siblings, 2 replies; 7+ messages in thread
From: Halil Pasic @ 2024-11-04 16:42 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, Wen Gu, D. Wythe, Tony Lu, David Miller,
	Jakub Kicinski, Eric Dumazet, Paolo Abeni, netdev, linux-s390,
	Heiko Carstens, Jan Karcher, Gerd Bayer, Alexandra Winter,
	Nils Hoppmann, Niklas Schnell, Thorsten Winkler, Karsten Graul,
	Stefan Raspl, Halil Pasic

On Thu, 31 Oct 2024 13:30:17 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Sat, 26 Oct 2024 07:58:39 +0800
> Dust Li <dust.li@linux.alibaba.com> wrote:
> 
> > >For some reason include/linux/wait.h does not offer a top level wrapper
> > >macro for wait_event with interruptible, exclusive and timeout. I did
> > >not spend too many cycles on thinking if that is even a combination that
> > >makes sense (on the quick I don't see why not) and conversely I
> > >refrained from making an attempt to accomplish the interruptible,
> > >exclusive and timeout combo by using the abstraction-wise lower
> > >level __wait_event interface.
> > >
> > >To alleviate the tx performance bottleneck and the CPU overhead due to
> > >the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.    
> > 
> > Hi,
> > 
> > Have you tested other values, such as 64? In our internal version, we
> > have used 64 for some time.  
> 
> Yes we have, but I'm not sure the data is still to be found. Let me do
> some digging.
> 

We did some digging and according to that data 64 is not likely to cut
it on the TX end for highly parallel request-response workload. But we
will measure some more these days just to be on the safe side.

> > 
> > Increasing this to 256 will require a 36K continuous physical memory
> > allocation in smc_wr_alloc_link_mem(). Based on my experience, this may
> > fail on servers that have been running for a long time and have
> > fragmented memory.  
> 
> Good point! It is possible that I did not give sufficient thought to
> this aspect.
> 

The failing allocation would lead to a fallback to TCP I believe. Which
I don't consider a catastrophic failure.

But let us put this patch on hold and see if we can come up with
something better.

Regards,
Halil


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-11-04 16:42     ` Halil Pasic
@ 2024-11-05 10:16       ` Paolo Abeni
  2024-11-05 14:34       ` Dust Li
  1 sibling, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2024-11-05 10:16 UTC (permalink / raw)
  To: Halil Pasic, Dust Li
  Cc: Wenjia Zhang, Wen Gu, D. Wythe, Tony Lu, David Miller,
	Jakub Kicinski, Eric Dumazet, netdev, linux-s390, Heiko Carstens,
	Jan Karcher, Gerd Bayer, Alexandra Winter, Nils Hoppmann,
	Niklas Schnell, Thorsten Winkler, Karsten Graul, Stefan Raspl

On 11/4/24 17:42, Halil Pasic wrote:
> On Thu, 31 Oct 2024 13:30:17 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>> On Sat, 26 Oct 2024 07:58:39 +0800
>> Dust Li <dust.li@linux.alibaba.com> wrote:
>>
>>>> For some reason include/linux/wait.h does not offer a top level wrapper
>>>> macro for wait_event with interruptible, exclusive and timeout. I did
>>>> not spend too many cycles on thinking if that is even a combination that
>>>> makes sense (on the quick I don't see why not) and conversely I
>>>> refrained from making an attempt to accomplish the interruptible,
>>>> exclusive and timeout combo by using the abstraction-wise lower
>>>> level __wait_event interface.
>>>>
>>>> To alleviate the tx performance bottleneck and the CPU overhead due to
>>>> the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.    
>>>
>>> Hi,
>>>
>>> Have you tested other values, such as 64? In our internal version, we
>>> have used 64 for some time.  
>>
>> Yes we have, but I'm not sure the data is still to be found. Let me do
>> some digging.
>>
> 
> We did some digging and according to that data 64 is not likely to cut
> it on the TX end for highly parallel request-response workload. But we
> will measure some more these days just to be on the safe side.
> 
>>>
>>> Increasing this to 256 will require a 36K continuous physical memory
>>> allocation in smc_wr_alloc_link_mem(). Based on my experience, this may
>>> fail on servers that have been running for a long time and have
>>> fragmented memory.  
>>
>> Good point! It is possible that I did not give sufficient thought to
>> this aspect.
>>
> 
> The failing allocation would lead to a fallback to TCP I believe. Which
> I don't consider a catastrophic failure.
> 
> But let us put this patch on hold and see if we can come up with
> something better.

FTR, I marked this patch as 'changes requested' given the possible risk
of regressions (more frequent fallback to TCP).

We can revive it should an agreement be reached.

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT
  2024-11-04 16:42     ` Halil Pasic
  2024-11-05 10:16       ` Paolo Abeni
@ 2024-11-05 14:34       ` Dust Li
  1 sibling, 0 replies; 7+ messages in thread
From: Dust Li @ 2024-11-05 14:34 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Wenjia Zhang, Wen Gu, D. Wythe, Tony Lu, David Miller,
	Jakub Kicinski, Eric Dumazet, Paolo Abeni, netdev, linux-s390,
	Heiko Carstens, Jan Karcher, Gerd Bayer, Alexandra Winter,
	Nils Hoppmann, Niklas Schnell, Thorsten Winkler, Karsten Graul,
	Stefan Raspl

On 2024-11-04 17:42:15, Halil Pasic wrote:
>On Thu, 31 Oct 2024 13:30:17 +0100
>Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> On Sat, 26 Oct 2024 07:58:39 +0800
>> Dust Li <dust.li@linux.alibaba.com> wrote:
>> 
>> > >For some reason include/linux/wait.h does not offer a top level wrapper
>> > >macro for wait_event with interruptible, exclusive and timeout. I did
>> > >not spend too many cycles on thinking if that is even a combination that
>> > >makes sense (on the quick I don't see why not) and conversely I
>> > >refrained from making an attempt to accomplish the interruptible,
>> > >exclusive and timeout combo by using the abstraction-wise lower
>> > >level __wait_event interface.
>> > >
>> > >To alleviate the tx performance bottleneck and the CPU overhead due to
>> > >the spinlock contention, let us increase SMC_WR_BUF_CNT to 256.    
>> > 
>> > Hi,
>> > 
>> > Have you tested other values, such as 64? In our internal version, we
>> > have used 64 for some time.  
>> 
>> Yes we have, but I'm not sure the data is still to be found. Let me do
>> some digging.
>> 
>
>We did some digging and according to that data 64 is not likely to cut
>it on the TX end for highly parallel request-response workload. But we
>will measure some more these days just to be on the safe side.
>
>> > 
>> > Increasing this to 256 will require a 36K continuous physical memory
>> > allocation in smc_wr_alloc_link_mem(). Based on my experience, this may
>> > fail on servers that have been running for a long time and have
>> > fragmented memory.  
>> 
>> Good point! It is possible that I did not give sufficient thought to
>> this aspect.
>> 
>
>The failing allocation would lead to a fallback to TCP I believe. Which
>I don't consider a catastrophic failure.

Yes, but fallback to TCP may be not the only result.

When we don't have much continuous physical memory, allocating a large
continuous physical memory without flags like __GFP_NORETRY would cause
memory compaction. We've encounter problems before, the one I still
remember is the statistics buffer for mlx5 was once allocated using
kmalloc, and it was changed to kvmalloc later because of the large
physical continous memory allocation cause problems with online servers.

>
>But let us put this patch on hold and see if we can come up with
>something better.

Agree

Best regards,
Dust


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-11-05 14:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25  7:46 [PATCH net-next] net/smc: increase SMC_WR_BUF_CNT Wenjia Zhang
2024-10-25 13:56 ` Simon Horman
2024-10-25 23:58 ` Dust Li
2024-10-31 12:30   ` Halil Pasic
2024-11-04 16:42     ` Halil Pasic
2024-11-05 10:16       ` Paolo Abeni
2024-11-05 14:34       ` Dust Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).