public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* SMC-R throughput drops for specific message sizes
@ 2023-12-01 13:33 Nikolaou Alexandros (SO/PAF1-Mb)
  2023-12-04 16:09 ` Wenjia Zhang
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolaou Alexandros (SO/PAF1-Mb) @ 2023-12-01 13:33 UTC (permalink / raw)
  To: linux-s390@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

Dear SMC Maintainers and Contributors,
 
I would like to bring to your attention some observations made during recent experiments conducted with the SMC-R on the following system:
 
- SMC version: 3E92E1460DA96BE2B2DDC2F (version 1)
- Kernel: both v5.4 and v5.15
- Ubuntu: 20.04
- Benchmarking tool: ‘qperf’
- The SMC-R vs TCP/IP throughput has been measured using the ‘qperf’ tool for various message sizes ranging from 0 to 8 MB.
 

Based on the attached plot, you may notice that the SMC-R throughput drops substantially according to a periodic regular pattern.

A few configurations have been checked, but the pattern described above persists:
 
- All the min/default/max values of ‘net.ipv4.tcp_rmem’ and ‘net.ipv4.tcp_wmem’ have been utilized.
- Experiments with the Linux kernel v5.4 and v5.15 have been conducted.
- The MTU size has been changed from 1500 up to 9000.
- Several message sizes have been tried.
 
I was wondering whether you could help me understand a reason behind these drops, please? This behavior is not observed with the TCP/IP stack. Any insights or guidance would be highly appreciated.
 
Thank you very much in advance.
 
Mit freundlichen Grüßen / Best regards

Alexandros Nikolaou

Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com
Alexandros.Nikolaou@de.bosch.com

Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039
Geschäftsführung: Robert Mulatz, Daniel Meyer

[-- Attachment #2: smc_valleys.png --]
[-- Type: image/png, Size: 95309 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: SMC-R throughput drops for specific message sizes
@ 2023-12-13 15:52 Nikolaou Alexandros (SO/PAF1-Mb)
  0 siblings, 0 replies; 11+ messages in thread
From: Nikolaou Alexandros (SO/PAF1-Mb) @ 2023-12-13 15:52 UTC (permalink / raw)
  To: Wen Gu, Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann
  Cc: linux-s390@vger.kernel.org, netdev, Wenjia Zhang, Jan Karcher,
	Dust Li

Dear Gerd,

Thank you for directing this matter to the most relevant group and 
individuals. Your support is greatly appreciated. We're actively delving deeper into the issue to gain further insights as well.
I'll ensure to keep this thread updated with any new findings as we progress. 
Should you need any additional information, please feel free to reach out.

Mit freundlichen Grüßen / Best regards

Alexandros Nikolaou


Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com
Alexandros.Nikolaou@de.bosch.com


Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039

Geschäftsführung: Robert Mulatz, Daniel Meyer

From: Wen Gu <guwen@linux.alibaba.com>
Sent: Wednesday, December 13, 2023 14:38
To: Gerd Bayer <gbayer@linux.ibm.com>; Nikolaou Alexandros (SO/PAF1-Mb) <Alexandros.Nikolaou@de.bosch.com>; D . Wythe <alibuda@linux.alibaba.com>; Tony Lu <tonylu@linux.alibaba.com>; Nils Hoppmann <niho@linux.ibm.com>
Cc: linux-s390@vger.kernel.org <linux-s390@vger.kernel.org>; netdev <netdev@vger.kernel.org>; Wenjia Zhang <wenjia@linux.ibm.com>; Jan Karcher <jaka@linux.ibm.com>; Dust Li <dust.li@linux.alibaba.com>
Subject: Re: SMC-R throughput drops for specific message sizes
 


On 2023/12/13 20:17, Gerd Bayer wrote:
> Hi Nikolaou,
>
> thank you for providing more details about your setup.
>
> On Wed, 2023-12-06 at 15:28 +0000, Nikolaou Alexandros (SO/PAF1-Mb)
> wrote:
>> Dear Wenjia,
>
> while Wenjia is out, I'm writing primarily to getting some more folks'
> attention to this topic. Furthermore, I'm moving the discussion to the
> netdev mailing list where SMC discussions usually take place.
>
>> Thanks for getting back to me. Some further details on the
>> experiments are:
>>  
>> - The tests had been conducted on a one-to-one connection between two
>> Mellanox-powered (mlx5, ConnectX-5) PCs.
>> - Attached you may find the client log of the qperf output. You may
>> notice that for the majority of message size values, the bandwidth is
>> around 3.2GB/s which matches the maximum throughput of the
>> mellanox NICs.
>> According to a periodic regular pattern though, with the first
>> occurring at a message size of 473616 – 522192 (with a step of
>> 12144kB), the 3.2GB/s throughput drops substantially. The
>> corresponding commands for these drops are
>> server: smc_run qperf
>> client: smc_run qperf -v -uu -H worker1 -m 473616 tcp_bw
>> - Our smc version (3E92E1460DA96BE2B2DDC2F, smc-tools-1.2.2) does not
>> provide us with the smcr info, smc_rnics -a and smcr -d
>> stats commands. As an alternative, you may also find attached the
>> output of ibv_devinfo -v.
>> - Buffer size:
>> sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 6291456"
>> sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 6291456"
>> - MTU size: 9000
>>  
>> Should you require further information, please let me know.
>
> Wenjia and I belong to a group of Linux on Z developers that maintains
> the SMC protocol on s390 mainframe systems. Nils Hoppmann is our expert
> for performance and might be able to shed some light on his experiences
> with throughput drops for particular SMC message sizes. Our experience
> is heavily biased towards IBM Z systems, though - with their distinct
> cache and PCI root-complex hardware designs.
>
> Over the last few years there's a group around D. Wythe, Wen Gu and
> Tony Lu who adopted and extended the SMC protocol for use-cases on x86
> architectures. I address them here explicitly, soliciting feedback on
> their experiences.

Certainly. Our team will take a closer look into this matter as well.
We intend to review the thread thoroughly and conduct an analysis within
our environment. Updates and feedback will be provided in this thread.

>
> All in all there are several moving parts involved here, that could
> play a role:
> - firmware level of your Mellanox/NVidia NICs,
> - platform specific hardware designs re. cache and root-complexes,
> interrupt distribution, ...
> - exact code level of the device drivers and the SMC protocol
>
> This is just a heads-up, that there may be requests to try things with
> newer code levels ;)
>
> Thank you,
> Gerd
>
> --
> Gerd Bayer
> Linux on IBM Z Development - IBM Germany R&D

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: SMC-R throughput drops for specific message sizes
@ 2024-02-01 13:50 Iordache Costin (XC-AS/EAE-UK)
  2024-02-05  3:50 ` Wen Gu
  0 siblings, 1 reply; 11+ messages in thread
From: Iordache Costin (XC-AS/EAE-UK) @ 2024-02-01 13:50 UTC (permalink / raw)
  To: Wen Gu, Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann
  Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org, Wenjia Zhang,
	Jan Karcher, Dust Li

[-- Attachment #1: Type: text/plain, Size: 2462 bytes --]

Hi, 

This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the community.

Brief reminder, our hardware/software context is as follows:
     - 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port
     - only one HCA port is active/connected on each side (QSFP28 cable)
     - max HCA throughput: 25Gbps ~ 3.12GBs. 
     - max/active MTU: 4096
     - kernel: 6.5.0-14
     - benchmarking tool: qperf 0.4.11

Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers.

Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - see "SMC-R vs TCP" plot.

Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but it did not, so we bit the bullet and delved into the SMC-R code. 

The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput drops were caused by the size of this buffer being too small. 

We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for different message size ranges.

The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances.

cheers,
Costin.

PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool specific; corking has been disabled; etc.

[-- Attachment #2: smc_qperf_merged-4.jpg --]
[-- Type: image/jpeg, Size: 291219 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread
[parent not found: <GV2PR10MB8037B30A9D2CE67F267D5E61BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>]

end of thread, other threads:[~2024-03-28 12:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-01 13:33 SMC-R throughput drops for specific message sizes Nikolaou Alexandros (SO/PAF1-Mb)
2023-12-04 16:09 ` Wenjia Zhang
2023-12-06 15:28   ` Nikolaou Alexandros (SO/PAF1-Mb)
2023-12-13 12:17     ` Gerd Bayer
2023-12-13 13:38       ` Wen Gu
  -- strict thread matches above, loose matches on Subject: below --
2023-12-13 15:52 Nikolaou Alexandros (SO/PAF1-Mb)
2024-02-01 13:50 Iordache Costin (XC-AS/EAE-UK)
2024-02-05  3:50 ` Wen Gu
2024-02-19  8:44   ` Wen Gu
2024-02-27 11:28     ` Iordache Costin (XC-AS/EAE-UK)
     [not found] <GV2PR10MB8037B30A9D2CE67F267D5E61BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>
     [not found] ` <GV2PR10MB80376BEB9EE8E03F98CC86A1BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>
2024-03-28 12:18   ` Goerlitz Andreas (SO/PAF1-Mb)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox