public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* problem with duplicate resends
@ 2026-02-20 18:48 Hebenstreit, Michael
  2026-02-24  9:31 ` Leon Romanovsky
  0 siblings, 1 reply; 2+ messages in thread
From: Hebenstreit, Michael @ 2026-02-20 18:48 UTC (permalink / raw)
  To: linux-rdma@vger.kernel.org

Hello

We have a problem in a Linux cluster using Omnipath 100 and GPFS. Typically, after a complete reboot the cluster works correctly for 10-14 days. Then problems start, happening about once ever 2-3 days. This makes the problem very hard to debug.

The problem starts with one or more storage nodes (A, B, C...) being unable to write to a "bad" storage node X. A/B/C/... would then throw an IBV_WC_RETRY_EXC_ERR error and close the QP pair. In response NodeX would also close the connection. Afterwards GPFS cannot re-establish a new connection fast enough and everything goes south until the NodeX is rebooted. GPFS is NOT my question here though.

During the last crash thanks to a new monitoring system, we discovered that NodeA/B/C/.. would execute 6 RDMA retries and accordingly the RcResend counters on the hfi1 driver would go up. But on NodeX the RcDupRew counter would go up in step with all the RcResends. That indicates the resends are incorrect and had already been previously acknowledged.

The operating system is RedHat EL 8.10 with a very old rdma-core version 48.

My question - is there any known bug in libibverbs/libhfi1verbs-rdmav34 that could explain this behavior?

Thanks
Michael

------------------------------------------------------------------------------
Michael Hebenstreit         Principal Performance Engineer
Cornelis Networks           Performance Team
Tel.:+1-385-393-5444        E-mail: michael.hebenstreit@cornelisnetworks.com

External recipient

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-02-24  9:31 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20 18:48 problem with duplicate resends Hebenstreit, Michael
2026-02-24  9:31 ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox