From: Leon Romanovsky <leon@kernel.org>
To: "Hebenstreit, Michael" <michael.hebenstreit@cornelisnetworks.com>
Cc: "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: problem with duplicate resends
Date: Tue, 24 Feb 2026 11:31:19 +0200 [thread overview]
Message-ID: <20260224093119.GG10607@unreal> (raw)
In-Reply-To: <LV2PR01MB9940993E9E56B23CEF734AC4909B68A@LV2PR01MB994099.prod.exchangelabs.com>
On Fri, Feb 20, 2026 at 06:48:21PM +0000, Hebenstreit, Michael wrote:
> Hello
>
> We have a problem in a Linux cluster using Omnipath 100 and GPFS. Typically, after a complete reboot the cluster works correctly for 10-14 days. Then problems start, happening about once ever 2-3 days. This makes the problem very hard to debug.
>
> The problem starts with one or more storage nodes (A, B, C...) being unable to write to a "bad" storage node X. A/B/C/... would then throw an IBV_WC_RETRY_EXC_ERR error and close the QP pair. In response NodeX would also close the connection. Afterwards GPFS cannot re-establish a new connection fast enough and everything goes south until the NodeX is rebooted. GPFS is NOT my question here though.
>
> During the last crash thanks to a new monitoring system, we discovered that NodeA/B/C/.. would execute 6 RDMA retries and accordingly the RcResend counters on the hfi1 driver would go up. But on NodeX the RcDupRew counter would go up in step with all the RcResends. That indicates the resends are incorrect and had already been previously acknowledged.
>
> The operating system is RedHat EL 8.10 with a very old rdma-core version 48.
>
> My question - is there any known bug in libibverbs/libhfi1verbs-rdmav34 that could explain this behavior?
From the upstream perspective, the answer is no. We are not aware of any
related issues or discussions.
Thanks
>
> Thanks
> Michael
>
> ------------------------------------------------------------------------------
> Michael Hebenstreit Principal Performance Engineer
> Cornelis Networks Performance Team
> Tel.:+1-385-393-5444 E-mail: michael.hebenstreit@cornelisnetworks.com
>
> External recipient
>
prev parent reply other threads:[~2026-02-24 9:31 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-20 18:48 problem with duplicate resends Hebenstreit, Michael
2026-02-24 9:31 ` Leon Romanovsky [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260224093119.GG10607@unreal \
--to=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=michael.hebenstreit@cornelisnetworks.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox