* Rare Duplicate Completions
@ 2014-04-04 17:00 Christopher Mitchell
2014-04-08 1:01 ` Bart Van Assche
0 siblings, 1 reply; 3+ messages in thread
From: Christopher Mitchell @ 2014-04-04 17:00 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Hi,
I am working on building a distributed Infiniband application, testing
using Mellanox Connect-X HCAs. In very rare cases, perhaps once in a
few million operations, I appear to be receiving duplicate completions
or incorrect completions. For instance, I'll send out an RDMA request
and receive a completion for a Verb message response I had just
handled, or send a Verb message request and receive a duplicate Verb
message completion.. Needless to say, this is introducing instability
in my application. Does anyone have experience with a bug like this,
or am I encountering some arcane issue in how I'm manipulating the
HCA? I'd be more than happy to furnish relevant sections of code to
help nail down the issue.
Thanks in advance,
Christopher
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Rare Duplicate Completions
2014-04-04 17:00 Rare Duplicate Completions Christopher Mitchell
@ 2014-04-08 1:01 ` Bart Van Assche
[not found] ` <CAPb9-SEvrWWxWbjSDz5iVP16aRTbDd4tg-TN_af0ecB0o3m1Sw@mail.gmail.com>
0 siblings, 1 reply; 3+ messages in thread
From: Bart Van Assche @ 2014-04-08 1:01 UTC (permalink / raw)
To: Christopher Mitchell,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 4/04/2014 10:00, Christopher Mitchell wrote:
> I am working on building a distributed Infiniband application, testing
> using Mellanox Connect-X HCAs. In very rare cases, perhaps once in a
> few million operations, I appear to be receiving duplicate completions
> or incorrect completions. For instance, I'll send out an RDMA request
> and receive a completion for a Verb message response I had just
> handled, or send a Verb message request and receive a duplicate Verb
> message completion.. Needless to say, this is introducing instability
> in my application. Does anyone have experience with a bug like this,
> or am I encountering some arcane issue in how I'm manipulating the
> HCA? I'd be more than happy to furnish relevant sections of code to
> help nail down the issue.
Hello Christopher,
In e.g. the SCST SRP target driver there is code present that checks for
duplicate and/or missing completions. Although this code is being used
intensively I have not yet seen any error messages being logged by the
code that verifies completions. Note: something that is nontrivial in
the RDMA API and that you might already be aware of is that even for
non-signaled work requests a completion is delivered if that work
request fails. If these non-signaled work requests do not have a unique
wr_id error completions for these requests might be misinterpret as
duplicate completions.
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Rare Duplicate Completions
[not found] ` <CAPb9-SEvrWWxWbjSDz5iVP16aRTbDd4tg-TN_af0ecB0o3m1Sw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-10 21:43 ` Christopher Mitchell
0 siblings, 0 replies; 3+ messages in thread
From: Christopher Mitchell @ 2014-04-10 21:43 UTC (permalink / raw)
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Bert,
Thanks for the help. Unfortunately, it looks as if all of my
message-passing and RDMA operations use the SIGNALED flag. I'm
checking the wc->status flag (==IBV_WC_SUCCESS), and it and
everything in the message appears to be completely identical to the
first completion. I notice that it only seems to happen when I have
server processes (ie, listeners) that also open active connections to
other listeners as well as accept connections from client processes. I
perform both RDMA and message-passing across those server-to-server
connections, so on a hunch I tried rewriting the RDMA operations as
message-passing, but the problem still occurs. Thanks in advance for
any other thoughts you might have.
Cheers,
Christopher
On Thu, Apr 10, 2014 at 5:42 PM, Christopher Mitchell
<christopher-1z5WdJkP5Frk1uMJSBkQmQ@public.gmane.org> wrote:
> Bert,
>
> Thanks for the help. Unfortunately, it looks as if all of my message-passing
> and RDMA operations use the SIGNALED flag. I'm checking the wc->status flag
> (==IBV_WC_SUCCESS), and it and everything in the message appears to be
> completely identical to the first completion. I notice that it only seems to
> happen when I have server processes (ie, listeners) that also open active
> connections to other listeners as well as accept connections from client
> processes. I perform both RDMA and message-passing across those
> server-to-server connections, so on a hunch I tried rewriting the RDMA
> operations as message-passing, but the problem still occurs. Thanks in
> advance for any other thoughts you might have.
>
> Cheers,
> Christopher
>
>
> On Mon, Apr 7, 2014 at 9:01 PM, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>>
>> On 4/04/2014 10:00, Christopher Mitchell wrote:
>>>
>>> I am working on building a distributed Infiniband application, testing
>>> using Mellanox Connect-X HCAs. In very rare cases, perhaps once in a
>>> few million operations, I appear to be receiving duplicate completions
>>> or incorrect completions. For instance, I'll send out an RDMA request
>>> and receive a completion for a Verb message response I had just
>>> handled, or send a Verb message request and receive a duplicate Verb
>>> message completion.. Needless to say, this is introducing instability
>>> in my application. Does anyone have experience with a bug like this,
>>> or am I encountering some arcane issue in how I'm manipulating the
>>> HCA? I'd be more than happy to furnish relevant sections of code to
>>> help nail down the issue.
>>
>>
>> Hello Christopher,
>>
>> In e.g. the SCST SRP target driver there is code present that checks for
>> duplicate and/or missing completions. Although this code is being used
>> intensively I have not yet seen any error messages being logged by the code
>> that verifies completions. Note: something that is nontrivial in the RDMA
>> API and that you might already be aware of is that even for non-signaled
>> work requests a completion is delivered if that work request fails. If these
>> non-signaled work requests do not have a unique wr_id error completions for
>> these requests might be misinterpret as duplicate completions.
>>
>> Bart.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-04-10 21:43 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-04 17:00 Rare Duplicate Completions Christopher Mitchell
2014-04-08 1:01 ` Bart Van Assche
[not found] ` <CAPb9-SEvrWWxWbjSDz5iVP16aRTbDd4tg-TN_af0ecB0o3m1Sw@mail.gmail.com>
[not found] ` <CAPb9-SEvrWWxWbjSDz5iVP16aRTbDd4tg-TN_af0ecB0o3m1Sw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-10 21:43 ` Christopher Mitchell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox