* why flipping responder_resources/initiator_depth?
@ 2014-06-22 7:42 Or Gerlitz
[not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2014-06-22 7:42 UTC (permalink / raw)
To: Hefty, Sean,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)
Cc: Sagi Grimberg, Roi Dayan
Hi Sean,
So we just noted that the IB CM @ cm_format_req_event() flips the values
of RDMA READ initiator-depth and responder-resources advertized in the
client through the REQ when before it delivers the event to the server
(and the same is done the other way around in cm_format_rep_event()).
Any special reason for that? specifically, it doesn't comply with our
common sense (...) and the text in the rdma_connect/rdma_accept man
pages http://linux.die.net/man/3/rdma_connect that say:
responder_resources - The maximum number of outstanding RDMA read and
atomic operations that the local side will accept from the remote side.
[...]
initiator_depth - The maximum number of outstanding RDMA read and atomic
operations that the local side will have to the remote side. [...]
Agree? if you do, how do we fix it... the bug is there from day one.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: why flipping responder_resources/initiator_depth?
[not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-06-23 5:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Hefty, Sean @ 2014-06-23 5:00 UTC (permalink / raw)
To: Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)
Cc: Sagi Grimberg, Roi Dayan
> Any special reason for that? specifically, it doesn't comply with our
> common sense (...) and the text in the rdma_connect/rdma_accept man
> pages http://linux.die.net/man/3/rdma_connect that say:
Flipping the values keeps the meaning the same on both sides, with the meaning relative to the local, versus remote side, as opposed to the sender of the message.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2014-06-23 5:55 ` Or Gerlitz
[not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2014-06-23 5:55 UTC (permalink / raw)
To: Hefty, Sean
Cc: Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Mon, Jun 23, 2014 at 8:00 AM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> Any special reason for that? specifically, it doesn't comply with our
>> common sense (...) and the text in the rdma_connect/rdma_accept man
>> pages http://linux.die.net/man/3/rdma_connect that say:
> Flipping the values keeps the meaning the same on both sides, with the meaning relative to the local, versus remote side, as opposed to the sender of the message.
But the meaning need not be the same on both sides. It's very
common/possible for the quantity of how many inflight rdma-reads will
be initiated/responded by each side of the connection to be
asymmetrical - can be N for the client --> server direction and M for
the server --> client direction.In the real life example below N=0
A case where only side initiated rdma-reads, say the server, which is
commonn in many transactional storage protocols (SRP, iSER) so the
client and server need to negoiate how many inflight rdma-reads the
server is allowed to issue, as I see things, this is the expected
flow:
1. the client to put into the responder_resources they provide to
rdma_connect the the maximum number of outstanding RDMA read that they
will be able accept from the server side
2. the server to apply a minimum function between the
responder_resources which were advertized by the client (and they get
in the connection request event params) to how many inflight
rdma-reads their HCA supports
Isn't this paradigm robust and simple? and can be extended to the case
where only the client or both sides issue rdma-reads/atomics?
Also note that this flipping decision taken by the Linux CM will make
applications to potentially not properly inter-operate with instances
running over other CMs (e.g windows, ESX, realtime/proprietary OS in
commercial storage boxes, etc).
I find the current practice being both not inter-operable and
confusing. I don't see how it follows the terms terms "local" and
"remote" used in the man page.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-23 16:49 ` Jason Gunthorpe
[not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2014-06-23 16:49 UTC (permalink / raw)
To: Or Gerlitz
Cc: Hefty, Sean, Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Mon, Jun 23, 2014 at 08:55:07AM +0300, Or Gerlitz wrote:
> 1. the client to put into the responder_resources they provide to
> rdma_connect the the maximum number of outstanding RDMA read that they
> will be able accept from the server side
>
> 2. the server to apply a minimum function between the
> responder_resources which were advertized by the client (and they get
> in the connection request event params) to how many inflight
> rdma-reads their HCA supports
>From a wire perspective the spec is pretty clear what the CM responder
resources and initiator depth are supposed to be, and the behavior of
#2 is mandated in the spec.
>From a API perspective it makes sense that the only input to the
the API would be 'the initiator depth the caller will use', which is
basically the only thing the caller actually controls. 0 if the client
never uses RDMA READ or ATOMICs, 1 if it is strictly interlocked, and
higher as necessary.
I'm not sure there is a use case to limit QP responder resources at
the caller? Maybe to specify '0' if the caller knows it will never
setup a remote readable MR?
So both sides pass in their desired initiator depth. Both sides limit
that to HCA init depth capabilities. The REQ side plugs that value
into REQ.initiatorDepth and the HCA capability into
REQ.responderResources.
The REQ responder takes min(REQ.responderResources,local
intiatorDepth) and returns that in REP.initiatorDepth. It takes
min(REQ.initiatorDepth, HW respres capability) and plugs that into the
local QP and returns it in REp.responderResources
The REQ initiator takes that reply and does
min(REP.responderResources,HW initdepth capability,API depth) and
plugs that into the QP and does checks that REP.initDepth <
REQ.responderResources and errors if false, and plugs REP.initDepth
into the local QP's responder resources.
The swapping and general missing handling of RR negotiating in the
whole kernel CM API (not just RDMA CM, but IB CM too) is a
longstanding bug, and I have written user space code that fixes it up
in the past :(
It works OK if both sides hard code 2 or 4, or whatever is 99% of use
cases, it is broken if you are doing what Or is talking about, and
optimizing RR usage because on half of a connection doesn't use RRs at
all.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-06-23 17:38 ` Or Gerlitz
[not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2014-06-23 17:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Hefty, Sean, Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Mon, Jun 23, 2014 at 7:49 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
[...]
> The swapping and general missing handling of RR negotiating in the
> whole kernel CM API (not just RDMA CM, but IB CM too) is a
> longstanding bug, and I have written user space code that fixes it up
> in the past :(
Jason, the swapping takes place in the IB CM indeed, I just used the
wording from the librdmacm man pages to described the desired
behaviour as I see it. Did you ever repored to the swapping on this
list in the past? when?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: why flipping responder_resources/initiator_depth?
[not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-23 18:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Hefty, Sean @ 2014-06-23 18:00 UTC (permalink / raw)
To: Or Gerlitz, Jason Gunthorpe
Cc: Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
> > The swapping and general missing handling of RR negotiating in the
> > whole kernel CM API (not just RDMA CM, but IB CM too) is a
> > longstanding bug, and I have written user space code that fixes it up
> > in the past :(
>
> Jason, the swapping takes place in the IB CM indeed, I just used the
> wording from the librdmacm man pages to described the desired
> behaviour as I see it. Did you ever repored to the swapping on this
> list in the past? when?
The behavior matches the documentation. And the problem is...?
The initiator_depth and responder_resources must be swapped between the REQ and REP. Why is having the RDMA CM do this swapping an issue?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2014-06-23 18:34 ` Jason Gunthorpe
[not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2014-06-23 18:34 UTC (permalink / raw)
To: Hefty, Sean
Cc: Or Gerlitz, Or Gerlitz,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Mon, Jun 23, 2014 at 06:00:57PM +0000, Hefty, Sean wrote:
> > > The swapping and general missing handling of RR negotiating in the
> > > whole kernel CM API (not just RDMA CM, but IB CM too) is a
> > > longstanding bug, and I have written user space code that fixes it up
> > > in the past :(
> >
> > Jason, the swapping takes place in the IB CM indeed, I just used the
> > wording from the librdmacm man pages to described the desired
> > behaviour as I see it. Did you ever repored to the swapping on this
> > list in the past? when?
>
> The behavior matches the documentation. And the problem is...?
The problem is this whole thing is a giant gotcha if you don't
intimitely understand exactly what the spec requires, and naively
assume the kernel does something sane, or even provides you the values
the spec says you need in fields that are named the same as the spec.
If you use the IB CM in userspace you need to hook IB_CM_REQ_RECEIVED
and do something like this:
/* Note, req.responder_resources and req.initiator_depth are swapped
in the kernel. FIXME: this works around the kernel not implementing
the negotation procedure by doing it here */
rep.responder_resources = min((int)req.responder_resources,
devAttr.max_qp_rd_atom);
rep.initiator_depth = min((int)req.initiator_depth,
devAttr.max_qp_init_rd_atom);
So
1) The kernel swapped the values before passing them to userspace,
(and other kernel consumers). So this becomes very confusing if
you are not aware that req.responders_resources is not actually
what the IBA spec describes as REQ responderResources.
2) The kernel doesn't do anything to help implement the IBA sepc
required negotiation, it doesn't limit to HCA values, for instance
after getting a REQ.
3) There is no aide to help a simple app developer do this right, and
almost everyone I've ever looked at just passes 2 in for both
values and hopes for the best.
4) Other elements of the negotiation procedure I outlined above seem
to be missing, like the sanity check of the REP, and the
generation of REJ if the values are not acceptable.
I haven't looked at how this all plays through with RDMA CM. But
looking quickly, I don't see an obvious similar min in cma_connect_ib.
To my mind, the biggest issue is the common code does not seem to make
it easy for apps to correctly implement the IBA negotiation protocol.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-06-25 20:51 ` Or Gerlitz
[not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2014-06-25 20:51 UTC (permalink / raw)
To: Jason Gunthorpe, Hefty, Sean
Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Mon, Jun 23, 2014 at 9:34 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Jun 23, 2014 at 06:00:57PM +0000, Hefty, Sean wrote:
>> The behavior matches the documentation. And the problem is...?
Sean, which documentation exactly?
[...]
> I haven't looked at how this all plays through with RDMA CM. But
> looking quickly, I don't see an obvious similar min in cma_connect_ib.
Jason,
The rdma-cm doesn't do any flipping, it just provides the relevant IB
CM call with the params as given by the application, see
cma_connect_ib() --> ib_send_cm_req() and
cma_accept_ib() --> ib_send_cm_rep()
As I wrote on the initial note, the flipping is done in the IB CM
before it invokes the upper layer (e.g SRP/IPoIB/RDMA-CM/UCM)
provided callback in cm_format_req_event() and cm_format_rep_event()
> To my mind, the biggest issue is the common code does not seem to make
> it easy for apps to correctly implement the IBA negotiation protocol.
And also the Linux CM assumes the peer CM will apply the same flipping practice
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: why flipping responder_resources/initiator_depth?
[not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-25 21:08 ` Jason Gunthorpe
0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2014-06-25 21:08 UTC (permalink / raw)
To: Or Gerlitz
Cc: Hefty, Sean,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Sagi Grimberg, Roi Dayan
On Wed, Jun 25, 2014 at 11:51:41PM +0300, Or Gerlitz wrote:
> The rdma-cm doesn't do any flipping, it just provides the relevant IB
> CM call with the params as given by the application, see
Sure, that is what I mean. The IB spec has a protocol for negotiating
responder resources. I outlined how it works in a prior email.
IB CM and RDMA CM both push responsibility to implement the protocol
to the app.
The app needs to be aware of the flipped names when it implements it.
Otherwise the flipped names don't matter, if you dig into it enough
you can figure out what value from the IB CMA defined MAD ends up in
what structure member during the app callback, for every callback.
If you implement an app that uses responder resources, and doesn't
implement the IB defined negotation protocol then it is broken.
The most common thing that needs to be done is to limit things to HW
capability, and to use values that reflect the apps use of the QP (eg
0 if no RRs are used in a direction). The spec has all the details.
For your immediate case, I suspect the path you need to follow is to
implement the negotiation protocol, then you can have your client
stuff in 0 for initiator_depth and when the protocol is followed the
server will avoid allocating any RRs.
To that end the confusing swapping of the labels is just confusing,
you can still implement the protocol properly.
Bonus points if you can add some kind of core code support to
implement the protocol in common code based on both sides telling the
core what their initiator depth will be.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-06-25 21:08 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-22 7:42 why flipping responder_resources/initiator_depth? Or Gerlitz
[not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-06-23 5:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-06-23 5:55 ` Or Gerlitz
[not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-23 16:49 ` Jason Gunthorpe
[not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-06-23 17:38 ` Or Gerlitz
[not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-23 18:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-06-23 18:34 ` Jason Gunthorpe
[not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-06-25 20:51 ` Or Gerlitz
[not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-25 21:08 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox