* why flipping responder_resources/initiator_depth?
@ 2014-06-22 7:42 Or Gerlitz
[not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2014-06-22 7:42 UTC (permalink / raw)
To: Hefty, Sean,
linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)
Cc: Sagi Grimberg, Roi Dayan
Hi Sean,
So we just noted that the IB CM @ cm_format_req_event() flips the values
of RDMA READ initiator-depth and responder-resources advertized in the
client through the REQ when before it delivers the event to the server
(and the same is done the other way around in cm_format_rep_event()).
Any special reason for that? specifically, it doesn't comply with our
common sense (...) and the text in the rdma_connect/rdma_accept man
pages http://linux.die.net/man/3/rdma_connect that say:
responder_resources - The maximum number of outstanding RDMA read and
atomic operations that the local side will accept from the remote side.
[...]
initiator_depth - The maximum number of outstanding RDMA read and atomic
operations that the local side will have to the remote side. [...]
Agree? if you do, how do we fix it... the bug is there from day one.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread[parent not found: <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* RE: why flipping responder_resources/initiator_depth? [not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2014-06-23 5:00 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Hefty, Sean @ 2014-06-23 5:00 UTC (permalink / raw) To: Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org) Cc: Sagi Grimberg, Roi Dayan > Any special reason for that? specifically, it doesn't comply with our > common sense (...) and the text in the rdma_connect/rdma_accept man > pages http://linux.die.net/man/3/rdma_connect that say: Flipping the values keeps the meaning the same on both sides, with the meaning relative to the local, versus remote side, as opposed to the sender of the message. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2014-06-23 5:55 ` Or Gerlitz [not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Or Gerlitz @ 2014-06-23 5:55 UTC (permalink / raw) To: Hefty, Sean Cc: Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Mon, Jun 23, 2014 at 8:00 AM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: >> Any special reason for that? specifically, it doesn't comply with our >> common sense (...) and the text in the rdma_connect/rdma_accept man >> pages http://linux.die.net/man/3/rdma_connect that say: > Flipping the values keeps the meaning the same on both sides, with the meaning relative to the local, versus remote side, as opposed to the sender of the message. But the meaning need not be the same on both sides. It's very common/possible for the quantity of how many inflight rdma-reads will be initiated/responded by each side of the connection to be asymmetrical - can be N for the client --> server direction and M for the server --> client direction.In the real life example below N=0 A case where only side initiated rdma-reads, say the server, which is commonn in many transactional storage protocols (SRP, iSER) so the client and server need to negoiate how many inflight rdma-reads the server is allowed to issue, as I see things, this is the expected flow: 1. the client to put into the responder_resources they provide to rdma_connect the the maximum number of outstanding RDMA read that they will be able accept from the server side 2. the server to apply a minimum function between the responder_resources which were advertized by the client (and they get in the connection request event params) to how many inflight rdma-reads their HCA supports Isn't this paradigm robust and simple? and can be extended to the case where only the client or both sides issue rdma-reads/atomics? Also note that this flipping decision taken by the Linux CM will make applications to potentially not properly inter-operate with instances running over other CMs (e.g windows, ESX, realtime/proprietary OS in commercial storage boxes, etc). I find the current practice being both not inter-operable and confusing. I don't see how it follows the terms terms "local" and "remote" used in the man page. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-06-23 16:49 ` Jason Gunthorpe [not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2014-06-23 16:49 UTC (permalink / raw) To: Or Gerlitz Cc: Hefty, Sean, Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Mon, Jun 23, 2014 at 08:55:07AM +0300, Or Gerlitz wrote: > 1. the client to put into the responder_resources they provide to > rdma_connect the the maximum number of outstanding RDMA read that they > will be able accept from the server side > > 2. the server to apply a minimum function between the > responder_resources which were advertized by the client (and they get > in the connection request event params) to how many inflight > rdma-reads their HCA supports >From a wire perspective the spec is pretty clear what the CM responder resources and initiator depth are supposed to be, and the behavior of #2 is mandated in the spec. >From a API perspective it makes sense that the only input to the the API would be 'the initiator depth the caller will use', which is basically the only thing the caller actually controls. 0 if the client never uses RDMA READ or ATOMICs, 1 if it is strictly interlocked, and higher as necessary. I'm not sure there is a use case to limit QP responder resources at the caller? Maybe to specify '0' if the caller knows it will never setup a remote readable MR? So both sides pass in their desired initiator depth. Both sides limit that to HCA init depth capabilities. The REQ side plugs that value into REQ.initiatorDepth and the HCA capability into REQ.responderResources. The REQ responder takes min(REQ.responderResources,local intiatorDepth) and returns that in REP.initiatorDepth. It takes min(REQ.initiatorDepth, HW respres capability) and plugs that into the local QP and returns it in REp.responderResources The REQ initiator takes that reply and does min(REP.responderResources,HW initdepth capability,API depth) and plugs that into the QP and does checks that REP.initDepth < REQ.responderResources and errors if false, and plugs REP.initDepth into the local QP's responder resources. The swapping and general missing handling of RR negotiating in the whole kernel CM API (not just RDMA CM, but IB CM too) is a longstanding bug, and I have written user space code that fixes it up in the past :( It works OK if both sides hard code 2 or 4, or whatever is 99% of use cases, it is broken if you are doing what Or is talking about, and optimizing RR usage because on half of a connection doesn't use RRs at all. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2014-06-23 17:38 ` Or Gerlitz [not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Or Gerlitz @ 2014-06-23 17:38 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hefty, Sean, Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Mon, Jun 23, 2014 at 7:49 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: [...] > The swapping and general missing handling of RR negotiating in the > whole kernel CM API (not just RDMA CM, but IB CM too) is a > longstanding bug, and I have written user space code that fixes it up > in the past :( Jason, the swapping takes place in the IB CM indeed, I just used the wording from the librdmacm man pages to described the desired behaviour as I see it. Did you ever repored to the swapping on this list in the past? when? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: why flipping responder_resources/initiator_depth? [not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-06-23 18:00 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Hefty, Sean @ 2014-06-23 18:00 UTC (permalink / raw) To: Or Gerlitz, Jason Gunthorpe Cc: Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan > > The swapping and general missing handling of RR negotiating in the > > whole kernel CM API (not just RDMA CM, but IB CM too) is a > > longstanding bug, and I have written user space code that fixes it up > > in the past :( > > Jason, the swapping takes place in the IB CM indeed, I just used the > wording from the librdmacm man pages to described the desired > behaviour as I see it. Did you ever repored to the swapping on this > list in the past? when? The behavior matches the documentation. And the problem is...? The initiator_depth and responder_resources must be swapped between the REQ and REP. Why is having the RDMA CM do this swapping an issue? ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2014-06-23 18:34 ` Jason Gunthorpe [not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2014-06-23 18:34 UTC (permalink / raw) To: Hefty, Sean Cc: Or Gerlitz, Or Gerlitz, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Mon, Jun 23, 2014 at 06:00:57PM +0000, Hefty, Sean wrote: > > > The swapping and general missing handling of RR negotiating in the > > > whole kernel CM API (not just RDMA CM, but IB CM too) is a > > > longstanding bug, and I have written user space code that fixes it up > > > in the past :( > > > > Jason, the swapping takes place in the IB CM indeed, I just used the > > wording from the librdmacm man pages to described the desired > > behaviour as I see it. Did you ever repored to the swapping on this > > list in the past? when? > > The behavior matches the documentation. And the problem is...? The problem is this whole thing is a giant gotcha if you don't intimitely understand exactly what the spec requires, and naively assume the kernel does something sane, or even provides you the values the spec says you need in fields that are named the same as the spec. If you use the IB CM in userspace you need to hook IB_CM_REQ_RECEIVED and do something like this: /* Note, req.responder_resources and req.initiator_depth are swapped in the kernel. FIXME: this works around the kernel not implementing the negotation procedure by doing it here */ rep.responder_resources = min((int)req.responder_resources, devAttr.max_qp_rd_atom); rep.initiator_depth = min((int)req.initiator_depth, devAttr.max_qp_init_rd_atom); So 1) The kernel swapped the values before passing them to userspace, (and other kernel consumers). So this becomes very confusing if you are not aware that req.responders_resources is not actually what the IBA spec describes as REQ responderResources. 2) The kernel doesn't do anything to help implement the IBA sepc required negotiation, it doesn't limit to HCA values, for instance after getting a REQ. 3) There is no aide to help a simple app developer do this right, and almost everyone I've ever looked at just passes 2 in for both values and hopes for the best. 4) Other elements of the negotiation procedure I outlined above seem to be missing, like the sanity check of the REP, and the generation of REJ if the values are not acceptable. I haven't looked at how this all plays through with RDMA CM. But looking quickly, I don't see an obvious similar min in cma_connect_ib. To my mind, the biggest issue is the common code does not seem to make it easy for apps to correctly implement the IBA negotiation protocol. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2014-06-25 20:51 ` Or Gerlitz [not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Or Gerlitz @ 2014-06-25 20:51 UTC (permalink / raw) To: Jason Gunthorpe, Hefty, Sean Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Mon, Jun 23, 2014 at 9:34 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Mon, Jun 23, 2014 at 06:00:57PM +0000, Hefty, Sean wrote: >> The behavior matches the documentation. And the problem is...? Sean, which documentation exactly? [...] > I haven't looked at how this all plays through with RDMA CM. But > looking quickly, I don't see an obvious similar min in cma_connect_ib. Jason, The rdma-cm doesn't do any flipping, it just provides the relevant IB CM call with the params as given by the application, see cma_connect_ib() --> ib_send_cm_req() and cma_accept_ib() --> ib_send_cm_rep() As I wrote on the initial note, the flipping is done in the IB CM before it invokes the upper layer (e.g SRP/IPoIB/RDMA-CM/UCM) provided callback in cm_format_req_event() and cm_format_rep_event() > To my mind, the biggest issue is the common code does not seem to make > it easy for apps to correctly implement the IBA negotiation protocol. And also the Linux CM assumes the peer CM will apply the same flipping practice Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: why flipping responder_resources/initiator_depth? [not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-06-25 21:08 ` Jason Gunthorpe 0 siblings, 0 replies; 9+ messages in thread From: Jason Gunthorpe @ 2014-06-25 21:08 UTC (permalink / raw) To: Or Gerlitz Cc: Hefty, Sean, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Sagi Grimberg, Roi Dayan On Wed, Jun 25, 2014 at 11:51:41PM +0300, Or Gerlitz wrote: > The rdma-cm doesn't do any flipping, it just provides the relevant IB > CM call with the params as given by the application, see Sure, that is what I mean. The IB spec has a protocol for negotiating responder resources. I outlined how it works in a prior email. IB CM and RDMA CM both push responsibility to implement the protocol to the app. The app needs to be aware of the flipped names when it implements it. Otherwise the flipped names don't matter, if you dig into it enough you can figure out what value from the IB CMA defined MAD ends up in what structure member during the app callback, for every callback. If you implement an app that uses responder resources, and doesn't implement the IB defined negotation protocol then it is broken. The most common thing that needs to be done is to limit things to HW capability, and to use values that reflect the apps use of the QP (eg 0 if no RRs are used in a direction). The spec has all the details. For your immediate case, I suspect the path you need to follow is to implement the negotiation protocol, then you can have your client stuff in 0 for initiator_depth and when the protocol is followed the server will avoid allocating any RRs. To that end the confusing swapping of the labels is just confusing, you can still implement the protocol properly. Bonus points if you can add some kind of core code support to implement the protocol in common code based on both sides telling the core what their initiator depth will be. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-06-25 21:08 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-22 7:42 why flipping responder_resources/initiator_depth? Or Gerlitz
[not found] ` <53A688FB.6070600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-06-23 5:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931CCAD-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-06-23 5:55 ` Or Gerlitz
[not found] ` <CAJZOPZKqYiGpxi8bjDu5TBu0G6EX_DjRLvEVhNDTy9L79h6MbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-23 16:49 ` Jason Gunthorpe
[not found] ` <20140623164938.GA23697-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-06-23 17:38 ` Or Gerlitz
[not found] ` <CAJZOPZJHM1v62kr1_8X2cZXxftNqtC+ngMKNz64eFNFrxyXbAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-23 18:00 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823739931FEF9-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-06-23 18:34 ` Jason Gunthorpe
[not found] ` <20140623183455.GA3879-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-06-25 20:51 ` Or Gerlitz
[not found] ` <CAJZOPZ+3HN7YWPUdhXvFopx2JiqtAzG5cfrK+8we92=KzDyDDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-25 21:08 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox