* ib_post_send in drivers
@ 2009-11-20 16:16 frank zago
[not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: frank zago @ 2009-11-20 16:16 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello,
It seems ib_post_send() is implemented slightly differently in the
various hardware drivers (as in kernel 2.6.31). Here are the differences
I've noticed regarding the bad_wr parameter.
amso1100/c2_qp.c : c2_post_send()
* bails out and does not set bad_wr if the 1st check is bad.
cxgb3/iwch_qp.c : post_one_send()
* test for bad_send_wr but it should always be set
cxgb3/iwch_qp.c : iwch_post_send()
* bails out and does not set bad_wr if the 1st 2 checks are bad
ehca/ehca_reqs.c : ehca_post_send()
* bails out and does not set bad_wr if the 1st check is bad.
* test for bad_send_wr but it should always be set
* always return success if at least one post succeeded.
ehca/ehca_reqs.c : post_one_send()
* test for bad_send_wr but it should always be set
nes/nes_verbs.c : nes_post_send()
* bails out and does not set bad_wr if the 1st check is bad.
I think assume most are bugs (especially the ehca driver). I can post a
patch to fix these if confirmed.
Regards,
Frank
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread[parent not found: <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>]
* Re: ib_post_send in drivers [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org> @ 2009-11-20 19:26 ` Bart Van Assche [not found] ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-11-23 19:02 ` Roland Dreier 1 sibling, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2009-11-20 19:26 UTC (permalink / raw) To: frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On Fri, Nov 20, 2009 at 5:16 PM, frank zago <fzago-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org> wrote: > > It seems ib_post_send() is implemented slightly differently in the various hardware drivers (as in kernel 2.6.31). Here are the differences I've noticed regarding the bad_wr parameter. > > amso1100/c2_qp.c : c2_post_send() > * bails out and does not set bad_wr if the 1st check is bad. > > cxgb3/iwch_qp.c : post_one_send() > * test for bad_send_wr but it should always be set > > cxgb3/iwch_qp.c : iwch_post_send() > * bails out and does not set bad_wr if the 1st 2 checks are bad > > ehca/ehca_reqs.c : ehca_post_send() > * bails out and does not set bad_wr if the 1st check is bad. > * test for bad_send_wr but it should always be set > * always return success if at least one post succeeded. > > ehca/ehca_reqs.c : post_one_send() > * test for bad_send_wr but it should always be set > > nes/nes_verbs.c : nes_post_send() > * bails out and does not set bad_wr if the 1st check is bad. > > I think assume most are bugs (especially the ehca driver). I can post a patch to fix these if confirmed. I would like to add the following item to the above list: mlx4/qp.c: mlx4_ib_post_send() * when passing a list containing more than one item to mlx4_ib_post_send(), and sending the second or later item fails (e.g. because of QP overflow), the preceding items are sent anyway. This behavior makes it almost impossible to get error recovery right for block device implementations that use ib_post_send() (e.g. the SRPT target implementation). If my interpretation of the section about verbs in the InfiniBand Architecture Specification is correct, either all work requests should be processed or none. A quote from section 11.4.1.1, Post Send Request (page 622 in volume 1 of release 1.2.1): If an immediate error is returned, the QP state shall not be affected. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: ib_post_send in drivers [not found] ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-11-20 20:08 ` Sean Hefty [not found] ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Sean Hefty @ 2009-11-20 20:08 UTC (permalink / raw) To: 'Bart Van Assche', frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA >mlx4/qp.c: mlx4_ib_post_send() >* when passing a list containing more than one item to >mlx4_ib_post_send(), and sending the second or later item fails (e.g. >because of QP overflow), the preceding items are sent anyway. This >behavior makes it almost impossible to get error recovery right for >block device implementations that use ib_post_send() (e.g. the SRPT >target implementation). Yes - this is the correct behavior. The bad_wr pointer should reference the WR that failed, with all WRs in the list passed that point being returned unprocessed. This is the reason for having the bad_wr in the call. Error recovery shouldn't be any more difficult than posting one WR at a time. >If my interpretation of the section about verbs in the InfiniBand >Architecture Specification is correct, either all work requests should >be processed or none. A quote from section 11.4.1.1, Post Send Request >(page 622 in volume 1 of release 1.2.1): The IB spec does not define an API. For performance reasons, you don't want the implementation to walk through the WR list multiple times - once to check it, then a second time to actually post the requests to the hardware. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>]
* Re: ib_post_send in drivers [not found] ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org> @ 2009-11-21 11:17 ` Bart Van Assche [not found] ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2009-11-21 11:17 UTC (permalink / raw) To: Sean Hefty; +Cc: frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA On Fri, Nov 20, 2009 at 9:08 PM, Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: >>mlx4/qp.c: mlx4_ib_post_send() >>* when passing a list containing more than one item to >>mlx4_ib_post_send(), and sending the second or later item fails (e.g. >>because of QP overflow), the preceding items are sent anyway. This >>behavior makes it almost impossible to get error recovery right for >>block device implementations that use ib_post_send() (e.g. the SRPT >>target implementation). > > Yes - this is the correct behavior. The bad_wr pointer should reference the WR > that failed, with all WRs in the list passed that point being returned > unprocessed. This is the reason for having the bad_wr in the call. Error > recovery shouldn't be any more difficult than posting one WR at a time. > >>If my interpretation of the section about verbs in the InfiniBand >>Architecture Specification is correct, either all work requests should >>be processed or none. A quote from section 11.4.1.1, Post Send Request >>(page 622 in volume 1 of release 1.2.1): > > The IB spec does not define an API. For performance reasons, you don't want the > implementation to walk through the WR list multiple times - once to check it, > then a second time to actually post the requests to the hardware. Thanks for the feedback. I have two further questions: - Where can IB driver developers find detailed specifications of the verbs API they should implement ? I learned about the details of the behavior of the ib_post_send() call by reading the mlx4 source code. Shouldn't this behavior be documented in include/rdma/ib_verbs.h instead ? - Does walking twice over the WR list always result in inferior performance compared to walking once over this list ? Both the iSER protocol and the SRP protocol allow to send large sg lists (e.g. containing 128 elements) at once over the wire. When using asynchronous (buffered) I/O, this maximum is often reached. One interesting performance optimization is to send all 128 sg elements at once using one ib_post_send() call and to request a completion notification for the last WR only. But if the ib_post_send() call returns an immediate error and has sent part of the WR list, no completion notification will be received. So code that calls ib_post_send() has to request a completion notification for each WR, which has a negative performance impact. My opinion is that the current behavior makes ib_post_send() easier to implement, while the behavior specified in the IBAS is more interesting for applications that use the verbs API. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: ib_post_send in drivers [not found] ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-11-21 20:01 ` Jason Gunthorpe [not found] ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2009-11-23 19:09 ` Roland Dreier 1 sibling, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2009-11-21 20:01 UTC (permalink / raw) To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA On Sat, Nov 21, 2009 at 12:17:32PM +0100, Bart Van Assche wrote: > ib_post_send() has to request a completion notification for each WR, > which has a negative performance impact. My opinion is that the > current behavior makes ib_post_send() easier to implement, while the > behavior specified in the IBAS is more interesting for applications > that use the verbs API. It seems to me an error return from ib_post_send either means the caller is asking for something impossible, or the QP is wrecked, and is thus pretty much non-recoverable. For instance all the errors for mlx4 fit this pattern - and I think that is a reasonable requirement for any implementation of ib_post_send. So in that light the absence/presence of a completion does not seem important - if you get an error back you should always tear down the QP. Are SRP/etc calling ib_post_send in a way that ever returns errors? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: ib_post_send in drivers [not found] ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2009-11-21 20:37 ` Bart Van Assche [not found] ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2009-11-21 20:37 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA On Sat, Nov 21, 2009 at 9:01 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > On Sat, Nov 21, 2009 at 12:17:32PM +0100, Bart Van Assche wrote: > > > ib_post_send() has to request a completion notification for each WR, > > which has a negative performance impact. My opinion is that the > > current behavior makes ib_post_send() easier to implement, while the > > behavior specified in the IBAS is more interesting for applications > > that use the verbs API. > > It seems to me an error return from ib_post_send either means the > caller is asking for something impossible, or the QP is wrecked, and > is thus pretty much non-recoverable. For instance all the errors for > mlx4 fit this pattern - and I think that is a reasonable requirement > for any implementation of ib_post_send. > > So in that light the absence/presence of a completion does not seem > important - if you get an error back you should always tear down the > QP. > > Are SRP/etc calling ib_post_send in a way that ever returns errors? ib_post_send() failures in the SRP target implementation have already been observed. The error code returned by ib_post_send() was ENOMEM (queue full). It would be more useful IMHO if ib_post_send() would not post any WR instead of posting part of the WR's passed to ib_post_send() when the 'queue full' condition is hit. Note: SRPT uses a single SRQ per HCA, even when multiple SRP connections are active for the same HCA. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: ib_post_send in drivers [not found] ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-11-21 21:52 ` Jason Gunthorpe 0 siblings, 0 replies; 9+ messages in thread From: Jason Gunthorpe @ 2009-11-21 21:52 UTC (permalink / raw) To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA On Sat, Nov 21, 2009 at 09:37:29PM +0100, Bart Van Assche wrote: > It would be more useful IMHO if ib_post_send() would not > post any WR instead of posting part of the WR's passed to > ib_post_send() when the 'queue full' condition is hit. Note: SRPT uses > a single SRQ per HCA, even when multiple SRP connections are active > for the same HCA. Well, yes, this is useful, but the way verbs is kinda designed, it isn't the verbs layer that worries about that. If your app cannot handle a partial send then it must track the SQ availability on its own and never cause the SQ to overflow. This is what SRP initiator does, for instance. If you are keeping track, and you get ENOMEM then it is a bug. A fair criticism is that verbs doesn't have APIs to export the various Q availabilities - but I think ib_post_send is fine for its intended role. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ib_post_send in drivers [not found] ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-11-21 20:01 ` Jason Gunthorpe @ 2009-11-23 19:09 ` Roland Dreier 1 sibling, 0 replies; 9+ messages in thread From: Roland Dreier @ 2009-11-23 19:09 UTC (permalink / raw) To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA > Thanks for the feedback. I have two further questions: > - Where can IB driver developers find detailed specifications of the > verbs API they should implement ? I learned about the details of the > behavior of the ib_post_send() call by reading the mlx4 source code. > Shouldn't this behavior be documented in include/rdma/ib_verbs.h > instead ? As usual things are designed by an ad hoc mixture of the various specs mixed with experience from real implementations, with some details ending up as folklore only documented in mailing list threads. Of course it would be good to improve the documentation, but at the moment many details did not end up getting explicitly written. > - Does walking twice over the WR list always result in inferior > performance compared to walking once over this list ? Both the iSER > protocol and the SRP protocol allow to send large sg lists (e.g. > containing 128 elements) at once over the wire. When using > asynchronous (buffered) I/O, this maximum is often reached. One > interesting performance optimization is to send all 128 sg elements at > once using one ib_post_send() call and to request a completion > notification for the last WR only. But if the ib_post_send() call > returns an immediate error and has sent part of the WR list, no > completion notification will be received. So code that calls > ib_post_send() has to request a completion notification for each WR, > which has a negative performance impact. My opinion is that the > current behavior makes ib_post_send() easier to implement, while the > behavior specified in the IBAS is more interesting for applications > that use the verbs API. I think that yes, walking a list of work requests twice, once to look for immediate errors and once to actually post them is probably going to perform somewhat worse. Also, all of the immediate errors that can be returned are really things that the consumer can avoid -- too many SG entries, invalid opcode, WQ overflow. So error handling in the consumer for this can probably become BUG() or the equivalent. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ib_post_send in drivers [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org> 2009-11-20 19:26 ` Bart Van Assche @ 2009-11-23 19:02 ` Roland Dreier 1 sibling, 0 replies; 9+ messages in thread From: Roland Dreier @ 2009-11-23 19:02 UTC (permalink / raw) To: frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA > It seems ib_post_send() is implemented slightly differently in the > various hardware drivers (as in kernel 2.6.31). Here are the > differences I've noticed regarding the bad_wr parameter. Most of those appear to be bugs. Patches fixing them would be welcome I think. (just to be clear -- splitting this up by driver is probably the best idea, so that individual driver maintainers can review) Thanks for auditing all this. - Roland -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-11-23 19:09 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-20 16:16 ib_post_send in drivers frank zago
[not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
2009-11-20 19:26 ` Bart Van Assche
[not found] ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-20 20:08 ` Sean Hefty
[not found] ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2009-11-21 11:17 ` Bart Van Assche
[not found] ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-21 20:01 ` Jason Gunthorpe
[not found] ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2009-11-21 20:37 ` Bart Van Assche
[not found] ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-21 21:52 ` Jason Gunthorpe
2009-11-23 19:09 ` Roland Dreier
2009-11-23 19:02 ` Roland Dreier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox