public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* ib_post_send in drivers
@ 2009-11-20 16:16 frank zago
       [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: frank zago @ 2009-11-20 16:16 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hello,

It seems ib_post_send() is implemented slightly differently in the 
various hardware drivers (as in kernel 2.6.31). Here are the differences 
I've noticed regarding the bad_wr parameter.

amso1100/c2_qp.c : c2_post_send()
 * bails out and does not set bad_wr if the 1st check is bad.

cxgb3/iwch_qp.c : post_one_send()
 * test for bad_send_wr but it should always be set

cxgb3/iwch_qp.c : iwch_post_send()
 * bails out and does not set bad_wr if the 1st 2 checks are bad

ehca/ehca_reqs.c : ehca_post_send()
 * bails out and does not set bad_wr if the 1st check is bad.
 * test for bad_send_wr but it should always be set
 * always return success if at least one post succeeded.

ehca/ehca_reqs.c : post_one_send()
 * test for bad_send_wr but it should always be set

nes/nes_verbs.c : nes_post_send()
 * bails out and does not set bad_wr if the 1st check is bad.

I think assume most are bugs (especially the ehca driver). I can post a 
patch to fix these if confirmed.

Regards,
 Frank

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
@ 2009-11-20 19:26   ` Bart Van Assche
       [not found]     ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-11-23 19:02   ` Roland Dreier
  1 sibling, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2009-11-20 19:26 UTC (permalink / raw)
  To: frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 20, 2009 at 5:16 PM, frank zago <fzago-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org> wrote:
>
> It seems ib_post_send() is implemented slightly differently in the various hardware drivers (as in kernel 2.6.31). Here are the differences I've noticed regarding the bad_wr parameter.
>
> amso1100/c2_qp.c : c2_post_send()
> * bails out and does not set bad_wr if the 1st check is bad.
>
> cxgb3/iwch_qp.c : post_one_send()
> * test for bad_send_wr but it should always be set
>
> cxgb3/iwch_qp.c : iwch_post_send()
> * bails out and does not set bad_wr if the 1st 2 checks are bad
>
> ehca/ehca_reqs.c : ehca_post_send()
> * bails out and does not set bad_wr if the 1st check is bad.
> * test for bad_send_wr but it should always be set
> * always return success if at least one post succeeded.
>
> ehca/ehca_reqs.c : post_one_send()
> * test for bad_send_wr but it should always be set
>
> nes/nes_verbs.c : nes_post_send()
> * bails out and does not set bad_wr if the 1st check is bad.
>
> I think assume most are bugs (especially the ehca driver). I can post a patch to fix these if confirmed.

I would like to add the following item to the above list:

mlx4/qp.c: mlx4_ib_post_send()
* when passing a list containing more than one item to
mlx4_ib_post_send(), and sending the second or later item fails (e.g.
because of QP overflow), the preceding items are sent anyway. This
behavior makes it almost impossible to get error recovery right for
block device implementations that use ib_post_send() (e.g. the SRPT
target implementation).

If my interpretation of the section about verbs in the InfiniBand
Architecture Specification is correct, either all work requests should
be processed or none. A quote from section 11.4.1.1, Post Send Request
(page 622 in volume 1 of release 1.2.1):

If an immediate error is returned, the QP state shall not be affected.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: ib_post_send in drivers
       [not found]     ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-11-20 20:08       ` Sean Hefty
       [not found]         ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Hefty @ 2009-11-20 20:08 UTC (permalink / raw)
  To: 'Bart Van Assche', frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

>mlx4/qp.c: mlx4_ib_post_send()
>* when passing a list containing more than one item to
>mlx4_ib_post_send(), and sending the second or later item fails (e.g.
>because of QP overflow), the preceding items are sent anyway. This
>behavior makes it almost impossible to get error recovery right for
>block device implementations that use ib_post_send() (e.g. the SRPT
>target implementation).

Yes - this is the correct behavior.  The bad_wr pointer should reference the WR
that failed, with all WRs in the list passed that point being returned
unprocessed.  This is the reason for having the bad_wr in the call.  Error
recovery shouldn't be any more difficult than posting one WR at a time.

>If my interpretation of the section about verbs in the InfiniBand
>Architecture Specification is correct, either all work requests should
>be processed or none. A quote from section 11.4.1.1, Post Send Request
>(page 622 in volume 1 of release 1.2.1):

The IB spec does not define an API.  For performance reasons, you don't want the
implementation to walk through the WR list multiple times - once to check it,
then a second time to actually post the requests to the hardware.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found]         ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2009-11-21 11:17           ` Bart Van Assche
       [not found]             ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2009-11-21 11:17 UTC (permalink / raw)
  To: Sean Hefty; +Cc: frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 20, 2009 at 9:08 PM, Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>mlx4/qp.c: mlx4_ib_post_send()
>>* when passing a list containing more than one item to
>>mlx4_ib_post_send(), and sending the second or later item fails (e.g.
>>because of QP overflow), the preceding items are sent anyway. This
>>behavior makes it almost impossible to get error recovery right for
>>block device implementations that use ib_post_send() (e.g. the SRPT
>>target implementation).
>
> Yes - this is the correct behavior.  The bad_wr pointer should reference the WR
> that failed, with all WRs in the list passed that point being returned
> unprocessed.  This is the reason for having the bad_wr in the call.  Error
> recovery shouldn't be any more difficult than posting one WR at a time.
>
>>If my interpretation of the section about verbs in the InfiniBand
>>Architecture Specification is correct, either all work requests should
>>be processed or none. A quote from section 11.4.1.1, Post Send Request
>>(page 622 in volume 1 of release 1.2.1):
>
> The IB spec does not define an API.  For performance reasons, you don't want the
> implementation to walk through the WR list multiple times - once to check it,
> then a second time to actually post the requests to the hardware.

Thanks for the feedback. I have two further questions:
- Where can IB driver developers find detailed specifications of the
verbs API they should implement ? I learned about the details of the
behavior of the ib_post_send() call by reading the mlx4 source code.
Shouldn't this behavior be documented in include/rdma/ib_verbs.h
instead ?
- Does walking twice over the WR list always result in inferior
performance compared to walking once over this list ? Both the iSER
protocol and the SRP protocol allow to send large sg lists (e.g.
containing 128 elements) at once over the wire. When using
asynchronous (buffered) I/O, this maximum is often reached. One
interesting performance optimization is to send all 128 sg elements at
once using one ib_post_send() call and to request a completion
notification for the last WR only. But if the ib_post_send() call
returns an immediate error and has sent part of the WR list, no
completion notification will be received. So code that calls
ib_post_send() has to request a completion notification for each WR,
which has a negative performance impact. My opinion is that the
current behavior makes ib_post_send() easier to implement, while the
behavior specified in the IBAS is more interesting for applications
that use the verbs API.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found]             ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-11-21 20:01               ` Jason Gunthorpe
       [not found]                 ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2009-11-23 19:09               ` Roland Dreier
  1 sibling, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2009-11-21 20:01 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sat, Nov 21, 2009 at 12:17:32PM +0100, Bart Van Assche wrote:

> ib_post_send() has to request a completion notification for each WR,
> which has a negative performance impact. My opinion is that the
> current behavior makes ib_post_send() easier to implement, while the
> behavior specified in the IBAS is more interesting for applications
> that use the verbs API.

It seems to me an error return from ib_post_send either means the
caller is asking for something impossible, or the QP is wrecked, and
is thus pretty much non-recoverable. For instance all the errors for
mlx4 fit this pattern - and I think that is a reasonable requirement
for any implementation of ib_post_send.

So in that light the absence/presence of a completion does not seem
important - if you get an error back you should always tear down the
QP.

Are SRP/etc calling ib_post_send in a way that ever returns errors?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found]                 ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2009-11-21 20:37                   ` Bart Van Assche
       [not found]                     ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2009-11-21 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sat, Nov 21, 2009 at 9:01 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>
> On Sat, Nov 21, 2009 at 12:17:32PM +0100, Bart Van Assche wrote:
>
> > ib_post_send() has to request a completion notification for each WR,
> > which has a negative performance impact. My opinion is that the
> > current behavior makes ib_post_send() easier to implement, while the
> > behavior specified in the IBAS is more interesting for applications
> > that use the verbs API.
>
> It seems to me an error return from ib_post_send either means the
> caller is asking for something impossible, or the QP is wrecked, and
> is thus pretty much non-recoverable. For instance all the errors for
> mlx4 fit this pattern - and I think that is a reasonable requirement
> for any implementation of ib_post_send.
>
> So in that light the absence/presence of a completion does not seem
> important - if you get an error back you should always tear down the
> QP.
>
> Are SRP/etc calling ib_post_send in a way that ever returns errors?

ib_post_send() failures in the SRP target implementation have already
been observed. The error code returned by ib_post_send() was ENOMEM
(queue full). It would be more useful IMHO if ib_post_send() would not
post any WR instead of posting part of the WR's passed to
ib_post_send() when the 'queue full' condition is hit. Note: SRPT uses
a single SRQ per HCA, even when multiple SRP connections are active
for the same HCA.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found]                     ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-11-21 21:52                       ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2009-11-21 21:52 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sat, Nov 21, 2009 at 09:37:29PM +0100, Bart Van Assche wrote:

> It would be more useful IMHO if ib_post_send() would not
> post any WR instead of posting part of the WR's passed to
> ib_post_send() when the 'queue full' condition is hit. Note: SRPT uses
> a single SRQ per HCA, even when multiple SRP connections are active
> for the same HCA.

Well, yes, this is useful, but the way verbs is kinda designed, it
isn't the verbs layer that worries about that. If your app cannot
handle a partial send then it must track the SQ availability on its
own and never cause the SQ to overflow. This is what SRP initiator
does, for instance.

If you are keeping track, and you get ENOMEM then it is a bug.

A fair criticism is that verbs doesn't have APIs to export the various
Q availabilities - but I think ib_post_send is fine for its intended
role.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
  2009-11-20 19:26   ` Bart Van Assche
@ 2009-11-23 19:02   ` Roland Dreier
  1 sibling, 0 replies; 9+ messages in thread
From: Roland Dreier @ 2009-11-23 19:02 UTC (permalink / raw)
  To: frank zago; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


 > It seems ib_post_send() is implemented slightly differently in the
 > various hardware drivers (as in kernel 2.6.31). Here are the
 > differences I've noticed regarding the bad_wr parameter.

Most of those appear to be bugs.  Patches fixing them would be welcome I
think.  (just to be clear -- splitting this up by driver is probably the
best idea, so that individual driver maintainers can review)

Thanks for auditing all this.

 - Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ib_post_send in drivers
       [not found]             ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-11-21 20:01               ` Jason Gunthorpe
@ 2009-11-23 19:09               ` Roland Dreier
  1 sibling, 0 replies; 9+ messages in thread
From: Roland Dreier @ 2009-11-23 19:09 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Sean Hefty, frank zago, linux-rdma-u79uwXL29TY76Z2rM5mHXA


 > Thanks for the feedback. I have two further questions:
 > - Where can IB driver developers find detailed specifications of the
 > verbs API they should implement ? I learned about the details of the
 > behavior of the ib_post_send() call by reading the mlx4 source code.
 > Shouldn't this behavior be documented in include/rdma/ib_verbs.h
 > instead ?

As usual things are designed by an ad hoc mixture of the various specs
mixed with experience from real implementations, with some details
ending up as folklore only documented in mailing list threads.  Of
course it would be good to improve the documentation, but at the moment
many details did not end up getting explicitly written.

 > - Does walking twice over the WR list always result in inferior
 > performance compared to walking once over this list ? Both the iSER
 > protocol and the SRP protocol allow to send large sg lists (e.g.
 > containing 128 elements) at once over the wire. When using
 > asynchronous (buffered) I/O, this maximum is often reached. One
 > interesting performance optimization is to send all 128 sg elements at
 > once using one ib_post_send() call and to request a completion
 > notification for the last WR only. But if the ib_post_send() call
 > returns an immediate error and has sent part of the WR list, no
 > completion notification will be received. So code that calls
 > ib_post_send() has to request a completion notification for each WR,
 > which has a negative performance impact. My opinion is that the
 > current behavior makes ib_post_send() easier to implement, while the
 > behavior specified in the IBAS is more interesting for applications
 > that use the verbs API.

I think that yes, walking a list of work requests twice, once to look
for immediate errors and once to actually post them is probably going to
perform somewhat worse.  Also, all of the immediate errors that can be
returned are really things that the consumer can avoid -- too many SG
entries, invalid opcode, WQ overflow.  So error handling in the consumer
for this can probably become BUG() or the equivalent.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-11-23 19:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-20 16:16 ib_post_send in drivers frank zago
     [not found] ` <4B06C0EA.2070501-klaOcWyJdxkshyMvu7JE4pqQE7yCjDx5@public.gmane.org>
2009-11-20 19:26   ` Bart Van Assche
     [not found]     ` <e2e108260911201126o5233ea3v94163083fe3d26fc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-20 20:08       ` Sean Hefty
     [not found]         ` <E102E8F5252B43318544B660CF6B78F7-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2009-11-21 11:17           ` Bart Van Assche
     [not found]             ` <e2e108260911210317t323adfe0k4aa3e821d3a1c63c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-21 20:01               ` Jason Gunthorpe
     [not found]                 ` <20091121200128.GD1966-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2009-11-21 20:37                   ` Bart Van Assche
     [not found]                     ` <e2e108260911211237j2b9fd96cn3df78fe5c59fdc5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-11-21 21:52                       ` Jason Gunthorpe
2009-11-23 19:09               ` Roland Dreier
2009-11-23 19:02   ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox