About a shortcoming of the verbs API

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* About a shortcoming of the verbs API
       [not found] ` <AANLkTi=zowawGDjyh+uKve_NiRNMXcrqjAk0hRxGSMOv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-25 18:54   ` Bart Van Assche
       [not found]     ` <AANLkTinHRnt-jvy0xBOAPUDGcfx6=V6rkRT3t0Ja52FP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-07-25 18:54 UTC (permalink / raw)
  To: Linux-RDMA

One of the most common operations when using the verbs API is to
dequeue and process completions. For many applications, e.g. storage
protocols, processing completions in order is a correctness
requirement. Unfortunately with the current IB verbs API it is not
possible to process completions in order on a multiprocessor system
when using notification-based completion processing without
introducing additional locking.

The two most common patterns for notification-based completion processing are:

1. Single completion processing loop.

* Initialization:
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);

* Notification handler:

struct ib_wc wc;
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
while (ib_poll_cq(cq, 1, &wc) > 0)
    /* process wc */

2. Double completion processing loop

* Initialization:
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);

* Notification handler:

struct ib_wc wc;
do {
    while (ib_poll_cq(cq, 1, &wc) > 0)
        /* process wc */
} while (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
IB_CQ_REPORT_MISSED_EVENTS) > 0);

A known performance-wise disadvantage of the single notification
processing loop in (1) is that the completion handler can be invoked
with an empty completion queue (see also
http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03148.html).
While less likely, this can also happen with the double notification
processing loop (2).

What is worse is that none of the above two loops guarantees that
completions will be processed in order on a multiprocessor system. The
following can happen with both (1) and (2):
* The completion handler is invoked.
* Notifications are reenabled.
* A work completion (A) is popped of the completion queue.
* Completion processing is delayed for whatever reason.
* A new completion is pushed on the completion queue by the HCA.
* A new notification is generated.
* The same completion handler is invoked on another CPU, pops a
completion (B) from the completion queue and processes it.
* The completion handler that was delayed continues and processes
completion (A).

Or: completions (A) and (B) have been processed out-of-order.

This is not only a shortcoming of the OFED implementation of the verbs
API, but a shortcoming that is also present in the verb extensions as
defined by the IBTA. My opinion is that defining "poll for completion"
and "request completion notification" as separate verbs is not the
most optimal approach for multiprocessor or multi-core systems.

The only way I know of to prevent out-of-order completion processing
with the current OFED verbs API is to protect the whole completion
processing loop against concurrent execution with a spinlock. Maybe it
should be considered to extend the verbs API such that it is possible
to process completions in order without additional locking. Apparently
API functions that allow this in a similar context have already been
invented in the past -- see e.g. VipCQNotify() in the Virtual
Interface Architecture Specification.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTinHRnt-jvy0xBOAPUDGcfx6=V6rkRT3t0Ja52FP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]     ` <AANLkTinHRnt-jvy0xBOAPUDGcfx6=V6rkRT3t0Ja52FP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-26 14:21       ` Steve Wise
       [not found]         ` <4C4D99F8.3090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  2010-07-26 19:22       ` Roland Dreier
  1 sibling, 1 reply; 30+ messages in thread
From: Steve Wise @ 2010-07-26 14:21 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Linux-RDMA

On 07/25/2010 01:54 PM, Bart Van Assche wrote:
> One of the most common operations when using the verbs API is to
> dequeue and process completions. For many applications, e.g. storage
> protocols, processing completions in order is a correctness
> requirement. Unfortunately with the current IB verbs API it is not
> possible to process completions in order on a multiprocessor system
> when using notification-based completion processing without
> introducing additional locking.
>
> The two most common patterns for notification-based completion processing are:
>
> 1. Single completion processing loop.
>
> * Initialization:
> ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>
> * Notification handler:
>
> struct ib_wc wc;
> ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> while (ib_poll_cq(cq, 1,&wc)>  0)
>      /* process wc */
>
>
> 2. Double completion processing loop
>
> * Initialization:
> ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>
> * Notification handler:
>
> struct ib_wc wc;
> do {
>      while (ib_poll_cq(cq, 1,&wc)>  0)
>          /* process wc */
> } while (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
> IB_CQ_REPORT_MISSED_EVENTS)>  0);
>
>
> A known performance-wise disadvantage of the single notification
> processing loop in (1) is that the completion handler can be invoked
> with an empty completion queue (see also
> http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03148.html).
> While less likely, this can also happen with the double notification
> processing loop (2).
>
> What is worse is that none of the above two loops guarantees that
> completions will be processed in order on a multiprocessor system. The
> following can happen with both (1) and (2):
> * The completion handler is invoked.
> * Notifications are reenabled.
> * A work completion (A) is popped of the completion queue.
> * Completion processing is delayed for whatever reason.
> * A new completion is pushed on the completion queue by the HCA.
> * A new notification is generated.
> * The same completion handler is invoked on another CPU, pops a
> completion (B) from the completion queue and processes it.
> * The completion handler that was delayed continues and processes
> completion (A).
>
> Or: completions (A) and (B) have been processed out-of-order.
>
> This is not only a shortcoming of the OFED implementation of the verbs
> API, but a shortcoming that is also present in the verb extensions as
> defined by the IBTA. My opinion is that defining "poll for completion"
> and "request completion notification" as separate verbs is not the
> most optimal approach for multiprocessor or multi-core systems.
>
> The only way I know of to prevent out-of-order completion processing
> with the current OFED verbs API is to protect the whole completion
> processing loop against concurrent execution with a spinlock. Maybe it
> should be considered to extend the verbs API such that it is possible
> to process completions in order without additional locking. Apparently
> API functions that allow this in a similar context have already been
> invented in the past -- see e.g. VipCQNotify() in the Virtual
> Interface Architecture Specification.
>
> Bart.
>

Hey Bart,

This this the API to which you refer?

http://docsrv.sco.com/cgi-bin/man/man?VipCQNotify+3VI


I don't see how it provides the semantics you desire?


Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <4C4D99F8.3090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]         ` <4C4D99F8.3090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-26 17:59           ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2010-07-26 17:59 UTC (permalink / raw)
  To: Steve Wise; +Cc: Linux-RDMA

On Mon, Jul 26, 2010 at 4:21 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> On 07/25/2010 01:54 PM, Bart Van Assche wrote:
>>
>> [ ... ]
>>
>> The only way I know of to prevent out-of-order completion processing
>> with the current OFED verbs API is to protect the whole completion
>> processing loop against concurrent execution with a spinlock. Maybe it
>> should be considered to extend the verbs API such that it is possible
>> to process completions in order without additional locking. Apparently
>> API functions that allow this in a similar context have already been
>> invented in the past -- see e.g. VipCQNotify() in the Virtual
>> Interface Architecture Specification.
>
> Is this the API to which you refer?
>
> http://docsrv.sco.com/cgi-bin/man/man?VipCQNotify+3VI
>
> I don't see how it provides the semantics you desire?

The web page you refer to is owned by a company that is controversial
in the Linux world. A more neutral source is the book "The Virtual
Interface Architecture" (Intel Press, 2002) or the document "Virtual
Interface Architecture Specification" (1997, available online at
http://pllab.cs.nthu.edu.tw/cs5403/Readings/EJB/san_10.pdf). In both
documents it is described that VipCQNotify atomically either dequeues
a work completion or enables notifications. As far as I know none of
the verb extensions defined by the IBTA allows to perform both
operations in an atomic way.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]     ` <AANLkTinHRnt-jvy0xBOAPUDGcfx6=V6rkRT3t0Ja52FP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-07-26 14:21       ` Steve Wise
@ 2010-07-26 19:22       ` Roland Dreier
       [not found]         ` <adamxtejbes.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-26 19:22 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Linux-RDMA

 > 2. Double completion processing loop
 > 
 > * Initialization:
 > ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 > 
 > * Notification handler:
 > 
 > struct ib_wc wc;
 > do {
 >     while (ib_poll_cq(cq, 1, &wc) > 0)
 >         /* process wc */
 > } while (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
 > IB_CQ_REPORT_MISSED_EVENTS) > 0);

This approach can be used to have race-free in-order processing of
completions using a scheme such as the NAPI processing loop used by the
IPoIB driver (with help from the core networking stack).  Essentially a
completion notification just marks the completion processing routine as
runnable, and the networking core schedules that processing routine in a
single-threaded way until the CQ is drained.

Another approach is to just always run the completion processing for a
given CQ on a single CPU and avoid locking entirely.  If you want more
CPUs to spread the work, just use multiple CQs and multiple event vectors.

 > see e.g. VipCQNotify() in the Virtual Interface Architecture
 > Specification.

I don't know of an efficient way to implement this type of "atomic
dequeue completion or enable completions" with any existing hardware.
Do you have an idea how this could be done?

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <adamxtejbes.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]         ` <adamxtejbes.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-27  7:54           ` Or Gerlitz
       [not found]             ` <4C4E90B6.5070002-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  2010-07-27  8:33           ` Bart Van Assche
  1 sibling, 1 reply; 30+ messages in thread
From: Or Gerlitz @ 2010-07-27  7:54 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Bart Van Assche, Linux-RDMA

Roland Dreier wrote:
>  > do {
>  >     while (ib_poll_cq(cq, 1, &wc) > 0)
>  >         /* process wc */
>  > } while (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
>  > IB_CQ_REPORT_MISSED_EVENTS) > 0);

> This approach can be used to have race-free in-order processing of
> completions using a scheme such as the NAPI processing loop used by the
> IPoIB driver (with help from the core networking stack).

Roland, I'm wasn't sure if/howmuch the results are buggy, but the IPoIB poll loop doesn't check whether the return code of ib_req_notify_cq is negative (error) or positive (more completions to poll), any thoughts on the matter?

   444		if (done < budget) {
   445			if (dev->features & NETIF_F_LRO)
   446				lro_flush_all(&priv->lro.lro_mgr);
   447	
   448			napi_complete(napi);
   449			if (unlikely(ib_req_notify_cq(priv->recv_cq,
   450						      IB_CQ_NEXT_COMP |
   451						      IB_CQ_REPORT_MISSED_EVENTS)) &&
   452			    napi_reschedule(napi))
   453				goto poll_more;
   454		}
   455	
   456		return done;
   457	}

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <4C4E90B6.5070002-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]             ` <4C4E90B6.5070002-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
@ 2010-07-28 17:44               ` Roland Dreier
       [not found]                 ` <ada1vanfqn1.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-28 17:44 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Bart Van Assche, Linux-RDMA

 > Roland, I'm wasn't sure if/howmuch the results are buggy, but the
 > IPoIB poll loop doesn't check whether the return code of
 > ib_req_notify_cq is negative (error) or positive (more completions to
 > poll), any thoughts on the matter?

I think I had two things in mind here:

 - I don't know of any drivers that actually return any errors ever from
   req_notify_cq so it's not really a practical concern.

 - If we did get an error, there's not much we can do except keep
   polling and try to request notification again later -- exactly the
   same thing we would do if we got a positive return value.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <ada1vanfqn1.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                 ` <ada1vanfqn1.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-29  6:27                   ` Or Gerlitz
  0 siblings, 0 replies; 30+ messages in thread
From: Or Gerlitz @ 2010-07-29  6:27 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Bart Van Assche, Linux-RDMA

Roland Dreier wrote:
>  - If we did get an error, there's not much we can do except keep
>    polling and try to request notification again later -- exactly the
>    same thing we would do if we got a positive return value.
Basically, you're right. My only concern is a case where a hw driver 
keeps returning error from the notify verb and ipoib isn't aware of 
that. But as no one returns such error, this can be leaved for now.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]         ` <adamxtejbes.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  2010-07-27  7:54           ` Or Gerlitz
@ 2010-07-27  8:33           ` Bart Van Assche
       [not found]             ` <AANLkTinYuyCqJ6_wq6GH0vQGAY-mwC=7ZLicBnXO+efB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-07-27  8:33 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux-RDMA

On Mon, Jul 26, 2010 at 9:22 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> [ ... ]
>
> Another approach is to just always run the completion processing for a
> given CQ on a single CPU and avoid locking entirely.  If you want more
> CPUs to spread the work, just use multiple CQs and multiple event vectors.

In the applications I'm familiar with InfiniBand is being used not
only because of its low latency but also because of its high
throughput. In order to handle such loads efficiently, interrupts have
to be spread over multiple CPUs.

Switching from a single receive queue to multiple receive queues is an
interesting alternative, but is not possible without changing the
communication protocol between client and server. Changing the
communication protocol is not always possible, especially when the
communication protocol has been defined by a standards organization.

>  > see e.g. VipCQNotify() in the Virtual Interface Architecture
>  > Specification.
>
> I don't know of an efficient way to implement this type of "atomic
> dequeue completion or enable completions" with any existing hardware.
> Do you have an idea how this could be done?

I am not an expert with regard to HCA programming. But I assume the
above should refer to "reprogrammable firmware" instead of "hardware"
?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTinYuyCqJ6_wq6GH0vQGAY-mwC=7ZLicBnXO+efB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]             ` <AANLkTinYuyCqJ6_wq6GH0vQGAY-mwC=7ZLicBnXO+efB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-27 16:50               ` Roland Dreier
       [not found]                 ` <adafwz4g98j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-27 16:50 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Linux-RDMA

 > In the applications I'm familiar with InfiniBand is being used not
 > only because of its low latency but also because of its high
 > throughput.

Yes, I seem to recall hearing that people care about throughput as well.

 > In order to handle such loads efficiently, interrupts have to be
 > spread over multiple CPUs.

Let's look at what you say you want here:

 - strict in-order processing of completions
 - work spread across multiple CPUs

Do you see that the two goals are contradictory?  If you are running
work on multiple CPUs in parallel, then there can't be an order assumed
between CPUs -- otherwise you serialize the processing and lose all the
benefit of parallelism.

 > Switching from a single receive queue to multiple receive queues is an
 > interesting alternative, but is not possible without changing the
 > communication protocol between client and server. Changing the
 > communication protocol is not always possible, especially when the
 > communication protocol has been defined by a standards organization.

If you only have a single client talking to a single server over a
single connection, then yes the opportunities for parallelism are
limited.

By the way, looking at VipCQNotify further, I'm not sure I follow
exactly the race you're worried about.  If you're willing to do your
processing from the completion notification callback (which seems to be
the approach that VipCQNotify forces), then doesn't the following (from
Documentation/infiniband/core_locking.txt):

  The low-level driver is responsible for ensuring that multiple
  completion event handlers for the same CQ are not called
  simultaneously.  The driver must guarantee that only one CQ event
  handler for a given CQ is running at a time.  In other words, the
  following situation is not allowed:

        CPU1                                    CPU2

  low-level driver ->
    consumer CQ event callback:
      /* ... */
      ib_req_notify_cq(cq, ...);
                                        low-level driver ->
      /* ... */                           consumer CQ event callback:
                                            /* ... */
      return from CQ event handler

mean that the problem you are complaining about doesn't actually exist?

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <adafwz4g98j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                 ` <adafwz4g98j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-27 18:03                   ` Bart Van Assche
       [not found]                     ` <AANLkTimAk0k-q1EKjaXOadoXvKXbEN9nAky0w1rjixxB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-07-27 18:03 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux-RDMA

On Tue, Jul 27, 2010 at 6:50 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> [ ... ]
>
> From Documentation/infiniband/core_locking.txt:
>
>  The low-level driver is responsible for ensuring that multiple
>  completion event handlers for the same CQ are not called
>  simultaneously.  The driver must guarantee that only one CQ event
>  handler for a given CQ is running at a time.  In other words, the
>  following situation is not allowed:
>
>        CPU1                                    CPU2
>
>  low-level driver ->
>    consumer CQ event callback:
>      /* ... */
>      ib_req_notify_cq(cq, ...);
>                                        low-level driver ->
>      /* ... */                           consumer CQ event callback:
>                                            /* ... */
>      return from CQ event handler
>
> mean that the problem you are complaining about doesn't actually exist?

As far as I know it is not possible for a HCA to tell whether or not a
CPU has finished executing the interrupt it triggered. So it is not
possible for the HCA to implement the above requirement by delaying
the generation of a new interrupt -- implementing the above
requirement is only possible in the low-level driver. A low-level
driver could e.g. postpone notification reenabling until the end of
the interrupt handler or it could use a spinlock to prevent
simultaneous execution of notification handlers. I have inspected the
source code of one particular low-level driver but could not find any
such provisions. Did I overlook something ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTimAk0k-q1EKjaXOadoXvKXbEN9nAky0w1rjixxB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                     ` <AANLkTimAk0k-q1EKjaXOadoXvKXbEN9nAky0w1rjixxB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-27 18:20                       ` Jason Gunthorpe
       [not found]                         ` <20100727182046.GT7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2010-07-27 18:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, Linux-RDMA

On Tue, Jul 27, 2010 at 08:03:25PM +0200, Bart Van Assche wrote:

> As far as I know it is not possible for a HCA to tell whether or not a
> CPU has finished executing the interrupt it triggered. So it is not
> possible for the HCA to implement the above requirement by delaying
> the generation of a new interrupt -- implementing the above

Linux does not allow interrupts to re-enter.. Read through
kernel/irq/chip.c handle_edge_irq to get a sense of how that is done
for MSI. Looked to me like all the CQ call backs flowed from the
interrupt handler in mlx4?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20100727182046.GT7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                         ` <20100727182046.GT7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-27 19:28                           ` Bart Van Assche
       [not found]                             ` <AANLkTimAS6znoCCw33ipVV-W-e1BJS93Fxzp-oe0jO4u-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-08-07  7:56                           ` Bart Van Assche
  1 sibling, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-07-27 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, Linux-RDMA

On Tue, Jul 27, 2010 at 8:20 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>
> On Tue, Jul 27, 2010 at 08:03:25PM +0200, Bart Van Assche wrote:
>
> > As far as I know it is not possible for a HCA to tell whether or not a
> > CPU has finished executing the interrupt it triggered. So it is not
> > possible for the HCA to implement the above requirement by delaying
> > the generation of a new interrupt -- implementing the above
>
> Linux does not allow interrupts to re-enter.. Read through
> kernel/irq/chip.c handle_edge_irq to get a sense of how that is done
> for MSI. Looked to me like all the CQ call backs flowed from the
> interrupt handler in mlx4?

Thanks for the feedback -- I had tried to look up that information but
hadn't found it yet.

I have two more questions:
- Some time ago I observed that the kernel reported soft lockups
because of spin_lock() calls inside a completion handler. These
spinlocks were not locked in any other context than the completion
handler itself. And the lockups disappeared after having replaced the
spin_lock() calls by spin_lock_irqsave(). Can it be concluded from
this observation that completion handlers are not always invoked from
interrupt context ?
- The function handle_edge_irq() in kernel/irq/chip.c invokes the
actual interrupt handler while the spinlock desc->lock is not locked.
Does that mean that a completion interrupt can get lost due to the
following (unlikely) event order ?
* completion interrupt is triggered on CPU 1.
* handle_edge_irq() sets the IRQ_INPROGRESS flag and invokes the
completion handler on CPU 1.
* The completion handler reenables the completion interrupt via
ib_req_notify_cq().
* Before the IRQ_INPROGRESS flag is cleared by handle_edge_irq(), a
new completion interrupt is triggered on CPU 2.
* handle_edge_irq() is invoked on CPU 2 and exits immediately because
IRQ_INPROGRESS is still set.
* handle_edge_irq() clears IRQ_INPROGRESS on CPU 1.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTimAS6znoCCw33ipVV-W-e1BJS93Fxzp-oe0jO4u-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                             ` <AANLkTimAS6znoCCw33ipVV-W-e1BJS93Fxzp-oe0jO4u-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-27 20:15                               ` Jason Gunthorpe
  2010-07-28 17:42                               ` Roland Dreier
  1 sibling, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2010-07-27 20:15 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, Linux-RDMA

On Tue, Jul 27, 2010 at 09:28:54PM +0200, Bart Van Assche wrote:

> I have two more questions: - Some time ago I observed that the
> kernel reported soft lockups because of spin_lock() calls inside a
> completion handler. These spinlocks were not locked in any other
> context than the completion handler itself. And the lockups
> disappeared after having replaced the spin_lock() calls by
> spin_lock_irqsave(). Can it be concluded from this observation that
> completion handlers are not always invoked from interrupt context ?

I don't know.. It wouldn't surprise me if there were some error paths
that called completion handlers outside an IRQ context, but as Roland
pointed out the API guarantee is that this never happens in parallel
with interrupt called cases.

> - The function handle_edge_irq() in kernel/irq/chip.c invokes the
> actual interrupt handler while the spinlock desc->lock is not
> locked.  Does that mean that a completion interrupt can get lost due
> to the

It holds desc->lock while manipulating the flags, so IRQ_PENDING will
be set by CPU 2 and CPU 1 will notice once and re-invoke the handler
once it re-locks desc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]                             ` <AANLkTimAS6znoCCw33ipVV-W-e1BJS93Fxzp-oe0jO4u-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-07-27 20:15                               ` Jason Gunthorpe
@ 2010-07-28 17:42                               ` Roland Dreier
       [not found]                                 ` <ada62zzfqpj.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-28 17:42 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jason Gunthorpe, Linux-RDMA

 > - Some time ago I observed that the kernel reported soft lockups
 > because of spin_lock() calls inside a completion handler. These
 > spinlocks were not locked in any other context than the completion
 > handler itself. And the lockups disappeared after having replaced the
 > spin_lock() calls by spin_lock_irqsave(). Can it be concluded from
 > this observation that completion handlers are not always invoked from
 > interrupt context ?

Did you get a soft lockup report or a lockdep report?  Anyway, the very
next paragraph of the documentation I quoted says:

  The context in which completion event and asynchronous event
  callbacks run is not defined.  Depending on the low-level driver, it
  may be process context, softirq context, or interrupt context.
  Upper level protocol consumers may not sleep in a callback.

So yes, it is possible that a completion callback gets called in
non-interrupt context.

However as far as I know, at least mthca and mlx4 only call completion
callbacks from the interrupt handler.  But without the actual code in
question it's hard to know what the real problem was.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <ada62zzfqpj.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                 ` <ada62zzfqpj.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-28 17:51                                   ` Ralph Campbell
       [not found]                                     ` <1280339513.31421.264.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Ralph Campbell @ 2010-07-28 17:51 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Bart Van Assche, Jason Gunthorpe, Linux-RDMA

Actually, I tried to implement the completion callback
in a workqueue thread but ipoib_cm_handle_tx_wc() calls
netif_tx_lock() which isn't safe unless it is called
from an IRQ handler or netif_tx_lock_bh() is called first.

On Wed, 2010-07-28 at 10:42 -0700, Roland Dreier wrote:
> > - Some time ago I observed that the kernel reported soft lockups
>  > because of spin_lock() calls inside a completion handler. These
>  > spinlocks were not locked in any other context than the completion
>  > handler itself. And the lockups disappeared after having replaced the
>  > spin_lock() calls by spin_lock_irqsave(). Can it be concluded from
>  > this observation that completion handlers are not always invoked from
>  > interrupt context ?
> 
> Did you get a soft lockup report or a lockdep report?  Anyway, the very
> next paragraph of the documentation I quoted says:
> 
>   The context in which completion event and asynchronous event
>   callbacks run is not defined.  Depending on the low-level driver, it
>   may be process context, softirq context, or interrupt context.
>   Upper level protocol consumers may not sleep in a callback.
> 
> So yes, it is possible that a completion callback gets called in
> non-interrupt context.
> 
> However as far as I know, at least mthca and mlx4 only call completion
> callbacks from the interrupt handler.  But without the actual code in
> question it's hard to know what the real problem was.
> 
>  - R.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <1280339513.31421.264.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                     ` <1280339513.31421.264.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-07-28 18:01                                       ` Bart Van Assche
  2010-07-28 18:05                                       ` Roland Dreier
  1 sibling, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2010-07-28 18:01 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Roland Dreier, Jason Gunthorpe, Linux-RDMA

On Wed, Jul 28, 2010 at 7:51 PM, Ralph Campbell
<ralph.campbell-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 2010-07-28 at 10:42 -0700, Roland Dreier wrote:
>> > - Some time ago I observed that the kernel reported soft lockups
>>  > because of spin_lock() calls inside a completion handler. These
>>  > spinlocks were not locked in any other context than the completion
>>  > handler itself. And the lockups disappeared after having replaced the
>>  > spin_lock() calls by spin_lock_irqsave(). Can it be concluded from
>>  > this observation that completion handlers are not always invoked from
>>  > interrupt context ?
>>
>> Did you get a soft lockup report or a lockdep report?  Anyway, the very
>> next paragraph of the documentation I quoted says:
>>
>>   The context in which completion event and asynchronous event
>>   callbacks run is not defined.  Depending on the low-level driver, it
>>   may be process context, softirq context, or interrupt context.
>>   Upper level protocol consumers may not sleep in a callback.
>>
>> So yes, it is possible that a completion callback gets called in
>> non-interrupt context.
>>
>> However as far as I know, at least mthca and mlx4 only call completion
>> callbacks from the interrupt handler.  But without the actual code in
>> question it's hard to know what the real problem was.
>
> Actually, I tried to implement the completion callback
> in a workqueue thread but ipoib_cm_handle_tx_wc() calls
> netif_tx_lock() which isn't safe unless it is called
> from an IRQ handler or netif_tx_lock_bh() is called first.

Has anyone already tried to enable threaded IRQ mode for IB completion
interrupts ? Threaded IRQs are one of the cornerstones of the
real-time Linux patch for reducing interrupt latency.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]                                     ` <1280339513.31421.264.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  2010-07-28 18:01                                       ` Bart Van Assche
@ 2010-07-28 18:05                                       ` Roland Dreier
       [not found]                                         ` <adask33eb36.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-28 18:05 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Bart Van Assche, Jason Gunthorpe, Linux-RDMA

 > Actually, I tried to implement the completion callback
 > in a workqueue thread but ipoib_cm_handle_tx_wc() calls
 > netif_tx_lock() which isn't safe unless it is called
 > from an IRQ handler or netif_tx_lock_bh() is called first.

Oh, sounds like a bug in IPoIB.  I guess we could fix it by just
changing it to netif_tx_lock_bh()?  (Or is that not safe from an IRQ handler?)
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <adask33eb36.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                         ` <adask33eb36.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-28 18:11                                           ` Ralph Campbell
  2010-07-28 18:16                                           ` Roland Dreier
  1 sibling, 0 replies; 30+ messages in thread
From: Ralph Campbell @ 2010-07-28 18:11 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Bart Van Assche, Jason Gunthorpe, Linux-RDMA

On Wed, 2010-07-28 at 11:05 -0700, Roland Dreier wrote:
> > Actually, I tried to implement the completion callback
>  > in a workqueue thread but ipoib_cm_handle_tx_wc() calls
>  > netif_tx_lock() which isn't safe unless it is called
>  > from an IRQ handler or netif_tx_lock_bh() is called first.
> 
> Oh, sounds like a bug in IPoIB.  I guess we could fix it by just
> changing it to netif_tx_lock_bh()?  (Or is that not safe from an IRQ handler?)

netif_tx_lock_bh() is an inline function for
	local_bh_disable();
        netif_tx_lock();

so I meant to say local_bh_disable(), not netif_tx_lock_bh().

Basically, we would need a "irqsave" version of netif_tx_lock()
so that it could be called from either IRQ or non-IRQ context
and save/restore the prior state.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]                                         ` <adask33eb36.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  2010-07-28 18:11                                           ` Ralph Campbell
@ 2010-07-28 18:16                                           ` Roland Dreier
       [not found]                                             ` <adaocdreal0.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-07-28 18:16 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Bart Van Assche, Jason Gunthorpe, Linux-RDMA

 >  > Actually, I tried to implement the completion callback
 >  > in a workqueue thread but ipoib_cm_handle_tx_wc() calls
 >  > netif_tx_lock() which isn't safe unless it is called
 >  > from an IRQ handler or netif_tx_lock_bh() is called first.

 > Oh, sounds like a bug in IPoIB.  I guess we could fix it by just
 > changing it to netif_tx_lock_bh()?  (Or is that not safe from an IRQ handler?)

Wait, is this still a problem with IPoIB?  As far as I can tell, the
IPoIB completion handlers don't do anything except enable the NAPI poll
routine or the transmit ring timer (ie they just do napi_schedule() or
mod_timer()), so the context that the CQ callback is called in doesn't
matter.  In particular I don't see any way ipoib_cm_handle_tx_wc() could
be reached except from the NAPI polling loop.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <adaocdreal0.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                             ` <adaocdreal0.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-28 19:05                                               ` Ralph Campbell
  0 siblings, 0 replies; 30+ messages in thread
From: Ralph Campbell @ 2010-07-28 19:05 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Bart Van Assche, Jason Gunthorpe, Linux-RDMA

On Wed, 2010-07-28 at 11:16 -0700, Roland Dreier wrote:
> >  > Actually, I tried to implement the completion callback
>  >  > in a workqueue thread but ipoib_cm_handle_tx_wc() calls
>  >  > netif_tx_lock() which isn't safe unless it is called
>  >  > from an IRQ handler or netif_tx_lock_bh() is called first.
> 
>  > Oh, sounds like a bug in IPoIB.  I guess we could fix it by just
>  > changing it to netif_tx_lock_bh()?  (Or is that not safe from an IRQ handler?)
> 
> Wait, is this still a problem with IPoIB?  As far as I can tell, the
> IPoIB completion handlers don't do anything except enable the NAPI poll
> routine or the transmit ring timer (ie they just do napi_schedule() or
> mod_timer()), so the context that the CQ callback is called in doesn't
> matter.  In particular I don't see any way ipoib_cm_handle_tx_wc() could
> be reached except from the NAPI polling loop.
> 
>  - R.

I don't remember now whether I hit the problem in a backported IPoIB
or in a recent kernel but I did need to single thread and call
local_bh_disable() for completion callbacks or I would get deadlocks.
I just assumed that ULPs were being written with that as a requirement.

This is what makes understanding the "locking conventions" for
IPoIB really complex. Sometimes you need a lock and sometimes
you don't depending on the state of the network stack.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]                         ` <20100727182046.GT7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-07-27 19:28                           ` Bart Van Assche
@ 2010-08-07  7:56                           ` Bart Van Assche
       [not found]                             ` <AANLkTimc3iS8=8ZQ9u8tOLP4-q_e+o0=AncZj-Mbre2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-08-07  7:56 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, Linux-RDMA

On Tue, Jul 27, 2010 at 8:20 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, Jul 27, 2010 at 08:03:25PM +0200, Bart Van Assche wrote:
>
>> As far as I know it is not possible for a HCA to tell whether or not a
>> CPU has finished executing the interrupt it triggered. So it is not
>> possible for the HCA to implement the above requirement by delaying
>> the generation of a new interrupt -- implementing the above
>
> Linux does not allow interrupts to re-enter.. Read through
> kernel/irq/chip.c handle_edge_irq to get a sense of how that is done
> for MSI. Looked to me like all the CQ call backs flowed from the
> interrupt handler in mlx4?

The above implies that one must be careful when applying a common
Linux practice, that is to defer interrupt handling from IRQ context
to tasklet context. Since tasklets are executed with interrupts
enabled, invoking ib_req_notify_cq(cq, IB_CQ_NEXT_COMP) from tasklet
context may cause concurrent execution of an IB IRQ with an IB
tasklet. So if ib_poll_cq() is invoked from tasklet context, the
entire polling loop has to be protected against concurrent execution.
As far as I know such protection against concurrent execution is not
necessary inside tasklets that handle other types of hardware.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTimc3iS8=8ZQ9u8tOLP4-q_e+o0=AncZj-Mbre2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                             ` <AANLkTimc3iS8=8ZQ9u8tOLP4-q_e+o0=AncZj-Mbre2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-07 16:32                               ` Roland Dreier
       [not found]                                 ` <adavd7mz8m9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  2010-08-08  1:38                               ` Jason Gunthorpe
  1 sibling, 1 reply; 30+ messages in thread
From: Roland Dreier @ 2010-08-07 16:32 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jason Gunthorpe, Linux-RDMA

 > The above implies that one must be careful when applying a common
 > Linux practice, that is to defer interrupt handling from IRQ context
 > to tasklet context. Since tasklets are executed with interrupts
 > enabled, invoking ib_req_notify_cq(cq, IB_CQ_NEXT_COMP) from tasklet
 > context may cause concurrent execution of an IB IRQ with an IB
 > tasklet. So if ib_poll_cq() is invoked from tasklet context, the
 > entire polling loop has to be protected against concurrent execution.
 > As far as I know such protection against concurrent execution is not
 > necessary inside tasklets that handle other types of hardware.

Not sure that I follow the problem you're worried about.  A given
tasklet can only be running on one CPU at any one time -- if an
interrupt occurs and reschedules the tasklet then it just runs again
when it exits.

Also I'm not sure I understand why this is special for IB hardware --
standard practice is for interrupt handlers to clear the interrupt
source and reenable interrupts, so I don't see why the same thing you
describe can't happen with any interrupt-generating device that defers
work to a tasklet.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <adavd7mz8m9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                 ` <adavd7mz8m9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-08-08 18:19                                   ` Bart Van Assche
       [not found]                                     ` <AANLkTinKsLNoia96AVDA6fP9Es5_2Rq_wTgY=z6wk_FE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-08-08 18:19 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jason Gunthorpe, Linux-RDMA

On Sat, Aug 7, 2010 at 6:32 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> Not sure that I follow the problem you're worried about.  A given
> tasklet can only be running on one CPU at any one time -- if an
> interrupt occurs and reschedules the tasklet then it just runs again
> when it exits.
>
> Also I'm not sure I understand why this is special for IB hardware --
> standard practice is for interrupt handlers to clear the interrupt
> source and reenable interrupts, so I don't see why the same thing you
> describe can't happen with any interrupt-generating device that defers
> work to a tasklet.

One of the applications I have been looking at is adding blk-iopoll
support in ib_srp and to make it possible to enable and disable
blk-iopoll at runtime via sysfs. A naive implementation could look
e.g. as follows:

/* Poll the IB receive queue once. Returns zero if this function
should be called again. */
static int srp_recv_poll(struct ib_cq *cq, struct srp_target_port *target)
{
	struct ib_wc wc;

	do {
		if (ib_poll_cq(cq, 1, &wc) > 0) {
			if (wc.status == IB_WC_SUCCESS) {
				srp_handle_recv(target, &wc);
				return 0;
			} else {
				shost_printk(KERN_ERR, target->scsi_host,
					     PFX "failed receive status %d\n",
					     wc.status);
				target->qp_in_error = 1;
			}
		}
	} while (ib_req_notify_cq(target->recv_cq, IB_CQ_NEXT_COMP
				  | IB_CQ_REPORT_MISSED_EVENTS) > 0);
	return -1;
}

/* blk-iopoll callback function, which is invoked on tasklet context. */
static int srp_iopoll(struct blk_iopoll *iop, int budget)
{
	struct srp_target_port *target;
	int processed;

	target = container_of(iop, struct srp_target_port, iopoll);
	for (processed = 0; processed < budget; processed++) {
		if (srp_recv_poll_once(target->recv_cq, target) != 0) {
			blk_iopoll_complete(iop);
			break;
		}
	}
	return processed;
}

/* receive completion queue notification callback function, which is
typically invoked on IRQ context. */
static void srp_recv_completion(struct ib_cq *cq, void *target_ptr)
{
	struct srp_target_port *target = target_ptr;

	if (target->iopoll_enabled
	    && blk_iopoll_sched_prep(&target->iopoll) == 0)
		blk_iopoll_sched(&target->iopoll);
	else
		while (srp_recv_poll_once(cq, target) == 0)
			;
}

/* sysfs callback function that shows the current value of
target->blk_iopoll_enabled */

/* sysfs callback function that allows to set the value of
target->blk_iopoll_enabled */

As far as I can see with the above implementation enabling blk-iopoll
mode would be fine, but disabling not, because while disabling
blk-iopoll mode it could e.g. happen that srp_iopoll() is still
polling the receive completion queue on one CPU in tasklet context and
srp_recv_completion() starts polling the receive queue simultaneously
on another CPU in IRQ context. This can happen independently of
whether loop (1) or loop (2) is used to poll the IB receive completion
queue (see also http://www.spinics.net/lists/linux-rdma/msg05003.html
for the definitions of loop (1) and (2)). Although it is possible to
wait for completion of a tasklet by calling tasklet_disable(), I don't
think it is safe to call this function from IRQ context because that
might trigger a deadlock.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTinKsLNoia96AVDA6fP9Es5_2Rq_wTgY=z6wk_FE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                     ` <AANLkTinKsLNoia96AVDA6fP9Es5_2Rq_wTgY=z6wk_FE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-09 14:49                                       ` David Dillow
       [not found]                                         ` <1281365347.4968.5.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: David Dillow @ 2010-08-09 14:49 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, Jason Gunthorpe, Linux-RDMA

On Sun, 2010-08-08 at 20:19 +0200, Bart Van Assche wrote:
> On Sat, Aug 7, 2010 at 6:32 PM, Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> > Not sure that I follow the problem you're worried about.  A given
> > tasklet can only be running on one CPU at any one time -- if an
> > interrupt occurs and reschedules the tasklet then it just runs again
> > when it exits.
> >
> > Also I'm not sure I understand why this is special for IB hardware --
> > standard practice is for interrupt handlers to clear the interrupt
> > source and reenable interrupts, so I don't see why the same thing you
> > describe can't happen with any interrupt-generating device that defers
> > work to a tasklet.
> 
> One of the applications I have been looking at is adding blk-iopoll
> support in ib_srp and to make it possible to enable and disable
> blk-iopoll at runtime via sysfs. A naive implementation could look
> e.g. as follows:

I'm not sure it makes sense to enable/disable this at runtime -- we
don't do it for NAPI, why do it for block devices? I'm not even sure I'd
want to see a config option for it in kbuild -- that was done during the
transition to NAPI and it lingered forever for some drivers. I'd rather
we got it correct, and not give people yet another knob to figure out.

I can certainly see a use case for testing the patch's performance,
though.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <1281365347.4968.5.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                         ` <1281365347.4968.5.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2010-08-09 18:45                                           ` Vladislav Bolkhovitin
       [not found]                                             ` <4C604CB4.5060705-d+Crzxg7Rs0@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-09 18:45 UTC (permalink / raw)
  To: David Dillow; +Cc: Bart Van Assche, Roland Dreier, Jason Gunthorpe, Linux-RDMA

David Dillow, on 08/09/2010 06:49 PM wrote:
> On Sun, 2010-08-08 at 20:19 +0200, Bart Van Assche wrote:
>> On Sat, Aug 7, 2010 at 6:32 PM, Roland Dreier<rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>  wrote:
>>> Not sure that I follow the problem you're worried about.  A given
>>> tasklet can only be running on one CPU at any one time -- if an
>>> interrupt occurs and reschedules the tasklet then it just runs again
>>> when it exits.
>>>
>>> Also I'm not sure I understand why this is special for IB hardware --
>>> standard practice is for interrupt handlers to clear the interrupt
>>> source and reenable interrupts, so I don't see why the same thing you
>>> describe can't happen with any interrupt-generating device that defers
>>> work to a tasklet.
>>
>> One of the applications I have been looking at is adding blk-iopoll
>> support in ib_srp and to make it possible to enable and disable
>> blk-iopoll at runtime via sysfs. A naive implementation could look
>> e.g. as follows:
>
> I'm not sure it makes sense to enable/disable this at runtime -- we
> don't do it for NAPI, why do it for block devices? I'm not even sure I'd
> want to see a config option for it in kbuild -- that was done during the
> transition to NAPI and it lingered forever for some drivers. I'd rather
> we got it correct, and not give people yet another knob to figure out.
>
> I can certainly see a use case for testing the patch's performance,
> though.

For the testing it can be done as a local to the corresponding file #ifdef.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <4C604CB4.5060705-d+Crzxg7Rs0@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                             ` <4C604CB4.5060705-d+Crzxg7Rs0@public.gmane.org>
@ 2010-08-09 18:58                                               ` David Dillow
  0 siblings, 0 replies; 30+ messages in thread
From: David Dillow @ 2010-08-09 18:58 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Bart Van Assche, Roland Dreier, Jason Gunthorpe, Linux-RDMA

On Mon, 2010-08-09 at 22:45 +0400, Vladislav Bolkhovitin wrote:
> David Dillow, on 08/09/2010 06:49 PM wrote:
> > I'm not sure it makes sense to enable/disable this at runtime -- we
> > don't do it for NAPI, why do it for block devices? I'm not even sure I'd
> > want to see a config option for it in kbuild -- that was done during the
> > transition to NAPI and it lingered forever for some drivers. I'd rather
> > we got it correct, and not give people yet another knob to figure out.
> >
> > I can certainly see a use case for testing the patch's performance,
> > though.
> 
> For the testing it can be done as a local to the corresponding file #ifdef.

Yes, I was suggesting something that would stay local to Bart's tree and
wouldn't be in the upstream submission. Having it accessible from
userspace without a rebuild may be more convenient for him to do initial
performance testing, but it would be better to do the final tests on the
actual code submitted.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: About a shortcoming of the verbs API
       [not found]                             ` <AANLkTimc3iS8=8ZQ9u8tOLP4-q_e+o0=AncZj-Mbre2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-08-07 16:32                               ` Roland Dreier
@ 2010-08-08  1:38                               ` Jason Gunthorpe
       [not found]                                 ` <20100808013822.GA15146-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2010-08-08  1:38 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, Linux-RDMA

On Sat, Aug 07, 2010 at 09:56:13AM +0200, Bart Van Assche wrote:

> The above implies that one must be careful when applying a common
> Linux practice, that is to defer interrupt handling from IRQ context
> to tasklet context. Since tasklets are executed with interrupts
> enabled, invoking ib_req_notify_cq(cq, IB_CQ_NEXT_COMP) from tasklet
> context may cause concurrent execution of an IB IRQ with an IB
> tasklet. So if ib_poll_cq() is invoked from tasklet context, the
> entire polling loop has to be protected against concurrent execution.
> As far as I know such protection against concurrent execution is not
> necessary inside tasklets that handle other types of hardware.

No, all hardware pretty much works like this. The general flow is:

IRQ happens
 (if level triggered 'ack' the IRQ to the HW, to suppress the level)
SW processes
SW 'does something' to the HW to cause new IRQs to happen
IRQ happens.. repeat..

In the IB case, you get exactly one completion call back and when you
call ib_req_notify_cq/etc then you might get more. There is implicit
locking provided by the HW, in that it does not generate more IRQs
until explicitly told to.

Pretty much every bit of HW works the same way, IRQs stop until the
SW 'does something' to indicate it wants more. If this something is
the very last thing in a tasklet then everything is OK.

A trivial example for ethernet hardware would be someting like this..

IRQ happens whenever packet_buffer_wr != packet_buffer_rd
  When HW sends an IRQ it sets a HW bit to disable further IRQ
  messages
SW processes
SW writes packet_buffer_rd to HW
 HW flips off its disable further IRQ bit
IRQ happens whenever packet_buffer_wr != packet_buffer_rd, which may
 be immediately

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20100808013822.GA15146-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                 ` <20100808013822.GA15146-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-08-08 18:16                                   ` Bart Van Assche
       [not found]                                     ` <AANLkTi=My1aK3VsYejeVeRSqo+7RNMX2x6osGNbBERvx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2010-08-08 18:16 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, Linux-RDMA

On Sun, Aug 8, 2010 at 3:38 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> [ ... ]
>
> No, all hardware pretty much works like this. The general flow is:
>
> IRQ happens
>  (if level triggered 'ack' the IRQ to the HW, to suppress the level)
> SW processes
> SW 'does something' to the HW to cause new IRQs to happen
> IRQ happens.. repeat..
>
> [ ... ]

You might have missed or forgotten the point that was made in the
first message of this thread.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <AANLkTi=My1aK3VsYejeVeRSqo+7RNMX2x6osGNbBERvx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                     ` <AANLkTi=My1aK3VsYejeVeRSqo+7RNMX2x6osGNbBERvx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-08 23:51                                       ` Jason Gunthorpe
       [not found]                                         ` <20100808235104.GA32488-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2010-08-08 23:51 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, Linux-RDMA

On Sun, Aug 08, 2010 at 08:16:55PM +0200, Bart Van Assche wrote:
> On Sun, Aug 8, 2010 at 3:38 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > [ ... ]
> >
> > No, all hardware pretty much works like this. The general flow is:
> >
> > IRQ happens
> > ?(if level triggered 'ack' the IRQ to the HW, to suppress the level)
> > SW processes
> > SW 'does something' to the HW to cause new IRQs to happen
> > IRQ happens.. repeat..
> >
> > [ ... ]
> 
> You might have missed or forgotten the point that was made in the
> first message of this thread.

Erm, Roland asserted the problem you were thinking about did not exist
in Linux, and I thought you agreed?

http://www.spinics.net/lists/linux-rdma/msg05031.html

Was there something else in that message?

I agree there is some variation in what HW is sensitive to for
generating IRQs, and I do agree that making ib_req_notify_cq an event
sensitive condition (ie a new CQ was added) rather than a state
sensitive call back (ie the CQ is not empty) often requires more
code. But it does not fundamentally make IB any different than every
thing else - and it fits within the general flow I outline above.

Further, the approach you outline in your follow on message for
blkio, has problems.. Look at how IPOIB does NAPI to see how
this must look.

1) ib_req_notify_cq must only be called if you are processing less
   the budget
2) blk_iopoll_complete must be called prior to ib_req_notify_cq, since
   call ib_req_notify_cq can immediately generate an interrupt, and
   that interrupt must see the sched bit as cleared. If
   ib_req_notify_cq races then you have to blkiopoll_reschedual.
   (and maybe continue looping depending on your strategy for
    call blkio_poll_disable elsewhere)
3) The idea you can hand off to normal processing if
   blk_iopoll_sched_prep fails in the ISR does not work for anything
   relying on the non-rentrancy of the blkio_poll call back for
   locking. This seems to describe the SRP driver.

There is no easy way you can switch from processing in a non-ISR
context to processing in an interrupt on the fly.. Each relies on
different implicit locking and switching between those two domains is
ugly. Something like this pseudo-code:

srp_supress_ib_req_notify_cq = 1;
blkio_poll_disable();

// now there will be no more blkio calls, and no more interrupts!

// Neuter the ISR while we are piddling:
set_bit(IOPOLL_F_DISABLE, &iop->state);

// Drain the CQ
poll_again:
while (srp_recv_poll_once())
   ;

// Try to swith back to interrupts!
disable_interrupts();
ret = ib_req_notify_cq(priv->recv_cq,
	               IB_CQ_NEXT_COMP |
                       IB_CQ_REPORT_MISSED_EVENTS));
if (ret) {
   enable_interrupts();
   goto poll_agian;
}

// OK! We will *definately* get an interrupt now!
srp_do_not_use_blkio_poll = 1;
enable_interrupts();

Hope this helps,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20100808235104.GA32488-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: About a shortcoming of the verbs API
       [not found]                                         ` <20100808235104.GA32488-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-08-09  7:57                                           ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2010-08-09  7:57 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, Linux-RDMA

On Mon, Aug 9, 2010 at 1:51 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>
> [ ... ]
>
> Further, the approach you outline in your follow on message for
> blkio, has problems.. Look at how IPOIB does NAPI to see how
> this must look.
>
> 1) ib_req_notify_cq must only be called if you are processing less
>   the budget
> 2) blk_iopoll_complete must be called prior to ib_req_notify_cq, since
>   call ib_req_notify_cq can immediately generate an interrupt, and
>   that interrupt must see the sched bit as cleared. If
>   ib_req_notify_cq races then you have to blkiopoll_reschedual.
>   (and maybe continue looping depending on your strategy for
>    call blkio_poll_disable elsewhere)
> 3) The idea you can hand off to normal processing if
>   blk_iopoll_sched_prep fails in the ISR does not work for anything
>   relying on the non-rentrancy of the blkio_poll call back for
>   locking. This seems to describe the SRP driver.

Good catches.

> There is no easy way you can switch from processing in a non-ISR
> context to processing in an interrupt on the fly.. Each relies on
> different implicit locking and switching between those two domains is
> ugly. Something like this pseudo-code:
>
> srp_supress_ib_req_notify_cq = 1;
> blkio_poll_disable();
>
> // now there will be no more blkio calls, and no more interrupts!
>
> // Neuter the ISR while we are piddling:
> set_bit(IOPOLL_F_DISABLE, &iop->state);
>
> // Drain the CQ
> poll_again:
> while (srp_recv_poll_once())
>   ;
>
> // Try to swith back to interrupts!
> disable_interrupts();
> ret = ib_req_notify_cq(priv->recv_cq,
>                       IB_CQ_NEXT_COMP |
>                       IB_CQ_REPORT_MISSED_EVENTS));
> if (ret) {
>   enable_interrupts();
>   goto poll_agian;
> }
>
> // OK! We will *definately* get an interrupt now!
> srp_do_not_use_blkio_poll = 1;
> enable_interrupts();

Regarding the above pseudo-code: I do not know of any Linux kernel
version that defines the functions disable_interrupts() and
enable_interrupts(). There exist functions however that allow to
disable and re-enable interrupts on the local CPU. But for the above
pseudo-code, disabling interrupts on the local CPU only would make the
above code behave incorrectly on multiprocessor systems. And as far as
I know there are no functions available to disable all interrupts on
all CPUs - which would be a very expensive operation anyway. Invoking
disable_irq() and enable_irq() could help, but this requires knowledge
of the interrupt number, which is not available in the context of a
completion handler. So I'm not sure whether it is possible to
translate the above approach from pseudo-code to real code.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-08-09 18:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AANLkTi=zowawGDjyh+uKve_NiRNMXcrqjAk0hRxGSMOv@mail.gmail.com>
     [not found] ` <AANLkTi=zowawGDjyh+uKve_NiRNMXcrqjAk0hRxGSMOv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-25 18:54   ` About a shortcoming of the verbs API Bart Van Assche
     [not found]     ` <AANLkTinHRnt-jvy0xBOAPUDGcfx6=V6rkRT3t0Ja52FP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-26 14:21       ` Steve Wise
     [not found]         ` <4C4D99F8.3090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-26 17:59           ` Bart Van Assche
2010-07-26 19:22       ` Roland Dreier
     [not found]         ` <adamxtejbes.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-27  7:54           ` Or Gerlitz
     [not found]             ` <4C4E90B6.5070002-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-07-28 17:44               ` Roland Dreier
     [not found]                 ` <ada1vanfqn1.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-29  6:27                   ` Or Gerlitz
2010-07-27  8:33           ` Bart Van Assche
     [not found]             ` <AANLkTinYuyCqJ6_wq6GH0vQGAY-mwC=7ZLicBnXO+efB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-27 16:50               ` Roland Dreier
     [not found]                 ` <adafwz4g98j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-27 18:03                   ` Bart Van Assche
     [not found]                     ` <AANLkTimAk0k-q1EKjaXOadoXvKXbEN9nAky0w1rjixxB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-27 18:20                       ` Jason Gunthorpe
     [not found]                         ` <20100727182046.GT7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-27 19:28                           ` Bart Van Assche
     [not found]                             ` <AANLkTimAS6znoCCw33ipVV-W-e1BJS93Fxzp-oe0jO4u-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-27 20:15                               ` Jason Gunthorpe
2010-07-28 17:42                               ` Roland Dreier
     [not found]                                 ` <ada62zzfqpj.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-28 17:51                                   ` Ralph Campbell
     [not found]                                     ` <1280339513.31421.264.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-28 18:01                                       ` Bart Van Assche
2010-07-28 18:05                                       ` Roland Dreier
     [not found]                                         ` <adask33eb36.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-28 18:11                                           ` Ralph Campbell
2010-07-28 18:16                                           ` Roland Dreier
     [not found]                                             ` <adaocdreal0.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-28 19:05                                               ` Ralph Campbell
2010-08-07  7:56                           ` Bart Van Assche
     [not found]                             ` <AANLkTimc3iS8=8ZQ9u8tOLP4-q_e+o0=AncZj-Mbre2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-07 16:32                               ` Roland Dreier
     [not found]                                 ` <adavd7mz8m9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-08-08 18:19                                   ` Bart Van Assche
     [not found]                                     ` <AANLkTinKsLNoia96AVDA6fP9Es5_2Rq_wTgY=z6wk_FE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-09 14:49                                       ` David Dillow
     [not found]                                         ` <1281365347.4968.5.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2010-08-09 18:45                                           ` Vladislav Bolkhovitin
     [not found]                                             ` <4C604CB4.5060705-d+Crzxg7Rs0@public.gmane.org>
2010-08-09 18:58                                               ` David Dillow
2010-08-08  1:38                               ` Jason Gunthorpe
     [not found]                                 ` <20100808013822.GA15146-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-08-08 18:16                                   ` Bart Van Assche
     [not found]                                     ` <AANLkTi=My1aK3VsYejeVeRSqo+7RNMX2x6osGNbBERvx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-08 23:51                                       ` Jason Gunthorpe
     [not found]                                         ` <20100808235104.GA32488-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-08-09  7:57                                           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox