ipoib lockdep warning

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ipoib lockdep warning
@ 2006-07-11 20:10 Zach Brown
  2006-07-11 21:16 ` Michael S. Tsirkin
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Zach Brown @ 2006-07-11 20:10 UTC (permalink / raw)
  To: openib-general, Linux Kernel Mailing List, Arjan van de Ven,
	Ingo Molnar

I get an awfully verbose lockdep warning when I bring up ipoib devices
and wanted to make sure it didn't go unreported.  I've put it up at the
following URL:

  http://oss.oracle.com/~zab/ipoib-multicast-lockdep-warning.txt

I looked into it a bit and it seems to be on to something.  There might
be a AB, BC, CA ordering deadlock.

AB:

priv->lock is held while acquiring query_idr.lock

ipoib_mcast_send()
  spin_lock(&priv->lock);
  ipoib_mcast_sendonly_join()
    ib_sa_mcmember_rec_query()
      send_mad()
        idr_pre_get(&query_idr)
          spin_lock(&idp->lock);

BC:

query_idr.lock is taken with interrupts enabled and so is implicitly
ordered before dev->_xmit_lock which is taken in interrupt context.

ipoib_mcast_join_task()
  ipoib_mcast_join()
    ib_sa_mcmember_rec_query()
      send_mad()
        idr_pre_get(&query_idr)
          spin_lock(&idp->lock)

CA:

dev->_xmit_lock is held while acquiring priv->lock.  This triggers the
lockdep warning that adding the dep between dev->_xmit_lock and
priv->lock connects a soft-irq-safe lock to a soft-irq-unsafe one.

ipoib_mcast_restart_task()
  local_irq_save(flags);
  netif_tx_lock(dev)
    spin_lock(&dev->_xmit_lock);
  spin_lock(&priv->lock);

I can imagine all sorts of potential fixes (block ints when calling idr?
 reorder acquiry in ipoib_mcast_restart_task()?) but I'm operating on a
partial view of the paths here so I wasn't comfortable suggesting a fix.
 I wouldn't be surprised to hear that there are circumstances that both
lockdep and I don't know about that stop this from being a problem :).

In any case, it'd be fantastic if someone who knows this code could sit
down with lockdep and some ipoib regression suite to shake out lockdep's
complaints.

- z

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ipoib lockdep warning
  2006-07-11 20:10 ipoib lockdep warning Zach Brown
@ 2006-07-11 21:16 ` Michael S. Tsirkin
  2006-07-11 21:40   ` Sean Hefty
  2006-07-11 22:54 ` Roland Dreier
  2006-07-12  2:33 ` [openib-general] ipoib lockdep warning Roland Dreier
  2 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2006-07-11 21:16 UTC (permalink / raw)
  To: Zach Brown, Sean Hefty, Hal Rosenstock, Roland Dreier
  Cc: openib-general, Linux Kernel Mailing List, Arjan van de Ven,
	Ingo Molnar

Quoting r. Zach Brown <zach.brown@oracle.com>:
> BC:
> 
> query_idr.lock is taken with interrupts enabled and so is implicitly
> ordered before dev->_xmit_lock which is taken in interrupt context.
> 
> ipoib_mcast_join_task()
>   ipoib_mcast_join()
>     ib_sa_mcmember_rec_query()
>       send_mad()
>         idr_pre_get(&query_idr)
>           spin_lock(&idp->lock)

Got to check, but if that's true we have a simple deadlock here:
ib_sa_mcmember_rec_query might get called from interrupt
context as well, deadlocking on idp->lock?

Sean?

> I can imagine all sorts of potential fixes (block ints when calling idr?
>  reorder acquiry in ipoib_mcast_restart_task()?) but I'm operating on a
> partial view of the paths here so I wasn't comfortable suggesting a fix.
>  I wouldn't be surprised to hear that there are circumstances that both
> lockdep and I don't know about that stop this from being a problem :).

Awesome, thanks for the analysis! Your help is very much appreciated.

-- 
MST

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: ipoib lockdep warning
  2006-07-11 21:16 ` Michael S. Tsirkin
@ 2006-07-11 21:40   ` Sean Hefty
  2006-07-11 21:50     ` [openib-general] " Zach Brown
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Hefty @ 2006-07-11 21:40 UTC (permalink / raw)
  To: 'Michael S. Tsirkin', Zach Brown, Hal Rosenstock,
	Roland Dreier
  Cc: openib-general, Linux Kernel Mailing List, Arjan van de Ven,
	Ingo Molnar

>Quoting r. Zach Brown <zach.brown@oracle.com>:
>> BC:
>>
>> query_idr.lock is taken with interrupts enabled and so is implicitly
>> ordered before dev->_xmit_lock which is taken in interrupt context.
>>
>> ipoib_mcast_join_task()
>>   ipoib_mcast_join()
>>     ib_sa_mcmember_rec_query()
>>       send_mad()
>>         idr_pre_get(&query_idr)
>>           spin_lock(&idp->lock)
>
>Got to check, but if that's true we have a simple deadlock here:
>ib_sa_mcmember_rec_query might get called from interrupt
>context as well, deadlocking on idp->lock?
>
>Sean?

As a side note, I believe that this is the upstream code and does not include
the latest multicast changes.

I'm not sure if anything calls into the sa_query interfaces from interrupt
context, but I doubt it.  From my brief look at the initially reported issue, I
can't determine if there's an actual problem without studying the ipoib code
more.

- Sean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 21:40   ` Sean Hefty
@ 2006-07-11 21:50     ` Zach Brown
  0 siblings, 0 replies; 32+ messages in thread
From: Zach Brown @ 2006-07-11 21:50 UTC (permalink / raw)
  To: Sean Hefty
  Cc: 'Michael S. Tsirkin', Hal Rosenstock, Roland Dreier,
	Ingo Molnar, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven


> As a side note, I believe that this is the upstream code and does not include
> the latest multicast changes.

Indeed, I should have mentioned in my report that I was running 2.6.17-mm6.

- z

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 20:10 ipoib lockdep warning Zach Brown
  2006-07-11 21:16 ` Michael S. Tsirkin
@ 2006-07-11 22:54 ` Roland Dreier
  2006-07-11 23:27   ` Zach Brown
  2006-07-12  2:33 ` [openib-general] ipoib lockdep warning Roland Dreier
  2 siblings, 1 reply; 32+ messages in thread
From: Roland Dreier @ 2006-07-11 22:54 UTC (permalink / raw)
  To: Zach Brown
  Cc: openib-general, Linux Kernel Mailing List, Arjan van de Ven,
	Ingo Molnar

No time to really look at this in detail, but I think the issue is a
slightly bogus conversion to netif_tx_lock().  Can you try this patch
and see if things are better?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ab40488..ddd1946 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -820,9 +820,8 @@ void ipoib_mcast_restart_task(void *dev_
 
 	ipoib_mcast_stop_thread(dev, 0);
 
-	local_irq_save(flags);
 	netif_tx_lock(dev);
-	spin_lock(&priv->lock);
+	spin_lock_irqsave(&priv->lock, flags);
 
 	/*
 	 * Unfortunately, the networking core only gives us a list of all of
@@ -893,9 +892,8 @@ void ipoib_mcast_restart_task(void *dev_
 		}
 	}
 
-	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->lock, flags);
 	netif_tx_unlock(dev);
-	local_irq_restore(flags);
 
 	/* We have to cancel outside of the spinlock */
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 22:54 ` Roland Dreier
@ 2006-07-11 23:27   ` Zach Brown
  2006-07-11 23:43     ` Roland Dreier
  0 siblings, 1 reply; 32+ messages in thread
From: Zach Brown @ 2006-07-11 23:27 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

Roland Dreier wrote:
> No time to really look at this in detail, but I think the issue is a
> slightly bogus conversion to netif_tx_lock().  Can you try this patch
> and see if things are better?

> -	local_irq_save(flags);
>  	netif_tx_lock(dev);
> -	spin_lock(&priv->lock);
> +	spin_lock_irqsave(&priv->lock, flags);

Hmm, won't that hold dev->_xmit_lock with interrupts enabled?  That
seems like it'd deadlock with dev_watchdog grabbing it from a softirq?

But setting that aside, I don't think this will help as it doesn't
change the order that the locks are acquired in this path.  This can
still be racing with other cpus that are each in the other two paths
(queue_idr.lock -> softirq -> dev->_xmit_lock, priv->lock ->
queue_idr.lock).  The local state of interrupt masking on this cpu
doesn't stop the other cpus from each grabbing the first lock in their
path and then trying to grab the second.  Imagine that they all race to
grab the first (A, B, C) and succeed and then all get stuck spinning on
their second lock (B, C, A).

Maybe you could get the priv->lock here before dev->_xmit_lock.  Then
we'd have AB, BC, AC, and that's OK.  I'm going to guess that this won't
work because other paths have dev->_xmit_lock -> priv->lock ordering.

Another possibility is masking interrupts when getting queue_idr.lock.
That drops the implicit dependency between queue_idr and _xmit_lock and
gives us AB, B, CA -- which is fine.  That means blocking ints while in
idr_pre_get() which allocs which leads to GFP_ATOMIC and more likely
allocation failure.

That's my reading, anyway.

- z

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 23:27   ` Zach Brown
@ 2006-07-11 23:43     ` Roland Dreier
  2006-07-11 23:53       ` Zach Brown
  2006-07-12  9:38       ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-11 23:43 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

Hmm, good point.

It sort of seems to me like the idr interfaces are broken by design.
Internally, lib/idr.c uses bare spin_lock(&idp->lock) with no
interrupt disabling or anything in both the idr_pre_get() and
idr_get_new() code paths.  But idr_pre_get() is supposed to be called
in a context that can sleep, while idr_get_new() is supposed to be
called with locks held to serialize things (at least according to
http://lwn.net/Articles/103209/).

So, ugh... maybe the best thing to do is change lib/idr.c to use
spin_lock_irqsave() internally?

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 23:43     ` Roland Dreier
@ 2006-07-11 23:53       ` Zach Brown
  2006-07-12  0:06         ` Roland Dreier
  2006-07-12  9:38       ` Ingo Molnar
  1 sibling, 1 reply; 32+ messages in thread
From: Zach Brown @ 2006-07-11 23:53 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

Roland Dreier wrote:
> Hmm, good point.
> 
> It sort of seems to me like the idr interfaces are broken by design.
> Internally, lib/idr.c uses bare spin_lock(&idp->lock) with no
> interrupt disabling or anything in both the idr_pre_get() and
> idr_get_new() code paths.

I wasn't thrilled to see that either.  We seem to have a fair precedent
(list.h, rbtree, etc) for leaving serialization to callers.

> So, ugh... maybe the best thing to do is change lib/idr.c to use
> spin_lock_irqsave() internally?

I dunno, it seems to have had _irq() locking in the past?  From the
comment at the top:

 * Modified by George Anzinger to reuse immediately and to use
 * find bit instructions.  Also removed _irq on spinlocks.

- z

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 23:53       ` Zach Brown
@ 2006-07-12  0:06         ` Roland Dreier
  0 siblings, 0 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-12  0:06 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

 > > So, ugh... maybe the best thing to do is change lib/idr.c to use
 > > spin_lock_irqsave() internally?
 > 
 > I dunno, it seems to have had _irq() locking in the past?  From the
 > comment at the top:
 > 
 >  * Modified by George Anzinger to reuse immediately and to use
 >  * find bit instructions.  Also removed _irq on spinlocks.

Well, _irq would be no good, because we might want to call idr stuff
with interrupts disabled.  But making idr internally _irqsave seems
like the right fix to me.

I think the real issue here is that the sa_query.c stuff wants to use
the idr mechanism to assign "query ids", and other modules want to be
able to start queries from any context.  So if idr uses bare spin_lock
internally, then sa_query.c has no choice but to wrap all idr calls
inside spin_lock_irqsave and do all allocation with GFP_ATOMIC, which
doesn't seem very nice.

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 23:43     ` Roland Dreier
  2006-07-11 23:53       ` Zach Brown
@ 2006-07-12  9:38       ` Ingo Molnar
  2006-07-12 11:09         ` Michael S. Tsirkin
                           ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Ingo Molnar @ 2006-07-12  9:38 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Zach Brown, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

* Roland Dreier <rdreier@cisco.com> wrote:

> Hmm, good point.
> 
> It sort of seems to me like the idr interfaces are broken by design.
[...]

> So, ugh... maybe the best thing to do is change lib/idr.c to use 
> spin_lock_irqsave() internally?

i agree that the IDR subsystem should be irq-safe if GFP_ATOMIC is 
passed in. So the _irqsave()/_irqrestore() fix should be done.

But i also think that you should avoid using GFP_ATOMIC for any sort of 
reliable IO path and push as much work into process context as possible. 
Is it acceptable for your infiniband IO model to fail with -ENOMEM if 
GFP_ATOMIC happens to fail, and is the IO retried transparently?

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ipoib lockdep warning
  2006-07-12  9:38       ` Ingo Molnar
@ 2006-07-12 11:09         ` Michael S. Tsirkin
  2006-07-12 16:31           ` [openib-general] " Sean Hefty
                             ` (2 more replies)
  2006-07-12 19:06         ` [openib-general] ipoib lockdep warning Roland Dreier
  2006-07-12 20:45         ` [PATCH] Convert idr's internal locking to _irqsave variant Roland Dreier
  2 siblings, 3 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2006-07-12 11:09 UTC (permalink / raw)
  To: Ingo Molnar, Sean Hefty
  Cc: Roland Dreier, Zach Brown, Linux Kernel Mailing List,
	openib-general, Arjan van de Ven

Quoting r. Ingo Molnar <mingo@elte.hu>:
> But i also think that you should avoid using GFP_ATOMIC for any sort of 
> reliable IO path and push as much work into process context as possible. 
> Is it acceptable for your infiniband IO model to fail with -ENOMEM if 
> GFP_ATOMIC happens to fail, and is the IO retried transparently?

Yes, this is true for users that pass GFP_ATOMIC to sa_query, at least.  But
might not be so for other users:  send_mad in sa_query actually gets gfp_flags
parameter, but for some reason it does not pass it to idr_pre_get, which means
even sa query done with GFP_KERNEL flag is likely to fail.

Sean, it seems we need something like the following - what do you think?

--

Avoid bogus out out memory errors: fix sa_query to actually pass gfp_mask
supplied by the user to idr_pre_get.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index e911c99..aeda484 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -488,13 +488,13 @@ static void init_mad(struct ib_sa_mad *m
 	spin_unlock_irqrestore(&tid_lock, flags);
 }
 
-static int send_mad(struct ib_sa_query *query, int timeout_ms)
+static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
 	unsigned long flags;
 	int ret, id;
 
 retry:
-	if (!idr_pre_get(&query_idr, GFP_ATOMIC))
+	if (!idr_pre_get(&query_idr, gfp_mask))
 		return -ENOMEM;
 	spin_lock_irqsave(&idr_lock, flags);
 	ret = idr_get_new(&query_idr, query, &id);
@@ -630,7 +630,7 @@ int ib_sa_path_rec_get(struct ib_device 
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 
@@ -752,7 +752,7 @@ int ib_sa_service_rec_query(struct ib_de
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 
@@ -844,7 +844,7 @@ int ib_sa_mcmember_rec_query(struct ib_d
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 


-- 
MST

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-12 11:09         ` Michael S. Tsirkin
@ 2006-07-12 16:31           ` Sean Hefty
  2006-07-12 18:56           ` Roland Dreier
  2006-07-13 15:55           ` [PATCH] IB/core: use correct gfp_mask in sa_query Michael S. Tsirkin
  2 siblings, 0 replies; 32+ messages in thread
From: Sean Hefty @ 2006-07-12 16:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ingo Molnar, Sean Hefty, Roland Dreier, Arjan van de Ven,
	Zach Brown, openib-general, Linux Kernel Mailing List

Michael S. Tsirkin wrote:
> Yes, this is true for users that pass GFP_ATOMIC to sa_query, at least.  But
> might not be so for other users:  send_mad in sa_query actually gets gfp_flags
> parameter, but for some reason it does not pass it to idr_pre_get, which means
> even sa query done with GFP_KERNEL flag is likely to fail.
> 
> Sean, it seems we need something like the following - what do you think?

I noticed this same thing looking at the code yesterday.  I can't think of any 
reason why your patch wouldn't work.

- Sean


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ipoib lockdep warning
  2006-07-12 11:09         ` Michael S. Tsirkin
  2006-07-12 16:31           ` [openib-general] " Sean Hefty
@ 2006-07-12 18:56           ` Roland Dreier
  2006-07-13 15:55           ` [PATCH] IB/core: use correct gfp_mask in sa_query Michael S. Tsirkin
  2 siblings, 0 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-12 18:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ingo Molnar, Sean Hefty, Zach Brown, Linux Kernel Mailing List,
	openib-general, Arjan van de Ven

 > Avoid bogus out out memory errors: fix sa_query to actually pass gfp_mask
 > supplied by the user to idr_pre_get.

Yes, this looks right to me.

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] IB/core: use correct gfp_mask in sa_query
  2006-07-12 11:09         ` Michael S. Tsirkin
  2006-07-12 16:31           ` [openib-general] " Sean Hefty
  2006-07-12 18:56           ` Roland Dreier
@ 2006-07-13 15:55           ` Michael S. Tsirkin
  2 siblings, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2006-07-13 15:55 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, Sean Hefty
  Cc: Roland Dreier, Zach Brown, openib-general, Arjan van de Ven

Andrew, could you please drop the following in -mm and on to Linus?

--

Avoid bogus out of memory errors: fix sa_query to actually pass gfp_mask
supplied by the user to idr_pre_get.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Acked-by: "Sean Hefty" <mshefty@ichips.intel.com>
Acked-by: "Roland Dreier" <rdreier@cisco.com>

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index e911c99..aeda484 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -488,13 +488,13 @@ static void init_mad(struct ib_sa_mad *m
 	spin_unlock_irqrestore(&tid_lock, flags);
 }
 
-static int send_mad(struct ib_sa_query *query, int timeout_ms)
+static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
 	unsigned long flags;
 	int ret, id;
 
 retry:
-	if (!idr_pre_get(&query_idr, GFP_ATOMIC))
+	if (!idr_pre_get(&query_idr, gfp_mask))
 		return -ENOMEM;
 	spin_lock_irqsave(&idr_lock, flags);
 	ret = idr_get_new(&query_idr, query, &id);
@@ -630,7 +630,7 @@ int ib_sa_path_rec_get(struct ib_device 
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 
@@ -752,7 +752,7 @@ int ib_sa_service_rec_query(struct ib_de
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 
@@ -844,7 +844,7 @@ int ib_sa_mcmember_rec_query(struct ib_d
 
 	*sa_query = &query->sa_query;
 
-	ret = send_mad(&query->sa_query, timeout_ms);
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
 
-- 
MST

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-12  9:38       ` Ingo Molnar
  2006-07-12 11:09         ` Michael S. Tsirkin
@ 2006-07-12 19:06         ` Roland Dreier
  2006-07-12 20:45         ` [PATCH] Convert idr's internal locking to _irqsave variant Roland Dreier
  2 siblings, 0 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-12 19:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, Linux Kernel Mailing List, openib-general,
	Arjan van de Ven

 > i agree that the IDR subsystem should be irq-safe if GFP_ATOMIC is 
 > passed in. So the _irqsave()/_irqrestore() fix should be done.

OK, I will send the idr change to Andrew.

 > But i also think that you should avoid using GFP_ATOMIC for any sort of 
 > reliable IO path and push as much work into process context as possible. 
 > Is it acceptable for your infiniband IO model to fail with -ENOMEM if 
 > GFP_ATOMIC happens to fail, and is the IO retried transparently?

Yes, I think it's OK.  This idr use is in an inherently unreliable
path.  With that said, as Michael pointed out, we can change things to
use GFP_ATOMIC less.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-12  9:38       ` Ingo Molnar
  2006-07-12 11:09         ` Michael S. Tsirkin
  2006-07-12 19:06         ` [openib-general] ipoib lockdep warning Roland Dreier
@ 2006-07-12 20:45         ` Roland Dreier
  2006-07-12 21:14           ` Ingo Molnar
  2006-07-13  1:30           ` Andrew Morton
  2 siblings, 2 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-12 20:45 UTC (permalink / raw)
  To: akpm
  Cc: Arjan van de Ven, Ingo Molnar, Zach Brown, openib-general,
	linux-kernel

Currently, the code in lib/idr.c uses a bare spin_lock(&idp->lock) to
do internal locking.  This is a nasty trap for code that might call
idr functions from different contexts; for example, it seems perfectly
reasonable to call idr_get_new() from process context and idr_remove()
from interrupt context -- but with the current locking this would lead
to a potential deadlock.

The simplest fix for this is to just convert the idr locking to use
spin_lock_irqsave().

In particular, this fixes a very complicated locking issue detected by
lockdep, involving the ib_ipoib driver's priv->lock and dev->_xmit_lock,
which get involved with the ib_sa module's query_idr.lock.

Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Zach Brown <zach.brown@oracle.com>,
Signed-off-by: Roland Dreier <rolandd@cisco.com>

---

diff --git a/lib/idr.c b/lib/idr.c
index 4d09681..16d2143 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -38,14 +38,15 @@ static kmem_cache_t *idr_layer_cache;
 static struct idr_layer *alloc_layer(struct idr *idp)
 {
 	struct idr_layer *p;
+	unsigned long flags;
 
-	spin_lock(&idp->lock);
+	spin_lock_irqsave(&idp->lock, flags);
 	if ((p = idp->id_free)) {
 		idp->id_free = p->ary[0];
 		idp->id_free_cnt--;
 		p->ary[0] = NULL;
 	}
-	spin_unlock(&idp->lock);
+	spin_unlock_irqrestore(&idp->lock, flags);
 	return(p);
 }
 
@@ -59,12 +60,14 @@ static void __free_layer(struct idr *idp
 
 static void free_layer(struct idr *idp, struct idr_layer *p)
 {
+	unsigned long flags;
+
 	/*
 	 * Depends on the return element being zeroed.
 	 */
-	spin_lock(&idp->lock);
+	spin_lock_irqsave(&idp->lock, flags);
 	__free_layer(idp, p);
-	spin_unlock(&idp->lock);
+	spin_unlock_irqrestore(&idp->lock, flags);
 }
 
 /**
@@ -168,6 +171,7 @@ static int idr_get_new_above_int(struct 
 {
 	struct idr_layer *p, *new;
 	int layers, v, id;
+	unsigned long flags;
 
 	id = starting_id;
 build_up:
@@ -191,14 +195,14 @@ build_up:
 			 * The allocation failed.  If we built part of
 			 * the structure tear it down.
 			 */
-			spin_lock(&idp->lock);
+			spin_lock_irqsave(&idp->lock, flags);
 			for (new = p; p && p != idp->top; new = p) {
 				p = p->ary[0];
 				new->ary[0] = NULL;
 				new->bitmap = new->count = 0;
 				__free_layer(idp, new);
 			}
-			spin_unlock(&idp->lock);
+			spin_unlock_irqrestore(&idp->lock, flags);
 			return -1;
 		}
 		new->ary[0] = p;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-12 20:45         ` [PATCH] Convert idr's internal locking to _irqsave variant Roland Dreier
@ 2006-07-12 21:14           ` Ingo Molnar
  2006-07-13  1:30           ` Andrew Morton
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2006-07-12 21:14 UTC (permalink / raw)
  To: Roland Dreier
  Cc: akpm, Arjan van de Ven, Zach Brown, openib-general, linux-kernel


* Roland Dreier <rdreier@cisco.com> wrote:

> Currently, the code in lib/idr.c uses a bare spin_lock(&idp->lock) to 
> do internal locking.  This is a nasty trap for code that might call 
> idr functions from different contexts; for example, it seems perfectly 
> reasonable to call idr_get_new() from process context and idr_remove() 
> from interrupt context -- but with the current locking this would lead 
> to a potential deadlock.
> 
> The simplest fix for this is to just convert the idr locking to use 
> spin_lock_irqsave().
> 
> In particular, this fixes a very complicated locking issue detected by 
> lockdep, involving the ib_ipoib driver's priv->lock and 
> dev->_xmit_lock, which get involved with the ib_sa module's 
> query_idr.lock.
> 
> Cc: Arjan van de Ven <arjan@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-12 20:45         ` [PATCH] Convert idr's internal locking to _irqsave variant Roland Dreier
  2006-07-12 21:14           ` Ingo Molnar
@ 2006-07-13  1:30           ` Andrew Morton
  2006-07-13 15:42             ` Roland Dreier
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-07-13  1:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

On Wed, 12 Jul 2006 13:45:12 -0700
Roland Dreier <rdreier@cisco.com> wrote:

> Currently, the code in lib/idr.c uses a bare spin_lock(&idp->lock) to
> do internal locking.  This is a nasty trap for code that might call
> idr functions from different contexts; for example, it seems perfectly
> reasonable to call idr_get_new() from process context and idr_remove()
> from interrupt context -- but with the current locking this would lead
> to a potential deadlock.
> 
> The simplest fix for this is to just convert the idr locking to use
> spin_lock_irqsave().
> 
> In particular, this fixes a very complicated locking issue detected by
> lockdep, involving the ib_ipoib driver's priv->lock and dev->_xmit_lock,
> which get involved with the ib_sa module's query_idr.lock.

Sigh.  It was always a mistake (of the kernel programming 101 type) to put
any locking at all in the idr code.  At some stage we need to weed it all
out and move it to callers.

Your fix is yet more fallout from that mistake.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13  1:30           ` Andrew Morton
@ 2006-07-13 15:42             ` Roland Dreier
  2006-07-13 20:54               ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Roland Dreier @ 2006-07-13 15:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

 > Sigh.  It was always a mistake (of the kernel programming 101 type) to put
 > any locking at all in the idr code.  At some stage we need to weed it all
 > out and move it to callers.
 > 
 > Your fix is yet more fallout from that mistake.

Agreed.  Consider me on the hook to fix this up in a better way once
my life is a little saner.  Maybe I'll try to cook something up on the
plane ride to Ottawa.

Anyway you can punch me in the stomach if I don't have something in
time for 2.6.19.

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 15:42             ` Roland Dreier
@ 2006-07-13 20:54               ` Andrew Morton
  2006-07-13 21:03                 ` Roland Dreier
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-07-13 20:54 UTC (permalink / raw)
  To: Roland Dreier; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

On Thu, 13 Jul 2006 08:42:47 -0700
Roland Dreier <rdreier@cisco.com> wrote:

>  > Sigh.  It was always a mistake (of the kernel programming 101 type) to put
>  > any locking at all in the idr code.  At some stage we need to weed it all
>  > out and move it to callers.
>  > 
>  > Your fix is yet more fallout from that mistake.
> 
> Agreed.  Consider me on the hook to fix this up in a better way once
> my life is a little saner.  Maybe I'll try to cook something up on the
> plane ride to Ottawa.
> 

I suspect it'll get really ugly.  It's a container library which needs to
allocate memory when items are added, like the radix-tree.  Either it needs
to assume GFP_ATOMIC, which is bad and can easily fail or it does weird
things like radix_tree_preload().

The basic problem is:

	idr_pre_get(GFP_KERNEL);
	spin_lock(my_lock);
	idr_get_new(..);

which is racy, because some other CPU could have got in there and consumed
some of the pool which was precharged by idr_pre_get().

It's wildly improbable that it'll actually fail.  It requires all of:

a) that the race occur

b) that the racing thread consume an appreciable amount of the pool

c) that this thread also consume an appreciable amount (such that the
   total of both exceeds the pool size).

d) that a (needs to be added) GFP_ATOMIC attempt to replenish the pool
   inside idr_get_new() fails.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 20:54               ` Andrew Morton
@ 2006-07-13 21:03                 ` Roland Dreier
  2006-07-13 21:05                   ` Arjan van de Ven
  2006-07-13 21:43                   ` Andrew Morton
  0 siblings, 2 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-13 21:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

 > I suspect it'll get really ugly.  It's a container library which needs to
 > allocate memory when items are added, like the radix-tree.  Either it needs
 > to assume GFP_ATOMIC, which is bad and can easily fail or it does weird
 > things like radix_tree_preload().

Actually I don't think it has to be too bad.  We could tweak the
interface a little bit so that consumers do something like:

	struct idr_layer *layer = NULL;	/* opaque */

retry:
        spin_lock(&my_idr_lock);
	ret = idr_get_new(&my_idr, ptr, &id, layer);
        spin_unlock(&my_idr_lock);

        if (ret == -EAGAIN) {
		layer = idr_alloc_layer(&my_idr, GFP_KERNEL);
		if (!IS_ERR(layer))
			goto retry;
	}

in other words make the consumer responsible for passing in new memory
that can be used for a new entry (or freed if other entries have
become free in the meantime).

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 21:03                 ` Roland Dreier
@ 2006-07-13 21:05                   ` Arjan van de Ven
  2006-07-14  0:18                     ` Roland Dreier
  2006-07-13 21:43                   ` Andrew Morton
  1 sibling, 1 reply; 32+ messages in thread
From: Arjan van de Ven @ 2006-07-13 21:05 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Andrew Morton, mingo, zach.brown, openib-general, linux-kernel

On Thu, 2006-07-13 at 14:03 -0700, Roland Dreier wrote:
>  > I suspect it'll get really ugly.  It's a container library which needs to
>  > allocate memory when items are added, like the radix-tree.  Either it needs
>  > to assume GFP_ATOMIC, which is bad and can easily fail or it does weird
>  > things like radix_tree_preload().
> 
> Actually I don't think it has to be too bad.  We could tweak the
> interface a little bit so that consumers do something like:
> 
> 	

it does get harder if this is needed for your IB device to do 
more work, so that your swap device on your IB can take more IO's
to free up ram..



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 21:05                   ` Arjan van de Ven
@ 2006-07-14  0:18                     ` Roland Dreier
  2006-07-14  6:20                       ` Arjan van de Ven
  0 siblings, 1 reply; 32+ messages in thread
From: Roland Dreier @ 2006-07-14  0:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, mingo, zach.brown, openib-general, linux-kernel

    Arjan> it does get harder if this is needed for your IB device to
    Arjan> do more work, so that your swap device on your IB can take
    Arjan> more IO's to free up ram..

That's the classic problem, but it's more a matter of the consumer
using GFP_NOIO in the right places.

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-14  0:18                     ` Roland Dreier
@ 2006-07-14  6:20                       ` Arjan van de Ven
  0 siblings, 0 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-07-14  6:20 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Andrew Morton, mingo, zach.brown, openib-general, linux-kernel

On Thu, 2006-07-13 at 17:18 -0700, Roland Dreier wrote:
>     Arjan> it does get harder if this is needed for your IB device to
>     Arjan> do more work, so that your swap device on your IB can take
>     Arjan> more IO's to free up ram..
> 
> That's the classic problem, but it's more a matter of the consumer
> using GFP_NOIO in the right places.

GFP_NOIO isn't going to save you in the cases where the memory really is
running low and you need the memory to do more IO...



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 21:03                 ` Roland Dreier
  2006-07-13 21:05                   ` Arjan van de Ven
@ 2006-07-13 21:43                   ` Andrew Morton
  2006-07-14  1:08                     ` Roland Dreier
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-07-13 21:43 UTC (permalink / raw)
  To: Roland Dreier; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

On Thu, 13 Jul 2006 14:03:21 -0700
Roland Dreier <rdreier@cisco.com> wrote:

>  > I suspect it'll get really ugly.  It's a container library which needs to
>  > allocate memory when items are added, like the radix-tree.  Either it needs
>  > to assume GFP_ATOMIC, which is bad and can easily fail or it does weird
>  > things like radix_tree_preload().
> 
> Actually I don't think it has to be too bad.  We could tweak the
> interface a little bit so that consumers do something like:
> 
> 	struct idr_layer *layer = NULL;	/* opaque */
> 
> retry:
>         spin_lock(&my_idr_lock);
> 	ret = idr_get_new(&my_idr, ptr, &id, layer);
>         spin_unlock(&my_idr_lock);
> 
>         if (ret == -EAGAIN) {
> 		layer = idr_alloc_layer(&my_idr, GFP_KERNEL);
> 		if (!IS_ERR(layer))
> 			goto retry;
> 	}
> 
> in other words make the consumer responsible for passing in new memory
> that can be used for a new entry (or freed if other entries have
> become free in the meantime).
> 

Good point, a try-again loop would work.  Do we really need the caller to
maintain a cache?  I suspect something like

drat:
	if (idr_pre_get(GFP_KERNEL) == ENOMEM)
		give_up();
	spin_lock();
	ret = idr_get_new();
	spin_unlock();
	if (ret == ENOMEM)
		goto drat;

would do it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-13 21:43                   ` Andrew Morton
@ 2006-07-14  1:08                     ` Roland Dreier
  2006-07-14  1:18                       ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Roland Dreier @ 2006-07-14  1:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

 > Good point, a try-again loop would work.  Do we really need the caller to
 > maintain a cache?  I suspect something like
 > 
 > drat:
 > 	if (idr_pre_get(GFP_KERNEL) == ENOMEM)
 > 		give_up();
 > 	spin_lock();
 > 	ret = idr_get_new();
 > 	spin_unlock();
 > 	if (ret == ENOMEM)
 > 		goto drat;
 > 
 > would do it.

The problem (for my tiny brain at least) is that I don't know where
idr_pre_get() can put the memory it allocates if there's no lock in
the idr structure -- how do you maintain internal consistency if no
locks are held when filling the cache?

Having the caller hold a chunk of memory in a stack variable was the
trick I came up with to get around that.

 - R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-14  1:08                     ` Roland Dreier
@ 2006-07-14  1:18                       ` Andrew Morton
  2006-07-14  1:30                         ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-07-14  1:18 UTC (permalink / raw)
  To: Roland Dreier; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

On Thu, 13 Jul 2006 18:08:17 -0700
Roland Dreier <rdreier@cisco.com> wrote:

>  > Good point, a try-again loop would work.  Do we really need the caller to
>  > maintain a cache?  I suspect something like
>  > 
>  > drat:
>  > 	if (idr_pre_get(GFP_KERNEL) == ENOMEM)
>  > 		give_up();
>  > 	spin_lock();
>  > 	ret = idr_get_new();
>  > 	spin_unlock();
>  > 	if (ret == ENOMEM)
>  > 		goto drat;
>  > 
>  > would do it.
> 
> The problem (for my tiny brain at least) is that I don't know where
> idr_pre_get() can put the memory it allocates if there's no lock in
> the idr structure -- how do you maintain internal consistency if no
> locks are held when filling the cache?

argh.  Aren't you supposed to be on vacation or something?

> Having the caller hold a chunk of memory in a stack variable was the
> trick I came up with to get around that.

Yes, that certainly works.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-14  1:18                       ` Andrew Morton
@ 2006-07-14  1:30                         ` Andrew Morton
  2006-07-17 15:57                           ` Roland Dreier
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-07-14  1:30 UTC (permalink / raw)
  To: rdreier, arjan, mingo, zach.brown, openib-general, linux-kernel

On Thu, 13 Jul 2006 18:18:35 -0700
Andrew Morton <akpm@osdl.org> wrote:

> > Having the caller hold a chunk of memory in a stack variable was the
> > trick I came up with to get around that.
> 
> Yes, that certainly works.

Problem is, I think, you'll need to preallocate IDR_FREE_MAX items.  And
then free them all again when none of them were consumed (usual).

Yes, storing the preallocated nodes in the idr itself requires locking. 
But that locking is 100% private to the IDR implementation.  It locks only
the preload list and not the user's stuff.

radix_tree_preload() effectively does this.  Except the preload list is
kernel-wide.  It's split across CPUs and uses
local_irq_disable/preempt_disable locking tricks as a performance
optimisation.  But conceptually it's the same.

Simply copying that would give something which is known to work...  It
seems like a large amount of fuss, but when you think about it the problem
isn't simple.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Convert idr's internal locking to _irqsave variant
  2006-07-14  1:30                         ` Andrew Morton
@ 2006-07-17 15:57                           ` Roland Dreier
  0 siblings, 0 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-17 15:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: arjan, mingo, zach.brown, openib-general, linux-kernel

 > Problem is, I think, you'll need to preallocate IDR_FREE_MAX items.  And
 > then free them all again when none of them were consumed (usual).

Actually I think it's OK if we just pass in no more than one extra
layer for each try to add something with idr_get_new().  In the worst
case, this leads to a lot of extra calls to idr_get_new(), but in the
usual case it's fine.

I'm including a lightly tested big patch with all my idr changes for
comments -- I'll split it up into a form more suitable for merging.
(Some of the changes are unrelated and obviously good, eg using
kmem_cache_zalloc() instead of a slab cache with a constructor that
does memset(0)).

I'm not sure I'm thrilled with this approach, but it does seem to be a
net win.  With an allyesconfig with debugging options turned off (so
spinlocks shrink back down to 8 bytes), I get the following:

   text		   data	    bss	    dec		    hex	filename
24347759	5971210	2322176	32641145	1f21079	vmlinux.old
24347370	5970474	2320704	32638548	1f20654	vmlinux.new

Most of the savings comes from ocfs2, which has a static array of
255 structures that each contain an idr -- so removing the lock from
struct idr saves 255 * 8 = 2040 bytes.  However, even without
factoring that in, this does seem to be a net win:

add/remove: 2/4 grow/shrink: 23/51 up/down: 719/-3215 (-2496)
function                                     old     new   delta
idr_get_new_above                             38     554    +516
dm_create                                    957    1000     +43
ipath_init_one                              3294    3329     +35
get_layer                                      -      32     +32
idr_alloc_layer                                -      16     +16
sd_probe                                     871     881     +10
rtc_device_register                          493     503     +10
proc_register                                277     286      +9
mmc_add_host_sysfs                           126     135      +9
idr_add_uobj                                  80      85      +5
cma_alloc_port                               224     229      +5
sys_timer_create                             876     880      +4
set_anon_super                               173     177      +4
sctp_process_init                           1312    1316      +4
ib_ucm_ctx_alloc                             197     201      +4
ib_create_cm_id                              287     290      +3
proc_mkdir_mode                               95      97      +2
vfs_kern_mount                               279     280      +1
send_mad                                     325     326      +1
proc_symlink                                 141     142      +1
proc_file_write                               40      41      +1
kill_block_super                              56      57      +1
get_sb_single                                175     176      +1
free_proc_entry                              108     109      +1
deactivate_super                             126     127      +1
proc_readdir                                 353     352      -1
proc_getattr                                  40      39      -1
get_sb_nodev                                 150     149      -1
o2net_send_message_vec                      2032    2030      -2
hwmon_device_register                        198     196      -2
create_proc_entry                            170     168      -2
__put_super_and_need_restart                  54      52      -2
v9fs_read_work                              1424    1421      -3
v9fs_mux_init                               1333    1330      -3
v9fs_mux_flush_cb                            303     300      -3
v9fs_mux_destroy                             369     366      -3
v9fs_mux_cancel                              382     379      -3
o2net_init                                   441     438      -3
inotify_add_watch                            285     280      -5
v9fs_session_init                           1490    1484      -6
unnamed_dev_idr                               32      24      -8
unit_table                                    32      24      -8
tcp_ps                                        32      24      -8
sdp_ps                                        32      24      -8
sd_index_idr                                  32      24      -8
sctp_assocs_id                                32      24      -8
rtc_idr                                       32      24      -8
query_idr                                     32      24      -8
proc_inum_idr                                 32      24      -8
posix_timers_id                               32      24      -8
mmc_host_idr                                  32      24      -8
lpfc_hba_index                                32      24      -8
ib_uverbs_srq_idr                             32      24      -8
ib_uverbs_qp_idr                              32      24      -8
ib_uverbs_pd_idr                              32      24      -8
ib_uverbs_mw_idr                              32      24      -8
ib_uverbs_mr_idr                              32      24      -8
ib_uverbs_cq_idr                              32      24      -8
ib_uverbs_ah_idr                              32      24      -8
i2c_adapter_idr                               32      24      -8
hwmon_idr                                     32      24      -8
ctx_id_table                                  32      24      -8
cm                                           112     104      -8
allocated_ptys                                32      24      -8
_minor_idr                                    32      24      -8
i2c_add_adapter                              494     484     -10
idr_cache_ctor                                12       -     -12
v9fs_get_idpool                              149     134     -15
ib_cm_init                                   237     220     -17
free_layer                                    55      34     -21
idr_init                                      88      62     -26
idr_get_new                                   40      13     -27
idr_remove                                   357     328     -29
infinipath_init                              227     197     -30
lpfc_pci_probe_one                          2752    2710     -42
ptmx_open                                    427     379     -48
idr_pre_get                                   59       -     -59
alloc_layer                                   66       -     -66
idr_get_new_above_int                        533       -    -533
o2net_nodes                                99960   97920   -2040

---

diff --git a/arch/powerpc/mm/mmu_context_64.c b/arch/powerpc/mm/mmu_context_64.c
index 90a06ac..85d6a03 100644
--- a/arch/powerpc/mm/mmu_context_64.c
+++ b/arch/powerpc/mm/mmu_context_64.c
@@ -26,20 +26,21 @@ static DEFINE_IDR(mmu_context_idr);
 
 int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
+	struct idr_layer *layer = NULL;
 	int index;
 	int err;
 
 again:
-	if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL))
-		return -ENOMEM;
-
 	spin_lock(&mmu_context_lock);
-	err = idr_get_new_above(&mmu_context_idr, NULL, 1, &index);
+	err = idr_get_new_above(&mmu_context_idr, NULL, 1, &index, layer);
 	spin_unlock(&mmu_context_lock);
 
-	if (err == -EAGAIN)
+	if (err == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return -ENOMEM;
 		goto again;
-	else if (err)
+	} else if (err)
 		return err;
 
 	if (index > MAX_CONTEXT) {
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index bfdb902..0ae099e 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -2135,6 +2135,7 @@ #ifdef CONFIG_UNIX98_PTYS
 static int ptmx_open(struct inode * inode, struct file * filp)
 {
 	struct tty_struct *tty;
+	struct idr_layer *layer = NULL;
 	int retval;
 	int index;
 	int idr_ret;
@@ -2143,24 +2144,27 @@ static int ptmx_open(struct inode * inod
 
 	/* find a device that is not in use. */
 	down(&allocated_ptys_lock);
-	if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-		up(&allocated_ptys_lock);
-		return -ENOMEM;
-	}
-	idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-	if (idr_ret < 0) {
-		up(&allocated_ptys_lock);
-		if (idr_ret == -EAGAIN)
-			return -ENOMEM;
-		return -EIO;
-	}
-	if (index >= pty_limit) {
-		idr_remove(&allocated_ptys, index);
-		up(&allocated_ptys_lock);
-		return -EIO;
-	}
+	do {
+		idr_ret = idr_get_new(&allocated_ptys, NULL, &index, layer);
+		if (idr_ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer) {
+				idr_ret = -ENOMEM;
+				break;
+			}
+			continue;
+		}
+
+		if (index >= pty_limit) {
+			idr_remove(&allocated_ptys, index);
+			idr_ret = -EIO;
+		}
+	} while (idr_ret == -EAGAIN);
 	up(&allocated_ptys_lock);
 
+	if (idr_ret)
+		return idr_ret == -EAGAIN ? -ENOMEM : -EIO;
+
 	mutex_lock(&tty_mutex);
 	retval = init_dev(ptm_driver, index, &tty);
 	mutex_unlock(&tty_mutex);
diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 106fa01..82d6d04 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -38,20 +38,21 @@ static DEFINE_SPINLOCK(idr_lock);
  */
 struct class_device *hwmon_device_register(struct device *dev)
 {
+	struct idr_layer *layer = NULL;
 	struct class_device *cdev;
 	int id, err;
 
 again:
-	if (unlikely(idr_pre_get(&hwmon_idr, GFP_KERNEL) == 0))
-		return ERR_PTR(-ENOMEM);
-
 	spin_lock(&idr_lock);
-	err = idr_get_new(&hwmon_idr, NULL, &id);
+	err = idr_get_new(&hwmon_idr, NULL, &id, layer);
 	spin_unlock(&idr_lock);
 
-	if (unlikely(err == -EAGAIN))
+	if (unlikely(err == -EAGAIN)) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return ERR_PTR(-ENOMEM);
 		goto again;
-	else if (unlikely(err))
+	} else if (unlikely(err))
 		return ERR_PTR(err);
 
 	id = id & MAX_ID_MASK;
diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c
index 9cb277d..f8aa8ea 100644
--- a/drivers/i2c/i2c-core.c
+++ b/drivers/i2c/i2c-core.c
@@ -151,22 +151,24 @@ static struct device_attribute dev_attr_
 int i2c_add_adapter(struct i2c_adapter *adap)
 {
 	int id, res = 0;
+	struct idr_layer   *layer = NULL;
 	struct list_head   *item;
 	struct i2c_driver  *driver;
 
 	mutex_lock(&core_lists);
 
-	if (idr_pre_get(&i2c_adapter_idr, GFP_KERNEL) == 0) {
-		res = -ENOMEM;
-		goto out_unlock;
-	}
-
-	res = idr_get_new(&i2c_adapter_idr, adap, &id);
-	if (res < 0) {
-		if (res == -EAGAIN)
+again:
+	res = idr_get_new(&i2c_adapter_idr, adap, &id, layer);
+	if (res == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer) {
 			res = -ENOMEM;
-		goto out_unlock;
+			goto out_unlock;
+		}
+		goto again;
 	}
+	if (res < 0)
+		goto out_unlock;
 
 	adap->nr =  id & MAX_ID_MASK;
 	mutex_init(&adap->bus_lock);
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index f85c97f..cf14d01 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -298,6 +298,7 @@ static int cm_init_av_by_path(struct ib_
 
 static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 {
+	struct idr_layer *layer = NULL;
 	unsigned long flags;
 	int ret;
 	static int next_id;
@@ -305,9 +306,15 @@ static int cm_alloc_id(struct cm_id_priv
 	do {
 		spin_lock_irqsave(&cm.lock, flags);
 		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, next_id++,
-					(__force int *) &cm_id_priv->id.local_id);
+					(__force int *) &cm_id_priv->id.local_id,
+					layer);
 		spin_unlock_irqrestore(&cm.lock, flags);
-	} while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) );
+		if (ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer)
+				ret = -ENOMEM;
+		}
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3347,7 +3354,6 @@ static int __init ib_cm_init(void)
 	cm.remote_qp_table = RB_ROOT;
 	cm.remote_sidr_table = RB_ROOT;
 	idr_init(&cm.local_id_table);
-	idr_pre_get(&cm.local_id_table, GFP_KERNEL);
 
 	cm.wq = create_workqueue("ib_cm");
 	if (!cm.wq)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d6f99d5..314b150 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1433,6 +1433,7 @@ static void cma_bind_port(struct rdma_bi
 static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,
 			  unsigned short snum)
 {
+	struct idr_layer *layer = NULL;
 	struct rdma_bind_list *bind_list;
 	int port, start, ret;
 
@@ -1443,8 +1444,13 @@ static int cma_alloc_port(struct idr *ps
 	start = snum ? snum : sysctl_local_port_range[0];
 
 	do {
-		ret = idr_get_new_above(ps, bind_list, start, &port);
-	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
+		ret = idr_get_new_above(ps, bind_list, start, &port, layer);
+		if (ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer)
+				ret = -ENOMEM;
+		}
+	} while (ret == -EAGAIN);
 
 	if (ret)
 		goto err;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index aeda484..824b652 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -490,17 +490,20 @@ static void init_mad(struct ib_sa_mad *m
 
 static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
+	struct idr_layer *layer = NULL;
 	unsigned long flags;
 	int ret, id;
 
 retry:
-	if (!idr_pre_get(&query_idr, gfp_mask))
-		return -ENOMEM;
 	spin_lock_irqsave(&idr_lock, flags);
-	ret = idr_get_new(&query_idr, query, &id);
+	ret = idr_get_new(&query_idr, query, &id, layer);
 	spin_unlock_irqrestore(&idr_lock, flags);
-	if (ret == -EAGAIN)
-		goto retry;
+	if (ret == -EAGAIN) {
+		layer = idr_alloc_layer(gfp_mask);
+		if (layer)
+			goto retry;
+		ret = -ENOMEM;
+	}
 	if (ret)
 		return ret;
 
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index c1c6fda..22d5e24 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -173,6 +173,7 @@ static void ib_ucm_cleanup_events(struct
 
 static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file)
 {
+	struct idr_layer *layer = NULL;
 	struct ib_ucm_context *ctx;
 	int result;
 
@@ -186,13 +187,15 @@ static struct ib_ucm_context *ib_ucm_ctx
 	INIT_LIST_HEAD(&ctx->events);
 
 	do {
-		result = idr_pre_get(&ctx_id_table, GFP_KERNEL);
-		if (!result)
-			goto error;
-
 		mutex_lock(&ctx_id_mutex);
-		result = idr_get_new(&ctx_id_table, ctx, &ctx->id);
+		result = idr_get_new(&ctx_id_table, ctx, &ctx->id, layer);
 		mutex_unlock(&ctx_id_mutex);
+
+		if (result == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer)
+				result = -ENOMEM;
+		}
 	} while (result == -EAGAIN);
 
 	if (result)
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index bdf5d50..71dea88 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -109,18 +109,20 @@ static void put_uobj_write(struct ib_uob
 
 static int idr_add_uobj(struct idr *idr, struct ib_uobject *uobj)
 {
+	struct idr_layer *layer = NULL;
 	int ret;
 
-retry:
-	if (!idr_pre_get(idr, GFP_KERNEL))
-		return -ENOMEM;
+	do {
+		spin_lock(&ib_uverbs_idr_lock);
+		ret = idr_get_new(idr, uobj, &uobj->id, layer);
+		spin_unlock(&ib_uverbs_idr_lock);
 
-	spin_lock(&ib_uverbs_idr_lock);
-	ret = idr_get_new(idr, uobj, &uobj->id);
-	spin_unlock(&ib_uverbs_idr_lock);
-
-	if (ret == -EAGAIN)
-		goto retry;
+		if (ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer)
+				ret = -ENOMEM;
+		}
+	} while (ret == -EAGAIN);
 
 	return ret;
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 823131d..6f2a711 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -168,15 +168,11 @@ static void ipath_free_devdata(struct pc
 
 static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev)
 {
+	struct idr_layer *layer = NULL;
 	unsigned long flags;
 	struct ipath_devdata *dd;
 	int ret;
 
-	if (!idr_pre_get(&unit_table, GFP_KERNEL)) {
-		dd = ERR_PTR(-ENOMEM);
-		goto bail;
-	}
-
 	dd = vmalloc(sizeof(*dd));
 	if (!dd) {
 		dd = ERR_PTR(-ENOMEM);
@@ -187,7 +183,19 @@ static struct ipath_devdata *ipath_alloc
 
 	spin_lock_irqsave(&ipath_devs_lock, flags);
 
-	ret = idr_get_new(&unit_table, dd, &dd->ipath_unit);
+	do {
+		ret = idr_get_new(&unit_table, dd, &dd->ipath_unit, layer);
+		if (ret == -EAGAIN) {
+			spin_unlock_irqrestore(&ipath_devs_lock, flags);
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer) {
+				dd = ERR_PTR(-ENOMEM);
+				goto bail;
+			}
+			spin_lock_irqsave(&ipath_devs_lock, flags);
+		}
+	} while (ret == -EAGAIN);
+
 	if (ret < 0) {
 		printk(KERN_ERR IPATH_DRV_NAME
 		       ": Could not allocate unit ID: error %d\n", -ret);
@@ -1754,10 +1762,6 @@ static int __init infinipath_init(void)
 	 * the PCI subsystem.
 	 */
 	idr_init(&unit_table);
-	if (!idr_pre_get(&unit_table, GFP_KERNEL)) {
-		ret = -ENOMEM;
-		goto bail;
-	}
 
 	ret = pci_register_driver(&ipath_driver);
 	if (ret < 0) {
@@ -1780,7 +1784,7 @@ static int __init infinipath_init(void)
 		goto bail_group;
 	}
 
-	goto bail;
+	return ret;
 
 bail_group:
 	ipath_driver_remove_group(&ipath_driver.driver);
@@ -1791,7 +1795,6 @@ bail_pci:
 bail_unit:
 	idr_destroy(&unit_table);
 
-bail:
 	return ret;
 }
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c99bf9f..e2dd20a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -807,23 +807,30 @@ static void free_minor(int minor)
  */
 static int specific_minor(struct mapped_device *md, int minor)
 {
+	struct idr_layer *layer = NULL;
 	int r, m;
 
 	if (minor >= (1 << MINORBITS))
 		return -EINVAL;
 
-	r = idr_pre_get(&_minor_idr, GFP_KERNEL);
-	if (!r)
-		return -ENOMEM;
-
 	spin_lock(&_minor_lock);
 
+again:
 	if (idr_find(&_minor_idr, minor)) {
 		r = -EBUSY;
 		goto out;
 	}
 
-	r = idr_get_new_above(&_minor_idr, MINOR_ALLOCED, minor, &m);
+	r = idr_get_new_above(&_minor_idr, MINOR_ALLOCED, minor, &m, layer);
+	if (r == -EAGAIN) {
+		spin_unlock(&_minor_lock);
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return -ENOMEM;
+
+		spin_lock(&_minor_lock);
+		goto again;
+	}
 	if (r)
 		goto out;
 
@@ -840,18 +847,21 @@ out:
 
 static int next_free_minor(struct mapped_device *md, int *minor)
 {
+	struct idr_layer *layer = NULL;
 	int r, m;
 
-	r = idr_pre_get(&_minor_idr, GFP_KERNEL);
-	if (!r)
-		return -ENOMEM;
-
+again:
 	spin_lock(&_minor_lock);
-
-	r = idr_get_new(&_minor_idr, MINOR_ALLOCED, &m);
-	if (r) {
-		goto out;
+	r = idr_get_new(&_minor_idr, MINOR_ALLOCED, &m, layer);
+	if (r == -EAGAIN) {
+		spin_unlock(&_minor_lock);
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return -ENOMEM;
+		goto again;
 	}
+	if (r)
+		goto out;
 
 	if (m >= (1 << MINORBITS)) {
 		idr_remove(&_minor_idr, m);
diff --git a/drivers/mmc/mmc_sysfs.c b/drivers/mmc/mmc_sysfs.c
index a2a35fd..bf61ff4 100644
--- a/drivers/mmc/mmc_sysfs.c
+++ b/drivers/mmc/mmc_sysfs.c
@@ -280,14 +280,19 @@ struct mmc_host *mmc_alloc_host_sysfs(in
  */
 int mmc_add_host_sysfs(struct mmc_host *host)
 {
+	struct idr_layer *layer = NULL;
 	int err;
 
-	if (!idr_pre_get(&mmc_host_idr, GFP_KERNEL))
-		return -ENOMEM;
-
+again:
 	spin_lock(&mmc_host_lock);
-	err = idr_get_new(&mmc_host_idr, host, &host->index);
+	err = idr_get_new(&mmc_host_idr, host, &host->index, layer);
 	spin_unlock(&mmc_host_lock);
+	if (err == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return -ENOMEM;
+		goto again;
+	}
 	if (err)
 		return err;
 
diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c
index 1cb61a7..3388059 100644
--- a/drivers/rtc/class.c
+++ b/drivers/rtc/class.c
@@ -42,19 +42,21 @@ struct rtc_device *rtc_device_register(c
 					struct rtc_class_ops *ops,
 					struct module *owner)
 {
+	struct idr_layer *layer = NULL;
 	struct rtc_device *rtc;
 	int id, err;
 
-	if (idr_pre_get(&rtc_idr, GFP_KERNEL) == 0) {
-		err = -ENOMEM;
-		goto exit;
-	}
-
-
+again:
 	mutex_lock(&idr_lock);
-	err = idr_get_new(&rtc_idr, NULL, &id);
+	err = idr_get_new(&rtc_idr, NULL, &id, layer);
 	mutex_unlock(&idr_lock);
 
+	if (err == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return ERR_PTR(-ENOMEM);
+		goto again;
+	}
 	if (err < 0)
 		goto exit;
 
diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 81755a3..401608b 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -1421,6 +1421,7 @@ lpfc_pci_probe_one(struct pci_dev *pdev,
 	struct lpfc_hba  *phba;
 	struct lpfc_sli  *psli;
 	struct lpfc_iocbq *iocbq_entry = NULL, *iocbq_next = NULL;
+	struct idr_layer *layer = NULL;
 	unsigned long bar0map_len, bar2map_len;
 	int error = -ENODEV, retval;
 	int i;
@@ -1443,10 +1444,16 @@ lpfc_pci_probe_one(struct pci_dev *pdev,
 	phba->pcidev = pdev;
 
 	/* Assign an unused board number */
-	if (!idr_pre_get(&lpfc_hba_index, GFP_KERNEL))
-		goto out_put_host;
-
-	error = idr_get_new(&lpfc_hba_index, NULL, &phba->brd_no);
+again:
+	error = idr_get_new(&lpfc_hba_index, NULL, &phba->brd_no, layer);
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer) {
+			error = -ENOMEM;
+			goto out_put_host;
+		}
+		goto again;
+	}
 	if (error)
 		goto out_put_host;
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3225d31..fe504be 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1617,6 +1617,7 @@ static int sd_probe(struct device *dev)
 	struct scsi_device *sdp = to_scsi_device(dev);
 	struct scsi_disk *sdkp;
 	struct gendisk *gd;
+	struct idr_layer *layer = NULL;
 	u32 index;
 	int error;
 
@@ -1636,13 +1637,19 @@ static int sd_probe(struct device *dev)
 	if (!gd)
 		goto out_free;
 
-	if (!idr_pre_get(&sd_index_idr, GFP_KERNEL))
-		goto out_put;
-
+again:
 	spin_lock(&sd_index_lock);
-	error = idr_get_new(&sd_index_idr, NULL, &index);
+	error = idr_get_new(&sd_index_idr, NULL, &index, layer);
 	spin_unlock(&sd_index_lock);
 
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer) {
+			error = -ENOMEM;
+			goto out_put;
+		}
+		goto again;
+	}
 	if (index >= SD_MAX_DISKS)
 		error = -EBUSY;
 	if (error)
diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index 22f7ccd..d03c3ba 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -198,25 +198,26 @@ struct v9fs_session_info *v9fs_inode2v9s
 
 int v9fs_get_idpool(struct v9fs_idpool *p)
 {
+	struct idr_layer *layer = NULL;
 	int i = 0;
 	int error;
 
 retry:
-	if (idr_pre_get(&p->pool, GFP_KERNEL) == 0)
-		return 0;
-
 	if (down_interruptible(&p->lock) == -EINTR) {
 		eprintk(KERN_WARNING, "Interrupted while locking\n");
 		return -1;
 	}
 
 	/* no need to store exactly p, we just need something non-null */
-	error = idr_get_new(&p->pool, p, &i);
+	error = idr_get_new(&p->pool, p, &i, layer);
 	up(&p->lock);
 
-	if (error == -EAGAIN)
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return -1;
 		goto retry;
-	else if (error)
+	} else if (error)
 		return -1;
 
 	return i;
diff --git a/fs/inotify.c b/fs/inotify.c
index 723836a..7e12bed 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -131,12 +131,16 @@ EXPORT_SYMBOL_GPL(put_inotify_watch);
 static int inotify_handle_get_wd(struct inotify_handle *ih,
 				 struct inotify_watch *watch)
 {
+	struct idr_layer *layer = NULL;
 	int ret;
 
 	do {
-		if (unlikely(!idr_pre_get(&ih->idr, GFP_KERNEL)))
-			return -ENOSPC;
-		ret = idr_get_new_above(&ih->idr, watch, ih->last_wd+1, &watch->wd);
+		ret = idr_get_new_above(&ih->idr, watch, ih->last_wd+1, &watch->wd, layer);
+		if (ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_KERNEL);
+			if (!layer)
+				ret = -ENOSPC;
+		}
 	} while (ret == -EAGAIN);
 
 	if (likely(!ret))
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index b650efa..f9d1b04 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -175,19 +175,21 @@ static u8 o2net_num_from_nn(struct o2net
 
 static int o2net_prep_nsw(struct o2net_node *nn, struct o2net_status_wait *nsw)
 {
+	struct idr_layer *layer = NULL;
 	int ret = 0;
 
 	do {
-		if (!idr_pre_get(&nn->nn_status_idr, GFP_ATOMIC)) {
-			ret = -EAGAIN;
-			break;
-		}
 		spin_lock(&nn->nn_lock);
-		ret = idr_get_new(&nn->nn_status_idr, nsw, &nsw->ns_id);
+		ret = idr_get_new(&nn->nn_status_idr, nsw, &nsw->ns_id, layer);
 		if (ret == 0)
 			list_add_tail(&nsw->ns_node_item,
 				      &nn->nn_status_list);
 		spin_unlock(&nn->nn_lock);
+		if (ret == -EAGAIN) {
+			layer = idr_alloc_layer(GFP_ATOMIC);
+			if (!layer)
+				ret = -ENOMEM;
+		}
 	} while (ret == -EAGAIN);
 
 	if (ret == 0)  {
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 4ba0300..60bdd42 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -318,19 +318,20 @@ #define PROC_DYNAMIC_FIRST 0xF0000000UL
  */
 static unsigned int get_inode_number(void)
 {
+	struct idr_layer *layer = NULL;
 	int i, inum = 0;
 	int error;
 
 retry:
-	if (idr_pre_get(&proc_inum_idr, GFP_KERNEL) == 0)
-		return 0;
-
 	spin_lock(&proc_inum_lock);
-	error = idr_get_new(&proc_inum_idr, NULL, &i);
+	error = idr_get_new(&proc_inum_idr, NULL, &i, layer);
 	spin_unlock(&proc_inum_lock);
-	if (error == -EAGAIN)
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (!layer)
+			return 0;
 		goto retry;
-	else if (error)
+	} else if (error)
 		return 0;
 
 	inum = (i & MAX_ID_MASK) + PROC_DYNAMIC_FIRST;
diff --git a/fs/super.c b/fs/super.c
index 6d4e817..67455f5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -607,19 +607,20 @@ static DEFINE_SPINLOCK(unnamed_dev_lock)
 
 int set_anon_super(struct super_block *s, void *data)
 {
+	struct idr_layer *layer = NULL;
 	int dev;
 	int error;
 
  retry:
-	if (idr_pre_get(&unnamed_dev_idr, GFP_ATOMIC) == 0)
-		return -ENOMEM;
 	spin_lock(&unnamed_dev_lock);
-	error = idr_get_new(&unnamed_dev_idr, NULL, &dev);
+	error = idr_get_new(&unnamed_dev_idr, NULL, &dev, layer);
 	spin_unlock(&unnamed_dev_lock);
-	if (error == -EAGAIN)
-		/* We raced and lost with another CPU. */
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_ATOMIC);
+		if (!layer)
+			return -ENOMEM;
 		goto retry;
-	else if (error)
+	} else if (error)
 		return -EAGAIN;
 
 	if ((dev & MAX_ID_MASK) == (1 << MINORBITS)) {
diff --git a/include/linux/idr.h b/include/linux/idr.h
index 8268034..de34c4e 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -44,7 +44,7 @@ #define MAX_ID_MASK (MAX_ID_BIT - 1)
 #define MAX_LEVEL (MAX_ID_SHIFT + IDR_BITS - 1) / IDR_BITS
 
 /* Number of id_layer structs to leave in free list */
-#define IDR_FREE_MAX MAX_LEVEL + MAX_LEVEL
+#define IDR_FREE_MAX (MAX_LEVEL + MAX_LEVEL)
 
 struct idr_layer {
 	unsigned long		 bitmap; /* A zero bit means "space here" */
@@ -57,7 +57,6 @@ struct idr {
 	struct idr_layer *id_free;
 	int		  layers;
 	int		  id_free_cnt;
-	spinlock_t	  lock;
 };
 
 #define IDR_INIT(name)						\
@@ -66,7 +65,6 @@ #define IDR_INIT(name)						\
 	.id_free	= NULL,					\
 	.layers 	= 0,					\
 	.id_free_cnt	= 0,					\
-	.lock		= __SPIN_LOCK_UNLOCKED(name.lock),	\
 }
 #define DEFINE_IDR(name)	struct idr name = IDR_INIT(name)
 
@@ -75,9 +73,10 @@ #define DEFINE_IDR(name)	struct idr name
  */
 
 void *idr_find(struct idr *idp, int id);
-int idr_pre_get(struct idr *idp, gfp_t gfp_mask);
-int idr_get_new(struct idr *idp, void *ptr, int *id);
-int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
+struct idr_layer *idr_alloc_layer(gfp_t gfp_mask);
+int idr_get_new(struct idr *idp, void *ptr, int *id, struct idr_layer *layer);
+int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id,
+		      struct idr_layer *layer);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_destroy(struct idr *idp);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index ac6dc87..18891b2 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -434,6 +434,7 @@ sys_timer_create(const clockid_t which_c
 		 struct sigevent __user *timer_event_spec,
 		 timer_t __user * created_timer_id)
 {
+	struct idr_layer *layer = NULL;
 	int error = 0;
 	struct k_itimer *new_timer = NULL;
 	int new_timer_id;
@@ -451,17 +452,16 @@ sys_timer_create(const clockid_t which_c
 
 	spin_lock_init(&new_timer->it_lock);
  retry:
-	if (unlikely(!idr_pre_get(&posix_timers_id, GFP_KERNEL))) {
-		error = -EAGAIN;
-		goto out;
-	}
 	spin_lock_irq(&idr_lock);
 	error = idr_get_new(&posix_timers_id, (void *) new_timer,
-			    &new_timer_id);
+			    &new_timer_id, layer);
 	spin_unlock_irq(&idr_lock);
-	if (error == -EAGAIN)
-		goto retry;
-	else if (error) {
+	if (error == -EAGAIN) {
+		layer = idr_alloc_layer(GFP_KERNEL);
+		if (layer)
+			goto retry;
+	}
+	if (error) {
 		/*
 		 * Wierd looking, but we return EAGAIN if the IDR is
 		 * full (proper POSIX return value for this)
diff --git a/lib/idr.c b/lib/idr.c
index 16d2143..187ce4d 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -35,65 +35,39 @@ #include <linux/idr.h>
 
 static kmem_cache_t *idr_layer_cache;
 
-static struct idr_layer *alloc_layer(struct idr *idp)
+static struct idr_layer *get_layer(struct idr *idp)
 {
 	struct idr_layer *p;
-	unsigned long flags;
 
-	spin_lock_irqsave(&idp->lock, flags);
-	if ((p = idp->id_free)) {
+	p = idp->id_free;
+	if (p) {
 		idp->id_free = p->ary[0];
 		idp->id_free_cnt--;
 		p->ary[0] = NULL;
 	}
-	spin_unlock_irqrestore(&idp->lock, flags);
-	return(p);
-}
 
-/* only called when idp->lock is held */
-static void __free_layer(struct idr *idp, struct idr_layer *p)
-{
-	p->ary[0] = idp->id_free;
-	idp->id_free = p;
-	idp->id_free_cnt++;
+	return p;
 }
 
 static void free_layer(struct idr *idp, struct idr_layer *p)
 {
-	unsigned long flags;
-
-	/*
-	 * Depends on the return element being zeroed.
-	 */
-	spin_lock_irqsave(&idp->lock, flags);
-	__free_layer(idp, p);
-	spin_unlock_irqrestore(&idp->lock, flags);
+	if (idp->id_free_cnt < IDR_FREE_MAX) {
+		p->ary[0] = idp->id_free;
+		idp->id_free = p;
+		idp->id_free_cnt++;
+	} else
+		kmem_cache_free(idr_layer_cache, p);
 }
 
 /**
- * idr_pre_get - reserver resources for idr allocation
- * @idp:	idr handle
+ * idr_alloc_layer - reserve resources for idr allocation
  * @gfp_mask:	memory allocation flags
- *
- * This function should be called prior to locking and calling the
- * following function.  It preallocates enough memory to satisfy
- * the worst possible allocation.
- *
- * If the system is REALLY out of memory this function returns 0,
- * otherwise 1.
  */
-int idr_pre_get(struct idr *idp, gfp_t gfp_mask)
+struct idr_layer *idr_alloc_layer(gfp_t gfp_mask)
 {
-	while (idp->id_free_cnt < IDR_FREE_MAX) {
-		struct idr_layer *new;
-		new = kmem_cache_alloc(idr_layer_cache, gfp_mask);
-		if (new == NULL)
-			return (0);
-		free_layer(idp, new);
-	}
-	return 1;
+	return kmem_cache_zalloc(idr_layer_cache, gfp_mask);
 }
-EXPORT_SYMBOL(idr_pre_get);
+EXPORT_SYMBOL(idr_alloc_layer);
 
 static int sub_alloc(struct idr *idp, void *ptr, int *starting_id)
 {
@@ -136,7 +110,7 @@ static int sub_alloc(struct idr *idp, vo
 		 * Create the layer below if it is missing.
 		 */
 		if (!p->ary[m]) {
-			if (!(new = alloc_layer(idp)))
+			if (!(new = get_layer(idp)))
 				return -1;
 			p->ary[m] = new;
 			p->count++;
@@ -171,14 +145,13 @@ static int idr_get_new_above_int(struct 
 {
 	struct idr_layer *p, *new;
 	int layers, v, id;
-	unsigned long flags;
 
 	id = starting_id;
 build_up:
 	p = idp->top;
 	layers = idp->layers;
 	if (unlikely(!p)) {
-		if (!(p = alloc_layer(idp)))
+		if (!(p = get_layer(idp)))
 			return -1;
 		layers = 1;
 	}
@@ -190,19 +163,17 @@ build_up:
 		layers++;
 		if (!p->count)
 			continue;
-		if (!(new = alloc_layer(idp))) {
+		if (!(new = get_layer(idp))) {
 			/*
 			 * The allocation failed.  If we built part of
 			 * the structure tear it down.
 			 */
-			spin_lock_irqsave(&idp->lock, flags);
 			for (new = p; p && p != idp->top; new = p) {
 				p = p->ary[0];
 				new->ary[0] = NULL;
 				new->bitmap = new->count = 0;
-				__free_layer(idp, new);
+				free_layer(idp, new);
 			}
-			spin_unlock_irqrestore(&idp->lock, flags);
 			return -1;
 		}
 		new->ary[0] = p;
@@ -216,7 +187,7 @@ build_up:
 	v = sub_alloc(idp, ptr, &id);
 	if (v == -2)
 		goto build_up;
-	return(v);
+	return v;
 }
 
 /**
@@ -225,20 +196,25 @@ build_up:
  * @ptr: pointer you want associated with the ide
  * @start_id: id to start search at
  * @id: pointer to the allocated handle
+ * @layer: pointer to extra storage
  *
  * This is the allocate id function.  It should be called with any
  * required locks.
  *
  * If memory is required, it will return -EAGAIN, you should unlock
- * and go back to the idr_pre_get() call.  If the idr is full, it will
- * return -ENOSPC.
+ * and call idr_alloc_layer() to get a new layer to pass in as the
+ * @layer parameter.  If the idr is full, it will return -ENOSPC.
  *
  * @id returns a value in the range 0 ... 0x7fffffff
  */
-int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id)
+int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id,
+		      struct idr_layer *layer)
 {
 	int rv;
 
+	if (layer)
+		free_layer(idp, layer);
+
 	rv = idr_get_new_above_int(idp, ptr, starting_id);
 	/*
 	 * This is a cheap hack until the IDR code can be fixed to
@@ -260,33 +236,20 @@ EXPORT_SYMBOL(idr_get_new_above);
  * @idp: idr handle
  * @ptr: pointer you want associated with the ide
  * @id: pointer to the allocated handle
+ * @layer: pointer to extra storage
  *
  * This is the allocate id function.  It should be called with any
  * required locks.
  *
  * If memory is required, it will return -EAGAIN, you should unlock
- * and go back to the idr_pre_get() call.  If the idr is full, it will
- * return -ENOSPC.
+ * and call idr_alloc_layer() to get a new layer to pass in as the
+ * @layer parameter.  If the idr is full, it will return -ENOSPC.
  *
  * @id returns a value in the range 0 ... 0x7fffffff
  */
-int idr_get_new(struct idr *idp, void *ptr, int *id)
+int idr_get_new(struct idr *idp, void *ptr, int *id, struct idr_layer *layer)
 {
-	int rv;
-
-	rv = idr_get_new_above_int(idp, ptr, 0);
-	/*
-	 * This is a cheap hack until the IDR code can be fixed to
-	 * return proper error values.
-	 */
-	if (rv < 0) {
-		if (rv == -1)
-			return -EAGAIN;
-		else /* Will be -3 */
-			return -ENOSPC;
-	}
-	*id = rv;
-	return 0;
+	return idr_get_new_above(idp, ptr, 0, id, layer);
 }
 EXPORT_SYMBOL(idr_get_new);
 
@@ -349,11 +312,6 @@ void idr_remove(struct idr *idp, int id)
 		idp->top = p;
 		--idp->layers;
 	}
-	while (idp->id_free_cnt >= IDR_FREE_MAX) {
-		p = alloc_layer(idp);
-		kmem_cache_free(idr_layer_cache, p);
-		return;
-	}
 }
 EXPORT_SYMBOL(idr_remove);
 
@@ -364,7 +322,7 @@ EXPORT_SYMBOL(idr_remove);
 void idr_destroy(struct idr *idp)
 {
 	while (idp->id_free_cnt) {
-		struct idr_layer *p = alloc_layer(idp);
+		struct idr_layer *p = get_layer(idp);
 		kmem_cache_free(idr_layer_cache, p);
 	}
 }
@@ -445,17 +403,11 @@ void *idr_replace(struct idr *idp, void 
 }
 EXPORT_SYMBOL(idr_replace);
 
-static void idr_cache_ctor(void * idr_layer, kmem_cache_t *idr_layer_cache,
-		unsigned long flags)
-{
-	memset(idr_layer, 0, sizeof(struct idr_layer));
-}
-
 static  int init_id_cache(void)
 {
 	if (!idr_layer_cache)
 		idr_layer_cache = kmem_cache_create("idr_layer_cache",
-			sizeof(struct idr_layer), 0, 0, idr_cache_ctor, NULL);
+			sizeof(struct idr_layer), 0, 0, NULL, NULL);
 	return 0;
 }
 
@@ -470,6 +422,5 @@ void idr_init(struct idr *idp)
 {
 	init_id_cache();
 	memset(idp, 0, sizeof(struct idr));
-	spin_lock_init(&idp->lock);
 }
 EXPORT_SYMBOL(idr_init);
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 2a87736..fc712c4 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1928,6 +1928,7 @@ int sctp_process_init(struct sctp_associ
  	 * association.
 	 */
 	if (!asoc->temp) {
+		struct idr_layer *layer = NULL;
 		int assoc_id;
 		int error;
 
@@ -1937,15 +1938,16 @@ int sctp_process_init(struct sctp_associ
 			goto clean_up;
 
 	retry:
-		if (unlikely(!idr_pre_get(&sctp_assocs_id, gfp)))
-			goto clean_up;
 		spin_lock_bh(&sctp_assocs_id_lock);
 		error = idr_get_new_above(&sctp_assocs_id, (void *)asoc, 1,
-					  &assoc_id);
+					  &assoc_id, layer);
 		spin_unlock_bh(&sctp_assocs_id_lock);
-		if (error == -EAGAIN)
-			goto retry;
-		else if (error)
+		if (error == -EAGAIN) {
+			layer = idr_alloc_layer(gfp);
+			if (layer)
+				goto retry;
+		}
+		if (error)
 			goto clean_up;
 
 		asoc->assoc_id = (sctp_assoc_t) assoc_id;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-11 20:10 ipoib lockdep warning Zach Brown
  2006-07-11 21:16 ` Michael S. Tsirkin
  2006-07-11 22:54 ` Roland Dreier
@ 2006-07-12  2:33 ` Roland Dreier
  2006-07-12  6:49   ` Arjan van de Ven
  2 siblings, 1 reply; 32+ messages in thread
From: Roland Dreier @ 2006-07-12  2:33 UTC (permalink / raw)
  To: Zach Brown
  Cc: openib-general, Linux Kernel Mailing List, Arjan van de Ven,
	Ingo Molnar

OK, the patch to lib/idr.c below is at least one way to fix this.
However, with that applied I get the lockdep warning below, which
seems to be a false positive -- I'm not sure what the right fix is,
but the blame really seems to fall on udp_ioctl for poking into the
sk_buff_head lock itself.

Ingo and/or Arjan, any thoughts on the idr.c change or the
sk_buff_head warning?

Thanks,
  Roland


Here's the idr.c change:

diff --git a/lib/idr.c b/lib/idr.c
index 4d09681..16d2143 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -38,14 +38,15 @@ static kmem_cache_t *idr_layer_cache;
 static struct idr_layer *alloc_layer(struct idr *idp)
 {
 	struct idr_layer *p;
+	unsigned long flags;
 
-	spin_lock(&idp->lock);
+	spin_lock_irqsave(&idp->lock, flags);
 	if ((p = idp->id_free)) {
 		idp->id_free = p->ary[0];
 		idp->id_free_cnt--;
 		p->ary[0] = NULL;
 	}
-	spin_unlock(&idp->lock);
+	spin_unlock_irqrestore(&idp->lock, flags);
 	return(p);
 }
 
@@ -59,12 +60,14 @@ static void __free_layer(struct idr *idp
 
 static void free_layer(struct idr *idp, struct idr_layer *p)
 {
+	unsigned long flags;
+
 	/*
 	 * Depends on the return element being zeroed.
 	 */
-	spin_lock(&idp->lock);
+	spin_lock_irqsave(&idp->lock, flags);
 	__free_layer(idp, p);
-	spin_unlock(&idp->lock);
+	spin_unlock_irqrestore(&idp->lock, flags);
 }
 
 /**
@@ -168,6 +171,7 @@ static int idr_get_new_above_int(struct 
 {
 	struct idr_layer *p, *new;
 	int layers, v, id;
+	unsigned long flags;
 
 	id = starting_id;
 build_up:
@@ -191,14 +195,14 @@ build_up:
 			 * The allocation failed.  If we built part of
 			 * the structure tear it down.
 			 */
-			spin_lock(&idp->lock);
+			spin_lock_irqsave(&idp->lock, flags);
 			for (new = p; p && p != idp->top; new = p) {
 				p = p->ary[0];
 				new->ary[0] = NULL;
 				new->bitmap = new->count = 0;
 				__free_layer(idp, new);
 			}
-			spin_unlock(&idp->lock);
+			spin_unlock_irqrestore(&idp->lock, flags);
 			return -1;
 		}
 		new->ary[0] = p;


And here's the warning I get, which appears to be a false positive:

======================================================
[ INFO: hard-safe -> hard-unsafe lock order detected ]
------------------------------------------------------
swapper/0 [HC0[0]:SC1[2]:HE0:SE0] is trying to acquire:
 (&skb_queue_lock_key){-+..}, at: [<ffffffff8038683e>] skb_queue_tail+0x1d/0x47

and this task is already holding:
 (&priv->lock){.+..}, at: [<ffffffff881efd5b>] ipoib_mcast_send+0x29/0x413 [ib_ipoib]
which would create a new lock dependency:
 (&priv->lock){.+..} -> (&skb_queue_lock_key){-+..}

but this new dependency connects a hard-irq-safe lock:
 (&priv->tx_lock){+...}
... which became hard-irq-safe at:
  [<ffffffff8024172a>] lock_acquire+0x4a/0x69
  [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
  [<ffffffff881ee615>] ipoib_ib_completion+0x340/0x3ff [ib_ipoib]
  [<ffffffff881ca8a5>] mthca_cq_completion+0x65/0x6b [ib_mthca]
  [<ffffffff881ca1bc>] mthca_eq_int+0x70/0x3d2 [ib_mthca]
  [<ffffffff881ca644>] mthca_arbel_interrupt+0x3b/0x98 [ib_mthca]
  [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
  [<ffffffff802538c6>] __do_IRQ+0xac/0x117
  [<ffffffff8020becf>] do_IRQ+0xf7/0x108
  [<ffffffff80209bd8>] common_interrupt+0x64/0x65

to a hard-irq-unsafe lock:
 (&skb_queue_lock_key){-+..}
... which became hard-irq-unsafe at:
...  [<ffffffff8024172a>] lock_acquire+0x4a/0x69
  [<ffffffff803eb349>] _spin_lock_bh+0x26/0x33
  [<ffffffff803c6de8>] udp_ioctl+0x46/0x87
  [<ffffffff803cd1ab>] inet_ioctl+0x8c/0x8f
  [<ffffffff8038203c>] sock_ioctl+0x1c0/0x1ea
  [<ffffffff8028591e>] do_ioctl+0x26/0x74
  [<ffffffff80285bb6>] vfs_ioctl+0x24a/0x264
  [<ffffffff80285c11>] sys_ioctl+0x41/0x68
  [<ffffffff802096a1>] system_call+0x7d/0x83

other info that might help us debug this:

2 locks held by swapper/0:
 #0:  (&priv->tx_lock){+...}, at: [<ffffffff881ed6ef>] ipoib_start_xmit+0x42/0x66d [ib_ipoib]
 #1:  (&priv->lock){.+..}, at: [<ffffffff881efd5b>] ipoib_mcast_send+0x29/0x413 [ib_ipoib]

the hard-irq-safe lock's dependencies:
-> (&priv->tx_lock){+...} ops: 0 {
   initial-use  at:
                        [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                        [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                        [<ffffffff881ef705>] ipoib_mcast_join_finish+0x340/0x379 [ib_ipoib]
                        [<ffffffff881ef7e0>] ipoib_mcast_join_complete+0xa2/0x294 [ib_ipoib]
                        [<ffffffff881e6389>] ib_sa_mcmember_rec_callback+0x4b/0x57 [ib_sa]
                        [<ffffffff881e648d>] recv_handler+0x3d/0x4a [ib_sa]
                        [<ffffffff881b9c20>] ib_mad_completion_handler+0x3df/0x5e4 [ib_mad]
                        [<ffffffff80238422>] run_workqueue+0xa0/0xf7
                        [<ffffffff8023861d>] worker_thread+0xee/0x122
                        [<ffffffff8023b661>] kthread+0xd0/0xfb
                        [<ffffffff8020a5f1>] child_rip+0x7/0x12
   in-hardirq-W at:
                        [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                        [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                        [<ffffffff881ee615>] ipoib_ib_completion+0x340/0x3ff [ib_ipoib]
                        [<ffffffff881ca8a5>] mthca_cq_completion+0x65/0x6b [ib_mthca]
                        [<ffffffff881ca1bc>] mthca_eq_int+0x70/0x3d2 [ib_mthca]
                        [<ffffffff881ca644>] mthca_arbel_interrupt+0x3b/0x98 [ib_mthca]
                        [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
                        [<ffffffff802538c6>] __do_IRQ+0xac/0x117
                        [<ffffffff8020becf>] do_IRQ+0xf7/0x108
                        [<ffffffff80209bd8>] common_interrupt+0x64/0x65
 }
 ... key      at: [<ffffffff881f9a28>] __key.23652+0x0/0xffffffffffff804f [ib_ipoib]
 -> (&priv->lock){.+..} ops: 0 {
    initial-use  at:
                          [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                          [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                          [<ffffffff881ef2c6>] ipoib_mcast_start_thread+0x6e/0x87 [ib_ipoib]
                          [<ffffffff881eee7d>] ipoib_ib_dev_up+0x52/0x58 [ib_ipoib]
                          [<ffffffff881ec23f>] ipoib_open+0x65/0x102 [ib_ipoib]
                          [<ffffffff8038c74c>] dev_open+0x3a/0x80
                          [<ffffffff8038ba90>] dev_change_flags+0x65/0x139
                          [<ffffffff803ccb56>] devinet_ioctl+0x240/0x5e2
                          [<ffffffff803cd18f>] inet_ioctl+0x70/0x8f
                          [<ffffffff8038203c>] sock_ioctl+0x1c0/0x1ea
                          [<ffffffff8028591e>] do_ioctl+0x26/0x74
                          [<ffffffff80285bb6>] vfs_ioctl+0x24a/0x264
                          [<ffffffff80285c11>] sys_ioctl+0x41/0x68
                          [<ffffffff802096a1>] system_call+0x7d/0x83
    in-softirq-W at:
                          [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                          [<ffffffff803eb316>] _spin_lock+0x21/0x2e
                          [<ffffffff881efd5a>] ipoib_mcast_send+0x28/0x413 [ib_ipoib]
                          [<ffffffff881eda0e>] ipoib_start_xmit+0x361/0x66d [ib_ipoib]
                          [<ffffffff8038dfc8>] dev_hard_start_xmit+0x1ab/0x221
                          [<ffffffff8039a024>] __qdisc_run+0xfa/0x1cd
                          [<ffffffff8038e174>] dev_queue_xmit+0x136/0x263
                          [<ffffffff80390209>] neigh_connected_output+0xae/0xc7
                          [<ffffffff8813f4de>] ip6_output2+0x254/0x28c [ipv6]
                          [<ffffffff8813fcf8>] ip6_output+0x7e2/0x7f8 [ipv6]
                          [<ffffffff8814dbf7>] ndisc_send_ns+0x38f/0x4c1 [ipv6]
                          [<ffffffff88145048>] addrconf_dad_timer+0xfb/0x11e [ipv6]
                          [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
                          [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                          [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                          [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                          [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                          [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
  }
  ... key      at: [<ffffffff881f9a30>] __key.23651+0x0/0xffffffffffff8047 [ib_ipoib]
  -> (&idr_lock){....} ops: 0 {
     initial-use  at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                            [<ffffffff881e6161>] send_mad+0x39/0x141 [ib_sa]
                            [<ffffffff881e6a20>] ib_sa_mcmember_rec_query+0x147/0x17c [ib_sa]
                            [<ffffffff881f02a4>] ipoib_mcast_join+0x15f/0x1f2 [ib_ipoib]
                            [<ffffffff881f0582>] ipoib_mcast_join_task+0x24b/0x2e7 [ib_ipoib]
                            [<ffffffff80238422>] run_workqueue+0xa0/0xf7
                            [<ffffffff8023861d>] worker_thread+0xee/0x122
                            [<ffffffff8023b661>] kthread+0xd0/0xfb
                            [<ffffffff8020a5f1>] child_rip+0x7/0x12
   }
   ... key      at: [<ffffffff881ea488>] __key.17990+0x0/0xffffffffffffc8c2 [ib_sa]
   -> (query_idr.lock){....} ops: 0 {
      initial-use  at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                              [<ffffffff802e4581>] free_layer+0x1c/0x40
                              [<ffffffff802e45d6>] idr_pre_get+0x31/0x42
                              [<ffffffff881e614d>] send_mad+0x25/0x141 [ib_sa]
                              [<ffffffff881e6a20>] ib_sa_mcmember_rec_query+0x147/0x17c [ib_sa]
                              [<ffffffff881f02a4>] ipoib_mcast_join+0x15f/0x1f2 [ib_ipoib]
                              [<ffffffff881f0582>] ipoib_mcast_join_task+0x24b/0x2e7 [ib_ipoib]
                              [<ffffffff80238422>] run_workqueue+0xa0/0xf7
                              [<ffffffff8023861d>] worker_thread+0xee/0x122
                              [<ffffffff8023b661>] kthread+0xd0/0xfb
                              [<ffffffff8020a5f1>] child_rip+0x7/0x12
    }
    ... key      at: [<ffffffff881e9f30>] query_idr+0x30/0xffffffffffffce4a [ib_sa]
   ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff802e4530>] alloc_layer+0x18/0x4d
   [<ffffffff802e48de>] idr_get_new_above_int+0x37/0x225
   [<ffffffff802e4adb>] idr_get_new+0xf/0x2f
   [<ffffffff881e6177>] send_mad+0x4f/0x141 [ib_sa]
   [<ffffffff881e6a20>] ib_sa_mcmember_rec_query+0x147/0x17c [ib_sa]
   [<ffffffff881f02a4>] ipoib_mcast_join+0x15f/0x1f2 [ib_ipoib]
   [<ffffffff881f0582>] ipoib_mcast_join_task+0x24b/0x2e7 [ib_ipoib]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

  ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff881e6069>] ib_sa_cancel_query+0x1b/0x73 [ib_sa]
   [<ffffffff881eefb5>] wait_for_mcast_join+0x35/0xd2 [ib_ipoib]
   [<ffffffff881ef385>] ipoib_mcast_stop_thread+0xa6/0xe6 [ib_ipoib]
   [<ffffffff881f07da>] ipoib_mcast_restart_task+0x4b/0x3d9 [ib_ipoib]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

  -> (&mad_agent_priv->lock){....} ops: 0 {
     initial-use  at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                            [<ffffffff881b936d>] ib_post_send_mad+0x3e0/0x512 [ib_mad]
                            [<ffffffff881bb251>] agent_send_response+0x11b/0x155 [ib_mad]
                            [<ffffffff881b9b2b>] ib_mad_completion_handler+0x2ea/0x5e4 [ib_mad]
                            [<ffffffff80238422>] run_workqueue+0xa0/0xf7
                            [<ffffffff8023861d>] worker_thread+0xee/0x122
                            [<ffffffff8023b661>] kthread+0xd0/0xfb
                            [<ffffffff8020a5f1>] child_rip+0x7/0x12
   }
   ... key      at: [<ffffffff881c1424>] __key.17415+0x0/0xffffffffffffaf74 [ib_mad]
   -> (base_lock_keys + cpu){++..} ops: 0 {
      initial-use  at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff8020bed4>] do_IRQ+0xfc/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
      in-hardirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb316>] _spin_lock+0x21/0x2e
                              [<ffffffff80231bad>] __mod_timer+0x93/0xc8
                              [<ffffffff80231c0e>] mod_timer+0x2c/0x2f
                              [<ffffffff80377fb2>] i8042_interrupt+0x2a/0x222
                              [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
                              [<ffffffff802538c6>] __do_IRQ+0xac/0x117
                              [<ffffffff8020becf>] do_IRQ+0xf7/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
                              [<7ffffffffffffffd>] 0x7ffffffffffffffd
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                              [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
      in-softirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff8020bed4>] do_IRQ+0xfc/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
    }
    ... key      at: [<ffffffff805a16e0>] base_lock_keys+0x0/0x20
   ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff80231316>] lock_timer_base+0x21/0x47
   [<ffffffff80231c84>] try_to_del_timer_sync+0x14/0x62
   [<ffffffff80231ce3>] del_timer_sync+0x11/0x1a
   [<ffffffff881b8a7b>] ib_mad_complete_send_wr+0x10b/0x1f8 [ib_mad]
   [<ffffffff881b8c8d>] ib_mad_send_done_handler+0x125/0x17a [ib_mad]
   [<ffffffff881b9db2>] ib_mad_completion_handler+0x571/0x5e4 [ib_mad]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

   -> (base_lock_keys + cpu#4){++..} ops: 0 {
      initial-use  at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                              [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
      in-hardirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                              [<ffffffff80231316>] lock_timer_base+0x21/0x47
                              [<ffffffff80231c2f>] del_timer+0x1e/0x5f
                              [<ffffffff8034f281>] scsi_delete_timer+0x12/0x5d
                              [<ffffffff8034d064>] scsi_done+0xd/0x1e
                              [<ffffffff8035d444>] ata_scsi_qc_complete+0xb1/0xc3
                              [<ffffffff803583e9>] __ata_qc_complete+0x218/0x225
                              [<ffffffff803584c5>] ata_qc_complete+0xcf/0xd5
                              [<ffffffff80359cf0>] ata_hsm_qc_complete+0x1c6/0x1d8
                              [<ffffffff8035a31a>] ata_hsm_move+0x618/0x638
                              [<ffffffff8035a8ba>] ata_interrupt+0x161/0x1ae
                              [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
                              [<ffffffff802538c6>] __do_IRQ+0xac/0x117
                              [<ffffffff8020becf>] do_IRQ+0xf7/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
      in-softirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                              [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
    }
    ... key      at: [<ffffffff805a16f8>] base_lock_keys+0x18/0x20
   ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb316>] _spin_lock+0x21/0x2e
   [<ffffffff80231bad>] __mod_timer+0x93/0xc8
   [<ffffffff802386c6>] queue_delayed_work+0x75/0x7d
   [<ffffffff881b776a>] wait_for_response+0xdf/0xe8 [ib_mad]
   [<ffffffff881b8a31>] ib_mad_complete_send_wr+0xc1/0x1f8 [ib_mad]
   [<ffffffff881b8c8d>] ib_mad_send_done_handler+0x125/0x17a [ib_mad]
   [<ffffffff881b9db2>] ib_mad_completion_handler+0x571/0x5e4 [ib_mad]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

   -> (base_lock_keys + cpu#2){++..} ops: 0 {
      initial-use  at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                              [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
      in-hardirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb316>] _spin_lock+0x21/0x2e
                              [<ffffffff80231bad>] __mod_timer+0x93/0xc8
                              [<ffffffff80231c0e>] mod_timer+0x2c/0x2f
                              [<ffffffff80377fb2>] i8042_interrupt+0x2a/0x222
                              [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
                              [<ffffffff802538c6>] __do_IRQ+0xac/0x117
                              [<ffffffff8020becf>] do_IRQ+0xf7/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
      in-softirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                              [<ffffffff8023197f>] run_timer_softirq+0x41/0x1dc
                              [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                              [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                              [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                              [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                              [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
    }
    ... key      at: [<ffffffff805a16e8>] base_lock_keys+0x8/0x20
   ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb316>] _spin_lock+0x21/0x2e
   [<ffffffff80231bad>] __mod_timer+0x93/0xc8
   [<ffffffff802386c6>] queue_delayed_work+0x75/0x7d
   [<ffffffff881b776a>] wait_for_response+0xdf/0xe8 [ib_mad]
   [<ffffffff881b8a31>] ib_mad_complete_send_wr+0xc1/0x1f8 [ib_mad]
   [<ffffffff881b8c8d>] ib_mad_send_done_handler+0x125/0x17a [ib_mad]
   [<ffffffff881b9db2>] ib_mad_completion_handler+0x571/0x5e4 [ib_mad]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

  ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff881b7a16>] ib_modify_mad+0x27/0x14c [ib_mad]
   [<ffffffff881b7b45>] ib_cancel_mad+0xa/0xd [ib_mad]
   [<ffffffff881e60b8>] ib_sa_cancel_query+0x6a/0x73 [ib_sa]
   [<ffffffff881eefb5>] wait_for_mcast_join+0x35/0xd2 [ib_ipoib]
   [<ffffffff881ef385>] ipoib_mcast_stop_thread+0xa6/0xe6 [ib_ipoib]
   [<ffffffff881f07da>] ipoib_mcast_restart_task+0x4b/0x3d9 [ib_ipoib]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

  -> (modlist_lock){.+..} ops: 0 {
     initial-use  at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                            [<ffffffff80247f0c>] module_text_address+0x15/0x3b
                            [<ffffffff803ee75b>] __register_kprobe+0x2bc/0x2da
                            [<ffffffff803ee8e3>] register_kprobe+0xc/0xf
                            [<ffffffff80806064>] arch_init_kprobes+0xf/0x12
                            [<ffffffff808094be>] init_kprobes+0x3f/0x52
                            [<ffffffff8020718b>] init+0x143/0x308
                            [<ffffffff8020a5f1>] child_rip+0x7/0x12
     in-softirq-W at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                            [<ffffffff80247e1e>] is_module_address+0x14/0x80
                            [<ffffffff8023ede1>] static_obj+0x6c/0x79
                            [<ffffffff8023ee8b>] lockdep_init_map+0x7d/0xc3
                            [<ffffffff802f64d2>] __spin_lock_init+0x2e/0x58
                            [<ffffffff881ef221>] ipoib_mcast_alloc+0x97/0xce [ib_ipoib]
                            [<ffffffff881efe3b>] ipoib_mcast_send+0x109/0x413 [ib_ipoib]
                            [<ffffffff881eda0e>] ipoib_start_xmit+0x361/0x66d [ib_ipoib]
                            [<ffffffff8038dfc8>] dev_hard_start_xmit+0x1ab/0x221
                            [<ffffffff8039a024>] __qdisc_run+0xfa/0x1cd
                            [<ffffffff8038e174>] dev_queue_xmit+0x136/0x263
                            [<ffffffff80390209>] neigh_connected_output+0xae/0xc7
                            [<ffffffff8813f4de>] ip6_output2+0x254/0x28c [ipv6]
                            [<ffffffff8813fcf8>] ip6_output+0x7e2/0x7f8 [ipv6]
                            [<ffffffff8814d736>] ndisc_send_rs+0x33d/0x46f [ipv6]
                            [<ffffffff881427b9>] addrconf_dad_completed+0x90/0xe2 [ipv6]
                            [<ffffffff88144fc1>] addrconf_dad_timer+0x74/0x11e [ipv6]
                            [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
                            [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                            [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                            [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                            [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                            [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
   }
   ... key      at: [<ffffffff804cfd98>] modlist_lock+0x18/0x40
  ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff80247e1e>] is_module_address+0x14/0x80
   [<ffffffff8023ede1>] static_obj+0x6c/0x79
   [<ffffffff8023ee8b>] lockdep_init_map+0x7d/0xc3
   [<ffffffff802f64d2>] __spin_lock_init+0x2e/0x58
   [<ffffffff881ef221>] ipoib_mcast_alloc+0x97/0xce [ib_ipoib]
   [<ffffffff881f0903>] ipoib_mcast_restart_task+0x174/0x3d9 [ib_ipoib]
   [<ffffffff80238422>] run_workqueue+0xa0/0xf7
   [<ffffffff8023861d>] worker_thread+0xee/0x122
   [<ffffffff8023b661>] kthread+0xd0/0xfb
   [<ffffffff8020a5f1>] child_rip+0x7/0x12

  -> (&qp->sq.lock){.+..} ops: 0 {
     initial-use  at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb644>] _spin_lock_irq+0x28/0x35
                            [<ffffffff881cdef3>] mthca_modify_qp+0x67/0xd7b [ib_mthca]
                            [<ffffffff881a68fa>] ib_modify_qp+0xc/0xf [ib_core]
                            [<ffffffff881b8394>] ib_mad_init_device+0x2db/0x55c [ib_mad]
                            [<ffffffff881a83fe>] ib_register_device+0x20d/0x300 [ib_core]
                            [<ffffffff881d2f53>] mthca_register_device+0x3e0/0x430 [ib_mthca]
                            [<ffffffff881c4f26>] mthca_init_one+0xbc2/0xcbc [ib_mthca]
                            [<ffffffff802fcab2>] pci_device_probe+0x4b/0x72
                            [<ffffffff803481b7>] driver_probe_device+0x59/0xaf
                            [<ffffffff803482d3>] __driver_attach+0x58/0x90
                            [<ffffffff80347665>] bus_for_each_dev+0x48/0x7a
                            [<ffffffff80348047>] driver_attach+0x1b/0x1e
                            [<ffffffff803479d2>] bus_add_driver+0x74/0x112
                            [<ffffffff80348705>] driver_register+0x8c/0x91
                            [<ffffffff802fc684>] __pci_register_driver+0x60/0x84
                            [<ffffffff88022016>] 0xffffffff88022016
                            [<ffffffff80247a93>] sys_init_module+0x174c/0x1884
                            [<ffffffff802096a1>] system_call+0x7d/0x83
     in-softirq-W at:
                            [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                            [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                            [<ffffffff881cff9d>] mthca_arbel_post_send+0x2f/0x5be [ib_mthca]
                            [<ffffffff881ee043>] ipoib_send+0x117/0x1d6 [ib_ipoib]
                            [<ffffffff881f012d>] ipoib_mcast_send+0x3fb/0x413 [ib_ipoib]
                            [<ffffffff881eda0e>] ipoib_start_xmit+0x361/0x66d [ib_ipoib]
                            [<ffffffff8038dfc8>] dev_hard_start_xmit+0x1ab/0x221
                            [<ffffffff8039a024>] __qdisc_run+0xfa/0x1cd
                            [<ffffffff8038e174>] dev_queue_xmit+0x136/0x263
                            [<ffffffff80390209>] neigh_connected_output+0xae/0xc7
                            [<ffffffff8813f4de>] ip6_output2+0x254/0x28c [ipv6]
                            [<ffffffff8813fcf8>] ip6_output+0x7e2/0x7f8 [ipv6]
                            [<ffffffff8814dbf7>] ndisc_send_ns+0x38f/0x4c1 [ipv6]
                            [<ffffffff88145048>] addrconf_dad_timer+0xfb/0x11e [ipv6]
                            [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
                            [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                            [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                            [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                            [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                            [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
   }
   ... key      at: [<ffffffff881e4060>] __key.20374+0x0/0xffffffffffff231a [ib_mthca]
   -> (&qp->rq.lock){+...} ops: 0 {
      initial-use  at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb316>] _spin_lock+0x21/0x2e
                              [<ffffffff881cdf02>] mthca_modify_qp+0x76/0xd7b [ib_mthca]
                              [<ffffffff881a68fa>] ib_modify_qp+0xc/0xf [ib_core]
                              [<ffffffff881b8394>] ib_mad_init_device+0x2db/0x55c [ib_mad]
                              [<ffffffff881a83fe>] ib_register_device+0x20d/0x300 [ib_core]
                              [<ffffffff881d2f53>] mthca_register_device+0x3e0/0x430 [ib_mthca]
                              [<ffffffff881c4f26>] mthca_init_one+0xbc2/0xcbc [ib_mthca]
                              [<ffffffff802fcab2>] pci_device_probe+0x4b/0x72
                              [<ffffffff803481b7>] driver_probe_device+0x59/0xaf
                              [<ffffffff803482d3>] __driver_attach+0x58/0x90
                              [<ffffffff80347665>] bus_for_each_dev+0x48/0x7a
                              [<ffffffff80348047>] driver_attach+0x1b/0x1e
                              [<ffffffff803479d2>] bus_add_driver+0x74/0x112
                              [<ffffffff80348705>] driver_register+0x8c/0x91
                              [<ffffffff802fc684>] __pci_register_driver+0x60/0x84
                              [<ffffffff88022016>] 0xffffffff88022016
                              [<ffffffff80247a93>] sys_init_module+0x174c/0x1884
                              [<ffffffff802096a1>] system_call+0x7d/0x83
      in-hardirq-W at:
                              [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                              [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                              [<ffffffff881cd749>] mthca_arbel_post_receive+0x35/0x23a [ib_mthca]
                              [<ffffffff881ee183>] ipoib_ib_post_receive+0x81/0xf9 [ib_ipoib]
                              [<ffffffff881ee55b>] ipoib_ib_completion+0x286/0x3ff [ib_ipoib]
                              [<ffffffff881ca8a5>] mthca_cq_completion+0x65/0x6b [ib_mthca]
                              [<ffffffff881ca1bc>] mthca_eq_int+0x70/0x3d2 [ib_mthca]
                              [<ffffffff881ca644>] mthca_arbel_interrupt+0x3b/0x98 [ib_mthca]
                              [<ffffffff802537de>] handle_IRQ_event+0x28/0x64
                              [<ffffffff802538c6>] __do_IRQ+0xac/0x117
                              [<ffffffff8020becf>] do_IRQ+0xf7/0x108
                              [<ffffffff80209bd8>] common_interrupt+0x64/0x65
    }
    ... key      at: [<ffffffff881e4058>] __key.20375+0x0/0xffffffffffff2322 [ib_mthca]
   ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb316>] _spin_lock+0x21/0x2e
   [<ffffffff881cdf02>] mthca_modify_qp+0x76/0xd7b [ib_mthca]
   [<ffffffff881a68fa>] ib_modify_qp+0xc/0xf [ib_core]
   [<ffffffff881b8394>] ib_mad_init_device+0x2db/0x55c [ib_mad]
   [<ffffffff881a83fe>] ib_register_device+0x20d/0x300 [ib_core]
   [<ffffffff881d2f53>] mthca_register_device+0x3e0/0x430 [ib_mthca]
   [<ffffffff881c4f26>] mthca_init_one+0xbc2/0xcbc [ib_mthca]
   [<ffffffff802fcab2>] pci_device_probe+0x4b/0x72
   [<ffffffff803481b7>] driver_probe_device+0x59/0xaf
   [<ffffffff803482d3>] __driver_attach+0x58/0x90
   [<ffffffff80347665>] bus_for_each_dev+0x48/0x7a
   [<ffffffff80348047>] driver_attach+0x1b/0x1e
   [<ffffffff803479d2>] bus_add_driver+0x74/0x112
   [<ffffffff80348705>] driver_register+0x8c/0x91
   [<ffffffff802fc684>] __pci_register_driver+0x60/0x84
   [<ffffffff88022016>] 0xffffffff88022016
   [<ffffffff80247a93>] sys_init_module+0x174c/0x1884
   [<ffffffff802096a1>] system_call+0x7d/0x83

  ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
   [<ffffffff881cff9d>] mthca_arbel_post_send+0x2f/0x5be [ib_mthca]
   [<ffffffff881ee043>] ipoib_send+0x117/0x1d6 [ib_ipoib]
   [<ffffffff881f012d>] ipoib_mcast_send+0x3fb/0x413 [ib_ipoib]
   [<ffffffff881eda0e>] ipoib_start_xmit+0x361/0x66d [ib_ipoib]
   [<ffffffff8038dfc8>] dev_hard_start_xmit+0x1ab/0x221
   [<ffffffff8039a024>] __qdisc_run+0xfa/0x1cd
   [<ffffffff8038e174>] dev_queue_xmit+0x136/0x263
   [<ffffffff80390209>] neigh_connected_output+0xae/0xc7
   [<ffffffff8813f4de>] ip6_output2+0x254/0x28c [ipv6]
   [<ffffffff8813fcf8>] ip6_output+0x7e2/0x7f8 [ipv6]
   [<ffffffff8814dbf7>] ndisc_send_ns+0x38f/0x4c1 [ipv6]
   [<ffffffff88145048>] addrconf_dad_timer+0xfb/0x11e [ipv6]
   [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
   [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
   [<ffffffff8020a94d>] call_softirq+0x1d/0x28
   [<ffffffff8022e0bc>] irq_exit+0x56/0x59
   [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
   [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70

 ... acquired at:
   [<ffffffff8024172a>] lock_acquire+0x4a/0x69
   [<ffffffff803eb316>] _spin_lock+0x21/0x2e
   [<ffffffff881efd5a>] ipoib_mcast_send+0x28/0x413 [ib_ipoib]
   [<ffffffff881eda0e>] ipoib_start_xmit+0x361/0x66d [ib_ipoib]
   [<ffffffff8038dfc8>] dev_hard_start_xmit+0x1ab/0x221
   [<ffffffff8039a024>] __qdisc_run+0xfa/0x1cd
   [<ffffffff8038e174>] dev_queue_xmit+0x136/0x263
   [<ffffffff80390209>] neigh_connected_output+0xae/0xc7
   [<ffffffff8813f4de>] ip6_output2+0x254/0x28c [ipv6]
   [<ffffffff8813fcf8>] ip6_output+0x7e2/0x7f8 [ipv6]
   [<ffffffff8814dbf7>] ndisc_send_ns+0x38f/0x4c1 [ipv6]
   [<ffffffff88145048>] addrconf_dad_timer+0xfb/0x11e [ipv6]
   [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
   [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
   [<ffffffff8020a94d>] call_softirq+0x1d/0x28
   [<ffffffff8022e0bc>] irq_exit+0x56/0x59
   [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
   [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70


the hard-irq-unsafe lock's dependencies:
-> (&skb_queue_lock_key){-+..} ops: 0 {
   initial-use  at:
                        [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                        [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                        [<ffffffff8038683d>] skb_queue_tail+0x1c/0x47
                        [<ffffffff8039ff72>] netlink_broadcast+0x212/0x2ea
                        [<ffffffff802e57c4>] kobject_uevent+0x3c9/0x43d
                        [<ffffffff8034897f>] store_uevent+0x16/0x1e
                        [<ffffffff803488ef>] class_device_attr_store+0x1b/0x1e
                        [<ffffffff802b0c33>] sysfs_write_file+0xb7/0xe4
                        [<ffffffff8027448e>] vfs_write+0xad/0x154
                        [<ffffffff80274d8b>] sys_write+0x46/0x6f
                        [<ffffffff802096a1>] system_call+0x7d/0x83
   in-softirq-W at:
                        [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                        [<ffffffff803eb7da>] _spin_lock_irqsave+0x2b/0x3c
                        [<ffffffff80386aca>] skb_dequeue+0x18/0x5c
                        [<ffffffff80387015>] skb_queue_purge+0x18/0x26
                        [<ffffffff80394261>] neigh_timer_handler+0x26f/0x346
                        [<ffffffff80231a8e>] run_timer_softirq+0x150/0x1dc
                        [<ffffffff8022e345>] __do_softirq+0x6b/0xf7
                        [<ffffffff8020a94d>] call_softirq+0x1d/0x28
                        [<ffffffff8022e0bc>] irq_exit+0x56/0x59
                        [<ffffffff802146ee>] smp_apic_timer_interrupt+0x59/0x5f
                        [<ffffffff8020a2c1>] apic_timer_interrupt+0x69/0x70
   hardirq-on-W at:
                        [<ffffffff8024172a>] lock_acquire+0x4a/0x69
                        [<ffffffff803eb349>] _spin_lock_bh+0x26/0x33
                        [<ffffffff803c6de8>] udp_ioctl+0x46/0x87
                        [<ffffffff803cd1ab>] inet_ioctl+0x8c/0x8f
                        [<ffffffff8038203c>] sock_ioctl+0x1c0/0x1ea
                        [<ffffffff8028591e>] do_ioctl+0x26/0x74
                        [<ffffffff80285bb6>] vfs_ioctl+0x24a/0x264
                        [<ffffffff80285c11>] sys_ioctl+0x41/0x68
                        [<ffffffff802096a1>] system_call+0x7d/0x83
 }
 ... key      at: [<ffffffff807ccac8>] skb_queue_lock_key+0x0/0x18

stack backtrace:

Call Trace:
 [<ffffffff8020ac99>] show_trace+0xaa/0x238
 [<ffffffff8020b037>] dump_stack+0x13/0x15
 [<ffffffff8023ff9e>] check_usage+0x282/0x293
 [<ffffffff802412b7>] __lock_acquire+0x85c/0xa29
 [<ffffffff8024172b>] lock_acquire+0x4b/0x69
 [<ffffffff803eb7db>] _spin_lock_irqsave+0x2c/0x3c
 [<ffffffff8038683e>] skb_queue_tail+0x1d/0x47
 [<ffffffff881efecc>] :ib_ipoib:ipoib_mcast_send+0x19a/0x413
 [<ffffffff881eda0f>] :ib_ipoib:ipoib_start_xmit+0x362/0x66d
 [<ffffffff8038dfc9>] dev_hard_start_xmit+0x1ac/0x221
 [<ffffffff8039a025>] __qdisc_run+0xfb/0x1cd
 [<ffffffff8038e175>] dev_queue_xmit+0x137/0x263
 [<ffffffff8039020a>] neigh_connected_output+0xaf/0xc7
 [<ffffffff8813f4df>] :ipv6:ip6_output2+0x255/0x28c
 [<ffffffff8813fcf9>] :ipv6:ip6_output+0x7e3/0x7f8
 [<ffffffff8814d737>] :ipv6:ndisc_send_rs+0x33e/0x46f
 [<ffffffff881427ba>] :ipv6:addrconf_dad_completed+0x91/0xe2
 [<ffffffff88144fc2>] :ipv6:addrconf_dad_timer+0x75/0x11e
 [<ffffffff80231a8f>] run_timer_softirq+0x151/0x1dc
 [<ffffffff8022e346>] __do_softirq+0x6c/0xf7
 [<ffffffff8020a94e>] call_softirq+0x1e/0x28
ib0: no IPv6 routers present

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-12  2:33 ` [openib-general] ipoib lockdep warning Roland Dreier
@ 2006-07-12  6:49   ` Arjan van de Ven
  2006-07-12 19:01     ` Roland Dreier
  0 siblings, 1 reply; 32+ messages in thread
From: Arjan van de Ven @ 2006-07-12  6:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Zach Brown, openib-general, Linux Kernel Mailing List,
	Ingo Molnar

On Tue, 2006-07-11 at 19:33 -0700, Roland Dreier wrote:
> OK, the patch to lib/idr.c below is at least one way to fix this.
> However, with that applied I get the lockdep warning below, which
> seems to be a false positive -- I'm not sure what the right fix is,
> but the blame really seems to fall on udp_ioctl for poking into the
> sk_buff_head lock itself.

this does not have to be a false positive!
It is not legal to take ANY non-hardirq safe lock after having taken a
lock that's used in hardirq context.
(having said that the skb_queue_tail lock needs a  special treatment for
some real false positives; Linus merged that already)

> And here's the warning I get, which appears to be a false positive:
> 
> ======================================================
> [ INFO: hard-safe -> hard-unsafe lock order detected ]
> ------------------------------------------------------
> swapper/0 [HC0[0]:SC1[2]:HE0:SE0] is trying to acquire:
>  (&skb_queue_lock_key){-+..}, at: [<ffffffff8038683e>] skb_queue_tail+0x1d/0x47
> 
> and this task is already holding:
>  (&priv->lock){.+..}, at: [<ffffffff881efd5b>] ipoib_mcast_send+0x29/0x413 [ib_ipoib]
> which would create a new lock dependency:
>  (&priv->lock){.+..} -> (&skb_queue_lock_key){-+..}
> 

Lets call priv->lock lock A, and the skb_queue_lock lock B to keep this
short


CPU 0			CPU 1
soft irq context	user or softirq context

take lock B
(anywhere in net stack)
			take lock A 
			(ipoib_mcast_send)

interrupt happens

take lock A (spins)
in ISR somewhere

			take lock B (spins)
			(in skb_queue_tail() called from
			ipoib_mcast_send)




Now this assumes your queue is shared with the networking stack,
so that the first "take lock B" on CPU 0 can happen without disabling interrupts.
(as several of the SKB functions do that get used by the networking stack, but not
 the simple ones used by drivers for private queues).
If this queue used in ipoib_mcast_send() is a private queue you probably want to 
apply the patch below (already in/on its way to mainline)

Index: linux-2.6.18-rc1/include/linux/skbuff.h
===================================================================
--- linux-2.6.18-rc1.orig/include/linux/skbuff.h
+++ linux-2.6.18-rc1/include/linux/skbuff.h
@@ -606,10 +606,17 @@ static inline __u32 skb_queue_len(const 
 
 extern struct lock_class_key skb_queue_lock_key;
 
+/*
+ * This function creates a split out lock class for each invocation;
+ * this is needed for now since a whole lot of users of the skb-queue
+ * infrastructure in drivers have different locking usage (in hardirq)
+ * than the networking core (in softirq only). In the long run either the
+ * network layer or drivers should need annotation to consolidate the
+ * main types of usage into 3 classes.
+ */
 static inline void skb_queue_head_init(struct sk_buff_head *list)
 {
 	spin_lock_init(&list->lock);
-	lockdep_set_class(&list->lock, &skb_queue_lock_key);
 	list->prev = list->next = (struct sk_buff *)list;
 	list->qlen = 0;
 }



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [openib-general] ipoib lockdep warning
  2006-07-12  6:49   ` Arjan van de Ven
@ 2006-07-12 19:01     ` Roland Dreier
  0 siblings, 0 replies; 32+ messages in thread
From: Roland Dreier @ 2006-07-12 19:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Zach Brown, openib-general, Linux Kernel Mailing List,
	Ingo Molnar

 > this does not have to be a false positive!
 > It is not legal to take ANY non-hardirq safe lock after having taken a
 > lock that's used in hardirq context.
 > (having said that the skb_queue_tail lock needs a  special treatment for
 > some real false positives; Linus merged that already)

 ...

 > Now this assumes your queue is shared with the networking stack,

Right, understood -- but ipoib has a private skb queue.  Which is why
I said it was a false positive.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-07-17 15:57 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-11 20:10 ipoib lockdep warning Zach Brown
2006-07-11 21:16 ` Michael S. Tsirkin
2006-07-11 21:40   ` Sean Hefty
2006-07-11 21:50     ` [openib-general] " Zach Brown
2006-07-11 22:54 ` Roland Dreier
2006-07-11 23:27   ` Zach Brown
2006-07-11 23:43     ` Roland Dreier
2006-07-11 23:53       ` Zach Brown
2006-07-12  0:06         ` Roland Dreier
2006-07-12  9:38       ` Ingo Molnar
2006-07-12 11:09         ` Michael S. Tsirkin
2006-07-12 16:31           ` [openib-general] " Sean Hefty
2006-07-12 18:56           ` Roland Dreier
2006-07-13 15:55           ` [PATCH] IB/core: use correct gfp_mask in sa_query Michael S. Tsirkin
2006-07-12 19:06         ` [openib-general] ipoib lockdep warning Roland Dreier
2006-07-12 20:45         ` [PATCH] Convert idr's internal locking to _irqsave variant Roland Dreier
2006-07-12 21:14           ` Ingo Molnar
2006-07-13  1:30           ` Andrew Morton
2006-07-13 15:42             ` Roland Dreier
2006-07-13 20:54               ` Andrew Morton
2006-07-13 21:03                 ` Roland Dreier
2006-07-13 21:05                   ` Arjan van de Ven
2006-07-14  0:18                     ` Roland Dreier
2006-07-14  6:20                       ` Arjan van de Ven
2006-07-13 21:43                   ` Andrew Morton
2006-07-14  1:08                     ` Roland Dreier
2006-07-14  1:18                       ` Andrew Morton
2006-07-14  1:30                         ` Andrew Morton
2006-07-17 15:57                           ` Roland Dreier
2006-07-12  2:33 ` [openib-general] ipoib lockdep warning Roland Dreier
2006-07-12  6:49   ` Arjan van de Ven
2006-07-12 19:01     ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox