From: ebiederm@xmission.com (Eric W. Biederman)
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>, Jan Beulich <jbeulich@novell.com>,
tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
linux-kernel@vger.kernel.org, Gautham R Shenoy <ego@in.ibm.com>,
Alexey Dobriyan <adobriyan@gmail.com>,
netdev@vger.kernel.org
Subject: Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting.
Date: Tue, 31 Mar 2009 15:44:11 -0700 [thread overview]
Message-ID: <m1hc198g90.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <1238513726.8530.564.camel@twins> (Peter Zijlstra's message of "Tue\, 31 Mar 2009 17\:35\:26 +0200")
Peter Zijlstra <peterz@infradead.org> writes:
> On Tue, 2009-03-31 at 06:40 -0700, Eric W. Biederman wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Sat, 2009-03-21 at 00:42 -0700, Eric W. Biederman wrote:
>> >> It is possible for get lock ordering deadlocks between locks
>> >> and waiting for the sysctl used count to drop to zero. We have
>> >> recently observed one of these in the networking code.
>> >>
>> >> So teach the sysctl code how to speak lockdep so the kernel
>> >> can warn about these kinds of rare issues proactively.
>> >
>> > It would be very good to extend this changelog with a more detailed
>> > explanation of the deadlock in question.
>> >
>> > Let me see if I got it right:
>> >
>> > We're holding a lock, while waiting for the refcount to drop to 0.
>> > Dropping that refcount is blocked on that lock.
>> >
>> > Something like that?
>>
>> Exactly.
>>
>> I must have written an explanation so many times that it got
>> lost when I wrote that commit message.
>>
>> In particular the problem can be see with /proc/sys/net/ipv4/conf/*/forwarding.
>>
>> The problem is that the handler for fowarding takes the rtnl_lock
>> with the reference count held.
>>
>> Then we call unregister_sysctl_table under the rtnl_lock.
>> which waits for the reference count to go to zero.
>
>> >> +
>> >> +# define lock_sysctl() __raw_spin_lock(&sysctl_lock.raw_lock)
>> >> +# define unlock_sysctl() __raw_spin_unlock(&sysctl_lock.raw_lock)
>> >
>> > Uhmm, Please explain that -- without a proper explanation this is a NAK.
>>
>> If the refcount is to be considered a lock. sysctl_lock must be considered
>> the internals of that lock. lockdep gets extremely confused otherwise.
>> Since the spinlock is static to this file I'm not especially worried
>> about it.
>
> Usually lock internal locks still get lockdep coverage. Let see if we
> can find a way for this to be true even here. I suspect the below to
> cause the issue:
>
>> >> /* called under sysctl_lock, will reacquire if has to wait */
>> >> @@ -1478,47 +1531,54 @@ static void start_unregistering(struct ctl_table_header *p)
>> >> * if p->used is 0, nobody will ever touch that entry again;
>> >> * we'll eliminate all paths to it before dropping sysctl_lock
>> >> */
>> >> + table_acquire(p);
>> >> if (unlikely(p->used)) {
>> >> struct completion wait;
>> >> + table_contended(p);
>> >> +
>> >> init_completion(&wait);
>> >> p->unregistering = &wait;
>> >> - spin_unlock(&sysctl_lock);
>> >> + unlock_sysctl();
>> >> wait_for_completion(&wait);
>> >> - spin_lock(&sysctl_lock);
>> >> + lock_sysctl();
>> >> } else {
>> >> /* anything non-NULL; we'll never dereference it */
>> >> p->unregistering = ERR_PTR(-EINVAL);
>> >> }
>> >> + table_acquired(p);
>> >> +
>> >> /*
>> >> * do not remove from the list until nobody holds it; walking the
>> >> * list in do_sysctl() relies on that.
>> >> */
>> >> list_del_init(&p->ctl_entry);
>> >> +
>> >> + table_release(p);
>> >> }
>
> There you acquire the table while holding the spinlock, generating:
> sysctl_lock -> table_lock, however you then release the sysctl_lock and
> re-acquire it, generating table_lock -> sysctl_lock.
>
> Humm, can't we write that differently?
That is an artifact of sysctl_lock being used to implement
table_lock as best as I can tell. The case you point
out I could probably play with where I claim the lock
is acquired and make it work.
__sysctl_head_next on the read side is trickier.
We come in with table_lock held for read.
We grab sysctl_lock.
We release table_lock (aka the reference count is decremented)
We grab table_lock on the next table (aka the reference count is incremented)
We release sysctl_lock
If we generate lockdep annotations for that it would seem to transition
through the states:
table_lock
table_lock -> sysctl_lock
sysctl_lock
sysctl_lock -> table_lock
table_lock
Short of saying table_lock is an implementation detail. Used to
make certain operations atomic I do not see how to model this case.
Let me take a slightly simpler case and ask how that gets modeled.
Looking at rwsem. Ok all of the annotations are outside of the
spin_lock. So in some sense we are sloppy, and fib to lockdep
about when the we acquire/release a lock. In another sense
we are simply respecting the abstraction.
I guess I can take a look and see if I can model things a slightly
more lossy fashion so I don't need to do the __raw_spin_lock thing.
>> >> @@ -1951,7 +2011,13 @@ struct ctl_table_header *__register_sysctl_paths(
>> >> return NULL;
>> >> }
>> >> #endif
>> >> - spin_lock(&sysctl_lock);
>> >> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>> >> + {
>> >> + static struct lock_class_key __key;
>> >> + lockdep_init_map(&header->dep_map, "sysctl_used", &__key, 0);
>> >> + }
>> >> +#endif
>> >
>> > This means every sysctl thingy gets the same class, is that
>> > intended/desired?
>>
>> There is only one place we initialize it, and as far as I know really
>> only one place we take it. Which is the definition of a lockdep
>> class as far as I know.
>
> Indeed, just checking.
The only difference I can possibly see is read side versus write side.
Or in my case refcount side versus wait side.
Eric
next prev parent reply other threads:[~2009-03-31 22:44 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-12 13:21 [PATCH, resend] eliminate spurious pointless WARN_ON()s Jan Beulich
2009-03-12 13:48 ` Andi Kleen
2009-03-13 8:52 ` Peter Zijlstra
2009-03-13 1:39 ` [tip:core/ipi] generic-ipi: " Jan Beulich
2009-03-13 8:54 ` Peter Zijlstra
2009-03-13 9:21 ` [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s Jan Beulich
2009-03-13 9:43 ` Peter Zijlstra
2009-03-13 10:38 ` Ingo Molnar
2009-03-19 22:14 ` Eric W. Biederman
2009-03-20 8:52 ` Ingo Molnar
2009-03-20 9:58 ` Eric W. Biederman
2009-03-20 18:24 ` Ingo Molnar
2009-03-20 18:52 ` Peter Zijlstra
2009-03-20 19:34 ` cpu hotplug and lockdep (was: Re: [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s) Peter Zijlstra
2009-03-21 7:39 ` [PATCH 0/2] sysctl: lockdep support Eric W. Biederman
2009-03-21 7:40 ` [PATCH 1/2] sysctl: Don't take the use count of multiple heads at a time Eric W. Biederman
2009-03-21 7:42 ` [PATCH 2/2] sysctl: lockdep support for sysctl reference counting Eric W. Biederman
2009-03-30 22:26 ` Andrew Morton
2009-03-30 22:53 ` Eric W. Biederman
2009-03-30 23:18 ` Andrew Morton
2009-03-30 23:50 ` Eric W. Biederman
2009-03-31 8:10 ` Peter Zijlstra
2009-03-31 8:47 ` Eric W. Biederman
2009-03-31 8:17 ` Peter Zijlstra
2009-03-31 13:40 ` Eric W. Biederman
2009-03-31 15:35 ` Peter Zijlstra
2009-03-31 22:44 ` Eric W. Biederman [this message]
2009-04-10 9:18 ` Andrew Morton
2009-03-20 23:40 ` [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s Eric W. Biederman
2009-03-21 10:20 ` Peter Zijlstra
2009-03-13 9:31 ` [tip:core/ipi] generic-ipi: eliminate spurious pointless WARN_ON()s Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] generic-ipi: eliminate WARN_ON()s during oops/panic Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic: decrease oops_in_progress only after having done the panic Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic, smp: provide smp_send_stop() wrapper on UP too Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic: clean up kernel/panic.c Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m1hc198g90.fsf@fess.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ego@in.ibm.com \
--cc=hpa@zytor.com \
--cc=jbeulich@novell.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox