Re: Race condition in ipv6 code

All of lore.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Francesco Ruggeri <fruggeri@aristanetworks.com>,
	netdev@vger.kernel.org, Stephen Hemminger <shemminger@vyatta.com>
Subject: Re: Race condition in ipv6 code
Date: Thu, 12 Jan 2012 23:40:52 -0800	[thread overview]
Message-ID: <m1ipkgqccb.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <1326434549.2617.14.camel@edumazet-laptop> (Eric Dumazet's message of "Fri, 13 Jan 2012 07:02:29 +0100")

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le jeudi 12 janvier 2012 à 16:11 -0800, Eric W. Biederman a écrit :
>
>> Because the rtnl_lock is broad and we have ABBA deadlocks if we don't in
>> particular we hold the rtnl_lock over sysctl registration and removal.
>> sysctl removal blocks until all of the callers into the sysctl methods
>> namely addrconf_sysctl_forward in this case finish executing.
>> 
>> CPU 0                                         CPU 1
>> 
>> rtnl_lock()                                   use_count++
>> unregister_netdevice()                           addrconf_ctl_foward
>>   unregister_sysctl_table()                        rtnl_lock()
>>      wait for use_count of addrconf_ctl_forward
>>        to == 0
>>      
>> 
>> I smacked lockdep around so it would warn about the sysfs ones.
>> The proc and sysctl ones I never did manage to get lockdep warnings
>> but a ABBA deadlock is most definitely possible.
>> 
>> Any solutions better than simply restarting the system call are welcome.
>> 
>> Perhaps for these heavy weigh methods we should create a work struct
>> and go schedule work to perform the change instead of trying to do the
>> work synchronously in the sysctl handler.
>> 
>
> This idea was discussed in netconf 2011 (I focused on the ability to
> dismantle netdevices at high rates), and I admit I had not implemented
> yet.

Interesting.  That is a discussion I would have been interested to be
a part of.

In my queue I have sysctl cleanup and speed improvements I should be
posting them for review in the next week or two.

That leaves /proc/net/dev_snmp6/.... as the scalability bottleneck for
the number of network devices.

I have gotten the rcu_barrier in device unregister out from under the
trylock.

At a practical level the rcu_barrier seems to be the primary bottleneck
to remove netdevices at a high rate, although I am certain there are
other issues that crop up.  If I go with a networking path that can
batch network devices, and am not bottlenecked by proc, sysfs or sysctl
I can remove 10,000 or so network devices a second.

I have been scratching my head trying to figure out if it is possible
to batch network device deletes that come in via a netlink DELLINK message.

> The problem with rtnl_trylock() is that we make very litle progress if
> many tasks are competing for rtnl, and we have no fairness.

Yes.  That code path uses rtnl_trylock() simply because I needed a
simple and correct fix.  It was never intended to be the long term
solution, but rather a stop gap to at least remove the possibility of
deadlock in the kernel.  It is deployed fairly widely because I found
a lot of examples of that bug.

> The task holding RTNL might be descheduled and all cpus running other
> threads spinning in rtnl_trylock()/restart_syscall(). Almost a deadlock.
>
> Maybe waiting RTNL for at least a couple of ms would help, instead of
> instantly failing rtnl_trylock() and looping.

Yes.  If the rtnl_trylock and syscall_restart looping is a problem
we should use a lock other than the rtnl_lock for those variables
or we should use a work queue and take the rtnl_lock in a context
where we don't need the try lock.

I don't think there is much point in getting sophisticated about
rtnl_trylock().  It was a good stopgap but we really should remove
the ABBA deadlock.

As for Stephen Hemminger's suggestion to make a: proc_do_intvec_and_rtnl_lock
function.  We could make a helper that does what Francesco did but in a
slightly nicer fashion.  What we can't do is make a version that does
not have the ABBA deadlock, which makes the nasty rtnl_trylock feasible.
Given that we still would have the nasty effects of the rtnl_trylock and
syscall_restart idiom I don't expect creating a helper and encouraging
people to build code that is going to need that logic is a particularly
good idea.

The other option is to look at dropping the rtnl_lock in addrconf_ifdown
around the addrconf_sysctl_unregister call.  It doesn't seem obvious to
me that we can.

I just thought through the sysctl case just to be certain that we can't
take the locks in the other order, and I just don't see a way to do it.
The problem is that the use tracking guarantees that none of the user
supplied data is being used and it is safe to remove a module.  If the
lock comes from the module kaboom.

Hmm.  I partially take back my conclusion that I can't fix the problem
at the sysctl level.  If we want to take rtnl_lock for every
networking sysctl I think I could enhance the ctl_table_set or the
ctl_table_root concept to have a mutex we could take over every
networking sysctl.  With copy_from_user being in the middle of that
I'm not particularly fond of that idea.  Nor am I fond of taking
the rtnl_mutex in even more situations than we take it now.  Plus it
seems like more deep magic that will confuse people.

So I really think the best solution to avoid the locking craziness is to
have a wrapper that gets the value from userspace and calls
schedule_work to get another thread to actually process the change.  I
don't see any problems with writing a helper function for that.  The
only downside with using schedule_work is that we return to userspace
before the change has been fully installed in the kernel.  I don't
expect that would be a problem but stranger things have happened.

Eric

next prev parent reply	other threads:[~2012-01-13  7:38 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-12  2:13 Race condition in ipv6 code Francesco Ruggeri
2012-01-12  6:31 ` Eric Dumazet
2012-01-12  6:44   ` David Miller
2012-01-12 20:48     ` Francesco Ruggeri
2012-01-13  0:11   ` Eric W. Biederman
2012-01-13  6:02     ` Eric Dumazet
2012-01-13  7:40       ` Eric W. Biederman [this message]
2012-01-13 17:04         ` Ben Greear
2012-01-14  5:46           ` Eric W. Biederman
2012-01-14 18:31             ` Ben Greear
2012-01-20  2:54               ` Eric W. Biederman
2012-01-13  1:17 ` Eric W. Biederman
2012-01-13  1:57   ` Stephen Hemminger
2012-01-13 22:02   ` Francesco Ruggeri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1ipkgqccb.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fruggeri@aristanetworks.com \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.