From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: Race condition in ipv6 code Date: Thu, 12 Jan 2012 23:40:52 -0800 Message-ID: References: <1326349911.2741.1.camel@edumazet-laptop> <1326434549.2617.14.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Francesco Ruggeri , netdev@vger.kernel.org, Stephen Hemminger To: Eric Dumazet Return-path: Received: from out01.mta.xmission.com ([166.70.13.231]:43234 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752151Ab2AMHib convert rfc822-to-8bit (ORCPT ); Fri, 13 Jan 2012 02:38:31 -0500 In-Reply-To: <1326434549.2617.14.camel@edumazet-laptop> (Eric Dumazet's message of "Fri, 13 Jan 2012 07:02:29 +0100") Sender: netdev-owner@vger.kernel.org List-ID: Eric Dumazet writes: > Le jeudi 12 janvier 2012 =C3=A0 16:11 -0800, Eric W. Biederman a =C3=A9= crit : > >> Because the rtnl_lock is broad and we have ABBA deadlocks if we don'= t in >> particular we hold the rtnl_lock over sysctl registration and remova= l. >> sysctl removal blocks until all of the callers into the sysctl metho= ds >> namely addrconf_sysctl_forward in this case finish executing. >>=20 >> CPU 0 CPU 1 >>=20 >> rtnl_lock() use_count++ >> unregister_netdevice() addrconf_ctl_foward >> unregister_sysctl_table() rtnl_lock() >> wait for use_count of addrconf_ctl_forward >> to =3D=3D 0 >> =20 >>=20 >> I smacked lockdep around so it would warn about the sysfs ones. >> The proc and sysctl ones I never did manage to get lockdep warnings >> but a ABBA deadlock is most definitely possible. >>=20 >> Any solutions better than simply restarting the system call are welc= ome. >>=20 >> Perhaps for these heavy weigh methods we should create a work struct >> and go schedule work to perform the change instead of trying to do t= he >> work synchronously in the sysctl handler. >>=20 > > This idea was discussed in netconf 2011 (I focused on the ability to > dismantle netdevices at high rates), and I admit I had not implemente= d > yet. Interesting. That is a discussion I would have been interested to be a part of. In my queue I have sysctl cleanup and speed improvements I should be posting them for review in the next week or two. That leaves /proc/net/dev_snmp6/.... as the scalability bottleneck for the number of network devices. I have gotten the rcu_barrier in device unregister out from under the trylock. At a practical level the rcu_barrier seems to be the primary bottleneck to remove netdevices at a high rate, although I am certain there are other issues that crop up. If I go with a networking path that can batch network devices, and am not bottlenecked by proc, sysfs or sysctl I can remove 10,000 or so network devices a second. I have been scratching my head trying to figure out if it is possible to batch network device deletes that come in via a netlink DELLINK mess= age. > The problem with rtnl_trylock() is that we make very litle progress i= f > many tasks are competing for rtnl, and we have no fairness. Yes. That code path uses rtnl_trylock() simply because I needed a simple and correct fix. It was never intended to be the long term solution, but rather a stop gap to at least remove the possibility of deadlock in the kernel. It is deployed fairly widely because I found a lot of examples of that bug. > The task holding RTNL might be descheduled and all cpus running other > threads spinning in rtnl_trylock()/restart_syscall(). Almost a deadlo= ck. > > Maybe waiting RTNL for at least a couple of ms would help, instead of > instantly failing rtnl_trylock() and looping. Yes. If the rtnl_trylock and syscall_restart looping is a problem we should use a lock other than the rtnl_lock for those variables or we should use a work queue and take the rtnl_lock in a context where we don't need the try lock. I don't think there is much point in getting sophisticated about rtnl_trylock(). It was a good stopgap but we really should remove the ABBA deadlock. As for Stephen Hemminger's suggestion to make a: proc_do_intvec_and_rtn= l_lock function. We could make a helper that does what Francesco did but in a slightly nicer fashion. What we can't do is make a version that does not have the ABBA deadlock, which makes the nasty rtnl_trylock feasible= =2E Given that we still would have the nasty effects of the rtnl_trylock an= d syscall_restart idiom I don't expect creating a helper and encouraging people to build code that is going to need that logic is a particularly good idea. The other option is to look at dropping the rtnl_lock in addrconf_ifdow= n around the addrconf_sysctl_unregister call. It doesn't seem obvious to me that we can. I just thought through the sysctl case just to be certain that we can't take the locks in the other order, and I just don't see a way to do it. The problem is that the use tracking guarantees that none of the user supplied data is being used and it is safe to remove a module. If the lock comes from the module kaboom. Hmm. I partially take back my conclusion that I can't fix the problem at the sysctl level. If we want to take rtnl_lock for every networking sysctl I think I could enhance the ctl_table_set or the ctl_table_root concept to have a mutex we could take over every networking sysctl. With copy_from_user being in the middle of that I'm not particularly fond of that idea. Nor am I fond of taking the rtnl_mutex in even more situations than we take it now. Plus it seems like more deep magic that will confuse people. So I really think the best solution to avoid the locking craziness is t= o have a wrapper that gets the value from userspace and calls schedule_work to get another thread to actually process the change. I don't see any problems with writing a helper function for that. The only downside with using schedule_work is that we return to userspace before the change has been fully installed in the kernel. I don't expect that would be a problem but stranger things have happened. Eric