From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: [PATCH 1/1] net: race condition when removing virtual net_device Date: Mon, 16 Sep 2013 03:45:06 -0700 Message-ID: <87six5kpu5.fsf@xmission.com> References: <1379008796-2121-1-git-send-email-fruggeri@aristanetworks.com> <87txhp249u.fsf@xmission.com> <871u4t1d9t.fsf@xmission.com> <87d2ocrx9b.fsf@xmission.com> Mime-Version: 1.0 Content-Type: text/plain Cc: "David S. Miller" , Eric Dumazet , Jiri Pirko , Alexander Duyck , Cong Wang , netdev@vger.kernel.org To: Francesco Ruggeri Return-path: Received: from out01.mta.xmission.com ([166.70.13.231]:43830 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753380Ab3IPKpV (ORCPT ); Mon, 16 Sep 2013 06:45:21 -0400 In-Reply-To: (Francesco Ruggeri's message of "Sun, 15 Sep 2013 19:54:43 -0700") Sender: netdev-owner@vger.kernel.org List-ID: Francesco Ruggeri writes: > On Fri, Sep 13, 2013 at 6:46 PM, Eric W. Biederman > wrote: >> Francesco Ruggeri writes: >> >>> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman >>> wrote: >>> >>> I do have some concerns about both correctness and feasibility of this approach. >>> >>> About 2), namespace dependent operations triggered by unregistering >>> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and >>> probably more) would not be done in the namespaces where they should. >> >> Yes they will. That is what dev_change_net_namespace does. > > You are right, I don't know what I was thinking. >> It was worth a second look. I can not find anything wrong with your >> patch but I can not convince myself that it is correct either. The >> moving around the loopback device in the net dev todo list to prevent >> deadlock I can't imagine why you are doing that. >> > > That is in order not to introduce a potential deadlock when multiple > namespaces are destroyed in default_device_exit_batch. > Take the same veth scenario as before: > v0 in namespace ns0 (loopback_dev lo0) and similarly for v1, ns1 and lo1. > Assume two processes destroy ns0 and ns1. cleanup_net is executed in a > workqueue and the two namespaces can end up being processed in the > same instance of cleanup_net/ops_exit_list/default_device_exit_batch. > default_device_exit_batch traverses each namespace's dev_base list and > unregister_netdevice_queue is executed in this order: > v0 v1 lo0 v1 v0 lo1. > unregister_netdevice_queue is executed twice on v0 and v1 but that is > ok because it uses list_move_tail and only the last one sticks. > The resulting list for unregister_netdevice_many is: > lo0 v1 v0 lo1. > If v0 holds a reference to lo0 there will be a deadlock in > netdev_run_todo if v0 does not come before lo0 in net_todo_list. By > pushing all loopback_devs to the end of net_todo_list this situation > is avoided. > > This is the sequence with today's (actually 3.4) code: > > unregister_netdevice_queue: v0 (ns ffff880037aecd00) > unregister_netdevice_queue: v1 (ns ffff880037aed800) > unregister_netdevice_queue: lo (ns ffff880037aecd00) > unregister_netdevice_queue: v1 (ns ffff880037aed800) > unregister_netdevice_queue: v0 (ns ffff880037aecd00) > unregister_netdevice_queue: lo (ns ffff880037aed800) > unregister_netdevice_many: lo (ns ffff880037aecd00) v1 (ns > ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800) > netdev_run_todo: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0 > (ns ffff880037aecd00) lo (ns ffff880037aed800) Interesting. So we have a very small chance of hillarity ensuing with dst_ifdown. But we probably won't notice because even if lo has been freed the memory likely hasn't been recycled. So dst_hold likely does not affect anyone. So it seems real problems only happens when we send the rebroadcast. Which explains why this has gone unnoticed for so long. I really don't want to support concurrency during the network namespace shutdown. That quickly becomes too crazy to think about. I believe the weird reordering of veth devices is solved or largely solved by: commit f45a5c267da35174e22cec955093a7513dc1623d Author: Eric Dumazet Date: Fri Feb 8 20:10:49 2013 +0000 veth: fix NULL dereference in veth_dellink() commit d0e2c55e7c940 (veth: avoid a NULL deref in veth_stats_one) added another NULL deref in veth_dellink(). # ip link add name veth1 type veth peer name veth0 # rmmod veth We crash because veth_dellink() is called twice, so we must take care of NULL peer. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Which unfortunately means you really need to test 3.11 or 3.12-rc1 to make headway on this issue. If there is further veth specific madness we can solve it by tweaking the veth driver. > and this is the sequence after pushing the loopback_devs to the tail > of net_todo_list: > > unregister_netdevice_queue: v0 (ns ffff8800379f8000) > unregister_netdevice_queue: v1 (ns ffff8800378c0b00) > unregister_netdevice_queue: lo (ns ffff8800379f8000) > unregister_netdevice_queue: v1 (ns ffff8800378c0b00) > unregister_netdevice_queue: v0 (ns ffff8800379f8000) > unregister_netdevice_queue: lo (ns ffff8800378c0b00) > unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns > ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00) > netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo > (ns ffff8800379f8000) lo (ns ffff8800378c0b00) I am scratching my head. That isn't what I would expect from putting loopback devices at the end of the list. Even more I am not comfortable with any situation where preserving the order in which the devices are unregistered is not sufficient to make things work correctly. That quickly gets into fragile layering problems, and I don't know how anyone would be able to maintain that. I still don't see a better solution than dev_change_net_namespace, for the network devices that have this problem. Network devices are not supposed to be findable after the dance at the start of cleanup_net. It might be possible to actually build asynchronous network device unregistration and opportunistick batching. Aka cleanup network devices much like we clean up network namespaces now, and by funnelling them all into a single threaded work queue that would probably stop the races, as well as improve performance. I don't have the energy to do that right now unfortunately :( > Should we take this discussion offline? > I do appreciate your spending time on this. If anyone complains we can drop them off of the CC but it is useful to leave a record of the thought process that went into this, so we should at the minimum keep netdev in the loop.