* BUG ? ipip unregister_netdevice_many()
@ 2010-10-07 8:48 Hans Schillstrom
2010-10-08 11:19 ` Daniel Lezcano
2010-10-08 16:06 ` Eric W. Biederman
0 siblings, 2 replies; 28+ messages in thread
From: Hans Schillstrom @ 2010-10-07 8:48 UTC (permalink / raw)
To: netdev@vger.kernel.org; +Cc: Eric W. Biederman, Daniel Lezcano
Hello
I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
(see conf file at the end)
Kernel: net-next-2.6 top from 4 october 2010
I added some printk's inn ipip.c ipip_exit_net()
...
rtnl_lock();
printk(KERN_ERR "ipip_exit_net(enter)\n");
ipip_destroy_tunnels(ipn, &list);
printk(KERN_ERR "ipip_exit_net(1)\n");
unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
printk(KERN_ERR "ipip_exit_net(2)\n");
unregister_netdevice_many(&list);
printk(KERN_ERR "ipip_exit_net(3)\n");
rtnl_unlock();
printk(KERN_ERR "ipip_exit_net(exit)\n");
Exit steps:
===== Screen dump =====
# ifconfig eth0 0.0.0.0 down
# ifconfig eth1 0.0.0.0 down
# ifconfig lo 0.0.0.0 down
# ip li de eth0
# ip li de eth1
# ifconfig -a
ip6tnl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
NOARP MTU:1460 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
LOOPBACK MTU:16436 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
tunl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
# ps
PID USER VSZ STAT COMMAND
1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
2 root 4540 S /bin/ash /var/bin/init
7 root 6640 S inetd
8 root 4544 S /bin/ash
26 root 4544 R ps
# lsmod
Module Size Used by Not tainted
macvlan 8709 0
pcnet32 29549 0
tg3 112093 0
libphy 21043 1 tg3
# kill 7 2
# ps
PID USER VSZ STAT COMMAND
1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
8 root 4544 S /bin/ash
28 root 4544 R ps
# exit ( here is the exit from netns )
# ipip_exit_net(enter)
ipip_exit_net(1)
ipip_exit_net(2)
------------[ cut here ]------------
WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
Hardware name: Bochs
Modules linked in: macvlan pcnet32 tg3 libphy
Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3+ #7
Call Trace:
[<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
[<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
[<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
[<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
[<ffffffff81342108>] addrconf_ifdown+0x415/0x45e
[<ffffffff81342b98>] addrconf_notify+0x756/0x7fe
[<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
[<ffffffff813622b3>] ? ip6mr_device_event+0x8d/0x9e
[<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
[<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
[<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
[<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
[<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
[<ffffffff81324209>] ipip_exit_net+0xea/0x11a
[<ffffffff812bc941>] ? cleanup_net+0x0/0x198
[<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
[<ffffffff812bca39>] cleanup_net+0xf8/0x198
[<ffffffff810568c7>] process_one_work+0x2a2/0x44d
[<ffffffff81056e35>] worker_thread+0x1db/0x34e
[<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
[<ffffffff8105a030>] kthread+0x82/0x8a
[<ffffffff81003954>] kernel_thread_helper+0x4/0x10
[<ffffffff81059fae>] ? kthread+0x0/0x8a
[<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
---[ end trace 939b5185219f32e7 ]---
ipip_exit_net(3)
ipip_exit_net(exit)
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
....
...
===== End of screen dump =====
lxc conf file:
# Container with network virtualized using the vlan device driver
# Local eth0 uplink
lxc.utsname = fee_0
lxc.network.type = macvlan
lxc.network.flags = up
lxc.network.link = eth1
lxc.network.hwaddr = 00:00:04:01:01:01
lxc.network.ipv4 = 192.168.1.21/24
lxc.network.ipv6 = 2003::2:1:1/96
# local eth1 downlink - to the RS farm
lxc.network.type = macvlan
lxc.network.flags = up
lxc.network.link = eth0
lxc.network.hwaddr = 00:00:03:01:01:01
lxc.network.ipv4 = 192.168.0.21/24
lxc.network.ipv6 = 2003::1:1:1/96
lxc.mount.entry = /var/lib/lxc/fee_0/var /var none rw,bind 0 0
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-07 8:48 BUG ? ipip unregister_netdevice_many() Hans Schillstrom
@ 2010-10-08 11:19 ` Daniel Lezcano
2010-10-08 11:53 ` Hans Schillstrom
2010-10-08 16:51 ` Eric W. Biederman
2010-10-08 16:06 ` Eric W. Biederman
1 sibling, 2 replies; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-08 11:19 UTC (permalink / raw)
To: Hans Schillstrom; +Cc: netdev@vger.kernel.org, Eric W. Biederman
On 10/07/2010 10:48 AM, Hans Schillstrom wrote:
> Hello
> I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
> The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
>
> A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
> (see conf file at the end)
>
> Kernel: net-next-2.6 top from 4 october 2010
>
Hi Hans,
I tried to reproduce your problem but I just get a big kernel crash when
exiting the container :/
The stack is different but it may be related to the same problem.
BUG: unable to handle kernel paging request at ffff88003ba453a0
IP: [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
PGD 180b063 PUD 180f063 PMD 1ffdb067 PTE 3ba45160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/virtual/net/mc0PyXBA/type
CPU 0
Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc7-next-20101007+ #11 /Bochs
RIP: 0010:[<ffffffff813020b6>] [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
RSP: 0018:ffff88003f111c30 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88003bdd1e60 RCX: 000000000000c100
RDX: ffff88003ba453a0 RSI: ffff88003f111d70 RDI: ffffffff810300e5
RBP: ffff88003f111c50 R08: ffff88003f111d70 R09: 00000000000000cc
R10: 0000000000000001 R11: ffff88003f111ba0 R12: ffff88003bdd1800
R13: ffff880039fec800 R14: ffff88003f111d70 R15: ffff88003ba06830
FS: 0000000000000000(0000) GS:ffffffff8181b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88003ba453a0 CR3: 000000003c284000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88003f110000, task
ffff88003f0f20a0)
Stack:
ffffffff81657a10 ffff88003bdd1800 ffffffff81657a10 ffff88003bb23800
<0> ffff88003f111c70 ffffffff81362d09 ffff88003bdd1800 ffff88003f111cf0
<0> ffff88003f111c90 ffffffff81362d31 ffff88003ba067c0 ffff88003bdd1800
Call Trace:
[<ffffffff81362d09>] __dev_close+0x75/0x83
[<ffffffff81362d31>] dev_close+0x1a/0x3f
[<ffffffff81362e38>] rollback_registered_many+0xe2/0x21c
[<ffffffff81362f88>] unregister_netdevice_many+0x16/0x6d
[<ffffffff8136314d>] default_device_exit_batch+0xa7/0xbb
[<ffffffff8135db06>] ops_exit_list+0x4e/0x56
[<ffffffff8135e285>] cleanup_net+0xf5/0x195
[<ffffffff8103e084>] process_one_work+0x25d/0x3e7
[<ffffffff8103e027>] ? process_one_work+0x200/0x3e7
[<ffffffff8135e190>] ? cleanup_net+0x0/0x195
[<ffffffff8103e54a>] worker_thread+0x1b5/0x342
[<ffffffff8103e395>] ? worker_thread+0x0/0x342
[<ffffffff81041495>] kthread+0x7c/0x84
[<ffffffff810034f4>] kernel_thread_helper+0x4/0x10
[<ffffffff814389ba>] ? restore_args+0x0/0x30
[<ffffffff81041419>] ? kthread+0x0/0x84
[<ffffffff810034f0>] ? kernel_thread_helper+0x0/0x10
Code: 00 00 02 74 0b 83 ce ff 4c 89 ef e8 ab eb 05 00 49 8b b4 24 a0 02
00 00 4c 89 ef e8 b9 52 06 00 48 8b 43 18 48 8b 53 20 48 85 c0 <48> 89
02 74 04 48 89 50 08 48 be 00 02 20 00 00 00 ad de 48 89
RIP [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
RSP <ffff88003f111c30>
CR2: ffff88003ba453a0
---[ end trace 05c41c2103816005 ]---
BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [<ffffffff810410bf>] kthread_data+0xb/0x11
PGD 180c067 PUD 180d067 PMD 0
Oops: 0000 [#2] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/virtual/net/mc0PyXBA/type
CPU 0
Pid: 5, comm: kworker/u:0 Tainted: G D
2.6.36-rc7-next-20101007+ #11 /Bochs
RIP: 0010:[<ffffffff810410bf>] [<ffffffff810410bf>] kthread_data+0xb/0x11
RSP: 0018:ffff88003f111868 EFLAGS: 00010096
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88003f111fd8
RDX: ffff88003f0f20a0 RSI: 0000000000000000 RDI: ffff88003f0f20a0
RBP: ffff88003f111868 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000246 R11: 09f911029d74e35b R12: 0000000000000000
R13: ffff88003f111948 R14: ffff88003f0c60a0 R15: ffff88003f0f2218
FS: 0000000000000000(0000) GS:ffffffff8181b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 000000003cb19000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88003f110000, task
ffff88003f0f20a0)
Stack:
ffff88003f111888 ffffffff8103d4e4 ffff88003f111888 ffff88003f0f2310
<0> ffff88003f110010 ffff88003f0f20a0 ffff88003f111fd8 ffff88003f111fd8
Call Trace:
[<ffffffff8103d4e4>] wq_worker_sleeping+0x10/0x76
[<ffffffff81435ffe>] schedule+0xf4/0x405
[<ffffffff8102ebc4>] do_exit+0x647/0x660
[<ffffffff81005ba0>] oops_end+0xb3/0xbb
[<ffffffff8101c6b8>] no_context+0x1f5/0x204
[<ffffffff8101c854>] __bad_area_nosemaphore+0x18d/0x1b0
[<ffffffff8101c885>] bad_area_nosemaphore+0xe/0x10
[<ffffffff8101cb52>] do_page_fault+0x16b/0x34d
[<ffffffff81300a85>] ? ei_set_multicast_list+0x1f/0x3d
[<ffffffff81437f46>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff81438b9f>] page_fault+0x1f/0x30
[<ffffffff810300e5>] ? local_bh_enable_ip+0xb7/0xbd
[<ffffffff813020b6>] ? macvlan_stop+0x57/0x7d
[<ffffffff81362d09>] __dev_close+0x75/0x83
[<ffffffff81362d31>] dev_close+0x1a/0x3f
[<ffffffff81362e38>] rollback_registered_many+0xe2/0x21c
[<ffffffff81362f88>] unregister_netdevice_many+0x16/0x6d
[<ffffffff8136314d>] default_device_exit_batch+0xa7/0xbb
[<ffffffff8135db06>] ops_exit_list+0x4e/0x56
[<ffffffff8135e285>] cleanup_net+0xf5/0x195
[<ffffffff8103e084>] process_one_work+0x25d/0x3e7
[<ffffffff8103e027>] ? process_one_work+0x200/0x3e7
[<ffffffff8135e190>] ? cleanup_net+0x0/0x195
[<ffffffff8103e54a>] worker_thread+0x1b5/0x342
[<ffffffff8103e395>] ? worker_thread+0x0/0x342
[<ffffffff81041495>] kthread+0x7c/0x84
[<ffffffff810034f4>] kernel_thread_helper+0x4/0x10
[<ffffffff814389ba>] ? restore_args+0x0/0x30
[<ffffffff81041419>] ? kthread+0x0/0x84
[<ffffffff810034f0>] ? kernel_thread_helper+0x0/0x10
Code: 5c 41 5d 41 5e c9 c3 90 55 48 8b 04 25 40 a0 81 81 48 8b 80 18 02
00 00 48 89 e5 8b 40 f0 c9 c3 48 8b 87 18 02 00 00 55 48 89 e5 <48> 8b
40 f8 c9 c3 48 89 f0 c1 ee 06 55 89 f6 83 e0 3f 48 c1 e6
RIP [<ffffffff810410bf>] kthread_data+0xb/0x11
RSP <ffff88003f111868>
CR2: fffffffffffffff8
---[ end trace 05c41c2103816006 ]---
Thanks
-- Daniel
> I added some printk's inn ipip.c ipip_exit_net()
> ...
> rtnl_lock();
> printk(KERN_ERR "ipip_exit_net(enter)\n");
> ipip_destroy_tunnels(ipn,&list);
> printk(KERN_ERR "ipip_exit_net(1)\n");
> unregister_netdevice_queue(ipn->fb_tunnel_dev,&list);
> printk(KERN_ERR "ipip_exit_net(2)\n");
> unregister_netdevice_many(&list);
> printk(KERN_ERR "ipip_exit_net(3)\n");
> rtnl_unlock();
> printk(KERN_ERR "ipip_exit_net(exit)\n");
>
>
> Exit steps:
> ===== Screen dump =====
>
> # ifconfig eth0 0.0.0.0 down
> # ifconfig eth1 0.0.0.0 down
> # ifconfig lo 0.0.0.0 down
> # ip li de eth0
> # ip li de eth1
> # ifconfig -a
> ip6tnl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> NOARP MTU:1460 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> LOOPBACK MTU:16436 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> sit0 Link encap:IPv6-in-IPv4
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> tunl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> # ps
> PID USER VSZ STAT COMMAND
> 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> 2 root 4540 S /bin/ash /var/bin/init
> 7 root 6640 S inetd
> 8 root 4544 S /bin/ash
> 26 root 4544 R ps
> # lsmod
> Module Size Used by Not tainted
> macvlan 8709 0
> pcnet32 29549 0
> tg3 112093 0
> libphy 21043 1 tg3
> # kill 7 2
> # ps
> PID USER VSZ STAT COMMAND
> 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> 8 root 4544 S /bin/ash
> 28 root 4544 R ps
> # exit ( here is the exit from netns )
> # ipip_exit_net(enter)
> ipip_exit_net(1)
> ipip_exit_net(2)
> ------------[ cut here ]------------
> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
> Hardware name: Bochs
> Modules linked in: macvlan pcnet32 tg3 libphy
> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3+ #7
> Call Trace:
> [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
> [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
> [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
> [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
> [<ffffffff81342108>] addrconf_ifdown+0x415/0x45e
> [<ffffffff81342b98>] addrconf_notify+0x756/0x7fe
> [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
> [<ffffffff813622b3>] ? ip6mr_device_event+0x8d/0x9e
> [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
> [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
> [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
> [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
> [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
> [<ffffffff81324209>] ipip_exit_net+0xea/0x11a
> [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
> [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
> [<ffffffff812bca39>] cleanup_net+0xf8/0x198
> [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
> [<ffffffff81056e35>] worker_thread+0x1db/0x34e
> [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
> [<ffffffff8105a030>] kthread+0x82/0x8a
> [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
> [<ffffffff81059fae>] ? kthread+0x0/0x8a
> [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace 939b5185219f32e7 ]---
> ipip_exit_net(3)
> ipip_exit_net(exit)
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> ....
> ...
> ===== End of screen dump =====
>
> lxc conf file:
> # Container with network virtualized using the vlan device driver
> # Local eth0 uplink
> lxc.utsname = fee_0
> lxc.network.type = macvlan
> lxc.network.flags = up
> lxc.network.link = eth1
> lxc.network.hwaddr = 00:00:04:01:01:01
> lxc.network.ipv4 = 192.168.1.21/24
> lxc.network.ipv6 = 2003::2:1:1/96
> # local eth1 downlink - to the RS farm
> lxc.network.type = macvlan
> lxc.network.flags = up
> lxc.network.link = eth0
> lxc.network.hwaddr = 00:00:03:01:01:01
> lxc.network.ipv4 = 192.168.0.21/24
> lxc.network.ipv6 = 2003::1:1:1/96
> lxc.mount.entry = /var/lib/lxc/fee_0/var /var none rw,bind 0 0
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 11:19 ` Daniel Lezcano
@ 2010-10-08 11:53 ` Hans Schillstrom
2010-10-08 12:28 ` Hans Schillstrom
2010-10-08 16:51 ` Eric W. Biederman
1 sibling, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2010-10-08 11:53 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: netdev@vger.kernel.org, Eric W. Biederman
On Friday 08 October 2010 13:19:08 Daniel Lezcano wrote:
Hello
> On 10/07/2010 10:48 AM, Hans Schillstrom wrote:
> > Hello
> > I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
> > The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
> >
> > A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
> > (see conf file at the end)
> >
> > Kernel: net-next-2.6 top from 4 october 2010
> >
>
> Hi Hans,
>
> I tried to reproduce your problem but I just get a big kernel crash when
> exiting the container :/
>
> The stack is different but it may be related to the same problem.
>
> BUG: unable to handle kernel paging request at ffff88003ba453a0
> IP: [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
> PGD 180b063 PUD 180f063 PMD 1ffdb067 PTE 3ba45160
> Oops: 0002 [#1] DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/virtual/net/mc0PyXBA/type
> CPU 0
> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc7-next-20101007+ #11 /Bochs
> RIP: 0010:[<ffffffff813020b6>] [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
> RSP: 0018:ffff88003f111c30 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88003bdd1e60 RCX: 000000000000c100
> RDX: ffff88003ba453a0 RSI: ffff88003f111d70 RDI: ffffffff810300e5
> RBP: ffff88003f111c50 R08: ffff88003f111d70 R09: 00000000000000cc
> R10: 0000000000000001 R11: ffff88003f111ba0 R12: ffff88003bdd1800
> R13: ffff880039fec800 R14: ffff88003f111d70 R15: ffff88003ba06830
> FS: 0000000000000000(0000) GS:ffffffff8181b000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff88003ba453a0 CR3: 000000003c284000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/u:0 (pid: 5, threadinfo ffff88003f110000, task
> ffff88003f0f20a0)
> Stack:
> ffffffff81657a10 ffff88003bdd1800 ffffffff81657a10 ffff88003bb23800
> <0> ffff88003f111c70 ffffffff81362d09 ffff88003bdd1800 ffff88003f111cf0
> <0> ffff88003f111c90 ffffffff81362d31 ffff88003ba067c0 ffff88003bdd1800
> Call Trace:
> [<ffffffff81362d09>] __dev_close+0x75/0x83
> [<ffffffff81362d31>] dev_close+0x1a/0x3f
> [<ffffffff81362e38>] rollback_registered_many+0xe2/0x21c
> [<ffffffff81362f88>] unregister_netdevice_many+0x16/0x6d
> [<ffffffff8136314d>] default_device_exit_batch+0xa7/0xbb
> [<ffffffff8135db06>] ops_exit_list+0x4e/0x56
> [<ffffffff8135e285>] cleanup_net+0xf5/0x195
> [<ffffffff8103e084>] process_one_work+0x25d/0x3e7
> [<ffffffff8103e027>] ? process_one_work+0x200/0x3e7
> [<ffffffff8135e190>] ? cleanup_net+0x0/0x195
> [<ffffffff8103e54a>] worker_thread+0x1b5/0x342
> [<ffffffff8103e395>] ? worker_thread+0x0/0x342
> [<ffffffff81041495>] kthread+0x7c/0x84
> [<ffffffff810034f4>] kernel_thread_helper+0x4/0x10
> [<ffffffff814389ba>] ? restore_args+0x0/0x30
> [<ffffffff81041419>] ? kthread+0x0/0x84
> [<ffffffff810034f0>] ? kernel_thread_helper+0x0/0x10
> Code: 00 00 02 74 0b 83 ce ff 4c 89 ef e8 ab eb 05 00 49 8b b4 24 a0 02
> 00 00 4c 89 ef e8 b9 52 06 00 48 8b 43 18 48 8b 53 20 48 85 c0 <48> 89
> 02 74 04 48 89 50 08 48 be 00 02 20 00 00 00 ad de 48 89
> RIP [<ffffffff813020b6>] macvlan_stop+0x57/0x7d
> RSP <ffff88003f111c30>
> CR2: ffff88003ba453a0
> ---[ end trace 05c41c2103816005 ]---
> BUG: unable to handle kernel paging request at fffffffffffffff8
> IP: [<ffffffff810410bf>] kthread_data+0xb/0x11
> PGD 180c067 PUD 180d067 PMD 0
> Oops: 0000 [#2] DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/virtual/net/mc0PyXBA/type
> CPU 0
> Pid: 5, comm: kworker/u:0 Tainted: G D
> 2.6.36-rc7-next-20101007+ #11 /Bochs
> RIP: 0010:[<ffffffff810410bf>] [<ffffffff810410bf>] kthread_data+0xb/0x11
> RSP: 0018:ffff88003f111868 EFLAGS: 00010096
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88003f111fd8
> RDX: ffff88003f0f20a0 RSI: 0000000000000000 RDI: ffff88003f0f20a0
> RBP: ffff88003f111868 R08: 0000000000000002 R09: 0000000000000001
> R10: 0000000000000246 R11: 09f911029d74e35b R12: 0000000000000000
> R13: ffff88003f111948 R14: ffff88003f0c60a0 R15: ffff88003f0f2218
> FS: 0000000000000000(0000) GS:ffffffff8181b000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: fffffffffffffff8 CR3: 000000003cb19000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/u:0 (pid: 5, threadinfo ffff88003f110000, task
> ffff88003f0f20a0)
> Stack:
> ffff88003f111888 ffffffff8103d4e4 ffff88003f111888 ffff88003f0f2310
> <0> ffff88003f110010 ffff88003f0f20a0 ffff88003f111fd8 ffff88003f111fd8
> Call Trace:
> [<ffffffff8103d4e4>] wq_worker_sleeping+0x10/0x76
> [<ffffffff81435ffe>] schedule+0xf4/0x405
> [<ffffffff8102ebc4>] do_exit+0x647/0x660
> [<ffffffff81005ba0>] oops_end+0xb3/0xbb
> [<ffffffff8101c6b8>] no_context+0x1f5/0x204
> [<ffffffff8101c854>] __bad_area_nosemaphore+0x18d/0x1b0
> [<ffffffff8101c885>] bad_area_nosemaphore+0xe/0x10
> [<ffffffff8101cb52>] do_page_fault+0x16b/0x34d
> [<ffffffff81300a85>] ? ei_set_multicast_list+0x1f/0x3d
> [<ffffffff81437f46>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff81438b9f>] page_fault+0x1f/0x30
> [<ffffffff810300e5>] ? local_bh_enable_ip+0xb7/0xbd
> [<ffffffff813020b6>] ? macvlan_stop+0x57/0x7d
> [<ffffffff81362d09>] __dev_close+0x75/0x83
> [<ffffffff81362d31>] dev_close+0x1a/0x3f
> [<ffffffff81362e38>] rollback_registered_many+0xe2/0x21c
> [<ffffffff81362f88>] unregister_netdevice_many+0x16/0x6d
> [<ffffffff8136314d>] default_device_exit_batch+0xa7/0xbb
> [<ffffffff8135db06>] ops_exit_list+0x4e/0x56
> [<ffffffff8135e285>] cleanup_net+0xf5/0x195
> [<ffffffff8103e084>] process_one_work+0x25d/0x3e7
> [<ffffffff8103e027>] ? process_one_work+0x200/0x3e7
> [<ffffffff8135e190>] ? cleanup_net+0x0/0x195
> [<ffffffff8103e54a>] worker_thread+0x1b5/0x342
> [<ffffffff8103e395>] ? worker_thread+0x0/0x342
> [<ffffffff81041495>] kthread+0x7c/0x84
> [<ffffffff810034f4>] kernel_thread_helper+0x4/0x10
> [<ffffffff814389ba>] ? restore_args+0x0/0x30
> [<ffffffff81041419>] ? kthread+0x0/0x84
> [<ffffffff810034f0>] ? kernel_thread_helper+0x0/0x10
> Code: 5c 41 5d 41 5e c9 c3 90 55 48 8b 04 25 40 a0 81 81 48 8b 80 18 02
> 00 00 48 89 e5 8b 40 f0 c9 c3 48 8b 87 18 02 00 00 55 48 89 e5 <48> 8b
> 40 f8 c9 c3 48 89 f0 c1 ee 06 55 89 f6 83 e0 3f 48 c1 e6
> RIP [<ffffffff810410bf>] kthread_data+0xb/0x11
> RSP <ffff88003f111868>
> CR2: fffffffffffffff8
> ---[ end trace 05c41c2103816006 ]---
>
>
>
> Thanks
> -- Daniel
I did the same setup without any tunnel modules loaded and then it almost worked
except free-ing of the loopback interface :-(
"unregister_netdevice: waiting for lo to become free. Usage count = 4"
When adding a tunnel module (here ip6_tunnel) you'll have the crash
ex:
/var/lib/lxc # ifconfig -a
ip6tnl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
NOARP MTU:1460 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
LOOPBACK MTU:16436 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
# ifconfig
# ps
PID USER VSZ STAT COMMAND
1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
2 root 4540 S /bin/ash /var/bin/init
6 root 4544 S /bin/ash
16 root 4544 R ps
# kill 2
# ^D (exit of ns)
# ------------[ cut here ]------------
WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
Hardware name: Bochs
Modules linked in: macvlan ip6_tunnel tunnel6 pcnet32 tg3 libphy
Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3 #2
Call Trace:
[<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
[<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
[<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
[<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
[<ffffffff81340c75>] addrconf_ifdown+0x415/0x45e
[<ffffffff81341705>] addrconf_notify+0x756/0x7fe
[<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
[<ffffffff81360eb3>] ? ip6mr_device_event+0x8d/0x9e
[<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
[<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
[<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
[<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
[<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
[<ffffffffa0047244>] ip6_tnl_exit_net+0xa4/0xb8 [ip6_tunnel]
[<ffffffff812bc941>] ? cleanup_net+0x0/0x198
[<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
[<ffffffff812bca39>] cleanup_net+0xf8/0x198
[<ffffffff810568c7>] process_one_work+0x2a2/0x44d
[<ffffffff81056e35>] worker_thread+0x1db/0x34e
[<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
[<ffffffff8105a030>] kthread+0x82/0x8a
[<ffffffff81003954>] kernel_thread_helper+0x4/0x10
[<ffffffff81059fae>] ? kthread+0x0/0x8a
[<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
---[ end trace eb3bc950cf9a8748 ]---
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
Regards
Hans
>
> > I added some printk's inn ipip.c ipip_exit_net()
> > ...
> > rtnl_lock();
> > printk(KERN_ERR "ipip_exit_net(enter)\n");
> > ipip_destroy_tunnels(ipn,&list);
> > printk(KERN_ERR "ipip_exit_net(1)\n");
> > unregister_netdevice_queue(ipn->fb_tunnel_dev,&list);
> > printk(KERN_ERR "ipip_exit_net(2)\n");
> > unregister_netdevice_many(&list);
> > printk(KERN_ERR "ipip_exit_net(3)\n");
> > rtnl_unlock();
> > printk(KERN_ERR "ipip_exit_net(exit)\n");
> >
> >
> > Exit steps:
> > ===== Screen dump =====
> >
> > # ifconfig eth0 0.0.0.0 down
> > # ifconfig eth1 0.0.0.0 down
> > # ifconfig lo 0.0.0.0 down
> > # ip li de eth0
> > # ip li de eth1
> > # ifconfig -a
> > ip6tnl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> > NOARP MTU:1460 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> >
> > lo Link encap:Local Loopback
> > inet addr:127.0.0.1 Mask:255.0.0.0
> > LOOPBACK MTU:16436 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> >
> > sit0 Link encap:IPv6-in-IPv4
> > NOARP MTU:1480 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> >
> > tunl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> > NOARP MTU:1480 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> >
> > # ps
> > PID USER VSZ STAT COMMAND
> > 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> > 2 root 4540 S /bin/ash /var/bin/init
> > 7 root 6640 S inetd
> > 8 root 4544 S /bin/ash
> > 26 root 4544 R ps
> > # lsmod
> > Module Size Used by Not tainted
> > macvlan 8709 0
> > pcnet32 29549 0
> > tg3 112093 0
> > libphy 21043 1 tg3
> > # kill 7 2
> > # ps
> > PID USER VSZ STAT COMMAND
> > 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> > 8 root 4544 S /bin/ash
> > 28 root 4544 R ps
> > # exit ( here is the exit from netns )
> > # ipip_exit_net(enter)
> > ipip_exit_net(1)
> > ipip_exit_net(2)
> > ------------[ cut here ]------------
> > WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
> > Hardware name: Bochs
> > Modules linked in: macvlan pcnet32 tg3 libphy
> > Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3+ #7
> > Call Trace:
> > [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
> > [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
> > [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
> > [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
> > [<ffffffff81342108>] addrconf_ifdown+0x415/0x45e
> > [<ffffffff81342b98>] addrconf_notify+0x756/0x7fe
> > [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
> > [<ffffffff813622b3>] ? ip6mr_device_event+0x8d/0x9e
> > [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
> > [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
> > [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
> > [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
> > [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
> > [<ffffffff81324209>] ipip_exit_net+0xea/0x11a
> > [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
> > [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
> > [<ffffffff812bca39>] cleanup_net+0xf8/0x198
> > [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
> > [<ffffffff81056e35>] worker_thread+0x1db/0x34e
> > [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
> > [<ffffffff8105a030>] kthread+0x82/0x8a
> > [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
> > [<ffffffff81059fae>] ? kthread+0x0/0x8a
> > [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
> > ---[ end trace 939b5185219f32e7 ]---
> > ipip_exit_net(3)
> > ipip_exit_net(exit)
> > unregister_netdevice: waiting for lo to become free. Usage count = 4
> > unregister_netdevice: waiting for lo to become free. Usage count = 4
> > unregister_netdevice: waiting for lo to become free. Usage count = 4
> > ....
> > ...
> > ===== End of screen dump =====
> >
> > lxc conf file:
> > # Container with network virtualized using the vlan device driver
> > # Local eth0 uplink
> > lxc.utsname = fee_0
> > lxc.network.type = macvlan
> > lxc.network.flags = up
> > lxc.network.link = eth1
> > lxc.network.hwaddr = 00:00:04:01:01:01
> > lxc.network.ipv4 = 192.168.1.21/24
> > lxc.network.ipv6 = 2003::2:1:1/96
> > # local eth1 downlink - to the RS farm
> > lxc.network.type = macvlan
> > lxc.network.flags = up
> > lxc.network.link = eth0
> > lxc.network.hwaddr = 00:00:03:01:01:01
> > lxc.network.ipv4 = 192.168.0.21/24
> > lxc.network.ipv6 = 2003::1:1:1/96
> > lxc.mount.entry = /var/lib/lxc/fee_0/var /var none rw,bind 0 0
> >
>
--
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 11:53 ` Hans Schillstrom
@ 2010-10-08 12:28 ` Hans Schillstrom
2010-10-08 15:53 ` Daniel Lezcano
2010-10-08 16:45 ` Eric W. Biederman
0 siblings, 2 replies; 28+ messages in thread
From: Hans Schillstrom @ 2010-10-08 12:28 UTC (permalink / raw)
To: Daniel Lezcano, Eric W. Biederman; +Cc: netdev@vger.kernel.org
Hi Eric,
Any advice how to trace this down ?
This rollback_registered_many() seems to have on the lists before...
All IPv4 and IPv6 tunnels causes this crash, all you have to do is load the tunnel module(s)
enter a new ns and exit from it.
Have not tested any more devices than tunnels,
I did an "ip link delete" on my macvlans before exiting the ns.
snip
> # ------------[ cut here ]------------
> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
> Hardware name: Bochs
> Modules linked in: macvlan ip6_tunnel tunnel6 pcnet32 tg3 libphy
> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3 #2
> Call Trace:
> [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
> [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
> [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
> [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
> [<ffffffff81340c75>] addrconf_ifdown+0x415/0x45e
> [<ffffffff81341705>] addrconf_notify+0x756/0x7fe
> [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
> [<ffffffff81360eb3>] ? ip6mr_device_event+0x8d/0x9e
> [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
> [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
> [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
> [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
> [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
> [<ffffffffa0047244>] ip6_tnl_exit_net+0xa4/0xb8 [ip6_tunnel]
> [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
> [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
> [<ffffffff812bca39>] cleanup_net+0xf8/0x198
> [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
> [<ffffffff81056e35>] worker_thread+0x1db/0x34e
> [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
> [<ffffffff8105a030>] kthread+0x82/0x8a
> [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
> [<ffffffff81059fae>] ? kthread+0x0/0x8a
> [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace eb3bc950cf9a8748 ]---
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
--
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 12:28 ` Hans Schillstrom
@ 2010-10-08 15:53 ` Daniel Lezcano
2010-10-08 16:17 ` Daniel Lezcano
2010-10-08 16:45 ` Eric W. Biederman
1 sibling, 1 reply; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-08 15:53 UTC (permalink / raw)
To: Hans Schillstrom; +Cc: Eric W. Biederman, netdev@vger.kernel.org
On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
> Hi Eric,
> Any advice how to trace this down ?
> This rollback_registered_many() seems to have on the lists before...
> All IPv4 and IPv6 tunnels causes this crash, all you have to do is load the tunnel module(s)
> enter a new ns and exit from it.
>
> Have not tested any more devices than tunnels,
> I did an "ip link delete" on my macvlans before exiting the ns.
>
Ah ! I succeed to reproduce it.
It does not appear immediately in fact.
I am trying to simplify the configuration but I am falling in the bug I
talked about in the previous email.
> snip
>
>> # ------------[ cut here ]------------
>> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
>> Hardware name: Bochs
>> Modules linked in: macvlan ip6_tunnel tunnel6 pcnet32 tg3 libphy
>> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3 #2
>> Call Trace:
>> [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
>> [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
>> [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
>> [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
>> [<ffffffff81340c75>] addrconf_ifdown+0x415/0x45e
>> [<ffffffff81341705>] addrconf_notify+0x756/0x7fe
>> [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
>> [<ffffffff81360eb3>] ? ip6mr_device_event+0x8d/0x9e
>> [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
>> [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
>> [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
>> [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
>> [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
>> [<ffffffffa0047244>] ip6_tnl_exit_net+0xa4/0xb8 [ip6_tunnel]
>> [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
>> [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
>> [<ffffffff812bca39>] cleanup_net+0xf8/0x198
>> [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
>> [<ffffffff81056e35>] worker_thread+0x1db/0x34e
>> [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
>> [<ffffffff8105a030>] kthread+0x82/0x8a
>> [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
>> [<ffffffff81059fae>] ? kthread+0x0/0x8a
>> [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
>> ---[ end trace eb3bc950cf9a8748 ]---
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>>
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-07 8:48 BUG ? ipip unregister_netdevice_many() Hans Schillstrom
2010-10-08 11:19 ` Daniel Lezcano
@ 2010-10-08 16:06 ` Eric W. Biederman
1 sibling, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-08 16:06 UTC (permalink / raw)
To: Hans Schillstrom; +Cc: netdev@vger.kernel.org, Daniel Lezcano
Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> Hello
> I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
> The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
>
> A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
> (see conf file at the end)
>
> Kernel: net-next-2.6 top from 4 october 2010
>
> I added some printk's inn ipip.c ipip_exit_net()
> ...
> rtnl_lock();
> printk(KERN_ERR "ipip_exit_net(enter)\n");
> ipip_destroy_tunnels(ipn, &list);
> printk(KERN_ERR "ipip_exit_net(1)\n");
> unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
> printk(KERN_ERR "ipip_exit_net(2)\n");
> unregister_netdevice_many(&list);
> printk(KERN_ERR "ipip_exit_net(3)\n");
> rtnl_unlock();
> printk(KERN_ERR "ipip_exit_net(exit)\n");
>
>
> Exit steps:
> ===== Screen dump =====
>
> # ifconfig eth0 0.0.0.0 down
> # ifconfig eth1 0.0.0.0 down
> # ifconfig lo 0.0.0.0 down
> # ip li de eth0
> # ip li de eth1
> # ifconfig -a
> ip6tnl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> NOARP MTU:1460 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> LOOPBACK MTU:16436 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> sit0 Link encap:IPv6-in-IPv4
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> tunl0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> # ps
> PID USER VSZ STAT COMMAND
> 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> 2 root 4540 S /bin/ash /var/bin/init
> 7 root 6640 S inetd
> 8 root 4544 S /bin/ash
> 26 root 4544 R ps
> # lsmod
> Module Size Used by Not tainted
> macvlan 8709 0
> pcnet32 29549 0
> tg3 112093 0
> libphy 21043 1 tg3
> # kill 7 2
> # ps
> PID USER VSZ STAT COMMAND
> 1 root 12412 S /usr/lib64/lxc/lxc-init -- /var/bin/init
> 8 root 4544 S /bin/ash
> 28 root 4544 R ps
> # exit ( here is the exit from netns )
> # ipip_exit_net(enter)
> ipip_exit_net(1)
> ipip_exit_net(2)
> ------------[ cut here ]------------
> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953
> unregister_sysctl_table+0xc7/0xf9()
This warning is caused by removing the parent directory
before the child in the sysctl tables. Not strictly fatal but
it is a problem. It may be worth looking at which sysctl
tables ipip registers to see if we can rectify this.
> Hardware name: Bochs
> Modules linked in: macvlan pcnet32 tg3 libphy
> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3+ #7
> Call Trace:
> [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
> [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
> [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
> [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
> [<ffffffff81342108>] addrconf_ifdown+0x415/0x45e
> [<ffffffff81342b98>] addrconf_notify+0x756/0x7fe
> [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
> [<ffffffff813622b3>] ? ip6mr_device_event+0x8d/0x9e
> [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
> [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
> [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
> [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
> [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
> [<ffffffff81324209>] ipip_exit_net+0xea/0x11a
> [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
> [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
> [<ffffffff812bca39>] cleanup_net+0xf8/0x198
> [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
> [<ffffffff81056e35>] worker_thread+0x1db/0x34e
> [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
> [<ffffffff8105a030>] kthread+0x82/0x8a
> [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
> [<ffffffff81059fae>] ? kthread+0x0/0x8a
> [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace 939b5185219f32e7 ]---
> ipip_exit_net(3)
> ipip_exit_net(exit)
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
Nasty. Someone has left a reference lying around to one of the network
devices. It is a reference that we can transfer to the loopback device
at device exit time, but we never drop the reference and so the loopback
interface never frees up.
Ouch!
There is the painful method of instrumenting of dev_hold and dev_release
that may give you a clue. It may also be worth seeing which kinds of
device reference we transfer from the loopback device when a device
exits.
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 15:53 ` Daniel Lezcano
@ 2010-10-08 16:17 ` Daniel Lezcano
2010-10-08 16:58 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-08 16:17 UTC (permalink / raw)
Cc: Hans Schillstrom, Eric W. Biederman, netdev@vger.kernel.org
On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>> Hi Eric,
>> Any advice how to trace this down ?
>> This rollback_registered_many() seems to have on the lists before...
>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>> load the tunnel module(s)
>> enter a new ns and exit from it.
>>
>> Have not tested any more devices than tunnels,
>> I did an "ip link delete" on my macvlans before exiting the ns.
>
> Ah ! I succeed to reproduce it.
> It does not appear immediately in fact.
>
> I am trying to simplify the configuration but I am falling in the bug I
> talked about in the previous email.
Ok, so after investigating, we just need a macvlan and specify an ipv6
address for it (inside a new netns of course), and the loopback is not
released. I compiled out the tunnels, so they are not related to this
problem I think.
That reduces the scope of investigation :)
Looking forward ...
-- Daniel
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 12:28 ` Hans Schillstrom
2010-10-08 15:53 ` Daniel Lezcano
@ 2010-10-08 16:45 ` Eric W. Biederman
2010-10-08 17:20 ` David Miller
1 sibling, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-08 16:45 UTC (permalink / raw)
To: Hans Schillstrom; +Cc: Daniel Lezcano, netdev@vger.kernel.org
Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> Hi Eric,
> Any advice how to trace this down ?
> This rollback_registered_many() seems to have on the lists before...
That is just the core piece that does device registration.
> All IPv4 and IPv6 tunnels causes this crash, all you have to do is load the tunnel module(s)
> enter a new ns and exit from it.
Ouch.
> Have not tested any more devices than tunnels,
> I did an "ip link delete" on my macvlans before exiting the ns.
My hunch is that we have dst entry problems, as I know those hop network
interfaces when we destroy network devices, but I have seen weird issues
with the route cache as well.
Grumble.
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 11:19 ` Daniel Lezcano
2010-10-08 11:53 ` Hans Schillstrom
@ 2010-10-08 16:51 ` Eric W. Biederman
1 sibling, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-08 16:51 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Hans Schillstrom, netdev@vger.kernel.org
Daniel Lezcano <daniel.lezcano@free.fr> writes:
> On 10/07/2010 10:48 AM, Hans Schillstrom wrote:
>> Hello
>> I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
>> The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
>>
>> A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
>> (see conf file at the end)
>>
>> Kernel: net-next-2.6 top from 4 october 2010
>>
>
> Hi Hans,
>
> I tried to reproduce your problem but I just get a big kernel crash when exiting
> the container :/
>
> The stack is different but it may be related to the same problem.
Double ouch. My guess that this is more related to the recent macvlan
changes.
It looks like there is plenty of debugging work to do :(
Ouch Ouch Ouch!
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 16:17 ` Daniel Lezcano
@ 2010-10-08 16:58 ` Eric W. Biederman
2010-10-08 17:29 ` Daniel Lezcano
0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-08 16:58 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Hans Schillstrom, netdev@vger.kernel.org
Daniel Lezcano <daniel.lezcano@free.fr> writes:
> On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
>> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>>> Hi Eric,
>>> Any advice how to trace this down ?
>>> This rollback_registered_many() seems to have on the lists before...
>>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>>> load the tunnel module(s)
>>> enter a new ns and exit from it.
>>>
>>> Have not tested any more devices than tunnels,
>>> I did an "ip link delete" on my macvlans before exiting the ns.
>>
>> Ah ! I succeed to reproduce it.
>> It does not appear immediately in fact.
>>
>> I am trying to simplify the configuration but I am falling in the bug I
>> talked about in the previous email.
>
> Ok, so after investigating, we just need a macvlan and specify an ipv6 address
> for it (inside a new netns of course), and the loopback is not released. I
> compiled out the tunnels, so they are not related to this problem I think.
>
> That reduces the scope of investigation :)
This reproduces the unable to free nedevice problem? Or the bad
pointer in macvlan_close problem?
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 16:45 ` Eric W. Biederman
@ 2010-10-08 17:20 ` David Miller
2010-10-08 17:32 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-10-08 17:20 UTC (permalink / raw)
To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 08 Oct 2010 09:45:15 -0700
> My hunch is that we have dst entry problems, as I know those hop network
> interfaces when we destroy network devices, but I have seen weird issues
> with the route cache as well.
While we're on this topic, can someone explain to me what the special
CONFIG_NET_NS code in net/ipv4/route.c:rt_do_flush() is trying to
accomplish?
If the issue is that there is an implicit ordering of releasing of
'dst' entries that must be maintained, we really ought to formalize
it (f.e. with dependency pointers or something like that).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 16:58 ` Eric W. Biederman
@ 2010-10-08 17:29 ` Daniel Lezcano
2010-10-08 17:47 ` Daniel Lezcano
0 siblings, 1 reply; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-08 17:29 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Hans Schillstrom, netdev@vger.kernel.org
On 10/08/2010 06:58 PM, Eric W. Biederman wrote:
> Daniel Lezcano<daniel.lezcano@free.fr> writes:
>
>
>> On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
>>
>>> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>>>
>>>> Hi Eric,
>>>> Any advice how to trace this down ?
>>>> This rollback_registered_many() seems to have on the lists before...
>>>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>>>> load the tunnel module(s)
>>>> enter a new ns and exit from it.
>>>>
>>>> Have not tested any more devices than tunnels,
>>>> I did an "ip link delete" on my macvlans before exiting the ns.
>>>>
>>> Ah ! I succeed to reproduce it.
>>> It does not appear immediately in fact.
>>>
>>> I am trying to simplify the configuration but I am falling in the bug I
>>> talked about in the previous email.
>>>
>> Ok, so after investigating, we just need a macvlan and specify an ipv6 address
>> for it (inside a new netns of course), and the loopback is not released. I
>> compiled out the tunnels, so they are not related to this problem I think.
>>
>> That reduces the scope of investigation :)
>>
> This reproduces the unable to free nedevice problem? Or the bad
> pointer in macvlan_close problem?
>
The free netdevice problem.
For the macvlan problem, you have to create 2 macvlans on the *same*
physical interface, move them to the network namespace and then exit
this net_ns. You have to repeat this operation several times because
that happen randomly after a few iterations.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 17:20 ` David Miller
@ 2010-10-08 17:32 ` Eric W. Biederman
2010-10-12 20:05 ` David Miller
0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-08 17:32 UTC (permalink / raw)
To: David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev
David Miller <davem@davemloft.net> writes:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 08 Oct 2010 09:45:15 -0700
>
>> My hunch is that we have dst entry problems, as I know those hop network
>> interfaces when we destroy network devices, but I have seen weird issues
>> with the route cache as well.
>
> While we're on this topic, can someone explain to me what the special
> CONFIG_NET_NS code in net/ipv4/route.c:rt_do_flush() is trying to
> accomplish?
>
> If the issue is that there is an implicit ordering of releasing of
> 'dst' entries that must be maintained, we really ought to formalize
> it (f.e. with dependency pointers or something like that).
It is just dealing with not flushing the entire routing cache, just the
routes that have expired. Which prevents one network namespace from
flushing it's routes and DOS'ing another.
The practical consequence is that the hash chains have to be picked
apart with some entries kept and some released based upon
rt_is_expired().
I went through it a year or so ago with a fine tooth comb and it made
sense and seems correct, but it does seem overly convoluted.
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 17:29 ` Daniel Lezcano
@ 2010-10-08 17:47 ` Daniel Lezcano
0 siblings, 0 replies; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-08 17:47 UTC (permalink / raw)
Cc: Eric W. Biederman, Hans Schillstrom, netdev@vger.kernel.org
On 10/08/2010 07:29 PM, Daniel Lezcano wrote:
> On 10/08/2010 06:58 PM, Eric W. Biederman wrote:
>> Daniel Lezcano<daniel.lezcano@free.fr> writes:
>>
>>> On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
>>>> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>>>>> Hi Eric,
>>>>> Any advice how to trace this down ?
>>>>> This rollback_registered_many() seems to have on the lists before...
>>>>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>>>>> load the tunnel module(s)
>>>>> enter a new ns and exit from it.
>>>>>
>>>>> Have not tested any more devices than tunnels,
>>>>> I did an "ip link delete" on my macvlans before exiting the ns.
>>>> Ah ! I succeed to reproduce it.
>>>> It does not appear immediately in fact.
>>>>
>>>> I am trying to simplify the configuration but I am falling in the bug I
>>>> talked about in the previous email.
>>> Ok, so after investigating, we just need a macvlan and specify an
>>> ipv6 address
>>> for it (inside a new netns of course), and the loopback is not
>>> released. I
>>> compiled out the tunnels, so they are not related to this problem I
>>> think.
>>>
>>> That reduces the scope of investigation :)
>> This reproduces the unable to free nedevice problem? Or the bad
>> pointer in macvlan_close problem?
>
> The free netdevice problem.
>
> For the macvlan problem, you have to create 2 macvlans on the *same*
Just to clarify. The problem happens with a different macvlans on
different physical interfaces but it is harder to reproduce.
Sauf indication contraire ci-dessus:
Compagnie IBM France
Siège Social : Tour Descartes, 2, avenue Gambetta, La Défense 5, 92400
Courbevoie
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 542.737.118 ?
SIREN/SIRET : 552 118 465 02430
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-08 17:32 ` Eric W. Biederman
@ 2010-10-12 20:05 ` David Miller
2010-10-13 11:19 ` Jarek Poplawski
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: David Miller @ 2010-10-12 20:05 UTC (permalink / raw)
To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 08 Oct 2010 10:32:40 -0700
> It is just dealing with not flushing the entire routing cache, just the
> routes that have expired. Which prevents one network namespace from
> flushing it's routes and DOS'ing another.
That's a very indirect and obfuscated way of handling it.
And I still don't know why we let the first contiguous set of expired
entries in the chain get freed outside of the lock, and the rest
inside the lock. That really isn't explained by anything I've read.
How about we just do exactly what's intended, and with no ifdefs?
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/route.h b/include/net/route.h
index 7e5e73b..8d24761 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -106,7 +106,7 @@ extern int ip_rt_init(void);
extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
__be32 src, struct net_device *dev);
extern void rt_cache_flush(struct net *net, int how);
-extern void rt_cache_flush_batch(void);
+extern void rt_cache_flush_batch(struct net *net);
extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 919f2ad..4039f56 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
rt_cache_flush(dev_net(dev), 0);
break;
case NETDEV_UNREGISTER_BATCH:
- rt_cache_flush_batch();
+ rt_cache_flush_batch(dev_net(dev));
break;
}
return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0755aa4..6ad730c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
* Can be called by a softirq or a process.
* In the later case, we want to be reschedule if necessary
*/
-static void rt_do_flush(int process_context)
+static void rt_do_flush(struct net *net, int process_context)
{
unsigned int i;
struct rtable *rth, *next;
- struct rtable * tail;
for (i = 0; i <= rt_hash_mask; i++) {
+ struct rtable *list, **pprev;
+
if (process_context && need_resched())
cond_resched();
rth = rt_hash_table[i].chain;
@@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
continue;
spin_lock_bh(rt_hash_lock_addr(i));
-#ifdef CONFIG_NET_NS
- {
- struct rtable ** prev, * p;
- rth = rt_hash_table[i].chain;
+ pprev = &rt_hash_table[i].chain;
+ rth = *pprev;
+ while (rth) {
+ next = rth->dst.rt_next;
+ if (dev_net(rth->dst.dev) == net) {
+ *pprev = next;
- /* defer releasing the head of the list after spin_unlock */
- for (tail = rth; tail; tail = tail->dst.rt_next)
- if (!rt_is_expired(tail))
- break;
- if (rth != tail)
- rt_hash_table[i].chain = tail;
-
- /* call rt_free on entries after the tail requiring flush */
- prev = &rt_hash_table[i].chain;
- for (p = *prev; p; p = next) {
- next = p->dst.rt_next;
- if (!rt_is_expired(p)) {
- prev = &p->dst.rt_next;
- } else {
- *prev = next;
- rt_free(p);
- }
- }
+ rth->dst.rt_next = list;
+ list = rth;
+ } else
+ pprev = &rth->dst.rt_next;
+
+ rth = next;
}
-#else
- rth = rt_hash_table[i].chain;
- rt_hash_table[i].chain = NULL;
- tail = NULL;
-#endif
+
spin_unlock_bh(rt_hash_lock_addr(i));
- for (; rth != tail; rth = next) {
- next = rth->dst.rt_next;
- rt_free(rth);
+ for (; list; list = next) {
+ next = list->dst.rt_next;
+ rt_free(list);
}
}
}
@@ -906,13 +893,13 @@ void rt_cache_flush(struct net *net, int delay)
{
rt_cache_invalidate(net);
if (delay >= 0)
- rt_do_flush(!in_softirq());
+ rt_do_flush(net, !in_softirq());
}
/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(void)
+void rt_cache_flush_batch(struct net *net)
{
- rt_do_flush(!in_softirq());
+ rt_do_flush(net, !in_softirq());
}
static void rt_emergency_hash_rebuild(struct net *net)
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-12 20:05 ` David Miller
@ 2010-10-13 11:19 ` Jarek Poplawski
2010-10-13 21:58 ` David Miller
2010-10-13 22:16 ` Daniel Lezcano
2010-10-14 4:40 ` Eric W. Biederman
2 siblings, 1 reply; 28+ messages in thread
From: Jarek Poplawski @ 2010-10-13 11:19 UTC (permalink / raw)
To: David Miller; +Cc: ebiederm, hans.schillstrom, daniel.lezcano, netdev
On 2010-10-12 22:05, David Miller wrote:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 08 Oct 2010 10:32:40 -0700
>
>> It is just dealing with not flushing the entire routing cache, just the
>> routes that have expired. Which prevents one network namespace from
>> flushing it's routes and DOS'ing another.
>
> That's a very indirect and obfuscated way of handling it.
>
> And I still don't know why we let the first contiguous set of expired
> entries in the chain get freed outside of the lock, and the rest
> inside the lock. That really isn't explained by anything I've read.
>
> How about we just do exactly what's intended, and with no ifdefs?
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
...
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 0755aa4..6ad730c 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
> * Can be called by a softirq or a process.
> * In the later case, we want to be reschedule if necessary
> */
> -static void rt_do_flush(int process_context)
> +static void rt_do_flush(struct net *net, int process_context)
> {
> unsigned int i;
> struct rtable *rth, *next;
> - struct rtable * tail;
>
> for (i = 0; i <= rt_hash_mask; i++) {
> + struct rtable *list, **pprev;
Isn't "list = NULL" needed here?
Jarek P.
...
> + rth->dst.rt_next = list;
> + list = rth;
> + } else
> + pprev = &rth->dst.rt_next;
> +
> + rth = next;
...
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-13 11:19 ` Jarek Poplawski
@ 2010-10-13 21:58 ` David Miller
2010-10-14 6:41 ` Hans Schillstrom
0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-10-13 21:58 UTC (permalink / raw)
To: jarkao2; +Cc: ebiederm, hans.schillstrom, daniel.lezcano, netdev
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 13 Oct 2010 11:19:47 +0000
>> -static void rt_do_flush(int process_context)
>> +static void rt_do_flush(struct net *net, int process_context)
>> {
>> unsigned int i;
>> struct rtable *rth, *next;
>> - struct rtable * tail;
>>
>> for (i = 0; i <= rt_hash_mask; i++) {
>> + struct rtable *list, **pprev;
>
> Isn't "list = NULL" needed here?
Yes it is, thanks for catching that.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-12 20:05 ` David Miller
2010-10-13 11:19 ` Jarek Poplawski
@ 2010-10-13 22:16 ` Daniel Lezcano
2010-10-13 23:23 ` David Miller
2010-10-14 4:40 ` Eric W. Biederman
2 siblings, 1 reply; 28+ messages in thread
From: Daniel Lezcano @ 2010-10-13 22:16 UTC (permalink / raw)
To: David Miller; +Cc: ebiederm, hans.schillstrom, netdev
On 10/12/2010 10:05 PM, David Miller wrote:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 08 Oct 2010 10:32:40 -0700
>
>
>> It is just dealing with not flushing the entire routing cache, just the
>> routes that have expired. Which prevents one network namespace from
>> flushing it's routes and DOS'ing another.
>>
> That's a very indirect and obfuscated way of handling it.
>
I agree.
> And I still don't know why we let the first contiguous set of expired
> entries in the chain get freed outside of the lock, and the rest
> inside the lock. That really isn't explained by anything I've read.
>
> How about we just do exactly what's intended, and with no ifdefs?
>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Dave,
do you mind to wait I test the patch before merging it ?
I would like to stress a bit this routine with multiple containers.
Thanks
-- Daniel
> Signed-off-by: David S. Miller<davem@davemloft.net>
>
> diff --git a/include/net/route.h b/include/net/route.h
> index 7e5e73b..8d24761 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -106,7 +106,7 @@ extern int ip_rt_init(void);
> extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
> __be32 src, struct net_device *dev);
> extern void rt_cache_flush(struct net *net, int how);
> -extern void rt_cache_flush_batch(void);
> +extern void rt_cache_flush_batch(struct net *net);
> extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
> extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
> extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 919f2ad..4039f56 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> rt_cache_flush(dev_net(dev), 0);
> break;
> case NETDEV_UNREGISTER_BATCH:
> - rt_cache_flush_batch();
> + rt_cache_flush_batch(dev_net(dev));
> break;
> }
> return NOTIFY_DONE;
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 0755aa4..6ad730c 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
> * Can be called by a softirq or a process.
> * In the later case, we want to be reschedule if necessary
> */
> -static void rt_do_flush(int process_context)
> +static void rt_do_flush(struct net *net, int process_context)
> {
> unsigned int i;
> struct rtable *rth, *next;
> - struct rtable * tail;
>
> for (i = 0; i<= rt_hash_mask; i++) {
> + struct rtable *list, **pprev;
> +
> if (process_context&& need_resched())
> cond_resched();
> rth = rt_hash_table[i].chain;
> @@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
> continue;
>
> spin_lock_bh(rt_hash_lock_addr(i));
> -#ifdef CONFIG_NET_NS
> - {
> - struct rtable ** prev, * p;
>
> - rth = rt_hash_table[i].chain;
> + pprev =&rt_hash_table[i].chain;
> + rth = *pprev;
> + while (rth) {
> + next = rth->dst.rt_next;
> + if (dev_net(rth->dst.dev) == net) {
> + *pprev = next;
>
> - /* defer releasing the head of the list after spin_unlock */
> - for (tail = rth; tail; tail = tail->dst.rt_next)
> - if (!rt_is_expired(tail))
> - break;
> - if (rth != tail)
> - rt_hash_table[i].chain = tail;
> -
> - /* call rt_free on entries after the tail requiring flush */
> - prev =&rt_hash_table[i].chain;
> - for (p = *prev; p; p = next) {
> - next = p->dst.rt_next;
> - if (!rt_is_expired(p)) {
> - prev =&p->dst.rt_next;
> - } else {
> - *prev = next;
> - rt_free(p);
> - }
> - }
> + rth->dst.rt_next = list;
> + list = rth;
> + } else
> + pprev =&rth->dst.rt_next;
> +
> + rth = next;
> }
> -#else
> - rth = rt_hash_table[i].chain;
> - rt_hash_table[i].chain = NULL;
> - tail = NULL;
> -#endif
> +
> spin_unlock_bh(rt_hash_lock_addr(i));
>
> - for (; rth != tail; rth = next) {
> - next = rth->dst.rt_next;
> - rt_free(rth);
> + for (; list; list = next) {
> + next = list->dst.rt_next;
> + rt_free(list);
> }
> }
> }
> @@ -906,13 +893,13 @@ void rt_cache_flush(struct net *net, int delay)
> {
> rt_cache_invalidate(net);
> if (delay>= 0)
> - rt_do_flush(!in_softirq());
> + rt_do_flush(net, !in_softirq());
> }
>
> /* Flush previous cache invalidated entries from the cache */
> -void rt_cache_flush_batch(void)
> +void rt_cache_flush_batch(struct net *net)
> {
> - rt_do_flush(!in_softirq());
> + rt_do_flush(net, !in_softirq());
> }
>
> static void rt_emergency_hash_rebuild(struct net *net)
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-13 22:16 ` Daniel Lezcano
@ 2010-10-13 23:23 ` David Miller
2010-10-14 3:57 ` Eric Dumazet
0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-10-13 23:23 UTC (permalink / raw)
To: daniel.lezcano; +Cc: ebiederm, hans.schillstrom, netdev
From: Daniel Lezcano <daniel.lezcano@free.fr>
Date: Thu, 14 Oct 2010 00:16:15 +0200
> do you mind to wait I test the patch before merging it ?
> I would like to stress a bit this routine with multiple containers.
Yes, it would be great if you could test this.
Please make sure you get the fix for the bug that
Jarek found ('list' needs to be initialized to NULL)
I've included the latest version below:
diff --git a/include/net/route.h b/include/net/route.h
index 7e5e73b..8d24761 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -106,7 +106,7 @@ extern int ip_rt_init(void);
extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
__be32 src, struct net_device *dev);
extern void rt_cache_flush(struct net *net, int how);
-extern void rt_cache_flush_batch(void);
+extern void rt_cache_flush_batch(struct net *net);
extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 919f2ad..4039f56 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
rt_cache_flush(dev_net(dev), 0);
break;
case NETDEV_UNREGISTER_BATCH:
- rt_cache_flush_batch();
+ rt_cache_flush_batch(dev_net(dev));
break;
}
return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0755aa4..6ad730c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
* Can be called by a softirq or a process.
* In the later case, we want to be reschedule if necessary
*/
-static void rt_do_flush(int process_context)
+static void rt_do_flush(struct net *net, int process_context)
{
unsigned int i;
struct rtable *rth, *next;
- struct rtable * tail;
for (i = 0; i <= rt_hash_mask; i++) {
+ struct rtable *list = NULL, **pprev;
+
if (process_context && need_resched())
cond_resched();
rth = rt_hash_table[i].chain;
@@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
continue;
spin_lock_bh(rt_hash_lock_addr(i));
-#ifdef CONFIG_NET_NS
- {
- struct rtable ** prev, * p;
- rth = rt_hash_table[i].chain;
+ pprev = &rt_hash_table[i].chain;
+ rth = *pprev;
+ while (rth) {
+ next = rth->dst.rt_next;
+ if (dev_net(rth->dst.dev) == net) {
+ *pprev = next;
- /* defer releasing the head of the list after spin_unlock */
- for (tail = rth; tail; tail = tail->dst.rt_next)
- if (!rt_is_expired(tail))
- break;
- if (rth != tail)
- rt_hash_table[i].chain = tail;
-
- /* call rt_free on entries after the tail requiring flush */
- prev = &rt_hash_table[i].chain;
- for (p = *prev; p; p = next) {
- next = p->dst.rt_next;
- if (!rt_is_expired(p)) {
- prev = &p->dst.rt_next;
- } else {
- *prev = next;
- rt_free(p);
- }
- }
+ rth->dst.rt_next = list;
+ list = rth;
+ } else
+ pprev = &rth->dst.rt_next;
+
+ rth = next;
}
-#else
- rth = rt_hash_table[i].chain;
- rt_hash_table[i].chain = NULL;
- tail = NULL;
-#endif
+
spin_unlock_bh(rt_hash_lock_addr(i));
- for (; rth != tail; rth = next) {
- next = rth->dst.rt_next;
- rt_free(rth);
+ for (; list; list = next) {
+ next = list->dst.rt_next;
+ rt_free(list);
}
}
}
@@ -906,13 +893,13 @@ void rt_cache_flush(struct net *net, int delay)
{
rt_cache_invalidate(net);
if (delay >= 0)
- rt_do_flush(!in_softirq());
+ rt_do_flush(net, !in_softirq());
}
/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(void)
+void rt_cache_flush_batch(struct net *net)
{
- rt_do_flush(!in_softirq());
+ rt_do_flush(net, !in_softirq());
}
static void rt_emergency_hash_rebuild(struct net *net)
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-13 23:23 ` David Miller
@ 2010-10-14 3:57 ` Eric Dumazet
2010-10-14 23:28 ` Paul E. McKenney
0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-10-14 3:57 UTC (permalink / raw)
To: David Miller; +Cc: daniel.lezcano, ebiederm, hans.schillstrom, netdev
Le mercredi 13 octobre 2010 à 16:23 -0700, David Miller a écrit :
> From: Daniel Lezcano <daniel.lezcano@free.fr>
> Date: Thu, 14 Oct 2010 00:16:15 +0200
>
> > do you mind to wait I test the patch before merging it ?
> > I would like to stress a bit this routine with multiple containers.
>
> Yes, it would be great if you could test this.
>
> Please make sure you get the fix for the bug that
> Jarek found ('list' needs to be initialized to NULL)
>
> I've included the latest version below:
>
> diff --git a/include/net/route.h b/include/net/route.h
> index 7e5e73b..8d24761 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -106,7 +106,7 @@ extern int ip_rt_init(void);
> extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
> __be32 src, struct net_device *dev);
> extern void rt_cache_flush(struct net *net, int how);
> -extern void rt_cache_flush_batch(void);
> +extern void rt_cache_flush_batch(struct net *net);
> extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
> extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
> extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 919f2ad..4039f56 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> rt_cache_flush(dev_net(dev), 0);
> break;
> case NETDEV_UNREGISTER_BATCH:
> - rt_cache_flush_batch();
> + rt_cache_flush_batch(dev_net(dev));
> break;
> }
> return NOTIFY_DONE;
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 0755aa4..6ad730c 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
> * Can be called by a softirq or a process.
> * In the later case, we want to be reschedule if necessary
> */
> -static void rt_do_flush(int process_context)
> +static void rt_do_flush(struct net *net, int process_context)
> {
> unsigned int i;
> struct rtable *rth, *next;
> - struct rtable * tail;
>
> for (i = 0; i <= rt_hash_mask; i++) {
> + struct rtable *list = NULL, **pprev;
> +
> if (process_context && need_resched())
> cond_resched();
> rth = rt_hash_table[i].chain;
> @@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
> continue;
>
> spin_lock_bh(rt_hash_lock_addr(i));
> -#ifdef CONFIG_NET_NS
> - {
> - struct rtable ** prev, * p;
>
> - rth = rt_hash_table[i].chain;
> + pprev = &rt_hash_table[i].chain;
> + rth = *pprev;
> + while (rth) {
> + next = rth->dst.rt_next;
> + if (dev_net(rth->dst.dev) == net) {
if (net_eq(dev_net(rth->dst.dev), net)) {
> + *pprev = next;
>
> - /* defer releasing the head of the list after spin_unlock */
> - for (tail = rth; tail; tail = tail->dst.rt_next)
> - if (!rt_is_expired(tail))
> - break;
> - if (rth != tail)
> - rt_hash_table[i].chain = tail;
> -
> - /* call rt_free on entries after the tail requiring flush */
> - prev = &rt_hash_table[i].chain;
> - for (p = *prev; p; p = next) {
> - next = p->dst.rt_next;
> - if (!rt_is_expired(p)) {
> - prev = &p->dst.rt_next;
> - } else {
> - *prev = next;
> - rt_free(p);
> - }
> - }
> + rth->dst.rt_next = list;
> + list = rth;
I was wondering about RCU rules here.
We change pointers while a reader might enter in a loop.
It seems fine : At soon as we spin_unlock(), the loop should be closed.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
minor coding style : You should add a brace in the else clause :
pprev = &rt_hash_table[i].chain;
for (rth = *pprev; rth != NULL; rth = next) {
next = rth->dst.rt_next;
if (net_eq(dev_net(rth->dst.dev), net)) {
*pprev = next;
rth->dst.rt_next = list;
list = rth;
} else {
pprev = &rth->dst.rt_next;
}
}
Thanks !
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-12 20:05 ` David Miller
2010-10-13 11:19 ` Jarek Poplawski
2010-10-13 22:16 ` Daniel Lezcano
@ 2010-10-14 4:40 ` Eric W. Biederman
2010-10-14 4:50 ` David Miller
2 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-14 4:40 UTC (permalink / raw)
To: David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev
David Miller <davem@davemloft.net> writes:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 08 Oct 2010 10:32:40 -0700
>
>> It is just dealing with not flushing the entire routing cache, just the
>> routes that have expired. Which prevents one network namespace from
>> flushing it's routes and DOS'ing another.
>
> That's a very indirect and obfuscated way of handling it.
>
> And I still don't know why we let the first contiguous set of expired
> entries in the chain get freed outside of the lock, and the rest
> inside the lock. That really isn't explained by anything I've read.
>
> How about we just do exactly what's intended, and with no ifdefs?
I'm all for no ifdefs.
And reading the code your version looks much simpler and easier
to read and I am all for that.
However I think the test should still be rt_is_expired(), because
that is what rt_do_flush() is doing removing the expired entries
from the list.
The only difference being that we remove the assumption that all hash
table entries must be expired at this point.
We have very straight forwardly expired all of the route table entries
for the namespace that go this going earlier.
Eric
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/include/net/route.h b/include/net/route.h
> index 7e5e73b..8d24761 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -106,7 +106,7 @@ extern int ip_rt_init(void);
> extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
> __be32 src, struct net_device *dev);
> extern void rt_cache_flush(struct net *net, int how);
> -extern void rt_cache_flush_batch(void);
> +extern void rt_cache_flush_batch(struct net *net);
> extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
> extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
> extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 919f2ad..4039f56 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> rt_cache_flush(dev_net(dev), 0);
> break;
> case NETDEV_UNREGISTER_BATCH:
> - rt_cache_flush_batch();
> + rt_cache_flush_batch(dev_net(dev));
I believe this change is actually wrong. dev here is the first
element of a list of network devices, and that list may span multiple
network namespaces.
> break;
> }
> return NOTIFY_DONE;
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 0755aa4..6ad730c 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
> * Can be called by a softirq or a process.
> * In the later case, we want to be reschedule if necessary
> */
> -static void rt_do_flush(int process_context)
> +static void rt_do_flush(struct net *net, int process_context)
> {
> unsigned int i;
> struct rtable *rth, *next;
> - struct rtable * tail;
>
> for (i = 0; i <= rt_hash_mask; i++) {
> + struct rtable *list, **pprev;
> +
> if (process_context && need_resched())
> cond_resched();
> rth = rt_hash_table[i].chain;
> @@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
> continue;
>
> spin_lock_bh(rt_hash_lock_addr(i));
> -#ifdef CONFIG_NET_NS
> - {
> - struct rtable ** prev, * p;
>
> - rth = rt_hash_table[i].chain;
> + pprev = &rt_hash_table[i].chain;
> + rth = *pprev;
> + while (rth) {
> + next = rth->dst.rt_next;
> + if (dev_net(rth->dst.dev) == net) {
> + *pprev = next;
>
> - /* defer releasing the head of the list after spin_unlock */
> - for (tail = rth; tail; tail = tail->dst.rt_next)
> - if (!rt_is_expired(tail))
> - break;
> - if (rth != tail)
> - rt_hash_table[i].chain = tail;
> -
> - /* call rt_free on entries after the tail requiring flush */
> - prev = &rt_hash_table[i].chain;
> - for (p = *prev; p; p = next) {
> - next = p->dst.rt_next;
> - if (!rt_is_expired(p)) {
> - prev = &p->dst.rt_next;
> - } else {
> - *prev = next;
> - rt_free(p);
> - }
> - }
> + rth->dst.rt_next = list;
> + list = rth;
> + } else
> + pprev = &rth->dst.rt_next;
> +
> + rth = next;
> }
> -#else
> - rth = rt_hash_table[i].chain;
> - rt_hash_table[i].chain = NULL;
> - tail = NULL;
> -#endif
> +
> spin_unlock_bh(rt_hash_lock_addr(i));
>
> - for (; rth != tail; rth = next) {
> - next = rth->dst.rt_next;
> - rt_free(rth);
> + for (; list; list = next) {
> + next = list->dst.rt_next;
> + rt_free(list);
> }
> }
> }
> @@ -906,13 +893,13 @@ void rt_cache_flush(struct net *net, int delay)
> {
> rt_cache_invalidate(net);
> if (delay >= 0)
> - rt_do_flush(!in_softirq());
> + rt_do_flush(net, !in_softirq());
> }
>
> /* Flush previous cache invalidated entries from the cache */
> -void rt_cache_flush_batch(void)
> +void rt_cache_flush_batch(struct net *net)
> {
> - rt_do_flush(!in_softirq());
> + rt_do_flush(net, !in_softirq());
> }
>
> static void rt_emergency_hash_rebuild(struct net *net)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-14 4:40 ` Eric W. Biederman
@ 2010-10-14 4:50 ` David Miller
2010-10-14 5:20 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-10-14 4:50 UTC (permalink / raw)
To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 13 Oct 2010 21:40:49 -0700
> However I think the test should still be rt_is_expired(), because
> that is what rt_do_flush() is doing removing the expired entries
> from the list.
I can't see a reason for that test.
Everything calling into this code path has created a condition
that requires that all routing cache entries for that namespace
be deleted.
This function is meant to unconditionally flush the entire table.
I believe you added that extraneous test, and it never existed there
before.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-14 4:50 ` David Miller
@ 2010-10-14 5:20 ` Eric W. Biederman
2010-10-14 15:09 ` David Miller
0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-14 5:20 UTC (permalink / raw)
To: David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev
David Miller <davem@davemloft.net> writes:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Wed, 13 Oct 2010 21:40:49 -0700
>
>> However I think the test should still be rt_is_expired(), because
>> that is what rt_do_flush() is doing removing the expired entries
>> from the list.
>
> I can't see a reason for that test.
>
> Everything calling into this code path has created a condition
> that requires that all routing cache entries for that namespace
> be deleted.
>
> This function is meant to unconditionally flush the entire table.
>
> I believe you added that extraneous test, and it never existed there
> before.
At the point network namespaces entered the picture the logic was:
void rt_cache_flush(struct net *net, int delay)
{
rt_cache_invalidate();
if (delay >= 0)
rt_do_flush(!in_softirq());
}
/* Strictly speaking rt_is_expired was just open coded in
* rt_check_expire. But this is the check that was used.
*/
static inline int rt_is_expired(struct rtable *rth)
{
return rth->rt_genid != atomic_read(&rt_genid);
}
static void rt_cache_invalidate(void)
{
unsigned char shuffle;
get_random_bytes(&shuffle, sizeof(shuffle));
atomic_add(shuffle + 1U, &rt_genid);
}
static void rt_do_flush(int process_context)
{
unsigned int i;
struct rtable *rth, *next;
for (i = 0; i <= rt_hash_mask; i++) {
if (process_context && need_resched())
cond_resched();
rth = rt_hash_table[i].chain;
if (!rth)
continue;
spin_lock_bh(rt_hash_lock_addr(i));
rth = rt_hash_table[i].chain;
rt_hash_table[i].chain = NULL;
tail = NULL;
spin_unlock_bh(rt_hash_lock_addr(i));
for(; rth != tail; rth = next)
{
next = rth->dst.rt_next;
rt_free(rth);
}
}
}
Because of the rt_cache_invalidate() in rt_cache_flush() this
guaranteed that rt_is_expired() was true for every route cache entry,
and this also guaranteed that every routing cache entry we were flush
atomically became inaccessible.
So rt_is_expired() has always been valid, but in practice it was just
always optimized out as being redundant.
With the network namespace support we limit the scope of the test of
the invalidate to just a single network namespace, and as such
rt_is_expired stops being true for every cache entry. So we cannot
unconditionally throw away entire chains.
All of which can be either done by network namespace equality or by
rt_is_expired(). Although Denis picked rt_is_expired() when he made
his change.
The only place it makes a noticable difference in practice is what
happens when we do batched deleletes of lots of network devices in
different network namespaces.
During batched network device deletes in fib_netdev_event we do
rt_cache_flush(dev_net(dev), -1) for each network device. and then a
final rt_cache_flush_batch() to remove the invalidated entries. These
devices can be from multiple network namespaces, so I suspect that is
a savings worth having.
So if we are going to change the tests we need to do something with
rt_cache_flush_batch(). Further I do not see what is confusing about
a test that asks if the routing cache entry is unusable. Is
rt_cache_expired() a bad name?
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-13 21:58 ` David Miller
@ 2010-10-14 6:41 ` Hans Schillstrom
0 siblings, 0 replies; 28+ messages in thread
From: Hans Schillstrom @ 2010-10-14 6:41 UTC (permalink / raw)
To: David Miller
Cc: jarkao2@gmail.com, ebiederm@xmission.com, daniel.lezcano@free.fr,
netdev@vger.kernel.org
On Wednesday 13 October 2010 23:58:56 David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 13 Oct 2010 11:19:47 +0000
>
> >> -static void rt_do_flush(int process_context)
> >> +static void rt_do_flush(struct net *net, int process_context)
> >> {
> >> unsigned int i;
> >> struct rtable *rth, *next;
> >> - struct rtable * tail;
> >>
> >> for (i = 0; i <= rt_hash_mask; i++) {
> >> + struct rtable *list, **pprev;
> >
> > Isn't "list = NULL" needed here?
>
> Yes it is, thanks for catching that.
>
It solves the crach but....
#
Slab corruption: size-4096 start=ffff88000f950000, len=4096
010: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b
unregister_netdevice: waiting for lo to become free. Usage count = 4
Slab corruption: size-4096 start=ffff88000f9af000, len=4096
010: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
unregister_netdevice: waiting for lo to become free. Usage count = 4
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-14 5:20 ` Eric W. Biederman
@ 2010-10-14 15:09 ` David Miller
2010-10-14 18:35 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-10-14 15:09 UTC (permalink / raw)
To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 13 Oct 2010 22:20:28 -0700
> With the network namespace support we limit the scope of the test of
> the invalidate to just a single network namespace, and as such
> rt_is_expired stops being true for every cache entry. So we cannot
> unconditionally throw away entire chains.
>
> All of which can be either done by network namespace equality or by
> rt_is_expired(). Although Denis picked rt_is_expired() when he made
> his change.
Right, and I choose to use namespace equality which will completely
compile into no code at all when namespace support is not in the
kernel.
Therefore, making the non-namespace case equivalent and as efficient
as it always was.
> The only place it makes a noticable difference in practice is what
> happens when we do batched deleletes of lots of network devices in
> different network namespaces.
>
> During batched network device deletes in fib_netdev_event we do
> rt_cache_flush(dev_net(dev), -1) for each network device. and then a
> final rt_cache_flush_batch() to remove the invalidated entries. These
> devices can be from multiple network namespaces, so I suspect that is
> a savings worth having.
How can it make a real difference even in this case? We'll obliterate
all the entries, and then on subsequent passes we'll find nothing
matching that namespace any more.
Show me performance tests that show it makes any difference, please.
> So if we are going to change the tests we need to do something with
> rt_cache_flush_batch(). Further I do not see what is confusing about
> a test that asks if the routing cache entry is unusable. Is
> rt_cache_expired() a bad name?
It's not a bad name, it's just an unnecessary test that we don't need
to even make in this specific place.
Redundancy tends to accumulate.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-14 15:09 ` David Miller
@ 2010-10-14 18:35 ` Eric W. Biederman
0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-10-14 18:35 UTC (permalink / raw)
To: David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev, Octavian Purdila
David Miller <davem@davemloft.net> writes:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Wed, 13 Oct 2010 22:20:28 -0700
>
>> With the network namespace support we limit the scope of the test of
>> the invalidate to just a single network namespace, and as such
>> rt_is_expired stops being true for every cache entry. So we cannot
>> unconditionally throw away entire chains.
>>
>> All of which can be either done by network namespace equality or by
>> rt_is_expired(). Although Denis picked rt_is_expired() when he made
>> his change.
>
> Right, and I choose to use namespace equality which will completely
> compile into no code at all when namespace support is not in the
> kernel.
>
> Therefore, making the non-namespace case equivalent and as efficient
> as it always was.
Almost you still have the hash list inversion, which means you have
to at look at the rtable entry even on a one list long hash chain.
Perhaps I am looking at it wrong but once you look at the entries
I don't see the difference in the number of cache line faults
between one variant of the code and the other.
>> The only place it makes a noticable difference in practice is what
>> happens when we do batched deleletes of lots of network devices in
>> different network namespaces.
>>
>> During batched network device deletes in fib_netdev_event we do
>> rt_cache_flush(dev_net(dev), -1) for each network device. and then a
>> final rt_cache_flush_batch() to remove the invalidated entries. These
>> devices can be from multiple network namespaces, so I suspect that is
>> a savings worth having.
>
> How can it make a real difference even in this case? We'll obliterate
> all the entries, and then on subsequent passes we'll find nothing
> matching that namespace any more.
>
> Show me performance tests that show it makes any difference, please.
Octavian did you happen to measure the performance difference when you
added batching of routing table flushes?
>> So if we are going to change the tests we need to do something with
>> rt_cache_flush_batch(). Further I do not see what is confusing about
>> a test that asks if the routing cache entry is unusable. Is
>> rt_cache_expired() a bad name?
>
> It's not a bad name, it's just an unnecessary test that we don't need
> to even make in this specific place.
As long as we do something that is correct in the batched flush case
I am happy either way.
Eric
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
@ 2010-10-14 19:21 Octavian Purdila
0 siblings, 0 replies; 28+ messages in thread
From: Octavian Purdila @ 2010-10-14 19:21 UTC (permalink / raw)
To: Eric W. Biederman, David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev
> > How can it make a real difference even in this case? We'll obliterate
> > all the entries, and then on subsequent passes we'll find nothing
> > matching that namespace any more.
> >
> > Show me performance tests that show it makes any difference, please.
>
> Octavian did you happen to measure the performance difference when you
> added batching of routing table flushes?
>
Unfortunatelly I dont't have the numbers anymore, but I remember it was noticeable when using a large number of interfaces (10K) - if I remember correctly around 1 second out of 10 for the whole unregister process.
BTW, another bottleneck for mass unregister while interfaces are up is dev_deactivate / dev_close. I experimented with "batchifying" it and for 32K interfaces the time went down from 5mins to 30s.
The patch I have is not pretty, it basically creates another 2 functions for each of dev_close and dev_deactivate for pre and post synchronise_rcu barrier.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: BUG ? ipip unregister_netdevice_many()
2010-10-14 3:57 ` Eric Dumazet
@ 2010-10-14 23:28 ` Paul E. McKenney
0 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2010-10-14 23:28 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, daniel.lezcano, ebiederm, hans.schillstrom, netdev
On Thu, Oct 14, 2010 at 05:57:11AM +0200, Eric Dumazet wrote:
> Le mercredi 13 octobre 2010 à 16:23 -0700, David Miller a écrit :
> > From: Daniel Lezcano <daniel.lezcano@free.fr>
[ . . . ]
> > for (i = 0; i <= rt_hash_mask; i++) {
> > + struct rtable *list = NULL, **pprev;
> > +
> > if (process_context && need_resched())
> > cond_resched();
> > rth = rt_hash_table[i].chain;
> > @@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
> > continue;
> >
> > spin_lock_bh(rt_hash_lock_addr(i));
> > -#ifdef CONFIG_NET_NS
> > - {
> > - struct rtable ** prev, * p;
> >
> > - rth = rt_hash_table[i].chain;
> > + pprev = &rt_hash_table[i].chain;
> > + rth = *pprev;
> > + while (rth) {
> > + next = rth->dst.rt_next;
> > + if (dev_net(rth->dst.dev) == net) {
>
> if (net_eq(dev_net(rth->dst.dev), net)) {
>
>
> > + *pprev = next;
> >
> > - /* defer releasing the head of the list after spin_unlock */
> > - for (tail = rth; tail; tail = tail->dst.rt_next)
> > - if (!rt_is_expired(tail))
> > - break;
> > - if (rth != tail)
> > - rt_hash_table[i].chain = tail;
> > -
> > - /* call rt_free on entries after the tail requiring flush */
> > - prev = &rt_hash_table[i].chain;
> > - for (p = *prev; p; p = next) {
> > - next = p->dst.rt_next;
> > - if (!rt_is_expired(p)) {
> > - prev = &p->dst.rt_next;
> > - } else {
> > - *prev = next;
> > - rt_free(p);
> > - }
> > - }
> > + rth->dst.rt_next = list;
> > + list = rth;
>
> I was wondering about RCU rules here.
> We change pointers while a reader might enter in a loop.
> It seems fine : At soon as we spin_unlock(), the loop should be closed.
I don't see where the structure pointed to by list came from, but
especially if it is newly allocated, we do need an rcu_assign_pointer().
Thanx, Paul
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> minor coding style : You should add a brace in the else clause :
>
> pprev = &rt_hash_table[i].chain;
> for (rth = *pprev; rth != NULL; rth = next) {
> next = rth->dst.rt_next;
> if (net_eq(dev_net(rth->dst.dev), net)) {
> *pprev = next;
> rth->dst.rt_next = list;
> list = rth;
> } else {
> pprev = &rth->dst.rt_next;
> }
> }
>
> Thanks !
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2010-10-14 23:28 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-07 8:48 BUG ? ipip unregister_netdevice_many() Hans Schillstrom
2010-10-08 11:19 ` Daniel Lezcano
2010-10-08 11:53 ` Hans Schillstrom
2010-10-08 12:28 ` Hans Schillstrom
2010-10-08 15:53 ` Daniel Lezcano
2010-10-08 16:17 ` Daniel Lezcano
2010-10-08 16:58 ` Eric W. Biederman
2010-10-08 17:29 ` Daniel Lezcano
2010-10-08 17:47 ` Daniel Lezcano
2010-10-08 16:45 ` Eric W. Biederman
2010-10-08 17:20 ` David Miller
2010-10-08 17:32 ` Eric W. Biederman
2010-10-12 20:05 ` David Miller
2010-10-13 11:19 ` Jarek Poplawski
2010-10-13 21:58 ` David Miller
2010-10-14 6:41 ` Hans Schillstrom
2010-10-13 22:16 ` Daniel Lezcano
2010-10-13 23:23 ` David Miller
2010-10-14 3:57 ` Eric Dumazet
2010-10-14 23:28 ` Paul E. McKenney
2010-10-14 4:40 ` Eric W. Biederman
2010-10-14 4:50 ` David Miller
2010-10-14 5:20 ` Eric W. Biederman
2010-10-14 15:09 ` David Miller
2010-10-14 18:35 ` Eric W. Biederman
2010-10-08 16:51 ` Eric W. Biederman
2010-10-08 16:06 ` Eric W. Biederman
-- strict thread matches above, loose matches on Subject: below --
2010-10-14 19:21 Octavian Purdila
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).