From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops. Date: Thu, 29 Mar 2012 08:36:15 +0200 Message-ID: <1333002975.2325.82.camel@edumazet-glaptop> References: <002501cd0d74$317fd100$947f7300$%huangpeng@huawei.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: linux-kernel@vger.kernel.org, harry.majun@huawei.com, zhoukang7@huawei.com, netdev To: "Peter Huang (Peng)" Return-path: Received: from mail-ee0-f46.google.com ([74.125.83.46]:52621 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753848Ab2C2GgU (ORCPT ); Thu, 29 Mar 2012 02:36:20 -0400 In-Reply-To: <002501cd0d74$317fd100$947f7300$%huangpeng@huawei.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 2012-03-29 at 14:21 +0800, Peter Huang (Peng) wrote: > In our environment, we encountered a kernel Oops problem, and caused a > restart. > CC netdev, since its more appropriate > Below are what happened: > kernel: 2.6.32.36-0.5-xen OS:xen + dom-0 + guest(rhel5.5) > 1.destroy one VM. > 2.ipsan path have some problem and make destroy process delayed about 10s. > 3.customer defined script find that VM no longer exsit through libvirt API. > 4.br0(related to the VM we are destoryed before) was deleted by the script. > 5.delayed VM destroy process come to tap device releasing, this will > decrement > skb->_skb_dst's reference count(skb->_skb_dst points to fake_rtable), but > br0 > deleting already released this struct, and unfortunately OS reused this > memory > and marked it read-only. > 6.Oops happened, and caused restart. > > After analyzing the stack dump info, we find out that during our VM destroy, > lots of ipv6 multicast pkts > exsited, and skb->_skb_dst pointed to (stuct)fake_rtable. > through kernel source greping, will only find one reference to fake_rtable's > MTU setting. > > So I'm wondering that what fake_rtable stands for, and where we are using > it. > If fake_rtable's dst is not used, we can make dst as NULL to avoid our > problem,. > I also added the patch which modified the skb->_skb_dst to NULL when > "skb->_skb_dst == (unsigned long)&to->br->fake_rtable". > > BTW, we also verified a similar senario on kernel-3.3, that br0 has attached > eth0 and eth1, eth1 was > connected to our guest which will multicast ipv6 packets, and you can get an > "WARNING: at net/core/dst.c:274 dst_release+0x6d/0x70()" > by using the fake_rtable_verify.c attached, > #gcc fake_rtable_verify.c > #./a.out & > #sleep 30 //make sure ipv6 pkts was in tap00's receiving queue. > #ifconfig br0 down > #brctl delbr br0 //delete br0, will also delete net_device's fake_rtable. > #sleep 50 > #kill -9 `pidof a.out` //tap00's delete will do dst_release, and this will > write to the memory already freed. > > Below is the Oops stack dump info: > //////////////////////////////////////////////////////////////////////////// > /// > RIP: e030:[] > {dst_release+0x11} > RSP: e02b:ffff88008b185b70 EFLAGS: 00010286 > RAX: 00000000ffffffff RBX: ffff880033d184c0 RCX: 0000000000000000 > RDX: ffff88008b54f080 RSI: 0000000012df12df RDI: ffff88008b54efc0 > RBP: ffff8800f4a3f500 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000002 R11: ffffffff8018c1e0 R12: ffff8800f4a3f400 > R13: 0000000000000001 R14: ffff8800f4a3f4e0 R15: ffff8800351030c0 > FS: 00007f4cbd080700(0000) GS:ffff880002008000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: ffff88008b54f080 CR3: 000000008a27c000 CR4: 0000000000002620 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > {dump_trace+0x65} > {notifier_call_chain+0x37} > {notify_die+0x2d} > {__die+0x8b} > {no_context+0xd1} > {__bad_area_nosemaphore+0x175} > {page_fault+0x28} > {dst_release+0x11} > {skb_release_head_state+0xbd} > {__kfree_skb+0x9} > {pfifo_fast_reset+0x5b} > {qdisc_reset+0x13} > {dev_deactivate_queue+0x57} > {dev_deactivate+0x3f} > {dev_close+0x65} > {rollback_registered+0x3e} > {unregister_netdevice+0x15} > {tun:tun_chr_close+0xe5} > {__fput+0xcd} > {filp_close+0x56} > {put_files_struct+0x7a} > {do_exit+0x752} > {do_group_exit+0x3f} > {get_signal_to_deliver+0x229} > {do_notify_resume+0x11d} > {int_signal+0x12} > [<00007f4cbc7fd57d>] > //////////////////////////////////////////////////////////////////////////// > /// > > Signed-off-by: Peter Huang(Peng) > --- > diff -Nur a/net/bridge/br_forward.c b/net/bridge/br_forward.c > @@ -91,6 +91,9 @@ > skb->dev = to->dev; > skb_forward_csum(skb); > > + if (skb->_skb_dst == (unsigned long)&to->br->fake_rtable) > + skb_dst_set(skb, NULL); > + > NF_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD, skb, indev, skb->dev, > br_forward_finish); > } Did you check current kernel has this bug ? I remember we already fix this, maybe you need a backport.