From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Peter Huang (Peng)" Subject: =?utf-8?Q?=E7=AD=94=E5=A4=8D:_=5BPATCH=5D_set_fake=5Frtable's_ds?= =?utf-8?Q?t_to_NULL_to_avoid_kernel_Oops.?= Date: Thu, 29 Mar 2012 14:40:01 +0800 Message-ID: <002601cd0d76$c4987440$4dc95cc0$%huangpeng@huawei.com> References: <002501cd0d74$317fd100$947f7300$%huangpeng@huawei.com> <1333002975.2325.82.camel@edumazet-glaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, harry.majun@huawei.com, zhoukang7@huawei.com, 'netdev' To: 'Eric Dumazet' Return-path: Received: from szxga04-in.huawei.com ([119.145.14.67]:38717 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750845Ab2C2GkW convert rfc822-to-8bit (ORCPT ); Thu, 29 Mar 2012 02:40:22 -0400 In-reply-to: <1333002975.2325.82.camel@edumazet-glaptop> Content-language: zh-cn Sender: netdev-owner@vger.kernel.org List-ID: We already check current kernel-3.3, it has the same problem. I am not very sure that if this modify could cause other problems or no= t, Because I don't know where fake_rtable was used. -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- =E5=8F=91=E4=BB=B6=E4=BA=BA: Eric Dumazet [mailto:eric.dumazet@gmail.co= m]=20 =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2012=E5=B9=B43=E6=9C=8829=E6=97=A5= 14:36 =E6=94=B6=E4=BB=B6=E4=BA=BA: Peter Huang (Peng) =E6=8A=84=E9=80=81: linux-kernel@vger.kernel.org; harry.majun@huawei.co= m; zhoukang7@huawei.com; netdev =E4=B8=BB=E9=A2=98: Re: [PATCH] set fake_rtable's dst to NULL to avoid = kernel Oops. On Thu, 2012-03-29 at 14:21 +0800, Peter Huang (Peng) wrote: > In our environment, we encountered a kernel Oops problem, and caused = a > restart. >=20 CC netdev, since its more appropriate > Below are what happened: > kernel: 2.6.32.36-0.5-xen OS:xen + dom-0 + guest(rhel5.5) > 1.destroy one VM. > 2.ipsan path have some problem and make destroy process delayed about= 10s. > 3.customer defined script find that VM no longer exsit through libvir= t API. > 4.br0(related to the VM we are destoryed before) was deleted by the s= cript. > 5.delayed VM destroy process come to tap device releasing, this will > decrement=20 > skb->_skb_dst's reference count(skb->_skb_dst points to fake_rtable),= but > br0=20 > deleting already released this struct, and unfortunately OS reused th= is > memory=20 > and marked it read-only. > 6.Oops happened, and caused restart. >=20 > After analyzing the stack dump info, we find out that during our VM d= estroy, > lots of ipv6 multicast pkts=20 > exsited, and skb->_skb_dst pointed to (stuct)fake_rtable. > through kernel source greping, will only find one reference to fake_r= table's > MTU setting. >=20 > So I'm wondering that what fake_rtable stands for, and where we are u= sing > it. > If fake_rtable's dst is not used, we can make dst as NULL to avoid ou= r > problem,. > I also added the patch which modified the skb->_skb_dst to NULL when > "skb->_skb_dst =3D=3D (unsigned long)&to->br->fake_rtable". >=20 > BTW, we also verified a similar senario on kernel-3.3, that br0 has a= ttached > eth0 and eth1, eth1 was=20 > connected to our guest which will multicast ipv6 packets, and you can= get an > "WARNING: at net/core/dst.c:274 dst_release+0x6d/0x70()" > by using the fake_rtable_verify.c attached,=20 > #gcc fake_rtable_verify.c > #./a.out & > #sleep 30 //make sure ipv6 pkts was in tap00's receiving queu= e. > #ifconfig br0 down > #brctl delbr br0 //delete br0, will also delete net_device's fake_rta= ble. > #sleep 50 > #kill -9 `pidof a.out` //tap00's delete will do dst_release, and this= will > write to the memory already freed. >=20 > Below is the Oops stack dump info: > /////////////////////////////////////////////////////////////////////= /////// > /// > RIP: e030:[] > {dst_release+0x11} > RSP: e02b:ffff88008b185b70 EFLAGS: 00010286 > RAX: 00000000ffffffff RBX: ffff880033d184c0 RCX: 0000000000000000 > RDX: ffff88008b54f080 RSI: 0000000012df12df RDI: ffff88008b54efc0 > RBP: ffff8800f4a3f500 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000002 R11: ffffffff8018c1e0 R12: ffff8800f4a3f400 > R13: 0000000000000001 R14: ffff8800f4a3f4e0 R15: ffff8800351030c0 > FS: 00007f4cbd080700(0000) GS:ffff880002008000(0000) knlGS:000000000= 0000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: ffff88008b54f080 CR3: 000000008a27c000 CR4: 0000000000002620 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > {dump_trace+0x65} > {notifier_call_chain+0x37} > {notify_die+0x2d} > {__die+0x8b} > {no_context+0xd1} > {__bad_area_nosemaphore+0x175} > {page_fault+0x28} > {dst_release+0x11} > {skb_release_head_state+0xbd} > {__kfree_skb+0x9} > {pfifo_fast_reset+0x5b} > {qdisc_reset+0x13} > {dev_deactivate_queue+0x57} > {dev_deactivate+0x3f} > {dev_close+0x65} > {rollback_registered+0x3e} > {unregister_netdevice+0x15} > {tun:tun_chr_close+0xe5} > {__fput+0xcd} > {filp_close+0x56} > {put_files_struct+0x7a} > {do_exit+0x752} > {do_group_exit+0x3f} > {get_signal_to_deliver+0x229} > {do_notify_resume+0x11d} > {int_signal+0x12} > [<00007f4cbc7fd57d>] > /////////////////////////////////////////////////////////////////////= /////// > /// >=20 > Signed-off-by: Peter Huang(Peng) > --- > diff -Nur a/net/bridge/br_forward.c b/net/bridge/br_forward.c > @@ -91,6 +91,9 @@ > skb->dev =3D to->dev; > skb_forward_csum(skb); >=20 > + if (skb->_skb_dst =3D=3D (unsigned long)&to->br->fake_rtable) > + skb_dst_set(skb, NULL); > + > NF_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD, skb, indev, skb->dev, > br_forward_finish); > } Did you check current kernel has this bug ? I remember we already fix this, maybe you need a backport.