netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reproducible VLAN/e1000e crash in 2.6.36 vanilla.
@ 2010-10-25 17:57 Ben Greear
  2010-10-25 21:18 ` Ben Greear
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Greear @ 2010-10-25 17:57 UTC (permalink / raw)
  To: NetDev


To re-create, setup 2 802.1q vlans on different physical interfaces on the same system,
set up routing rules such that send-to-self works, and pass traffic (UDP/IPv4 in this case,
but doesn't seem to matter).
Stop traffic, then attempt to create additional 802.1q vlans on the same physical interfaces.
The crash only appears to happen after having sent traffic on the interface.

Likely it will also crash if one system is sending to another, but so far we've
just tested sending-to-self.

This appears very reproducible for us, and appears to be the same problem that
I had reported against our hacked kernel here:

http://www.spinics.net/lists/netdev/msg144748.html


[root@ct503-60 ~]# general protection fault: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/virtual/net/eth2.103/type
CPU 2
Modules linked in: 8021q garp bridge stp llc veth arc4 michael_mic macvlan pktgen fuse nfs lockd fscach]

Pid: 0, comm: kworker/0:1 Not tainted 2.6.36 #32 X8DTU/X8DTU
RIP: 0010:[<ffffffff813cada1>]  [<ffffffff813cada1>] vlan_hwaccel_do_receive+0x64/0xca
RSP: 0018:ffff880001a43c10  EFLAGS: 00010287
RAX: 0000000000000002 RBX: ffff88031d1b0200 RCX: ffff88032d600000
RDX: ffff880001a43c00 RSI: ffff88031d1b0200 RDI: 0000000000000001
RBP: ffff880001a43c30 R08: 0000000000000067 R09: ffff8803217268c0
R10: ffff88031d1b0228 R11: 00000000000005f2 R12: ffff88032d600000
R13: ffff10032f040890 R14: 0000000000000000 R15: ffff880330b6ae00
FS:  0000000000000000(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001d07be8 CR3: 0000000001642000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:1 (pid: 0, threadinfo ffff8803321f0000, task ffff88033209f700)
Stack:
  ffff880001a43c30 ffff88031d1b0200 ffff88032d600908 ffff88031d1b0208
<0> ffff880001a43c90 ffffffff81344313 ffff88031d1b0200 ffff88032d600908
<0> ffff880001a43c70 ffffffff81061d07 000000004cc5bf73 ffff88031d1b0200
Call Trace:
  <IRQ>
  [<ffffffff81344313>] __netif_receive_skb+0x36/0x3b3
  [<ffffffff81061d07>] ? ktime_get_real+0x11/0x3e
  [<ffffffff813454a5>] netif_receive_skb+0x67/0x6e
  [<ffffffff81345b8c>] napi_skb_finish+0x24/0x3b
  [<ffffffff813cb07f>] vlan_gro_receive+0x7b/0x80
  [<ffffffffa016d5b9>] e1000_receive_skb+0x51/0x6d [e1000e]
  [<ffffffffa016eeb0>] e1000_clean_rx_irq+0x1ed/0x292 [e1000e]
  [<ffffffffa016f287>] e1000_clean+0x75/0x221 [e1000e]
  [<ffffffff81345690>] net_rx_action+0xad/0x19c
  [<ffffffff81048926>] __do_softirq+0xa8/0x135
  [<ffffffff8100a99c>] call_softirq+0x1c/0x30
  [<ffffffff8100c085>] do_softirq+0x41/0x7e
  [<ffffffff81048ab8>] irq_exit+0x36/0x85
  [<ffffffff8100b7bf>] do_IRQ+0xad/0xc4
  [<ffffffff813ed4d3>] ret_from_intr+0x0/0x11
  <EOI>
  [<ffffffff8120dab2>] ? intel_idle+0xe6/0x112
  [<ffffffff8120da95>] ? intel_idle+0xc9/0x112
  [<ffffffff8131d121>] cpuidle_idle_call+0xab/0xe6
  [<ffffffff81008dd5>] cpu_idle+0x59/0xb5
  [<ffffffff813e6da8>] start_secondary+0x1a9/0x1ae
Code: 0d 0f b7 c0 41 8b 44 85 04 66 c7 83 bc 00 00 00 00 00 89 43 78 4d 8b ad d8 00 00 00 e8 c1 95 e0 f




-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reproducible VLAN/e1000e crash in 2.6.36 vanilla.
  2010-10-25 17:57 Reproducible VLAN/e1000e crash in 2.6.36 vanilla Ben Greear
@ 2010-10-25 21:18 ` Ben Greear
  2010-10-25 21:34   ` John Fastabend
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Greear @ 2010-10-25 21:18 UTC (permalink / raw)
  To: NetDev

On 10/25/2010 10:57 AM, Ben Greear wrote:
>
> To re-create, setup 2 802.1q vlans on different physical interfaces on
> the same system,
> set up routing rules such that send-to-self works, and pass traffic
> (UDP/IPv4 in this case,
> but doesn't seem to matter).
> Stop traffic, then attempt to create additional 802.1q vlans on the same
> physical interfaces.
> The crash only appears to happen after having sent traffic on the
> interface.
>
> Likely it will also crash if one system is sending to another, but so
> far we've
> just tested sending-to-self.
>
> This appears very reproducible for us, and appears to be the same
> problem that
> I had reported against our hacked kernel here:
>
> http://www.spinics.net/lists/netdev/msg144748.html

Bleh, I think I see the problem.

If a NIC is in promis mode, it can receive VLAN packets for which there
are no VLAN devices.

static gro_result_t
vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
                 unsigned int vlan_tci, struct sk_buff *skb)
{
         struct sk_buff *p;
         struct net_device *vlan_dev;
         u16 vlan_id;

         if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
                 skb->deliver_no_wcard = 1;

         skb->skb_iif = skb->dev->ifindex;
         __vlan_hwaccel_put_tag(skb, vlan_tci);
         vlan_id = vlan_tci & VLAN_VID_MASK;
         vlan_dev = vlan_group_get_device(grp, vlan_id);

         if (vlan_dev)
                 skb->dev = vlan_dev;
         else if (vlan_id) {
                 if (!(skb->dev->flags & IFF_PROMISC))
                         goto drop;
                 skb->pkt_type = PACKET_OTHERHOST;
         }

You hit that else branch, and then skb->dev remains the physical
device.

Later, it's passed to:

int vlan_hwaccel_do_receive(struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct vlan_rx_stats     *rx_stats;

	skb->dev = vlan_dev_info(dev)->real_dev;
	netif_nit_deliver(skb);


which does no checking before assuming that skb->dev is a vlan
device.

Things go downhill rapidly after that.


Maybe this code in dev.c should check that skb->dev is
VLAN device before passing to the hwaccel code?

static int __netif_receive_skb(struct sk_buff *skb)
{
	struct packet_type *ptype, *pt_prev;
	rx_handler_func_t *rx_handler;
	struct net_device *orig_dev;
	struct net_device *master;
	struct net_device *null_or_orig;
	struct net_device *orig_or_bond;
	int ret = NET_RX_DROP;
	__be16 type;

	if (!netdev_tstamp_prequeue)
		net_timestamp_check(skb);

	if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
		return NET_RX_SUCCESS;


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reproducible VLAN/e1000e crash in 2.6.36 vanilla.
  2010-10-25 21:18 ` Ben Greear
@ 2010-10-25 21:34   ` John Fastabend
  2010-10-25 21:38     ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: John Fastabend @ 2010-10-25 21:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev

On 10/25/2010 2:18 PM, Ben Greear wrote:
> On 10/25/2010 10:57 AM, Ben Greear wrote:
>>
>> To re-create, setup 2 802.1q vlans on different physical interfaces on
>> the same system,
>> set up routing rules such that send-to-self works, and pass traffic
>> (UDP/IPv4 in this case,
>> but doesn't seem to matter).
>> Stop traffic, then attempt to create additional 802.1q vlans on the same
>> physical interfaces.
>> The crash only appears to happen after having sent traffic on the
>> interface.
>>
>> Likely it will also crash if one system is sending to another, but so
>> far we've
>> just tested sending-to-self.
>>
>> This appears very reproducible for us, and appears to be the same
>> problem that
>> I had reported against our hacked kernel here:
>>
>> http://www.spinics.net/lists/netdev/msg144748.html
> 
> Bleh, I think I see the problem.
> 
> If a NIC is in promis mode, it can receive VLAN packets for which there
> are no VLAN devices.
> 
> static gro_result_t
> vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
>                  unsigned int vlan_tci, struct sk_buff *skb)
> {
>          struct sk_buff *p;
>          struct net_device *vlan_dev;
>          u16 vlan_id;
> 
>          if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>                  skb->deliver_no_wcard = 1;
> 
>          skb->skb_iif = skb->dev->ifindex;
>          __vlan_hwaccel_put_tag(skb, vlan_tci);
>          vlan_id = vlan_tci & VLAN_VID_MASK;
>          vlan_dev = vlan_group_get_device(grp, vlan_id);
> 
>          if (vlan_dev)
>                  skb->dev = vlan_dev;
>          else if (vlan_id) {
>                  if (!(skb->dev->flags & IFF_PROMISC))
>                          goto drop;
>                  skb->pkt_type = PACKET_OTHERHOST;
>          }
> 
> You hit that else branch, and then skb->dev remains the physical
> device.
> 
> Later, it's passed to:
> 
> int vlan_hwaccel_do_receive(struct sk_buff *skb)
> {
> 	struct net_device *dev = skb->dev;
> 	struct vlan_rx_stats     *rx_stats;
> 
> 	skb->dev = vlan_dev_info(dev)->real_dev;
> 	netif_nit_deliver(skb);
> 

Looks like this should be fixed on net-next,

bool vlan_hwaccel_do_receive(struct sk_buff **skbp)
{
        struct sk_buff *skb = *skbp;
        u16 vlan_id = skb->vlan_tci & VLAN_VID_MASK;
        struct net_device *vlan_dev;
        struct vlan_rx_stats *rx_stats;

        vlan_dev = vlan_find_dev(skb->dev, vlan_id);
        if (!vlan_dev) {
                if (vlan_id)
                        skb->pkt_type = PACKET_OTHERHOST;
                return false;
        }

If the vlan_dev is not found do not set skb->dev and return false then
in __netif_receive_skb,

      if (vlan_tx_tag_present(skb)) {
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                        pt_prev = NULL;
                }
                if (vlan_hwaccel_do_receive(&skb)) {
                        ret = __netif_receive_skb(skb);
                        goto out;
                } else if (unlikely(!skb))
                        goto out;
        }


> 
> which does no checking before assuming that skb->dev is a vlan
> device.
> 
> Things go downhill rapidly after that.
> 
> 
> Maybe this code in dev.c should check that skb->dev is
> VLAN device before passing to the hwaccel code?
> 
> static int __netif_receive_skb(struct sk_buff *skb)
> {
> 	struct packet_type *ptype, *pt_prev;
> 	rx_handler_func_t *rx_handler;
> 	struct net_device *orig_dev;
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
> 	struct net_device *orig_or_bond;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
> 
> 	if (!netdev_tstamp_prequeue)
> 		net_timestamp_check(skb);
> 
> 	if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
> 		return NET_RX_SUCCESS;
> 
> 
> Thanks,
> Ben
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reproducible VLAN/e1000e crash in 2.6.36 vanilla.
  2010-10-25 21:34   ` John Fastabend
@ 2010-10-25 21:38     ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2010-10-25 21:38 UTC (permalink / raw)
  To: John Fastabend; +Cc: Ben Greear, NetDev

Le lundi 25 octobre 2010 à 14:34 -0700, John Fastabend a écrit :
> On 10/25/2010 2:18 PM, Ben Greear wrote:
> > On 10/25/2010 10:57 AM, Ben Greear wrote:
> >>
> >> To re-create, setup 2 802.1q vlans on different physical interfaces on
> >> the same system,
> >> set up routing rules such that send-to-self works, and pass traffic
> >> (UDP/IPv4 in this case,
> >> but doesn't seem to matter).
> >> Stop traffic, then attempt to create additional 802.1q vlans on the same
> >> physical interfaces.
> >> The crash only appears to happen after having sent traffic on the
> >> interface.
> >>
> >> Likely it will also crash if one system is sending to another, but so
> >> far we've
> >> just tested sending-to-self.
> >>
> >> This appears very reproducible for us, and appears to be the same
> >> problem that
> >> I had reported against our hacked kernel here:
> >>
> >> http://www.spinics.net/lists/netdev/msg144748.html
> > 
> > Bleh, I think I see the problem.
> > 
> > If a NIC is in promis mode, it can receive VLAN packets for which there
> > are no VLAN devices.
> > 
> > static gro_result_t
> > vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
> >                  unsigned int vlan_tci, struct sk_buff *skb)
> > {
> >          struct sk_buff *p;
> >          struct net_device *vlan_dev;
> >          u16 vlan_id;
> > 
> >          if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
> >                  skb->deliver_no_wcard = 1;
> > 
> >          skb->skb_iif = skb->dev->ifindex;
> >          __vlan_hwaccel_put_tag(skb, vlan_tci);
> >          vlan_id = vlan_tci & VLAN_VID_MASK;
> >          vlan_dev = vlan_group_get_device(grp, vlan_id);
> > 
> >          if (vlan_dev)
> >                  skb->dev = vlan_dev;
> >          else if (vlan_id) {
> >                  if (!(skb->dev->flags & IFF_PROMISC))
> >                          goto drop;
> >                  skb->pkt_type = PACKET_OTHERHOST;
> >          }
> > 
> > You hit that else branch, and then skb->dev remains the physical
> > device.
> > 
> > Later, it's passed to:
> > 
> > int vlan_hwaccel_do_receive(struct sk_buff *skb)
> > {
> > 	struct net_device *dev = skb->dev;
> > 	struct vlan_rx_stats     *rx_stats;
> > 
> > 	skb->dev = vlan_dev_info(dev)->real_dev;
> > 	netif_nit_deliver(skb);
> > 
> 
> Looks like this should be fixed on net-next,
> 
> bool vlan_hwaccel_do_receive(struct sk_buff **skbp)
> {
>         struct sk_buff *skb = *skbp;
>         u16 vlan_id = skb->vlan_tci & VLAN_VID_MASK;
>         struct net_device *vlan_dev;
>         struct vlan_rx_stats *rx_stats;
> 
>         vlan_dev = vlan_find_dev(skb->dev, vlan_id);
>         if (!vlan_dev) {
>                 if (vlan_id)
>                         skb->pkt_type = PACKET_OTHERHOST;
>                 return false;
>         }
> 
> If the vlan_dev is not found do not set skb->dev and return false then
> in __netif_receive_skb,
> 
>       if (vlan_tx_tag_present(skb)) {
>                 if (pt_prev) {
>                         ret = deliver_skb(skb, pt_prev, orig_dev);
>                         pt_prev = NULL;
>                 }
>                 if (vlan_hwaccel_do_receive(&skb)) {
>                         ret = __netif_receive_skb(skb);
>                         goto out;
>                 } else if (unlikely(!skb))
>                         goto out;
>         }
> 

Yes but net-next is totally different beast for vlans ;)

We should make a patch for 2.6.36, not bringing huge vlan stuff added
for 2.6.37 




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-10-25 21:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-25 17:57 Reproducible VLAN/e1000e crash in 2.6.36 vanilla Ben Greear
2010-10-25 21:18 ` Ben Greear
2010-10-25 21:34   ` John Fastabend
2010-10-25 21:38     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).