Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: fold network name hash (v2)
From: Stephen Hemminger @ 2009-10-28 15:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, akpm, torvalds, opurdila,
	viro
In-Reply-To: <4AE7DF8E.3020607@gmail.com>

On Wed, 28 Oct 2009 07:07:10 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > The full_name_hash does not produce a value that is evenly distributed
> > over the lower 8 bits. This causes name hash to be unbalanced with large
> > number of names. There is a standard function to fold in upper bits
> > so use that.
> > 
> > This is independent of possible improvements to full_name_hash()
> > in future.
> 
> >  static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
> >  {
> >  	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
> > -	return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
> > +	return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
> >  }
> >  
> >  static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
> 
> full_name_hash() returns an "unsigned int", which is guaranteed to be 32 bits
> 
> You should therefore use hash_32(hash, NETDEV_HASHBITS),
> not hash_long() that maps to hash_64() on 64 bit arches, which is
> slower and certainly not any better with a 32bits input.

OK, I was following precedent. Only a couple places use hash_32, most use
hash_long().

Using the upper bits does give better distribution.
With 100,000 network names:

               Time       Ratio       Max   StdDev
hash_32       0.002123     1.00       422  11.07
hash_64       0.002927     1.00       400   3.97

The time field is pretty meaningless for such a small sample

^ permalink raw reply

* Re: WAN device configuration, again...
From: Dan Williams @ 2009-10-28 16:03 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: netdev
In-Reply-To: <m3d447hcks.fsf@intrepid.localdomain>

On Wed, 2009-10-28 at 14:28 +0100, Krzysztof Halasa wrote:
> Hi,
> 
> I'm currently at final stages of "producing" two WAN drivers and there
> is one thing to solve: they have really complex options. It's no longer
> a V.35 with ca. 4 clock modes, a clock rate and few encodings etc. They
> need many options unique to each driver/board. I think I need a more
> capable interface to configure the devices than the current ioctl-based
> one.
> 
> I think of something:
> - using netlink or similar interface

If you're doing a new config interface, I'd suggest netlink like the
wireless guys did to replace WEXT with cfg80211.  Using netlink makes
your interface easily available from programs/libraries without having
to screenscrape anything.  If you want some advice on netlink API stuff,
ask Johannes Berg.

Dan

> - with potentially unlimited "payload" size (data may be transfered in
>   smaller packets)
> - the "command" and "response" should be variable-length ASCII-based,
>   instead of fixed structures. This way I don't have to duplicate all
>   option handling in userspace, only the specific driver has to know
>   about them.
> 
> Comments? Perhaps there is already an example?
> Should I use something else?
> 
> I also thought about using /sys read/write calls, but I'm not sure it's
> a good idea.


^ permalink raw reply

* RE: [PATCH] udev: create empty regular files to represent net interfaces
From: Jordan_Hargrave @ 2009-10-28 16:09 UTC (permalink / raw)
  To: dannf, Matt_Domsch
  Cc: kay.sievers, linux-hotplug, Narendra_K, netdev, Charles_Rose,
	bhutchings
In-Reply-To: <20091028150913.GB3612@ldl.fc.hp.com>

I was thinking, if we're not planning on use the chardev/kernel route.  There already exists an ifindex file in /sys/class/net/XXX/ifindex.
Should udev/helper be creating links to this, or is it better to keep everything under the /dev/ tree?
Using this method would require the patch to udev to handle renaming events.

--jordan hargrave
Dell Enterprise Linux Engineering



-----Original Message-----
From: dann frazier [mailto:dannf@hp.com]
Sent: Wed 10/28/2009 10:09
To: Domsch, Matt
Cc: Kay Sievers; linux-hotplug@vger.kernel.org; K, Narendra; netdev@vger.kernel.org; Hargrave, Jordan; Rose, Charles; Ben Hutchings
Subject: Re: [PATCH] udev: create empty regular files to represent net interfaces
 
On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier



^ permalink raw reply

* RE: [PATCH] udev: create empty regular files to represent net interfaces
From: Jordan_Hargrave @ 2009-10-28 16:09 UTC (permalink / raw)
  To: dannf, Matt_Domsch
  Cc: kay.sievers, linux-hotplug, Narendra_K, netdev, Charles_Rose,
	bhutchings
In-Reply-To: <20091028150913.GB3612@ldl.fc.hp.com>

I was thinking, if we're not planning on use the chardev/kernel route.  There already exists an ifindex file in /sys/class/net/XXX/ifindex.
Should udev/helper be creating links to this, or is it better to keep everything under the /dev/ tree?
Using this method would require the patch to udev to handle renaming events.


--jordan hargrave
Dell Enterprise Linux Engineering



-----Original Message-----
From: dann frazier [mailto:dannf@hp.com]
Sent: Wed 10/28/2009 10:09
To: Domsch, Matt
Cc: Kay Sievers; linux-hotplug@vger.kernel.org; K, Narendra; netdev@vger.kernel.org; Hargrave, Jordan; Rose, Charles; Ben Hutchings
Subject: Re: [PATCH] udev: create empty regular files to represent net interfaces
 
On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier



^ permalink raw reply

* Re: [PATCH] udev: create empty regular files to represent net interfaces
From: Greg KH @ 2009-10-28 16:11 UTC (permalink / raw)
  To: Jordan_Hargrave
  Cc: dannf, Matt_Domsch, kay.sievers, linux-hotplug, Narendra_K,
	netdev, Charles_Rose, bhutchings
In-Reply-To: <5DDAB7BA7BDB58439DD0EED0B8E9A3AE02E827AF@ausx3mpc102.aus.amer.dell.com>


A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Wed, Oct 28, 2009 at 11:09:36AM -0500, Jordan_Hargrave@Dell.com wrote:
> I was thinking, if we're not planning on use the chardev/kernel route.
> There already exists an ifindex file in /sys/class/net/XXX/ifindex.
> Should udev/helper be creating links to this, or is it better to keep
> everything under the /dev/ tree?  Using this method would require the
> patch to udev to handle renaming events.

Please never create symlinks out of /dev/ to /sys that doesn't make
sense at all and probably violates part of the LSB somewhere...

thanks,

greg k-h

^ permalink raw reply

* Re: Oops in net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c::ipv6_confirm(), kernel 2.6.30.8
From: Patrick McHardy @ 2009-10-28 16:29 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: netfilter-devel, netdev
In-Reply-To: <200910170903.n9H93bKI012269@int-mx03.intmail.prod.int.phx2.redhat.com>

Chuck Ebbert wrote:
> general protection fault: 0000 [#1] SMP 
> last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
> CPU 0 
> Modules linked in: tun fuse rfcomm sco bridge stp llc bnep l2cap autofs4
> w83627ehf hwmon_vid sunrpc sit tunnel4 nf_nat_sip nf_conntrack_sip nf_nat_ftp
> nf_conntrack_ftp ipt_LOG xt_owner iptable_mangle ipt_MASQUERADE iptable_nat
> nf_nat xt_limit nf_conntrack_ipv6 xt_mac ip6t_LOG ip6table_filter ip6_tables
> p4_clockmod freq_table speedstep_lib squashfs nls_utf8 dm_multipath raid1
> kvm_intel kvm uinput ipv6 ppdev snd_hda_codec_realtek snd_hda_intel
> snd_hda_codec snd_hwdep snd_pcm nouveau pcspkr i2c_i801 firewire_ohci snd_timer
> btusb firewire_core e1000 snd bluetooth drm iTCO_wdt iTCO_vendor_support
> crc_itu_t i2c_algo_bit asus_atk0110 i82975x_edac soundcore sky2 edac_core
> parport_pc i2c_core floppy hwmon snd_page_alloc parport raid456 raid6_pq
> async_xor async_memcpy async_tx xor [last unloaded: freq_table]
> Pid: 4104, comm: qemu-kvm Not tainted 2.6.30.8-64.fc11.x86_64.debug #1 System
> Product Name
> RIP: 0010:[<ffffffffa03624e1>]  [<ffffffffa03624e1>] ipv6_confirm+0xd0/0x147
> [nf_conntrack_ipv6]
> RSP: 0018:ffff880035203668  EFLAGS: 00010212
> RAX: 0000000000000030 RBX: ffff8801f90a1080 RCX: 0000000000000002
> RDX: ffffffff81783f40 RSI: 0000000000000030 RDI: ffff8801f90a1080
> RBP: ffff880035203698 R08: ffffffffa04520ee R09: ffff880035203748
> R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81783f40
> R13: 6b6b6b6b6b6b6b6b R14: 0000000000000002 R15: 0000000000000004
> FS:  00007f944e44b740(0000) GS:ffff880035200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fffbc54ef60 CR3: 000000020f8d8000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process qemu-kvm (pid: 4104, threadinfo ffff88020f89c000, task
> ffff8802104a4760)
> Stack:
>  3a00000000000246 00000000696092b1 0000000080000000 ffff8801f90a1080
>  ffffffffa04520ee ffffffff81783bd0 ffff880035203708 ffffffff8142b117
>  ffff8800352036c8 ffff880035203748 ffff880210492060 0000000000000000
> Call Trace:
>  <IRQ> <0> [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffffa045263b>] nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa0452d92>] br_nf_post_routing+0x1a8/0x1e4 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffffa044c39d>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044c3ed>] br_forward_finish+0x3a/0x62 [bridge]
>  [<ffffffffa0452aaa>] br_nf_forward_finish+0xb3/0xd2 [bridge]
>  [<ffffffffa045263b>] ? nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa045318a>] br_nf_forward_ip+0x1af/0x1de [bridge]
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffffa044c415>] ? __br_forward+0x0/0xab [bridge]
>  [<ffffffffa044c39d>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044c499>] __br_forward+0x84/0xab [bridge]
>  [<ffffffffa044c1ca>] br_flood+0x82/0xd9 [bridge]
>  [<ffffffff814086ee>] ? netif_receive_skb+0x120/0x44c
>  [<ffffffffa044c249>] br_flood_forward+0x28/0x3e [bridge]
>  [<ffffffffa044d36a>] br_handle_frame_finish+0x13a/0x167 [bridge]
>  [<ffffffffa04529da>] br_nf_pre_routing_finish_ipv6+0xb7/0xd4 [bridge]
>  [<ffffffffa045263b>] ? nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa04534e8>] br_nf_pre_routing+0x32f/0x577 [bridge]
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffff8123bbf6>] ? kobject_put+0x54/0x6f
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffffa044d21a>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044d533>] br_handle_frame+0x19c/0x1d9 [bridge]
>  [<ffffffff814088fa>] netif_receive_skb+0x32c/0x44c

> Code: 2c 75 1d f6 05 1a fc 1c e2 40 74 60 f6 05 17 fc 1c e2 04 74 57 80 3d ad
> 4d 00 00 00 74 4e eb 62 44 89 f1 4c 89 e2 89 c6 48 89 df <41> ff 55 50 83 f8 01
> 75 3d 4c 8b a3 88 00 00 00 4d 85 e4 74 2c 
> RIP  [<ffffffffa03624e1>] ipv6_confirm+0xd0/0x147 [nf_conntrack_ipv6]
>  RSP <ffff880035203668>
> ---[ end trace 5dc400d9f2f8290b ]---
> 
>    c: f6 05 17 fc 1c e2 04  testb  $0x4,-0x1de303e9(%rip)
>   13: 74 57                 je     0x6c
>   15: 80 3d ad 4d 00 00 00  cmpb   $0x0,0x4dad(%rip)
>   1c: 74 4e                 je     0x6c
>   1e: eb 62                 jmp    0x82
>   20: 44 89 f1              mov    %r14d,%ecx
>   23: 4c 89 e2              mov    %r12,%rdx
>   26: 89 c6                 mov    %eax,%esi
>   28: 48 89 df              mov    %rbx,%rdi
> 
>    0: 41 ff 55 50           callq  *0x50(%r13)  <===========
>    4: 83 f8 01              cmp    $0x1,%eax
>    7: 75 3d                 jne    0x46
>    9: 4c 8b a3 88 00 00 00  mov    0x88(%rbx),%r12
>   10: 4d 85 e4              test   %r12,%r12
>   13: 74 2c                 je     0x41
> 
> R13: 6b6b6b6b6b6b6b6b  
> 
> Corresponds to:
> net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c:178:
> 
>         ret = helper->help(skb, protoff, ct, ctinfo);  

Did you unload any helper modules before this happened?

^ permalink raw reply

* [PATCH] vmxnet3: remove duplicate #include
From: Shreyas Bhatewara @ 2009-10-28 16:30 UTC (permalink / raw)
  To: netdev; +Cc: weiyi.huang, pv-drivers


Remove duplicate headerfile includes from vmxnet3_int.h

Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Bhavesh Davda <davda@vmware.com>

---

diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 3c0d70d..4450816 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -27,16 +27,11 @@
 #ifndef _VMXNET3_INT_H
 #define _VMXNET3_INT_H
 
-#include <linux/types.h>
 #include <linux/ethtool.h>
 #include <linux/delay.h>
-#include <linux/device.h>
 #include <linux/netdevice.h>
 #include <linux/pci.h>
-#include <linux/ethtool.h>
 #include <linux/compiler.h>
-#include <linux/module.h>
-#include <linux/moduleparam.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/ioport.h>

^ permalink raw reply related

* Re: TG3, kvm, ipv6 & tso data corruption bug?
From: Matt Carlson @ 2009-10-28 16:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: netdev@vger.kernel.org, Linux kernel Mailing List,
	Matthew Carlson, Michael Chan, KVM list
In-Reply-To: <4AE8595F.1080404@redhat.com>

On Wed, Oct 28, 2009 at 07:46:55AM -0700, Rik van Riel wrote:
> I have been tracking down what I thought was a KVM related network
> issue for a while, however it appears it could be a hardware issue.
> 
> The symptom is that data in network packets gets corrupted, before
> the checksum is calculated.  This means the remote host can get
> corrupted data, with no way to calculate it (except application
> level checksums).  Luckily ssh has such checksums, so my rsync over
> ssh backup script discovered this issue.
> 
> On a very regular basis, I got this message from ssh:
> 
> 	Corrupted MAC on input.
> 
> I have played around a bit and narrowed it down to the following:
> 
> ipv4          => no problem
> ipv6 w/o tso  => no problem
> ipv6 with tso => occasional data corruption
> 
> Disabling tso with ethtool -K eth0 tso off makes the problem stop.
> 
> I am running Fedora 12's 2.6.31.1-56.fc12.x86_64 kernel, with the
> following hardware:
> 
> 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 
> Gigabit Ethernet PCIe (rev 10)
> 
> I do not know enough about the network layer to know whether this is
> fixable in software or whether TSO offloading for ipv6 should just
> be disabled on this model.

This problem sounds familiar.  There are chip bugs in this area, but as
far as I know, they should have been worked around.  Let me see if this
is indeed the same bug resurfacing.


^ permalink raw reply

* RE: [Pv-drivers] [PATCH] vmxnet3: remove duplicate #include
From: Bhavesh Davda @ 2009-10-28 16:32 UTC (permalink / raw)
  To: Shreyas Bhatewara, netdev@vger.kernel.org
  Cc: pv-drivers@vmware.com, weiyi.huang@gmail.com
In-Reply-To: <alpine.LRH.2.00.0910280910250.24555@sbhatewara-dev1.eng.vmware.com>

Correction:

< Signed-off-by: Bhavesh Davda <davda@vmware.com>
> Signed-off-by: Bhavesh Davda <bhavesh@vmware.com>

Thanks

- Bhavesh
 
Bhavesh P. Davda

> -----Original Message-----
> From: pv-drivers-bounces@vmware.com [mailto:pv-drivers-
> bounces@vmware.com] On Behalf Of Shreyas Bhatewara
> Sent: Wednesday, October 28, 2009 9:31 AM
> To: netdev@vger.kernel.org
> Cc: pv-drivers@vmware.com; weiyi.huang@gmail.com
> Subject: [Pv-drivers] [PATCH] vmxnet3: remove duplicate #include
> 
> 
> Remove duplicate headerfile includes from vmxnet3_int.h
> 
> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
> Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
> Signed-off-by: Bhavesh Davda <davda@vmware.com>
> 
> ---
> 
> diff --git a/drivers/net/vmxnet3/vmxnet3_int.h
> b/drivers/net/vmxnet3/vmxnet3_int.h
> index 3c0d70d..4450816 100644
> --- a/drivers/net/vmxnet3/vmxnet3_int.h
> +++ b/drivers/net/vmxnet3/vmxnet3_int.h
> @@ -27,16 +27,11 @@
>  #ifndef _VMXNET3_INT_H
>  #define _VMXNET3_INT_H
> 
> -#include <linux/types.h>
>  #include <linux/ethtool.h>
>  #include <linux/delay.h>
> -#include <linux/device.h>
>  #include <linux/netdevice.h>
>  #include <linux/pci.h>
> -#include <linux/ethtool.h>
>  #include <linux/compiler.h>
> -#include <linux/module.h>
> -#include <linux/moduleparam.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
>  #include <linux/ioport.h>
> _______________________________________________
> Pv-drivers mailing list
> Pv-drivers@vmware.com
> http://mailman2.vmware.com/mailman/listinfo/pv-drivers

^ permalink raw reply

* phylib printouts
From: Kristoffer Glembo @ 2009-10-28 16:33 UTC (permalink / raw)
  To: netdev

Hi,

There are some printouts in the phy abstraction library using pr_info(). 
  Personally I would prefer if they used pr_debug() instead.

Especially the pr_info("%s: probed\n", bus->name) in mdiobus_register() 
since in some cases I can't tolerate a console printout while the link 
is down (we have an ethernet debug link and read the UART output through 
the link).

Of course this is a very specific case but in general in embedded 
systems (which I think is where the phylib is mainly used) I prefer the 
driver and other kernel code to be rather silent if not debugging.

Any opinions? :)

Best regards,
Kristoffer Glembo

^ permalink raw reply

* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Patrick McHardy @ 2009-10-28 16:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benny Amorsen, Gertjan Hofman, Matt Carlson,
	netdev@vger.kernel.org, David S. Miller
In-Reply-To: <4AE6C54B.7080805@gmail.com>

Eric Dumazet wrote:
> Benny Amorsen a écrit :
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>>
>>> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
>>> (and other HW accelerated vlan nics I suppose)
>>>
>>> [PATCH] vlan: allow VLAN ID 0 to be used
>>>
>>> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>>>
>>> 0 value is used a special value, meaning VLAN ID not set.
>>> This forbids use of VLAN ID 0
>> Are you sure you actually want to do this?
>>
>> VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
>> if they had no VLAN tag at all, except that they have an 802.1p value.
>>
>> Sending frames with VLAN 0 should have something to do with whether
>> the sender wants to use 802.1p, which doesn't really have much to do
>> with VLAN's at all...
>>
>> It would be nice if the unsuspecting user was at least warned that their
>> use of VLAN 0 is non-standard and may cause surprising results like
>> leakage into the "native" VLAN. That could be done in /sbin/ip or
>> /sbin/vconfig, of course.
>>
> 
> Quotting http://en.wikipedia.org/wiki/IEEE_802.1Q
> 
> VLAN Identifier (VID): a 12-bit field specifying the VLAN to which the frame belongs.
>  A value of 0 means that the frame doesn't belong to any VLAN; in this case the 802.1Q
>  tag specifies only a priority and is referred to as a priority tag.
> A value of hex FFF is reserved for implementation use.
> All other values may be used as VLAN identifiers, allowing up to 4094 VLANs
> 
> 
> So we expect to generate a 802.1Q frame, even with a VID=0 field.
> Before patch, device sends a non 802.1Q frame, which is not what was wanted by user.
> (Maybe he wants to check its device/network is able to transport 1522 bytes frames, who knows...)
> 
> To use non tagged frames, user selects eth0 device, and to send tagged frames, he selects eth0.0

I agree. Quoting IEEE802.1Q:

0 The null VLAN ID. Indicates that the tag header contains only priority
  information; no VLAN identifier is present in the frame. ...

which implies that its valid to use 0 as tag in the header.

> Now, maybe eth0 and eth0.0 should share same IP addresses, because incoming frame
> with ID=0 tag should be received by eth0 device, but I am not sure standard requires this.

I don't think the standard makes any demands regarding this.

The current behaviour seems better to me since its symetrical
to the output path, where the VLAN device is necessary to be
able to specify a priority mapping.



^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Steve Chen @ 2009-10-28 17:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AE86420.3040607@gmail.com>

On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
> Steve Chen a écrit :
> > Following is the e-mail I replied to Rick Jones in case you haven't got
> > it yet.  I have not posted the test code yet.  The test code is a tar gz
> > file about 11k in size.  Lot of people will get very upset if I send it
> > to the mailing list.  If you can suggest a place to up load, I'll be
> > happy to do so.
> > 
> > Regards,
> 
> > Here is the scenario this patch tries to address
> > 
> > <src node> ---->  <switch>  ----> <eth0 dest node>
> >                             \--->  <eth1 dest node>
> 
> If each fragment is received twice on host, once by eth0, once by eth1,
> should we deliver datagram once or twice ?

The application received it once.  IIRC the duplicate packet is drop in
the routing code.

> 
> Once should be enough, even if in the non fragmented case, it will
> be delivered twice (kernel cannot detect duplicates, user app might do itself)

Routing code drops the duplicate packet for none-fragmented case as
well.

> 
> 
> > 
> > For this specific case, src/dst address, protocol, IP ID and fragment
> > offset are all identical.  The only difference is the ingress interface.
> > A good follow up question would be why would anyone in their right mind
> > multicast to the same destination?  well, I don't know.  I can not get
> > the people who reported the problem to tell me either.   Since someone
> > found the need to do this,  perhaps others may find it useful too.
> > 
> 
> Then, if a 2000 bytes message is fragmented in two packets, one coming
> from eth0, one coming from eth1, I suspect your patch drops the message,
> unless eth0/eth1 are part of a bonding device...

Actually, the patch tries to prevent packet drop for this exact
scenario.  Please consider the following scenarios
1.  Packet comes in the fragment reassemble code in the following order
(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
Packet from both interfaces get reassembled and gets further processed.

2. Packet can some times arrive in (perhaps other orders as well)
(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
packet from eth1 is dropped in the routing code.

> 
> That would break common routing setups, using two links to aggregate bandwidth ?

I don't believe it would.  The aggregate bandwidth will work the same as
before.  The attributes (src/dst addr, protocol, interface, etc.) should
generate a unique hash key.  If hash collision should happen with the
addition of iif << 5, the code still compare the original src addr along
with interface number, so there should be no issues.

> Nothing in IP forces all packets of a datagram to follow an unique route.
> 
My understanding as well.

Regards,

Steve


^ permalink raw reply

* Re: [net-next-2.6 PATCH v4 3/3] TCPCT part 1c: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-28 17:14 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: Eric Dumazet
In-Reply-To: <4AE8527E.4080603@gmail.com>

Eric Dumazet wrote:
> William Allen Simpson a écrit :
>> suggested for many years, and is also useful without SYN data, allowing
>> several related concepts to use the same extension option.
>>
>>    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
>>    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
> 
> Sorry this link might be interesting to you, but I found nothing that explains
> your patches.
> 
>>    "Re: what a new TCP header might look like", May 12, 1998.
>>    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
> 
> Same here....
> 
They explain the "several related concepts" -- indeed the entire scope of
the whole project.  That's why this patch series is labeled "part 1".


> I tried to find an RFC or document about this stuff and failed.
> 
This patch series was fairly clearly described in Adam's draft last year.
It's expired now, but I'll send you a copy privately.


> Before reading implementation code, I like to have english text that describes
> the new concept/design.
> 
Before writing text, I like to have implementation code.... :-)

In fact, it has long been my position that we shouldn't publish IETF RFCs
without running code (and I never have).  Even PPP over SOnet/SDH had
preliminary hardware before publication.  Unfortunately, those with
standards-bodies-itis have a sad history of the opposite.

The Usenix ;login: overview and other documents are embargoed until
(December) publication, but I'll send you some recent galleys privately.


> (BTW I found http://ttcplinux.sourceforge.net/theses/ETTCP.pdf and found it interesting,
> I wonder what happened to this)
> 
I found it too, and deliberately extended Adam's sockopt to encompass it.
It's one reason this was renamed TCP Cookie *Transactions* (TCPCT).  That
will come along later, probably about parts 4 or 5.

^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Rick Jones @ 2009-10-28 17:18 UTC (permalink / raw)
  To: Mark Huth; +Cc: Steve Chen, netdev
In-Reply-To: <4AE8776D.4020609@mvista.com>

>> It has been hours since my last good Emily Litella moment so I'll ask 
>> - isn't the combination of source and dest addr, protocol, IP ID and 
>> fragment offset supposed to take care of this?  How does the ingress 
>> interface have anything to do with it?
>>
>> rick jones
> 
> The problem we've seen arises only when there are multiple interfaces 
> each receiving the same multicast packets.  In that case there are 
> multiple packets with the same key.  Steve was able to track down a 
> packet loss due to re-assembly failure under certain arrival order 
> conditions.
> 
> The proposed fix eliminated the packet loss in this case.  There might 
> be a different problem in the re-assembly code that we have masked by 
> separating the packets into streams from each interface.  Now that you 
> mention it, the re-assembly code should be robust in the face of some 
> duplicated and mis-ordered packets.  We can look more closely at that code.

If I understand correctly, the idea here is to say that when multiple interfaces 
receive fragments of copies of the same  IP datagram that both copies will 
"survive" and flow up the stack?

I'm basing that on your description, and an email from Steve that reads:

> Actually, the patch tries to prevent packet drop for this exact
> scenario.  Please consider the following scenarios
> 1.  Packet comes in the fragment reassemble code in the following order
> (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> Packet from both interfaces get reassembled and gets further processed.
> 
> 2. Packet can some times arrive in (perhaps other orders as well)
> (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> packet from eth1 is dropped in the routing code.

Doesn't that rather fly in the face of the weak-end-system model followed by Linux?

I can see where scenario one leads to two IP datagrams making it up the stack, 
but I would have thought that was simply an "accident" of the situation that 
cannot reasonably be prevented, not justification to cause scenario two to send 
two datagrams up the stack.

rick jones

^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Mark Huth @ 2009-10-28 16:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: Steve Chen, netdev
In-Reply-To: <4AE780CB.8070401@hp.com>

Rick Jones wrote:
> Steve Chen wrote:
>> Multicast packet reassembly can fail
>>
>> When multicast connections with multiple fragments are received by the 
>> same
>> node from more than one Ethernet ports, race condition between fragments
>> from each Ethernet port can cause fragment reassembly to fail leading to
>> packet drop.  This is because packets from each Ethernet port appears 
>> identical
>> to the the code that reassembles the Ethernet packet.
>>
>> The solution is evaluate the Ethernet interface number in addition to 
>> all other
>> parameters so that every packet can be uniquely identified.  The existing
>> iif field in struct ipq is now used to generate the hash key, and iif 
>> is also
>> used for comparison in case of hash collision.
>>
>> Please note that q->saddr ^ (q->iif << 5) is now being passed into
>> ipqhashfn to generate the hash key.  This is borrowed from the routing
>> code.
>>
>> Signed-off-by: Steve Chen <schen@mvista.com>
>> Signed-off-by: Mark Huth <mhuth@mvista.com>
> 
> It has been hours since my last good Emily Litella moment so I'll ask - 
> isn't the combination of source and dest addr, protocol, IP ID and 
> fragment offset supposed to take care of this?  How does the ingress 
> interface have anything to do with it?
> 
> rick jones
The problem we've seen arises only when there are multiple interfaces 
each receiving the same multicast packets.  In that case there are 
multiple packets with the same key.  Steve was able to track down a 
packet loss due to re-assembly failure under certain arrival order 
conditions.

The proposed fix eliminated the packet loss in this case.  There might 
be a different problem in the re-assembly code that we have masked by 
separating the packets into streams from each interface.  Now that you 
mention it, the re-assembly code should be robust in the face of some 
duplicated and mis-ordered packets.  We can look more closely at that code.

Mark Huth


^ permalink raw reply

* [PATCH net-next-2.6] be2net: Add the new PCI IDs to PCI_DEVICE_TABLE.
From: Ajit Khaparde @ 2009-10-28 17:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This patch adds the PCI IDs for the next generation chip to the
PCI_DEVICE_ID table.

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be_main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 4520db7..43180dc 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -31,8 +31,10 @@ MODULE_PARM_DESC(rx_frag_size, "Size of a fragment that holds rcvd data.");
 
 static DEFINE_PCI_DEVICE_TABLE(be_dev_ids) = {
 	{ PCI_DEVICE(BE_VENDOR_ID, BE_DEVICE_ID1) },
+	{ PCI_DEVICE(BE_VENDOR_ID, BE_DEVICE_ID2) },
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID1) },
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID2) },
+	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID3) },
 	{ 0 }
 };
 MODULE_DEVICE_TABLE(pci, be_dev_ids);
-- 
1.6.0.4


^ permalink raw reply related

* Re: [RFC] multiqueue changes
From: Patrick McHardy @ 2009-10-28 17:27 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Eric Dumazet, David S. Miller, Linux Netdev List
In-Reply-To: <20091008120039.GA8691@ff.dom.local>

Jarek Poplawski wrote:
> On Thu, Oct 08, 2009 at 09:03:44AM +0000, Jarek Poplawski wrote:
> ...
>> num_tx_queues. But it seems we should rather use more often
>> real_num_tx_queue in schedulers code like dumps and maybe more
>> (like e.g. sch_multiq does).
> 
> ...i.e. probably everywhere between dev_activate and dev_deactivate
> all qdisc operations could use real_num_tx_queues (including a test
> like: netif_is_real_multique), unless I miss something.

We don't seem to be supporting changing real_num_tx_queues for
registered devices currently (at least I couldn't find it).
So I guess it depends on how this would be implemented.

Simply changing the dev->real_num_tx_queues value while the
device is down would require qdisc operations to operate on
all possible queues since the amount of queues in use could
be changed after the qdisc is created/configured, but before
the device is set up. This approach has more complications
like switching between mq and non-mq root qdiscs, taking care
of non-default root qdisc (cloning them to the new queues), etc.

A simpler alternative would be to destroy the existing root
qdisc on any change to real_num_tx_queues and have dev_activate()
set it up from scratch. In this case, we could (as you suggested)
use real_num_tx_queues, which should fix the problem Eric reported.

^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Eric Dumazet @ 2009-10-28 17:26 UTC (permalink / raw)
  To: Steve Chen; +Cc: netdev
In-Reply-To: <1256749559.3153.447.camel@linux-1lbu>

Steve Chen a écrit :
> On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
>> If each fragment is received twice on host, once by eth0, once by eth1,
>> should we deliver datagram once or twice ?
> 
> The application received it once.  IIRC the duplicate packet is drop in
> the routing code.
> 
>> Once should be enough, even if in the non fragmented case, it will
>> be delivered twice (kernel cannot detect duplicates, user app might do itself)
> 
> Routing code drops the duplicate packet for none-fragmented case as
> well.

Really ? How so ? Receiving two copies of the same packet is legal.

> 
>>
>>> For this specific case, src/dst address, protocol, IP ID and fragment
>>> offset are all identical.  The only difference is the ingress interface.
>>> A good follow up question would be why would anyone in their right mind
>>> multicast to the same destination?  well, I don't know.  I can not get
>>> the people who reported the problem to tell me either.   Since someone
>>> found the need to do this,  perhaps others may find it useful too.
>>>
>> Then, if a 2000 bytes message is fragmented in two packets, one coming
>> from eth0, one coming from eth1, I suspect your patch drops the message,
>> unless eth0/eth1 are part of a bonding device...
> 
> Actually, the patch tries to prevent packet drop for this exact
> scenario.  Please consider the following scenarios
> 1.  Packet comes in the fragment reassemble code in the following order
> (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> Packet from both interfaces get reassembled and gets further processed.

Yes your patch does this, so each multicast application receives two copies of the
same datagram.

> 
> 2. Packet can some times arrive in (perhaps other orders as well)
> (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> packet from eth1 is dropped in the routing code.

Really ? how so ? I dont see how it can happen, unless you use RPF ?

current situation should be :

(eth0 frag1) : We create a context, store frag1 in it
(eth1 frag1) : We find this context, and drop frag1 since we already have the data
                  (maybe the bug is here, if we cannot cope with a duplicate ?)
(eth0 frag2) : We find this context, store frag2 -> complete datagram and deliver it
(eth1 frag2) : We find context, drop frag2 since datagram was completed.

               (or maybe we create a new context that will timeout later, maybe this is your problem ?)

Net effect : We deliver the datagram correctly.


> 
>> That would break common routing setups, using two links to aggregate bandwidth ?
> 
> I don't believe it would.  The aggregate bandwidth will work the same as
> before.  The attributes (src/dst addr, protocol, interface, etc.) should
> generate a unique hash key.  If hash collision should happen with the
> addition of iif << 5, the code still compare the original src addr along
> with interface number, so there should be no issues.

What about the obvious :

(eth0 frag1),  (eth1 frag2)

Your patch creates two contexts since hashes are different,
that will timeout and no packet delivered at all


^ permalink raw reply

* [PATCH] AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
From: Neil Horman @ 2009-10-28 17:39 UTC (permalink / raw)
  To: netdev; +Cc: davem, nhorman, eric.dumazet

Augment raw_send_hdrinc to correct for incorrect ip header length values

A series of oopses was reported to me recently.  Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.

I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.

Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call.  In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid.  Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers.  During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl).  Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.

So, what to do about this?  My first thought was to simply return -EINVAL, and
let user space sort it out.  I'm still thinking that might be the best way, but
I thought I'd try this first, just in case someone has reason to try to
send such a bogus frame through the kernel.  This solution simply checks the
value of ihl in raw_send_hdrinc and expands the skb to fit, filling the new
space with  IPOPT_NOOP options.  I've confirmed that it fixes the crashes that
were reported.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 raw.c |   42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)


diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 9ef8c08..412304f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,13 +351,42 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	skb->ip_summed = CHECKSUM_NONE;
 
 	skb->transport_header = skb->network_header;
-	err = memcpy_fromiovecend((void *)iph, from, 0, length);
-	if (err)
-		goto error_fault;
+	err = -EFAULT;
+	if (memcpy_fromiovecend((void *)iph, from, 0, length))
+		goto error_free;
 
-	/* We don't modify invalid header */
 	iphlen = iph->ihl * 4;
-	if (iphlen >= sizeof(*iph) && iphlen <= length) {
+	/*
+	 * We don't modify invalid header, but we do want to be sure
+	 * that we have enough space in the skb to hold the header
+	 * and all its options, so if iph->ihl is greater than
+	 * the length of the iovec, we need to reallocate the skb, lest
+	 * odd things happen farther down the stack
+	 */
+	if (iphlen > length) {
+		size_t new_length;
+		int i;
+		char *pad;
+ 
+		/*
+		 * someone passed in a bogus ip header, in which the
+		 * the iph->ihl value was longer than the actual data
+		 * buffer.  We need to at least meet the ihl requirement
+		 * and since we don't mess with the ip header here
+		 * lets expand the skb
+		 */
+		new_length = iphlen - length;
+		err = -ENOMEM;
+		if (pskb_expand_head(skb, 0,
+		     new_length, GFP_KERNEL) < 0)
+			goto error_free;
+		pad = skb_put(skb, new_length);
+		for (i = 0; i < new_length; i++)
+			pad[i] = IPOPT_NOOP;
+ 
+	}
+ 
+	if (iphlen >= sizeof(*iph)) {
 		if (!iph->saddr)
 			iph->saddr = rt->rt_src;
 		iph->check   = 0;
@@ -380,8 +409,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 out:
 	return 0;
 
-error_fault:
-	err = -EFAULT;
+error_free:
 	kfree_skb(skb);
 error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);

^ permalink raw reply related

* Re: [PATCH] Multicast packet reassembly can fail
From: Steve Chen @ 2009-10-28 17:50 UTC (permalink / raw)
  To: Rick Jones; +Cc: Mark Huth, netdev
In-Reply-To: <4AE87D03.4020708@hp.com>

On Wed, 2009-10-28 at 10:18 -0700, Rick Jones wrote:
> >> It has been hours since my last good Emily Litella moment so I'll ask 
> >> - isn't the combination of source and dest addr, protocol, IP ID and 
> >> fragment offset supposed to take care of this?  How does the ingress 
> >> interface have anything to do with it?
> >>
> >> rick jones
> > 
> > The problem we've seen arises only when there are multiple interfaces 
> > each receiving the same multicast packets.  In that case there are 
> > multiple packets with the same key.  Steve was able to track down a 
> > packet loss due to re-assembly failure under certain arrival order 
> > conditions.
> > 
> > The proposed fix eliminated the packet loss in this case.  There might 
> > be a different problem in the re-assembly code that we have masked by 
> > separating the packets into streams from each interface.  Now that you 
> > mention it, the re-assembly code should be robust in the face of some 
> > duplicated and mis-ordered packets.  We can look more closely at that code.
> 
> If I understand correctly, the idea here is to say that when multiple interfaces 
> receive fragments of copies of the same  IP datagram that both copies will 
> "survive" and flow up the stack?
> 
> I'm basing that on your description, and an email from Steve that reads:
> 
> > Actually, the patch tries to prevent packet drop for this exact
> > scenario.  Please consider the following scenarios
> > 1.  Packet comes in the fragment reassemble code in the following order
> > (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> > Packet from both interfaces get reassembled and gets further processed.
> > 
> > 2. Packet can some times arrive in (perhaps other orders as well)
> > (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> > Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> > packet from eth1 is dropped in the routing code.
> 
> Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> 
> I can see where scenario one leads to two IP datagrams making it up the stack, 
> but I would have thought that was simply an "accident" of the situation that 
> cannot reasonably be prevented, not justification to cause scenario two to send 
> two datagrams up the stack.

For scenario 2, the routing code drops the 2nd packet.  As a result, no
packet make it to the application.  If someone is willing to suggest an
alternative, I can certainly rework the patch and retest.

Regards,

Steve


^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Rick Jones @ 2009-10-28 18:10 UTC (permalink / raw)
  To: Steve Chen; +Cc: Mark Huth, netdev
In-Reply-To: <1256752203.3153.461.camel@linux-1lbu>

>>If I understand correctly, the idea here is to say that when multiple interfaces 
>>receive fragments of copies of the same  IP datagram that both copies will 
>>"survive" and flow up the stack?
>>
>>I'm basing that on your description, and an email from Steve that reads:
>>
>>
>>>Actually, the patch tries to prevent packet drop for this exact
>>>scenario.  Please consider the following scenarios
>>>1.  Packet comes in the fragment reassemble code in the following order
>>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
>>>Packet from both interfaces get reassembled and gets further processed.
>>>
>>>2. Packet can some times arrive in (perhaps other orders as well)
>>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
>>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
>>>packet from eth1 is dropped in the routing code.
>>
>>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
>>
>>I can see where scenario one leads to two IP datagrams making it up the stack, 
>>but I would have thought that was simply an "accident" of the situation that 
>>cannot reasonably be prevented, not justification to cause scenario two to send 
>>two datagrams up the stack.
> 
> 
> For scenario 2, the routing code drops the 2nd packet.  As a result, no
> packet make it to the application.  If someone is willing to suggest an
> alternative, I can certainly rework the patch and retest.

I'll ask my next potentially Emily Litella question - don't multicast IP 
applications bind to multicast IP addresses and not interfaces?  That is to say, 
doesn't the first datagram completed get delivered to all applications on the 
host which have bound to the corresponding multicast IP (and port number...) ?

rick jones

^ permalink raw reply

* Re: [PATCH] AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
From: Eric Dumazet @ 2009-10-28 18:13 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <20091028173955.GB7422@hmsreliant.think-freely.org>

Neil Horman a écrit :
> Augment raw_send_hdrinc to correct for incorrect ip header length values
> 
> A series of oopses was reported to me recently.  Apparently when using AF_RAW
> sockets to send data to peers that were reachable via ipsec encapsulation,
> people could panic or BUG halt their systems.
> 
> I've tracked the problem down to user space sending an invalid ip header over an
> AF_RAW socket with IP_HDRINCL set to 1.
> 
> Basically what happens is that userspace sends down an ip frame that includes
> only the header (no data), but sets the ip header ihl value to a large number,
> one that is larger than the total amount of data passed to the sendmsg call.  In
> raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
> that was passed in, but assume the data is all valid.  Later during ipsec
> encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
> to provide headroom for the ipsec headers.  During this operation, the
> skb->transport_header is repointed to a spot computed by
> skb->network_header + the ip header length (ihl).  Since so little data was
> passed in relative to the value of ihl provided by the raw socket, we point
> transport header to an unknown location, resulting in various crashes.
> 
> So, what to do about this?  My first thought was to simply return -EINVAL, and
> let user space sort it out.  I'm still thinking that might be the best way, but
> I thought I'd try this first, just in case someone has reason to try to
> send such a bogus frame through the kernel.  This solution simply checks the
> value of ihl in raw_send_hdrinc and expands the skb to fit, filling the new
> space with  IPOPT_NOOP options.  I've confirmed that it fixes the crashes that
> were reported.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 

Thanks a lot for this detailed info, I wish everything could be explained like this !

I believe we should drop the request, since padding it is not what was expected by user.

^ permalink raw reply

* Re: wanPMC-CxT1E1
From: Greg KH @ 2009-10-28 18:11 UTC (permalink / raw)
  To: Bob Beers; +Cc: netdev
In-Reply-To: <4f6ba3b0910271048n10ff37fek9af191b133892e1e@mail.gmail.com>

On Tue, Oct 27, 2009 at 01:48:53PM -0400, Bob Beers wrote:
> On Mon, Oct 26, 2009 at 4:41 PM, Greg KH <greg@kroah.com> wrote:
> > Getting it to build on 2.6.31 is more important than RHEL5, we can't do
> > anything with an old kernel like that.
> 
> ok, so where do I start, I have a system ready to start
>  git cloning, and creating patches. I googled for a while
>  but didn't find a nice recipe for participating in the -staging
>  process.

I'll try to get it to build in the kernel tree today and send you a copy
of the patch to try out.

thanks,

greg k-h

^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Steve Chen @ 2009-10-28 18:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: mhuth, netdev
In-Reply-To: <4AE87ECD.7080408@gmail.com>

On Wed, 2009-10-28 at 18:26 +0100, Eric Dumazet wrote:
> Steve Chen a écrit :
> > On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
> >> If each fragment is received twice on host, once by eth0, once by eth1,
> >> should we deliver datagram once or twice ?
> > 
> > The application received it once.  IIRC the duplicate packet is drop in
> > the routing code.
> > 
> >> Once should be enough, even if in the non fragmented case, it will
> >> be delivered twice (kernel cannot detect duplicates, user app might do itself)
> > 
> > Routing code drops the duplicate packet for none-fragmented case as
> > well.
> 
> Really ? How so ? Receiving two copies of the same packet is legal.

I will have to double check exactly where the packet drop happens.  I
thought it was somewhere in routing, but it could be in netfilter.

> 
> > 
> >>
> >>> For this specific case, src/dst address, protocol, IP ID and fragment
> >>> offset are all identical.  The only difference is the ingress interface.
> >>> A good follow up question would be why would anyone in their right mind
> >>> multicast to the same destination?  well, I don't know.  I can not get
> >>> the people who reported the problem to tell me either.   Since someone
> >>> found the need to do this,  perhaps others may find it useful too.
> >>>
> >> Then, if a 2000 bytes message is fragmented in two packets, one coming
> >> from eth0, one coming from eth1, I suspect your patch drops the message,
> >> unless eth0/eth1 are part of a bonding device...
> > 
> > Actually, the patch tries to prevent packet drop for this exact
> > scenario.  Please consider the following scenarios
> > 1.  Packet comes in the fragment reassemble code in the following order
> > (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> > Packet from both interfaces get reassembled and gets further processed.
> 
> Yes your patch does this, so each multicast application receives two copies of the
> same datagram.
> 
> > 
> > 2. Packet can some times arrive in (perhaps other orders as well)
> > (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> > Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> > packet from eth1 is dropped in the routing code.
> 
> Really ? how so ? I dont see how it can happen, unless you use RPF ?
> 
> current situation should be :
> 
> (eth0 frag1) : We create a context, store frag1 in it
> (eth1 frag1) : We find this context, and drop frag1 since we already have the data
>                   (maybe the bug is here, if we cannot cope with a duplicate ?)
> (eth0 frag2) : We find this context, store frag2 -> complete datagram and deliver it
> (eth1 frag2) : We find context, drop frag2 since datagram was completed.

Yes, this is exactly what is happening in the current code.

> 
>                (or maybe we create a new context that will timeout later, maybe this is your problem ?)
> 
> Net effect : We deliver the datagram correctly.
> 
> 
> > 
> >> That would break common routing setups, using two links to aggregate bandwidth ?
> > 
> > I don't believe it would.  The aggregate bandwidth will work the same as
> > before.  The attributes (src/dst addr, protocol, interface, etc.) should
> > generate a unique hash key.  If hash collision should happen with the
> > addition of iif << 5, the code still compare the original src addr along
> > with interface number, so there should be no issues.
> 
> What about the obvious :
> 
> (eth0 frag1),  (eth1 frag2)
> 
> Your patch creates two contexts since hashes are different,
> that will timeout and no packet delivered at all
> 
I see the point you are making.  I assumed, probably incorrectly, that
since eth0 and eth1 have different IP address.  I would get a complete
series of fragments for each interface.  Perhaps, I should really be
looking up the stack to see why packets were dropped.  Please correct me
if I'm mistaken.  The normal behavior is that application should be
receiving either 2 (scenario 1) or 1 (scenario 2) packets.

Regards,

Steve


^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Steve Chen @ 2009-10-28 18:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: Mark Huth, netdev
In-Reply-To: <4AE88907.3030206@hp.com>

On Wed, 2009-10-28 at 11:10 -0700, Rick Jones wrote:
> >>If I understand correctly, the idea here is to say that when multiple interfaces 
> >>receive fragments of copies of the same  IP datagram that both copies will 
> >>"survive" and flow up the stack?
> >>
> >>I'm basing that on your description, and an email from Steve that reads:
> >>
> >>
> >>>Actually, the patch tries to prevent packet drop for this exact
> >>>scenario.  Please consider the following scenarios
> >>>1.  Packet comes in the fragment reassemble code in the following order
> >>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> >>>Packet from both interfaces get reassembled and gets further processed.
> >>>
> >>>2. Packet can some times arrive in (perhaps other orders as well)
> >>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> >>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> >>>packet from eth1 is dropped in the routing code.
> >>
> >>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> >>
> >>I can see where scenario one leads to two IP datagrams making it up the stack, 
> >>but I would have thought that was simply an "accident" of the situation that 
> >>cannot reasonably be prevented, not justification to cause scenario two to send 
> >>two datagrams up the stack.
> > 
> > 
> > For scenario 2, the routing code drops the 2nd packet.  As a result, no
> > packet make it to the application.  If someone is willing to suggest an
> > alternative, I can certainly rework the patch and retest.
> 
> I'll ask my next potentially Emily Litella question - don't multicast IP 
> applications bind to multicast IP addresses and not interfaces?  That is to say, 
> doesn't the first datagram completed get delivered to all applications on the 
> host which have bound to the corresponding multicast IP (and port number...) ?
I actually don't know who Emily Litella is until today.  This mailing
list is great not just for learning networking stuff :).  In the test
code I received, one of the step to setup is to configure the IP address
of the interface that the application is expecting the packet.  It
appears to bind on interface based on that casual observation.  I'll
have to study the code in detail to be able to say for sure.

Regards,

Steve



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox