Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 04/11] bnx2x: Add TX fault check for fiber PHYs
From: Ben Hutchings @ 2011-05-23  5:20 UTC (permalink / raw)
  To: Yaniv Rosner; +Cc: davem, netdev, eilong
In-Reply-To: <1306063927.20872.86.camel@lb-tlvb-dmitry>

On Sun, 2011-05-22 at 14:32 +0300, Yaniv Rosner wrote:
> In case TX fault is detected on Fiber PHYs, declare the link as down
> until TX fault is gone.
[...]
> --- a/drivers/net/bnx2x/bnx2x_reg.h
> +++ b/drivers/net/bnx2x/bnx2x_reg.h
> @@ -6037,6 +6037,7 @@ Theotherbitsarereservedandshouldbezero*/
>  #define MDIO_PMA_REG_BCM_CTRL		0x0096
>  #define MDIO_PMA_REG_FEC_CTRL		0x00ab
>  #define MDIO_PMA_REG_RX_ALARM_CTRL	0x9000
> +#define MDIO_PMA_REG_TX_ALARM_CTRL	0x9001
>  #define MDIO_PMA_REG_LASI_CTRL		0x9002
>  #define MDIO_PMA_REG_RX_ALARM		0x9003
>  #define MDIO_PMA_REG_TX_ALARM		0x9004

By the way, the LASI registers are already named in <linux/mdio.h>:

#define MDIO_PMA_LASI_RXCTRL	0x9000	/* RX_ALARM control */
#define MDIO_PMA_LASI_TXCTRL	0x9001	/* TX_ALARM control */
#define MDIO_PMA_LASI_CTRL	0x9002	/* LASI control */
#define MDIO_PMA_LASI_RXSTAT	0x9003	/* RX_ALARM status */
#define MDIO_PMA_LASI_TXSTAT	0x9004	/* TX_ALARM status */
#define MDIO_PMA_LASI_STAT	0x9005	/* LASI status */

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [v3 00/39] faster tree-based sysctl implementation
From: Eric W. Biederman @ 2011-05-23  4:27 UTC (permalink / raw)
  To: Lucian Adrian Grijincu
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller
In-Reply-To: <BANLkTim7xn8tM4wTyd_m_nwUQg_CUTapeA@mail.gmail.com>

Lucian Adrian Grijincu <lucian.grijincu@gmail.com> writes:

> Hi
>
> This is version 3 of a patch series that introduces a faster/leaner
> sysctl internal implementation.
>
> Due to high number of patches and low general interest I'll just point
> you to the tree/branch:
>
>   git://github.com/luciang/linux-2.6-new-sysctl.git   v3-new-sysctl-alg
>
>
> Patches are on top of v2.6.39. I did not pick a more recent (random)
> point in Linus' tree to rebase these onto to not mess up testing.


Thanks for keeping going on this.

This patchset looks like it is deserving of some close scrutiny, and
not just the high level design overview I have given the previous
patches.  This is going to be a busy week for me so I probably won't
get through all of the patches for a while.

I will mention a couple of nits I noticed while I was skimming through
your patches.
- There can be multiple proc superblocks and thus multiple inodes
  referring to the same /proc/sys file, if there are multiple pid
  namespaces. 

- I have a hope to move /proc/sys into /proc/<pid>/sys so we don't have
  to look at current to determine the namespace we want to display.
  That would allow the deeply magic sysctl_is_seen check to be removed
  from proc_sys_compare.  That is not your problem, but of an
  explanation why the namespaces are passed through.

Eric

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Changli Gao @ 2011-05-23  2:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, David Miller, Jiri Pirko, Nicolas de Pesloüan,
	netdev, shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <m1ei3qdwa6.fsf@fess.ebiederm.org>

On Mon, May 23, 2011 at 9:45 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>> In another side, is there a specification which defines the
>> hw-accel-vlan-rx?
>
> I don't know.
>
> I have just been trying to clean up the mess since some of the
> hw-accel-vlan code broke my use case, by delivering packets with
> priority but no vlan (aka vlan 0 packets) twice to my pf_packet sockets.
>

OK. But if we have decided to simulate the hw-accel-vlan-rx, I think
we'd better adjust the place where we put the emulation code. The very
beginnings of netif_rx() and neif_receive_skb() are better. Then rps
can support vlan packets without any change.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
From: Rusty Russell @ 2011-05-23  2:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <20110522121008.GA12155-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > Current code might introduce a lot of latency variation
> > > if there are many pending bufs at the time we
> > > attempt to transmit a new one. This is bad for
> > > real-time applications and can't be good for TCP either.
> > 
> > Do we have more than speculation to back that up, BTW?
> 
> Need to dig this up: I thought we saw some reports of this on the list?

I think so too, but a reference needs to be here too.

It helps to have exact benchmarks on what's being tested, otherwise we
risk unexpected interaction with the other optimization patches.

> > >  	struct sk_buff *skb;
> > >  	unsigned int len;
> > > -
> > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > +	bool c;
> > > +	int n;
> > > +
> > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > +	 * all of the memory back if they are used fast enough. */
> > > +	for (n = 0;
> > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > +	     ++n) {
> > >  		pr_debug("Sent skb %p\n", skb);
> > >  		vi->dev->stats.tx_bytes += skb->len;
> > >  		vi->dev->stats.tx_packets++;
> > >  		dev_kfree_skb_any(skb);
> > >  	}
> > > +	return !c;
> > 
> > This is for() abuse :)
> > 
> > Why is the capacity check in there at all?  Surely it's simpler to try
> > to free 2 skbs each time around?
> 
> This is in case we can't use indirect: we want to free up
> enough buffers for the following add_buf to succeed.

Sure, or we could just count the frags of the skb we're taking out,
which would be accurate for both cases and far more intuitive.

ie. always try to free up twice as much as we're about to put in.

Can we hit problems with OOM?  Sure, but no worse than now...

The problem is that this "virtqueue_get_capacity()" returns the worst
case, not the normal case.  So using it is deceptive.

> I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> sure we have enough space in the buffer. Another way to do
> that is with a define :).

To do this properly, we should really be using the actual number of sg
elements needed, but we'd have to do most of xmit_skb beforehand so we
know how many.

Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCH] netns: add /proc/*/net/id symlink
From: Eric W. Biederman @ 2011-05-23  2:02 UTC (permalink / raw)
  To: David Lamparter; +Cc: Alexey Dobriyan, davem, netdev, Linux Containers
In-Reply-To: <20110523014303.GA2351982@jupiter.n2.diac24.net>

David Lamparter <equinox@diac24.net> writes:

>> ... Eric W. Biederman wrote:
>> Now it probably needs to be better documented that /proc/*/net/*
>> have the same inode number if the network namespace is the
>> same, as everyone including myself overlooked this very handy
>> existing property.
>
> Eh, so did I. But, yes, very nice.
>
> On Sat, May 21, 2011 at 05:15:38PM -0700, Eric W. Biederman wrote:
>> Additionally that solution will work for comparing network namespaces
>> that don't happen to have any processes in them at the moment.  Because
>> fstat works on file descriptors.
>
> Hm. I have a peeve here. Assume I am a... rogue admin, whatever. I have
> root on a router. I create a new network namespace, put a macvlan of
> eth0 in it and a macvlan of eth1. I enable ip_forward.
>
> Then I make a mount namespace, bind-mount the net namespace, bind mount
> the mount namespace and terminate all processes that reference it (yes
> this does work, i just checked [!]).

You must be using an older version of my patchset than what I have
queued for Linus.  Bind mounting the mount namepsace and creating
reference counting loops is a weird and ugly case.  So for the moment I
am not supporting the mount namespace, until I can think through
the consequences.

> Now I can use it to bypass all firewall rules, IDS, whatever.
>
> How is any normal admin, monitoring script or whatever else able to
> detect this?

Which is why we I proceed slowly and cautiously with adding new kernel
interfaces.  It is hard to think of everything until you can actually
put it into use, and play with it.

Other than not allowing bind mounting the mount namespace I don't
have any all encompassing really good answers at the moment.

I do have a few small answers.  For network namespaces you can look in
/proc/slabinfo and see how many you have, unless slub is lying to you.
On the switch your server is connected to you can look at the mac table
and see which mac addresses are currently in use, and notice if there
are unaccounted for mac addresses.

Eric

^ permalink raw reply

* [v3 00/39] faster tree-based sysctl implementation
From: Lucian Adrian Grijincu @ 2011-05-23  1:56 UTC (permalink / raw)
  To: Eric W . Biederman, linux-kernel
  Cc: netdev, Lucian Adrian Grijincu, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

Hi

This is version 3 of a patch series that introduces a faster/leaner
sysctl internal implementation.

Due to high number of patches and low general interest I'll just point
you to the tree/branch:

  git://github.com/luciang/linux-2.6-new-sysctl.git   v3-new-sysctl-alg


Patches are on top of v2.6.39. I did not pick a more recent (random)
point in Linus' tree to rebase these onto to not mess up testing.



Changes since v2
(http://thread.gmane.org/gmane.linux.kernel/1137032/focus=194748):
- added a compatibility layer to support old registering complex
  sysctl trees. This layer will be deleted once all users of the old
  are changed.
- subdirectories and netns correspondent dirs are now held in rbtrees
- split of from the patches that make changes in the rest of the tree
- rebased on top of 2.6.39




Changes since v1 (http://thread.gmane.org/gmane.linux.kernel/1133667):
- rebased on top of 2.6.39-rc6
- split the patch that adds the new algorithm and data structures.
- fixed a few bugs lingering in the old code
- shrinked a reference counter
- added a new reference counter to maintain ownership information
- added method to register an empty sysctl dir and converted some users
- added checks enforcing the rule that a non-netns specific directory may
 not be registered after a netns specific one has already been registered.
- added cookie support: register a piece of data with the header to be
 used to make simple conversions on the ctl_table. This saves memory where
 we need to register sysctl tables with the same content affecting
 different pieces of data.
- enforced sysctl checks



$ time modprobe dummy numdummies=N


NOTE: these stats are from v2. v3 should be a bit slower due to:
- the compatibility layer
- the old stats used cookies to prevent kmemdups() on ctl_table arrays
- the old patches had an optimisation for directories with many
subdirs that was replaced (in v3) with rbtrees

Without this patch series :(
- ipv4 only
 -  N=1000  time= 0m 06s
 -  N=2000  time= 0m 30s
 -  N=4000  time= 2m 35s
- ipv4 and ipv6
 -  N=1000  time= 0m 24s
 -  N=2000  time= 2m 14s
 -  N=4000  time=10m 16s
 -  N=5000  time=16m  3s

With this patch series    :)
- ipv4 only
 -  N=1000  time= 0m  0.33s
 -  N=2000  time= 0m  1.25s
 -  N=4000  time= 0m  5.31s
- ipv4 and ipv6
 -  N=1000  time= 0m  0.41s
 -  N=2000  time= 0m  1.62s
 -  N=4000  time= 0m  7.64s
 -  N=5000  time= 0m 12.35s
 -  N=8000  time= 0m 36.95s




Patches marked with RFC: are patches where reviewers should pay more
attention as I may have missed something.


Part 1: introduce compatibility layer:
  sysctl: introduce temporary sysctl wrappers
  sysctl: register only tables of sysctl files


Part 2: minimal changes to sysctl users:
  sysctl: call sysctl_init before the first sysctl registration
  sysctl: no-child: manually register kernel/random
  sysctl: no-child: manually register kernel/keys
  sysctl: no-child: manually register fs/inotify
  sysctl: no-child: manually register fs/epoll
  sysctl: no-child: manually register root tables


Part 3: cleanups simplifying the new algorithm (created when asked to
break up the new algorithm patch:
  sysctl: faster reimplementation of sysctl_check_table
  sysctl: remove useless ctl_table->parent field
  sysctl: simplify find_in_table
  sysctl: sysctl_head_grab defaults to root header on NULL
  sysctl: delete useless grab_header function
  sysctl: rename ->used to ->ctl_use_refs
  sysctl: rename sysctl_head_grab/finish to sysctl_use_header/unuse
  sysctl: rename sysctl_head_next to sysctl_use_next_header
  sysctl: split ->count into ctl_procfs_refs and ctl_header_refs
  sysctl: rename sysctl_head_get/put to sysctl_proc_inode_get/put
  sysctl: rename (un)use_table to __sysctl_(un)use_header
  sysctl: simplify ->permissions hook
  sysctl: group root-specific operations
  sysctl: introduce ctl_table_group
  sysctl: move removal from list out of start_unregistering


Part 4: new algorithm/data structures:
  sysctl: faster tree-based sysctl implementation


Part 5: checks/warns requested during review:
  sysctl: add duplicate entry and sanity ctl_table checks
  sysctl: alloc ctl_table_header with kmem_cache
  RFC: sysctl: change type of ctl_procfs_refs to u8
  sysctl: check netns-specific registration order respected
  sysctl: warn if registration/unregistration order is not respected
  RFC: sysctl: always perform sysctl checks


Part 6: Eric requested rbtrees for subdirs. I also used rbtrees for
netns correspondent dirs. Hope that adding rbtrees after using the old
list-based implementation is good enough. The rbtree code complicates
things a bit and would uglify the patch adding the new algorithm.
  sysctl: reorder members of ctl_table_header (cleanup)
  sysctl: add ctl_type member
  RFC: sysctl: replace subdirectory list with rbtree
  RFC: sysctl: replace netns corresp list with rbtree
  sysctl: union-ize some ctl_table_header fields


Part 7: Eric requested ability to register an empty dir:
  sysctl: add register_sysctl_dir: register an empty sysctl directory


Part 8: unrequested feature I'd like to piggy back :)
  sysctl: add ctl_cookie and ctl_cookie_handler
  sysctl: add cookie to __register_sysctl_paths
  sysctl: add register_net_sysctl_table_net_cookie


 drivers/char/random.c            |   27 +-
 fs/eventpoll.c                   |   22 +-
 fs/notify/inotify/inotify_user.c |   22 +-
 fs/proc/inode.c                  |    2 +-
 fs/proc/proc_sysctl.c            |  233 +++++---
 include/linux/inotify.h          |    2 -
 include/linux/key.h              |    3 -
 include/linux/poll.h             |    2 -
 include/linux/sysctl.h           |  221 +++++---
 include/net/net_namespace.h      |    4 +-
 init/main.c                      |    1 +
 kernel/Makefile                  |    5 +-
 kernel/sysctl.c                  | 1161 ++++++++++++++++++++++++++++----------
 kernel/sysctl_check.c            |  325 +++++++----
 lib/Kconfig.debug                |    8 -
 net/sysctl_net.c                 |   86 ++--
 security/keys/key.c              |    7 +
 security/keys/sysctl.c           |   18 +-
 18 files changed, 1495 insertions(+), 654 deletions(-)


-- 
 .
..: Lucian

^ permalink raw reply

* Re: [PATCH] netns: add /proc/*/net/id symlink
From: David Lamparter @ 2011-05-23  1:47 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, Alexey Dobriyan, davem, netdev,
	Linux Containers
In-Reply-To: <20110523014303.GA2351982@jupiter.n2.diac24.net>

On Mon, May 23, 2011 at 03:43:03AM +0200, David Lamparter wrote:
> Then I make a mount namespace, bind-mount the net namespace, bind mount
> the mount namespace and terminate all processes that reference it (yes
> this does work, i just checked [!]).

Actually, Eric, bind-mounting a mount namespace inside itself should
probably be forbidden? No idea if you changed that (running a year-old
version of your patches here). Not only can you lose network namespaces
inside those self-referential mount namespaces but also references to
block devices, unix socket connections, etc. pp.


-David


^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Eric W. Biederman @ 2011-05-23  1:45 UTC (permalink / raw)
  To: Changli Gao
  Cc: Ben Greear, David Miller, Jiri Pirko, Nicolas de Pesloüan,
	netdev, shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <BANLkTik6etR8V1qeDnYBKt6BaODBe8kUOw@mail.gmail.com>

Changli Gao <xiaosuo@gmail.com> writes:

> On Mon, May 23, 2011 at 8:38 AM, Changli Gao <xiaosuo@gmail.com> wrote:
>> On Mon, May 23, 2011 at 6:38 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>>> Many years ago we supported the REORDER, but we suggested disabling
>>>> it for most users because it was a performance drag.  Funny that now
>>>> it seems to be the opposite!
>>>
>>> Yes it is funny.  I looked in history a while back and what I saw was
>>> that REORDER was always enabled by default and it took some serious
>>> effort to figure out how to get vconfig to disable REORDER. ip doesn't
>>> admit that REORDER can be disabled at all.
>>>
>>
>> Really?
>>
>> Quoted from the manual page of vconfig
>>       set_flag [vlan-device] 0 | 1
>>              When  1,  ethernet  header  reorders  are turned on. Dumping the
>>              device will appear as a common ethernet  device  without  vlans.
>>              When  0(default)  however,  ethernet  headers are not reordered,
>>              which results in vlan tagged packets when  dumping  the  device.
>>              Usually the default gives no problems, but some packet filtering
>>              programs might have problems with it.
>>
>> reordered is disabled by default. I also concern the performance.
>> Untag and then tag are expensive for the NICs which don't support
>> hw-accel-vlan-rx/tx.
>>
>
> For ip:
> localhost ~ # ip link add link eth0 vlan1 type vlan help
> Usage: ... vlan id VLANID [ FLAG-LIST ]
>                           [ ingress-qos-map QOS-MAP ] [ egress-qos-map QOS-MAP ]

Apparently I was blind when I looked at iproute.  I am certain I didn't
see it there but iproute clearly has the option to set or clear reorder_hdr.

> VLANID := 0-4095
> FLAG-LIST := [ FLAG-LIST ] FLAG
> FLAG := [ reorder_hdr { on | off } ] [ gvrp { on | off } ]
>         [ loose_binding { on | off } ]
> QOS-MAP := [ QOS-MAP ] QOS-MAPPING
> QOS-MAPPING := FROM:TO
>
> After checking the code, I found reorder_hdr is off by default.

In iproute reorder_hdr is not modified by default which is subtly
different.

> In another side, is there a specification which defines the
> hw-accel-vlan-rx?

I don't know.

I have just been trying to clean up the mess since some of the
hw-accel-vlan code broke my use case, by delivering packets with
priority but no vlan (aka vlan 0 packets) twice to my pf_packet sockets.

Eric


^ permalink raw reply

* Re: [PATCH] netns: add /proc/*/net/id symlink
From: David Lamparter @ 2011-05-23  1:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Dobriyan, davem, netdev, equinox, Linux Containers
In-Reply-To: <m1vcx3r3o5.fsf@fess.ebiederm.org>

> ... Eric W. Biederman wrote:
> Now it probably needs to be better documented that /proc/*/net/*
> have the same inode number if the network namespace is the
> same, as everyone including myself overlooked this very handy
> existing property.

Eh, so did I. But, yes, very nice.

On Sat, May 21, 2011 at 05:15:38PM -0700, Eric W. Biederman wrote:
> Additionally that solution will work for comparing network namespaces
> that don't happen to have any processes in them at the moment.  Because
> fstat works on file descriptors.

Hm. I have a peeve here. Assume I am a... rogue admin, whatever. I have
root on a router. I create a new network namespace, put a macvlan of
eth0 in it and a macvlan of eth1. I enable ip_forward.

Then I make a mount namespace, bind-mount the net namespace, bind mount
the mount namespace and terminate all processes that reference it (yes
this does work, i just checked [!]).

Now I can use it to bypass all firewall rules, IDS, whatever.

How is any normal admin, monitoring script or whatever else able to
detect this?

> With the /proc/<pid>/ns/net file and bind mounts I have solved the
> deeper problem of how do we get userspace policy into the naming of
> namespaces.  With those files and the setns system call I have solved
> the other problem of what is a good way to refer to namespaces without
> assuming a global name.  So once those changes are merged I expect there
> to be much less pressure to misuse any kind of identifier we can have.
> 
> And if we only make the guarantee about inode consistency for the
> /proc/<pid>/ns/FILE files I don't expect it will make maintenance
> of procfs any harder than it already is.


-David


^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Eric W. Biederman @ 2011-05-23  1:39 UTC (permalink / raw)
  To: Changli Gao
  Cc: Ben Greear, David Miller, Jiri Pirko, Nicolas de Pesloüan,
	netdev, shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <BANLkTinXFYzzj3R97mKH-90xM64BDwN8aA@mail.gmail.com>

Changli Gao <xiaosuo@gmail.com> writes:

> On Mon, May 23, 2011 at 6:38 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>>> Many years ago we supported the REORDER, but we suggested disabling
>>> it for most users because it was a performance drag.  Funny that now
>>> it seems to be the opposite!
>>
>> Yes it is funny.  I looked in history a while back and what I saw was
>> that REORDER was always enabled by default and it took some serious
>> effort to figure out how to get vconfig to disable REORDER. ip doesn't
>> admit that REORDER can be disabled at all.
>>
>
> Really?

> Quoted from the manual page of vconfig
>        set_flag [vlan-device] 0 | 1
>               When  1,  ethernet  header  reorders  are turned on. Dumping the
>               device will appear as a common ethernet  device  without  vlans.
>               When  0(default)  however,  ethernet  headers are not reordered,
>               which results in vlan tagged packets when  dumping  the  device.
>               Usually the default gives no problems, but some packet filtering
>               programs might have problems with it.
>
> reordered is disabled by default. I also concern the performance.
> Untag and then tag are expensive for the NICs which don't support
> hw-accel-vlan-rx/tx.

Yes really the vconfig man page is wrong.  Reorder has been enabled
by default since the code was merged.  See below.

Dhcp among other things does not work if you disable reorder.

As for performace we are moving 12 bytes which should be cache hot
4 bytes later in what should be the same cache line.  I expect my
16Mhz 386 will slow down a little for an operations like that, I
don't expect much else will.  I can't see anything short of benchmark
numbers being persuasive.

The other reality is that the code is honestly and truly broken in
the corner cases.  If we don't consolidate the code paths it will
never operate correctly.

Not that I agree 100% with all of the decisions but the code has
been broken for a while for most users so we really need to make
it consistent and fix the bugs.

Furthermore to the best of my knowledge no one actually clears
the reorder bit.  Changli the fact that you don't know that the
reorder bit is set by default suggest that you don't clear it
in the cases you are concerned about.

If no one is clearing the reorder bit in practice all that we
really have for the vlan changes are code motion and simplification
which should only have a positive impact on performance.

Eric

Aka the vlan code came in at:

commit 15a435fe2c0f649e2879e51e05fa466c2bb12731
Author: torvalds <torvalds>
Date:   Tue Feb 5 20:30:01 2002 +0000

    v2.4.13.5 -> v2.4.13.6
    
      - me: remember to bump the version number ;)
      - Hugh Dickins: export "free_lru_page()" for modules
      - Jeff Garzik: don't change nopage arguments, just make the last a dummy one
      - David Miller: sparc and net updates (netfilter, VLAN etc)
      - Nikita Danilov: reiserfs cleanups
      - Jan Kara: quota initialization race
      - Tigran Aivazian: make the x86 microcode update driver happy about
      hyperthreaded P4's
      - me: shrink dcache/icache more aggressively
      - me: fix up oom-killer so that it actually works
    
    BKrev: 3c6040c9uewuD0S9AwFCfH3_YMyRHQ

[snip]
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
new file mode 100644
index 0000000..e549e33
--- /dev/null
+++ b/net/8021q/vlan.c
[snip]
+/* DO reorder the header by default */
+unsigned short vlan_default_dev_flags = 1;

[snip]
+/*  Attach a VLAN device to a mac address (ie Ethernet Card).
+ *  Returns the device that was created, or NULL if there was
+ *  an error of some kind.
+ */
+struct net_device *register_802_1Q_vlan_device(const char* eth_IF_name,
+					       unsigned short VLAN_ID)
+{
[snip]
+	VLAN_DEV_INFO(new_dev)->vlan_id = VLAN_ID; /* 1 through 0xFFF */
+	VLAN_DEV_INFO(new_dev)->real_dev = real_dev;
+	VLAN_DEV_INFO(new_dev)->dent = NULL;
+	VLAN_DEV_INFO(new_dev)->flags = vlan_default_dev_flags;
[snip]

^ permalink raw reply

* (unknown)
From: xufeng zhang @ 2011-05-23  1:36 UTC (permalink / raw)
  To: netdev

subscribe netdev

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Changli Gao @ 2011-05-23  1:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, David Miller, Jiri Pirko, Nicolas de Pesloüan,
	netdev, shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <BANLkTinXFYzzj3R97mKH-90xM64BDwN8aA@mail.gmail.com>

On Mon, May 23, 2011 at 8:38 AM, Changli Gao <xiaosuo@gmail.com> wrote:
> On Mon, May 23, 2011 at 6:38 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>>> Many years ago we supported the REORDER, but we suggested disabling
>>> it for most users because it was a performance drag.  Funny that now
>>> it seems to be the opposite!
>>
>> Yes it is funny.  I looked in history a while back and what I saw was
>> that REORDER was always enabled by default and it took some serious
>> effort to figure out how to get vconfig to disable REORDER. ip doesn't
>> admit that REORDER can be disabled at all.
>>
>
> Really?
>
> Quoted from the manual page of vconfig
>       set_flag [vlan-device] 0 | 1
>              When  1,  ethernet  header  reorders  are turned on. Dumping the
>              device will appear as a common ethernet  device  without  vlans.
>              When  0(default)  however,  ethernet  headers are not reordered,
>              which results in vlan tagged packets when  dumping  the  device.
>              Usually the default gives no problems, but some packet filtering
>              programs might have problems with it.
>
> reordered is disabled by default. I also concern the performance.
> Untag and then tag are expensive for the NICs which don't support
> hw-accel-vlan-rx/tx.
>

For ip:
localhost ~ # ip link add link eth0 vlan1 type vlan help
Usage: ... vlan id VLANID [ FLAG-LIST ]
                          [ ingress-qos-map QOS-MAP ] [ egress-qos-map QOS-MAP ]

VLANID := 0-4095
FLAG-LIST := [ FLAG-LIST ] FLAG
FLAG := [ reorder_hdr { on | off } ] [ gvrp { on | off } ]
        [ loose_binding { on | off } ]
QOS-MAP := [ QOS-MAP ] QOS-MAPPING
QOS-MAPPING := FROM:TO

After checking the code, I found reorder_hdr is off by default.

In another side, is there a specification which defines the hw-accel-vlan-rx?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: avoid synchronize_rcu() in dev_deactivate_many
From: David Miller @ 2011-05-23  1:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: greearb, netdev, kaber
In-Reply-To: <1305884529.3173.11.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 20 May 2011 11:42:09 +0200

> dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set
> to noop_qdisc.
> 
> This call is here to make sure they are no outstanding qdisc-less
> dev_queue_xmit calls before returning to caller.
> 
> But in dismantle phase, we dont have to wait, because we wont activate
> again the device, and we are going to wait one rcu grace period later in
> rollback_registered_many().
> 
> After this patch, device dismantle uses one synchronize_net() and one
> rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL
> latency.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Patrick McHardy <kaber@trash.net>,
> CC: Ben Greear <greearb@candelatech.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: remove synchronize_net() from netdev_set_master()
From: David Miller @ 2011-05-23  1:03 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, shemminger, jpirko
In-Reply-To: <1305869860.3156.49.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 20 May 2011 07:37:40 +0200

> In the old days, we used to access dev->master in __netif_receive_skb()
> in a rcu_read_lock section.
> 
> So one synchronize_net() call was needed in netdev_set_master() to make
> sure another cpu could not use old master while/after we release it.
> 
> We now use netdev_rx_handler infrastructure and added one
> synchronize_net() call in bond_release()/bond_release_all()
> 
> Remove the obsolete synchronize_net() from netdev_set_master() and add
> one in bridge del_nbp() after its netdev_rx_handler_unregister() call.
> 
> This makes enslave -d a bit faster.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Jiri Pirko <jpirko@redhat.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [Patch 3/3] net: rename NETDEV_BONDING_DESLAVE to NETDEV_RELEASE
From: David Miller @ 2011-05-23  1:03 UTC (permalink / raw)
  To: amwang; +Cc: netdev, andy, nhorman
In-Reply-To: <1305877152-30970-3-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 20 May 2011 15:39:12 +0800

> s/NETDEV_BONDING_DESLAVE/NETDEV_RELEASE/ as Andy suggested.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Andy Gospodarek <andy@greyhouse.net>
> Cc: Neil Horman <nhorman@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH 4/3] rtnetlink: ignore NETDEV_RELEASE and NETDEV_JOIN event
From: David Miller @ 2011-05-23  1:03 UTC (permalink / raw)
  To: amwang; +Cc: netdev
In-Reply-To: <1305882392-21241-1-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 20 May 2011 17:06:32 +0800

> These two events are not expected to be caught by userspace.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>

Applied.

^ permalink raw reply

* Re: [Patch 2/3] bridge: call NETDEV_JOIN notifiers when add a slave
From: David Miller @ 2011-05-23  1:03 UTC (permalink / raw)
  To: amwang; +Cc: netdev, nhorman
In-Reply-To: <1305877152-30970-2-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 20 May 2011 15:39:11 +0800

> In the previous patch I added NETDEV_JOIN, now
> we can notify netconsole when adding a device to a bridge too.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Neil Horman <nhorman@redhat.com>

Applied.

^ permalink raw reply

* Re: [V3 Patch 1/3] netpoll: disable netpoll when enslave a device
From: David Miller @ 2011-05-23  1:03 UTC (permalink / raw)
  To: amwang; +Cc: netdev, nhorman
In-Reply-To: <1305877152-30970-1-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 20 May 2011 15:39:10 +0800

> V3: rename NETDEV_ENSLAVE to NETDEV_JOIN
> 
> Currently we do nothing when we enslave a net device which is running netconsole.
> Neil pointed out that we may get weird results in such case, so let's disable
> netpoll on the device being enslaved. I think it is too harsh to prevent
> the device being ensalved if it is running netconsole.
> 
> By the way, this patch also removes the NETDEV_GOING_DOWN from netconsole
> netdev notifier, because netpoll will check if the device is running or not
> and we don't handle NETDEV_PRE_UP neither.
> 
> This patch is based on net-next-2.6.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Neil Horman <nhorman@redhat.com>

Applied.

^ permalink raw reply

* Re: ip_rt_bug questions.
From: David Miller @ 2011-05-23  1:02 UTC (permalink / raw)
  To: davej; +Cc: netdev
In-Reply-To: <20110521171642.GA18411@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Sat, 21 May 2011 13:16:42 -0400

> Add a stack backtrace to the ip_rt_bug path for debugging
> 
> Signed-off-by: Dave Jones <davej@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: skb_trim explicitely check the linearity instead of data_len
From: David Miller @ 2011-05-23  1:02 UTC (permalink / raw)
  To: emmanuel.grumbach; +Cc: netdev
In-Reply-To: <1306043169-26659-1-git-send-email-emmanuel.grumbach@intel.com>

From: emmanuel.grumbach@intel.com
Date: Sun, 22 May 2011 08:46:09 +0300

> From: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
> 
> The purpose of the check on data_len is to check linearity, so use the inline
> helper for this. No overhead and more explicit.
> 
> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>

Applied.

^ permalink raw reply

* Re: [Patch] pktgen: refactor pg_init() code
From: David Miller @ 2011-05-23  1:02 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev
In-Reply-To: <BANLkTikd8Ft0N6ovGp8KOVksg2Rj4RxRaA@mail.gmail.com>

From: Américo Wang <xiyou.wangcong@gmail.com>
Date: Sun, 22 May 2011 18:52:08 +0800

> This also shrinks the object size a little.
> 
>    text	   data	    bss	    dec	    hex	filename
>   28834	    186	      8	  29028	   7164	net/core/pktgen.o
>   28816	    186	      8	  29010	   7152	net/core/pktgen.o.AFTER
> 
> 
> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
> Cc: "David Miller" <davem@davemloft.net>,

Applied.

^ permalink raw reply

* Re: [Patch] pktgen: use vzalloc_node() instead of vmalloc_node() + memset()
From: David Miller @ 2011-05-23  1:02 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev
In-Reply-To: <BANLkTimWSEwkoP3GOjZ3C-yG9jZRVM+yNA@mail.gmail.com>

From: Américo Wang <xiyou.wangcong@gmail.com>
Date: Sun, 22 May 2011 18:17:11 +0800

> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>

Applied.

^ permalink raw reply

* Re: [PATCH] net: filter: move forward declarations to avoid compile warnings
From: David Miller @ 2011-05-23  1:01 UTC (permalink / raw)
  To: heiko.carstens; +Cc: linux-kernel, netdev, eric.dumazet
In-Reply-To: <20110522170811.GA2758@osiris.boeblingen.de.ibm.com>

From: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Sun, 22 May 2011 19:08:11 +0200

> From: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> Get rid of this compile warning:
> 
> In file included from arch/s390/kernel/compat_linux.c:37:0:
> include/linux/filter.h:139:23: warning: 'struct sk_buff' declared inside parameter list
> 
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-2.6 0/4] bnx2x fixes
From: David Miller @ 2011-05-23  1:01 UTC (permalink / raw)
  To: dmitry; +Cc: netdev, eilong, vladz
In-Reply-To: <1306095292.9376.10.camel@lb-tlvb-dmitry>

From: "Dmitry Kravkov" <dmitry@broadcom.com>
Date: Sun, 22 May 2011 23:14:52 +0300

> Please consider applying following fixes to net-2.6.

All applied, thanks.

^ permalink raw reply

* [PATCH 5/5] net: Remove linux/prefetch.h include from linux/skbuff.h
From: David Miller @ 2011-05-23  0:58 UTC (permalink / raw)
  To: netdev


No longer needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 09901fd..8cac356 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -28,7 +28,6 @@
 #include <net/checksum.h>
 #include <linux/rcupdate.h>
 #include <linux/dmaengine.h>
-#include <linux/prefetch.h>
 #include <linux/hrtimer.h>
 
 /* Don't change this without changing skb_csum_unnecessary! */
-- 
1.7.4.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox